6246 – SLP vectorization when performing scalar operations on vector elements

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 6246 - SLP vectorization when performing scalar operations on vector elements

Summary: SLP vectorization when performing scalar operations on vector elements

Status:	RESOLVED FIXED

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Scalar Optimizations (show other bugs)
Version:	2.6
Hardware:	PC Linux

Importance:	P normal
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2010-02-05 13:46 PST by Henning Thielemann
Modified:	2017-01-08 14:54 PST (History)
CC List:	7 users (show)

See Also:
Fixed By Commit(s):

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Henning Thielemann 2010-02-05 13:46:17 PST

In my automatically generated code it often happens that scalar operations are applied to vector elements that could have been written as vector operations as well. E.g. (due to modularization issues) I generate code like

define <4 x float> @_vadd(<4 x float>, <4 x float>) {
  %a0 = extractelement <4 x float> %0, i32 0
  %b0 = extractelement <4 x float> %1, i32 0
  %c0 = fadd float %a0, %b0
  %a1 = extractelement <4 x float> %0, i32 1
  %b1 = extractelement <4 x float> %1, i32 1
  %c1 = fadd float %a1, %b1
  %a2 = extractelement <4 x float> %0, i32 2
  %b2 = extractelement <4 x float> %1, i32 2
  %c2 = fadd float %a2, %b2
  %a3 = extractelement <4 x float> %0, i32 3
  %b3 = extractelement <4 x float> %1, i32 3
  %c3 = fadd float %a3, %b3
  %d0 = insertelement <4 x float> undef, float %c0, i32 0
  %d1 = insertelement <4 x float> %d0, float %c1, i32 1
  %d2 = insertelement <4 x float> %d1, float %c2, i32 2
  %d3 = insertelement <4 x float> %d2, float %c3, i32 3
  ret <4 x float> %d3
}

I think it would be both correct and more efficient to swap 'fadd's and 'extractelements' by an optimization pass which would yield:

define <4 x float> @_vadd(<4 x float>, <4 x float>) nounwind readnone {
  %c = fadd <4 x float> %0, %1
  %c0 = extractelement <4 x float> %c, i32 0
  %c1 = extractelement <4 x float> %c, i32 1
  %c2 = extractelement <4 x float> %c, i32 2
  %c3 = extractelement <4 x float> %c, i32 3
  %d0 = insertelement <4 x float> undef, float %c0, i32 0
  %d1 = insertelement <4 x float> %d0, float %c1, i32 1
  %d2 = insertelement <4 x float> %d1, float %c2, i32 2
  %d3 = insertelement <4 x float> %d2, float %c3, i32 3
  ret <4 x float> %d3
}

That the remaining extractelements and insertelements are the identity transform is already correctly detected both by the optimizer and the (X86) code generator. The optimizer transforms the last piece of code to something like:

define <4 x float> @_vadd(<4 x float>, <4 x float>) nounwind readnone {
  %c = fadd <4 x float> %0, %1
  ret <4 x float> %c
}

Comment 1 Nadav Rotem 2012-05-26 21:02:42 PDT

I think that this should be handled by the BB vectorizer.

Comment 2 Tobias Grosser 2013-07-18 13:41:05 PDT

Nadav, any plans to handle this in the SLP vectorizer?

Comment 3 Nadav Rotem 2013-07-18 13:47:03 PDT

Yea, it should be easy to do.  We have a few heuristics for starting an SLP chain. We should add a rule for starting a chain at a bunch of inserts.

Comment 4 Simon Pilgrim 2016-10-13 09:26:18 PDT

Is this actually fixed? A quick ToT of either didn't look any better.

Comment 5 Alexey Bataev 2016-10-13 09:45:36 PDT

Yes, I checked it and it works as required

Comment 6 Simon Pilgrim 2016-11-12 09:43:32 PST

Reopening this. If we look at the equivalent c source then it appears that the vectorizer isn't working for non 128-bit vectors:

#include <x86intrin.h>

__m128 _vadd128(__m128 a, __m128 b) {
 return _mm_setr_ps(
   a[0] + b[0],
   a[1] + b[1],
   a[2] + b[2],
   a[3] + b[3]);
}

__m256 _vadd256(__m256 a, __m256 b) {
 return _mm256_setr_ps(
   a[0] + b[0],
   a[1] + b[1],
   a[2] + b[2],
   a[3] + b[3],
   a[4] + b[4],
   a[5] + b[5],
   a[6] + b[6],
   a[7] + b[7]);
}

clang -S -O3 pr6246.c -march=btver2 -o - -emit-llvm

; Function Attrs: norecurse nounwind readnone ssp uwtable
define <4 x float> @_vadd128(<4 x float> %a, <4 x float> %b) local_unnamed_addr #0 {
entry:
  %0 = fadd <4 x float> %a, %b
  ret <4 x float> %0
}

; Function Attrs: norecurse nounwind readnone ssp uwtable
define <8 x float> @_vadd256(<8 x float> %a, <8 x float> %b) local_unnamed_addr #0 {
entry:
  %vecext = extractelement <8 x float> %a, i32 0
  %vecext1 = extractelement <8 x float> %b, i32 0
  %add = fadd float %vecext, %vecext1
  %vecext2 = extractelement <8 x float> %a, i32 1
  %vecext3 = extractelement <8 x float> %b, i32 1
  %add4 = fadd float %vecext2, %vecext3
  %vecext5 = extractelement <8 x float> %a, i32 2
  %vecext6 = extractelement <8 x float> %b, i32 2
  %add7 = fadd float %vecext5, %vecext6
  %vecext8 = extractelement <8 x float> %a, i32 3
  %vecext9 = extractelement <8 x float> %b, i32 3
  %add10 = fadd float %vecext8, %vecext9
  %vecext11 = extractelement <8 x float> %a, i32 4
  %vecext12 = extractelement <8 x float> %b, i32 4
  %add13 = fadd float %vecext11, %vecext12
  %vecext14 = extractelement <8 x float> %a, i32 5
  %vecext15 = extractelement <8 x float> %b, i32 5
  %add16 = fadd float %vecext14, %vecext15
  %vecext17 = extractelement <8 x float> %a, i32 6
  %vecext18 = extractelement <8 x float> %b, i32 6
  %add19 = fadd float %vecext17, %vecext18
  %vecext20 = extractelement <8 x float> %a, i32 7
  %vecext21 = extractelement <8 x float> %b, i32 7
  %add22 = fadd float %vecext20, %vecext21
  %vecinit.i = insertelement <8 x float> undef, float %add, i32 0
  %vecinit1.i = insertelement <8 x float> %vecinit.i, float %add4, i32 1
  %vecinit2.i = insertelement <8 x float> %vecinit1.i, float %add7, i32 2
  %vecinit3.i = insertelement <8 x float> %vecinit2.i, float %add10, i32 3
  %vecinit4.i = insertelement <8 x float> %vecinit3.i, float %add13, i32 4
  %vecinit5.i = insertelement <8 x float> %vecinit4.i, float %add16, i32 5
  %vecinit6.i = insertelement <8 x float> %vecinit5.i, float %add19, i32 6
  %vecinit7.i = insertelement <8 x float> %vecinit6.i, float %add22, i32 7
  ret <8 x float> %vecinit7.i
}

Comment 7 Alexey Bataev 2016-12-01 14:52:21 PST

Must be fixed in r288412

Comment 8 Michael Kuperstein 2016-12-01 16:59:52 PST

r288412 was reverted in r288431, reopening.

Comment 9 Simon Pilgrim 2016-12-02 04:55:52 PST

I've added tests for this (128/256/512 bit float/double vectors) to llvm\test\Transforms\SLPVectorizer\X86\arith-fp.ll at r288492

Comment 10 Alexey Bataev 2016-12-02 09:07:35 PST

Now, I think, it is finally fixed in r288497

Comment 11 Michael Kuperstein 2016-12-02 12:58:16 PST

Reverted again in r288508, reopening.

Comment 12 Simon Pilgrim 2017-01-08 14:54:00 PST

Fixed at rL289043 - it seems to have stuck this time.