In my automatically generated code it often happens that scalar operations are applied to vector elements that could have been written as vector operations as well. E.g. (due to modularization issues) I generate code like define <4 x float> @_vadd(<4 x float>, <4 x float>) { %a0 = extractelement <4 x float> %0, i32 0 %b0 = extractelement <4 x float> %1, i32 0 %c0 = fadd float %a0, %b0 %a1 = extractelement <4 x float> %0, i32 1 %b1 = extractelement <4 x float> %1, i32 1 %c1 = fadd float %a1, %b1 %a2 = extractelement <4 x float> %0, i32 2 %b2 = extractelement <4 x float> %1, i32 2 %c2 = fadd float %a2, %b2 %a3 = extractelement <4 x float> %0, i32 3 %b3 = extractelement <4 x float> %1, i32 3 %c3 = fadd float %a3, %b3 %d0 = insertelement <4 x float> undef, float %c0, i32 0 %d1 = insertelement <4 x float> %d0, float %c1, i32 1 %d2 = insertelement <4 x float> %d1, float %c2, i32 2 %d3 = insertelement <4 x float> %d2, float %c3, i32 3 ret <4 x float> %d3 } I think it would be both correct and more efficient to swap 'fadd's and 'extractelements' by an optimization pass which would yield: define <4 x float> @_vadd(<4 x float>, <4 x float>) nounwind readnone { %c = fadd <4 x float> %0, %1 %c0 = extractelement <4 x float> %c, i32 0 %c1 = extractelement <4 x float> %c, i32 1 %c2 = extractelement <4 x float> %c, i32 2 %c3 = extractelement <4 x float> %c, i32 3 %d0 = insertelement <4 x float> undef, float %c0, i32 0 %d1 = insertelement <4 x float> %d0, float %c1, i32 1 %d2 = insertelement <4 x float> %d1, float %c2, i32 2 %d3 = insertelement <4 x float> %d2, float %c3, i32 3 ret <4 x float> %d3 } That the remaining extractelements and insertelements are the identity transform is already correctly detected both by the optimizer and the (X86) code generator. The optimizer transforms the last piece of code to something like: define <4 x float> @_vadd(<4 x float>, <4 x float>) nounwind readnone { %c = fadd <4 x float> %0, %1 ret <4 x float> %c }
I think that this should be handled by the BB vectorizer.
Nadav, any plans to handle this in the SLP vectorizer?
Yea, it should be easy to do. We have a few heuristics for starting an SLP chain. We should add a rule for starting a chain at a bunch of inserts.
Is this actually fixed? A quick ToT of either didn't look any better.
Yes, I checked it and it works as required
Reopening this. If we look at the equivalent c source then it appears that the vectorizer isn't working for non 128-bit vectors: #include <x86intrin.h> __m128 _vadd128(__m128 a, __m128 b) { return _mm_setr_ps( a[0] + b[0], a[1] + b[1], a[2] + b[2], a[3] + b[3]); } __m256 _vadd256(__m256 a, __m256 b) { return _mm256_setr_ps( a[0] + b[0], a[1] + b[1], a[2] + b[2], a[3] + b[3], a[4] + b[4], a[5] + b[5], a[6] + b[6], a[7] + b[7]); } clang -S -O3 pr6246.c -march=btver2 -o - -emit-llvm ; Function Attrs: norecurse nounwind readnone ssp uwtable define <4 x float> @_vadd128(<4 x float> %a, <4 x float> %b) local_unnamed_addr #0 { entry: %0 = fadd <4 x float> %a, %b ret <4 x float> %0 } ; Function Attrs: norecurse nounwind readnone ssp uwtable define <8 x float> @_vadd256(<8 x float> %a, <8 x float> %b) local_unnamed_addr #0 { entry: %vecext = extractelement <8 x float> %a, i32 0 %vecext1 = extractelement <8 x float> %b, i32 0 %add = fadd float %vecext, %vecext1 %vecext2 = extractelement <8 x float> %a, i32 1 %vecext3 = extractelement <8 x float> %b, i32 1 %add4 = fadd float %vecext2, %vecext3 %vecext5 = extractelement <8 x float> %a, i32 2 %vecext6 = extractelement <8 x float> %b, i32 2 %add7 = fadd float %vecext5, %vecext6 %vecext8 = extractelement <8 x float> %a, i32 3 %vecext9 = extractelement <8 x float> %b, i32 3 %add10 = fadd float %vecext8, %vecext9 %vecext11 = extractelement <8 x float> %a, i32 4 %vecext12 = extractelement <8 x float> %b, i32 4 %add13 = fadd float %vecext11, %vecext12 %vecext14 = extractelement <8 x float> %a, i32 5 %vecext15 = extractelement <8 x float> %b, i32 5 %add16 = fadd float %vecext14, %vecext15 %vecext17 = extractelement <8 x float> %a, i32 6 %vecext18 = extractelement <8 x float> %b, i32 6 %add19 = fadd float %vecext17, %vecext18 %vecext20 = extractelement <8 x float> %a, i32 7 %vecext21 = extractelement <8 x float> %b, i32 7 %add22 = fadd float %vecext20, %vecext21 %vecinit.i = insertelement <8 x float> undef, float %add, i32 0 %vecinit1.i = insertelement <8 x float> %vecinit.i, float %add4, i32 1 %vecinit2.i = insertelement <8 x float> %vecinit1.i, float %add7, i32 2 %vecinit3.i = insertelement <8 x float> %vecinit2.i, float %add10, i32 3 %vecinit4.i = insertelement <8 x float> %vecinit3.i, float %add13, i32 4 %vecinit5.i = insertelement <8 x float> %vecinit4.i, float %add16, i32 5 %vecinit6.i = insertelement <8 x float> %vecinit5.i, float %add19, i32 6 %vecinit7.i = insertelement <8 x float> %vecinit6.i, float %add22, i32 7 ret <8 x float> %vecinit7.i }
Must be fixed in r288412
r288412 was reverted in r288431, reopening.
I've added tests for this (128/256/512 bit float/double vectors) to llvm\test\Transforms\SLPVectorizer\X86\arith-fp.ll at r288492
Now, I think, it is finally fixed in r288497
Reverted again in r288508, reopening.
Fixed at rL289043 - it seems to have stuck this time.