Prior to r310260, the compiler was optimizing some add/sub + shuffle patterns to horizontal add/sub instructions, but no longer seems to do so. Consider this example: __attribute__((noinline)) __m128 add_ps_001(__m128 a, __m128 b) { __m128 r = (__m128){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] }; return __builtin_shufflevector(r, a, -1, 1, 2, 3); } When compiled with the options "-S -O2 -march=bdver2" using a compiler prior to r310260, the compiler would generate the following assembly: vmovaps (%rcx), %xmm0 vhaddps (%rdx), %xmm0, %xmm0 retq After r310260, the compiler is now generating the following code for the same function: vmovaps (%rdx), %xmm0 vmovddup 8(%rcx), %xmm1 # xmm1 = mem[0,0] vinsertps $28, %xmm0, %xmm1, %xmm2 # xmm2 = xmm1[0],xmm0[0],zero,zero vinsertps $76, %xmm1, %xmm0, %xmm1 # xmm1 = xmm1[1],xmm0[1],zero,zero vaddps %xmm1, %xmm2, %xmm1 vpermilpd $1, %xmm0, %xmm2 # xmm2 = xmm0[1,0] vpermilps $231, %xmm0, %xmm0 # xmm0 = xmm0[3,1,2,3] vaddss %xmm0, %xmm2, %xmm0 vpermilps $208, %xmm1, %xmm1 # xmm1 = xmm1[0,0,1,3] vinsertps $48, %xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0,1,2],xmm0[0] retq
The IR for this example is identical to bug 34725, so if we view this as a backend problem, one of these would be a duplicate.
For reference: https://reviews.llvm.org/rL310260 (SLP) And here's the IR that we currently (r314201) produce: $ ./clang -O2 34724.c -S -o - -mavx -emit-llvm define <4 x float> @add_ps_001(<4 x float> %a, <4 x float> %b) { entry: %0 = shufflevector <4 x float> %a, <4 x float> %b, <2 x i32> <i32 2, i32 4> %1 = shufflevector <4 x float> %a, <4 x float> %b, <2 x i32> <i32 3, i32 5> %2 = fadd <2 x float> %0, %1 %3 = extractelement <2 x float> %2, i32 0 %vecinit5 = insertelement <4 x float> undef, float %3, i32 1 %4 = extractelement <2 x float> %2, i32 1 %vecinit9 = insertelement <4 x float> %vecinit5, float %4, i32 2 %vecext10 = extractelement <4 x float> %b, i32 2 %vecext11 = extractelement <4 x float> %b, i32 3 %add12 = fadd float %vecext10, %vecext11 %vecinit13 = insertelement <4 x float> %vecinit9, float %add12, i32 3 ret <4 x float> %vecinit13 }
There's a related theme here and bug 34730 and bug 34716: %ext = extractelement <2 x float> %x, i32 0 %ins = insertelement <4 x float> undef, float %ext, i32 1 --> %shuf = shufflevector <2 x float> %x, <2 x float> undef, <4 x i32> <i32 undef, i32 0, i32 undef, i32 undef> Can we canonicalize to shuffles more aggressively in instcombine now? If we can recognize insert_subvector, extract_subvector, identity (this is already true I think), single-element-repositioning (better name for this example?) as special case masks, it should reduce complexity in the backend.
(In reply to Sanjay Patel from comment #1) > The IR for this example is identical to bug 34725, so if we view this as a > backend problem, one of these would be a duplicate. That was wrong - the instructions are similar, but the element indexes for the shuffles and insert/extract are not the same.
We could probably make the list of shuffles instcombine recognizes a bit longer, if that helps... we just want to guard against making shuffles the backend won't recognize.
Looking at this report again now, and nothing's changed since the last comments. Here's a better view of the code going into SLP: define <4 x float> @add_ps_001(<4 x float> %a, <4 x float> %b) { %a0 = extractelement <4 x float> %a, i32 0 %a1 = extractelement <4 x float> %a, i32 1 %a2 = extractelement <4 x float> %a, i32 2 %a3 = extractelement <4 x float> %a, i32 3 %b0 = extractelement <4 x float> %b, i32 0 %b1 = extractelement <4 x float> %b, i32 1 %b2 = extractelement <4 x float> %b, i32 2 %b3 = extractelement <4 x float> %b, i32 3 %a23 = fadd float %a2, %a3 %b01 = fadd float %b0, %b1 %b23 = fadd float %b2, %b3 %v1 = insertelement <4 x float> undef, float %a23, i32 1 %v2 = insertelement <4 x float> %v1, float %b01, i32 2 %v3 = insertelement <4 x float> %v2, float %b23, i32 3 ret <4 x float> %v3 } ------------------------------------------------------------------------------ There are multiple potential problems contained in this example: 1. The cost model calculation claims that the extract/insert changes + <2 x float> math op are profitable. This may need adjustment either in the generic formulation or x86 cost model values. 2. SLP chooses to round down to vectorize to <2 x float> rather than rounding up a potential <3 x float> op to <4 x float>. This is a limitation in SLPVectorizerPass::tryToVectorizeList(). 3. SLP chooses to vectorize the more complicated pair of scalar ops rather than the simpler pair. Ie, we could vectorize this pair with elements all from %b: %b01 = fadd float %b0, %b1 %b23 = fadd float %b2, %b3 instead of this pair: %a23 = fadd float %a2, %a3 %b01 = fadd float %b0, %b1 4. As noted in comment 3, instcombine should be able to turn more extract+insert sequences into shuffles. We should probably adjust that independent of this bug, but I don't think it will change the codegen outcome on this example.
If we are viewing this as a backend pattern matching problem, I don't know how we'd solve it after SLP partially vectorizes the IR. Matching a mix of scalar/vector code to try to find a horizontal op is a potentially limitless exercise. I think we either need to get SLP to stop vectorizing on this example or get it to vectorize to a <4 x float> fadd. The backend should recognize either of those possibilities and turn it into 'vhaddps' for x86 AVX.
Something else that is odd is that the codegen is incredibly different depending upon the element you shuffle as undefined: https://gcc.godbolt.org/z/e-6XUz Many are generating the haddps and a load of extra code as well! __m128 add_ps_u123(__m128 a, __m128 b) { __m128 r = (__m128){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] }; return __builtin_shufflevector(r, a, -1, 1, 2, 3); } __m128 add_ps_0u22(__m128 a, __m128 b) { __m128 r = (__m128){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] }; return __builtin_shufflevector(r, a, 0, -1, 2, 3); } __m128 add_ps_01u3(__m128 a, __m128 b) { __m128 r = (__m128){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] }; return __builtin_shufflevector(r, a, 0, 1, -1, 3); } __m128 add_ps_012u(__m128 a, __m128 b) { __m128 r = (__m128){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] }; return __builtin_shufflevector(r, a, 0, 1, 2, -1); } _Z11add_ps_u123Dv4_fS_: # @_Z11add_ps_u123Dv4_fS_ vpermilpd $1, %xmm0, %xmm2 # xmm2 = xmm0[1,0] vinsertps $204, %xmm0, %xmm1, %xmm0 # xmm0 = xmm0[3],xmm1[1],zero,zero vinsertps $28, %xmm1, %xmm2, %xmm2 # xmm2 = xmm2[0],xmm1[0],zero,zero vaddps %xmm0, %xmm2, %xmm0 vpermilpd $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0] vpermilps $231, %xmm1, %xmm1 # xmm1 = xmm1[3,1,2,3] vaddss %xmm1, %xmm2, %xmm1 vpermilps $208, %xmm0, %xmm0 # xmm0 = xmm0[0,0,1,3] vinsertps $48, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0] retq _Z11add_ps_0u22Dv4_fS_: # @_Z11add_ps_0u22Dv4_fS_ vhaddps %xmm1, %xmm0, %xmm0 retq _Z11add_ps_01u3Dv4_fS_: # @_Z11add_ps_01u3Dv4_fS_ vpermilpd $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0] vpermilps $231, %xmm1, %xmm1 # xmm1 = xmm1[3,1,2,3] vhaddps %xmm0, %xmm0, %xmm0 vaddss %xmm1, %xmm2, %xmm1 vinsertps $48, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0] retq _Z11add_ps_012uDv4_fS_: # @_Z11add_ps_012uDv4_fS_ vmovshdup %xmm1, %xmm2 # xmm2 = xmm1[1,1,3,3] vhaddps %xmm0, %xmm0, %xmm0 vaddss %xmm2, %xmm1, %xmm1 vinsertps $32, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1],xmm1[0],xmm0[3] retq
(In reply to Sanjay Patel from comment #6) > Looking at this report again now, and nothing's changed since the last > comments. Here's a better view of the code going into SLP: > > define <4 x float> @add_ps_001(<4 x float> %a, <4 x float> %b) { > %a0 = extractelement <4 x float> %a, i32 0 > %a1 = extractelement <4 x float> %a, i32 1 > %a2 = extractelement <4 x float> %a, i32 2 > %a3 = extractelement <4 x float> %a, i32 3 > > %b0 = extractelement <4 x float> %b, i32 0 > %b1 = extractelement <4 x float> %b, i32 1 > %b2 = extractelement <4 x float> %b, i32 2 > %b3 = extractelement <4 x float> %b, i32 3 > > %a23 = fadd float %a2, %a3 > %b01 = fadd float %b0, %b1 > %b23 = fadd float %b2, %b3 > > %v1 = insertelement <4 x float> undef, float %a23, i32 1 > %v2 = insertelement <4 x float> %v1, float %b01, i32 2 > %v3 = insertelement <4 x float> %v2, float %b23, i32 3 > ret <4 x float> %v3 > } > > ----------------------------------------------------------------------------- > - > > There are multiple potential problems contained in this example: > > 1. The cost model calculation claims that the extract/insert changes + <2 x > float> math op are profitable. This may need adjustment either in the > generic formulation or x86 cost model values. > > 2. SLP chooses to round down to vectorize to <2 x float> rather than > rounding up a potential <3 x float> op to <4 x float>. This is a limitation > in SLPVectorizerPass::tryToVectorizeList(). > > 3. SLP chooses to vectorize the more complicated pair of scalar ops rather > than the simpler pair. Ie, we could vectorize this pair with elements all > from %b: > %b01 = fadd float %b0, %b1 > %b23 = fadd float %b2, %b3 > > instead of this pair: > %a23 = fadd float %a2, %a3 > %b01 = fadd float %b0, %b1 > > 4. As noted in comment 3, instcombine should be able to turn more > extract+insert sequences into shuffles. We should probably adjust that > independent of this bug, but I don't think it will change the codegen > outcome on this example. I have a very preliminary patch for non-pow2 support in SLP vectorizer, which will extend <3 x float> to <4 x float>. It is too optimistic in terms of the cost of the vectorization, but generally speacking looks good. Hope fix all the problems very soon.
(In reply to Alexey Bataev from comment #9) > I have a very preliminary patch for non-pow2 support in SLP vectorizer, > which will extend <3 x float> to <4 x float>. It is too optimistic in terms > of the cost of the vectorization, but generally speacking looks good. Hope > fix all the problems very soon. Great! If we get a <4 x float> math op from SLP, then that should be no problem for the backend to match as a horizontal op (if that's good for the target). I'm still trying to make instcombine produce code that SLP can handle more easily here: https://reviews.llvm.org/D53507
@spatel Do you think this is a candidate for the VectorCombine pass?
(In reply to Simon Pilgrim from comment #11) > @spatel Do you think this is a candidate for the VectorCombine pass? We already get the extracelement transforms in -vector-combine that I would expect given the IR in comment 6: %0 = shufflevector <4 x float> %a, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 3, i32 undef> %1 = fadd <4 x float> %0, %a %2 = shufflevector <4 x float> %b, <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef> %3 = fadd <4 x float> %2, %b %4 = shufflevector <4 x float> %1, <4 x float> %3, <4 x i32> <i32 undef, i32 2, i32 4, i32 undef> %5 = shufflevector <4 x float> %b, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 3, i32 undef> %6 = fadd <4 x float> %5, %b %7 = shufflevector <4 x float> %4, <4 x float> %6, <4 x i32> <i32 undef, i32 1, i32 2, i32 6> ret <4 x float> %7 ------------------------------------------------------------------------------ So SLP is out of the picture currently. The given IR does result in a horizontal add, but we don't have a way to combine these into a single hadd yet: vmovshdup %xmm0, %xmm2 ## xmm2 = xmm0[1,1,3,3] vaddps %xmm0, %xmm2, %xmm0 vhaddps %xmm1, %xmm1, %xmm2 vshufps $200, %xmm2, %xmm0, %xmm0 ## xmm0 = xmm0[0,2],xmm2[0,3] vmovshdup %xmm1, %xmm2 ## xmm2 = xmm1[1,1,3,3] vaddps %xmm1, %xmm2, %xmm1 vinsertps $177, %xmm1, %xmm0, %xmm0 ## xmm0 = zero,xmm0[1,2],xmm1[2]
A possible transform for vector-combine that will help: %5 = shufflevector <4 x float> %b, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 3, i32 undef> %6 = fadd <4 x float> %5, %b %7 = shufflevector <4 x float> %4, <4 x float> %6, <4 x i32> <i32 undef, i32 1, i32 2, i32 6> We chose to shuffle element 3 of %b over to element 2 in the 1st shuffle, and that then gets pushed back over to element 3 in the 2nd shuffle. We want to adjust the 1st shuffle so that we're calculating the binop in element 3 because that creates a select ("blend") shuffle instead: %5 = shufflevector <4 x float> %b, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 undef, i32 2> %6 = fadd <4 x float> %5, %b %7 = shufflevector <4 x float> %4, <4 x float> %6, <4 x i32> <i32 undef, i32 1, i32 2, i32 7>
Small hack proposal: https://reviews.llvm.org/D76623
Further backend improvements: https://reviews.llvm.org/D83789