Compiler no longer optimizing some horizontal add/sub patterns after r310260 #34072

dyung · 2017-09-26T00:06:58Z


Bugzilla Link	34724
Version	trunk
OS	Linux
CC	@alexey-bataev,@efriedma-quic,@hfinkel,@RKSimon,@rotateright,@ZviRackover

Extended Description

Prior to r310260, the compiler was optimizing some add/sub + shuffle patterns to horizontal add/sub instructions, but no longer seems to do so.

Consider this example:

__attribute__((noinline))
__m128 add_ps_001(__m128 a, __m128 b) {
  __m128 r = (__m128){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, -1, 1, 2, 3);
}

When compiled with the options "-S -O2 -march=bdver2" using a compiler prior to r310260, the compiler would generate the following assembly:

vmovaps (%rcx), %xmm0
vhaddps (%rdx), %xmm0, %xmm0
retq

After r310260, the compiler is now generating the following code for the same function:

vmovaps (%rdx), %xmm0
vmovddup        8(%rcx), %xmm1  # xmm1 = mem[0,0]
vinsertps       $28, %xmm0, %xmm1, %xmm2 # xmm2 = xmm1[0],xmm0[0],zero,zero
vinsertps       $76, %xmm1, %xmm0, %xmm1 # xmm1 = xmm1[1],xmm0[1],zero,zero
vaddps  %xmm1, %xmm2, %xmm1
vpermilpd       $1, %xmm0, %xmm2 # xmm2 = xmm0[1,0]
vpermilps       $231, %xmm0, %xmm0 # xmm0 = xmm0[3,1,2,3]
vaddss  %xmm0, %xmm2, %xmm0
vpermilps       $208, %xmm1, %xmm1 # xmm1 = xmm1[0,0,1,3]
vinsertps       $48, %xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0,1,2],xmm0[0]
retq

The text was updated successfully, but these errors were encountered:

rotateright · 2017-09-26T14:52:24Z

The IR for this example is identical to bug 34725, so if we view this as a backend problem, one of these would be a duplicate.

rotateright · 2017-09-26T14:56:32Z

For reference: https://reviews.llvm.org/rL310260 (SLP)

And here's the IR that we currently (r314201) produce:

$ ./clang -O2 34724.c -S -o - -mavx -emit-llvm

define <4 x float> @add_ps_001(<4 x float> %a, <4 x float> %b) {
entry:
  %0 = shufflevector <4 x float> %a, <4 x float> %b, <2 x i32> <i32 2, i32 4>
  %1 = shufflevector <4 x float> %a, <4 x float> %b, <2 x i32> <i32 3, i32 5>
  %2 = fadd <2 x float> %0, %1
  %3 = extractelement <2 x float> %2, i32 0
  %vecinit5 = insertelement <4 x float> undef, float %3, i32 1
  %4 = extractelement <2 x float> %2, i32 1
  %vecinit9 = insertelement <4 x float> %vecinit5, float %4, i32 2
  %vecext10 = extractelement <4 x float> %b, i32 2
  %vecext11 = extractelement <4 x float> %b, i32 3
  %add12 = fadd float %vecext10, %vecext11
  %vecinit13 = insertelement <4 x float> %vecinit9, float %add12, i32 3
  ret <4 x float> %vecinit13
}

rotateright · 2017-09-26T15:17:38Z

There's a related theme here and bug 34730 and bug 34716:

%ext = extractelement <2 x float> %x, i32 0
%ins = insertelement <4 x float> undef, float %ext, i32 1

-->

%shuf = shufflevector <2 x float> %x, <2 x float> undef, <4 x i32> <i32 undef, i32 0, i32 undef, i32 undef>

Can we canonicalize to shuffles more aggressively in instcombine now? If we can recognize insert_subvector, extract_subvector, identity (this is already true I think), single-element-repositioning (better name for this example?) as special case masks, it should reduce complexity in the backend.

rotateright · 2017-09-26T15:18:49Z

The IR for this example is identical to bug 34725, so if we view this as a
backend problem, one of these would be a duplicate.

That was wrong - the instructions are similar, but the element indexes for the shuffles and insert/extract are not the same.

efriedma-quic · 2017-09-26T21:37:52Z

We could probably make the list of shuffles instcombine recognizes a bit longer, if that helps... we just want to guard against making shuffles the backend won't recognize.

rotateright · 2018-10-20T16:06:17Z

Looking at this report again now, and nothing's changed since the last comments. Here's a better view of the code going into SLP:

define <4 x float> @add_ps_001(<4 x float> %a, <4 x float> %b) {
  %a0 = extractelement <4 x float> %a, i32 0
  %a1 = extractelement <4 x float> %a, i32 1
  %a2 = extractelement <4 x float> %a, i32 2
  %a3 = extractelement <4 x float> %a, i32 3

  %b0 = extractelement <4 x float> %b, i32 0
  %b1 = extractelement <4 x float> %b, i32 1
  %b2 = extractelement <4 x float> %b, i32 2
  %b3 = extractelement <4 x float> %b, i32 3

  %a23 = fadd float %a2, %a3
  %b01 = fadd float %b0, %b1
  %b23 = fadd float %b2, %b3

  %v1 = insertelement <4 x float> undef, float %a23, i32 1
  %v2 = insertelement <4 x float> %v1, float %b01, i32 2
  %v3 = insertelement <4 x float> %v2, float %b23, i32 3
  ret <4 x float> %v3
}

There are multiple potential problems contained in this example:

The cost model calculation claims that the extract/insert changes + <2 x float> math op are profitable. This may need adjustment either in the generic formulation or x86 cost model values.
SLP chooses to round down to vectorize to <2 x float> rather than rounding up a potential <3 x float> op to <4 x float>. This is a limitation in SLPVectorizerPass::tryToVectorizeList().
SLP chooses to vectorize the more complicated pair of scalar ops rather than the simpler pair. Ie, we could vectorize this pair with elements all from %b:
%b01 = fadd float %b0, %b1
%b23 = fadd float %b2, %b3

instead of this pair:
%a23 = fadd float %a2, %a3
%b01 = fadd float %b0, %b1

As noted in comment 3, instcombine should be able to turn more extract+insert sequences into shuffles. We should probably adjust that independent of this bug, but I don't think it will change the codegen outcome on this example.

rotateright · 2018-10-20T17:12:13Z

If we are viewing this as a backend pattern matching problem, I don't know how we'd solve it after SLP partially vectorizes the IR.

Matching a mix of scalar/vector code to try to find a horizontal op is a potentially limitless exercise.

I think we either need to get SLP to stop vectorizing on this example or get it to vectorize to a <4 x float> fadd. The backend should recognize either of those possibilities and turn it into 'vhaddps' for x86 AVX.

RKSimon · 2018-10-20T17:35:39Z

Something else that is odd is that the codegen is incredibly different depending upon the element you shuffle as undefined:

https://gcc.godbolt.org/z/e-6XUz

Many are generating the haddps and a load of extra code as well!

__m128 add_ps_u123(__m128 a, __m128 b) {
  __m128 r = (__m128){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, -1, 1, 2, 3);
}

__m128 add_ps_0u23(__m128 a, __m128 b) {
  __m128 r = (__m128){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, 0, -1, 2, 3);
}

__m128 add_ps_01u3(__m128 a, __m128 b) {
  __m128 r = (__m128){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, 0, 1, -1, 3);
}

__m128 add_ps_012u(__m128 a, __m128 b) {
  __m128 r = (__m128){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, 0, 1, 2, -1);
}

_Z11add_ps_u123Dv4_fS_:
        vpermilpd       $1, %xmm0, %xmm2 # xmm2 = xmm0[1,0]
        vinsertps       $204, %xmm0, %xmm1, %xmm0 # xmm0 = xmm0[3],xmm1[1],zero,zero
        vinsertps       $28, %xmm1, %xmm2, %xmm2 # xmm2 = xmm2[0],xmm1[0],zero,zero
        vaddps  %xmm0, %xmm2, %xmm0
        vpermilpd       $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0]
        vpermilps       $231, %xmm1, %xmm1 # xmm1 = xmm1[3,1,2,3]
        vaddss  %xmm1, %xmm2, %xmm1
        vpermilps       $208, %xmm0, %xmm0 # xmm0 = xmm0[0,0,1,3]
        vinsertps       $48, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0]
        retq
_Z11add_ps_0u23Dv4_fS_:
        vhaddps %xmm1, %xmm0, %xmm0
        retq
_Z11add_ps_01u3Dv4_fS_:
        vpermilpd       $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0]
        vpermilps       $231, %xmm1, %xmm1 # xmm1 = xmm1[3,1,2,3]
        vhaddps %xmm0, %xmm0, %xmm0
        vaddss  %xmm1, %xmm2, %xmm1
        vinsertps       $48, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0]
        retq
_Z11add_ps_012uDv4_fS_:
        vmovshdup       %xmm1, %xmm2    # xmm2 = xmm1[1,1,3,3]
        vhaddps %xmm0, %xmm0, %xmm0
        vaddss  %xmm2, %xmm1, %xmm1
        vinsertps       $32, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1],xmm1[0],xmm0[3]
        retq

alexey-bataev · 2018-10-25T14:32:47Z

Looking at this report again now, and nothing's changed since the last
comments. Here's a better view of the code going into SLP:

define <4 x float> @add_ps_001(<4 x float> %a, <4 x float> %b) {
%a0 = extractelement <4 x float> %a, i32 0
%a1 = extractelement <4 x float> %a, i32 1
%a2 = extractelement <4 x float> %a, i32 2
%a3 = extractelement <4 x float> %a, i32 3

%b0 = extractelement <4 x float> %b, i32 0
%b1 = extractelement <4 x float> %b, i32 1
%b2 = extractelement <4 x float> %b, i32 2
%b3 = extractelement <4 x float> %b, i32 3

%a23 = fadd float %a2, %a3
%b01 = fadd float %b0, %b1
%b23 = fadd float %b2, %b3

%v1 = insertelement <4 x float> undef, float %a23, i32 1
%v2 = insertelement <4 x float> %v1, float %b01, i32 2
%v3 = insertelement <4 x float> %v2, float %b23, i32 3
ret <4 x float> %v3
}

There are multiple potential problems contained in this example:

The cost model calculation claims that the extract/insert changes + <2 x
float> math op are profitable. This may need adjustment either in the
generic formulation or x86 cost model values.

SLP chooses to round down to vectorize to <2 x float> rather than
rounding up a potential <3 x float> op to <4 x float>. This is a limitation
in SLPVectorizerPass::tryToVectorizeList().

SLP chooses to vectorize the more complicated pair of scalar ops rather
than the simpler pair. Ie, we could vectorize this pair with elements all
from %b:
%b01 = fadd float %b0, %b1
%b23 = fadd float %b2, %b3

instead of this pair:
%a23 = fadd float %a2, %a3
%b01 = fadd float %b0, %b1

As noted in comment 3, instcombine should be able to turn more
extract+insert sequences into shuffles. We should probably adjust that
independent of this bug, but I don't think it will change the codegen
outcome on this example.

I have a very preliminary patch for non-pow2 support in SLP vectorizer, which will extend <3 x float> to <4 x float>. It is too optimistic in terms of the cost of the vectorization, but generally speacking looks good. Hope fix all the problems very soon.

rotateright · 2018-10-25T15:46:40Z

I have a very preliminary patch for non-pow2 support in SLP vectorizer,
which will extend <3 x float> to <4 x float>. It is too optimistic in terms
of the cost of the vectorization, but generally speacking looks good. Hope
fix all the problems very soon.

Great! If we get a <4 x float> math op from SLP, then that should be no problem for the backend to match as a horizontal op (if that's good for the target).

I'm still trying to make instcombine produce code that SLP can handle more easily here:
https://reviews.llvm.org/D53507

RKSimon · 2020-03-21T18:48:06Z

@spatel Do you think this is a candidate for the VectorCombine pass?

rotateright · 2020-03-22T14:36:35Z

@spatel Do you think this is a candidate for the VectorCombine pass?

We already get the extracelement transforms in -vector-combine that I would expect given the IR in comment 6:
%0 = shufflevector <4 x float> %a, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 3, i32 undef>
%1 = fadd <4 x float> %0, %a
%2 = shufflevector <4 x float> %b, <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
%3 = fadd <4 x float> %2, %b
%4 = shufflevector <4 x float> %1, <4 x float> %3, <4 x i32> <i32 undef, i32 2, i32 4, i32 undef>
%5 = shufflevector <4 x float> %b, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 3, i32 undef>
%6 = fadd <4 x float> %5, %b
%7 = shufflevector <4 x float> %4, <4 x float> %6, <4 x i32> <i32 undef, i32 1, i32 2, i32 6>
ret <4 x float> %7

So SLP is out of the picture currently. The given IR does result in a horizontal add, but we don't have a way to combine these into a single hadd yet:

vmovshdup	%xmm0, %xmm2    ## xmm2 = xmm0[1,1,3,3]
vaddps	%xmm0, %xmm2, %xmm0
vhaddps	%xmm1, %xmm1, %xmm2
vshufps	$200, %xmm2, %xmm0, %xmm0 ## xmm0 = xmm0[0,2],xmm2[0,3]
vmovshdup	%xmm1, %xmm2    ## xmm2 = xmm1[1,1,3,3]
vaddps	%xmm1, %xmm2, %xmm1
vinsertps	$177, %xmm1, %xmm0, %xmm0 ## xmm0 = zero,xmm0[1,2],xmm1[2]

rotateright · 2020-03-22T14:51:33Z

A possible transform for vector-combine that will help:

%5 = shufflevector <4 x float> %b, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 3, i32 undef>
%6 = fadd <4 x float> %5, %b
%7 = shufflevector <4 x float> %4, <4 x float> %6, <4 x i32> <i32 undef, i32 1, i32 2, i32 6>

We chose to shuffle element 3 of %b over to element 2 in the 1st shuffle, and that then gets pushed back over to element 3 in the 2nd shuffle. We want to adjust the 1st shuffle so that we're calculating the binop in element 3 because that creates a select ("blend") shuffle instead:

%5 = shufflevector <4 x float> %b, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 undef, i32 2>
%6 = fadd <4 x float> %5, %b
%7 = shufflevector <4 x float> %4, <4 x float> %6, <4 x i32> <i32 undef, i32 1, i32 2, i32 7>

rotateright · 2020-03-23T19:11:42Z

Small hack proposal:
https://reviews.llvm.org/D76623

RKSimon · 2020-07-14T21:26:32Z

Further backend improvements: https://reviews.llvm.org/D83789

… with another existing HADD/SUB Fixes some more cases from #34072 where undemanded vector elements prevent HADD/SUB being matched on slow targets

… with another existing HADD/SUB Fixes some more cases from llvm#34072 where undemanded vector elements prevent HADD/SUB being matched on slow targets

…undef,x) -> shuffle(x,undef). foldInsExtVectorToShuffle is likely to be inserting into an undef value, so make sure we've canonicalized this to the RHS in the folded shuffle to help further VectorCombine folds. Minor tweak to help #34072

Add tests for horizontal add patterns with missing/undemanded elements - which typically prevents folding to the (add (shuffle a, b),(shuffle a, b)) optimal pattern

…huffle(z)),binop(shuffle(y),shuffle(w)) -> binop(shuffle(x,z),shuffle(y,w)) Some patterns (in particular horizontal style patterns) can end up with shuffles straddling both sides of a binop/cmp. Where individually the folds aren't worth it, by merging the (oneuse) shuffles we can notably reduce the net instruction count and cost. One of the final steps towards finally addressing llvm#34072

…huffle(z)),binop(shuffle(y),shuffle(w)) -> binop(shuffle(x,z),shuffle(y,w)) (#120984) Some patterns (in particular horizontal style patterns) can end up with shuffles straddling both sides of a binop/cmp. Where individually the folds aren't worth it, by merging the (oneuse) shuffles we can notably reduce the net instruction count and cost. One of the final steps towards finally addressing #34072

Matches the existing horizontal-add tests, with the additional non-commutable constraint

RKSimon · 2025-01-08T12:35:43Z

Closing - the recent improvements to VectorCombine/CostModel have addressed all the 128-bit add/sub 'undefined element' cases.

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

RKSimon added a commit that referenced this issue May 28, 2024

[X86] isHorizontalBinOp - always create HADD/SUB if it will be merged…

05443ad

… with another existing HADD/SUB Fixes some more cases from #34072 where undemanded vector elements prevent HADD/SUB being matched on slow targets

RKSimon self-assigned this Dec 5, 2024

RKSimon added a commit that referenced this issue Dec 19, 2024

[PhaseOrdering][X86] Add test coverage for #34072

434819c

Add tests for horizontal add patterns with missing/undemanded elements - which typically prevents folding to the (add (shuffle a, b),(shuffle a, b)) optimal pattern

RKSimon mentioned this issue Dec 20, 2024

[VectorCombine] foldShuffleOfShuffles - failure to merge nested shuffles with common operands #120764

Closed

RKSimon mentioned this issue Dec 23, 2024

[VectorCombine] foldShuffleOfBinops - fold shuffle(binop(shuffle(x),shuffle(z)),binop(shuffle(y),shuffle(w)) -> binop(shuffle(x,z),shuffle(y,w)) #120984

Merged

RKSimon added a commit that referenced this issue Jan 6, 2025

[PhaseOrdering][X86] Add horizontal-sub test coverage for #34072

67652a3

Matches the existing horizontal-add tests, with the additional non-commutable constraint

paulhuggett pushed a commit to paulhuggett/llvm-project that referenced this issue Jan 7, 2025

[PhaseOrdering][X86] Add horizontal-sub test coverage for llvm#34072

9173ed2

Matches the existing horizontal-add tests, with the additional non-commutable constraint

RKSimon closed this as completed Jan 8, 2025

EugeneZelenko added vectorizers and removed backend:X86 labels Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compiler no longer optimizing some horizontal add/sub patterns after r310260 #34072

Compiler no longer optimizing some horizontal add/sub patterns after r310260 #34072

dyung commented Sep 26, 2017 •

edited by RKSimon

Loading

rotateright commented Sep 26, 2017

rotateright commented Sep 26, 2017 •

edited by RKSimon

Loading

rotateright commented Sep 26, 2017

rotateright commented Sep 26, 2017

efriedma-quic commented Sep 26, 2017

rotateright commented Oct 20, 2018 •

edited by RKSimon

Loading

rotateright commented Oct 20, 2018

RKSimon commented Oct 20, 2018 •

edited

Loading

alexey-bataev commented Oct 25, 2018

rotateright commented Oct 25, 2018

RKSimon commented Mar 21, 2020

rotateright commented Mar 22, 2020

rotateright commented Mar 22, 2020

rotateright commented Mar 23, 2020

RKSimon commented Jul 14, 2020

RKSimon commented Jan 8, 2025

Compiler no longer optimizing some horizontal add/sub patterns after r310260 #34072

Compiler no longer optimizing some horizontal add/sub patterns after r310260 #34072

Comments

dyung commented Sep 26, 2017 • edited by RKSimon Loading

Extended Description

rotateright commented Sep 26, 2017

rotateright commented Sep 26, 2017 • edited by RKSimon Loading

rotateright commented Sep 26, 2017

rotateright commented Sep 26, 2017

efriedma-quic commented Sep 26, 2017

rotateright commented Oct 20, 2018 • edited by RKSimon Loading

rotateright commented Oct 20, 2018

RKSimon commented Oct 20, 2018 • edited Loading

alexey-bataev commented Oct 25, 2018

rotateright commented Oct 25, 2018

RKSimon commented Mar 21, 2020

rotateright commented Mar 22, 2020

rotateright commented Mar 22, 2020

rotateright commented Mar 23, 2020

RKSimon commented Jul 14, 2020

RKSimon commented Jan 8, 2025

dyung commented Sep 26, 2017 •

edited by RKSimon

Loading

rotateright commented Sep 26, 2017 •

edited by RKSimon

Loading

rotateright commented Oct 20, 2018 •

edited by RKSimon

Loading

RKSimon commented Oct 20, 2018 •

edited

Loading