Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiler no longer optimizing some horizontal add/sub patterns after r310260 #34072

Open
dyung opened this issue Sep 26, 2017 · 15 comments
Open
Labels
backend:X86 bugzilla Issues migrated from bugzilla

Comments

@dyung
Copy link
Collaborator

dyung commented Sep 26, 2017

Bugzilla Link 34724
Version trunk
OS Linux
CC @alexey-bataev,@efriedma-quic,@hfinkel,@RKSimon,@rotateright,@ZviRackover

Extended Description

Prior to r310260, the compiler was optimizing some add/sub + shuffle patterns to horizontal add/sub instructions, but no longer seems to do so.

Consider this example:

__attribute__((noinline))
__m128 add_ps_001(__m128 a, __m128 b) {
  __m128 r = (__m128){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, -1, 1, 2, 3);
}

When compiled with the options "-S -O2 -march=bdver2" using a compiler prior to r310260, the compiler would generate the following assembly:

vmovaps (%rcx), %xmm0
vhaddps (%rdx), %xmm0, %xmm0
retq

After r310260, the compiler is now generating the following code for the same function:

vmovaps (%rdx), %xmm0
vmovddup        8(%rcx), %xmm1  # xmm1 = mem[0,0]
vinsertps       $28, %xmm0, %xmm1, %xmm2 # xmm2 = xmm1[0],xmm0[0],zero,zero
vinsertps       $76, %xmm1, %xmm0, %xmm1 # xmm1 = xmm1[1],xmm0[1],zero,zero
vaddps  %xmm1, %xmm2, %xmm1
vpermilpd       $1, %xmm0, %xmm2 # xmm2 = xmm0[1,0]
vpermilps       $231, %xmm0, %xmm0 # xmm0 = xmm0[3,1,2,3]
vaddss  %xmm0, %xmm2, %xmm0
vpermilps       $208, %xmm1, %xmm1 # xmm1 = xmm1[0,0,1,3]
vinsertps       $48, %xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0,1,2],xmm0[0]
retq
@rotateright
Copy link
Contributor

The IR for this example is identical to bug 34725, so if we view this as a backend problem, one of these would be a duplicate.

@rotateright
Copy link
Contributor

rotateright commented Sep 26, 2017

For reference: https://reviews.llvm.org/rL310260 (SLP)

And here's the IR that we currently (r314201) produce:

$ ./clang -O2 34724.c -S -o - -mavx -emit-llvm

define <4 x float> @add_ps_001(<4 x float> %a, <4 x float> %b) {
entry:
  %0 = shufflevector <4 x float> %a, <4 x float> %b, <2 x i32> <i32 2, i32 4>
  %1 = shufflevector <4 x float> %a, <4 x float> %b, <2 x i32> <i32 3, i32 5>
  %2 = fadd <2 x float> %0, %1
  %3 = extractelement <2 x float> %2, i32 0
  %vecinit5 = insertelement <4 x float> undef, float %3, i32 1
  %4 = extractelement <2 x float> %2, i32 1
  %vecinit9 = insertelement <4 x float> %vecinit5, float %4, i32 2
  %vecext10 = extractelement <4 x float> %b, i32 2
  %vecext11 = extractelement <4 x float> %b, i32 3
  %add12 = fadd float %vecext10, %vecext11
  %vecinit13 = insertelement <4 x float> %vecinit9, float %add12, i32 3
  ret <4 x float> %vecinit13
}

@rotateright
Copy link
Contributor

There's a related theme here and bug 34730 and bug 34716:

%ext = extractelement <2 x float> %x, i32 0
%ins = insertelement <4 x float> undef, float %ext, i32 1

-->

%shuf = shufflevector <2 x float> %x, <2 x float> undef, <4 x i32> <i32 undef, i32 0, i32 undef, i32 undef>

Can we canonicalize to shuffles more aggressively in instcombine now? If we can recognize insert_subvector, extract_subvector, identity (this is already true I think), single-element-repositioning (better name for this example?) as special case masks, it should reduce complexity in the backend.

@rotateright
Copy link
Contributor

The IR for this example is identical to bug 34725, so if we view this as a
backend problem, one of these would be a duplicate.

That was wrong - the instructions are similar, but the element indexes for the shuffles and insert/extract are not the same.

@efriedma-quic
Copy link
Collaborator

We could probably make the list of shuffles instcombine recognizes a bit longer, if that helps... we just want to guard against making shuffles the backend won't recognize.

@rotateright
Copy link
Contributor

rotateright commented Oct 20, 2018

Looking at this report again now, and nothing's changed since the last comments. Here's a better view of the code going into SLP:

define <4 x float> @add_ps_001(<4 x float> %a, <4 x float> %b) {
  %a0 = extractelement <4 x float> %a, i32 0
  %a1 = extractelement <4 x float> %a, i32 1
  %a2 = extractelement <4 x float> %a, i32 2
  %a3 = extractelement <4 x float> %a, i32 3

  %b0 = extractelement <4 x float> %b, i32 0
  %b1 = extractelement <4 x float> %b, i32 1
  %b2 = extractelement <4 x float> %b, i32 2
  %b3 = extractelement <4 x float> %b, i32 3

  %a23 = fadd float %a2, %a3
  %b01 = fadd float %b0, %b1
  %b23 = fadd float %b2, %b3

  %v1 = insertelement <4 x float> undef, float %a23, i32 1
  %v2 = insertelement <4 x float> %v1, float %b01, i32 2
  %v3 = insertelement <4 x float> %v2, float %b23, i32 3
  ret <4 x float> %v3
}

There are multiple potential problems contained in this example:

  1. The cost model calculation claims that the extract/insert changes + <2 x float> math op are profitable. This may need adjustment either in the generic formulation or x86 cost model values.

  2. SLP chooses to round down to vectorize to <2 x float> rather than rounding up a potential <3 x float> op to <4 x float>. This is a limitation in SLPVectorizerPass::tryToVectorizeList().

  3. SLP chooses to vectorize the more complicated pair of scalar ops rather than the simpler pair. Ie, we could vectorize this pair with elements all from %b:
    %b01 = fadd float %b0, %b1
    %b23 = fadd float %b2, %b3

instead of this pair:
%a23 = fadd float %a2, %a3
%b01 = fadd float %b0, %b1

  1. As noted in comment 3, instcombine should be able to turn more extract+insert sequences into shuffles. We should probably adjust that independent of this bug, but I don't think it will change the codegen outcome on this example.

@rotateright
Copy link
Contributor

If we are viewing this as a backend pattern matching problem, I don't know how we'd solve it after SLP partially vectorizes the IR.

Matching a mix of scalar/vector code to try to find a horizontal op is a potentially limitless exercise.

I think we either need to get SLP to stop vectorizing on this example or get it to vectorize to a <4 x float> fadd. The backend should recognize either of those possibilities and turn it into 'vhaddps' for x86 AVX.

@RKSimon
Copy link
Collaborator

RKSimon commented Oct 20, 2018

Something else that is odd is that the codegen is incredibly different depending upon the element you shuffle as undefined:

https://gcc.godbolt.org/z/e-6XUz

Many are generating the haddps and a load of extra code as well!

__m128 add_ps_u123(__m128 a, __m128 b) {
  __m128 r = (__m128){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, -1, 1, 2, 3);
}

__m128 add_ps_0u22(__m128 a, __m128 b) {
  __m128 r = (__m128){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, 0, -1, 2, 3);
}

__m128 add_ps_01u3(__m128 a, __m128 b) {
  __m128 r = (__m128){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, 0, 1, -1, 3);
}

__m128 add_ps_012u(__m128 a, __m128 b) {
  __m128 r = (__m128){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, 0, 1, 2, -1);
}

_Z11add_ps_u123Dv4_fS_:
        vpermilpd       $1, %xmm0, %xmm2 # xmm2 = xmm0[1,0]
        vinsertps       $204, %xmm0, %xmm1, %xmm0 # xmm0 = xmm0[3],xmm1[1],zero,zero
        vinsertps       $28, %xmm1, %xmm2, %xmm2 # xmm2 = xmm2[0],xmm1[0],zero,zero
        vaddps  %xmm0, %xmm2, %xmm0
        vpermilpd       $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0]
        vpermilps       $231, %xmm1, %xmm1 # xmm1 = xmm1[3,1,2,3]
        vaddss  %xmm1, %xmm2, %xmm1
        vpermilps       $208, %xmm0, %xmm0 # xmm0 = xmm0[0,0,1,3]
        vinsertps       $48, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0]
        retq
_Z11add_ps_0u22Dv4_fS_:
        vhaddps %xmm1, %xmm0, %xmm0
        retq
_Z11add_ps_01u3Dv4_fS_:
        vpermilpd       $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0]
        vpermilps       $231, %xmm1, %xmm1 # xmm1 = xmm1[3,1,2,3]
        vhaddps %xmm0, %xmm0, %xmm0
        vaddss  %xmm1, %xmm2, %xmm1
        vinsertps       $48, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0]
        retq
_Z11add_ps_012uDv4_fS_:
        vmovshdup       %xmm1, %xmm2    # xmm2 = xmm1[1,1,3,3]
        vhaddps %xmm0, %xmm0, %xmm0
        vaddss  %xmm2, %xmm1, %xmm1
        vinsertps       $32, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1],xmm1[0],xmm0[3]
        retq

@alexey-bataev
Copy link
Member

Looking at this report again now, and nothing's changed since the last
comments. Here's a better view of the code going into SLP:

define <4 x float> @​add_ps_001(<4 x float> %a, <4 x float> %b) {
%a0 = extractelement <4 x float> %a, i32 0
%a1 = extractelement <4 x float> %a, i32 1
%a2 = extractelement <4 x float> %a, i32 2
%a3 = extractelement <4 x float> %a, i32 3

%b0 = extractelement <4 x float> %b, i32 0
%b1 = extractelement <4 x float> %b, i32 1
%b2 = extractelement <4 x float> %b, i32 2
%b3 = extractelement <4 x float> %b, i32 3

%a23 = fadd float %a2, %a3
%b01 = fadd float %b0, %b1
%b23 = fadd float %b2, %b3

%v1 = insertelement <4 x float> undef, float %a23, i32 1
%v2 = insertelement <4 x float> %v1, float %b01, i32 2
%v3 = insertelement <4 x float> %v2, float %b23, i32 3
ret <4 x float> %v3
}


There are multiple potential problems contained in this example:

  1. The cost model calculation claims that the extract/insert changes + <2 x
    float> math op are profitable. This may need adjustment either in the
    generic formulation or x86 cost model values.

  2. SLP chooses to round down to vectorize to <2 x float> rather than
    rounding up a potential <3 x float> op to <4 x float>. This is a limitation
    in SLPVectorizerPass::tryToVectorizeList().

  3. SLP chooses to vectorize the more complicated pair of scalar ops rather
    than the simpler pair. Ie, we could vectorize this pair with elements all
    from %b:
    %b01 = fadd float %b0, %b1
    %b23 = fadd float %b2, %b3

instead of this pair:
%a23 = fadd float %a2, %a3
%b01 = fadd float %b0, %b1

  1. As noted in comment 3, instcombine should be able to turn more
    extract+insert sequences into shuffles. We should probably adjust that
    independent of this bug, but I don't think it will change the codegen
    outcome on this example.

I have a very preliminary patch for non-pow2 support in SLP vectorizer, which will extend <3 x float> to <4 x float>. It is too optimistic in terms of the cost of the vectorization, but generally speacking looks good. Hope fix all the problems very soon.

@rotateright
Copy link
Contributor

I have a very preliminary patch for non-pow2 support in SLP vectorizer,
which will extend <3 x float> to <4 x float>. It is too optimistic in terms
of the cost of the vectorization, but generally speacking looks good. Hope
fix all the problems very soon.

Great! If we get a <4 x float> math op from SLP, then that should be no problem for the backend to match as a horizontal op (if that's good for the target).

I'm still trying to make instcombine produce code that SLP can handle more easily here:
https://reviews.llvm.org/D53507

@RKSimon
Copy link
Collaborator

RKSimon commented Mar 21, 2020

@​spatel Do you think this is a candidate for the VectorCombine pass?

@rotateright
Copy link
Contributor

@​spatel Do you think this is a candidate for the VectorCombine pass?

We already get the extracelement transforms in -vector-combine that I would expect given the IR in comment 6:
%0 = shufflevector <4 x float> %a, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 3, i32 undef>
%1 = fadd <4 x float> %0, %a
%2 = shufflevector <4 x float> %b, <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
%3 = fadd <4 x float> %2, %b
%4 = shufflevector <4 x float> %1, <4 x float> %3, <4 x i32> <i32 undef, i32 2, i32 4, i32 undef>
%5 = shufflevector <4 x float> %b, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 3, i32 undef>
%6 = fadd <4 x float> %5, %b
%7 = shufflevector <4 x float> %4, <4 x float> %6, <4 x i32> <i32 undef, i32 1, i32 2, i32 6>
ret <4 x float> %7


So SLP is out of the picture currently. The given IR does result in a horizontal add, but we don't have a way to combine these into a single hadd yet:

vmovshdup	%xmm0, %xmm2    ## xmm2 = xmm0[1,1,3,3]
vaddps	%xmm0, %xmm2, %xmm0
vhaddps	%xmm1, %xmm1, %xmm2
vshufps	$200, %xmm2, %xmm0, %xmm0 ## xmm0 = xmm0[0,2],xmm2[0,3]
vmovshdup	%xmm1, %xmm2    ## xmm2 = xmm1[1,1,3,3]
vaddps	%xmm1, %xmm2, %xmm1
vinsertps	$177, %xmm1, %xmm0, %xmm0 ## xmm0 = zero,xmm0[1,2],xmm1[2]

@rotateright
Copy link
Contributor

A possible transform for vector-combine that will help:

%5 = shufflevector <4 x float> %b, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 3, i32 undef>
%6 = fadd <4 x float> %5, %b
%7 = shufflevector <4 x float> %4, <4 x float> %6, <4 x i32> <i32 undef, i32 1, i32 2, i32 6>

We chose to shuffle element 3 of %b over to element 2 in the 1st shuffle, and that then gets pushed back over to element 3 in the 2nd shuffle. We want to adjust the 1st shuffle so that we're calculating the binop in element 3 because that creates a select ("blend") shuffle instead:

%5 = shufflevector <4 x float> %b, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 undef, i32 2>
%6 = fadd <4 x float> %5, %b
%7 = shufflevector <4 x float> %4, <4 x float> %6, <4 x i32> <i32 undef, i32 1, i32 2, i32 7>

@rotateright
Copy link
Contributor

Small hack proposal:
https://reviews.llvm.org/D76623

@RKSimon
Copy link
Collaborator

RKSimon commented Jul 14, 2020

Further backend improvements: https://reviews.llvm.org/D83789

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 bugzilla Issues migrated from bugzilla
Projects
None yet
Development

No branches or pull requests

5 participants