34724 – Compiler no longer optimizing some horizontal add/sub patterns after r310260

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 34724 - Compiler no longer optimizing some horizontal add/sub patterns after r310260

Summary: Compiler no longer optimizing some horizontal add/sub patterns after r310260

Status:	NEW

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Backend: X86 (show other bugs)
Version:	trunk
Hardware:	PC Linux

Importance:	P normal
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2017-09-25 17:06 PDT by Douglas Yung
Modified:	2020-07-14 14:26 PDT (History)
CC List:	7 users (show)

See Also:	34725 34730 34716 41813 45747
Fixed By Commit(s):

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Douglas Yung 2017-09-25 17:06:58 PDT

Prior to r310260, the compiler was optimizing some add/sub + shuffle patterns to horizontal add/sub instructions, but no longer seems to do so.

Consider this example:

__attribute__((noinline))
__m128 add_ps_001(__m128 a, __m128 b) {
  __m128 r = (__m128){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, -1, 1, 2, 3);
}

When compiled with the options "-S -O2 -march=bdver2" using a compiler prior to r310260, the compiler would generate the following assembly:

vmovaps (%rcx), %xmm0
vhaddps (%rdx), %xmm0, %xmm0
retq

After r310260, the compiler is now generating the following code for the same function:

vmovaps (%rdx), %xmm0
vmovddup        8(%rcx), %xmm1  # xmm1 = mem[0,0]
vinsertps       $28, %xmm0, %xmm1, %xmm2 # xmm2 = xmm1[0],xmm0[0],zero,zero
vinsertps       $76, %xmm1, %xmm0, %xmm1 # xmm1 = xmm1[1],xmm0[1],zero,zero
vaddps  %xmm1, %xmm2, %xmm1
vpermilpd       $1, %xmm0, %xmm2 # xmm2 = xmm0[1,0]
vpermilps       $231, %xmm0, %xmm0 # xmm0 = xmm0[3,1,2,3]
vaddss  %xmm0, %xmm2, %xmm0
vpermilps       $208, %xmm1, %xmm1 # xmm1 = xmm1[0,0,1,3]
vinsertps       $48, %xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0,1,2],xmm0[0]
retq

Comment 1 Sanjay Patel 2017-09-26 07:52:24 PDT

The IR for this example is identical to bug 34725, so if we view this as a backend problem, one of these would be a duplicate.

Comment 2 Sanjay Patel 2017-09-26 07:56:32 PDT

For reference: https://reviews.llvm.org/rL310260 (SLP)

And here's the IR that we currently (r314201) produce:

$ ./clang -O2 34724.c -S -o - -mavx -emit-llvm

define <4 x float> @add_ps_001(<4 x float> %a, <4 x float> %b) {
entry:
  %0 = shufflevector <4 x float> %a, <4 x float> %b, <2 x i32> <i32 2, i32 4>
  %1 = shufflevector <4 x float> %a, <4 x float> %b, <2 x i32> <i32 3, i32 5>
  %2 = fadd <2 x float> %0, %1
  %3 = extractelement <2 x float> %2, i32 0
  %vecinit5 = insertelement <4 x float> undef, float %3, i32 1
  %4 = extractelement <2 x float> %2, i32 1
  %vecinit9 = insertelement <4 x float> %vecinit5, float %4, i32 2
  %vecext10 = extractelement <4 x float> %b, i32 2
  %vecext11 = extractelement <4 x float> %b, i32 3
  %add12 = fadd float %vecext10, %vecext11
  %vecinit13 = insertelement <4 x float> %vecinit9, float %add12, i32 3
  ret <4 x float> %vecinit13
}

Comment 3 Sanjay Patel 2017-09-26 08:17:38 PDT

There's a related theme here and bug 34730 and bug 34716:

  %ext = extractelement <2 x float> %x, i32 0
  %ins = insertelement <4 x float> undef, float %ext, i32 1

-->

  %shuf = shufflevector <2 x float> %x, <2 x float> undef, <4 x i32> <i32 undef, i32 0, i32 undef, i32 undef>


Can we canonicalize to shuffles more aggressively in instcombine now? If we can recognize insert_subvector, extract_subvector, identity (this is already true I think), single-element-repositioning (better name for this example?) as special case masks, it should reduce complexity in the backend.

Comment 4 Sanjay Patel 2017-09-26 08:18:49 PDT

(In reply to Sanjay Patel from comment #1)
> The IR for this example is identical to bug 34725, so if we view this as a
> backend problem, one of these would be a duplicate.

That was wrong - the instructions are similar, but the element indexes for the shuffles and insert/extract are not the same.

Comment 5 Eli Friedman 2017-09-26 14:37:52 PDT

We could probably make the list of shuffles instcombine recognizes a bit longer, if that helps... we just want to guard against making shuffles the backend won't recognize.

Comment 6 Sanjay Patel 2018-10-20 09:06:17 PDT

Looking at this report again now, and nothing's changed since the last comments. Here's a better view of the code going into SLP:

define <4 x float> @add_ps_001(<4 x float> %a, <4 x float> %b) {
  %a0 = extractelement <4 x float> %a, i32 0
  %a1 = extractelement <4 x float> %a, i32 1
  %a2 = extractelement <4 x float> %a, i32 2
  %a3 = extractelement <4 x float> %a, i32 3

  %b0 = extractelement <4 x float> %b, i32 0
  %b1 = extractelement <4 x float> %b, i32 1
  %b2 = extractelement <4 x float> %b, i32 2
  %b3 = extractelement <4 x float> %b, i32 3

  %a23 = fadd float %a2, %a3
  %b01 = fadd float %b0, %b1
  %b23 = fadd float %b2, %b3

  %v1 = insertelement <4 x float> undef, float %a23, i32 1
  %v2 = insertelement <4 x float> %v1, float %b01, i32 2
  %v3 = insertelement <4 x float> %v2, float %b23, i32 3
  ret <4 x float> %v3
}

------------------------------------------------------------------------------

There are multiple potential problems contained in this example:

1. The cost model calculation claims that the extract/insert changes + <2 x float> math op are profitable. This may need adjustment either in the generic formulation or x86 cost model values.

2. SLP chooses to round down to vectorize to <2 x float> rather than rounding up a potential <3 x float> op to <4 x float>. This is a limitation in SLPVectorizerPass::tryToVectorizeList().

3. SLP chooses to vectorize the more complicated pair of scalar ops rather than the simpler pair. Ie, we could vectorize this pair with elements all from %b:
  %b01 = fadd float %b0, %b1
  %b23 = fadd float %b2, %b3

instead of this pair:
  %a23 = fadd float %a2, %a3
  %b01 = fadd float %b0, %b1

4. As noted in comment 3, instcombine should be able to turn more extract+insert sequences into shuffles. We should probably adjust that independent of this bug, but I don't think it will change the codegen outcome on this example.

Comment 7 Sanjay Patel 2018-10-20 10:12:13 PDT

If we are viewing this as a backend pattern matching problem, I don't know how we'd solve it after SLP partially vectorizes the IR. 

Matching a mix of scalar/vector code to try to find a horizontal op is a potentially limitless exercise. 

I think we either need to get SLP to stop vectorizing on this example or get it to vectorize to a <4 x float> fadd. The backend should recognize either of those possibilities and turn it into 'vhaddps' for x86 AVX.

Comment 8 Simon Pilgrim 2018-10-20 10:35:39 PDT

Something else that is odd is that the codegen is incredibly different depending upon the element you shuffle as undefined:

https://gcc.godbolt.org/z/e-6XUz

Many are generating the haddps and a load of extra code as well!

__m128 add_ps_u123(__m128 a, __m128 b) {
  __m128 r = (__m128){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, -1, 1, 2, 3);
}

__m128 add_ps_0u22(__m128 a, __m128 b) {
  __m128 r = (__m128){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, 0, -1, 2, 3);
}

__m128 add_ps_01u3(__m128 a, __m128 b) {
  __m128 r = (__m128){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, 0, 1, -1, 3);
}

__m128 add_ps_012u(__m128 a, __m128 b) {
  __m128 r = (__m128){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, 0, 1, 2, -1);
}

_Z11add_ps_u123Dv4_fS_:                 # @_Z11add_ps_u123Dv4_fS_
        vpermilpd       $1, %xmm0, %xmm2 # xmm2 = xmm0[1,0]
        vinsertps       $204, %xmm0, %xmm1, %xmm0 # xmm0 = xmm0[3],xmm1[1],zero,zero
        vinsertps       $28, %xmm1, %xmm2, %xmm2 # xmm2 = xmm2[0],xmm1[0],zero,zero
        vaddps  %xmm0, %xmm2, %xmm0
        vpermilpd       $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0]
        vpermilps       $231, %xmm1, %xmm1 # xmm1 = xmm1[3,1,2,3]
        vaddss  %xmm1, %xmm2, %xmm1
        vpermilps       $208, %xmm0, %xmm0 # xmm0 = xmm0[0,0,1,3]
        vinsertps       $48, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0]
        retq
_Z11add_ps_0u22Dv4_fS_:                 # @_Z11add_ps_0u22Dv4_fS_
        vhaddps %xmm1, %xmm0, %xmm0
        retq
_Z11add_ps_01u3Dv4_fS_:                 # @_Z11add_ps_01u3Dv4_fS_
        vpermilpd       $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0]
        vpermilps       $231, %xmm1, %xmm1 # xmm1 = xmm1[3,1,2,3]
        vhaddps %xmm0, %xmm0, %xmm0
        vaddss  %xmm1, %xmm2, %xmm1
        vinsertps       $48, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0]
        retq
_Z11add_ps_012uDv4_fS_:                 # @_Z11add_ps_012uDv4_fS_
        vmovshdup       %xmm1, %xmm2    # xmm2 = xmm1[1,1,3,3]
        vhaddps %xmm0, %xmm0, %xmm0
        vaddss  %xmm2, %xmm1, %xmm1
        vinsertps       $32, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1],xmm1[0],xmm0[3]
        retq

Comment 9 Alexey Bataev 2018-10-25 07:32:47 PDT

(In reply to Sanjay Patel from comment #6)
> Looking at this report again now, and nothing's changed since the last
> comments. Here's a better view of the code going into SLP:
> 
> define <4 x float> @add_ps_001(<4 x float> %a, <4 x float> %b) {
>   %a0 = extractelement <4 x float> %a, i32 0
>   %a1 = extractelement <4 x float> %a, i32 1
>   %a2 = extractelement <4 x float> %a, i32 2
>   %a3 = extractelement <4 x float> %a, i32 3
> 
>   %b0 = extractelement <4 x float> %b, i32 0
>   %b1 = extractelement <4 x float> %b, i32 1
>   %b2 = extractelement <4 x float> %b, i32 2
>   %b3 = extractelement <4 x float> %b, i32 3
> 
>   %a23 = fadd float %a2, %a3
>   %b01 = fadd float %b0, %b1
>   %b23 = fadd float %b2, %b3
> 
>   %v1 = insertelement <4 x float> undef, float %a23, i32 1
>   %v2 = insertelement <4 x float> %v1, float %b01, i32 2
>   %v3 = insertelement <4 x float> %v2, float %b23, i32 3
>   ret <4 x float> %v3
> }
> 
> -----------------------------------------------------------------------------
> -
> 
> There are multiple potential problems contained in this example:
> 
> 1. The cost model calculation claims that the extract/insert changes + <2 x
> float> math op are profitable. This may need adjustment either in the
> generic formulation or x86 cost model values.
> 
> 2. SLP chooses to round down to vectorize to <2 x float> rather than
> rounding up a potential <3 x float> op to <4 x float>. This is a limitation
> in SLPVectorizerPass::tryToVectorizeList().
> 
> 3. SLP chooses to vectorize the more complicated pair of scalar ops rather
> than the simpler pair. Ie, we could vectorize this pair with elements all
> from %b:
>   %b01 = fadd float %b0, %b1
>   %b23 = fadd float %b2, %b3
> 
> instead of this pair:
>   %a23 = fadd float %a2, %a3
>   %b01 = fadd float %b0, %b1
> 
> 4. As noted in comment 3, instcombine should be able to turn more
> extract+insert sequences into shuffles. We should probably adjust that
> independent of this bug, but I don't think it will change the codegen
> outcome on this example.

I have a very preliminary patch for non-pow2 support in SLP vectorizer, which will extend <3 x float> to <4 x float>. It is too optimistic in terms of the cost of the vectorization, but generally speacking looks good. Hope fix all the problems very soon.

Comment 10 Sanjay Patel 2018-10-25 08:46:40 PDT

(In reply to Alexey Bataev from comment #9)
> I have a very preliminary patch for non-pow2 support in SLP vectorizer,
> which will extend <3 x float> to <4 x float>. It is too optimistic in terms
> of the cost of the vectorization, but generally speacking looks good. Hope
> fix all the problems very soon.

Great! If we get a <4 x float> math op from SLP, then that should be no problem for the backend to match as a horizontal op (if that's good for the target).

I'm still trying to make instcombine produce code that SLP can handle more easily here:
https://reviews.llvm.org/D53507

Comment 11 Simon Pilgrim 2020-03-21 11:48:06 PDT

@spatel Do you think this is a candidate for the VectorCombine pass?

Comment 12 Sanjay Patel 2020-03-22 07:36:35 PDT

(In reply to Simon Pilgrim from comment #11)
> @spatel Do you think this is a candidate for the VectorCombine pass?

We already get the extracelement transforms in -vector-combine that I would expect given the IR in comment 6:
  %0 = shufflevector <4 x float> %a, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 3, i32 undef>
  %1 = fadd <4 x float> %0, %a
  %2 = shufflevector <4 x float> %b, <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
  %3 = fadd <4 x float> %2, %b
  %4 = shufflevector <4 x float> %1, <4 x float> %3, <4 x i32> <i32 undef, i32 2, i32 4, i32 undef>
  %5 = shufflevector <4 x float> %b, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 3, i32 undef>
  %6 = fadd <4 x float> %5, %b
  %7 = shufflevector <4 x float> %4, <4 x float> %6, <4 x i32> <i32 undef, i32 1, i32 2, i32 6>
  ret <4 x float> %7

------------------------------------------------------------------------------

So SLP is out of the picture currently. The given IR does result in a horizontal add, but we don't have a way to combine these into a single hadd yet:

	vmovshdup	%xmm0, %xmm2    ## xmm2 = xmm0[1,1,3,3]
	vaddps	%xmm0, %xmm2, %xmm0
	vhaddps	%xmm1, %xmm1, %xmm2
	vshufps	$200, %xmm2, %xmm0, %xmm0 ## xmm0 = xmm0[0,2],xmm2[0,3]
	vmovshdup	%xmm1, %xmm2    ## xmm2 = xmm1[1,1,3,3]
	vaddps	%xmm1, %xmm2, %xmm1
	vinsertps	$177, %xmm1, %xmm0, %xmm0 ## xmm0 = zero,xmm0[1,2],xmm1[2]

Comment 13 Sanjay Patel 2020-03-22 07:51:33 PDT

A possible transform for vector-combine that will help:

  %5 = shufflevector <4 x float> %b, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 3, i32 undef>
  %6 = fadd <4 x float> %5, %b
  %7 = shufflevector <4 x float> %4, <4 x float> %6, <4 x i32> <i32 undef, i32 1, i32 2, i32 6>


We chose to shuffle element 3 of %b over to element 2 in the 1st shuffle, and that then gets pushed back over to element 3 in the 2nd shuffle. We want to adjust the 1st shuffle so that we're calculating the binop in element 3 because that creates a select ("blend") shuffle instead:

  %5 = shufflevector <4 x float> %b, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 undef, i32 2>
  %6 = fadd <4 x float> %5, %b
  %7 = shufflevector <4 x float> %4, <4 x float> %6, <4 x i32> <i32 undef, i32 1, i32 2, i32 7>

Comment 14 Sanjay Patel 2020-03-23 12:11:42 PDT

Small hack proposal:
https://reviews.llvm.org/D76623

Comment 15 Simon Pilgrim 2020-07-14 14:26:32 PDT

Further backend improvements: https://reviews.llvm.org/D83789