[X86] Failure to use HADDPS for partial register result #31780

RKSimon · 2017-03-27T14:04:21Z


Bugzilla Link	32433
Resolution	FIXED
Resolved on	May 13, 2019 09:16
Version	trunk
OS	Windows NT
CC	@anton-afanasyev,@lesshaste,@dtemirbulatov,@rotateright
Fixed by commit(s)	rL353923,rL360594,rL360596

Extended Description

While the 256-bit horizontal pair sums work fine (both on btver2 and btver1), the 128-bit version completely fails:

#include <x86intrin.h>
void sum_pairs_128(__m128 f, float *p) {
p[0] = f[0] + f[1];
p[1] = f[2] + f[3];
}
void sum_pairs_256(__m256 f, float *p) {
p[0] = f[0] + f[1];
p[1] = f[2] + f[3];
p[2] = f[4] + f[5];
p[3] = f[6] + f[7];
}

clang -O3 -march=btver2

sum_pairs_128(float __vector(4), float*):
vmovshdup %xmm0, %xmm1 # xmm1 = xmm0[1,1,3,3]
vaddss %xmm1, %xmm0, %xmm1
vmovss %xmm1, (%rdi)
vpermilpd $1, %xmm0, %xmm1 # xmm1 = xmm0[1,0]
vpermilps $231, %xmm0, %xmm0 # xmm0 = xmm0[3,1,2,3]
vaddss %xmm0, %xmm1, %xmm0
vmovss %xmm0, 4(%rdi)
retq

sum_pairs_256(float __vector(8), float*):
vextractf128 $1, %ymm0, %xmm1
vhaddps %xmm1, %xmm0, %xmm0
vmovups %xmm0, (%rdi)
retq

RKSimon · 2017-03-27T14:04:21Z

assigned to @anton-afanasyev

rotateright · 2017-03-27T22:52:03Z

The 256-bit case gets vectorized in IR:

define void @sum_pairs_256(<8 x float> %f, float* nocapture %p) local_unnamed_addr #0 {
entry:
%0 = shufflevector <8 x float> %f, <8 x float> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
%1 = shufflevector <8 x float> %f, <8 x float> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
%2 = fadd <4 x float> %0, %1
%3 = bitcast float* %p to <4 x float>*
store <4 x float> %2, <4 x float>* %3, align 4, !tbaa !2
ret void
}

This means we have special-case cost modeling to allow even/odd shuffles? I think the 128-bit case would need to know that <2 x float> ops are ok in this case, or it would have to be pattern-matched in the DAG.

rotateright · 2018-10-08T20:48:51Z

define void @sum_pairs_128(<4 x float> %f, float* %p) {
%vecext = extractelement <4 x float> %f, i32 0
%vecext1 = extractelement <4 x float> %f, i32 1
%add = fadd float %vecext, %vecext1
store float %add, float* %p, align 4
%vecext2 = extractelement <4 x float> %f, i32 2
%vecext3 = extractelement <4 x float> %f, i32 3
%add4 = fadd float %vecext2, %vecext3
%arrayidx5 = getelementptr inbounds float, float* %p, i64 1
store float %add4, float* %arrayidx5, align 4
ret void
}

That's the IR currently (r343965), and I'm not sure how the backend would manage to optimize that. It's not like the cases in bug 39195. So we probably need to adjust the cost model to allow SLP to turn that into vector code.

anton-afanasyev · 2018-11-26T12:38:31Z

To use 128-bit horizontal sum, one can switch 64-bit slp-vectorization on:

$ cat t.cpp
#include <x86intrin.h>
void sum_pairs_128(__m128 f, float p) {
p[0] = f[0] + f[1];
p[1] = f[2] + f[3];
}
$ clang -O3 -mllvm -slp-min-reg-size=64 -march=btver2 -S -o - t.cpp
sum_pairs_128(float __vector(4), float):
vhaddps xmm0, xmm0, xmm0
vmovlps qword ptr [rdi], xmm0
ret

Though the correct fix is to change one line in SLPVectorizerPass::vectorizeStores() function:

for (unsigned Size = R.getMaxVecRegSize(); Size >= R.getMinVecRegSize() / 2;
, but it triggers another Loop Unrolling bug, not related to this one.

I'm to send this fix to review and to report loop unrolling bug.

rotateright · 2018-12-21T19:10:00Z

https://reviews.llvm.org/D56011
...would produce a late haddps for half of this, but that's much too late to get the optimal code.

anton-afanasyev · 2018-12-26T12:31:28Z

https://reviews.llvm.org/D56011
...would produce a late haddps for half of this, but that's much too late to
get the optimal code.

Yes, this case should be processed by SLPVectorizer itself. Here is the patch which fixes this: https://reviews.llvm.org/D56082.

It also generates more optimal code for not horizontal instructions like this:

void mul_pairs_128(__m128 f, float *p) {
p[0] = f[0] * f[1];
p[1] = f[2] * f[3];
}

anton-afanasyev · 2019-02-13T09:01:35Z

Fixed: ca9aff9

rotateright · 2019-02-20T14:20:26Z

Reopening because the change was reverted due to an LTO build failure and perf regressions:
https://reviews.llvm.org/rL354434

RKSimon · 2019-05-10T14:22:31Z

Current codegen: https://godbolt.org/z/PXToee

RKSimon · 2019-05-10T14:33:02Z

We can fix this in DAG pretty trivially: https://reviews.llvm.org/D61782

RKSimon · 2019-05-13T16:16:10Z

Resolving, we were able to deal with this in the backend by relaxing the hasOneUse limits in lowerAddSubToHorizontalOp

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[X86] Failure to use HADDPS for partial register result #31780

[X86] Failure to use HADDPS for partial register result #31780

RKSimon commented Mar 27, 2017

RKSimon commented Mar 27, 2017

rotateright commented Mar 27, 2017

rotateright commented Oct 8, 2018

anton-afanasyev commented Nov 26, 2018

rotateright commented Dec 21, 2018

anton-afanasyev commented Dec 26, 2018

anton-afanasyev commented Feb 13, 2019

rotateright commented Feb 20, 2019

RKSimon commented May 10, 2019

RKSimon commented May 10, 2019

RKSimon commented May 13, 2019

[X86] Failure to use HADDPS for partial register result #31780

[X86] Failure to use HADDPS for partial register result #31780

Comments

RKSimon commented Mar 27, 2017

Extended Description

RKSimon commented Mar 27, 2017

rotateright commented Mar 27, 2017

rotateright commented Oct 8, 2018

anton-afanasyev commented Nov 26, 2018

rotateright commented Dec 21, 2018

anton-afanasyev commented Dec 26, 2018

anton-afanasyev commented Feb 13, 2019

rotateright commented Feb 20, 2019

RKSimon commented May 10, 2019

RKSimon commented May 10, 2019

RKSimon commented May 13, 2019