Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[X86] Failure to use HADDPS for partial register result #31780

Closed
RKSimon opened this issue Mar 27, 2017 · 11 comments
Closed

[X86] Failure to use HADDPS for partial register result #31780

RKSimon opened this issue Mar 27, 2017 · 11 comments
Assignees
Labels
backend:X86 bugzilla Issues migrated from bugzilla

Comments

@RKSimon
Copy link
Collaborator

RKSimon commented Mar 27, 2017

Bugzilla Link 32433
Resolution FIXED
Resolved on May 13, 2019 09:16
Version trunk
OS Windows NT
CC @anton-afanasyev,@lesshaste,@dtemirbulatov,@rotateright
Fixed by commit(s) rL353923,rL360594,rL360596

Extended Description

While the 256-bit horizontal pair sums work fine (both on btver2 and btver1), the 128-bit version completely fails:

#include <x86intrin.h>
void sum_pairs_128(__m128 f, float *p) {
p[0] = f[0] + f[1];
p[1] = f[2] + f[3];
}
void sum_pairs_256(__m256 f, float *p) {
p[0] = f[0] + f[1];
p[1] = f[2] + f[3];
p[2] = f[4] + f[5];
p[3] = f[6] + f[7];
}

clang -O3 -march=btver2

sum_pairs_128(float __vector(4), float*):
vmovshdup %xmm0, %xmm1 # xmm1 = xmm0[1,1,3,3]
vaddss %xmm1, %xmm0, %xmm1
vmovss %xmm1, (%rdi)
vpermilpd $1, %xmm0, %xmm1 # xmm1 = xmm0[1,0]
vpermilps $231, %xmm0, %xmm0 # xmm0 = xmm0[3,1,2,3]
vaddss %xmm0, %xmm1, %xmm0
vmovss %xmm0, 4(%rdi)
retq

sum_pairs_256(float __vector(8), float*):
vextractf128 $1, %ymm0, %xmm1
vhaddps %xmm1, %xmm0, %xmm0
vmovups %xmm0, (%rdi)
retq

@RKSimon
Copy link
Collaborator Author

RKSimon commented Mar 27, 2017

assigned to @anton-afanasyev

@rotateright
Copy link
Contributor

The 256-bit case gets vectorized in IR:

define void @​sum_pairs_256(<8 x float> %f, float* nocapture %p) local_unnamed_addr #​0 {
entry:
%0 = shufflevector <8 x float> %f, <8 x float> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
%1 = shufflevector <8 x float> %f, <8 x float> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
%2 = fadd <4 x float> %0, %1
%3 = bitcast float* %p to <4 x float>*
store <4 x float> %2, <4 x float>* %3, align 4, !tbaa !​2
ret void
}


This means we have special-case cost modeling to allow even/odd shuffles? I think the 128-bit case would need to know that <2 x float> ops are ok in this case, or it would have to be pattern-matched in the DAG.

@rotateright
Copy link
Contributor

define void @​sum_pairs_128(<4 x float> %f, float* %p) {
%vecext = extractelement <4 x float> %f, i32 0
%vecext1 = extractelement <4 x float> %f, i32 1
%add = fadd float %vecext, %vecext1
store float %add, float* %p, align 4
%vecext2 = extractelement <4 x float> %f, i32 2
%vecext3 = extractelement <4 x float> %f, i32 3
%add4 = fadd float %vecext2, %vecext3
%arrayidx5 = getelementptr inbounds float, float* %p, i64 1
store float %add4, float* %arrayidx5, align 4
ret void
}


That's the IR currently (r343965), and I'm not sure how the backend would manage to optimize that. It's not like the cases in bug 39195. So we probably need to adjust the cost model to allow SLP to turn that into vector code.

@anton-afanasyev
Copy link
Contributor

To use 128-bit horizontal sum, one can switch 64-bit slp-vectorization on:

$ cat t.cpp
#include <x86intrin.h>
void sum_pairs_128(__m128 f, float p) {
p[0] = f[0] + f[1];
p[1] = f[2] + f[3];
}
$ clang -O3 -mllvm -slp-min-reg-size=64 -march=btver2 -S -o - t.cpp
sum_pairs_128(float __vector(4), float
):
vhaddps xmm0, xmm0, xmm0
vmovlps qword ptr [rdi], xmm0
ret

Though the correct fix is to change one line in SLPVectorizerPass::vectorizeStores() function:

for (unsigned Size = R.getMaxVecRegSize(); Size >= R.getMinVecRegSize() / 2;
, but it triggers another Loop Unrolling bug, not related to this one.

I'm to send this fix to review and to report loop unrolling bug.

@rotateright
Copy link
Contributor

https://reviews.llvm.org/D56011
...would produce a late haddps for half of this, but that's much too late to get the optimal code.

@anton-afanasyev
Copy link
Contributor

https://reviews.llvm.org/D56011
...would produce a late haddps for half of this, but that's much too late to
get the optimal code.

Yes, this case should be processed by SLPVectorizer itself. Here is the patch which fixes this: https://reviews.llvm.org/D56082.

It also generates more optimal code for not horizontal instructions like this:

void mul_pairs_128(__m128 f, float *p) {
p[0] = f[0] * f[1];
p[1] = f[2] * f[3];
}

@anton-afanasyev
Copy link
Contributor

Fixed: ca9aff9

@rotateright
Copy link
Contributor

Reopening because the change was reverted due to an LTO build failure and perf regressions:
https://reviews.llvm.org/rL354434

@RKSimon
Copy link
Collaborator Author

RKSimon commented May 10, 2019

Current codegen: https://godbolt.org/z/PXToee

@RKSimon
Copy link
Collaborator Author

RKSimon commented May 10, 2019

We can fix this in DAG pretty trivially: https://reviews.llvm.org/D61782

@RKSimon
Copy link
Collaborator Author

RKSimon commented May 13, 2019

Resolving, we were able to deal with this in the backend by relaxing the hasOneUse limits in lowerAddSubToHorizontalOp

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 bugzilla Issues migrated from bugzilla
Projects
None yet
Development

No branches or pull requests

3 participants