-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[X86] Failure to use HADDPS for partial register result #31780
Comments
assigned to @anton-afanasyev |
The 256-bit case gets vectorized in IR: define void @sum_pairs_256(<8 x float> %f, float* nocapture %p) local_unnamed_addr #0 { This means we have special-case cost modeling to allow even/odd shuffles? I think the 128-bit case would need to know that <2 x float> ops are ok in this case, or it would have to be pattern-matched in the DAG. |
define void @sum_pairs_128(<4 x float> %f, float* %p) { That's the IR currently (r343965), and I'm not sure how the backend would manage to optimize that. It's not like the cases in bug 39195. So we probably need to adjust the cost model to allow SLP to turn that into vector code. |
To use 128-bit horizontal sum, one can switch 64-bit slp-vectorization on: $ cat t.cpp Though the correct fix is to change one line in SLPVectorizerPass::vectorizeStores() function:
I'm to send this fix to review and to report loop unrolling bug. |
https://reviews.llvm.org/D56011 |
Yes, this case should be processed by SLPVectorizer itself. Here is the patch which fixes this: https://reviews.llvm.org/D56082. It also generates more optimal code for not horizontal instructions like this: void mul_pairs_128(__m128 f, float *p) { |
Fixed: ca9aff9 |
Reopening because the change was reverted due to an LTO build failure and perf regressions: |
Current codegen: https://godbolt.org/z/PXToee |
We can fix this in DAG pretty trivially: https://reviews.llvm.org/D61782 |
Resolving, we were able to deal with this in the backend by relaxing the hasOneUse limits in lowerAddSubToHorizontalOp |
Extended Description
While the 256-bit horizontal pair sums work fine (both on btver2 and btver1), the 128-bit version completely fails:
#include <x86intrin.h>
void sum_pairs_128(__m128 f, float *p) {
p[0] = f[0] + f[1];
p[1] = f[2] + f[3];
}
void sum_pairs_256(__m256 f, float *p) {
p[0] = f[0] + f[1];
p[1] = f[2] + f[3];
p[2] = f[4] + f[5];
p[3] = f[6] + f[7];
}
clang -O3 -march=btver2
sum_pairs_128(float __vector(4), float*):
vmovshdup %xmm0, %xmm1 # xmm1 = xmm0[1,1,3,3]
vaddss %xmm1, %xmm0, %xmm1
vmovss %xmm1, (%rdi)
vpermilpd $1, %xmm0, %xmm1 # xmm1 = xmm0[1,0]
vpermilps $231, %xmm0, %xmm0 # xmm0 = xmm0[3,1,2,3]
vaddss %xmm0, %xmm1, %xmm0
vmovss %xmm0, 4(%rdi)
retq
sum_pairs_256(float __vector(8), float*):
vextractf128 $1, %ymm0, %xmm1
vhaddps %xmm1, %xmm0, %xmm0
vmovups %xmm0, (%rdi)
retq
The text was updated successfully, but these errors were encountered: