-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The number of SIMD loads increases unnecessarily #46902
Comments
assigned to @rotateright |
Thanks for the test case. We may want to limit the transform based on uses. We could also view this as a failure of either instcombine or vector-combine to convert the load+insert+insert into load+splat. But I need to see what that translates to in x86 asm. |
I'm curious if this was noticed as a perf regression. If so, which CPU model was it tested on? |
Yes, I noticed this as a performance regression on "Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz", which I believe is Haswell. |
On 2nd look, instcombine does manage to form the splat - and that's what we are seeing in the x86 asm (movsldup). So we could say this is a phase ordering problem (we are running vector-combine directly after SLP, no instcombine in-between). Or we could blame SLP for creating the poor insert/extract sequences in the 1st place. Ideally, I think we should be able to fold the load and splat together: But that doesn't happen with the <2 x float> IR created by SLP, so that should be considered a codegen bug. So there's plenty of blame/opportunity here, but the quick fix will be to limit vector-combine. |
Getting the backend to fold the load looks difficult. We don't propagate the dereferenceable number of bytes into SDAG, so we can't widen the load that late. Ie, vector combine is the right place to get this into ideal form, but we need to improve the load combining there. We have the regression test now and TODO comments to track this, so I'll mark this bug as fixed. |
Thank a lot Sanjay for a quick fix! |
Extended Description
This patch:
8fb0559
increases the number of memory loads in certain cases.
Consider:
target triple = "x86_64-unknown-linux-gnu"
declare dso_local float* @getscaleptr() #0
define void @foo(<2 x float>* nonnull %resultptr, <2 x float>* nonnull %opptr) {
%scaleptr = call nonnull align 16 dereferenceable(64) float* @getscaleptr()
%op = load <2 x float>, <2 x float>* %opptr, align 4
%scale = load float, float* %scaleptr, align 16
%op0 = extractelement <2 x float> %op, i32 0
%product0 = fmul float %op0, %scale
%result0 = insertelement <2 x float> undef, float %product0, i32 0
%op1 = extractelement <2 x float> %op, i32 1
%product1 = fmul float %op1, %scale
%result1 = insertelement <2 x float> %result0, float %product1, i32 1
store <2 x float> %result1, <2 x float>* %resultptr, align 8
ret void
}
This testcase multiplies a <2 x float> value by a scalar float value
and stores the result back to memory.
Compile like so:
$ clang -O2 -msse4.2 -S bug.ll -o bug.s
Then with and without the patch, I get the following assembly diff:
mulps %xmm0, %xmm1
movlps %xmm1, (%r14)
addq $8, %rsp
popq %rbx
popq %r14
retq
Note that the patch replaces movsldup with insertps, which reads from
the same location as movaps, increasing the number of loads.
Here is the "IR Dump After Optimize scalar/vector ops".
Without the patch:
%scaleptr = tail call nonnull align 16 dereferenceable(64) float* @getscaleptr()
%op = load <2 x float>, <2 x float>* %opptr, align 4
%scale = load float, float* %scaleptr, align 16
%1 = insertelement <2 x float> undef, float %scale, i32 0
%2 = insertelement <2 x float> %1, float %scale, i32 1
%3 = fmul <2 x float> %op, %2
%4 = extractelement <2 x float> %3, i32 0
%result0 = insertelement <2 x float> undef, float %4, i32 0
%5 = extractelement <2 x float> %3, i32 1
%result1 = insertelement <2 x float> %result0, float %5, i32 1
store <2 x float> %result1, <2 x float>* %resultptr, align 8
ret void
With the patch:
%scaleptr = tail call nonnull align 16 dereferenceable(64) float* @getscaleptr()
%op = load <2 x float>, <2 x float>* %opptr, align 4
%1 = bitcast float* %scaleptr to <4 x float>*
%2 = load <4 x float>, <4 x float>* %1, align 16
%3 = shufflevector <4 x float> %2, <4 x float> undef, <2 x i32> <i32 0, i32 1>
%scale = load float, float* %scaleptr, align 16
%4 = insertelement <2 x float> %3, float %scale, i32 1
%5 = fmul <2 x float> %op, %4
%6 = extractelement <2 x float> %5, i32 0
%result0 = insertelement <2 x float> undef, float %6, i32 0
%7 = extractelement <2 x float> %5, i32 1
%result1 = insertelement <2 x float> %result0, float %7, i32 1
store <2 x float> %result1, <2 x float>* %resultptr, align 8
ret void
Notice the three loads with the patch.
Here is the final LLVM IR.
Without the patch:
%scaleptr = tail call nonnull align 16 dereferenceable(64) float* @getscaleptr()
%op = load <2 x float>, <2 x float>* %opptr, align 4
%scale = load float, float* %scaleptr, align 16
%1 = insertelement <2 x float> undef, float %scale, i32 0
%2 = shufflevector <2 x float> %1, <2 x float> undef, <2 x i32> zeroinitializer
%3 = fmul <2 x float> %op, %2
store <2 x float> %3, <2 x float>* %resultptr, align 8
ret void
With the patch:
%scaleptr = tail call nonnull align 16 dereferenceable(64) float* @getscaleptr()
%op = load <2 x float>, <2 x float>* %opptr, align 4
%1 = bitcast float* %scaleptr to <4 x float>*
%2 = load <4 x float>, <4 x float>* %1, align 16
%3 = shufflevector <4 x float> %2, <4 x float> undef, <2 x i32> <i32 0, i32 undef>
%scale = load float, float* %scaleptr, align 16
%4 = insertelement <2 x float> %3, float %scale, i32 1
%5 = fmul <2 x float> %op, %4
store <2 x float> %5, <2 x float>* %resultptr, align 8
ret void
The text was updated successfully, but these errors were encountered: