Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The number of SIMD loads increases unnecessarily #46902

Closed
kazutakahirata opened this issue Sep 17, 2020 · 8 comments
Closed

The number of SIMD loads increases unnecessarily #46902

kazutakahirata opened this issue Sep 17, 2020 · 8 comments
Assignees
Labels
bugzilla Issues migrated from bugzilla

Comments

@kazutakahirata
Copy link
Contributor

Bugzilla Link 47558
Resolution FIXED
Resolved on Sep 17, 2020 13:21
Version trunk
OS Linux
CC @RKSimon,@rotateright

Extended Description

This patch:

8fb0559

increases the number of memory loads in certain cases.

Consider:

target triple = "x86_64-unknown-linux-gnu"

declare dso_local float* @​getscaleptr() #​0

define void @​foo(<2 x float>* nonnull %resultptr, <2 x float>* nonnull %opptr) {
%scaleptr = call nonnull align 16 dereferenceable(64) float* @​getscaleptr()
%op = load <2 x float>, <2 x float>* %opptr, align 4
%scale = load float, float* %scaleptr, align 16

%op0 = extractelement <2 x float> %op, i32 0
%product0 = fmul float %op0, %scale
%result0 = insertelement <2 x float> undef, float %product0, i32 0

%op1 = extractelement <2 x float> %op, i32 1
%product1 = fmul float %op1, %scale
%result1 = insertelement <2 x float> %result0, float %product1, i32 1

store <2 x float> %result1, <2 x float>* %resultptr, align 8
ret void
}

This testcase multiplies a <2 x float> value by a scalar float value
and stores the result back to memory.

Compile like so:

$ clang -O2 -msse4.2 -S bug.ll -o bug.s

Then with and without the patch, I get the following assembly diff:

pushq	%r14
pushq	%rbx
pushq	%rax
movq	%rsi, %rbx
movq	%rdi, %r14
callq	getscaleptr
movsd	(%rbx), %xmm0                   # xmm0 = mem[0],zero
  • movss (%rax), %xmm1 # xmm1 = mem[0],zero,zero,zero
  • movsldup %xmm1, %xmm1 # xmm1 = xmm1[0,0,2,2]
  • movaps (%rax), %xmm1
  • insertps $16, (%rax), %xmm1 # xmm1 = xmm1[0],mem[0],xmm1[2,3]
    mulps %xmm0, %xmm1
    movlps %xmm1, (%r14)
    addq $8, %rsp
    popq %rbx
    popq %r14
    retq

Note that the patch replaces movsldup with insertps, which reads from
the same location as movaps, increasing the number of loads.

Here is the "IR Dump After Optimize scalar/vector ops".

Without the patch:

%scaleptr = tail call nonnull align 16 dereferenceable(64) float* @​getscaleptr()
%op = load <2 x float>, <2 x float>* %opptr, align 4
%scale = load float, float* %scaleptr, align 16
%1 = insertelement <2 x float> undef, float %scale, i32 0
%2 = insertelement <2 x float> %1, float %scale, i32 1
%3 = fmul <2 x float> %op, %2
%4 = extractelement <2 x float> %3, i32 0
%result0 = insertelement <2 x float> undef, float %4, i32 0
%5 = extractelement <2 x float> %3, i32 1
%result1 = insertelement <2 x float> %result0, float %5, i32 1
store <2 x float> %result1, <2 x float>* %resultptr, align 8
ret void

With the patch:

%scaleptr = tail call nonnull align 16 dereferenceable(64) float* @​getscaleptr()
%op = load <2 x float>, <2 x float>* %opptr, align 4
%1 = bitcast float* %scaleptr to <4 x float>*
%2 = load <4 x float>, <4 x float>* %1, align 16
%3 = shufflevector <4 x float> %2, <4 x float> undef, <2 x i32> <i32 0, i32 1>
%scale = load float, float* %scaleptr, align 16
%4 = insertelement <2 x float> %3, float %scale, i32 1
%5 = fmul <2 x float> %op, %4
%6 = extractelement <2 x float> %5, i32 0
%result0 = insertelement <2 x float> undef, float %6, i32 0
%7 = extractelement <2 x float> %5, i32 1
%result1 = insertelement <2 x float> %result0, float %7, i32 1
store <2 x float> %result1, <2 x float>* %resultptr, align 8
ret void

Notice the three loads with the patch.

Here is the final LLVM IR.

Without the patch:

%scaleptr = tail call nonnull align 16 dereferenceable(64) float* @​getscaleptr()
%op = load <2 x float>, <2 x float>* %opptr, align 4
%scale = load float, float* %scaleptr, align 16
%1 = insertelement <2 x float> undef, float %scale, i32 0
%2 = shufflevector <2 x float> %1, <2 x float> undef, <2 x i32> zeroinitializer
%3 = fmul <2 x float> %op, %2
store <2 x float> %3, <2 x float>* %resultptr, align 8
ret void

With the patch:

%scaleptr = tail call nonnull align 16 dereferenceable(64) float* @​getscaleptr()
%op = load <2 x float>, <2 x float>* %opptr, align 4
%1 = bitcast float* %scaleptr to <4 x float>*
%2 = load <4 x float>, <4 x float>* %1, align 16
%3 = shufflevector <4 x float> %2, <4 x float> undef, <2 x i32> <i32 0, i32 undef>
%scale = load float, float* %scaleptr, align 16
%4 = insertelement <2 x float> %3, float %scale, i32 1
%5 = fmul <2 x float> %op, %4
store <2 x float> %5, <2 x float>* %resultptr, align 8
ret void

@kazutakahirata
Copy link
Contributor Author

assigned to @rotateright

@rotateright
Copy link
Contributor

Thanks for the test case. We may want to limit the transform based on uses.

We could also view this as a failure of either instcombine or vector-combine to convert the load+insert+insert into load+splat. But I need to see what that translates to in x86 asm.

@rotateright
Copy link
Contributor

I'm curious if this was noticed as a perf regression. If so, which CPU model was it tested on?

@kazutakahirata
Copy link
Contributor Author

Yes, I noticed this as a performance regression on "Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz", which I believe is Haswell.

@rotateright
Copy link
Contributor

We could also view this as a failure of either instcombine or vector-combine
to convert the load+insert+insert into load+splat. But I need to see what
that translates to in x86 asm.

On 2nd look, instcombine does manage to form the splat - and that's what we are seeing in the x86 asm (movsldup).

So we could say this is a phase ordering problem (we are running vector-combine directly after SLP, no instcombine in-between). Or we could blame SLP for creating the poor insert/extract sequences in the 1st place.

Ideally, I think we should be able to fold the load and splat together:
movsd (%rbx), %xmm0 # xmm0 = mem[0],zero
movsldup (%rax), %xmm1 # xmm1 = mem[0,0,2,2]
mulps %xmm0, %xmm1
movlps %xmm1, (%r14)

But that doesn't happen with the <2 x float> IR created by SLP, so that should be considered a codegen bug.

So there's plenty of blame/opportunity here, but the quick fix will be to limit vector-combine.

@rotateright
Copy link
Contributor

@rotateright
Copy link
Contributor

Getting the backend to fold the load looks difficult. We don't propagate the dereferenceable number of bytes into SDAG, so we can't widen the load that late.

Ie, vector combine is the right place to get this into ideal form, but we need to improve the load combining there.

We have the regression test now and TODO comments to track this, so I'll mark this bug as fixed.

@kazutakahirata
Copy link
Contributor Author

Thank a lot Sanjay for a quick fix!

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla
Projects
None yet
Development

No branches or pull requests

2 participants