The number of SIMD loads increases unnecessarily #46902

kazutakahirata · 2020-09-17T06:58:10Z


Bugzilla Link	47558
Resolution	FIXED
Resolved on	Sep 17, 2020 13:21
Version	trunk
OS	Linux
CC	@RKSimon,@rotateright

Extended Description

This patch:

8fb0559

increases the number of memory loads in certain cases.

Consider:

target triple = "x86_64-unknown-linux-gnu"

declare dso_local float* @getscaleptr() #0

define void @foo(<2 x float>* nonnull %resultptr, <2 x float>* nonnull %opptr) {
%scaleptr = call nonnull align 16 dereferenceable(64) float* @getscaleptr()
%op = load <2 x float>, <2 x float>* %opptr, align 4
%scale = load float, float* %scaleptr, align 16

%op0 = extractelement <2 x float> %op, i32 0
%product0 = fmul float %op0, %scale
%result0 = insertelement <2 x float> undef, float %product0, i32 0

%op1 = extractelement <2 x float> %op, i32 1
%product1 = fmul float %op1, %scale
%result1 = insertelement <2 x float> %result0, float %product1, i32 1

store <2 x float> %result1, <2 x float>* %resultptr, align 8
ret void
}

This testcase multiplies a <2 x float> value by a scalar float value
and stores the result back to memory.

Compile like so:

$ clang -O2 -msse4.2 -S bug.ll -o bug.s

Then with and without the patch, I get the following assembly diff:

pushq	%r14
pushq	%rbx
pushq	%rax
movq	%rsi, %rbx
movq	%rdi, %r14
callq	getscaleptr
movsd	(%rbx), %xmm0                   # xmm0 = mem[0],zero

movss (%rax), %xmm1 # xmm1 = mem[0],zero,zero,zero
movsldup %xmm1, %xmm1 # xmm1 = xmm1[0,0,2,2]

movaps (%rax), %xmm1
insertps $16, (%rax), %xmm1 # xmm1 = xmm1[0],mem[0],xmm1[2,3]
mulps %xmm0, %xmm1
movlps %xmm1, (%r14)
addq $8, %rsp
popq %rbx
popq %r14
retq

Note that the patch replaces movsldup with insertps, which reads from
the same location as movaps, increasing the number of loads.

Here is the "IR Dump After Optimize scalar/vector ops".

Without the patch:

%scaleptr = tail call nonnull align 16 dereferenceable(64) float* @getscaleptr()
%op = load <2 x float>, <2 x float>* %opptr, align 4
%scale = load float, float* %scaleptr, align 16
%1 = insertelement <2 x float> undef, float %scale, i32 0
%2 = insertelement <2 x float> %1, float %scale, i32 1
%3 = fmul <2 x float> %op, %2
%4 = extractelement <2 x float> %3, i32 0
%result0 = insertelement <2 x float> undef, float %4, i32 0
%5 = extractelement <2 x float> %3, i32 1
%result1 = insertelement <2 x float> %result0, float %5, i32 1
store <2 x float> %result1, <2 x float>* %resultptr, align 8
ret void

With the patch:

%scaleptr = tail call nonnull align 16 dereferenceable(64) float* @getscaleptr()
%op = load <2 x float>, <2 x float>* %opptr, align 4
%1 = bitcast float* %scaleptr to <4 x float>*
%2 = load <4 x float>, <4 x float>* %1, align 16
%3 = shufflevector <4 x float> %2, <4 x float> undef, <2 x i32> <i32 0, i32 1>
%scale = load float, float* %scaleptr, align 16
%4 = insertelement <2 x float> %3, float %scale, i32 1
%5 = fmul <2 x float> %op, %4
%6 = extractelement <2 x float> %5, i32 0
%result0 = insertelement <2 x float> undef, float %6, i32 0
%7 = extractelement <2 x float> %5, i32 1
%result1 = insertelement <2 x float> %result0, float %7, i32 1
store <2 x float> %result1, <2 x float>* %resultptr, align 8
ret void

Notice the three loads with the patch.

Here is the final LLVM IR.

Without the patch:

%scaleptr = tail call nonnull align 16 dereferenceable(64) float* @getscaleptr()
%op = load <2 x float>, <2 x float>* %opptr, align 4
%scale = load float, float* %scaleptr, align 16
%1 = insertelement <2 x float> undef, float %scale, i32 0
%2 = shufflevector <2 x float> %1, <2 x float> undef, <2 x i32> zeroinitializer
%3 = fmul <2 x float> %op, %2
store <2 x float> %3, <2 x float>* %resultptr, align 8
ret void

With the patch:

%scaleptr = tail call nonnull align 16 dereferenceable(64) float* @getscaleptr()
%op = load <2 x float>, <2 x float>* %opptr, align 4
%1 = bitcast float* %scaleptr to <4 x float>*
%2 = load <4 x float>, <4 x float>* %1, align 16
%3 = shufflevector <4 x float> %2, <4 x float> undef, <2 x i32> <i32 0, i32 undef>
%scale = load float, float* %scaleptr, align 16
%4 = insertelement <2 x float> %3, float %scale, i32 1
%5 = fmul <2 x float> %op, %4
store <2 x float> %5, <2 x float>* %resultptr, align 8
ret void

The text was updated successfully, but these errors were encountered:

kazutakahirata · 2020-09-17T06:58:10Z

assigned to @rotateright

rotateright · 2020-09-17T12:25:56Z

Thanks for the test case. We may want to limit the transform based on uses.

We could also view this as a failure of either instcombine or vector-combine to convert the load+insert+insert into load+splat. But I need to see what that translates to in x86 asm.

rotateright · 2020-09-17T12:27:37Z

I'm curious if this was noticed as a perf regression. If so, which CPU model was it tested on?

kazutakahirata · 2020-09-17T16:33:19Z

Yes, I noticed this as a performance regression on "Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz", which I believe is Haswell.

rotateright · 2020-09-17T17:53:56Z

We could also view this as a failure of either instcombine or vector-combine
to convert the load+insert+insert into load+splat. But I need to see what
that translates to in x86 asm.

On 2nd look, instcombine does manage to form the splat - and that's what we are seeing in the x86 asm (movsldup).

So we could say this is a phase ordering problem (we are running vector-combine directly after SLP, no instcombine in-between). Or we could blame SLP for creating the poor insert/extract sequences in the 1st place.

Ideally, I think we should be able to fold the load and splat together:
movsd (%rbx), %xmm0 # xmm0 = mem[0],zero
movsldup (%rax), %xmm1 # xmm1 = mem[0,0,2,2]
mulps %xmm0, %xmm1
movlps %xmm1, (%r14)

But that doesn't happen with the <2 x float> IR created by SLP, so that should be considered a codegen bug.

So there's plenty of blame/opportunity here, but the quick fix will be to limit vector-combine.

rotateright · 2020-09-17T18:34:55Z

https://reviews.llvm.org/rG48a23bccf373

rotateright · 2020-09-17T18:51:45Z

Getting the backend to fold the load looks difficult. We don't propagate the dereferenceable number of bytes into SDAG, so we can't widen the load that late.

Ie, vector combine is the right place to get this into ideal form, but we need to improve the load combining there.

We have the regression test now and TODO comments to track this, so I'll mark this bug as fixed.

kazutakahirata · 2020-09-17T20:21:56Z

Thank a lot Sanjay for a quick fix!

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The number of SIMD loads increases unnecessarily #46902

The number of SIMD loads increases unnecessarily #46902

kazutakahirata commented Sep 17, 2020

kazutakahirata commented Sep 17, 2020

rotateright commented Sep 17, 2020

rotateright commented Sep 17, 2020

kazutakahirata commented Sep 17, 2020

rotateright commented Sep 17, 2020

rotateright commented Sep 17, 2020

rotateright commented Sep 17, 2020

kazutakahirata commented Sep 17, 2020

The number of SIMD loads increases unnecessarily #46902

The number of SIMD loads increases unnecessarily #46902

Comments

kazutakahirata commented Sep 17, 2020

Extended Description

kazutakahirata commented Sep 17, 2020

rotateright commented Sep 17, 2020

rotateright commented Sep 17, 2020

kazutakahirata commented Sep 17, 2020

rotateright commented Sep 17, 2020

rotateright commented Sep 17, 2020

rotateright commented Sep 17, 2020

kazutakahirata commented Sep 17, 2020