LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 47558 - The number of SIMD loads increases unnecessarily
Summary: The number of SIMD loads increases unnecessarily
Status: RESOLVED FIXED
Alias: None
Product: libraries
Classification: Unclassified
Component: Scalar Optimizations (show other bugs)
Version: trunk
Hardware: PC Linux
: P enhancement
Assignee: Sanjay Patel
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-09-16 23:58 PDT by Kazu Hirata
Modified: 2020-09-17 13:21 PDT (History)
3 users (show)

See Also:
Fixed By Commit(s):


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Kazu Hirata 2020-09-16 23:58:10 PDT
This patch:

https://github.com/llvm/llvm-project/commit/8fb055932c085da21f3b721995a06f42006744bd

increases the number of memory loads in certain cases.

Consider:

target triple = "x86_64-unknown-linux-gnu"

declare dso_local float* @getscaleptr() #0

define void @foo(<2 x float>* nonnull %resultptr, <2 x float>* nonnull %opptr) {
  %scaleptr = call nonnull align 16 dereferenceable(64) float* @getscaleptr()
  %op = load <2 x float>, <2 x float>* %opptr, align 4
  %scale = load float, float* %scaleptr, align 16

  %op0 = extractelement <2 x float> %op, i32 0
  %product0 = fmul float %op0, %scale
  %result0 = insertelement <2 x float> undef, float %product0, i32 0

  %op1 = extractelement <2 x float> %op, i32 1
  %product1 = fmul float %op1, %scale
  %result1 = insertelement <2 x float> %result0, float %product1, i32 1

  store <2 x float> %result1, <2 x float>* %resultptr, align 8
  ret void
}

This testcase multiplies a <2 x float> value by a scalar float value
and stores the result back to memory.

Compile like so:

$ clang -O2 -msse4.2 -S bug.ll -o bug.s

Then with and without the patch, I get the following assembly diff:

 	pushq	%r14
 	pushq	%rbx
 	pushq	%rax
 	movq	%rsi, %rbx
 	movq	%rdi, %r14
 	callq	getscaleptr
 	movsd	(%rbx), %xmm0                   # xmm0 = mem[0],zero
-	movss	(%rax), %xmm1                   # xmm1 = mem[0],zero,zero,zero
-	movsldup	%xmm1, %xmm1                    # xmm1 = xmm1[0,0,2,2]
+	movaps	(%rax), %xmm1
+	insertps	$16, (%rax), %xmm1              # xmm1 = xmm1[0],mem[0],xmm1[2,3]
 	mulps	%xmm0, %xmm1
 	movlps	%xmm1, (%r14)
 	addq	$8, %rsp
 	popq	%rbx
 	popq	%r14
 	retq

Note that the patch replaces movsldup with insertps, which reads from
the same location as movaps, increasing the number of loads.

Here is the "IR Dump After Optimize scalar/vector ops".

Without the patch:

  %scaleptr = tail call nonnull align 16 dereferenceable(64) float* @getscaleptr()
  %op = load <2 x float>, <2 x float>* %opptr, align 4
  %scale = load float, float* %scaleptr, align 16
  %1 = insertelement <2 x float> undef, float %scale, i32 0
  %2 = insertelement <2 x float> %1, float %scale, i32 1
  %3 = fmul <2 x float> %op, %2
  %4 = extractelement <2 x float> %3, i32 0
  %result0 = insertelement <2 x float> undef, float %4, i32 0
  %5 = extractelement <2 x float> %3, i32 1
  %result1 = insertelement <2 x float> %result0, float %5, i32 1
  store <2 x float> %result1, <2 x float>* %resultptr, align 8
  ret void

With the patch:

  %scaleptr = tail call nonnull align 16 dereferenceable(64) float* @getscaleptr()
  %op = load <2 x float>, <2 x float>* %opptr, align 4
  %1 = bitcast float* %scaleptr to <4 x float>*
  %2 = load <4 x float>, <4 x float>* %1, align 16
  %3 = shufflevector <4 x float> %2, <4 x float> undef, <2 x i32> <i32 0, i32 1>
  %scale = load float, float* %scaleptr, align 16
  %4 = insertelement <2 x float> %3, float %scale, i32 1
  %5 = fmul <2 x float> %op, %4
  %6 = extractelement <2 x float> %5, i32 0
  %result0 = insertelement <2 x float> undef, float %6, i32 0
  %7 = extractelement <2 x float> %5, i32 1
  %result1 = insertelement <2 x float> %result0, float %7, i32 1
  store <2 x float> %result1, <2 x float>* %resultptr, align 8
  ret void

Notice the three loads with the patch.

Here is the final LLVM IR.

Without the patch:

  %scaleptr = tail call nonnull align 16 dereferenceable(64) float* @getscaleptr()
  %op = load <2 x float>, <2 x float>* %opptr, align 4
  %scale = load float, float* %scaleptr, align 16
  %1 = insertelement <2 x float> undef, float %scale, i32 0
  %2 = shufflevector <2 x float> %1, <2 x float> undef, <2 x i32> zeroinitializer
  %3 = fmul <2 x float> %op, %2
  store <2 x float> %3, <2 x float>* %resultptr, align 8
  ret void

With the patch:

  %scaleptr = tail call nonnull align 16 dereferenceable(64) float* @getscaleptr()
  %op = load <2 x float>, <2 x float>* %opptr, align 4
  %1 = bitcast float* %scaleptr to <4 x float>*
  %2 = load <4 x float>, <4 x float>* %1, align 16
  %3 = shufflevector <4 x float> %2, <4 x float> undef, <2 x i32> <i32 0, i32 undef>
  %scale = load float, float* %scaleptr, align 16
  %4 = insertelement <2 x float> %3, float %scale, i32 1
  %5 = fmul <2 x float> %op, %4
  store <2 x float> %5, <2 x float>* %resultptr, align 8
  ret void
Comment 1 Sanjay Patel 2020-09-17 05:25:56 PDT
Thanks for the test case. We may want to limit the transform based on uses. 

We could also view this as a failure of either instcombine or vector-combine to convert the load+insert+insert into load+splat. But I need to see what that translates to in x86 asm.
Comment 2 Sanjay Patel 2020-09-17 05:27:37 PDT
I'm curious if this was noticed as a perf regression. If so, which CPU model was it tested on?
Comment 3 Kazu Hirata 2020-09-17 09:33:19 PDT
Yes, I noticed this as a performance regression on "Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz", which I believe is Haswell.
Comment 4 Sanjay Patel 2020-09-17 10:53:56 PDT
(In reply to Sanjay Patel from comment #1)
> We could also view this as a failure of either instcombine or vector-combine
> to convert the load+insert+insert into load+splat. But I need to see what
> that translates to in x86 asm.

On 2nd look, instcombine does manage to form the splat - and that's what we are seeing in the x86 asm (movsldup). 

So we could say this is a phase ordering problem (we are running vector-combine directly after SLP, no instcombine in-between). Or we could blame SLP for creating the poor insert/extract sequences in the 1st place.

Ideally, I think we should be able to fold the load and splat together:
	movsd	(%rbx), %xmm0                   # xmm0 = mem[0],zero
	movsldup	(%rax), %xmm1                   # xmm1 = mem[0,0,2,2]
	mulps	%xmm0, %xmm1
	movlps	%xmm1, (%r14)

But that doesn't happen with the <2 x float> IR created by SLP, so that should be considered a codegen bug.

So there's plenty of blame/opportunity here, but the quick fix will be to limit vector-combine.
Comment 5 Sanjay Patel 2020-09-17 11:34:55 PDT
 https://reviews.llvm.org/rG48a23bccf373
Comment 6 Sanjay Patel 2020-09-17 11:51:45 PDT
Getting the backend to fold the load looks difficult. We don't propagate the dereferenceable number of bytes into SDAG, so we can't widen the load that late. 

Ie, vector combine is the right place to get this into ideal form, but we need to improve the load combining there.

We have the regression test now and TODO comments to track this, so I'll mark this bug as fixed.
Comment 7 Kazu Hirata 2020-09-17 13:21:56 PDT
Thank a lot Sanjay for a quick fix!