21780 – [X86][AVX] Expansion of 256 bit vector loads fails to fold into shuffles

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 21780 - [X86][AVX] Expansion of 256 bit vector loads fails to fold into shuffles

Summary: [X86][AVX] Expansion of 256 bit vector loads fails to fold into shuffles

Status:	NEW

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Backend: X86 (show other bugs)
Version:	trunk
Hardware:	PC All

Importance:	P normal
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2014-12-08 14:06 PST by Simon Pilgrim
Modified:	2019-07-05 15:50 PDT (History)
CC List:	3 users (show)

See Also:	31572 16739 42424
Fixed By Commit(s):

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Simon Pilgrim 2014-12-08 14:06:17 PST

Follow up to [Bug #21710] '[X86][AVX] suboptimal expansion of 256 bit vector loads.'

Merging of consecutive loads into a 256-bit ymm register now works well for simple cases, and the loads also fold nicely for bitwise ops (as well as basic float ops - fadd, fsub etc.). 

Vector shuffle optimizations however attempt to selectively load individual lanes and in doing so prevent the optimization from folding the load into the shuffle.

e.g.

__m256d vsht_d4(__m256d foo) { 
  return __builtin_shufflevector( foo, foo, 0, 0, 2, 2 ); 
} 

define <4 x double> @_Z7vsht_d4Dv4_d(<4 x double> %foo) #1 {
  %1 = shufflevector <4 x double> %foo, <4 x double> undef, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
  ret <4 x double> %1
}

vpermilpd $0, %ymm0, %ymm0 # ymm0 = ymm0[0,0,2,2] 
retq 

__m256d vsht_d4_fold(const double* ptr) {
  __m256d foo = (__m256d){ ptr[0], ptr[1], ptr[2], ptr[3] }; 
  return __builtin_shufflevector( foo, foo, 0, 0, 2, 2 ); 
}
 
define <4 x double> @_Z12vsht_d4_foldPKd(double* nocapture readonly %ptr) #0 {
  %1 = load double* %ptr, align 8, !tbaa !1
  %2 = insertelement <4 x double> undef, double %1, i32 0
  %3 = getelementptr inbounds double* %ptr, i64 2
  %4 = load double* %3, align 8, !tbaa !1
  %5 = insertelement <4 x double> %2, double %4, i32 2
  %6 = shufflevector <4 x double> %5, <4 x double> undef, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
  ret <4 x double> %6
}

vmovsd (%rdi), %xmm0 
vmovsd 16(%rdi), %xmm1 
vinsertf128 $1, %xmm1, %ymm0, %ymm0 
vpermilpd $0, %ymm0, %ymm0 # ymm0 = ymm0[0,0,2,2] 
retq 

Manually editing the IR does permit the fold to occur:

define <4 x double> @_Z12vsht_d4_foldPKd(double* nocapture readonly %ptr) #0 {
  %1 = load double* %ptr, align 8, !tbaa !1
  %2 = insertelement <4 x double> undef, double %1, i32 0
  %3 = getelementptr inbounds double* %ptr, i64 1
  %4 = load double* %3, align 8, !tbaa !1
  %5 = insertelement <4 x double> %2, double %4, i32 1
  %6 = getelementptr inbounds double* %ptr, i64 2
  %7 = load double* %6, align 8, !tbaa !1
  %8 = insertelement <4 x double> %5, double %7, i32 2
  %9 = getelementptr inbounds double* %ptr, i64 3
  %10 = load double* %9, align 8, !tbaa !1
  %11 = insertelement <4 x double> %8, double %10, i32 3
  %12 = shufflevector <4 x double> %11, <4 x double> undef, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
  ret <4 x double> %12
}

vpermilpd $0, (%rdi), %ymm0 # ymm0 = mem[0,0,2,2]
retq

Comment 1 Sanjay Patel 2015-02-20 11:25:53 PST

As of r230021, the vpermild has become a vmovddup, but we're still not merging the loads:

$ ./llc 21780.ll -o  - -mattr=avx
...
	vmovsd	(%rdi), %xmm0           ## xmm0 = mem[0],zero
	vmovsd	16(%rdi), %xmm1         ## xmm1 = mem[0],zero
	vinsertf128	$1, %xmm1, %ymm0, %ymm0
	vmovddup	%ymm0, %ymm0    ## ymm0 = ymm0[0,0,2,2]
	retq

Comment 2 Sanjay Patel 2015-03-27 09:46:36 PDT

I think the problem with doing a full vector load in this case is that we'd be reading unknown bytes of memory beyond the last scalar load. In general, this isn't safe (unmapped memory).

If we change the test case to load the scalar elements at both ends of the vector, then we do get the vector load optimization to kick:

define <4 x double> @_Z12vsht_d4_foldPKd(double* nocapture readonly %ptr) #0 {
  %1 = load double, double* %ptr, align 8
  %2 = insertelement <4 x double> undef, double %1, i32 0
  %3 = getelementptr double, double* %ptr, i64 3   ; Read last vector element
  %4 = load double, double* %3, align 8
  %5 = insertelement <4 x double> %2, double %4, i32 3 ; And insert into last position
  %6 = shufflevector <4 x double> %5, <4 x double> undef, <4 x i32> <i32 0, i32 0, i32 3, i32 3>
  ret <4 x double> %6
}


$ ./llc -mattr=avx 21780.ll -o -
...
	vpermilpd	$12, (%rdi), %ymm0 ## ymm0 = mem[0,0,3,3]
	retq

Comment 3 Sanjay Patel 2015-03-27 10:28:05 PDT

Ah, now I see the source test case:
  __m256d foo = (__m256d){ ptr[0], ptr[1], ptr[2], ptr[3] }; 
  return __builtin_shufflevector( foo, foo, 0, 0, 2, 2 ); 

So I'll stick with the assessment that the backend is doing all that it can given the IR. 

It looks like -instcombine is responsible for eliminating the unused loads in IR. To fix this, we'd need to teach instcombine to weigh a potential trade-off: bypass the elimination of some loads in exchange for reducing the total number of ops via a larger load.

Comment 4 Simon Pilgrim 2016-02-20 13:52:22 PST

Current codegen:

$ ./llc 21780.ll -o  - -mattr=avx

  vmovddup	(%rdi), %xmm0   ## xmm0 = mem[0,0]
  vmovddup	16(%rdi), %xmm1 ## xmm1 = mem[0,0]
  vinsertf128	$1, %xmm1, %ymm0, %ymm0
  retq

Comment 5 Sanjay Patel 2016-04-01 10:25:07 PDT

I was going back through our splat discussions in relation to bug 27141, and came back to this bug.

Let's make this IR test case concrete:

define <4 x double> @load_four_scalars_but_use_two(double* %ptr) {
  %arrayidx0 = getelementptr inbounds double, double* %ptr, i64 0
  %arrayidx1 = getelementptr inbounds double, double* %ptr, i64 1
  %arrayidx2 = getelementptr inbounds double, double* %ptr, i64 2
  %arrayidx3 = getelementptr inbounds double, double* %ptr, i64 3

  %ld0 = load double, double* %arrayidx0
  %ld1 = load double, double* %arrayidx1
  %ld2 = load double, double* %arrayidx2
  %ld3 = load double, double* %arrayidx3

  %ins0 = insertelement <4 x double> undef, double %ld0, i32 0
  %ins1 = insertelement <4 x double> %ins0, double %ld1, i32 1
  %ins2 = insertelement <4 x double> %ins1, double %ld2, i32 2
  %ins3 = insertelement <4 x double> %ins2, double %ld3, i32 3

  %shuffle = shufflevector <4 x double> %ins3, <4 x double> undef, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
  ret <4 x double> %shuffle
}

-----------------------------------------------------------------------------

Here's the current optimization - delete the unused scalar loads and inserts:

$ ./opt -instcombine 21780.ll -S
; ModuleID = '21780.ll'
source_filename = "21780.ll"

define <4 x double> @load_four_scalars_but_use_two(double* %ptr) {
  %arrayidx2 = getelementptr inbounds double, double* %ptr, i64 2
  %ld0 = load double, double* %ptr, align 8
  %ld2 = load double, double* %arrayidx2, align 8
  %ins0 = insertelement <4 x double> undef, double %ld0, i32 0
  %ins2 = insertelement <4 x double> %ins0, double %ld2, i32 2
  %shuffle = shufflevector <4 x double> %ins2, <4 x double> undef, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
  ret <4 x double> %shuffle
}

-----------------------------------------------------------------------------

What we'd prefer to see for an AVX machine:

define <4 x double> @load_four_scalars_but_use_two(double* %ptr) {
  %bc = bitcast double* %ptr to <4 x double>*
  %ldvec = load <4 x double>, <4 x double>* %bc
  %shuffle = shufflevector <4 x double> %ldvec, <4 x double> undef, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
  ret <4 x double> %shuffle
}

-----------------------------------------------------------------------------

To summarize earlier comments: after instcombine removes the loads, no other pass is allowed to recreate them. We've simply lost the information that it was safe to do those operations in the first place. This goes back to the recent llvm-dev thread on reading extra memory.

So this has to be an instcombine transform. Given that the vector IR has less instructions, I think that's a reasonable transform compared to what we do today. It should be a backend responsibility to split that vector load into scalar ops if that would be profitable.

Comment 6 Sanjay Patel 2016-04-02 15:12:54 PDT

Note that there is a "load-combine" pass for IR that is not on by default. It doesn't work on the example IR in comment 5 because it is currently limited to integer loads. If we change the example types to ints, it does this:

define <4 x i64> @load_four_scalars_but_use_two(i64* %ptr) {
  %1 = bitcast i64* %ptr to i256*
  %ld0.combined = load i256, i256* %1, align 8  <--- crazy wide integer type
  %combine.extract.trunc = trunc i256 %ld0.combined to i64
  %combine.extract.shift = lshr i256 %ld0.combined, 64
  %combine.extract.trunc1 = trunc i256 %combine.extract.shift to i64
  %combine.extract.shift2 = lshr i256 %ld0.combined, 128
  %combine.extract.trunc3 = trunc i256 %combine.extract.shift2 to i64
  %combine.extract.shift4 = lshr i256 %ld0.combined, 192
  %combine.extract.trunc5 = trunc i256 %combine.extract.shift4 to i64
  %ins0 = insertelement <4 x i64> undef, i64 %combine.extract.trunc, i32 0
  %ins1 = insertelement <4 x i64> %ins0, i64 %combine.extract.trunc1, i32 1
  %ins2 = insertelement <4 x i64> %ins1, i64 %combine.extract.trunc3, i32 2
  %ins3 = insertelement <4 x i64> %ins2, i64 %combine.extract.trunc5, i32 3
  ret <4 x i64> %ins3
}


Unfortunately, even if we fix that, I don't think this pass is a viable option for the transform we want to do because we have to do our transform before instcombine has a chance to eliminate the unused loads.

The load-combine pass does provide a model for what is required though: alias analysis, etc. As we've seen with combining loads in the DAG, it's not an easy transform in general because of all of the error potential when reordering memory operations.

Comment 7 Sanjay Patel 2017-01-18 10:39:08 PST

(In reply to comment #5) 
> To summarize earlier comments: after instcombine removes the loads, no other
> pass is allowed to recreate them. We've simply lost the information that it
> was safe to do those operations in the first place. This goes back to the
> recent llvm-dev thread on reading extra memory.
> 
> So this has to be an instcombine transform. Given that the vector IR has
> less instructions, I think that's a reasonable transform compared to what we
> do today. It should be a backend responsibility to split that vector load
> into scalar ops if that would be profitable.

Let me refine that statement: instcombine must preserve the fact that it has removed a load of memory in order for subsequent passes to act on that information. 

It's probably easier to have the SLP vectorizer or some other pass act on that information because instcombine isn't currently equipped for memop combining AFAIK.

There does already appear to be a metadata type that would work for this use case:

"The optional !dereferenceable_or_null metadata must reference a single metadata name <deref_bytes_node> corresponding to a metadata node with one i64 entry. The existence of the !dereferenceable_or_null metadata on the instruction tells the optimizer that the value loaded is known to be either dereferenceable or null. The number of bytes known to be dereferenceable is specified by the integer value in the metadata node."

http://llvm.org/docs/LangRef.html#load-instruction

Comment 8 Simon Pilgrim 2019-06-27 09:55:19 PDT

Current codegen: https://godbolt.org/z/sc6te_

Comment 9 Sanjay Patel 2019-07-05 15:50:10 PDT

Try to preserve the dereferenceable information before instcombine kills it:
https://reviews.llvm.org/D64258