Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[X86][AVX] Expansion of 256 bit vector loads fails to fold into shuffles #22154

Open
RKSimon opened this issue Dec 8, 2014 · 11 comments
Open
Labels

Comments

@RKSimon
Copy link
Collaborator

RKSimon commented Dec 8, 2014

Bugzilla Link 21780
Version trunk
OS All
CC @adibiagio,@rotateright

Extended Description

Follow up to [Bug #22084] '[X86][AVX] suboptimal expansion of 256 bit vector loads.'

Merging of consecutive loads into a 256-bit ymm register now works well for simple cases, and the loads also fold nicely for bitwise ops (as well as basic float ops - fadd, fsub etc.).

Vector shuffle optimizations however attempt to selectively load individual lanes and in doing so prevent the optimization from folding the load into the shuffle.

e.g.

__m256d vsht_d4(__m256d foo) { 
  return __builtin_shufflevector( foo, foo, 0, 0, 2, 2 ); 
} 

define <4 x double> @_Z7vsht_d4Dv4_d(<4 x double> %foo) #1 {
  %1 = shufflevector <4 x double> %foo, <4 x double> undef, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
  ret <4 x double> %1
}

vpermilpd $0, %ymm0, %ymm0 # ymm0 = ymm0[0,0,2,2] 
retq 

__m256d vsht_d4_fold(const double* ptr) {
  __m256d foo = (__m256d){ ptr[0], ptr[1], ptr[2], ptr[3] }; 
  return __builtin_shufflevector( foo, foo, 0, 0, 2, 2 ); 
}
 
define <4 x double> @_Z12vsht_d4_foldPKd(double* nocapture readonly %ptr) #0 {
  %1 = load double* %ptr, align 8, !tbaa !1
  %2 = insertelement <4 x double> undef, double %1, i32 0
  %3 = getelementptr inbounds double* %ptr, i64 2
  %4 = load double* %3, align 8, !tbaa !1
  %5 = insertelement <4 x double> %2, double %4, i32 2
  %6 = shufflevector <4 x double> %5, <4 x double> undef, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
  ret <4 x double> %6
}

vmovsd (%rdi), %xmm0 
vmovsd 16(%rdi), %xmm1 
vinsertf128 $1, %xmm1, %ymm0, %ymm0 
vpermilpd $0, %ymm0, %ymm0 # ymm0 = ymm0[0,0,2,2] 
retq 

Manually editing the IR does permit the fold to occur:

define <4 x double> @_Z12vsht_d4_foldPKd(double* nocapture readonly %ptr) #0 {
  %1 = load double* %ptr, align 8, !tbaa !1
  %2 = insertelement <4 x double> undef, double %1, i32 0
  %3 = getelementptr inbounds double* %ptr, i64 1
  %4 = load double* %3, align 8, !tbaa !1
  %5 = insertelement <4 x double> %2, double %4, i32 1
  %6 = getelementptr inbounds double* %ptr, i64 2
  %7 = load double* %6, align 8, !tbaa !1
  %8 = insertelement <4 x double> %5, double %7, i32 2
  %9 = getelementptr inbounds double* %ptr, i64 3
  %10 = load double* %9, align 8, !tbaa !1
  %11 = insertelement <4 x double> %8, double %10, i32 3
  %12 = shufflevector <4 x double> %11, <4 x double> undef, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
  ret <4 x double> %12
}

vpermilpd $0, (%rdi), %ymm0 # ymm0 = mem[0,0,2,2]
retq
@rotateright
Copy link
Contributor

As of r230021, the vpermild has become a vmovddup, but we're still not merging the loads:

$ ./llc 21780.ll -o - -mattr=avx
...
vmovsd (%rdi), %xmm0 ## xmm0 = mem[0],zero
vmovsd 16(%rdi), %xmm1 ## xmm1 = mem[0],zero
vinsertf128 $1, %xmm1, %ymm0, %ymm0
vmovddup %ymm0, %ymm0 ## ymm0 = ymm0[0,0,2,2]
retq

@rotateright
Copy link
Contributor

rotateright commented Mar 27, 2015

I think the problem with doing a full vector load in this case is that we'd be reading unknown bytes of memory beyond the last scalar load. In general, this isn't safe (unmapped memory).

If we change the test case to load the scalar elements at both ends of the vector, then we do get the vector load optimization to kick:

define <4 x double> @_Z12vsht_d4_foldPKd(double* nocapture readonly %ptr) #0 {
  %1 = load double, double* %ptr, align 8
  %2 = insertelement <4 x double> undef, double %1, i32 0
  %3 = getelementptr double, double* %ptr, i64 3   ; Read last vector element
  %4 = load double, double* %3, align 8
  %5 = insertelement <4 x double> %2, double %4, i32 3 ; And insert into last position
  %6 = shufflevector <4 x double> %5, <4 x double> undef, <4 x i32> <i32 0, i32 0, i32 3, i32 3>
  ret <4 x double> %6
}


$ ./llc -mattr=avx 21780.ll -o -
...
	vpermilpd	$12, (%rdi), %ymm0 ## ymm0 = mem[0,0,3,3]
	retq

@rotateright
Copy link
Contributor

rotateright commented Mar 27, 2015

Ah, now I see the source test case:

  __m256d foo = (__m256d){ ptr[0], ptr[1], ptr[2], ptr[3] }; 
  return __builtin_shufflevector( foo, foo, 0, 0, 2, 2 ); 

So I'll stick with the assessment that the backend is doing all that it can given the IR.

It looks like -instcombine is responsible for eliminating the unused loads in IR. To fix this, we'd need to teach instcombine to weigh a potential trade-off: bypass the elimination of some loads in exchange for reducing the total number of ops via a larger load.

@RKSimon
Copy link
Collaborator Author

RKSimon commented Feb 20, 2016

Current codegen:

$ ./llc 21780.ll -o  - -mattr=avx

  vmovddup	(%rdi), %xmm0   ## xmm0 = mem[0,0]
  vmovddup	16(%rdi), %xmm1 ## xmm1 = mem[0,0]
  vinsertf128	$1, %xmm1, %ymm0, %ymm0
  retq

@rotateright
Copy link
Contributor

rotateright commented Apr 1, 2016

I was going back through our splat discussions in relation to bug 27141, and came back to this bug.

Let's make this IR test case concrete:

define <4 x double> @load_four_scalars_but_use_two(double* %ptr) {
  %arrayidx0 = getelementptr inbounds double, double* %ptr, i64 0
  %arrayidx1 = getelementptr inbounds double, double* %ptr, i64 1
  %arrayidx2 = getelementptr inbounds double, double* %ptr, i64 2
  %arrayidx3 = getelementptr inbounds double, double* %ptr, i64 3

  %ld0 = load double, double* %arrayidx0
  %ld1 = load double, double* %arrayidx1
  %ld2 = load double, double* %arrayidx2
  %ld3 = load double, double* %arrayidx3

  %ins0 = insertelement <4 x double> undef, double %ld0, i32 0
  %ins1 = insertelement <4 x double> %ins0, double %ld1, i32 1
  %ins2 = insertelement <4 x double> %ins1, double %ld2, i32 2
  %ins3 = insertelement <4 x double> %ins2, double %ld3, i32 3

  %shuffle = shufflevector <4 x double> %ins3, <4 x double> undef, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
  ret <4 x double> %shuffle
}

Here's the current optimization - delete the unused scalar loads and inserts:

$ ./opt -instcombine 21780.ll -S
; ModuleID = '21780.ll'
source_filename = "21780.ll"

define <4 x double> @load_four_scalars_but_use_two(double* %ptr) {
  %arrayidx2 = getelementptr inbounds double, double* %ptr, i64 2
  %ld0 = load double, double* %ptr, align 8
  %ld2 = load double, double* %arrayidx2, align 8
  %ins0 = insertelement <4 x double> undef, double %ld0, i32 0
  %ins2 = insertelement <4 x double> %ins0, double %ld2, i32 2
  %shuffle = shufflevector <4 x double> %ins2, <4 x double> undef, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
  ret <4 x double> %shuffle
}

What we'd prefer to see for an AVX machine:

define <4 x double> @load_four_scalars_but_use_two(double* %ptr) {
  %bc = bitcast double* %ptr to <4 x double>*
  %ldvec = load <4 x double>, <4 x double>* %bc
  %shuffle = shufflevector <4 x double> %ldvec, <4 x double> undef, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
  ret <4 x double> %shuffle
}

To summarize earlier comments: after instcombine removes the loads, no other pass is allowed to recreate them. We've simply lost the information that it was safe to do those operations in the first place. This goes back to the recent llvm-dev thread on reading extra memory.

So this has to be an instcombine transform. Given that the vector IR has less instructions, I think that's a reasonable transform compared to what we do today. It should be a backend responsibility to split that vector load into scalar ops if that would be profitable.

@rotateright
Copy link
Contributor

rotateright commented Apr 2, 2016

Note that there is a "load-combine" pass for IR that is not on by default. It doesn't work on the example IR in comment 5 because it is currently limited to integer loads. If we change the example types to ints, it does this:

define <4 x i64> @load_four_scalars_but_use_two(i64* %ptr) {
  %1 = bitcast i64* %ptr to i256*
  %ld0.combined = load i256, i256* %1, align 8  <--- crazy wide integer type
  %combine.extract.trunc = trunc i256 %ld0.combined to i64
  %combine.extract.shift = lshr i256 %ld0.combined, 64
  %combine.extract.trunc1 = trunc i256 %combine.extract.shift to i64
  %combine.extract.shift2 = lshr i256 %ld0.combined, 128
  %combine.extract.trunc3 = trunc i256 %combine.extract.shift2 to i64
  %combine.extract.shift4 = lshr i256 %ld0.combined, 192
  %combine.extract.trunc5 = trunc i256 %combine.extract.shift4 to i64
  %ins0 = insertelement <4 x i64> undef, i64 %combine.extract.trunc, i32 0
  %ins1 = insertelement <4 x i64> %ins0, i64 %combine.extract.trunc1, i32 1
  %ins2 = insertelement <4 x i64> %ins1, i64 %combine.extract.trunc3, i32 2
  %ins3 = insertelement <4 x i64> %ins2, i64 %combine.extract.trunc5, i32 3
  ret <4 x i64> %ins3
}

Unfortunately, even if we fix that, I don't think this pass is a viable option for the transform we want to do because we have to do our transform before instcombine has a chance to eliminate the unused loads.

The load-combine pass does provide a model for what is required though: alias analysis, etc. As we've seen with combining loads in the DAG, it's not an easy transform in general because of all of the error potential when reordering memory operations.

@rotateright
Copy link
Contributor

To summarize earlier comments: after instcombine removes the loads, no other
pass is allowed to recreate them. We've simply lost the information that it
was safe to do those operations in the first place. This goes back to the
recent llvm-dev thread on reading extra memory.

So this has to be an instcombine transform. Given that the vector IR has
less instructions, I think that's a reasonable transform compared to what we
do today. It should be a backend responsibility to split that vector load
into scalar ops if that would be profitable.

Let me refine that statement: instcombine must preserve the fact that it has removed a load of memory in order for subsequent passes to act on that information.

It's probably easier to have the SLP vectorizer or some other pass act on that information because instcombine isn't currently equipped for memop combining AFAIK.

There does already appear to be a metadata type that would work for this use case:

"The optional !dereferenceable_or_null metadata must reference a single metadata name <deref_bytes_node> corresponding to a metadata node with one i64 entry. The existence of the !dereferenceable_or_null metadata on the instruction tells the optimizer that the value loaded is known to be either dereferenceable or null. The number of bytes known to be dereferenceable is specified by the integer value in the metadata node."

http://llvm.org/docs/LangRef.html#load-instruction

@RKSimon
Copy link
Collaborator Author

RKSimon commented Jun 27, 2019

Current codegen: https://godbolt.org/z/sc6te_

@rotateright
Copy link
Contributor

Try to preserve the dereferenceable information before instcombine kills it:
https://reviews.llvm.org/D64258

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 9, 2021
@RKSimon
Copy link
Collaborator Author

RKSimon commented Dec 11, 2022

All we're now missing for this to fold correctly is for the arg to be correctly set to dereferenceable(32):

define <4 x double> @vsht_d4_fold(ptr nocapture noundef readonly %0) {
  %2 = load double, ptr %0, align 8
  %3 = insertelement <4 x double> undef, double %2, i64 0
  %4 = getelementptr inbounds double, ptr %0, i64 2
  %5 = load double, ptr %4, align 8
  %6 = insertelement <4 x double> %3, double %5, i64 2
  %7 = shufflevector <4 x double> %6, <4 x double> poison, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
  ret <4 x double> %7
}

define <4 x double> @vsht_d4_fold_deref(ptr nocapture noundef readonly dereferenceable(32) %0) {
  %2 = load double, ptr %0, align 8
  %3 = insertelement <4 x double> undef, double %2, i64 0
  %4 = getelementptr inbounds double, ptr %0, i64 2
  %5 = load double, ptr %4, align 8
  %6 = insertelement <4 x double> %3, double %5, i64 2
  %7 = shufflevector <4 x double> %6, <4 x double> poison, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
  ret <4 x double> %7
}

opt -O3

define <4 x double> @vsht_d4_fold(ptr nocapture noundef readonly %0) {
  %2 = load double, ptr %0, align 8
  %3 = insertelement <4 x double> undef, double %2, i64 0
  %4 = getelementptr inbounds double, ptr %0, i64 2
  %5 = load double, ptr %4, align 8
  %6 = insertelement <4 x double> %3, double %5, i64 2
  %7 = shufflevector <4 x double> %6, <4 x double> poison, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
  ret <4 x double> %7
}

define <4 x double> @vsht_d4_fold_deref(ptr nocapture noundef readonly dereferenceable(32) %0) {
  %2 = load <4 x double>, ptr %0, align 8
  %3 = shufflevector <4 x double> %2, <4 x double> poison, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
  ret <4 x double> %3
}

@rotateright
Copy link
Contributor

rotateright commented Mar 2, 2023

Right - VectorCombine will create the vector load when it knows the full 32-bytes are dereferenceable.

It's the job of the Attributor pass to add that argument attribute (and that's why I abandoned my limited patch years ago). Attributor is capable of doing it today, but Attributor isn't on by default. It seems the reason for that is a huge compile-time hit:
https://llvm-compile-time-tracker.com/compare.php?from=5680b7570342c3457b80d3129fe60e53ef7ddfd5&to=df13b2a8c23f0c7c2e0ed53d9b3037251097cf01&stat=instructions:u

In case that link dies, it's showing a 28% - 58% regression for compile-time with this patch to partially enable Attributor:

diff --git a/llvm/lib/Passes/PassBuilderPipelines.cpp b/llvm/lib/Passes/PassBuilderPipelines.cpp
index adb555ed21b9d..0afb720649739 100644
--- a/llvm/lib/Passes/PassBuilderPipelines.cpp
+++ b/llvm/lib/Passes/PassBuilderPipelines.cpp
@@ -260,7 +260,7 @@ static cl::opt<bool> EnableConstraintElimination(
         "Enable pass to eliminate conditions based on linear constraints"));
 
 static cl::opt<AttributorRunOption> AttributorRun(
-    "attributor-enable", cl::Hidden, cl::init(AttributorRunOption::NONE),
+    "attributor-enable", cl::Hidden, cl::init(AttributorRunOption::MODULE),
     cl::desc("Enable the attributor inter-procedural deduction pass"),
     cl::values(clEnumValN(AttributorRunOption::ALL, "all",
                           "enable all attributor runs"),

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants