[X86][AVX] Expansion of 256 bit vector loads fails to fold into shuffles #22154

RKSimon · 2014-12-08T22:06:17Z


Bugzilla Link	21780
Version	trunk
OS	All
CC	@adibiagio,@rotateright

Extended Description

Follow up to [Bug #22084] '[X86][AVX] suboptimal expansion of 256 bit vector loads.'

Merging of consecutive loads into a 256-bit ymm register now works well for simple cases, and the loads also fold nicely for bitwise ops (as well as basic float ops - fadd, fsub etc.).

Vector shuffle optimizations however attempt to selectively load individual lanes and in doing so prevent the optimization from folding the load into the shuffle.

e.g.

__m256d vsht_d4(__m256d foo) { 
  return __builtin_shufflevector( foo, foo, 0, 0, 2, 2 ); 
}

define <4 x double> @_Z7vsht_d4Dv4_d(<4 x double> %foo) #1 {
  %1 = shufflevector <4 x double> %foo, <4 x double> undef, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
  ret <4 x double> %1
}

vpermilpd $0, %ymm0, %ymm0 # ymm0 = ymm0[0,0,2,2] 
retq

__m256d vsht_d4_fold(const double* ptr) {
  __m256d foo = (__m256d){ ptr[0], ptr[1], ptr[2], ptr[3] }; 
  return __builtin_shufflevector( foo, foo, 0, 0, 2, 2 ); 
}

define <4 x double> @_Z12vsht_d4_foldPKd(double* nocapture readonly %ptr) #0 {
  %1 = load double* %ptr, align 8, !tbaa !1
  %2 = insertelement <4 x double> undef, double %1, i32 0
  %3 = getelementptr inbounds double* %ptr, i64 2
  %4 = load double* %3, align 8, !tbaa !1
  %5 = insertelement <4 x double> %2, double %4, i32 2
  %6 = shufflevector <4 x double> %5, <4 x double> undef, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
  ret <4 x double> %6
}

vmovsd (%rdi), %xmm0 
vmovsd 16(%rdi), %xmm1 
vinsertf128 $1, %xmm1, %ymm0, %ymm0 
vpermilpd $0, %ymm0, %ymm0 # ymm0 = ymm0[0,0,2,2] 
retq

Manually editing the IR does permit the fold to occur:

define <4 x double> @_Z12vsht_d4_foldPKd(double* nocapture readonly %ptr) #0 {
  %1 = load double* %ptr, align 8, !tbaa !1
  %2 = insertelement <4 x double> undef, double %1, i32 0
  %3 = getelementptr inbounds double* %ptr, i64 1
  %4 = load double* %3, align 8, !tbaa !1
  %5 = insertelement <4 x double> %2, double %4, i32 1
  %6 = getelementptr inbounds double* %ptr, i64 2
  %7 = load double* %6, align 8, !tbaa !1
  %8 = insertelement <4 x double> %5, double %7, i32 2
  %9 = getelementptr inbounds double* %ptr, i64 3
  %10 = load double* %9, align 8, !tbaa !1
  %11 = insertelement <4 x double> %8, double %10, i32 3
  %12 = shufflevector <4 x double> %11, <4 x double> undef, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
  ret <4 x double> %12
}

vpermilpd $0, (%rdi), %ymm0 # ymm0 = mem[0,0,2,2]
retq

rotateright · 2015-02-20T19:25:53Z

As of r230021, the vpermild has become a vmovddup, but we're still not merging the loads:

$ ./llc 21780.ll -o  - -mattr=avx
...
	vmovsd	(%rdi), %xmm0           ## xmm0 = mem[0],zero
	vmovsd	16(%rdi), %xmm1         ## xmm1 = mem[0],zero
	vinsertf128	$1, %xmm1, %ymm0, %ymm0
	vmovddup	%ymm0, %ymm0    ## ymm0 = ymm0[0,0,2,2]
	retq

rotateright · 2015-03-27T16:46:36Z

I think the problem with doing a full vector load in this case is that we'd be reading unknown bytes of memory beyond the last scalar load. In general, this isn't safe (unmapped memory).

If we change the test case to load the scalar elements at both ends of the vector, then we do get the vector load optimization to kick:

define <4 x double> @_Z12vsht_d4_foldPKd(double* nocapture readonly %ptr) #0 {
  %1 = load double, double* %ptr, align 8
  %2 = insertelement <4 x double> undef, double %1, i32 0
  %3 = getelementptr double, double* %ptr, i64 3   ; Read last vector element
  %4 = load double, double* %3, align 8
  %5 = insertelement <4 x double> %2, double %4, i32 3 ; And insert into last position
  %6 = shufflevector <4 x double> %5, <4 x double> undef, <4 x i32> <i32 0, i32 0, i32 3, i32 3>
  ret <4 x double> %6
}


$ ./llc -mattr=avx 21780.ll -o -
...
	vpermilpd	$12, (%rdi), %ymm0 ## ymm0 = mem[0,0,3,3]
	retq

rotateright · 2015-03-27T17:28:05Z

Ah, now I see the source test case:

  __m256d foo = (__m256d){ ptr[0], ptr[1], ptr[2], ptr[3] }; 
  return __builtin_shufflevector( foo, foo, 0, 0, 2, 2 );

So I'll stick with the assessment that the backend is doing all that it can given the IR.

It looks like -instcombine is responsible for eliminating the unused loads in IR. To fix this, we'd need to teach instcombine to weigh a potential trade-off: bypass the elimination of some loads in exchange for reducing the total number of ops via a larger load.

RKSimon · 2016-02-20T21:52:22Z

Current codegen:

$ ./llc 21780.ll -o  - -mattr=avx

  vmovddup	(%rdi), %xmm0   ## xmm0 = mem[0,0]
  vmovddup	16(%rdi), %xmm1 ## xmm1 = mem[0,0]
  vinsertf128	$1, %xmm1, %ymm0, %ymm0
  retq

rotateright · 2016-04-01T17:25:07Z

I was going back through our splat discussions in relation to bug 27141, and came back to this bug.

Let's make this IR test case concrete:

define <4 x double> @load_four_scalars_but_use_two(double* %ptr) {
  %arrayidx0 = getelementptr inbounds double, double* %ptr, i64 0
  %arrayidx1 = getelementptr inbounds double, double* %ptr, i64 1
  %arrayidx2 = getelementptr inbounds double, double* %ptr, i64 2
  %arrayidx3 = getelementptr inbounds double, double* %ptr, i64 3

  %ld0 = load double, double* %arrayidx0
  %ld1 = load double, double* %arrayidx1
  %ld2 = load double, double* %arrayidx2
  %ld3 = load double, double* %arrayidx3

  %ins0 = insertelement <4 x double> undef, double %ld0, i32 0
  %ins1 = insertelement <4 x double> %ins0, double %ld1, i32 1
  %ins2 = insertelement <4 x double> %ins1, double %ld2, i32 2
  %ins3 = insertelement <4 x double> %ins2, double %ld3, i32 3

  %shuffle = shufflevector <4 x double> %ins3, <4 x double> undef, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
  ret <4 x double> %shuffle
}

Here's the current optimization - delete the unused scalar loads and inserts:

$ ./opt -instcombine 21780.ll -S
; ModuleID = '21780.ll'
source_filename = "21780.ll"

define <4 x double> @load_four_scalars_but_use_two(double* %ptr) {
  %arrayidx2 = getelementptr inbounds double, double* %ptr, i64 2
  %ld0 = load double, double* %ptr, align 8
  %ld2 = load double, double* %arrayidx2, align 8
  %ins0 = insertelement <4 x double> undef, double %ld0, i32 0
  %ins2 = insertelement <4 x double> %ins0, double %ld2, i32 2
  %shuffle = shufflevector <4 x double> %ins2, <4 x double> undef, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
  ret <4 x double> %shuffle
}

What we'd prefer to see for an AVX machine:

define <4 x double> @load_four_scalars_but_use_two(double* %ptr) {
  %bc = bitcast double* %ptr to <4 x double>*
  %ldvec = load <4 x double>, <4 x double>* %bc
  %shuffle = shufflevector <4 x double> %ldvec, <4 x double> undef, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
  ret <4 x double> %shuffle
}

To summarize earlier comments: after instcombine removes the loads, no other pass is allowed to recreate them. We've simply lost the information that it was safe to do those operations in the first place. This goes back to the recent llvm-dev thread on reading extra memory.

So this has to be an instcombine transform. Given that the vector IR has less instructions, I think that's a reasonable transform compared to what we do today. It should be a backend responsibility to split that vector load into scalar ops if that would be profitable.

rotateright · 2016-04-02T22:12:54Z

Note that there is a "load-combine" pass for IR that is not on by default. It doesn't work on the example IR in comment 5 because it is currently limited to integer loads. If we change the example types to ints, it does this:

define <4 x i64> @load_four_scalars_but_use_two(i64* %ptr) {
  %1 = bitcast i64* %ptr to i256*
  %ld0.combined = load i256, i256* %1, align 8  <--- crazy wide integer type
  %combine.extract.trunc = trunc i256 %ld0.combined to i64
  %combine.extract.shift = lshr i256 %ld0.combined, 64
  %combine.extract.trunc1 = trunc i256 %combine.extract.shift to i64
  %combine.extract.shift2 = lshr i256 %ld0.combined, 128
  %combine.extract.trunc3 = trunc i256 %combine.extract.shift2 to i64
  %combine.extract.shift4 = lshr i256 %ld0.combined, 192
  %combine.extract.trunc5 = trunc i256 %combine.extract.shift4 to i64
  %ins0 = insertelement <4 x i64> undef, i64 %combine.extract.trunc, i32 0
  %ins1 = insertelement <4 x i64> %ins0, i64 %combine.extract.trunc1, i32 1
  %ins2 = insertelement <4 x i64> %ins1, i64 %combine.extract.trunc3, i32 2
  %ins3 = insertelement <4 x i64> %ins2, i64 %combine.extract.trunc5, i32 3
  ret <4 x i64> %ins3
}

Unfortunately, even if we fix that, I don't think this pass is a viable option for the transform we want to do because we have to do our transform before instcombine has a chance to eliminate the unused loads.

The load-combine pass does provide a model for what is required though: alias analysis, etc. As we've seen with combining loads in the DAG, it's not an easy transform in general because of all of the error potential when reordering memory operations.

rotateright · 2017-01-18T18:39:08Z

To summarize earlier comments: after instcombine removes the loads, no other
pass is allowed to recreate them. We've simply lost the information that it
was safe to do those operations in the first place. This goes back to the
recent llvm-dev thread on reading extra memory.

So this has to be an instcombine transform. Given that the vector IR has
less instructions, I think that's a reasonable transform compared to what we
do today. It should be a backend responsibility to split that vector load
into scalar ops if that would be profitable.

Let me refine that statement: instcombine must preserve the fact that it has removed a load of memory in order for subsequent passes to act on that information.

It's probably easier to have the SLP vectorizer or some other pass act on that information because instcombine isn't currently equipped for memop combining AFAIK.

There does already appear to be a metadata type that would work for this use case:

"The optional !dereferenceable_or_null metadata must reference a single metadata name <deref_bytes_node> corresponding to a metadata node with one i64 entry. The existence of the !dereferenceable_or_null metadata on the instruction tells the optimizer that the value loaded is known to be either dereferenceable or null. The number of bytes known to be dereferenceable is specified by the integer value in the metadata node."

http://llvm.org/docs/LangRef.html#load-instruction

RKSimon · 2019-06-27T16:55:19Z

Current codegen: https://godbolt.org/z/sc6te_

rotateright · 2019-07-05T22:50:10Z

Try to preserve the dereferenceable information before instcombine kills it:
https://reviews.llvm.org/D64258

RKSimon · 2022-12-11T17:16:12Z

All we're now missing for this to fold correctly is for the arg to be correctly set to dereferenceable(32):

define <4 x double> @vsht_d4_fold(ptr nocapture noundef readonly %0) {
  %2 = load double, ptr %0, align 8
  %3 = insertelement <4 x double> undef, double %2, i64 0
  %4 = getelementptr inbounds double, ptr %0, i64 2
  %5 = load double, ptr %4, align 8
  %6 = insertelement <4 x double> %3, double %5, i64 2
  %7 = shufflevector <4 x double> %6, <4 x double> poison, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
  ret <4 x double> %7
}

define <4 x double> @vsht_d4_fold_deref(ptr nocapture noundef readonly dereferenceable(32) %0) {
  %2 = load double, ptr %0, align 8
  %3 = insertelement <4 x double> undef, double %2, i64 0
  %4 = getelementptr inbounds double, ptr %0, i64 2
  %5 = load double, ptr %4, align 8
  %6 = insertelement <4 x double> %3, double %5, i64 2
  %7 = shufflevector <4 x double> %6, <4 x double> poison, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
  ret <4 x double> %7
}

opt -O3

define <4 x double> @vsht_d4_fold(ptr nocapture noundef readonly %0) {
  %2 = load double, ptr %0, align 8
  %3 = insertelement <4 x double> undef, double %2, i64 0
  %4 = getelementptr inbounds double, ptr %0, i64 2
  %5 = load double, ptr %4, align 8
  %6 = insertelement <4 x double> %3, double %5, i64 2
  %7 = shufflevector <4 x double> %6, <4 x double> poison, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
  ret <4 x double> %7
}

define <4 x double> @vsht_d4_fold_deref(ptr nocapture noundef readonly dereferenceable(32) %0) {
  %2 = load <4 x double>, ptr %0, align 8
  %3 = shufflevector <4 x double> %2, <4 x double> poison, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
  ret <4 x double> %3
}

rotateright · 2023-03-02T20:59:21Z

Right - VectorCombine will create the vector load when it knows the full 32-bytes are dereferenceable.

It's the job of the Attributor pass to add that argument attribute (and that's why I abandoned my limited patch years ago). Attributor is capable of doing it today, but Attributor isn't on by default. It seems the reason for that is a huge compile-time hit:
https://llvm-compile-time-tracker.com/compare.php?from=5680b7570342c3457b80d3129fe60e53ef7ddfd5&to=df13b2a8c23f0c7c2e0ed53d9b3037251097cf01&stat=instructions:u

In case that link dies, it's showing a 28% - 58% regression for compile-time with this patch to partially enable Attributor:

diff --git a/llvm/lib/Passes/PassBuilderPipelines.cpp b/llvm/lib/Passes/PassBuilderPipelines.cpp
index adb555ed21b9d..0afb720649739 100644
--- a/llvm/lib/Passes/PassBuilderPipelines.cpp
+++ b/llvm/lib/Passes/PassBuilderPipelines.cpp
@@ -260,7 +260,7 @@ static cl::opt<bool> EnableConstraintElimination(
         "Enable pass to eliminate conditions based on linear constraints"));
 
 static cl::opt<AttributorRunOption> AttributorRun(
-    "attributor-enable", cl::Hidden, cl::init(AttributorRunOption::NONE),
+    "attributor-enable", cl::Hidden, cl::init(AttributorRunOption::MODULE),
     cl::desc("Enable the attributor inter-procedural deduction pass"),
     cl::values(clEnumValN(AttributorRunOption::ALL, "all",
                           "enable all attributor runs"),

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 9, 2021

RKSimon added the missed-optimization label Dec 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[X86][AVX] Expansion of 256 bit vector loads fails to fold into shuffles #22154

[X86][AVX] Expansion of 256 bit vector loads fails to fold into shuffles #22154

RKSimon commented Dec 8, 2014 •

edited

Loading

rotateright commented Feb 20, 2015 •

edited by RKSimon

Loading

rotateright commented Mar 27, 2015 •

edited by RKSimon

Loading

rotateright commented Mar 27, 2015 •

edited by RKSimon

Loading

RKSimon commented Feb 20, 2016 •

edited

Loading

rotateright commented Apr 1, 2016 •

edited by RKSimon

Loading

rotateright commented Apr 2, 2016 •

edited by RKSimon

Loading

rotateright commented Jan 18, 2017

RKSimon commented Jun 27, 2019 •

edited

Loading

rotateright commented Jul 5, 2019

RKSimon commented Dec 11, 2022

rotateright commented Mar 2, 2023 •

edited

Loading

[X86][AVX] Expansion of 256 bit vector loads fails to fold into shuffles #22154

[X86][AVX] Expansion of 256 bit vector loads fails to fold into shuffles #22154

Comments

RKSimon commented Dec 8, 2014 • edited Loading

Extended Description

rotateright commented Feb 20, 2015 • edited by RKSimon Loading

rotateright commented Mar 27, 2015 • edited by RKSimon Loading

rotateright commented Mar 27, 2015 • edited by RKSimon Loading

RKSimon commented Feb 20, 2016 • edited Loading

rotateright commented Apr 1, 2016 • edited by RKSimon Loading

rotateright commented Apr 2, 2016 • edited by RKSimon Loading

rotateright commented Jan 18, 2017

RKSimon commented Jun 27, 2019 • edited Loading

rotateright commented Jul 5, 2019

RKSimon commented Dec 11, 2022

rotateright commented Mar 2, 2023 • edited Loading

RKSimon commented Dec 8, 2014 •

edited

Loading

rotateright commented Feb 20, 2015 •

edited by RKSimon

Loading

rotateright commented Mar 27, 2015 •

edited by RKSimon

Loading

rotateright commented Mar 27, 2015 •

edited by RKSimon

Loading

RKSimon commented Feb 20, 2016 •

edited

Loading

rotateright commented Apr 1, 2016 •

edited by RKSimon

Loading

rotateright commented Apr 2, 2016 •

edited by RKSimon

Loading

RKSimon commented Jun 27, 2019 •

edited

Loading

rotateright commented Mar 2, 2023 •

edited

Loading