-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[X86][AVX] Expansion of 256 bit vector loads fails to fold into shuffles #22154
Comments
As of r230021, the vpermild has become a vmovddup, but we're still not merging the loads:
|
I think the problem with doing a full vector load in this case is that we'd be reading unknown bytes of memory beyond the last scalar load. In general, this isn't safe (unmapped memory). If we change the test case to load the scalar elements at both ends of the vector, then we do get the vector load optimization to kick:
|
Ah, now I see the source test case:
So I'll stick with the assessment that the backend is doing all that it can given the IR. It looks like -instcombine is responsible for eliminating the unused loads in IR. To fix this, we'd need to teach instcombine to weigh a potential trade-off: bypass the elimination of some loads in exchange for reducing the total number of ops via a larger load. |
Current codegen:
|
I was going back through our splat discussions in relation to bug 27141, and came back to this bug. Let's make this IR test case concrete:
Here's the current optimization - delete the unused scalar loads and inserts:
What we'd prefer to see for an AVX machine:
To summarize earlier comments: after instcombine removes the loads, no other pass is allowed to recreate them. We've simply lost the information that it was safe to do those operations in the first place. This goes back to the recent llvm-dev thread on reading extra memory. So this has to be an instcombine transform. Given that the vector IR has less instructions, I think that's a reasonable transform compared to what we do today. It should be a backend responsibility to split that vector load into scalar ops if that would be profitable. |
Note that there is a "load-combine" pass for IR that is not on by default. It doesn't work on the example IR in comment 5 because it is currently limited to integer loads. If we change the example types to ints, it does this:
Unfortunately, even if we fix that, I don't think this pass is a viable option for the transform we want to do because we have to do our transform before instcombine has a chance to eliminate the unused loads. The load-combine pass does provide a model for what is required though: alias analysis, etc. As we've seen with combining loads in the DAG, it's not an easy transform in general because of all of the error potential when reordering memory operations. |
Let me refine that statement: instcombine must preserve the fact that it has removed a load of memory in order for subsequent passes to act on that information. It's probably easier to have the SLP vectorizer or some other pass act on that information because instcombine isn't currently equipped for memop combining AFAIK. There does already appear to be a metadata type that would work for this use case: "The optional !dereferenceable_or_null metadata must reference a single metadata name <deref_bytes_node> corresponding to a metadata node with one i64 entry. The existence of the !dereferenceable_or_null metadata on the instruction tells the optimizer that the value loaded is known to be either dereferenceable or null. The number of bytes known to be dereferenceable is specified by the integer value in the metadata node." |
Current codegen: https://godbolt.org/z/sc6te_ |
Try to preserve the dereferenceable information before instcombine kills it: |
All we're now missing for this to fold correctly is for the arg to be correctly set to dereferenceable(32):
opt -O3
|
Right - VectorCombine will create the vector load when it knows the full 32-bytes are dereferenceable. It's the job of the Attributor pass to add that argument attribute (and that's why I abandoned my limited patch years ago). Attributor is capable of doing it today, but Attributor isn't on by default. It seems the reason for that is a huge compile-time hit: In case that link dies, it's showing a 28% - 58% regression for compile-time with this patch to partially enable Attributor: diff --git a/llvm/lib/Passes/PassBuilderPipelines.cpp b/llvm/lib/Passes/PassBuilderPipelines.cpp
index adb555ed21b9d..0afb720649739 100644
--- a/llvm/lib/Passes/PassBuilderPipelines.cpp
+++ b/llvm/lib/Passes/PassBuilderPipelines.cpp
@@ -260,7 +260,7 @@ static cl::opt<bool> EnableConstraintElimination(
"Enable pass to eliminate conditions based on linear constraints"));
static cl::opt<AttributorRunOption> AttributorRun(
- "attributor-enable", cl::Hidden, cl::init(AttributorRunOption::NONE),
+ "attributor-enable", cl::Hidden, cl::init(AttributorRunOption::MODULE),
cl::desc("Enable the attributor inter-procedural deduction pass"),
cl::values(clEnumValN(AttributorRunOption::ALL, "all",
"enable all attributor runs"), |
Extended Description
Follow up to [Bug #22084] '[X86][AVX] suboptimal expansion of 256 bit vector loads.'
Merging of consecutive loads into a 256-bit ymm register now works well for simple cases, and the loads also fold nicely for bitwise ops (as well as basic float ops - fadd, fsub etc.).
Vector shuffle optimizations however attempt to selectively load individual lanes and in doing so prevent the optimization from folding the load into the shuffle.
e.g.
Manually editing the IR does permit the fold to occur:
The text was updated successfully, but these errors were encountered: