-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge consecutive load insertions into a vector where possible #38821
Comments
We've made some progress on this example - the original code actually looks better on paper (llvm-mca) than the suggested asm: |
The current IR for those examples looks like this:
|
Given the source, I don't think there's a way to speculatively load the full vector - access to data[7] is not guaranteed. Is there anything more we can do to improve this? |
Yes - without a better dereferencable range widening the scalatr load insertions is the best we can do. The codegen, at least for SSE4+ looks pretty good now - although plain SSE2 seems to have regressed again since a high around 13.x, which I think is a SLP/cost issue. We can probably close this, although we probably need at least some explicit SLP test coverage first. |
Note that with AVX2, the suggested merged IR will actually look better because we can use a broadcast load of i16: ...but I've been staring at that for a while, and I do not see a way to coerce SLP or the cost model into producing the necessary re-arranged loads and shuffles. Once SLP is done, there's no way for the backend to undo it either AFAIK. |
Extended Description
-O3 -march=btver2
Many of the loads/insertions could be merged to something like:
https://godbolt.org/z/-HLpsE
The text was updated successfully, but these errors were encountered: