New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vectorize widening instructions #49600
Comments
This is an interesting example (I'm looking at optimizer pipeline stages in https://reviews.llvm.org/D102002 ). From the "-print-before-all" debug output, we see:
If we did not full unroll, then loop vectorizer can vectorize the loop. I fed the unrolled IR input directly to "opt -loop-vectorize" as an experiment. It worked by default for an x86-64 triple, but not a default aarch64 triple. But the x86 vectorization was not ideal - it looks like this: SLP also could have vectorized the fully unrolled output, and that looks close to ideal, but it bails out because it assumes that this is a pattern that the backend can load-combine into something better (grep for "SLP: Assume load combining for tree"). Either we need to do some kind of (limited) load combining late in IR or refine the restriction in SLP to allow this case. The backend might be able to merge stores and loads to create wider scalar ops, but that's unlikely to be ideal. |
Many thanks for the analysis! Like I said, I didn't have a chance to look at this yet, but just raised it as an interesting case. I might be able to have a look at this sometime next week, but if someone else want to look... :-)
This sounds like a SLP cost-model issue, and would make this very doable to fix. |
It's not purely a cost model issue because the load-combine bailout overrides costs. But I'm not seeing this problem if the target has vector registers/ops that match the size of the final store. Ie, this vectorizes on x86 with AVX2 because that has a 256-bit store, but not on x86 with SSE2. Let's try to refine/restrict the bailout again: |
Cool, thanks, will look at this first thing next week. |
Should be fixed with: |
Extended Description
GCC11 learned a new trick[1] and is now able to vectorise widening instruction much better. Copying for completeness the example[2] here:
void wide1(char * __restrict a, short *__restrict b, int n) {
for (int x = 0; x < 16; x++)
b[x] = a[x] << 8;
}
GCC11 generates:
whereas with trunk we generate:
We completely unroll this very early, and then fail to loop or slp vectorise this (haven't looked into this yet, don't know yet which one).
[1] https://community.arm.com/developer/tools-software/tools/b/tools-software-ides-blog/posts/performance-improvements-in-gcc-11
[2] https://godbolt.org/z/KPe6xjfed
The text was updated successfully, but these errors were encountered: