Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vectorize widening instructions #49600

Closed
sjoerdmeijer opened this issue May 7, 2021 · 6 comments
Closed

vectorize widening instructions #49600

sjoerdmeijer opened this issue May 7, 2021 · 6 comments
Labels
bugzilla Issues migrated from bugzilla

Comments

@sjoerdmeijer
Copy link
Collaborator

Bugzilla Link 50256
Resolution FIXED
Resolved on May 17, 2021 08:04
Version trunk
OS Linux
CC @davidbolvansky,@fhahn,@RKSimon,@MattPD,@rotateright

Extended Description

GCC11 learned a new trick[1] and is now able to vectorise widening instruction much better. Copying for completeness the example[2] here:

void wide1(char * __restrict a, short *__restrict b, int n) {
for (int x = 0; x < 16; x++)
b[x] = a[x] << 8;
}

GCC11 generates:

    ldr     q0, [x0]
    shll    v1.8h, v0.8b, 8
    shll2   v0.8h, v0.16b, 8
    stp     q1, q0, [x1]
    ret

whereas with trunk we generate:

    ldrb    w8, [x0]
    ldrb    w9, [x0, #&#8203;1]
    ldrb    w10, [x0, #&#8203;15]
    lsl     w8, w8, #&#8203;8
    strh    w8, [x1]
    ldrb    w8, [x0, #&#8203;2]
    lsl     w9, w9, #&#8203;8
    strh    w9, [x1, #&#8203;2]
    ldrb    w9, [x0, #&#8203;3]
    lsl     w8, w8, #&#8203;8
    strh    w8, [x1, #&#8203;4]
    ldrb    w8, [x0, #&#8203;4]
    lsl     w9, w9, #&#8203;8
    strh    w9, [x1, #&#8203;6]
    ldrb    w9, [x0, #&#8203;5]
    lsl     w8, w8, #&#8203;8
    strh    w8, [x1, #&#8203;8]
    ldrb    w8, [x0, #&#8203;6]
    lsl     w9, w9, #&#8203;8
    strh    w9, [x1, #&#8203;10]
    ldrb    w9, [x0, #&#8203;7]
    lsl     w8, w8, #&#8203;8
    strh    w8, [x1, #&#8203;12]
    ldrb    w8, [x0, #&#8203;8]
    lsl     w9, w9, #&#8203;8
    strh    w9, [x1, #&#8203;14]
    ldrb    w9, [x0, #&#8203;9]
    lsl     w8, w8, #&#8203;8
    strh    w8, [x1, #&#8203;16]
    ldrb    w8, [x0, #&#8203;10]
    lsl     w9, w9, #&#8203;8
    strh    w9, [x1, #&#8203;18]
    ldrb    w9, [x0, #&#8203;11]
    lsl     w8, w8, #&#8203;8
    strh    w8, [x1, #&#8203;20]
    ldrb    w8, [x0, #&#8203;12]
    lsl     w9, w9, #&#8203;8
    strh    w9, [x1, #&#8203;22]
    ldrb    w9, [x0, #&#8203;13]
    lsl     w8, w8, #&#8203;8
    strh    w8, [x1, #&#8203;24]
    ldrb    w8, [x0, #&#8203;14]
    lsl     w9, w9, #&#8203;8
    strh    w9, [x1, #&#8203;26]
    lsl     w9, w10, #&#8203;8
    lsl     w8, w8, #&#8203;8
    strh    w8, [x1, #&#8203;28]
    strh    w9, [x1, #&#8203;30]
    ret

We completely unroll this very early, and then fail to loop or slp vectorise this (haven't looked into this yet, don't know yet which one).

[1] https://community.arm.com/developer/tools-software/tools/b/tools-software-ides-blog/posts/performance-improvements-in-gcc-11
[2] https://godbolt.org/z/KPe6xjfed

@rotateright
Copy link
Contributor

This is an interesting example (I'm looking at optimizer pipeline stages in https://reviews.llvm.org/D102002 ).

From the "-print-before-all" debug output, we see:

  1. Full unroll happens.
  2. Loop vectorizer is therefore not involved.
  3. SLP misses the vectorization opportunity.

If we did not full unroll, then loop vectorizer can vectorize the loop. I fed the unrolled IR input directly to "opt -loop-vectorize" as an experiment. It worked by default for an x86-64 triple, but not a default aarch64 triple. But the x86 vectorization was not ideal - it looks like this:
%wide.load = load <2 x i8>, <2 x i8>* %3, align 1, !tbaa !​5
%4 = zext <2 x i8> %wide.load to <2 x i16>
%5 = shl nuw <2 x i16> %4, <i16 8, i16 8>

SLP also could have vectorized the fully unrolled output, and that looks close to ideal, but it bails out because it assumes that this is a pattern that the backend can load-combine into something better (grep for "SLP: Assume load combining for tree").

Either we need to do some kind of (limited) load combining late in IR or refine the restriction in SLP to allow this case. The backend might be able to merge stores and loads to create wider scalar ops, but that's unlikely to be ideal.

@sjoerdmeijer
Copy link
Collaborator Author

Many thanks for the analysis!

Like I said, I didn't have a chance to look at this yet, but just raised it as an interesting case. I might be able to have a look at this sometime next week, but if someone else want to look... :-)

SLP also could have vectorized the fully unrolled output, and that looks close to ideal, but it bails out because it assumes that this is a pattern that the backend can load-combine into something better (grep for "SLP: Assume load combining for tree").

This sounds like a SLP cost-model issue, and would make this very doable to fix.
If this was related to tuning the unroller, I was suspecting it to be a bit more tricky.

@rotateright
Copy link
Contributor

This sounds like a SLP cost-model issue, and would make this very doable to
fix.
If this was related to tuning the unroller, I was suspecting it to be a bit
more tricky.

It's not purely a cost model issue because the load-combine bailout overrides costs.

But I'm not seeing this problem if the target has vector registers/ops that match the size of the final store. Ie, this vectorizes on x86 with AVX2 because that has a 256-bit store, but not on x86 with SSE2.

Let's try to refine/restrict the bailout again:
https://reviews.llvm.org/D102074

@sjoerdmeijer
Copy link
Collaborator Author

Let's try to refine/restrict the bailout again:
https://reviews.llvm.org/D102074

Cool, thanks, will look at this first thing next week.

@davidbolvansky
Copy link
Collaborator

If we did not full unroll, then loop vectorizer can vectorize the loop.

This is sad story of many missed vectorizations.

#46522 , #47070 , #46898 , #46780 , #30920 , #46897 , #46835

It would be interesting to run Loop Vectorizer -> Loop Unroller -> SLP Vectorizer.

@rotateright
Copy link
Contributor

Should be fixed with:
https://reviews.llvm.org/rG49950cb1f6f6

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 11, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla
Projects
None yet
Development

No branches or pull requests

3 participants