vectorize widening instructions #49600

sjoerdmeijer · 2021-05-07T08:59:36Z


Bugzilla Link	50256
Resolution	FIXED
Resolved on	May 17, 2021 08:04
Version	trunk
OS	Linux
CC	@davidbolvansky,@fhahn,@RKSimon,@MattPD,@rotateright

Extended Description

GCC11 learned a new trick[1] and is now able to vectorise widening instruction much better. Copying for completeness the example[2] here:

void wide1(char * __restrict a, short *__restrict b, int n) {
for (int x = 0; x < 16; x++)
b[x] = a[x] << 8;
}

GCC11 generates:

    ldr     q0, [x0]
    shll    v1.8h, v0.8b, 8
    shll2   v0.8h, v0.16b, 8
    stp     q1, q0, [x1]
    ret

whereas with trunk we generate:

    ldrb    w8, [x0]
    ldrb    w9, [x0, #&#8203;1]
    ldrb    w10, [x0, #&#8203;15]
    lsl     w8, w8, #&#8203;8
    strh    w8, [x1]
    ldrb    w8, [x0, #&#8203;2]
    lsl     w9, w9, #&#8203;8
    strh    w9, [x1, #&#8203;2]
    ldrb    w9, [x0, #&#8203;3]
    lsl     w8, w8, #&#8203;8
    strh    w8, [x1, #&#8203;4]
    ldrb    w8, [x0, #&#8203;4]
    lsl     w9, w9, #&#8203;8
    strh    w9, [x1, #&#8203;6]
    ldrb    w9, [x0, #&#8203;5]
    lsl     w8, w8, #&#8203;8
    strh    w8, [x1, #&#8203;8]
    ldrb    w8, [x0, #&#8203;6]
    lsl     w9, w9, #&#8203;8
    strh    w9, [x1, #&#8203;10]
    ldrb    w9, [x0, #&#8203;7]
    lsl     w8, w8, #&#8203;8
    strh    w8, [x1, #&#8203;12]
    ldrb    w8, [x0, #&#8203;8]
    lsl     w9, w9, #&#8203;8
    strh    w9, [x1, #&#8203;14]
    ldrb    w9, [x0, #&#8203;9]
    lsl     w8, w8, #&#8203;8
    strh    w8, [x1, #&#8203;16]
    ldrb    w8, [x0, #&#8203;10]
    lsl     w9, w9, #&#8203;8
    strh    w9, [x1, #&#8203;18]
    ldrb    w9, [x0, #&#8203;11]
    lsl     w8, w8, #&#8203;8
    strh    w8, [x1, #&#8203;20]
    ldrb    w8, [x0, #&#8203;12]
    lsl     w9, w9, #&#8203;8
    strh    w9, [x1, #&#8203;22]
    ldrb    w9, [x0, #&#8203;13]
    lsl     w8, w8, #&#8203;8
    strh    w8, [x1, #&#8203;24]
    ldrb    w8, [x0, #&#8203;14]
    lsl     w9, w9, #&#8203;8
    strh    w9, [x1, #&#8203;26]
    lsl     w9, w10, #&#8203;8
    lsl     w8, w8, #&#8203;8
    strh    w8, [x1, #&#8203;28]
    strh    w9, [x1, #&#8203;30]
    ret

We completely unroll this very early, and then fail to loop or slp vectorise this (haven't looked into this yet, don't know yet which one).

[1] https://community.arm.com/developer/tools-software/tools/b/tools-software-ides-blog/posts/performance-improvements-in-gcc-11
[2] https://godbolt.org/z/KPe6xjfed

The text was updated successfully, but these errors were encountered:

rotateright · 2021-05-07T13:15:47Z

This is an interesting example (I'm looking at optimizer pipeline stages in https://reviews.llvm.org/D102002 ).

From the "-print-before-all" debug output, we see:

Full unroll happens.
Loop vectorizer is therefore not involved.
SLP misses the vectorization opportunity.

If we did not full unroll, then loop vectorizer can vectorize the loop. I fed the unrolled IR input directly to "opt -loop-vectorize" as an experiment. It worked by default for an x86-64 triple, but not a default aarch64 triple. But the x86 vectorization was not ideal - it looks like this:
%wide.load = load <2 x i8>, <2 x i8>* %3, align 1, !tbaa !5
%4 = zext <2 x i8> %wide.load to <2 x i16>
%5 = shl nuw <2 x i16> %4, <i16 8, i16 8>

SLP also could have vectorized the fully unrolled output, and that looks close to ideal, but it bails out because it assumes that this is a pattern that the backend can load-combine into something better (grep for "SLP: Assume load combining for tree").

Either we need to do some kind of (limited) load combining late in IR or refine the restriction in SLP to allow this case. The backend might be able to merge stores and loads to create wider scalar ops, but that's unlikely to be ideal.

sjoerdmeijer · 2021-05-07T13:38:17Z

Many thanks for the analysis!

Like I said, I didn't have a chance to look at this yet, but just raised it as an interesting case. I might be able to have a look at this sometime next week, but if someone else want to look... :-)

SLP also could have vectorized the fully unrolled output, and that looks close to ideal, but it bails out because it assumes that this is a pattern that the backend can load-combine into something better (grep for "SLP: Assume load combining for tree").

This sounds like a SLP cost-model issue, and would make this very doable to fix.
If this was related to tuning the unroller, I was suspecting it to be a bit more tricky.

rotateright · 2021-05-07T15:19:27Z

This sounds like a SLP cost-model issue, and would make this very doable to
fix.
If this was related to tuning the unroller, I was suspecting it to be a bit
more tricky.

It's not purely a cost model issue because the load-combine bailout overrides costs.

But I'm not seeing this problem if the target has vector registers/ops that match the size of the final store. Ie, this vectorizes on x86 with AVX2 because that has a 256-bit store, but not on x86 with SSE2.

Let's try to refine/restrict the bailout again:
https://reviews.llvm.org/D102074

sjoerdmeijer · 2021-05-07T18:18:25Z

Let's try to refine/restrict the bailout again:
https://reviews.llvm.org/D102074

Cool, thanks, will look at this first thing next week.

davidbolvansky · 2021-05-07T19:23:13Z

If we did not full unroll, then loop vectorizer can vectorize the loop.

This is sad story of many missed vectorizations.

#46522 , #47070 , #46898 , #46780 , #30920 , #46897 , #46835

It would be interesting to run Loop Vectorizer -> Loop Unroller -> SLP Vectorizer.

rotateright · 2021-05-11T12:57:24Z

Should be fixed with:
https://reviews.llvm.org/rG49950cb1f6f6

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 11, 2021

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vectorize widening instructions #49600

vectorize widening instructions #49600

sjoerdmeijer commented May 7, 2021

rotateright commented May 7, 2021

sjoerdmeijer commented May 7, 2021

rotateright commented May 7, 2021

sjoerdmeijer commented May 7, 2021

davidbolvansky commented May 7, 2021

rotateright commented May 11, 2021

vectorize widening instructions #49600

vectorize widening instructions #49600

Comments

sjoerdmeijer commented May 7, 2021

Extended Description

rotateright commented May 7, 2021

sjoerdmeijer commented May 7, 2021

rotateright commented May 7, 2021

sjoerdmeijer commented May 7, 2021

davidbolvansky commented May 7, 2021

rotateright commented May 11, 2021