LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 49933 - Missed SLP vectorization
Summary: Missed SLP vectorization
Status: CONFIRMED
Alias: None
Product: libraries
Classification: Unclassified
Component: Scalar Optimizations (show other bugs)
Version: trunk
Hardware: PC Windows NT
: P enhancement
Assignee: Unassigned LLVM Bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-04-12 05:47 PDT by David Bolvansky
Modified: 2021-09-04 15:29 PDT (History)
5 users (show)

See Also:
Fixed By Commit(s):


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description David Bolvansky 2021-04-12 05:47:52 PDT
typedef unsigned char uint8_t;

static inline uint8_t bar( uint8_t x )
{
  return x&(~63) ? -x : x;
}


void foo( uint8_t *__restrict dst, uint8_t *__restrict src)
{
    
	for( int x = 0; x < 8; x++ )
	    dst[x] = bar(src[x]);
}



ICC:
foo(unsigned char*, unsigned char*):
        vpmovzxbd ymm2, QWORD PTR [rsi]                         #5.25
        vpand     ymm0, ymm2, YMMWORD PTR .L_2il0floatpacket.0[rip] #5.12
        vpxor     ymm1, ymm1, ymm1                              #5.25
        vptestmd  k1, ymm0, ymm0                                #5.12
        vpsubd    ymm2{k1}, ymm1, ymm2                          #5.25
        vpmovdb   QWORD PTR [rdi], ymm2                         #13.6
        vzeroupper                                              #14.1
        ret    

LLVM does not vectorize it with avx/avx2/avx512 - cost model issue?

https://godbolt.org/z/Kheeec4cG
Comment 1 David Bolvansky 2021-04-12 06:28:55 PDT
Ok, with typedef unsigned short uint8_t;

LLVM produces good codegen. So it looks like a cost model issue for (U)INT8.
Comment 2 Simon Pilgrim 2021-04-12 09:21:19 PDT
Not sure if its purely a cost model issue, but also to do with 8 x uint8_t being smaller than the 128-bit vector target minimum.
Comment 3 David Bolvansky 2021-04-25 13:16:59 PDT
But according to llvm-mca, ICC's codegen is much better

Block RThroughput is 4, for LLVM is 9.8
Comment 4 David Bolvansky 2021-04-25 13:29:16 PDT
With  -mllvm -slp-min-reg-size=64

We have this nice codegen 

foo(unsigned char*, unsigned char*):                             # @foo(unsigned char*, unsigned char*)
        vmovq   xmm0, qword ptr [rsi]           # xmm0 = mem[0],zero
        vpcmpltub       k1, xmm0, xmmword ptr [rip + .LCPI0_0]
        vpxor   xmm1, xmm1, xmm1
        vpsubb  xmm1, xmm1, xmm0
        vmovdqu8        xmm1 {k1}, xmm0
        vmovq   qword ptr [rdi], xmm1
        ret


Block RThroughput: 1.8
Comment 5 Anton Afanasyev 2021-07-23 13:29:11 PDT
The best way to fix this bug is to wait for this patch to land: https://reviews.llvm.org/D57059 ("non-power-of-2 vectors"). I've checked that it works better:

foo(unsigned char*, unsigned char*):
        vmovq   xmm0, qword ptr [rsi]           # xmm0 = mem[0],zero
        vpcmpltub       k1, xmm0, xmmword ptr [rip + .LCPI0_0]
        vpxor   xmm1, xmm1, xmm1
        vpsubb  xmm1, xmm1, xmm0
        vmovdqu8        xmm1 {k1}, xmm0
        mov     ax, 255
        kmovd   k1, eax
        vmovdqu8        xmmword ptr [rdi] {k1}, xmm1
        ret

Although this vectorized codegen is different compared to `-slp-min-reg-size=64`: Block RThroughput: 2.2

This difference comes from using `@llvm.masked.store` instead of `store`.
Comment 6 Anton Afanasyev 2021-09-04 15:29:38 PDT
Added test to track issue: https://reviews.llvm.org/rGdd028c359e09