49933 – Missed SLP vectorization

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 49933 - Missed SLP vectorization

Summary: Missed SLP vectorization

Status:	CONFIRMED

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Scalar Optimizations (show other bugs)
Version:	trunk
Hardware:	PC Windows NT

Importance:	P enhancement
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2021-04-12 05:47 PDT by David Bolvansky
Modified:	2021-09-04 15:29 PDT (History)
CC List:	5 users (show)

See Also:	49934 47491
Fixed By Commit(s):

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description David Bolvansky 2021-04-12 05:47:52 PDT

typedef unsigned char uint8_t;

static inline uint8_t bar( uint8_t x )
{
  return x&(~63) ? -x : x;
}


void foo( uint8_t *__restrict dst, uint8_t *__restrict src)
{
    
	for( int x = 0; x < 8; x++ )
	    dst[x] = bar(src[x]);
}



ICC:
foo(unsigned char*, unsigned char*):
        vpmovzxbd ymm2, QWORD PTR [rsi]                         #5.25
        vpand     ymm0, ymm2, YMMWORD PTR .L_2il0floatpacket.0[rip] #5.12
        vpxor     ymm1, ymm1, ymm1                              #5.25
        vptestmd  k1, ymm0, ymm0                                #5.12
        vpsubd    ymm2{k1}, ymm1, ymm2                          #5.25
        vpmovdb   QWORD PTR [rdi], ymm2                         #13.6
        vzeroupper                                              #14.1
        ret    

LLVM does not vectorize it with avx/avx2/avx512 - cost model issue?

https://godbolt.org/z/Kheeec4cG

Comment 1 David Bolvansky 2021-04-12 06:28:55 PDT

Ok, with typedef unsigned short uint8_t;

LLVM produces good codegen. So it looks like a cost model issue for (U)INT8.

Comment 2 Simon Pilgrim 2021-04-12 09:21:19 PDT

Not sure if its purely a cost model issue, but also to do with 8 x uint8_t being smaller than the 128-bit vector target minimum.

Comment 3 David Bolvansky 2021-04-25 13:16:59 PDT

But according to llvm-mca, ICC's codegen is much better

Block RThroughput is 4, for LLVM is 9.8

Comment 4 David Bolvansky 2021-04-25 13:29:16 PDT

With  -mllvm -slp-min-reg-size=64

We have this nice codegen 

foo(unsigned char*, unsigned char*):                             # @foo(unsigned char*, unsigned char*)
        vmovq   xmm0, qword ptr [rsi]           # xmm0 = mem[0],zero
        vpcmpltub       k1, xmm0, xmmword ptr [rip + .LCPI0_0]
        vpxor   xmm1, xmm1, xmm1
        vpsubb  xmm1, xmm1, xmm0
        vmovdqu8        xmm1 {k1}, xmm0
        vmovq   qword ptr [rdi], xmm1
        ret


Block RThroughput: 1.8

Comment 5 Anton Afanasyev 2021-07-23 13:29:11 PDT

The best way to fix this bug is to wait for this patch to land: https://reviews.llvm.org/D57059 ("non-power-of-2 vectors"). I've checked that it works better:

foo(unsigned char*, unsigned char*):
        vmovq   xmm0, qword ptr [rsi]           # xmm0 = mem[0],zero
        vpcmpltub       k1, xmm0, xmmword ptr [rip + .LCPI0_0]
        vpxor   xmm1, xmm1, xmm1
        vpsubb  xmm1, xmm1, xmm0
        vmovdqu8        xmm1 {k1}, xmm0
        mov     ax, 255
        kmovd   k1, eax
        vmovdqu8        xmmword ptr [rdi] {k1}, xmm1
        ret

Although this vectorized codegen is different compared to `-slp-min-reg-size=64`: Block RThroughput: 2.2

This difference comes from using `@llvm.masked.store` instead of `store`.

Comment 6 Anton Afanasyev 2021-09-04 15:29:38 PDT

Added test to track issue: https://reviews.llvm.org/rGdd028c359e09