typedef unsigned char uint8_t; static inline uint8_t bar( uint8_t x ) { return x&(~63) ? -x : x; } void foo( uint8_t *__restrict dst, uint8_t *__restrict src) { for( int x = 0; x < 8; x++ ) dst[x] = bar(src[x]); } ICC: foo(unsigned char*, unsigned char*): vpmovzxbd ymm2, QWORD PTR [rsi] #5.25 vpand ymm0, ymm2, YMMWORD PTR .L_2il0floatpacket.0[rip] #5.12 vpxor ymm1, ymm1, ymm1 #5.25 vptestmd k1, ymm0, ymm0 #5.12 vpsubd ymm2{k1}, ymm1, ymm2 #5.25 vpmovdb QWORD PTR [rdi], ymm2 #13.6 vzeroupper #14.1 ret LLVM does not vectorize it with avx/avx2/avx512 - cost model issue? https://godbolt.org/z/Kheeec4cG
Ok, with typedef unsigned short uint8_t; LLVM produces good codegen. So it looks like a cost model issue for (U)INT8.
Not sure if its purely a cost model issue, but also to do with 8 x uint8_t being smaller than the 128-bit vector target minimum.
But according to llvm-mca, ICC's codegen is much better Block RThroughput is 4, for LLVM is 9.8
With -mllvm -slp-min-reg-size=64 We have this nice codegen foo(unsigned char*, unsigned char*): # @foo(unsigned char*, unsigned char*) vmovq xmm0, qword ptr [rsi] # xmm0 = mem[0],zero vpcmpltub k1, xmm0, xmmword ptr [rip + .LCPI0_0] vpxor xmm1, xmm1, xmm1 vpsubb xmm1, xmm1, xmm0 vmovdqu8 xmm1 {k1}, xmm0 vmovq qword ptr [rdi], xmm1 ret Block RThroughput: 1.8
The best way to fix this bug is to wait for this patch to land: https://reviews.llvm.org/D57059 ("non-power-of-2 vectors"). I've checked that it works better: foo(unsigned char*, unsigned char*): vmovq xmm0, qword ptr [rsi] # xmm0 = mem[0],zero vpcmpltub k1, xmm0, xmmword ptr [rip + .LCPI0_0] vpxor xmm1, xmm1, xmm1 vpsubb xmm1, xmm1, xmm0 vmovdqu8 xmm1 {k1}, xmm0 mov ax, 255 kmovd k1, eax vmovdqu8 xmmword ptr [rdi] {k1}, xmm1 ret Although this vectorized codegen is different compared to `-slp-min-reg-size=64`: Block RThroughput: 2.2 This difference comes from using `@llvm.masked.store` instead of `store`.
Added test to track issue: https://reviews.llvm.org/rGdd028c359e09