Related: Bug #45806 and https://stackoverflow.com/q/61601902/ I am trying to produce an AVX2 mask with all-ones in the lower lane and all-zeroes in the upper lane of a YMM register. The code I am using is: __m256i mask = _mm256_zextsi128_si256(_mm_set1_epi8(-1)); This should produce a single instruction like `vpcmpeqd %xmm0,%xmm0,%xmm0`, but Clang insists on putting the value into memory and loading it. However, Clang insists on putting this into memory and loading it. The behavior in context is even more odd: __m256i minmax(__m256i v1, __m256i v2) { __m256i comp = _mm256_cmpgt_epi64(v1, v2); __m256i mask = _mm256_zextsi128_si256(_mm_set1_epi8(-1)); return _mm256_blendv_epi8(v2, v1, _mm256_xor_si256(comp, mask)); } This goes through a bunch of contortions with extracting, shifting, and expanding 128-bit registers when I feel like the result I want is pretty straightforward. Godbolt example: https://gcc.godbolt.org/z/GPhJ6s
define <4 x i64> @_Z6minmaxDv4_xS_(<4 x i64> %0, <4 x i64> %1) { %3 = icmp sgt <4 x i64> %0, %1 %4 = xor <4 x i1> %3, <i1 true, i1 true, i1 false, i1 false> %5 = select <4 x i1> %4, <4 x i64> %0, <4 x i64> %1 ret <4 x i64> %5 }
(In reply to Simon Pilgrim from comment #1) > define <4 x i64> @_Z6minmaxDv4_xS_(<4 x i64> %0, <4 x i64> %1) { > %3 = icmp sgt <4 x i64> %0, %1 > %4 = xor <4 x i1> %3, <i1 true, i1 true, i1 false, i1 false> So basically the code that is handling materialization of all-ones constants as pcmpeq needs to be taught that if lower portion is all-ones and the rest is zeros, it might still be profitable, i'm guessing? > %5 = select <4 x i1> %4, <4 x i64> %0, <4 x i64> %1 > ret <4 x i64> %5 > }
We might need to improve PromoteMaskArithmetic to better handle selects. CC'ing Florian who did a load of improvements in D72524.
(In reply to Roman Lebedev from comment #2) > So basically the code that is handling materialization of all-ones constants > as pcmpeq needs to be taught that if lower portion is all-ones and the rest > is zeros, it might still be profitable, i'm guessing? [Bug #42653] discusses something similar for rematerializable lower 'allones' subvector masks once we avoid the unnecessary packss/pmovsx
rGfe6f5ba0bffd - added test case rGb8a725274c22 - fixed PACKSS promotion issues Current AVX2 Codegen: .LCPI0_0: .quad 1 # 0x1 .quad 1 # 0x1 .quad 0 # 0x0 .quad 0 # 0x0 _Z6minmaxDv4_xS_: # @_Z6minmaxDv4_xS_ vpcmpgtq %ymm1, %ymm0, %ymm2 vpxor .LCPI0_0(%rip), %ymm2, %ymm2 vpsllq $63, %ymm2, %ymm2 vblendvpd %ymm2, %ymm0, %ymm1, %ymm0 retq
(In reply to Simon Pilgrim from comment #5) > rGfe6f5ba0bffd - added test case > rGb8a725274c22 - fixed PACKSS promotion issues > > Current AVX2 Codegen: > > .LCPI0_0: > .quad 1 # 0x1 > .quad 1 # 0x1 > .quad 0 # 0x0 > .quad 0 # 0x0 > _Z6minmaxDv4_xS_: # @_Z6minmaxDv4_xS_ > vpcmpgtq %ymm1, %ymm0, %ymm2 produces either -1 or 0 > vpxor .LCPI0_0(%rip), %ymm2, %ymm2 Inverts lowest bit only > vpsllq $63, %ymm2, %ymm2 moves lowest bit into highest bit > vblendvpd %ymm2, %ymm0, %ymm1, %ymm0 uses highest bit to control blending Can't we get rid of the vpsllq by using -1 instead of 1 in xor? > retq
That's the next step - there is plenty of code that tries to do that kind of thing - and nearly all of it ignores vectors :-)
Initial patch: https://reviews.llvm.org/D82257 This will make sure we're using -1/0 sign masks but doesn't materialize the constant using VPCMPEQ xmm (with implicit zeroing of the upper elements).
Current AVX2 Codegen: .LCPI0_0: .quad -1 # 0xffffffffffffffff .quad -1 # 0xffffffffffffffff .quad 0 # 0x0 .quad 0 # 0x0 _Z6minmaxDv4_xS_: # @_Z6minmaxDv4_xS_ vpcmpgtq %ymm1, %ymm0, %ymm2 vpxor .LCPI0_0(%rip), %ymm2, %ymm2 vblendvpd %ymm2, %ymm0, %ymm1, %ymm0 retq