Suboptimal code for _mm256_zextsi128_si256(_mm_set1_epi8(-1)) #45153

NemoPublius · 2020-05-05T20:03:37Z


Bugzilla Link	45808
Version	trunk
OS	Linux
CC	@topperc,@fhahn,@LebedevRI,@RKSimon,@rotateright
Fixed by commit(s)	`fe6f5ba`,b8a725274c22,3521ecf1f8a3

Extended Description

Related: Bug #45806 and https://stackoverflow.com/q/61601902/

I am trying to produce an AVX2 mask with all-ones in the lower lane and all-zeroes in the upper lane of a YMM register. The code I am using is:

__m256i mask = _mm256_zextsi128_si256(_mm_set1_epi8(-1));

This should produce a single instruction like vpcmpeqd %xmm0,%xmm0,%xmm0, but Clang insists on putting the value into memory and loading it.

However, Clang insists on putting this into memory and loading it.

The behavior in context is even more odd:

__m256i minmax(__m256i v1, __m256i v2)
{
    __m256i comp = _mm256_cmpgt_epi64(v1, v2);
    __m256i mask = _mm256_zextsi128_si256(_mm_set1_epi8(-1));
    return _mm256_blendv_epi8(v2, v1, _mm256_xor_si256(comp, mask));
}

This goes through a bunch of contortions with extracting, shifting, and expanding 128-bit registers when I feel like the result I want is pretty straightforward.

Godbolt example: https://gcc.godbolt.org/z/GPhJ6s

The text was updated successfully, but these errors were encountered:

RKSimon · 2020-05-05T20:06:29Z

define <4 x i64> @Z6minmaxDv4_xS(<4 x i64> %0, <4 x i64> %1) {
%3 = icmp sgt <4 x i64> %0, %1
%4 = xor <4 x i1> %3, <i1 true, i1 true, i1 false, i1 false>
%5 = select <4 x i1> %4, <4 x i64> %0, <4 x i64> %1
ret <4 x i64> %5
}

LebedevRI · 2020-05-05T20:21:15Z

define <4 x i64> @Z6minmaxDv4_xS(<4 x i64> %0, <4 x i64> %1) {
%3 = icmp sgt <4 x i64> %0, %1
%4 = xor <4 x i1> %3, <i1 true, i1 true, i1 false, i1 false>
So basically the code that is handling materialization of all-ones constants
as pcmpeq needs to be taught that if lower portion is all-ones and the rest
is zeros, it might still be profitable, i'm guessing?

%5 = select <4 x i1> %4, <4 x i64> %0, <4 x i64> %1
ret <4 x i64> %5
}

RKSimon · 2020-05-05T21:04:49Z

We might need to improve PromoteMaskArithmetic to better handle selects.

CC'ing Florian who did a load of improvements in D72524.

RKSimon · 2020-05-05T21:08:09Z

So basically the code that is handling materialization of all-ones constants
as pcmpeq needs to be taught that if lower portion is all-ones and the rest
is zeros, it might still be profitable, i'm guessing?

[Bug #42653] discusses something similar for rematerializable lower 'allones' subvector masks once we avoid the unnecessary packss/pmovsx

RKSimon · 2020-05-08T10:00:22Z

rGfe6f5ba0bffd - added test case
rGb8a725274c22 - fixed PACKSS promotion issues

Current AVX2 Codegen:

.LCPI0_0:
.quad 1 # 0x1
.quad 1 # 0x1
.quad 0 # 0x0
.quad 0 # 0x0
Z6minmaxDv4_xS: # @Z6minmaxDv4_xS
vpcmpgtq %ymm1, %ymm0, %ymm2
vpxor .LCPI0_0(%rip), %ymm2, %ymm2
vpsllq $63, %ymm2, %ymm2
vblendvpd %ymm2, %ymm0, %ymm1, %ymm0
retq

LebedevRI · 2020-05-08T11:37:58Z

rGfe6f5ba0bffd - added test case
rGb8a725274c22 - fixed PACKSS promotion issues

Current AVX2 Codegen:

.LCPI0_0:
.quad 1 # 0x1
.quad 1 # 0x1
.quad 0 # 0x0
.quad 0 # 0x0
Z6minmaxDv4_xS: # @Z6minmaxDv4_xS
vpcmpgtq %ymm1, %ymm0, %ymm2
produces either -1 or 0
vpxor .LCPI0_0(%rip), %ymm2, %ymm2
Inverts lowest bit only
vpsllq $63, %ymm2, %ymm2
moves lowest bit into highest bit
vblendvpd %ymm2, %ymm0, %ymm1, %ymm0
uses highest bit to control blending

Can't we get rid of the vpsllq by using -1 instead of 1 in xor?

    retq

RKSimon · 2020-05-08T11:58:01Z

That's the next step - there is plenty of code that tries to do that kind of thing - and nearly all of it ignores vectors :-)

RKSimon · 2020-06-22T14:03:06Z

Initial patch: https://reviews.llvm.org/D82257

This will make sure we're using -1/0 sign masks but doesn't materialize the constant using VPCMPEQ xmm (with implicit zeroing of the upper elements).

RKSimon · 2020-07-01T21:18:55Z

Current AVX2 Codegen:

.LCPI0_0:
.quad -1 # 0xffffffffffffffff
.quad -1 # 0xffffffffffffffff
.quad 0 # 0x0
.quad 0 # 0x0
Z6minmaxDv4_xS: # @Z6minmaxDv4_xS
vpcmpgtq %ymm1, %ymm0, %ymm2
vpxor .LCPI0_0(%rip), %ymm2, %ymm2
vblendvpd %ymm2, %ymm0, %ymm1, %ymm0
retq

RKSimon · 2022-02-12T20:31:40Z

I think we can close this: https://gcc.godbolt.org/z/E6bfj1vx6

If we have one use of the mask we create a constant to fold into the xor.

If the mask has multiple uses then we perform the materialization trick - although neither clang or gcc recognise the implicit zero upper bits of the vpcmpeq xmm, both perform a (almost free) vmovdqa to do this.

NemoPublius · 2022-02-14T18:59:57Z

@RKSimon Loading from memory is not "almost free" compared to a single all-register instruction. That is sort of the entire point of this bug report.

I guess I will have to stick to inline asm for this.

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

RKSimon closed this as completed Feb 12, 2022

RKSimon mentioned this issue Feb 12, 2022

[X86][AVX] Rematerializable lower 'allones' subvector masks #41998

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suboptimal code for _mm256_zextsi128_si256(_mm_set1_epi8(-1)) #45153

Suboptimal code for _mm256_zextsi128_si256(_mm_set1_epi8(-1)) #45153

NemoPublius commented May 5, 2020

RKSimon commented May 5, 2020

LebedevRI commented May 5, 2020

RKSimon commented May 5, 2020

RKSimon commented May 5, 2020

RKSimon commented May 8, 2020

LebedevRI commented May 8, 2020

RKSimon commented May 8, 2020

RKSimon commented Jun 22, 2020

RKSimon commented Jul 1, 2020

RKSimon commented Feb 12, 2022

NemoPublius commented Feb 14, 2022 •

edited

Loading

Suboptimal code for _mm256_zextsi128_si256(_mm_set1_epi8(-1)) #45153

Suboptimal code for _mm256_zextsi128_si256(_mm_set1_epi8(-1)) #45153

Comments

NemoPublius commented May 5, 2020

Extended Description

RKSimon commented May 5, 2020

LebedevRI commented May 5, 2020

RKSimon commented May 5, 2020

RKSimon commented May 5, 2020

RKSimon commented May 8, 2020

LebedevRI commented May 8, 2020

RKSimon commented May 8, 2020

RKSimon commented Jun 22, 2020

RKSimon commented Jul 1, 2020

RKSimon commented Feb 12, 2022

NemoPublius commented Feb 14, 2022 • edited Loading

NemoPublius commented Feb 14, 2022 •

edited

Loading