Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suboptimal code for _mm256_zextsi128_si256(_mm_set1_epi8(-1)) #45153

Closed
NemoPublius opened this issue May 5, 2020 · 11 comments
Closed

Suboptimal code for _mm256_zextsi128_si256(_mm_set1_epi8(-1)) #45153

NemoPublius opened this issue May 5, 2020 · 11 comments
Labels
backend:X86 bugzilla Issues migrated from bugzilla

Comments

@NemoPublius
Copy link

Bugzilla Link 45808
Version trunk
OS Linux
CC @topperc,@fhahn,@LebedevRI,@RKSimon,@rotateright
Fixed by commit(s) fe6f5ba,b8a725274c22,3521ecf1f8a3

Extended Description

Related: Bug #​45806 and https://stackoverflow.com/q/61601902/

I am trying to produce an AVX2 mask with all-ones in the lower lane and all-zeroes in the upper lane of a YMM register. The code I am using is:

__m256i mask = _mm256_zextsi128_si256(_mm_set1_epi8(-1));

This should produce a single instruction like vpcmpeqd %xmm0,%xmm0,%xmm0, but Clang insists on putting the value into memory and loading it.

However, Clang insists on putting this into memory and loading it.

The behavior in context is even more odd:

__m256i minmax(__m256i v1, __m256i v2)
{
    __m256i comp = _mm256_cmpgt_epi64(v1, v2);
    __m256i mask = _mm256_zextsi128_si256(_mm_set1_epi8(-1));
    return _mm256_blendv_epi8(v2, v1, _mm256_xor_si256(comp, mask));
}

This goes through a bunch of contortions with extracting, shifting, and expanding 128-bit registers when I feel like the result I want is pretty straightforward.

Godbolt example: https://gcc.godbolt.org/z/GPhJ6s

@RKSimon
Copy link
Collaborator

RKSimon commented May 5, 2020

define <4 x i64> @​Z6minmaxDv4_xS(<4 x i64> %0, <4 x i64> %1) {
%3 = icmp sgt <4 x i64> %0, %1
%4 = xor <4 x i1> %3, <i1 true, i1 true, i1 false, i1 false>
%5 = select <4 x i1> %4, <4 x i64> %0, <4 x i64> %1
ret <4 x i64> %5
}

@LebedevRI
Copy link
Member

define <4 x i64> @​Z6minmaxDv4_xS(<4 x i64> %0, <4 x i64> %1) {
%3 = icmp sgt <4 x i64> %0, %1
%4 = xor <4 x i1> %3, <i1 true, i1 true, i1 false, i1 false>
So basically the code that is handling materialization of all-ones constants
as pcmpeq needs to be taught that if lower portion is all-ones and the rest
is zeros, it might still be profitable, i'm guessing?

%5 = select <4 x i1> %4, <4 x i64> %0, <4 x i64> %1
ret <4 x i64> %5
}

@RKSimon
Copy link
Collaborator

RKSimon commented May 5, 2020

We might need to improve PromoteMaskArithmetic to better handle selects.

CC'ing Florian who did a load of improvements in D72524.

@RKSimon
Copy link
Collaborator

RKSimon commented May 5, 2020

So basically the code that is handling materialization of all-ones constants
as pcmpeq needs to be taught that if lower portion is all-ones and the rest
is zeros, it might still be profitable, i'm guessing?

[Bug #​42653] discusses something similar for rematerializable lower 'allones' subvector masks once we avoid the unnecessary packss/pmovsx

@RKSimon
Copy link
Collaborator

RKSimon commented May 8, 2020

rGfe6f5ba0bffd - added test case
rGb8a725274c22 - fixed PACKSS promotion issues

Current AVX2 Codegen:

.LCPI0_0:
.quad 1 # 0x1
.quad 1 # 0x1
.quad 0 # 0x0
.quad 0 # 0x0
Z6minmaxDv4_xS: # @​Z6minmaxDv4_xS
vpcmpgtq %ymm1, %ymm0, %ymm2
vpxor .LCPI0_0(%rip), %ymm2, %ymm2
vpsllq $63, %ymm2, %ymm2
vblendvpd %ymm2, %ymm0, %ymm1, %ymm0
retq

@LebedevRI
Copy link
Member

rGfe6f5ba0bffd - added test case
rGb8a725274c22 - fixed PACKSS promotion issues

Current AVX2 Codegen:

.LCPI0_0:
.quad 1 # 0x1
.quad 1 # 0x1
.quad 0 # 0x0
.quad 0 # 0x0
Z6minmaxDv4_xS: # @​Z6minmaxDv4_xS
vpcmpgtq %ymm1, %ymm0, %ymm2
produces either -1 or 0
vpxor .LCPI0_0(%rip), %ymm2, %ymm2
Inverts lowest bit only
vpsllq $63, %ymm2, %ymm2
moves lowest bit into highest bit
vblendvpd %ymm2, %ymm0, %ymm1, %ymm0
uses highest bit to control blending

Can't we get rid of the vpsllq by using -1 instead of 1 in xor?

    retq

@RKSimon
Copy link
Collaborator

RKSimon commented May 8, 2020

That's the next step - there is plenty of code that tries to do that kind of thing - and nearly all of it ignores vectors :-)

@RKSimon
Copy link
Collaborator

RKSimon commented Jun 22, 2020

Initial patch: https://reviews.llvm.org/D82257

This will make sure we're using -1/0 sign masks but doesn't materialize the constant using VPCMPEQ xmm (with implicit zeroing of the upper elements).

@RKSimon
Copy link
Collaborator

RKSimon commented Jul 1, 2020

Current AVX2 Codegen:

.LCPI0_0:
.quad -1 # 0xffffffffffffffff
.quad -1 # 0xffffffffffffffff
.quad 0 # 0x0
.quad 0 # 0x0
Z6minmaxDv4_xS: # @​Z6minmaxDv4_xS
vpcmpgtq %ymm1, %ymm0, %ymm2
vpxor .LCPI0_0(%rip), %ymm2, %ymm2
vblendvpd %ymm2, %ymm0, %ymm1, %ymm0
retq

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
@RKSimon
Copy link
Collaborator

RKSimon commented Feb 12, 2022

I think we can close this: https://gcc.godbolt.org/z/E6bfj1vx6

If we have one use of the mask we create a constant to fold into the xor.

If the mask has multiple uses then we perform the materialization trick - although neither clang or gcc recognise the implicit zero upper bits of the vpcmpeq xmm, both perform a (almost free) vmovdqa to do this.

@NemoPublius
Copy link
Author

NemoPublius commented Feb 14, 2022

@RKSimon Loading from memory is not "almost free" compared to a single all-register instruction. That is sort of the entire point of this bug report.

I guess I will have to stick to inline asm for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 bugzilla Issues migrated from bugzilla
Projects
None yet
Development

No branches or pull requests

3 participants