-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suboptimal code for _mm256_zextsi128_si256(_mm_set1_epi8(-1)) #45153
Comments
define <4 x i64> @Z6minmaxDv4_xS(<4 x i64> %0, <4 x i64> %1) { |
|
We might need to improve PromoteMaskArithmetic to better handle selects. CC'ing Florian who did a load of improvements in D72524. |
[Bug #42653] discusses something similar for rematerializable lower 'allones' subvector masks once we avoid the unnecessary packss/pmovsx |
rGfe6f5ba0bffd - added test case Current AVX2 Codegen: .LCPI0_0: |
Can't we get rid of the vpsllq by using -1 instead of 1 in xor?
|
That's the next step - there is plenty of code that tries to do that kind of thing - and nearly all of it ignores vectors :-) |
Initial patch: https://reviews.llvm.org/D82257 This will make sure we're using -1/0 sign masks but doesn't materialize the constant using VPCMPEQ xmm (with implicit zeroing of the upper elements). |
Current AVX2 Codegen: .LCPI0_0: |
I think we can close this: https://gcc.godbolt.org/z/E6bfj1vx6 If we have one use of the mask we create a constant to fold into the xor. If the mask has multiple uses then we perform the materialization trick - although neither clang or gcc recognise the implicit zero upper bits of the vpcmpeq xmm, both perform a (almost free) vmovdqa to do this. |
@RKSimon Loading from memory is not "almost free" compared to a single all-register instruction. That is sort of the entire point of this bug report. I guess I will have to stick to inline asm for this. |
Extended Description
Related: Bug #45806 and https://stackoverflow.com/q/61601902/
I am trying to produce an AVX2 mask with all-ones in the lower lane and all-zeroes in the upper lane of a YMM register. The code I am using is:
This should produce a single instruction like
vpcmpeqd %xmm0,%xmm0,%xmm0
, but Clang insists on putting the value into memory and loading it.However, Clang insists on putting this into memory and loading it.
The behavior in context is even more odd:
This goes through a bunch of contortions with extracting, shifting, and expanding 128-bit registers when I feel like the result I want is pretty straightforward.
Godbolt example: https://gcc.godbolt.org/z/GPhJ6s
The text was updated successfully, but these errors were encountered: