-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
_mm_undefined_si128 compiles to incorrect SSE code #31524
Comments
I checked more and Clang generates incorrect code even with -O0. In the case of -O0, the problem is different: Clang is doing a pcmpeqd of two different uninitialized memory addresses. This is still incorrect. It seems to me that Clang is treating _mm_undefined_si128() the same way as uninitialized memory, which is incorrect. Compare these two functions: __m128i ReturnOneBits() bool ReturnTrue() ReturnTrue() always returns false. This is acceptable because the C++ standard states that using the value of uninitialized variables is undefined behavior, so a seemingly impossible result is permissible. However, for _mm_undefined_si128, it has a behavior defined by Intel. It seems reasonable to me that _mm_undefined_si128 returns some unspecified value, but once _mm_undefined_si128 is "called", the value returned becomes fixed. _mm_undefined_si128 really wouldn't serve a purpose if it were like C++'s uninitialize variable rules, because you could just use an uninitialized __m128i for this. |
http://www.playingwithpointers.com/problem-with-undef.html So: define i1 @cmp_undef() { is not 'true', it is: $ ./opt -instsimplify -S cmpundef.ll I agree that we can solve/work-around this as a front-end bug: The suggestion to use setzero seems like a good solution to me, but we'll need to check all of the users of the undef intrinsics to make sure that doesn't cause extra instructions in the asm - or we'd defeat the point of these lovely x86 intrinsics. :) Another option would be to let the intrinsic survive to the backend and clean it up there. This might have the unintended consequence (like some other LLVM optimization intrinsics) of hindering IR optimizations though. |
Btw, in the new poison/undef scheme, we could have _mm_undefined_si128() return |
Sadly, the one-line fix to clang: Index: lib/CodeGen/CGBuiltin.cpp--- lib/CodeGen/CGBuiltin.cpp (revision 297300)
...causes many regression test failures: Failing Tests (9): ...because the x86 headers use _mm_undefined* in the definitions of other intrinsics. Example: #define _mm_permute_pd(A, C) extension ({ I think that's a legitimate undef usage there, but couldn't this have used the __builtin_shufflevector version with only 2 params instead? Ideally, someone would fix those headers... A lazier alternative would be to replace the undefined intrinsic uses in shufflevector with the first param. An even lazier alternative would be to just fix all the test failures. The optimizer should easily see that the zeroinitializer is actually not used in cases like the above? |
There isn't a form of builtin_shufflevector that takes a single input and a list of indices. There's an exactly 2 argument form that takes a vector mask, but I've never seen that used. |
Hmmm...there is one regression test for that form, but I'm not sure how to constant initialize that vector mask. If there's no objection, I'll work on the lazier solution. InstCombine has the smarts to convert a zero operand to an undef in all the cases I see in these tests via SimplifyDemandedVectorElts(). |
And even if instcombine doesn't run due to -O0, I think SelectionDAGBuilder will turn unused inputs to shuffles to undef as well. |
In the longer term would the plan be to try and go back to the 'freeze poison' approach? |
Yes - I haven't followed the freeze discussion very closely, but Sanjoy's description sounds like a match for the undef intrinsic needs. But (correct me if I'm wrong), freeze doesn't exist today, so we need an immediate fix for those undef intrinsics to avoid miscompiles. If we have a few extra xorps in the final code, it probably doesn't matter much since we tend to sprinkle those around anyway. :) |
Yes, sorry I didn't mean to say we shouldn't go ahead now. Just adding a TODO comment about it should be enough, and hopefully someone will remember to change this again... |
Patch posted for review: |
This should be fixed after: If you have any cases where you're using mm_undefined* and clang is producing an unnecessary zero-ing instruction (xorps, etc) in optimized assembly, please file a new bug report. Thanks! |
Extended Description
_mm_undefined_si128 (internally, __builtin_ia32_undef128) is designed to allow writing x86 SSE code that uses the existing values of SSE registers without regard to their current contents. An example is the following code to generate an SSE register with all "1" bits:
__m128i ReturnOneBits()
{
__m128i dummy = _mm_undefined_si128();
return _mm_cmpeq_epi32(dummy, dummy);
}
It should compile to something like this:
pcmpeqd %xmm0, %xmm0
retq
But instead, with -O1, -O2 or -O3, it compiles to this:
xorps %xmm0, %xmm0
retq
In other words, it returns all "0" bits instead of all "1" bits. (With optimizations disabled, the generated code reads uninitialized memory then does pcmpeqd on the two values,
The following function does compile correctly, and clang in fact sees that zeroing a register beforehand is unnecessary:
__m128i ReturnOneBits()
{
__m128i dummy = _mm_setzero_si128();
return _mm_cmpeq_epi32(dummy, dummy);
// -or-
return _mm_set_epi32(-1, -1, -1, -1);
}
These compile to:
pcmpeqd %xmm0, %xmm0
retq
Because clang's optimizer realizes that it doesn't care about the previous value of xmm0, it actually would be an acceptable solution if __builtin_ia32_undef128 were removed from the compiler and _mm_undefined_si128 simply called _mm_setzero_si128. (This is what Microsoft Visual C++ does, in fact.)
I have not tried the other _mm*_undefined* functions.
The text was updated successfully, but these errors were encountered: