Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_mm_undefined_si128 compiles to incorrect SSE code #31524

Closed
Myriachan mannequin opened this issue Mar 8, 2017 · 12 comments
Closed

_mm_undefined_si128 compiles to incorrect SSE code #31524

Myriachan mannequin opened this issue Mar 8, 2017 · 12 comments
Labels
bugzilla Issues migrated from bugzilla clang:codegen miscompilation

Comments

@Myriachan
Copy link
Mannequin

Myriachan mannequin commented Mar 8, 2017

Bugzilla Link 32176
Resolution FIXED
Resolved on Mar 12, 2017 12:50
Version 4.0
OS All
CC @topperc,@aqjune,@RKSimon,@nunoplopes,@regehr,@sanjoy,@rotateright

Extended Description

_mm_undefined_si128 (internally, __builtin_ia32_undef128) is designed to allow writing x86 SSE code that uses the existing values of SSE registers without regard to their current contents. An example is the following code to generate an SSE register with all "1" bits:

__m128i ReturnOneBits()
{
__m128i dummy = _mm_undefined_si128();
return _mm_cmpeq_epi32(dummy, dummy);
}

It should compile to something like this:

pcmpeqd %xmm0, %xmm0
retq

But instead, with -O1, -O2 or -O3, it compiles to this:

xorps %xmm0, %xmm0
retq

In other words, it returns all "0" bits instead of all "1" bits. (With optimizations disabled, the generated code reads uninitialized memory then does pcmpeqd on the two values,

The following function does compile correctly, and clang in fact sees that zeroing a register beforehand is unnecessary:

__m128i ReturnOneBits()
{
__m128i dummy = _mm_setzero_si128();
return _mm_cmpeq_epi32(dummy, dummy);
// -or-
return _mm_set_epi32(-1, -1, -1, -1);
}

These compile to:

pcmpeqd %xmm0, %xmm0
retq

Because clang's optimizer realizes that it doesn't care about the previous value of xmm0, it actually would be an acceptable solution if __builtin_ia32_undef128 were removed from the compiler and _mm_undefined_si128 simply called _mm_setzero_si128. (This is what Microsoft Visual C++ does, in fact.)

I have not tried the other _mm*_undefined* functions.

@Myriachan
Copy link
Mannequin Author

Myriachan mannequin commented Mar 8, 2017

I checked more and Clang generates incorrect code even with -O0. In the case of -O0, the problem is different: Clang is doing a pcmpeqd of two different uninitialized memory addresses. This is still incorrect.

It seems to me that Clang is treating _mm_undefined_si128() the same way as uninitialized memory, which is incorrect.

Compare these two functions:

__m128i ReturnOneBits()
{
__m128i dummy = _mm_undefined_si128();
return _mm_cmpeq_epi32(dummy, dummy);
}

bool ReturnTrue()
{
int i;
return i == i;
}

ReturnTrue() always returns false. This is acceptable because the C++ standard states that using the value of uninitialized variables is undefined behavior, so a seemingly impossible result is permissible.

However, for _mm_undefined_si128, it has a behavior defined by Intel. It seems reasonable to me that _mm_undefined_si128 returns some unspecified value, but once _mm_undefined_si128 is "called", the value returned becomes fixed. _mm_undefined_si128 really wouldn't serve a purpose if it were like C++'s uninitialize variable rules, because you could just use an uninitialized __m128i for this.

@rotateright
Copy link
Contributor

http://www.playingwithpointers.com/problem-with-undef.html
"an undef value has a potentially new bit pattern of the compiler’s choosing at each use site"

So:

define i1 @​cmp_undef() {
%cmp = icmp eq i32 undef, undef
ret i1 %cmp
}

is not 'true', it is:

$ ./opt -instsimplify -S cmpundef.ll
define i1 @​cmp_undef() {
ret i1 undef
}

I agree that we can solve/work-around this as a front-end bug:
case X86::BI__builtin_ia32_undef128:
case X86::BI__builtin_ia32_undef256:
case X86::BI__builtin_ia32_undef512:
return UndefValue::get(ConvertType(E->getType()));

The suggestion to use setzero seems like a good solution to me, but we'll need to check all of the users of the undef intrinsics to make sure that doesn't cause extra instructions in the asm - or we'd defeat the point of these lovely x86 intrinsics. :)

Another option would be to let the intrinsic survive to the backend and clean it up there. This might have the unintended consequence (like some other LLVM optimization intrinsics) of hindering IR optimizations though.

@sanjoy
Copy link
Contributor

sanjoy commented Mar 8, 2017

Btw, in the new poison/undef scheme, we could have _mm_undefined_si128() return freeze poison (freeze is an instruction, so the IR would be %v = freeze poison; ret %v), which is a non-fluctuating but unspecified value.

@rotateright
Copy link
Contributor

Sadly, the one-line fix to clang:

Index: lib/CodeGen/CGBuiltin.cpp

--- lib/CodeGen/CGBuiltin.cpp (revision 297300)
+++ lib/CodeGen/CGBuiltin.cpp (working copy)
@@ -7381,7 +7381,10 @@
case X86::BI__builtin_ia32_undef128:
case X86::BI__builtin_ia32_undef256:
case X86::BI__builtin_ia32_undef512:

  • return UndefValue::get(ConvertType(E->getType()));

...causes many regression test failures:

Failing Tests (9):
Clang :: CodeGen/avx-builtins.c
Clang :: CodeGen/avx2-builtins.c
Clang :: CodeGen/avx512bw-builtins.c
Clang :: CodeGen/avx512dq-builtins.c
Clang :: CodeGen/avx512f-builtins.c
Clang :: CodeGen/avx512vl-builtins.c
Clang :: CodeGen/avx512vldq-builtins.c
Clang :: CodeGen/sse-builtins.c
Clang :: CodeGen/sse2-builtins.c

...because the x86 headers use _mm_undefined* in the definitions of other intrinsics. Example:

#define _mm_permute_pd(A, C) extension ({
(__m128d)__builtin_shufflevector((__v2df)(__m128d)(A),
(__v2df)_mm_undefined_pd(),
((C) >> 0) & 0x1, ((C) >> 1) & 0x1); })

I think that's a legitimate undef usage there, but couldn't this have used the __builtin_shufflevector version with only 2 params instead?

Ideally, someone would fix those headers...

A lazier alternative would be to replace the undefined intrinsic uses in shufflevector with the first param.

An even lazier alternative would be to just fix all the test failures. The optimizer should easily see that the zeroinitializer is actually not used in cases like the above?

@topperc
Copy link
Collaborator

topperc commented Mar 8, 2017

There isn't a form of builtin_shufflevector that takes a single input and a list of indices. There's an exactly 2 argument form that takes a vector mask, but I've never seen that used.

@rotateright
Copy link
Contributor

There isn't a form of builtin_shufflevector that takes a single input and a
list of indices. There's an exactly 2 argument form that takes a vector
mask, but I've never seen that used.

Hmmm...there is one regression test for that form, but I'm not sure how to constant initialize that vector mask.

If there's no objection, I'll work on the lazier solution. InstCombine has the smarts to convert a zero operand to an undef in all the cases I see in these tests via SimplifyDemandedVectorElts().

@topperc
Copy link
Collaborator

topperc commented Mar 9, 2017

And even if instcombine doesn't run due to -O0, I think SelectionDAGBuilder will turn unused inputs to shuffles to undef as well.

@RKSimon
Copy link
Collaborator

RKSimon commented Mar 9, 2017

In the longer term would the plan be to try and go back to the 'freeze poison' approach?

@rotateright
Copy link
Contributor

Yes - I haven't followed the freeze discussion very closely, but Sanjoy's description sounds like a match for the undef intrinsic needs.

But (correct me if I'm wrong), freeze doesn't exist today, so we need an immediate fix for those undef intrinsics to avoid miscompiles.

If we have a few extra xorps in the final code, it probably doesn't matter much since we tend to sprinkle those around anyway. :)

@RKSimon
Copy link
Collaborator

RKSimon commented Mar 9, 2017

Yes, sorry I didn't mean to say we shouldn't go ahead now. Just adding a TODO comment about it should be enough, and hopefully someone will remember to change this again...

@rotateright
Copy link
Contributor

Patch posted for review:
https://reviews.llvm.org/D30834

@rotateright
Copy link
Contributor

This should be fixed after:
https://reviews.llvm.org/rL297588

If you have any cases where you're using mm_undefined* and clang is producing an unnecessary zero-ing instruction (xorps, etc) in optimized assembly, please file a new bug report. Thanks!

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla clang:codegen miscompilation
Projects
None yet
Development

No branches or pull requests

4 participants