LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 32176 - _mm_undefined_si128 compiles to incorrect SSE code
Summary: _mm_undefined_si128 compiles to incorrect SSE code
Status: RESOLVED FIXED
Alias: None
Product: clang
Classification: Unclassified
Component: LLVM Codegen (show other bugs)
Version: 4.0
Hardware: PC All
: P normal
Assignee: Unassigned Clang Bugs
URL:
Keywords: miscompilation
Depends on:
Blocks:
 
Reported: 2017-03-07 16:36 PST by Melissa
Modified: 2017-03-12 12:50 PDT (History)
8 users (show)

See Also:
Fixed By Commit(s):


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Melissa 2017-03-07 16:36:21 PST
_mm_undefined_si128 (internally, __builtin_ia32_undef128) is designed to allow writing x86 SSE code that uses the existing values of SSE registers without regard to their current contents.  An example is the following code to generate an SSE register with all "1" bits:

__m128i ReturnOneBits()
{
  __m128i dummy = _mm_undefined_si128();
  return _mm_cmpeq_epi32(dummy, dummy);
}

It should compile to something like this:

pcmpeqd %xmm0, %xmm0
retq

But instead, with -O1, -O2 or -O3, it compiles to this:

xorps %xmm0, %xmm0
retq

In other words, it returns all "0" bits instead of all "1" bits.  (With optimizations disabled, the generated code reads uninitialized memory then does pcmpeqd on the two values, 

The following function *does* compile correctly, and clang in fact sees that zeroing a register beforehand is unnecessary:

__m128i ReturnOneBits()
{
  __m128i dummy = _mm_setzero_si128();
  return _mm_cmpeq_epi32(dummy, dummy);
  // -or-
  return _mm_set_epi32(-1, -1, -1, -1);
}

These compile to:

pcmpeqd %xmm0, %xmm0
retq

Because clang's optimizer realizes that it doesn't care about the previous value of xmm0, it actually would be an acceptable solution if __builtin_ia32_undef128 were removed from the compiler and _mm_undefined_si128 simply called _mm_setzero_si128.  (This is what Microsoft Visual C++ does, in fact.)

I have not tried the other _mm*_undefined* functions.
Comment 1 Melissa 2017-03-07 17:38:42 PST
I checked more and Clang generates incorrect code even with -O0.  In the case of -O0, the problem is different: Clang is doing a pcmpeqd of two different uninitialized memory addresses.  This is still incorrect.

It seems to me that Clang is treating _mm_undefined_si128() the same way as uninitialized memory, which is incorrect.

Compare these two functions:

__m128i ReturnOneBits()
{
  __m128i dummy = _mm_undefined_si128();
  return _mm_cmpeq_epi32(dummy, dummy);
}

bool ReturnTrue()
{
  int i;
  return i == i;
}

ReturnTrue() always returns false.  This is acceptable because the C++ standard states that using the value of uninitialized variables is undefined behavior, so a seemingly impossible result is permissible.

However, for _mm_undefined_si128, it has a behavior defined by Intel.  It seems reasonable to me that _mm_undefined_si128 returns some unspecified value, but once _mm_undefined_si128 is "called", the value returned becomes fixed.  _mm_undefined_si128 really wouldn't serve a purpose if it were like C++'s uninitialize variable rules, because you could just use an uninitialized __m128i for this.
Comment 2 Sanjay Patel 2017-03-08 11:02:00 PST
http://www.playingwithpointers.com/problem-with-undef.html
"an undef value has a potentially new bit pattern of the compiler’s choosing at each use site"

So:

define i1 @cmp_undef() {
  %cmp = icmp eq i32 undef, undef
  ret i1 %cmp
}

is not 'true', it is:

$ ./opt -instsimplify -S cmpundef.ll 
define i1 @cmp_undef() {
  ret i1 undef
}


I agree that we can solve/work-around this as a front-end bug:
  case X86::BI__builtin_ia32_undef128:
  case X86::BI__builtin_ia32_undef256:
  case X86::BI__builtin_ia32_undef512:
    return UndefValue::get(ConvertType(E->getType()));

The suggestion to use setzero seems like a good solution to me, but we'll need to check all of the users of the undef intrinsics to make sure that doesn't cause extra instructions in the asm - or we'd defeat the point of these lovely x86 intrinsics. :)

Another option would be to let the intrinsic survive to the backend and clean it up there. This might have the unintended consequence (like some other LLVM optimization intrinsics) of hindering IR optimizations though.
Comment 3 Sanjoy Das 2017-03-08 12:10:21 PST
Btw, in the new poison/undef scheme, we could have _mm_undefined_si128() return `freeze poison` (`freeze` is an instruction, so the IR would be `%v = freeze poison; ret %v`), which is a non-fluctuating but unspecified value.
Comment 4 Sanjay Patel 2017-03-08 14:13:26 PST
Sadly, the one-line fix to clang:

Index: lib/CodeGen/CGBuiltin.cpp
===================================================================
--- lib/CodeGen/CGBuiltin.cpp	(revision 297300)
+++ lib/CodeGen/CGBuiltin.cpp	(working copy)
@@ -7381,7 +7381,10 @@
   case X86::BI__builtin_ia32_undef128:
   case X86::BI__builtin_ia32_undef256:
   case X86::BI__builtin_ia32_undef512:
-    return UndefValue::get(ConvertType(E->getType()));
+    // The x86 definition of "undef" is not the same as the LLVM definition
+    // (PR32176). We leave the exercise of optimizing away an unnecessary zero
+    // constant to the backend.
+    return llvm::Constant::getNullValue(ConvertType(E->getType()));

...causes many regression test failures:

Failing Tests (9):
    Clang :: CodeGen/avx-builtins.c
    Clang :: CodeGen/avx2-builtins.c
    Clang :: CodeGen/avx512bw-builtins.c
    Clang :: CodeGen/avx512dq-builtins.c
    Clang :: CodeGen/avx512f-builtins.c
    Clang :: CodeGen/avx512vl-builtins.c
    Clang :: CodeGen/avx512vldq-builtins.c
    Clang :: CodeGen/sse-builtins.c
    Clang :: CodeGen/sse2-builtins.c

...because the x86 headers use _mm_undefined* in the definitions of other intrinsics. Example:

#define _mm_permute_pd(A, C) __extension__ ({ \
  (__m128d)__builtin_shufflevector((__v2df)(__m128d)(A), \
                                   (__v2df)_mm_undefined_pd(), \
                                   ((C) >> 0) & 0x1, ((C) >> 1) & 0x1); })

I think that's a legitimate undef usage there, but couldn't this have used the __builtin_shufflevector version with only 2 params instead?

Ideally, someone would fix those headers...

A lazier alternative would be to replace the undefined intrinsic uses in shufflevector with the first param.

An even lazier alternative would be to just fix all the test failures. The optimizer should easily see that the zeroinitializer is actually not used in cases like the above?
Comment 5 Craig Topper 2017-03-08 14:54:59 PST
There isn't a form of builtin_shufflevector that takes a single input and a list of indices. There's an exactly 2 argument form that takes a vector mask, but I've never seen that used.
Comment 6 Sanjay Patel 2017-03-08 15:58:02 PST
(In reply to Craig Topper from comment #5)
> There isn't a form of builtin_shufflevector that takes a single input and a
> list of indices. There's an exactly 2 argument form that takes a vector
> mask, but I've never seen that used.

Hmmm...there is one regression test for that form, but I'm not sure how to constant initialize that vector mask.

If there's no objection, I'll work on the lazier solution. InstCombine has the smarts to convert a zero operand to an undef in all the cases I see in these tests via SimplifyDemandedVectorElts().
Comment 7 Craig Topper 2017-03-08 17:07:50 PST
And even if instcombine doesn't run due to -O0, I think SelectionDAGBuilder will turn unused inputs to shuffles to undef as well.
Comment 8 Simon Pilgrim 2017-03-09 15:20:25 PST
In the longer term would the plan be to try and go back to the 'freeze poison' approach?
Comment 9 Sanjay Patel 2017-03-09 15:34:12 PST
Yes - I haven't followed the freeze discussion very closely, but Sanjoy's description sounds like a match for the undef intrinsic needs.

But (correct me if I'm wrong), freeze doesn't exist today, so we need an immediate fix for those undef intrinsics to avoid miscompiles. 

If we have a few extra xorps in the final code, it probably doesn't matter much since we tend to sprinkle those around anyway. :)
Comment 10 Simon Pilgrim 2017-03-09 15:36:54 PST
Yes, sorry I didn't mean to say we shouldn't go ahead now. Just adding a TODO comment about it should be enough, and hopefully someone will remember to change this again...
Comment 11 Sanjay Patel 2017-03-10 09:45:59 PST
Patch posted for review:
https://reviews.llvm.org/D30834
Comment 12 Sanjay Patel 2017-03-12 12:50:48 PDT
This should be fixed after:
https://reviews.llvm.org/rL297588

If you have any cases where you're using _mm_undefined_* and clang is producing an unnecessary zero-ing instruction (xorps, etc) in optimized assembly, please file a new bug report. Thanks!