_mm_undefined_si128 compiles to incorrect SSE code #31524

Myriachan · 2017-03-08T00:36:21Z


Bugzilla Link	32176
Resolution	FIXED
Resolved on	Mar 12, 2017 12:50
Version	4.0
OS	All
CC	@topperc,@aqjune,@RKSimon,@nunoplopes,@regehr,@sanjoy,@rotateright

Extended Description

_mm_undefined_si128 (internally, __builtin_ia32_undef128) is designed to allow writing x86 SSE code that uses the existing values of SSE registers without regard to their current contents. An example is the following code to generate an SSE register with all "1" bits:

__m128i ReturnOneBits()
{
__m128i dummy = _mm_undefined_si128();
return _mm_cmpeq_epi32(dummy, dummy);
}

It should compile to something like this:

pcmpeqd %xmm0, %xmm0
retq

But instead, with -O1, -O2 or -O3, it compiles to this:

xorps %xmm0, %xmm0
retq

In other words, it returns all "0" bits instead of all "1" bits. (With optimizations disabled, the generated code reads uninitialized memory then does pcmpeqd on the two values,

The following function does compile correctly, and clang in fact sees that zeroing a register beforehand is unnecessary:

__m128i ReturnOneBits()
{
__m128i dummy = _mm_setzero_si128();
return _mm_cmpeq_epi32(dummy, dummy);
// -or-
return _mm_set_epi32(-1, -1, -1, -1);
}

These compile to:

pcmpeqd %xmm0, %xmm0
retq

Because clang's optimizer realizes that it doesn't care about the previous value of xmm0, it actually would be an acceptable solution if __builtin_ia32_undef128 were removed from the compiler and _mm_undefined_si128 simply called _mm_setzero_si128. (This is what Microsoft Visual C++ does, in fact.)

I have not tried the other _mm*_undefined* functions.

Myriachan · 2017-03-08T01:38:42Z

I checked more and Clang generates incorrect code even with -O0. In the case of -O0, the problem is different: Clang is doing a pcmpeqd of two different uninitialized memory addresses. This is still incorrect.

It seems to me that Clang is treating _mm_undefined_si128() the same way as uninitialized memory, which is incorrect.

Compare these two functions:

__m128i ReturnOneBits()
{
__m128i dummy = _mm_undefined_si128();
return _mm_cmpeq_epi32(dummy, dummy);
}

bool ReturnTrue()
{
int i;
return i == i;
}

ReturnTrue() always returns false. This is acceptable because the C++ standard states that using the value of uninitialized variables is undefined behavior, so a seemingly impossible result is permissible.

However, for _mm_undefined_si128, it has a behavior defined by Intel. It seems reasonable to me that _mm_undefined_si128 returns some unspecified value, but once _mm_undefined_si128 is "called", the value returned becomes fixed. _mm_undefined_si128 really wouldn't serve a purpose if it were like C++'s uninitialize variable rules, because you could just use an uninitialized __m128i for this.

rotateright · 2017-03-08T19:02:00Z

http://www.playingwithpointers.com/problem-with-undef.html
"an undef value has a potentially new bit pattern of the compiler’s choosing at each use site"

So:

define i1 @cmp_undef() {
%cmp = icmp eq i32 undef, undef
ret i1 %cmp
}

is not 'true', it is:

$ ./opt -instsimplify -S cmpundef.ll
define i1 @cmp_undef() {
ret i1 undef
}

I agree that we can solve/work-around this as a front-end bug:
case X86::BI__builtin_ia32_undef128:
case X86::BI__builtin_ia32_undef256:
case X86::BI__builtin_ia32_undef512:
return UndefValue::get(ConvertType(E->getType()));

The suggestion to use setzero seems like a good solution to me, but we'll need to check all of the users of the undef intrinsics to make sure that doesn't cause extra instructions in the asm - or we'd defeat the point of these lovely x86 intrinsics. :)

Another option would be to let the intrinsic survive to the backend and clean it up there. This might have the unintended consequence (like some other LLVM optimization intrinsics) of hindering IR optimizations though.

sanjoy · 2017-03-08T20:10:21Z

Btw, in the new poison/undef scheme, we could have _mm_undefined_si128() return freeze poison (freeze is an instruction, so the IR would be %v = freeze poison; ret %v), which is a non-fluctuating but unspecified value.

rotateright · 2017-03-08T22:13:26Z

Sadly, the one-line fix to clang:

Index: lib/CodeGen/CGBuiltin.cpp

--- lib/CodeGen/CGBuiltin.cpp (revision 297300)
+++ lib/CodeGen/CGBuiltin.cpp (working copy)
@@ -7381,7 +7381,10 @@
case X86::BI__builtin_ia32_undef128:
case X86::BI__builtin_ia32_undef256:
case X86::BI__builtin_ia32_undef512:

return UndefValue::get(ConvertType(E->getType()));

// The x86 definition of "undef" is not the same as the LLVM definition
// (SmallPtrSet usage in SROA could introduce non-determinism in Clang #32176 ). We leave the exercise of optimizing away an unnecessary zero
// constant to the backend.
return llvm::Constant::getNullValue(ConvertType(E->getType()));

...causes many regression test failures:

Failing Tests (9):
Clang :: CodeGen/avx-builtins.c
Clang :: CodeGen/avx2-builtins.c
Clang :: CodeGen/avx512bw-builtins.c
Clang :: CodeGen/avx512dq-builtins.c
Clang :: CodeGen/avx512f-builtins.c
Clang :: CodeGen/avx512vl-builtins.c
Clang :: CodeGen/avx512vldq-builtins.c
Clang :: CodeGen/sse-builtins.c
Clang :: CodeGen/sse2-builtins.c

...because the x86 headers use _mm_undefined* in the definitions of other intrinsics. Example:

#define _mm_permute_pd(A, C) extension ({
(__m128d)__builtin_shufflevector((__v2df)(__m128d)(A),
(__v2df)_mm_undefined_pd(),
((C) >> 0) & 0x1, ((C) >> 1) & 0x1); })

I think that's a legitimate undef usage there, but couldn't this have used the __builtin_shufflevector version with only 2 params instead?

Ideally, someone would fix those headers...

A lazier alternative would be to replace the undefined intrinsic uses in shufflevector with the first param.

An even lazier alternative would be to just fix all the test failures. The optimizer should easily see that the zeroinitializer is actually not used in cases like the above?

topperc · 2017-03-08T22:54:59Z

There isn't a form of builtin_shufflevector that takes a single input and a list of indices. There's an exactly 2 argument form that takes a vector mask, but I've never seen that used.

rotateright · 2017-03-08T23:58:02Z

There isn't a form of builtin_shufflevector that takes a single input and a
list of indices. There's an exactly 2 argument form that takes a vector
mask, but I've never seen that used.

Hmmm...there is one regression test for that form, but I'm not sure how to constant initialize that vector mask.

If there's no objection, I'll work on the lazier solution. InstCombine has the smarts to convert a zero operand to an undef in all the cases I see in these tests via SimplifyDemandedVectorElts().

topperc · 2017-03-09T01:07:50Z

And even if instcombine doesn't run due to -O0, I think SelectionDAGBuilder will turn unused inputs to shuffles to undef as well.

RKSimon · 2017-03-09T23:20:25Z

In the longer term would the plan be to try and go back to the 'freeze poison' approach?

rotateright · 2017-03-09T23:34:12Z

Yes - I haven't followed the freeze discussion very closely, but Sanjoy's description sounds like a match for the undef intrinsic needs.

But (correct me if I'm wrong), freeze doesn't exist today, so we need an immediate fix for those undef intrinsics to avoid miscompiles.

If we have a few extra xorps in the final code, it probably doesn't matter much since we tend to sprinkle those around anyway. :)

RKSimon · 2017-03-09T23:36:54Z

Yes, sorry I didn't mean to say we shouldn't go ahead now. Just adding a TODO comment about it should be enough, and hopefully someone will remember to change this again...

rotateright · 2017-03-10T17:45:59Z

Patch posted for review:
https://reviews.llvm.org/D30834

rotateright · 2017-03-12T19:50:48Z

This should be fixed after:
https://reviews.llvm.org/rL297588

If you have any cases where you're using mm_undefined* and clang is producing an unnecessary zero-ing instruction (xorps, etc) in optimized assembly, please file a new bug report. Thanks!

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

scottmcm mentioned this issue Feb 12, 2023

Intrinsics like _mm_undefined_si128 should clarify what "undefined" means to them rust-lang/rust#107945

Closed

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_mm_undefined_si128 compiles to incorrect SSE code #31524

_mm_undefined_si128 compiles to incorrect SSE code #31524

Myriachan mannequin commented Mar 8, 2017

Myriachan mannequin commented Mar 8, 2017

rotateright commented Mar 8, 2017

sanjoy commented Mar 8, 2017

rotateright commented Mar 8, 2017

topperc commented Mar 8, 2017

rotateright commented Mar 8, 2017

topperc commented Mar 9, 2017

RKSimon commented Mar 9, 2017

rotateright commented Mar 9, 2017

RKSimon commented Mar 9, 2017

rotateright commented Mar 10, 2017

rotateright commented Mar 12, 2017

_mm_undefined_si128 compiles to incorrect SSE code #31524

_mm_undefined_si128 compiles to incorrect SSE code #31524

Comments

Myriachan mannequin commented Mar 8, 2017

Extended Description

Myriachan mannequin commented Mar 8, 2017

rotateright commented Mar 8, 2017

sanjoy commented Mar 8, 2017

rotateright commented Mar 8, 2017

Index: lib/CodeGen/CGBuiltin.cpp

topperc commented Mar 8, 2017

rotateright commented Mar 8, 2017

topperc commented Mar 9, 2017

RKSimon commented Mar 9, 2017

rotateright commented Mar 9, 2017

RKSimon commented Mar 9, 2017

rotateright commented Mar 10, 2017

rotateright commented Mar 12, 2017