Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clang++ miscompile at -O1 with revision 274385 #29638

Open
wjristow opened this issue Sep 6, 2016 · 18 comments
Open

clang++ miscompile at -O1 with revision 274385 #29638

wjristow opened this issue Sep 6, 2016 · 18 comments
Labels
bugzilla Issues migrated from bugzilla clang:codegen miscompilation

Comments

@wjristow
Copy link
Collaborator

wjristow commented Sep 6, 2016

Bugzilla Link 30290
Version trunk
OS Linux
CC @adibiagio,@zmodem,@hfinkel,@jmorse,@RKSimon,@MatzeB,@rotateright

Extended Description

We have encountered a miscompile for x86_64-unknown-linux-gnu, with
'-march=btver2' at optimization '-O1'. In the test-cases I found, it passes
with optimizaiton lowered to -O0 or raised to -O2, and it also passes for the
default architecture. I'd expect that the bug can also manifest at higher
optimization levels, but I haven't found an example.

I've reduced the failure to the test-case appended below, and found that the
failure began happening with r274385 (it still fails with current ToT, r280692).
The failing and passing behaviors are:

$ clang++ --version
clang version 4.0.0 (trunk 280692)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: <....>
$ clang++ -o test.elf -O1 -march=btver2 test.cpp
$ ./test.elf
input: 41 42 43 44 45
output: 41 42 45 17 17
FAIL
$
$ clang++ -o test.elf -O1 test.cpp
$ ./test.elf
input: 41 42 43 44 45
output: 43 44 45 41 42
PASS
$
$ clang++ -o test.elf -O2 -march=btver2 test.cpp
$ ./test.elf
input: 41 42 43 44 45
output: 43 44 45 41 42
PASS
$

Although the failure of this test-case began happening with r274385 (as was
the case with the program this was reduced from), I suspect that the
underlying bug is older than that, because in reducing the original program, I
also encountered a test-case that fails in a similar way with significantly
older compilers. Specifically, I found a very similar test-case that fails
with llvm 3.6 (passes with llvm 3.5). I'm guessing that the recent change of
r274385 is correct, and it's just that the old bug is now more likely to
manifest. I haven't bisected down precisely when the older failure began to
happen. I'll put that similar test-case that fails even with older compilers
in the next post here. (I'm filing this against Clang / LLVM Codegen, because
that's what changed with r274385. However, if my guess is right, that's
probably the wrong area.)

Note that bug 28603 is generally about failures appearing since r274385. But
in the discussion there, the failing test-cases were all examples of broken
user-code that was exposed by r274385. I believe the test-case below is legal
code, but as I said above, I'm suspicious that r274385 itself is just exposing
an older latent problem.

With that, here is the test-case that began failing with r274385:

//////////////////////////////// test.cpp ////////////////////////////////
extern "C" void printf(...);

int srcData[] = {41, 42, 43, 44, 45};
const int nElems = sizeof(srcData) / sizeof(srcData[0]);

struct Container {
Container(int i) { mData = i; }
int mData;
};

struct Wrapper {
int unusedInt;
long unusedLong;
Wrapper(Container* p) : mPtr(p) {}
Container* mPtr;
};

struct ContainerVector {
Container* firstElem;
Container* lastElem;
ContainerVector() : firstElem(0), lastElem(0) { }
Container* ptrToElem(int inx) { return firstElem + inx; }
void allocVec(int n_elems) {
Container* newVec = (Container*) operator new(sizeof(Container) * n_elems);
firstElem = newVec;
lastElem = newVec + n_elems;
for (int i = 0; i < n_elems; ++i)
ptrToElem(i)->mData = 17;
}
};

static void printVec(ContainerVector cv, const char* msg)
{
printf("%10s: ", msg);
for (int i = 0; i < nElems; ++i)
printf(" %3d", cv.ptrToElem(i)->mData);
printf("\n");
}

static Wrapper copy(Wrapper wSrcStart, Wrapper wSrcEnd, Wrapper wDst) {
Container *cSrc, *cEnd, *cDst;
cSrc = wSrcStart.mPtr; cEnd = wSrcEnd.mPtr; cDst = wDst.mPtr;
for (; cSrc != cEnd; ++cDst, ++cSrc)
*cDst = *cSrc;
wDst.mPtr = cDst;
return wDst;
}

static void copyWithRotateLeft(Wrapper wStartSrc, Wrapper wStartDstSrc,
Wrapper wEndSrc, Wrapper wDst) {
wDst = copy(wStartDstSrc, wEndSrc, wDst);
copy(wStartSrc, wStartDstSrc, wDst);
}

int main() {
ContainerVector src, dst;
src.allocVec(nElems);
dst.allocVec(nElems);
for (int i = 0; i < nElems; ++i)
src.ptrToElem(i)->mData = srcData[i];
printVec(src, "input");
copyWithRotateLeft(Wrapper(src.firstElem), Wrapper(src.firstElem + 2),
Wrapper(src.lastElem), Wrapper(dst.firstElem));
printVec(dst, "output");
printf("%s\n", (((dst.ptrToElem(4))->mData != 42) ? "FAIL" : "PASS"));
}
//////////////////////////////////////////////////////////////////////////

@wjristow
Copy link
Collaborator Author

wjristow commented Sep 6, 2016

//////////////////////// "old_bug.cpp" ////////////////////////
//
// Miscompile at '-O1 -march=btver2', using any modern Clang (with
// llvm 3.6 through ToT, Target: x86_64-unknown-linux-gnu).
//
// Fails when built as:
//
// clang++ -o test.elf -O1 -march=btver2 test.cpp
//
// Passes if the optimization level is set to -O0 or -O2.
//
extern "C" void printf(...);

class G {
public:
// Test passes if either of these unused members are removed.
int unusedInt;
long unusedLong;

G(int i) : mData(i) {}
int mData;
};

static G combine(G g1, G g2, G g3) {
int i1 = g1.mData, i2 = g2.mData, i3 = g3.mData;
for (; i1 != i2; ++i3, ++i1)
/* empty loop */ ;
g3.mData = i3;
return g3;
}

static G mix(G g1, G g2, G g3, G g4) {
g4 = combine(g2, g3, g4);
return combine(g1, g2, g4);
}

int main() {
G g = mix(G(0), G(0), G(17), G(0));

printf("mData = %d (expect 17). %s.\n",
g.mData, ((g.mData != 17) ? "FAIL" : "PASS"));
}
///////////////////////////////////////////////////////////////

@llvmbot
Copy link
Collaborator

llvmbot commented Sep 6, 2016

In finding pre-existing source bugs that are revealed by r274385, -fsanitize-address-use-after-scope appears to be very useful. Maybe you want to try that first.

@llvmbot
Copy link
Collaborator

llvmbot commented Sep 6, 2016

In finding pre-existing source bugs that are revealed by r274385,
-fsanitize-address-use-after-scope appears to be very useful. Maybe you want
to try that first.

Oops used wrong account... It's me above.

@wjristow
Copy link
Collaborator Author

wjristow commented Sep 6, 2016

After seeing your note at bug 28695, comment 3, I was just about to add a note
here saying I'm no longer confident in my statement:

"I believe the test-case ... is legal code, ..."

But I see you beat me to the punch. Thanks for the advice in comment 3. I'll
look this over more carefully.

@wjristow
Copy link
Collaborator Author

wjristow commented Sep 7, 2016

testcase with auxiliary files
I've now looked this over more carefully, and with the caveat that I don't
consider myself an expert in the lifetimes of C++ temps, I'm unable to find a
user-bug in the above test-cases. There definitely isn't anything as blatant
as what's described at bug 28695, comment 3 (where the address of a temp is
taken as a key part of the trouble), and I also don't see any more subtle
errors. So I think the test-cases above are legal.

I did run the address sanitizer:

-fsanitize=address -fsanitize-address-use-after-scope

on the test-case in comment 0, above, and it pointed out two memory leaks (of
the vector of 5 Container structs for the src and dst being allocated, but never
freed), but it didn't note any issues about "use-after-scope" problems. I
experimented with some other test-cases I constructed that purposely used temps
after they were freed, similar to the note at bug 28695, comment 3, and I did
see messages of the form:

==27001==ERROR: AddressSanitizer: stack-use-after-scope on address ...

So I can see that that approach can be useful, but it didn't find any issues
with the test-cases here. I realize that doesn't prove the test-cases here are
legal code, but it adds to my view that they are.

That said, I still suspect that the underlying issue is an old latent bug that
is exposed by the new lifetime marks on the temps of r274385. The fact that
it's happening with '-march=btver2' in these test-cases, makes it feel more
like a back-end bug to me. With that in mind, I tried an experiment of
disabling the scheduler ('-mllvm -disable-post-ra'), and found that the
test-cases above then passed. Could it be a matter of some alias information
of some temps not properly indicating interference that exists (this is a
complete shot in the dark)?

In analyzing it with the scheduling observation in mind, I found that if I
swapped two instructions in the failing assembly code, the test then passed.
I'm attaching an archive with:

  1. an updated "test.cpp" (it just deletes the allocated vectors at the end,
    to make it a clean ASan run; plus it fixes the prototype of 'printf()'
    which I noticed was over-simplified by creduce as I made the small
    test-case),
  2. a "test.ll" created as shown in the commands below, and
  3. a hacked assembly file ("hack.s") that was created by editing the
    failing assembly file.

Below are commands that show the sequence.


$ clang++ -emit-llvm -S -O1 test.cpp
$ llc -mcpu=btver2 -o fail.s test.ll # btver2 will fail
$ llc -o pass.s test.ll # this will pass
$ llc -mcpu=btver2 -disable-post-ra -o nosc.s test.ll # btver2, no sched passes
$ clang -c fail.s pass.s nosc.s
$ clang++ -o fail.elf fail.o
$ clang++ -o pass.elf pass.o
$ clang++ -o nosc.elf nosc.o
$ ./fail.elf
input: 41 42 43 44 45
output: 41 42 45 17 17
FAIL
$ ./pass.elf
input: 41 42 43 44 45
output: 43 44 45 41 42
PASS
$ ./nosc.elf
input: 41 42 43 44 45
output: 43 44 45 41 42
PASS
$ cp fail.s hack.s
$ vi hack.s # manually edit/hack the failing assembly
... swap 2 lines, and note with a comment "SWAPPED" ...
$ diff -C5 fail.s hack.s
*** fail.s 2016-09-06 23:14:23.381060423 -0700
--- hack.s 2016-09-06 23:16:31.665058801 -0700


*** 265,277 ****
vmovups 136(%rsp), %xmm0
vmovups %xmm0, (%rsp)
callq ZL4copy7WrapperS_S
movq 96(%rsp), %rax
vmovups 80(%rsp), %xmm0
! movq 200(%rsp), %rcx
leaq 80(%rsp), %rdi
! movq %rax, 200(%rsp)
vmovups %xmm0, 184(%rsp)
movq %rcx, 64(%rsp)
vmovups 184(%rsp), %xmm0
vmovups %xmm0, 48(%rsp)
movq 152(%rsp), %rax
--- 265,277 ----
vmovups 136(%rsp), %xmm0
vmovups %xmm0, (%rsp)
callq ZL4copy7WrapperS_S
movq 96(%rsp), %rax
vmovups 80(%rsp), %xmm0
! movq %rax, 200(%rsp) # SWAPPED
leaq 80(%rsp), %rdi
! movq 200(%rsp), %rcx # SWAPPED
vmovups %xmm0, 184(%rsp)
movq %rcx, 64(%rsp)
vmovups 184(%rsp), %xmm0
vmovups %xmm0, 48(%rsp)
movq 152(%rsp), %rax
$ clang -c hack.s
$ clang++ -o hack.elf hack.o
$ ./hack.elf
input: 41 42 43 44 45
output: 43 44 45 41 42
PASS
$


In summary:

  1. This test-case could be illegal code (a user-bug), but I don't see any
    problems. If there is something I missed, I'll be grateful if someone
    else sees it, and points it out to me.

  2. Assuming the test-case is legal, disabling the scheduler (-disable-post-ra)
    fixes the examples I found. This makes me suspicious that it might be a
    latent back-end bug, possibly related to alias analysis (or a bug in
    communicating alias information from the front-end to the back-end).

  3. Again assuming the test-case is legal, all of this still makes me suspect
    that r274385 isn't the ultimate problem here, but that it's exposing some
    longstanding latent problem.

@llvmbot
Copy link
Collaborator

llvmbot commented Sep 7, 2016

Yes, I believe that it's a backend bug, because I removed all lifetime.start and lifetime.end calls in the .ll file, and it still fails.

~ % lli -O1 bad.ll
input: 41 42 43 44 45
output: 43 44 45 41 42
PASS
~ % lli -O2 bad.ll
input: 41 42 43 44 45
output: 41 42 45 17 17
FAIL

Looking at the deviate point of -print-after-all, the bug happens late in the backend, at least as late as post RA scheduling:

~ % diff -u10 <(llc -O1 bad.ll -print-after-all 2>&1) <(llc -O2 bad.ll -print-after-all 2>&1) | head -40
--- /proc/self/fd/13 2016-09-07 14:12:21.604977039 -0700
+++ /proc/self/fd/14 2016-09-07 14:12:21.604977039 -0700
@@ -5106,512 +5106,512 @@

*** IR Dump After Post RA top-down list latency scheduler ***:

...

but I don't know where the issue is.

@​spatel, can you look at this btver2 related issue?

@rotateright
Copy link
Contributor

@​spatel, can you look at this btver2 related issue?

This is not strictly-speaking a btver2 bug; it's just that the post-RA scheduler / critical-antidep-breaker are enabled for btver2, so the bug is visible there because the bug is likely in one of those places.

These passes are also enabled for the default 32-bit x86 CPU model, so I think we could hit a variant of this bug with any "-m32" compilation (see possibly-related bug 27681 / https://reviews.llvm.org/D20456 ).

Here's a reduced (but not minimized) test case to show that the bug is independent of btver2:

%class.G = type <{ i32, [4 x i8], i64, i32, [4 x i8] }>

define void @​foo(%class.G* noalias nocapture sret %agg.result, %class.G* byval nocapture readonly align 8 %g1, %class.G* byval nocapture readonly align 8 %g2, %class.G* byval nocapture readonly align 8 %g3, %class.G* byval nocapture align 8 %g4) {
%ref.tmp = alloca %class.G, align 8
%t0 = bitcast %class.G* %ref.tmp to i8*
%t1 = bitcast %class.G* %g4 to i8*
call fastcc void @​bar(%class.G* nonnull sret %ref.tmp, %class.G* byval nonnull align 8 %g2, %class.G* byval nonnull align 8 %g3, %class.G* byval nonnull align 8 %g4)
call void @​llvm.memcpy.p0i8.p0i8.i64(i8* %t1, i8* %t0, i64 20, i32 8, i1 false)
call fastcc void @​bar(%class.G* sret %agg.result, %class.G* byval nonnull align 8 %g1, %class.G* byval nonnull align 8 %g2, %class.G* byval nonnull align 8 %g4)
ret void
}

declare void @​bar(%class.G* noalias nocapture sret %agg.result, %class.G* byval nocapture readonly align 8 %g1, %class.G* byval nocapture readonly align 8 %g2, %class.G* byval nocapture align 8 %g3)
declare void @​llvm.memcpy.p0i8.p0i8.i64(i8* nocapture writeonly, i8* nocapture readonly, i64, i32, i1) #​1


You can see the problem by diffing the output of:
$ llc -o - 30290.ll -post-RA-scheduler=0 -mattr=avx
$ llc -o - 30290.ll -post-RA-scheduler=1 -mattr=avx

As Warren showed in comment 5, a 64-bit load is getting hoisted ahead of a 32-bit store to that same location:

Good:
movl %eax, 200(%rsp)
...
movq 200(%rsp), %rax

Bad:
movq 200(%rsp), %rcx
...
movl %eax, 200(%rsp)


cc'ing Dave Kreitzer and Matthias Braun in case they have any insight. I think the last person to look at these passes was Mitch Bodart in D20456, but I don't see a bugzilla ID for Mitch. I don't have a good understanding of this layer.

@adibiagio
Copy link
Collaborator

@​spatel, can you look at this btver2 related issue?

This is not strictly-speaking a btver2 bug; it's just that the post-RA
scheduler / critical-antidep-breaker are enabled for btver2, so the bug is
visible there because the bug is likely in one of those places.

Thanks for the small reproducer Sanjay.

I did some investigation, and I am pretty sure that the issue is neither in post-RA scheduler nor in the antidep breaker. For what I can see, recent changes to the antidep breaker have exposed what seemed to be a latent bug.

The problem is caused by an "inaccurate" pointer information (MachinePointerInfo) associated to the loads and stores of an unrolled constant memcpy call. This occurs at SelectionDAGBuilder stage when we the intrinsic memcpy call is expanded.

The machine scheduler uses instances of "MachinePointerInfo" to describe what memory objects are referenced by each load and store. During codegen, the scheduler uses that information to reconstruct the dependency arcs between memory operations; load and store operations that references the same memory objects may alias. The "pointer info" is encapsulated in the machine memory operand, and describes the memory access in terms of base+offset.

Back to our problem:

The memcpy call from the minimal reproducible posted by Sanjay is expanded by the x86 backend into a sequence of loads and stores. Each load/store refers to a stack object (a fixed stack objects, and a non-fixed stack object related to the 'alloca').

%t0 = bitcast %class.G* %ref.tmp to i8*
%t1 = bitcast %class.G* %g4 to i8*
call void @​llvm.memcpy.p0i8.p0i8.i64(i8 *t1, i8 *t0, i64 20, i32 8, i1 false)

Where %ref.tmp is a non-fixed frame object obtained from an alloca; %g4 is a pointer to a value passed by copy (i.e. byval) on the stack by the caller. So, the object pointed by %g4 is a fixed frame object.

When the SelectionDAGBuilder expands the call to memcpy, the pointer info instances related to %t0 and %t1 do not reference the two stack objects!.
Later on, %g4 is also passed 'byval' to another function call. The compiler would therefore emit a copy of %g4 on the stack before calling @​bar.

The inaccurate pointer info for the loads and stores becomes problematic only after regalloc.

At post-RA stage, the antidep-breaker is run to expose more opportunities for scheduling machine instructions. Every time the antidep-breaker successfully manages to break a chain of dependencies, the scheduler constructs a new dependence graph.
Each basic block is traversed bottom-up; the pointer info of each load or store, is compared with the pointer info of every other load and store already visited. If two instructions reference the same memory object, then a dependency is inserted in the graph. In particular, if a load L references the same memory object of a previously seen store instruction, then the two instructions may alias.

In our example, the scheduler thinks that the load/store sequence obtained from unrolling the constant memcpy call does not alias with any of the loads/stores for the byval copy of %g4 on the stack.
The lack of dependency arcs between those loads and stores is what causes the incorrect scheduling of memory operations.

I have a patch that fixes the issue in the minimal reproducible from Sanjay (I plan to send it soon for review).
Note that this issue is also affecting memset/memmove.
If we replace memcpy with memmove, then we also get incorrect codegen.

@rotateright
Copy link
Contributor

The problem is caused by an "inaccurate" pointer information
(MachinePointerInfo) associated to the loads and stores of an unrolled
constant memcpy call. This occurs at SelectionDAGBuilder stage when we the
intrinsic memcpy call is expanded.

Oh wow, so the bug is actually worse than I suspected. It could affect any target regardless of whether they run post-RA-scheduler (although it may be very hard to expose without that?).

Thank you for tracking it down!

@adibiagio
Copy link
Collaborator

The problem is caused by an "inaccurate" pointer information
(MachinePointerInfo) associated to the loads and stores of an unrolled
constant memcpy call. This occurs at SelectionDAGBuilder stage when we the
intrinsic memcpy call is expanded.

Oh wow, so the bug is actually worse than I suspected. It could affect any
target regardless of whether they run post-RA-scheduler (although it may be
very hard to expose without that?).

So, the reason why this bug is only exposed after RA is because we run the "aggressive" antidep breaker as part of the postRAScheduler.

The logic in SchedulePostRATDList::schedule() firstly attempts to "break anti dependencies". On success, it clears the entire scheduleDAG, and then it rebuilds the graph.

The code in ScheduleDAGInstrs::buildSchedGraph is responsible for building the graph by walking the basic block from bottom to top and generating a scheduling unit for each machine instruction.
Dependencies are updated on the fly: the scheduler keeps two set of pointer descriptors for visited stores and loads.
Data dependences between instructions are updated based on the used/defined registers, and the referenced memory objects.

The interesting portion is related to the call to function 'getUnderlyingObjectsForInstr'. In our example, we have that each load and store references a stack object. However, all the MachineMemOperands descriptors are not instances of 'PseudoSourceValue' (which - in our case - would describe a reference to a stack object). Instead, all the loads/stores obtained from expanding the memcpy call are llvm::Value*. Since those value are not 'PseudoSourceValue' objects, they are considered to be disjoint from any stack allocated objects already visited.

@adibiagio
Copy link
Collaborator

Patch (based on r281240)
Here is a patch that fixes the problems with the MachinePointerInfo of loads and stores expanded from constant memcpy/memmove/memset.

The patch is still missing LLVM test cases for the memcpy/memmove/memset.

@adibiagio
Copy link
Collaborator

Mmm.. Unfortunately this bug is much more complicated than I thought.

Strictly speaking, the problem is not related to the lowering of memcpy/memmove calls in selectionDAG.

The problem is "caused" by the presence of both explicit and implicit copies of a same 'byval' argument.

A byval %bv is a reference to a value allocated on the stack by the caller of a function. A fixed frame object is created by SelectionDAG to describe the frame location of %bv.

If %bv is passed by copy in input to a function, then SelectionDAG emits a copy of the value referenced by %bv on the stack. This is done as part of the function call lowering. On x86, a byval copy is a normal (implicit) memcpy; for small memcpy sizes, we expect that the copy gets fully expanded into a optimal sequence of loads and stores.

X86TargetLowering::LowerCallTo
-X86TargetLowering::LowerMemOpCallTo - CreateCopyOfByValArgument.

Internally, function 'CreateCopyOfByValArgument' delegates to SelectionDAG::getMemcpy().
However, CreateCopyOfByValArgument always passes a "null" MachinePointerInfo objects for both source and destination. The two pointer info objects are used by the memcpy expansion logic in SelectionDAG to annotate the optimally expanded sequence of loads and stores.

It is quite common to have 'null' MachinePointerInfo objects for loads and stores that reference fixed frame objects. So, the logic in Method SelectionDag::getLoad() specially handles the case of 'null pointer info'. In particular, getLoad() attempts to reconstruct the memory aliasing information by looking at the SDValue which describes the load address.
In the case of a load from a 'byval' pointer, the implicit (inferred) MachinePointerInfo is always a reference to a FixedFrameObject.

Unfortunately, we have a completely different logic for the case where there is an explicit memcpy/memmove of a 'byval'. The explicit memory copy is either a call to the memcpy/memmove intrinsic, or a sequence of loads and stores.
When SelectionDAGBuilder expands an intrinsic memcpy/memmove call, the MachinePointerInfo objects for source and destination are never null. In the case of byval %bv, the memory object would be the IR pointer Value %bv itself.
Note that the MachinePointerInfo is never null even for the case where there are explicit loads of a byval.

The problem is that the two approaches are inconsistent. The non-null MachinePointerInfo constructed from %bv, and the "inferred" MachinePointerInfo, don't seem to alias the same memory objects according to logic used in method `ScheduleDAGInstrs::buildSchedGraph()'.

As I explained in Comment 10,
ScheduleDAGInstrs::buildSchedGraph() is responsible for building the graph by walking the basic block from bottom to top and generating a scheduling unit for each machine instruction.
Dependencies are updated on the fly: the scheduler keeps two set of pointer descriptors for visited stores and loads.
Data dependences between instructions are updated based on the used/defined registers, and the referenced memory objects.

Method 'buildSchedGraph' uses a function named 'getUnderlyingObjectsForInstr' to discover which memory objects are referenced by which loads and stores. If two memory operations reference the same memory object, then the scheduler will not attempt to reorder them.

In our example, there is a mix of loads and stores obtained from the expansion of an explicit memcpy of a byval, and the expansion of an implicit copy of the same byval. The stores obtained from the explicit memcpy don't reference the same underlying memory object than the loads from the implicit byval copy. This is what causes the problem.

Incidentally, I noticed that a patch for another issue related to byval values and 'buildSchedGraph' was committed at revision 165616.

@zmodem
Copy link
Collaborator

zmodem commented Jan 30, 2017

What's the status here?

Is this a regression from 3.9 or did it exist before?

Is this something that might get fixed for 4.0?

@adibiagio
Copy link
Collaborator

What's the status here?

Hi Hans,

Since this bug is still unassigned, I suspect that nobody is working on it.

Is this a regression from 3.9 or did it exist before?

Is this something that might get fixed for 4.0?

I don't think that we have a fix for this issue. My original patch from comment 11 might be good as a workaround but isn't the right long-term fix. Finding a proper fix might take time (at least for me).

Just to avoid misunderstandings, my original intention was to help triage this issue. Apologies if it gave the impression that I was working on this. I never assigned this bug to myself as I always had high priority items to work on.

@zmodem
Copy link
Collaborator

zmodem commented Feb 1, 2017

What's the status here?

Hi Hans,

Since this bug is still unassigned, I suspect that nobody is working on it.

Is this a regression from 3.9 or did it exist before?

Is this something that might get fixed for 4.0?

I don't think that we have a fix for this issue. My original patch from
comment 11 might be good as a workaround but isn't the right long-term fix.
Finding a proper fix might take time (at least for me).

Just to avoid misunderstandings, my original intention was to help triage
this issue. Apologies if it gave the impression that I was working on this.
I never assigned this bug to myself as I always had high priority items to
work on.

I understand. I was just asking to see if anyone was working on this.

@wjristow
Copy link
Collaborator Author

wjristow commented Feb 2, 2017

As an aside, this problem has become "somewhat latent" in ToT and also
LLVM 4.0, in that the test-case provided in comment 0 no longer fails when
compiled in the "normal user way". That is, for:

clang++ -o pass.elf -O1 -march=btver2 test.cpp

that test-case now passes.

Presumably with some effort, a new test-case could be found that still shows
the problem in this "normal user way". In looking into it, assuming it's
possible to create a test-case that failed at -O2 back when this PR was
reported, then I think that test-case would still fail now.

That said, the test-case provided here still does fail in the same way it
used to when compiled at '-O1' to bitcode, and then running 'llc' manually:

clang++ -emit-llvm -S -O1 test.cpp
llc -mcpu=btver2 -o test.s test.ll
clang -c test.s
clang++ -o fail.elf test.o

In tracking it down, the change that made it start to pass when compiled
"normally" is revision 291276 (Fri, 06 Jan 2017, a bit before branching
for LLVM 4.0):

------------------------------------------------------------------
Use CodegenOpts::less when creating a TargetMachine for clang `-O1`

Summary:
Clang was initializing the TargetMachine with CodeGenOpt::Default
for O1. This change is aligning it on llc:

-O0: OptLevel = CodeGenOpt::None
-O1: OptLevel = CodeGenOpt::Less
-O2 -Os -Oz: OptLevel = CodeGenOpt::Default
-O3: OptLevel = CodeGenOpt::Aggressive
------------------------------------------------------------------

If I understand this correctly, prior to this change, in a -O1 compilation,
the back-end did some -O2 things. And for my test-case the IR Clang generated
at -O1 would trigger the problem in the back-end that had that -O2 aspect
enabled. So explicitly lowering the 'llc' opt level to -O1 would likely
suppress the problem. So I tried the following experiment:

clang++ -emit-llvm -S -O1 test.cpp
llc -O1 -mcpu=btver2 -o test.s test.ll # note: '-O1' explicitly passed
clang -c test.s
clang++ -o test.elf test.o

Running this 'test.elf' passed, somewhat as expected. In fact, it passes with
a compiler prior to the recent change of r291276. It also passes going back
the older compiler (r274395) where I first encountered this.

The bottom line: The bug is still there, although for this test-case a user
won't encounter it.

@jmorse
Copy link
Member

jmorse commented Mar 29, 2018

Hi,

I've eyeballed this a little further, and this issue does appear to still be present in both ToT and the latest release (6.0). Take this example, which passes a struct byval twice, where the expected output is "5":

--------8<--------
#include <string.h>
#include <stdio.h>

#define SIZE 7

struct face {
int bees[SIZE];
};

void
baz(struct face beards)
{
int sum = 0;
for (unsigned int i = 0; i < SIZE; i++)
sum += beards.bees[i];
printf("%d\n", sum);
}

void
foo(struct face jam)
{
jam.bees[0] = 1;
jam.bees[1] = 1;
jam.bees[2] = 1;
jam.bees[3] = 1;
jam.bees[4] = 1;
baz(jam);
}

int
main()
{
struct face bear;
memset(&bear, 0, sizeof(bear));
foo(bear);
return 0;
}
-------->8--------

generates different results depending on the cpu targeted, and AFAIK doesn't contain undefined behaviour:

jmorse@localhost:$ clang-6.0 test.c -O2 -fno-inline -o a.out
jmorse@localhost:
$ ./a.out
5
jmorse@localhost:$ clang-6.0 test.c -O2 -fno-inline -o a.out -march=btver2
jmorse@localhost:
$ ./a.out
4

(The -fno-inline is just to suppress over-optimisation, I think this problem would still be present if each function were in its own translation unit).

Specifically, something similar to what Sanjay posted above occurs, with -march=btver2:

--------8<--------
foo:
subq $40, %rsp
vmovaps .LCPI1_0(%rip), %xmm0 # xmm0 = [1,1,1,1]
vmovaps %xmm0, 48(%rsp) # Write to first portion of "jam"
vmovups 60(%rsp), %xmm0 # Read before "jam" arg fully written
movl $1, 64(%rsp) # Final write to "jam"
vmovups %xmm0, 12(%rsp)
vmovaps 48(%rsp), %xmm0
vmovups %xmm0, (%rsp)
callq baz@PLT
addq $40, %rsp
retq
-------->8--------

without:
--------8<--------
foo:
subq $40, %rsp
movaps .LCPI1_0(%rip), %xmm0 # xmm0 = [1,1,1,1]
movaps %xmm0, 48(%rsp) # First write to "jam"
movl $1, 64(%rsp) # Final write to "jam"
movups 60(%rsp), %xmm0
movups %xmm0, 12(%rsp)
movaps 48(%rsp), %xmm0
movups %xmm0, (%rsp)
callq baz
addq $40, %rsp
retq
-------->8--------

As Andrea points out, the root of this problem is that the source-level assignments to fields in "jam" are synchronised on an LLVM IR Value *, but the assignments generated to pass values down to "baz" are not synchronised. The call-lowering code only identifies that a stack slot is being written to, which the instruction scheduler doesn't know can alias the Value *. IMHO, this isn't generally solvable because (I believe?) SDValue's don't necessarily have a mapping back to Value *'s in the IR. (This is a bit of a guess, I'm not an expert on this).

Instead, IMHO the solution likely lies with MachineFrameInfo::StackSlot's "isAliased" field (currently left false for byval args on X86), the documentation for the field reads:

// If true, an LLVM IR value might point to this [stack] object.
// Normally, spill slots and fixed-offset objects don't alias IR-accessible
// objects, but there are exceptions (on PowerPC, for example, some byval
// arguments have ABI-prescribed offsets).
bool isAliased;

which seems to match what's happening here: any byval parameter's memory can be accessed through the Value * pointer of the formal function argument; but also as an SDValue if the parameter is passed byval-again to another function, and perhaps by other routes. isAliased is also true for statically sized "alloca" stack slots (FunctionLoweringInfo::set) and in MachineFrameInfo::CreateStackObject for any non-spill-slot slots, which are other cases where we can have stack slots pointed at by an IR Value.

Setting X86TargetLowering::LowerMemArgument to mark all byval stack slots as aliased ( https://reviews.llvm.org/D45022 ) fixes the example above; it also fixes the reduced-reproducer that Sanjay posted.

Feedback appreciated; I'll put the patch forward for review if it's likely to have fixed this PR.

@jmorse
Copy link
Member

jmorse commented May 8, 2018

Proposed patch has been accepted (cheers), I've also tried to look at whether or not this is a complete fix for the general issue. It seems that most of the other consumers of PseudoSourceValue's use mayAlias(), which correctly states that FixedStackSlot's might be aliased, whether or not it is. The only users of isAliased() (which could return false) are the instruction scheduler and something similar in MIPS. And looking at MIPS, it turns out that they've experienced exactly this (byval isAliased) problem and fixed it a couple of months ago [0]!

Looking at a few other backends [1-3, linked lines might shift over time] there are several which appear to not mark byval arguments as aliased, thus making them potentially liable to similar errors. However, I'm unfamiliar with those architectures & calling conventions, so they could just as easily be safe.

Possibly this needs some kind of public service announcement on the mailing list that two backends have been bitten by this; or more simply the isAliased argument to CreateFixedObject should be default-true, giving safe behaviour by default.

[0] https://reviews.llvm.org/rL325782
[1] http://llvm.org/doxygen/ARMISelLowering_8cpp_source.html#l03597
[2] http://llvm.org/doxygen/AArch64ISelLowering_8cpp_source.html#l02867
[3] http://llvm.org/doxygen/HexagonISelLowering_8cpp_source.html#l00767

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla clang:codegen miscompilation
Projects
None yet
Development

No branches or pull requests

7 participants