-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clang++ miscompile at -O1 with revision 274385 #29638
Comments
//////////////////////// "old_bug.cpp" //////////////////////// class G { G(int i) : mData(i) {} static G combine(G g1, G g2, G g3) { static G mix(G g1, G g2, G g3, G g4) { int main() { printf("mData = %d (expect 17). %s.\n", |
In finding pre-existing source bugs that are revealed by r274385, -fsanitize-address-use-after-scope appears to be very useful. Maybe you want to try that first. |
Oops used wrong account... It's me above. |
After seeing your note at bug 28695, comment 3, I was just about to add a note "I believe the test-case ... is legal code, ..." But I see you beat me to the punch. Thanks for the advice in comment 3. I'll |
testcase with auxiliary files I did run the address sanitizer: -fsanitize=address -fsanitize-address-use-after-scope on the test-case in comment 0, above, and it pointed out two memory leaks (of ==27001==ERROR: AddressSanitizer: stack-use-after-scope on address ... So I can see that that approach can be useful, but it didn't find any issues That said, I still suspect that the underlying issue is an old latent bug that In analyzing it with the scheduling observation in mind, I found that if I
Below are commands that show the sequence. $ clang++ -emit-llvm -S -O1 test.cpp *** 265,277 **** In summary:
|
Yes, I believe that it's a backend bug, because I removed all lifetime.start and lifetime.end calls in the .ll file, and it still fails. ~ % lli -O1 bad.ll Looking at the deviate point of -print-after-all, the bug happens late in the backend, at least as late as post RA scheduling: ~ % diff -u10 <(llc -O1 bad.ll -print-after-all 2>&1) <(llc -O2 bad.ll -print-after-all 2>&1) | head -40 *** IR Dump After Post RA top-down list latency scheduler ***:... but I don't know where the issue is. @spatel, can you look at this btver2 related issue? |
This is not strictly-speaking a btver2 bug; it's just that the post-RA scheduler / critical-antidep-breaker are enabled for btver2, so the bug is visible there because the bug is likely in one of those places. These passes are also enabled for the default 32-bit x86 CPU model, so I think we could hit a variant of this bug with any "-m32" compilation (see possibly-related bug 27681 / https://reviews.llvm.org/D20456 ). Here's a reduced (but not minimized) test case to show that the bug is independent of btver2: %class.G = type <{ i32, [4 x i8], i64, i32, [4 x i8] }> define void @foo(%class.G* noalias nocapture sret %agg.result, %class.G* byval nocapture readonly align 8 %g1, %class.G* byval nocapture readonly align 8 %g2, %class.G* byval nocapture readonly align 8 %g3, %class.G* byval nocapture align 8 %g4) { declare void @bar(%class.G* noalias nocapture sret %agg.result, %class.G* byval nocapture readonly align 8 %g1, %class.G* byval nocapture readonly align 8 %g2, %class.G* byval nocapture align 8 %g3) You can see the problem by diffing the output of: As Warren showed in comment 5, a 64-bit load is getting hoisted ahead of a 32-bit store to that same location: Good: Bad: cc'ing Dave Kreitzer and Matthias Braun in case they have any insight. I think the last person to look at these passes was Mitch Bodart in D20456, but I don't see a bugzilla ID for Mitch. I don't have a good understanding of this layer. |
Thanks for the small reproducer Sanjay. I did some investigation, and I am pretty sure that the issue is neither in post-RA scheduler nor in the antidep breaker. For what I can see, recent changes to the antidep breaker have exposed what seemed to be a latent bug. The problem is caused by an "inaccurate" pointer information (MachinePointerInfo) associated to the loads and stores of an unrolled constant memcpy call. This occurs at SelectionDAGBuilder stage when we the intrinsic memcpy call is expanded. The machine scheduler uses instances of "MachinePointerInfo" to describe what memory objects are referenced by each load and store. During codegen, the scheduler uses that information to reconstruct the dependency arcs between memory operations; load and store operations that references the same memory objects may alias. The "pointer info" is encapsulated in the machine memory operand, and describes the memory access in terms of base+offset. Back to our problem: The memcpy call from the minimal reproducible posted by Sanjay is expanded by the x86 backend into a sequence of loads and stores. Each load/store refers to a stack object (a fixed stack objects, and a non-fixed stack object related to the 'alloca'). %t0 = bitcast %class.G* %ref.tmp to i8* Where %ref.tmp is a non-fixed frame object obtained from an alloca; %g4 is a pointer to a value passed by copy (i.e. byval) on the stack by the caller. So, the object pointed by %g4 is a When the SelectionDAGBuilder expands the call to memcpy, the pointer info instances related to %t0 and %t1 do not reference the two stack objects!. The inaccurate pointer info for the loads and stores becomes problematic only after regalloc. At post-RA stage, the antidep-breaker is run to expose more opportunities for scheduling machine instructions. Every time the antidep-breaker successfully manages to break a chain of dependencies, the scheduler constructs a new dependence graph. In our example, the scheduler thinks that the load/store sequence obtained from unrolling the constant memcpy call does not alias with any of the loads/stores for the byval copy of %g4 on the stack. I have a patch that fixes the issue in the minimal reproducible from Sanjay (I plan to send it soon for review). |
Oh wow, so the bug is actually worse than I suspected. It could affect any target regardless of whether they run post-RA-scheduler (although it may be very hard to expose without that?). Thank you for tracking it down! |
So, the reason why this bug is only exposed after RA is because we run the "aggressive" antidep breaker as part of the postRAScheduler. The logic in SchedulePostRATDList::schedule() firstly attempts to "break anti dependencies". On success, it The code in ScheduleDAGInstrs::buildSchedGraph is responsible for building the graph by walking the basic block from bottom to top and generating a scheduling unit for each machine instruction. The interesting portion is related to the call to function 'getUnderlyingObjectsForInstr'. In our example, we have that each load and store references a stack object. However, all the MachineMemOperands descriptors are not instances of 'PseudoSourceValue' (which - in our case - would describe a reference to a stack object). Instead, all the loads/stores obtained from expanding the memcpy call are llvm::Value*. Since those value are not 'PseudoSourceValue' objects, they are considered to be disjoint from any stack allocated objects already visited. |
Patch (based on r281240) The patch is still missing LLVM test cases for the memcpy/memmove/memset. |
Mmm.. Unfortunately this bug is much more complicated than I thought. Strictly speaking, the problem is not related to the lowering of memcpy/memmove calls in selectionDAG. The problem is "caused" by the presence of both explicit and implicit copies of a same 'byval' argument. A byval %bv is a reference to a value allocated on the stack by the caller of a function. A fixed frame object is created by SelectionDAG to describe the frame location of %bv. If %bv is passed by copy in input to a function, then SelectionDAG emits a copy of the value referenced by %bv on the stack. This is done as part of the function call lowering. On x86, a byval copy is a normal (implicit) memcpy; for small memcpy sizes, we expect that the copy gets fully expanded into a optimal sequence of loads and stores. X86TargetLowering::LowerCallTo Internally, function 'CreateCopyOfByValArgument' delegates to SelectionDAG::getMemcpy(). It is quite common to have 'null' MachinePointerInfo objects for loads and stores that reference fixed frame objects. So, the logic in Method SelectionDag::getLoad() specially handles the case of 'null pointer info'. In particular, getLoad() attempts to reconstruct the memory aliasing information by looking at the SDValue which describes the load address. Unfortunately, we have a completely different logic for the case where there is an explicit memcpy/memmove of a 'byval'. The explicit memory copy is either a call to the memcpy/memmove intrinsic, or a sequence of loads and stores. The problem is that the two approaches are inconsistent. The non-null MachinePointerInfo constructed from %bv, and the "inferred" MachinePointerInfo, don't seem to alias the same memory objects according to logic used in method `ScheduleDAGInstrs::buildSchedGraph()'. As I explained in Comment 10, Method 'buildSchedGraph' uses a function named 'getUnderlyingObjectsForInstr' to discover which memory objects are referenced by which loads and stores. If two memory operations reference the same memory object, then the scheduler will not attempt to reorder them. In our example, there is a mix of loads and stores obtained from the expansion of an explicit memcpy of a byval, and the expansion of an implicit copy of the same byval. The stores obtained from the explicit memcpy don't reference the same underlying memory object than the loads from the implicit byval copy. This is what causes the problem. Incidentally, I noticed that a patch for another issue related to byval values and 'buildSchedGraph' was committed at revision 165616. |
What's the status here? Is this a regression from 3.9 or did it exist before? Is this something that might get fixed for 4.0? |
Hi Hans, Since this bug is still unassigned, I suspect that nobody is working on it.
I don't think that we have a fix for this issue. My original patch from comment 11 might be good as a workaround but isn't the right long-term fix. Finding a proper fix might take time (at least for me). Just to avoid misunderstandings, my original intention was to help triage this issue. Apologies if it gave the impression that I was working on this. I never assigned this bug to myself as I always had high priority items to work on. |
I understand. I was just asking to see if anyone was working on this. |
As an aside, this problem has become "somewhat latent" in ToT and also clang++ -o pass.elf -O1 -march=btver2 test.cpp that test-case now passes. Presumably with some effort, a new test-case could be found that still shows That said, the test-case provided here still does fail in the same way it clang++ -emit-llvm -S -O1 test.cpp In tracking it down, the change that made it start to pass when compiled
If I understand this correctly, prior to this change, in a -O1 compilation, clang++ -emit-llvm -S -O1 test.cpp Running this 'test.elf' passed, somewhat as expected. In fact, it passes with The bottom line: The bug is still there, although for this test-case a user |
Hi, I've eyeballed this a little further, and this issue does appear to still be present in both ToT and the latest release (6.0). Take this example, which passes a struct byval twice, where the expected output is "5": --------8<-------- #define SIZE 7 struct face { void void int generates different results depending on the cpu targeted, and AFAIK doesn't contain undefined behaviour: jmorse@localhost: (The -fno-inline is just to suppress over-optimisation, I think this problem would still be present if each function were in its own translation unit). Specifically, something similar to what Sanjay posted above occurs, with -march=btver2: --------8<-------- without: As Andrea points out, the root of this problem is that the source-level assignments to fields in "jam" are synchronised on an LLVM IR Value *, but the assignments generated to pass values down to "baz" are not synchronised. The call-lowering code only identifies that a stack slot is being written to, which the instruction scheduler doesn't know can alias the Value *. IMHO, this isn't generally solvable because (I believe?) SDValue's don't necessarily have a mapping back to Value *'s in the IR. (This is a bit of a guess, I'm not an expert on this). Instead, IMHO the solution likely lies with MachineFrameInfo::StackSlot's "isAliased" field (currently left false for byval args on X86), the documentation for the field reads:
which seems to match what's happening here: any byval parameter's memory can be accessed through the Value * pointer of the formal function argument; but also as an SDValue if the parameter is passed byval-again to another function, and perhaps by other routes. isAliased is also true for statically sized "alloca" stack slots (FunctionLoweringInfo::set) and in MachineFrameInfo::CreateStackObject for any non-spill-slot slots, which are other cases where we can have stack slots pointed at by an IR Value. Setting X86TargetLowering::LowerMemArgument to mark all byval stack slots as aliased ( https://reviews.llvm.org/D45022 ) fixes the example above; it also fixes the reduced-reproducer that Sanjay posted. Feedback appreciated; I'll put the patch forward for review if it's likely to have fixed this PR. |
Proposed patch has been accepted (cheers), I've also tried to look at whether or not this is a complete fix for the general issue. It seems that most of the other consumers of PseudoSourceValue's use mayAlias(), which correctly states that FixedStackSlot's might be aliased, whether or not it is. The only users of isAliased() (which could return false) are the instruction scheduler and something similar in MIPS. And looking at MIPS, it turns out that they've experienced exactly this (byval isAliased) problem and fixed it a couple of months ago [0]! Looking at a few other backends [1-3, linked lines might shift over time] there are several which appear to not mark byval arguments as aliased, thus making them potentially liable to similar errors. However, I'm unfamiliar with those architectures & calling conventions, so they could just as easily be safe. Possibly this needs some kind of public service announcement on the mailing list that two backends have been bitten by this; or more simply the isAliased argument to CreateFixedObject should be default-true, giving safe behaviour by default. [0] https://reviews.llvm.org/rL325782 |
Extended Description
We have encountered a miscompile for x86_64-unknown-linux-gnu, with
'-march=btver2' at optimization '-O1'. In the test-cases I found, it passes
with optimizaiton lowered to -O0 or raised to -O2, and it also passes for the
default architecture. I'd expect that the bug can also manifest at higher
optimization levels, but I haven't found an example.
I've reduced the failure to the test-case appended below, and found that the
failure began happening with r274385 (it still fails with current ToT, r280692).
The failing and passing behaviors are:
$ clang++ --version
clang version 4.0.0 (trunk 280692)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: <....>
$ clang++ -o test.elf -O1 -march=btver2 test.cpp
$ ./test.elf
input: 41 42 43 44 45
output: 41 42 45 17 17
FAIL
$
$ clang++ -o test.elf -O1 test.cpp
$ ./test.elf
input: 41 42 43 44 45
output: 43 44 45 41 42
PASS
$
$ clang++ -o test.elf -O2 -march=btver2 test.cpp
$ ./test.elf
input: 41 42 43 44 45
output: 43 44 45 41 42
PASS
$
Although the failure of this test-case began happening with r274385 (as was
the case with the program this was reduced from), I suspect that the
underlying bug is older than that, because in reducing the original program, I
also encountered a test-case that fails in a similar way with significantly
older compilers. Specifically, I found a very similar test-case that fails
with llvm 3.6 (passes with llvm 3.5). I'm guessing that the recent change of
r274385 is correct, and it's just that the old bug is now more likely to
manifest. I haven't bisected down precisely when the older failure began to
happen. I'll put that similar test-case that fails even with older compilers
in the next post here. (I'm filing this against Clang / LLVM Codegen, because
that's what changed with r274385. However, if my guess is right, that's
probably the wrong area.)
Note that bug 28603 is generally about failures appearing since r274385. But
in the discussion there, the failing test-cases were all examples of broken
user-code that was exposed by r274385. I believe the test-case below is legal
code, but as I said above, I'm suspicious that r274385 itself is just exposing
an older latent problem.
With that, here is the test-case that began failing with r274385:
//////////////////////////////// test.cpp ////////////////////////////////
extern "C" void printf(...);
int srcData[] = {41, 42, 43, 44, 45};
const int nElems = sizeof(srcData) / sizeof(srcData[0]);
struct Container {
Container(int i) { mData = i; }
int mData;
};
struct Wrapper {
int unusedInt;
long unusedLong;
Wrapper(Container* p) : mPtr(p) {}
Container* mPtr;
};
struct ContainerVector {
Container* firstElem;
Container* lastElem;
ContainerVector() : firstElem(0), lastElem(0) { }
Container* ptrToElem(int inx) { return firstElem + inx; }
void allocVec(int n_elems) {
Container* newVec = (Container*) operator new(sizeof(Container) * n_elems);
firstElem = newVec;
lastElem = newVec + n_elems;
for (int i = 0; i < n_elems; ++i)
ptrToElem(i)->mData = 17;
}
};
static void printVec(ContainerVector cv, const char* msg)
{
printf("%10s: ", msg);
for (int i = 0; i < nElems; ++i)
printf(" %3d", cv.ptrToElem(i)->mData);
printf("\n");
}
static Wrapper copy(Wrapper wSrcStart, Wrapper wSrcEnd, Wrapper wDst) {
Container *cSrc, *cEnd, *cDst;
cSrc = wSrcStart.mPtr; cEnd = wSrcEnd.mPtr; cDst = wDst.mPtr;
for (; cSrc != cEnd; ++cDst, ++cSrc)
*cDst = *cSrc;
wDst.mPtr = cDst;
return wDst;
}
static void copyWithRotateLeft(Wrapper wStartSrc, Wrapper wStartDstSrc,
Wrapper wEndSrc, Wrapper wDst) {
wDst = copy(wStartDstSrc, wEndSrc, wDst);
copy(wStartSrc, wStartDstSrc, wDst);
}
int main() {
ContainerVector src, dst;
src.allocVec(nElems);
dst.allocVec(nElems);
for (int i = 0; i < nElems; ++i)
src.ptrToElem(i)->mData = srcData[i];
printVec(src, "input");
copyWithRotateLeft(Wrapper(src.firstElem), Wrapper(src.firstElem + 2),
Wrapper(src.lastElem), Wrapper(dst.firstElem));
printVec(dst, "output");
printf("%s\n", (((dst.ptrToElem(4))->mData != 42) ? "FAIL" : "PASS"));
}
//////////////////////////////////////////////////////////////////////////
The text was updated successfully, but these errors were encountered: