Created attachment 10423 [details] Testcase for InitDAGTopologicalSorting assertion failure Reduced testcase attached. Compiling with "-target i386-unknown-freebsd10 -O2 -c gstatomicqueue-testcase.c" results in: Assertion failed: (Node2Index[SU->NodeNum] > Node2Index[I->getSUnit()->NodeNum] && "Wrong topological sorting"), function InitDAGTopologicalSorting, file /share/dim/src/llvm/trunk/lib/CodeGen/ScheduleDAG.cpp, line 511. Stack dump: 0. Program arguments: /home/dim/obj/llvm-180108M-trunk-freebsd10-i386-aconf-rel-1/Release+Asserts/bin/clang -cc1 -triple i386-unknown-freebsd10 -emit-obj -disable-free -main-file-name gstatomicqueue-testcase.c -mrelocation-model static -mdisable-fp-elim -masm-verbose -mconstructor-aliases -target-cpu i486 -target-linker-version 2.23.1 -coverage-file /share/dim/src/llvm/bugs/gstreamer/gstatomicqueue-testcase.o -resource-dir /home/dim/obj/llvm-180108M-trunk-freebsd10-i386-aconf-rel-1/Release+Asserts/bin/../lib/clang/3.3 -O2 -fdebug-compilation-dir /share/dim/src/llvm/bugs/gstreamer -ferror-limit 19 -fmessage-length 278 -mstackrealign -fobjc-runtime=gnustep -fobjc-default-synthesize-properties -fdiagnostics-show-option -fcolor-diagnostics -backend-option -vectorize-loops -o gstatomicqueue-testcase.o -x c gstatomicqueue-testcase.c 1. <eof> parser at end of file 2. Code generation 3. Running pass 'Function Pass Manager' on module 'gstatomicqueue-testcase.c'. 4. Running pass 'X86 DAG->DAG Instruction Selection' on function '@gst_atomic_queue_push' clang: error: unable to execute command: Abort trap (core dumped) clang: error: clang frontend command failed due to signal (use -v to see invocation) clang version 3.3 (trunk 180108) Target: i386-unknown-freebsd10 Thread model: posix
Reduced test case, compile with llc -march=x86 -mcpu=i486 < t.ll: define void @gst_atomic_queue_push() { entry: br label %while.body while.body: %0 = load volatile i32* undef, align 4 fence seq_cst %1 = load volatile i32* undef, align 4 %cmp = icmp sgt i32 %1, %0 br i1 %cmp, label %while.body, label %if.then if.then: ret void }
Created attachment 11013 [details] fix for assertion failure The attached patch changes the WorkList in InitDAGTopologicalSorting into a FIFO queue. Items that are added earlier to the list need to get a higher index than later items because they may be successors of the later items. This fixes the assertion failure, but does not fix the bug yet. The bug does not appear with -msse2. In that case the mfence instruction is used. So something seems to go wrong with the -mno-sse2 fence (lock orl $0,(%esp)).
Created attachment 11024 [details] proposed fix This patch fixes the bug for me. It creates a new X86ISD::FENCE node and lowers ISD::ATOMIC_FENCE to it instead of to X86::OR32mrLocked. Then during instruction selection X86ISD::FENCE is replaced by X86::LOCK_OR32mi8.
The patch to fix the assertion failure turns not to be necessary but it's still a valid patch I think.
Created attachment 11025 [details] Proposed fix, updated for ToT As-is, the proposed fix doesn't compile with top-of-trunk: llvm[3]: Compiling X86ISelDAGToDAG.cpp for Release+Asserts build lib/Target/X86/X86ISelDAGToDAG.cpp:1867:18: error: no matching member function for call to 'getMachineNode' return CurDAG->getMachineNode(X86::LOCK_OR32mi8, dl, MVT::Other, Ops); ~~~~~~~~^~~~~~~~~~~~~~ include/llvm/CodeGen/SelectionDAG.h:824:18: note: candidate function not viable: no known conversion from 'llvm::DebugLoc' to 'llvm::SDLoc' for 2nd argument MachineSDNode *getMachineNode(unsigned Opcode, SDLoc dl, EVT VT, ^ and so on for all the possible overloads. Apparently llvm:DebugLoc has been replaced with llvm::SDLoc after the 3.3 release. I have updated the patch to compile with top-of-trunk. Now the only thing missing is a good testcase... And I agree, your earlier patch replacing vector with queue should also go in, as a separate commit of course.
For some reason, this fix also changes the behavior of the codegen a little, since it makes test/CodeGen/X86/2012-01-16-mfence-nosse-flags.ll fail. This is because instead of the expected "cmpl $0, %eax", it now generates "testl %eax, %eax". I have no idea why yet. If it is correct, that other test must also be updated.
(In reply to comment #6) > For some reason, this fix also changes the behavior of the codegen a little, > since it makes test/CodeGen/X86/2012-01-16-mfence-nosse-flags.ll fail. This > is because instead of the expected "cmpl $0, %eax", it now generates "testl > %eax, %eax". I have no idea why yet. If it is correct, that other test > must also be updated. I'm not sufficiently familiar with llvm internals to know if this is a correct explanation but I suspect the reason is that the fence used to be "xorl r,r \ lock orl r,(%esp)" whereas now it is "lock orl $O,(%esp)" (in the old code there was getConstant(0) and now it's getTargetConstant(0)). In the test case the comparison comes before the fence which means initially llvm expected to be able to reuse the zero for the fence in the comparison (cmpl r,%eax) and this isn't optimised into a testl. Later the comparison is rescheduled and llvm doesn't know r isn't clobbered so it becomes cmpl $0,%eax. With the patch the fence doesn't require a zero so initially there's cmpl $0,%eax which is optimised into a testl that is then rescheduled. First I tried to keep the getConstant(0), as an operand to X86ISD::FENCE and then instruction selection would choose between LOCK_OR32mi8 and LOCK_OR32mr based on whether the zero was used only once or not. This basically worked but I couldn't figure out what to put in the .td file for X86ISD::FENCE with an operand and I couldn't get it to optimise {__sync_synchronize();return 0;} which had two "xorl %eax,%eax" (llvm doesn't seem to know the src operand in LOCK_OR isn't clobbered). As for a test case, wouldn't Benjamin Kramer's test above be good enough?
Created attachment 11028 [details] Proposed fix, updated for ToT, with testcase Updated the fix with a testcase specific for this PR, taken from comment 1. I also had to update the testcase for bug 11768, which now generates more efficient code, e.g. before the fix it generated: movl ptr, %eax xorl %ecx, %ecx lock orl %ecx, (%esp) cmpl $0, %eax je .LBB0_1 after the fix it generates: movl ptr, %eax lock orl $0, (%esp) testl %eax, %eax je .LBB0_1
I've done some digging on exactly why it's failing, so I'll record my discoveries in case I forget over the weekend. The issue is that SelectionDAG comes across: A = Node(load volatile) MachineNode(ORfence) B Node(load volatile) Node(sub A, B) in chain order. When deciding whether it can merge the sub and the "A load", as is x86's wont, it tries to check if combining those chains would create a cycle. Unfortunately it stops looking when it reaches a MachineNode, thinking that everything beyond that point must be selected and hence OK (SelectionDAG.cpp:1863). That is an incorrect assumption in this case and a loopy SUB32rm is emitted. So the patch works by making sure there isn't a pre-selected MachineNode to find during this check. It would be nice to assume that MachineNodes only exist after selection, but if so, we've got quite a bit of refactoring in multiple backends to get there. I'll see if I can work out what the performance implications of doing a more thorough search here are.
Fixed in r191165. Thanks Tim!