Created attachment 23955 [details] bbi-47255.ll reproducer llc -O1 -o - bbi-47255.ll results in .text .file "bbi-47255.ll" llc: ../lib/CodeGen/SelectionDAG/SelectionDAG.cpp:900: bool llvm::SelectionDAG::RemoveNodeFromCSEMaps(llvm::SDNode *): Assertion `N->getOpcode() != ISD::DELETED_NODE && "DELETED_NODE in CSEMap!"' failed. PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace. Stack dump: 0. Program arguments: build-all/bin/llc -O1 -o - bbi-47255.ll 1. Running pass 'Function Pass Manager' on module 'bbi-47255.ll'. 2. Running pass 'X86 DAG->DAG Instruction Selection' on function '@f' #0 0x00000000025d2c84 PrintStackTraceSignalHandler(void*) (build-all/bin/llc+0x25d2c84) #1 0x00000000025d07fe llvm::sys::RunSignalHandlers() (build-all/bin/llc+0x25d07fe) #2 0x00000000025d2f9c SignalHandler(int) (build-all/bin/llc+0x25d2f9c) #3 0x00007f7f4f6c88a0 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x128a0) #4 0x00007f7f4e171f47 raise /build/glibc-2ORdQG/glibc-2.27/signal/../sysdeps/unix/sysv/linux/raise.c:51:0 #5 0x00007f7f4e1738b1 abort /build/glibc-2ORdQG/glibc-2.27/stdlib/abort.c:81:0 #6 0x00007f7f4e16342a __assert_fail_base /build/glibc-2ORdQG/glibc-2.27/assert/assert.c:89:0 #7 0x00007f7f4e1634a2 (/lib/x86_64-linux-gnu/libc.so.6+0x304a2) #8 0x00000000023be964 llvm::SelectionDAG::RemoveNodeFromCSEMaps(llvm::SDNode*) (build-all/bin/llc+0x23be964) #9 0x00000000023bec4f llvm::SelectionDAG::DeleteNode(llvm::SDNode*) (build-all/bin/llc+0x23bec4f) #10 0x000000000226da96 (anonymous namespace)::DAGCombiner::recursivelyDeleteUnusedNodes(llvm::SDNode*) (build-all/bin/llc+0x226da96) #11 0x000000000226e990 llvm::SelectionDAG::Combine(llvm::CombineLevel, llvm::AAResults*, llvm::CodeGenOpt::Level) (build-all/bin/llc+0x226e990) #12 0x0000000002416372 llvm::SelectionDAGISel::CodeGenAndEmitDAG() (build-all/bin/llc+0x2416372) #13 0x0000000002415392 llvm::SelectionDAGISel::SelectAllBasicBlocks(llvm::Function const&) (build-all/bin/llc+0x2415392) #14 0x00000000024119ce llvm::SelectionDAGISel::runOnMachineFunction(llvm::MachineFunction&) (build-all/bin/llc+0x24119ce) #15 0x00000000012f6da5 (anonymous namespace)::X86DAGToDAGISel::runOnMachineFunction(llvm::MachineFunction&) (build-all/bin/llc+0x12f6da5) #16 0x0000000001abf979 llvm::MachineFunctionPass::runOnFunction(llvm::Function&) (build-all/bin/llc+0x1abf979) #17 0x0000000001eb83bf llvm::FPPassManager::runOnFunction(llvm::Function&) (build-all/bin/llc+0x1eb83bf) #18 0x0000000001ebec58 llvm::FPPassManager::runOnModule(llvm::Module&) (build-all/bin/llc+0x1ebec58) #19 0x0000000001eb898d llvm::legacy::PassManagerImpl::run(llvm::Module&) (build-all/bin/llc+0x1eb898d) #20 0x00000000006c306d main (build-all/bin/llc+0x6c306d) #21 0x00007f7f4e154b97 __libc_start_main /build/glibc-2ORdQG/glibc-2.27/csu/../csu/libc-start.c:344:0 #22 0x00000000006c09aa _start (build-all/bin/llc+0x6c09aa) Abort (core dumped) This is new, it starts crashing with deb4b25807: [DAGCombine] Don't delete the node if it has uses immediately
> This is new, it starts crashing with deb4b25807: > > [DAGCombine] Don't delete the node if it has uses immediately That one is on the 11.x branch, so let's treat this as a blocker. QingShan: can you take a look?
Here we called `RemoveDeadNode` on a constant fp node (in `getNegatedExpression` case `ISD::FADD`), which seems to cause the crash. Is the handling of constants different from other types of nodes?
Sent patch https://reviews.llvm.org/D87614 to fix this.
(In reply to Qiu Chaofan from comment #3) > Sent patch https://reviews.llvm.org/D87614 to fix this. That landed in 2508ef014e8b. I've merged it to 11.x in 88e17a8e9b49bfbcfc1fa70f867b5b56a7a64fc7. Thanks!
Thanks!
Reopening as https://reviews.llvm.org/D87614#2273931 suggests there are still problems. I'm not familiar with the code, but it sounds like we're now trying to fix the fixes... Is it possible to fix these problems in some more principled way, or can we revert back to green?
I reverted the commit and plan to send a new revision.. I thought it was in urgent and committed without additional sanitize check. Sorry for the back-and-forth.
(In reply to Qiu Chaofan from comment #7) > I reverted the commit and plan to send a new revision.. I thought it was in > urgent and committed without additional sanitize check. Sorry for the > back-and-forth. Thanks. Cherry-picked the revert to 11.x as 22dab218407e159631fd0689cb4412646b51515a.
Hmm, we do some big refactor for getNegatedExpression and this is the last step to remove unused DAG Node we created to save compiling time. Though there are several patches but in fact, they are all doing the same thing. We miss the fact that, two DAG nodes can be the same no matter where it comes from. I think, we have clean it now.
> I think, we have clean it now. I see the latest patch is here: https://reviews.llvm.org/D87698
(In reply to Hans Wennborg from comment #10) > > I think, we have clean it now. > > I see the latest patch is here: https://reviews.llvm.org/D87698 That landed and has been cherry-picked to 11.x as b78e5de029c26c309f541ab883fa5d6d953b073d. Please let me know if there are any follow-ups.
Hi! Unfortunately we found a new problem related to this in our downstream testing. Reproduce with: llc -O1 -o - bbi-47958_ppc.ll -mtriple=powerpc-unknown-linux-gnu Result: llc: ../lib/CodeGen/SelectionDAG/SelectionDAG.cpp:900: bool llvm::SelectionDAG::RemoveNodeFromCSEMaps(llvm::SDNode *): Assertion `N->getOpcode() != ISD::DELETED_NODE && "DELETED_NODE in CSEMap!"' failed. This starts happening wiith the fix https://reviews.llvm.org/D87698 submitted as a2fb5446be. If I revert a2fb5446be this crash goes away. I haven't tried this out on 11.x, only on trunk.
Created attachment 24015 [details] bbi-47958_ppc.ll reproducer
Qiu: sounds like there are *still* problems here, despite multiple fixes, and it's holding up the release. Can you please take a look?
> Can you please take a look? In fact, can we revert back to the behavior of 10.0.0? Given the number of fix attempts so far, I'm not confident that just adding another patch right before the release will do the trick here.
(In reply to Mikael Holmén from comment #12) > Hi! > > Unfortunately we found a new problem related to this in our downstream > testing. > > Reproduce with: > llc -O1 -o - bbi-47958_ppc.ll -mtriple=powerpc-unknown-linux-gnu > > Result: > llc: ../lib/CodeGen/SelectionDAG/SelectionDAG.cpp:900: bool > llvm::SelectionDAG::RemoveNodeFromCSEMaps(llvm::SDNode *): Assertion > `N->getOpcode() != ISD::DELETED_NODE && "DELETED_NODE in CSEMap!"' failed. Mikael, can you comment on the nature of the source code and how likely users are to run into this bug? Given the state of this problem, we may have to release with this as a known issue.
Also could someone comment on how severe this assert is. For example, is the bug only that we're trying to remove an already deleted node from this map (which seems benign in itself) or could it lead to miscompilation when asserts are disabled?
(In reply to Hans Wennborg from comment #16) > (In reply to Mikael Holmén from comment #12) > > Hi! > > > > Unfortunately we found a new problem related to this in our downstream > > testing. > > > > Reproduce with: > > llc -O1 -o - bbi-47958_ppc.ll -mtriple=powerpc-unknown-linux-gnu > > > > Result: > > llc: ../lib/CodeGen/SelectionDAG/SelectionDAG.cpp:900: bool > > llvm::SelectionDAG::RemoveNodeFromCSEMaps(llvm::SDNode *): Assertion > > `N->getOpcode() != ISD::DELETED_NODE && "DELETED_NODE in CSEMap!"' failed. > > Mikael, can you comment on the nature of the source code and how likely > users are to run into this bug? > > Given the state of this problem, we may have to release with this as a known > issue. As far as I know we have only seen this in some generated tests involving loops with _Complex float. The latest example looks like _Complex float a; void f(void) { for (int b;;) { a = a - a; a = a + b * a; a = a + a; b = a * a * (a * (a + a) + a - (a + a + a - b)); } } and then we ran clang -emit-llvm-bc -ffast-math -O1 opt -O1 -disable-basic-aa llc -O1 on it for our target. I'm not sure what is really required for it to happen.
(In reply to Mikael Holmén from comment #18) > (In reply to Hans Wennborg from comment #16) > > (In reply to Mikael Holmén from comment #12) > > > Hi! > > > > > > Unfortunately we found a new problem related to this in our downstream > > > testing. > > > > > > Reproduce with: > > > llc -O1 -o - bbi-47958_ppc.ll -mtriple=powerpc-unknown-linux-gnu > > > > > > Result: > > > llc: ../lib/CodeGen/SelectionDAG/SelectionDAG.cpp:900: bool > > > llvm::SelectionDAG::RemoveNodeFromCSEMaps(llvm::SDNode *): Assertion > > > `N->getOpcode() != ISD::DELETED_NODE && "DELETED_NODE in CSEMap!"' failed. > > > > Mikael, can you comment on the nature of the source code and how likely > > users are to run into this bug? > > > > Given the state of this problem, we may have to release with this as a known > > issue. > > As far as I know we have only seen this in some generated tests involving > loops with _Complex float. > > The latest example looks like > > _Complex float a; > > void f(void) > { > for (int b;;) > { > a = a - a; > a = a + b * a; > a = a + a; > b = a * a * (a * (a + a) + a - (a + a + a - b)); > } > } > > and then we ran > clang -emit-llvm-bc -ffast-math -O1 > opt -O1 -disable-basic-aa > llc -O1 > on it for our target. > > I'm not sure what is really required for it to happen. Hi, in patch D87698, will it resolve the crash if only revert the `RemoveDeadNode(CFP)` part? I'll look at it later. Thanks.
(In reply to Qiu Chaofan from comment #19) > Hi, in patch D87698, will it resolve the crash if only revert the > `RemoveDeadNode(CFP)` part? Yes if I just remove that line the crash seems to go away for the new reproducer. I haven't done any more testing so I've no idea if it comes with any negative effects.
(In reply to Hans Wennborg from comment #17) > Also could someone comment on how severe this assert is. > > For example, is the bug only that we're trying to remove an already deleted > node from this map (which seems benign in itself) or could it lead to > miscompilation when asserts are disabled? Qui, do you know the answer to this question?
(In reply to Hans Wennborg from comment #21) > (In reply to Hans Wennborg from comment #17) > > Also could someone comment on how severe this assert is. > > > > For example, is the bug only that we're trying to remove an already deleted > > node from this map (which seems benign in itself) or could it lead to > > miscompilation when asserts are disabled? > > Qui, do you know the answer to this question? As I tested, the program still crashes due to memory error on a non-assert build. At the first look, the reason is that a constant node (combined from a fmul) is deleted but still referenced in `getNegatedExpression`. BTW, removing the `RemoveDeadNode` here will never do harm, although it might not be the root cause. Since the patch introducing it is to avoid time explosion, not crashes..
(In reply to Qiu Chaofan from comment #22) > As I tested, the program still crashes due to memory error on a non-assert > build. Okay, that sounds pretty bad. > > At the first look, the reason is that a constant node (combined from a fmul) > is deleted but still referenced in `getNegatedExpression`. > > BTW, removing the `RemoveDeadNode` here will never do harm, although it > might not be the root cause. Since the patch introducing it is to avoid time > explosion, not crashes.. I thought the RemoveDeadNode calls were added to fix PR46877 which reports an infinite loop in DAGCombiner, not just slow compiles? (compile time "explosion" also doesn't sound good..)
I've been trying to trace the history of these patches: --- The latest patch causing a problem is D87698 That is a re-land of D87614 but with a fix "in rare cases the node to be removed is the same as result of getNode. We missed that" D87614 was in turn a fix of D86183 for PR47517 D86183 was a fix for PR46877 which was caused by D77319 (which was already reverted once) It seems D77319 is another in a line of fixes: "We are struggling to fix such issues on D75501, D70975, D76638, D76439 etc" Some of those landed and some didn't. They all seem targeted at a problem discussed in D70595 (there's no PR). --- This history of patches is what concerns me about adding yet another fix: how confident can we be that it's the last one? Also, there is a high cost to taking another fix at this point. It means spinning another release candidate and pushing the release even further out. I don't want to do that unless we're completely confident that the fix is correct, and that the problem it solves is severe enough. Sadly, it sounds like we don't just assert, but crash here, which is certainly bad. Sanjay, you've been involved in these patches. What are your thoughts on this situation?
(In reply to Hans Wennborg from comment #23) > I thought the RemoveDeadNode calls were added to fix PR46877 which reports > an infinite loop in DAGCombiner, not just slow compiles? (compile time > "explosion" also doesn't sound good..) It was exponentially increasing compile-time after closer inspection, but it went effectively-infinite-loop pretty quickly. (In reply to Hans Wennborg from comment #24) > This history of patches is what concerns me about adding yet another fix: > how confident can we be that it's the last one? > > Also, there is a high cost to taking another fix at this point. It means > spinning another release candidate and pushing the release even further out. > > I don't want to do that unless we're completely confident that the fix is > correct, and that the problem it solves is severe enough. > > Sadly, it sounds like we don't just assert, but crash here, which is > certainly bad. > > > > Sanjay, you've been involved in these patches. What are your thoughts on > this situation? I don't have the current failure root caused, but so far I haven't gotten it to fail on x86 or aarch64, so there may be some target-specific behavior in play. Backing out the whole series of patches would probably be worse than taking 1 more patch since we were fixing a crash to begin with. But I can't say yet whether this is something we would hit in typical code yet. Killing a RemoveDeadNode() call seems like a relatively safe change, but our track record on this series of patches is poor as noted. cc'ing Craig for another opinion. I'll keep trying to reduce/isolate the test.
(In reply to Sanjay Patel from comment #25) > (In reply to Hans Wennborg from comment #23) > > I thought the RemoveDeadNode calls were added to fix PR46877 which reports > > an infinite loop in DAGCombiner, not just slow compiles? (compile time > > "explosion" also doesn't sound good..) > > It was exponentially increasing compile-time after closer inspection, but it > went effectively-infinite-loop pretty quickly. > > > (In reply to Hans Wennborg from comment #24) > > This history of patches is what concerns me about adding yet another fix: > > how confident can we be that it's the last one? > > > > Also, there is a high cost to taking another fix at this point. It means > > spinning another release candidate and pushing the release even further out. > > > > I don't want to do that unless we're completely confident that the fix is > > correct, and that the problem it solves is severe enough. > > > > Sadly, it sounds like we don't just assert, but crash here, which is > > certainly bad. > > > > > > > > Sanjay, you've been involved in these patches. What are your thoughts on > > this situation? > > I don't have the current failure root caused, but so far I haven't gotten it > to fail on x86 or aarch64, so there may be some target-specific behavior in > play. In this particular case a target using AA during DAGCombin/Isel is needed. Our out of tree target does, and PPC too. When I forced useAA==true on x86 it crashed there too.
> I'll keep trying to reduce/isolate the test. Thanks for looking into it!
The cause should be: We're negating `fmul X,Y`, X is an `fmul` but NegX is finally constant-folded in `getNode`, Y is a direct constant so NegY is also constant. Here it's some coincidence that **NegX is equal to NegY**. NegX was kept since the code did not go through the part in `case ISD::ConstantFP` (X is not constant) but for Y, NegY was deleted. So NegX was deleted too, and referenced later..
I have it reduced to this to crash on x86 (but ironically not with the original PPC target): define float @f(float* %p) { store float 0.0, float* %p, align 1 %real = load float, float* %p, align 1 %r2 = fmul fast float %real, %real %t1 = fmul fast float %real, 3.0 %t2 = fmul fast float %t1, %real %mul_ac56 = fmul fast float %t2, %t1 %mul_ac72 = fmul fast float %mul_ac56, %r2 ret float %mul_ac72 } -------------------------------------------------------------------------------- So to trigger the crash, we require fast-math (at least 'reassoc') and some constant folding (the load/store of 0.0 and subsequent fmul) that escaped IR optimization. That constant folding has to follow a very narrow path - even commuting the operands in some of those instructions avoids the bug. I don't know how to assess the risk of that scenario. I want to believe it's not worth delaying the 11.0 release again, but a fix should definitely go into any later builds since it's a compiler crash.
I improved SDAG's constant folding: https://reviews.llvm.org/rG2ccbf3dbd5ba ...so that's going to make it harder to hit the bug in getNegatedExpression(). I think we should still make a change in there to be safer, but the IR repro test here is not crashing now in trunk. It's still a judgement call as to whether we should delay the release for my fix or a change in getNegatedExpression().
(In reply to Sanjay Patel from comment #30) > I improved SDAG's constant folding: > https://reviews.llvm.org/rG2ccbf3dbd5ba > ...so that's going to make it harder to hit the bug in > getNegatedExpression(). I think we should still make a change in there to be > safer, but the IR repro test here is not crashing now in trunk. > > It's still a judgement call as to whether we should delay the release for my > fix or a change in getNegatedExpression(). Thank you, Sanjay! Although we eliminated the failure, I'm afraid there might be cases similar to this but negated results are not constants? (NegX is equal to NegY, NegX was used but NegY was deleted)
We've found another crash relasted to the "RemoveDeadNode(CFP)" part. Unfortunately I haven't been able to reproduce it on PPC or X86. Roughly what happens is that in TargetLowering::getNegatedExpression Op is t36: f32 = ConstantFP<4.218750e+08> and SDValue CFP = DAG.getConstantFP(V, DL, VT); returns a the same node so Op == CFP and then we do RemoveDeadNode(CFP); which will remove Op. But the caller of getNegatedExpression has a handle to Op which then is NegX: t36: f32 = <<Deleted Node!>> and it will use it to build a new fsub node, which will reference a deleted node: Creating new node: t37: f32 = fsub nnan ninf nsz arcp contract afn reassoc t37, ConstantFP:f32<-4.218750e+08> The new x * 0.0 fold doesn't help in this case.
(Bah, forgot to add the last part) Compilation later ends with Operand not processed? t37: f32 = fsub nnan ninf nsz arcp contract afn reassoc t37, ConstantFP:f32<-4.218750e+08> UNREACHABLE executed at ../lib/CodeGen/SelectionDAG/LegalizeTypes.cpp:485!
This looks like the node is its own operand? t37: f32 = fsub nnan ninf nsz arcp contract afn reassoc t37, ConstantFP:f32<-4.218750e+08>
(In reply to Craig Topper from comment #34) > This looks like the node is its own operand? > > t37: f32 = fsub nnan ninf nsz arcp contract afn reassoc t37, > ConstantFP:f32<-4.218750e+08> Yes I think it's due to the input actually being a DELETED NODE.
(In reply to Mikael Holmén from comment #32) > We've found another crash relasted to the "RemoveDeadNode(CFP)" part. Ok, now I've managed to reproduce the latest crash on X86 as well: llc -O0 -o - bbi-47837_x86.ll
Created attachment 24019 [details] bbi-47837_x86.ll reproducer
(In reply to Mikael Holmén from comment #32) > We've found another crash relasted to the "RemoveDeadNode(CFP)" part. > Unfortunately I haven't been able to reproduce it on PPC or X86. > > Roughly what happens is that in TargetLowering::getNegatedExpression Op is > > t36: f32 = ConstantFP<4.218750e+08> > > and > > SDValue CFP = DAG.getConstantFP(V, DL, VT); > > returns a the same node so Op == CFP and then we do > > RemoveDeadNode(CFP); > > which will remove Op. > > But the caller of getNegatedExpression has a handle to Op which then is > > NegX: t36: f32 = <<Deleted Node!>> > > and it will use it to build a new fsub node, which will reference a deleted > node: > > Creating new node: t37: f32 = fsub nnan ninf nsz arcp contract afn reassoc > t37, ConstantFP:f32<-4.218750e+08> > > The new x * 0.0 fold doesn't help in this case. Ah, so it's the same issue as comment #28. I think I need to revert the Remove DeadNode(CFP) part since it's actually not necessary for the original bug of thi s.
(In reply to Sanjay Patel from comment #29) > So to trigger the crash, we require fast-math (at least 'reassoc') and some > constant folding (the load/store of 0.0 and subsequent fmul) that escaped IR > optimization. That constant folding has to follow a very narrow path - even > commuting the operands in some of those instructions avoids the bug. > > I don't know how to assess the risk of that scenario. I want to believe it's > not worth delaying the 11.0 release again, but a fix should definitely go > into any later builds since it's a compiler crash. Mikael, how bad would it be for your group if these issues are not fixed in LLVM 11.0.0, but rather 11.0.1? (In reply to Qiu Chaofan from comment #38) > I think I need to revert the > Remove DeadNode(CFP) part since it's actually not necessary for the original > bug of this. Please go ahead an do that, and include the test cases from this bug so we don't regress.
(In reply to Hans Wennborg from comment #39) > Mikael, how bad would it be for your group if these issues are not fixed in > LLVM 11.0.0, but rather 11.0.1? It's ok for us.
(In reply to Mikael Holmén from comment #37) > Created attachment 24019 [details] > bbi-47837_x86.ll reproducer Reduced to crash with default settings (optimizations on) for most targets: define float @f(float %x, float %y) { %add = fadd fast float %x, 750.0 %sub = fsub fast float %x, %add %mul = fmul fast float %sub, %sub %mul2 = fmul fast float %mul, %sub %add2 = fadd fast float %mul2, 1.0 %add3 = fadd fast float %mul2, %add2 %mul3 = fmul fast float %y, %add3 ret float %mul3 }
(In reply to Hans Wennborg from comment #39) > (In reply to Qiu Chaofan from comment #38) > > I think I need to revert the > > Remove DeadNode(CFP) part since it's actually not necessary for the original > > bug of this. > > Please go ahead an do that, and include the test cases from this bug so we > don't regress. Sure. See rGb326d4ff. (In reply to Sanjay Patel from comment #41) > (In reply to Mikael Holmén from comment #37) > > Created attachment 24019 [details] > > bbi-47837_x86.ll reproducer > > Reduced to crash with default settings (optimizations on) for most targets: > > define float @f(float %x, float %y) { > %add = fadd fast float %x, 750.0 > %sub = fsub fast float %x, %add > %mul = fmul fast float %sub, %sub > %mul2 = fmul fast float %mul, %sub > %add2 = fadd fast float %mul2, 1.0 > %add3 = fadd fast float %mul2, %add2 > %mul3 = fmul fast float %y, %add3 > ret float %mul3 > } Thanks!
(In reply to Qiu Chaofan from comment #42) > (In reply to Hans Wennborg from comment #39) > > (In reply to Qiu Chaofan from comment #38) > > > I think I need to revert the > > > Remove DeadNode(CFP) part since it's actually not necessary for the original > > > bug of this. > > > > Please go ahead an do that, and include the test cases from this bug so we > > don't regress. > > Sure. See rGb326d4ff. Thanks. Pushed to 11.x as 121babae56e9f08acedf3d6d44757e35556d0a37. It would be nice if this code were more reliable, but closing this bug for now since the currently known problems should be fixed now.