LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 47517 - SelectionDAG.cpp:900: bool llvm::SelectionDAG::RemoveNodeFromCSEMaps(llvm::SDNode *): Assertion `N->getOpcode() != ISD::DELETED_NODE && "DELETED_NODE in CSEMap!"' failed. with llc -O1
Summary: SelectionDAG.cpp:900: bool llvm::SelectionDAG::RemoveNodeFromCSEMaps(llvm::SD...
Status: RESOLVED FIXED
Alias: None
Product: libraries
Classification: Unclassified
Component: Common Code Generator Code (show other bugs)
Version: trunk
Hardware: PC Linux
: P enhancement
Assignee: qshanz
URL:
Keywords:
Depends on:
Blocks: release-11.0.0
  Show dependency tree
 
Reported: 2020-09-14 04:20 PDT by Mikael Holmén
Modified: 2020-10-06 07:10 PDT (History)
8 users (show)

See Also:
Fixed By Commit(s):


Attachments
bbi-47255.ll reproducer (1.78 KB, text/plain)
2020-09-14 04:20 PDT, Mikael Holmén
Details
bbi-47958_ppc.ll reproducer (1.04 KB, text/plain)
2020-10-01 23:27 PDT, Mikael Holmén
Details
bbi-47837_x86.ll reproducer (589 bytes, text/plain)
2020-10-05 00:44 PDT, Mikael Holmén
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mikael Holmén 2020-09-14 04:20:43 PDT
Created attachment 23955 [details]
bbi-47255.ll reproducer

llc -O1 -o - bbi-47255.ll

results in

	.text
	.file	"bbi-47255.ll"
llc: ../lib/CodeGen/SelectionDAG/SelectionDAG.cpp:900: bool llvm::SelectionDAG::RemoveNodeFromCSEMaps(llvm::SDNode *): Assertion `N->getOpcode() != ISD::DELETED_NODE && "DELETED_NODE in CSEMap!"' failed.
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.	Program arguments: build-all/bin/llc -O1 -o - bbi-47255.ll 
1.	Running pass 'Function Pass Manager' on module 'bbi-47255.ll'.
2.	Running pass 'X86 DAG->DAG Instruction Selection' on function '@f'
 #0 0x00000000025d2c84 PrintStackTraceSignalHandler(void*) (build-all/bin/llc+0x25d2c84)
 #1 0x00000000025d07fe llvm::sys::RunSignalHandlers() (build-all/bin/llc+0x25d07fe)
 #2 0x00000000025d2f9c SignalHandler(int) (build-all/bin/llc+0x25d2f9c)
 #3 0x00007f7f4f6c88a0 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x128a0)
 #4 0x00007f7f4e171f47 raise /build/glibc-2ORdQG/glibc-2.27/signal/../sysdeps/unix/sysv/linux/raise.c:51:0
 #5 0x00007f7f4e1738b1 abort /build/glibc-2ORdQG/glibc-2.27/stdlib/abort.c:81:0
 #6 0x00007f7f4e16342a __assert_fail_base /build/glibc-2ORdQG/glibc-2.27/assert/assert.c:89:0
 #7 0x00007f7f4e1634a2 (/lib/x86_64-linux-gnu/libc.so.6+0x304a2)
 #8 0x00000000023be964 llvm::SelectionDAG::RemoveNodeFromCSEMaps(llvm::SDNode*) (build-all/bin/llc+0x23be964)
 #9 0x00000000023bec4f llvm::SelectionDAG::DeleteNode(llvm::SDNode*) (build-all/bin/llc+0x23bec4f)
#10 0x000000000226da96 (anonymous namespace)::DAGCombiner::recursivelyDeleteUnusedNodes(llvm::SDNode*) (build-all/bin/llc+0x226da96)
#11 0x000000000226e990 llvm::SelectionDAG::Combine(llvm::CombineLevel, llvm::AAResults*, llvm::CodeGenOpt::Level) (build-all/bin/llc+0x226e990)
#12 0x0000000002416372 llvm::SelectionDAGISel::CodeGenAndEmitDAG() (build-all/bin/llc+0x2416372)
#13 0x0000000002415392 llvm::SelectionDAGISel::SelectAllBasicBlocks(llvm::Function const&) (build-all/bin/llc+0x2415392)
#14 0x00000000024119ce llvm::SelectionDAGISel::runOnMachineFunction(llvm::MachineFunction&) (build-all/bin/llc+0x24119ce)
#15 0x00000000012f6da5 (anonymous namespace)::X86DAGToDAGISel::runOnMachineFunction(llvm::MachineFunction&) (build-all/bin/llc+0x12f6da5)
#16 0x0000000001abf979 llvm::MachineFunctionPass::runOnFunction(llvm::Function&) (build-all/bin/llc+0x1abf979)
#17 0x0000000001eb83bf llvm::FPPassManager::runOnFunction(llvm::Function&) (build-all/bin/llc+0x1eb83bf)
#18 0x0000000001ebec58 llvm::FPPassManager::runOnModule(llvm::Module&) (build-all/bin/llc+0x1ebec58)
#19 0x0000000001eb898d llvm::legacy::PassManagerImpl::run(llvm::Module&) (build-all/bin/llc+0x1eb898d)
#20 0x00000000006c306d main (build-all/bin/llc+0x6c306d)
#21 0x00007f7f4e154b97 __libc_start_main /build/glibc-2ORdQG/glibc-2.27/csu/../csu/libc-start.c:344:0
#22 0x00000000006c09aa _start (build-all/bin/llc+0x6c09aa)
Abort (core dumped)

This is new, it starts crashing with deb4b25807:

    [DAGCombine] Don't delete the node if it has uses immediately
Comment 1 Hans Wennborg 2020-09-14 05:30:56 PDT
> This is new, it starts crashing with deb4b25807:
> 
>     [DAGCombine] Don't delete the node if it has uses immediately

That one is on the 11.x branch, so let's treat this as a blocker.

QingShan: can you take a look?
Comment 2 Qiu Chaofan 2020-09-14 06:58:02 PDT
Here we called `RemoveDeadNode` on a constant fp node (in `getNegatedExpression` case `ISD::FADD`), which seems to cause the crash. Is the handling of constants different from other types of nodes?
Comment 3 Qiu Chaofan 2020-09-14 07:54:15 PDT
Sent patch https://reviews.llvm.org/D87614 to fix this.
Comment 4 Hans Wennborg 2020-09-15 04:49:00 PDT
(In reply to Qiu Chaofan from comment #3)
> Sent patch https://reviews.llvm.org/D87614 to fix this.

That landed in 2508ef014e8b. I've merged it to 11.x in 88e17a8e9b49bfbcfc1fa70f867b5b56a7a64fc7. Thanks!
Comment 5 Mikael Holmén 2020-09-15 06:00:06 PDT
Thanks!
Comment 6 Hans Wennborg 2020-09-15 06:14:07 PDT
Reopening as https://reviews.llvm.org/D87614#2273931 suggests there are still problems.

I'm not familiar with the code, but it sounds like we're now trying to fix the fixes...

Is it possible to fix these problems in some more principled way, or can we revert back to green?
Comment 7 Qiu Chaofan 2020-09-15 07:10:33 PDT
I reverted the commit and plan to send a new revision.. I thought it was in urgent and committed without additional sanitize check. Sorry for the back-and-forth.
Comment 8 Hans Wennborg 2020-09-15 07:59:12 PDT
(In reply to Qiu Chaofan from comment #7)
> I reverted the commit and plan to send a new revision.. I thought it was in
> urgent and committed without additional sanitize check. Sorry for the
> back-and-forth.

Thanks. Cherry-picked the revert to 11.x as 22dab218407e159631fd0689cb4412646b51515a.
Comment 9 qshanz 2020-09-15 18:12:19 PDT
Hmm, we do some big refactor for getNegatedExpression and this is the last step to remove unused DAG Node we created to save compiling time. Though there are several patches but in fact, they are all doing the same thing. We miss the fact that, two DAG nodes can be the same no matter where it comes from. 

I think, we have clean it now.
Comment 10 Hans Wennborg 2020-09-16 05:45:06 PDT
> I think, we have clean it now.

I see the latest patch is here: https://reviews.llvm.org/D87698
Comment 11 Hans Wennborg 2020-09-17 04:42:52 PDT
(In reply to Hans Wennborg from comment #10)
> > I think, we have clean it now.
> 
> I see the latest patch is here: https://reviews.llvm.org/D87698

That landed and has been cherry-picked to 11.x as b78e5de029c26c309f541ab883fa5d6d953b073d.

Please let me know if there are any follow-ups.
Comment 12 Mikael Holmén 2020-10-01 23:26:50 PDT
Hi!

Unfortunately we found a new problem related to this in our downstream testing.

Reproduce with:
 llc -O1 -o - bbi-47958_ppc.ll -mtriple=powerpc-unknown-linux-gnu

Result:
llc: ../lib/CodeGen/SelectionDAG/SelectionDAG.cpp:900: bool llvm::SelectionDAG::RemoveNodeFromCSEMaps(llvm::SDNode *): Assertion `N->getOpcode() != ISD::DELETED_NODE && "DELETED_NODE in CSEMap!"' failed.

This starts happening wiith the fix https://reviews.llvm.org/D87698 submitted as a2fb5446be.

If I revert a2fb5446be this crash goes away.

I haven't tried this out on 11.x, only on trunk.
Comment 13 Mikael Holmén 2020-10-01 23:27:55 PDT
Created attachment 24015 [details]
bbi-47958_ppc.ll reproducer
Comment 14 Hans Wennborg 2020-10-02 00:50:50 PDT
Qiu: sounds like there are *still* problems here, despite multiple fixes, and it's holding up the release.

Can you please take a look?
Comment 15 Hans Wennborg 2020-10-02 00:58:10 PDT
> Can you please take a look?

In fact, can we revert back to the behavior of 10.0.0? Given the number of fix attempts so far, I'm not confident that just adding another patch right before the release will do the trick here.
Comment 16 Hans Wennborg 2020-10-02 00:59:39 PDT
(In reply to Mikael Holmén from comment #12)
> Hi!
> 
> Unfortunately we found a new problem related to this in our downstream
> testing.
> 
> Reproduce with:
>  llc -O1 -o - bbi-47958_ppc.ll -mtriple=powerpc-unknown-linux-gnu
> 
> Result:
> llc: ../lib/CodeGen/SelectionDAG/SelectionDAG.cpp:900: bool
> llvm::SelectionDAG::RemoveNodeFromCSEMaps(llvm::SDNode *): Assertion
> `N->getOpcode() != ISD::DELETED_NODE && "DELETED_NODE in CSEMap!"' failed.

Mikael, can you comment on the nature of the source code and how likely users are to run into this bug?

Given the state of this problem, we may have to release with this as a known issue.
Comment 17 Hans Wennborg 2020-10-02 01:07:49 PDT
Also could someone comment on how severe this assert is.

For example, is the bug only that we're trying to remove an already deleted node from this map (which seems benign in itself) or could it lead to miscompilation when asserts are disabled?
Comment 18 Mikael Holmén 2020-10-02 01:30:17 PDT
(In reply to Hans Wennborg from comment #16)
> (In reply to Mikael Holmén from comment #12)
> > Hi!
> > 
> > Unfortunately we found a new problem related to this in our downstream
> > testing.
> > 
> > Reproduce with:
> >  llc -O1 -o - bbi-47958_ppc.ll -mtriple=powerpc-unknown-linux-gnu
> > 
> > Result:
> > llc: ../lib/CodeGen/SelectionDAG/SelectionDAG.cpp:900: bool
> > llvm::SelectionDAG::RemoveNodeFromCSEMaps(llvm::SDNode *): Assertion
> > `N->getOpcode() != ISD::DELETED_NODE && "DELETED_NODE in CSEMap!"' failed.
> 
> Mikael, can you comment on the nature of the source code and how likely
> users are to run into this bug?
> 
> Given the state of this problem, we may have to release with this as a known
> issue.

As far as I know we have only seen this in some generated tests involving loops with _Complex float.

The latest example looks like

_Complex float a;

void f(void)
{
    for (int b;;)
    {
        a = a - a;
        a = a + b * a;
        a = a + a;
        b = a * a * (a * (a + a) + a - (a + a + a - b));
    }
}

and then we ran
 clang -emit-llvm-bc -ffast-math -O1
 opt -O1 -disable-basic-aa
 llc -O1
on it for our target.

I'm not sure what is really required for it to happen.
Comment 19 Qiu Chaofan 2020-10-02 03:39:33 PDT
(In reply to Mikael Holmén from comment #18)
> (In reply to Hans Wennborg from comment #16)
> > (In reply to Mikael Holmén from comment #12)
> > > Hi!
> > > 
> > > Unfortunately we found a new problem related to this in our downstream
> > > testing.
> > > 
> > > Reproduce with:
> > >  llc -O1 -o - bbi-47958_ppc.ll -mtriple=powerpc-unknown-linux-gnu
> > > 
> > > Result:
> > > llc: ../lib/CodeGen/SelectionDAG/SelectionDAG.cpp:900: bool
> > > llvm::SelectionDAG::RemoveNodeFromCSEMaps(llvm::SDNode *): Assertion
> > > `N->getOpcode() != ISD::DELETED_NODE && "DELETED_NODE in CSEMap!"' failed.
> > 
> > Mikael, can you comment on the nature of the source code and how likely
> > users are to run into this bug?
> > 
> > Given the state of this problem, we may have to release with this as a known
> > issue.
> 
> As far as I know we have only seen this in some generated tests involving
> loops with _Complex float.
> 
> The latest example looks like
> 
> _Complex float a;
> 
> void f(void)
> {
>     for (int b;;)
>     {
>         a = a - a;
>         a = a + b * a;
>         a = a + a;
>         b = a * a * (a * (a + a) + a - (a + a + a - b));
>     }
> }
> 
> and then we ran
>  clang -emit-llvm-bc -ffast-math -O1
>  opt -O1 -disable-basic-aa
>  llc -O1
> on it for our target.
> 
> I'm not sure what is really required for it to happen.

Hi, in patch D87698, will it resolve the crash if only revert the `RemoveDeadNode(CFP)` part? I'll look at it later. Thanks.
Comment 20 Mikael Holmén 2020-10-02 04:11:18 PDT
(In reply to Qiu Chaofan from comment #19)
> Hi, in patch D87698, will it resolve the crash if only revert the
> `RemoveDeadNode(CFP)` part?

Yes if I just remove that line the crash seems to go away for the new reproducer.

I haven't done any more testing so I've no idea if it comes with any negative effects.
Comment 21 Hans Wennborg 2020-10-02 04:42:29 PDT
(In reply to Hans Wennborg from comment #17)
> Also could someone comment on how severe this assert is.
> 
> For example, is the bug only that we're trying to remove an already deleted
> node from this map (which seems benign in itself) or could it lead to
> miscompilation when asserts are disabled?

Qui, do you know the answer to this question?
Comment 22 Qiu Chaofan 2020-10-02 10:48:10 PDT
(In reply to Hans Wennborg from comment #21)
> (In reply to Hans Wennborg from comment #17)
> > Also could someone comment on how severe this assert is.
> > 
> > For example, is the bug only that we're trying to remove an already deleted
> > node from this map (which seems benign in itself) or could it lead to
> > miscompilation when asserts are disabled?
> 
> Qui, do you know the answer to this question?

As I tested, the program still crashes due to memory error on a non-assert build.

At the first look, the reason is that a constant node (combined from a fmul) is deleted but still referenced in `getNegatedExpression`.

BTW, removing the `RemoveDeadNode` here will never do harm, although it might not be the root cause. Since the patch introducing it is to avoid time explosion, not crashes..
Comment 23 Hans Wennborg 2020-10-02 11:42:10 PDT
(In reply to Qiu Chaofan from comment #22)
> As I tested, the program still crashes due to memory error on a non-assert
> build.

Okay, that sounds pretty bad.

> 
> At the first look, the reason is that a constant node (combined from a fmul)
> is deleted but still referenced in `getNegatedExpression`.
> 
> BTW, removing the `RemoveDeadNode` here will never do harm, although it
> might not be the root cause. Since the patch introducing it is to avoid time
> explosion, not crashes..

I thought the RemoveDeadNode calls were added to fix PR46877 which reports an infinite loop in DAGCombiner, not just slow compiles? (compile time "explosion" also doesn't sound good..)
Comment 24 Hans Wennborg 2020-10-02 11:44:09 PDT
I've been trying to trace the history of these patches:

---
The latest patch causing a problem is D87698

That is a re-land of D87614 but with a fix "in rare cases the node to be removed is the same as result of getNode. We missed that"

D87614 was in turn a fix of D86183 for PR47517

D86183 was a fix for PR46877 which was caused by D77319 (which was already reverted once)

It seems D77319 is another in a line of fixes: "We are struggling to fix such issues on D75501, D70975, D76638, D76439 etc"
Some of those landed and some didn't. They all seem targeted at a problem discussed in D70595 (there's no PR).
---

This history of patches is what concerns me about adding yet another fix: how confident can we be that it's the last one?

Also, there is a high cost to taking another fix at this point. It means spinning another release candidate and pushing the release even further out.

I don't want to do that unless we're completely confident that the fix is correct, and that the problem it solves is severe enough.

Sadly, it sounds like we don't just assert, but crash here, which is certainly bad.



Sanjay, you've been involved in these patches. What are your thoughts on this situation?
Comment 25 Sanjay Patel 2020-10-02 12:49:40 PDT
(In reply to Hans Wennborg from comment #23)
> I thought the RemoveDeadNode calls were added to fix PR46877 which reports
> an infinite loop in DAGCombiner, not just slow compiles? (compile time
> "explosion" also doesn't sound good..)

It was exponentially increasing compile-time after closer inspection, but it went effectively-infinite-loop pretty quickly.


(In reply to Hans Wennborg from comment #24)
> This history of patches is what concerns me about adding yet another fix:
> how confident can we be that it's the last one?
> 
> Also, there is a high cost to taking another fix at this point. It means
> spinning another release candidate and pushing the release even further out.
> 
> I don't want to do that unless we're completely confident that the fix is
> correct, and that the problem it solves is severe enough.
> 
> Sadly, it sounds like we don't just assert, but crash here, which is
> certainly bad.
> 
> 
> 
> Sanjay, you've been involved in these patches. What are your thoughts on
> this situation?

I don't have the current failure root caused, but so far I haven't gotten it to fail on x86 or aarch64, so there may be some target-specific behavior in play.

Backing out the whole series of patches would probably be worse than taking 1 more patch since we were fixing a crash to begin with. But I can't say yet whether this is something we would hit in typical code yet.

Killing a RemoveDeadNode() call seems like a relatively safe change, but our track record on this series of patches is poor as noted. cc'ing Craig for another opinion. 

I'll keep trying to reduce/isolate the test.
Comment 26 Mikael Holmén 2020-10-02 12:57:49 PDT
(In reply to Sanjay Patel from comment #25)
> (In reply to Hans Wennborg from comment #23)
> > I thought the RemoveDeadNode calls were added to fix PR46877 which reports
> > an infinite loop in DAGCombiner, not just slow compiles? (compile time
> > "explosion" also doesn't sound good..)
> 
> It was exponentially increasing compile-time after closer inspection, but it
> went effectively-infinite-loop pretty quickly.
> 
> 
> (In reply to Hans Wennborg from comment #24)
> > This history of patches is what concerns me about adding yet another fix:
> > how confident can we be that it's the last one?
> > 
> > Also, there is a high cost to taking another fix at this point. It means
> > spinning another release candidate and pushing the release even further out.
> > 
> > I don't want to do that unless we're completely confident that the fix is
> > correct, and that the problem it solves is severe enough.
> > 
> > Sadly, it sounds like we don't just assert, but crash here, which is
> > certainly bad.
> > 
> > 
> > 
> > Sanjay, you've been involved in these patches. What are your thoughts on
> > this situation?
> 
> I don't have the current failure root caused, but so far I haven't gotten it
> to fail on x86 or aarch64, so there may be some target-specific behavior in
> play.

In this particular case a target using AA during DAGCombin/Isel is needed. Our out of tree target does, and PPC too. When I forced useAA==true on x86 it crashed there too.
Comment 27 Hans Wennborg 2020-10-02 12:59:54 PDT
> I'll keep trying to reduce/isolate the test.

Thanks for looking into it!
Comment 28 Qiu Chaofan 2020-10-03 09:02:25 PDT
The cause should be:

We're negating `fmul X,Y`, X is an `fmul` but NegX is finally constant-folded in `getNode`, Y is a direct constant so NegY is also constant. 

Here it's some coincidence that **NegX is equal to NegY**. NegX was kept since the code did not go through the part in `case ISD::ConstantFP` (X is not constant) but for Y, NegY was deleted. So NegX was deleted too, and referenced later..
Comment 29 Sanjay Patel 2020-10-04 05:15:32 PDT
I have it reduced to this to crash on x86 (but ironically not with the original PPC target):

define float @f(float* %p) {
  store float 0.0, float* %p, align 1
  %real = load float, float* %p, align 1
  %r2 = fmul fast float %real, %real
  %t1 = fmul fast float %real, 3.0
  %t2 = fmul fast float %t1, %real
  %mul_ac56 = fmul fast float %t2, %t1
  %mul_ac72 = fmul fast float %mul_ac56, %r2
  ret float %mul_ac72
}

--------------------------------------------------------------------------------

So to trigger the crash, we require fast-math (at least 'reassoc') and some constant folding (the load/store of 0.0 and subsequent fmul) that escaped IR optimization. That constant folding has to follow a very narrow path - even commuting the operands in some of those instructions avoids the bug.

I don't know how to assess the risk of that scenario. I want to believe it's not worth delaying the 11.0 release again, but a fix should definitely go into any later builds since it's a compiler crash.
Comment 30 Sanjay Patel 2020-10-04 08:48:56 PDT
I improved SDAG's constant folding:
https://reviews.llvm.org/rG2ccbf3dbd5ba
...so that's going to make it harder to hit the bug in getNegatedExpression(). I think we should still make a change in there to be safer, but the IR repro test here is not crashing now in trunk.

It's still a judgement call as to whether we should delay the release for my fix or a change in getNegatedExpression().
Comment 31 Qiu Chaofan 2020-10-04 20:51:00 PDT
(In reply to Sanjay Patel from comment #30)
> I improved SDAG's constant folding:
> https://reviews.llvm.org/rG2ccbf3dbd5ba
> ...so that's going to make it harder to hit the bug in
> getNegatedExpression(). I think we should still make a change in there to be
> safer, but the IR repro test here is not crashing now in trunk.
> 
> It's still a judgement call as to whether we should delay the release for my
> fix or a change in getNegatedExpression().

Thank you, Sanjay!

Although we eliminated the failure, I'm afraid there might be cases similar to this but negated results are not constants? (NegX is equal to NegY, NegX was used but NegY was deleted)
Comment 32 Mikael Holmén 2020-10-04 23:21:07 PDT
We've found another crash relasted to the "RemoveDeadNode(CFP)" part. Unfortunately I haven't been able to reproduce it on PPC or X86.

Roughly what happens is that in TargetLowering::getNegatedExpression Op is

 t36: f32 = ConstantFP<4.218750e+08>

and

    SDValue CFP = DAG.getConstantFP(V, DL, VT);

returns a the same node so Op == CFP and then we do

      RemoveDeadNode(CFP);

which will remove Op.

But the caller of getNegatedExpression has a handle to Op which then is

NegX: t36: f32 = <<Deleted Node!>>

and it will use it to build a new fsub node, which will reference a deleted node:

Creating new node: t37: f32 = fsub nnan ninf nsz arcp contract afn reassoc t37, ConstantFP:f32<-4.218750e+08>

The new x * 0.0 fold doesn't help in this case.
Comment 33 Mikael Holmén 2020-10-04 23:24:04 PDT
(Bah, forgot to add the last part)
Compilation later ends with

Operand not processed?
t37: f32 = fsub nnan ninf nsz arcp contract afn reassoc t37, ConstantFP:f32<-4.218750e+08>

UNREACHABLE executed at ../lib/CodeGen/SelectionDAG/LegalizeTypes.cpp:485!
Comment 34 Craig Topper 2020-10-04 23:31:36 PDT
This looks like the node is its own operand?

t37: f32 = fsub nnan ninf nsz arcp contract afn reassoc t37, ConstantFP:f32<-4.218750e+08>
Comment 35 Mikael Holmén 2020-10-04 23:34:46 PDT
(In reply to Craig Topper from comment #34)
> This looks like the node is its own operand?
> 
> t37: f32 = fsub nnan ninf nsz arcp contract afn reassoc t37,
> ConstantFP:f32<-4.218750e+08>

Yes I think it's due to the input actually being a DELETED NODE.
Comment 36 Mikael Holmén 2020-10-05 00:43:58 PDT
(In reply to Mikael Holmén from comment #32)
> We've found another crash relasted to the "RemoveDeadNode(CFP)" part.

Ok, now I've managed to reproduce the latest crash on X86 as well:

 llc -O0 -o - bbi-47837_x86.ll
Comment 37 Mikael Holmén 2020-10-05 00:44:29 PDT
Created attachment 24019 [details]
bbi-47837_x86.ll reproducer
Comment 38 Qiu Chaofan 2020-10-05 02:48:57 PDT
(In reply to Mikael Holmén from comment #32)
> We've found another crash relasted to the "RemoveDeadNode(CFP)" part.
> Unfortunately I haven't been able to reproduce it on PPC or X86.
> 
> Roughly what happens is that in TargetLowering::getNegatedExpression Op is
> 
>  t36: f32 = ConstantFP<4.218750e+08>
> 
> and
> 
>     SDValue CFP = DAG.getConstantFP(V, DL, VT);
> 
> returns a the same node so Op == CFP and then we do
> 
>       RemoveDeadNode(CFP);
> 
> which will remove Op.
> 
> But the caller of getNegatedExpression has a handle to Op which then is
> 
> NegX: t36: f32 = <<Deleted Node!>>
> 
> and it will use it to build a new fsub node, which will reference a deleted
> node:
> 
> Creating new node: t37: f32 = fsub nnan ninf nsz arcp contract afn reassoc
> t37, ConstantFP:f32<-4.218750e+08>
> 
> The new x * 0.0 fold doesn't help in this case.

Ah, so it's the same issue as comment #28. I think I need to revert the Remove DeadNode(CFP) part since it's actually not necessary for the original bug of thi s.
Comment 39 Hans Wennborg 2020-10-05 07:45:07 PDT
(In reply to Sanjay Patel from comment #29)
> So to trigger the crash, we require fast-math (at least 'reassoc') and some
> constant folding (the load/store of 0.0 and subsequent fmul) that escaped IR
> optimization. That constant folding has to follow a very narrow path - even
> commuting the operands in some of those instructions avoids the bug.
> 
> I don't know how to assess the risk of that scenario. I want to believe it's
> not worth delaying the 11.0 release again, but a fix should definitely go
> into any later builds since it's a compiler crash.

Mikael, how bad would it be for your group if these issues are not fixed in LLVM 11.0.0, but rather 11.0.1?



(In reply to Qiu Chaofan from comment #38)
> I think I need to revert the
> Remove DeadNode(CFP) part since it's actually not necessary for the original
> bug of this.

Please go ahead an do that, and include the test cases from this bug so we don't regress.
Comment 40 Mikael Holmén 2020-10-05 07:54:59 PDT
(In reply to Hans Wennborg from comment #39)
> Mikael, how bad would it be for your group if these issues are not fixed in
> LLVM 11.0.0, but rather 11.0.1?

It's ok for us.
Comment 41 Sanjay Patel 2020-10-05 08:46:10 PDT
(In reply to Mikael Holmén from comment #37)
> Created attachment 24019 [details]
> bbi-47837_x86.ll reproducer

Reduced to crash with default settings (optimizations on) for most targets:

define float @f(float %x, float %y) {
  %add = fadd fast float %x, 750.0
  %sub = fsub fast float %x, %add
  %mul = fmul fast float %sub, %sub
  %mul2 = fmul fast float %mul, %sub
  %add2 = fadd fast float %mul2, 1.0
  %add3 = fadd fast float %mul2, %add2
  %mul3 = fmul fast float %y, %add3
  ret float %mul3
}
Comment 42 Qiu Chaofan 2020-10-05 10:19:55 PDT
(In reply to Hans Wennborg from comment #39)
> (In reply to Qiu Chaofan from comment #38)
> > I think I need to revert the
> > Remove DeadNode(CFP) part since it's actually not necessary for the original
> > bug of this.
> 
> Please go ahead an do that, and include the test cases from this bug so we
> don't regress.

Sure. See rGb326d4ff.

(In reply to Sanjay Patel from comment #41)
> (In reply to Mikael Holmén from comment #37)
> > Created attachment 24019 [details]
> > bbi-47837_x86.ll reproducer
> 
> Reduced to crash with default settings (optimizations on) for most targets:
> 
> define float @f(float %x, float %y) {
>   %add = fadd fast float %x, 750.0
>   %sub = fsub fast float %x, %add
>   %mul = fmul fast float %sub, %sub
>   %mul2 = fmul fast float %mul, %sub
>   %add2 = fadd fast float %mul2, 1.0
>   %add3 = fadd fast float %mul2, %add2
>   %mul3 = fmul fast float %y, %add3
>   ret float %mul3
> }

Thanks!
Comment 43 Hans Wennborg 2020-10-06 07:10:09 PDT
(In reply to Qiu Chaofan from comment #42)
> (In reply to Hans Wennborg from comment #39)
> > (In reply to Qiu Chaofan from comment #38)
> > > I think I need to revert the
> > > Remove DeadNode(CFP) part since it's actually not necessary for the original
> > > bug of this.
> > 
> > Please go ahead an do that, and include the test cases from this bug so we
> > don't regress.
> 
> Sure. See rGb326d4ff.

Thanks. Pushed to 11.x as 121babae56e9f08acedf3d6d44757e35556d0a37.

It would be nice if this code were more reliable, but closing this bug for now since the currently known problems should be fixed now.