Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

X86 ChromeOS devices with kernel 4.4 fails to boot after 4b0aa5724feaa89a9538dcab97e018110b0e4bc3 #46812

Closed
jcai19 opened this issue Sep 9, 2020 · 27 comments
Assignees
Labels
bugzilla Issues migrated from bugzilla llvm:codegen

Comments

@jcai19
Copy link
Member

jcai19 commented Sep 9, 2020

Bugzilla Link 47468
Resolution FIXED
Resolved on Sep 22, 2020 02:45
Version trunk
OS All
Blocks #4440 #46070
CC @jcai19,@cmtice,@topperc,@dwblaikie,@zmodem,@jyknight,@lalozano,@m-gupta,@nathanchance,@nickdesaulniers,@stephenhines

Extended Description

While building with recent LLVM versions, we found X86 ChromeOS images with kernel 4.4 all failed to boot. We bisected it to the following commit.

commit 4b0aa57
Author: James Y Knight jyknight@google.com
Date: Fri May 15 23:43:30 2020 -0400

Change the INLINEASM_BR MachineInstr to be a non-terminating instruction.

Link: https://crbug.com/1123683

@jcai19
Copy link
Member Author

jcai19 commented Sep 9, 2020

assigned to @jyknight

@m-gupta
Copy link
Contributor

m-gupta commented Sep 9, 2020

60433c6 (Remove TwoAddressInstructionPass::sink3AddrInstruction. ) was supposed to be a fix but as per https://crbug.com/1123683 , the boot issue is still present after it.

@jyknight
Copy link
Member

jyknight commented Sep 9, 2020

Are you saying that the issue described in ClangBuiltLinux/linux#1085 was not actually fixed?

Or is this a distinct issue that didn't exhibit there for some reason?

@jcai19
Copy link
Member Author

jcai19 commented Sep 9, 2020

Right, I applied https://reviews.llvm.org/rG60433c63acb71935111304d71e41b7ee982398f8 on http://crrev.com/c/2372837/14 and the issue still reproduced. Another later version LLVM that included 0433c63acb71935111304d71e41b7ee982398f8 also had the same boot failure http://crrev.com/c/2361864/2.

@jyknight
Copy link
Member

jyknight commented Sep 9, 2020

The second link doesn't have any test results? Or do you mean you manually tested it at that revision?

Can you give some more detail? I'm not familiar with chrome's build system.

Perhaphs you could follow the same steps Nick used in the bug I linked to, to find what changed in the assembly output?

@jcai19
Copy link
Member Author

jcai19 commented Sep 9, 2020

< The second link doesn't have any test results? Or do you mean you manually tested it at that revision?

< Can you give some more detail? I'm not familiar with chrome's build system.

Sorry an example of the failure can be found as follows. Search for "System rolled back" in the log.

https://00e9e64bac44fd8b1f073164b070e39c7d3b67d7d5b582cb85-apidata.googleusercontent.com/download/storage/v1/b/chromeos-autotest-results/o/swarming-4e5b5cee0a6c6a10%2Fprovision_chromeos6-row6-rack4-host11%2Fdebug%2Fautoserv.DEBUG?qk=AD5uMEs2QtbXhnrDyvEWABvDxkI31TZKCEntN_pMZC_YxVXitalTlHGSfO1zzo_oipdCAkEu61HjA6U3J8F7Ic7xGE9xfpc1eZ6PPWPbKnylwnx7qVIENybBBiRRXlmPaIBwFzWsapdcd5efVuhJeTND3ElAb_1zKOnAkN9aKrE6n1tnMXzA3iVNRSX_DsLyEfufkGd8MyE9J6Yy2Jyqx9WFyHleb-_e-Mzpj9B40zsrihD5YUX1Qd_Pwaoi_r3yCatzHTBDLMncK4Kd7pWYcVrI6i3cwsJONChaqC7iWFkCD9i572ctlCLZ3ARCpDKNVjENXtpV9awJGVihZrHEXS4rR-SsTO7dfqODQa7GxovexnOfTb9xqMDhjCuoZ4pqY6Vfcaw9vgfqm3dYFop2bhZWDiGTZOPJHLPc4MLUEuW0T7mxg59bQQxl47caJYP6579czEGIlGgojj6qn_GdLs2QwRwNOICUMcqYGmDGaCQZRdnw59xDLUHD9qWrR8lr-QORk86qNRnIdPlwa3eXQI7eXSm2MmvGyCTX-YS5Z_PSp2-qSirYAv5fIcOetZL12vV-0rjurU-TN3GJ9zxBuJqzbwZSvyVXQ8q5WwNWkia7a-oM_yRShQGBCnzN9xBKS_7B8a_m8TNdRSycpzKzlt6MNGAmGYz-oYDMmtEwRSysgoW8qRNZ_7pc8UHCnZ9NyrPKOzLjkk96pBcdQ1677n5rRRcuE68SGd3mHNUN8C5AGG6k-un_Av5iNMCX3sMeRHRn3eL27nFaHZGaTZ_PRqhLCdZd-B0i95_YKJnW1NkGa5QMBtnn1Um9XnFs4ZDVgf0Z0apiQ8aDPaRnrzWNTpwCiYgEL_PiSMTtQKXJ8Rz4zCOr0aLBSK2v51yxSkJPkser3lBFmsq3RtOyArhuZ4KuyhZJosDi1Wpi-Ti9DnzLq6zwRqt6Bms&isca=1

< Perhaphs you could follow the same steps Nick used in the bug I linked to, to find what changed in the assembly output?

SG. I'm working with Nick on this.

@zmodem
Copy link
Collaborator

zmodem commented Sep 15, 2020

Is there any news on this?

@m-gupta
Copy link
Contributor

m-gupta commented Sep 15, 2020

We are still actively debugging this. We have found that the devices actually boot the first time but reboot when system UI is restarted for any reason e.g. tests. And the problem indeed starts from the mentioned LLVM patch.

Lack of access to a physical device has unfortunately made debugging really slow.

@zmodem
Copy link
Collaborator

zmodem commented Sep 17, 2020

As it's currently looking, I don't think we can hold the release for this.

@nickdesaulniers
Copy link
Member

We were able to get physical access to the device.

We have bisected down to the object file affected.

I'm working through pinpointing where things are going wrong now.

@nickdesaulniers
Copy link
Member

Specifically, the linux-4.4.y branch of the stable tree of the Linux kernel with CONFIG_USER_NS enabled; kernel/user_namespace.o seems to the the culprit of an object file bisection performed by Jian. Still digging in what's broken about it.

@nickdesaulniers
Copy link
Member

Comparing the disassemblies between:

  1. commit 78c69a0
  2. commit 78c69a0 + cherry pick 4b0aa57 + cherry pick 78c69a0

I can see a few differences, which look pretty obviously wrong to me, for example the relatively small fn free_user_ns:

example:

-     2ab: 4c 89 f3                             movq    %r14, %rbx
-     2ae: f0                                   lock
-     2af: 41 ff 8e c0 00 00 00                 decl    192(%r14)
-     2b6: 74 d2                                je      0x28a <free_user_ns+0xa>
+     2ab: f0                                   lock
+     2ac: 41 ff 8e c0 00 00 00                 decl    192(%r14)
+     2b3: 74 d5                                je      0x28a <free_user_ns+0xa>
+     2b5: 4c 89 f3                             movq    %r14, %rbx

so the mov was occurring unconditionally before the dec, now it's conditional (oops) and after the dec (double oops)

Since free_user_ns is relatively small, should make the reproducer concise and getting it easy.

@nickdesaulniers
Copy link
Member

free_user_ns__user_namespace.ll
Attaching reproducer extracted via:

$ llvm-extract --func=free_user_ns user_namespace.ll -o free_user_ns__user_namespace.ll

It looks like a BB terminated with callbr has an indirect target back to itself.

Will work on reducing this further via bugpoint or such.

@nickdesaulniers
Copy link
Member

repro.ll
smaller reproducer

$ llc -O2 repro.ll

Specifically:
.Ltmp1:
lock
decl 192(%r14)
je .Ltmp0
#NO_APP
movq %r14, %rbx

It seems like the movq %r14, %rbx would occur before the lock'd decl unconditionally prior to the patches in question in clang-11.

@nickdesaulniers
Copy link
Member

fwiw, dumping the output of llc -O2 -print-before-all repro.ll -o - to a file and comparing that before vs after the changes in question, things start to go visibly wrong after

Eliminate PHI nodes for register allocation

pass. We have:

  • $rbx = MOV64rr $r14 ...
  • INLINEASM_BR ...
  • INLINEASM_BR ...
  • $rbx = MOV64rr killed $r14

(so we were doing mov then inline asm blob; after we have those reversed, which is bad)

@jyknight
Copy link
Member

The bug is in llvm::findPHICopyInsertPoint.

We're correctly hitting the special case for placement in a block with asm-goto or invoke. Then, the code places the copy after the last use or def of the register. Unfortunately, placing after the last use is wrong -- inlineasm_br uses the register, in this example. Yet, the whole point of the special-case is to find a place to insert the copy before that instruction.

I believe just placing the copy after last DEF, not after the last use or def, will be a correct fix. E.g.:

--- a/llvm/lib/CodeGen/PHIEliminationUtils.cpp
+++ b/llvm/lib/CodeGen/PHIEliminationUtils.cpp
@@ -34,7 +34,7 @@ llvm::findPHICopyInsertPoint(MachineBasicBlock* MBB, MachineBasicBlock* SuccMBB,
// Discover any defs/uses in this basic block.
SmallPtrSet<MachineInstr*, 8> DefUsesInMBB;
MachineRegisterInfo& MRI = MBB->getParent()->getRegInfo();

  • for (MachineInstr &RI : MRI.reg_instructions(SrcReg)) {
  • for (MachineInstr &RI : MRI.def_instructions(SrcReg)) {
    if (RI.getParent() == MBB)
    DefUsesInMBB.insert(&RI);
    }

However, that will have the effect of placing the copy earlier than necessary in the block, which might increase register pressure. I'm going to try to make a slightly better fix.

@jcai19
Copy link
Member Author

jcai19 commented Sep 17, 2020

Thanks for the fix. I've verified locally it fixed the issue in a test device.

@jyknight
Copy link
Member

I've got the "Slightly better fix" posted for review now: https://reviews.llvm.org/D87865

Please verify this works for you as well, thanks!

@nickdesaulniers
Copy link
Member

Thanks!

Jian and Manoj are testing this on CrOS kernels.
Nathan is testing this on various linux kernel distro's configs.
I'm testing this with kernel patches that use asm goto w/ outputs, and our 65 various kernel tree and configs: https://travis-ci.com/github/ClangBuiltLinux/continuous-integration.

It will take me at least an hour and a half to two hours to get through all of those.

Hans, I suspect we can have the patch ready and well qualified by tonight (US time, maybe next 6-8 hours).

@jcai19
Copy link
Member Author

jcai19 commented Sep 17, 2020

I verified https://reviews.llvm.org/D87865 fixed the issue locally.

@jcai19
Copy link
Member Author

jcai19 commented Sep 17, 2020

The verification I've done so far is limited to one test device. Will start a more complete test with the patch.

@nickdesaulniers
Copy link
Member

All of my tests with this on ToT LLVM look good. Going to rerun them again with this on top of release/11.x.

@nickdesaulniers
Copy link
Member

All of my kernel build+boot tests with release/11.x + https://reviews.llvm.org/D87865 passed.

@jyknight
Copy link
Member

Fix committed to master as f7a53d8, which cherry-picks cleanly to 11.x.

@​Hans: shall I push to 11.x?

@nickdesaulniers
Copy link
Member

Manoj reports the CrOS tests look good:

https://chromium-review.googlesource.com/c/chromiumos/overlays/chromiumos-overlay/+/2417398/3#message-18cdbb29241e515a94ffc182889a7c9c2b335ec7

Nathan reports clang-11+patch and ToT+patch look good:

tot: https://paste.myself5.de/raw/aciqosofos
clang-11: https://del.dog/raw/xapegagywo
(tot showed one failure, but that's due to a missed cherry pick for an unrelated breakage upstream in kernel sources)

The fixed was merged on master in: https://reviews.llvm.org/rGf7a53d82c0902147909f28a9295a9d00b4b27d38

Let's get this cherry-picked over to release/11.x and close this out. Hans?

@nickdesaulniers
Copy link
Member

Manoj notes that the fix will depend on

commit a0385bd ("[llvm] Add contains(KeyType) -> bool methods to SmallPtrSet")

which doesn't look like it's in release/11.x to me.

So we'll likely want to cherry pick the following 2 patches to close this out:

1:
commit a0385bd ("[llvm] Add contains(KeyType) -> bool methods to SmallPtrSet")

2:
commit f7a53d8 ("#47468 : Fix findPHICopyInsertPoint, so that copies aren't incorrectly inserted after an INLINEASM_BR.")

@zmodem
Copy link
Collaborator

zmodem commented Sep 22, 2020

Cherry-picked to 11.x as 410b0dc and 6250d49. Thanks everyone!

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla llvm:codegen
Projects
None yet
Development

No branches or pull requests

5 participants