-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
X86 ChromeOS devices with kernel 4.4 fails to boot after 4b0aa5724feaa89a9538dcab97e018110b0e4bc3 #46812
Comments
assigned to @jyknight |
60433c6 (Remove TwoAddressInstructionPass::sink3AddrInstruction. ) was supposed to be a fix but as per https://crbug.com/1123683 , the boot issue is still present after it. |
Are you saying that the issue described in ClangBuiltLinux/linux#1085 was not actually fixed? Or is this a distinct issue that didn't exhibit there for some reason? |
Right, I applied https://reviews.llvm.org/rG60433c63acb71935111304d71e41b7ee982398f8 on http://crrev.com/c/2372837/14 and the issue still reproduced. Another later version LLVM that included 0433c63acb71935111304d71e41b7ee982398f8 also had the same boot failure http://crrev.com/c/2361864/2. |
The second link doesn't have any test results? Or do you mean you manually tested it at that revision? Can you give some more detail? I'm not familiar with chrome's build system. Perhaphs you could follow the same steps Nick used in the bug I linked to, to find what changed in the assembly output? |
< The second link doesn't have any test results? Or do you mean you manually tested it at that revision? < Can you give some more detail? I'm not familiar with chrome's build system. Sorry an example of the failure can be found as follows. Search for "System rolled back" in the log. < Perhaphs you could follow the same steps Nick used in the bug I linked to, to find what changed in the assembly output? SG. I'm working with Nick on this. |
Is there any news on this? |
We are still actively debugging this. We have found that the devices actually boot the first time but reboot when system UI is restarted for any reason e.g. tests. And the problem indeed starts from the mentioned LLVM patch. Lack of access to a physical device has unfortunately made debugging really slow. |
As it's currently looking, I don't think we can hold the release for this. |
We were able to get physical access to the device. We have bisected down to the object file affected. I'm working through pinpointing where things are going wrong now. |
Specifically, the linux-4.4.y branch of the stable tree of the Linux kernel with CONFIG_USER_NS enabled; kernel/user_namespace.o seems to the the culprit of an object file bisection performed by Jian. Still digging in what's broken about it. |
Comparing the disassemblies between: I can see a few differences, which look pretty obviously wrong to me, for example the relatively small fn example:
so the Since |
free_user_ns__user_namespace.ll $ llvm-extract --func=free_user_ns user_namespace.ll -o free_user_ns__user_namespace.ll It looks like a BB terminated with callbr has an indirect target back to itself. Will work on reducing this further via bugpoint or such. |
repro.ll $ llc -O2 repro.ll Specifically: It seems like the |
fwiw, dumping the output of llc -O2 -print-before-all repro.ll -o - to a file and comparing that before vs after the changes in question, things start to go visibly wrong after Eliminate PHI nodes for register allocation pass. We have:
(so we were doing mov then inline asm blob; after we have those reversed, which is bad) |
The bug is in llvm::findPHICopyInsertPoint. We're correctly hitting the special case for placement in a block with asm-goto or invoke. Then, the code places the copy after the last use or def of the register. Unfortunately, placing after the last use is wrong -- inlineasm_br uses the register, in this example. Yet, the whole point of the special-case is to find a place to insert the copy before that instruction. I believe just placing the copy after last DEF, not after the last use or def, will be a correct fix. E.g.: --- a/llvm/lib/CodeGen/PHIEliminationUtils.cpp
However, that will have the effect of placing the copy earlier than necessary in the block, which might increase register pressure. I'm going to try to make a slightly better fix. |
Thanks for the fix. I've verified locally it fixed the issue in a test device. |
I've got the "Slightly better fix" posted for review now: https://reviews.llvm.org/D87865 Please verify this works for you as well, thanks! |
Thanks! Jian and Manoj are testing this on CrOS kernels. It will take me at least an hour and a half to two hours to get through all of those. Hans, I suspect we can have the patch ready and well qualified by tonight (US time, maybe next 6-8 hours). |
I verified https://reviews.llvm.org/D87865 fixed the issue locally. |
The verification I've done so far is limited to one test device. Will start a more complete test with the patch. |
All of my tests with this on ToT LLVM look good. Going to rerun them again with this on top of release/11.x. |
All of my kernel build+boot tests with release/11.x + https://reviews.llvm.org/D87865 passed. |
Fix committed to master as f7a53d8, which cherry-picks cleanly to 11.x. @Hans: shall I push to 11.x? |
Manoj reports the CrOS tests look good: Nathan reports clang-11+patch and ToT+patch look good: tot: https://paste.myself5.de/raw/aciqosofos The fixed was merged on master in: https://reviews.llvm.org/rGf7a53d82c0902147909f28a9295a9d00b4b27d38 Let's get this cherry-picked over to release/11.x and close this out. Hans? |
Manoj notes that the fix will depend on commit a0385bd ("[llvm] Add contains(KeyType) -> bool methods to SmallPtrSet") which doesn't look like it's in release/11.x to me. So we'll likely want to cherry pick the following 2 patches to close this out: 1: 2: |
Extended Description
While building with recent LLVM versions, we found X86 ChromeOS images with kernel 4.4 all failed to boot. We bisected it to the following commit.
commit 4b0aa57
Author: James Y Knight jyknight@google.com
Date: Fri May 15 23:43:30 2020 -0400
Change the INLINEASM_BR MachineInstr to be a non-terminating instruction.
Link: https://crbug.com/1123683
The text was updated successfully, but these errors were encountered: