-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MCJIT tests fail randomly on ARM #18431
Comments
assigned to @rengolin |
Forgot to say that, those tests pass on Stage1, but fail on Stage2, so it's probably some codegen failure in Clang... |
Compiling lli with line info got me this stack trace: #0 llvm::RemoteTargetExternal::Receive #1 llvm::RemoteTargetExternal::allocateSpace #2 0x000e03b4 in llvm::RemoteMemoryManager::notifyObjectLoaded The Address=@0x4: is due to -O3 being aggressive, since Data above is correct and at the same address than the RemoteAddr from the first call. The assembly of the beginning of that function also seem correct (although with too many moves): _ZN4llvm20RemoteTargetExternal7ReceiveENS_14LLIMessageTypeERy: |
I just tested on x86_64 and Phase2 and Phase3 are stable. It's an ARM specific bug, it seems. |
This assertion reflects a failure in communication with a child process. The actual failure is almost certainly happening on the child side. This assertion just indicates that the child didn't send the expected response. This may be because the child sent too few bytes (and this may be an implementation issue if the pipe read returned with less than 4 bytes but more were being buffered). It may also indicate that the child process crashed. The communication protocol used in this case is fairly raw and potentially brittle. I'm not familiar with the particulars of how pipes work on ARM, so it may be that there is some behavioral difference being exposed by clang's optimizations. The best way to approach this failure is probably to have a close inspection of the pipe communication handling on both child and host side. |
Patch by Andrew Kaylor |
Modified Patch works If I introduce a busy waiting counter to limit the number of times it should try while in interruption, the lock-ups go away. I don't think this is a good solution, since I believe this is still fragile and busy waiting, but at least it highlights that the problem is, indeed, in the communication protocol, as you suggested. |
Patch using select |
Both patches work somewhat, but still fail randomly, which means we're not fixing the problem. I'm working on refactoring the communication layer, not changing any of the protocol, but trying to format the code so to make it harder to not follow the protocol, ie. splitting between header and payload, adding lots of error messages (and checking for EINTR, EPIPE, etc), stack traces, etc. |
Refactoring of remote MCJIT MPI My assumptions:
|
Brilliant! I just skimmed over this in 30 seconds and my refactor is the other half (killing off ChildTarget.inc), so there may be a couple of one-line conflicts but basically the two patches are complementary and can be landed independently. |
Ok, I only started changing the ChildTarget.inc now because I spotted some failures there. I'll stop my progress, wait for your patch, merge and continue. |
ChildTarget.inc cleanup, WIP I'm posting my WIP patch so you can give it a spin on ARM and see where I'm heading. This embeds LLVM's inc modules and removes duplicated code on the ChildTarget side. My feeling is that if any platform-specific code is needed from now on, it should rather go into LLVM's Support and embedded into whatever needs it. The UNIX/Windows abstraction for ChildTarget is gone, was not sound. Patch makes a start on Windows support using the existing platform code instead. |
By the way, InvalidateInstructionCache() isn't even getting called on TOT AFAICT, was did it ever work on ARM? |
Fixed in r199261. |
All recent failures have this type of error: ERROR: ReadBytes: Expecting 4 bytes, Got 0, (RemoteTargetExternal::ReceiveHeader), (RemoteTargetExternal::executeCode) Files affected so far: test-common-symbols-remote.ll Ex: http://lab.llvm.org:8011/builders/clang-native-arm-cortex-a15-self-host/builds/396 |
I've landed my full refactoring series for the remote MCJIT. Along with Renato's work we're looking a lot better. Can you try on ARM again? There's proper error reporting to stderr now. |
My local tests on Clang-compiled LLVM went well, so I'm re-enabling the tests on ARM. Let's see how the self-hosting bot behaves. After a few successful runs, I'll mark this bug as fixed. Thanks for the hard work! |
Extended Description
Tests in ExecutionEngine/MCJIT/remote like:
Fail randomly on a Cortex-A15 buildbot with the Assert:
tools/lli/RemoteTargetExternal.cpp:144:
void llvm::RemoteTargetExternal::Receive(llvm::LLIMessageType, uint64_t &):
Assertion `rc == 4 && "Error reading message type.
The text was updated successfully, but these errors were encountered: