Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MCJIT tests fail randomly on ARM #18431

Closed
rengolin opened this issue Nov 25, 2013 · 18 comments
Closed

MCJIT tests fail randomly on ARM #18431

rengolin opened this issue Nov 25, 2013 · 18 comments
Assignees
Labels
bugzilla Issues migrated from bugzilla

Comments

@rengolin
Copy link
Member

Bugzilla Link 18057
Resolution FIXED
Resolved on Jan 26, 2014 06:39
Version trunk
OS Linux
CC @andykaylor,@TNorthover

Extended Description

Tests in ExecutionEngine/MCJIT/remote like:

  • multi-module-a.ll.script
  • test-fp-no-external-funcs-remote.ll.script
  • test-common-symbols-remote.ll.script
  • cross-module-a.ll.script
  • test-global-init-nonzero-remote.ll.script

Fail randomly on a Cortex-A15 buildbot with the Assert:

tools/lli/RemoteTargetExternal.cpp:144:
void llvm::RemoteTargetExternal::Receive(llvm::LLIMessageType, uint64_t &):
Assertion `rc == 4 && "Error reading message type.

@rengolin
Copy link
Member Author

assigned to @rengolin

@rengolin
Copy link
Member Author

Forgot to say that, those tests pass on Stage1, but fail on Stage2, so it's probably some codegen failure in Clang...

@rengolin
Copy link
Member Author

Compiling lli with line info got me this stack trace:

#​0 llvm::RemoteTargetExternal::Receive
(ExpectedMsgType=llvm::LLI_AllocationResult, Data=@0x7effef98)
at tools/lli/RemoteTargetExternal.cpp:145

#​1 llvm::RemoteTargetExternal::allocateSpace
(Size=, Alignment=, Address=@0x4: )
at tools/lli/RemoteTargetExternal.cpp:32

#​2 0x000e03b4 in llvm::RemoteMemoryManager::notifyObjectLoaded
(EE=0xd887a0, Obj=)
at lli/RemoteMemoryManager.cpp:132

The Address=@0x4: is due to -O3 being aggressive, since Data above is correct and at the same address than the RemoteAddr from the first call.

The assembly of the beginning of that function also seem correct (although with too many moves):

_ZN4llvm20RemoteTargetExternal7ReceiveENS_14LLIMessageTypeERy:
push {r4, r5, r6, r11, lr}
add r11, sp, #​12
sub sp, sp, #​12
mov r5, r0
mov r4, r2
ldr r0, [r5, #​156]
mov r6, r1
add r1, sp, #​8
mov r2, #​4
ldr r0, [r0]
bl read
cmp r0, #​4
bne .LBB2_5

@rengolin
Copy link
Member Author

I just tested on x86_64 and Phase2 and Phase3 are stable. It's an ARM specific bug, it seems.

@andykaylor
Copy link
Contributor

This assertion reflects a failure in communication with a child process. The actual failure is almost certainly happening on the child side. This assertion just indicates that the child didn't send the expected response.

This may be because the child sent too few bytes (and this may be an implementation issue if the pipe read returned with less than 4 bytes but more were being buffered). It may also indicate that the child process crashed.

The communication protocol used in this case is fairly raw and potentially brittle. I'm not familiar with the particulars of how pipes work on ARM, so it may be that there is some behavioral difference being exposed by clang's optimizations.

The best way to approach this failure is probably to have a close inspection of the pipe communication handling on both child and host side.

@rengolin
Copy link
Member Author

Patch by Andrew Kaylor
Patch by Andrew Kaylor, fixes the assert, but introduces lock-ups. I'll have a look and see what (if) I can improve.

@rengolin
Copy link
Member Author

Modified Patch works
Hi Andrew,

If I introduce a busy waiting counter to limit the number of times it should try while in interruption, the lock-ups go away.

I don't think this is a good solution, since I believe this is still fragile and busy waiting, but at least it highlights that the problem is, indeed, in the communication protocol, as you suggested.

@rengolin
Copy link
Member Author

Patch using select
By using select, we guarantee that there will be something to read/write whenever we get to actually call the read/write functions. This has shown to work on x86_64 and ARMv7A.

@rengolin
Copy link
Member Author

rengolin commented Dec 3, 2013

Both patches work somewhat, but still fail randomly, which means we're not fixing the problem.

I'm working on refactoring the communication layer, not changing any of the protocol, but trying to format the code so to make it harder to not follow the protocol, ie. splitting between header and payload, adding lots of error messages (and checking for EINTR, EPIPE, etc), stack traces, etc.

@rengolin
Copy link
Member Author

rengolin commented Dec 3, 2013

Refactoring of remote MCJIT MPI
This is an incomplete implementation and meant only as a communication between me and Alp, since he also has a solution for this problem. My refactoring is less focused on solving the problem and more focused on making it a bit more robust by introducing error handling on all levels.

My assumptions:

  1. We need the low-level messages to have the same format: { header + size + payload }, and to have the same implementation (thus, the higher depth of hierarchy): Send/Receive x Header/Payload + Receive() wrappers
  2. The child should not assert or exit, but it should report errors to the parent via messages, and the parent should terminate the child (LLIMessageStatus)
  3. We need to keep the pipe synchronized and clean, or messages can get misaligned and it's hard to know which is the header (specially true to load section messages): not doing that yet to a full extent
  4. Success returns true, not false. ;)

@llvmbot
Copy link
Collaborator

llvmbot commented Dec 3, 2013

Created attachment 11650 [details]
Refactoring of remote MCJIT MPI

This is an incomplete implementation and meant only as a communication
between me and Alp, since he also has a solution for this problem. My
refactoring is less focused on solving the problem and more focused on
making it a bit more robust by introducing error handling on all levels.

My assumptions:

  1. We need the low-level messages to have the same format: { header + size
  • payload }, and to have the same implementation (thus, the higher depth of
    hierarchy): Send/Receive x Header/Payload + Receive() wrappers
  1. The child should not assert or exit, but it should report errors to the
    parent via messages, and the parent should terminate the child
    (LLIMessageStatus)
  2. We need to keep the pipe synchronized and clean, or messages can get
    misaligned and it's hard to know which is the header (specially true to load
    section messages): not doing that yet to a full extent
  3. Success returns true, not false. ;)

Brilliant! I just skimmed over this in 30 seconds and my refactor is the other half (killing off ChildTarget.inc), so there may be a couple of one-line conflicts but basically the two patches are complementary and can be landed independently.

@rengolin
Copy link
Member Author

rengolin commented Dec 3, 2013

Ok, I only started changing the ChildTarget.inc now because I spotted some failures there.

I'll stop my progress, wait for your patch, merge and continue.

@llvmbot
Copy link
Collaborator

llvmbot commented Dec 3, 2013

ChildTarget.inc cleanup, WIP
Hi Renato,

I'm posting my WIP patch so you can give it a spin on ARM and see where I'm heading.

This embeds LLVM's inc modules and removes duplicated code on the ChildTarget side.

My feeling is that if any platform-specific code is needed from now on, it should rather go into LLVM's Support and embedded into whatever needs it.

The UNIX/Windows abstraction for ChildTarget is gone, was not sound. Patch makes a start on Windows support using the existing platform code instead.

@llvmbot
Copy link
Collaborator

llvmbot commented Dec 3, 2013

By the way, InvalidateInstructionCache() isn't even getting called on TOT AFAICT, was did it ever work on ARM?

@rengolin
Copy link
Member Author

Fixed in r199261.

@rengolin
Copy link
Member Author

All recent failures have this type of error:

ERROR: ReadBytes: Expecting 4 bytes, Got 0, (RemoteTargetExternal::ReceiveHeader), (RemoteTargetExternal::executeCode)

Files affected so far:

test-common-symbols-remote.ll
test-fp-no-external-funcs-remote.ll
test-global-init-nonzero-remote.ll
test-ptr-reloc-remote.ll

Ex: http://lab.llvm.org:8011/builders/clang-native-arm-cortex-a15-self-host/builds/396

@llvmbot
Copy link
Collaborator

llvmbot commented Jan 24, 2014

I've landed my full refactoring series for the remote MCJIT. Along with Renato's work we're looking a lot better.

Can you try on ARM again? There's proper error reporting to stderr now.

@rengolin
Copy link
Member Author

My local tests on Clang-compiled LLVM went well, so I'm re-enabling the tests on ARM. Let's see how the self-hosting bot behaves. After a few successful runs, I'll mark this bug as fixed.

Thanks for the hard work!
--renato

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 9, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla
Projects
None yet
Development

No branches or pull requests

3 participants