LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 18057 - MCJIT tests fail randomly on ARM
Summary: MCJIT tests fail randomly on ARM
Status: RESOLVED FIXED
Alias: None
Product: libraries
Classification: Unclassified
Component: Generic Execution Engine Support (show other bugs)
Version: trunk
Hardware: PC Linux
: P normal
Assignee: Renato Golin
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-11-25 05:02 PST by Renato Golin
Modified: 2014-01-26 06:39 PST (History)
6 users (show)

See Also:
Fixed By Commit(s):


Attachments
Patch by Andrew Kaylor (3.11 KB, patch)
2013-11-28 09:04 PST, Renato Golin
Details
Modified Patch works (1.89 KB, patch)
2013-11-28 14:00 PST, Renato Golin
Details
Patch using select (2.26 KB, patch)
2013-11-28 16:46 PST, Renato Golin
Details
Refactoring of remote MCJIT MPI (31.26 KB, patch)
2013-12-03 05:35 PST, Renato Golin
Details
ChildTarget.inc cleanup, WIP (11.19 KB, patch)
2013-12-03 06:21 PST, Alp Toker
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Renato Golin 2013-11-25 05:02:58 PST
Tests in ExecutionEngine/MCJIT/remote like:
 * multi-module-a.ll.script
 * test-fp-no-external-funcs-remote.ll.script
 * test-common-symbols-remote.ll.script
 * cross-module-a.ll.script
 * test-global-init-nonzero-remote.ll.script

Fail randomly on a Cortex-A15 buildbot with the Assert:

tools/lli/RemoteTargetExternal.cpp:144:
 void llvm::RemoteTargetExternal::Receive(llvm::LLIMessageType, uint64_t &): 
  Assertion `rc == 4 && "Error reading message type.
Comment 1 Renato Golin 2013-11-25 05:10:05 PST
Forgot to say that, those tests pass on Stage1, but fail on Stage2, so it's probably some codegen failure in Clang...
Comment 2 Renato Golin 2013-11-25 10:59:58 PST
Compiling lli with line info got me this stack trace:

#0  llvm::RemoteTargetExternal::Receive 
    (ExpectedMsgType=llvm::LLI_AllocationResult, Data=@0x7effef98)
    at tools/lli/RemoteTargetExternal.cpp:145

#1  llvm::RemoteTargetExternal::allocateSpace
    (Size=<optimized out>, Alignment=<optimized out>, Address=@0x4: <error>)
    at tools/lli/RemoteTargetExternal.cpp:32

#2  0x000e03b4 in llvm::RemoteMemoryManager::notifyObjectLoaded
    (EE=0xd887a0, Obj=<optimized out>)
    at lli/RemoteMemoryManager.cpp:132

The Address=@0x4: <error> is due to -O3 being aggressive, since Data above is correct and at the same address than the RemoteAddr from the first call.

The assembly of the beginning of that function also seem correct (although with too many moves):

_ZN4llvm20RemoteTargetExternal7ReceiveENS_14LLIMessageTypeERy:
	push	{r4, r5, r6, r11, lr}
	add	r11, sp, #12
	sub	sp, sp, #12
	mov	r5, r0
	mov	r4, r2
	ldr	r0, [r5, #156]
	mov	r6, r1
	add	r1, sp, #8
	mov	r2, #4
	ldr	r0, [r0]
	bl	read
	cmp	r0, #4
	bne	.LBB2_5
Comment 3 Renato Golin 2013-11-26 04:24:02 PST
I just tested on x86_64 and Phase2 and Phase3 are stable. It's an ARM specific bug, it seems.
Comment 4 Andy Kaylor 2013-11-26 12:58:28 PST
This assertion reflects a failure in communication with a child process.  The actual failure is almost certainly happening on the child side.  This assertion just indicates that the child didn't send the expected response.

This may be because the child sent too few bytes (and this may be an implementation issue if the pipe read returned with less than 4 bytes but more were being buffered).  It may also indicate that the child process crashed.

The communication protocol used in this case is fairly raw and potentially brittle.  I'm not familiar with the particulars of how pipes work on ARM, so it may be that there is some behavioral difference being exposed by clang's optimizations.

The best way to approach this failure is probably to have a close inspection of the pipe communication handling on both child and host side.
Comment 5 Renato Golin 2013-11-28 09:04:50 PST
Created attachment 11625 [details]
Patch by Andrew Kaylor

Patch by Andrew Kaylor, fixes the assert, but introduces lock-ups. I'll have a look and see what (if) I can improve.
Comment 6 Renato Golin 2013-11-28 14:00:36 PST
Created attachment 11627 [details]
Modified Patch works

Hi Andrew,

If I introduce a busy waiting counter to limit the number of times it should try while in interruption, the lock-ups go away.

I don't think this is a good solution, since I believe this is still fragile *and* busy waiting, but at least it highlights that the problem is, indeed, in the communication protocol, as you suggested.
Comment 7 Renato Golin 2013-11-28 16:46:00 PST
Created attachment 11628 [details]
Patch using select

By using select, we guarantee that there will be something to read/write whenever we get to actually call the read/write functions. This has shown to work on x86_64 and ARMv7A.
Comment 8 Renato Golin 2013-12-03 03:08:49 PST
Both patches work somewhat, but still fail randomly, which means we're not fixing the problem.

I'm working on refactoring the communication layer, not changing any of the protocol, but trying to format the code so to make it harder to *not* follow the protocol, ie. splitting between header and payload, adding lots of error messages (and checking for EINTR, EPIPE, etc), stack traces, etc.
Comment 9 Renato Golin 2013-12-03 05:35:01 PST
Created attachment 11650 [details]
Refactoring of remote MCJIT MPI

This is an incomplete implementation and meant only as a communication between me and Alp, since he also has a solution for this problem. My refactoring is less focused on solving the problem and more focused on making it a bit more robust by introducing error handling on all levels.

My assumptions:
 1. We need the low-level messages to have the same format: { header + size + payload }, and to have the same implementation (thus, the higher depth of hierarchy): Send/Receive x Header/Payload + Receive() wrappers
 2. The child should not assert or exit, but it should report errors to the parent via messages, and the parent should terminate the child (LLIMessageStatus)
 3. We need to keep the pipe synchronized and clean, or messages can get misaligned and it's hard to know which is the header (specially true to load section messages): not doing that yet to a full extent
 4. Success returns true, not false. ;)
Comment 10 Alp Toker 2013-12-03 05:37:26 PST
(In reply to comment #9)
> Created attachment 11650 [details]
> Refactoring of remote MCJIT MPI
> 
> This is an incomplete implementation and meant only as a communication
> between me and Alp, since he also has a solution for this problem. My
> refactoring is less focused on solving the problem and more focused on
> making it a bit more robust by introducing error handling on all levels.
> 
> My assumptions:
>  1. We need the low-level messages to have the same format: { header + size
> + payload }, and to have the same implementation (thus, the higher depth of
> hierarchy): Send/Receive x Header/Payload + Receive() wrappers
>  2. The child should not assert or exit, but it should report errors to the
> parent via messages, and the parent should terminate the child
> (LLIMessageStatus)
>  3. We need to keep the pipe synchronized and clean, or messages can get
> misaligned and it's hard to know which is the header (specially true to load
> section messages): not doing that yet to a full extent
>  4. Success returns true, not false. ;)


Brilliant! I just skimmed over this in 30 seconds and my refactor is the other half (killing off ChildTarget.inc), so there may be a couple of one-line conflicts but basically the two patches are complementary and can be landed independently.
Comment 11 Renato Golin 2013-12-03 06:12:39 PST
Ok, I only started changing the ChildTarget.inc now because I spotted some failures there. 

I'll stop my progress, wait for your patch, merge and continue.
Comment 12 Alp Toker 2013-12-03 06:21:07 PST
Created attachment 11652 [details]
ChildTarget.inc cleanup, WIP

Hi Renato,

I'm posting my WIP patch so you can give it a spin on ARM and see where I'm heading.

This embeds LLVM's inc modules and removes duplicated code on the ChildTarget side.

My feeling is that if any platform-specific code is needed from now on, it should rather go into LLVM's Support and embedded into whatever needs it.

The UNIX/Windows abstraction for ChildTarget is gone, was not sound. Patch makes a start on Windows support using the existing platform code instead.
Comment 13 Alp Toker 2013-12-03 06:25:21 PST
By the way, InvalidateInstructionCache() isn't even getting called on TOT AFAICT, was did it ever work on ARM?
Comment 14 Renato Golin 2014-01-14 17:07:23 PST
Fixed in r199261.
Comment 15 Renato Golin 2014-01-15 03:09:25 PST
All recent failures have this type of error:

ERROR: ReadBytes: Expecting 4 bytes, Got 0, (RemoteTargetExternal::ReceiveHeader), (RemoteTargetExternal::executeCode)

Files affected so far:

test-common-symbols-remote.ll
test-fp-no-external-funcs-remote.ll
test-global-init-nonzero-remote.ll
test-ptr-reloc-remote.ll

Ex: http://lab.llvm.org:8011/builders/clang-native-arm-cortex-a15-self-host/builds/396
Comment 16 Alp Toker 2014-01-24 15:21:30 PST
I've landed my full refactoring series for the remote MCJIT. Along with Renato's work we're looking a lot better.

Can you try on ARM again? There's proper error reporting to stderr now.
Comment 17 Renato Golin 2014-01-25 17:39:51 PST
My local tests on Clang-compiled LLVM went well, so I'm re-enabling the tests on ARM. Let's see how the self-hosting bot behaves. After a few successful runs, I'll mark this bug as fixed.

Thanks for the hard work!
--renato