5184 – Lazy compilation is not thread-safe

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 5184 - Lazy compilation is not thread-safe

Summary: Lazy compilation is not thread-safe

Status:	NEW

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	MCJIT (show other bugs)
Version:	trunk
Hardware:	PC All

Importance:	P normal
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2009-10-13 21:00 PDT by Jeffrey Yasskin
Modified:	2012-03-07 04:32 PST (History)
CC List:	7 users (show)

See Also:
Fixed By Commit(s):

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jeffrey Yasskin 2009-10-13 21:00:51 PDT

Lazy compilation is implemented roughly as:

call @compilationCallback

define @compilationCallback() {
  ...
  store %new_function_address (%return_address-4)
}

which turns the "call @compilationCallback" into a "call @real_function". Unfortunately, while that store into the return address is running, another thread could be executing the call. At least on the x86, there are no guarantees about what can happen in this case. (Data reads and writes are defined, but not instruction execution.) It's entirely possible that the store won't be seen as atomic and will result in a wild jump.

I see three ways to fix this:
1) Instead of codegening a direct call instruction, generate something like:
  %target = atomic load @callsite_target_address
  call %target
and then at the end of compilation atomically store the real function back to @callsite_target_address. This causes every callsite to be an indirect call, which is likely to slow things down a lot, so there should be some way to mark the functions where it's used.  On platforms that don't even have atomic pointer writes, this could be implemented by acquiring a lock around every lazily-compilable call.
2) Assert that lazy compilation is only used in a single-threaded program. This allows us to delete the JIT lock.
3) Delete lazy compilation. (which would simplify the JIT a fair amount)

Comment 1 Jeffrey Yasskin 2009-10-22 18:36:10 PDT

Besides the "how to update a call instruction" problem, lazy JITting can race with concurrent modifications of any IR in the same LLVMContext unless you're careful to hold the JIT lock around such modifications. This should be documented.

Comment 2 Rafael Ávila de Espíndola 2009-10-23 09:18:44 PDT

> I see three ways to fix this:
> 1) Instead of codegening a direct call instruction, generate something like:
>   %target = atomic load @callsite_target_address
>   call %target
> and then at the end of compilation atomically store the real function back to
> @callsite_target_address. This causes every callsite to be an indirect call,
> which is likely to slow things down a lot, so there should be some way to mark
> the functions where it's used.  On platforms that don't even have atomic
> pointer writes, this could be implemented by acquiring a lock around every
> lazily-compilable call.

This looks a lot like the way DSOs are handled when compiling with -fPIC. You can probably expect the same kind of overhead. It is interesting to note that when using a DSO that was not compiled without -fPIC (on x86 at least) the resolution is done at load time.

> 2) Assert that lazy compilation is only used in a single-threaded program. This
> allows us to delete the JIT lock.
> 3) Delete lazy compilation. (which would simplify the JIT a fair amount)
>

Comment 3 varth 2009-10-24 06:40:20 PDT

(In reply to comment #0)
> Lazy compilation is implemented roughly as:
> 
> call @compilationCallback
> 
> define @compilationCallback() {
>   ...
>   store %new_function_address (%return_address-4)
> }
> 
> which turns the "call @compilationCallback" into a "call @real_function".
> Unfortunately, while that store into the return address is running, another
> thread could be executing the call. At least on the x86, there are no
> guarantees about what can happen in this case. (Data reads and writes are
> defined, but not instruction execution.) It's entirely possible that the store
> won't be seen as atomic and will result in a wild jump.

I'm missing some knowledge: the write is a 4-byte write, right? Why is that not atomic on x86?

If the store of a call instruction is not atomic, at least we should be able to store an instruction that does an infinite loop until the update is complete.

The paper from IBM JVM developers called "Experiences with Multi-threading and Dynamic Class Loading in a Java Just-In-Time Compiler" [CGO2006] discuss about the issue of lazy compilation and multithreading. You can find the presentation of the paper here:
http://www.cgo.org/cgo2006/html/progslides/session2_talk3_maier.pdf


> 
> I see three ways to fix this:
> 1) Instead of codegening a direct call instruction, generate something like:
>   %target = atomic load @callsite_target_address
>   call %target
> and then at the end of compilation atomically store the real function back to
> @callsite_target_address. This causes every callsite to be an indirect call,
> which is likely to slow things down a lot, so there should be some way to mark
> the functions where it's used.  On platforms that don't even have atomic
> pointer writes, this could be implemented by acquiring a lock around every
> lazily-compilable call.

The infinite loop instruction is a possible way of avoiding this.

> 2) Assert that lazy compilation is only used in a single-threaded program. This
> allows us to delete the JIT lock.

I really don't want this :)

> 3) Delete lazy compilation. (which would simplify the JIT a fair amount)
> 

Indeed, it would simplify things a lot, but, again, I really don't want to not support lazy compilation. How can Unladden-Swallow afford disabling lazy compilation? Don't you have functions that are not compiled yet and only want to compile when they are actually used?

Comment 4 Rafael Ávila de Espíndola 2009-10-24 10:00:15 PDT

> > 3) Delete lazy compilation. (which would simplify the JIT a fair amount)
> > 
> 
> Indeed, it would simplify things a lot, but, again, I really don't want to not
> support lazy compilation. How can Unladden-Swallow afford disabling lazy
> compilation? Don't you have functions that are not compiled yet and only want
> to compile when they are actually used?
> 

If I understand it correctly, only hot parts are translated from python bytecode to llvm.

Comment 5 Jeffrey Yasskin 2009-10-25 05:25:20 PDT

>> Unfortunately, while that store into the return address is running, another
>> thread could be executing the call. At least on the x86, there are no
>> guarantees about what can happen in this case. (Data reads and writes are
>> defined, but not instruction execution.) It's entirely possible that the store
>> won't be seen as atomic and will result in a wild jump.
>
> I'm missing some knowledge: the write is a 4-byte write, right? Why is that not
> atomic on x86?

With threading, the question you have to ask is, "why _is_ that atomic". Absent guarantees from either the intel&amd architecture manuals, or exhaustive testing with lots of different platforms, assuming something's atomic sets you up to get paged at 3 in the morning.

For example, the "Intel® 64 and IA-32 Architectures Software Developer's Manual" (http://www.intel.com/Assets/PDF/manual/253668.pdf) section [8.2.3.1  Assumptions, Terminology, and Notation] says that aligned 1, 2, 4, and 8-byte reads and writes, and arbitrary locked instructions appear to execute as a single memory access, but that "Other instructions may be implemented with multiple memory accesses."  It doesn't mention instruction fetches, but that makes me think that they can appear to be implemented with multiple memory accesses too. 

The JVMs have done exhaustive testing, so I personally trust copying them, but that does set us up for a maintenance burden as new chips come out.

> If the store of a call instruction is not atomic, at least we should be able to
> store an instruction that does an infinite loop until the update is complete.

This doesn't necessarily help. Say the instruction looks like

[call] [target1] [target2] [target3] [target4]

So we call:
  store [br] [self] to {[call] [target1]}
  mfence
  store [self] [newtarget2] [new target3] [new target4] to {[self] [target2] [target3] [target4]}
  mfence
  store [call] [newtarget1] to {[br] [self]}

Another thread is concurrently executing this code. 3 problems:
1) If storing the call instruction wasn't atomic in the first place, why do we think storing the [br self] over it would be atomic? Specifically, the executing thread could have fetched the [call] part of the original code before the mfence, stalled for a bit while the updating thread wrote [br self], then fetched [self] [target2] [target3] [target4]: crash
2)  Alternately, maybe writing [br self] is atomic, but the executing thread fetched [call] [target1] before that write, then stalled for a bit, then fetched [newtarget2] [new target3] [new target4] after they're written: crash.
3) Alternately, since the mfence constrains only the writer and not the reader, maybe the instruction fetch happens out of order and reads [call] [newtarget1] [target2] [target3] [target4]: crash.

> The paper from IBM JVM developers called "Experiences with Multi-threading and
> Dynamic Class Loading in a Java Just-In-Time Compiler" [CGO2006] discuss about
> the issue of lazy compilation and multithreading. You can find the presentation
> of the paper here:
> http://www.cgo.org/cgo2006/html/progslides/session2_talk3_maier.pdf

Thanks, that's helpful. It says that fetches are atomic within "patching boundaries": "A patching boundary is an address that is aligned on the length of a data cache line on P6 microarchitecture (i.e., 32 bytes on Pentium 3 and 64 bytes on Pentium 4) [9] [10] or 8 bytes on K7 and K8 microarchitecture (determined through empirical experimentation) [1]."  Of course, we'd need more testing to check that 8 bytes is still sufficient on K10, Core 2, and Nehalem. The right thing to do is probably to cross-reference our patching code to the right functions in openjdk so it's easy to just imitate them for new processors.

>> 3) Delete lazy compilation. (which would simplify the JIT a fair amount)
>>
>
> Indeed, it would simplify things a lot, but, again, I really don't want to not
> support lazy compilation. How can Unladden-Swallow afford disabling lazy
> compilation? Don't you have functions that are not compiled yet and only want
> to compile when they are actually used?

Like Rafael said, Unladen Swallow and Rubinius only JIT-compile hot functions, and we don't define "hot" as "called once". When a not-yet-compiled function is called from JITted code, it calls back into the interpreter. If A calls B, and A becomes hot before B, we currently have to recompile A to get a direct call to B; otherwise the call goes through the interpreter. Rubinius tries to find blocks of functions to compile all at once, so it's less vulnerable to a caller becoming hot only slightly before a callee.

Lazy compilation in the current JIT would also be a latency problem: codegen is slow, and we don't want to block the call to the lazily-compiled function while it codegens (the call can run through the interpreter instead). Instead, we want to compile in a background thread and only switch the call from the interpreter to compiled code when compilation is finished.

I think we _would_ use a more generic thread-safe code patching facility if the JIT offered it.

Comment 6 varth 2009-10-26 06:41:10 PDT

OK, so if you're not using lazy compilation, then why bother? ;-) No seriously, I'd rather let that bug open, wait until an expert implement it, and still enable lazy compilation in llvm. Does that sound reasonable?

Comment 7 Jeffrey Yasskin 2009-10-26 19:30:59 PDT

I think we should turn lazy compilation off by default as long as it's not thread-safe. Besides being potentially crashy, it has confused people trying to helgrind their apps.

I don't think the "lazily compile every function" approach is good for performance for anyone, but if vmkit or other clients are using it, I can't unilaterally delete it.

Comment 8 varth 2009-10-27 02:47:08 PDT

(In reply to comment #7)
> I think we should turn lazy compilation off by default as long as it's not
> thread-safe.

Fine by me!

Comment 9 Torvald 2009-10-28 15:10:22 PDT

I got some feedback (thanks!), at least for AMD processors.
According to the AMD programmer manual (http://support.amd.com/us/Processor_TechDocs/24593.pdf), it is supposed to be safe to patch the *naturally aligned* address after the call with a store (Section 7.6.1). So, one could thus align the call before the call address accordingly with nops. Using a lock cmpxchg is not necessarily safe.

Comment 10 Jeffrey Yasskin 2009-10-28 15:50:06 PDT

Thanks Torvald! Specifically, I think you're referring to "Synchronization for cross- 
modifying code is not required for code that resides within the naturally aligned quadword." which is a really nice guarantee.

I can't find the wording that implies that a lock cmpxchg is not necessarily safe; where do you see that? I don't think we need it, but I'd like to get all the restrictions recorded here if possible.

If anyone else knows the guarantees for powerpc, arm, etc. please link to them here too.

Comment 11 Torvald 2009-11-09 05:54:52 PST

Jeffrey, I guess that the natural alignment is the necessary condition, not the cmpxchg. But I don't have a citation for this, sorry.