Bug 5184 - Lazy compilation is not thread-safe
: Lazy compilation is not thread-safe
Status: NEW
Product: libraries
Classification: Unclassified
Component: Target-Independent JIT
: trunk
: PC All
: P normal
Assigned To: Unassigned LLVM Bugs
:
:
:
:
  Show dependency treegraph
 
Reported: 2009-10-13 21:00 CDT by Jeffrey Yasskin
Modified: 2012-03-07 04:32 CST (History)
7 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jeffrey Yasskin 2009-10-13 21:00:51 CDT
Lazy compilation is implemented roughly as:

call @compilationCallback

define @compilationCallback() {
  ...
  store %new_function_address (%return_address-4)
}

which turns the "call @compilationCallback" into a "call @real_function".
Unfortunately, while that store into the return address is running, another
thread could be executing the call. At least on the x86, there are no
guarantees about what can happen in this case. (Data reads and writes are
defined, but not instruction execution.) It's entirely possible that the store
won't be seen as atomic and will result in a wild jump.

I see three ways to fix this:
1) Instead of codegening a direct call instruction, generate something like:
  %target = atomic load @callsite_target_address
  call %target
and then at the end of compilation atomically store the real function back to
@callsite_target_address. This causes every callsite to be an indirect call,
which is likely to slow things down a lot, so there should be some way to mark
the functions where it's used.  On platforms that don't even have atomic
pointer writes, this could be implemented by acquiring a lock around every
lazily-compilable call.
2) Assert that lazy compilation is only used in a single-threaded program. This
allows us to delete the JIT lock.
3) Delete lazy compilation. (which would simplify the JIT a fair amount)
Comment 1 Jeffrey Yasskin 2009-10-22 18:36:10 CDT
Besides the "how to update a call instruction" problem, lazy JITting can race
with concurrent modifications of any IR in the same LLVMContext unless you're
careful to hold the JIT lock around such modifications. This should be
documented.
Comment 2 Rafael Ávila de Espíndola 2009-10-23 09:18:44 CDT
> I see three ways to fix this:
> 1) Instead of codegening a direct call instruction, generate something like:
>   %target = atomic load @callsite_target_address
>   call %target
> and then at the end of compilation atomically store the real function back to
> @callsite_target_address. This causes every callsite to be an indirect call,
> which is likely to slow things down a lot, so there should be some way to mark
> the functions where it's used.  On platforms that don't even have atomic
> pointer writes, this could be implemented by acquiring a lock around every
> lazily-compilable call.

This looks a lot like the way DSOs are handled when compiling with -fPIC. You
can probably expect the same kind of overhead. It is interesting to note that
when using a DSO that was not compiled without -fPIC (on x86 at least) the
resolution is done at load time.

> 2) Assert that lazy compilation is only used in a single-threaded program. This
> allows us to delete the JIT lock.
> 3) Delete lazy compilation. (which would simplify the JIT a fair amount)
> 
Comment 3 varth 2009-10-24 06:40:20 CDT
(In reply to comment #0)
> Lazy compilation is implemented roughly as:
> 
> call @compilationCallback
> 
> define @compilationCallback() {
>   ...
>   store %new_function_address (%return_address-4)
> }
> 
> which turns the "call @compilationCallback" into a "call @real_function".
> Unfortunately, while that store into the return address is running, another
> thread could be executing the call. At least on the x86, there are no
> guarantees about what can happen in this case. (Data reads and writes are
> defined, but not instruction execution.) It's entirely possible that the store
> won't be seen as atomic and will result in a wild jump.

I'm missing some knowledge: the write is a 4-byte write, right? Why is that not
atomic on x86?

If the store of a call instruction is not atomic, at least we should be able to
store an instruction that does an infinite loop until the update is complete.

The paper from IBM JVM developers called "Experiences with Multi-threading and
Dynamic Class Loading in a Java Just-In-Time Compiler" [CGO2006] discuss about
the issue of lazy compilation and multithreading. You can find the presentation
of the paper here:
http://www.cgo.org/cgo2006/html/progslides/session2_talk3_maier.pdf


> 
> I see three ways to fix this:
> 1) Instead of codegening a direct call instruction, generate something like:
>   %target = atomic load @callsite_target_address
>   call %target
> and then at the end of compilation atomically store the real function back to
> @callsite_target_address. This causes every callsite to be an indirect call,
> which is likely to slow things down a lot, so there should be some way to mark
> the functions where it's used.  On platforms that don't even have atomic
> pointer writes, this could be implemented by acquiring a lock around every
> lazily-compilable call.

The infinite loop instruction is a possible way of avoiding this.

> 2) Assert that lazy compilation is only used in a single-threaded program. This
> allows us to delete the JIT lock.

I really don't want this :)

> 3) Delete lazy compilation. (which would simplify the JIT a fair amount)
> 

Indeed, it would simplify things a lot, but, again, I really don't want to not
support lazy compilation. How can Unladden-Swallow afford disabling lazy
compilation? Don't you have functions that are not compiled yet and only want
to compile when they are actually used?
Comment 4 Rafael Ávila de Espíndola 2009-10-24 10:00:15 CDT
> > 3) Delete lazy compilation. (which would simplify the JIT a fair amount)
> > 
> 
> Indeed, it would simplify things a lot, but, again, I really don't want to not
> support lazy compilation. How can Unladden-Swallow afford disabling lazy
> compilation? Don't you have functions that are not compiled yet and only want
> to compile when they are actually used?
> 

If I understand it correctly, only hot parts are translated from python
bytecode to llvm.
Comment 5 Jeffrey Yasskin 2009-10-25 05:25:20 CDT
>> Unfortunately, while that store into the return address is running, another
>> thread could be executing the call. At least on the x86, there are no
>> guarantees about what can happen in this case. (Data reads and writes are
>> defined, but not instruction execution.) It's entirely possible that the store
>> won't be seen as atomic and will result in a wild jump.
>
> I'm missing some knowledge: the write is a 4-byte write, right? Why is that not
> atomic on x86?

With threading, the question you have to ask is, "why _is_ that atomic". Absent
guarantees from either the intel&amd architecture manuals, or exhaustive
testing with lots of different platforms, assuming something's atomic sets you
up to get paged at 3 in the morning.

For example, the "Intel® 64 and IA-32 Architectures Software Developer's
Manual" (http://www.intel.com/Assets/PDF/manual/253668.pdf) section [8.2.3.1 
Assumptions, Terminology, and Notation] says that aligned 1, 2, 4, and 8-byte
reads and writes, and arbitrary locked instructions appear to execute as a
single memory access, but that "Other instructions may be implemented with
multiple memory accesses."  It doesn't mention instruction fetches, but that
makes me think that they can appear to be implemented with multiple memory
accesses too. 

The JVMs have done exhaustive testing, so I personally trust copying them, but
that does set us up for a maintenance burden as new chips come out.

> If the store of a call instruction is not atomic, at least we should be able to
> store an instruction that does an infinite loop until the update is complete.

This doesn't necessarily help. Say the instruction looks like

[call] [target1] [target2] [target3] [target4]

So we call:
  store [br] [self] to {[call] [target1]}
  mfence
  store [self] [newtarget2] [new target3] [new target4] to {[self] [target2]
[target3] [target4]}
  mfence
  store [call] [newtarget1] to {[br] [self]}

Another thread is concurrently executing this code. 3 problems:
1) If storing the call instruction wasn't atomic in the first place, why do we
think storing the [br self] over it would be atomic? Specifically, the
executing thread could have fetched the [call] part of the original code before
the mfence, stalled for a bit while the updating thread wrote [br self], then
fetched [self] [target2] [target3] [target4]: crash
2)  Alternately, maybe writing [br self] is atomic, but the executing thread
fetched [call] [target1] before that write, then stalled for a bit, then
fetched [newtarget2] [new target3] [new target4] after they're written: crash.
3) Alternately, since the mfence constrains only the writer and not the reader,
maybe the instruction fetch happens out of order and reads [call] [newtarget1]
[target2] [target3] [target4]: crash.

> The paper from IBM JVM developers called "Experiences with Multi-threading and
> Dynamic Class Loading in a Java Just-In-Time Compiler" [CGO2006] discuss about
> the issue of lazy compilation and multithreading. You can find the presentation
> of the paper here:
> http://www.cgo.org/cgo2006/html/progslides/session2_talk3_maier.pdf

Thanks, that's helpful. It says that fetches are atomic within "patching
boundaries": "A patching boundary is an address that is aligned on the length
of a data cache line on P6 microarchitecture (i.e., 32 bytes on Pentium 3 and
64 bytes on Pentium 4) [9] [10] or 8 bytes on K7 and K8 microarchitecture
(determined through empirical experimentation) [1]."  Of course, we'd need more
testing to check that 8 bytes is still sufficient on K10, Core 2, and Nehalem.
The right thing to do is probably to cross-reference our patching code to the
right functions in openjdk so it's easy to just imitate them for new
processors.

>> 3) Delete lazy compilation. (which would simplify the JIT a fair amount)
>>
>
> Indeed, it would simplify things a lot, but, again, I really don't want to not
> support lazy compilation. How can Unladden-Swallow afford disabling lazy
> compilation? Don't you have functions that are not compiled yet and only want
> to compile when they are actually used?

Like Rafael said, Unladen Swallow and Rubinius only JIT-compile hot functions,
and we don't define "hot" as "called once". When a not-yet-compiled function is
called from JITted code, it calls back into the interpreter. If A calls B, and
A becomes hot before B, we currently have to recompile A to get a direct call
to B; otherwise the call goes through the interpreter. Rubinius tries to find
blocks of functions to compile all at once, so it's less vulnerable to a caller
becoming hot only slightly before a callee.

Lazy compilation in the current JIT would also be a latency problem: codegen is
slow, and we don't want to block the call to the lazily-compiled function while
it codegens (the call can run through the interpreter instead). Instead, we
want to compile in a background thread and only switch the call from the
interpreter to compiled code when compilation is finished.

I think we _would_ use a more generic thread-safe code patching facility if the
JIT offered it. 
Comment 6 varth 2009-10-26 06:41:10 CDT
OK, so if you're not using lazy compilation, then why bother? ;-) No seriously,
I'd rather let that bug open, wait until an expert implement it, and still
enable lazy compilation in llvm. Does that sound reasonable?
Comment 7 Jeffrey Yasskin 2009-10-26 19:30:59 CDT
I think we should turn lazy compilation off by default as long as it's not
thread-safe. Besides being potentially crashy, it has confused people trying to
helgrind their apps.

I don't think the "lazily compile every function" approach is good for
performance for anyone, but if vmkit or other clients are using it, I can't
unilaterally delete it.
Comment 8 varth 2009-10-27 02:47:08 CDT
(In reply to comment #7)
> I think we should turn lazy compilation off by default as long as it's not
> thread-safe.

Fine by me!
Comment 9 Torvald 2009-10-28 15:10:22 CDT
I got some feedback (thanks!), at least for AMD processors.
According to the AMD programmer manual
(http://support.amd.com/us/Processor_TechDocs/24593.pdf), it is supposed to be
safe to patch the *naturally aligned* address after the call with a store
(Section 7.6.1). So, one could thus align the call before the call address
accordingly with nops. Using a lock cmpxchg is not necessarily safe.
Comment 10 Jeffrey Yasskin 2009-10-28 15:50:06 CDT
Thanks Torvald! Specifically, I think you're referring to "Synchronization for
cross- 
modifying code is not required for code that resides within the naturally
aligned quadword." which is a really nice guarantee.

I can't find the wording that implies that a lock cmpxchg is not necessarily
safe; where do you see that? I don't think we need it, but I'd like to get all
the restrictions recorded here if possible.

If anyone else knows the guarantees for powerpc, arm, etc. please link to them
here too.
Comment 11 Torvald 2009-11-09 05:54:52 CST
Jeffrey, I guess that the natural alignment is the necessary condition, not the
cmpxchg. But I don't have a citation for this, sorry.