Lazy compilation is not thread-safe #5556

llvmbot · 2009-10-14T04:00:51Z


Bugzilla Link	5184
Version	trunk
OS	All
Reporter	LLVM Bugzilla Contributor
CC	@danchr

Extended Description

Lazy compilation is implemented roughly as:

call @compilationCallback

define @compilationCallback() {
...
store %new_function_address (%return_address-4)
}

which turns the "call @compilationCallback" into a "call @real_function". Unfortunately, while that store into the return address is running, another thread could be executing the call. At least on the x86, there are no guarantees about what can happen in this case. (Data reads and writes are defined, but not instruction execution.) It's entirely possible that the store won't be seen as atomic and will result in a wild jump.

I see three ways to fix this:

Instead of codegening a direct call instruction, generate something like:
%target = atomic load @callsite_target_address
call %target
and then at the end of compilation atomically store the real function back to @callsite_target_address. This causes every callsite to be an indirect call, which is likely to slow things down a lot, so there should be some way to mark the functions where it's used. On platforms that don't even have atomic pointer writes, this could be implemented by acquiring a lock around every lazily-compilable call.
Assert that lazy compilation is only used in a single-threaded program. This allows us to delete the JIT lock.
Delete lazy compilation. (which would simplify the JIT a fair amount)

llvmbot · 2009-10-23T01:36:10Z

Besides the "how to update a call instruction" problem, lazy JITting can race with concurrent modifications of any IR in the same LLVMContext unless you're careful to hold the JIT lock around such modifications. This should be documented.

llvmbot · 2009-10-23T16:18:44Z

I see three ways to fix this:

Instead of codegening a direct call instruction, generate something like:
%target = atomic load @callsite_target_address
call %target
and then at the end of compilation atomically store the real function back to
@callsite_target_address. This causes every callsite to be an indirect call,
which is likely to slow things down a lot, so there should be some way to mark
the functions where it's used. On platforms that don't even have atomic
pointer writes, this could be implemented by acquiring a lock around every
lazily-compilable call.

This looks a lot like the way DSOs are handled when compiling with -fPIC. You can probably expect the same kind of overhead. It is interesting to note that when using a DSO that was not compiled without -fPIC (on x86 at least) the resolution is done at load time.

Assert that lazy compilation is only used in a single-threaded program. This
allows us to delete the JIT lock.

Delete lazy compilation. (which would simplify the JIT a fair amount)

llvmbot · 2009-10-24T13:40:20Z

Lazy compilation is implemented roughly as:

call @compilationCallback

define @compilationCallback() {
...
store %new_function_address (%return_address-4)
}

which turns the "call @compilationCallback" into a "call @real_function".
Unfortunately, while that store into the return address is running, another
thread could be executing the call. At least on the x86, there are no
guarantees about what can happen in this case. (Data reads and writes are
defined, but not instruction execution.) It's entirely possible that the store
won't be seen as atomic and will result in a wild jump.

I'm missing some knowledge: the write is a 4-byte write, right? Why is that not atomic on x86?

If the store of a call instruction is not atomic, at least we should be able to store an instruction that does an infinite loop until the update is complete.

The paper from IBM JVM developers called "Experiences with Multi-threading and Dynamic Class Loading in a Java Just-In-Time Compiler" [CGO2006] discuss about the issue of lazy compilation and multithreading. You can find the presentation of the paper here:
http://www.cgo.org/cgo2006/html/progslides/session2_talk3_maier.pdf

I see three ways to fix this:

Instead of codegening a direct call instruction, generate something like:
%target = atomic load @callsite_target_address
call %target
and then at the end of compilation atomically store the real function back to
@callsite_target_address. This causes every callsite to be an indirect call,
which is likely to slow things down a lot, so there should be some way to mark
the functions where it's used. On platforms that don't even have atomic
pointer writes, this could be implemented by acquiring a lock around every
lazily-compilable call.

The infinite loop instruction is a possible way of avoiding this.

Assert that lazy compilation is only used in a single-threaded program. This
allows us to delete the JIT lock.

I really don't want this :)

Delete lazy compilation. (which would simplify the JIT a fair amount)

Indeed, it would simplify things a lot, but, again, I really don't want to not support lazy compilation. How can Unladden-Swallow afford disabling lazy compilation? Don't you have functions that are not compiled yet and only want to compile when they are actually used?

llvmbot · 2009-10-24T17:00:15Z

Delete lazy compilation. (which would simplify the JIT a fair amount)

Indeed, it would simplify things a lot, but, again, I really don't want to not
support lazy compilation. How can Unladden-Swallow afford disabling lazy
compilation? Don't you have functions that are not compiled yet and only want
to compile when they are actually used?

If I understand it correctly, only hot parts are translated from python bytecode to llvm.

llvmbot · 2009-10-25T12:25:20Z

Unfortunately, while that store into the return address is running, another
thread could be executing the call. At least on the x86, there are no
guarantees about what can happen in this case. (Data reads and writes are
defined, but not instruction execution.) It's entirely possible that the store
won't be seen as atomic and will result in a wild jump.

I'm missing some knowledge: the write is a 4-byte write, right? Why is that not
atomic on x86?

With threading, the question you have to ask is, "why is that atomic". Absent guarantees from either the intel&amd architecture manuals, or exhaustive testing with lots of different platforms, assuming something's atomic sets you up to get paged at 3 in the morning.

For example, the "Intel® 64 and IA-32 Architectures Software Developer's Manual" (http://www.intel.com/Assets/PDF/manual/253668.pdf) section [8.2.3.1 Assumptions, Terminology, and Notation] says that aligned 1, 2, 4, and 8-byte reads and writes, and arbitrary locked instructions appear to execute as a single memory access, but that "Other instructions may be implemented with multiple memory accesses." It doesn't mention instruction fetches, but that makes me think that they can appear to be implemented with multiple memory accesses too.

The JVMs have done exhaustive testing, so I personally trust copying them, but that does set us up for a maintenance burden as new chips come out.

If the store of a call instruction is not atomic, at least we should be able to
store an instruction that does an infinite loop until the update is complete.

This doesn't necessarily help. Say the instruction looks like

[call] [target1] [target2] [target3] [target4]

So we call:
store [br] [self] to {[call] [target1]}
mfence
store [self] [newtarget2] [new target3] [new target4] to {[self] [target2] [target3] [target4]}
mfence
store [call] [newtarget1] to {[br] [self]}

Another thread is concurrently executing this code. 3 problems:

If storing the call instruction wasn't atomic in the first place, why do we think storing the [br self] over it would be atomic? Specifically, the executing thread could have fetched the [call] part of the original code before the mfence, stalled for a bit while the updating thread wrote [br self], then fetched [self] [target2] [target3] [target4]: crash
Alternately, maybe writing [br self] is atomic, but the executing thread fetched [call] [target1] before that write, then stalled for a bit, then fetched [newtarget2] [new target3] [new target4] after they're written: crash.
Alternately, since the mfence constrains only the writer and not the reader, maybe the instruction fetch happens out of order and reads [call] [newtarget1] [target2] [target3] [target4]: crash.

The paper from IBM JVM developers called "Experiences with Multi-threading and
Dynamic Class Loading in a Java Just-In-Time Compiler" [CGO2006] discuss about
the issue of lazy compilation and multithreading. You can find the presentation
of the paper here:
http://www.cgo.org/cgo2006/html/progslides/session2_talk3_maier.pdf

Thanks, that's helpful. It says that fetches are atomic within "patching boundaries": "A patching boundary is an address that is aligned on the length of a data cache line on P6 microarchitecture (i.e., 32 bytes on Pentium 3 and 64 bytes on Pentium 4) [9] [10] or 8 bytes on K7 and K8 microarchitecture (determined through empirical experimentation) [1]." Of course, we'd need more testing to check that 8 bytes is still sufficient on K10, Core 2, and Nehalem. The right thing to do is probably to cross-reference our patching code to the right functions in openjdk so it's easy to just imitate them for new processors.

Delete lazy compilation. (which would simplify the JIT a fair amount)

Indeed, it would simplify things a lot, but, again, I really don't want to not
support lazy compilation. How can Unladden-Swallow afford disabling lazy
compilation? Don't you have functions that are not compiled yet and only want
to compile when they are actually used?

Like Rafael said, Unladen Swallow and Rubinius only JIT-compile hot functions, and we don't define "hot" as "called once". When a not-yet-compiled function is called from JITted code, it calls back into the interpreter. If A calls B, and A becomes hot before B, we currently have to recompile A to get a direct call to B; otherwise the call goes through the interpreter. Rubinius tries to find blocks of functions to compile all at once, so it's less vulnerable to a caller becoming hot only slightly before a callee.

Lazy compilation in the current JIT would also be a latency problem: codegen is slow, and we don't want to block the call to the lazily-compiled function while it codegens (the call can run through the interpreter instead). Instead, we want to compile in a background thread and only switch the call from the interpreter to compiled code when compilation is finished.

I think we would use a more generic thread-safe code patching facility if the JIT offered it.

llvmbot · 2009-10-26T13:41:10Z

OK, so if you're not using lazy compilation, then why bother? ;-) No seriously, I'd rather let that bug open, wait until an expert implement it, and still enable lazy compilation in llvm. Does that sound reasonable?

llvmbot · 2009-10-27T02:30:59Z

I think we should turn lazy compilation off by default as long as it's not thread-safe. Besides being potentially crashy, it has confused people trying to helgrind their apps.

I don't think the "lazily compile every function" approach is good for performance for anyone, but if vmkit or other clients are using it, I can't unilaterally delete it.

llvmbot · 2009-10-27T09:47:08Z

I think we should turn lazy compilation off by default as long as it's not
thread-safe.

Fine by me!

llvmbot · 2009-10-28T22:10:22Z

I got some feedback (thanks!), at least for AMD processors.
According to the AMD programmer manual (http://support.amd.com/us/Processor_TechDocs/24593.pdf), it is supposed to be safe to patch the naturally aligned address after the call with a store (Section 7.6.1). So, one could thus align the call before the call address accordingly with nops. Using a lock cmpxchg is not necessarily safe.

llvmbot · 2009-10-28T22:50:06Z

Thanks Torvald! Specifically, I think you're referring to "Synchronization for cross-
modifying code is not required for code that resides within the naturally aligned quadword." which is a really nice guarantee.

I can't find the wording that implies that a lock cmpxchg is not necessarily safe; where do you see that? I don't think we need it, but I'd like to get all the restrictions recorded here if possible.

If anyone else knows the guarantees for powerpc, arm, etc. please link to them here too.

llvmbot · 2009-11-09T13:54:52Z

Jeffrey, I guess that the natural alignment is the necessary condition, not the cmpxchg. But I don't have a citation for this, sorry.

…cb5b8e141c06fa9d4dc288 [lldb] Add an option for playground transformation "high performance" mode

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 3, 2021

augusto2112 pushed a commit to augusto2112/llvm-project that referenced this issue Nov 10, 2022

Merge pull request llvm#5556 from apple/🍒/rebranch/a83966d65be024d46c…

2205a3b

…cb5b8e141c06fa9d4dc288 [lldb] Add an option for playground transformation "high performance" mode

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lazy compilation is not thread-safe #5556

Lazy compilation is not thread-safe #5556

llvmbot commented Oct 14, 2009

llvmbot commented Oct 23, 2009

llvmbot commented Oct 23, 2009

llvmbot commented Oct 24, 2009

llvmbot commented Oct 24, 2009

llvmbot commented Oct 25, 2009

llvmbot commented Oct 26, 2009

llvmbot commented Oct 27, 2009

llvmbot commented Oct 27, 2009

llvmbot commented Oct 28, 2009

llvmbot commented Oct 28, 2009

llvmbot commented Nov 9, 2009

Lazy compilation is not thread-safe #5556

Lazy compilation is not thread-safe #5556

Comments

llvmbot commented Oct 14, 2009

Extended Description

llvmbot commented Oct 23, 2009

llvmbot commented Oct 23, 2009

llvmbot commented Oct 24, 2009

llvmbot commented Oct 24, 2009

llvmbot commented Oct 25, 2009

llvmbot commented Oct 26, 2009

llvmbot commented Oct 27, 2009

llvmbot commented Oct 27, 2009

llvmbot commented Oct 28, 2009

llvmbot commented Oct 28, 2009

llvmbot commented Nov 9, 2009