-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lazy compilation is not thread-safe #5556
Comments
Besides the "how to update a call instruction" problem, lazy JITting can race with concurrent modifications of any IR in the same LLVMContext unless you're careful to hold the JIT lock around such modifications. This should be documented. |
This looks a lot like the way DSOs are handled when compiling with -fPIC. You can probably expect the same kind of overhead. It is interesting to note that when using a DSO that was not compiled without -fPIC (on x86 at least) the resolution is done at load time.
|
I'm missing some knowledge: the write is a 4-byte write, right? Why is that not atomic on x86? If the store of a call instruction is not atomic, at least we should be able to store an instruction that does an infinite loop until the update is complete. The paper from IBM JVM developers called "Experiences with Multi-threading and Dynamic Class Loading in a Java Just-In-Time Compiler" [CGO2006] discuss about the issue of lazy compilation and multithreading. You can find the presentation of the paper here:
The infinite loop instruction is a possible way of avoiding this.
I really don't want this :)
Indeed, it would simplify things a lot, but, again, I really don't want to not support lazy compilation. How can Unladden-Swallow afford disabling lazy compilation? Don't you have functions that are not compiled yet and only want to compile when they are actually used? |
If I understand it correctly, only hot parts are translated from python bytecode to llvm. |
With threading, the question you have to ask is, "why is that atomic". Absent guarantees from either the intel&amd architecture manuals, or exhaustive testing with lots of different platforms, assuming something's atomic sets you up to get paged at 3 in the morning. For example, the "Intel® 64 and IA-32 Architectures Software Developer's Manual" (http://www.intel.com/Assets/PDF/manual/253668.pdf) section [8.2.3.1 Assumptions, Terminology, and Notation] says that aligned 1, 2, 4, and 8-byte reads and writes, and arbitrary locked instructions appear to execute as a single memory access, but that "Other instructions may be implemented with multiple memory accesses." It doesn't mention instruction fetches, but that makes me think that they can appear to be implemented with multiple memory accesses too. The JVMs have done exhaustive testing, so I personally trust copying them, but that does set us up for a maintenance burden as new chips come out.
This doesn't necessarily help. Say the instruction looks like [call] [target1] [target2] [target3] [target4] So we call: Another thread is concurrently executing this code. 3 problems:
Thanks, that's helpful. It says that fetches are atomic within "patching boundaries": "A patching boundary is an address that is aligned on the length of a data cache line on P6 microarchitecture (i.e., 32 bytes on Pentium 3 and 64 bytes on Pentium 4) [9] [10] or 8 bytes on K7 and K8 microarchitecture (determined through empirical experimentation) [1]." Of course, we'd need more testing to check that 8 bytes is still sufficient on K10, Core 2, and Nehalem. The right thing to do is probably to cross-reference our patching code to the right functions in openjdk so it's easy to just imitate them for new processors.
Like Rafael said, Unladen Swallow and Rubinius only JIT-compile hot functions, and we don't define "hot" as "called once". When a not-yet-compiled function is called from JITted code, it calls back into the interpreter. If A calls B, and A becomes hot before B, we currently have to recompile A to get a direct call to B; otherwise the call goes through the interpreter. Rubinius tries to find blocks of functions to compile all at once, so it's less vulnerable to a caller becoming hot only slightly before a callee. Lazy compilation in the current JIT would also be a latency problem: codegen is slow, and we don't want to block the call to the lazily-compiled function while it codegens (the call can run through the interpreter instead). Instead, we want to compile in a background thread and only switch the call from the interpreter to compiled code when compilation is finished. I think we would use a more generic thread-safe code patching facility if the JIT offered it. |
OK, so if you're not using lazy compilation, then why bother? ;-) No seriously, I'd rather let that bug open, wait until an expert implement it, and still enable lazy compilation in llvm. Does that sound reasonable? |
I think we should turn lazy compilation off by default as long as it's not thread-safe. Besides being potentially crashy, it has confused people trying to helgrind their apps. I don't think the "lazily compile every function" approach is good for performance for anyone, but if vmkit or other clients are using it, I can't unilaterally delete it. |
Fine by me! |
I got some feedback (thanks!), at least for AMD processors. |
Thanks Torvald! Specifically, I think you're referring to "Synchronization for cross- I can't find the wording that implies that a lock cmpxchg is not necessarily safe; where do you see that? I don't think we need it, but I'd like to get all the restrictions recorded here if possible. If anyone else knows the guarantees for powerpc, arm, etc. please link to them here too. |
Jeffrey, I guess that the natural alignment is the necessary condition, not the cmpxchg. But I don't have a citation for this, sorry. |
…cb5b8e141c06fa9d4dc288 [lldb] Add an option for playground transformation "high performance" mode
Extended Description
Lazy compilation is implemented roughly as:
call @compilationCallback
define @compilationCallback() {
...
store %new_function_address (%return_address-4)
}
which turns the "call @compilationCallback" into a "call @real_function". Unfortunately, while that store into the return address is running, another thread could be executing the call. At least on the x86, there are no guarantees about what can happen in this case. (Data reads and writes are defined, but not instruction execution.) It's entirely possible that the store won't be seen as atomic and will result in a wild jump.
I see three ways to fix this:
%target = atomic load @callsite_target_address
call %target
and then at the end of compilation atomically store the real function back to @callsite_target_address. This causes every callsite to be an indirect call, which is likely to slow things down a lot, so there should be some way to mark the functions where it's used. On platforms that don't even have atomic pointer writes, this could be implemented by acquiring a lock around every lazily-compilable call.
The text was updated successfully, but these errors were encountered: