New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inline asm "rm" constraint lowered "m" when "r" would be preferable #20571
Comments
Agreed. |
*** Bug #34837 has been marked as a duplicate of this bug. *** |
Would not it be better to at least treat 'mr' as 'r' rather than 'm'? This will probably yield better code in many cases. |
*** Bug #36931 has been marked as a duplicate of this bug. *** |
*** Bug #30873 has been marked as a duplicate of this bug. *** |
"mr" isn't being treated as "m". Consider this C function:
With "clang -O3 -m32", it compiles into this mess:
But if I use "m" instead of "mr", then it compiles into what I wanted:
So the presence of "r" is somehow making the codegen worse even though it's putting the value in memory anyway. |
*** Bug #46874 has been marked as a duplicate of this bug. *** |
[Copy-n-paste of my harebrained idea here] The major issue with supporting multiple constraints is how we model those constraints until register allocation is complete. (Thank you, Capt. Obvious!) The decision of which constraint to use is made during DAG selection. So there's no (easy) way to change this during register allocation. Half-baked idea (all very hand-wavy): Different constraints use different set ups (pre-asm code) and tear downs (post-asm code) for the inline asm. What if we created pseudo-instructions to represent inline asm set up and tear down? Something like this: INLINEASM_SETUP <representing register/memory setup> The register allocator could then try different constraints (going from most restrictive to least restrictive) until it find one that works. One drawback is that the RA needs to process INLINEASM before it can generate the correct code for INLINEASM_SETUP. That might be doable if the three instructions are treated as a single unit. |
*** Bug #48750 has been marked as a duplicate of this bug. *** |
mentioned in issue llvm/llvm-bugzilla-archive#31525 |
mentioned in issue llvm/llvm-bugzilla-archive#35489 |
mentioned in issue llvm/llvm-bugzilla-archive#37583 |
mentioned in issue #4440 |
mentioned in issue llvm/llvm-bugzilla-archive#47530 |
mentioned in issue llvm/llvm-bugzilla-archive#49406 |
I was thinking of this issue while reading |
We need to talk about your reading habits. :-P |
@llvm/issue-subscribers-backend-x86 |
At the llvm dev conf '22 @topperc or @compnerd or @arsenm mentioned that perhaps we could add a new register flag (akin to EDIT: I think @topperc was referring to |
Ah, looks like we haven't made any progress on this since my last report above 9mo ago. Still an issue though. |
So I have something hacked up that is slightly working; it's not ready to be published today but maybe by end of next week I'll have something to demonstrate feasibilty. The idea is:
Basically, it seems that RA doesn't know how to spill INLINEASM (and INLINEASM_BR) register operands. We need to be able to:
Right now, selectiondag picks one of the last two, lacking the ability to express the first, since RA lacks the machinery from 2 below.
There's probably more I need to do for greedy RA to update the live range info (I don't fully understand LiveRangeEdit quite yet). Right now my hacky code is failing to clean up the previously used virtreg which leads to an assertion failure down the line; I hope to have that fixed by EOD but we'll see. cc @qcolombet |
This enables -regalloc=greedy to memfold spillable inline asm MachineOperands. Because no instruction selection framework marks MachineOperands as spillable, no language frontend can observe functional changes from this patch. That will change once instruction selection frameworks are updated. Link: #20571
ok, the core parts have all landed for greedy. (I noticed that Next up:
I noticed that peephole-opt folds loads; we might want to disable that for inline asm...I need to think more about that case. |
In commit b053359 ("[X86InstrInfo] support memfold on spillable inline asm (llvm#70832)"), I had a last minute fix to update the memoperands. I originally did this in the parent foldInlineAsmMemOperand call, updated the mir test via update_mir_test_checks.py, but then decided to move it to the child call of foldInlineAsmMemOperand. But I forgot to rerun update_mir_test_checks.py. That last minute change caused the same memoperand to be added twice when recursion occurred (for tied operands). I happened to get lucky that trailing content omitted from the CHECK line doesn't result in test failure. But rerunning update_mir_test_checks.py on the mir test added in that commit produces updated output. This is resulting in updates to the test that: 1. conflate additions to the test in child commits with simply updating the test as it should have been when first committed. 2. look wrong because the same memoperand is specified twice (we don't deduplicate memoperands when added). Example: INLINEASM ... :: (load (s32) from %stack.0) (load (s32) from %stack.0) Fix the bug, so that in child commits, we don't have additional unrelated test changes (which would be wrong anyways) from simply running update_mir_test_checks.py. Link: llvm#20571
In commit b053359 ("[X86InstrInfo] support memfold on spillable inline asm (#70832)"), I had a last minute fix to update the memoperands. I originally did this in the parent foldInlineAsmMemOperand call, updated the mir test via update_mir_test_checks.py, but then decided to move it to the child call of foldInlineAsmMemOperand. But I forgot to rerun update_mir_test_checks.py. That last minute change caused the same memoperand to be added twice when recursion occurred (for tied operands). I happened to get lucky that trailing content omitted from the CHECK line doesn't result in test failure. But rerunning update_mir_test_checks.py on the mir test added in that commit produces updated output. This is resulting in updates to the test that: 1. conflate additions to the test in child commits with simply updating the test as it should have been when first committed. 2. look wrong because the same memoperand is specified twice (we don't deduplicate memoperands when added). Example: INLINEASM ... :: (load (s32) from %stack.0) (load (s32) from %stack.0) Fix the bug, so that in child commits, we don't have additional unrelated test changes (which would be wrong anyways) from simply running update_mir_test_checks.py. Link: #20571
putting this down for now. If someone else wants to pick up the torch, I can push my WIP branch somewhere which may be useful as a reference (but will probably bit rot). |
Please do. |
As alluded to in llvm#20571, it would be nice if we could mutate operand lists of MachineInstr's more safely.
Are some of my branches (I have 6 locally; they're free) but those 3 appear to be the latest/most developed. Really, the issue I ran into with RegAllocFast was more so: do we try to reassign a register to a stack slot and mutate the inline asm upon discovering we're in a state of register exhaustion (Regalloc fast was not designed to do this, everything is a const reference, and I'd wind up with duplicated instructions inserted in a bunch of places), or try to spill inline asm "rm" variables proactively (shot down in code review). https://gist.github.com/nickdesaulniers/8f20bea3bcdd9fe97219428ab6e8bf8b were a handful of end-to-end tests I was using to test support for all architectures. You may find these useful. |
regallocfact can fail to allocate regs even without this patch. I think it's not supposed to be used in production, just like fast-isel (might be wrong), so another possibility would be just to revert the old (always in-memory) behavior when fast regalloc is selected. |
There is no "revert [to] the old behavior" here. ISEL needs to be changed for this to work. Then which regalloc frameworks runs after needs to be able to handle what ISEL has decided. If we don't change ISEL, we don't fix this bug. If we don't fix regallocfast, then we may fail to compile code that previously we could. |
@nickdesaulniers is there an issue with making isel "deciding" to treat "rm" as "m" (old behavior) when fast regalloc is selected? |
I don't recall where I was told this, but I recall being told in no uncertain terms that all ISEL frameworks need to support all regalloc frameworks. This M:N ISEL frameworks to regalloc frameworks is what makes fixing this bug a PITA. |
This doesn't change the support though. Although I agree that tying some isel-level option like "Treat rm constraints as m" with the selected regalloc feels a bit icky, but that's the problem with fast regalloc. It can fail and it's best-effort. Any optimization that puts something that was previously in memory into registers can potentially trigger this. The question is if running fastregalloc on code with inline asm with "rm" constraints is a common enough scenario that it needs to have so much effort spent on trying to make it better (it already works to my understanding). |
Maybe in this comment? #20571 (comment) |
I'm thinking of following this course of action:
This will take care of (almost) all uses of this feature. What about code from another front-end or hand-written LLVM IR?
|
If supporting this in RegAllocFast is too hard, I think making sure that the problematic patterns don't reach it is okay, assuming this is done in a reliable matter (making this a clang frontend decision is obviously not acceptable). Having all isel work with all regalloc is a nice goal to have, but I don't think it needs to block this feature if it's too hard to achieve. At least that's my 2c. |
While I agree in general, this is how it's done today, and there aren't a lot of alternative methods of input we are desperate to support. The inline ASM syntax is specific to GCC (and only "mostly" supported in Clang). It requires supporting only those languages which either GCC supports or which use its inline ASM syntax (in which case, "What hath God wrought?"). Given that restrictive set, we're then left with IR as the main thing we need to support. It's possible to handle this while reading the IR, and it's important for testing. But I think focusing on the front-end first will give us a greater benefit for our work. |
I haven't followed the details here in this giant thread, but
This doesn't sound too horrible to me
I don't see the issue here? The FastRA philosophy is to just spill everything all the time to avoid doing anything hard |
@MatzeB 's feedback seemed to want to avoid doing a pre-emptive scan over every instruction to find inline asm instructions (which probably don't exist in most programs). Rereading his feedback though, it sounds like he wasn't necessarily against the approach I had proposed (as I had mis-remembered). I don't recall whether I tested his suggestion, and what the results were. |
When using the inline asm constraint string "rm" (or "g"), we generally would like the compiler to choose "r", but it is permitted to choose "m" if there's register pressure. This is distinct from "r" in which the register is not permitted to be spilled to the stack. The decision of which to use must be made at some point. Currently, the instruction selection frameworks (ISELs) make the choice, and the register allocators had better be able to handle the result. Steal a bit from Storage when using register operands to disambiguate between the two cases. Add helpers/getters/setters, and print in MIR when such a register is foldable. The getter will later be used by the register allocation frameworks (and asserted by the ISELs) while the setters will be used by the instruction selection frameworks. Link: llvm/llvm-project#20571
… ops as having infinite spill weight (#70747) This is necessary for RegAllocGreedy support for memory folding inline asm that uses "rm" constraints. Thanks to @qcolombet for the suggestion. Link: llvm/llvm-project#20571
There's a gross error that happens with the current WIP code: long int __attribute__((noinline)) foo(long int y) {
long int x, z;
asm __volatile__ ("pushf\n\t"
"popq %0\n\t"
"pushq %2\n\t"
"popq %1"
: "=rm" (x), "=rm" (z)
: "rm" (y)
: "ax", "bx", "cx", "dx", "di", "si", "bp",
"r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15"
);
return x;
}
int main() {
__builtin_printf("flags: 0x%08lx\n", foo(991278));
return 0;
} What should be generated is: movq %rdi, -24(%rsp)
#APP
pushfq
popq -16(%rsp)
pushq -24(%rsp)
popq -8(%rsp)
#NO_APP
movq -16(%rsp), %rax However, this is generated: movq %rdi, -32(%rsp) # 8-byte Spill
#APP # 8-byte Folded Reload
pushfq
popq -32(%rsp)
pushq -32(%rsp)
popq -24(%rsp)
#NO_APP
movq -32(%rsp), %rax # 8-byte Reload |
Extended Description
When multiple alternatives in an inline asm constraint are given we ignore all of them but the most "general". This gives nasty artifacts in the code.
The spilling is totally unnecessary. GCC gets this one right. On 32 bit x86 it's even worse:
GCC knows a better way:
The constraint "g" is just as bad, being translated into "imr" internally.
The text was updated successfully, but these errors were encountered: