Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inline asm "rm" constraint lowered "m" when "r" would be preferable #20571

Open
d0k opened this issue Jul 3, 2014 · 61 comments
Open

inline asm "rm" constraint lowered "m" when "r" would be preferable #20571

d0k opened this issue Jul 3, 2014 · 61 comments
Labels
bugzilla Issues migrated from bugzilla confirmed Verified by a second party inline-asm llvm:codegen

Comments

@d0k
Copy link
Member

d0k commented Jul 3, 2014

Bugzilla Link 20197
Version trunk
OS Linux
Blocks #4440
CC @legrosbuffle,@echristo,@isanbard,@josephcsible,@nickdesaulniers,@zygoloid,@tstellar

Extended Description

When multiple alternatives in an inline asm constraint are given we ignore all of them but the most "general". This gives nasty artifacts in the code.

int bsr(unsigned v) {
  int ret;
  __asm__("bsr %1, %0" : "=&r"(ret) : "rm"(v) : "cc");
  return ret;
}

$ clang -O3 -S -o - t.c
bsr:
	movl	%edi, -4(%rsp)
	#APP
	bsrl	-4(%rsp), %eax
	#NO_APP
	retq

The spilling is totally unnecessary. GCC gets this one right. On 32 bit x86 it's even worse:

$ clang -O3 -S -o - t.c -m32
bsr:
	pushl	%eax
	movl	8(%esp), %eax
	movl	%eax, (%esp)
	#APP
	bsrl	(%esp), %eax
	#NO_APP
	popl	%edx
	retl

GCC knows a better way:

$ gcc-4.8 -O3 -S -o - t.c -m32
bsr:
#APP
	bsr 4(%esp), %eax
#NO_APP
	ret

The constraint "g" is just as bad, being translated into "imr" internally.

@echristo
Copy link
Contributor

echristo commented Jul 3, 2014

Agreed.

@llvmbot
Copy link
Collaborator

llvmbot commented Mar 20, 2018

*** Bug #34837 has been marked as a duplicate of this bug. ***

@llvmbot
Copy link
Collaborator

llvmbot commented Mar 20, 2018

Would not it be better to at least treat 'mr' as 'r' rather than 'm'? This will probably yield better code in many cases.

@llvmbot
Copy link
Collaborator

llvmbot commented Jun 1, 2018

*** Bug #36931 has been marked as a duplicate of this bug. ***

@llvmbot
Copy link
Collaborator

llvmbot commented Jun 1, 2018

*** Bug #30873 has been marked as a duplicate of this bug. ***

@josephcsible
Copy link

josephcsible commented Jun 6, 2020

Would not it be better to at least treat 'mr' as 'r' rather than 'm'? This
will probably yield better code in many cases.

"mr" isn't being treated as "m". Consider this C function:

void f(int x) {
    asm volatile("# x is in %0" :: "mr"(x));
}

With "clang -O3 -m32", it compiles into this mess:

f:
        pushl   %eax
        movl    8(%esp), %eax
        movl    %eax, (%esp)
        # x is in (%esp)
        popl    %eax
        retl

But if I use "m" instead of "mr", then it compiles into what I wanted:

f:
        # x is in 4(%esp)
        retl

So the presence of "r" is somehow making the codegen worse even though it's putting the value in memory anyway.

@efriedma-quic
Copy link
Collaborator

efriedma-quic commented Sep 15, 2020

*** Bug #46874 has been marked as a duplicate of this bug. ***

@isanbard
Copy link
Contributor

[Copy-n-paste of my harebrained idea here]

The major issue with supporting multiple constraints is how we model those constraints until register allocation is complete. (Thank you, Capt. Obvious!) The decision of which constraint to use is made during DAG selection. So there's no (easy) way to change this during register allocation.

Half-baked idea (all very hand-wavy):

Different constraints use different set ups (pre-asm code) and tear downs (post-asm code) for the inline asm. What if we created pseudo-instructions to represent inline asm set up and tear down? Something like this:

INLINEASM_SETUP <representing register/memory setup>
INLINEASM <...>
INLINEASM_TEARDOWN <representing copy of results into vregs/pregs>

The register allocator could then try different constraints (going from most restrictive to least restrictive) until it find one that works.

One drawback is that the RA needs to process INLINEASM before it can generate the correct code for INLINEASM_SETUP. That might be doable if the three instructions are treated as a single unit.

@efriedma-quic
Copy link
Collaborator

efriedma-quic commented Mar 3, 2021

*** Bug #48750 has been marked as a duplicate of this bug. ***

@llvmbot
Copy link
Collaborator

llvmbot commented Nov 26, 2021

mentioned in issue llvm/llvm-bugzilla-archive#31525

@llvmbot
Copy link
Collaborator

llvmbot commented Nov 27, 2021

mentioned in issue llvm/llvm-bugzilla-archive#35489

@llvmbot
Copy link
Collaborator

llvmbot commented Nov 27, 2021

mentioned in issue llvm/llvm-bugzilla-archive#37583

@edwintorok
Copy link
Contributor

mentioned in issue #4440

@efriedma-quic
Copy link
Collaborator

mentioned in issue llvm/llvm-bugzilla-archive#47530

@efriedma-quic
Copy link
Collaborator

mentioned in issue llvm/llvm-bugzilla-archive#49406

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 9, 2021
@nickdesaulniers
Copy link
Member

nickdesaulniers commented Dec 13, 2021

I was thinking of this issue while reading ChooseConstraint in llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp.

@bwendling
Copy link
Collaborator

I was thinking of this issue while reading ChooseConstraint in llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp.

We need to talk about your reading habits. :-P

@llvmbot
Copy link
Collaborator

llvmbot commented Aug 2, 2022

@llvm/issue-subscribers-backend-x86

@nickdesaulniers nickdesaulniers changed the title Inline asm constraint alternatives ignored inline asm "rm" constraint lowered "m" when "r" would be preferable Aug 3, 2022
@nickdesaulniers
Copy link
Member

nickdesaulniers commented Nov 28, 2022

At the llvm dev conf '22 @topperc or @compnerd or @arsenm mentioned that perhaps we could add a new register flag (akin to early-clobber, def, or kill) and then another pass ("memfold"? or possibly just regalloc) could try to DTRT.

EDIT: I think @topperc was referring to TargetInstrInfo::foldMemoryOperand or InlineSpiller::foldMemoryOperand.

@nickdesaulniers
Copy link
Member

Ah, looks like we haven't made any progress on this since my last report above 9mo ago. Still an issue though.
https://lore.kernel.org/lkml/CAHk-=whVvD05T0yD5DQj803uETLD6qDq-Vx-SiLPcrL=eO77LQ@mail.gmail.com/

@Endilll Endilll added the confirmed Verified by a second party label Aug 30, 2023
@nickdesaulniers
Copy link
Member

nickdesaulniers commented Aug 30, 2023

So I have something hacked up that is slightly working; it's not ready to be published today but maybe by end of next week I'll have something to demonstrate feasibilty.

The idea is:

  1. have selectiondag ChooseConstraint be more aggressive, but add a new InlineAsm::Kind_* value between Kind_Register and Kind_Mem called Kind_SpillableRegister (or w/e) which basically behaves as Kind_Register in all cases but the one described below, which is what selectiondag will set.
  2. have greedy register allocator check for these kind of operands before imminent register exhaustion; if they exist, spill before the inline asm, reload after the inline asm, and update the inline asm to use the newly created stack slot.

Basically, it seems that RA doesn't know how to spill INLINEASM (and INLINEASM_BR) register operands. We need to be able to:

  1. signal to RA the difference between 3 cases:
  • "the constraint string for this operand allows the operand to be spilled" (i.e. "rm")
  • "the constraint string for this operand does not allow the operand to be spilled (i.e. "r")"
  • "the constraint string for this operand does not allow the operand to exist in a register (i.e. "m").

Right now, selectiondag picks one of the last two, lacking the ability to express the first, since RA lacks the machinery from 2 below.

  1. the transformation machinery for INLINEASM:
  • insert spill before INLINEASM
  • insert reload after INLINEASM (or reloads after every indirect branch target of INLINEASM_BR)
  • transform operand from reguse to mem
  • transform operand from virtreg to stack slot

There's probably more I need to do for greedy RA to update the live range info (I don't fully understand LiveRangeEdit quite yet). Right now my hacky code is failing to clean up the previously used virtreg which leads to an assertion failure down the line; I hope to have that fixed by EOD but we'll see.

cc @qcolombet

nickdesaulniers added a commit that referenced this issue Nov 29, 2023
This enables -regalloc=greedy to memfold spillable inline asm
MachineOperands.

Because no instruction selection framework marks MachineOperands as
spillable, no language frontend can observe functional changes from this
patch. That will change once instruction selection frameworks are
updated.

Link: #20571
@nickdesaulniers
Copy link
Member

nickdesaulniers commented Dec 1, 2023

ok, the core parts have all landed for greedy.

(I noticed that peephole-opt will fold loads BEFORE reaching any register allocator, but not any stores).

Next up:

  • regallocfast
  • instruction selectors
  • m68k support (maybe)
  • asm goto
    • opt-phis bug related to asm goto

I noticed that peephole-opt folds loads; we might want to disable that for inline asm...I need to think more about that case.

nickdesaulniers added a commit to nickdesaulniers/llvm-project that referenced this issue Dec 1, 2023
In commit b053359 ("[X86InstrInfo] support memfold on spillable inline asm
(llvm#70832)"), I had a last minute fix to update the memoperands.  I originally
did this in the parent foldInlineAsmMemOperand call, updated the mir test via
update_mir_test_checks.py, but then decided to move it to the child call of
foldInlineAsmMemOperand.

But I forgot to rerun update_mir_test_checks.py. That last minute change caused
the same memoperand to be added twice when recursion occurred (for tied
operands). I happened to get lucky that trailing content omitted from the
CHECK line doesn't result in test failure.

But rerunning update_mir_test_checks.py on the mir test added in that commit
produces updated output. This is resulting in updates to the test that:
1. conflate additions to the test in child commits with simply updating the
   test as it should have been when first committed.
2. look wrong because the same memoperand is specified twice (we don't
   deduplicate memoperands when added). Example:

    INLINEASM ... :: (load (s32) from %stack.0) (load (s32) from %stack.0)

Fix the bug, so that in child commits, we don't have additional unrelated test
changes (which would be wrong anyways) from simply running
update_mir_test_checks.py.

Link: llvm#20571
nickdesaulniers added a commit that referenced this issue Dec 4, 2023
In commit b053359 ("[X86InstrInfo] support memfold on spillable
inline asm
(#70832)"), I had a last minute fix to update the memoperands. I
originally
did this in the parent foldInlineAsmMemOperand call, updated the mir
test via
update_mir_test_checks.py, but then decided to move it to the child call
of
foldInlineAsmMemOperand.

But I forgot to rerun update_mir_test_checks.py. That last minute change
caused
the same memoperand to be added twice when recursion occurred (for tied
operands). I happened to get lucky that trailing content omitted from
the
CHECK line doesn't result in test failure.

But rerunning update_mir_test_checks.py on the mir test added in that
commit
produces updated output. This is resulting in updates to the test that:
1. conflate additions to the test in child commits with simply updating
the
   test as it should have been when first committed.
2. look wrong because the same memoperand is specified twice (we don't
   deduplicate memoperands when added). Example:

INLINEASM ... :: (load (s32) from %stack.0) (load (s32) from %stack.0)

Fix the bug, so that in child commits, we don't have additional
unrelated test
changes (which would be wrong anyways) from simply running
update_mir_test_checks.py.

Link: #20571
@nickdesaulniers nickdesaulniers removed their assignment Apr 9, 2024
@nickdesaulniers
Copy link
Member

putting this down for now. If someone else wants to pick up the torch, I can push my WIP branch somewhere which may be useful as a reference (but will probably bit rot).

@bwendling
Copy link
Collaborator

I can push my WIP branch somewhere which may be useful as a reference (but will probably bit rot).

Please do.

nickdesaulniers added a commit to nickdesaulniers/llvm-project that referenced this issue Apr 10, 2024
As alluded to in llvm#20571, it would be nice if we could mutate operand
lists of MachineInstr's more safely.
@nickdesaulniers
Copy link
Member

Are some of my branches (I have 6 locally; they're free) but those 3 appear to be the latest/most developed. Really, the issue I ran into with RegAllocFast was more so: do we try to reassign a register to a stack slot and mutate the inline asm upon discovering we're in a state of register exhaustion (Regalloc fast was not designed to do this, everything is a const reference, and I'd wind up with duplicated instructions inserted in a bunch of places), or try to spill inline asm "rm" variables proactively (shot down in code review).

https://gist.github.com/nickdesaulniers/8f20bea3bcdd9fe97219428ab6e8bf8b were a handful of end-to-end tests I was using to test support for all architectures. You may find these useful.

@danilaml
Copy link
Collaborator

regallocfact can fail to allocate regs even without this patch. I think it's not supposed to be used in production, just like fast-isel (might be wrong), so another possibility would be just to revert the old (always in-memory) behavior when fast regalloc is selected.

@nickdesaulniers
Copy link
Member

There is no "revert [to] the old behavior" here. ISEL needs to be changed for this to work. Then which regalloc frameworks runs after needs to be able to handle what ISEL has decided. If we don't change ISEL, we don't fix this bug. If we don't fix regallocfast, then we may fail to compile code that previously we could.

@danilaml
Copy link
Collaborator

@nickdesaulniers is there an issue with making isel "deciding" to treat "rm" as "m" (old behavior) when fast regalloc is selected?

@nickdesaulniers
Copy link
Member

I don't recall where I was told this, but I recall being told in no uncertain terms that all ISEL frameworks need to support all regalloc frameworks. This M:N ISEL frameworks to regalloc frameworks is what makes fixing this bug a PITA.

@danilaml
Copy link
Collaborator

This doesn't change the support though. Although I agree that tying some isel-level option like "Treat rm constraints as m" with the selected regalloc feels a bit icky, but that's the problem with fast regalloc. It can fail and it's best-effort. Any optimization that puts something that was previously in memory into registers can potentially trigger this. The question is if running fastregalloc on code with inline asm with "rm" constraints is a common enough scenario that it needs to have so much effort spent on trying to make it better (it already works to my understanding).

@bwendling
Copy link
Collaborator

I don't recall where I was told this, but I recall being told in no uncertain terms that all ISEL frameworks need to support all regalloc frameworks. This M:N ISEL frameworks to regalloc frameworks is what makes fixing this bug a PITA.

Maybe in this comment? #20571 (comment)

@bwendling
Copy link
Collaborator

I'm thinking of following this course of action:

  1. Have Clang generate a memory constraint for 'rm', the current behavior, for non-Greedy register alloctors.
  2. Clang generates a foldable 'r' constraint for the Greedy allocator.

This will take care of (almost) all uses of this feature.

What about code from another front-end or hand-written LLVM IR?

  1. Fuck that shit....
  2. Okay, if we MUST, we could modify the inline asm instruction before code-gen.

@nikic
Copy link
Contributor

nikic commented Apr 16, 2024

If supporting this in RegAllocFast is too hard, I think making sure that the problematic patterns don't reach it is okay, assuming this is done in a reliable matter (making this a clang frontend decision is obviously not acceptable). Having all isel work with all regalloc is a nice goal to have, but I don't think it needs to block this feature if it's too hard to achieve. At least that's my 2c.

@bwendling
Copy link
Collaborator

making this a clang frontend decision is obviously not acceptable

While I agree in general, this is how it's done today, and there aren't a lot of alternative methods of input we are desperate to support. The inline ASM syntax is specific to GCC (and only "mostly" supported in Clang). It requires supporting only those languages which either GCC supports or which use its inline ASM syntax (in which case, "What hath God wrought?"). Given that restrictive set, we're then left with IR as the main thing we need to support. It's possible to handle this while reading the IR, and it's important for testing. But I think focusing on the front-end first will give us a greater benefit for our work.

@arsenm
Copy link
Contributor

arsenm commented Apr 17, 2024

I haven't followed the details here in this giant thread, but

Okay, if we MUST, we could modify the inline asm instruction before code-gen.

This doesn't sound too horrible to me

Are some of my branches (I have 6 locally; they're free) but those 3 appear to be the latest/most developed. Really, the issue I ran into with RegAllocFast was more so: do we try to reassign a register to a stack slot and mutate the inline asm upon discovering we're in a state of register exhaustion (Regalloc fast was not designed to do this, everything is a const reference, and I'd wind up with duplicated instructions inserted in a bunch of places), or try to spill inline asm "rm" variables proactively (shot down in code review).

I don't see the issue here? The FastRA philosophy is to just spill everything all the time to avoid doing anything hard

@nickdesaulniers
Copy link
Member

Are some of my branches (I have 6 locally; they're free) but those 3 appear to be the latest/most developed. Really, the issue I ran into with RegAllocFast was more so: do we try to reassign a register to a stack slot and mutate the inline asm upon discovering we're in a state of register exhaustion (Regalloc fast was not designed to do this, everything is a const reference, and I'd wind up with duplicated instructions inserted in a bunch of places), or try to spill inline asm "rm" variables proactively (shot down in code review).

I don't see the issue here? The FastRA philosophy is to just spill everything all the time to avoid doing anything hard

@MatzeB 's feedback seemed to want to avoid doing a pre-emptive scan over every instruction to find inline asm instructions (which probably don't exist in most programs).

Rereading his feedback though, it sounds like he wasn't necessarily against the approach I had proposed (as I had mis-remembered). I don't recall whether I tested his suggestion, and what the results were.

@arsenm
Copy link
Contributor

arsenm commented Apr 17, 2024

@MatzeB 's feedback seemed to want to avoid doing a pre-emptive scan over every instruction to find inline asm instructions (which probably don't exist in most programs).

MachineFunction already has hasInlineAsm to skip the walk over the function. Could even refine this if really needed

qihangkong pushed a commit to rvgpu/llvm that referenced this issue Apr 18, 2024
When using the inline asm constraint string "rm" (or "g"), we generally
would like the compiler to choose "r", but it is permitted to choose "m"
if there's register pressure. This is distinct from "r" in which the
register is not permitted to be spilled to the stack.

The decision of which to use must be made at some point.  Currently, the
instruction selection frameworks (ISELs) make the choice, and the
register allocators had better be able to handle the result.

Steal a bit from Storage when using register operands to disambiguate
between the two cases.  Add helpers/getters/setters, and print in MIR
when such a register is foldable.

The getter will later be used by the register allocation frameworks (and
asserted by the ISELs) while the setters will be used by the instruction
selection frameworks.

Link: llvm/llvm-project#20571
qihangkong pushed a commit to rvgpu/llvm that referenced this issue Apr 18, 2024
… ops as having infinite spill weight (#70747)

This is necessary for RegAllocGreedy support for memory folding inline
asm that uses "rm" constraints.

Thanks to @qcolombet for the suggestion.

Link: llvm/llvm-project#20571
@bwendling
Copy link
Collaborator

There's a gross error that happens with the current WIP code:

long int __attribute__((noinline)) foo(long int y) {
  long int x, z;

  asm __volatile__ ("pushf\n\t"
                    "popq %0\n\t"
                    "pushq %2\n\t"
                    "popq %1"
                    : "=rm" (x), "=rm" (z)
                    : "rm" (y)
                    : "ax", "bx", "cx", "dx", "di", "si", "bp",
                      "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15"
                    );
  return x;
}

int main() {
  __builtin_printf("flags: 0x%08lx\n", foo(991278));
  return 0;
}

What should be generated is:

	movq	%rdi, -24(%rsp)
	#APP
	pushfq
	popq	-16(%rsp)
	pushq	-24(%rsp)
	popq	-8(%rsp)
	#NO_APP
	movq	-16(%rsp), %rax

However, this is generated:

	movq	%rdi, -32(%rsp)                 # 8-byte Spill
	#APP                                    # 8-byte Folded Reload
	pushfq
	popq	-32(%rsp)
	pushq	-32(%rsp)
	popq	-24(%rsp)
	#NO_APP
	movq	-32(%rsp), %rax                 # 8-byte Reload

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla confirmed Verified by a second party inline-asm llvm:codegen
Projects
None yet
Development

No branches or pull requests