New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLVM does a poor job of rematerializing address offsets which will fold into the addressing mode #22604
Comments
We could also solve this in CodeGenPrep by sinking the GEPs into the blocks where they are used. Manually doing that IR transform and invoking llc gives me good code: $ cat t.ll %struct.A = type { i32, i32, i32, i32 } ; Function Attrs: noreturn for.cond: ; preds = %for.cond.backedge, %entry if.then: ; preds = %for.cond if.then4: ; preds = %for.cond for.cond.backedge: ; preds = %if.then4, %if.then12, %if.then8, %if.then, %for.cond if.then8: ; preds = %for.cond if.then12: ; preds = %for.cond declare void @"\01?assign@@YAXPAI@Z"(i32*) #1 attributes #0 = { noreturn "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" } !llvm.ident = !{#0} !0 = !{!"clang version 3.6.0 "} $ llc t.ll -o - BB#0: # %entry
LBB0_5: # %for.cond.backedge BB#2: # %for.cond
LBB0_3: # %if.then |
Sounds like a job for lazy code motion, if we had it. |
I actually don't know what that is, but I mentioned this because I recall that the CUDA guys needed this and implemented it in a target-specific IR pass. |
The CodeGenPrepare approach is definitely the way to go here. In fact, I wonder why it is not already doing what we want since we should have all the logic there to handle this. Just to give some information on why the rematerialization in register allocator wouldn’t help: We have to rematerialize LEAs. At best, LEAs use only one register and define one. I.e., unless the register we use is a reserved register like SP, we would still need to carry over the value of that register to the related uses (addressing mode or not). Thus that does not help the allocation problem and we still have to spill some value. To summarize handling that through rematerialization would mean we trade potential spill size improvement against potentially more expensive computation. Indeed, we trade a reload of the spilled LEA against a reload + LEA. As a side note, currently, the register allocator only handles “simple” rematerialization, i.e., instructions that do not have any uses like load of immediate. |
Relaying some IRC discussions here...
So, CGP doesn't work well for a more interesting reason. CGP correctly handles the case where we have a load or store that we can fold the GEP into an addressing mode for. That's not the case we have. We have an address operand to a call. The only reason it makes sense to not spill these is that computing the values and putting them in registers with LEA takes exactly the same resources as loading them from the stack even if the load is free (same addressing unit usage). So this really seems like a case where we have an operation that can produce a value which is as cheap or cheaper than reloading that value, with no 'folding' or other factors. The only way this makes sense to do on each loop iteration is because we ran out of registers to hold the precomputed values. So I think this really needs to be more analogous to remat.
This is great example. I think there are two key reasons why this makes sense:
Put another way, while if there is only one of these in the loop then we would trade a reload of the spilled LEA against a reload + LEA, when there are 2 entries in the loop, we trade 2 reloads against 1 reload + 2 LEAs. that makes the regalloc problem simpler in a meaningful way I think. I'm not sure if "2" or "3" should be the magical number, we can use a reasonable cutoff here, but essentially at some point reducing the number of reloads should be worth the addititional operations. This is then especially true of an LEA which uses no more of the addressing unit than a reload would (base + offset) as that instruction seems strictly no more expensive than the reload itself. Does that make more sense to you Quentin?
|
I wrote a comment here this morning something along the lines of what Chandler said. Basically, I agree it's a remat problem because we don't have a memory access to fold into here. |
Thanks Chandler for the update here!
Ah right, I should have looked into the IR and spare the investigation on that aspect. CodeGenPrepare does, indeed, this folding for addressing modes only.
I am not sure I follow here. In that case the LEA and the reload do not take the same resources, at least from the register allocator point of view. The LEA still needs a non-free input register (rsi), whereas the reload does not (rbp).
Well, you still need the input value of the LEA to be available, which may not be the case where you want to rematerialize.
Wait, I guess the code feeding the register allocator something like following: loop: I.e., the uses of the base are fine, this is the uses of its users that are not.
But those are all before the point where we exceed the register pressure. Anyway, the way the register allocator works is that it tries to split the live-ranges to relax the constraints then spills if it did not work. When you get to the spill point, you try to produce the shortest possible live-ranges. Thus, you would insert the reload right before the uses. There are several problems here:
For #2, we may be able to rely on some existing tablegen annotation, like isCheapAsMove but I wonder if that would be enough. #5 is a particular fundamental change for the allocator and anyway, I do not think it should go in that direction (at least with our current algorithm). My point is that it really looks like we would need some kind of global code motion here and in particular, I do not think this should be handled as rematerialization. Let me go back to the example to illustrate why. Because of this change, the register pressure problem may just vanish.
In my opinion, this is a code motion/global scheduling problem, not a rematerialization problem. Moreover, I do not think the register allocator, as it exists, is the right place to fix this. Just to clarify, my definitions for general rematerialization and code motion are:
Arguably, with some stretch of the definition of rematerialization we end up with a code motion definition. However, going into that direction does not seem to fit the design of the current register allocator. The bottom line is for a long-term solution, I think it would be beneficial to work toward a global scheduler with register pressure awareness (or a register allocator with scheduling awareness) than to fix that in the current register allocator. How important is that use case? |
Extended Description
A certain code pattern, where a function mostly consists of a large loop around a switch or if-else condition results in large live ranges for basic address computations. The register allocator fails to rematerialize these address computations in addressing modes of memory operands and instead spills and reloads them from the stack dramatically increasing stack usage and generally causing badness.
Small reproduction:
struct A {
unsigned a;
unsigned b;
unsigned c;
unsigned d;
};
void assign(unsigned *val);
void f(unsigned char *input, A *a) {
for (;;) {
unsigned char t = (++input)[0];
if (t == 0) {
assign(&a->a);
} else if (t == 1) {
assign(&a->b);
} else if (t == 2) {
assign(&a->c);
} else if (t == 3) {
assign(&a->d);
}
}
}
Resulting in:
__Z1fPhP1A: ## @_Z1fPhP1A
.cfi_startproc
BB#0: ## %entry
Ltmp0:
.cfi_def_cfa_offset 16
Ltmp1:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp2:
.cfi_def_cfa_register %rbp
pushq %r15
pushq %r14
pushq %r13
pushq %r12
pushq %rbx
subq $24, %rsp
Ltmp3:
.cfi_offset %rbx, -56
Ltmp4:
.cfi_offset %r12, -48
Ltmp5:
.cfi_offset %r13, -40
Ltmp6:
.cfi_offset %r14, -32
Ltmp7:
.cfi_offset %r15, -24
movq %rdi, %rbx
leaq 4(%rsi), %rax
movq %rax, -48(%rbp) ## 8-byte Spill
leaq 8(%rsi), %rax
movq %rax, -56(%rbp) ## 8-byte Spill
leaq 12(%rsi), %rax
movq %rax, -64(%rbp) ## 8-byte Spill
leaq 16(%rsi), %r14
leaq 20(%rsi), %r15
movq %rsi, %r13
incq %rbx
leaq LJTI0_0(%rip), %r12
jmp LBB0_1
[.....]
The text was updated successfully, but these errors were encountered: