The tail recursion elimination pass can produce functions with allocas not in the entry block, if it wants them to live across iterations of the tail-recursion-eliminated code. A simple example of the produced code is: void %foo(bool %c) { br label %Next Next: %A = alloca int %B = alloca int %C = alloca int %D = alloca int store int 1, int* %A store int 1, int* %B store int 1, int* %C store int 1, int* %D call void %test(int* %A, int* %B, int* %C, int* %D) br bool %c, label %Next, label %Out Out: ret void } declare void %test(int*,int*,int*,int*) It does this on the assumption that a block of allocas will be compiled into marginally efficient code, on the order of performance as a normal prolog. However, this isn't the case. The code above compiles into: LBB1_1: ;Next mr r2, r1 addi r3, r2, -16 mr r1, r3 mr r7, r1 addi r4, r7, -16 mr r1, r4 mr r8, r1 addi r5, r8, -16 mr r1, r5 mr r9, r1 addi r6, r9, -16 mr r1, r6 li r10, 1 stw r10, -16(r2) stw r10, -16(r7) stw r10, -16(r8) stw r10, -16(r9) rlwinm r29, r30, 0, 31, 31 addi r1, r1, -64 bl L_test$stub cmplwi cr0, r29, 0 addi r1, r1, 64 bne cr0, LBB1_1 ;Next At the top, that is a ton of back-to-back copies into the stack pointer and out. On X86, doing this sort of thing quickly causes spill code to be generated. It would be far better to see these fixed size dynamic allocas and merge them together in the code generator, doing a single stack adjustment and then treating the suballocas as offsets from the base pointer. -Chris
I found a link to this PR in TRE and I found it amusing so I started looking. This is an updated IR dump which works with 4.0 (and trunk) define void @patatino(i1 %c) { entry: br label %next next: %A = alloca i32 %B = alloca i32 %C = alloca i32 %D = alloca i32 store i32 1, i32* %A store i32 1, i32* %B store i32 1, i32* %C store i32 1, i32* %D call void @test(i32* %A, i32* %B, i32* %C, i32* %D) br i1 %c, label %next, label %out out: ret void } declare void @test(i32*, i32*, i32*, i32*)
What does that codegen into?
patatino: [...] .p2align 4, 0x90 .LBB0_1: # %next # =>This Inner Loop Header: Depth=1 movq %rsp, %r8 leaq -16(%r8), %rdi movq %rdi, %rsp movq %rsp, %r9 leaq -16(%r9), %rsi movq %rsi, %rsp movq %rsp, %rax leaq -16(%rax), %rdx movq %rdx, %rsp movq %rsp, %rbx leaq -16(%rbx), %rcx movq %rcx, %rsp movl $1, -16(%r8) movl $1, -16(%r9) movl $1, -16(%rax) movl $1, -16(%rbx) callq test testb $1, %r14b jne .LBB0_1 # BB#2: # %out leaq -16(%rbp), %rsp popq %rbx popq %r14 popq %rbp retq .Lfunc_end0: .size patatino, .Lfunc_end0-patatino This is just running `llc` on that IR. I note that if I run `-O3` on the input IR only loop unswitch triggers, i.e. the loop body is duplicated (but the codegen is the same for each BB as expected, except it's duplicated). I guess some `mov` could still be eliminated but it's not clear to me whether we should prevent TRE in case of dynamic allocas? FWIW, I see TRE kicking in not very often, and many things changed since this PR was opened, including the new SROA. I assume you don't have the original code that triggered this anymore :(
I don't have the original code. Here's a trick to generate test cases: just change the code that detects this to dump out the function name then abort. With that compiler, just build the test suite. Any test case the compiler crashes on is one you want to look at :-) The generated assembly shows the issue: those are multiple dynamic allocations back to back. This is bad for a number of reasons, not least of which is that frame pointer elimination is defeated, and that every time through the loop, more stack space is allocated. Compared to 2006, LLVM now has the stack save/restore intrinsics which would help at least some of these issues. -Chris