962 – Code generator compiles back-to-back fixed-size dynamic allocas into really bad code

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 962 - Code generator compiles back-to-back fixed-size dynamic allocas into really bad code

Summary: Code generator compiles back-to-back fixed-size dynamic allocas into really b...

Status:	NEW

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Common Code Generator Code (show other bugs)
Version:	1.8
Hardware:	All All

Importance:	P enhancement
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:	code-quality

Depends on:
Blocks:

Reported:	2006-10-22 13:20 PDT by Chris Lattner
Modified:	2018-02-17 09:48 PST (History)
CC List:	3 users (show)

See Also:
Fixed By Commit(s):

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Chris Lattner 2006-10-22 13:20:59 PDT

The tail recursion elimination pass can produce functions with allocas not in the entry block, if it wants 
them to live across iterations of the tail-recursion-eliminated code.  A simple example of the produced 
code is:

void %foo(bool %c) {
        br label %Next
Next:
        %A = alloca int
        %B = alloca int
        %C = alloca int
        %D = alloca int
        store int 1, int* %A
        store int 1, int* %B
        store int 1, int* %C
        store int 1, int* %D
        call void %test(int* %A, int* %B, int* %C, int* %D)
        br bool %c, label %Next, label %Out
Out:
        ret void
}
declare void %test(int*,int*,int*,int*)


It does this on the assumption that a block of allocas will be compiled into marginally efficient code, on 
the order of performance as a normal prolog.  However, this isn't the case.  The code above compiles 
into:

LBB1_1: ;Next
        mr r2, r1
        addi r3, r2, -16
        mr r1, r3
        mr r7, r1
        addi r4, r7, -16
        mr r1, r4
        mr r8, r1
        addi r5, r8, -16
        mr r1, r5
        mr r9, r1
        addi r6, r9, -16
        mr r1, r6
        li r10, 1
        stw r10, -16(r2)
        stw r10, -16(r7)
        stw r10, -16(r8)
        stw r10, -16(r9)
        rlwinm r29, r30, 0, 31, 31
        addi r1, r1, -64
        bl L_test$stub
        cmplwi cr0, r29, 0
        addi r1, r1, 64
        bne cr0, LBB1_1 ;Next

At the top, that is a ton of back-to-back copies into the stack pointer and out.  On X86, doing this sort 
of thing quickly causes spill code to be generated.

It would be far better to see these fixed size dynamic allocas and merge them together in the code 
generator, doing a single stack adjustment and then treating the suballocas as offsets from the base 
pointer.

-Chris

Comment 1 Davide Italiano 2017-07-18 08:53:13 PDT

I found a link to this PR in TRE and I found it amusing so I started looking.
This is an updated IR dump which works with 4.0 (and trunk)

define void @patatino(i1 %c) {
entry:
  br label %next

next:
  %A = alloca i32
  %B = alloca i32
  %C = alloca i32
  %D = alloca i32
  store i32 1, i32* %A
  store i32 1, i32* %B
  store i32 1, i32* %C
  store i32 1, i32* %D
  call void @test(i32* %A, i32* %B, i32* %C, i32* %D)
  br i1 %c, label %next, label %out

out:
  ret void
}

declare void @test(i32*, i32*, i32*, i32*)

Comment 2 Chris Lattner 2017-07-18 18:42:28 PDT

What does that codegen into?

Comment 3 Davide Italiano 2017-07-18 18:59:03 PDT

patatino:
[...]
        .p2align        4, 0x90
.LBB0_1:                                # %next
                                        # =>This Inner Loop Header: Depth=1
        movq    %rsp, %r8
        leaq    -16(%r8), %rdi
        movq    %rdi, %rsp
        movq    %rsp, %r9
        leaq    -16(%r9), %rsi
        movq    %rsi, %rsp
        movq    %rsp, %rax
        leaq    -16(%rax), %rdx
        movq    %rdx, %rsp
        movq    %rsp, %rbx
        leaq    -16(%rbx), %rcx
        movq    %rcx, %rsp
        movl    $1, -16(%r8)
        movl    $1, -16(%r9)
        movl    $1, -16(%rax)
        movl    $1, -16(%rbx)
        callq   test
        testb   $1, %r14b
        jne     .LBB0_1
# BB#2:                                 # %out
        leaq    -16(%rbp), %rsp
        popq    %rbx
        popq    %r14
        popq    %rbp
        retq
.Lfunc_end0:
        .size   patatino, .Lfunc_end0-patatino


This is just running `llc` on that IR. I note that if I run `-O3` on the input IR only loop unswitch triggers, i.e. the loop body is duplicated (but the codegen is the same for each BB as expected, except it's duplicated).

I guess some `mov` could still be eliminated but it's not clear to me whether we should prevent TRE in case of dynamic allocas? FWIW, I see TRE kicking in not very often, and many things changed since this PR was opened, including the new SROA.

I assume you don't have the original code that triggered this anymore :(

Comment 4 Chris Lattner 2017-07-18 22:12:33 PDT

I don't have the original code.  Here's a trick to generate test cases: just change the code that detects this to dump out the function name then abort. With that compiler, just build the test suite.  Any test case the compiler crashes on is one you want to look at :-)


The generated assembly shows the issue: those are multiple dynamic allocations back to back.  This is bad for a number of reasons, not least of which is that frame pointer elimination is defeated, and that every time through the loop, more stack space is allocated.

Compared to 2006, LLVM now has the stack save/restore intrinsics which would help at least some of these issues.

-Chris