Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Early tail dup can cause problems to register allocator #10468

Open
llvmbot opened this issue Jun 7, 2011 · 23 comments
Open

Early tail dup can cause problems to register allocator #10468

llvmbot opened this issue Jun 7, 2011 · 23 comments
Labels
bugzilla Issues migrated from bugzilla llvm:regalloc

Comments

@llvmbot
Copy link
Collaborator

llvmbot commented Jun 7, 2011

Bugzilla Link 10096
Resolution FIXED
Resolved on Jun 30, 2011 20:12
Version trunk
OS Linux
Attachments master2.bc
Reporter LLVM Bugzilla Contributor
CC @atrick,@stoklund

Extended Description

I am reporting a bug to keep track of the recent discussions on llvm-commits.

With the attached bitcode file:

$ time ./bin/llc -disable-early-taildup ~/llvm/master2.bc -o master.o -tail-dup-size=8 -filetype=obj

real 0m1.667s
user 0m1.640s
sys 0m0.024s

$ ls -l master.o
-rw-rw-r--. 1 espindola espindola 126256 Jun 7 01:19 master.o

$ time ./bin/llc ~/llvm/master2.bc -o master.o -filetype=obj

real 0m24.317s
user 0m24.204s
sys 0m0.063s
$ ls -l master.o
-rw-rw-r--. 1 espindola espindola 675912 Jun 7 01:20 master.o

@llvmbot
Copy link
Collaborator Author

llvmbot commented Jun 7, 2011

I tried building all of firefox with

-Xclang -mllvm -Xclang -disable-early-taildup -Xclang -mllvm -Xclang -tail-dup-size=8

The results are amazing. With -O3 -g (as before) jsinterp.o builds in 0m32.132s
(from 63s) and is 1490840 bytes (from 1876288 or 168448 with the clang patch).

A dromaeo run gets 1689.14runs/s (Total), which is the fastest run on this laptop so far :-)

I am tempted to propose shifting a bit of responsibility from the early tail dup to the one that runs after ra. The early one would only duplicate small (2 instructions, no calls, no rets) blocks. The last one would be the one responsible for the more aggressive optimizations like duplicating blocks with indirect gotos.

@stoklund
Copy link
Mannequin

stoklund mannequin commented Jun 7, 2011

We are duplicating the indirectbr block to help the branch predictor. It may be a bigger deal on ARM cores, but Intel cores should benefit too.

In this case, something goes horribly wrong with coalescing and register pressure explodes.

If we can fix that, I would expect even better results with the duplicated indirectbr block.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Jun 7, 2011

Not that the indirectbr was duplicated, just that it was duplicated in the second pass.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Jun 8, 2011

patch I am trying

@llvmbot
Copy link
Collaborator Author

llvmbot commented Jun 9, 2011

I found some problems in the tail dup pass by stress testing it with a max tail size of 8 and checking on. I am currently working on those.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Jun 9, 2011

Updated patch.
This one survived a bootstrap with a dup limit of 8. Will clean it up and commit the refactoring bits.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Jun 9, 2011

rebased patch

@llvmbot
Copy link
Collaborator Author

llvmbot commented Jun 10, 2011

remaining parts
I have committed the refactoring and fixes bits. Back to benchmarking..

@llvmbot
Copy link
Collaborator Author

llvmbot commented Jun 25, 2011

I implemented an IL level indirectbr duplicator to try to isolate the problem. The two attached IL files are very similar, test5.ll having one more instruction duplicated.

The log files are the output of llc testN.ll -relocation-model=pic -disable-early-taildup -print-machineinstrs.

Note how the outputs are mostly in sync until we reach the register allocator.

The end result is that test4.o has a 323 byte __TEXT and test5.o has a 565 __TEXT.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Jun 25, 2011

test4.ll

@llvmbot
Copy link
Collaborator Author

llvmbot commented Jun 25, 2011

test5.ll

@llvmbot
Copy link
Collaborator Author

llvmbot commented Jun 25, 2011

test4.log

@llvmbot
Copy link
Collaborator Author

llvmbot commented Jun 25, 2011

test5.log

@llvmbot
Copy link
Collaborator Author

llvmbot commented Jun 26, 2011

The problem is with coalescing failing to merge some of the copies created by phielim. I am debugging why.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Jun 26, 2011

test4.phielim
Output just after coalescing.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Jun 26, 2011

test5.phielim
Output just after coalescing.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Jun 27, 2011

The problem is that in coalescing when trying to decide if we can combine the live ranges of A and B, we only keep track of copies between A and B. Because of this we conclude that vreg40 and vreg42 cannot be merged when we see:

%vreg40<def> = COPY %vreg45
%vreg42<def> = COPY %vreg45

I have a fairly ad hoc patch that checks for both registers being defined with copies from a common third one. It fixes the reduced test case, but it looks like I am not updating live ranges correctly...(fails an assert in make check)

Perhaps we could solve this problem by reverting the algorithm order: Start by assuming that all candidates can be merged, and start splitting the groups until they can be split no further. This should handle cyclic cases like

%vreg40<def> = COPY %vregX
%vreg42<def> = COPY %vregY

and vregX and vregY can be coalesced but we fail to notice because of

%vregX<def> = COPY %vreg40
%vregY<def> = COPY %vreg42

@llvmbot
Copy link
Collaborator Author

llvmbot commented Jun 27, 2011

test4.ll

@llvmbot
Copy link
Collaborator Author

llvmbot commented Jun 27, 2011

test5.ll

@llvmbot
Copy link
Collaborator Author

llvmbot commented Jun 27, 2011

test4.phielim

@llvmbot
Copy link
Collaborator Author

llvmbot commented Jun 27, 2011

test5.phielim

@stoklund
Copy link
Mannequin

stoklund mannequin commented Jun 27, 2011

This is a tricky problem. The live ranges must still be in SSA form after coalescing. That is why you can't just merge ranges, even when you know they have the same value. Normally, we merge the overlapping value numbers into one, but that only works when one def dominates the other.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Jun 27, 2011

This is after phi elimination, so by SSA you mean that each definition must dominate all its uses, right?

I think this would still be fine, since we can rewrite

%vreg40<def> = COPY %vreg45
%vreg42<def> = COPY %vreg45

as

%vreg40<def> = COPY %vreg45
%vreg42<def> = COPY %vreg40

and vreg40 dominates its new use. Since now vreg42 is defined in terms of vreg40, the existing logic works and we merge the two.

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla llvm:regalloc
Projects
None yet
Development

No branches or pull requests

1 participant