Early tail dup still not as good as duplicating indirectbr in clang #10623

llvmbot · 2011-07-02T21:44:08Z


Bugzilla Link	10251
Version	trunk
OS	All
Attachments	indirect.patch, jsinterp.ii.bz2
Reporter	LLVM Bugzilla Contributor
CC	@asl

Extended Description

Things got a lot better with the new improvements to the coalescer, but duplicating indirectbr in clang still produces better results.

In jsinterp.o, the current trunk produces:

171419 3904 0 0 175323 2acdb

And duplicating indirectbr in clang produces:

117275 3904 0 0 121179 1d95b

llvmbot · 2011-07-03T21:36:23Z

I reduce this a bit. It is still the case that the taildup is causing problems to the register allocator, but this time it is not something the coalescer can help with.

The indirectgoto bb has code that looks like

....
JMP64m %vreg25, 8, %vreg26, 0, %noreg

and vreg25 is defined in a loop preheader:

%vreg25 = LEA64r %RIP, 1, %noreg, ga:@_ZZN2js9InterpretEP9JSContextPNS_10StackFrameENS_10InterpModeEE15normalJumpTable, %noreg

We duplicate indirectgoto into the preheader and then duplicate the preheader itself. This turns vreg25 into a phi where all operands are identical, but this is too late to fix it.

Some possible fixes/improvements:

*) Teach MachineCode taildup the same trick the IL one knows about moving code to a common dominator instead of coping it.
*) Make early tail dupilcation a bit less aggressive so that we don't duplicate a loop preheader.
*) Move it earlier, so that other passes can clean up.

I will give the second option a try.

llvmbot · 2011-07-03T21:36:58Z

testcase

llvmbot · 2011-07-04T03:34:56Z

patch I will benchmark

llvmbot · 2011-07-04T06:42:21Z

The previous patch helped with firefox jsinterp.o, but it hurt webkit's interpreter which probably benefits from the extra tail duplication.

Trying the "move it earlier" option.

llvmbot · 2011-07-04T08:00:21Z

On size at least we are getting close. With indirectbr duplicated in clang __TEXT is 119536 and without it is 126656 (after 134372).

There is still some performance difference with gcc. I am running the benchmarks again, but duplicating indirectbr in clang used to produce the best results.

One difference I noticed is that trunk produces:

   leaq    (%r14,%rax,8), %r8

....
jmpq *(%r8)

duplicating indirectbr in clang produces

jmpq *(%rdi,%r9,8)

which suggests that the tail duplication should be done at the IL level or at least codegenprepare needs to be more aggressive. The code for indirectgoto looks like

%indirect.goto.dest.in = phi i8** ...
%indirect.goto.dest = load i8** %indirect.goto.dest.in, align 8
indirectbr i8* %indirect.goto.dest, [...]

of all the 230 arguments of the indirect.goto.dest.in phi, only one is not a getelementptr.

llvmbot · 2011-07-04T08:02:26Z

master.bc

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Early tail dup still not as good as duplicating indirectbr in clang #10623

Early tail dup still not as good as duplicating indirectbr in clang #10623

llvmbot commented Jul 2, 2011

llvmbot commented Jul 3, 2011

llvmbot commented Jul 3, 2011

llvmbot commented Jul 4, 2011

llvmbot commented Jul 4, 2011

llvmbot commented Jul 4, 2011

llvmbot commented Jul 4, 2011

Early tail dup still not as good as duplicating indirectbr in clang #10623

Early tail dup still not as good as duplicating indirectbr in clang #10623

Comments

llvmbot commented Jul 2, 2011

Extended Description

llvmbot commented Jul 3, 2011

llvmbot commented Jul 3, 2011

llvmbot commented Jul 4, 2011

llvmbot commented Jul 4, 2011

llvmbot commented Jul 4, 2011

llvmbot commented Jul 4, 2011