17245 – MachineSink Performance Issue

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 17245 - MachineSink Performance Issue

Summary: MachineSink Performance Issue

Status:	RESOLVED FIXED

Alias:	None

Product:	new-bugs
Classification:	Unclassified
Component:	new bugs (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	P normal
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2013-09-14 16:42 PDT by Will Dietz
Modified:	2013-10-14 12:01 PDT (History)
CC List:	4 users (show)

See Also:
Fixed By Commit(s):

Attachments
Proposed patch to address performance issue (4.11 KB, patch) 2013-09-14 16:42 PDT, Will Dietz	Details
-time-passes output showing MachineSink performance issue (before) (17.51 KB, text/plain) 2013-09-14 16:43 PDT, Will Dietz	Details
-time-passes output showing patched MachineSink fixes compile-time performance issue (17.50 KB, text/plain) 2013-09-14 16:44 PDT, Will Dietz	Details
reduced test case demonstrating MachineSink performance issue (xz'd to attach) (685.47 KB, application/x-xz) 2013-09-14 16:46 PDT, Will Dietz	Details
Updated patch with some lit test fixes (5.78 KB, patch) 2013-09-15 16:26 PDT, Will Dietz	Details
Failing lit tests: {before/after output, debug logs, copy of failing test} (71.97 KB, application/x-xz) 2013-09-16 11:46 PDT, Will Dietz	Details
Show Obsolete (1) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Will Dietz 2013-09-14 16:42:28 PDT

Created attachment 11197 [details]
Proposed patch to address performance issue

I'm encountering bad performance behavior of the MachineSink pass,
as shown in the attached 'before.log' (MachineSink taking ~25 minutes to execute!).

Investigation led me to create the attached patch which both avoids the n^2 behavior presently exhibited by the pass but also tries to more faithfully implement the heuristic described in the comments of the code itself.

With this patch applied, MachineSink's execution time reduces from 1494.1 seconds down to 1.8 seconds.  I've attached the patch as well as 'after.log' demonstrating the much improved timings.

Unfortunately the original IR is both large and something I'd rather not post, although I've managed to reduce it into the attached 'reduced.bc' example that has before/after timings of 10.5s/0.2s (I used bugpoint filtering on the runtime before being greater than 10s and after being less than 1 second).

Comment 1 Will Dietz 2013-09-14 16:43:13 PDT

Created attachment 11198 [details]
-time-passes output showing MachineSink performance issue (before)

Comment 2 Will Dietz 2013-09-14 16:44:11 PDT

Created attachment 11199 [details]
-time-passes output showing patched MachineSink fixes compile-time performance issue

Comment 3 Will Dietz 2013-09-14 16:46:26 PDT

Created attachment 11200 [details]
reduced test case demonstrating MachineSink performance issue (xz'd to attach)

Compressed 'reduced.bc' to satisfy bugzilla's attachment size requirement.  Apparently it's not reduced enough yet! :)

Comment 4 Will Dietz 2013-09-14 16:48:04 PDT

I've previously discussed this issue and patch on llvm-commits
(filing bug now that I have reduced test case),
see: http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20130909/187556.html .

Comment 5 Will Dietz 2013-09-15 16:26:04 PDT

Created attachment 11208 [details]
Updated patch with some lit test fixes

Looking at the resulting lit failures, I've attached an updated patch
that updates lit tests that work as expected but needed minor changes.

Unfortunately there are still failures I'm not able to resolve:

Failing Tests (4):
    LLVM :: CodeGen/ARM/2011-04-11-MachineLICMBug.ll
    LLVM :: CodeGen/ARM/2011-08-25-ldmia_ret.ll
    LLVM :: CodeGen/ARM/2012-08-30-select.ll
    LLVM :: CodeGen/X86/hoist-common.ll

Here's a run-down of these tests and why I'm not sure how to best resolve them:

* CodeGen/X86/hoist-common.ll:

Looking at the commit introducing this test, it seems BranchFolder is supposed to be hoisting the xorl.  Unfortunately it does not since the two instructions do not get pass "isIdenticalTo()".  Accordingly this test seems either invalid or is legitimately failing.  I'm not sure why it's been passing, but it's not because BranchFolder was hoisting the instruction in question.

* CodeGen/ARM/2011-04-11-MachineLICMBug.ll:

The resulting ASM in my non-expert opinion looks better here (ITT blocks instead of branches, no stack spilling of 'lr'), but I have no idea how to update this to test what was originally intended (or if that functionality is working as expected).

* CodeGen/ARM/2011-08-25-ldmia_ret.ll:

Probably best someone more familiar with ARM assembly look at this one, sorry.  Nothing seems wrong about it to me, but I definitely would benefit from some guidance on how to fix this test.  I'd be happy to provide the ASM on request, it's too much to put inline here.

* CodeGen/ARM/2012-08-30-select.ll:

I'm not sure what to make of this one, although the difference seems straightforward.  Resulting a(short) assembly reproduced below:

---Before -------------------------
        .thumb_func     _select_s_v_v
_select_s_v_v:
@ BB#0:                                 @ %entry
        vld1.8  {d16, d17}, [r1]
        tst.w   r0, #1
        it      ne
        vmovne.i32      q8, #0x0
        vmov    r0, r1, d16
        vmov    r2, r3, d17
        bx      lr
-- After --------------------------
        .thumb_func     _select_s_v_v
_select_s_v_v:
@ BB#0:                                 @ %entry
        tst.w   r0, #1
        beq     LBB0_2
@ BB#1:                                 @ %select.mid
        vmov.i32        q8, #0x0
        b       LBB0_3
LBB0_2:
        vld1.8  {d16, d17}, [r1]
LBB0_3:                                 @ %select.end
        vmov    r0, r1, d16
        vmov    r2, r3, d17
        bx      lr
-----------------------------------

My understanding is the 'it'-based variant would be better than branches on ARM, making this a regression, but a)I really am no ARM assembly expert and b)I'm unsure how this patch caused some other component to be unable to rewrite into an IT block.

Comment 6 Renato Golin 2013-09-16 04:38:24 PDT

Hi Will,

(In reply to comment #5)
> * CodeGen/ARM/2011-04-11-MachineLICMBug.ll:
> 
> The resulting ASM in my non-expert opinion looks better here (ITT blocks
> instead of branches, no stack spilling of 'lr'), but I have no idea how to
> update this to test what was originally intended (or if that functionality
> is working as expected).

IT blocks are not necessarily better than branches, and this is especially true in newer, faster, out-of-order cores. Can you attach here the produced ASM with your patch? I'll compare with trunk and see if I can spot anything.


> * CodeGen/ARM/2011-08-25-ldmia_ret.ll:
> 
> Probably best someone more familiar with ARM assembly look at this one,
> sorry.  Nothing seems wrong about it to me, but I definitely would benefit
> from some guidance on how to fix this test.  I'd be happy to provide the ASM
> on request, it's too much to put inline here.

Feel free to zip and attach to this bug.


> * CodeGen/ARM/2012-08-30-select.ll:
>
> My understanding is the 'it'-based variant would be better than branches on
> ARM, making this a regression, but a)I really am no ARM assembly expert and
> b)I'm unsure how this patch caused some other component to be unable to
> rewrite into an IT block.

I'm suspecting that your patch is breaking the IT block lowering because the pattern is kept at different MBBs, ie. some cases that were correctly sinking, aren't any more. Your heuristics seem a bit too constrained.

It's not clear if the IT version is better in this case. The load (vld1.8  {d16, d17}, [r1]) is expensive, and is happening always now, when in some cases (when r0 == 1), it'll be reset to zero, which will probably be waiting for the load to finish just to clear it. However, in your version, the number of branches to such small blocks can confuse the branch-predictor. It all depends on which core it's running, as they tend to have different costs for different features.

I'm not sure what the test is testing, so I can't tell you what's wrong there. It looks as a bug found on Darwin that just got in alongside the real change. 

Would be good if people could add 2-3 lines of comments on what the test is looking for...

Comment 7 Will Dietz 2013-09-16 11:46:00 PDT

Created attachment 11213 [details]
Failing lit tests: {before/after output, debug logs, copy of failing test}

(In reply to comment #6)

Hi Renato,

Thanks for taking a look and for your response.

I've attached the output from the failing lit tests, packaged together
to hopefully make investigating it easier.  The "orig" files are from ToT yesterday and the "ms" files are with the patched MachineSink pass.  Included in each directory are the failing test case, the resulting assembly and debug logs produced during the test for both configurations.

> Hi Will,
> 
> (In reply to comment #5)
> > * CodeGen/ARM/2011-04-11-MachineLICMBug.ll:
> > 
> > The resulting ASM in my non-expert opinion looks better here (ITT blocks
> > instead of branches, no stack spilling of 'lr'), but I have no idea how to
> > update this to test what was originally intended (or if that functionality
> > is working as expected).
> 
> IT blocks are not necessarily better than branches, and this is especially
> true in newer, faster, out-of-order cores. Can you attach here the produced
> ASM with your patch? I'll compare with trunk and see if I can spot anything.
> 

Ah, thank you for your explanation.  Didn't realize they made out-of-order ARM chips these days!  Anyway, you'll find this test with the others in the attached tarball.

> 
> > * CodeGen/ARM/2011-08-25-ldmia_ret.ll:
> > 
> > Probably best someone more familiar with ARM assembly look at this one,
> > sorry.  Nothing seems wrong about it to me, but I definitely would benefit
> > from some guidance on how to fix this test.  I'd be happy to provide the ASM
> > on request, it's too much to put inline here.
> 
> Feel free to zip and attach to this bug.
> 
> 
> > * CodeGen/ARM/2012-08-30-select.ll:
> >
> > My understanding is the 'it'-based variant would be better than branches on
> > ARM, making this a regression, but a)I really am no ARM assembly expert and
> > b)I'm unsure how this patch caused some other component to be unable to
> > rewrite into an IT block.
> 
> I'm suspecting that your patch is breaking the IT block lowering because the
> pattern is kept at different MBBs, ie. some cases that were correctly
> sinking, aren't any more. Your heuristics seem a bit too constrained.
> 

Taking a look at this some more, it seems to actually be the opposite: we're now sinking in cases we didn't previously and that (apparently) prevents the IT block formation.  This can be seen in the before/after debug logs for this test case (search for "Machine Sinking").  That said, there should be cases where we no longer sink when we would previously but I don't believe any of these failures are due to that.

> It's not clear if the IT version is better in this case. The load (vld1.8 
> {d16, d17}, [r1]) is expensive, and is happening always now, when in some
> cases (when r0 == 1), it'll be reset to zero, which will probably be waiting
> for the load to finish just to clear it. However, in your version, the
> number of branches to such small blocks can confuse the branch-predictor. It
> all depends on which core it's running, as they tend to have different costs
> for different features.

Thank you for the analysis and details, much appreciated.  So is the result to perform some manner of testing on ARM (LNT?) on popular architectures to find out?  Or is it likely a wash overall?  Commit and wait for performance regression reports from iOS developers? O:)

> 
> I'm not sure what the test is testing, so I can't tell you what's wrong
> there. It looks as a bug found on Darwin that just got in alongside the real
> change. 
> 
> Would be good if people could add 2-3 lines of comments on what the test is
> looking for...

I think it's simply trying to verify that vector operations are considered for IT instruction generation? But yes comments would help.  We can always ask the author... (looks like Nadav Rotem) :).

Comment 8 Will Dietz 2013-09-16 12:05:42 PDT

Since I was curious, I went ahead and ran before/after experiments with LNT on x86-64.
The results are (temporarily) available here:

Before: http://wdtz.org:8000/db_default/v4/nts/3
After: http://wdtz.org:8000/db_default/v4/nts/4

Comparison:

http://wdtz.org:8000/db_default/v4/nts/4?show_previous=yes&show_stddev=yes&show_all=yes&show_all_samples=yes&num_comparison_runs=10&test_filter=&test_min_value_filter=&aggregation_fn=min&compare_to=3&submit=Update

I'm new to LNT, so not sure what can be concluded but offhand it seems the impact on both compilation and execution performance is minor and likely well within the noise of the experiment (although min of 3 samples are being compared).  Only executions with tiny duration showed regression.

FWIW tests were run on a calm system (running only sshd and dhclient) with ASLR/power saving/frequency scaling disabled (I've previously invested time in making this machine produce stable numbers) with the following LNT invocation:

lnt-ve/bin/lnt runtest nt --sandbox ~/lnt-ve --cc ~/llvm/install/bin/clang --cxx ~/llvm/install/bin/clang++ --test-suite ~/src/llvm-test-suite DEMO-default  --large --build-threads=4 --multisample=3

Unfortunately don't have access to ARM hardware to produce similar results, and I think it's on ARM that this would likely have a greater impact (if any).

Anyway, thought I'd share JIC the results are useful.

Comment 9 Renato Golin 2013-09-17 05:19:55 PDT

(In reply to comment #8)
> I'm new to LNT, so not sure what can be concluded but offhand it seems the
> impact on both compilation and execution performance is minor and likely
> well within the noise of the experiment (although min of 3 samples are being
> compared).  Only executions with tiny duration showed regression.

You shouldn't trust LNT performance results (yet), as they're not proper benchmarks, just timed execution of largely unaltered code. The amount of noise is so great that even 5% regressions would wash away.


> Unfortunately don't have access to ARM hardware to produce similar results,
> and I think it's on ARM that this would likely have a greater impact (if
> any).

If the change is sound (Andrew seems to agree it is), and the differences in code generated don't produce broken code, I think we can wait for the ARM LNT buildbot to report regressions over time:

http://llvm.org/perf/db_default/v4/nts/machine/10

If it doesn't break straight on, I'd wait a few iterations to check on the regression cases as well as the improvements and see if the new "visual moving average" is off up or down, so than we can look at the specific test.

But don't waste too much of your time with that. Until we're benchmarking it properly, there is little point in trying to improve performance on LNT.

Comment 10 Will Dietz 2013-10-14 12:01:06 PDT

Fixed in r192608, thanks!