Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MachineSink Performance Issue #17619

Closed
llvmbot opened this issue Sep 14, 2013 · 7 comments
Closed

MachineSink Performance Issue #17619

llvmbot opened this issue Sep 14, 2013 · 7 comments
Labels
bugzilla Issues migrated from bugzilla

Comments

@llvmbot
Copy link
Collaborator

llvmbot commented Sep 14, 2013

Bugzilla Link 17245
Resolution FIXED
Resolved on Oct 14, 2013 12:01
Version unspecified
OS Linux
Attachments Proposed patch to address performance issue, -time-passes output showing MachineSink performance issue (before), -time-passes output showing patched MachineSink fixes compile-time performance issue, reduced test case demonstrating MachineSink performance issue (xz'd to attach)
Reporter LLVM Bugzilla Contributor
CC @echristo,@rengolin

Extended Description

I'm encountering bad performance behavior of the MachineSink pass,
as shown in the attached 'before.log' (MachineSink taking ~25 minutes to execute!).

Investigation led me to create the attached patch which both avoids the n^2 behavior presently exhibited by the pass but also tries to more faithfully implement the heuristic described in the comments of the code itself.

With this patch applied, MachineSink's execution time reduces from 1494.1 seconds down to 1.8 seconds. I've attached the patch as well as 'after.log' demonstrating the much improved timings.

Unfortunately the original IR is both large and something I'd rather not post, although I've managed to reduce it into the attached 'reduced.bc' example that has before/after timings of 10.5s/0.2s (I used bugpoint filtering on the runtime before being greater than 10s and after being less than 1 second).

@llvmbot
Copy link
Collaborator Author

llvmbot commented Sep 14, 2013

I've previously discussed this issue and patch on llvm-commits
(filing bug now that I have reduced test case),
see: http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20130909/187556.html .

@llvmbot
Copy link
Collaborator Author

llvmbot commented Sep 15, 2013

Updated patch with some lit test fixes
Looking at the resulting lit failures, I've attached an updated patch
that updates lit tests that work as expected but needed minor changes.

Unfortunately there are still failures I'm not able to resolve:

Failing Tests (4):
LLVM :: CodeGen/ARM/2011-04-11-MachineLICMBug.ll
LLVM :: CodeGen/ARM/2011-08-25-ldmia_ret.ll
LLVM :: CodeGen/ARM/2012-08-30-select.ll
LLVM :: CodeGen/X86/hoist-common.ll

Here's a run-down of these tests and why I'm not sure how to best resolve them:

  • CodeGen/X86/hoist-common.ll:

Looking at the commit introducing this test, it seems BranchFolder is supposed to be hoisting the xorl. Unfortunately it does not since the two instructions do not get pass "isIdenticalTo()". Accordingly this test seems either invalid or is legitimately failing. I'm not sure why it's been passing, but it's not because BranchFolder was hoisting the instruction in question.

  • CodeGen/ARM/2011-04-11-MachineLICMBug.ll:

The resulting ASM in my non-expert opinion looks better here (ITT blocks instead of branches, no stack spilling of 'lr'), but I have no idea how to update this to test what was originally intended (or if that functionality is working as expected).

  • CodeGen/ARM/2011-08-25-ldmia_ret.ll:

Probably best someone more familiar with ARM assembly look at this one, sorry. Nothing seems wrong about it to me, but I definitely would benefit from some guidance on how to fix this test. I'd be happy to provide the ASM on request, it's too much to put inline here.

  • CodeGen/ARM/2012-08-30-select.ll:

I'm not sure what to make of this one, although the difference seems straightforward. Resulting a(short) assembly reproduced below:

---Before -------------------------
.thumb_func _select_s_v_v
_select_s_v_v:
@ BB#0: @ %entry
vld1.8 {d16, d17}, [r1]
tst.w r0, #​1
it ne
vmovne.i32 q8, #​0x0
vmov r0, r1, d16
vmov r2, r3, d17
bx lr
-- After --------------------------
.thumb_func _select_s_v_v
_select_s_v_v:
@ BB#0: @ %entry
tst.w r0, #​1
beq LBB0_2
@ BB#1: @ %select.mid
vmov.i32 q8, #​0x0
b LBB0_3
LBB0_2:
vld1.8 {d16, d17}, [r1]
LBB0_3: @ %select.end
vmov r0, r1, d16
vmov r2, r3, d17
bx lr

My understanding is the 'it'-based variant would be better than branches on ARM, making this a regression, but a)I really am no ARM assembly expert and b)I'm unsure how this patch caused some other component to be unable to rewrite into an IT block.

@rengolin
Copy link
Member

Hi Will,

  • CodeGen/ARM/2011-04-11-MachineLICMBug.ll:

The resulting ASM in my non-expert opinion looks better here (ITT blocks
instead of branches, no stack spilling of 'lr'), but I have no idea how to
update this to test what was originally intended (or if that functionality
is working as expected).

IT blocks are not necessarily better than branches, and this is especially true in newer, faster, out-of-order cores. Can you attach here the produced ASM with your patch? I'll compare with trunk and see if I can spot anything.

  • CodeGen/ARM/2011-08-25-ldmia_ret.ll:

Probably best someone more familiar with ARM assembly look at this one,
sorry. Nothing seems wrong about it to me, but I definitely would benefit
from some guidance on how to fix this test. I'd be happy to provide the ASM
on request, it's too much to put inline here.

Feel free to zip and attach to this bug.

  • CodeGen/ARM/2012-08-30-select.ll:

My understanding is the 'it'-based variant would be better than branches on
ARM, making this a regression, but a)I really am no ARM assembly expert and
b)I'm unsure how this patch caused some other component to be unable to
rewrite into an IT block.

I'm suspecting that your patch is breaking the IT block lowering because the pattern is kept at different MBBs, ie. some cases that were correctly sinking, aren't any more. Your heuristics seem a bit too constrained.

It's not clear if the IT version is better in this case. The load (vld1.8 {d16, d17}, [r1]) is expensive, and is happening always now, when in some cases (when r0 == 1), it'll be reset to zero, which will probably be waiting for the load to finish just to clear it. However, in your version, the number of branches to such small blocks can confuse the branch-predictor. It all depends on which core it's running, as they tend to have different costs for different features.

I'm not sure what the test is testing, so I can't tell you what's wrong there. It looks as a bug found on Darwin that just got in alongside the real change.

Would be good if people could add 2-3 lines of comments on what the test is looking for...

@llvmbot
Copy link
Collaborator Author

llvmbot commented Sep 16, 2013

Failing lit tests: {before/after output, debug logs, copy of failing test}

Hi Renato,

Thanks for taking a look and for your response.

I've attached the output from the failing lit tests, packaged together
to hopefully make investigating it easier. The "orig" files are from ToT yesterday and the "ms" files are with the patched MachineSink pass. Included in each directory are the failing test case, the resulting assembly and debug logs produced during the test for both configurations.

Hi Will,

  • CodeGen/ARM/2011-04-11-MachineLICMBug.ll:

The resulting ASM in my non-expert opinion looks better here (ITT blocks
instead of branches, no stack spilling of 'lr'), but I have no idea how to
update this to test what was originally intended (or if that functionality
is working as expected).

IT blocks are not necessarily better than branches, and this is especially
true in newer, faster, out-of-order cores. Can you attach here the produced
ASM with your patch? I'll compare with trunk and see if I can spot anything.

Ah, thank you for your explanation. Didn't realize they made out-of-order ARM chips these days! Anyway, you'll find this test with the others in the attached tarball.

  • CodeGen/ARM/2011-08-25-ldmia_ret.ll:

Probably best someone more familiar with ARM assembly look at this one,
sorry. Nothing seems wrong about it to me, but I definitely would benefit
from some guidance on how to fix this test. I'd be happy to provide the ASM
on request, it's too much to put inline here.

Feel free to zip and attach to this bug.

  • CodeGen/ARM/2012-08-30-select.ll:

My understanding is the 'it'-based variant would be better than branches on
ARM, making this a regression, but a)I really am no ARM assembly expert and
b)I'm unsure how this patch caused some other component to be unable to
rewrite into an IT block.

I'm suspecting that your patch is breaking the IT block lowering because the
pattern is kept at different MBBs, ie. some cases that were correctly
sinking, aren't any more. Your heuristics seem a bit too constrained.

Taking a look at this some more, it seems to actually be the opposite: we're now sinking in cases we didn't previously and that (apparently) prevents the IT block formation. This can be seen in the before/after debug logs for this test case (search for "Machine Sinking"). That said, there should be cases where we no longer sink when we would previously but I don't believe any of these failures are due to that.

It's not clear if the IT version is better in this case. The load (vld1.8
{d16, d17}, [r1]) is expensive, and is happening always now, when in some
cases (when r0 == 1), it'll be reset to zero, which will probably be waiting
for the load to finish just to clear it. However, in your version, the
number of branches to such small blocks can confuse the branch-predictor. It
all depends on which core it's running, as they tend to have different costs
for different features.

Thank you for the analysis and details, much appreciated. So is the result to perform some manner of testing on ARM (LNT?) on popular architectures to find out? Or is it likely a wash overall? Commit and wait for performance regression reports from iOS developers? O:)

I'm not sure what the test is testing, so I can't tell you what's wrong
there. It looks as a bug found on Darwin that just got in alongside the real
change.

Would be good if people could add 2-3 lines of comments on what the test is
looking for...

I think it's simply trying to verify that vector operations are considered for IT instruction generation? But yes comments would help. We can always ask the author... (looks like Nadav Rotem) :).

@llvmbot
Copy link
Collaborator Author

llvmbot commented Sep 16, 2013

Since I was curious, I went ahead and ran before/after experiments with LNT on x86-64.
The results are (temporarily) available here:

Before: http://wdtz.org:8000/db_default/v4/nts/3
After: http://wdtz.org:8000/db_default/v4/nts/4

Comparison:

http://wdtz.org:8000/db_default/v4/nts/4?show_previous=yes&show_stddev=yes&show_all=yes&show_all_samples=yes&num_comparison_runs=10&test_filter=&test_min_value_filter=&aggregation_fn=min&compare_to=3&submit=Update

I'm new to LNT, so not sure what can be concluded but offhand it seems the impact on both compilation and execution performance is minor and likely well within the noise of the experiment (although min of 3 samples are being compared). Only executions with tiny duration showed regression.

FWIW tests were run on a calm system (running only sshd and dhclient) with ASLR/power saving/frequency scaling disabled (I've previously invested time in making this machine produce stable numbers) with the following LNT invocation:

lnt-ve/bin/lnt runtest nt --sandbox ~/lnt-ve --cc ~/llvm/install/bin/clang --cxx ~/llvm/install/bin/clang++ --test-suite ~/src/llvm-test-suite DEMO-default --large --build-threads=4 --multisample=3

Unfortunately don't have access to ARM hardware to produce similar results, and I think it's on ARM that this would likely have a greater impact (if any).

Anyway, thought I'd share JIC the results are useful.

@rengolin
Copy link
Member

I'm new to LNT, so not sure what can be concluded but offhand it seems the
impact on both compilation and execution performance is minor and likely
well within the noise of the experiment (although min of 3 samples are being
compared). Only executions with tiny duration showed regression.

You shouldn't trust LNT performance results (yet), as they're not proper benchmarks, just timed execution of largely unaltered code. The amount of noise is so great that even 5% regressions would wash away.

Unfortunately don't have access to ARM hardware to produce similar results,
and I think it's on ARM that this would likely have a greater impact (if
any).

If the change is sound (Andrew seems to agree it is), and the differences in code generated don't produce broken code, I think we can wait for the ARM LNT buildbot to report regressions over time:

http://llvm.org/perf/db_default/v4/nts/machine/10

If it doesn't break straight on, I'd wait a few iterations to check on the regression cases as well as the improvements and see if the new "visual moving average" is off up or down, so than we can look at the specific test.

But don't waste too much of your time with that. Until we're benchmarking it properly, there is little point in trying to improve performance on LNT.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Oct 14, 2013

Fixed in r192608, thanks!

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 9, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla
Projects
None yet
Development

No branches or pull requests

2 participants