-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MachineSink Performance Issue #17619
Comments
I've previously discussed this issue and patch on llvm-commits |
Updated patch with some lit test fixes Unfortunately there are still failures I'm not able to resolve: Failing Tests (4): Here's a run-down of these tests and why I'm not sure how to best resolve them:
Looking at the commit introducing this test, it seems BranchFolder is supposed to be hoisting the xorl. Unfortunately it does not since the two instructions do not get pass "isIdenticalTo()". Accordingly this test seems either invalid or is legitimately failing. I'm not sure why it's been passing, but it's not because BranchFolder was hoisting the instruction in question.
The resulting ASM in my non-expert opinion looks better here (ITT blocks instead of branches, no stack spilling of 'lr'), but I have no idea how to update this to test what was originally intended (or if that functionality is working as expected).
Probably best someone more familiar with ARM assembly look at this one, sorry. Nothing seems wrong about it to me, but I definitely would benefit from some guidance on how to fix this test. I'd be happy to provide the ASM on request, it's too much to put inline here.
I'm not sure what to make of this one, although the difference seems straightforward. Resulting a(short) assembly reproduced below: ---Before -------------------------
|
Hi Will,
IT blocks are not necessarily better than branches, and this is especially true in newer, faster, out-of-order cores. Can you attach here the produced ASM with your patch? I'll compare with trunk and see if I can spot anything.
Feel free to zip and attach to this bug.
I'm suspecting that your patch is breaking the IT block lowering because the pattern is kept at different MBBs, ie. some cases that were correctly sinking, aren't any more. Your heuristics seem a bit too constrained. It's not clear if the IT version is better in this case. The load (vld1.8 {d16, d17}, [r1]) is expensive, and is happening always now, when in some cases (when r0 == 1), it'll be reset to zero, which will probably be waiting for the load to finish just to clear it. However, in your version, the number of branches to such small blocks can confuse the branch-predictor. It all depends on which core it's running, as they tend to have different costs for different features. I'm not sure what the test is testing, so I can't tell you what's wrong there. It looks as a bug found on Darwin that just got in alongside the real change. Would be good if people could add 2-3 lines of comments on what the test is looking for... |
Failing lit tests: {before/after output, debug logs, copy of failing test} Hi Renato, Thanks for taking a look and for your response. I've attached the output from the failing lit tests, packaged together
Ah, thank you for your explanation. Didn't realize they made out-of-order ARM chips these days! Anyway, you'll find this test with the others in the attached tarball.
Taking a look at this some more, it seems to actually be the opposite: we're now sinking in cases we didn't previously and that (apparently) prevents the IT block formation. This can be seen in the before/after debug logs for this test case (search for "Machine Sinking"). That said, there should be cases where we no longer sink when we would previously but I don't believe any of these failures are due to that.
Thank you for the analysis and details, much appreciated. So is the result to perform some manner of testing on ARM (LNT?) on popular architectures to find out? Or is it likely a wash overall? Commit and wait for performance regression reports from iOS developers? O:)
I think it's simply trying to verify that vector operations are considered for IT instruction generation? But yes comments would help. We can always ask the author... (looks like Nadav Rotem) :). |
Since I was curious, I went ahead and ran before/after experiments with LNT on x86-64. Before: http://wdtz.org:8000/db_default/v4/nts/3 Comparison: I'm new to LNT, so not sure what can be concluded but offhand it seems the impact on both compilation and execution performance is minor and likely well within the noise of the experiment (although min of 3 samples are being compared). Only executions with tiny duration showed regression. FWIW tests were run on a calm system (running only sshd and dhclient) with ASLR/power saving/frequency scaling disabled (I've previously invested time in making this machine produce stable numbers) with the following LNT invocation: lnt-ve/bin/lnt runtest nt --sandbox ~/lnt-ve --cc ~/llvm/install/bin/clang --cxx ~/llvm/install/bin/clang++ --test-suite ~/src/llvm-test-suite DEMO-default --large --build-threads=4 --multisample=3 Unfortunately don't have access to ARM hardware to produce similar results, and I think it's on ARM that this would likely have a greater impact (if any). Anyway, thought I'd share JIC the results are useful. |
You shouldn't trust LNT performance results (yet), as they're not proper benchmarks, just timed execution of largely unaltered code. The amount of noise is so great that even 5% regressions would wash away.
If the change is sound (Andrew seems to agree it is), and the differences in code generated don't produce broken code, I think we can wait for the ARM LNT buildbot to report regressions over time: http://llvm.org/perf/db_default/v4/nts/machine/10 If it doesn't break straight on, I'd wait a few iterations to check on the regression cases as well as the improvements and see if the new "visual moving average" is off up or down, so than we can look at the specific test. But don't waste too much of your time with that. Until we're benchmarking it properly, there is little point in trying to improve performance on LNT. |
Fixed in r192608, thanks! |
Extended Description
I'm encountering bad performance behavior of the MachineSink pass,
as shown in the attached 'before.log' (MachineSink taking ~25 minutes to execute!).
Investigation led me to create the attached patch which both avoids the n^2 behavior presently exhibited by the pass but also tries to more faithfully implement the heuristic described in the comments of the code itself.
With this patch applied, MachineSink's execution time reduces from 1494.1 seconds down to 1.8 seconds. I've attached the patch as well as 'after.log' demonstrating the much improved timings.
Unfortunately the original IR is both large and something I'd rather not post, although I've managed to reduce it into the attached 'reduced.bc' example that has before/after timings of 10.5s/0.2s (I used bugpoint filtering on the runtime before being greater than 10s and after being less than 1 second).
The text was updated successfully, but these errors were encountered: