New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
-instcombine with fast flag moves div outside of loop into loop, dramatically regressing performance #45460
Comments
There is no cost modeling in instcombine. Arguably, it shouldn't be doing any instruction movement in the first place, but that's a bigger issue. I don't think you can expect to tack passes onto the end of -O3 and expect reasonable results, but I might be missing larger context here? |
To be clear, the godbolt link shows vectorizing after unrolling, so we have non-CSE'd code like this before going into instcombine: |
Clang does not seem to show this behavior: The context is that I was trying to replicate a performance regression I saw in Julia when using fast-math. Julia defines its passes here: So perhaps this is a problem with their pass ordering? |
I recently removed an instantiation of -early-cse for Clang with: See also comments in bug 46065. It seems inefficient to let the backend deal with that CSE, but we need to weigh compile-time costs and decide what, if anything, is better. But this is probably a side-issue. I think what we really care about is figuring out how instructions are being hoisted/sunk for the problem case.
I've dabbled in pass manager changes as above, but don't claim any expertise. But if Clang is doing what we want and Julia is not, then I'd start with Julia's pass manager and see how the fdivs are moving differently there vs. Clang. |
LICM is probably the pass that is responsible for moving 'fdiv fast' out of a loop in IR as shown here: And for the normal 'opt' pipelines that clang uses, the final round of LICM is never followed by instcombine from what I can tell. If I'm seeing the Julia pipeline correctly, it only uses LICM twice like this: But it then adds InstCombine again later. If you agree with that analysis, I think we can close this bug and open a bug against Julia. |
Great, thanks. |
Is there any testcase in llvm tree to ensure that things will not regress if someone in the future tries to add instcombine after licm? |
Good point - added a test based on the example here: That shows the CSE opportunity. This bug is about fdiv sinking/hoisting, so resolving since there's a Julia bug for that now. |
Extended Description
Godbolt:
https://godbolt.org/z/3vuZbN
The code divides one vector by a scalar.
The unoptimized IR has
inverse = fdiv 1, scalar
and then multiplies one vector by the inverse in the loop.
With -instcombine, the division is moved into the loop.
Seems like a failure of cost modeling not to realize that divisions are far more expensive.
With vectors of length 128, this leads to about a 5.6-fold regression in performance on my 10980xe.
The text was updated successfully, but these errors were encountered: