Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

-instcombine with fast flag moves div outside of loop into loop, dramatically regressing performance #45460

Closed
chriselrod opened this issue May 28, 2020 · 8 comments
Labels
bugzilla Issues migrated from bugzilla invalid Resolved as invalid, i.e. not a bug

Comments

@chriselrod
Copy link

Bugzilla Link 46115
Resolution INVALID
Resolved on May 31, 2020 07:40
Version trunk
OS Linux
CC @davidbolvansky,@efriedma-quic,@RKSimon,@nikic,@rotateright

Extended Description

Godbolt:
https://godbolt.org/z/3vuZbN

The code divides one vector by a scalar.
The unoptimized IR has

inverse = fdiv 1, scalar

and then multiplies one vector by the inverse in the loop.

With -instcombine, the division is moved into the loop.
Seems like a failure of cost modeling not to realize that divisions are far more expensive.

With vectors of length 128, this leads to about a 5.6-fold regression in performance on my 10980xe.

@rotateright
Copy link
Contributor

There is no cost modeling in instcombine. Arguably, it shouldn't be doing any instruction movement in the first place, but that's a bigger issue.

I don't think you can expect to tack passes onto the end of -O3 and expect reasonable results, but I might be missing larger context here?

@rotateright
Copy link
Contributor

I don't think you can expect to tack passes onto the end of -O3 and expect
reasonable results, but I might be missing larger context here?

To be clear, the godbolt link shows vectorizing after unrolling, so we have non-CSE'd code like this before going into instcombine:
%71 = fdiv fast double 1.000000e+00, %2
%72 = fdiv fast double 1.000000e+00, %2
%73 = fdiv fast double 1.000000e+00, %2
%74 = fdiv fast double 1.000000e+00, %2

@chriselrod
Copy link
Author

There is no cost modeling in instcombine. Arguably, it shouldn't be doing
any instruction movement in the first place, but that's a bigger issue.

I don't think you can expect to tack passes onto the end of -O3 and expect
reasonable results, but I might be missing larger context here?

Clang does not seem to show this behavior:
https://godbolt.org/z/36kSVh
It does have the non-CSE'd fdiv fasts in the LLVM IR, but the final asm shows just a single vdivpd (and then vmulpds in the loop).

The context is that I was trying to replicate a performance regression I saw in Julia when using fast-math. Julia defines its passes here:
https://github.com/JuliaLang/julia/blob/98d1c3185bdcad02ccf5fd35857abb06e6d5e10e/src/aotcompile.cpp

So perhaps this is a problem with their pass ordering?
What's the correct place for this to get fixed?

@rotateright
Copy link
Contributor

Clang does not seem to show this behavior:
https://godbolt.org/z/36kSVh
It does have the non-CSE'd fdiv fasts in the LLVM IR, but the final asm
shows just a single vdivpd (and then vmulpds in the loop).

I recently removed an instantiation of -early-cse for Clang with:
https://reviews.llvm.org/rG57bb4787d72f

See also comments in bug 46065.

It seems inefficient to let the backend deal with that CSE, but we need to weigh compile-time costs and decide what, if anything, is better.

But this is probably a side-issue. I think what we really care about is figuring out how instructions are being hoisted/sunk for the problem case.

The context is that I was trying to replicate a performance regression I saw
in Julia when using fast-math. Julia defines its passes here:
https://github.com/JuliaLang/julia/blob/
98d1c3185bdcad02ccf5fd35857abb06e6d5e10e/src/aotcompile.cpp

So perhaps this is a problem with their pass ordering?
What's the correct place for this to get fixed?

I've dabbled in pass manager changes as above, but don't claim any expertise. But if Clang is doing what we want and Julia is not, then I'd start with Julia's pass manager and see how the fdivs are moving differently there vs. Clang.

@rotateright
Copy link
Contributor

LICM is probably the pass that is responsible for moving 'fdiv fast' out of a loop in IR as shown here:
https://reviews.llvm.org/D30819

And for the normal 'opt' pipelines that clang uses, the final round of LICM is never followed by instcombine from what I can tell.

If I'm seeing the Julia pipeline correctly, it only uses LICM twice like this:
PM->add(createLICMPass());
PM->add(createLoopUnswitchPass());
PM->add(createLICMPass());

But it then adds InstCombine again later.

If you agree with that analysis, I think we can close this bug and open a bug against Julia.

@chriselrod
Copy link
Author

LICM is probably the pass that is responsible for moving 'fdiv fast' out of
a loop in IR as shown here:
https://reviews.llvm.org/D30819

And for the normal 'opt' pipelines that clang uses, the final round of LICM
is never followed by instcombine from what I can tell.

If I'm seeing the Julia pipeline correctly, it only uses LICM twice like
this:
PM->add(createLICMPass());
PM->add(createLoopUnswitchPass());
PM->add(createLICMPass());

But it then adds InstCombine again later.

If you agree with that analysis, I think we can close this bug and open a
bug against Julia.

Great, thanks.
I opened an issue here: JuliaLang/julia#36084

@davidbolvansky
Copy link
Collaborator

Is there any testcase in llvm tree to ensure that things will not regress if someone in the future tries to add instcombine after licm?

@rotateright
Copy link
Contributor

Is there any testcase in llvm tree to ensure that things will not regress if
someone in the future tries to add instcombine after licm?

Good point - added a test based on the example here:
91b45fb

That shows the CSE opportunity. This bug is about fdiv sinking/hoisting, so resolving since there's a Julia bug for that now.

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla invalid Resolved as invalid, i.e. not a bug
Projects
None yet
Development

No branches or pull requests

3 participants