Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to optimize combined popcount pattern #48343

Closed
GabrielRavier opened this issue Feb 2, 2021 · 12 comments
Closed

Failure to optimize combined popcount pattern #48343

GabrielRavier opened this issue Feb 2, 2021 · 12 comments
Labels
bugzilla Issues migrated from bugzilla llvm:optimizations

Comments

@GabrielRavier
Copy link
Contributor

GabrielRavier commented Feb 2, 2021

Bugzilla Link 48999
Version trunk
OS Linux
CC @davidbolvansky,@jsonn,@LebedevRI,@RKSimon,@nikic,@rotateright

Extended Description

int f(unsigned int a, unsigned int b)
{
    return __builtin_popcount(a & 8) + __builtin_popcount(b & 2);
}

This can be optimized to return __builtin_popcount((a & 8) | (b & 2));. This transformation is done by GCC, but not by LLVM.

Godbolt comparison: https://godbolt.org/z/zjP3Ys
(If someone can tell me how to use ctpop in alive2 that'd be nice btw, I couldn't get it to work personally)

@RKSimon
Copy link
Collaborator

RKSimon commented Apr 2, 2021

We should be able to fold:

add(ctpop(A),ctpop(B)) --> ctpop(add(A,B)) --> ctpop(or(A,B))

if A and B have no bits set in common.

@rotateright
Copy link
Contributor

Does the motivating example really look like the code in the description?

We don't need popcount if we're masking a single bit of a value - just shift that bit to the right:
https://alive2.llvm.org/ce/z/ASdf47

define i32 @​src(i32 %x, i32 %y) {
%ax = and i32 %x, 8
%p = call i32 @​llvm.ctpop.i32(i32 %ax)
ret i32 %p
}

define i32 @​tgt(i32 %x, i32 %y) {
%ax = and i32 %x, 8
%p = lshr i32 %ax, 3
ret i32 %p
}

We are missing that fold currently.

@RKSimon
Copy link
Collaborator

RKSimon commented Apr 3, 2021

Yes the single bit popcnt cases would make better sense as a shift+mask.

That just leaves the more general popcnt case:

int f2(unsigned int a, unsigned int b)
{
    return __builtin_popcount(a << 16) + __builtin_popcount(b >> 16);
}

https://godbolt.org/z/sqvY36cbv

which gcc also optimizes

@rotateright
Copy link
Contributor

Yes the single bit popcnt cases would make better sense as a shift+mask.

Get that one out of the way at least:
https://reviews.llvm.org/rGc0645f13243c

...although that means we are potentially missing a codegen optimization to form popcount out of bitwise-logic (if popcount is cheap).

@davidbolvansky
Copy link
Collaborator

GCC:

popcount(X) + popcount(Y) is popcount(X|Y) when X&Y must be zero.

@davidbolvansky
Copy link
Collaborator

Candidate patch:
https://reviews.llvm.org/D101210

@davidbolvansky
Copy link
Collaborator

davidbolvansky commented Apr 25, 2021

Thanks to Joerg Sonnenberger, we could generalize this even more.

ctpop(A) + ctpop(B) =>  ctpop(X|Y) + numberofcommonbits(X, Y)

My simple testcases:

int ok_test(unsigned int a) // thanks Joerg
{
    unsigned b = (a & 4) | 3;
    unsigned c = (a & 8) | 3;
    return __builtin_popcount(b) +  __builtin_popcount(c);
}

int ok_test2(unsigned int a, unsigned int b)
{
    return __builtin_popcount(a << 16) + __builtin_popcount(b >> 16);
}

int neg_test(unsigned int a, unsigned int b)
{
    return __builtin_popcount(a << 16) + __builtin_popcount(b >> 4);
}

So I started with:

if (match(LHS, m_OneUse(m_Intrinsic<Intrinsic::ctpop>(m_Value(A)))) &&
      match(RHS, m_OneUse(m_Intrinsic<Intrinsic::ctpop>(m_Value(B))))) {
    KnownBits AKnown = computeKnownBits(A, 0, &I);
    KnownBits BKnown = computeKnownBits(B, 0, &I);
    KnownBits commonBits = KnownBits::commonBits(AKnown, BKnown);

    llvm::errs() << (commonBits.isUnknown() ? " unknown " : "known some") << " Min/Max: " << commonBits.countMinPopulation() << " / " << commonBits.countMaxPopulation() << "\n";
    

}
ok_test - known some Min/Max: 2 / 4
ok_test2 - unknown  Min/Max: 0 / 32
neg_test - unknown  Min/Max: 0 / 32

On the other side, haveNoCommonBitsSet(A, B) for ok_test2 says true. So I would expect "known some" for "ok_test2" too.

I have never worked with knownbits so any advice is useful. Thanks

@RKSimon
Copy link
Collaborator

RKSimon commented May 3, 2021

I think what you're saying is:

ctpop(A) + ctpop(B) => ctpop(X|Y) + ctpop(X&Y)

https://alive2.llvm.org/ce/z/Hzw8-v


define i4 @​src(i4 %0, i4 %1) {
%2:
%3 = ctpop i4 %0
%4 = ctpop i4 %1
%5 = add i4 %4, %3
ret i4 %5
}
=>
define i4 @​tgt(i4 %0, i4 %1) {
%2:
%3 = or i4 %0, %1
%4 = ctpop i4 %3
%5 = and i4 %0, %1
%6 = ctpop i4 %5
%7 = add i4 %4, %6
ret i4 %7
}
Transformation seems to be correct!

Not sure how we can make good use of this general use case though tbh - anyone got any suggestions?

@LebedevRI
Copy link
Member

I think what you're saying is:

ctpop(A) + ctpop(B) => ctpop(X|Y) + ctpop(X&Y)

https://alive2.llvm.org/ce/z/Hzw8-v


define i4 @​src(i4 %0, i4 %1) {
%2:
%3 = ctpop i4 %0
%4 = ctpop i4 %1
%5 = add i4 %4, %3
ret i4 %5
}
=>
define i4 @​tgt(i4 %0, i4 %1) {
%2:
%3 = or i4 %0, %1
%4 = ctpop i4 %3
%5 = and i4 %0, %1
%6 = ctpop i4 %5
%7 = add i4 %4, %6
ret i4 %7
}
Transformation seems to be correct!

Not sure how we can make good use of this general use case though tbh -
anyone got any suggestions?

In general - we can't, we should have a reverse fold of that.
The case where this could be good, is when ctpop(X&Y) happens to be a constant.
I.e. we'd call KnownBits on X and Y, and their bit pairs sort into 4 categories:

  • known to be set in both
  • known to be unset in both
  • known to be set in one and unset in another
  • unknown in both/either.
    Iff we have the last case, we can't do it.
    Not really sure how worth this pattern is in reality.

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 11, 2021
@RKSimon
Copy link
Collaborator

RKSimon commented Mar 11, 2022

Tentatively closing this

@GabrielRavier @davidbolvansky
Feel free to reopen this if you encounter a real world use for ctpop(A) + ctpop(B) => ctpop(X|Y) + ctpop(X&Y)

@RKSimon RKSimon closed this as completed Mar 11, 2022
@GabrielRavier
Copy link
Contributor Author

GabrielRavier commented Jul 7, 2022

Tentatively closing this

@GabrielRavier @davidbolvansky Feel free to reopen this if you encounter a real world use for ctpop(A) + ctpop(B) => ctpop(X|Y) + ctpop(X&Y)

I have not personally encountered this but I've finally found the patch that made GCC optimize it, and it appears to cite a real world use:

These may seem obscure transformations, but performing these types of
POPCOUNT operations are often the performance critical steps in some cheminformatics
applications.

I've also found slides on a similar subject over here which suggest that this can also be used to speed up chemical similarity calculations when they are calculated as Tanimoto coefficients.

(PS: I do not seem to be able to reopen this issue myself)

@GabrielRavier
Copy link
Contributor Author

GabrielRavier commented Jul 7, 2022

Actually nvm I just realised this was closed w.r.t. the more generic optimisation, whereas the examples I found are for the original testcase which was fixed - maybe I shouldn't look at old bug reports at 12 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla llvm:optimizations
Projects
None yet
Development

No branches or pull requests

5 participants