Failure to optimize combined popcount pattern #48343

GabrielRavier · 2021-02-02T13:53:08Z


Bugzilla Link	48999
Version	trunk
OS	Linux
CC	@davidbolvansky,@jsonn,@LebedevRI,@RKSimon,@nikic,@rotateright

Extended Description

int f(unsigned int a, unsigned int b)
{
    return __builtin_popcount(a & 8) + __builtin_popcount(b & 2);
}

This can be optimized to return __builtin_popcount((a & 8) | (b & 2));. This transformation is done by GCC, but not by LLVM.

Godbolt comparison: https://godbolt.org/z/zjP3Ys
(If someone can tell me how to use ctpop in alive2 that'd be nice btw, I couldn't get it to work personally)

The text was updated successfully, but these errors were encountered:

RKSimon · 2021-04-02T19:49:15Z

We should be able to fold:

add(ctpop(A),ctpop(B)) --> ctpop(add(A,B)) --> ctpop(or(A,B))

if A and B have no bits set in common.

rotateright · 2021-04-02T21:15:22Z

Does the motivating example really look like the code in the description?

We don't need popcount if we're masking a single bit of a value - just shift that bit to the right:
https://alive2.llvm.org/ce/z/ASdf47

define i32 @src(i32 %x, i32 %y) {
%ax = and i32 %x, 8
%p = call i32 @llvm.ctpop.i32(i32 %ax)
ret i32 %p
}

define i32 @tgt(i32 %x, i32 %y) {
%ax = and i32 %x, 8
%p = lshr i32 %ax, 3
ret i32 %p
}

We are missing that fold currently.

RKSimon · 2021-04-03T10:14:42Z

Yes the single bit popcnt cases would make better sense as a shift+mask.

That just leaves the more general popcnt case:

int f2(unsigned int a, unsigned int b)
{
    return __builtin_popcount(a << 16) + __builtin_popcount(b >> 16);
}

https://godbolt.org/z/sqvY36cbv

which gcc also optimizes

rotateright · 2021-04-05T14:00:43Z

Yes the single bit popcnt cases would make better sense as a shift+mask.

Get that one out of the way at least:
https://reviews.llvm.org/rGc0645f13243c

...although that means we are potentially missing a codegen optimization to form popcount out of bitwise-logic (if popcount is cheap).

davidbolvansky · 2021-04-23T13:13:53Z

GCC:

popcount(X) + popcount(Y) is popcount(X|Y) when X&Y must be zero.

davidbolvansky · 2021-04-24T00:03:08Z

Candidate patch:
https://reviews.llvm.org/D101210

davidbolvansky · 2021-04-25T00:20:05Z

Thanks to Joerg Sonnenberger, we could generalize this even more.

ctpop(A) + ctpop(B) =>  ctpop(X|Y) + numberofcommonbits(X, Y)

My simple testcases:

int ok_test(unsigned int a) // thanks Joerg
{
    unsigned b = (a & 4) | 3;
    unsigned c = (a & 8) | 3;
    return __builtin_popcount(b) +  __builtin_popcount(c);
}

int ok_test2(unsigned int a, unsigned int b)
{
    return __builtin_popcount(a << 16) + __builtin_popcount(b >> 16);
}

int neg_test(unsigned int a, unsigned int b)
{
    return __builtin_popcount(a << 16) + __builtin_popcount(b >> 4);
}

So I started with:

if (match(LHS, m_OneUse(m_Intrinsic<Intrinsic::ctpop>(m_Value(A)))) &&
      match(RHS, m_OneUse(m_Intrinsic<Intrinsic::ctpop>(m_Value(B))))) {
    KnownBits AKnown = computeKnownBits(A, 0, &I);
    KnownBits BKnown = computeKnownBits(B, 0, &I);
    KnownBits commonBits = KnownBits::commonBits(AKnown, BKnown);

    llvm::errs() << (commonBits.isUnknown() ? " unknown " : "known some") << " Min/Max: " << commonBits.countMinPopulation() << " / " << commonBits.countMaxPopulation() << "\n";
    

}

ok_test - known some Min/Max: 2 / 4
ok_test2 - unknown  Min/Max: 0 / 32
neg_test - unknown  Min/Max: 0 / 32

On the other side, haveNoCommonBitsSet(A, B) for ok_test2 says true. So I would expect "known some" for "ok_test2" too.

I have never worked with knownbits so any advice is useful. Thanks

RKSimon · 2021-05-03T14:34:06Z

I think what you're saying is:

ctpop(A) + ctpop(B) => ctpop(X|Y) + ctpop(X&Y)

https://alive2.llvm.org/ce/z/Hzw8-v

define i4 @src(i4 %0, i4 %1) {
%2:
%3 = ctpop i4 %0
%4 = ctpop i4 %1
%5 = add i4 %4, %3
ret i4 %5
}
=>
define i4 @tgt(i4 %0, i4 %1) {
%2:
%3 = or i4 %0, %1
%4 = ctpop i4 %3
%5 = and i4 %0, %1
%6 = ctpop i4 %5
%7 = add i4 %4, %6
ret i4 %7
}
Transformation seems to be correct!

Not sure how we can make good use of this general use case though tbh - anyone got any suggestions?

LebedevRI · 2021-05-03T14:52:39Z

I think what you're saying is:

ctpop(A) + ctpop(B) => ctpop(X|Y) + ctpop(X&Y)

https://alive2.llvm.org/ce/z/Hzw8-v

define i4 @src(i4 %0, i4 %1) {
%2:
%3 = ctpop i4 %0
%4 = ctpop i4 %1
%5 = add i4 %4, %3
ret i4 %5
}
=>
define i4 @tgt(i4 %0, i4 %1) {
%2:
%3 = or i4 %0, %1
%4 = ctpop i4 %3
%5 = and i4 %0, %1
%6 = ctpop i4 %5
%7 = add i4 %4, %6
ret i4 %7
}
Transformation seems to be correct!

Not sure how we can make good use of this general use case though tbh -
anyone got any suggestions?

In general - we can't, we should have a reverse fold of that.
The case where this could be good, is when ctpop(X&Y) happens to be a constant.
I.e. we'd call KnownBits on X and Y, and their bit pairs sort into 4 categories:

known to be set in both
known to be unset in both
known to be set in one and unset in another
unknown in both/either.
Iff we have the last case, we can't do it.
Not really sure how worth this pattern is in reality.

RKSimon · 2022-03-11T15:21:58Z

Tentatively closing this

@GabrielRavier @davidbolvansky
Feel free to reopen this if you encounter a real world use for ctpop(A) + ctpop(B) => ctpop(X|Y) + ctpop(X&Y)

GabrielRavier · 2022-07-07T21:21:11Z

Tentatively closing this

@GabrielRavier @davidbolvansky Feel free to reopen this if you encounter a real world use for ctpop(A) + ctpop(B) => ctpop(X|Y) + ctpop(X&Y)

I have not personally encountered this but I've finally found the patch that made GCC optimize it, and it appears to cite a real world use:

These may seem obscure transformations, but performing these types of
POPCOUNT operations are often the performance critical steps in some cheminformatics
applications.

I've also found slides on a similar subject over here which suggest that this can also be used to speed up chemical similarity calculations when they are calculated as Tanimoto coefficients.

(PS: I do not seem to be able to reopen this issue myself)

GabrielRavier · 2022-07-07T21:29:22Z

Actually nvm I just realised this was closed w.r.t. the more generic optimisation, whereas the examples I found are for the original testcase which was fixed - maybe I shouldn't look at old bug reports at 12 AM

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 11, 2021

RKSimon added the llvm:optimizations label Mar 10, 2022

RKSimon closed this as completed Mar 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to optimize combined popcount pattern #48343

Failure to optimize combined popcount pattern #48343

GabrielRavier commented Feb 2, 2021 •

edited by RKSimon

RKSimon commented Apr 2, 2021

rotateright commented Apr 2, 2021

RKSimon commented Apr 3, 2021 •

edited

rotateright commented Apr 5, 2021

davidbolvansky commented Apr 23, 2021

davidbolvansky commented Apr 24, 2021

davidbolvansky commented Apr 25, 2021 •

edited by RKSimon

RKSimon commented May 3, 2021

LebedevRI commented May 3, 2021

RKSimon commented Mar 11, 2022

GabrielRavier commented Jul 7, 2022 •

edited

GabrielRavier commented Jul 7, 2022 •

edited

Failure to optimize combined popcount pattern #48343

Failure to optimize combined popcount pattern #48343

Comments

GabrielRavier commented Feb 2, 2021 • edited by RKSimon

Extended Description

RKSimon commented Apr 2, 2021

rotateright commented Apr 2, 2021

RKSimon commented Apr 3, 2021 • edited

rotateright commented Apr 5, 2021

davidbolvansky commented Apr 23, 2021

davidbolvansky commented Apr 24, 2021

davidbolvansky commented Apr 25, 2021 • edited by RKSimon

RKSimon commented May 3, 2021

LebedevRI commented May 3, 2021

RKSimon commented Mar 11, 2022

GabrielRavier commented Jul 7, 2022 • edited

GabrielRavier commented Jul 7, 2022 • edited

GabrielRavier commented Feb 2, 2021 •

edited by RKSimon

RKSimon commented Apr 3, 2021 •

edited

davidbolvansky commented Apr 25, 2021 •

edited by RKSimon

GabrielRavier commented Jul 7, 2022 •

edited

GabrielRavier commented Jul 7, 2022 •

edited