48999 – Failure to optimize combined popcount pattern

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 48999 - Failure to optimize combined popcount pattern

Summary: Failure to optimize combined popcount pattern

Status:	NEW

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Scalar Optimizations (show other bugs)
Version:	trunk
Hardware:	PC Linux

Importance:	P enhancement
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2021-02-02 05:53 PST by Gabriel Ravier
Modified:	2021-05-03 07:57 PDT (History)
CC List:	7 users (show)

See Also:	47906
Fixed By Commit(s):

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Gabriel Ravier 2021-02-02 05:53:08 PST

int f(unsigned int a, unsigned int b)
{
    return __builtin_popcount(a & 8) + __builtin_popcount(b & 2);
}

This can be optimized to `return __builtin_popcount((a & 8) | (b & 2));`. This transformation is done by GCC, but not by LLVM.

Godbolt comparison: https://godbolt.org/z/zjP3Ys
(If someone can tell me how to use ctpop in alive2 that'd be nice btw, I couldn't get it to work personally)

Comment 1 Simon Pilgrim 2021-04-02 12:49:15 PDT

We should be able to fold:

  add(ctpop(A),ctpop(B)) --> ctpop(add(A,B)) --> ctpop(or(A,B))

if A and B have no bits set in common.

Comment 2 Sanjay Patel 2021-04-02 14:15:22 PDT

Does the motivating example really look like the code in the description? 

We don't need popcount if we're masking a single bit of a value - just shift that bit to the right:
https://alive2.llvm.org/ce/z/ASdf47

define i32 @src(i32 %x, i32 %y) {
  %ax = and i32 %x, 8
  %p = call i32 @llvm.ctpop.i32(i32 %ax)
  ret i32 %p
}

define i32 @tgt(i32 %x, i32 %y) {
  %ax = and i32 %x, 8
  %p = lshr i32 %ax, 3
  ret i32 %p
}

We are missing that fold currently.

Comment 3 Simon Pilgrim 2021-04-03 03:14:42 PDT

Yes the single bit popcnt cases would make better sense as a shift+mask.

That just leaves the more general popcnt case:

int f2(unsigned int a, unsigned int b)
{
    return __builtin_popcount(a << 16) + __builtin_popcount(b >> 16);
}

https://godbolt.org/z/sqvY36cbv

which gcc also optimizes

Comment 4 Sanjay Patel 2021-04-05 07:00:43 PDT

(In reply to Simon Pilgrim from comment #3)
> Yes the single bit popcnt cases would make better sense as a shift+mask.

Get that one out of the way at least:
https://reviews.llvm.org/rGc0645f13243c

...although that means we are potentially missing a codegen optimization to form popcount out of bitwise-logic (if popcount is cheap).

Comment 5 David Bolvansky 2021-04-23 06:13:53 PDT

GCC:

popcount(X) + popcount(Y) is popcount(X|Y) when X&Y must be zero.

Comment 6 David Bolvansky 2021-04-23 17:03:08 PDT

Candidate patch:
https://reviews.llvm.org/D101210

Comment 7 David Bolvansky 2021-04-24 17:20:05 PDT

Thanks to Joerg Sonnenberger, we could generalize this even more.

ctpop(A) + ctpop(B) =>  ctpop(X|Y) + numberofcommonbits(X, Y)

My simple testcases:
int ok_test(unsigned int a) // thanks Joerg
{
    unsigned b = (a & 4) | 3;
    unsigned c = (a & 8) | 3;
    return __builtin_popcount(b) +  __builtin_popcount(c);
}

int ok_test2(unsigned int a, unsigned int b)
{
    return __builtin_popcount(a << 16) + __builtin_popcount(b >> 16);
}

int neg_test(unsigned int a, unsigned int b)
{
    return __builtin_popcount(a << 16) + __builtin_popcount(b >> 4);
}

So I started with:
if (match(LHS, m_OneUse(m_Intrinsic<Intrinsic::ctpop>(m_Value(A)))) &&
      match(RHS, m_OneUse(m_Intrinsic<Intrinsic::ctpop>(m_Value(B))))) {
    KnownBits AKnown = computeKnownBits(A, 0, &I);
    KnownBits BKnown = computeKnownBits(B, 0, &I);
    KnownBits commonBits = KnownBits::commonBits(AKnown, BKnown);

    llvm::errs() << (commonBits.isUnknown() ? " unknown " : "known some") << " Min/Max: " << commonBits.countMinPopulation() << " / " << commonBits.countMaxPopulation() << "\n";
    

}


ok_test - known some Min/Max: 2 / 4
ok_test2 - unknown  Min/Max: 0 / 32
neg_test - unknown  Min/Max: 0 / 32


On the other side, haveNoCommonBitsSet(A, B) for ok_test2 says true. So I would expect "known some" for "ok_test2" too.

I have never worked with knownbits so any advice is useful. Thanks

Comment 8 Simon Pilgrim 2021-05-03 07:34:06 PDT

I think what you're saying is:

ctpop(A) + ctpop(B) =>  ctpop(X|Y) + ctpop(X&Y)

https://alive2.llvm.org/ce/z/Hzw8-v

----------------------------------------
define i4 @src(i4 %0, i4 %1) {
%2:
  %3 = ctpop i4 %0
  %4 = ctpop i4 %1
  %5 = add i4 %4, %3
  ret i4 %5
}
=>
define i4 @tgt(i4 %0, i4 %1) {
%2:
  %3 = or i4 %0, %1
  %4 = ctpop i4 %3
  %5 = and i4 %0, %1
  %6 = ctpop i4 %5
  %7 = add i4 %4, %6
  ret i4 %7
}
Transformation seems to be correct!

Not sure how we can make good use of this general use case though tbh - anyone got any suggestions?

Comment 9 Roman Lebedev 2021-05-03 07:52:39 PDT

(In reply to Simon Pilgrim from comment #8)
> I think what you're saying is:
> 
> ctpop(A) + ctpop(B) =>  ctpop(X|Y) + ctpop(X&Y)
> 
> https://alive2.llvm.org/ce/z/Hzw8-v
> 
> ----------------------------------------
> define i4 @src(i4 %0, i4 %1) {
> %2:
>   %3 = ctpop i4 %0
>   %4 = ctpop i4 %1
>   %5 = add i4 %4, %3
>   ret i4 %5
> }
> =>
> define i4 @tgt(i4 %0, i4 %1) {
> %2:
>   %3 = or i4 %0, %1
>   %4 = ctpop i4 %3
>   %5 = and i4 %0, %1
>   %6 = ctpop i4 %5
>   %7 = add i4 %4, %6
>   ret i4 %7
> }
> Transformation seems to be correct!
> 
> Not sure how we can make good use of this general use case though tbh -
> anyone got any suggestions?

In general - we can't, we should have a reverse fold of that.
The case where this could be good, is when ctpop(X&Y) happens to be a constant.
I.e. we'd call KnownBits on X and Y, and their bit pairs sort into 4 categories:
* known to be set in both 
* known to be unset in both
* known to be set in one and unset in another
* unknown in both/either.
Iff we have the last case, we can't do it.
Not really sure how worth this pattern is in reality.