Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missing simplification for llvm.vector.reduce.and on vector of i1 #50603

Open
zygoloid mannequin opened this issue Jul 29, 2021 · 10 comments
Open

missing simplification for llvm.vector.reduce.and on vector of i1 #50603

zygoloid mannequin opened this issue Jul 29, 2021 · 10 comments
Assignees
Labels
bugzilla Issues migrated from bugzilla

Comments

@zygoloid
Copy link
Mannequin

zygoloid mannequin commented Jul 29, 2021

Bugzilla Link 51259
Version trunk
OS All
CC @topperc,@LebedevRI,@RKSimon,@phoebewang,@rotateright

Extended Description

Live demo: https://godbolt.org/z/a6YaahdPP

Example:

[[gnu::weak]] void do_this() {}
[[gnu::weak]] void do_that() {}

void f1(unsigned char const p[8]) {
  if (p[0] != 0x00 & p[1] != 0x00 & p[2] != 0x00 & p[3] != 0x00 & p[4] != 0x00 &
      p[5] != 0x00 & p[6] != 0x00 & p[7] != 0x00) {
    do_this();
  } else {
    do_that();
  }
}

void f2(unsigned const char *p) {
  using T [[gnu::vector_size(8), gnu::aligned(1)]] = unsigned char;
  T same = *(T *)p == (T){0, 0, 0, 0, 0, 0, 0, 0};
  if ((unsigned long)same == 0) {
    do_this();
  } else {
    do_that();
  }
}

This results in the following:

f1(unsigned char const*):                               # @f1(unsigned char const*)
        vmovq   xmm0, qword ptr [rdi]           # xmm0 = mem[0],zero
        vpxor   xmm1, xmm1, xmm1
        vpcmpeqb        xmm0, xmm0, xmm1
        vpmovmskb       eax, xmm0
        not     eax
        cmp     al, -1
        jne     ...

f2(unsigned char const*):                               # @f2(unsigned char const*)
        vmovq   xmm0, qword ptr [rdi]           # xmm0 = mem[0],zero
        vpxor   xmm1, xmm1, xmm1
        vpcmpeqb        xmm0, xmm0, xmm1
        vmovq   rax, xmm0
        test    rax, rax
        je      ...

I think these should produce the same assembly, and the result from f2 looks better to me (though both are the same size). Presumably we'd need to recognize that after vpcmpeqb, each lane in xmm0 is either all-zeros or all-ones, so the vpmovmskb is redundant.

@zygoloid
Copy link
Mannequin Author

zygoloid mannequin commented Jul 29, 2021

assigned to @LebedevRI

@zygoloid
Copy link
Mannequin Author

zygoloid mannequin commented Jul 29, 2021

It looks like we might be missing a simplification of @​llvm.vector.reduce.and on a vector of i1: replacing

%5 = call i1 @​llvm.vector.reduce.and.v8i1(<8 x i1> %4)

with an icmp looks like it should result in the better IR. (Thanks to Nick Lewycky for pointing this out.)

@RKSimon
Copy link
Collaborator

RKSimon commented Jul 29, 2021

Should we always be canonicalizing and/or/xor reductions to bitcasts+icmp/parity do you think?

IIRC we already do this in a couple of cases already.

@rotateright
Copy link
Contributor

Should we always be canonicalizing and/or/xor reductions to
bitcasts+icmp/parity do you think?

IIRC we already do this in a couple of cases already.

Already done with:
https://reviews.llvm.org/D97406

The godbolt link in the description is testing with 12.0.1. Here's trunk:
https://godbolt.org/z/6vE6T81Yn

There's still a difference in the IR and asm though. We might want to change both.

@LebedevRI
Copy link
Member

Should we always be canonicalizing and/or/xor reductions to
bitcasts+icmp/parity do you think?

IIRC we already do this in a couple of cases already.

Surely.

https://godbolt.org/z/fhrPzvxzv
So we are missing xor/mul/comparison handling.
i1 xor is just add: https://alive2.llvm.org/ce/z/GZToKH
i1 mul is just and: https://alive2.llvm.org/ce/z/R4Apqp
i1 umax is just or: https://alive2.llvm.org/ce/z/dSE3x_
i1 smin is just or: https://alive2.llvm.org/ce/z/49Z8yh
i1 umin is just and: https://alive2.llvm.org/ce/z/SfZ2pU
i1 smax is just and: https://alive2.llvm.org/ce/z/ndJ2Ze

@LebedevRI
Copy link
Member

Ok, i've added all the missing instcombine pieces,
all integer reductions w/ (potentially extended i1 element type
are now expanded.

We now probably want to adjust costmodel to reflect that,
and i'm not sure if we want to do some similar backend handling?

@zygoloid
Copy link
Mannequin Author

zygoloid mannequin commented Aug 3, 2021

Awesome, thanks! Minor point: the original two examples still don't canonicalize to the same IR:

f1 uses

%5 = bitcast <8 x i1> %4 to i8
%6 = icmp eq i8 %5, 0

resulting in

    vpmovmskb       eax, xmm0
    test    al, al

f2 uses

%5 = sext <8 x i1> %4 to <8 x i8>
%6 = bitcast <8 x i8> %5 to i64
%7 = icmp eq i64 %6, 0

    vmovq   rax, xmm0
    test    rax, rax

I have no idea which of these is better (if either), but presumably there's an instcombine missing or similar?

@LebedevRI
Copy link
Member

Awesome, thanks! Minor point: the original two examples still don't
canonicalize to the same IR:

f1 uses

%5 = bitcast <8 x i1> %4 to i8
%6 = icmp eq i8 %5, 0

resulting in

    vpmovmskb       eax, xmm0
    test    al, al

f2 uses

%5 = sext <8 x i1> %4 to <8 x i8>
%6 = bitcast <8 x i8> %5 to i64
%7 = icmp eq i64 %6, 0

    vmovq   rax, xmm0
    test    rax, rax

I have no idea which of these is better (if either), but presumably there's
an instcombine missing or similar?

Yep, looks like we fail to drop sext before bitcast-icmp: https://godbolt.org/z/Mf8vMfs5r

@LebedevRI
Copy link
Member

Alright, i think the only thing we are missing before we can close this is cost model changes.

This block:

case Intrinsic::vector_reduce_or:
case Intrinsic::vector_reduce_and: {
// Canonicalize logical or/and reductions:
// Or reduction for i1 is represented as:
// %val = bitcast <ReduxWidth x i1> to iReduxWidth
// %res = cmp ne iReduxWidth %val, 0
// And reduction for i1 is represented as:
// %val = bitcast <ReduxWidth x i1> to iReduxWidth
// %res = cmp eq iReduxWidth %val, 11111
Value *Arg = II->getArgOperand(0);
Value *Vect;
if (match(Arg, m_ZExtOrSExtOrSelf(m_Value(Vect)))) {
if (auto *FTy = dyn_cast<FixedVectorType>(Vect->getType()))
if (FTy->getElementType() == Builder.getInt1Ty()) {
Value *Res = Builder.CreateBitCast(
Vect, Builder.getIntNTy(FTy->getNumElements()));
if (IID == Intrinsic::vector_reduce_and) {
Res = Builder.CreateICmpEQ(
Res, ConstantInt::getAllOnesValue(Res->getType()));
} else {
assert(IID == Intrinsic::vector_reduce_or &&
"Expected or reduction.");
Res = Builder.CreateIsNotNull(Res);
}
if (Arg != Vect)
Res = Builder.CreateCast(cast<CastInst>(Arg)->getOpcode(), Res,
II->getType());
return replaceInstUsesWith(CI, Res);
}
}
LLVM_FALLTHROUGH;
}
case Intrinsic::vector_reduce_add: {
if (IID == Intrinsic::vector_reduce_add) {
// Convert vector_reduce_add(ZExt(<n x i1>)) to
// ZExtOrTrunc(ctpop(bitcast <n x i1> to in)).
// Convert vector_reduce_add(SExt(<n x i1>)) to
// -ZExtOrTrunc(ctpop(bitcast <n x i1> to in)).
// Convert vector_reduce_add(<n x i1>) to
// Trunc(ctpop(bitcast <n x i1> to in)).
Value *Arg = II->getArgOperand(0);
Value *Vect;
if (match(Arg, m_ZExtOrSExtOrSelf(m_Value(Vect)))) {
if (auto *FTy = dyn_cast<FixedVectorType>(Vect->getType()))
if (FTy->getElementType() == Builder.getInt1Ty()) {
Value *V = Builder.CreateBitCast(
Vect, Builder.getIntNTy(FTy->getNumElements()));
Value *Res = Builder.CreateUnaryIntrinsic(Intrinsic::ctpop, V);
if (Res->getType() != II->getType())
Res = Builder.CreateZExtOrTrunc(Res, II->getType());
if (Arg != Vect &&
cast<Instruction>(Arg)->getOpcode() == Instruction::SExt)
Res = Builder.CreateNeg(Res);
return replaceInstUsesWith(CI, Res);
}
}
}
LLVM_FALLTHROUGH;
}
case Intrinsic::vector_reduce_xor: {
if (IID == Intrinsic::vector_reduce_xor) {
// Exclusive disjunction reduction over the vector with
// (potentially-extended) i1 element type is actually a
// (potentially-extended) arithmetic `add` reduction over the original
// non-extended value:
// vector_reduce_xor(?ext(<n x i1>))
// -->
// ?ext(vector_reduce_add(<n x i1>))
Value *Arg = II->getArgOperand(0);
Value *Vect;
if (match(Arg, m_ZExtOrSExtOrSelf(m_Value(Vect)))) {
if (auto *FTy = dyn_cast<FixedVectorType>(Vect->getType()))
if (FTy->getElementType() == Builder.getInt1Ty()) {
Value *Res = Builder.CreateAddReduce(Vect);
if (Arg != Vect)
Res = Builder.CreateCast(cast<CastInst>(Arg)->getOpcode(), Res,
II->getType());
return replaceInstUsesWith(CI, Res);
}
}
}
LLVM_FALLTHROUGH;
}
case Intrinsic::vector_reduce_mul: {
if (IID == Intrinsic::vector_reduce_mul) {
// Multiplicative reduction over the vector with (potentially-extended)
// i1 element type is actually a (potentially zero-extended)
// logical `and` reduction over the original non-extended value:
// vector_reduce_mul(?ext(<n x i1>))
// -->
// zext(vector_reduce_and(<n x i1>))
Value *Arg = II->getArgOperand(0);
Value *Vect;
if (match(Arg, m_ZExtOrSExtOrSelf(m_Value(Vect)))) {
if (auto *FTy = dyn_cast<FixedVectorType>(Vect->getType()))
if (FTy->getElementType() == Builder.getInt1Ty()) {
Value *Res = Builder.CreateAndReduce(Vect);
if (Res->getType() != II->getType())
Res = Builder.CreateZExt(Res, II->getType());
return replaceInstUsesWith(CI, Res);
}
}
}
LLVM_FALLTHROUGH;
}
case Intrinsic::vector_reduce_umin:
case Intrinsic::vector_reduce_umax: {
if (IID == Intrinsic::vector_reduce_umin ||
IID == Intrinsic::vector_reduce_umax) {
// UMin/UMax reduction over the vector with (potentially-extended)
// i1 element type is actually a (potentially-extended)
// logical `and`/`or` reduction over the original non-extended value:
// vector_reduce_u{min,max}(?ext(<n x i1>))
// -->
// ?ext(vector_reduce_{and,or}(<n x i1>))
Value *Arg = II->getArgOperand(0);
Value *Vect;
if (match(Arg, m_ZExtOrSExtOrSelf(m_Value(Vect)))) {
if (auto *FTy = dyn_cast<FixedVectorType>(Vect->getType()))
if (FTy->getElementType() == Builder.getInt1Ty()) {
Value *Res = IID == Intrinsic::vector_reduce_umin
? Builder.CreateAndReduce(Vect)
: Builder.CreateOrReduce(Vect);
if (Arg != Vect)
Res = Builder.CreateCast(cast<CastInst>(Arg)->getOpcode(), Res,
II->getType());
return replaceInstUsesWith(CI, Res);
}
}
}
LLVM_FALLTHROUGH;
}
case Intrinsic::vector_reduce_smin:
case Intrinsic::vector_reduce_smax: {
if (IID == Intrinsic::vector_reduce_smin ||
IID == Intrinsic::vector_reduce_smax) {
// SMin/SMax reduction over the vector with (potentially-extended)
// i1 element type is actually a (potentially-extended)
// logical `and`/`or` reduction over the original non-extended value:
// vector_reduce_s{min,max}(<n x i1>)
// -->
// vector_reduce_{or,and}(<n x i1>)
// and
// vector_reduce_s{min,max}(sext(<n x i1>))
// -->
// sext(vector_reduce_{or,and}(<n x i1>))
// and
// vector_reduce_s{min,max}(zext(<n x i1>))
// -->
// zext(vector_reduce_{and,or}(<n x i1>))
Value *Arg = II->getArgOperand(0);
Value *Vect;
if (match(Arg, m_ZExtOrSExtOrSelf(m_Value(Vect)))) {
if (auto *FTy = dyn_cast<FixedVectorType>(Vect->getType()))
if (FTy->getElementType() == Builder.getInt1Ty()) {
Instruction::CastOps ExtOpc = Instruction::CastOps::CastOpsEnd;
if (Arg != Vect)
ExtOpc = cast<CastInst>(Arg)->getOpcode();
Value *Res = ((IID == Intrinsic::vector_reduce_smin) ==
(ExtOpc == Instruction::CastOps::ZExt))
? Builder.CreateAndReduce(Vect)
: Builder.CreateOrReduce(Vect);
if (Arg != Vect)
Res = Builder.CreateCast(ExtOpc, Res, II->getType());
return replaceInstUsesWith(CI, Res);
}
}
}
LLVM_FALLTHROUGH;
}

needs to be effectively copied here:

with an appropriate set of tests.

  • iff we are asking for the cost of an zext/sext from <n x i1>,
    and all users are such integral reductions, then the cost of said extension
    should be 0, since it will be gone.

@​Simon: might you be interested in handling with this?
If not, i could finish this i guess.

@LebedevRI
Copy link
Member

mentioned in issue llvm/llvm-bugzilla-archive#51305

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 11, 2021
@RKSimon RKSimon changed the title missing simplification for @&#8203;llvm.vector.reduce.and on vector of i1 missing simplification for llvm.vector.reduce.and on vector of i1 Dec 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla
Projects
None yet
Development

No branches or pull requests

3 participants