missing simplification for llvm.vector.reduce.and on vector of i1 #50603

zygoloid · 2021-07-29T01:43:18Z


Bugzilla Link	51259
Version	trunk
OS	All
CC	@topperc,@LebedevRI,@RKSimon,@phoebewang,@rotateright

Extended Description

Live demo: https://godbolt.org/z/a6YaahdPP

Example:

[[gnu::weak]] void do_this() {}
[[gnu::weak]] void do_that() {}

void f1(unsigned char const p[8]) {
  if (p[0] != 0x00 & p[1] != 0x00 & p[2] != 0x00 & p[3] != 0x00 & p[4] != 0x00 &
      p[5] != 0x00 & p[6] != 0x00 & p[7] != 0x00) {
    do_this();
  } else {
    do_that();
  }
}

void f2(unsigned const char *p) {
  using T [[gnu::vector_size(8), gnu::aligned(1)]] = unsigned char;
  T same = *(T *)p == (T){0, 0, 0, 0, 0, 0, 0, 0};
  if ((unsigned long)same == 0) {
    do_this();
  } else {
    do_that();
  }
}

This results in the following:

f1(unsigned char const*):                               # @f1(unsigned char const*)
        vmovq   xmm0, qword ptr [rdi]           # xmm0 = mem[0],zero
        vpxor   xmm1, xmm1, xmm1
        vpcmpeqb        xmm0, xmm0, xmm1
        vpmovmskb       eax, xmm0
        not     eax
        cmp     al, -1
        jne     ...

f2(unsigned char const*):                               # @f2(unsigned char const*)
        vmovq   xmm0, qword ptr [rdi]           # xmm0 = mem[0],zero
        vpxor   xmm1, xmm1, xmm1
        vpcmpeqb        xmm0, xmm0, xmm1
        vmovq   rax, xmm0
        test    rax, rax
        je      ...

I think these should produce the same assembly, and the result from f2 looks better to me (though both are the same size). Presumably we'd need to recognize that after vpcmpeqb, each lane in xmm0 is either all-zeros or all-ones, so the vpmovmskb is redundant.

The text was updated successfully, but these errors were encountered:

zygoloid · 2021-07-29T01:43:18Z

assigned to @LebedevRI

zygoloid · 2021-07-29T01:55:14Z

It looks like we might be missing a simplification of @llvm.vector.reduce.and on a vector of i1: replacing

%5 = call i1 @llvm.vector.reduce.and.v8i1(<8 x i1> %4)

with an icmp looks like it should result in the better IR. (Thanks to Nick Lewycky for pointing this out.)

RKSimon · 2021-07-29T12:58:36Z

Should we always be canonicalizing and/or/xor reductions to bitcasts+icmp/parity do you think?

IIRC we already do this in a couple of cases already.

rotateright · 2021-07-29T13:09:11Z

Should we always be canonicalizing and/or/xor reductions to
bitcasts+icmp/parity do you think?

IIRC we already do this in a couple of cases already.

Already done with:
https://reviews.llvm.org/D97406

The godbolt link in the description is testing with 12.0.1. Here's trunk:
https://godbolt.org/z/6vE6T81Yn

There's still a difference in the IR and asm though. We might want to change both.

LebedevRI · 2021-07-29T13:21:56Z

Should we always be canonicalizing and/or/xor reductions to
bitcasts+icmp/parity do you think?

IIRC we already do this in a couple of cases already.

Surely.

https://godbolt.org/z/fhrPzvxzv
So we are missing xor/mul/comparison handling.
i1 xor is just add: https://alive2.llvm.org/ce/z/GZToKH
i1 mul is just and: https://alive2.llvm.org/ce/z/R4Apqp
i1 umax is just or: https://alive2.llvm.org/ce/z/dSE3x_
i1 smin is just or: https://alive2.llvm.org/ce/z/49Z8yh
i1 umin is just and: https://alive2.llvm.org/ce/z/SfZ2pU
i1 smax is just and: https://alive2.llvm.org/ce/z/ndJ2Ze

LebedevRI · 2021-08-02T21:56:37Z

Ok, i've added all the missing instcombine pieces,
all integer reductions w/ (potentially extended i1 element type
are now expanded.

We now probably want to adjust costmodel to reflect that,
and i'm not sure if we want to do some similar backend handling?

zygoloid · 2021-08-03T07:27:01Z

Awesome, thanks! Minor point: the original two examples still don't canonicalize to the same IR:

f1 uses

%5 = bitcast <8 x i1> %4 to i8
%6 = icmp eq i8 %5, 0

resulting in

    vpmovmskb       eax, xmm0
    test    al, al

f2 uses

%5 = sext <8 x i1> %4 to <8 x i8>
%6 = bitcast <8 x i8> %5 to i64
%7 = icmp eq i64 %6, 0

    vmovq   rax, xmm0
    test    rax, rax

I have no idea which of these is better (if either), but presumably there's an instcombine missing or similar?

LebedevRI · 2021-08-03T09:27:00Z

Awesome, thanks! Minor point: the original two examples still don't
canonicalize to the same IR:

f1 uses

%5 = bitcast <8 x i1> %4 to i8
%6 = icmp eq i8 %5, 0

resulting in
    vpmovmskb       eax, xmm0
    test    al, al
f2 uses

%5 = sext <8 x i1> %4 to <8 x i8>
%6 = bitcast <8 x i8> %5 to i64
%7 = icmp eq i64 %6, 0
    vmovq   rax, xmm0
    test    rax, rax
I have no idea which of these is better (if either), but presumably there's
an instcombine missing or similar?

Yep, looks like we fail to drop sext before bitcast-icmp: https://godbolt.org/z/Mf8vMfs5r

LebedevRI · 2021-08-11T13:09:24Z

Alright, i think the only thing we are missing before we can close this is cost model changes.

This block:

llvm-project/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp

Lines 1978 to 2147 in 01d59c0

    
           case Intrinsic::vector_reduce_or: 
        
           case Intrinsic::vector_reduce_and: { 
        
             // Canonicalize logical or/and reductions: 
        
             // Or reduction for i1 is represented as: 
        
             // %val = bitcast <ReduxWidth x i1> to iReduxWidth 
        
             // %res = cmp ne iReduxWidth %val, 0 
        
             // And reduction for i1 is represented as: 
        
             // %val = bitcast <ReduxWidth x i1> to iReduxWidth 
        
             // %res = cmp eq iReduxWidth %val, 11111 
        
             Value *Arg = II->getArgOperand(0); 
        
             Value *Vect; 
        
             if (match(Arg, m_ZExtOrSExtOrSelf(m_Value(Vect)))) { 
        
               if (auto *FTy = dyn_cast<FixedVectorType>(Vect->getType())) 
        
                 if (FTy->getElementType() == Builder.getInt1Ty()) { 
        
                   Value *Res = Builder.CreateBitCast( 
        
                       Vect, Builder.getIntNTy(FTy->getNumElements())); 
        
                   if (IID == Intrinsic::vector_reduce_and) { 
        
                     Res = Builder.CreateICmpEQ( 
        
                         Res, ConstantInt::getAllOnesValue(Res->getType())); 
        
                   } else { 
        
                     assert(IID == Intrinsic::vector_reduce_or && 
        
                            "Expected or reduction."); 
        
                     Res = Builder.CreateIsNotNull(Res); 
        
                   } 
        
                   if (Arg != Vect) 
        
                     Res = Builder.CreateCast(cast<CastInst>(Arg)->getOpcode(), Res, 
        
                                              II->getType()); 
        
                   return replaceInstUsesWith(CI, Res); 
        
                 } 
        
             } 
        
             LLVM_FALLTHROUGH; 
        
           } 
        
           case Intrinsic::vector_reduce_add: { 
        
             if (IID == Intrinsic::vector_reduce_add) { 
        
               // Convert vector_reduce_add(ZExt(<n x i1>)) to 
        
               // ZExtOrTrunc(ctpop(bitcast <n x i1> to in)). 
        
               // Convert vector_reduce_add(SExt(<n x i1>)) to 
        
               // -ZExtOrTrunc(ctpop(bitcast <n x i1> to in)). 
        
               // Convert vector_reduce_add(<n x i1>) to 
        
               // Trunc(ctpop(bitcast <n x i1> to in)). 
        
               Value *Arg = II->getArgOperand(0); 
        
               Value *Vect; 
        
               if (match(Arg, m_ZExtOrSExtOrSelf(m_Value(Vect)))) { 
        
                 if (auto *FTy = dyn_cast<FixedVectorType>(Vect->getType())) 
        
                   if (FTy->getElementType() == Builder.getInt1Ty()) { 
        
                     Value *V = Builder.CreateBitCast( 
        
                         Vect, Builder.getIntNTy(FTy->getNumElements())); 
        
                     Value *Res = Builder.CreateUnaryIntrinsic(Intrinsic::ctpop, V); 
        
                     if (Res->getType() != II->getType()) 
        
                       Res = Builder.CreateZExtOrTrunc(Res, II->getType()); 
        
                     if (Arg != Vect && 
        
                         cast<Instruction>(Arg)->getOpcode() == Instruction::SExt) 
        
                       Res = Builder.CreateNeg(Res); 
        
                     return replaceInstUsesWith(CI, Res); 
        
                   } 
        
               } 
        
             } 
        
             LLVM_FALLTHROUGH; 
        
           } 
        
           case Intrinsic::vector_reduce_xor: { 
        
             if (IID == Intrinsic::vector_reduce_xor) { 
        
               // Exclusive disjunction reduction over the vector with 
        
               // (potentially-extended) i1 element type is actually a 
        
               // (potentially-extended) arithmetic `add` reduction over the original 
        
               // non-extended value: 
        
               //   vector_reduce_xor(?ext(<n x i1>)) 
        
               //     --> 
        
               //   ?ext(vector_reduce_add(<n x i1>)) 
        
               Value *Arg = II->getArgOperand(0); 
        
               Value *Vect; 
        
               if (match(Arg, m_ZExtOrSExtOrSelf(m_Value(Vect)))) { 
        
                 if (auto *FTy = dyn_cast<FixedVectorType>(Vect->getType())) 
        
                   if (FTy->getElementType() == Builder.getInt1Ty()) { 
        
                     Value *Res = Builder.CreateAddReduce(Vect); 
        
                     if (Arg != Vect) 
        
                       Res = Builder.CreateCast(cast<CastInst>(Arg)->getOpcode(), Res, 
        
                                                II->getType()); 
        
                     return replaceInstUsesWith(CI, Res); 
        
                   } 
        
               } 
        
             } 
        
             LLVM_FALLTHROUGH; 
        
           } 
        
           case Intrinsic::vector_reduce_mul: { 
        
             if (IID == Intrinsic::vector_reduce_mul) { 
        
               // Multiplicative reduction over the vector with (potentially-extended) 
        
               // i1 element type is actually a (potentially zero-extended) 
        
               // logical `and` reduction over the original non-extended value: 
        
               //   vector_reduce_mul(?ext(<n x i1>)) 
        
               //     --> 
        
               //   zext(vector_reduce_and(<n x i1>)) 
        
               Value *Arg = II->getArgOperand(0); 
        
               Value *Vect; 
        
               if (match(Arg, m_ZExtOrSExtOrSelf(m_Value(Vect)))) { 
        
                 if (auto *FTy = dyn_cast<FixedVectorType>(Vect->getType())) 
        
                   if (FTy->getElementType() == Builder.getInt1Ty()) { 
        
                     Value *Res = Builder.CreateAndReduce(Vect); 
        
                     if (Res->getType() != II->getType()) 
        
                       Res = Builder.CreateZExt(Res, II->getType()); 
        
                     return replaceInstUsesWith(CI, Res); 
        
                   } 
        
               } 
        
             } 
        
             LLVM_FALLTHROUGH; 
        
           } 
        
           case Intrinsic::vector_reduce_umin: 
        
           case Intrinsic::vector_reduce_umax: { 
        
             if (IID == Intrinsic::vector_reduce_umin || 
        
                 IID == Intrinsic::vector_reduce_umax) { 
        
               // UMin/UMax reduction over the vector with (potentially-extended) 
        
               // i1 element type is actually a (potentially-extended) 
        
               // logical `and`/`or` reduction over the original non-extended value: 
        
               //   vector_reduce_u{min,max}(?ext(<n x i1>)) 
        
               //     --> 
        
               //   ?ext(vector_reduce_{and,or}(<n x i1>)) 
        
               Value *Arg = II->getArgOperand(0); 
        
               Value *Vect; 
        
               if (match(Arg, m_ZExtOrSExtOrSelf(m_Value(Vect)))) { 
        
                 if (auto *FTy = dyn_cast<FixedVectorType>(Vect->getType())) 
        
                   if (FTy->getElementType() == Builder.getInt1Ty()) { 
        
                     Value *Res = IID == Intrinsic::vector_reduce_umin 
        
                                      ? Builder.CreateAndReduce(Vect) 
        
                                      : Builder.CreateOrReduce(Vect); 
        
                     if (Arg != Vect) 
        
                       Res = Builder.CreateCast(cast<CastInst>(Arg)->getOpcode(), Res, 
        
                                                II->getType()); 
        
                     return replaceInstUsesWith(CI, Res); 
        
                   } 
        
               } 
        
             } 
        
             LLVM_FALLTHROUGH; 
        
           } 
        
           case Intrinsic::vector_reduce_smin: 
        
           case Intrinsic::vector_reduce_smax: { 
        
             if (IID == Intrinsic::vector_reduce_smin || 
        
                 IID == Intrinsic::vector_reduce_smax) { 
        
               // SMin/SMax reduction over the vector with (potentially-extended) 
        
               // i1 element type is actually a (potentially-extended) 
        
               // logical `and`/`or` reduction over the original non-extended value: 
        
               //   vector_reduce_s{min,max}(<n x i1>) 
        
               //     --> 
        
               //   vector_reduce_{or,and}(<n x i1>) 
        
               // and 
        
               //   vector_reduce_s{min,max}(sext(<n x i1>)) 
        
               //     --> 
        
               //   sext(vector_reduce_{or,and}(<n x i1>)) 
        
               // and 
        
               //   vector_reduce_s{min,max}(zext(<n x i1>)) 
        
               //     --> 
        
               //   zext(vector_reduce_{and,or}(<n x i1>)) 
        
               Value *Arg = II->getArgOperand(0); 
        
               Value *Vect; 
        
               if (match(Arg, m_ZExtOrSExtOrSelf(m_Value(Vect)))) { 
        
                 if (auto *FTy = dyn_cast<FixedVectorType>(Vect->getType())) 
        
                   if (FTy->getElementType() == Builder.getInt1Ty()) { 
        
                     Instruction::CastOps ExtOpc = Instruction::CastOps::CastOpsEnd; 
        
                     if (Arg != Vect) 
        
                       ExtOpc = cast<CastInst>(Arg)->getOpcode(); 
        
                     Value *Res = ((IID == Intrinsic::vector_reduce_smin) == 
        
                                   (ExtOpc == Instruction::CastOps::ZExt)) 
        
                                      ? Builder.CreateAndReduce(Vect) 
        
                                      : Builder.CreateOrReduce(Vect); 
        
                     if (Arg != Vect) 
        
                       Res = Builder.CreateCast(ExtOpc, Res, II->getType()); 
        
                     return replaceInstUsesWith(CI, Res); 
        
                   } 
        
               } 
        
             } 
        
             LLVM_FALLTHROUGH; 
        
           }

needs to be effectively copied here:

llvm-project/llvm/include/llvm/CodeGen/BasicTTIImpl.h

Line 1533 in 01d59c0

with an appropriate set of tests.

iff we are asking for the cost of an zext/sext from <n x i1>,
and all users are such integral reductions, then the cost of said extension
should be 0, since it will be gone.

@Simon: might you be interested in handling with this?
If not, i could finish this i guess.

LebedevRI · 2021-11-27T04:38:02Z

mentioned in issue llvm/llvm-bugzilla-archive#51305

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 11, 2021

RKSimon changed the title ~~missing simplification for @llvm.vector.reduce.and on vector of i1~~ missing simplification for llvm.vector.reduce.and on vector of i1 Dec 14, 2021

LebedevRI mentioned this issue Apr 5, 2022

[InstCombine] Preferred way to mask and splat a single low bit #50647

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

missing simplification for llvm.vector.reduce.and on vector of i1 #50603

missing simplification for llvm.vector.reduce.and on vector of i1 #50603

zygoloid mannequin commented Jul 29, 2021 •

edited by RKSimon

Loading

zygoloid mannequin commented Jul 29, 2021

zygoloid mannequin commented Jul 29, 2021

RKSimon commented Jul 29, 2021

rotateright commented Jul 29, 2021

LebedevRI commented Jul 29, 2021

LebedevRI commented Aug 2, 2021

zygoloid mannequin commented Aug 3, 2021

LebedevRI commented Aug 3, 2021

LebedevRI commented Aug 11, 2021

LebedevRI commented Nov 27, 2021

missing simplification for llvm.vector.reduce.and on vector of i1 #50603

missing simplification for llvm.vector.reduce.and on vector of i1 #50603

Comments

zygoloid mannequin commented Jul 29, 2021 • edited by RKSimon Loading

Extended Description

zygoloid mannequin commented Jul 29, 2021

zygoloid mannequin commented Jul 29, 2021

RKSimon commented Jul 29, 2021

rotateright commented Jul 29, 2021

LebedevRI commented Jul 29, 2021

LebedevRI commented Aug 2, 2021

zygoloid mannequin commented Aug 3, 2021

LebedevRI commented Aug 3, 2021

LebedevRI commented Aug 11, 2021

LebedevRI commented Nov 27, 2021

zygoloid mannequin commented Jul 29, 2021 •

edited by RKSimon

Loading