51259 – missing simplification for @llvm.vector.reduce.and on vector of i1

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 51259 - missing simplification for @llvm.vector.reduce.and on vector of i1

Summary: missing simplification for @llvm.vector.reduce.and on vector of i1

Status:	NEW

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Scalar Optimizations (show other bugs)
Version:	trunk
Hardware:	PC All

Importance:	P enhancement
Assignee:	Roman Lebedev

URL:
Keywords:

Depends on:
Blocks:

Reported:	2021-07-28 18:43 PDT by Richard Smith
Modified:	2021-08-11 06:09 PDT (History)
CC List:	6 users (show)

See Also:	51312 51315 51305 51306
Fixed By Commit(s):

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Richard Smith 2021-07-28 18:43:18 PDT

Live demo: https://godbolt.org/z/a6YaahdPP

Example:

[[gnu::weak]] void do_this() {}
[[gnu::weak]] void do_that() {}

void f1(unsigned char const p[8]) {
  if (p[0] != 0x00 & p[1] != 0x00 & p[2] != 0x00 & p[3] != 0x00 & p[4] != 0x00 &
      p[5] != 0x00 & p[6] != 0x00 & p[7] != 0x00) {
    do_this();
  } else {
    do_that();
  }
}

void f2(unsigned const char *p) {
  using T [[gnu::vector_size(8), gnu::aligned(1)]] = unsigned char;
  T same = *(T *)p == (T){0, 0, 0, 0, 0, 0, 0, 0};
  if ((unsigned long)same == 0) {
    do_this();
  } else {
    do_that();
  }
}

This results in the following:

f1(unsigned char const*):                               # @f1(unsigned char const*)
        vmovq   xmm0, qword ptr [rdi]           # xmm0 = mem[0],zero
        vpxor   xmm1, xmm1, xmm1
        vpcmpeqb        xmm0, xmm0, xmm1
        vpmovmskb       eax, xmm0
        not     eax
        cmp     al, -1
        jne     ...

f2(unsigned char const*):                               # @f2(unsigned char const*)
        vmovq   xmm0, qword ptr [rdi]           # xmm0 = mem[0],zero
        vpxor   xmm1, xmm1, xmm1
        vpcmpeqb        xmm0, xmm0, xmm1
        vmovq   rax, xmm0
        test    rax, rax
        je      ...

I think these should produce the same assembly, and the result from f2 looks better to me (though both are the same size). Presumably we'd need to recognize that after vpcmpeqb, each lane in xmm0 is either all-zeros or all-ones, so the vpmovmskb is redundant.

Comment 1 Richard Smith 2021-07-28 18:55:14 PDT

It looks like we might be missing a simplification of @llvm.vector.reduce.and on a vector of i1: replacing

  %5 = call i1 @llvm.vector.reduce.and.v8i1(<8 x i1> %4)

with an icmp looks like it should result in the better IR. (Thanks to Nick Lewycky for pointing this out.)

Comment 2 Simon Pilgrim 2021-07-29 05:58:36 PDT

Should we always be canonicalizing <X x i1> and/or/xor reductions to bitcasts+icmp/parity do you think?

IIRC we already do this in a couple of cases already.

Comment 3 Sanjay Patel 2021-07-29 06:09:11 PDT

(In reply to Simon Pilgrim from comment #2)
> Should we always be canonicalizing <X x i1> and/or/xor reductions to
> bitcasts+icmp/parity do you think?
> 
> IIRC we already do this in a couple of cases already.

Already done with:
https://reviews.llvm.org/D97406

The godbolt link in the description is testing with 12.0.1. Here's trunk:
https://godbolt.org/z/6vE6T81Yn

There's still a difference in the IR and asm though. We might want to change both.

Comment 4 Roman Lebedev 2021-07-29 06:21:56 PDT

(In reply to Simon Pilgrim from comment #2)
> Should we always be canonicalizing <X x i1> and/or/xor reductions to
> bitcasts+icmp/parity do you think?
> 
> IIRC we already do this in a couple of cases already.

Surely.

https://godbolt.org/z/fhrPzvxzv
So we are missing xor/mul/comparison handling.
i1 xor is just add: https://alive2.llvm.org/ce/z/GZToKH
i1 mul is just and: https://alive2.llvm.org/ce/z/R4Apqp
i1 umax is just or: https://alive2.llvm.org/ce/z/dSE3x_
i1 smin is just or: https://alive2.llvm.org/ce/z/49Z8yh
i1 umin is just and: https://alive2.llvm.org/ce/z/SfZ2pU
i1 smax is just and: https://alive2.llvm.org/ce/z/ndJ2Ze

Comment 5 Roman Lebedev 2021-08-02 14:56:37 PDT

Ok, i've added all the missing instcombine pieces,
all integer reductions w/ (potentially extended i1 element type
are now expanded.

We now probably want to adjust costmodel to reflect that,
and i'm not sure if we want to do some similar backend handling?

Comment 6 Richard Smith 2021-08-03 00:27:01 PDT

Awesome, thanks! Minor point: the original two examples still don't canonicalize to the same IR:

f1 uses

  %5 = bitcast <8 x i1> %4 to i8
  %6 = icmp eq i8 %5, 0

resulting in

        vpmovmskb       eax, xmm0
        test    al, al

f2 uses

  %5 = sext <8 x i1> %4 to <8 x i8>
  %6 = bitcast <8 x i8> %5 to i64
  %7 = icmp eq i64 %6, 0

        vmovq   rax, xmm0
        test    rax, rax

I have no idea which of these is better (if either), but presumably there's an instcombine missing or similar?

Comment 7 Roman Lebedev 2021-08-03 02:27:00 PDT

(In reply to Richard Smith from comment #6)
> Awesome, thanks! Minor point: the original two examples still don't
> canonicalize to the same IR:
> 
> f1 uses
> 
>   %5 = bitcast <8 x i1> %4 to i8
>   %6 = icmp eq i8 %5, 0
> 
> resulting in
> 
>         vpmovmskb       eax, xmm0
>         test    al, al
> 
> f2 uses
> 
>   %5 = sext <8 x i1> %4 to <8 x i8>
>   %6 = bitcast <8 x i8> %5 to i64
>   %7 = icmp eq i64 %6, 0
> 
>         vmovq   rax, xmm0
>         test    rax, rax
> 
> I have no idea which of these is better (if either), but presumably there's
> an instcombine missing or similar?

Yep, looks like we fail to drop sext before bitcast-icmp: https://godbolt.org/z/Mf8vMfs5r

Comment 8 Roman Lebedev 2021-08-11 06:09:24 PDT

Alright, i think the only thing we are missing before we can close this is cost model changes.

This block:
https://github.com/llvm/llvm-project/blob/01d59c0de822099c62f12f275c41338f6df9f5ac/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp#L1978-L2147
needs to be effectively copied here:
https://github.com/llvm/llvm-project/blob/01d59c0de822099c62f12f275c41338f6df9f5ac/llvm/include/llvm/CodeGen/BasicTTIImpl.h#L1533
with an appropriate set of tests.
+ iff we are asking for the cost of an zext/sext from `<n x i1>`,
and all users are such integral reductions, then the cost of said extension
should be 0, since it will be gone.

@Simon: might you be interested in handling with this?
If not, i could finish this i guess.