41312 – [SLP] Missed icmp/fcmp allof/anyof reductions

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 41312 - [SLP] Missed icmp/fcmp allof/anyof reductions

Summary: [SLP] Missed icmp/fcmp allof/anyof reductions

Status:	NEW

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Scalar Optimizations (show other bugs)
Version:	trunk
Hardware:	PC Windows NT

Importance:	P enhancement
Assignee:	Sanjay Patel

URL:
Keywords:

Depends on:
Blocks:	49930
	Show dependency tree

Reported:	2019-03-29 14:15 PDT by Simon Pilgrim
Modified:	2021-07-10 02:22 PDT (History)
CC List:	6 users (show)

See Also:	43745
Fixed By Commit(s):

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Simon Pilgrim 2019-03-29 14:15:39 PDT

We should be able to combine these to allof/anyof vector reductions, instead we end up with nested branch trees and scalar selects.

https://gcc.godbolt.org/z/FImIuY

#include <x86intrin.h>

float test_merge_allof_v4sf(__v4sf t) {
  if((t[0] < 0.0 && t[1] < 0.0 && t[2] < 0.0 && t[3] < 0.0) || 
     (t[0] > 1.0 && t[1] > 1.0 && t[2] > 1.0 && t[3] > 1.0))
    return 0;
  return t[0] + t[1];
}
float test_merge_anyof_v4sf(__v4sf t) {
  if((t[0] < 0.0 || t[1] < 0.0 || t[2] < 0.0 || t[3] < 0.0) || 
     (t[0] > 1.0 || t[1] > 1.0 || t[2] > 1.0 || t[3] > 1.0))
    return 0;
  return t[0] + t[1];
}

float test_separate_allof_v4sf(__v4sf t) {
  if((t[0] < 0.0 && t[1] < 0.0 && t[2] < 0.0 && t[3] < 0.0))
    return 0;
  if((t[0] > 1.0 && t[1] > 1.0 && t[2] > 1.0 && t[3] > 1.0))
    return 0;
  return t[0] + t[1];
}
float test_separate_anyof_v4sf(__v4sf t) {
  if((t[0] < 0.0 || t[1] < 0.0 || t[2] < 0.0 || t[3] < 0.0))
    return 0;
  if((t[0] > 1.0 || t[1] > 1.0 || t[2] > 1.0 || t[3] > 1.0))
    return 0;
  return t[0] + t[1];
}

int test_separate_allof_v4si(__v4si t) {
  if((t[0] < 1 && t[1] < 1 && t[2] < 1 && t[3] < 1))
    return 0;
  if((t[0] > 255 && t[1] > 255 && t[2] > 255 && t[3] > 255))
    return 0;
  return t[0] + t[1];
}
int test_separate_anyof_v4si(__v4si t) {
  if((t[0] < 1 || t[1] < 1 || t[2] < 1 || t[3] < 1))
    return 0;
  if((t[0] > 255 || t[1] > 255 || t[2] > 255 || t[3] > 255))
    return 0;
  return t[0] + t[1];
}

Comment 1 Roman Lebedev 2019-05-20 12:29:10 PDT

I also just arrived at this.
https://godbolt.org/z/zIyI0_
I did some benchmark (not of just the snippet, but of the entire code),
and this results in up to ~-2% improvement in some tested cases.
(and a +18% regression in a few cases, so cost model!)

Any pointers?

Comment 2 Sanjay Patel 2019-05-21 12:08:03 PDT

(In reply to Roman Lebedev from comment #1)
> I also just arrived at this.
> https://godbolt.org/z/zIyI0_
> I did some benchmark (not of just the snippet, but of the entire code),
> and this results in up to ~-2% improvement in some tested cases.
> (and a +18% regression in a few cases, so cost model!)
> 
> Any pointers?

The more general case:
bool test(const unsigned char* input) {
    return input[0] != 0xFF &&
           input[1] != 0xFF &&
           input[2] != 0xFF &&
           input[3] != 0xFF;
}

...is hard/impossible. If the 1st byte is 0xFF, then we are not allowed to load any further because that could be unmapped memory. I think we would require the 'dereferenceable' attribute to handle that case.

Comment 3 Roman Lebedev 2019-05-21 12:16:43 PDT

(In reply to Sanjay Patel from comment #2)
> (In reply to Roman Lebedev from comment #1)
> > I also just arrived at this.
> > https://godbolt.org/z/zIyI0_
> > I did some benchmark (not of just the snippet, but of the entire code),
> > and this results in up to ~-2% improvement in some tested cases.
> > (and a +18% regression in a few cases, so cost model!)
> > 
> > Any pointers?
> 
> The more general case:
> bool test(const unsigned char* input) {
>     return input[0] != 0xFF &&
>            input[1] != 0xFF &&
>            input[2] != 0xFF &&
>            input[3] != 0xFF;
> }
> 
> ...is hard/impossible. If the 1st byte is 0xFF, then we are not allowed to
> load any further because that could be unmapped memory.

> I think we would
> require the 'dereferenceable' attribute to handle that case.

Yeah, the C variant isn't well-preserved. The whole load if i32 is legal there,
i just kind of forgot to about it.

Comment 4 Sanjay Patel 2020-09-16 07:43:16 PDT

There are multiple things potentially to fix here...but here's one that I can try to fix:

SLP does not actually sort the candidates for a reduction, so the instruction order in IR affects the groupings and what comes out. That's part of why we see reductions for some examples but not others.

Comment 5 Sanjay Patel 2020-09-16 08:56:33 PDT

(In reply to Sanjay Patel from comment #4) 
> SLP does not actually sort the candidates for a reduction, so the
> instruction order in IR affects the groupings and what comes out. That's
> part of why we see reductions for some examples but not others.

Proposal:
https://reviews.llvm.org/D87772

Comment 6 Simon Pilgrim 2021-06-02 03:33:05 PDT

The code is now typically something like:

define float @test_merge_anyof_v4sf(<4 x float> %0) {
  %2 = extractelement <4 x float> %0, i32 0
  %3 = fcmp olt float %2, 0.000000e+00
  %4 = extractelement <4 x float> %0, i32 1
  %5 = fcmp olt float %4, 0.000000e+00
  %6 = select i1 %3, i1 true, i1 %5
  %7 = extractelement <4 x float> %0, i32 2
  %8 = fcmp olt float %7, 0.000000e+00
  %9 = select i1 %6, i1 true, i1 %8
  %10 = extractelement <4 x float> %0, i32 3
  %11 = fcmp olt float %10, 0.000000e+00
  %12 = select i1 %9, i1 true, i1 %11
  %13 = fcmp ogt float %2, 1.000000e+00
  %14 = select i1 %12, i1 true, i1 %13
  %15 = fcmp ogt float %4, 1.000000e+00
  %16 = select i1 %14, i1 true, i1 %15
  %17 = fcmp ogt float %7, 1.000000e+00
  %18 = select i1 %16, i1 true, i1 %17
  %19 = fcmp ogt float %10, 1.000000e+00
  %20 = select i1 %18, i1 true, i1 %19
  %21 = fadd float %2, %4
  %22 = select i1 %20, float 0.000000e+00, float %21
  ret float %22
}

which lowers to:

test_merge_anyof_v4sf:
        vmovss  .LCPI0_0(%rip), %xmm5           # xmm5 = mem[0],zero,zero,zero
        vmovshdup       %xmm0, %xmm1            # xmm1 = xmm0[1,1,3,3]
        vpermilps       $255, %xmm0, %xmm3      # xmm3 = xmm0[3,3,3,3]
        vpermilpd       $1, %xmm0, %xmm2        # xmm2 = xmm0[1,0]
        vaddss  %xmm1, %xmm0, %xmm4
        vcmpltss        %xmm3, %xmm5, %xmm6
        vandnps %xmm4, %xmm6, %xmm4
        vcmpltss        %xmm2, %xmm5, %xmm6
        vandnps %xmm4, %xmm6, %xmm4
        vcmpltss        %xmm1, %xmm5, %xmm6
        vcmpltss        %xmm0, %xmm5, %xmm5
        vandnps %xmm4, %xmm6, %xmm4
        vandnps %xmm4, %xmm5, %xmm4
        vxorps  %xmm5, %xmm5, %xmm5
        vcmpltss        %xmm5, %xmm3, %xmm3
        vcmpltss        %xmm5, %xmm2, %xmm2
        vcmpltss        %xmm5, %xmm1, %xmm1
        vcmpltss        %xmm5, %xmm0, %xmm0
        vandnps %xmm4, %xmm3, %xmm3
        vandnps %xmm3, %xmm2, %xmm2
        vandnps %xmm2, %xmm1, %xmm1
        vandnps %xmm1, %xmm0, %xmm0
        retq

Comment 7 Simon Pilgrim 2021-06-02 03:37:00 PDT

We have to be careful in IR as the 

  select i1 %12, i1 true, i1 %13

pattern is sensitive to poison, I don't think we can create reduction intrinsics from it?

Comment 8 Sanjay Patel 2021-07-07 05:48:39 PDT

(In reply to Simon Pilgrim from comment #7)
> We have to be careful in IR as the 
> 
>   select i1 %12, i1 true, i1 %13
> 
> pattern is sensitive to poison, I don't think we can create reduction
> intrinsics from it?

We can do it with freeze:
https://alive2.llvm.org/ce/z/RQ_yhV

So we need to:
1. Have SLP recognize the select form of logic ops
2. Insert a freeze to make a transform poison-safe.

Comment 9 Sanjay Patel 2021-07-07 13:11:34 PDT

I drafted a hack for SLP reduction matching, and it partly works on the minimal patterns without obviously breaking anything else. 

I'll try to clean that up. 

Not sure yet if we'll still need other changes (SimplifyCFG or codegen?).

Comment 10 Sanjay Patel 2021-07-09 13:56:59 PDT

https://reviews.llvm.org/D105730

Comment 11 Simon Pilgrim 2021-07-10 02:22:10 PDT

(In reply to Sanjay Patel from comment #10)
> https://reviews.llvm.org/D105730

+1 Thank goodness for freeze :)