LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 41312 - [SLP] Missed icmp/fcmp allof/anyof reductions
Summary: [SLP] Missed icmp/fcmp allof/anyof reductions
Status: NEW
Alias: None
Product: libraries
Classification: Unclassified
Component: Scalar Optimizations (show other bugs)
Version: trunk
Hardware: PC Windows NT
: P enhancement
Assignee: Sanjay Patel
URL:
Keywords:
Depends on:
Blocks: 49930
  Show dependency tree
 
Reported: 2019-03-29 14:15 PDT by Simon Pilgrim
Modified: 2021-07-10 02:22 PDT (History)
6 users (show)

See Also:
Fixed By Commit(s):


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Simon Pilgrim 2019-03-29 14:15:39 PDT
We should be able to combine these to allof/anyof vector reductions, instead we end up with nested branch trees and scalar selects.

https://gcc.godbolt.org/z/FImIuY

#include <x86intrin.h>

float test_merge_allof_v4sf(__v4sf t) {
  if((t[0] < 0.0 && t[1] < 0.0 && t[2] < 0.0 && t[3] < 0.0) || 
     (t[0] > 1.0 && t[1] > 1.0 && t[2] > 1.0 && t[3] > 1.0))
    return 0;
  return t[0] + t[1];
}
float test_merge_anyof_v4sf(__v4sf t) {
  if((t[0] < 0.0 || t[1] < 0.0 || t[2] < 0.0 || t[3] < 0.0) || 
     (t[0] > 1.0 || t[1] > 1.0 || t[2] > 1.0 || t[3] > 1.0))
    return 0;
  return t[0] + t[1];
}

float test_separate_allof_v4sf(__v4sf t) {
  if((t[0] < 0.0 && t[1] < 0.0 && t[2] < 0.0 && t[3] < 0.0))
    return 0;
  if((t[0] > 1.0 && t[1] > 1.0 && t[2] > 1.0 && t[3] > 1.0))
    return 0;
  return t[0] + t[1];
}
float test_separate_anyof_v4sf(__v4sf t) {
  if((t[0] < 0.0 || t[1] < 0.0 || t[2] < 0.0 || t[3] < 0.0))
    return 0;
  if((t[0] > 1.0 || t[1] > 1.0 || t[2] > 1.0 || t[3] > 1.0))
    return 0;
  return t[0] + t[1];
}

int test_separate_allof_v4si(__v4si t) {
  if((t[0] < 1 && t[1] < 1 && t[2] < 1 && t[3] < 1))
    return 0;
  if((t[0] > 255 && t[1] > 255 && t[2] > 255 && t[3] > 255))
    return 0;
  return t[0] + t[1];
}
int test_separate_anyof_v4si(__v4si t) {
  if((t[0] < 1 || t[1] < 1 || t[2] < 1 || t[3] < 1))
    return 0;
  if((t[0] > 255 || t[1] > 255 || t[2] > 255 || t[3] > 255))
    return 0;
  return t[0] + t[1];
}
Comment 1 Roman Lebedev 2019-05-20 12:29:10 PDT
I also just arrived at this.
https://godbolt.org/z/zIyI0_
I did some benchmark (not of just the snippet, but of the entire code),
and this results in up to ~-2% improvement in some tested cases.
(and a +18% regression in a few cases, so cost model!)

Any pointers?
Comment 2 Sanjay Patel 2019-05-21 12:08:03 PDT
(In reply to Roman Lebedev from comment #1)
> I also just arrived at this.
> https://godbolt.org/z/zIyI0_
> I did some benchmark (not of just the snippet, but of the entire code),
> and this results in up to ~-2% improvement in some tested cases.
> (and a +18% regression in a few cases, so cost model!)
> 
> Any pointers?

The more general case:
bool test(const unsigned char* input) {
    return input[0] != 0xFF &&
           input[1] != 0xFF &&
           input[2] != 0xFF &&
           input[3] != 0xFF;
}

...is hard/impossible. If the 1st byte is 0xFF, then we are not allowed to load any further because that could be unmapped memory. I think we would require the 'dereferenceable' attribute to handle that case.
Comment 3 Roman Lebedev 2019-05-21 12:16:43 PDT
(In reply to Sanjay Patel from comment #2)
> (In reply to Roman Lebedev from comment #1)
> > I also just arrived at this.
> > https://godbolt.org/z/zIyI0_
> > I did some benchmark (not of just the snippet, but of the entire code),
> > and this results in up to ~-2% improvement in some tested cases.
> > (and a +18% regression in a few cases, so cost model!)
> > 
> > Any pointers?
> 
> The more general case:
> bool test(const unsigned char* input) {
>     return input[0] != 0xFF &&
>            input[1] != 0xFF &&
>            input[2] != 0xFF &&
>            input[3] != 0xFF;
> }
> 
> ...is hard/impossible. If the 1st byte is 0xFF, then we are not allowed to
> load any further because that could be unmapped memory.

> I think we would
> require the 'dereferenceable' attribute to handle that case.

Yeah, the C variant isn't well-preserved. The whole load if i32 is legal there,
i just kind of forgot to about it.
Comment 4 Sanjay Patel 2020-09-16 07:43:16 PDT
There are multiple things potentially to fix here...but here's one that I can try to fix:

SLP does not actually sort the candidates for a reduction, so the instruction order in IR affects the groupings and what comes out. That's part of why we see reductions for some examples but not others.
Comment 5 Sanjay Patel 2020-09-16 08:56:33 PDT
(In reply to Sanjay Patel from comment #4) 
> SLP does not actually sort the candidates for a reduction, so the
> instruction order in IR affects the groupings and what comes out. That's
> part of why we see reductions for some examples but not others.

Proposal:
https://reviews.llvm.org/D87772
Comment 6 Simon Pilgrim 2021-06-02 03:33:05 PDT
The code is now typically something like:

define float @test_merge_anyof_v4sf(<4 x float> %0) {
  %2 = extractelement <4 x float> %0, i32 0
  %3 = fcmp olt float %2, 0.000000e+00
  %4 = extractelement <4 x float> %0, i32 1
  %5 = fcmp olt float %4, 0.000000e+00
  %6 = select i1 %3, i1 true, i1 %5
  %7 = extractelement <4 x float> %0, i32 2
  %8 = fcmp olt float %7, 0.000000e+00
  %9 = select i1 %6, i1 true, i1 %8
  %10 = extractelement <4 x float> %0, i32 3
  %11 = fcmp olt float %10, 0.000000e+00
  %12 = select i1 %9, i1 true, i1 %11
  %13 = fcmp ogt float %2, 1.000000e+00
  %14 = select i1 %12, i1 true, i1 %13
  %15 = fcmp ogt float %4, 1.000000e+00
  %16 = select i1 %14, i1 true, i1 %15
  %17 = fcmp ogt float %7, 1.000000e+00
  %18 = select i1 %16, i1 true, i1 %17
  %19 = fcmp ogt float %10, 1.000000e+00
  %20 = select i1 %18, i1 true, i1 %19
  %21 = fadd float %2, %4
  %22 = select i1 %20, float 0.000000e+00, float %21
  ret float %22
}

which lowers to:

test_merge_anyof_v4sf:
        vmovss  .LCPI0_0(%rip), %xmm5           # xmm5 = mem[0],zero,zero,zero
        vmovshdup       %xmm0, %xmm1            # xmm1 = xmm0[1,1,3,3]
        vpermilps       $255, %xmm0, %xmm3      # xmm3 = xmm0[3,3,3,3]
        vpermilpd       $1, %xmm0, %xmm2        # xmm2 = xmm0[1,0]
        vaddss  %xmm1, %xmm0, %xmm4
        vcmpltss        %xmm3, %xmm5, %xmm6
        vandnps %xmm4, %xmm6, %xmm4
        vcmpltss        %xmm2, %xmm5, %xmm6
        vandnps %xmm4, %xmm6, %xmm4
        vcmpltss        %xmm1, %xmm5, %xmm6
        vcmpltss        %xmm0, %xmm5, %xmm5
        vandnps %xmm4, %xmm6, %xmm4
        vandnps %xmm4, %xmm5, %xmm4
        vxorps  %xmm5, %xmm5, %xmm5
        vcmpltss        %xmm5, %xmm3, %xmm3
        vcmpltss        %xmm5, %xmm2, %xmm2
        vcmpltss        %xmm5, %xmm1, %xmm1
        vcmpltss        %xmm5, %xmm0, %xmm0
        vandnps %xmm4, %xmm3, %xmm3
        vandnps %xmm3, %xmm2, %xmm2
        vandnps %xmm2, %xmm1, %xmm1
        vandnps %xmm1, %xmm0, %xmm0
        retq
Comment 7 Simon Pilgrim 2021-06-02 03:37:00 PDT
We have to be careful in IR as the 

  select i1 %12, i1 true, i1 %13

pattern is sensitive to poison, I don't think we can create reduction intrinsics from it?
Comment 8 Sanjay Patel 2021-07-07 05:48:39 PDT
(In reply to Simon Pilgrim from comment #7)
> We have to be careful in IR as the 
> 
>   select i1 %12, i1 true, i1 %13
> 
> pattern is sensitive to poison, I don't think we can create reduction
> intrinsics from it?

We can do it with freeze:
https://alive2.llvm.org/ce/z/RQ_yhV

So we need to:
1. Have SLP recognize the select form of logic ops
2. Insert a freeze to make a transform poison-safe.
Comment 9 Sanjay Patel 2021-07-07 13:11:34 PDT
I drafted a hack for SLP reduction matching, and it partly works on the minimal patterns without obviously breaking anything else. 

I'll try to clean that up. 

Not sure yet if we'll still need other changes (SimplifyCFG or codegen?).
Comment 10 Sanjay Patel 2021-07-09 13:56:59 PDT
https://reviews.llvm.org/D105730
Comment 11 Simon Pilgrim 2021-07-10 02:22:10 PDT
(In reply to Sanjay Patel from comment #10)
> https://reviews.llvm.org/D105730

+1 Thank goodness for freeze :)