[SLP] Missed icmp/fcmp allof/anyof reductions #40657

RKSimon · 2019-03-29T21:15:39Z


Bugzilla Link	41312
Version	trunk
OS	Windows NT
Blocks	#49274
CC	@alexey-bataev,@dtemirbulatov,@LebedevRI,@MattPD,@rotateright

Extended Description

We should be able to combine these to allof/anyof vector reductions, instead we end up with nested branch trees and scalar selects.

https://gcc.godbolt.org/z/FImIuY

#include <x86intrin.h>

float test_merge_allof_v4sf(__v4sf t) {
  if((t[0] < 0.0 && t[1] < 0.0 && t[2] < 0.0 && t[3] < 0.0) || 
     (t[0] > 1.0 && t[1] > 1.0 && t[2] > 1.0 && t[3] > 1.0))
    return 0;
  return t[0] + t[1];
}
float test_merge_anyof_v4sf(__v4sf t) {
  if((t[0] < 0.0 || t[1] < 0.0 || t[2] < 0.0 || t[3] < 0.0) || 
     (t[0] > 1.0 || t[1] > 1.0 || t[2] > 1.0 || t[3] > 1.0))
    return 0;
  return t[0] + t[1];
}

float test_separate_allof_v4sf(__v4sf t) {
  if((t[0] < 0.0 && t[1] < 0.0 && t[2] < 0.0 && t[3] < 0.0))
    return 0;
  if((t[0] > 1.0 && t[1] > 1.0 && t[2] > 1.0 && t[3] > 1.0))
    return 0;
  return t[0] + t[1];
}
float test_separate_anyof_v4sf(__v4sf t) {
  if((t[0] < 0.0 || t[1] < 0.0 || t[2] < 0.0 || t[3] < 0.0))
    return 0;
  if((t[0] > 1.0 || t[1] > 1.0 || t[2] > 1.0 || t[3] > 1.0))
    return 0;
  return t[0] + t[1];
}

int test_separate_allof_v4si(__v4si t) {
  if((t[0] < 1 && t[1] < 1 && t[2] < 1 && t[3] < 1))
    return 0;
  if((t[0] > 255 && t[1] > 255 && t[2] > 255 && t[3] > 255))
    return 0;
  return t[0] + t[1];
}
int test_separate_anyof_v4si(__v4si t) {
  if((t[0] < 1 || t[1] < 1 || t[2] < 1 || t[3] < 1))
    return 0;
  if((t[0] > 255 || t[1] > 255 || t[2] > 255 || t[3] > 255))
    return 0;
  return t[0] + t[1];
}

RKSimon · 2019-03-29T21:15:39Z

assigned to @rotateright

LebedevRI · 2019-05-20T19:29:10Z

I also just arrived at this.
https://godbolt.org/z/zIyI0_
I did some benchmark (not of just the snippet, but of the entire code),
and this results in up to ~-2% improvement in some tested cases.
(and a +18% regression in a few cases, so cost model!)

Any pointers?

rotateright · 2019-05-21T19:08:03Z

I also just arrived at this.
https://godbolt.org/z/zIyI0_
I did some benchmark (not of just the snippet, but of the entire code),
and this results in up to ~-2% improvement in some tested cases.
(and a +18% regression in a few cases, so cost model!)

Any pointers?

The more general case:

bool test(const unsigned char* input) {
    return input[0] != 0xFF &&
           input[1] != 0xFF &&
           input[2] != 0xFF &&
           input[3] != 0xFF;
}

...is hard/impossible. If the 1st byte is 0xFF, then we are not allowed to load any further because that could be unmapped memory. I think we would require the 'dereferenceable' attribute to handle that case.

LebedevRI · 2019-05-21T19:16:43Z

I also just arrived at this.
https://godbolt.org/z/zIyI0_
I did some benchmark (not of just the snippet, but of the entire code),
and this results in up to ~-2% improvement in some tested cases.
(and a +18% regression in a few cases, so cost model!)

Any pointers?

The more general case:
bool test(const unsigned char* input) {
    return input[0] != 0xFF &&
           input[1] != 0xFF &&
           input[2] != 0xFF &&
           input[3] != 0xFF;
}
...is hard/impossible. If the 1st byte is 0xFF, then we are not allowed to
load any further because that could be unmapped memory.

I think we would
require the 'dereferenceable' attribute to handle that case.

Yeah, the C variant isn't well-preserved. The whole load if i32 is legal there,
i just kind of forgot to about it.

rotateright · 2020-09-16T14:43:16Z

There are multiple things potentially to fix here...but here's one that I can try to fix:

SLP does not actually sort the candidates for a reduction, so the instruction order in IR affects the groupings and what comes out. That's part of why we see reductions for some examples but not others.

rotateright · 2020-09-16T15:56:33Z

SLP does not actually sort the candidates for a reduction, so the
instruction order in IR affects the groupings and what comes out. That's
part of why we see reductions for some examples but not others.

Proposal:
https://reviews.llvm.org/D87772

RKSimon · 2021-06-02T10:33:05Z

The code is now typically something like:

define float @test_merge_anyof_v4sf(<4 x float> %0) {
  %2 = extractelement <4 x float> %0, i32 0
  %3 = fcmp olt float %2, 0.000000e+00
  %4 = extractelement <4 x float> %0, i32 1
  %5 = fcmp olt float %4, 0.000000e+00
  %6 = select i1 %3, i1 true, i1 %5
  %7 = extractelement <4 x float> %0, i32 2
  %8 = fcmp olt float %7, 0.000000e+00
  %9 = select i1 %6, i1 true, i1 %8
  %10 = extractelement <4 x float> %0, i32 3
  %11 = fcmp olt float %10, 0.000000e+00
  %12 = select i1 %9, i1 true, i1 %11
  %13 = fcmp ogt float %2, 1.000000e+00
  %14 = select i1 %12, i1 true, i1 %13
  %15 = fcmp ogt float %4, 1.000000e+00
  %16 = select i1 %14, i1 true, i1 %15
  %17 = fcmp ogt float %7, 1.000000e+00
  %18 = select i1 %16, i1 true, i1 %17
  %19 = fcmp ogt float %10, 1.000000e+00
  %20 = select i1 %18, i1 true, i1 %19
  %21 = fadd float %2, %4
  %22 = select i1 %20, float 0.000000e+00, float %21
  ret float %22
}

which lowers to:

test_merge_anyof_v4sf:
        vmovss  .LCPI0_0(%rip), %xmm5           # xmm5 = mem[0],zero,zero,zero
        vmovshdup       %xmm0, %xmm1            # xmm1 = xmm0[1,1,3,3]
        vpermilps       $255, %xmm0, %xmm3      # xmm3 = xmm0[3,3,3,3]
        vpermilpd       $1, %xmm0, %xmm2        # xmm2 = xmm0[1,0]
        vaddss  %xmm1, %xmm0, %xmm4
        vcmpltss        %xmm3, %xmm5, %xmm6
        vandnps %xmm4, %xmm6, %xmm4
        vcmpltss        %xmm2, %xmm5, %xmm6
        vandnps %xmm4, %xmm6, %xmm4
        vcmpltss        %xmm1, %xmm5, %xmm6
        vcmpltss        %xmm0, %xmm5, %xmm5
        vandnps %xmm4, %xmm6, %xmm4
        vandnps %xmm4, %xmm5, %xmm4
        vxorps  %xmm5, %xmm5, %xmm5
        vcmpltss        %xmm5, %xmm3, %xmm3
        vcmpltss        %xmm5, %xmm2, %xmm2
        vcmpltss        %xmm5, %xmm1, %xmm1
        vcmpltss        %xmm5, %xmm0, %xmm0
        vandnps %xmm4, %xmm3, %xmm3
        vandnps %xmm3, %xmm2, %xmm2
        vandnps %xmm2, %xmm1, %xmm1
        vandnps %xmm1, %xmm0, %xmm0
        retq

RKSimon · 2021-06-02T10:37:00Z

We have to be careful in IR as the

select i1 %12, i1 true, i1 %13

pattern is sensitive to poison, I don't think we can create reduction intrinsics from it?

rotateright · 2021-07-07T12:48:39Z

We have to be careful in IR as the

select i1 %12, i1 true, i1 %13

pattern is sensitive to poison, I don't think we can create reduction
intrinsics from it?

We can do it with freeze:
https://alive2.llvm.org/ce/z/RQ_yhV

So we need to:

Have SLP recognize the select form of logic ops
Insert a freeze to make a transform poison-safe.

rotateright · 2021-07-07T20:11:34Z

I drafted a hack for SLP reduction matching, and it partly works on the minimal patterns without obviously breaking anything else.

I'll try to clean that up.

Not sure yet if we'll still need other changes (SimplifyCFG or codegen?).

rotateright · 2021-07-09T20:56:59Z

https://reviews.llvm.org/D105730

RKSimon · 2021-07-10T09:22:10Z

https://reviews.llvm.org/D105730

+1 Thank goodness for freeze :)

RKSimon · 2021-11-27T04:16:57Z

mentioned in issue #49274

RKSimon · 2022-02-05T20:46:27Z

Further reduction improvements: https://reviews.llvm.org/D114171

RKSimon mentioned this issue Apr 12, 2021

[Meta] Missed combines using freeze #49274

Open

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

RKSimon added llvm:SLPVectorizer llvm:optimizations labels Mar 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SLP] Missed icmp/fcmp allof/anyof reductions #40657

[SLP] Missed icmp/fcmp allof/anyof reductions #40657

RKSimon commented Mar 29, 2019 •

edited

Loading

RKSimon commented Mar 29, 2019

LebedevRI commented May 20, 2019 •

edited by RKSimon

Loading

rotateright commented May 21, 2019 •

edited by RKSimon

Loading

LebedevRI commented May 21, 2019 •

edited by RKSimon

Loading

rotateright commented Sep 16, 2020

rotateright commented Sep 16, 2020

RKSimon commented Jun 2, 2021 •

edited

Loading

RKSimon commented Jun 2, 2021

rotateright commented Jul 7, 2021

rotateright commented Jul 7, 2021

rotateright commented Jul 9, 2021

RKSimon commented Jul 10, 2021

RKSimon commented Nov 27, 2021

RKSimon commented Feb 5, 2022

[SLP] Missed icmp/fcmp allof/anyof reductions #40657

[SLP] Missed icmp/fcmp allof/anyof reductions #40657

Comments

RKSimon commented Mar 29, 2019 • edited Loading

Extended Description

RKSimon commented Mar 29, 2019

LebedevRI commented May 20, 2019 • edited by RKSimon Loading

rotateright commented May 21, 2019 • edited by RKSimon Loading

LebedevRI commented May 21, 2019 • edited by RKSimon Loading

rotateright commented Sep 16, 2020

rotateright commented Sep 16, 2020

RKSimon commented Jun 2, 2021 • edited Loading

RKSimon commented Jun 2, 2021

rotateright commented Jul 7, 2021

rotateright commented Jul 7, 2021

rotateright commented Jul 9, 2021

RKSimon commented Jul 10, 2021

RKSimon commented Nov 27, 2021

RKSimon commented Feb 5, 2022

RKSimon commented Mar 29, 2019 •

edited

Loading

LebedevRI commented May 20, 2019 •

edited by RKSimon

Loading

rotateright commented May 21, 2019 •

edited by RKSimon

Loading

LebedevRI commented May 21, 2019 •

edited by RKSimon

Loading

RKSimon commented Jun 2, 2021 •

edited

Loading