Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SLP] Missed icmp/fcmp allof/anyof reductions #40657

Open
RKSimon opened this issue Mar 29, 2019 · 14 comments
Open

[SLP] Missed icmp/fcmp allof/anyof reductions #40657

RKSimon opened this issue Mar 29, 2019 · 14 comments
Assignees
Labels

Comments

@RKSimon
Copy link
Collaborator

RKSimon commented Mar 29, 2019

Bugzilla Link 41312
Version trunk
OS Windows NT
Blocks #49274
CC @alexey-bataev,@dtemirbulatov,@LebedevRI,@MattPD,@rotateright

Extended Description

We should be able to combine these to allof/anyof vector reductions, instead we end up with nested branch trees and scalar selects.

https://gcc.godbolt.org/z/FImIuY

#include <x86intrin.h>

float test_merge_allof_v4sf(__v4sf t) {
  if((t[0] < 0.0 && t[1] < 0.0 && t[2] < 0.0 && t[3] < 0.0) || 
     (t[0] > 1.0 && t[1] > 1.0 && t[2] > 1.0 && t[3] > 1.0))
    return 0;
  return t[0] + t[1];
}
float test_merge_anyof_v4sf(__v4sf t) {
  if((t[0] < 0.0 || t[1] < 0.0 || t[2] < 0.0 || t[3] < 0.0) || 
     (t[0] > 1.0 || t[1] > 1.0 || t[2] > 1.0 || t[3] > 1.0))
    return 0;
  return t[0] + t[1];
}

float test_separate_allof_v4sf(__v4sf t) {
  if((t[0] < 0.0 && t[1] < 0.0 && t[2] < 0.0 && t[3] < 0.0))
    return 0;
  if((t[0] > 1.0 && t[1] > 1.0 && t[2] > 1.0 && t[3] > 1.0))
    return 0;
  return t[0] + t[1];
}
float test_separate_anyof_v4sf(__v4sf t) {
  if((t[0] < 0.0 || t[1] < 0.0 || t[2] < 0.0 || t[3] < 0.0))
    return 0;
  if((t[0] > 1.0 || t[1] > 1.0 || t[2] > 1.0 || t[3] > 1.0))
    return 0;
  return t[0] + t[1];
}

int test_separate_allof_v4si(__v4si t) {
  if((t[0] < 1 && t[1] < 1 && t[2] < 1 && t[3] < 1))
    return 0;
  if((t[0] > 255 && t[1] > 255 && t[2] > 255 && t[3] > 255))
    return 0;
  return t[0] + t[1];
}
int test_separate_anyof_v4si(__v4si t) {
  if((t[0] < 1 || t[1] < 1 || t[2] < 1 || t[3] < 1))
    return 0;
  if((t[0] > 255 || t[1] > 255 || t[2] > 255 || t[3] > 255))
    return 0;
  return t[0] + t[1];
}
@RKSimon
Copy link
Collaborator Author

RKSimon commented Mar 29, 2019

assigned to @rotateright

@LebedevRI
Copy link
Member

LebedevRI commented May 20, 2019

I also just arrived at this.
https://godbolt.org/z/zIyI0_
I did some benchmark (not of just the snippet, but of the entire code),
and this results in up to ~-2% improvement in some tested cases.
(and a +18% regression in a few cases, so cost model!)

Any pointers?

@rotateright
Copy link
Contributor

rotateright commented May 21, 2019

I also just arrived at this.
https://godbolt.org/z/zIyI0_
I did some benchmark (not of just the snippet, but of the entire code),
and this results in up to ~-2% improvement in some tested cases.
(and a +18% regression in a few cases, so cost model!)

Any pointers?

The more general case:

bool test(const unsigned char* input) {
    return input[0] != 0xFF &&
           input[1] != 0xFF &&
           input[2] != 0xFF &&
           input[3] != 0xFF;
}

...is hard/impossible. If the 1st byte is 0xFF, then we are not allowed to load any further because that could be unmapped memory. I think we would require the 'dereferenceable' attribute to handle that case.

@LebedevRI
Copy link
Member

LebedevRI commented May 21, 2019

I also just arrived at this.
https://godbolt.org/z/zIyI0_
I did some benchmark (not of just the snippet, but of the entire code),
and this results in up to ~-2% improvement in some tested cases.
(and a +18% regression in a few cases, so cost model!)

Any pointers?

The more general case:

bool test(const unsigned char* input) {
    return input[0] != 0xFF &&
           input[1] != 0xFF &&
           input[2] != 0xFF &&
           input[3] != 0xFF;
}

...is hard/impossible. If the 1st byte is 0xFF, then we are not allowed to
load any further because that could be unmapped memory.

I think we would
require the 'dereferenceable' attribute to handle that case.

Yeah, the C variant isn't well-preserved. The whole load if i32 is legal there,
i just kind of forgot to about it.

@rotateright
Copy link
Contributor

There are multiple things potentially to fix here...but here's one that I can try to fix:

SLP does not actually sort the candidates for a reduction, so the instruction order in IR affects the groupings and what comes out. That's part of why we see reductions for some examples but not others.

@rotateright
Copy link
Contributor

SLP does not actually sort the candidates for a reduction, so the
instruction order in IR affects the groupings and what comes out. That's
part of why we see reductions for some examples but not others.

Proposal:
https://reviews.llvm.org/D87772

@RKSimon
Copy link
Collaborator Author

RKSimon commented Jun 2, 2021

The code is now typically something like:

define float @test_merge_anyof_v4sf(<4 x float> %0) {
  %2 = extractelement <4 x float> %0, i32 0
  %3 = fcmp olt float %2, 0.000000e+00
  %4 = extractelement <4 x float> %0, i32 1
  %5 = fcmp olt float %4, 0.000000e+00
  %6 = select i1 %3, i1 true, i1 %5
  %7 = extractelement <4 x float> %0, i32 2
  %8 = fcmp olt float %7, 0.000000e+00
  %9 = select i1 %6, i1 true, i1 %8
  %10 = extractelement <4 x float> %0, i32 3
  %11 = fcmp olt float %10, 0.000000e+00
  %12 = select i1 %9, i1 true, i1 %11
  %13 = fcmp ogt float %2, 1.000000e+00
  %14 = select i1 %12, i1 true, i1 %13
  %15 = fcmp ogt float %4, 1.000000e+00
  %16 = select i1 %14, i1 true, i1 %15
  %17 = fcmp ogt float %7, 1.000000e+00
  %18 = select i1 %16, i1 true, i1 %17
  %19 = fcmp ogt float %10, 1.000000e+00
  %20 = select i1 %18, i1 true, i1 %19
  %21 = fadd float %2, %4
  %22 = select i1 %20, float 0.000000e+00, float %21
  ret float %22
}

which lowers to:

test_merge_anyof_v4sf:
        vmovss  .LCPI0_0(%rip), %xmm5           # xmm5 = mem[0],zero,zero,zero
        vmovshdup       %xmm0, %xmm1            # xmm1 = xmm0[1,1,3,3]
        vpermilps       $255, %xmm0, %xmm3      # xmm3 = xmm0[3,3,3,3]
        vpermilpd       $1, %xmm0, %xmm2        # xmm2 = xmm0[1,0]
        vaddss  %xmm1, %xmm0, %xmm4
        vcmpltss        %xmm3, %xmm5, %xmm6
        vandnps %xmm4, %xmm6, %xmm4
        vcmpltss        %xmm2, %xmm5, %xmm6
        vandnps %xmm4, %xmm6, %xmm4
        vcmpltss        %xmm1, %xmm5, %xmm6
        vcmpltss        %xmm0, %xmm5, %xmm5
        vandnps %xmm4, %xmm6, %xmm4
        vandnps %xmm4, %xmm5, %xmm4
        vxorps  %xmm5, %xmm5, %xmm5
        vcmpltss        %xmm5, %xmm3, %xmm3
        vcmpltss        %xmm5, %xmm2, %xmm2
        vcmpltss        %xmm5, %xmm1, %xmm1
        vcmpltss        %xmm5, %xmm0, %xmm0
        vandnps %xmm4, %xmm3, %xmm3
        vandnps %xmm3, %xmm2, %xmm2
        vandnps %xmm2, %xmm1, %xmm1
        vandnps %xmm1, %xmm0, %xmm0
        retq

@RKSimon
Copy link
Collaborator Author

RKSimon commented Jun 2, 2021

We have to be careful in IR as the

select i1 %12, i1 true, i1 %13

pattern is sensitive to poison, I don't think we can create reduction intrinsics from it?

@rotateright
Copy link
Contributor

We have to be careful in IR as the

select i1 %12, i1 true, i1 %13

pattern is sensitive to poison, I don't think we can create reduction
intrinsics from it?

We can do it with freeze:
https://alive2.llvm.org/ce/z/RQ_yhV

So we need to:

  1. Have SLP recognize the select form of logic ops
  2. Insert a freeze to make a transform poison-safe.

@rotateright
Copy link
Contributor

I drafted a hack for SLP reduction matching, and it partly works on the minimal patterns without obviously breaking anything else.

I'll try to clean that up.

Not sure yet if we'll still need other changes (SimplifyCFG or codegen?).

@rotateright
Copy link
Contributor

@RKSimon
Copy link
Collaborator Author

RKSimon commented Jul 10, 2021

https://reviews.llvm.org/D105730

+1 Thank goodness for freeze :)

@RKSimon
Copy link
Collaborator Author

RKSimon commented Nov 27, 2021

mentioned in issue #49274

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
@RKSimon
Copy link
Collaborator Author

RKSimon commented Feb 5, 2022

Further reduction improvements: https://reviews.llvm.org/D114171

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants