-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SLP] Missed icmp/fcmp allof/anyof reductions #40657
Comments
assigned to @rotateright |
I also just arrived at this. Any pointers? |
The more general case: bool test(const unsigned char* input) {
return input[0] != 0xFF &&
input[1] != 0xFF &&
input[2] != 0xFF &&
input[3] != 0xFF;
} ...is hard/impossible. If the 1st byte is 0xFF, then we are not allowed to load any further because that could be unmapped memory. I think we would require the 'dereferenceable' attribute to handle that case. |
Yeah, the C variant isn't well-preserved. The whole load if i32 is legal there, |
There are multiple things potentially to fix here...but here's one that I can try to fix: SLP does not actually sort the candidates for a reduction, so the instruction order in IR affects the groupings and what comes out. That's part of why we see reductions for some examples but not others. |
Proposal: |
The code is now typically something like: define float @test_merge_anyof_v4sf(<4 x float> %0) {
%2 = extractelement <4 x float> %0, i32 0
%3 = fcmp olt float %2, 0.000000e+00
%4 = extractelement <4 x float> %0, i32 1
%5 = fcmp olt float %4, 0.000000e+00
%6 = select i1 %3, i1 true, i1 %5
%7 = extractelement <4 x float> %0, i32 2
%8 = fcmp olt float %7, 0.000000e+00
%9 = select i1 %6, i1 true, i1 %8
%10 = extractelement <4 x float> %0, i32 3
%11 = fcmp olt float %10, 0.000000e+00
%12 = select i1 %9, i1 true, i1 %11
%13 = fcmp ogt float %2, 1.000000e+00
%14 = select i1 %12, i1 true, i1 %13
%15 = fcmp ogt float %4, 1.000000e+00
%16 = select i1 %14, i1 true, i1 %15
%17 = fcmp ogt float %7, 1.000000e+00
%18 = select i1 %16, i1 true, i1 %17
%19 = fcmp ogt float %10, 1.000000e+00
%20 = select i1 %18, i1 true, i1 %19
%21 = fadd float %2, %4
%22 = select i1 %20, float 0.000000e+00, float %21
ret float %22
} which lowers to: test_merge_anyof_v4sf:
vmovss .LCPI0_0(%rip), %xmm5 # xmm5 = mem[0],zero,zero,zero
vmovshdup %xmm0, %xmm1 # xmm1 = xmm0[1,1,3,3]
vpermilps $255, %xmm0, %xmm3 # xmm3 = xmm0[3,3,3,3]
vpermilpd $1, %xmm0, %xmm2 # xmm2 = xmm0[1,0]
vaddss %xmm1, %xmm0, %xmm4
vcmpltss %xmm3, %xmm5, %xmm6
vandnps %xmm4, %xmm6, %xmm4
vcmpltss %xmm2, %xmm5, %xmm6
vandnps %xmm4, %xmm6, %xmm4
vcmpltss %xmm1, %xmm5, %xmm6
vcmpltss %xmm0, %xmm5, %xmm5
vandnps %xmm4, %xmm6, %xmm4
vandnps %xmm4, %xmm5, %xmm4
vxorps %xmm5, %xmm5, %xmm5
vcmpltss %xmm5, %xmm3, %xmm3
vcmpltss %xmm5, %xmm2, %xmm2
vcmpltss %xmm5, %xmm1, %xmm1
vcmpltss %xmm5, %xmm0, %xmm0
vandnps %xmm4, %xmm3, %xmm3
vandnps %xmm3, %xmm2, %xmm2
vandnps %xmm2, %xmm1, %xmm1
vandnps %xmm1, %xmm0, %xmm0
retq |
We have to be careful in IR as the select i1 %12, i1 true, i1 %13 pattern is sensitive to poison, I don't think we can create reduction intrinsics from it? |
We can do it with freeze: So we need to:
|
I drafted a hack for SLP reduction matching, and it partly works on the minimal patterns without obviously breaking anything else. I'll try to clean that up. Not sure yet if we'll still need other changes (SimplifyCFG or codegen?). |
+1 Thank goodness for freeze :) |
mentioned in issue #49274 |
Further reduction improvements: https://reviews.llvm.org/D114171 |
Extended Description
We should be able to combine these to allof/anyof vector reductions, instead we end up with nested branch trees and scalar selects.
https://gcc.godbolt.org/z/FImIuY
The text was updated successfully, but these errors were encountered: