We should be able to combine these to allof/anyof vector reductions, instead we end up with nested branch trees and scalar selects. https://gcc.godbolt.org/z/FImIuY #include <x86intrin.h> float test_merge_allof_v4sf(__v4sf t) { if((t[0] < 0.0 && t[1] < 0.0 && t[2] < 0.0 && t[3] < 0.0) || (t[0] > 1.0 && t[1] > 1.0 && t[2] > 1.0 && t[3] > 1.0)) return 0; return t[0] + t[1]; } float test_merge_anyof_v4sf(__v4sf t) { if((t[0] < 0.0 || t[1] < 0.0 || t[2] < 0.0 || t[3] < 0.0) || (t[0] > 1.0 || t[1] > 1.0 || t[2] > 1.0 || t[3] > 1.0)) return 0; return t[0] + t[1]; } float test_separate_allof_v4sf(__v4sf t) { if((t[0] < 0.0 && t[1] < 0.0 && t[2] < 0.0 && t[3] < 0.0)) return 0; if((t[0] > 1.0 && t[1] > 1.0 && t[2] > 1.0 && t[3] > 1.0)) return 0; return t[0] + t[1]; } float test_separate_anyof_v4sf(__v4sf t) { if((t[0] < 0.0 || t[1] < 0.0 || t[2] < 0.0 || t[3] < 0.0)) return 0; if((t[0] > 1.0 || t[1] > 1.0 || t[2] > 1.0 || t[3] > 1.0)) return 0; return t[0] + t[1]; } int test_separate_allof_v4si(__v4si t) { if((t[0] < 1 && t[1] < 1 && t[2] < 1 && t[3] < 1)) return 0; if((t[0] > 255 && t[1] > 255 && t[2] > 255 && t[3] > 255)) return 0; return t[0] + t[1]; } int test_separate_anyof_v4si(__v4si t) { if((t[0] < 1 || t[1] < 1 || t[2] < 1 || t[3] < 1)) return 0; if((t[0] > 255 || t[1] > 255 || t[2] > 255 || t[3] > 255)) return 0; return t[0] + t[1]; }
I also just arrived at this. https://godbolt.org/z/zIyI0_ I did some benchmark (not of just the snippet, but of the entire code), and this results in up to ~-2% improvement in some tested cases. (and a +18% regression in a few cases, so cost model!) Any pointers?
(In reply to Roman Lebedev from comment #1) > I also just arrived at this. > https://godbolt.org/z/zIyI0_ > I did some benchmark (not of just the snippet, but of the entire code), > and this results in up to ~-2% improvement in some tested cases. > (and a +18% regression in a few cases, so cost model!) > > Any pointers? The more general case: bool test(const unsigned char* input) { return input[0] != 0xFF && input[1] != 0xFF && input[2] != 0xFF && input[3] != 0xFF; } ...is hard/impossible. If the 1st byte is 0xFF, then we are not allowed to load any further because that could be unmapped memory. I think we would require the 'dereferenceable' attribute to handle that case.
(In reply to Sanjay Patel from comment #2) > (In reply to Roman Lebedev from comment #1) > > I also just arrived at this. > > https://godbolt.org/z/zIyI0_ > > I did some benchmark (not of just the snippet, but of the entire code), > > and this results in up to ~-2% improvement in some tested cases. > > (and a +18% regression in a few cases, so cost model!) > > > > Any pointers? > > The more general case: > bool test(const unsigned char* input) { > return input[0] != 0xFF && > input[1] != 0xFF && > input[2] != 0xFF && > input[3] != 0xFF; > } > > ...is hard/impossible. If the 1st byte is 0xFF, then we are not allowed to > load any further because that could be unmapped memory. > I think we would > require the 'dereferenceable' attribute to handle that case. Yeah, the C variant isn't well-preserved. The whole load if i32 is legal there, i just kind of forgot to about it.
There are multiple things potentially to fix here...but here's one that I can try to fix: SLP does not actually sort the candidates for a reduction, so the instruction order in IR affects the groupings and what comes out. That's part of why we see reductions for some examples but not others.
(In reply to Sanjay Patel from comment #4) > SLP does not actually sort the candidates for a reduction, so the > instruction order in IR affects the groupings and what comes out. That's > part of why we see reductions for some examples but not others. Proposal: https://reviews.llvm.org/D87772
The code is now typically something like: define float @test_merge_anyof_v4sf(<4 x float> %0) { %2 = extractelement <4 x float> %0, i32 0 %3 = fcmp olt float %2, 0.000000e+00 %4 = extractelement <4 x float> %0, i32 1 %5 = fcmp olt float %4, 0.000000e+00 %6 = select i1 %3, i1 true, i1 %5 %7 = extractelement <4 x float> %0, i32 2 %8 = fcmp olt float %7, 0.000000e+00 %9 = select i1 %6, i1 true, i1 %8 %10 = extractelement <4 x float> %0, i32 3 %11 = fcmp olt float %10, 0.000000e+00 %12 = select i1 %9, i1 true, i1 %11 %13 = fcmp ogt float %2, 1.000000e+00 %14 = select i1 %12, i1 true, i1 %13 %15 = fcmp ogt float %4, 1.000000e+00 %16 = select i1 %14, i1 true, i1 %15 %17 = fcmp ogt float %7, 1.000000e+00 %18 = select i1 %16, i1 true, i1 %17 %19 = fcmp ogt float %10, 1.000000e+00 %20 = select i1 %18, i1 true, i1 %19 %21 = fadd float %2, %4 %22 = select i1 %20, float 0.000000e+00, float %21 ret float %22 } which lowers to: test_merge_anyof_v4sf: vmovss .LCPI0_0(%rip), %xmm5 # xmm5 = mem[0],zero,zero,zero vmovshdup %xmm0, %xmm1 # xmm1 = xmm0[1,1,3,3] vpermilps $255, %xmm0, %xmm3 # xmm3 = xmm0[3,3,3,3] vpermilpd $1, %xmm0, %xmm2 # xmm2 = xmm0[1,0] vaddss %xmm1, %xmm0, %xmm4 vcmpltss %xmm3, %xmm5, %xmm6 vandnps %xmm4, %xmm6, %xmm4 vcmpltss %xmm2, %xmm5, %xmm6 vandnps %xmm4, %xmm6, %xmm4 vcmpltss %xmm1, %xmm5, %xmm6 vcmpltss %xmm0, %xmm5, %xmm5 vandnps %xmm4, %xmm6, %xmm4 vandnps %xmm4, %xmm5, %xmm4 vxorps %xmm5, %xmm5, %xmm5 vcmpltss %xmm5, %xmm3, %xmm3 vcmpltss %xmm5, %xmm2, %xmm2 vcmpltss %xmm5, %xmm1, %xmm1 vcmpltss %xmm5, %xmm0, %xmm0 vandnps %xmm4, %xmm3, %xmm3 vandnps %xmm3, %xmm2, %xmm2 vandnps %xmm2, %xmm1, %xmm1 vandnps %xmm1, %xmm0, %xmm0 retq
We have to be careful in IR as the select i1 %12, i1 true, i1 %13 pattern is sensitive to poison, I don't think we can create reduction intrinsics from it?
(In reply to Simon Pilgrim from comment #7) > We have to be careful in IR as the > > select i1 %12, i1 true, i1 %13 > > pattern is sensitive to poison, I don't think we can create reduction > intrinsics from it? We can do it with freeze: https://alive2.llvm.org/ce/z/RQ_yhV So we need to: 1. Have SLP recognize the select form of logic ops 2. Insert a freeze to make a transform poison-safe.
I drafted a hack for SLP reduction matching, and it partly works on the minimal patterns without obviously breaking anything else. I'll try to clean that up. Not sure yet if we'll still need other changes (SimplifyCFG or codegen?).
https://reviews.llvm.org/D105730
(In reply to Sanjay Patel from comment #10) > https://reviews.llvm.org/D105730 +1 Thank goodness for freeze :)