[SLP] Failure to create v2f64 comparison reductions #43090

RKSimon · 2019-10-21T13:50:37Z


Bugzilla Link	43745
Version	trunk
OS	Windows NT
CC	@alexey-bataev,@adibiagio,@topperc,@MattPD,@rotateright

Extended Description

Current Codegen: https://godbolt.org/z/n0UB_k

Pulled out of the c-ray benchmark:

#define ERR_MARGIN 1e-6

bool cmp_lt_gt(double a, double b, double c) {
  double t1 = (-b + c) / (2.0 * a);
  double t2 = (-b - c) / (2.0 * a);

  if((t1 < ERR_MARGIN && t2 < ERR_MARGIN) || (t1 > 1.0 && t2 > 1.0))
    return 0;
  return 1;
}

SLP fails to create AND reductions for either of these, let alone merge them into a single (tweaked) reduction (and branch). Oddly it also manages to vectorize only one of the comparisons but then fails to form a reduction for the result.

clang -g0 -O3 -march=btver2 -emit-llvm

define i1 @cmp_lt_gtdouble %0, double %1, double %2) {
  %4 = fneg double %1
  %5 = fmul double %0, 2.000000e+00
  %6 = insertelement <2 x double> undef, double %4, i32 0
  %7 = insertelement <2 x double> %6, double %2, i32 1
  %8 = insertelement <2 x double> undef, double %2, i32 0
  %9 = insertelement <2 x double> %8, double %1, i32 1
  %10 = fsub <2 x double> %7, %9
  %11 = insertelement <2 x double> undef, double %5, i32 0
  %12 = shufflevector <2 x double> %11, <2 x double> undef, <2 x i32> zeroinitializer
  %13 = fdiv <2 x double> %10, %12
  %14 = extractelement <2 x double> %13, i32 1
  %15 = fcmp olt double %14, 0x3EB0C6F7A0B5ED8D
  %16 = extractelement <2 x double> %13, i32 0
  %17 = fcmp olt double %16, 0x3EB0C6F7A0B5ED8D
  %18 = and i1 %15, %17
  br i1 %18, label %24, label %19

19: ; preds = %3
  %20 = fcmp ule <2 x double> %13, <double 1.000000e+00, double 1.000000e+00>
  %21 = extractelement <2 x i1> %20, i32 0
  %22 = extractelement <2 x i1> %20, i32 1
  %23 = or i1 %21, %22
  ret i1 %23

24: ; preds = %3
  ret i1 false
}

The text was updated successfully, but these errors were encountered:

RKSimon · 2019-10-31T17:55:29Z

HorizontalReduction.tryToReduce bails out of <2 x X> reduction cases, which seems to prevent the fcmp vectorization as well.

rotateright · 2019-10-31T18:05:26Z

HorizontalReduction.tryToReduce bails out of <2 x X> reduction cases, which
seems to prevent the fcmp vectorization as well.

Would this handle it?
https://reviews.llvm.org/D59710

RKSimon · 2019-10-31T22:02:19Z

HorizontalReduction.tryToReduce bails out of <2 x X> reduction cases, which
seems to prevent the fcmp vectorization as well.

Would this handle it?
https://reviews.llvm.org/D59710

Yes I think so, if we don't want to introduce <2 x X> reductions for whatever reason, we would catch this in the backend if only the fcmp had vectorized.

RKSimon · 2019-11-06T15:10:10Z

https://godbolt.org/z/EKQcUy

This contains an ideal version that merges the reductions into a single result:

bool cmp_lt_gt_sse(double a, double b, double c) {
    double t1 = (-b + c) / (2.0 * a);
    double t2 = (-b - c) / (2.0 * a);

    __m128d t12 = _mm_setr_pd(t1, t2);
    __m128d lo = _mm_cmplt_pd(t12, _mm_set1_pd(ERR_MARGIN));
    __m128d hi = _mm_cmpgt_pd(t12, _mm_set1_pd(1.0));
    __m128d reduce = _mm_and_pd(
        _mm_unpacklo_pd(lo, hi),
        _mm_unpackhi_pd(lo, hi)
    );
    if (_mm_movemask_pd(reduce) != 0)
      return 0;
    return 1;
}

rotateright · 2019-11-08T12:51:09Z

D59710 became a restricted first step to avoid regressions on other cases:
https://reviews.llvm.org/rG7ff57705ba196ce649d6034614b3b9df57e1f84f

rotateright · 2020-06-25T16:11:29Z

VectorCombine proposal that would create the missing reduction pattern (not intrinsic though):
https://reviews.llvm.org/D82474

I'm not sure how we get from there to the ideal code from comment 4.
We could start by converting reduction patterns to intrinsics in VectorCombine.

RKSimon · 2020-06-26T16:32:20Z

I'm not sure how we get from there to the ideal code from comment 4.
We could start by converting reduction patterns to intrinsics in
VectorCombine.

Handling in the backend should be relatively trivial with the recent work on CMP(MOVMSK()) folds.

rotateright · 2020-06-29T15:12:42Z

I'm not sure how we get from there to the ideal code from comment 4.
We could start by converting reduction patterns to intrinsics in
VectorCombine.

Handling in the backend should be relatively trivial with the recent work on
CMP(MOVMSK()) folds.

After https://reviews.llvm.org/rGb6315aee5b42, we have this IR:

define  i1 @_Z9cmp_lt_gtddd(double %a, double %b, double %c) {
entry:
  %fneg = fneg double %b
  %mul = fmul double %a, 2.000000e+00
  %0 = insertelement <2 x double> undef, double %c, i32 0
  %1 = insertelement <2 x double> %0, double %fneg, i32 1
  %2 = insertelement <2 x double> undef, double %b, i32 0
  %3 = insertelement <2 x double> %2, double %c, i32 1
  %4 = fsub <2 x double> %1, %3
  %5 = insertelement <2 x double> undef, double %mul, i32 0
  %6 = shufflevector <2 x double> %5, <2 x double> undef, <2 x i32> zeroinitializer
  %7 = fdiv <2 x double> %4, %6
  %8 = fcmp olt <2 x double> %7, <double 0x3EB0C6F7A0B5ED8D, double 0x3EB0C6F7A0B5ED8D>
  %shift = shufflevector <2 x i1> %8, <2 x i1> undef, <2 x i32> <i32 1, i32 undef>
  %9 = and <2 x i1> %8, %shift
  %or.cond = extractelement <2 x i1> %9, i64 0
  br i1 %or.cond, label %cleanup, label %lor.lhs.false

lor.lhs.false:                                    ; preds = %entry
  %10 = fcmp ule <2 x double> %7, <double 1.000000e+00, double 1.000000e+00>
  %shift17 = shufflevector <2 x i1> %10, <2 x i1> undef, <2 x i32> <i32 1, i32 undef>
  %11 = or <2 x i1> %10, %shift17
  %not.or.cond9 = extractelement <2 x i1> %11, i32 0
  ret i1 %not.or.cond9

cleanup:                                          ; preds = %entry
  ret i1 false
}

SDAG can't do anything with the ops in different basic blocks. This would require SimplifyCFG to speculatively execute the 2nd reduction plus some instcombine or other transforms to convert to a single reduction?

Current x86 AVX asm looks like this:

LCPI0_0:
	.quad	0x8000000000000000      ## double -0
	.quad	0x8000000000000000      ## double -0
LCPI0_1:
	.quad	0x3eb0c6f7a0b5ed8d      ## double 9.9999999999999995E-7
	.quad	0x3eb0c6f7a0b5ed8d      ## double 9.9999999999999995E-7
LCPI0_2:
	.quad	0x3ff0000000000000      ## double 1
	.quad	0x3ff0000000000000      ## double 1
	.section	__TEXT,__text,regular,pure_instructions
	.globl	__Z9cmp_lt_gtddd
	.p2align	4, 0x90
__Z9cmp_lt_gtddd:                       ## @_Z9cmp_lt_gtddd
	.cfi_startproc
## %bb.0:                               ## %entry
	vxorpd	LCPI0_0(%rip), %xmm1, %xmm3
	vaddsd	%xmm0, %xmm0, %xmm0
	vunpcklpd	%xmm3, %xmm2, %xmm3 ## xmm3 = xmm2[0],xmm3[0]
	vunpcklpd	%xmm2, %xmm1, %xmm1 ## xmm1 = xmm1[0],xmm2[0]
	vsubpd	%xmm1, %xmm3, %xmm1
	vmovddup	%xmm0, %xmm0    ## xmm0 = xmm0[0,0]
	vdivpd	%xmm0, %xmm1, %xmm0
	vcmpltpd	LCPI0_1(%rip), %xmm0, %xmm1
	vmovmskpd	%xmm1, %eax
	cmpb	$3, %al
	jne	LBB0_1
## %bb.2:                               ## %cleanup
	xorl	%eax, %eax
	retq
LBB0_1:                                 ## %lor.lhs.false
	vmovapd	LCPI0_2(%rip), %xmm1    ## xmm1 = [1.0E+0,1.0E+0]
	vcmpnltpd	%xmm0, %xmm1, %xmm0
	vmovmskpd	%xmm0, %eax
	testb	%al, %al
	setne	%al
	retq

slacka · 2021-06-04T16:01:05Z

In c-ray 1.1, gcc 11 is 94% faster than clang 12.
https://www.phoronix.com/scan.php?page=article&item=clang12-gcc11-icelake&num=4

RKSimon · 2021-06-04T19:17:40Z

Current IR:

define i1 @cmp_lt_gtddd(double %0, double %1, double %2) {
  %4 = fneg double %1
  %5 = fsub double %2, %1
  %6 = fmul double %0, 2.000000e+00
  %7 = fdiv double %5, %6
  %8 = fsub double %4, %2
  %9 = fdiv double %8, %6
  %10 = fcmp olt double %7, 0x3EB0C6F7A0B5ED8D
  %11 = fcmp olt double %9, 0x3EB0C6F7A0B5ED8D
  %12 = select i1 %10, i1 %11, i1 false
  br i1 %12, label %17, label %13

13:
  %14 = fcmp ule double %7, 1.000000e+00
  %15 = fcmp ule double %9, 1.000000e+00
  %16 = select i1 %14, i1 true, i1 %15
  ret i1 %16

17:
  ret i1 false
}

RKSimon · 2022-01-25T21:23:23Z

Current IR:

define i1 @_Z9cmp_lt_gtddd(double %0, double %1, double %2) {
  %4 = fneg double %1
  %5 = fmul double %0, 2.000000e+00
  %6 = insertelement <2 x double> poison, double %2, i64 0
  %7 = insertelement <2 x double> %6, double %4, i64 1
  %8 = insertelement <2 x double> poison, double %1, i64 0
  %9 = insertelement <2 x double> %8, double %2, i64 1
  %10 = fsub <2 x double> %7, %9
  %11 = insertelement <2 x double> poison, double %5, i64 0
  %12 = shufflevector <2 x double> %11, <2 x double> poison, <2 x i32> zeroinitializer
  %13 = fdiv <2 x double> %10, %12
  %14 = fcmp olt <2 x double> %13, <double 0x3EB0C6F7A0B5ED8D, double 0x3EB0C6F7A0B5ED8D>
  %15 = shufflevector <2 x i1> %14, <2 x i1> poison, <2 x i32> <i32 1, i32 undef>
  %16 = and <2 x i1> %14, %15
  %17 = extractelement <2 x i1> %16, i64 0
  %18 = fcmp ule <2 x double> %13, <double 1.000000e+00, double 1.000000e+00>
  %19 = shufflevector <2 x i1> %18, <2 x i1> poison, <2 x i32> <i32 1, i32 undef>
  %20 = or <2 x i1> %18, %19
  %21 = extractelement <2 x i1> %20, i64 0
  %22 = select i1 %17, i1 false, i1 %21
  ret i1 %22
}

We're failing to vectorize several comparison reduction patterns. Issue #43090 was based off this, but while that simplified test case is now folding, the original still fails due to poor cost model values for vXi1 extractions

RKSimon · 2022-09-25T18:01:22Z

Candidate Patch: https://reviews.llvm.org/D134605

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

RKSimon added the llvm:optimizations label Apr 10, 2022

RKSimon added llvm:SLPVectorizer and removed llvm:optimizations labels Apr 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SLP] Failure to create v2f64 comparison reductions #43090

[SLP] Failure to create v2f64 comparison reductions #43090

RKSimon commented Oct 21, 2019 •

edited

Loading

RKSimon commented Oct 31, 2019

rotateright commented Oct 31, 2019

RKSimon commented Oct 31, 2019

RKSimon commented Nov 6, 2019 •

edited

Loading

rotateright commented Nov 8, 2019

rotateright commented Jun 25, 2020

RKSimon commented Jun 26, 2020

rotateright commented Jun 29, 2020 •

edited by RKSimon

Loading

slacka mannequin commented Jun 4, 2021

RKSimon commented Jun 4, 2021 •

edited

Loading

RKSimon commented Jan 25, 2022

RKSimon commented Sep 25, 2022

[SLP] Failure to create v2f64 comparison reductions #43090

[SLP] Failure to create v2f64 comparison reductions #43090

Comments

RKSimon commented Oct 21, 2019 • edited Loading

Extended Description

RKSimon commented Oct 31, 2019

rotateright commented Oct 31, 2019

RKSimon commented Oct 31, 2019

RKSimon commented Nov 6, 2019 • edited Loading

rotateright commented Nov 8, 2019

rotateright commented Jun 25, 2020

RKSimon commented Jun 26, 2020

rotateright commented Jun 29, 2020 • edited by RKSimon Loading

slacka mannequin commented Jun 4, 2021

RKSimon commented Jun 4, 2021 • edited Loading

RKSimon commented Jan 25, 2022

RKSimon commented Sep 25, 2022

RKSimon commented Oct 21, 2019 •

edited

Loading

RKSimon commented Nov 6, 2019 •

edited

Loading

rotateright commented Jun 29, 2020 •

edited by RKSimon

Loading

RKSimon commented Jun 4, 2021 •

edited

Loading