Missed optimisation - horizontal max for vectors is not optimised. #23490

llvmbot · 2015-04-03T19:34:43Z


Bugzilla Link	23116
Resolution	FIXED
Resolved on	Feb 04, 2021 10:58
Version	trunk
OS	Windows NT
Reporter	LLVM Bugzilla Contributor
CC	@adibiagio,@RKSimon,@rotateright

Extended Description

Computing the horizontal max (or min etc..) of a vector is not optimised to faster code.
For example, the C code:

#include <immintrin.h>

inline float max(float a, float b)
{
return a > b ? a : b;
}

float findMax(__m256 v)
{
return max(max(max(max(max(max(max(v[0], v[1]), v[2]), v[3]), v[4]), v[5]), v[6]), v[7]);
}

Is compiled to by Clang 3.7/trunk to:

findMax(float __vector(8)): # @findMax(float __vector(8))
vmovshdup %xmm0, %xmm1 # xmm1 = xmm0[1,1,3,3]
vmaxss %xmm1, %xmm0, %xmm1
vpermilpd $1, %xmm0, %xmm2 # xmm2 = xmm0[1,0]
vmaxss %xmm2, %xmm1, %xmm1
vpermilps $231, %xmm0, %xmm2 # xmm2 = xmm0[3,1,2,3]
vmaxss %xmm2, %xmm1, %xmm1
vextractf128 $1, %ymm0, %xmm0
vmaxss %xmm0, %xmm1, %xmm1
vmovshdup %xmm0, %xmm2 # xmm2 = xmm0[1,1,3,3]
vmaxss %xmm2, %xmm1, %xmm1
vpermilpd $1, %xmm0, %xmm2 # xmm2 = xmm0[1,0]
vmaxss %xmm2, %xmm1, %xmm1
vpermilps $231, %xmm0, %xmm0 # xmm0 = xmm0[3,1,2,3]
vmaxss %xmm0, %xmm1, %xmm0
vzeroupper
retq

Which is basically 7 vmaxss's in serial (slow).

It could be optimised to something like:

float findMax(__m256 v)
{
__m128 a = _mm256_extractf128_ps(v, 0);
__m128 b = _mm256_extractf128_ps(v, 1);

__m128 c = _mm_max_ps(a, b);

return max(max(c[0], c[1]), max(c[2], c[3]));

}

Which compiles to:

findMax(float __vector(8)): # @findMax(float __vector(8))
vextractf128 $1, %ymm0, %xmm1
vmaxps %xmm1, %xmm0, %xmm0
vmovshdup %xmm0, %xmm1 # xmm1 = xmm0[1,1,3,3]
vmaxss %xmm1, %xmm0, %xmm1
vpermilpd $1, %xmm0, %xmm2 # xmm2 = xmm0[1,0]
vpermilps $231, %xmm0, %xmm0 # xmm0 = xmm0[3,1,2,3]
vmaxss %xmm0, %xmm2, %xmm0
vmaxss %xmm0, %xmm1, %xmm0
vzeroupper
retq

See http://goo.gl/jM3KNz for the code.

The text was updated successfully, but these errors were encountered:

rotateright · 2015-04-03T23:02:06Z

The optimal solution would use 3 maxps instructions. In your 2nd example, after you get the maxps of the two halves, shift over by 2 elements and maxps again, then shift over 1 more and maxps (or maxss) again.

So shift by 4:

x0 | x1 | x2 | x3
x4 | x5 | x6 | x7

maxps, then shift by 2:

max(x0,x4) | max(x1,x5)
max(x2,x6) | max(x3,x7)

maxps, then shift by 1:

max(x0,x4,x2,x6)
max(x1,x5,x3,x7)

maxss = max(x0,x4,x2,x6,x1,x5,x3,x7)

rotateright · 2015-04-03T23:07:56Z

There are 2 parts to solving this:

Recognize the reassociation opportunity of max/min:

float max4(float a, float b, float c, float d) {
float t1 = a > b ? a : b;
float t2 = t1 > c ? t1 : c;
float t3 = t2 > d ? t2 : d;
return t3;
}

That should be 2 independent max ops followed by a max of the intermediate results, not this:
maxss %xmm1, %xmm0
maxss %xmm2, %xmm0
maxss %xmm3, %xmm0

This is similar to bug 21768 and bug 17305.

Recognize the parallel max functionality available in a maxps and use it as described in comment 1.

rotateright · 2015-08-23T00:44:38Z

float max4(float a, float b, float c, float d) {
float t1 = a > b ? a : b;
float t2 = t1 > c ? t1 : c;
float t3 = t2 > d ? t2 : d;
return t3;
}

That should be 2 independent max ops followed by a max of the intermediate
results.

That was wrong. Because of the way the SSE max/min instructions work with NaN and signed zero inputs, they are not commutative. So I think that the reassociation optimization can only happen when we have a relaxed FP environment. Currently (r245733), we do get the reassociation when using -ffast-math.

It should be possible to do the optimization with just -ffinite-math-only -fno-signed-zeros too, but that doesn't work yet.

rotateright · 2018-10-12T13:46:19Z

Update: as of r344320, we get close to optimal codegen with -O2 -mavx -ffast-math:
vextractf128 $1, %ymm0, %xmm1
vmaxps %ymm1, %ymm0, %ymm0
vpermilpd $1, %xmm0, %xmm1 ## xmm1 = xmm0[1,0]
vmaxps %ymm1, %ymm0, %ymm0
vmovshdup %xmm0, %xmm1 ## xmm1 = xmm0[1,1,3,3]
vmaxps %ymm1, %ymm0, %ymm0

The IR is vectorized by the SLP vectorizer:
define float @findMax(<8 x float> %v) local_unnamed_addr #0 {
entry:
%rdx.shuf = shufflevector <8 x float> %v, <8 x float> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
%rdx.minmax.cmp = fcmp fast olt <8 x float> %rdx.shuf, %v
%rdx.minmax.select = select <8 x i1> %rdx.minmax.cmp, <8 x float> %v, <8 x float> %rdx.shuf
%rdx.shuf33 = shufflevector <8 x float> %rdx.minmax.select, <8 x float> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%rdx.minmax.cmp34 = fcmp fast ogt <8 x float> %rdx.minmax.select, %rdx.shuf33
%rdx.minmax.select35 = select <8 x i1> %rdx.minmax.cmp34, <8 x float> %rdx.minmax.select, <8 x float> %rdx.shuf33
%rdx.shuf36 = shufflevector <8 x float> %rdx.minmax.select35, <8 x float> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%rdx.minmax.cmp37 = fcmp fast ogt <8 x float> %rdx.minmax.select35, %rdx.shuf36
%rdx.minmax.select38 = select <8 x i1> %rdx.minmax.cmp37, <8 x float> %rdx.minmax.select35, <8 x float> %rdx.shuf36
%0 = extractelement <8 x float> %rdx.minmax.select38, i32 0
ret float %0
}

As noted before, without some FP loosening this may be difficult to optimize because of NaN behavior. I think SLP needs to be adjusted for that to happen.

RKSimon · 2018-12-06T11:35:44Z

Update: as of r344320, we get close to optimal codegen with -O2 -mavx
-ffast-math:
vextractf128 $1, %ymm0, %xmm1
vmaxps %ymm1, %ymm0, %ymm0
vpermilpd $1, %xmm0, %xmm1 ## xmm1 = xmm0[1,0]
vmaxps %ymm1, %ymm0, %ymm0
vmovshdup %xmm0, %xmm1 ## xmm1 = xmm0[1,1,3,3]
vmaxps %ymm1, %ymm0, %ymm0

AFAICT the outstanding issues are:

1 - we should be able to use xmm instead of full ymm widths (beneficial to btver2 as it avoids double pumping of the 128-bit ALUs).

2 - the SLP shouldn't need full 'fast' attribute (which attributes must it have?)

rotateright · 2018-12-06T14:09:18Z

Update: as of r344320, we get close to optimal codegen with -O2 -mavx
-ffast-math:
vextractf128 $1, %ymm0, %xmm1
vmaxps %ymm1, %ymm0, %ymm0
vpermilpd $1, %xmm0, %xmm1 ## xmm1 = xmm0[1,0]
vmaxps %ymm1, %ymm0, %ymm0
vmovshdup %xmm0, %xmm1 ## xmm1 = xmm0[1,1,3,3]
vmaxps %ymm1, %ymm0, %ymm0

AFAICT the outstanding issues are:

1 - we should be able to use xmm instead of full ymm widths (beneficial to
btver2 as it avoids double pumping of the 128-bit ALUs).

Right - we've made several IR and backend changes towards this goal, but this one isn't handled yet. I have an idea to improve this at least partly in IR in instcombine (this assumes SLP has done the vectorization).

2 - the SLP shouldn't need full 'fast' attribute (which attributes must it
have?)

The FMF needed for min/max reordering are 'nnan' and 'nsz' (because if we change the order of operations, then we don't know how those will shake out).

There's no actual math here, so that rules out 'reassoc'. Infinities should be handled correctly in all cases, so we don't need 'ninf'. 'arcp', 'afn', 'contract' are irrelevant.

It's possible that one of the newer, more tightly specified min/max ops makes all FMF unnecessary, but I'm not sure yet.

rotateright · 2019-03-14T16:04:28Z

We manage to scalarize the final max op with:
https://reviews.llvm.org/rL355792

rotateright · 2019-05-14T00:45:12Z

This takes care of the narrowing:
https://reviews.llvm.org/rL360639

rotateright · 2019-05-14T00:51:00Z

We need to fix FMF in IR to solve the flags part of this because 'nsz' doesn't actually make sense on an fcmp. See bug 38086.

rotateright · 2021-01-25T23:29:31Z

We need to fix FMF in IR to solve the flags part of this because 'nsz'
doesn't actually make sense on an fcmp. See bug 38086.

We side-stepped that issue by creating reduction intrinsics in IR. With:
https://reviews.llvm.org/rG77adbe6a8c71
...I think we finally have FMF straightened out in SLP, so we produce something like this:
%0 = call nnan ninf nsz float @llvm.vector.reduce.fmax.v8f32(<8 x float> %v)

But there's still an FMF bug in expandReductions(), so we don't get the expected vectorized shuffle reduction.

rotateright · 2021-02-04T18:58:02Z

After:
https://reviews.llvm.org/rG056d31dd2a04 (and several other patches as noted)

And with:
$ clang -O2 23116.c -S -o - -fno-signed-zeros -ffinite-math-only -mavx

We finally have the ideal output with the minimal relaxed FP settings:
vextractf128 $1, %ymm0, %xmm1
vmaxps %xmm1, %xmm0, %xmm0
vpermilpd $1, %xmm0, %xmm1 ## xmm1 = xmm0[1,0]
vmaxps %xmm1, %xmm0, %xmm0
vmovshdup %xmm0, %xmm1 ## xmm1 = xmm0[1,1,3,3]
vmaxss %xmm1, %xmm0, %xmm0
vzeroupper
retq

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 9, 2021

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missed optimisation - horizontal max for vectors is not optimised. #23490

Missed optimisation - horizontal max for vectors is not optimised. #23490

llvmbot commented Apr 3, 2015

rotateright commented Apr 3, 2015

rotateright commented Apr 3, 2015

rotateright commented Aug 23, 2015

rotateright commented Oct 12, 2018

RKSimon commented Dec 6, 2018

rotateright commented Dec 6, 2018

rotateright commented Mar 14, 2019

rotateright commented May 14, 2019

rotateright commented May 14, 2019

rotateright commented Jan 25, 2021

rotateright commented Feb 4, 2021

Missed optimisation - horizontal max for vectors is not optimised. #23490

Missed optimisation - horizontal max for vectors is not optimised. #23490

Comments

llvmbot commented Apr 3, 2015

Extended Description

rotateright commented Apr 3, 2015

rotateright commented Apr 3, 2015

rotateright commented Aug 23, 2015

rotateright commented Oct 12, 2018

RKSimon commented Dec 6, 2018

rotateright commented Dec 6, 2018

rotateright commented Mar 14, 2019

rotateright commented May 14, 2019

rotateright commented May 14, 2019

rotateright commented Jan 25, 2021

rotateright commented Feb 4, 2021