Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missed optimisation - horizontal max for vectors is not optimised. #23490

Closed
llvmbot opened this issue Apr 3, 2015 · 11 comments
Closed

Missed optimisation - horizontal max for vectors is not optimised. #23490

llvmbot opened this issue Apr 3, 2015 · 11 comments
Labels
bugzilla Issues migrated from bugzilla

Comments

@llvmbot
Copy link
Collaborator

llvmbot commented Apr 3, 2015

Bugzilla Link 23116
Resolution FIXED
Resolved on Feb 04, 2021 10:58
Version trunk
OS Windows NT
Reporter LLVM Bugzilla Contributor
CC @adibiagio,@RKSimon,@rotateright

Extended Description

Computing the horizontal max (or min etc..) of a vector is not optimised to faster code.
For example, the C code:

#include <immintrin.h>

inline float max(float a, float b)
{
return a > b ? a : b;
}

float findMax(__m256 v)
{
return max(max(max(max(max(max(max(v[0], v[1]), v[2]), v[3]), v[4]), v[5]), v[6]), v[7]);
}

Is compiled to by Clang 3.7/trunk to:

findMax(float __vector(8)): # @​findMax(float __vector(8))
vmovshdup %xmm0, %xmm1 # xmm1 = xmm0[1,1,3,3]
vmaxss %xmm1, %xmm0, %xmm1
vpermilpd $1, %xmm0, %xmm2 # xmm2 = xmm0[1,0]
vmaxss %xmm2, %xmm1, %xmm1
vpermilps $231, %xmm0, %xmm2 # xmm2 = xmm0[3,1,2,3]
vmaxss %xmm2, %xmm1, %xmm1
vextractf128 $1, %ymm0, %xmm0
vmaxss %xmm0, %xmm1, %xmm1
vmovshdup %xmm0, %xmm2 # xmm2 = xmm0[1,1,3,3]
vmaxss %xmm2, %xmm1, %xmm1
vpermilpd $1, %xmm0, %xmm2 # xmm2 = xmm0[1,0]
vmaxss %xmm2, %xmm1, %xmm1
vpermilps $231, %xmm0, %xmm0 # xmm0 = xmm0[3,1,2,3]
vmaxss %xmm0, %xmm1, %xmm0
vzeroupper
retq

Which is basically 7 vmaxss's in serial (slow).

It could be optimised to something like:

float findMax(__m256 v)
{
__m128 a = _mm256_extractf128_ps(v, 0);
__m128 b = _mm256_extractf128_ps(v, 1);

__m128 c = _mm_max_ps(a, b);

return max(max(c[0], c[1]), max(c[2], c[3]));

}

Which compiles to:

findMax(float __vector(8)): # @​findMax(float __vector(8))
vextractf128 $1, %ymm0, %xmm1
vmaxps %xmm1, %xmm0, %xmm0
vmovshdup %xmm0, %xmm1 # xmm1 = xmm0[1,1,3,3]
vmaxss %xmm1, %xmm0, %xmm1
vpermilpd $1, %xmm0, %xmm2 # xmm2 = xmm0[1,0]
vpermilps $231, %xmm0, %xmm0 # xmm0 = xmm0[3,1,2,3]
vmaxss %xmm0, %xmm2, %xmm0
vmaxss %xmm0, %xmm1, %xmm0
vzeroupper
retq

See http://goo.gl/jM3KNz for the code.

@rotateright
Copy link
Contributor

The optimal solution would use 3 maxps instructions. In your 2nd example, after you get the maxps of the two halves, shift over by 2 elements and maxps again, then shift over 1 more and maxps (or maxss) again.

So shift by 4:

x0 | x1 | x2 | x3
x4 | x5 | x6 | x7

maxps, then shift by 2:

max(x0,x4) | max(x1,x5)
max(x2,x6) | max(x3,x7)

maxps, then shift by 1:

max(x0,x4,x2,x6)
max(x1,x5,x3,x7)

maxss = max(x0,x4,x2,x6,x1,x5,x3,x7)

@rotateright
Copy link
Contributor

There are 2 parts to solving this:

  1. Recognize the reassociation opportunity of max/min:

float max4(float a, float b, float c, float d) {
float t1 = a > b ? a : b;
float t2 = t1 > c ? t1 : c;
float t3 = t2 > d ? t2 : d;
return t3;
}

That should be 2 independent max ops followed by a max of the intermediate results, not this:
maxss %xmm1, %xmm0
maxss %xmm2, %xmm0
maxss %xmm3, %xmm0

This is similar to bug 21768 and bug 17305.

  1. Recognize the parallel max functionality available in a maxps and use it as described in comment 1.

@rotateright
Copy link
Contributor

float max4(float a, float b, float c, float d) {
float t1 = a > b ? a : b;
float t2 = t1 > c ? t1 : c;
float t3 = t2 > d ? t2 : d;
return t3;
}

That should be 2 independent max ops followed by a max of the intermediate
results.

That was wrong. Because of the way the SSE max/min instructions work with NaN and signed zero inputs, they are not commutative. So I think that the reassociation optimization can only happen when we have a relaxed FP environment. Currently (r245733), we do get the reassociation when using -ffast-math.

It should be possible to do the optimization with just -ffinite-math-only -fno-signed-zeros too, but that doesn't work yet.

@rotateright
Copy link
Contributor

Update: as of r344320, we get close to optimal codegen with -O2 -mavx -ffast-math:
vextractf128 $1, %ymm0, %xmm1
vmaxps %ymm1, %ymm0, %ymm0
vpermilpd $1, %xmm0, %xmm1 ## xmm1 = xmm0[1,0]
vmaxps %ymm1, %ymm0, %ymm0
vmovshdup %xmm0, %xmm1 ## xmm1 = xmm0[1,1,3,3]
vmaxps %ymm1, %ymm0, %ymm0

The IR is vectorized by the SLP vectorizer:
define float @​findMax(<8 x float> %v) local_unnamed_addr #​0 {
entry:
%rdx.shuf = shufflevector <8 x float> %v, <8 x float> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
%rdx.minmax.cmp = fcmp fast olt <8 x float> %rdx.shuf, %v
%rdx.minmax.select = select <8 x i1> %rdx.minmax.cmp, <8 x float> %v, <8 x float> %rdx.shuf
%rdx.shuf33 = shufflevector <8 x float> %rdx.minmax.select, <8 x float> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%rdx.minmax.cmp34 = fcmp fast ogt <8 x float> %rdx.minmax.select, %rdx.shuf33
%rdx.minmax.select35 = select <8 x i1> %rdx.minmax.cmp34, <8 x float> %rdx.minmax.select, <8 x float> %rdx.shuf33
%rdx.shuf36 = shufflevector <8 x float> %rdx.minmax.select35, <8 x float> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%rdx.minmax.cmp37 = fcmp fast ogt <8 x float> %rdx.minmax.select35, %rdx.shuf36
%rdx.minmax.select38 = select <8 x i1> %rdx.minmax.cmp37, <8 x float> %rdx.minmax.select35, <8 x float> %rdx.shuf36
%0 = extractelement <8 x float> %rdx.minmax.select38, i32 0
ret float %0
}


As noted before, without some FP loosening this may be difficult to optimize because of NaN behavior. I think SLP needs to be adjusted for that to happen.

@RKSimon
Copy link
Collaborator

RKSimon commented Dec 6, 2018

Update: as of r344320, we get close to optimal codegen with -O2 -mavx
-ffast-math:
vextractf128 $1, %ymm0, %xmm1
vmaxps %ymm1, %ymm0, %ymm0
vpermilpd $1, %xmm0, %xmm1 ## xmm1 = xmm0[1,0]
vmaxps %ymm1, %ymm0, %ymm0
vmovshdup %xmm0, %xmm1 ## xmm1 = xmm0[1,1,3,3]
vmaxps %ymm1, %ymm0, %ymm0

AFAICT the outstanding issues are:

1 - we should be able to use xmm instead of full ymm widths (beneficial to btver2 as it avoids double pumping of the 128-bit ALUs).

2 - the SLP shouldn't need full 'fast' attribute (which attributes must it have?)

@rotateright
Copy link
Contributor

Update: as of r344320, we get close to optimal codegen with -O2 -mavx
-ffast-math:
vextractf128 $1, %ymm0, %xmm1
vmaxps %ymm1, %ymm0, %ymm0
vpermilpd $1, %xmm0, %xmm1 ## xmm1 = xmm0[1,0]
vmaxps %ymm1, %ymm0, %ymm0
vmovshdup %xmm0, %xmm1 ## xmm1 = xmm0[1,1,3,3]
vmaxps %ymm1, %ymm0, %ymm0

AFAICT the outstanding issues are:

1 - we should be able to use xmm instead of full ymm widths (beneficial to
btver2 as it avoids double pumping of the 128-bit ALUs).

Right - we've made several IR and backend changes towards this goal, but this one isn't handled yet. I have an idea to improve this at least partly in IR in instcombine (this assumes SLP has done the vectorization).

2 - the SLP shouldn't need full 'fast' attribute (which attributes must it
have?)

The FMF needed for min/max reordering are 'nnan' and 'nsz' (because if we change the order of operations, then we don't know how those will shake out).

There's no actual math here, so that rules out 'reassoc'. Infinities should be handled correctly in all cases, so we don't need 'ninf'. 'arcp', 'afn', 'contract' are irrelevant.

It's possible that one of the newer, more tightly specified min/max ops makes all FMF unnecessary, but I'm not sure yet.

@rotateright
Copy link
Contributor

We manage to scalarize the final max op with:
https://reviews.llvm.org/rL355792

@rotateright
Copy link
Contributor

This takes care of the narrowing:
https://reviews.llvm.org/rL360639

@rotateright
Copy link
Contributor

We need to fix FMF in IR to solve the flags part of this because 'nsz' doesn't actually make sense on an fcmp. See bug 38086.

@rotateright
Copy link
Contributor

We need to fix FMF in IR to solve the flags part of this because 'nsz'
doesn't actually make sense on an fcmp. See bug 38086.

We side-stepped that issue by creating reduction intrinsics in IR. With:
https://reviews.llvm.org/rG77adbe6a8c71
...I think we finally have FMF straightened out in SLP, so we produce something like this:
%0 = call nnan ninf nsz float @​llvm.vector.reduce.fmax.v8f32(<8 x float> %v)

But there's still an FMF bug in expandReductions(), so we don't get the expected vectorized shuffle reduction.

@rotateright
Copy link
Contributor

After:
https://reviews.llvm.org/rG056d31dd2a04 (and several other patches as noted)

And with:
$ clang -O2 23116.c -S -o - -fno-signed-zeros -ffinite-math-only -mavx

We finally have the ideal output with the minimal relaxed FP settings:
vextractf128 $1, %ymm0, %xmm1
vmaxps %xmm1, %xmm0, %xmm0
vpermilpd $1, %xmm0, %xmm1 ## xmm1 = xmm0[1,0]
vmaxps %xmm1, %xmm0, %xmm0
vmovshdup %xmm0, %xmm1 ## xmm1 = xmm0[1,1,3,3]
vmaxss %xmm1, %xmm0, %xmm0
vzeroupper
retq

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 9, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla
Projects
None yet
Development

No branches or pull requests

3 participants