-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missed optimisation - horizontal max for vectors is not optimised. #23490
Comments
The optimal solution would use 3 maxps instructions. In your 2nd example, after you get the maxps of the two halves, shift over by 2 elements and maxps again, then shift over 1 more and maxps (or maxss) again. So shift by 4: x0 | x1 | x2 | x3 maxps, then shift by 2: max(x0,x4) | max(x1,x5) maxps, then shift by 1: max(x0,x4,x2,x6) maxss = max(x0,x4,x2,x6,x1,x5,x3,x7) |
There are 2 parts to solving this:
float max4(float a, float b, float c, float d) { That should be 2 independent max ops followed by a max of the intermediate results, not this: This is similar to bug 21768 and bug 17305.
|
That was wrong. Because of the way the SSE max/min instructions work with NaN and signed zero inputs, they are not commutative. So I think that the reassociation optimization can only happen when we have a relaxed FP environment. Currently (r245733), we do get the reassociation when using -ffast-math. It should be possible to do the optimization with just -ffinite-math-only -fno-signed-zeros too, but that doesn't work yet. |
Update: as of r344320, we get close to optimal codegen with -O2 -mavx -ffast-math: The IR is vectorized by the SLP vectorizer: As noted before, without some FP loosening this may be difficult to optimize because of NaN behavior. I think SLP needs to be adjusted for that to happen. |
AFAICT the outstanding issues are: 1 - we should be able to use xmm instead of full ymm widths (beneficial to btver2 as it avoids double pumping of the 128-bit ALUs). 2 - the SLP shouldn't need full 'fast' attribute (which attributes must it have?) |
Right - we've made several IR and backend changes towards this goal, but this one isn't handled yet. I have an idea to improve this at least partly in IR in instcombine (this assumes SLP has done the vectorization).
The FMF needed for min/max reordering are 'nnan' and 'nsz' (because if we change the order of operations, then we don't know how those will shake out). There's no actual math here, so that rules out 'reassoc'. Infinities should be handled correctly in all cases, so we don't need 'ninf'. 'arcp', 'afn', 'contract' are irrelevant. It's possible that one of the newer, more tightly specified min/max ops makes all FMF unnecessary, but I'm not sure yet. |
We manage to scalarize the final max op with: |
This takes care of the narrowing: |
We need to fix FMF in IR to solve the flags part of this because 'nsz' doesn't actually make sense on an fcmp. See bug 38086. |
We side-stepped that issue by creating reduction intrinsics in IR. With: But there's still an FMF bug in expandReductions(), so we don't get the expected vectorized shuffle reduction. |
After: And with: We finally have the ideal output with the minimal relaxed FP settings: |
Extended Description
Computing the horizontal max (or min etc..) of a vector is not optimised to faster code.
For example, the C code:
#include <immintrin.h>
inline float max(float a, float b)
{
return a > b ? a : b;
}
float findMax(__m256 v)
{
return max(max(max(max(max(max(max(v[0], v[1]), v[2]), v[3]), v[4]), v[5]), v[6]), v[7]);
}
Is compiled to by Clang 3.7/trunk to:
findMax(float __vector(8)): # @findMax(float __vector(8))
vmovshdup %xmm0, %xmm1 # xmm1 = xmm0[1,1,3,3]
vmaxss %xmm1, %xmm0, %xmm1
vpermilpd $1, %xmm0, %xmm2 # xmm2 = xmm0[1,0]
vmaxss %xmm2, %xmm1, %xmm1
vpermilps $231, %xmm0, %xmm2 # xmm2 = xmm0[3,1,2,3]
vmaxss %xmm2, %xmm1, %xmm1
vextractf128 $1, %ymm0, %xmm0
vmaxss %xmm0, %xmm1, %xmm1
vmovshdup %xmm0, %xmm2 # xmm2 = xmm0[1,1,3,3]
vmaxss %xmm2, %xmm1, %xmm1
vpermilpd $1, %xmm0, %xmm2 # xmm2 = xmm0[1,0]
vmaxss %xmm2, %xmm1, %xmm1
vpermilps $231, %xmm0, %xmm0 # xmm0 = xmm0[3,1,2,3]
vmaxss %xmm0, %xmm1, %xmm0
vzeroupper
retq
Which is basically 7 vmaxss's in serial (slow).
It could be optimised to something like:
float findMax(__m256 v)
{
__m128 a = _mm256_extractf128_ps(v, 0);
__m128 b = _mm256_extractf128_ps(v, 1);
}
Which compiles to:
findMax(float __vector(8)): # @findMax(float __vector(8))
vextractf128 $1, %ymm0, %xmm1
vmaxps %xmm1, %xmm0, %xmm0
vmovshdup %xmm0, %xmm1 # xmm1 = xmm0[1,1,3,3]
vmaxss %xmm1, %xmm0, %xmm1
vpermilpd $1, %xmm0, %xmm2 # xmm2 = xmm0[1,0]
vpermilps $231, %xmm0, %xmm0 # xmm0 = xmm0[3,1,2,3]
vmaxss %xmm0, %xmm2, %xmm0
vmaxss %xmm0, %xmm1, %xmm0
vzeroupper
retq
See http://goo.gl/jM3KNz for the code.
The text was updated successfully, but these errors were encountered: