-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compiler no longer optimizing some horizontal add/sub patterns after r310260 #34072
Comments
The IR for this example is identical to bug 34725, so if we view this as a backend problem, one of these would be a duplicate. |
For reference: https://reviews.llvm.org/rL310260 (SLP) And here's the IR that we currently (r314201) produce: $ ./clang -O2 34724.c -S -o - -mavx -emit-llvm
|
There's a related theme here and bug 34730 and bug 34716: %ext = extractelement <2 x float> %x, i32 0 --> %shuf = shufflevector <2 x float> %x, <2 x float> undef, <4 x i32> <i32 undef, i32 0, i32 undef, i32 undef> Can we canonicalize to shuffles more aggressively in instcombine now? If we can recognize insert_subvector, extract_subvector, identity (this is already true I think), single-element-repositioning (better name for this example?) as special case masks, it should reduce complexity in the backend. |
That was wrong - the instructions are similar, but the element indexes for the shuffles and insert/extract are not the same. |
We could probably make the list of shuffles instcombine recognizes a bit longer, if that helps... we just want to guard against making shuffles the backend won't recognize. |
Looking at this report again now, and nothing's changed since the last comments. Here's a better view of the code going into SLP:
There are multiple potential problems contained in this example:
instead of this pair:
|
If we are viewing this as a backend pattern matching problem, I don't know how we'd solve it after SLP partially vectorizes the IR. Matching a mix of scalar/vector code to try to find a horizontal op is a potentially limitless exercise. I think we either need to get SLP to stop vectorizing on this example or get it to vectorize to a <4 x float> fadd. The backend should recognize either of those possibilities and turn it into 'vhaddps' for x86 AVX. |
Something else that is odd is that the codegen is incredibly different depending upon the element you shuffle as undefined: https://gcc.godbolt.org/z/e-6XUz Many are generating the haddps and a load of extra code as well!
|
I have a very preliminary patch for non-pow2 support in SLP vectorizer, which will extend <3 x float> to <4 x float>. It is too optimistic in terms of the cost of the vectorization, but generally speacking looks good. Hope fix all the problems very soon. |
Great! If we get a <4 x float> math op from SLP, then that should be no problem for the backend to match as a horizontal op (if that's good for the target). I'm still trying to make instcombine produce code that SLP can handle more easily here: |
@spatel Do you think this is a candidate for the VectorCombine pass? |
We already get the extracelement transforms in -vector-combine that I would expect given the IR in comment 6: So SLP is out of the picture currently. The given IR does result in a horizontal add, but we don't have a way to combine these into a single hadd yet:
|
A possible transform for vector-combine that will help: %5 = shufflevector <4 x float> %b, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 3, i32 undef> We chose to shuffle element 3 of %b over to element 2 in the 1st shuffle, and that then gets pushed back over to element 3 in the 2nd shuffle. We want to adjust the 1st shuffle so that we're calculating the binop in element 3 because that creates a select ("blend") shuffle instead: %5 = shufflevector <4 x float> %b, <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 undef, i32 2> |
Small hack proposal: |
Further backend improvements: https://reviews.llvm.org/D83789 |
… with another existing HADD/SUB Fixes some more cases from #34072 where undemanded vector elements prevent HADD/SUB being matched on slow targets
… with another existing HADD/SUB Fixes some more cases from llvm#34072 where undemanded vector elements prevent HADD/SUB being matched on slow targets
…undef,x) -> shuffle(x,undef). foldInsExtVectorToShuffle is likely to be inserting into an undef value, so make sure we've canonicalized this to the RHS in the folded shuffle to help further VectorCombine folds. Minor tweak to help #34072
Add tests for horizontal add patterns with missing/undemanded elements - which typically prevents folding to the (add (shuffle a, b),(shuffle a, b)) optimal pattern
…huffle(z)),binop(shuffle(y),shuffle(w)) -> binop(shuffle(x,z),shuffle(y,w)) Some patterns (in particular horizontal style patterns) can end up with shuffles straddling both sides of a binop/cmp. Where individually the folds aren't worth it, by merging the (oneuse) shuffles we can notably reduce the net instruction count and cost. One of the final steps towards finally addressing llvm#34072
…huffle(z)),binop(shuffle(y),shuffle(w)) -> binop(shuffle(x,z),shuffle(y,w)) Some patterns (in particular horizontal style patterns) can end up with shuffles straddling both sides of a binop/cmp. Where individually the folds aren't worth it, by merging the (oneuse) shuffles we can notably reduce the net instruction count and cost. One of the final steps towards finally addressing llvm#34072
…huffle(z)),binop(shuffle(y),shuffle(w)) -> binop(shuffle(x,z),shuffle(y,w)) Some patterns (in particular horizontal style patterns) can end up with shuffles straddling both sides of a binop/cmp. Where individually the folds aren't worth it, by merging the (oneuse) shuffles we can notably reduce the net instruction count and cost. One of the final steps towards finally addressing llvm#34072
…huffle(z)),binop(shuffle(y),shuffle(w)) -> binop(shuffle(x,z),shuffle(y,w)) Some patterns (in particular horizontal style patterns) can end up with shuffles straddling both sides of a binop/cmp. Where individually the folds aren't worth it, by merging the (oneuse) shuffles we can notably reduce the net instruction count and cost. One of the final steps towards finally addressing llvm#34072
…huffle(z)),binop(shuffle(y),shuffle(w)) -> binop(shuffle(x,z),shuffle(y,w)) Some patterns (in particular horizontal style patterns) can end up with shuffles straddling both sides of a binop/cmp. Where individually the folds aren't worth it, by merging the (oneuse) shuffles we can notably reduce the net instruction count and cost. One of the final steps towards finally addressing llvm#34072
…huffle(z)),binop(shuffle(y),shuffle(w)) -> binop(shuffle(x,z),shuffle(y,w)) (#120984) Some patterns (in particular horizontal style patterns) can end up with shuffles straddling both sides of a binop/cmp. Where individually the folds aren't worth it, by merging the (oneuse) shuffles we can notably reduce the net instruction count and cost. One of the final steps towards finally addressing #34072
Matches the existing horizontal-add tests, with the additional non-commutable constraint
Matches the existing horizontal-add tests, with the additional non-commutable constraint
Closing - the recent improvements to VectorCombine/CostModel have addressed all the 128-bit add/sub 'undefined element' cases. |
Extended Description
Prior to r310260, the compiler was optimizing some add/sub + shuffle patterns to horizontal add/sub instructions, but no longer seems to do so.
Consider this example:
When compiled with the options "-S -O2 -march=bdver2" using a compiler prior to r310260, the compiler would generate the following assembly:
After r310260, the compiler is now generating the following code for the same function:
The text was updated successfully, but these errors were encountered: