New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
128 bit arithmetic --- inefficient logic test #49541
Comments
assigned to @nerh |
Current Codegen: https://godbolt.org/z/bWvGK5vhT |
I 'd like to fix this issue if there are no objections. LLVM mid-end generates comparison of
Then it is combined into following nodes:
That are legalized into:
And finally combined into:
I see several ways to get rid of FSHL:
Personally I more into 3rd approach as it will work not only for the single case with setcc, but I'm not quite sure if patterns that could be combined by it arise frequently enough to make that approach reasonable to implement. Note that the issue is not only about 128-bit variables and the same suboptimal code will be emitted for 64-bit variable on 32-bit architectures (for Aarch32 in particular). |
Review: https://reviews.llvm.org/D111530 |
In D111530, I suggested that we add some relatively basic pattern-matching folds for shifts and funnel shifts and avoid a more specialized solution if possible. We can start by implementing at least one of these in IR because it's easier to write the code and verify with Alive2: https://alive2.llvm.org/ce/z/qHpmNn This will need to be adapted/extended for SDAG to handle the motivating bug ( #49541 ) because the patterns only appear later with that example (added some tests: bb850d4) This can be extended within InstSimplify to handle cases where we 'and' with a shift too (in that case, kill the funnel shift). We could also handle patterns where the shift and funnel shift directions are inverted, but I think it's better to canonicalize that instead to avoid pattern-match case explosion. Differential Revision: https://reviews.llvm.org/D120253
This is the SDAG equivalent of an instcombine transform added with: fd80760 This is another step towards solving #49541 and part of an alternative set of more general transforms than what is proposed in D111530. https://alive2.llvm.org/ce/z/ToxaE8
Current Codegen:
or (with slowSHLD):
|
@llvm/issue-subscribers-backend-x86 |
I have one more SDAG patch in the works for this. AFAICT, we've already fixed the motivating example on targets other than x86. But x86 has legal funnel-shift instructions for scalar types (shrd/shld), so we need one more combine for setcc like this: It's a minor diff for most x86 targets - just trading a funnel shift for a shift. For vectors, it would eliminate a shift. We could add that in IR too, but it would be too early to help this specific problem. There are 2 patterns * 2 (commutes) * 2 (left/right) * 2 (predicates) = 16 tests needed. |
This is another family of patterns based on issue #49541
fshl (or X, Y), X, C ==/!= 0 --> or (shl Y, C), X ==/!= 0 fshl X, (or X, Y), C ==/!= 0 --> or (srl Y, BW-C), X ==/!= 0 This is similar to an existing setcc-of-rotate fold, but the matching requires more checks for the more general funnel op: https://alive2.llvm.org/ce/z/Ab2jDd We are effectively decomposing the funnel shift into logical shifts, reassociating, and removing a shift. This should get us the final improvements for x86-64 that were originally shown in D111530 ( #49541 ); x86-32 still shows some SHLD/SHRD, so the pattern is not matching there yet. Differential Revision: https://reviews.llvm.org/D122919
@sqrmax @nerh @rotateright Resolve this now? |
The motivating example was fixed but there's still an issue with funnel shifts comparison with zero for shifts having operands 4 (or more) times wider than target machine's register (i.e. 128 bit funnel shifts and ARM target, see https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/ARM/icmp-shift-opt.ll#L139). |
OK cheers - I'm making this a generic codegen bug as its not just x86 |
Is there a source example of how we would get the "4 or more" pattern to the backend? InstCombine canonicalizes the IR in the remaining ARM test ( https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/ARM/icmp-shift-opt.ll#L139 ) to use a mask instead of a shift: And the backend seems to get the ideal asm once that is done: So if there's still some case that is escaping optimization, we could try the shift->and transform in SDAG. |
If you replace the left shift with a right shift in the example then InstCombine will replace the shift with comparison with a constant: And DAGCombiner will replace it with shift: There are no funnel shifts on ARM so DAGCombiner doesn't fold corresponding subtrees into FSHL/FSHR and as a result we can't use existing optimizations to eliminate unnecessary shifts. |
Hoist funnel shift from logic op: logic_op (FSH x0, x1, s), (FSH y0, y1, s) --> FSH (logic_op x0, y0), (logic_op x1, y1), s The transformation improves code generated for some cases related to issue #49541. Reduced amount of funnel shifts can also improve throughput on x86 CPUs by utilizing more available ports: https://quick-bench.com/q/gC7AKkJJsDZzRrs_JWDzm9t_iDM Transformation correctness checks: https://alive2.llvm.org/ce/z/TKPULH https://alive2.llvm.org/ce/z/UvTd_9 https://alive2.llvm.org/ce/z/j8qW3_ https://alive2.llvm.org/ce/z/7Wq7gE https://alive2.llvm.org/ce/z/Xr5w8R https://alive2.llvm.org/ce/z/D5xe_E https://alive2.llvm.org/ce/z/2yBZiy Differential Revision: https://reviews.llvm.org/D130994
Hoist and combine shift operations from logic operations tree: logic (logic (SH x0, s), y), (logic (SH x1, s), z) --> logic (SH (logic x0, x1), s), (logic y, z) The transformation improves code generated for some cases related to the issue #49541. Correctness: https://alive2.llvm.org/ce/z/pVqVgY https://alive2.llvm.org/ce/z/YVvT-q https://alive2.llvm.org/ce/z/W5zTBq https://alive2.llvm.org/ce/z/YfJsvJ https://alive2.llvm.org/ce/z/3YSyDM https://alive2.llvm.org/ce/z/Bs2kzk https://alive2.llvm.org/ce/z/EoQpzU https://alive2.llvm.org/ce/z/Jnc_5H https://alive2.llvm.org/ce/z/_LP6k_ https://alive2.llvm.org/ce/z/KvZNC9 Reviewed By: spatel Differential Revision: https://reviews.llvm.org/D131189
fshl (or X, Y), X, C ==/!= 0 --> or (shl Y, C), X ==/!= 0 fshl X, (or X, Y), C ==/!= 0 --> or (srl Y, BW-C), X ==/!= 0 This is similar to an existing setcc-of-rotate fold, but the matching requires more checks for the more general funnel op: https://alive2.llvm.org/ce/z/Ab2jDd We are effectively decomposing the funnel shift into logical shifts, reassociating, and removing a shift. This should get us the final improvements for x86-64 that were originally shown in D111530 ( llvm/llvm-project#49541 ); x86-32 still shows some SHLD/SHRD, so the pattern is not matching there yet. Differential Revision: https://reviews.llvm.org/D122919
Extended Description
See attached code, compiled with -O2. The condition inside the while () test is unnecessarily evaluated with a 128 bit shift instruction,
even though a 64 bit shift instruction suffices. However, changing || to | in the logic condition yields the more efficient code below.
Found with clang-10 on Ubuntu 20.04 LTS, verified for clang 10, 11, and trunk using godbolt. Note that gcc -O2 handles both of these cases emitting the more efficient code.
The text was updated successfully, but these errors were encountered: