Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

128 bit arithmetic --- inefficient logic test #49541

Open
sqrmax opened this issue May 2, 2021 · 12 comments
Open

128 bit arithmetic --- inefficient logic test #49541

sqrmax opened this issue May 2, 2021 · 12 comments
Assignees
Labels
bugzilla Issues migrated from bugzilla llvm:codegen

Comments

@sqrmax
Copy link

sqrmax commented May 2, 2021

Bugzilla Link 50197
Version trunk
OS All
Attachments Sample code
CC @nerh,@LebedevRI,@RKSimon,@rotateright

Extended Description

See attached code, compiled with -O2. The condition inside the while () test is unnecessarily evaluated with a 128 bit shift instruction,

square:                                 # @square
        mov     rax, rdi
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        add     rax, 1
        adc     rsi, 0
        mov     rcx, rsi
        shld    rcx, rax, 4
        mov     rdx, rsi
        shr     rdx, 60
        or      rdx, rcx
        jne     .LBB0_1
        ret

even though a 64 bit shift instruction suffices. However, changing || to | in the logic condition yields the more efficient code below.

square:                                 # @square
        mov     rax, rdi
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        add     rax, 1
        adc     rsi, 0
        mov     rcx, rax
        shr     rcx, 60
        or      rcx, rsi
        jne     .LBB0_1
        ret

Found with clang-10 on Ubuntu 20.04 LTS, verified for clang 10, 11, and trunk using godbolt. Note that gcc -O2 handles both of these cases emitting the more efficient code.

@sqrmax
Copy link
Author

sqrmax commented May 2, 2021

assigned to @nerh

@RKSimon
Copy link
Collaborator

RKSimon commented May 3, 2021

Current Codegen: https://godbolt.org/z/bWvGK5vhT

@ghost
Copy link

ghost commented Sep 23, 2021

I 'd like to fix this issue if there are no objections.

LLVM mid-end generates comparison of j with 1<<60 for loop's condition and following DAG node is created for it:

    t19: i1 = setcc t7, Constant:i128<1152921504606846976>, setult:ch

Then it is combined into following nodes:

    t33: i128 = srl t7, Constant:i64<60>
  t40: i1 = setcc t33, Constant:i128<0>, setne:ch

That are legalized into:

          t45: i64 = srl t41, Constant:i64<60>
          t44: i64 = shl t42, Constant:i64<4>
        t46: i64 = or t45, t44
        t47: i64 = srl t42, Constant:i64<60>
      t49: i64 = or t46, t47
    t48: i8 = setcc t49, Constant:i64<0>, setne:ch

And finally combined into:

      t53: i64 = fshl t42, t41, Constant:i64<4>
      t47: i64 = srl t42, Constant:i64<60>
    t49: i64 = or t53, t47
  t48: i8 = setcc t49, Constant:i64<0>, setne:ch

I see several ways to get rid of FSHL:

  1. Hook into DAGTypeLegalizer::ExpandShiftByConstant, check that SRL has only one use and it's SetCC Eq/Ne 0, and then skip unnecessary shifts;
  2. Explicitly match expanded SRL and combine it into (setcc (or (srl a), b, C), 0, setne);
  3. Add several peepholes into DAGCombiner to iteratively combine:
    a) (or (fshl a, b, C1), (srl a, C2)) => (or (rotl a), (srl b)) // it is profitable disregard setcc optimization because rotations are usually faster than funnel shifts
    b) (setcc (or (rotl a), b), 0, setne) => (setcc (or a, b), 0, setne)

Personally I more into 3rd approach as it will work not only for the single case with setcc, but I'm not quite sure if patterns that could be combined by it arise frequently enough to make that approach reasonable to implement.

Note that the issue is not only about 128-bit variables and the same suboptimal code will be emitted for 64-bit variable on 32-bit architectures (for Aarch32 in particular).

@ghost
Copy link

ghost commented Oct 11, 2021

Review: https://reviews.llvm.org/D111530

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 11, 2021
rotateright added a commit that referenced this issue Feb 23, 2022
In D111530, I suggested that we add some relatively basic pattern-matching
folds for shifts and funnel shifts and avoid a more specialized solution
if possible.

We can start by implementing at least one of these in IR because it's
easier to write the code and verify with Alive2:
https://alive2.llvm.org/ce/z/qHpmNn

This will need to be adapted/extended for SDAG to handle the motivating
bug ( #49541 ) because the patterns only appear later with that example
(added some tests: bb850d4)

This can be extended within InstSimplify to handle cases where we 'and'
with a shift too (in that case, kill the funnel shift).
We could also handle patterns where the shift and funnel shift directions
are inverted, but I think it's better to canonicalize that instead to
avoid pattern-match case explosion.

Differential Revision: https://reviews.llvm.org/D120253
rotateright added a commit that referenced this issue Feb 27, 2022
This is the SDAG equivalent of an instcombine transform added with:
fd80760

This is another step towards solving #49541 and part of an alternative
set of more general transforms than what is proposed in D111530.

https://alive2.llvm.org/ce/z/ToxaE8
@RKSimon
Copy link
Collaborator

RKSimon commented Mar 23, 2022

Current Codegen:

slow128(unsigned __int128):                            # @slow128(unsigned __int128)
        movq    %rdi, %rax
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        addq    $1, %rax
        adcq    $0, %rsi
        movq    %rax, %rcx
        orq     %rsi, %rcx
        shrdq   $60, %rsi, %rcx
        jne     .LBB0_1
        retq

or (with slowSHLD):

slow128(unsigned __int128):                            # @slow128(unsigned __int128)
        movq    %rdi, %rax
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        addq    $1, %rax
        adcq    $0, %rsi
        movq    %rax, %rdx
        movq    %rsi, %rcx
        shrq    $60, %rdx
        rolq    $4, %rcx
        orq     %rcx, %rdx
        jne     .LBB0_1
        retq

@llvmbot
Copy link
Collaborator

llvmbot commented Mar 23, 2022

@llvm/issue-subscribers-backend-x86

@rotateright
Copy link
Contributor

I have one more SDAG patch in the works for this. AFAICT, we've already fixed the motivating example on targets other than x86. But x86 has legal funnel-shift instructions for scalar types (shrd/shld), so we need one more combine for setcc like this:
https://alive2.llvm.org/ce/z/wsbSDw
// fshl (or F1, Y), F1, C ==/!= 0 --> or (shl Y, C), F1 ==/!= 0
// fshl F0, (or Y, F0), C ==/!= 0 --> or (srl Y, BW-C), F0 ==/!= 0

It's a minor diff for most x86 targets - just trading a funnel shift for a shift. For vectors, it would eliminate a shift. We could add that in IR too, but it would be too early to help this specific problem.

There are 2 patterns * 2 (commutes) * 2 (left/right) * 2 (predicates) = 16 tests needed.

rotateright added a commit that referenced this issue Apr 1, 2022
This is another family of patterns based on issue #49541
rotateright added a commit that referenced this issue Apr 11, 2022
fshl (or X, Y), X, C ==/!= 0 --> or (shl Y, C), X ==/!= 0
fshl X, (or X, Y), C ==/!= 0 --> or (srl Y, BW-C), X ==/!= 0

This is similar to an existing setcc-of-rotate fold, but the
matching requires more checks for the more general funnel op:
https://alive2.llvm.org/ce/z/Ab2jDd

We are effectively decomposing the funnel shift into logical
shifts, reassociating, and removing a shift.

This should get us the final improvements for x86-64 that were
originally shown in D111530
( #49541 );
x86-32 still shows some SHLD/SHRD, so the pattern is not
matching there yet.

Differential Revision: https://reviews.llvm.org/D122919
@RKSimon
Copy link
Collaborator

RKSimon commented May 2, 2022

@sqrmax @nerh @rotateright Resolve this now?

@ghost
Copy link

ghost commented May 2, 2022

The motivating example was fixed but there's still an issue with funnel shifts comparison with zero for shifts having operands 4 (or more) times wider than target machine's register (i.e. 128 bit funnel shifts and ARM target, see https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/ARM/icmp-shift-opt.ll#L139).

@RKSimon
Copy link
Collaborator

RKSimon commented May 2, 2022

OK cheers - I'm making this a generic codegen bug as its not just x86

@rotateright
Copy link
Contributor

Is there a source example of how we would get the "4 or more" pattern to the backend? InstCombine canonicalizes the IR in the remaining ARM test ( https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/ARM/icmp-shift-opt.ll#L139 ) to use a mask instead of a shift:
https://alive2.llvm.org/ce/z/xP-ZEN

And the backend seems to get the ideal asm once that is done:
https://godbolt.org/z/ovMjn6en8

So if there's still some case that is escaping optimization, we could try the shift->and transform in SDAG.

@ghost
Copy link

ghost commented May 3, 2022

If you replace the left shift with a right shift in the example then InstCombine will replace the shift with comparison with a constant:
https://alive2.llvm.org/ce/z/WMQWq4

And DAGCombiner will replace it with shift:
https://godbolt.org/z/b4Wfdjj8f

There are no funnel shifts on ARM so DAGCombiner doesn't fold corresponding subtrees into FSHL/FSHR and as a result we can't use existing optimizations to eliminate unnecessary shifts.

rotateright pushed a commit that referenced this issue Aug 5, 2022
Hoist funnel shift from logic op:
logic_op (FSH x0, x1, s), (FSH y0, y1, s) --> FSH (logic_op x0, y0), (logic_op x1, y1), s

The transformation improves code generated for some cases related to
issue #49541.

Reduced amount of funnel shifts can also improve throughput on x86 CPUs by utilizing more
available ports: https://quick-bench.com/q/gC7AKkJJsDZzRrs_JWDzm9t_iDM

Transformation correctness checks:
https://alive2.llvm.org/ce/z/TKPULH
https://alive2.llvm.org/ce/z/UvTd_9
https://alive2.llvm.org/ce/z/j8qW3_
https://alive2.llvm.org/ce/z/7Wq7gE
https://alive2.llvm.org/ce/z/Xr5w8R
https://alive2.llvm.org/ce/z/D5xe_E
https://alive2.llvm.org/ce/z/2yBZiy

Differential Revision: https://reviews.llvm.org/D130994
@fzhinkin fzhinkin assigned fzhinkin and unassigned ghost Aug 7, 2022
fzhinkin added a commit that referenced this issue Aug 12, 2022
Hoist and combine shift operations from logic operations tree:
logic (logic (SH x0, s), y), (logic (SH x1, s), z)  --> logic (SH (logic x0, x1), s), (logic y, z)

The transformation improves code generated for some cases related to the issue #49541.

Correctness:
https://alive2.llvm.org/ce/z/pVqVgY
https://alive2.llvm.org/ce/z/YVvT-q
https://alive2.llvm.org/ce/z/W5zTBq
https://alive2.llvm.org/ce/z/YfJsvJ
https://alive2.llvm.org/ce/z/3YSyDM
https://alive2.llvm.org/ce/z/Bs2kzk
https://alive2.llvm.org/ce/z/EoQpzU
https://alive2.llvm.org/ce/z/Jnc_5H
https://alive2.llvm.org/ce/z/_LP6k_
https://alive2.llvm.org/ce/z/KvZNC9

Reviewed By: spatel

Differential Revision: https://reviews.llvm.org/D131189
mem-frob pushed a commit to draperlaboratory/hope-llvm-project that referenced this issue Oct 7, 2022
fshl (or X, Y), X, C ==/!= 0 --> or (shl Y, C), X ==/!= 0
fshl X, (or X, Y), C ==/!= 0 --> or (srl Y, BW-C), X ==/!= 0

This is similar to an existing setcc-of-rotate fold, but the
matching requires more checks for the more general funnel op:
https://alive2.llvm.org/ce/z/Ab2jDd

We are effectively decomposing the funnel shift into logical
shifts, reassociating, and removing a shift.

This should get us the final improvements for x86-64 that were
originally shown in D111530
( llvm/llvm-project#49541 );
x86-32 still shows some SHLD/SHRD, so the pattern is not
matching there yet.

Differential Revision: https://reviews.llvm.org/D122919
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla llvm:codegen
Projects
None yet
Development

No branches or pull requests

6 participants