50197 – 128 bit arithmetic --- inefficient logic test

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 50197 - 128 bit arithmetic --- inefficient logic test

Summary: 128 bit arithmetic --- inefficient logic test

Status:	NEW

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Scalar Optimizations (show other bugs)
Version:	trunk
Hardware:	PC All

Importance:	P enhancement
Assignee:	Filipp Zhinkin

URL:
Keywords:

Depends on:
Blocks:

Reported:	2021-05-02 04:40 PDT by Andres Valloud
Modified:	2021-10-11 03:13 PDT (History)
CC List:	6 users (show)

See Also:
Fixed By Commit(s):

Attachments
Sample code (351 bytes, text/plain) 2021-05-02 04:40 PDT, Andres Valloud	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Andres Valloud 2021-05-02 04:40:25 PDT

Created attachment 24821 [details]
Sample code

See attached code, compiled with -O2.  The condition inside the while () test is unnecessarily evaluated with a 128 bit shift instruction,

square:                                 # @square
        mov     rax, rdi
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        add     rax, 1
        adc     rsi, 0
        mov     rcx, rsi
        shld    rcx, rax, 4
        mov     rdx, rsi
        shr     rdx, 60
        or      rdx, rcx
        jne     .LBB0_1
        ret

even though a 64 bit shift instruction suffices.  However, changing || to | in the logic condition yields the more efficient code below.

square:                                 # @square
        mov     rax, rdi
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        add     rax, 1
        adc     rsi, 0
        mov     rcx, rax
        shr     rcx, 60
        or      rcx, rsi
        jne     .LBB0_1
        ret

Found with clang-10 on Ubuntu 20.04 LTS, verified for clang 10, 11, and trunk using godbolt.  Note that gcc -O2 handles both of these cases emitting the more efficient code.

Comment 1 Simon Pilgrim 2021-05-03 03:07:19 PDT

Current Codegen: https://godbolt.org/z/bWvGK5vhT

Comment 2 Filipp Zhinkin 2021-09-23 12:22:20 PDT

I 'd like to fix this issue if there are no objections.

LLVM mid-end generates comparison of `j` with 1<<60 for loop's condition and following DAG node is created for it:

        t19: i1 = setcc t7, Constant:i128<1152921504606846976>, setult:ch

Then it is combined into following nodes:

        t33: i128 = srl t7, Constant:i64<60>
      t40: i1 = setcc t33, Constant:i128<0>, setne:ch

That are legalized into:

              t45: i64 = srl t41, Constant:i64<60>
              t44: i64 = shl t42, Constant:i64<4>
            t46: i64 = or t45, t44
            t47: i64 = srl t42, Constant:i64<60>
          t49: i64 = or t46, t47
        t48: i8 = setcc t49, Constant:i64<0>, setne:ch

And finally combined into:

          t53: i64 = fshl t42, t41, Constant:i64<4>
          t47: i64 = srl t42, Constant:i64<60>
        t49: i64 = or t53, t47
      t48: i8 = setcc t49, Constant:i64<0>, setne:ch


I see several ways to get rid of FSHL:
1) Hook into DAGTypeLegalizer::ExpandShiftByConstant, check that SRL has only one use and it's SetCC Eq/Ne 0, and then skip unnecessary shifts;
2) Explicitly match expanded SRL and combine it into (setcc (or (srl a), b, C), 0, setne);
3) Add several peepholes into DAGCombiner to iteratively combine:
   a) (or (fshl a, b, C1), (srl a, C2)) => (or (rotl a), (srl b)) // it is profitable disregard setcc optimization because rotations are usually faster than funnel shifts
   b) (setcc (or (rotl a), b), 0, setne) => (setcc (or a, b), 0, setne)
   
Personally I more into 3rd approach as it will work not only for the single case with setcc, but I'm not quite sure if patterns that could be combined by it arise frequently enough to make that approach reasonable to implement.

Note that the issue is not only about 128-bit variables and the same suboptimal code will be emitted for 64-bit variable on 32-bit architectures (for Aarch32 in particular).

Comment 3 Filipp Zhinkin 2021-10-11 03:13:28 PDT

Review: https://reviews.llvm.org/D111530