36243 – Missed optimization: inefficient codegen for __builtin_addc

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 36243 - Missed optimization: inefficient codegen for __builtin_addc

Summary: Missed optimization: inefficient codegen for __builtin_addc

Status:	CONFIRMED

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Common Code Generator Code (show other bugs)
Version:	trunk
Hardware:	PC All

Importance:	P enhancement
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2018-02-05 14:34 PST by Marcin Kościelnicki
Modified:	2019-12-27 06:59 PST (History)
CC List:	8 users (show)

See Also:	39464
Fixed By Commit(s):	rG257acbf6aee9

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Marcin Kościelnicki 2018-02-05 14:34:12 PST

clang has __builtin_addc* functions, which are supposed to emit hardware add-with-carry instructions.  However, there is no corresponding intrinsic on LLVM side, so clang emits a sequence of instructions that is only recognized and folded to a single hw instruction in two cases:

- carry input is 0, or
- carry output is unused

This means that any carry chains longer than 2 result in inefficient code:

void add3(
    unsigned long long *restrict a,
    unsigned long long *restrict b,
    unsigned long long *restrict c
) {
    unsigned long long cf = 0;
    c[0] = __builtin_addcll(a[0], b[0], cf, &cf);
    c[1] = __builtin_addcll(a[1], b[1], cf, &cf);
    c[2] = __builtin_addcll(a[2], b[2], cf, &cf);
}

Compiles to:

add3:                                   # @add3
        .cfi_startproc
# BB#0:
        movq    (%rdi), %rax
        movq    (%rsi), %r8
        leaq    (%rax,%r8), %rcx
        movq    %rcx, (%rdx)
        movq    8(%rdi), %rcx
        addq    8(%rsi), %rcx
        setb    %r9b
        addq    %r8, %rax
        adcq    $0, %rcx
        setb    %al
        orb     %r9b, %al
        movzbl  %al, %eax
        movq    %rcx, 8(%rdx)
        movq    16(%rsi), %rcx
        addq    16(%rdi), %rcx
        addq    %rax, %rcx
        movq    %rcx, 16(%rdx)
        retq

I suppose we're going to need a new target-independent generic intrinsic,
say { iX, i1 } @llvm.uadd.with.overflow.carry.iX(iX, iX, i1) (and a corresponding one for subtraction as well) and map it to ISD::ADDE / ISD::ADDCARRY.

Comment 1 Andy Gibbs 2018-09-06 08:01:25 PDT

I have also found this.  Here is the code I am using:

    #include <stdint.h>

    struct UInt96
      {
      uint32_t d2, d1, d0;
      };

    void ADD96_v1(const UInt96& s, UInt96& d)
      {
      uint64_t sum = uint64_t(d.d0) + uint64_t(s.d0);
      d.d0 = sum; sum >>= 32;
      sum += uint64_t(d.d1) + uint64_t(s.d1);
      d.d1 = sum; sum >>= 32;
      sum += uint64_t(d.d2) + uint64_t(s.d2);
      d.d2 = sum;
      }

    void ADD96_v2(const UInt96& s, UInt96& d)
      {
      uint32_t carry;
  
      d.d0 = __builtin_addc(d.d0, s.d0, 0, &carry);
      d.d1 = __builtin_addc(d.d1, s.d1, carry, &carry);
      d.d2 = __builtin_addc(d.d2, s.d2, carry, &carry);
      }

ADD96_v1 was my original function, and ADD96_v2 is my attempt to rewrite it using __builtin_addc.  While the second version works, I am surprised that it is less optimal than the first version: using the Godbolt Compiler Explorer, the generated ARM code contains an additional three seemingly unnecessary instructions; worse the PPC code utilises branch operations!

My compiler options are:

-target arm-unknown-linux-gnueabihf -mcpu=cortex-a9 -O2

and:

-target powerpc-unknown-linux-gnu -mcpu=603e -O2

It seems that the __builtin_addc* functions don't really have an advantage; is that true?

Comment 2 David Zarzycki 2019-11-11 13:23:25 PST

FYI – https://reviews.llvm.org/D70079

Comment 3 Simon Pilgrim 2019-11-12 09:20:19 PST

Current Codegen: https://godbolt.org/z/cVSccY

Comment 4 Simon Pilgrim 2019-11-21 06:41:38 PST

rG257acbf6aee9 solves much of the poor codegen generically in DAGCombine

x86/arm code works as expected

ppc struggles as the target still needs to be updated to use ADDCARRY/SUBCARRY

Andy - maybe open a PPC specific bug?

Comment 5 Simon Pilgrim 2019-11-22 04:16:48 PST

Adding some PPC guys who might know whats best to do to improve PPC support and reduce the branching.

Do we need to extend the ISD::ADD/SUBCARRY combines (where possible) to better support ISD::ADD/SUBE or can we move PPC away from ISD::ADD/SUBC + ISD::ADD/SUBE?

Comment 6 Nemanja Ivanovic 2019-12-27 06:59:01 PST

We aren't currently working on moving away from the old carry setting/using nodes to the new ones, but we do not have any fundamental issue with doing so. I am not sure this statement is any help though.