Missed optimization: inefficient codegen for __builtin_addc #35591

llvmbot · 2018-02-05T22:34:12Z


Bugzilla Link	36243
Version	trunk
OS	All
Reporter	LLVM Bugzilla Contributor
CC	@davezarzycki,@hfinkel,@RKSimon,@nemanjai,@rotateright
Fixed by commit(s)	rG257acbf6aee9

Extended Description

clang has __builtin_addc* functions, which are supposed to emit hardware add-with-carry instructions. However, there is no corresponding intrinsic on LLVM side, so clang emits a sequence of instructions that is only recognized and folded to a single hw instruction in two cases:

carry input is 0, or
carry output is unused

This means that any carry chains longer than 2 result in inefficient code:

void add3(
    unsigned long long *restrict a,
    unsigned long long *restrict b,
    unsigned long long *restrict c
) {
    unsigned long long cf = 0;
    c[0] = __builtin_addcll(a[0], b[0], cf, &cf);
    c[1] = __builtin_addcll(a[1], b[1], cf, &cf);
    c[2] = __builtin_addcll(a[2], b[2], cf, &cf);
}

Compiles to:

add3:
        .cfi_startproc
# BB#0:
        movq    (%rdi), %rax
        movq    (%rsi), %r8
        leaq    (%rax,%r8), %rcx
        movq    %rcx, (%rdx)
        movq    8(%rdi), %rcx
        addq    8(%rsi), %rcx
        setb    %r9b
        addq    %r8, %rax
        adcq    $0, %rcx
        setb    %al
        orb     %r9b, %al
        movzbl  %al, %eax
        movq    %rcx, 8(%rdx)
        movq    16(%rsi), %rcx
        addq    16(%rdi), %rcx
        addq    %rax, %rcx
        movq    %rcx, 16(%rdx)
        retq

I suppose we're going to need a new target-independent generic intrinsic,
say { iX, i1 } @llvm.uadd.with.overflow.carry.iX(iX, iX, i1) (and a corresponding one for subtraction as well) and map it to ISD::ADDE / ISD::ADDCARRY.

The text was updated successfully, but these errors were encountered:

llvmbot · 2018-09-06T15:01:25Z

I have also found this. Here is the code I am using:

#include <stdint.h>

struct UInt96
  {
  uint32_t d2, d1, d0;
  };

void ADD96_v1(const UInt96& s, UInt96& d)
  {
  uint64_t sum = uint64_t(d.d0) + uint64_t(s.d0);
  d.d0 = sum; sum >>= 32;
  sum += uint64_t(d.d1) + uint64_t(s.d1);
  d.d1 = sum; sum >>= 32;
  sum += uint64_t(d.d2) + uint64_t(s.d2);
  d.d2 = sum;
  }

void ADD96_v2(const UInt96& s, UInt96& d)
  {
  uint32_t carry;

  d.d0 = __builtin_addc(d.d0, s.d0, 0, &carry);
  d.d1 = __builtin_addc(d.d1, s.d1, carry, &carry);
  d.d2 = __builtin_addc(d.d2, s.d2, carry, &carry);
  }

ADD96_v1 was my original function, and ADD96_v2 is my attempt to rewrite it using __builtin_addc. While the second version works, I am surprised that it is less optimal than the first version: using the Godbolt Compiler Explorer, the generated ARM code contains an additional three seemingly unnecessary instructions; worse the PPC code utilises branch operations!

My compiler options are:

-target arm-unknown-linux-gnueabihf -mcpu=cortex-a9 -O2

and:

-target powerpc-unknown-linux-gnu -mcpu=603e -O2

It seems that the __builtin_addc* functions don't really have an advantage; is that true?

davezarzycki · 2019-11-11T21:23:25Z

FYI – https://reviews.llvm.org/D70079

RKSimon · 2019-11-12T17:20:19Z

Current Codegen: https://godbolt.org/z/cVSccY

RKSimon · 2019-11-21T14:41:38Z

rG257acbf6aee9 solves much of the poor codegen generically in DAGCombine

x86/arm code works as expected

ppc struggles as the target still needs to be updated to use ADDCARRY/SUBCARRY

Andy - maybe open a PPC specific bug?

RKSimon · 2019-11-22T12:16:48Z

Adding some PPC guys who might know whats best to do to improve PPC support and reduce the branching.

Do we need to extend the ISD::ADD/SUBCARRY combines (where possible) to better support ISD::ADD/SUBE or can we move PPC away from ISD::ADD/SUBC + ISD::ADD/SUBE?

nemanjai · 2019-12-27T14:59:01Z

We aren't currently working on moving away from the old carry setting/using nodes to the new ones, but we do not have any fundamental issue with doing so. I am not sure this statement is any help though.

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

llvmbot added the confirmed label Jan 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missed optimization: inefficient codegen for __builtin_addc #35591

Missed optimization: inefficient codegen for __builtin_addc #35591

llvmbot commented Feb 5, 2018 •

edited by RKSimon

Loading

llvmbot commented Sep 6, 2018

davezarzycki commented Nov 11, 2019

RKSimon commented Nov 12, 2019

RKSimon commented Nov 21, 2019

RKSimon commented Nov 22, 2019

nemanjai commented Dec 27, 2019

Missed optimization: inefficient codegen for __builtin_addc #35591

Missed optimization: inefficient codegen for __builtin_addc #35591

Comments

llvmbot commented Feb 5, 2018 • edited by RKSimon Loading

Extended Description

llvmbot commented Sep 6, 2018

davezarzycki commented Nov 11, 2019

RKSimon commented Nov 12, 2019

RKSimon commented Nov 21, 2019

RKSimon commented Nov 22, 2019

nemanjai commented Dec 27, 2019

llvmbot commented Feb 5, 2018 •

edited by RKSimon

Loading