Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missed optimization: inefficient codegen for __builtin_addc #35591

Open
llvmbot opened this issue Feb 5, 2018 · 6 comments
Open

Missed optimization: inefficient codegen for __builtin_addc #35591

llvmbot opened this issue Feb 5, 2018 · 6 comments
Labels
bugzilla Issues migrated from bugzilla confirmed Verified by a second party llvm:codegen

Comments

@llvmbot
Copy link
Collaborator

llvmbot commented Feb 5, 2018

Bugzilla Link 36243
Version trunk
OS All
Reporter LLVM Bugzilla Contributor
CC @davezarzycki,@hfinkel,@RKSimon,@nemanjai,@rotateright
Fixed by commit(s) rG257acbf6aee9

Extended Description

clang has __builtin_addc* functions, which are supposed to emit hardware add-with-carry instructions. However, there is no corresponding intrinsic on LLVM side, so clang emits a sequence of instructions that is only recognized and folded to a single hw instruction in two cases:

  • carry input is 0, or
  • carry output is unused

This means that any carry chains longer than 2 result in inefficient code:

void add3(
    unsigned long long *restrict a,
    unsigned long long *restrict b,
    unsigned long long *restrict c
) {
    unsigned long long cf = 0;
    c[0] = __builtin_addcll(a[0], b[0], cf, &cf);
    c[1] = __builtin_addcll(a[1], b[1], cf, &cf);
    c[2] = __builtin_addcll(a[2], b[2], cf, &cf);
}

Compiles to:

add3:
        .cfi_startproc
# BB#0:
        movq    (%rdi), %rax
        movq    (%rsi), %r8
        leaq    (%rax,%r8), %rcx
        movq    %rcx, (%rdx)
        movq    8(%rdi), %rcx
        addq    8(%rsi), %rcx
        setb    %r9b
        addq    %r8, %rax
        adcq    $0, %rcx
        setb    %al
        orb     %r9b, %al
        movzbl  %al, %eax
        movq    %rcx, 8(%rdx)
        movq    16(%rsi), %rcx
        addq    16(%rdi), %rcx
        addq    %rax, %rcx
        movq    %rcx, 16(%rdx)
        retq

I suppose we're going to need a new target-independent generic intrinsic,
say { iX, i1 } @llvm.uadd.with.overflow.carry.iX(iX, iX, i1) (and a corresponding one for subtraction as well) and map it to ISD::ADDE / ISD::ADDCARRY.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Sep 6, 2018

I have also found this. Here is the code I am using:

#include <stdint.h>

struct UInt96
  {
  uint32_t d2, d1, d0;
  };

void ADD96_v1(const UInt96& s, UInt96& d)
  {
  uint64_t sum = uint64_t(d.d0) + uint64_t(s.d0);
  d.d0 = sum; sum >>= 32;
  sum += uint64_t(d.d1) + uint64_t(s.d1);
  d.d1 = sum; sum >>= 32;
  sum += uint64_t(d.d2) + uint64_t(s.d2);
  d.d2 = sum;
  }

void ADD96_v2(const UInt96& s, UInt96& d)
  {
  uint32_t carry;

  d.d0 = __builtin_addc(d.d0, s.d0, 0, &carry);
  d.d1 = __builtin_addc(d.d1, s.d1, carry, &carry);
  d.d2 = __builtin_addc(d.d2, s.d2, carry, &carry);
  }

ADD96_v1 was my original function, and ADD96_v2 is my attempt to rewrite it using __builtin_addc. While the second version works, I am surprised that it is less optimal than the first version: using the Godbolt Compiler Explorer, the generated ARM code contains an additional three seemingly unnecessary instructions; worse the PPC code utilises branch operations!

My compiler options are:

-target arm-unknown-linux-gnueabihf -mcpu=cortex-a9 -O2

and:

-target powerpc-unknown-linux-gnu -mcpu=603e -O2

It seems that the __builtin_addc* functions don't really have an advantage; is that true?

@davezarzycki
Copy link
Member

FYI – https://reviews.llvm.org/D70079

@RKSimon
Copy link
Collaborator

RKSimon commented Nov 12, 2019

Current Codegen: https://godbolt.org/z/cVSccY

@RKSimon
Copy link
Collaborator

RKSimon commented Nov 21, 2019

rG257acbf6aee9 solves much of the poor codegen generically in DAGCombine

x86/arm code works as expected

ppc struggles as the target still needs to be updated to use ADDCARRY/SUBCARRY

Andy - maybe open a PPC specific bug?

@RKSimon
Copy link
Collaborator

RKSimon commented Nov 22, 2019

Adding some PPC guys who might know whats best to do to improve PPC support and reduce the branching.

Do we need to extend the ISD::ADD/SUBCARRY combines (where possible) to better support ISD::ADD/SUBE or can we move PPC away from ISD::ADD/SUBC + ISD::ADD/SUBE?

@nemanjai
Copy link
Member

We aren't currently working on moving away from the old carry setting/using nodes to the new ones, but we do not have any fundamental issue with doing so. I am not sure this statement is any help though.

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
@llvmbot llvmbot added the confirmed Verified by a second party label Jan 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla confirmed Verified by a second party llvm:codegen
Projects
None yet
Development

No branches or pull requests

4 participants