clang has __builtin_addc* functions, which are supposed to emit hardware add-with-carry instructions. However, there is no corresponding intrinsic on LLVM side, so clang emits a sequence of instructions that is only recognized and folded to a single hw instruction in two cases: - carry input is 0, or - carry output is unused This means that any carry chains longer than 2 result in inefficient code: void add3( unsigned long long *restrict a, unsigned long long *restrict b, unsigned long long *restrict c ) { unsigned long long cf = 0; c[0] = __builtin_addcll(a[0], b[0], cf, &cf); c[1] = __builtin_addcll(a[1], b[1], cf, &cf); c[2] = __builtin_addcll(a[2], b[2], cf, &cf); } Compiles to: add3: # @add3 .cfi_startproc # BB#0: movq (%rdi), %rax movq (%rsi), %r8 leaq (%rax,%r8), %rcx movq %rcx, (%rdx) movq 8(%rdi), %rcx addq 8(%rsi), %rcx setb %r9b addq %r8, %rax adcq $0, %rcx setb %al orb %r9b, %al movzbl %al, %eax movq %rcx, 8(%rdx) movq 16(%rsi), %rcx addq 16(%rdi), %rcx addq %rax, %rcx movq %rcx, 16(%rdx) retq I suppose we're going to need a new target-independent generic intrinsic, say { iX, i1 } @llvm.uadd.with.overflow.carry.iX(iX, iX, i1) (and a corresponding one for subtraction as well) and map it to ISD::ADDE / ISD::ADDCARRY.
I have also found this. Here is the code I am using: #include <stdint.h> struct UInt96 { uint32_t d2, d1, d0; }; void ADD96_v1(const UInt96& s, UInt96& d) { uint64_t sum = uint64_t(d.d0) + uint64_t(s.d0); d.d0 = sum; sum >>= 32; sum += uint64_t(d.d1) + uint64_t(s.d1); d.d1 = sum; sum >>= 32; sum += uint64_t(d.d2) + uint64_t(s.d2); d.d2 = sum; } void ADD96_v2(const UInt96& s, UInt96& d) { uint32_t carry; d.d0 = __builtin_addc(d.d0, s.d0, 0, &carry); d.d1 = __builtin_addc(d.d1, s.d1, carry, &carry); d.d2 = __builtin_addc(d.d2, s.d2, carry, &carry); } ADD96_v1 was my original function, and ADD96_v2 is my attempt to rewrite it using __builtin_addc. While the second version works, I am surprised that it is less optimal than the first version: using the Godbolt Compiler Explorer, the generated ARM code contains an additional three seemingly unnecessary instructions; worse the PPC code utilises branch operations! My compiler options are: -target arm-unknown-linux-gnueabihf -mcpu=cortex-a9 -O2 and: -target powerpc-unknown-linux-gnu -mcpu=603e -O2 It seems that the __builtin_addc* functions don't really have an advantage; is that true?
FYI – https://reviews.llvm.org/D70079
Current Codegen: https://godbolt.org/z/cVSccY
rG257acbf6aee9 solves much of the poor codegen generically in DAGCombine x86/arm code works as expected ppc struggles as the target still needs to be updated to use ADDCARRY/SUBCARRY Andy - maybe open a PPC specific bug?
Adding some PPC guys who might know whats best to do to improve PPC support and reduce the branching. Do we need to extend the ISD::ADD/SUBCARRY combines (where possible) to better support ISD::ADD/SUBE or can we move PPC away from ISD::ADD/SUBC + ISD::ADD/SUBE?
We aren't currently working on moving away from the old carry setting/using nodes to the new ones, but we do not have any fundamental issue with doing so. I am not sure this statement is any help though.