llvm doesn't generate inline sequence for load/store of unaligned atomic variable on x86_64 #33695

llvmbot · 2017-08-28T17:22:57Z


Bugzilla Link	34347
Version	trunk
OS	Linux
Reporter	LLVM Bugzilla Contributor
CC	@DimitryAndric,@emaste,@zygoloid,@TNorthover,@tstellar

Extended Description

-------------------- 1.cc -----------------------
#include

struct AM {
int f1;
int f2;
};

std::atomic ip3{AM()};
std::atomic ip4{AM()};

void foo() {
auto val = ip4.load(std::memory_order_relaxed);
ip3.store(val, std::memory_order_relaxed);
}

~/workarea/llvm-r309240/dbuild/bin/clang -O2 -std=c++11 -fno-exceptions -S 1.cc
.cfi_startproc

BB#0: # %entry

    pushq   %rax

.Lcfi0:
.cfi_def_cfa_offset 16
movl $local_ip4, %edi
xorl %esi, %esi
callq __atomic_load_8
movl $local_ip3, %edi
xorl %edx, %edx
movq %rax, %rsi
popq %rax
jmp __atomic_store_8 # TAILCALL
.Lfunc_end0:
.size _Z3foov, .Lfunc_end0-_Z3foov
.cfi_endproc

g++ -O2 -std=c++11 -fno-exceptions -S 1.cc
.cfi_startproc
movq ip4(%rip), %rax
movq %rax, ip3(%rip)
ret
.cfi_endproc

If alignas(8) is added:
struct alignas(8) AM {
int f1;
int f2;
};

clang will generate inline sequence for the atomic accesses:
.cfi_startproc
movq ip4(%rip), %rax
movq %rax, ip3(%rip)
retq

The text was updated successfully, but these errors were encountered:

zygoloid · 2017-08-28T17:49:01Z

Should we be using mov for unaligned atomic accesses in general, or only (as in this example) where we have extrinsic knowledge that the access is actually properly aligned?

llvmbot · 2017-08-28T18:05:50Z

I guess we want to generate mov for unaligned case in general, since unaligned load/store only cause minor or no slowdown on modern x86 architecutures.

zygoloid · 2017-08-28T18:22:54Z

I guess we want to generate mov for unaligned case in general, since
unaligned load/store only cause minor or no slowdown on modern x86
architecutures.

But isn't that lowering incorrect -- don't we lose atomicity if the object is actually not aligned (in particular, if it crosses a cache line boundary)?

llvmbot · 2017-08-28T19:57:20Z

You are right. Thanks. Look at the asm code more, and find that GCC is able to align ip3 and ip4 to 8 bytes but llvm is only able to align ip3 and ip4 to 4 bytes. This is the root cause of the difference. So we better increase the alignment of atomic variable as an optimization, instead of generate mov for unaligned access.

zygoloid · 2017-08-28T22:48:06Z

Looks like a frontend bug:

struct AM { int f1, f2; };
alignas(8) AM m;
AM load() {
AM am;
__atomic_load(&m, &am, 0);
return am;
}

results in an alignment 4 atomic load, which the frontend turns into a libcall. We're apparently only considering the type alignment and not the alignment of the actual atomic operand to the builtin.

DimitryAndric · 2018-01-15T17:33:58Z

The fix for this bug (https://reviews.llvm.org/rL314145) has strange side effects on 32-bit x86, which cause quite a number of problems in FreeBSD ports.

For example, take this minimalist test case:

// clang -m32 -march=i486 -O2 -fomit-frame-pointer -S atomic-test-2.c -o -
extern struct {
unsigned long long a;
} b;

extern unsigned long long c;

void foo(void)
{
__atomic_add_fetch(&b.a, 1, 0);
__atomic_add_fetch(&c, 1, 0);
}

With clang r314144, this results in:

foo: # @foo

BB#0: # %entry

pushl	$0
pushl	$0
pushl	$1
pushl	$b
calll	__atomic_fetch_add_8
addl	$16, %esp
pushl	$0
pushl	$0
pushl	$1
pushl	$c
calll	__atomic_fetch_add_8
addl	$16, %esp
retl

With clang r314145, it starts producing:

foo: # @foo

BB#0: # %entry

pushl	%ebx
pushl	$0
pushl	$0
pushl	$1
pushl	$b
calll	__atomic_fetch_add_8
addl	$16, %esp
movl	c+4, %edx
movl	c, %eax
.p2align	4, 0x90

.LBB0_1: # %atomicrmw.start
# =>This Inner Loop Header: Depth=1
movl %eax, %ebx
addl $1, %ebx
movl %edx, %ecx
adcl $0, %ecx
lock cmpxchg8b c
jne .LBB0_1

BB#2: # %atomicrmw.end

popl	%ebx
retl

Why is this?

The problem is that a lot of programs test for availability of __atomic primitives by compiling small test cases with code like above, which are now suddenly succeeding because libcalls to __atomic functions have been replaced by inline lock cmpxchg8b's.

Note that these are even invalid with -march=i486, but clang does produce them!

llvmbot · 2018-01-16T18:49:21Z

In https://wmi@llvm.org/svn/llvm-project/cfe/trunk/lib/Basic/Targets/X86.h, there is a FIXME in class X86_32TargetInfo.

// x86-32 has atomics up to 8 bytes
// FIXME: Check that we actually have cmpxchg8b before setting
// MaxAtomicInlineWidth. (cmpxchg8b is an i586 instruction.)
MaxAtomicPromoteWidth = MaxAtomicInlineWidth = 64;

Now we need to fix it. I will prepare a patch for it.

DimitryAndric · 2018-01-16T22:40:34Z

In https://wmi@llvm.org/svn/llvm-project/cfe/trunk/lib/Basic/Targets/X86.h,
there is a FIXME in class X86_32TargetInfo.
// x86-32 has atomics up to 8 bytes
// FIXME: Check that we actually have cmpxchg8b before setting
// MaxAtomicInlineWidth. (cmpxchg8b is an i586 instruction.)
MaxAtomicPromoteWidth = MaxAtomicInlineWidth = 64;
Now we need to fix it. I will prepare a patch for it.

Yes, please note that there is also this quite old review which attempted to solve the case more generally, but it got stalled: https://reviews.llvm.org/D29542

DimitryAndric · 2018-02-02T23:40:26Z

In https://wmi@llvm.org/svn/llvm-project/cfe/trunk/lib/Basic/Targets/X86.h,
there is a FIXME in class X86_32TargetInfo.
// x86-32 has atomics up to 8 bytes
// FIXME: Check that we actually have cmpxchg8b before setting
// MaxAtomicInlineWidth. (cmpxchg8b is an i586 instruction.)
MaxAtomicPromoteWidth = MaxAtomicInlineWidth = 64;
Now we need to fix it. I will prepare a patch for it.

Wei, any update on that patch? :)

llvmbot · 2018-02-02T23:59:50Z

In https://wmi@llvm.org/svn/llvm-project/cfe/trunk/lib/Basic/Targets/X86.h,
there is a FIXME in class X86_32TargetInfo.
// x86-32 has atomics up to 8 bytes
// FIXME: Check that we actually have cmpxchg8b before setting
// MaxAtomicInlineWidth. (cmpxchg8b is an i586 instruction.)
MaxAtomicPromoteWidth = MaxAtomicInlineWidth = 64;
Now we need to fix it. I will prepare a patch for it.
Wei, any update on that patch? :)

It is committed at https://reviews.llvm.org/rL323281. Could you verify its effectiveness?

DimitryAndric · 2018-02-03T16:28:06Z

In https://wmi@llvm.org/svn/llvm-project/cfe/trunk/lib/Basic/Targets/X86.h,
there is a FIXME in class X86_32TargetInfo.
// x86-32 has atomics up to 8 bytes
// FIXME: Check that we actually have cmpxchg8b before setting
// MaxAtomicInlineWidth. (cmpxchg8b is an i586 instruction.)
MaxAtomicPromoteWidth = MaxAtomicInlineWidth = 64;
Now we need to fix it. I will prepare a patch for it.
Wei, any update on that patch? :)
It is committed at https://reviews.llvm.org/rL323281. Could you verify its
effectiveness?

It does work for the test cases here, but if you use e.g. -march=i586 or higher, there are still sometimes calls to __atomic functions, instead of inlined cmpxchg8b operations. I don't really know where that changed, between clang 5.0 and 6.0.

llvmbot · 2018-02-05T18:24:28Z

In https://wmi@llvm.org/svn/llvm-project/cfe/trunk/lib/Basic/Targets/X86.h,
there is a FIXME in class X86_32TargetInfo.
// x86-32 has atomics up to 8 bytes
// FIXME: Check that we actually have cmpxchg8b before setting
// MaxAtomicInlineWidth. (cmpxchg8b is an i586 instruction.)
MaxAtomicPromoteWidth = MaxAtomicInlineWidth = 64;
Now we need to fix it. I will prepare a patch for it.
Wei, any update on that patch? :)
It is committed at https://reviews.llvm.org/rL323281. Could you verify its
effectiveness?
It does work for the test cases here, but if you use e.g. -march=i586 or
higher, there are still sometimes calls to __atomic functions, instead of
inlined cmpxchg8b operations. I don't really know where that changed,
between clang 5.0 and 6.0.

Clang uses "Ptr.getAlignment() % sizeChars == 0" as a precondition to use inline atomic. For x86-32, unsigned long long has 8 bytes size but may have 4 bytes alignment, although it is possible to do atomic load/store for the data with such type using cmpxchg8b, clang will use libcall for it. It seems a separate restriction of Clang.

Note: for clang, unsigned long long inside of structure uses 4 bytes alignment. standalone unsigned long long uses 8 bytes alignment.

TNorthover · 2018-04-05T10:10:29Z

The implementation of __atomic_load in compiler-rt appears to be buggy by only looking at the size of the input when deciding whether to take a lock. It (and hence all others) should probably fall back for misaligned atomics too.

A locked cmpxchg is theoretically viable on x86, but it looks like that would break libatomic compatibility. It's certainly an ABI issue. Also, if you're using genuinely misaligned atomics you probably deserve whatever you get.

Finally, everyone seems to have decided that mixing libcalls and inlined atomics on the same object is OK. That's probably fine because GCC seems to agree, but be careful!

tstellar · 2018-04-11T04:51:10Z

We just hit a build failure in Fedora caused by r314145: https://bugzilla.redhat.com/show_bug.cgi?id=1565766

I've reduced the test case from the Fedora bug even further and it's another example where clang will no longer generate a native cmpxchg instruction:

// clang -O2 -c t.c
#include <stdint.h>

struct attribute ((packed)) baz {
uint32_t x3;
};

struct foo {
uint32_t x;
uint32_t x2;
struct baz *b;
} fx;

uint32_t foor(struct foo *p, uint32_t y)
{
return __atomic_fetch_or(&p->b->x3, y, 5);
}
int main() { }

TNorthover · 2018-04-11T06:46:45Z

Atomics: for the bugs that keep on giving. It seems everyone also used to agree that misaligned atomics on x86 should be lock-free, which raises 3 issues:

Bare load and store are not lock-free on x86. Clang generates a libcall. GCC generates a non-atomic mov instruction.
LLVM's cmpxchg is specified to require alignment at least size in LangRef. I don't know if any pass makes use of that, but they could. So there's theoretically no way for Clang to actually emit a lock-free misaligned cmpxchg at the moment.
My proposed fix for compiler-rt (https://reviews.llvm.org/D45319) would make all misaligned atomics locked on x86. I'll have to change that.

pirama-arumuga-nainar · 2021-11-27T00:28:56Z

mentioned in issue llvm/llvm-bugzilla-archive#36860

pirama-arumuga-nainar · 2021-11-27T00:28:56Z

mentioned in issue llvm/llvm-bugzilla-archive#36860

efriedma-quic · 2022-04-18T17:25:42Z

A bunch of stuff has changed here; I think this is mostly fixed. #38194 tracks the issue that LLVM and gcc disagree about what to do if the argument to _atomic* is misaligned.

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

efriedma-quic closed this as completed Apr 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llvm doesn't generate inline sequence for load/store of unaligned atomic variable on x86_64 #33695

llvm doesn't generate inline sequence for load/store of unaligned atomic variable on x86_64 #33695

llvmbot commented Aug 28, 2017

zygoloid mannequin commented Aug 28, 2017

llvmbot commented Aug 28, 2017

zygoloid mannequin commented Aug 28, 2017

llvmbot commented Aug 28, 2017

zygoloid mannequin commented Aug 28, 2017

DimitryAndric commented Jan 15, 2018

llvmbot commented Jan 16, 2018

DimitryAndric commented Jan 16, 2018

DimitryAndric commented Feb 2, 2018

llvmbot commented Feb 2, 2018

DimitryAndric commented Feb 3, 2018

llvmbot commented Feb 5, 2018

TNorthover commented Apr 5, 2018

tstellar commented Apr 11, 2018

TNorthover commented Apr 11, 2018

pirama-arumuga-nainar commented Nov 27, 2021

pirama-arumuga-nainar commented Nov 27, 2021

efriedma-quic commented Apr 18, 2022

llvm doesn't generate inline sequence for load/store of unaligned atomic variable on x86_64 #33695

llvm doesn't generate inline sequence for load/store of unaligned atomic variable on x86_64 #33695

Comments

llvmbot commented Aug 28, 2017

Extended Description

void foo() { auto val = ip4.load(std::memory_order_relaxed); ip3.store(val, std::memory_order_relaxed); }

BB#0: # %entry

zygoloid mannequin commented Aug 28, 2017

llvmbot commented Aug 28, 2017

zygoloid mannequin commented Aug 28, 2017

llvmbot commented Aug 28, 2017

zygoloid mannequin commented Aug 28, 2017

DimitryAndric commented Jan 15, 2018

BB#0: # %entry

BB#0: # %entry

BB#2: # %atomicrmw.end

llvmbot commented Jan 16, 2018

DimitryAndric commented Jan 16, 2018

DimitryAndric commented Feb 2, 2018

llvmbot commented Feb 2, 2018

DimitryAndric commented Feb 3, 2018

llvmbot commented Feb 5, 2018

TNorthover commented Apr 5, 2018

tstellar commented Apr 11, 2018

TNorthover commented Apr 11, 2018

pirama-arumuga-nainar commented Nov 27, 2021

pirama-arumuga-nainar commented Nov 27, 2021

efriedma-quic commented Apr 18, 2022

void foo() {
auto val = ip4.load(std::memory_order_relaxed);
ip3.store(val, std::memory_order_relaxed);
}