-------------------- 1.cc ----------------------- #include <atomic> struct AM { int f1; int f2; }; std::atomic<AM> ip3{AM()}; std::atomic<AM> ip4{AM()}; void foo() { auto val = ip4.load(std::memory_order_relaxed); ip3.store(val, std::memory_order_relaxed); } ------------------------------------------------- ~/workarea/llvm-r309240/dbuild/bin/clang -O2 -std=c++11 -fno-exceptions -S 1.cc .cfi_startproc # BB#0: # %entry pushq %rax .Lcfi0: .cfi_def_cfa_offset 16 movl $local_ip4, %edi xorl %esi, %esi callq __atomic_load_8 movl $local_ip3, %edi xorl %edx, %edx movq %rax, %rsi popq %rax jmp __atomic_store_8 # TAILCALL .Lfunc_end0: .size _Z3foov, .Lfunc_end0-_Z3foov .cfi_endproc g++ -O2 -std=c++11 -fno-exceptions -S 1.cc .cfi_startproc movq ip4(%rip), %rax movq %rax, ip3(%rip) ret .cfi_endproc If alignas(8) is added: struct alignas(8) AM { int f1; int f2; }; clang will generate inline sequence for the atomic accesses: .cfi_startproc movq ip4(%rip), %rax movq %rax, ip3(%rip) retq
Should we be using mov for unaligned atomic accesses in general, or only (as in this example) where we have extrinsic knowledge that the access is actually properly aligned?
I guess we want to generate mov for unaligned case in general, since unaligned load/store only cause minor or no slowdown on modern x86 architecutures.
(In reply to Wei Mi from comment #2) > I guess we want to generate mov for unaligned case in general, since > unaligned load/store only cause minor or no slowdown on modern x86 > architecutures. But isn't that lowering incorrect -- don't we lose atomicity if the object is actually not aligned (in particular, if it crosses a cache line boundary)?
You are right. Thanks. Look at the asm code more, and find that GCC is able to align ip3 and ip4 to 8 bytes but llvm is only able to align ip3 and ip4 to 4 bytes. This is the root cause of the difference. So we better increase the alignment of atomic variable as an optimization, instead of generate mov for unaligned access.
Looks like a frontend bug: struct AM { int f1, f2; }; alignas(8) AM m; AM load() { AM am; __atomic_load(&m, &am, 0); return am; } results in an alignment 4 atomic load, which the frontend turns into a libcall. We're apparently only considering the type alignment and not the alignment of the actual atomic operand to the builtin.
The fix for this bug (https://reviews.llvm.org/rL314145) has strange side effects on 32-bit x86, which cause quite a number of problems in FreeBSD ports. For example, take this minimalist test case: // clang -m32 -march=i486 -O2 -fomit-frame-pointer -S atomic-test-2.c -o - extern struct { unsigned long long a; } b; extern unsigned long long c; void foo(void) { __atomic_add_fetch(&b.a, 1, 0); __atomic_add_fetch(&c, 1, 0); } With clang r314144, this results in: foo: # @foo # BB#0: # %entry pushl $0 pushl $0 pushl $1 pushl $b calll __atomic_fetch_add_8 addl $16, %esp pushl $0 pushl $0 pushl $1 pushl $c calll __atomic_fetch_add_8 addl $16, %esp retl With clang r314145, it starts producing: foo: # @foo # BB#0: # %entry pushl %ebx pushl $0 pushl $0 pushl $1 pushl $b calll __atomic_fetch_add_8 addl $16, %esp movl c+4, %edx movl c, %eax .p2align 4, 0x90 .LBB0_1: # %atomicrmw.start # =>This Inner Loop Header: Depth=1 movl %eax, %ebx addl $1, %ebx movl %edx, %ecx adcl $0, %ecx lock cmpxchg8b c jne .LBB0_1 # BB#2: # %atomicrmw.end popl %ebx retl Why is this? The problem is that a lot of programs test for availability of __atomic primitives by compiling small test cases with code like above, which are now suddenly succeeding because libcalls to __atomic functions have been replaced by inline lock cmpxchg8b's. Note that these are even invalid with -march=i486, but clang does produce them!
In https://wmi@llvm.org/svn/llvm-project/cfe/trunk/lib/Basic/Targets/X86.h, there is a FIXME in class X86_32TargetInfo. // x86-32 has atomics up to 8 bytes // FIXME: Check that we actually have cmpxchg8b before setting // MaxAtomicInlineWidth. (cmpxchg8b is an i586 instruction.) MaxAtomicPromoteWidth = MaxAtomicInlineWidth = 64; Now we need to fix it. I will prepare a patch for it.
(In reply to Wei Mi from comment #7) > In https://wmi@llvm.org/svn/llvm-project/cfe/trunk/lib/Basic/Targets/X86.h, > there is a FIXME in class X86_32TargetInfo. > > // x86-32 has atomics up to 8 bytes > // FIXME: Check that we actually have cmpxchg8b before setting > // MaxAtomicInlineWidth. (cmpxchg8b is an i586 instruction.) > MaxAtomicPromoteWidth = MaxAtomicInlineWidth = 64; > > Now we need to fix it. I will prepare a patch for it. Yes, please note that there is also this quite old review which attempted to solve the case more generally, but it got stalled: https://reviews.llvm.org/D29542
(In reply to Wei Mi from comment #7) > In https://wmi@llvm.org/svn/llvm-project/cfe/trunk/lib/Basic/Targets/X86.h, > there is a FIXME in class X86_32TargetInfo. > > // x86-32 has atomics up to 8 bytes > // FIXME: Check that we actually have cmpxchg8b before setting > // MaxAtomicInlineWidth. (cmpxchg8b is an i586 instruction.) > MaxAtomicPromoteWidth = MaxAtomicInlineWidth = 64; > > Now we need to fix it. I will prepare a patch for it. Wei, any update on that patch? :)
(In reply to Dimitry Andric from comment #9) > (In reply to Wei Mi from comment #7) > > In https://wmi@llvm.org/svn/llvm-project/cfe/trunk/lib/Basic/Targets/X86.h, > > there is a FIXME in class X86_32TargetInfo. > > > > // x86-32 has atomics up to 8 bytes > > // FIXME: Check that we actually have cmpxchg8b before setting > > // MaxAtomicInlineWidth. (cmpxchg8b is an i586 instruction.) > > MaxAtomicPromoteWidth = MaxAtomicInlineWidth = 64; > > > > Now we need to fix it. I will prepare a patch for it. > > Wei, any update on that patch? :) It is committed at https://reviews.llvm.org/rL323281. Could you verify its effectiveness?
(In reply to Wei Mi from comment #10) > (In reply to Dimitry Andric from comment #9) > > (In reply to Wei Mi from comment #7) > > > In https://wmi@llvm.org/svn/llvm-project/cfe/trunk/lib/Basic/Targets/X86.h, > > > there is a FIXME in class X86_32TargetInfo. > > > > > > // x86-32 has atomics up to 8 bytes > > > // FIXME: Check that we actually have cmpxchg8b before setting > > > // MaxAtomicInlineWidth. (cmpxchg8b is an i586 instruction.) > > > MaxAtomicPromoteWidth = MaxAtomicInlineWidth = 64; > > > > > > Now we need to fix it. I will prepare a patch for it. > > > > Wei, any update on that patch? :) > > It is committed at https://reviews.llvm.org/rL323281. Could you verify its > effectiveness? It does work for the test cases here, but if you use e.g. -march=i586 or higher, there are still sometimes calls to __atomic functions, instead of inlined cmpxchg8b operations. I don't really know where that changed, between clang 5.0 and 6.0.
(In reply to Dimitry Andric from comment #11) > (In reply to Wei Mi from comment #10) > > (In reply to Dimitry Andric from comment #9) > > > (In reply to Wei Mi from comment #7) > > > > In https://wmi@llvm.org/svn/llvm-project/cfe/trunk/lib/Basic/Targets/X86.h, > > > > there is a FIXME in class X86_32TargetInfo. > > > > > > > > // x86-32 has atomics up to 8 bytes > > > > // FIXME: Check that we actually have cmpxchg8b before setting > > > > // MaxAtomicInlineWidth. (cmpxchg8b is an i586 instruction.) > > > > MaxAtomicPromoteWidth = MaxAtomicInlineWidth = 64; > > > > > > > > Now we need to fix it. I will prepare a patch for it. > > > > > > Wei, any update on that patch? :) > > > > It is committed at https://reviews.llvm.org/rL323281. Could you verify its > > effectiveness? > > It does work for the test cases here, but if you use e.g. -march=i586 or > higher, there are still sometimes calls to __atomic functions, instead of > inlined cmpxchg8b operations. I don't really know where that changed, > between clang 5.0 and 6.0. Clang uses "Ptr.getAlignment() % sizeChars == 0" as a precondition to use inline atomic. For x86-32, unsigned long long has 8 bytes size but may have 4 bytes alignment, although it is possible to do atomic load/store for the data with such type using cmpxchg8b, clang will use libcall for it. It seems a separate restriction of Clang. Note: for clang, unsigned long long inside of structure uses 4 bytes alignment. standalone unsigned long long uses 8 bytes alignment.
The implementation of __atomic_load in compiler-rt appears to be buggy by only looking at the size of the input when deciding whether to take a lock. It (and hence all others) should probably fall back for misaligned atomics too. A locked cmpxchg is theoretically viable on x86, but it looks like that would break libatomic compatibility. It's certainly an ABI issue. Also, if you're using genuinely misaligned atomics you probably deserve whatever you get. Finally, everyone seems to have decided that mixing libcalls and inlined atomics on the same object is OK. That's probably fine because GCC seems to agree, but be careful!
We just hit a build failure in Fedora caused by r314145: https://bugzilla.redhat.com/show_bug.cgi?id=1565766 I've reduced the test case from the Fedora bug even further and it's another example where clang will no longer generate a native cmpxchg instruction: // clang -O2 -c t.c #include <stdint.h> struct __attribute__ ((__packed__)) baz { uint32_t x3; }; struct foo { uint32_t x; uint32_t x2; struct baz *b; } fx; uint32_t foor(struct foo *p, uint32_t y) { return __atomic_fetch_or(&p->b->x3, y, 5); } int main() { }
Atomics: for the bugs that keep on giving. It seems everyone also used to agree that misaligned atomics on x86 should be lock-free, which raises 3 issues: 1. Bare load and store are not lock-free on x86. Clang generates a libcall. GCC generates a non-atomic mov instruction. 2. LLVM's cmpxchg is specified to require alignment at least size in LangRef. I don't know if any pass makes use of that, but they could. So there's theoretically no way for Clang to actually emit a lock-free misaligned cmpxchg at the moment. 3. My proposed fix for compiler-rt (https://reviews.llvm.org/D45319) would make all misaligned atomics locked on x86. I'll have to change that.