Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llvm doesn't generate inline sequence for load/store of unaligned atomic variable on x86_64 #33695

Closed
llvmbot opened this issue Aug 28, 2017 · 18 comments
Labels
bugzilla Issues migrated from bugzilla llvm:codegen

Comments

@llvmbot
Copy link
Collaborator

llvmbot commented Aug 28, 2017

Bugzilla Link 34347
Version trunk
OS Linux
Reporter LLVM Bugzilla Contributor
CC @DimitryAndric,@emaste,@zygoloid,@TNorthover,@tstellar

Extended Description

-------------------- 1.cc -----------------------
#include

struct AM {
int f1;
int f2;
};

std::atomic ip3{AM()};
std::atomic ip4{AM()};

void foo() {
auto val = ip4.load(std::memory_order_relaxed);
ip3.store(val, std::memory_order_relaxed);
}

~/workarea/llvm-r309240/dbuild/bin/clang -O2 -std=c++11 -fno-exceptions -S 1.cc
.cfi_startproc

BB#0: # %entry

    pushq   %rax

.Lcfi0:
.cfi_def_cfa_offset 16
movl $local_ip4, %edi
xorl %esi, %esi
callq __atomic_load_8
movl $local_ip3, %edi
xorl %edx, %edx
movq %rax, %rsi
popq %rax
jmp __atomic_store_8 # TAILCALL
.Lfunc_end0:
.size _Z3foov, .Lfunc_end0-_Z3foov
.cfi_endproc

g++ -O2 -std=c++11 -fno-exceptions -S 1.cc
.cfi_startproc
movq ip4(%rip), %rax
movq %rax, ip3(%rip)
ret
.cfi_endproc

If alignas(8) is added:
struct alignas(8) AM {
int f1;
int f2;
};

clang will generate inline sequence for the atomic accesses:
.cfi_startproc
movq ip4(%rip), %rax
movq %rax, ip3(%rip)
retq

@zygoloid
Copy link
Mannequin

zygoloid mannequin commented Aug 28, 2017

Should we be using mov for unaligned atomic accesses in general, or only (as in this example) where we have extrinsic knowledge that the access is actually properly aligned?

@llvmbot
Copy link
Collaborator Author

llvmbot commented Aug 28, 2017

I guess we want to generate mov for unaligned case in general, since unaligned load/store only cause minor or no slowdown on modern x86 architecutures.

@zygoloid
Copy link
Mannequin

zygoloid mannequin commented Aug 28, 2017

I guess we want to generate mov for unaligned case in general, since
unaligned load/store only cause minor or no slowdown on modern x86
architecutures.

But isn't that lowering incorrect -- don't we lose atomicity if the object is actually not aligned (in particular, if it crosses a cache line boundary)?

@llvmbot
Copy link
Collaborator Author

llvmbot commented Aug 28, 2017

You are right. Thanks. Look at the asm code more, and find that GCC is able to align ip3 and ip4 to 8 bytes but llvm is only able to align ip3 and ip4 to 4 bytes. This is the root cause of the difference. So we better increase the alignment of atomic variable as an optimization, instead of generate mov for unaligned access.

@zygoloid
Copy link
Mannequin

zygoloid mannequin commented Aug 28, 2017

Looks like a frontend bug:

struct AM { int f1, f2; };
alignas(8) AM m;
AM load() {
AM am;
__atomic_load(&m, &am, 0);
return am;
}

results in an alignment 4 atomic load, which the frontend turns into a libcall. We're apparently only considering the type alignment and not the alignment of the actual atomic operand to the builtin.

@DimitryAndric
Copy link
Collaborator

The fix for this bug (https://reviews.llvm.org/rL314145) has strange side effects on 32-bit x86, which cause quite a number of problems in FreeBSD ports.

For example, take this minimalist test case:

// clang -m32 -march=i486 -O2 -fomit-frame-pointer -S atomic-test-2.c -o -
extern struct {
unsigned long long a;
} b;

extern unsigned long long c;

void foo(void)
{
__atomic_add_fetch(&b.a, 1, 0);
__atomic_add_fetch(&c, 1, 0);
}

With clang r314144, this results in:

foo: # @​foo

BB#0: # %entry

pushl	$0
pushl	$0
pushl	$1
pushl	$b
calll	__atomic_fetch_add_8
addl	$16, %esp
pushl	$0
pushl	$0
pushl	$1
pushl	$c
calll	__atomic_fetch_add_8
addl	$16, %esp
retl

With clang r314145, it starts producing:

foo: # @​foo

BB#0: # %entry

pushl	%ebx
pushl	$0
pushl	$0
pushl	$1
pushl	$b
calll	__atomic_fetch_add_8
addl	$16, %esp
movl	c+4, %edx
movl	c, %eax
.p2align	4, 0x90

.LBB0_1: # %atomicrmw.start
# =>This Inner Loop Header: Depth=1
movl %eax, %ebx
addl $1, %ebx
movl %edx, %ecx
adcl $0, %ecx
lock cmpxchg8b c
jne .LBB0_1

BB#2: # %atomicrmw.end

popl	%ebx
retl

Why is this?

The problem is that a lot of programs test for availability of __atomic primitives by compiling small test cases with code like above, which are now suddenly succeeding because libcalls to __atomic functions have been replaced by inline lock cmpxchg8b's.

Note that these are even invalid with -march=i486, but clang does produce them!

@llvmbot
Copy link
Collaborator Author

llvmbot commented Jan 16, 2018

In https://wmi@llvm.org/svn/llvm-project/cfe/trunk/lib/Basic/Targets/X86.h, there is a FIXME in class X86_32TargetInfo.

// x86-32 has atomics up to 8 bytes
// FIXME: Check that we actually have cmpxchg8b before setting
// MaxAtomicInlineWidth. (cmpxchg8b is an i586 instruction.)
MaxAtomicPromoteWidth = MaxAtomicInlineWidth = 64;

Now we need to fix it. I will prepare a patch for it.

@DimitryAndric
Copy link
Collaborator

In https://wmi@llvm.org/svn/llvm-project/cfe/trunk/lib/Basic/Targets/X86.h,
there is a FIXME in class X86_32TargetInfo.

// x86-32 has atomics up to 8 bytes
// FIXME: Check that we actually have cmpxchg8b before setting
// MaxAtomicInlineWidth. (cmpxchg8b is an i586 instruction.)
MaxAtomicPromoteWidth = MaxAtomicInlineWidth = 64;

Now we need to fix it. I will prepare a patch for it.

Yes, please note that there is also this quite old review which attempted to solve the case more generally, but it got stalled: https://reviews.llvm.org/D29542

@DimitryAndric
Copy link
Collaborator

In https://wmi@llvm.org/svn/llvm-project/cfe/trunk/lib/Basic/Targets/X86.h,
there is a FIXME in class X86_32TargetInfo.

// x86-32 has atomics up to 8 bytes
// FIXME: Check that we actually have cmpxchg8b before setting
// MaxAtomicInlineWidth. (cmpxchg8b is an i586 instruction.)
MaxAtomicPromoteWidth = MaxAtomicInlineWidth = 64;

Now we need to fix it. I will prepare a patch for it.

Wei, any update on that patch? :)

@llvmbot
Copy link
Collaborator Author

llvmbot commented Feb 2, 2018

In https://wmi@llvm.org/svn/llvm-project/cfe/trunk/lib/Basic/Targets/X86.h,
there is a FIXME in class X86_32TargetInfo.

// x86-32 has atomics up to 8 bytes
// FIXME: Check that we actually have cmpxchg8b before setting
// MaxAtomicInlineWidth. (cmpxchg8b is an i586 instruction.)
MaxAtomicPromoteWidth = MaxAtomicInlineWidth = 64;

Now we need to fix it. I will prepare a patch for it.

Wei, any update on that patch? :)

It is committed at https://reviews.llvm.org/rL323281. Could you verify its effectiveness?

@DimitryAndric
Copy link
Collaborator

In https://wmi@llvm.org/svn/llvm-project/cfe/trunk/lib/Basic/Targets/X86.h,
there is a FIXME in class X86_32TargetInfo.

// x86-32 has atomics up to 8 bytes
// FIXME: Check that we actually have cmpxchg8b before setting
// MaxAtomicInlineWidth. (cmpxchg8b is an i586 instruction.)
MaxAtomicPromoteWidth = MaxAtomicInlineWidth = 64;

Now we need to fix it. I will prepare a patch for it.

Wei, any update on that patch? :)

It is committed at https://reviews.llvm.org/rL323281. Could you verify its
effectiveness?

It does work for the test cases here, but if you use e.g. -march=i586 or higher, there are still sometimes calls to __atomic functions, instead of inlined cmpxchg8b operations. I don't really know where that changed, between clang 5.0 and 6.0.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Feb 5, 2018

In https://wmi@llvm.org/svn/llvm-project/cfe/trunk/lib/Basic/Targets/X86.h,
there is a FIXME in class X86_32TargetInfo.

// x86-32 has atomics up to 8 bytes
// FIXME: Check that we actually have cmpxchg8b before setting
// MaxAtomicInlineWidth. (cmpxchg8b is an i586 instruction.)
MaxAtomicPromoteWidth = MaxAtomicInlineWidth = 64;

Now we need to fix it. I will prepare a patch for it.

Wei, any update on that patch? :)

It is committed at https://reviews.llvm.org/rL323281. Could you verify its
effectiveness?

It does work for the test cases here, but if you use e.g. -march=i586 or
higher, there are still sometimes calls to __atomic functions, instead of
inlined cmpxchg8b operations. I don't really know where that changed,
between clang 5.0 and 6.0.

Clang uses "Ptr.getAlignment() % sizeChars == 0" as a precondition to use inline atomic. For x86-32, unsigned long long has 8 bytes size but may have 4 bytes alignment, although it is possible to do atomic load/store for the data with such type using cmpxchg8b, clang will use libcall for it. It seems a separate restriction of Clang.

Note: for clang, unsigned long long inside of structure uses 4 bytes alignment. standalone unsigned long long uses 8 bytes alignment.

@TNorthover
Copy link
Contributor

The implementation of __atomic_load in compiler-rt appears to be buggy by only looking at the size of the input when deciding whether to take a lock. It (and hence all others) should probably fall back for misaligned atomics too.

A locked cmpxchg is theoretically viable on x86, but it looks like that would break libatomic compatibility. It's certainly an ABI issue. Also, if you're using genuinely misaligned atomics you probably deserve whatever you get.

Finally, everyone seems to have decided that mixing libcalls and inlined atomics on the same object is OK. That's probably fine because GCC seems to agree, but be careful!

@tstellar
Copy link
Collaborator

We just hit a build failure in Fedora caused by r314145: https://bugzilla.redhat.com/show_bug.cgi?id=1565766

I've reduced the test case from the Fedora bug even further and it's another example where clang will no longer generate a native cmpxchg instruction:

// clang -O2 -c t.c
#include <stdint.h>

struct attribute ((packed)) baz {
uint32_t x3;
};

struct foo {
uint32_t x;
uint32_t x2;
struct baz *b;
} fx;

uint32_t foor(struct foo *p, uint32_t y)
{
return __atomic_fetch_or(&p->b->x3, y, 5);
}
int main() { }

@TNorthover
Copy link
Contributor

Atomics: for the bugs that keep on giving. It seems everyone also used to agree that misaligned atomics on x86 should be lock-free, which raises 3 issues:

  1. Bare load and store are not lock-free on x86. Clang generates a libcall. GCC generates a non-atomic mov instruction.

  2. LLVM's cmpxchg is specified to require alignment at least size in LangRef. I don't know if any pass makes use of that, but they could. So there's theoretically no way for Clang to actually emit a lock-free misaligned cmpxchg at the moment.

  3. My proposed fix for compiler-rt (https://reviews.llvm.org/D45319) would make all misaligned atomics locked on x86. I'll have to change that.

@pirama-arumuga-nainar
Copy link
Collaborator

mentioned in issue llvm/llvm-bugzilla-archive#36860

1 similar comment
@pirama-arumuga-nainar
Copy link
Collaborator

mentioned in issue llvm/llvm-bugzilla-archive#36860

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
@efriedma-quic
Copy link
Collaborator

A bunch of stuff has changed here; I think this is mostly fixed. #38194 tracks the issue that LLVM and gcc disagree about what to do if the argument to _atomic* is misaligned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla llvm:codegen
Projects
None yet
Development

No branches or pull requests

6 participants