34347 – llvm doesn't generate inline sequence for load/store of unaligned atomic variable on x86_64

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 34347 - llvm doesn't generate inline sequence for load/store of unaligned atomic variable on x86_64

Summary: llvm doesn't generate inline sequence for load/store of unaligned atomic vari...

Status:	NEW

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Common Code Generator Code (show other bugs)
Version:	trunk
Hardware:	PC Linux

Importance:	P enhancement
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2017-08-28 10:22 PDT by Wei Mi
Modified:	2018-04-10 23:46 PDT (History)
CC List:	8 users (show)

See Also:
Fixed By Commit(s):

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Wei Mi 2017-08-28 10:22:57 PDT

-------------------- 1.cc -----------------------
#include <atomic>

struct AM {
  int f1;
  int f2;
};

std::atomic<AM> ip3{AM()};
std::atomic<AM> ip4{AM()};

void foo() {
  auto val = ip4.load(std::memory_order_relaxed);
  ip3.store(val, std::memory_order_relaxed);
}
-------------------------------------------------

~/workarea/llvm-r309240/dbuild/bin/clang -O2 -std=c++11 -fno-exceptions -S 1.cc
        .cfi_startproc
# BB#0:                                 # %entry
        pushq   %rax
.Lcfi0:
        .cfi_def_cfa_offset 16
        movl    $local_ip4, %edi
        xorl    %esi, %esi
        callq   __atomic_load_8
        movl    $local_ip3, %edi
        xorl    %edx, %edx
        movq    %rax, %rsi
        popq    %rax
        jmp     __atomic_store_8        # TAILCALL
.Lfunc_end0:
        .size   _Z3foov, .Lfunc_end0-_Z3foov
        .cfi_endproc


g++ -O2 -std=c++11 -fno-exceptions -S 1.cc
        .cfi_startproc
        movq    ip4(%rip), %rax
        movq    %rax, ip3(%rip)
        ret
        .cfi_endproc

If alignas(8) is added:
struct alignas(8) AM {
  int f1;
  int f2;
};

clang will generate inline sequence for the atomic accesses:
        .cfi_startproc
        movq    ip4(%rip), %rax
        movq    %rax, ip3(%rip)
        retq

Comment 1 Richard Smith 2017-08-28 10:49:01 PDT

Should we be using mov for unaligned atomic accesses in general, or only (as in this example) where we have extrinsic knowledge that the access is actually properly aligned?

Comment 2 Wei Mi 2017-08-28 11:05:50 PDT

I guess we want to generate mov for unaligned case in general, since unaligned load/store only cause minor or no slowdown on modern x86 architecutures.

Comment 3 Richard Smith 2017-08-28 11:22:54 PDT

(In reply to Wei Mi from comment #2)
> I guess we want to generate mov for unaligned case in general, since
> unaligned load/store only cause minor or no slowdown on modern x86
> architecutures.

But isn't that lowering incorrect -- don't we lose atomicity if the object is actually not aligned (in particular, if it crosses a cache line boundary)?

Comment 4 Wei Mi 2017-08-28 12:57:20 PDT

You are right. Thanks. Look at the asm code more, and find that GCC is able to align ip3 and ip4 to 8 bytes but llvm is only able to align ip3 and ip4 to 4 bytes. This is the root cause of the difference. So we better increase the alignment of atomic variable as an optimization, instead of generate mov for unaligned access.

Comment 5 Richard Smith 2017-08-28 15:48:06 PDT

Looks like a frontend bug:

struct AM { int f1, f2; };
alignas(8) AM m;
AM load() {
  AM am;
  __atomic_load(&m, &am, 0);
  return am;
}

results in an alignment 4 atomic load, which the frontend turns into a libcall. We're apparently only considering the type alignment and not the alignment of the actual atomic operand to the builtin.

Comment 6 Dimitry Andric 2018-01-15 09:33:58 PST

The fix for this bug (https://reviews.llvm.org/rL314145) has strange side effects on 32-bit x86, which cause quite a number of problems in FreeBSD ports.

For example, take this minimalist test case:

// clang -m32 -march=i486 -O2 -fomit-frame-pointer -S atomic-test-2.c -o -
extern struct {
  unsigned long long a;
} b;

extern unsigned long long c;

void foo(void)
{
  __atomic_add_fetch(&b.a, 1, 0);
  __atomic_add_fetch(&c, 1, 0);
}

With clang r314144, this results in:

foo:                                    # @foo
# BB#0:                                 # %entry
	pushl	$0
	pushl	$0
	pushl	$1
	pushl	$b
	calll	__atomic_fetch_add_8
	addl	$16, %esp
	pushl	$0
	pushl	$0
	pushl	$1
	pushl	$c
	calll	__atomic_fetch_add_8
	addl	$16, %esp
	retl

With clang r314145, it starts producing:

foo:                                    # @foo
# BB#0:                                 # %entry
	pushl	%ebx
	pushl	$0
	pushl	$0
	pushl	$1
	pushl	$b
	calll	__atomic_fetch_add_8
	addl	$16, %esp
	movl	c+4, %edx
	movl	c, %eax
	.p2align	4, 0x90
.LBB0_1:                                # %atomicrmw.start
                                        # =>This Inner Loop Header: Depth=1
	movl	%eax, %ebx
	addl	$1, %ebx
	movl	%edx, %ecx
	adcl	$0, %ecx
	lock		cmpxchg8b	c
	jne	.LBB0_1
# BB#2:                                 # %atomicrmw.end
	popl	%ebx
	retl

Why is this?

The problem is that a lot of programs test for availability of __atomic primitives by compiling small test cases with code like above, which are now suddenly succeeding because libcalls to __atomic functions have been replaced by inline lock cmpxchg8b's.

Note that these are even invalid with -march=i486, but clang does produce them!

Comment 7 Wei Mi 2018-01-16 10:49:21 PST

In https://wmi@llvm.org/svn/llvm-project/cfe/trunk/lib/Basic/Targets/X86.h, there is a FIXME in class X86_32TargetInfo.

    // x86-32 has atomics up to 8 bytes
    // FIXME: Check that we actually have cmpxchg8b before setting
    // MaxAtomicInlineWidth. (cmpxchg8b is an i586 instruction.)
    MaxAtomicPromoteWidth = MaxAtomicInlineWidth = 64;

Now we need to fix it. I will prepare a patch for it.

Comment 8 Dimitry Andric 2018-01-16 14:40:34 PST

(In reply to Wei Mi from comment #7)
> In https://wmi@llvm.org/svn/llvm-project/cfe/trunk/lib/Basic/Targets/X86.h,
> there is a FIXME in class X86_32TargetInfo.
> 
>     // x86-32 has atomics up to 8 bytes
>     // FIXME: Check that we actually have cmpxchg8b before setting
>     // MaxAtomicInlineWidth. (cmpxchg8b is an i586 instruction.)
>     MaxAtomicPromoteWidth = MaxAtomicInlineWidth = 64;
> 
> Now we need to fix it. I will prepare a patch for it.

Yes, please note that there is also this quite old review which attempted to solve the case more generally, but it got stalled: https://reviews.llvm.org/D29542

Comment 9 Dimitry Andric 2018-02-02 15:40:26 PST

(In reply to Wei Mi from comment #7)
> In https://wmi@llvm.org/svn/llvm-project/cfe/trunk/lib/Basic/Targets/X86.h,
> there is a FIXME in class X86_32TargetInfo.
> 
>     // x86-32 has atomics up to 8 bytes
>     // FIXME: Check that we actually have cmpxchg8b before setting
>     // MaxAtomicInlineWidth. (cmpxchg8b is an i586 instruction.)
>     MaxAtomicPromoteWidth = MaxAtomicInlineWidth = 64;
> 
> Now we need to fix it. I will prepare a patch for it.

Wei, any update on that patch? :)

Comment 10 Wei Mi 2018-02-02 15:59:50 PST

(In reply to Dimitry Andric from comment #9)
> (In reply to Wei Mi from comment #7)
> > In https://wmi@llvm.org/svn/llvm-project/cfe/trunk/lib/Basic/Targets/X86.h,
> > there is a FIXME in class X86_32TargetInfo.
> > 
> >     // x86-32 has atomics up to 8 bytes
> >     // FIXME: Check that we actually have cmpxchg8b before setting
> >     // MaxAtomicInlineWidth. (cmpxchg8b is an i586 instruction.)
> >     MaxAtomicPromoteWidth = MaxAtomicInlineWidth = 64;
> > 
> > Now we need to fix it. I will prepare a patch for it.
> 
> Wei, any update on that patch? :)

It is committed at https://reviews.llvm.org/rL323281. Could you verify its effectiveness?

Comment 11 Dimitry Andric 2018-02-03 08:28:06 PST

(In reply to Wei Mi from comment #10)
> (In reply to Dimitry Andric from comment #9)
> > (In reply to Wei Mi from comment #7)
> > > In https://wmi@llvm.org/svn/llvm-project/cfe/trunk/lib/Basic/Targets/X86.h,
> > > there is a FIXME in class X86_32TargetInfo.
> > > 
> > >     // x86-32 has atomics up to 8 bytes
> > >     // FIXME: Check that we actually have cmpxchg8b before setting
> > >     // MaxAtomicInlineWidth. (cmpxchg8b is an i586 instruction.)
> > >     MaxAtomicPromoteWidth = MaxAtomicInlineWidth = 64;
> > > 
> > > Now we need to fix it. I will prepare a patch for it.
> > 
> > Wei, any update on that patch? :)
> 
> It is committed at https://reviews.llvm.org/rL323281. Could you verify its
> effectiveness?

It does work for the test cases here, but if you use e.g. -march=i586 or higher, there are still sometimes calls to __atomic functions, instead of inlined cmpxchg8b operations.  I don't really know where that changed, between clang 5.0 and 6.0.

Comment 12 Wei Mi 2018-02-05 10:24:28 PST

(In reply to Dimitry Andric from comment #11)
> (In reply to Wei Mi from comment #10)
> > (In reply to Dimitry Andric from comment #9)
> > > (In reply to Wei Mi from comment #7)
> > > > In https://wmi@llvm.org/svn/llvm-project/cfe/trunk/lib/Basic/Targets/X86.h,
> > > > there is a FIXME in class X86_32TargetInfo.
> > > > 
> > > >     // x86-32 has atomics up to 8 bytes
> > > >     // FIXME: Check that we actually have cmpxchg8b before setting
> > > >     // MaxAtomicInlineWidth. (cmpxchg8b is an i586 instruction.)
> > > >     MaxAtomicPromoteWidth = MaxAtomicInlineWidth = 64;
> > > > 
> > > > Now we need to fix it. I will prepare a patch for it.
> > > 
> > > Wei, any update on that patch? :)
> > 
> > It is committed at https://reviews.llvm.org/rL323281. Could you verify its
> > effectiveness?
> 
> It does work for the test cases here, but if you use e.g. -march=i586 or
> higher, there are still sometimes calls to __atomic functions, instead of
> inlined cmpxchg8b operations.  I don't really know where that changed,
> between clang 5.0 and 6.0.

Clang uses "Ptr.getAlignment() % sizeChars == 0" as a precondition to use inline atomic. For x86-32, unsigned long long has 8 bytes size but may have 4 bytes alignment, although it is possible to do atomic load/store for the data with such type using cmpxchg8b, clang will use libcall for it. It seems a separate restriction of Clang.

Note: for clang, unsigned long long inside of structure uses 4 bytes alignment. standalone unsigned long long uses 8 bytes alignment.

Comment 13 Tim Northover 2018-04-05 03:10:29 PDT

The implementation of __atomic_load in compiler-rt appears to be buggy by only looking at the size of the input when deciding whether to take a lock. It (and hence all others) should probably fall back for misaligned atomics too.

A locked cmpxchg is theoretically viable on x86, but it looks like that would break libatomic compatibility. It's certainly an ABI issue. Also, if you're using genuinely misaligned atomics you probably deserve whatever you get.

Finally, everyone seems to have decided that mixing libcalls and inlined atomics on the same object is OK. That's probably fine because GCC seems to agree, but be careful!

Comment 14 Tom Stellard 2018-04-10 21:51:10 PDT

We just hit a build failure in Fedora caused by r314145: https://bugzilla.redhat.com/show_bug.cgi?id=1565766

I've reduced the test case from the Fedora bug even further and it's another example where clang will no longer generate a native cmpxchg instruction:

// clang -O2 -c  t.c
#include <stdint.h>

struct __attribute__ ((__packed__)) baz {
  uint32_t x3;
};

struct foo {
  uint32_t x;
  uint32_t x2;
  struct baz *b;
} fx;


uint32_t foor(struct foo *p, uint32_t y)
{
    return __atomic_fetch_or(&p->b->x3, y, 5);
}
int main() { }

Comment 15 Tim Northover 2018-04-10 23:46:45 PDT

Atomics: for the bugs that keep on giving. It seems everyone also used to agree that misaligned atomics on x86 should be lock-free, which raises 3 issues:

1. Bare load and store are not lock-free on x86. Clang generates a libcall. GCC generates a non-atomic mov instruction.

2. LLVM's cmpxchg is specified to require alignment at least size in LangRef. I don't know if any pass makes use of that, but they could. So there's theoretically no way for Clang to actually emit a lock-free misaligned cmpxchg at the moment.

3. My proposed fix for compiler-rt (https://reviews.llvm.org/D45319) would make all misaligned atomics locked on x86. I'll have to change that.