LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 34056 - Regression(r304267): clang miscompiles chrome on arm32
Summary: Regression(r304267): clang miscompiles chrome on arm32
Status: RESOLVED FIXED
Alias: None
Product: libraries
Classification: Unclassified
Component: Backend: ARM (show other bugs)
Version: trunk
Hardware: PC Linux
: P enhancement
Assignee: Matthias Braun
URL:
Keywords:
Depends on:
Blocks: 33849
  Show dependency tree
 
Reported: 2017-08-03 14:40 PDT by Nico Weber
Modified: 2017-08-10 10:13 PDT (History)
5 users (show)

See Also:
Fixed By Commit(s):


Attachments
fairly reduced (10.34 KB, text/x-objcsrc)
2017-08-07 18:08 PDT, Nico Weber
Details
compiled binary, with a vanilla llvm@r310262 clang (83.05 KB, application/x-java)
2017-08-07 18:09 PDT, Nico Weber
Details
compiled binary, with llvm@310262 but with r304267 reverted locally (83.05 KB, application/x-java)
2017-08-07 18:09 PDT, Nico Weber
Details
smaller repro (6.78 KB, text/x-objcsrc)
2017-08-07 18:32 PDT, Nico Weber
Details
slightly smaller repro (5.09 KB, text/x-objcsrc)
2017-08-07 18:46 PDT, Nico Weber
Details
slightly smaller repro (4.89 KB, text/x-objcsrc)
2017-08-07 18:48 PDT, Nico Weber
Details
pretty small repro (2.20 KB, text/x-objcsrc)
2017-08-07 19:14 PDT, Nico Weber
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Nico Weber 2017-08-03 14:40:06 PDT
Chrome and all of its test binaries fail when built in a debug build on arm32. We bisected this down to r304267.

https://bugs.chromium.org/p/chromium/issues/detail?id=742563#c15 has some analysis on what's wrong. The file mentioned there is https://cs.chromium.org/chromium/src/base/threading/thread_local_storage.cc?q=base/threading/thread_local_storage.cc&sq=package:chromium&dr&l=320

  TlsVectorEntry* tls_data = static_cast<TlsVectorEntry*>(
      PlatformThreadLocalStorage::GetTLSValue(
          base::subtle::NoBarrier_Load(&g_native_tls_key)));
  if (!tls_data)
    tls_data = ConstructTlsVector();
  DCHECK_NE(slot_, kInvalidSlotValue);
  DCHECK_LT(slot_, kThreadLocalStorageSize);
  tls_data[slot_].data = value;
  tls_data[slot_].version = version_;

"First ThreadLocalStorage::StaticSlot::Set(void* value) gets called, then when ThreadLocalStorage::StaticSlot::Get() is called later on, tls_data is null, and ConstructTlsVector() is called again." (this is the bug)
Comment 1 Nico Weber 2017-08-03 14:40:34 PDT
matzeb, do you happen to see how your change could've triggered this?
Comment 2 Nico Weber 2017-08-03 14:42:50 PDT
(and GetTLSValue() is just pthread_getspecific() on iOS)
Comment 3 Nico Weber 2017-08-03 14:44:56 PDT
And base::subtle::NoBarrier_Load() is:

typedef volatile std::atomic<Atomic32>* AtomicLocation32;
inline Atomic32 NoBarrier_Load(volatile const Atomic32* ptr) {
  return ((AtomicLocation32)ptr)->load(std::memory_order_relaxed);
}
Comment 4 Nico Weber 2017-08-04 06:58:43 PDT
matzeb, ping?
Comment 5 Reid Kleckner 2017-08-04 09:50:17 PDT
I think argument promotion is the source of the under-aligned alloca:

*** IR Dump After Deduce function attributes ***
; Function Attrs: inlinehint nounwind uwtable
define internal fastcc void @foo(%struct.key* byval align 16 %end) unnamed_addr #2 {
entry:
  %0 = load i8* (%struct.obj*, i64, i32, i64, i64, %struct.key*)**, i8* (%struct.obj*, i64, i32, i64, i64, %struct.key*)*** bitcast (%struct.obj* @obj to i8* (%struc
t.obj*, i64, i32, i64, i64, %struct.key*)***), align 8, !tbaa !2
  %1 = load i8* (%struct.obj*, i64, i32, i64, i64, %struct.key*)*, i8* (%struct.obj*, i64, i32, i64, i64, %struct.key*)** %0, align 8, !tbaa !7
  %call = call i8* %1(%struct.obj* nonnull @obj, i64 0, i32 0, i64 0, i64 0, %struct.key* byval nonnull align 16 %end) #3
  ret void
}
*** IR Dump After Promote 'by reference' arguments to scalars ***
; Function Attrs: inlinehint nounwind uwtable
define internal fastcc void @foo(i128 %end.0) unnamed_addr #2 {
entry:
  %end = alloca %struct.key
  %.0 = getelementptr %struct.key, %struct.key* %end, i32 0, i32 0
  store i128 %end.0, i128* %.0
  %0 = load i8* (%struct.obj*, i64, i32, i64, i64, %struct.key*)**, i8* (%struct.obj*, i64, i32, i64, i64, %struct.key*)*** bitcast (%struct.obj* @obj to i8* (%struct.obj*, i64, i32, i64, i64, %struct.key*)***), align 8, !tbaa !2
  %1 = load i8* (%struct.obj*, i64, i32, i64, i64, %struct.key*)*, i8* (%struct.obj*, i64, i32, i64, i64, %struct.key*)** %0, align 8, !tbaa !7
  %call = call i8* %1(%struct.obj* nonnull @obj, i64 0, i32 0, i64 0, i64 0, %struct.key* byval nonnull align 16 %end) #3
  ret void
}

It should use the 'align' attribute to set the alignment of the alloca it creates, rather than relying on DataLayout to figure it out.
Comment 6 Reid Kleckner 2017-08-04 09:50:37 PDT
Wrong bug :(
Comment 7 Matthias Braun 2017-08-04 11:36:09 PDT
The relevant commit changes the lowering of cmpxchg instructions, however the  mentioned `ThreadLocalStorage::StaticSlot::Set` function does not seem to contain any `cmpxchg`s (at least copying and pasting together the relevant snippets doesn't produce one for me and from reading the code I don't see one either).

So I'd suspect the actual miscompilation happens somewhere else in the code. I'll look around a bit more, but ideally if you can compile the relevant test with and without this change and inspect the difference, I'd suspect it to be pretty small and it should hint at what function is miscompiled.
Comment 8 Nico Weber 2017-08-07 18:08:12 PDT
Created attachment 18931 [details]
fairly reduced

Down to one file. Compile flag (and a bunch of other goop that won't help you):

[1/9] ../../../../llvm-build/bin/clang++ -MMD -MF obj/base/base_unittests_arch_executable_sources/test_support_ios.o.d -DSYSTEM_NATIVE_UTF8 -DV8_DEPRECATION_WARNINGS -DNO_TCMALLOC -DDISABLE_NACL -DCHROMIUM_BUILD -DFIELDTRIAL_TESTING_ENABLED -DCR_CLANG_REVISION=\"307486-1\" -DCR_XCODE_VERSION=0833 -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D_DEBUG -DDYNAMIC_ANNOTATIONS_ENABLED=1 -DWTF_USE_DYNAMIC_ANNOTATIONS=1 -I../.. -Igen -fno-strict-aliasing --param=ssp-buffer-size=4 -fstack-protector -Wno-builtin-macro-redefined -D__DATE__= -D__TIME__= -D__TIMESTAMP__= -fcolor-diagnostics -arch armv7 -Wall -Werror -Wextra -Wundeclared-selector -Wno-missing-field-initializers -Wno-unused-parameter -Wno-c++11-narrowing -Wno-covered-switch-default -Wno-unneeded-internal-declaration -Wno-inconsistent-missing-override -Wno-undefined-var-template -Wno-nonportable-include-path -Wno-address-of-packed-member -Wno-unused-lambda-capture -Wno-user-defined-warnings -O0 -fno-omit-frame-pointer -gdwarf-2 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS10.3.sdk -stdlib=libc++ -miphoneos-version-min=9.0 -fvisibility=hidden -Wheader-hygiene -Wstring-conversion -Wtautological-overlap-compare -Wno-undefined-bool-conversion -Wno-tautological-undefined-compare -std=c++11 -fobjc-call-cxx-cdtors -Wobjc-missing-property-synthesis -fno-rtti -fno-exceptions -fvisibility-inlines-hidden -c ../../base/test/test_support_ios.mm -o obj/base/base_unittests_arch_executable_sources/test_support_ios.o
[2/9] touch obj/base/base_unittests_arch_executable_sources.stamp
[3/9] TOOL_VERSION=1479159887 ../../build/toolchain/mac/linker_driver.py ../../../../llvm-build/bin/clang++  -Xlinker -rpath -Xlinker @executable_path/Frameworks -Xlinker -objc_abi_version -Xlinker 2 -arch armv7 -Werror -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS10.3.sdk -stdlib=libc++ -miphoneos-version-min=9.0 -Wl,-ObjC -o "obj/base/arm/base_unittests" -Wl,-filelist,"obj/base/arm/base_unittests.rsp"  -framework UIKit -framework CoreFoundation -framework CoreGraphics -framework CoreText -framework Foundation
[4/9] touch obj/base/base_unittests_executable.inputdeps.stamp
[5/9] python ../../build/toolchain/mac/linker_driver.py xcrun lipo -create -output obj/base/base_unittests obj/base/arm/base_unittests
[6/9] touch obj/base/base_unittests_executable.stamp
[7/9] touch obj/base/base_unittests.codesigning.inputdeps.stamp
[8/9] python ../../build/config/ios/codesign.py code-sign-bundle -t=iphoneos -i=E4F29B16FFF03137BD658E870CABC14FD4961B7F -e=../../build/config/ios/entitlements.plist -b=obj/base/base_unittests base_unittests.app
[9/9] touch obj/base/base_unittests.stamp


With r304267 applied, this test succeeds and the breakpoint is hit:

    if (this != MyGetTLSMessageLoop()->Get())
      asm("bkpt 0");

With it reverted, it doesn't succeed and the program quits without trouble. I'll also attach the final binary in a second (out/Release-iphoneos/base_unittests.app/base_unittests -- note that this is a debug build, despite the "Release" in the filename).

The disassembly of the two binaries looks identical to me (except for offsets), except that __ZN4base6subtle24NoBarrier_CompareAndSwapEPViii has a bunch of additional 

    movs	r1, #0x0

lines.
Comment 9 Nico Weber 2017-08-07 18:09:05 PDT
Created attachment 18932 [details]
compiled binary, with a vanilla llvm@r310262 clang
Comment 10 Nico Weber 2017-08-07 18:09:40 PDT
Created attachment 18933 [details]
compiled binary, with llvm@310262 but with r304267 reverted locally
Comment 11 Nico Weber 2017-08-07 18:32:37 PDT
Created attachment 18934 [details]
smaller repro
Comment 12 Nico Weber 2017-08-07 18:46:21 PDT
Created attachment 18935 [details]
slightly smaller repro
Comment 13 Nico Weber 2017-08-07 18:48:41 PDT
Created attachment 18936 [details]
slightly smaller repro

(whoops, prior attachment didn't compile)
Comment 14 Nico Weber 2017-08-07 19:14:50 PDT
Created attachment 18937 [details]
pretty small repro

Now 74 lines total. The relevant pieces:


typedef int32_t Atomic32;
typedef volatile std::atomic<Atomic32>* AtomicLocation32;

inline Atomic32 NoBarrier_CompareAndSwap(volatile Atomic32* ptr,
                                         Atomic32 old_value,
                                         Atomic32 new_value) {
  ((AtomicLocation32)ptr)
      ->compare_exchange_strong(old_value,
                                new_value,
                                std::memory_order_relaxed,
                                std::memory_order_relaxed);
  return old_value;
}

inline Atomic32 NoBarrier_Load(volatile const Atomic32* ptr) {
  return ((AtomicLocattion32)ptr)->load(std::memory_order_relaxed);
}

- (void)runTests {
  pthread_key_t key = NoBarrier_Load(&g_native_tls_key);
  if (key == TLS_KEY_OUT_OF_INDEXES) {
    pthread_key_create(&key, OnThreadExit);
    NoBarrier_CompareAndSwap(&g_native_tls_key, TLS_KEY_OUT_OF_INDEXES, key);
    pthread_setspecific(key, new int);
  }
  if (!pthread_getspecific(NoBarrier_Load(&g_native_tls_key)))
    asm("bkpt 0");  // <- this gets hit with r304267 applied
}
Comment 15 Matthias Braun 2017-08-08 16:06:58 PDT
Thanks a lot for providing the reduced program. Indeed zeroing out the status register is the only change I can see by inspecting the assembly and that is exactly one of the things that commit changed, and as far as I can see that zeroed register isn't even read by anyone. Digging further...
Comment 16 Nico Weber 2017-08-09 08:29:52 PDT
(In reply to Matthias Braun from comment #15)
> Thanks a lot for providing the reduced program. Indeed zeroing out the
> status register is the only change I can see by inspecting the assembly and
> that is exactly one of the things that commit changed, and as far as I can
> see that zeroed register isn't even read by anyone. Digging further...

Early in __Z24NoBarrier_CompareAndSwapPViii we have:

0000ba32		.byte  30	@ KIND_JUMP_TABLE8
0000ba33		.byte  30	@ KIND_JUMP_TABLE8
0000ba34		.byte  35	@ KIND_JUMP_TABLE8
0000ba35		.byte  64	@ KIND_JUMP_TABLE8
0000ba36		.byte  70	@ KIND_JUMP_TABLE8
0000ba38	1c 98 	ldr	r0, [sp, #0x70]
0000ba3a	01 68 	ldr	r1, [r0]
0000ba3c	23 9a 	ldr	r2, [sp, #0x8c]
0000ba3e	dd f8 78 c0 	ldr.w	r12, [sp, #0x78]
0000ba42	00 21 	movs	r1, #0x0  # XXX
0000ba44	5c e8 00 3f 	ldrex	r3, [r12]
0000ba48	0b 45 	cmp	r3, r1


The line marked XXX is present in the bad binary but not in the good. Two lines further down r1 is compared against r3. So that zeroed register does look like it's getting read to me (?)
Comment 17 Matthias Braun 2017-08-09 10:59:10 PDT
That part looks like this for me:
       ldr     r0, [sp, #112]          @ 4-byte Reload
        ldr     r1, [r0]
        ldr     r2, [sp, #140]
        ldr.w   r12, [sp, #120]         @ 4-byte Reload
 LBB4_4:                                 @ %monotonic.i
                                         @ =>This Inner Loop Header: Depth=1
-       movs    r9, #0                 @ This appears/disappears depending on patch
        ldrex   r3, [r12]
        cmp     r3, r1
        bne     LBB4_6
Comment 18 Tim Northover 2017-08-09 13:23:18 PDT
The problem is that the Thumb "movs" only accepts registers r0-r8. As luck would have it 9 % 8 == 1, so when LLVM is trying to encode a move to r9 it gets interpreted as a clobber of r1.
Comment 19 Tim Northover 2017-08-09 13:27:26 PDT
Sorry, that should obviously be r0-r7.

Anyway, until recently we could have used "t2MOVi" instead, but unfortunately v8M-baseline has ldrex/strex but no t2MOVi so we need to make sure that whatever we do works on (e.g.) Cortex-M23 too.
Comment 20 Nico Weber 2017-08-09 13:59:34 PDT
Hah, thanks for that insight, Tim :-)

MatzeB, if this is tricky to fix, can we revert for now? Having known miscompiles in the tree is suboptimal.
Comment 21 Matthias Braun 2017-08-09 14:10:32 PDT
I'll put something in this afternoon, I think we can go back to the way before my patch without the mov (while maintaining the parts of my patch that give correct live-in lists, which was my main motivation anyway).
Comment 22 Matthias Braun 2017-08-09 14:18:54 PDT
And for the record:

Turns out that when you immediately compile to a .o file you do indeed get `movs    r1, #0x0` when you go through an assembly file then you have `movs r9, #0` in the assembly file and the assembler turns that into `movs.w  r9, #0x0` which works.
Comment 23 Matthias Braun 2017-08-09 15:23:03 PDT
Fixed in r310534
Comment 24 Matthias Braun 2017-08-09 15:24:15 PDT
@Nico Weber: Can you test and see if this fixes your issues. If it does, I'll merge the fix into the llvm 5.0 branch.
Comment 25 Matthias Braun 2017-08-09 15:28:27 PDT
Let's keep this open until we have the fix in llvm-5.0
Comment 26 Nico Weber 2017-08-09 16:19:25 PDT
Thanks, I can confirm that this fixes my reduced example, and most of our base_unittests now pass for me. The two that don't also fail with the new change and r304267 reverted. So this looks much better now.
Comment 27 Hans Wennborg 2017-08-10 10:13:21 PDT
Merged to 5.0 in r310628.