Chrome and all of its test binaries fail when built in a debug build on arm32. We bisected this down to r304267. https://bugs.chromium.org/p/chromium/issues/detail?id=742563#c15 has some analysis on what's wrong. The file mentioned there is https://cs.chromium.org/chromium/src/base/threading/thread_local_storage.cc?q=base/threading/thread_local_storage.cc&sq=package:chromium&dr&l=320 TlsVectorEntry* tls_data = static_cast<TlsVectorEntry*>( PlatformThreadLocalStorage::GetTLSValue( base::subtle::NoBarrier_Load(&g_native_tls_key))); if (!tls_data) tls_data = ConstructTlsVector(); DCHECK_NE(slot_, kInvalidSlotValue); DCHECK_LT(slot_, kThreadLocalStorageSize); tls_data[slot_].data = value; tls_data[slot_].version = version_; "First ThreadLocalStorage::StaticSlot::Set(void* value) gets called, then when ThreadLocalStorage::StaticSlot::Get() is called later on, tls_data is null, and ConstructTlsVector() is called again." (this is the bug)
matzeb, do you happen to see how your change could've triggered this?
(and GetTLSValue() is just pthread_getspecific() on iOS)
And base::subtle::NoBarrier_Load() is: typedef volatile std::atomic<Atomic32>* AtomicLocation32; inline Atomic32 NoBarrier_Load(volatile const Atomic32* ptr) { return ((AtomicLocation32)ptr)->load(std::memory_order_relaxed); }
matzeb, ping?
I think argument promotion is the source of the under-aligned alloca: *** IR Dump After Deduce function attributes *** ; Function Attrs: inlinehint nounwind uwtable define internal fastcc void @foo(%struct.key* byval align 16 %end) unnamed_addr #2 { entry: %0 = load i8* (%struct.obj*, i64, i32, i64, i64, %struct.key*)**, i8* (%struct.obj*, i64, i32, i64, i64, %struct.key*)*** bitcast (%struct.obj* @obj to i8* (%struc t.obj*, i64, i32, i64, i64, %struct.key*)***), align 8, !tbaa !2 %1 = load i8* (%struct.obj*, i64, i32, i64, i64, %struct.key*)*, i8* (%struct.obj*, i64, i32, i64, i64, %struct.key*)** %0, align 8, !tbaa !7 %call = call i8* %1(%struct.obj* nonnull @obj, i64 0, i32 0, i64 0, i64 0, %struct.key* byval nonnull align 16 %end) #3 ret void } *** IR Dump After Promote 'by reference' arguments to scalars *** ; Function Attrs: inlinehint nounwind uwtable define internal fastcc void @foo(i128 %end.0) unnamed_addr #2 { entry: %end = alloca %struct.key %.0 = getelementptr %struct.key, %struct.key* %end, i32 0, i32 0 store i128 %end.0, i128* %.0 %0 = load i8* (%struct.obj*, i64, i32, i64, i64, %struct.key*)**, i8* (%struct.obj*, i64, i32, i64, i64, %struct.key*)*** bitcast (%struct.obj* @obj to i8* (%struct.obj*, i64, i32, i64, i64, %struct.key*)***), align 8, !tbaa !2 %1 = load i8* (%struct.obj*, i64, i32, i64, i64, %struct.key*)*, i8* (%struct.obj*, i64, i32, i64, i64, %struct.key*)** %0, align 8, !tbaa !7 %call = call i8* %1(%struct.obj* nonnull @obj, i64 0, i32 0, i64 0, i64 0, %struct.key* byval nonnull align 16 %end) #3 ret void } It should use the 'align' attribute to set the alignment of the alloca it creates, rather than relying on DataLayout to figure it out.
Wrong bug :(
The relevant commit changes the lowering of cmpxchg instructions, however the mentioned `ThreadLocalStorage::StaticSlot::Set` function does not seem to contain any `cmpxchg`s (at least copying and pasting together the relevant snippets doesn't produce one for me and from reading the code I don't see one either). So I'd suspect the actual miscompilation happens somewhere else in the code. I'll look around a bit more, but ideally if you can compile the relevant test with and without this change and inspect the difference, I'd suspect it to be pretty small and it should hint at what function is miscompiled.
Created attachment 18931 [details] fairly reduced Down to one file. Compile flag (and a bunch of other goop that won't help you): [1/9] ../../../../llvm-build/bin/clang++ -MMD -MF obj/base/base_unittests_arch_executable_sources/test_support_ios.o.d -DSYSTEM_NATIVE_UTF8 -DV8_DEPRECATION_WARNINGS -DNO_TCMALLOC -DDISABLE_NACL -DCHROMIUM_BUILD -DFIELDTRIAL_TESTING_ENABLED -DCR_CLANG_REVISION=\"307486-1\" -DCR_XCODE_VERSION=0833 -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D_DEBUG -DDYNAMIC_ANNOTATIONS_ENABLED=1 -DWTF_USE_DYNAMIC_ANNOTATIONS=1 -I../.. -Igen -fno-strict-aliasing --param=ssp-buffer-size=4 -fstack-protector -Wno-builtin-macro-redefined -D__DATE__= -D__TIME__= -D__TIMESTAMP__= -fcolor-diagnostics -arch armv7 -Wall -Werror -Wextra -Wundeclared-selector -Wno-missing-field-initializers -Wno-unused-parameter -Wno-c++11-narrowing -Wno-covered-switch-default -Wno-unneeded-internal-declaration -Wno-inconsistent-missing-override -Wno-undefined-var-template -Wno-nonportable-include-path -Wno-address-of-packed-member -Wno-unused-lambda-capture -Wno-user-defined-warnings -O0 -fno-omit-frame-pointer -gdwarf-2 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS10.3.sdk -stdlib=libc++ -miphoneos-version-min=9.0 -fvisibility=hidden -Wheader-hygiene -Wstring-conversion -Wtautological-overlap-compare -Wno-undefined-bool-conversion -Wno-tautological-undefined-compare -std=c++11 -fobjc-call-cxx-cdtors -Wobjc-missing-property-synthesis -fno-rtti -fno-exceptions -fvisibility-inlines-hidden -c ../../base/test/test_support_ios.mm -o obj/base/base_unittests_arch_executable_sources/test_support_ios.o [2/9] touch obj/base/base_unittests_arch_executable_sources.stamp [3/9] TOOL_VERSION=1479159887 ../../build/toolchain/mac/linker_driver.py ../../../../llvm-build/bin/clang++ -Xlinker -rpath -Xlinker @executable_path/Frameworks -Xlinker -objc_abi_version -Xlinker 2 -arch armv7 -Werror -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS10.3.sdk -stdlib=libc++ -miphoneos-version-min=9.0 -Wl,-ObjC -o "obj/base/arm/base_unittests" -Wl,-filelist,"obj/base/arm/base_unittests.rsp" -framework UIKit -framework CoreFoundation -framework CoreGraphics -framework CoreText -framework Foundation [4/9] touch obj/base/base_unittests_executable.inputdeps.stamp [5/9] python ../../build/toolchain/mac/linker_driver.py xcrun lipo -create -output obj/base/base_unittests obj/base/arm/base_unittests [6/9] touch obj/base/base_unittests_executable.stamp [7/9] touch obj/base/base_unittests.codesigning.inputdeps.stamp [8/9] python ../../build/config/ios/codesign.py code-sign-bundle -t=iphoneos -i=E4F29B16FFF03137BD658E870CABC14FD4961B7F -e=../../build/config/ios/entitlements.plist -b=obj/base/base_unittests base_unittests.app [9/9] touch obj/base/base_unittests.stamp With r304267 applied, this test succeeds and the breakpoint is hit: if (this != MyGetTLSMessageLoop()->Get()) asm("bkpt 0"); With it reverted, it doesn't succeed and the program quits without trouble. I'll also attach the final binary in a second (out/Release-iphoneos/base_unittests.app/base_unittests -- note that this is a debug build, despite the "Release" in the filename). The disassembly of the two binaries looks identical to me (except for offsets), except that __ZN4base6subtle24NoBarrier_CompareAndSwapEPViii has a bunch of additional movs r1, #0x0 lines.
Created attachment 18932 [details] compiled binary, with a vanilla llvm@r310262 clang
Created attachment 18933 [details] compiled binary, with llvm@310262 but with r304267 reverted locally
Created attachment 18934 [details] smaller repro
Created attachment 18935 [details] slightly smaller repro
Created attachment 18936 [details] slightly smaller repro (whoops, prior attachment didn't compile)
Created attachment 18937 [details] pretty small repro Now 74 lines total. The relevant pieces: typedef int32_t Atomic32; typedef volatile std::atomic<Atomic32>* AtomicLocation32; inline Atomic32 NoBarrier_CompareAndSwap(volatile Atomic32* ptr, Atomic32 old_value, Atomic32 new_value) { ((AtomicLocation32)ptr) ->compare_exchange_strong(old_value, new_value, std::memory_order_relaxed, std::memory_order_relaxed); return old_value; } inline Atomic32 NoBarrier_Load(volatile const Atomic32* ptr) { return ((AtomicLocattion32)ptr)->load(std::memory_order_relaxed); } - (void)runTests { pthread_key_t key = NoBarrier_Load(&g_native_tls_key); if (key == TLS_KEY_OUT_OF_INDEXES) { pthread_key_create(&key, OnThreadExit); NoBarrier_CompareAndSwap(&g_native_tls_key, TLS_KEY_OUT_OF_INDEXES, key); pthread_setspecific(key, new int); } if (!pthread_getspecific(NoBarrier_Load(&g_native_tls_key))) asm("bkpt 0"); // <- this gets hit with r304267 applied }
Thanks a lot for providing the reduced program. Indeed zeroing out the status register is the only change I can see by inspecting the assembly and that is exactly one of the things that commit changed, and as far as I can see that zeroed register isn't even read by anyone. Digging further...
(In reply to Matthias Braun from comment #15) > Thanks a lot for providing the reduced program. Indeed zeroing out the > status register is the only change I can see by inspecting the assembly and > that is exactly one of the things that commit changed, and as far as I can > see that zeroed register isn't even read by anyone. Digging further... Early in __Z24NoBarrier_CompareAndSwapPViii we have: 0000ba32 .byte 30 @ KIND_JUMP_TABLE8 0000ba33 .byte 30 @ KIND_JUMP_TABLE8 0000ba34 .byte 35 @ KIND_JUMP_TABLE8 0000ba35 .byte 64 @ KIND_JUMP_TABLE8 0000ba36 .byte 70 @ KIND_JUMP_TABLE8 0000ba38 1c 98 ldr r0, [sp, #0x70] 0000ba3a 01 68 ldr r1, [r0] 0000ba3c 23 9a ldr r2, [sp, #0x8c] 0000ba3e dd f8 78 c0 ldr.w r12, [sp, #0x78] 0000ba42 00 21 movs r1, #0x0 # XXX 0000ba44 5c e8 00 3f ldrex r3, [r12] 0000ba48 0b 45 cmp r3, r1 The line marked XXX is present in the bad binary but not in the good. Two lines further down r1 is compared against r3. So that zeroed register does look like it's getting read to me (?)
That part looks like this for me: ldr r0, [sp, #112] @ 4-byte Reload ldr r1, [r0] ldr r2, [sp, #140] ldr.w r12, [sp, #120] @ 4-byte Reload LBB4_4: @ %monotonic.i @ =>This Inner Loop Header: Depth=1 - movs r9, #0 @ This appears/disappears depending on patch ldrex r3, [r12] cmp r3, r1 bne LBB4_6
The problem is that the Thumb "movs" only accepts registers r0-r8. As luck would have it 9 % 8 == 1, so when LLVM is trying to encode a move to r9 it gets interpreted as a clobber of r1.
Sorry, that should obviously be r0-r7. Anyway, until recently we could have used "t2MOVi" instead, but unfortunately v8M-baseline has ldrex/strex but no t2MOVi so we need to make sure that whatever we do works on (e.g.) Cortex-M23 too.
Hah, thanks for that insight, Tim :-) MatzeB, if this is tricky to fix, can we revert for now? Having known miscompiles in the tree is suboptimal.
I'll put something in this afternoon, I think we can go back to the way before my patch without the mov (while maintaining the parts of my patch that give correct live-in lists, which was my main motivation anyway).
And for the record: Turns out that when you immediately compile to a .o file you do indeed get `movs r1, #0x0` when you go through an assembly file then you have `movs r9, #0` in the assembly file and the assembler turns that into `movs.w r9, #0x0` which works.
Fixed in r310534
@Nico Weber: Can you test and see if this fixes your issues. If it does, I'll merge the fix into the llvm 5.0 branch.
Let's keep this open until we have the fix in llvm-5.0
Thanks, I can confirm that this fixes my reduced example, and most of our base_unittests now pass for me. The two that don't also fail with the new change and r304267 reverted. So this looks much better now.
Merged to 5.0 in r310628.