This is an archive of the discontinued LLVM Phabricator instance.

[asan] Default to -fsanitize-address-use-odr-indicator for non-Windows
ClosedPublic

Authored by MaskRay on Nov 1 2022, 10:34 PM.

Details

Summary

This enables odr indicators on all platforms and private aliases on non-Windows.
Note that GCC also uses private aliases: this fixes bogus
The following global variable is not properly aligned. errors for interposed global variables

Fix https://github.com/google/sanitizers/issues/398
Fix https://github.com/google/sanitizers/issues/1017
Fix https://github.com/llvm/llvm-project/issues/36893 (we can restore D46665)

Global variables of non-hasExactDefinition() linkages (i.e.
linkonce/linkonce_odr/weak/weak_odr/common/external_weak) are not instrumented.
If an instrumented variable gets interposed to an uninstrumented variable due to
symbol interposition (e.g. in issue 36893, _ZTS1A in foo.so is resolved to _ZTS1A in
the executable), there may be a bogus error.

With private aliases, the register code will not resolve to a definition in
another module, and thus prevent the issue.

Cons: minor size increase. This is mainly due to extra __odr_asan_gen_* symbols.
(ELF) In addition, in relocatable files private aliases replace some relocations
referencing global symbols with .L symbols and may introduce some STT_SECTION symbols.

For lld, with -g0, the size increase is 0.07~0.09% for many configurations I
have tested: -O0, -O1, -O2, -O3, -O2 -ffunction-sections -fdata-sections
-Wl,--gc-sections. With -g1 or above, the size increase ratio will be even smaller.

This patch obsoletes D92078.

Don't migrate Windows for now: the static data member of a specialization
std::num_put<char>::id is a weak symbol, as well as its ODR indicator.
Unfortunately, link.exe (and lld without -lldmingw) generally doesn't support
duplicate weak definitions (weak symbols in different TUs likely pick different
defined external symbols and conflict).

Diff Detail

Event Timeline

MaskRay created this revision.Nov 1 2022, 10:34 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 1 2022, 10:34 PM
MaskRay requested review of this revision.Nov 1 2022, 10:34 PM
Herald added projects: Restricted Project, Restricted Project, Restricted Project. · View Herald TranscriptNov 1 2022, 10:34 PM
MaskRay updated this revision to Diff 472533.Nov 2 2022, 1:38 AM
MaskRay retitled this revision from [asan] Default to -fsanitize-address-use-odr-indicator to [asan] Default to -fsanitize-address-use-odr-indicator for non-Windows.
MaskRay edited the summary of this revision. (Show Details)

exclude Windows for its weak symbol issue

kstoimenov accepted this revision.Nov 2 2022, 10:09 AM
kstoimenov added inline comments.
llvm/test/Instrumentation/AddressSanitizer/global_with_comdat.ll
8

Remove extra space?

This revision is now accepted and ready to land.Nov 2 2022, 10:09 AM
vitalybuka added inline comments.Nov 2 2022, 6:40 PM
llvm/lib/Transforms/Instrumentation/AddressSanitizer.cpp
777–782

It should be

UseOdrIndicator(ClUseOdrIndicator.getNumOccurrences() > 0 ? ClUseOdrIndicator : UseOdrIndicator),
ClUsePrivateAlias(ClUsePrivateAlias.getNumOccurrences() > 0 ? ClUsePrivateAlias : UseOdrIndicator),

My rule of thumb is Cl flags, which are mostly for testing, must be direct and do exactly what was asked, even if it's not useful, and they should override fronted parameters.

vitalybuka accepted this revision.Nov 2 2022, 6:40 PM
MaskRay updated this revision to Diff 472827.Nov 2 2022, 6:56 PM
MaskRay marked 2 inline comments as done.

address comments

Thanks for the quick review:)

MaskRay edited the summary of this revision. (Show Details)Nov 2 2022, 7:10 PM
MaskRay edited the summary of this revision. (Show Details)Nov 2 2022, 7:16 PM
This revision was landed with ongoing or failed builds.Nov 2 2022, 7:21 PM
This revision was automatically updated to reflect the committed changes.
sbc100 added a subscriber: sbc100.EditedNov 4 2022, 12:26 PM

At least for WebAssembly object files, it looks like the symbol table now contains (what look like) demangled symbols. e.g.:

$ llvm-nm /tmp/emscripten_temp/command_0.o 
00003ef5 D __odr_asan_gen__numargs
000041a6 D __odr_asan_gen__stdcmd<1068>::init
000041c3 D __odr_asan_gen__stdcmd<1069>::init
000041e0 D __odr_asan_gen__stdcmd<1070>::init
000041fd D __odr_asan_gen__stdcmd<1071>::init
0000421a D __odr_asan_gen__stdcmd<1073>::init
0000423f D __odr_asan_gen__stdcmd<1074>::init
0000425c D __odr_asan_gen__stdcmd<1075>::init
00004279 D __odr_asan_gen__stdcmd<1076>::init
0000429e D __odr_asan_gen__stdcmd<1077>::init
000042bc D __odr_asan_gen__stdcmd<1078>::init

Symbol names with colon in them currently cause issues with the emscripten compiler (when parsing the output of llvm-nm). If this is intentional we can look into re-working the parser, but I just wanted to check if this was indented behaviour? Are these characters allowed in symbol names?

MaskRay added a comment.EditedNov 4 2022, 12:34 PM

At least for WebAssembly object files, it looks like the symbol table now contains (what look like) demangled symbols. e.g.:

$ llvm-nm /tmp/emscripten_temp/command_0.o 
00003ef5 D __odr_asan_gen__numargs
000041a6 D __odr_asan_gen__stdcmd<1068>::init
000041c3 D __odr_asan_gen__stdcmd<1069>::init
000041e0 D __odr_asan_gen__stdcmd<1070>::init
000041fd D __odr_asan_gen__stdcmd<1071>::init
0000421a D __odr_asan_gen__stdcmd<1073>::init
0000423f D __odr_asan_gen__stdcmd<1074>::init
0000425c D __odr_asan_gen__stdcmd<1075>::init
00004279 D __odr_asan_gen__stdcmd<1076>::init
0000429e D __odr_asan_gen__stdcmd<1077>::init
000042bc D __odr_asan_gen__stdcmd<1078>::init

Symbol names with colon in them currently cause issues with the emscripten compiler (when parsing the output of llvm-nm). If this is intensional we can look into re-working the parser, but I just wanted to check if this was indented behaviour? Are these character allows in symbol names?

The asan instrumentation just prepends __odr_asan_gen_ to the symbol name to form a new symbol name. For ELF every byte except \0 can be used in a symbol name, and this is totally fine.

I am unfamiliar with WebAssembly. Does the aforementioned parsing tool somehow skip printing _stdcmd<1068>::init symbols?

ModuleAddressSanitizer::shouldInstrumentGlobal encodes the candidate global variables.

llvm/lib/Transforms/Instrumentation/AddressSanitizer.cpp
777–782

Good catch. I agree with this analysis.

The asan instrumentation just prepends __odr_asan_gen_ to the symbol name to form a new symbol name. For ELF every byte except \0 can be used in a symbol name, and this is totally fine.

I am unfamiliar with WebAssembly. Does the aforementioned parsing tool somehow skip printing _stdcmd<1068>::init symbols?

The parser was confused by the presence of a colon in the symbol names and generating an error. I created a patch that makes it a little more robust: https://github.com/emscripten-core/emscripten/pull/18152

So we have a fix for the proximate issue, but I just wanted to check if that new prepended symbol name was supposed to be the demanded C++ name (which it seem to be) rather than the mangled name?

This is the first time we've had these characters appearing in symbol names so I just wanted to flag that as relatively unprecedented (at least in terms of all the codebases that emscripten has been exposed to so far).

ModuleAddressSanitizer::shouldInstrumentGlobal encodes the candidate global variables.

MaskRay added a comment.EditedNov 4 2022, 1:02 PM

The asan instrumentation just prepends __odr_asan_gen_ to the symbol name to form a new symbol name. For ELF every byte except \0 can be used in a symbol name, and this is totally fine.

I am unfamiliar with WebAssembly. Does the aforementioned parsing tool somehow skip printing _stdcmd<1068>::init symbols?

The parser was confused by the presence of a colon in the symbol names and generating an error. I created a patch that makes it a little more robust: https://github.com/emscripten-core/emscripten/pull/18152

So we have a fix for the proximate issue, but I just wanted to check if that new prepended symbol name was supposed to be the demanded C++ name (which it seem to be) rather than the mangled name?

I think so. The instrumentation is done at LLVM layer, not Clang layer. The LLVM layer code generally shouldn't know the different mangling schemes used by different language frontends.
For C++, we can get a mangled variable name with something like

template <typename T>
inline int var = 3;

int *x = &var<int>;

_Z3varIiE

My understanding is that all such symbols are in a COMDAT and have a linkonce_odr/weak_odr linkage: these symbols are skipped by ModuleAddressSanitizer::shouldInstrumentGlobal, so one will not see __odr_asan_gen__Z3varIiE. (And even if such a symbol name appears, it's not a big deal. Unable to demangle such non-interesting symbols is fine, IMHO.)
I do not know what the WebAssembly world does but if you can provide detail instructions for reproduce I am happy to take a look if needed.

This is the first time we've had these characters appearing in symbol names so I just wanted to flag that as relatively unprecedented (at least in terms of all the codebases that emscripten has been exposed to so far).

ModuleAddressSanitizer::shouldInstrumentGlobal encodes the candidate global variables.