New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[linux] debug information generated by clang much larger than gcc's, making linking clang objects 20% slower than gcc #7926
Comments
Er, that second command line should use "-o profile.clang" ../llvm/Release/bin/clang++ -pthread -O0 -g -c -o profile.clang profile.ii -rw-r--r-- 1 evanm evanm 4835656 2010-07-01 15:44 profile.clang |
g++ -v? Is the size due to debug info? What are the sizes with -g0? |
$ gcc --version |
With -g0, they are pretty close: $ ls -l profile.clang profile.gcc |
Devang, this looks like an example of where Clang is producing much more bloated debug info for c++ code than GCC is. Can you investigate? |
My g++-4.2 does not compile attached preprocessed source file. It'd be great if you could attach an example that I can compile using g++-4.2 for comparison. Otherwise, it'd be easier to analyze size bloat if there is a smaller test case. |
Attached a hacked up version of the source which will build with gcc-4.2 (although the code wouldn't work properly, it should be fine for investigating debug info). On Darwin, Clang is actually doing better than GCC -- it seems like gcc-4.4 has made improvements in this area.ddunbar@giles:tmp$ xclang -v
|
In building Chrome with clang vs gcc and comparing files pair-wise, the size difference distribution was all over -- some were smaller with clang, some larger. The summed result though is a binary larger by a factor of a 2x. I grabbed this file because it exhibited the largest difference, so I had hoped it would have made the problem more obvious (e.g. maybe it's just one huge symbol, or maybe each symbol is just a little bit larger). I can provide any of the other files, but I'm not sure whether they'll help much. I guess what I'm saying is: I'd like to help, please give me advice on how. |
I agree with your approach, starting with the file with the biggest difference is a good place. The problem is that we can't reproduce the issue on Darwin with GCC 4.2. If you attach the .s files generated with GCC 4.4 and Clang on your platform, Devang might be able to tell which bits of debug info GCC is leaving out. Another approach would be to see if the size difference manifests on Darwin using GCC 4.2. If so, and you can grab the .o file with the biggest difference there, we can probably do a better job of investigating. p.s. Does Chrome built with Clang work? |
Adding Nico, who has successfully built on Mac. Nico: to find files with large differences, I did something like I might be able to try a build on an Ubuntu Hardy machine (also gcc 4.2), but my recollection is that there are other problems there like Chrome on Clang: yeah, it runs! On-page images come out slightly corrupted (gonna be annoying to track down) and there's one last v8 patch that is some template trickiness that they want me to perftest on gcc/MSVC to make sure I don't regress there, and I haven't gotten around to it. |
(Chrome/Mac on Clang: Main binary builds and runs in Debug, but fails to link in Release. unit_tests, the binary which contains most of chrome's automated tests, builds in Debug but reports many test failures when build with clang and eventually crashes.) |
FWIW, on OS X they're in the same ballpark, clang's ~11% bigger: hummer:src thakis$ ls -l clang/sym/Debug/Chromium\ Framework.framework/Versions/Current/Chromium\ Framework |
Here are the ten largest sections and their from Linux Chrome. (Ignore the first number in each line, that's just the section number). gcc 4.4.3: clang: Generated with: |
Er, I meant: ls -lh says gcc produces 589M binary, while clang produces 2.1G binary. |
I've been trying to find a reduced test case for this, basically taking the preprocessed profile.cc, ripping out stuff while it still compiles and still generates a significantly larger .o file with Clang than with GCC. I'm not sure the reasons for the difference in size here is the same as for the difference in size for all of Chrome, but hopefully it could be helpful in some way. Compiling the attached file like this: clang++ -g -c /tmp/reduction.ii -o /tmp/a.o.clang And comparing with GCC (4.2.1 on Darwin, 4.4.3 on Linux) using the same flags yields these file sizes (in bytes): On x86_64-unknown-linux-gnu: On x86_64-apple-darwin10: (This was using Clang built from r123315.) So it seems that Clang generates a larger .o file than GCC on Linux. On Darwin it seems about the same. Not using the -g option makes them come out about the same size. |
4k seems small enough to objdump -D the files and look for clues. |
chandlerc says that gcc uses a version of dwarf debugging information that has been tuned to be small (dwarf 2?), while clang uses the old and bloaty version (dwarf 1?). |
The debug information causes big .o files, which in turn makes ld very slow. This bug makes chrome builds with clang/linux 5x slower than gcc. For now, we've disabled debug information generation on the clang/linux bots because of this. See http://code.google.com/p/chromium/issues/detail?id=70000 |
I know of two things that can be done to improve clang's debug info size:
There might be more advantages in dwarf4, but I am not sure. |
I suspect a large chunk of this problem will be solved by turning on the .cfi_* sections Rafael. That should in particular help the non-type debug information size. The type debug info size is what would be helped most by dwarf4 (from my faded memory...) |
Regarding "comment 20", it is not true. llvm emits dwarf3. Someone with access to linux machine should take a very small test case and see why debug info is bloated 5x (as per claims) on linux but not on darwin. |
Comments #8 and #11 indicate that gcc 4.2 (the Darwin one) has behavior similar to clang, but current gcc versions are the ones that produce the problem. |
That was even more confusing, sorry.
|
Aha.. thanks for clarification. |
Evan: I'm not sure that's completely right. On the chromium waterfall, the linux/clang bot was way slower than the clang/mac bot before I disabled debug information generation. I believe both bots have similar hardware. |
One big difference is that on darwin most of the debug info is not copied from the .o files, so it is not hit as hard by having large debug info. Are you using gold on linux btw? |
Yes, we use gold. |
Updating the title, since it's no longer 5x as slow. Here's some up-to-date data: (gcc 4.4, clang from last week – 149419 – with memcpy() codegen bug, chromium r120581). thakis@yearofthelinuxdesktop:/usr/local/google/chrome/src$ ls -l out_clang/Debug/chrome thakis@yearofthelinuxdesktop:/usr/local/google/chrome/src$ ls -l out_gcc/Debug/chrome thakis@yearofthelinuxdesktop:/usr/local/google/chrome/src$ objdump -h ./out_clang/Debug/chrome | perl -pe 's/([0-9a-f]{8}).*/hex($1)/eg'| sort -k3 -n | tail -10 | tac thakis@yearofthelinuxdesktop:/usr/local/google/chrome/src$ objdump -h ./out_gcc/Debug/chrome | perl -pe 's/([0-9a-f]{8}).*/hex($1)/eg'| sort -k3 -n | tail -10 | tac |
Preprocessed example file. 3949 kB bigger: out_clang/Debug/obj/chrome/common/common.logging_chrome.o So logging_chrome.cc ( http://code.google.com/codesearch#OAMlx_jo-ck/src/chrome/common/logging_chrome.cc&exact_package=chromium&q=logging_chrome.cc ) is almost 4 MB bigger when built with clang. Here's the breakdown for that file: thakis@yearofthelinuxdesktop:/usr/local/google/chrome/src$ objdump -h out_clang/Debug/obj/chrome/common/common.logging_chrome.o | perl -pe 's/([0-9a-f]{8})./hex($1)/eg'| sort -k3 -n | tail -10 | tac I'm attaching the (clang-)preprocessed output of logging_chrome.cc. |
Interesting. Thanks for the additional information! |
Here are the current chrome build numbers, from the dupe: [reply] [-] Description Nico Weber 2012-02-07 13:32:21 CST I did incremental builds (touch one file, measure rebuild time – this measures So while compiling with clang is faster, linking the resulting object files Raw numbers (each is the min of 3 runs): Chrome incremental build times (ninja, gold on by default, debug clang gcc component build (libv8.so instead of libv8.a etc) clang component build (release builds) clang gcc component clang component …and with that, I'm done spamming this bug for a while. |
Hrm. logging_chrome.ii is preprocessed on linux and not on the mac. I'll probably need someone with access to gcc on linux to do a bit of analysis of which types are included and which aren't - and we can try to come up with a "why" and "how" from there. |
FWIW I've got the -wi output of logging_chrome.ii from linux. The formatting is... annoying so it's slow going looking through it. It's obviously much larger though. |
I sent echristo a mach-o version of logging_chrome.o (built with clang, and with a local build of gcc4.6). If anybody else wants that, let me know. (It's too big to attach it.) |
Not looking at this currently. |
I checked if this has improved. It hasn't. clang's debug binary output is now 2GB (due to chrome growing over time I suppose). gcc4.6's has grown a lot, it's now 1.76GB (still 12% smaller than clang's). Linking 'chrome' in debug mode after touching a single file takes 1m42s with gcc (warm cache), but 7m11s with clang over 400% slower by now. |
First approximation guess would be relocations, but I'll need to get something set up to look at it. I'll move this up my priority list. |
As another side note I'm not quite sure what's going on in the debug information in particular that's making linking so slow. It'll be worth looking into. |
First semi-random example of differences between Clang & GCC: int func(void (*)()); struct foo { int i = func(foo::bar); // neither emit 'foo' struct bar { This affects all the "ChromeUtilityHostMsg" objects that use a registration system not unlike foo+bar here. See "ChromeUtilityHostMsg_ParseJSON_Succeeded" for example. GCC doesn't produce any debug info for that type, even though the ctor called from the global initializer for g_LoggerRegisterHelperChromeUtilityHostMsg_ParseJSON_Succeeded references the type (both the ID enum, and the Log builder function) (side note: why do you have the extra LoggerRegisterHelperChromeUtilityHostMsg_ParseJSON_Succeeded class? Why not have a generic registration type with ctor parameters: LoggerRegisterHelper g_LoggerRegisterHelper(ChromeUtilityHostMsg_ParseJSON_Succeeded::ID, ChromeUtilityHostMsg_ParseJSON_Succeeded::Log); ? that would reduce Clang's debug info (you'd drop one out of every pair of these types), though it'd increase GCC's because it would now actually start emitting the debug info for ChromeUtilityHostMsg_ParseJSON_Succeeded) |
Another issue we noticed is that Clang produces debug info for implicit special members that are ODR used, but never codegen'd (because they were frontend inlined*, or used in inline functions that were never called/codegen'd, etc). eg: struct foo { void func(foo*); int main() { Clang produces a description of 'foo' with one member 'i' and one subprogram 'foo' (the ctor) for which there is no definition. GCC only describes 'i' and does not describe a ctor 'foo'. Both compilers, given code such as: void func(foo&(foo::*)(const foo&)); produce debug info describing the type 'foo' with a subprogram 'operator=', the implicit default copy assignment operator. As a minor note - when Clang (either rightly or wrongly) emits the declarations of these implicit members, it provides file and line numbers for those declarations. GCC does not. Also, Clang produces "DW_AT_accessibility" on all members, even with the default accessibility of the type.
|
Similar to the previous point about special members, Clang emits the declaration of all template specializations, even those that are never codegen'd as in this example: struct foo { inline void func() { int main() { Since 'func' is never codegen'd (as it is never called), func is never generated either - Clang still emits a debug info description for both func and func. GCC only emits func. This issue and the implicit special members can both be addressed in a similar/unified way - don't emit those declarations when emitting the type, but add them after the fact if we end up codegening them. |
Probable GCC bug is producing small debug info when it comes to stream-related code such as: #include int main() { The total object file size from GCC is nearly 1/4th that of Clang, but that's because GCC only produced debug info describing the /declaration/ of basic_ifstream. It didn't describe any of the members (including the 'bad()' function called there). Clang does describe this type (& its dependencies) in full, which seems appropriate. I haven't narrowed this down fully, but it seems to be related to libstdc++'s use of extern template as follows: extern template class basic_ifstream; if that line is not present, the full type information is emitted just like Clang. This might seem sort of reasonable, if the full definition of basic_ifstream was emitted in the debug-built binaries of libstdc++, but so far as I can tell, it is not (I only see another declaration there). But perhaps I've done something wrong in that investigation. If I can come up with a standalone repro, I'll post those details. |
A little more detail on the fstream issue. I've simplified it down to: struct a { template extern template class b; int main() { In GCC's DWARF: 'b' is emitted as a declaration but it contains the DW_TAG_subprograms of 'b's ctor and dtor (if those are explicit, rather than implicit, it does not include them). If the "extern template class b" is proceeded by "template class b" then the debug info for 'func' is emitted (I'm not sure what/where/why the problem exists for this in libstdc++'s debug symbol builds - I may've looked at the wrong libs or in the wrong way, or linked to the wrong ones, etc...) |
Referring to comment 67, looking at gcc 4.6 behaviour, gcc does ship all the debug info for an extern template where it's instantiated, not where it's used. So this program: #include int main() { builds a binary that does not have debug info for ifstream, but its debug info can be found in the libstdc++ .so instead. |
Worse than this - even if you switch out the registration system for something more like what I described, something else seems to be getting in the way: struct base { GCC emits a declaration for 'foo', not a definition (so GCC's debug info doesn't mention 'base' at all. The absence of a virtual function in 'base' causes this not to happen & the full definition of 'foo' to be emitted, including the DW_AT_inheritance refering to 'base', and the complete 'base' definition. Given the absence of any key function here, I'm not sure what GCC's deal is. These are polymorphic classes that may be seen in no other TU than this one where their code must be emitted (due to the presence of virtual functions). I don't know how GCC could rely on these types to be emitted in any other TU - and if not, this seems like another GCC bug. Perhaps someone from GCC can explain how/why this is correct & what the logic is here & we can implement it, but for now I'm suspicious. I'd like to find a way around this bug or optimization so I can compare Clang v GCC debug info sans this issue & see how much it contributes to the problems with this TU (my theory is that it's the major issue/difference in this TU - the dependencies brought in via the base class & parameter types of these registration objects are substantial & there are many of them) but I'm not entirely sure how to do that. I'll keep experimenting. |
So - a theory to explain Comments 64, 67-70: GCC is emitting type information only when the vtable (& other virtual 'stuff' - such as virtual base handling) is emitted for any type that has a vtable. That's why this looks weird for the explicit template instantiation case - it's not specifically targeting that case, it just gets it for free because the type then ends up with a key function (at least one out of line virtual function) & thus the vtable isn't emitted at the call site. This is a fairly sound idea (one we've considered implementing before) though there are some trivial ways it can fail (whether or not such code is likely to occur in the wild is open for debate): struct foo { int main() { in this case, the virtual functions of 'foo' may never be defined - since no instance of 'foo' is ever constructed, this is valid. Yet the referenced static member will still be uncallable from the debugger because GCC emits no mention of teh 'foo' type at all. |
Just while I've got this here, I've prototyped the "only emit debug info for class definitions along with the vtable for any type with a vtable" idea and here are some numbers: with the original logging TU: Strings with baseline Clang: 30919 If we change the registration system so GCC doesn't win by avoiding the intermediate registration types (due to the ID reference being ignored by GCC): Strings with improved Clang: 6068 So we get pretty close. |
I'm going to consider this resolved by r188576 as it reduces the debug info for the example down to If you have other particularly bad examples, please file new bugs/data & I'll investigate. GCC 4.7: Clang (old): Clang (r188576): |
I finally had the opportunity to check this. With current trunk chromium, gcc 4.6, and clang r192635, a (statically linked) debug build of chrome is 1.9GB with gcc and 1.3GB with clang. Incremental build times after touching a file in the ui/ directory is 3m55s with clang and 4m02s with gcc. So this looks much better now, thanks! |
Awesome! thanks for all your (everyone on this bug) help with repros, reductions, etc. Sorry it was a bit of a while coming. |
Woot. |
As a follow-up (probably not related to this bug): I also measure incremental build times with a component build, and clang is ~10% slower than gcc 4.6 for a full build of chrome (24min vs 22min), and still a good deal slower in incremental builds (1.6s vs 1.2s). So generally I feel that more perf work is needed on linux. |
OK, interesting. If you ever get any information for it let me know, otherwise I'll see what I can work up as I get to it. |
As an update, a lot of things have been improved and we're seeing build and link times (including incremental) under gcc at this point. Thanks again everyone! |
1 similar comment
As an update, a lot of things have been improved and we're seeing build and link times (including incremental) under gcc at this point. Thanks again everyone! |
Extended Description
I am looking into why gcc Chrome is ~600mb while clang builds of Chrome are 1.7gb. (Both with debugging symbols.) I've attached the preprocessed version of one of the files that differs the most between the two compilers.
With the same input file, I can generate two output files:
g++ -pthread -O0 -g -c -o profile.gcc profile.ii
../llvm/Release/bin/clang++ -pthread -O0 -g -c -o profile.gcc profile.ii
Their sizes differ significantly:
-rw-r--r-- 1 evanm evanm 4709032 2010-07-01 15:30 profile.clang
-rw-r--r-- 1 evanm evanm 2570352 2010-07-01 15:30 profile.gcc
gzipped file attached.
llvm$ svn info . tools/clang/ | grep Revision
Revision: 107405
Revision: 107405
The text was updated successfully, but these errors were encountered: