r265836, which causes more memcpys to be expanded, grew chrome_child.dll by 240 KB. I suspect that maybe X86TargetLowering::MaxStoresPerMemcpyOptSize should be reduced, at least on 32-bit where we lower calls with smaller code than we used to.
Some notes: I tried bootstrapping Clang on 32-bit x86 with -Os and different values for MaxStoresPerMemcpyOptSize and MaxStoresPerMemmoveOptSize Size MaxStoresPerMemcpyOptSize 67964194 1 68009250 2 68029730 3 68037922 4 (4 is the default) So it seems setting it to 1 would be a size win. But there might be more going on here. Consider the following example: #include <string.h> void f(char *dst, const char *src) { memcpy(dst, src, 9); } At -O2 we will lower it with loads and stores: 00000000 <f>: 0: 8b 44 24 04 mov 0x4(%esp),%eax 4: 8b 4c 24 08 mov 0x8(%esp),%ecx 8: 8a 51 08 mov 0x8(%ecx),%dl b: 88 50 08 mov %dl,0x8(%eax) e: f2 0f 10 01 movsd (%ecx),%xmm0 12: f2 0f 11 00 movsd %xmm0,(%eax) 16: c3 ret At -Os, with the threshold set to 1, we make the call, and save one byte: 00000000 <f>: 0: 83 ec 10 sub $0x10,%esp 3: 6a 09 push $0x9 5: ff 74 24 1c pushl 0x1c(%esp) 9: ff 74 24 1c pushl 0x1c(%esp) d: e8 fc ff ff ff call e <f+0xe> 12: 83 c4 1c add $0x1c,%esp 15: c3 ret However, if dst and src in the source were "int" instead of "char", we'd use rep;movs to copy two 4-byte words, and do one 1-byte move afterwards, causing much larger code: 00000000 <f>: 0: 57 push %edi 1: 56 push %esi 2: 8b 44 24 0c mov 0xc(%esp),%eax 6: 8b 54 24 10 mov 0x10(%esp),%edx a: b9 02 00 00 00 mov $0x2,%ecx f: 89 c7 mov %eax,%edi 11: 89 d6 mov %edx,%esi 13: f3 a5 rep movsl %ds:(%esi),%es:(%edi) 15: 8a 4a 08 mov 0x8(%edx),%cl 18: 88 48 08 mov %cl,0x8(%eax) 1b: 5e pop %esi 1c: 5f pop %edi 1d: c3 ret The rep;movs lowering comes from X86SelectionDAGInfo::EmitTargetCodeForMemcpy. If we were to just 1-byte rep;movs instead, the code would be: 00000000 <f>: 0: 57 push %edi 1: 56 push %esi 2: 8b 7c 24 0c mov 0xc(%esp),%edi 6: 8b 74 24 10 mov 0x10(%esp),%esi a: b9 09 00 00 00 mov $0x9,%ecx f: f3 a4 rep movsb %ds:(%esi),%es:(%edi) 11: 5e pop %esi 12: 5f pop %edi 13: c3 ret Which is the smallest yet. Maybe at -Os we should try to avoid ending up with remaining bytes after rep;movs and instead work with a smaller type, or just make the library call. Perhaps doing this would be better than trying to lower MaxStoresPerMemcpyOptSize. (Mostly notes to self) CMake invocation: $ CC=/work/llvm/build.release/bin/clang CXX=/work/llvm/build.release/bin/clang++ CFLAGS="-m32" CXXFLAGS="-m32" cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS_RELEASE="-Os" -DCMAKE_CXX_FLAGS_RELEASE="-Os" -GNinja .. Needed to apply this patch to avoid crashing: --- a/lib/Target/X86/X86SelectionDAGInfo.cpp +++ b/lib/Target/X86/X86SelectionDAGInfo.cpp @@ -241,6 +241,8 @@ SDValue X86SelectionDAGInfo::EmitTargetCodeForMemcpy( unsigned UBytes = AVT.getSizeInBits() / 8; unsigned CountVal = SizeVal / UBytes; + if (!CountVal) + return SDValue(); SDValue Count = DAG.getIntPtrConstant(CountVal, dl); unsigned BytesLeft = SizeVal % UBytes;
Some measurements on a 32-bit Chrome build on Windows: Threshold Size (chrome_child.dll) 4 50,785,280 2 50,761,728 1 50,719,232