27415 – Evaluate X86TargetLowering::MaxStoresPerMemcpyOptSize

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 27415 - Evaluate X86TargetLowering::MaxStoresPerMemcpyOptSize

Summary: Evaluate X86TargetLowering::MaxStoresPerMemcpyOptSize

Status:	NEW

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Backend: X86 (show other bugs)
Version:	trunk
Hardware:	PC Linux

Importance:	P normal
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:

Depends on:
Blocks:	26299
	Show dependency tree

Reported:	2016-04-18 17:15 PDT by Hans Wennborg
Modified:	2020-10-12 09:01 PDT (History)
CC List:	2 users (show)

See Also:	47794
Fixed By Commit(s):

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Hans Wennborg 2016-04-18 17:15:16 PDT

r265836, which causes more memcpys to be expanded, grew chrome_child.dll by 240 KB.

I suspect that maybe X86TargetLowering::MaxStoresPerMemcpyOptSize should be reduced, at least on 32-bit where we lower calls with smaller code than we used to.

Comment 1 Hans Wennborg 2016-04-22 15:44:04 PDT

Some notes:

I tried bootstrapping Clang on 32-bit x86 with -Os and different values for MaxStoresPerMemcpyOptSize and MaxStoresPerMemmoveOptSize

Size      MaxStoresPerMemcpyOptSize
67964194  1
68009250  2
68029730  3
68037922  4

(4 is the default)

So it seems setting it to 1 would be a size win.

But there might be more going on here. Consider the following example:

#include <string.h>
void f(char *dst, const char *src) {
  memcpy(dst, src, 9);
}

At -O2 we will lower it with loads and stores:
00000000 <f>:
   0:   8b 44 24 04             mov    0x4(%esp),%eax
   4:   8b 4c 24 08             mov    0x8(%esp),%ecx
   8:   8a 51 08                mov    0x8(%ecx),%dl
   b:   88 50 08                mov    %dl,0x8(%eax)
   e:   f2 0f 10 01             movsd  (%ecx),%xmm0
  12:   f2 0f 11 00             movsd  %xmm0,(%eax)
  16:   c3                      ret

At -Os, with the threshold set to 1, we make the call, and save one byte:

00000000 <f>:
   0:   83 ec 10                sub    $0x10,%esp
   3:   6a 09                   push   $0x9
   5:   ff 74 24 1c             pushl  0x1c(%esp)
   9:   ff 74 24 1c             pushl  0x1c(%esp)
   d:   e8 fc ff ff ff          call   e <f+0xe>
  12:   83 c4 1c                add    $0x1c,%esp
  15:   c3                      ret

However, if dst and src in the source were "int" instead of "char", we'd use rep;movs to copy two 4-byte words, and do one 1-byte move afterwards, causing much larger code:

00000000 <f>:
   0:   57                      push   %edi
   1:   56                      push   %esi
   2:   8b 44 24 0c             mov    0xc(%esp),%eax
   6:   8b 54 24 10             mov    0x10(%esp),%edx
   a:   b9 02 00 00 00          mov    $0x2,%ecx
   f:   89 c7                   mov    %eax,%edi
  11:   89 d6                   mov    %edx,%esi
  13:   f3 a5                   rep movsl %ds:(%esi),%es:(%edi)
  15:   8a 4a 08                mov    0x8(%edx),%cl
  18:   88 48 08                mov    %cl,0x8(%eax)
  1b:   5e                      pop    %esi
  1c:   5f                      pop    %edi
  1d:   c3                      ret

The rep;movs lowering comes from X86SelectionDAGInfo::EmitTargetCodeForMemcpy. If we were to just 1-byte rep;movs instead, the code would be:

00000000 <f>:
   0:   57                      push   %edi
   1:   56                      push   %esi
   2:   8b 7c 24 0c             mov    0xc(%esp),%edi
   6:   8b 74 24 10             mov    0x10(%esp),%esi
   a:   b9 09 00 00 00          mov    $0x9,%ecx
   f:   f3 a4                   rep movsb %ds:(%esi),%es:(%edi)
  11:   5e                      pop    %esi
  12:   5f                      pop    %edi
  13:   c3                      ret

Which is the smallest yet.

Maybe at -Os we should try to avoid ending up with remaining bytes after rep;movs and instead work with a smaller type, or just make the library call. Perhaps doing this would be better than trying to lower MaxStoresPerMemcpyOptSize.




(Mostly notes to self)
CMake invocation:
$ CC=/work/llvm/build.release/bin/clang CXX=/work/llvm/build.release/bin/clang++ CFLAGS="-m32" CXXFLAGS="-m32" cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS_RELEASE="-Os" -DCMAKE_CXX_FLAGS_RELEASE="-Os" -GNinja ..

Needed to apply this patch to avoid crashing:

--- a/lib/Target/X86/X86SelectionDAGInfo.cpp
+++ b/lib/Target/X86/X86SelectionDAGInfo.cpp
@@ -241,6 +241,8 @@ SDValue X86SelectionDAGInfo::EmitTargetCodeForMemcpy(
 
   unsigned UBytes = AVT.getSizeInBits() / 8;
   unsigned CountVal = SizeVal / UBytes;
+  if (!CountVal)
+    return SDValue();
   SDValue Count = DAG.getIntPtrConstant(CountVal, dl);
   unsigned BytesLeft = SizeVal % UBytes;

Comment 2 Hans Wennborg 2016-04-22 17:43:47 PDT

Some measurements on a 32-bit Chrome build on Windows:

Threshold   Size (chrome_child.dll)
4           50,785,280
2           50,761,728
1           50,719,232