Consider: #include <tr1/functional> #include <algorithm> void assign( long* variable, long v) { std::transform( variable, variable + 1, variable, std::tr1::bind( std::plus< long >(), 0L, v ) ); } This compiles to a single store on x86, but a whole ton of code on x86-64. This is because the temporary structs are larger on x86-64, so EmitAggregateCopy in llvm-gcc emits them as a memcpy instead of scalar transfers. The problem is that this later blocks scalarrepl from promoting the structs, causing much worse codegen: __Z6assignRll: # x86-32 movl 8(%esp), %eax movl 4(%esp), %ecx movl %eax, (%ecx) ret __Z6assignRll: # x86-64 subq $88, %rsp movb $0, 64(%rsp) movq $0, 72(%rsp) movq %rsi, 80(%rsp) movq %rsi, 48(%rsp) movq 72(%rsp), %rax movq %rax, 40(%rsp) movq 64(%rsp), %rax movq %rax, 32(%rsp) movq 40(%rsp), %rax movq %rax, 8(%rsp) movq 48(%rsp), %rax movq %rax, 16(%rsp) movq 32(%rsp), %rax movq %rax, (%rsp) movq 16(%rsp), %rax addq 8(%rsp), %rax movq %rax, (%rdi) addq $88, %rsp ret -Chris
Repro with: $ llvm-g++ t.cc -O3 -S -o - -fno-exceptions -fomit-frame-pointer -m64
Here's a reduced testcase: #include <string.h> struct foo { int A, B; }; int test(struct foo *P) { struct foo L; memcpy(&L, P, sizeof(struct foo)); return L.A; } %struct.foo = type { i32, i32 } implementation ; Functions: define i32 @test(%struct.foo* %P) { entry: %L = alloca %struct.foo, align 8 ; <%struct.foo*> [#uses=2] %L2 = bitcast %struct.foo* %L to i8* ; <i8*> [#uses=1] %tmp13 = bitcast %struct.foo* %P to i8* ; <i8*> [#uses=1] call void @llvm.memcpy.i32( i8* %L2, i8* %tmp13, i32 8, i32 4 ) %tmp4 = getelementptr %struct.foo* %L, i32 0, i32 0 ; <i32*> [#uses=1] %tmp5 = load i32* %tmp4 ; <i32> [#uses=1] ret i32 %tmp5 }
This patch contains the (disabled) code to do the SROA. Before this can be enabled, mem2reg needs to be able to promote scalars targetted by memcpy/memset etc. http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20070305/045540.html -Chris
Second half committed here: http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20070305/045718.html Third 'half' will come later.
For my notes: The main part of this is implemented, but the testcase in this bug is not yet implemented. The reason for this is that the lowered structure starts out with an array of bytes, so we get a memcpy ("gep x, 0, 0") instead of memcpy(bitcast). Two issues: 1. Make xform work with GEP. 2. Figure out why we're getting the array of bytes in this trivial case. Another testcase: #include <string.h> struct foo { int A, B; }; struct bar{ struct foo x; long long y; double D; }; int test1(struct foo *P) { struct foo L; memcpy(&L, P, sizeof(struct foo)); return L.A; } int test2() { struct foo L[4]; memset(L, 0, sizeof(struct foo)*4); return L[0].A; } int test3() { struct bar B; memset(&B, 1, sizeof(B)); B.x.A = 1; B.D = 10.0; return B.x.B; } -Chris
Here is the final piece: http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20070312/045911.html Testcases here: Transforms/ScalarRepl/memset-aggregate.ll Transforms/ScalarRepl/memset-aggregate-byte-leader.ll -Chris