I stumbled over a file with excessive stack usage in the Linux kernel when building with clang, but not with gcc, reduced the test case to: struct i2c_adapter { char name[48]; int a[1000]; } hexium_attach_hexium; void hexium_attach(void) { hexium_attach_hexium = (struct i2c_adapter){"hexium gemini"}; } $ clang-8 -Werror -c hexium-gemini.c -Wframe-larger-than=100 -O2 hexium-gemini.c:5:6: error: stack frame size of 4056 bytes in function 'hexium_attach' [-Werror,-Wframe-larger-than=] Comparing the output shows that clang not only allocates stack space for the extra copy of the structure, but also fails to optimize the string assignment: https://godbolt.org/z/7jpy_y I worked around it in the kernel sources by using an explict memset and strscpy, but this seems like a generally useful optimization that could be added.
After playing around with it a little more, I see that clang avoids the stack allocation and memcpy as well if I initialize any other member besides the string to a nonzero value: struct i2c_adapter { char name[48]; int a[1000]; } s; void hexium_attach(void) { s = (struct i2c_adapter){"h", {1}}; } https://godbolt.org/z/-zgFHU
I think this is bug very early on in the optimization pipeline; within SROA. Comparing Arnd's second example w/ `1` (result optimized) vs explicit `0` (result unoptimized) or implicit `0` (same, unoptimized). $ clang -O0 -Xclang -disable-O0-optnone x0.c -emit-llvm -c -g0 -S -o x.0.ll $ clang -O0 -Xclang -disable-O0-optnone x1.c -emit-llvm -c -g0 -S -o x.1.ll $ opt -O2 x.0.ll -S -print-after-all 2> x.0.txt >/dev/null $ opt -O2 x.1.ll -S -print-after-all 2> x.1.txt >/dev/null $ meld x.0.txt x.1.txt It looks like the after the 4th pass (of many) things start to diverge: *** IR Dump After SROA *** ; x.0.ll %1 = alloca %struct.i2c_adapter, align 4 ; x.1.ll %.sroa.0 = alloca [48 x i8], align 4 %.sroa.4 = alloca [999 x i32] For x.1.ll: SROA is able to split the local allocation into 2 alloca's, 1 memset, and 2 memcpys. Later optimizations are able to prune this down to 0 alloca's, 1 memset, and 1 memcpy. For x.0.ll: SROA is not able to split the local allocation into 2. After optimizations, we're left with 1 (large) alloca, 1 memset, and 2 memcpys (alloca a local, memset the zero's, memcpy the string, memcpy the local to the global). So the next thing for me to dig into is, "why is SROA not able to correctly split the large alloca in this case?"
Created attachment 21934 [details] x.0.txt
Created attachment 21935 [details] x.1.txt
Created attachment 21936 [details] x0.ll
Created attachment 21937 [details] x1.ll
I see Arnd submitted a few more kernel patches for this, as it seems to be a recurring issue. Playing with the example more, I think another issue is the assignment of char[] members via a string literal. I think that bug may be earlier, in Clang's codegen of LLVM IR. For example: struct foo { char bar [1000]; } my_foo; void bad_lazy_init(void) { my_foo = (struct foo){ .bar = "h", }; } produces a 4000B .asciiz string "h<3999 zeros>" but .bar = {104, 0} produces much more concise code (and LLVM IR). For an AST node like: StringLiteral 0xde7fde8 <line:7:16> 'char [1000]' "h" in a compound literal, I wonder if we should try to optimize that in clang codegen, or tackle this strictly in SROA?