40776 – missed optimization with C constructors of sparse structures

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 40776 - missed optimization with C constructors of sparse structures

Summary: missed optimization with C constructors of sparse structures

Status:	NEW

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Scalar Optimizations (show other bugs)
Version:	trunk
Hardware:	PC Linux

Importance:	P enhancement
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:

Depends on:
Blocks:	4068
	Show dependency tree

Reported:	2019-02-19 08:52 PST by Arnd Bergmann
Modified:	2020-05-06 11:54 PDT (History)
CC List:	7 users (show)

See Also:
Fixed By Commit(s):

Attachments
x.0.txt (99.83 KB, text/plain) 2019-05-12 15:59 PDT, Nick Desaulniers	Details
x.1.txt (115.85 KB, text/plain) 2019-05-12 15:59 PDT, Nick Desaulniers	Details
x0.ll (2.40 KB, text/plain) 2019-05-12 16:00 PDT, Nick Desaulniers	Details
x1.ll (2.43 KB, text/plain) 2019-05-12 16:00 PDT, Nick Desaulniers	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Arnd Bergmann 2019-02-19 08:52:32 PST

I stumbled over a file with excessive stack usage in the Linux kernel when building with clang, but not with gcc, reduced the test case to:

struct i2c_adapter {
  char name[48];
  int a[1000];
} hexium_attach_hexium;
void hexium_attach(void) {
    hexium_attach_hexium = (struct i2c_adapter){"hexium gemini"};
}

$ clang-8 -Werror  -c hexium-gemini.c -Wframe-larger-than=100  -O2
hexium-gemini.c:5:6: error: stack frame size of 4056 bytes in function 'hexium_attach' [-Werror,-Wframe-larger-than=]

Comparing the output shows that clang not only allocates stack space for the extra copy of the structure, but also fails to optimize the string assignment:

https://godbolt.org/z/7jpy_y

I worked around it in the kernel sources by using an explict memset and strscpy, but this seems like a generally useful optimization that could be added.

Comment 1 Arnd Bergmann 2019-02-19 11:36:39 PST

After playing around with it a little more, I see that clang avoids the stack allocation and memcpy as well if I initialize any other member besides the string to a nonzero value:

struct i2c_adapter {
  char name[48];
  int a[1000];
} s;
void hexium_attach(void) {
    s = (struct i2c_adapter){"h", {1}};
}

https://godbolt.org/z/-zgFHU

Comment 2 Nick Desaulniers 2019-05-12 15:58:45 PDT

I think this is bug very early on in the optimization pipeline; within SROA.

Comparing Arnd's second example w/ `1` (result optimized) vs explicit `0` (result unoptimized) or implicit `0` (same, unoptimized).

$ clang -O0 -Xclang -disable-O0-optnone x0.c -emit-llvm -c -g0 -S -o x.0.ll
$ clang -O0 -Xclang -disable-O0-optnone x1.c -emit-llvm -c -g0 -S -o x.1.ll
$ opt -O2 x.0.ll -S -print-after-all 2> x.0.txt >/dev/null
$ opt -O2 x.1.ll -S -print-after-all 2> x.1.txt >/dev/null
$ meld x.0.txt x.1.txt

It looks like the after the 4th pass (of many) things start to diverge:
*** IR Dump After SROA ***
; x.0.ll
  %1 = alloca %struct.i2c_adapter, align 4
; x.1.ll
  %.sroa.0 = alloca [48 x i8], align 4
  %.sroa.4 = alloca [999 x i32]

For x.1.ll:
SROA is able to split the local allocation into 2 alloca's, 1 memset, and 2 memcpys.  Later optimizations are able to prune this down to 0 alloca's, 1 memset, and 1 memcpy.

For x.0.ll:
SROA is not able to split the local allocation into 2.  After optimizations, we're left with 1 (large) alloca, 1 memset, and 2 memcpys (alloca a local, memset the zero's, memcpy the string, memcpy the local to the global).

So the next thing for me to dig into is, "why is SROA not able to correctly split the large alloca in this case?"

Comment 3 Nick Desaulniers 2019-05-12 15:59:32 PDT

Created attachment 21934 [details]
x.0.txt

Comment 4 Nick Desaulniers 2019-05-12 15:59:53 PDT

Created attachment 21935 [details]
x.1.txt

Comment 5 Nick Desaulniers 2019-05-12 16:00:16 PDT

Created attachment 21936 [details]
x0.ll

Comment 6 Nick Desaulniers 2019-05-12 16:00:38 PDT

Created attachment 21937 [details]
x1.ll

Comment 7 Nick Desaulniers 2020-05-06 11:54:35 PDT

I see Arnd submitted a few more kernel patches for this, as it seems to be a recurring issue.

Playing with the example more, I think another issue is the assignment of char[] members via a string literal.  I think that bug may be earlier, in Clang's codegen of LLVM IR.

For example:
struct foo {
    char bar [1000];
} my_foo;

void bad_lazy_init(void) {
    my_foo = (struct foo){
        .bar = "h",
    };
}

produces a 4000B .asciiz string "h<3999 zeros>" but

.bar = {104, 0}

produces much more concise code (and LLVM IR).  For an AST node like:

StringLiteral 0xde7fde8 <line:7:16> 'char [1000]' "h"

in a compound literal, I wonder if we should try to optimize that in clang codegen, or tackle this strictly in SROA?