LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 42551 - excessive stack usage compiling linux amdgpu kernel driver
Summary: excessive stack usage compiling linux amdgpu kernel driver
Status: NEW
Alias: None
Product: clang
Classification: Unclassified
Component: -New Bugs (show other bugs)
Version: trunk
Hardware: PC Linux
: P enhancement
Assignee: Unassigned Clang Bugs
URL:
Keywords:
Depends on:
Blocks: 4068
  Show dependency tree
 
Reported: 2019-07-09 01:27 PDT by Arnd Bergmann
Modified: 2019-07-09 09:24 PDT (History)
8 users (show)

See Also:
Fixed By Commit(s):


Attachments
preprocessed and partially reduced file (49.01 KB, text/x-csrc)
2019-07-09 01:27 PDT, Arnd Bergmann
Details
original source file, preprocessed and compressed (254.18 KB, application/gzip)
2019-07-09 01:30 PDT, Arnd Bergmann
Details
kernel patch to avoid passing structures by value (7.44 KB, patch)
2019-07-09 06:58 PDT, Arnd Bergmann
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Arnd Bergmann 2019-07-09 01:27:31 PDT
Created attachment 22208 [details]
preprocessed and partially reduced file

One file in the linux kernel appears to trigger a failed optimization that leads to large stack usage. Compiling a 32-bit ARM defconfig with the amdgpu driver enabled, I get

drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dce_calcs.c:2987:6: error: stack frame size of 1344 bytes in function 'bw_calcs' [-Werror,-Wframe-larger-than=]
bool bw_calcs(struct dc_context *ctx,
     ^
drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dce_calcs.c:76:13: error: stack frame size of 5328 bytes in function 'calculate_bandwidth' [-Werror,-Wframe-larger-than=]


I managed to partially reduce the preprocessed source file to illustrate the problem better:

https://godbolt.org/z/ZlDp0Z

The stack usage is highly target architecture dependent:

gcc arm-linux-gnueabi:      208 bytes
clang-8 arm-linux-gnueabi: 4144 bytes
clang-9 arm-linux-gnueabi: 4992 bytes
clang-9 aarch64-linux-gnu:  272 bytes
clang-9 powerpc64:         4272 bytes
clang-9 powerpc32:         4112 bytes
clang-9 s390-32:           4168 bytes
clang-9 s390-64:           4168 bytes
clang-9 sparc32:          10272 bytes
clang-9 sparc64:            432 bytes

$ clang-9 -Wframe-larger-than=10 --target=arm-linux-gnu -O2 -fno-strict-overflow -S dce-calcs-clang-noinline.i   -m32 -Wno-unused-value -Wno-logical-op-parentheses -Wno-return-type -Wno-implicit-int
dce-calcs-clang.i:275:4: warning: stack frame size of 4944 bytes in function 'calculate_bandwidth' [-Wframe-larger-than=]
   calculate_bandwidth(struct bw_calcs_dceip *dceip, struct bw_calcs_vbios *vbios,                     struct bw_calcs_data *data) {
   ^

Note that we build 32-bit kernels with -Wframe-larger-than=1024 because of the highly limited available stack space, and using 5KB of stack is likely to result in a kernel crash.
Comment 1 Arnd Bergmann 2019-07-09 01:30:04 PDT
Created attachment 22209 [details]
original source file, preprocessed and compressed
Comment 2 Arnd Bergmann 2019-07-09 06:58:47 PDT
Created attachment 22215 [details]
kernel patch to avoid passing structures by value

I experimentally tried to change all external functions related to this file that pass around 'struct bw_fixed' objects to instead pass plain 64-bit integers. On 32-bit arm, this improves the stack usage from 5000 bytes to around 1000 bytes, but some of that comes back once we turn on the address sanitizer, so it's still beyond the warning limit.

On i386, the patch appears to have no effect.
Comment 3 Arnd Bergmann 2019-07-09 07:12:06 PDT
Got a better reduced test case, see https://godbolt.org/z/z5bVKS

struct bw_fixed { long long value; };

struct bw_fixed bw_min2(struct bw_fixed, struct bw_fixed);
struct bw_fixed bw_max2(struct bw_fixed, struct bw_fixed);
struct bw_fixed bw_mul(struct bw_fixed, struct bw_fixed);

static inline struct bw_fixed bw_max3(struct bw_fixed v1, struct bw_fixed v2, struct bw_fixed v3) {
    return bw_max2(bw_max2(v1, v2), v3);
}

struct bw_fixed bw_int_to_fixed(long long value);

int f(struct bw_fixed _a, struct bw_fixed _b, struct bw_fixed _c)
{
    struct bw_fixed a=_a, b=_b, c=_c;
    struct bw_fixed a1=_a, b1=_b, c1=_c;
    struct bw_fixed a2=_a, b2=_b, c2=_c;

    a1 = bw_mul(bw_int_to_fixed(a.value), bw_int_to_fixed(3));
    b1 = bw_max3(a1, bw_mul(a2, a1), bw_mul(b2, c1));
    c1 = bw_max3(a2, bw_int_to_fixed(3), bw_int_to_fixed(2));

    return bw_max3(a1, b1, c1).value;
}

This just combines a couple of random operations from the original
kernel file. What can be seen here is that clang on ARM allocates
a stack slot for each temporary value resulting from calling 
bw_int_to_fixed() or bw_max(), while it doesn't do that on x86-64,
and gcc never does it.