Created attachment 22208 [details] preprocessed and partially reduced file One file in the linux kernel appears to trigger a failed optimization that leads to large stack usage. Compiling a 32-bit ARM defconfig with the amdgpu driver enabled, I get drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dce_calcs.c:2987:6: error: stack frame size of 1344 bytes in function 'bw_calcs' [-Werror,-Wframe-larger-than=] bool bw_calcs(struct dc_context *ctx, ^ drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dce_calcs.c:76:13: error: stack frame size of 5328 bytes in function 'calculate_bandwidth' [-Werror,-Wframe-larger-than=] I managed to partially reduce the preprocessed source file to illustrate the problem better: https://godbolt.org/z/ZlDp0Z The stack usage is highly target architecture dependent: gcc arm-linux-gnueabi: 208 bytes clang-8 arm-linux-gnueabi: 4144 bytes clang-9 arm-linux-gnueabi: 4992 bytes clang-9 aarch64-linux-gnu: 272 bytes clang-9 powerpc64: 4272 bytes clang-9 powerpc32: 4112 bytes clang-9 s390-32: 4168 bytes clang-9 s390-64: 4168 bytes clang-9 sparc32: 10272 bytes clang-9 sparc64: 432 bytes $ clang-9 -Wframe-larger-than=10 --target=arm-linux-gnu -O2 -fno-strict-overflow -S dce-calcs-clang-noinline.i -m32 -Wno-unused-value -Wno-logical-op-parentheses -Wno-return-type -Wno-implicit-int dce-calcs-clang.i:275:4: warning: stack frame size of 4944 bytes in function 'calculate_bandwidth' [-Wframe-larger-than=] calculate_bandwidth(struct bw_calcs_dceip *dceip, struct bw_calcs_vbios *vbios, struct bw_calcs_data *data) { ^ Note that we build 32-bit kernels with -Wframe-larger-than=1024 because of the highly limited available stack space, and using 5KB of stack is likely to result in a kernel crash.
Created attachment 22209 [details] original source file, preprocessed and compressed
Created attachment 22215 [details] kernel patch to avoid passing structures by value I experimentally tried to change all external functions related to this file that pass around 'struct bw_fixed' objects to instead pass plain 64-bit integers. On 32-bit arm, this improves the stack usage from 5000 bytes to around 1000 bytes, but some of that comes back once we turn on the address sanitizer, so it's still beyond the warning limit. On i386, the patch appears to have no effect.
Got a better reduced test case, see https://godbolt.org/z/z5bVKS struct bw_fixed { long long value; }; struct bw_fixed bw_min2(struct bw_fixed, struct bw_fixed); struct bw_fixed bw_max2(struct bw_fixed, struct bw_fixed); struct bw_fixed bw_mul(struct bw_fixed, struct bw_fixed); static inline struct bw_fixed bw_max3(struct bw_fixed v1, struct bw_fixed v2, struct bw_fixed v3) { return bw_max2(bw_max2(v1, v2), v3); } struct bw_fixed bw_int_to_fixed(long long value); int f(struct bw_fixed _a, struct bw_fixed _b, struct bw_fixed _c) { struct bw_fixed a=_a, b=_b, c=_c; struct bw_fixed a1=_a, b1=_b, c1=_c; struct bw_fixed a2=_a, b2=_b, c2=_c; a1 = bw_mul(bw_int_to_fixed(a.value), bw_int_to_fixed(3)); b1 = bw_max3(a1, bw_mul(a2, a1), bw_mul(b2, c1)); c1 = bw_max3(a2, bw_int_to_fixed(3), bw_int_to_fixed(2)); return bw_max3(a1, b1, c1).value; } This just combines a couple of random operations from the original kernel file. What can be seen here is that clang on ARM allocates a stack slot for each temporary value resulting from calling bw_int_to_fixed() or bw_max(), while it doesn't do that on x86-64, and gcc never does it.