Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

excessive stack usage for FORTIFY_SOURCE building the kernel #45147

Closed
arndb mannequin opened this issue May 5, 2020 · 10 comments
Closed

excessive stack usage for FORTIFY_SOURCE building the kernel #45147

arndb mannequin opened this issue May 5, 2020 · 10 comments
Assignees
Labels
bugzilla Issues migrated from bugzilla clang Clang issues not falling into any other category

Comments

@arndb
Copy link
Mannequin

arndb mannequin commented May 5, 2020

Bugzilla Link 45802
Resolution FIXED
Resolved on May 17, 2020 13:19
Version 10.0
OS Linux
Blocks #44654
CC @efriedma-quic,@gburgessiv,@zx2c4,@m-gupta,@nathanchance,@nickdesaulniers,@zygoloid,@serge-sans-paille,@tstellar
Fixed by commit(s) 2dd17ff 9490808 3ab301b eaae6df
@arndb
Copy link
Mannequin Author

arndb mannequin commented May 5, 2020

assigned to @efriedma-quic

@arndb
Copy link
Mannequin Author

arndb mannequin commented May 5, 2020

I ran into a warning about exceeding the 2048 byte per-function stack limit on a kernel build with clang-10, and produced a reduced test case:

int m;
typedef unsigned long size_t;
void *c(void *dest, const void *src, size_t n) asm("memcpy");
extern inline attribute((gnu_inline)) void *memcpy(void *dest, const void *src, size_t n) { return c(dest, src, n); }
void d();
static void ad(unsigned long long *a, unsigned long long *b) {
long e[] = {0, a[1], a[2], a[3], a[4]};
long k = a[1];
__uint128_t f[5];
__uint128_t *g = f;
unsigned long long *h = b;
g += g[1] = (unsigned long)e * b[0];
g += h[3] += b[0];
g[4] = a[4];
g += a[1];
g[1] += k;
g += h[13] += b[1];
g += 4 * b[1];
d();
long l = *e;
g[1] = (__uint128_t)l * b[2];
g += b[2];
g[1] += (unsigned long)e * b[3];
}
static void j(unsigned long long *a, unsigned long long *b) { a[5] = b[5]; }
void ap() {
long ai[15] = {};
unsigned long long am[40];
unsigned long long *ag = am + 10;
int i;
while (i--)
while (m) {
j(am + 5, ag + 5);
unsigned long long *o = am + 5, *n = ag + 5;
unsigned long long ai[40];
ad(ag, o);
ad(am, n);
unsigned long long *aj = ai + 5;
d(aj);
}
memcpy(ai, am, *am);
}

$ clang-10 -Wframe-larger-than=1 -O2 -c curve25519-hacl64-test.c
curve25519-hacl64-test.c:26:6: warning: stack frame size of 1368 bytes in function 'ap' [-Wframe-larger-than=]

The main culprit here seems to be related to the way that the kernel overrides the memcpy() function when CONFIG_FORTIFY_SOURCE is set, intentionally preventing the memcpy() from being inlined. However, with clang-9 or clang-11, the result is as expected, so this also seems to be a clang-10 specific regression:

$ clang-9 -Wframe-larger-than=1 -O2 -c curve25519-hacl64-test.c
curve25519-hacl64-test.c:26:6: warning: stack frame size of 344 bytes in function 'ap' [-Wframe-larger-than=]

$ clang-11 -Wframe-larger-than=1 -O2 -c curve25519-hacl64-test.c
curve25519-hacl64-test.c:26:6: warning: stack frame size of 344 bytes in function 'ap' [-Wframe-larger-than=]

See also https://godbolt.org/z/Qicn0J for the output

@zx2c4
Copy link
Mannequin

zx2c4 mannequin commented May 5, 2020

clang-10 appears to have a broken optimization stage that doesn't enable the compiler to prove at compile time that certain memcpys are within bounds, and thus the outline memcpy is always called, resulting in horrific performance, and in some cases, excessive stack frame growth.

Here's a simple reproducer:

typedef unsigned long size_t;
void *c(void *dest, const void *src, size_t n) __asm__("memcpy");
extern inline __attribute__((gnu_inline)) void *memcpy(void *dest, const void *src, size_t n) { return c(dest, src, n); }
void blah(char *a)
{
  unsigned long long b[10], c[10];
  int i;

  memcpy(b, a, sizeof(b));
  for (i = 0; i < 10; ++i)
    c[i] = b[i] ^ b[9 - i];
  for (i = 0; i < 10; ++i)
    b[i] = c[i] ^ a[i];
  memcpy(a, b, sizeof(b));
}

Compile this with clang-9 and clang-10 and observe:

zx2c4@thinkpad /tmp/curve25519-hacl64-stack-frame-size-test $ clang-10 -Wframe-larger-than=0 -O3 -c b.c -o c10.o
b.c:5:6: warning: stack frame size of 104 bytes in function 'blah' [-Wframe-larger-than=]
void blah(char *a)
^
1 warning generated.
zx2c4@thinkpad /tmp/curve25519-hacl64-stack-frame-size-test $ clang-9 -Wframe-larger-than=0 -O3 -c b.c -o c9.o

Looking at the disassembly of c10.o and c9.o, one can see that c9.o is properly optimized in the obvious way one would expect, while c10.o has blown up and includes extern calls to memcpy.

@m-gupta
Copy link
Contributor

m-gupta commented May 6, 2020

@nathanchance
Copy link
Member

Tom, could those two patches be added to 10.0.1?

@serge-sans-paille
Copy link
Collaborator

For the record, in clang 9, the inline definition of memcpy was ignored by clang, as showcased by this example: https://godbolt.org/z/8D4iDJ

Since clang-10, clang honors the inline definition, as illustrated by: https://godbolt.org/z/dmunLs.

In the reduced test case proposed, clang-10 does not nonor the inline definition, because it's trivially recursive, and falls back to the memcpy intrinsic.

I don't know if that's the intended behavior for the kernel, just wanted to provide more details.

@tstellar
Copy link
Collaborator

tstellar commented May 7, 2020

Eli, you reviewed this patch initially, any objections to merging this along with the follow up fix?

https://reviews.llvm.org/rG2dd17ff08165e6118e70f00e22b2c36d2d4e0a9a
https://reviews.llvm.org/rG94908088a831141cfbdd15fc5837dccf30cfeeb6

@efriedma-quic
Copy link
Collaborator

LGTM

@tstellar
Copy link
Collaborator

tstellar commented May 7, 2020

Merged: eaae6df

@arndb
Copy link
Mannequin Author

arndb mannequin commented May 17, 2020

I've updated my compiler to the fixed version now, but it seems to run into a related problem in a few other files in the kernel with the new version. This (manually) reduced test case triggers a warning about large stack usage with the fortified memcpy() with clang-10 and clang-11 but not clang-9:

https://godbolt.org/z/E5_4rS

$ clang-10 --target=arm-linux-gnueabi -c br_multicast.c -O2 -
Wframe-larger-than=1
br_multicast.c:65:6: warning: stack frame size of 632 bytes in function 'br_multicast_get_stats' [-Wframe-larger-than=]

typedef unsigned int size_t;
typedef unsigned long __kernel_size_t;
typedef unsigned long long __u64;

extern void attribute((noreturn)) fortify_panic(const char *name);
extern void __write_overflow(void);
extern void __read_overflow2(void);

extern inline attribute((gnu_inline))
void *memcpy(void *p, const void *q, size_t size)
{
size_t p_size = __builtin_object_size(p, 0);
size_t q_size = __builtin_object_size(q, 0);
if (__builtin_constant_p(size)) {
if (p_size < size)
__write_overflow();
if (q_size < size)
__read_overflow2();
}
if (p_size < size || q_size < size)
fortify_panic(func);
return __builtin_memcpy(p, q, size);
}

struct br_mcast_stats {
__u64 igmp_v1queries[2]; __u64 igmp_v2queries[2]; __u64 igmp_v3queries[2];
__u64 igmp_leaves[2]; __u64 igmp_v1reports[2]; __u64 igmp_v2reports[2];
__u64 igmp_v3reports[2]; __u64 igmp_parse_errors;

    __u64 mld_v1queries[2]; __u64 mld_v2queries[2]; __u64 mld_leaves[2];
    __u64 mld_v1reports[2]; __u64 mld_v2reports[2]; __u64 mld_parse_errors;

};

struct bridge_mcast_stats {
struct br_mcast_stats mstats;
};

#define barrier() asm volatile("" : : : "memory")
#define cpu_relax() barrier()
#define smp_rmb(option) asm volatile ("" : : : "memory")
#define READ_ONCE(x) ({ typeof(x) __val = (volatile typeof(x))&(x); __val; })

typedef struct seqcount {
unsigned sequence;
} seqcount_t;

static inline unsigned read_seqcount_begin(const seqcount_t *s)
{
unsigned ret;
repeat:
ret = READ_ONCE(s->sequence);
if (ret & 1) { cpu_relax(); goto repeat; }
smp_rmb();
return ret;
}

static inline int read_seqcount_retry(const seqcount_t *s, unsigned start)
{ smp_rmb(); return s->sequence != start; }

extern void *per_cpu_ptr(void *ptr, unsigned int cpu);

static void mcast_stats_add_dir(__u64 *dst, __u64 *src)
{ dst[0] += src[0]; dst[1] += src[1]; }

void br_multicast_get_stats(struct bridge_mcast_stats *stats,
struct br_mcast_stats *dest, seqcount_t *seq)
{
struct br_mcast_stats tdst;
int i;

    __builtin_memset(dest, 0, sizeof(*dest));
    __builtin_memset(&tdst, 0, sizeof(tdst));
    for (i=0; i<4; i++) {
            struct bridge_mcast_stats *cpu_stats = per_cpu_ptr(stats, i);
            struct br_mcast_stats temp;
            unsigned int start;

            do {
                    start = read_seqcount_begin(seq);
                    memcpy(&temp, &cpu_stats->mstats, sizeof(temp));
            } while (read_seqcount_retry(seq, start));

            mcast_stats_add_dir(tdst.igmp_v1queries, temp.igmp_v1queries);
            mcast_stats_add_dir(tdst.igmp_v2queries, temp.igmp_v2queries);
            mcast_stats_add_dir(tdst.igmp_v3queries, temp.igmp_v3queries);
            mcast_stats_add_dir(tdst.igmp_leaves, temp.igmp_leaves);
            mcast_stats_add_dir(tdst.igmp_v1reports, temp.igmp_v1reports);
            mcast_stats_add_dir(tdst.igmp_v2reports, temp.igmp_v2reports);
            mcast_stats_add_dir(tdst.igmp_v3reports, temp.igmp_v3reports);
            tdst.igmp_parse_errors += temp.igmp_parse_errors;

            mcast_stats_add_dir(tdst.mld_v1queries, temp.mld_v1queries);
            mcast_stats_add_dir(tdst.mld_v2queries, temp.mld_v2queries);
            mcast_stats_add_dir(tdst.mld_leaves, temp.mld_leaves);
            mcast_stats_add_dir(tdst.mld_v1reports, temp.mld_v1reports);
            mcast_stats_add_dir(tdst.mld_v2reports, temp.mld_v2reports);
            tdst.mld_parse_errors += temp.mld_parse_errors;
    }
    memcpy(dest, &tdst, sizeof(*dest));

}

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla clang Clang issues not falling into any other category
Projects
None yet
Development

No branches or pull requests

5 participants