-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8ec7ea3ddce7379e13e8dfb4a5260a6d2004aa1c causes more stack usage on PowerPC #48954
Comments
We're compiling for a (relatively) legacy system (no "stdbrx" - https://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20130325/169706.html )? Let's reduce to the simplest load/swap/store 64-bit sequence: define void @bs(i64* %p) { $ llc -o - bswap.ll -mtriple=powerpc64 There might be a better trick, but we could at least split this into 2 lwz + 2 stwbrx? |
Two other alternatives would be the following... Single load, two swapping stores: Two swapping loads, merge, single store: |
Yes - that would reduce the scope of the problem. We just need to split one of these basic patterns, and that hopefully solves the motivating example: define void @split_store(i64 %x, i64* %p) { define i64 @split_load(i64* %p) { ...but we may want to do both to cover all bases. cc'ing Craig and Simon - does this look like something that belongs in generic combiner or legalization (other targets might have the same issue)? |
FWIW, I think it makes sense for this to be in the generic combiner. If BSWAP is legal for i32 but not for i64, we can emit 2 x (extract_element + bswap) + build_pair Presumably the combiner will already do the right thing when the input is a load and/or the only user is a store. |
Yes I think so - IIRC we do similar things in DAGCombine load/store combining already. |
I was hoping so, but it didn't seem to work. I just hacked this into DAGCombiner::visitBSWAP(): if (VT == MVT::i64) { And neither the load nor store patterns appear to get merged with extract/build_pair with powerpc64 target, so we end up with the same asm. |
Bumping, we're probably going to have to spin down our CI coverage of PPC over this. :( |
I wouldn't expect that DAG combine hack to give the same asm. The original asm had an expanded 64-bit swap. I assume that DAG combine would give two expanded 32 bit swaps. Assuming the BUILD_PAIR becomes a shift/or sequence, there is a combine for split stores, but it requires isMultiStoresCheaperThanBitsMerge to return true. |
Yes, you're right. I saw the string of rotate opcodes, but they're not the same instructions.
I think we would need to change DAGCombiner::splitMergedValStore() to recognize a BUILD_PAIR input. Otherwise, the smaller bswaps are going to get expanded before we split the store. I'm seeing a lot of unexpected diffs across several targets with that change hacked in. It seems like there are gaps in BUILD_PAIR and/or EXTRACT_ELEMENT lowering because I see crashes too. This is going to take multiple changes and at least some of those are specific to PowerPC lowering, so I think we should view this as a PowerPC-specific bug and then see if it is worth generalizing. Nemanja, do you want to take this? |
Nemanja, this came up again on LKML: |
OK, please review the fix in https://reviews.llvm.org/D104836 Sorry for the delay. I realize that the target-independent combine doesn't work because we won't have the bswap(load) or store(bswap) idiom due to the intervening EXTRACT_VALUE and BUILD_PAIR respectively. |
The fix is actually in two commits since I neglected to account for big endian targets in the initial commit. Initial commit: Follow-up fix for BE: |
Ugh, unfortunately this caused failures in the bootstrap build on BE Linux and I had to disable it temporarily. I have now fixed it in yet another follow-up commit. So this also needs the following two commits: https://reviews.llvm.org/rG4e22c7265d86 - disables the change |
Extended Description
Forgive me if I am filing this in the wrong section or tagging the wrong people, this is just based on my investigation and triage.
After 8ec7ea3, there was a balloon in stack usage in a particular function in the PowerPC KVM section of the Linux kernel, which causes a build error because of -Werror.
Prior to 8ec7ea3 (with Linux at 5.12-rc3 for https://git.kernel.org/linus/97e4910232fa1f81e806aa60c25a0450276d99a2):
legacy pass manager
$ make -skj"$(nproc)" ARCH=powerpc CC=clang CROSS_COMPILE=powerpc64-linux-gnu- KCFLAGS="-fno-experimental-new-pass-manager -Wframe-larger-than=1024" O=build/ppc64 distclean pseries_defconfig disable-werror.config arch/powerpc/kvm/book3s_hv_nested.o
arch/powerpc/kvm/book3s_hv_nested.c:264:6: warning: stack frame size of 1728 bytes in function 'kvmhv_enter_nested_guest' [-Wframe-larger-than=]
long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
^
1 warning generated.
new pass manager
$ make -skj"$(nproc)" ARCH=powerpc CC=clang CROSS_COMPILE=powerpc64-linux-gnu- KCFLAGS="-fexperimental-new-pass-manager -Wframe-larger-than=1024" O=build/ppc64 distclean pseries_defconfig disable-werror.config arch/powerpc/kvm/book3s_hv_nested.o
arch/powerpc/kvm/book3s_hv_nested.c:264:6: warning: stack frame size of 1712 bytes in function 'kvmhv_enter_nested_guest' [-Wframe-larger-than=]
long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
^
1 warning generated.
After 8ec7ea3:
legacy pass manager
$ make -skj"$(nproc)" ARCH=powerpc CC=clang CROSS_COMPILE=powerpc64-linux-gnu- KCFLAGS="-fno-experimental-new-pass-manager -Wframe-larger-than=1024" O=build/ppc64 distclean pseries_defconfig disable-werror.config arch/powerpc/kvm/book3s_hv_nested.o
arch/powerpc/kvm/book3s_hv_nested.c:264:6: warning: stack frame size of 2480 bytes in function 'kvmhv_enter_nested_guest' [-Wframe-larger-than=]
long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
^
1 warning generated.
new pass manager
$ make -skj"$(nproc)" ARCH=powerpc CC=clang CROSS_COMPILE=powerpc64-linux-gnu- KCFLAGS="-fexperimental-new-pass-manager -Wframe-larger-than=1024" O=build/ppc64 distclean pseries_defconfig disable-werror.config arch/powerpc/kvm/book3s_hv_nested.o
arch/powerpc/kvm/book3s_hv_nested.c:264:6: warning: stack frame size of 2048 bytes in function 'kvmhv_enter_nested_guest' [-Wframe-larger-than=]
long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
^
1 warning generated.
For reference, GCC 10.2.0:
$ make -skj"$(nproc)" ARCH=powerpc ARCH=powerpc CROSS_COMPILE=powerpc64-linux- KCFLAGS=-Wframe-larger-than=1024 O=build/ppc64 distclean pseries_defconfig disable-werror.config arch/powerpc/kvm/book3s_hv_nested.o
arch/powerpc/kvm/book3s_hv_nested.c: In function 'kvmhv_enter_nested_guest':
arch/powerpc/kvm/book3s_hv_nested.c:387:1: warning: the frame size of 1280 bytes is larger than 1024 bytes [-Wframe-larger-than=]
387 | }
| ^
I tried to reduce this down with cvise and came up with:
struct hv_guest_state {
int vcpu_token;
long pcr;
long amor;
long dpdes;
long hfscr;
long tb_offset;
long srr0;
long srr1;
long sprg[4];
long pidr;
long cfar;
long ppr;
long dawr1;
long dawrx1;
} kvmhv_write_guest_state_and_regs_l2_hv;
struct pt_regs {
struct {
struct {
long gpr[32];
long nip;
long msr;
long orig_gpr3;
long ctr;
long link;
long xer;
long ccr;
long softe;
long trap;
long dar;
long dsisr;
long result;
};
long __pad[4];
};
} __srcu_read_unlock();
struct kvm_vcpu_arch {
struct pt_regs regs;
struct kvmppc_vcore *vcore;
int trap;
};
struct kvmppc_vcore {
long tb_offset;
long pcr;
};
struct kvm_nested_guest {
long l1_gr_to_hr;
};
int kvm_vcpu_read_guest();
static void kvmhv_update_ptbl_cache(struct kvm_nested_guest *gp) {
struct kvm_nested_guest __trans_tmp_14 = *gp;
__srcu_read_unlock(__trans_tmp_14);
}
void byteswap_pt_regs(struct pt_regs *regs) {
unsigned long *addr = (long *)regs;
for (; addr < (unsigned long *)(regs + 1); addr++)
*addr = __builtin_bswap64(*addr);
}
int kvmhv_read_guest_state_and_regs(struct hv_guest_state *l2_hv,
struct pt_regs *l2_regs) {
return kvm_vcpu_read_guest(l2_hv) || kvm_vcpu_read_guest(l2_regs);
}
long kvmhv_enter_nested_guest(struct kvm_vcpu_arch *vcpu) {
struct kvm_nested_guest *l2;
struct pt_regs l2_regs, saved_l1_regs;
struct hv_guest_state l2_hv;
struct kvmppc_vcore *vc = vcpu->vcore;
long regs_ptr = kvmhv_read_guest_state_and_regs(&l2_hv, &l2_regs);
byteswap_pt_regs(&l2_regs);
if (!l2->l1_gr_to_hr)
kvmhv_update_ptbl_cache(l2);
saved_l1_regs = vcpu->regs;
vc->tb_offset += l2_hv.tb_offset;
vcpu->regs = saved_l1_regs;
vc = vcpu->vcore;
vc->pcr = 0;
struct hv_guest_state *hr = 0;
hr->pidr = hr->dawr1 = __builtin_bswap64(hr->dawr1);
hr->dawrx1 = __builtin_bswap64(hr->dawrx1);
byteswap_pt_regs(&l2_regs);
kvm_vcpu_read_guest(kvmhv_write_guest_state_and_regs_l2_hv, regs_ptr);
return vcpu->trap;
}
which produces the same result: https://godbolt.org/z/754Men.
Prior to 8ec7ea3
LPM
$ clang -fno-experimental-new-pass-manager --target=powerpc64-linux-gnu -O2 -Wno-pointer-sign -Wframe-larger-than=512 -c -o /dev/null book3s_hv_nested.i
book3s_hv_nested.i:63:6: warning: stack frame size of 1136 bytes in function 'kvmhv_enter_nested_guest' [-Wframe-larger-than=]
long kvmhv_enter_nested_guest(struct kvm_vcpu_arch *vcpu) {
^
1 warning generated.
NPM
$ clang -fexperimental-new-pass-manager --target=powerpc64-linux-gnu -O2 -Wno-pointer-sign -Wframe-larger-than=512 -c -o /dev/null book3s_hv_nested.i
book3s_hv_nested.i:63:6: warning: stack frame size of 640 bytes in function 'kvmhv_enter_nested_guest' [-Wframe-larger-than=]
long kvmhv_enter_nested_guest(struct kvm_vcpu_arch *vcpu) {
^
1 warning generated.
After 8ec7ea3:
LPM
$ clang -fno-experimental-new-pass-manager --target=powerpc64-linux-gnu -O2 -Wno-pointer-sign -Wframe-larger-than=512 -c -o /dev/null book3s_hv_nested.i
book3s_hv_nested.i:54:6: warning: stack frame size of 944 bytes in function 'byteswap_pt_regs' [-Wframe-larger-than=]
void byteswap_pt_regs(struct pt_regs *regs) {
^
book3s_hv_nested.i:63:6: warning: stack frame size of 2064 bytes in function 'kvmhv_enter_nested_guest' [-Wframe-larger-than=]
long kvmhv_enter_nested_guest(struct kvm_vcpu_arch *vcpu) {
^
2 warnings generated.
NPM
$ clang -fexperimental-new-pass-manager --target=powerpc64-linux-gnu -O2 -Wno-pointer-sign -Wframe-larger-than=512 -c -o /dev/null book3s_hv_nested.i
book3s_hv_nested.i:54:6: warning: stack frame size of 944 bytes in function 'byteswap_pt_regs' [-Wframe-larger-than=]
void byteswap_pt_regs(struct pt_regs *regs) {
^
book3s_hv_nested.i:63:6: warning: stack frame size of 1520 bytes in function 'kvmhv_enter_nested_guest' [-Wframe-larger-than=]
long kvmhv_enter_nested_guest(struct kvm_vcpu_arch *vcpu) {
^
2 warnings generated.
Arnd Bergmann points out that this is most likely because PowerPC does not appear to be using optimal assembly for byte swapping (or at least would help), which I believe should be visible from the Godbolt link above.
The text was updated successfully, but these errors were encountered: