When -g (or just -gline-tables-only) is passed to clang, CFI_INSTRUCTIONs are added to the function prologue by the Prologue/Epilogue insertion pass. These instructions act as scheduling region barriers when the PostRA scheduler is run. This can lead to different schedules for the prologue block, which can further lead to differences in tail merging/duplication. A simple example the illustrates the problem: target triple = "aarch64-linux-gnu" @X1 = global i32 0, align 4 @X2 = global i32 0, align 4 @X3 = global i32 0, align 4 @X4 = global i32 0, align 4 define void @test(i32 %i) #0 { entry: %0 = load i32, i32* @X1, align 4 %x1 = add i32 %0, 1 %x2 = add i32 %0, 2 %x3 = add i32 %0, 3 %x4 = add i32 %0, 4 tail call void @foo() store i32 %x1, i32* @X1, align 4 store i32 %x2, i32* @X2, align 4 store i32 %x3, i32* @X3, align 4 store i32 %x4, i32* @X4, align 4 ret void } declare void @foo() attributes #0 = { nounwind } !llvm.dbg.cu = !{!0} !llvm.module.flags = !{!3, !4, !5} !0 = distinct !DICompileUnit(language: DW_LANG_C99, file: !1, producer: "clang version 7.0.0 (trunk 330790) (llvm/trunk 330787)", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, enums: !2) !1 = !DIFile(filename: "test.c", directory: "") !2 = !{} !3 = !{i32 2, !"Dwarf Version", i32 4} !4 = !{i32 2, !"Debug Info Version", i32 3} !5 = !{i32 1, !"wchar_size", i32 4} when compiled as is (i.e. with debug info), produces the following code: stp x23, x22, [sp, #-48]! // 8-byte Folded Spill stp x21, x20, [sp, #16] // 8-byte Folded Spill stp x19, x30, [sp, #32] // 8-byte Folded Spill .cfi_def_cfa_offset 48 .cfi_offset w30, -8 .cfi_offset w19, -16 .cfi_offset w20, -24 .cfi_offset w21, -32 .cfi_offset w22, -40 .cfi_offset w23, -48 adrp x19, X1 ldr w8, [x19, :lo12:X1] add w20, w8, #1 // =1 add w21, w8, #2 // =2 add w22, w8, #3 // =3 add w23, w8, #4 // =4 bl foo adrp x8, X2 str w20, [x19, :lo12:X1] str w21, [x8, :lo12:X2] ldp x19, x30, [sp, #32] // 8-byte Folded Reload ldp x21, x20, [sp, #16] // 8-byte Folded Reload adrp x9, X3 adrp x10, X4 str w22, [x9, :lo12:X3] str w23, [x10, :lo12:X4] ldp x23, x22, [sp], #48 // 8-byte Folded Reload ret but when the debug metadata is commented out, it produces the following code instead: stp x23, x22, [sp, #-48]! // 8-byte Folded Spill stp x19, x30, [sp, #32] // 8-byte Folded Spill adrp x19, X1 ldr w8, [x19, :lo12:X1] stp x21, x20, [sp, #16] // 8-byte Folded Spill add w20, w8, #1 // =1 add w21, w8, #2 // =2 add w22, w8, #3 // =3 add w23, w8, #4 // =4 bl foo adrp x8, X2 str w20, [x19, :lo12:X1] str w21, [x8, :lo12:X2] ldp x19, x30, [sp, #32] // 8-byte Folded Reload ldp x21, x20, [sp, #16] // 8-byte Folded Reload adrp x9, X3 adrp x10, X4 str w22, [x9, :lo12:X3] str w23, [x10, :lo12:X4] ldp x23, x22, [sp], #48 // 8-byte Folded Reload ret Note the different position in the schedule of the first adrp and ldr instructions.
I was under the impression that there was an existing bug for this, but I'm not finding it, so here's what I know about this situation. The .cfi_* directives are there to allow generating information that allows a debugger (or unwinder, or backtrace, or whatever) to navigate the stack frames, and correctly restore the state of a previous frame. That information has specific details about which register is stored where, and when. Allowing a scheduler to reorder instructions across .cfi_* directives would make that information incorrect. That's why the post RA scheduler treats them as scheduling barriers.
It seems to me that we should be able to model the relevant dependencies (e.g. by adding some implicit reads/writes to the CFI_INSTRUCTION) such that only valid re-orderings would occur. We might also need to teach the scheduler to schedule these instructions as soon as they are available, or in some other way prevent their presence from altering scheduling decisions. That being said, we'd need to weigh the complexity of doing all of this against the benefit of having slightly more consistent code generation w.r.t. debug flags. FWIW, this isn't something I am actively pursuing.
It is a given, among people who do debug info seriously, that changing code generation when you say -g is a Bad Thing. In the worst case, actual bugs can hide behind what look like innocuous differences. And if this actually interferes with other optimizations like tail merging or outlining, then it might get a little higher on someone's radar. Thanks for filing the bug!
The scheduling regions [I, RegionEnd] will be broken by cfg instructions in PostRA Machine Instruction Scheduler (postmisched) pass. track the code when getSchedRegions: https://github.com/llvm/llvm-project/blob/b1cf175271820b17c27edfd483c2ab52ce0afcfb/llvm/lib/CodeGen/MachineScheduler.cpp#L487 ------------------------------------------------ if (isSchedBoundary(&MI, &*MBB, MF, TII)) break; ------ isSchedBoundary() { return MI->isCall() || TII->isSchedulingBoundary(*MI, MBB, *MF); } ------ TII->isSchedulingBoundary(*MI, MBB, *MF); { if (MI.isTerminator() || MI.isPosition()) return true; } ------ // llvm/include/llvm/CodeGen/MachineInstr.h#L1024 bool isPosition() const { return isLabel() || isCFIInstruction(); } ---------------------------------------------- The given CFI instruction is considered as a scheduling boundary when MI isPosition=true. Then, one Region is broke as two different Regions(that makes wrong behavior). I am going to ignore CFG instruction before isSchedBoundary() and push_back to Regions without CFG instructions. Or is there any other better solution to fix this issue?
Chris wrote: > I am going to ignore CFG instruction before isSchedBoundary() and push_back > to Regions without CFG instructions. Or is there any other better solution > to fix this issue? From what Paul wrote in comment 1, I believe the risk is that machine instructions will be rescheduled in such a way that the CFI instructions are no longer correct, which will mislead debuggers. Fully fixing this probably means the scheduler needs to understand CFI instructions, so that it can fix up the changes that it makes. An alternative interpretation would be that it's better to create broken debug info than to change the generated code, but that might be difficult to argue.
In this case, cfi instructions are recognized as SchedulingBoundary and break one Region into two Regions. For example, the su(0) and su(1) are one Region, and su(9) and su(10) is another Region, they should schedule in one Region, but in fact they do not schedule as two Regions caused by cfi barriers. After replaced the MI.isPosition with MI.isLable in isSchedulingBoundary, cfi instructions will not break the Regions then, and those cfi instructions will be pushed in the same Regions with machine instructions and do scheduling together. here is debug information: --------------------- Instruction sequence before schedule: su(0) - machine instruction sunit su(1) su(2~8)-these are cfi instructions su(9) su(10) --------------------- after acheduling (no code change) - scheduling no change, postmisched fails su(0) su(1) su(2~8)-these are cfi instructions su(9) su(10) -------------------- After scheduling (with code change) - the machine instruction is correct, but cfi moved after su(0). su(1) su(9) su(10) su(0) su(2~8) - cfi instruction. code change: TII->isSchedulingBoundary(*MI, MBB, *MF); { - if (MI.isTerminator() || MI.isPosition()) + if (MI.isTerminator() || MI.isLabel()) return true; } Could the issue be fixed with this change? In this way the cfi instruction will join scheduling and reserved in code. Not sure if this has risk somewhere. or could I ignore the cfi instruction just like handling with debug instruction while scheduling(not push to Region, not do scheduling)?
I think your idea is not a valid fix for the problem. Although the problem of having differing assembly will certainly be fixed by your change, it would move cfi instructions in an invalid way and away from stack altering instructions. Please see this discussion on the mailing list: http://lists.llvm.org/pipermail/llvm-dev/2019-September/135433.html And this patch on Phabricator: https://reviews.llvm.org/D68076
Hi Chris, I've been digging around and, with some help, I think I now understand what's going on here. I hope this sheds some light on the matter. There is a TL;DR near the end. The following write-up follows my investigation as someone who previously had no knowledge on cfi instructions. Clickbait title: CFI_INSTRUCTIONs don't need to be scheduling barriers. # Scheduling barriers If an instruction, implicitly or otherwise, modifies the stack pointer register (SP) it is treated as a scheduling barrier. CFI_INSTRUCTIONs in LLVM mir are also always treated as scheduling barriers. # Call Frame Information (CFI) instructions CFI instructions are used to work out offset of the Canonical Frame Address (CFA), the stack slot containing the return address, from SP for the current frame at any given time during program execution. By definition then any SP modifying instruction requires a CFI_INSTRUCTION follow it to describe the modification (if we're emitting emitting CFI, of course). CFI_INSTRUCTIONs that we care about for this example: def_cfa_offset <x: int> : CFA offset <x> from SP offset <r: reg>, <x: int> : Reg <r> value stored at offset <x> from CFA # x86 When we spill registers on x86 we will most likely `push r`. push modifies SP, so it's a scheduling barrier, and if we want CFI it also requires a CFI_INSTRUCTION. Assuming the CFI_INSTRUCTIONs correctly always follow the instruction that they describe it doesn't matter whether they are scheduling barriers or not. Illustrated in this example: --- Without CFI MOV64rr $eax, $r15 PUSH64r $rbp // [WRITE SP: SCHED BARRIER] MOV64rr $eax, $r15 ... --- With CFI MOV64rr $eax, $r15 PUSH64r $rbp // [WRITE SP: SCHED BARRIER] CFI_INSTRUCTION def_cfa_offset 8 // [CFI: SCHED BARRIER] Follow PUSH64r CFI_INSTRUCTION offset $rbp -16 // [CFI: SCHED BARRIER] Follow PUSH64r MOV64rr $eax, $r15 ... # Goeff's aarch64 reproducer Taken from this ticket, note the new comments in the square brackets. stp x23, x22, [sp, #-48]! // 8-byte Folded Spill [READ-WRITE SP] stp x21, x20, [sp, #16] // 8-byte Folded Spill [READ SP] stp x19, x30, [sp, #32] // 8-byte Folded Spill [READ SP] .cfi_def_cfa_offset 48 .cfi_offset w30, -8 .cfi_offset w19, -16 .cfi_offset w20, -24 .cfi_offset w21, -32 .cfi_offset w22, -40 .cfi_offset w23, -48 Here, there are two spills which do not modify SP but do require CFI_INSTRUCTIONs. [TL;DR] This is the disconnect: Not all instructions which require CFI_INSTRUCTIONs modify SP. By definition, instructions which modify SP are both scheduling barriers and require a `CFI_INSTRUCTION def_cfa_offset`. Spills, however, may not modify SP and therefore may not be scheduling barriers. So the instruction `CFI_INSTRUCTION offset` may incorrectly introduce a scheduling barrier and cause a codegen change. # Potential fix (suggested by @jmorse offline) Pull out CFI_INSTRUCTIONS before scheduling. Map them to the instructions which they describe. Then after scheduling, reinsert the CFI_INSTRUCTIONs such that they follow the instructions they describe. This can be done by the scheduler itself, and it seems to be an easier-to-implement version of fix #2 suggested in the llvm-dev discussion here [0]. [0] http://lists.llvm.org/pipermail/llvm-dev/2019-September/135433.html
>Pull out CFI_INSTRUCTIONS before scheduling. Map them to the instructions >which they describe. Then after scheduling, reinsert the CFI_INSTRUCTIONs >such that they follow the instructions they describe. This can be done by >the scheduler itself, and it seems to be an easier-to-implement version of >fix #2 suggested in the llvm-dev discussion here [0]. >[0] http://lists.llvm.org/pipermail/llvm-dev/2019-September/135433.html Thanks Orlando for analyzing, not sure if David or someone else are working on this issue. If not, let me have a try to fix.
Not sure if my solution is workable as below: In getSchedRegions(), the region will be detected from RegionEnd to RegionBegin, during this detecting, when found CFI, then splice it to RegionEnd and set the CFI as RegionEnd. So that it will not be recognized as SchedBoundary to split one Region for scheduling. for example, original MBB: ---------------------------------------------- frame-setup STPXi killed $x20, killed $x19, $sp, 4 ------(Region 2 Begin) frame-setup CFI_INSTRUCTION def_cfa_offset 48 ------(Region 2 End) frame-setup CFI_INSTRUCTION offset $w19, -8 renamable $w23 = ADDWri killed renamable $w8, 4, 0 ------(Region 1 Begin) BL @foo, csr_aarch64_aapcs, implicit-def dead $lr, implicit $sp, implicit-def $sp ------(Region 1 End) ---------------------------------------------- After reorder CFI by splice to END: ---------------------------------------------- frame-setup STPXi killed $x20, killed $x19, $sp, 4 ------(Region 1 Begin) renamable $w23 = ADDWri killed renamable $w8, 4, 0 frame-setup CFI_INSTRUCTION def_cfa_offset 48 ------(Region 1 End) frame-setup CFI_INSTRUCTION offset $w19, -8 BL @foo, csr_aarch64_aapcs, implicit-def dead $lr, implicit $sp, implicit-def $sp ----------------------------------------------
Only aarch64/arm64 meet this issue, and other platform does not, X86 for example disabled post-MI-sched: lib/Target/X86/X86Schedule.td ----------------------------------------------------------- // IssueWidth is analogous to the number of decode units. Core and its // descendents, including Nehalem and SandyBridge have 4 decoders. // Resources beyond the decoder operate on micro-ops and are bufferred // so adjacent micro-ops don't directly compete. // // MicroOpBufferSize > 1 indicates that RAW dependencies can be // decoded in the same cycle. The value 32 is a reasonably arbitrary // number of in-flight instructions. // // HighLatency=10 is optimistic. X86InstrInfo::isHighLatencyDef // indicates high latency opcodes. Alternatively, InstrItinData // entries may be included here to define specific operand // latencies. Since these latencies are not used for pipeline hazards, // they do not need to be exact. // The GenericX86Model contains no instruction schedules // and disables PostRAScheduler. class GenericX86Model : SchedMachineModel { let IssueWidth = 4; let MicroOpBufferSize = 32; let LoadLatency = 4; let HighLatency = 10; let PostRAScheduler = 0; -----disabled let CompleteModel = 0; } -----------------------------------------------------------
Fix available in: https://reviews.llvm.org/D68076