37240 – -g causes different code to be generated due to PostRA scheduling differences

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 37240 - -g causes different code to be generated due to PostRA scheduling differences

Summary: -g causes different code to be generated due to PostRA scheduling differences

Status:	NEW

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Backend: AArch64 (show other bugs)
Version:	trunk
Hardware:	All All

Importance:	P normal
Assignee:	David Tellenbach

URL:
Keywords:

Depends on:
Blocks:	37728
	Show dependency tree

Reported:	2018-04-25 12:59 PDT by Geoff Berry
Modified:	2020-08-07 17:32 PDT (History)
CC List:	16 users (show)

See Also:
Fixed By Commit(s):

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Geoff Berry 2018-04-25 12:59:26 PDT

When -g (or just -gline-tables-only) is passed to clang, CFI_INSTRUCTIONs are added to the function prologue by the Prologue/Epilogue insertion pass.  These instructions act as scheduling region barriers when the PostRA scheduler is run.  This can lead to different schedules for the prologue block, which can further lead to differences in tail merging/duplication.

A simple example the illustrates the problem:

target triple = "aarch64-linux-gnu"

@X1 = global i32 0, align 4
@X2 = global i32 0, align 4
@X3 = global i32 0, align 4
@X4 = global i32 0, align 4

define void @test(i32 %i) #0 {
entry:
  %0 = load i32, i32* @X1, align 4
  %x1 = add i32 %0, 1
  %x2 = add i32 %0, 2
  %x3 = add i32 %0, 3
  %x4 = add i32 %0, 4
  tail call void @foo()
  store i32 %x1, i32* @X1, align 4
  store i32 %x2, i32* @X2, align 4
  store i32 %x3, i32* @X3, align 4
  store i32 %x4, i32* @X4, align 4
  ret void
}

declare void @foo()

attributes #0 = { nounwind }

!llvm.dbg.cu = !{!0}
!llvm.module.flags = !{!3, !4, !5}
!0 = distinct !DICompileUnit(language: DW_LANG_C99, file: !1, producer: "clang version 7.0.0 (trunk 330790) (llvm/trunk 330787)", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, enums: !2)
!1 = !DIFile(filename: "test.c", directory: "")
!2 = !{}
!3 = !{i32 2, !"Dwarf Version", i32 4}
!4 = !{i32 2, !"Debug Info Version", i32 3}
!5 = !{i32 1, !"wchar_size", i32 4}


when compiled as is (i.e. with debug info), produces the following code:

        stp     x23, x22, [sp, #-48]!   // 8-byte Folded Spill
        stp     x21, x20, [sp, #16]     // 8-byte Folded Spill
        stp     x19, x30, [sp, #32]     // 8-byte Folded Spill
        .cfi_def_cfa_offset 48
        .cfi_offset w30, -8
        .cfi_offset w19, -16
        .cfi_offset w20, -24
        .cfi_offset w21, -32
        .cfi_offset w22, -40
        .cfi_offset w23, -48
        adrp    x19, X1
        ldr     w8, [x19, :lo12:X1]
        add     w20, w8, #1             // =1
        add     w21, w8, #2             // =2
        add     w22, w8, #3             // =3
        add     w23, w8, #4             // =4
        bl      foo
        adrp    x8, X2
        str     w20, [x19, :lo12:X1]
        str     w21, [x8, :lo12:X2]
        ldp     x19, x30, [sp, #32]     // 8-byte Folded Reload
        ldp     x21, x20, [sp, #16]     // 8-byte Folded Reload
        adrp    x9, X3
        adrp    x10, X4
        str     w22, [x9, :lo12:X3]
        str     w23, [x10, :lo12:X4]
        ldp     x23, x22, [sp], #48     // 8-byte Folded Reload
        ret


but when the debug metadata is commented out, it produces the following code instead:

        stp     x23, x22, [sp, #-48]!   // 8-byte Folded Spill
        stp     x19, x30, [sp, #32]     // 8-byte Folded Spill
        adrp    x19, X1
        ldr     w8, [x19, :lo12:X1]
        stp     x21, x20, [sp, #16]     // 8-byte Folded Spill
        add     w20, w8, #1             // =1
        add     w21, w8, #2             // =2
        add     w22, w8, #3             // =3
        add     w23, w8, #4             // =4
        bl      foo
        adrp    x8, X2
        str     w20, [x19, :lo12:X1]
        str     w21, [x8, :lo12:X2]
        ldp     x19, x30, [sp, #32]     // 8-byte Folded Reload
        ldp     x21, x20, [sp, #16]     // 8-byte Folded Reload
        adrp    x9, X3
        adrp    x10, X4
        str     w22, [x9, :lo12:X3]
        str     w23, [x10, :lo12:X4]
        ldp     x23, x22, [sp], #48     // 8-byte Folded Reload
        ret

Note the different position in the schedule of the first adrp and ldr instructions.

Comment 1 Paul Robinson 2018-04-25 14:37:42 PDT

I was under the impression that there was an existing bug for this,
but I'm not finding it, so here's what I know about this situation.

The .cfi_* directives are there to allow generating information that
allows a debugger (or unwinder, or backtrace, or whatever) to navigate 
the stack frames, and correctly restore the state of a previous frame.
That information has specific details about which register is stored
where, and when.  Allowing a scheduler to reorder instructions across
.cfi_* directives would make that information incorrect.  That's why
the post RA scheduler treats them as scheduling barriers.

Comment 2 Geoff Berry 2018-04-25 14:54:07 PDT

It seems to me that we should be able to model the relevant dependencies (e.g. by adding some implicit reads/writes to the CFI_INSTRUCTION) such that only valid re-orderings would occur.  We might also need to teach the scheduler to schedule these instructions as soon as they are available, or in some other way prevent their presence from altering scheduling decisions.

That being said, we'd need to weigh the complexity of doing all of this against the benefit of having slightly more consistent code generation w.r.t. debug flags.  FWIW, this isn't something I am actively pursuing.

Comment 3 Paul Robinson 2018-04-25 15:13:21 PDT

It is a given, among people who do debug info seriously, that changing
code generation when you say -g is a Bad Thing.  In the worst case,
actual bugs can hide behind what look like innocuous differences.

And if this actually interferes with other optimizations like tail merging
or outlining, then it might get a little higher on someone's radar.

Thanks for filing the bug!

Comment 4 Chris Ye 2019-09-05 19:19:39 PDT

The scheduling regions [I, RegionEnd] will be broken by cfg instructions in PostRA Machine Instruction Scheduler (postmisched) pass.

track the code when getSchedRegions: https://github.com/llvm/llvm-project/blob/b1cf175271820b17c27edfd483c2ab52ce0afcfb/llvm/lib/CodeGen/MachineScheduler.cpp#L487
------------------------------------------------
if (isSchedBoundary(&MI, &*MBB, MF, TII))
  break;

------
isSchedBoundary()
{
   return MI->isCall() || TII->isSchedulingBoundary(*MI, MBB, *MF);
}

------
TII->isSchedulingBoundary(*MI, MBB, *MF);
{
  if (MI.isTerminator() || MI.isPosition())
    return true;
}

------
// llvm/include/llvm/CodeGen/MachineInstr.h#L1024
 bool isPosition() const { return isLabel() || isCFIInstruction(); }
----------------------------------------------

The given CFI instruction is considered as a scheduling boundary when MI isPosition=true. Then, one Region is broke as two different Regions(that makes wrong behavior).

I am going to ignore CFG instruction before isSchedBoundary() and push_back to Regions without CFG instructions. Or is there any other better solution to fix this issue?

Comment 5 Jeremy Morse 2019-09-10 02:42:48 PDT

Chris wrote:
> I am going to ignore CFG instruction before isSchedBoundary() and push_back
> to Regions without CFG instructions. Or is there any other better solution
> to fix this issue?

From what Paul wrote in comment 1, I believe the risk is that machine instructions will be rescheduled in such a way that the CFI instructions are no longer correct, which will mislead debuggers. Fully fixing this probably means the scheduler needs to understand CFI instructions, so that it can fix up the changes that it makes.

An alternative interpretation would be that it's better to create broken debug info than to change the generated code, but that might be difficult to argue.

Comment 6 Chris Ye 2019-09-17 02:51:20 PDT

In this case, cfi instructions are recognized as SchedulingBoundary and break one Region into two Regions. For example, the su(0) and su(1) are one Region, and su(9) and su(10) is another Region, they should schedule in one Region, but in fact they do not schedule as two Regions caused by cfi barriers. After replaced the MI.isPosition with MI.isLable in isSchedulingBoundary, cfi instructions will not break the Regions then, and those cfi instructions will be pushed in the same Regions with machine instructions and do scheduling together.

here is debug information:
---------------------
Instruction sequence before schedule: 
 su(0) - machine instruction sunit
 su(1)
 su(2~8)-these are cfi instructions
 su(9)
 su(10)

---------------------
after acheduling (no code change) - scheduling no change, postmisched fails
 su(0)
 su(1)
 su(2~8)-these are cfi instructions
 su(9)
 su(10)

--------------------
After scheduling (with code change) - the machine instruction is correct, but cfi moved after su(0).
  su(1)
  su(9)
  su(10)
  su(0)
  su(2~8) - cfi instruction.

code change:
TII->isSchedulingBoundary(*MI, MBB, *MF);
{
-  if (MI.isTerminator() || MI.isPosition())
+  if (MI.isTerminator() || MI.isLabel())
    return true;
}

Could the issue be fixed with this change? In this way the cfi instruction will join scheduling and reserved in code. Not sure if this has risk somewhere.
or could I ignore the cfi instruction just like handling with debug instruction while scheduling(not push to Region, not do scheduling)?

Comment 7 David Tellenbach 2019-09-26 09:25:57 PDT

I think your idea is not a valid fix for the problem. Although the problem of having differing assembly will certainly be fixed by your change, it would move cfi instructions in an invalid way and away from stack altering instructions.

Please see this discussion on the mailing list: http://lists.llvm.org/pipermail/llvm-dev/2019-September/135433.html

And this patch on Phabricator: https://reviews.llvm.org/D68076

Comment 8 Orlando Cazalet-Hyams 2019-12-05 07:39:55 PST

Hi Chris, I've been digging around and, with some help, I think I now
understand what's going on here. I hope this sheds some light on the matter.

There is a TL;DR near the end. The following write-up follows my investigation
as someone who previously had no knowledge on cfi instructions.

Clickbait title: CFI_INSTRUCTIONs don't need to be scheduling barriers.

# Scheduling barriers
If an instruction, implicitly or otherwise, modifies the stack pointer
register (SP) it is treated as a scheduling barrier. CFI_INSTRUCTIONs in LLVM
mir are also always treated as scheduling barriers.

# Call Frame Information (CFI) instructions
CFI instructions are used to work out offset of the Canonical Frame Address
(CFA), the stack slot containing the return address, from SP for the current
frame at any given time during program execution. By definition then any SP
modifying instruction requires a CFI_INSTRUCTION follow it to describe the
modification (if we're emitting emitting CFI, of course).

CFI_INSTRUCTIONs that we care about for this example:
def_cfa_offset <x: int>    : CFA offset <x> from SP
offset <r: reg>, <x: int>  : Reg <r> value stored at offset  <x> from CFA

# x86
When we spill registers on x86 we will most likely `push r`. push modifies
SP, so it's a scheduling barrier, and if we want CFI it also requires a
CFI_INSTRUCTION.

Assuming the CFI_INSTRUCTIONs correctly always follow the instruction that
they describe it doesn't matter whether they are scheduling barriers or not.
Illustrated in this example:

--- Without CFI
MOV64rr $eax, $r15
PUSH64r $rbp                      // [WRITE SP: SCHED BARRIER]
MOV64rr $eax, $r15
...
--- With CFI
MOV64rr $eax, $r15
PUSH64r $rbp                      // [WRITE SP: SCHED BARRIER]
CFI_INSTRUCTION def_cfa_offset 8  // [CFI: SCHED BARRIER] Follow PUSH64r
CFI_INSTRUCTION offset $rbp -16   // [CFI: SCHED BARRIER] Follow PUSH64r
MOV64rr $eax, $r15
...

# Goeff's aarch64 reproducer
Taken from this ticket, note the new comments in the square brackets.

stp     x23, x22, [sp, #-48]!   // 8-byte Folded Spill [READ-WRITE SP]
stp     x21, x20, [sp, #16]     // 8-byte Folded Spill [READ SP]
stp     x19, x30, [sp, #32]     // 8-byte Folded Spill [READ SP]
.cfi_def_cfa_offset 48
.cfi_offset w30, -8
.cfi_offset w19, -16
.cfi_offset w20, -24
.cfi_offset w21, -32
.cfi_offset w22, -40
.cfi_offset w23, -48

Here, there are two spills which do not modify SP but do require
CFI_INSTRUCTIONs.

[TL;DR]
This is the disconnect: Not all instructions which require CFI_INSTRUCTIONs
modify SP.

By definition, instructions which modify SP are both scheduling barriers
and require a `CFI_INSTRUCTION def_cfa_offset`. Spills, however, may not modify
SP and therefore may not be scheduling barriers. So the instruction
`CFI_INSTRUCTION offset` may incorrectly introduce a scheduling barrier and
cause a codegen change.

# Potential fix (suggested by @jmorse offline)
Pull out CFI_INSTRUCTIONS before scheduling. Map them to the instructions
which they describe. Then after scheduling, reinsert the CFI_INSTRUCTIONs
such that they follow the instructions they describe. This can be done by
the scheduler itself, and it seems to be an easier-to-implement version of
fix #2 suggested in the llvm-dev discussion here [0].

[0] http://lists.llvm.org/pipermail/llvm-dev/2019-September/135433.html

Comment 9 Chris Ye 2019-12-08 22:57:04 PST

>Pull out CFI_INSTRUCTIONS before scheduling. Map them to the instructions
>which they describe. Then after scheduling, reinsert the CFI_INSTRUCTIONs
>such that they follow the instructions they describe. This can be done by
>the scheduler itself, and it seems to be an easier-to-implement version of
>fix #2 suggested in the llvm-dev discussion here [0].
>[0] http://lists.llvm.org/pipermail/llvm-dev/2019-September/135433.html

Thanks Orlando for analyzing, not sure if David or someone else are working on this issue. If not, let me have a try to fix.

Comment 10 Chris Ye 2019-12-10 21:52:09 PST

Not sure if my solution is workable as below:

In getSchedRegions(), the region will be detected from RegionEnd to RegionBegin,
during this detecting, when found CFI, then splice it to RegionEnd and set the CFI as RegionEnd. So that it will not be recognized as SchedBoundary to split one Region for scheduling.


for example, original MBB:
----------------------------------------------
    frame-setup STPXi killed $x20, killed $x19, $sp, 4  ------(Region 2 Begin)
    frame-setup CFI_INSTRUCTION def_cfa_offset 48  ------(Region 2 End)
    frame-setup CFI_INSTRUCTION offset $w19, -8
    renamable $w23 = ADDWri killed renamable $w8, 4, 0 ------(Region 1 Begin)
    BL @foo, csr_aarch64_aapcs, implicit-def dead $lr, implicit $sp, implicit-def $sp  ------(Region 1 End)
----------------------------------------------

After reorder CFI by splice to END:
----------------------------------------------
    frame-setup STPXi killed $x20, killed $x19, $sp, 4 ------(Region 1 Begin)
    renamable $w23 = ADDWri killed renamable $w8, 4, 0 
    frame-setup CFI_INSTRUCTION def_cfa_offset 48 ------(Region 1 End)
    frame-setup CFI_INSTRUCTION offset $w19, -8
    BL @foo, csr_aarch64_aapcs, implicit-def dead $lr, implicit $sp, implicit-def $sp
----------------------------------------------

Comment 11 Chris Ye 2019-12-10 22:04:12 PST

Only aarch64/arm64 meet this issue, and other platform does not, X86 for example disabled post-MI-sched:


lib/Target/X86/X86Schedule.td
-----------------------------------------------------------
// IssueWidth is analogous to the number of decode units. Core and its
// descendents, including Nehalem and SandyBridge have 4 decoders.
// Resources beyond the decoder operate on micro-ops and are bufferred
// so adjacent micro-ops don't directly compete.
//
// MicroOpBufferSize > 1 indicates that RAW dependencies can be
// decoded in the same cycle. The value 32 is a reasonably arbitrary
// number of in-flight instructions.
//
// HighLatency=10 is optimistic. X86InstrInfo::isHighLatencyDef
// indicates high latency opcodes. Alternatively, InstrItinData
// entries may be included here to define specific operand
// latencies. Since these latencies are not used for pipeline hazards,
// they do not need to be exact.

// The GenericX86Model contains no instruction schedules
// and disables PostRAScheduler.
class GenericX86Model : SchedMachineModel {
  let IssueWidth = 4;
  let MicroOpBufferSize = 32;
  let LoadLatency = 4;
  let HighLatency = 10;
  let PostRAScheduler = 0;  -----disabled
  let CompleteModel = 0;
}
-----------------------------------------------------------

Comment 12 hiraditya 2020-08-07 17:32:16 PDT

Fix available in: https://reviews.llvm.org/D68076