Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve optimized debug info for escaped variables that live in memory #33483

Closed
rnk opened this issue Aug 9, 2017 · 40 comments
Closed

Improve optimized debug info for escaped variables that live in memory #33483

rnk opened this issue Aug 9, 2017 · 40 comments
Labels
bugzilla Issues migrated from bugzilla debuginfo

Comments

@rnk
Copy link
Collaborator

rnk commented Aug 9, 2017

Bugzilla Link 34136
Version trunk
OS Windows NT
CC @adrian-prantl,@chandlerc,@chrisjbris,@dwblaikie,@dpxf,@echristo,@zmodem,@hfinkel,@jmorse,@jimingham,@jrmuizel,@jdm,@lhames,@jasonmolenda,@nico,@OCHyams,@pogo59

Extended Description

Consider:

int getint(void);
void alias(char *);
void crash(void);
void f() {
  int x = getint();
  int y = getint();
  int z = getint();
  alias((char*)&x);
  alias((char*)&y);
  alias((char*)&z);
  crash();
}

InstCombine runs LowerDbgDeclares on this code, and we end up with very poor debug info. We basically say that x, y, and z live in EAX after each call to getint, and they are immediately clobbered, never to be available again.

Those casts to char* also prevent the transform from adding more dbg.values that would extend the range.

This was reduced from http://crbug.com/753934

@rnk
Copy link
Collaborator Author

rnk commented Aug 9, 2017

IR before instcombine:

; Function Attrs: nounwind uwtable
define void @​f() #​0 !dbg !​11 {
entry:
  %x = alloca i32, align 4
  %y = alloca i32, align 4
  %z = alloca i32, align 4
  %0 = bitcast i32* %x to i8*, !dbg !​19
  call void @​llvm.lifetime.start.p0i8(i64 4, i8* %0) #​4, !dbg !​19
  call void @​llvm.dbg.declare(metadata i32* %x, metadata !​15, metadata !​20), !dbg !​21
  %call = call i32 @​getint(), !dbg !​22
  store i32 %call, i32* %x, align 4, !dbg !​21, !tbaa !​23
  %1 = bitcast i32* %y to i8*, !dbg !​27
  call void @​llvm.lifetime.start.p0i8(i64 4, i8* %1) #​4, !dbg !​27
  call void @​llvm.dbg.declare(metadata i32* %y, metadata !​17, metadata !​20), !dbg !​28
  %call1 = call i32 @​getint(), !dbg !​29
  store i32 %call1, i32* %y, align 4, !dbg !​28, !tbaa !​23
  %2 = bitcast i32* %z to i8*, !dbg !​30
  call void @​llvm.lifetime.start.p0i8(i64 4, i8* %2) #​4, !dbg !​30
  call void @​llvm.dbg.declare(metadata i32* %z, metadata !​18, metadata !​20), !dbg !​31
  %call2 = call i32 @​getint(), !dbg !​32
  store i32 %call2, i32* %z, align 4, !dbg !​31, !tbaa !​23
  %3 = bitcast i32* %x to i8*, !dbg !​33
  call void @​alias(i8* %3), !dbg !​34
  %4 = bitcast i32* %y to i8*, !dbg !​35
  call void @​alias(i8* %4), !dbg !​36
  %5 = bitcast i32* %z to i8*, !dbg !​37
  call void @​alias(i8* %5), !dbg !​38
  call void @​crash(), !dbg !​39
  %6 = bitcast i32* %z to i8*, !dbg !​40
  call void @​llvm.lifetime.end.p0i8(i64 4, i8* %6) #​4, !dbg !​40
  %7 = bitcast i32* %y to i8*, !dbg !​40
  call void @​llvm.lifetime.end.p0i8(i64 4, i8* %7) #​4, !dbg !​40
  %8 = bitcast i32* %x to i8*, !dbg !​40
  call void @​llvm.lifetime.end.p0i8(i64 4, i8* %8) #​4, !dbg !​40
  ret void, !dbg !​40
}

IR after instcombine:

define void @​f() #​0 !dbg !​11 {
entry:
  %x = alloca i32, align 4
  %y = alloca i32, align 4
  %z = alloca i32, align 4
  %0 = bitcast i32* %x to i8*, !dbg !​19
  call void @​llvm.lifetime.start.p0i8(i64 4, i8* %0) #​4, !dbg !​19
  %call = call i32 @​getint() #​4, !dbg !​20
  call void @​llvm.dbg.value(metadata i32 %call, metadata !​15, metadata !​21), !dbg !​22
  store i32 %call, i32* %x, align 4, !dbg !​22, !tbaa !​23
  %1 = bitcast i32* %y to i8*, !dbg !​27
  call void @​llvm.lifetime.start.p0i8(i64 4, i8* %1) #​4, !dbg !​27
  %call1 = call i32 @​getint() #​4, !dbg !​28
  call void @​llvm.dbg.value(metadata i32 %call1, metadata !​17, metadata !​21), !dbg !​29
  store i32 %call1, i32* %y, align 4, !dbg !​29, !tbaa !​23
  %2 = bitcast i32* %z to i8*, !dbg !​30
  call void @​llvm.lifetime.start.p0i8(i64 4, i8* %2) #​4, !dbg !​30
  %call2 = call i32 @​getint() #​4, !dbg !​31
  call void @​llvm.dbg.value(metadata i32 %call2, metadata !​18, metadata !​21), !dbg !​32
  store i32 %call2, i32* %z, align 4, !dbg !​32, !tbaa !​23
  %3 = bitcast i32* %x to i8*, !dbg !​33
  call void @​alias(i8* %3) #​4, !dbg !​34
  %4 = bitcast i32* %y to i8*, !dbg !​35
  call void @​alias(i8* %4) #​4, !dbg !​36
  %5 = bitcast i32* %z to i8*, !dbg !​37
  call void @​alias(i8* %5) #​4, !dbg !​38
  call void @​crash() #​4, !dbg !​39
  %6 = bitcast i32* %z to i8*, !dbg !​40
  call void @​llvm.lifetime.end.p0i8(i64 4, i8* %6) #​4, !dbg !​40
  %7 = bitcast i32* %y to i8*, !dbg !​40
  call void @​llvm.lifetime.end.p0i8(i64 4, i8* %7) #​4, !dbg !​40
  %8 = bitcast i32* %x to i8*, !dbg !​40
  call void @​llvm.lifetime.end.p0i8(i64 4, i8* %8) #​4, !dbg !​40
  ret void, !dbg !​40
}

Note that we only have 3 dbg.value intrinsics. These ultimately turn into DEBUG_VALUE EAX MIR instructions:

        #DEBUG_VALUE: f:x <- %EAX
        .cv_loc 0 1 5 7                 # store.c:5:7
        movl    %eax, 52(%rsp)
        .cv_loc 0 1 6 11                # store.c:6:11
        callq   getint
.Ltmp1:
        #DEBUG_VALUE: f:y <- %EAX
        .cv_loc 0 1 6 7                 # store.c:6:7
        movl    %eax, 48(%rsp)
        .cv_loc 0 1 7 11                # store.c:7:11
        callq   getint
.Ltmp2:
        #DEBUG_VALUE: f:z <- %EAX
        .cv_loc 0 1 7 7                 # store.c:7:7
        movl    %eax, 44(%rsp)
        leaq    52(%rsp), %rcx
        .cv_loc 0 1 8 3                 # store.c:8:3
        callq   alias

They are all immediately clobbered, so that's no good.

Even if we drop dbg.declare, we should have something to indicate that the variable will live in memory. The dbg.value could use a DW_OP_deref DIExpression after the store, for example.

@rnk
Copy link
Collaborator Author

rnk commented Aug 10, 2017

LowerDbgDeclare gives up on any use that isn't a direct use of the entire alloca. Aggregates are typically built up one field at a time, so we can often lose all debug info for them. Consider this example:

struct Foo { int x, y; };
int getint(void);
void escape(char *);
void f() {
struct Foo o;
o.x = getint();
o.y = getint();
escape((char *)&o);
}

I'm starting to feel more and more like LowerDbgDeclare is just a bad idea. We should keep dbg.declare around as long as the variable's alloca is still alive. Even if instcombine, GVN, or some other pass deletes or moves some stores to a local variable, I'd much rather tell the user the variable is in memory with a stale value than that we don't have any info at all.

@dwblaikie
Copy link
Collaborator

I'd much rather tell the user the variable is in
memory with a stale value than that we don't have any info at all.

There's no way to communicate to the user that the value is potentially stale though - so that seems likely to /really/ confuse a user & lead them way down the garden path, potentially.

@adrian-prantl
Copy link
Collaborator

We probably want both.

My favorite example is this:

void g(int*);
void f(int i) {
g(&i); // alloca is live until here
if (i == 0) { // end of alloca live range
... // i is now a constant, but without lowering to dbg.value we can't capture that and would still point to the stack slot that may even have been reused by now.
}
}

@rnk
Copy link
Collaborator Author

rnk commented Aug 10, 2017

I'd much rather tell the user the variable is in
memory with a stale value than that we don't have any info at all.

There's no way to communicate to the user that the value is potentially
stale though - so that seems likely to /really/ confuse a user & lead them
way down the garden path, potentially.

I don't think I agree with that perspective. In the tradeoff between giving the user more information that might be potentially misleading vs. no information at all, I'd prefer to give more information. Even if LLVM moves some stores around and the debugger prints uninitialized garbage for my local variables, I can step around until it gets initialized and then stare at the assembly to figure out what happened. That's a pretty normal workflow when debugging optimized code, IMO.

We probably want both.

My favorite example is this:

void g(int*);
void f(int i) {
g(&i); // alloca is live until here
if (i == 0) { // end of alloca live range
... // i is now a constant, but without lowering to dbg.value we can't
capture that and would still point to the stack slot that may even have been
reused by now.
}
}

I think we could fix this by modifying stack coloring and GVN. For the changes to stack coloring, consider this simpler example:

void g(restrict int *);
int getint(void);
void f(int i) {
g(&i);
int j = getint();
g(&j);
}

If stack coloring reuses i's slot for j, maybe it could update the debug info to shrink the live range of i. I don't think we can do this today, because our frame index table uses the labels inserted for the scope that ultimately come from the DILocations stapled on MachineInstrs.

GVN could also help in the original example by inserting llvm.dbg.value i == 0 calls when it does that predication. Again, we'd have to learn how to mix dbg.declare with dbg.value, but I think we'll need to do something like that in the long run.

@adrian-prantl
Copy link
Collaborator

I don't think I agree with that perspective. In the tradeoff between giving the user more information that might be potentially misleading vs. no information at all, I'd prefer to give more information.

I strongly disagree with this statement. If what the debugger reports cannot be trusted in some cases, then everything the debugger presents to the user becomes worthless, because the user won't be able to tell the difference between information that is correct and information that may be correct.

@dwblaikie
Copy link
Collaborator

I don't think I agree with that perspective. In the tradeoff between giving the user more information that might be potentially misleading vs. no information at all, I'd prefer to give more information.

I strongly disagree with this statement. If what the debugger reports cannot
be trusted in some cases, then everything the debugger presents to the
user becomes worthless, because the user won't be able to tell the
difference between information that is correct and information that may be
correct.

Yeah - while there might be cases where the answer is unclear (due to optimizations, code gets shuffled around, there may be uncertainty around what the value of a variable is at this instruction) & I can understand debating whether there's a "good enough" answer in some of those cases.

But in cases where it's objectively incorrect - the value's moved into a register, the register is mutated to reflect the new value but the memory location still holds the old value - I don't think it's a good idea to consider that anything other than a bug to be fixed.

It's not so much the user getting "uninitialized garbage" they can easily identify, ignore, & move on from. But about them getting answers that are close enough to be plausible & then spending a lot more time trying to understand a problem that doesn't exist because the debugger told them it did.

@rnk
Copy link
Collaborator Author

rnk commented Aug 14, 2017

I strongly disagree with this statement. If what the debugger reports cannot
be trusted in some cases, then everything the debugger presents to the
user becomes worthless, because the user won't be able to tell the
difference between information that is correct and information that may be
correct.

I think there are cases where we will have to accept stale variable values. I don't think we should treat it as an absolute goal to never give incorrect values. Otherwise we burden every LLVM pass that has the ability to delete memory writes with the responsibility of updating debug info. I think 90% of memory updates are eliminated by a few core passes: GVN and instcombine. If we can teach them to emit dbg.value when deleting stores to static allocas with debug info, we will have a reasonably good debug experience. There is a long tail of passes like memcpyopt that delete stores that I don't expect to preserve debug info. We can invent a way for them to say "I deleted this dead memcpy, so the value of the variable is unavailable at this point, and becomes available again here (dbg.value+undef?)", but I don't think we'll be able to maintain that for all passes.

It's also worth considering that we delete dead stores to heap objects, so the user can already observe impossible program states in the debugger. We can't just put the blinders on and say "sorry we optimized the program, nothing is certain". The user is already debugging an optimized program, they shouldn't trust anything it says, and our existing user base seems to already know that.

@dwblaikie
Copy link
Collaborator

The user is already debugging an optimized program,
they shouldn't trust anything it says, and our existing user base seems to
already know that.

What sort of feedback from users is available on this?

& there's also a risk that the more unreliable these things get, the less users trust the answers they get, the less the use it because it's untrustworthy.

I imagine there are fairly few cases where writes to heap values are optimized away in a way that's especially user visible. Function call boundaries happen often and are a pretty strong impediment to alias analysis for heap values, etc.

@llvmbot
Copy link
Collaborator

llvmbot commented Aug 14, 2017

I don't think I agree with that perspective. In the tradeoff between giving the user more information that might be potentially misleading vs. no information at all, I'd prefer to give more information.

I strongly disagree with this statement. If what the debugger reports cannot
be trusted in some cases, then everything the debugger presents to the
user becomes worthless, because the user won't be able to tell the
difference between information that is correct and information that may be
correct.

I strongly disagree with this statement. I think it's an overly idealistic view of debugging optimized code. Nobody wants to debug optimized code, because it's a pain. And the reason it's a pain is because the original source code doesn't map well to the generated machine code. At the same time, the compiler has more information than the user does about the transformations that happened, and for the compiler to say "I can't be 100% sure that this information is 100% accurate, therefore I'm not even going to try to help you" seems like the wrong approach.

Generating debug info is not like generating binary code, where it is either provably correct or provably incorrect. The entire point of debug info is to help the user as much as possible. By generating nothing, we are not helping anyone, and we may even be making it impossible for a user to track down a bug. By generating something, with the caveat that it may be stale, we at least point the user in a direction. If it's wrong, then they're no worse off than when they started. But if it's right, then we saved them a ton of time.

Anyone debugging optimized code is goign to be going back and forth between the assembly and source view anyway, and should be able to confirm that the debugger is reporting accurate information.

@rnk
Copy link
Collaborator Author

rnk commented Aug 14, 2017

The user is already debugging an optimized program,
they shouldn't trust anything it says, and our existing user base seems to
already know that.

What sort of feedback from users is available on this?

& there's also a risk that the more unreliable these things get, the less
users trust the answers they get, the less the use it because it's
untrustworthy.

Every build system I've ever encountered has at least two configurations: debug and release, even if they don't go by those names. The ubiquity of those configurations reflects the doubts that users have about the quality of optimized debug info.

I imagine there are fairly few cases where writes to heap values are
optimized away in a way that's especially user visible. Function call
boundaries happen often and are a pretty strong impediment to alias analysis
for heap values, etc.

This logic also applies to escaped locals to some extent. Locals that cannot be promoted to SSA are modeled as "read/write" at every external function call. Alias analysis, of course, is a lot stronger, so things like GVN do fire in practice.

@dwblaikie
Copy link
Collaborator

If it's wrong, then they're no worse off than when they started.

This is the bit where I think there's a lot of/strong disagreement.

If the debugger tells someone the wrong answer & the user doesn't know it's wrong - it may cost them a lot of time going down the wrong debugging path, trying to understand why that answer is happening in their program (when it isn't).

So, yes, I do think it's more costly to give the wrong answer than no answer. How much more costly (how many wrong answers would be worth it to get some right answers instead of no answers)? Not sure.

I think there is a valid tradeoff there, especially as an intermediate state, where we say "yeah, this technique gives the wrong answer sometimes but gets us a whole lot more right answers (rather than no answer) - /and/ we have a path forward to address the wrong answers it creates (& that they are bugs and should be treated as such)".

@dwblaikie
Copy link
Collaborator

If it's wrong, then they're no worse off than when they started.

This is the bit where I think there's a lot of/strong disagreement.

If the debugger tells someone the wrong answer & the user doesn't know it's
wrong - it may cost them a lot of time going down the wrong debugging path,
trying to understand why that answer is happening in their program (when it
isn't).

So, yes, I do think it's more costly to give the wrong answer than no
answer. How much more costly (how many wrong answers would be worth it to
get some right answers instead of no answers)? Not sure.

I think there is a valid tradeoff there, especially as an intermediate
state, where we say "yeah, this technique gives the wrong answer sometimes
but gets us a whole lot more right answers (rather than no answer) - /and/
we have a path forward to address the wrong answers it creates (& that they
are bugs and should be treated as such)".

The user is already debugging an optimized program,
they shouldn't trust anything it says, and our existing user base seems to
already know that.

What sort of feedback from users is available on this?

& there's also a risk that the more unreliable these things get, the less
users trust the answers they get, the less the use it because it's
untrustworthy.

Every build system I've ever encountered has at least two configurations:
debug and release, even if they don't go by those names. The ubiquity of
those configurations reflects the doubts that users have about the quality
of optimized debug info.

I think it's reasonable to expect that debugging optimized programs will be harder - I don't think that necessarily means expecting the answers to be wrong. Code reordering (especially now that we're dropping debug locations when code moves between blocks) can be confusing, values may be optimized away, etc. Yeah, it's harder & interacting with code that behaves more like the source code is easier.

I don't think that necessarily implies users are fine with/expect the debugger to give answers that are wrong.

I imagine there are fairly few cases where writes to heap values are
optimized away in a way that's especially user visible. Function call
boundaries happen often and are a pretty strong impediment to alias analysis
for heap values, etc.

This logic also applies to escaped locals to some extent. Locals that cannot
be promoted to SSA are modeled as "read/write" at every external function
call. Alias analysis, of course, is a lot stronger, so things like GVN do
fire in practice.

Right - and there it's pretty noticeable as you're saying and would be great to improve.

I think there's substantial overlap in goals/desires here, feels like this is turning into a philosophical/principle stand when maybe it doesn't need to be? There's a technical tradeoff here - in both extremes ("everything's optimized away, we don't know where anything is" and "every variable has the erroneous value 3") it wouldn't be any good for a user. And somewhere there are practical tradeoffs around accuracy and usability for users - and a place from which improvements on both sides (removing erroneous values, adding info about values that were previously unknown) would be valuable.

I'm not sure exactly how to make that tradeoff judgement, but I think it's probably more helpful to look at it from that perspective.

@llvmbot
Copy link
Collaborator

llvmbot commented Aug 14, 2017

The user is already debugging an optimized program,
they shouldn't trust anything it says, and our existing user base seems to
already know that.

What sort of feedback from users is available on this?

I'm a long-time user of debugging tools on Microsoft Visual C++ optimized binaries. I can say that we who do this know not to trust what the debugger presents, and often I find e.g. this==nullptr and other anomalies in debugging.
Note that even with perfect debugging information, the debugger will not always be able to unwind the stack correctly. (Think 3rd party modules with no symbols, JIT code with no unwind info, malware with bad intent, etc.).
So, no matter the fidelity of the symbols, it will always be necessary to exercise caution when debugging Windows crashes.

@hfinkel
Copy link
Collaborator

hfinkel commented Aug 14, 2017

If it's wrong, then they're no worse off than when they started.

This is the bit where I think there's a lot of/strong disagreement.

If the debugger tells someone the wrong answer & the user doesn't know it's
wrong - it may cost them a lot of time going down the wrong debugging path,
trying to understand why that answer is happening in their program (when it
isn't).

So, yes, I do think it's more costly to give the wrong answer than no
answer. How much more costly (how many wrong answers would be worth it to
get some right answers instead of no answers)? Not sure.

I think there is a valid tradeoff there, especially as an intermediate
state, where we say "yeah, this technique gives the wrong answer sometimes
but gets us a whole lot more right answers (rather than no answer) - /and/
we have a path forward to address the wrong answers it creates (& that they
are bugs and should be treated as such)".

The user is already debugging an optimized program,
they shouldn't trust anything it says, and our existing user base seems to
already know that.

What sort of feedback from users is available on this?

& there's also a risk that the more unreliable these things get, the less
users trust the answers they get, the less the use it because it's
untrustworthy.

Every build system I've ever encountered has at least two configurations:
debug and release, even if they don't go by those names. The ubiquity of
those configurations reflects the doubts that users have about the quality
of optimized debug info.

I think it's reasonable to expect that debugging optimized programs will be
harder - I don't think that necessarily means expecting the answers to be
wrong. Code reordering (especially now that we're dropping debug locations
when code moves between blocks) can be confusing, values may be optimized
away, etc. Yeah, it's harder & interacting with code that behaves more like
the source code is easier.

I don't think that necessarily implies users are fine with/expect the
debugger to give answers that are wrong.

This is consistent with my experience as well. Users get really annoyed when the debugger provides wrong values (even more annoyed than when the debugger provides no values).

@jimingham
Copy link
Collaborator

I really disagree with the statement:

If it's wrong, then they're no worse off than when they started. But if it's right, then we saved them a ton of time.

It would be too exhausting to use a debugger if you have to keep in the forefront of your mind that every value you look at might very well be wrong. So you put that aside and trust it, and then when you're in the middle chasing down some problem and some variable has a seemingly impossible value you spend hours trying to figure out how it could possibly be that value till you remember that the debugger is a piece of crap and has been lying to you again. You end up much worse off than when you started because you've lost your train of thought on the wild goose chase the debugger led you, you get to start again realizing that your tools are untrustworthy.

This is not a theoretical opinion on my part.

One difference between gdb & lldb is that when gdb can't unwind a register, it just attributes it to the last known value in a younger frame. The end result is gdb would confidently declare a value for some variable (usually an argument in the "bt" list) which would be totally bogus. That caused many many wasted hours over the years, and even people who had been bitten before would fall into the trap of trusting these values because you just can't use the thing if you have to always distrust it.

I have been on the receiving end when people in this situation have been very forcefully conveying to me that they were much worse off for our having lied to them than if we had just told them we didn't know.

In lldb, if we can't unwind a register we report it as unknown. People fairly easily understand that not everything is recoverable in optimized code debugging. From many years supporting folks using the two approaches it is clear lldb's was the right choice.

Also, if lldb knows the places where the values are correct, we can help the user propagate those values to the places they need to stop. For instance, we can present to them the places they can stop where the value IS trustworthy. We can even do things like put hidden stops at known good locations and record info behind the user's back to represent when they later stop.. Or we can get lldb to do some assembly analysis to try to figure out where the value moved from it's last good locations, automating the work folks have to do by hand when the debug info isn't helpful.

OTOH, if all we know is "declared locations are sometimes right" you'd never know when to do this work, and anything you do will be colored by the uncertainty.

If we added a DW_AT_probable_location and only converted it to DW_AT_location when clang is confident of its analysis, that would be okay. Then you could tell people what to expect. But forcing people to live constantly under "the values of Damocles" seems an awfully bad way to construct a tools environment.

@jasonmolenda
Copy link
Collaborator

To throw two cents in, from the debugger's perspective it's always better to say "I don't know" rather than "this is the value" and give the wrong value. The former is incredibly annoying to anyone working on optimized code -- I work with several groups that have to always build optimized and debugging that code is way harder than it should be today, I fully understand where the pain is coming from. But thinking about the failure modes makes it clear what the right answer is.

If I say "I cannot give you the value of this variable", the dev will either find the value in another stack frame, or they'll resort to reading assembly, find where it is actually still live, and get it. They will curse the debugger for making their lives so hard (I'd prefer they curse the compiler, but that's not what they do, alas.)

If I hand out a value that is wrong, the developer can waste DAYS of time working on the assumption that this code somehow got this impossible value. I've gotten those bug reports -- people who, literally, wasted days because the debugger gave out a value that was not correct. It's so much worse of a failure mode, we've always considered this unacceptable when we've encountered these situations in lldb. The more you work with users of the debugger and their bug reports, the more you realize the importance of getting this right.

The debugger must be conservative in these situations, as frustrating as that is. The debug info must be conservative in these situations. Or if the compiler wants to emit variable locations that are possibly correct, add a different tag in the DWARF for it and debuggers like lldb will be free to ignore it.

Any debug info that ckauns a variable is accessible at a location, and that location does not have the variable value, is buggy debug info.

@nico
Copy link
Contributor

nico commented Aug 15, 2017

One thing where this is hurting us a lot in chrome/win land at the moment is minidumps. Minidumps aren't really a thing on non-Windows, which might be why expectations here are so different.

On Windows, when chrome (or what have you) crashes, you can download a minidump from the crash server, load that into your debugger, and then use that to inspect the state of variables. Users are aware that they look at optimized debug info. Someone told me "Misleading information is mitigated by the fact you can ask the debugger where's it's getting its information (dv /v) and if you see stuff in a register, then you know to start distrusting the values."

With clang, when users ask for the values of things while debugging minidumps, they currently almost always get "no answer" instead of "likely answer is X". This is very different from how things used to work for them, and this makes people unhappy too.

Given the different expectations, would adding this behavior behind a (default-off) flag be a short-term option? Then we can improve optimized debug info asynchronously and see if it's possible to get it to be good enough without having to sometimes show invalid information.

@dwblaikie
Copy link
Collaborator

One thing where this is hurting us a lot in chrome/win land at the moment is
minidumps. Minidumps aren't really a thing on non-Windows, which might be
why expectations here are so different.

On Windows, when chrome (or what have you) crashes, you can download a
minidump from the crash server, load that into your debugger, and then use
that to inspect the state of variables. Users are aware that they look at
optimized debug info. Someone told me "Misleading information is mitigated
by the fact you can ask the debugger where's it's getting its information
(dv /v) and if you see stuff in a register, then you know to start
distrusting the values."

Interestingly it seems like that's the opposite of the likely erroneous information that would happen in the direction discussed here.

Here the issue is that LLVM isn't describing the memory location backing a variable once it copies the value into a register. By adding the stack memory location the likely mistakes are that the value may be stale because a dead store has been eliminated and the value was /only/ in that register (or nowhere at all).

With clang, when users ask for the values of things while debugging
minidumps, they currently almost always get "no answer" instead of "likely
answer is X". This is very different from how things used to work for them,
and this makes people unhappy too.

nod It's clearly a regression for MSVC users switching to Clang, no doubt. And it's clear LLVM can do better - yeah, when "f(&x)" is called, Clang should be describing the location of 'x' without any doubt.

Though I wonder how many false positives ("the value of X is Y" when it's really Z) MSVC has compared to what the proposed direction/fixes have in Clang. Might well be that the users also don't expect as many false positives as this would create in Clang.

Given the different expectations, would adding this behavior behind a
(default-off) flag be a short-term option? Then we can improve optimized
debug info asynchronously and see if it's possible to get it to be good
enough without having to sometimes show invalid information.

Maybe - though the divergence does make me a touch uncomfortable, it does amount to sort of an A/B test with different implementations and users.

I'm not sure exactly where the tradeoffs are of the proposed fixes & what the path to addressing any gaps would be.

Adrian: Where're you at, philosophically, with all this. Are you of the opinion that no false positive regression is acceptable? Or are some acceptable (with the acknowledgement that they are bugs that should be fixed and should be identified, tracked, have a likely path to resolution, etc) to reduce false negatives ("no location" when there's really an obvious, unambiguous, unavoidable location that should be reported in the DWARF)?

If you feel like no false positive regression is acceptable, then yeah - I guess forking the default here with a flag might be the path forward - then as the gaps in the analysis are filled (tracking when a value moves back into memory), eventually the flag can be removed/ignored as the only thing it would be enabling would be the false positives - reintroducing the "unknown values" when it's truly unknown.

@hfinkel
Copy link
Collaborator

hfinkel commented Aug 15, 2017

One thing where this is hurting us a lot in chrome/win land at the moment is
minidumps. Minidumps aren't really a thing on non-Windows, which might be
why expectations here are so different.

On Windows, when chrome (or what have you) crashes, you can download a
minidump from the crash server, load that into your debugger, and then use
that to inspect the state of variables. Users are aware that they look at
optimized debug info. Someone told me "Misleading information is mitigated
by the fact you can ask the debugger where's it's getting its information
(dv /v) and if you see stuff in a register, then you know to start
distrusting the values."

Why? I thought that minidumps have registers and stack, but lack other memory contents.

With clang, when users ask for the values of things while debugging
minidumps, they currently almost always get "no answer" instead of "likely
answer is X". This is very different from how things used to work for them,
and this makes people unhappy too.

Heh. So we've trained users to expect broken behavior and now they want it back ;) [this is unfortunately common.]

Given the different expectations, would adding this behavior behind a
(default-off) flag be a short-term option? Then we can improve optimized
debug info asynchronously and see if it's possible to get it to be good
enough without having to sometimes show invalid information.

@jasonmolenda
Copy link
Collaborator

Given the different expectations, would adding this behavior behind a
(default-off) flag be a short-term option? Then we can improve optimized
debug info asynchronously and see if it's possible to get it to be good
enough without having to sometimes show invalid information.

I am real uncomfortable with this being an option on the producer of debug info (clang). Now the person who builds a program is deciding how the debugger behaves -- making an assumption about what the debug developer is going to see. The person debugging a binary may have no idea how the program was compiled. The person who is compiling the program may not understand the meaning of the clang flag ("-fpossibly-wrong-but-maybe-really-helpful-debug-info? Yes please!") themselves.

If we want to emit possibly-correct location information in parallel to known-correct location information in the DWARF in a separate list, and add an option to the consumer (lldb) to use possibly-correct location information (with default-off), I'd be OK with that approach. Then the person enabling the use of this in lldb is responsible for their own results.

I don't think MSVC's behavior is a feature to be emulated, though. We've had many cases like this over the years, and in lldb we consider showing incorrect values to the user as our worst class of bugs.

@nico
Copy link
Contributor

nico commented Aug 15, 2017

Given the different expectations, would adding this behavior behind a
(default-off) flag be a short-term option? Then we can improve optimized
debug info asynchronously and see if it's possible to get it to be good
enough without having to sometimes show invalid information.

I am real uncomfortable with this being an option on the producer of debug
info (clang). Now the person who builds a program is deciding how the
debugger behaves -- making an assumption about what the debug developer is
going to see. The person debugging a binary may have no idea how the
program was compiled. The person who is compiling the program may not
understand the meaning of the clang flag
("-fpossibly-wrong-but-maybe-really-helpful-debug-info? Yes please!")
themselves.

We could make it CodeView-only and not touch the DWARF bits. That should address lldb concerns.

I agree with dblaikie that having two codepaths here long-term seems undesirable, but coming up with a grand unified scheme is going to take time, and ideally we wouldn't have to block switching chrome/win to clang on this work.

@adrian-prantl
Copy link
Collaborator

Adrian: Where're you at, philosophically, with all this. Are you of the opinion that no false positive regression is acceptable? Or are some acceptable (with the acknowledgement that they are bugs that should be fixed and should be identified, tracked, have a likely path to resolution, etc) to reduce false negatives ("no location" when there's really an obvious, unambiguous, unavoidable location that should be reported in the DWARF)?

As I said earlier in the thread I strongly believe that we should not knowingly introduce new false positives (unless there is a way to mark them in the debug info as not trustworthy and debuggers will actually surface that to the user). If the user can't trust some of the variable values, every value reported by the debugger becomes not trustworthy.

I also think that this discussion is misguided, because I am convinced that there is a lot of lower-hanging fruit to pick that does not involve watering down the debug info with not-to-be-trusted information. We have lots of documented PRs of optimizations dropping dbg.values that can be fixed without introducing any uncertainty and I think we should focus our attention there.

I am also talking with Reid to work out a plan to fix the issue described in this PR without introducing new false positives — this will need a little more design but it is worth taking the time to solve this correctly. We have a lot of the necessary components already in place.

@rnk
Copy link
Collaborator Author

rnk commented Aug 16, 2017

Keno wrote an alternative proposal, essentially trying to improve on the inference that instcombine does today: https://gist.github.com/Keno/480b8057df1b7c63c321

I'm proposing that we remove instcombine's conversion of dbg.declare and push that responsibility onto the passes that delete stores.

@rnk
Copy link
Collaborator Author

rnk commented Aug 17, 2017

Adrian and I sat down and worked out a long term plan to improve things in this area without knowingly introducing any inaccuracies in variable info. For a long time, we've discussed unifying dbg.declare and dbg.value. We can't mix dbg.declare and dbg.value, so to fix this bug, we should standardize everything on dbg.value, and then we don't need instcombine or any other pass to lossily convert dbg.declare to dbg.value. The steps are:

  • Add a flag to clang to always emit dbg.value+DW_OP_deref for local variables instead of dbg.declare. Maybe call it -gdbg-value? Document the new scheme in SourceLevelDebugging.rst.

  • Teach all passes that eliminate dead stores (primarily DSE & instcombine) to insert dbg.value calls when deleting stores to allocas with a dominating dbg.value+deref. This logic would be factored into a shared utility, like salvageDebugInfo. One subtlety is that we need to insert dbg.value+deref calls at the store that kills the store being eliminated, indicating that the value once again lives in memory.

  • Test the flag, flip the default when it's ready, remove the flag in the next release.

In the long run, this will make it so we start printing variable locations as assembly comments at -O0, which is a nice side benefit.

@adrian-prantl
Copy link
Collaborator

I think we also had on the list:

  • Add support for memory locations in LiveDebugValues

We also identified two nice-to-have features (these are things that currently don't work either):

  • insert a dbg.value(undef) at the point where a lifetime.end marker is generated (or directly in the stack coloring pass)

  • let clang generate allocas for indirectly passed function arguments (e.g., sret)

We also came to the conclusion that — given the input comes from a well-behaved frontend, and we shall document what exactly that means* — it should never be necessary to insert dbg.values for load or call instructions.

*) A well-behaved frontend produces an alloca for a variable and generates a store to the alloca each time the variable is modified. It further generates a dbg.value describing the variable in the alloca as the one true location for the variable. This contract is what allows us to pay attention to dead store elimination (cf. Reid's second point).

By describing allocas with dbg.value we have the opportunity to get a more precise start location for variable liveness, but we need to be careful not to regress the DWARF representation at -O0 to a location list, since this would prevent the debugger from letting the user assign new values to the variable (implicit vs. explicit location description). We could represent stack-allocated variables at -O0 with frame index location + a DW_AT_start_scope attribute to achieve this.

@hfinkel
Copy link
Collaborator

hfinkel commented Aug 17, 2017

instead of dbg.declare. Maybe call it -gdbg-value? Document the new scheme in SourceLevelDebugging.rst.

So you want to give users the option to try it?

I'd rather not introduce temporary user-facing options. Could this just be a -Xclang (-cc1) option?

@rnk
Copy link
Collaborator Author

rnk commented Aug 18, 2017

  • Add support for memory locations in LiveDebugValues

Yep, I realized I forgot to post that while discussing this in person with Zach.

We also identified two nice-to-have features (these are things that
currently don't work either):

  • insert a dbg.value(undef) at the point where a lifetime.end marker is
    generated (or directly in the stack coloring pass)

It's true, stack coloring needs to add DBG_VALUE MIR instructions on lifetime end. I haven't actually been able to craft a C test case that forces confusing stack colorings to occur, so I don't think it's very high priority. (I tried hard! :)

  • let clang generate allocas for indirectly passed function arguments (e.g.,
    sret)

This works, but I filed llvm/llvm-bugzilla-archive#34227 as an alternative to that.

By describing allocas with dbg.value we have the opportunity to get a more
precise start location for variable liveness, but we need to be careful not
to regress the DWARF representation at -O0 to a location list, since this
would prevent the debugger from letting the user assign new values to the
variable (implicit vs. explicit location description). We could represent
stack-allocated variables at -O0 with frame index location + a
DW_AT_start_scope attribute to achieve this.

Indeed, preserving the -O0 experience is important.

So you want to give users the option to try it?

Yes. We don't have a lot of good tools for evaluating debug info quality beyond deploying changes to users and giving them flags to flip. =/ Adrian said he's trying to upstream his tool to calculate DW_AT_location coverage, but it's still an approximation of quality.

I'd rather not introduce temporary user-facing options. Could this just be a
-Xclang (-cc1) option?

I'd prefer it if users didn't have to know about -Xclang, but I don't care that much.

@adrian-prantl
Copy link
Collaborator

I'd rather not introduce temporary user-facing options. Could this just be a
-Xclang (-cc1) option?

I'd prefer it if users didn't have to know about -Xclang, but I don't care that much.

Since the point of the option is to enable incremental development of the feature until it is ready to become the default, users ideally will never have to interact with the option, because they won't want to turn it on until it's ready for prime time. Driver options are expensive compared to cc1 options, because we typically have to support them indefinitely, so making this (temporary!) option a cc1 option seems right to me.

@pogo59
Copy link
Collaborator

pogo59 commented Aug 21, 2017

Sorry for being late to the party, I've been away a couple of weeks.

There's always a tension between providing "truthful" info and
"potentially useful" info. IMO the way to build users' trust in
the value of debug info (especially optimized) is to provide truthful
info. Half-truths make the debugger into an unreliable tool.
(Personally I use debuggers only as a last resort.)

P.S. Using instructions to convey debug info is a poor design, period.
It causes all kinds of other problems because people writing passes
rarely think about the effect on debug info, and we've fixed a bunch
of cases where instructions/uses/etc were counted inappropriately.
I'd rather be thinking about a design that eliminates all dbg.*
instructions.

@rnk
Copy link
Collaborator Author

rnk commented Aug 21, 2017

P.S. Using instructions to convey debug info is a poor design, period.
It causes all kinds of other problems because people writing passes
rarely think about the effect on debug info, and we've fixed a bunch
of cases where instructions/uses/etc were counted inappropriately.
I'd rather be thinking about a design that eliminates all dbg.*
instructions.

I agree, but fundamentally we need to represent some placeholder in the instruction stream. I think without radically changing the MI and IR basic block representations, that means we represent variable tracking info with elements in the linked list, regardless of how we name them. The best we can do is probably to just add filtering instruction iterators that skip dbg instructions.

The use-list issues seem like they have been fixed since 2014, when we added ValueAsMetadata / MetadataAsValue, right? I guess this might still be an issue in MIR.

@pogo59
Copy link
Collaborator

pogo59 commented Aug 21, 2017

Filtering iterators helps. We still have cases of -g affecting codegen
but all the really egregious cases have been taken care of, at this point.

Eliminating dbg.* instructions is probably a topic for a BoF or something.
I'll just point out that when I first looked at dbg.declare, it seemed
that it added exactly one item of info to what alloca already had, and
that really didn't seem worth having a whole separate instruction.
Another kind of metadata on alloca should have worked just as well.

@chrisjbris
Copy link

Some experimental improvemnts to lowerDbgDeclare brought
this issue to our interest recently, initially looking at issue
llvm/llvm-bugzilla-archive#45667 .

Messges #​25 & and #​26 on this Bug (34136) outline a decent plan to
begin improving debuginfo for the escaped variables.

We intend to start addressing this issue, is anyone currently working on this,
are there any objections to our now implementing improvements?

@zmodem
Copy link
Collaborator

zmodem commented Jul 20, 2020

We intend to start addressing this issue, is anyone currently working on
this,
are there any objections to our now implementing improvements?

Sounds great. I'm not aware of anyone working on it from our side.

@OCHyams
Copy link
Contributor

OCHyams commented Jan 11, 2021

I have just reported an issue which seems relevant to this thread: llvm/llvm-bugzilla-archive#48719 .

The bug report shows how we can run into incorrect debug info for an escaped variable when promoting or partially promoting it introduces PHI instructions, combined with SimplifyCFG squashing the PHI block and its preds together.

@OCHyams
Copy link
Contributor

OCHyams commented Nov 27, 2021

mentioned in issue llvm/llvm-bugzilla-archive#47946

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
@OCHyams
Copy link
Contributor

OCHyams commented Aug 22, 2023

I think it might be time to close this ticket. Using assignment tracking, which is enabled by default for Non-O0, non-LTO builds which are not tuned for LLDB (conditions in code), we use emit a single location - the stack location - for all the variables in the original reproducer (godbolt).

With assignment tracking disabled we still get better coverage now for the variables than is reported in the original summary (goldbolt). Each variable has a location list of: eax until eax is clobbered, then the stack location at the escaping call for that variable.

@OCHyams OCHyams closed this as completed Aug 22, 2023
@rnk
Copy link
Collaborator Author

rnk commented Aug 22, 2023

What is the story for llvm::LowerDbgDeclare? I think Chrome still uses the flag I added as a workaround to turn that off:
https://source.chromium.org/chromium/chromium/src/+/main:build/config/compiler/BUILD.gn;l=582?q=instcombine-lower-dbg-declare&ss=chromium&start=11

For tracking purposes, is there some other issue Chromium should follow which tracks what we need to do to remove that flag and still get good debug info?

@OCHyams
Copy link
Contributor

OCHyams commented Aug 22, 2023

What is the story for llvm::LowerDbgDeclare?

LowerDbgDeclare only affects dbg.declares, and assignment tracking replaces most dbg.declares with a new intrinsic. So if your builds meet the "requirements" (see previous comment) to have assignment tracking enabled then you should in theory get nearly identical binaries once you remove -instcombine-lower-dbg-declare=0. I say "nearly identical" because assignment tracking falls back to dbg.declares for some cases it can't handle yet.

If your builds can't/don't use assignment tracking then LowerDbgDeclare will still be an issue (although the resultant variable location coverage has improved somewhat since the ticket was filed).


Out of interest, if you're able to share, what debugger(s) do you support? And do your builds use debugger tuning flags?

I think Chrome still uses the flag I added as a workaround to turn that off

It might be worth checking if assignment tracking has solved this problem for you.

If you are able to easily check, what do the coverage stats from llvm-locstats.py look like (possibly worth looking at the full llvm-dwarfdump --statistics stats too) for a build with:

A) -instcombine-lower-dbg-declare=0
B) -instcombine-lower-dbg-declare=1
C) -instcombine-lower-dbg-declare=1 -Xclang -fexperimental-assignment-tracking=false

Build C is just to check that assignment tracking is indeed enabled for at least some the compilations (a complex enough build setup could reasonably have it enabled for some parts and disabled automatically for others).

For tracking purposes, is there some other issue Chromium should follow which tracks what we need to do to remove that flag and still get good debug info?

Not that I know of - please feel free to re-open this if you're still using it to track that work. It might also be reasonable to create a new ticket, I don't mind either way personally.

@rnk
Copy link
Collaborator Author

rnk commented Aug 22, 2023

I think most debugging happens on Windows, so we're using PDBs and Codeview. From reading the clang sources, it suggests that we are still using the default debugger tuning of gdb. We really ought to add an entry in the tuning enum to cover Windows debuggers.

In any case, it sounds like we aren't using the new mode yet, and the next step is to try it and evaluate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla debuginfo
Projects
None yet
Development

No branches or pull requests