Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llvm-gsymutil is failing with "child range not contained in parent" error #46501

Open
petrhosek opened this issue Aug 13, 2020 · 17 comments
Open
Labels
bugzilla Issues migrated from bugzilla debuginfo

Comments

@petrhosek
Copy link
Member

Bugzilla Link 47157
Version trunk
OS All
CC @dwblaikie,@JDevlieghere,@walkerkd,@pogo59,@frobtech,@thanm

Extended Description

I haven't yet managed to create a minimal reproducer, so here is a full binary for now until I manage to reduce it: https://storage.googleapis.com/fuchsia-build/basic_fuzzer_test.lzma

To reproduce the issue, simply run:

$ llvm-gsymutil --convert=basic_fuzzer_test --out-file=basic_fuzzer_test.gsym
...
llvm-gsymutil: error: basic_fuzzer_test: child range not contained in parent

This binary is somewhat unusual in that it's a Go fuzzer (compiled with the Go compiler) that calls into C++ test harness (compiled with Clang), so the binary includes an entire Go runtime and the generated DWARF looks somewhat different from DWARF generated by Clang.

@dwblaikie
Copy link
Collaborator

running llvm-dwarfdump --verify on this file gives a more informative error related to the "child range not contained in parent" including dumping the relevant DIEs. The first example the verifier prints is:

0x00145adb: DW_TAG_inlined_subroutine
DW_AT_abstract_origin (0x000000000014199a "runtime.lockOSThread")
DW_AT_ranges (0x0001b790
[0x000000000007dc1f, 0x000000000007dc2e)
[0x000000000007dc2e, 0x000000000007dc34)
[0x000000000007dc43, 0x000000000007dc44))
DW_AT_call_file ("./runtime/cgocall.go")
DW_AT_call_line (190)

0x00145aea: DW_TAG_inlined_subroutine
DW_AT_abstract_origin (0x00000000001419b3 "runtime.dolockOSThread")
DW_AT_ranges (0x0001b7e0
[0x000000000007dc34, 0x000000000007dc43)
[0x000000000007dc47, 0x000000000007dc4e)
[0x000000000007dc4e, 0x000000000007dc52))
DW_AT_call_file ("./runtime/proc.go")
DW_AT_call_line (3675)

And that certainly looks like bogus DWARF to me. The first range in the runtime.doLockOSThread is exactly not contained in the outer, runtime.lockOSThread range list, I think.

@llvmbot
Copy link
Collaborator

llvmbot commented Aug 13, 2020

The main questions is should llvm-gsymutil be returning an error, or just a warning, when it encounters invalid DWARF? I am fine with either. There are many LTO binaries that have this issue, and it would be a shame to not generate a gsym if the DWARF has any errors.

@dwblaikie
Copy link
Collaborator

I don't know of any aspect of LLVM's LTO implementation that could lead to this DWARF being emitted, FWIW. The way scopes are represented makes it pretty impossible to have a nested scope that doesn't have a nested address range, to the best of my knowledge.

@llvmbot
Copy link
Collaborator

llvmbot commented Aug 13, 2020

Which LTO? Monolithic LTO of Thin LTO? Or Both?

@dwblaikie
Copy link
Collaborator

Either - it's baked into the IR representation to the best of understanding of it.

Any given IR instruction may have a debug location, that debug location has a chain of scopes - we only put them inside each other in the DIE tree when they're in that chain, and their address ranges are derived from that scope chain - I don't know of any way a child scope could include an address/range that is not included in its parent scope, due to the nature of this implementation strategy.

(there was a bug recently about instruction bundles (used on Hexagon and Sparc at least) - that results in a correct line table, but missing scopes in the DIE tree - but that wouldn't result in children with addresses not included in their parent so far as I can imagine)

Essentially this applies to any IR (single compilation, Thin or full LTO, JIT, weird hand-crafted IR, etc) - the nature of the way scoping is represented makes it, to the best of my knowledge, fairly impossible (always the possibility of bugs, to be sure - but I have a hard time imagining the shape of the bugs necessary to allow it in this situation) to have a child scope include addresses that are not in the parent scope.

@petrhosek
Copy link
Member Author

petrhosek commented Aug 13, 2020

This looks a like an issue in the Go compiler, so it's something we need to file a bug for with Go maintainers.

Whether llvm-gsymutil should just print a warning or fail, that's a great question. I'm not sure if there's any precedent among LLVM tools, but it'd be nice to make that behavior configurable, for example by having a -strict mode (other suggestions are welcome).

The way I discovered this issue (as well as #46494) is by trying to enable llvm-gsymutil in our build as a post-processing step. Currently, that fails the build because of these issues, and while I hope that we can address those, it may take some time, especially if we need to fix other compilers, and so I'd like to make these errors non-fatal in the meantime even if it means that we aren't going to produce GSYM files for some binaries. We could achieve the same by wrapping llvm-gsymutil in a script that discards the exit code, but having a flag would be nicer.

@llvmbot
Copy link
Collaborator

llvmbot commented Aug 14, 2020

This brings up an issue that we have had for a while: compilers and linkers generating DWARF with no checks on the validity of the produced DWARF. llvm-gsymutil is catching these kinds of issues when it parses the DWARF and bringing the issues to our attention. The functionality behind the "llvm-dwarfdump --verify" is available in:

bool DWARFContext::verify(raw_ostream &OS, DIDumpOptions DumpOpts);

So it is easy for tools that are based on llvm/clang sources to integrate this into debug builds of the drivers for the compiler and linker. We will probably need to make some modifications to the verification code to ensure it handles all files (unlinked .o files, linked executables and dynamic libraries) correctly, but it is a great way to verify the DWARF is correct before we pass it on to other tools.

So sounds like this bug should probably be switched over to the Go compiler to fix this DWARF issue?

@petrhosek
Copy link
Member Author

This was already filed as an issue against Go over a year ago but so far there hasn't been any response: golang/go#33188

@petrhosek
Copy link
Member Author

For reference, this is a workaround implemented by delve: go-delve/delve#1807

@thanm
Copy link
Contributor

thanm commented Aug 14, 2020

A colleague of mine is asking for clarification on whether this particular oddity is definitely illegal according to the DWARF spec. Is there a spot in the specification one could point to that disallows this particular insanity? Thanks.

@dwblaikie
Copy link
Collaborator

A colleague of mine is asking for clarification on whether this particular
oddity is definitely illegal according to the DWARF spec. Is there a spot
in the specification one could point to that disallows this particular
insanity? Thanks.

Don't think so, no. Believe it falls under the whole "DWARF is a permissive standard" and there are certainly some non-trivial cases where it could make sense. A C++ lambda or other local type with member functions defined inline (couldn't be defined anywhere else) could be represented as a truly nested type (so you'd have subprogram { structure_type { subprogram { } }) but it might not make sense to describe the local type's member functions as being within the range outer subprogram - they are distinct functions.

@dwblaikie
Copy link
Collaborator

(that case might not be warned on by llvm-dwarfdump --verify, though... maybe - I mean, I don't think GCC or LLVM produce that sort of DWARF anyway so any implementation there might be accidental)

@llvmbot
Copy link
Collaborator

llvmbot commented Aug 14, 2020

The other thing to think about is what clients that use this DWARF will expect. For example if you want to symbolicate an address to get the function name and scope correctly, clients will typically check the DW_TAG_compile_unit's ranges and see if the address falls within those ranges and if the address does, we search all of the DIEs. Most code I have seen will, when it encounters a DW_TAG_subprogram DIEs, just check that address range of that DIE and skip to the sibling of that DW_TAG_subprogram and continuing to search. Accelerator tables in DWARF also just point you to a compile unit that you must linearly search. LLVM code will fail to do the lookup on this DWARF because the code does exactly this:

DWARFContext::DIEsForAddress DWARFContext::getDIEsForAddress(uint64_t Address) {
  DIEsForAddress Result;

  DWARFCompileUnit *CU = getCompileUnitForAddress(Address);
  if (!CU)
    return Result;

  Result.CompileUnit = CU;
  Result.FunctionDIE = CU->getSubroutineForAddress(Address);

  std::vector<DWARFDie> Worklist;
  Worklist.push_back(Result.FunctionDIE);
  while (!Worklist.empty()) {
    DWARFDie DIE = Worklist.back();
    Worklist.pop_back();

    if (!DIE.isValid())
      continue;

    if (DIE.getTag() == DW_TAG_lexical_block &&
        DIE.addressRangeContainsAddress(Address)) {
      Result.BlockDIE = DIE;
      break;
    }

    for (auto Child : DIE)
      Worklist.push_back(Child);
  }

  return Result;
}

So all lookups will fail if your DWARF is as it currently emitted for any addresses in "runtime.dolockOSThread".

@pogo59
Copy link
Collaborator

pogo59 commented Aug 17, 2020

The other thing to think about is what clients that use this DWARF will
expect. For example if you want to symbolicate an address to get the
function name and scope correctly, clients will typically check the
DW_TAG_compile_unit's ranges and see if the address falls within those
ranges and if the address does, we search all of the DIEs. Most code I have
seen will, when it encounters a DW_TAG_subprogram DIEs, just check that
address range of that DIE and skip to the sibling of that DW_TAG_subprogram
and continuing to search.

Hmmm that's making an assumption about nested subprograms that isn't valid
for all languages. The skip-to-sibling works for C because it doesn't have
nested functions, nor did C++ until relatively recently.
But for something like Pascal nested procedures, I'd expect the DIE tree to
reflect the lexical parentage, while the generated-code range of the inner
procedure would not be contained within the generated-code range of its parent.
The local class and lambda cases in C++ should follow a similar pattern IMO.

The situation is different for inlined subprograms, which obviously should
be contained within the caller's code range. But that's different from the
lexical-parent case.

@dwblaikie
Copy link
Collaborator

The other thing to think about is what clients that use this DWARF will
expect. For example if you want to symbolicate an address to get the
function name and scope correctly, clients will typically check the
DW_TAG_compile_unit's ranges and see if the address falls within those
ranges and if the address does, we search all of the DIEs. Most code I have
seen will, when it encounters a DW_TAG_subprogram DIEs, just check that
address range of that DIE and skip to the sibling of that DW_TAG_subprogram
and continuing to search.

Hmmm that's making an assumption about nested subprograms that isn't valid
for all languages. The skip-to-sibling works for C because it doesn't have
nested functions, nor did C++ until relatively recently.
But for something like Pascal nested procedures, I'd expect the DIE tree to
reflect the lexical parentage, while the generated-code range of the inner
procedure would not be contained within the generated-code range of its
parent.
The local class and lambda cases in C++ should follow a similar pattern IMO.

Yeah, but for now they don't. GCC and LLVM always put their subprogram definitions outside any nesting like that (I think GCC goes further and puts subprogram definitions always as direct children of the compile_unit - won't even define a subprogram in a namespace - instead it'll declare it in the namespace and define it at the root/CU scope), so at least some consumers (including llvm-symbolizer, for instance) rely on this - it's faster, means you can skip whole subtrees when searching for "which DWARF describes this instruction address". It doesn't handle all possible DWARF - but given how permissive DWARF is, I'm not sure that's a practical goal, generally.

(I wonder what GCC's DWARF for the Pascal example you gave is - I'm guessing they do something like they do with C++ Examples and declare the nested function inside the outer function, but define the nested function at the CU/root scope of the DIE tree)

@petrhosek
Copy link
Member Author

I wonder if we could adopt the same solution delve did in go-delve/delve#1807 for llvm-gsymutil to make it more tolerant? While it'd be ideal to address this in Go, it might take a while until that happens, and even longer until all binaries are rebuilt using the new compiler.

@frobtech
Copy link
Contributor

DWARF is explicitly a loose specification and it's always wise for DWARF consumers to be flexible in their expectations of the input. Rather than demanding the input have a certain shape, always do the best you can discern the most useful interpretation of the input you get. When the input has an unexpected shape that makes it hard to be confident what interpretation is really most useful, then warn about the data but rarely should such a case be a hard error as invalid input data.

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla debuginfo
Projects
None yet
Development

No branches or pull requests

6 participants