[llvm-dev] RFC: CodeView debug info emission in Clang/LLVM

Dave Bartolomeo via llvm-dev

unread,

Oct 29, 2015, 1:11:51 PM10/29/15

to llvm...@lists.llvm.org, cfe...@lists.llvm.org

RFC: CodeView debug info emission in Clang/LLVM

Overview

On Windows, the de facto debug information format is CodeView, most commonly encountered in the form of a .pdb file. This is the format emitted by the Visual C++, C#, and VB.NET compilers, consumed by the Visual Studio debugger and the Windows debugger (WinDbg), and exposed for read-only access via the DIA SDK. The CodeView format has never been publically documented, and Microsoft has never provided an API for emitting CodeView info for native code. Therefore, Clang and LLVM have only been able to emit the small subset of CodeView information that the community has been able to reverse engineer.

In order to improve the experience of using Clang and other LLVM-based compilers to target Windows, Microsoft has decided to contribute code to the LLVM project to read and write CodeView debug information, including changes to make Clang and LLVM emit CodeView debug information for C and C++ code. This RFC covers the first phase of this work: Emitting CodeView type information for C and C++. The next phase will be to emit CodeView symbol information for functions and their local variables; I’ll send out a separate RFC for that when I get to that phase.

I’ll start with some background on the CodeView format, and then move on to the proposed design.

Overview of the CodeView Debug Information Format

“CodeView” is the name we use to refer to the debug record format generated by the Visual C++ compiler and consumed by the Visual Studio debugger, the Windows debugger (WinDbg), and the DIA SDK. CodeView records are contained in either a .pdb file or in an object file. The CodeView records that describe the debug information for a PE image (i.e. a .dll or .exe) are always contained in a corresponding PDB file. The CodeView records that describe the debug information for a COFF object file (.obj) are contained within the .obj itself, although some of the debug information will be stored in a .pdb file if the .obj was compiled with the /Zi or /ZI option.

When code is compiled with cl.exe using the /Z7, /Zi, or /ZI option, cl.exe generates two well-known sections in the resulting .obj file: “.debug$T” and “.debug$S”. These are known as the “types” section and the “symbols” section, respectively. The types section contains CodeView records that describe all of the data types referenced by symbols in that .obj. The symbols section contains CodeView records that describe all of the symbols defined within the .obj, including functions, global and static data, and local variables. When link.exe is invoked with the /debug option, all of the debug information from the contributing .obj files is combined into a single .pdb file for the linked image.

The .debug$T Section

The types section of the .obj file contains a short header consisting solely of the version number of the CodeView types format (currently equal to 4), followed by a sequence of CodeView type records. Each type record starts with a 16-bit field holding the length of the record, followed by a 16-bit tag field that identifies the kind of type described by the record. The format of the remainder of the record depends on the tag. Common type record kinds include:

- Pointer

- Array

- Function

- Struct

- Class

- Union

- Enum

Duplicate type records are folded based on a binary comparison of their contents. Thus, there will be only a single instance of the type record for ‘const char*’ in a given types section, regardless of the number of uses of that type.

When one type record needs to refer to another type record (e.g. a Pointer record referring to the record that describes the referent type of the pointer), it uses a 32-bit “type index”, usually abbreviated “TI”. A TI with a value less than 0x1000 refers to a well-known type for which no type record actually exists. Examples include primitive types like ‘int’ or ‘wchar_t’, and simple pointers to these primitive types. A TI with a value of 0x1000 or greater refers to the another type record in the types section, whose zero-based index is determined by subtracting 0x1000 from the value of the TI. It is an invariant of the types section that a given type record may only use a TI to refer to type records defined earlier in the types section. Thus, no cycles are possible. In order to support types with cyclic dependencies, user-defined types (class, struct, union, enum) can have two records for each type: one to describe the forward declaration, and one to describe the definition. Other records refer to the forward declaration of the type, and only the definition record contains the member list of the type. The debugger matches a forward declaration with its definition based on the qualified name of the type.

Type indices are also used within the .debug$S section to refer to types in the .debug$T section.

If a given .obj file was compiled with the /Zi or /ZI option, the type records for that .obj are stored in a separate .pdb file, rather than in the .obj file itself. The records in the PDB have exactly the same format as those in the .obj, so there is essentially no functional difference in the debug info itself.

When the linker generates the .pdb for an image, it creates a single types section in the .pdb consisting of the transitive closure of all of the type records referenced by any symbol in any of the contributing .objs, with any type indices suitably fixed up to refer to the correct record in the merged types section.

The .debug$S Section

The symbols section of the .obj file contains several substreams to describe the symbols defined in that .obj. The most common substreams are:

- Line Numbers: Contains mappings from code address ranges to source file, line, and column.

- Source File Info: Contains the file names and file hashes of source files referenced in the Line Numbers stream.

- Symbols: Contains symbol records that describe functions and variables.

The Symbols substream is a sequence of records that, like the type records, each begin with a 16-bit size and a 16-bit tag. Common symbol record kinds include:

- Global Data

- Function

- Block Scope

- Stack Frame

- Frame Pointer-Relative Variable

- Register-Relative Variable

- Enregistered Variable

Unlike type records, some symbol records can be nested. For example, Function records usually contain a Stack Frame record, local variable records, and Block Scope records. Block Scope records can in turn contain more local variable and Block Scope records.

When a symbol record needs to refer to a data type, it uses a TI that refers to a record in the types section for the .obj.

When the linker generate the .pdb for an image, it creates a separate symbols section in the .pdb for each contributing .obj. The contents of the .obj’s symbols section are copied into the corresponding section in the .pdb, fixing up any TIs to refer to the types section of the .pdb, and fixing up any code or data addresses to refer to the correct location in the final linked image.

Proposed Design

How Debug Info is Generated

The CodeView type records for a compilation unit will be generated by the front-end for the source language (Clang, in the case of C and C++). The front-end has access to the full type system and AST of the language, which is necessary to generate accurate debug type info. The type records will be represented as metadata in the LLVM IR, similar to how DWARF debug info is represented. I’ll cover the actual representation in a bit more detail below.

The LLVM back-end will be responsible for emitting the CodeView type records from the IR into the output .obj file. Since the type records will already be in the correct format, this is essentially just a copy. No inspection of the type records is necessary within LLVM. The back-end will also be responsible for generating CodeView symbol records, line numbers, and source file info for any functions and data defined in the compilation unit. The back-end is the logical place to do this because only the back-end knows the code addresses, data addresses, and stack frame layouts.

Representation of CodeView in LLVM IR

DICompileUnit

+ existing fields

+ CodeViewTypes : DICodeViewTypes

DICodeViewTypes

+ TypeRecords : MDString[]

+ UDTSymbols : DICodeViewUDT[]

DICodeViewUDT

+ Name : MDString

+ TypeIndex : uint32_t

DIVariable

+ existing fields

+ TypeIndex : uint32_t

DISubprogram

+ existing fields

+ TypeIndex : uint32_t

The existing DICompileUnit node will have a new operand named CodeViewTypes, which points to the new DICodeViewTypes node that describes the CodeView type information for the compilation unit.

The DICodeViewTypes node contains two operands:

- TypeRecords, an array of MDStrings containing the actual CodeView type records for the compilation unit, sorted in ascending order of type index.

- UDTSymbols, and array of DICodeViewUDT nodes describing the user-defined types (class/struct/union/enum) for which CodeView symbol records will need to be emitted by the back-end.

The DICodeViewUDT node contains two operands:

- Name, an MDString with the name of the symbol as it should appear in the CodeView symbol record.

- TypeIndex, a uint32_t holding the CodeView type index of the type record for the user-defined type’s definition.

The DICodeViewUDT nodes are necessary because they are generally the only references to the definition of the user-defined type. Other uses of that type refer to the forward declaration record for the type, and without a reference to the definition of the type, the linker will discard the definition record when it merges the type information into the PDB.

To specify the CodeView type for a variable or function, the DIVariable and DISubprogram nodes will have an additional TypeIndex operand containing the type index of the type record for that variable or function’s type. This operand will be set to zero when CodeView debug info is not enabled.

The above representation essentially extends the existing DWARF-focused debug metadata to also include CodeView info. This was the least invasive way I found to add CodeView support, but it may not be the right architectural decision. It would also be possible to have the CodeView metadata entirely separate from the DWARF metadata. This would reduce the size of the IR when only one form of debug information was being emitted, which is presumably the common case. However, I expect it would complicate the scenario where both DWARF and CodeView are being emitted; for example, would having two dbg.declare intrinsics for a single local variable confuse existing consumers of LLVM IR? I’m hoping someone more familiar with the existing debug info architecture can provide some guidance here if there’s a better way of doing this.

New Library - LLVMCodeView

The design introduces a new library in LLVM, “LLVMCodeView”. This library will contain the code to read and write the CodeView debug info format. The library depends only on the LLVMSupport library, enabling non-LLVM clients to use the library without depending on large portions of LLVM. The LLVMCodeView library is not responsible for translating other forms of information (e.g. LLVM IR, Clang ASTs) to the CodeView format; that work happens in other components.

Changes to LLVMCore

The LLVMCore library will be extended with the definitions of the new debug metadata nodes and new fields on existing nodes, as described previously.

Generating CodeView Type Records in Clang

The clangCodeGen library will be extended with a new class, CodeViewTypeTable. This class is the CodeView equivalent of CGDebugInfo for CodeView. It translates Clang types into the appropriate CodeView type record on demand, returning the type index of the new record. This is where most of the interesting work happens. Since all of the type records for a given image are merged together by the linker when creating the final .pdb, having the type records emitting by Clang match those emitted by cl.exe as closely as possible minimizes conflicts when object files built by the two compilers are linked together into the same image.

Daniel Dilts via llvm-dev

unread,

Oct 29, 2015, 3:42:53 PM10/29/15

to Dave Bartolomeo, llvm...@lists.llvm.org, cfe...@lists.llvm.org

I am really excited to see the work for generating CodeView done.

I have two questions:

1. Will the CodeView information be publicly documented?

2. Will LLD and LLDB be updated as necessary to support CodeView?

_______________________________________________
cfe-dev mailing list
cfe...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

Adrian Prantl via llvm-dev

unread,

Oct 29, 2015, 5:09:04 PM10/29/15

to Dave Bartolomeo, llvm...@lists.llvm.org, cfe...@lists.llvm.org

> On Oct 29, 2015, at 10:11 AM, Dave Bartolomeo via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> Proposed Design
> How Debug Info is Generated
> The CodeView type records for a compilation unit will be generated by the front-end for the source language (Clang, in the case of C and C++). The front-end has access to the full type system and AST of the language, which is necessary to generate accurate debug type info. The type records will be represented as metadata in the LLVM IR, similar to how DWARF debug info is represented. I’ll cover the actual representation in a bit more detail below.
> The LLVM back-end will be responsible for emitting the CodeView type records from the IR into the output .obj file. Since the type records will already be in the correct format, this is essentially just a copy. No inspection of the type records is necessary within LLVM. The back-end will also be responsible for generating CodeView symbol records, line numbers, and source file info for any functions and data defined in the compilation unit. The back-end is the logical place to do this because only the back-end knows the code addresses, data addresses, and stack frame layouts.

Thanks for proposing this.

How different are the type records from the type information we currently have in LLVM's DIType hierarchy? Would it be feasible to move the logic for generating type records from LLVM metadata into the backend? This way a frontend could be agnostic about the debug information format.

-- adrian
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Saleem Abdulrasool via llvm-dev

unread,

Oct 30, 2015, 1:02:35 AM10/30/15

to Adrian Prantl, llvm...@lists.llvm.org, cfe...@lists.llvm.org

I think that this really is the path we want to follow. If the current metadata we emit is insufficient, we should augment it with additional information sufficient to generate the necessary data in the backend. The same annotations would then be able able to generate one OR both debug info formats.

-- adrian
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

--

Saleem Abdulrasool
compnerd (at) compnerd (dot) org

罗勇刚(Yonggang Luo)

unread,

Oct 30, 2015, 1:33:08 PM10/30/15

to llvm...@lists.llvm.org, cfe...@lists.llvm.org

That's great to hear the great news. I am looking for it for a long time.

On Fri, Oct 30, 2015 at 1:02 PM, Saleem Abdulrasool via llvm-dev

--
此致
礼
罗勇刚
Yours
sincerely,
Yonggang Luo

Dave Bartolomeo via llvm-dev

unread,

Oct 30, 2015, 4:25:53 PM10/30/15

to Daniel Dilts, llvm...@lists.llvm.org, cfe...@lists.llvm.org

Yes, we will be publically documenting the CodeView format. We’re in the process of making our internal CodeView documentation fit for public consumption.

As far as LLD/LLDB goes, we (Microsoft) don’t have any current plans to implement the CodeView support in those projects ourselves. However, we certainly want to make sure that the code and documentation we release to support CodeView within LLVM is sufficient for any other interested member of the community to implement that support.

-Dave

Dave Bartolomeo via llvm-dev

unread,

Oct 30, 2015, 8:12:57 PM10/30/15

to Saleem Abdulrasool, Adrian Prantl, llvm...@lists.llvm.org, cfe...@lists.llvm.org

From: Saleem Abdulrasool [mailto:comp...@compnerd.org]
Sent: Thursday, October 29, 2015 10:02 PM
To: Adrian Prantl <apr...@apple.com>
Cc: Dave Bartolomeo <Dave.Ba...@microsoft.com>; llvm...@lists.llvm.org; cfe...@lists.llvm.org
Subject: Re: [llvm-dev] RFC: CodeView debug info emission in Clang/LLVM

On Thu, Oct 29, 2015 at 2:08 PM, Adrian Prantl via llvm-dev <llvm...@lists.llvm.org> wrote:

> On Oct 29, 2015, at 10:11 AM, Dave Bartolomeo via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> Proposed Design
> How Debug Info is Generated
> The CodeView type records for a compilation unit will be generated by the front-end for the source language (Clang, in the case of C and C++). The front-end has access to the full type system and AST of the language, which is necessary to generate accurate debug type info. The type records will be represented as metadata in the LLVM IR, similar to how DWARF debug info is represented. I’ll cover the actual representation in a bit more detail below.
> The LLVM back-end will be responsible for emitting the CodeView type records from the IR into the output .obj file. Since the type records will already be in the correct format, this is essentially just a copy. No inspection of the type records is necessary within LLVM. The back-end will also be responsible for generating CodeView symbol records, line numbers, and source file info for any functions and data defined in the compilation unit. The back-end is the logical place to do this because only the back-end knows the code addresses, data addresses, and stack frame layouts.

Thanks for proposing this.

How different are the type records from the type information we currently have in LLVM's DIType hierarchy? Would it be feasible to move the logic for generating type records from LLVM metadata into the backend? This way a frontend could be agnostic about the debug information format.

I think that this really is the path we want to follow. If the current metadata we emit is insufficient, we should augment it with additional information sufficient to generate the necessary data in the backend. The same annotations would then be able able to generate one OR both debug info formats.

[dB] I considered that approach, but I see a few reasons why I don’t think making the debug metadata format agnostic would work out very well. To ensure that the backed can generate both debug formats by itself, we need to make the metadata contain enough information from the original AST for the format-specific code in the backend to generate the debug info. I believe that in practice, we’d wind up having to encode a significant portion of the AST (for decls of types and members, at least) into metadata, because debug type info, at least in CodeView, strives for pretty close fidelity with the declarations and types in the original source language. The CodeView debug type info is used by the VS debugger to parse and evaluate C++ expressions while debugging. We currently have a bunch of limitations in our debugger’s expression evaluation due to information missing from the debug type info, and we’ll probably attempt to preserve even more of that information going forward. There’s not much information from the AST that we can ignore if we want to reach that goal. Of course, we could just accept that we need the majority of the AST for type and function declarations in the debug metadata, and do that work in order to avoid having the frontend know about debug info formats, but that just means that now the backend code that generates the debug info has to know about all of the source language-specific constructs that it’s reading when creating the debug info. I think I’d rather have Clang have to understand the language-specific parts of multiple debug info formats than have LLVM understand language-specific metadata.

As an example, the CodeView definition of a user-defined type requires both the mangled name of the type and the non-mangled “display name” of the type. Both of these require a fair bit of information from the AST to generate. For the mangled name in particular, there’s already code in Clang that generates this. If we want the backend to do this instead, we have to stuff a bunch of AST info into metadata, and then figure out how to share the name mangling code between Clang (where it operates on actual ASTs) and LLVM (where it would operate on metadata). If, instead, we have Clang compute the mangled name and display name and pass those names in the metadata, we’re not being particularly format-agnostic in Clang, and if the current compilation is only generating DWARF, we didn’t really need to compute or store those potentially large strings for every type anyway.

Whether Clang is format-agnostic or not, there will have to be some component that converts from something format-agnostic (either ASTs or metadata) to DWARF, and some component that converts from ASTs or metadata to CodeView. You can put those two components in Clang and accept that Clang won’t be format-agnostic. Or, you can put those two components in LLVM, which leaves Clang as format-agnostic but requires that LLVM be more source language-aware. It also requires a third component to translate ASTs into metadata to pass to LLVM. Letting Clang worry about two different debug type info formats seems preferable to writing additional code and making an LLVM component understand more about the source language.

Is there another approach I haven’t thought of that would let us wind up with a cleaner solution? I’ve only been working with the Clang and LLVM debug info code for a few months, so my knowledge of the existing design is far from complete.

Note that for the rest of debug info (line numbers, source files, stack layouts, etc.), I don’t think the frontend should have to worry about the debug info format, and the current design for those pieces is just fine. It’s only the type info that I think is source language-specific enough to justify computing it in the frontend.

-- adrian
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

David Blaikie via llvm-dev

unread,

Oct 30, 2015, 11:07:07 PM10/30/15

to Dave Bartolomeo, llvm-dev, Clang Dev

Brief answer, but can go into detail later:

If this is the right idea, lets do it for dwarf too & generalize the support to work for both. It's certainly something we've considered, to save all the complexity of representing essentially static data in an intermediate form.

That said, given some of the stuff we have for lto, for example (deallocating/merging types etc) I'm not sure that's obviously the right strategy.

Mangled names for types don't seem like a hugely difficult feature. We already support mangled names for function debug info in dwarf. We already have the mangled name of a type in the metadata, it could be used for codeview emission.

It might be worth talking more & considering what other language features codeview uses that we haven't already plumbed through for dwarf (& dwarf based debuggers use dwarf for expression evaluation too, fwiw)

Robinson, Paul via llvm-dev

unread,

Oct 31, 2015, 3:04:32 PM10/31/15

to cfe...@lists.llvm.org, llvm-dev

The details of the mangling would be ABI dependent not debug-info-format dependent. Metadata already allows conveying a mangled name into LLVM, as David Blaikie mentioned, so that's not really an issue. The frontend knows how to construct the mangled name, the backend knows where the mangled name goes in the final debug info. It's a pretty reasonable separation of concerns.

I didn't see anything in this quickie overview of CodeView that wouldn't be expressible in DWARF, so there's nothing (yet) persuasive to suggest metadata should be format-aware. It would be worthwhile for somebody knowledgeable in one format to take a good detailed look at the other, just to make sure; please provide a link to the detailed CodeView description when it becomes available.

Regarding source-language awareness of the debug-info generator, that's really not a concern (and I say this as someone who once helped add DWARF emission of COBOL-specific entries to a compiler backend that was not entirely clear how to spell COBOL). You need an API that is able to specify the constructs used by the language, and the rest of it is just processing those record types the way they're supposed to be. The backend is not doing any language-semantic analysis of the info, it's just doing what it's told.

Abstractly, the exercise of generalizing LLVM metadata to be able to support more than one debug-info format feels like a good thing. Metadata used to be more closely tied to DWARF (e.g., used DWARF tag codes directly in the metadata nodes to identify things) but it has been evolving away from that to a class hierarchy that is not so explicitly DWARF-ish. Handling CodeView would encourage that direction, rather than being a more fundamental shift.

--paulr

Zachary Turner via llvm-dev

unread,

Oct 31, 2015, 6:07:30 PM10/31/15

to Robinson, Paul, cfe...@lists.llvm.org, llvm-dev

Definitely having someone who knows both formats well would be an advantage. Dave B might be in the best position to do this, so hopefully he can provide a couple more examples of areas where he has trouble expressing CV information entirely in the backend.

Regardless of what everyone ends up deciding on with regards to the front-end / back-discussion, I want to suggest separating the work into separate pieces that can go in independently of each other.

For example, the proposed LLVMCodeView library, which simply reads and writes raw CV records, seems to be orthogonal to this discussion and could be submitted independently.

David Blaikie via llvm-dev

unread,

Oct 31, 2015, 6:20:22 PM10/31/15

to Zachary Turner, llvm-dev, cfe...@lists.llvm.org

On Sat, Oct 31, 2015 at 3:07 PM, Zachary Turner via llvm-dev <llvm...@lists.llvm.org> wrote:

Definitely having someone who knows both formats well would be an advantage. Dave B might be in the best position to do this, so hopefully he can provide a couple more examples of areas where he has trouble expressing CV information entirely in the backend.

Regardless of what everyone ends up deciding on with regards to the front-end / back-discussion, I want to suggest separating the work into separate pieces that can go in independently of each other.

For example, the proposed LLVMCodeView library, which simply reads and writes raw CV records, seems to be orthogonal to this discussion and could be submitted independently.

I haven't looked at the patch in general, but that sounds quite plausible - unit tests or what-have-you that demonstrate the expected behavior regardless of wehre it ultimately ends up being used from (LLVM, Clang, or both)

Aboud, Amjad via llvm-dev

unread,

Nov 1, 2015, 7:10:36 AM11/1/15

to David Blaikie, Zachary Turner, Dave Bartolomeo, llvm-dev, cfe...@lists.llvm.org

I also think that we should keep one representation of debug info in the LLVM IR.

There would be a need to extend some of the debug info entries to support CodeView, but I think that most of the information generated today by Clang for Dwarf can be used for generating CodeView.

I can think about two missing extensions that are needed to CodeView:

1. In Frontend: File Checksum, it is probably a calculation that Clang should do and send to the backend through DIFile.

2. In Backend: Extend “X86 Dwarf<->LLVM register mappings” to support “X86 CodeView<->LLVM register mappings”

I can think about more differences (gaps) between Dwarf and CodeView that need to be closed, however, it is doable with one uniform (generic) debug info metadata in the LLVM IR.

Regards,

Amjad

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Joerg Sonnenberger via llvm-dev

unread,

Nov 1, 2015, 3:21:46 PM11/1/15

to llvm...@lists.llvm.org, cfe...@lists.llvm.org

On Fri, Oct 30, 2015 at 08:25:15PM +0000, Dave Bartolomeo via llvm-dev wrote:
> Yes, we will be publically documenting the CodeView format. We’re in the process of making our internal CodeView documentation fit for public consumption.
>
> As far as LLD/LLDB goes, we (Microsoft) don’t have any current plans to
> implement the CodeView support in those projects ourselves. However,
> we certainly want to make sure that the code and documentation we
> release to support CodeView within LLVM is sufficient for any other
> interested member of the community to implement that support.

It would be useful to have a good set of test cases next to the
documentation, especially if they cover all parts of the specification.
That makes it easier for others (e.g. lldb, lldb) to get basic
interoperability testing in shape.

Joerg

Robinson, Paul via llvm-dev

unread,

Nov 2, 2015, 4:02:00 PM11/2/15

to Aboud, Amjad, David Blaikie, Zachary Turner, Dave Bartolomeo, llvm...@lists.llvm.org, cfe...@lists.llvm.org

| 1. In Frontend: File Checksum, it is probably a calculation that Clang should do and send to the backend through DIFile.

If this is what I think it is, there's a DWARF 5 feature to make use of it as well. Searching on "file" and "checksum" in the microsoft-pdb link doesn't turn up any hits, though.

| 2. In Backend: Extend “X86 Dwarf<->LLVM register mappings” to support “X86 CodeView<->LLVM register mappings”

I suppose this would only be particularly useful for X86, but it would be target-dependent.

--paulr

Saleem Abdulrasool via llvm-dev

unread,

Nov 2, 2015, 10:52:06 PM11/2/15

to Robinson, Paul, llvm...@lists.llvm.org, cfe...@lists.llvm.org

On Mon, Nov 2, 2015 at 1:01 PM, Robinson, Paul via cfe-dev <cfe...@lists.llvm.org> wrote:

| 1. In Frontend: File Checksum, it is probably a calculation that Clang should do and send to the backend through DIFile.

If this is what I think it is, there's a DWARF 5 feature to make use of it as well. Searching on "file" and "checksum" in the microsoft-pdb link doesn't turn up any hits, though.

| 2. In Backend: Extend “X86 Dwarf<->LLVM register mappings” to support “X86 CodeView<->LLVM register mappings”

I suppose this would only be particularly useful for X86, but it would be target-dependent.

The register mapping is architecture dependent in DWARF (e.g 0, x86 = %ax, 0, ARM = r0). Using a more general way to address registers in the MD would allow sinking the mapping into the backend which could then base it on the output format as well. This would actually help simplify understanding the metadata without having the DWARF spec handy. This is just one of the current points in our current implementation where we are overly DWARF-centric (understandably so -- we haven't needed it and it made things easier to implement). As others pointed out, this exercise will point out places where we haven't generalized things and will help clean those up.

Reid Kleckner via llvm-dev

unread,

Nov 3, 2015, 10:51:48 AM11/3/15

to Daniel Dilts, llvm...@lists.llvm.org, cfe...@lists.llvm.org

On Thu, Oct 29, 2015 at 12:42 PM, Daniel Dilts via cfe-dev <cfe...@lists.llvm.org> wrote:

2. Will LLD and LLDB be updated as necessary to support CodeView?

Rui is is looking at making LLD link codeview from object files into PDBs.

Zachary Turner intends to add PDB reading support to LLDB. We already have a PDB implementation of DIContext in lib/DebugInfo that uses PDBs. The only client is currently llvm-symbolizer, but the idea was that LLDB could use it, and eventually we should shift it off DIA and over to something cross-platform.

Reid Kleckner via llvm-dev

unread,

Nov 3, 2015, 12:43:13 PM11/3/15

to David Blaikie, llvm-dev, Duncan P. N. Exon Smith

On Fri, Oct 30, 2015 at 8:06 PM, David Blaikie via cfe-dev <cfe...@lists.llvm.org> wrote:

If this is the right idea, lets do it for dwarf too & generalize the support to work for both. It's certainly something we've considered, to save all the complexity of representing essentially static data in an intermediate form.

That said, given some of the stuff we have for lto, for example (deallocating/merging types etc) I'm not sure that's obviously the right strategy.

Similar but different to LTO, LLD is also going to have to do to deduplicate type records when it reads in .debug$T sections to spit out a PDB. We probably want to make the LLVM CodeView library useful for handling that task.

One question for Dave B is, what algorithms does the VC++ linker use to deduplicate types? The general problem is graph isomorphism, but surely the linker takes some shortcuts to merge most duplicate debug info without spending too much time on it.

I should also state up front that I'm mostly interested in pursuing the /Z7 debug info design in upstream clang, and not /Zi, which incrementally writes PDBs from the compiler. It would be really hard to teach clang how to do IPC with mspdbsrv or coordinate concurrent updates to the PDB on-disk.

-----

Personally, I'm not sure that merging CV type info into our existing DI hierarchy is the right way to go. Looking over the CV types in http://reviews.llvm.org/D14209 suggests that they have a fair amount of extra information that seems irrelevant to DWARF types. If we were to try to use a uniform representation for DWARF and CV types, we might want to follow David's suggestion to make it more opaque, or we'll end up with a bunch of dead fields in the DI hierarchy.

I think Duncan considered and discarded the idea of putting most type information into an MDString earlier this year. The current representation is a lot easier to read and therefore test.

Zachary Turner via llvm-dev

unread,

Nov 3, 2015, 12:58:16 PM11/3/15

to Reid Kleckner, Daniel Dilts, llvm...@lists.llvm.org, cfe...@lists.llvm.org

llvm-pdbdump uses it too. But yea, having a cross-platform one is good for obvious reasons, and having a "reference" implementation (e.g. DIA) in addition to the non-DIA based implementation is desirable because it gives us an easy way to verify correctness -- by comparing the output of the DIA based tool and the non DIA based tool.

Dave Bartolomeo via llvm-dev

unread,

Nov 3, 2015, 7:25:54 PM11/3/15

to David Blaikie, Zachary Turner, llvm-dev

The LLVMCodeView library is definitely independent of the rest of the design questions.

As far as testing goes, what would be the conventional LLVM way of testing a library for file format manipulation? A test tool that converts some simple text form into a .obj containing CodeView sections, and comparing with a baseline .obj? Or would the test convert back from the .obj to some kind of text as well, and compare to a text baseline? Is there some other LLVM component that has similar testing requirements that I can use as an example for how to test LLVMCodeView? Note that I’d be adding a CodeView->text dump tool anyway, since that will be pretty much essential for anyone working with CodeView.

-Dave

Zachary Turner via llvm-dev

unread,

Nov 3, 2015, 7:37:46 PM11/3/15

to Dave Bartolomeo, David Blaikie, llvm-dev, David Majnemer

On Tue, Nov 3, 2015 at 4:25 PM Dave Bartolomeo <Dave.Ba...@microsoft.com> wrote:

The LLVMCodeView library is definitely independent of the rest of the design questions.

As far as testing goes, what would be the conventional LLVM way of testing a library for file format manipulation? A test tool that converts some simple text form into a .obj containing CodeView sections, and comparing with a baseline .obj? Or would the test convert back from the .obj to some kind of text as well, and compare to a text baseline? Is there some other LLVM component that has similar testing requirements that I can use as an example for how to test LLVMCodeView? Note that I’d be adding a CodeView->text dump tool anyway, since that will be pretty much essential for anyone working with CodeView.

-Dave

David Majnemer and David Blaikie (this is seriously like the attack of the Daves) probably have some thoughts, but currently there is code in llvm-readobj.exe to parse certain types of codeview from object files (mostly line table information).

So one idea is to generate some object files that have CV records you know about up front, then pass the output of llvm-readobj (which would need to be updated to use LLVMCodeView instead of hand-rolled parsing) to FileCheck and verify that it matches some pattern.

Maybe the Daves have some other ideas as well.

David Blaikie via llvm-dev

unread,

Nov 3, 2015, 7:46:40 PM11/3/15

to Zachary Turner, llvm-dev, David Majnemer

Yep, this would test the dumping behavior - and how we test llvm-dwarfdump. I assume you have similar checked-in-binary tests for llvm-pdbdump?

How you get output to dump is a bit fuzzy in this case (we don't have much test coverage like this particular situation) - one way is to create another textual format (json, etc), read it, generate CV from it, dump it, FileCheck it, but that's a bit heavyweight.

I'd be inclined to write unit tests if possible - use the CV APIs directly in a unit test, generate in-memory CV output to feed into the dumper in-process, if possible (or, if necessary/substantially more convenient, have the unit test actually write CV output, dump, check)

Hmm, yeah, not perfect, though - how do you check the dumped output? FileCheck is our usual tool for this & there's, again, probably no great in-process story for that...

Open to further ideas...

- Nth Dave

Adrian Prantl via llvm-dev

unread,

Nov 4, 2015, 12:55:07 PM11/4/15

to Zachary Turner, llvm...@lists.llvm.org, cfe...@lists.llvm.org

For reading DWARF we currently have two sort of redundant implementations in the overall LLVM project — one in lib/DebugInfo and another one in LLDB. Do you see an opportunity for sharing the PDB implementation between LLVM and LLDB?

Zachary Turner via llvm-dev

unread,

Nov 4, 2015, 1:04:39 PM11/4/15

to Adrian Prantl, llvm...@lists.llvm.org, cfe...@lists.llvm.org

Yes the PDB implementation will absolutely be shared.

I'm not responsible for the DWARF reading code in LLDB, but my understanding is that it is the way it is because they want to load a lot of debug info lazily and so their reader is optimized for that use case. I don't know, I personally think if one had the will and the knowledge, that they could drive a change to LLDB to use LLVM's DWARF reading code, making changes to LLVM's implementation along the way to make sure that the performance characteristics remain the same. I would love it if someone did that.

Aboud, Amjad via llvm-dev

unread,

Nov 5, 2015, 8:38:33 AM11/5/15

to Robinson, Paul, David Blaikie, Zachary Turner, Dave Bartolomeo, llvm...@lists.llvm.org, cfe...@lists.llvm.org

I do not think that “FileChecksums” mentioned in CodeView patch (http://reviews.llvm.org/D14209) is the same thing as in DWARF 5.

In DWARF 5 the checksum is for the generated debug info sections, like debug_info, .debug_macro, etc. Thus, it make sense to do it in the codegen (debug emitter)

In CodeView, I believe the checksum is for the source file, which make more sense to calculate it in Clang.

Dave, might be able to explain it for us.

Regards,

Amjad

Robinson, Paul via llvm-dev

unread,

Nov 5, 2015, 10:21:06 AM11/5/15

to Aboud, Amjad, David Blaikie, Zachary Turner, Dave Bartolomeo, llvm...@lists.llvm.org, cfe...@lists.llvm.org

For type units (a DWARF 4 feature) there is a "signature" computed with MD5, which is of course computed by LLVM as it creates the units.

In DWARF 5 there is provision in the line table for providing an MD5 checksum of the source file (instead of the file size and modtime characteristics), which is exactly what you've described. Yes this needs to be calculated in Clang and passed down to LLVM through the metadata.

Thanks for verifying!

--paulr

Reid Kleckner via llvm-dev

unread,

Mar 2, 2016, 8:19:30 PM3/2/16

to Dave Bartolomeo, llvm...@lists.llvm.org, cfe...@lists.llvm.org

Circling back around 4 months later...

I now believe that we should just let the frontend generate CV type info. It's really not worth the hassle to try to have a common representation. Enough C++ ABI-specific information leaks into the format that it's really better to avoid trying to create a union of DWARF and CV type info in LLVM DI metadata. We were able to reuse all the other non-type DI metadata, such as location info and scope info, to emit inline line tables and variable locations, so I think we did OK on reusing the existing infrastructure. Compromising at not reusing the type representation seems OK.

I haven't come up with any ideas better than the design that Dave Bartolomeo outlined below, so I think we should go ahead with that. One thing I considered was extending DITypeRef to be a union between MDString*, DIType*, and a type index, but I think that's too invasive. I also don't want to make a whole DIType heap allocation just to wrap a 32-bit type index, so I'm in favor of putting the indices into DISubprogram and DIVariable.

Any thoughts on this plan?

David Blaikie via llvm-dev

unread,

Mar 3, 2016, 1:26:29 PM3/3/16

to Reid Kleckner, llvm...@lists.llvm.org, cfe...@lists.llvm.org

I think it'd be reasonable to at least figure out a good way to do type references consistently across the two schemes, but I'm OK with the idea of having a blob of opaque type information for different debug info formats, created by frontends (& don't mind if the library for building that blob live in LLVM or Clang for now - the DWARF one at least would probably live in LLVM because type info and other DWARF are described by similar/the same constructs (DIEs, abbrevs, etc) - but it seems like that's not the case for PDB, so there might not be any code to share between LLVM's CodeView needs and the type info construction - then it's just a matter of whether pushing that library down into LLVM for other frontends to use would be good, which it probably will be at some point, so if it goes into Clang I'd at least try to keep it pretty well separated)

Potentially that consistency could be created by going the other way - replace DITypeRef with an int, then have the retained types list be the int->type mapping. Skipping the mangled names. (& skip the retained types list for CV/PDB)

- Dave

Aboud, Amjad via llvm-dev

unread,

Mar 8, 2016, 7:39:15 AM3/8/16

to David Blaikie, Reid Kleckner, llvm...@lists.llvm.org, cfe...@lists.llvm.org

Hi,

I said it before and I am saying it again, I do not think that this proposal is needed to support Codeview.

1. Why cannot Codegen make use of current DIType metadata to represent the codeview types?

2. Why cannot “DW_TAG_typedef” be used to generate the “DICodeViewUDT” symbol?

3. Why do we need the TypeIndex?

· DISubprogram and DIVariable simply point to the DIType metadata, instead of having an index into an array where these DIType are stored?!

4. Why the “TypeRecords” are of type MDString? Are they the source name of the type?

I believe that current Debug Info metadata contains all information needed to create the codeview information in codegen.

Thus, I do not see a need to either modify Clang or even modify the LLVM IR.

Please, if you have a concrete case where you think we have lost information needed for codeview between Clang and Codegen, tell us about it and I will be happy to help you figure out how to retrieve this information from current DI metadata.

Thanks,

Amjad

---------------------------------------------------------------------
Intel Israel (74) Limited

Eric Christopher via llvm-dev

unread,

Mar 9, 2016, 12:35:08 AM3/9/16

to Aboud, Amjad, David Blaikie, Reid Kleckner, llvm...@lists.llvm.org, cfe...@lists.llvm.org

In general, I agree here. I'm still unconvinced that this needs to happen this way.

-eric

Reid Kleckner via llvm-dev

unread,

Mar 10, 2016, 12:24:40 PM3/10/16

to David Blaikie, llvm...@lists.llvm.org, cfe...@lists.llvm.org

On Thu, Mar 3, 2016 at 10:26 AM, David Blaikie <dbla...@gmail.com> wrote:

I think it'd be reasonable to at least figure out a good way to do type references consistently across the two schemes, but I'm OK with the idea of having a blob of opaque type information for different debug info formats, created by frontends (& don't mind if the library for building that blob live in LLVM or Clang for now - the DWARF one at least would probably live in LLVM because type info and other DWARF are described by similar/the same constructs (DIEs, abbrevs, etc) - but it seems like that's not the case for PDB, so there might not be any code to share between LLVM's CodeView needs and the type info construction - then it's just a matter of whether pushing that library down into LLVM for other frontends to use would be good, which it probably will be at some point, so if it goes into Clang I'd at least try to keep it pretty well separated)

Potentially that consistency could be created by going the other way - replace DITypeRef with an int, then have the retained types list be the int->type mapping. Skipping the mangled names. (& skip the retained types list for CV/PDB)

DITypeRef wraps a Metadata*, though, not an int. Given that there are zero users of DITypeRef in Transforms/ and Analysis/, I don't see why we should try to forcibly create sharing where there is none. The only consumers of type information are essentially the separate debug info backends.

David Blaikie via llvm-dev

unread,

Mar 10, 2016, 12:49:36 PM3/10/16

to Reid Kleckner, llvm...@lists.llvm.org, cfe...@lists.llvm.org

I haven't looked in detail at the patch - but it sounded like the proposal was to add an int field next to every DITypeRef field? That seems verbose/intrusive to the schema compared to making the type reference machinery able to be one or the other (or is the proposal to have DITypeRef fields be a union of int or DITypeRef (then the DITypeRef itself is a union of metadata reference or string)? If we already have a union of metadata or string, it seems like the better thing to do would be to make it metadata, string, or int rather than having two different layers for referring to types)

Reid Kleckner via llvm-dev

unread,

Mar 10, 2016, 1:51:15 PM3/10/16

to Aboud, Amjad, llvm...@lists.llvm.org, cfe...@lists.llvm.org

It is certainly *possible* to use the existing DIType hierarchy to generate CodeView, but I don't believe it is useful. We would have to make the DI metadata into the union of DWARF and CodeView, and it would be horrible. Here is an incomplete list of things that would be awkward:

- Member pointer inheritance models. Not all pointers to members are the same size.

- Describing locations of virtual bases in vbtables. I'm not sure how to get from DW_TAG_inheritance data to "offset of vbptr from vfptr of complete class".

- Describing 'this' adjustments performed in virtual method prologues.

- New virtuality types to indicate "introducing" virtual methods.

- New flags on everything, see CodeView.h for more info.

If you need more visibility into what's different, consider this C++ source:

struct A {

virtual void f() {}

int a;

};

struct B : virtual A {

virtual void f() {}

virtual void g() {}

int b;

};

struct C : virtual A {

virtual void f() {}

virtual void h() {}

int c;

};

struct D : B, C {

virtual void f() {}

virtual void g() {}

virtual void h() {}

int d;

};

D d;

auto mp = &D::f;

Compare the metadata that clang generates with the dump of the codeview that MSVC generates, and decide for yourself if the representations are a good match:

$ clang -cc1 -std=c++11 -emit-llvm -debug-info-kind=limited t.cpp -o - -triple x86_64-linux -o t.ll

LLVM IR: https://ghostbin.com/paste/dpqo8

$ cl -c t.cpp -Z7 && llvm-readobj -codeview t.obj

Dump of MSVC CodeView: https://ghostbin.com/paste/92ya3

Sure, yes, it is *possible* to write a converter from one to the other, but why is it necessary? What use case does it enable?

You might think it would allow non-Clang frontends to avoid having separate type info emitters, but in practice it won't, because these frontends will need to be augmented to pass down all kinds of CV-specific junk.

Eric Christopher via llvm-dev

unread,

Mar 16, 2016, 5:14:01 PM3/16/16

to Reid Kleckner, Aboud, Amjad, llvm...@lists.llvm.org, cfe...@lists.llvm.org

Hi All,

Reid, Dave and I have chatted about this quite a bit and I think we have a way forward that gets us in a direction we'd like to go, offers some potential performance benefits for existing dwarf users, and maintains some compatibility while transitions are happening. We're currently writing up a proposal and will send it out for RFC shortly.