[LLVMdev] [Debug Info + LTO] Type Uniquing for C types?

49 views
Skip to first unread message

Manman Ren

unread,
Oct 11, 2013, 2:39:59 PM10/11/13
to llv...@cs.uiuc.edu, Eric Christopher, David Blaikie

Hi all,

With C++'s ODR, we are able to unique C++ types by using type identifiers to refer to types.
Type identifiers are generated by C++ mangler. What about languages without ODR? Should we unique C types as well?

One solution for C types is to generate a cross-CU unique identifier for C types. And before linking, we update all type identifiers in a source module with the corresponding hash of the C types, then linking can continue as usual.

This requires clang to generate a cross-CU unique identifier for C types (one simple scheme is using a identifier that is unique within the CU and concatenating the CU's file name). And it also requires hashing of C types at DebugInfo IR level. We can add an API such as updateTypeIdentifiers(Module *), linker can call it right before linking in a source module.

This is a preliminary design to start discussion.

Comments and feedback are welcome.

Thanks,
Manman


Eric Christopher

unread,
Oct 11, 2013, 2:48:18 PM10/11/13
to Manman Ren, llv...@cs.uiuc.edu
> With C++'s ODR, we are able to unique C++ types by using type identifiers to
> refer to types.
> Type identifiers are generated by C++ mangler. What about languages without
> ODR? Should we unique C types as well?
>

We can, but the identifier will need to be constructed on, likely, a
language dependent basis to ensure uniqueness.

> One solution for C types is to generate a cross-CU unique identifier for C
> types. And before linking, we update all type identifiers in a source module
> with the corresponding hash of the C types, then linking can continue as
> usual.
>

Yes.

> This requires clang to generate a cross-CU unique identifier for C types
> (one simple scheme is using a identifier that is unique within the CU and
> concatenating the CU's file name). And it also requires hashing of C types
> at DebugInfo IR level. We can add an API such as
> updateTypeIdentifiers(Module *), linker can call it right before linking in
> a source module.
>

I think the easiest design you'll get for uniquing C types that are
named the same thing (i.e. type defined in a .h file) is to use the
name of the struct combined with the file (and possibly line/column)
as an identifier. If you want to unify by structure then you'll need
to do something the equivalent to the type hashing that we're
implementing in the back end, but that'll be more difficult to
construct via the front end - it may be possible though.

-eric
_______________________________________________
LLVM Developers mailing list
LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Eric Christopher

unread,
Oct 11, 2013, 2:54:27 PM10/11/13
to Manman Ren, Doug Gregor, llv...@cs.uiuc.edu
>
> I think the easiest design you'll get for uniquing C types that are
> named the same thing (i.e. type defined in a .h file) is to use the
> name of the struct combined with the file (and possibly line/column)
> as an identifier. If you want to unify by structure then you'll need
> to do something the equivalent to the type hashing that we're
> implementing in the back end, but that'll be more difficult to
> construct via the front end - it may be possible though.
>

To sum up in a slightly better way I think Doug has posted some rules
on how to merge C types for modules and we could use those to
construct a unique identifier for the type. If we do that I'd request
we prepend the type name in there some how as that'd be convenient. :)

Manman Ren

unread,
Oct 11, 2013, 3:01:30 PM10/11/13
to Eric Christopher, llv...@cs.uiuc.edu
On Fri, Oct 11, 2013 at 11:48 AM, Eric Christopher <echr...@gmail.com> wrote:
> With C++'s ODR, we are able to unique C++ types by using type identifiers to
> refer to types.
> Type identifiers are generated by C++ mangler. What about languages without
> ODR? Should we unique C types as well?
>

We can, but the identifier will need to be constructed on, likely, a
language dependent basis to ensure uniqueness.

> One solution for C types is to generate a cross-CU unique identifier for C
> types. And before linking, we update all type identifiers in a source module
> with the corresponding hash of the C types, then linking can continue as
> usual.
>

Yes.

> This requires clang to generate a cross-CU unique identifier for C types
> (one simple scheme is using a identifier that is unique within the CU and
> concatenating the CU's file name). And it also requires hashing of C types
> at DebugInfo IR level. We can add an API such as
> updateTypeIdentifiers(Module *), linker can call it right before linking in
> a source module.
>

I think the easiest design you'll get for uniquing C types that are
named the same thing (i.e. type defined in a .h file) is to use the
name of the struct combined with the file (and possibly line/column)
as an identifier.

Since we don't have ODR, we may have macros defined differently for a struct in a .h file,
thus having two versions of the struct from two different CU. It seems that we can't assume
structs with the same name and defined in the same file/line/column are the same.
 
If you want to unify by structure then you'll need
to do something the equivalent to the type hashing that we're
implementing in the back end, but that'll be more difficult to
construct via the front end - it may be possible though.

Hashing the types can happen either at the front end or at IR level. That is our first design choice :)

I think we should try not to hash the types for non-LTO builds at the front end or at IR level, since it does not give us
any benefit given that we are hashing them at the back end.

One advantage of hashing it at IR level is that we can just hash the MDNodes that affect the
type MDNode, at front end, the AST contains more information and should be harder to hash.

Thanks,
Manman
 

-eric

Eric Christopher

unread,
Oct 11, 2013, 3:07:03 PM10/11/13
to Manman Ren, Doug Gregor, llv...@cs.uiuc.edu
>
> Since we don't have ODR, we may have macros defined differently for a struct
> in a .h file,
> thus having two versions of the struct from two different CU. It seems that
> we can't assume
> structs with the same name and defined in the same file/line/column are the
> same.
>

Ah right sorry, I remember this. Also, macros are evil, just ask the
modules guys :)

> Hashing the types can happen either at the front end or at IR level. That is
> our first design choice :)
>

Sorta :)

> I think we should try not to hash the types for non-LTO builds at the front
> end or at IR level, since it does not give us
> any benefit given that we are hashing them at the back end.
>
> One advantage of hashing it at IR level is that we can just hash the MDNodes
> that affect the
> type MDNode, at front end, the AST contains more information and should be
> harder to hash.

It depends upon the goals. If the goal is to make debug information
post-link smaller then just using the type hashing machinery for
structs will be sufficient. However, if it's to save space during an
LTO link then we'll want to do it in the front end.

Doug: Have a link for how you do the C type merging for modules?

Manman Ren

unread,
Oct 11, 2013, 3:10:26 PM10/11/13
to Eric Christopher, llv...@cs.uiuc.edu
On Fri, Oct 11, 2013 at 11:54 AM, Eric Christopher <echr...@gmail.com> wrote:
>
> I think the easiest design you'll get for uniquing C types that are
> named the same thing (i.e. type defined in a .h file) is to use the
> name of the struct combined with the file (and possibly line/column)
> as an identifier. If you want to unify by structure then you'll need
> to do something the equivalent to the type hashing that we're
> implementing in the back end, but that'll be more difficult to
> construct via the front end - it may be possible though.
>

To sum up in a slightly better way I think Doug has posted some rules
on how to merge C types for modules and we could use those to
construct a unique identifier for the type. If we do that I'd request
we prepend the type name in there some how as that'd be convenient. :)

If we can get a unique identifier without the type hashing, that will be great. Where can I find the rules?

Manman
 

-eric

Manman Ren

unread,
Oct 11, 2013, 3:19:39 PM10/11/13
to Eric Christopher, llv...@cs.uiuc.edu
On Fri, Oct 11, 2013 at 12:07 PM, Eric Christopher <echr...@gmail.com> wrote:
>
> Since we don't have ODR, we may have macros defined differently for a struct
> in a .h file,
> thus having two versions of the struct from two different CU. It seems that
> we can't assume
> structs with the same name and defined in the same file/line/column are the
> same.
>

Ah right sorry, I remember this. Also, macros are evil, just ask the
modules guys :)

> Hashing the types can happen either at the front end or at IR level. That is
> our first design choice :)
>

Sorta :)

> I think we should try not to hash the types for non-LTO builds at the front
> end or at IR level, since it does not give us
> any benefit given that we are hashing them at the back end.
>
> One advantage of hashing it at IR level is that we can just hash the MDNodes
> that affect the
> type MDNode, at front end, the AST contains more information and should be
> harder to hash.

It depends upon the goals. If the goal is to make debug information
post-link smaller then just using the type hashing machinery for
structs will be sufficient.

By "the type hashing machinery for structs", are you referring to the type hashing at the back end?
 
However, if it's to save space during an
LTO link then we'll want to do it in the front end.

Yes, my purpose here is to save memory space in number of MDNodes (also # of DIEs) generated in a LTO build.
Type hashing at the DIE level can reduce the dwarf size.

Manman

Eric Christopher

unread,
Oct 11, 2013, 3:40:03 PM10/11/13
to Manman Ren, llv...@cs.uiuc.edu
>> It depends upon the goals. If the goal is to make debug information
>> post-link smaller then just using the type hashing machinery for
>> structs will be sufficient.
>
>
> By "the type hashing machinery for structs", are you referring to the type
> hashing at the back end?
>

I am, yes, since there's no other place we do currently.

>>
>> However, if it's to save space during an
>> LTO link then we'll want to do it in the front end.
>
>
> Yes, my purpose here is to save memory space in number of MDNodes (also # of
> DIEs) generated in a LTO build.
> Type hashing at the DIE level can reduce the dwarf size.
>

I agree with both of these statements.

I also agree with the desire to help LTO memory consumption so we'll
need something from the front end for this since we'd like to continue
to use the folding set to do the uniquing.

Manman Ren

unread,
Oct 14, 2013, 4:08:56 PM10/14/13
to Eric Christopher, llv...@cs.uiuc.edu
On Fri, Oct 11, 2013 at 12:40 PM, Eric Christopher <echr...@gmail.com> wrote:
>> It depends upon the goals. If the goal is to make debug information
>> post-link smaller then just using the type hashing machinery for
>> structs will be sufficient.
>
>
> By "the type hashing machinery for structs", are you referring to the type
> hashing at the back end?
>

I am, yes, since there's no other place we do currently.

>>
>> However, if it's to save space during an
>> LTO link then we'll want to do it in the front end.
>
>
> Yes, my purpose here is to save memory space in number of MDNodes (also # of
> DIEs) generated in a LTO build.
> Type hashing at the DIE level can reduce the dwarf size.
>

I agree with both of these statements.

I also agree with the desire to help LTO memory consumption so we'll
need something from the front end for this since we'd like to continue
to use the folding set to do the uniquing.

Hi Eric,

Assume that we need to do type hashing (i.e. assume Doug's rules for merging C types do not apply), would you
like it to be done on AST types or IR debug info metadata types?

One possibility is to hash the IR debug info types in DIBuilder::finalize.
When we are creating the MD nodes, we can provide a simple identifier that is unique within the DIBuider. In finailize(),
we update the identifier to include a hash value (prepend the type name), so types constructed by one DIBuilder will be unique
from types constructed by another DIBUilder unless they are structurally equivalent by having the same hash.

We can then use the folding set to do the uniquing across CUs.

Thanks,
Manman 
 

-eric

Douglas Gregor

unread,
Oct 14, 2013, 6:49:31 PM10/14/13
to Eric Christopher, llv...@cs.uiuc.edu

On Oct 11, 2013, at 12:07 PM, Eric Christopher <echr...@gmail.com> wrote:


Since we don't have ODR, we may have macros defined differently for a struct
in a .h file,
thus having two versions of the struct from two different CU. It seems that
we can't assume
structs with the same name and defined in the same file/line/column are the
same.


Ah right sorry, I remember this. Also, macros are evil, just ask the
modules guys :)

Hashing the types can happen either at the front end or at IR level. That is
our first design choice :)


Sorta :)

I think we should try not to hash the types for non-LTO builds at the front
end or at IR level, since it does not give us
any benefit given that we are hashing them at the back end.

One advantage of hashing it at IR level is that we can just hash the MDNodes
that affect the
type MDNode, at front end, the AST contains more information and should be
harder to hash.

It depends upon the goals. If the goal is to make debug information
post-link smaller then just using the type hashing machinery for
structs will be sufficient. However, if it's to save space during an
LTO link then we'll want to do it in the front end.

Doug: Have a link for how you do the C type merging for modules?

Modules foists the C++ one definition rule on C/Objective-C so that it can avoid performing type merging, so we can’t look there.

C doesn’t have a one definition rule per se. The cross-translation-unit compatibility rules are in 6.2.7 of the C standard, which boils down to structural equality:

  1. Moreover, two structure, union, or enumerated types declared in separate translation units are compatible if their tags and members satisfy the following requirements: If one is declared with a tag, the other shall be declared with the same tag. If both are complete types, then the following additional requirements apply: there shall be a one-to-one correspondence between their members such that each pair of corresponding members are declared with compatible types, and such that if one member of a corresponding pair is declared with a name, the other member is declared with the same name. For two structures, corresponding members shall be declared in the same order. For two structures or unions, corresponding bit-fields shall have the same widths. For two enumerations, corresponding members shall have the same values. 


- Doug

Eric Christopher

unread,
Oct 14, 2013, 6:52:34 PM10/14/13
to Douglas Gregor, llv...@cs.uiuc.edu
> Modules foists the C++ one definition rule on C/Objective-C so that it can
> avoid performing type merging, so we can’t look there.
>
> C doesn’t have a one definition rule per se. The cross-translation-unit
> compatibility rules are in 6.2.7 of the C standard, which boils down to
> structural equality:
>
> Moreover, two structure, union, or enumerated types declared in separate
> translation units are compatible if their tags and members satisfy the
> following requirements: If one is declared with a tag, the other shall be
> declared with the same tag. If both are complete types, then the following
> additional requirements apply: there shall be a one-to-one correspondence
> between their members such that each pair of corresponding members are
> declared with compatible types, and such that if one member of a
> corresponding pair is declared with a name, the other member is declared
> with the same name. For two structures, corresponding members shall be
> declared in the same order. For two structures or unions, corresponding
> bit-fields shall have the same widths. For two enumerations, corresponding
> members shall have the same values.
>

Bummer, I was hoping you had something clever. :)

Manman Ren

unread,
Oct 14, 2013, 7:14:40 PM10/14/13
to Eric Christopher, llv...@cs.uiuc.edu
On Mon, Oct 14, 2013 at 1:08 PM, Manman Ren <manma...@gmail.com> wrote:



On Fri, Oct 11, 2013 at 12:40 PM, Eric Christopher <echr...@gmail.com> wrote:
>> It depends upon the goals. If the goal is to make debug information
>> post-link smaller then just using the type hashing machinery for
>> structs will be sufficient.
>
>
> By "the type hashing machinery for structs", are you referring to the type
> hashing at the back end?
>

I am, yes, since there's no other place we do currently.

>>
>> However, if it's to save space during an
>> LTO link then we'll want to do it in the front end.
>
>
> Yes, my purpose here is to save memory space in number of MDNodes (also # of
> DIEs) generated in a LTO build.
> Type hashing at the DIE level can reduce the dwarf size.
>

I agree with both of these statements.

I also agree with the desire to help LTO memory consumption so we'll
need something from the front end for this since we'd like to continue
to use the folding set to do the uniquing.

Hi Eric,

Assume that we need to do type hashing (i.e. assume Doug's rules for merging C types do not apply),

Now the assumption is true, any opinion on where to do the hashing?

Thanks,
Manman

Eric Christopher

unread,
Oct 21, 2013, 5:27:14 PM10/21/13
to Manman Ren, llv...@cs.uiuc.edu
We should still do it in the front end for the types with a language
specific way. Nothing has greatly changed versus, say, C++ here - it's
just easier in C++ because of the language.

Manman Ren

unread,
Oct 21, 2013, 9:29:00 PM10/21/13
to Eric Christopher, llv...@cs.uiuc.edu
Hi Eric,

We don't have any hashing implementation in the front end, so I don't quite get what you mean
by "still do it in the front end" :)

For C++, we don't need to hash the types because of ODR. For other languages, is it better
to hash the MDNodes instead of the AST nodes because of the following?
1: we can handle all languages without ODR
2: we don't need to update each front-end that tries to take advantage of type uniquing
3: the AST contains more information and MDNodes contain all the necessary information for Dwarf

I would like to propose the following:
Step 1: When we are creating the MD nodes, we can provide a simple identifier that is unique within the DIBuider.
             In DIBuilder, implement generateTypeIdentifier for types that are globally visible (one possibility is the type name appended with a unique ID)
Step 2: In DIBuilder::finalize(), we call hashing algorithm to update the type identifiers generated in step 1 to be the type name appended with its hash
Step 1 is necessary because we want to make sure we are using the type identifiers when referring to the types.
Without step 1, the type reference will be via MDNode, and a MDNode field can't be updated to a type identifier later on.
Step 3: We can then use the folding set to do the uniquing across CUs, during linking.

Let me know your thoughts,

Thanks,
Manman


-eric

Reply all
Reply to author
Forward
0 new messages