Folks, we work on optimization of binary size and improvement of debug info quality.
To reduce the size of the binary we use -ffunction-sections so that unused code would be garbage collected.
When the linker does garbage collection, a lot of abandoned debug info is left behind.
Besides inflated debug info size, we ended up with overlapping address ranges and no way to say valid vs garbage ranges(D59553).
To resolve these two problems, we use implementation extracted from dsymutil https://reviews.llvm.org/D74169.
It adds --gc-debuginfo command line option to the linker to remove obsolete debug info.
Currently, it has the following limitations: does not support DWARF5, modules, -fdebug-types-section, type units, .debug_types, multiple .debug_info sections, split DWARF, thin lto.
Following are size/performance results for the D74169:
A: --function-sections --gc-sections
B: --function-sections --gc-sections --gc-debuginfo
C: --function-sections --gc-sections --fdebug-types-section
D: --function-sections --gc-sections --gsplit-dwarf
E: --function-sections --gc-sections --gc-debuginfo --compress-debug-sections=zlib
LLVM code base:
--------------------------------------------------------------
| Options | build time | bin size | lib size |
--------------------------------------------------------------
| A | 54min(100%) | 19.0G(100%) | 15.0G(100.0%) |
--------------------------------------------------------------
| B | 65min(120%) | 9.7G( 51%) | 12.0G( 80.0%) |
--------------------------------------------------------------
| C | 53min( 98%) | 12.0G( 63%) | 15.0G(100.0%) |
--------------------------------------------------------------
| D | 52min( 96%) | 12.0G( 63%) | 8.2G( 55.0%) |
--------------------------------------------------------------
| E | 64min(118%) | 5.3G( 28%) | 12.0G( 80.0%) |
--------------------------------------------------------------
Clang binary:
-------------------------------------------------------------
| Options | size | link time | used memory |
-------------------------------------------------------------
| A | 1.50G(100%) | 9sec(100%) | 9307MB(100%) |
-------------------------------------------------------------
| B | 0.76G( 50%) | 68sec(755%) | 15055MB(161%) |
-------------------------------------------------------------
| C | 0.82G( 54%) | 8sec( 89%) | 8402MB( 90%) |
-------------------------------------------------------------
| D | 0.96G( 64%) | 6sec( 67%) | 4273MB( 46%) |
-------------------------------------------------------------
| E | 0.43G( 29%) | 77sec(855%) | 15000MB(161%) |
-------------------------------------------------------------
lldb loading time:
--------------------------------------------
| Options | time | used memory |
--------------------------------------------
| A | 6.4sec(100%) | 1495MB(100%) |
--------------------------------------------
| B | 4.0sec( 63%) | 826MB( 55%) |
--------------------------------------------
| C | 3.7sec( 58%) | 877MB( 59%) |
--------------------------------------------
| D | 4.3sec( 67%) | 1023MB( 69%) |
--------------------------------------------
| E | 2.1sec( 33%) | 478MB( 32%) |
--------------------------------------------
I want to discuss the results and to decide whether it is worth to integrate of D74169:
improvements:
1. Reduces the size of debug info(50%).
2. Resolves overlapping of address ranges(D59553).
3. Reduced size of debug info allows tools to work faster and to require less memory.
drawbacks and not implemented features:
1. linking time is increased(755%).
The --gc-debuginfo option is off by default. So it would affect only those who need it and explicitly specified it.
I think the current DWARFLinker code could be optimized more to improve performance results.
2. Support of type units.
That could be implemented further.
3. DWARF5.
Current DWARFEmitter/DWARFStreamer has an implementation for DWARF generation, which does not support
DWARF5(only debug_names table). At the same time, there already exists code in CodeGen/AsmPrinter/DwarfDebug.h,
which implements most of DWARF5. It seems that DWARFEmitter/DWARFStreamer should be rewritten using
DwarfDebug/DwarfFile. Though I am not sure whether it would be easy to re-use DwarfDebug/DwarfFile.
It would probably be necessary to separate some intermediate level of DwarfDebug/DwarfFile.
4. split DWARF support.
This solution does not work with split DWARF currently. But it could be useful for the split dwarf in two ways:
a) The generation of skeleton file could be changed in such a way that address ranges pointing to garbage
collected code would be replaced with lowpc=0, highpc=0. That would solve the problem of overlapping address
ranges(D59553).
b) The approach similar to dsymutil implementation could be used to generate monolithic debuginfo created
from .dwo files. That suggestion is from - https://reviews.llvm.org/D74169#1888386.
i.e., DWARFLinker could be taught to generate the same output as D74169 but for split DWARF as the source.
5. -fmodules-debuginfo
That problem was described in this review - https://reviews.llvm.org/D54747#1505462 . Currently, DWARFLinker/dsymutil has the same problem. It could be solved using the fact that DWARFLinker analyzes debuginfo. It could recognize debug info generated for
the module and keep it(compile units containing debug info for modules do not have low_pc, high_pc).
6. -flto=thin
That problem was described in this review https://reviews.llvm.org/D54747#1503720. It also exists in current DWARFLinker/dsymutil implementation. I think that problem should be discussed more: it could probably be fixed by avoiding generation of such incomplete
declaration during thinlto, or, alternatively, DWARFLinker could recognize such situation and copy missed type declaration.
=======================================================================================
Debuginfo, Linker folks, What do you think about current results and future directions?
It introduces quite a significant linking time increase(6x-8x). But it would affect only those who use that feature.
Thus the users will be able to decide whether that linking time increase is acceptable or not.
Resolving all 1-6 points is quite a significant work. But, in the result, debug info is more correct and compact.
Do you think that it would be good to integrate it and to start to work on improving?
Thank you, Alexey.
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Folks, we work on optimization of binary size and improvement of debug info quality.
To reduce the size of the binary we use -ffunction-sections so that unused code would be garbage collected.
When the linker does garbage collection, a lot of abandoned debug info is left behind.
Besides inflated debug info size, we ended up with overlapping address ranges and no way to say valid vs garbage ranges(D59553).
To resolve these two problems, we use implementation extracted from dsymutil https://reviews.llvm.org/D74169.
It adds --gc-debuginfo command line option to the linker to remove obsolete debug info.
Currently, it has the following limitations: does not support DWARF5, modules, -fdebug-types-section, type units, .debug_types,
multiple .debug_info sections, split DWARF, thin lto.
Following are size/performance results for the D74169:
A: --function-sections --gc-sections
B: --function-sections --gc-sections --gc-debuginfo
C: --function-sections --gc-sections --fdebug-types-section
3. DWARF5.
Current DWARFEmitter/DWARFStreamer has an implementation for DWARF generation, which does not support
DWARF5(only debug_names table). At the same time, there already exists code in CodeGen/AsmPrinter/DwarfDebug.h,
which implements most of DWARF5. It seems that DWARFEmitter/DWARFStreamer should be rewritten using
DwarfDebug/DwarfFile. Though I am not sure whether it would be easy to re-use DwarfDebug/DwarfFile.
It would probably be necessary to separate some intermediate level of DwarfDebug/DwarfFile.
4. split DWARF support.
This solution does not work with split DWARF currently. But it could be useful for the split dwarf in two ways:
a) The generation of skeleton file could be changed in such a way that address ranges pointing to garbage
collected code would be replaced with lowpc=0, highpc=0. That would solve the problem of overlapping address
ranges(D59553).
b) The approach similar to dsymutil implementation could be used to generate monolithic debuginfo created
from .dwo files. That suggestion is from - https://reviews.llvm.org/D74169#1888386.
i.e., DWARFLinker could be taught to generate the same output as D74169 but for split DWARF as the source.
5. -fmodules-debuginfo
That problem was described in this review - https://reviews.llvm.org/D54747#1505462 . Currently, DWARFLinker/dsymutil has the same problem. It could be solved using the fact that DWARFLinker analyzes debuginfo. It could recognize debug info generated for the module and keep it(compile units containing debug info for modules do not have low_pc, high_pc).
6. -flto=thin
That problem was described in this review https://reviews.llvm.org/D54747#1503720. It also exists in current DWARFLinker/dsymutil implementation. I think that problem should be discussed more: it could probably be fixed by avoiding generation of such incomplete declaration during thinlto,
or, alternatively, DWARFLinker could recognize such situation and copy missed type declaration.
=======================================================================================
Debuginfo, Linker folks, What do you think about current results and future directions?
It introduces quite a significant linking time increase(6x-8x). But it would affect only those who use that feature.Thus the users will be able to decide whether that linking time increase is acceptable or not.
Resolving all 1-6 points is quite a significant work. But, in the result, debug info is more correct and compact.
Do you think that it would be good to integrate it and to start to work on improving?
Thank you, Alexey.
>Hi Alexey,
Hi David, Excuse me for delayed answer. It took some time to prepare. Please, find the answers bellow...
>Broad question: Do you have any specific motivation/users/etc in implementing this (if you can speak about it)?
> - it might help motivate the work, understand what tradeoffs might be suitable for you/your users, etc.
>>2. Support of type units.
>> That could be implemented further.
>>4. split DWARF support.
>> This solution does not work with split DWARF currently. But it could be useful for the split dwarf in two ways:
>> a) The generation of skeleton file could be changed in such a way that address ranges pointing to garbage
>> collected code would be replaced with lowpc=0, highpc=0. That would solve the problem of overlapping
>> address ranges(D59553).
>> 6. -flto=thin
>> That problem was described in this review
https://reviews.llvm.org/D54747#1503720. It also exists in
>> current DWARFLinker/dsymutil implementation. I think that problem should be discussed more: it could
>> probably be fixed by avoiding generation of such incomplete declaration during thinlto,
Hi David, Excuse me for delayed answer. It took some time to prepare. Please, find the answers bellow...
>Broad question: Do you have any specific motivation/users/etc in implementing this (if you can speak about it)?
> - it might help motivate the work, understand what tradeoffs might be suitable for you/your users, etc.
There are two general requirements:
1) Remove (or clean) invalid debug info.
2) Optimize the DWARF size.
The specifics which our users have:
- embedded platform which uses 0 as start of .text section.
- custom toolset which does not support all features yet(f.e. split dwarf).
- tolerant of the link-time increase.
- need a useful way to share debug builds.
For the first point: we have a problem "Overlapping address ranges starting from 0"(D59553).We use custom solution, but the general solution like D74169 would be better here.
For the second point: split dwarf could be a good alternative to have debug info with minimal size.Still, it has drawbacks (not supported by tools currently, does not solve the "Overlapping address ranges"problem, not very convenient to share(even using .dwp)).
Thus in long terms, the D74169 looks to be a good solution for us: resolves "Overlapping address ranges"problem, binary with minimal size, supported by current tools, easy to share debug build(single binary withminimal size).
> In general, in the current state, I don't have strong feelings either way about this going in as-is with the intent to >improve it to make it more viable - or some of that work being done out-of-tree until it's a more viable >performance tradeoff. Mostly happy to leave that up to folks more involved with lld.
>
>A couple of minor points...
>> C: --function-sections --gc-sections --fdebug-types-section
> ^ not sure of the point of testing/showing comparisons with a situation that's currently unsupported
that situation is currently supported(--gc-debuginfo is not used in this measurement).
"--fdebug-types-section" is supported functionality.
The purpose of these data is to compare results for "--fdebug-types-section" and "--gc-debuginfo".
>>2. Support of type units.
>> That could be implemented further.
>Enabling type units increases object size to make it easier to deduplicate at link time by a DWARF-unaware>linker. With a DWARF aware linker it'd be generally desirable not to have to add that object size overhead to>get the linking improvements.
But, DWARFLinker should adequately work with type units since they are already implemented.
If someone uses --fdebug-types-section, then it should adequately work when used togetherwith --gc-debuginfo(if --gc-debuginfo would be accepted).Right?
Another thing is that the idea behind type units has the potential to help Dwarf-aware linker to work faster.Currently, DWARFLinker analyzes context to understand whether types are the same or not.
Hi David, please find my comments inside:
>>>2. Support of type units.
>>>
>>>> That could be implemented further.
>> >> 6. -flto=thin
>> >> That problem was described in this review https://reviews.llvm.org/D54747#1503720. It also exists in
>> >> current DWARFLinker/dsymutil implementation. I think that problem should be discussed more: it could
>> >> probably be fixed by avoiding generation of such incomplete declaration during thinlto,
A pure rnglist rewriting - I think it'd be OK to have in upstream -
again, cost/benefit/etc would have to be weighed. I'm not sure it
would save enough space to be particularly valuable beyond the
correctness issue - and it doesn't completely solve the correctness
issue for zero-address usage or low-address usage (because you could
still have overlapping subprograms inside a CU - so if you were
symbolizing you could use the correct rnglist to filter, but then go
look inside the CU only to find two subprograms that had that address
& not know which one was the correct one an which one was the
discarded one).
rnglist rewriting might be easy enough to prototype - but depends what
you want to spend your time on, I know this whole issue has been a
huge investment of your time already - but maybe this recent
revitalization of the conversation around having an explicit value in
the linker might be sufficient to address everyone's needs... *fingers
crossed*)
DWARF was designed in an era when COMDAT and ICF were not a thing, or at least not common, certainly not when talking about function code. The overhead of a unit occurred only once per translation unit, so that expense was reasonably amortized.
Splitting functions into their own object-file sections and making them excludable is an evolution of compiler/linker technology that DWARF has not kept up with. The linker-friendly solutions (COMDAT DWARF) would put function-related .debug_* contributions into a section-group along with the function .text itself; this multiplies the total number of sections to deal with, regardless of the tactics used for the content of each per-function DWARF section. The fully DWARF-conformant solution would create one partial_unit per function, with the corresponding overhead of unit headers (especially painful in the .debug_line section). Alternatively we fragment DWARF into sections without headers and rely on the linker to make everything look right in the linked executable; this produces .o files that are not DWARF conformant (unless we can standardize this in DWARF v6) and would be a big hassle for consumers other than the linker.
Or we pay the cost of parsing, trimming, and rewriting all the DWARF in the linker.
--paulr
>DWARF was designed in an era when COMDAT and ICF were not a thing, or at least not common,
>certainly not when talking about function code. The overhead of a unit occurred only once per
>translation unit, so that expense was reasonably amortized.
>Splitting functions into their own object-file sections and making them excludable is an evolution of
>compiler/linker technology that DWARF has not kept up with. The linker-friendly solutions (COMDAT
>DWARF) would put function-related .debug_* contributions into a section-group along with the function
>.text itself; this multiplies the total number of sections to deal with, regardless of the tactics used for the
> content of each per-function DWARF section. The fully DWARF-conformant solution would create one
> partial_unit per function, with the corresponding overhead of unit headers (especially painful in the
> .debug_line section). Alternatively we fragment DWARF into sections without headers and rely on the
> linker to make everything look right in the linked executable; this produces .o files that are not DWARF
>conformant (unless we can standardize this in DWARF v6) and would be a big hassle for consumers
>other than the linker.
>Or we pay the cost of parsing, trimming, and rewriting all the DWARF in the linker.
Probably we could try to make DWARF easy to parsing, trimming, rewriting so that full DWARF
parsing solution would not take too much time?
f.e. -debug-types-section solution uses COMDAT sections to split and deduplicate types.
That solution works quite fast. It has already mentioned drawback with a big size
overhead(because of section headers/type unit headers sizes). But, the fact that type units
could be identified just by hash-id(without parsing type names and types hierarchies)
allows the linker to reject duplications quickly. Another thing is that the linker drops
duplicated COMDAT sections without any additional check. After duplications are deleted,
the debug info is still consistent.
There could be done DWARF aware solution working using the same two principles:
1. compare types by hash-id.
2. drop duplications without analyzing contents.
If all types are put into a separate type table and have hash-id, then it would be much easier to
deduplicate them. The idea demonstrated here - https://reviews.llvm.org/P8164. (It still has a
questions: whether base types should be put into type table, whether references into type table
should be done by DW_AT_signature or just by offset, etc.. ) While handling that separate type table
the DWARF aware linker would check the only hash_id and put only one type description
with the same id in the final type table. It also would allow us to solve that -flto=thin problem -
http://lists.llvm.org/pipermail/llvm-dev/2020-May/141938.html (there is dsymutil example there).
i.e., the case when type definition would be removed will not occur.
Thank you, Alexey.
"object files don't contain DWARF, but they contain stuff that the
linker will turn into DWARF" wouldn't seem like the worst thing to me
- what sort of pre-linking parsing of DWARF use cases do you have in
mind, other than for our own compiler development uses?
(notwithstanding in-object Split DWARF (where the .dwo sections would
have to be remain usable without linking) or the MachO style debug
info distribution model which is similar)
But even then, I'm not sure how viable it would be - as Fangrui
pointed out on another thread about this: ELF section overhead itself
is non-trivial ("sizeof(Elf64_Shdr) = 64.") & it would probably be
rather difficult to reconstruct header-less slice-and-dicable sections
in some cases. For type information (a reduced overhead version of
-fdebug-types-section) I could see it - but for functions, they need
to refer to addresses - preferably in the debug_addr section, and
that's accessed by index, so taking chunks out of it would break other
references to it, etc... adding the header would be expensive, and how
would the CU construct its DW_AT_ranges value if that has to be sliced
and diced? Again, some amount of linker magic might solve some of
these problems - but I think there's still a lot of overhead to making
a solution that's workable with a DWARF-agnostic linker (or even with
a DWARF aware one, but in an efficient amount of time/space where it's
not only usable for small programs, or for linking when you're
shipping a final production binary, etc)
& as always, not sure how any of this would work for Split DWARF -
just a debug_adr section that has some addresses that point to
discardable functions... if we want those addresses themselves to be
discardable (so we don't have to use a tombstone value inserted by the
linker) then they'd need to be in separate debug_addr contributions
with headers, etc - the overhead just seems too high to me in all the
ways I can look at that.
I think there is scope for lower-overhead type deduplication,
especially now with type units being merged into the debug_info
section. Perhaps we could drop dwo_ids and use section references to
refer to types & rely on the linker to keep those referenced sections
alive - though section references are longer than CU-relative
references. (but we need the extra length - because if the linker
deduplicates a type definition - one CU may be referencing a type very
far away, so the shorter reference might be inadequate) I don't think
the indirection through the type hash is /super/ significant to the
cost - I think it's more in the duplication of many DIEs especially
for function definitions (since the type unit sig8 system only
provides a way to reference the type - not its member functions, their
parameters, etc - so all those DIEs get duplicated in any CU that
needs to provide a definition of a member function). We could
prototype cross-unit DIE references to lower the cost of that
duplication, though rumor has it that constructor based type homing
might provide enough value to obviate the need for type units (or at
least make the overhead not worthwhile - so revisiting the overhead to
reduce it might make it worthwhile again... ).
Probably wouldn't be super hard to use LLVM's existing cross-unit DIE
Referencing machinery (implemented for LTO) to refer directly to DIEs
in a type unit without using the signature system... - hmm, that'd
only work if your type unit DIEs were identical? /maybe/ ? Not sure
how that'd work if you wanted to refer into a type unit, but the type
unit got deduplicated. Might be able to rely on the linker to preserve
every unique copy of the type unit that's referenced if we phrase
things carefully - so if your compiler does produce exactly identical
type units they get deduplicated and sec_refs refer to the uniquely
preserved copy - but otherwise it preserves as many distinct copies as
needed. (I don't know enough about how that works to be sure - but I
know that these linkonce/inline function deduplication does seem to
cause the DWARF to refer to the singular function if that function is
identical, and if it isn't, then you get 0 - so there's /something/ in
the linker that can adjust for deduplicating identical duplicates... )
Yep, if they're sub-contribution regions, that wouldn't play well with
Split DWARF. (& full contribution isolation have the DWARF header
overhead, etc)
I'd still be concerned about the ELF header overhead even of this
sub-contribution scheme, but could be interesting to see how it plays
out in practice.
All that said, to avoid burying the lede here, I'll splice something
from the end up here:
> Although the point is not to avoid tombstone values, but to do a more efficient job of editing the final DWARF to omit gc'd functions; it's no problem at all to use a tombstone value in .debug_addr IMO.
But the tombstone values are Alexey's underlying issue (this ongoing
design discussion for over a year now) & /sort/ of mine too recently
(which, unfortunately, is what's reinvigoraetd this discussion -
would've been nice if I/we/someone had identified this sooner &
could've helped Alexey in a more timely manner): Alexey is dealing
with a platform where 0 is a valid address so the lld/gold strategy of
resolving relocations to dead code to "0+addend" creates ambiguous
DWARF. I'm dealing with a case of zero-length functions ("int f1() {
}" or "void f2() { __builtin_unreachable(); }") causing early
termination of DWARFv4 range lists.
The reason for the DWARF-aware linker proposal was because the "let's
choose a better tombstone" discussion didn't go anywhere & people sort
of encouraged in this direction of "what if we didn't need a
tombstone/the linker fixed up the debug info instead". So if the DWARF
redundancy elimination doesn't address the issue of zero as a valid
address, it doesn't address Alexey's needs, unfortunately. :/
That said, I super appreciate the time you've put into writing this up
and it is valuable & I'd love to see some (even hand-crafted assembly)
prototypes, maybe do some back-of-the-envelope numbers to see whether
the ELF header overhead would be worth it, etc.
> > But even then, I'm not sure how viable it would be - as Fangrui
> > pointed out on another thread about this: ELF section overhead itself
> > is non-trivial ("sizeof(Elf64_Shdr) = 64.") & it would probably be
> > rather difficult to reconstruct header-less slice-and-dicable sections
> > in some cases. For type information (a reduced overhead version of
> > -fdebug-types-section) I could see it - but for functions, they need
> > to refer to addresses - preferably in the debug_addr section, and
> > that's accessed by index, so taking chunks out of it would break other
> > references to it, etc... adding the header would be expensive, and how
> > would the CU construct its DW_AT_ranges value if that has to be sliced
> > and diced? Again, some amount of linker magic might solve some of
> > these problems - but I think there's still a lot of overhead to making
> > a solution that's workable with a DWARF-agnostic linker (or even with
> > a DWARF aware one, but in an efficient amount of time/space where it's
> > not only usable for small programs, or for linking when you're
> > shipping a final production binary, etc)
>
> The idea we have blue-skied internally would work something like this
> (initially explicated in terms of the .debug_info section, then seeing
> how that tactic applies to other sections):
>
> There's a top fragment, containing the CU header and the CU DIE itself.
> Linker magic makes this first in the output file.
Quick curiosity: Is there existing linker magic for this? What does it
look like? I'd love to know so I can play around with hand crafted
prototypes/keep it in mind for such things.
(basically the ability for an object file to say "here's the start and
end of my contribution to this section, and some bits that /can/ go in
the middle, but you can drop them if you like")
> Types also go here; certainly base types, and other file-scope types
> can be included here or put into type units. (Type units aren't
> fragmented, they are their own thing same as always.)
Separately, it might be worth considering putting types in such a
thing - but, yes, the "How do you reference them when they might be in
your unit or someone else's unit", etc, would have to be figured out.
I guess using an external symbol might be the solution there - again,
with a better understanding of the ^ mentioned linker magic, I'd
probably play around with hand crafting some examples just to see how
this could work.
> There's a matching bottom fragment, which is just the terminating NULL
> for the CU DIE; linker magic makes this last in the output file.
Last of all the contributions from this object file, not last in the
whole output file, right? (please excuse the pedantry, just double
checking)
> Each function has its own fragment, which is in the same link-group
> (COMDAT or whatever) as the function's .text section; that way, if the
> function is discarded, so is the .debug_info fragment. Offhand I can't
> think of any cases (other than DW_AT_specification, addressed below) of
> references to a subprogram DIE from elsewhere,
The call_site DWARF would want to refer to a subprogram DIE, but that
could be handled by (first pass) having a declaration subprogram in
the initial fragment that the call_site could refer to using the usual
assembler-resolved CU-relative offset. Of course that'd mean a bunch
of (probably the bigger part) of the function's DWARF footprint
wouldn't be deduplicated, but would address this part of the address
tombstone issue (if not using debug_addr) & reduce some of the DWARF -
the addresses are pretty big (if you're not pooling them), etc.
> so it should be fine to
> discard the entire function fragment as needed. Linker magic puts all
> function fragments between the top and bottom fragments, in some
> indeterminate order. Each function fragment is the usual complete
> subtree, rooted in DW_TAG_subprogram.
Rooted at the top level (well, below the DW_TAG_compile_unit) DIE, as
you mention later - namespace, or whatever else.
> References to types are either
> to type units as normal, or to types in the top fragment. Note that
> these references do not require relocations; type units are by signature
> as always, and for types in the top fragment, the offsets into the top
> fragment are known at compile time.
>
> Inlined functions are described as part of the function they have been
> inlined into, being children of the function DIE. DW_AT_specification
> refers to the abstract declaration which is in its own fragment (or the
> top fragment, but that keeps the declaration from being elided if all
> references go away).
Yep, this overlaps with the call_site stuff I mentioned earlier - same
ideas. Either top fragment, or its own fragment. Keeping its own
fragment alive, and figuring out how to reference it (depending on
fragment layout/elision) would require some work, but I think it's
do-able. Might even be do-able so it can be deduplicated across CUs
(use a sec_offset form, use a linker-resolved relocation to it) - this
infrastructure would overlap with type deduplication without type
units too.
Though linker resolved relocations add more bytes...
> If functions are inside namespaces, each function fragment will need
> to have namespace DIEs around the function DIE. This adds overhead
> but it's pretty small.
>
> I hand-wave filling in the CU header's unit length. I'd expect a
> relocation with a reference to the bottom fragment should be able to
> compute the correct value.
*nod*
> That's the story for .debug_info; what about other sections?
>
> Sections referenced by index from .debug_info can't be fragmented;
> this would be: .debug_abbrev, .debug_addr, .debug_str_offsets.
>
> .debug_str doesn't need to be fragmented, linkers DTRT already.
(linkers deduplicate debug_str - but can they be made to remove
unreferenced strings too? in that cas ewe'd have an interesting
tradeoff of maybe using FORM_strp rather than strx - if we wanted the
linker to be able to drop strings from dropped function definitions,
etc)
> .debug_macro contents are not tied to functions and won't be fragmented.
>
> .debug_loclists and .debug_rnglists should be fragmentable the same
> way as .debug_info; they exist only as extensions of .debug_info, and
> the range list for the CU itself is merely a concatenated set of
> contributions from each constituent function, so that should Just Work
> (although it won't be optimal, adjacent ranges won't be coalesced).
At least the way we currently emit loclists and rnglists is by using
an index (the header of loclists and rnglists has an index to offset
mapping) - like strx, this would make it hard/impossible for a
DWARF-agnostic linker to see through to find out which indexes were
actually used. We could potentially not use the loclistx/rnglistx
forms/indexes from fragments - instead using sec_offsets that would
make them relocatable/removable/etc. (so long as all the index-based
referenced lists came in the debug_loclist/debug_rnglist header
fragment)
> I believe the same is true for .debug_loc and .debug_ranges, although
> I haven't checked.
Yep, those ones are easier - there's no contribution header, they can
only be referenced via sec_offset, so slicing and dicing them is
cheap.
But the tombstone problem still exists for the CU's debug_ranges -
though /maybe/ it could be carefully constructed from fragments...
that's going to be a /lot/ of sections in the end though.
> .debug_aranges is functionally equivalent to the CU rangelist.
Yup. (as we've touched on before, we don't use aranges at Google -
instead relying on CU's ranges which are just a little more expensive
to retrieve - but no need to duplicate the data in both places - if
consumers really find the aranges worthwhile to avoid parsing a few
attributes on the CU DIE, perhaps a future spec could let
debug_aranges reference a range list? so that aranges and the CU could
share the same data?)
> .debug_line can work the same way as .debug_info but is worth a word.
> The top fragment has the header, including the directory/file lists
> because those are referenced by index. DW_LNE_define_file can't be
> used. Each function has a fragment containing the sequence for that
> function, starting with set_address and ending with end_sequence.
> The bottom fragment is empty, existing only to allow the length to
> be computed.
Yep - can't remove dead file and directory names, unfortunately - and
the line table's pretty compact, so not sure it'd be a great savings
(especially compared to the ELF section overhead - at the object file
size at least (though probably a small win for linked executable
size)). Chances are those strings (now in debug_line_str) would be
used /somewhere/ in the program, so linker string deduplication would
get most of the wins - just dead offset entries in the line table
header.
> .debug_line_str is a string section and requires nothing special.
>
> .debug_names ... haven't looked at it but I suspect either it doesn't
> survive or it has to be generated post-link (or by the linker).
Generally you're going to want a DWARF-aware linker for debug_names,
same as gdb-index, etc.
> .debug_frame I *think* can be fragmented, but I haven't take the
> time to look at it to make sure.
>
> Those are all the sections I see in DWARF v5 Appendix B.
>
> So that's the blue-sky vision of linker-magic COMDAT DWARF, which
> took me about an hour to write down just now. There is certainly
> a non-trivial overhead in terms of ELF sections; in the general
> case we would have 5 per-function fragments (for .debug_info,
> .debug_line, .debug_rnglists, .debug_loclists, .debug_aranges).
>
> Not small, but then other features in the works are using huge
> quantities of ELF sections too (section-per-basic-block).
That work's being scoped to be fairly selective about which basic
blocks it puts in unique sections - just those that are especially
performance sensitive, so the cost isn't as high as you might
otherwise imagine. Adding 5 new sections per function would be
probably a significantly larger growth than anything else I'm aware
of, but I haven't run the numbers by any means.
Thanks again for the write up!
- Dave
Your proposed option --dead-reloc-addend=.debug_info=0xffffffffffffffff
seems like a good idea. (I'd expect it to support signed -1 and -2 for
convenience & consistency in some other places (we sometimes use addends
as signed values)).
LLD only supports absolute relocation types (plus R_PPC64_DTPREL64 which
can go to .debug_addr, plus R_RISCV_{ADD,SUB}*).
The computed value is S + A.
We still consider the symbolic value S as zero, but override A with the
supplied option --dead-reloc-addend=.debug_info=-1
I particularly like that `addend` is part of the option name.
My mere complaint is that the relocation record is not dead, but rather
its referenced symbol is dead. However, I can't think of a better
name...
Checked with Martin Storsjö, this option may be useful for other binary
formats supporting DWARF. (binutils does not like ELF-specific options
not called -z foobar).
I think it is fine to add this option to LLD if GNU ld is also happy
with the name. I'll check with them.
"There is a danger that one community won't accept an extension that
they haven't been involved in the design process for." :) (Coutesy of Peter)
The built-in rules of the linker are the following:
--dead-reloc-addend=.debug_loc=-2
--dead-reloc-addend=.debug_ranges=-2
--dead-reloc-addend=.debug_*=-1
They can be overridden.
Hey, I have read
https://groups.google.com/forum/#!msg/generic-abi/A-1rbP8hFCA/EDA7Sf3KBwAJ
"monolithic input section handling" from Ben:)
+1 for "-1 except .debug_loc/.debug_ranges use -2"
>>
>> That said, I super appreciate the time you've put into writing this up
>> and it is valuable & I'd love to see some (even hand-crafted assembly)
>> prototypes, maybe do some back-of-the-envelope numbers to see whether
>> the ELF header overhead would be worth it, etc.
>
>It would be nice to verify that the section-fragment idea would produce
>something that looked usable. Hand-written assembly... would require
>research into how to specify the right section attributes, but would
>likely be less effort than trying to make LLVM do something plausible.
>
>I'll see about creating an internal task for this.
According to Peter Smith, Arm Compiler 5 splits up DWARF v3 debugging
information and puts these sections into comdat groups:
"This approach did produce significantly more debug information than gcc
did. For small microcontroller projects this wasn't a problem. For
larger feature phone problems we had to put a lot of work into keeping
the linker's memory usage down as many of our customers at the time were
using 32-bit Windows machines with a default maximum virtual memory of 2Gb."
I'd also love to see some examples (even hand-crafted assembly).
We probably have to reuse the ".debug_info" string (in assembly this requires
unique linkage, which has been implemented in LLVM for a while but relatively
new in binutils (future 2.35)) which is already an entry in .strtab, otherwise
the string itself can cost quite a lot.
(Mostly https://sourceware.org/pipermail/binutils/2020-May/111361.html )
That said, I super appreciate the time you've put into writing this up
and it is valuable & I'd love to see some (even hand-crafted assembly)
prototypes, maybe do some back-of-the-envelope numbers to see whether
the ELF header overhead would be worth it, etc.
I second that: "it's probably best to at least initially frame the
discussion around non-configurable value for the sake of reducing the
scope/possible surface area of the feature/users/etc".
The necessity of using some different concrete value most probably
would arise if there is a tool which uses this another value.
Until there is a known use case, it would be better to use just:
--dead-reloc-addend
Thank you, Alexey.
>>FWIW, I think it's probably best to at least initially frame the
>>discussion around non-configurable value for the sake of reducing the
>>scope/possible surface area of the feature/users/etc. I'd probably
>>only encourage adding the user-configurable flag if/when someone has a
>>use case for it.
>I second that: "it's probably best to at least initially frame the
>discussion around non-configurable value for the sake of reducing the
>scope/possible surface area of the feature/users/etc".
>The necessity of using some different concrete value most probably
>would arise if there is a tool which uses this another value.
>Until there is a known use case, it would be better to use just:
>--dead-reloc-addend
--dead-reloc-addend=<value>
David>I think there is scope for lower-overhead type deduplication,
David>especially now with type units being merged into the debug_info
David>section. Perhaps we could drop dwo_ids and use section references to
David>refer to types & rely on the linker to keep those referenced sections
David>alive - though section references are longer than CU-relative
David>references. (but we need the extra length - because if the linker
David>deduplicates a type definition - one CU may be referencing a type very
David>far away, so the shorter reference might be inadequate) I don't think
David>the indirection through the type hash is /super/ significant to the
David>cost - I think it's more in the duplication of many DIEs especially
David>for function definitions (since the type unit sig8 system only
David>provides a way to reference the type - not its member functions, their
David>parameters, etc - so all those DIEs get duplicated in any CU that
David>needs to provide a definition of a member function). We could
David>prototype cross-unit DIE references to lower the cost of that
David>duplication, though rumor has it that constructor based type homing
David>might provide enough value to obviate the need for type units (or at
David>least make the overhead not worthwhile - so revisiting the overhead to
David>reduce it might make it worthwhile again... ).
David>Probably wouldn't be super hard to use LLVM's existing cross-unit DIE
David>Referencing machinery (implemented for LTO) to refer directly to DIEs
David>in a type unit without using the signature system... - hmm, that'd
David>only work if your type unit DIEs were identical? /maybe/ ? Not sure
David>how that'd work if you wanted to refer into a type unit, but the type
David>unit got deduplicated. Might be able to rely on the linker to preserve
David>every unique copy of the type unit that's referenced if we phrase
David>things carefully - so if your compiler does produce exactly identical
David>type units they get deduplicated and sec_refs refer to the uniquely
David>preserved copy - but otherwise it preserves as many distinct copies as
David>needed. (I don't know enough about how that works to be sure - but I
David>know that these linkonce/inline function deduplication does seem to
David>cause the DWARF to refer to the singular function if that function is
David>identical, and if it isn't, then you get 0 - so there's /something/ in
David>the linker that can adjust for deduplicating identical duplicates... )
Probably I was a bit unclear: the above idea is not for types
(placed in COMDAT sections) deduplicated by the linker.
This idea goes in another direction than fragmenting dwarf
using elf sections&tricks. It seems to me that the cost of fragmenting is too high.
It is not only the sizes of structures describing fragments but also the complexity
of tools that should be taught to work with fragmented DWARF.
(f.e. llvm-dwarfdump applied to object file should be able to read fragmented DWARF,
but applied to linked executable it should work with non-fragmented DWARF).
That idea is for the tool which works the same way as dsymutil ODR.
I will shortly describe the idea of making DWARF be easier processed by dsymutil/DWARFLinker:
The idea is to have only one "type table" per object file(special section .debug_types_table).
This "type table" would contain all types.
There could be a special type of reference - type_offset - that offset points into the type table.
Basic types could always be placed into the start of "type table" thus, offsets to basic types
most often would be 1 byte. There also would be a special kind of reference - reference inside the type.
Type units sig8 system - would not be used to reference types.
Types deduplication is assumed to be done, not by linker mechanism for COMDAT,
but by a tool like dsymutil. This tool would create resulting .debug_types_table by putting there
types from source .debug_types_table-s. Only one copy of the type would be placed into the
resulting table. All references pointing to the deleted copy would be corrected to point
to the single copy inside "type table". (that is how dsymutil works currently)
sig8 hash-id would be used to compare types and to deduplicate them.
It would speed up the current dsymutil context analysis.
Types having the same hash-id could be deduplicated.
This would allow deduplicating a more number of types than current dsymutil.
Incomplete type definitions having a similar set of members are not deduplicated by dsymutil currently.
In this case they would have the same hash-id.
This "type table" would take less space than current "type units" and current ODR solution.
Above is just an idea on how to help DWARF-aware linker(based on idea removing obsolete debug info)
to work faster(if that is interesting).
Alexey.
On 2020-06-04, Robinson, Paul via llvm-dev wrote:
>+ Ben Dunbobbin, whose name I take in vain below.
>He's my local expert on weird ELF features.
Hey, I have read
https://groups.google.com/forum/#!msg/generic-abi/A-1rbP8hFCA/EDA7Sf3KBwAJ
"monolithic input section handling" from Ben:)
Actually the thread Fangrui linked got me thinking & so I poked
around, according to <
https://docs.oracle.com/cd/E19683-01/816-1386/chapter6-94076/index.html>
it actually doesn't require as much magic as I thought it might:
"In the absence of the sh_link ordering information, sections from a
single input file combined within one section of the output file will
be contiguous and have the same relative ordering as they did in the
input file. The contributions from multiple input files will appear in
link-line order."
OK, so at the basic level, we could have a debug_info starting
section, some number of comdat debug_info sections, and the debug_info
tail section (the null terminator).
Keeping types alive/cross-referencing them might be trickier (since
their liveness isn't just tied to an existing real comdat), but
do-able.
Yeah, might try and work that up one of these days...
Turns out I decided to do it now. Some of this relates to
threaded-later-than-this-email replies I've read by the time I'm
writing this email.
Yep - for DIEs that don't need to be referenced (such as subprogram
DIEs - assuming they aren't the target of a call_site - all global
variable DIEs (don't think there's any way to target them in LLVM's
current DWARF emission) using comdats and relying on the linker's
guarantee (which is at least documented for some Unix linkers: "In the
absence of the sh_link ordering information, sections from a single
input file combined within one section of the output file will be
contiguous and have the same relative ordering as they did in the
input file. The contributions from multiple input files will appear in
link-line order." -
https://docs.oracle.com/cd/E19683-01/816-1386/chapter6-94076/index.html
"When not otherwise constrained, sections should be emitted in input
order." - https://web.stanford.edu/~ouster/cgi-bin/cs140-spring18/pintos/specs/sysv-abi-update.html/ch4.sheader.html
)
So that works a treat, exactly as Paul suggested.
Also works to modify the debug_ranges/rnglists section keeping list
fragments in the same comdat group too - though the object size
overhead there is probably higher than worthwhile, given the size of
the contributions versus the size of ELF sections/groups/etc. If
someone really wants to trim their ranges/rnglists down for size - I
think a linker feature would be suitable there, because the section is
small/format is simpler (hmm, except addrx forms - those would require
parsing CU DIEs to get the addr_base to know which addresses were
referenced by the range list, etc... :/).
Doing this with types is a bit more difficult (yes, type units exist,
but I was wondering if we could avoid some of their overhead with
techniques like this) - I'm not sure how to make a hunk of .debug_info
that gets dropped if no symbol in it is referenced. (-gc-sections
didn't seem to activate on the hunk of .debug_info I tried... I guess
if that worked then all the debug_info would be dropped all the time -
so we'd need some way to specifically opt-in ). Attempts at using an
external symbol so type hunks could be deduplicated by comdat, with
cross-CU references resolved by symbol - but that doesn't seem to have
worked out (the final value used for the symbolic reference is not the
desired value... I'm probably holding it wrong).
Yeah, I was just wondering whether it could be useful for that too.
> This idea goes in another direction than fragmenting dwarf
> using elf sections&tricks. It seems to me that the cost of fragmenting is too high.
I tend to agree - but I'm sort of leaning towards trying to use object
features as much as possible, then implementing just enough custom
handling in the linker to recoup overhead, etc. (eg: add some kind of
small header/brief description that makes it easy for the linker to
slice-and-dice - but hopefully a domain-specific such header can be a
bit more compact than the fully general ELF form)
> It is not only the sizes of structures describing fragments but also the complexity
> of tools that should be taught to work with fragmented DWARF.
> (f.e. llvm-dwarfdump applied to object file should be able to read fragmented DWARF,
> but applied to linked executable it should work with non-fragmented DWARF).
> That idea is for the tool which works the same way as dsymutil ODR.
>
> I will shortly describe the idea of making DWARF be easier processed by dsymutil/DWARFLinker:
>
> The idea is to have only one "type table" per object file(special section .debug_types_table).
> This "type table" would contain all types.
> There could be a special type of reference - type_offset - that offset points into the type table.
> Basic types could always be placed into the start of "type table" thus, offsets to basic types
> most often would be 1 byte. There also would be a special kind of reference - reference inside the type.
> Type units sig8 system - would not be used to reference types.
>
> Types deduplication is assumed to be done, not by linker mechanism for COMDAT,
> but by a tool like dsymutil. This tool would create resulting .debug_types_table by putting there
> types from source .debug_types_table-s. Only one copy of the type would be placed into the
> resulting table. All references pointing to the deleted copy would be corrected to point
> to the single copy inside "type table". (that is how dsymutil works currently)
^ that's the step that's probably a bit expensive for a general-use
tool - it implies parsing all the DWARF to find those references and
rewrite them, I think. For a high-performance solution that could be
run by the linker I think it'd be necessary to have a solution that
doesn't involve parsing all the DIEs.
One way to do that would be to have a CU-local type indirection table.
DIEs reference local type numbers (like local address/string numbers -
addrx/strx/rnglistx) and that table contains either sig8 (no linker
fixups required) or the local type offsets you describe - the linker
would then only need to read this type number indirection table and
rewrite them to the final type numbers.
The official ELF specification (acknowledged by multiple parties,
Linux, *BSD, HP-UX, Solaris, haiku, etc) is
http://www.sco.com/developers/gabi/latest/contents.html We need to read
the Solaris Linker and Libraries Guide with a grain of salt.
The special section indexes SHN_BEFORE and SHN_AFTER are currently
Solaris specific and none of GNU ld, gold, LLD recognizes them.
(If we find needs, we can consider them)
The sh_link field of SHF_LINK_ORDER is currently used by !associated.
I need to read more what we can do with the field. https://reviews.llvm.org/D72904
Linkers do retain the input file order. AFAICT this is guaranteed by the
ELF specification. In practice many linkers do this. (LLD has an option
which changes this convention: --shuffle-sections=seed)
In general, --gc-sections retains a non-SHF_ALLOC section
if it is not associated to another SHF_ALLOC section (via SHF_LINK_ORDER,
SHT_RELA, or a section group).
If we want --gc-sections to be effectful on fragmented .debug_*, we'll
need to carefully construct section references (via relocations).
Lookup table style sections (.debug_addr .debug_str_offsets) will
definitely be difficult to merge. SHF_MERGE (constant merging) is an
optional ELF feature which most linkers implement, but it does not allow
section headers/footers (these DWARF v5 sections all have a header) or
varying entry sizes.
SHF_MERGE
The data in the section may be merged to eliminate duplication. Unless
the SHF_STRINGS flag is also set, the data elements in the section are
of a uniform size. The size of each element is specified in the section
header's sh_entsize field. If the SHF_STRINGS flag is also set, the data
elements consist of null-terminated character strings. The size of each
character is specified in the section header's sh_entsize field.
Since SHF_MERGE isn't usable, performing any constant merging requires
some DWARF awareness :/
I think this indeed should be implemented and evaluated.
So that various approaches could be compared.
>> It is not only the sizes of structures describing fragments but also the complexity
>> of tools that should be taught to work with fragmented DWARF.
>> (f.e. llvm-dwarfdump applied to object file should be able to read fragmented DWARF,
>> but applied to linked executable it should work with non-fragmented DWARF).
>> That idea is for the tool which works the same way as dsymutil ODR.
>>
>> I will shortly describe the idea of making DWARF be easier processed by dsymutil/DWARFLinker:
>>
>> The idea is to have only one "type table" per object file(special section .debug_types_table).
>> This "type table" would contain all types.
>> There could be a special type of reference - type_offset - that offset points into the type table.
>> Basic types could always be placed into the start of "type table" thus, offsets to basic types
>> most often would be 1 byte. There also would be a special kind of reference - reference inside the type.
>> Type units sig8 system - would not be used to reference types.
>>
>> Types deduplication is assumed to be done, not by linker mechanism for COMDAT,
>> but by a tool like dsymutil. This tool would create resulting .debug_types_table by putting there
>> types from source .debug_types_table-s. Only one copy of the type would be placed into the
>> resulting table. All references pointing to the deleted copy would be corrected to point
>> to the single copy inside "type table". (that is how dsymutil works currently)
>^ that's the step that's probably a bit expensive for a general-use
>tool - it implies parsing all the DWARF to find those references and
>rewrite them, I think. For a high-performance solution that could be
>run by the linker I think it'd be necessary to have a solution that
>doesn't involve parsing all the DIEs.
According to the current dsymutil processing,
exactly this process is not the most time-consuming.
That could be done relatively fast.
Anyway, I think the dsymutil approach is still valuable, and it
would be useful to optimize it.
Do you think it would be useful to make dsymutil/DWARFLinker truly multi-thread?
(To make dsymutil/DWARFLinker able to process each object file in a separate thread)
>One way to do that would be to have a CU-local type indirection table.
>DIEs reference local type numbers (like local address/string numbers -
>addrx/strx/rnglistx) and that table contains either sig8 (no linker
>fixups required) or the local type offsets you describe - the linker
>would then only need to read this type number indirection table and
>rewrite them to the final type numbers.
Yes, that could be additionally done if this process would be time-consuming.
David, thank you for all your comments and explanations. They are extremely helpful.
Thank you, Alexey.
Linker-based DWARF redundancy/dead-DWARF elimination isn't really a
feature I/Google (in the parts of it I'm involved with) would use. We
use Split DWARF internally & mostly have issues with object size, not
so much with linked executable size - so on multiple fronts, this work
probably wouldn't be deployed in the parts of Google I work with.
& I'm not sure how much Alexey needs this - the original proposal to
remove dead DWARF was as a way to address the 0-as-a-valid-address
issue due to lack of a good tombstone. Now that we're moving forward
on the -2/-1 tombstone thing - I'm not sure if any of us (in this
community/thread) have a deeply pressing need to remove redundant/dead
DWARF any more than we did last week/month/etc.
That said, I do find it a fun/interesting topic & am enjoying playing
around with linker/object features & seeing what can be done here,
what the tradeoffs might be, etc - I just don't want to be misleading
in my level of investment here. Not sure about other folks if anyone
ends up fully prototyping this and the object size/linked executable
size tradeoffs are worthwhile, etc. It'd be interesting to see.
Thanks for the link!
> The special section indexes SHN_BEFORE and SHN_AFTER are currently
> Solaris specific and none of GNU ld, gold, LLD recognizes them.
> (If we find needs, we can consider them)
>
> The sh_link field of SHF_LINK_ORDER is currently used by !associated.
> I need to read more what we can do with the field. https://reviews.llvm.org/D72904
>
> Linkers do retain the input file order. AFAICT this is guaranteed by the
> ELF specification. In practice many linkers do this. (LLD has an option
> which changes this convention: --shuffle-sections=seed)
Great!
Ah, well, that presents a workaround: an empty .text comdat group to
associate with the DWARF type fragment...
> If we want --gc-sections to be effectful on fragmented .debug_*, we'll
> need to carefully construct section references (via relocations).
What sort of relocations did you have in mind? I have managed to use
the above (empty .text comdat to associate with DWARF descriptions of
types - then using a relocation to a symbol in the .debug_info type
fragment from the .debug_info function fragment (using a function's
parameter type as the test case here)) - this does the right thing
(dropping the type if all the function fragments that refer to it are
dropped by -gc-sections, for instance).
Looks something roughly like:
.section .text,"axG",@progbits,_Z2f13foo,comdat,unique,1
...
.section .text,"axG",@progbits,_Z2f23foo,comdat,unique,1
...
.section .text,"axG",@progbits,_Z3foo,comdat,unique,1
# empty .text comdat to make the .debug_info type comdat droppable
.section .debug_info,"",@progbits,unique,1
# DWARF header/CU DIE/etc
.section .debug_info,"G",@progbits,_Z3foo,comdat,unique,1
# label for type DIE - we'd need a more advanced mangling scheme for
# this to ensure it doesn't overlap with C++, etc
_Z3foo:
# type DIEs
.section .debug_info,"G",@progbits,_Z2f13foo,comdat,unique,1
# DWARF fragment for 'f1'
...
.long _Z3foo # DW_AT_type
...
# similar fragment for f2
.section .debug_info,"",@progbits,unique,2
# DWARF footer
.byte 0 # End Of Children Mark
.Ldebug_info_end0:
But I have some trouble making that _Z3foo relocation to do what I
want across multiple files. So if two files have code like the above -
the _Z3foo comdat is picked from one of them (if one of the ".long
_Z3foo" is retained from that file), and the ".long _Z3foo" references
are resolved correctly to refer to the offset in the .debug_info
_Z3foo comdat group. But the ".long _Z3foo" references from the other
input file are resolved to zero. Any chance of making that do what I
want?
(honestly, this is probably all too high object overhead for some
users - I mean, it doesn't apply to my/Google's use case at all, given
Split DWARF - but maybe other users would be happy with a
linker-agnostic small-linked-DWARF result even at the cost of
significant object size)
> Lookup table style sections (.debug_addr .debug_str_offsets) will
> definitely be difficult to merge. SHF_MERGE (constant merging) is an
> optional ELF feature which most linkers implement, but it does not allow
> section headers/footers (these DWARF v5 sections all have a header) or
> varying entry sizes.
Yeah, I think they're more or less a lost cause because of the
indexing in them. I don't have any great ideas for them. Generally you
want a DWARF-aware link of those sections to create a single more
efficient lookup table, so that can take into account dropped code,
etc, potentially.
- Dave
Fair enough - though I'd still imagine any solution that involves
parsing all the DIEs still wouldn't be fast enough (maybe an order of
magnitude faster than the current solution even - but that's stuill,
what, 6 or 7x slower than linking without the feature?) for most users
to consider it a good tradeoff.
> Anyway, I think the dsymutil approach is still valuable, and it
> would be useful to optimize it.
> Do you think it would be useful to make dsymutil/DWARFLinker truly multi-thread?
> (To make dsymutil/DWARFLinker able to process each object file in a separate thread)
Perhaps - that I'd probably leave up to the folks who are more
invested in dsymutil (Adrian Prantl et al). Maybe one day we'll get it
integrated into llvm-dwp and then I'll be interested in getting as
much performance out of it as lld - so multithreading and things would
be on the books.
> >One way to do that would be to have a CU-local type indirection table.
> >DIEs reference local type numbers (like local address/string numbers -
> >addrx/strx/rnglistx) and that table contains either sig8 (no linker
> >fixups required) or the local type offsets you describe - the linker
> >would then only need to read this type number indirection table and
> >rewrite them to the final type numbers.
>
> Yes, that could be additionally done if this process would be time-consuming.
>
> David, thank you for all your comments and explanations. They are extremely helpful.
Sure thing - really appreciate your patience with all this - it's... a
lot of moving parts.
- Dave
>to consider it a good trade-off.
It seems to me that even the current 6x-7x slowdown could be useful.
Users who already use dsymutil or llvm-dwp(assuming DWARFLinker
would be taught to work with a split dwarf) tools spend this time and,
in some scenarios, waste disk space by inter-mediate files.
Thus if they would use this LLD feature in its current state
- they would still receive benefits.
Speaking of performance results - LLD is a multi-thread linker;
it handles sections in parallel. DWARFLinker generates DWARF using
AsmPrinter which is a stream - so it could make resulting DWARF only
continuously. It is not surprising that the parallel solution works faster.
Making DWARFLinker truly multi-threaded would probably allow us
to make slowdown to be at 2x-4x range.
>> Anyway, I think the dsymutil approach is still valuable, and it
>> would be useful to optimize it.
>> Do you think it would be useful to make dsymutil/DWARFLinker truly multi-thread?
>> (To make dsymutil/DWARFLinker able to process each object file in a separate thread)
>Perhaps - that I'd probably leave up to the folks who are more
>invested in dsymutil (Adrian Prantl et al). Maybe one day we'll get it
>integrated into llvm-dwp and then I'll be interested in getting as
>much performance out of it as lld - so multithreading and things would
>be on the books.
I think improving dsymutil is a valuable thing.
Though there are several directions which might be considered
to make it more robust:
1. support of latest DWARF - DWARF5/DWARF64...
2. implement multi-threaded execution.
3. support of split DWARF.
4. implement dsymutil for non-darwin platform.
All of this is a massive piece of work.
Our original investment was to solve two problems:
1. Overlapped address ranges, which is currently close to being solved. Thank you for helping with that!
2. Size of debug info. That still becomes an issue, but we are unsure whether we are ready to
invest in solving all the above 1-4 problems and how much community interested in it.
Thank you, Alexey.
If it is 6x-7x slowdown (which may be optimized to 2x-4x), I wonder
whether it is a good trade-off keeping it as an in-linker pass, or
rather we should just use another utility compressing the output separately.
If the slowdown is such a pain, I might not consider --gc-debuginfo a
readily usable feature like --gdb-index or future --debug-names (DWARF
v5 accelator table - I have a plan to add it but I am always distracted
by other priorities at hand). Considering that this breaks GNU linkers,
I will add the following lines to the build system
if LINKER_IS_LLD && ENABLE_GC_DEBUGINFO
add -Wl,--gc-debuginfo
I don't think this is more complex than:
if ENABLE_GC_DEBUGINFO
set linker to a wrapper which optimizes the output like dwz
We probably should add another -f option for specifying the linker path,
like -fld-path= https://lists.llvm.org/pipermail/cfe-dev/2020-June/065710.html
FWIW, dwp (llvm-dwp hasn't really been optimized compared to binutils
dwp) is designed to be very quick - by not needing to do a lot of
parsing/fixups. Which, yes, means larger output files than would be
possible with more parsing/etc. It also doesn't take any input from
the linker (so it can run in parallel with the linker) - so it can't
remove dead subprograms. Given Google's the major (perhaps only
significant?) user of Split DWARF - I can say that the needs don't
necessarily overlap well with something that would take significantly
longer to run or use significantly more memory. Faster/cheaper/with
somewhat bigger output files is probably the right tradeoff for
Google's use case, at least.
I imagine Apple's use for dsymutil is somewhat similar - it's not used
in the iterative development cycle, only in final releases - well,
maybe their situation is more "neutral" - not a major pain point in
any case I'd guess.
> Thus if they would use this LLD feature in its current state
> - they would still receive benefits.
>
> Speaking of performance results - LLD is a multi-thread linker;
> it handles sections in parallel. DWARFLinker generates DWARF using
> AsmPrinter which is a stream - so it could make resulting DWARF only
> continuously. It is not surprising that the parallel solution works faster.
> Making DWARFLinker truly multi-threaded would probably allow us
> to make slowdown to be at 2x-4x range.
*nod* that's still a really expensive link - but I understand that's a
suitable tradeoff for your users
> >> Anyway, I think the dsymutil approach is still valuable, and it
> >> would be useful to optimize it.
> >> Do you think it would be useful to make dsymutil/DWARFLinker truly multi-thread?
> >> (To make dsymutil/DWARFLinker able to process each object file in a separate thread)
>
> >Perhaps - that I'd probably leave up to the folks who are more
> >invested in dsymutil (Adrian Prantl et al). Maybe one day we'll get it
> >integrated into llvm-dwp and then I'll be interested in getting as
> >much performance out of it as lld - so multithreading and things would
> >be on the books.
>
> I think improving dsymutil is a valuable thing.
> Though there are several directions which might be considered
> to make it more robust:
>
> 1. support of latest DWARF - DWARF5/DWARF64...
I expect/though some of the Apple folks had already worked on DWARF5 support?
DWARF64 - that's been around for a while, and just hasn't been needed
by LLVM users thus far, it seems (until recently - where some
developers have started working on that)
> 2. implement multi-threaded execution.
> 3. support of split DWARF.
Maybe, though I'm still not sure it'd be the right tradeoff -
especially if it involved having to wait to run the .dwo merger (call
it DWARF-aware dwp, or dsymutil with dwp support) until after the
linker ran.
> 4. implement dsymutil for non-darwin platform.
That's probably, essentially (3), more-or-less. Split DWARF is
somewhat of a formalization of Apple's/MachO DWARF distribution model
(leave DWARF it in files that aren't linked/use them from a debugger,
but also be able to merge them into some final file (dsym or dwp) for
archival purposes)
> All of this is a massive piece of work.
> Our original investment was to solve two problems:
>
> 1. Overlapped address ranges, which is currently close to being solved. Thank you for helping with that!
Yeah, again, sorry that's taken quite so long/somewhat circuitous route.
> 2. Size of debug info. That still becomes an issue, but we are unsure whether we are ready to
> invest in solving all the above 1-4 problems and how much community interested in it.
Fair, for sure - I don't think you'd need to sign up to solve all of
them (don't think they necessarily need solving). Potentially moving
the logic out into a separate tool as Fangrui's considering - a
post-link DWARF optimizer, rather than in-linker DWARF optimization.
I really don't want to give you the runaround like this - but multiple
times slower links is something that seems pretty problematic for most
users, to the point of weighing the maintainability of lld against the
convenience of having this functionality in-linker rather than in a
post-link optimizer.
(I know you've spoken a bit before about your users needs - but if
it's possible, could you explain (again :/) why they have such a
strong need for smaller DWARF? While DWARF size is an ongoing concern
for many users (Google certainly - hence the invention of Split DWARF,
use of type units and compressed DWARF, etc) - usually it's in rather
large programs, but it sounds like you're dealing with relatively
small ones (otherwise the increase in link time, I'd imagine, would be
prohibitive for your users?)? You mentioned that the usability cost of
Split DWARF for your users was too high (or high enough to justify
this alternative work of DWARF-aware linking)? That all seems a bit
surprising to me - though I understand the deployment issues of Split
DWARF do present some challenges to users in more heterogenous
environments than Google's... still, I'd have thought there was some
hope there)
gc-debuginfo could be done not from linker but as a standalone
tool(like dsymutil,llvm-dwp), as you said. The reasons why it
was suggested to do from the linker:
1. Linker already has liveness information built and object files loaded.
Thus, it would be the fastest implementation if called from the linker.
Otherwise, there should be created and written address map while
linking, there should be generated inter-mediate debug-info files.
And then, the separate tool would read that map and load object files
or inter-mediate debug-info files again. So processing time
would become even longer.
2. Linker already processes debug info: error reporting, --gdb-index,
upcoming --debug-names. From the design point of view, it would be
good to have a separate module - DWARFLinker - which implements
all that functionality. So that there would not be additional separate
specific linker implementation of them. Instead, already existed
implementation would be called from the linker. i.e. Depending on the
tasks, the linker would call either DWARFLinker.generate-gdb-index(),
DWARFLinker.generate-debug-names(), DWARFLinker.gc-debuginfo().
The idea behind gc-debuginfo was not to slowdown the linking process for everybody.
But to allow generation optimized debug-info for those who need it.
That is the same idea as LTO. LTO slowdowns usual compilation significantly,
but it creates a highly optimized code.
Thank you, Alexey.
I think the suggestion would be to link as normal, then process to
optimize (like dwz - I believe it does something like this too). It
wouldn't need a map from the linker - it could use the existing
tombstone values in the linked DWARF to determine what to drop. Yes,
the processing time would be longer, for sure. (I think Fangrui's
suggestion there is "if you're willing to take a (let's say optimized)
2x or more increase to link time, maybe (let's say doing the
intermediate step adds some fairly significant chunk of overhead) 3x
wouldn't make too much of a difference?)
> 2. Linker already processes debug info: error reporting,
It's important that it's only consulted in the error path - if the
link is failing anyway, taking a little longer to fail isn't likely to
significantly hurt the user experience (compared to the time it takes
a human to read the message, process it/think about what it means and
come up with a theory about how to address it). Compared to getting in
the way of the automated path to a working executable.
> --gdb-index,
> upcoming --debug-names.
Importantly, these can be produced without parsing a lot of DWARF if
the input contains gnu_pubnames, or debug_names (yeah, currently
gdb_index still is quite costly (but I think more like 10s of percent
in link time, not multiple 100s of percent) in memory and link time
even when it's only parsing gnu_pubnames.
> From the design point of view, it would be
> good to have a separate module - DWARFLinker - which implements
> all that functionality. So that there would not be additional separate
> specific linker implementation of them. Instead, already existed
> implementation would be called from the linker. i.e. Depending on the
> tasks, the linker would call either DWARFLinker.generate-gdb-index(),
> DWARFLinker.generate-debug-names(), DWARFLinker.gc-debuginfo().
>
> The idea behind gc-debuginfo was not to slowdown the linking process for everybody.
> But to allow generation optimized debug-info for those who need it.
> That is the same idea as LTO. LTO slowdowns usual compilation significantly,
> but it creates a highly optimized code.
Yep - I think the main difference there is going to be the size of the
user base compared to the complexity (certainly LTO adds complexity to
the linker - though without much alternative (well, alternative would
be build systems being aware of this - sort of like Fangrui's
suggestion, LTO could be a separate tool that merges LLVM bitcode
files, then creates real object files that go to the actual native
linker) to gain the desired performance).
Fangrui: What's your assessment of the complexity of adding this
functionality to lld? Are you concerned it'll be an ongoing
maintenance burden on other work in lld? If not, I'd be inclined
to/lean towards accepting this & having some room for Alexey to
improve the performance for his own users needs (& hopefully that'll
improve DWARFLinker functionality/performance as well), see if anyone
else wants this link time tradeoff.
What sort of changes to the spec are you thinking of here? (the ones I
know of would maybe be something like MachO's subsections via symbols
(to reduce section overhead by just telling the linker it can slice up
sections at public symbol boundaries without the need for section
headers to describe it) & maybe some similar things for DWARF (the
slicable debug_info I was showing/prototyping earlier - would benefit
from some way to communicate the slice boundaries to the linker that
didn't have the overhead of ELF sections) - or maybe just an overhaul
of the section and relocation formats in general to make them more
compact)
I see. FWIW, Comparison splitdwarf+dwp and DWARFLinker from lld:
1. split-dwarf+llvm-dwp = linking time for clang 6 sec,
generating time for .dwp 53 sec, clang=997M clang.dwp=1.1G.
2. DWARFLinker from lld = linking time for clang 72 sec, clang=760M.
>> Thus if they would use this LLD feature in its current state
>> - they would still receive benefits.
>>
>> Speaking of performance results - LLD is a multi-thread linker;
>> it handles sections in parallel. DWARFLinker generates DWARF using
>> AsmPrinter which is a stream - so it could make resulting DWARF only
>> continuously. It is not surprising that the parallel solution works faster.
>> Making DWARFLinker truly multi-threaded would probably allow us
>> to make slowdown to be at 2x-4x range.
>
>*nod* that's still a really expensive link - but I understand that's a
>suitable tradeoff for your users
>
Btw, 2x or 7x is for pure linking time. Overall compilation slowdown
is not so significant. Building LLVM codebase has only 20% slowdown.
>> >> Anyway, I think the dsymutil approach is still valuable, and it
>> >> would be useful to optimize it.
>> >> Do you think it would be useful to make dsymutil/DWARFLinker truly multi-thread?
>> >> (To make dsymutil/DWARFLinker able to process each object file in a separate thread)
>>
>> >Perhaps - that I'd probably leave up to the folks who are more
>> >invested in dsymutil (Adrian Prantl et al). Maybe one day we'll get it
>> >integrated into llvm-dwp and then I'll be interested in getting as
>> >much performance out of it as lld - so multithreading and things would
>> >be on the books.
>>
>> I think improving dsymutil is a valuable thing.
>> Though there are several directions which might be considered
>> to make it more robust:
>>
>> 1. support of latest DWARF - DWARF5/DWARF64...
>
>I expect/though some of the Apple folks had already worked on DWARF5 support?
>DWARF64 - that's been around for a while, and just hasn't been needed
>by LLVM users thus far, it seems (until recently - where some
>developers have started working on that)
There already implemented debug_names table, but debug_rnglists,
debug_loclists, type units - are not implemented yet. The thing which
should probably be changed is that dsymutil should not have its version
of code generating DWARF tables. It should call already existed
DWARF5/DWARF64 implementations. Then dsymutil would always
use last DWARF generators.
We have many large programs and keep Dayly/Nightly debug builds,
which takes a lot of disk space. Compilation time for these programs is big.
The scenario is "compile once".(not compile-debug-compile-debug).
So we think that solution(like dsymutil/DWARFLinker) would not slowdown
the compilation time of overall build significantly(see above numbers for
llvm codebase) and would allow us to reduce disk space required to keep
all of these builds.
>You mentioned that the usability cost of
>Split DWARF for your users was too high (or high enough to justify
>this alternative work of DWARF-aware linking)? That all seems a bit
>surprising to me - though I understand the deployment issues of Split
>DWARF do present some challenges to users in more heterogenous
>environments than Google's... still, I'd have thought there was some
>hope there)
Our tools does not support split dwarf yet. Though we plan to implement it.
When we would have support of split dwarf then it would be
convenient to have easy way to share built debug binaries. llvm-dwp is the
answer to this. DWARFLinker could probably be another answer.
FWIW, llvm-dwp is not very well optimized (which is to say: it is not
optimized), binutils dwp might be a better comparison (& even that
doesn't have the parallelism & some potential further memory savings
that lld has that we could take advantage of in a dwp-like tool)
What build mode was the clang binary built in? Optimized or unoptimized?
> 2. DWARFLinker from lld = linking time for clang 72 sec, clang=760M.
It does seem a tad strange that the clang binary would be smaller
non-split with DWARF linking than it was split. Though I could imagine
this might be possible in an optimized build (wehre debug_ranges
become quite relatively expensive in the .o file contribution with
Split DWARF)
Could you compare the section sizes between these two clang binaries, perhaps?
> >> Thus if they would use this LLD feature in its current state
> >> - they would still receive benefits.
> >>
> >> Speaking of performance results - LLD is a multi-thread linker;
> >> it handles sections in parallel. DWARFLinker generates DWARF using
> >> AsmPrinter which is a stream - so it could make resulting DWARF only
> >> continuously. It is not surprising that the parallel solution works faster.
> >> Making DWARFLinker truly multi-threaded would probably allow us
> >> to make slowdown to be at 2x-4x range.
> >
> >*nod* that's still a really expensive link - but I understand that's a
> >suitable tradeoff for your users
> >
>
> Btw, 2x or 7x is for pure linking time. Overall compilation slowdown
> is not so significant. Building LLVM codebase has only 20% slowdown.
Understood - that's still quite significant to most users, I'd imagine.
> >> >> Anyway, I think the dsymutil approach is still valuable, and it
> >> >> would be useful to optimize it.
> >> >> Do you think it would be useful to make dsymutil/DWARFLinker truly multi-thread?
> >> >> (To make dsymutil/DWARFLinker able to process each object file in a separate thread)
> >>
> >> >Perhaps - that I'd probably leave up to the folks who are more
> >> >invested in dsymutil (Adrian Prantl et al). Maybe one day we'll get it
> >> >integrated into llvm-dwp and then I'll be interested in getting as
> >> >much performance out of it as lld - so multithreading and things would
> >> >be on the books.
> >>
> >> I think improving dsymutil is a valuable thing.
> >> Though there are several directions which might be considered
> >> to make it more robust:
> >>
> >> 1. support of latest DWARF - DWARF5/DWARF64...
> >
> >I expect/though some of the Apple folks had already worked on DWARF5 support?
> >DWARF64 - that's been around for a while, and just hasn't been needed
> >by LLVM users thus far, it seems (until recently - where some
> >developers have started working on that)
>
> There already implemented debug_names table, but debug_rnglists,
> debug_loclists, type units - are not implemented yet.
Superficially, type units wouldn't be on the list of features (like
DWARF64 - it's optional) I'd try to support in dsymutil - since their
size overhead is more justified for a DWARF-agnostic linker that's
using comdat groups. With a DWARF-aware linker I'd be specifically
hoping to avoid using type units to help
> The thing which
> should probably be changed is that dsymutil should not have its version
> of code generating DWARF tables. It should call already existed
> DWARF5/DWARF64 implementations. Then dsymutil would always
> use last DWARF generators.
Possibly - I don't know what the architectural tradeoffs for that look
like - I'd imagine DWARFLinker has sufficiently different
needs/tradeoffs than LLVM's DWARF generation code (rewriting existing
DIEs compared to building new ones from scratch, etc) that it might be
hard for them to share a lot of their implementation.
Ah, OK - for archival purposes. So the interactive developers wouldn't
necessarily be using this feature. Makes sense - similar to dsymutil
and dwp, mostly used for archival purposes & you can debug straight
from .o/.dwos for interactive/iterative development.
In that case, it seems more likely that a separate tool might suffice.
Also, out of curiosity - have you tried just compressing the output
(-gz (I think that does the right thing for the linker level
compression too, otherwise -Wl,-compress-debug-sections might do it))
or are you already doing that in addition?
> >You mentioned that the usability cost of
> >Split DWARF for your users was too high (or high enough to justify
> >this alternative work of DWARF-aware linking)? That all seems a bit
> >surprising to me - though I understand the deployment issues of Split
> >DWARF do present some challenges to users in more heterogenous
> >environments than Google's... still, I'd have thought there was some
> >hope there)
>
> Our tools does not support split dwarf yet. Though we plan to implement it.
> When we would have support of split dwarf then it would be
> convenient to have easy way to share built debug binaries. llvm-dwp is the
> answer to this. DWARFLinker could probably be another answer.
Ah, fair enough - thanks for the context!
right, that is unoptimized build with -ffunction-sections.
>> 2. DWARFLinker from lld = linking time for clang 72 sec, clang=760M.
>It does seem a tad strange that the clang binary would be smaller
>non-split with DWARF linking than it was split. Though I could imagine
>this might be possible in an optimized build (wehre debug_ranges
>become quite relatively expensive in the .o file contribution with
>Split DWARF)
>Could you compare the section sizes between these two clang binaries, perhaps?
.debug_ranges is three times bigger and .debug_line is twice bigger.
>> >> Thus if they would use this LLD feature in its current state
>> >> - they would still receive benefits.
>> >>
>> >> Speaking of performance results - LLD is a multi-thread linker;
>> >> it handles sections in parallel. DWARFLinker generates DWARF using
>> >> AsmPrinter which is a stream - so it could make resulting DWARF only
>> >> continuously. It is not surprising that the parallel solution works faster.
>> >> Making DWARFLinker truly multi-threaded would probably allow us
>> >> to make slowdown to be at 2x-4x range.
>> >
>> >*nod* that's still a really expensive link - but I understand that's a
>> >suitable tradeoff for your users
>> >
>>
>> Btw, 2x or 7x is for pure linking time. Overall compilation slowdown
>> is not so significant. Building LLVM codebase has only 20% slowdown.
>
>Understood - that's still quite significant to most users, I'd imagine.
I see.
It is not easy, and would require some additions, but it would benefit
in that all format implementation is in one place. Thus changing that place
would reflect in other places. There are at least three implementations for
.debug_ranges, .debug_aranges currently...
agreed: if to continue the work on this then it makes sense to
do it as separate tool. Make it fast enough. And if there would be interest
in it - then it would probably be possible to return to idea calling it from linker.
>Also, out of curiosity - have you tried just compressing the output
>(-gz (I think that does the right thing for the linker level
>compression too, otherwise -Wl,-compress-debug-sections might do it))
>or are you already doing that in addition?
sure. we use -Wl,-compress-debug-sections.
Thank you, Alexey.
>> >You mentioned that the usability cost of
>> >Split DWARF for your users was too high (or high enough to justify
>> >this alternative work of DWARF-aware linking)? That all seems a bit
>> >surprising to me - though I understand the deployment issues of Split
>> >DWARF do present some challenges to users in more heterogenous
>> >environments than Google's... still, I'd have thought there was some
>> >hope there)
>>
>> Our tools does not support split dwarf yet. Though we plan to implement it.
>> When we would have support of split dwarf then it would be
>> convenient to have easy way to share built debug binaries. llvm-dwp is the
>> answer to this. DWARFLinker could probably be another answer.
>Ah, fair enough - thanks for the context!
> > >> >One way to do that would be to have a CU-local type indirection table.
And this is without Split DWARF? Without linker DWARF compression? -
that seems quite a bit surprising, that the deduplication of DWARF
could fit into less space than the wasted/reclaimed space in ranges (&
line)?
Could you double check these numbers & provide a clearer summary?
Here's my attempt at numbers (all with function-sections+gc-sections)...
Split DWARF tests didn't seem meaningful - gc-debuginfo + split DWARF
seemed to drop all the debug info (except gdb_index) so wasn't
working/comparison wasn't meaningful for Apples to Apples, but
included it for comparing gc'd non-split to non-gc'd split (disabled
gnu-pubnames/gdb-index (-gsplit-dwarf -gno-gnu-pubnames) (which turns
on by default with Split DWARF because gdb needs it - but a bit of an
unfair comparison without turning on gnu-pubnames/gdb-index in other
build modes too, since it... /shouldn't/ be necessary) which might've
been a factor in the data you were looking at)
* -O0: (baseline, just using strip -g: 356 MB)
* compressed: 25% smaller with gc-debuginfo (481 MB / 641 MB) (407
MB split/non-gc)
* uncompressed: 30% smaller (820 MB / 1.2 GB) (566 MB split/non-gc)
* -O3: (baseline: 116 MB)
* compressed: 16% smaller (361 MB / 462 MB) (283 MB split/non-gc)
* uncompressed: 22% smaller (1022 MB / 1.2 GB) (156 MB split/non-gc)
On Fri, Jun 26, 2020 at 9:28 AM Alexey Lapshin
that was without split dwarf, without linker compression.
>
> Could you double check these numbers & provide a clearer summary?
sure, I would re-check it.
>
> Here's my attempt at numbers (all with function-sections+gc-sections)...
>
> Split DWARF tests didn't seem meaningful - gc-debuginfo + split DWARF
> seemed to drop all the debug info (except gdb_index) so wasn't
> working/comparison wasn't meaningful for Apples to Apples, but
> included it for comparing gc'd non-split to non-gc'd split (disabled
> gnu-pubnames/gdb-index (-gsplit-dwarf -gno-gnu-pubnames) (which turns
> on by default with Split DWARF because gdb needs it - but a bit of an
> unfair comparison without turning on gnu-pubnames/gdb-index in other
> build modes too, since it... /shouldn't/ be necessary) which might've
> been a factor in the data you were looking at)
that might be the case. i.e. clang=997M for split dwarf(from my previous
measurement) might include gnu-pubnames.
would recheck it and if that is the case then it is a unfair comparison.
My point was that "DWARFLinker from lld" takes less space than singleton
split dwarf file+.dwp file.
for -O0 uncompressed:
- .dwp took 1.1G(if I built it correctly), singleton clang(from your
measurements) 566 MB
overall 1.6G.
- The "DWARFLinker from lld" 820 MB(from your measurements).
So "DWARFLinker from lld" looks two times better.
Anyway, thank you for pointing me to possible mistake. I would recheck
it and update results.
Alexey.
Oh, yeah, even if there are some measurement issues, linked executable
+ .dwp is going to be larger than a linked executable using non-split
DWARF (in v5), since v5 uses all the same representations as non-split
DWARF, and split DWARF adds the indirection overhead of a split file,
etc.
Even without DWARF linking, it's true that split DWARF has overhead
(dwp+executable will be larger than executable non-split).
But maybe we've ended up down a bit of a tangent in any case.
Trying to bring this back to "should this be committed to lld" seems
valuable, and I'm not sure what the right criteria are for that.
Ray's the best person to weigh in on that. My 2c is that I think it
probably is worthwhile, even just as an experiment, assuming it's not
too intrusive to lld.
We decided to give the idea of "removing of obsolete debug info"
another try and are going to implement it as a separate utility
working with built binary. Making it to be multi-thread would
probably show better performance results and then it could
probably be considered as acceptable to use from the linker.
Alexey.
Hi Eric, please
I would publish the proposal shortly to discuss it.
Shortly: we decided to move in slightly other direction than adding this functionality
into dsymutil. Though if there is a preference to implement it as part of dsymutil
we are OK to do this way.
In its first version, this new utility supposed to receive built binary with debug info
as input(with the new marking for references to removed code sections -1/-2
-https://reviews.llvm.org/D84825) and create a new binary with removed obsolete
debug info according to the above marking. In the next versions, it could be extended
with other debug info optimizations tasks. F.e. generation new index tables, debug info
optimizing... etc...
We considered three options:
1. add new functionality into dsymutil. So that dsymutil behaves differently
on a non-darwin platform and supports another set of command-line options.
2. add new functionality into llvm-objcopy. llvm-objcopy already supports various
binary objects formats(MachO,ELF,COFF,wasm). It also has several options
to work with debug-info.
3. create new utility llvm-dwarfutil which would implement the above functionality
and reuse DWARFLinker(extracted from dsymutil) library and new library
ObjectCopy(extracted from llvm-objcopy).
So far our preference is number three. The reason for this is that separate
utility specifically working with debug info looks as good separation of concepts.
Adding another behavior to dsymutil looks not very good. Extending the already
rich interface of llvm-objcopy looks also not very good. Having in mind that actual
implementation would be shared by libraries, the separate utility, working specifically
with debug info, looks like the right choice. That is our current idea.
I would publish the proposal shortly to discuss it.
Hi Jonas,
Thank you for the comments, please find my answers below...
Hi Alexey,
I should've looked at this earlier. I went through the thread again and I'vemade some comments, mostly from the dsymutil point of view.
> Current DWARFEmitter/DWARFStreamer has an implementation for DWARF> generation, which does not support DWARF5(only debug_names table). At the> same time, there already exists code in CodeGen/AsmPrinter/DwarfDebug.h,> which implements most of DWARF5. It seems that DWARFEmitter/DWARFStreamer> should be rewritten using DwarfDebug/DwarfFile. Though I am not sure> whether it would be easy to re-use DwarfDebug/DwarfFile. It would probably> be necessary to separate some intermediate level of DwarfDebug/DwarfFile.
These classes serve very different purposes. Last time I looked at them therewas very little overlap in functionality. In the compiler we're mostlyconcerned with generating the DWARF, while in dsymutil we try to copyeverything we don't need to parse, and fix up what we have to. I don't wantto say it's not possible, but I think supporting DWARF5 in those classes isgoing to be a lot less work than trying to reuse the CodeGen variants.
> Measurements show that it is spent ~10 sec in> llvm::StringMapImpl::LookupBucketFor(). The problem is that the same> strings, again and again, are added to the string pool. Two attributes> having the same string value would be analyzed (hash calculated) and> searched inside the string pool. Even if these strings are already in> string table(DW_FORM_strp, DW_FORM_strx). The process could be optimized> for string tables. So that if some string from the string table were> accessed previously then, it would keep a reference into the string pool.> This would eliminate a lot of string pool searches.
I'm not sure I fully understand the optimization, but I'd love to speed thisup, if only for dsymutil's sake. I'd love to talk about this in a separatethread or offline.
> Currently, all object files are analyzed sequentially and cloned> sequentially. Cloning is started in parallel with analyzing. That scheme> could be changed: analyzing and cloning could be done in parallel for each> object file. That requires refactoring of DWARFLinker and making string> pools and DeclContextTree thread-safe.
I'm less familiar with the way that LLD uses the DWARFOptimizer but this isnot possible for dsymutil as it is trying to deduplicate DIEs from differentcompile units.
> Extending the already rich interface of llvm-objcopy looks also not very> good. Having in mind that actual implementation would be shared by> libraries, the separate utility, working specifically with debug info,> looks like the right choice. That is our current idea.
> My personal thought would be that extending dsymutil should be ok as the> functionality goes well with everything else dsymutil does (other than not> support ELF which the dsymutil maintainers are on board with last I> checked). That said, I definitely think a write-up will be helpful. No> matter what I support extracting all of the behavior into libraries and> using that somewhere :)
Ha, so basically what I was trying to say above.
I look forward to seeing the proposal!
yep, would publish it soon.
Thank you, Alexey.
Probably some opportunities to share some code, even if not the whole
generator - might be best to refactor those opportunistically, rather
than a wholesale "change DWARFLinker to use (all) of
lib/CodeGen/AsmPrinter/Dwarf*". Sort of like the approach that's been
taken with lldb's use of libDebugInfoDWARF - picking particular
features that have high overlap and refactoring them to be reusable
between the two different use cases.
llvm/CodeGen/DIE.h
class DIE* {
void emitValue(const AsmPrinter *Asm, dwarf::Form Form) const;
unsigned SizeOf(const AsmPrinter *AP, dwarf::Form Form) const;
}
Having access to all of AsmPrinter public data members
complicates DWARF generation:
void DIEInlineString::emitValue(const AsmPrinter *AP, dwarf::Form Form)
const {
if (Form == dwarf::DW_FORM_string) {
AP->OutStreamer->emitBytes(S);
AP->emitInt8(0);
return;
}
}
It would be good to do something similar to https://reviews.llvm.org/D76293.
I.e. avoid AsmPrinter dependence using an abstract interface
(llvm/DebugInfo/DWARF/DWARFDebugSection.h):
class DIE* {
void emitValue(DwarfDebugSection *Dwarf, dwarf::Form Form) const;
unsigned SizeOf(const DwarfDebugSection *Dwarf, dwarf::Form Form) const;
}
Such separation, could f.e. allow to implement
AsmPrinter::emitDwarfDIE(const DIE &Die)
in some general place(libDebugInfoDWARF) and then be reused by others
(without necessity to link/use AsmPrinter).
https://reviews.llvm.org/D76293 was though to be more general.
But we could probably start from that smaller change:
avoid dependence of DIE* classes on AsmPrinter.
Alexey.
>
>> Supporting new standard would require rewriting/modification of all these places. In the ideal world,
>> having single implementation for the DWARF generation allows changing one place and having
>> benefits in others. Probably, CodeGen classes could be rewritten and then it would be useful
>> to write them assuming two use cases - generation from the scratch and copying/updating
>> existing data. In the end, there would be single implementation which could be reused in
>> many places. Though, it is indeed a lot of work.
>>
>>
>>
As mentioned in https://reviews.llvm.org/D76293#1928139 - probably
best not to put this into libDebugInfoDWARF - that library is
currently for DWARF parsing & LLVM proper only really needs DWARF
emission, so bundling them together may confuse things a bit - in
terms of adding unnecessary dependencies, conflating/confusing the
goals/priorities of the different libraries, etc. (see similar
separations between reading and writing with things like libIR
containing IR asm writing, but libAsmParser containing the parsing
code).
> and then be reused by others
> (without necessity to link/use AsmPrinter).
>
> https://reviews.llvm.org/D76293 was though to be more general.
> But we could probably start from that smaller change:
> avoid dependence of DIE* classes on AsmPrinter.
Sure - that'd be the general idea as we discussed in D76293 & in this
thread: start small, share some pieces & build up the necessary
abstractions (streaming APIs, etc) over the two different use cases,
etc.
>
>> and then be reused by others
>> (without necessity to link/use AsmPrinter).
>>
>> https://reviews.llvm.org/D76293 was though to be more general.
>> But we could probably start from that smaller change:
>> avoid dependence of DIE* classes on AsmPrinter.
> Sure - that'd be the general idea as we discussed in D76293 & in this
> thread: start small, share some pieces & build up the necessary
> abstractions (streaming APIs, etc) over the two different use cases,
> etc.
yep.