RFC: Improving GN's target resolution data structures and computations

84 views
Skip to first unread message

David Turner

unread,
Apr 19, 2022, 10:31:23 AM4/19/22
to gn-dev
This is an RFC for a number of changes to GN's internal implementation, which are currently available on Gerrit as a stack of CLs, whose top is at https://gn-review.googlesource.com/c/gn/+/13627.

This RFC was written upon demand by Brett Wilson, but the changes have been decomposed into small CLs that should be relatively easy to understand. The proposed document describes the overall vision for the changes and explains why they improve performance, reduce RAM usage, and make the code easier to understand overall.

These changes are written to improve GN without modifying its output or interface. They are thus non-breaking. What they do is:
  • Provide a few specialized data structures for certain data types that noticeably reduce GN's total RAM usage and/or improve data locality / reduce cache latencies.

  • Move some expensive recursive computations over the build graph out of the main thread, where they were a bottleneck, and into background worker threads that run in parallel.

  • Ensure that these expensive computations are only performed when needed, unlike the current implementation that has to perform significant unnecessary work, due to lack of proper context.

  • Reduce the amount of partially-constructed fields in each Target class instance, which has been a source of bug and confusion in the source tree so far.

Overview of the changes

This section describes the changes proposed (exact details to be found in the CLs on Gerrit), while further sections will provide additional justification for them:
  1. Move data computed at target resolution time out of the Target class, and into a dedicated TargetResolvedData. This makes the code easier to understand by reducing the amount of state in the Target class that is not properly initialized until target resolution is completed (this type of data has been a source of confusion and bugs multiple times, as explained later).

  2. Ensure that the expensive recursive computations performed by the TargetResolvedData methods are only performed when needed, by walking the graph top-to-bottom, instead of bottom-to-top. Notice that this does not change the result of the computations, it only avoids unnecessary work and allocations that were already quantified as wasting up to 10% of `gn gen` time. Memoization is used to ensure that duplicate walks over the same dependency tree are avoided.

  3. Move the computation of TargetResolvedData out of the main thread, and into the background worker threads that need it at `gn gen` time, this reduces a noticeable performance bottleneck. This is safe because the graph of Target objects is always treated as a shared read-only input when this happens.

  4. Introduce specialized data types to reduce the overall heap usage of the build graph in memory (which towers over 3 GiB currently), and speed up parsing / computations over them (e.g. ImmutableVector<>, and ResolvedTargetDeps).

  5. Introduce the NestedSet<T> template used to speed up repeated merges of data sets performed frequently during target resolution. Using a recursive DAG of small nodes allows these merge operations, that happen multiple times per node in the build graph, to be performed significantly faster and use far less overall memory. The operation of flattening the result into a list is only performed when needed, i.e. when generating a Ninja command or a `gn desc` output. This speeds up GN and saves several hundred MiBs of RAM.

    This was directly inspired by the NestedSet Java template implemented by the Bazel build system, which is critical for its performance.
The end result of all these changes is speeding up `gn gen` by 35% (removing 5s on a Fuchsia build), and reducing overall peak RAM usage by 25% (up to 1 GiB of RAM for a Fuchsia build). Without changing GN's output, as verified manually for the Chromium and Fuchsia builds on Linux.

On GN's memory usage

GN's main task is to:

1) Read a set of input BUILD.gn files.
2) Evaluate them to build a directly-acyclic graph of build targets.
3) For the `gn gen` operation, use the graph to generate a set of Ninja commands.
4) For other commands, perform computations over the graph to answer various queries (e.g. find paths between targets).

Measurements performed on a moderate Fuchsia build configuration, that uses around 5700 input files and 203000 targets shows that:

- peak RAM usage for `gn check` is around 3.7 GiB
- peak RAM usage for `gn gen` is around 3.9 GiB

Moreover, the Fuchsia tree provides 6800+ input files, for a total of 2.4 GiB, assuming a generous 50x expansion when loaded in memory, this amounts to about 100 MiB of RAM used to load the input files in the above example.

This suggests a very crude approximation where the build graph takes around 3.4 GiB of RAM in memory. At this size [1], cache effects affect overall performance significantly, and the memory-layout of data structures used by GN at runtime matters.

A previous CL illustrates it: by replacing an std::map<> with two parallel vectors and a small hash map, `gn gen` time was reduced by 10%, and overall peak RAM usage reduced by 242 MiB. Without changing the parent class' interface or its behavior.

Generally speaking, the simplest way to improve performance when doing computations over a very large data set are to:

  • Minimizing overall RAM usage / heap allocations, to decrease the number of cache misses at runtime.|

  • Choose data structures that minimize the amount of pointer-chasing  during computations.

  • Segregate the data between "cold" and "hot" parts, to ensure better data locality of the hot parts which are supposed to be used the most during the computations.
That's essentially what the proposed changes do.

How GN loads input files

The Loader class implements loading, caching and evaluation of input BUILD.gn files (and their imports). It uses a worker thread pool to perform most of its work in parallel background threads, but hides it under an interface that provides both synchronous and asynchronous entry points.

The result of evaluating a BUILD.gn file, in a specific GN toolchain context, is a list of *partially-constructed* Item instances.

The Item class is used to represent any labelled item that can be present or referenced in the build graph. There are currently 4 item types: Target, Config, Toolchain and Pool (all of these derived classes of the base Item class). For simplicity the rest of this document will focus on targets, because the proposed change only affects them.

The content of these partially-constructed instances comes directly from the corresponding item definition in the GN language, either when stated explicitly in a BUILD.gn file, or by expanding a GN template call.

For example, a Target instance has lists of labels for the private, public, and data dependencies specified through `deps`, `public_deps` and `data_deps` in its GN language definition.

Note that some fields of these Item instances are not properly initialized yet: this happens later when the items are added to the graph.


How GN builds the Item graph

The Builder class is responsible for building the overall graph in memory. It runs exclusively in the main thread, which happens to be a noticeable performance bottleneck at runtime. The graph is constructed incrementally in the following way:
  1. The Builder calls the Loader class recursively to load `BUILD.gn` files and receive the corresponding partially-constructed items. These are added to the current graph, then their dependencies are used to load other build files that define them.

    Hence targets are added to the graph top-to-bottom, always in partially constructed state.

  2. During construction, once all dependencies of a given target have been loaded, or if there are no dependencies, the target's state is "resolved" by calling Target::OnResolved()` which will perform a number of computations that result in the "final" state of the target.

    Target resolution is how GN propagates information from dependencies to dependents. E.g. This is how public configs and public dependencies are pushed upstream in the graph. More on this later.

    Once a target has been resolved, its state is final, and all its dependents that were waiting for it can be resolved (unless they are waiting on other dependencies).
During `gn gen`, as a special case, as soon as a target is resolved, a task is posted to the worker thread pool to generate a Ninja fragment from it. Note that since at this point the target's state, and that of all its dependencies is final, it can be used as a shared read-only input for multiple threads safely.

On the design of the Target class

The Target class definition is quite large, because it has to support all the different target types supported by the GN language. There are no sub-classes specialized for certain target types. This results in a large number of members, and a large number of methods.

Combined with the fact that some of these members are only valid after target resolution, the source code can be difficult to understand because there is no real distinction, either through comments, runtime checks, or separation into different objects, between final and partial values. Similarly, it is not always obvious when or which methods would return valid values before target resolution.

Unsurprisingly, this led to subtle bugs. For example this CL fixed a bug where a method that was called during target resolution was relying on the result of another method that only returned valid values after resolution was completed for the same instance. This was made more difficult to spot in code review due to the fact that sub-objects of Target themselves
can also be partially-constructed (in this case, the problem was hidden in the SwiftValues class).

Another example is this CL which fixed some unit-tests which forgot to call Target::OnResolved() and thus computed and verified incorrect results from partial target state.

On computations performed at target resolution time

Various target types need to perform computations over the tree of their dependencies when generating Ninja fragments. For example, an `executable()` target must collect all the `libs` and `lib_dirs` values from its transitive dependencies to include them into the final linker command string.

In the example above, these are returned by methods like `Target::all_libs()` and `Target::all_lib_dirs()`, which return vectors of `LibFile` and `SourceDir` values respectively.

The implementation of these methods does not compute the result on demand, instead, everything is pre-computed at target resolution time, in reverse dependency order. The main benefit of this approach is that the values associated with a given target are only computed once.

For example, if several `executable()` targets depend on the same `static_library()` target (which itself has a deep tree of source_set() dependencies), the associated values are available immediately, avoiding a duplication of dependency tree walks.

There are however a number of drawbacks to this implementation detail:
  • All resolved values are computed in `Target::OnResolved()` which can only run in the main thread, which is in practice a noticeable performance bottleneck.

  • Because `Target::OnResolved()` doesn't know anything about the target's dependents, it always computes all possible values, even if there is actually no need for it. This results in lots of unnecessary work and extra allocations.

  • Many of these values are simple vectors, sets or maps, that can become quite large in a deep build graph. Because they contain the same items duplicated at multiple level of the hierarchy. And this results in lots of container merges that all happen in the main thread, and eat up significant memory, since the result is stored within Target instances unconditionally.
Recently two CLs were proposed to move the resolution data for binary targets to a specific class, and to delay its upstream propagation until it could be determined that it was actually needed. This resulted in a 10% reduction in GN gen time (in other words, `gn gen` spends 10% of its time computing unnecessary data), but was rejected due to the increase in complexity to the target resolution algorithm (it allegedly rendered the code even more difficult to follow).

Another attempt tried to perform target resolution in background worker threads as well, but also introduced complexity on its own (requiring mutexes and locking of certain data structures, the author suggesting adding thread annotations to the GN source base).

The proposed changes solve the issue by moving the target resolution computations to a separate class, that can be used outside of the main thread safely, keeping its code simple (and extremely similar to the one currently used by `Target::OnResolved()`, while avoiding the use of locks and thread annotations).

On mutable vs immutable containers

Most members in the Target class are implemented using simple mutable container types, like an `std::vector<>`, a `UniqueVector<>` (a vector with the added properties of not allowing duplicates), or standard sets or maps of different types.

This is convenient while constructing Target instances, or the data it will store, but has a few minor drawbacks when it comes to memory:
  • Mutable containers tend to use more heap memory than necessary, even an `std::vector<>` will over-allocate memory using an exponential capacity growth policy. A `UniqueVector<>` instance contains a vector as well as a companion hash map which becomes useless once the object has been finalized.

  • Containers that do not have this problem, like `std::map<>`, tend to rely on a lot of small object allocations, which creates cache-unfriendly data structures that work poorly once the data set becomes very large (see example listed previously), or have poor merging performance at runtime.
In the case of GN, computation of resolved data means that these data sets are parsed very extensively during target resolution, and the issues above become noticeable under profiling.

These issues are well-known in software engineering, and a number of immutable container data types have been implemented to address this problem:
  • Because they are constructed once, then never modified, their data layout and heap allocation can be optimized to be efficient and minimal.

  • Because they are immutable, their state is obvious when reading the source code, and they can be used as shared input for computations performed in parallel threads safely.
The main drawback of these types is that they must be constructed from a mutable "builder" class, which generally provide a number of mutable operations to determine the final, immutable output, as in the following hypothetical example:

  ImmutableFoo foo = FooBuilder().AddItem(item1).AddDependency(dep1).Build();

In practice though, this is only an annoyance for certain unit-tests which typically get mutable references to `Target` members and append things to them directly.

The proposed changes implemented a few immutable data structures to reduce RAM usage and speed up parsing over them.

[1] For reference, the L3 cache on a very powerful workstation is about 36 MiB,
while laptops with high-end Core-i7 processors have an L3 cache of 8 MiB.

Takuto Ikuta

unread,
Apr 20, 2022, 4:33:24 AM4/20/22
to David Turner, gn-dev
Hi David,

Thank you for writing the proposal/explanation, I haven't taken a deep look at your series of implementation yet. But I want to understand your proposal/implementations.
I also feel it is better to have more eyes for changes merged before this proposal. So do you mind if I ask you to revert your merged CLs and reland to get a review from me (and other reviewers)?

And I think it might be better to use google docs for this kind of somewhat large proposal. It is easier to see the current state and to comment than discussing it on a mail thread.

Thanks,
Takuto

--
To unsubscribe from this group and stop receiving emails from it, send an email to gn-dev+un...@chromium.org.

David Turner

unread,
Apr 20, 2022, 10:44:22 AM4/20/22
to Takuto Ikuta, gn-dev
Le mer. 20 avr. 2022 à 10:33, Takuto Ikuta <tik...@chromium.org> a écrit :
Hi David,

Thank you for writing the proposal/explanation, I haven't taken a deep look at your series of implementation yet. But I want to understand your proposal/implementations.
I also feel it is better to have more eyes for changes merged before this proposal. So do you mind if I ask you to revert your merged CLs and reland to get a review from me (and other reviewers)?

And I think it might be better to use google docs for this kind of somewhat large proposal. It is easier to see the current state and to comment than discussing it on a mail thread.

I have converted the RFC in a Google Doc available for comments at https://docs.google.com/document/d/1k8wzlWDPEFCdU6fDfK90x8uSIA65GA7_WlzFNX0hsFk/edit?usp=sharing
Unfortunately, my settings do not allow me to share this with users outside of Alphabet, should I reshare this using my personal account?

I have updated the document slightly, and added a section at the end with Gerrit links and descriptions for each individual CL, including quick profiling results for Fuchsia and Chromium for those that change GN's behavior (many CLs just introduce new support templates or classes with their unit-tests, without actually using them in the code).
Note also that none of the CLs that have been submitted so far modify GN's behavior.

- Digit

Takuto Ikuta

unread,
Apr 21, 2022, 12:37:50 AM4/21/22
to David Turner, gn-dev
On Wed, Apr 20, 2022 at 11:44 PM 'David Turner' via gn-dev <gn-...@chromium.org> wrote:


Le mer. 20 avr. 2022 à 10:33, Takuto Ikuta <tik...@chromium.org> a écrit :
Hi David,

Thank you for writing the proposal/explanation, I haven't taken a deep look at your series of implementation yet. But I want to understand your proposal/implementations.
I also feel it is better to have more eyes for changes merged before this proposal. So do you mind if I ask you to revert your merged CLs and reland to get a review from me (and other reviewers)?

And I think it might be better to use google docs for this kind of somewhat large proposal. It is easier to see the current state and to comment than discussing it on a mail thread.

I have converted the RFC in a Google Doc available for comments at https://docs.google.com/document/d/1k8wzlWDPEFCdU6fDfK90x8uSIA65GA7_WlzFNX0hsFk/edit?usp=sharing
Unfortunately, my settings do not allow me to share this with users outside of Alphabet, should I reshare this using my personal account?


Thanks. But yeah, I think it is better to use a public account for this as your proposal doesn't have any confidential information and for openness.

David Turner

unread,
Apr 26, 2022, 10:17:39 AM4/26/22
to Takuto Ikuta, gn-dev
I have updated the RFC with a new section detailing each patch with more details, as well as profiling information for each one of the,
It is now available publicly from my personal account from the following link:

https://docs.google.com/document/d/1sHlXBIdTZ47JZfGt3DldTj6zlwtDr_xsYlwTmeL1ZtE/edit?usp=sharing

Please take a look.

- Digit

Brett Wilson

unread,
Apr 26, 2022, 8:14:21 PM4/26/22
to David Turner, Takuto Ikuta, gn-dev
Thanks, the write up is great and I've read some of it already.

Brett

Takuto Ikuta

unread,
Apr 27, 2022, 1:03:17 AM4/27/22
to Brett Wilson, David Turner, gn-dev
I took a look at your proposal/CLs too, and the overall direction looks good to me.
Thank you for writing informative documents.

David Turner

unread,
Dec 5, 2022, 5:49:16 AM12/5/22
to Takuto Ikuta, gn-dev
Just to let you know that I had the time lately to rework the CLs in my proposal to rebase and simplify the code, i.e.:

- Remove custom data types like ImmutableVector<> / TaggedPointer<> / NestedSet<>.
- Remove the thread-specific cache.
- But still move the computation of resolved target data out of the Target class, and perform these on demand.

The new stack of CLs is at https://gn-review.git.corp.google.com/c/gn/+/14884 now.

I have also performed some benchmarks to compare its performance with GN tip-of-tree and the previous implementation, and it still performs well, though less than the original.

                                   main      new    original
Fuchsia Mean time (seconds)       27.1      21.5        20.8
Fuchsia Peak RAM (kiB)         
5050176      4206752          3874420

Chromium Mean Time (seconds)      4.63      4.39       3.58
Chromium Peak RAM (kiB)        
1036720      1097404          699384

Regards,

- Digit

Владимир Архипов

unread,
Dec 5, 2022, 7:15:31 AM12/5/22
to David Turner, Takuto Ikuta, gn-dev

пн, 5 дек. 2022 г., 13:49 'David Turner' via gn-dev <gn-...@chromium.org>:
Reply all
Reply to author
Forward
0 new messages