This is an RFC for a number of changes to GN's internal implementation, which are currently available on Gerrit as a stack of CLs, whose top is at
https://gn-review.googlesource.com/c/gn/+/13627.
This RFC was written upon demand by Brett Wilson, but the changes have been decomposed into small CLs that should be relatively easy to understand. The proposed document describes the overall vision for the changes and explains why they improve performance, reduce RAM usage, and make the code easier to understand overall.
These changes are written to improve GN
without modifying its output or interface. They are thus non-breaking. What they do is:
- Provide a few specialized data structures for certain data types that noticeably reduce GN's total RAM usage and/or improve data locality / reduce cache latencies.
- Move some expensive recursive computations over the build graph out of the main thread, where they were a bottleneck, and into background worker threads that run in parallel.
- Ensure that these expensive computations are only performed when needed, unlike the current implementation that has to perform significant unnecessary work, due to lack of proper context.
- Reduce the amount of partially-constructed fields in each Target class instance, which has been a source of bug and confusion in the source tree so far.
Overview of the changes
This section describes the changes proposed (exact details to be found in the CLs on Gerrit), while further sections will provide additional justification for them:
- Move data computed at target resolution time out of the Target class, and into a dedicated TargetResolvedData. This makes the code easier to understand by reducing the amount of state in the Target class that is not properly initialized until target resolution is completed (this type of data has been a source of confusion and bugs multiple times, as explained later).
- Ensure that the expensive recursive computations performed by the TargetResolvedData methods are only performed when needed, by walking the graph top-to-bottom, instead of bottom-to-top. Notice that this does not change the result of the computations, it only avoids unnecessary work and allocations that were already quantified as wasting up to 10% of `gn gen` time. Memoization is used to ensure that duplicate walks over the same dependency tree are avoided.
- Move the computation of TargetResolvedData out of the main thread, and into the background worker threads that need it at `gn gen` time, this reduces a noticeable performance bottleneck. This is safe because the graph of Target objects is always treated as a shared read-only input when this happens.
- Introduce specialized data types to reduce the overall heap usage of the build graph in memory (which towers over 3 GiB currently), and speed up parsing / computations over them (e.g. ImmutableVector<>, and ResolvedTargetDeps).
- Introduce the NestedSet<T> template used to speed up repeated merges of data sets performed frequently during target resolution. Using a recursive DAG of small nodes allows these merge operations, that happen multiple times per node in the build graph, to be performed significantly faster and use far less overall memory. The operation of flattening the result into a list is only performed when needed, i.e. when generating a Ninja command or a `gn desc` output. This speeds up GN and saves several hundred MiBs of RAM.
This was directly inspired by the NestedSet Java template implemented by the Bazel build system, which is critical for its performance.
The end result of all these changes is speeding up `
gn gen` by 35% (removing 5s on a Fuchsia build), and reducing overall peak RAM usage by 25% (up to 1 GiB of RAM for a Fuchsia build). Without changing GN's output, as verified manually for the Chromium and Fuchsia builds on Linux.
On GN's memory usage
GN's main task is to:
1) Read a set of input BUILD.gn files.
2) Evaluate them to build a directly-acyclic graph of build targets.
3) For the `
gn gen` operation, use the graph to generate a set of Ninja commands.
4) For other commands, perform computations over the graph to answer various queries (e.g. find paths between targets).
Measurements performed on a moderate Fuchsia build configuration, that uses around 5700 input files and 203000 targets shows that:
- peak RAM usage for `
gn check` is around 3.7 GiB
- peak RAM usage for `
gn gen` is around 3.9 GiB
Moreover, the Fuchsia tree provides 6800+ input files, for a total of 2.4 GiB, assuming a generous 50x expansion when loaded in memory, this amounts to about 100 MiB of RAM used to load the input files in the above example.
This suggests a very crude approximation where the build graph takes around 3.4 GiB of RAM in memory. At this size [1], cache effects affect overall performance significantly, and the memory-layout of data structures used by GN at runtime matters.
A
previous CL illustrates it: by replacing an
std::map<> with two parallel vectors and a small hash map, `
gn gen` time was reduced by 10%, and overall peak RAM usage reduced by 242 MiB. Without changing the parent class' interface or its behavior.
Generally speaking, the simplest way to improve performance when doing computations over a very large data set are to:
- Minimizing overall RAM usage / heap allocations, to decrease the number of cache misses at runtime.|
- Choose data structures that minimize the amount of pointer-chasing during computations.
- Segregate the data between "cold" and "hot" parts, to ensure better data locality of the hot parts which are supposed to be used the most during the computations.
That's essentially what the proposed changes do.
How GN loads input files
The
Loader class implements loading, caching and evaluation of input
BUILD.gn files (and their imports). It uses a worker thread pool to perform most of its work in parallel background threads, but hides it under an interface that provides both synchronous and asynchronous entry points.
The result of evaluating a BUILD.gn file, in a specific GN toolchain context, is a list of
*partially-constructed* Item instances.
The
Item class is used to represent any labelled item that can be present or referenced in the build graph. There are currently 4 item types:
Target,
Config,
Toolchain and
Pool (all of these derived classes of the base Item class). For simplicity the rest of this document will focus on targets, because the proposed change only affects them.
The content of these partially-constructed instances comes directly from the corresponding item definition in the GN language, either when stated explicitly in a
BUILD.gn file, or by expanding a GN template call.
For example, a
Target instance has lists of labels for the private, public, and data dependencies specified through `
deps`, `
public_deps` and `
data_deps` in its GN language definition.
Note that some fields of these
Item instances are not properly initialized yet: this happens later when the items are added to the graph.
How GN builds the Item graph
The
Builder class is responsible for building the overall graph in memory. It runs exclusively in the main thread, which happens to be a noticeable performance bottleneck at runtime. The graph is constructed incrementally in the following way:
- The Builder calls the Loader class recursively to load `BUILD.gn` files and receive the corresponding partially-constructed items. These are added to the current graph, then their dependencies are used to load other build files that define them.
Hence targets are added to the graph top-to-bottom, always in partially constructed state.
- During construction, once all dependencies of a given target have been loaded, or if there are no dependencies, the target's state is "resolved" by calling Target::OnResolved()` which will perform a number of computations that result in the "final" state of the target.
Target resolution is how GN propagates information from dependencies to dependents. E.g. This is how public configs and public dependencies are pushed upstream in the graph. More on this later.
Once a target has been resolved, its state is final, and all its dependents that were waiting for it can be resolved (unless they are waiting on other dependencies).
During `
gn gen`, as a special case, as soon as a target is resolved, a task is posted to the worker thread pool to generate a Ninja fragment from it. Note that since at this point the target's state, and that of all its dependencies is final, it can be used as a shared read-only input for multiple threads safely.
On the design of the Target class
The Target class definition is quite large, because it has to support all the different target types supported by the GN language. There are no sub-classes specialized for certain target types. This results in a large number of members, and a large number of methods.
Combined with the fact that some of these members are only valid
after target resolution, the source code can be difficult to understand because there is no real distinction, either through comments, runtime checks, or separation into different objects, between final and partial values. Similarly, it is not always obvious when or which methods would return valid values before target resolution.
Unsurprisingly, this led to subtle bugs. For example
this CL fixed a bug where a method that was called during target resolution was relying on the result of another method that only returned valid values
after resolution was completed for the same instance. This was made more difficult to spot in code review due to the fact that sub-objects of Target themselves
can also be partially-constructed (in this case, the problem was hidden in the
SwiftValues class).
Another example is
this CL which fixed some unit-tests which forgot to call
Target::OnResolved() and thus computed and verified incorrect results from partial target state.
On computations performed at target resolution time
Various target types need to perform computations over the tree of their dependencies when generating Ninja fragments. For example, an `
executable()` target must collect all the `
libs` and `
lib_dirs` values from its transitive dependencies to include them into the final linker command string.
In the example above, these are returned by methods like `
Target::all_libs()` and `
Target::all_lib_dirs()`, which return vectors of `
LibFile` and `
SourceDir` values respectively.
The implementation of these methods does not compute the result on demand, instead, everything is pre-computed at target resolution time, in reverse dependency order. The main benefit of this approach is that the values associated with a given target are only computed once.
For example, if several `
executable()` targets depend on the same `
static_library()` target (which itself has a deep tree of
source_set() dependencies), the associated values are available immediately, avoiding a duplication of dependency tree walks.
There are however a number of drawbacks to this implementation detail:
- All resolved values are computed in `Target::OnResolved()` which can only run in the main thread, which is in practice a noticeable performance bottleneck.
- Because `Target::OnResolved()` doesn't know anything about the target's dependents, it always computes all possible values, even if there is actually no need for it. This results in lots of unnecessary work and extra allocations.
- Many of these values are simple vectors, sets or maps, that can become quite large in a deep build graph. Because they contain the same items duplicated at multiple level of the hierarchy. And this results in lots of container merges that all happen in the main thread, and eat up significant memory, since the result is stored within Target instances unconditionally.
Recently
two CLs were proposed to move the resolution data for binary targets to a specific class, and to delay its upstream propagation until it could be determined that it was actually needed. This resulted in a 10% reduction in GN gen time (in other words, `gn gen` spends 10% of its time computing unnecessary data), but was rejected due to the increase in complexity to the target resolution algorithm (it allegedly rendered the code even more difficult to follow).
Another
attempt tried to perform target resolution in background worker threads as well, but also introduced complexity on its own (requiring mutexes and locking of certain data structures, the author suggesting adding thread annotations to the GN source base).
The proposed changes solve the issue by moving the target resolution computations to a separate class, that can be used outside of the main thread safely, keeping its code simple (and extremely similar to the one currently used by `
Target::OnResolved()`, while avoiding the use of locks and thread annotations).
On mutable vs immutable containers
Most members in the
Target class are implemented using simple mutable container types, like an `
std::vector<>`, a `
UniqueVector<>` (a vector with the added properties of not allowing duplicates), or standard sets or maps of different types.
This is convenient while constructing
Target instances, or the data it will store, but has a few minor drawbacks when it comes to memory:
- Mutable containers tend to use more heap memory than necessary, even an `std::vector<>` will over-allocate memory using an exponential capacity growth policy. A `UniqueVector<>` instance contains a vector as well as a companion hash map which becomes useless once the object has been finalized.
- Containers that do not have this problem, like `std::map<>`, tend to rely on a lot of small object allocations, which creates cache-unfriendly data structures that work poorly once the data set becomes very large (see example listed previously), or have poor merging performance at runtime.
In the case of GN, computation of resolved data means that these data sets are parsed very extensively during target resolution, and the issues above become noticeable under profiling.
These issues are well-known in software engineering, and a number of immutable container data types have been implemented to address this problem:
- Because they are constructed once, then never modified, their data layout and heap allocation can be optimized to be efficient and minimal.
- Because they are immutable, their state is obvious when reading the source code, and they can be used as shared input for computations performed in parallel threads safely.
The main drawback of these types is that they must be constructed from a mutable "builder" class, which generally provide a number of mutable operations to determine the final, immutable output, as in the following hypothetical example:
ImmutableFoo foo = FooBuilder().AddItem(item1).AddDependency(dep1).Build();In practice though, this is only an annoyance for certain unit-tests which typically get mutable references to `
Target` members and append things to them directly.
The proposed changes implemented a few immutable data structures to reduce RAM usage and speed up parsing over them.
[1] For reference, the L3 cache on a very powerful workstation is about 36 MiB,
while laptops with high-end Core-i7 processors have an L3 cache of 8 MiB.