Hi,
I work on Android, and recently I've been experimenting with using threads to parallelize parts of Ninja and reduce its no-op run-time overhead. Ninja typically adds about 12-20 seconds of overhead to an AOSP incremental build, which I've been able to reduce to about a second or two.
In my modified Ninja, I've parallelized these tasks:
- manifest loading
- .ninja_deps and .ninja_log loading
- edge command-hash computation, used to compute dirty nodes
- node stat calls, also used to compute dirty nodes
Details:
My modified Ninja leave the existing file formats alone, and it still loads only one manifest or log file at a time. When it loads a file, it first splits it into chunks blindly, then searches for the start of a record or declaration in each chunk. For build logs, the delimiter is simply an LF character. Manifests and deps logs have more complicated delimiting. (Roughly, the parser searches a manifest for a declaration preceded by an unescaped newline, and it searches a v3 deps log for a 0x8000xxxx word that starts a deps record.)
The parser loads manifests using several passes:
- A "DFS" pass. In depth-first order, the parser loads the manifest tree one file at a time. Each file is split into chunks, which are parsed in parallel. Once chunks are parsed, their bindings and rules are recorded in a Scope object that roughly corresponds to BindingEnv. Bindings can be evaluated lazily at this point (e.g. for "include ${foo}.ninja").
- An "edge setup" pass evaluates the paths on each edge, in parallel, and creates/finds Node objects. (This step uses a concurrent hash table to map from path to Node*.)
- A non-parallel pass deals with duplicate edges outputting the same path. Some edges can be filtered out with dupbuild=warn.
- A parallelized pass initializes nodes' out-edges using edge inputs.
The parser records two kinds of locations on graph objects:
- A "ScopePosition" comprising a Scope* parent and an index. Each binding, rule, and edge has a ScopePosition. Ninja can evaluate paths or search for rules using an edge's ScopePosition and find the correct result even though it does the lookup after everything is parsed. When a variable name is rebound, a Scope object records all bindings for the name.
- A "DFS location" that tracks location within a depth-first walk of the manifest tree. DFS locations handle things that don't respect Ninja scopes -- e.g. pools and default targets.
Benchmarks:
Some statistics for context: In a typical AOSP master checkout today, with a hikey960-userdebug lunch target, the Ninja manifest has about 930K nodes, 640K edges, and 250K rules. Most of the rules come from the kati Makefile->ninja generator, which creates a separate rule for each output. The main AOSP manifest tree for this configuration has four files:
combined-hikey960.ninja [184 bytes]
subninja build-hikey960.ninja [generated by kati, 412MB]
subninja build-hikey960-package.ninja [generated by kati, 37403 bytes]
subninja soong/build.ninja [generated by soong, 1078MB]
IIRC, the build log and deps log files for Android can grow to hundreds of megabytes and require several seconds to load in total. Android's build system invokes Ninja three times during a build, with different manifests but the same log files, so log parsing time is important.
AOSP currently uses Ninja 1.8.2 with a few Android-specific patches. For these measurements, I'm using an E5-2690 v4 28-core/56-thread CPU with 128GiB of RAM and an NVMe disk. Assuming warm disk caches, I see these improvements when running "ninja -f out/combined-hikey960.ninja <target>":
- Building "nothing": 12.0s -> 0.5s
- Building "droid": 13.8s -> 1.0s
("droid" builds a working system image.)
If I drop FS caches before each run (echo 3 > /proc/sys/vm/drop_caches), I see these numbers:
- Building "nothing": 12.0s -> 1.4s
- Building "droid": 20.2s -> 2.3s
I also benchmarked the effect on a Chromium build. Chromium's gn build has ~4900 Ninja files rather than a few giant files, so manifest splitting doesn't help much. There's still an improvement, though. For the baseline, I compiled Ninja 1.9.0 from GitHub.
- query chrome (ninja -t query chrome): 1.6s -> 0.3s
- build chrome (ninja chrome): 2.7s -> 0.9s
With dropped FS caches:
- query chrome: 3.2s -> 1.7s
- build chrome: 6.0s -> 2.9s
I haven't tried running it on Windows yet, and IIRC, I first need to implement memory-mapping files on Windows. I think I've implemented the "slash-bits" stuff correctly, which required determining the slash-bits of the earliest reference to a path.
Code:
I'm not sure how interested upstream Ninja is in this work -- it does make Ninja more complex. (e.g. Ignoring re2c output, my current changes add ~3300 lines.) I thought I should share it in any case.