Parallelized Ninja for faster Android builds

328 views
Skip to first unread message

rpri...@google.com

unread,
Feb 27, 2019, 7:54:05 PM2/27/19
to ninja-build
Hi,

I work on Android, and recently I've been experimenting with using threads to parallelize parts of Ninja and reduce its no-op run-time overhead. Ninja typically adds about 12-20 seconds of overhead to an AOSP incremental build, which I've been able to reduce to about a second or two.

In my modified Ninja, I've parallelized these tasks:
  • manifest loading
  • .ninja_deps and .ninja_log loading
  • edge command-hash computation, used to compute dirty nodes
  • node stat calls, also used to compute dirty nodes
Details:

My modified Ninja leave the existing file formats alone, and it still loads only one manifest or log file at a time. When it loads a file, it first splits it into chunks blindly, then searches for the start of a record or declaration in each chunk. For build logs, the delimiter is simply an LF character. Manifests and deps logs have more complicated delimiting. (Roughly, the parser searches a manifest for a declaration preceded by an unescaped newline, and it searches a v3 deps log for a 0x8000xxxx word that starts a deps record.)

The parser loads manifests using several passes:
  • A "DFS" pass. In depth-first order, the parser loads the manifest tree one file at a time. Each file is split into chunks, which are parsed in parallel. Once chunks are parsed, their bindings and rules are recorded in a Scope object that roughly corresponds to BindingEnv. Bindings can be evaluated lazily at this point (e.g. for "include ${foo}.ninja").
  • An "edge setup" pass evaluates the paths on each edge, in parallel, and creates/finds Node objects. (This step uses a concurrent hash table to map from path to Node*.)
  • A non-parallel pass deals with duplicate edges outputting the same path. Some edges can be filtered out with dupbuild=warn.
  • A parallelized pass initializes nodes' out-edges using edge inputs.
The parser records two kinds of locations on graph objects:
  • A "ScopePosition" comprising a Scope* parent and an index. Each binding, rule, and edge has a ScopePosition. Ninja can evaluate paths or search for rules using an edge's ScopePosition and find the correct result even though it does the lookup after everything is parsed. When a variable name is rebound, a Scope object records all bindings for the name.
  • A "DFS location" that tracks location within a depth-first walk of the manifest tree. DFS locations handle things that don't respect Ninja scopes -- e.g. pools and default targets.
Benchmarks:

Some statistics for context: In a typical AOSP master checkout today, with a hikey960-userdebug lunch target, the Ninja manifest has about 930K nodes, 640K edges, and 250K rules. Most of the rules come from the kati Makefile->ninja generator, which creates a separate rule for each output. The main AOSP manifest tree for this configuration has four files:

combined-hikey960.ninja [184 bytes]
subninja build-hikey960.ninja [generated by kati, 412MB]
subninja build-hikey960-package.ninja [generated by kati, 37403 bytes]
subninja soong/build.ninja [generated by soong, 1078MB]

IIRC, the build log and deps log files for Android can grow to hundreds of megabytes and require several seconds to load in total. Android's build system invokes Ninja three times during a build, with different manifests but the same log files, so log parsing time is important.

AOSP currently uses Ninja 1.8.2 with a few Android-specific patches. For these measurements, I'm using an E5-2690 v4 28-core/56-thread CPU with 128GiB of RAM and an NVMe disk. Assuming warm disk caches, I see these improvements when running "ninja -f out/combined-hikey960.ninja <target>":
  • Building "nothing": 12.0s -> 0.5s
  • Building "droid": 13.8s -> 1.0s
("droid" builds a working system image.)

If I drop FS caches before each run (echo 3 > /proc/sys/vm/drop_caches), I see these numbers:
  • Building "nothing": 12.0s -> 1.4s
  • Building "droid": 20.2s -> 2.3s
I also benchmarked the effect on a Chromium build. Chromium's gn build has ~4900 Ninja files rather than a few giant files, so manifest splitting doesn't help much. There's still an improvement, though. For the baseline, I compiled Ninja 1.9.0 from GitHub.
  • query chrome (ninja -t query chrome): 1.6s -> 0.3s
  • build chrome (ninja chrome): 2.7s -> 0.9s
With dropped FS caches:
  • query chrome: 3.2s -> 1.7s
  • build chrome: 6.0s -> 2.9s
I haven't tried running it on Windows yet, and IIRC, I first need to implement memory-mapping files on Windows. I think I've implemented the "slash-bits" stuff correctly, which required determining the slash-bits of the earliest reference to a path.

Code:

My code is currently based on the AOSP Ninja repository, which I think is just Ninja 1.8.2 with some minor patches. I could rebase it on upstream Ninja with some effort. It's currently hosted on AOSP Gerrit (https://android-review.googlesource.com/c/platform/external/ninja/+/910715). I can successfully build that repository without the Android build system (i.e. using configure.py + ninja). The code uses C++11 features extensively and C++14 features less extensively. I'm aware that Ninja is currently C++98-only. (https://github.com/ninja-build/ninja/issues/1403).

I'm not sure how interested upstream Ninja is in this work -- it does make Ninja more complex. (e.g. Ignoring re2c output, my current changes add ~3300 lines.) I thought I should share it in any case.

Jan Niklas Hasse

unread,
Feb 28, 2019, 4:38:55 AM2/28/19
to ninja-build
Hi,

it sounds great! I would be very interested in the changes.

One thing to benchmark would be performance changes with builds that are already fast. Multi-threading introduces overhead which can hurt smaller use-cases.
 
The code uses C++11 features extensively and C++14 features less extensively. I'm aware that Ninja is currently C++98-only. (https://github.com/ninja-build/ninja/issues/1403).

We want ninja to run (or rather compile) on older systems. As C++11 is rather old itself, I don't think it's that far off (as compilers have started to enable it by default and we don't pass -std=c++98 on the cmd line, I think that all of our 1.9.0 binaries are currently built in C++11-or-higher-mode anyway). I wouldn't worry about it too much for the moment.
 
I'm not sure how interested upstream Ninja is in this work -- it does make Ninja more complex. (e.g. Ignoring re2c output, my current changes add ~3300 lines.) I thought I should share it in any case.

 I only had a brief look, but I think some parts could be simplified. For example I wouldn't add a "nothreads" option and why do you need the CHECK macro instead of assert?

How much does the memory-file-mapping contribute to the speed-up? Could it even work without that?

Evan Martin

unread,
Feb 28, 2019, 9:34:44 PM2/28/19
to rpri...@google.com, ninja-build
This is really awesome!  Some comments inline.

On Wed, Feb 27, 2019 at 4:54 PM rprichard via ninja-build <ninja...@googlegroups.com> wrote:
  • manifest loading
  • .ninja_deps and .ninja_log loading
  • edge command-hash computation, used to compute dirty nodes
  • node stat calls, also used to compute dirty nodes
Details:

My modified Ninja leave the existing file formats alone, and it still loads only one manifest or log file at a time. When it loads a file, it first splits it into chunks blindly, then searches for the start of a record or declaration in each chunk. For build logs, the delimiter is simply an LF character. Manifests and deps logs have more complicated delimiting. (Roughly, the parser searches a manifest for a declaration preceded by an unescaped newline, and it searches a v3 deps log for a 0x8000xxxx word that starts a deps record.)

The parser loads manifests using several passes:
  • A "DFS" pass. In depth-first order, the parser loads the manifest tree one file at a time. Each file is split into chunks, which are parsed in parallel. Once chunks are parsed, their bindings and rules are recorded in a Scope object that roughly corresponds to BindingEnv. Bindings can be evaluated lazily at this point (e.g. for "include ${foo}.ninja").
  • An "edge setup" pass evaluates the paths on each edge, in parallel, and creates/finds Node objects. (This step uses a concurrent hash table to map from path to Node*.)
  • A non-parallel pass deals with duplicate edges outputting the same path. Some edges can be filtered out with dupbuild=warn.
  • A parallelized pass initializes nodes' out-edges using edge inputs.
The parser records two kinds of locations on graph objects:
  • A "ScopePosition" comprising a Scope* parent and an index. Each binding, rule, and edge has a ScopePosition. Ninja can evaluate paths or search for rules using an edge's ScopePosition and find the correct result even though it does the lookup after everything is parsed. When a variable name is rebound, a Scope object records all bindings for the name.
  • A "DFS location" that tracks location within a depth-first walk of the manifest tree. DFS locations handle things that don't respect Ninja scopes -- e.g. pools and default targets.
One thing to keep in mind is that it's worth considering whether a change in architecture in some of these areas would make these tasks easier.
For example, the deps log could be redesigned to allow more efficient loading (or at least scanning).
Or we had talked about how dupbuild should be a fatal error always, in which case you might not need such special handling of it.  (The only reason it's a flag right now is that we accidentally didn't make it an error from the start...)
Or we could change the scoping rules such that things behave more consistently.

Benchmarks:

Some statistics for context: In a typical AOSP master checkout today, with a hikey960-userdebug lunch target, the Ninja manifest has about 930K nodes, 640K edges, and 250K rules. Most of the rules come from the kati Makefile->ninja generator, which creates a separate rule for each output. The main AOSP manifest tree for this configuration has four files:

As I recall from the last time I talked with someone on Android, I think it was generating really inefficient Ninja files.  I wonder if you spent more time up front being careful in the Ninja files you generate, if it would save you more time here.  Again, it's maybe worth considering changes to Ninja's architecture to enable your use cases.  I think the Ninja basics (how variables and scopes etc. work) were more or less derived from how gyp (part of Chrome) worked.

On the other hand, 930k nodes is pretty big no matter how you do it!  I think Chrome was like <100k when I was working on it.
I know Android is much bigger than Chrome, but is it possible that kati generates a lot of stamp files?  I vaguely recall something about how its makefiles are structured...

combined-hikey960.ninja [184 bytes]
subninja build-hikey960.ninja [generated by kati, 412MB]
subninja build-hikey960-package.ninja [generated by kati, 37403 bytes]
subninja soong/build.ninja [generated by soong, 1078MB]

IIRC, the build log and deps log files for Android can grow to hundreds of megabytes and require several seconds to load in total. Android's build system invokes Ninja three times during a build, with different manifests but the same log files, so log parsing time is important.

So here you have like ~1.5gb of ninja files.  That is a lot of text considering it should mostly just be file paths!
I think in the past someone (from Android?) proposed a binary format for these files that would be more efficient -- for example, reusing the idea from the deps log, perhaps your generator could create a flat file that just lists paths, such that you get a dense mapping of paths to integers, and then the rest of the system could just work with integers.

I'm not sure how interested upstream Ninja is in this work -- it does make Ninja more complex.

Disclaimer: my suggestions above about changing the architecture haven't been approved by upstream Ninja, they're just my own ideas.  ;) 

Ryan Prichard

unread,
Mar 1, 2019, 8:20:24 PM3/1/19
to Jan Niklas Hasse, ninja-build
On Thu, Feb 28, 2019 at 1:38 AM Jan Niklas Hasse <jha...@gmail.com> wrote:
Hi,

it sounds great! I would be very interested in the changes.

One thing to benchmark would be performance changes with builds that are already fast. Multi-threading introduces overhead which can hurt smaller use-cases.

Is there a specific use case I should test? My code does a couple of things currently to reduce multi-threading overhead, which I needed for good Chromium performance:
 - If there's only one task to run, it avoids the thread pool.
 - It tries to avoid splitting a manifest into chunks smaller than some threshold (currently 1MiB, but that's not careful chosen).

I could avoid distributing tiny amounts of work onto worker threads more generally. Maybe it'd also make sense to avoid creating worker threads until they're needed.

The code uses C++11 features extensively and C++14 features less extensively. I'm aware that Ninja is currently C++98-only. (https://github.com/ninja-build/ninja/issues/1403).

We want ninja to run (or rather compile) on older systems. As C++11 is rather old itself, I don't think it's that far off (as compilers have started to enable it by default and we don't pass -std=c++98 on the cmd line, I think that all of our 1.9.0 binaries are currently built in C++11-or-higher-mode anyway). I wouldn't worry about it too much for the moment.

I'd be OK with having just C++11. I think I overestimated how much C++14 I'm using.

I'm not sure how interested upstream Ninja is in this work -- it does make Ninja more complex. (e.g. Ignoring re2c output, my current changes add ~3300 lines.) I thought I should share it in any case.

 I only had a brief look, but I think some parts could be simplified. For example I wouldn't add a "nothreads" option and why do you need the CHECK macro instead of assert?

I think I added CHECK() because assert() was compiled out in the Android Ninja build. I could remove CHECK (and turn assert() on in our build, at least for ninja_test). The Android build has a separate ASAN-ified ninja binary, so maybe that could have assertions enabled?

Incidentally, there's an assertion check in Android's Ninja repo that currently fails, but only on an actual Android manifest, not during ninja_tests:

  if (!deps_type.empty() && !config_.dry_run) {
    assert(edge->outputs_.size() == 1 && "should have been rejected by parser");

The Android Ninja manifest parser relaxed the "deps" handling to allow extra implicit outputs, but we haven't updated the assert yet.

I mostly added "-d threads" for defensive purposes. e.g. If there's a bug in the threading/sharding, it could be disabled. I'm not sure I actually want that flag, though.

How much does the memory-file-mapping contribute to the speed-up? Could it even work without that?

Yeah, it should work without using mmap.

The lexer needs a loaded manifest file to be followed by a NUL character, which I'm doing by (re)mapping an extra page of NULs after the end of the file. I'm keeping all manifest files mmap'ed until the whole manifest tree is parsed, rather than unloading individual files once they're parsed. For the build and deps logs, I'm mapping the whole file at once rather than read a 256KiB/512KiB chunk at a time.

mmap has the downside that different stages of parsing could see inconsistent content if something modifies a file while it's being loaded. All files are unmapped before Ninja does any work.

I benchmarked three approaches to reading files:
 - read()
 - pread() from many threads into a common buffer
 - mmap

I wrote a test program that loads a 1GB manifest file, sums each byte, and frees memory before exiting. I see total run-times like so (assuming a warm disk cache):
 - read(): 606ms
 - parallel pread(): 115ms
 - mmap(): 70ms

About 40ms of the mmap time comes from unmapping the file. Skipping munmap doesn't help.

-Ryan

Matthew Woehlke

unread,
Mar 8, 2019, 11:57:07 AM3/8/19
to ninja...@googlegroups.com
On 01/03/2019 20.20, 'Ryan Prichard' via ninja-build wrote:
> On Thu, Feb 28, 2019 at 1:38 AM Jan Niklas Hasse wrote:
>> We want ninja to run (or rather compile) on older systems. As C++11 is
>> rather old itself, I don't think it's that far off (as compilers have
>> started to enable it by default and we don't pass -std=c++98 on the cmd
>> line, I think that all of our 1.9.0 binaries are currently built in
>> C++11-or-higher-mode anyway). I wouldn't worry about it too much for the
>> moment.
>
> I'd be OK with having just C++11. I think I overestimated how much C++14
> I'm using.

If it's not overly onerous, I would suggest trying to target GCC 4.8,
which is the latest available on still-in-circulation Red Hat. This has
most of C++11, but not 100% (in particular, regex is missing).

--
Matthew

Nico Weber

unread,
Mar 8, 2019, 12:00:59 PM3/8/19
to Matthew Woehlke, ninja-build
I'd recommend not using C++11 for a few more years. Build systems above all should be stable, reliable, and boring.

I'd also recommend not merging code that adds threads to ninja if it can be helped. 

Ben Boeckel

unread,
Mar 8, 2019, 1:38:44 PM3/8/19
to Ryan Prichard, Jan Niklas Hasse, ninja-build
On Fri, Mar 01, 2019 at 17:20:11 -0800, 'Ryan Prichard' via ninja-build wrote:
> Incidentally, there's an assertion check in Android's Ninja repo that
> currently fails, but only on an actual Android manifest, not during
> ninja_tests:
>
> if (!deps_type.empty() && !config_.dry_run) {
> assert(edge->outputs_.size() == 1 && "should have been rejected by
> parser");
>
> The Android Ninja manifest parser relaxed the "deps" handling to allow
> extra implicit outputs, but we haven't updated the assert yet.

See this PR for fixing this once and for all :) :

https://github.com/ninja-build/ninja/pull/1534

Need to figure out how to get a test which runs a build and verifies
properties about it, but that's the only blocker.

--Ben

Jan Niklas Hasse

unread,
Mar 11, 2019, 6:43:59 PM3/11/19
to ninja-build
Am Freitag, 8. März 2019 18:00:59 UTC+1 schrieb Nico Weber:
I'd also recommend not merging code that adds threads to ninja if it can be helped.

Interesting, why not? Because of the overhead to start them?
Message has been deleted

Ryan Prichard

unread,
Mar 11, 2019, 7:41:53 PM3/11/19
to Matthew Woehlke, ninja-build
On Fri, Mar 8, 2019 at 8:57 AM 'Matthew Woehlke' via ninja-build <ninja...@googlegroups.com> wrote:
If it's not overly onerous, I would suggest trying to target GCC 4.8,
which is the latest available on still-in-circulation Red Hat. This has
most of C++11, but not 100% (in particular, regex is missing).

GCC 4.8's C++11 support looks complete enough to me. FWIW, I see a RHEL/CentOS software package, "devtoolset", that backports new versions of GCC and libstdc++ to RHEL/CentOS 6 and 7.

Ninja's configure.py lists Solaris and AIX support, too. They look(?) like they have C++11 compilers.

-Ryan

Nico Weber

unread,
Mar 11, 2019, 9:07:42 PM3/11/19
to Jan Niklas Hasse, ninja-build
Several reasons:

- Multithreaded code tends to be buggier and harder to reason about than single-threaded code.

- Multithreading and signals don't mix very well. You need to carefully write a bunch of subtle code to get it right. The posix_spawn patch from a while for example was a lot easier to reason about because there's just one thread.

Michael Jones

unread,
Mar 12, 2019, 7:24:41 AM3/12/19
to Nico Weber, Jan Niklas Hasse, ninja-build
Interesting, why not? Because of the overhead to start them?

Several reasons:

- Multithreaded code tends to be buggier and harder to reason about than single-threaded code.

While it's mostly true that multithreaded code can be harder to reason about than single-threaded code, I don't really think it's fair to attribute "More buggy" to multithreaded code. That's an emergent property the programmer, and the conceptual difficulty of the code being written. I've removed easily as many bugs over my career from single threaded code as from multithreaded code.

Newer versions of the C++ standard, which come with std::thread, drastically reduce the difference as well.
 

- Multithreading and signals don't mix very well. You need to carefully write a bunch of subtle code to get it right. The posix_spawn patch from a while for example was a lot easier to reason about because there's just one thread.

Wouldn't the parsing of the ninja file be distinct from the executing of the graph? Perhaps an assert that all of the threads have been joined prior to executing rules would satisfy this concern?

James Darpinian

unread,
Mar 12, 2019, 1:42:33 PM3/12/19
to Nico Weber, Jan Niklas Hasse, ninja-build
I agree with concerns about adding threads. I'd like to put in a plug here for my prior pull request #1438, which is 1/10 the size but has many of the same benefits in terms of startup time improvement.

--
You received this message because you are subscribed to the Google Groups "ninja-build" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ninja-build...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ben Boeckel

unread,
Mar 12, 2019, 2:02:23 PM3/12/19
to Ryan Prichard, Evan Martin, ninja-build
On Mon, Mar 11, 2019 at 15:54:51 -0700, 'Ryan Prichard' via ninja-build wrote:
> - Android's 1.8.2 Ninja loads the file in 2.4s.
> - My parallelized parser loads the file in 110ms.

This seems unimpressive (to me) on an absolute scale for such a large
change (the percent speedup is impressive though). How much time is
spent doing the `stat` for creating the internal build plan for Android?
I assume there are millions of those and that sounds embarrasingly
parallel to me. Much simpler code (probably) and an isolated part of
ninja's lifetime where the threads exist. Maybe that would be a better
place to do threading?

--Ben

Ryan Prichard

unread,
Mar 12, 2019, 6:06:04 PM3/12/19
to ben.b...@kitware.com, Evan Martin, ninja-build
Just in case it wasn't clear -- the 2.4s only describes the time to load a 580MB ".ninja_deps" file, and the 2.4->0.11 optimization is a result of the changes in deps_log.cc. The .ninja_deps file is frequently smaller -- AOSP is somewhat smaller than Google's internal tree, and the logs grow as more of the manifest is built.

The overall improvement for a typical AOSP "droid" build was 13.8 -> 1.0 seconds. In that configuration, ninja's graph had 940K nodes[*], but I only needed to stat about 225K paths. There's also a "checkbuild" target that looks like it needs about half of all the nodes. Stat'ing these 225K paths seems to take about ~650ms, which falls to about ~100ms once it's parallelized. I'm benchmarking with an NVMe disk where everything's already in the kernel cache. The effect can be much larger -- e.g. on a slower drive with an internal checkout, with FS caches dropped, parallelization reduces stat times from 14s to 1.2s. (I'm assuming I/O parallelism helps?)

[*] This includes nodes from the manifest and nodes from .ninja_deps. It's grown from 930K -> 940K, probably because I built more of the graph.

-Ryan

Reply all
Reply to author
Forward
0 new messages