Caching build results

1,182 views
Skip to first unread message

Dmitry

unread,
Oct 9, 2011, 7:45:03 PM10/9/11
to ninja-build
Does anyone have suggestions for how to use a shared cache with ninja,
similar to scons's cache? The idea is that a large group compiles a
source tree which is largely the same, and compiled by all in a
uniform build cluster. So that the many object files, libraries, and
binaries which are identical can be merely hard links to a shared
cache. This speeds up builds and reduces the total size of the built
files tremendously. Scons supports this.

I know there is ccache for compiling C/C++ code with a cache. It
supports hard-linking, but I don't see how to make it interact with
ninja properly (since ninja cares about timestamp, and the timestamp
is shared when using hard links). Also ccache is only for object
files, and the many binaries/tools are equally important in our case.

I have some ideas, but wanted to see first if anyone already has
experience doing it (or failing to do it). Thanks!

Dmitry

Evan Martin

unread,
Oct 10, 2011, 5:01:30 PM10/10/11
to ninja...@googlegroups.com
On Sun, Oct 9, 2011 at 4:45 PM, Dmitry <dsa...@gmail.com> wrote:
> Does anyone have suggestions for how to use a shared cache with ninja,
> similar to scons's cache? The idea is that a large group compiles a
> source tree which is largely the same, and compiled by all in a
> uniform build cluster. So that the many object files, libraries, and
> binaries which are identical can be merely hard links to a shared
> cache. This speeds up builds and reduces the total size of the built
> files tremendously. Scons supports this.

I'm not exactly sure how the Scons cache works. Is it by file contents?

If so, I think I could imagine implementing such a thing by creating a
wrapper script that takes a list of inputs, outputs, and a command
line. By hashing the inputs and command line together you could
compute a cache key and fetch the cached output; otherwise, you could
run the command and cache the output.

In the case where you fetch the cached output, updating a local copy
from the cache will cause the mtimes to update, so ninja will think
the output is up to date until you touch an input again.

Below you mentioned hard links. As you described, I think you can't
get the timestamps right when you use hard links. If it can't rely on
timestamps, a build system would need to rely on some other
information to judge whether a file is up to date.

I think a shared cache involving hard links would also rely on a
network file system of some sort (not sure I fully understand your
system), which also sounds

In Scons's case, I believe it uses file contents, but that means it
needs to read the contents of files when deciding what to build.

> I have some ideas, but wanted to see first if anyone already has
> experience doing it (or failing to do it). Thanks!

I'd be curious to hear your ideas, too. You've surely thought about
it more than me.

Rachel Blum

unread,
Oct 10, 2011, 5:35:45 PM10/10/11
to ninja...@googlegroups.com
If so, I think I could imagine implementing such a thing by creating a
wrapper script that takes a list of inputs, outputs, and a command
line.  

 
 
In the case where you fetch the cached output, updating a local copy
from the cache will cause the mtimes to update, so ninja will think
the output is up to date until you touch an input again.

If you go to the hash computation step, it's almost worth it to have the entire DAG be driven by that. 
 
In Scons's case, I believe it uses file contents, but that means it
needs to read the contents of files when deciding what to build.

You _could_ map e.g. the files inode+mtime to a hash in some kind of external database. Not perfect, but works surprisingly well. It works even better if you tie that content hash database to e.g. a change notification daemon that watches your FS. (If you're looking at an FS that allows metadata, you might also carry the hash around in the metadata. Not super-efficient, but at least hash & content stay in sync. )

As for hashes for source files - we are using git. It's kind of based on the idea of content hashes, so that might be a good source to use :)


Rachel

Dmitry

unread,
Oct 10, 2011, 7:15:52 PM10/10/11
to ninja-build
On Oct 10, 5:01 pm, Evan Martin wrote:
> I'm not exactly sure how the Scons cache works.  Is it by file contents?

Yes, but it's not the contents of the file that's being cached. E.g.
if you are compiling an object file, then you need to be able to
compute its hash value (under which the file is stored in the cache)
without actually building it. So the hash involves the hashes of all
the dependencies plus the command line. The hash of each dependency is
in turn computed the same way; and only the hash of the leaves is
based on the file contents.

> If so, I think I could imagine implementing such a thing by creating a
> wrapper script that takes a list of inputs, outputs, and a command
> line.

Basically, except that you only hash the leaves directly, and for
intermediate results you have to have the hash based on their
dependencies, like I described above. In particular, the hash value
associated to any built file has to be stored somewhere in the file
system (or external DB as Rachel suggested), since it's not just a
function of file itself.

> I think a shared cache involving hard links would also rely on a
> network file system of some sort (not sure I fully understand your
> system), which also sounds

Well, strictly speaking it could still be useful on one machine if
many people build on it, but in our case, yes, it's NFS.

> In Scons's case, I believe it uses file contents, but that means it
> needs to read the contents of files when deciding what to build.

It does, and it's one of the (many) things that make it very slow. It
can be configured to only use timestamps, but I am not sure that the
caching with hard links would work as well. Scons *also* maintains a
database of state, so it could conceivably be storing extra timestamps
for its own use. There is a lot of things you could do if you didn't
care about speed :)

On Oct 10, 5:35 pm, Rachel Blum <gr...@chromium.org> wrote:
> You need the dependencies, too. (See http://google-engtools.blogspot.com/2011/09/build-in-cloud-distributing-build-steps.html
> some ideas ;)

Thanks for the link! I am unclear how the dynamic dependency detection
(depfiles) are done in that system. Do you know? Are programmers
expected to add a line to a BUILD file every time they add a #include?

The depfiles present a particular annoyance with caching, because you
can't compute the hash from dependencies unless you know all the
dependencies, and you don't know them all until you build and generate
a depfile. (Scons solves it by directly scanning sources for
dependencies with a regex, on every build.) Since compiling is the
only use case I have where both depfiles and caching are important to
me,
I plan to just go the ccache "preprocessor mode" route, and run "g++ -
E" to get the hash value of a file in order to check for it in the
cache.

I've been experimenting with caching ideas around ninja, and promise
to report back soon.

Dmitry

Rachel Blum

unread,
Oct 10, 2011, 7:41:25 PM10/10/11
to ninja...@googlegroups.com
Thanks for the link! I am unclear how the dynamic dependency detection
(depfiles) are done in that system. Do you know? Are programmers
expected to add a line to a BUILD file every time they add a #include?

I have no idea about the internals of that system - I merely used the blog post since it makes my point so much clearer than I could :)

If I were to write this (and I've written two build systems, so I'm not really in the mood to do it again:), I'd record dependencies as a separate artifact. That way, at least you don't regenerate deps unless any of the prerequisites for it have changed. (And you can't have new deps unless some of the inputs have changed. It's a bit circular, but it works :)

I've been experimenting with caching ideas around ninja, and promise
to report back soon.

Would love to hear what you come up with!

Rachel

Dmitry Sagalovskiy

unread,
Oct 11, 2011, 11:50:26 PM10/11/11
to ninja...@googlegroups.com
Here's what I have for caching. I had two goals: use cache and use hardlinks. Hardlinks present an issue with timestamps, so they make the solution much more convoluted, but they are important in my case.

Basically, every target that's cached is complemented with an additional file containing its cache key (a 32-byte md5 checksum of everything that goes into building it). For a target "foo", the cache key file is named "foo.ckey". This file  is also used for the timestamp. A caching rule is hacked up to output "foo" and "foo.ckey" together, and anything that ought to depend on "foo" is made to depend on "foo.ckey" instead. This part definitely feels like a hack, but it's (almost) doable without help from ninja.

Here's an example of a non-trivial plain rule and build line, and a version transformed for caching:

rule link
  command = $CXX -o $out $LINKFLAGS $in $libs

build tools/test: link tools/test.o | $LIB_DIR/libutil.a
  libs = -lutil
----------------------
rule link
  command = cache_build $CACHE_DIR $out -- $in -- $
      $CXX -o $_rout $LINKFLAGS $_rin $libs

build tools/test.ckey tools/test: link $
    tools/test.o.ckey $LIB_DIR/libutil.a.ckey
  libs = -lutil
  _rout = tools/test
  _rin = tools/test.o

Note how the transformation replaces dependencies with .ckey files; and special variables are added to hold the "real" inputs that the original command line needs. I hacked up my build file generator to do this.

cache_build is a tool with this syntax: cache_build <output> -- <inputs> -- <command>. The inputs are cache key files. It computes an md5 checksum of the combination of all the input cache keys, the command line, and the name of the output file; this checksum becomes the cache key for the output file. In the case above, the cache key is recorded into tools/test.o.ckey, and used to look up the file in $CACHE_DIR. If it exists, cache_build makes tools/test be a hard-link to the cached file (i.e. fetches from cache). If it doesn't exist, then cache_build executes the given command line, and makes the cached file be a hard-link to the just-built target (i.e. stores into cache).

If anything depends on tools/test, it has to be changed to depend instead on tools/test.ckey (since the timestamp of tools/test isn't a reliable indication of anything). tools/test.ckey is a proxy for tools/test.

With depfiles, the approach I took is to specify a command to run, whose output is md5-checksummed to get the cache key. This is given in the <inputs> section of the arguments to cache_build. So the compile rule looks like this:

rule compile
  command = cache_build $CACHE_DIR $out $_rout.d.ckey -- $
      $in "exec $CXX -E $CCFLAGS $_rin" -- $
      $CXX -o $_rout -c -MD -MF $_rout.d -MT $_rout $CCFLAGS $_rin
  depfile = $_rout.d

build tools/test.o.ckey tools/test.o: compile tools/test.cc.ckey
  _rout = tools/test.o
  _rin = tools/test.cc

Leaves of the tree get a special rule:

rule cache_key
  command = md5 -q $in > $out
build tools/test.cc.ckey: cache_key tools/test.cc

There are a few subtleties, which need help from ninja. For example, to avoid changing ALL rules (even those that don't use caching), you'd need the ability to depend on the actual file ("tools/test" in the example) rather than the cache key file. But the real file, being a hardlink, will never look up-to-date to ninja. I've hacked up the following syntax, akin to order-only dependencies:

build tools/test.ckey || tools/test: link ...

Ninja will treat both tools/test.ckey and tools/test as outputs, but it will ignore the timestamp of tools/test in deciding whether or not it is dirty.

The main problem I have is that the way I set it up doubles the number of files, the size of build.ninja, and more than quadruples the effort required to understand what's going on. It would also be nice to see informative descriptions, for example, "fetching tools/test from cache" or "linking tools/test", depending on whether it was found in cache. For all this, I'd like to consider adding explicit caching support to ninja.

I am looking for feedback on whether anyone else would care; what other use cases I haven't thought about; and of course ideas on how to do things better. All that said, what I have *is* essentially working!

Dmitry

Dmitry

unread,
Oct 12, 2011, 10:03:24 AM10/12/11
to ninja-build
Today ninja looks at mtimes in Edge::RecomputeDirty to determine what
needs to be rebuilt. What do you think of changing it across the board
to use max(mtime, log_entry->end_time) instead, when the log_entry
from the last build is available?
Reply all
Reply to author
Forward
0 new messages