I have some questions about fabricate:
1. Can you give an impression of its speed? Ideal would be benchmarks
against other systems like done in these blog posts:
http://gamesfromwithin.com/?p=42
http://gamesfromwithin.com/?p=44
(The post includes the code he used to generate his test project.) If
you haven't done this yet, maybe I could take a stab at it.
2. It seems to me that it might be tricky to set up the build script so
that all of the dependencies are in the right order, and it doesn't look
like fabricate helps with this task. But it could...
Admittedly, it doesn't know about the dependencies until the first
(full) build is finished. But even at the end of the first build, it
would be possible for it to tell the user whether a file was read by one
process then subsequently changed by another. When that happens,
fabricate should (IMHO) emit at least a warning message if not a fatal
error.
Dependency loops could also be detected and reported as an even more
serious problem.
There could also be a special mode in which fabricate does a full build,
keeping track of dependency problems (possibly re-running the build
until everything is up-to-date), then at the end suggest how the build
could be reordered to enable a one-pass build.
3. Finally, I am intrigued by the possibility of speeding up builds
using many-to-many compilation steps. For example, it is orders of
magnitude faster to compile *all* Java files in one invocation of javac
rather than invoking javac for each source file. Probably C/C++
compilations could also be sped up in the same way, though not as
dramatically.
Have you thought about whether and how such types of compilation could
be done from within fabricate? The main tricks are that (a) one would
prefer to recompile only the files that have changed, and (b) it would
be nice to avoid a huge recompilation when source files are added to and
deleted from the tree; at the same time it is important to notice when
new files appear and make sure they get compiled.
I suppose that to support the many-to-many case, the dependency
generation would have to be done using special code. But I haven't
really figured it out yet.
Thanks for the interesting tool!
Michael
I'll see if I can get around to it.
> One known speed issue: fabricate's model makes it inherently difficult
> (impossible?) to run build commands in parallel. "make" is able to
> parallelize build commands, because of the explicitly specified
> dependency graph. However, fabricate doesn't really have a true
> dependency graph, so it's important that commands are run sequentially.
I would rather say that the build script has to respect the dependencies
and the best that fabricate can do is check after the build that there
were no problems (as discussed below). But if the build script knows
that certain commands can run in parallel, it would conceptually be
possible for *the build script itself* to issue the parallelizable build
commands in parallel (say, through an Executor service provide by
fabricate). And if fabricate can monitor simultaneously-running
commands separately (e.g., using the strace system), then it can verify
that there were no out-of-order build steps or even race conditions.
In fact, under fabricate's basic assumption that there is a 1:1 mapping
between inputs and outputs via commands, it could even attempt to
parallelize builds itself. For example, it could assume that the
dependencies from the last build are still valid. (Usually they will
be.) It could therefore determine which commands can be carried out in
what order, and which commands can be executed in parallel. So let it
optimistically try building in its self-determined order, monitoring the
commands for dependency problems and race conditions. If there are
none, then the build is necessarily OK. If there are any problems, then
delete the outputs of the questionable commands and fall back to doing a
serial build. (Occasionally some work would need to be done twice, but
usually the optimistic build would probably be provably correct.)
But unfortunately, I can't see how fabricate could determine, using the
atime method, which files were touched by which command if multiple
commands were run in parallel. Thus parallel building is probably only
conceivable when using the strace method. This is a serious limitation
of fabricate, given that systems with multiple processors are already
the norm.
> 2. Regarding dependency warnings, a build where a file is read by one
> process, then modified by another, is actually a conceivable use-case.
> But unlikely, I case. Maybe a scenario where you wanted (for some reason)
> to compile the previous build date into your binary. Like I said,
> unlikely.
I think this would be a perverse use case and the default should be to
treat it as an error. At best there should be a user option to say "I
know that the dependency order is screwed up but I don't care."
> Dependency loops -- I'm not sure if this would be generally
> detectable. Presumably in such a case, the command which depended on
> a future command would simply bomb out. Or, if it was a
> badly-thought-out command, it might continue and produce garbled
> output. Do you have any ideas on how we'd detect a dependency loop in
> advance?
The command would not necessarily bomb out, because an old version of
the file that it depends on could still be around from a previous build.
On the other hand, some commands might use the existence of files to
define the input to the command (think of "ar cr file.a *.o"). This
command would also not bomb out if one of the *.o files had erroneously
not yet been created, but the output would still be wrong. Consider the
build script
cc -c a.c
cc -c b.c
ar cr file.a *.o
cc -c c.c
The first time this command is run, fabricate would deduce that the "ar"
command (specifically, "ar cr file.a a.o b.o") depends on a.o and b.o.
Fabricate would have no way of figuring out that c.o is created later
and *would have* affected the "ar" command. The results of the build
would be broken.
If the build is run a second time immediately after the first, then the
first two commands would be skipped, then the "ar" command would read
"ar cr file.a a.o b.o c.o" because of the "c.o" file generated by the
first build. Fabricate would not recognize that command at all and
would therefore run the "ar" command. It would skip compiling "c.c" a
second time.
The *third* time the build is run, fabricate can figure out that there
is nothing to do.
The dependency graph is fundamentally only known after a build has
completed, and the next build could have a different dependency graph.
So no, there is no way *in general* to detect a dependency loop in
advance. But the dependency loop could be detected *after* the build.
Conceptually, fabricate would remember what commands were executed to
carry out the build. Then it would imagine running the build again. If
the dependencies were in fact OK, then fabricate should be able to
determine that the second build wouldn't have to do anything.
Otherwise, there was necessarily a problem that fabricate should report.
In fact, the check could be carried out more simply by analyzing the
dependency graph that it has been determined. The dependency graph is
trivially constructible from the .deps file; the nodes are filenames and
the (directed) edges are implied by the input-output filename pairs
corresponding to commands that were run during the build.
That suggests another question. On Windows, "shell globbing" characters
are expanded not by the shell but by individual programs. So how could
fabricate figure out that "ar cr file.a *.o" in the second build is
different than the command executed during the first build?
The basic problem here is that the contents of a directory is another
build input that fabricate does not consider. It *could* treat this as
another input (for example when using the strace method), but doing so
would lead to too many false positives and unnecessary build. (E.g., in
our "*.o" example, the "ar" command would be re-executed if *any* file
were added or removed from the current directory, whereas only changes
to the list of *.o files should cause the command to be re-run.)
Probably the way to get around this problem is to require build scripts
to do file globbing via a function provided by fabricate:
fabricate.glob('*.o')
The output is the list of files matching the pattern '*.o', and among
the dependencies is recorded "command depends on the list of files
matching *.o" and a hash of the return value. When
fabricate.glob('*.o') is invoked during the next build, the list is
generated and compared to the old value. If the lists are different,
then the whole command has to be executed again.
My tentative conclusions from all this discussion are:
1. fabricate is not 100% safe, because it misses some build dependencies
(such as the contents of directory listings). But it would probably be
possible to specify some ground rules which, if obeyed by build scripts,
would guarantee the safety of the build (for example, one might need to
require that filename globbing is done using a function provided by
fabricate).
2. fabricate is probably never going to be able to do safe parallel
builds using atimes, though it might be possible using strace.
3. It's still a very fascinating concept and could potentially be the
basis of a vastly simpler build system.
4. All of this is pointless if the repeated directory crawls and MD5
hash computations are too expensive -> therefore, benchmarking is needed!
But if the build script knows that certain commands can run in parallel, it would conceptually be possible for *the build script itself* to issue the parallelizable build commands in parallel (say, through an Executor service provide by fabricate).
2. fabricate is probably never going to be able to do safe parallel
builds using atimes, though it might be possible using strace.
This is one feature I would like to have in fabricate too. How about
extending run() to accept multiple commands (accepting a list of
commands)? Then one can configure the number of concurrent commands to
run with, for instances, a command line option (like -j ;) and/or
hardcoded when calling run([cmd1, cmd2, cmd3], threads=2). I'm willing
to work on this if it's welcome.
It would be nice to provide some option to the user making the build script to tell Fabricate to use some number of threads by default based on the number of cores in the computer
> You'll have to think about what to do with each build target's output. The easiest would be to just let it all come out the parent's stdout like we currently do, but I wonder whether that will mix up the output into something confusing. I wonder what make does.
OK, I did some benchmarking of fabricate using the method described in
the two blog posts quoted above. Briefly, the test "project" consists
of 50 libraries, each library consists of 100 C++ header files and 100
C++ source files, each header+source defines one trivial class, and the
source files each include 15 other headers from the same library plus 5
headers from other, earlier-numbered libraries. The interlibrary
dependencies only go one way, making it easy to define a self-consistent
build order. The code that the blog author used to create the test data
is linked to from the second blog post.
For comparison, I repeated some of the blog author's original
measurements on my computer (a Linux notebook), plus a few new ones. I
only bothered with the "full" build (build everything from scratch) and
the global "do-nothing" build (run the build again when there is nothing
to do). The builds that I added are as follows:
A "make_nonrecursive" build that uses "gcc -MM" to generate include-file
dependencies for each file, and includes all of the dependency files and
the library-wide Makefiles into a single project-wide Makefile that
knows about all of the dependencies. See [1] for more information about
this style of Makefile.
I added two fabricate.py benchmarks, one using atimes_runner and one
using strace_runner. In both cases the build file does the obvious
loop-over-libraries, loop-over-sourcefiles and invokes g++ using
fabricate.run() once for each file. The tests use fabricate trunk r21.
Here are my results:
make (recursively invoked):
full: 194 s
do-nothing: 3.7 s
make_nonrecursive:
full: 319 s
do-nothing: 1.7 s
scons:
full: 312 s
do-nothing: 52 s
fabricate.py, using atimes_runner:
full: 4245 s
do-nothing: 17.7 s
fabricate.py, using strace_runner:
full: 251 s
do-nothing: 18.9 s
fabricate.py, using strace_runner and mtime_hasher:
full: 328 s
do-nothing: 352 s (i.e., rebuilt everything)
Summary and analysis:
make_nonrecursive is slower than recursive make for the full build, but
this is probably mostly because make_recursive uses "gcc -MM" to
generate the dependencies rather than "makedepend" as used by the
recursive make. This is an implementation difference that could be
changed. For the do-nothing build, make_nonrecursive is faster.
fabricate with the atimes_runner is so excruciatingly slow at doing a
full build that it cannot seriously be considered for a project of this
size. In retrospect, it is clear that it has to be slow for large
projects, because before and after *every* command it has to monkey
around with the atimes of *every single file in the project*. Thus the
time for building a project with N files goes something like O(N^2).
A full build using fabricate and strace_runner is much better, almost as
fast as make.
For do-nothing builds, fabricate is considerably slower than make,
though not as slow as scons. A do-nothing build is the extreme case of
an incremental build, which is the bread-and-butter of daily
development. So I think it would make sense investing time to speed up
do-nothing builds (more discussion below).
For some reason, fabricate with strace_runner and mtime_hasher didn't
work for me. The do-nothing build in fact rebuilt everything. It is
even slower than the corresponding full build.
I noticed that the .deps file when using the md5_hasher is almost 17 Mb
and 129k lines long. But a lot of the information is redundant, because
include files are typically included multiple times. In fact (modulo
whitespace), the file has only 26k unique lines, and many of those lines
also have common information. It should definitely be possible to
reduce the size of this file significantly by squeezing out some of the
redundancy.
It also appears that files are being hashed redundantly, for example
even though fabricate has computed the hash during the current build and
knows (via strace) that the file hasn't been written to since then. In
fact, during a full build md5_hasher() is called 149k times, of which
119k calls are redundant. Even worse, during a do-nothing build (when
nothing can possibly change), md5_hasher() is called 119k times, of
which 104k are redundant. One file is hashed 53 times! Reducing some
of this unnecessary hashing could potentially speed things up too,
without a loss of accuracy.
Michael
OK, I did some benchmarking of fabricate using the method described in
the two blog posts quoted above. Briefly, the test "project" consists [...]
The code that the blog author used to create the test data is linked to from the second blog post.
fabricate.py, using atimes_runner:
full: 4245 s
do-nothing: 17.7 s
I noticed that the .deps file when using the md5_hasher is almost 17 Mb
and 129k lines long. But a lot of the information is redundant, because
include files are typically included multiple times.
It also appears that files are being hashed redundantly, for example
even though fabricate has computed the hash during the current build and
knows (via strace) that the file hasn't been written to since then. In
fact, during a full build md5_hasher() is called 149k times, of which
119k calls are redundant. Even worse, during a do-nothing build (when
nothing can possibly change), md5_hasher() is called 119k times, of
which 104k are redundant. One file is hashed 53 times! Reducing some
of this unnecessary hashing could potentially speed things up too,
without a loss of accuracy.
> Ouch! :-) Good points about it being O(n^2). I might do some
> (smaller!) tests myself to see how much that reduces if fabricate
> doesn't call utime() on every file before and after -- but only looks
> at the atimes. Still be slow, of course, but maybe not as slow.
I've already done some of this this test on Linux. On Linux, the speed
of utime() is equivalent to getting the atime, so since we currently do
an equal amount of get_atime() and utime(), you would halve the delay
time from that. But the os.walk takes about twice the time of get_atime
and utime put together. So the directory walk is the biggest time hog.
My results are only for linux, not windows.
Obviously, though, atimes are never going to be a good solution for big
projects. If somebody wants fabricate for big projects, it may be
better for them to work on a solution using easy_hooks library like you say.
Cheers,
Berwyn
Hi Michael,
OK, I did some benchmarking of fabricate using the method described in
the two blog posts quoted above. Briefly, the test "project" consists [...]
Wow, thanks Michael -- those are some really worthwhile results. It's good to know both that strace is quite quick (faster than SCons and up there with make), and also that atimes is very slow on big projects.
Hmmm, I wonder why strace_runner + mtime_hasher REbuilt both times. I've noticed that if you switch hashers after a build it will always do a full rebuild because the hash strings will be different -- but this is okay, because people won't usually switch between the two hashers a lot. But then if you build again it shouldn't rebuild again. Strange -- I might try it on a Linux machine here if I get the chance.
The code that the blog author used to create the test data is linked to from the second blog post.
Thanks. Do you mind sending through the fabricate build script you used for this test data, just so I'm comparing against the same results? (Read: I'm being lazy. :-)
I noticed that the .deps file when using the md5_hasher is almost 17 Mb
and 129k lines long. But a lot of the information is redundant, because
include files are typically included multiple times.
Heh, that's interesting. Again, we hadn't had big enough projects to notice that. However, probably the easiest way to solve that (given that memory's not really an issue here) is just to use gzip.open() -- i.e., store the file zipped. It's actually slightly faster on big .deps files this way, because zipping is faster than writing a big file to disk. Won't make a huge difference, because it's only done once at the start/end of the build, but it'll mean tiny .deps files. It'll mean the .deps file won't be human-readable anymore, but I'll include an override for when one is debugging fabricate issues.
It also appears that files are being hashed redundantly, for example
even though fabricate has computed the hash during the current build and
knows (via strace) that the file hasn't been written to since then. In
fact, during a full build md5_hasher() is called 149k times, of which
119k calls are redundant. Even worse, during a do-nothing build (when
nothing can possibly change), md5_hasher() is called 119k times, of
which 104k are redundant. One file is hashed 53 times! Reducing some
of this unnecessary hashing could potentially speed things up too,
without a loss of accuracy.
Ah yes, this is a very good point too. As mentioned before, we're not too worried about speed (except perhaps the slowness of atimes_runner), but still, if there's a fairly easy way to do this without breaking things, it's worth looking into.
I think the size of the file and the redundant work that are being done are connected. The program needs an enduring concept of a file, to keep track of all changes to the file, cache the old hash for the file, and store redundancies as references (direct or indirect) to these file objects. Normalizing the data this way will fix both problems almost automatically.