fabricate.py

103 views
Skip to first unread message

Michael Haggerty

unread,
Jul 28, 2009, 7:05:22 PM7/28/09
to fabrica...@googlegroups.com
I just looked at memoize.py (again) two days ago and would have liked to
try it except for the fact that we need a build system for Linux *and*
Windows. So I was very interested to see your announcement of fabricate
and excited to try it out.

I have some questions about fabricate:

1. Can you give an impression of its speed? Ideal would be benchmarks
against other systems like done in these blog posts:

http://gamesfromwithin.com/?p=42
http://gamesfromwithin.com/?p=44

(The post includes the code he used to generate his test project.) If
you haven't done this yet, maybe I could take a stab at it.

2. It seems to me that it might be tricky to set up the build script so
that all of the dependencies are in the right order, and it doesn't look
like fabricate helps with this task. But it could...

Admittedly, it doesn't know about the dependencies until the first
(full) build is finished. But even at the end of the first build, it
would be possible for it to tell the user whether a file was read by one
process then subsequently changed by another. When that happens,
fabricate should (IMHO) emit at least a warning message if not a fatal
error.

Dependency loops could also be detected and reported as an even more
serious problem.

There could also be a special mode in which fabricate does a full build,
keeping track of dependency problems (possibly re-running the build
until everything is up-to-date), then at the end suggest how the build
could be reordered to enable a one-pass build.

3. Finally, I am intrigued by the possibility of speeding up builds
using many-to-many compilation steps. For example, it is orders of
magnitude faster to compile *all* Java files in one invocation of javac
rather than invoking javac for each source file. Probably C/C++
compilations could also be sped up in the same way, though not as
dramatically.

Have you thought about whether and how such types of compilation could
be done from within fabricate? The main tricks are that (a) one would
prefer to recompile only the files that have changed, and (b) it would
be nice to avoid a huge recompilation when source files are added to and
deleted from the tree; at the same time it is important to notice when
new files appear and make sure they get compiled.

I suppose that to support the many-to-many case, the dependency
generation would have to be done using special code. But I haven't
really figured it out yet.

Thanks for the interesting tool!
Michael

Bryan Hoyt

unread,
Jul 28, 2009, 7:41:56 PM7/28/09
to fabrica...@googlegroups.com
Hi Michael,

1. Like Ben said earlier, no, we don't have benchmarks of fabricate's speed yet. Such benchmarks would be useful, and if you do any benchmarking, we'd be keen for you to post your results here.

One known speed issue: fabricate's model makes it inherently difficult (impossible?) to run build commands in parallel. "make" is able to parallelize build commands, because of the explicitly specified dependency graph. However, fabricate doesn't really have a true dependency graph, so it's important that commands are run sequentially.

2. Regarding dependency warnings, a build where a file is read by one process, then modified by another, is actually a conceivable use-case. But unlikely, I case. Maybe a scenario where you wanted (for some reason) to compile the previous build date into your binary. Like I said, unlikely.

A warning would suffice, in that case, I guess. Unless we think of a reason not to, we'll probably add that in an upcoming release.

Dependency loops -- I'm not sure if this would be generally detectable. Presumably in such a case, the command which depended on a future command would simply bomb out. Or, if it was a badly-thought-out command, it might continue and produce garbled output. Do you have any ideas on how we'd detect a dependency loop in advance?

3. Yeah, you're right. It would require special-case code.

If you want to write any code yourself, this mailing list is currently our low-tech DVCS. Just post diffs to this list, and if we like your mods, we'll stick them into the next release.

 - bry
--
Bryan Hoyt, Web Presentation & Software  --  Brush Technology
Ph: +64 3 942 7833     Mobile: +64 21 238 7955    Web: brush.co.nz

Michael Haggerty

unread,
Jul 29, 2009, 1:21:54 PM7/29/09
to fabrica...@googlegroups.com
Bryan Hoyt wrote:
> 1. Like Ben said earlier, no, we don't have benchmarks of fabricate's
> speed yet. Such benchmarks would be useful, and if you do any
> benchmarking, we'd be keen for you to post your results here.

I'll see if I can get around to it.

> One known speed issue: fabricate's model makes it inherently difficult
> (impossible?) to run build commands in parallel. "make" is able to
> parallelize build commands, because of the explicitly specified
> dependency graph. However, fabricate doesn't really have a true
> dependency graph, so it's important that commands are run sequentially.

I would rather say that the build script has to respect the dependencies
and the best that fabricate can do is check after the build that there
were no problems (as discussed below). But if the build script knows
that certain commands can run in parallel, it would conceptually be
possible for *the build script itself* to issue the parallelizable build
commands in parallel (say, through an Executor service provide by
fabricate). And if fabricate can monitor simultaneously-running
commands separately (e.g., using the strace system), then it can verify
that there were no out-of-order build steps or even race conditions.

In fact, under fabricate's basic assumption that there is a 1:1 mapping
between inputs and outputs via commands, it could even attempt to
parallelize builds itself. For example, it could assume that the
dependencies from the last build are still valid. (Usually they will
be.) It could therefore determine which commands can be carried out in
what order, and which commands can be executed in parallel. So let it
optimistically try building in its self-determined order, monitoring the
commands for dependency problems and race conditions. If there are
none, then the build is necessarily OK. If there are any problems, then
delete the outputs of the questionable commands and fall back to doing a
serial build. (Occasionally some work would need to be done twice, but
usually the optimistic build would probably be provably correct.)

But unfortunately, I can't see how fabricate could determine, using the
atime method, which files were touched by which command if multiple
commands were run in parallel. Thus parallel building is probably only
conceivable when using the strace method. This is a serious limitation
of fabricate, given that systems with multiple processors are already
the norm.

> 2. Regarding dependency warnings, a build where a file is read by one
> process, then modified by another, is actually a conceivable use-case.
> But unlikely, I case. Maybe a scenario where you wanted (for some reason)
> to compile the previous build date into your binary. Like I said,
> unlikely.

I think this would be a perverse use case and the default should be to
treat it as an error. At best there should be a user option to say "I
know that the dependency order is screwed up but I don't care."

> Dependency loops -- I'm not sure if this would be generally
> detectable. Presumably in such a case, the command which depended on
> a future command would simply bomb out. Or, if it was a
> badly-thought-out command, it might continue and produce garbled
> output. Do you have any ideas on how we'd detect a dependency loop in
> advance?

The command would not necessarily bomb out, because an old version of
the file that it depends on could still be around from a previous build.

On the other hand, some commands might use the existence of files to
define the input to the command (think of "ar cr file.a *.o"). This
command would also not bomb out if one of the *.o files had erroneously
not yet been created, but the output would still be wrong. Consider the
build script

cc -c a.c
cc -c b.c
ar cr file.a *.o
cc -c c.c

The first time this command is run, fabricate would deduce that the "ar"
command (specifically, "ar cr file.a a.o b.o") depends on a.o and b.o.
Fabricate would have no way of figuring out that c.o is created later
and *would have* affected the "ar" command. The results of the build
would be broken.

If the build is run a second time immediately after the first, then the
first two commands would be skipped, then the "ar" command would read
"ar cr file.a a.o b.o c.o" because of the "c.o" file generated by the
first build. Fabricate would not recognize that command at all and
would therefore run the "ar" command. It would skip compiling "c.c" a
second time.

The *third* time the build is run, fabricate can figure out that there
is nothing to do.

The dependency graph is fundamentally only known after a build has
completed, and the next build could have a different dependency graph.
So no, there is no way *in general* to detect a dependency loop in
advance. But the dependency loop could be detected *after* the build.
Conceptually, fabricate would remember what commands were executed to
carry out the build. Then it would imagine running the build again. If
the dependencies were in fact OK, then fabricate should be able to
determine that the second build wouldn't have to do anything.
Otherwise, there was necessarily a problem that fabricate should report.

In fact, the check could be carried out more simply by analyzing the
dependency graph that it has been determined. The dependency graph is
trivially constructible from the .deps file; the nodes are filenames and
the (directed) edges are implied by the input-output filename pairs
corresponding to commands that were run during the build.

That suggests another question. On Windows, "shell globbing" characters
are expanded not by the shell but by individual programs. So how could
fabricate figure out that "ar cr file.a *.o" in the second build is
different than the command executed during the first build?

The basic problem here is that the contents of a directory is another
build input that fabricate does not consider. It *could* treat this as
another input (for example when using the strace method), but doing so
would lead to too many false positives and unnecessary build. (E.g., in
our "*.o" example, the "ar" command would be re-executed if *any* file
were added or removed from the current directory, whereas only changes
to the list of *.o files should cause the command to be re-run.)

Probably the way to get around this problem is to require build scripts
to do file globbing via a function provided by fabricate:

fabricate.glob('*.o')

The output is the list of files matching the pattern '*.o', and among
the dependencies is recorded "command depends on the list of files
matching *.o" and a hash of the return value. When
fabricate.glob('*.o') is invoked during the next build, the list is
generated and compared to the old value. If the lists are different,
then the whole command has to be executed again.


My tentative conclusions from all this discussion are:

1. fabricate is not 100% safe, because it misses some build dependencies
(such as the contents of directory listings). But it would probably be
possible to specify some ground rules which, if obeyed by build scripts,
would guarantee the safety of the build (for example, one might need to
require that filename globbing is done using a function provided by
fabricate).

2. fabricate is probably never going to be able to do safe parallel
builds using atimes, though it might be possible using strace.

3. It's still a very fascinating concept and could potentially be the
basis of a vastly simpler build system.

4. All of this is pointless if the repeated directory crawls and MD5
hash computations are too expensive -> therefore, benchmarking is needed!

Ben Hoyt

unread,
Jul 29, 2009, 6:36:12 PM7/29/09
to fabrica...@googlegroups.com
Thanks, Michael, those are some good ideas.

But if the build script knows that certain commands can run in parallel, it would conceptually be possible for *the build script itself* to issue the parallelizable build commands in parallel (say, through an Executor service provide by fabricate).

If we head in the parallelization direction, this is a good idea -- get the build script to specify it explicitly. Insteads of "for command in commands: run(command)" you'd do "run_parallel(commands)". It's requiring more from the user, of course ...

2. fabricate is probably never going to be able to do safe parallel
builds using atimes, though it might be possible using strace.

Yes, we spend some time looking for a good strace equivalent on Windows. Pankaj Garg's StraceNT is close, but it doesn't recursive into sub-processes, which is necessary especially for things like gcc, which sub-spawns "cc1.exe" and "ld.exe" etc. We'd still switch if something become available, or if we write one ourselves ... :-)

Cheers,
Ben.

Leandro Lucarella

unread,
Aug 3, 2009, 12:20:15 AM8/3/09
to fabricate users
On Jul 29, 7:36 pm, Ben Hoyt <benh...@gmail.com> wrote:
> Thanks, Michael, those are some good ideas.
>
> But if the build script knows that certain commands can run in parallel, it
>
> > would conceptually be possible for *the build script itself* to issue the
> > parallelizable build commands in parallel (say, through an Executor service
> > provide by fabricate).
>
> If we head in the parallelization direction, this is a good idea -- get the
> build script to specify it explicitly. Insteads of "for command in commands:
> run(command)" you'd do "run_parallel(commands)". It's requiring more from
> the user, of course ...

This is one feature I would like to have in fabricate too. How about
extending run() to accept multiple commands (accepting a list of
commands)? Then one can configure the number of concurrent commands to
run with, for instances, a command line option (like -j ;) and/or
hardcoded when calling run([cmd1, cmd2, cmd3], threads=2). I'm willing
to work on this if it's welcome.

Opinions?

Ben Hoyt

unread,
Aug 3, 2009, 4:29:17 AM8/3/09
to fabrica...@googlegroups.com
This is one feature I would like to have in fabricate too. How about
extending run() to accept multiple commands (accepting a list of
commands)? Then one can configure the number of concurrent commands to
run with, for instances, a command line option (like -j ;) and/or
hardcoded when calling run([cmd1, cmd2, cmd3], threads=2). I'm willing
to work on this if it's welcome.

This is a neat idea, and I think it might work (with one gotcha, see below). Because of how fabricate works (it's a Python script, and things need to happen in a certain order), unfortunately you can't generalise the parallel build thing like with make.

But I think you're alright if you do what you've shown and specify explicitly that "these commands can be run in parallel" by passing a list of commands to run(). So where we now have this in our build scripts:

    for source in sources:
        run('gcc -c %s.c' % source)

We could change it to:

    run('gcc -c %s.c' % source for source in sources)

Which is actually kinda tidy, and then you could go "build.py -j4" to build with up to say four threads, but only where a list of commands has been passed to to run().

THE GOTCHA: This parallel thing would have to be disabled when using atimes_runner, for example on Windows, because with the atimes_runner you don't know which process accessed/modified the files. With strace_runner (Linux) you're fine, however.

However, I'd be good to have this implemented for runners that it worked for (strace_runner), assuming it was a reasonably clean implementation, and if it just worked, so the user didn't have to worry about it.

Berwyn and Bryan, what do you guys reckon?

-Ben

Berwyn Hoyt

unread,
Aug 3, 2009, 5:21:26 AM8/3/09
to fabrica...@googlegroups.com
Hi,

In line with our "just works" philosophy, I'd be happier with this parallel build thing if specifying the number of threads was optional.  It should just default to unlimited threads like make -j does.

You'll have to think about what to do with each build target's output.  The easiest would be to just let it all come out the parent's stdout like we currently do, but I wonder whether that will mix up the output into something confusing.  I wonder what make does.  I guess I'd just let the outputs of the various threads mix up until somebody wants something more elaborate.

I'm not opposed to this mainly because it *is* possible to implement an strace-like system for windows by anyone who gets annoyed enough by not having parallel builds.  So this can still be a cross-platform solution.  That can be the project's long-term target.  Presumably strace will work on the mac since it is unix-based.  Ben, perhaps it is worth mentioning the easyhooks library in the issue on the subject for anyone who feels inclined to tackle that one.  Probably also worth documenting the current drawbacks of atimes there and linking the issue from fabricate's "how it works" page.

- Berwyn



I'd be happier if there were a default number of threads
--
Berwyn Hoyt, Electronic Solutions & Business  --  Brush Technology
Ph: +64 3 359 2101     Mobile: +64 21 045 7830
Web: brush.co.nz

Leandro Lucarella

unread,
Aug 3, 2009, 9:43:05 AM8/3/09
to fabricate users
On Aug 3, 5:29 am, Ben Hoyt <benh...@gmail.com> wrote:
> > This is one feature I would like to have in fabricate too. How about
> > extending run() to accept multiple commands (accepting a list of
> > commands)? Then one can configure the number of concurrent commands to
> > run with, for instances, a command line option (like -j ;) and/or
> > hardcoded when calling run([cmd1, cmd2, cmd3], threads=2). I'm willing
> > to work on this if it's welcome.
>
> This is a neat idea, and I think it might work (with one gotcha, see below).
> Because of how fabricate works (it's a Python script, and things need to
> happen in a certain order), unfortunately you can't generalise the parallel
> build thing like with make.

Yes, of course there should be some aid from the programmer. All
commands specified to run in the same run() call should be guaranteed
by the user to be safe to run in parallel (just like the user has to
guarantee that the program called will produce the exact same outputs
for the exact same inputs =).

> But I think you're alright if you do what you've shown and specify
> explicitly that "these commands can be run in parallel" by passing a list of
> commands to run(). So where we now have this in our build scripts:
>
>     for source in sources:
>         run('gcc -c %s.c' % source)
>
> We could change it to:
>
>     run('gcc -c %s.c' % source for source in sources)

Right.

> Which is actually kinda tidy, and then you could go "build.py -j4" to build
> with up to say four threads, but only where a list of commands has been
> passed to to run().
>
> THE GOTCHA: This parallel thing would have to be disabled when using
> atimes_runner, for example on Windows, because with the atimes_runner you
> don't know which process accessed/modified the files. With strace_runner
> (Linux) you're fine, however.

Yes, that was exactly my thoughts, I forgot to mention this. I think
the runner should provide some way to tell when it supports parallel
building or not. Then, when you use -j4 but the runner doesn't support
parallel building you just ignore the -j4.

> However, I'd be good to have this implemented for runners that it worked for
> (strace_runner), assuming it was a reasonably clean implementation, and if
> it just worked, so the user didn't have to worry about it.

I'll probably give it a shot in a near future.

Leandro Lucarella

unread,
Aug 3, 2009, 9:50:21 AM8/3/09
to fabricate users
On Aug 3, 6:21 am, Berwyn Hoyt <berh...@gmail.com> wrote:
> Hi,
> In line with our "just works" philosophy, I'd be happier with this parallel build thing if specifying the number of threads was optional.  It should just default to unlimited threads like make -j does.

I think the sane default is to just have -j1 by default. It would be
nice to provide some option to the user making the build script to
tell Fabricate to use some number of threads by default based on the
number of cores in the computer (if the user knows that parallel
building is safe), but that can come in the future.

> You'll have to think about what to do with each build target's output.  The easiest would be to just let it all come out the parent's stdout like we currently do, but I wonder whether that will mix up the output into something confusing.  I wonder what make does.

I think it just do that.

> I guess I'd just let the outputs of the various threads mix up until somebody wants something more elaborate.

Agree.

Ben Hoyt

unread,
Aug 3, 2009, 6:56:14 PM8/3/09
to fabrica...@googlegroups.com
It would be nice to provide some option to the user making the build script to tell Fabricate to use some number of threads by default based on the number of cores in the computer

Yeah, I was thinking that too -- assuming there's a reasonably easy way to find that out on Linux/Windows. Stop press: yes, looks like it's this on Linux:

    file('/proc/cpuinfo').read().count('processor')

And this on Windows (I'm sure you can use ctypes pretty easily to avoid win32api, too):

    win32api.GetSystemInfo()[5]

There might be a more platform-generic way to do it in Python, but I couldn't find it.

> You'll have to think about what to do with each build target's output.  The easiest would be to just let it all come out the parent's stdout like we currently do, but I wonder whether that will mix up the output into something confusing.  I wonder what make does.

Yeah, I think make just mixes up the output, but it's usually in order for simple gcc commands that only print one line of output.

What Visual Studio does is print something like this:

1>Compiling C files
3>program.c
2>util.c
3>foo.c
4>bar.c

Where the number before the > is a "process number". I guess you could sort the output at the end, but then you'd have to wait before you got anything on the screen.


> I guess I'd just let the outputs of the various threads mix up until somebody wants something more elaborate.

Yeah, though Visual Studio's way might be a fairly easy-to-implement in-between method.

-Ben

Leandro Lucarella

unread,
Aug 3, 2009, 7:34:34 PM8/3/09
to fabricate users
On Aug 3, 7:56 pm, Ben Hoyt <benh...@gmail.com> wrote:
> > It would be nice to provide some option to the user making the build script
> > to tell Fabricate to use some number of threads by default based on the
> > number of cores in the computer
>
> Yeah, I was thinking that too -- assuming there's a reasonably easy way to
> find that out on Linux/Windows. Stop press: yes, looks like it's this on
> Linux:
>
>     file('/proc/cpuinfo').read().count('processor')
>
> And this on Windows (I'm sure you can use ctypes pretty easily to avoid
> win32api, too):
>
>     win32api.GetSystemInfo()[5]
>
> There might be a more platform-generic way to do it in Python, but I
> couldn't find it.

I looked myself and couldn't find anything more elegant either, so I
think that could be a good start =)

> > You'll have to think about what to do with each build target's output.
> > The easiest would be to just let it all come out the parent's stdout like we
> > currently do, but I wonder whether that will mix up the output into
> > something confusing.  I wonder what make does.
>
> Yeah, I think make just mixes up the output, but it's usually in order for
> simple gcc commands that only print one line of output.
>
> What Visual Studio does is print something like this:
>
> 1>Compiling C files
> 3>program.c
> 2>util.c
> 3>foo.c
> 4>bar.c
>
> Where the number before the > is a "process number". I guess you could sort
> the output at the end, but then you'd have to wait before you got anything
> on the screen.
>
> > I guess I'd just let the outputs of the various threads mix up until
>
> somebody wants something more elaborate.
>
> Yeah, though Visual Studio's way might be a fairly easy-to-implement
> in-between method.

The VS way can be helpful, but I think it should be at least an option
not to confuse other IDEs that expects the compiler output at the
beginning to show errors (like VIM).

Michael Haggerty

unread,
Aug 6, 2009, 4:27:58 AM8/6/09
to fabrica...@googlegroups.com
Michael Haggerty wrote:
> I have some questions about fabricate:
>
> 1. Can you give an impression of its speed? Ideal would be benchmarks
> against other systems like done in these blog posts:
>
> http://gamesfromwithin.com/?p=42
> http://gamesfromwithin.com/?p=44
>
> (The post includes the code he used to generate his test project.) If
> you haven't done this yet, maybe I could take a stab at it.

OK, I did some benchmarking of fabricate using the method described in
the two blog posts quoted above. Briefly, the test "project" consists
of 50 libraries, each library consists of 100 C++ header files and 100
C++ source files, each header+source defines one trivial class, and the
source files each include 15 other headers from the same library plus 5
headers from other, earlier-numbered libraries. The interlibrary
dependencies only go one way, making it easy to define a self-consistent
build order. The code that the blog author used to create the test data
is linked to from the second blog post.

For comparison, I repeated some of the blog author's original
measurements on my computer (a Linux notebook), plus a few new ones. I
only bothered with the "full" build (build everything from scratch) and
the global "do-nothing" build (run the build again when there is nothing
to do). The builds that I added are as follows:

A "make_nonrecursive" build that uses "gcc -MM" to generate include-file
dependencies for each file, and includes all of the dependency files and
the library-wide Makefiles into a single project-wide Makefile that
knows about all of the dependencies. See [1] for more information about
this style of Makefile.

I added two fabricate.py benchmarks, one using atimes_runner and one
using strace_runner. In both cases the build file does the obvious
loop-over-libraries, loop-over-sourcefiles and invokes g++ using
fabricate.run() once for each file. The tests use fabricate trunk r21.

Here are my results:

make (recursively invoked):
full: 194 s
do-nothing: 3.7 s

make_nonrecursive:
full: 319 s
do-nothing: 1.7 s

scons:
full: 312 s
do-nothing: 52 s

fabricate.py, using atimes_runner:
full: 4245 s
do-nothing: 17.7 s

fabricate.py, using strace_runner:
full: 251 s
do-nothing: 18.9 s

fabricate.py, using strace_runner and mtime_hasher:
full: 328 s
do-nothing: 352 s (i.e., rebuilt everything)

Summary and analysis:

make_nonrecursive is slower than recursive make for the full build, but
this is probably mostly because make_recursive uses "gcc -MM" to
generate the dependencies rather than "makedepend" as used by the
recursive make. This is an implementation difference that could be
changed. For the do-nothing build, make_nonrecursive is faster.

fabricate with the atimes_runner is so excruciatingly slow at doing a
full build that it cannot seriously be considered for a project of this
size. In retrospect, it is clear that it has to be slow for large
projects, because before and after *every* command it has to monkey
around with the atimes of *every single file in the project*. Thus the
time for building a project with N files goes something like O(N^2).

A full build using fabricate and strace_runner is much better, almost as
fast as make.

For do-nothing builds, fabricate is considerably slower than make,
though not as slow as scons. A do-nothing build is the extreme case of
an incremental build, which is the bread-and-butter of daily
development. So I think it would make sense investing time to speed up
do-nothing builds (more discussion below).

For some reason, fabricate with strace_runner and mtime_hasher didn't
work for me. The do-nothing build in fact rebuilt everything. It is
even slower than the corresponding full build.

I noticed that the .deps file when using the md5_hasher is almost 17 Mb
and 129k lines long. But a lot of the information is redundant, because
include files are typically included multiple times. In fact (modulo
whitespace), the file has only 26k unique lines, and many of those lines
also have common information. It should definitely be possible to
reduce the size of this file significantly by squeezing out some of the
redundancy.

It also appears that files are being hashed redundantly, for example
even though fabricate has computed the hash during the current build and
knows (via strace) that the file hasn't been written to since then. In
fact, during a full build md5_hasher() is called 149k times, of which
119k calls are redundant. Even worse, during a do-nothing build (when
nothing can possibly change), md5_hasher() is called 119k times, of
which 104k are redundant. One file is hashed 53 times! Reducing some
of this unnecessary hashing could potentially speed things up too,
without a loss of accuracy.

Michael

[1] http://aegis.sourceforge.net/auug97.pdf

Ben Hoyt

unread,
Aug 6, 2009, 10:13:41 PM8/6/09
to fabrica...@googlegroups.com
Hi Michael,

OK, I did some benchmarking of fabricate using the method described in
the two blog posts quoted above.  Briefly, the test "project" consists [...]

Wow, thanks Michael -- those are some really worthwhile results. It's good to know both that strace is quite quick (faster than SCons and up there with make), and also that atimes is very slow on big projects.

Hmmm, I wonder why strace_runner + mtime_hasher REbuilt both times. I've noticed that if you switch hashers after a build it will always do a full rebuild because the hash strings will be different -- but this is okay, because people won't usually switch between the two hashers a lot. But then if you build again it shouldn't rebuild again. Strange -- I might try it on a Linux machine here if I get the chance.

The code that the blog author used to create the test data is linked to from the second blog post.

Thanks. Do you mind sending through the fabricate build script you used for this test data, just so I'm comparing against the same results? (Read: I'm being lazy. :-)

fabricate.py, using atimes_runner:
 full: 4245 s
 do-nothing: 17.7 s

Ouch! :-) Good points about it being O(n^2). I might do some (smaller!) tests myself to see how much that reduces if fabricate doesn't call utime() on every file before and after -- but only looks at the atimes. Still be slow, of course, but maybe not as slow.

That said, we're aware that atimes is kind of a hackish solution. It's slow on big projects as you've pointed out. (Though this doesn't concern us hugely, because our in-house projects are usually quite tiny compared to this 5000-file test! 20 files is a "big" project for us.)

But atimes are also unsafe for our autoclean() function. Because if you happen to edit say main.c while your build is running, atimes_runner() will mark main.c as an output, and then delete it during an autoclean(). Doesn't usually happen, but not cool if it does. We've been thinking of ways around it for this reason. If you have any ideas for better ways to automatically find dependencies under Windows, let us know!

There is a version of strace for Windows that I've hacked to kind of work (http://www.intellectualheaven.com/default.asp?BH=projects&H=strace.htm), but it doesn't strace recursively into child processes. Which means it doesn't work for the likes of gcc, which spawns its own sub-processes.

There's also the EasyHook library (http://www.codeplex.com/easyhook) which allows you to monitor Win32 API calls, but again, it's hard to monitor child processes -- you'd have to manually hook the CreateProcess() and similar Win32 functions, and recurse into them somehow. We thinks it's possible, but it might be hard.

I noticed that the .deps file when using the md5_hasher is almost 17 Mb
and 129k lines long.  But a lot of the information is redundant, because
include files are typically included multiple times.

Heh, that's interesting. Again, we hadn't had big enough projects to notice that. However, probably the easiest way to solve that (given that memory's not really an issue here) is just to use gzip.open() -- i.e., store the file zipped. It's actually slightly faster on big .deps files this way, because zipping is faster than writing a big file to disk. Won't make a huge difference, because it's only done once at the start/end of the build, but it'll mean tiny .deps files. It'll mean the .deps file won't be human-readable anymore, but I'll include an override for when one is debugging fabricate issues.

It also appears that files are being hashed redundantly, for example
even though fabricate has computed the hash during the current build and
knows (via strace) that the file hasn't been written to since then.  In
fact, during a full build md5_hasher() is called 149k times, of which
119k calls are redundant.  Even worse, during a do-nothing build (when
nothing can possibly change), md5_hasher() is called 119k times, of
which 104k are redundant.  One file is hashed 53 times!  Reducing some
of this unnecessary hashing could potentially speed things up too,
without a loss of accuracy.

Ah yes, this is a very good point too. As mentioned before, we're not too worried about speed (except perhaps the slowness of atimes_runner), but still, if there's a fairly easy way to do this without breaking things, it's worth looking into.

I've added a few issues about some of these things, see:
    http://code.google.com/p/fabricate/issues/list

Thanks for your time!

Cheers,
Ben.

Berwyn Hoyt

unread,
Aug 7, 2009, 12:25:55 AM8/7/09
to fabrica...@googlegroups.com
Hi Ben,

> Ouch! :-) Good points about it being O(n^2). I might do some
> (smaller!) tests myself to see how much that reduces if fabricate
> doesn't call utime() on every file before and after -- but only looks
> at the atimes. Still be slow, of course, but maybe not as slow.

I've already done some of this this test on Linux. On Linux, the speed
of utime() is equivalent to getting the atime, so since we currently do
an equal amount of get_atime() and utime(), you would halve the delay
time from that. But the os.walk takes about twice the time of get_atime
and utime put together. So the directory walk is the biggest time hog.
My results are only for linux, not windows.

Obviously, though, atimes are never going to be a good solution for big
projects. If somebody wants fabricate for big projects, it may be
better for them to work on a solution using easy_hooks library like you say.

Cheers,
Berwyn


Michael Haggerty

unread,
Aug 8, 2009, 7:23:23 AM8/8/09
to fabrica...@googlegroups.com
Ben Hoyt wrote:
Hi Michael,

OK, I did some benchmarking of fabricate using the method described in
the two blog posts quoted above.  Briefly, the test "project" consists [...]

Wow, thanks Michael -- those are some really worthwhile results. It's good to know both that strace is quite quick (faster than SCons and up there with make), and also that atimes is very slow on big projects.

Hmmm, I wonder why strace_runner + mtime_hasher REbuilt both times. I've noticed that if you switch hashers after a build it will always do a full rebuild because the hash strings will be different -- but this is okay, because people won't usually switch between the two hashers a lot. But then if you build again it shouldn't rebuild again. Strange -- I might try it on a Linux machine here if I get the chance.

I never retained the project while switching hashers; I "rm -rf"ed the whole source tree including the deps file and recreated it from scratch.


The code that the blog author used to create the test data is linked to from the second blog post.

Thanks. Do you mind sending through the fabricate build script you used for this test data, just so I'm comparing against the same results? (Read: I'm being lazy. :-)

I just created a public git project for it.  You can download it by typing

    git clone git://repo.or.cz/build-benchmarks.git

Please feel free to make suggestions or submit patches.


I noticed that the .deps file when using the md5_hasher is almost 17 Mb
and 129k lines long.  But a lot of the information is redundant, because
include files are typically included multiple times.

Heh, that's interesting. Again, we hadn't had big enough projects to notice that. However, probably the easiest way to solve that (given that memory's not really an issue here) is just to use gzip.open() -- i.e., store the file zipped. It's actually slightly faster on big .deps files this way, because zipping is faster than writing a big file to disk. Won't make a huge difference, because it's only done once at the start/end of the build, but it'll mean tiny .deps files. It'll mean the .deps file won't be human-readable anymore, but I'll include an override for when one is debugging fabricate issues.

It also appears that files are being hashed redundantly, for example
even though fabricate has computed the hash during the current build and
knows (via strace) that the file hasn't been written to since then.  In
fact, during a full build md5_hasher() is called 149k times, of which
119k calls are redundant.  Even worse, during a do-nothing build (when
nothing can possibly change), md5_hasher() is called 119k times, of
which 104k are redundant.  One file is hashed 53 times!  Reducing some
of this unnecessary hashing could potentially speed things up too,
without a loss of accuracy.

Ah yes, this is a very good point too. As mentioned before, we're not too worried about speed (except perhaps the slowness of atimes_runner), but still, if there's a fairly easy way to do this without breaking things, it's worth looking into.

I think the size of the file and the redundant work that are being done are connected.  The program needs an enduring concept of a file, to keep track of all changes to the file, cache the old hash for the file, and store redundancies as references (direct or indirect) to these file objects.  Normalizing the data this way will fix both problems almost automatically.

I'd like to try working on this change myself but no promises whether I'll have time for it in the near future.

Michael

Ben Hoyt

unread,
Aug 10, 2009, 5:54:26 PM8/10/09
to fabrica...@googlegroups.com
Hi Michael,


> I never retained the project while switching hashers; I "rm -rf"ed the
> whole source tree including the deps file and recreated it from scratch.

Strange -- I'll try to reproduce it here at some stage (the strace+mtime rebuild issue).


> I just created a public git project for it.

Nice -- thanks for that. Again, I'll try to do some stats of my own at some stage -- I'm particularly interested in whether atimes_runner is faster on Windows than Linux for medium-sized projects.

I think the size of the file and the redundant work that are being done are connected.  The program needs an enduring concept of a file, to keep track of all changes to the file, cache the old hash for the file, and store redundancies as references (direct or indirect) to these file objects.  Normalizing the data this way will fix both problems almost automatically.

Fair point -- they do (or at least could) go together. Normalizing the data as you suggest is a great idea (as long as it doesn't get way too complicated :-).

Definitely be keen to see your results if you get a chance to work on this change. I'll have a think about it when I get around to it (that perennial time problem :-).

Cheers,
Ben.

Reply all
Reply to author
Forward
0 new messages