cl.exe batching [Re: Windows CMake ninja generator performance]

786 views
Skip to first unread message

Evan Martin

unread,
Sep 19, 2012, 2:16:33 PM9/19/12
to ninja...@googlegroups.com
(Forking the thread to keep all the batching stuff together.)

On Wed, Sep 19, 2012 at 9:30 AM, Scott Graham <sco...@chromium.org> wrote:
> One WAG is that it's because of batching. cl supports passing multiple .c
> files and specifying an output directory for objects (this reduces process
> startup overhead). It's messy to use for normal tools that want
> predetermined rules (not 1:1, can't have same basename on .c files in
> different dirs, etc.). But clean performance might suffer if the cmake ninja
> generator doesn't use that functionality.

I think we can maybe make cl.exe batching work, with the cooperation
of the generator. For example, imagine something like this:

build out\a.obj out\b.obj out\c.obj: cl src\a.cpp src\b.cpp src\c.cpp
# only specify output dir, not output files, in the cl command
command = cl $in /OutputDir:out

However, for incremental builds I suspect we need to only pass the
dirty inputs to cl. That starts getting more complicated -- there are
lots of ways it could be coerced into working but they're all pretty
ugly.

It should be possible to at least hack out something like the above to
see if it helps clean build time, at the cost of making incremental
slower. That would at least pin down whether this is the reason.

Evan Martin

unread,
Sep 19, 2012, 2:18:53 PM9/19/12
to ninja...@googlegroups.com
On Wed, Sep 19, 2012 at 11:16 AM, Evan Martin <mar...@danga.com> wrote:
> build out\a.obj out\b.obj out\c.obj: cl src\a.cpp src\b.cpp src\c.cpp
> # only specify output dir, not output files, in the cl command
> command = cl $in /OutputDir:out
>
> However, for incremental builds I suspect we need to only pass the
> dirty inputs to cl. That starts getting more complicated -- there are
> lots of ways it could be coerced into working but they're all pretty
> ugly.

One thought that just struck me: if we do add syntax for "pool",
perhaps you could write the build edges separately (as you would with
gcc) but in the same cl.exe pool, and then have the pool-running logic
know that it can run multiple pending build edges at the same time.
That actually doesn't feel so awful to me.

Scott Graham

unread,
Sep 19, 2012, 6:03:16 PM9/19/12
to ninja...@googlegroups.com
On Wed, Sep 19, 2012 at 11:16 AM, Evan Martin <mar...@danga.com> wrote:
(Forking the thread to keep all the batching stuff together.)

On Wed, Sep 19, 2012 at 9:30 AM, Scott Graham <sco...@chromium.org> wrote:
> One WAG is that it's because of batching. cl supports passing multiple .c
> files and specifying an output directory for objects (this reduces process
> startup overhead). It's messy to use for normal tools that want
> predetermined rules (not 1:1, can't have same basename on .c files in
> different dirs, etc.). But clean performance might suffer if the cmake ninja
> generator doesn't use that functionality.

I think we can maybe make cl.exe batching work, with the cooperation
of the generator.  For example, imagine something like this:

build out\a.obj out\b.obj out\c.obj: cl src\a.cpp src\b.cpp src\c.cpp
  # only specify output dir, not output files, in the cl command
  command = cl $in /OutputDir:out

However, for incremental builds I suspect we need to only pass the
dirty inputs to cl.

Yeah, that's definitely an ugly bit.

Some additional gotchas to keep in mind if someone wants to try:
- you can't name each output individually (they go to the cwd with the same basename). Duplicate basename outputs will be silently overwritten, so the generator needs to make they're compiled in separate batches/different directories/not batched.
- /showIncludes doesn't work with /MP. I don't know if that's important or not because ninja should be able to take care of keeping the process queue full. (MSBuild uses an strace-alike to do deps). I assume it does mean that showIncludes parsing can keep working because the output won't get mashed together.

Peter Kümmel

unread,
Sep 20, 2012, 3:22:26 PM9/20/12
to ninja...@googlegroups.com
On 20.09.2012 00:03, Scott Graham wrote:
> (MSBuild uses an strace-alike to do deps).

Interesting, do you think using such a solution would be faster?
Is there an API?

Peter

Scott Graham

unread,
Sep 20, 2012, 3:30:58 PM9/20/12
to ninja...@googlegroups.com
I'm not sure if the motivation was speed or correctness.

Doing it manually is "quite difficult" and requires something like Detours (hops between x86, x64, and .NET processes are tricky to deal with).

The launcher stub is Tracker.exe, and is part of .NET http://msdn.microsoft.com/en-us/library/microsoft.build.utilities.filetracker.aspx (4.0+ I think). It's not especially well documented but you'll probably have Tracker.exe in your path on a recent Windows box if you want to play with it.
 


Peter

Peter Kümmel

unread,
Sep 20, 2012, 3:38:40 PM9/20/12
to ninja...@googlegroups.com
On 20.09.2012 21:30, Scott Graham wrote:
Ah, that's "Tracker". Recently it was missed, I had to install the Windows SDK to get it.

>
>
>>
>>
>> Peter
>>
>

Reid Kleckner

unread,
Sep 20, 2012, 4:41:35 PM9/20/12
to ninja...@googlegroups.com
It is an interesting solution, but it is quite error prone.  One of the reasons we're interested in moving away from msbuild is because file tracker can add inconsequential files to the list of dependencies.  This basically breaks incremental builds, since everything is considered out of date.

Bill Hoffman

unread,
Sep 21, 2012, 10:38:12 AM9/21/12
to ninja...@googlegroups.com
Would batching cl make sense for ninja? Basically it would mean that
ninja would have to go to a single threaded build and allow cl to do the
parallel part of the build. I suspect the batching may or may not be
the real problem. It is also likely that creating all of those tiny .d
files is taking lots of time.

I guess that would be easy to test. Just edit the ninja build files
cmake creates to not create .d files and see if the initial build is
much faster. I suspect it will be.

-Bill



--
Bill Hoffman
Kitware, Inc.
28 Corporate Drive
Clifton Park, NY 12065
bill.h...@kitware.com
http://www.kitware.com
518 881-4905 (Direct)
518 371-3971 x105
Fax (518) 371-4573

Scott Graham

unread,
Sep 21, 2012, 10:49:22 AM9/21/12
to ninja...@googlegroups.com
On Fri, Sep 21, 2012 at 7:38 AM, Bill Hoffman <bill.h...@kitware.com> wrote:
Would batching cl make sense for ninja?  Basically it would mean that ninja would have to go to a single threaded build and allow cl to do the parallel part of the build.

Batching is unrelated to parallelism. The list of .c are compiled sequentially, so only useful when you're compiling a large number of files. I believe the goal is just to avoid the time it takes cl to start (and I suppose it could try to be clever about caching between compilation units).
 
 I suspect the batching may or may not be the real problem.  It is also likely that creating all of those tiny .d files is taking lots of time.

I guess that would be easy to test.  Just edit the ninja build files cmake creates to not create .d files and see if the initial build is much faster.  I suspect it will be.


Yup, could certainly be something else, or a combination of factors.
 

Evan Martin

unread,
Sep 21, 2012, 1:13:38 PM9/21/12
to ninja...@googlegroups.com
On Fri, Sep 21, 2012 at 7:49 AM, Scott Graham <sco...@chromium.org> wrote:
>> Would batching cl make sense for ninja? Basically it would mean that
>> ninja would have to go to a single threaded build and allow cl to do the
>> parallel part of the build.
>
> Batching is unrelated to parallelism. The list of .c are compiled
> sequentially, so only useful when you're compiling a large number of files.
> I believe the goal is just to avoid the time it takes cl to start (and I
> suppose it could try to be clever about caching between compilation units).

I was hoping it'd cache the parsed .h files or something like that.
But I definitely agree it's the sort of thing worth hacking up in a
one-off (like hand-editing a build.ninja file) to measure before going
down the road of real implementation effort.

Bill Hoffman

unread,
Sep 21, 2012, 3:14:28 PM9/21/12
to ninja...@googlegroups.com
On 9/21/2012 10:49 AM, Scott Graham wrote:
>
> Batching is unrelated to parallelism. The list of .c are compiled
> sequentially, so only useful when you're compiling a large number of
> files. I believe the goal is just to avoid the time it takes cl to start
> (and I suppose it could try to be clever about caching between
> compilation units).
I don't agree. If you run cl foo.c foo2.c foo3.c /MP it will try to
build them in parallel. If you run it without /MP it will run them
sequentially. I don't see how ninja can be used to do a parallel build
in this mode of running cl??

-Bill

Scott Graham

unread,
Sep 21, 2012, 3:17:45 PM9/21/12
to ninja...@googlegroups.com
Sorry, yes, I meant without /MP (because as mentioned in the other thread, /showIncludes doesn't work with /MP).

I'm not sure I understand what you're getting at. If you have 1000 .cc files to compile, then batching them in groups of 10 .cc files still gives you up to 100 cl invocations to run in parallel.
 
-Bill


Bill Hoffman

unread,
Sep 21, 2012, 3:37:46 PM9/21/12
to ninja...@googlegroups.com
On 9/21/2012 3:17 PM, Scott Graham wrote:
>
> Sorry, yes, I meant without /MP (because as mentioned in the other
> thread, /showIncludes doesn't work with /MP).
>
> I'm not sure I understand what you're getting at. If you have 1000 .cc
> files to compile, then batching them in groups of 10 .cc files still
> gives you up to 100 cl invocations to run in parallel.
OK, I get it now. I was not thinking of slicing them up like that. I
was thinking it would be 1000 or 1. I would certainly look into how
much this helps sounds pretty complicated to implement, and picking the
size of groups will not be easy among other things.

-Bill

Peter Kümmel

unread,
Sep 21, 2012, 7:18:48 PM9/21/12
to ninja...@googlegroups.com
On 21.09.2012 16:38, Bill Hoffman wrote:
> Would batching cl make sense for ninja? Basically it would mean that
> ninja would have to go to a single threaded build and allow cl to do the
> parallel part of the build. I suspect the batching may or may not be
> the real problem. It is also likely that creating all of those tiny .d
> files is taking lots of time.
>
> I guess that would be easy to test. Just edit the ninja build files
> cmake creates to not create .d files and see if the initial build is
> much faster. I suspect it will be.
>

Here the numbers for a cmake debug build.
After 10 "ninja clean && ninja CMakeLib" calls I have

without: .d files: [322/322 6.04/sec] +- 0.02/sec
with .d files: [322/322 6.11/sec] +- 0.02/sec

So the .d cost for this C++ lib is about 1-2%.
This shows dependencies aren't expensive here.
For C only projects like reactos it would be worse.

Peter



Setup:
- NINJA_STATUS=[%s/%t %o/sec]
- attached ninja patch to get 2 digits

- with .d files: untouched cmake output, e.g.

rule C_COMPILER
depfile = $DEP_FILE
command = "C:/Program Files (x86)/CMake 2.8/bin/cmcldeps.exe" C $in "$DEP_FILE" $out "Note: including file: "
"C:/Program Files (x86)/Microsoft Visual Studio 11.0/VC/bin/x86_amd64/cl.exe"
C:\PROGRA~2\MICROS~1.0\VC\bin\X86_AM~1\cl.exe /nologo $FLAGS $DEFINES /Fo$out /Fd$TARGET_PDB -c $in


- without .d files: all removed before the last cl.exe path in rules.ninja of C_COMPILER/CXX_COMPILER command variable:

rule C_COMPILER
depfile = $DEP_FILE
command = C:\PROGRA~2\MICROS~1.0\VC\bin\X86_AM~1\cl.exe /nologo $FLAGS $DEFINES /Fo$out /Fd$TARGET_PDB -c $in
digits.patch

Jean LAULIAC

unread,
Jul 18, 2013, 4:16:09 PM7/18/13
to ninja...@googlegroups.com
Hi there! (thread digging!!! tell me if it's better to open a new thread)  

So, we are using the (awesome) Ninja tool on a pretty large project, for a use case that is probably uncommon: compiling assets in a Node.js web project. Notably, we compile Stylus to CSS, Snocket-Coffees to Javascript, and soon we'll be using the build system to compile Jade to JS and SASS to CSS (phasing out Stylus).

It means our compilers (Stylus, Coffescript) are actually Node.js applications. The problem is... Node.js is painfully slow to start up, notably because the Javascript is recompiled every single time. Even worse is when the scripts are written in CoffeeScript. Because of this, on a fresh clone of the repo, it is actually much faster to compile files manually by running a single, huge, command-line (eg. "stylus foo.styl bar.styl baz.styl ..." than having Ninja spawning a process for each file compile.

This can be considered as a limitation of Node.js and V8, but this is also something that can be solved on the build system side; providing a generic solution to this kind of issue, that is, batching.

A few thoughts:
  • could be rule based, a rule would indicate a special command-line about "how to batch";
  • with -j4 and 100 files to compile, it could simply be split into 4 batches of 25 files each;
  • generating deps files in the same time, even maybe in single big .d file (it's completely possible with the Makefile-compatible syntax, but may be not with the /showIncludes syntax..).
So I add a vote in favor of batching capabilities.

What do you think?

Richard O'Grady

unread,
Jul 19, 2013, 1:45:00 PM7/19/13
to Jean LAULIAC, ninja...@googlegroups.com
This would be nice for us as well.

I recently hacked in something like this to my ninja fork. We pass our
jobs to a third-party app that distributes compile work to other
machines. The way that works is by writing out a text file containing
the list of command lines to execute and then invoking this
distributed builder. So I intercept each Subprocess and instead of
executing it, just append it to a "BatchSubprocess" that writes all
the command lines to a file, then kick off that BatchSubprocess when
there are no more jobs ready. But the implementation is fairly
specific for our use case; it'd be neat to have a proper way to do
this.

// richard
> --
> You received this message because you are subscribed to the Google Groups
> "ninja-build" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to ninja-build...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Evan Martin

unread,
Jul 19, 2013, 2:21:04 PM7/19/13
to Richard O'Grady, Jean LAULIAC, ninja-build
I would like to make this work.  I *think* we can do something pretty simple: write the per-edge rules separately as usual, but then we tweak either the rule or use a pool.  When Ninja decides it's time to kick off another command, if the available command is one that is batchable, it can scan through the other ready-to-run commands to see if any others can be batched with it and run them together.

Some of the open questions in my mind are:

- How are output directories handled?  E.g. if two available commands are "build dir1/foo.o: cc foo.cc" and "build some/other/dir/bar.o: bar.cc", how can Ninja know how to assemble the command line such that the batch runner can run them correctly?

- How many knobs to control batching does Ninja need to provide?  E.g. do we need a way to limit how many files are batched together?  Maybe the answer to the output directory batching question is that we can only batch together files that share an output directory -- does this mean we then need an $output_dir variable usable in the rule's command?  (E.g. "command = mycompiler --output-dir=$output_dir $in", where $in can be multiple files.)

- How are failures handled?  If the batched compile exits with a failure status, does that mean all of the outputs passed to the command should be considered bad, or must we parse the output in some way to determine failure on a per-output basis?

- How are dependencies extracted?  It's easy enough to say "just write out multiple .d files", but we hit the same output directory problem, and the whole reason for patching is for cl.exe which doesn't write .d files.  In cl's /showIncludes case, it lists each file it's compiling followed by that file's list of includes, so perhaps that format should just be required by Ninja's parser.

I think collecting requirements (as people have been mentioning on this thread) is a good start, so if you would benefit from pooling maybe you could describe your ideal answers to the above questions.  It seems like the sort of thing we ought to collect on the wiki...

Jean LAULIAC

unread,
Jul 20, 2013, 4:11:43 PM7/20/13
to Evan Martin, Richard O'Grady, ninja-build
Hi all,

I started collecting ideas on a wifi page: https://github.com/martine/ninja/wiki/Batching-feature-roadmap
One of the main issues is to be compatible with the command-line formats of the diverse compilers, while staying generic enough.


— Jean

Evan Martin

unread,
Jul 20, 2013, 4:14:56 PM7/20/13
to Jean LAULIAC, Richard O'Grady, ninja-build
This is great!  I didn't realize the wiki was world-editable.  I'll add a section for specific compilers.

Nico Weber

unread,
Jul 20, 2013, 5:23:42 PM7/20/13
to Jean LAULIAC, Evan Martin, Richard O'Grady, ninja-build
On Sat, Jul 20, 2013 at 1:11 PM, Jean LAULIAC <laul...@gmail.com> wrote:
Hi all,

I started collecting ideas on a wifi page: https://github.com/martine/ninja/wiki/Batching-feature-roadmap

Cool! The numbers look pretty off, though. Starting a compiler process is usually considered to be pretty cheap (both gcc and clang fork a second process for the actual compilation for example – pas -### to see child processes the compiler driver starts). It's probably a bit more expensive on Windows, but not 1s expensive.

Also, it's true that the compiler needs to read .h files just once, but it can't precompile them except in special cases, since the first include in the two translation units could change the behavior of all #ifdefs in all following headers. Saving the redundant read of .h files might shave of fractions of a second (again, possibly a bit more on Windows).

I think better motivating examples are things like node.js, possibly javac, things that do network requests and would rather do fewer and larger requests, etc.

Nico

Neil Mitchell

unread,
Jul 20, 2013, 5:37:32 PM7/20/13
to Nico Weber, Jean LAULIAC, Evan Martin, Richard O'Grady, ninja-build
> Cool! The numbers look pretty off, though. Starting a compiler process is
> usually considered to be pretty cheap (both gcc and clang fork a second
> process for the actual compilation for example – pas -### to see child
> processes the compiler driver starts). It's probably a bit more expensive on
> Windows, but not 1s expensive.

For Windows you are typically looking at 0.1-0.3s, but if you are
going through something like Cygwin/Mingw in the worst case you can
hit 1s.

> Also, it's true that the compiler needs to read .h files just once, but it
> can't precompile them except in special cases, since the first include in
> the two translation units could change the behavior of all #ifdefs in all
> following headers. Saving the redundant read of .h files might shave of
> fractions of a second (again, possibly a bit more on Windows).

The MSVC precompile mechanism is often used to precompile things like
system headers, so things like Windows.h are precompiled in. Most well
written libraries tend to double inclusion. You also use
precompilation very explicitly, saying which headers end up in the
precompilation file, and building that precompile file explicitly. It
also isn't semantics preserving - anything precompiled in is available
anywhere using that precompile file, even if that file doesn't include
the precompiled headers itself. While it has many weaknesses, when I
first measured it the build times on my project dropped 40%, so it is
definitely worth it. However, batching calls to cl doesn't help at
all, you just make all your .obj files depend on the .pch (precompiled
header) file, and have a rule to build the .pch - Ninja handles it
just fine already.

> I think better motivating examples are things like node.js, possibly javac,
> things that do network requests and would rather do fewer and larger
> requests, etc.

My motivating example would be anything operating over SSH. Setting up
the connection is very expensive, running multiple things on the
server is quick. If you have two rules A.txt and B.txt both of which
just get something off the same remote server batching them might be
far quicker.

Thanks, Neil
Reply all
Reply to author
Forward
0 new messages