Creating a shared dependency graph

e...@nunswithguns.net

unread,

Feb 22, 2007, 8:22:34 PM2/22/07

to

Hello all,

I'm implementing a simple build system for C++ projects and I'd like
to be able to use the power of multi-core machines to speed up the
dependency analysis (#includes).

Currently I do the analysis for each source file in isolation from the
others. This approach lends itself to an easy threaded implementation
as there is very little mutable shared state involved.

However, I think it likely that if I share the dependency data between
threads that are updating it, I might be able to get better
performance. For example if file1.cpp and file2.cpp both #include
header.h, the current scheme will end up doing a particular chunk of
analysis multiple times (descending in to header.h and the headers it
#includes and so on).

I would like to have a list (or some other container) of structures
that associates a source file or header file with a list of files that
it *directly* #includes e.g.

// sketch
struct dep_info
{
dep_info(const std::string &filename) :
filename_(filename),
pending_(true)
{ }

std::string filename_; // source/header file
std::vector<std::string> includes_; // directly #included headers
bool pending; // indicates whether the includes_ vector needs filling
};

std::vector<dep_info> dependencies;

Initially the dependencies vector will have default dep_info objects
for all the C++ implementation files I'm interested in building. Then
a bunch of threads will be thrown at it.

A worker thread will pick up a pending dep_info from the vector and
scan it for #include directives. Any #included headers it finds that
aren't already represented by a dep_info with such a filename_ will be
added to the dependencies vector ready for processing in the near
future.

In fact I might be able to use an index to indicate the next dep_info
to be processed, rather than a pending_ flag in each one, but I
digress.

I don't have much experience with concurrency in general (I bet you're
all fed up with recent tide of people like me, already :) So I'm
wondering if this scheme is sensible, once appropriate locking is in
place? I'm hoping that the time spent scanning a source file will
dwarf the duration of the average lock but I have little intuition to
rely on. If it is indeed half-way sensible, I'd appreciate any
recommendations for threading idioms and constructs that might be used
in the implementation.

FWIW I'll be implementing this in Python. C++ was only for
illustration purposes.

Kinds regards,

Edd

Andrew Marlow

unread,

Feb 23, 2007, 4:08:46 AM2/23/07

to

On Thu, 22 Feb 2007 17:22:34 -0800, edd wrote:

> Hello all,
>
> I'm implementing a simple build system for C++ projects and I'd like
> to be able to use the power of multi-core machines to speed up the
> dependency analysis (#includes).

I can't wait for this to be finished and available. It should prove to be
very popular (I hope). I have a couple of requests:

1) Please make the system support the usual -I and -D options that govern
which #include statements get seen by the preprocessor.
2) Please produce output for the graphviz (aka dot) program. This will
enable a dependency diagram to be drawn in a number of formats (ps, png
etc).

One way to get the some parallelism is to use Make to create the
dependency information. Many build systems are setup this way. They either
use special options on the compiler (which then locks the project to that
compiler) or they use the portable open source command makedepend (from
the X11 project) to generate the dependencies. The makedepend approach
would lend itself to handling the -D and -I options. Using GNU make you
can then invoke make with the -j option to fire off a number of parallel
jobs. The python program would then be reduced to processing the
dependency information generated by the make.

Regards,

Andrew Marlow
--
There is an emerald here the size of a plover's egg!
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html

e...@nunswithguns.net

unread,

Feb 23, 2007, 4:12:37 PM2/23/07

to

Hi Andrew,

Thanks for taking an interest. You've raised a number of issues that
I've already spent some time thinking about. I'm rather conscious of
the fact that I'll take things off topic if I respond fully to some of
your points, so I hope you'll excuse me if I appear to gloss over some
of what you've said. I'd be happy to discuss things in depth via
email. FYI, the current version of 'slam' is available from
http://www.mr-edd.co.uk. Reading the docs will answer some of your
queries.

On 23 Feb, 09:08, "Andrew Marlow" <use...@marlowa.plus.com> wrote:
> On Thu, 22 Feb 2007 17:22:34 -0800, edd wrote:
> > Hello all,
>
> > I'm implementing a simple build system for C++ projects and I'd like
> > to be able to use the power of multi-core machines to speed up the
> > dependency analysis (#includes).
>
> I can't wait for this to be finished and available. It should prove to be
> very popular (I hope). I have a couple of requests:
>
> 1) Please make the system support the usual -I and -D options that govern
> which #include statements get seen by the preprocessor.

This kind of facility has been implemented, but in a compiler-neutral
way.

> 2) Please produce output for the graphviz (aka dot) program. This will
> enable a dependency diagram to be drawn in a number of formats (ps, png
> etc).

That's a nice idea. I'll try to look in to that at some point.

> One way to get the some parallelism is to use Make to create the
> dependency information.

Which make, though? :)
Currently slam runs on Windows and most Unix-like machines and has
support for a variety of compilers. I'm not comfortable relying on a
particular version/variant/vendor of make. There are numerous problems
with make anyway, which is why I started this project (email for
discussion!). I actually get each supported compiler to calculate the
dependencies for a given file, which is about as accurate as it can
get!

> Many build systems are setup this way. They either
> use special options on the compiler (which then locks the project to that
> compiler) or they use the portable open source command makedepend (from
> the X11 project) to generate the dependencies. The makedepend approach
> would lend itself to handling the -D and -I options.

slam maps preprocessor definitions and include directories on to the
correct compiler CLI options whether that be -D & -I or something
else. I could use makedepend, but even that isn't as accurate as
getting the compiler to do it (email for details!)

But the bigger problem is that makedepend flattens the dependencies
automatically, which is what I'm doing right now anyway (and doesn't
lend itself to creating a complete dot file, either).

That is, I don't believe that I can use makedepend to find the headers
that are *directly* #included by a given source file. So there's
potentially a heck of a lot of dependency data being recalculated over
and over again by makedepend (and in the current implementation in
slam). See the example with file1.cpp, file.cpp and header.h in the
original post.

I may not be communicating the problem particularly well. I really
want multiple threads to share each others findings so that I don't
have to scan headers multiple times. By running a given process (such
as makedepend) in isolation on each source file, I'm evading a
potentially substantial optimisation opportunity.

> Using GNU make you
> can then invoke make with the -j option to fire off a number of parallel
> jobs.

Again I really want to avoid make (one problem for example is that the
-j option produces faulty builds under some circumstances!). But the
actual build sequence is distributed among multiple threads by slam,
anyway. It's the dependency generation that's the problem.

Kind regards,

Edd

Steve Watt

unread,

Feb 26, 2007, 2:53:08 AM2/26/07

to

In article <1172193754.4...@m58g2000cwm.googlegroups.com>,
<e...@nunswithguns.net> wrote:
[ ... ]

>However, I think it likely that if I share the dependency data between
>threads that are updating it, I might be able to get better
>performance. For example if file1.cpp and file2.cpp both #include
>header.h, the current scheme will end up doing a particular chunk of
>analysis multiple times (descending in to header.h and the headers it
>#includes and so on).

Yes, but that's probably unavoidable unless you keep state of the
entire namespace of what macros (#defines) are in place when you
do the #include. The header file may have different dependencies
depending on what macros are active (for example _REENTRANT is a
common one that might cause different headers to be included).

--
Steve Watt KD6GGD PP-ASEL-IA ICBM: 121W 56' 57.5" / 37N 20' 15.3"
Internet: steve @ Watt.COM Whois: SW32-ARIN
Free time? There's no such thing. It just comes in varying prices...

e...@nunswithguns.net

unread,

Feb 26, 2007, 7:16:54 AM2/26/07

to

Hi Steve,

Thanks for your response.

On 26 Feb, 07:53, Steve Watt <steve.removet...@Watt.COM> wrote:

> In article <1172193754.419985.252...@m58g2000cwm.googlegroups.com>, <e...@nunswithguns.net> wrote:
>
> [ ... ]
>
> >However, I think it likely that if I share the dependency data between
> >threads that are updating it, I might be able to get better
> >performance. For example if file1.cpp and file2.cpp both #include
> >header.h, the current scheme will end up doing a particular chunk of
> >analysis multiple times (descending in to header.h and the headers it
> >#includes and so on).
>
> Yes, but that's probably unavoidable unless you keep state of the
> entire namespace of what macros (#defines) are in place when you
> do the #include. The header file may have different dependencies
> depending on what macros are active (for example _REENTRANT is a
> common one that might cause different headers to be included).

Yes, I'll need a full C pre-processor to be able to implement things
100% correctly. Thats not out of the question if I used something like
Boost.Wave, for example.

However, it doesn't have to be 100% correct; as long as the header
deduction code is conservative things should be good enough. That is,
giving a file extra dependencies won't lead to an incorrect build in
the vast majority of cases (all cases?). I'm willing to invest time in
the experiement to test this hypothesis, anyway.

Kind regards,

Edd

Markus Elfring

unread,

Apr 10, 2007, 11:42:59 AM4/10/07

to

> Yes, I'll need a full C pre-processor to be able to implement things
> 100% correctly. Thats not out of the question if I used something like
> Boost.Wave, for example.

Is the analysis tool "http://synopsis.fresco.org/" appropriate for your use
case? Would you like to perform source code introspection?

Regards,
Markus