Replacing ghc --make with Shake

97 views
Skip to first unread message

Edward Z. Yang

unread,
Jul 27, 2015, 9:19:35 PM7/27/15
to shake-build-system
Hello all,

As you may be well aware, GHC currently implements a mini build system
for the --make flag. Simon has never been keen about the implementation
since it's pretty complicated, and I've recently needed to add more
sophisticated logic to it, which has left me wondering if we couldn't
replace it with something Shake based.

Unfortunately, Shake has too many dependencies to be easily added as a
build dependency to GHC. So I wonder if there is not a mini Shake inside
which could be factored out and made usable? Something like was
described in the ICFP paper. For example:

- We'd like to avoid having an extra metadata cache; as a tradeoff,
we can store metadata in some of the files we generate, and when
when running GHCi we can keep the metadata in memory

- We'd like to avoid dependencies on 'extra', 'hashable', 'js-*',
'unordered-containers' and 'utf8-string'

- We don't need reporting / only need very simple reporting

- We don't need fancy linting

- We don't need the libraries for shelling out to commands to
run them to build things

- We don't need resources

- We probably don't need oracles? (Not quite sure about this one.)

Basically, we have some code which:

1. Generates a dependency graph by recursively finding and parsing
files,

2. Topologically sorts it, and

3. Builds everything in order

and surely there should be a mini-Shake, or maybe some conceptual
idea which GHC can take advantage of here.

Cheers,
Edward

Neil Mitchell

unread,
Jul 28, 2015, 4:03:21 AM7/28/15
to Edward Z. Yang, shake-build-system
Hi Edward,

> As you may be well aware, GHC currently implements a mini build system
> for the --make flag. Simon has never been keen about the implementation
> since it's pretty complicated, and I've recently needed to add more
> sophisticated logic to it, which has left me wondering if we couldn't
> replace it with something Shake based.

A reasonable idea. Also, Shake runs massively faster than ghc --make
in certain circumstances - think 40s vs 0.4s.This is why we use
ghc-make at work (https://github.com/ndmitchell/ghc-make).

> Unfortunately, Shake has too many dependencies to be easily added as a
> build dependency to GHC.

Are you sure? The GHC build system itself is being converted to Shake.
As a result, any machine that can build GHC will already has Shake
installed.

> So I wonder if there is not a mini Shake inside
> which could be factored out and made usable? Something like was
> described in the ICFP paper.

Note the description of the ICFP paper is good for the core, but it's
a lot of work to implement it.

> - We'd like to avoid having an extra metadata cache; as a tradeoff,
> we can store metadata in some of the files we generate, and when
> when running GHCi we can keep the metadata in memory

Shake rather requires the metadata - it saves run the CPP, parsing all
the files etc, so makes things go faster too. Why not put the metadata
as a .cache where you would put the .hi for the build?

> - We'd like to avoid dependencies on 'extra', 'hashable', 'js-*',
> 'unordered-containers' and 'utf8-string'

Taking a look at the Shake dependencies, you can certainly trim some
(js-flot/js-jquery) but some (hashable) are part of the interface. You
could write a new Shake, but it is a different Shake.

> - We don't need the libraries for shelling out to commands to
> run them to build things

That means you can probably drop the process requirement. But you're
now going quite away from the Shake library itself.

>
> - We don't need resources
>
> - We probably don't need oracles? (Not quite sure about this one.)

Neither of these make any difference in terms of dependencies.

> Basically, we have some code which:
>
> 1. Generates a dependency graph by recursively finding and parsing
> files,
>
> 2. Topologically sorts it, and
>
> 3. Builds everything in order
>
> and surely there should be a mini-Shake, or maybe some conceptual
> idea which GHC can take advantage of here.

The conceptual idea behind Shake is avoiding a static dependency
graph. You are working in a simpler problem domain, and while you
certainly could use Shake, you sound like you want to avoid a lot of
the features of Shake. So why not write your own restricted version of
Shake? I would have thought something like:

ghcBuilder :: [Rule] -> IO ()

data Rule = Rule
{filesOut :: [FilePath]
,filesIn :: [FilePath]
,action :: IO ()
}

Now you've got a conceptually simple abstraction which GHC can use, so
shouldn't complicate the rest of GHC. Implementing that abstraction on
top of Shake is trivial, and implementing it yourself is not too hard
either, without going as far as implementing a full Shake clone.

So, having read your entire email, I wonder if the problem is you
don't have a clear and unambiguous boundary between a dependency-based
build system and the rest of GHC? If so, Shake solves that, but might
be overkill and might have too many undesirable features to replace
ghc --make. But you can take the approach of assuming there is a 3rd
party library, and enforcing the API separation, even though there
isn't.

Thanks, Neil

Ryan Gonzalez

unread,
Jul 28, 2015, 10:26:59 AM7/28/15
to Neil Mitchell, Edward Z. Yang, shake-build-system


On July 28, 2015 3:03:20 AM CDT, Neil Mitchell <ndmit...@gmail.com> wrote:
>Hi Edward,
>
>> As you may be well aware, GHC currently implements a mini build
>system
>> for the --make flag. Simon has never been keen about the
>implementation
>> since it's pretty complicated, and I've recently needed to add more
>> sophisticated logic to it, which has left me wondering if we couldn't
>> replace it with something Shake based.
>
>A reasonable idea. Also, Shake runs massively faster than ghc --make
>in certain circumstances - think 40s vs 0.4s.This is why we use
>ghc-make at work (https://github.com/ndmitchell/ghc-make).
>

Dang, I never knew about that. I now need to try it!
--
Sent from my Nexus 5 with K-9 Mail. Please excuse my brevity.

Neil Mitchell

unread,
Jul 28, 2015, 11:55:45 AM7/28/15
to Ryan Gonzalez, Edward Z. Yang, shake-build-system
> Dang, I never knew about that. I now need to try it!

The 100x faster is really only if you have a very aggressive virus
scanner on Windows. It might be a bit faster on other machines, but
unless you find yourself waiting for ghc --make to finish when there
is nothing to build, it's probably not going to have any real impact.
It does save spawning processes and CPP processing.

Thanks, Neil

Luke Hoersten

unread,
Jul 29, 2015, 4:43:27 PM7/29/15
to Shake build system, rym...@gmail.com, ezy...@mit.edu, ndmit...@gmail.com
Should also be mentioned that the `js-*` packages both only depend on base. The dependency tree for shake really is quite small. I would highly recommend moving to Shake for ghc --make.

Luke

Evan Laforge

unread,
Jul 29, 2015, 6:50:03 PM7/29/15
to Luke Hoersten, Shake build system, rym...@gmail.com, Edward Yang, Neil Mitchell
If ghc --make used shake, would it be theoretically possible to expose
it to be reused as part of a larger build system? Long ago I switched
from ghc --make to shake, but I use it to build not just haskell, but
also c++, hsc2hs, generate tests and profiles, documentation, etc. I
have my own haskell and C dependency chaser, but it would be nice to
reuse ghc for that, because I have to handle CPP and import parsing
myself. Also ghc --make can infer the needed packages from the source
for each binary. I use a hardcoded list and all binaries get all of
them even if they don't need all of them. I guess the hardcoded list
is necessary as a place to put version constraints ala cabal, but it
would be nice to ask "if X is the main module, what packages are
needed?"

In the bigger picture, there's kind of a phase transition where you go
from the small project that can use --make or cabal to the large
project that needs to switch to shake, and it would be nice to make
that smoother than rewrite everything.

The other big thing is that shake is parallel. I guess just switching
to use it internally for the dependency graph wouldn't magically make
ghc thread-safe, but it would be one more step towards that.

Edward Z. Yang

unread,
Jul 30, 2015, 3:16:15 AM7/30/15
to Neil Mitchell, shake-build-system
> A reasonable idea. Also, Shake runs massively faster than ghc --make
> in certain circumstances - think 40s vs 0.4s.This is why we use
> ghc-make at work (https://github.com/ndmitchell/ghc-make).

Well, there's an important distinction to be made here: ghc --make
parallelizes compilation over one process, whereas ghc-make parallelizes
compilation over many processes doing one-shot compilation. My attitude
is that if you really want to scale indefinitely over arbitrarily many
cores, you're going to need separate processes; but it's not clear
if it's desirable for ghc --make to start generating GHC subprocesses
to compile in parallel. (Maybe!)

In any case, there are two senses in which ghc --make could be shaked:

1. ghc --make uses Shake, but commands are done in the same-process
using the GHC API.

2. ghc --make uses Shake to spawn multiple GHC subprocesses. With
some work, this might not use the GHC API at all, in which case
it would be a useful library for people looking to develop Shake
build systems.

As a GHC developer, I was curious about (1); however, I think (2) is
equally important. However, I also think people can develop (2) without
actually needing changes from GHC; it's just a question of hooking
up ghc-make with appropriate GHC API bits so that, e.g. the output of
ghc -M can be incrementally generated.

> Are you sure? The GHC build system itself is being converted to Shake.
> As a result, any machine that can build GHC will already has Shake
> installed.

Another distinction to make here: there's a difference between shipping
Shake as a boot library--which is compiled with the host GHC--and
shipping Shake as a distribution library--which is compiled with the
target GHC and not easily modifiable. If you /ship/ Shake with GHC,
anyone wanting to compile against any native GHC support for Shake
is pinned to the specific version of Shake which was shipped. And
I think people will very likely want to link against a "GHC Shake"
library, in the sense of (2).

> > - We'd like to avoid having an extra metadata cache; as a tradeoff,
> > we can store metadata in some of the files we generate, and when
> > when running GHCi we can keep the metadata in memory
>
> Shake rather requires the metadata - it saves run the CPP, parsing all
> the files etc, so makes things go faster too. Why not put the metadata
> as a .cache where you would put the .hi for the build?

Or perhaps we could put it in the .hi file itself?

> The conceptual idea behind Shake is avoiding a static dependency
> graph. You are working in a simpler problem domain, and while you
> certainly could use Shake, you sound like you want to avoid a lot of
> the features of Shake. So why not write your own restricted version of
> Shake? I would have thought something like:
>
> ghcBuilder :: [Rule] -> IO ()
>
> data Rule = Rule
> {filesOut :: [FilePath]
> ,filesIn :: [FilePath]
> ,action :: IO ()
> }
>
> Now you've got a conceptually simple abstraction which GHC can use, so
> shouldn't complicate the rest of GHC. Implementing that abstraction on
> top of Shake is trivial, and implementing it yourself is not too hard
> either, without going as far as implementing a full Shake clone.

This approach does sound attractive, however, I'm curious why earlier
you stated it was actually quite difficult to implement this?

> So, having read your entire email, I wonder if the problem is you
> don't have a clear and unambiguous boundary between a dependency-based
> build system and the rest of GHC? If so, Shake solves that, but might
> be overkill and might have too many undesirable features to replace
> ghc --make. But you can take the approach of assuming there is a 3rd
> party library, and enforcing the API separation, even though there
> isn't.

It is true that GHC's build system is not well separated internally,
but it *is* separated: ghc --make implements some logic for avoiding
recompilation, and there is also interface checking which lets us know
when recompilation is unnecessary (even if timetstamps have updated).

Edward

Edward Z. Yang

unread,
Jul 30, 2015, 3:18:47 AM7/30/15
to Evan Laforge, Luke Hoersten, Shake build system, rymg19, Neil Mitchell
Excerpts from Evan Laforge's message of 2015-07-29 15:49:43 -0700:
> If ghc --make used shake, would it be theoretically possible to expose
> it to be reused as part of a larger build system?

As I mentioned in my email to Neil, what you want is a library for
building Haskell code with Shake using one-shot GHC which has feature
parity with ghc --make. You are probably only going to be able
to get this with some cooperation from GHC HQ, but I think it is a
goal we can all get behind, and *doesn't* necessarily need any
changes to GHC.

Edward

Neil Mitchell

unread,
Jul 30, 2015, 5:03:08 AM7/30/15
to Edward Z. Yang, shake-build-system
> Well, there's an important distinction to be made here: ghc --make
> parallelizes compilation over one process, whereas ghc-make parallelizes
> compilation over many processes doing one-shot compilation.

That's not why ghc-make goes faster - if anything, that's why it goes
slower. Where ghc-make goes faster is checking if things should be
rebuilt, because it has static information about dependencies (which
header files are imported) and can do a very quick list of file/time
checks. In comparison GHC parses each Haskell file and for ones with
CPP even spawns an external process. On Windows, especially if you
have a very aggressive virus scanner, that's incredibly slow.

Put another way, GHC computes the graph then checks. Shake checks and
then computes whatever subset of the graph has become dirty.

> My attitude
> is that if you really want to scale indefinitely over arbitrarily many
> cores, you're going to need separate processes; but it's not clear
> if it's desirable for ghc --make to start generating GHC subprocesses
> to compile in parallel. (Maybe!)

GHC does a LOT of caching between runs. Doing ghc --make is typically
about 3 times faster than multiple ghc -c calls because it only loads
the .hi files once. You want to stay in process for GHC (or perhaps
fix the ghc -c speed thing - maybe parsing .hi files is too expensive?
Hoogle just mmap's its database and peeks at it from there).

> Another distinction to make here: there's a difference between shipping
> Shake as a boot library--which is compiled with the host GHC--and
> shipping Shake as a distribution library--which is compiled with the
> target GHC and not easily modifiable. If you /ship/ Shake with GHC,
> anyone wanting to compile against any native GHC support for Shake
> is pinned to the specific version of Shake which was shipped. And
> I think people will very likely want to link against a "GHC Shake"
> library, in the sense of (2).

Shake definitely wants to be a boot library and not a distribution
library. If you have a good abstraction, you could expose that in the
GHC API, and then writing a Shake client outside is pretty easy.

>> Shake rather requires the metadata - it saves run the CPP, parsing all
>> the files etc, so makes things go faster too. Why not put the metadata
>> as a .cache where you would put the .hi for the build?
>
> Or perhaps we could put it in the .hi file itself?

Yep, that works too.

> This approach does sound attractive, however, I'm curious why earlier
> you stated it was actually quite difficult to implement this?

I stated it would be difficult to reimplement Shake, or a mini-Shake.
Shake is monadic and polymorphic - both things that vastly complicate
the system. Monadic pretty much requires the separate metadata store,
and all the serialisation headaches that brings. Polymorphic makes
those serialisation headaches even worse. The system I sketched above
is applicative and monomorphic - a parallel implementation is ~20
lines.

> It is true that GHC's build system is not well separated internally,
> but it *is* separated: ghc --make implements some logic for avoiding
> recompilation, and there is also interface checking which lets us know
> when recompilation is unnecessary (even if timetstamps have updated).

Can you point at the "build system" inside GHC? I'd be keen to see the
abstract interface.

> If ghc --make used shake, would it be theoretically possible to expose
> it to be reused as part of a larger build system?

My inclination is that GHC's use (or not use) of Shake should be an
internal detail, so GHC shouldn't expose some Shake Rules in its API.
However, if GHC could expose more dependency information in a friendly
format you could imagine writing a ghc-shake project that took the
data/functions and then produced Shake rules. Or maybe GHC already
exposes the dependency information in a friendly format - I haven't
looked.

Thanks, Neil

Evan Laforge

unread,
Aug 3, 2015, 4:35:55 PM8/3/15
to Edward Z. Yang, Luke Hoersten, Shake build system, rymg19, Neil Mitchell
On Thu, Jul 30, 2015 at 12:18 AM, Edward Z. Yang <ezy...@mit.edu> wrote:
> As I mentioned in my email to Neil, what you want is a library for
> building Haskell code with Shake using one-shot GHC which has feature
> parity with ghc --make. You are probably only going to be able
> to get this with some cooperation from GHC HQ, but I think it is a
> goal we can all get behind, and *doesn't* necessarily need any
> changes to GHC.

Well, I already have that, though built by hand. And it looks like
ghc-make already has that too.

What I was angling at is if GHC could export the dependency graph it
figures out and the packages it infers. Actually it looks like the
first is already done via -M... though for some reason it requires
-dep-suffix now, and it's a little bit awkward to parse it out of a
generated Makefile. But as far as I can tell, there's no way to get
the inferred package list out. It would also be a big speed boost to
be able to reuse ghc's internal caches of parsed .hi files, this would
get something similar to my ghc-server attempt from a long time ago.

In more generality, it would be nice if --make were more "open" so you
could hook into it. As it is, it's an all-or-nothing machine with a
single "go" button. That's convenient to use, but as soon as you
outgrow it, you can't use it at all anymore. I don't have a concrete
idea for what that would look like, but suppose ghc --make allowed me
to write my own shake rules or reconfigure some built-in ones that
know how to build haskell taking advantage of internal ghc features
like the package cache. Then I could run 'ghc --make-with
Shakefile.hs' and it would compile and load the rules in Shakefile.hs,
plugin-style, and I could naturally extend ghc to deal with my
arbitrarily complicated build setup. This would also generalize all
of those fiddly flags like -osuf and -outputdir that poke and prod ghc
into putting files where you want. GHC would become something like a
fancy shake library that has built-in support for compiling haskell
without having invoke subprocesses. The benefits would be speed
(probably mostly due to package cache and recompilation avoidance...
shake goes purely based on file timestamp so it's not as smart as ghc)
and correctness (language aware dependency tracking, typed API vs.
cmdline flags). I guess this is pretty ambitious and complicated
though. Just a thought.

Maybe Neil is right that any usage of shake should be purely internal.
In that case, we'd basically continue with the status quo for --make,
just with a more principled internal implementation and faster startup
time. The part where you can extend with your own rules would go
unused. But, just as with the current situation, as soon as you need
to integrate C compilation, or generated files, or something else, you
need a more general build system, and at that point you likely wind up
dispensing with --make. In my case, haskell and C compilation is
intertwined due to hsc files, so it seemed simpler to go all-shake.

A less ambitious feature request would be for a -M-packages flag that
would also output the package deps. It seems like
-include-package-deps (which is misdocumented as requiring two
hyphens, btw) is just that, but doesn't seem to generate any
dependencies in practice. And maybe for those flags to have a
machine-readable stdout version, instead of wanting to modify a
Makefile.

Edward Z. Yang

unread,
Aug 4, 2015, 7:40:22 PM8/4/15
to Neil Mitchell, shake-build-system
Excerpts from Neil Mitchell's message of 2015-07-30 02:03:08 -0700:
> > Well, there's an important distinction to be made here: ghc --make
> > parallelizes compilation over one process, whereas ghc-make parallelizes
> > compilation over many processes doing one-shot compilation.
>
> That's not why ghc-make goes faster - if anything, that's why it goes
> slower. Where ghc-make goes faster is checking if things should be
> rebuilt, because it has static information about dependencies (which
> header files are imported) and can do a very quick list of file/time
> checks. In comparison GHC parses each Haskell file and for ones with
> CPP even spawns an external process. On Windows, especially if you
> have a very aggressive virus scanner, that's incredibly slow.
>
> Put another way, GHC computes the graph then checks. Shake checks and
> then computes whatever subset of the graph has become dirty.

I.e., this bug: https://ghc.haskell.org/trac/ghc/ticket/1290
It's not been fixed... because people didn't want to add a cache.
I've made the interface file suggestion on the bug.

In fact, it would be very easy to add a cache over the GHC API:
just make sure the contents of hsc_mod_graph stick around between
compilations.

> > My attitude
> > is that if you really want to scale indefinitely over arbitrarily many
> > cores, you're going to need separate processes; but it's not clear
> > if it's desirable for ghc --make to start generating GHC subprocesses
> > to compile in parallel. (Maybe!)
>
> GHC does a LOT of caching between runs. Doing ghc --make is typically
> about 3 times faster than multiple ghc -c calls because it only loads
> the .hi files once. You want to stay in process for GHC (or perhaps
> fix the ghc -c speed thing - maybe parsing .hi files is too expensive?
> Hoogle just mmap's its database and peeks at it from there).

Yes, I understand that. But you win a lot from multicore.
I want to claim that this is why GHC does one-shot compilation,
but I'm pretty sure our parallel build system predates ghc -j
support, and another reason we don't use --make is so that
we can build multiple packages in parallel (even when they depend
on each other.)

Recently some people started looking at why ghc -j doesn't
scale as well: https://ghc.haskell.org/trac/ghc/ticket/9221
Maybe they will be able to fix it.

> I stated it would be difficult to reimplement Shake, or a mini-Shake.
> Shake is monadic and polymorphic - both things that vastly complicate
> the system. Monadic pretty much requires the separate metadata store,
> and all the serialisation headaches that brings. Polymorphic makes
> those serialisation headaches even worse. The system I sketched above
> is applicative and monomorphic - a parallel implementation is ~20
> lines.

(And hopefully integrates with full Shake!)

> > It is true that GHC's build system is not well separated internally,
> > but it *is* separated: ghc --make implements some logic for avoiding
> > recompilation, and there is also interface checking which lets us know
> > when recompilation is unnecessary (even if timetstamps have updated).
>
> Can you point at the "build system" inside GHC? I'd be keen to see the
> abstract interface.

GhcMake exports these two functions:

-- | Do dependency analysis on the 'hsc_targets'
-- of the session, returning a list of 'ModSummary's
-- (representing things to build).
depanal :: GhcMonad m =>
[ModuleName] -- ^ excluded modules
-> Bool -- ^ allow duplicate roots
-> m ModuleGraph

-- | Try to compile/load 'hsc_targets' of the session.
load :: GhcMonad m => LoadHowMuch -> m SuccessFlag

These functions are a bit hard to understand because they depend
on no small degree on the session, 'HscEnv'; specifically:

hsc_targets :: [Target]
-- ^ The targets (or roots) of the current session

(where Target is something like a ModuleName plus some more wibbles.)
Shake would ostensibly replace these functions.

What would Shake call? The frontend that actually parses and
type-checks the module is:

genericHscFrontend :: ModSummary -> Hsc TcGblEnv

But this neither writes interface files to disk, generates code, nor
does recompilation avoidance. For some unfathomable reason, there are
two implementations of this functionality:

-- | Used by --make
compileOne :: HscEnv
-> ModSummary -- ^ summary for module being compiled
-> Int -- ^ module N ...
-> Int -- ^ ... of M
-> Maybe ModIface -- ^ old interface, if we have one
-> Maybe Linkable -- ^ old linkable, if we have one
-> SourceModified
-> IO HomeModInfo -- ^ the complete HomeModInfo, if successful

-- | Used by -c
hscCompileOneShot :: HscEnv
-> ModSummary
-> SourceModified
-> IO HscStatus

Additionally, GHC has a "pipeline", which is responsible for handling
compilation of files that have multiple phases. For example, the
pipeline for compiling an hs file would go something like:

HsPp: Preprocess the the file
Hsc: Compile the module to core (performing the recompilation check)
HscOut: Compile the core to assembly
As: Compile the assembly to object code

A mess? Yes definitely. It would be nice to know how to clean this up.

> > If ghc --make used shake, would it be theoretically possible to expose
> > it to be reused as part of a larger build system?
>
> My inclination is that GHC's use (or not use) of Shake should be an
> internal detail, so GHC shouldn't expose some Shake Rules in its API.
> However, if GHC could expose more dependency information in a friendly
> format you could imagine writing a ghc-shake project that took the
> data/functions and then produced Shake rules. Or maybe GHC already
> exposes the dependency information in a friendly format - I haven't
> looked.

I know you're aware of ghc -M; what dependency information beyond this
are you thinking of?

Edward

Edward Z. Yang

unread,
Aug 4, 2015, 8:46:27 PM8/4/15
to Evan Laforge, Luke Hoersten, Shake build system, rymg19, "Neil Mitchell"
Excerpts from Evan Laforge's message of 2015-08-03 13:35:34 -0700:
> What I was angling at is if GHC could export the dependency graph it
> figures out and the packages it infers. Actually it looks like the
> first is already done via -M... though for some reason it requires
> -dep-suffix now, and it's a little bit awkward to parse it out of a
> generated Makefile. But as far as I can tell, there's no way to get
> the inferred package list out. It would also be a big speed boost to
> be able to reuse ghc's internal caches of parsed .hi files, this would
> get something similar to my ghc-server attempt from a long time ago.

What do you mean by the "inferred package list"? Surely you don't mean
the -package-id flags which you passed to GHC? Do you mean the set of
installed packages which GHC considers in scope? Do you actually just
mean that you want to create a giant module graph of multiple packages
for building? Or is it something more like you want to know given a
module name where its hi file lives? (You can, TODAY, get all of this
information from the GHC API.)
I think it would be most productive to talk about specific use-cases.
I think hsc2hs is a good example. I think how people imagined hsc2hs
would work is:

1. Your build system runs hsc2hs to generate some hs files
(Cabal knows how to do this)
2. You then run ghc --make which slurps in the hs files

So, in the normal case, there's no interleaving: you compile the hsc2hs
C programs and run them, and then you compile the Haskell code. I'm
curious to know what is going on in your setup that requires
interleaving. Does the C program need to link against some Haskell
libraries? (Surely not the libraries you are compiling?)

More generally, it seems wrong for GHC to be used as a build system.
You might *link* it as a library to your Haskell-based build system,
but it not actually be in the business of building your Haskell and
your Fortran and your CUDA and your...

> A less ambitious feature request would be for a -M-packages flag that
> would also output the package deps. It seems like
> -include-package-deps (which is misdocumented as requiring two
> hyphens, btw) is just that, but doesn't seem to generate any
> dependencies in practice.

At least in 7.10.1, -include-pkg-deps with -dep-suffix generates
external deps for me. For example, here was a simple program:

A.o-boot : A.hs-boot
A.o-boot : Dev/ghc-7.10.1/usr/lib/ghc-7.10.1/base_I5BErHzyOm07EBNpKBEeUv/Prelude.hi
B.o : B.hs
B.o : A.hi-boot
B.o : Dev/ghc-7.10.1/usr/lib/ghc-7.10.1/base_I5BErHzyOm07EBNpKBEeUv/Prelude.hi
A.o : A.hs
A.o : Dev/ghc-7.10.1/usr/lib/ghc-7.10.1/base_I5BErHzyOm07EBNpKBEeUv/Prelude.hi
A.o : B.hi

Maybe you want something else?

> And maybe for those flags to have a
> machine-readable stdout version, instead of wanting to modify a
> Makefile.

Sure, this is easy to do. Can you file a feature request? Propose a
flag for it and the desired semantics.

Edward

Evan Laforge

unread,
Aug 5, 2015, 5:11:56 PM8/5/15
to Edward Z. Yang, Luke Hoersten, Shake build system, rymg19, Neil Mitchell
On Tue, Aug 4, 2015 at 5:46 PM, Edward Z. Yang <ezy...@mit.edu> wrote:
> What do you mean by the "inferred package list"? Surely you don't mean
> the -package-id flags which you passed to GHC? Do you mean the set of
> installed packages which GHC considers in scope? Do you actually just
> mean that you want to create a giant module graph of multiple packages
> for building? Or is it something more like you want to know given a
> module name where its hi file lives? (You can, TODAY, get all of this
> information from the GHC API.)

When you do --make, it will look at the imports, and figure out which
packages to link based on that. Of course, that means you don't have
an opportunity to select versions. I build a bunch of binaries from
the same source, so I just have a list of all the packages and
versions needed, and give that list to every binary. This works ok,
but is not so ideal since it's one giant global list for everyone, so
even a rarely built binary will contribute its heavy dependency to
every build. Or I could manually maintain a package list for every
binary separately, as cabal does, but that's annoying.

So if I could ask ghc about its inferred packages, I could then go
trim down the package list automatically.

This is also a lot like installing a cabal sandbox of consistent
versions, and then using --make to link in only the ones needed for a
particular binary.

> I think it would be most productive to talk about specific use-cases.
> I think hsc2hs is a good example. I think how people imagined hsc2hs
> would work is:
>
> 1. Your build system runs hsc2hs to generate some hs files
> (Cabal knows how to do this)
> 2. You then run ghc --make which slurps in the hs files
>
> So, in the normal case, there's no interleaving: you compile the hsc2hs
> C programs and run them, and then you compile the Haskell code. I'm
> curious to know what is going on in your setup that requires
> interleaving. Does the C program need to link against some Haskell
> libraries? (Surely not the libraries you are compiling?)

Well, I'm probably overstating the complexity of my situation. I just
mean that the actual compiling and preprocessing is interleaved just
due to them all being in the same dependency graph. E.g. a change in
a .h file will cause the .hsc that includes it to rebuild. Back when
I used make + ghc --make, there would be phase 1 which is compiling
C++ and hsc2hs in parallel, and then when that's all done, run a ghc
--make, which would then be single threaded, and then possibly a phase
3 which is running whatever tests I compiled, generating haddock,
running binaries to produce intermediate files, or whatever. Now that
I'm using shake, all those things happen in parallel and interleaved.

ghc --make also doesn't understand how to rebuild when things like
external configuration or library versions change. I think it can
basically figure out .hs dependencies and CPP #include dependencies
(presumably by virtue of check for recompilation post-CPP), but that's
about it.

If I had C generated from haskell and then linked back into haskell
then --make would get increasingly awkward as I'd have to stack more
phases on, but fortunately I'm not that bad yet :) The ability to
avoid the multiple-phase problem due to generated files was one of
shake's original selling points.

> More generally, it seems wrong for GHC to be used as a build system.
> You might *link* it as a library to your Haskell-based build system,
> but it not actually be in the business of building your Haskell and
> your Fortran and your CUDA and your...

Yeah, it does seem weird to use GHC to compile, say, C. But come to
think of it... it already kind of does that, doesn't it? Just in a
very basic way. Since building means doing lots of non-haskell stuff,
ghc --make won't be able to integrate with the rest of the system.
That's fine since you can use traditional 'ghc -c'... but meanwhile
ghc --make has nice features that you lose when you do that.

So how to get the nice features out of ghc --make so a larger system
can use them? Maybe dump the cache in some file which is really fast
to save and load? Or even a memcache kind of thing? Or keep ghcs
running persistently like my ghc-server attempt? Or, as you say, link
ghc in as a library to the shakefile so the cache can be passed around
in memory. In fact, this is probably already possible with the GHC
API. In practice it would just be another variant on ghc-server.

> At least in 7.10.1, -include-pkg-deps with -dep-suffix generates
> external deps for me. For example, here was a simple program:
>
> A.o-boot : A.hs-boot
> A.o-boot : Dev/ghc-7.10.1/usr/lib/ghc-7.10.1/base_I5BErHzyOm07EBNpKBEeUv/Prelude.hi

Hmm, I tried this again and it works now, I was probably just doing it
wrong. That's almost what I want... I'd rather have a package name,
but it could be parsed out of the path.

>> And maybe for those flags to have a
>> machine-readable stdout version, instead of wanting to modify a
>> Makefile.
>
> Sure, this is easy to do. Can you file a feature request? Propose a
> flag for it and the desired semantics.

Sure, done at https://ghc.haskell.org/trac/ghc/ticket/10741

Maybe I should give this a try myself, I imagine I'd just have to see
what -M does, and make a simpler version of that.

Edward Z. Yang

unread,
Dec 11, 2015, 3:06:40 AM12/11/15
to shake-build-system
Hello all,

I'm actually writing this. I worked out how to architecturally make
this work: ghc --shake will be a "frontend plugin", ala
https://ghc.haskell.org/trac/ghc/ticket/11194, at which point
it's OK to just depend on Shake directly (and I get all access to
GHC using the GHC API, AND I don't have to reimplement all of
the parsing nonsense). The starting plan will be all in-process
builds; after that we can think about spawning ghc subprocess
(at which point we'll subsume ghc-make).

One problem I ran into was what to write as a rule for object
files. We can't assume that it's generated by a Haskell file;
in the Shake manual, the suggestion is that you make .hs.o files
and .c.o files. But I don't want to do this, because I want to
be compatible with ghc --make's behavior.

While I was writing this message, I realized that ghc-make had to solve
this problem. It looks like they use an oracle to figure out what the
input module is supposed to be? I'm planning on following the rule
structure of ghc-make, but if anyone has any extra tips about it,
I'm all ears.

Edward

Edward Z. Yang

unread,
Dec 12, 2015, 4:11:50 AM12/12/15
to shake-build-system
A few oddities:

1. When I say ghc --make A.c, the rule for making A.o should be
to look for and compile A.c, rather than look for A.hs. The
way I've implemented this is to add extra, higher priority rules
for these builds. But somehow, I feel like I'm not supposed to
use Shake this way; are there best practices for dynamically
configured rules?

2. Suppose I am building B which imports A, in which case I need A.hi.
Further, suppose I am building with the flags -isrc1 -isrc2 (and
no -hidir/-odir flags).

According to GHC's semantics, the full path of the A.hi I should
depend on is either src1/A.hi or src2/A.hi, based on where
the source file lived.

That's all very fine and well, but what's the actual suggested
way I should go about coding this up? The best I can think of
is to do the probe when we're looking for the dependency, and
cache the result.

3. Same example, but build with ghc --make A.hs B.hs -i, which forces
there to be no import path. Now, the only way that GHC knows
that A.hi lives in the current directory is the fact that it
forces explicitly specified hs files to get added to the
"finder cache". Another situation where I seem to need to adjust
my rules.

Edward

Excerpts from Edward Z. Yang's message of 2015-12-11 00:06:36 -0800:

Neil Mitchell

unread,
Dec 12, 2015, 5:16:45 AM12/12/15
to Edward Z. Yang, shake-build-system
Hi Edward,

I've only got a little time while my son hides in a cupboard, so a few
best-quick-effort answers to help you on your way:

ghc-make has two features, it gives you parallelism and fast-rebuilds.
Before GHC --make supported parallelism the first of these was useful,
now it's not. The second of these avoids running the preprocessor and
in certain circumstances can save minutes off the nothing-to-do build.
While the second purpose is useful, I have a vague plan to replace it
with a generic solution that is less ghc/cabal specific, and works for
things like make/scons just as well:
https://github.com/jacereda/fsatrace/issues/8

The way ghc-make works is it generates a Makefile with -M, but uses
weird hisuf/osuf options so it can unambiguously identify Haskell
sources vs C sources. The exact semantics of these flags seems to
change in every GHC version, so it's been a struggle to keep up.

What's the reason for wanting to use Shake to replace ghc --make?
Simplicity? Less code? Extra features (profiling, timestamp vs
hashes)? Performance?

I wouldn't bother trying to do out of process ghc --make using Shake,
since out of process goes much slower.

1) For dynamically configured rules, you'd generally do:

"*.o" %> \out -> do
b <- computeWhetherItShouldBeHaskellOrC
if b then ... else ...

That computation might be looking at an oracle, querying the existence
of the .c source etc.

Adding higher priority rules generally a bad plan, and rarely work out
- if you have two rules for "*.o" at different priorities, then the
lower priority one is pointless.

The other approach is to have a complete file to action mapping,
stored in something like Map FilePath Action, then use ?> to look them
up.

2) Generally you would firstJustM doesFileExist possiblePaths, using
firstJustM from the extra package and doesFileExist in the Action
monad. Then you are also declaring dependencies on the existence of
files, so if the user adds A.hi earlier on the include path it
rebuilds properly.

3) I guess given that behaviour, you have to look at the finder cache
in some way while figuring out the location. Putting the finder cache
into an oracle is one option.

Thanks, Neil

Edward Z. Yang

unread,
Dec 12, 2015, 11:45:19 PM12/12/15
to Neil Mitchell, shake-build-system
Excerpts from Neil Mitchell's message of 2015-12-12 02:16:44 -0800:
> Hi Edward,
>
> I've only got a little time while my son hides in a cupboard, so a few
> best-quick-effort answers to help you on your way:
>
> ghc-make has two features, it gives you parallelism and fast-rebuilds.
> Before GHC --make supported parallelism the first of these was useful,
> now it's not. The second of these avoids running the preprocessor and
> in certain circumstances can save minutes off the nothing-to-do build.
> While the second purpose is useful, I have a vague plan to replace it
> with a generic solution that is less ghc/cabal specific, and works for
> things like make/scons just as well:
> https://github.com/jacereda/fsatrace/issues/8
>
> The way ghc-make works is it generates a Makefile with -M, but uses
> weird hisuf/osuf options so it can unambiguously identify Haskell
> sources vs C sources. The exact semantics of these flags seems to
> change in every GHC version, so it's been a struggle to keep up.
>
> What's the reason for wanting to use Shake to replace ghc --make?
> Simplicity? Less code? Extra features (profiling, timestamp vs
> hashes)? Performance?
>
> I wouldn't bother trying to do out of process ghc --make using Shake,
> since out of process goes much slower.

Performance and profiling, but also as a way to figure out how to make
GHC more build system friendly. Specifically:

1. If you run `ghc -M` it has to scan and parse every single file,
even if the timestamp didn't update. There isn't really a good
way for GHC to work around this because ghc --make is constrained
to not create any database (like Shake does). This results in
a noticeable delay for big projects, e.g. GHC's.

2. I want to put GHC's calls to the preprocessor through the build
system. Though I will admit file tracing is interesting way to
solve this (see also buildsome).

3. I want the profiler. If I have a project which compiles slowly,
I want a tool that can tell me what is going slow, so I can try
to speed up compilation / arrange for more parallelism in the module
graph.

4. Maybe separate process parallelism is slower on one core, but
some of us have 12 cores and a very parallel module graph. GHC's
build benefits a lot from separate process parallel build. Maybe
I can even implement the parallel GHC build-process optimization.

5. I want to make an integrated GHC/Cabal build system in Shake,
so that I can parallelize across packages. This is not something
that you can do if you treat a Cabal script as a black box build
system.

And I want it to be a drop-in replacement, like ghc-make.

> 1) For dynamically configured rules, you'd generally do:
>
> "*.o" %> \out -> do
> b <- computeWhetherItShouldBeHaskellOrC
> if b then ... else ...
>
> That computation might be looking at an oracle, querying the existence
> of the .c source etc.
>
> Adding higher priority rules generally a bad plan, and rarely work out
> - if you have two rules for "*.o" at different priorities, then the
> lower priority one is pointless.

That's not what I'm doing: I have a low priority "*.o" rule for Haskell
files, and then "foo.o", "bar.o", etc rules for every file which I know
has a C file. But maybe I'll just move this into the "*.o" rule.

> The other approach is to have a complete file to action mapping,
> stored in something like Map FilePath Action, then use ?> to look them
> up.

But surely the file to action mapping should not be allowed to
dynamically change between runs of the build system? I'm very unclear
about what you're allowed to change in a Shake build script before
you have to "bump" the build system version.

> 2) Generally you would firstJustM doesFileExist possiblePaths, using
> firstJustM from the extra package and doesFileExist in the Action
> monad. Then you are also declaring dependencies on the existence of
> files, so if the user adds A.hi earlier on the include path it
> rebuilds properly.

Ooh, nice, I had reimplement firstJustM. Will use that.

> 3) I guess given that behaviour, you have to look at the finder cache
> in some way while figuring out the location. Putting the finder cache
> into an oracle is one option.

OK.

Thanks for the advice!
Edward

Edward Z. Yang

unread,
Dec 13, 2015, 6:50:19 PM12/13/15
to Neil Mitchell, shake-build-system
OK, another quick question.

When I compile a module with GHC, I get both a .o and a .hi file, but
also an *in-memory* data structure (called HomeModInfo), which another
compilation may depend on.

It would be simple enough to make a new Rule whose input is a "module
name" and whose output is "a .o file, a .hi file, and the in-memory
data structure", except that I don't want to write a Binary instance
for this data structure. You see, the *hi* file is what caches this
structure, so I don't want to reimplement this with Shake. Essentially,
if no up-to-date hi file exists, I want Shake to build it and
return the correct data-structure, however, if an up-to-date hi file
exists, I want to just run GHC's machinery for reading interface files
to get an up-to-date file.

This seems to suggest an overlapping rule for getting a HomeModInfo:

1. You can build it from an hs file; you'll also get a .o and .hi
file in the process.

2. You can build it from an hi file.

If .o/.hi need to be updated, you should prefer (1); otherwise do (2).

Is there actually a reasonable way to do this?

Edward

Excerpts from Edward Z. Yang's message of 2015-12-12 20:45:15 -0800:

Neil Mitchell

unread,
Dec 13, 2015, 7:07:55 PM12/13/15
to Edward Z. Yang, shake-build-system
> Performance and profiling, but also as a way to figure out how to make
> GHC more build system friendly.

These all seem good reasons.

> This results in a noticeable delay for big projects, e.g. GHC's.

For the specific case of the GHC build system, when it's converted to
Shake, there's won't be an issue. But certainly the delay hits many
other projects.

> 4. Maybe separate process parallelism is slower on one core, but
> some of us have 12 cores and a very parallel module graph. GHC's
> build benefits a lot from separate process parallel build. Maybe
> I can even implement the parallel GHC build-process optimization.

On Windows, at least with my particularly configured 24 Core system
(it has a very aggressive anti-virus) out-of-process is slower than
everything. But interesting to hear it isn't a universal story.

> That's not what I'm doing: I have a low priority "*.o" rule for Haskell
> files, and then "foo.o", "bar.o", etc rules for every file which I know
> has a C file. But maybe I'll just move this into the "*.o" rule.

That works just fine. In general, any rule which matches a literal
pattern has higher priority than a wildcard one, so you might not even
need the priorities.

>
>> The other approach is to have a complete file to action mapping,
>> stored in something like Map FilePath Action, then use ?> to look them
>> up.
>
> But surely the file to action mapping should not be allowed to
> dynamically change between runs of the build system? I'm very unclear
> about what you're allowed to change in a Shake build script before
> you have to "bump" the build system version.

Good question, but no good answer - I've raised
https://github.com/ndmitchell/shake/issues/347 to document it. In
general it all works out, provided the actions don't change for a
single entry without also effecting its dependencies. I've never had a
problem, but I can't immediately say which preconditions I am
following...

> Ooh, nice, I had reimplement firstJustM. Will use that.

firstJust is also very useful, particularly in strict Haskell :)

> This seems to suggest an overlapping rule for getting a HomeModInfo:
>
> 1. You can build it from an hs file; you'll also get a .o and .hi
> file in the process.
>
> 2. You can build it from an hi file.
>
> If .o/.hi need to be updated, you should prefer (1); otherwise do (2).

In all cases you want an up-to-date and existing .hi file first, so do
need ["foo.hi"]

After that, it seems that HomeModInfo is just a cache, so I'd have a
cache, not visible or integrated with Shake in any way. Compiling puts
into the cache. If it's not in the cache and you ask for it, you read
the hi file.

The cache is really abstracting over the idea that in some cases you
can short-circuit the write-to-file/read-from-file pattern.

Thanks, Neil
Reply all
Reply to author
Forward
0 new messages