Generalizing "need" and "want" for computations instead of compilations

max.ri...@gmail.com

unread,

Apr 3, 2013, 4:00:22 AM4/3/13

to shake-bui...@googlegroups.com

Hi Neil and others,

I am working on an idea to use Shake as a system to manage automate a computational project (as part of my PhD research). The current "state-of-the-art" is to just write a huge bash script that simulates a bunch of stuff, and when something breaks, do a bunch of commenting, copy-pasting, and repeat. I am trying to move to a "descriptive" approach, where you describe how a particular thing is computed, and what it depends on, and shake figures out the rest.

For example:

I have to simulate 150 things based on model A0, which give me update information so that I can create a new model A1, and repeat all 150 (or not, depending on the update perhaps) to get A2, etc.

Each simulation has its own small dependencies, some are files, but some are previous simulations.

I've been digging around in the source code and have been thinking about writing a new type of "need" called needCompute, which looks at a DB and checks to see if a certain thing has been computed yet. In general, I want a "need" method that doesn't look for a file, but rather the status of a particular computation. I imagine storing the status in some sort of persistent file or database, but I'd love to try to reuse what is already in Shake.

So for example:
------------------------------------------------
want [NextModel]

NextModel %> \x -> do

-- Source 1, Source 2, etc are all simulations that need to be done
needCompute $ map Source [1..10]
-- updates model based on results from Source computations
updateModel

Source %> \src_id -> do

-- run simulation for src_id
system' "./run_src" [(show src_id)]

--------------------------------------------------

In this example, the (%>) function "completes" the (Source x) computations in the database/file, and when they are all completed, the "NextModel" computation can proceed.

I'm still thinking out a lot of the details here, especially when it comes to how the type abstractions can work in a general way for different types of computation -- Specifically how you would specify a type of computation (Source Int, Gradient [Source], ModelUpdate Gradient, etc), while trying to take advantage of the type system polymorphism.

cheers,
Max

Neil Mitchell

unread,

Apr 3, 2013, 4:37:32 AM4/3/13

to max.ri...@gmail.com, shake-bui...@googlegroups.com

Hi Max,

Interesting project, and one that certainly seems like it should be
possible in Shake. There are two stack overflow questions that are
worth reading to get some ideas:

* How to deal with transitive commands, for example A0 -> A1 -> A2 ...
: http://stackoverflow.com/questions/14622169/how-to-write-fixed-point-build-rules-in-shake-e-g-latex

* Why files might be a better option that storing additional
information in the database:
http://stackoverflow.com/questions/14631978/how-to-define-custom-rule-in-shake-development-shake-core-is-hidden

As a first approach, I would try encoding your nodes as files. You
could imagine the scheme:

/n/.model - a model, where 0.model is the initial starting state
final.model - the final model after everything converges
/n/-/m/.sim - simulation m on model n

You have a custom rule for 0.model or require it as a source. You
follow the transitive pattern to generate successive models and then a
final model. Each simulation produces a list of updates to 0-3.sim,
which are required to produce 1.model. In essence, instead of having a
database or custom file storing all the data, you split each piece of
data out into a single file and just save the data to a file. Then you
just reuse need/want, and everything works as expected.

The disadvantages of this approach are that you have lots more files,
you may end up deserialising the same file more than once, but these
are purely issues of performance, and can be easily remedied later
(after profiling) without any big restructuring of the actual logic. I
assume that one day someone will create need/want that work against an
SQL database, perhaps with an in-memory cache of recently accessed
data, but it has not been necessary yet.

Thanks, Neil

Neil Mitchell

unread,

Apr 3, 2013, 7:00:55 AM4/3/13

to max.ri...@gmail.com, shake-bui...@googlegroups.com

Another way of answering your question:

"Generalizing "need" and "want" for computations instead of compilations"

Is that "need" and "want" are already generalised over
compilations/computations, the only thing they talk about are _files_.
If you make your computations read/write to files then need/want are
already sufficient. If you want your computations to be persistent in
any way, and allow partial rebuilds after changes, you're going to
need to save your computations somewhere, and files is a reasonable
first go.

Thanks, Neil

trigsci

unread,

Oct 7, 2014, 6:40:43 PM10/7/14

to shake-bui...@googlegroups.com

I have a similar issue where multiple files are created base on a shared computation. If the computation is easily serializable I can create the temp file as below. But what approach would I use if they are not? How could one pass data from one rule into another? Could I create a phony rule and use MVar, STM or some other global memory to share the result?

main = do
shakeArgs shakeOptions $ do
    want ["needsresult.a","needsresult.b"]
    "results.bin" *> \out -> do
      result <- complexCalc
      liftIO $ B.writeFile out result
    "needsresult.a" *> \out -> do
      need ["results.bin"]
      result <- liftIO $ B.readFile "results.bin"
      need $ requiredStuffA result
      doSomethingForA out result

    "needsresult.b" *> \out -> do
      need ["results.bin"]
      result <- liftIO $ B.readFile "results.bin"
      need $ requiredStuffB result
      doSomethingForB out result

Neil Mitchell

unread,

Oct 8, 2014, 5:46:23 PM10/8/14

to trigsci, shake-bui...@googlegroups.com

Using your own MVar/IORef is generally a poor idea to communicate
between rules - it's very difficult to get right without either
deadlocks or messing up the dependencies.

If the output isn't serialisable (or even if it is, but
deserialisation is expensive) you should use newCache (see
http://hackage.haskell.org/package/shake-0.13.4/docs/Development-Shake.html#v:newCache).
However, note that if the result can't be serialised, then if
requiredStuffA changes that means complexCalc will be rerun - unless
you can serialise the results of complexCalc I don't see any way
around that.

Thanks, Neil

Reply all

Reply to author

Forward