Making Shake git-aware

43 views
Skip to first unread message

Joachim Breitner

unread,
May 4, 2015, 10:22:46 AM5/4/15
to shake-bui...@googlegroups.com
Hi,

a shake-based application of mine reads files stored in a git
repository. For disk space efficiency reasons I do not want to have the
files checked-out, but rather access a bare repository. So I set out to
implement a shake extension providing these function:

doesGitFileExist :: RepoPath -> FilePath -> Action Bool
readGitFile :: RepoPath -> FilePath -> Action BS.ByteString

I want to enable shake to exploit git’s particulars, i.e. that the list
of files associated with a commit is determined by that commit’s SHA
hash, and the same for the contents of a file.

My idea was to create custom rules: One to resolve a reference (here
always "HEAD") to a commit hash, and one to resolve a filename to the
hash of that file’s blob:

So here is my code for that:

newtype GetGitReferenceQ = GetGitReferenceQ (RepoPath, RefName)
deriving (Typeable,Eq,Hashable,Binary,NFData,Show)

newtype GitSHA = GitSHA T.Text
deriving (Typeable,Eq,Hashable,Binary,NFData,Show)

newtype GetGitFileRefQ = GetGitFileRefQ (RepoPath, T.Text, FilePath)
deriving (Typeable,Eq,Hashable,Binary,NFData,Show)

instance Rule GetGitReferenceQ GitSHA where
storedValue _ (GetGitReferenceQ (repoPath, name)) =
Just . GitSHA <$> getGitReference' repoPath name

instance Rule GetGitFileRefQ (Maybe T.Text) where
storedValue _ (GetGitFileRefQ (repoPath, ref', filename)) =
Just <$> getGitFileRef' repoPath ref' filename

(The primed versions are IO functions, using gitlib)

Based on that, the publicly facing functions are implemented as

doesGitFileExist :: RepoPath -> FilePath -> Action Bool
doesGitFileExist repoPath fn = do
GitSHA ref' <- apply1 $ GetGitReferenceQ (repoPath, "HEAD")
res <- apply1 $ GetGitFileRefQ (repoPath, ref', fn)
return $ isJust (res :: Maybe T.Text)

readGitFile :: FilePath -> FilePath -> Action BS.ByteString
readGitFile repoPath fn = do
GitSHA ref' <- apply1 $ GetGitReferenceQ (repoPath, "HEAD")
res <- apply1 $ GetGitFileRefQ (repoPath, ref', fn)
case res of
Nothing -> fail "readGitFile: File does not exist"
Just ref' -> liftIO $ withRepository lgFactory repoPath $ do
ref <- parseOid ref'
catBlob (Tagged ref)

And finally, judging from how the other modules do this, I have to add
this bit, which needs to be included by program that uses my module:


defaultRuleGitLib :: Rules ()
defaultRuleGitLib = do
rule $ \(GetGitReferenceQ (repoPath, refName)) -> Just $ liftIO $
GitSHA <$> getGitReference' repoPath refName
rule $ \(GetGitFileRefQ (repoPath, ref', fn)) -> Just $ liftIO $
getGitFileRef' repoPath ref' fn

Full code at: https://github.com/nomeata/gipeda/blob/1bb22/src/Development/Shake/Gitlib.hs


Is that a correct use of the Rule machinery?

Why is there a duplication in defaultRuleGitLib and the storedValue method? Is also normal?


Thanks,
Joachim

--
Joachim “nomeata” Breitner
ma...@joachim-breitner.dehttp://www.joachim-breitner.de/
Jabber: nom...@joachim-breitner.de • GPG-Key: 0xF0FBF51F
Debian Developer: nom...@debian.org
signature.asc

John Wiegley

unread,
May 4, 2015, 12:42:29 PM5/4/15
to Joachim Breitner, shake-bui...@googlegroups.com
>>>>> Joachim Breitner <ma...@joachim-breitner.de> writes:

> I want to enable shake to exploit git’s particulars, i.e. that the list of
> files associated with a commit is determined by that commit’s SHA hash, and
> the same for the contents of a file.

> My idea was to create custom rules: One to resolve a reference (here always
> "HEAD") to a commit hash, and one to resolve a filename to the hash of that
> file’s blob:

Note that Gitlib would make this extremely easy to add.

John

Joachim Breitner

unread,
May 4, 2015, 1:22:03 PM5/4/15
to shake-bui...@googlegroups.com
Hi,

note that I did, I'm wondering if I did it right :-)

Have a look at the linked code, you could check that I'm using gitlib sanely :-)

Greetings, Joachim

Neil Mitchell

unread,
May 4, 2015, 5:03:32 PM5/4/15
to Joachim Breitner, shake-bui...@googlegroups.com
Looks thoroughly reasonable to me - exactly how I'd expect it to be
written. A very nice example of what I'd expect custom rules to be
used for, and how they can be written.

The duplication between storedValue and somewhere else is fairly
common - for example in the built-in file rules I define
defaultRuleFile in terms of storedValue (see
https://github.com/ndmitchell/shake/blob/master/src/Development/Shake/Rules/File.hs#L90).
As usual in Haskell, if there's too much duplication just drag it out
into a function, but here there doesn't seem to be much of an issue.

Thanks, Neil

John Wiegley

unread,
May 4, 2015, 10:34:21 PM5/4/15
to Joachim Breitner, shake-bui...@googlegroups.com
>>>>> Joachim Breitner <ma...@joachim-breitner.de> writes:

> note that I did, I'm wondering if I did it right :-)
>
> Have a look at the linked code, you could check that I'm using gitlib
> sanely :-)

Ah, sorry Joachim! I didn't read to the end to see that the filename was
Gitlib.hs. The code looks good to me. :)

John

Joachim Breitner

unread,
May 5, 2015, 3:40:06 AM5/5/15
to shake-bui...@googlegroups.com
Hi,

Am Montag, den 04.05.2015, 22:03 +0100 schrieb Neil Mitchell:
> Looks thoroughly reasonable to me - exactly how I'd expect it to be
> written. A very nice example of what I'd expect custom rules to be
> used for, and how they can be written.

thanks. It doesn’t quite work as I wanted, though. Here is a small
example program:

main = shakeArgs shakeOptions $ do
defaultRuleGitLib

"*.out" *> \out -> do
let base = takeBaseName out
bs <- readGitFile "repo" (base <.> "txt")
liftIO $ putStrLn $ "Writing " ++ base
writeFile' out (show (BS.length bs))

action $ forM_ ["a","b"] $ \base -> do
ex <- doesGitFileExist "repo" (base <.> "txt")
when ex $ need [base <.> "out"]

With the code I posted, i.e.

readGitFile :: FilePath -> FilePath -> Action BS.ByteString
readGitFile repoPath fn = do
GitSHA ref' <- apply1 $ GetGitReferenceQ (repoPath, "HEAD")
res <- apply1 $ GetGitFileRefQ (repoPath, ref', fn)
case res of
Nothing -> fail "readGitFile: File does not exist"
Just ref' -> liftIO $ withRepository lgFactory repoPath
$ do
ref <- parseOid ref'
catBlob (Tagged ref)

it would re-build "a.out" if there is a new commit, even if that commit
leaves "a.txt" unchanged. Which makes sense, as one of the rules asks
“what is the commit named "HEAD"?”.

I had to move the resolving of "HEAD" into GetGitFileRefQ and write

readGitFile :: FilePath -> FilePath -> Action BS.ByteString
readGitFile repoPath fn = do
res <- apply1 $ GetGitFileRefQ (repoPath, "HEAD", fn)
case res of
Nothing -> fail "readGitFile: File does not exist"
Just ref' -> liftIO $ withRepository lgFactory repoPath
$ do
ref <- parseOid ref'
catBlob (Tagged ref)

Now I get the desired behavior (a new commit that leaves "a.txt"
untouched does not trigger a re-build of "a.out"). But shake will still
query the hash of all files, even if HEAD has not changed at all!

Is there a way to make shake do nothing but check the value of HEAD in
this case? Is my use of "action" wrong here?

I tried to use the HEAD-resolving query in the rule for the
file-lookup-query:

defaultRuleGitLib :: Rules ()
defaultRuleGitLib = do
rule $ \(GetGitReferenceQ (repoPath, refName)) -> Just $ liftIO $
GitSHA <$> getGitReference' repoPath refName
rule $ \(GetGitFileRefQ (repoPath, refName, fn)) -> Just $ do
GitSHA ref' <- apply1 $ GetGitReferenceQ (repoPath, "HEAD")
liftIO $ getGitFileRef' repoPath ref' fn

But it would still run "getGitFileRef'" even with an unchanged
repository.


And a more conceptual question: When should `storedValue` return
`Nothing` and when `Just`?

Greetings,
signature.asc

Neil Mitchell

unread,
May 6, 2015, 12:48:49 PM5/6/15
to Joachim Breitner, shake-bui...@googlegroups.com
> And a more conceptual question: When should `storedValue` return
> `Nothing` and when `Just`?

In the Rule class (but not actually next to storedValue) it says: "A
way to query the current state of an artifact, with storedValue
returning the current state, or Nothing if there is no current state
(e.g. the file does not exist). " So approximately return Nothing if
the data just doesn't exist on disk.

> I tried to use the HEAD-resolving query in the rule for the
> file-lookup-query:
>
> defaultRuleGitLib :: Rules ()
> defaultRuleGitLib = do
> rule $ \(GetGitReferenceQ (repoPath, refName)) -> Just $ liftIO $
> GitSHA <$> getGitReference' repoPath refName
> rule $ \(GetGitFileRefQ (repoPath, refName, fn)) -> Just $ do
> GitSHA ref' <- apply1 $ GetGitReferenceQ (repoPath, "HEAD")
> liftIO $ getGitFileRef' repoPath ref' fn
>
> But it would still run "getGitFileRef'" even with an unchanged
> repository.

So I think that is what you should be doing, and it should (to a first
approximation) work the way you were hoping for. You roughly have the
dependency chain:

GetGitReferenceQ <- GetGitFileRefQ <- readGitFile

And since Shake should break the dependency chain whenever a value is
equal, if HEAD doesn't change, GetGitFileRefQ shouldn't be
recalculated.

The problem (I suspect) is that the storedValue instance for
GetGitFileRefQ is being invoked. Can you trace inside the storedValue
to confirm that's where the reference is being checked?

I think the solution to that is to follow the Rule advice "For rules
whose values are not stored externally, storedValue should return Just
with a sentinel value and equalValue should always return EqualCheap
for that sentinel.". I know the value is stored externally, but its
stored externally with perfect tracking via the SHA dependency, which
means it might as well be not stored.

While I think that probably solves your problem (please let me know),
I'm not happy with it - the storedValue and defaultRule seem to be
fighting each other. Maybe the concept of storedValue just isn't quite
working here and should be rethought somewhat.

Thanks, Neil

Joachim Breitner

unread,
May 6, 2015, 3:32:53 PM5/6/15
to shake-bui...@googlegroups.com
Hi,

Am Mittwoch, den 06.05.2015, 17:48 +0100 schrieb Neil Mitchell:
> > I tried to use the HEAD-resolving query in the rule for the
> > file-lookup-query:
> >
> > defaultRuleGitLib :: Rules ()
> > defaultRuleGitLib = do
> > rule $ \(GetGitReferenceQ (repoPath, refName)) -> Just $ liftIO $
> > GitSHA <$> getGitReference' repoPath refName
> > rule $ \(GetGitFileRefQ (repoPath, refName, fn)) -> Just $ do
> > GitSHA ref' <- apply1 $ GetGitReferenceQ (repoPath, "HEAD")
> > liftIO $ getGitFileRef' repoPath ref' fn
> >
> > But it would still run "getGitFileRef'" even with an unchanged
> > repository.
>
> So I think that is what you should be doing, and it should (to a first
> approximation) work the way you were hoping for. You roughly have the
> dependency chain:
>
> GetGitReferenceQ <- GetGitFileRefQ <- readGitFile
>
> And since Shake should break the dependency chain whenever a value is
> equal, if HEAD doesn't change, GetGitFileRefQ shouldn't be
> recalculated.
>
> The problem (I suspect) is that the storedValue instance for
> GetGitFileRefQ is being invoked. Can you trace inside the storedValue
> to confirm that's where the reference is being checked?

Yes, indeed. Here a run with no change:

% Compression complete
storedValue GetGitFileRefQ
HEAD lookup!
Tree lookup!
% valid True for (GetGitFileRefQ ("repo","HEAD","a.txt")) (Just "83e1501243b46efbe5407e191840da4ee76b76d4")
storedValue GetGitReferenceQ
HEAD lookup!
% valid True for (GetGitReferenceQ ("repo","HEAD")) (GitSHA "05ddb4a38826098d571f142f11fdd4c23f59ceca")
% Loaded -> Ready, GetGitReferenceQ ("repo","HEAD")
% Loaded -> Ready, GetGitFileRefQ ("repo","HEAD","a.txt")
% valid True for a.out (File {mod=0xA899856F,size=0x2,digest=VAL})
% Loaded -> Ready, a.out
storedValue GetGitFileRefQ
HEAD lookup!
Tree lookup!
% valid True for (GetGitFileRefQ ("repo","HEAD","b.txt")) Nothing
% Loaded -> Ready, GetGitFileRefQ ("repo","HEAD","b.txt")
Build completed in 0:01m

Here a run with a change to HEAD, but not to the actual a.txt:

% Found 6 real entries
storedValue GetGitFileRefQ
HEAD lookup!
Tree lookup!
% valid True for (GetGitFileRefQ ("repo","HEAD","a.txt")) (Just "83e1501243b46efbe5407e191840da4ee76b76d4")
storedValue GetGitReferenceQ
HEAD lookup!
% valid False for (GetGitReferenceQ ("repo","HEAD")) (GitSHA "05ddb4a38826098d571f142f11fdd4c23f59ceca")
% Loaded -> Waiting, GetGitReferenceQ ("repo","HEAD")
% Loaded -> Waiting, GetGitFileRefQ ("repo","HEAD","a.txt")
# GetGitReferenceQ ("repo","HEAD")
HEAD lookup!
% Waiting -> Ready, GetGitReferenceQ ("repo","HEAD")
% Waiting -> Waiting, GetGitFileRefQ ("repo","HEAD","a.txt")
% result (GetGitReferenceQ ("repo","HEAD")) = (GitSHA "d04351227b669d6efde1003cf07951a452e09fd7")
# GetGitFileRefQ ("repo","HEAD","a.txt")
Tree lookup!
% Waiting -> Ready, GetGitFileRefQ ("repo","HEAD","a.txt")
% result (GetGitFileRefQ ("repo","HEAD","a.txt")) = (Just "83e1501243b46efbe5407e191840da4ee76b76d4")
% valid True for a.out (File {mod=0xA899856F,size=0x2,digest=VAL})
% Loaded -> Ready, a.out
storedValue GetGitFileRefQ
HEAD lookup!
Tree lookup!
% valid True for (GetGitFileRefQ ("repo","HEAD","b.txt")) Nothing
% Loaded -> Waiting, GetGitFileRefQ ("repo","HEAD","b.txt")
# GetGitFileRefQ ("repo","HEAD","b.txt")
Tree lookup!
% Waiting -> Ready, GetGitFileRefQ ("repo","HEAD","b.txt")
% result (GetGitFileRefQ ("repo","HEAD","b.txt")) = Nothing
Build completed in 0:01m

and finally a run with a new HEAD with a changed a.txt:

% Found 6 real entries
storedValue GetGitFileRefQ
HEAD lookup!
Tree lookup!
% valid False for (GetGitFileRefQ ("repo","HEAD","a.txt")) (Just "83e1501243b46efbe5407e191840da4ee76b76d4")
% Loaded -> Waiting, GetGitFileRefQ ("repo","HEAD","a.txt")
# GetGitFileRefQ ("repo","HEAD","a.txt")
storedValue GetGitReferenceQ
HEAD lookup!
% valid False for (GetGitReferenceQ ("repo","HEAD")) (GitSHA "d04351227b669d6efde1003cf07951a452e09fd7")
% Loaded -> Waiting, GetGitReferenceQ ("repo","HEAD")
# GetGitReferenceQ ("repo","HEAD")
HEAD lookup!
% Waiting -> Ready, GetGitReferenceQ ("repo","HEAD")
% result (GetGitReferenceQ ("repo","HEAD")) = (GitSHA "60c235b92e3451d9c6e30f0e7805463a71977e56")
Tree lookup!
% Waiting -> Ready, GetGitFileRefQ ("repo","HEAD","a.txt")
% result (GetGitFileRefQ ("repo","HEAD","a.txt")) = (Just "35bfc60cf7524abf9e6a04b6e27098c32407a826")
% valid True for a.out (File {mod=0xA899856F,size=0x2,digest=VAL})
% Loaded -> Waiting, a.out
# a.out
Writing a
% Waiting -> Ready, a.out
% result a.out = (File {mod=0xAAF108F7,size=0x2,digest=VAL})
storedValue GetGitFileRefQ
HEAD lookup!
Tree lookup!
% valid True for (GetGitFileRefQ ("repo","HEAD","b.txt")) Nothing
% Loaded -> Waiting, GetGitFileRefQ ("repo","HEAD","b.txt")
# GetGitFileRefQ ("repo","HEAD","b.txt")
Tree lookup!
% Waiting -> Ready, GetGitFileRefQ ("repo","HEAD","b.txt")
% result (GetGitFileRefQ ("repo","HEAD","b.txt")) = Nothing
Build completed in 0:01m

Not sure what the conclusion is here.

> I think the solution to that is to follow the Rule advice "For rules
> whose values are not stored externally, storedValue should return Just
> with a sentinel value and equalValue should always return EqualCheap
> for that sentinel.". I know the value is stored externally, but its
> stored externally with perfect tracking via the SHA dependency, which
> means it might as well be not stored.

Ah, I might be on an older version (0.13.2, which is in Debian), where
it says
“For rules whose values are not stored externally, storedValue
should return Nothing.”
– hence my earlier question about when to exactly return Nothing.

Indeed, 0.15.1’s docs are more elaborate; sorry for not properly
RTFM’ing here.


What I find confusing: If use use a sentinel value (e.g. "data Answer =
Answer", right?) for when the value is not stored externally, then were
do I store the value at all? Or maybe I’m misunderstanding „external“
here.


I’m slightly confused by the example, which says
“below is a simplified rule for building files, where files are
identified by a FilePath and their state is identified by a hash
of their contents”
but then has ModTime instead of hashes. Intentional or a documentation
glitch?


I’ll experiment a bit more, and push for getting 0.15.1 into Debian.
It’s pending packaging of the js-* libraries, which someone else wanted
to look into this weekend.

Thanks!
Joachim

--
signature.asc

Neil Mitchell

unread,
May 6, 2015, 3:45:26 PM5/6/15
to Joachim Breitner, shake-bui...@googlegroups.com
> What I find confusing: If use use a sentinel value (e.g. "data Answer =
> Answer", right?) for when the value is not stored externally, then were
> do I store the value at all? Or maybe I’m misunderstanding „external“
> here.

By sentinel, it means some construction like:

newtype GetGitFileRefQ = GetGitFileRefQ (Maybe (RepoPath, T.Text, FilePath))

instance ...
storedValue _ _ = return $ Just $ GetGitFileRefQ Nothing
equalValue _ _ (GetGitFileRefQ a) (GetGitFileRefQ b)
| Just a <- a, Just b <- b = if a == b then EqualCheap else NotEqual
| otherwise = EqualCheap

And then everywhere you currently construct a GetGitFileRefQ, add in a Just.

So the documentation is written for values "not stored externally",
i.e. those that Shake computes on its own and that aren't also on
disk. Your file _is_ on disk, so the documentation says you should
return Just, and that is the general advice I would give, but in your
case its also significantly slower, and returning some sentinel (e.g.
GetGitFileRefQ Nothing) works just as well since there is no way to
change the on-disk value without also changing one of the
dependencies.

The above is deeply ugly, but there's currently no better way - I'll
be thinking what the better way looks like over the next few days.

> I’m slightly confused by the example, which says
> “below is a simplified rule for building files, where files are
> identified by a FilePath and their state is identified by a hash
> of their contents”
> but then has ModTime instead of hashes. Intentional or a documentation
> glitch?

That looks like a glitch.

> I’ll experiment a bit more, and push for getting 0.15.1 into Debian.
> It’s pending packaging of the js-* libraries, which someone else wanted
> to look into this weekend.

Great.

Thanks, Neil
Reply all
Reply to author
Forward
0 new messages