Up-to-dateness of files downloaded from url

16 views
Skip to first unread message

Bruno Hernández

unread,
Jun 27, 2019, 1:49:19 PM6/27/19
to Shake build system
Hi, I am quite new to Shake, and I found it interesting for the flexibility it gives, which I think Make is lacking for more complex things like building containers.
One basic use case I haven't found so far in this group nor in the examples, is a target that is simply achieved by downloading a file from an URL. This is very common in projects with big binaries that are pre-build.
I thought I'd ask here first not to reinvent the wheel. Also I wonder how could I let Shake know that the file in the URL is up-to-date, not to download it twice?

Neil Mitchell

unread,
Jul 1, 2019, 11:35:12 AM7/1/19
to Bruno Hernández, Shake build system
Hi Bruno,

Great question!

Agreed, this is something that Shake should really have a
StackOverflow question for, so I suggest once we've iterated on the
mailing list we move it there so people can search for it.

Shake doesn't offer anything out the box, but I suspect you can build
any behaviour you want. The big question with downloading a file is
under what circumstances do you want to redownload it? For some files,
e.g. http://hackage.haskell.org/package/shake-0.18.2/shake-0.18.2.tar.gz
the contents of the file should be consistent forevermore. If so, you
can do:

"shake-0.18.2.tar.gz" *> \out -> do
cmd "wget" ["-o" ++ out]
"http://hackage.haskell.org/package/shake-0.18.2/shake-0.18.2.tar.gz"

The rule has no dependencies, so the only circumstances it will rerun
are when the output changes, e.g. when you do a clean.

If however the file changes regularly, and you want to download it
afresh every time, just shove an alwaysRerun in there. If you want to
run the rule periodically, e.g. daily, then your best bet is creating
an oracle which returns the current day and depend on the oracle, as
the rule will rerun when the oracle changes.

If you want to get even more fancy, and have an HTTP HEAD check that
returns the last modified date, that's possible too. The simplest
approach there is to define an oracle with the HEAD information, and
then depend on it. It will cause you to do an HTTP HEAD every run, and
there are the issues around checking the HEAD then downloading (it
might change in the meantime) and whether you need to do both HEAD and
GET the first time around. I'm sure they are solvable, but its
probably sufficiently complex that it would want to go into a library.
And after all that, your nothing-to-do build system is still going to
require talking to an external server.

Finally, I should slightly warn that if you are grabbing things off
the internet, usually best practice is to confirm their hash before
running them. That can only work with fixed downloads, and after
downloading you'd assert the hash was a particular value or fail.

Thanks, Neil
> --
> You received this message because you are subscribed to the Google Groups "Shake build system" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to shake-build-sys...@googlegroups.com.
> To view this discussion on the web, visit https://groups.google.com/d/msgid/shake-build-system/5d830a7d-e51e-4587-ab96-13811a2ba274%40googlegroups.com.

Bruno Hernández

unread,
Jul 2, 2019, 2:34:58 PM7/2/19
to Shake build system
Hi,
First of all thanks a lot for your response.
Like I said, I'm new. Step by step. Divide and conquer, like the Romans would say.

- The ISO name: There is normally an URL to download ISOs or binaries that ends in /latest/. I would then output the HTTP reply of such URL to stdout, then match with a regex of the binary (something like perryLinux-[version].iso) so that I can know [version]. With this I can determine the actual name of the binary and pass it to the shake binary:

iso_name=$(wget --quiet --output-document=- ${pointer_to_latest} | grep -o "perryLinux-20[0-9][0-9]\.[0-9][0-9]\.[0-9][0-9].iso" | tail --lines=1)

./Shakefile iso_name

I guess you meant that I can figure out the iso_name with Haskell libraries though? Like dark black Haskell magic?

- The alwaysRerun: I have found that I can use alwaysRerun for the smaller SHA sum file, then I can use it as dependency. I was concerned though, when I saw that, after deleting the SHA sum file, the ISO file was not being re-downloaded.

     shaSums %> \out -> do
       alwaysRerun
       cmd_ "wget" "--quiet" shaUrl

     "//*.iso" %> \out -> do
       need [shaSums]
       cmd_ "wget" "--quiet" isoUrl

- What is an Oracle? Surely more Haskell black magic. Is there an example that I can look at?

Wearing my ignorance with pride, because it is Bliss,
Bruno
> To unsubscribe from this group and stop receiving emails from it, send an email to shake-build-system+unsub...@googlegroups.com.

Neil Mitchell

unread,
Jul 3, 2019, 1:33:53 AM7/3/19
to Bruno Hernández, Shake build system
Hi Bruno,

It sounds like roughly what you want is to always fetch the URL name,
and then download the file when it changes. There are various ways to
do that, ranging from shell commands and files, to using Haskell
libraries and oracles. Focusing more on the first approach, I'd
imagine something like:

"iso.url" %> \out -> do
alwaysRerun
Stdout x <- cmd Shell "wget --quiet --output-document=-
${pointer_to_latest} | grep -o
\"perryLinux-20[0-9][0-9]\.[0-9][0-9]\.[0-9][0-9].iso\" | tail
--lines=1"
writeFileChanged out x

"**/*.iso" %> \out -> do
url <- readFile' (out -<.> "url")
cmd_ "wget --quiet" url "-o" out

The first rule says always run your shell snippet, and put the output
in "iso.url"

The second rule says find the associated .url (and depend on it), then
download with wget.

I think your approach of depending on the SHA file should have worked
fine, and should have rebuilt each time.If you believe the SHA might
change for a given .iso without the URL changing that's a very
sensible way to go. In particular, deleting the sha should cause the
file to rebuild, provided the .iso is given as a target (e.g.
want/action, or something that depends on one of those)

Thanks, Neil
>> > To unsubscribe from this group and stop receiving emails from it, send an email to shake-build-sys...@googlegroups.com.
> --
> You received this message because you are subscribed to the Google Groups "Shake build system" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to shake-build-sys...@googlegroups.com.
> To view this discussion on the web, visit https://groups.google.com/d/msgid/shake-build-system/b36fdccb-0744-4d2c-884b-d612530c89b1%40googlegroups.com.

bruno

unread,
Jul 7, 2019, 4:06:31 PM7/7/19
to Neil Mitchell, Shake build system

Hi,

Building on the examples, I have tried a number of things with little success:

import Development.Shake
import Development.Shake.Classes
import Development.Shake.Command
import Development.Shake.FilePath
import Development.Shake.Util

main :: IO ()

newtype UrlLastModified = UrlLastModified () deriving (Show,Typeable,Eq,Hashable,Binary,NFData)
type instance RuleResult UrlLastModified = String

Will not compile:

Shakefile.hs:11:1: error:

    • Illegal family instance for ‘RuleResult’
        (Use TypeFamilies to allow indexed type families)
    • In the type instance declaration for ‘RuleResult’

I can compile instead the following type:

newtype UrlLastModified = UrlLastModified () deriving (Show,Typeable,Eq)

Then go ahead and define the oracle:

     addOracle $ \(UrlLastModified _) -> do
       fmap fromStdout $ cmd Shell "curl --silent --head http://hackage.haskell.org/package/shake-0.18.2/shake-0.18.2.tar.gz | grep -i last.modified" :: Action String

Which will fail to compile with the following error:

Shakefile.hs:29:6: error:
    • Couldn't match type ‘RuleResult UrlLastModified’ with ‘[Char]’
        arising from a use of ‘addOracle’

Finally, I can define my new type as a String:

newtype UrlLastModified = UrlLastModified String deriving (Show,Typeable,Eq)

Then add an Oracle:

     addOracle $ \(UrlLastModified _) -> do
       Stdout lastModified <- cmd Shell "curl --silent --head http://hackage.haskell.org/package/shake-0.18.2/shake-0.18.2.tar.gz | grep -i last.modified"
       return lastModified

But then I fail when asking:

     "//*.tar.gz" %> \out -> do
       lastModified <- askOracle (UrlLastModified)
       cmd_ "wget" "--quiet" ["http://hackage.haskell.org/package/shake-0.18.2/" ++ out]

With the following error:

Shakefile.hs:38:24: error:
    • No instance for (Show (RuleResult (String -> UrlLastModified)))
        arising from a use of ‘askOracle’

Is it so, that I always need an Action String as a result from my Oracle?

bruno

unread,
Jul 10, 2019, 12:46:36 PM7/10/19
to Neil Mitchell, Shake build system

OK I think I got this. The following works:

Adding the following to the .hs file header:

{-# LANGUAGE TypeFamilies, GeneralizedNewtypeDeriving, DeriveDataTypeable #-}

Then, the normal type declaration as shown in the examples:

newtype UrlLastModified = UrlLastModified () deriving (Show,Typeable,Eq,Hashable,Binary,NFData)
type instance RuleResult UrlLastModified = String

I added the oracle using this one-liner (Shell option is important):

     addOracle $ \(UrlLastModified _) -> fmap fromStdout $ cmd Shell "curl --silent --head http://hackage.haskell.org/package/shake-0.18.2/shake-0.18.2.tar.gz | grep -i last.modified" :: Action String

Here I would like to know how can I format this better into multiple lines, ultimately how can this be parameterized also. Why does this need to return an Action String?
And one day when I grow up, I might even use Haskell libraries for the HTTP request instead of dirty,dirty shell.

     "//*.tar.gz" %> \out -> do

       lastModified <- askOracle (UrlLastModified ())
       cmd_ "wget" "--quiet" [shakeUrl ++ out]

I believe this covers the main issue of up-to-dateness of a file located at an URL.

The specific HTTP request and how to interpret it might change, but I think this is very important as here we are really doing something that cannot be done with Make and others, while keeping it functional *.

- >The SHA SUM calculation would be also good to add. We can have the downloaded shasum file as a dependency, but what is the best way to double-check with the already downloaded file? Because technically, by the time we figure out our file has not the same checksum, the file would be already downloaded.

* (One can also create targets that are always run with Make. I have never liked that those targets exist, because from an automation point of view the files depending on them are technically never up-to-date).

Reply all
Reply to author
Forward
0 new messages