retry based on md5 checksum instead of time-stamp

Emanuel Schmid

unread,

Dec 7, 2017, 6:13:18 AM12/7/17

to bpipe-discuss

Hello
I like at lot bpipe and am shifting multiple of my pipelines now to it gradually.
One thing which I encounter and which bugs me a bit though is, that the "retry" option is based on time-stamps.
It obviously makes completely sense, but is tricky in our situation.
The problem is that we have an automatic scrubbing systems which remove obsolete data - which can be only circumvented by "touching" the files.
This obviously throws of bpipe completely and all analysis are being run again as the input data have now a new time-stamp.

Can you please elaborate a bit how this is implemented and do you think it would be feasible somehow to replace/add a md5-based system ?

Cheers

Simon

unread,

Dec 7, 2017, 11:44:55 PM12/7/17

to bpipe-discuss

Hi Emanuel,

I don't think it would be too difficult to implement a checksum based system.

Bpipe stores a lot of properties about the files and I have often thought that a checksum would be a good thing to have as an option, not so much for dependency tracking but just for ensuring integrity of files. You can see where the actual checking about if something is up to date here:

https://github.com/ssadedin/bpipe/blob/master/src/main/groovy/bpipe/Dependencies.groovy#L363

You'll notice that after it checks timestamps it eventually calls graph.propertiesFor(<output>). There it is retrieving an arbitrary properties object that is kept for every output of the pipeline. Via the same mechanism, checksums of the inputs used to create an output could be recorded, and those could be retrieved compared to checksums of the files to determine if the output is up to date.

The main problem with all this is that for large files, computing checksums would take some time, so you could end up finding that Bpipe takes a long time just to work out if an output needs to be created. Would other properties like the file size serve your purpose just as well?

Cheers,

Simon

Emanuel Schmid

unread,

Dec 8, 2017, 3:26:18 AM12/8/17

to bpipe-discuss

Thank you very much for this swift answer Simon
I will look into it a bit today, but I agree with you that checksums might slow down bpipe a lot if we have large files.
Which I unfortunately have. But I think indeed the file size would be a good measure in my case which should fix my current issue.

Cheers

Emanuel Schmid

unread,

Dec 8, 2017, 3:55:20 AM12/8/17

to bpipe-discuss

Hi
I looked a bit further in your code and was wondering.
Would it work if I introduce in your Uitls.groovy a new function similar to the "findOlder"-function or even replace findOlder simply by a size-comparison check or would that break other points of the pipeline ?

Reply all

Reply to author

Forward