Improved cache support. TL;DR

Paolo Di Tommaso

unread,

Aug 1, 2016, 9:29:35 AM8/1/16

to nextflow

Hello,

The upcoming release 0.22.0 will bring some long-awaited improvements in the nextflow caching mechanism.

It will include two new commands: `log` and `clean`.

The `log` command replaces the `history` command that will be deprecated.

In the simplest form the `log` command just prints the list of executed pipelines in the current folder. For example:

$ nextflow log

TIMESTAMP RUN NAME SESSION ID COMMAND

2016-08-01 11:44:51 grave_poincare 18cbe2d3-d1b7-4030-8df4-ae6c42abaa9c nextflow run hello

2016-08-01 11:44:55 small_goldstine 18cbe2d3-d1b7-4030-8df4-ae6c42abaa9c nextflow run hello -resume

2016-08-01 11:45:09 goofy_kilby 0a1f1589-bd0e-4cfc-b688-34a03810735e nextflow run rnatoy -with-docker

Since this version to each pipeline execution is assigned a unique `name` that will help to identify multiple runs of a pipeline. This name is automatically generated by nextflow or provide by the user on the run command line.

By using the run name (or even the session id) it is possible to inspect the tasks that have been executed by that pipeline run. For example:

$ nextflow log goofy_kilby
/Users/../work/0b/be0d1c4b6fd6c778d509caa3565b64
/Users/../work/ec/3100e79e21c28a12ec2204304c1081
/Users/../work/7d/eb4d4471d04cec3c69523aab599fd4
/Users/../work/8f/d5a26b17b40374d37338ccfe967a30
/Users/../work/94/dfdfb63d5816c9c65889ae34511b32

By default the `log` command prints the task execution paths. However by using the `-f` command line option it is possible to provide a custom list of fields to be printed. For example:

$ nextflow log goofy_kilby -f hash,name,exit,status

0b/be0d1c buildIndex (ggal_1_48850000_49020000.Ggal71.500bpflank) 0 COMPLETED

ec/3100e7 mapping (ggal_gut) 0 COMPLETED

7d/eb4d44 mapping (ggal_liver) 0 COMPLETED

8f/d5a26b makeTranscript (ggal_liver) 0 COMPLETED

94/dfdfb6 makeTranscript (ggal_gut) 0 COMPLETED

The fields accepted by the `-f` options are the ones included in the trace report, plus the following: script, stdout, stderr, env.

A user can further customise the printed log by using the `-t` option which allows a template (string or file) to be specified. This makes it possible to create complex custom report in any text based format. For example you could use the following markdown snippet saving it to a file:

## $name

script:

$script

exist status: $exit
task status: $status
task folder: $folder

then, the following command will output a markdown file containing the script, exit status and folder of all executed tasks:

nextflow log goofy_kilby -t my-template.md > execution-report.md

The `filter` option makes it possible to select which entries to be include in the log report. Any valid groovy boolean expression on the log fields can be used to define the filter condition. For example:

nextflow log goofy_kilby -filter 'name =~ /foo.*/ && status == "FAILED"'

The `clean` command allows you to delete cached work directories by specifying the run name or a session id. For example:

$ nextflow clean goofy_kilby

The above command will delete task directories for the run with name `goofy_kilby`. The special name `last` can be used to delete the tasks or the last pipeline run.

$ nextflow clean last

The options -before, -after, -cut can be used to specify a set of runs to delete. For example:

$ nextflow clean -before last

Finally, the trace report will include also an entry for each cached task that was included in a pipeline execution. For example:

task_id hash tag name attempt status exit

1 70/84f82a - sayHello (1) 1 CACHED 0

4 21/3b7aca - sayHello (4) 1 CACHED 0

3 b3/b67279 - sayHello (3) 1 CACHED 0

2 5e/4479aa - sayHello (2) 1 COMPLETED 0

Thought I'm not sure that in this report the status field should be reported as 'CACHED' as in this example or 'COMPLETED', adding a new column `cached` reporting true|false.

You can try these new features by defining in the following environment variable:

export NXF_VER=0.22.0-SNAPSHOT

Comments are welcome.

Cheers,
Paolo

Mike Smoot

unread,

Aug 2, 2016, 2:55:21 PM8/2/16

to Nextflow

Hi Paolo,

Very excited for these changes - both features look excellent. Having named runs in particular seem like a potentially very useful feature. One question I have is what is the scope of a named run? To explain what I'm asking, assume that you have run "grave_poincare" that gets halfway done with the pipeline and then "small_goldstine" that gets further. If I clean small_goldstine will it remove the work done in grave_poincare as well? Or will it only delete the tasks run in small_goldstine? I guess the inverse question is whether cleaning grave_poincare would remove anything that small_goldstine builds off of?

As for the log feature, I'm wondering if it would be possible to enable a verbose mode that logs the caching logic and hashes so that we can more easily debug what's triggering a particular process to re-run? I know this is possible now using "-trace nextflow.processor.TaskProcessor", but it takes a fair amount of digging through logs to figure out what's different. Even just pretty printing things so that we could diff the output between runs would be a big help.

For the trace report I think I'd prefer printing "CACHED" for any cached process rather than "COMPLETED". Another option might be "CACHED (goofy_kilby)" so that you know which run the task was cached in. Or maybe add another column for that? Not a big deal in any case.

thanks,

Mike

Paolo Di Tommaso

unread,

Aug 3, 2016, 5:55:55 AM8/3/16

to nextflow

Hi Mike,

Thanks for your reply. You will find my reply below.

On Tue, Aug 2, 2016 at 8:55 PM, Mike Smoot <mike....@gmail.com> wrote:

Hi Paolo,

Very excited for these changes - both features look excellent. Having named runs in particular seem like a potentially very useful feature. One question I have is what is the scope of a named run? To explain what I'm asking, assume that you have run "grave_poincare" that gets halfway done with the pipeline and then "small_goldstine" that gets further. If I clean small_goldstine will it remove the work done in grave_poincare as well? Or will it only delete the tasks run in small_goldstine? I guess the inverse question is whether cleaning grave_poincare would remove anything that small_goldstine builds off of?

No. Overlapping executions will not be deleted by the clean command. Thus referring your example if you clean `small_goldstine` it wont remove the tasks produced by `grave_poincare` and used also by the `small_goldstine` run.

Behind there's a very simple referencing count mechanism, a task is deleted only when it's referenced by a single run. This makes me think that may be useful a `force` flag to delete tasks including the overlapping one.

However using `clean <session-id>` will delete all of them in any case.

As for the log feature, I'm wondering if it would be possible to enable a verbose mode that logs the caching logic and hashes so that we can more easily debug what's triggering a particular process to re-run? I know this is possible now using "-trace nextflow.processor.TaskProcessor", but it takes a fair amount of digging through logs to figure out what's different. Even just pretty printing things so that we could diff the output between runs would be a big help.

This makes sense. You may want to open a feature requested for that.

For the trace report I think I'd prefer printing "CACHED" for any cached process rather than "COMPLETED". Another option might be "CACHED (goofy_kilby)" so that you know which run the task was cached in. Or maybe add another column for that? Not a big deal in any case.

I tend to agree with you. An alternative to status CACHED, would be to add a column `cached = true|false` having the status set to `COMPLETED`.

A little benefits of the latter approach is that it would allow (maybe) a simpler filtering query on tasks status, because I guess that usually one wants to know which tasks have been executed successfully of failed. That's they were cached or not it's less relevant.

Cheers,

Paolo

Pau Carrio

unread,

Sep 6, 2016, 10:46:33 AM9/6/16

to Nextflow

Hi
Just checking what happened on August... and both features are GREAT!
Thanks Nextflow-developers!
Regards
Pau

Robert Syme

unread,

Sep 12, 2016, 9:12:06 PM9/12/16

to Nextflow

Great work Paolo. The log/clean features make testing out new ideas and workflow options even easier. Thanks!

-r

Reply all

Reply to author

Forward