Improved cache support. TL;DR

518 views
Skip to first unread message

Paolo Di Tommaso

unread,
Aug 1, 2016, 9:29:35 AM8/1/16
to nextflow
Hello, 

The upcoming release 0.22.0 will bring some long-awaited improvements in the nextflow caching mechanism. 


It will include two new commands: `log` and `clean`. 

The `log` command replaces the `history` command that will be deprecated. 

In the simplest form the `log` command just prints the list of executed pipelines in the current folder. For example: 

$ nextflow log
TIMESTAMP            RUN NAME         SESSION ID                            COMMAND   
2016-08-01 11:44:51  grave_poincare   18cbe2d3-d1b7-4030-8df4-ae6c42abaa9c  nextflow run hello              
2016-08-01 11:44:55  small_goldstine  18cbe2d3-d1b7-4030-8df4-ae6c42abaa9c  nextflow run hello -resume      
2016-08-01 11:45:09  goofy_kilby      0a1f1589-bd0e-4cfc-b688-34a03810735e  nextflow run rnatoy -with-docker


Since this version to each pipeline execution is assigned a unique `name` that will help to identify multiple runs of a pipeline. This name is automatically generated by nextflow or provide by the user on the run command line. 

By using the run name (or even the session id) it is possible to inspect the tasks that have been executed by that pipeline run. For example: 

$ nextflow log goofy_kilby
/Users/../work/0b/be0d1c4b6fd6c778d509caa3565b64
/Users/../work/ec/3100e79e21c28a12ec2204304c1081
/Users/../work/7d/eb4d4471d04cec3c69523aab599fd4
/Users/../work/8f/d5a26b17b40374d37338ccfe967a30
/Users/../work/94/dfdfb63d5816c9c65889ae34511b32

By default the `log` command prints the task execution paths. However by using the `-f` command line option it is possible to provide a custom list of fields to be printed. For example: 

$ nextflow log goofy_kilby -f hash,name,exit,status 
0b/be0d1c  buildIndex (ggal_1_48850000_49020000.Ggal71.500bpflank)  0  COMPLETED
ec/3100e7  mapping (ggal_gut)                                       0  COMPLETED
7d/eb4d44  mapping (ggal_liver)                                     0  COMPLETED
8f/d5a26b  makeTranscript (ggal_liver)                              0  COMPLETED
94/dfdfb6  makeTranscript (ggal_gut)                                0  COMPLETED


The fields accepted by the `-f` options are the ones included in the trace report, plus the following: script, stdout, stderr, env. 

A user can further customise the printed log by using the `-t` option which allows a template (string or file) to be specified. This makes it possible to create complex custom report in any text based format.  For example you could use the following markdown snippet saving it to a file: 
    
## $name

script: 

    $script

exist status: $exit
task status: $status
task folder: $folder

then, the following command will output a markdown file containing the script, exit status and folder of all executed tasks: 

nextflow log goofy_kilby -t my-template.md > execution-report.md 


The `filter` option makes it possible to select which entries to be include in the log report. Any valid groovy boolean expression on the log fields can be used to define the filter condition. For example: 

 nextflow log goofy_kilby -filter 'name =~ /foo.*/ && status == "FAILED"'



The `clean` command allows you to delete cached work directories by specifying the run name or a session id. For example: 

  $ nextflow clean goofy_kilby 

The above command will delete task directories for the run with name `goofy_kilby`. The special name `last` can be used to delete the tasks or the last pipeline run. 

$ nextflow clean last 

The options -before, -after, -cut can be used to specify a set of runs to delete. For example: 

$ nextflow clean -before last 


Finally, the trace report will include also an entry for each cached task that was included in a pipeline execution. For example: 

task_id  hash       tag  name          attempt  status     exit
1        70/84f82a  -    sayHello (1)  1        CACHED     0
4        21/3b7aca  -    sayHello (4)  1        CACHED     0
3        b3/b67279  -    sayHello (3)  1        CACHED     0
2        5e/4479aa  -    sayHello (2)  1        COMPLETED  0


Thought I'm not sure that in this report the status field should be reported as 'CACHED' as in this example or 'COMPLETED', adding a new column `cached` reporting true|false. 


You can try these new features by defining in the following environment variable: 

export NXF_VER=0.22.0-SNAPSHOT



Comments are welcome.  


 Cheers,
Paolo


Mike Smoot

unread,
Aug 2, 2016, 2:55:21 PM8/2/16
to Nextflow
Hi Paolo,

Very excited for these changes - both features look excellent.  Having named runs in particular seem like a potentially very useful feature.  One question I have is what is the scope of a named run?  To explain what I'm asking, assume that you have run "grave_poincare" that gets halfway done with the pipeline and then "small_goldstine" that gets further.  If I clean small_goldstine will it remove the work done in grave_poincare as well?  Or will it only delete the tasks run in small_goldstine?  I guess the inverse question is whether cleaning grave_poincare would remove anything that small_goldstine builds off of?

As for the log feature, I'm wondering if it would be possible to enable a verbose mode that logs the caching logic and hashes so that we can more easily debug what's triggering a particular process to re-run?  I know this is possible now using "-trace nextflow.processor.TaskProcessor", but it takes a fair amount of digging through logs to figure out what's different.  Even just pretty printing things so that we could diff the output between runs would be a big help. 

For the trace report I think I'd prefer printing "CACHED" for any cached process rather than "COMPLETED".  Another option might be "CACHED (goofy_kilby)" so that you know which run the task was cached in.  Or maybe add another column for that?  Not a big deal in any case.


thanks,
Mike

Paolo Di Tommaso

unread,
Aug 3, 2016, 5:55:55 AM8/3/16
to nextflow
Hi Mike, 

Thanks for your reply. You will find my reply below. 


On Tue, Aug 2, 2016 at 8:55 PM, Mike Smoot <mike....@gmail.com> wrote:
Hi Paolo,

Very excited for these changes - both features look excellent.  Having named runs in particular seem like a potentially very useful feature.  One question I have is what is the scope of a named run?  To explain what I'm asking, assume that you have run "grave_poincare" that gets halfway done with the pipeline and then "small_goldstine" that gets further.  If I clean small_goldstine will it remove the work done in grave_poincare as well?  Or will it only delete the tasks run in small_goldstine?  I guess the inverse question is whether cleaning grave_poincare would remove anything that small_goldstine builds off of?

No. Overlapping executions will not be deleted by the clean command. Thus referring your example if you clean `small_goldstine`  it wont remove the tasks produced by `grave_poincare` and used also by the `small_goldstine` run. 

Behind there's a very simple referencing count mechanism, a task is deleted only when it's referenced by a single run. This makes me think that may be useful a `force` flag to delete tasks including the overlapping one. 

However using `clean <session-id>` will delete all of them in any case.  
 

As for the log feature, I'm wondering if it would be possible to enable a verbose mode that logs the caching logic and hashes so that we can more easily debug what's triggering a particular process to re-run?  I know this is possible now using "-trace nextflow.processor.TaskProcessor", but it takes a fair amount of digging through logs to figure out what's different.  Even just pretty printing things so that we could diff the output between runs would be a big help. 


This makes sense. You may want to open a feature requested for that. 


 
For the trace report I think I'd prefer printing "CACHED" for any cached process rather than "COMPLETED".  Another option might be "CACHED (goofy_kilby)" so that you know which run the task was cached in.  Or maybe add another column for that?  Not a big deal in any case.


I tend to agree with you. An alternative to status CACHED, would be to add a column `cached = true|false` having the status set to `COMPLETED`. 

A little benefits of the latter approach is that it would allow (maybe) a simpler filtering query on tasks status, because I guess that usually one wants to know which tasks have been executed successfully of failed. That's they were cached or not it's less relevant. 


Cheers,
Paolo

Pau Carrio

unread,
Sep 6, 2016, 10:46:33 AM9/6/16
to Nextflow
Hi
Just checking what happened on August... and both features are GREAT!
Thanks Nextflow-developers!
Regards
Pau

Robert Syme

unread,
Sep 12, 2016, 9:12:06 PM9/12/16
to Nextflow
Great work Paolo. The log/clean features make testing out new ideas and workflow options even easier. Thanks!

-r
Reply all
Reply to author
Forward
0 new messages