cleanup of work dir?

Tobias Sargeant

unread,

Nov 10, 2014, 1:45:58 AM11/10/14

to next...@googlegroups.com

Hi,

Is there any way to ask nextflow to clean up its work directory? For example, on one pipeline I'm working on, A successful run gives:

nextflow run -resume pipeline.nf

N E X T F L O W ~ version 0.11.0

[warm up] executor > local

[64/e21211] Cached process > generateBarcodeFasta (1)

[a7/717250] Cached process > stripPhiX (1)

[6b/4aacdd] Cached process > generateBarcodeBT2 (1)

[91/6c128c] Cached process > countLines (1)

[c6/e8cd9e] Cached process > alignForCounts (1)

[69/cbdc4b] Cached process > samToCounts (1)

However, as I've been developing on it for a bit, the following directories exist:

work/0e/a5da60d82211a684a84f8bb945a491

[39 other directories...]

work/fb/eac7cfff32cccaa2e96c8a4088319a

It would be nice to be able to clean up anything that isn't currently needed in the cache.

Thanks,

Toby.

Paolo Di Tommaso

unread,

Nov 10, 2014, 5:48:30 AM11/10/14

to nextflow

Hi Toby,

Nextflow is not doing that mainly because in the most common usage scenario, i.e. a cluster with a shared NFS, deleting hundred or thousands of files could be a very time consuming task, that would add a significative time overhead to the pipeline execution.

Usually I manage this by using the -w command line option that allows you to specify the pipeline working directory i.e. where the cached files are stored, then deleting the ones produced in the former directory.

In production it may have sense to use a shared scratch folder and having a cron job that deletes the produced files older than, let's say, one week.

Could that work for you ?

Cheers,

Paolo

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.
Visit this group at http://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.

Tobias Sargeant

unread,

Nov 10, 2014, 6:53:50 PM11/10/14

to next...@googlegroups.com

HI Paolo,

I may be Doing It Wrong, but this doesn't really cover what I was thinking.

It's less of a production thing, and more of a pipeline development thing. Often I will want to do some alignment steps early on, and then iteratively experiment with later steps. Keeping early steps cached for subsequent development is therefore important, but I end up with a lot of unnecessary result directories for later stages that I'd like to clean up periodically.

What I'd ideally like is for nextflow to be able to do (given a workdir, and assuming -resume, and set parameters) is say which directories are currently 'active' (in the sense that they would be used in cached steps of the pipeline execution) and which are no longer necessary. In this way I could quickly clean up those partial results without getting rid of cached steps, which potentially have the results of long running computations.

Paolo Di Tommaso

unread,

Nov 11, 2014, 12:32:53 PM11/11/14

to nextflow

Hi Toby,

I see your point and I think it makes sense.

I opened a feature request on Gihub proposing a possible implementation. Let's continue the discussion there:

https://github.com/nextflow-io/nextflow/issues/19

Who is interested is invited to join.

Cheers,

Paolo

Tobias Sargeant

unread,

Nov 11, 2014, 6:36:58 PM11/11/14

to next...@googlegroups.com

Hi,

This looks like exactly what I would like to be able to do. I haven't done any groovy programming before, but I will take a look and see if I can implement it.

I didn't know about the nextflow history command, and that raises some questions in my mind.

Is the pipeline script associated with the run UUID? My guess is no, given that --resume uses the same UUID, but the current version of the script.

Maybe it would be good to store the current version of the script associated with the UUID, and allow -resume UUID to restart using that version?

I wonder though: if the full execution context is stored at each invocation, is there really a difference between resuming and not resuming? Assuming none of the input state changes for a task, then a cached result should surely be able to be used in all circumstances? It's kind of an old fashioned benchmark to compare against, but make(1) doesn't need to make a distinction. In fact nextflow would have an advantage over make(1) here because it can use input parameters as well as the set of commands to be executed to determine whether to use a cached output.

Toby.

> You received this message because you are subscribed to a topic in the Google Groups "Nextflow" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/nextflow/10VtUFy_00k/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to nextflow+u...@googlegroups.com.

Paolo Di Tommaso

unread,

Nov 12, 2014, 5:17:04 AM11/12/14

to nextflow

Hi,

On Wed, Nov 12, 2014 at 12:36 AM, Tobias Sargeant <tobias....@gmail.com> wrote:

This looks like exactly what I would like to be able to do. I haven't done any groovy programming before, but I will take a look and see if I can implement it.

That's very welcome. If you have some experience with Java, you should find extremely easy to learn groovy (personally I've started with this, 6 pages, cheat sheet https://db.tt/aV7PoKNw)

I didn't know about the nextflow history command, and that raises some questions in my mind.

Is the pipeline script associated with the run UUID? My guess is no, given that --resume uses the same UUID, but the current version of the script.

Maybe it would be good to store the current version of the script associated with the UUID, and allow -resume UUID to restart using that version?

Whenever you launch a pipeline execution a new random UUID is associated to it. This number is used a seed for all the following keys generation in the caching mechanism.

Basically what happens when you specify the -resume option is simply that the UUID of the last execution is re-used. Actually you can also specify the UUID of a previously execution pipeline as an optional argument of the -resume command line option. For example:

nextflow run <pipeline name> -resume <UUID>

This is very handy because allows you to modify your script(s) and continue from the last successful executed/cached step. This works also between different script files, because the caching is guaranteed the level of a single process not of the overall script.

This why I tend to disagree with idea of storing the current version of the script associated with the UUID. The beauty of the resume is that you can modify the script and continue the execution.

In order to keep track of script changes/revisions Nextflow provide a smooth integration with Git. You can run a specific branch or tagged version by using the -r command line option. Have a look to "Run a specific revision" at this link

http://www.nextflow.io/blog/2014/share-nextflow-pipelines-with-github.html

I wonder though: if the full execution context is stored at each invocation, is there really a difference between resuming and not resuming? Assuming none of the input state changes for a task, then a cached result should surely be able to be used in all circumstances? It's kind of an old fashioned benchmark to compare against, but make(1) doesn't need to make a distinction. In fact nextflow would have an advantage over make(1) here because it can use input parameters as well as the set of commands to be executed to determine whether to use a cached output.

I'm not sure to fully get your point here, anyway although having the resume always "active" makes more sense during the pipeline development, I think that when you run your pipelines in production you need to keep these instances separated. For this reason I choose to have the resume option to be provided explicitly.

Cheers,

Paolo

Maria Chatzou

unread,

Nov 12, 2014, 5:22:15 AM11/12/14

to next...@googlegroups.com

Hi Tobias,

About the resuming option, you are right the framework can handle the resuming fairly easy and internally, so yes we could not have the -resume.
The reason, though, we have it is because we want to keep separate two conceptual things: re-running and resuming, and we want to give full control over it to the user, so they can decide every time if they want to do a clean run or continue from the last step the pipeline stopped.

It might sound silly why anyone would want to re-run a pipeline if no changes have been made, but we have been having cases of people that wanted to re-run a pipeline all over again just to make sure that results are re-producible and in this case if there was no clear separation between re-running and resuming the users wouldn't be able to do it.