How to wait for file to be created before moving on to next process?

1,906 views
Skip to first unread message

Dave Deandre

unread,
Jun 7, 2018, 5:21:02 PM6/7/18
to Nextflow
Hi,

I am trying to take in a file using Channel.fromPath(<pathFile here>) that is produced by the first process, which later will be used by the downstream process. However, when running it, the file does not exist, since the process is asynchronously ran with assigning variable. In the code below, it is just a replication of the actual process, where the shell process output cannot be captured by nextflow. I tried using Channel.watchPath(<path File here>) instead of fromPath, and it worked, but the pipeline does not end (as mentioned in the doc).

So my questions are as follows:
1. is there any way to stop watchPath after desired processes have been finished?
2. is there any other solutions in emitting the file from a specific directory after a process is finished?
3. is there any other alternatives?

NF script:


#!/usr/
bin/env  nextflow



/**

* step 1: Make a file called var1.txt

*/


process makeFiles {

        publishDir "<filePath where var1.txt will be published at the end of makeFiles process>"



        output:


        file "var1.txt"

        shell:

        """

        touch var1.txt

        echo 100 > var1.txt

        """

}


/**

* imports the var1.txt to a channel

*/


varChannel = Channel.fromPath("<filePath where var1.txt will be published at the end of makeFiles process>")


varChannel.subscribe { println it.text }


I understand that in this example, a channel can be as such in the output: file "var1.txt" into varChannel.

However, in the actual issue I am currently working on, it is not the case, and it is not possible to capture the file directly from nextflow, and I have to wait for the files to be created in the filepath.


Thank you very much!

Paolo Di Tommaso

unread,
Jun 8, 2018, 5:00:14 AM6/8/18
to nextflow
Process communication is made via (file) channels. You don't need to use intermediated folders. PublishDir is meant to save the final output of a workflow task, not for internal process communication.

I would suggest to have a look at the Basic concepts section in the documentation. 

  
Hope it helps. 

p

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.

Felix Kokocinski

unread,
Jun 8, 2018, 8:36:49 AM6/8/18
to Nextflow
Hi Paolo,

Related to this, I am wondering what your suggestion might be if the processes are driven by their own config files and take from there where to find which files and where to store them.
If I don't know before the process starts which file(names) will be written by the first process, it would be easier to rely on them being transferred to the outdir so that the next process can pick them up from there. In my case they are actually stored in sub-directories making the piping from output to input more difficult.

Would you add a process or processing step that stages the files into the input (sub-)directories expected by downstream processes?

 Many thanks, Felix





Paolo Di Tommaso

unread,
Jun 8, 2018, 9:30:13 AM6/8/18
to nextflow
what your suggestion might be if the processes are driven by their own config files

Not sure to understand this. The main difference of NF compared to many other tools is that the tasks are not aware *by design* of data storage input/output paths. 

This does not limit task composition and brings many benefits such as task isolation, avoid file name collision when running parallel tasks, simplify debugging and make the overall execution resumable and more resilient. 

A task to depend on another does not need to know the file names produced by the upstream one. 


p

Felix Kokocinski

unread,
Jun 8, 2018, 9:46:54 AM6/8/18
to Nextflow
Sorry if this wasn't clear. Let me try again. 
I'm executing two programs (or steps of a program) in two Nextflow processes. Both programs require a config file and no other input as the input data locations are defined within the config. Program1 writes some files into $outdir/$sample/processed" that program2 expects to find there. A quick NF example code below.
So how can I make sure that the data written by program1 is there for program2?

Thanks, Felix


process part-1{
  publishDir: outdir

  input:
  file config from config-channel

  """
  bash program1 $config 
  """
}

process part-2{
  publishDir: outdir

  input:
  file config from config-channel

  """
  bash program2 $config
  """
}


Paolo Di Tommaso

unread,
Jun 8, 2018, 9:50:26 AM6/8/18
to nextflow
> Program1 writes some files into $outdir/$sample/processed" that program2 expects to find there

This is an anti-pattern, the program *must* produce the output in a task work dir or any sub-directory of this. 


p

Felix Kokocinski

unread,
Jun 8, 2018, 9:58:48 AM6/8/18
to Nextflow
OK, but is there an easy way to provide all files and their current (sub)directory structure from the output of the task-1 work dir to the input task area of task-2, please?
Would that be something like
  output:
  file(*)

Thanks, Felix



Paolo Di Tommaso

unread,
Jun 8, 2018, 10:03:05 AM6/8/18
to nextflow
Of course.


process foo {
  output: file('*') into all_files_ch

  """
  your_command
  """
}


process bar {
  input: file('*') from all_files_ch

  """
  another_command
  """
}


Dave Deandre

unread,
Jun 8, 2018, 10:11:43 AM6/8/18
to Nextflow
Thank you very much,

Much appreciated.

Dave

On Friday, June 8, 2018 at 9:03:05 AM UTC-5, Paolo Di Tommaso wrote:
Of course.


process foo {
  output: file('*') into all_files_ch

  """
  your_command
  """
}


process bar {
  input: file('*') from all_files_ch

  """
  another_command
  """
}

On Fri, Jun 8, 2018 at 3:58 PM, Felix Kokocinski <meiste...@gmail.com> wrote:
OK, but is there an easy way to provide all files and their current (sub)directory structure from the output of the task-1 work dir to the input task area of task-2, please?
Would that be something like
  output:
  file(*)

Thanks, Felix



--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.

Paolo Di Tommaso

unread,
Jun 8, 2018, 10:13:34 AM6/8/18
to nextflow
You are welcome !


p

Felix Kokocinski

unread,
Jun 8, 2018, 10:35:31 AM6/8/18
to Nextflow
Hehe, that is too easy! ;-)

I'm currently getting an error for this however:

  Missing output file(s) `*` expected by process `analysis_1`


Thanks, Felix

Paolo Di Tommaso

unread,
Jun 8, 2018, 10:49:08 AM6/8/18
to nextflow
This sounds like the task isn't producing any new file/dir. 

Check the content of the task work dir.


p

--

Felix Kokocinski

unread,
Jun 8, 2018, 11:44:31 AM6/8/18
to next...@googlegroups.com
A file explicitly created with
touch test.file
is visible in the work folders, but the files produced by the program called, are only written to the outDir as the output path is also specified in the config file that is used. I suppose the output files are written there directly...
Have you come across this scenario before?
Would I have to dynamically re-write the config file for every step pointing to a tmp output area??

Many thanks, Felix



Paolo Di Tommaso

unread,
Jun 8, 2018, 11:50:47 AM6/8/18
to nextflow
Would I have to dynamically re-write the config file for every step pointing to a tmp output area

If this `outDir` is outside the task work dir, yes. 


p

Reply all
Reply to author
Forward
0 new messages