Iterate over nth pair of files them within a process

534 views
Skip to first unread message

rpl...@genoscope.cns.fr

unread,
Feb 3, 2017, 6:59:40 AM2/3/17
to Nextflow
Hello,

I would like to create a kind of job array because each job is very short. I've got a list of pairs of files and I would like to process them as array of pairs.
Let say I have a Channel like :
 
pairOfFiles = Channel.from([file('./A-1.txt'),file('./B-1.txt')], [file('./A-2.txt'),file('./B-2.txt')],
                 
[file('./A-3.txt'),file('./B-3.txt')], [file('./A-4.txt'),file('./B-4.txt')],
                 
[file('./A-5.txt'),file('./B-5.txt')])

Then I want to process them as array of size 2.

pairOfFiles
.buffer(size: 2, remainder: true)
.set { ArrayFiles }


The channel will look like : 

[[A-1.txt, B-1.txt], [A-2.txt, B-2.txt]]
[[A-3.txt, B-3.txt], [A-4.txt, B-4.txt]]
[[A-5.txt, B-5.txt]]



I've got a program that takes input like : 

command file_type_A file_type_B


So I would like to create a process that will take as input a channel like "ArrayFiles" and process an array of pair of files. Like in https://www.nextflow.io/docs/latest/faq.html#how-do-i-iterate-over-nth-files-them-within-a-process but instead of a list of files, it is a list of pair of files.

I could create a file that like :

A-1.txt,B-1.txt
A
-2.txt,B-2.txt

and launch the command for each line but is there a better way to implement it ? Using the channel operators ?

Thank you

Paolo Di Tommaso

unread,
Feb 5, 2017, 2:02:20 PM2/5/17
to nextflow
Hi, 

If I've understood well you problem, you don't need to structure the pairs into arrays. You can provide the result of `buffer` as a process input. For example: 


Channel.fromPath('file*')
    .buffer(size: 2, remainder: true)
    .set { pairOfFiles }
    
    
process foo {
   input: file x from pairOfFiles
   
   """
   echo $x
   """
 }    


Hope it helps 

Cheers,
Paolo


--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.

rpl...@genoscope.cns.fr

unread,
Feb 6, 2017, 4:42:12 AM2/6/17
to Nextflow
Hello,

Yes, I don't have to but since the jobs are very short, if I submit one job for one pair of files, it spends more time waiting in the queue that running. 
To avoid that, I would like to submit one job for n pairs of files and, if it is possible, without writing this list (of pairs of files) in a file.

Cheers,

Rémi 
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.

Paolo Di Tommaso

unread,
Feb 6, 2017, 5:15:39 AM2/6/17
to nextflow
Is it possible to determine these pairs by using a pattern on the file name? otherwise how are you planning to define them without having to write a list ?


Cheers, Paolo


To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.

rpl...@genoscope.cns.fr

unread,
Feb 6, 2017, 5:44:47 AM2/6/17
to Nextflow
Yes, I've already grouped them using the "phase" operator and I get the channel : 

( ['./A-1.txt', './B-1.txt'], ['./A-2.txt', './B-2.txt'], ['./A-3.txt', './B-3.txt'], ['./A-4.txt', './B-4.txt'], ['./A-5.txt', './B-5.txt'] )

The pairs of files are grouped in the same tuple.
From that, I've grouped them again with buffer and get a channel that contains : 

( [['./A-1.txt', './B-1.txt'], ['./A-2.txt', './B-2.txt']] , [['./A-3.txt', './B-3.txt'], ['./A-4.txt', './B-4.txt']] , [['./A-5.txt', './B-5.txt']] )


And now, I would like to launch one job/process per array like : 

Submitted process > process foo ([['./A-1.txt', './B-1.txt'], ['./A-2.txt', './B-2.txt']])
Submitted process > process foo ([['./A-3.txt', './B-3.txt'], ['./A-4.txt', './B-4.txt']])
Submitted process > process foo ([['./A-5.txt', './B-5.txt']])


The process foo contain one command and I want to loop within the process through the array like : 
command A-1.txt B-1.txt
command A
-2.txt B-2.txt

It will submit one job on the cluster and this job will run 2 times the same command but with different arguments.

Thanks,

Rémi

Paolo Di Tommaso

unread,
Feb 6, 2017, 9:30:35 AM2/6/17
to nextflow
If the process is receiving the files A-1.txt B-1.txt and A-2.txt B-2.txt, then it's just matter of iterate over them using the scripting appropriate for your job. BASH? Perl?

Unfortunately I'm not getting exactly what is the problem stopping you. If you can provide a running code example,  I can try to help you to improve it. 


Cheers,
Paolo
 

To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.

rpl...@genoscope.cns.fr

unread,
Feb 15, 2017, 5:09:00 AM2/15/17
to Nextflow
Sorry for the late reply, I wasn't available :)

So what you are suggesting, is to transform a groovy data structure to a bash/perl/... data structure (I've joined a working example as attachment).

I thought it might have another solution like looping through the groovy data structure using groovy method and at each loop execute the command with : 
  • The groovy syntax:
    "command foo".execute()


  • or even using the "three double-quote" nextflow syntax like:
    process foo {
       input
    :
       val groovy_array
    from BAR

       script
    :
       groovy_array
    .each { it ->
       
    """
       command $it
       """

       
    }
    }

Thank you,

Rémi
test-nextflow.nf

Paolo Di Tommaso

unread,
Feb 15, 2017, 6:05:14 AM2/15/17
to nextflow
Almost there. I was suggesting more the way implemented the script you attached. 

Also the `groovy_array` in the script is an option though it's not the right syntax. 

Even if it could work in this way, still it's a sub-optimal solution because the declared inputs contract don't match the real input values i.e. your are specifying a value object while you are passing files, therefore that won't work if you want to deploy your script with containers. 


I would go for something like the following: 



Channel.from(1..5).buffer(size: 3, remainder: true).set { ch_ids }
Channel.fromPath('A-*.txt').buffer(size: 3, remainder: true).set { ch_a }
Channel.fromPath('B-*.txt').buffer(size: 3, remainder: true).set { ch_b }

process launchArrayJobs {
  echo true
  
  input:
  val ids from ch_ids
  file file_a from ch_a
  file file_b from ch_b
  
  script:
  assert ids.size() == file_a.size()
  assert ids.size() == file_b.size()
  def cmd = ''
  for( int i=0; i<ids.size(); i++ ) {
    cmd += "echo command ${ids[i]} ${file_a[i]} ${file_b[i]}"
  } 
  cmd
}



Note the use of a `for`to concatenate the command string that is finally returned as last statement. 


Cheers,
Paolo 



To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.

rpl...@genoscope.cns.fr

unread,
Feb 15, 2017, 10:23:29 AM2/15/17
to Nextflow
Thanks ! 

The 3 channels will be automatically in the correct order ? In my case, the two file channels (ch_a and ch_b) come from other processes.

Paolo Di Tommaso

unread,
Feb 15, 2017, 10:55:39 AM2/15/17
to nextflow
Channels are logically FIFO, however if ch_a and ch_b are produced by two different processes it's not guaranteed that they have the same order. 

As possible solution you can sort them (if a compare logic is applicable) or if those file share a common identifier you can match them with phase operator and then separate them. 

 
Cheers,
Paolo




To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.

rpl...@genoscope.cns.fr

unread,
Feb 16, 2017, 7:45:25 AM2/16/17
to Nextflow
Thank you very much for your help and your time !

Rémi
Reply all
Reply to author
Forward
0 new messages