how to create a process for each input file

709 views
Skip to first unread message

Manuele Simi

unread,
May 28, 2015, 5:14:55 PM5/28/15
to next...@googlegroups.com
Hello,

I'm totally new to nextflow.io, so this is my first hands-on experience with the tool.

I have two processes: the first one creates a set of files and I would like that an instance of the second process is executed in parallel for each of those files. I looked at the documentation, tried several options, but none worked as I expect.

The two processes are defined as follow:


process submit {
       
        output:
        file 'slicingPlan.tsv' into plan
        file 'index_*' into indexes    

        """
        java -jar /Users/mas2182/Lab/TestWorkflows/goby/goby.jar -m suggest-position-slices  -n 200 -o slicingPlan.tsv '${params.alignment1}' '${params.alignment2}' > /dev/null
        split -l 1 slicingPlan.tsv index_
        """
}


process align {
        input:
        each index from indexes

        "echo the line is: ${index}"   
}


However, the second one is always executed only one time with the whole set of files passed as parameter and I don't understand why. Any suggestion is more than welcome.

With less priority, I would also like to do not use files a input for the second process, but a Channel with values created starting from the content of the file produced by the first one. Again, I'm failing to get to that point because all the examples I looked at seem to create Channels with set of values outside the processes.

Thanks!
manuele

Paolo Di Tommaso

unread,
May 28, 2015, 6:21:43 PM5/28/15
to nextflow
Hi, 

By default the process outputs all the files that matches the wildcard as a single emission.  Thus, the downstream process will receive in input a list of files, and so, since the "submit" process is executed just one time also the "align" process will be executed exactly one time. 

To execute an "align" task for each file, the "indexes" channel need to output each file as a sole emission. You can do this by adding the option "mode flatten" in the output definition. You can find an example here: 



Another way consist in using the operator flatMap. For example see the link below:

https://gist.github.com/pditommaso/8ce246ecc4dfa8eab477


Also note at line 16, that the input should be a file, not "each". The last is used to iterate over a "static" list of items and it's not meant to be used with a channel.


Regarding your second question, if you want the second process to manage values instead of files, you will need to transform the content of that channel from files to .. values. 

You can do that using the map operator, for example: 

valIndexes = indexes .flatMap() .map { file -> extractValue(file.text) }

Where "flatMap" transform the list of files to many files (as shown before), then "map" transform each file to a value.  "extractValue" is just an helper function that you can use to wrap you conversion logic. Finally the new channel "valIndexes" can be used as input for the second process (in place of "indexes"). 

In nextflow operators are the swissknife used to transform/adapt channels and connect processes. 


Let me know if I this helps or for any further doubts.


Cheers,
Paolo

  
 

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.
Visit this group at http://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.

Manuele Simi

unread,
May 29, 2015, 11:06:33 AM5/29/15
to next...@googlegroups.com
Hi Paolo,

thanks. Super-useful suggestion and it worked. Thanks also for the further explanation.

I noticed the flatten mode in the sample script before, but it looks to me that it's not documented, even in the Channels chapter of the documentation. Or at least I couldn't find it.

Thanks again,
manuele
Reply all
Reply to author
Forward
0 new messages