How to group items in channel by base name

2,178 views
Skip to first unread message

Fa Vio

unread,
Dec 16, 2016, 11:27:23 AM12/16/16
to Nextflow
Hello,
I've been using nextflow a little bit recently, but I can not get this to properly work. Here is the situation:

I have a number of files which I split into 4 smaller parts; each part is processed independently and then outputs should be merged into a new definitive big file.
Suppose the initial files are "BigFile1.bf, BigFile2.bf, etc.", I can't find a way to group them by base name and merge them accordingly.
Maybe this example clarifies:

Channel.fromPath(params.bigfiles).into{ big_files }

process split
{
    input
:
    file big_file

    output
:
    file
'*.part_[0-9].bf' into split_files mode flatten

   
"""
    split_files.sh $big_file
    """

}

process workOnFiles
{
    input
:
    file
(partial) from split_files

    output
:
    file
"${partial.baseName}.EDIT.bf" into edited_files

    script
:
   
"""
    edit_files.sh ${partial}
> ${partial.baseName}.EDIT.bf
    """
}

process merge {
    input:
    file edited_files

    output:
    file '*.FINAL.bf' into final_file 

    script:
    prefix = partial.toString() - ~/(\.part_[0-9])?(\.EDIT)?(\.bf)?$/
    """
    merge_file.sh ${edited_file} -o $prefix.FINAL.bf
    """
}

How do I make sure that the 'merge' process merges together only BigFile1_part_*.bf, and not others?
I've been trying with the 'groupBy' operator, with something like

edited_files
    .groupBy {
String str -> str - ~/(\.part_[0-9])?(\.EDIT)?(\.bf)?$/ }

but I didnt work.

Other operators allow to output values in tuples containing a specified number of values, but I should make sure those values come from the same source.


Any advice would be of great help,
Fabio





Paolo Di Tommaso

unread,
Dec 16, 2016, 11:45:29 AM12/16/16
to nextflow
Hi Fabio, 

I would do the following:  given `edited_files` map each file to a pair <common prefix, file>, then use groupTuple to group together all pairs with the same prefix.

Thus replace the code after the process `workOnFiles` as shown below: 

edited_files
  .map { file -> tuple(get_prefix(file.name), file) }
  .groupTuple()
  .set { grouped_files }


process merge {
    input:
    set prefix, file(edited_files) from grouped_files

    output:
    file '*.FINAL.bf' into final_file 

    script:
    """
    merge_file.sh ${edited_files} -o $prefix.FINAL.bf
    """
}

Note: replace `get_prefix` with your own code to get common prefix given a file name. Moreover if you are expecting always the same number of parts, specify that the attribute `size` when using the grouping operator (see the docs) eg. groupTuple(size: 4)


Hope it helps 



Cheers,
Paolo




--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.

Fa Vio

unread,
Dec 16, 2016, 1:35:24 PM12/16/16
to Nextflow
That's awesome, exactly what I needed.

thanks a lot for the quick reply.


Best,
Fabio
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.

Paolo Di Tommaso

unread,
Dec 16, 2016, 1:36:46 PM12/16/16
to nextflow



To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.

biowang

unread,
Feb 9, 2017, 2:49:16 AM2/9/17
to Nextflow
could you provide me the ducuments about function or operator, tuple and get_prefix?

在 2016年12月17日星期六 UTC+8上午12:45:29,Paolo Di Tommaso写道:
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.

Paolo Di Tommaso

unread,
Feb 9, 2017, 5:56:04 AM2/9/17
to nextflow
Hi, 

Nextflow operators are documented here. The function `tuple` is little more than a syntax sugar to create an immutable list of items. The function `get_prefix` mention in this thread is an invented name for a user provided function. 

Don't forget that NF is a superset of Groovy/Java programming language, thus you can define in your script any function (or even class) that you may need in your code. 

You can refer to the Groovy programming lang documentation to learn more about the function syntax. 


Cheers,
Paolo


To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages