Variable number of inputs causes groupby to nest tuples resulting in a channel that cannot be used

jaha...@gmail.com

unread,

Apr 24, 2018, 10:42:38 PM4/24/18

to Nextflow

Hello,

I am pretty new to nextflow, but have been enjoying it a lot so far. I am having a problem when using the groupby function to condense outputs from a channel. Basically, I am downloading fastq files that can be either paired or unpaired. I then want to take these files (which are SRR run files), and combine them so that they correspond to their SRX, since some SRX have multiple SRR associated with them.

The problem that I am having is that when I have paired end reads, the tuple that is generated nests, and nextflow can not use nested tuples as a channel. When I look in the work directory, instead of linking my desired fastq files, it just has a file with the inside tuple in it. Here is what my problem looks like:

Unpaired:

When I println, the tuple for single reads (which work as channels later on) looks like these following two examples (The first being an example of when an SRX has two SRR, and the second being when an SRX only has one).

[SRX1734976, [~/PATH_TO_DIR/work/c3/6beac3c84990915e1e78309ba24815/SRR3463850_1.fastq, ~/PATH_TO_DIR/work/67/5bfbbb51ae18f0aa6685b91c98aa20/SRR3463851_1.fastq]]

[SRX1734541, [~/PATH_TO_DIR/work/e6/f6fcf3930e97fa409b472e5b37edca/SRR3460000_1.fastq]]

And when I access a work directory after the following step, Nextflow has been able to properly identify the groupby as a channel and pull in the proper files:

lrwxrwxrwx 1 john.hadish 91 Apr 24 19:15 SRR3463850_1.fastq -> ~/PATH_TO_DIRwork/e4/6faaf89302e28a60883e11b2a9de8e/SRR3463850_1.fastq
lrwxrwxrwx 1 john.hadish 91 Apr 24 19:15 SRR3463851_1.fastq -> ~/PATH_TO_DIR/work/33/cde3c590185c4fee2ad29ab117e480/SRR3463851_1.fastq
-rw-r--r-- 2 john.hadish 590348 Apr 24 19:15 SRX1734976_1.fastq
-rw-r--r-- 2 john.hadish 0 Apr 24 19:15 SRX1734976_2.fastq

The groupby makes an object that nextflow recognizes as a channel in this case.

Paired:

When I have paired data, however, I run into problems. The paired end reads channel made by groupby nests the files further since their are two files associated with each SRR:

[SRX2589027, [[~/PATH_TO_DIR/work/d3/2a66a4fb4579bcbdd7b8b62859623f/SRR5285723_1.fastq, ~/PATH_TO_DIR/work/d3/2a66a4fb4579bcbdd7b8b62859623f/SRR5285723_2.fastq], [~/PATH_TO_DIR/work/d3/f409e6ff8928863c6a903094182b8c/SRR5285722_1.fastq, ~/PATH_TO_DIR/work/d3/f409e6ff8928863c6a903094182b8c/SRR5285722_2.fastq]]]

[SRX2081981, [[~/PATH_TO_DIR/work/e2/663dd08280d902001d0952230303d8/SRR4113368_1.fastq, ~/PATH_TO_DIR/work/e2/663dd08280d902001d0952230303d8/SRR4113368_2.fastq]]]

And when I access the work directory after the following step, I do not have the files that the groupby made, but rather an "input" file:

total 8
lrwxrwxrwx 1 john.hadish 84 Apr 24 19:15 input.1 -> ~/PATH_TO_DIR/work/tmp/cc/541f9775a405e24c48d6db6b81bc14/input.1
lrwxrwxrwx 1 john.hadish 84 Apr 24 19:15 input.2 -> ~/PATH_TO_DIR/work/tmp/7f/10052b31520caa96da8c8962121714/input.2
-rw-r--r-- 2 john.hadish 0 Apr 24 19:15 SRX2589027_1.fastq
-rw-r--r-- 2 john.hadish 0 Apr 24 19:15 SRX2589027_2.fastq

The input.1 and input.2 files are just the inside of the tuples seen above:

[/scidas/arabidopsis/trial/sra2gev/work/a7/6b059184fdd8a5fdf9436720afa746/SRR5285723_1.fastq, /scidas/arabidopsis/trial/sra2gev/work/a7/6b059184fdd8a5fdf9436720afa746/SRR5285723_2.fastq]

Code:

The section of my code where I use groupby looks like this:

// This section downloads the files that are either paired or unpaired and passes them to the raw_fastq channel
// The srx value was determined in the step prior to this based on a srr from a list

process fastq_dump {
module 'sratoolkit'
publishDir "$srx", mode: 'link'
time '24h'
tag { sra }

input:
set val(srx), val(sra) from srx_value

output:
set val(srx), file("${sra}_?.fastq") into raw_fastq //The output can be paired or unpaired

"""
fastq-dump --split-files $sra
"""
}

// Here is where I use the groupby that produces a nice channel for unpaired data, but nested tuples for paired

raw_fastq
.groupTuple()
.set { grouped_fastq }

I have tried using "flatten" and "group" and have looked around on the google group for help, but none of the other topics lined up with this. Any help you could give me would be greatly appreciated.

Paolo Di Tommaso

unread,

Apr 25, 2018, 8:44:50 AM4/25/18

to nextflow

> The problem that I am having is that when I have paired end reads, the tuple that is generated nests, and nextflow can not use nested tuples as a channel. When I look in the work directory, instead of linking my desired fastq files, it just has a file with the inside tuple in it. Here is what my problem looks like

Nextflow can handle nested tuple if the input is declared properly. I suspect you have missed to declare the reads as a `file`. Look at this snippet for an example.

Hope it helps.

p

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.

jaha...@gmail.com

unread,

Apr 25, 2018, 7:48:03 PM4/25/18

to Nextflow

Hi Paolo, thank you for the quick reply.

I do not think that that is my problem, I have been declaring that they are files in the next step. I boiled down what my problem is, and have come to the conclusion that my problem is that I have two processes that are nesting my files, and the second time they get nested they become unusable. The first time they become nested is when they are sent to "raw_fastq" and the second time is when they are grouped with "groupTuple".

I think my problem would be solved if I could repress the first time, and I now see that if a channel is created using a "wildcard" in the name (as my script is) that it will group as a tuple, but if their is no wild card that it will not group. I wrote the following Minimal code to show this:

process yes_nests{

output:
set val("SRX0006"), file("*.txt") into nested_channel
"""
echo foo > results.txt
echo bar > other_results.txt
"""
}

process no_nests{

output:
set val("SRX0006"), file("results.txt"), file("other_results.txt") into non_nested_channel
"""
echo foo > results.txt
echo bar > other_results.txt
"""
}

non_nested_channel.println()
nested_channel.println()

These two processes look almost identical, but their output looks different, the process titled "yes_nests" (one that uses wildcard to output files) looks like this (I have highlighted the brackets in red to show the difference in output):

[SRX0006, [/home/john.hadish/Minimal_Working_Example/work/3a/c728dbd76474b840b2853b2ad95654/other_results.txt, /home/john.hadish/Minimal_Working_Example/work/3a/c728dbd76474b840b2853b2ad95654/results.txt]]

And the one that does not use wild cards (titled "no_nests") looks like this:

[SRX0006, /home/john.hadish/Minimal_Working_Example/work/86/f227df662643a071caac5631908c88/results.txt, /home/john.hadish/Minimal_Working_Example/work/86/f227df662643a071caac5631908c88/other_results.txt]

As you can see, the first one groups the files, while the second one just outputs all to a channel, with no additional grouping. Since my next step also groups, this becomes a problem.

My question now is, is there a way to repress this behavior of grouping when using a wild card in the output? It seems that their should be a way to make both of the above processes output the same thing, since they are almost identical.

I would like to keep the wild card in my code so that I do not have to worry about how many files are associated, but this grouping is preventing me from using it.

I understand that both the above outputs would work, but the next step also groups (groupTuple) so I run into problems after that step.

Thanks in advance.

Steve

unread,

Apr 25, 2018, 8:42:26 PM4/25/18

to Nextflow

"I have been declaring that they are files in the next step"

They have to be declared as `file` both in the Channel and in the Process input.

If I am understanding correctly, your greater issue is that you need to group a variable number of files together and pass them through your channel. I had the same issue, and wrote some scripts to handle it. There is a demo of the method I use here:

https://github.com/stevekm/nextflow-demos/tree/master/parse-samplesheet

I use a .tsv formatted samplesheet, where one of the fields is a comma-separated list of input fastq files. I then parse the samplesheet in my input Channel, and split the fastq field, before passing it through to the Processes, where I pick up the files that were passed.

Also, to generate a samplesheet in this format, I use this script:

https://github.com/NYU-Molecular-Pathology/NGS580-nf/blob/e890b24ce3c9a0dcbf0392f4e6fb5d7f79195794/generate-samplesheets.py

Hope this is helpful!

jaha...@gmail.com

unread,

Apr 25, 2018, 8:50:24 PM4/25/18

to Nextflow

Thanks for your response Steve,

I think I just fixed my problem. It turns out there is already a fantastic tip in the nextflow documentation that was directly what I was looking for:

Tip

By default all the files matching the specified glob pattern are emitted by the channel as a sole (list) item. It is also possible to emit each file as a sole item by adding the mode flatten attribute in the output file declaration.

Turns out all I needed was to add "mode flatten" after my output variable in fastq dump so that it looks like this:

output:
set val(srx), file("${sra}_?.fastq") into raw_fastq mode flatten

I knew there should be an easy fix, I just needed to read the documentation more thoroughly. I did learn a lot about nextflow trying to solve this though, so it was a good experience!

Thanks for your help,

John

On Tuesday, April 24, 2018 at 7:42:38 PM UTC-7, jaha...@gmail.com wrote:

Reply all

Reply to author

Forward