Read files in multiple directories

772 views
Skip to first unread message

pmca

unread,
Apr 11, 2017, 9:42:48 AM4/11/17
to Nextflow
Hi,

I've just started learning NextFlow, and have a question that I hope someone can help me here.

I have a file that I can feed as a list to NF listing all the directories I want to iterate over. If I read this file into a channel I can then print those same directories:
Channel.fromPath(params.list)
 
.splitText()
 
.map { file(it.trim()) }
 
.set { dir_list }

dir_list
.println()

The above code prints:
/Users/pmca/NF/datasets/ds1
/Users/pmca/NF/datasets/ds2

What I would like to do now, is to read pairs of files from these directories and work in parallel with those pairs. I think, but am not sure, that the best approach would also be to read the files into a channel and then work with that. The code I tried is below but it doesn't work.

params.reads = "*{1,2}.fastq.gz"

Channel.fromFilePairs("${read_list}/*", size: -1) { file -> params.reads }
 
.set { reads_list }

Anyone has any idea on how to implement this?
Thanks,
Pedro

pmca

unread,
Apr 11, 2017, 10:24:22 AM4/11/17
to Nextflow
Just to be more clear, the code above seems to work if I define only one directory in a variable. For example:

dir_list = '/Users/pmca/NF/datasets/ds1'

Channel.fromFilePairs("${dir_list}/*", size: -1) { file -> params.reads }
 
.set { reads_list }

reads_list
.println()

prints as expected:
[*{1,2}.fastq.gz, [/Users/pmca/NF/datasets/ds1/s4_1.fq.gz, /Users/pmca/NF/datasets/ds1/s4_2.fq.gz]]

But how to make this work in a channel with multiple directories?
Thanks.

Paolo Di Tommaso

unread,
Apr 11, 2017, 6:11:04 PM4/11/17
to nextflow
Then problem is that `fromFilePairs` is designed to work with a path defining a glob pattern not with a list of directories. 

What you can do is to convert your list of directories to a comma separated string. For example, having the following dirs:
  • foo
  • bar
  • qux

You can create a the following glob pattern 

Channel.fromFilePatterns( "{foo,bar,qux}/*_{1,2}.fastq" )

To convert a file containing the a list of directory you can use the following snippet: 

def paths = file('directories.txt').readLines().findAll { it.size()>0 }.join(',')
Channel.fromPath(  "{$paths}/*_{1,2}.fastq" ).set { new_channel }


Note: `paths` must contains at least two directories. 


Hope it helps. 


Cheers,
Paolo



--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.

pmca

unread,
Apr 11, 2017, 6:59:08 PM4/11/17
to Nextflow
Hi Paolo,

thanks for the help. I'm still learning the tricks and treats of NextFlow/Groovy and I hope time will be my friend.

Your explanation makes sense, and having a string of directories seems to be good way of iterating through them. I'm trying to implement this process so that the pipeline can run through multiple sample (in different directories) in parallel, which would be better than having a diferent pipeline for each sample.

The first part of the code you showed is working fine but I'm not getting any output from the second part:
def paths = file(params.list).readLines().findAll { it.size()>0 }.join(',')
// println paths
/* prints /Users/pmca/NF/datasets/ds1,/Users/pmca/NF/datasets/ds2 as expected */

However,
Channel.fromPath(  "{$paths}/*_{1,2}.fq.gz" ).set { new_channel }
new_channel.println()
/* prints nothing */

I also tried with the list above but couldn't print nothing as well:
Channel.fromPath( "/Users/pmca/NF/datasets/ds1/*_{1,2}.fq.gz,/Users/pmca/NF/datasets/ds2/*_{1,2}.fq.gz" ).set { new_channel }

The reads are as follows in "/Users/pmca/NF/datasets/ds1/": s4_1.fq.gz  s4_2.fq.gz
and in "/Users/pmca/NF/datasets/ds2/": s5_1.fq.gz  s5_2.fq.gz

Thanks,
Pedro
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.

Paolo Di Tommaso

unread,
Apr 11, 2017, 7:10:40 PM4/11/17
to nextflow
Hi, 

the first syntax must work. Make sure the the pattern you specified matches the file extension. Also tried with a single file path to troubleshot the problem. Finally bare in mind if you get file pairs you will need to use `fromFilePairs` instead of `fromPath`. 


Cheers,
Paolo

pmca

unread,
Apr 11, 2017, 7:22:52 PM4/11/17
to Nextflow
It seems to be working with a single path, but not with a comma delimited string of paths:

myExtension = "*_{1,2}.fq.gz"
Channel.fromFilePairs( "/Users/pmca/NF/datasets/ds1/${myExtension},/Users/pmca/NF/datasets/ds2/${myExtension}" ).set { new_channel }
new_channel
.println()

Channel.fromFilePairs( "/Users/pmca/NF/datasets/ds1/${myExtension}" ).set { other_channel }
other_channel
.println()

outputs:
[warm up] executor > local
[s4, [/Users/pmca/NF/datasets/ds1/s4_1.fq.gz, /Users/pmca/NF/datasets/ds1/s4_2.fq.gz]]

this is from the other_channel channel (removing other_channel.println() function yields no output)

Thanks,
P

Paolo Di Tommaso

unread,
Apr 12, 2017, 11:14:33 AM4/12/17
to nextflow
You are right, that was my fault. 

The paths must share a common root directly. Thus what you can do currently is to have a multiple paths specified as shown below 

Channel.fromFilePairs( "/Users/pmca/NF/datasets/{ds1,ds2,etc}/*_{1,2}.fastq.gz" )

or 

Channel.fromFilePairs( "/Users/pmca/NF/datasets/*/*_{1,2}.fastq.gz" )


Another approach if to have a file listing not just the directory folders, but the path of each read pair file, For example: 

id1, /Users/pmca/NF/datasets/ds1/s1_1.fq.gz, /Users/pmca/NF/datasets/ds1/s1_2.fq.gz
id2, /Users/pmca/NF/datasets/ds1/s2_1.fq.gz, /Users/pmca/NF/datasets/ds1/s2_2.fq.gz
:
idn, /Users/pmca/NF/datasets/ds1/sn_1.fq.gz, /Users/pmca/NF/datasets/ds1/sn_2.fq.gz

Then using a snippet like the following to parse it


Channel.fromPath( 'listing.txt' )
             .splitCsv()
             .map { id, read1, read2 -> tuple( id, file(read1), file(read2) ) }
             .set { new_channel }



Said that it could be useful to add the ability to process multiple paths/directories as you are suggesting. 



Hope it helps.

Cheers,
Paolo
 



--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.

pmca

unread,
Apr 12, 2017, 5:39:37 PM4/12/17
to Nextflow
Hi Paolo,

thanks for your patience. I also think that this feature would be great to have in NextFlow as would allow to have data organized in folders that could have different origins but could be incorporated in the same workflow. As a workaround, one thing that would work with the ideas you suggested previously is to have symlinks for the different directories in the same directory that NF could read.

As for the code snippet you gave, I'm sorry but I can't get it to work yet. I would say is in the map function but can't get around it.

Channel.fromPath( 'list.reads.txt' )
 
.splitCsv()
 
// .subscribe { row ->
 
// println "${row[0]} - ${row[1]} - ${row[2]}" }

 
.map { id, read1, read2 -> tuple( id, file(read1), file(read2) ) }

 
.set { channel3 }
channel3
.println()

gives the following error:
ERROR ~ No signature of method: _nf_script_8d5b6144$_run_closure1.call() is applicable for argument types: ([Ljava.lang.String;) values: [[s1,  /Users/pmca/NF/datasets/ds1/s4_1.fq.gz, ...]]
Possible solutions: any(), any(), any(groovy.lang.Closure), each(groovy.lang.Closure), any(groovy.lang.Closure), each(groovy.lang.Closure)

However, if I comment out the .map and .set functions and uncomment the .subscribe and println seems to print the correct paths:
s1 - /Users/pmca/NF/datasets/ds1/s4_1.fq.gz - /Users/pmca/NF/datasets/ds1/s4_2.fq.gz
s2
- /Users/pmca/NF/datasets/ds2/s5_1.fq.gz - /Users/pmca/NF/datasets/ds2/s5_2.fq.gz


Thanks,
Pedro
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.
Message has been deleted

pmca

unread,
Apr 12, 2017, 6:20:43 PM4/12/17
to Nextflow
I think I managed to get over the error message by just using one "receiver" in .map. This seems to work but had to loose the file() definition. Not sure how important this is at this stage of reading from file. I guess I could set file() when the channel is send to a process!?

Channel.fromPath( 'list.reads.txt' )
 .splitCsv()
 .map { it -> tuple(it[0], it[1], it[2]) }
 // .map { id, read1, read2 -> tuple( id, file(read1), file(read2) ) }
 .into { channel3 }
channel3.println()


prints:
[s1,  /Users/pmca/NF/datasets/ds1/s4_1.fq.gz,  /Users/pmca/NF/datasets/ds1/s4_2.fq.gz]
[s2, /Users/pmca/NF/datasets/ds2/s5_1.fq.gz, /Users/pmca/NF/datasets/ds2/s5_2.fq.gz]

Paolo Di Tommaso

unread,
Apr 12, 2017, 6:20:55 PM4/12/17
to nextflow
Almost there, use 

tuple( it[0], file(it[1]), file(it[2]) )

instead of 

tuple(it[0], it[1], it[2])


That's needed to convert the string paths to file objects. 


p

On Thu, Apr 13, 2017 at 12:17 AM, pmca <p.alme...@gmail.com> wrote:
I think I managed to get over the error message by just using one "receiver" in .map. This seems to work but had to loose the file() definition. Not sure how important this is at this stage of reading from file. I guess I could set file() when the channel is send to a process!?

Channel.fromPath( 'list.reads.txt' )
 
.splitCsv()
 
.map { it -> tuple(it[0], it[1], it[2]) }

 
// .map { id, read1, read2 -> tuple( id, read1, read2 ) }
 
.into { channel3 }
channel3
.println()


prints:
[s1,  /Users/pmca/NF/datasets/ds1/s4_1.fq.gz,  /Users/pmca/NF/datasets/ds1/s4_2.fq.gz]
[s2, /Users/pmca/NF/datasets/ds2/s5_1.fq.gz, /Users/pmca/NF/datasets/ds2/s5_2.fq.gz]

To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.

pmca

unread,
Apr 12, 2017, 6:29:36 PM4/12/17
to Nextflow
Great, thanks. Is this the expected behaviour?

[s1, /Users/pmca/NF/ /Users/pmca/NF/datasets/ds1/s4_1.fq.gz, /Users/pmca/NF/ /Users/pmca/NF/datasets/ds1/s4_2.fq.gz]
[s2, /Users/pmca/NF/datasets/ds2/s5_1.fq.gz, /Users/pmca/NF/datasets/ds2/s5_2.fq.gz]

in s1, it[1] and it[2] are broken into two strings/paths, whereas in s2 don't.
The difference in list.reads.txt file is that in s1 fields are separated by "comma space" but in s2 is just "comma".
> cat list.reads.txt
s1
, /Users/pmca/NF/datasets/ds1/s4_1.fq.gz, /Users/pmca/NF/datasets/ds1/s4_2.fq.gz
s2
,/Users/pmca/NF/datasets/ds2/s5_1.fq.gz,/Users/pmca/NF/datasets/ds2/s5_2.fq.gz

/Users/pmca/NF/ is my current working directory.

Thanks,
P.

Paolo Di Tommaso

unread,
Apr 12, 2017, 6:32:55 PM4/12/17
to nextflow
A comma separate value (CSV) file is comma separated not comma-blank separated .. :)


p

To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.

pmca

unread,
Apr 12, 2017, 6:39:34 PM4/12/17
to Nextflow
:) indeed. Was just testing the strictness of the file/function.

Thanks a lot for all the help.
Reply all
Reply to author
Forward
0 new messages