Processing a list of paths from a text file

3,362 views
Skip to first unread message

Barry Moore

unread,
Dec 28, 2016, 2:23:44 AM12/28/16
to Nextflow
Hi all,

I'm working through a simple Nextflow script to calculate the md5sum on a list of files, and I've run into a behavior that I don't understand, but also can't seem to fix.  I create a few test files with random content and calculate their md5sums on the command line.

seq 1 5 | parallel 'echo $RANDOM > {}.txt'
ls
*.txt > files.list
md5sum
*.txt > files.md5sum

The goal is to iterate through the list of text files and calculate md5sum on each of them in parallel and then collect that md5sum output into a single file.  My Nextflow script is shown below after a few questions:
  1. Nextflow creates work/tmp/*/input.1 files for each of the 5 *.txt files in my list.  Inside each of these input.1 files is the filename of one of my original *.txt files rather than the random number the original file contains.  What I want Nextflow to do is stage each of the files listed in files.list, not stage it's filename to a new file.
  2. When nextflow does run md5sum on the files it stages its using the staged name in it's output.  Ultimately I want Nextflow to produce md5sum output with the original filename, but I wasn't able to get Nextflow to stage the files by their original names.  I found a reference in this forum to staging files by their original names using as input "file '*' channel", but was unable to make that work.

Finally, for both of these issues, I realize that I may be doing things the hard way since I'm new to both Nextflow and Groovy, so if there is a more idomatic way of processing a bunch of files from a list that makes my problems moot, I'd love to hear that more general advice.


Thanks for any help you may have to offer,

Barry

#!/usr/bin/env nextflow

// Usage: nextflow run validate_new_files.nf --list files.txt

file_list
= Channel.from(file(params.list).readLines().each{file(it)})
//Also tried: file_list = Channel.from(file(params.list).readLines())

process parallel_md5sum
{

  input
:
  file
('data_file') from file_list
 
  output
:
  file
('data_file.md5sum') into md5sum_files
 
 
"""
  md5sum data_file > data_file.md5sum
  """

}

process collect_md5sum
{

  input
:
  file
('data_file??.md5sum') from md5sum_files.toList()
 
  output
:
  file
('primary_data.md5sum')
 
 
"""
  cat data_file??.md5sum > primary_data.md5sum
  """

}


Paolo Di Tommaso

unread,
Dec 28, 2016, 3:28:39 AM12/28/16
to nextflow
Hi Barry, 


Have a try to this: 

The main problem is your script is that `each` does not return any value, so it cannot work in that way. 

Then, to stage input files with their original name, use a variable reference instead of a file name. 

Finally NF is creating that file because the `parallel_md5sum` process is getting a input a string value not a file object, thus it creates a file for you having the string value as a content. 


Hope it helps.

Cheers,
Paolo


--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.

Barry Moore

unread,
Dec 28, 2016, 9:43:22 AM12/28/16
to Nextflow
Thanks Paolo, that worked great.  Just FYI, I had to add a bit of code to the map closure to remove the newline from the filenames:

Channel.fromPath(params.list)
       
.splitText()
       
.map { file(it.replaceFirst(/\n/,'')) }
       
.set { file_list }

Barry

Paolo Di Tommaso

unread,
Dec 28, 2016, 11:48:52 AM12/28/16
to nextflow
It makes sense. You can even use `.trim()` for the same.


Cheers,
Paolo

Barry Moore

unread,
Dec 28, 2016, 5:07:20 PM12/28/16
to Nextflow
Ah, I new there had to be a clean method for that - thanks!

B


On Wednesday, December 28, 2016 at 12:23:44 AM UTC-7, Barry Moore wrote:

Aznable Char

unread,
Mar 11, 2020, 1:26:43 AM3/11/20
to Nextflow
Hi, new user here and I hit a similar problem, which drives me nut when I start to think about it. I can't find the definition of "variable reference" you mentioned earlier in terms of file operation. Can you explain what you meant by "variable reference" in this context? I have been using "Channel.fromPath("*.txt") for example to create input for other processes without any problem. But why would something like "Channel.fromList(["list.txt"])" get staged as  "input.*" with filename as content when passed to a process with "input: file(finput)"? The language seems very vague about the expected behavior of the "file()" method -- here, one would thought that the "file(finput)" call at the process's input block would create the "variable reference" as in the case of input "Channel.fromList(["list.txt"]).map{ file(it) }"? There seems to be some inconsistent here.

On Wednesday, December 28, 2016 at 12:28:39 AM UTC-8, Paolo Di Tommaso wrote:
Hi Barry, 


Have a try to this: 

The main problem is your script is that `each` does not return any value, so it cannot work in that way. 

Then, to stage input files with their original name, use a variable reference instead of a file name. 

Finally NF is creating that file because the `parallel_md5sum` process is getting a input a string value not a file object, thus it creates a file for you having the string value as a content. 


Hope it helps.

Cheers,
Paolo

To unsubscribe from this group and stop receiving emails from it, send an email to next...@googlegroups.com.

Anand Venkatraman

unread,
Mar 11, 2020, 1:02:34 PM3/11/20
to Nextflow
Hi Aznable 

Have you looked at https://www.nextflow.io/docs/latest/faq.html - especially questions 1 and 2 to see if it addresses what you want
Reply all
Reply to author
Forward
0 new messages