Processing a list of paths from a text file

Barry Moore

unread,

Dec 28, 2016, 2:23:44 AM12/28/16

to Nextflow

Hi all,

I'm working through a simple Nextflow script to calculate the md5sum on a list of files, and I've run into a behavior that I don't understand, but also can't seem to fix. I create a few test files with random content and calculate their md5sums on the command line.

seq 1 5 | parallel 'echo $RANDOM > {}.txt'
ls *.txt > files.list
md5sum *.txt > files.md5sum

The goal is to iterate through the list of text files and calculate md5sum on each of them in parallel and then collect that md5sum output into a single file. My Nextflow script is shown below after a few questions:

Nextflow creates work/tmp/*/input.1 files for each of the 5 *.txt files in my list. Inside each of these input.1 files is the filename of one of my original *.txt files rather than the random number the original file contains. What I want Nextflow to do is stage each of the files listed in files.list, not stage it's filename to a new file.
When nextflow does run md5sum on the files it stages its using the staged name in it's output. Ultimately I want Nextflow to produce md5sum output with the original filename, but I wasn't able to get Nextflow to stage the files by their original names. I found a reference in this forum to staging files by their original names using as input "file '*' channel", but was unable to make that work.

Finally, for both of these issues, I realize that I may be doing things the hard way since I'm new to both Nextflow and Groovy, so if there is a more idomatic way of processing a bunch of files from a list that makes my problems moot, I'd love to hear that more general advice.

Thanks for any help you may have to offer,

Barry

#!/usr/bin/env nextflow

// Usage: nextflow run validate_new_files.nf --list files.txt

file_list = Channel.from(file(params.list).readLines().each{file(it)})
//Also tried: file_list = Channel.from(file(params.list).readLines())

process parallel_md5sum {

  input:
  file('data_file') from file_list
  
  output:
  file('data_file.md5sum') into md5sum_files
  
  """
  md5sum data_file > data_file.md5sum
  """
}

process collect_md5sum {

  input:
  file('data_file??.md5sum') from md5sum_files.toList()
  
  output:
  file('primary_data.md5sum')
  
  """
  cat data_file??.md5sum > primary_data.md5sum
  """
}

Paolo Di Tommaso

unread,

Dec 28, 2016, 3:28:39 AM12/28/16

to nextflow

Hi Barry,

Have a try to this:

https://gist.github.com/pditommaso/6d78e51a4486e96a0a934df6a30cf857

The main problem is your script is that `each` does not return any value, so it cannot work in that way.

Then, to stage input files with their original name, use a variable reference instead of a file name.

Finally NF is creating that file because the `parallel_md5sum` process is getting a input a string value not a file object, thus it creates a file for you having the string value as a content.

Hope it helps.

Cheers,

Paolo

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.

Barry Moore

unread,

Dec 28, 2016, 9:43:22 AM12/28/16

to Nextflow

Thanks Paolo, that worked great. Just FYI, I had to add a bit of code to the map closure to remove the newline from the filenames:

Channel.fromPath(params.list)
        .splitText()
        .map { file(it.replaceFirst(/\n/,'')) }
        .set { file_list }

Barry

Paolo Di Tommaso

unread,

Dec 28, 2016, 11:48:52 AM12/28/16

to nextflow

It makes sense. You can even use `.trim()` for the same.

Cheers,
Paolo

Barry Moore

unread,

Dec 28, 2016, 5:07:20 PM12/28/16

to Nextflow

Ah, I new there had to be a clean method for that - thanks!

B

On Wednesday, December 28, 2016 at 12:23:44 AM UTC-7, Barry Moore wrote:

Aznable Char

unread,

Mar 11, 2020, 1:26:43 AM3/11/20

to Nextflow

Hi, new user here and I hit a similar problem, which drives me nut when I start to think about it. I can't find the definition of "variable reference" you mentioned earlier in terms of file operation. Can you explain what you meant by "variable reference" in this context? I have been using "Channel.fromPath("*.txt") for example to create input for other processes without any problem. But why would something like "Channel.fromList(["list.txt"])" get staged as "input.*" with filename as content when passed to a process with "input: file(finput)"? The language seems very vague about the expected behavior of the "file()" method -- here, one would thought that the "file(finput)" call at the process's input block would create the "variable reference" as in the case of input "Channel.fromList(["list.txt"]).map{ file(it) }"? There seems to be some inconsistent here.

On Wednesday, December 28, 2016 at 12:28:39 AM UTC-8, Paolo Di Tommaso wrote:

Hi Barry,

Have a try to this:
https://gist.github.com/pditommaso/6d78e51a4486e96a0a934df6a30cf857

The main problem is your script is that `each` does not return any value, so it cannot work in that way.

Then, to stage input files with their original name, use a variable reference instead of a file name.

Finally NF is creating that file because the `parallel_md5sum` process is getting a input a string value not a file object, thus it creates a file for you having the string value as a content.

Hope it helps.

Cheers,
Paolo

To unsubscribe from this group and stop receiving emails from it, send an email to next...@googlegroups.com.

Anand Venkatraman

unread,

Mar 11, 2020, 1:02:34 PM3/11/20

to Nextflow

Hi Aznable

Have you looked at https://www.nextflow.io/docs/latest/faq.html - especially questions 1 and 2 to see if it addresses what you want

Reply all

Reply to author

Forward