Filter channel by number of lines in file

725 views
Skip to first unread message

Steve

unread,
Apr 26, 2018, 3:50:02 PM4/26/18
to Nextflow
I want to filter out entries in my Channel where one of the files does not have enough lines.

I tried to write the filter command like this:

samples_lofreq_vcf.concat(sample_vcf_hc2)
                   
.combine(annovar_db_dir)
                   
.filter { caller, sampleID, sample_vcf, sample_tsv, annovar_db_dir ->
                        println
"[sample_tsv]: ${sample_tsv}"
                       
def sample_tsv_file = new File(sample_tsv)
                        println
"[sample_tsv_file]: ${sample_tsv_file}"
                       
long count = sample_tsv_file.lines().count()
                        println
"[count]: ${count}"
                        count
> 1
                   
}
                   
.set { samples_vcfs_tsvs_filtered }



But I get the following error:

[sample_tsv]: /ifs/data/molecpathlab/development/NGS580-nf/work/46/2d92543e1820f95f2b28386ae8c2ab/SeraCare-1to1-Positive.LoFreq.reformat.tsv
ERROR
~ Could not find matching constructor for: java.io.File(sun.nio.fs.UnixPath)



I am guessing I do not have the right syntax or something here?

In this case I am just checking to make sure if there is more than one line present. I will also need to do more conditional filtering, such as counting the number of entries in a .vcf file;

grep -v '^#' filtered.vcf | wc -l


Paolo Di Tommaso

unread,
Apr 26, 2018, 3:55:05 PM4/26/18
to nextflow
It looks that `sample_tsv` is already a path therefore is enough `sample_tsv.readLines().size()` 

I wont use this over a dataset containing big files. 


p

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.

Steve

unread,
Apr 26, 2018, 4:00:55 PM4/26/18
to Nextflow
Was trying to implement this Java answer here: https://stackoverflow.com/a/26448726/5359531
but couldn't get that to work either, not sure what the Groovy equivalent is;

If you are using Java 8 you can use streams :

long count = Files.lines(Paths.get(filename)).count();


Alternatively maybe there is a way to only read & count lines until reaching some threshold value? Like if I only need to have >=2 lines, just read the first two lines then break & return a `true`?
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.

Paolo Di Tommaso

unread,
Apr 26, 2018, 4:10:51 PM4/26/18
to nextflow
Groovy is a superset of Java, as NF is a superset of Groovy .. 

Therefore you can use any regular java syntax (with a few exception) in a NF script. The following snippet should work


Files.lines(sample_tsv).count()

But you will need to import the Files class as at the top of the script: 

import java.nio.file.Files; 



p

To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.

Lavi Bharath

unread,
Sep 24, 2018, 2:58:47 AM9/24/18
to Nextflow
Hi, I am trying to count the number of lines for gzipped file? Thanks.

Steve

unread,
Oct 8, 2018, 6:52:30 PM10/8/18
to Nextflow
Lavi, unless you want to try and figure out how to do an inline shell call from your Groovy channel operator, I would probably just have a process that runs `zcat | wc -l > num_lines.txt`, then return the 'num_lines' file and read and parse it in a channel filter or choice operator.

Other methods I figured out for checking the number of lines are summarized here:

Lavi Bharath

unread,
Oct 24, 2018, 4:22:16 AM10/24/18
to Nextflow
Yes, thanks Steve. 
I just implemented the first option.
Reply all
Reply to author
Forward
0 new messages