Option to process multiple datasets (single-end reads OR paired-end reads) now available!

63 views
Skip to first unread message

Rachid

unread,
Jul 9, 2016, 12:21:59 AM7/9/16
to CLARK Users
Dear users,

As you know, multiple datasets can be processed by CLARK. So far, this was possible only for single-end reads.
As many of you had requested it, now multiple datasets of paired-end reads can now be processed at once in a scalable fashion. 
The feature is now available - along with other bug fixes and code improvement - thus, please feel free to download online the package and try it out!
Or directly by command line: wget http://clark.cs.ucr.edu/Download/CLARKV1.2.3.tar.gz 

The command to use this feature is through the option "-P" (consistently with the single-end reads), and details have been added to explain it:
See in the Overview page, the paragraph entitled "Processing multiple samples/datasets" or the README file for examples (or See below).

Cheers,
Rachid

== (Extract from the overview page)

       Processing multiple samples/datasets
The program can run multiple sample/dataset once the database is loaded, in other words, - unlike other classifiers - you do not need to run the program N times if there are N samples/datasets to process. CLARK can load the database with your settings once and then classify as many datasets as needed before exiting.
For example, if you want to annotate six datasets (sample1.fa, sample2.fa, ..., sample6.fa), then you can store addresses (physical location in your disk) of these files into one file called "samples.txt", such that:
$ cat samples.txt
sample1.fa
sample2.fa
sample3.fa
sample4.fa
sample5.fa
sample6.fa

and then simply run:
$ ./classify_metagenome.sh -O samples.txt -R samples.txt 

Once the computations done, the program has created six results files (CSV format) associated to the samples: sample1.fa.csv, sample2.fa.csv, ... and sample6.fa.csv.

If you want the results files to have other/specific names (say "result1.csv", "result1.csv", ..., "result6.csv") then you can store these names into a file "results.txt", such that: 
$ cat results.txt 
result1 
result2 
... 
result6

and then run: 
$ ./classify_metagenome.sh -O samples.txt -R results.txt 

This scalable fashion to annotate multiple datasets works for single-end reads or paired-end reads. In the case of paired-end reads, you must provide two files (each containing addresses of files for the right/left read). For examples, if you have three datasets (sample1, sample2, and sample3) of paired-end reads then you can create "samples.R.txt" and "samples.L.txt" such that: 
$ cat samples.R.txt 
sample1.R1 
sample2.R1 
sample3.R1 
$ cat samples.L.txt 
sample1.R2 
sample2.R2 
sample3.R2 

You can run CLARK on these datasets of paired-end reads (with option "-P"):
$ ./classify_metagenome.sh -P samples.R.txt samples.L.txt -R results.txt 

where, the file "results.txt" is: 
$ cat results.txt 
result1 
result2 
result3 

Results files will be stored in files entitled "sample1.R1.csv", "sample2.R1.csv" and "sample3.R1.csv" (consistently with the input dataset of same prefix). 

Or you can just run:
$ ./classify_metagenome.sh -P samples.R.txt samples.L.txt -R samples.R.txt 

You can change the parameters (e.g., the k-mer length, the mode of execution, the variant, the number of parallel threads,...) or specify options for your data (e.g., compressed files,...). 

To see the full list of options/parameters available, run:
$ ./classify_metagenome.sh 

Reply all
Reply to author
Forward
0 new messages