confused on how to merge different FASTQ file for pooled sample analysis and how to create db file for usearch

Kirill Tsyganov

unread,

Sep 14, 2016, 12:20:11 AM9/14/16

to VSEARCH Forum

Hi there,

I'm very glad that there is an open-source version of usearch came available to the community, so thank you for that.

These are the commands that I use to do 16S rRNA like analysis on my data. I can't find exactly where I've got this "pipeline" from, but it was somewhere at the usearch site.

usearch8.1.1861_i86linux32 -fastq_mergepairs ~/projects/blah/blah/raw-data/fastqFiles/*_R1_*.fastq -fastqout merged.fq -relabel @

usearch8.1.1861_i86linux32 -fastq_filter merged.fq -fastq_maxee 1.0 -relabel Filt -fastaout filtered.fa

usearch8.1.1861_i86linux32 -derep_fulllength filtered.fa -relabel Uniq -sizeout -fastaout uniques.fa

usearch8.1.1861_i86linux32 -cluster_otus uniques.fa -minsize 2 -otus otus.fa -relabel Otu

usearch8.1.1861_i86linux32 -usearch_global merged.fq -db otus2.fa -strand both -id 0.90 -otutabout otutab2.txt -biomout otutab2.json

I understand from this post https://groups.google.com/forum/#!searchin/vsearch-forum/fastq_mergepairs%7Csort:relevance/vsearch-forum/SYUfrxMjWb4/CnHeVrYfBwAJ that I can't quite repeat usearch -fastq_mergepairs. And so two questions here:

1. How do I merge individual FASTQ files? Because I have to pool all of my sequences together into one file e.g merged.fq and then try to cluster my pooled sequences to identify putative species (otus).

I suppose I can append with cat for example all FASTQs to one file and then start working from that. Is this how it should be done?

2. usearch -usearch_global had this neat feature where during fastq_mergepairs, relabel @ options which made merged.fq file to have specially labelled reads - each read is named with file name and unique number e.g myFile.1, myFile.2 ... myFile.n, where n is number of reads in myFile. This allowed later using usearch_global command to create table of counts of newly clustered otus in the overall population. How would you go about creating such table of counts without having single db? I guess more direct question; How do you create db file from several FASTQ files?

Thanks,

Kirill

Kirill Tsyganov

unread,

Sep 15, 2016, 2:10:26 AM9/15/16

to VSEARCH Forum

Hi guys,

This is an alternative approach using vsearch to reproduce what i've done using usearch

for i in trimed/*R1*.gz; do

echo "\n$i" >> log.txt;

vsearch --fastq_mergepairs $i

--reverse ${i/_R1/_R2}

--fastqout $(basename ${i%%_S*})_merged.fastq

--eeout

--fastq_maxdiffs 2

--fastq_maxns 0

--fastq_minlen 100

--fastq_maxmergelen 160

--fastq_minovlen 20

--fastq_maxee 1.0

--threads 32 2>> log.txt

done

for i in merged/*.gz; do

vsearch --derep_fulllength $i

--output $(basename $i .fq)_dereped.fa

--fasta_width 0

--sizeout

--relabel $(basename $i merged.fastq.gz)

--threads 32 2>> log.txt

done

for i in dereped/*.gz; do zcat $i >> unique.fa; done

Now that I've done pre-processing and got my unique sequences all in one file. It is time to cluster them.. And here where vsearch doesn't seem to perform ...

vsearch --cluster_fast reads.fa

--id 0.90

--alnout otus_aln.txt

--threads 32

--rowlen 161

--centroids otus_centroids.txt

--profile otus_profile.txt

--msaout otus_msaout.txt

--uc otus_uclust.txt

--fasta_width 0

This command runs much much slower then equivalent one in usearch and it produces ~ 6000 otus compare to 5 using usearch

vsearch v2.0.5_linux_x86_64, 125.9GB RAM, 32 cores

https://github.com/torognes/vsearch

Reading file reads.fa 100%

145220481 nt in 1137359 seqs, min 100, max 160, avg 128

Masking 100%

Sorting by length 100%

Counting unique k-mers 100%

Clustering 100%

Sorting clusters 100%

Writing clusters 100%

Clusters: 12430 Size min 1, max 159458, avg 91.5

Singletons: 6845, 0.6% of seqs, 55.1% of clusters

Multiple alignments 100%

--------------------------------------

usearch8.1.1861_i86linux32 -cluster_otus reads.fa -minsize 2 -otu_radius_pct 10 -otus otus_10.fa -relabel Otu

usearch v8.1.1861_i86linux32, 4.0Gb RAM (132Gb total), 32 cores

(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.

http://drive5.com/usearch

Licensed to: check...@monash.edu

WARNING: OTU radius > 3% not recommended

00:01  46Mb  100.0% 5 OTUs, 4 chimeras (0.1%)

Does anyone know what's the reason for such difference?

Thanks,

Kirill

Torbjørn Rognes

unread,

Sep 19, 2016, 4:28:11 AM9/19/16

to VSEARCH Forum

Hi

In vsearch you will need to run fastq_mergepairs with the following command for each pair of input files:

vsearch -fastq_mergepairs ~/projects/blah/blah/raw-data/fastqFiles/file_R1.fastq -reverse ~/projects/blah/blah/raw-data/fastqFiles/file_R2.fastq -fastqout merged.fq

Currently, vsearch does not support the "@" feature in the relabel option. We will consider implementing it in a future version.

Also, vsearch does not support the cluster_otus command.

The options for writing OTU tables (-otutabout and -biomout) are not supported either, but we plan to include them in a future version.

In general, the commands and options for usearch version 7 are implemented in vsearch. Commands, options and syntax from usearch version 8 are only partially included.

- Torbjørn

Torbjørn Rognes

unread,

Sep 19, 2016, 4:32:48 AM9/19/16

to VSEARCH Forum

Hi

The cluster_otus command is not implemented in vsearch. Is very different from the cluster_fast command, so they cannot really be compared. They are not equivalent.

For clustering, vsearch is still generally somewhat slower then usearch, but it depends a lot on the the length of sequences, how different they are, the number of clusters, etc.

Please see our preprint for some comparisions:

https://peerj.com/preprints/2409/

- Torbjørn

Kirill Tsyganov

unread,

Oct 11, 2016, 10:28:28 PM10/11/16

to VSEARCH Forum

Hi Torbjorn,

sorry about super later reply, for some reason I didn't get email notification.

This is great info thanks heaps !

Reply all

Reply to author

Forward

confused on how to merge different FASTQ file for pooled sample analysis and how to create db file for usearch_global

Kirill Tsyganov

Kirill Tsyganov

Torbjørn Rognes

Torbjørn Rognes

Kirill Tsyganov