confused on how to merge different FASTQ file for pooled sample analysis and how to create db file for usearch_global

601 views
Skip to first unread message

Kirill Tsyganov

unread,
Sep 14, 2016, 12:20:11 AM9/14/16
to VSEARCH Forum
Hi there, 

I'm very glad that there is an open-source version of usearch came available to the community, so thank you for that. 

These are the commands that I use to do 16S rRNA like analysis on my data. I can't find exactly where I've got this "pipeline" from, but it was somewhere at the usearch site.

usearch8.1.1861_i86linux32 -fastq_mergepairs ~/projects/blah/blah/raw-data/fastqFiles/*_R1_*.fastq -fastqout merged.fq -relabel @
usearch8.1.1861_i86linux32 -fastq_filter merged.fq -fastq_maxee 1.0 -relabel Filt -fastaout filtered.fa
usearch8.1.1861_i86linux32 -derep_fulllength filtered.fa -relabel Uniq -sizeout -fastaout uniques.fa
usearch8.1.1861_i86linux32 -cluster_otus uniques.fa -minsize 2 -otus otus.fa -relabel Otu
usearch8.1.1861_i86linux32 -usearch_global merged.fq -db otus2.fa -strand both -id 0.90 -otutabout otutab2.txt -biomout otutab2.json

I understand from this post https://groups.google.com/forum/#!searchin/vsearch-forum/fastq_mergepairs%7Csort:relevance/vsearch-forum/SYUfrxMjWb4/CnHeVrYfBwAJ that I can't quite repeat usearch -fastq_mergepairs. And so two questions here:

1. How do I merge individual FASTQ files? Because I have to pool all of my sequences together into one file e.g merged.fq and then try to cluster my pooled sequences to identify putative species (otus). 
I suppose I can append with cat for example all FASTQs to one file and then start working from that. Is this how it should be done? 
2. usearch -usearch_global had this neat feature where during fastq_mergepairs, relabel @ options which made merged.fq file to have specially labelled reads - each read is named with file name and unique number e.g myFile.1, myFile.2 ... myFile.n, where n is number of reads in myFile. This allowed later using usearch_global command to create table of counts of newly clustered otus in the overall population. How would you go about creating such table of counts without having single db? I guess more direct question; How do you create db file from several FASTQ files?

Thanks, 

Kirill   

Kirill Tsyganov

unread,
Sep 15, 2016, 2:10:26 AM9/15/16
to VSEARCH Forum
Hi guys, 

This is an alternative approach using vsearch to reproduce what i've done using usearch

for i in trimed/*R1*.gz; do
 echo "\n$i" >> log.txt;
 vsearch --fastq_mergepairs $i
         --reverse ${i/_R1/_R2}
         --fastqout $(basename ${i%%_S*})_merged.fastq
         --eeout
         --fastq_maxdiffs 2
         --fastq_maxns 0
         --fastq_minlen 100
         --fastq_maxmergelen 160
         --fastq_minovlen 20
         --fastq_maxee 1.0
         --threads 32 2>> log.txt
done

for i in merged/*.gz; do
 vsearch --derep_fulllength $i
         --output $(basename $i .fq)_dereped.fa
         --fasta_width 0
         --sizeout
         --relabel $(basename $i merged.fastq.gz)
         --threads 32 2>> log.txt
done

for i in dereped/*.gz; do zcat $i >> unique.fa; done

Now that I've done pre-processing and got my unique sequences all in one file. It is time to cluster them.. And here where vsearch doesn't seem to perform ...

vsearch --cluster_fast reads.fa
              --id 0.90
              --alnout otus_aln.txt
              --threads 32
              --rowlen 161
              --centroids otus_centroids.txt
              --profile otus_profile.txt
              --msaout otus_msaout.txt
              --uc otus_uclust.txt
              --fasta_width 0

This command runs much much slower then equivalent one in usearch and it produces ~ 6000 otus compare to 5 using usearch

vsearch v2.0.5_linux_x86_64, 125.9GB RAM, 32 cores

Reading file reads.fa 100%  
145220481 nt in 1137359 seqs, min 100, max 160, avg 128
Masking 100%  
Sorting by length 100%
Counting unique k-mers 100%  
Clustering 100%  
Sorting clusters 100%
Writing clusters 100%  
Clusters: 12430 Size min 1, max 159458, avg 91.5
Singletons: 6845, 0.6% of seqs, 55.1% of clusters
Multiple alignments 100%  

--------------------------------------

usearch8.1.1861_i86linux32 -cluster_otus reads.fa -minsize 2 -otu_radius_pct 10 -otus otus_10.fa -relabel Otu

usearch v8.1.1861_i86linux32, 4.0Gb RAM (132Gb total), 32 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.

Licensed to: check...@monash.edu


WARNING: OTU radius > 3% not recommended

00:01  46Mb  100.0% 5 OTUs, 4 chimeras (0.1%)

Does anyone know what's the reason for such difference? 

Thanks, 

Kirill

Torbjørn Rognes

unread,
Sep 19, 2016, 4:28:11 AM9/19/16
to VSEARCH Forum
Hi

In vsearch you will need to run fastq_mergepairs with the following command for each pair of input files:

vsearch -fastq_mergepairs ~/projects/blah/blah/raw-data/fastqFiles/file_R1.fastq -reverse ~/projects/blah/blah/raw-data/fastqFiles/file_R2.fastq -fastqout merged.fq

Currently, vsearch does not support the "@" feature in the relabel option. We will consider implementing it in a future version.

Also, vsearch does not support the cluster_otus command.

The options for writing OTU tables (-otutabout and -biomout) are not supported either, but we plan to include them in a future version.

In general, the commands and options for usearch version 7 are implemented in vsearch. Commands, options and syntax from usearch version 8 are only partially included.

- Torbjørn

Torbjørn Rognes

unread,
Sep 19, 2016, 4:32:48 AM9/19/16
to VSEARCH Forum
Hi

The cluster_otus command is not implemented in vsearch. Is very different from the cluster_fast command, so they cannot really be compared. They are not equivalent.

For clustering, vsearch is still generally somewhat slower then usearch, but it depends a lot on the the length of sequences, how different they are, the number of clusters, etc.

Please see our preprint for some comparisions:


- Torbjørn

Kirill Tsyganov

unread,
Oct 11, 2016, 10:28:28 PM10/11/16
to VSEARCH Forum
Hi Torbjorn, 

sorry about super later reply, for some reason I didn't get email notification.

This is great info thanks heaps !
Reply all
Reply to author
Forward
0 new messages