vsearch equivalent to pick_open_reference_otus

259 views
Skip to first unread message

Constantino Schillebeeckx

unread,
Aug 24, 2016, 1:05:32 PM8/24/16
to Qiime 1 Forum
Hi gang,

I recently stumbled upon vsearch, and wanted to give it a whirl in our current 16S pipeline.  (I know I should probably be asking in their forum but it doesn't seem highly frequented).  Currently we are picking OTUs with pick_open_reference_otus.py (with uclust) using Green Genes as a reference database.  

For reference:



I see vsearch isn't integrated into QIIME, so I started writing a bash script that would call the proper sequence of vsearch commands; so far I've got:

# dereplicate reads
vsearch
--derep_fulllength seqs.fna --output derep.fna --minuniquesize 2 --fasta_width 0 --sizeout --relabel_keep;

# must sort because searching done greedily
# see http://drive5.com/usearch/manual/uparseotu_algo.html
vsearch
--fasta_width 0 --sortbysize derep.fna -output derep_sorted.fna --relabel_keep --notrunclabels;

# search against Green Genes to generate closed ref OTUs
# reemember to set SET MAX_REJECTS, etc
vsearch
--fasta_width 0 --usearch_global derep_sorted.fna --threads 0 --dbmask none --qmask none --rowlen 0 --top_hits_only --notmatched closed_ref_fail.fna --db GG_97.fasta --id 0.97 --matched closed_ref.fna --uc closed_ref.uc --relabel_keep --notrunclabels;

# sort samples failed closed ref reads
vsearch
--fasta_width 0 --sortbysize closed_ref_fail.fna -output closed_ref_fail_sorted.fna --relabel_keep --notrunclabels;

# randomly subsample 10% of failed closed ref reads
# this will already be sorted by abundance since the input is sorted
vsearch
--fasta_width 0 --fastx_subsample closed_ref_fail_sorted.fna --fastaout closed_ref_fail_subsample_sorted.fna --sample_pct 10 --relabel_keep --notrunclabels;

# cluster failed closed ref subsample reads -> ref DB for new ref OTU
vsearch
--fasta_width 0 --cluster_size closed_ref_fail_subsample_sorted.fna --clusterout_id --consout new_ref_db.fna --id 0.97 --qmask none --relabel_keep --notrunclabels;

# search against new ref DB
# hits are considered New.ReferenceOTU
# failures are considered New.CleanUp
vsearch
--fasta_width 0 --usearch_global closed_ref_fail_sorted.fna --threads 0 --dbmask none --qmask none --rowlen 0 --top_hits_only --notmatched new_ref_fail.fna --db new_ref_db.fna --id 0.97 --matched new_ref.fna --uc new_ref.uc --relabel_keep --notrunclabels;

# denovo cluster of new ref failures
vsearch
--fasta_width 0 --cluster_size new_ref_fail.fna --clusterout_id --consout new_ref_cleanup.fna --id 0.97 --qmask none --uc new_ref_cleanup.uc --relabel_keep --notrunclabels;


When it comes to generating OTU tables, it looks like one can use the biom from-uc script as detailed by gregcaporaso.

However, from the usearch documentation we read:
Sequence labels must have sample identifiers (input set) and OTU identifiers (database) as explained later in this page. This means that you cannot use the input file to cluster_otus for this step because several samples often have the same unique sequence, so the dereplicated (unique) sequence labels either do not have a sample identifier, or have a misleading sample identifier because the same sequence may be found in other samples. The way to deal with this is usually to go back to the "raw" reads after merging or truncating to a fixed length. See sample identifiers for ways to add sample identifiers to the read labels.

Furthermore, later in the page we read:
If a size annotation is found in an read label, the abundance will be added to the total for its OTU. 

Which leads me to my questions regarding the generation of a final OTU table:
  • Can I generate an OTU table using the files closed_ref.uc, new_ref.uc and new_ref_cleanup.uc using the following reference sequences (respectively) GG_97.fasta, new_ref_db.fna, new_ref_cleanup.fna (The usearch documentation would suggests not, although in combination with the size annotation data, I don't understand why the cluster + size data doesn't fully define the OTU table)
  • Does the biom from-uc script take into account the size annotation?
  • It seems like the answer to question #1 seems to be "NO" as I'm running into ValueError issues (see here), what am I doing wrong?
  • Does the above sequence if scripts replicate pick_open_referece_otus?

Thanks for the insight everyone!

Jose Antonio Navas Molina

unread,
Aug 25, 2016, 12:55:55 AM8/25/16
to Qiime 1 Forum, Torbjørn Rognes, Frédéric Mahé
Hi Constantino,

I've cc'ed the original developers of vsearch, their may be able to provide way more information that I can.

Cheers,

Colin Brislawn

unread,
Aug 25, 2016, 2:40:20 PM8/25/16
to Qiime 1 Forum
Hello Constantino,

I also use vsearch for OTU clustering. My final step it to map all my reads from demultiplexing (seqs.fna) to my full list of OTU centroids using --usearch_global. This is based on this step of the uparse pipeline:

The output of this is a .uc file, which I can convert into a .biom table using biom-format.

Let me know if that helps,
Colin


Constantino Schillebeeckx

unread,
Aug 25, 2016, 5:49:07 PM8/25/16
to Qiime 1 Forum
Thanks Colin, I think that's the last piece I was missing.

As a last step (after concatting the OTU centroids from the three OTU picking steps) I'm doing (as you recommended):
cat step1_otus/closed_ref.fna step2_otus/new_ref.fna step3_otus/new_ref_cleanup.fna >> final.fna;
vsearch
--fasta_width 0 --usearch_global derep_sorted.fna --threads 0 --dbmask none --qmask none --rowlen 0 --top_hits_only --db final.fna --id 0.97 --uc final.uc --relabel_keep --notrunclabels

Then, to generate the OTU table I do:
biom from-uc -i final.uc -o final.biom --rep-set-fp final.fna;

However I'm getting the error: 
ValueError: Not all sequence identifiers in the input BIOM file are present in description fields in the representative sequence fasta file.

Which I take to mean there is a read in final.uc which could not be found in final.fna - I'm not sure why that could be though.

For reference, I'm putting together a gist so that this can be used by others.

Related: is it actually necessary to do the final usearch_global if we already have size_annotation data within our dereplicated files?  It seems to me that we are doing un-need work and could get OTU counts directly from the size_annotion (of course, it's entirely possible that I'm missing something...)

Colin Brislawn

unread,
Aug 25, 2016, 6:11:28 PM8/25/16
to Qiime 1 Forum
What happens if you omit the '--rep-set-fp final', and just run 
biom from-uc -final.uc -final.biom

If you have already renamed your OTU centroids, I'm not sure if you need the rep_set file at all. 

Colin

Constantino Schillebeeckx

unread,
Aug 26, 2016, 3:13:45 PM8/26/16
to Qiime 1 Forum
Yep that was it; thanks a million Colin.

Wanted to write it again because I think it could be useful to others.  I've put together the finished pipeline into the following gist.

Colin Brislawn

unread,
Aug 26, 2016, 5:46:06 PM8/26/16
to Qiime 1 Forum
Very cool, Constantino. I think vsearch will be part of Qiime 2, but I'm not totally sure on the status of that. I'm glad there are some other folks out there using vsearch in a production environment.

Colin

Reply all
Reply to author
Forward
0 new messages