VSEARCH: chimera and OTU picking

2,865 views
Skip to first unread message

Sanjeev Sariya

unread,
Apr 7, 2016, 12:15:36 PM4/7/16
to VSEARCH Forum
Hi There,

Thanks for this tool. Is this tool ready to get it working with data, or is it still undergoing tests?

I've data at hand, and conference knocking on my door. Usearch 32 bits refuse to handle data with error:

File size too big for 32-bit version (17.4Gb)

Input file size 17G

Reads: 32084664


I need 2 things to get done: identify chimera, and OTU picking.

I'd love to hear your suggestions.

Any help shall be greatly appreciated.

Frédéric Mahé

unread,
Apr 7, 2016, 12:34:21 PM4/7/16
to VSEARCH Forum
Dear Sanjeev,

I participate in the development of vsearch, and indeed we are still testing some of its less used functionalities. Nonetheless, I use vsearch on a daily basis and for all my projects. I trust the software, as it gives robust and replicable results.

Best,

Sanjeev Sariya

unread,
Apr 7, 2016, 12:44:59 PM4/7/16
to VSEARCH Forum
Hi Frédéric,

Thanks for lightening fast reply. I'm looking at URL. To get my task done I'll do:

- Dereplicate
vsearch --derep_full ERR348713.fasta --output ERR348713.derep.fasta --sizeout

- Chimera detection
vsearch --uchime_denovo ERR348713.derep.fasta --chimeras chimera.out.fasta --nonchimeras non.chimera.out.fasta

Please correct me if I'm wrong here.
Are there any steps that are to be done?

Thanks,
Sanjeev

Frédéric Mahé

unread,
Apr 7, 2016, 12:58:41 PM4/7/16
to VSEARCH Forum
Well, it depends if your input is a fastq or fasta file.

You can convert from fastq to fasta with:

vsearch --fastq_filter fastqfile --fastaout fastafile

Then, if you don't need to trim adaptors or primers, you can dereplicate with your command. Instead of launching chimera detection on the full dataset, I suggest to clusterize it first:

vsearch --cluster_size fastafile --centroids centroids.fasta --id 0.97

Then, you can use your chimera detection command on the file centroids.fasta. Again, it depends a lot on the initial state of your data. I invite you to read vsearch's manual and to try the different options, so you can take an informed decision.

Best,

Sanjeev Sariya

unread,
Apr 7, 2016, 4:42:51 PM4/7/16
to VSEARCH Forum
Hi,

Thank you for your reply.
I come working on QIIME and its wrappers, therefore, this might take take back and forth exchange of replies. Apologies for the trouble. 
I've Illumina V4 region data. 300 Read length. I've 414 samples. I'd like to use Vsearch post demultiplexing data from QIIME. That is, on FASTA file. :) QIIME 1.9.1

vsearch --version

vsearch v1.10.2_linux_x86_64, 63.0GB RAM, 32 cores


I had in mind:
1- dereplicate, 
2- chimera check  
3- filter chimeras using QIIME from original fasta
4- Use vsearch for clustering - otu picking
5- pick representative seqs
6 - run RDP

What you're suggesting is:
1- dereplicate, 
2- run clustering - otu picking
3- then run chimera
4- remove chimeras from original data set
5- pick representative seqs
6 - run RDP

1- Am I understanding you correctly?
2- Also, I read posts mentioning on removal of single-tons, sorting, and other technical things. Could you help/direct me at which step are those to be done using vsearch?
I've never used usearch separately.

Thanks,
Sanjeev

Sanjeev Sariya

unread,
Apr 7, 2016, 5:12:57 PM4/7/16
to VSEARCH Forum
Not to flood inbox:


What I'm doing is:


1- dereplicate and remove single ton


vsearch --derep_full lib_tagclean.fasta --output derep.fasta --log=log --sizeout --minuniquesize 2  


2- run chimera check

vsearch --uchime_denovo derep.fasta --chimeras chimera.out.fasta --nonchimeras non.chimera.out.fasta


---------


3- 

This is something I don't know if I should be doing post dereplication??

vsearch --sortbysize derep.fa -output derep_sorted.fa


4-

- Then run chimera check on above dereplicated and sorted out seqs??

vsearch --uchime_denovo derep.fasta --chimeras sorted.chimera.out.fasta --nonchimeras sorted.non.chimera.out.fasta


I do not know if I should be doing Step 3 after step 1, and then run chimera check.


---


Thanks much for all your support and guidance.!


Frédéric Mahé

unread,
Apr 19, 2016, 5:59:28 AM4/19/16
to VSEARCH Forum
Yes, you are right, I am suggesting:


1- dereplicate, 
2- run clustering - otu picking
3- then run chimera on OTU representatives (output only non-chimeras)
4 - run RDP on non-chimeric OTU representatives

I don't use vsearch for clustering, I use swarm. If you are interested, my pipeline, with code examples is described here.

Regarding singletons, yes it is customary to remove them. You can do that with vsearch on your final fasta file of non-chimeric OTU representative. Personally, I filter out small OTUs I see in only one sample when I build the OTU table.

Frédéric Mahé

unread,
Apr 19, 2016, 6:04:18 AM4/19/16
to VSEARCH Forum
On Thursday, April 7, 2016 at 11:12:57 PM UTC+2, Sanjeev Sariya wrote:
Not to flood inbox:


What I'm doing is:


3- 

This is something I don't know if I should be doing post dereplication??

vsearch --sortbysize derep.fa -output derep_sorted.fa



You don't need to sort dereplicated sequences. Normally, vsearch outputs sorted results.

 


4-

- Then run chimera check on above dereplicated and sorted out seqs??

vsearch --uchime_denovo derep.fasta --chimeras sorted.chimera.out.fasta --nonchimeras sorted.non.chimera.out.fasta


I do not know if I should be doing Step 3 after step 1, and then run chimera check.



You should search for chimeras in the fasta file containing the OTU representative sequences (after clustering). Working only on representatives speeds up the process.

Best,

André Soares

unread,
May 28, 2016, 7:38:13 AM5/28/16
to VSEARCH Forum
Hello there,

Would chimera.out.fasta be in usual .fasta format?
As in:
>xxx and so on
ATCCAGAG

Thinking of applying this to QIIME, so there would be the necessity of extracting the strings after ">" in to a \n delimited .txt file...

Thanks,
André

Sanjeev Sariya

unread,
Jun 1, 2016, 12:38:00 PM6/1/16
to VSEARCH Forum
Hi André,
Yes, the output FASTA would be usual FASTA file.

Best,
Sanjeev

Jessica Hardwicke

unread,
Jul 21, 2016, 2:05:16 PM7/21/16
to VSEARCH Forum
Hi Sanjeev,

Did you use Qiime to build an otu table after the pipeline you mentioned above? I'm wondering how to rune Qiime's make_otu_table.py on the chimera-removed data, as the input for this command expects the .txt output of pick_otus.py...

Sanjeev Sariya

unread,
Jul 21, 2016, 4:57:32 PM7/21/16
to VSEARCH Forum
Hi Jessica,

I do not continue with QIIME once I get demultiplexed FASTA. 
I classify them using our in-house curated training set with RDP, and parse them with python script to get desired tabulated output.

QIIME demultiplexed FASTAs --> VSEARCH chimera -->filter chimera --> RDP

--
Sanjeev

Sanjeev Sariya

unread,
Jul 22, 2016, 9:45:19 AM7/22/16
to VSEARCH Forum
Hi there,

I follow following steps:

1)


vsearch --derep_full small_test.fasta --output small_derep.fasta --log=log --sizeout --minuniquesize 2


2)


vsearch -cluster_fast small_derep.fasta -id 0.97 --sizein --sizeout --relabel OTU_  --centroids otus.fna


3)


vsearch --uchime_denovo otus.fna --nonchimeras otus_checked.fna --sizein --xsize --chimeras chimeras.fasta  


4)


vsearch -usearch_global small_test.fasta -db otus_checked.fna -strand plus -id 0.97 -uc otu_table_mapping.uc


5)


python drive5/mod_uc2otutab.py otu_table_mapping.uc > tabfile.tsv


I download this (mod_uc2otutab.py) script from Robert Edgar's website.


6)


sed -i -E 's/;size=[0-9]+;//g' tabfile.tsv


I discussed this at length with QIIME developer. Respective thread's link.


Hope this would be helpful.

Cheers!

Bahtiyar Yilmaz

unread,
Nov 9, 2016, 10:16:24 AM11/9/16
to VSEARCH Forum
Hey Sanjeev,

Can I ask you one thing? When I run a command line:

vsearch --uchime_denovo otus.fna --nonchimeras otus_checked.fna --sizein --xsize --chimeras chimeras.fasta  

I got two output file "otus_checked.fna" and "chimeras.fasta". Can I use " otus_checked.fna" for pick_otus.py? If that's so, why do you use filter_fasta.py command line in QIIME?

If I need to use filter_fasta.py... Can you write me down what the exact input files that you introduce to command line?

Thanks a lot.

Best wishes,
Bahti

Sanjeev Sariya

unread,
Nov 9, 2016, 11:56:06 AM11/9/16
to VSEARCH Forum
Hello Bahti,

I assume you ran de-replication and clustering step before running  vsearch --uchime
The  "otus_checked.fna" you get is OTU picked. You can run classifier on it. These are non-chimeric reads. 

Sorry, I do not know what "filter_fasta.py" script is for.

I'm sharing link for the lengthy discussion (with each step's meaning) I'd with QIIME developers for each step. Link for it.
Hope this helps. 

Cheers!
Sanjeev

Bahtiyar Yilmaz

unread,
Nov 9, 2016, 12:39:39 PM11/9/16
to VSEARCH Forum
Thanks a lot Sanjeev! I have seen that! 

Kindest regards,
Bahti

Toke BA

unread,
Dec 6, 2016, 5:42:02 AM12/6/16
to VSEARCH Forum
Hi Sanjeev,

Do you do chimera cheking on samples seperately or a combined pool of sequences from all samples?

Best

Sanjeev Sariya

unread,
Dec 6, 2016, 8:20:29 AM12/6/16
to VSEARCH Forum
Hi Toke,  

I run chimera check on the combined samples.  
I think you're facing the same problem I face - on how to repeat chimera checking with same samples for repetitive analyses.  Don't know how to get over this bottle neck step.  

Thanks,
Sanjeev

felipealbor...@gmail.com

unread,
Jul 12, 2017, 2:30:52 PM7/12/17
to VSEARCH Forum
Hi Sanjeev, I was wondering what does your step 4 does? if you already have your OTUs without chimeras, why not make a table from that file?

cheers
felipe

Sanjeev Sariya

unread,
Jul 12, 2017, 2:39:09 PM7/12/17
to VSEARCH Forum
Hi, 

I'm not 100% certain but what I understand is chimera free OTUs are needed to be mapped back to the initial data set to so we could know to which all samples those representative sequences belong.

Thanks,
Sanjeev

felipealbor...@gmail.com

unread,
Jul 12, 2017, 2:46:40 PM7/12/17
to VSEARCH Forum
I am following this  pipeline and I get an OTU table with abundances of each OTU on each sample. please correct me if I'm wrong:

1) dereplicate samples. --sizeout will give me the abundance of each sample

vsearch --derep_fulllength rawsequences.fasta --output sequences.derep.fasta --sizeout --strand plus --minuniquesize 2



2) remove chimeras. -sizein drags sequence abundance from previous file and -z\sizeout outputs the new sequence abundance (without chimeras)

vsearch --uchime_denovo sequences.derep.fasta --sizein --nonchimeras seqs.nochimeras.fasta --sizeout  


3) Cluster OTUs. -sizein and -sizeout results in an otu table with abundances

vsearch -cluster_fast seqs.nochimeras.fasta --sizein --id 0.99 --sizeout --sizeorder --relabel OTU_ --centroids OTU.sequences.fasta --otutabout otutable.txt


I might be wrong but I don't see why we need to map the no-chimera OTU list to the original file if with this pipeline I get all I need (iI think) from step 3

cheers
Felipe

Colin Brislawn

unread,
Jul 12, 2017, 7:22:47 PM7/12/17
to VSEARCH Forum
Hello Felipe,

This is a great question: 
I might be wrong but I don't see why we need to map the no-chimera OTU list to the original file if with this pipeline I get all I need (iI think) from step 3

The person who developed this pipeline has a great answer:
Sequence labels must have sample identifiers (input set) and OTU identifiers (database) as explained later in this page. This means that you cannot use the input file to cluster_otus for this step because several samples often have the same unique sequence, so the dereplicated (unique) sequence labels either do not have a sample identifier, or have a misleading sample identifier because the same sequence may be found in other samples. The way to deal with this is usually to go back to the "raw" reads after merging or truncating to a fixed length.


You can read more about this step on this page:

I hope that helps,
Colin

felipealbor...@gmail.com

unread,
Jul 13, 2017, 2:56:03 PM7/13/17
to VSEARCH Forum
Thanks! that makes sense

xiol...@gmail.com

unread,
Jan 3, 2018, 6:43:20 PM1/3/18
to VSEARCH Forum
Hi Sanjeev,

I was wondering how you did with assigning taxonomy to OTUs ?

Thanks,

xio

Colin Brislawn

unread,
Jan 6, 2018, 6:19:54 PM1/6/18
to VSEARCH Forum
Hello Xio,

Happy New Year!

vsearch does not include taxonomy assignment (but it might!). You could use another pipeline to assign taxonomy, like one of these:
Qiime 2 includes lots of plugins for taxonomy assignment, including using vsearch: https://docs.qiime2.org/2017.12/plugins/available/feature-classifier/classify-consensus-vsearch/ 
FROGS was developed by one of the vsearch developers, includes taxonomy assignment, and can be run online. Here is how it works: https://f1000research.com/slides/5-1832 
Reply all
Reply to author
Forward
0 new messages