VSEARCH: chimera and otu picking - steps

2,190 views
Skip to first unread message

Sanjeev Sariya

unread,
Apr 8, 2016, 9:30:20 AM4/8/16
to Qiime 1 Forum
Hi QIIME Devs,

QIIME version 1.9.1 usearch 32 bits, 6.1 version
I know VSEARCH is not supported by QIIME.

Recently, I've been handed a huge data set. Data post forward, and reverse primer removal, demultiplexed sizes 17G. I've usearch 32bits, which won't allow me to have chimera check, and thus has put me in a bad situation. illumina HiSeq 300 PE, V3-V4 region. 

I've never used usearch separately from QIIME, and therefore having challenge running VSEARCH, and proceeding further. Pardon me for creating duplicate thread on QIIME: https://groups.google.com/forum/#!topic/vsearch-forum/Hh-AaYVTQpg

I don't what steps are to be followed separately on VSEARCH as all (dereplication, sorting, etc ) had been taken care by QIIME scripts. I'd be thankful to you if you could help me with steps, and correct me if I'm wrong. The individual steps are split.

My goal is to find chimeras (denovo),  filter them, pick OTUs, get representative sequences, (denovo) and run RDP classifier separately. I'm not using green genes, or SILVA.

What I've done:

1) Dereplicate data (get rid of singletons):

vsearch --derep_full lib_clean.fasta --output derep.fasta --log=log --sizeout --minuniquesize 2 


2) Get chimeras


vsearch --uchime_denovo derep.fasta --chimeras chimera.out.fasta --nonchimeras non.chimera.out.fasta


Chimera's job is still running from ~10+ hours now.


3) I'll filter them out using filter_fasta.py filter_fasta.py script. 


4) 

How do I proceed with pick_otus.py using vsearch (.uc, centroid steps)?


-- 

Please help, and guide.


Best,

Sanjeev



 


Colin Brislawn

unread,
Apr 8, 2016, 12:48:03 PM4/8/16
to Qiime 1 Forum
Hello Sanjeev,

I also use vsearch just like you describe here. I find it works really well!

Some thoughts: 
  • Vsearch scales really well with large files! Isn't that great!
  • uchime-denovo is not currently multi-threaded, so it takes awhile. One valid option is to pick OTUs first, then chimer check afterwards. This also works well and is much faster. 
3) I'll filter them out using filter_fasta.py filter_fasta.py script. 
Or you could just use the output file, non.chimera.out.fasta, as the input for OTU picking. 

How do I proceed with pick_otus.py using vsearch (.uc, centroid steps)?
Are you planning to chimera check before or after OTU picking? Let me know and I can help you through these next steps.

Colin
 

Sanjeev Sariya

unread,
Apr 8, 2016, 1:36:00 PM4/8/16
to Qiime 1 Forum
Hi Colin, 

Glad to see your reply. 

| Or you could just use the output file, non.chimera.out.fasta, as the input for OTU picking. 

Yes, thanks for this. I was thinking in one direction only. :-/

| Are you planning to chimera check before or after OTU picking? Let me know and I can help you through these next steps.

I think I'm to opt for your opinion - pick OTU, and then do chimera check. I'm unsure of how the workflow would be like:

- dereplicate using VSEARCH
- cluster them at 97% using VSEARCH
- remove chimera using VSEARCH
- pick representative sequences using QIIME

Does this look saner?
I explicitly mentioned VSEARCH, and QIIME so there isn't any confusion in tool(s), term(s), and technicalities. :)

Looking forward to hear from you.
Thanks again for your supportive reply. :)

--Sanjeev

Colin Brislawn

unread,
Apr 8, 2016, 3:14:01 PM4/8/16
to Qiime 1 Forum
Hi Sanjeev,

- dereplicate using VSEARCH
- cluster them at 97% using VSEARCH
- remove chimera using VSEARCH
Perfect!
When clustering, you can use multiple threads, which is really important when running at scale. 

- pick representative sequences using QIIME
Nope, this is different. When picking OTUs outside of qiime, instead of this you map your original reads back to your chimera checked OTUs. If you have used uparse, this should be familear: 
The vsearch command should be something like: 
vsearch -usearch_global reads.fa -db otus.fa -strand plus -id 0.97 -uc map.uc

Then you convert that map.uc file into a .biom table. 


Good luck!
As you walk through this process, let me know if you have any questions,
Colin 

Sanjeev Sariya

unread,
Apr 8, 2016, 5:51:55 PM4/8/16
to Qiime 1 Forum
Hi Colin,

Thanks for your reply, again. 
Its embarrassing I've not worked with .biom before. I'd need your patience here. 
I've always worked with process in QIIME: get chimera removed, cluster, representative seqs, and classify them using our in-house training set.

What I've done on small data is:

1- Dereplicate

vsearch --derep_full small_test.fasta --output small_derep.fasta --log=log --sizeout --minuniquesize 2


getting rid of single tons.

2- Cluster them

vsearch -cluster_fast small_derep.fasta -id 0.97 -uc results.uc -minsize 2  -consout  otus.fa


getting rid of single tons.

3- Some step:

vsearch -usearch_global small_derep.fasta -db otus.fa -strand plus -id 0.97 -uc no.uc


I'm being honest here to say, I don't know what I'm doing. I've following doubts:


A) I'm getting rid of singletons in step1 and step2, is it redundant or is it advised?


B) In step 2, I hope -consout  otus.fa is correct, and required file. Please correct me if I'm wrong.


C) In step 2, I've  -uc results.uc  and in step 3 again I've -uc no.uc

Which one is correct, and what is what? uc - uclust tool


D) What is the step3 for?


E) Is Step3 necessary? 


F) How do I use .biom file further, and how am I to generate representative sequences to get back to classification?


My apologies for throwing so many naive queries. But I've been failing to get my head around with multiple files, and steps.

Appreciate your time, and replies. :)


Best,

Sanjeev



Colin Brislawn

unread,
Apr 9, 2016, 9:42:40 PM4/9/16
to Qiime 1 Forum
Great questions Sanjeev! I'll answer in line. 


On Friday, April 8, 2016 at 2:51:55 PM UTC-7, Sanjeev Sariya wrote:
Hi Colin,

Thanks for your reply, again. 
Its embarrassing I've not worked with .biom before. I'd need your patience here. 
I've always worked with process in QIIME: get chimera removed, cluster, representative seqs, and classify them using our in-house training set.

What I've done on small data is:

1- Dereplicate

vsearch --derep_full small_test.fasta --output small_derep.fasta --log=log --sizeout --minuniquesize 2


getting rid of single tons.
Good stuff!  

2- Cluster them

vsearch -cluster_fast small_derep.fasta -id 0.97 -uc results.uc -minsize 2  -consout  otus.fa


getting rid of single tons.
Really good, and very close. At this step, it's important to include the --sizein and --sizeout flags. This makes sure that the size annotations from derep_fulllength are included. Because you have already removed singletons in the previous step, you can remove them here.  
Also, use --centroids (instead of --consout) and '--relabel OTU_' that way your reads will actually say OTU_1, OTU_2, OTU_3, etc, in the output file. 


3- Some step:
Add a step! Here is where I check for chimeras. The command will be something like
vsearch --uchime_denovo otus.fna --nonchimeras otus_checked.fna --sizein --xsize
The --xsize strips the size information from otu_checked.fna so that it does not mess up tree building (FastTree2 does not like size annotations). 


vsearch -usearch_global small_derep.fasta -db otus.fa -strand plus -id 0.97 -uc no.uc


Also close. In this step, use your original (before dereplication) for mapping: --usearch_global mall_test.fasta.
This step is placing each read from a sample into its best OTU. I might call the output --uc otu_table_mapping.uc 


I'm being honest here to say, I don't know what I'm doing. I've following doubts:


I'm here to help! :-)


A) I'm getting rid of singletons in step1 and step2, is it redundant or is it advised?

 I would only remove singletons in step1, then make sure that setp2 uses --sizein so those size annotations are kept. 


B) In step 2, I hope -consout  otus.fa is correct, and required file. Please correct me if I'm wrong.

Maybe it is for usearch. For vsearch, I would use the --centroids flag because it keeps the most abundant read as the centroid. 


C) In step 2, I've  -uc results.uc  and in step 3 again I've -uc no.uc

Which one is correct, and what is what? uc - uclust tool

You can call it whatever you want. I really like showing that it is going to become the OTU table.  


D) What is the step3 for?


E) Is Step3 necessary?

Yes. This step takes every single read from every single sample and puts it into the best OTU. The output (the uc mapping file) will be converted into your .biom file. (This replaces the qiime scripts pick_rep_set.py and make_otu_table.py)


F) How do I use .biom file further, and how am I to generate representative sequences to get back to classification?

The output of --cluster_fast is your rep_set.fna file. After you chimera check it, you use it for taxonomy stuff. 


My apologies for throwing so many naive queries. But I've been failing to get my head around with multiple files, and steps.

Appreciate your time, and replies. :)


You are good! Asking these questions gives you a real understanding of this process. And I love answering methods questions like this! 

Best,

Sanjeev



Sanjeev Sariya

unread,
Apr 11, 2016, 11:35:01 AM4/11/16
to Qiime 1 Forum
Hi Colin,

Thank you for detailed replies, and walking down the steps. :)
Just to be certain I'm understanding you completely, I'd like to tally steps.

1)                                                                                                               

De-replicate

                                                                                       

vsearch --derep_full small_test.fasta --output small_derep.fasta --log=log --sizeout --minuniquesize 2

                                                                                                                                                                                               

2)                                

cluster them at 97%, relabel, centroid file is the one to be used further in chimera:


vsearch -cluster_fast small_derep.fasta -id 0.97 -uc results.uc  --sizein --sizeout --relabel OTU_  --centroids otus.fna 

                                                                                                                                                                                               

3)       

Perform chimera check (de novo in my case):

                                                                                                                                                                                               

vsearch --uchime_denovo otus.fna --nonchimeras otus_checked.fna --sizein --xsize --chimeras chimeras.fasta

                                                                                                                                                                                            

4)                                

Get .uc file which is sort of OTU table                                                                                                                                                                                               


vsearch -usearch_global small_test.fasta -db otus.fna -strand plus -id 0.97 -uc otu_table_mapping.uc


5)                                                                                                                                                                                             

I used scripts from this link. Fixed function as mentioned on thread                                                                                                                                    


python drive5/mod_uc2otutab.py otu_table_mapping.uc > tabfile.tsv

                                                                                                                                                                                               

6)                                                                                                                                                                                 

Things went good so far. I didn't like ;size=N; concatenated with my OTUs, so another step:

                                                                  

sed -i -E 's/;size=[0-9]+;//g' tabfile.tsv 


Questions:


A- I did clustering after sorting by lengths using --cluster_fast flag.

How does one decide which one to go: clustering after sorting by length, or sorting by abundance?

Or is it to each on his/her own?


B- I don't use results.uc generated in step 2. 

- What is its use? 

If no use, then I'd get rid of that flag, and argument.


C- Just to get down with the entire process with small data I took 2000 initial sequences from my big demultiplexed file. 


-- 2000 input sequences

-- After de-replicating I'm left with 148: 7.4% of the original

-- otus.fna - after clustering, that is, my representative sequences: 75 

-- After running chimera check, my non-chimeric sequences were 75. That means no chimera. It is understandable, it might be due to so less sequences grabbed.

-- I'm using the non-chimeric ones for RDP classification and further analyses.


In other words, I've 3.75% (rep) reads of the original 2000 to work on RDP. 

Do these numbers look legit to you, or there's something horrible going on? (again, I took a too small)


D- In step:


vsearch -usearch_global small_test.fasta -db otus.fna -strand plus -id 0.97 -uc otu_table_mapping.uc


Shouldn't I be using otus_checked.fna in stead of otus.fna?

OTU table should be made from the reads which passed all chimera, and other filters.


--

Thanks again for your extensive support, and encouragement throughout this process.


Best,

Sanjeev

Colin Brislawn

unread,
Apr 11, 2016, 1:20:55 PM4/11/16
to Qiime 1 Forum
Good morning Sanjeev,

I'll reply in line. 


On Monday, April 11, 2016 at 8:35:01 AM UTC-7, Sanjeev Sariya wrote:
Hi Colin,

Thank you for detailed replies, and walking down the steps. :)
Just to be certain I'm understanding you completely, I'd like to tally steps.

1)                                                                                                               

De-replicate

                                                                                       

vsearch --derep_full small_test.fasta --output small_derep.fasta --log=log --sizeout --minuniquesize 2

                                                                                                                                                                                               

2)                                

cluster them at 97%, relabel, centroid file is the one to be used further in chimera:


vsearch -cluster_fast small_derep.fasta -id 0.97 -uc results.uc  --sizein --sizeout --relabel OTU_  --centroids otus.fna 

Here, you can not use -uc results.uc, because that file is not needed. 
 

3)       

Perform chimera check (de novo in my case):

                                                                                                                                                                                               

vsearch --uchime_denovo otus.fna --nonchimeras otus_checked.fna --sizein --xsize --chimeras chimeras.fasta

                                                                                                                                                                                            

4)                                

Get .uc file which is sort of OTU table                                                                                                                                                                                               


vsearch -usearch_global small_test.fasta -db otus.fna -strand plus -id 0.97 -uc otu_table_mapping.uc

 Close! Replace otus.fna with otus_checked.fna. After all, you are only interested in the OTUs that are no chimeric. 

5)                                                                                                                                                                                             

I used scripts from this link. Fixed function as mentioned on thread                                                                                                                                    


python drive5/mod_uc2otutab.py otu_table_mapping.uc > tabfile.tsv

Perfect.  

6)                                                                                                                                                                                 

Things went good so far. I didn't like ;size=N; concatenated with my OTUs, so another step:

                                                                  

sed -i -E 's/;size=[0-9]+;//g' tabfile.tsv 

That works. The --xsize option in will also remove these size annotations.  

Questions:


A- I did clustering after sorting by lengths using --cluster_fast flag.

Yes. That's the clustering step.  

How does one decide which one to go: clustering after sorting by length, or sorting by abundance?

Or is it to each on his/her own?

As always, the choice is yours. I really like sorting by abundance (not by length) and other people like sorting by abundance too: http://www.drive5.com/usearch/manual/uclust_algo.html  


B- I don't use results.uc generated in step 2. 

- What is its use? 

If no use, then I'd get rid of that flag, and argument.

Perfect! I would remove it as well.  


C- Just to get down with the entire process with small data I took 2000 initial sequences from my big demultiplexed file. 


-- 2000 input sequences

-- After de-replicating I'm left with 148: 7.4% of the original

Makes sense.  

-- otus.fna - after clustering, that is, my representative sequences: 75 

Good.  

-- After running chimera check, my non-chimeric sequences were 75. That means no chimera. It is understandable, it might be due to so less sequences grabbed.

I agree. When you have more OTUs, you will probably find more chimeras. 

-- I'm using the non-chimeric ones for RDP classification and further analyses.


In other words, I've 3.75% (rep) reads of the original 2000 to work on RDP. 

Do these numbers look legit to you, or there's something horrible going on? (again, I took a too small)


D- In step:


vsearch -usearch_global small_test.fasta -db otus.fna -strand plus -id 0.97 -uc otu_table_mapping.uc


Shouldn't I be using otus_checked.fna in stead of otus.fna?

Yes!  

OTU table should be made from the reads which passed all chimera, and other filters.


--

Thanks again for your extensive support, and encouragement throughout this process.


Best,

Sanjeev


Great questions! Let me know how this process works for you!
Colin
 

Sanjeev Sariya

unread,
Apr 11, 2016, 1:28:42 PM4/11/16
to Qiime 1 Forum
Very good morning Colin. 
Wow! Amazing and extremely helpful replies. :)

Shall definitely let you know should I head into any more road blocks. 
I shall compare results using both clustering: abundance, and length and check my results to decide which one to go for. 

Thanks a lot for your prompt responses, patience, and guiding me through this process.

Cheers,
Sanjeev

Neus

unread,
Oct 6, 2016, 9:24:41 AM10/6/16
to Qiime 1 Forum
Dear Sanjeev and Colin,

I've been reading this thread and using the same pipeline as Sanjeev for processing my dataset.

I have added another step to filter my OTU table using the --min_count_fraction command of Qiime, to retain the most abundant OTUs. I have filtered the .biom file resulting from the pipeline, but now, I am not able to pick the representative sequences of the filtered OTU table by using the pick_rep_set.py script.

Maybe the problem is the OTU map. I am using the .uc file resulting from the pipeline and translated to .txt. Is that correct?

Do you have any other idea to get the representative sequences from the filtered file? I wonder if it is possible to do the filtering during the vsearch pipeline.

Thank you very much in advance for your help,

Cheers,

Neus

Sanjeev Sariya

unread,
Oct 6, 2016, 11:14:52 AM10/6/16
to Qiime 1 Forum
Hi Neus,

If understand you correctly:
The .txt file from .uc file is different that what QIIME creates (from rep script). OTU map (.txt) from .uc gives you count of each sample, however, OTU map from QIIME prints sample name number of times respective OTU present in it. 

.uc -> .txt file I generate is using scripts from robert edgar. 

Comparison of OTU tables :

.uc ->.txt

OTUID  Sample1 --header
OTU1   2

QIIME's:

OTUID ---header
OTU1 Sample1 Sample1

Maybe Colin can throw some more light. I don't use the mentioned flag in my processing.
I cut off from QIIME post de-multiplexing my reads and separately proceed as I and Colin discussed in this thread. :)
I hope this helps.

Cheers,
Sanjeev

Colin Brislawn

unread,
Oct 6, 2016, 12:55:40 PM10/6/16
to Qiime 1 Forum
Hello Neus,

I'm still trying to understand your approach, so I'm not sure how much I can help.

I have filtered the .biom file resulting from the pipeline, but now, I am not able to pick the representative sequences of the filtered OTU table by using the pick_rep_set.py script.
What script are you using to filter the .biom table? 
What script are you using to 'pick_rep_set.py'? At no point in my pipeline do I use this script, because I create my .biom table based on the rep_set.fna from OTU picking, not the other way around.

Maybe the problem is the OTU map. I am using the .uc file resulting from the pipeline and translated to .txt. Is that correct?
I usually go
.uc --> .txt --> .biom
Here, the .uc file is created by searching all my reads against my rep_set.fna (so again, the rep_set.fna is the start of this process, not the end). 

So after you have removed OTUs from your .biom table, is there a reason you want to _also_ remove them from your rep_set.fna file? Once I get a .biom table, I don't really worry about what is in my rep_set.fna anymore. 

Thanks for working with me here,
Colin

Neus

unread,
Oct 10, 2016, 7:22:06 AM10/10/16
to Qiime 1 Forum
Dear Sanjeev and Colin,

First of all, I'm sorry for the delay on my response.

Sanjeev, as you pointed out, I think the problem is the otu map I'm using for picking the representative sequences from the OTU table.

Colin, the reason of my approach is that I'd like to work with the file of representative sequences, but the one filtered, without singletons and also without low abundance OTUs. So,

What script are you using to filter the .biom table?
I have used filter_otus_from_otu_table.py -i (.biom table from .uc) -o (.biom table filtered) --min_count_fraction 0.00005

What script are you using to 'pick_rep_set.py'? At no point in my pipeline do I use this script, because I create my .biom table based on the rep_set.fna from OTU picking, not the other way around.

I have used pick_rep_set.py -i ( my .uc converted to .txt) -f (the nonchimeras.fasta file from the -uchime_denovo script) -o

As I mentioned before, I think the problem is this file:
my .uc converted to .txt, which is not the OTU map that Qiime needs.

My question was if there is a way to get the representative sequences from the .biom table filtered, as I do not know how to discard the low abundance OTUs during the vsearch pipeline.

I hope I have explained that clearly.

I really appreciate your help!

Sincerely,

Neus

Colin Brislawn

unread,
Oct 11, 2016, 12:38:08 PM10/11/16
to Qiime 1 Forum
Hello Neus,

"is a way to get the representative sequences from the .biom table filtered, as I do not know how to discard the low abundance OTUs during the vsearch pipeline."

I'm not really sure how to do this... I've never needed this for my analysis.

Colin

Neus

unread,
Oct 13, 2016, 9:28:48 AM10/13/16
to Qiime 1 Forum
No worries Colin, I'll find another way ;)

Thank you very much for your help and time!!

Neus

Colin Brislawn

unread,
Oct 13, 2016, 11:07:29 AM10/13/16
to Qiime 1 Forum
Hello Neus,

OK. Let us know what works for you. Maybe this script will help:

Colin

Neus

unread,
Oct 14, 2016, 8:56:55 AM10/14/16
to Qiime 1 Forum
That's what I was looking for!

Thank you very much, Colin!

Have a great weekend,

Neus

Rachael Joakim

unread,
Nov 8, 2016, 7:29:34 PM11/8/16
to Qiime 1 Forum
Hello!

This forum has been very useful as a first-timer. I have a dataset of 50 swabs that was sequenced on an Illumina HiSeq. As it is my first run-through, I have been working through this tutorial, recommended to me by the Gilbert Lab: https://www.gitbook.com/book/twbattaglia/introduction-to-qiime/details.

Once I reached the chimera removal step, however, I began to run into issues. I worked with the default method, ChimeraSlayer, through issues with the reference database (downloaded the newest greengenes database as a result) and then a general bug in the QIIME script (confirmed by an outside computer programmer). I tried re-downloading ChimeraSlayer and adding it to my PATH:

identify_chimeric_seqs.py -i pick_open_otus/pynast_aligned_seqs/rep_set_aligned.fasta -a pick_open_otus/new_refseqs.fna -m ChimeraSlayer

Traceback (most recent call last):
  File "/macqiime/anaconda/bin/identify_chimeric_seqs.py", line 354, in <module>
    main()
  File "/macqiime/anaconda/bin/identify_chimeric_seqs.py", line 328, in main
    keep_intermediates=keep_intermediates)
  File "/macqiime/anaconda/lib/python2.7/site-packages/qiime/identify_chimeric_seqs.py", line 159, in chimeraSlayer_identify_chimeras
    keep_intermediates=keep_intermediates):
  File "/macqiime/anaconda/lib/python2.7/site-packages/qiime/identify_chimeric_seqs.py", line 143, in __call__
    keep_intermediates=keep_intermediates)
  File "/macqiime/anaconda/lib/python2.7/site-packages/qiime/identify_chimeric_seqs.py", line 637, in get_chimeras_from_Nast_aligned
    app_results = app()
  File "/macqiime/anaconda/lib/python2.7/site-packages/burrito/util.py", line 295, in __call__
    result_paths = self._get_result_paths(data)
  File "/macqiime/anaconda/lib/python2.7/site-packages/qiime/identify_chimeric_seqs.py", line 419, in _get_result_paths
    raise ApplicationError("Calling ChimeraSlayer failed.")
burrito.util.ApplicationError: Calling ChimeraSlayer failed.

I tried usearch but ran into the common problem of memory. I successfully ran vsearch externally of QIIME, but had issues downstream stemming (I believe) from the formatting of the output file?:

vsearch --uchime_ref ../refdbs/greengenes/gg_13_5.fasta -db OTUs.fna --nonchimeras nonchimeras.fasta

I used nonchimeras.fasta as my input file for the next QIIME step, OTU clustering. I want to cluster to 99% as I believe this is a more accurate representation of microbial lineages (another story, of course), so I used the de_novo strategy:

pick_de_novo_otus.py -p ../../../../parameters/99_de_novo_params.txt -i nonchimeras.fasta -o 99_de_novo_nonchimeras

When I summarize the biom OTU table, I realize my sample names have disappeared, with only OTU's listed. I know this because I ran the entire default QIIME pipeline without the chimera removal step to compare outputs (using open-ref OTU picking), and when I summarized that biom table It had each sample listed with the number of OTU's found in each. I was then able to filter the lower-read samples out of the OTU table for further analysis. When I tried the same command for the vsearch/de_novo output, I received a warning that all samples had been filtered out, resulting in an empty table.

Is there away to remove chimeras externally, in a format that can be re-inserted into the QIIME pipeline? Am I missing something else? If there is another forum or tutorial of relevance I will be happy to dive into that, as well.

Thank you!

Rachael Joakim


Lowry76

unread,
Nov 23, 2016, 5:28:56 PM11/23/16
to qiime...@googlegroups.com
Hi Colin et al,

First of all thanks for the great help. I'm new to amplicon sequence analysis and am trying to use qiime/vsearch and other tools for analysis of functional genes.
I have a few questions the regarding workflow.

As explained to Sanjeev you recommend:
1. de-rep your sequences. fas
2. cluster at 97% (output = clustered.fas)
3. remove chimeras from clustered.fas (output = checked_clustered.fas)
4. map complete sequence.fas, i.e. non-de-replicated non-chimera checked fasta against the checked_clustered.fas

Questions:

1. Why do I have to re-map in step 4?
    If I understand correctly step 2 creates a rep set that is chimera checked in step 3 and abundance is added in step 4.
    But isn't the abundance already included with the --sizein flag in the clustering? Why do I have to re-map to the rep-set if it's already done?

2. Do I have to use the same identity cut-off for step2 and step 4?
    For example, in my case of using functional genes the percent identity for OTUs would be significantly lower.
    But if I use 97% in step 2 and something like 90% in step 4 than step 2 would basically act as noise reduction.
    Also, how does vsearch behave if I try to map reads with lower identity to the 97% rep set, e.g., in case that a read is 95% similar to more than one OTU?


Would be great to get your thoughts on this.
Thanks a lot,
Lowry

Colin Brislawn

unread,
Nov 23, 2016, 10:18:16 PM11/23/16
to Qiime 1 Forum
Hello Lowry,

Interesting questions. I should start by mentioning that what I'm doing here is based on this pipeline, except that I'm using vsearch (instead of usearch). 

1 ) Why re-map in step 4?
During step 1, each read in your input seqs.fna file is label with its sampleID. After this step, the reads from different samples are represented by a single read with a size annotation. "But isn't the abundance already included with the --sizein flag in the clustering?" Yes, read is annotated with the abundance of that read in the full data set, we still need to know the abundance of that read in each different sample. To get reads per samples, we need to go back before dereplication, when reads from different samples were added together. (Qiime tracks which reads are in which samples using a different method.) 

2) Does clustering and remapping have to have the same threshold. 
Very astute! These two steps can have different thresholds. 
"But if I use 97% in step 2 and something like 90% in step 4 than step 2 would basically act as noise reduction." (emphasis mine) 
I'm not sure how strong this would be... 
When using the same threshold, the goal of remapping is just to identify the count of each OTU in each sample. When relaxing the threshold while remapping, more reads would be able to match something and you would keep more reads, but most would simply map to the same OTU as if 97% were used. 

Also, how does vsearch behave if I try to map reads with lower identity to the 97% rep set, e.g., in case that a read is 95% similar to more than one OTU?
It will map it to the single OTU that it's most similar to. (vsearch can also be used for search, in which some/all OTUs above 95% would be reported. Taxonomy annotation in qiime takes place using a method like this in which the top three matches over 90% are reported.)

Great questions Lowry keep in touch,
Colin


Lowry76

unread,
Nov 24, 2016, 1:05:17 AM11/24/16
to Qiime 1 Forum
Hi Colin,

Thanks a lot for the answer.
Q1: I totally overlooked the per-sample point.
Q2: When I tried to explain the problem again for this answer I realized what the problem was (my thinking) so that one is explained, too :)

Thanks a lot,
Lowry

Colin Brislawn

unread,
Nov 25, 2016, 2:47:49 AM11/25/16
to Qiime 1 Forum
Hey Lowry,

When I tried to explain the problem again for this answer
That has happened to me so many times! I'm glad this discussion let us process the problem and discover the answer. 

Keep in touch,
Colin
 

Sanjeev Sariya

unread,
Jul 12, 2017, 2:54:13 PM7/12/17
to Qiime 1 Forum
Hi Colin,

How are you doing? I'd need some help with doubts I got on this thread in pipeline we had discussed earlier in this thread on VSEARCH forum.


1- dereplicate

vsearch --derep_full full_tagclean.fasta  --output full_derep.fasta --log=vsearch_log --sizeout --minuniquesize 2 2>derep_log.txt


2- Cluster 

vsearch -cluster_fast full_derep.fasta -id 0.97 --sizein --sizeout --relabel OTU_  --centroids otus.fna 2> cluster_log.txt


3- Chimera check

vsearch --uchime_denovo otus.fna --nonchimeras otus_checked.fna --sizein --chimeras chimeras.fasta --borderline border_line.fasta --xsize 2> chimera_log.txt


4- What does this aim for? 

vsearch -usearch_global full_tagclean.fasta -db otus_checked.fna -strand plus -id 0.97 -uc otu_table_mapping.uc --xsize 2> usearch_global_log.txt


5- Get OTU table:


python drive5/mod_uc2otutab.py otu_table_mapping.uc > tabfile.tsv



Why am I doing 4th step again?

Looking forward to hear from you.
--Sanjeev

Colin Brislawn

unread,
Jul 12, 2017, 5:46:46 PM7/12/17
to Qiime 1 Forum
Hello Sanjeev,

Step 4 is sometimes called 'mapping reads to OTUs'. See here: 

Colin

Sanjeev Sariya

unread,
Jul 12, 2017, 7:13:44 PM7/12/17
to Qiime 1 Forum
Hello Colin,

Thanks for your reply. I understood the same. Appreciate it. 
 
Do you think you could reply on this thread on VSEASRCH forum and clear clouds? I'd not be able to explain why his (Felipe, last reply in thread) pipeline would fail. :)

Best,
Sanjeev

Francesco Paroni Sterbini

unread,
Jul 14, 2017, 4:19:35 AM7/14/17
to Qiime 1 Forum
Hi Colin, hi Sanjeev,

I've successfully used the pipeline you suggested combining vsearch and qiime.

Everything went perfect until I tried beta_diversity_through_plots.py on qiime.

I had this error message:

ngs@paroni-Z9PE-D8-WS:/media/sdb1/Francesco/Ratti_illumina/join2$ beta_diversity_through_plots.py -i taxonomy_otu_table.biom  -o bdiv_even146/ -t rep_tre.tre -m mapping_file -e 146

Traceback (most recent call last):
  File "/home/ngs/miniconda3/envs/qiime1/bin/beta_diversity_through_plots.py", line 4, in <module>
    __import__('pkg_resources').run_script('qiime==1.9.1', 'beta_diversity_through_plots.py')
  File "/home/ngs/miniconda3/envs/qiime1/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/__init__.py", line 744, in run_script
   
  File "/home/ngs/miniconda3/envs/qiime1/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/__init__.py", line 1499, in run_script
   
  File "/home/ngs/miniconda3/envs/qiime1/lib/python2.7/site-packages/qiime-1.9.1-py2.7.egg-info/scripts/beta_diversity_through_plots.py", line 153, in <module>
    main()
  File "/home/ngs/miniconda3/envs/qiime1/lib/python2.7/site-packages/qiime-1.9.1-py2.7.egg-info/scripts/beta_diversity_through_plots.py", line 150, in main
    status_update_callback=status_update_callback)
  File "/home/ngs/miniconda3/envs/qiime1/lib/python2.7/site-packages/qiime/workflow/downstream.py", line 183, in run_beta_diversity_through_plots
    close_logger_on_success=close_logger_on_success)
  File "/home/ngs/miniconda3/envs/qiime1/lib/python2.7/site-packages/qiime/workflow/util.py", line 122, in call_commands_serially
    raise WorkflowError(msg)
qiime.workflow.util.WorkflowError:

*** ERROR RAISED DURING STEP: Beta Diversity (weighted_unifrac)
Command run was:
 beta_diversity.py -i bdiv_even146//taxonomy_otu_table_even146.biom -o bdiv_even146/ --metrics weighted_unifrac  -t rep_tre.tre
Command returned exit status: 1
Stdout:

Stderr

Traceback (most recent call last):
  File "/home/ngs/miniconda3/envs/qiime1/bin/beta_diversity.py", line 4, in <module>
    __import__('pkg_resources').run_script('qiime==1.9.1', 'beta_diversity.py')
  File "/home/ngs/miniconda3/envs/qiime1/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/__init__.py", line 744, in run_script
   
  File "/home/ngs/miniconda3/envs/qiime1/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/__init__.py", line 1499, in run_script
   
  File "/home/ngs/miniconda3/envs/qiime1/lib/python2.7/site-packages/qiime-1.9.1-py2.7.egg-info/scripts/beta_diversity.py", line 152, in <module>
    main()
  File "/home/ngs/miniconda3/envs/qiime1/lib/python2.7/site-packages/qiime-1.9.1-py2.7.egg-info/scripts/beta_diversity.py", line 145, in main
    opts.output_dir, opts.rows, full_tree=opts.full_tree)
  File "/home/ngs/miniconda3/envs/qiime1/lib/python2.7/site-packages/qiime/beta_diversity.py", line 180, in single_file_beta
    make_subtree=(not full_tree))
  File "/home/ngs/miniconda3/envs/qiime1/lib/python2.7/site-packages/qiime/beta_metrics.py", line 44, in result
    is_symmetric=is_symmetric, modes=["distance_matrix"], **kwargs)
  File "/home/ngs/miniconda3/envs/qiime1/lib/python2.7/site-packages/cogent/maths/unifrac/fast_unifrac.py", line 466, in fast_unifrac
    envs, count_array, unique_envs, env_to_index, node_to_index, env_names, branch_lengths, nodes, t = _fast_unifrac_setup(t, envs, make_subtree)
  File "/home/ngs/miniconda3/envs/qiime1/lib/python2.7/site-packages/cogent/maths/unifrac/fast_unifrac.py", line 194, in _fast_unifrac_setup
    raise ValueError, "No valid samples/environments found. Check whether tree tips match otus/taxa present in samples/environments"
ValueError: No valid samples/environments found. Check whether tree tips match otus/taxa present in samples/environments

I tried to validate the otus.fna file, I just tried with 3 samples :

validate_demultiplexed_fasta.py -i otus.fna  -m mapping_file -o validate_demultiplexed -t rep_tre.tre -s

This is the log:

# fasta file otus.fna validation report
Percent duplicate labels: 0.000
Percent QIIME-incompatible fasta labels: 0.000
Percent of labels that fail to map to SampleIDs: 0.000
Percent of sequences with invalid characters: 0.000
Percent of sequences with barcodes detected: 0.028
Percent of sequences with barcodes detected at the beginning of the sequence: 0.000
Percent of sequences with primers detected: 1.000
Fasta label subset in tree tips report
The following labels were not in tree tips:
63914F60056
63915F60057
63916F60058

So I guess that something went wrong labelling my samples...

Could you help me in some way?

Thank you very much

Francesco

Ps. Since this is a one year old post, there is something that changed in the meantime?
I've also followed this thread.

Sanjeev Sariya

unread,
Jul 14, 2017, 7:27:54 AM7/14/17
to Qiime 1 Forum
Hi Francesco,

Sorry, I've not worked with beta plots once I'm through with this pipeline. I'd suggest you to create new thread with appropriate title to gain attention from other members. :)
Good luck.

Francesco Paroni Sterbini

unread,
Jul 14, 2017, 7:48:20 AM7/14/17
to Qiime 1 Forum
Hi Sanjeev,
thanks for the answer!

I wrote here because I guess that the problem is not in the beta_diversity command but somewhere in naming OTUs...

I'm not sure I've assigned sampleIDs in the correct way and format and I'm trying to figure out if --relabel OTU_ while clustering it's correct for my data...

The problem is that labels aren't in the tree tips in the rep_tre.tre....

Thanks again

Francesco

Yoshiki Vázquez Baeza

unread,
Jul 17, 2017, 1:02:09 PM7/17/17
to Qiime 1 Forum
Dear Francesco,

Please open a new question in the forum and we can follow-up there.

The probleem is that the OTU ids in your tree do not match the IDs in your table, without knowing more bout your processing steps I can't quite tell you what may have been the problem.

Thanks!

Yoshiki.
Reply all
Reply to author
Forward
0 new messages