blast or bowtie2? which one should I use?

Niki

unread,

Apr 24, 2012, 12:43:10 AM4/24/12

to metaphl...@googlegroups.com

Thanks for this nice piece of software. I have a quick question here. If I prefer accuracy over time-efficiency, should I use blast or bowtie2? Thanks!

Nicola Segata

unread,

Apr 24, 2012, 9:01:20 AM4/24/12

to metaphl...@googlegroups.com

Hi Niki,

thanks a lot for your question.

Short answer: if computational performance is the bottleneck of your analysis you should use BowTie2, if you want to maximize accuracy it doesn't really matter because Blastn and BowTie2 provide very similar predictions.

Here is a more verbose answer in case the computational performance is not the main factor to consider in your analysis. Based on our ten synthetic datasets, bowtie2 and blastn perform almost equally in terms of accuracy. Sometimes blastn is slightly better than bowtie2, but other times the opposite is true. Bowtie2 with '--bt2_ps very-sensitive-local' (the default) or '--bt2_ps sensitive-local' seems to be on average a little more accurate than blastn producing however some more false positives. On the other hand, Bowtie2 with '--bt2_ps very-sensitive' (the default) or '--bt2_ps sensitive' is a bit less accurate than blastn but with fewer false positives. The false positives / false negatives trade-off can also be tuned using the '--stat_q' option (that works for both blastn and BowTie2) that we suggest to set higher than 0.1 (but smaller than 0.33) if one wants to avoid as much as possible false positives at the price of having some false negatives. Currently, we profiled 1,000 real metagenomes using blastn (because we didn't add the BowTie2 option yet), and just few metagenomes with BowTie2; until a more comprehensive analysis is performed, it may be a bit safer to use blastn instead of BowTie2 if the accuracy is much more important than computational efficiency.

I added a FAQ section with this question/answer (hopefully other Q/A will be added soon):

https://bitbucket.org/nsegata/metaphlan/wiki/FAQ

Let me know if you have any comment or question!

thanks

Nicola

Niki

unread,

Apr 24, 2012, 11:32:12 AM4/24/12

to metaphl...@googlegroups.com

Thank you very much Nicola, I appreciate your help. Your answer makes great sense.

In my test run to compare bt2 and blastn with default setting on one dataset, it seems bt2 result (82 species) contains more otu than blastn result (60 species), though the 22 species are in very low abundance.

per your answer, I think I will stick to blastn for now, and bump up the --stat_q value. I plan to test 0.1 to 0.3 (step size 0.05) and see how much difference that'll make. Let me know if there is a better way to choose the --stat_q value.

Thanks again!

On Tuesday, April 24, 2012 9:01:20 AM UTC-4, Nicola Segata wrote:
> Hi Niki,</div>
> thanks a lot for your question. </div>
>
> </p>
> <font color="#393939" face="Helvetica, Arial, sans-serif">Short answer: if computational performance is the bottleneck of your analysis you should use BowTie2, if you want to maximize accuracy it doesn't really matter because Blastn and BowTie2 provide very similar predictions.</font></p>
> <font color="#393939" face="Helvetica, Arial, sans-serif">Here is a more verbose answer in case the computational performance is not the main factor to consider in your analysis. Based on our ten synthetic datasets, bowtie2 and blastn perform almost equally in terms of accuracy. Sometimes blastn is slightly better than bowtie2, but other times the opposite is true. Bowtie2 with '--bt2_ps very-sensitive-local' (the default) or '--bt2_ps sensitive-local' seems to be on average a little more accurate than blastn producing however some more false positives. On the other hand, Bowtie2 with '--bt2_ps very-sensitive' (the default) or '--bt2_ps sensitive' is a bit less accurate than blastn but with fewer false positives. The false positives / false negatives trade-off can also be tuned using the '--stat_q' option (that works for both blastn and BowTie2) that we suggest to set higher than 0.1 (but smaller than 0.33) if one wants to avoid as much as possible false positives at the price of having some false negatives. Currently, we profiled 1,000 real metagenomes using blastn (because we didn't add the BowTie2 option yet), and just few metagenomes with BowTie2; until a more comprehensive analysis is performed, it may be a bit safer to use blastn instead of BowTie2 if the accuracy is much more important than computational efficiency.</font></p>
> </p></div>
>
> </div>
> I added a FAQ section with this question/answer (hopefully other Q/A will be added soon):</div>
> <a href="https://bitbucket.org/nsegata/metaphlan/wiki/FAQ" target="_blank">https://bitbucket.org/nsegata/<WBR>metaphlan/wiki/FAQ</a>
> </div>
>
> </div>
> Let me know if you have any comment or question!</div>
> thanks</div>
> Nicola</div>
> On Tuesday, April 24, 2012 12:43:10 AM UTC-4, Niki wrote:<blockquote class="gmail_quote" style="margin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-left:1ex">Thanks for this nice piece of software. I have a quick question here. If I prefer accuracy over time-efficiency, should I use blast or bowtie2? Thanks!
> </p></blockquote>

Nicola Segata

unread,

Apr 24, 2012, 11:50:52 AM4/24/12

to metaphl...@googlegroups.com

Hi Niki,

for setting the --stat_q value you can run MetaPhlAn with '-t clade_profiles' instead of the default '-t rel_ab'. This will generate a "coverage profile" for each marker of each clade. If several clades of interest (i.e. those appearing in the standard results) have markers with more than 10% of zeros you should use a value for --stat_q higher than 0.1 (which represents 10%). However, if you have many zeros and only few non-zero values, it's more likely than the non-zero values are false positives and there is no need to increase --stat_q.

Because of taxonomic levels with only one descendant and other properties of the markers, this procedure is not rigorous for all taxonomic clades, but eyeballing some clades should be sufficient for suggesting whether --stat_q needs to be increased or not.