What is the cutoff of Blast for best hits?

Ning

unread,

Mar 21, 2013, 2:56:50 PM3/21/13

to humann...@googlegroups.com

Hi,

I am running HUMAnN on my data. I set up the default for blast which generated a large amount of blast output. I read HUMAnN paper which mentioned to choose Evalue < 1 and up to 20 most significant hits for further processing. As I understand, here the evalue<1 and up to 20 most significant hits were from blast results, and used for HUMAnN input.

My questions are:
1. Is there any suggested cutoff for blast output? Shall I filter the bad hits before I run HUMAnN, or can I reset up the cutoff in HUMAnN? If I can set up the cutoff for input when I run HUMAnN, how can I do this? I read the readme file, and paper, I couldn't find any information about it.
2. It is about taxonomy. Shall I make the filter before I run HUMAnN or can I set up the filter of taxonomy in HUMAnN?

Thanks a lot.

Best
Ning

Katrine Whiteson

unread,

Mar 22, 2013, 12:35:48 AM3/22/13

to humann...@googlegroups.com

Hi Ning,

My understanding is that you don't need to apply a BLAST cutoff, because the humann pipeline will prioritize the best hits. When they compared the putput using a more stringent blast cutoff vs having no blast cutoff at all, this did not make a difference. I am pasting an email exchange I had with Nicola Segata that may be relevant.

About the taxonomy, if you are interested in the functions of a particular microbe, I think you may need to filter that before running humann, or find another way, as my understanding is that all hits to a particular pathway or module are clustered together regardless of taxonomy.

One strategy for examining the phylogeny of genes for a particular function is to take all of the hits from your dataset to a particular function and then look at their taxonomy by BLASTing against a local nt database and then retrieving the function with the fastacmd command.

:)
Katrine

On Sat, Feb 2, 2013 at 3:49 AM, Katrine Whiteson

Dear Nicola,

I am a post-doc in Forest Rohwer's lab using metagenomic sequences and other
methods to characterize microbial communities in CF patients.

I have a couple questions about using HUMANN, which I am really enjoying
using:

1.) Do you recommend using a Blast cutoff? I notice that in the SOP there is
no mention of an e-value cutoff, and in this paper (below), you use 1. Is
the importance of very poor hits significantly reduced in the weighting
method described in the Plos Comp Bio paper?

Yes, the importance of very poor hits is extremely low. In general, we
saw that using a cut-off of 1e-3 or 1e-5 is not going to change the
results. In particular, we use a cut-off of 1e-5 when we use usearch
instead of blastx because usearch stops the mapping when good hits are
found, and it is thus important that "good hits" are defined with a
more stringent cut-off.

2.) You also mention that you use the top 20 blast hits, and in the Plos
Comp Bio paper mention that genes will be duplicated and counted more than
once if they fit in more than one ko. Do you think the relative abundance
reported in the 04b module file

I'm missing the last part of the question here, but I think I get the
sense. The pathway and module relative abundance is not just the
average of the abundance and it is thus robust to genes that have been
accounted for more than once. This is especially true because only a
small portion of the genes in a module/pathway can possibly be present
in other modules/pathways.

3.) Are samples with very different different sequence coverage accounted
for? (Can I compare samples with 20,000 reads to samples with 100,000
reads?) What is the minimum number of reads per sample that you would
recommend?

HUMAnN does not directly account for coverage. It's hard to say
whether a minimum number of reads is required, but I think the most
important aspect to consider is the consistency of the sequencing
depth within the same study. A 5x coverage range (from 20,000 to
100,000) is not that huge but it is probably not negligible either. I
would suggest to compare the samples in your study even when their
coverage is quite different; however, you should verify "post-hoc"
that your findings are not correlating with the depth of the samples.
This is the approach I'm using for all metagenomic studies involving a
relative abundance of some sort (bugs, functions, pathways).

If you find that your results are biased because of the depth you can
of course down-sample the metagenomes to the size of the metagenome
with the smallest coverage. This is far from optimal from the costs
viewpoint but sometimes it may be necessary.

Let me also add that the pathway/module coverage is somehow more
robust to the variable sequencing depth than pathway/module abundance.

I would like to be sure I know how to use my results here. We find very
enriched amino acid transport, and I am concerned that poor blast hit score
could be biasing our results.

Keep in mind that any "absolute" results is going to be biased toward
what the functional database (in this case KEGG) contains and how the
information in the database is categorized. And also towards the
gene/families/pathways contained in model organisms that are clearly
more well characterized. This means that enrichments should always be
relative between 2 or more groups (classes) rather than an absolute
quantification. In other words, amino acid transportes may be over
represented in KEGG...

I hope this helps
many thanks
Nicola

Ning

unread,

Mar 22, 2013, 10:23:21 AM3/22/13

to humann...@googlegroups.com

Thanks Katrine. It is quite helpful. I will try it see how it goes.

Best
Ning

Zhuofei

unread,

Jul 16, 2013, 9:44:33 AM7/16/13

to humann...@googlegroups.com

Hello Ning,

I had the same question about this. I'm wondering whether I should add a e_value cutoff of Blast or not. Do you have any suggestions?

Best regards,
Zhuofei

Ning

unread,

Jul 16, 2013, 12:10:32 PM7/16/13

to humann...@googlegroups.com

Hi Zhoufei,

Normally, I set the blast e_value cutoff 1E-4 or 1E-6 to reduce the size of blast output. Because the default e_value is 1, it is too high. With the default value, blast output will keep a lot of low identity alignment, which are not desired. Other than this, if you use HUMAnN, as Katrine mentioned in her post, 'the humann pipeline will prioritize the best hits', which HUMAnN has taken care of it. You don't have to worry about it too much. If you will do other analysis, it depends.

Hope this is helpful.

Ning

Yu, Jovian

unread,

Jul 16, 2013, 12:12:04 PM7/16/13

to humann...@googlegroups.com

Hi all,

Echoing what Ning has said here - we've used 1e6 as a typical cutoff for HUMAnN!

-Jovian

Zhuofei Xu

unread,

Jul 17, 2013, 1:37:22 AM7/17/13

to humann...@googlegroups.com

Hi Ning,

Thanks very much for your suggestion! Very helpful!

Also thank Jovian. I will use 1e6 as cutoff.

Best regards,

Zhuofei

--
Zhuofei Xu
Postdoc
Section of Microbiology
Department of Biology
University of Copenhagen

E-mail: zhuof...@gmail.com

Reply all

Reply to author

Forward