Hi Ning,
My understanding is that you don't need to apply a BLAST cutoff, because the humann pipeline will prioritize the best hits. When they compared the putput using a more stringent blast cutoff vs having no blast cutoff at all, this did not make a difference. I am pasting an email exchange I had with Nicola Segata that may be relevant.
About the taxonomy, if you are interested in the functions of a particular microbe, I think you may need to filter that before running humann, or find another way, as my understanding is that all hits to a particular pathway or module are clustered together regardless of taxonomy.
One strategy for examining the phylogeny of genes for a particular function is to take all of the hits from your dataset to a particular function and then look at their taxonomy by BLASTing against a local nt database and then retrieving the function with the fastacmd command.
:)
Katrine
On Sat, Feb 2, 2013 at 3:49 AM, Katrine Whiteson
Dear Nicola,
I am a post-doc in Forest Rohwer's lab using metagenomic sequences and other
methods to characterize microbial communities in CF patients.
I have a couple questions about using HUMANN, which I am really enjoying
using:
1.) Do you recommend using a Blast cutoff? I notice that in the SOP there is
no mention of an e-value cutoff, and in this paper (below), you use 1. Is
the importance of very poor hits significantly reduced in the weighting
method described in the Plos Comp Bio paper?
Yes, the importance of very poor hits is extremely low. In general, we
saw that using a cut-off of 1e-3 or 1e-5 is not going to change the
results. In particular, we use a cut-off of 1e-5 when we use usearch
instead of blastx because usearch stops the mapping when good hits are
found, and it is thus important that "good hits" are defined with a
more stringent cut-off.
2.) You also mention that you use the top 20 blast hits, and in the Plos
Comp Bio paper mention that genes will be duplicated and counted more than
once if they fit in more than one ko. Do you think the relative abundance
reported in the 04b module file
I'm missing the last part of the question here, but I think I get the
sense. The pathway and module relative abundance is not just the
average of the abundance and it is thus robust to genes that have been
accounted for more than once. This is especially true because only a
small portion of the genes in a module/pathway can possibly be present
in other modules/pathways.
3.) Are samples with very different different sequence coverage accounted
for? (Can I compare samples with 20,000 reads to samples with 100,000
reads?) What is the minimum number of reads per sample that you would
recommend?
HUMAnN does not directly account for coverage. It's hard to say
whether a minimum number of reads is required, but I think the most
important aspect to consider is the consistency of the sequencing
depth within the same study. A 5x coverage range (from 20,000 to
100,000) is not that huge but it is probably not negligible either. I
would suggest to compare the samples in your study even when their
coverage is quite different; however, you should verify "post-hoc"
that your findings are not correlating with the depth of the samples.
This is the approach I'm using for all metagenomic studies involving a
relative abundance of some sort (bugs, functions, pathways).
If you find that your results are biased because of the depth you can
of course down-sample the metagenomes to the size of the metagenome
with the smallest coverage. This is far from optimal from the costs
viewpoint but sometimes it may be necessary.
Let me also add that the pathway/module coverage is somehow more
robust to the variable sequencing depth than pathway/module abundance.
I would like to be sure I know how to use my results here. We find very
enriched amino acid transport, and I am concerned that poor blast hit score
could be biasing our results.
Keep in mind that any "absolute" results is going to be biased toward
what the functional database (in this case KEGG) contains and how the
information in the database is categorized. And also towards the
gene/families/pathways contained in model organisms that are clearly
more well characterized. This means that enrichments should always be
relative between 2 or more groups (classes) rather than an absolute
quantification. In other words, amino acid transportes may be over
represented in KEGG...
I hope this helps
many thanks
Nicola