Pathabundance/pathcoverage question

679 views
Skip to first unread message

O. AlZahal

unread,
Nov 11, 2015, 2:55:08 PM11/11/15
to HUMAnN Users
Hi folks, I have questions regarding pathcoverage. 
*I wonder if pathway coverage can be used as a tool to filter pathway abundance items, that is before conducting the statistical analysis on pathabuncance. Does it make sense to do statistical analysis on pathways that are not covered in the first place (i.e. above a given threshold of coverage i.e. 0.7,..). 
*Even if a pathway (as a total) is covered, does this coverage need to be within a given bug?  i.e. so i can infer a functional change due to this bug?
*Another question, i used the legacy database in my analysis and i got (unstratified) 260 pathways in my pathabundance file, whereas i got only 230 in my pathcoverage file? I am not sure where that comes from. I am missing coverage data for 30 pathways.

many thaks 

OA



Eric Franzosa

unread,
Nov 12, 2015, 12:46:41 PM11/12/15
to humann...@googlegroups.com
* It's common practice to perform feature selection in meta'omics using some measure of prevalence (% of samples in which the feature "appeared"). You could define "appeared" as "abundance > cutoff" (which is fairly common), but your idea of defining "appeared" as "coverage > cutoff" would also work. Since downstream testing ends up happening on the abundance values, it's often more convenient to also do filtering based on abundance.

* I would not say that you _have_ to assign functional changes to a bug, it's more of a "bonus" in the cases where you can. :-) There will be times when HUMAnN2 detects a pathway purely during the translated search (equivalent to how HUMAnN1 does _all_ pathway quantification), and in that case taxonomic assignment is "unclassified." I can also imagine cases where a pathway was consistently different between conditions, but no single bug showed a consistent difference (even though the pathway abundance was always 100% attributed to known species).

* I looked into the abundance/coverage discordance - this arises because we don't currently output pathways with 0 coverage to the coverage file. The pathways in the coverage file should end up being a subset of those in the abundance file, and the difference is the set of pathways that had 0 computed coverage. These tend to be very low abundance pathways. We'll look into making this more clear, but in the meantime you can always substitute 0s for the missing values.

Thanks,
Eric


Ousama AlZahal

unread,
Nov 12, 2015, 9:50:11 PM11/12/15
to humann...@googlegroups.com
Many thanks Eric. I just have few follow up clarifications.

*AS you indicated that pathways with low coverage (0 in my example) “tend to be very low abundance pathways”. However, how can low coverage/ high abundance pathway be interpreted. I am also faced with a scenario where some pathways are extremely significant and abundant (top 5, 0.04), but has a low coverage of 0.1??
* So far i strongly feel that a coverage filter should be applied . I am using 0.6. would that be reasonable? This leaves me about 40 pathways of 260. 
*what is the minimum number of gene families that i should be able to find in the gene families file for each pathway present in the pathabuncance file.  The reason i am asking this question is that i picked a random pathway (turned to be very low abundant with 0 coverage), but i was not able to locate any of its Kxxxxx in the genefamiles file. 

many thanks
OA

Eric Franzosa

unread,
Nov 16, 2015, 1:53:27 PM11/16/15
to humann...@googlegroups.com
Let me provide some background on the pathway coverage measurement:

Pathway coverage is intended as a measure of our confidence that a pathway is truly present in a community. This is based on the idea that some reactions recruit a small amount of reads purely by chance. It's possible for all reactions in a pathway to be in this group, thus giving the pathway a non-zero abundance, although the pathway itself might not really be encoded by organisms in the community. In developing HUMAnN1, we discovered that if a reaction was in the top 50% of reactions by abundance (i.e. abundance above the median abundance), it was likely to be truly present. The pathway coverage measurement is based on comparing the pathway's individual reaction abundances to the median reaction abundance: if the pathway's reaction abundances are consistently above the median, then the pathway coverage will approach 1; if there is a "weak link" in the pathway, then its coverage will drop toward 0, indicating that its abundance should be interpreted with caution.

On to your specific questions:

* I _am_ surprised that a super-abundant pathway would have such low coverage. Notably, the coverage measure described above was based on KEGG's pathway and reaction definitions. We are in the process of confirming that these approaches are still well-suited for MetaCyc; it's possible that we'll need to fine-tune them to the new database, in which case this sort of feedback is super helpful!

* A coverage of 0.5 indicates that the "weakest link" reaction in a pathway was as abundant as the median-abundance reaction. I'd be fairly comfortable calling such a pathway "present" in a single sample. When you have multiple samples, just seeing that a pathway is consistently detected in a few samples is also a great way to boost your confidence that it's a real signal.

* Do you have KEGG-style KOs in your gene families output, or are they all UniRef identifiers? In the default mode HUMAnN2 does not output KOs to the gene families file, so this would not be a surprising finding.

Thanks,
Eric


Ousama AlZahal

unread,
Nov 17, 2015, 10:08:48 AM11/17/15
to humann...@googlegroups.com
Thank you Eric for your detailed answer. It is clear now. Based on this, i feel comfortable running LEfSE on pathways with abudncec above 0.5. Regarding the abundant/low coverage issue. There are only few pathways like that. I will send you a spreadsheet (by email) with abundances and coverages side-by-side. The calculation may or may not make sense to you at the end.

many thanks 

OA
Reply all
Reply to author
Forward
0 new messages