KEGG alternative

Aaron

unread,

Jan 29, 2013, 3:23:03 PM1/29/13

to picrus...@googlegroups.com

Hi All,

I'd like to use PICRUSt to look at gene/pathway/module differences between treatment groups, but unfortunately, KEGG is closed to me for the present. Are there any plans to utilize additional pathway databases like metacyc or unipathway in picrust? If not, how difficult would it be for me to generate the necessary reference files for another database? Out of curiosity, is there a particular reason that you chose to use KEGG instead of metacyc?

Thanks for making picrust available pre-publication and thanks for all your help.

Best,

Aaron

Jesse Zaneveld

unread,

Jan 29, 2013, 4:25:26 PM1/29/13

to picrus...@googlegroups.com

Hi Aaron,

We are actually working on including tools for annotating the KEGG functional categories based on the raw identifiers. The release containing this functionality should be out probably in the next week or so (we'll send an e-mail out to the users list when it goes out).

Thanks for the input about alternative annotation databases. We're interested in input on alternative pathway databases to utilize in the future. We chose KEGG initially due to the readily available mappings between all IMG genomes (which we used to validate our methods) and KEGG gene family copy numbers, and because we had worked with the resource in the past.

You can certainly use alternative annotation schemes, however it will take more work on your end. You won't be able to use the precalculated KEGG file, and will instead need to generate predictions for the gene copy number of each gene family, in each organism. You can do this using predict_traits.py (make sure to use the -l option to limit predictions to those you need based on your OTU table). However, to do so you will need a table of each sequenced genome (easiest if you use the IMG id for that genome) x each gene family, and entries corresponding to the gene copy number of that gene in that organism. If the genome names map exactly to the reference greengenes tree, then that's all you need. If not, then you'll need to map your genomes into a reference tree.

So your workflow would be:

1) Map up gene copy number x genome annotations from metacyc (or your favorite database) to generate a trait table

2) Check that all organism ids in your trait table exactly match organism ids in the augmented tree

3) If not, add your organisms into the tree (could be difficult/computationally expensive, but we can pass along Daniel McDonald's methodology for doing this with IMG genomes).

4) Run predict_traits.py to generate a genome prediction table describing genes x organisms (including unsequenced organisms)

5) Run picrust predict_metagenomes.py as usual, replacing our precalculated table with your custom one.

So it would involve a fair amount of work, but could certainly be done. An alternative approach would be to look for annotation links between KO ids and entries in your favorite database. These sometimes take some digging to find, but they do exist for most annotation types. Then you could just generate KO groups, but annotate them to another function type (I know KO to COG and some KO to GO annotations are available for example).

In any case, I hope that helps. Do let us know which way you decide to go, and any additional information we can fill in to help you out with the analysis.

All the best,
Jesse

--
You received this message because you are subscribed to the Google Groups "picrust-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to picrust-user...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Daniel McDonald

unread,

Jan 30, 2013, 12:16:18 PM1/30/13

to picrus...@googlegroups.com

> 3) If not, add your organisms into the tree (could be
> difficult/computationally expensive, but we can pass along Daniel McDonald's
> methodology for doing this with IMG genomes).

Reasonably straight forward: align and build a tree. For 16S we used
SSU-Align and FastTree. I can dig up the parameters used if there is
interest

Best,
Daniel

Aaron

unread,

Feb 14, 2013, 1:02:17 PM2/14/13

to picrus...@googlegroups.com

Hi Dan and Jesse,

Thanks for the reply. I've been looking into KEGG alternatives, but so far haven't found anything that would be a perfect substitute (in this vein I'm open to any suggestions on databases).

So far I’ve considered metacyc, cog, unipathway, wikipathway and GO. Metacyc is a database of experimentally-verified pathways, but the genes and pathways are organism-specific. I could try combining the species-specific pathways (i.e. combine species-specific pathways by merging genes between species if they are reciprocal best blast hits) but that could present problems as I try to merge multiple species. COG (clusters of orthologous groups) provides organism-independent gene classifiers, but the classifiers are only organized into broad functionalities. Wikipathway seems to be an attempt to replicate the datat in KEGG and supplement this with additional data, but it seems less informative than KEGG. Unipathway provides levels of pathway organization similar to KEGG but many (most?) of the Uniprot identifiers are not linked to any unipathway metabolic pathways. Finally, the GO Process database is similar to KEGG, so I could potentially create a ‘GOprocess x CopyNumber’ table for the GO annotated genomes in NCBI, build an augmented tree, and then look for differences in GO processes between treatments. However, a great majority (~95%) of GO annotations derive from automatic transfer of InterProScan results so I don’t know how reliable that sort of table would be. Despite this, the GOprocess seems like the best substitute for KEGG pathways.

I think the easiest path might be to substitute the KO identifiers for GO Molecular Function identifiers via http://www.genome.jp/files/ko2go.xl, (thanks for that suggestion Jesse). However, I don’t know how informative GO enrichment of Molecular Function will be for comparing two bacterial communities (a previous experience using GO left me with the impression that you could probably find a significant hypergeometric enrichment for any random list of genes even after Bonferroni correction!). Have you tried looking at GO enrichment with 16S data in the past?

Alternatively, I could build a trait table for the IMG genomes (I’m guessing this is the address: ftp://ftp.jgi-psf.org/pub/IMG/img_core_v400/) using pfam domains instead of KO IDs. Then I could look for differences in functional annotations between treatments, but this might be misleading since different proteins can share the same pfam annotation.

Do you have any thoughts or suggestions about these options?

Thanks again for the help.

Best,

Aaron

Curtis Huttenhower

unread,

Feb 14, 2013, 1:10:34 PM2/14/13

to picrus...@googlegroups.com

Hi Aaron - we've been developing MetaCyc pathways using UniRef orthologous groups as a KEGG alternative for use in HUMAnN. MetaCyc has been the best pathway alternative due to the mechanistic info, which is very important (e.g. KEGG modules are much more precise for these analyses than KEGG pathways). We're using UniRef protein families as a very standardized catalog for now while we redevelop a better, more microbiome-targeted alternative, and I believe other groups are working on similar new "OG" sequence catalogs.

For now, to execute the kind of analyses you describe, we've been using the last public version of KEGG with HUMAnN and LEfSe to find differential modules or pathways. If you contact me directly, I'm happy to point you at the data from KEGG necessary to run HUMAnN with PICRUSt, and that output can be provided near-directly into LEfSe using the online or offline tools. We don't have a complete pipeline ready yet to use UniRef/MetaCyc instead, but I can let you know when it gets closer (and if you're experimenting, that's the approach I'd recommend trying out in the meantime :)

Thanks a bunch -
Curtis

> Thanks again for the help.

Silas Kieser

unread,

Mar 20, 2017, 6:59:19 AM3/20/17

to picrust-users

Dear all,

Are there any updates to use Picrust with UniRef/MetaCy? I would be interested.

Kind regards

Silas

Reply all

Reply to author

Forward