For a given genome, how to download a table with gene IDs, KOs and the pathways that each gene belongs to

947 views
Skip to first unread message

Lucas

unread,
Apr 6, 2016, 10:37:24 PM4/6/16
to IMG User Forum
Hi there,


This question may be very easy to answer, but I just could not find a way to do it. For a given genome, I want to download the list of all genes, with the KO id information and the pathways that the genes are assigned to.
I Am looking for something like:

Gene ID, locus_tag, KO, KEGG category, KEGG pathway

Currently, I can see  the kegg categories, the list of genes for each specific category. For instance,

Alanine, aspartate and glutamate metabolism

    GeneID Adenylosuccinate lyase (EC 4.3.2.2)
    GeneID Adenylosuccinate synthetase

But I want a comprehensive list for all the genes of a given genome. Can anybody help me here?

Cheers,
Lucas

Matthew

unread,
Apr 12, 2016, 5:26:57 PM4/12/16
to IMG User Forum
Hi Lucas,

I'm just a user, but I don't think they support this currently. 
You can go to the genome's page, and click on the hyperlink number for "protein coding genes" and that will allow you to easily add all genes in the genome to your gene cart.
from there, if you go to the bottom of the page, you can add columns for extra annotation data to add to the table -- for instance, KO tag is under the third column (Function Field), "KEGG Orthology ID, Name and Definition". 

Notice that if you go to a particular gene's page, there will be a lot more annotation data available (eg KEGG pathways), which cannot be exported in the gene table.

One workaround you could try is to download the "genbank-like" file for the genome.
you'll have to massage the data to suit your needs, so having some skills with something like SQL, R, Pandas (what I use), or Excel+VBA will be necessary.
These files seem to have a line for things such as KEGG module, COG Category, and Metacyc.

Another thing you could do is look at KEGG to see if you can download a table of all their KO tags mapped to KEGG categories & KEGG pathways, and then you could do a database join between the two.
(this would add on the appropriate KEGG category for each KO tag).
You'll probably have to clean up the KEGG ID tags (KO -- aren't these really K0, though?) in order to get them to match.  I use regex-formatted searching within Sublime Text
I see there is a table of definitions for the KEGG module list under "Find Functions" >> KEGG >> KEGG Module List.  You could export that, and then maybe first pivot each row by module type (do you know how to do a data frame pivot, or work with "pivot tables"?  I use Pandas, but you could also use Excel, R, SQL, etc.  I don't know what it means when things are in parentheses vs have - or + in front of them.

If you could somehow get each KEGG pathway into the function cart, you could profile all of them against your favorite genome.

Good Luck!
Matt Melnicki.
(Postdoc, Kerfeld Lab, LBNL & MSU)

Matthew

unread,
Apr 12, 2016, 5:40:09 PM4/12/16
to IMG User Forum
oh, quick followup, lucas:

you can easily get an exportable table of all KO tads and the module they belong to via:
FIND FUNCTIONS >> KEGG >> PATHWAYS VIA KO TERMS >> click on the first hyperlink "metabolic pathways" (currently it says 2632). it's under "02 Global and overview maps".
I'm not positive that it will be comprehensive, but it seems so.
Actually it probably omits everything else at the "01" level of the hierarchy (eg, genetic information processing, environmental information processing...)

But if you just care about metabolic genes, you can get a matchup table between Module & KEGG ID ("KO" tag), which you could then append to a new column of your exported gene cart table (after you add the KEGG ID field).

you could probably do something more comprehensive by browsing around the KEGG BRITE database link...

So, still no easy way to do what you're asking!!!
Good luck, again.

Lucas

unread,
Apr 12, 2016, 9:00:48 PM4/12/16
to IMG User Forum
Hi Matt,

Thanks a lot for your reply =). Very helpful and detailed. Yes, I see what you mean, and sure can do that. I just think is very counter intuitive to make all this effort to get this info that is there. For instance, now I am battling with a similar problem. I want the transporter classification database ID and families for the CDS of the genome. Something like:

Gene ID, locus_tag, TC_ID, name, family.

Taking your suggestions from KEEG, I could just get the list of all TC_ID name family from Find Functions > Transporter Classification > TC List.
But, I still do not find the map Locus_tag or Gene Id to to the TC_ID. Could you please point me out to the genebank-like file that you mentioned?

I could export the KO entrances for each locus tag from the Gene cart Analysis:
  1. Select Protein Coding Genes –> Select all > add selected to gene Cart
  2. Select Analysis Cart –> Scroll down and select all functional fields (KEGG ID is one of them) –> Display genes again –> Display all –> Export.
I Will suggest the IMG people to include such features, because the data is there and could be easily displayed.

Cheers,
Lucas

Matthew

unread,
Apr 12, 2016, 9:34:49 PM4/12/16
to IMG User Forum
Hi Lucas,

I'm running into some problems that are reminiscent of yours.  There are some desired features which seem fairly easy for IMG to implement, but whenever I inquire about them I am given some excuse.

A subtext to your predicament relates to how many genomes you want to do this for? If it's just one genome, and you can handle massaging the data with your own computational skills, then the problem comes down to downloading the "genbank-like" data file for that genome... There should be a few ways to do that, both from a genome's IMG page.  At the bottom you can click on the button for "Export Genome Data" (generate genbank file). But I'm pretty sure you get the same (or similar data) if you click the button above it for "Export Gene Information".  The download data link at the top is not ideal, because it takes you to the JGI download portal, which i find to be really messy -- there is no consistency in the file types available within each genome's folder, and the folders are not organized in a definitive manner (although they try to provide you multiple ways to get what you want).  furthermore, sometimes they include the "raw reads" (seriously, who wants that, unless it's metagenomic?!!) in these download folders, and that makes the file size prohibitively large.

if it's just one genome you want, no problem. delete what you don't need.

you might want to check out those download folders, though.  i do recall seeing a file that seems like the source of all the annotations, maybe the output from running their own prediction algorithms, but i can't recall if they have the one you want.

personally, i really just want an easy way to download the AA sequences for all protein-coding genes (across several genomes) and then i could just re-run any analysis i want locally.  In the absence of that, it makes matching up IMG gene_oids (eg with Uniprot IDs) impossible.

I love IMG, but it does not seem to be very friendly for customizing your own analyses.  What I'm told is that is designed for people with metagenome/metatranscriptome data to upload and match up against all the sequenced genomes.  It's a pity, because their databases are incredibly valuable (and their auto-annotations are the least error-prone i've seen)...

hope that helps,
Matt.

Lucas

unread,
Apr 13, 2016, 8:00:17 PM4/13/16
to IMG User Forum
Hi Matt,

I agree with you. IMG is really good, but I also do not get why they haven't implemented minor changes which would impact greatly the user experience. Thanks for the tips for downloading the annotation files and for your valuable help. Gold!
Cheers,
Lucas
Reply all
Reply to author
Forward
0 new messages