UniRef90 to KO mapping

1,051 views
Skip to first unread message

jjo...@gmail.com

unread,
Feb 27, 2017, 9:31:31 PM2/27/17
to HUMAnN Users
Hi,

I wonder if someone could point me towards the files originally used to generate the mappings between KEGG Orthogroups (KOs) and UniRef90 IDs?

The Humann2 documentation states that “in most cases, mappings are directly inferred from the annotation of the corresponding UniRef centroid sequence in UniProt”. However, the idmapping files I can find on uniprot knowledgebase do not contain KO information, so I am not clear on how the file map_ko_uniref90.txt.gz is created.

I ask because I think I have identified an error in the KO-UniRef90 mappings that is causing artifactual results in my Humann2 output.


-----


In more detail – In my mWGS data I am able to find a very strong negative correlation between the abundance of an individual KO (generated from Humann2 0.7.1 output) and the abundance of a single microbial taxon. This relationship is biologically plausible and very interesting to me. However, when I download the full NCBI genome annotation for the relevant taxon, I find that the genome contains a gene that belongs to the KO in question. Querying Genbank - the common names, KO numbers, and EC numbers all confirm that this gene belongs to the KO in question.

When I search for the relevant gene sequence in chocophlan I am able to find it. I then use the fasta header to find the associated UniRef90 ID for this gene sequence. I can then confirm that there is no mapping between the chocophlan UniRef90 ID and the KO within ‘map_ko_uniref90.txt.gz’.

Finally, if I account for counts assigned to the missing UniRef90 ID (taken from the *genefamilies* output), then the negative correlation between my taxon and the KO completely disappears. My conclusion is that a failure to map all UniRef90 IDs to their respective KOs is therefore responsible for my initial observation.

As this is a highly abundant and potentially important taxon, I'm very concerned by this result. Any advice would be greatly appreciated!

Thanks

Jethro

Eric Franzosa

unread,
Feb 28, 2017, 12:19:44 PM2/28/17
to humann...@googlegroups.com
Hi Jethro,

Consider a UniRef90 centroid such as:


Which is centered on representative protein:


You can see the raw information for this protein by appending a ".txt" to that URL:


Fields starting with "DR" are database cross references. "DR KO; K03746; -." is the KO cross reference for this protein. We assemble (e.g.) KO mappings by parsing these raw data files en masse.

=====

It's possible that the UniRef90 cluster to which your protein is assigned has not been annotated to your KO of interest, even though it's compatible with your protein (an example of an annotation coverage problem). Can you send me the relevant KO ID and the ChocoPhlAn header for the protein of interest?

Thanks,
Eric


jjo...@gmail.com

unread,
Feb 28, 2017, 3:53:55 PM2/28/17
to HUMAnN Users
Hi Eric,

Many thanks - your explanation is correct.

Looking at the raw txt file I can confirm that the KO mapping is missing from the information provided within the UniProt sequence entry.


Jethro

Reply all
Reply to author
Forward
0 new messages