please help with conversion of mouse gene symbols to HUGO

665 views
Skip to first unread message
Assigned to me by e...@broadinstitute.org

Alexey Surnov

unread,
Mar 13, 2018, 2:24:59 PM3/13/18
to gsea-help

Hello all! Can anyone please help me with conversion of mouse gene symbols to HUGO? I have a list of 9171 mouse genes obtained from RNA-Seq, and I want to analyze them with GSEA. However, for further work I first have to convert their official mouse gene symbols to HUGO notation. GSEA recommends (http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/RNA-Seq_Data_and_Ensembl_CHIP_files) first to convert all genes to their ENSEMBL ID (for example Trp53 -> ENSMUSG00000059552) using BioMart, and then convert the ENSEMBL ID to HUGO using the dictionary available on GSEA webpage (ENSEMBL_mouse_gene.chip, ftp://ftp.broadinstitute.org/pub/gsea/annotations/) (for example, ENSMUSG00000059552 -> TP53).

However, not all genes were well converted that way.

1.              Only 8335 genes out of 9171 had a unique HUGO symbol.

2.              373 genes are present in the ENSEMBL dictionary of mouse gene symbols provided by BioMart (http://useast.ensembl.org/biomart/martview/1b29a6e5676193a70708c8674600ceb5), but are absent in the dictionary of HUGO symbols provided by GSEA (ENSEMBL_mouse_gene.chip).

3.              462 gene symbols were absent in the BioMart dictionary, so I could not even find an ENSEMBL ID for them.

4.              Finally, one gene (Snora16a) has two HUGO symbols (SNORA16A and SNORA16B).

So, after all I need to know how to convert those remaining 373+462+1=836 gene symbols to HUGO notation to be able to analyze them with GSEA adequately. This means that for each mouse gene symbol I need to know a corresponding HUGO symbol used in GSEA gene sets (if some mouse gene symbols correspond to several synonymous HUGO symbols, then I must know them all).

I would appreciate any help of a GSEA administrator or any other experienced person.

Thanks

Alex Surnov

Saint Louis University

Arthur Liberzon

unread,
Mar 15, 2018, 3:46:35 PM3/15/18
to gsea-help
When gene identifiers are from species other than human, our CHIP files convert them to orthologous human genes. Our CHIP files don't contain mappings for non-human identifiers for which there is no human ortholog, or mappings for which there are many human orthologs. For more, please read our FAQ 2.10. Please consult your local bioinformatician for more help if you plan to implement alternative approach.

Alexey Surnov

unread,
Mar 16, 2018, 8:12:17 PM3/16/18
to gsea-help
Thank you Arthur for your answer. It was helpful. However, I would really like to clarify one more point.

Here is my question: how did it happen, that some genes that were knowingly taken from a mouse transcriptome are present in GSEA's gene sets, but are absent in the CHIP file?

For example, the gene ACE is present in the gene set http://software.broadinstitute.org/gsea/msigdb/cards/GSE20500_CTRL_VS_RETINOIC_ACID_TREATED_CD4_TCELL_DN, which is certainly generated from mouse cells. The human gene ACE and the mouse gene Ace are orthologous to each other, however, those genes are absent in GSEA's mouse-to-HUGO chip-file (ENSEMBL_mouse_gene.chip). This observation bothers me a lot, because, I worry that some genes out of 373 mentioned in the previous message are in fact present in some gene sets in c7.all.v6.1.symbols.gmt but are absent in ENSEMBL_mouse_gene.chip the same way as Ace in the example above. If this is the case, then GSEA software will not identify those genes when it is running the enrichment analysis.
Once again - thank you for your attention
Alex

Arthur Liberzon

unread,
Mar 19, 2018, 2:50:09 PM3/19/18
to gsea-help
Gene sets in MSigDB undergo conversions to human NCBI Entrez Gene IDs using an internal pipeline that is very comprehensive and robust. At the same time, CHIP files are generated by a variety of other means and in general reformat information from the third party annotation sources. For example, the Ensembl CHIP files use v91 Ensembl biomaRt (R Bioconductor), 'mmusculus_gene_ensembl' source of Mouse genes (GRCm38.p5). This build does not contain any human ortholog for mouse gene "Ace".  We have plans to unify the various CHIP file making procedures and bring them in accord with the more robust and comprehensive pipeline that we have in place for MSigDB gene sets.

Alexey Surnov

unread,
Mar 20, 2018, 11:46:04 AM3/20/18
to gsea-help
OK, it all looks clear now.
Again thank you very much for this consultation
Alex

понедельник, 19 марта 2018 г., 13:50:09 UTC-5 пользователь Arthur Liberzon написал:
Reply all
Reply to author
Forward
0 new messages