help on Ensemble ID to Gene Symbol conversion

152 views
Skip to first unread message

Dong

unread,
Sep 12, 2021, 5:39:24 PM9/12/21
to gsea-help
Hi, 

I am trying to run GSEA analysis using microarray data that I downloaded from GEO database.  I notice that the gene symbols included in GPL document are not all approved symbols.  My guess is that the annotation is not updated.  My guess could be wrong.  I am asking whether there is any good tool allowing for the conversion of Ensemble ID to Gene Symbol conversion.  I tried several tools but don't think that I have a success.  The Ensemble ID may be OK for GSEA analysis.  But I still also want to get the symbols.

Thanks.
Dong

Anthony Castanza

unread,
Sep 12, 2021, 7:54:06 PM9/12/21
to gsea-help
What is the GPL ID of the microarray you're using? MSigDB offers a number of CHIP files that can be used with various microarray platforms, and for other platforms we offer the Symbol remapping chips that can take various historical gene symbols and harmonize them to the versions used in MSigDB.

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/c8df75a4-529a-474d-a750-3735d9f82931n%40googlegroups.com.

Dong

unread,
Sep 14, 2021, 8:09:56 AM9/14/21
to gsea-help
Hi Anthony, 

The GPL ID is GPL13667.  The platform is [HG-U219] Affymetrix Human Genome U219 Array.  Where can I find out the remapping data?  In addition to GSEA analysis, I also want to do some other analysis that needs gene symbol or ensemble ID. 

Thanks.
Dong

Anthony Castanza

unread,
Sep 14, 2021, 4:16:36 PM9/14/21
to gsea-help
Hi Dong,



For this GPL, we don't by default build a CHIP for the Affymetrix Human Genome U219 Array, as this array is not deposited in Biomart and so doesn't have standardized mapping data. However, I was able to construct one by downloading the platform annotation data table from the GEO record, extracting the ID, Entrez.Gene, Ensembl, and Gene.Symbol columns, splitting any multiple mappings in the Entrez, Ensembl, and Symbol columns and remapping each Entrez ID, Ensembl ID, and Symbol through their respective MSigDB chips (to pick up any genes that may have been annotated in one namespace but not the others) and then recombining the outputs of these mappings into a final CHIP. The files were recombined by accepting Probe to Ensembl Mappings as "canonical" then adding in mappings for missing probes (where available) from the Entrez.Gene mappings first, and finally any missing Probes from the Symbol basis mappings. This avoids conflicting probe to gene mapping wherever possible, and prioritizes the data source that is most consistent with our standard procedure. I still wouldn't consider this procedure as up to our normal standards for inclusion in MSigDB's chip catalogue, but it's probably the best that can be done with this kind of old platform data without going back to the raw probe sequences.

Let me know if you have any questions.

-Anthony

 

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

Affymetrix_Human_Genome_U219_MSigDB.v.7.4_custom.chip

Dong

unread,
Sep 14, 2021, 5:04:13 PM9/14/21
to gsea-help
Hi Anthony,

Wow. A lot of effort!  Definitely it seems complicate to me.  How did you do those remapping?  By using GSEA tool?  I notice that, in the custom CHIP file, not all probes have been converted.  Why that happens?  Also, when I run GSEA analysis, do I need to choose collapse/remapp to gene symbols and then Human_Gene_Symbol_With_Remapping_MSigDB.v7.4.chip as Chip platform?  Or I even can choose No-collapse since the gene symbols are human one and have already included in the matrix. 

Thanks.
Dong

Anthony Castanza

unread,
Sep 14, 2021, 5:12:32 PM9/14/21
to gsea...@googlegroups.com

I performed these remappings using data from our chip files for NCBI/Entrez Gene IDs, Ensembl Gene IDs, and Gene Symbols using functions in R. Not all probes have valid gene mappings, and so those probes would be discarded in the CHIP assembly. This is a common thing in old platforms as they were frequently constructed using the UniGene database as a reference which was based on old EST technologies that were quite noisy and imprecise.

 

If your dataset is already in Human Gene Symbols you *can* choose "No collapse" but unless your dataset was analyzed using the specific version of Ensembl/Gencode that we used when constructing that version of MSigDB we recommend using the "Human_Gene_Symbol_with_Remapping" chip as the chip platform as that will ensure that any obsolete symbols are converted to their current equivalents, otherwise genes with symbols that have changed will be lost from the analysis.

 

-Anthony

 

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

 

Adrian Buensuceso

unread,
Feb 17, 2022, 1:12:41 AM2/17/22
to gsea-help
Thank you for creating this custom CHIP file and making it available! I was analyzing data from a U219 array and this is exactly what I needed!

-Adrian

Reply all
Reply to author
Forward
0 new messages