RNA-Seq Data and Ensembl CHIP files

1,471 views
Skip to first unread message
Assigned to me

David Eby

unread,
Aug 11, 2017, 12:15:32 AM8/11/17
to gsea-help
To facilitate GSEA analysis of RNA-Seq data, we are now providing four new CHIP files to convert human and mouse Ensembl IDs, which are commonly used for gene expression derived from RNA-Seq data, to HUGO gene symbols as used in MSigDB.  More details are here.

These files are available through the CHIP annotation file selector in the GSEA Desktop client, as well as from the Downloads page of our website.

igor

unread,
Oct 25, 2017, 5:04:04 PM10/25/17
to gsea-help
I am happy to hear GSEA is being modified to be more compatible with RNA-seq data. Those CHIP files are useful if you are starting with Ensembl IDs. What is the suggested protocol if you are starting with gene symbols? I used to select GENE_SYMBOL.chip, but that was removed.

Thank you.

David Eby

unread,
Oct 26, 2017, 8:42:46 AM10/26/17
to gsea-help
Hi Igor,

The GENE_SYMBOL.chip file is meant only for internal use by GSEA in certain pre-processing and reporting steps.  It's not meant for users to select directly for collapsing the dataset.  In the past this caused confusion for a number of users, which is why we removed it.

If you are starting with gene symbols then you can proceed directly without the need to collapse the dataset at all, provided you have HUGO symbols.  The MSigDB GMTs accessible from within the GSEA Desktop use HUGO symbols and so will properly map to your dataset.  Alternatively, there are equivalent GMTs that instead use (Human) Entrez IDs available from our website's Download page - these can be supplied to GSEA through the Load Data screen.

In either case, just turn off the Collapse Dataset step and proceed with your data as it is.

If your gene symbols are in another format then we would recommend converting them to HUGO.  If your symbols are from another species then you will need to somehow map them to human genes - remembering to account for orthologs / homologs - as MSigDB is oriented to human biology.  We can't offer any specific recommendations for either such conversion at this time.

The key point is that both your input dataset and the Gene Set database (GMT / GMX) must have symbols in the same namespace.  GSEA will do a straightforward mapping between these files, so the symbols must match exactly.  At present these must be ALL CAPS, due to some unfortunate internal requirements we hope to clear in the future.

See the section on Preparing Data Files in our User Guide for more details, in particular the portion on Consistent Feature Identifiers Across Data Files.

Regards,
David

igor

unread,
Oct 28, 2017, 7:13:47 PM10/28/17
to gsea-help
Hi David.

That did not work for me. If I turn off Collapse Dataset, I get an error. If I add GENE_SYMBOL.chip, then it works.

I think the problem is that not all gene symbols match exactly. It's possible to fix the gene symbols, but it's a lot easier to just use GENE_SYMBOL.chip.

I personally use the CLI version of GSEA, but I help many other people use the GUI version. I don't know what confusion GENE_SYMBOL.chip caused before, but the current situation is definitely causing confusion. In addition to all the previous steps, users now have to know where to get GENE_SYMBOL.chip, download it, and then load it in GSEA. Thus, there are now extra steps to return to the old default. What's worse is they are not documented anywhere. Please consider adding GENE_SYMBOL.chip back to the list.

Thank you.

Arthur Liberzon

unread,
Oct 31, 2017, 11:56:35 AM10/31/17
to gsea-help
Hi Igor,

We removed GENE_SYMBOL.chip because in its current state it is wrong to use for dealing with gene symbol aliases, even though the CHIP 'worked' in the sense that the program did not issue any warnings or errors. Instead, we urge using dedicated CHIP file to collapse platform-specific identifiers to official human gene symbols. With RNA-Seq data, they typically are ENSEMBL gene or transcript identifiers, so it would make most sense to use them. If GSEA refuses to accept your data and you have trouble to identify the issue, then I encourage you to share the detailed error message and attach your input file and we will try to help resolve it.

igor

unread,
Oct 31, 2017, 2:02:20 PM10/31/17
to gsea-help
Hi Arthur.

Thanks for getting back to me. I'd like to clarify some of the points you raise.

> the CHIP 'worked' in the sense that the program did not issue any warnings or errors
The chip worked because the results make sense. The genes that are going in have the correct values in the output. A few are filtered out, but that fraction is negligible.

>  Instead, we urge using dedicated CHIP file to collapse platform-specific identifiers to official human gene symbols.
I am not using program-specific identifiers.

> With RNA-Seq data, they typically are ENSEMBL gene or transcript identifiers, so it would make most sense to use them.
It really depends on your environment. I deal with many people and the genes are usually not Ensembl IDs. Most biologists cannot handle Ensembl IDs.

> If GSEA refuses to accept your data and you have trouble to identify the issue, then I encourage you to share the detailed error message and attach your input file and we will try to help resolve it.
I know what the error is. All gene symbols must match exactly. I can solve it with GENE_SYMBOL.chip which will simply filter out the "unknown" gene symbols. Please correct me if I am wrong.

Is the only concern gene aliases? In that case, don't the underlying gene sets have the same issue since they are also stored as gene symbols?

Thanks.

Arthur Liberzon

unread,
Oct 31, 2017, 3:08:35 PM10/31/17
to gsea-help
The gene symbols CHIP file is now 11 years old, is out of sync with gene sets in MSigDB and in its current form is impossible to update. For these reasons we decided to remove it from the v3.0 GSEA on. Of course, the CHIP file is available for downloading from the archived data on our Downloads link. If you decide to go ahead and use it, it's fine. We just want to make sure that using this file for analysis is your decision and that we are not endorsing using this CHIP file for GSEA.
 
I agree with you that for reports it's best to have gene symbols. I personally prefer carrying out analyses in the space of more robust identifiers and resorting to gene symbols only at the last steps for reporting and sharing results with others. It does not really matter what kind of gene identifiers you prefer to work with as long as you have a CHIP file that reliably converts them to the corresponding official human gene symbols. In my experience, RNA-seq data often ends up having ENSEMBL IDs as gene identifiers. Therefore, we have CHIP files to run GSEA on data with ENSEMB IDs as gene identifiers. With these CHIP files, GSEA properly collapses expression dataset to the space of official human gene symbols and should work fine for all downstream reports that will use human gene symbols. Of course, if you don't like these identifiers, the alternative would be NCBI's Entrez Gene IDs. In this case, however, you will need to build your own CHIP file to handle the conversions.

Hope this is helpful,
Arthur

igor

unread,
Nov 1, 2017, 12:59:38 PM11/1/17
to gsea-help
I wrongly assumed that the gene symbols CHIP was in sync with all the MSigDB genes. I can see why the 2006 list would be a problem. Maybe a good compromise would be to update it? That may be a moot point if GSEA v3.0 works with gene symbols without specifying a CHIP file. In that case, is it possible for it to silently filter out unknown genes instead of throwing an error?

Thanks.

Arthur Liberzon

unread,
Nov 1, 2017, 2:25:32 PM11/1/17
to gsea...@googlegroups.com
Well, as I mentioned earlier, it is not possible to update the gene symbols CHIP file. We do explore the alternatives, so I really appreciate your suggestions with this regard. Thanks!

--Arthur

--
You received this message because you are subscribed to a topic in the Google Groups "gsea-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gsea-help/jfDWsQ9ljFI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gsea-help+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/3e693155-4c39-410d-9fb1-2a0cb05f5382%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

___________________________________________

Arthur Liberzon, Ph.D.

Molecular Signatures Database (MSigDB) curator and

CMAP Bioinformatics Scientist I

Cancer Program


The Broad Institute of MIT and Harvard

415 Main Street

Cambridge MA 02142

Phone: (617) 714 7582

E-mail: libe...@broadinstitute.org

Michał Krassowski

unread,
Nov 26, 2017, 6:11:54 AM11/26/17
to gsea-help
Dear Arthur,

I understand your concerns about keeping the 11 years old file, though I just wanted to point out my case. I was playing around trying to reproduce GSEA results on TP53 WT/MT as published in 2005 in PNAS.

I basically tried to follow the arguments from: my_analysis.Gsea.1130958999391.rpt (linked in GSEA Report for Dataset p53_full_useme_maxed_cs.gct) and was quite confused to find that there is no such a chip file as "Gene_Symbol" available.

Adding to the confusion was the user guide (http://software.broadinstitute.org/gsea/doc/GSEAUserGuideTEXT.htm#_Selecting_DNA_Chip) which states that: 

The web site includes chip annotation files for commonly used DNA chips (human, mouse, and other organisms), as well as two specially defined chip files:
         Gene_symbols lists all of the gene symbols known to GSEA. It is assembled primarily from NCBI Entrez databases.
         Seq_accessions lists all sequence accessions known to GSEA. It is assembled primarily from GenBank identifiers and the gene symbols and common aliases defined in the GENE_SYMBOL.chip file.

and shows a screenshot with those in place.

Please, consider updating the user guide, or restoring the "GENE_SYMBOL" with little "(outdated/obsolete/do not use)" description.

With kind regards,
Michał Krassowski

Arthur Liberzon

unread,
Nov 27, 2017, 11:06:45 AM11/27/17
to gsea-help

I understand your concerns about keeping the 11 years old file, though I just wanted to point out my case. I was playing around trying to reproduce GSEA results on TP53 WT/MT as published in 2005 in PNAS.

We keep archived data on the Downloads page on our web site precisely for the purpose of reproducing these results. All necessary files are there.

 
I basically tried to follow the arguments from: my_analysis.Gsea.1130958999391.rpt (linked in GSEA Report for Dataset p53_full_useme_maxed_cs.gct) and was quite confused to find that there is no such a chip file as "Gene_Symbol" available.

I already wrote that in the previous posting - if you insist on using the old Gene Symbols CHIP file, then you can always dig it out the archived resources from the Downloads page and load along with the rest of your input files. We just don't endorse its use any more and don't maintain it.
Use this file at your own risk.

 
Adding to the confusion was the user guide (http://software.broadinstitute.org/gsea/doc/GSEAUserGuideTEXT.htm#_Selecting_DNA_Chip) which states that: 

The web site includes chip annotation files for commonly used DNA chips (human, mouse, and other organisms), as well as two specially defined chip files:
         Gene_symbols lists all of the gene symbols known to GSEA. It is assembled primarily from NCBI Entrez databases.
         Seq_accessions lists all sequence accessions known to GSEA. It is assembled primarily from GenBank identifiers and the gene symbols and common aliases defined in the GENE_SYMBOL.chip file.

and shows a screenshot with those in place.

Please, consider updating the user guide, or restoring the "GENE_SYMBOL" with little "(outdated/obsolete/do not use)" description.

Good point - we will consider that, indeed.
 
Reply all
Reply to author
Forward
0 new messages