GSEA Collapse/Remap with RNA-seq data

500 views
Skip to first unread message

zcu...@gmail.com

unread,
May 15, 2020, 5:18:08 PM5/15/20
to gsea-help
I'm having some trouble understanding what the collapse option does with rna-seq data and if I should be using it.

I'm using GSEA 4.0.3, MacOS Catalina 10.15.4, and Java 8 to analyze rna-seq data I was able follow the instructions here with the hallmarks collection and Mouse_ENSEMBL_Gene_ID_MSigDB.vX.chip and it ran and I thought everything had worked perfectly.

But then I tried to run GSEA preranked with the -log10 of pvalues from sleuth and got error 1020 multiple rows mapped to RHD16(just an example), which I thought was impossible because I had dropped all duplicates using pandas when I created the rank file. 

 {rpt_label=my_analysis, rnd_seed=timestamp, set_min=15, chip=ftp.broadinstitute.org://pub/gsea/annotations_versioned/Mouse_ENSEMBL_Gene_ID_to_Human_Orthologs_MSigDB.v7.1.chip, zip_report=false, create_svgs=false, scoring_scheme=weighted, rnk=/Users/student/Documents/Rotations/Ntranos/analysis/sox9/test.rnk, norm=meandiv, out=/Users/student/gsea_home/output/may13, mode=Max_probe, include_only_symbols=true, set_max=500, gmx=ftp.broadinstitute.org://pub/gsea/gene_sets/h.all.v7.1.symbols.gmt, make_sets=true, plot_top_x=20, gui=false, nperm=1000, collapse=Remap_Only}
158691770 [ERROR ] - Tool exec error
xtools.api.param.BadParamException: Multiple rows mapped to the symbol ''RDH16'.  This is not allowed in Remap_only mode.
at edu.mit.broad.genome.alg.DatasetGenerators.collapse(DatasetGenerators.java:255) ~[gsea-minimal-4.0.3.jar:?]
at xtools.gsea.GseaPreranked.getRankedList(GseaPreranked.java:190) ~[gsea-minimal-4.0.3.jar:?]
at xtools.gsea.GseaPreranked.execute(GseaPreranked.java:92) ~[gsea-minimal-4.0.3.jar:?]
at edu.mit.broad.xbench.tui.TaskManager$ToolRunnable.run(TaskManager.java:435) [gsea-minimal-4.0.3.jar:?]
at java.lang.Thread.run(Unknown Source) [?:?]
158691778 [INFO  ] - Renaming rpt dir on error to: /Users/student/gsea_home/output/may13/error_my_analysis.GseaPreranked.1589576615484


So it was here that I realized that when I initially ran GSEA that I had the default option to collapse my dataset, but when I ran GSEA preranked the default is remap only. So I investigated what the collapse dataset was doing to my data using the collapse dataset tool and looking at the Symbol_to_probe_set_mapping_details.xls file that was generated and found the following:

Screen Shot 2020-05-15 at 2.11.13 PM.png



When I read the documentation the collapse tool seems like it was meant for collapsing multiple probes to one gene, not collapsing multiple genes to one gene. But then I looked at the genes present in the gene set Hallmark collection and only RDH16 is present(for example, Rdh1, Rdh6 Rdh16f2 are not present), so it seems like the chip file is meant to match the genes present in the gene sets? 

What is the advice for best practices for using or not using the collapse tool for rna-seq data and what is the best practice for getting GSEA preranked to run when I have multiple rows mapped to the same gene (ie RDH16)?

Thanks!

Anthony Castanza

unread,
May 15, 2020, 5:40:58 PM5/15/20
to gsea...@googlegroups.com
Hi,

Each Ensembl Gene ID represents a discrete gene model, however, due to things like alternative assemblies, or just differences between now different consortiums assemble their genomes, it's possible for multiple gene models to map to a single Gene Symbol.

This is further complicated by the necessity to perform orthology conversion with using Mouse genomes with MSigDB gene sets, which are all human genes. In this case, it's additionally possible for multiple mouse genes to have the same human gene as their best matching orthologue.

For this reason we offer several options for "collapsing" datasets. The simplest is remap_only, which simply checks each symbol to ensure that it is the current version of that gene's symbol as annotated by HGNC. This mode only works if you only have 1:1 mappings because this fundamentally conflicts with the operations needed to do orthology conversion this option is only valid for Human datasets, not mouse.

As for collapse, in the simpler case of Human datasets, we offer the option (in advanced parameters) to either select the "max" - the gene model that has the highest expression of the multiple options, or "sum" which will sum all the gene models that correspond to the Gene Symbol. 

This later option, while superior for Human datasets, is not a valid option for mouse datasets, again because of issues with multimapping of orthologues. In this case, you'll want to select "collapse" and leave the collapse method advanced parameter with the default (max) selection.

Does this make sense?

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/3b24febc-d472-4fae-bb20-811afcd0c5d5%40googlegroups.com.

zcu...@gmail.com

unread,
May 15, 2020, 6:01:53 PM5/15/20
to gsea-help
Thank you so much! I didn't know each ensembl gene id represents a different gene model which can map to the same different gene symbol, then this is compounded with the orthology conversion from mouse to human.  So just to verify, I should be able to run GSEA and GSEA preranked using collapse and using the default max option? 
To unsubscribe from this group and stop receiving emails from it, send an email to gsea...@googlegroups.com.

Anthony Castanza

unread,
May 15, 2020, 6:03:51 PM5/15/20
to gsea...@googlegroups.com
Provided you supply the correct mouse orthology chip for your gene identifiers, yes.


-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/9cb5e5f4-f7b2-4724-be4b-9039318aa5b8%40googlegroups.com.

benhob...@gmail.com

unread,
May 21, 2020, 3:04:05 PM5/21/20
to gsea-help
Hi Anthony-

I have a quick question related to this thread. First, I wanted to say that after recently returning to GSEA after ~1yr away, I was delighted to see the recent change to gene orthology mapping procedure for non-human gene sets. I used the Mouse_Gene_Symbol_Remapping_to_Human_Orthologs_MSIGDB.v7.1chip provided from the website to collapse to gene symbols using the default 'Max_probe' procedure as you suggested. This is such a nice feature to have integrated into the GSEA workflow, I used to have to deal with the mapping myself... Great work from the GSEA team!

However, I am wondering, where can I find the precise mapping that was performed during this procedure? I can't seem to find it in the GSEA output, maybe I am not looking in the right place. To be clear, I can find a collapsed .rnk file in the /edb directory that has the human mappings used for the GSEA, but I cannot find the file with the final mouse-> human mappings that I would need to precisely deconvolve the human mappings back into mouse gene symbols. This is critical for further analysis of significant gene sets coming out of the GSEA-- which I can get the exact human mappings used in my analysis from the gene_sets.gmt file found in the /edb directory.

Thanks very much
Best regards,
Ben Hobson
Columbia University
To unsubscribe from this group and stop receiving emails from it, send an email to gsea...@googlegroups.com.

Anthony Castanza

unread,
May 21, 2020, 3:15:50 PM5/21/20
to gsea...@googlegroups.com

Hi Ben,

 

Thanks for the feedback! We’re glad you found this useful! We’re also working on some additional resources to continue to improve our support for mouse datasets that should be available in the near future, so stay tuned.

 

At the moment, in order to retrieve the precise mappings, you’d need to download a copy of the chip file, but this contains all the mappings and not just the one that was specifically used for the analysis. We’re working on outputting the mapping file in the next GSEA release, but at the moment it does pretty much the same thing and just shows all the mappings for a given gene (albeit concatenated into a single row). I’ll make a note that we could improve the utility of this new file by including an additional column to specify which of the possible mappings was specifically used in “MAX” mode.

 

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

 

 

 

When I read the documentation the collapse tool seems like it was meant for collapsing multiple probes to one gene, not collapsing multiple genes to one gene. But then I looked at the genes present in the gene set Hallmark collection and only RDH16 is present(for example, Rdh1, Rdh6 Rdh16f2 are not present), so it seems like the chip file is meant to match the genes present in the gene sets? 

 

What is the advice for best practices for using or not using the collapse tool for rna-seq data and what is the best practice for getting GSEA preranked to run when I have multiple rows mapped to the same gene (ie RDH16)?

 

Thanks!

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/3b24febc-d472-4fae-bb20-811afcd0c5d5%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.

To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/654e9ae8-5898-40ed-9409-dd62f01cf899%40googlegroups.com.

benhob...@gmail.com

unread,
May 21, 2020, 4:34:39 PM5/21/20
to gsea-help
Hi Anthony,

I'm afraid I don't quite understand the last part of your response. I did indeed download and take a look at the CHIP file that I used for mapping, but as you stated, this contains all possible mappings. So of the five separate rows where mouse gene symbols all mapped to GSTM2, I cannot tell which rank/data was actually used in the GSEA (human GSTM2 symbol is there...). Are you saying that I currently cannot tell which mouse symbol's data was ultimately used in the GSEA for these cases, until you update it to output the mapping file? Ideally this would be an explicit two column file that contains the final set of mouse gene symbols and the corresponding human gene symbols they were mapped to for GSEA...

Thanks
Ben

Anthony Castanza

unread,
May 25, 2020, 12:17:36 PM5/25/20
to gsea-help
Hi Ben,

After refreshing my memory as to how exactly the collapse function works, in its current iteration (which was designed for microarrays) GSEA can use all the information from all the possible annotate orthologs to construct the "Gene" used for analysis. The Max_probe function in particular picks the max value from each ortholog for each sample independently, which, unfortunately, can result in a different original gene being picked for each sample's map to that ortholog.

We recognize that this isn't ideal, and we're working on other ways to handle this so a consistent gene is picked for all samples but all of the methods we've proposed internally are all subject to one form of bias or another (for example, picking original gene with the maximum average across all samples would bias the dataset against downregulated genes) and most of the metrics we've considered that don't have that issue would require integrating the phenotype information into the collapse function which creates an issue where analysis on the basis of different phenotypes could create different collapsed datasets. We're open to any suggestions for improved collapse modes!

For these and other reasons, or ortholog mapping is restricted to what we've annotated (in collaboration with MGI) as the "best" ortholog in as many cases as possible, resulting in as many 1:1 mappings as possible. We think that this builds an acceptable dataset for the process of modeling human biology in mouse as the genes where we retain multimappings are generally very low homology across all genes. We're working on additional resources for native analysis of mouse data, but these aren't quite ready to be released yet.

In the meantime, we also offer an option in advanced parameters to collapse by constructing a "median" gene from the possible orthologs, this option might be preferable for the sake of data consistency for data which requires ortholog conversion.

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

benhob...@gmail.com

unread,
May 26, 2020, 2:52:15 PM5/26/20
to gsea-help
Hi Anthony,

What you have done makes sense. Perhaps I should have specified that I am using GSEA pre-ranked. In my case, the issue of picking different orthologs for different samples is not an issue. Since expression values are not provided (only rank values), I am not sure how max_probe would function in this regard.

"For these and other reasons, or ortholog mapping is restricted to what we've annotated (in collaboration with MGI) as the "best" ortholog in as many cases as possible, resulting in as many 1:1 mappings as possible."
-- This sounds great, I would just like to be able to access this mapping, specifically the ultimate 1:1 mappings that were used for the .rnk file I provide. I end up with gene set hits, but cannot recover the original mouse gene symbols present in the sets for downstream analysis. It seems like a dictionary or data structure containing this information must be present at some point within the mapping process before GSEA, and it would be nice to have this so I can reconvert. Maybe this is harder to output than it seems to me...

Best,
Ben

Charles Abrams

unread,
Jun 29, 2022, 4:26:11 PM6/29/22
to gsea-help
I have noted that my 19000 mouse ensemble gene IDs generated with RNAseq are being collapsed into just 561 genes. This seems excessive and is making it hard to identify any significant enrichment. Am I doing something wrong?

Anthony Castanza

unread,
Jun 29, 2022, 4:33:41 PM6/29/22
to gsea...@googlegroups.com

Hello,

 

I would agree that a collapse from 19,000 to just 561 genes is excessive and unlikely to be correct. I would suspect that something has gone wrong here causing the genes to not be recognized correctly. Did you perhaps accidentally select the Mouse_Gene_Symbol chip file?

If your dataset is in Mouse Ensembl IDs, you’ll need to select the Mouse_ENSEMBL_Gene_ID_Human_Orthologs_MSigDB.v7.5.1.chip from the GSEA chip platform drop down. If this later file is the file you selected we might be able to figure out what is going on by taking a closer look at your input file.

 

If you’re willing to share the data file you’re experiencing this issue with you can send it confidentially to gsea...@broadinstitute.org

 

Additionally, in the future we’d ask that you please create a new post for issues that are unconnected to the original poster.

 

Thanks,

 

-Anthony

 

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

 

Error! Filename not specified.

 

 

When I read the documentation the collapse tool seems like it was meant for collapsing multiple probes to one gene, not collapsing multiple genes to one gene. But then I looked at the genes present in the gene set Hallmark collection and only RDH16 is present(for example, Rdh1, Rdh6 Rdh16f2 are not present), so it seems like the chip file is meant to match the genes present in the gene sets? 

 

What is the advice for best practices for using or not using the collapse tool for rna-seq data and what is the best practice for getting GSEA preranked to run when I have multiple rows mapped to the same gene (ie RDH16)?

 

Thanks!

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/3b24febc-d472-4fae-bb20-811afcd0c5d5%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/654e9ae8-5898-40ed-9409-dd62f01cf899%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.

To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/fa244961-4f69-4108-9c24-406e8b992181n%40googlegroups.com.

David Eby

unread,
Jun 29, 2022, 6:22:16 PM6/29/22
to gsea...@googlegroups.com
Hi Charles,

Another possibility is that some/most of the Ensembl gene IDs still have accession numbers (like for ENSMUSG00000012345.6, the trailing '.6').  These need to be stripped in order to match during the collapse.

Yet another possibility is that the majority are transcript IDs instead of gene IDs.


Reply all
Reply to author
Forward
0 new messages