Questions for Running GSEA using Proteomics Data

Sarah Philippi

unread,

Apr 12, 2022, 5:37:43 PM4/12/22

to gsea-help

Hi there!

I have human proteomics data that I'd like to run in GSEA, I have my raw data for the expression of each gene/protein in each patient's sample, but I'm not sure if this is the appropriate data to use in GSEA.

For RNAseq analyses I have read that you must use deseq so that the values are normalized for between-sample comparisons.

The company that ran the patient's proteomics samples (SomaLogic) did apply a normalization of some kind, but I don't know if it is comparable to the deseq in RNAseq data. aka, is my data actually normalized and okay to run in GSEA?

Additionally, I am having trouble deciding what gene set to use as my background for the proteomics data. As it stands, I only have data for ~1,300 proteins, so I'm not sure if there is a gene set out there that I can use or if I need to make my own?

I know not everyone does proteomics work so I appreciate any guidance you may have. Thank you!

Sarah Philippi

unread,

Apr 12, 2022, 6:09:27 PM4/12/22

to gsea-help

I just wanted to provide a small clarification/update, the values that SomaLogic produces for these proteomic measurements is Relative Fluorescence Units (RFUs).

Anthony Castanza

unread,

Apr 12, 2022, 7:32:58 PM4/12/22

to gsea...@googlegroups.com

Hi Sarah,

Data doesn't need to be normalized with DESeq to be appropriate for GSEA, the key is that it needs to be normalized in such a way that between-sample comparisons can be performed, this is the case for metrics like median-of-ratios normalized counts for RNA seq, but not the case for something like TPM which is normalized for within-sample comparisons.

Based on the information in this paper about the SOMAscan assay (https://www.nature.com/articles/s41598-017-14755-5), assuming similar methods were followed for normalization, the intensities should be comparable. That said, it would be worthwhile to reach out to whoever did the data normalization and confirm that it is appropriate for differential expression analysis as-is.

The larger issue is the number of proteins assayed. GSEA is designed to run on expression data for all genes datasets with only 1,000-2,000 genes will frequently not have enough information available to accurately assess gene sets. As such we don't really recommend running GSEA on this sort of data.

GSEA doesn't use separate sets as a background, while there are some gene sets such as housekeeping genes that can be used as a reference for a set that should not be differentially expressed, generally GSEA works through an empirical null distribution model where either samples or genes are randomly permuted and random enrichment scores for sets that shouldn't be enriched are calculated to determine how likely an observed enrichment is. The background used for this calculation is the input gene list, if a gene is not in dataset you provide it is excluded from all gene sets and calculations. This causes problems when so many genes are unavailable for sets that the set is no longer a meaningful representation of the annotation.

Unfortunately I can't really say if this assay is appropriate because I don't know enough about how genes were selected for inclusion, if it was designed in such a way as to try to get an unbiased sampling of the genome, it may be possible to get some meaningful enrichment results, but you would need to pay close attention to how many genes were in the set originally, and how many are in the set after background filtering. This information is made available on the GSEA enrichment report page if you want to give it a try anyway.

-Anthony

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/1e6569e2-d930-480f-8b0a-5107be0ef659n%40googlegroups.com.

Sarah Philippi

unread,

Apr 13, 2022, 10:21:56 AM4/13/22

to gsea-help

Hi Anthony,

Thank you for the detailed information, it sounds like I should probably look into analyses other than GSEA with this type of dataset. I can also look more into how the SomaScan assay makes these decisions, but to my knowledge it just has x aptamers available and according to their website that allows for detection of 7,000 proteins. I'm under the assumption this is just based on only having measurements for proteins where good aptamers have been developed and thus, isn't a complete measuring of all proteins within the proteome but provides as much coverage as possible.

That being said, I do have a bulk mouse hippocampal RNAseq dataset with deseq information and would also plan to run that in GSEA. I still don't know that I necessarily understand how to pick the best reference/chip platform. I've looked online at some of the User Guide's descriptions and I'm still not positive. Do you have a simpler explanation for how to make this decision?

Best,

Sarah

Anthony Castanza

unread,

Apr 13, 2022, 6:49:09 PM4/13/22

to gsea-help

Hi Sarah,

The chip should be chosen based on the identifiers that are in the datafile you're indenting to use. If they look something like "ENSMUSG0000012345" then the chip would be "Mouse_Ensembl_Gene_ID_[....].chip" if the data is in mouse gene symbols already then the Mouse_Gene_Symbol_Remapping_[....].chip file would probably be correct. If you send me a sample of the IDs from your input file I can tell you specifically which one you should use.

-Anthony

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/8a58fb23-a790-422e-a2cc-ea2a4a7c68b5n%40googlegroups.com.

Sarah Philippi

unread,

Apr 14, 2022, 11:25:57 AM4/14/22

to gsea-help

Hi Anthony,

It sounds like I am picking the chip correctly, I believe!

1) For my human proteomics data, I've been picking the Human_Uniprot_IDs_MSigBD.v.7.5.1.chip because I have uniprot IDs

2) For mouse RNAseq I've selected the Mouse_ENSEMBL_Gene_ID_Human_Orthologs_MSigDB.v7.5.1.chip since I have Ensemble IDs

I have what is probably a very simple question in addition to this, but how does the "gene sets" file differ from the chip file? I've tried running the GSEA and I don't remember specifying a gene set outside of what is included in my expression data file, but the user guide makes it sound like there is an additional list of genes to incorporate? Additionally, online it says this gene set is automatically identified as HGNC symbols and my understanding is that I'd have to convert the gene set since my two examples above, are not the HGNC symbols specifically?

Sorry for the very simple, technical questions, I just want to be sure my parameters are correct before making conclusions!

Best,

Sarah

Anthony Castanza

unread,

Apr 14, 2022, 1:14:52 PM4/14/22

to gsea-help

Hi Sarah,

GSEA takes a simple list of genes (i.e. a molecular pathway) and tests within a ranked dataset if that pathway is overrepresented at the top or bottom of the list. The "gene sets database" file contains those gene sets/molecular pathways that you want to test against your data using GSEA, i.e. Reactome pathways, GO Terms, MSigDB Hallmarks, etc. MSigDB maintains collections of these resources in the format to use with GSEA. The files we provide have the genes represented in HGNC Gene Symbols and version matched to the provided CHIP files. You can't run GSEA without specifying the sets you want to test your dataset against, but you don't necessarily have to use the sets we provide. For example, instead of a canonical database you could format the top DEGs from an independent experiment in the gene set database format (GMT/GMX) and test that against your data, but in that case you would need to make sure the gene symbols match those being used in your data.

Does that make sense?

-Anthony

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/27728295-164a-4823-aad8-59c5222896abn%40googlegroups.com.

Sarah Philippi

unread,

Apr 14, 2022, 5:17:59 PM4/14/22

to gsea-help

That explanation does make sense, but what I'm a little confused on is where this gene set is incorporated into the analysis? Do you happen to have a screenshot of where you are loading this gene set? The only instance of loading an external file I can think of is when I load the expression and phenotype files from my data and when I select the chip array to be used as we discussed earlier.

Additionally, this means a person would run GSEA, on their same dataset, n number of times because you might have several different interests related to the gene set (like looking at x, y, z, pathways from KEGG, for instance). Is that correct?

Anthony Castanza

unread,

Apr 14, 2022, 5:30:45 PM4/14/22

to gsea-help

The gene sets are selected from the "gene sets database" dialogue in the Run GSEA window. This is the box immediately under the "expression dataset" box on the Run page. If you wanted to load your own custom gene sets instead of the ones we provide they can be loaded in through the same function you used to load in your dataset and cls file, they would then appear as options under the gene sets database dialogue, but you don't need to do this if you want to use our sets. Our sets do not have to be loaded separately and can be simply selected from the options presented if you click on the [...] button next to that box.

You would not need to run GSEA n number of times, multiple gene sets are tested simultaneously and GSEA returns both set independent pValues, as well as global FDRs.

You would generally want to run different collections of sets separately (i.e. you would want to run KEGG and Reactome separately so the KEGG sets don't influence the FDRs of the Reactome sets and vice versa, but you would not want to run each of the reactome sets individually.

Let me know if you have additional questions

-Anthony

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/cca1c288-b5c6-4bd0-93a9-adb5037d163cn%40googlegroups.com.

Sarah Philippi

unread,

Apr 15, 2022, 10:04:59 AM4/15/22

to gsea-help

I see, I've been using the hallmarks as a default while I've been doing this troubleshooting and didn't fully realize that was where my gene set variability could be because I've been using the same gene set each time. Sorry about that, but thank you for clearing that up for me!

I do have a follow-up question regarding the "collections" you're describing. When I go on MSigDB, I see that I can look at the KEGG curated gene set, but when I click on this I'm also directed to ~186 individual gene sets within KEGG. This is why I was under the impression I would need to run GSEA n number of times in order to test the gene sets from KEGG I'm interested in, but what you're saying makes it sound like I should technically be using a collection of gene sets, rather than an individual gene set from KEGG, is that correct?

I've attached an image of the gene set selection from GSEA that I would pick for KEGG and then the individual gene sets within KEGG available according to MSigDB.

Anthony Castanza

unread,

Apr 15, 2022, 2:37:52 PM4/15/22

to gsea-help

Hallmarks, like KEGG, is a collection - it contains 50 signatures for various important cellular processes. When you select the KEGG (or Hallmarks, or any other) collection in that GSEA window you are performing enrichment analysis of all the sets that live in that collection. You do not need to run sets individually, and running sets individually will give invalid FDRs as there aren't other sets to compute a false discovery rate against.

I'm not sure what you mean by "gene set variability", if you mean the slight run-to-run variability in the results of GSEA that is because of the way GSEA uses random permutations to generate the null distribution.

-Anthony

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/2537608e-735d-4603-ae8f-02abeaf4486en%40googlegroups.com.

Sarah Philippi

unread,

Apr 15, 2022, 3:11:32 PM4/15/22

to gsea-help

Sorry, I apologize if I've mislead you with my comment on "gene set variability" I meant to say variability in the sense that I can run different gene sets, like KEGG or Hallmarks, at that specified location.

Thank you for clearing up my confusion on selecting the gene sets. It makes sense that you'd only want to run collections of gene sets, rather than the gene sets individually within that collection.

And to reiterate, do I need to adjust for anything with the gene set to avoid issues with the HGNC symbols and my data? If my data is uniprot IDs, for instance, and I select the CHIP for uniprot IDs, and the curated KEGG symbols gene set available on GSEA, I do not need to do any adjustments because it would be version matched to the CHIP automatically by GSEA?

Anthony Castanza

unread,

Apr 15, 2022, 3:20:54 PM4/15/22

to gsea-help

All our chip files are versioned so that they work seamlessly with the same version of MSigDB. So if you're using v7.5.1 of the Uniprot ID chip it will map your dataset to the exact same HGNc symbols as used in the v7.5.1 MSigDB KEGG collection, you shouldn't need to do anything else.

-Anthony

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/11465fb9-d08b-468b-afa2-08506f1e14b4n%40googlegroups.com.

Sarah Philippi

unread,

Apr 16, 2022, 7:35:51 PM4/16/22

to gsea-help

Thank you, Anthony! That was my understanding, but it's been wonderful being able to double-check some of these things with you.

My last question, I just wanted to confirm based on the user guide, expression data should not be manipulated prior to GSEA (i.e. if using deseq from RNAseq it should be just that, not ranked in some way based on fold change, etc.)?

However, I do see the option on the user guide to use a "preranked" list. Can you describe in what scenarios this would be suggested? and how exactly it is beneficial? Is it something where I should consider doing both the non-ranked and the pre-ranked analyses?

Anthony Castanza

unread,

Apr 18, 2022, 12:50:10 PM4/18/22

to gsea...@googlegroups.com

For RNA-seq the data should generally be normalized for between-sample comparisons, i.e. the median-of-ratios method from DESeq2 (can be retrieved by exporting the counts(dds, normalized=TRUE) table) or the TMM method, or some other appropriate normalization. FPKM/RPKM/TPM are not appropriate methods of normalization for standard GSEA.

GSEA Preranked is provided as an option for where someone might have computed statistics externally on, for example, a complicated experiment that needed particular confounding variable correction, and they want to run GSEA on the results. Or if someone just has a Log2FC list from a small experiment that we can't support (GSEA's default statistics require a minimum of 3 samples per group, so if you wanted to do a 1v1 or 2v2 comparison you'd need to rank it yourself and then provide that ranking to GSEA Preranked). It's mainly a mode we make available to provide greater flexibility to our users.

To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/3a9cefe4-d845-4adc-b082-e2e7ecce9c86n%40googlegroups.com.

Sarah Philippi

unread,

Apr 20, 2022, 9:54:10 AM4/20/22

to gsea-help

Hi Anthony,

Thank you for the information, it's very appreciated!

Reply all

Reply to author

Forward