GSEA for SomaScan assay data

Alice Piller

unread,

Apr 16, 2024, 3:18:35 AM4/16/24

to gsea-help

Hi,

I am carrying out a GSEA on SomaLogic proteomic data comprising 5012 proteins, looking for differential enrichment in two phenotypes, one with 17 samples and one with 55. I am using the UniProt MSigDB chip array.

With the per group sample size being greater than 7 and the objective being to discern differentially enriched pathways between the phenotypes, using the 'phenotype' permutation seemed suitable. However, the results of the analysis returned very high FDR values and none below the suggested threshold of 0.25. Using the gene_set permutation type returned more significant results with a few hits having an FDR below the recommended threshold of 0.05.

I have fiddled with some of the other parameters with some reduction in FDR, for example, changing the Enrichment Statistic to weighted_p2. I thought that setting the Randomization Mode to equalize_and_balance would be more appropriate since the sample sizes are quite different but this increased the FDR values substantially. Are there any parameters that you suggest changing from the default for my specific case?

What I suspect may be affecting the FDR values is that I am casting the net too broadly over all gene sets. Is this suspicion valid and if so, how significant is this effect roughly?

I'd greatly appreciate any other input you have for this analysis. Otherwise, great software with comprehensive outputs.

Thank you in advance for your help in this matter.

Anthony Castanza

unread,

Apr 16, 2024, 2:13:18 PM4/16/24

to gsea-help

Hi Alice,

If you area running all of MSigDB together in a single run then yes this is likely to significantly adversely affect the FDR calculation. I wouldn't recommend running more than one collection at a time.
I also wouldn't typically recommend setting weighted_p2, this causes hits (genes in the set) to have twice their rank metric contribute to the score. It is, in my opinion, over-weighted.

In this case I probably would set the equalize_and_balance parameter because of the imbalance as you noted, with phenotype permutation mode. With as many samples as you have, you might try increasing the number of permutations to 10,000 with these modes. If you still are having significance issues with running individual collections with these settings, please let me know and we can investigate other potential solutions.

-Anthony

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

Alice Piller

unread,

Apr 18, 2024, 3:49:16 AM4/18/24

to gsea-help

Hi Antony,

My mistake. I am not using all gene sets; I am using the hallmark collection as I understand is suggested for initial analysis and to guide further research. This is already quite a small collection, so maybe it wouldn't have had as inflating of an effect as I thought. My approach is to investigate the more significant pathways and use them to guide the curation of a collection. I've seen this done in a few papers, but I'm not sure how it's done. Is the idea to extract gene sets related to a particular biological phenomenon, such as thrombosis or B cell proliferation, from a series of relevant collections, and then to use this collection in the analysis? I've seen quite a few posts on here about using your own set of gene sets. Is that a common or suggested approach to gain finer resolution of enriched pathways proceeding analysis using the hallmark collection? If so, do you have any suggestions regarding this approach?

Interestingly, I used equalize and balance on my uneven phenotype comparison (n = 17 and 55) and the FDR values increased. I then used equalize and balance on another, more even phenotype comparison (n = 37 and 35) and the FDR values were more significant. Both comparisons use the same data from the same cohort.

Thank you for your suggestions. They were very helpful. For my uneven phenotype comparison (the one in my first post), I am still not getting significant results, with the lowest FDR being around 0.6 on equalize_and_balance and 0.36 on no_balance. For my more even phenotype comparison, I'm getting a few significant results, with the lowest FDR being around 0.2 on equalize_and_balance and 0.225 on no_balance.

I have another (possibly naive) question on the interpretation of enrichment plots like the one below. It is my understanding that this means the interferon gamma response pathway is significantly upregulated in the right-hand 'HighNeut' group. Is that to say it is significantly downregulated in the left-hand 'LowNeut' group?

Thank you for your help. I really appreciate it.

Anthony Castanza

unread,

Apr 18, 2024, 5:51:21 PM4/18/24

to gsea...@googlegroups.com

Hi Alice,

Running against the hallmarks collection is pretty much a best-case scenario with regards to the inter-set redundancy (or lack thereof) as this collection was designed to minimize this. So, high FDRs particularly when permuting is likely a result of some hidden feature in your dataset. Have you done any kind of hierarchical clustering on your samples to see if there is a sample subset that looks more like the other phenotype?

Would you be willing to share a screenshot of the heatmap that accompanies one of your top hit gene sets?

As to the creating your own collection from identified sets of interest, this can definitely give you a finer-grain look at the perturbed processes in your samples but in this case the FDRs can be even more unreliable due to the redundancy issue. One way you can identify candidate sets is by taking the candidate sets from your top enriched sets, clicking the link to navigate to their set page on the MSigDB site, then using the "investigate gene sets" tool to find other sets with overlap to the set of interest in other collections.

With regard to the question about the enrichment score in that plot, the score is negative because the score reflects the direction of the comparison, but the interpretation of the score as being "positively enriched in the right-hand 'HighNeut' group" is correct.

Let me know if you have other questions!

-Anthony

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/59f26e56-b5ac-4551-a55b-97266db1b032n%40googlegroups.com.

Alice Piller

unread,

Apr 19, 2024, 5:12:34 AM4/19/24

to gsea-help

Hi Antony,

Thank you for your feedback. I haven't carried out any formal hierarchical clustering. I've mainly been looking at the overall heatmap while considering other phenotypes and seeing if they may be contributing to the pattern. I've attached the heatmap for the top gene set (lowest by FDR, although not the lowest by other metrics). I'll admit, the pattern isn't all that clear. I've highlighted another phenotype in green and purple, which I suspect may by contributing to the pattern. Let me know your thoughts.

Thanks for letting me know about the "investigate gene sets" feature. I'll explore that as well.

Reply all

Reply to author

Forward