GSEA for proteomics data ?

1,366 views
Skip to first unread message

Manisha Pal

unread,
Nov 16, 2021, 10:32:10 AM11/16/21
to gsea-help
Hello, 

I'm working on a proteomics data and wondering can we use GSEA for getting insight of significantly enriched pathways?
What preprocessing need to be done before loading input list, can we input missing value imputed, normalized and log transformed list?

Best regards
Manisha Pal

Anthony Castanza

unread,
Nov 16, 2021, 4:36:08 PM11/16/21
to gsea-help

Hi Manisha,

If you have a ranked list of features you should definitely be able to use it with GSEA Preranked provided it's formatted into a .rnk file (https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#RNK:_Ranked_list_file_format_.28.2A.rnk.29), generally we recommend that these RNK files use some metric like Log2FC for all expressed genes (for RNA-seq data) but the data you've described should work okay.

Assuming this isn't a differential proteomics study where you have both positive and negative scores (i.e. a Log2FC) you might also try Z-score transfromed data, that might give GSEA a better shot at calculating enrichment at both the top and bottom of the ranked list (i.e. above and below the mean expression).

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

Manisha Pal

unread,
Nov 16, 2021, 10:14:48 PM11/16/21
to gsea-help
Hello Anthony, 
Thank you for the great explaination. But data I'm working with is from differential proteomics study, in that case will preRanked GSEA will work efficiently?

Thank you 
Manisha Pal

Anthony Castanza

unread,
Nov 16, 2021, 10:47:02 PM11/16/21
to gsea-help
In that case you have Log2FC data yes? If so it should work just fine.
If you only have pValues then you'd want to use the -log10(pValue just so things are scaled appropriately.

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego
--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/3e16cf86-de11-49b7-b16d-e649412b5583n%40googlegroups.com.

Manisha Pal

unread,
Nov 16, 2021, 11:03:36 PM11/16/21
to gsea-help
We've both Log2FC and pValue. Shall we rank using any one of them or use sign(Log2FC)*-log10(pValue) ?

Anthony Castanza

unread,
Nov 17, 2021, 2:44:54 PM11/17/21
to gsea...@googlegroups.com

Log2FC is generally the standard metric, however some users have had good results with the significance weighting offered by the combined metric. We haven't collected detailed comparative performance metrics though so can't offer an official advisement there sorry.

 

-Anthony

 

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

 

Manisha Pal

unread,
Dec 12, 2021, 5:10:48 AM12/12/21
to gsea-help
Hello, 

I've a doubt. 
When I did preRanked GSEA for phenotype A and B of set 1, some X pathways were significantly enriched in phenotype B and  when with same genesets I  performed preRanked GSEA for set 2 (Phenotype A vs Phenotype B ) some Y pathways were significantly enriched in phenotype A. 
But when I considered  Phenotype B and did GSEA for Set 1 vs Set 2 (considering the same genesets ) I found this time Pathway X was significantly enriched in phenotype A. 
What could be the possible reasons for such results ?

Anthony Castanza

unread,
Dec 12, 2021, 5:39:16 AM12/12/21
to gsea...@googlegroups.com

Hi,

 

I'm not entirely sure I understand the procedure you're describing here.

 

With GSEA you have a ranked list of genes - such as the log2 fold change calculated from differential expression of samples of phenotype A vs Phenotype B.

You use this ranked list to calculate enrichment scores for sets of genes that represent biological functions/pathways.

Those scores are reported as either positive (functions enriched in phenotype A) or negative (functions enriched in phenotype B).

 

It sounds like you're describing the following:

you have two independent groups (such as two groups of technical replicates) each containing paired sets of "phenotype a" and "phenotype b" samples. You've ranked genes for A vs B for each of the technical replicates resulting in two different ranked lists.

Then for each ranked list (one for each paired set of technical replicates) you've performed preranked GSEA.

Now, you're seeing pathways which are scored as enriched in one phenotype in one set of technical replicates, but enriched in the opposite phenotype in the other set of technical replicates.

 

Is that correct?

 

This kind of switching would be quite unusual. If that is the case, I might suggest trying a hierarchical clustering method on a pooled dataset to assess the fidelity of your technical replicates (i.e. are the expected replicates all clustering together). There might be some kind of batch effect that hasn't been accounted for. Also, have you looked at the enrichment plot? The typical expected shape is described in the GSEA user guide (a smooth-ish mountain shape skewed strongly to one side of the distribution). An atypical shape in one of the comparisons might be informative. Also, how does the leading edge gene membership of the set change between the replicates? This might give you an indication of what genes are driving this discrepancy.

 

It's difficult to speculate on answers to these kind of questions without more information about the experiment and what the results are actually looking like. If you're able to get into more details about the data that you wouldn't want to share on an open forum, you can reach out to us privately at gsea...@broadinstitute.org

Unnati Agarwal

unread,
Jul 8, 2022, 5:44:14 PM7/8/22
to gsea-help
Hi Anthony,

Could you help me in finding some papers that have used the = sign(Log2FC)*-log10(pValue) formula and got good results. 
I'm doing Pre-Ranked GSEA and wanted to have a look at some communications relating to this. 

Thanks in advance.
Unnati Agarwal
Reply all
Reply to author
Forward
0 new messages