edgeR::voom weighted log2cpm - Most appropriate use of GSEA

314 views
Skip to first unread message

enot

unread,
Mar 3, 2024, 2:02:20 AM3/3/24
to gsea-help
Hi, 

I have spent a bunch of time doing my best to become familiar with the GSEA method and  desktop application and I also have a colleague who has previously used the ClusterProfiler implementation of GSEA (which I realise is unsupported by Broad Institute). 

I am really happy to have finally become aware of this google help room, it's really fantastic.
I've gathered that it's really recommended to feed the normalised counts data output from DEseq2 with sizeFactor correction into the desktop tool and use default settings incl. 'collapse' to gene symbol, 'weighted' ES, 'Signal2noise' ranking metric and 'Max_probe' collapsing mode, and 'gene set' permutations in my case because I have <7 replicates per condition. I also use ensemble gene ids (ENSG....) and select the /Human_Ensembl_Gene_ID_MSigDB.v2023.2 chip platform.

I am working with an RNAseq dataset collected from brain organoids (2 cell lines at 3 different time points, 4 replicates per condition) The 'rawest' for of the data I currently have my hands on is weighted log2cpm (produced using edgeR::zoom). I realise log transformed data is not ideal and natural scale data should ideally be used.. I have also DE output ie. logFC and FDR that I could use to calculate signed significance and go the preranked route but I do not wish to flatten my data this way and would like that the magnitude of a genes relationship with a phenotype to be include in the ranking formula.

I could ask to have my RNAseq re-analysed using DEseq2 but in the meantime/ if not possible, what would be the most appropriate/mathematically sound way for me to deal with weighted log2cpm data output from voom in the GSEA desktop tool. Any guidance would be hugely appreciated. 

There are several different comparisons I am interested in when it comes to my data set:
mut_120 v cont_120
mut_150 v  cont  _150mut  _180 v cont  _180mut  _180 v mut_150
mut_150 v mut  _120mut  _180 v mut  _120
cont_180 v cont  _150cont  _150 v cont  _120
cont_180 v cont  _120

Is it appropriate to run them all separately through the software or is there some way to run them in batch/ simultaneously to account for multiple comparisons? 

I really apologise for my long-winded entry here.

Many thanks, 
Eleni

enot

unread,
Mar 3, 2024, 5:59:35 PM3/3/24
to gsea-help
Just to make a correction: I will not use 'Max_probe' collapsing mode, I will use 'sum of probes' because I've read in other threads that for RNAseq data using Ensemble IDs that's the best option.
If I can get my hands on  DESeq2 sizeFactor correction “counts( , normalized = TRUE)” data I will use GSEA as seen in the screenshot on the left below and in the meantime I will try preranked mode using signed significance (calculated from my DE output) on the right. Does this seem reasonable? Is there a third option I can use with my weighted log2cpm Voom data that doesn't require preranking?
Screenshot 2024-03-04 095432.pngScreenshot 2024-03-04 095456_2.png

Anthony Castanza

unread,
Mar 4, 2024, 1:10:27 PM3/4/24
to gsea...@googlegroups.com
Hi Eleni,

The settings screenshot you've shown for standard GSEA look pretty reasonable to me. For Preranked mode you would probably want to stick to abs_max_of_probes though at doesn't really make sense to sum the kind of metrics used here most of the time.

If the weighted log2cpm is what EdgeR is using internally for its differential expression it is probably a suitable metrc for GSEA, however I can't vouch for it specifically. The recommendation to use DeSeq2's normalized counts is just based on the limited resources we have available to do testing, and it was a simple one that we know performs well.

Unfortunately GSEA doesn't really have tools to account for multiple different comparisons. The only way is to run them separately. It probably would be advisable to perform some additional multiple comparison testing corrections to the statistics, but I don't believe that this is commonly done, nor is there a straightforward way to do so. Typically it seems that people just report the various individual statistics.
The EnrichmentMap tool can help visualize (but not rigorously, statistically, compare) multiple GSEA runs, however as far as I understand it, this only supports a limited (~4) number of simultaneous comparisons.

Finally, with regard to using Preranked, I'm not sure what you mean by "flatten" your data. GSEA still uses the value of the ranking metric in preranked mode as long as you use the weighted enrichment statistic. It is more limited than the standard method as it doesn't account for the standard deviation of the samples in computing the metric, being limited to only the Log2FC or whatever other single value you provide, but the fundamentals of the algorithm are the same (at least, the same as the gene_set permutation mode).

Let me know if you have additional questions or there was something I missed from your original messages


-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/7c553731-4317-44ae-8bf1-b41bcd2ef029n%40googlegroups.com.

enot

unread,
Mar 5, 2024, 12:12:42 AM3/5/24
to gsea-help
Hi Anthony, 

A big thanks for your response.

I was unclear but by 'flatten' I meant the consideration that preranking by signed significance (sign[(logFC)]*-log10FDR) does not factor in magnitude of change in expression and also results in duplicate ranking scores (ties) that are arbitrarily resolved.
Since my first post I have gotten my hands on normalised counts that have not been logged (ie. natural scale output from DEseq, not log scale Voom output).

I have run GSEA using preranked mode with signed significance and also with the default mode using DEseq normalised counts and signal 2noise ranking metric.

Comparing the two methods: Very similar gene sets appear at the top of my up and down lists. However many more are significant by FDR qval<0.05 in preranked mode than in default (only one gene set is significant in the up list in default mode). [see below images]
Obviously these two methods are different and I would expect differences in the GSEA analysis output.
But should I be concerned that I only have 1 significant up term using the default method? It's not entirely unsurprising because a big reason I turned to GSEA in the first place is because standard overrepresentation analysis (filtered by FDR<0.05 and FC>1.5) did not yield any significant terms.
As a side not it also concerns me that mostly ribosomal terms appear in the up lists which makes me wary there's something I'm not accounting for properly.

Does anything here raise flags for you?
Kindest Regards, 
Eleni

Here are snapshots from preranked GSEA input and output:
Expression Data_preranked.png
pos gene sets_preranked.png
neg gene sets_preranked.png


Here are snapshots from standard GSEA input and output:
Expression Data.pngpos gene sets.pngneg gene sets.png

enot

unread,
Mar 5, 2024, 12:22:52 AM3/5/24
to gsea-help
Apologies, just to add: these are example top hits from down and up gene sets running GSEA the standard way. You can even see here that even the only significant up pathway plot is a bit funky.
Thank you again,
Eleni
uplot.pngdown plot.pnguplot.png

Anthony Castanza

unread,
Mar 5, 2024, 3:59:48 PM3/5/24
to gsea...@googlegroups.com
Hi Eleni,

I personally don't care for the sign[(logFC)]*-log10FDR method of computing a preranked list. The loss of the biologically relevant magnitude information from just taking the sign of the Log2FC is too substantial. I've seen, and personally lean towards the use of taking the actual log2fc itself and incorporating it into a ranking metric like log2(FC)*-log10(pValue) - using the pValue and not the FDR also increases the dynamic range of the metric as well.

Enrichment of lots of ribosomal sets can definitely indicate a data quality problem. Typically some sort of ribosomal depletion is performed on these kinds of sequencing data and it's possible that some of the samples in one of your phenotypes did not deplete as well as others. Have you looked at the heatmaps to see if there are samples that are clear outliers? Did you perchance do any kind of hierarchical clustering or PCA on these samples to see if there are outliers?
That said, if the core enrichment genes are genes that are contributing to the biogenesis of ribosomes but aren't actually the ribosomal RNAs themselves then it's definitely possible that this could be due to something like an upregulation of the process of protein synthesis itself in your treatment.
Unfortunately it's difficult to speculate about underlying issues without knowing a lot more about the actual experimental interventions, adn being able to access the full dataset metrics.

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

enot

unread,
Mar 5, 2024, 9:42:16 PM3/5/24
to gsea-help
Hi Anthony, 

Thank you again for being such a help.
Agreed. I now have the normalised counts output from DEseq so I was able to use GSEA the standard recommended way (as far as I've understood it to be and as is shown in an above screenshot). 
-->My expression file contains natural scale normalised counts and I rank by Signal2Noise using a weighted enrichment statistic. 
Q: Does GSEA do it's own log transform in the background? Is that why it's not suitable to use logged expression data? I had really odd results when I tried inputting logcpm Voom expression data.

Would you still recommend the normalised counts output from DEseq and with signal2noise rank over using differential expression data to prerank using: log2(FC)*-log10(pValue)?
Is it unusual for me only to have a single gene set significant by qval FDR>0.05 (in one diection) with a gene list containing >16000 genes using /msigdb/human/gene_sets/c5.go.bp.v2023.2.Hs.symbols?

As for the ribosomal issue:
I have a PCA plot (plot below). The comparison I'm currently using for all my initial GSEA sanity checks is COVE 150 vs PCON 150 and although the replicates don't cluster super tightly, there is separation. Potentially one of my PCON 120 samples is an outlier but it wouldn't factor into my 150 data set.
Many of the core enrichment genes appear to be ribosomal proteins (heatmap and enriched genes list below)  themselves so perhaps I need to look into whether ribosomal depletion was performed on my sequencing data and if so if it requires adjustment. 

An unrelated question: I've struggled to fully grasp why the collapsing mode should be altered depending on the format of the expression data. I wonder if there's a resource you could point me towards that might clear this up for me.

Kindest Regards, 
Eleni

PCA.png
ribosomal enriched.pngribosomal_heatmap.png

enot

unread,
Mar 6, 2024, 11:46:48 PM3/6/24
to gsea-help
Just a short add on: It's likely my library prep was poly-A enriched rather than ribosome depleted.

Anthony Castanza

unread,
Mar 7, 2024, 1:57:17 PM3/7/24
to gsea...@googlegroups.com
Hi Eleni,

No, GSEA does not perform a log transformation internally. Applying a log transformation fundamentally changes the nature of the math being done in the computation of the signal to noise ratio, which could be the cause of the odd results you saw when using that data.

We always recommend using the standard mode of GSEA when you have access to the underlying counts matrix.

It is atypical to see that kind of asymmetrical enrichment result. It isn't impossible, it definitely does happen, but it is odd.
Enrichment of ribosomal biogenesis isn't necessarily an invalid result if everything else looks fine with the dataset.

The collapsing mode should be altered depending on the nature of the values being provided. For example, for RNA-seq data it (generally) makes sense to sum them when collapsing because each count is a discrete "real" object, and if two annotated sequences map to the same gene, the actual underlying fragments that existed in the sample are both directly contributing to that gene. However, with intensity values from a microarray for example, it doesn't really make sense to sum them because they aren't a discrete measure, so we recommend the "max probe" approach which was the standard for that sort of data analysis.
When operating on ranked lists, neither of these really make sense however - if you have two annotated sequences and one increased by 2 fold and one increased by 1.5 fold, it doesn't really make sense to say that the shared gene they mapped to increased by 3.5 fold- so "sum" doesn't really work, and in a scenario where one went up by +1.5 (log2fc) but another went down by 2 (e.g. -2 log2fc), if you used the "max" option, it would always keep the positive value which isn't valid either. A weighted average would probably be the best way to collapse multiple sequences to a single gene but we don't have any way to determine how to weight them from the data we have access to in the ranked list, so we settle for taking the "most extreme" change ( abs_max_of_probes). This avoids issues with both summing, and with the positive bias of the max probe approach.
Sorry we don't offer a better explanation of this in the user guide, but hopefully this makes sense.

Let me know if you have more questions!

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

enot

unread,
Mar 14, 2024, 12:45:03 AM3/14/24
to gsea-help
Hi Anthony, 

Thank you for breaking down the rationale behind collapsing mode selection - I follow now.

Something I'm noticing which is worrying me is that some of my Random ES plots (for significantly enriched gene sets) are either moderately or extremely skewed even though my ranked list always seems to have a fairly even number of positive and negative values (generally zero point is at rank ~8000 out of ~16000 genes) no matter which comparison I make (I have data from 6 different conditions that I am comparing in different combinations).
I have included some examples below of random ES plot (and their respective ES plots) that are very balanced all the way to very skewed. Should I be concerned and if so, what might be the reason for this? Please let me know if you require any further information.

I have one more thing I'm a little unsure about for now - I'm quite sure it's straight forward but I just haven't been following it because I've confused myself with the volume of GSEA runs I've attempted. Once GSEA desktop finishes a 'run' it automatically spits out a folder with the specified Analysis Name to the chosen directory, that contains all the data files. I also have in some instances, but not all, what looks like duplicate folders that have the name file name but end with ".collapsed.dataset" however inside is very little. Are these the results of failed runs, I'm just unsure how I've generated them? A related question is when I select my Expression dataset in the desktop GSEA version I now have both the original files and duplicate files ending with "collapsed_to_symbols" like shows below:
When I rerun the same dataset through GSEA, for example to run it against a different gene set, am I at risk of somehow making in error and incorporating a collapsed gene set from a previous run? Screenshot 2024-03-14 153445.png


3.png7.png2.png6.png1.png5.png4.png9.png

I really really apologise for how lengthy this is. 
Many thanks, 
Eleni

Anthony Castanza

unread,
Mar 14, 2024, 2:30:49 PM3/14/24
to gsea...@googlegroups.com
Hi Eleni,

The extra collapsed folders that are showing up as separate collapsed_to_symbols lines were likely generated by the separate "Collapse Dataset" tool that's available from lower down in the left sidebar. Did you use that tool at all? If they were run with the same chip file and the same collapse settings then they should be the same as a new run with the original data and the same collapse settings.

Generally a little bit of skew in the random ES distribution isn't anything to be concerned about. GSEA only needs the distribution of the same sign as the true enrichment score so the distribution for  GOBP_NEUROTRANSMITTER_TRANSPORT, where it basically failed to generate permutations that resulted in negative signs is fine since those aren't being used anyway. If it had looked the opposite then it would be something to be concerned about.
GO_RIBOSOMAL_SMALL_SUBUIT_BIOGENESIS is more skewed than I would generally hope but it's still several hundred permutations so I wouldn't be concerned there either. It's mostly just a result of sampling. If you're running gene_set permutation mode you can increase the number of permutations to like, 10000 from the default 1000, but I don't see anything in particular here to necesitate this change.

Let me know if you have more questions, or I missed addressing something you'd asked.

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

enot

unread,
Mar 17, 2024, 11:33:18 PM3/17/24
to gsea-help
Hi Anthony, 

That all makes a lot of sense. Really thank you again for clearing things up for me.
I don't believe I've used the stand-alone "Collapse dataset" tool so that's a bit of a mystery but shouldn't be causing any problems.
Unfortunately thought I have since come across the situation you described that should cause concern.
I have 6 conditions (2 cell lines at 3 different time points) for which I made 9 different comparisons (comparing between cell lines at each timepoint, and comparing within a cell line between each time point) that I run separately through GSEA. 
I have used 1000 perms and this result I get for one of my within cell line comparisons: Only about 50 of the 1000 perms resulted in a neg ES. In this case, I am inclined to be very suspicious of the below result.

Picture12.pngPicture13.pngScreenshot 2024-03-15 163344.png
I then increased number of perms to 10000 and still see a massive skew:

enplot_GOBP_CYTOPLASMIC_TRANSLATION_2651.pnggset_rnd_es_dist_2653.pngPicture14.png
This is also the case for other significant gene sets from the same comparison.
Any ideas why I might be seeing this? and if I'm unable to resolve it, is there any way for me to have confidence in this output?

One all together different question I have (and I'm so sorry to not be done):
- Before being on this forum I was initially advised to use log2cpm as the expression format for GSEA input rather than counts because 'the non-log transformed data was not distributed correctly for t-test (or by extension Signal2Noise I suppose)'. I wondered what you make of that? A follow up to that is have you heard of people removing unwanted variance (RUV) from their large-scale RNA seq dataset before passing it through GSEA and in you opinion would this be a reasonable thing to do?

Many thanks, 
Eleni

Anthony Castanza

unread,
Mar 19, 2024, 12:34:59 PM3/19/24
to gsea...@googlegroups.com
Hi Eleni,

The only thing you can really do here is to run an increased number of permutations if you have a sufficient number of samples to allow doing so, and to do so with a newly generated random seed to hope that the new permutation matrix doesn't fall into the same unfortunate spot.

That said, what this means is that in 1000 random permutations GSEA not only wasn't able to construct permutations for scoring this set that scored larger than the true score, it was barely able to find permutations that scored on the same side of the distribution. So while this does negatively impact the precision of our ability to calculate the significance (e.g. the 0 pVal and FDR likely have a greater true range of variance than if you had 500 permutations with the same sign and still had a 0), it doesn't really impact the conclusion that this is a strongly negatively enriched set.

The person who gave you that advice is correct that non-log transformed data is not appropriate for running a t-test directly on a per-gene basis for the data, however, that is not what GSEA is doing. GSEA is only using the ranking metric to compute a gene score (not a significance) then using that score (and the permuted scores) to build a distribution.

As for an RUV approach, I could see this being reasonable to do in some cases, yes, but I'm not knowledgeable about the specifics of any given approach to give concrete advice, sorry.

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego
--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.

enot

unread,
Jul 8, 2024, 3:53:44 AM7/8/24
to gsea-help
Hi Anthony, 

Thank you for all your help to date. I had to put my GSEA aside for a while but I am returning to it now.
I am rerunning my RNAseq data set through GSEA because I am changing my collapse mode from 'sum of probes' to 'abs max of probes' as you advised above. The full list of parameters I use are below. I provide ENSG Ids and select the Human_Ensembl_Gene_ID_MSigDB.v2023.2 platform (I found this to work better than using Gene Symbols).
The transcriptome I'm working with contains 17639 genes. The ranked lists I am returned following GSEA analysis contain 16430 genes. So there is a 1208 gene discrepancy. I presume this is fairly expected depending on which CHIP platform is used because ENSG and gene names can be outdated and many are noncoding genes etc so mapping the provided gene list to the CHIP platform database gene IDs could result in some discrepancy. I wondered if there is a quality control check-list you might advise me to follow to ensure potentially important genes are not omitted from the analysis? In my case, 971/16430 ranked genes are not in my transcriptome (at least not by the same ID), and 2167/17639 transcriptome genes are not in the ranked list. By quick glance some of these genes look relevant to my project. I am also happy to just continue and not worry too much about this but I wanted to check if this is somehthing I should be considering/ trying to rectify.

Any advice would be greatly appreciated,
Eleni

producer_class xtools.gsea.Gsea producer_timestamp 1720420390389 param zip_report false param num 100 param scoring_scheme weighted param norm meandiv param out C:\Users\enotaras\OneDrive - Children's Medical Research Institute\KR\SNAP25 Project\RNAseq\DeSeq2\240708_Results param mode Abs_max_of_probes param include_only_symbols true param set_max 500 param gmx ftp.broadinstitute.org://pub/gsea/msigdb/human/gene_sets/c5.go.bp.v2023.2.Hs.symbols.gmt param plot_top_x 30 param nperm 1000 param order descending param rnd_seed timestamp param rpt_label 240708_COVE_120_PCON120 param set_min 15 param res C:\Users\enotaras\OneDrive - Children's Medical Research Institute\KR\SNAP25 Project\RNAseq\DeSeq2\240708_Input_Files\COVE_120_PCON_120_counts.txt param chip ftp.broadinstitute.org://pub/gsea/msigdb/human/annotations/Human_Ensembl_Gene_ID_MSigDB.v2023.2.Hs.chip param create_svgs false param cls C:\Users\enotaras\OneDrive - Children's Medical Research Institute\KR\SNAP25 Project\RNAseq\DeSeq2\240708_Input_Files\COVE_120_PCON_120_Phenotype data.cls#COVE120_versus_PCON120 param sort real param create_gcts false param help false param save_rnd_lists false param median false param metric Signal2Noise param make_sets true param rnd_type no_balance param gui false param permute gene_set param collapse Collapse file C:\Users\enotaras\OneDrive - Children's Medical Research Institute\KR\SNAP25 Project\RNAseq\DeSeq2\240708_Results\240708_COVE_120_PCON120.Gsea.1720420390389\index.html

Anthony Castanza

unread,
Jul 8, 2024, 2:53:02 PM7/8/24
to gsea...@googlegroups.com
Hi Eleni,

This seems to me to be a reasonable amount of dropouts to me.
Looking at the parameters from your message, I'd note that I only recommend Abs_max_of_probes for the preranked GSEA mode, the standard mode where you use a counts table, particularly with RNA-seq data should definitely still use sum_of_probes. When picking a collapse mode it's important to think about what the values you're working with mean - whether they're a discrete entity like a count of a sequence, or the result of a mathematical operation like a ranking value and adjust your parameters accordingly.

Utilizing the Ensembl Gene IDs as your input identifiers for RNA-seq data is definitely the preferred ID type here, this is what we use internally to build the database in the first place so the mapping results are going to be the most robust that it is possible to obtain.

We don't really offer a quality control checklist or anything, but as long as you're using the correct chip file for your platform GSEA is generally pretty robust to missing a few genes from a set of interest. A handful of potentially relevant genes not being included in the mapping won't have all that much impact on the calculation of a set's enrichment assuming that the set is of a reasonable size.

Let me know if you have any more questions!

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego
Reply all
Reply to author
Forward
0 new messages