GSEA for Single-Cell RNA Seq cluster

2,429 views
Skip to first unread message

J

unread,
Jun 25, 2021, 10:26:57 PM6/25/21
to gsea-help
Hello,

I am looking to compare the gene set enrichment of two phenotypes within a single-cell RNA seq cluster obtained by using the Seurat package in R, and I would like to know what would be the correct way to do this.  At first, I thought I may input the expressions file (normalized) of the single cells into the desktop GSEA program with each column in the expressions file representing a cell -- a sample.  Thus, I would have thousands of samples, so I doubt that this would be the most effective method.  

I was also thinking that I may combine the expressions of all the cells of the two phenotypes into two samples -- one for each phenotype.  Then, I would have an expression file of two columns for the two samples, and I input this shorter expression file into the desktop GSEA program.  However, I am unsure if this would work considering that Signal2Noise ranking metric requires a minimum of three samples for each phenotype.  

Do any of these methods make sense?  If not, please suggest other methods, including those that may not even use the desktop GSEA program.

Thank you

Anthony Castanza

unread,
Jun 25, 2021, 10:57:54 PM6/25/21
to gsea...@googlegroups.com
Hello,

Single Cell experiments are kind of a tricky area and we haven't developed much in the way of guidance here yet since the use cases are so varied.
Your best bet might be to use Seurat's differential expression calculations, and then use that differential expression result as input for GSEA Preranked. You might have to adjust any results thresholding to get a list of all genes though (I'm not entirely sure what the current default behavior in Seurat is).

I wouldn't use each cell as a "sample". But let's say that hypothetically you isolated cells from multiple culture plates or experimental animals. In that case you could probably try something like the idea you suggested, but where you take the cells from each plate/animal separately and average their expression to create a pseudo-bulk sample for each plate for each phenotype. This would then allow you to use signal2noise if you had an appropriate number of original samples. (Does that make sense?) But I don't know enough about your experiment to say if this approach would work.


--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gsea-help/f01c94f5-26e4-42cb-91e9-bcb20de9285bn%40googlegroups.com.


--
-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

J

unread,
Jun 26, 2021, 7:16:51 PM6/26/21
to gsea-help
Thank you for your reply!  The situation that you describe in your response's second paragraph is not mine, but I think using GSEAPreranked with the differentially expressed genes would be a reasonable route.  I plan to use the GO biological processes gene sets.  

Another question: I am unsure about how appropriate the permutation type of GSEAPreranked (gene set) would be in comparing the gene set enrichment between my two phenotypes.  According to Subramanian et al. (2005), the permutation of phenotype labels would yield a more reasonable significance level of enrichment score than would the permutation of genes.  Does this suggest that GSEAPreranked is not optimal because it permutes exclusively the gene sets?

Thank you

Anthony Castanza

unread,
Jun 26, 2021, 8:19:37 PM6/26/21
to gsea...@googlegroups.com
Hello,

Yes, we generally consider the phenotype permutation mode, which is only available in full GSEA and not GSEA Preranked, to be a superior null distribution model. However, GSEA is not designed to calculate differential expression for single cell data, or to permute thousands of samples. GSEA seems to work reasonably okay with pseudo-bulk samples, which is why I suggested that alternative if you had an appropriate underlying experimental design but that doesn't appear to be the case.

Among other issues, the sparsity of single cell data, and the number of samples are non-trivial issues that the differential expression tools in seurat and similar software have been developed to handle. We can't vouch for the performance of the phenotype permutation test, or the signal2noise ranking metric when working on single cell data, therefore ranking genes with a tool designed to properly calculate differential expression for this kind of data, and then using that ranked list is the recommended approach.

You're of course welcome to try the other mode and let us know how it behaved on your dataset, but it's unfortunately not a scenario we can offer support or advice on.

J

unread,
Jun 27, 2021, 9:34:15 PM6/27/21
to gsea-help
Sounds good.  Would it be OK if the ranked list of genes featured only most of the genes in stead of all?  For some reason the parameters in Seurat for finding the differentially expressed genes analyze and return only about 19000 genes out of 22000 in my case.  As far as I know, I think the parameters I set for finding the DEGs would be the ones that give me the most genes, but I will keep testing more.  

Also: how would I interpret the GSEAPreranked results?  In the html file containing the results, I see the phenotypes of na_pos and na_neg.  I understand that since I ran GSEAPreranked, I did not specify any phenotypes.  So what do the phenotypes of na_pos and na_neg mean?

Thank you

acas...@cloud.ucsd.edu

unread,
Jun 27, 2021, 9:41:23 PM6/27/21
to gsea...@googlegroups.com

Hello,

 

Yeah, this kind of filtering is normal, typically genes that had zero and/or some extremely low number of UMIs across all samples were removed, it shouldn’t affect the outcome. This same kind of filtering is typically performed with bulk RNA-seq data as well. We just don’t want the list to be restricted to something like only “significant” features.

 

As for interpreting the results, the na_pos will reflect the upregulated/positive side of your list, and the na_neg will reflect the downregulated/negative side of your ranked list. Those names are just placeholders since we don’t know what the phenotypes are that the up and down sides of the enrichment actually represent.

 

Let me know if you have any more questions,

 

-Anthony

 

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

 

J

unread,
Jun 28, 2021, 10:19:03 AM6/28/21
to gsea-help
Ok.  If I understand correctly, for GSEAPreranked results derived from a conserved-marker ranked gene list between two phenotypes, na_pos would be the the up-regulated gene sets shared by the two groups of cells, and na_neg would be the down regulated gene sets shared among the two groups of cells.

But how may I interpret na_pos and na_neg for the results of GSEAPreranked derived from a differentially expressed gene ranked list?  Would the interpretation for na_pos be something like "positively differentially expressed" (most differentially expressed between the two groups?) and na_neg be "negatively differentially expressed" (more conserved between the two cell groups?).  If so, in na_pos, how may I find out which of the two cell groups has the up-regulated gene sets and which has the down regulated gene sets?

Thank you

J

unread,
Jun 28, 2021, 11:12:28 AM6/28/21
to gsea-help
Also: I read that GSEAPreranked sorts my ranked list in descending numerical order regardless of my ranking metric.  So if I use significance p-value, would I need to negate every p-value so that GSEAPreranked would put the most significant values on top?  

Thank you  

acas...@cloud.ucsd.edu

unread,
Jun 28, 2021, 12:22:48 PM6/28/21
to gsea...@googlegroups.com

Hello,

 

In my previous message I wes describing the case for a differentially expressed gene ranked list. But this is, in fact, a general case. na_pos will always be the gene sets enriched in the positive side of the ranked list, and na_neg will always be the gene sets enriched in the negative side of the ranked list regardless of the underlying nature of the experiment that led to the ranking. Conservation between your groups doesn’t factor into it unless it is a factor in how you ranked your genes. If your ranking is differential expression between two groups, then positive enrichment will be from upregulated genes and negative enrichment will be from downregulated genes.

 

With regard to using pValue;  generally for pValues, you probably want to use the -log10 of the pvalue, this would allow the more significant pvalues to be assigned more weight than less significant ones (since GSEA uses the ranking metric to weight each step of the enrichment score calculation). Some people also have good results using the -log10(pvalue)*sign of the fold change. This later option allows you to get both positive and negative scores while still using the pValue to assign the magnitude of the weighting factor.

J

unread,
Jun 29, 2021, 11:48:32 AM6/29/21
to gsea-help
So if I use the -log10(value) * sign of fold change, then I would be able to input a list of all ranked genes into GSEAPreranked, right?  The transformed pvalues of each gene would tell me in which direction the gene is regulated between the two groups.  

However, if I do not multiply by the sign of fold change, then I would not be able to tell in which direction the regulation of a certain gene would be.  In this case, I would have to run GSEAPreranked twice with two lists of ranked genes: one for upregulated genes and another for down regulated, right?  And this latter option would be suboptimal because each of the two ranked gene lists would not contain all genes since I would need to set a logFC threshold while calculating the gene ranks, and GSEA should look at all genes when calculating enrichment instead of only the most upregulated genes or the most down regulated.  

Please let me know if I understand this correctly.

Thank you

Anthony Castanza

unread,
Jun 29, 2021, 11:56:15 AM6/29/21
to gsea...@googlegroups.com
Your understanding is correct with the caveat that running the up and down lists separately is even worse and not comparable to running the combined list as there may be genes in a given gene set that are distributed on both sides of the list and GSEA will use this information in the enrichment calculation.


-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

J

unread,
Jun 29, 2021, 10:13:28 PM6/29/21
to gsea-help
Thank you for your help so far.

Transforming the adjusted p-values yields a lot of repeat values at the top and the bottom of the list because there are a lot of repeated adjusted p-values.  The problem with repetition in the transformed plain p-value is not so bad, but there is still about 3000 genes with the same score at the top of the list because next-to-zero p-values are rounded to zero in Excel while executing the transformation formula.  A square root transformation gives me a similar problem.  If my transformations yield too many repeated values, then would running GSEAPreranked twice with the lists for upregulated and down regulated genes be an ok option despite not being ideal?  Is there another ranking metric you could recommend that prevents repeats?

Thanks again

Anthony Castanza

unread,
Jun 30, 2021, 1:11:01 PM6/30/21
to gsea...@googlegroups.com

Hello,

 

You would not want to use the adjusted p-value because of the very issue you observed. I would recommend calculating the transformation internally in R (since Seurat is R based) and then only writing out the transformed list and not using excel at all.

 

Running the up and down lists separately is almost never a good option for GSEA for the previously mentioned reason. Additionally, splitting the ranked list would not address the repeated values issue as the ones present on one side of the list don’t affect the ones present on the other side of the list.

 

Some repeats are simply to be expected, genes that changed identically (within the limits of our ability to detect them) will have identical values, this is just something that happens. But it might be possible to mitigate this somewhat by using the value of the log2(fc) including the sign, instead of just the sign alone in the transformation formula (as well as not using Excel to perform the calculation due to the rounding behavior).

 

-Anthony

 

Anthony S. Castanza, PhD

Curator, Molecular Signatures Database

Mesirov Lab, Department of Medicine

University of California, San Diego

Reply all
Reply to author
Forward
0 new messages