Information about Parametric usage

Adithi GR

unread,

Mar 5, 2026, 4:56:43 AMMar 5

to gsea-help

Dear GSEA team,

Thank you for helping with my doubts. I wanted to know if there are any information about the basic fields and where can I find them. I wanted to optimise my results but I'm unable to figure out when to use what.

What I mean is:

which enrichment statistic to use, which metric to use for ranking genes, gene list sorting mode, gene list ordering mode all of that.

If there is any link with this information, it will be helpful.

Thank you

Anthony Castanza

unread,

Mar 6, 2026, 11:45:45 AMMar 6

to gsea-help

We offer the GSEA User guide here: https://docs.gsea-msigdb.org/#GSEA/GSEA_User_Guide/
Generally, the most common parameter that people adjust is the permutation type, which requires being set to the "gene_set" method for datasets with small numbers of samples (<7 per phenotype). Beyond that, the defaults are generally correct for most purposes.

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine

University of California, San Diego

Adithi GR

unread,

Mar 9, 2026, 1:33:53 AMMar 9

to gsea-help

I'm currently using the TPM data directly. I think I should perhaps do normalisation is there a specific normalisation that you would recommend.

Please let me know.

Thanks,

Adithi

Anthony Castanza

unread,

Mar 11, 2026, 7:25:13 PMMar 11

to gsea-help

Hi Adithi,

For standard GSEA (e.g. not single-sample GSEA) we generally recommend normalized counts (such as what you can output from DESeq2's "median-of-ratios" method), this is generally a more appropriate method for between-sample comparisons than TPM which is best for comparing relative expression of genes within a sample.

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine

University of California, San Diego

Adithi GR

unread,

Mar 25, 2026, 6:07:31 PMMar 25

to gsea-help

Thank you Anthony. As you have mentioned and as the information on the website I tried using the Deseq2 normalised counts for GSEA however no matter what I do I'm unable to get any data in FDR 25% section using the permutation phenotype.

I have removed the gene rows with less expression. This is also with a particular condition called stable (12 samples with replicates of clones) and not with the other condition named unstable (10 samples with replicate of clone data).

For more context, I have labelled the clone data I have as stable or unstable based on a parameter and then I'm trying to run GSEA. I have RNASeq salmon files for these clones which are in replicates.

Please do help or suggest me something that can help improve my data.

This is the code I'm currently using to normalise the data.

#libraries

install_if_missing <- function(packages) {
if (length(setdiff(packages, rownames(installed.packages()))) >0) {
install.packages(setdiff(packages, rownames(install.packages())))
}
}

#libraries

library(tximport)
library(dplyr)
library(ggplot2)
library(DESeq2)
library(readxl)
library(readr)

files <- list.files(path = "path", pattern = ".sf", full.names = TRUE, recursive = TRUE)

sample_names <- basename(files) %>% gsub(".sf", "", .)

input_path <- "path"
tx2gene <- read_excel(input_path)
head(tx2gene)
txi <- tximport(
files,
type = "salmon",
tx2gene = tx2gene,
)

#creating metadata and condition data
meta<- data.frame(condition = c("unstable","unstable", "unstable", "unstable", "stable", "stable", "stable", "stable",
"unstable", "unstable", "unstable", "unstable", "unstable", "unstable", "unstable", "unstable",
"unstable", "unstable", "unstable", "unstable", "stable", "stable", "unstable", "unstable", "unstable", "unstable"
))
colnames(txi$counts) <- sample_names
rownames(meta)<- colnames(txi$counts)

meta

#creating normalised counts using deseq2

dds <- DESeqDataSetFromTximport(txi, colData = meta, design = ~ condition)

#perform DESeq2 analysis (this normalises the data)
dds <- DESeq(dds)

#Get the normalised counts
normalized_counts <- counts(dds, normalized = TRUE)
colnames(normalized_counts) <- sample_names

# Now view it
head(normalized_counts)
print(normalized_counts)

write.csv(normalized_counts, file = "normalized_counts.csv", row.names = TRUE)

Anthony Castanza

unread,

Mar 25, 2026, 6:17:16 PMMar 25

to gsea-help

Have you done any other kind of analysis, like a PCA to try to get an idea how strong the signal is in your dataset? Is there clear separation of your two phenotype groups?What about the actual deseq2 results themselves, were there many significant genes?

What MSigDB collections are you using?

If you have a dataset with low power and are running a lot of gene sets, particularly combining multiple collections, you can run into situations like this.

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

--
You received this message because you are subscribed to the Google Groups "gsea-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsea-help+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/gsea-help/1335117f-49f1-47f9-93a1-cdc4f421ede4n%40googlegroups.com.

Adithi GR

unread,

Mar 31, 2026, 11:03:48 AMMar 31

to gsea-help

I have not done a PCA or any other analysis. Should I do it to see how the two conditions and how they are segregated?

I was able to see significant results from Deseq2 around 50 to 60 genes if the log2fc was>1 and padj <0.05. I didn't use the log2fold ratio as the input for GSEA. I used the normalised counts from Differential analysis.

I was using the gmt file that was given to me from KEGG for chinese hamster. I have used all the other mouse related datasets from MSigDB as well and have seen the same issue.
I see that there are datasets that are upregulated and some are in p nominal value <1 but none of them are in the threshold of FDR<25%.

Anthony Castanza

unread,

Apr 7, 2026, 6:28:37 PMApr 7

to gsea-help

Hi Adithi,

My apologies for the delay in getting back to you. Yes, I would always generally recommend running PCA on your datasets to look at the sample segregation. It can give you a pretty good idea of the strength of sample segregation. Using a log2fc threshold of 1 and only seeing 50-60 genes would, to me at least, say that whatever treatment was performed might be underpowered. I can't really speak to the KEGG file that you recieved, but gene dropouts from the gene namespace conversion might be further exacerbating this issue when running against the MSigDB mouse collections. How are you mapping the gene identifiers between your species, and the mouse gene set data?

Do you have an example you can share of one of the heatmaps and null distribution plots, from one of the gene sets that scored highly but was still not significant?

If there is a power issue, you might need to use gene set permutation, either directly, or though supplying ranked data from DESeq2 (such as the test statistic column) to the Preranked mode. This test is generally less statistically rigorous, but can be useful for datasets without strong sample segregation.

-Anthony

Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego

Reply all

Reply to author

Forward