Although genomewide RNA expression analysis has become a routine tool in biomedical research, extracting biological insight from such information remains a major challenge. Here, we describe a powerful analytical method called Gene Set Enrichment Analysis (GSEA) for interpreting gene expression data. The method derives its power by focusing on gene sets, that is, groups of genes that share common biological function, chromosomal location, or regulation. We demonstrate how GSEA yields insights into several cancer-related data sets, including leukemia and lung cancer. Notably, where single-gene analysis finds little similarity between two independent studies of patient survival in lung cancer, GSEA reveals many biological pathways in common. The GSEA method is embodied in a freely available software package, together with an initial database of 1,325 biologically defined gene sets.
Please register to download the GSEA software and the MSigDB gene sets, and to use our web tools. After registering, you can log in at any time using your email address. Registration is free. Its only purpose is to help us track usage for reports to our funding agencies.
To cite your use of the Molecular Signatures Database (MSigDB), a joint project of UC San Diego and Broad Institute, please reference Subramanian, Tamayo, et al. (2005, PNAS) and one or more of the following as appropriate: Liberzon, et al. (2011, Bioinformatics), Liberzon, et al. (2015, Cell Systems), and also the source for the gene set as listed on the gene set page. If you use Mouse MSigDB, please also cite Castanza, et al. (2023, Nature Methods).
The 34550 gene sets in the Human Molecular Signatures Database (MSigDB) are divided into 9 major collections, and several subcollections. See the table below for a brief description of each, and the Human MSigDB Collections: Details and Acknowledgments page for more detailed descriptions. See also the latest MSigDB Release Notes.
Click on the "browse gene sets" links in the table below to view the gene sets in a collection. Or download the gene sets in a collection by clicking on the links below the "Download Files" headings. For a description of the GMT file format see the Data Formats guide in the Documentation section. The gene sets can be downloaded as NCBI (Entrez) Gene Identifiers or HUGO (HGNC) Gene Symbols. There are also JSON bundles containing the Human gene sets using HUGO (HGNC) Gene Symbols along with some useful metadata. A SQLite database containing all the Human MSigDB gene sets is available as well.
Use the navigation bar on the left to display documentation on GSEA software, MSigDB database or GSEA/MSigDB web site. If you have comments or questions not answered by the FAQ or the User Guide, contact us at groups.google.com/group/gsea-help.
If you are new to GSEA, see the Tutorial for a brief overview of the software. If you have a question, see the FAQ or the User Guide. The User Guide describes how to prepare data files, load data files, run the gene set enrichment analysis, and interpret the results. It also includes instructions for running GSEA from the command line and a Quick Reference section, which describes each window of the GSEA desktop application.
The GSEA method was originally developed for analysis of microarray data. In order to best adapt this method for RNA-sequencing data sets the GSEA team has developed a collection of guidelines and suggestions which describe how to properly handle these data.
Current release of the Molecular Signatures Database is divided into two parts, the MSigDB Human Collections, and the MSigDB Mouse collections. Release notes for the current version of the Human collections are available here: (MSigDB v2023.1.Hs) and the release notes for the current version of the Mouse collections are available here: (MSigDB v2023.1.Mm). For information about MSigDB and the gene sets, see the MSigDB web site.
Hi, I'm trying to start a project based on R where I input cancer patient data to find DEG's to ultimately search for possible pharmaceutical targets. My focus is can I can input the same data into GSEA and DEG to confirm each other's conclusions. Right now, I'm only using DEG (voom+limma package in R) to filter/select significant genes.
I know that these two analyses are completely different- GSEA takes in a priori gene sets and gives information relevant to significant gene SETS for each phenotype. DEG will look into individual GENES (not gene sets) and gives us a list of differentially expressed genes for each phenotype.
However, I was wondering if these can work together in harmony so that we can first use GSEA to filter significant gene sets and then use DEG to test individual genes significantly enriched in those gene sets of GSEA. I thought this would help because just performing DEG inherently lacks biological significance. But while GSEA has biological significance, it doesn't have the ability to detect at the level of individual genes. So why not make them work together to complement each other's strengths/weaknesses?
For example, I would run GSEA for two different cancer types (phenotype) A and B, and find gene set X is overexpressed. Then I would look into which group of individual genes are contributing the most to the enrichment score for gene set X. Then I would run a DEG analysis of those individual genes. If I find some genes that are significantly overexpressed for specific types of cancers, that actually itself can be a probable target.
So, you can first find your DEG list, with gene name/symbol/ID, pvalue and log FC. Then you use this list to run a GSEA. Also, it's better to run the GSEA on ALL your genes, not only over-expressed/under-expressed.
What a GSEA do is to rank your genes based on a certain value that you provide; usually, this value is the logFC of the genes, but sometimes I saw even calculations like: pvalue * logFC, in this way you also take into consideration the significance of the gene, even if GSEA doesn't care of it! ;)
Now, imagine that you have a DEG list with logFC, you load into GSEA program ( -msigdb.org/gsea/index.jsp , from BROAD instute) or online in a website like Enrichr ( ).GSEA basically take the ranked genes (from + to -, according to logFC) and confront them with the gene sets specified.
Then, this is VERY important, the result it is not that a specific pathway is up- or down-regulated, but the fact that the pathway is affected in some way by the condition that you re studying. In fact, you will have enriched genes both up and down-regulated. The result of GSEA is a broad picture of what's going on in your cell line / model.
You should clarify this part. The result is actually an enrichment score with a specific direction (up or down). Not all genes are in the same direction, but there should be en enrichment at one end of the spectrum (see also: "leading-edge genes").
Hey, thanks for your kind input :) I do understand your method and why you would suggest one like that. However, don't you think both would work? GSEA first then DEG/DEG first then GSEA? But the reason why I thought GSEA first then DEG would work better is because if you do DEG first for tens of thousands of individual genes, then it is simply too inefficient- GSEA would help reduce dimensionality for the subsequent DEG test. Though, it would be interesting to see the differences of results of DEG first then GSEA vs GSEA first then DEG...
Also, when you say the result of a GSEA is not that a specific pathway is up- or down-regulated, you're basically saying that the reason why we do GSEA is just to see which pathway (gene sets) is affected by the phenotype, right? So basically up-regulation and down-regulation isn't significant in GSEA...
Hi! I don't understand how you would do a GSEA without having a list of Differentially Expressed Genes... What would you insert as input in the analysis? The GSEA starts with a list of DEGs, so in any case you need to do it beforre running GSEA.So, you do DEG analysis and find genes up regulated and down regulated in tumor VS normal.Now, you have a list of DEGs that you can analyse in different ways.
One for example is to decide the cutoff for pvalue and logFC to define what is really DE in the two conditions, imagine pvalue < 0,05 and logFC > 1,5. In this way, you find genes that you can further analyze with an enrichment analysis such as GO pathway or KEGG on only upregulated genes for example (or down), in this way you can find pathways that move in the way of your genes (+ or -).
With GSEA you use all the genes as I said before and you obtain a list of pathways/biological processes in which your list of genes is involved, based on the ranking provided. In these lists you can find both up and downregulated genes, because as you know a pathway is composed by many components. So, GSEA is a general picture of what's going on.
For your aim, both methods can be good. You can find gene X, very important target to block that is one of the top of your DEG analysis; but also you can find that "epithelial to mesenchimal transition" is enriched in your GSEA analysis (based on the same DEG list), so you can pick one in the many genes involved as a target to block the entire pathway, instead of only few genes.;)
You're right, sorry, I wrote something that can create confusion: with DEG list I mean the list that you obtain after you analyze your two conditions in the RNAseq/microarray, so technically they are just the genes that come out from analysis with pval and logFC. :)
How about using pathway analysis using R package "Rontotools"? Would that be a better idea since this would provide actual biological significance and is more accurate than gene set analysis (DEG, GSEA)?
Hi! I know this post has been a while. I saw your opininion on "it's better to run the GSEA on ALL your genes, not only over-expressed/under-expressed." Why would you think so? I usually do GSEA on the differential expressed genes. If my DEGs has a trend, so it will bias the GSEA results?
df19127ead