20-Oct-2023: MSigDB 2023.2 released. Introducing new subcollections from KEGG_MEDICUS and the Curated Cancer Cell Atlas (3CA), plus new user-submitted sets for C2 and M2 CGP. Updated with gene data from Ensembl 110. Human and Mouse collections for Reactome, GO, and WikiPathways have been updated. See the release notes for details.
20-Jul-2023: We have introduced a new interactive Compendia Expression Profiles tool using Next-Generation Clustered Heat Maps (NG-CHM) from the Department of Bioinformatics and Computational Biology at the MD Anderson Cancer Center. Unlike our static image tool, these heatmaps allow interactive exploration of the expression profile of MSigDB gene sets or user-defined gene lists (on our Investigate page).
24-Mar-2023: With the release of MSigDB 2023.1 we are introducing a SQLite database for the gene sets in both the Human (2023.1.Hs) and the Mouse (2023.1.Mm) resources. This new format brings the MSigDB contents and metadata with all of the searchability and manipulative power of a full relational database. See our documentation for more details on the contents and usage.
3-Mar-2023: MSigDB 2023.1 released. Updated to gene data from Ensembl 109. Human and Mouse collections for Reactome, GO, and WikiPathways have been updated, as well as HPO, Lung atlas sets from He et al. have been added to C8, and Uterine cell types from Zhang et al to M8, plus new user-submitted sets for C2 and M2 CGP. See the release notes for details.
7-Sep-2022: Announcing the first release of Mouse MSigDB (v2022.1.Mm) with 16,000 gene sets that can be used directly for GSEA analysis of mouse datasets without the need for orthology conversion. A new release of Human MSigDB (v2022.1.Hs) includes updates to Reactome, GO, HPO, and WikiPathways. See the 2022.1.Hs release notes and 2022.1.Mm release notes for details.
Please register to download the GSEA software, access our web tools, and view the MSigDB gene sets. After registering, you can log in at any time using your email address. Registration is free. Its only purpose is to help us track usage for reports to our funding agencies.
Gene set enrichment analysis (GSEA) (also called functional enrichment analysis or pathway enrichment analysis) is a method to identify classes of genes or proteins that are over-represented in a large set of genes or proteins, and may have an association with different phenotypes (e.g. different organism growth patterns or diseases). The method uses statistical approaches to identify significantly enriched or depleted groups of genes. Transcriptomics technologies and proteomics results often identify thousands of genes, which are used for the analysis.[1]
After the completion of the Human Genome Project, the problem of how to interpret and analyze it remained. In order to seek out genes associated with diseases, DNA microarrays were used to measure the amount of gene expression in different cells. Microarrays on thousands of different genes were carried out, and comparisons the results of two different cell categories, e.g. normal cells versus cancerous cells. However, this method of comparison is not sensitive enough to detect the subtle differences between the expression of individual genes, because diseases typically involve entire groups of genes.[2] Multiple genes are linked to a single biological pathway, and so it is the additive change in expression within gene sets that leads to the difference in phenotypic expression. Gene Set Enrichment Analysis was developed [2] to focus on the changes of expression in groups of a priori defined gene sets. By doing so, this method resolves the problem of the undetectable, small changes in the expression of single genes.[3]
Gene set enrichment analysis uses a priori gene sets that have been grouped together by their involvement in the same biological pathway, or by proximal location on a chromosome.[1] A database of these predefined sets can be found at the Molecular signatures database (MSigDB).[4][5] In GSEA, DNA microarrays, or now RNA-Seq, are still performed and compared between two cell categories, but instead of focusing on individual genes in a long list, the focus is put on a gene set.[1] Researchers analyze whether the majority of genes in the set fall in the extremes of this list: the top and bottom of the list correspond to the largest differences in expression between the two cell types. If the gene set falls at either the top (over-expressed) or bottom (under-expressed), it is thought to be related to the phenotypic differences.
One other limitation to Gene Set Enrichment Analysis is that the results are very dependent on the algorithm that clusters the genes, and the number of clusters being tested.[7] Spectral Gene Set Enrichment (SGSE) is a proposed, unsupervised test. The method's founders claim that it is a better way to find associations between MSigDB gene sets and microarray data. The general steps include:
GSEA uses complicated statistics, so it requires a computer program to run the calculations. GSEA has become standard practice, and there are many websites and downloadable programs that will provide the data sets and run the analysis.
NASQAR (Nucleic Acid SeQuence Analysis Resource) is an open source, web-based platform for high-throughput sequencing data analysis and visualization.[8][9] Users can perform GSEA using the popular R-based clusterProfiler package [10] in a simple, user-friendly web app. NASQAR currently supports GO Term and KEGG Pathway enrichment with all organisms supported by an Org.Db database.[11]
WebGestalt [14] is a web based gene set analysis toolkit. It supports three well-established and complementary methods for enrichment analysis, including Over-Representation Analysis (ORA), Gene Set Enrichment Analysis (GSEA), and Network Topology-based Analysis (NTA). Analysis can be performed against 12 organisms and 321,251 functional categories using 354 gene identifiers from various databases and technology platforms.
Enrichr[15][16] [17] is a gene set enrichment analysis tool for mammalian gene sets. It contains background libraries for transcription regulation, pathways and protein interactions, ontologies including GO and the human and mouse phenotype ontologies, signatures from cells treated with drugs, gene sets associated with human diseases, and expression of genes in different cells and tissues. Enrichr was developed by the Ma'ayan Laboratory at the Icahn School of Medicine at Mount Sinai.[18] The background libraries are from over 200 resources and contain over 450,000 annotated gene sets. The tool can be accessed through API and provides different ways to visualize the results.
GeneSCF is a real-time based functional enrichment tool with support for multiple organisms[19] and is designed to overcome the problems associated with using outdated resources and databases.[20] Advantages of using GeneSCF: real-time analysis, users do not have to depend on enrichment tools to get updated, easy for computational biologists to integrate GeneSCF with their NGS pipeline, it supports multiple organisms, enrichment analysis for multiple gene list using multiple source database in single run, retrieve or download complete GO terms/Pathways/Functions with associated genes as simple table format in a plain text file.[21][22]
DAVID is the database for annotation, visualization and integrated discovery, a bioinformatics tool that pools together information from most major bioinformatics sources, with the aim of analyzing large gene lists in a high-throughput manner.[23] DAVID goes beyond standard GSEA with additional functions like switching between gene and protein identifiers on the genome-wide scale,[23] however, the annotations used by DAVID was not updated since October 2016 to Dec 2021,[24] which can have a considerable impact on practical interpretation of results.[25] However, A most recent update was performed in 2021[24]
Metascape is a biologist-oriented gene-list analysis portal.[26] Metascape integrates pathway enrichment analysis, protein complex analysis, and multi-list meta-analysis into one seamless workflow accessible through a significantly simplified user interface. Metascape maintains analysis accuracy by updating its 40 underlying knowledgebases monthly. Metascape presents results using easy-to-interpret graphics, spreadsheets, and publication quality presentations, and is freely available.[27]
The Gene Ontology (GO) consortium has also developed their own online GO term enrichment tool,[28]allowing species-specific enrichment analysis versus the complete database, coarser-grained GO slims, or custom references.[29]
In 2010, Gill Bejerano from Stanford University released the Genomic region enrichment of annotations tool (GREAT), a software which takes advantage of regulatory domains to better associate gene ontology terms to genes.[30] Its primary purpose is to identify pathways and processes that are significantly associated with factor regulating activity. This method maps genes with regulatory regions through a hypergeometric test over genes, inferring proximal gene regulatory domains. It does this by using the total fraction of the genome associated with a given ontology term as the expected fraction of input regions associated with the term by chance. Enrichment is calculated by all regulatory regions, and several experiments were performed to validate GREAT, one of which being enrichment analyses done on 8 ChIP-seq datasets.[31]
ToppGene is a one-stop portal for gene list enrichment analysis and candidate gene prioritizationbased on functional annotations and protein interactions network.[36] Developed and maintained by the Division of Biomedical Informatics at Cincinnati Children's Hospital Medical Center.
Quantitative Set Analysis for Gene Expression (QuSAGE) is a computational method for gene set enrichment analysis.[37] QuSAGE improves power by accounting for inter-gene correlations and quantifies gene set activity with a complete probability density function (PDF). From this PDF, P values and confidence intervals can be easily extracted. Preserving the PDF also allows for post-hoc analysis (e.g., pair-wise comparisons of gene set activity) while maintaining statistical traceability. Turner et al. extended the applicability of QuSAGE to longitudinal studies by adding functionality for general linear mixed models.[38] QuSAGE was used by the NIH/NIAID Human Immunology Project Consortium to identify baseline transcriptional signatures that were associated with human influenza vaccination responses.[39] QuSAGE is available as an R/Bioconductor package, and is maintained by the Kleinstein Lab at Yale School of Medicine.
760c119bf3