Msigdb Hallmark 2020

1 view

Skip to first unread message

Lior Gonzales

unread,

Aug 4, 2024, 4:55:21 PM8/4/24

to dipenilan

Thesite is secure.

The ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

The Molecular Signatures Database (MSigDB) is one of the most widely used and comprehensive databases of gene sets for performing gene set enrichment analysis. Since its creation, MSigDB has grown beyond its roots in metabolic disease and cancer to include >10,000 gene sets. These better represent a wider range of biological processes and diseases, but the utility of the database is reduced by increased redundancy across, and heterogeneity within, gene sets. To address this challenge, here we use a combination of automated approaches and expert curation to develop a collection of "hallmark" gene sets as part of MSigDB. Each hallmark in this collection consists of a "refined" gene set, derived from multiple "founder" sets, that conveys a specific biological state or process and displays coherent expression. The hallmarks effectively summarize most of the relevant information of the original founder sets and, by reducing both variation and redundancy, provide more refined and concise inputs for gene set enrichment analysis.

I am able to to use the msigdbr library to import the gene collections from msigdb into r, but I am unsure of how to specifically use a function to compute the overlaps between the genes in my gene set and the gene sets in msigdb and obtain the FDR p-values. Are there any tutorials online for this method or example codes?

Yes. Until very recently, the recommendation from GSEA developers was to use the pre-ranked GSEA for RNA-seq data, so that has been the default one in my mind since most transcriptomic data is RNA-seq.

Pathway analysis is a common task in genomics research and there are many available R-based software tools. Depending on the tool, it may be necessary to import the pathways, translate genes to the appropriate species, convert between symbols and IDs, and format the resulting object.

This package was generated with MSigDB v7.5.1 (released January 2022). The MSigDB version is used as the base of the msigdsbr package version. You can check the installed version with packageVersion("msigdbr").

Yes. You can then import the GMT files (with getGmt() from the GSEABase package, for example). The GMTs only include the human genes, even for gene sets generated from mouse experiments. If you are working with non-human data, you then have to convert the MSigDB genes to your organism or your genes to human.

There are a few other resources that and provide some of the functionality and served as an inspiration for this package. Ge Lab Gene Set Files has GMT files for many species. WEHI provides MSigDB gene sets in R format for human and mouse. MSigDF is based on the WEHI resource, but is converted to a more tidyverse-friendly data frame. These are updated at varying frequencies and may not use the latest version of MSigDB.

Gene homologs are provided by HUGO Gene Nomenclature Committee at the European Bioinformatics Institute which integrates the orthology assertions predicted for human genes by eggNOG, Ensembl Compara, HGNC, HomoloGene, Inparanoid, NCBI Gene Orthology, OMA, OrthoDB, OrthoMCL, Panther, PhylomeDB, TreeFam and ZFIN. For each human equivalent within each species, only the ortholog supported by the largest number of databases is used.

I am currently working with TPM expression data obtained from RNA-seq analysis, and my dataset includes a diverse range of biotypes such as miRNA, lncRNA, pseudogenes, etc., resulting in a total of around 60,000 genes. As I intend to perform enrichment analysis (ssGSEA) using the hallmark gene list from MSigDB, I am faced with a crucial decision regarding whether to filter the data based on biotype='protein coding'.

Given the diverse nature of the genes in my dataset, I am uncertain about the potential impact of including non-protein coding biotypes on the enrichment analysis. Filtering by biotype='protein coding' seems like a logical step to focus on protein-coding genes relevant to the hallmark pathways, but I would like to seek the community's advice and experiences on this matter.

I appreciate any insights, experiences, or recommendations the community can provide to help me make an informed decision on whether to filter my RNA-seq data by biotype='protein coding' for hallmark pathway enrichment analysis.

Hi, a follow up for your question, have you tried using fgsea library for the ssGSEA analysis or did you directly use gsva? I am particularly using the fgseaMultiLevel function and I imported the gene sets from msigdb. My question is: how do you preprocess your data before applying ssGSEA function to it? Do you perform normalization, log transformation or just input raw counts matrix after filtering?

Hi, the fgsea package does not implement the ssGSEA algorithm, but the GSVA package does, concretely, the original version by Barbie et al. (2009), described in the subsection "Signature Projection Method" from the Online Methods. I'd say probably ssGSEA works best with normalized logCPM or logTPM units of expression, but we did not develop ssGSEA, so you may get a more authorative answer in the official support site for ssGSEA, which I believe is this Google Group.

While you are not naming it explicitly in your question, from your tags and the line of code you've pasted, I assume you are interested in using the implementation of the ssGSEA method available in the GSVA Bioconductor package. In this package, the function gsva() is going to filter out rows in your expression_matrix object for which there are no corresponding identifiers in the gene_sets object. Therefore, unless I'm misunderstanding your question, there is no much to worry about non-protein-coding genes from the perspective of using the software. If there is an expression profile in your RNA-seq data set for a non-protein-coding gene, which also forms part of a gene set in the MSigDB Hallmark collection, then 'gsva()' will use it and I do not see a reason to exclude it. By the way, in the line of code you write, there is no need to specify the kcdf and mx.diff parameters when method="ssgsea", since kcdf and mx.diff only apply when method="gsva".

This has been a common misunderstanding throughout the years and in the last release of GSVA we have deprecated this interface, in favor of an object-oriented one that does not allow the user to do this anymore. For instance, in the new interface (available in GSVA 1.50, Bioc release 3.18), to use the ssGSEA method with the inputs you show in your question, you should write the following:

H: Hallmarks is a new collection of 50 sets. These gene sets represent specific well defined biological states or processes and display coherent expression. The hallmark gene sets were generated by a computational methodology based on identifying gene set overlaps and extracting coherent representatives of them. Details of the procedure will become available after the manuscript describing it is accepted for publication. The hallmark gene sets reduce noise and redundancy and provide a better biological space for GSEA and other gene set-based analyses of genomic data.

Hallmark gene set pages provide links to the corresponding founder sets for more in-depth exploration. In addition, hallmark gene set pages include links to microarray data that served for refining and validation of the hallmark signatures.

The CP (Canonical Pathways) sub-collection has 10 new gene sets from the Matrisome Project. The "matrisome" refers to the ensemble of genes encoding extracellular matrix (ECM) and ECM-associated proteins (as defined by Naba and collaborators). The Matrisome Project is a collaborative effort between the laboratory of Richard Hynes at MIT, researchers at the Barbara K. Ostrom (1978) Bioinformatics & Computing Facility at the Koch Institute at MIT and theBroad Institute, pursuing extensive in silica and experimental characterization of ECM components.

Files from previous versions of MSigDB (v4.0, v3.1, v3.0, v2.5, v2.1 and v1.0) are archived and available at Downloads page. You can view them through the MSigDB Browser tool in the GSEA desktop application.

Discover the correlated pathways/gene sets of a single pathway/gene set or discover correlation relationships among multiple pathways/gene sets. Draw a heatmap or create a network of your query and extract members of each pathway/gene set found in the available collections (MSigDB H hallmark, MSigDB C2 Canonical pathways, MSigDB C5 GO BP and Pathprint).

Genesets are simply a named list of character vectors which can be directly passed to hyper(). Alternatively, one can pass a gsets object, which can retain the name and version of the genesets one uses. This versioning will be included when exporting results or generating reports, which will ensure your results are reproducible.

Please pay attention to the versioning - hypeR will default to the msigdbr version installed on your machine which updates with the curation version of the genesets frequently done by the Broad. Check to make sure you are using the genesets you expect.

If msigdb genesets are not sufficient, we have also provided another set of functions for downloading and loading other publicly available genesets. This is facilitated by interfacing with the publicly available libraries hosted by enrichr.

A single-column data frame of labels where the rownames are unique identifiers. Leaf node labels should have an associated geneset, while internal nodes do not have to. The only genesets tested, will be those in the list of genesets.

To do so, we use gene-set enrichment analysis, a group of methods designed to identify enriched functions represented by collections of genes known as gene-sets. This workflow will demonstrate functional analysis of transcriptomic data using the molecular signatures database (through the msigdb R/Bioconductor package) and a gene-set enrichment method, singscore. It will also demonstrate how higher order biological themes can be identified in data using the vissEpackage. It will begin by loading gene expression data and gene-sets from the ExperimentHub using the emtdata and msigdb R/Bioconductor packages. Molecular phenotypes representing the functional characterisitic of Samples will be identified using the single-sample gene-set enrichment method, singscore. Finally, higher-order functional themes will be identified using vissE.