The Molecular Signatures Database (MSigDB) is a nice resource containing various gene sets designed for use in Gene Set Enrichment Analyses (GSEA) and its variants. It was co-developed with the GSEA by the Broad Institute and is still maintained by them; you can read more in the classic paper: Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. It is particularly useful for helping with the interpretation of high-throughput transcriptomic data. For example, instead of going through a list of differentially expressed genes individually, you can identify whether biological processes and pathways are enriched using MSigDB and GSEA.
There is a human and mouse collection; the human collection is grouped into:
- H - hallmark gene sets are coherently expressed signatures derived by aggregating many MSigDB gene sets to represent well-defined biological states or processes.
- C1 - positional gene sets corresponding to human chromosome cytogenetic bands.
- C2 - curated gene sets from online pathway databases, publications in PubMed, and knowledge of domain experts.
- C3 - regulatory target gene sets based on gene target predictions for microRNA seed sequences and predicted transcription factor binding sites.
- C4 - computational gene sets defined by mining large collections of cancer-oriented expression data.
- C5 - ontology gene sets consist of genes annotated by the same ontology term.
- C6 - oncogenic signature gene sets defined directly from microarray gene expression data from cancer gene perturbations.
- C7 - immunologic signature gene sets represent cell states and perturbations within the immune system.
- C8 - cell type signature gene sets curated from cluster markers identified in single-cell sequencing studies of human tissue.
The mouse collection is grouped into:
- MH - mouse-ortholog hallmark gene sets are versions of gene sets in the MSigDB Hallmarks collection mapped to their mouse orthologs.
- M1 - positional gene sets corresponding to mouse chromosome cytogenetic bands.
- M2 - curated gene sets from online pathway databases, publications in PubMed, and knowledge of domain experts.
- M3 - regulatory target gene sets based on gene target predictions for microRNA seed sequences and predicted transcription factor binding sites.
- M5 - ontology gene sets consist of genes annotated by the same ontology term.
- M8 - cell type signature gene sets curated from cluster markers identified in single-cell sequencing studies of mouse tissue.
Since there are different collections, you should use an appropriate gene set collection that aligns with your research focus.
The Bioconductor msigdb package provides an interface for interacting with MSigDB via R and can be used to obtain the different gene sets; from the vignette:
The molecular signatures database (MSigDB) is one of the largest collections of molecular signatures or gene expression signatures. A variety of gene expression signatures are hosted on this database including experimentally derived signatures and signatures representing pathways and ontologies from other curated databases. This rich collection of gene expression signatures (>25,000) can facilitate a wide variety of signature-based analyses, the most popular being gene set enrichment analyses. These signatures can be used to perform enrichment analysis in a DE experiment using tools such as {GSEA}, {fry} (from {limma}) and {camera} (from {limma}). Alternatively, they can be used to perform single-sample gene-set analysis of individual transcriptomic profiles using approaches such as {singscore}, {ssGSEA} and {GSVA}.
This package provides the gene sets in the MSigDB in the form of
GeneSet
objects. This data structure is specifically designed to store information about gene sets, including their member genes and metadata. Other packages, such as {msigdbr} and {EGSEAdata} provide these gene sets too, however, they do so by storing them as lists or tibbles. These structures are not specific to gene sets therefore do not allow storage of important metadata associated with each gene set, for example, their short and long descriptions. Additionally, the lack of structure allows creation of invalid gene sets. Accessory functions implemented in the {GSEABase} package provide a neat interface to interact withGeneSet
objects.
Installation
Install {msigdb}. (Dependencies are listed in the Imports section in the DESCRIPTION file.)
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!require("msigdb", quietly = TRUE))
BiocManager::install("msigdb")
Load package.
library(msigdb)
packageVersion("msigdb")
[1] ‘1.14.0’
Downloading the MSigDB database
In order to download the MSigDB database, we need to load {ExperimentHub} and {GSEABase}.
suppressPackageStartupMessages(library(ExperimentHub))
suppressPackageStartupMessages(library(GSEABase))
Query an ExperimentHub
object.
eh <- ExperimentHub(ask = FALSE)
AnnotationHub::query(x = eh, pattern = 'msigdb')
ExperimentHub with 49 records
# snapshotDate(): 2024-10-24
# $dataprovider: Broad Institute, Emory University, EBI
# $species: Homo sapiens, Mus musculus
# $rdataclass: GSEABase::GeneSetCollection, list, data.frame
# additional mcols(): taxonomyid, genome, description, coordinate_1_based, maintainer,
# rdatadateadded, preparerclass, tags, rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["EH5421"]]'
title
EH5421 | msigdb.v7.2.hs.SYM
EH5422 | msigdb.v7.2.hs.EZID
EH5423 | msigdb.v7.2.mm.SYM
EH5424 | msigdb.v7.2.mm.EZID
EH6727 | MSigDB C8 MANNO MIDBRAIN
... ...
EH8296 | msigdb.v7.5.1.hs.SYM
EH8297 | msigdb.v7.5.1.mm.EZID
EH8298 | msigdb.v7.5.1.mm.idf
EH8299 | msigdb.v7.5.1.mm.SYM
EH8300 | imex_hsmm_0722
Specify a more specific pattern to look for only human collections.
AnnotationHub::query(x = eh, pattern = 'msigdb.*hs.SYM')
ExperimentHub with 7 records
# snapshotDate(): 2024-10-24
# $dataprovider: Broad Institute
# $species: Homo sapiens
# $rdataclass: GSEABase::GeneSetCollection
# additional mcols(): taxonomyid, genome, description, coordinate_1_based, maintainer,
# rdatadateadded, preparerclass, tags, rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["EH5421"]]'
title
EH5421 | msigdb.v7.2.hs.SYM
EH6772 | msigdb.v7.3.hs.SYM
EH6778 | msigdb.v7.4.hs.SYM
EH7359 | msigdb.v7.5.hs.SYM
EH8284 | msigdb.v2022.1.hs.SYM
EH8290 | msigdb.v2023.1.hs.SYM
EH8296 | msigdb.v7.5.1.hs.SYM
The experiment hubs seem to be ordered from earliest to latest.
AnnotationHub::query(x = eh, pattern = 'msigdb.*hs.SYM') |>
tail(1) -> msigdb_hs_latest
names(msigdb_hs_latest)
msigdb_hs_latest
ExperimentHub with 1 record
# snapshotDate(): 2024-10-24
# names(): EH8296
# package(): msigdb
# $dataprovider: Broad Institute
# $species: Homo sapiens
# $rdataclass: GSEABase::GeneSetCollection
# $rdatadateadded: 2023-07-03
# $title: msigdb.v7.5.1.hs.SYM
# $description: Gene expression signatures (Homo sapiens) from the Molecular Signatures Database...
# $taxonomyid: 9606
# $genome: NA
# $sourcetype: XML
# $sourceurl: https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.5.1/msigdb_v7.5.1.xml
# $sourcesize: NA
# $tags: c("Homo_sapiens_Data", "Mus_musculus_Data")
# retrieve record with 'object[["EH8296"]]'
Data can be downloaded using the unique ID.
eh[[names(msigdb_hs_latest)]]
GeneSetCollection
names: chr1p11, chr1p12, ..., GOMF_STARCH_BINDING (45226 total)
unique identifiers: RPL22P6, NBPF8, ..., POM121L15P (41072 total)
types in collection:
geneIdType: SymbolIdentifier (1 total)
collectionType: BroadCollection (1 total)
Data can also be downloaded using msigdb::getMsigdb()
.
msigdb_ver <- sub(pattern = "msigdb.v(.*).hs.SYM", replacement = "\\1", msigdb_hs_latest$title)
msigdb_hs_sym <- msigdb::getMsigdb(org = "hs", id = "SYM", version = msigdb_ver)
msigdb_hs_ezid <- msigdb::getMsigdb(org = "hs", id = "EZID", version = msigdb_ver)
msigdb_hs_sym
GeneSetCollection
names: chr1p11, chr1p12, ..., GOMF_STARCH_BINDING (45226 total)
unique identifiers: RPL22P6, NBPF8, ..., POM121L15P (41072 total)
types in collection:
geneIdType: SymbolIdentifier (1 total)
collectionType: BroadCollection (1 total)
msigdb_hs_ezid
GeneSetCollection
names: chr1p11, chr1p12, ..., GOMF_STARCH_BINDING (45226 total)
unique identifiers: 100132047, 728841, ..., 100101629 (40905 total)
types in collection:
geneIdType: EntrezIdentifier (1 total)
collectionType: BroadCollection (1 total)
Accessing the GeneSet and GeneSetCollection objects
A GeneSetCollection
object is effectively a list and therefore all list processing functions work.
str(msigdb_hs_sym, max.level = 2)
Formal class 'GeneSetCollection' [package "GSEABase"] with 1 slot
..@ .Data:List of 45226
Each signature is stored in a GeneSet
object and can be processed using functions from the {GSEABase} package.
gs <- msigdb_hs_sym[[1984]]
gs
setName: BAUS_TFF2_TARGETS_DN
geneIds: BEX4, NAT8, ..., THRSP (total: 12)
geneIdType: Symbol
collectionType: Broad
bcCategory: c2 (Curated)
bcSubCategory: CGP
details: use 'details(object)'
Get gene IDs.
geneIds(gs)
[1] "BEX4" "NAT8" "RBP4" "ART3" "DDX6" "PRDX2" "HEBP1" "CTSC" "PCP4" "OR2H2" "BAG2" "THRSP"
Details of a gene set.
details(gs)
setName: BAUS_TFF2_TARGETS_DN
geneIds: BEX4, NAT8, ..., THRSP (total: 12)
geneIdType: Symbol
collectionType: Broad
bcCategory: c2 (Curated)
bcSubCategory: CGP
setIdentifier: LVY1HGGWMJ7:35020:Fri May 26 12:20:46 2023:95704
description: Genes down-regulated in pyloric atrium with knockout of TFF2 [GeneID=7032].
(longDescription available)
organism: Mus musculus
pubMedIds: 16121031
urls: https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.5.1/msigdb_v7.5.1.xml
contributor: Arthur Liberzon
setVersion: 7.5.1
creationDate:
table(sapply(lapply(msigdb_hs_sym, collectionType), bcCategory))
c1 c2 c3 c4 c5 c6 c7 c8 h
299 6180 3726 858 28005 189 5219 700 50
Create vector to subset hallmark gene sets; these gene sets are good for general use since they contain a wide range of biological processes and pathways.
wanted <- sapply(lapply(msigdb_hs_sym, collectionType), bcCategory) == "h"
table(wanted)
wanted
FALSE TRUE
45176 50
Hallmark gene sets.
hallmark_gs <- msigdb_hs_sym[wanted]
hallmark_gs
GeneSetCollection
names: HALLMARK_TNFA_SIGNALING_VIA_NFKB, HALLMARK_HYPOXIA, ..., HALLMARK_PANCREAS_BETA_CELLS (50 total)
unique identifiers: JUNB, CXCL2, ..., SRP14 (4383 total)
types in collection:
geneIdType: SymbolIdentifier (1 total)
collectionType: BroadCollection (1 total)
Genes in the HALLMARK_TNFA_SIGNALING_VIA_NFKB gene set.
geneIds(hallmark_gs[[1]])
[1] "JUNB" "CXCL2" "ATF3" "NFKBIA" "TNFAIP3" "PTGS2" "CXCL1" "IER3"
[9] "CD83" "CCL20" "CXCL3" "MAFF" "NFKB2" "TNFAIP2" "HBEGF" "KLF6"
[17] "BIRC3" "PLAUR" "ZFP36" "ICAM1" "JUN" "EGR3" "IL1B" "BCL2A1"
[25] "PPP1R15A" "ZC3H12A" "SOD2" "NR4A2" "IL1A" "RELB" "TRAF1" "BTG2"
[33] "DUSP1" "MAP3K8" "ETS2" "F3" "SDC4" "EGR1" "IL6" "TNF"
[41] "KDM6B" "NFKB1" "LIF" "PTX3" "FOSL1" "NR4A1" "JAG1" "CCL4"
[49] "GCH1" "CCL2" "RCAN1" "DUSP2" "EHD1" "IER2" "REL" "CFLAR"
[57] "RIPK2" "NFKBIE" "NR4A3" "PHLDA1" "IER5" "TNFSF9" "GEM" "GADD45A"
[65] "CXCL10" "PLK2" "BHLHE40" "EGR2" "SOCS3" "SLC2A6" "PTGER4" "DUSP5"
[73] "SERPINB2" "NFIL3" "SERPINE1" "TRIB1" "TIPARP" "RELA" "BIRC2" "CXCL6"
[81] "LITAF" "TNFAIP6" "CD44" "INHBA" "PLAU" "MYC" "TNFRSF9" "SGK1"
[89] "TNIP1" "NAMPT" "FOSL2" "PNRC1" "ID2" "CD69" "IL7R" "EFNA1"
[97] "PHLDA2" "PFKFB3" "CCL5" "YRDC" "IFNGR2" "SQSTM1" "BTG3" "GADD45B"
[105] "KYNU" "G0S2" "BTG1" "MCL1" "VEGFA" "MAP2K3" "CDKN1A" "CCN1"
[113] "TANK" "IFIT2" "IL18" "TUBB2A" "IRF1" "FOS" "OLR1" "RHOB"
[121] "AREG" "NINJ1" "ZBTB10" "PLPP3" "KLF4" "CXCL11" "SAT1" "CSF1"
[129] "GPR183" "PMEPA1" "PTPRE" "TLR2" "ACKR3" "KLF10" "MARCKS" "LAMB3"
[137] "CEBPB" "TRIP10" "F2RL1" "KLF9" "LDLR" "TGIF1" "RNF19B" "DRAM1"
[145] "B4GALT1" "DNAJB4" "CSF2" "PDE4B" "SNN" "PLEK" "STAT5A" "DENND5A"
[153] "CCND1" "DDX58" "SPHK1" "CD80" "TNFAIP8" "CCNL1" "FUT4" "CCRL2"
[161] "SPSB1" "TSC22D1" "B4GALT5" "SIK1" "CLCF1" "NFE2L2" "FOSB" "PER1"
[169] "NFAT5" "ATP2B1" "IL12B" "IL6ST" "SLC16A6" "ABCA1" "HES1" "BCL6"
[177] "IRS2" "SLC2A3" "CEBPD" "IL23A" "SMAD3" "TAP1" "MSC" "IFIH1"
[185] "IL15RA" "TNIP2" "BCL3" "PANX1" "FJX1" "EDN1" "EIF1" "BMP2"
[193] "DUSP4" "PDLIM5" "ICOSLG" "GFPT2" "KLF2" "TNC" "SERPINB8" "MXD1"
Names of hallmark gene lists.
names(hallmark_gs)
[1] "HALLMARK_TNFA_SIGNALING_VIA_NFKB" "HALLMARK_HYPOXIA"
[3] "HALLMARK_CHOLESTEROL_HOMEOSTASIS" "HALLMARK_MITOTIC_SPINDLE"
[5] "HALLMARK_WNT_BETA_CATENIN_SIGNALING" "HALLMARK_TGF_BETA_SIGNALING"
[7] "HALLMARK_IL6_JAK_STAT3_SIGNALING" "HALLMARK_DNA_REPAIR"
[9] "HALLMARK_G2M_CHECKPOINT" "HALLMARK_APOPTOSIS"
[11] "HALLMARK_NOTCH_SIGNALING" "HALLMARK_ADIPOGENESIS"
[13] "HALLMARK_ESTROGEN_RESPONSE_EARLY" "HALLMARK_ESTROGEN_RESPONSE_LATE"
[15] "HALLMARK_ANDROGEN_RESPONSE" "HALLMARK_MYOGENESIS"
[17] "HALLMARK_PROTEIN_SECRETION" "HALLMARK_INTERFERON_ALPHA_RESPONSE"
[19] "HALLMARK_INTERFERON_GAMMA_RESPONSE" "HALLMARK_APICAL_JUNCTION"
[21] "HALLMARK_APICAL_SURFACE" "HALLMARK_HEDGEHOG_SIGNALING"
[23] "HALLMARK_COMPLEMENT" "HALLMARK_UNFOLDED_PROTEIN_RESPONSE"
[25] "HALLMARK_PI3K_AKT_MTOR_SIGNALING" "HALLMARK_MTORC1_SIGNALING"
[27] "HALLMARK_E2F_TARGETS" "HALLMARK_MYC_TARGETS_V1"
[29] "HALLMARK_MYC_TARGETS_V2" "HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION"
[31] "HALLMARK_INFLAMMATORY_RESPONSE" "HALLMARK_XENOBIOTIC_METABOLISM"
[33] "HALLMARK_FATTY_ACID_METABOLISM" "HALLMARK_OXIDATIVE_PHOSPHORYLATION"
[35] "HALLMARK_GLYCOLYSIS" "HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY"
[37] "HALLMARK_P53_PATHWAY" "HALLMARK_UV_RESPONSE_UP"
[39] "HALLMARK_UV_RESPONSE_DN" "HALLMARK_ANGIOGENESIS"
[41] "HALLMARK_HEME_METABOLISM" "HALLMARK_COAGULATION"
[43] "HALLMARK_IL2_STAT5_SIGNALING" "HALLMARK_BILE_ACID_METABOLISM"
[45] "HALLMARK_PEROXISOME" "HALLMARK_ALLOGRAFT_REJECTION"
[47] "HALLMARK_SPERMATOGENESIS" "HALLMARK_KRAS_SIGNALING_UP"
[49] "HALLMARK_KRAS_SIGNALING_DN" "HALLMARK_PANCREAS_BETA_CELLS"
If you are interested in another collection, simply use the same approach above but use the appropriate collection name: c1, c2, c3, c4, c5, c6, c7 or c8. That's it!
This post was sponsored by Ecodyst:
Ecodyst's innovative metallic condensers are a game-changer. They not only boost your recovery rates, but also champion eco-friendly practices. Theis direct self-cooling technology eliminates the need for coolant liquids and slashes energy consumption by over 50%. Get the breakthrough you deserve with your rotary evaporator.

This work is licensed under a Creative Commons
Attribution 4.0 International License.