Gene Set Variation Analysis

The Gene Set Variation Analysis (GSVA) is another popular analysis method for bulk RNA-seq data. GSVA differs from Gene Set Enrichment Analysis (GSEA) in that it can estimate gene set enrichment within a single sample. GSEA typically uses results from a differential expression analysis, which requires multiple samples, to determine whether there is an enrichment of gene sets associated with an experimental condition. With GSVA you can estimate enrichment without comparing across samples!

In this post, I will follow the vignette of the {GSVA} Bioconductor package, which describes GSVA as follows:

Gene set variation analysis (GSVA) is a particular type of gene set enrichment method that works on single samples and enables pathway-centric analyses of molecular data by performing a conceptually simple but powerful change in the functional unit of analysis, from genes to gene sets. The GSVA package provides the implementation of four single-sample gene set enrichment methods, concretely zscore, plage, ssGSEA and its own called GSVA. While this methodology was initially developed for gene expression data, it can be applied to other types of molecular profiling data.

Gene set variation analysis (GSVA) provides an estimate of pathway activity by transforming an input gene-by-sample expression data matrix into a corresponding gene-set-by-sample expression data matrix.

This resulting expression data matrix can be then used with classical analytical methods such as differential expression, classification, survival analysis, clustering or correlation analysis in a pathway-centric manner. One can also perform sample-wise comparisons between pathways and other molecular data types such as microRNA expression or binding data, copy-number variation (CNV) data or single nucleotide polymorphisms (SNPs).

Installation

Install {GSVA} using {BiocManager}. (Dependencies are listed in the Imports section in the DESCRIPTION file.)

if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

if (!require("GSVA", quietly = TRUE))
  BiocManager::install("GSVA")

if (!require("GSVAdata", quietly = TRUE))
  BiocManager::install("GSVAdata")

Load the package using library().

library(GSVA)
packageVersion("GSVA")
[1] ‘2.0.3’

Quick start

Generate example expression matrix with 10,000 genes (g1 to g10000) across 30 samples (s1 to s30) as an example.

p <- 10000
n <- 30
set.seed(1984)
X <- matrix(
  rnorm(p*n),
  nrow=p,
  dimnames=list(paste0("g", 1:p), paste0("s", 1:n))
)
X[1:5, 1:5]
           s1         s2         s3          s4           s5
g1  0.4092032  1.4676435  0.3515056  1.53512312 -1.279009469
g2 -0.3230250 -1.8501416 -0.9198650  1.40036448  0.086613315
g3  0.6358523  1.6084120  1.6380322  0.23799146  0.216628121
g4 -1.8461288 -0.2928844  0.4651573 -0.09766558 -0.009887299
g5  0.9536474 -0.4816006  0.1807824  1.03141311  0.206414282

Generate 100 gene sets (gs1 to gs100) that are contain from 10 to up to 100 genes sampled from 1:p.

set.seed(1984)
gs <- as.list(sample(10:100, size=100, replace=TRUE))

gs <- lapply(gs, function(n, p){
  paste0("g", sample(1:p, size=n, replace=FALSE))
}, p)
names(gs) <- paste0("gs", 1:length(gs))

sapply(gs, length)
  gs1   gs2   gs3   gs4   gs5   gs6   gs7   gs8   gs9  gs10  gs11  gs12  gs13  gs14  gs15  gs16 
   49    29    67    90    94    87    41    26    86    77    97    90    45    47    54    83 
 gs17  gs18  gs19  gs20  gs21  gs22  gs23  gs24  gs25  gs26  gs27  gs28  gs29  gs30  gs31  gs32 
   11    75    95    99    94    89    93    50    49    87    36    61    84    99    58    30 
 gs33  gs34  gs35  gs36  gs37  gs38  gs39  gs40  gs41  gs42  gs43  gs44  gs45  gs46  gs47  gs48 
   63    29    35    29    69    41    46    38    17    48    72    15    81   100    93    37 
 gs49  gs50  gs51  gs52  gs53  gs54  gs55  gs56  gs57  gs58  gs59  gs60  gs61  gs62  gs63  gs64 
   99    89    43    36    84    83    40    72    90    86    37    23    69    96    20    93 
 gs65  gs66  gs67  gs68  gs69  gs70  gs71  gs72  gs73  gs74  gs75  gs76  gs77  gs78  gs79  gs80 
   36    21    46    76    71    57    48    25    73    26    46    29    53    69    69    42 
 gs81  gs82  gs83  gs84  gs85  gs86  gs87  gs88  gs89  gs90  gs91  gs92  gs93  gs94  gs95  gs96 
   76    30    16    49    35    12    83    99    88    66    10    51    82    73    97    59 
 gs97  gs98  gs99 gs100 
   59    42    10    64 

Calculate GSVA enrichment scores using the gsva() function, which does all the work and requires the following two input arguments:

  1. A normalised gene expression dataset, which can be provided in one of the following containers:
    • A matrix of expression values with genes corresponding to rows and samples corresponding to columns.
    • An ExpressionSet object; see package Biobase.
    • A SummarizedExperiment object, see package SummarizedExperiment.
  2. A collection of gene sets; which can be provided in one of the following containers:
    • A list object where each element corresponds to a gene set defined by a vector of gene identifiers, and the element names correspond to the names of the gene sets.
    • A GeneSetCollection object; see package GSEABase.

The first argument to the gsva() function is the gene expression data matrix and the second the collection of gene sets. The gsva() function can take the input expression data and gene sets using different specialized containers that facilitate the access and manipulation of molecular and phenotype data, as well as their associated metadata. Another advanced features include the use of on-disk and parallel backends to enable, respectively, using GSVA on large molecular data sets and speed up computing time.

The gsva() function will apply the following filters before the actual calculations take place:

  1. Discard genes in the input expression data matrix with constant expression.
  2. Discard genes in the input gene sets that do not map to a gene in the input gene expression data matrix.
  3. Discard gene sets that, after applying the previous filters, do not meet a minimum and maximum size, which by default is 1 for the minimum size and Inf for the maximum size.

When method="gsva" is used (the default), the following parameters can be tuned:

  • kcdf: The first step of the GSVA algorithm brings gene expression profiles to a common scale by calculating an expression statistic through a non-parametric estimation of the CDF across samples. Such a non-parametric estimation employs a kernel function and the kcdf parameter allows the user to specify three possible values for that function:
  1. "Gaussian", the default value, which is suitable for continuous expression data, such as microarray fluorescent units in logarithmic scale and RNA-seq log-CPMs, log-RPKMs or log-TPMs units of expression;
  2. "Poisson", which is suitable for integer counts, such as those derived from RNA-seq alignments;
  3. "none", which will enforce a direct estimation of the CDF without a kernel function.
  • mx.diff: The last step of the GSVA algorithm calculates the gene set enrichment score from two Kolmogorov-Smirnov random walk statistics. This parameter is a logical flag that allows the user to specify two possible ways to do such calculation:
  1. TRUE, the default value, where the enrichment score is calculated as the magnitude difference between the largest positive and negative random walk deviations;
  2. FALSE, where the enrichment score is calculated as the maximum distance of the random walk from zero.
  • abs.ranking: Logical flag used only when mx.diff=TRUE. By default, abs.ranking=FALSE and it implies that a modified Kuiper statistic is used to calculate enrichment scores, taking the magnitude difference between the largest positive and negative random walk deviations. When abs.ranking=TRUE the original Kuiper statistic is used, by which the largest positive and negative random walk deviations are added together. In this case, gene sets with genes enriched on either extreme (high or low) will be regarded as highly activated.

  • tau: Exponent defining the weight of the tail in the random walk. By default tau=1. When method="ssgsea", this parameter is also used and its default value becomes then tau=0.25 to match the methodology described in (Barbie et al. 2009).

In general, the default values for the previous parameters are suitable for most analysis settings, which usually consist of some kind of normalised continuous expression values.

es_gsva <- gsva(gsvaParam(X, gs), verbose=FALSE)
dim(es_gsva)
[1] 100  30

Median enrichment scores across the samples.

apply(es_gsva, 2, median)
          s1           s2           s3           s4           s5           s6           s7 
 0.009061514 -0.008458424 -0.005713628 -0.021937531  0.006417182  0.019422486  0.010140618 
          s8           s9          s10          s11          s12          s13          s14 
 0.005812097  0.006495584  0.008887644  0.024577619 -0.031697634  0.001642502  0.008919786 
         s15          s16          s17          s18          s19          s20          s21 
-0.022622470  0.027695420 -0.015799537 -0.011108686 -0.013956442 -0.015493300 -0.004809844 
         s22          s23          s24          s25          s26          s27          s28 
-0.014081494  0.026845336  0.023895676  0.006358240  0.010642450 -0.012690144 -0.005999451 
         s29          s30 
-0.005058572 -0.012403422

Below we carry out single sample GSEA (ssGSEA), which is a non-parametric method that calculates a gene set enrichment score per sample as the normalised difference in empirical cumulative distribution functions (CDFs) of gene expression ranks inside and outside the gene set. By default, the implementation in the {GSVA} package normalises pathway scores, dividing them by the range of calculated values.

es_ssgsea <- gsva(ssgseaParam(X, gs), verbose=FALSE)
apply(es_ssgsea, 2, median)
       s1        s2        s3        s4        s5        s6        s7        s8        s9       s10 
0.1056332 0.1105148 0.1149790 0.1045303 0.1192256 0.1274702 0.1192217 0.1287768 0.1316429 0.1257247 
      s11       s12       s13       s14       s15       s16       s17       s18       s19       s20 
0.1293241 0.1083643 0.1222038 0.1012195 0.1018693 0.1174057 0.1103184 0.1112410 0.1164373 0.1126160 
      s21       s22       s23       s24       s25       s26       s27       s28       s29       s30 
0.1212393 0.1067666 0.1167117 0.1418775 0.1221909 0.1235449 0.1112199 0.1012381 0.1254514 0.1210392

Enrichment scores on microarray and RNA-seq data

Gene expression data of lymphoblastoid cell lines (LCL) from HapMap individuals that have been profiled using microarray and RNA-seq.

library(Biobase)
library(GSVAdata)
data(c2BroadSets)
data(commonPickrellHuang)

stopifnot(
  identical(
    featureNames(huangArrayRMAnoBatchCommon_eset),
    featureNames(pickrellCountsArgonneCQNcommon_eset)
  )
)

stopifnot(
  identical(
    sampleNames(huangArrayRMAnoBatchCommon_eset),
    sampleNames(pickrellCountsArgonneCQNcommon_eset)
  )
)

pickrellCountsArgonneCQNcommon_eset
ExpressionSet (storageMode: lockedEnvironment)
assayData: 11482 features, 36 samples 
  element names: exprs 
protocolData
  rowNames: NA19099 NA18523 ... NA19171 (36 total)
  varLabels: exprs dates
  varMetadata: labelDescription channel
phenoData
  rowNames: NA19099 NA18523 ... NA19171 (36 total)
  varLabels: CoriellCellLineID Population ... FamilyRelationship (5 total)
  varMetadata: channel labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation: org.Hs.eg.db 

For the current analysis we use the subset of canonical pathways from the C2 collection of MSigDB Gene Sets. These correspond to the following pathways from KEGG, REACTOME and BIOCARTA.

canonicalC2BroadSets <- c2BroadSets[
  c(
    grep("^KEGG", names(c2BroadSets)),
    grep("^REACTOME", names(c2BroadSets)),
    grep("^BIOCARTA", names(c2BroadSets))
  )
]

canonicalC2BroadSets
GeneSetCollection
  names: KEGG_GLYCOLYSIS_GLUCONEOGENESIS, KEGG_CITRATE_CYCLE_TCA_CYCLE, ..., BIOCARTA_ACTINY_PATHWAY (833 total)
  unique identifiers: 55902, 2645, ..., 8544 (6744 total)
  types in collection:
    geneIdType: EntrezIdentifier (1 total)
    collectionType: BroadCollection (1 total)

We calculate the GSVA enrichment scores for these gene sets using first the normalised microarray data and then the normalised RNA-seq integer count data. Note that the only requirement to do the latter is to set the argument kcdf="Poisson", which is "Gaussian" by default. However, if the RNA-seq normalised expression levels is continuous, such as log-CPMs, log-RPKMs or log-TPMs, use "Gaussian".

Run GSVA for the microarray data.

huangPar <- gsvaParam(
  exprData = huangArrayRMAnoBatchCommon_eset,
  geneSets = canonicalC2BroadSets,
  minSize=5,
  maxSize=500
)

esmicro <- gsva(huangPar, verbose=FALSE)

exprs(esmicro)[1:6, 1:6]
                                                  NA19099    NA18523     NA19144
KEGG_GLYCOLYSIS_GLUCONEOGENESIS               -0.14230367 -0.3154226 -0.39125278
KEGG_CITRATE_CYCLE_TCA_CYCLE                  -0.20050240  0.2516592  0.01290791
KEGG_PENTOSE_PHOSPHATE_PATHWAY                 0.13532736 -0.3454335  0.01171014
KEGG_PENTOSE_AND_GLUCURONATE_INTERCONVERSIONS -0.04674354  0.2631958 -0.24777062
KEGG_FRUCTOSE_AND_MANNOSE_METABOLISM          -0.05032619 -0.2581911 -0.27263716
KEGG_GALACTOSE_METABOLISM                     -0.50995845 -0.3290795 -0.33917558
                                                  NA19137    NA18861     NA19116
KEGG_GLYCOLYSIS_GLUCONEOGENESIS                0.21314789  0.3982731 -0.02979645
KEGG_CITRATE_CYCLE_TCA_CYCLE                  -0.23516857  0.1673898 -0.18868132
KEGG_PENTOSE_PHOSPHATE_PATHWAY                 0.10461510  0.5338337  0.08497909
KEGG_PENTOSE_AND_GLUCURONATE_INTERCONVERSIONS -0.57095581 -0.1515618 -0.12045011
KEGG_FRUCTOSE_AND_MANNOSE_METABOLISM           0.02525198  0.4651490 -0.07264436
KEGG_GALACTOSE_METABOLISM                      0.17973234  0.5893476 -0.05162470

Run GSVA on the RNA-seq data using kcdf="Poisson".

pickrellPar <- gsvaParam(
  exprData = pickrellCountsArgonneCQNcommon_eset,
  geneSets = canonicalC2BroadSets,
  minSize=5,
  maxSize=500,
  kcdf="Poisson"
)

system.time(
  esrnaseq <- gsva(pickrellPar, verbose=FALSE)
)

exprs(esrnaseq)[1:6, 1:6]
                                                 NA19099     NA18523    NA19144
KEGG_GLYCOLYSIS_GLUCONEOGENESIS               0.22812891 -0.26335522 -0.3747005
KEGG_CITRATE_CYCLE_TCA_CYCLE                  0.19353985 -0.20067180 -0.3193682
KEGG_PENTOSE_PHOSPHATE_PATHWAY                0.39550501 -0.35132141 -0.1411042
KEGG_PENTOSE_AND_GLUCURONATE_INTERCONVERSIONS 0.30352162 -0.23381641 -0.2716007
KEGG_FRUCTOSE_AND_MANNOSE_METABOLISM          0.11033580 -0.06220681 -0.0458456
KEGG_GALACTOSE_METABOLISM                     0.05968213 -0.42388064 -0.0400420
                                                 NA19137     NA18861    NA19116
KEGG_GLYCOLYSIS_GLUCONEOGENESIS                0.2907323  0.17079093 -0.4576523
KEGG_CITRATE_CYCLE_TCA_CYCLE                   0.2573002  0.09459388 -0.1443797
KEGG_PENTOSE_PHOSPHATE_PATHWAY                 0.3582431  0.16435766 -0.2664579
KEGG_PENTOSE_AND_GLUCURONATE_INTERCONVERSIONS -0.2462472 -0.06071677  0.0453623
KEGG_FRUCTOSE_AND_MANNOSE_METABOLISM           0.4425963 -0.43860752 -0.2635444
KEGG_GALACTOSE_METABOLISM                      0.3494772 -0.12227027 -0.4633993

The enrichment scores for each sample can be used to "cluster" the samples; here we show a correlation plot showing the correlations of enrichment scores between the RNA-seq samples ordered using hierarchical clustering.

library(corrplot)

corrplot(cor(exprs(esrnaseq), exprs(esrnaseq)), type="lower", diag = FALSE, order = "hclust")

The data has been "transformed" to a gene set by sample matrix in contrast to a gene by sample matrix. Therefore, other types of analysis/visualisation methods that typically use a gene by samples matrix can be used with the gene set by sample matrix, such as a heatmap and/or PCA. If used with cancer data, GSVA could be potentially used to identify different cancer subtypes that have specific enrichments of gene sets; GSVA is quite commonly used with TCGA data for this purpose.


This post was sponsored by Logos Biosystems:

Logos Biosystems offers a suite of advanced imaging solutions designed to reveal the intricacies of cellular and tissue biology. Their automated cell counting, digital cell imaging, and tissue clearing 3D imaging technologies provide researchers with the tools they need to visualize the unseen, driving innovation in areas like drug discovery and agricultural research.




Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.