CAGE analysis using the R Bioconductor package CAGEr

This post is outdated; please refer to the official documentation.

Cap Analysis Gene Expression (CAGE) is a molecular technique, developed at RIKEN, which captures all transcription starting sites (TSSs) of an RNA population. The C in CAGE refers to the altered nucleotide at the 5' site of precursor messenger RNA, termed the cap, which CAGE targets and pulls down. The vignette of the CAGEr package has a very nice introduction to CAGE. I'd just like to add that several other CAGE protocol exists, such as HeliScopeCAGE and nanoCAGE. While these protocols all capture TSSs, the biochemical steps are different, especially nanoCAGE, which does not use CAP trapping but template switching. If you're interested in template switching with respect to transcriptome studies, have a look at the introduction of this paper, which I wrote.

In this post I will go through the workflow of the CAGEr package. If you perform CAGE analysis, I highly recommend using this package. It provides the methods/analysis steps that are commonly used in CAGE analyses and eliminates the use of in house scripts/methods. For the first part I will use publicly available FANTOM3 and FANTOM4 data that is conveniently packaged in Bioconductor as part of CAGEr. The second part shows an example session using ENCODE CAGE data, which is conveniently provided as BAM files.

#installing if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("CAGEr")
biocLite("FANTOM3and4CAGE")
#the following two files are ~800 megs each
biocLite("BSgenome.Hsapiens.UCSC.hg18")
biocLite("BSgenome.Hsapiens.UCSC.hg19")

#load libraries
library(CAGEr)
library(FANTOM3and4CAGE)
library(BSgenome.Hsapiens.UCSC.hg18)

#load list of available samples for human
data(FANTOMhumanSamples)
head(FANTOMhumanSamples)
                dataset           group          sample
1 FANTOMtissueCAGEhuman        cerebrum        cerebrum
2 FANTOMtissueCAGEhuman    renal_artery    renal_artery
3 FANTOMtissueCAGEhuman          ureter          ureter
4 FANTOMtissueCAGEhuman urinary_bladder urinary_bladder
5 FANTOMtissueCAGEhuman          kidney      malignancy
6 FANTOMtissueCAGEhuman          kidney          kidney

table(FANTOMhumanSamples$group)

THP-1_monocytic_induction                   adipose             adrenal_gland
                       18                         4                         3
                    blood               bone_marrow                     brain
                       30                         1                         1
                   breast                     cecum                cerebellum
                        1                         2                         1
                 cerebrum                     colon                    embryo
                        1                         3                         1
               epididymis              frontal_lobe                     heart
                        1                         1                         1
                   kidney           large_intestine                     liver
                        2                         2                         3
                     lung             mammary_gland                    muscle
                        4                         1                         1
           occipital_lobe                  pancreas             parietal_lobe
                        1                         1                         1
           prostate_gland                    rectum              renal_artery
                        1                         2                         1
                     skin           small_intestine                    spleen
                        2                         1                         1
                   testis                    thymus                 undefined
                        1                         1                         3
                   ureter           urinary_bladder
                        1                         1
table(FANTOMhumanSamples$dataset)

FANTOMtimecourseCAGEhuman     FANTOMtissueCAGEhuman
                       18                        82
data(FANTOMtissueCAGEhuman)
typeof(FANTOMtissueCAGEhuman)
[1] "list"
names(FANTOMtissueCAGEhuman)
 [1] "cerebrum"        "renal_artery"    "ureter"          "urinary_bladder"
 [5] "kidney"          "small_intestine" "rectum"          "cecum"
 [9] "liver"           "large_intestine" "prostate_gland"  "mammary_gland"
[13] "epididymis"      "skin"            "adipose"         "pancreas"
[17] "thymus"          "undefined"       "blood"           "lung"
[21] "adrenal_gland"   "colon"           "brain"           "cerebellum"
[25] "testis"          "embryo"          "bone_marrow"     "heart"
[29] "muscle"          "frontal_lobe"    "occipital_lobe"  "parietal_lobe"
[33] "spleen"          "breast"

#only one heart sample
names(FANTOMtissueCAGEhuman$heart)
[1] "chr"    "pos"    "strand" "heart"
#three liver samples
names(FANTOMtissueCAGEhuman$liver)
[1] "chr"            "pos"            "strand"         "liver"
[5] "HB-6065_Hep_G2" "malignancy"

cage_data <- importPublicData(dataset = FANTOMtissueCAGEhuman, group = c("blood", "cerebellum"), sample = c("control", "cerebellum"))

cage_data

S4 Object of class CAGEset

=======================================
Input data information
=======================================
Reference genome (organism): BSgenome.Hsapiens.UCSC.hg18
Input file type: FANTOM
Input file names: FANTOM__blood__control, FANTOM__cerebellum__cerebellum
Sample labels: blood__control, cerebellum__cerebellum
=======================================
CTSS information
=======================================
CTSS chromosome: chr1, chr1, chr1, ...
CTSS position: 554523, 559786, 795484, ...
CTSS strand: -, +, +, ...
Tag count:
	-> blood__control: 1, 1, 1, ...
 	-> cerebellum__cerebellum: 0, 0, 0, ...
Normalized tpm:
=======================================
Tag cluster (TC) information
=======================================
CTSS clustering method: 
Number of TCs per sample:
=======================================
Consensus cluster information
=======================================
Number of consensus clusters:
Consensus cluster chromosome:
Consensus cluster start:
Consensus cluster end:
Consensus cluster strand:
Normalized tpm:
=======================================
Expression profiling
=======================================
Expression clustering method: 
Expression clusters for consensus clusters:
=======================================
Promoter shifting
=======================================
GroupX:
GroupY:
Shifting scores:
KS p-values (FDR adjusted):

#get CAGE defined transcriptional starting sites (CTSS)
cage_ctss <- CTSStagCount(cage_data)
#174,248 CAGE defined TSSs
dim(cage_ctss)
[1] 174248      5
head(cage_ctss)
   chr    pos strand blood__control cerebellum__cerebellum
1 chr1 554523      -              1                      0
2 chr1 559786      +              1                      0
3 chr1 795484      +              1                      0
4 chr1 801145      -              0                      1
5 chr1 835219      +              6                      0
6 chr1 837557      -              1                      0
#cerebellum CTSSs with 1 or more supporting tags
table(cage_ctss$cerebellum__cerebellum>0)

 FALSE   TRUE 
122034  52214
#CTSSs with 1 or more supporting tags in both libraries
table(cage_ctss$cerebellum__cerebellum>0 & cage_ctss$blood__control>0)

 FALSE   TRUE 
165361   8887

#library sizes
#the library sizes are quite different as one dataset is from FANTOM3 and the other from FANTOM4
librarySizes(cage_data)
[1] 820134  93618

#plot reverse cumulative
plotReverseCumulatives(cage_data, onePlot = TRUE)

CTSS_reverse_cumulatives_raw_all_samplesThe two datasets roughly follow a power law distribution. The alpha and T values are used in the power law normalisation step.

#power law normalisation as per Balwierz et al.
#using values from the reverse cumulative plot
normalizeTagCount(cage_data, method = "powerLaw", fitInRange = c(10, 1000), alpha = 1.15, T = 1*10^5)

cage_ctss_normalised <- CTSSnormalizedTpm(cage_data)
dim(cage_ctss_normalised)
[1] 174248      5
head(cage_ctss_normalised)
   chr    pos strand blood__control cerebellum__cerebellum
1 chr1 554523      -      0.1572808               0.000000
2 chr1 559786      +      0.1572808               0.000000
3 chr1 795484      +      0.1572808               0.000000
4 chr1 801145      -      0.0000000               1.604291
5 chr1 835219      +      0.8568774               0.000000
6 chr1 837557      -      0.1572808               0.000000

#CTSS clustering but keep CTSSs that are supported by at least 2 tags in one library
clusterCTSS(object = cage_data, threshold = 1, thresholdIsTpm = TRUE, nrPassThreshold = 1, method = "distclu", maxDist = 20, removeSingletons = TRUE, keepSingletonsAbove = 2)

blood_tag_clusters <- tagClusters(cage_data, sample="blood__control")
#most of the CTSSs were filtered out due to normalisation using the power law and using relatively few tags
dim(blood_tag_clusters)
[1] 756   9
head(blood_tag_clusters)
  cluster  chr    start      end strand nr_ctss dominant_ctss       tpm
1       1 chr1   945362   945363      +       1        945363  4.024005
2       2 chr1  2149331  2149332      +       1       2149332  3.533413
3       3 chr1 12149628 12149635      +       2      12149635  3.074830
4       4 chr1 15883368 15883375      +       2      15883375 12.192203
5       5 chr1 15883396 15883440      +       7      15883402 26.717980
6       6 chr1 20832553 20832570      +       4      20832558  4.693610
  tpm.dominant_ctss
1          4.024005
2          3.533413
3          2.059047
4         11.065021
5          8.348363
6          1.770097

#normalise by tpm
normalizeTagCount(cage_data, method = 'simpleTpm')
cage_ctss_normalised <- CTSSnormalizedTpm(cage_data)
head(cage_ctss_normalised)
   chr    pos strand blood__control cerebellum__cerebellum
1 chr1 554523      -       1.219313                0.00000
2 chr1 559786      +       1.219313                0.00000
3 chr1 795484      +       1.219313                0.00000
4 chr1 801145      -       0.000000               10.68171
5 chr1 835219      +       7.315878                0.00000
6 chr1 837557      -       1.219313                0.00000

clusterCTSS(object = cage_data, threshold = 1, thresholdIsTpm = TRUE, nrPassThreshold = 1, method = "distclu", maxDist = 20, removeSingletons = TRUE, keepSingletonsAbove = 2)
blood_tag_clusters <- tagClusters(cage_data, sample="blood__control")
dim(blood_tag_clusters)
[1] 3919    9
head(blood_tag_clusters)
  cluster  chr   start     end strand nr_ctss dominant_ctss       tpm
1       1 chr1  867568  867569      +       1        867569  2.295394
2       2 chr1  885811  885840      +       3        885827  1.585254
3       3 chr1  945356  945363      +       2        945363 22.971769
4       4 chr1 1157485 1157513      +       4       1157486  5.889388
5       5 chr1 2149330 2149342      +       5       2149332 32.733198
6       6 chr1 3531438 3531477      +       6       3531469  9.697042
  tpm.dominant_ctss
1          2.295394
2          1.124938
3         22.527034
4          2.929547
5         19.599104
6          3.306190

brain_tag_clusters <- tagClusters(cage_data, sample="cerebellum__cerebellum")
dim(brain_tag_clusters)
[1] 9397    9
head(brain_tag_clusters)
  cluster  chr   start     end strand nr_ctss dominant_ctss       tpm
1       1 chr1  885811  885840      +       3        885827  4.812872
2       2 chr1  885875  885876      +       1        885876  6.880490
3       3 chr1  945334  945363      +       5        945363 22.283645
4       4 chr1  947517  947518      +       1        947518  3.322395
5       5 chr1 1360932 1360961      +       6       1360947 23.660102
6       6 chr1 1541123 1541139      +       2       1541139  8.484781
  tpm.dominant_ctss
1          1.604291
2          6.880490
3         12.384523
4          3.322395
5          6.880490
6          6.880490

#promoter shape of tag clusters
cumulativeCTSSdistribution(cage_data, clusters = "tagClusters")
quantilePositions(cage_data, clusters = "tagClusters", qLow = 0.1, qUp = 0.9)
plotInterquantileWidth(cage_data, clusters = "tagClusters", tpmThreshold = 3, qLow = 0.1, qUp = 0.9)

promoter_shape_cerebellum_vs_blood-01Slightly more sharp promoters, tissue specific promoters, in the cerebellum.

#creating consensus promoters across samples
aggregateTagClusters(cage_data, tpmThreshold = 2, qLow = NULL , qUp = NULL, maxDist = 100)
consensus_tc <- consensusClusters(cage_data)
dim(consensus_tc)
[1] 9553    6
head(consensus_tc)
  consensus.cluster  chr   start     end strand       tpm
1                 1 chr1  867568  867569      +  2.295394
2                 2 chr1  885811  885876      + 11.693362
3                 3 chr1  945334  945363      + 45.255414
4                 4 chr1  947517  947518      +  3.322395
5                 5 chr1 1157485 1157513      +  5.889388
6                 6 chr1 1360932 1360961      + 23.660102

#expression profiling using self organising maps with 8 clusters
getExpressionProfiles(cage_data, what = "consensusClusters", tpmThreshold = 2, nrPassThreshold = 1, method = "som", xDim = 3, yDim = 2)
plotExpressionProfiles(cage_data, what = "consensusClusters")

consensusClusters_expression_profiles5 different expression profiles.

class2_1 <- extractExpressionClass(cage_data, what = "consensusClusters", which = "2_1")
#cerebellum specific clusters
head(class2_1)
  consensus.cluster  chr   start     end strand       tpm blood__control
2                 2 chr1  885811  885876      + 11.693362              0
4                 4 chr1  947517  947518      +  3.322395              0
6                 6 chr1 1360932 1360961      + 23.660102              0
7                 7 chr1 1541123 1541139      +  8.484781              0
8                 8 chr1 1940548 1940686      + 91.435014              0
9                 9 chr1 2021234 2021235      +  3.322395              0
  cerebellum__cerebellum expression_class
2              11.693362              2_1
4               3.322395              2_1
6              23.660102              2_1
7               8.484781              2_1
8              91.435014              2_1
9               3.322395              2_1
dim(class2_1)
[1] 6644    9
#shifting promoters
cumulativeCTSSdistribution(cage_data, clusters = "consensusClusters")
scoreShift(cage_data, groupX = "blood__control", groupY = "cerebellum__cerebellum", testKS = TRUE, useTpmKS = FALSE)
shifting.promoters <- getShiftingPromoters(cage_data, tpmThreshold = 5, scoreThreshold = 0.6, fdrThreshold = 0.01)
head(shifting.promoters)
  consensus.cluster   chr     start       end strand shifting.score groupX.pos
1              1103 chr10  73280999  73281093      -      0.6998989   73281014
2              1126 chr10  81955258  81955275      -      0.7953620   81955273
3              1132 chr10  88844515  88844655      -      0.6614246   88844626
4              1343 chr11  63362952  63363194      +      0.6515045   63362988
5              1468 chr11 120827982 120828433      +      0.7961409  120827994
6              1524 chr11   1741677   1741785      -      0.6623520    1741683
  groupY.pos groupX.tpm groupY.tpm    pvalue.KS       fdr.KS
1   73281042  354.92565   40.83468 5.069056e-12 6.853364e-10
2   81955270   27.58374   10.08907 7.091682e-04 8.725338e-03
3   88844580   48.46444   15.01576 7.727921e-04 9.412747e-03
4   63363158   16.06317   21.31103 4.525447e-05 1.001381e-03
5  120828239  105.56836  157.76106 0.000000e+00 0.000000e+00
6    1741733  213.65201  119.62965 0.000000e+00 0.000000e+00
#28 cases of promoter shifting
dim(shifting.promoters)
[1] 28 12

Using CAGEr with BAM files

Here I just demonstrate how ENCODE CAGE BAM files can be loaded into R using CAGEr and subsequently analysed. Firstly download some ENCODE CAGE data:

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageHelas3CellPapAlnRep1.bam
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageHelas3CellPapAlnRep2.bam
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageHelas3CytosolPapAlnRep1.bam
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageHelas3CytosolPapAlnRep2.bam
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageHelas3NucleusPapAlnRep1.bam
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageHelas3NucleusPapAlnRep2.bam

Load these BAM files into R:

#CAGE mapped to hg19
library("CAGEr")
library("BSgenome.Hsapiens.UCSC.hg19")
bam_file <- list.files(".", full.names=T, pattern="*.bam")
bam_file
[1] "./wgEncodeRikenCageHelas3CellPapAlnRep1.bam"
[2] "./wgEncodeRikenCageHelas3CellPapAlnRep2.bam"
[3] "./wgEncodeRikenCageHelas3CytosolPapAlnRep1.bam"
[4] "./wgEncodeRikenCageHelas3CytosolPapAlnRep2.bam"
[5] "./wgEncodeRikenCageHelas3NucleusPapAlnRep1.bam"
[6] "./wgEncodeRikenCageHelas3NucleusPapAlnRep2.bam"
cage_bam <- new("CAGEset", genomeName = "BSgenome.Hsapiens.UCSC.hg19", inputFiles = bam_file, inputFilesType = 'bam', sampleLabels = c('cell_1','cell_2','cytosol_1','cytosol_2','nucleus_1','nucleus_2'))
cage_bam

S4 Object of class CAGEset

=======================================
Input data information
=======================================
Reference genome (organism): BSgenome.Hsapiens.UCSC.hg19
Input file type: bam
Input file names: ./wgEncodeRikenCageHelas3CellPapAlnRep1.bam, ./wgEncodeRikenCageHelas3CellPapAlnRep2.bam, ./wgEncodeRikenCageHelas3CytosolPapAlnRep1.bam, ./wgEncodeRikenCageHelas3CytosolPapAlnRep2.bam, ./wgEncodeRikenCageHelas3NucleusPapAlnRep1.bam, ./wgEncodeRikenCageHelas3NucleusPapAlnRep2.bam
Sample labels: cell_1, cell_2, cytosol_1, cytosol_2, nucleus_1, nucleus_2
=======================================
CTSS information
=======================================
CTSS chromosome:
CTSS position:
CTSS strand:
Tag count:
Normalized tpm:
=======================================
Tag cluster (TC) information
=======================================
CTSS clustering method:
Number of TCs per sample:
=======================================
Consensus cluster information
=======================================
Number of consensus clusters:
Consensus cluster chromosome:
Consensus cluster start:
Consensus cluster end:
Consensus cluster strand:
Normalized tpm:
=======================================
Expression profiling
=======================================
Expression clustering method:
Expression clusters for consensus clusters:
=======================================
Promoter shifting
=======================================
GroupX:
GroupY:
Shifting scores:
KS p-values (FDR adjusted):

#now run getCTSS(), which performs some data filtering
#the default of the parameter sequencingQualityThreshold is 10
#the default of the parameter mappingQualityThreshold is 20
#multi mapping tags will be removed using mappingQualityThreshold = 20,
#since they are mapped with lower confidence
#CAGE produces a non-templated G, which is removed by default in the filtering step
#I want a less stringent mapping quality threshold
#also this step takes a while, so go grab a coffee or something
getCTSS(cage_bam, mappingQualityThreshold=10)

#Extracts the tag count for all detected TSSs in all CAGE datasets from a CAGEset object
cage_bam_ctss <- CTSStagCount(cage_bam)
head(cage_bam_ctss)
   chr   pos strand cell_1 cell_2 cytosol_1 cytosol_2 nucleus_1 nucleus_2
1 chr1 15928      -      0      0         0         0         0         1
2 chr1 15929      -      0      1         0         0         0         3
3 chr1 15931      -      0      0         0         0         1         2
4 chr1 15932      -      0      0         0         0         1         0
5 chr1 16120      -      0      0         0         0         0         1
6 chr1 16447      +      1      0         0         0         0         0

librarySizes(cage_bam)
   cell_1    cell_2 cytosol_1 cytosol_2 nucleus_1 nucleus_2
 20768404  17170199  15903886  18974027  12215062  16634562

plotReverseCumulatives(cage_bam, onePlot = TRUE)

hela_CTSS_reverse_cumulatives_raw_all_samples

#normalise using simple tpm
normalizeTagCount(cage_bam, method = "simpleTpm")

#install multicore package
install.packages("multicore")
library(multicore)

#CTSS clustering
clusterCTSS(object = cage_bam, threshold = 1, thresholdIsTpm = TRUE, nrPassThreshold = 1, method = "distclu", maxDist = 20, removeSingletons = TRUE, keepSingletonsAbove = 5, useMulticore = T, nrCores = 6)

#promoter shape
cumulativeCTSSdistribution(cage_bam, clusters = "tagClusters", useMulticore=T, ncCores=6)
quantilePositions(cage_bam, clusters = "tagClusters", qLow = 0.1, qUp = 0.9, useMulticore=T, ncCores=6)
plotInterquantileWidth(cage_bam, clusters = "tagClusters", tpmThreshold = 3, qLow = 0.1, qUp = 0.9)

hela_promoter_widthBy eye, the distribution of promoter widths are quite similar.

aggregateTagClusters(cage_bam, tpmThreshold = 5, qLow = 0.1, qUp = 0.9, maxDist = 100)
cage_bam_consensus_cluster <- consensusClusters(cage_bam)
head(cage_bam_consensus_cluster)
  consensus.cluster  chr  start    end strand       tpm
1                 1 chr1 566751 566768      +  23.08532
2                 2 chr1 566934 566935      +  35.65526
3                 3 chr1 877489 877551      + 164.39766
4                 4 chr1 877694 877706      +  12.93554
5                 5 chr1 878800 878811      +  20.05782
6                 6 chr1 879544 879552      +  37.91798

getExpressionProfiles(cage_bam, what = "consensusClusters", tpmThreshold = 10, nrPassThreshold = 1, method = "som", xDim = 4, yDim = 2)
plotExpressionProfiles(cage_bam, what = "consensusClusters")

hela_consensusClusters_expression_profiles-01The cell and cytosol libraries have a much more similar expression pattern than the nucleus.

class3_1 <- extractExpressionClass(cage_bam, what = "consensusClusters", which = "3_1")
head(class3_1)
   consensus.cluster  chr    start      end strand      tpm   cell_1  cell_2
5                  5 chr1   878800   878811      + 20.05782 0.000000  0.0000
6                  6 chr1   879544   879552      + 37.91798 0.000000  0.0000
14                14 chr1  1407134  1407162      + 22.91589 9.870763  0.0000
20                20 chr1  6266141  6266194      + 28.16439 0.000000  0.0000
51                51 chr1 12290077 12290151      + 88.13640 7.029909 13.3953
64                64 chr1 16971245 16971284      + 10.94108 0.000000  0.0000
   cytosol_1 cytosol_2 nucleus_1 nucleus_2 expression_class
5    0.00000   0.00000  11.46126  8.596559              3_1
6    0.00000   7.85284  10.88820 19.176940              3_1
14   0.00000   0.00000   0.00000 13.045129              3_1
20   0.00000   0.00000  10.06954 18.094856              3_1
51   8.80288   0.00000  32.33713 26.571184              3_1
64   0.00000   0.00000   0.00000 10.941076              3_1

cumulativeCTSSdistribution(cage_bam, clusters = "consensusClusters", useMulticore = T, nrCores = 8)
scoreShift(cage_bam, groupX = c('cell_1','cell_2'), groupY = c('nucleus_1','nucleus_2'), useMulticore = T, nrCores = 8)

cage_bam_shifting_promoter <- getShiftingPromoters(cage_bam, tpmThreshold = 5, scoreThreshold = 0.6)
dim(cage_bam_shifting_promoter)
[1]  9 12
cage_bam_shifting_promoter
  consensus.cluster   chr     start       end strand shifting.score groupX.pos
1              1277  chr1 173834364 173834440      -      0.7354745  173834365
2              1741 chr10  32557674  32557679      -      0.6651646   32557675
3              1896 chr10 120810706 120810810      -      0.8260399  120810810
4              2086 chr11  62432782  62432789      +      0.6538574   62432784
5              2421 chr11  10828364  10828414      -      0.8521818   10828402
6              2575 chr11  62623334  62623360      -      0.7124639   62623360
7              3630 chr13  52034635  52034641      +      0.9126441   52034641
8              6197 chr18    657854    657865      -      0.9591837     657864
9             10713  chr6 154801343 154801345      +      0.6234377  154801344
  groupY.pos groupX.tpm groupY.tpm    pvalue.KS      fdr.KS
1  173834438   8.137361  11.133795 0.0054630258 0.068119111
2   32557679  16.245710  22.594163 0.0001548836 0.003333143
3  120810707  16.893608  10.806331 0.0001633640 0.003497887
4   62432789  14.644714  13.108258 0.0039198589 0.052506294
5   10828365   7.463260   6.221827 0.0117281810 0.125559655
6   62623351 182.683028 628.354358 0.0000000000 0.000000000
7   52034639  90.599848  32.429580 0.0000000000 0.000000000
8     657855   6.697651   5.891348 0.0062558015 0.076101781
9  154801345  10.308558   5.321299 0.0228524497 0.209703378

hela_promoter_shiftIGV snapshot of consensus cluster 1277, which had a shifting score of 0.7354745. The four tracks have the same scale (0-200) and shows a slight shift in the promoter usage in the nucleus tracks (bottom 2).

Conclusions

If you perform CAGE analysis, you will no doubt perform many of the analysis steps available through the CAGEr package, such as obtaining CTSSs, aggregating CTSSs into tag clusters and analysing promoter widths, i.e. sharp or board promoter analysis. By implementing these steps in Bioconductor, it allows CAGE analyses to be standardised. I've skipped over the track exporting features of the package, which allow users to create UCSC Genome Browser compliant tracks and is a very nice feature.

Lastly, to use CAGEr to prepare a data.frame for further downstream analyses, such as differential expression using edgeR, use normalizeTagCount() and specify method="none" to retain the raw counts.

#following on from above
for_edger <- cage_bam
#must normalise before CTSS clustering
normalizeTagCount(for_edger, method = "none")
clusterCTSS(object = for_edger, threshold = 1, thresholdIsTpm = FALSE, nrPassThreshold = 1, method = "distclu", maxDist = 20, removeSingletons = FALSE, useMulticore = T, nrCores = 8)
aggregateTagClusters(for_edger, tpmThreshold = 0, qLow = NULL , qUp = NULL, maxDist = 100)
head(for_edger@consensusClustersTpmMatrix)
                 sample
consensus.cluster cell_1 cell_2 cytosol_1 cytosol_2 nucleus_1 nucleus_2
                1      1      0         0         0         0         0
                2      0      0         0         0         0         1
                3      0      0         0         1         0         0
                4      0      0         0         0         2         0
                5      0      1         0         1         0         0
                6      0      0         1         0         0         0

for_edger_count <- for_edger@consensusClustersTpmMatrix
dim(for_edger_count)
[1] 1785762       6

for_edger_consensus_cluster <- consensusClusters(for_edger)
head(for_edger_consensus_cluster)
  consensus.cluster  chr  start    end strand tpm
1                 1 chr1  16446  16447      +   1
2                 2 chr1  54691  54692      +   1
3                 3 chr1  61839  61840      +   1
4                 4 chr1 234426 234428      +   2
5                 5 chr1 534298 534307      +   2
6                 6 chr1 540631 540632      +   1
dim(for_edger_consensus_cluster)
[1] 1785762       6

for_edger_consensus_cluster$id <- paste(for_edger_consensus_cluster$chr, for_edger_consensus_cluster$start, for_edger_consensus_cluster$end, for_edger_consensus_cluster$strand, sep="_")

d <- for_edger_count
rownames(d) <- for_edger_consensus_cluster$id
head(d)
                      sample
consensus.cluster      cell_1 cell_2 cytosol_1 cytosol_2 nucleus_1 nucleus_2
  chr1_16446_16447_+        1      0         0         0         0         0
  chr1_54691_54692_+        0      0         0         0         0         1
  chr1_61839_61840_+        0      0         0         1         0         0
  chr1_234426_234428_+      0      0         0         0         2         0
  chr1_534298_534307_+      0      1         0         1         0         0
  chr1_540631_540632_+      0      0         1         0         0         0

library(edgeR)
group <- c(rep("C",4),rep("N",2))
d <- DGEList(counts = d, group=group)
d <- calcNormFactors(d)
d <- estimateCommonDisp(d, verbose=T)
Disp = 0.24589 , BCV = 0.4959
d <- estimateTagwiseDisp(d)
de.tgw <- exactTest(d)
summary(decideTestsDGE(de.tgw, p.value=0.01))
   [,1]
-1    2589
0  1781401
1     1772
topTags(de.tgw)
Comparison of groups:  N-C
                               logFC    logCPM       PValue          FDR
chr1_94482781_94482785_+   11.810768  3.184550 7.874442e-37 1.406188e-30
chr9_19378703_19380410_-   -2.650257 11.242988 6.810442e-31 6.080914e-25
chr7_158387793_158387956_+  7.918394  5.187417 6.835810e-29 4.069043e-23
chr12_93477477_93477492_-   6.395872  3.584738 5.491582e-28 2.392241e-22
chr19_2269434_2271513_+    -3.167503  9.322378 6.698097e-28 2.392241e-22
chr11_57959329_57959330_+   8.835326  2.699936 2.607977e-27 7.762043e-22
chr5_140026819_140027418_- -3.048639  9.172754 2.127501e-24 5.427443e-19
chr5_172377711_172379664_+ -2.911335  4.187378 4.344533e-24 8.960908e-19
chr1_8937960_8939433_-     -2.161808 12.759042 4.516177e-24 8.960908e-19
chr3_53898686_53899598_+    3.717081  2.148959 6.049100e-24 1.080225e-18

d$counts['chr1_94482781_94482785_+',]
   cell_1    cell_2 cytosol_1 cytosol_2 nucleus_1 nucleus_2
        0         0         0         0       603       515

sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] BSgenome.Hsapiens.UCSC.hg19_1.3.19 CAGEr_1.2.6
[3] BSgenome.Mmusculus.UCSC.mm9_1.3.19 BSgenome_1.28.0
[5] Biostrings_2.28.0                  GenomicRanges_1.12.4
[7] IRanges_1.18.2                     BiocGenerics_0.6.0

loaded via a namespace (and not attached):
 [1] RCurl_1.95-4.1     Rsamtools_1.12.3   VGAM_0.9-1         XML_3.98-1.1
 [5] beanplot_1.1       bitops_1.0-5       data.table_1.8.8   rtracklayer_1.20.4
 [9] som_0.3-5          stats4_3.0.1       tools_3.0.1        zlibbioc_1.6.0

See also

My post on using the FANTOM3 dataset
CAGEr vignette
CAGEr manual
FANTOM3and4CAGE vignette

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
34 comments Add yours
  1. Hi there

    I am having a problem using "CAGEr". I followed your script above, but I got an error message saying "CAGEset is not a defined class".

    It would be appreciated if you could help me out. Details are below:

    > library(CAGEr)
    > library("BSgenome.Hsapiens.UCSC.hg19")
    > bam_file bam_file
    [1] "./frontal_CAGE.q10.sorted.bam" "./lcl_2339_CAGE.q10.sorted.bam"
    > cage_bamError in getClass(Class, where = topenv(parent.frame())) :
    “CAGEset” is not a defined class

      1. Hi Dave

        Thank you very much for your prompt reply. Yes, you are right. Some commands were missing in the previous message. My apology.

        We have bam files of CAGE data from RIKEN. So, I would like to use CAGEr for analyses. Any comments and tips would be greatly appreciated.

        Here are the completed commands and an error message below:
        ------------------------------------------------------------
        library("CAGEr")
        library("BSgenome.Hsapiens.UCSC.hg19")
        bam_file <- list.files(".", full.names=T, pattern="*.bam")
        bam_file
        [1] "./frontal_CAGE.q10.sorted.bam" "./lcl_2339_CAGE.q10.sorted.bam"
        cage_bam <- new("CAGEset", genomeName = "BSgenome.Hsapiens.UCSC.hg19", inputFiles = bam_file, inputFilesType = 'bam', sampleLabels = c('Frontal', 'LCL'))
        ------------------------------------------------------------
        Error in getClass(Class, where = topenv(parent.frame())) :
        “CAGEset” is not a defined class

        1. Hi Shingo,

          I can't replicate your error. Are you using the latest R and CAGEr versions? I'm using R version 3.0.1 (2013-05-16) and CAGEr_1.2.5. You can find this information by typing sessionInfo() in R.

          Cheers,

          Dave

  2. Hi Dave

    Yes, I am using exactly the same version of R and CAGEr as yours.
    Do you think there is something wrong with my BAM files?
    They have been sorted using samtools sort.

    Kind regards

    Shingo

    1. Hi Shingo,

      I don't think it's your BAM files. I recently had a problem with CAGEr (a different problem) and wrote to the author of the package, but she hasn't replied.

      Regarding your error, it seems to be Bioconductor related. Try emailing the mailing list (bioconductor (at) r-project.org) with your code snippet and the output of sessionInfo(). I'm sure they can assist you better than me.

      Good luck!

      Dave

    1. Hi Shingo,

      I just got a reply from the developer and she told me a new version of CAGEr is coming out in the next couple of days.

      In addition to this try updating all Bioconductor packages to see if it fixes your problem:


      source("http://bioconductor.org/biocLite.R")
      biocLite()

      Cheers,

      Dave

  3. Hi Dave

    Great news. Thanks for the advice. I will update Bioconductor package. Also, keep an eye on the new version of CAGEr.

    Have a nice weekend!

    Shingo

    1. Hi Shingo,

      I've just updated this post with the latest version of CAGEr and some publicly available CAGE data (from ENCODE).

      Try to follow the example with one or two BAM files (if you don't have a computer with lots of memory) and see if you get any errors.

      Good luck!

      Cheers,

      Dave

  4. Hi Dave,
    I am handling with CAGE data. And After mapping with STAR, I got some BAM files. All the bam files are >2G, Then I put into CAGEr, there always report errors.
    Error in .Call(.NAME, ..., PACKAGE = PACKAGE) :
    negative length vectors are not allowed
    I think maybe bamfiles are too large. So I extract the chr1 as a test file.
    samtools view -b **.bam chr1 > test_chr1.bam
    Error in setnames(CTSS, c("chr", "pos", "strand", sample.labels[i])) :
    Can't assign 4 names to a 1 column data.table

    1. Hi Jing,

      I guess the usual troubleshooting steps apply here: Are you using the latest version of R, Bioconductor, and the CAGEr package? Does the CAGEr package work with another BAM file? Try this one http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageHelas3CellPapAlnRep1.bam from ENCODE. If the ENCODE CAGE file works, try take a subsample of your original BAM file:

      samtools view -s 0.01 -b your_bam_file.bam > test.bam

      and see if it's something inherent with your BAM file.

      Hope that helps,

      Dave

      1. Hello,

        I was wondering if this error was ever solved? I get the same error and I am unfamiliar with bamfiles. I can not tell if my bamfiles are not suitable for this analysis, nor solve how to fix them.

        Thank you

        Poppy

            1. All mapped by HISAT with quality values. The problem occurs after I merge fastq or bam files from 2 lanes, so I think the size is the problem, especially since small merged bam files give no error. So right now I am downsampling and try to circumvent the error.

  5. Hi, Dave. Thanks for the great tutorial.
    In the EdgeR analysis, you mentioned to "use normalizeTagCount() and specify method="none" to retain the raw counts."

    I just wondering why you not use normalized data?
    I have no experience in Cage data, but for microarray, normalized data is used for T-test/linear modeling.

    Thanks!

    1. Hi Xiaoxi,

      thanks for the comment. I wrote:

      to use CAGEr to prepare a data.frame for further downstream analyses, such as differential expression using edgeR, use normalizeTagCount() and specify method="none" to retain the raw counts.

      So the normalizeTagCount(for_edger, method = "none") is for the CAGEr package not edgeR. The calcNormFactors() function in edgeR performs the normalisation. The package edgeR requires raw counts.

      Cheers,

      Dave

  6. Hello,
    Thank you for this tutorial. I have only one question, is it possible to scale all the Y axis when using "plotInterquantileWidth" function? For an easier comparison between the different conditions, it would be great if you could for example set these Y axis to .20 for all plots. Is this possible?

    1. I'm not sure whether this can be done via the plotInterquantileWidth() function. My approach would be to obtain the widths from the CAGEset object and create histograms using hist() function with the ylim=c(0,0.2) parameter.

  7. Hi Dave,

    I ran into a similar issue as Jing Li.
    I got up to the getCTSS(myCAGEset) step alright.

    But when I tried clusterCTSS, this is what I got:
    > clusterCTSS(object = myCAGEset, threshold = 1, thresholdIsTpm = TRUE, nrPassThreshold = 1, method = "distclu", maxDist = 20, removeSingletons = TRUE, keepSingletonsAbove = 5)

    Filtering CTSSs below threshold...
    Clustering...
    -> cage1
    Error in `[.data.frame`(data, , c("chr", "pos", "strand", s)) :
    undefined columns selected

    I tried using the demo bam you linked to but I got the same result. I am using a recent version of R (3.2.0) and CageR. Do you know what might be the issue?

  8. Hi Dave,
    I am using CAGEr package, taking input of bam files.
    upto normalization step it is running fine. I used the following commend for normalization:
    > normalizeTagCount(cage_bam, method = "simpleTpm")
    > cage_ctss_normalised head(cage_ctss_normalised)

    And following output also getting.
    chr pos strand Cell_Gm12878_1 Cell_Gm12878_2 Cell_K562_1 Cell_K562_2
    1 chr1 15929 - 0.0000000 0.00000000 0.00000000 0.05430723
    2 chr1 17496 - 0.0000000 0.05120271 0.19671273 0.00000000
    3 chr1 17497 - 0.0000000 0.00000000 0.06557091 0.05430723
    4 chr1 17500 - 0.0000000 0.00000000 0.06557091 0.00000000
    5 chr1 17504 - 0.1179744 0.00000000 0.00000000 0.00000000
    6 chr1 55983 - 0.0000000 0.00000000 0.00000000 0.05430723

    Then I am doing clustering using the following command :

    > clusterCTSS(object = cage_bam, threshold = 1, thresholdIsTpm = TRUE, nrPassThreshold = 1, method = "distclu", maxDist = 20, removeSingletons = TRUE, keepSingletonsAbove = 2, useMulticore = T, nrCores = 6)

    To check the clustering step run successfully or not I am using the following command:

    tc head(tc)
    NULL
    In return I am getting NULL output.
    And in the following step i am getting problem.
    Please Help.

    1. Did you save your tag clusters into another object like my example?

      clusterCTSS(object = cage_data, threshold = 1, thresholdIsTpm = TRUE, nrPassThreshold = 1, method = "distclu", maxDist = 20, removeSingletons = TRUE, keepSingletonsAbove = 2)
       
      blood_tag_clusters <- tagClusters(cage_data, sample="blood__control")
      
  9. Hi Dave,

    I followed your script above, but I got an error message saying :-

    > cage_data <- importPublicData(dataset = FANTOMtissueCAGEhuman, group = c("blood", "cerebellum"), sample = c("control", "cerebellum"))
    Error in (function (classes, fdef, mtable) :
    unable to find an inherited method for function 'importPublicData' for signature '"missing", "list", "character", "character"'

    It would be appreciated if you could help me out

    1. Hi Ajay,

      the author of the package has made changes to CAGEr and to the importPublicData() function; my post is quite old now. Here's how to import datasets using the new syntax:

      cage_data <- importPublicData(source="FANTOM3and4", dataset = "FANTOMtissueCAGEhuman", group = c("blood", "cerebellum"), sample = c("control", "cerebellum"))
      
  10. Hi Dave,

    I was running CAGEr using data in .bam format , download from ENCODE code given above ran successfully for Gm12878 Cell line , but it is showing following error while working with K562 Cell line :

    >getCTSS(cage_bam_K562, mappingQualityThreshold=10)

    Reading in file: ./wgEncodeRikenCageK562CytosolPapAlnRep2.bam...
    -> Filtering out low quality reads...
    -> Removing the first base of the reads if 'G' and not aligned to the genome...
    -> Estimating the frequency of adding a 'G' nucleotide and correcting the systematic bias...
    -> Making CTSSs and counting number of tags...

    Reading in file: ./wgEncodeRikenCageK562NucleolusTotalAln.bam...
    Error in value[[3L]](cond) :
    failed to open BamFile: SAM/BAM header missing or empty
    file: './wgEncodeRikenCageK562NucleolusTotalAln.bam'
    In addition: Warning messages:
    1: In doTryCatch(return(expr), name, parentenv, handler) :
    [bam_header_read] EOF marker is absent. The input is probably truncated.
    2: In doTryCatch(return(expr), name, parentenv, handler) :
    [bam_header_read] EOF marker is absent. The input is probably truncated.
    3: In doTryCatch(return(expr), name, parentenv, handler) :
    [bam_header_read] EOF marker is absent. The input is probably truncated.
    4: In doTryCatch(return(expr), name, parentenv, handler) :
    [bam_header_read] EOF marker is absent. The input is probably truncated.
    5: In doTryCatch(return(expr), name, parentenv, handler) :
    [bam_header_read] invalid BAM binary header (this is not a BAM file).

      1. Thanks Dave,

        I found there are lot of files missing in your aforementioned link . Is it possible to use this script for all those cage file for example like IMR90 cell line or other which are not present in aforementioned list ?

        1. Hi Ajay,

          it seems the author of the package has made a lot of changes to CAGEr since this post was published. I would recommend going through the vignette instead of using my code.

          Cheers,

          Dave

  11. Hello Dave,

    There is a function clusterCTSS() which uses an argument "threshold". In the documentation the description of this argument is written as:

    "Only CTSSs with signal >= threshold in >= nrPassThreshold experiments will be used for clustering and will contribute towards total signal of the cluster".

    I opened the corresponding sam file and here is an entry on that file

    "HWI-EAS420_0007:1:76:17187:1705#0|GAT 16 chr16 27899107 3 7M7D20M * 0 0 TTTTTTTTTTCTTTTTTTTTCTCCTCC ]dd]_bbebWfccfefe\faf`f\ffe NM:i:7 MD:Z:7^CTTTTCT20 XP:Z:L888~~~~~~~~~~~~~~~~~~~~~~~"

    I opened this cage tag in UCSC genome browser and here is the details:

    Read name: HWI-EAS420_0007:1:76:17187:1705#0|GAT
    Position: chr16:27899107-27899140
    Band: 16p12.1
    Genomic Size: 34
    Alignment Quality: 3
    CIGAR string: 7M7D20M (7 (mis)Match, 7 Deletion, 20 (mis)Match)
    Tags: NM:7 MD:7^CTTTTCT20 XP:L888~~~~~~~~~~~~~~~~~~~~~~~
    Flags: 0x10:

    Can you please tell me which is the "signal" value in the sam file that the "threshhold" argument is using?

    I think it is the 5th coloumn of the sam file (in this example the value is 3), i.e. the Alignment Quality (3 in this example) but I m not sure.

    Thank you.

Leave a Reply

Your email address will not be published. Required fields are marked *