Getting started with HISAT, StringTie, and Ballgown

A popular toolset used for analysing RNA-seq data is the tuxedo suite, which consists of TopHat and Cufflinks. The suite provided a start to finish pipeline that allowed users to map reads, assemble transcripts, and perform differential expression analyses. A newer "tuxedo suite" has been developed and is made up of three tools: HISAT, StringTie, and Ballgown. A Nature Protocols article provides a summary of the new suite as well as a tutorial; this post was written while I was going through the tutorial.

I worked through the tutorial on a MacBook Pro, which means that I downloaded binaries for OS X. If you're using some flavour of Linux, download the Linux binaries instead. The data for the tutorial is available at ftp://ftp.ccb.jhu.edu/pub/RNAseq_protocol; you can perform a recursive download using wget to download all the files on the FTP server. You can use your data but you'll have to index the relevant reference file and prepare your own sample text file. For this post, I used the same data as the tutorial.

# recursive download
wget -c -r ftp://ftp.ccb.jhu.edu/pub/RNAseq_protocol

# move the data tarball to directory root
mv ftp.ccb.jhu.edu/pub/RNAseq_protocol/chrX_data.tar.gz .

# extract
tar xzf chrX_data.tar.gz

# check out the directory structure
tree --charset=ascii chrX_data
chrX_data
|-- genes
|   `-- chrX.gtf
|-- genome
|   `-- chrX.fa
|-- geuvadis_phenodata.csv
|-- indexes
|   |-- chrX_tran.1.ht2
|   |-- chrX_tran.2.ht2
|   |-- chrX_tran.3.ht2
|   |-- chrX_tran.4.ht2
|   |-- chrX_tran.5.ht2
|   |-- chrX_tran.6.ht2
|   |-- chrX_tran.7.ht2
|   `-- chrX_tran.8.ht2
|-- mergelist.txt
`-- samples
    |-- ERR188044_chrX_1.fastq.gz
    |-- ERR188044_chrX_2.fastq.gz
    |-- ERR188104_chrX_1.fastq.gz
    |-- ERR188104_chrX_2.fastq.gz
    |-- ERR188234_chrX_1.fastq.gz
    |-- ERR188234_chrX_2.fastq.gz
    |-- ERR188245_chrX_1.fastq.gz
    |-- ERR188245_chrX_2.fastq.gz
    |-- ERR188257_chrX_1.fastq.gz
    |-- ERR188257_chrX_2.fastq.gz
    |-- ERR188273_chrX_1.fastq.gz
    |-- ERR188273_chrX_2.fastq.gz
    |-- ERR188337_chrX_1.fastq.gz
    |-- ERR188337_chrX_2.fastq.gz
    |-- ERR188383_chrX_1.fastq.gz
    |-- ERR188383_chrX_2.fastq.gz
    |-- ERR188401_chrX_1.fastq.gz
    |-- ERR188401_chrX_2.fastq.gz
    |-- ERR188428_chrX_1.fastq.gz
    |-- ERR188428_chrX_2.fastq.gz
    |-- ERR188454_chrX_1.fastq.gz
    |-- ERR188454_chrX_2.fastq.gz
    |-- ERR204916_chrX_1.fastq.gz
    `-- ERR204916_chrX_2.fastq.gz

4 directories, 36 files

A description of the data set is provided by geuvadis_phenodata.csv. Normally, you will have to prepare this file yourself; it will be used later in the Ballgown step.

cat chrX_data/geuvadis_phenodata.csv 
"ids","sex","population"
"ERR188044","male","YRI"
"ERR188104","male","YRI"
"ERR188234","female","YRI"
"ERR188245","female","GBR"
"ERR188257","male","GBR"
"ERR188273","female","YRI"
"ERR188337","female","GBR"
"ERR188383","male","GBR"
"ERR188401","male","GBR"
"ERR188428","female","GBR"
"ERR188454","male","YRI"
"ERR204916","female","YRI"

Now let's download the programs; have a look at the HISAT2 page to find the appropriate binary to download. I like to download programs in a src directory and link them to a bin directory, which is in my PATH.

# for OS X
cd ~/src
wget -c ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/downloads/hisat2-2.1.0-OSX_x86_64.zip
unzip hisat2-2.1.0-OSX_x86_64.zip

# provide link to binaries in my bin directory
cd ~/bin/
ln -s ~/src/hisat2-2.1.0/hisat2* .
# some files were already linked
ln -s ~/src/hisat2-2.1.0/*.py .
ln: ./hisat2_extract_exons.py: File exists
ln: ./hisat2_extract_snps_haplotypes_UCSC.py: File exists
ln: ./hisat2_extract_snps_haplotypes_VCF.py: File exists
ln: ./hisat2_extract_splice_sites.py: File exists
ln: ./hisat2_simulate_reads.py: File exists

Again, take a look at the StringTie page to find the appropriate binary to download.

# for OS X
cd ~/src
wget -c http://ccb.jhu.edu/software/stringtie/dl/stringtie-1.3.3b.OSX_x86_64.tar.gz
tar xzf stringtie-1.3.3b.OSX_x86_64.tar.gz

# provide link to binary in my bin directory
cd ~/bin/
ln -s ~/src/stringtie-1.3.3b.OSX_x86_64/stringtie

The gffcompare tool needs to be compiled.

cd ~/src/
git clone https://github.com/gpertea/gclib
git clone https://github.com/gpertea/gffcompare
cd gffcompare
make release

# link again
cd ~/bin/
ln -s ~/src/gffcompare/gffcompare

Download SAMtools from http://www.htslib.org/download/ and compile.

# unzip and compile
tar xjf samtools-1.6.tar.bz2 
cd samtools-1.6
./configure
make

# link samtools
cd ~/bin
ln -s ~/src/samtools-1.6/samtools

Ballgown is a Bioconductor package, so we need to install that using R. While we are at it, we will install various dependencies too.

install.packages("devtools")
install.packages("dplyr")

source("https://www.bioconductor.org/biocLite.R")
biocLite(c("alyssafrazee/RSkittleBrewer", "ballgown", "genefilter"))

Now that we have downloaded and prepared all the required programs, we can start the analysis!

Mapping

Mapping is performed using HISAT2 and usually the first step, prior to mapping, is to create an index of the reference genome. The indices are provided in the data folder but let's create them again.

mkdir my_index
cd my_index

# use the Python scripts to extract splice-site and exon information from a gene annotation file
extract_splice_sites.py ../chrX_data/genes/chrX.gtf > chrX.ss
extract_exons.py ../chrX_data/genes/chrX.gtf > chrX.exon

head -3 chrX.ss
chrX    276393  281481  +
chrX    281683  284166  +
chrX    284313  288732  +

head -3 chrX.exon 
chrX    276323  276393  +
chrX    281393  281683  +
chrX    284166  284313  +

# now to build the index
# the --ss and --exon options can be omitted if annotation data is not available
time hisat2-build -p 8 --ss chrX.ss --exon chrX.exon ../chrX_data/genome/chrX.fa chrX_tran
# screen output not shown to save space
Total time for call to driver() for forward index: 00:03:34

real    3m33.870s
user    10m10.778s
sys     1m9.074s

ls -1
chrX.exon
chrX.fa
chrX.ss
chrX_tran.1.ht2
chrX_tran.2.ht2
chrX_tran.3.ht2
chrX_tran.4.ht2
chrX_tran.5.ht2
chrX_tran.6.ht2
chrX_tran.7.ht2
chrX_tran.8.ht2

Despite creating our own indices, we'll use the ones provided by the tutorial for reproducibility's sake. From geuvadis_phenodata.csv we saw that there are 12 samples; each sample has two FASTQ files since this is paired-end data. Let's start the mapping.

# create directory to store mapping results
mkdir map

# map each sample using 8 threads
hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188044_chrX_1.fastq.gz -2 chrX_data/samples/ERR188044_chrX_2.fastq.gz -S map/ERR188044_chrX.sam
hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188104_chrX_1.fastq.gz -2 chrX_data/samples/ERR188104_chrX_2.fastq.gz -S map/ERR188104_chrX.sam
hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188234_chrX_1.fastq.gz -2 chrX_data/samples/ERR188234_chrX_2.fastq.gz -S map/ERR188234_chrX.sam
hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188245_chrX_1.fastq.gz -2 chrX_data/samples/ERR188245_chrX_2.fastq.gz -S map/ERR188245_chrX.sam
hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188257_chrX_1.fastq.gz -2 chrX_data/samples/ERR188257_chrX_2.fastq.gz -S map/ERR188257_chrX.sam
hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188273_chrX_1.fastq.gz -2 chrX_data/samples/ERR188273_chrX_2.fastq.gz -S map/ERR188273_chrX.sam
hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188337_chrX_1.fastq.gz -2 chrX_data/samples/ERR188337_chrX_2.fastq.gz -S map/ERR188337_chrX.sam
hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188383_chrX_1.fastq.gz -2 chrX_data/samples/ERR188383_chrX_2.fastq.gz -S map/ERR188383_chrX.sam
hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188401_chrX_1.fastq.gz -2 chrX_data/samples/ERR188401_chrX_2.fastq.gz -S map/ERR188401_chrX.sam
hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188428_chrX_1.fastq.gz -2 chrX_data/samples/ERR188428_chrX_2.fastq.gz -S map/ERR188428_chrX.sam
hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188454_chrX_1.fastq.gz -2 chrX_data/samples/ERR188454_chrX_2.fastq.gz -S map/ERR188454_chrX.sam
hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR204916_chrX_1.fastq.gz -2 chrX_data/samples/ERR204916_chrX_2.fastq.gz -S map/ERR204916_chrX.sam

# mapping took around two and a half minutes
# real    2m36.509s
# user    15m17.815s
# sys     3m29.939s

You should always only store sorted BAM (or CRAM) files and delete the SAM files after conversion.

# sort mapping results using SAMtools on 8 threads
samtools sort -@ 8 -o map/ERR188044_chrX.bam map/ERR188044_chrX.sam
samtools sort -@ 8 -o map/ERR188104_chrX.bam map/ERR188104_chrX.sam
samtools sort -@ 8 -o map/ERR188234_chrX.bam map/ERR188234_chrX.sam
samtools sort -@ 8 -o map/ERR188245_chrX.bam map/ERR188245_chrX.sam
samtools sort -@ 8 -o map/ERR188257_chrX.bam map/ERR188257_chrX.sam
samtools sort -@ 8 -o map/ERR188273_chrX.bam map/ERR188273_chrX.sam
samtools sort -@ 8 -o map/ERR188337_chrX.bam map/ERR188337_chrX.sam
samtools sort -@ 8 -o map/ERR188383_chrX.bam map/ERR188383_chrX.sam
samtools sort -@ 8 -o map/ERR188401_chrX.bam map/ERR188401_chrX.sam
samtools sort -@ 8 -o map/ERR188428_chrX.bam map/ERR188428_chrX.sam
samtools sort -@ 8 -o map/ERR188454_chrX.bam map/ERR188454_chrX.sam
samtools sort -@ 8 -o map/ERR204916_chrX.bam map/ERR204916_chrX.sam

# remove SAM files
rm map/*.sam

# sorting and converting took just over a minute
real    1m14.533s
user    5m44.637s
sys     0m9.590s

Assembly

Now we need to assemble the mapped reads into transcripts. StringTie can assemble transcripts with or without annotation; as noted in the protocol, annotation can be helpful when the number of reads for a transcript is too low for an accurate assembly.

# store assembly results in a new directory
mkdir assembly

# create assembly per sample using 8 threads
stringtie map/ERR188044_chrX.bam -l ERR188044 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188044_chrX.gtf
stringtie map/ERR188104_chrX.bam -l ERR188104 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188104_chrX.gtf
stringtie map/ERR188234_chrX.bam -l ERR188234 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188234_chrX.gtf
stringtie map/ERR188245_chrX.bam -l ERR188245 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188245_chrX.gtf
stringtie map/ERR188257_chrX.bam -l ERR188257 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188257_chrX.gtf
stringtie map/ERR188273_chrX.bam -l ERR188273 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188273_chrX.gtf
stringtie map/ERR188337_chrX.bam -l ERR188337 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188337_chrX.gtf
stringtie map/ERR188383_chrX.bam -l ERR188383 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188383_chrX.gtf
stringtie map/ERR188401_chrX.bam -l ERR188401 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188401_chrX.gtf
stringtie map/ERR188428_chrX.bam -l ERR188428 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188428_chrX.gtf
stringtie map/ERR188454_chrX.bam -l ERR188454 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188454_chrX.gtf
stringtie map/ERR204916_chrX.bam -l ERR204916 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR204916_chrX.gtf

# assembly and quantification took a minute and a half
# real    1m30.893s
# user    1m58.455s
# sys     0m9.860s

# before merging we need to modify mergelist.txt
# this is because I created a new directory to store the results
# the modified mergelist.txt should look like this
cat chrX_data/mergelist.txt
assembly/ERR188044_chrX.gtf
assembly/ERR188104_chrX.gtf
assembly/ERR188234_chrX.gtf
assembly/ERR188245_chrX.gtf
assembly/ERR188257_chrX.gtf
assembly/ERR188273_chrX.gtf
assembly/ERR188337_chrX.gtf
assembly/ERR188383_chrX.gtf
assembly/ERR188401_chrX.gtf
assembly/ERR188428_chrX.gtf
assembly/ERR188454_chrX.gtf
assembly/ERR204916_chrX.gtf

# merge all transcripts from the different samples
stringtie --merge -p 8 -G chrX_data/genes/chrX.gtf -o stringtie_merged.gtf chrX_data/mergelist.txt

# check out the transcripts
cat stringtie_merged.gtf | head
# stringtie --merge -p 8 -G chrX_data/genes/chrX.gtf -o stringtie_merged.gtf chrX_data/mergelist.txt
# StringTie version 1.3.3b
chrX    StringTie       transcript      322514  323718  1000    .       .       gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; 
chrX    StringTie       exon    322514  323718  1000    .       .       gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; exon_number "1"; 
chrX    StringTie       transcript      319145  321319  1000    +       .       gene_id "MSTRG.2"; transcript_id "NR_027232"; gene_name "LINC00685"; ref_gene_id "NR_027232"; 
chrX    StringTie       exon    319145  321319  1000    +       .       gene_id "MSTRG.2"; transcript_id "NR_027232"; exon_number "1"; gene_name "LINC00685"; ref_gene_id "NR_027232"; 
chrX    StringTie       transcript      319145  321319  1000    +       .       gene_id "MSTRG.2"; transcript_id "NR_027231"; gene_name "LINC00685"; ref_gene_id "NR_027231"; 
chrX    StringTie       exon    319145  319551  1000    +       .       gene_id "MSTRG.2"; transcript_id "NR_027231"; exon_number "1"; gene_name "LINC00685"; ref_gene_id "NR_027231"; 
chrX    StringTie       exon    320208  321319  1000    +       .       gene_id "MSTRG.2"; transcript_id "NR_027231"; exon_number "2"; gene_name "LINC00685"; ref_gene_id "NR_027231"; 
chrX    StringTie       transcript      304750  318701  1000    -       .       gene_id "MSTRG.3"; transcript_id "MSTRG.3.1";

# how many transcripts?
cat stringtie_merged.gtf | grep -v "^#" | awk '$3=="transcript" {print}' | wc -l
    3491

Let's compare the StringTie transcripts to known transcripts using gffcompare.

# compare the assembled transcripts to known transcripts
gffcompare -r chrX_data/genes/chrX.gtf -G -o merged stringtie_merged.gtf

cat merged.stats
# gffcompare v0.10.1 | Command line was:
#gffcompare -r chrX_data/genes/chrX.gtf -G -o merged stringtie_merged.gtf
#

#= Summary for dataset: stringtie_merged.gtf 
#     Query mRNAs :    3281 in    1521 loci  (2651 multi-exon transcripts)
#            (535 multi-transcript loci, ~2.2 transcripts per locus)
# Reference mRNAs :    2102 in    1086 loci  (1856 multi-exon)
# Super-loci w/ reference transcripts:      998
#-----------------| Sensitivity | Precision  |
        Base level:   100.0     |    77.6    |
        Exon level:   100.0     |    85.4    |
      Intron level:    99.8     |    91.0    |
Intron chain level:    99.6     |    69.7    |
  Transcript level:    99.6     |    63.8    |
       Locus level:   100.0     |    70.9    |

     Matching intron chains:    1848
       Matching transcripts:    2094
              Matching loci:    1086

          Missed exons:       0/8804    (  0.0%)
           Novel exons:     971/10608   (  9.2%)
        Missed introns:      14/7946    (  0.2%)
         Novel introns:     219/8714    (  2.5%)
           Missed loci:       0/1086    (  0.0%)
            Novel loci:     421/1521    ( 27.7%)

 Total union super-loci across all input datasets: 1521 
3281 out of 3281 consensus transcripts written in merged.annotated.gtf (0 discarded as redundant)

The high sensitivity means that almost all of the StringTie transcripts match the known transcripts, i.e. low false negative. The precision is much lower indicating that many of the StringTie transcripts are not in the list of known transcripts, which are either false positives or truly de novo transcripts. The novel exons, introns, and loci indicate how many of the sites were not found in the list of known transcripts.

All known transcripts were assembled by StringTie, including a few novel ones.

Now that we have our assembled transcripts, we can estimate their abundances.

stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188044/ERR188044_chrX.gtf map/ERR188044_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188104/ERR188104_chrX.gtf map/ERR188104_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188234/ERR188234_chrX.gtf map/ERR188234_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188245/ERR188245_chrX.gtf map/ERR188245_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188257/ERR188257_chrX.gtf map/ERR188257_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188273/ERR188273_chrX.gtf map/ERR188273_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188337/ERR188337_chrX.gtf map/ERR188337_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188383/ERR188383_chrX.gtf map/ERR188383_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188401/ERR188401_chrX.gtf map/ERR188401_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188428/ERR188428_chrX.gtf map/ERR188428_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188454/ERR188454_chrX.gtf map/ERR188454_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR204916/ERR204916_chrX.gtf map/ERR204916_chrX.bam

# estimation took just over a minute and a half
# real    1m39.661s
# user    2m0.179s
# sys     0m9.223s

# check out the files
ls -1 ballgown/ERR188044
ERR188044_chrX.gtf
e2t.ctab
e_data.ctab
i2t.ctab
i_data.ctab
t_data.ctab

Differential expression

To perform the expression analyses, we need to use R and Ballgown; I recommend using RStudio. To get started load the required libraries and the data.

library(ballgown)
library(RSkittleBrewer)
library(genefilter)
library(dplyr)
library(devtools)

# change this to the directory that contains all the StringTie results
setwd("~/muse/tuxedo")

# load the sample information
pheno_data <- read.csv("chrX_data/geuvadis_phenodata.csv")

# create a ballgown object
bg_chrX <- ballgown(dataDir = "ballgown",
                    samplePattern = "ERR",
                    pData = pheno_data)

class(bg_chrX)
[1] "ballgown"
attr(,"package")
[1] "ballgown"

bg_chrX
ballgown instance with 3491 transcripts and 12 samples

What methods are available for ballgown objects?

methods(class="ballgown")
 [1] dirs            eexpr           expr            expr<-          geneIDs         geneNames       gexpr          
 [8] iexpr           indexes         indexes<-       mergedDate      pData           pData<-         sampleNames    
[15] seqnames        show            structure       subset          texpr           transcriptIDs   transcriptNames
see '?methods' for accessing help and source code

# we can get the gene, transcript, exon, and intron expression levels using
# gexpr(), texpr(), eexpr(), and iexpr()
head(gexpr(bg_chrX), 2)
         FPKM.ERR188044 FPKM.ERR188104 FPKM.ERR188234 FPKM.ERR188245 FPKM.ERR188257 FPKM.ERR188273 FPKM.ERR188337
MSTRG.1        7.169349       10.42652       13.83639       1.050201       5.677819      10.756237       7.054854
MSTRG.10      21.428192       13.13144       14.11443      18.454338      10.182308       3.887958      23.385319
         FPKM.ERR188383 FPKM.ERR188401 FPKM.ERR188428 FPKM.ERR188454 FPKM.ERR204916
MSTRG.1        4.732841      11.424809       5.733899       6.688090       5.061143
MSTRG.10      11.815677       8.196958       9.578302       9.961549      10.997639

head(texpr(bg_chrX), 2)
  FPKM.ERR188044 FPKM.ERR188104 FPKM.ERR188234 FPKM.ERR188245 FPKM.ERR188257 FPKM.ERR188273 FPKM.ERR188337 FPKM.ERR188383
1        23.9694       18.49576       39.70492       14.06822       25.51846       23.84778       30.89737       20.71429
2         0.0000        0.00000       27.79636       13.96464       44.97094        0.00000        0.00000        0.00000
  FPKM.ERR188401 FPKM.ERR188428 FPKM.ERR188454 FPKM.ERR204916
1       28.03131       24.97612        28.2617       20.24706
2       25.81932        0.00000         0.0000        0.00000

Next we filter out transcripts with low variance.

# note that this subset function is not the base R function but a ballgown one
# to see the order in which R looks for functions in packages use search()
# search()
#  [1] ".GlobalEnv"             "package:bindrcpp"       "package:devtools"       "package:dplyr"         
#  [5] "package:genefilter"     "package:RSkittleBrewer" "package:ballgown"       "tools:rstudio"         
#  [9] "package:stats"          "package:graphics"       "package:grDevices"      "package:utils"         
# [13] "package:datasets"       "package:methods"        "Autoloads"              "package:base"
#
# the rowVars is from the genefilter package and calculates the row variance
bg_chrX_filt <- subset(bg_chrX, "rowVars(texpr(bg_chrX)) >1", genomesubset=TRUE)

# 1,264 transcripts were filtered out
bg_chrX_filt
ballgown instance with 2227 transcripts and 12 samples

Perform the differential expression analysis stattest() function; confounders are specified using the adjustvars parameter, which has to match the column name in pheno_data. We are testing for transcripts and genes that are differentially expressed between male and females, hence sex is our covariate of interest. In addition to testing transcripts and genes, we can also test differential expression at exons and introns; just change the feature parameter accordingly.

head(pData(bg_chrX_filt), 3)
        ids    sex population
1 ERR188044   male        YRI
2 ERR188104   male        YRI
3 ERR188234 female        YRI

# test on transcripts
results_transcripts <- stattest(bg_chrX_filt,
                                feature="transcript",
                                covariate="sex",
                                adjustvars = c("population"),
                                getFC=TRUE, meas="FPKM")

# results are in a data frame
class(results_transcripts)
[1] "data.frame"

dim(results_transcripts)
[1] 2227    5

head(results_transcripts)
     feature id        fc      pval      qval
1 transcript  1 0.9386481 0.7208669 0.9454480
2 transcript  2 1.2073309 0.8670656 0.9756579
3 transcript  3 1.0058534 0.9964598 0.9997816
4 transcript  4 0.3847566 0.5214029 0.9290666
5 transcript  5 0.6089373 0.3247825 0.9278154
6 transcript  6 0.6449469 0.3062408 0.9253708

table(results_transcripts$qval < 0.05)

FALSE  TRUE 
 2215    12

# test on genes
results_genes <- stattest(bg_chrX_filt,
                          feature="gene",
                          covariate="sex",
                          adjustvars = c("population"),
                          getFC=TRUE, meas="FPKM")

class(results_genes)
[1] "data.frame"

dim(results_genes)
[1] 1013    5

table(results_genes$qval<0.05)

FALSE  TRUE 
 1002    11

The results_transcripts data frame doesn't contain any identifiers; we will create a new data frame with this information.

# the order is the same so we can simply combine the information
results_transcripts <- data.frame(geneNames = geneNames(bg_chrX_filt),
                                  geneIDs = geneIDs(bg_chrX_filt),
                                  results_transcripts)

# now we have the identifiers
head(results_transcripts)
  geneNames geneIDs    feature id        fc      pval      qval
1         . MSTRG.4 transcript  1 0.9386481 0.7208669 0.9454480
2    PLCXD1 MSTRG.4 transcript  2 1.2073309 0.8670656 0.9756579
3         . MSTRG.4 transcript  3 1.0058534 0.9964598 0.9997816
4         . MSTRG.4 transcript  4 0.3847566 0.5214029 0.9290666
5         . MSTRG.5 transcript  5 0.6089373 0.3247825 0.9278154
6    PLCXD1 MSTRG.4 transcript  6 0.6449469 0.3062408 0.9253708

# which transcripts are detected as differentially expressed at qval < 0.05?
results_transcripts %>% filter(qval < 0.05)
   geneNames   geneIDs    feature   id          fc         pval         qval
1     PNPLA4  MSTRG.64 transcript  186 0.592477057 2.119474e-04 4.290970e-02
2          . MSTRG.140 transcript  421 3.141219608 6.096529e-05 1.508552e-02
3      KDM6A MSTRG.255 transcript  734 0.054166544 1.208983e-04 2.692404e-02
4      RPS4X MSTRG.511 transcript 1605 0.598737678 2.560509e-04 4.751878e-02
5       TSIX MSTRG.522 transcript 1648 0.078029979 1.743580e-06 7.765906e-04
6          . MSTRG.523 transcript 1649 0.016057740 3.872369e-10 2.874589e-07
7       XIST MSTRG.523 transcript 1650 0.002997908 1.849406e-10 2.059314e-07
8          . MSTRG.523 transcript 1651 0.030714646 1.360867e-10 2.059314e-07
9          . MSTRG.523 transcript 1652 0.028289665 6.782559e-08 3.776190e-05
10         . MSTRG.605 transcript 1843 7.378759461 1.285917e-05 4.772897e-03
11         . MSTRG.612 transcript 1847 9.154881892 4.889775e-05 1.361191e-02
12         . MSTRG.766 transcript 2333 0.272425415 1.909634e-05 6.075365e-03

Let's create a MA plot.

library(ggplot2)
library(cowplot)

results_transcripts$mean <- rowMeans(texpr(bg_chrX_filt))

ggplot(results_transcripts, aes(log2(mean), log2(fc), colour = qval<0.05)) +
  scale_color_manual(values=c("#999999", "#FF0000")) +
  geom_point() +
  geom_hline(yintercept=0)

Summary

The new tuxedo package is very fast; I realise that the tutorial only used a small subset of reads that were already determined to map to chromosome X. Despite this, the mapping and assembly took mere minutes. A recent benchmark of RNA-seq aligners did demonstrate that HISAT or HISAT2 was the fastest splice-aware mapper out of 14 algorithms. However, HISAT or HISAT2 had a low recall percentage when mapping reads with high complexity, i.e. more polymorphic sites and higher error rates, on the default settings; mapping accuracy was vastly improved after tuning the parameters.

I plan to set up a Snakemake pipeline for running the new tuxedo suite and will compare it with other pipelines, such as this STAR and Cufflinks/RSEM pipeline.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
21 comments Add yours
  1. Hi Dave- I was wondering if you could comment on an observation we made when we ran this pipeline as described here.

    We did an experiment in mouse, knockout vs WT. For alignment we used hisat2, default parameters. Followed by stringtie, and ballgown. We got a large number of significantly D.E. “transcripts”, but, when we conducted a gene level analysis, we got barely any D.E. genes. The D.E. transcripts list mostly has the same gene showing D.E. of different splice forms in each condition. Since we are dealing with the same tissue, we really don’t expect such a huge splicing effect. I wonder if many of the splice variants could be mapping artifacts, because, in some cases, I look at the aligned reads in a browser and it shows no difference between the two samples in terms of # of reads mapped.

    1. Hi Nandita,

      I recall that a former colleague had a similar problem to what you are describing, which is the discrepancy in DE between genes and transcripts. Regarding your example, I guess the obvious thing to do (which you may have already done) is to create an expression table of the gene and another of the transcripts belonging to the same gene. Perhaps in the knockout, it has switched to another splice variant, therefore there is DE on the transcript level. However, when you collapse expression onto a gene level they are expressed similarly. I’m not so sure about what you meant about mapping artifacts though. If there was a systematic artifact, it should affect both samples equally and you shouldn’t have a discrepancy only in one sample.

      Cheers,

      Dave

  2. Thanks for responding, Dave.
    “Perhaps in the knockout, it has switched to another splice variant, therefore there is DE on the transcript level.” This would be really cool if true, and the when we use the “plot transcripts” function in ballgown (Fig 5. page 1664 in the Nature Protocols 2016 paper) to look at one of these cases, it indeed implies different transcripts are expressed.

    However, when using the ucsc genome browser to view bigwig files generated from the aligned bams, we cannot see any unique splice junction being covered in one condition versus the other. So we are not sure why these reads have been assigned to different splice isoforms. Additionally, like I said, we really don’t expect so many events where isoforms are switched in the system we are examining.

    “If there was a systematic artifact, it should affect both samples equally and you shouldn’t have a discrepancy only in one sample.”

    Agreed. I am unable to explain it either. ? Short of eyeballing every such event in a genome browser, or asking the lab to validate via qPCR, I’m not able to assign confidence in the differential transcripts results, even though the fc, p-val and q-val look very good.

  3. Hi Dan,
    Very nice blog. I have one quick questions. Is there a way one can logFC in addition to FC in ballgown output?
    Thanks,
    Upendra

  4. Nice blog. I want to ask how Ballgown compares with DESeq2? And which is the best tool to plot heat maps, GO and Pathway Analysis, PCA Analysis? Thank you!

    1. I haven’t made the comparison yet. For heatmaps, pathway analysis, and PCA I like pheatmap, fgsea, and FactoMineR, respectively.

  5. Hi, very nice blog entry.
    Thanks for the comprehensive explanations. At the end, did you compare the pipline of the new tuxedo suite with others such as the STAR/Cufflinks? I couldn’t find any other entry. I wonder if you have any comment on this.

    Thank you!

  6. how to interpret fold change (fc) in ballgown results, a fake example calculation is appreciated.
    regards,

  7. Hi Dave,
    thank you for your detailed blog post. Very helpful!
    Which files did you load in the IGV to visualise the known as well as the novel transcripts?
    Cheers
    David

  8. Why it is not providing the transcript Ids on running head(texpr(bg_chrX), 2). While it is providing the gene ids by running head(gexpr(bg_chrX), 2). Can you please help.

  9. Hi Dave. Thanks for the comprehensive tutorial . I think this is my first new year gift for 2020.
    I would also like to know if there is a way to extract the estimated abundance for all samples to a spreadsheet. Instead of doing differential expression analysis I would like to use an alternative gene prioritization method and this requires that all the estimates are in a single spreadsheet. Thanks once again.

  10. Hi Dave,
    An excellent tutorial. I am following this up for my RNA-seq data analysis. However, I am getting an error message:
    ‘Error in scan(file= file, what = what, sep = sep, quote = quote, dec = dec, :
    line 686 did not have 12 elements’,
    when I run:
    bg_chrX <- ballgown(dataDir = "ballgown", samplePattern = "ERR", pData = pheno_data)

    Can you please suggest where it's going wrong? Appreciate your help.
    Thanks,
    Asad

  11. Hi Dave,
    An excellent tutorial. However in my case there are covariate column similar to this example.
    ids time population
    ids time population
    4 A2780_KO1_12hr_Cis 12hr KO
    3 A2780_KO1_24hr_Cis 24hr KO
    1 A2780_KO1_6hr_Cis 6hr KO
    6 A2780_KO2_12hr_Cis 12hr KO
    5 A2780_KO2_24hr_Cis 24hr KO
    2 A2780_KO2_6hr_Cis 6hr KO
    10 A2780_WT1_12hr_Cis 12hr WT
    9 A2780_WT1_24hr_Cis 24hr WT
    7 A2780_WT1_6hr_Cis 6hr WT
    12 A2780_WT2_12hr_Cis 12hr WT
    11 A2780_WT2_24hr_Cis 24hr WT
    8 A2780_WT2_6hr_Cis 6hr WT

    Whenever I am running stattest function I am getting a warning
    results_transcripts <- stattest(bg_chrX_filt,
    feature="transcript",
    covariate = "time",
    adjustvars = c("population"),
    getFC=TRUE, meas="FPKM")

    Warning message:
    In stattest(bg_chrX_filt, feature = "transcript", covariate = "time", :
    fold changes only available for 2-group comparisons

    This warning make sense that calculating fold changes between two group is possible. However, in your example you are able to calculate fc.

    Please guide me in this regard

  12. I recently ran the codes. I noticed that the number of transcripts varied a lot compare to the output here. Can someone assist or enlighten me on reason why I am getting different output?

    I was trying to find out the number of assembled genes, novel genes, transcripts matching annotation, and novel transcripts as described in the publication.
    When counting the transcripts and novel transcript and genes, which class codes do you count or not count?

    Thank you for your help

    Author Manuscript Author Manuscript Author Manuscript Author Manuscript
    Pertea et al. Page 34
    Table 1
    Class codes used to describe how assembled transcripts compare to reference annotation.
    Class code Description
    = Predicted transcript has exactly the same introns as the reference transcript.
    c Predicted transcript is contained within the reference transcript.
    j Predicted transcript is a potential novel isoform that shares at least one splice junction with a reference transcript.
    e Single exon predicted transcript overlaps a reference exon plus at least 10 bp of a reference intron, indicating a possible pre-
    mRNA fragment.
    i Predicted transcript falls entirely within a reference intron.
    o Exon of predicted transcript overlaps a reference transcript.
    p Predicted transcript lies within 2 Kb of a reference transcript (possible polymerase run-on fragment).
    r Predicted transcript has more than 50% of its bases overlapping a soft-masked (repetitive) reference sequence.
    u Predicted transcript is intergenic in comparison to known reference transcripts.
    x Exon of predicted transcript overlaps reference but lies on the opposite strand.
    s Intron of predicted transcript overlaps a reference intron on the opposite st

    1. This post is a bit outdated, so you probably used updated versions of the tools. For example the HISAT2 version used for this post was 2.1.0 and the latest version is 2.2.1. I will update this post at some point.

      As for checking novel transcripts, you can try to use gffcompare. I do not know of any tool that can calculate the statistics you posted. Did they mention how they tallied the class codes in the paper?

  13. Hi Dave. This statement is a bit confusing to me
    The high sensitivity means that almost all of the StringTie transcripts match the known transcripts, i.e. low false negative. The precision is much lower indicating that many of the StringTie transcripts are not in the list of known transcripts, which are either false positives or truly de novo transcripts. The novel exons, introns, and loci indicate how many of the sites were not found in the list of known transcripts

    If lots of the stringtie transcripts match the known transcripts, then how come they are novel? I am assuming the known transcripts are the ones in the gtf file

    1. Hi Vincent,

      not all the StringTie transcripts have exact matches to known transcripts because they have novel exons, i.e. exons that don’t exist in the GTF file, thus they are considered putatively novel transcripts.

      Cheers,
      Dave

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.