A popular toolset used for analysing RNA-seq data is the tuxedo suite, which consists of TopHat and Cufflinks. The suite provided a start to finish pipeline that allowed users to map reads, assemble transcripts, and perform differential expression analyses. A newer "tuxedo suite" has been developed and is made up of three tools: HISAT, StringTie, and Ballgown. A Nature Protocols article provides a summary of the new suite as well as a tutorial; this post was written while I was going through the tutorial.
I worked through the tutorial on a MacBook Pro, which means that I downloaded binaries for OS X. If you're using some flavour of Linux, download the Linux binaries instead. The data for the tutorial is available at ftp://ftp.ccb.jhu.edu/pub/RNAseq_protocol; you can perform a recursive download using wget to download all the files on the FTP server. You can use your data but you'll have to index the relevant reference file and prepare your own sample text file. For this post, I used the same data as the tutorial.
# recursive download wget -c -r ftp://ftp.ccb.jhu.edu/pub/RNAseq_protocol # move the data tarball to directory root mv ftp.ccb.jhu.edu/pub/RNAseq_protocol/chrX_data.tar.gz . # extract tar xzf chrX_data.tar.gz # check out the directory structure tree --charset=ascii chrX_data chrX_data |-- genes | `-- chrX.gtf |-- genome | `-- chrX.fa |-- geuvadis_phenodata.csv |-- indexes | |-- chrX_tran.1.ht2 | |-- chrX_tran.2.ht2 | |-- chrX_tran.3.ht2 | |-- chrX_tran.4.ht2 | |-- chrX_tran.5.ht2 | |-- chrX_tran.6.ht2 | |-- chrX_tran.7.ht2 | `-- chrX_tran.8.ht2 |-- mergelist.txt `-- samples |-- ERR188044_chrX_1.fastq.gz |-- ERR188044_chrX_2.fastq.gz |-- ERR188104_chrX_1.fastq.gz |-- ERR188104_chrX_2.fastq.gz |-- ERR188234_chrX_1.fastq.gz |-- ERR188234_chrX_2.fastq.gz |-- ERR188245_chrX_1.fastq.gz |-- ERR188245_chrX_2.fastq.gz |-- ERR188257_chrX_1.fastq.gz |-- ERR188257_chrX_2.fastq.gz |-- ERR188273_chrX_1.fastq.gz |-- ERR188273_chrX_2.fastq.gz |-- ERR188337_chrX_1.fastq.gz |-- ERR188337_chrX_2.fastq.gz |-- ERR188383_chrX_1.fastq.gz |-- ERR188383_chrX_2.fastq.gz |-- ERR188401_chrX_1.fastq.gz |-- ERR188401_chrX_2.fastq.gz |-- ERR188428_chrX_1.fastq.gz |-- ERR188428_chrX_2.fastq.gz |-- ERR188454_chrX_1.fastq.gz |-- ERR188454_chrX_2.fastq.gz |-- ERR204916_chrX_1.fastq.gz `-- ERR204916_chrX_2.fastq.gz 4 directories, 36 files
A description of the data set is provided by geuvadis_phenodata.csv. Normally, you will have to prepare this file yourself; it will be used later in the Ballgown step.
cat chrX_data/geuvadis_phenodata.csv "ids","sex","population" "ERR188044","male","YRI" "ERR188104","male","YRI" "ERR188234","female","YRI" "ERR188245","female","GBR" "ERR188257","male","GBR" "ERR188273","female","YRI" "ERR188337","female","GBR" "ERR188383","male","GBR" "ERR188401","male","GBR" "ERR188428","female","GBR" "ERR188454","male","YRI" "ERR204916","female","YRI"
Now let's download the programs; have a look at the HISAT2 page to find the appropriate binary to download. I like to download programs in a src directory and link them to a bin directory, which is in my PATH.
# for OS X cd ~/src wget -c ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/downloads/hisat2-2.1.0-OSX_x86_64.zip unzip hisat2-2.1.0-OSX_x86_64.zip # provide link to binaries in my bin directory cd ~/bin/ ln -s ~/src/hisat2-2.1.0/hisat2* . # some files were already linked ln -s ~/src/hisat2-2.1.0/*.py . ln: ./hisat2_extract_exons.py: File exists ln: ./hisat2_extract_snps_haplotypes_UCSC.py: File exists ln: ./hisat2_extract_snps_haplotypes_VCF.py: File exists ln: ./hisat2_extract_splice_sites.py: File exists ln: ./hisat2_simulate_reads.py: File exists
Again, take a look at the StringTie page to find the appropriate binary to download.
# for OS X cd ~/src wget -c http://ccb.jhu.edu/software/stringtie/dl/stringtie-1.3.3b.OSX_x86_64.tar.gz tar xzf stringtie-1.3.3b.OSX_x86_64.tar.gz # provide link to binary in my bin directory cd ~/bin/ ln -s ~/src/stringtie-1.3.3b.OSX_x86_64/stringtie
The gffcompare tool needs to be compiled.
cd ~/src/ git clone https://github.com/gpertea/gclib git clone https://github.com/gpertea/gffcompare cd gffcompare make release # link again cd ~/bin/ ln -s ~/src/gffcompare/gffcompare
Download SAMtools from http://www.htslib.org/download/ and compile.
# unzip and compile tar xjf samtools-1.6.tar.bz2 cd samtools-1.6 ./configure make # link samtools cd ~/bin ln -s ~/src/samtools-1.6/samtools
Ballgown is a Bioconductor package, so we need to install that using R. While we are at it, we will install various dependencies too.
install.packages("devtools") install.packages("dplyr") source("https://www.bioconductor.org/biocLite.R") biocLite(c("alyssafrazee/RSkittleBrewer", "ballgown", "genefilter"))
Now that we have downloaded and prepared all the required programs, we can start the analysis!
Mapping
Mapping is performed using HISAT2 and usually the first step, prior to mapping, is to create an index of the reference genome. The indices are provided in the data folder but let's create them again.
mkdir my_index cd my_index # use the Python scripts to extract splice-site and exon information from a gene annotation file extract_splice_sites.py ../chrX_data/genes/chrX.gtf > chrX.ss extract_exons.py ../chrX_data/genes/chrX.gtf > chrX.exon head -3 chrX.ss chrX 276393 281481 + chrX 281683 284166 + chrX 284313 288732 + head -3 chrX.exon chrX 276323 276393 + chrX 281393 281683 + chrX 284166 284313 + # now to build the index # the --ss and --exon options can be omitted if annotation data is not available time hisat2-build -p 8 --ss chrX.ss --exon chrX.exon ../chrX_data/genome/chrX.fa chrX_tran # screen output not shown to save space Total time for call to driver() for forward index: 00:03:34 real 3m33.870s user 10m10.778s sys 1m9.074s ls -1 chrX.exon chrX.fa chrX.ss chrX_tran.1.ht2 chrX_tran.2.ht2 chrX_tran.3.ht2 chrX_tran.4.ht2 chrX_tran.5.ht2 chrX_tran.6.ht2 chrX_tran.7.ht2 chrX_tran.8.ht2
Despite creating our own indices, we'll use the ones provided by the tutorial for reproducibility's sake. From geuvadis_phenodata.csv we saw that there are 12 samples; each sample has two FASTQ files since this is paired-end data. Let's start the mapping.
# create directory to store mapping results mkdir map # map each sample using 8 threads hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188044_chrX_1.fastq.gz -2 chrX_data/samples/ERR188044_chrX_2.fastq.gz -S map/ERR188044_chrX.sam hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188104_chrX_1.fastq.gz -2 chrX_data/samples/ERR188104_chrX_2.fastq.gz -S map/ERR188104_chrX.sam hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188234_chrX_1.fastq.gz -2 chrX_data/samples/ERR188234_chrX_2.fastq.gz -S map/ERR188234_chrX.sam hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188245_chrX_1.fastq.gz -2 chrX_data/samples/ERR188245_chrX_2.fastq.gz -S map/ERR188245_chrX.sam hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188257_chrX_1.fastq.gz -2 chrX_data/samples/ERR188257_chrX_2.fastq.gz -S map/ERR188257_chrX.sam hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188273_chrX_1.fastq.gz -2 chrX_data/samples/ERR188273_chrX_2.fastq.gz -S map/ERR188273_chrX.sam hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188337_chrX_1.fastq.gz -2 chrX_data/samples/ERR188337_chrX_2.fastq.gz -S map/ERR188337_chrX.sam hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188383_chrX_1.fastq.gz -2 chrX_data/samples/ERR188383_chrX_2.fastq.gz -S map/ERR188383_chrX.sam hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188401_chrX_1.fastq.gz -2 chrX_data/samples/ERR188401_chrX_2.fastq.gz -S map/ERR188401_chrX.sam hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188428_chrX_1.fastq.gz -2 chrX_data/samples/ERR188428_chrX_2.fastq.gz -S map/ERR188428_chrX.sam hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188454_chrX_1.fastq.gz -2 chrX_data/samples/ERR188454_chrX_2.fastq.gz -S map/ERR188454_chrX.sam hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR204916_chrX_1.fastq.gz -2 chrX_data/samples/ERR204916_chrX_2.fastq.gz -S map/ERR204916_chrX.sam # mapping took around two and a half minutes # real 2m36.509s # user 15m17.815s # sys 3m29.939s
You should always only store sorted BAM (or CRAM) files and delete the SAM files after conversion.
# sort mapping results using SAMtools on 8 threads samtools sort -@ 8 -o map/ERR188044_chrX.bam map/ERR188044_chrX.sam samtools sort -@ 8 -o map/ERR188104_chrX.bam map/ERR188104_chrX.sam samtools sort -@ 8 -o map/ERR188234_chrX.bam map/ERR188234_chrX.sam samtools sort -@ 8 -o map/ERR188245_chrX.bam map/ERR188245_chrX.sam samtools sort -@ 8 -o map/ERR188257_chrX.bam map/ERR188257_chrX.sam samtools sort -@ 8 -o map/ERR188273_chrX.bam map/ERR188273_chrX.sam samtools sort -@ 8 -o map/ERR188337_chrX.bam map/ERR188337_chrX.sam samtools sort -@ 8 -o map/ERR188383_chrX.bam map/ERR188383_chrX.sam samtools sort -@ 8 -o map/ERR188401_chrX.bam map/ERR188401_chrX.sam samtools sort -@ 8 -o map/ERR188428_chrX.bam map/ERR188428_chrX.sam samtools sort -@ 8 -o map/ERR188454_chrX.bam map/ERR188454_chrX.sam samtools sort -@ 8 -o map/ERR204916_chrX.bam map/ERR204916_chrX.sam # remove SAM files rm map/*.sam # sorting and converting took just over a minute real 1m14.533s user 5m44.637s sys 0m9.590s
Assembly
Now we need to assemble the mapped reads into transcripts. StringTie can assemble transcripts with or without annotation; as noted in the protocol, annotation can be helpful when the number of reads for a transcript is too low for an accurate assembly.
# store assembly results in a new directory mkdir assembly # create assembly per sample using 8 threads stringtie map/ERR188044_chrX.bam -l ERR188044 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188044_chrX.gtf stringtie map/ERR188104_chrX.bam -l ERR188104 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188104_chrX.gtf stringtie map/ERR188234_chrX.bam -l ERR188234 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188234_chrX.gtf stringtie map/ERR188245_chrX.bam -l ERR188245 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188245_chrX.gtf stringtie map/ERR188257_chrX.bam -l ERR188257 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188257_chrX.gtf stringtie map/ERR188273_chrX.bam -l ERR188273 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188273_chrX.gtf stringtie map/ERR188337_chrX.bam -l ERR188337 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188337_chrX.gtf stringtie map/ERR188383_chrX.bam -l ERR188383 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188383_chrX.gtf stringtie map/ERR188401_chrX.bam -l ERR188401 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188401_chrX.gtf stringtie map/ERR188428_chrX.bam -l ERR188428 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188428_chrX.gtf stringtie map/ERR188454_chrX.bam -l ERR188454 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR188454_chrX.gtf stringtie map/ERR204916_chrX.bam -l ERR204916 -p 8 -G chrX_data/genes/chrX.gtf -o assembly/ERR204916_chrX.gtf # assembly and quantification took a minute and a half # real 1m30.893s # user 1m58.455s # sys 0m9.860s # before merging we need to modify mergelist.txt # this is because I created a new directory to store the results # the modified mergelist.txt should look like this cat chrX_data/mergelist.txt assembly/ERR188044_chrX.gtf assembly/ERR188104_chrX.gtf assembly/ERR188234_chrX.gtf assembly/ERR188245_chrX.gtf assembly/ERR188257_chrX.gtf assembly/ERR188273_chrX.gtf assembly/ERR188337_chrX.gtf assembly/ERR188383_chrX.gtf assembly/ERR188401_chrX.gtf assembly/ERR188428_chrX.gtf assembly/ERR188454_chrX.gtf assembly/ERR204916_chrX.gtf # merge all transcripts from the different samples stringtie --merge -p 8 -G chrX_data/genes/chrX.gtf -o stringtie_merged.gtf chrX_data/mergelist.txt # check out the transcripts cat stringtie_merged.gtf | head # stringtie --merge -p 8 -G chrX_data/genes/chrX.gtf -o stringtie_merged.gtf chrX_data/mergelist.txt # StringTie version 1.3.3b chrX StringTie transcript 322514 323718 1000 . . gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; chrX StringTie exon 322514 323718 1000 . . gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; exon_number "1"; chrX StringTie transcript 319145 321319 1000 + . gene_id "MSTRG.2"; transcript_id "NR_027232"; gene_name "LINC00685"; ref_gene_id "NR_027232"; chrX StringTie exon 319145 321319 1000 + . gene_id "MSTRG.2"; transcript_id "NR_027232"; exon_number "1"; gene_name "LINC00685"; ref_gene_id "NR_027232"; chrX StringTie transcript 319145 321319 1000 + . gene_id "MSTRG.2"; transcript_id "NR_027231"; gene_name "LINC00685"; ref_gene_id "NR_027231"; chrX StringTie exon 319145 319551 1000 + . gene_id "MSTRG.2"; transcript_id "NR_027231"; exon_number "1"; gene_name "LINC00685"; ref_gene_id "NR_027231"; chrX StringTie exon 320208 321319 1000 + . gene_id "MSTRG.2"; transcript_id "NR_027231"; exon_number "2"; gene_name "LINC00685"; ref_gene_id "NR_027231"; chrX StringTie transcript 304750 318701 1000 - . gene_id "MSTRG.3"; transcript_id "MSTRG.3.1"; # how many transcripts? cat stringtie_merged.gtf | grep -v "^#" | awk '$3=="transcript" {print}' | wc -l 3491
Let's compare the StringTie transcripts to known transcripts using gffcompare.
# compare the assembled transcripts to known transcripts gffcompare -r chrX_data/genes/chrX.gtf -G -o merged stringtie_merged.gtf cat merged.stats # gffcompare v0.10.1 | Command line was: #gffcompare -r chrX_data/genes/chrX.gtf -G -o merged stringtie_merged.gtf # #= Summary for dataset: stringtie_merged.gtf # Query mRNAs : 3281 in 1521 loci (2651 multi-exon transcripts) # (535 multi-transcript loci, ~2.2 transcripts per locus) # Reference mRNAs : 2102 in 1086 loci (1856 multi-exon) # Super-loci w/ reference transcripts: 998 #-----------------| Sensitivity | Precision | Base level: 100.0 | 77.6 | Exon level: 100.0 | 85.4 | Intron level: 99.8 | 91.0 | Intron chain level: 99.6 | 69.7 | Transcript level: 99.6 | 63.8 | Locus level: 100.0 | 70.9 | Matching intron chains: 1848 Matching transcripts: 2094 Matching loci: 1086 Missed exons: 0/8804 ( 0.0%) Novel exons: 971/10608 ( 9.2%) Missed introns: 14/7946 ( 0.2%) Novel introns: 219/8714 ( 2.5%) Missed loci: 0/1086 ( 0.0%) Novel loci: 421/1521 ( 27.7%) Total union super-loci across all input datasets: 1521 3281 out of 3281 consensus transcripts written in merged.annotated.gtf (0 discarded as redundant)
The high sensitivity means that almost all of the StringTie transcripts match the known transcripts, i.e. low false negative. The precision is much lower indicating that many of the StringTie transcripts are not in the list of known transcripts, which are either false positives or truly de novo transcripts. The novel exons, introns, and loci indicate how many of the sites were not found in the list of known transcripts.
All known transcripts were assembled by StringTie, including a few novel ones.
Now that we have our assembled transcripts, we can estimate their abundances.
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188044/ERR188044_chrX.gtf map/ERR188044_chrX.bam stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188104/ERR188104_chrX.gtf map/ERR188104_chrX.bam stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188234/ERR188234_chrX.gtf map/ERR188234_chrX.bam stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188245/ERR188245_chrX.gtf map/ERR188245_chrX.bam stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188257/ERR188257_chrX.gtf map/ERR188257_chrX.bam stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188273/ERR188273_chrX.gtf map/ERR188273_chrX.bam stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188337/ERR188337_chrX.gtf map/ERR188337_chrX.bam stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188383/ERR188383_chrX.gtf map/ERR188383_chrX.bam stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188401/ERR188401_chrX.gtf map/ERR188401_chrX.bam stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188428/ERR188428_chrX.gtf map/ERR188428_chrX.bam stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188454/ERR188454_chrX.gtf map/ERR188454_chrX.bam stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR204916/ERR204916_chrX.gtf map/ERR204916_chrX.bam # estimation took just over a minute and a half # real 1m39.661s # user 2m0.179s # sys 0m9.223s # check out the files ls -1 ballgown/ERR188044 ERR188044_chrX.gtf e2t.ctab e_data.ctab i2t.ctab i_data.ctab t_data.ctab
Differential expression
To perform the expression analyses, we need to use R and Ballgown; I recommend using RStudio. To get started load the required libraries and the data.
library(ballgown) library(RSkittleBrewer) library(genefilter) library(dplyr) library(devtools) # change this to the directory that contains all the StringTie results setwd("~/muse/tuxedo") # load the sample information pheno_data <- read.csv("chrX_data/geuvadis_phenodata.csv") # create a ballgown object bg_chrX <- ballgown(dataDir = "ballgown", samplePattern = "ERR", pData = pheno_data) class(bg_chrX) [1] "ballgown" attr(,"package") [1] "ballgown" bg_chrX ballgown instance with 3491 transcripts and 12 samples
What methods are available for ballgown objects?
methods(class="ballgown") [1] dirs eexpr expr expr<- geneIDs geneNames gexpr [8] iexpr indexes indexes<- mergedDate pData pData<- sampleNames [15] seqnames show structure subset texpr transcriptIDs transcriptNames see '?methods' for accessing help and source code # we can get the gene, transcript, exon, and intron expression levels using # gexpr(), texpr(), eexpr(), and iexpr() head(gexpr(bg_chrX), 2) FPKM.ERR188044 FPKM.ERR188104 FPKM.ERR188234 FPKM.ERR188245 FPKM.ERR188257 FPKM.ERR188273 FPKM.ERR188337 MSTRG.1 7.169349 10.42652 13.83639 1.050201 5.677819 10.756237 7.054854 MSTRG.10 21.428192 13.13144 14.11443 18.454338 10.182308 3.887958 23.385319 FPKM.ERR188383 FPKM.ERR188401 FPKM.ERR188428 FPKM.ERR188454 FPKM.ERR204916 MSTRG.1 4.732841 11.424809 5.733899 6.688090 5.061143 MSTRG.10 11.815677 8.196958 9.578302 9.961549 10.997639 head(texpr(bg_chrX), 2) FPKM.ERR188044 FPKM.ERR188104 FPKM.ERR188234 FPKM.ERR188245 FPKM.ERR188257 FPKM.ERR188273 FPKM.ERR188337 FPKM.ERR188383 1 23.9694 18.49576 39.70492 14.06822 25.51846 23.84778 30.89737 20.71429 2 0.0000 0.00000 27.79636 13.96464 44.97094 0.00000 0.00000 0.00000 FPKM.ERR188401 FPKM.ERR188428 FPKM.ERR188454 FPKM.ERR204916 1 28.03131 24.97612 28.2617 20.24706 2 25.81932 0.00000 0.0000 0.00000
Next we filter out transcripts with low variance.
# note that this subset function is not the base R function but a ballgown one # to see the order in which R looks for functions in packages use search() # search() # [1] ".GlobalEnv" "package:bindrcpp" "package:devtools" "package:dplyr" # [5] "package:genefilter" "package:RSkittleBrewer" "package:ballgown" "tools:rstudio" # [9] "package:stats" "package:graphics" "package:grDevices" "package:utils" # [13] "package:datasets" "package:methods" "Autoloads" "package:base" # # the rowVars is from the genefilter package and calculates the row variance bg_chrX_filt <- subset(bg_chrX, "rowVars(texpr(bg_chrX)) >1", genomesubset=TRUE) # 1,264 transcripts were filtered out bg_chrX_filt ballgown instance with 2227 transcripts and 12 samples
Perform the differential expression analysis stattest() function; confounders are specified using the adjustvars parameter, which has to match the column name in pheno_data. We are testing for transcripts and genes that are differentially expressed between male and females, hence sex is our covariate of interest. In addition to testing transcripts and genes, we can also test differential expression at exons and introns; just change the feature parameter accordingly.
head(pData(bg_chrX_filt), 3) ids sex population 1 ERR188044 male YRI 2 ERR188104 male YRI 3 ERR188234 female YRI # test on transcripts results_transcripts <- stattest(bg_chrX_filt, feature="transcript", covariate="sex", adjustvars = c("population"), getFC=TRUE, meas="FPKM") # results are in a data frame class(results_transcripts) [1] "data.frame" dim(results_transcripts) [1] 2227 5 head(results_transcripts) feature id fc pval qval 1 transcript 1 0.9386481 0.7208669 0.9454480 2 transcript 2 1.2073309 0.8670656 0.9756579 3 transcript 3 1.0058534 0.9964598 0.9997816 4 transcript 4 0.3847566 0.5214029 0.9290666 5 transcript 5 0.6089373 0.3247825 0.9278154 6 transcript 6 0.6449469 0.3062408 0.9253708 table(results_transcripts$qval < 0.05) FALSE TRUE 2215 12 # test on genes results_genes <- stattest(bg_chrX_filt, feature="gene", covariate="sex", adjustvars = c("population"), getFC=TRUE, meas="FPKM") class(results_genes) [1] "data.frame" dim(results_genes) [1] 1013 5 table(results_genes$qval<0.05) FALSE TRUE 1002 11
The results_transcripts data frame doesn't contain any identifiers; we will create a new data frame with this information.
# the order is the same so we can simply combine the information results_transcripts <- data.frame(geneNames = geneNames(bg_chrX_filt), geneIDs = geneIDs(bg_chrX_filt), results_transcripts) # now we have the identifiers head(results_transcripts) geneNames geneIDs feature id fc pval qval 1 . MSTRG.4 transcript 1 0.9386481 0.7208669 0.9454480 2 PLCXD1 MSTRG.4 transcript 2 1.2073309 0.8670656 0.9756579 3 . MSTRG.4 transcript 3 1.0058534 0.9964598 0.9997816 4 . MSTRG.4 transcript 4 0.3847566 0.5214029 0.9290666 5 . MSTRG.5 transcript 5 0.6089373 0.3247825 0.9278154 6 PLCXD1 MSTRG.4 transcript 6 0.6449469 0.3062408 0.9253708 # which transcripts are detected as differentially expressed at qval < 0.05? results_transcripts %>% filter(qval < 0.05) geneNames geneIDs feature id fc pval qval 1 PNPLA4 MSTRG.64 transcript 186 0.592477057 2.119474e-04 4.290970e-02 2 . MSTRG.140 transcript 421 3.141219608 6.096529e-05 1.508552e-02 3 KDM6A MSTRG.255 transcript 734 0.054166544 1.208983e-04 2.692404e-02 4 RPS4X MSTRG.511 transcript 1605 0.598737678 2.560509e-04 4.751878e-02 5 TSIX MSTRG.522 transcript 1648 0.078029979 1.743580e-06 7.765906e-04 6 . MSTRG.523 transcript 1649 0.016057740 3.872369e-10 2.874589e-07 7 XIST MSTRG.523 transcript 1650 0.002997908 1.849406e-10 2.059314e-07 8 . MSTRG.523 transcript 1651 0.030714646 1.360867e-10 2.059314e-07 9 . MSTRG.523 transcript 1652 0.028289665 6.782559e-08 3.776190e-05 10 . MSTRG.605 transcript 1843 7.378759461 1.285917e-05 4.772897e-03 11 . MSTRG.612 transcript 1847 9.154881892 4.889775e-05 1.361191e-02 12 . MSTRG.766 transcript 2333 0.272425415 1.909634e-05 6.075365e-03
Let's create a MA plot.
library(ggplot2) library(cowplot) results_transcripts$mean <- rowMeans(texpr(bg_chrX_filt)) ggplot(results_transcripts, aes(log2(mean), log2(fc), colour = qval<0.05)) + scale_color_manual(values=c("#999999", "#FF0000")) + geom_point() + geom_hline(yintercept=0)
Summary
The new tuxedo package is very fast; I realise that the tutorial only used a small subset of reads that were already determined to map to chromosome X. Despite this, the mapping and assembly took mere minutes. A recent benchmark of RNA-seq aligners did demonstrate that HISAT or HISAT2 was the fastest splice-aware mapper out of 14 algorithms. However, HISAT or HISAT2 had a low recall percentage when mapping reads with high complexity, i.e. more polymorphic sites and higher error rates, on the default settings; mapping accuracy was vastly improved after tuning the parameters.
I plan to set up a Snakemake pipeline for running the new tuxedo suite and will compare it with other pipelines, such as this STAR and Cufflinks/RSEM pipeline.
This work is licensed under a Creative Commons
Attribution 4.0 International License.
Hi Dave- I was wondering if you could comment on an observation we made when we ran this pipeline as described here.
We did an experiment in mouse, knockout vs WT. For alignment we used hisat2, default parameters. Followed by stringtie, and ballgown. We got a large number of significantly D.E. “transcripts”, but, when we conducted a gene level analysis, we got barely any D.E. genes. The D.E. transcripts list mostly has the same gene showing D.E. of different splice forms in each condition. Since we are dealing with the same tissue, we really don’t expect such a huge splicing effect. I wonder if many of the splice variants could be mapping artifacts, because, in some cases, I look at the aligned reads in a browser and it shows no difference between the two samples in terms of # of reads mapped.
Hi Nandita,
I recall that a former colleague had a similar problem to what you are describing, which is the discrepancy in DE between genes and transcripts. Regarding your example, I guess the obvious thing to do (which you may have already done) is to create an expression table of the gene and another of the transcripts belonging to the same gene. Perhaps in the knockout, it has switched to another splice variant, therefore there is DE on the transcript level. However, when you collapse expression onto a gene level they are expressed similarly. I’m not so sure about what you meant about mapping artifacts though. If there was a systematic artifact, it should affect both samples equally and you shouldn’t have a discrepancy only in one sample.
Cheers,
Dave
Thanks for responding, Dave.
“Perhaps in the knockout, it has switched to another splice variant, therefore there is DE on the transcript level.” This would be really cool if true, and the when we use the “plot transcripts” function in ballgown (Fig 5. page 1664 in the Nature Protocols 2016 paper) to look at one of these cases, it indeed implies different transcripts are expressed.
However, when using the ucsc genome browser to view bigwig files generated from the aligned bams, we cannot see any unique splice junction being covered in one condition versus the other. So we are not sure why these reads have been assigned to different splice isoforms. Additionally, like I said, we really don’t expect so many events where isoforms are switched in the system we are examining.
“If there was a systematic artifact, it should affect both samples equally and you shouldn’t have a discrepancy only in one sample.”
Agreed. I am unable to explain it either. ? Short of eyeballing every such event in a genome browser, or asking the lab to validate via qPCR, I’m not able to assign confidence in the differential transcripts results, even though the fc, p-val and q-val look very good.
Hi Dan,
Very nice blog. I have one quick questions. Is there a way one can logFC in addition to FC in ballgown output?
Thanks,
Upendra
Nice blog. I want to ask how Ballgown compares with DESeq2? And which is the best tool to plot heat maps, GO and Pathway Analysis, PCA Analysis? Thank you!
I haven’t made the comparison yet. For heatmaps, pathway analysis, and PCA I like pheatmap, fgsea, and FactoMineR, respectively.
Hi, very nice blog entry.
Thanks for the comprehensive explanations. At the end, did you compare the pipline of the new tuxedo suite with others such as the STAR/Cufflinks? I couldn’t find any other entry. I wonder if you have any comment on this.
Thank you!
I haven’t done the comparison yet. It’s on my TODO list.
how to interpret fold change (fc) in ballgown results, a fake example calculation is appreciated.
regards,
Thank you for your post. I would like to know if you have the possibility to get, and send to my email, the paper which you have mentioned at the end of your post: https://link.springer.com/protocol/10.1007%2F978-1-4939-4035-6_14
Thank you once again.
Best Regards, José
Hi Dave,
thank you for your detailed blog post. Very helpful!
Which files did you load in the IGV to visualise the known as well as the novel transcripts?
Cheers
David
Why it is not providing the transcript Ids on running head(texpr(bg_chrX), 2). While it is providing the gene ids by running head(gexpr(bg_chrX), 2). Can you please help.
Hi Dave. Thanks for the comprehensive tutorial . I think this is my first new year gift for 2020.
I would also like to know if there is a way to extract the estimated abundance for all samples to a spreadsheet. Instead of doing differential expression analysis I would like to use an alternative gene prioritization method and this requires that all the estimates are in a single spreadsheet. Thanks once again.
Hi Dave,
An excellent tutorial. I am following this up for my RNA-seq data analysis. However, I am getting an error message:
‘Error in scan(file= file, what = what, sep = sep, quote = quote, dec = dec, :
line 686 did not have 12 elements’,
when I run:
bg_chrX <- ballgown(dataDir = "ballgown", samplePattern = "ERR", pData = pheno_data)
Can you please suggest where it's going wrong? Appreciate your help.
Thanks,
Asad
Hi Dave,
An excellent tutorial. However in my case there are covariate column similar to this example.
ids time population
ids time population
4 A2780_KO1_12hr_Cis 12hr KO
3 A2780_KO1_24hr_Cis 24hr KO
1 A2780_KO1_6hr_Cis 6hr KO
6 A2780_KO2_12hr_Cis 12hr KO
5 A2780_KO2_24hr_Cis 24hr KO
2 A2780_KO2_6hr_Cis 6hr KO
10 A2780_WT1_12hr_Cis 12hr WT
9 A2780_WT1_24hr_Cis 24hr WT
7 A2780_WT1_6hr_Cis 6hr WT
12 A2780_WT2_12hr_Cis 12hr WT
11 A2780_WT2_24hr_Cis 24hr WT
8 A2780_WT2_6hr_Cis 6hr WT
Whenever I am running stattest function I am getting a warning
results_transcripts <- stattest(bg_chrX_filt,
feature="transcript",
covariate = "time",
adjustvars = c("population"),
getFC=TRUE, meas="FPKM")
Warning message:
In stattest(bg_chrX_filt, feature = "transcript", covariate = "time", :
fold changes only available for 2-group comparisons
This warning make sense that calculating fold changes between two group is possible. However, in your example you are able to calculate fc.
Please guide me in this regard
I recently ran the codes. I noticed that the number of transcripts varied a lot compare to the output here. Can someone assist or enlighten me on reason why I am getting different output?
I was trying to find out the number of assembled genes, novel genes, transcripts matching annotation, and novel transcripts as described in the publication.
When counting the transcripts and novel transcript and genes, which class codes do you count or not count?
Thank you for your help
Author Manuscript Author Manuscript Author Manuscript Author Manuscript
Pertea et al. Page 34
Table 1
Class codes used to describe how assembled transcripts compare to reference annotation.
Class code Description
= Predicted transcript has exactly the same introns as the reference transcript.
c Predicted transcript is contained within the reference transcript.
j Predicted transcript is a potential novel isoform that shares at least one splice junction with a reference transcript.
e Single exon predicted transcript overlaps a reference exon plus at least 10 bp of a reference intron, indicating a possible pre-
mRNA fragment.
i Predicted transcript falls entirely within a reference intron.
o Exon of predicted transcript overlaps a reference transcript.
p Predicted transcript lies within 2 Kb of a reference transcript (possible polymerase run-on fragment).
r Predicted transcript has more than 50% of its bases overlapping a soft-masked (repetitive) reference sequence.
u Predicted transcript is intergenic in comparison to known reference transcripts.
x Exon of predicted transcript overlaps reference but lies on the opposite strand.
s Intron of predicted transcript overlaps a reference intron on the opposite st
This post is a bit outdated, so you probably used updated versions of the tools. For example the HISAT2 version used for this post was 2.1.0 and the latest version is 2.2.1. I will update this post at some point.
As for checking novel transcripts, you can try to use gffcompare. I do not know of any tool that can calculate the statistics you posted. Did they mention how they tallied the class codes in the paper?
Hi Dave. This statement is a bit confusing to me
The high sensitivity means that almost all of the StringTie transcripts match the known transcripts, i.e. low false negative. The precision is much lower indicating that many of the StringTie transcripts are not in the list of known transcripts, which are either false positives or truly de novo transcripts. The novel exons, introns, and loci indicate how many of the sites were not found in the list of known transcripts
If lots of the stringtie transcripts match the known transcripts, then how come they are novel? I am assuming the known transcripts are the ones in the gtf file
Hi Vincent,
not all the StringTie transcripts have exact matches to known transcripts because they have novel exons, i.e. exons that don’t exist in the GTF file, thus they are considered putatively novel transcripts.
Cheers,
Dave
Hi Dave
the link to get data seems to be not working…..
Hi Naina,
I found a copy on my old computer and uploaded it to my server. You can download it at:
https://davetang.org/file/chrX_data.tar.gz
The MD5 checksum is b4763ce3de87b381a8a782a2d4f414ef
Cheers,
Dave