From Dave's wiki
Jump to navigation Jump to search

One of the first RNA-seq papers.[1].

Best practices

Experimental protocols

This protocol is optimised for 0.1-1 microgram of total RNA; lower amounts can result in inefficient ligation and low yield. Use of lower quality RNA, including FFPE samples, may require further optimisation to determine the input amount.

Quality control


Mitochondrial gene expression

  1. Poor sample quality, leading to a high fraction of apoptotic or lysing cells.
  2. Biology of the particular sample, for example tumor biopsies, which may have increased mitochondrial gene expression due to metabolic activity and/or necrosis.

Apoptotic cells express mitochondrial genes and export these transcripts to the cytoplasm in mammalian cells. For example, when apoptotic cells are spiked into an otherwise healthy cell suspension, an increased number of mitochondrial genes are detected.

Typical pipeline

The analysis pipeline can be conceptually divided into four main tasks[3]:

  1. alignment of the reads to the genome
  2. assembly of the alignments into full-length transcripts
  3. quantification of the expression levels of each gene and transcript; and
  4. calculation of the differences in expression for all genes among the different experimental conditions.

The HISAT, StringTie, and Ballgown pipeline[3]

  • HISAT aligns RNA-seq reads to a genome and discovers transcript splice sites, while running far faster than TopHat2 and requiring much less computer memory than other methods.
  • StringTie assembles the alignments into full and partial transcripts, creating multiple isoforms as necessary and estimating the expression levels of all genes and transcripts.
  • Ballgown takes the transcripts and expression levels from StringTie and applies rigorous statistical methods to determine which transcripts are differentially expressed between two or more experiments.


For an aligner to be viable for RNA-seq it must[4]:

  1. align reads across splice junctions
  2. handle paired-end reads
  3. handle strand-specific data, and
  4. run efficiently

Differential expression

  1. data visualisation and inspection
  2. statistical tests for differential expression
  3. multiple test correction
  4. downstream inspection and summarisation of results.


Raw counts need to be normalised prior to comparison among different samples because these counts are affected by factors such as number of sequenced reads or transcript length.

  • RPKM/FPKM is a within-sample normalisation method that will remove library-size and transcript-length effects

Correcting for transcript-length effects is not necessary when comparing changes in gene expression across samples, but it necessary for comparing genes within the sample.

For TPM, the denominator is calculated by summing the total number of fragments sequenced. However, since the denominator is going to be different between experiments, and thus is also sample dependent which is why you cannot directly compare TPM between samples. While this is true, TPM is probably the most stable unit across experiments, though you still shouldn’t compare it across experiments.

Between-sample normalisation (BSN), which is required when comparing changes across experiments, addresses two issues:

1. Variable sequencing depth 2. Finding a "control set" of expression features, which should have similar expression patterns across experiments to serve as a baseline



  1. The Transcriptional Landscape of the Yeast Genome Defined by RNA Sequencing http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2951732/
  2. A survey of best practices for RNA-seq data analysis https://pubmed.ncbi.nlm.nih.gov/26813401/
  3. 3.0 3.1 Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown http://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html
  4. Simulation-based comprehensive benchmarking of RNA-seq aligners https://www.ncbi.nlm.nih.gov/pubmed/27941783