RNA-Seq
One of the first RNA-seq papers.[1].
Best practices
- ENCODE guidelines for best practices for RNA-seq hosted - https://davetang.org/file/encode_best_practices_rnaseq_v2.pdf
- https://www.fda.gov/science-research/bioinformatics-tools/microarraysequencing-quality-control-maqcseqc#MAQC_IV
- Assessing Sequence Data Quality - http://bioinfo-core.org/index.php/9th_Discussion-28_October_2010
- A survey of best practices for RNA-seq data analysis[2]
Experimental protocols
- TruSeq Stranded mRNA protocol - https://support.illumina.com/content/dam/illumina-support/documents/documentation/chemistry_documentation/samplepreps_truseq/truseq-stranded-mrna-workflow/truseq-stranded-mrna-workflow-reference-1000000040498-00.pdf
- Poly-A selection using poly-T oligo attached magnetic beads
- mRNA fragmentation using divalent cations under higher temperature
- First strand cDNA synthesis using RT and random primers; add actinomycin D to prevent spurious DNA-dependent synthesis
- Second strand cDNA synthesis with dUTP to quench second strand during amplification
- Adenylate 3' ends
- Ligate adapters for hybridisation to flow cell
This protocol is optimised for 0.1-1 microgram of total RNA; lower amounts can result in inefficient ligation and low yield. Use of lower quality RNA, including FFPE samples, may require further optimisation to determine the input amount.
Quality control
- Quality of RNA-seq Tool-Set - https://hartleys.github.io/QoRTs/index.html
Duplication
Mitochondrial gene expression
- Poor sample quality, leading to a high fraction of apoptotic or lysing cells.
- Biology of the particular sample, for example tumor biopsies, which may have increased mitochondrial gene expression due to metabolic activity and/or necrosis.
Apoptotic cells express mitochondrial genes and export these transcripts to the cytoplasm in mammalian cells. For example, when apoptotic cells are spiked into an otherwise healthy cell suspension, an increased number of mitochondrial genes are detected.
Typical pipeline
The analysis pipeline can be conceptually divided into four main tasks[3]:
- alignment of the reads to the genome
- assembly of the alignments into full-length transcripts
- quantification of the expression levels of each gene and transcript; and
- calculation of the differences in expression for all genes among the different experimental conditions.
The HISAT, StringTie, and Ballgown pipeline[3]
- HISAT aligns RNA-seq reads to a genome and discovers transcript splice sites, while running far faster than TopHat2 and requiring much less computer memory than other methods.
- StringTie assembles the alignments into full and partial transcripts, creating multiple isoforms as necessary and estimating the expression levels of all genes and transcripts.
- Ballgown takes the transcripts and expression levels from StringTie and applies rigorous statistical methods to determine which transcripts are differentially expressed between two or more experiments.
Mapping
For an aligner to be viable for RNA-seq it must[4]:
- align reads across splice junctions
- handle paired-end reads
- handle strand-specific data, and
- run efficiently
Differential expression
- data visualisation and inspection
- statistical tests for differential expression
- multiple test correction
- downstream inspection and summarisation of results.
- https://www.bioconductor.org/packages/release/bioc/html/EBSeq.html
- https://github.com/deweylab/RSEM/blob/master/README.md#de
Normalisation
Raw counts need to be normalised prior to comparison among different samples because these counts are affected by factors such as number of sequenced reads or transcript length.
- RPKM/FPKM is a within-sample normalisation method that will remove library-size and transcript-length effects
Correcting for transcript-length effects is not necessary when comparing changes in gene expression across samples, but it necessary for comparing genes within the sample.
- What the FPKM? https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/
For TPM, the denominator is calculated by summing the total number of fragments sequenced. However, since the denominator is going to be different between experiments, and thus is also sample dependent which is why you cannot directly compare TPM between samples. While this is true, TPM is probably the most stable unit across experiments, though you still shouldn’t compare it across experiments.
- Between sample normalisation -> https://haroldpimentel.wordpress.com/2014/12/08/in-rna-seq-2-2-between-sample-normalization/
Between-sample normalisation (BSN), which is required when comparing changes across experiments, addresses two issues:
1. Variable sequencing depth 2. Finding a "control set" of expression features, which should have similar expression patterns across experiments to serve as a baseline
- Effective length and count -> https://groups.google.com/forum/#!topic/rsem-users/IaZmviqghJc
- Expression units -> https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/
Tutorials
- https://github.com/griffithlab/rnaseq_tutorial
- Nice figures explaining how STAR works - https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/03_alignment.html
References
- ↑ The Transcriptional Landscape of the Yeast Genome Defined by RNA Sequencing http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2951732/
- ↑ A survey of best practices for RNA-seq data analysis https://pubmed.ncbi.nlm.nih.gov/26813401/
- ↑ 3.0 3.1 Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown http://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html
- ↑ Simulation-based comprehensive benchmarking of RNA-seq aligners https://www.ncbi.nlm.nih.gov/pubmed/27941783