Getting started with TopHat

Updated links for the binaries on 2015 March 2nd

TopHat is a tool that can find splice junctions without a reference annotation. By first mapping RNA-Seq reads to the genome (using Bowtie/2), TopHat identifies potential exons, since many RNA-Seq reads will contiguously align to the genome. Using this initial mapping information, TopHat builds a database of possible splice junctions, and then maps the reads against these junctions to confirm them. In the past people have built splice junctions based on known references, such as RefSeq. TopHat allows a user to find potentially new splice variants.

I will use RNA-Seq data from the Genome Research paper by Marioni et al. published in 2008 to test TopHat. I found it funny that the submission title for their dataset was “RNASeq: the death Knell of expression arrays?”; I guess they decided to go with something much less morbid when they finally published their paper. Their sequence data was downloaded from DDBJ under the accession number SRA000299.

There are two sample descriptions for this submission, which are identical:

SRS000561: We extracted total RNA from liver and kidney samples of a single human male, puri_ed [sic] the poly-A mRNA and sheared it prior to cDNA synthesis
SRS000562: We extracted total RNA from liver and kidney samples of a single human male, puri_ed [sic] the poly-A mRNA and sheared it prior to cDNA synthesis

There are four experiments for this submission, totalling 6 runs:

SRX000571 -> SRR002321 and SRR002323 -> 080226_CMLIVER_0007_3pM
SRX000604 -> SRR002322 -> 080226_CMLIVER_0007_1.5pM
SRX000605 -> SRR002320 and SRR002325 -> 080226_CMKIDNEY_0007_3pM
SRX000606 -> SRR002324 -> 080226_CMKIDNEY_0007_1.5pM

Now to download the 6 sequence runs:

#!/bin/bash
wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA000/SRA000299/SRX000571/SRR002321.fastq.bz2
wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA000/SRA000299/SRX000571/SRR002323.fastq.bz2
wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA000/SRA000299/SRX000604/SRR002322.fastq.bz2
wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA000/SRA000299/SRX000605/SRR002320.fastq.bz2
wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA000/SRA000299/SRX000605/SRR002325.fastq.bz2
wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA000/SRA000299/SRX000606/SRR002324.fastq.bz2

Number of reads in each file

SRR002320.fastq 39,266,713
SRR002321.fastq 54,856,271
SRR002322.fastq 18,437,696
SRR002323.fastq 14,761,931
SRR002324.fastq 17,292,434
SRR002325.fastq 27,137,793

Total 171,752,838

From the paper there were two full sequencing runs (i.e. 7 lanes).

lane run1 conc lane run2 conc
Lane1 kidney 3pM Lane1 liver 1.5pM
Lane2 liver 3pM Lane2 kidney 3pM
Lane3 kidney 3pM Lane3 liver 3pM
Lane4 liver 3pM Lane4 kidney 1.5pM
Lane5 CNTL-PhiX 3pM Lane5 CNTL-PhiX 1.5pM
Lane6 liver 3pM Lane6 kidney 3pM
Lane7 kidney 3pM Lane7 liver 1.5pM
Lane8 liver 3pM Lane8 kidney 1.5pM

Read number statistics as described from the paper (note the discrepancy with the above table for the second run! All of these information are according to the supplementary material downloaded from their website. The table above has the same information presented in their paper):

Run1
L1 – kidney 13,017,169
L2 – liver 14,003,322
L3 – kidney 13,401,343
L4 – liver 14,230,879
L6 – liver 13,525,355
L7 – kidney 12,848,201
L8 – liver 13,096,715
Total 94,122,984

Run2
L1 – kidney 9,096,595
L2 – liver 13,687,929
L3 – kidney 14,761,931
L4 – liver 8,843,158
L6 – liver 13,449,864
L7 – kidney 9,341,101
L8 – liver 8,449,276
Total 77,629,854

The total of the two runs is 171,752,838, which equals the total number of reads from the fastq files. The sequence data from the 14 lanes have been condensed into the 6 fastq files. But we can separate them out by using the information from the fastq definition line. Here’s what each part of the definition line represents:

@SRR002321.1 080226_CMLIVERKIDNEY_0007:2:1:115:885

“SRR002321.1” is the short read archive name, where the .1 is the read number
“080226_CMLIVERKIDNEY_0007” should be the Machine name, which has been arbitrarily changed
“2” is the Channel/lane number
“1” is the Tile number
“115” is the X position
“885” is the Y position

So we can separate out the reads based on the Channel/lane number. However there may be a complication in that we can’t separate the two different runs. Given that the two runs were done on different dates (according to the supplementary information) namely 3rd March 2008 and 10th March 2008, these may correspond to 080226_CMLIVERKIDNEY_0007 and 080317_CM-KID-LIV-2-REPEAT_0003 respectively. I scanned all 171,752,838 reads and found that reads were either assigned to 080226_CMLIVERKIDNEY_0007 or 080317_CM-KID-LIV-2-REPEAT_0003 and by separating out the reads by these identifiers and by the lanes I get the same read counts as I showed above:

080226_CMLIVERKIDNEY_0007_1 13,017,169
080226_CMLIVERKIDNEY_0007_2 14,003,322
080226_CMLIVERKIDNEY_0007_3 13,401,343
080226_CMLIVERKIDNEY_0007_4 14,230,879
080226_CMLIVERKIDNEY_0007_6 13,525,355
080226_CMLIVERKIDNEY_0007_7 12,848,201
080226_CMLIVERKIDNEY_0007_8 13,096,715
080317_CM-KID-LIV-2-REPEAT_0003_1 9,096,595
080317_CM-KID-LIV-2-REPEAT_0003_2 13,687,929
080317_CM-KID-LIV-2-REPEAT_0003_3 14,761,931
080317_CM-KID-LIV-2-REPEAT_0003_4 8,843,158
080317_CM-KID-LIV-2-REPEAT_0003_6 13,449,864
080317_CM-KID-LIV-2-REPEAT_0003_7 9,341,101
080317_CM-KID-LIV-2-REPEAT_0003_8 8,449,276

Script used to separate out the fastq files:


#!/usr/bin/perl

#use strict;
use warnings;

my @infile = qw/SRR002320.fastq SRR002321.fastq SRR002322.fastq SRR002323.fastq SRR002324.fastq SRR002325.fastq/;

my %marker = ();

foreach my $infile (@infile){
   open(IN,'<',$infile) || die "Could not open $infile: $!\n";
   while(<IN>){
      chomp;
      #@SRR002320.1 080226_CMLIVERKIDNEY_0007:1:1:112:735
      #GTGGTGGGGTTGGTATTTGGTTTCTCGTTTTAATTA
      #+
      #IIIIIIII"IIIII)I$I1%HII"I#./(#/'$#*#
      my $header = $_;
      my $read = <IN>;
      chomp($read);
      my $plus_line = <IN>;
      chomp($plus_line);
      my $quality = <IN>;
      chomp($quality);

      if ($header =~ /^@(SRR\d+\.\d+)\s([a-zA-Z0-9-_]+):(\d+):(\d+):(\d+):(\d+)$/){
         my $srr = $1;
         my $name = $2;
         my $lane = $3;
         my $tile = $4;
         my $x = $5;
         my $y = $6;
         my $id = $name . '_' . $lane;
         if (exists $marker{$id}){
            print $id "$header\n$read\n$plus_line\n$quality\n";
         } else {
            $marker{$id} = '1';
            open($id,'>',"$id.fastq") || die "Could not write to $id: $!\n";
            print $id "$header\n$read\n$plus_line\n$quality\n";
         }

      } else {
         die "Error parsing line $.: $header\n";
      }

   }
   close(IN);
}

foreach my $id (keys %marker){
   close($id);
}

exit(0);

Setting up TopHat

Download the binaries for bowtie2:

wget http://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.4/bowtie2-2.2.4-linux-x86_64.zip
unzip bowtie2-2.2.4-linux-x86_64.zip

Download the binaries for tophat2:

wget http://ccb.jhu.edu/software/tophat/downloads/tophat-2.0.13.Linux_x86_64.tar.gz
tar -xzf tophat-2.0.13.Linux_x86_64.tar.gz

Download the test data:

wget http://ccb.jhu.edu/software/tophat/downloads/test_data.tar.gz
tar -xzf test_data.tar.gz

Run a test job with the test_data

tophat -r 20 test_ref reads_1.fq reads_2.fq

The -r parameter

-r/–mate-inner-dist

This is the expected (mean) inner distance between mate pairs. For, example, for paired end runs with fragments selected at 300bp, where each end is 50bp, you should set -r to be 200. There is no default, and this parameter is required for paired end runs.

Interpreting the test results

The reference sequence where the string of A’s serve as introns.

cat test_ref.fa
>test_chromosome
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ACTACTATCTGACTAGACTGGAGGCGCTTGCGACTGAGCTAGGACGTGCC
ACTACGGGGATGACGACTAGGACTACGGACGGACTTAGAGCGTCAGATGC
AGCGACTGGACTATTTAGGACGATCGGACTGAGGAGGGCAGTAGGACGCT
ACGTATTTGGCGCGCGGCGCTACGGCTGAGCGTCGAGCTTGCGATACGCC
GTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
ACTATTACTTTATTATCTTACTCGGACGTAGACGGATCGGCAACGGGACT
GTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
TTTTCTACTTGAGACTGGGATCGAGGCGGACTTTTTAGGACGGGACTTGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

There are 100 paired end reads in reads_1.fq and reads_2.fq.

head -4 reads_1.fq
@test_mRNA_150_290_0/1
TCCTAAAAAGTCCGCCTCGGTCTCAGTCTCAAGTAGAAAAAGTCCCGTTGGCGATCCGTCTACGTCCGAGTAAGA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

head -4 reads_2.fq
@test_mRNA_150_290_0/2
TACGTATTTGTCGCGCGGCCCTACGGCTGAGCGTCGAGCTTGCGATCCGCCACTATTACTTTATTATCTTACTCG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

If TopHat ran properly, your output would look something like this using IGV and viewing the accepted_hits.bam file:

Analysing the Marioni et al. data

First we need to index the reference genome or you can download it from ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/hg19.ebwt.zip.

bowtie2-build /path/to/hg19 hg19

On one core (Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz) of my Ubuntu box, it took roughly 100 minutes to index the hg19 genome. If your download speed is slow, perhaps you can index the reference yourself.

Start TopHat on one of the runs (the coverage-search algorithm was turned off because it took too long and almost used up my 8 gigs of memory):

tophat --num-threads 8 --no-coverage-search ~/genome/hg19 SRR002321.fastq

Log

[2012-08-19 16:32:31] Beginning TopHat run (v2.0.4)
———————————————–
[2012-08-19 16:32:31] Checking for Bowtie
Bowtie version: 2.0.0.7
[2012-08-19 16:32:31] Checking for Samtools
Samtools version: 0.1.18.0
[2012-08-19 16:32:31] Checking for Bowtie index files
[2012-08-19 16:32:31] Checking for reference FASTA file
[2012-08-19 16:32:31] Generating SAM header for /home/davetang/genome/hg19
format: fastq
quality scale: phred33 (default)
[2012-08-19 16:33:32] Preparing reads
left reads: min. length=36, max. length=36, 53482003 kept reads (1374268 discarded)
Warning: you have only one segment per read.
If the read length is greater than or equal to 45bp,
we strongly recommend that you decrease –segment-length to about half the read length because TopHat will work better with multiple segments
[2012-08-19 16:39:02] Mapping left_kept_reads to genome hg19 with Bowtie2
[2012-08-19 17:09:59] Searching for junctions via segment mapping
Warning: junction database is empty!
[2012-08-19 17:11:23] Reporting output tracks
———————————————–
[2012-08-19 17:46:45] Run complete: 01:14:14 elapsed

Alcohol dehydrogenase 1b on IGV where in red are reads mapped on the positive strand and blue on the negative:

Running TopHat with coverage search on each individual lane (reads for each lane were separated from the fastq files using the script above):

for file in `ls *.fastq`; do base=`basename $file .fastq`; nice tophat --num-threads 30 --output-dir $base hg19_male $file; done

Using BWA

As a comparison I also aligned the data using BWA following the same strategy used in their paper, two mismatches and discarding multimappers (although they used the Illumina-supplied mapping program ELAND). They could align ~40% of their reads uniquely to a genomic location. Here are my mapping statistics using the same approach but with BWA:

run1 read unique_mapped percent
Lane1 13,017,169 4,555,926 35.00
Lane2 14,003,322 4,760,137 33.99
Lane3 13,401,343 4,747,605 35.43
Lane4 14,230,879 4,808,619 33.79
Lane6 13,525,355 4,609,224 34.08
Lane7 12,848,201 4,346,034 33.83
Lane8 13,096,715 4,433,962 33.86
run2 read unique_mapped percent
Lane1 9,096,595 4,034,396 44.35
Lane2 13,687,929 5,003,548 36.55
Lane3 14,761,931 5,111,827 34.63
Lane4 8,843,158 4,252,449 48.09
Lane6 13,449,864 5,175,652 38.48
Lane7 9,341,101 4,379,373 46.88
Lane8 8,449,276 4,137,438 48.97

The mean of all the mapping percentages is 38.42, which is roughly 40%. Next I used a simple approach for annotating these reads to the Ensembl database. I downloaded the Ensembl transcripts as a bed file from the UCSC Table Browser and used intersectBed. However, the number of uniquely mapped reads that overlap exons is around 40%:

run1 unique_mapped ensembl_exon percent
Lane1 4,555,926 1,932,692 42.42
Lane2 4,760,137 1,864,202 39.16
Lane3 4,747,605 2,004,449 42.22
Lane4 4,808,619 1,881,362 39.12
Lane6 4,609,224 1,793,735 38.92
Lane7 4,346,034 1,829,295 42.09
Lane8 4,433,962 1,728,134 38.97
run2 unique_mapped ensembl_exon percent
Lane1 4,034,396 1,636,857 40.57
Lane2 5,003,548 2,145,708 42.88
Lane3 5,111,827 2,016,176 39.44
Lane4 4,252,449 1,856,823 43.66
Lane6 5,175,652 2,211,203 42.72
Lane7 4,379,373 1,770,117 40.42
Lane8 4,137,438 1,806,110 43.65

More information

TopHat manual
RNA-Seq protocol used to for these libraries

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
17 comments Add yours
  1. Hi Dave,

    I have been running Tophat for quite a while now. I normally use the protocol described in this paper http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html
    I mostly focus my analysis on using the uniquely mapping reads. Once I run Tophat with the default settings as give on the website and the nature paper I extract uniquely mapping reads by using the regex for finding “NH:i:1” which gives Number of reported alignments that contains the query in the current record. I have also tried other ways like matching the SAM header and only extracting the read whose header appears only once. Although both these methods tend to give different numbers. I am aware of the fact that Tophat also has an option of max-multi hits which can be set to 1 but I have not tried it yet as I am not sure if it would give me only those reads that map once or all the reads including the multi-mapping reads aligned only once. I hope you understand what I am trying to say. I just want to know the best way possible to get uniquely mapping reads.

    1. Hi Saad,

      I haven’t been able to play around with TopHat as intended (hence this post has yet to be updated). I did something similar to you when I was working with BWA, which was to look for the X0:i:1 field. However when I doubled checked this by just extracting reads whose header appears only once, just like you, it gave the same results as looking for the X0:i:1 field. Since you get different results, I can see why you’re concerned. I did look at Bowtie (not Bowtie2) when mapping multimapping reads (http://davetang.org/muse/2011/11/26/bowtie-and-multimapping-reads/) but I don’t see the X0:i:1 field, so it’s either my version of Bowtie is outdated or you used Bowtie2.

      I guess what I would do is to find the parameters for Bowtie or Bowtie2, which returns reads that map uniquely and not mapping multimapping reads only once (which is what BWA does, however the X0:i: field indicates how many times the read multimapped). You can make a small test file and test different parameters.

      Lastly if you’re interested in assigning weights to multimapping reads, have a look at this post: http://davetang.org/muse/2012/05/25/how-to-deal-with-multi-mapping-tags/

      Sorry I couldn’t be of more help. When I have time, I will update this post.

      Cheers,

      Dave

        1. Hi Saad,

          I’m currently revisiting this old post and updating it; sorry for the delay in getting back.

          Using the TopHat results on the test data linked in this post, I found examples where a read appears twice in the BAM file and has the NH:i:1 tag. From the SAM manual the NH tag is the “Number of reported alignments that contains the query in the current record.” I would have assumed like you that, at least for this example, it should be NH:i:2.

          However when I looked at the BAM flag for this entry (test_mRNA_51_194_47), one was 163 and the other 83, meaning that they are mate pair i.e. different reads. So I guess the approach of grepping for NH:i:1 would not work for mate pair data. So it could be that you are working with mate pair data?

          Hope that helps,

          Dave

  2. Hi Dave,

    Nice and well written article!

    Just a question, I have noticed that in the TopHat run, some reads were discarded. Isn’t it going to effect the aligned reads or BAM file?

    1. Hi Aso,

      Thanks! When you say discarded, you mean not aligned? From the manual, reads are discarded if there are too many mismatches, gaps, edits, and if reads are multimapped above a certain threshold.

      Cheers,

      Dave

      1. Thank you so much for a quick reply!

        So we can include those reads by changing the threshold value or we should continue with it? Because after they will be included, the produced BAM file will be different than the previous one. (I think)

        1. Hi Aso,

          No problem! Have a play with the threshold values (and different settings) and see how it affects your dataset. I’ve only used TopHat sparingly, so please give it a go.

          Cheers,

          Dave

  3. Hi Dave

    I have never used paired end reads with tophat. I just started using them. Can you please detail me how to choose right value for -r/–mate-inner-dist.

    Thanks
    Saad

    1. Hi Saad,

      That depends on your protocol; the mate distance refers to the distance of the paired reads. If your protocol shears DNA at ~500bp then -r would be 500.

      Have a look at this guide from Illumina about paired end sequencing.

      Cheers,

      Dave

  4. Thanks Dave,

    Actually I am using SRA data and I have a lot of SRA files to work on that are paired end. I was just wondering if you know of any way to get the mate distance information from SRA programmatically.

    Thanks
    Saad

  5. Hi Dave,

    I have two questions for you.

    one : Is there a way we can get the fragment size information from SRAdb programatically using SRADB etc.

    second: I have not used Human data in the past I only worked on plant data and we generally use only uniquely mapping reads to do transcript assembly and differential expression. What is the general convention when it comes to human data in order to use cufflinks/cuffdiff do we use only uniquely mapping reads or all the reads mapped.

    1. Hi Saad,

      In the Short Read Archive, there should be a paper associated with the dataset; I would have a look at the papers, especially the methods, for which you plan to use data for. If there is no paper associated, try mapping the mate pairs individually and check the distances between the first and second mate pair.

      As for reads mapping to repetitive regions, depends on your interest. If you are only interested in protein coding genes, you won’t lose much by discarding non-unique reads; you may lose some reads to pseudogenes. If you’re interested in transposon expression, obviously discarding non-unique reads will remove a lot of signal.

      Hope that helps,

      Dave

  6. Sorry to ask again:
    In the SRA the fragment size is in a range :-

    fragment size range : 200-250

    So in that case what should be my -r option in tophat
    Do I need to subtract read length from fragment size.

  7. As a part of my analysis I have Human RNA seq data (illumina, Single-ended. 36 bp reads) from Normal and cancer patients which I need to align to human genome reference. when I run tophat2 and try to align using bowtie2 I am getting error saying that there is no enough memory. The computer I am using has 32 GB RAM and 2TB space with 1TB allocated to ubuntu. Is this configuration not enough to run my analysis for human reads? Please suggest me the system requirement or any changes I need to make.

    I am pasting my command and error note in case that may give you better idea about my problem.

    root@linux-Precision-T3600:~# tophat -g 1 -G ‘/home/linux/Desktop/NGSsampledata/Homo_sapiens_Ensembl_GRCh37/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf’ -o ‘/home/linux/Desktop/NGSsampledata/tophat_output_1’ ‘/home/linux/Desktop/NGSsampledata/Homo_sapiens_Ensembl_GRCh37/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome’ ‘/home/linux/Desktop/NGSsampledata/Normal/normal.fastq’

    [2014-01-18 18:25:14] Beginning TopHat run (v2.0.10)
    ———————————————–
    [2014-01-18 18:25:14] Checking for Bowtie
    Bowtie version: 2.1.0.0
    [2014-01-18 18:25:14] Checking for Samtools
    Samtools version: 0.1.19.0
    [2014-01-18 18:25:14] Checking for Bowtie index files (genome)..
    [2014-01-18 18:25:14] Checking for reference FASTA file
    [2014-01-18 18:25:14] Generating SAM header for /home/linux/Desktop/NGSsampledata/Homo_sapiens_Ensembl_GRCh37/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome
    [2014-01-18 18:25:16] Reading known junctions from GTF file
    [2014-01-18 18:25:32] Preparing reads
    left reads: min. length=35, max. length=35, 15669232 kept reads (5155 discarded)
    [2014-01-18 18:28:12] Building transcriptome data files /home/linux/Desktop/NGSsampledata/tophat_output_1/tmp/genes
    [2014-01-18 18:28:45] Building Bowtie index from genes.fa
    [2014-01-18 18:56:21] Mapping left_kept_reads to transcriptome genes with Bowtie2
    [2014-01-18 19:29:21] Resuming TopHat pipeline with unmapped reads
    Warning: you have only one segment per read.
    If the read length is greater than or equal to 45bp,
    we strongly recommend that you decrease –segment-length to about half the read length because TopHat will work better with multiple segments
    [2014-01-18 19:29:21] Mapping left_kept_reads.m2g_um to genome genome with Bowtie2
    [FAILED]
    Error running bowtie:
    Out of memory allocating the ebwt[] array for the Bowtie index. Please try
    again on a computer with more memory.
    Error: Encountered internal Bowtie 2 exception (#1)
    Command: /usr/local/bin/bowtie2-align -k 1 -D 15 -R 2 -N 0 -L 20 -i S,1,1.25 –gbar 4 –mp 6,2 –np 1 –rdg 5,3 –rfg 5,3 –score-min C,-14,0 -p 1 –sam-no-hd -x /home/linux/Desktop/NGSsampledata/Homo_sapiens_Ensembl_GRCh37/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome –

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.