Finding junctions with TopHat

For setting up TopHat see my previous post.

Here, I wanted to test whether TopHat can find junctions with single end 27bp reads.

The reference sequence I used was the test_ref.fa provided by the TopHat authors (see my previous post for the link), where the A's mark the intron regions:

>test_chromosome
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ACTACTATCTGACTAGACTGGAGGCGCTTGCGACTGAGCTAGGACGTGCC
ACTACGGGGATGACGACTAGGACTACGGACGGACTTAGAGCGTCAGATGC
AGCGACTGGACTATTTAGGACGATCGGACTGAGGAGGGCAGTAGGACGCT
ACGTATTTGGCGCGCGGCGCTACGGCTGAGCGTCGAGCTTGCGATACGCC
GTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
ACTATTACTTTATTATCTTACTCGGACGTAGACGGATCGGCAACGGGACT
GTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
TTTTCTACTTGAGACTGGGATCGAGGCGGACTTTTTAGGACGGGACTTGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

To index the reference run:

bowtie2-build test_ref.fa test_ref

I made up a test read (also based on the test fastq file provided by the TopHat authors) in fastq format:

@test_mRNA_151_297_1d/1
TGCGATACGCCACTATTACTTTATTAT
+
IIIIIIIIIIIIIIIIIIIIIIIIIII

If you run TopHat with the default settings, no alignment will be returned:

tophat test_ref blah.fq

We need to modify the segment length parameter:

--segment-length Each read is cut up into segments, each at least this long. These segments are mapped independently. The default is 25.

Interestingly I get results when I specify a segment-length of 12, but not at a higher value (I would have assumed that I needed a segment length of 11):

tophat --segment-length 12 test_ref blah.fq
samtools view tophat_out/accepted_hits.bam
#test_mRNA_151_297_1d    0       test_chromosome 240     50      11M100N16M      *       0       0       TGCGATACGCCACTATTACTTTATTAT     IIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-6  XM:i:0  XO:i:0  XG:i:0  MD:Z:27 NM:i:0  XS:A:+  NH:i:1

From the CIGAR line, you can see how I split up the read; 11 nucleotides on one exon and 16 nucleotides on the other exon.

And lastly if I modify the --max-segment-intron parameter to anything less than 101, no results are returned

--max-segment-intron The maximum intron length that may be found during split-segment search. The default is 500000.

tophat --segment-length 12 --max-segment-intron 101 test_ref blah.fq #returns result
tophat --segment-length 12 --max-segment-intron 100 test_ref blah.fq #no result

Now the only questions are how trustworthy these junctions, detected from 27bp reads, are if you only have a few reads supporting the junction and how much more computational time it adds when we shorten the segment length.

Lastly, I have tested different --segment-length parameters (data not shown), and depending on how the read is split up, it is not the length of the segment that governs whether a match can be found, but whether the segment length matches how the read is split e.g. in a test where I split a 32bp read into 19/22, mapping results could be found with a segment length of 18 but not 10.

Annotating RNA-Seq data

After mapping your reads from an RNA-Seq experiment, usually the next task is identify the transcripts that the reads came from (i.e. annotating RNA-Seq data) and there are many ways of doing so. Here I just describe a rather crude method whereby I download sequence coordinates of hg19 RefSeqs as a BED12 file from the UCSC Genome Browser Table Browser and extrapolate the 5' UTR, exons, introns, 3' UTR and intergenic regions according to this file. I then output this as a bed file for use with intersectBed. A better (and much less error prone) method would be just to download each individual region of a RefSeq transcript from the Table Browser.

The process would have been quite straight forward, however since transcripts are located on the positive and negative strands and are spliced (even 5' and 3' UTRs), the script I wrote to create the annotation bed file is quite long and may contain unwanted features i.e. bugs. To use the script first download the refGene BED12 file from the UCSC Genome Browser Table Browser. Then sort the file by chromosome and then by the starting coordinate i.e. cat refgene.bed12 | sort -k1,1 -k2,2n > refgene_sorted.bed12. NOTE: this script will only work for the RefSeqs mapped to hg19, since I have hardcoded the chromosome ends within the script.

Without further ado:


#!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.bed>\n";
my $infile = shift or die $usage;

my %chr_end = ();
#DATA used for calculating final intergenic region
while(<DATA>){
   chomp;
   my ($chr,$end) = split(/\s+/);
   $chr_end{$chr} = $end;
}
close(DATA);

#used for calculating intergenic regions
my $end_of_transcript = '-1';
my $previous_chr = 'chr1';

open(IN,'<',$infile) || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
=head1 sample bed file

chr1    11873   14409   NR_046018       0       +       14409   14409   0       3       354,109,1189,   0,739,1347,
chr1    14361   29370   NR_024540       0       -       29370   29370   0       11      468,69,152,159,198,136,137,147,99,154,50,       0,608,1434,2245,2496,2871,3244,3553,3906,10376,14959,
chr1    34610   36081   NR_026818       0       -       36081   36081   0       3       564,205,361,    0,666,1110,

=cut
   my($chr,$start,$end,$id,$score,$strand,$thick_start,$thick_end,$rgb,$block_count,$block_size,$block_start) = split(/\t/);
   #just in case
   $chr= lc $chr;

   #work only with the assembled chromosomes
   next unless exists $chr_end{$chr};

   my @block_size = split(/,/,$block_size);
   my @block_start = split(/,/,$block_start);

   #declare UTR variables
   my $five_prime_utr_start = '0';
   my $five_prime_utr_end = '0';
   my $three_prime_utr_start = '0';
   my $three_prime_utr_end = '0';
   #utr information is present when the thick_start is not equal to the start and end
   if ($end != $thick_start && $thick_start != $start ){
      #chr1    861120  879961  NM_152486       0       +       861321  879533
      my $utr_start = $start;
      my $utr_end = $thick_start + 1;
      if ($strand eq '+'){
         $five_prime_utr_start = $utr_start;
         $five_prime_utr_end = $utr_end;
      } else {
         $three_prime_utr_start =$utr_end;
         $three_prime_utr_end =$utr_start;
      }
   }
   if ($end != $thick_end){
      my $utr_start = $thick_end;
      my $utr_end = $end + 1;
      if ($strand eq '+'){
         $three_prime_utr_start = $utr_start;
         $three_prime_utr_end = $utr_end;
      } else {
         $five_prime_utr_start =$utr_end;
         $five_prime_utr_end =$utr_start;
      }
   }

   #declare arrays for introns and exons
   my @exon_start = ();
   my @exon_end = ();
   my @intron_start = ();
   my @intron_end = ();

   #go through each block from right to left
   for (my $block=$block_count-1; $block>=0; --$block){
      #exons
      #0 based coordinates
      my $exon_start = $start + $block_start[$block] + 1;
      my $exon_end = $start + $block_start[$block] + $block_size[$block];
      #print "Exon: $block\t$exon_start\t$exon_end\n";

      if ($strand eq '+'){
         if ($three_prime_utr_start != 0){
            if ($exon_start > $three_prime_utr_start){
               #print "3' UTR is spliced:\t$exon_start\t$exon_end\n";
               print join("\t",$chr,$exon_start,$exon_end,"3_UTR_$id",'0',$strand),"\n";
            } else {
               #print "3' UTR:\t$three_prime_utr_start\t$exon_end\n";
               print join("\t",$chr,$three_prime_utr_start,$exon_end,"3_UTR_$id",'0',$strand),"\n";
               $three_prime_utr_start = '0';
            }
         }
      } else {
         if ($five_prime_utr_end != 0){
            if ($exon_start > $five_prime_utr_end){
               #print "5' UTR is spliced:\t$exon_start\t$exon_end\n";
               print join("\t",$chr,$exon_start,$exon_end,"5_UTR_$id",'0',$strand),"\n";
            } else {
               #print "5' UTR:\t$five_prime_utr_end\t$exon_end\n";
               print join("\t",$chr,$five_prime_utr_end,$exon_end,"5_UTR_$id",'0',$strand),"\n";
               $five_prime_utr_end = '0';
            }
         }
      }
   }

   my $previous_exon_end = '';
   #go through each block from left to right
   for (1 .. $block_count){
      #array indexing starts with 0
      my $block = $_ - 1;

      #exons
      #0 based coordinates
      my $exon_start = $start + $block_start[$block] + 1;
      my $exon_end = $start + $block_start[$block] + $block_size[$block];
      #print "$id\t$exon_start\t$exon_end\n";
      print join("\t",$chr,$exon_start,$exon_end,"EXON_$id",'0',$strand),"\n";
      $exon_start[$block] = $exon_start;
      $exon_end[$block] = $exon_end;
      #print "Exon $block:\t$exon_start\t$exon_end\n";

      if ($strand eq '+'){
         if ($five_prime_utr_end != 0){
            if ($exon_end < $five_prime_utr_end){
               #print "5' UTR is spliced:\t$exon_start\t$exon_end\n";
               print join("\t",$chr,$exon_start,$exon_end,"5_UTR_$id",'0',$strand),"\n";
            } else {
               #print "5' UTR:\t$exon_start\t$five_prime_utr_end\n";
               print join("\t",$chr,$exon_start,$five_prime_utr_end,"5_UTR_$id",'0',$strand),"\n";
               $five_prime_utr_end = '0';
            }
         }
      } else {
         if ($three_prime_utr_start != 0){
            if ($exon_end < $three_prime_utr_start){
               #print "3' UTR is spliced:\t$exon_start\t$exon_end\n";
               print join("\t",$chr,$exon_start,$exon_end,"3_UTR_$id",'0',$strand),"\n";
            } else {
               #print "3' UTR:\t$exon_start\t$three_prime_utr_start\n";
               print join("\t",$chr,$exon_start,$three_prime_utr_start,"3_UTR_$id",'0',$strand),"\n";
               $three_prime_utr_start = '0';
            }
         }
      }

      #introns
      if ($block == 0){
         $previous_exon_end = $exon_end;
      } else {
         my $intron_start = $previous_exon_end + 1;
         my $intron_end = $exon_start - 1;
         #print "Intron $block:\t$intron_start\t$intron_end\n";
         print join("\t",$chr,$intron_start,$intron_end,"INTRON_$id",'0',$strand),"\n";
         $intron_start[$block] = $intron_start;
         $intron_end[$block] = $intron_end;
         $previous_exon_end = $exon_end;
      }
   }

   #first and last exons
   #uncomment out if you want to store first and last exons
   if ($block_count > 1){
      if ($strand eq '+'){
         my $first_exon_start = $exon_start[0];
         my $first_exon_end = $exon_end[0];
         my $last_exon_start = $exon_start[-1];
         my $last_exon_end = $exon_end[-1];
         #print "First exon:\t$first_exon_start\t$first_exon_end\n";
         #print "Last exon:\t$last_exon_start\t$last_exon_end\n";
      } elsif ($strand eq '-'){
         my $first_exon_start = $exon_start[-1];
         my $first_exon_end = $exon_end[-1];
         my $last_exon_start = $exon_start[0];
         my $last_exon_end = $exon_end[0];
         #print "First exon:\t$first_exon_start\t$first_exon_end\n";
         #print "Last exon:\t$last_exon_start\t$last_exon_end\n";
      } else {
         die "Incorrect strand information: $strand\n";
      }
   }

   #intergenic
   #the file is sorted, so this is the last transcript on the chromosome
   if ($chr ne $previous_chr){
      my $intergenic_start = $end_of_transcript + 1;
      my $intergenic_end = $chr_end{$previous_chr} - 1;
      $end_of_transcript = '-1';
      #print "$intergenic_start\t$intergenic_end\n";
      print join("\t",$chr,$intergenic_start,$intergenic_end,'INTERGENIC','0','0'),"\n";
      $previous_chr = $chr;
   }

   my $intergenic_start = $end_of_transcript + 1;
   my $intergenic_end = $start - 1;
   #print "$intergenic_start\t$intergenic_end\n";
   #this is to avoid overlapping transcript
   if ($intergenic_start >= $intergenic_end){
      #overlapping transcript
   } else {
      print join("\t",$chr,$intergenic_start,$intergenic_end,'INTERGENIC','0','0'),"\n";
   }
   $end_of_transcript = $end;

   #print "---\n\n";

}

close(IN);
exit(0);

#hg19 chromosome ends used for calculating last intergenic region
__DATA__
chr1    249250621
chr2    243199373
chr3    198022430
chr4    191154276
chr5    180915260
chr6    171115067
chr7    159138663
chrX    155270560
chr8    146364022
chr9    141213431
chr10   135534747
chr11   135006516
chr12   133851895
chr13   115169878
chr14   107349540
chr15   102531392
chr16   90354753
chr17   81195210
chr18   78077248
chr20   63025520
chrY    59373566
chr19   59128983
chr22   51304566
chr21   48129895

You should end up with a bed file as such:

chr1 1264277 1266724 INTERGENIC 0 0
chr1 1284445 1284492 5_UTR_NM_004421 0 -
chr1 1270658 1271895 EXON_NM_004421 0 -
chr1 1270658 1271522 3_UTR_NM_004421 0 -
chr1 1273357 1273563 EXON_NM_004421 0 -
chr1 1271896 1273356 INTRON_NM_004421 0 -
chr1 1273649 1273816 EXON_NM_004421 0 -
chr1 1273564 1273648 INTRON_NM_004421 0 -

Getting started with TopHat

Updated links for the binaries on 2015 March 2nd

TopHat is a tool that can find splice junctions without a reference annotation. By first mapping RNA-Seq reads to the genome (using Bowtie/2), TopHat identifies potential exons, since many RNA-Seq reads will contiguously align to the genome. Using this initial mapping information, TopHat builds a database of possible splice junctions, and then maps the reads against these junctions to confirm them. In the past people have built splice junctions based on known references, such as RefSeq. TopHat allows a user to find potentially new splice variants.

I will use RNA-Seq data from the Genome Research paper by Marioni et al. published in 2008 to test TopHat. I found it funny that the submission title for their dataset was "RNASeq: the death Knell of expression arrays?"; I guess they decided to go with something much less morbid when they finally published their paper. Their sequence data was downloaded from DDBJ under the accession number SRA000299.

There are two sample descriptions for this submission, which are identical:

SRS000561: We extracted total RNA from liver and kidney samples of a single human male, puri_ed [sic] the poly-A mRNA and sheared it prior to cDNA synthesis
SRS000562: We extracted total RNA from liver and kidney samples of a single human male, puri_ed [sic] the poly-A mRNA and sheared it prior to cDNA synthesis

There are four experiments for this submission, totalling 6 runs:

SRX000571 -> SRR002321 and SRR002323 -> 080226_CMLIVER_0007_3pM
SRX000604 -> SRR002322 -> 080226_CMLIVER_0007_1.5pM
SRX000605 -> SRR002320 and SRR002325 -> 080226_CMKIDNEY_0007_3pM
SRX000606 -> SRR002324 -> 080226_CMKIDNEY_0007_1.5pM

Now to download the 6 sequence runs:

#!/bin/bash
wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA000/SRA000299/SRX000571/SRR002321.fastq.bz2
wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA000/SRA000299/SRX000571/SRR002323.fastq.bz2
wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA000/SRA000299/SRX000604/SRR002322.fastq.bz2
wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA000/SRA000299/SRX000605/SRR002320.fastq.bz2
wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA000/SRA000299/SRX000605/SRR002325.fastq.bz2
wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA000/SRA000299/SRX000606/SRR002324.fastq.bz2

Number of reads in each file

SRR002320.fastq 39,266,713
SRR002321.fastq 54,856,271
SRR002322.fastq 18,437,696
SRR002323.fastq 14,761,931
SRR002324.fastq 17,292,434
SRR002325.fastq 27,137,793

Total 171,752,838

From the paper there were two full sequencing runs (i.e. 7 lanes).

lane run1 conc lane run2 conc
Lane1 kidney 3pM Lane1 liver 1.5pM
Lane2 liver 3pM Lane2 kidney 3pM
Lane3 kidney 3pM Lane3 liver 3pM
Lane4 liver 3pM Lane4 kidney 1.5pM
Lane5 CNTL-PhiX 3pM Lane5 CNTL-PhiX 1.5pM
Lane6 liver 3pM Lane6 kidney 3pM
Lane7 kidney 3pM Lane7 liver 1.5pM
Lane8 liver 3pM Lane8 kidney 1.5pM

Read number statistics as described from the paper (note the discrepancy with the above table for the second run! All of these information are according to the supplementary material downloaded from their website. The table above has the same information presented in their paper):

Run1
L1 - kidney 13,017,169
L2 - liver 14,003,322
L3 - kidney 13,401,343
L4 - liver 14,230,879
L6 - liver 13,525,355
L7 - kidney 12,848,201
L8 - liver 13,096,715
Total 94,122,984

Run2
L1 - kidney 9,096,595
L2 - liver 13,687,929
L3 - kidney 14,761,931
L4 - liver 8,843,158
L6 - liver 13,449,864
L7 - kidney 9,341,101
L8 - liver 8,449,276
Total 77,629,854

The total of the two runs is 171,752,838, which equals the total number of reads from the fastq files. The sequence data from the 14 lanes have been condensed into the 6 fastq files. But we can separate them out by using the information from the fastq definition line. Here's what each part of the definition line represents:

@SRR002321.1 080226_CMLIVERKIDNEY_0007:2:1:115:885

"SRR002321.1" is the short read archive name, where the .1 is the read number
"080226_CMLIVERKIDNEY_0007" should be the Machine name, which has been arbitrarily changed
"2" is the Channel/lane number
"1" is the Tile number
"115" is the X position
"885" is the Y position

So we can separate out the reads based on the Channel/lane number. However there may be a complication in that we can't separate the two different runs. Given that the two runs were done on different dates (according to the supplementary information) namely 3rd March 2008 and 10th March 2008, these may correspond to 080226_CMLIVERKIDNEY_0007 and 080317_CM-KID-LIV-2-REPEAT_0003 respectively. I scanned all 171,752,838 reads and found that reads were either assigned to 080226_CMLIVERKIDNEY_0007 or 080317_CM-KID-LIV-2-REPEAT_0003 and by separating out the reads by these identifiers and by the lanes I get the same read counts as I showed above:

080226_CMLIVERKIDNEY_0007_1 13,017,169
080226_CMLIVERKIDNEY_0007_2 14,003,322
080226_CMLIVERKIDNEY_0007_3 13,401,343
080226_CMLIVERKIDNEY_0007_4 14,230,879
080226_CMLIVERKIDNEY_0007_6 13,525,355
080226_CMLIVERKIDNEY_0007_7 12,848,201
080226_CMLIVERKIDNEY_0007_8 13,096,715
080317_CM-KID-LIV-2-REPEAT_0003_1 9,096,595
080317_CM-KID-LIV-2-REPEAT_0003_2 13,687,929
080317_CM-KID-LIV-2-REPEAT_0003_3 14,761,931
080317_CM-KID-LIV-2-REPEAT_0003_4 8,843,158
080317_CM-KID-LIV-2-REPEAT_0003_6 13,449,864
080317_CM-KID-LIV-2-REPEAT_0003_7 9,341,101
080317_CM-KID-LIV-2-REPEAT_0003_8 8,449,276

Script used to separate out the fastq files:


#!/usr/bin/perl

#use strict;
use warnings;

my @infile = qw/SRR002320.fastq SRR002321.fastq SRR002322.fastq SRR002323.fastq SRR002324.fastq SRR002325.fastq/;

my %marker = ();

foreach my $infile (@infile){
   open(IN,'<',$infile) || die "Could not open $infile: $!\n";
   while(<IN>){
      chomp;
      #@SRR002320.1 080226_CMLIVERKIDNEY_0007:1:1:112:735
      #GTGGTGGGGTTGGTATTTGGTTTCTCGTTTTAATTA
      #+
      #IIIIIIII"IIIII)I$I1%HII"I#./(#/'$#*#
      my $header = $_;
      my $read = <IN>;
      chomp($read);
      my $plus_line = <IN>;
      chomp($plus_line);
      my $quality = <IN>;
      chomp($quality);

      if ($header =~ /^@(SRR\d+\.\d+)\s([a-zA-Z0-9-_]+):(\d+):(\d+):(\d+):(\d+)$/){
         my $srr = $1;
         my $name = $2;
         my $lane = $3;
         my $tile = $4;
         my $x = $5;
         my $y = $6;
         my $id = $name . '_' . $lane;
         if (exists $marker{$id}){
            print $id "$header\n$read\n$plus_line\n$quality\n";
         } else {
            $marker{$id} = '1';
            open($id,'>',"$id.fastq") || die "Could not write to $id: $!\n";
            print $id "$header\n$read\n$plus_line\n$quality\n";
         }

      } else {
         die "Error parsing line $.: $header\n";
      }

   }
   close(IN);
}

foreach my $id (keys %marker){
   close($id);
}

exit(0);

Setting up TopHat

Download the binaries for bowtie2:

wget http://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.4/bowtie2-2.2.4-linux-x86_64.zip
unzip bowtie2-2.2.4-linux-x86_64.zip

Download the binaries for tophat2:

wget http://ccb.jhu.edu/software/tophat/downloads/tophat-2.0.13.Linux_x86_64.tar.gz
tar -xzf tophat-2.0.13.Linux_x86_64.tar.gz

Download the test data:

wget http://ccb.jhu.edu/software/tophat/downloads/test_data.tar.gz
tar -xzf test_data.tar.gz

Run a test job with the test_data

tophat -r 20 test_ref reads_1.fq reads_2.fq

The -r parameter

-r/--mate-inner-dist

This is the expected (mean) inner distance between mate pairs. For, example, for paired end runs with fragments selected at 300bp, where each end is 50bp, you should set -r to be 200. There is no default, and this parameter is required for paired end runs.

Interpreting the test results

The reference sequence where the string of A's serve as introns.

cat test_ref.fa
>test_chromosome
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ACTACTATCTGACTAGACTGGAGGCGCTTGCGACTGAGCTAGGACGTGCC
ACTACGGGGATGACGACTAGGACTACGGACGGACTTAGAGCGTCAGATGC
AGCGACTGGACTATTTAGGACGATCGGACTGAGGAGGGCAGTAGGACGCT
ACGTATTTGGCGCGCGGCGCTACGGCTGAGCGTCGAGCTTGCGATACGCC
GTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
ACTATTACTTTATTATCTTACTCGGACGTAGACGGATCGGCAACGGGACT
GTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
TTTTCTACTTGAGACTGGGATCGAGGCGGACTTTTTAGGACGGGACTTGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

There are 100 paired end reads in reads_1.fq and reads_2.fq.

head -4 reads_1.fq
@test_mRNA_150_290_0/1
TCCTAAAAAGTCCGCCTCGGTCTCAGTCTCAAGTAGAAAAAGTCCCGTTGGCGATCCGTCTACGTCCGAGTAAGA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

head -4 reads_2.fq
@test_mRNA_150_290_0/2
TACGTATTTGTCGCGCGGCCCTACGGCTGAGCGTCGAGCTTGCGATCCGCCACTATTACTTTATTATCTTACTCG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

If TopHat ran properly, your output would look something like this using IGV and viewing the accepted_hits.bam file:

Analysing the Marioni et al. data

First we need to index the reference genome or you can download it from ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/hg19.ebwt.zip.

bowtie2-build /path/to/hg19 hg19

On one core (Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz) of my Ubuntu box, it took roughly 100 minutes to index the hg19 genome. If your download speed is slow, perhaps you can index the reference yourself.

Start TopHat on one of the runs (the coverage-search algorithm was turned off because it took too long and almost used up my 8 gigs of memory):

tophat --num-threads 8 --no-coverage-search ~/genome/hg19 SRR002321.fastq

Log

[2012-08-19 16:32:31] Beginning TopHat run (v2.0.4)
-----------------------------------------------
[2012-08-19 16:32:31] Checking for Bowtie
Bowtie version: 2.0.0.7
[2012-08-19 16:32:31] Checking for Samtools
Samtools version: 0.1.18.0
[2012-08-19 16:32:31] Checking for Bowtie index files
[2012-08-19 16:32:31] Checking for reference FASTA file
[2012-08-19 16:32:31] Generating SAM header for /home/davetang/genome/hg19
format: fastq
quality scale: phred33 (default)
[2012-08-19 16:33:32] Preparing reads
left reads: min. length=36, max. length=36, 53482003 kept reads (1374268 discarded)
Warning: you have only one segment per read.
If the read length is greater than or equal to 45bp,
we strongly recommend that you decrease --segment-length to about half the read length because TopHat will work better with multiple segments
[2012-08-19 16:39:02] Mapping left_kept_reads to genome hg19 with Bowtie2
[2012-08-19 17:09:59] Searching for junctions via segment mapping
Warning: junction database is empty!
[2012-08-19 17:11:23] Reporting output tracks
-----------------------------------------------
[2012-08-19 17:46:45] Run complete: 01:14:14 elapsed

Alcohol dehydrogenase 1b on IGV where in red are reads mapped on the positive strand and blue on the negative:

Running TopHat with coverage search on each individual lane (reads for each lane were separated from the fastq files using the script above):

for file in `ls *.fastq`; do base=`basename $file .fastq`; nice tophat --num-threads 30 --output-dir $base hg19_male $file; done

Using BWA

As a comparison I also aligned the data using BWA following the same strategy used in their paper, two mismatches and discarding multimappers (although they used the Illumina-supplied mapping program ELAND). They could align ~40% of their reads uniquely to a genomic location. Here are my mapping statistics using the same approach but with BWA:

run1 read unique_mapped percent
Lane1 13,017,169 4,555,926 35.00
Lane2 14,003,322 4,760,137 33.99
Lane3 13,401,343 4,747,605 35.43
Lane4 14,230,879 4,808,619 33.79
Lane6 13,525,355 4,609,224 34.08
Lane7 12,848,201 4,346,034 33.83
Lane8 13,096,715 4,433,962 33.86
run2 read unique_mapped percent
Lane1 9,096,595 4,034,396 44.35
Lane2 13,687,929 5,003,548 36.55
Lane3 14,761,931 5,111,827 34.63
Lane4 8,843,158 4,252,449 48.09
Lane6 13,449,864 5,175,652 38.48
Lane7 9,341,101 4,379,373 46.88
Lane8 8,449,276 4,137,438 48.97

The mean of all the mapping percentages is 38.42, which is roughly 40%. Next I used a simple approach for annotating these reads to the Ensembl database. I downloaded the Ensembl transcripts as a bed file from the UCSC Table Browser and used intersectBed. However, the number of uniquely mapped reads that overlap exons is around 40%:

run1 unique_mapped ensembl_exon percent
Lane1 4,555,926 1,932,692 42.42
Lane2 4,760,137 1,864,202 39.16
Lane3 4,747,605 2,004,449 42.22
Lane4 4,808,619 1,881,362 39.12
Lane6 4,609,224 1,793,735 38.92
Lane7 4,346,034 1,829,295 42.09
Lane8 4,433,962 1,728,134 38.97
run2 unique_mapped ensembl_exon percent
Lane1 4,034,396 1,636,857 40.57
Lane2 5,003,548 2,145,708 42.88
Lane3 5,111,827 2,016,176 39.44
Lane4 4,252,449 1,856,823 43.66
Lane6 5,175,652 2,211,203 42.72
Lane7 4,379,373 1,770,117 40.42
Lane8 4,137,438 1,806,110 43.65

More information

TopHat manual
RNA-Seq protocol used to for these libraries