Silly mnemonics

Back in first year genetics (i.e. genetics 101), our tutor was telling us of a way to remember pyrimidines and purines. She said, pyrimidines reminded her of the pyramids and therefore Cleopatra and Tutankhamun and therefore Cytosines and Thymines. We laughed, but to this day that's how I remember pyrimidines.

Another thing I keep forgetting are histone modifications. So in the spirit of my tutor, I came up with ACL (since I play a lot of basketball and a torn ACL is a very common basketball injury) which stands for ACetylation of Lysine residues of histone proteins. If you have a torn ACL it severely weakens your ability to play basketball, so I associated it with the weakened association of DNA with histones.

Quite silly but at least now I will remember it.

Related:

http://davetang.org/muse/2012/02/06/h3k27ac/

How to deal with multi mapping reads

Eukaryotic genomes are repetitive in nature i.e. the sequence content is not unique. When mapping high throughput sequencing reads back to the genome, whether for de novo assembly or for RNA sequencing, a subset of reads will map to more than 1 location. Some people refer to these reads as multi-reads for multi mapping reads. One way of dealing with multi mapping reads is to discard them, however the results may be biased and even worst, there may be something interesting that is being discarded.

One strategy for mapping multi mapping reads from an RNA-seq experiment, was to examine the mapping locations of all reads (single and multi mapping reads) and using the information from single mapping reads to probabilistically assign where a multi mapping read maps. To illustrate using a simple example, imagine that 100 reads were uniquely mapped to chr1:100-126 and another 100 reads multi mapped to chr1:102-128 AND chrX:102-128. The idea is that the 100 multi mapping reads "probably" arose from the genomic entity at chr1:100.

Below is a demonstration of the software, which is written in Python.

#Usage
python MuMRescueLite.py <input file> <output file> <window>

The format of the input file is ID, locations, chromosome, strand, start, end, count. Below is the format of the input file of the example I discussed above:

#ID	locations	chromosome	strand	start	end	count
read_x	1	chr1	+	100	126	100
read_y	2	chr1	+	102	128	100
read_y	2	chrX	+	102	128	100

Now running the script:

python MuMRescueLite.py test.tsv out.tsv 4
cat out.tsv
tagID	mapPositions	chromosome	strand	start	stop	count	weight
read_x	1	chr1	+	100	126	100	1.0
read_y	2	chr1	+	102	128	100	1.0
read_y	2	chrX	+	102	128	100	0.0

For the window size, I used 4; here's a detailed description of the window parameter from the documentation:

window: the number of bases around each MuM location to seek single mapped tags at each multi mapping location; MuMRescueLite will search a length of window/2 upstream and downstream of a given MuM location.

Notice in our results, the weight column, which from the documentation:

weight as probability for this sequence of this mapped position; 1.0 for the single mapped sequences, from 0.0 to 1.0 for the multi mapped tags

So read_y (the multi mapping read) has been given a weight of 1 to the location chr1:102-128.

Here's one more example:

#ID	locations	chromosome	strand	start	end	count
read_x	1	chr1	+	100	126	100
read_y	2	chr1	+	102	128	100
read_y	2	chrX	+	102	128	100
read_z	1	chrX	+	100	126	50
python MuMRescueLite.py test.tsv out.tsv 4
tagID	mapPositions	chromosome	strand	start	stop	count	weight
read_x	1	chr1	+	100	126	100	1.0
read_y	2	chr1	+	102	128	100	0.666666666667
read_y	2	chrX	+	102	128	100	0.333333333333
read_z	1	chrX	+	100	126	50	1.0

read_y is split based on the number of single mapping tags from read_x and read_z. There are more reads at chr1:100-126, hence more weight is given to this loci; the weight is in the proportion of 100:50 or 2:1.

Conclusion

The fundamental goal of MuMRescueLite is to calculate the probability that from a set of possible loci, a given locus is the true source of a MuM. This is achieved by counting the SiMs that occur in a nominal window around each locus occupied by a MuM and dividing this count by the total number of SiMs proximal to all loci associated with that MuM. In this way, MuMs are assigned proportionately to each of their loci and are therefore ‘rescued’. MuMs that do not coincide with at least one SiM are not rescued.

Using blat

My multipurpose sequence aligner tool of choice for many years has been blat. This is a short post on the basics of blat. To use blat, download the 64bit Linux version of blat (or a version that matches your operating system) here. When aligning sequences to the genome, make sure you use the 64 bit version or else you will receive an out of memory error (needHugeMen: Out of huge memory - request size).

To get started below is a slide I made couple of years ago:

First blat splits the reference sequence up into "tiles". The manner in which it is split depends on two parameters, -tileSize and -stepSize, where -tileSize is the size of the tile and -stepSize specifies when to start the next tile. The default setting of both for DNA sequences is 11, which also means the tiles do not overlap.

For blat to report an alignment, your query sequence must match at least two tiles (set via -minMatch) with no mismatches (you can allow up to one mismatch in the tile by using -oneOff). So if you're trying to align a 21 bp sequence to a reference using the default setting, blat will never report an alignment.

To illustrate, imagine this reference sequence (44bp):

>database
AAAAAAAAAAACCCCCCCCCCCGGGGGGGGGGGTTTTTTTTTTT

and this query sequence (12bp)

>test
GGGGGGGGGGGT:

The only way an alignment will be reported is if the tileSize is set to 1 and the minScore set to less than 12.

#returns no hit
blat -minScore=0 -stepSize=2 database.fa test.fa output.psl
#returns 2 hits
blat -minScore=0 -stepSize=1 database.fa test.fa output.psl

Here's an old post showing how the blat parameters affect the output.

Running blat in parallel

Before running blat in parallel, check how many processors are on your machine (32 are available in this example):

cat /proc/cpuinfo | grep "^processor" | wc
     32      96     470

To use blat in parallel, download GNU parallel and compile it. Download the hg19 chromosomes. Let's align a bunch of long non coding RNAs (lncRNA) to hg19. Download the human lncRNA dataset from NONCODE and unzip it. There should be 33,829 sequences in the human.fa file.

unzip lncRNA_human.zip
cd human
cat human.fa | grep "^>" | wc
  33829  692926 4407633

Now untar and gunzip the hg19 file. I'm only interested in assembled chromosomes, so I will delete the rest.

tar -xzf chromFa.tar.gz
#ls to check whether I'm deleting the right files
ls *random.fa
chr11_gl000202_random.fa  chr17_gl000206_random.fa  chr1_gl000191_random.fa   chr4_gl000194_random.fa  chr9_gl000198_random.fa
chr17_gl000203_random.fa  chr18_gl000207_random.fa  chr1_gl000192_random.fa   chr7_gl000195_random.fa  chr9_gl000199_random.fa
chr17_gl000204_random.fa  chr19_gl000208_random.fa  chr21_gl000210_random.fa  chr8_gl000196_random.fa  chr9_gl000200_random.fa
chr17_gl000205_random.fa  chr19_gl000209_random.fa  chr4_gl000193_random.fa   chr8_gl000197_random.fa  chr9_gl000201_random.fa
rm *random.fa
ls chrUn*.fa
#ls to check again
chrUn_gl000211.fa  chrUn_gl000216.fa  chrUn_gl000221.fa  chrUn_gl000226.fa  chrUn_gl000231.fa  chrUn_gl000236.fa  chrUn_gl000241.fa  chrUn_gl000246.fa
chrUn_gl000212.fa  chrUn_gl000217.fa  chrUn_gl000222.fa  chrUn_gl000227.fa  chrUn_gl000232.fa  chrUn_gl000237.fa  chrUn_gl000242.fa  chrUn_gl000247.fa
chrUn_gl000213.fa  chrUn_gl000218.fa  chrUn_gl000223.fa  chrUn_gl000228.fa  chrUn_gl000233.fa  chrUn_gl000238.fa  chrUn_gl000243.fa  chrUn_gl000248.fa
chrUn_gl000214.fa  chrUn_gl000219.fa  chrUn_gl000224.fa  chrUn_gl000229.fa  chrUn_gl000234.fa  chrUn_gl000239.fa  chrUn_gl000244.fa  chrUn_gl000249.fa
chrUn_gl000215.fa  chrUn_gl000220.fa  chrUn_gl000225.fa  chrUn_gl000230.fa  chrUn_gl000235.fa  chrUn_gl000240.fa  chrUn_gl000245.fa
rm chrUn*.fa
ls *hap*.fa
chr17_ctg5_hap1.fa  chr6_apd_hap1.fa  chr6_dbb_hap3.fa   chr6_mcf_hap5.fa  chr6_ssto_hap7.fa
chr4_ctg9_hap1.fa   chr6_cox_hap2.fa  chr6_mann_hap4.fa  chr6_qbl_hap6.fa
rm *hap*.fa
#22 somatic chromosomes + chrY, chrM and chrX
ls *.fa | wc
     25      25     213

Create a test fasta file containing the sequence for n37. I checked the human.bed file for n37 (since NONCODE also provides you with the mapping location as a bed file) to see that it maps to chr11.

cat human/human.fa | grep -v "^#" | head -2 > human/test.fa
cat human/human.bed | awk '$4 == "n37" {print}'
chr11   2161746 2169894 n37     992     +       2161746 2169894 0       7       431,181,100,293,38,682,331,     0,5728,5911,6801,7095,7134,7817,
#run a test blat job using default parameters
blat chr11.fa human/test.fa test.psl
cat test.psl
#results are the same
psLayout version 3

match   mis-    rep.    N's     Q gap   Q gap   T gap   T gap   strand  Q               Q       Q       Q       T               T       T       T       block   blockSizes   qStarts  tStarts
        match   match           count   bases   count   bases           name            size    start   end     name            size    start   end     count
---------------------------------------------------------------------------------------------------------------------------------------------------------------
2046    10      0       0       0       0       6       6092    +       n37     2056    0       2056    chr11   135006516       2161746 2169894 7       431,181,100,293,38,682,331,  0,431,612,712,1005,1043,1725,   2161746,2167474,2167657,2168547,2168841,2168880,2169563,

Now to do this in parallel and convert results from psl to bed. I wrote a script called psl_to_bed_best_score.pl, a while ago and we can use it to parse the psl files.

parallel blat chr{}.fa human/test.fa test_{}.psl ::: {1..22} X Y M
ls *.psl
test_1.psl   test_11.psl  test_13.psl  test_15.psl  test_17.psl  test_19.psl  test_20.psl  test_22.psl  test_4.psl  test_6.psl  test_8.psl  test_M.psl  test_Y.psl
test_10.psl  test_12.psl  test_14.psl  test_16.psl  test_18.psl  test_2.psl   test_21.psl  test_3.psl   test_5.psl  test_7.psl  test_9.psl  test_X.psl
cat *.psl > all.psl
psl_to_bed_best_score.pl all.psl all.bed
cat all.bed
chr11   2161746 2169894 n37     2030    +       2161746 2169894 255,0,0 7       431,181,100,293,38,682,331,     0,5728,5911,6801,7095,7134,7817
#almost identical to the bed file provided from NONCODE apart from the score column
cat human/human.bed | awk '$4 == "n37" {print}'
chr11   2161746 2169894 n37     992     +       2161746 2169894 0       7       431,181,100,293,38,682,331,     0,5728,5911,6801,7095,7134,7817,
#try it on all lncRNA but first remove the comments from the human.fa file
cat human/human.fa | grep -v "^#" > tmp
mv tmp human/human.fa
time parallel blat chr{}.fa human/human.fa test_{}.psl ::: {1..22} X Y M

#well that took too long!
real    16996m51.851s
user    301319m54.423s
sys     50m14.070s

See also this useful post for more examples of how to use GNU parallel.

blat parameters

For convenience.

blat - Standalone BLAT v. 34 fast sequence search command line tool
usage:
   blat database query [-ooc=11.ooc] output.psl
where:
   database and query are each either a .fa , .nib or .2bit file,
   or a list these files one file name per line.
   -ooc=11.ooc tells the program to load over-occurring 11-mers from
               and external file.  This will increase the speed
               by a factor of 40 in many cases, but is not required
   output.psl is where to put the output.
   Subranges of nib and .2bit files may specified using the syntax:
      /path/file.nib:seqid:start-end
   or
      /path/file.2bit:seqid:start-end
   or
      /path/file.nib:start-end
   With the second form, a sequence id of file:start-end will be used.
options:
   -t=type     Database type.  Type is one of:
                 dna - DNA sequence
                 prot - protein sequence
                 dnax - DNA sequence translated in six frames to protein
               The default is dna
   -q=type     Query type.  Type is one of:
                 dna - DNA sequence
                 rna - RNA sequence
                 prot - protein sequence
                 dnax - DNA sequence translated in six frames to protein
                 rnax - DNA sequence translated in three frames to protein
               The default is dna
   -prot       Synonymous with -t=prot -q=prot
   -ooc=N.ooc  Use overused tile file N.ooc.  N should correspond to
               the tileSize
   -tileSize=N sets the size of match that triggers an alignment.
               Usually between 8 and 12
               Default is 11 for DNA and 5 for protein.
   -stepSize=N spacing between tiles. Default is tileSize.
   -oneOff=N   If set to 1 this allows one mismatch in tile and still
               triggers an alignments.  Default is 0.
   -minMatch=N sets the number of tile matches.  Usually set from 2 to 4
               Default is 2 for nucleotide, 1 for protein.
   -minScore=N sets minimum score.  This is the matches minus the
               mismatches minus some sort of gap penalty.  Default is 30
   -minIdentity=N Sets minimum sequence identity (in percent).  Default is
               90 for nucleotide searches, 25 for protein or translated
               protein searches.
   -maxGap=N   sets the size of maximum gap between tiles in a clump.  Usually
               set from 0 to 3.  Default is 2. Only relevent for minMatch > 1.
   -noHead     suppress .psl header (so it's just a tab-separated file)
   -makeOoc=N.ooc Make overused tile file. Target needs to be complete genome.
   -repMatch=N sets the number of repetitions of a tile allowed before
               it is marked as overused.  Typically this is 256 for tileSize
               12, 1024 for tile size 11, 4096 for tile size 10.
               Default is 1024.  Typically only comes into play with makeOoc.
               Also affected by stepSize. When stepSize is halved repMatch is
               doubled to compensate.
   -mask=type  Mask out repeats.  Alignments won't be started in masked region
               but may extend through it in nucleotide searches.  Masked areas
               are ignored entirely in protein or translated searches. Types are
                 lower - mask out lower cased sequence
                 upper - mask out upper cased sequence
                 out   - mask according to database.out RepeatMasker .out file
                 file.out - mask database according to RepeatMasker file.out
   -qMask=type Mask out repeats in query sequence.  Similar to -mask above but
               for query rather than target sequence.
   -repeats=type Type is same as mask types above.  Repeat bases will not be
               masked in any way, but matches in repeat areas will be reported
               separately from matches in other areas in the psl output.
   -minRepDivergence=NN - minimum percent divergence of repeats to allow
               them to be unmasked.  Default is 15.  Only relevant for
               masking using RepeatMasker .out files.
   -dots=N     Output dot every N sequences to show program's progress
   -trimT      Trim leading poly-T
   -noTrimA    Don't trim trailing poly-A
   -trimHardA  Remove poly-A tail from qSize as well as alignments in
               psl output
   -fastMap    Run for fast DNA/DNA remapping - not allowing introns,
               requiring high %ID
   -out=type   Controls output file format.  Type is one of:
                   psl - Default.  Tab separated format, no sequence
                   pslx - Tab separated format with sequence
                   axt - blastz-associated axt format
                   maf - multiz-associated maf format
                   sim4 - similar to sim4 format
                   wublast - similar to wublast format
                   blast - similar to NCBI blast format
                   blast8- NCBI blast tabular format
                   blast9 - NCBI blast tabular format with comments
   -fine       For high quality mRNAs look harder for small initial and
               terminal exons.  Not recommended for ESTs
   -maxIntron=N  Sets maximum intron size. Default is 750000
   -extendThroughN - Allows extension of alignment through large blocks of N's

Getting started with TopHat

Updated links for the binaries on 2015 March 2nd

TopHat is a tool that can find splice junctions without a reference annotation. By first mapping RNA-Seq reads to the genome (using Bowtie/2), TopHat identifies potential exons, since many RNA-Seq reads will contiguously align to the genome. Using this initial mapping information, TopHat builds a database of possible splice junctions, and then maps the reads against these junctions to confirm them. In the past people have built splice junctions based on known references, such as RefSeq. TopHat allows a user to find potentially new splice variants.

I will use RNA-Seq data from the Genome Research paper by Marioni et al. published in 2008 to test TopHat. I found it funny that the submission title for their dataset was "RNASeq: the death Knell of expression arrays?"; I guess they decided to go with something much less morbid when they finally published their paper. Their sequence data was downloaded from DDBJ under the accession number SRA000299.

There are two sample descriptions for this submission, which are identical:

SRS000561: We extracted total RNA from liver and kidney samples of a single human male, puri_ed [sic] the poly-A mRNA and sheared it prior to cDNA synthesis
SRS000562: We extracted total RNA from liver and kidney samples of a single human male, puri_ed [sic] the poly-A mRNA and sheared it prior to cDNA synthesis

There are four experiments for this submission, totalling 6 runs:

SRX000571 -> SRR002321 and SRR002323 -> 080226_CMLIVER_0007_3pM
SRX000604 -> SRR002322 -> 080226_CMLIVER_0007_1.5pM
SRX000605 -> SRR002320 and SRR002325 -> 080226_CMKIDNEY_0007_3pM
SRX000606 -> SRR002324 -> 080226_CMKIDNEY_0007_1.5pM

Now to download the 6 sequence runs:

#!/bin/bash
wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA000/SRA000299/SRX000571/SRR002321.fastq.bz2
wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA000/SRA000299/SRX000571/SRR002323.fastq.bz2
wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA000/SRA000299/SRX000604/SRR002322.fastq.bz2
wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA000/SRA000299/SRX000605/SRR002320.fastq.bz2
wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA000/SRA000299/SRX000605/SRR002325.fastq.bz2
wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA000/SRA000299/SRX000606/SRR002324.fastq.bz2

Number of reads in each file

SRR002320.fastq 39,266,713
SRR002321.fastq 54,856,271
SRR002322.fastq 18,437,696
SRR002323.fastq 14,761,931
SRR002324.fastq 17,292,434
SRR002325.fastq 27,137,793

Total 171,752,838

From the paper there were two full sequencing runs (i.e. 7 lanes).

lane run1 conc lane run2 conc
Lane1 kidney 3pM Lane1 liver 1.5pM
Lane2 liver 3pM Lane2 kidney 3pM
Lane3 kidney 3pM Lane3 liver 3pM
Lane4 liver 3pM Lane4 kidney 1.5pM
Lane5 CNTL-PhiX 3pM Lane5 CNTL-PhiX 1.5pM
Lane6 liver 3pM Lane6 kidney 3pM
Lane7 kidney 3pM Lane7 liver 1.5pM
Lane8 liver 3pM Lane8 kidney 1.5pM

Read number statistics as described from the paper (note the discrepancy with the above table for the second run! All of these information are according to the supplementary material downloaded from their website. The table above has the same information presented in their paper):

Run1
L1 - kidney 13,017,169
L2 - liver 14,003,322
L3 - kidney 13,401,343
L4 - liver 14,230,879
L6 - liver 13,525,355
L7 - kidney 12,848,201
L8 - liver 13,096,715
Total 94,122,984

Run2
L1 - kidney 9,096,595
L2 - liver 13,687,929
L3 - kidney 14,761,931
L4 - liver 8,843,158
L6 - liver 13,449,864
L7 - kidney 9,341,101
L8 - liver 8,449,276
Total 77,629,854

The total of the two runs is 171,752,838, which equals the total number of reads from the fastq files. The sequence data from the 14 lanes have been condensed into the 6 fastq files. But we can separate them out by using the information from the fastq definition line. Here's what each part of the definition line represents:

@SRR002321.1 080226_CMLIVERKIDNEY_0007:2:1:115:885

"SRR002321.1" is the short read archive name, where the .1 is the read number
"080226_CMLIVERKIDNEY_0007" should be the Machine name, which has been arbitrarily changed
"2" is the Channel/lane number
"1" is the Tile number
"115" is the X position
"885" is the Y position

So we can separate out the reads based on the Channel/lane number. However there may be a complication in that we can't separate the two different runs. Given that the two runs were done on different dates (according to the supplementary information) namely 3rd March 2008 and 10th March 2008, these may correspond to 080226_CMLIVERKIDNEY_0007 and 080317_CM-KID-LIV-2-REPEAT_0003 respectively. I scanned all 171,752,838 reads and found that reads were either assigned to 080226_CMLIVERKIDNEY_0007 or 080317_CM-KID-LIV-2-REPEAT_0003 and by separating out the reads by these identifiers and by the lanes I get the same read counts as I showed above:

080226_CMLIVERKIDNEY_0007_1 13,017,169
080226_CMLIVERKIDNEY_0007_2 14,003,322
080226_CMLIVERKIDNEY_0007_3 13,401,343
080226_CMLIVERKIDNEY_0007_4 14,230,879
080226_CMLIVERKIDNEY_0007_6 13,525,355
080226_CMLIVERKIDNEY_0007_7 12,848,201
080226_CMLIVERKIDNEY_0007_8 13,096,715
080317_CM-KID-LIV-2-REPEAT_0003_1 9,096,595
080317_CM-KID-LIV-2-REPEAT_0003_2 13,687,929
080317_CM-KID-LIV-2-REPEAT_0003_3 14,761,931
080317_CM-KID-LIV-2-REPEAT_0003_4 8,843,158
080317_CM-KID-LIV-2-REPEAT_0003_6 13,449,864
080317_CM-KID-LIV-2-REPEAT_0003_7 9,341,101
080317_CM-KID-LIV-2-REPEAT_0003_8 8,449,276

Script used to separate out the fastq files:


#!/usr/bin/perl

#use strict;
use warnings;

my @infile = qw/SRR002320.fastq SRR002321.fastq SRR002322.fastq SRR002323.fastq SRR002324.fastq SRR002325.fastq/;

my %marker = ();

foreach my $infile (@infile){
   open(IN,'<',$infile) || die "Could not open $infile: $!\n";
   while(<IN>){
      chomp;
      #@SRR002320.1 080226_CMLIVERKIDNEY_0007:1:1:112:735
      #GTGGTGGGGTTGGTATTTGGTTTCTCGTTTTAATTA
      #+
      #IIIIIIII"IIIII)I$I1%HII"I#./(#/'$#*#
      my $header = $_;
      my $read = <IN>;
      chomp($read);
      my $plus_line = <IN>;
      chomp($plus_line);
      my $quality = <IN>;
      chomp($quality);

      if ($header =~ /^@(SRR\d+\.\d+)\s([a-zA-Z0-9-_]+):(\d+):(\d+):(\d+):(\d+)$/){
         my $srr = $1;
         my $name = $2;
         my $lane = $3;
         my $tile = $4;
         my $x = $5;
         my $y = $6;
         my $id = $name . '_' . $lane;
         if (exists $marker{$id}){
            print $id "$header\n$read\n$plus_line\n$quality\n";
         } else {
            $marker{$id} = '1';
            open($id,'>',"$id.fastq") || die "Could not write to $id: $!\n";
            print $id "$header\n$read\n$plus_line\n$quality\n";
         }

      } else {
         die "Error parsing line $.: $header\n";
      }

   }
   close(IN);
}

foreach my $id (keys %marker){
   close($id);
}

exit(0);

Setting up TopHat

Download the binaries for bowtie2:

wget http://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.4/bowtie2-2.2.4-linux-x86_64.zip
unzip bowtie2-2.2.4-linux-x86_64.zip

Download the binaries for tophat2:

wget http://ccb.jhu.edu/software/tophat/downloads/tophat-2.0.13.Linux_x86_64.tar.gz
tar -xzf tophat-2.0.13.Linux_x86_64.tar.gz

Download the test data:

wget http://ccb.jhu.edu/software/tophat/downloads/test_data.tar.gz
tar -xzf test_data.tar.gz

Run a test job with the test_data

tophat -r 20 test_ref reads_1.fq reads_2.fq

The -r parameter

-r/--mate-inner-dist

This is the expected (mean) inner distance between mate pairs. For, example, for paired end runs with fragments selected at 300bp, where each end is 50bp, you should set -r to be 200. There is no default, and this parameter is required for paired end runs.

Interpreting the test results

The reference sequence where the string of A's serve as introns.

cat test_ref.fa
>test_chromosome
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ACTACTATCTGACTAGACTGGAGGCGCTTGCGACTGAGCTAGGACGTGCC
ACTACGGGGATGACGACTAGGACTACGGACGGACTTAGAGCGTCAGATGC
AGCGACTGGACTATTTAGGACGATCGGACTGAGGAGGGCAGTAGGACGCT
ACGTATTTGGCGCGCGGCGCTACGGCTGAGCGTCGAGCTTGCGATACGCC
GTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
ACTATTACTTTATTATCTTACTCGGACGTAGACGGATCGGCAACGGGACT
GTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
TTTTCTACTTGAGACTGGGATCGAGGCGGACTTTTTAGGACGGGACTTGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

There are 100 paired end reads in reads_1.fq and reads_2.fq.

head -4 reads_1.fq
@test_mRNA_150_290_0/1
TCCTAAAAAGTCCGCCTCGGTCTCAGTCTCAAGTAGAAAAAGTCCCGTTGGCGATCCGTCTACGTCCGAGTAAGA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

head -4 reads_2.fq
@test_mRNA_150_290_0/2
TACGTATTTGTCGCGCGGCCCTACGGCTGAGCGTCGAGCTTGCGATCCGCCACTATTACTTTATTATCTTACTCG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

If TopHat ran properly, your output would look something like this using IGV and viewing the accepted_hits.bam file:

Analysing the Marioni et al. data

First we need to index the reference genome or you can download it from ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/hg19.ebwt.zip.

bowtie2-build /path/to/hg19 hg19

On one core (Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz) of my Ubuntu box, it took roughly 100 minutes to index the hg19 genome. If your download speed is slow, perhaps you can index the reference yourself.

Start TopHat on one of the runs (the coverage-search algorithm was turned off because it took too long and almost used up my 8 gigs of memory):

tophat --num-threads 8 --no-coverage-search ~/genome/hg19 SRR002321.fastq

Log

[2012-08-19 16:32:31] Beginning TopHat run (v2.0.4)
-----------------------------------------------
[2012-08-19 16:32:31] Checking for Bowtie
Bowtie version: 2.0.0.7
[2012-08-19 16:32:31] Checking for Samtools
Samtools version: 0.1.18.0
[2012-08-19 16:32:31] Checking for Bowtie index files
[2012-08-19 16:32:31] Checking for reference FASTA file
[2012-08-19 16:32:31] Generating SAM header for /home/davetang/genome/hg19
format: fastq
quality scale: phred33 (default)
[2012-08-19 16:33:32] Preparing reads
left reads: min. length=36, max. length=36, 53482003 kept reads (1374268 discarded)
Warning: you have only one segment per read.
If the read length is greater than or equal to 45bp,
we strongly recommend that you decrease --segment-length to about half the read length because TopHat will work better with multiple segments
[2012-08-19 16:39:02] Mapping left_kept_reads to genome hg19 with Bowtie2
[2012-08-19 17:09:59] Searching for junctions via segment mapping
Warning: junction database is empty!
[2012-08-19 17:11:23] Reporting output tracks
-----------------------------------------------
[2012-08-19 17:46:45] Run complete: 01:14:14 elapsed

Alcohol dehydrogenase 1b on IGV where in red are reads mapped on the positive strand and blue on the negative:

Running TopHat with coverage search on each individual lane (reads for each lane were separated from the fastq files using the script above):

for file in `ls *.fastq`; do base=`basename $file .fastq`; nice tophat --num-threads 30 --output-dir $base hg19_male $file; done

Using BWA

As a comparison I also aligned the data using BWA following the same strategy used in their paper, two mismatches and discarding multimappers (although they used the Illumina-supplied mapping program ELAND). They could align ~40% of their reads uniquely to a genomic location. Here are my mapping statistics using the same approach but with BWA:

run1 read unique_mapped percent
Lane1 13,017,169 4,555,926 35.00
Lane2 14,003,322 4,760,137 33.99
Lane3 13,401,343 4,747,605 35.43
Lane4 14,230,879 4,808,619 33.79
Lane6 13,525,355 4,609,224 34.08
Lane7 12,848,201 4,346,034 33.83
Lane8 13,096,715 4,433,962 33.86
run2 read unique_mapped percent
Lane1 9,096,595 4,034,396 44.35
Lane2 13,687,929 5,003,548 36.55
Lane3 14,761,931 5,111,827 34.63
Lane4 8,843,158 4,252,449 48.09
Lane6 13,449,864 5,175,652 38.48
Lane7 9,341,101 4,379,373 46.88
Lane8 8,449,276 4,137,438 48.97

The mean of all the mapping percentages is 38.42, which is roughly 40%. Next I used a simple approach for annotating these reads to the Ensembl database. I downloaded the Ensembl transcripts as a bed file from the UCSC Table Browser and used intersectBed. However, the number of uniquely mapped reads that overlap exons is around 40%:

run1 unique_mapped ensembl_exon percent
Lane1 4,555,926 1,932,692 42.42
Lane2 4,760,137 1,864,202 39.16
Lane3 4,747,605 2,004,449 42.22
Lane4 4,808,619 1,881,362 39.12
Lane6 4,609,224 1,793,735 38.92
Lane7 4,346,034 1,829,295 42.09
Lane8 4,433,962 1,728,134 38.97
run2 unique_mapped ensembl_exon percent
Lane1 4,034,396 1,636,857 40.57
Lane2 5,003,548 2,145,708 42.88
Lane3 5,111,827 2,016,176 39.44
Lane4 4,252,449 1,856,823 43.66
Lane6 5,175,652 2,211,203 42.72
Lane7 4,379,373 1,770,117 40.42
Lane8 4,137,438 1,806,110 43.65

More information

TopHat manual
RNA-Seq protocol used to for these libraries

Installing Circos

A short post about installing Circos on Ubuntu, other Linux distributions and on Windows.

Note: if you are using Ubuntu, the location of the env program is in /usr/bin/env. The gddiag and circos programs, use /bin/env, so when you run gddiag it gives a bad interpreter error. Change the first line to #!/usr/bin/perl for both the gddiag and circos programs.

Installing Circos on Ubuntu

cat /etc/lsb-release 
#DISTRIB_ID=Ubuntu
#DISTRIB_RELEASE=12.04
#DISTRIB_CODENAME=precise
#DISTRIB_DESCRIPTION="Ubuntu 12.04 LTS"
wget http://circos.ca/distribution/circos-0.64.tgz
sudo cpan App::cpanminus
#change directory into where you unzipped circos and then the bin directory
cd circos-0.60/bin
test.modules > toget
cat toget | grep -v "^ok"
#fail Config::General is not usable (it or a sub-module is missing)
#fail Font::TTF::Font is not usable (it or a sub-module is missing)
#fail GD is not usable (it or a sub-module is missing)
#fail GD::Image is not usable (it or a sub-module is missing)
#fail GD::Polyline is not usable (it or a sub-module is missing)
#fail Math::Bezier is not usable (it or a sub-module is missing)
#fail Math::VecStat is not usable (it or a sub-module is missing)
#fail Readonly is not usable (it or a sub-module is missing)
#fail Regexp::Common is not usable (it or a sub-module is missing)
#fail Set::IntSpan is not usable (it or a sub-module is missing)
#fail Text::Format is not usable (it or a sub-module is missing)

sudo cpanm Config::General
sudo cpanm Font::TTF::Font
sudo cpanm GD
sudo cpanm GD::Image
sudo cpanm GD::Polyline
sudo cpanm Math::VecStat
sudo cpanm Readonly
sudo cpanm Regexp::Common
sudo cpanm Text::Format
sudo cpanm Math::Bezier
sudo cpanm Set::IntSpan

test.modules  | grep ^fail
fail GD is not usable (it or a sub-module is missing)
fail GD::Image is not usable (it or a sub-module is missing)
fail GD::Polyline is not usable (it or a sub-module is missing)

sudo apt-get -y install libgd2-xpm-dev build-essential

sudo cpanm GD
sudo cpanm GD::Image
sudo cpanm GD::Polyline

#all modules should be installed now
test.modules | grep "^fail"

#test GD
gddiag
#see if gddiag.png looks the same as the image at http://www.circos.ca/tutorials/lessons/configuration/png_output/images

#But I was still getting an error about error/configuration.missing.txt
circos 

#debuggroup conf 0.09s welcome to circos v0.60 4 May 2012
#debuggroup conf 0.09s guessing configuration file
#
#  *** CIRCOS ERROR ***
#
#  CONFIGURATION FILE ERROR

#  ...error text from [error/configuration.missing.txt] could not be read...

#  If you are having trouble debugging this error, use this tutorial to learn how
#  to use the debugging facility

#      http://www.circos.ca/tutorials/lessons/configuration/debugging

#  If you're still stumped, get support in the Circos Google Group

#      http://groups.google.com/group/circos-data-visualization

#  Stack trace:
# at /home/tan118/src/circos-0.60/bin/../lib/Circos/Error.pm line 325
#	Circos::Error::fatal_error('configuration', 'missing') called at /home/tan118/src/circos-0.60/bin/../lib/#Circos.pm line 152
#	Circos::run('Circos') called at bin/circos line 232

#I wrote to the author of Circos and turns out that
#the configuration.missing.txt file is missing in the tarball
#download the file and put it inside the error directory
wget http://davetang.org/file/configuration.missing.txt
mv configuration.missing.txt error

#In retrospect, the missing file above wouldn't have been a problem
#i.e. you could still run Circos if you gave it the right parameters
#However I wasn't sure what the error message:
#...error text from [error/configuration.missing.txt] could not be read...
#meant

Installing Circos on other Linux distributions

The easiest way to install Perl modules without administrator rights is to use ActivePerl; download it at http://www.activestate.com/activeperl/downloads. Once downloaded just run the shell script, install.sh, and follow the instructions.

To install Perl modules using ActiveState, use ppm. Here were the missing modules that I needed to install:

ppm install Config::General
ppm install Font::TTF::Font
ppm install Math::Bezier
ppm install Math::Round
ppm install Math::VecStat
ppm install Params::Validate
ppm install Regexp::Common
ppm install Set::IntSpan
ppm install Text::Format

Now if I run test.modules, all the modules are available.

To test if everything is working correctly, use the "run" script in the example folder, which should output a png and svg file that looks like this:

And on Windows

  1. Install ActivePerl
  2. Install cygwin

Now open up a cygwin terminal and extract the tarball:

tar -xzf circos-0.64.tgz
cd bin
test.modules
ok   Carp
ok   Clone
fail Config::General is not usable (it or a sub-module is missing)
ok   Cwd
ok   Data::Dumper
ok   Digest::MD5
ok   File::Basename
ok   File::Spec::Functions
ok   File::Temp
ok   FindBin
fail Font::TTF::Font is not usable (it or a sub-module is missing)
ok   GD
ok   GD::Image
ok   Getopt::Long
ok   IO::File
ok   List::MoreUtils
ok   List::Util
fail Math::Round is not usable (it or a sub-module is missing)
fail Math::VecStat is not usable (it or a sub-module is missing)
ok   Memoize
fail Params::Validate is not usable (it or a sub-module is missing)
ok   Pod::Usage
ok   POSIX
ok   Readonly
fail Regexp::Common is not usable (it or a sub-module is missing)
ok   Storable
ok   Sys::Hostname
ok   Text::Balanced
fail Text::Format is not usable (it or a sub-module is missing)
ok   Time::HiRes

The easiest way to install those missing packages is to open up the Perl Package Manager; it's not the fastest way but it's the least complicated (because cygwin has its own ppm and it conflicts with the ActivePerl ppm). After searching and installing the missing packages:

test.modules
ok   Carp
ok   Clone
ok   Config::General
ok   Cwd
ok   Data::Dumper
ok   Digest::MD5
ok   File::Basename
ok   File::Spec::Functions
ok   File::Temp
ok   FindBin
ok   Font::TTF::Font
ok   GD
ok   GD::Image
ok   Getopt::Long
ok   IO::File
ok   List::MoreUtils
ok   List::Util
ok   Math::Round
ok   Math::VecStat
ok   Memoize
ok   Params::Validate
ok   Pod::Usage
ok   POSIX
ok   Readonly
ok   Regexp::Common
ok   Storable
ok   Sys::Hostname
ok   Text::Balanced
ok   Text::Format
ok   Time::HiRes
#now cd into the example folder and run the script called "run"
#the run.out file will tell you that you're missing several other packages
#install those as well as per above using the Package Manager
#for example this is the last missing package
cat run.out

*** REQUIRED MODULE IS MISSING ***

You are missing the Perl module Set::IntSpan. Use CPAN to install it as described in this tutorial

http://www.circos.ca/documentation/tutorials/configuration/perl_and_modules
#now if you run "run" you should be able to generate the png file as above
run

More information

See http://circos.ca/software/download/circos/.

Equivalents in R, Python and Perl

Last update 2015 September 9th

I've been using Perl heavily for several years until I started my PhD back in 2010 (I still use it for many tasks but much more sparingly). Perl was widely used back in the early days when the human genome was yet to be sequenced and this famous article explained some of the reasons. However, the general trend is that Perl is on the decline, while Python on the rise. There are many reasons for this trend and at least for the biologists/bioinformaticians who program in Python that I have spoken to, they all claim that syntactically, Perl is ugly when compared to Python. Perl was what I learned first when entering bioinformatics, so I got used to looking at Perl code. However, given this trend, I will need to learn some Python to be able to read the growing number of scripts/applications written in Python. Surely it's not necessary, but I have a penchant for looking into how things are done.

Continue reading