Finding junctions with TopHat

For setting up TopHat see my previous post.

Here, I wanted to test whether TopHat can find junctions with single end 27bp reads.

The reference sequence I used was the test_ref.fa provided by the TopHat authors (see my previous post for the link), where the A's mark the intron regions:


To index the reference run:

bowtie2-build test_ref.fa test_ref

I made up a test read (also based on the test fastq file provided by the TopHat authors) in fastq format:


If you run TopHat with the default settings, no alignment will be returned:

tophat test_ref blah.fq

We need to modify the segment length parameter:

--segment-length Each read is cut up into segments, each at least this long. These segments are mapped independently. The default is 25.

Interestingly I get results when I specify a segment-length of 12, but not at a higher value (I would have assumed that I needed a segment length of 11):

tophat --segment-length 12 test_ref blah.fq
samtools view tophat_out/accepted_hits.bam
#test_mRNA_151_297_1d    0       test_chromosome 240     50      11M100N16M      *       0       0       TGCGATACGCCACTATTACTTTATTAT     IIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-6  XM:i:0  XO:i:0  XG:i:0  MD:Z:27 NM:i:0  XS:A:+  NH:i:1

From the CIGAR line, you can see how I split up the read; 11 nucleotides on one exon and 16 nucleotides on the other exon.

And lastly if I modify the --max-segment-intron parameter to anything less than 101, no results are returned

--max-segment-intron The maximum intron length that may be found during split-segment search. The default is 500000.

tophat --segment-length 12 --max-segment-intron 101 test_ref blah.fq #returns result
tophat --segment-length 12 --max-segment-intron 100 test_ref blah.fq #no result

Now the only questions are how trustworthy these junctions, detected from 27bp reads, are if you only have a few reads supporting the junction and how much more computational time it adds when we shorten the segment length.

Lastly, I have tested different --segment-length parameters (data not shown), and depending on how the read is split up, it is not the length of the segment that governs whether a match can be found, but whether the segment length matches how the read is split e.g. in a test where I split a 32bp read into 19/22, mapping results could be found with a segment length of 18 but not 10.

Annotating RNA-Seq data

After mapping your reads from an RNA-Seq experiment, usually the next task is identify the transcripts that the reads came from (i.e. annotating RNA-Seq data) and there are many ways of doing so. Here I just describe a rather crude method whereby I download sequence coordinates of hg19 RefSeqs as a BED12 file from the UCSC Genome Browser Table Browser and extrapolate the 5' UTR, exons, introns, 3' UTR and intergenic regions according to this file. I then output this as a bed file for use with intersectBed. A better (and much less error prone) method would be just to download each individual region of a RefSeq transcript from the Table Browser.

The process would have been quite straight forward, however since transcripts are located on the positive and negative strands and are spliced (even 5' and 3' UTRs), the script I wrote to create the annotation bed file is quite long and may contain unwanted features i.e. bugs. To use the script first download the refGene BED12 file from the UCSC Genome Browser Table Browser. Then sort the file by chromosome and then by the starting coordinate i.e. cat refgene.bed12 | sort -k1,1 -k2,2n > refgene_sorted.bed12. NOTE: this script will only work for the RefSeqs mapped to hg19, since I have hardcoded the chromosome ends within the script.

Without further ado:


use strict;
use warnings;

my $usage = "Usage: $0 <infile.bed>\n";
my $infile = shift or die $usage;

my %chr_end = ();
#DATA used for calculating final intergenic region
   my ($chr,$end) = split(/\s+/);
   $chr_end{$chr} = $end;

#used for calculating intergenic regions
my $end_of_transcript = '-1';
my $previous_chr = 'chr1';

open(IN,'<',$infile) || die "Could not open $infile: $!\n";
=head1 sample bed file

chr1    11873   14409   NR_046018       0       +       14409   14409   0       3       354,109,1189,   0,739,1347,
chr1    14361   29370   NR_024540       0       -       29370   29370   0       11      468,69,152,159,198,136,137,147,99,154,50,       0,608,1434,2245,2496,2871,3244,3553,3906,10376,14959,
chr1    34610   36081   NR_026818       0       -       36081   36081   0       3       564,205,361,    0,666,1110,

   my($chr,$start,$end,$id,$score,$strand,$thick_start,$thick_end,$rgb,$block_count,$block_size,$block_start) = split(/\t/);
   #just in case
   $chr= lc $chr;

   #work only with the assembled chromosomes
   next unless exists $chr_end{$chr};

   my @block_size = split(/,/,$block_size);
   my @block_start = split(/,/,$block_start);

   #declare UTR variables
   my $five_prime_utr_start = '0';
   my $five_prime_utr_end = '0';
   my $three_prime_utr_start = '0';
   my $three_prime_utr_end = '0';
   #utr information is present when the thick_start is not equal to the start and end
   if ($end != $thick_start && $thick_start != $start ){
      #chr1    861120  879961  NM_152486       0       +       861321  879533
      my $utr_start = $start;
      my $utr_end = $thick_start + 1;
      if ($strand eq '+'){
         $five_prime_utr_start = $utr_start;
         $five_prime_utr_end = $utr_end;
      } else {
         $three_prime_utr_start =$utr_end;
         $three_prime_utr_end =$utr_start;
   if ($end != $thick_end){
      my $utr_start = $thick_end;
      my $utr_end = $end + 1;
      if ($strand eq '+'){
         $three_prime_utr_start = $utr_start;
         $three_prime_utr_end = $utr_end;
      } else {
         $five_prime_utr_start =$utr_end;
         $five_prime_utr_end =$utr_start;

   #declare arrays for introns and exons
   my @exon_start = ();
   my @exon_end = ();
   my @intron_start = ();
   my @intron_end = ();

   #go through each block from right to left
   for (my $block=$block_count-1; $block>=0; --$block){
      #0 based coordinates
      my $exon_start = $start + $block_start[$block] + 1;
      my $exon_end = $start + $block_start[$block] + $block_size[$block];
      #print "Exon: $block\t$exon_start\t$exon_end\n";

      if ($strand eq '+'){
         if ($three_prime_utr_start != 0){
            if ($exon_start > $three_prime_utr_start){
               #print "3' UTR is spliced:\t$exon_start\t$exon_end\n";
               print join("\t",$chr,$exon_start,$exon_end,"3_UTR_$id",'0',$strand),"\n";
            } else {
               #print "3' UTR:\t$three_prime_utr_start\t$exon_end\n";
               print join("\t",$chr,$three_prime_utr_start,$exon_end,"3_UTR_$id",'0',$strand),"\n";
               $three_prime_utr_start = '0';
      } else {
         if ($five_prime_utr_end != 0){
            if ($exon_start > $five_prime_utr_end){
               #print "5' UTR is spliced:\t$exon_start\t$exon_end\n";
               print join("\t",$chr,$exon_start,$exon_end,"5_UTR_$id",'0',$strand),"\n";
            } else {
               #print "5' UTR:\t$five_prime_utr_end\t$exon_end\n";
               print join("\t",$chr,$five_prime_utr_end,$exon_end,"5_UTR_$id",'0',$strand),"\n";
               $five_prime_utr_end = '0';

   my $previous_exon_end = '';
   #go through each block from left to right
   for (1 .. $block_count){
      #array indexing starts with 0
      my $block = $_ - 1;

      #0 based coordinates
      my $exon_start = $start + $block_start[$block] + 1;
      my $exon_end = $start + $block_start[$block] + $block_size[$block];
      #print "$id\t$exon_start\t$exon_end\n";
      print join("\t",$chr,$exon_start,$exon_end,"EXON_$id",'0',$strand),"\n";
      $exon_start[$block] = $exon_start;
      $exon_end[$block] = $exon_end;
      #print "Exon $block:\t$exon_start\t$exon_end\n";

      if ($strand eq '+'){
         if ($five_prime_utr_end != 0){
            if ($exon_end < $five_prime_utr_end){
               #print "5' UTR is spliced:\t$exon_start\t$exon_end\n";
               print join("\t",$chr,$exon_start,$exon_end,"5_UTR_$id",'0',$strand),"\n";
            } else {
               #print "5' UTR:\t$exon_start\t$five_prime_utr_end\n";
               print join("\t",$chr,$exon_start,$five_prime_utr_end,"5_UTR_$id",'0',$strand),"\n";
               $five_prime_utr_end = '0';
      } else {
         if ($three_prime_utr_start != 0){
            if ($exon_end < $three_prime_utr_start){
               #print "3' UTR is spliced:\t$exon_start\t$exon_end\n";
               print join("\t",$chr,$exon_start,$exon_end,"3_UTR_$id",'0',$strand),"\n";
            } else {
               #print "3' UTR:\t$exon_start\t$three_prime_utr_start\n";
               print join("\t",$chr,$exon_start,$three_prime_utr_start,"3_UTR_$id",'0',$strand),"\n";
               $three_prime_utr_start = '0';

      if ($block == 0){
         $previous_exon_end = $exon_end;
      } else {
         my $intron_start = $previous_exon_end + 1;
         my $intron_end = $exon_start - 1;
         #print "Intron $block:\t$intron_start\t$intron_end\n";
         print join("\t",$chr,$intron_start,$intron_end,"INTRON_$id",'0',$strand),"\n";
         $intron_start[$block] = $intron_start;
         $intron_end[$block] = $intron_end;
         $previous_exon_end = $exon_end;

   #first and last exons
   #uncomment out if you want to store first and last exons
   if ($block_count > 1){
      if ($strand eq '+'){
         my $first_exon_start = $exon_start[0];
         my $first_exon_end = $exon_end[0];
         my $last_exon_start = $exon_start[-1];
         my $last_exon_end = $exon_end[-1];
         #print "First exon:\t$first_exon_start\t$first_exon_end\n";
         #print "Last exon:\t$last_exon_start\t$last_exon_end\n";
      } elsif ($strand eq '-'){
         my $first_exon_start = $exon_start[-1];
         my $first_exon_end = $exon_end[-1];
         my $last_exon_start = $exon_start[0];
         my $last_exon_end = $exon_end[0];
         #print "First exon:\t$first_exon_start\t$first_exon_end\n";
         #print "Last exon:\t$last_exon_start\t$last_exon_end\n";
      } else {
         die "Incorrect strand information: $strand\n";

   #the file is sorted, so this is the last transcript on the chromosome
   if ($chr ne $previous_chr){
      my $intergenic_start = $end_of_transcript + 1;
      my $intergenic_end = $chr_end{$previous_chr} - 1;
      $end_of_transcript = '-1';
      #print "$intergenic_start\t$intergenic_end\n";
      print join("\t",$chr,$intergenic_start,$intergenic_end,'INTERGENIC','0','0'),"\n";
      $previous_chr = $chr;

   my $intergenic_start = $end_of_transcript + 1;
   my $intergenic_end = $start - 1;
   #print "$intergenic_start\t$intergenic_end\n";
   #this is to avoid overlapping transcript
   if ($intergenic_start >= $intergenic_end){
      #overlapping transcript
   } else {
      print join("\t",$chr,$intergenic_start,$intergenic_end,'INTERGENIC','0','0'),"\n";
   $end_of_transcript = $end;

   #print "---\n\n";



#hg19 chromosome ends used for calculating last intergenic region
chr1    249250621
chr2    243199373
chr3    198022430
chr4    191154276
chr5    180915260
chr6    171115067
chr7    159138663
chrX    155270560
chr8    146364022
chr9    141213431
chr10   135534747
chr11   135006516
chr12   133851895
chr13   115169878
chr14   107349540
chr15   102531392
chr16   90354753
chr17   81195210
chr18   78077248
chr20   63025520
chrY    59373566
chr19   59128983
chr22   51304566
chr21   48129895

You should end up with a bed file as such:

chr1 1264277 1266724 INTERGENIC 0 0
chr1 1284445 1284492 5_UTR_NM_004421 0 -
chr1 1270658 1271895 EXON_NM_004421 0 -
chr1 1270658 1271522 3_UTR_NM_004421 0 -
chr1 1273357 1273563 EXON_NM_004421 0 -
chr1 1271896 1273356 INTRON_NM_004421 0 -
chr1 1273649 1273816 EXON_NM_004421 0 -
chr1 1273564 1273648 INTRON_NM_004421 0 -

Encyclopedia of DNA elements (ENCODE)

ENCODE is an abbreviation of "Encyclopedia of DNA elements", a project that aims to discover and define the functional elements encoded in the human genome via multiple technologies. Used in this context, the term "functional elements" is used to denote a discrete region of the genome that encodes a defined product (e.g., protein) or a reproducible biochemical signature, such as transcription or a specific chromatin structure. It is now widely appreciated that such signatures, either alone or in combination, mark genomic sequences with important functions, including exons, sites of RNA processing, and transcriptional regulatory elements such as promoters, enhancers, silencers, and insulators.

The data included from the ENCODE project include:

  1. Identification and quantification of RNA species in whole cells and in sub-cellular compartments
  2. Mapping of protein-coding regions
  3. Delineation of chromatin and DNA accessibility and structure with nucleases and chemical probes
  4. Mapping of histone modifications and transcription factor binding sites (TFBSs) by chromatin immunoprecipitation (ChIP)
  5. and measurement of DNA methylation

To facilitate comparison and integration of data, ENCODE data production efforts have prioritised selected sets of cell types. The highest priority set (designated "Tier 1) includes two widely studied immortalised cell lines - K562 erythroleukemia cells; an EBV-immortalised B-lymphoblastoid line and the H1 human embryonic stem cell line. A secondary priority set (Tier 2) includes HeLa-S3 cervical carcinoma cells, HepG2 hepatoblastoma cells, and primary (non-transformed) human umbilical vein endothelial cells (HUVEC), which have limited proliferation potential in culture. A third set (Tier 3) currently comprises more than 100 cell types that are being analysed in selected assays.

A major goal of ENCODE is to manually annotate all protein-coding genes, pseudogenes, and non-coding transcribed loci in the human genome and to catalog the products of transcription including splice isoforms. This annotation process involves consolidation of all evidence of transcripts (cDNA, EST sequences) and proteins from public databases, followed by building gene structures based on supporting experimental data.

Ultimately, each gene or transcript model is assigned one of three confidence levels:

  1. Level 1 includes genes validated by RT-PCR and sequencing, plus consensus psuedogenes.
  2. Level 2 includes manually annotated coding and long non-coding loci that have transcriptional evidence in EMBL/GenBank.
  3. Level 3 includes Ensembl gene predictions in regions not yet manually annotated or for which there is new transcriptional evidence.

The result of ENCODE gene annotation (termed "GENCODE") is a comprehensive catalog of transcripts and gene models.

Another goal of ENCODE is to produce a comprehensive genome-wide catalog of transcribed loci that characterises the size, polyadenylation status, and subcellular compartmentalisation of all transcripts. Both polyA+ and polyA- RNAs are being analysed and not only total whole cell RNAs but also those concentrated in the nucleus and cytosol. Long (>200nt) and short RNAs (<200nt) are being sequenced from each subcellular compartment, providing catalogs of potential miRNAs, snoRNA, promoter-associated short RNAs (PASRs) and other short cellular RNAs. Cis-regulatory regions include diverse functional elements (e.g., promoters, enhancers, silencers, and insulators) that collectively modulate the magnitude, timing, and cell-specificity of gene expression. The ENCODE Project is using multiple approaches to identify cis-regulatory regions, including localising their characteristic chromatin signatures and identifying sites of occupancy of sequence-specific transcription factors. Human cis-regulatory regions characteristically exhibit nuclease hypersensitivity and may show increased solubility after chromatin fixation and fragmentation. Additionally, specific patterns of post-translational histone modifications have been connected with distinct classes of regions such as promoters and enhancers. Chromatin accessibility and histone modifications thus provide independent and complementary annotations of human regulatory DNA. DNaseI hypersensitive sites (DHSs) are being mapped by two techniques: (i) capture of free DNA ends at in vivo DNaseI cleavage sites with biotinylated adapters, followed by digestion with a TypeIIS restiction enzyme to generate ~20 bp DNaseI cleavage site tags and (ii) direct sequencing of DNaseI cleavage sites at the ends of small (<300 bp) DNA fragments released by limiting treatment with DNaseI. For more information see the ENCODE user guide.

Downloading ENCODE data

Release ENCODE data can be download at

For example if you are interested in the H3K4Me1 CHiP-Seq data, bigWig files are provided at You may also be interested in the H3K4Me3 data, since this provides some perspective on the H3K4Me1 data.

To convert the bigWig file to a human readable (i.e. non-binary), download the bigWigToBedGraph executable.

Then convert the bigWig file to bedGraph by running:

bigWigToBedGraph wgEncodeBroadHistoneK562H3k4me1StdSig.bigWig wgEncodeBroadHistoneK562H3k4me1StdSig.bedgraph

The output:

chr1 10100 10125 0.6
chr1 10125 10225 1
chr1 10225 10250 1.36
chr1 10250 10275 2
chr1 10275 10300 3.64

Which can then be transformed into a BED file, for use with intersectBed.

The 1000 Genome Project

The 1000 Genome Project started as an endeavour to help capture, as much as possible, human genetic variation. The results of the pilot phase, are published in Nature. To sequence a person's genome, many copies of the DNA are broken into short pieces and each piece is sequenced and mapped, and stored in alignment files. Here's some information I gathered from the 1000 Genome Project page regarding the alignment files.

Data generated from the 1000 Genome Project is available at their ftp site. All alignment data is in the BAM format and the alignments are found under data/XXXXXXXX/alignment where the XXXXXXXX identifier is the sample name. The spreadsheet available at provides information on the samples.

The BAM filenames themselves contain a lot of information, e.g: NA12878.chrom1.LS454.ssaha.CEU.high_coverage.20091216.bam, where each part separated by the dot (i.e. period) are "Sample name", "Chromosome", "Sequencing platform", "Mapping algorithm", "Population", "Analysis group" and the date in the format YYYYMMDD. Sequence reads were aligned to the GRCh37 (hg19) build of the human reference. For more information on the alignment files, please refer to

At the moment I'm only interested in the Illumina data, which were all aligned by BWA. The BWA parameters used for alignment:

bwa aln -q 15 -f $sai_file $reference_fasta $fastq_file

The -q parameter is for read trimming and basically reads are trimmed at the position where the threshold of "badness" has been reached. For more information see these SEQanswers threads and the links within:

As an example, say I'm interested in the NA20502 sample, who is a female from Tuscany, Italy. I can download the alignment file for chromosome 20 here. For possibly faster downloads have a look at the Aspera connect software (instructions are available at the 1000 Genome Project data page).

Using IGV, here's a snapshot of a region on chromosome 20, where by eye there seems to be a couple of SNPs (and some sequencing errors).

Next thing on the agenda is to look at SNPs and INDELs using SAMtools and BCFtools (and to find my own storage for all the alignment files).

Sequence conservation in vertebrates

The UCSC Genome Browser provides multiple alignments of 46 vertebrate species and conveniently provides them for download. The multiple alignments show regions of sequence conservation among vertebrates. For more information see The multiple alignments are stored as Multiple Alignment Files and there are Perl and Python packages that parse them. The MAF format is quite straight forward and here I convert the MAF files into a BED file for use with intersectBed from the BEDTools suite.

Continue reading