Finding junctions with TopHat

For setting up TopHat see my previous post.

Here, I wanted to test whether TopHat can find junctions with single end 27bp reads.

The reference sequence I used was the test_ref.fa provided by the TopHat authors (see my previous post for the link), where the A’s mark the intron regions:


To index the reference run:

bowtie2-build test_ref.fa test_ref

I made up a test read (also based on the test fastq file provided by the TopHat authors) in fastq format:


If you run TopHat with the default settings, no alignment will be returned:

tophat test_ref blah.fq

We need to modify the segment length parameter:

–segment-length Each read is cut up into segments, each at least this long. These segments are mapped independently. The default is 25.

Interestingly I get results when I specify a segment-length of 12, but not at a higher value (I would have assumed that I needed a segment length of 11):

tophat --segment-length 12 test_ref blah.fq
samtools view tophat_out/accepted_hits.bam
#test_mRNA_151_297_1d    0       test_chromosome 240     50      11M100N16M      *       0       0       TGCGATACGCCACTATTACTTTATTAT     IIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-6  XM:i:0  XO:i:0  XG:i:0  MD:Z:27 NM:i:0  XS:A:+  NH:i:1

From the CIGAR line, you can see how I split up the read; 11 nucleotides on one exon and 16 nucleotides on the other exon.

And lastly if I modify the –max-segment-intron parameter to anything less than 101, no results are returned

–max-segment-intron The maximum intron length that may be found during split-segment search. The default is 500000.

tophat --segment-length 12 --max-segment-intron 101 test_ref blah.fq #returns result
tophat --segment-length 12 --max-segment-intron 100 test_ref blah.fq #no result

Now the only questions are how trustworthy these junctions, detected from 27bp reads, are if you only have a few reads supporting the junction and how much more computational time it adds when we shorten the segment length.

Lastly, I have tested different –segment-length parameters (data not shown), and depending on how the read is split up, it is not the length of the segment that governs whether a match can be found, but whether the segment length matches how the read is split e.g. in a test where I split a 32bp read into 19/22, mapping results could be found with a segment length of 18 but not 10.

Print Friendly, PDF & Email

Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
2 comments Add yours
  1. Another flag to consider the use of is -a/–min-anchor-length

    27 base pairs ? Sounds like Illumina CAGE-seq data. I would hope it wouldn’t map across a splice site if it really was at a TSS.

    1. Hi Dario,

      Thanks for the tip! You’re right, I was looking at CAGE data but not exactly at splice sites. I was looking at TSSes that were potentially spliced.



Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.