For setting up TopHat see my previous post.
Here, I wanted to test whether TopHat can find junctions with single end 27bp reads.
The reference sequence I used was the test_ref.fa provided by the TopHat authors (see my previous post for the link), where the A's mark the intron regions:
>test_chromosome
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ACTACTATCTGACTAGACTGGAGGCGCTTGCGACTGAGCTAGGACGTGCC
ACTACGGGGATGACGACTAGGACTACGGACGGACTTAGAGCGTCAGATGC
AGCGACTGGACTATTTAGGACGATCGGACTGAGGAGGGCAGTAGGACGCT
ACGTATTTGGCGCGCGGCGCTACGGCTGAGCGTCGAGCTTGCGATACGCC
GTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
ACTATTACTTTATTATCTTACTCGGACGTAGACGGATCGGCAACGGGACT
GTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
TTTTCTACTTGAGACTGGGATCGAGGCGGACTTTTTAGGACGGGACTTGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
To index the reference run:
bowtie2-build test_ref.fa test_ref
I made up a test read (also based on the test fastq file provided by the TopHat authors) in fastq format:
@test_mRNA_151_297_1d/1
TGCGATACGCCACTATTACTTTATTAT
+
IIIIIIIIIIIIIIIIIIIIIIIIIII
If you run TopHat with the default settings, no alignment will be returned:
tophat test_ref blah.fq
We need to modify the segment length parameter:
--segment-length Each read is cut up into segments, each at least this long. These segments are mapped independently. The default is 25.
Interestingly I get results when I specify a segment-length of 12, but not at a higher value (I would have assumed that I needed a segment length of 11):
tophat --segment-length 12 test_ref blah.fq samtools view tophat_out/accepted_hits.bam #test_mRNA_151_297_1d 0 test_chromosome 240 50 11M100N16M * 0 0 TGCGATACGCCACTATTACTTTATTAT IIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-6 XM:i:0 XO:i:0 XG:i:0 MD:Z:27 NM:i:0 XS:A:+ NH:i:1
From the CIGAR line, you can see how I split up the read; 11 nucleotides on one exon and 16 nucleotides on the other exon.
And lastly if I modify the --max-segment-intron parameter to anything less than 101, no results are returned
--max-segment-intron The maximum intron length that may be found during split-segment search. The default is 500000.
tophat --segment-length 12 --max-segment-intron 101 test_ref blah.fq #returns result tophat --segment-length 12 --max-segment-intron 100 test_ref blah.fq #no result
Now the only questions are how trustworthy these junctions, detected from 27bp reads, are if you only have a few reads supporting the junction and how much more computational time it adds when we shorten the segment length.
Lastly, I have tested different --segment-length parameters (data not shown), and depending on how the read is split up, it is not the length of the segment that governs whether a match can be found, but whether the segment length matches how the read is split e.g. in a test where I split a 32bp read into 19/22, mapping results could be found with a segment length of 18 but not 10.

This work is licensed under a Creative Commons
Attribution 4.0 International License.

Another flag to consider the use of is -a/–min-anchor-length
27 base pairs ? Sounds like Illumina CAGE-seq data. I would hope it wouldn’t map across a splice site if it really was at a TSS.
Hi Dario,
Thanks for the tip! You’re right, I was looking at CAGE data but not exactly at splice sites. I was looking at TSSes that were potentially spliced.
Cheers,
Dave