Updated 10th September 2013 to include LAST
I previously looked at mapping repeats with respect to sequencing errors in high throughput sequencing and as one would expect, the accuracy of the mapping decreased when sequencing errors were introduced. I then looked at aligning to unique regions of the genome to get an idea of how different short read alignment perform with reads that should map uniquely. Here I combine the two ideas, to see how different short read alignment programs perform when mapping repeats.
When I wrote my first mapping repeats post, I was made aware of this review article on "Repetitive DNA and next-generation sequencing: computational challenges and solutions" via Twitter (thank you CB). It was also his suggestion that it would be interesting to compare different short read alignment programs with respect to mapping repeats, hence this post.
Before I get started, I'd like to recapitulate some of the points made in the review article, which discussed the computational problems associated with mapping repeats especially with respect to high-throughput sequencing data.
- Repetitive sequences such as short tandem repeats and interspersed repeats make up roughly 50% of the human genome
- High throughput sequencers produce short reads; these reads are shorter than the repeats, thus mapping them becomes ambiguous
- It is typically reported that 70-80% of short reads can be mapped uniquely to the genome despite have a genome that is 50% repetitive; this is because repeats in the human genome have accumulated enough mutations to be distinguishable
- When dealing with a read that maps to multiple places, there are usually 3 choices
- Discard them
- Take the best alignment; if there are multiple best hits take one randomly
- Report all alignments or report up to a certain number
In addition one can make use of uniquely mapped reads; for example, you have a read that maps to two places. However at those two places, one place has a bunch of uniquely mapped reads and the other doesn't. It seems more likely that the read arose from the loci that had a bunch of uniquely mapped reads. It should be pointed out that this approach is only possible when all multi mapping locations of a read is stored.
OK now that we've got some background, let's compare some short read aligners. I'll use the same ones that were discussed in the review article, namely:
Bowtie
BFAST
Burrows-Wheeler Aligner (BWA)
mrFAST
SOAPAligner
and also
LAST.
A while ago, I have actually looked at how BWA and Bowtie deal with multi mapping reads. I have heard of BFAST and SOAPAligner but never mrFAST, which is a cool name and reminds me of MrBayes. Firstly download and install the alignment programs:
#bowtie wget http://sourceforge.net/projects/bowtie-bio/files/bowtie/1.0.0/bowtie-1.0.0-linux-x86_64.zip unzip bowtie-1.0.0-linux-x86_64.zip #bfast wget http://sourceforge.net/projects/bfast/files/bfast/0.7.0/bfast-0.7.0a.tar.gz tar -xzf bfast-0.7.0a.tar.gz cd bfast-0.7.0a ./configure --prefix=/home/davetang/bin make make check make install #bwa wget http://sourceforge.net/projects/bio-bwa/files/bwa-0.7.5a.tar.bz2 tar -xjf bwa-0.7.5a.tar.bz2 cd bwa-0.7.5a make #mrfast wget http://sourceforge.net/projects/mrfast/files/mrfast/mrfast-2.6.0.1.tar.gz tar xzf mrfast-2.6.0.1.tar.gz cd mrfast-2.6.0.1 make #soap2 wget http://soap.genomics.org.cn/down/soap2.21release.tar.gz tar -xzf soap2.21release.tar.gz cd soap2.21release #last wget http://last.cbrc.jp/last-356.zip unzip last-356.zip cd last-356 make
Indexing the genome (these steps take some time, so go grab a coffee or something):
bowtie-build hg19.fa hg19 bwa index hg19.fa mrfast --index hg19.fa 2bwt-builder hg19.fa lastdb -m1111110 hg19 hg19.fa
For BFAST, I indexed the genome using the recommended masks; for the specific steps see my BFAST wiki page.
I will use the same reads that I generated from my mapping repeats post. Now, let's align these reads using BWA:
bwa aln -t 8 -n 2 hg19.fa read.bwa.read1.fastq > read.bwa.read1.sai bwa samse hg19.fa read.bwa.read1.sai read.bwa.read1.fastq > read.bwa.read1.sam samtools view -S -b read.bwa.read1.sam > read.bwa.read1.bam samtools sort read.bwa.read1.bam read.bwa.read1_sorted rm read.bwa.read1.sai read.bwa.read1.sam read.bwa.read1.bam
I wrote a short script in my aligning to unique regions post that checks how many reads were correctly and incorrectly mapped.
assess.pl read.bwa.read1_sorted.bam Total: 4724573 Correct: 3183343 Incorrect: 1541230
Let's try Bowtie:
bowtie -p 8 --best -S hg19 read.bwa.read1.fastq > read.bwa.read1.sam samtools view -bS read.bwa.read1.sam > read.bwa.read1.bam samtools sort read.bwa.read1.bam read.bwa.read1_sorted rm read.bwa.read1.sam read.bwa.read1.bam assess.pl read.bwa.read1_sorted.bam Total: 4724573 Correct: 3174961 Incorrect: 1549612
Let's try SOAPaligner:
soap -a read.bwa.read1.fastq -D ~/genome/soap_index/hg19.fa.index -v 2 -p 8 -o read.bwa.read1.fastq.out wget http://soap.genomics.org.cn/down/soap2sam.tar.gz tar -xzf soap2sam.tar.gz chmod 755 soap2sam.pl soap2sam.pl read.bwa.read1.fastq.out > read.bwa.read1.fastq.sam samtools view -T hg19.fa -b -S read.bwa.read1.fastq.sam > read.bwa.read1.fastq.bam samtools sort read.bwa.read1.fastq.bam read.bwa.read1.fastq_sorted rm read.bwa.read1.fastq.bam read.bwa.read1.fastq.sam read.bwa.read1.fastq.out assess.pl read.bwa.read1.fastq_sorted.bam Total: 4633751 Correct: 3179312 Incorrect: 1454439
Let's try BFAST
bfast match -n 16 -f hg19.fa -r read.bwa.read1.fastq > read_match.bmf bfast localalign -n 16 -f hg19.fa -m read_match.bmf > read_align.baf bfast postprocess -n 16 -f hg19.fa -i read_align.baf > read.sam samtools view -bS read.sam > read.bam samtools sort read.bam read_sorted rm read.bam read.sam read_match.bmf read_align.baf assess.pl read_sorted.bam Total: 4724573 Correct: 2871241 Incorrect: 1853332
Let's try mrFAST:
#this was extremely slow #mrfast --search ~/genome/mrfast_index/hg19.fa --best -e 2 --seq read.bwa.read1.fastq -o read.bwa.read1.fastq.sam #so I split the fastq file into 16 parts #I can confirm that the results of splitting and not splitting #are exactly the same split -l 1250000 read.bwa.read1.fastq read_bwa_read_fastq_split_ ls -1 read.bwa.read1.fastq read_bwa_read_fastq_split_aa read_bwa_read_fastq_split_ab read_bwa_read_fastq_split_ac read_bwa_read_fastq_split_ad read_bwa_read_fastq_split_ae read_bwa_read_fastq_split_af read_bwa_read_fastq_split_ag read_bwa_read_fastq_split_ah read_bwa_read_fastq_split_ai read_bwa_read_fastq_split_aj read_bwa_read_fastq_split_ak read_bwa_read_fastq_split_al read_bwa_read_fastq_split_am read_bwa_read_fastq_split_an read_bwa_read_fastq_split_ao read_bwa_read_fastq_split_ap #and used GNU parallel (http://www.gnu.org/software/parallel/) parallel mrfast --search ~/genome/mrfast_index/hg19.fa --best -e 2 --seq read_bwa_read_fastq_split_a{} -o read_bwa_read_fastq_split_a{}.sam ::: {a..p} cat *.sam > blah samtools view -T ~/genome/hg19.fa -b -S blah > read.bwa.read1.fastq.bam samtools sort read.bwa.read1.fastq.bam read.bwa.read1.fastq_sorted assess.pl read.bwa.read1.fastq_sorted.bam Total: 3890753 Correct: 2076293 Incorrect: 1814460
And lastly, let's try LAST:
#using the typical usage from http://last.cbrc.jp/doc/last-map-probs.html lastal -Q1 -e120 hg19 read.bwa.read1.fastq | last-map-probs.py -s150 > hit.maf #OR using GNU parallel parallel --pipe -L4 "lastal -Q1 -e120 hg19 - | last-map-probs.py -s150" < read.bwa.read1.fastq > hit_parallel.maf cat hit.maf | grep "^q" | wc 2199870 6599610 243347994 #this convert script doesn't add the necessary SAM headers maf-convert.py sam hit.maf > hit.sam #get header from the bowtie bam file samtools view -H bowtie/read.bwa.read1_sorted.bam > header #convert to bam cat header hit.sam | samtools view -bS - > hit.bam samtools sort hit.bam hit_sorted rm hit.sam hit.bam #LAST does local alignment by default so alignments may be clipped #adjusted assess.pl script by adding #if ($cigar =~ /^(\d+)H/){ $original += $1; } assess_last.pl hit_sorted.bam Total: 2199866 Correct: 2189556 Incorrect: 10310
Summary
Program | Total | Correct | Incorrect | Percent correct |
BWA | 4724573 | 3183343 | 1541230 | 67.4 |
Bowtie | 4724573 | 3174961 | 1549612 | 67.2 |
SOAPaligner | 4633751 | 3179312 | 1454439 | 68.6 |
BFAST | 4724573 | 2871241 | 1853332 | 60.8 |
mrFAST | 3890753 | 2076293 | 1814460 | 53.4 |
LAST | 2199866 | 2189556 | 10310 | 99.5 |
Let's plot this in R:
data <- data.frame(BWA=c(3183343,1541230), Bowtie=c(3174961,1549612), SOAPaligner=c(3179312,1454439), BFAST=c(2871241,1853332), mrFAST=c(2076293,1814460), LAST=c(2189556,10310)) rownames(data) <- c("Correct","Incorrect") data BWA Bowtie SOAPaligner BFAST mrFAST LAST Correct 3183343 3174961 3179312 2871241 2076293 2189556 Incorrect 1541230 1549612 1454439 1853332 1814460 10310 par(mar=c(5.1, 4.1, 4.1, 3), xpd=TRUE) barplot(as.matrix(data), col=c(3,2), ylim=c(0,5e6)) legend(6.2, 5e06,rownames(data),fill=c(3,2))
LAST doesn't always align repeats, but when it does, it gets almost all of it correct!
Conclusions
This post gives a very rough idea of how well short read aligners perform when tasked with aligning reads from repetitive parts of the genome. I've tried to use the default settings of each respective program, unless it seemed unreasonable for my task (such as mrFAST reporting every single alignment by default, since this will use up huge amounts of space for my repetitive reads or SOAPaligner allowing up to 5 mismatches). Of particular mention is LAST, which was extremely accurate, though it aligned 40% of the reads.
In the end, the best way for dealing with multimapping reads is just to have longer reads coupled with paired end sequencing.
I have uploaded the gzipped FASTQ file (171M) that I have used for this post on my server.
This work is licensed under a Creative Commons
Attribution 4.0 International License.
Thanks a lot for the post! It’s really useful for me!
I wonder does last work with paired-end reads?
I’m glad it’s been useful. As far as I remember LAST works with paired-end reads.
Do you happen to have the speed for LAST in this little benchmark?
Yeah it was really silly of me to leave out the system time; I’ve been asked by several people. I will update this post one day, with the running times. If you do plan on running LAST, definitely use it with GNU parallel.
Hi,
A very interesting work!
I was wondering if you tried to check the results with different parameters.
Hi Shruti,
I’ve played around with different parameters for BWA and Bowtie but I haven’t compared the results.
Cheers,
Dave
hi , i installed bowtie using the link “wget http://sourceforge.net/projects/bowtie-bio/files/bowtie/1.0.0/bowtie-1.0.0-linux-x86_64.zip
unzip bowtie-1.0.0-linux-x86_64.zip”
given here but when i used “bowtie-build myfile”
it said Command ‘bowtie-build’ not found
could you help me out here?
Hi there,
the bowtie-build executable is in the folder bowtie-1.0.0 after you run unzip. You would need to run bowtie-1.0.0/bowtie-build after the unzip command. Have a read of this article https://www.howtogeek.com/658904/how-to-add-a-directory-to-your-path-in-linux/ to learn more about how executable/commands/tools are found in Linux.
In addition, this post is quite old. I would recommend using Bowtie 2 or HISAT2 instead of Bowtie.
Cheers,
Dave