Mapping repeats 2

Updated 10th September 2013 to include LAST

I previously looked at mapping repeats with respect to sequencing errors in high throughput sequencing and as one would expect, the accuracy of the mapping decreased when sequencing errors were introduced. I then looked at aligning to unique regions of the genome to get an idea of how different short read alignment perform with reads that should map uniquely. Here I combine the two ideas, to see how different short read alignment programs perform when mapping repeats.

When I wrote my first mapping repeats post, I was made aware of this review article on “Repetitive DNA and next-generation sequencing: computational challenges and solutions” via Twitter (thank you CB). It was also his suggestion that it would be interesting to compare different short read alignment programs with respect to mapping repeats, hence this post.

Before I get started, I’d like to recapitulate some of the points made in the review article, which discussed the computational problems associated with mapping repeats especially with respect to high-throughput sequencing data.

  1. Repetitive sequences such as short tandem repeats and interspersed repeats make up roughly 50% of the human genome
  2. High throughput sequencers produce short reads; these reads are shorter than the repeats, thus mapping them becomes ambiguous
  3. It is typically reported that 70-80% of short reads can be mapped uniquely to the genome despite have a genome that is 50% repetitive; this is because repeats in the human genome have accumulated enough mutations to be distinguishable
  4. When dealing with a read that maps to multiple places, there are usually 3 choices
    1. Discard them
    2. Take the best alignment; if there are multiple best hits take one randomly
    3. Report all alignments or report up to a certain number

In addition one can make use of uniquely mapped reads; for example, you have a read that maps to two places. However at those two places, one place has a bunch of uniquely mapped reads and the other doesn’t. It seems more likely that the read arose from the loci that had a bunch of uniquely mapped reads. It should be pointed out that this approach is only possible when all multi mapping locations of a read is stored.

OK now that we’ve got some background, let’s compare some short read aligners. I’ll use the same ones that were discussed in the review article, namely:

Bowtie
BFAST
Burrows-Wheeler Aligner (BWA)
mrFAST
SOAPAligner

and also

LAST.

A while ago, I have actually looked at how BWA and Bowtie deal with multi mapping reads. I have heard of BFAST and SOAPAligner but never mrFAST, which is a cool name and reminds me of MrBayes. Firstly download and install the alignment programs:

#bowtie
wget http://sourceforge.net/projects/bowtie-bio/files/bowtie/1.0.0/bowtie-1.0.0-linux-x86_64.zip
unzip bowtie-1.0.0-linux-x86_64.zip
#bfast
wget http://sourceforge.net/projects/bfast/files/bfast/0.7.0/bfast-0.7.0a.tar.gz
tar -xzf bfast-0.7.0a.tar.gz
cd bfast-0.7.0a
./configure --prefix=/home/davetang/bin
make
make check
make install
#bwa
wget http://sourceforge.net/projects/bio-bwa/files/bwa-0.7.5a.tar.bz2
tar -xjf bwa-0.7.5a.tar.bz2
cd bwa-0.7.5a
make
#mrfast
wget http://sourceforge.net/projects/mrfast/files/mrfast/mrfast-2.6.0.1.tar.gz
tar xzf mrfast-2.6.0.1.tar.gz
cd mrfast-2.6.0.1
make
#soap2
wget http://soap.genomics.org.cn/down/soap2.21release.tar.gz
tar -xzf soap2.21release.tar.gz
cd soap2.21release
#last
wget http://last.cbrc.jp/last-356.zip
unzip last-356.zip
cd last-356
make

Indexing the genome (these steps take some time, so go grab a coffee or something):

bowtie-build hg19.fa hg19
bwa index hg19.fa
mrfast --index hg19.fa
2bwt-builder hg19.fa
lastdb -m1111110 hg19 hg19.fa

For BFAST, I indexed the genome using the recommended masks; for the specific steps see my BFAST wiki page.

I will use the same reads that I generated from my mapping repeats post. Now, let’s align these reads using BWA:

bwa aln -t 8 -n 2 hg19.fa read.bwa.read1.fastq > read.bwa.read1.sai
bwa samse hg19.fa read.bwa.read1.sai read.bwa.read1.fastq > read.bwa.read1.sam
samtools view -S -b read.bwa.read1.sam > read.bwa.read1.bam
samtools sort read.bwa.read1.bam read.bwa.read1_sorted
rm read.bwa.read1.sai read.bwa.read1.sam read.bwa.read1.bam

I wrote a short script in my aligning to unique regions post that checks how many reads were correctly and incorrectly mapped.

assess.pl read.bwa.read1_sorted.bam
Total: 4724573
Correct: 3183343
Incorrect: 1541230

Let’s try Bowtie:

bowtie -p 8 --best -S hg19 read.bwa.read1.fastq > read.bwa.read1.sam
samtools view -bS read.bwa.read1.sam > read.bwa.read1.bam
samtools sort read.bwa.read1.bam read.bwa.read1_sorted
rm read.bwa.read1.sam read.bwa.read1.bam
assess.pl read.bwa.read1_sorted.bam
Total: 4724573
Correct: 3174961
Incorrect: 1549612

Let’s try SOAPaligner:

soap -a read.bwa.read1.fastq -D ~/genome/soap_index/hg19.fa.index -v 2 -p 8 -o read.bwa.read1.fastq.out
wget http://soap.genomics.org.cn/down/soap2sam.tar.gz
tar -xzf soap2sam.tar.gz
chmod 755 soap2sam.pl
soap2sam.pl read.bwa.read1.fastq.out > read.bwa.read1.fastq.sam
samtools view -T hg19.fa -b -S read.bwa.read1.fastq.sam > read.bwa.read1.fastq.bam
samtools sort read.bwa.read1.fastq.bam read.bwa.read1.fastq_sorted
rm read.bwa.read1.fastq.bam read.bwa.read1.fastq.sam read.bwa.read1.fastq.out
assess.pl read.bwa.read1.fastq_sorted.bam
Total: 4633751
Correct: 3179312
Incorrect: 1454439

Let’s try BFAST

bfast match -n 16 -f hg19.fa -r read.bwa.read1.fastq > read_match.bmf
bfast localalign -n 16 -f hg19.fa -m read_match.bmf > read_align.baf
bfast postprocess -n 16 -f hg19.fa -i read_align.baf > read.sam
samtools view -bS read.sam > read.bam
samtools sort read.bam read_sorted
rm read.bam read.sam read_match.bmf read_align.baf
assess.pl read_sorted.bam
Total: 4724573
Correct: 2871241
Incorrect: 1853332

Let’s try mrFAST:

#this was extremely slow
#mrfast --search ~/genome/mrfast_index/hg19.fa --best -e 2 --seq read.bwa.read1.fastq -o read.bwa.read1.fastq.sam
#so I split the fastq file into 16 parts
#I can confirm that the results of splitting and not splitting
#are exactly the same
split -l 1250000 read.bwa.read1.fastq read_bwa_read_fastq_split_
ls -1
read.bwa.read1.fastq
read_bwa_read_fastq_split_aa
read_bwa_read_fastq_split_ab
read_bwa_read_fastq_split_ac
read_bwa_read_fastq_split_ad
read_bwa_read_fastq_split_ae
read_bwa_read_fastq_split_af
read_bwa_read_fastq_split_ag
read_bwa_read_fastq_split_ah
read_bwa_read_fastq_split_ai
read_bwa_read_fastq_split_aj
read_bwa_read_fastq_split_ak
read_bwa_read_fastq_split_al
read_bwa_read_fastq_split_am
read_bwa_read_fastq_split_an
read_bwa_read_fastq_split_ao
read_bwa_read_fastq_split_ap
#and used GNU parallel (http://www.gnu.org/software/parallel/)
parallel mrfast --search ~/genome/mrfast_index/hg19.fa --best -e 2 --seq read_bwa_read_fastq_split_a{} -o read_bwa_read_fastq_split_a{}.sam ::: {a..p}
cat *.sam > blah
samtools view -T ~/genome/hg19.fa -b -S blah > read.bwa.read1.fastq.bam
samtools sort read.bwa.read1.fastq.bam read.bwa.read1.fastq_sorted
assess.pl read.bwa.read1.fastq_sorted.bam
Total: 3890753
Correct: 2076293
Incorrect: 1814460

And lastly, let’s try LAST:

#using the typical usage from http://last.cbrc.jp/doc/last-map-probs.html
lastal -Q1 -e120 hg19 read.bwa.read1.fastq | last-map-probs.py -s150 > hit.maf
#OR using GNU parallel
parallel --pipe -L4 "lastal -Q1 -e120 hg19 - | last-map-probs.py -s150" < read.bwa.read1.fastq > hit_parallel.maf
cat hit.maf | grep "^q" | wc
2199870 6599610 243347994
#this convert script doesn't add the necessary SAM headers
maf-convert.py sam hit.maf > hit.sam
#get header from the bowtie bam file
samtools view -H bowtie/read.bwa.read1_sorted.bam > header
#convert to bam
cat header hit.sam | samtools view -bS - > hit.bam
samtools sort hit.bam hit_sorted
rm hit.sam hit.bam
#LAST does local alignment by default so alignments may be clipped
#adjusted assess.pl script by adding
#if ($cigar =~ /^(\d+)H/){ $original += $1; }
assess_last.pl hit_sorted.bam
Total: 2199866
Correct: 2189556
Incorrect: 10310

Summary

Program Total Correct Incorrect Percent correct
BWA 4724573 3183343 1541230 67.4
Bowtie 4724573 3174961 1549612 67.2
SOAPaligner 4633751 3179312 1454439 68.6
BFAST 4724573 2871241 1853332 60.8
mrFAST 3890753 2076293 1814460 53.4
LAST 2199866 2189556 10310 99.5

Let’s plot this in R:

data <- data.frame(BWA=c(3183343,1541230), Bowtie=c(3174961,1549612), SOAPaligner=c(3179312,1454439), BFAST=c(2871241,1853332), mrFAST=c(2076293,1814460), LAST=c(2189556,10310))
rownames(data) <- c("Correct","Incorrect")
data
              BWA  Bowtie SOAPaligner   BFAST  mrFAST    LAST
Correct   3183343 3174961     3179312 2871241 2076293 2189556
Incorrect 1541230 1549612     1454439 1853332 1814460   10310
par(mar=c(5.1, 4.1, 4.1, 3), xpd=TRUE)
barplot(as.matrix(data), col=c(3,2), ylim=c(0,5e6))
legend(6.2, 5e06,rownames(data),fill=c(3,2))

compare_sra_mapping_repeatLAST doesn’t always align repeats, but when it does, it gets almost all of it correct!

Conclusions

This post gives a very rough idea of how well short read aligners perform when tasked with aligning reads from repetitive parts of the genome. I’ve tried to use the default settings of each respective program, unless it seemed unreasonable for my task (such as mrFAST reporting every single alignment by default, since this will use up huge amounts of space for my repetitive reads or SOAPaligner allowing up to 5 mismatches). Of particular mention is LAST, which was extremely accurate, though it aligned 40% of the reads.

In the end, the best way for dealing with multimapping reads is just to have longer reads coupled with paired end sequencing.

I have uploaded the gzipped FASTQ file (171M) that I have used for this post on my server.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
7 comments Add yours
    1. Yeah it was really silly of me to leave out the system time; I’ve been asked by several people. I will update this post one day, with the running times. If you do plan on running LAST, definitely use it with GNU parallel.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.