Aligning to unique regions

Post updated on the 10th September 2013 after receiving input from the author of LAST

I’ve been interested in aligning reads to the repetitive portion of the human genome; in this post I’ll look into how well different short read alignment programs perform when aligning to unique regions of the genome. Firstly to find unique regions in the genomes, I will use the CRG mappability track “wgEncodeCrgMapabilityAlign36mer.bigWig”, which is available from the UCSC Genome Browser page. Have a look at my post on the ENCODE mappability tracks for more background information.

#the extremely useful stats program is available
#https://github.com/arq5x/filo
cat wgEncodeCrgMapabilityAlign36mer.bedGraph | awk '$4 == 1 {print $3-$2}' | stats
Total lines:            15365566
Sum of lines:           2176355877
Ari. Mean:              141.638510224745
Geo. Mean:              15.0511330575994
Median:                 12
Mode:                   1 (N=1862327)
Anti-Mode:              6344 (N=1)
Minimum:                1
Maximum:                30513
Variance:               266606.806747189
StdDev:                 516.339817123558

From the above we see that:

  1. The longest stretch of unique nucleotides is 30513 + 35 bp
  2. The average length (arithmetic mean) of all unique regions is 142 + 35 bp
  3. There were 15,365,566 unique regions of variable length

I’m going to take unique regions that are longer than a certain number of bp and generate reads from these regions.

cat wgEncodeCrgMapabilityAlign36mer.bedGraph | awk '$4 == 1 && $3-$2>500 {OFS="\t"; print $1, $2, $3}' | wc
1027631 3082893 24586059

cat wgEncodeCrgMapabilityAlign36mer.bedGraph | awk '$4 == 1 && $3-$2>1000 {OFS="\t"; print $1, $2, $3}' | wc
 604130 1812390 14459696

cat wgEncodeCrgMapabilityAlign36mer.bedGraph | awk '$4 == 1 && $3-$2>1000 {OFS="\t"; print $1, $2, $3}' > wgEncodeCrgMapabilityAlign36mer_unique_1kb.bed

fastaFromBed -fi ~/genome/hg19.fa -bed wgEncodeCrgMapabilityAlign36mer_unique_1kb.bed -fo wgEncodeCrgMapabilityAlign36mer_unique_1kb.fa

#generate 1 million 36 bp reads from these unique regions
#dwgsim also simulates error rates and mutations by default
dwgsim -1 36 -2 0 -N 1000000 wgEncodeCrgMapabilityAlign36mer_unique_1kb.fa read

#how many random reads were generated?
cat read.bwa.read1.fastq | grep "@rand" | wc
  50859   50859 1729206

First let’s align using BWA

bwa aln -t 8 -n 2 hg19.fa read.bwa.read1.fastq > read.bwa.read1.sai
bwa samse hg19.fa read.bwa.read1.sai read.bwa.read1.fastq > read.bwa.read1.sam
samtools view -S -b read.bwa.read1.sam > read.bwa.read1.bam
samtools sort read.bwa.read1.bam read.bwa.read1_sorted

samtools flagstat read.bwa.read1.fastq_sorted.bam
1000000 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
912880 + 0 mapped (91.29%:nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (nan%:nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (nan%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

#how many unique regions were not aligned?
samtools view -f 4 read.bwa.read1.fastq_sorted.bam | grep -v "^rand" | wc
  36261  398871 5164969

Let’s try BFAST (for the indexing process see my BFAST wiki):

bfast match -f hg19.fa -r read.bwa.read1.fastq > read_match.bmf
#not sure about the reason for not processing all the reads
#from the BFAST output
#Reading read.bwa.read1.fastq into a temp file.
#Will process 999049 reads.
bfast localalign -f hg19.fa -m read_match.bmf > read_align.baf
bfast postprocess -f ~/genome/bfast_index/hg19.fa -i read_align.baf > read.sam
samtools view -bS read.sam > read.bam
samtools sort read.bam read_sorted
rm read.bam read.sam

samtools flagstat read_sorted.bam
999909 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
992879 + 0 mapped (99.30%:nan%)
1780 + 0 paired in sequencing
890 + 0 read1
890 + 0 read2
0 + 0 properly paired (0.00%:nan%)
1350 + 0 with itself and mate mapped
195 + 0 singletons (10.96%:nan%)
1296 + 0 with mate mapped to a different chr
517 + 0 with mate mapped to a different chr (mapQ>=5)

samtools view -f 4 bfast/read_sorted.bam | grep -v "^rand" | wc
     90    1530   18118

Now let’s try Bowtie

bowtie -p 8 --best -S hg19 read.bwa.read1.fastq > read.bwa.read1.sam
samtools view -bS read.bwa.read1.sam > read.bwa.read1.bam
samtools sort read.bwa.read1.bam read.bwa.read1_sorted
rm read.bwa.read1.sam read.bwa.read1.bam
samtools flagstat read.bwa.read1_sorted.bam
1000000 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
924705 + 0 mapped (92.47%:nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (nan%:nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (nan%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

samtools view -f 4 read.bwa.read1_sorted.bam | grep -v "^rand" | wc
  24436  293232 3700400

And lastly, let’s try LAST

#indexing the genome
lastdb -m1111110 hg19 hg19.fa
#using the typical usage from http://last.cbrc.jp/doc/last-map-probs.html
lastal -Q1 -e120  hg19 read.bfast.fastq | last-map-probs.py -s150 > hit2.maf
#or using GNU parallel
parallel --pipe -L4 "lastal -Q1 -e120 hg19 - | last-map-probs.py -s150" < read.bfast.fastq > hit_parallel.maf
#most of the reads were mapped
cat hit2.maf | grep "^q" | wc
 939861 2819583 108380531
#this convert script doesn't add the necessary SAM headers
maf-convert.py sam hit2.maf > hit2.sam
#get header from the bowtie bam file
samtools view -H bowtie/read.bwa.read1_sorted.bam > header
#convert to bam
cat header hit2.sam | samtools view -bS - > hit2.bam
samtools sort hit2.bam hit2_sorted
rm hit2.sam hit2.bam

Summaries

I have this script that tallies the number of correctly and incorrectly mapped reads (note that an unmapped read is also considered as incorrectly mapped):

#!/bin/env perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.bam>\n";
my $infile = shift or die $usage;

my $correct = 0;
my $incorrect = 0;
my $total = 0;

open(IN,'-|',"samtools view $infile") || die "Could not open $infile: $!\n";
while(<IN>){
        chomp;
        my ($qname,$flag,$rname,$pos,$mapq,$cigar,$mrnm,$mpos,$isize,$seq,$qual,$tag,$vtype,$value) = split;
        #print "$qname\n";
        #chrX:155260021-155260463_2_1_1_0_0_0_0:0:0_0:0:0_0
        next if /^rand/;
        ++$total;
        if ($qname =~ /^(\w+):(\d+)-(\d+)_(\d+)/){
                my $chr = $1;
                my $start = $2;
                my $end = $3;
                my $add = $4;
                #print "$chr $start $end $add\n";
                my $original = $start + $add;
                if ($original == $pos && $chr eq $rname){
                        ++$correct;
                } else {
                        ++$incorrect;
                        print "$qname\n";
                }
        } else {
                die "Could not parse $qname\n";
        }
}
close(IN);

warn "Total: $total\nCorrect: $correct\nIncorrect: $incorrect\n";

exit(0);

Now to tally:

#bwa
assess.pl bwa/read.bwa.read1.fastq_sorted.bam > bwa_incorrect.txt
Total: 949141
Correct: 912751
Incorrect: 36390
#bowtie
assess.pl bowtie/read.bwa.read1_sorted.bam > bowtie_incorrect.txt
Total: 949141
Correct: 924527
Incorrect: 24614
#bfast
assess.pl bfast/read_sorted.bam > bfast_incorrect.txt
Total: 949141
Correct: 942538
Incorrect: 6603
#now for last, I was made aware by the author of this
#http://bioinformatics.oxfordjournals.org/content/28/18/i325.long/reply
#see the comment below as well
#so I included a line that works out where the local alignment starts
#if ($cigar =~ /^(\d+)H/){ $original += $1; }
assess_last.pl hit2_sorted.bam > last_incorrect.txt
Total: 939857
Correct: 939478
Incorrect: 379
#the /1 is stripped by bwa and last
#strip these in the other two incorrect files
cat bfast_incorrect.txt | sed 's/\/1$//' > blah
mv -f blah bfast_incorrect.txt
cat bowtie_incorrect.txt | sed 's/\/1$//' > blah
mv -f blah bowtie_incorrect.txt

Were the reads that were mapped incorrectly consistent among the different short read aligners?

library(gplots)
bwa <- read.table("bwa_incorrect.txt",header=F,stringsAsFactors=F)
bowtie <- read.table("bowtie_incorrect.txt",header=F,stringsAsFactors=F)
bfast <- read.table("bfast_incorrect.txt",header=F,stringsAsFactors=F)
last <- read.table("last_incorrect.txt", header=F, stringsAsFactors=F)
VennList<-list(bwa=bwa, bowtie=bowtie, bfast=bfast, last=last)
venn(VennList)

bfast_bowtie_bwa_last_unique_region_venn_2

As Bowtie and BWA both use the Burrows-Wheeler algorithm to index the genome, they share a large portion of incorrectly mapped reads. LAST performed the best in that for every read it aligned, it got the most correct.

Conclusions

I generated 1,000,000 reads from unique regions in the genome, as determined by the CRG mappability track, using the DWGSIM program (which was written by the same person who developed BFAST). The DWGSIM program also generates random reads and simulates errors in the reads by default; for this run, there were 50,859 random reads, leaving us with 949,141 reads. These 949,141 were all aligned to the genome by the 4 alignment programs, with BFAST having the least number of incorrectly mapped reads (6,603). I don’t report the exact running time and I should have (my somewhat poor justification is that I didn’t care how much time it took, as long as the correct alignments were returned).

Obviously these tools work better if I tweak their individual settings; here I’ve just used the default settings for each tool as outlined in their respective manuals. All of the short read aligners could align the majority of unique reads backs to the genome. I had a brief look at the reads that were either incorrectly mapped or unmapped and they were the extreme cases where 3 or more simulated sequencing errors were introduced into the read (the simulation program introduces errors probabilistically), so this makes sense.

I will be looking into how well these short read alignment programs align to repetitive regions soon (update I’ve done this).

And finally I have included the set of reads I analysed here.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
12 comments Add yours
  1. Good post, as usual! For your next post (and maybe as well as for this one), I wonder how LAST (http://last.cbrc.jp/) would perform. It is supposed to be quite fast, and yet efficiently handle repetitive regions, but I don’t have experience with it. (Also, I know I may be asking too much… but next time, could you add “time” in front of the executable when launching it so that one knows how much time it took?)

    1. Hello,

      Thanks for the comment. Yes actually I heard the same thing regarding LAST and dealing with repetitive regions (the developer did recommend it to us since he collaborates with our centre). I’ll give it a go.

      Yeah sure I can include the running times in future posts, no problem.

      Cheers,

      Dave

        1. I’ve included LAST into the analysis and the results were suboptimal. This is the first time I’ve used LAST, and as I stated in the post, I just simply followed the example in the tutorial.

          I’ve linked the fastq files I generated for this post and perhaps you could download them and also try to align them using LAST. I wonder if I did something wrong.

          1. I downloaded the fastq file, aligned it as you did and got exactly the same results. The default “-e 40″ returns 818,438 matches.
            Then, I followed the tutorial where it indicates how to choose the score threshold:
            $ lastdb -x hg19_tmp allhg19_norandom.fasta
            $ awk ‘BEGIN{RS=”@”} {if(NF==0)next; split($0,a,”\n”); split(a[1],b,”_”); printf “>%s\n%s\n”,b[1],a[2]}’ read.bwa.read1.fastq \
            > read.bwa.read1.fa
            $ lastdb -x reads_tmp read.bwa.read1.fa
            $ lastex hg19_tmp.prj reads_tmp.prj
            I hence chose “-e 37” and got 202,470,806 matches.
            As advised in the README of LAST, it is recommended to contact the authors for any questions. This may be a good occasion?
            Also, it would be nice if the program was able to read gzipped fasta files and compile with OpenMP to map reads quicker (both things should not be difficult to do).

          2. Hi,

            very nice posts! A couple of comments on LAST.

            1. It would be better to use the “typical usage” described here:
            http://last.cbrc.jp/doc/last-map-probs.html

            2. The main reason you’re seeing incorrect mappings:
            Your script checks the coordinate of the read’s first base, but LAST does local alignment (by default), which might start at the 2nd base. See here:
            http://bioinformatics.oxfordjournals.org/content/28/18/i325.long/reply

            Replies to tf: You can read gzipped files via “zcat” (http://last.cbrc.jp/doc/lastal.txt). We use GNU parallel instead of OpenMP.

            1. Hi Martin,

              Thank you for the compliment and the comments. I redid the post using the typical usage that you suggested and indeed a lot more reads could be aligned. I also recalculated the incorrectly mapped reads by taking into account what you mentioned regarding the alignment start and the results are much better.

              I’m always appreciate it when the developer of the software contacts me and offers me their suggestions. So thanks again!

              Cheers,

              Dave

            2. Thanks Martin for the tips about zcat and parallel. Following Dave’s command-line, I guess the final result would look like this:
              zcat read.bfast.fastq.gz | parallel –pipe -L4 “lastal -Q1 -e120 h19 – | last-map-probs.py -s150” | gzip > hits.maf.gz
              (I tested and it works nicely.)
              And thanks Dave for updating you post! (It would also be very nice to have LAST in your comparison of aligners for mapping repeats 😉

              1. You’re welcome. I’ve updated this post and the “comparison of alignment programs when mapping repeats” post. LAST is very accurate.

  2. Hi Dave,

    wow that was quick! But I think the assessment still isn’t perfect, because the alignment might start at not only the read’s 2nd base, but also the 3rd, 4th, etc. How about adding this to assess.pl:

    if ($cigar =~ /^(\d+)H/){ $original += $1; }

    By the way, I only saw your post by luck, so feel free to contact me if you’d like suggestions!

  3. Hi Martin,

    Ah I see. Looking at the clipping to see where the alignment starts is definitely the best approach; thank you for the code. I’ll update the numbers soon.

    And thank you for the offer. I actually could use some suggestions for the work I’m doing as part of FANTOM5, if you have time of course!

    Cheers,

    Dave

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.