How mappable is a specific repeat?

If you've ever wondered how mappable a specific repeat is, here's a quick post on creating a plot showing the mappability of a repetitive element along its consensus sequence. Specifically, the consensus sequence of a repeat was taken and sub-sequences were created by a sliding window approach (i.e. moving along the sequence) at 1 bp intervals and these sub-sequences were mapped to hg19.

I will use bwa for the mapping, so firstly download and compile bwa:

wget http://sourceforge.net/projects/bio-bwa/files/latest/bwa-0.7.7.tar.bz2
tar -xjf bwa-0.7.7.tar.bz2
cd bwa-0.7.7
make

#index hg19
bwa index hg19.fa

Continue reading

Mapping repeats 2

Updated 10th September 2013 to include LAST

I previously looked at mapping repeats with respect to sequencing errors in high throughput sequencing and as one would expect, the accuracy of the mapping decreased when sequencing errors were introduced. I then looked at aligning to unique regions of the genome to get an idea of how different short read alignment perform with reads that should map uniquely. Here I combine the two ideas, to see how different short read alignment programs perform when mapping repeats.

When I wrote my first mapping repeats post, I was made aware of this review article on "Repetitive DNA and next-generation sequencing: computational challenges and solutions" via Twitter (thank you CB). It was also his suggestion that it would be interesting to compare different short read alignment programs with respect to mapping repeats, hence this post.

Continue reading

Aligning to unique regions

Post updated on the 10th September 2013 after receiving input from the author of LAST

I've been interested in aligning reads to the repetitive portion of the human genome; in this post I'll look into how well different short read alignment programs perform when aligning to unique regions of the genome. Firstly to find unique regions in the genomes, I will use the CRG mappability track "wgEncodeCrgMapabilityAlign36mer.bigWig", which is available from the UCSC Genome Browser page. Have a look at my post on the ENCODE mappability tracks for more background information.

#the extremely useful stats program is available
#https://github.com/arq5x/filo
cat wgEncodeCrgMapabilityAlign36mer.bedGraph | awk '$4 == 1 {print $3-$2}' | stats
Total lines:            15365566
Sum of lines:           2176355877
Ari. Mean:              141.638510224745
Geo. Mean:              15.0511330575994
Median:                 12
Mode:                   1 (N=1862327)
Anti-Mode:              6344 (N=1)
Minimum:                1
Maximum:                30513
Variance:               266606.806747189
StdDev:                 516.339817123558

From the above we see that:

  1. The longest stretch of unique nucleotides is 30513 + 35 bp
  2. The average length (arithmetic mean) of all unique regions is 142 + 35 bp
  3. There were 15,365,566 unique regions of variable length

I'm going to take unique regions that are longer than a certain number of bp and generate reads from these regions.

Continue reading

ENCODE mappability and repeats

The ENCODE mappability tracks can be visualised on the UCSC Genome Browser and they provide a sense of how mappable a region of the genome is in terms of short reads or k-mers. On a side note, it seems some people use "mapability" and some use "mappability"; I was taught the CVC rule, so I'm going to stick with "mappability".

There are two sets of tracks, alignability and uniqueness, but what's the difference between the two? From the UCSC Genome Browser:

Alignability - These tracks provide a measure of how often the sequence found at the particular location will align within the whole genome. Unlike measures of uniqueness, alignability will tolerate up to 2 mismatches. These tracks are in the form of signals ranging from 0 to 1 and have several configuration options.

Uniqueness - These tracks are a direct measure of sequence uniqueness throughout the reference genome. These tracks are in the form of signals ranging from 0 to 1 and have several configuration options.

Let's take a look at two examples, where I try to use k-mers of similar size:

#download the files
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeDukeMapabilityUniqueness35bp.bigWig
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeCrgMapabilityAlign36mer.bigWig
#bigWigToBedGraph can be downloaded at http://hgdownload.cse.ucsc.edu/admin/exe/
#convert into flat format
bigWigToBedGraph wgEncodeDukeMapabilityUniqueness35bp.bigWig wgEncodeDukeMapabilityUniqueness35bp.bedGraph
bigWigToBedGraph wgEncodeCrgMapabilityAlign36mer.bigWig wgEncodeCrgMapabilityAlign36mer.bedGraph
head wgEncodeDukeMapabilityUniqueness35bp.bedGraph
chr1    0       10145   0
chr1    10145   10160   1
chr1    10160   10170   0.5
chr1    10170   10175   0.333333
chr1    10175   10177   0.5
chr1    10177   10215   0
chr1    10215   10224   1
chr1    10224   10229   0.5
chr1    10229   10248   1
chr1    10248   10274   0
head wgEncodeCrgMapabilityAlign36mer.bedGraph
chr1    10000   10078   0.0013624
chr1    10078   10081   0.0238095
chr1    10081   10088   0.0185185
chr1    10088   10089   0.0147059
chr1    10089   10096   0.0185185
chr1    10096   10097   0.0238095
chr1    10097   10099   0.0185185
chr1    10099   10102   0.00172712
chr1    10102   10120   0.0013624
chr1    10120   10121   0.00172712

The bedGraph format simply associates a score (fourth column) to a particular region, as defined in the first three columns. At first glance you can already see the difference in the precision of the scores. The scores were calculated the same way, 1/the number of places a 35/36mer maps to the genome; however the uniqueness track will only keep scores for reads that map up to 4 places (0.25). So according to mappability track, the region chr1:10000-10035 will map to 734 places (since 1/734 =~ .0013624).

Continue reading

Mapping repeats

Most eukaryotic genomes are interspersed with repetitive elements and some of these elements have transcriptional activity, hence they appear when we sequence the RNA population. From the trend of things, some of these elements seem to be important. One strategy for analysing these repeats is to map them to the genome, to see where they came from and from what repeat class they arose from. This post looks into mapping repeats and how sequencing accuracy can affect the mapping accuracy.

I will use the human genome as an example; according to RepeatMasker and Repbase, there are roughly 5,298,130 repetitive elements in the human genome. How much of the genome is that? First download the RepeatMasker results performed on hg19 from the UCSC Table Browser tool. I've downloaded the results as a bed file and named it hg19_rmsk.bed.

#the extremely useful stats program is available
#https://github.com/arq5x/filo
cat ~/ucsc/hg19_rmsk.bed | perl -nle '@a=split; $s=$a[2]-$a[1]; print $s' | stats
Total lines:            5298130
Sum of lines:           1467396988
Ari. Mean:              276.965077867097
Geo. Mean:              168.518495379209
Median:                 188
Mode:                   21 (N=44789)
Anti-Mode:              3849 (N=1)
Minimum:                6
Maximum:                160602
Variance:               216904.549201035
StdDev:                 465.730124858845

In the above, I concatenated the entire bed file and redirected it to Perl, where it subtracted the end coordinate from the start, and outputted it into the stats program, where simple statistics were calculated. The total lines corresponded to the number of repetitive elements, which make up 1,467,396,988 bp of the hg19 genome. That's around half of the hg19 genome. Now to convert this bed file into a fasta file and randomly sample 5 million reads from the repeats.

Continue reading

Mapping long reads with Bowtie

Just a simple test to see if Bowtie can map long reads. Why? Well because Bowtie is fast, so I want to see if I can also use it as a general purpose aligner. In a previous post I was characterising the mapability of the genome. From this I selected a portion of the genome that is unique, chr1:19472012-19482647 (10,635 bp). Then I made a bed file spanning different lengths of this portion:

chr1 19472013 19472043
chr1 19472013 19472113
chr1 19472013 19472213
chr1 19472013 19472513
chr1 19472013 19473013
chr1 19472013 19474013
chr1 19472013 19477013
chr1 19472013 19482013

Then using fastaFromBed:

fastaFromBed -fi /analysisdata/genomes/hg19_male.fa -bed unique.bed -fo unique.fa

So there are 8 sequences, with lengths 30, 100, 200, 500, 1,000, 2,000, 5,000 and 10,000.

First let's try to map these using blat.

blat -t=dna -q=dna -minIdentity=100 hg19.2bit unique.fa unique.psl
Loaded 3137161264 letters in 93 sequences
Searched 18830 bases in 8 sequences
cat unique.psl
psLayout version 3

match   mis-    rep.    N's     Q gap   Q gap   T gap   T gap strand    Q               Q       Q       Q       T               T       T       T       block   blockSizes      qStarts  tStarts
        match   match           count   bases   count   bases           name            size    start   end     name            size    start   end     count
---------------------------------------------------------------------------------------------------------------------------------------------------------------
100     0       0       0       0       0       0       0     + chr1:19472013-19472113  100     0       100     chr1    249250621       19472013        19472113        1       100,    0,  19472013,
200     0       0       0       0       0       0       0     + chr1:19472013-19472213  200     0       200     chr1    249250621       19472013        19472213        1       200,    0,  19472013,
500     0       0       0       0       0       0       0     + chr1:19472013-19472513  500     0       500     chr1    249250621       19472013        19472513        1       500,    0,  19472013,
1000    0       0       0       0       0       0       0     + chr1:19472013-19473013  1000    0       1000    chr1    249250621       19472013        19473013        1       1000,   0,  19472013,
2000    0       0       0       0       0       0       0     + chr1:19472013-19474013  2000    0       2000    chr1    249250621       19472013        19474013        1       2000,   0,  19472013,
5000    0       0       0       0       0       0       0     + chr1:19472013-19477013  5000    0       5000    chr1    249250621       19472013        19477013        1       5000,   0,  19472013,
10000   0       0       0       0       0       0       0     + chr1:19472013-19482013  10000   0       10000   chr1    249250621       19472013        19482013        1       10000,  0,  19472013,

Now let's see if Bowtie can map these "long reads":

bowtie2-build hg19.fa hg19
bowtie2 -x hg19 -f -U unique.fa -S unique.sam
8 reads; of these:
  8 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    7 (87.50%) aligned exactly 1 time
    1 (12.50%) aligned >1 times
100.00% overall alignment rate
cat unique.sam | grep -v "^@" | cut -f1-4
chr1:19472013-19472043  0       chr3    54652417
chr1:19472013-19472113  0       chr1    19472014
chr1:19472013-19472213  0       chr1    19472014
chr1:19472013-19472513  0       chr1    19472014
chr1:19472013-19473013  0       chr1    19472014
chr1:19472013-19474013  0       chr1    19472014
chr1:19472013-19477013  0       chr1    19472014
chr1:19472013-19482013  0       chr1    19472014

The 30mer could not be mapped by blat but was mapped by Bowtie but to another place. It looks like it may be possible to map long reads, up to 10,000 bp in length with Bowtie, at least with these trivial examples.

Things to keep in mind when using Bowtie2

Taken from the Bowtie2 manual.

For an alignment to be considered "valid" (i.e. "good enough") by Bowtie 2, it must have an alignment score no less than the minimum score threshold. The threshold is configurable and is expressed as a function of the read length. In end-to-end alignment mode (a global alignment), the default minimum score threhsold is -0.6 + -0.6 * L, where L is the read length. For example, for a 30 bp read the minimum score threshold would be -0.6 + (-0.6 * 30) = -18.6.

A mismatched base at a high-quality position in the read receives a penalty of -6 by default. A length-2 read gap receives a penalty of -11 by default (-5 for the gap open, -3 for the first extension, -3 for the second extension). Thus for a 30 bp read, an alignment is still considered valid with 3 mismatches (-18) and one gap + one mismatch (-14). I'm not sure what the threshold is for a high-quality position, however when aligning without base-calling qualities scores, every mismatch receives a penalty of -6.

The alignment scores are displayed in the SAM output as the AS field.

AS:i:
Alignment score. Can be negative. Can be greater than 0 in --local mode (but not in --end-to-end mode). Only present if SAM record is for an aligned read.

XS:i:
Alignment score for second-best alignment. Can be negative. Can be greater than 0 in --local mode (but not in --end-to-end mode). Only present if the SAM record is for an aligned read and more than one alignment was found for the read.

Other ways to increase sensitivity is to alter the seed length of reads (-L), the interval between extracted seeds (-i), and the number of mismatches permitted per seed (-N). See also: -D, which puts an upper limit on the number of dynamic programming problems (i.e. seed extensions) that can "fail" in a row before Bowtie 2 stops searching and see also: -R, which sets the maximum number of times Bowtie 2 will "re-seed" when attempting to align a read with repetitive seeds.

Genome mapability

I know of the genome mapability and uniqueness tracks provided by the UCSC Genome Browser but I was just interested in doing this myself for the hg19 genome.

As a test, I investigated chr22, where the base composition is broken down as:

Length of chr22 = 51,304,566
A: 9,094,775
C: 8,375,984
G: 8,369,235
T: 9,054,551
Other: 16,410,021

I wrote a script to generate 35mer reads by sliding 1 bp along the chromosome:


#!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.fa>\n";
my $infile = shift or die $usage;

my $read_length = '35';
my $current_accession = '';
my %seq = ();

open(IN,'<',$infile) || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
   next if /^$/;
   if (/^>(.*)$/){
      $current_accession = $1;
      warn "Storing $current_accession\n";
   } else {
      $seq{$current_accession} .= $_;
   }
}
close(IN);

warn "Finished storing genome\n";

foreach my $chr (sort {$a cmp $b} keys %seq){
   my $length = length($seq{$chr});
   my $end = $length - $read_length;
   for (my $i = 0; $i <= $end; ++$i){
      my $read = substr($seq{$chr},$i,$read_length);
      $read = uc($read);
      next if $read =~ /N/;
      die "$read\n" unless $read =~ /[ATGC]+/;
      my $chr_start = $i + 1;
      my $chr_end = $i + $read_length;
      my $id = $chr . '_' . $chr_start . '_' . $chr_end;
      print ">$id\n$read\n";
   }
}

exit(0);

Any read that contained a "N" was removed. I then mapped these 35mers back to the hg19 genome using BWA, and here are the statistics:

Mapped: 34,894,095
Unmapped: 0
Edit distances
0 34,894,095
Multimap profile
Mapped uniquely 28,615,801
Mapped to 2 locations 1,350,653
Mapped to 3 locations 851,814
Mapped to 4 locations 388,360
Mapped to 5 locations 239,843
Mapped to 6 locations 194,604
Mapped to 7 locations 184,291
Mapped to 8 locations 149,467
Mapped to 9 locations 92,002
Mapped to 10 locations 66,698
Mapped to more than 10 location 2,760,562

28,615,801 / 34,894,095 (~82%) could be mapped uniquely.

Using the GENCODE exon list and intersectBed I characterised the list of uniquely and multi mapping reads.

350,930 / 6,278,294 (5.59%) of the multi mapping reads intersected GENCODE exons.

2,441,765 / 28,615,801 (8.53%) of the uniquely mapped reads intersected GENCODE exons; only a slighter higher percentage than multimappers.

It seems that most of chr22 contains unique stretches of nucleotides. These unique regions are not particularly enriched in exonic sequences in contrast to multimappers. Of the multi mapping reads that intersected exons, it would be interesting to see how many of these map to genes and pseudogenes, especially when we have 1,350,653 reads that mapped to two places in the genome.

And lastly by using genomeCoverageBed and by separating out the uniquely mapping reads, we can create a track that shows the unique parts of the genome.

genomeCoverageBed -ibam chr22_read_sorted_unique.bam -g ~/davetang/chrom_size/hg19_chrom_info.txt -d > result

cat result | awk '$3 != 0 {print}' | head
#chr22   16050002        1
#chr22   16050003        2
#chr22   16050004        3
#chr22   16050005        4
#chr22   16050006        5
#chr22   16050007        6
#chr22   16050008        7
#chr22   16050009        8
#chr22   16050010        9
#chr22   16050011        10

The third column, which shows the number of reads covering this position, is meaningless due to the way I slid across the chromosome. One can just change the values this into 1s and 0s, where 1 is unique and 0 is not unique.

cat result | perl -nle '@a=split; if ($a[2] != 0){ print "$a[0]\t$a[1]\t1" } else {print}'

Lastly a note about the Duke uniqueness tracks provided by the UCSC Genome Browser, which you can download here, since it's almost the same as what I showed above. For the 35mer track, here's part of the track:

chr1 68849 68859 0.333333
chr1 68859 68861 0.5
chr1 68861 68896 1
chr1 68896 69099 0.333333

I checked all the scores (fourth column) and they are either 0, 0.25, 0.333333, 0.5 and 1.

0 = mapped to more than 4 places
0.25 = mapped to 4 places
0.33 = mapped to 3 places
0.5 = mapped to 2 places
1 = mapped to 1 place

The scores are assigned to the first base of the sequence. Using the example above:

chr1 68859 68861 0.5

That should mean that chr1:68859-68894 and chr1:68860-68895 and chr1:68861-68896 all map to two places in the genome. To test this I extracted the sequence at chr1:68859-68894, which is

GAATTATAAGGCATAGCCAGGGAAGTAGTGCGAGAT

And blat returns three hits, with two best hits:

   ACTIONS      QUERY           SCORE START  END QSIZE IDENTITY CHRO STRAND  START    END      SPAN
---------------------------------------------------------------------------------------------------
browser details YourSeq           36     1    36    36 100.0%    19   +     110447    110482     36
browser details YourSeq           36     1    36    36 100.0%     1   +      68859     68894     36
browser details YourSeq           35     1    35    36 100.0%    15   -  102463460 102463494     35

And another test for chr1:68860-68895, which is this sequence AATTATAAGGCATAGCCAGGGAAGTAGTGCGAGATA and the blat result

   ACTIONS      QUERY           SCORE START  END QSIZE IDENTITY CHRO STRAND  START    END      SPAN
---------------------------------------------------------------------------------------------------
browser details YourSeq           36     1    36    36 100.0%    19   +     110448    110483     36
browser details YourSeq           36     1    36    36 100.0%     1   +      68860     68895     36
browser details YourSeq           34     1    36    36  97.3%    15   -  102463458 102463493     36

Here's a slide I made for the Duke uniqueness track using 20 bp windows: