Genome mapability

I know of the genome mapability and uniqueness tracks provided by the UCSC Genome Browser but I was just interested in doing this myself for the hg19 genome.

As a test, I investigated chr22, where the base composition is broken down as:

Length of chr22 = 51,304,566
A: 9,094,775
C: 8,375,984
G: 8,369,235
T: 9,054,551
Other: 16,410,021

I wrote a script to generate 35mer reads by sliding 1 bp along the chromosome:


#!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.fa>\n";
my $infile = shift or die $usage;

my $read_length = '35';
my $current_accession = '';
my %seq = ();

open(IN,'<',$infile) || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
   next if /^$/;
   if (/^>(.*)$/){
      $current_accession = $1;
      warn "Storing $current_accession\n";
   } else {
      $seq{$current_accession} .= $_;
   }
}
close(IN);

warn "Finished storing genome\n";

foreach my $chr (sort {$a cmp $b} keys %seq){
   my $length = length($seq{$chr});
   my $end = $length - $read_length;
   for (my $i = 0; $i <= $end; ++$i){
      my $read = substr($seq{$chr},$i,$read_length);
      $read = uc($read);
      next if $read =~ /N/;
      die "$read\n" unless $read =~ /[ATGC]+/;
      my $chr_start = $i + 1;
      my $chr_end = $i + $read_length;
      my $id = $chr . '_' . $chr_start . '_' . $chr_end;
      print ">$id\n$read\n";
   }
}

exit(0);

Any read that contained a "N" was removed. I then mapped these 35mers back to the hg19 genome using BWA, and here are the statistics:

Mapped: 34,894,095
Unmapped: 0
Edit distances
0 34,894,095
Multimap profile
Mapped uniquely 28,615,801
Mapped to 2 locations 1,350,653
Mapped to 3 locations 851,814
Mapped to 4 locations 388,360
Mapped to 5 locations 239,843
Mapped to 6 locations 194,604
Mapped to 7 locations 184,291
Mapped to 8 locations 149,467
Mapped to 9 locations 92,002
Mapped to 10 locations 66,698
Mapped to more than 10 location 2,760,562

28,615,801 / 34,894,095 (~82%) could be mapped uniquely.

Using the GENCODE exon list and intersectBed I characterised the list of uniquely and multi mapping reads.

350,930 / 6,278,294 (5.59%) of the multi mapping reads intersected GENCODE exons.

2,441,765 / 28,615,801 (8.53%) of the uniquely mapped reads intersected GENCODE exons; only a slighter higher percentage than multimappers.

It seems that most of chr22 contains unique stretches of nucleotides. These unique regions are not particularly enriched in exonic sequences in contrast to multimappers. Of the multi mapping reads that intersected exons, it would be interesting to see how many of these map to genes and pseudogenes, especially when we have 1,350,653 reads that mapped to two places in the genome.

And lastly by using genomeCoverageBed and by separating out the uniquely mapping reads, we can create a track that shows the unique parts of the genome.

genomeCoverageBed -ibam chr22_read_sorted_unique.bam -g ~/davetang/chrom_size/hg19_chrom_info.txt -d > result

cat result | awk '$3 != 0 {print}' | head
#chr22   16050002        1
#chr22   16050003        2
#chr22   16050004        3
#chr22   16050005        4
#chr22   16050006        5
#chr22   16050007        6
#chr22   16050008        7
#chr22   16050009        8
#chr22   16050010        9
#chr22   16050011        10

The third column, which shows the number of reads covering this position, is meaningless due to the way I slid across the chromosome. One can just change the values this into 1s and 0s, where 1 is unique and 0 is not unique.

cat result | perl -nle '@a=split; if ($a[2] != 0){ print "$a[0]\t$a[1]\t1" } else {print}'

Lastly a note about the Duke uniqueness tracks provided by the UCSC Genome Browser, which you can download here, since it's almost the same as what I showed above. For the 35mer track, here's part of the track:

chr1 68849 68859 0.333333
chr1 68859 68861 0.5
chr1 68861 68896 1
chr1 68896 69099 0.333333

I checked all the scores (fourth column) and they are either 0, 0.25, 0.333333, 0.5 and 1.

0 = mapped to more than 4 places
0.25 = mapped to 4 places
0.33 = mapped to 3 places
0.5 = mapped to 2 places
1 = mapped to 1 place

The scores are assigned to the first base of the sequence. Using the example above:

chr1 68859 68861 0.5

That should mean that chr1:68859-68894 and chr1:68860-68895 and chr1:68861-68896 all map to two places in the genome. To test this I extracted the sequence at chr1:68859-68894, which is

GAATTATAAGGCATAGCCAGGGAAGTAGTGCGAGAT

And blat returns three hits, with two best hits:

   ACTIONS      QUERY           SCORE START  END QSIZE IDENTITY CHRO STRAND  START    END      SPAN
---------------------------------------------------------------------------------------------------
browser details YourSeq           36     1    36    36 100.0%    19   +     110447    110482     36
browser details YourSeq           36     1    36    36 100.0%     1   +      68859     68894     36
browser details YourSeq           35     1    35    36 100.0%    15   -  102463460 102463494     35

And another test for chr1:68860-68895, which is this sequence AATTATAAGGCATAGCCAGGGAAGTAGTGCGAGATA and the blat result

   ACTIONS      QUERY           SCORE START  END QSIZE IDENTITY CHRO STRAND  START    END      SPAN
---------------------------------------------------------------------------------------------------
browser details YourSeq           36     1    36    36 100.0%    19   +     110448    110483     36
browser details YourSeq           36     1    36    36 100.0%     1   +      68860     68895     36
browser details YourSeq           34     1    36    36  97.3%    15   -  102463458 102463493     36

Here's a slide I made for the Duke uniqueness track using 20 bp windows:

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
2 comments Add yours
    1. Hello,

      I was interested in genome mapability purely as a visual aid on a genome browser, so that I could access whether a mapped read could potentially be a multi mapper.

      Cheers,

      Dave

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.