I know of the genome mapability and uniqueness tracks provided by the UCSC Genome Browser but I was just interested in doing this myself for the hg19 genome.
As a test, I investigated chr22, where the base composition is broken down as:
Length of chr22 = 51,304,566
A: 9,094,775
C: 8,375,984
G: 8,369,235
T: 9,054,551
Other: 16,410,021
I wrote a script to generate 35mer reads by sliding 1 bp along the chromosome:
#!/usr/bin/perl use strict; use warnings; my $usage = "Usage: $0 <infile.fa>\n"; my $infile = shift or die $usage; my $read_length = '35'; my $current_accession = ''; my %seq = (); open(IN,'<',$infile) || die "Could not open $infile: $!\n"; while(<IN>){ chomp; next if /^$/; if (/^>(.*)$/){ $current_accession = $1; warn "Storing $current_accession\n"; } else { $seq{$current_accession} .= $_; } } close(IN); warn "Finished storing genome\n"; foreach my $chr (sort {$a cmp $b} keys %seq){ my $length = length($seq{$chr}); my $end = $length - $read_length; for (my $i = 0; $i <= $end; ++$i){ my $read = substr($seq{$chr},$i,$read_length); $read = uc($read); next if $read =~ /N/; die "$read\n" unless $read =~ /[ATGC]+/; my $chr_start = $i + 1; my $chr_end = $i + $read_length; my $id = $chr . '_' . $chr_start . '_' . $chr_end; print ">$id\n$read\n"; } } exit(0);
Any read that contained a "N" was removed. I then mapped these 35mers back to the hg19 genome using BWA, and here are the statistics:
Mapped: 34,894,095
Unmapped: 0
Edit distances
0 34,894,095
Multimap profile
Mapped uniquely 28,615,801
Mapped to 2 locations 1,350,653
Mapped to 3 locations 851,814
Mapped to 4 locations 388,360
Mapped to 5 locations 239,843
Mapped to 6 locations 194,604
Mapped to 7 locations 184,291
Mapped to 8 locations 149,467
Mapped to 9 locations 92,002
Mapped to 10 locations 66,698
Mapped to more than 10 location 2,760,562
28,615,801 / 34,894,095 (~82%) could be mapped uniquely.
Using the GENCODE exon list and intersectBed I characterised the list of uniquely and multi mapping reads.
350,930 / 6,278,294 (5.59%) of the multi mapping reads intersected GENCODE exons.
2,441,765 / 28,615,801 (8.53%) of the uniquely mapped reads intersected GENCODE exons; only a slighter higher percentage than multimappers.
It seems that most of chr22 contains unique stretches of nucleotides. These unique regions are not particularly enriched in exonic sequences in contrast to multimappers. Of the multi mapping reads that intersected exons, it would be interesting to see how many of these map to genes and pseudogenes, especially when we have 1,350,653 reads that mapped to two places in the genome.
And lastly by using genomeCoverageBed and by separating out the uniquely mapping reads, we can create a track that shows the unique parts of the genome.
genomeCoverageBed -ibam chr22_read_sorted_unique.bam -g ~/davetang/chrom_size/hg19_chrom_info.txt -d > result cat result | awk '$3 != 0 {print}' | head #chr22 16050002 1 #chr22 16050003 2 #chr22 16050004 3 #chr22 16050005 4 #chr22 16050006 5 #chr22 16050007 6 #chr22 16050008 7 #chr22 16050009 8 #chr22 16050010 9 #chr22 16050011 10
The third column, which shows the number of reads covering this position, is meaningless due to the way I slid across the chromosome. One can just change the values this into 1s and 0s, where 1 is unique and 0 is not unique.
cat result | perl -nle '@a=split; if ($a[2] != 0){ print "$a[0]\t$a[1]\t1" } else {print}'
Lastly a note about the Duke uniqueness tracks provided by the UCSC Genome Browser, which you can download here, since it's almost the same as what I showed above. For the 35mer track, here's part of the track:
chr1 68849 68859 0.333333
chr1 68859 68861 0.5
chr1 68861 68896 1
chr1 68896 69099 0.333333
I checked all the scores (fourth column) and they are either 0, 0.25, 0.333333, 0.5 and 1.
0 = mapped to more than 4 places
0.25 = mapped to 4 places
0.33 = mapped to 3 places
0.5 = mapped to 2 places
1 = mapped to 1 place
The scores are assigned to the first base of the sequence. Using the example above:
chr1 68859 68861 0.5
That should mean that chr1:68859-68894 and chr1:68860-68895 and chr1:68861-68896 all map to two places in the genome. To test this I extracted the sequence at chr1:68859-68894, which is
GAATTATAAGGCATAGCCAGGGAAGTAGTGCGAGAT
And blat returns three hits, with two best hits:
ACTIONS QUERY SCORE START END QSIZE IDENTITY CHRO STRAND START END SPAN --------------------------------------------------------------------------------------------------- browser details YourSeq 36 1 36 36 100.0% 19 + 110447 110482 36 browser details YourSeq 36 1 36 36 100.0% 1 + 68859 68894 36 browser details YourSeq 35 1 35 36 100.0% 15 - 102463460 102463494 35
And another test for chr1:68860-68895, which is this sequence AATTATAAGGCATAGCCAGGGAAGTAGTGCGAGATA and the blat result
ACTIONS QUERY SCORE START END QSIZE IDENTITY CHRO STRAND START END SPAN --------------------------------------------------------------------------------------------------- browser details YourSeq 36 1 36 36 100.0% 19 + 110448 110483 36 browser details YourSeq 36 1 36 36 100.0% 1 + 68860 68895 36 browser details YourSeq 34 1 36 36 97.3% 15 - 102463458 102463493 36
Here's a slide I made for the Duke uniqueness track using 20 bp windows:
This work is licensed under a Creative Commons
Attribution 4.0 International License.
good job . But i interested in? how can i use mappability when using mapping tools ,such as BWA
Hello,
I was interested in genome mapability purely as a visual aid on a genome browser, so that I could access whether a mapped read could potentially be a multi mapper.
Cheers,
Dave