The ENCODE mappability tracks can be visualised on the UCSC Genome Browser and they provide a sense of how mappable a region of the genome is in terms of short reads or k-mers. On a side note, it seems some people use "mapability" and some use "mappability"; I was taught the CVC rule, so I'm going to stick with "mappability".
There are two sets of tracks, alignability and uniqueness, but what's the difference between the two? From the UCSC Genome Browser:
Alignability - These tracks provide a measure of how often the sequence found at the particular location will align within the whole genome. Unlike measures of uniqueness, alignability will tolerate up to 2 mismatches. These tracks are in the form of signals ranging from 0 to 1 and have several configuration options.
Uniqueness - These tracks are a direct measure of sequence uniqueness throughout the reference genome. These tracks are in the form of signals ranging from 0 to 1 and have several configuration options.
Let's take a look at two examples, where I try to use k-mers of similar size:
#download the files wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeDukeMapabilityUniqueness35bp.bigWig wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeCrgMapabilityAlign36mer.bigWig #bigWigToBedGraph can be downloaded at http://hgdownload.cse.ucsc.edu/admin/exe/ #convert into flat format bigWigToBedGraph wgEncodeDukeMapabilityUniqueness35bp.bigWig wgEncodeDukeMapabilityUniqueness35bp.bedGraph bigWigToBedGraph wgEncodeCrgMapabilityAlign36mer.bigWig wgEncodeCrgMapabilityAlign36mer.bedGraph head wgEncodeDukeMapabilityUniqueness35bp.bedGraph chr1 0 10145 0 chr1 10145 10160 1 chr1 10160 10170 0.5 chr1 10170 10175 0.333333 chr1 10175 10177 0.5 chr1 10177 10215 0 chr1 10215 10224 1 chr1 10224 10229 0.5 chr1 10229 10248 1 chr1 10248 10274 0 head wgEncodeCrgMapabilityAlign36mer.bedGraph chr1 10000 10078 0.0013624 chr1 10078 10081 0.0238095 chr1 10081 10088 0.0185185 chr1 10088 10089 0.0147059 chr1 10089 10096 0.0185185 chr1 10096 10097 0.0238095 chr1 10097 10099 0.0185185 chr1 10099 10102 0.00172712 chr1 10102 10120 0.0013624 chr1 10120 10121 0.00172712
The bedGraph format simply associates a score (fourth column) to a particular region, as defined in the first three columns. At first glance you can already see the difference in the precision of the scores. The scores were calculated the same way, 1/the number of places a 35/36mer maps to the genome; however the uniqueness track will only keep scores for reads that map up to 4 places (0.25). So according to mappability track, the region chr1:10000-10035 will map to 734 places (since 1/734 =~ .0013624).
Another thing to keep in mind is that the alignability track tolerates up to 2 mismatches, so as you can imagine, a read that aligns to two places in the alignability track, may be unique in the uniqueness track.
#this region is unique head -7 wgEncodeDukeMapabilityUniqueness35bp.bedGraph | tail -1 chr1 10215 10224 1 #however with 2 mismatches it's not so unique cat wgEncodeCrgMapabilityAlign36mer.bedGraph| head -35 | tail -4 chr1 10209 10216 0.0147059 chr1 10216 10217 0.0114943 chr1 10217 10218 0.0147059 chr1 10218 10221 0.0625
Just a word on the file format using this example from the align36mer track:
chr1 10216 10217 0.0114943
This means that the region chr1:10216-10251 (10216+35 due to the inclusiveness of the numbering) maps to 87 places with a threshold of two mismatches. If you blat the sequence TAACCCTAACCCTAACCCCTAACCCTAACCCTAAAC, the highest scoring match is that same region.
I drew this to help explain the tracks.
How unique are repetitive elements?
I previously looked at mapping repeats with respect to sequencing errors but now with the uniqueness tracks I can get an idea of the number of unique sites within repetitive elements.
cat wgEncodeDukeMapabilityUniqueness35bp.bedGraph | awk '$4 == 1 {OFS="\t"; print $1, $2, $3}' > wgEncodeDukeMapabilityUniqueness35bp.bed
#all the unique regions
head wgEncodeDukeMapabilityUniqueness35bp.bed
chr1 10145 10160
chr1 10215 10224
chr1 10229 10248
chr1 10311 10321
chr1 10434 10469
chr1 10484 10533
chr1 10588 10623
chr1 10782 10815
chr1 13044 13072
chr1 13292 13327
cat wgEncodeDukeMapabilityUniqueness35bp.bed | wc
12444047 37332141 296751457
#I downloaded the RepeatMasker track using the Table Browser tool at the UCSC Genome Browser
cat ~/ucsc/hg19_rmsk.bed | wc
5298130 31788780 193492882
#the extremely useful stats program is available
#https://github.com/arq5x/filo
cat ~/ucsc/hg19_rmsk.bed | perl -nle '@a=split; $s=$a[2]-$a[1]; print $s' | stats
Total lines: 5298130
Sum of lines: 1467396988
Ari. Mean: 276.965077867097
Geo. Mean: 168.518495379209
Median: 188
Mode: 21 (N=44789)
Anti-Mode: 3849 (N=1)
Minimum: 6
Maximum: 160602
Variance: 216904.549201035
StdDev: 465.730124858845
So all the repeats add up to 1,467,396,988 bp of sequence in the human genome (hg19). Before intersecting the repeats with the uniqueness track, I need to trim 34 bp from the repeat coordinates or else the coordinates will span a region outside of the repeat. I will also need to remove repeats that are less than 34 bp; the code below is minus 34 because start coordinates are in zero based coordinates.
cat ~/ucsc/hg19_rmsk.bed | perl -nle '@a = split; $a[2]-=34; next unless $a[2]>$a[1]; print join("\t",@a);' > hg19_rmsk_trimmed.bed
#lost a few repeats
cat hg19_rmsk_trimmed.bed | wc
4885299 29311794 178280373
intersectBed -a hg19_rmsk_trimmed.bed -b wgEncodeDukeMapabilityUniqueness35bp.bed -wo > hg19_rmsk_trimmed_wgEncodeDukeMapabilityUniqueness35bp.bed
cat hg19_rmsk_trimmed_wgEncodeDukeMapabilityUniqueness35bp.bed | head
chr1 16777160 16777436 AluSp 2147 + chr1 16777229 16777239 10
chr1 16777160 16777436 AluSp 2147 + chr1 16777277 16777316 39
chr1 16777160 16777436 AluSp 2147 + chr1 16777323 16777338 15
chr1 16777160 16777436 AluSp 2147 + chr1 16777339 16777351 12
chr1 16777160 16777436 AluSp 2147 + chr1 16777371 16777404 33
chr1 16777160 16777436 AluSp 2147 + chr1 16777409 16779647 27
chr1 33553606 33554612 L2b 626 + chr1 33553301 33554928 1006
chr1 50330063 50332119 L1PA10 12545 + chr1 50329862 50331296 1233
chr1 50330063 50332119 L1PA10 12545 + chr1 50331299 50331309 10
chr1 50330063 50332119 L1PA10 12545 + chr1 50331313 50331368 55
cat hg19_rmsk_trimmed_wgEncodeDukeMapabilityUniqueness35bp.bed | wc
14359713 143597130 917916687
#how many repeats harbour unique regions?
#most of them!
cat hg19_rmsk_trimmed_wgEncodeDukeMapabilityUniqueness35bp.bed | cut -f1-4 | sort -u | wc
4638952 18555808 139080136
#the tenth column in this file shows the number of bp overlap
cat hg19_rmsk_trimmed_wgEncodeDukeMapabilityUniqueness35bp.bed | cut -f10 | stats
Total lines: 14359713
Sum of lines: 956605598
Ari. Mean: 66.6173201372479
Geo. Mean: 21.2565940240155
Median: 25
Mode: 1 (N=869273)
Anti-Mode: 2082 (N=1)
Minimum: 1
Maximum: 7599
Variance: 17800.1933217871
StdDev: 133.417365143324
So what does the 956,605,598 number represent? That would be the number of starting sites at repeats that are unique. Imagine if I was looking a specific repeat that was 100 bp long and the entire sequence was unique. Using the 35mer uniqueness track, I would get 66 unique starting sites. To make sense of the 956,605,598 number we need to find out how many starting sites there are.
#minus the end from the start in the bed file
#minus another one because of the zero based coordinate system from the UCSC Genome Browser
#the zero based system makes the feature 1 bp longer
cat hg19_rmsk_trimmed.bed | awk '{print $3-$2}' | stats
Total lines: 4885299
Sum of lines: 1290624407
Ari. Mean: 264.185346076054
Geo. Mean: 139.701977637983
Median: 177
Mode: 265 (N=33186)
Anti-Mode: 3815 (N=1)
Minimum: 1
Maximum: 160568
Variance: 229453.013870233
StdDev: 479.0125404102
bc -l<<<956605598/1290624407
.74119596128170850561
I thought the number would be lower but it seems that ~74% of positions within repeats are unique. What about coding exon sequences (that a longer than 34 bp)?
zcat ~/ucsc/hg19_refgene_coding_exon.bed.gz | perl -nle '@a = split; $a[2]-=34; next unless $a[2]>$a[1]; print join("\t",@a);' > hg19_refgene_coding_exon_trimmed.bed
intersectBed -a hg19_refgene_coding_exon_trimmed.bed -b wgEncodeDukeMapabilityUniqueness35bp.bed -wo > hg19_refgene_coding_exon_trimmed_wgEncodeDukeMapabilityUniqueness35bp.bed
#how many starting sites
cat hg19_refgene_coding_exon_trimmed.bed | awk '{print $3-$2}' | stats
Total lines: 357554
Sum of lines: 48822602
Ari. Mean: 136.54609373689
Geo. Mean: 85.5854944294962
Median: 91
Mode: 62 (N=3681)
Anti-Mode: 672 (N=1)
Minimum: 1
Maximum: 21659
Variance: 71223.7271652088
StdDev: 266.877738234587
#how many unique
cat hg19_refgene_coding_exon_trimmed_wgEncodeDukeMapabilityUniqueness35bp.bed | cut -f10 | stats
Total lines: 370576
Sum of lines: 43821518
Ari. Mean: 118.252444842623
Geo. Mean: 71.4540055289801
Median: 80
Mode: 35 (N=9285)
Anti-Mode: 857 (N=1)
Minimum: 1
Maximum: 17072
Variance: 48421.6841883575
StdDev: 220.04927672764
bc -l <<<43821518/48822602
.89756621328785385096
~90% of the the RefSeq coding exons positions are unique; I feel that this number should be higher for no apparent reason other than my gut feeling. Let me double check this:
cat wgEncodeDukeMapabilityUniqueness35bp.bedGraph | awk '$4 < 1 {OFS="\t"; print $1, $2, $3}' > wgEncodeDukeMapabilityUniqueness35bp_not_unique.bed
intersectBed -a hg19_refgene_coding_exon_trimmed.bed -b wgEncodeDukeMapabilityUniqueness35bp_not_unique.bed -wo > hg19_refgene_coding_exon_trimmed_wgEncodeDukeMapabilityUniqueness35bp_not_unique.bed
cat hg19_refgene_coding_exon_trimmed_wgEncodeDukeMapabilityUniqueness35bp_not_unique.bed | head
head hg19_refgene_coding_exon_trimmed_wgEncodeDukeMapabilityUniqueness35bp_not_unique.bed
chr1 25124232 25124308 NM_013943_cds_1_0_chr1_25124233_f 0 + chr1 25124232 25124246 14
chr1 25140584 25140676 NM_013943_cds_2_0_chr1_25140585_f 0 + chr1 25140660 25140665 5
chr1 25140584 25140676 NM_013943_cds_2_0_chr1_25140585_f 0 + chr1 25140668 25140678 8
chr1 25153500 25153573 NM_013943_cds_3_0_chr1_25153501_f 0 + chr1 25153500 25153512 12
chr1 25153500 25153573 NM_013943_cds_3_0_chr1_25153501_f 0 + chr1 25153512 25153540 28
chr1 25153500 25153573 NM_013943_cds_3_0_chr1_25153501_f 0 + chr1 25153540 25153541 1
chr1 25153500 25153573 NM_013943_cds_3_0_chr1_25153501_f 0 + chr1 25153541 25153573 32
chr1 25166350 25166498 NM_013943_cds_4_0_chr1_25166351_f 0 + chr1 25166347 25166389 39
chr1 25166350 25166498 NM_013943_cds_4_0_chr1_25166351_f 0 + chr1 25166389 25166392 3
chr1 25166350 25166498 NM_013943_cds_4_0_chr1_25166351_f 0 + chr1 25166392 25166401 9
cat hg19_refgene_coding_exon_trimmed_wgEncodeDukeMapabilityUniqueness35bp_not_unique.bed | cut -f10 | stats
Total lines: 120750
Sum of lines: 2870360
Ari. Mean: 23.7710973084886
Geo. Mean: 9.14534781538738
Median: 9
Mode: 1 (N=14096)
Anti-Mode: 197 (N=1)
Minimum: 1
Maximum: 1994
Variance: 4132.41012943783
StdDev: 64.2838247884943
bc -l <<<2870360/48822602
.05879162278159611402
~6% of coding exons are in non-unique regions. Why doesn't this add to 100% (i.e. 6 + 90 != 100)?
intersectBed -a hg19_refgene_coding_exon_trimmed.bed -b wgEncodeDukeMapabilityUniqueness35bp.bedGraph -v > blah
#unassembled chromosomes
head blah
chr6_apd_hap1 3146158 3146246 NM_001178045_cds_0_0_chr6_apd_hap1_3146159_r 0 -
chr6_apd_hap1 3147183 3147234 NM_001178045_cds_1_0_chr6_apd_hap1_3147184_r 0 -
chr6_apd_hap1 3147347 3147408 NM_001178045_cds_2_0_chr6_apd_hap1_3147348_r 0 -
chr6_apd_hap1 3147541 3147578 NM_001178045_cds_3_0_chr6_apd_hap1_3147542_r 0 -
chr6_apd_hap1 3147846 3147886 NM_001178045_cds_4_0_chr6_apd_hap1_3147847_r 0 -
chr6_apd_hap1 3148029 3148099 NM_001178045_cds_5_0_chr6_apd_hap1_3148030_r 0 -
chr6_apd_hap1 3148221 3148282 NM_001178045_cds_6_0_chr6_apd_hap1_3148222_r 0 -
chr6_apd_hap1 3148404 3148624 NM_001178045_cds_7_0_chr6_apd_hap1_3148405_r 0 -
chr6_apd_hap1 3151686 3151755 NM_001178045_cds_8_0_chr6_apd_hap1_3151687_r 0 -
chr6_apd_hap1 3152076 3152135 NM_001178045_cds_9_0_chr6_apd_hap1_3152077_r 0 -
cat blah | cut -f1 | sort -u
#output below modified to save space
chr17_ctg5_hap1 chr19_gl000209_random chr1_gl000191_random chr4_ctg9_hap1 chr6_apd_hap1
chr6_cox_hap2 chr6_dbb_hap3 chr6_mann_hap4 chr6_mcf_hap5 chr6_qbl_hap6 chr6_ssto_hap7
chr7_gl000195_random chrUn_gl000213 chrUn_gl000222 chrUn_gl000223 chrUn_gl000228
#I know I should stop naming files as blah
#it's a bad habit I picked up from Rob
cat blah | awk '{print $3-$2}' | stats
Total lines: 15809
Sum of lines: 2130724
Ari. Mean: 134.779176418496
Geo. Mean: 83.0615911368902
Median: 84
Mode: 20 (N=655)
Anti-Mode: 124 (N=1)
Minimum: 1
Maximum: 4565
Variance: 47948.4880830461
StdDev: 218.971432116261
#and here's the culprit!
bc -l<<<2130724/48822602
.04364216393055003500
#recalculate the statistics
#refSeq coding exons unique
bc -l<<<43821518/"(48822602-2130724)"
.93852549687549513429
#refSeq coding exon non-unique
bc -l <<<2870360/"(48822602-2130724)"
.06147450312450486570
Conclusions
The ENCODE mappability tracks are a great visual resource when viewed on a Genome Browser as well as being a nice resource for calculating the mappability of regions of interest. I demonstrated that ~74% of the regions within repetitive elements are unique, which I thought was quite high, but this is probably due to the higher mutation rate in these sequences compared to those under selective pressure. RefSeq coding exons on the other hand have ~94% of their regions unique. If I used the alignability track instead:
cat wgEncodeCrgMapabilityAlign36mer.bedGraph | awk '$4 == 1 {OFS="\t"; print $1, $2, $3}' > wgEncodeCrgMapabilityAlign36mer.bed
cat wgEncodeCrgMapabilityAlign36mer.bed | wc
15365566 46096698 366807922
cat ~/ucsc/hg19_rmsk.bed | perl -nle '@a = split; $a[2]-=35; next unless $a[2]>$a[1]; print join("\t",@a);' > hg19_rmsk_trimmed_36.bed
cat hg19_rmsk_trimmed_36.bed | awk '{print $3-$2}' | stats
Total lines: 4860460
Sum of lines: 1285739108
Ari. Mean: 264.530334165902
Geo. Mean: 140.426791054987
Median: 178
Mode: 264 (N=33186)
Anti-Mode: 3814 (N=1)
Minimum: 1
Maximum: 160567
Variance: 230269.825370541
StdDev: 479.864382269138
intersectBed -a hg19_rmsk_trimmed_36.bed -b wgEncodeCrgMapabilityAlign36mer.bed -wo > hg19_rmsk_trimmed_36_wgEncodeCrgMapabilityAlign36mer.bed
cat hg19_rmsk_trimmed_36_wgEncodeCrgMapabilityAlign36mer.bed | cut -f10 | stats
Total lines: 14890348
Sum of lines: 702976025
Ari. Mean: 47.2101810515107
Geo. Mean: 13.7237631735887
Median: 14
Mode: 1 (N=1585619)
Anti-Mode: 1573 (N=1)
Minimum: 1
Maximum: 5235
Variance: 9311.44025498465
StdDev: 96.495804338762
bc -l<<<702976025/1285739108
.54674857490606873568
As you would expect, when taking into account mismatches, repetitive elements are much less unique; though ~55% of the sites are deemed as unique when using 36-mers.
See also

This work is licensed under a Creative Commons
Attribution 4.0 International License.
At the beginning of the post: “according to mappability track, the region chr1:10000-10035 will map to 734 places (since 1/733 =~ .001364)”
I have two simple questions:
1) Shouldn’t it 1/734 as 1/734 ~= 0.0013624 (same as file) whereas 1/733 ~= 0.0013642 (different)?
2) As it’s a 36mer, why does the bedGraph line indicate chr1:10000-10078 instead of chr1:10000-10035? Does that mean in fact that the region chr1:10000-10078 contains a set of 43 overlapping 36mers (kmer #1: 10000-10035, kmer #2: 10001-10036, …, kmer #43: 10042-10078)? And that it is the whole set of these 43 36mers that is unique?
Also, related to question 2: the 2nd line of the “alignability” file corresponds to the region chr1:10078-10081 with score 0.0238095. If I understood well, it means that this region of 4 nucleotides is present only 42 times in the whole human genome (+- 2 mismatches). This looks quite low to me given that it is only 4-bp long (even lower considering the possible mismatches). And what does it correspond to as the file was generated with 36mer?
(I had a look at the paper but couldn’t find an answer to this question.)
1) Nice pick up; it should be 1/734. Good to have another set of eyes going through my posts 🙂
2) Exactly.
No, you got it right in point 2. chr1:10078-10081 refers to the 36mers at chr1:10078-10113, chr1:10079-10114, chr1:10080-10115 and chr1:10081-10116.
I have a follow up question:
so given that the alignability scores are the following:
chr 1 10078 10081 0.0238095
chr1 10081 10088 0.0185185
does the kmer chr1:10081-10116 has score of 0.02 or 0.018?
ok, got it, any region in the bedGraph file contains all the start of 36mers, so I need to correct my point 2: there are not 43 36mers but 79, the 1st starting at 10000, the 2nd starting at 10001, …, and the 79th starting at 10078
thanks!
Thank you for your nice explanation. For some genome, we couldn’t find the mappability track from the ENCODE web page. For example, I want to get the mappability track of Rice(MHRS1), but couldn’t find this and want to construct by my self. Therefore, Would you please tell us how we construct the mappability track by ourselves.
This may be useful https://wiki.bits.vib.be/index.php/Create_a_mappability_track.