The ENCODE mappability tracks can be visualised on the UCSC Genome Browser and they provide a sense of how mappable a region of the genome is in terms of short reads or k-mers. On a side note, it seems some people use "mapability" and some use "mappability"; I was taught the CVC rule, so I'm going to stick with "mappability".
There are two sets of tracks, alignability and uniqueness, but what's the difference between the two? From the UCSC Genome Browser:
Alignability - These tracks provide a measure of how often the sequence found at the particular location will align within the whole genome. Unlike measures of uniqueness, alignability will tolerate up to 2 mismatches. These tracks are in the form of signals ranging from 0 to 1 and have several configuration options.
Uniqueness - These tracks are a direct measure of sequence uniqueness throughout the reference genome. These tracks are in the form of signals ranging from 0 to 1 and have several configuration options.
Let's take a look at two examples, where I try to use k-mers of similar size:
#download the files wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeDukeMapabilityUniqueness35bp.bigWig wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeCrgMapabilityAlign36mer.bigWig #bigWigToBedGraph can be downloaded at http://hgdownload.cse.ucsc.edu/admin/exe/ #convert into flat format bigWigToBedGraph wgEncodeDukeMapabilityUniqueness35bp.bigWig wgEncodeDukeMapabilityUniqueness35bp.bedGraph bigWigToBedGraph wgEncodeCrgMapabilityAlign36mer.bigWig wgEncodeCrgMapabilityAlign36mer.bedGraph head wgEncodeDukeMapabilityUniqueness35bp.bedGraph chr1 0 10145 0 chr1 10145 10160 1 chr1 10160 10170 0.5 chr1 10170 10175 0.333333 chr1 10175 10177 0.5 chr1 10177 10215 0 chr1 10215 10224 1 chr1 10224 10229 0.5 chr1 10229 10248 1 chr1 10248 10274 0 head wgEncodeCrgMapabilityAlign36mer.bedGraph chr1 10000 10078 0.0013624 chr1 10078 10081 0.0238095 chr1 10081 10088 0.0185185 chr1 10088 10089 0.0147059 chr1 10089 10096 0.0185185 chr1 10096 10097 0.0238095 chr1 10097 10099 0.0185185 chr1 10099 10102 0.00172712 chr1 10102 10120 0.0013624 chr1 10120 10121 0.00172712
The bedGraph format simply associates a score (fourth column) to a particular region, as defined in the first three columns. At first glance you can already see the difference in the precision of the scores. The scores were calculated the same way, 1/the number of places a 35/36mer maps to the genome; however the uniqueness track will only keep scores for reads that map up to 4 places (0.25). So according to mappability track, the region chr1:10000-10035 will map to 734 places (since 1/734 =~ .0013624).
Another thing to keep in mind is that the alignability track tolerates up to 2 mismatches, so as you can imagine, a read that aligns to two places in the alignability track, may be unique in the uniqueness track.
#this region is unique head -7 wgEncodeDukeMapabilityUniqueness35bp.bedGraph | tail -1 chr1 10215 10224 1 #however with 2 mismatches it's not so unique cat wgEncodeCrgMapabilityAlign36mer.bedGraph| head -35 | tail -4 chr1 10209 10216 0.0147059 chr1 10216 10217 0.0114943 chr1 10217 10218 0.0147059 chr1 10218 10221 0.0625
Just a word on the file format using this example from the align36mer track:
chr1 10216 10217 0.0114943
This means that the region chr1:10216-10251 (10216+35 due to the inclusiveness of the numbering) maps to 87 places with a threshold of two mismatches. If you blat the sequence TAACCCTAACCCTAACCCCTAACCCTAACCCTAAAC, the highest scoring match is that same region.
I drew this to help explain the tracks.
How unique are repetitive elements?
I previously looked at mapping repeats with respect to sequencing errors but now with the uniqueness tracks I can get an idea of the number of unique sites within repetitive elements.
cat wgEncodeDukeMapabilityUniqueness35bp.bedGraph | awk '$4 == 1 {OFS="\t"; print $1, $2, $3}' > wgEncodeDukeMapabilityUniqueness35bp.bed #all the unique regions head wgEncodeDukeMapabilityUniqueness35bp.bed chr1 10145 10160 chr1 10215 10224 chr1 10229 10248 chr1 10311 10321 chr1 10434 10469 chr1 10484 10533 chr1 10588 10623 chr1 10782 10815 chr1 13044 13072 chr1 13292 13327 cat wgEncodeDukeMapabilityUniqueness35bp.bed | wc 12444047 37332141 296751457 #I downloaded the RepeatMasker track using the Table Browser tool at the UCSC Genome Browser cat ~/ucsc/hg19_rmsk.bed | wc 5298130 31788780 193492882 #the extremely useful stats program is available #https://github.com/arq5x/filo cat ~/ucsc/hg19_rmsk.bed | perl -nle '@a=split; $s=$a[2]-$a[1]; print $s' | stats Total lines: 5298130 Sum of lines: 1467396988 Ari. Mean: 276.965077867097 Geo. Mean: 168.518495379209 Median: 188 Mode: 21 (N=44789) Anti-Mode: 3849 (N=1) Minimum: 6 Maximum: 160602 Variance: 216904.549201035 StdDev: 465.730124858845
So all the repeats add up to 1,467,396,988 bp of sequence in the human genome (hg19). Before intersecting the repeats with the uniqueness track, I need to trim 34 bp from the repeat coordinates or else the coordinates will span a region outside of the repeat. I will also need to remove repeats that are less than 34 bp; the code below is minus 34 because start coordinates are in zero based coordinates.
cat ~/ucsc/hg19_rmsk.bed | perl -nle '@a = split; $a[2]-=34; next unless $a[2]>$a[1]; print join("\t",@a);' > hg19_rmsk_trimmed.bed #lost a few repeats cat hg19_rmsk_trimmed.bed | wc 4885299 29311794 178280373 intersectBed -a hg19_rmsk_trimmed.bed -b wgEncodeDukeMapabilityUniqueness35bp.bed -wo > hg19_rmsk_trimmed_wgEncodeDukeMapabilityUniqueness35bp.bed cat hg19_rmsk_trimmed_wgEncodeDukeMapabilityUniqueness35bp.bed | head chr1 16777160 16777436 AluSp 2147 + chr1 16777229 16777239 10 chr1 16777160 16777436 AluSp 2147 + chr1 16777277 16777316 39 chr1 16777160 16777436 AluSp 2147 + chr1 16777323 16777338 15 chr1 16777160 16777436 AluSp 2147 + chr1 16777339 16777351 12 chr1 16777160 16777436 AluSp 2147 + chr1 16777371 16777404 33 chr1 16777160 16777436 AluSp 2147 + chr1 16777409 16779647 27 chr1 33553606 33554612 L2b 626 + chr1 33553301 33554928 1006 chr1 50330063 50332119 L1PA10 12545 + chr1 50329862 50331296 1233 chr1 50330063 50332119 L1PA10 12545 + chr1 50331299 50331309 10 chr1 50330063 50332119 L1PA10 12545 + chr1 50331313 50331368 55 cat hg19_rmsk_trimmed_wgEncodeDukeMapabilityUniqueness35bp.bed | wc 14359713 143597130 917916687 #how many repeats harbour unique regions? #most of them! cat hg19_rmsk_trimmed_wgEncodeDukeMapabilityUniqueness35bp.bed | cut -f1-4 | sort -u | wc 4638952 18555808 139080136 #the tenth column in this file shows the number of bp overlap cat hg19_rmsk_trimmed_wgEncodeDukeMapabilityUniqueness35bp.bed | cut -f10 | stats Total lines: 14359713 Sum of lines: 956605598 Ari. Mean: 66.6173201372479 Geo. Mean: 21.2565940240155 Median: 25 Mode: 1 (N=869273) Anti-Mode: 2082 (N=1) Minimum: 1 Maximum: 7599 Variance: 17800.1933217871 StdDev: 133.417365143324
So what does the 956,605,598 number represent? That would be the number of starting sites at repeats that are unique. Imagine if I was looking a specific repeat that was 100 bp long and the entire sequence was unique. Using the 35mer uniqueness track, I would get 66 unique starting sites. To make sense of the 956,605,598 number we need to find out how many starting sites there are.
#minus the end from the start in the bed file #minus another one because of the zero based coordinate system from the UCSC Genome Browser #the zero based system makes the feature 1 bp longer cat hg19_rmsk_trimmed.bed | awk '{print $3-$2}' | stats Total lines: 4885299 Sum of lines: 1290624407 Ari. Mean: 264.185346076054 Geo. Mean: 139.701977637983 Median: 177 Mode: 265 (N=33186) Anti-Mode: 3815 (N=1) Minimum: 1 Maximum: 160568 Variance: 229453.013870233 StdDev: 479.0125404102 bc -l<<<956605598/1290624407 .74119596128170850561
I thought the number would be lower but it seems that ~74% of positions within repeats are unique. What about coding exon sequences (that a longer than 34 bp)?
zcat ~/ucsc/hg19_refgene_coding_exon.bed.gz | perl -nle '@a = split; $a[2]-=34; next unless $a[2]>$a[1]; print join("\t",@a);' > hg19_refgene_coding_exon_trimmed.bed intersectBed -a hg19_refgene_coding_exon_trimmed.bed -b wgEncodeDukeMapabilityUniqueness35bp.bed -wo > hg19_refgene_coding_exon_trimmed_wgEncodeDukeMapabilityUniqueness35bp.bed #how many starting sites cat hg19_refgene_coding_exon_trimmed.bed | awk '{print $3-$2}' | stats Total lines: 357554 Sum of lines: 48822602 Ari. Mean: 136.54609373689 Geo. Mean: 85.5854944294962 Median: 91 Mode: 62 (N=3681) Anti-Mode: 672 (N=1) Minimum: 1 Maximum: 21659 Variance: 71223.7271652088 StdDev: 266.877738234587 #how many unique cat hg19_refgene_coding_exon_trimmed_wgEncodeDukeMapabilityUniqueness35bp.bed | cut -f10 | stats Total lines: 370576 Sum of lines: 43821518 Ari. Mean: 118.252444842623 Geo. Mean: 71.4540055289801 Median: 80 Mode: 35 (N=9285) Anti-Mode: 857 (N=1) Minimum: 1 Maximum: 17072 Variance: 48421.6841883575 StdDev: 220.04927672764 bc -l <<<43821518/48822602 .89756621328785385096
~90% of the the RefSeq coding exons positions are unique; I feel that this number should be higher for no apparent reason other than my gut feeling. Let me double check this:
cat wgEncodeDukeMapabilityUniqueness35bp.bedGraph | awk '$4 < 1 {OFS="\t"; print $1, $2, $3}' > wgEncodeDukeMapabilityUniqueness35bp_not_unique.bed intersectBed -a hg19_refgene_coding_exon_trimmed.bed -b wgEncodeDukeMapabilityUniqueness35bp_not_unique.bed -wo > hg19_refgene_coding_exon_trimmed_wgEncodeDukeMapabilityUniqueness35bp_not_unique.bed cat hg19_refgene_coding_exon_trimmed_wgEncodeDukeMapabilityUniqueness35bp_not_unique.bed | head head hg19_refgene_coding_exon_trimmed_wgEncodeDukeMapabilityUniqueness35bp_not_unique.bed chr1 25124232 25124308 NM_013943_cds_1_0_chr1_25124233_f 0 + chr1 25124232 25124246 14 chr1 25140584 25140676 NM_013943_cds_2_0_chr1_25140585_f 0 + chr1 25140660 25140665 5 chr1 25140584 25140676 NM_013943_cds_2_0_chr1_25140585_f 0 + chr1 25140668 25140678 8 chr1 25153500 25153573 NM_013943_cds_3_0_chr1_25153501_f 0 + chr1 25153500 25153512 12 chr1 25153500 25153573 NM_013943_cds_3_0_chr1_25153501_f 0 + chr1 25153512 25153540 28 chr1 25153500 25153573 NM_013943_cds_3_0_chr1_25153501_f 0 + chr1 25153540 25153541 1 chr1 25153500 25153573 NM_013943_cds_3_0_chr1_25153501_f 0 + chr1 25153541 25153573 32 chr1 25166350 25166498 NM_013943_cds_4_0_chr1_25166351_f 0 + chr1 25166347 25166389 39 chr1 25166350 25166498 NM_013943_cds_4_0_chr1_25166351_f 0 + chr1 25166389 25166392 3 chr1 25166350 25166498 NM_013943_cds_4_0_chr1_25166351_f 0 + chr1 25166392 25166401 9 cat hg19_refgene_coding_exon_trimmed_wgEncodeDukeMapabilityUniqueness35bp_not_unique.bed | cut -f10 | stats Total lines: 120750 Sum of lines: 2870360 Ari. Mean: 23.7710973084886 Geo. Mean: 9.14534781538738 Median: 9 Mode: 1 (N=14096) Anti-Mode: 197 (N=1) Minimum: 1 Maximum: 1994 Variance: 4132.41012943783 StdDev: 64.2838247884943 bc -l <<<2870360/48822602 .05879162278159611402
~6% of coding exons are in non-unique regions. Why doesn't this add to 100% (i.e. 6 + 90 != 100)?
intersectBed -a hg19_refgene_coding_exon_trimmed.bed -b wgEncodeDukeMapabilityUniqueness35bp.bedGraph -v > blah #unassembled chromosomes head blah chr6_apd_hap1 3146158 3146246 NM_001178045_cds_0_0_chr6_apd_hap1_3146159_r 0 - chr6_apd_hap1 3147183 3147234 NM_001178045_cds_1_0_chr6_apd_hap1_3147184_r 0 - chr6_apd_hap1 3147347 3147408 NM_001178045_cds_2_0_chr6_apd_hap1_3147348_r 0 - chr6_apd_hap1 3147541 3147578 NM_001178045_cds_3_0_chr6_apd_hap1_3147542_r 0 - chr6_apd_hap1 3147846 3147886 NM_001178045_cds_4_0_chr6_apd_hap1_3147847_r 0 - chr6_apd_hap1 3148029 3148099 NM_001178045_cds_5_0_chr6_apd_hap1_3148030_r 0 - chr6_apd_hap1 3148221 3148282 NM_001178045_cds_6_0_chr6_apd_hap1_3148222_r 0 - chr6_apd_hap1 3148404 3148624 NM_001178045_cds_7_0_chr6_apd_hap1_3148405_r 0 - chr6_apd_hap1 3151686 3151755 NM_001178045_cds_8_0_chr6_apd_hap1_3151687_r 0 - chr6_apd_hap1 3152076 3152135 NM_001178045_cds_9_0_chr6_apd_hap1_3152077_r 0 - cat blah | cut -f1 | sort -u #output below modified to save space chr17_ctg5_hap1 chr19_gl000209_random chr1_gl000191_random chr4_ctg9_hap1 chr6_apd_hap1 chr6_cox_hap2 chr6_dbb_hap3 chr6_mann_hap4 chr6_mcf_hap5 chr6_qbl_hap6 chr6_ssto_hap7 chr7_gl000195_random chrUn_gl000213 chrUn_gl000222 chrUn_gl000223 chrUn_gl000228 #I know I should stop naming files as blah #it's a bad habit I picked up from Rob cat blah | awk '{print $3-$2}' | stats Total lines: 15809 Sum of lines: 2130724 Ari. Mean: 134.779176418496 Geo. Mean: 83.0615911368902 Median: 84 Mode: 20 (N=655) Anti-Mode: 124 (N=1) Minimum: 1 Maximum: 4565 Variance: 47948.4880830461 StdDev: 218.971432116261 #and here's the culprit! bc -l<<<2130724/48822602 .04364216393055003500 #recalculate the statistics #refSeq coding exons unique bc -l<<<43821518/"(48822602-2130724)" .93852549687549513429 #refSeq coding exon non-unique bc -l <<<2870360/"(48822602-2130724)" .06147450312450486570
Conclusions
The ENCODE mappability tracks are a great visual resource when viewed on a Genome Browser as well as being a nice resource for calculating the mappability of regions of interest. I demonstrated that ~74% of the regions within repetitive elements are unique, which I thought was quite high, but this is probably due to the higher mutation rate in these sequences compared to those under selective pressure. RefSeq coding exons on the other hand have ~94% of their regions unique. If I used the alignability track instead:
cat wgEncodeCrgMapabilityAlign36mer.bedGraph | awk '$4 == 1 {OFS="\t"; print $1, $2, $3}' > wgEncodeCrgMapabilityAlign36mer.bed cat wgEncodeCrgMapabilityAlign36mer.bed | wc 15365566 46096698 366807922 cat ~/ucsc/hg19_rmsk.bed | perl -nle '@a = split; $a[2]-=35; next unless $a[2]>$a[1]; print join("\t",@a);' > hg19_rmsk_trimmed_36.bed cat hg19_rmsk_trimmed_36.bed | awk '{print $3-$2}' | stats Total lines: 4860460 Sum of lines: 1285739108 Ari. Mean: 264.530334165902 Geo. Mean: 140.426791054987 Median: 178 Mode: 264 (N=33186) Anti-Mode: 3814 (N=1) Minimum: 1 Maximum: 160567 Variance: 230269.825370541 StdDev: 479.864382269138 intersectBed -a hg19_rmsk_trimmed_36.bed -b wgEncodeCrgMapabilityAlign36mer.bed -wo > hg19_rmsk_trimmed_36_wgEncodeCrgMapabilityAlign36mer.bed cat hg19_rmsk_trimmed_36_wgEncodeCrgMapabilityAlign36mer.bed | cut -f10 | stats Total lines: 14890348 Sum of lines: 702976025 Ari. Mean: 47.2101810515107 Geo. Mean: 13.7237631735887 Median: 14 Mode: 1 (N=1585619) Anti-Mode: 1573 (N=1) Minimum: 1 Maximum: 5235 Variance: 9311.44025498465 StdDev: 96.495804338762 bc -l<<<702976025/1285739108 .54674857490606873568
As you would expect, when taking into account mismatches, repetitive elements are much less unique; though ~55% of the sites are deemed as unique when using 36-mers.
See also

This work is licensed under a Creative Commons
Attribution 4.0 International License.
At the beginning of the post: “according to mappability track, the region chr1:10000-10035 will map to 734 places (since 1/733 =~ .001364)”
I have two simple questions:
1) Shouldn’t it 1/734 as 1/734 ~= 0.0013624 (same as file) whereas 1/733 ~= 0.0013642 (different)?
2) As it’s a 36mer, why does the bedGraph line indicate chr1:10000-10078 instead of chr1:10000-10035? Does that mean in fact that the region chr1:10000-10078 contains a set of 43 overlapping 36mers (kmer #1: 10000-10035, kmer #2: 10001-10036, …, kmer #43: 10042-10078)? And that it is the whole set of these 43 36mers that is unique?
Also, related to question 2: the 2nd line of the “alignability” file corresponds to the region chr1:10078-10081 with score 0.0238095. If I understood well, it means that this region of 4 nucleotides is present only 42 times in the whole human genome (+- 2 mismatches). This looks quite low to me given that it is only 4-bp long (even lower considering the possible mismatches). And what does it correspond to as the file was generated with 36mer?
(I had a look at the paper but couldn’t find an answer to this question.)
1) Nice pick up; it should be 1/734. Good to have another set of eyes going through my posts 🙂
2) Exactly.
No, you got it right in point 2. chr1:10078-10081 refers to the 36mers at chr1:10078-10113, chr1:10079-10114, chr1:10080-10115 and chr1:10081-10116.
I have a follow up question:
so given that the alignability scores are the following:
chr 1 10078 10081 0.0238095
chr1 10081 10088 0.0185185
does the kmer chr1:10081-10116 has score of 0.02 or 0.018?
ok, got it, any region in the bedGraph file contains all the start of 36mers, so I need to correct my point 2: there are not 43 36mers but 79, the 1st starting at 10000, the 2nd starting at 10001, …, and the 79th starting at 10078
thanks!
Thank you for your nice explanation. For some genome, we couldn’t find the mappability track from the ENCODE web page. For example, I want to get the mappability track of Rice(MHRS1), but couldn’t find this and want to construct by my self. Therefore, Would you please tell us how we construct the mappability track by ourselves.
This may be useful https://wiki.bits.vib.be/index.php/Create_a_mappability_track.