ENCODE mappability and repeats

The ENCODE mappability tracks can be visualised on the UCSC Genome Browser and they provide a sense of how mappable a region of the genome is in terms of short reads or k-mers. On a side note, it seems some people use "mapability" and some use "mappability"; I was taught the CVC rule, so I'm going to stick with "mappability".

There are two sets of tracks, alignability and uniqueness, but what's the difference between the two? From the UCSC Genome Browser:

Alignability - These tracks provide a measure of how often the sequence found at the particular location will align within the whole genome. Unlike measures of uniqueness, alignability will tolerate up to 2 mismatches. These tracks are in the form of signals ranging from 0 to 1 and have several configuration options.

Uniqueness - These tracks are a direct measure of sequence uniqueness throughout the reference genome. These tracks are in the form of signals ranging from 0 to 1 and have several configuration options.

Let's take a look at two examples, where I try to use k-mers of similar size:

#download the files
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeDukeMapabilityUniqueness35bp.bigWig
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeCrgMapabilityAlign36mer.bigWig
#bigWigToBedGraph can be downloaded at http://hgdownload.cse.ucsc.edu/admin/exe/
#convert into flat format
bigWigToBedGraph wgEncodeDukeMapabilityUniqueness35bp.bigWig wgEncodeDukeMapabilityUniqueness35bp.bedGraph
bigWigToBedGraph wgEncodeCrgMapabilityAlign36mer.bigWig wgEncodeCrgMapabilityAlign36mer.bedGraph
head wgEncodeDukeMapabilityUniqueness35bp.bedGraph
chr1    0       10145   0
chr1    10145   10160   1
chr1    10160   10170   0.5
chr1    10170   10175   0.333333
chr1    10175   10177   0.5
chr1    10177   10215   0
chr1    10215   10224   1
chr1    10224   10229   0.5
chr1    10229   10248   1
chr1    10248   10274   0
head wgEncodeCrgMapabilityAlign36mer.bedGraph
chr1    10000   10078   0.0013624
chr1    10078   10081   0.0238095
chr1    10081   10088   0.0185185
chr1    10088   10089   0.0147059
chr1    10089   10096   0.0185185
chr1    10096   10097   0.0238095
chr1    10097   10099   0.0185185
chr1    10099   10102   0.00172712
chr1    10102   10120   0.0013624
chr1    10120   10121   0.00172712

The bedGraph format simply associates a score (fourth column) to a particular region, as defined in the first three columns. At first glance you can already see the difference in the precision of the scores. The scores were calculated the same way, 1/the number of places a 35/36mer maps to the genome; however the uniqueness track will only keep scores for reads that map up to 4 places (0.25). So according to mappability track, the region chr1:10000-10035 will map to 734 places (since 1/734 =~ .0013624).

Another thing to keep in mind is that the alignability track tolerates up to 2 mismatches, so as you can imagine, a read that aligns to two places in the alignability track, may be unique in the uniqueness track.

#this region is unique
head -7 wgEncodeDukeMapabilityUniqueness35bp.bedGraph | tail -1
chr1    10215   10224   1
#however with 2 mismatches it's not so unique
cat wgEncodeCrgMapabilityAlign36mer.bedGraph| head -35 | tail -4
chr1    10209   10216   0.0147059
chr1    10216   10217   0.0114943
chr1    10217   10218   0.0147059
chr1    10218   10221   0.0625

Just a word on the file format using this example from the align36mer track:

chr1    10216   10217   0.0114943

This means that the region chr1:10216-10251 (10216+35 due to the inclusiveness of the numbering) maps to 87 places with a threshold of two mismatches. If you blat the sequence TAACCCTAACCCTAACCCCTAACCCTAACCCTAAAC, the highest scoring match is that same region.

mappability-01I drew this to help explain the tracks.

How unique are repetitive elements?

I previously looked at mapping repeats with respect to sequencing errors but now with the uniqueness tracks I can get an idea of the number of unique sites within repetitive elements.

cat wgEncodeDukeMapabilityUniqueness35bp.bedGraph | awk '$4 == 1 {OFS="\t"; print $1, $2, $3}' > wgEncodeDukeMapabilityUniqueness35bp.bed
#all the unique regions
head wgEncodeDukeMapabilityUniqueness35bp.bed
chr1    10145   10160
chr1    10215   10224
chr1    10229   10248
chr1    10311   10321
chr1    10434   10469
chr1    10484   10533
chr1    10588   10623
chr1    10782   10815
chr1    13044   13072
chr1    13292   13327
cat wgEncodeDukeMapabilityUniqueness35bp.bed | wc
12444047 37332141 296751457
#I downloaded the RepeatMasker track using the Table Browser tool at the UCSC Genome Browser
cat ~/ucsc/hg19_rmsk.bed | wc
5298130 31788780 193492882
#the extremely useful stats program is available
#https://github.com/arq5x/filo
cat ~/ucsc/hg19_rmsk.bed | perl -nle '@a=split; $s=$a[2]-$a[1]; print $s' | stats
Total lines:            5298130
Sum of lines:           1467396988
Ari. Mean:              276.965077867097
Geo. Mean:              168.518495379209
Median:                 188
Mode:                   21 (N=44789)
Anti-Mode:              3849 (N=1)
Minimum:                6
Maximum:                160602
Variance:               216904.549201035
StdDev:                 465.730124858845

So all the repeats add up to 1,467,396,988 bp of sequence in the human genome (hg19). Before intersecting the repeats with the uniqueness track, I need to trim 34 bp from the repeat coordinates or else the coordinates will span a region outside of the repeat. I will also need to remove repeats that are less than 34 bp; the code below is minus 34 because start coordinates are in zero based coordinates.

cat ~/ucsc/hg19_rmsk.bed | perl -nle '@a = split; $a[2]-=34; next unless $a[2]>$a[1]; print join("\t",@a);' > hg19_rmsk_trimmed.bed
#lost a few repeats
cat hg19_rmsk_trimmed.bed | wc
4885299 29311794 178280373
intersectBed -a hg19_rmsk_trimmed.bed -b wgEncodeDukeMapabilityUniqueness35bp.bed -wo > hg19_rmsk_trimmed_wgEncodeDukeMapabilityUniqueness35bp.bed

cat hg19_rmsk_trimmed_wgEncodeDukeMapabilityUniqueness35bp.bed | head
chr1    16777160        16777436        AluSp   2147    +       chr1    16777229        16777239        10
chr1    16777160        16777436        AluSp   2147    +       chr1    16777277        16777316        39
chr1    16777160        16777436        AluSp   2147    +       chr1    16777323        16777338        15
chr1    16777160        16777436        AluSp   2147    +       chr1    16777339        16777351        12
chr1    16777160        16777436        AluSp   2147    +       chr1    16777371        16777404        33
chr1    16777160        16777436        AluSp   2147    +       chr1    16777409        16779647        27
chr1    33553606        33554612        L2b     626     +       chr1    33553301        33554928        1006
chr1    50330063        50332119        L1PA10  12545   +       chr1    50329862        50331296        1233
chr1    50330063        50332119        L1PA10  12545   +       chr1    50331299        50331309        10
chr1    50330063        50332119        L1PA10  12545   +       chr1    50331313        50331368        55
cat hg19_rmsk_trimmed_wgEncodeDukeMapabilityUniqueness35bp.bed | wc
14359713 143597130 917916687

#how many repeats harbour unique regions?
#most of them!
cat hg19_rmsk_trimmed_wgEncodeDukeMapabilityUniqueness35bp.bed | cut -f1-4 | sort -u | wc
4638952 18555808 139080136

#the tenth column in this file shows the number of bp overlap
cat hg19_rmsk_trimmed_wgEncodeDukeMapabilityUniqueness35bp.bed | cut -f10 | stats
Total lines:            14359713
Sum of lines:           956605598
Ari. Mean:              66.6173201372479
Geo. Mean:              21.2565940240155
Median:                 25
Mode:                   1 (N=869273)
Anti-Mode:              2082 (N=1)
Minimum:                1
Maximum:                7599
Variance:               17800.1933217871
StdDev:                 133.417365143324

So what does the 956,605,598 number represent? That would be the number of starting sites at repeats that are unique. Imagine if I was looking a specific repeat that was 100 bp long and the entire sequence was unique. Using the 35mer uniqueness track, I would get 66 unique starting sites. To make sense of the 956,605,598 number we need to find out how many starting sites there are.

#minus the end from the start in the bed file
#minus another one because of the zero based coordinate system from the UCSC Genome Browser
#the zero based system makes the feature 1 bp longer
cat hg19_rmsk_trimmed.bed | awk '{print $3-$2}' | stats
Total lines:            4885299
Sum of lines:           1290624407
Ari. Mean:              264.185346076054
Geo. Mean:              139.701977637983
Median:                 177
Mode:                   265 (N=33186)
Anti-Mode:              3815 (N=1)
Minimum:                1
Maximum:                160568
Variance:               229453.013870233
StdDev:                 479.0125404102
bc -l<<<956605598/1290624407
.74119596128170850561

I thought the number would be lower but it seems that ~74% of positions within repeats are unique. What about coding exon sequences (that a longer than 34 bp)?

zcat ~/ucsc/hg19_refgene_coding_exon.bed.gz | perl -nle '@a = split; $a[2]-=34; next unless $a[2]>$a[1]; print join("\t",@a);' > hg19_refgene_coding_exon_trimmed.bed
intersectBed -a hg19_refgene_coding_exon_trimmed.bed -b wgEncodeDukeMapabilityUniqueness35bp.bed -wo > hg19_refgene_coding_exon_trimmed_wgEncodeDukeMapabilityUniqueness35bp.bed
#how many starting sites
cat hg19_refgene_coding_exon_trimmed.bed | awk '{print $3-$2}' | stats
Total lines:            357554
Sum of lines:           48822602
Ari. Mean:              136.54609373689
Geo. Mean:              85.5854944294962
Median:                 91
Mode:                   62 (N=3681)
Anti-Mode:              672 (N=1)
Minimum:                1
Maximum:                21659
Variance:               71223.7271652088
StdDev:                 266.877738234587
#how many unique
cat hg19_refgene_coding_exon_trimmed_wgEncodeDukeMapabilityUniqueness35bp.bed | cut -f10 | stats
Total lines:            370576
Sum of lines:           43821518
Ari. Mean:              118.252444842623
Geo. Mean:              71.4540055289801
Median:                 80
Mode:                   35 (N=9285)
Anti-Mode:              857 (N=1)
Minimum:                1
Maximum:                17072
Variance:               48421.6841883575
StdDev:                 220.04927672764
bc -l <<<43821518/48822602
.89756621328785385096

~90% of the the RefSeq coding exons positions are unique; I feel that this number should be higher for no apparent reason other than my gut feeling. Let me double check this:

cat wgEncodeDukeMapabilityUniqueness35bp.bedGraph | awk '$4 < 1 {OFS="\t"; print $1, $2, $3}' > wgEncodeDukeMapabilityUniqueness35bp_not_unique.bed
intersectBed -a hg19_refgene_coding_exon_trimmed.bed -b wgEncodeDukeMapabilityUniqueness35bp_not_unique.bed -wo > hg19_refgene_coding_exon_trimmed_wgEncodeDukeMapabilityUniqueness35bp_not_unique.bed
cat hg19_refgene_coding_exon_trimmed_wgEncodeDukeMapabilityUniqueness35bp_not_unique.bed | head
head hg19_refgene_coding_exon_trimmed_wgEncodeDukeMapabilityUniqueness35bp_not_unique.bed
chr1    25124232        25124308        NM_013943_cds_1_0_chr1_25124233_f       0       +       chr1    25124232        25124246        14
chr1    25140584        25140676        NM_013943_cds_2_0_chr1_25140585_f       0       +       chr1    25140660        25140665        5
chr1    25140584        25140676        NM_013943_cds_2_0_chr1_25140585_f       0       +       chr1    25140668        25140678        8
chr1    25153500        25153573        NM_013943_cds_3_0_chr1_25153501_f       0       +       chr1    25153500        25153512        12
chr1    25153500        25153573        NM_013943_cds_3_0_chr1_25153501_f       0       +       chr1    25153512        25153540        28
chr1    25153500        25153573        NM_013943_cds_3_0_chr1_25153501_f       0       +       chr1    25153540        25153541        1
chr1    25153500        25153573        NM_013943_cds_3_0_chr1_25153501_f       0       +       chr1    25153541        25153573        32
chr1    25166350        25166498        NM_013943_cds_4_0_chr1_25166351_f       0       +       chr1    25166347        25166389        39
chr1    25166350        25166498        NM_013943_cds_4_0_chr1_25166351_f       0       +       chr1    25166389        25166392        3
chr1    25166350        25166498        NM_013943_cds_4_0_chr1_25166351_f       0       +       chr1    25166392        25166401        9
cat hg19_refgene_coding_exon_trimmed_wgEncodeDukeMapabilityUniqueness35bp_not_unique.bed | cut -f10 | stats
Total lines:            120750
Sum of lines:           2870360
Ari. Mean:              23.7710973084886
Geo. Mean:              9.14534781538738
Median:                 9
Mode:                   1 (N=14096)
Anti-Mode:              197 (N=1)
Minimum:                1
Maximum:                1994
Variance:               4132.41012943783
StdDev:                 64.2838247884943
bc -l <<<2870360/48822602
.05879162278159611402

~6% of coding exons are in non-unique regions. Why doesn't this add to 100% (i.e. 6 + 90 != 100)?

intersectBed -a hg19_refgene_coding_exon_trimmed.bed -b wgEncodeDukeMapabilityUniqueness35bp.bedGraph -v > blah
#unassembled chromosomes
head blah
chr6_apd_hap1   3146158 3146246 NM_001178045_cds_0_0_chr6_apd_hap1_3146159_r    0       -
chr6_apd_hap1   3147183 3147234 NM_001178045_cds_1_0_chr6_apd_hap1_3147184_r    0       -
chr6_apd_hap1   3147347 3147408 NM_001178045_cds_2_0_chr6_apd_hap1_3147348_r    0       -
chr6_apd_hap1   3147541 3147578 NM_001178045_cds_3_0_chr6_apd_hap1_3147542_r    0       -
chr6_apd_hap1   3147846 3147886 NM_001178045_cds_4_0_chr6_apd_hap1_3147847_r    0       -
chr6_apd_hap1   3148029 3148099 NM_001178045_cds_5_0_chr6_apd_hap1_3148030_r    0       -
chr6_apd_hap1   3148221 3148282 NM_001178045_cds_6_0_chr6_apd_hap1_3148222_r    0       -
chr6_apd_hap1   3148404 3148624 NM_001178045_cds_7_0_chr6_apd_hap1_3148405_r    0       -
chr6_apd_hap1   3151686 3151755 NM_001178045_cds_8_0_chr6_apd_hap1_3151687_r    0       -
chr6_apd_hap1   3152076 3152135 NM_001178045_cds_9_0_chr6_apd_hap1_3152077_r    0       -
cat blah  | cut -f1 | sort -u
#output below modified to save space
chr17_ctg5_hap1 chr19_gl000209_random chr1_gl000191_random chr4_ctg9_hap1 chr6_apd_hap1
chr6_cox_hap2 chr6_dbb_hap3 chr6_mann_hap4 chr6_mcf_hap5 chr6_qbl_hap6 chr6_ssto_hap7
chr7_gl000195_random chrUn_gl000213 chrUn_gl000222 chrUn_gl000223 chrUn_gl000228
#I know I should stop naming files as blah
#it's a bad habit I picked up from Rob
cat blah | awk '{print $3-$2}' | stats
Total lines:            15809
Sum of lines:           2130724
Ari. Mean:              134.779176418496
Geo. Mean:              83.0615911368902
Median:                 84
Mode:                   20 (N=655)
Anti-Mode:              124 (N=1)
Minimum:                1
Maximum:                4565
Variance:               47948.4880830461
StdDev:                 218.971432116261

#and here's the culprit!
bc -l<<<2130724/48822602
.04364216393055003500

#recalculate the statistics
#refSeq coding exons unique
bc -l<<<43821518/"(48822602-2130724)"
.93852549687549513429
#refSeq coding exon non-unique
bc -l <<<2870360/"(48822602-2130724)"
.06147450312450486570

Conclusions

The ENCODE mappability tracks are a great visual resource when viewed on a Genome Browser as well as being a nice resource for calculating the mappability of regions of interest. I demonstrated that ~74% of the regions within repetitive elements are unique, which I thought was quite high, but this is probably due to the higher mutation rate in these sequences compared to those under selective pressure. RefSeq coding exons on the other hand have ~94% of their regions unique. If I used the alignability track instead:

cat wgEncodeCrgMapabilityAlign36mer.bedGraph | awk '$4 == 1 {OFS="\t"; print $1, $2, $3}' > wgEncodeCrgMapabilityAlign36mer.bed
cat wgEncodeCrgMapabilityAlign36mer.bed | wc
15365566 46096698 366807922
cat ~/ucsc/hg19_rmsk.bed | perl -nle '@a = split; $a[2]-=35; next unless $a[2]>$a[1]; print join("\t",@a);' > hg19_rmsk_trimmed_36.bed
cat hg19_rmsk_trimmed_36.bed | awk '{print $3-$2}' | stats
Total lines:            4860460
Sum of lines:           1285739108
Ari. Mean:              264.530334165902
Geo. Mean:              140.426791054987
Median:                 178
Mode:                   264 (N=33186)
Anti-Mode:              3814 (N=1)
Minimum:                1
Maximum:                160567
Variance:               230269.825370541
StdDev:                 479.864382269138
intersectBed -a hg19_rmsk_trimmed_36.bed -b wgEncodeCrgMapabilityAlign36mer.bed -wo > hg19_rmsk_trimmed_36_wgEncodeCrgMapabilityAlign36mer.bed
cat hg19_rmsk_trimmed_36_wgEncodeCrgMapabilityAlign36mer.bed | cut -f10 | stats
Total lines:            14890348
Sum of lines:           702976025
Ari. Mean:              47.2101810515107
Geo. Mean:              13.7237631735887
Median:                 14
Mode:                   1 (N=1585619)
Anti-Mode:              1573 (N=1)
Minimum:                1
Maximum:                5235
Variance:               9311.44025498465
StdDev:                 96.495804338762
bc -l<<<702976025/1285739108
.54674857490606873568

As you would expect, when taking into account mismatches, repetitive elements are much less unique; though ~55% of the sites are deemed as unique when using 36-mers.

See also

The paper corresponding to the alignability track.




Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
7 comments Add yours
  1. At the beginning of the post: “according to mappability track, the region chr1:10000-10035 will map to 734 places (since 1/733 =~ .001364)”
    I have two simple questions:
    1) Shouldn’t it 1/734 as 1/734 ~= 0.0013624 (same as file) whereas 1/733 ~= 0.0013642 (different)?
    2) As it’s a 36mer, why does the bedGraph line indicate chr1:10000-10078 instead of chr1:10000-10035? Does that mean in fact that the region chr1:10000-10078 contains a set of 43 overlapping 36mers (kmer #1: 10000-10035, kmer #2: 10001-10036, …, kmer #43: 10042-10078)? And that it is the whole set of these 43 36mers that is unique?

    Also, related to question 2: the 2nd line of the “alignability” file corresponds to the region chr1:10078-10081 with score 0.0238095. If I understood well, it means that this region of 4 nucleotides is present only 42 times in the whole human genome (+- 2 mismatches). This looks quite low to me given that it is only 4-bp long (even lower considering the possible mismatches). And what does it correspond to as the file was generated with 36mer?
    (I had a look at the paper but couldn’t find an answer to this question.)

    1. 1) Nice pick up; it should be 1/734. Good to have another set of eyes going through my posts 🙂
      2) Exactly.

      No, you got it right in point 2. chr1:10078-10081 refers to the 36mers at chr1:10078-10113, chr1:10079-10114, chr1:10080-10115 and chr1:10081-10116.

      1. I have a follow up question:
        so given that the alignability scores are the following:
        chr 1 10078 10081 0.0238095
        chr1 10081 10088 0.0185185
        does the kmer chr1:10081-10116 has score of 0.02 or 0.018?

  2. ok, got it, any region in the bedGraph file contains all the start of 36mers, so I need to correct my point 2: there are not 43 36mers but 79, the 1st starting at 10000, the 2nd starting at 10001, …, and the 79th starting at 10078
    thanks!

  3. Thank you for your nice explanation. For some genome, we couldn’t find the mappability track from the ENCODE web page. For example, I want to get the mappability track of Rice(MHRS1), but couldn’t find this and want to construct by my self. Therefore, Would you please tell us how we construct the mappability track by ourselves.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.