Transposons have the ability to "jump" around in genomes and sometimes transposons jump into genomic sites occupied by other repetitive elements; these cases are what I refer to as "composite repetitive elements" for the purpose of this post. While almost all DNA transposons and the majority of retrotransposons have lost the ability to move around in the human genome, transposition events that have occurred in the past are captured within the genome sequence. This post is about finding composite repetitive elements in the human genome based on RepeatMasker annotations.
Updated 2015 February 8th to include some scatter plots of genome size versus repeat content.
I was writing about the make up of genomes today and was looking up statistics on repetitive elements in vertebrate genomes. While I could find individual papers with repetitive element statistics for a particular genome, I was unable to find a summary for a list of vertebrate genomes (but to be honest I didn't look very hard). So I thought I'll make my own and share it on my blog and via figshare. I will use the RepeatMasker annotations provided via the UCSC genome browser.
If you've ever wondered how mappable a specific repeat is, here's a quick post on creating a plot showing the mappability of a repetitive element along its consensus sequence. Specifically, the consensus sequence of a repeat was taken and sub-sequences were created by a sliding window approach (i.e. moving along the sequence) at 1 bp intervals and these sub-sequences were mapped to hg19.
I will use bwa for the mapping, so firstly download and compile bwa:
wget http://sourceforge.net/projects/bio-bwa/files/latest/bwa-0.7.7.tar.bz2 tar -xjf bwa-0.7.7.tar.bz2 cd bwa-0.7.7 make #index hg19 bwa index hg19.fa
Updated 10th September 2013 to include LAST
I previously looked at mapping repeats with respect to sequencing errors in high throughput sequencing and as one would expect, the accuracy of the mapping decreased when sequencing errors were introduced. I then looked at aligning to unique regions of the genome to get an idea of how different short read alignment perform with reads that should map uniquely. Here I combine the two ideas, to see how different short read alignment programs perform when mapping repeats.
When I wrote my first mapping repeats post, I was made aware of this review article on "Repetitive DNA and next-generation sequencing: computational challenges and solutions" via Twitter (thank you CB). It was also his suggestion that it would be interesting to compare different short read alignment programs with respect to mapping repeats, hence this post.
The ENCODE mappability tracks can be visualised on the UCSC Genome Browser and they provide a sense of how mappable a region of the genome is in terms of short reads or k-mers. On a side note, it seems some people use "mapability" and some use "mappability"; I was taught the CVC rule, so I'm going to stick with "mappability".
There are two sets of tracks, alignability and uniqueness, but what's the difference between the two? From the UCSC Genome Browser:
Alignability - These tracks provide a measure of how often the sequence found at the particular location will align within the whole genome. Unlike measures of uniqueness, alignability will tolerate up to 2 mismatches. These tracks are in the form of signals ranging from 0 to 1 and have several configuration options.
Uniqueness - These tracks are a direct measure of sequence uniqueness throughout the reference genome. These tracks are in the form of signals ranging from 0 to 1 and have several configuration options.
Let's take a look at two examples, where I try to use k-mers of similar size:
#download the files wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeDukeMapabilityUniqueness35bp.bigWig wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeCrgMapabilityAlign36mer.bigWig #bigWigToBedGraph can be downloaded at http://hgdownload.cse.ucsc.edu/admin/exe/ #convert into flat format bigWigToBedGraph wgEncodeDukeMapabilityUniqueness35bp.bigWig wgEncodeDukeMapabilityUniqueness35bp.bedGraph bigWigToBedGraph wgEncodeCrgMapabilityAlign36mer.bigWig wgEncodeCrgMapabilityAlign36mer.bedGraph head wgEncodeDukeMapabilityUniqueness35bp.bedGraph chr1 0 10145 0 chr1 10145 10160 1 chr1 10160 10170 0.5 chr1 10170 10175 0.333333 chr1 10175 10177 0.5 chr1 10177 10215 0 chr1 10215 10224 1 chr1 10224 10229 0.5 chr1 10229 10248 1 chr1 10248 10274 0 head wgEncodeCrgMapabilityAlign36mer.bedGraph chr1 10000 10078 0.0013624 chr1 10078 10081 0.0238095 chr1 10081 10088 0.0185185 chr1 10088 10089 0.0147059 chr1 10089 10096 0.0185185 chr1 10096 10097 0.0238095 chr1 10097 10099 0.0185185 chr1 10099 10102 0.00172712 chr1 10102 10120 0.0013624 chr1 10120 10121 0.00172712
The bedGraph format simply associates a score (fourth column) to a particular region, as defined in the first three columns. At first glance you can already see the difference in the precision of the scores. The scores were calculated the same way, 1/the number of places a 35/36mer maps to the genome; however the uniqueness track will only keep scores for reads that map up to 4 places (0.25). So according to mappability track, the region chr1:10000-10035 will map to 734 places (since 1/734 =~ .0013624).
Most eukaryotic genomes are interspersed with repetitive elements and some of these elements have transcriptional activity, hence they appear when we sequence the RNA population. From the trend of things, some of these elements seem to be important. One strategy for analysing these repeats is to map them to the genome, to see where they came from and from what repeat class they arose from. This post looks into mapping repeats and how sequencing accuracy can affect the mapping accuracy.
I will use the human genome as an example; according to RepeatMasker and Repbase, there are roughly 5,298,130 repetitive elements in the human genome. How much of the genome is that? First download the RepeatMasker results performed on hg19 from the UCSC Table Browser tool. I've downloaded the results as a bed file and named it hg19_rmsk.bed.
#the extremely useful stats program is available #https://github.com/arq5x/filo cat ~/ucsc/hg19_rmsk.bed | perl -nle '@a=split; $s=$a-$a; print $s' | stats Total lines: 5298130 Sum of lines: 1467396988 Ari. Mean: 276.965077867097 Geo. Mean: 168.518495379209 Median: 188 Mode: 21 (N=44789) Anti-Mode: 3849 (N=1) Minimum: 6 Maximum: 160602 Variance: 216904.549201035 StdDev: 465.730124858845
In the above, I concatenated the entire bed file and redirected it to Perl, where it subtracted the end coordinate from the start, and outputted it into the stats program, where simple statistics were calculated. The total lines corresponded to the number of repetitive elements, which make up 1,467,396,988 bp of the hg19 genome. That's around half of the hg19 genome. Now to convert this bed file into a fasta file and randomly sample 5 million reads from the repeats.
Updated on the 31st May 2013 and updated again on the 25th March 2015 in light of Chris's comment.
RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. Results of RepeatMasker performed on the human and mouse genomes are provided via the UCSC Table Browser tool. In the post I will summarise the results of the RepeatMasker program to gain an overview of the repetitive landscape of the human and mouse genome.