Finding composite repetitive elements

Transposons have the ability to "jump" around in genomes and sometimes transposons jump into genomic sites occupied by other repetitive elements; these cases are what I refer to as "composite repetitive elements" for the purpose of this post. While almost all DNA transposons and the majority of retrotransposons have lost the ability to move around in the human genome, transposition events that have occurred in the past are captured within the genome sequence. This post is about finding composite repetitive elements in the human genome based on RepeatMasker annotations.

Continue reading

Repetitive elements in vertebrate genomes

Updated 2015 February 8th to include some scatter plots of genome size versus repeat content.

I was writing about the make up of genomes today and was looking up statistics on repetitive elements in vertebrate genomes. While I could find individual papers with repetitive element statistics for a particular genome, I was unable to find a summary for a list of vertebrate genomes (but to be honest I didn't look very hard). So I thought I'll make my own and share it on my blog and via figshare. I will use the RepeatMasker annotations provided via the UCSC genome browser.

Continue reading

How mappable is a specific repeat?

If you've ever wondered how mappable a specific repeat is, here's a quick post on creating a plot showing the mappability of a repetitive element along its consensus sequence. Specifically, the consensus sequence of a repeat was taken and sub-sequences were created by a sliding window approach (i.e. moving along the sequence) at 1 bp intervals and these sub-sequences were mapped to hg19.

I will use bwa for the mapping, so firstly download and compile bwa:

tar -xjf bwa-0.7.7.tar.bz2
cd bwa-0.7.7

#index hg19
bwa index hg19.fa

Continue reading

Mapping repeats 2

Updated 10th September 2013 to include LAST

I previously looked at mapping repeats with respect to sequencing errors in high throughput sequencing and as one would expect, the accuracy of the mapping decreased when sequencing errors were introduced. I then looked at aligning to unique regions of the genome to get an idea of how different short read alignment perform with reads that should map uniquely. Here I combine the two ideas, to see how different short read alignment programs perform when mapping repeats.

When I wrote my first mapping repeats post, I was made aware of this review article on "Repetitive DNA and next-generation sequencing: computational challenges and solutions" via Twitter (thank you CB). It was also his suggestion that it would be interesting to compare different short read alignment programs with respect to mapping repeats, hence this post.

Continue reading

ENCODE mappability and repeats

The ENCODE mappability tracks can be visualised on the UCSC Genome Browser and they provide a sense of how mappable a region of the genome is in terms of short reads or k-mers. On a side note, it seems some people use "mapability" and some use "mappability"; I was taught the CVC rule, so I'm going to stick with "mappability".

There are two sets of tracks, alignability and uniqueness, but what's the difference between the two? From the UCSC Genome Browser:

Alignability - These tracks provide a measure of how often the sequence found at the particular location will align within the whole genome. Unlike measures of uniqueness, alignability will tolerate up to 2 mismatches. These tracks are in the form of signals ranging from 0 to 1 and have several configuration options.

Uniqueness - These tracks are a direct measure of sequence uniqueness throughout the reference genome. These tracks are in the form of signals ranging from 0 to 1 and have several configuration options.

Let's take a look at two examples, where I try to use k-mers of similar size:

#download the files
#bigWigToBedGraph can be downloaded at
#convert into flat format
bigWigToBedGraph wgEncodeDukeMapabilityUniqueness35bp.bigWig wgEncodeDukeMapabilityUniqueness35bp.bedGraph
bigWigToBedGraph wgEncodeCrgMapabilityAlign36mer.bigWig wgEncodeCrgMapabilityAlign36mer.bedGraph
head wgEncodeDukeMapabilityUniqueness35bp.bedGraph
chr1    0       10145   0
chr1    10145   10160   1
chr1    10160   10170   0.5
chr1    10170   10175   0.333333
chr1    10175   10177   0.5
chr1    10177   10215   0
chr1    10215   10224   1
chr1    10224   10229   0.5
chr1    10229   10248   1
chr1    10248   10274   0
head wgEncodeCrgMapabilityAlign36mer.bedGraph
chr1    10000   10078   0.0013624
chr1    10078   10081   0.0238095
chr1    10081   10088   0.0185185
chr1    10088   10089   0.0147059
chr1    10089   10096   0.0185185
chr1    10096   10097   0.0238095
chr1    10097   10099   0.0185185
chr1    10099   10102   0.00172712
chr1    10102   10120   0.0013624
chr1    10120   10121   0.00172712

The bedGraph format simply associates a score (fourth column) to a particular region, as defined in the first three columns. At first glance you can already see the difference in the precision of the scores. The scores were calculated the same way, 1/the number of places a 35/36mer maps to the genome; however the uniqueness track will only keep scores for reads that map up to 4 places (0.25). So according to mappability track, the region chr1:10000-10035 will map to 734 places (since 1/734 =~ .0013624).

Continue reading

Mapping repeats

Most eukaryotic genomes are interspersed with repetitive elements and some of these elements have transcriptional activity, hence they appear when we sequence the RNA population. From the trend of things, some of these elements seem to be important. One strategy for analysing these repeats is to map them to the genome, to see where they came from and from what repeat class they arose from. This post looks into mapping repeats and how sequencing accuracy can affect the mapping accuracy.

I will use the human genome as an example; according to RepeatMasker and Repbase, there are roughly 5,298,130 repetitive elements in the human genome. How much of the genome is that? First download the RepeatMasker results performed on hg19 from the UCSC Table Browser tool. I've downloaded the results as a bed file and named it hg19_rmsk.bed.

#the extremely useful stats program is available
cat ~/ucsc/hg19_rmsk.bed | perl -nle '@a=split; $s=$a[2]-$a[1]; print $s' | stats
Total lines:            5298130
Sum of lines:           1467396988
Ari. Mean:              276.965077867097
Geo. Mean:              168.518495379209
Median:                 188
Mode:                   21 (N=44789)
Anti-Mode:              3849 (N=1)
Minimum:                6
Maximum:                160602
Variance:               216904.549201035
StdDev:                 465.730124858845

In the above, I concatenated the entire bed file and redirected it to Perl, where it subtracted the end coordinate from the start, and outputted it into the stats program, where simple statistics were calculated. The total lines corresponded to the number of repetitive elements, which make up 1,467,396,988 bp of the hg19 genome. That's around half of the hg19 genome. Now to convert this bed file into a fasta file and randomly sample 5 million reads from the repeats.

Continue reading

The repetitive landscape of the human and mouse genome

Updated on the 31st May 2013 and updated again on the 25th March 2015 in light of Chris's comment.

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. Results of RepeatMasker performed on the human and mouse genomes are provided via the UCSC Table Browser tool. In the post I will summarise the results of the RepeatMasker program to gain an overview of the repetitive landscape of the human and mouse genome.

Continue reading