Updated on the 31st May 2013 and updated again on the 25th March 2015 in light of Chris's comment.
RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. Results of RepeatMasker performed on the human and mouse genomes are provided via the UCSC Table Browser tool. In the post I will summarise the results of the RepeatMasker program to gain an overview of the repetitive landscape of the human and mouse genome.
Below I provide a screenshot of the settings used to download the RepeatMasker results performed on hg19:
Use this settings to download the RepeatMasker results on hg19. Change the genome to mm9 for the mouse RepeatMasker results.
If you view the contents of the file, the output is not summarised:
zcat hg19_rmsk.tsv.gz | head #bin swScore milliDiv milliDel milliIns genoName genoStart genoEnd genoLeft strand repName repClass repFamily repStart repEnd repLeft id 1 2147 87 3 17 chr1 16777160 16777470 -232473151 + AluSp SINE Alu 1 313 0 2 1 2626 28 0 0 chr1 25165800 25166089 -224084532 - AluY SINE Alu -22 289 1 4 1 626 312 185 59 chr1 33553606 33554646 -215695975 + L2b LINE L2 2195 3356 -19 6 1 12545 132 15 48 chr1 50330063 50332153 -198918468 + L1PA10 LINE L1 15 2042 -4311 9 1 8050 8 1 9 chr1 58720067 58720973 -190529648 - L1PA2 LINE L1 -3 6152 5254 1 2 10586 185 68 72 chr1 75496180 75498100 -173752521 + L1MB7 LINE L1 4580 6518 -164 1 2 980 341 64 51 chr1 83886030 83886750 -165363871 - ERVL-E-int LTR ERVL -422 5245 4517 1 2 1422 234 189 34 chr1 100662895 100663391 -148587230 - L2a LINE L2 -857 2562 1990 1 2 532 244 6 66 chr1 117440426 117440514 -131810107 + L1ME1 LINE L1 2392 2470 -3930 1
I wrote a Perl script for summarising the RepeatMasker output files:
#!/usr/bin/perl
my $usage = "Usage: $0 <infile.repeatmasker>\n";
my $infile = shift or die $usage;
my %repName = ();
my %repClass = ();
my %repFamily = ();
open(IN,'-|',"gunzip -c $infile") || die "Could not open $infile: $!\n";
while(<IN>){
chomp;
next if (/^#/);
#bin swScore milliDiv milliDel milliIns genoName genoStart genoEnd genoLeft strand repName repClass repFamily repStart repEnd repLeft id
#1 2147 87 3 17 chr1 16777160 16777470 -232473151 + AluSp SINE Alu 1 313 0 2
my ($bin, $swScore, $milliDiv, $milliDel, $milliIns, $genoName, $genoStart, $genoEnd, $genoLeft, $strand, $repName, $repClass, $repFamily, $repStart, $repEnd, $repLeft, $id) = split(/\s+/);
#print "$bin\n";
if (exists $repName{$repName}){
$repName{$repName}++;
} else {
$repName{$repName} = 1;
}
if (exists $repClass{$repClass}){
$repClass{$repClass}++;
} else {
$repClass{$repClass} = 1;
}
if (exists $repFamily{$repFamily}){
$repFamily{$repFamily}++;
} else {
$repFamily{$repFamily} = 1;
}
}
close(IN);
#Change this into the corresponding hash for tallies of the repeat families and repeat classes
foreach my $key (sort {$repName{$b} <=> $repName{$a}} keys %repName){
print "$key\t$repName{$key}\n";
}
exit(0);
However, as I've been learning R, I will use that instead for providing summaries:
#the comment='$' is just to circumvent the
#first line of the file being ignored
hg19_repeat <- read.table(gzfile("hg19_rmsk.tsv.gz"),
header=T,
comment='$',
stringsAsFactors=F)
#check dimensions
dim(hg19_repeat)
#[1] 5298130 17
#check the first 6 lines
head(hg19_repeat)
# X.bin swScore milliDiv milliDel milliIns genoName genoStart genoEnd
#1 1 2147 87 3 17 chr1 16777160 16777470
#2 1 2626 28 0 0 chr1 25165800 25166089
#3 1 626 312 185 59 chr1 33553606 33554646
#4 1 12545 132 15 48 chr1 50330063 50332153
#5 1 8050 8 1 9 chr1 58720067 58720973
#6 2 10586 185 68 72 chr1 75496180 75498100
# genoLeft strand repName repClass repFamily repStart repEnd repLeft id
#1 -232473151 + AluSp SINE Alu 1 313 0 2
#2 -224084532 - AluY SINE Alu -22 289 1 4
#3 -215695975 + L2b LINE L2 2195 3356 -19 6
#4 -198918468 + L1PA10 LINE L1 15 2042 -4311 9
#5 -190529648 - L1PA2 LINE L1 -3 6152 5254 1
#6 -173752521 + L1MB7 LINE L1 4580 6518 -164 1
#I'm not sure what the id column is
#but I'm going to remove it and the bin column
hg19_repeat <- hg19_repeat[,-c(1,17)]
#check dimensions again
dim(hg19_repeat)
#[1] 5298130 15
#how many repeat families are there?
table(hg19_repeat$repFamily)
Alu CR1 DNA DNA? Deu
1194734 61303 2744 1881 1265
Dong-R4 ERV ERV1 ERVK ERVL
556 579 175937 10868 160346
ERVL-MaLR ERVL? Gypsy Gypsy? Helitron
347105 1854 10892 7869 1800
Helitron? L1 L1? L2 LTR
436 951780 84 466438 2206
LTR? Low_complexity MIR Merlin MuDR
122 371543 595094 57 1992
Other Penelope? PiggyBac PiggyBac? RNA
3733 51 2120 241 729
RTE RTE-BovB SINE SINE? Satellite
17874 655 962 425 6775
Simple_repeat TcMar TcMar-Mariner TcMar-Tc2 TcMar-Tigger
417913 1950 16348 8156 104026
TcMar? Unknown Unknown? acro centr
3424 7036 97 61 2325
hAT hAT-Blackjack hAT-Charlie hAT-Tip100 hAT?
12573 19755 254646 30669 3050
rRNA scRNA snRNA srpRNA tRNA
1769 1340 4386 1481 3670
telo
405
#store in another object
repeat_family <- table(hg19_repeat$repFamily)
#original margin settings
par()$mar
#[1] 5.1 4.1 4.1 2.1
par(mar=c(8.1, 6.1 ,4.1 ,2.1))
#postscript
postscript("barplot_repeat_family.eps")
barplot(repeat_family[order(repeat_family, decreasing=T)], las=2, ylim=c(0,1.2e6), cex.names=0.7)
dev.off()
barplot(repeat_family[order(repeat_family, decreasing=T)], las=2, ylim=c(0,1.4e6))
The most repetitive element in the hg19 genome are Alus, followed by L1s, MIRs and L2.
Summary of repeat classes:
#summary of repeat classes
repeat_class <- table(hg19_repeat$repClass)
repeat_class
DNA DNA? LINE LINE? Low_complexity LTR LTR? Other
461751 1881 1498690 51 371543 717656 122 3733
RC RNA rRNA Satellite scRNA Simple_repeat SINE SINE?
2236 729 1769 9566 1340 417913 1793723 425
snRNA srpRNA tRNA Unknown Unknown?
4386 1481 2002 7036 97
barplot(repeat_class[order(repeat_class, decreasing=T)], las=2, ylim=c(0,2e6))
The most numerous class of repeats in hg19 are SINEs, followed by LINEs and LTRs.
How many LINEs, SINEs and LTRs are there?
length(unique(hg19_repeat[hg19_repeat$repClass=="LINE",]$repName)) [1] 146 line <- table(hg19_repeat[hg19_repeat$repClass=="LINE",]$repName) head(line[order(line, decreasing=T)],n=8) L2a L2c L2b L1M5 L2 L3 L1ME4a L1ME1 171064 140653 97936 64374 56785 45015 43230 31734 length(unique(hg19_repeat[hg19_repeat$repClass=="SINE",]$repName)) [1] 49 sine <- table(hg19_repeat[hg19_repeat$repClass=="SINE",]$repName) head(sine[order(sine, decreasing=T)],n=8) MIRb MIR AluJb AluSx AluY AluSx1 MIRc AluSz 225155 175606 144945 144437 120617 110831 103492 98343 length(unique(hg19_repeat[hg19_repeat$repClass=="LTR",]$repName)) [1] 504 head(ltr[order(ltr, decreasing=T)],n=8) THE1B MLT1D MLT1A0 MLT1C MSTA MLT1K MLT1B MLT1J 22433 20741 20643 19824 19782 18173 18004 15270 #make some pie charts par(mfrow=c(1,2)) pie(line[order(line, decreasing=T)]) pie(sine[order(sine, decreasing=T)])
Not very pretty but at least we can see the abundance ranking of the repeats.
Length of LINEs, SINEs and LTRs in hg19
#calculate and store the lengths
hg19_repeat$length <- hg19_repeat$genoEnd - hg19_repeat$genoStart
#length of repeat class LINEs
summary(hg19_repeat[hg19_repeat$repClass=='LINE',]$length)
Min. 1st Qu. Median Mean 3rd Qu. Max.
11 115 221 426 448 8505
#length of repeat class SINEs
summary(hg19_repeat[hg19_repeat$repClass=='SINE',]$length)
Min. 1st Qu. Median Mean 3rd Qu. Max.
11.0 140.0 256.0 221.5 300.0 614.0
#length of repeat class LTRs
summary(hg19_repeat[hg19_repeat$repClass=='LTR',]$length)
Min. 1st Qu. Median Mean 3rd Qu. Max.
11.0 154.0 334.0 373.4 438.0 18040.0
line_length <- hg19_repeat[hg19_repeat$repClass=='LINE',]$length
sine_length <- hg19_repeat[hg19_repeat$repClass=='SINE',]$length
ltr_length <- hg19_repeat[hg19_repeat$repClass=='LTR',]$length
par(mfrow=c(1,3))
boxplot(line_length, main="LINE")
boxplot(sine_length, main="SINE")
boxplot(ltr_length, main="LTR")
SINEs are much more consistent in size than LINEs or LTRs.
How many full-length LINEs? Autonomous L1s are approximately 6,000 bp in size.
length(hg19_repeat[hg19_repeat$repClass=="LINE" & hg19_repeat$length>5999,]$repName)
[1] 7354
table(hg19_repeat[hg19_repeat$repClass=="LINE" & hg19_repeat$length>5999,]$repName)
L1HS L1M1 L1M2 L1M2a L1M2c L1MA1 L1MA2 L1MA3 L1MA4 L1MA4A L1MA5 L1MA5A
307 26 2 1 1 48 67 66 16 20 13 2
L1MA6 L1MA7 L1MA8 L1MA9 L1MB3 L1MC1 L1MC3 L1ME1 L1ME2 L1P2 L1P3 L1PA10
1 1 2 6 1 5 1 3 1 2 8 67
L1PA11 L1PA12 L1PA13 L1PA14 L1PA15 L1PA15-16 L1PA16 L1PA17 L1PA2 L1PA3 L1PA4 L1PA5
48 6 54 21 46 4 39 11 977 1367 1228 865
L1PA6 L1PA7 L1PA8 L1PA8A L1PB1 L1PB2 L1PB3 L1PB4 L1PBa L1PBa1 L1PBb L1PREC2
736 786 167 71 169 22 18 10 5 2 2 33
Total coverage
#total coverage in hg19
zcat hg19_rmsk.tsv.gz | grep -v "^#" | awk '{total+=$8-$7} END {print total}'
1467396988
#roughly 50% of the genome
bc -l<<<1467396988/3000000000*100
48.91323293333333333300
#total coverage in ce10
zcat ce10_rmsk.tsv.gz | grep -v "^#" | awk '{total+=$8-$7} END {print total}'
13827574
#roughly 14%
bc -l<<<13827574/100000000*100
13.82757400000000000000
25th March 2015: Summary based on my RepeatMasker run
I ran RepeatMasker on the hg19 genome using this command for all assembled chromosomes (chr1:22 and chr[MXY]):
#RepeatMasker version open-4.0.3 #-xsmall Returns repetitive regions in lowercase (rest capitals) rather than masked #-s Slow search; 0-5% more sensitive, 2-3 times slower than default #8 processors #RepBase Update 20130422 RepeatMasker -e crossmatch -species human -s -xsmall -pa 8 chr1.fa #An asterisk (*) in the final column indicates that there is a higher-scoring match whose domain partly (<80%) includes the domain of this match. cat *.out | grep -v "position in query" | grep -v "class/family" | grep -v "^$" | sed 's/^\s\+//' | sed 's/\s\+\*$//' | gzip > hg19_rmsk_cm_20130422.out.gz #slightly more than the UCSC Genome Browser numbers gunzip -c hg19_rmsk_cm_20130422.out.gz | wc -l 5313559 #concatenate the tbl files cat *.tbl | gzip > hg19_rmsk_cm_20130422.tbl.gz
A concatenation of the *.tbl files can be downloaded here and the concatenation of the *.out files can be downloaded here.
Summaries using R
data <- read.table("hg19_rmsk_cm_20130422.out.gz",
header=F,
stringsAsFactors = F)
dim(data)
[1] 5313559 15
#the repeat class and family
#are joined together
head(data)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15
1 1706 22.2 8.9 3.6 chr10 60964 61180 (135473567) C L1M4 LINE/L1 (2130) 4016 3779 1
2 1006 22.4 0.0 0.0 chr10 61181 61345 (135473402) C FRAM SINE/Alu (11) 165 1 2
3 1706 22.2 8.9 3.6 chr10 61346 61712 (135473035) C L1M4 LINE/L1 (2368) 3778 3398 1
4 21526 12.8 1.9 5.0 chr10 63293 67260 (135467487) + L1PA17 LINE/L1 1184 5132 (1750) 3
5 1366 12.9 0.0 0.0 chr10 67261 67438 (135467309) + AluSq2 SINE/Alu 136 313 (0) 4
6 21526 12.8 1.9 5.0 chr10 67439 69192 (135465555) + L1PA17 LINE/L1 5133 6880 (2) 3
#the following code splits the repeat
#class and family
my_subset_func <- function(x, which){
if (length(x)==2){
return(x[[which]])
} else {
return(x[[1]])
}
}
my_split <- strsplit(data[,11], split = "/")
df <- data[,1:10]
df$V11 <- sapply(my_split, my_subset_func, which=1)
df$V12 <- sapply(my_split, my_subset_func, which=2)
df <- cbind(df, data[,12:15])
dim(df)
[1] 5313559 16
names(df) <- c('score',
'div',
'del',
'ins',
'query',
'q_start',
'q_end',
'q_left',
'strand',
'repeat',
'class',
'family',
's_start',
's_end',
's_left',
'id')
head(df)
score div del ins query q_start q_end q_left strand repeat class family s_start s_end
1 1706 22.2 8.9 3.6 chr10 60964 61180 (135473567) C L1M4 LINE L1 (2130) 4016
2 1006 22.4 0.0 0.0 chr10 61181 61345 (135473402) C FRAM SINE Alu (11) 165
3 1706 22.2 8.9 3.6 chr10 61346 61712 (135473035) C L1M4 LINE L1 (2368) 3778
4 21526 12.8 1.9 5.0 chr10 63293 67260 (135467487) + L1PA17 LINE L1 1184 5132
5 1366 12.9 0.0 0.0 chr10 67261 67438 (135467309) + AluSq2 SINE Alu 136 313
6 21526 12.8 1.9 5.0 chr10 67439 69192 (135465555) + L1PA17 LINE L1 5133 6880
s_left id
1 3779 1
2 1 2
3 3398 1
4 (1750) 3
5 (0) 4
6 (2) 3
Let's calculate some numbers!
#what are the repeat classes?
names(table(df$class))
[1] "DNA" "DNA?" "LINE" "Low_complexity" "LTR"
[6] "LTR?" "RC" "RC?" "Retroposon" "RNA"
[11] "rRNA" "Satellite" "scRNA" "Simple_repeat" "SINE"
[16] "SINE?" "snRNA" "srpRNA" "tRNA" "Unknown"
#what are the repeat families?
names(table(df$family))
[1] "5S-Deu-L2" "acro" "Alu" "centr" "CR1"
[6] "DNA" "DNA?" "Dong-R4" "ERV1" "ERV1?"
[11] "ERVK" "ERVL" "ERVL-MaLR" "ERVL?" "Gypsy"
[16] "Gypsy?" "hAT" "hAT-Ac" "hAT-Blackjack" "hAT-Charlie"
[21] "hAT-Tag1" "hAT-Tip100" "hAT-Tip100?" "hAT?" "Helitron"
[26] "Helitron?" "L1" "L2" "Low_complexity" "LTR"
[31] "LTR?" "Merlin" "MIR" "MULE-MuDR" "Penelope"
[36] "PIF-Harbinger" "PiggyBac" "PiggyBac?" "RNA" "rRNA"
[41] "RTE-BovB" "RTE-X" "Satellite" "scRNA" "Simple_repeat"
[46] "SINE?" "snRNA" "srpRNA" "SVA" "TcMar"
[51] "TcMar-Mariner" "TcMar-Pogo" "TcMar-Tc2" "TcMar-Tigger" "TcMar?"
[56] "telo" "tRNA" "tRNA-Deu" "tRNA-RTE" "Unknown"
#how many LINEs?
nrow(unique(df[df$class=='LINE',c('query','class','id')]))
[1] 995183
#how many SINEs?
nrow(unique(df[df$class=='SINE',c('query','class','id')]))
[1] 1705701
#how many LTRs?
nrow(unique(df[df$class=='LTR',c('query','class','id')]))
[1] 511697

This work is licensed under a Creative Commons
Attribution 4.0 International License.
The number of repeats per class (and family) are not correct because RepeatMasker uses separate entries for multiple repeats. The ID line that you weren’t sure of, that identifies a single repeat. Therefore some lines will have the same ID b/c they are the same repeat. This is why you have 1.4 million LINES when the true number is 500,000.
Thank you Chris for the comment; I will fix up the post accordingly. I just found out that there’s a new visualisation option for the RepeatMasker track (http://genome.ucsc.edu/goldenPath/newsarch.html#032015), which must make use of the ID line for joining repeat fragments.
Hi again Chris,
I looked into this a bit more and it got a bit interesting. The RepeatMasker results downloaded from the UCSC Genome Browser for hg19 (which this post is based on) has only nine ID values in the five million or so lines.
gunzip -c hg19_rmsk.tsv.gz | awk '{print $17}' | sort -u 1 2 3 4 5 6 7 8 9 idIf I check out the database description of the ID column via the Table Browser tool, this is what it says:
“First digit of id field in RepeatMasker .out file. Best ignored.”
That must be why I wasn’t sure what the ID column was (according to my comment).
However, when I go to the .out files generated by RepeatMasker, you are of course correct. I’m not sure why the results stored on the UCSC Genome Browser database is as it is.
Cheers,
Dave
Hi Dave,
Interesting finding. I had never used the track from UCSC so I wasn’t aware of what happened with their ID line. Seems like an issue. I usually repeatmask the genomes myself for various reasons. I found your page while looking for a nice table of the repeat mask (tbl) file for the entire human genome classes, families, and subfamilies all In one table so I could make some graphics. Surprisingly hard to find!
I really like the analysis you’ve done here, especially using R. If you could get the corrected numbers I think you (or I) could really extend repeat characterization to a nice consistent way. For instance I’m trying to compare some next ten seq data repeat abundance back to reference genomes. Doing this in R would be simple, elegant, and scalable. Likewise with more species genomes coming online, it would be a nice start foe comparisons and visualizations.
I’ll try to put something together when I get back from a conference. Keep looking at it!
I recalculated the LINE count using a RepeatMasker run I performed on hg19 and by counting unique IDs; the number of LINEs is 995,183. It has decreased from ~1.5 million but it is roughly twice as much as your 500,000 number. The 995,183 number is the same if I tally the number of LINEs from all the *.tbl files. I have included a link in the post to all the concatenated *.out files generated by RepeatMasker.
Hi Dave,
This is very interesting. You’ve done some really useful work here. I believe your new count based on hg19 is actually correct. My 500,000 figure was based on what I had read in various papers which all seem to trace back to the initial sequencing of the human genome. In fact if you look on wikipedia article on “Retrotransposon” you will see the quote “The human genome, for example, contains about 500,000 LINEs, which is roughly 17% of the genome.[14]”. However the wikipedia article on “Long Interspersed nuclear element” in humans, you will see the figure of total LINES as 850,000 total, with 500,000 L1’s.
So I am strongly of the opinion that your latest count of 995,000 is actually the most accurate. (This was exactly the sort of info I was looking for btw.) I looks like the confusion came from 1) The early unfinished sequence of the human genome being updated over the last 14 years 2) a game of telephone, everyone citing old info and 3) Confusing the number of L1’s as the number of total LINEs since L1 is the most abundant and gets the most attention.
So! I think you’ve got it now. I am downloading your .out files and .tbl files. I realize this post was several years old but with the updates you’ve made it will be useful to me and many others I suspect. I’ll post again if I find anything else. Please link to any publications you may have that include transposon information so that I can cite you in my next publication.
http://en.wikipedia.org/wiki/Long_interspersed_nuclear_element#In_human
Hi Dave and Chris, I like your analysis too – Thanks Dave. I’d like to ask two things. 1. You updated this post again today (2015-03-25), so all the numbers in this post are came from the most up-to-date UCSC rmsk table, right? 2. Have you thought about the milliDiv, milliDel, and milliIns? I’m just starting to study about the Repeats and found that it seems like there is some levels of confidence for the aligned repeats across genome. So I am just wondering whether is it good to filtering out some low level of aligned repeats in my analysis. Hope to hear both of your thought. Cheers, Namshik
Hi Namshik,
thanks for the comment. Here are my replies:
1. The latest update refers to numbers calculated on my own RepeatMasker run on hg19 (see the last part of the post). I provided a link to download those files.
2. For now I haven’t thought about the alignment quality. What was good enough for a “match” by RepeatMasker was good enough for me.
Cheers,
Dave
Dear Dave,
First of all, thank you very much for all your scripts and ‘musings’!
I find I am regularly returning to your pages and find them very help and insightful.
As I was shamelessly using your R script, I detected a small omission. When calculating how many LINEs, SINEs and LTRs are present, there is actually no object made for the LTRs:
ltr <- table(hg19_repeat[hg19_repeat$repClass=="LTR",]$repName)
is missing.
Keep up the 'musings'!
with kind regards,
Aldo
Hi Aldo,
thanks for the comment and for including the missing line of code!
Cheers,
Dave