Updated 2015 February 8th to include some scatter plots of genome size versus repeat content.
I was writing about the make up of genomes today and was looking up statistics on repetitive elements in vertebrate genomes. While I could find individual papers with repetitive element statistics for a particular genome, I was unable to find a summary for a list of vertebrate genomes (but to be honest I didn't look very hard). So I thought I'll make my own and share it on my blog and via figshare. I will use the RepeatMasker annotations provided via the UCSC genome browser.
The UCSC genome browser provides this list of vertebrate genomes:
| Human (hg38) | Hedgehog (eriEur2) | Platypus (ornAna1) |
| Alpaca (vicPac2) | Horse (equCab2) | Rabbit (oryCun2) |
| American alligator (allMis1) | Kangaroo rat (dipOrd1) | Rat (rn5) |
| Armadillo (dasNov3) | Lamprey (petMar2) | Rhesus (rheMac3) |
| Atlantic cod (gadMor1) | Lizard (anoCar2) | Rock hyrax (proCap1) |
| Baboon (papAnu2) | Manatee (triMan1) | Sheep (oviAri3) |
| Budgerigar (melUnd1) | Marmoset (calJac3) | Shrew (sorAra1) |
| Bushbaby (otoGar3) | Medaka (oryLat2) | Sloth (choHof1) |
| Cat (felCat5) | Medium ground finch (geoFor1) | Squirrel (speTri2) |
| Chicken (galGal4) | Megabat (pteVam1) | Squirrel monkey (saiBol1) |
| Chimpanzee (panTro4) | Microbat (myoLuc2) | Stickleback (gasAcu1) |
| Chinese hamster (criGri1) | Minke whale (balAcu1) | Tarsier (tarSyr1) |
| Coelacanth (latCha1) | Mouse (mm10) | Tasmanian devil (sarHar1) |
| Cow (bosTau7) | Mouse lemur (micMur1) | Tenrec (echTel2) |
| Dog (canFam3) | Naked mole-rat (hetGla2) | Tetraodon (tetNig2) |
| Dolphin (turTru2) | Nile tilapia (oreNil2) | Tree shrew (tupBel1) |
| Elephant (loxAfr3) | Opossum (monDom5) | Turkey (melGal1) |
| Elephant shark (calMil1) | Orangutan (ponAbe2) | Wallaby (macEug2) |
| Ferret (musFur1) | Painted turtle (chrPic1) | White rhinoceros (cerSim1) |
| Fugu (fr3) | Panda (ailMel1) | X. tropicalis (xenTro3) |
| Gibbon (nomLeu3) | Pig (susScr3) | Zebra finch (taeGut2) |
| Gorilla (gorGor3) | Pika (ochPri2) | Zebrafish (danRer7) |
| Guinea pig (cavPor3) |
If you're like me and wondering what a pika is, I found out today that it's this cute little mammal. Anyway back to the topic, the RepeatMasker results for these genomes are accessible via their MySQL database, so below is a bash script that queries the MySQL database and obtains the vertebrate genome sizes and the coverage of repetitive elements in the respective genomes and saves the output. I put two sleep commands there just in case I'm making excessive queries.
#!/bin/bash #loop through all available vertebrate genomes for gen in hg38 eriEur2 ornAna1 vicPac2 equCab2 oryCun2 allMis1 dipOrd1 rn5 dasNov3 petMar2 rheMac3 gadMor1 anoCar2 proCap1 papAnu2 triMan1 oviAri3 melUnd1 calJac3 sorAra1 otoGar3 oryLat2 choHof1 felCat5 geoFor1 speTri2 galGal4 pteVam1 saiBol1 panTro4 myoLuc2 gasAcu1 criGri1 balAcu1 tarSyr1 latCha1 mm10 sarHar1 bosTau7 micMur1 echTel2 canFam3 hetGla2 tetNig2 turTru2 oreNil2 tupBel1 loxAfr3 monDom5 melGal1 calMil1 ponAbe2 macEug2 musFur1 chrPic1 cerSim1 fr3 ailMel1 xenTro3 nomLeu3 susScr3 taeGut2 gorGor3 ochPri2 danRer7 cavPor3; do echo $gen mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select sum(size) from $gen.chromInfo;" > $gen.size sleep 5 mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select sum(genoEnd-genoStart) from $gen.rmsk;" > $gen.rmsk sleep 5 done
If I run the above script, I get these error messages:
ERROR 1146 (42S02) at line 1: Table 'equCab2.rmsk' doesn't exist
ERROR 1146 (42S02) at line 1: Table 'gasAcu1.rmsk' doesn't exist
ERROR 1146 (42S02) at line 1: Table 'tetNig2.rmsk' doesn't exist
ERROR 1146 (42S02) at line 1: Table 'ponAbe2.rmsk' doesn't exist
It turns out that there is no RepeatMasker information for tetNig2 but for the other three genomes, the RepeatMasker information is stored in multiple tables per chromosome. I was going to leave these three genomes out but I thought someone might be interested in these genomes, so in the end I included them by doing some extra work. Below is a bash script that obtains all the tables, in the three respective genomes, which contains the RepeatMasker results.
#!/bin/bash #no RepeatMasker information for this genome #so delete these files rm tetNig2.rmsk tetNig2.size for gen in equCab2 gasAcu1 ponAbe2 do echo $gen mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "show tables in $gen like '%rmsk%';" > $gen.list sleep 5 done
Since the RepeatMasker results for the equCab2, gasAcu1, and ponAbe2 are stored across multiple tables, we will have to use UNION to perform a query across the multiple tables. The Perl script below performs this query and creates 3 rmsk files, for the 3 genomes.
#!/bin/env perl
use strict;
use warnings;
my @base = qw/equCab2 gasAcu1 ponAbe2/;
foreach my $base (@base){
open(IN,'<',"$base.list") || die "Could not open $base.list: $!\n";
my $query = '';
my $first = 1;
while(<IN>){
chomp;
next if /^Table/;
my $chr = $_;
if ($first == 1){
$query = "select sum(genoEnd-genoStart) from $base.$chr ";
$first = 0;
} else {
$query .= "union select sum(genoEnd-genoStart) from $base.$chr ";
}
}
$query =~ s/\s$/;/;
print "$query\n";
my $command = "mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \"$query\" > $base.temp";
system($command);
open(IN,'<',"$base.temp") || die "Could not open $base.temp:$!\n";
my $sum = 0;
while(<IN>){
chomp;
next unless /^\d+/;
$sum += $_;
}
close(IN);
open(OUT,'>',"$base.rmsk") || die "Could not open $base.rmsk for writing: $!\n";
print OUT "sum(genoEnd-genoStart)\n$sum\n";
close(OUT);
unlink("$base.temp");
unlink("$base.list");
}
exit(0);
Now that I have all the genome size and RepeatMasker coverage files, I can make my plot in R.
#list of genome size files
my_list <- gsub(".size", '', list.files(pattern="*.size$"))
#array to store all genome coverages
my_rep_cov <- array()
my_index <- 1
#loop through the files and store the genome size
for (my_file in list.files(pattern="*.rmsk$")){
my_data <- read.table(file=my_file,
skip=1,
header=F)
my_rep_cov[my_index] <- my_data$V1
my_index <- my_index + 1
}
#sanity check; 28 is the human genome
my_list[28]
my_rep_cov[28]
#array to store the repeat coverage
my_gen_size <- array()
my_index <- 1
for (my_file in list.files(pattern="*.size$")){
my_data <- read.table(file=my_file,
skip=1,
header=F)
my_gen_size[my_index] <- my_data$V1
my_index <- my_index + 1
}
#sanity check
my_gen_size[28]
my_percent <- round(my_rep_cov*100/my_gen_size, 2)
#sanity check
my_percent[28]
length(my_percent)
length(my_gen_size)
length(my_rep_cov)
#order the percentages
my_ordered_percent <- my_percent[order(my_percent)]
my_ordered_list <- my_list[order(my_percent)]
#create lookup to rename the genome abbreviations
my_lookup <- data.frame(short=c('hg38','eriEur2','ornAna1','vicPac2','equCab2','oryCun2','allMis1','dipOrd1','rn5','dasNov3','petMar2','rheMac3','gadMor1','anoCar2','proCap1','papAnu2','triMan1','oviAri3','melUnd1','calJac3','sorAra1','otoGar3','oryLat2','choHof1','felCat5','geoFor1','speTri2','galGal4','pteVam1','saiBol1','panTro4','myoLuc2','gasAcu1','criGri1','balAcu1','tarSyr1','latCha1','mm10','sarHar1','bosTau7','micMur1','echTel2','canFam3','hetGla2','tetNig2','turTru2','oreNil2','tupBel1','loxAfr3','monDom5','melGal1','calMil1','ponAbe2','macEug2','musFur1','chrPic1','cerSim1','fr3','ailMel1','xenTro3','nomLeu3','susScr3','taeGut2','gorGor3','ochPri2','danRer7','cavPor3'),
long=c('Human','Hedgehog','Platypus','Alpaca','Horse','Rabbit','American alligator','Kangaroo rat','Rat','Armadillo','Lamprey','Rhesus','Atlantic cod','Lizard','Rock hyrax','Baboon','Manatee','Sheep','Budgerigar','Marmoset','Shrew','Bushbaby','Medaka','Sloth','Cat','Medium ground finch','Squirrel','Chicken','Megabat','Squirrel monkey','Chimpanzee','Microbat','Stickleback','Chinese hamster','Minke whale','Tarsier','Coelacanth','Mouse','Tasmanian devil','Cow','Mouse lemur','Tenrec','Dog','Naked mole-rat','Tetraodon','Dolphin','Nile tilapia','Tree shrew','Elephant','Opossum','Turkey','Elephant shark','Orangutan','Wallaby','Ferret','Painted turtle','White rhinoceros','Fugu','Panda','X. tropicalis','Gibbon','Pig','Zebra finch','Gorilla','Pika','Zebrafish','Guinea pig'))
#match the index of the ordered list
#to the lookup index
my_long_index <- match(my_ordered_list, my_lookup$short)
#colour human genome in red
my_colour <- rep(1, length(my_percent))
my_colour[grep('hg38', my_ordered_list)] <- 2
#find out default margins
par()$mar
#[1] 5.1 4.1 4.1 2.1
#readjust margin to fit names
par(mar=c(10.1, 4.1, 4.1, 2.1))
#make the plot
barplot(my_ordered_percent,
names.arg=my_lookup$long[my_long_index],
las=2,
ylim=c(0,60),
col=my_colour)
A high percentage of the human genome is made up of repetitive elements, compared to other vertebrate genomes, but humans are not the highest; that distinction belongs to the opossum. The zebrafish has the second highest percentage of repetitive element coverage. Link to figshare.
How about the plot of genome sizes ordered by lowest to highest percentage of repetitive elements in genomes?
#plot genome sizes ordered by percent repeat
barplot(my_gen_size[order(my_percent)],
names.arg=my_lookup$long[my_long_index],
las=2,
ylim=c(0,4e9),
col=my_colour)
Several of the larger genomes have a lower percentage of repetitive elements compared to other vertebrate genomes. The zebrafish genome is much smaller than the other genomes but has a very high percentage of repeats. Link to figshare.
The results are only as good as the RepeatMasker annotations. Lastly, here's an updated figure with both plots together:
It's also on figshare.
As a table
my_table <- data.frame(percent=my_ordered_percent,
repeat_cov=my_rep_cov[order(my_percent)],
genome_size=my_gen_size[order(my_percent)],
organism=my_lookup$long[my_long_index])
percent repeat_cov genome_size organism
1 2.67 23221380 869000216 Medaka
2 2.71 70249245 2589745704 Painted turtle
3 3.26 30211713 927696114 Nile tilapia
4 3.59 16614804 463354448 Stickleback
5 5.65 60219852 1065292181 Medium ground finch
6 5.81 61724058 1061817101 Turkey
7 6.37 219383838 3445784354 Pika
8 7.07 202211884 2860591921 Coelacanth
9 7.20 80477988 1117373619 Budgerigar
10 7.33 28713997 391484715 Fugu
11 7.61 136870406 1799143587 Lizard
12 7.96 98040964 1232135591 Zebra finch
13 8.05 66379637 824327835 Atlantic cod
14 10.45 92520928 885550958 Lamprey
15 10.70 111973172 1046932099 Chicken
16 11.94 437010960 3660774957 Tree shrew
17 14.38 422354596 2936119008 Shrew
18 20.54 443277796 2158502098 Kangaroo rat
19 22.72 343489898 1511735326 Western clawed frog
20 22.74 678854084 2985258999 Rock hyrax
21 23.89 693487778 2902270736 Mouse lemur
22 25.07 738735062 2947024286 Tenrec
23 26.53 626144855 2360146428 Chinese hamster
24 27.02 735759102 2723219641 Guinea pig
25 27.15 264527669 974498586 Elephant shark
26 28.68 750840017 2618204639 Naked mole-rat
27 29.32 585214446 1996076410 Megabat
28 30.02 738218566 2458927620 Sloth
29 32.46 1007415247 3103808406 Manatee
30 32.61 663489175 2034575300 Microbat
31 33.98 738128838 2172177994 Alpaca
32 34.13 845852778 2478393770 Squirrel
33 36.46 1159539376 3179905132 Tarsier
34 36.90 889642964 2410758013 Ferret
35 37.08 934359053 2519724550 Bushbaby
36 37.33 919930340 2464367180 White rhinoceros
37 37.76 821021279 2174259888 American alligator
38 38.17 877655715 2299509015 Panda
39 38.30 1114376842 2909698938 Rat
40 39.31 955914303 2431687698 Minke whale
41 39.33 1209588292 3075184024 Wallaby
42 39.67 1259498115 3174693010 Tasmanian devil
43 39.79 1117628138 2808525991 Pig
44 39.93 1084521899 2715720925 Hedgehog
45 40.01 1020984747 2551996573 Dolphin
46 40.11 996450835 2484532062 Horse
47 41.68 1141006794 2737490501 Rabbit
48 41.73 1024603840 2455541136 Cat
49 42.69 1029296436 2410976875 Dog
50 43.51 1292241392 2969988180 Rhesus
51 44.01 1201953154 2730871774 Mouse
52 44.34 1161244502 2619054388 Sheep
53 44.42 1158792969 2608572064 Squirrel monkey
54 44.44 887441764 1996811212 Platypus
55 45.10 1637875527 3631522711 Armadillo
56 45.10 1492574315 3309577922 Chimpanzee
57 45.11 1315072022 2914958544 Marmoset
58 45.73 1576270319 3446771396 Orangutan
59 45.74 1385639740 3029553646 Gorilla
60 46.65 1491170703 3196760833 Elephant
61 46.83 1395986914 2981119579 Cow
62 47.66 1411740930 2962077449 Gibbon
63 49.49 1588381100 3209286105 Human
64 51.47 1517588085 2948380710 Baboon
65 54.17 765103854 1412464843 Zebrafish
66 54.48 1964494051 3605631728 Opossum
Some plots of genome size versus the repeat content
#plot genome size to repeat coverage
plot(my_table$genome_size,
my_table$repeat_cov,
xlab='Genome size (bp)',
ylab='Repetitive coverage (bp)',
main='Genome size versus repeat coverage (bp)',
pch=19)
my_pearson <- paste("Pearson correlation: ", round(cor(my_table$genome_size, my_table$repeat_cov),3))
text(x = 5e8, y = 1.9e9, cex = 1.2, adj=0, my_pearson)
my_spearman <- paste("Spearman correlation: ", round(cor(my_table$genome_size, my_table$repeat_cov,method="spearman"),3))
text(x =5e8, y = 1.75e9, cex = 1.2, adj=0, my_spearman)
#plot genome size to repeat percent
plot(my_table$genome_size,
my_table$percent,
xlab='Genome size (bp)',
ylab='Repetitive coverage (%)',
main='Genome size versus repeat coverage (%)',
pch=19)
my_pearson <- paste("Pearson correlation: ", round(cor(my_table$genome_size, my_table$percent),3))
text(x = 5e8, y = 50, cex = 1.2, adj=0, my_pearson)
my_spearman <- paste("Spearman correlation: ", round(cor(my_table$genome_size, my_table$percent,method="spearman"),3))
text(x =5e8, y = 46, cex = 1.2, adj=0, my_spearman)
Coloured by phylogenetic distance
Download the phylogenetic tree from the UCSC Genome Browser.
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons100way/hg19.100way.phastCons.mod
cat hg19.100way.phastCons.mod | grep ^TREE | perl -ne 'while(/[(,]+(.*?):/){ $s = $1; s///; print "\"$s\", "} END{print"\n"};'
"hg19", "panTro4", "gorGor3", "ponAbe2", "nomLeu3", "rheMac3", "macFas5", "papHam1", "chlSab1", "calJac3", "saiBol1", "otoGar3", "tupChi1", "speTri2", "jacJac1", "micOch1", "criGri1", "mesAur1", "mm10", "rn5", "hetGla2", "cavPor3", "chiLan1", "octDeg1", "oryCun2", "ochPri3", "susScr3", "vicPac2", "camFer1", "turTru2", "orcOrc1", "panHod1", "bosTau7", "oviAri3", "capHir1", "equCab2", "cerSim1", "felCat5", "canFam3", "musFur1", "ailMel1", "odoRosDiv1", "lepWed1", "pteAle1", "pteVam1", "myoDav1", "myoLuc2", "eptFus1", "eriEur2", "sorAra2", "conCri1", "loxAfr3", "eleEdw1", "triMan1", "chrAsi1", "echTel2", "oryAfe1", "dasNov3", "monDom5", "sarHar1", "macEug2", "ornAna1", "falChe1", "falPer1", "ficAlb2", "zonAlb1", "geoFor1", "taeGut2", "pseHum1", "melUnd1", "amaVit1", "araMac1", "colLiv1", "anaPla1", "galGal4", "melGal1", "allMis1", "cheMyd1", "chrPic1", "pelSin1", "apaSpi1", "anoCar2", "xenTro7", "latCha1", "tetNig2", "fr3", "takFla1", "oreNil2", "neoBri1", "hapBur1", "mayZeb1", "punNye1", "oryLat2", "xipMac1", "gasAcu1", "gadMor1", "danRer7", "astMex1", "lepOcu1", "petMar2",
Work in progress...
Word of caution
See the Twitter thread:
@davetang31 @DanGraur (2) multiple issues can arise as a result of masking one species with a library from another species [3/3]
— Cedric Feschotte (@CedricFeschotte) February 7, 2015

This work is licensed under a Creative Commons
Attribution 4.0 International License.


It would be interesting to plot the repetition percentages with each mammal’s respective rate of dying from cancer. Though I doubt you would be able to find good data sources and would probably have to just perform the study yourself.