Updated 2015 February 8th to include some scatter plots of genome size versus repeat content.
I was writing about the make up of genomes today and was looking up statistics on repetitive elements in vertebrate genomes. While I could find individual papers with repetitive element statistics for a particular genome, I was unable to find a summary for a list of vertebrate genomes (but to be honest I didn't look very hard). So I thought I'll make my own and share it on my blog and via figshare. I will use the RepeatMasker annotations provided via the UCSC genome browser.
The UCSC genome browser provides this list of vertebrate genomes:
Human (hg38) | Hedgehog (eriEur2) | Platypus (ornAna1) |
Alpaca (vicPac2) | Horse (equCab2) | Rabbit (oryCun2) |
American alligator (allMis1) | Kangaroo rat (dipOrd1) | Rat (rn5) |
Armadillo (dasNov3) | Lamprey (petMar2) | Rhesus (rheMac3) |
Atlantic cod (gadMor1) | Lizard (anoCar2) | Rock hyrax (proCap1) |
Baboon (papAnu2) | Manatee (triMan1) | Sheep (oviAri3) |
Budgerigar (melUnd1) | Marmoset (calJac3) | Shrew (sorAra1) |
Bushbaby (otoGar3) | Medaka (oryLat2) | Sloth (choHof1) |
Cat (felCat5) | Medium ground finch (geoFor1) | Squirrel (speTri2) |
Chicken (galGal4) | Megabat (pteVam1) | Squirrel monkey (saiBol1) |
Chimpanzee (panTro4) | Microbat (myoLuc2) | Stickleback (gasAcu1) |
Chinese hamster (criGri1) | Minke whale (balAcu1) | Tarsier (tarSyr1) |
Coelacanth (latCha1) | Mouse (mm10) | Tasmanian devil (sarHar1) |
Cow (bosTau7) | Mouse lemur (micMur1) | Tenrec (echTel2) |
Dog (canFam3) | Naked mole-rat (hetGla2) | Tetraodon (tetNig2) |
Dolphin (turTru2) | Nile tilapia (oreNil2) | Tree shrew (tupBel1) |
Elephant (loxAfr3) | Opossum (monDom5) | Turkey (melGal1) |
Elephant shark (calMil1) | Orangutan (ponAbe2) | Wallaby (macEug2) |
Ferret (musFur1) | Painted turtle (chrPic1) | White rhinoceros (cerSim1) |
Fugu (fr3) | Panda (ailMel1) | X. tropicalis (xenTro3) |
Gibbon (nomLeu3) | Pig (susScr3) | Zebra finch (taeGut2) |
Gorilla (gorGor3) | Pika (ochPri2) | Zebrafish (danRer7) |
Guinea pig (cavPor3) |
If you're like me and wondering what a pika is, I found out today that it's this cute little mammal. Anyway back to the topic, the RepeatMasker results for these genomes are accessible via their MySQL database, so below is a bash script that queries the MySQL database and obtains the vertebrate genome sizes and the coverage of repetitive elements in the respective genomes and saves the output. I put two sleep commands there just in case I'm making excessive queries.
#!/bin/bash #loop through all available vertebrate genomes for gen in hg38 eriEur2 ornAna1 vicPac2 equCab2 oryCun2 allMis1 dipOrd1 rn5 dasNov3 petMar2 rheMac3 gadMor1 anoCar2 proCap1 papAnu2 triMan1 oviAri3 melUnd1 calJac3 sorAra1 otoGar3 oryLat2 choHof1 felCat5 geoFor1 speTri2 galGal4 pteVam1 saiBol1 panTro4 myoLuc2 gasAcu1 criGri1 balAcu1 tarSyr1 latCha1 mm10 sarHar1 bosTau7 micMur1 echTel2 canFam3 hetGla2 tetNig2 turTru2 oreNil2 tupBel1 loxAfr3 monDom5 melGal1 calMil1 ponAbe2 macEug2 musFur1 chrPic1 cerSim1 fr3 ailMel1 xenTro3 nomLeu3 susScr3 taeGut2 gorGor3 ochPri2 danRer7 cavPor3; do echo $gen mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select sum(size) from $gen.chromInfo;" > $gen.size sleep 5 mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select sum(genoEnd-genoStart) from $gen.rmsk;" > $gen.rmsk sleep 5 done
If I run the above script, I get these error messages:
ERROR 1146 (42S02) at line 1: Table 'equCab2.rmsk' doesn't exist
ERROR 1146 (42S02) at line 1: Table 'gasAcu1.rmsk' doesn't exist
ERROR 1146 (42S02) at line 1: Table 'tetNig2.rmsk' doesn't exist
ERROR 1146 (42S02) at line 1: Table 'ponAbe2.rmsk' doesn't exist
It turns out that there is no RepeatMasker information for tetNig2 but for the other three genomes, the RepeatMasker information is stored in multiple tables per chromosome. I was going to leave these three genomes out but I thought someone might be interested in these genomes, so in the end I included them by doing some extra work. Below is a bash script that obtains all the tables, in the three respective genomes, which contains the RepeatMasker results.
#!/bin/bash #no RepeatMasker information for this genome #so delete these files rm tetNig2.rmsk tetNig2.size for gen in equCab2 gasAcu1 ponAbe2 do echo $gen mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "show tables in $gen like '%rmsk%';" > $gen.list sleep 5 done
Since the RepeatMasker results for the equCab2, gasAcu1, and ponAbe2 are stored across multiple tables, we will have to use UNION to perform a query across the multiple tables. The Perl script below performs this query and creates 3 rmsk files, for the 3 genomes.
#!/bin/env perl use strict; use warnings; my @base = qw/equCab2 gasAcu1 ponAbe2/; foreach my $base (@base){ open(IN,'<',"$base.list") || die "Could not open $base.list: $!\n"; my $query = ''; my $first = 1; while(<IN>){ chomp; next if /^Table/; my $chr = $_; if ($first == 1){ $query = "select sum(genoEnd-genoStart) from $base.$chr "; $first = 0; } else { $query .= "union select sum(genoEnd-genoStart) from $base.$chr "; } } $query =~ s/\s$/;/; print "$query\n"; my $command = "mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \"$query\" > $base.temp"; system($command); open(IN,'<',"$base.temp") || die "Could not open $base.temp:$!\n"; my $sum = 0; while(<IN>){ chomp; next unless /^\d+/; $sum += $_; } close(IN); open(OUT,'>',"$base.rmsk") || die "Could not open $base.rmsk for writing: $!\n"; print OUT "sum(genoEnd-genoStart)\n$sum\n"; close(OUT); unlink("$base.temp"); unlink("$base.list"); } exit(0);
Now that I have all the genome size and RepeatMasker coverage files, I can make my plot in R.
#list of genome size files my_list <- gsub(".size", '', list.files(pattern="*.size$")) #array to store all genome coverages my_rep_cov <- array() my_index <- 1 #loop through the files and store the genome size for (my_file in list.files(pattern="*.rmsk$")){ my_data <- read.table(file=my_file, skip=1, header=F) my_rep_cov[my_index] <- my_data$V1 my_index <- my_index + 1 } #sanity check; 28 is the human genome my_list[28] my_rep_cov[28] #array to store the repeat coverage my_gen_size <- array() my_index <- 1 for (my_file in list.files(pattern="*.size$")){ my_data <- read.table(file=my_file, skip=1, header=F) my_gen_size[my_index] <- my_data$V1 my_index <- my_index + 1 } #sanity check my_gen_size[28] my_percent <- round(my_rep_cov*100/my_gen_size, 2) #sanity check my_percent[28] length(my_percent) length(my_gen_size) length(my_rep_cov) #order the percentages my_ordered_percent <- my_percent[order(my_percent)] my_ordered_list <- my_list[order(my_percent)] #create lookup to rename the genome abbreviations my_lookup <- data.frame(short=c('hg38','eriEur2','ornAna1','vicPac2','equCab2','oryCun2','allMis1','dipOrd1','rn5','dasNov3','petMar2','rheMac3','gadMor1','anoCar2','proCap1','papAnu2','triMan1','oviAri3','melUnd1','calJac3','sorAra1','otoGar3','oryLat2','choHof1','felCat5','geoFor1','speTri2','galGal4','pteVam1','saiBol1','panTro4','myoLuc2','gasAcu1','criGri1','balAcu1','tarSyr1','latCha1','mm10','sarHar1','bosTau7','micMur1','echTel2','canFam3','hetGla2','tetNig2','turTru2','oreNil2','tupBel1','loxAfr3','monDom5','melGal1','calMil1','ponAbe2','macEug2','musFur1','chrPic1','cerSim1','fr3','ailMel1','xenTro3','nomLeu3','susScr3','taeGut2','gorGor3','ochPri2','danRer7','cavPor3'), long=c('Human','Hedgehog','Platypus','Alpaca','Horse','Rabbit','American alligator','Kangaroo rat','Rat','Armadillo','Lamprey','Rhesus','Atlantic cod','Lizard','Rock hyrax','Baboon','Manatee','Sheep','Budgerigar','Marmoset','Shrew','Bushbaby','Medaka','Sloth','Cat','Medium ground finch','Squirrel','Chicken','Megabat','Squirrel monkey','Chimpanzee','Microbat','Stickleback','Chinese hamster','Minke whale','Tarsier','Coelacanth','Mouse','Tasmanian devil','Cow','Mouse lemur','Tenrec','Dog','Naked mole-rat','Tetraodon','Dolphin','Nile tilapia','Tree shrew','Elephant','Opossum','Turkey','Elephant shark','Orangutan','Wallaby','Ferret','Painted turtle','White rhinoceros','Fugu','Panda','X. tropicalis','Gibbon','Pig','Zebra finch','Gorilla','Pika','Zebrafish','Guinea pig')) #match the index of the ordered list #to the lookup index my_long_index <- match(my_ordered_list, my_lookup$short) #colour human genome in red my_colour <- rep(1, length(my_percent)) my_colour[grep('hg38', my_ordered_list)] <- 2 #find out default margins par()$mar #[1] 5.1 4.1 4.1 2.1 #readjust margin to fit names par(mar=c(10.1, 4.1, 4.1, 2.1)) #make the plot barplot(my_ordered_percent, names.arg=my_lookup$long[my_long_index], las=2, ylim=c(0,60), col=my_colour)
A high percentage of the human genome is made up of repetitive elements, compared to other vertebrate genomes, but humans are not the highest; that distinction belongs to the opossum. The zebrafish has the second highest percentage of repetitive element coverage. Link to figshare.
How about the plot of genome sizes ordered by lowest to highest percentage of repetitive elements in genomes?
#plot genome sizes ordered by percent repeat barplot(my_gen_size[order(my_percent)], names.arg=my_lookup$long[my_long_index], las=2, ylim=c(0,4e9), col=my_colour)
Several of the larger genomes have a lower percentage of repetitive elements compared to other vertebrate genomes. The zebrafish genome is much smaller than the other genomes but has a very high percentage of repeats. Link to figshare.
The results are only as good as the RepeatMasker annotations. Lastly, here's an updated figure with both plots together:
It's also on figshare.
As a table
my_table <- data.frame(percent=my_ordered_percent, repeat_cov=my_rep_cov[order(my_percent)], genome_size=my_gen_size[order(my_percent)], organism=my_lookup$long[my_long_index]) percent repeat_cov genome_size organism 1 2.67 23221380 869000216 Medaka 2 2.71 70249245 2589745704 Painted turtle 3 3.26 30211713 927696114 Nile tilapia 4 3.59 16614804 463354448 Stickleback 5 5.65 60219852 1065292181 Medium ground finch 6 5.81 61724058 1061817101 Turkey 7 6.37 219383838 3445784354 Pika 8 7.07 202211884 2860591921 Coelacanth 9 7.20 80477988 1117373619 Budgerigar 10 7.33 28713997 391484715 Fugu 11 7.61 136870406 1799143587 Lizard 12 7.96 98040964 1232135591 Zebra finch 13 8.05 66379637 824327835 Atlantic cod 14 10.45 92520928 885550958 Lamprey 15 10.70 111973172 1046932099 Chicken 16 11.94 437010960 3660774957 Tree shrew 17 14.38 422354596 2936119008 Shrew 18 20.54 443277796 2158502098 Kangaroo rat 19 22.72 343489898 1511735326 Western clawed frog 20 22.74 678854084 2985258999 Rock hyrax 21 23.89 693487778 2902270736 Mouse lemur 22 25.07 738735062 2947024286 Tenrec 23 26.53 626144855 2360146428 Chinese hamster 24 27.02 735759102 2723219641 Guinea pig 25 27.15 264527669 974498586 Elephant shark 26 28.68 750840017 2618204639 Naked mole-rat 27 29.32 585214446 1996076410 Megabat 28 30.02 738218566 2458927620 Sloth 29 32.46 1007415247 3103808406 Manatee 30 32.61 663489175 2034575300 Microbat 31 33.98 738128838 2172177994 Alpaca 32 34.13 845852778 2478393770 Squirrel 33 36.46 1159539376 3179905132 Tarsier 34 36.90 889642964 2410758013 Ferret 35 37.08 934359053 2519724550 Bushbaby 36 37.33 919930340 2464367180 White rhinoceros 37 37.76 821021279 2174259888 American alligator 38 38.17 877655715 2299509015 Panda 39 38.30 1114376842 2909698938 Rat 40 39.31 955914303 2431687698 Minke whale 41 39.33 1209588292 3075184024 Wallaby 42 39.67 1259498115 3174693010 Tasmanian devil 43 39.79 1117628138 2808525991 Pig 44 39.93 1084521899 2715720925 Hedgehog 45 40.01 1020984747 2551996573 Dolphin 46 40.11 996450835 2484532062 Horse 47 41.68 1141006794 2737490501 Rabbit 48 41.73 1024603840 2455541136 Cat 49 42.69 1029296436 2410976875 Dog 50 43.51 1292241392 2969988180 Rhesus 51 44.01 1201953154 2730871774 Mouse 52 44.34 1161244502 2619054388 Sheep 53 44.42 1158792969 2608572064 Squirrel monkey 54 44.44 887441764 1996811212 Platypus 55 45.10 1637875527 3631522711 Armadillo 56 45.10 1492574315 3309577922 Chimpanzee 57 45.11 1315072022 2914958544 Marmoset 58 45.73 1576270319 3446771396 Orangutan 59 45.74 1385639740 3029553646 Gorilla 60 46.65 1491170703 3196760833 Elephant 61 46.83 1395986914 2981119579 Cow 62 47.66 1411740930 2962077449 Gibbon 63 49.49 1588381100 3209286105 Human 64 51.47 1517588085 2948380710 Baboon 65 54.17 765103854 1412464843 Zebrafish 66 54.48 1964494051 3605631728 Opossum
Some plots of genome size versus the repeat content
#plot genome size to repeat coverage plot(my_table$genome_size, my_table$repeat_cov, xlab='Genome size (bp)', ylab='Repetitive coverage (bp)', main='Genome size versus repeat coverage (bp)', pch=19) my_pearson <- paste("Pearson correlation: ", round(cor(my_table$genome_size, my_table$repeat_cov),3)) text(x = 5e8, y = 1.9e9, cex = 1.2, adj=0, my_pearson) my_spearman <- paste("Spearman correlation: ", round(cor(my_table$genome_size, my_table$repeat_cov,method="spearman"),3)) text(x =5e8, y = 1.75e9, cex = 1.2, adj=0, my_spearman)
#plot genome size to repeat percent plot(my_table$genome_size, my_table$percent, xlab='Genome size (bp)', ylab='Repetitive coverage (%)', main='Genome size versus repeat coverage (%)', pch=19) my_pearson <- paste("Pearson correlation: ", round(cor(my_table$genome_size, my_table$percent),3)) text(x = 5e8, y = 50, cex = 1.2, adj=0, my_pearson) my_spearman <- paste("Spearman correlation: ", round(cor(my_table$genome_size, my_table$percent,method="spearman"),3)) text(x =5e8, y = 46, cex = 1.2, adj=0, my_spearman)
Coloured by phylogenetic distance
Download the phylogenetic tree from the UCSC Genome Browser.
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons100way/hg19.100way.phastCons.mod cat hg19.100way.phastCons.mod | grep ^TREE | perl -ne 'while(/[(,]+(.*?):/){ $s = $1; s///; print "\"$s\", "} END{print"\n"};' "hg19", "panTro4", "gorGor3", "ponAbe2", "nomLeu3", "rheMac3", "macFas5", "papHam1", "chlSab1", "calJac3", "saiBol1", "otoGar3", "tupChi1", "speTri2", "jacJac1", "micOch1", "criGri1", "mesAur1", "mm10", "rn5", "hetGla2", "cavPor3", "chiLan1", "octDeg1", "oryCun2", "ochPri3", "susScr3", "vicPac2", "camFer1", "turTru2", "orcOrc1", "panHod1", "bosTau7", "oviAri3", "capHir1", "equCab2", "cerSim1", "felCat5", "canFam3", "musFur1", "ailMel1", "odoRosDiv1", "lepWed1", "pteAle1", "pteVam1", "myoDav1", "myoLuc2", "eptFus1", "eriEur2", "sorAra2", "conCri1", "loxAfr3", "eleEdw1", "triMan1", "chrAsi1", "echTel2", "oryAfe1", "dasNov3", "monDom5", "sarHar1", "macEug2", "ornAna1", "falChe1", "falPer1", "ficAlb2", "zonAlb1", "geoFor1", "taeGut2", "pseHum1", "melUnd1", "amaVit1", "araMac1", "colLiv1", "anaPla1", "galGal4", "melGal1", "allMis1", "cheMyd1", "chrPic1", "pelSin1", "apaSpi1", "anoCar2", "xenTro7", "latCha1", "tetNig2", "fr3", "takFla1", "oreNil2", "neoBri1", "hapBur1", "mayZeb1", "punNye1", "oryLat2", "xipMac1", "gasAcu1", "gadMor1", "danRer7", "astMex1", "lepOcu1", "petMar2",
Work in progress...
Word of caution
See the Twitter thread:
@davetang31 @DanGraur (2) multiple issues can arise as a result of masking one species with a library from another species [3/3]
— Cedric Feschotte (@CedricFeschotte) February 7, 2015

This work is licensed under a Creative Commons
Attribution 4.0 International License.
It would be interesting to plot the repetition percentages with each mammal’s respective rate of dying from cancer. Though I doubt you would be able to find good data sources and would probably have to just perform the study yourself.