Repetitive elements in vertebrate genomes

Updated 2015 February 8th to include some scatter plots of genome size versus repeat content.

I was writing about the make up of genomes today and was looking up statistics on repetitive elements in vertebrate genomes. While I could find individual papers with repetitive element statistics for a particular genome, I was unable to find a summary for a list of vertebrate genomes (but to be honest I didn't look very hard). So I thought I'll make my own and share it on my blog and via figshare. I will use the RepeatMasker annotations provided via the UCSC genome browser.

The UCSC genome browser provides this list of vertebrate genomes:

Human (hg38) Hedgehog (eriEur2) Platypus (ornAna1)
Alpaca (vicPac2) Horse (equCab2) Rabbit (oryCun2)
American alligator (allMis1) Kangaroo rat (dipOrd1) Rat (rn5)
Armadillo (dasNov3) Lamprey (petMar2) Rhesus (rheMac3)
Atlantic cod (gadMor1) Lizard (anoCar2) Rock hyrax (proCap1)
Baboon (papAnu2) Manatee (triMan1) Sheep (oviAri3)
Budgerigar (melUnd1) Marmoset (calJac3) Shrew (sorAra1)
Bushbaby (otoGar3) Medaka (oryLat2) Sloth (choHof1)
Cat (felCat5) Medium ground finch (geoFor1) Squirrel (speTri2)
Chicken (galGal4) Megabat (pteVam1) Squirrel monkey (saiBol1)
Chimpanzee (panTro4) Microbat (myoLuc2) Stickleback (gasAcu1)
Chinese hamster (criGri1) Minke whale (balAcu1) Tarsier (tarSyr1)
Coelacanth (latCha1) Mouse (mm10) Tasmanian devil (sarHar1)
Cow (bosTau7) Mouse lemur (micMur1) Tenrec (echTel2)
Dog (canFam3) Naked mole-rat (hetGla2) Tetraodon (tetNig2)
Dolphin (turTru2) Nile tilapia (oreNil2) Tree shrew (tupBel1)
Elephant (loxAfr3) Opossum (monDom5) Turkey (melGal1)
Elephant shark (calMil1) Orangutan (ponAbe2) Wallaby (macEug2)
Ferret (musFur1) Painted turtle (chrPic1) White rhinoceros (cerSim1)
Fugu (fr3) Panda (ailMel1) X. tropicalis (xenTro3)
Gibbon (nomLeu3) Pig (susScr3) Zebra finch (taeGut2)
Gorilla (gorGor3) Pika (ochPri2) Zebrafish (danRer7)
Guinea pig (cavPor3)

If you're like me and wondering what a pika is, I found out today that it's this cute little mammal. Anyway back to the topic, the RepeatMasker results for these genomes are accessible via their MySQL database, so below is a bash script that queries the MySQL database and obtains the vertebrate genome sizes and the coverage of repetitive elements in the respective genomes and saves the output. I put two sleep commands there just in case I'm making excessive queries.

#!/bin/bash

#loop through all available vertebrate genomes
for gen in hg38 eriEur2 ornAna1 vicPac2 equCab2 oryCun2 allMis1 dipOrd1 rn5 dasNov3 petMar2 rheMac3 gadMor1 anoCar2 proCap1 papAnu2 triMan1 oviAri3 melUnd1 calJac3 sorAra1 otoGar3 oryLat2 choHof1 felCat5 geoFor1 speTri2 galGal4 pteVam1 saiBol1 panTro4 myoLuc2 gasAcu1 criGri1 balAcu1 tarSyr1 latCha1 mm10 sarHar1 bosTau7 micMur1 echTel2 canFam3 hetGla2 tetNig2 turTru2 oreNil2 tupBel1 loxAfr3 monDom5 melGal1 calMil1 ponAbe2 macEug2 musFur1 chrPic1 cerSim1 fr3 ailMel1 xenTro3 nomLeu3 susScr3 taeGut2 gorGor3 ochPri2 danRer7 cavPor3;
   do echo $gen
   mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select sum(size) from $gen.chromInfo;" > $gen.size
   sleep 5
   mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select sum(genoEnd-genoStart) from $gen.rmsk;" > $gen.rmsk
   sleep 5
done

If I run the above script, I get these error messages:

ERROR 1146 (42S02) at line 1: Table 'equCab2.rmsk' doesn't exist
ERROR 1146 (42S02) at line 1: Table 'gasAcu1.rmsk' doesn't exist
ERROR 1146 (42S02) at line 1: Table 'tetNig2.rmsk' doesn't exist
ERROR 1146 (42S02) at line 1: Table 'ponAbe2.rmsk' doesn't exist

It turns out that there is no RepeatMasker information for tetNig2 but for the other three genomes, the RepeatMasker information is stored in multiple tables per chromosome. I was going to leave these three genomes out but I thought someone might be interested in these genomes, so in the end I included them by doing some extra work. Below is a bash script that obtains all the tables, in the three respective genomes, which contains the RepeatMasker results.

#!/bin/bash

#no RepeatMasker information for this genome
#so delete these files
rm tetNig2.rmsk tetNig2.size

for gen in equCab2 gasAcu1 ponAbe2
   do echo $gen
   mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "show tables in $gen like '%rmsk%';" > $gen.list
   sleep 5
done

Since the RepeatMasker results for the equCab2, gasAcu1, and ponAbe2 are stored across multiple tables, we will have to use UNION to perform a query across the multiple tables. The Perl script below performs this query and creates 3 rmsk files, for the 3 genomes.

#!/bin/env perl

use strict;
use warnings;

my @base = qw/equCab2 gasAcu1 ponAbe2/;

foreach my $base (@base){
   open(IN,'<',"$base.list") || die "Could not open $base.list: $!\n";
   my $query = '';
   my $first = 1;
   while(<IN>){
      chomp;
      next if /^Table/;
      my $chr = $_;
      if ($first == 1){
         $query = "select sum(genoEnd-genoStart) from $base.$chr ";
         $first = 0;
      } else {
         $query .= "union select sum(genoEnd-genoStart) from $base.$chr ";
      }
   }
   $query =~ s/\s$/;/;
   print "$query\n";

   my $command = "mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \"$query\" > $base.temp";
   system($command);

   open(IN,'<',"$base.temp") || die "Could not open $base.temp:$!\n";
   my $sum = 0;
   while(<IN>){
      chomp;
      next unless /^\d+/;
      $sum += $_;
   }
   close(IN);

   open(OUT,'>',"$base.rmsk") || die "Could not open $base.rmsk for writing: $!\n";
   print OUT "sum(genoEnd-genoStart)\n$sum\n";
   close(OUT);

   unlink("$base.temp");
   unlink("$base.list");
}

exit(0);

Now that I have all the genome size and RepeatMasker coverage files, I can make my plot in R.

#list of genome size files
my_list <- gsub(".size", '', list.files(pattern="*.size$"))

#array to store all genome coverages
my_rep_cov <- array()
my_index <- 1

#loop through the files and store the genome size
for (my_file in list.files(pattern="*.rmsk$")){
  my_data <- read.table(file=my_file,
                        skip=1,
                        header=F)
  my_rep_cov[my_index] <- my_data$V1
  my_index <- my_index + 1
}

#sanity check; 28 is the human genome
my_list[28]
my_rep_cov[28]

#array to store the repeat coverage
my_gen_size <- array()
my_index <- 1

for (my_file in list.files(pattern="*.size$")){
  my_data <- read.table(file=my_file,
                        skip=1,
                        header=F)
  my_gen_size[my_index] <- my_data$V1
  my_index <- my_index + 1
}

#sanity check
my_gen_size[28]

my_percent <- round(my_rep_cov*100/my_gen_size, 2)

#sanity check
my_percent[28]
length(my_percent)
length(my_gen_size)
length(my_rep_cov)

#order the percentages
my_ordered_percent <- my_percent[order(my_percent)]
my_ordered_list <- my_list[order(my_percent)]

#create lookup to rename the genome abbreviations
my_lookup <- data.frame(short=c('hg38','eriEur2','ornAna1','vicPac2','equCab2','oryCun2','allMis1','dipOrd1','rn5','dasNov3','petMar2','rheMac3','gadMor1','anoCar2','proCap1','papAnu2','triMan1','oviAri3','melUnd1','calJac3','sorAra1','otoGar3','oryLat2','choHof1','felCat5','geoFor1','speTri2','galGal4','pteVam1','saiBol1','panTro4','myoLuc2','gasAcu1','criGri1','balAcu1','tarSyr1','latCha1','mm10','sarHar1','bosTau7','micMur1','echTel2','canFam3','hetGla2','tetNig2','turTru2','oreNil2','tupBel1','loxAfr3','monDom5','melGal1','calMil1','ponAbe2','macEug2','musFur1','chrPic1','cerSim1','fr3','ailMel1','xenTro3','nomLeu3','susScr3','taeGut2','gorGor3','ochPri2','danRer7','cavPor3'),
                        long=c('Human','Hedgehog','Platypus','Alpaca','Horse','Rabbit','American alligator','Kangaroo rat','Rat','Armadillo','Lamprey','Rhesus','Atlantic cod','Lizard','Rock hyrax','Baboon','Manatee','Sheep','Budgerigar','Marmoset','Shrew','Bushbaby','Medaka','Sloth','Cat','Medium ground finch','Squirrel','Chicken','Megabat','Squirrel monkey','Chimpanzee','Microbat','Stickleback','Chinese hamster','Minke whale','Tarsier','Coelacanth','Mouse','Tasmanian devil','Cow','Mouse lemur','Tenrec','Dog','Naked mole-rat','Tetraodon','Dolphin','Nile tilapia','Tree shrew','Elephant','Opossum','Turkey','Elephant shark','Orangutan','Wallaby','Ferret','Painted turtle','White rhinoceros','Fugu','Panda','X. tropicalis','Gibbon','Pig','Zebra finch','Gorilla','Pika','Zebrafish','Guinea pig'))

#match the index of the ordered list
#to the lookup index
my_long_index <- match(my_ordered_list, my_lookup$short)

#colour human genome in red
my_colour <- rep(1, length(my_percent))
my_colour[grep('hg38', my_ordered_list)] <- 2

#find out default margins
par()$mar
#[1] 5.1 4.1 4.1 2.1

#readjust margin to fit names
par(mar=c(10.1, 4.1, 4.1, 2.1))

#make the plot
barplot(my_ordered_percent,
        names.arg=my_lookup$long[my_long_index],
        las=2,
        ylim=c(0,60),
        col=my_colour)

repeat_coverage_vertebrate_genomeA high percentage of the human genome is made up of repetitive elements, compared to other vertebrate genomes, but humans are not the highest; that distinction belongs to the opossum. The zebrafish has the second highest percentage of repetitive element coverage. Link to figshare.

How about the plot of genome sizes ordered by lowest to highest percentage of repetitive elements in genomes?

#plot genome sizes ordered by percent repeat
barplot(my_gen_size[order(my_percent)],
        names.arg=my_lookup$long[my_long_index],
        las=2,
        ylim=c(0,4e9),
        col=my_colour)

genome_sizeSeveral of the larger genomes have a lower percentage of repetitive elements compared to other vertebrate genomes. The zebrafish genome is much smaller than the other genomes but has a very high percentage of repeats. Link to figshare.

The results are only as good as the RepeatMasker annotations. Lastly, here's an updated figure with both plots together:

percent_and_genome_sizeIt's also on figshare.

As a table

my_table <- data.frame(percent=my_ordered_percent,
                       repeat_cov=my_rep_cov[order(my_percent)],
                       genome_size=my_gen_size[order(my_percent)],
                       organism=my_lookup$long[my_long_index])

   percent repeat_cov genome_size            organism
1     2.67   23221380   869000216              Medaka
2     2.71   70249245  2589745704      Painted turtle
3     3.26   30211713   927696114        Nile tilapia
4     3.59   16614804   463354448         Stickleback
5     5.65   60219852  1065292181 Medium ground finch
6     5.81   61724058  1061817101              Turkey
7     6.37  219383838  3445784354                Pika
8     7.07  202211884  2860591921          Coelacanth
9     7.20   80477988  1117373619          Budgerigar
10    7.33   28713997   391484715                Fugu
11    7.61  136870406  1799143587              Lizard
12    7.96   98040964  1232135591         Zebra finch
13    8.05   66379637   824327835        Atlantic cod
14   10.45   92520928   885550958             Lamprey
15   10.70  111973172  1046932099             Chicken
16   11.94  437010960  3660774957          Tree shrew
17   14.38  422354596  2936119008               Shrew
18   20.54  443277796  2158502098        Kangaroo rat
19   22.72  343489898  1511735326 Western clawed frog
20   22.74  678854084  2985258999          Rock hyrax
21   23.89  693487778  2902270736         Mouse lemur
22   25.07  738735062  2947024286              Tenrec
23   26.53  626144855  2360146428     Chinese hamster
24   27.02  735759102  2723219641          Guinea pig
25   27.15  264527669   974498586      Elephant shark
26   28.68  750840017  2618204639      Naked mole-rat
27   29.32  585214446  1996076410             Megabat
28   30.02  738218566  2458927620               Sloth
29   32.46 1007415247  3103808406             Manatee
30   32.61  663489175  2034575300            Microbat
31   33.98  738128838  2172177994              Alpaca
32   34.13  845852778  2478393770            Squirrel
33   36.46 1159539376  3179905132             Tarsier
34   36.90  889642964  2410758013              Ferret
35   37.08  934359053  2519724550            Bushbaby
36   37.33  919930340  2464367180    White rhinoceros
37   37.76  821021279  2174259888  American alligator
38   38.17  877655715  2299509015               Panda
39   38.30 1114376842  2909698938                 Rat
40   39.31  955914303  2431687698         Minke whale
41   39.33 1209588292  3075184024             Wallaby
42   39.67 1259498115  3174693010     Tasmanian devil
43   39.79 1117628138  2808525991                 Pig
44   39.93 1084521899  2715720925            Hedgehog
45   40.01 1020984747  2551996573             Dolphin
46   40.11  996450835  2484532062               Horse
47   41.68 1141006794  2737490501              Rabbit
48   41.73 1024603840  2455541136                 Cat
49   42.69 1029296436  2410976875                 Dog
50   43.51 1292241392  2969988180              Rhesus
51   44.01 1201953154  2730871774               Mouse
52   44.34 1161244502  2619054388               Sheep
53   44.42 1158792969  2608572064     Squirrel monkey
54   44.44  887441764  1996811212            Platypus
55   45.10 1637875527  3631522711           Armadillo
56   45.10 1492574315  3309577922          Chimpanzee
57   45.11 1315072022  2914958544            Marmoset
58   45.73 1576270319  3446771396           Orangutan
59   45.74 1385639740  3029553646             Gorilla
60   46.65 1491170703  3196760833            Elephant
61   46.83 1395986914  2981119579                 Cow
62   47.66 1411740930  2962077449              Gibbon
63   49.49 1588381100  3209286105               Human
64   51.47 1517588085  2948380710              Baboon
65   54.17  765103854  1412464843           Zebrafish
66   54.48 1964494051  3605631728             Opossum

Some plots of genome size versus the repeat content

#plot genome size to repeat coverage
plot(my_table$genome_size,
     my_table$repeat_cov,
     xlab='Genome size (bp)',
     ylab='Repetitive coverage (bp)',
     main='Genome size versus repeat coverage (bp)',
     pch=19)
my_pearson <- paste("Pearson correlation: ", round(cor(my_table$genome_size, my_table$repeat_cov),3))
text(x = 5e8, y = 1.9e9, cex = 1.2, adj=0, my_pearson)
my_spearman <- paste("Spearman correlation: ", round(cor(my_table$genome_size, my_table$repeat_cov,method="spearman"),3))
text(x =5e8, y = 1.75e9, cex = 1.2, adj=0, my_spearman)

genome_size_vs_repeat_bp

#plot genome size to repeat percent
plot(my_table$genome_size,
     my_table$percent,
     xlab='Genome size (bp)',
     ylab='Repetitive coverage (%)',
     main='Genome size versus repeat coverage (%)',
     pch=19)
my_pearson <- paste("Pearson correlation: ", round(cor(my_table$genome_size, my_table$percent),3))
text(x = 5e8, y = 50, cex = 1.2, adj=0, my_pearson)
my_spearman <- paste("Spearman correlation: ", round(cor(my_table$genome_size, my_table$percent,method="spearman"),3))
text(x =5e8, y = 46, cex = 1.2, adj=0, my_spearman)

genome_size_vs_repeat_percent

Coloured by phylogenetic distance

Download the phylogenetic tree from the UCSC Genome Browser.

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons100way/hg19.100way.phastCons.mod
cat hg19.100way.phastCons.mod | grep ^TREE | perl -ne 'while(/[(,]+(.*?):/){ $s = $1; s///; print "\"$s\", "} END{print"\n"};'
"hg19", "panTro4", "gorGor3", "ponAbe2", "nomLeu3", "rheMac3", "macFas5", "papHam1", "chlSab1", "calJac3", "saiBol1", "otoGar3", "tupChi1", "speTri2", "jacJac1", "micOch1", "criGri1", "mesAur1", "mm10", "rn5", "hetGla2", "cavPor3", "chiLan1", "octDeg1", "oryCun2", "ochPri3", "susScr3", "vicPac2", "camFer1", "turTru2", "orcOrc1", "panHod1", "bosTau7", "oviAri3", "capHir1", "equCab2", "cerSim1", "felCat5", "canFam3", "musFur1", "ailMel1", "odoRosDiv1", "lepWed1", "pteAle1", "pteVam1", "myoDav1", "myoLuc2", "eptFus1", "eriEur2", "sorAra2", "conCri1", "loxAfr3", "eleEdw1", "triMan1", "chrAsi1", "echTel2", "oryAfe1", "dasNov3", "monDom5", "sarHar1", "macEug2", "ornAna1", "falChe1", "falPer1", "ficAlb2", "zonAlb1", "geoFor1", "taeGut2", "pseHum1", "melUnd1", "amaVit1", "araMac1", "colLiv1", "anaPla1", "galGal4", "melGal1", "allMis1", "cheMyd1", "chrPic1", "pelSin1", "apaSpi1", "anoCar2", "xenTro7", "latCha1", "tetNig2", "fr3", "takFla1", "oreNil2", "neoBri1", "hapBur1", "mayZeb1", "punNye1", "oryLat2", "xipMac1", "gasAcu1", "gadMor1", "danRer7", "astMex1", "lepOcu1", "petMar2",

Work in progress...

Word of caution

See the Twitter thread:




Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
One comment Add yours
  1. It would be interesting to plot the repetition percentages with each mammal’s respective rate of dying from cancer. Though I doubt you would be able to find good data sources and would probably have to just perform the study yourself.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.