The repetitive landscape of the human and mouse genome

Updated on the 31st May 2013 and updated again on the 25th March 2015 in light of Chris's comment.

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. Results of RepeatMasker performed on the human and mouse genomes are provided via the UCSC Table Browser tool. In the post I will summarise the results of the RepeatMasker program to gain an overview of the repetitive landscape of the human and mouse genome.

Below I provide a screenshot of the settings used to download the RepeatMasker results performed on hg19:

ucsc_table_browser_hg19_rmskUse this settings to download the RepeatMasker results on hg19. Change the genome to mm9 for the mouse RepeatMasker results.

If you view the contents of the file, the output is not summarised:

zcat hg19_rmsk.tsv.gz | head
#bin    swScore milliDiv        milliDel        milliIns        genoName        genoStart       genoEnd genoLeft        strand  repName repClass        repFamily       repStart        repEnd       repLeft id
1       2147    87      3       17      chr1    16777160        16777470        -232473151      +       AluSp   SINE    Alu     1       313     0       2
1       2626    28      0       0       chr1    25165800        25166089        -224084532      -       AluY    SINE    Alu     -22     289     1       4
1       626     312     185     59      chr1    33553606        33554646        -215695975      +       L2b     LINE    L2      2195    3356    -19     6
1       12545   132     15      48      chr1    50330063        50332153        -198918468      +       L1PA10  LINE    L1      15      2042    -4311   9
1       8050    8       1       9       chr1    58720067        58720973        -190529648      -       L1PA2   LINE    L1      -3      6152    5254    1
2       10586   185     68      72      chr1    75496180        75498100        -173752521      +       L1MB7   LINE    L1      4580    6518    -164    1
2       980     341     64      51      chr1    83886030        83886750        -165363871      -       ERVL-E-int      LTR     ERVL    -422    5245    4517    1
2       1422    234     189     34      chr1    100662895       100663391       -148587230      -       L2a     LINE    L2      -857    2562    1990    1
2       532     244     6       66      chr1    117440426       117440514       -131810107      +       L1ME1   LINE    L1      2392    2470    -3930   1

I wrote a Perl script for summarising the RepeatMasker output files:

#!/usr/bin/perl

my $usage = "Usage: $0 <infile.repeatmasker>\n";
my $infile = shift or die $usage;

my %repName = ();
my %repClass = ();
my %repFamily = ();

open(IN,'-|',"gunzip -c $infile") || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
   next if (/^#/);
   #bin    swScore milliDiv        milliDel        milliIns        genoName        genoStart       genoEnd genoLeft        strand  repName repClass        repFamily       repStart   repEnd  repLeft id
   #1       2147    87      3       17      chr1    16777160        16777470        -232473151      +       AluSp   SINE    Alu     1       313     0       2
   my ($bin, $swScore, $milliDiv, $milliDel, $milliIns, $genoName, $genoStart, $genoEnd, $genoLeft, $strand, $repName, $repClass, $repFamily, $repStart, $repEnd, $repLeft, $id) = split(/\s+/);
   #print "$bin\n";

   if (exists $repName{$repName}){
      $repName{$repName}++;
   } else {
      $repName{$repName} = 1;
   }

   if (exists $repClass{$repClass}){
      $repClass{$repClass}++;
   } else {
      $repClass{$repClass} = 1;
   }

   if (exists $repFamily{$repFamily}){
      $repFamily{$repFamily}++;
   } else {
      $repFamily{$repFamily} = 1;
   }

}
close(IN);

#Change this into the corresponding hash for tallies of the repeat families and repeat classes
foreach my $key (sort {$repName{$b} <=> $repName{$a}} keys %repName){
   print "$key\t$repName{$key}\n";
}

exit(0);

However, as I've been learning R, I will use that instead for providing summaries:

#the comment='$' is just to circumvent the
#first line of the file being ignored
hg19_repeat <- read.table(gzfile("hg19_rmsk.tsv.gz"),
                          header=T,
                          comment='$',
                          stringsAsFactors=F)

#check dimensions
dim(hg19_repeat)
#[1] 5298130      17

#check the first 6 lines
head(hg19_repeat)
#  X.bin swScore milliDiv milliDel milliIns genoName genoStart  genoEnd
#1     1    2147       87        3       17     chr1  16777160 16777470
#2     1    2626       28        0        0     chr1  25165800 25166089
#3     1     626      312      185       59     chr1  33553606 33554646
#4     1   12545      132       15       48     chr1  50330063 50332153
#5     1    8050        8        1        9     chr1  58720067 58720973
#6     2   10586      185       68       72     chr1  75496180 75498100
#    genoLeft strand repName repClass repFamily repStart repEnd repLeft id
#1 -232473151      +   AluSp     SINE       Alu        1    313       0  2
#2 -224084532      -    AluY     SINE       Alu      -22    289       1  4
#3 -215695975      +     L2b     LINE        L2     2195   3356     -19  6
#4 -198918468      +  L1PA10     LINE        L1       15   2042   -4311  9
#5 -190529648      -   L1PA2     LINE        L1       -3   6152    5254  1
#6 -173752521      +   L1MB7     LINE        L1     4580   6518    -164  1

#I'm not sure what the id column is
#but I'm going to remove it and the bin column
hg19_repeat <- hg19_repeat[,-c(1,17)]

#check dimensions again
dim(hg19_repeat)
#[1] 5298130      15

#how many repeat families are there?
table(hg19_repeat$repFamily)

           Alu            CR1            DNA           DNA?            Deu
       1194734          61303           2744           1881           1265
       Dong-R4            ERV           ERV1           ERVK           ERVL
           556            579         175937          10868         160346
     ERVL-MaLR          ERVL?          Gypsy         Gypsy?       Helitron
        347105           1854          10892           7869           1800
     Helitron?             L1            L1?             L2            LTR
           436         951780             84         466438           2206
          LTR? Low_complexity            MIR         Merlin           MuDR
           122         371543         595094             57           1992
         Other      Penelope?       PiggyBac      PiggyBac?            RNA
          3733             51           2120            241            729
           RTE       RTE-BovB           SINE          SINE?      Satellite
         17874            655            962            425           6775
 Simple_repeat          TcMar  TcMar-Mariner      TcMar-Tc2   TcMar-Tigger
        417913           1950          16348           8156         104026
        TcMar?        Unknown       Unknown?           acro          centr
          3424           7036             97             61           2325
           hAT  hAT-Blackjack    hAT-Charlie     hAT-Tip100           hAT?
         12573          19755         254646          30669           3050
          rRNA          scRNA          snRNA         srpRNA           tRNA
          1769           1340           4386           1481           3670
          telo
           405

#store in another object
repeat_family <- table(hg19_repeat$repFamily)

#original margin settings
par()$mar
#[1] 5.1 4.1 4.1 2.1
par(mar=c(8.1, 6.1 ,4.1 ,2.1))

#postscript
postscript("barplot_repeat_family.eps")
barplot(repeat_family[order(repeat_family, decreasing=T)], las=2, ylim=c(0,1.2e6), cex.names=0.7)
dev.off()

barplot(repeat_family[order(repeat_family, decreasing=T)], las=2, ylim=c(0,1.4e6))

barplot_repeat_familyThe most repetitive element in the hg19 genome are Alus, followed by L1s, MIRs and L2.

Summary of repeat classes:

#summary of repeat classes
repeat_class <- table(hg19_repeat$repClass)
repeat_class

           DNA           DNA?           LINE          LINE? Low_complexity            LTR           LTR?          Other 
        461751           1881        1498690             51         371543         717656            122           3733 
            RC            RNA           rRNA      Satellite          scRNA  Simple_repeat           SINE          SINE? 
          2236            729           1769           9566           1340         417913        1793723            425 
         snRNA         srpRNA           tRNA        Unknown       Unknown? 
          4386           1481           2002           7036             97
barplot(repeat_class[order(repeat_class, decreasing=T)], las=2, ylim=c(0,2e6))

barplot_repeat_classThe most numerous class of repeats in hg19 are SINEs, followed by LINEs and LTRs.

How many LINEs, SINEs and LTRs are there?

length(unique(hg19_repeat[hg19_repeat$repClass=="LINE",]$repName))
[1] 146
line <- table(hg19_repeat[hg19_repeat$repClass=="LINE",]$repName)
head(line[order(line, decreasing=T)],n=8)

   L2a    L2c    L2b   L1M5     L2     L3 L1ME4a  L1ME1 
171064 140653  97936  64374  56785  45015  43230  31734

length(unique(hg19_repeat[hg19_repeat$repClass=="SINE",]$repName))
[1] 49
sine <- table(hg19_repeat[hg19_repeat$repClass=="SINE",]$repName)
head(sine[order(sine, decreasing=T)],n=8)

  MIRb    MIR  AluJb  AluSx   AluY AluSx1   MIRc  AluSz 
225155 175606 144945 144437 120617 110831 103492  98343

length(unique(hg19_repeat[hg19_repeat$repClass=="LTR",]$repName))
[1] 504
head(ltr[order(ltr, decreasing=T)],n=8)

 THE1B  MLT1D MLT1A0  MLT1C   MSTA  MLT1K  MLT1B  MLT1J 
 22433  20741  20643  19824  19782  18173  18004  15270

#make some pie charts
par(mfrow=c(1,2))
pie(line[order(line, decreasing=T)])
pie(sine[order(sine, decreasing=T)])

pie_line_sineNot very pretty but at least we can see the abundance ranking of the repeats.

Length of LINEs, SINEs and LTRs in hg19

#calculate and store the lengths
hg19_repeat$length <- hg19_repeat$genoEnd - hg19_repeat$genoStart
#length of repeat class LINEs
summary(hg19_repeat[hg19_repeat$repClass=='LINE',]$length)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     11     115     221     426     448    8505

#length of repeat class SINEs
summary(hg19_repeat[hg19_repeat$repClass=='SINE',]$length)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   11.0   140.0   256.0   221.5   300.0   614.0

#length of repeat class LTRs
summary(hg19_repeat[hg19_repeat$repClass=='LTR',]$length)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   11.0   154.0   334.0   373.4   438.0 18040.0

line_length <- hg19_repeat[hg19_repeat$repClass=='LINE',]$length
sine_length <- hg19_repeat[hg19_repeat$repClass=='SINE',]$length
ltr_length <- hg19_repeat[hg19_repeat$repClass=='LTR',]$length

par(mfrow=c(1,3))
boxplot(line_length, main="LINE")
boxplot(sine_length, main="SINE")
boxplot(ltr_length, main="LTR")

SINEs are much more consistent in size than LINEs or LTRs.

How many full-length LINEs? Autonomous L1s are approximately 6,000 bp in size.

length(hg19_repeat[hg19_repeat$repClass=="LINE" & hg19_repeat$length>5999,]$repName)
[1] 7354
table(hg19_repeat[hg19_repeat$repClass=="LINE" & hg19_repeat$length>5999,]$repName)

     L1HS      L1M1      L1M2     L1M2a     L1M2c     L1MA1     L1MA2     L1MA3     L1MA4    L1MA4A     L1MA5    L1MA5A 
      307        26         2         1         1        48        67        66        16        20        13         2 
    L1MA6     L1MA7     L1MA8     L1MA9     L1MB3     L1MC1     L1MC3     L1ME1     L1ME2      L1P2      L1P3    L1PA10 
        1         1         2         6         1         5         1         3         1         2         8        67 
   L1PA11    L1PA12    L1PA13    L1PA14    L1PA15 L1PA15-16    L1PA16    L1PA17     L1PA2     L1PA3     L1PA4     L1PA5 
       48         6        54        21        46         4        39        11       977      1367      1228       865 
    L1PA6     L1PA7     L1PA8    L1PA8A     L1PB1     L1PB2     L1PB3     L1PB4     L1PBa    L1PBa1     L1PBb   L1PREC2 
      736       786       167        71       169        22        18        10         5         2         2        33

Total coverage

#total coverage in hg19
zcat hg19_rmsk.tsv.gz | grep -v "^#" | awk '{total+=$8-$7} END {print total}'
1467396988

#roughly 50% of the genome
bc -l<<<1467396988/3000000000*100
48.91323293333333333300

#total coverage in ce10
zcat ce10_rmsk.tsv.gz | grep -v "^#" | awk '{total+=$8-$7} END {print total}'
13827574

#roughly 14%
bc -l<<<13827574/100000000*100
13.82757400000000000000

25th March 2015: Summary based on my RepeatMasker run

I ran RepeatMasker on the hg19 genome using this command for all assembled chromosomes (chr1:22 and chr[MXY]):

#RepeatMasker version open-4.0.3
#-xsmall Returns repetitive regions in lowercase (rest capitals) rather than masked
#-s Slow search; 0-5% more sensitive, 2-3 times slower than default
#8 processors
#RepBase Update 20130422
RepeatMasker -e crossmatch -species human -s -xsmall -pa 8 chr1.fa

#An asterisk (*) in the final column indicates that there is a higher-scoring match whose domain partly (<80%) includes the domain of this match.
cat *.out |
grep -v "position in query" |
grep -v "class/family" |
grep -v "^$" |
sed 's/^\s\+//' |
sed 's/\s\+\*$//' |
gzip > hg19_rmsk_cm_20130422.out.gz

#slightly more than the UCSC Genome Browser numbers
gunzip -c hg19_rmsk_cm_20130422.out.gz | wc -l
5313559

#concatenate the tbl files
cat *.tbl | gzip > hg19_rmsk_cm_20130422.tbl.gz

A concatenation of the *.tbl files can be downloaded here and the concatenation of the *.out files can be downloaded here.

Summaries using R

data <- read.table("hg19_rmsk_cm_20130422.out.gz",
                   header=F,
                   stringsAsFactors = F)

dim(data)
[1] 5313559      15

#the repeat class and family
#are joined together
head(data)
     V1   V2  V3  V4    V5    V6    V7          V8 V9    V10      V11    V12  V13    V14 V15
1  1706 22.2 8.9 3.6 chr10 60964 61180 (135473567)  C   L1M4  LINE/L1 (2130) 4016   3779   1
2  1006 22.4 0.0 0.0 chr10 61181 61345 (135473402)  C   FRAM SINE/Alu   (11)  165      1   2
3  1706 22.2 8.9 3.6 chr10 61346 61712 (135473035)  C   L1M4  LINE/L1 (2368) 3778   3398   1
4 21526 12.8 1.9 5.0 chr10 63293 67260 (135467487)  + L1PA17  LINE/L1   1184 5132 (1750)   3
5  1366 12.9 0.0 0.0 chr10 67261 67438 (135467309)  + AluSq2 SINE/Alu    136  313    (0)   4
6 21526 12.8 1.9 5.0 chr10 67439 69192 (135465555)  + L1PA17  LINE/L1   5133 6880    (2)   3

#the following code splits the repeat
#class and family
my_subset_func <- function(x, which){
  if (length(x)==2){
    return(x[[which]])
  } else {
    return(x[[1]])
  }
}

my_split <- strsplit(data[,11], split = "/")

df <- data[,1:10]
df$V11 <- sapply(my_split, my_subset_func, which=1)
df$V12 <- sapply(my_split, my_subset_func, which=2)
df <- cbind(df, data[,12:15])

dim(df)
[1] 5313559      16

names(df) <- c('score',
               'div',
               'del',
               'ins',
               'query',
               'q_start',
               'q_end',
               'q_left',
               'strand',
               'repeat',
               'class',
               'family',
               's_start',
               's_end',
               's_left',
               'id')

head(df)
  score  div del ins query q_start q_end      q_left strand repeat class family s_start s_end
1  1706 22.2 8.9 3.6 chr10   60964 61180 (135473567)      C   L1M4  LINE     L1  (2130)  4016
2  1006 22.4 0.0 0.0 chr10   61181 61345 (135473402)      C   FRAM  SINE    Alu    (11)   165
3  1706 22.2 8.9 3.6 chr10   61346 61712 (135473035)      C   L1M4  LINE     L1  (2368)  3778
4 21526 12.8 1.9 5.0 chr10   63293 67260 (135467487)      + L1PA17  LINE     L1    1184  5132
5  1366 12.9 0.0 0.0 chr10   67261 67438 (135467309)      + AluSq2  SINE    Alu     136   313
6 21526 12.8 1.9 5.0 chr10   67439 69192 (135465555)      + L1PA17  LINE     L1    5133  6880
  s_left id
1   3779  1
2      1  2
3   3398  1
4 (1750)  3
5    (0)  4
6    (2)  3

Let's calculate some numbers!

#what are the repeat classes?
names(table(df$class))
 [1] "DNA"            "DNA?"           "LINE"           "Low_complexity" "LTR"           
 [6] "LTR?"           "RC"             "RC?"            "Retroposon"     "RNA"           
[11] "rRNA"           "Satellite"      "scRNA"          "Simple_repeat"  "SINE"          
[16] "SINE?"          "snRNA"          "srpRNA"         "tRNA"           "Unknown"

#what are the repeat families?
names(table(df$family))
 [1] "5S-Deu-L2"      "acro"           "Alu"            "centr"          "CR1"           
 [6] "DNA"            "DNA?"           "Dong-R4"        "ERV1"           "ERV1?"         
[11] "ERVK"           "ERVL"           "ERVL-MaLR"      "ERVL?"          "Gypsy"         
[16] "Gypsy?"         "hAT"            "hAT-Ac"         "hAT-Blackjack"  "hAT-Charlie"   
[21] "hAT-Tag1"       "hAT-Tip100"     "hAT-Tip100?"    "hAT?"           "Helitron"      
[26] "Helitron?"      "L1"             "L2"             "Low_complexity" "LTR"           
[31] "LTR?"           "Merlin"         "MIR"            "MULE-MuDR"      "Penelope"      
[36] "PIF-Harbinger"  "PiggyBac"       "PiggyBac?"      "RNA"            "rRNA"          
[41] "RTE-BovB"       "RTE-X"          "Satellite"      "scRNA"          "Simple_repeat" 
[46] "SINE?"          "snRNA"          "srpRNA"         "SVA"            "TcMar"         
[51] "TcMar-Mariner"  "TcMar-Pogo"     "TcMar-Tc2"      "TcMar-Tigger"   "TcMar?"        
[56] "telo"           "tRNA"           "tRNA-Deu"       "tRNA-RTE"       "Unknown"

#how many LINEs?
nrow(unique(df[df$class=='LINE',c('query','class','id')]))
[1] 995183

#how many SINEs?
nrow(unique(df[df$class=='SINE',c('query','class','id')]))
[1] 1705701

#how many LTRs?
nrow(unique(df[df$class=='LTR',c('query','class','id')]))
[1] 511697
Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
11 comments Add yours
  1. The number of repeats per class (and family) are not correct because RepeatMasker uses separate entries for multiple repeats. The ID line that you weren’t sure of, that identifies a single repeat. Therefore some lines will have the same ID b/c they are the same repeat. This is why you have 1.4 million LINES when the true number is 500,000.

    1. Hi again Chris,

      I looked into this a bit more and it got a bit interesting. The RepeatMasker results downloaded from the UCSC Genome Browser for hg19 (which this post is based on) has only nine ID values in the five million or so lines.

      gunzip -c hg19_rmsk.tsv.gz | awk '{print $17}' | sort -u
      1
      2
      3
      4
      5
      6
      7
      8
      9
      id
      

      If I check out the database description of the ID column via the Table Browser tool, this is what it says:

      “First digit of id field in RepeatMasker .out file. Best ignored.”

      That must be why I wasn’t sure what the ID column was (according to my comment).

      However, when I go to the .out files generated by RepeatMasker, you are of course correct. I’m not sure why the results stored on the UCSC Genome Browser database is as it is.

      Cheers,

      Dave

      1. Hi Dave,

        Interesting finding. I had never used the track from UCSC so I wasn’t aware of what happened with their ID line. Seems like an issue. I usually repeatmask the genomes myself for various reasons. I found your page while looking for a nice table of the repeat mask (tbl) file for the entire human genome classes, families, and subfamilies all In one table so I could make some graphics. Surprisingly hard to find!

        I really like the analysis you’ve done here, especially using R. If you could get the corrected numbers I think you (or I) could really extend repeat characterization to a nice consistent way. For instance I’m trying to compare some next ten seq data repeat abundance back to reference genomes. Doing this in R would be simple, elegant, and scalable. Likewise with more species genomes coming online, it would be a nice start foe comparisons and visualizations.

        I’ll try to put something together when I get back from a conference. Keep looking at it!

    2. I recalculated the LINE count using a RepeatMasker run I performed on hg19 and by counting unique IDs; the number of LINEs is 995,183. It has decreased from ~1.5 million but it is roughly twice as much as your 500,000 number. The 995,183 number is the same if I tally the number of LINEs from all the *.tbl files. I have included a link in the post to all the concatenated *.out files generated by RepeatMasker.

      1. Hi Dave,

        This is very interesting. You’ve done some really useful work here. I believe your new count based on hg19 is actually correct. My 500,000 figure was based on what I had read in various papers which all seem to trace back to the initial sequencing of the human genome. In fact if you look on wikipedia article on “Retrotransposon” you will see the quote “The human genome, for example, contains about 500,000 LINEs, which is roughly 17% of the genome.[14]”. However the wikipedia article on “Long Interspersed nuclear element” in humans, you will see the figure of total LINES as 850,000 total, with 500,000 L1’s.

        So I am strongly of the opinion that your latest count of 995,000 is actually the most accurate. (This was exactly the sort of info I was looking for btw.) I looks like the confusion came from 1) The early unfinished sequence of the human genome being updated over the last 14 years 2) a game of telephone, everyone citing old info and 3) Confusing the number of L1’s as the number of total LINEs since L1 is the most abundant and gets the most attention.

        So! I think you’ve got it now. I am downloading your .out files and .tbl files. I realize this post was several years old but with the updates you’ve made it will be useful to me and many others I suspect. I’ll post again if I find anything else. Please link to any publications you may have that include transposon information so that I can cite you in my next publication.

        http://en.wikipedia.org/wiki/Long_interspersed_nuclear_element#In_human

  2. Hi Dave and Chris, I like your analysis too – Thanks Dave. I’d like to ask two things. 1. You updated this post again today (2015-03-25), so all the numbers in this post are came from the most up-to-date UCSC rmsk table, right? 2. Have you thought about the milliDiv, milliDel, and milliIns? I’m just starting to study about the Repeats and found that it seems like there is some levels of confidence for the aligned repeats across genome. So I am just wondering whether is it good to filtering out some low level of aligned repeats in my analysis. Hope to hear both of your thought. Cheers, Namshik

    1. Hi Namshik,

      thanks for the comment. Here are my replies:

      1. The latest update refers to numbers calculated on my own RepeatMasker run on hg19 (see the last part of the post). I provided a link to download those files.

      2. For now I haven’t thought about the alignment quality. What was good enough for a “match” by RepeatMasker was good enough for me.

      Cheers,

      Dave

  3. Dear Dave,

    First of all, thank you very much for all your scripts and ‘musings’!
    I find I am regularly returning to your pages and find them very help and insightful.

    As I was shamelessly using your R script, I detected a small omission. When calculating how many LINEs, SINEs and LTRs are present, there is actually no object made for the LTRs:
    ltr <- table(hg19_repeat[hg19_repeat$repClass=="LTR",]$repName)
    is missing.

    Keep up the 'musings'!

    with kind regards,
    Aldo

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.