Finding composite repetitive elements

Transposons have the ability to "jump" around in genomes and sometimes transposons jump into genomic sites occupied by other repetitive elements; these cases are what I refer to as "composite repetitive elements" for the purpose of this post. While almost all DNA transposons and the majority of retrotransposons have lost the ability to move around in the human genome, transposition events that have occurred in the past are captured within the genome sequence. This post is about finding composite repetitive elements in the human genome based on RepeatMasker annotations.

Firstly let's download the RepeatMasker annotation of the hg38 genome using the Table Browser tool over at the UCSC Genome Browser site:

hg38_rmsk_table_browser

Next, let's inspect the downloaded file:

#I'm not sure if the Table Browser
#will always generate the same file
#but here's the checksum nonetheless
md5 hg38_rmsk.bed.gz 
MD5 (hg38_rmsk.bed.gz) = aff9058746acd03dcd0922c688e636b2

#how many entries?
gunzip -c hg38_rmsk.bed.gz | wc -l
 5520017

#what does the file look like?
gunzip -c hg38_rmsk.bed.gz | head -5
chr1	67108753	67109046	L1P5	1892	+
chr1	8388315	8388618	AluY	2582	-
chr1	25165803	25166380	L1MB5	4085	+
chr1	33554185	33554483	AluSc	2285	-
chr1	41942894	41943205	AluY	2451	-

#let's sort the file
gunzip -c hg38_rmsk.bed.gz | sort -k1,1 -k2,2n -k6,6 | gzip > hg38_rmsk_sorted.bed.gz

#how does the sorted file look?
gunzip -c hg38_rmsk_sorted.bed.gz | head -5
chr1	10000	10468	(TAACCC)n	463	+
chr1	10468	11447	TAR1	3612	-
chr1	11504	11675	L1MC5a	484	-
chr1	11677	11780	MER5B	239	-
chr1	15264	15355	MIR3	318	-

#remove the unsorted file
rm hg38_rmsk.bed.gz

The simplest way of finding composite repetitive elements would be to use BEDTools. Specifically, I will merge repetitive elements that are contiguous, i.e. right next to each other, and then perform an intersection of the merged file with all the repetitive elements. By examining the results of this intersection file, I can easily see which of the entries in the merged file, are a composite. (Note that this doesn't necessarily capture transposons that have jumped into other transposons.)

#how many entries in this sorted file?
gunzip -c hg38_rmsk_sorted.bed.gz | wc -l
 5520017

#what does the sorted file look like again
gunzip -c hg38_rmsk_sorted.bed.gz | head -5
chr1	10000	10468	(TAACCC)n	463	+
chr1	10468	11447	TAR1	3612	-
chr1	11504	11675	L1MC5a	484	-
chr1	11677	11780	MER5B	239	-
chr1	15264	15355	MIR3	318	-

#how many entries are left after merging?
bedtools merge -i hg38_rmsk_sorted.bed.gz | wc -l
 4128893

#save the results of the merge
bedtools merge -i hg38_rmsk_sorted.bed.gz | gzip > hg38_rmsk_merged.bed.gz

#how does the merged file look like?
gunzip -c hg38_rmsk_merged.bed.gz | head -5
chr1	10000	11447
chr1	11504	11675
chr1	11677	11780
chr1	15264	15355
chr1	15797	15849

#perform a genome intersection of the merged file
#to the original entries
#use -wa to output the original coordinates of -a
#use -wb to output the original coordinates of -b
intersectBed -a hg38_rmsk_merged.bed.gz -b hg38_rmsk_sorted.bed.gz -wa -wb | head
chr1	10000	11447	chr1	10000	10468	(TAACCC)n	463	+
chr1	10000	11447	chr1	10468	11447	TAR1	3612	-
chr1	11504	11675	chr1	11504	11675	L1MC5a	484	-
chr1	11677	11780	chr1	11677	11780	MER5B	239	-
chr1	15264	15355	chr1	15264	15355	MIR3	318	-
chr1	15797	15849	chr1	15797	15849	(TGCTCC)n	18	+
chr1	16712	16744	chr1	16712	16744	(TGG)n	18	+
chr1	18906	19048	chr1	18906	19048	L2a	239	+
chr1	19971	20405	chr1	19971	20405	L3	994	+
chr1	20530	20679	chr1	20530	20679	Plat_L3	270	+

#save the results from the intersection
intersectBed -a hg38_rmsk_merged.bed.gz -b hg38_rmsk_sorted.bed.gz -wa -wb | gzip > hg38_rmsk_merged_intersect.tsv.gz

If you take a look above, we can see that the coordinates chr1:10000-11447 are a composite of two repetitive elements, namely (TAACCC)n and TAR1. Since the coordinates of the merged entry occur twice, we know that two separate repeats were merged together. Based on this, we can now count the number of composite repetitive elements.

#take only the first three columns
#of the intersection results
gunzip -c hg38_rmsk_merged_intersect.tsv.gz | cut -f1-3 | head
chr1	10000	11447
chr1	10000	11447
chr1	11504	11675
chr1	11677	11780
chr1	15264	15355
chr1	15797	15849
chr1	16712	16744
chr1	18906	19048
chr1	19971	20405
chr1	20530	20679

#find the unique rows and also
#provide the number of times the unique row occurs
#the results are already sorted, so I don't need to run sort
gunzip -c hg38_rmsk_merged_intersect.tsv.gz | cut -f1-3 | uniq -c | head
   2 chr1	10000	11447
   1 chr1	11504	11675
   1 chr1	11677	11780
   1 chr1	15264	15355
   1 chr1	15797	15849
   1 chr1	16712	16744
   1 chr1	18906	19048
   1 chr1	19971	20405
   1 chr1	20530	20679
   1 chr1	21948	22075

#the unique rows that occur more than once
#are the composite elements
gunzip -c hg38_rmsk_merged_intersect.tsv.gz | cut -f1-3 | uniq -c | awk '$1>1 {print}' | head
   2 chr1	10000	11447
   3 chr1	26582	27137
   2 chr1	30854	31131
   3 chr1	31292	31754
   2 chr1	35216	35499
   4 chr1	38255	40294
   3 chr1	43242	45753
   3 chr1	61603	62229
   2 chr1	72089	72163
   2 chr1	72184	72897

#how many composite repetitive elements?
gunzip -c hg38_rmsk_merged_intersect.tsv.gz | cut -f1-3 | uniq -c | awk '$1>1 {print}' | wc -l
  656891

#save the results
gunzip -c hg38_rmsk_merged_intersect.tsv.gz | cut -f1-3 | uniq -c | awk '$1>1 {OFS="\t"; print $2,$3,$4}' | gzip > hg38_rmsk_composite.bed.gz

The 656,891 number refers to the number of times two or more repetitive elements were merged into one because they were spatially right next to each other. This is the point where I find it easier to summarise the results of the intersection between merged entries with the individual repeats using a script.

#!/usr/bin/env perl

use strict;
use warnings;

my $infile = 'hg38_rmsk_merged_intersect.tsv.gz';

my %result = ();
my @order = ();

open(IN,'-|',"gunzip -c $infile") || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
   my @l = split(/\t/);
   my $id = $l[0] . ':' . $l[1] . '-' . $l[2];
   if (exists $result{$id}){
      my $i = scalar(@{$result{$id}});
      $result{$id}[$i] = $l[6];
   } else {
      $result{$id}[0] = $l[6];
      push(@order, $id);
   }
}
close(IN);

foreach my $id (@order){
   my $l = "$id\t";
   next if scalar(@{$result{$id}}) == 1;
   foreach my $rep (@{$result{$id}}){
      $l .= "$rep\t";
   }
   $l =~ s/\t$//;
   print "$l\n";
}

exit(0);

Now to run the script and to examine the results:

composite.pl | gzip > hg38_rmsk_composite_combination.txt.gz

gunzip -c hg38_rmsk_composite_combination.txt.gz | head
chr1:10000-11447        (TAACCC)n       TAR1
chr1:26582-27137        L2c     AluSp   L2c
chr1:30854-31131        (TC)n   MLT1A
chr1:31292-31754        MIRc    AluJo   MIRc
chr1:35216-35499        MIRb    AluJr
chr1:38255-40294        MLT1E1A-int     MLT1E1A AluSx   MLT1E1A
chr1:43242-45753        L1MA8   (AAAT)n L1MA8
chr1:61603-62229        L1ME4c  AluSc   L1ME4c
chr1:72089-72163        (TA)n   (ATAC)n
chr1:72184-72897        L1PA7   L1PA7

At the genomic loci, chr1:26582-27137, we clearly see a SINE in between a LINE, which can be parsimoniously explained by a single insertion event of a SINE into a LINE. Here's a genome browser screenshot:

composite_repetitive_element_example

However, some of the "composite elements" are simple repeats that are directly next to another repetitive element, for example, at chr1:10000-11447. These are technically composite repetitive elements but not transposons within transposons. The composite.pl script can be modified to identify these cases.




Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
2 comments Add yours
  1. How could I do that?
    “The composite.pl script can be modified to identify these cases.”
    I don’t know perl, could you share your ideas?
    Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.