Transposons have the ability to "jump" around in genomes and sometimes transposons jump into genomic sites occupied by other repetitive elements; these cases are what I refer to as "composite repetitive elements" for the purpose of this post. While almost all DNA transposons and the majority of retrotransposons have lost the ability to move around in the human genome, transposition events that have occurred in the past are captured within the genome sequence. This post is about finding composite repetitive elements in the human genome based on RepeatMasker annotations.
Firstly let's download the RepeatMasker annotation of the hg38 genome using the Table Browser tool over at the UCSC Genome Browser site:
Next, let's inspect the downloaded file:
#I'm not sure if the Table Browser #will always generate the same file #but here's the checksum nonetheless md5 hg38_rmsk.bed.gz MD5 (hg38_rmsk.bed.gz) = aff9058746acd03dcd0922c688e636b2 #how many entries? gunzip -c hg38_rmsk.bed.gz | wc -l 5520017 #what does the file look like? gunzip -c hg38_rmsk.bed.gz | head -5 chr1 67108753 67109046 L1P5 1892 + chr1 8388315 8388618 AluY 2582 - chr1 25165803 25166380 L1MB5 4085 + chr1 33554185 33554483 AluSc 2285 - chr1 41942894 41943205 AluY 2451 - #let's sort the file gunzip -c hg38_rmsk.bed.gz | sort -k1,1 -k2,2n -k6,6 | gzip > hg38_rmsk_sorted.bed.gz #how does the sorted file look? gunzip -c hg38_rmsk_sorted.bed.gz | head -5 chr1 10000 10468 (TAACCC)n 463 + chr1 10468 11447 TAR1 3612 - chr1 11504 11675 L1MC5a 484 - chr1 11677 11780 MER5B 239 - chr1 15264 15355 MIR3 318 - #remove the unsorted file rm hg38_rmsk.bed.gz
The simplest way of finding composite repetitive elements would be to use BEDTools. Specifically, I will merge repetitive elements that are contiguous, i.e. right next to each other, and then perform an intersection of the merged file with all the repetitive elements. By examining the results of this intersection file, I can easily see which of the entries in the merged file, are a composite. (Note that this doesn't necessarily capture transposons that have jumped into other transposons.)
#how many entries in this sorted file? gunzip -c hg38_rmsk_sorted.bed.gz | wc -l 5520017 #what does the sorted file look like again gunzip -c hg38_rmsk_sorted.bed.gz | head -5 chr1 10000 10468 (TAACCC)n 463 + chr1 10468 11447 TAR1 3612 - chr1 11504 11675 L1MC5a 484 - chr1 11677 11780 MER5B 239 - chr1 15264 15355 MIR3 318 - #how many entries are left after merging? bedtools merge -i hg38_rmsk_sorted.bed.gz | wc -l 4128893 #save the results of the merge bedtools merge -i hg38_rmsk_sorted.bed.gz | gzip > hg38_rmsk_merged.bed.gz #how does the merged file look like? gunzip -c hg38_rmsk_merged.bed.gz | head -5 chr1 10000 11447 chr1 11504 11675 chr1 11677 11780 chr1 15264 15355 chr1 15797 15849 #perform a genome intersection of the merged file #to the original entries #use -wa to output the original coordinates of -a #use -wb to output the original coordinates of -b intersectBed -a hg38_rmsk_merged.bed.gz -b hg38_rmsk_sorted.bed.gz -wa -wb | head chr1 10000 11447 chr1 10000 10468 (TAACCC)n 463 + chr1 10000 11447 chr1 10468 11447 TAR1 3612 - chr1 11504 11675 chr1 11504 11675 L1MC5a 484 - chr1 11677 11780 chr1 11677 11780 MER5B 239 - chr1 15264 15355 chr1 15264 15355 MIR3 318 - chr1 15797 15849 chr1 15797 15849 (TGCTCC)n 18 + chr1 16712 16744 chr1 16712 16744 (TGG)n 18 + chr1 18906 19048 chr1 18906 19048 L2a 239 + chr1 19971 20405 chr1 19971 20405 L3 994 + chr1 20530 20679 chr1 20530 20679 Plat_L3 270 + #save the results from the intersection intersectBed -a hg38_rmsk_merged.bed.gz -b hg38_rmsk_sorted.bed.gz -wa -wb | gzip > hg38_rmsk_merged_intersect.tsv.gz
If you take a look above, we can see that the coordinates chr1:10000-11447 are a composite of two repetitive elements, namely (TAACCC)n and TAR1. Since the coordinates of the merged entry occur twice, we know that two separate repeats were merged together. Based on this, we can now count the number of composite repetitive elements.
#take only the first three columns #of the intersection results gunzip -c hg38_rmsk_merged_intersect.tsv.gz | cut -f1-3 | head chr1 10000 11447 chr1 10000 11447 chr1 11504 11675 chr1 11677 11780 chr1 15264 15355 chr1 15797 15849 chr1 16712 16744 chr1 18906 19048 chr1 19971 20405 chr1 20530 20679 #find the unique rows and also #provide the number of times the unique row occurs #the results are already sorted, so I don't need to run sort gunzip -c hg38_rmsk_merged_intersect.tsv.gz | cut -f1-3 | uniq -c | head 2 chr1 10000 11447 1 chr1 11504 11675 1 chr1 11677 11780 1 chr1 15264 15355 1 chr1 15797 15849 1 chr1 16712 16744 1 chr1 18906 19048 1 chr1 19971 20405 1 chr1 20530 20679 1 chr1 21948 22075 #the unique rows that occur more than once #are the composite elements gunzip -c hg38_rmsk_merged_intersect.tsv.gz | cut -f1-3 | uniq -c | awk '$1>1 {print}' | head 2 chr1 10000 11447 3 chr1 26582 27137 2 chr1 30854 31131 3 chr1 31292 31754 2 chr1 35216 35499 4 chr1 38255 40294 3 chr1 43242 45753 3 chr1 61603 62229 2 chr1 72089 72163 2 chr1 72184 72897 #how many composite repetitive elements? gunzip -c hg38_rmsk_merged_intersect.tsv.gz | cut -f1-3 | uniq -c | awk '$1>1 {print}' | wc -l 656891 #save the results gunzip -c hg38_rmsk_merged_intersect.tsv.gz | cut -f1-3 | uniq -c | awk '$1>1 {OFS="\t"; print $2,$3,$4}' | gzip > hg38_rmsk_composite.bed.gz
The 656,891 number refers to the number of times two or more repetitive elements were merged into one because they were spatially right next to each other. This is the point where I find it easier to summarise the results of the intersection between merged entries with the individual repeats using a script.
#!/usr/bin/env perl use strict; use warnings; my $infile = 'hg38_rmsk_merged_intersect.tsv.gz'; my %result = (); my @order = (); open(IN,'-|',"gunzip -c $infile") || die "Could not open $infile: $!\n"; while(<IN>){ chomp; my @l = split(/\t/); my $id = $l[0] . ':' . $l[1] . '-' . $l[2]; if (exists $result{$id}){ my $i = scalar(@{$result{$id}}); $result{$id}[$i] = $l[6]; } else { $result{$id}[0] = $l[6]; push(@order, $id); } } close(IN); foreach my $id (@order){ my $l = "$id\t"; next if scalar(@{$result{$id}}) == 1; foreach my $rep (@{$result{$id}}){ $l .= "$rep\t"; } $l =~ s/\t$//; print "$l\n"; } exit(0);
Now to run the script and to examine the results:
composite.pl | gzip > hg38_rmsk_composite_combination.txt.gz gunzip -c hg38_rmsk_composite_combination.txt.gz | head chr1:10000-11447 (TAACCC)n TAR1 chr1:26582-27137 L2c AluSp L2c chr1:30854-31131 (TC)n MLT1A chr1:31292-31754 MIRc AluJo MIRc chr1:35216-35499 MIRb AluJr chr1:38255-40294 MLT1E1A-int MLT1E1A AluSx MLT1E1A chr1:43242-45753 L1MA8 (AAAT)n L1MA8 chr1:61603-62229 L1ME4c AluSc L1ME4c chr1:72089-72163 (TA)n (ATAC)n chr1:72184-72897 L1PA7 L1PA7
At the genomic loci, chr1:26582-27137, we clearly see a SINE in between a LINE, which can be parsimoniously explained by a single insertion event of a SINE into a LINE. Here's a genome browser screenshot:
However, some of the "composite elements" are simple repeats that are directly next to another repetitive element, for example, at chr1:10000-11447. These are technically composite repetitive elements but not transposons within transposons. The composite.pl script can be modified to identify these cases.

This work is licensed under a Creative Commons
Attribution 4.0 International License.
How could I do that?
“The composite.pl script can be modified to identify these cases.”
I don’t know perl, could you share your ideas?
Thanks
Hi Paulo,
what would you like to identify?
Cheers,
Dave