If you've ever wondered how mappable a specific repeat is, here's a quick post on creating a plot showing the mappability of a repetitive element along its consensus sequence. Specifically, the consensus sequence of a repeat was taken and sub-sequences were created by a sliding window approach (i.e. moving along the sequence) at 1 bp intervals and these sub-sequences were mapped to hg19.
I will use bwa for the mapping, so firstly download and compile bwa:
wget http://sourceforge.net/projects/bio-bwa/files/latest/bwa-0.7.7.tar.bz2 tar -xjf bwa-0.7.7.tar.bz2 cd bwa-0.7.7 make #index hg19 bwa index hg19.fa
Now download RepBase from http://www.girinst.org/server/RepBase/index.php; you need to register but registration is free. For this example, I will use window sizes of 30 bp (i.e. read length of 30) and the repetitive element LTR1F2:
#extract the contents tar -xzf RepBase19.03.fasta.tar.gz #set some parameters export my_repeat=LTR1F2 export my_read_length=30 #extract LTR1F2 from the human repeats #the Perl code simply starts printing lines #when it matches LTR1F2 #it exits when it comes across the next fasta entry cat RepBase19.03.fasta/humrep.ref | perl -nle 'BEGIN {$s=0}; if (/^>$ENV{my_repeat}\b/){ $s = 1; print; next }; if ($s==1){ if ($_ !~ /^>/){ print } else { exit } }' > $my_repeat.fa #check out the fasta cat $my_repeat.fa >LTR1F2 ERV1 Primates tgatacggacaggaggcagggaaatactgggtagaagagggcggggtccccggcgagggccccaccctca agcctggacccgcggccctaaatgagaacanncatccctgttttcccgcccgaatgttgccttttccaaa accaccctggcccgccacgccccccatcctgtacccataaaaaccccaaactccactggcagaggagcag agcggcgcggcagagaaggagagaagagaagaagcgtctgaacgtcgagaggagttcggctggggacggt cggagaggagttcggccggggacggccgaactccaggggaagattatcttcccactccatcccctttcca gctccccatcccgctgagagccacctccatcactcaataaaacctccgcattcaccatccttcaagtccg tgtgacctgattcttcctggacgccggacaaggacccgggtaccaagagggcagggtgtaaaaggctgtc accctgactctccactgagctggttaacacttagccgtccgcggacggcaactgctaaaagagcattaat tgtaacacacccctagacgctgccgtggggccggagcccaaaagcgctcgccccggccccggcacccgct cgcctgcgtgctccccctcccgcaaggggtttgagcgcggcggccgagtaagcgagccacacccctgtcg caagtcccgcgaaggggtcaagggaactctcccgtctca #slide across the consensus sequence at 1 bp intervals cat $my_repeat.fa | grep -v "^>" | perl -nle '$s.=$_; END {for ($i=0; $i<=length($s)-$ENV{my_read_length}; ++$i){ print ">$i\n", substr($s,$i,$ENV{my_read_length})} }' > ${my_repeat}_split.mfa #check out the split file head ${my_repeat}_split.mfa >0 tgatacggacaggaggcagggaaatactgg >1 gatacggacaggaggcagggaaatactggg >2 atacggacaggaggcagggaaatactgggt >3 tacggacaggaggcagggaaatactgggta >4 acggacaggaggcagggaaatactgggtag tail ${my_repeat}_split.mfa >705 cccgcgaaggggtcaagggaactctcccgt >706 ccgcgaaggggtcaagggaactctcccgtc >707 cgcgaaggggtcaagggaactctcccgtct >708 gcgaaggggtcaagggaactctcccgtctc >709 cgaaggggtcaagggaactctcccgtctca #location of the genome genome=/home/`whoami`/genome/bwa_index/hg19.fa #base name base=${my_repeat}_split bwa aln -t 8 -n 2 $genome $base.mfa > $base.sai bwa samse $genome $base.sai $base.mfa > $base.sam samtools view -S -b $base.sam > $base.bam samtools sort $base.bam ${base}_sorted samtools index ${base}_sorted.bam rm $base.sai $base.sam $base.bam #create a table where the first column is #the read and the second column is #the number of places it maps to samtools view ${base}_sorted.bam | sort -k1n | perl -nle 'if (/^(\d+).*X0:i:(\d+)\s+/){ print "$1\t$2" } elsif (/^(\d+)/){ print "$1\t0" } else { die "Could not parse line $." }' > ${my_repeat}_mappability.tsv
I wrote an R script called plot_repeat.R, which takes in the mappability table and plots the mappability of each position of the consensus repeat:
#!/bin/env Rscript args <- commandArgs(TRUE) #check for 1 arguments args_length <- length(args) if (args_length != 1){ stop("Please input filename") } my_file <- args[1] my_title <- gsub('_mappability.tsv', '', my_file) #read data data <- read.table(my_file, header=F, stringsAsFactors=F) dim(data) plot(data, type='l', main=my_title, xlab='Position along repeat', ylab='Number of times mapped in genome') abline(h=1) table(data$V2==1)
Run the R script:
plot_repeat.R ${my_repeat}_mappability.tsv #use imagemagick to convert the pdf to png convert Rplots.pdf -alpha off $my_repeat.png rm Rplots.pdf
The 5' end of the LTR1F2 repeat is the most ambiguous.
Mappability with longer reads
Let's put all the code together and rerun the analysis with a larger window size, i.e., longer reads. Just adjust the parameters:
#define parameters export my_repeat=LTR1F2 export my_read_length=35 genome=/home/`whoami`/genome/bwa_index/hg19.fa base=${my_repeat}_${my_read_length} #extract repeat cat RepBase19.03.fasta/humrep.ref | perl -nle 'BEGIN {$s=0}; if (/^>$ENV{my_repeat}\b/){ $s = 1; print; next }; if ($s==1){ if ($_ !~ /^>/){ print } else { exit } }' > $base.fa #split repeat cat $base.fa | grep -v "^>" | perl -nle '$s.=$_; END {for ($i=0; $i<=length($s)-$ENV{my_read_length}; ++$i){ print ">$i\n", substr($s,$i,$ENV{my_read_length})} }' > ${base}.mfa #align with bwa bwa aln -t 8 -n 2 $genome $base.mfa > $base.sai bwa samse $genome $base.sai $base.mfa > $base.sam samtools view -S -b $base.sam > $base.bam samtools sort $base.bam ${base}_sorted samtools index ${base}_sorted.bam rm $base.sai $base.sam $base.bam #create mappability table samtools view ${base}_sorted.bam | sort -k1n | perl -nle 'if (/^(\d+).*X0:i:(\d+)\s+/){ print "$1\t$2" } elsif (/^(\d+)/){ print "$1\t0" } else { die "Could not parse line $." }' > ${base}_mappability.tsv #plot plot_repeat.R ${base}_mappability.tsv #use imagemagick to convert the pdf to png convert Rplots.pdf -alpha off $base.png rm Rplots.pdf
Even with 35bp reads, most of the sequence along the repeat is quite ambiguous.
Conclusions
For this post I've taken the consensus of one particular repeat, and slid across the sequence at 1 bp intervals using 30 and 35 window sizes, and collected the sub-sequences. These sub-sequences were mapped to the genome using the bwa mapping program (allowing up to 2 mismatches), which provides a flag that lets you know how many times a read multimapped to the reference genome. A plot showing the number of times a window multimapped in the genome was produced using R.
By simply adjusting the repeat and read length parameters, you can quickly produce a plot that allows you to see how mappable a specific repeat is at a particular read length.

This work is licensed under a Creative Commons
Attribution 4.0 International License.