How mappable is a specific repeat?

If you've ever wondered how mappable a specific repeat is, here's a quick post on creating a plot showing the mappability of a repetitive element along its consensus sequence. Specifically, the consensus sequence of a repeat was taken and sub-sequences were created by a sliding window approach (i.e. moving along the sequence) at 1 bp intervals and these sub-sequences were mapped to hg19.

I will use bwa for the mapping, so firstly download and compile bwa:

wget http://sourceforge.net/projects/bio-bwa/files/latest/bwa-0.7.7.tar.bz2
tar -xjf bwa-0.7.7.tar.bz2
cd bwa-0.7.7
make

#index hg19
bwa index hg19.fa

Now download RepBase from http://www.girinst.org/server/RepBase/index.php; you need to register but registration is free. For this example, I will use window sizes of 30 bp (i.e. read length of 30) and the repetitive element LTR1F2:

#extract the contents
tar -xzf RepBase19.03.fasta.tar.gz

#set some parameters
export my_repeat=LTR1F2
export my_read_length=30

#extract LTR1F2 from the human repeats
#the Perl code simply starts printing lines
#when it matches LTR1F2
#it exits when it comes across the next fasta entry
cat RepBase19.03.fasta/humrep.ref | perl -nle 'BEGIN {$s=0}; if (/^>$ENV{my_repeat}\b/){ $s = 1; print; next }; if ($s==1){ if ($_ !~ /^>/){ print } else { exit } }' > $my_repeat.fa

#check out the fasta
cat $my_repeat.fa
>LTR1F2 ERV1    Primates
tgatacggacaggaggcagggaaatactgggtagaagagggcggggtccccggcgagggccccaccctca
agcctggacccgcggccctaaatgagaacanncatccctgttttcccgcccgaatgttgccttttccaaa
accaccctggcccgccacgccccccatcctgtacccataaaaaccccaaactccactggcagaggagcag
agcggcgcggcagagaaggagagaagagaagaagcgtctgaacgtcgagaggagttcggctggggacggt
cggagaggagttcggccggggacggccgaactccaggggaagattatcttcccactccatcccctttcca
gctccccatcccgctgagagccacctccatcactcaataaaacctccgcattcaccatccttcaagtccg
tgtgacctgattcttcctggacgccggacaaggacccgggtaccaagagggcagggtgtaaaaggctgtc
accctgactctccactgagctggttaacacttagccgtccgcggacggcaactgctaaaagagcattaat
tgtaacacacccctagacgctgccgtggggccggagcccaaaagcgctcgccccggccccggcacccgct
cgcctgcgtgctccccctcccgcaaggggtttgagcgcggcggccgagtaagcgagccacacccctgtcg
caagtcccgcgaaggggtcaagggaactctcccgtctca

#slide across the consensus sequence at 1 bp intervals
cat $my_repeat.fa | grep -v "^>" | perl -nle '$s.=$_; END {for ($i=0; $i<=length($s)-$ENV{my_read_length}; ++$i){ print ">$i\n", substr($s,$i,$ENV{my_read_length})} }' > ${my_repeat}_split.mfa

#check out the split file
head ${my_repeat}_split.mfa
>0
tgatacggacaggaggcagggaaatactgg
>1
gatacggacaggaggcagggaaatactggg
>2
atacggacaggaggcagggaaatactgggt
>3
tacggacaggaggcagggaaatactgggta
>4
acggacaggaggcagggaaatactgggtag

tail ${my_repeat}_split.mfa
>705
cccgcgaaggggtcaagggaactctcccgt
>706
ccgcgaaggggtcaagggaactctcccgtc
>707
cgcgaaggggtcaagggaactctcccgtct
>708
gcgaaggggtcaagggaactctcccgtctc
>709
cgaaggggtcaagggaactctcccgtctca

#location of the genome
genome=/home/`whoami`/genome/bwa_index/hg19.fa
#base name
base=${my_repeat}_split

bwa aln -t 8 -n 2 $genome $base.mfa > $base.sai
bwa samse $genome $base.sai $base.mfa > $base.sam
samtools view -S -b $base.sam > $base.bam
samtools sort $base.bam ${base}_sorted
samtools index ${base}_sorted.bam
rm $base.sai $base.sam $base.bam

#create a table where the first column is
#the read and the second column is
#the number of places it maps to
samtools view ${base}_sorted.bam | sort -k1n | perl -nle 'if (/^(\d+).*X0:i:(\d+)\s+/){ print "$1\t$2" } elsif (/^(\d+)/){ print "$1\t0" } else { die "Could not parse line $." }' > ${my_repeat}_mappability.tsv

I wrote an R script called plot_repeat.R, which takes in the mappability table and plots the mappability of each position of the consensus repeat:

#!/bin/env Rscript

args <- commandArgs(TRUE)
#check for 1 arguments
   args_length <- length(args)
   if (args_length != 1){
      stop("Please input filename")
   }
my_file <- args[1]

my_title <- gsub('_mappability.tsv', '', my_file)

#read data
data <- read.table(my_file, header=F, stringsAsFactors=F)
dim(data)
plot(data, type='l', main=my_title, xlab='Position along repeat', ylab='Number of times mapped in genome')
abline(h=1)
table(data$V2==1)

Run the R script:

plot_repeat.R ${my_repeat}_mappability.tsv
#use imagemagick to convert the pdf to png
convert Rplots.pdf -alpha off $my_repeat.png
rm Rplots.pdf

LTR1F2_30The 5' end of the LTR1F2 repeat is the most ambiguous.

Mappability with longer reads

Let's put all the code together and rerun the analysis with a larger window size, i.e., longer reads. Just adjust the parameters:

#define parameters
export my_repeat=LTR1F2
export my_read_length=35
genome=/home/`whoami`/genome/bwa_index/hg19.fa
base=${my_repeat}_${my_read_length}

#extract repeat
cat RepBase19.03.fasta/humrep.ref | perl -nle 'BEGIN {$s=0}; if (/^>$ENV{my_repeat}\b/){ $s = 1; print; next }; if ($s==1){ if ($_ !~ /^>/){ print } else { exit } }' > $base.fa

#split repeat
cat $base.fa | grep -v "^>" | perl -nle '$s.=$_; END {for ($i=0; $i<=length($s)-$ENV{my_read_length}; ++$i){ print ">$i\n", substr($s,$i,$ENV{my_read_length})} }' > ${base}.mfa

#align with bwa
bwa aln -t 8 -n 2 $genome $base.mfa > $base.sai
bwa samse $genome $base.sai $base.mfa > $base.sam
samtools view -S -b $base.sam > $base.bam
samtools sort $base.bam ${base}_sorted
samtools index ${base}_sorted.bam
rm $base.sai $base.sam $base.bam

#create mappability table
samtools view ${base}_sorted.bam | sort -k1n | perl -nle 'if (/^(\d+).*X0:i:(\d+)\s+/){ print "$1\t$2" } elsif (/^(\d+)/){ print "$1\t0" } else { die "Could not parse line $." }' > ${base}_mappability.tsv

#plot
plot_repeat.R ${base}_mappability.tsv
#use imagemagick to convert the pdf to png
convert Rplots.pdf -alpha off $base.png
rm Rplots.pdf

LTR1F2_35Even with 35bp reads, most of the sequence along the repeat is quite ambiguous.

Conclusions

For this post I've taken the consensus of one particular repeat, and slid across the sequence at 1 bp intervals using 30 and 35 window sizes, and collected the sub-sequences. These sub-sequences were mapped to the genome using the bwa mapping program (allowing up to 2 mismatches), which provides a flag that lets you know how many times a read multimapped to the reference genome. A plot showing the number of times a window multimapped in the genome was produced using R.

By simply adjusting the repeat and read length parameters, you can quickly produce a plot that allows you to see how mappable a specific repeat is at a particular read length.




Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.