Most eukaryotic genomes are interspersed with repetitive elements and some of these elements have transcriptional activity, hence they appear when we sequence the RNA population. From the trend of things, some of these elements seem to be important. One strategy for analysing these repeats is to map them to the genome, to see where they came from and from what repeat class they arose from. This post looks into mapping repeats and how sequencing accuracy can affect the mapping accuracy.
I will use the human genome as an example; according to RepeatMasker and Repbase, there are roughly 5,298,130 repetitive elements in the human genome. How much of the genome is that? First download the RepeatMasker results performed on hg19 from the UCSC Table Browser tool. I've downloaded the results as a bed file and named it hg19_rmsk.bed.
#the extremely useful stats program is available #https://github.com/arq5x/filo cat ~/ucsc/hg19_rmsk.bed | perl -nle '@a=split; $s=$a[2]-$a[1]; print $s' | stats Total lines: 5298130 Sum of lines: 1467396988 Ari. Mean: 276.965077867097 Geo. Mean: 168.518495379209 Median: 188 Mode: 21 (N=44789) Anti-Mode: 3849 (N=1) Minimum: 6 Maximum: 160602 Variance: 216904.549201035 StdDev: 465.730124858845
In the above, I concatenated the entire bed file and redirected it to Perl, where it subtracted the end coordinate from the start, and outputted it into the stats program, where simple statistics were calculated. The total lines corresponded to the number of repetitive elements, which make up 1,467,396,988 bp of the hg19 genome. That's around half of the hg19 genome. Now to convert this bed file into a fasta file and randomly sample 5 million reads from the repeats.
#fastaFromBed is part of bedtools available at #https://code.google.com/p/bedtools/ fastaFromBed -fi hg19.fa -bed hg19_rmsk.bed -fo hg19_rmsk.fa #use DWGSIM to simulate reads from repeats #DWGSIM is available from https://github.com/nh13/DWGSIM dwgsim -1 30 -2 0 -N 5000000 ~/ucsc/hg19_rmsk.fa read #remove the empty paired fastq file rm read.bwa.read2.fastq cat read.bwa.read1.fastq | head @chr1:16777160-16777470_100_1_1_0_0_0_1:0:0_0:0:0_0/1 TGCATTTTTAGTAGAGACAGGGTTTGTCCA + 5403226.35134243.2245540432422 @chr1:25165800-25166089_13_1_1_0_0_0_0:0:0_0:0:0_0/1 CTCCAGCCTGGGCGAAAGAGCGAGACTCTG + 42/221030122150232/13404300323 @chr1:33553606-33554646_985_1_1_0_0_0_1:0:0_0:0:0_0/1 GGCTCTGTGATAGGTGTTAATCAAACAAGT
We now have a fastq file with sequences generated from repetitive elements in the hg19 genome. Let's map them back to the genome using BWA (available at http://sourceforge.net/projects/bio-bwa/files/.
bwa aln -t 30 -n 2 hg19.fa read.bwa.read1.fastq > read.sai
bwa samse hg19 read.sai read.bwa.read1.fastq > read.sam
samtools view -S -b read.sam | samtools sort - read.bwa.read1.fastq_sorted
rm read.sam read.sai
samtools flagstat read.bwa.read1.fastq_sorted.bam
5000000 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
4638310 + 0 mapped (92.77%:nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (nan%:nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (nan%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
#DWGSIM also generates random reads
samtools view read.bwa.read1.fastq_sorted.bam | grep "^rand" | wc
275427 3029801 30098930
#how many of these were unmapped?
samtools view read.bwa.read1.fastq_sorted.bam | grep "^rand" | awk '$2 == 4 {print}' | wc
275414 3029554 30096514
So 7.23% of the reads could not be mapped (including most of the random reads). BWA provides some useful statistics in the SAM tags; I wrote a short post a while ago on the BWA X0 tag, which shows the number of locations are read mapped to. Another useful tag is the NM tag, which shows the number of mismatches in the alignment. If we tally the numbers in the X0 and NM tags we can get some idea of the mapping statistics:
Mapped: 4638310 Unmapped: 361690 Edit distances 0 2787494 1 1441126 2 409641 3 46 4 3 Multimap profile Mapped uniquely 3048212 Mapped to 2 locations 228776 Mapped to 3 locations 107775 Mapped to 4 locations 95376 Mapped to 5 locations 69025 Mapped to 6 locations 136835 Mapped to 7 locations 123003 Mapped to 8 locations 37637 Mapped to 9 locations 26309 Mapped to 10 locations 21551 Mapped to more than 10 location 743811
So 3,048,212 of the 5,000,000 reads could be mapped uniquely. Since we know where we generated the reads from, we can assess the accuracy of the results.
#as you can see the first read was mapped somewhere else #the second read was luckily mapped to the same place samtools view read.bwa.read1.fastq_sorted.bam | head -2 chrX:155260021-155260463_2_1_1_0_0_0_0:0:0_0:0:0_0 0 chr1 10109 0 30M * 0 0 AACCCTAACCCTAACCCTAACCCTAACCCT 2225213/1120231/52532255421341 XT:A:R NM:i:0 X0:i:712 XM:i:0 XO:i:0 XG:i:0 MD:Z:30 chr1:10000-10468_233_1_1_0_0_0_0:0:0_0:0:0_0 16 chr1 10233 0 30M * 0 0 CCTAACCCTAACCCTAAACCCTAAACCCTA 4532222/0252221242223622311344 XT:A:R NM:i:0 X0:i:2 X1:i:2 XM:i:0 XO:i:0 XG:i:0 MD:Z:30 XA:Z:chr8,-170547,30M,0;chr20,+62918708,30M,1;chr20,+62918131,30M,1;
Write a simple Perl script to assess the results:
#!/bin/env perl
use strict;
use warnings;
my $usage = "Usage: $0 <infile.bam>\n";
my $infile = shift or die $usage;
my $correct = 0;
my $incorrect = 0;
my $total = 0;
open(IN,'-|',"samtools view $infile") || die "Could not open $infile: $!\n";
while(<IN>){
chomp;
my ($qname,$flag,$rname,$pos,$mapq,$cigar,$mrnm,$mpos,$isize,$seq,$qual,$tag,$vtype,$value) = split;
#print "$qname\n";
#chrX:155260021-155260463_2_1_1_0_0_0_0:0:0_0:0:0_0
#ignore random reads
next if /^rand/;
++$total;
if ($qname =~ /^(\w+):(\d+)-(\d+)_(\d+)/){
my $chr = $1;
my $start = $2;
my $end = $3;
my $add = $4;
#print "$chr $start $end $add\n";
my $original = $start + $add;
if ($original == $pos && $chr eq $rname){
++$correct;
} else {
++$incorrect;
}
} else {
die "Could not parse $qname\n";
}
}
close(IN);
print "Total: $total\nCorrect: $correct\nIncorrect: $incorrect\n";
exit(0);
And what are the results?
assess.pl read.bwa.read1.fastq_sorted.bam Total: 4724573 Correct: 3183338 Incorrect: 1541235
Are particular repeat classes easier to map than others? We will need to annotate the reads using the hg19_rmsk.bed file which contains the types of repeats.
#how many repeat classes are there cat ~/ucsc/hg19_rmsk.bed | cut -f4 | sort -u | wc 1395 1395 11305
We will need to add to the Perl script, a section that reads in the hg19_rmsk.bed file and annotates each read:
#!/bin/env perl
use strict;
use warnings;
my $usage = "Usage: $0 <infile.bam>\n";
my $infile = shift or die $usage;
my %bed = ();
my $bed = '/home/davetang/ucsc/hg19_rmsk.bed';
open(IN,'<',$bed) || die "Could not open $bed: $!\n";
while(<IN>){
chomp;
my ($chr, $start, $end, $id, $score, $strand) = split;
$bed{$chr}{$start} = $id;
}
close(IN);
my %result = ();
open(IN,'-|',"samtools view $infile") || die "Could not open $infile: $!\n";
while(<IN>){
chomp;
my ($qname,$flag,$rname,$pos,$mapq,$cigar,$mrnm,$mpos,$isize,$seq,$qual,$tag,$vtype,$value) = split;
#print "$qname\n";
#chrX:155260021-155260463_2_1_1_0_0_0_0:0:0_0:0:0_0
next if /^rand/;
if ($qname =~ /^(\w+):(\d+)-(\d+)_(\d+)/){
my $chr = $1;
my $start = $2;
my $end = $3;
my $add = $4;
my $class = '';
if (exists $bed{$chr}{$start}){
$class = $bed{$chr}{$start};
} else {
die "Missing annotation in $bed\n";
}
my $original = $start + $add;
if ($original == $pos && $chr eq $rname){
if (exists $result{$class}{'correct'}){
$result{$class}{'correct'}++;
} else {
$result{$class}{'correct'} = 1;
}
} else {
if (exists $result{$class}{'incorrect'}){
$result{$class}{'incorrect'}++;
} else {
$result{$class}{'incorrect'} = 1;
}
}
} else {
die "Could not parse $qname\n";
}
}
close(IN);
print "class\tcorrect\tincorrect\n";
foreach my $class (keys %result){
my $correct = 0;
my $incorrect = 0;
if (exists $result{$class}{'correct'}){
$correct = $result{$class}{'correct'};
}
if (exists $result{$class}{'incorrect'}){
$incorrect = $result{$class}{'incorrect'};
}
print join ("\t",$class,$correct,$incorrect),"\n";
}
exit(0);
And to run the script:
assess_class.pl read.bwa.read1.fastq_sorted.bam > result head result class correct incorrect LTR64 355 17 MER4E 807 343 LTR77 171 22 HERVFH19-int 334 48 (TATG)n 243 104 (CTGTG)n 10 5 MSTB 5659 1363 L1M3b 1258 106 BSR/Beta 3058 1052
I've been spending a lot of time learning R, so let's load this into R
#load into R
data <- read.table("result", header=T)
head(data)
class correct incorrect
1 LTR64 355 17
2 MER4E 807 343
3 LTR77 171 22
4 HERVFH19-int 334 48
5 (TATG)n 243 104
6 (CTGTG)n 10 5
#dimensions of data
dim(data)
[1] 1245 3
#function to add two numbers
add <- function(x,y){
return(x+y)
}
#add new column to data
data$total <- mapply(add, data$correct, data$incorrect)
head(data[order(data$total,decreasing=T),])
class correct incorrect total
464 LTR60 39581 204507 244088
903 L2a 119580 4829 124409
428 MIRb 109551 4558 114109
1066 AluSx 31539 82327 113866
796 AluY 15326 93666 108992
936 AluJb 59165 49603 108768
#function to calculate percentage
percent <- function(c,t){
cp <- (c*100)/t
return(cp)
}
#correct percent
data$cp <- mapply(percent,data$correct,data$total)
head(data)
class correct incorrect total cp
1 LTR64 355 17 372 95.43011
2 MER4E 807 343 1150 70.17391
3 LTR77 171 22 193 88.60104
4 HERVFH19-int 334 48 382 87.43455
5 (TATG)n 243 104 347 70.02882
6 (CTGTG)n 10 5 15 66.66667
#summary of repeat class numebrs
summary(data$total)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 82 492 3795 1710 244100
#create subset where there are at least 3,000
data_3000 <- data[data$total>2999,]
head(data_3000)
class correct incorrect total cp
7 MSTB 5659 1363 7022 80.58958
9 BSR/Beta 3058 1052 4110 74.40389
13 MLT1G1 3777 151 3928 96.15580
14 AluSq2 12220 37994 50214 24.33584
18 L1MC2 9358 1008 10366 90.27590
33 LTR16C 5473 224 5697 96.06811
dim(data_3000)
[1] 215 5
#highest percentage of correctly mapped
head(data_3000[order(data_3000$cp,decreasing=T),],n=20)
class correct incorrect total cp
363 Charlie7 3102 95 3197 97.02846
336 ERV3-16A3_I-int 11423 378 11801 96.79688
741 LTR78 4336 145 4481 96.76412
233 L4 9742 336 10078 96.66601
379 MLT1K 12327 444 12771 96.52337
368 L2c 75940 2768 78708 96.48320
276 MLT1J 11433 418 11851 96.47287
700 MIR3 20984 800 21784 96.32758
696 MER103C 4396 169 4565 96.29792
133 MLT1J2 4284 165 4449 96.29130
246 LTR16A 5504 212 5716 96.29111
311 L1ME5 3085 121 3206 96.22583
247 L2b 46734 1835 48569 96.22187
720 L1M6 6244 246 6490 96.20955
823 ERVL-E-int 12240 485 12725 96.18861
13 MLT1G1 3777 151 3928 96.15580
1125 MIRc 34325 1380 35705 96.13500
903 L2a 119580 4829 124409 96.11845
61 LTR33 7728 316 8044 96.07161
771 L1ME4a 28569 1169 29738 96.06900
#lowest percentage of correctly mapped
head(data_3000[order(data_3000$cp),],n=20)
class correct incorrect total cp
325 AluYa5 71 3510 3581 1.982686
265 L1HS 425 10548 10973 3.873143
346 L1PA2 1820 29777 31597 5.760041
871 L1PA3 6169 56300 62469 9.875298
399 SVA_D 826 5133 5959 13.861386
796 AluY 15326 93666 108992 14.061583
213 L1PA4 9319 51149 60468 15.411457
464 LTR60 39581 204507 244088 16.215873
401 L1P1 1566 7166 8732 17.934036
764 L1PA5 10148 41760 51908 19.549973
615 AluSp 9570 35306 44876 21.325430
1097 AluSc8 4291 15663 19954 21.504460
895 AluSg 8236 29853 38089 21.623041
942 AluSc 6653 24050 30703 21.668892
1006 AluSg4 1520 5187 6707 22.662890
752 AluSg7 1621 5515 7136 22.715807
1069 AluSc5 1441 4570 6011 23.972717
14 AluSq2 12220 37994 50214 24.335843
312 AluSq 4864 14998 19862 24.488974
411 AluSx1 26971 75453 102424 26.332695
Now let's redo this analysis with a 5% sequencing error following the same steps as above.
#run dwgsim with a 5% sequencing error rate dwgsim -e 0.05 -1 30 -2 0 -N 5000000 ~/ucsc/hg19_rmsk.fa read_5_error #mapping as above #some stats samtools flagstat read_5_error.bwa.read1.fastq_sorted.bam 5000000 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 3999575 + 0 mapped (79.99%:nan%) 0 + 0 paired in sequencing 0 + 0 read1 0 + 0 read2 0 + 0 properly paired (nan%:nan%) 0 + 0 with itself and mate mapped 0 + 0 singletons (nan%:nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5) Mapped: 3999575 Unmapped: 1000425 Edit distances 0 1290995 1 1595650 2 1112669 3 252 4 9 Multimap profile Mapped uniquely 2549636 Mapped to 2 locations 236716 Mapped to 3 locations 118016 Mapped to 4 locations 98032 Mapped to 5 locations 71463 Mapped to 6 locations 123306 Mapped to 7 locations 111016 Mapped to 8 locations 39716 Mapped to 9 locations 29950 Mapped to 10 locations 23655 Mapped to more than 10 location 598069 assess.pl read_5_error.bwa.read1.fastq_sorted.bam Total: 4724226 Correct: 2562411 Incorrect: 2161815
Load results into R as per above:
#same as before
data_error <- read.table("result_error",header=T)
data_error$total <- mapply(add,data_error$correct,data_error$incorrect)
data_error$cp <- mapply(percent,data_error$correct,data_error$total)
data_error_3000 <- data_error[data_error$total>2999,]
> dim(data_error_3000)
[1] 215 5
head(data_error_3000[order(data_error_3000$cp),],n=20)
class correct incorrect total cp
325 AluYa5 46 3524 3570 1.288515
265 L1HS 299 10645 10944 2.732091
346 L1PA2 1258 30301 31559 3.986185
873 L1PA3 4517 57866 62383 7.240755
798 AluY 10458 98484 108942 9.599603
399 SVA_D 633 5310 5943 10.651186
213 L1PA4 6841 53709 60550 11.298101
464 LTR60 32339 211658 243997 13.253851
401 L1P1 1222 7481 8703 14.041135
766 L1PA5 7610 44296 51906 14.661118
615 AluSp 6630 38380 45010 14.730060
1099 AluSc8 3074 16917 19991 15.376920
897 AluSg 5894 32277 38171 15.441042
944 AluSc 4830 25916 30746 15.709361
754 AluSg7 1136 5980 7116 15.964025
1071 AluSc5 977 5002 5979 16.340525
1008 AluSg4 1124 5551 6675 16.838951
14 AluSq2 8783 41523 50306 17.459150
312 AluSq 3474 16303 19777 17.565859
436 AluSx4 929 4107 5036 18.447180
head(data_error_3000[order(data_error_3000$cp,decreasing=T),],n=20)
class correct incorrect total cp
1053 MLT1H 7195 1777 8972 80.19394
363 Charlie7 2576 638 3214 80.14935
246 LTR16A 4573 1141 5714 80.03150
233 L4 8049 2013 10062 79.99404
336 ERV3-16A3_I-int 9423 2369 11792 79.91011
697 MER103C 3638 918 4556 79.85075
247 L2b 38701 9880 48581 79.66283
1127 MIRc 28385 7248 35633 79.65930
1046 L1ME3F 4159 1062 5221 79.65907
368 L2c 62652 16020 78672 79.63697
133 MLT1J2 3559 911 4470 79.61969
905 L2a 98960 25338 124298 79.61512
379 MLT1K 10185 2608 12793 79.61385
743 LTR78 3574 919 4493 79.54596
93 L1MEg 17772 4570 22342 79.54525
182 L1ME3A 15148 3913 19061 79.47117
515 MLT1I 5224 1357 6581 79.38003
773 L1ME4a 23571 6138 29709 79.33959
402 L1ME3 8457 2204 10661 79.32652
276 MLT1J 9420 2458 11878 79.30628
#plot some results
plot(data_3000[order(data_3000$cp),]$cp, type="n",xlab="Repeat classes", ylab="Percent mapped correctly")
lines(data_3000[order(data_3000$cp),]$cp)
lines(data_error_3000[order(data_error_3000$cp),]$cp,col=2)
As would be expected, an increase in sequencing errors results in less accurate mapping.
Conclusions
I've looked at the problem of mapping repeats in a rather simplistic manner; I generated random reads (with no sequencing error and with a 5% sequencing error) from repetitive areas in the hg19 genome and mapped them back to the genome. As you would expect, we have less accurate mapping when there is sequencing error.
There are several things I could further investigate such as
- Mapping parameters and using different mapping programs
- Investigating specific repeat families, since not all repeat families are expressed and expressed at the same rate
- Looking at the fraction of unmapped reads and how many repeat classes were mapped back to the same class
But I just wanted a rough picture of the results when mapping repeats to a genome.

This work is licensed under a Creative Commons
Attribution 4.0 International License.
Much of R assumes you’re doing something with a vector, so there’s no need to use the mapply s. e.g.
data$total <- data$correct+data$incorrect
and
data$cp <- 100*data$correct/data$total
will work as they are.
If you /do/ need to use a function such as '+' or "-" (which lack obvious names) with an apply function then you can quote it
mapply("+", 1:10,1:10)
Hi Darren,
That is much simplier. Thank you for the tip/s!
Cheers,
Dave
Hi Dave
thank for sharing.
I have a question that is driving me crazy: how is it possible that BWA reports reads mapping in repetitive elements as uniquely mapped? For example, you say “So 3,048,212 of the 5,000,000 reads could be mapped uniquely.” but in general it seems very confusing because BWA should report it as Repeat and not as Unique.
Let me give an example. Given this sequence: CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACACTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCGCCCTAACCCTAACCG
and mapping the sequence to MM9 reference genome (mouse mm9), BWA returns a unique position in chromosome X, such as this SAM details:
M00571:39:000000000-A4EBU:1:1101:28475:12707 89 chrX 166527221 37 112M = 166527221 0 CGGTTAGGGTTAGGGCGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGTGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAG -./C:<.</HG<-?C-C0BHHC00EHGFCHHHF?0GG<EEFGFG0/0FF2FCHGHFEHHHCHFAFGHHHHHHHHHHFFHHHCFHHHHHGHHHHHHHHHGHHHHHHHHHHHHH RG:Z:LTR XT:A:U NM:i:3 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:3 XO:i:0 XG:i:0 MD:Z:0G14T42G53
But using BLAT (http://genome.ucsc.edu/cgi-bin/hgBlat), the same sequence does not map in chromosome X, instead it returns these results:
YourSeq 89 1 102 112 90.3% 14 + 91310977 91311069 93
YourSeq 69 1 88 112 85.8% 2 + 41751727 41751796 70
YourSeq 64 31 111 112 91.8% 6 – 6194937 6195016 80
YourSeq 54 1 54 112 100.0% 14 – 86598799 86598852 54
YourSeq 45 61 111 112 87.5% 9 + 27020751 27020798 48
YourSeq 34 70 107 112 85.8% 13 + 111534251 111534285 35
So… I am really confusing and scared… because I thought that BWA were reliable but it seems not because it reports a repeat as unique mapping hit.
What do you think?
Thanks,
Andrea
Hi Andrea,
Firstly repetitive elements can be uniquely mapped if they have enough mutations that make it unique in the genome.
Regarding BWA, I believe that BWA isn’t tailored to work with such long reads. Have a look at BWA-MEM, which was implemented for much longer reads. You can also try Bowtie2. I don’t have an index of the mm9 genome or else I could have tested it for you.
Also looking at your Blat results, the alignment seems to be quite poor, i.e. a lot of edits, since there is only 90.3% sequence identity. That may also be problematic.
Hope that helps,
Dave
Hi Dave
I am stuck with a problem, which looks similar. Hope you can share an idea
I want to blast/align a fasta file with few a thousand sequences with raw illumina PE reads in fastq format. Basically finding out indels (MITES) in the new genome.
From my understanding, if I map it with reference, I will lose this information as only the segments which map will show up.
Do you think I still map with the reference and get the reads with do not map from bam file?
Another option is to convert fastq to fasta and then blast.
Any suggestion will be helpful as this is my first Bioinformatics project of this kind.
Hi Uark,
sorry for the late reply. Depending on the alignment program you use, the unmapped reads should still be recorded, so you’ll know which reads do not map to your reference. I wouldn’t recommend BLAST; try BWA-MEM, which will record unmapped reads in the BAM file.
Cheers,
Dave