Mapping repeats

Most eukaryotic genomes are interspersed with repetitive elements and some of these elements have transcriptional activity, hence they appear when we sequence the RNA population. From the trend of things, some of these elements seem to be important. One strategy for analysing these repeats is to map them to the genome, to see where they came from and from what repeat class they arose from. This post looks into mapping repeats and how sequencing accuracy can affect the mapping accuracy.

I will use the human genome as an example; according to RepeatMasker and Repbase, there are roughly 5,298,130 repetitive elements in the human genome. How much of the genome is that? First download the RepeatMasker results performed on hg19 from the UCSC Table Browser tool. I’ve downloaded the results as a bed file and named it hg19_rmsk.bed.

#the extremely useful stats program is available
#https://github.com/arq5x/filo
cat ~/ucsc/hg19_rmsk.bed | perl -nle '@a=split; $s=$a[2]-$a[1]; print $s' | stats
Total lines:            5298130
Sum of lines:           1467396988
Ari. Mean:              276.965077867097
Geo. Mean:              168.518495379209
Median:                 188
Mode:                   21 (N=44789)
Anti-Mode:              3849 (N=1)
Minimum:                6
Maximum:                160602
Variance:               216904.549201035
StdDev:                 465.730124858845

In the above, I concatenated the entire bed file and redirected it to Perl, where it subtracted the end coordinate from the start, and outputted it into the stats program, where simple statistics were calculated. The total lines corresponded to the number of repetitive elements, which make up 1,467,396,988 bp of the hg19 genome. That’s around half of the hg19 genome. Now to convert this bed file into a fasta file and randomly sample 5 million reads from the repeats.

#fastaFromBed is part of bedtools available at
#https://code.google.com/p/bedtools/
fastaFromBed -fi hg19.fa -bed hg19_rmsk.bed -fo hg19_rmsk.fa
#use DWGSIM to simulate reads from repeats
#DWGSIM is available from https://github.com/nh13/DWGSIM
dwgsim -1 30 -2 0 -N 5000000 ~/ucsc/hg19_rmsk.fa read
#remove the empty paired fastq file
rm read.bwa.read2.fastq
cat read.bwa.read1.fastq | head
@chr1:16777160-16777470_100_1_1_0_0_0_1:0:0_0:0:0_0/1
TGCATTTTTAGTAGAGACAGGGTTTGTCCA
+
5403226.35134243.2245540432422
@chr1:25165800-25166089_13_1_1_0_0_0_0:0:0_0:0:0_0/1
CTCCAGCCTGGGCGAAAGAGCGAGACTCTG
+
42/221030122150232/13404300323
@chr1:33553606-33554646_985_1_1_0_0_0_1:0:0_0:0:0_0/1
GGCTCTGTGATAGGTGTTAATCAAACAAGT

We now have a fastq file with sequences generated from repetitive elements in the hg19 genome. Let’s map them back to the genome using BWA (available at http://sourceforge.net/projects/bio-bwa/files/.

bwa aln -t 30 -n 2 hg19.fa read.bwa.read1.fastq > read.sai
bwa samse hg19 read.sai read.bwa.read1.fastq > read.sam
samtools view -S -b read.sam | samtools sort - read.bwa.read1.fastq_sorted
rm read.sam read.sai
samtools flagstat read.bwa.read1.fastq_sorted.bam
5000000 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
4638310 + 0 mapped (92.77%:nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (nan%:nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (nan%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
#DWGSIM also generates random reads
samtools view read.bwa.read1.fastq_sorted.bam | grep "^rand" | wc
 275427 3029801 30098930
#how many of these were unmapped?
samtools view read.bwa.read1.fastq_sorted.bam | grep "^rand" | awk '$2 == 4 {print}' | wc
 275414 3029554 30096514

So 7.23% of the reads could not be mapped (including most of the random reads). BWA provides some useful statistics in the SAM tags; I wrote a short post a while ago on the BWA X0 tag, which shows the number of locations are read mapped to. Another useful tag is the NM tag, which shows the number of mismatches in the alignment. If we tally the numbers in the X0 and NM tags we can get some idea of the mapping statistics:

Mapped: 4638310
Unmapped: 361690
Edit distances
0       2787494
1       1441126
2       409641
3       46
4       3
Multimap profile
Mapped uniquely 3048212
Mapped to 2 locations   228776
Mapped to 3 locations   107775
Mapped to 4 locations   95376
Mapped to 5 locations   69025
Mapped to 6 locations   136835
Mapped to 7 locations   123003
Mapped to 8 locations   37637
Mapped to 9 locations   26309
Mapped to 10 locations  21551
Mapped to more than 10 location 743811

So 3,048,212 of the 5,000,000 reads could be mapped uniquely. Since we know where we generated the reads from, we can assess the accuracy of the results.

#as you can see the first read was mapped somewhere else
#the second read was luckily mapped to the same place
samtools view read.bwa.read1.fastq_sorted.bam | head -2
chrX:155260021-155260463_2_1_1_0_0_0_0:0:0_0:0:0_0      0       chr1    10109   0       30M     *       0       0       AACCCTAACCCTAACCCTAACCCTAACCCT  2225213/1120231/52532255421341       XT:A:R  NM:i:0  X0:i:712        XM:i:0  XO:i:0  XG:i:0  MD:Z:30
chr1:10000-10468_233_1_1_0_0_0_0:0:0_0:0:0_0    16      chr1    10233   0       30M     *       0       0       CCTAACCCTAACCCTAAACCCTAAACCCTA  4532222/0252221242223622311344       XT:A:R  NM:i:0  X0:i:2  X1:i:2  XM:i:0  XO:i:0  XG:i:0  MD:Z:30 XA:Z:chr8,-170547,30M,0;chr20,+62918708,30M,1;chr20,+62918131,30M,1;

Write a simple Perl script to assess the results:

#!/bin/env perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.bam>\n";
my $infile = shift or die $usage;

my $correct = 0;
my $incorrect = 0;
my $total = 0;

open(IN,'-|',"samtools view $infile") || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
   my ($qname,$flag,$rname,$pos,$mapq,$cigar,$mrnm,$mpos,$isize,$seq,$qual,$tag,$vtype,$value) = split;
   #print "$qname\n";
   #chrX:155260021-155260463_2_1_1_0_0_0_0:0:0_0:0:0_0
   #ignore random reads
   next if /^rand/;
   ++$total;
   if ($qname =~ /^(\w+):(\d+)-(\d+)_(\d+)/){
      my $chr = $1;
      my $start = $2;
      my $end = $3;
      my $add = $4;
      #print "$chr $start $end $add\n";
      my $original = $start + $add;
      if ($original == $pos && $chr eq $rname){
         ++$correct;
      } else {
         ++$incorrect;
      }
   } else {
      die "Could not parse $qname\n";
   }
}
close(IN);

print "Total: $total\nCorrect: $correct\nIncorrect: $incorrect\n";

exit(0);

And what are the results?

assess.pl read.bwa.read1.fastq_sorted.bam
Total: 4724573
Correct: 3183338
Incorrect: 1541235

Are particular repeat classes easier to map than others? We will need to annotate the reads using the hg19_rmsk.bed file which contains the types of repeats.

#how many repeat classes are there
cat ~/ucsc/hg19_rmsk.bed | cut -f4 | sort -u | wc
   1395    1395   11305

We will need to add to the Perl script, a section that reads in the hg19_rmsk.bed file and annotates each read:

#!/bin/env perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.bam>\n";
my $infile = shift or die $usage;

my %bed = ();
my $bed = '/home/davetang/ucsc/hg19_rmsk.bed';
open(IN,'<',$bed) || die "Could not open $bed: $!\n";
while(<IN>){
   chomp;
   my ($chr, $start, $end, $id, $score, $strand) = split;
   $bed{$chr}{$start} = $id;
}
close(IN);

my %result = ();
open(IN,'-|',"samtools view $infile") || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
   my ($qname,$flag,$rname,$pos,$mapq,$cigar,$mrnm,$mpos,$isize,$seq,$qual,$tag,$vtype,$value) = split;
   #print "$qname\n";
   #chrX:155260021-155260463_2_1_1_0_0_0_0:0:0_0:0:0_0
   next if /^rand/;
   if ($qname =~ /^(\w+):(\d+)-(\d+)_(\d+)/){
      my $chr = $1;
      my $start = $2;
      my $end = $3;
      my $add = $4;
      my $class = '';
      if (exists $bed{$chr}{$start}){
         $class = $bed{$chr}{$start};
      } else {
         die "Missing annotation in $bed\n";
      }
      my $original = $start + $add;
      if ($original == $pos && $chr eq $rname){
         if (exists $result{$class}{'correct'}){
            $result{$class}{'correct'}++;
         } else {
            $result{$class}{'correct'} = 1;
         }
      } else {
         if (exists $result{$class}{'incorrect'}){
            $result{$class}{'incorrect'}++;
         } else {
            $result{$class}{'incorrect'} = 1;
         }
      }
   } else {
      die "Could not parse $qname\n";
   }
}
close(IN);

print "class\tcorrect\tincorrect\n";
foreach my $class (keys %result){
   my $correct = 0;
   my $incorrect = 0;
   if (exists $result{$class}{'correct'}){
      $correct = $result{$class}{'correct'};
   }
   if (exists $result{$class}{'incorrect'}){
      $incorrect = $result{$class}{'incorrect'};
   }
   print join ("\t",$class,$correct,$incorrect),"\n";
}

exit(0);

And to run the script:

assess_class.pl read.bwa.read1.fastq_sorted.bam > result
head result
class   correct incorrect
LTR64   355     17
MER4E   807     343
LTR77   171     22
HERVFH19-int    334     48
(TATG)n 243     104
(CTGTG)n        10      5
MSTB    5659    1363
L1M3b   1258    106
BSR/Beta        3058    1052

I’ve been spending a lot of time learning R, so let’s load this into R

#load into R
data <- read.table("result", header=T)
head(data)
         class correct incorrect
1        LTR64     355        17
2        MER4E     807       343
3        LTR77     171        22
4 HERVFH19-int     334        48
5      (TATG)n     243       104
6     (CTGTG)n      10         5
#dimensions of data
dim(data)
&#91;1&#93; 1245    3
#function to add two numbers
add <- function(x,y){
   return(x+y)
}
#add new column to data
data$total <- mapply(add, data$correct, data$incorrect)
head(data&#91;order(data$total,decreasing=T),&#93;)
     class correct incorrect  total
464  LTR60   39581    204507 244088
903    L2a  119580      4829 124409
428   MIRb  109551      4558 114109
1066 AluSx   31539     82327 113866
796   AluY   15326     93666 108992
936  AluJb   59165     49603 108768

#function to calculate percentage
percent <- function(c,t){
   cp <- (c*100)/t
   return(cp)
}

#correct percent
data$cp <- mapply(percent,data$correct,data$total)
head(data)
         class correct incorrect total       cp
1        LTR64     355        17   372 95.43011
2        MER4E     807       343  1150 70.17391
3        LTR77     171        22   193 88.60104
4 HERVFH19-int     334        48   382 87.43455
5      (TATG)n     243       104   347 70.02882
6     (CTGTG)n      10         5    15 66.66667

#summary of repeat class numebrs
summary(data$total)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
      1      82     492    3795    1710  244100

#create subset where there are at least 3,000
data_3000 <- data&#91;data$total>2999,]
head(data_3000)
      class correct incorrect total       cp
7      MSTB    5659      1363  7022 80.58958
9  BSR/Beta    3058      1052  4110 74.40389
13   MLT1G1    3777       151  3928 96.15580
14   AluSq2   12220     37994 50214 24.33584
18    L1MC2    9358      1008 10366 90.27590
33   LTR16C    5473       224  5697 96.06811
dim(data_3000)
[1] 215   5

#highest percentage of correctly mapped
head(data_3000[order(data_3000$cp,decreasing=T),],n=20)
               class correct incorrect  total       cp
363         Charlie7    3102        95   3197 97.02846
336  ERV3-16A3_I-int   11423       378  11801 96.79688
741            LTR78    4336       145   4481 96.76412
233               L4    9742       336  10078 96.66601
379            MLT1K   12327       444  12771 96.52337
368              L2c   75940      2768  78708 96.48320
276            MLT1J   11433       418  11851 96.47287
700             MIR3   20984       800  21784 96.32758
696          MER103C    4396       169   4565 96.29792
133           MLT1J2    4284       165   4449 96.29130
246           LTR16A    5504       212   5716 96.29111
311            L1ME5    3085       121   3206 96.22583
247              L2b   46734      1835  48569 96.22187
720             L1M6    6244       246   6490 96.20955
823       ERVL-E-int   12240       485  12725 96.18861
13            MLT1G1    3777       151   3928 96.15580
1125            MIRc   34325      1380  35705 96.13500
903              L2a  119580      4829 124409 96.11845
61             LTR33    7728       316   8044 96.07161
771           L1ME4a   28569      1169  29738 96.06900

#lowest percentage of correctly mapped
head(data_3000[order(data_3000$cp),],n=20)
      class correct incorrect  total        cp
325  AluYa5      71      3510   3581  1.982686
265    L1HS     425     10548  10973  3.873143
346   L1PA2    1820     29777  31597  5.760041
871   L1PA3    6169     56300  62469  9.875298
399   SVA_D     826      5133   5959 13.861386
796    AluY   15326     93666 108992 14.061583
213   L1PA4    9319     51149  60468 15.411457
464   LTR60   39581    204507 244088 16.215873
401    L1P1    1566      7166   8732 17.934036
764   L1PA5   10148     41760  51908 19.549973
615   AluSp    9570     35306  44876 21.325430
1097 AluSc8    4291     15663  19954 21.504460
895   AluSg    8236     29853  38089 21.623041
942   AluSc    6653     24050  30703 21.668892
1006 AluSg4    1520      5187   6707 22.662890
752  AluSg7    1621      5515   7136 22.715807
1069 AluSc5    1441      4570   6011 23.972717
14   AluSq2   12220     37994  50214 24.335843
312   AluSq    4864     14998  19862 24.488974
411  AluSx1   26971     75453 102424 26.332695

Now let’s redo this analysis with a 5% sequencing error following the same steps as above.

#run dwgsim with a 5% sequencing error rate
dwgsim -e 0.05 -1 30 -2 0 -N 5000000 ~/ucsc/hg19_rmsk.fa read_5_error
#mapping as above
#some stats
samtools flagstat read_5_error.bwa.read1.fastq_sorted.bam
5000000 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
3999575 + 0 mapped (79.99%:nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (nan%:nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (nan%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
Mapped: 3999575
Unmapped: 1000425
Edit distances
0       1290995
1       1595650
2       1112669
3       252
4       9
Multimap profile
Mapped uniquely 2549636
Mapped to 2 locations   236716
Mapped to 3 locations   118016
Mapped to 4 locations   98032
Mapped to 5 locations   71463
Mapped to 6 locations   123306
Mapped to 7 locations   111016
Mapped to 8 locations   39716
Mapped to 9 locations   29950
Mapped to 10 locations  23655
Mapped to more than 10 location 598069
assess.pl read_5_error.bwa.read1.fastq_sorted.bam
Total: 4724226
Correct: 2562411
Incorrect: 2161815

Load results into R as per above:

#same as before
data_error <- read.table("result_error",header=T)
data_error$total <- mapply(add,data_error$correct,data_error$incorrect)
data_error$cp <- mapply(percent,data_error$correct,data_error$total)
data_error_3000 <- data_error&#91;data_error$total>2999,]
> dim(data_error_3000)
[1] 215   5
head(data_error_3000[order(data_error_3000$cp),],n=20)
      class correct incorrect  total        cp
325  AluYa5      46      3524   3570  1.288515
265    L1HS     299     10645  10944  2.732091
346   L1PA2    1258     30301  31559  3.986185
873   L1PA3    4517     57866  62383  7.240755
798    AluY   10458     98484 108942  9.599603
399   SVA_D     633      5310   5943 10.651186
213   L1PA4    6841     53709  60550 11.298101
464   LTR60   32339    211658 243997 13.253851
401    L1P1    1222      7481   8703 14.041135
766   L1PA5    7610     44296  51906 14.661118
615   AluSp    6630     38380  45010 14.730060
1099 AluSc8    3074     16917  19991 15.376920
897   AluSg    5894     32277  38171 15.441042
944   AluSc    4830     25916  30746 15.709361
754  AluSg7    1136      5980   7116 15.964025
1071 AluSc5     977      5002   5979 16.340525
1008 AluSg4    1124      5551   6675 16.838951
14   AluSq2    8783     41523  50306 17.459150
312   AluSq    3474     16303  19777 17.565859
436  AluSx4     929      4107   5036 18.447180
head(data_error_3000[order(data_error_3000$cp,decreasing=T),],n=20)
               class correct incorrect  total       cp
1053           MLT1H    7195      1777   8972 80.19394
363         Charlie7    2576       638   3214 80.14935
246           LTR16A    4573      1141   5714 80.03150
233               L4    8049      2013  10062 79.99404
336  ERV3-16A3_I-int    9423      2369  11792 79.91011
697          MER103C    3638       918   4556 79.85075
247              L2b   38701      9880  48581 79.66283
1127            MIRc   28385      7248  35633 79.65930
1046          L1ME3F    4159      1062   5221 79.65907
368              L2c   62652     16020  78672 79.63697
133           MLT1J2    3559       911   4470 79.61969
905              L2a   98960     25338 124298 79.61512
379            MLT1K   10185      2608  12793 79.61385
743            LTR78    3574       919   4493 79.54596
93             L1MEg   17772      4570  22342 79.54525
182           L1ME3A   15148      3913  19061 79.47117
515            MLT1I    5224      1357   6581 79.38003
773           L1ME4a   23571      6138  29709 79.33959
402            L1ME3    8457      2204  10661 79.32652
276            MLT1J    9420      2458  11878 79.30628

#plot some results
plot(data_3000[order(data_3000$cp),]$cp, type="n",xlab="Repeat classes", ylab="Percent mapped correctly")
lines(data_3000[order(data_3000$cp),]$cp)
lines(data_error_3000[order(data_error_3000$cp),]$cp,col=2)

repeat_classes_percent_mapped_correctlyAs would be expected, an increase in sequencing errors results in less accurate mapping.

Conclusions

I’ve looked at the problem of mapping repeats in a rather simplistic manner; I generated random reads (with no sequencing error and with a 5% sequencing error) from repetitive areas in the hg19 genome and mapped them back to the genome. As you would expect, we have less accurate mapping when there is sequencing error.

There are several things I could further investigate such as

  1. Mapping parameters and using different mapping programs
  2. Investigating specific repeat families, since not all repeat families are expressed and expressed at the same rate
  3. Looking at the fraction of unmapped reads and how many repeat classes were mapped back to the same class

But I just wanted a rough picture of the results when mapping repeats to a genome.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
8 comments Add yours
  1. Much of R assumes you’re doing something with a vector, so there’s no need to use the mapply s. e.g.

    data$total <- data$correct+data$incorrect
    and
    data$cp <- 100*data$correct/data$total
    will work as they are.

    If you /do/ need to use a function such as '+' or "-" (which lack obvious names) with an apply function then you can quote it

    mapply("+", 1:10,1:10)

  2. Hi Dave
    thank for sharing.
    I have a question that is driving me crazy: how is it possible that BWA reports reads mapping in repetitive elements as uniquely mapped? For example, you say “So 3,048,212 of the 5,000,000 reads could be mapped uniquely.” but in general it seems very confusing because BWA should report it as Repeat and not as Unique.
    Let me give an example. Given this sequence: CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACACTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCGCCCTAACCCTAACCG
    and mapping the sequence to MM9 reference genome (mouse mm9), BWA returns a unique position in chromosome X, such as this SAM details:
    M00571:39:000000000-A4EBU:1:1101:28475:12707 89 chrX 166527221 37 112M = 166527221 0 CGGTTAGGGTTAGGGCGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGTGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAG -./C:<.</HG<-?C-C0BHHC00EHGFCHHHF?0GG<EEFGFG0/0FF2FCHGHFEHHHCHFAFGHHHHHHHHHHFFHHHCFHHHHHGHHHHHHHHHGHHHHHHHHHHHHH RG:Z:LTR XT:A:U NM:i:3 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:3 XO:i:0 XG:i:0 MD:Z:0G14T42G53

    But using BLAT (http://genome.ucsc.edu/cgi-bin/hgBlat), the same sequence does not map in chromosome X, instead it returns these results:
    YourSeq 89 1 102 112 90.3% 14 + 91310977 91311069 93
    YourSeq 69 1 88 112 85.8% 2 + 41751727 41751796 70
    YourSeq 64 31 111 112 91.8% 6 – 6194937 6195016 80
    YourSeq 54 1 54 112 100.0% 14 – 86598799 86598852 54
    YourSeq 45 61 111 112 87.5% 9 + 27020751 27020798 48
    YourSeq 34 70 107 112 85.8% 13 + 111534251 111534285 35

    So… I am really confusing and scared… because I thought that BWA were reliable but it seems not because it reports a repeat as unique mapping hit.
    What do you think?
    Thanks,
    Andrea

    1. Hi Andrea,

      Firstly repetitive elements can be uniquely mapped if they have enough mutations that make it unique in the genome.

      Regarding BWA, I believe that BWA isn’t tailored to work with such long reads. Have a look at BWA-MEM, which was implemented for much longer reads. You can also try Bowtie2. I don’t have an index of the mm9 genome or else I could have tested it for you.

      Also looking at your Blat results, the alignment seems to be quite poor, i.e. a lot of edits, since there is only 90.3% sequence identity. That may also be problematic.

      Hope that helps,

      Dave

  3. Hi Dave

    I am stuck with a problem, which looks similar. Hope you can share an idea

    I want to blast/align a fasta file with few a thousand sequences with raw illumina PE reads in fastq format. Basically finding out indels (MITES) in the new genome.
    From my understanding, if I map it with reference, I will lose this information as only the segments which map will show up.
    Do you think I still map with the reference and get the reads with do not map from bam file?
    Another option is to convert fastq to fasta and then blast.

    Any suggestion will be helpful as this is my first Bioinformatics project of this kind.

    1. Hi Uark,

      sorry for the late reply. Depending on the alignment program you use, the unmapped reads should still be recorded, so you’ll know which reads do not map to your reference. I wouldn’t recommend BLAST; try BWA-MEM, which will record unmapped reads in the BAM file.

      Cheers,

      Dave

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.