I’ve wanted to write this post for a while, but I never had to work with paired-end libraries, so the impetus wasn’t quite there. Finally I’ve decided to take a look at some paired-end libraries at work and as usual, I will test some simple examples before I touch the real data. For those not familiar with paired-end reads, check out this post; it has very nice and simple illustrations, along with explanations on the terminology used in paired-end sequencing. Now let’s get started!
For the first test, I took some sequence from the human genome (hg19) and created two 100 bp reads from this region. The inner mate distance between the two reads is 200 bp, creating an insert size of 400 bp. The reads were then mapped back to the reference using BWA aln and sampe.
cat ref.fa >chr2:172936693-172938111 TTATGGCTTTAAAAATATTGTAACAGTGTCTGAATCAAACTTTAGTAAAATCTCTTTGGGTTATATCTGAGAAGCTTTTATTGAAGACTTTGAACAAAATTGTGTTTTTGACAGTTTTAAATTATAGGCTAACTAGCCTGGGAAAAAAGGATAGTGTCTCTCTGTTCTTTCATAGGAAATGTTGAATCAGACCCCTACTGGGAAAAGAAATTTAATGCATATCTCACTATCTTACTGTCCATGAATATAATAGAAATGAATTCAAAATGCAGTTTTATTTTTGCAAATGGGATGAGTCGATAGATGCACCTCATATTTTTGAACACCTAGGGTTCAACAAATTTACTGGTGGTGCTCTTGCATTTTAACAAAATTTATTCTTCAGTAGAAGGGGGCAGAGAACACTAGATTCTTATTCAAGCATTCTATCGAGCTCTGCATTCATGGCTGTGTCTAAAGGGCATGTCAGCCTTTGATTCTCTCTGAGAGGTAATTATCCTTTTCCTGTCACGGAACAACAAATGATAGCTAACTACAGAGGCACATTTGCAGTAGTCACATTCATCAACTGCAGAAAAAAAAATTCAATTTAATTGTGCAACACAGCTGCACATGGGCTTTTGAGCATTTCTGTTGTTCTCCCTGTCTCGCTATTCCTCCCTCCAGATCTATTTTTTAAACTTTTTTTCTGGTTATTTTTTCCCCTTTTTGTCTCTTCTTCCATTTTTACTCTCTGTACTTTCTTGTTAAAGTAATTTTCCTTTGTGGCTCTCATTCTTTTTCCCCCATTGAAGGCTATGAATGTAGAAAATTATCACAATTACTCATATAATTGAGCCTCTTTGTAGCAAGTGCAACTCCAGTAGCCTTTCTCCATCATGAAAATGGTTTCATTATAGGGTTTTTCATATTCTCTGACACCATCTACACAGAGGAACAGGCGTGCAGATGAGATGTGCTAGGAACAGGCTAGATCAGTAAGGTCACAGTAGGAATAATTAGCTCTGCTATGGAAAGAGCATCTAGGCCTTTTACTGCTACATAAATGTACTGTCCATGGCTTTTAGTCACAAAAAAAACTTACTAACAAATGGAGCTCCCGCCTACTACTTTGAAAAAAAGATTTGTATCAACACTACAATTTTCCATCATTAAGACTAATAACACAGAGCCTAGTATACATCAAGGGGAATAAAAAGAAAAATCTCACATTCAAGTGGCGGCTGGGTGCTGACCTTTGTTCCCTTTTTTTGTGTACGACTTAACTCTTTACAAAAAAGAGCCACACGCCACACCAACATGCAGGTGAACTCCAGCTAGTACTAGCAAAGCATAGCATTCAGTTGGAAAATTTGATAAATCTCCATGCAGGATAATGCATTTCATTACATATTCACTACATTAATTCTAGCTACATT cat r1.fa >my_read CTAACTAGCCTGGGAAAAAAGGATAGTGTCTCTCTGTTCTTTCATAGGAAATGTTGAATCAGACCCCTACTGGGAAAAGAAATTTAATGCATATCTCACT cat r2.fa >my_read TCGAGCTCTGCATTCATGGCTGTGTCTAAAGGGCATGTCAGCCTTTGATTCTCTCTGAGAGGTAATTATCCTTTTCCTGTCACGGAACAACAAATGATAG #reverse complement the second pair in one line cat r2.fa | grep -v "^>" | tr ACGT TGCA | rev | sponge r2.fa && echo ">my_read" | cat - r2.fa | sponge r2.fa cat r2.fa >my_read CTATCATTTGTTGTTCCGTGACAGGAAAAGGATAATTACCTCTCAGAGAGAATCAAAGGCTGACATGCCCTTTAGACACAGCCATGAATGCAGAGCTCGA #index reference bwa index ref.fa #align the reads bwa aln ref.fa r1.fa > r1.sai bwa aln ref.fa r2.fa > r2.sai #generate the paired-end alignment bwa sampe -a 400 ref.fa r1.sai r2.sai r1.fa r2.fa > my_read.sam #check the result cat my_read.sam @SQ SN:chr2:172936693-172938111 LN:1418 @PG ID:bwa PN:bwa VN:0.7.10-r789 CL:bwa sampe -a 400 ref.fa r1.sai r2.sai r1.fa r2.fa my_read 99 chr2:172936693-172938111 129 60 100M = 429 400 CTAACTAGCCTGGGAAAAAAGGATAGTGTCTCTCTGTTCTTTCATAGGAAATGTTGAATCAGACCCCTACTGGGAAAAGAAATTTAATGCATATCTCACT * XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:100 my_read 147 chr2:172936693-172938111 429 60 100M = 129 -400 TCGAGCTCTGCATTCATGGCTGTGTCTAAAGGGCATGTCAGCCTTTGATTCTCTCTGAGAGGTAATTATCCTTTTCCTGTCACGGAACAACAAATGATAG * XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:100
From the SAM format specification, the fields are:
Column | Field Type | Regexp/Range | Brief description |
1 | QNAME | String [!-?A-~]{1,255} | Query template NAME |
2 | FLAG | Int [0,216-1] | bitwise FLAG |
3 | RNAME | String \*|[!-()+-<>-~][!-~]* | Reference sequence NAME |
4 | POS | Int [0,231-1] | 1-based leftmost mapping POSition |
5 | MAPQ | Int [0,28-1] | MAPping Quality |
6 | CIGAR | String \*|([0-9]+[MIDNSHPX=])+ | CIGAR string |
7 | RNEXT | String \*|=|[!-()+-<>-~][!-~]* | Ref. name of the mate/next read |
8 | PNEXT | Int [0,231-1] | Position of the mate/next read |
9 | TLEN | Int [-231+1,231-1] | observed Template LENgth |
10 | SEQ | String \*|[A-Za-z=.]+ | segment SEQuence |
11 | QUAL | String [!-~]+ | ASCII of Phred-scaled base QUALity+33 |
The two bitwise flags for the reads are 99 and 147, which indicate:
99 – read paired, read mapped in proper pair, mate reverse strand, and first in pair
147 – read paired, read mapped in proper pair, read reverse strand, and second in pair
The RNEXT field is “=” because the mate-pairs mapped to the same reference. The PNEXT field shows the position of the mate pair, so for r1 this shows the position of r2 and vice versa. The TLEN is +/- 400, which is equal to the insert size of the fragment created in this example.
Let’s create another example, where the the reads map to two different places, i.e., not properly paired.
#new reference sequence with two loci cat ref2.fa >chr2:172936693-172938111 TTATGGCTTTAAAAATATTGTAACAGTGTCTGAATCAAACTTTAGTAAAATCTCTTTGGGTTATATCTGAGAAGCTTTTATTGAAGACTTTGAACAAAATTGTGTTTTTGACAGTTTTAAATTATAGGCTAACTAGCCTGGGAAAAAAGGATAGTGTCTCTCTGTTCTTTCATAGGAAATGTTGAATCAGACCCCTACTGGGAAAAGAAATTTAATGCATATCTCACTATCTTACTGTCCATGAATATAATAGAAATGAATTCAAAATGCAGTTTTATTTTTGCAAATGGGATGAGTCGATAGATGCACCTCATATTTTTGAACACCTAGGGTTCAACAAATTTACTGGTGGTGCTCTTGCATTTTAACAAAATTTATTCTTCAGTAGAAGGGGGCAGAGAACACTAGATTCTTATTCAAGCATTCTATCGAGCTCTGCATTCATGGCTGTGTCTAAAGGGCATGTCAGCCTTTGATTCTCTCTGAGAGGTAATTATCCTTTTCCTGTCACGGAACAACAAATGATAGCTAACTACAGAGGCACATTTGCAGTAGTCACATTCATCAACTGCAGAAAAAAAAATTCAATTTAATTGTGCAACACAGCTGCACATGGGCTTTTGAGCATTTCTGTTGTTCTCCCTGTCTCGCTATTCCTCCCTCCAGATCTATTTTTTAAACTTTTTTTCTGGTTATTTTTTCCCCTTTTTGTCTCTTCTTCCATTTTTACTCTCTGTACTTTCTTGTTAAAGTAATTTTCCTTTGTGGCTCTCATTCTTTTTCCCCCATTGAAGGCTATGAATGTAGAAAATTATCACAATTACTCATATAATTGAGCCTCTTTGTAGCAAGTGCAACTCCAGTAGCCTTTCTCCATCATGAAAATGGTTTCATTATAGGGTTTTTCATATTCTCTGACACCATCTACACAGAGGAACAGGCGTGCAGATGAGATGTGCTAGGAACAGGCTAGATCAGTAAGGTCACAGTAGGAATAATTAGCTCTGCTATGGAAAGAGCATCTAGGCCTTTTACTGCTACATAAATGTACTGTCCATGGCTTTTAGTCACAAAAAAAACTTACTAACAAATGGAGCTCCCGCCTACTACTTTGAAAAAAAGATTTGTATCAACACTACAATTTTCCATCATTAAGACTAATAACACAGAGCCTAGTATACATCAAGGGGAATAAAAAGAAAAATCTCACATTCAAGTGGCGGCTGGGTGCTGACCTTTGTTCCCTTTTTTTGTGTACGACTTAACTCTTTACAAAAAAGAGCCACACGCCACACCAACATGCAGGTGAACTCCAGCTAGTACTAGCAAAGCATAGCATTCAGTTGGAAAATTTGATAAATCTCCATGCAGGATAATGCATTTCATTACATATTCACTACATTAATTCTAGCTACATT >chr10:131642920-131644198 GGAACAGCACTTAATCCAGAAAGTTCCCGGATGGGTGGCTCTGACAATGGGCATTTACCTCTGAAATTGGTGCCGACGCTTGATGTTGCATGCATCTAGTTTGCATACAAATAAAGAATTAGACTTGTCATGAAGCAGGTATAATCAATGAAAGGCCCAGCTGTCTTAATCAGGTTTCGTTAGCCTGAGCAGCTCTCTGTGAGCAAGATAAAGTGTTTTTCAGAGGTATTAATAATTACTCATAATCTCTCCTATCATATCGCAAGACTTTTTACATGCCTTTCAATTTTAAGCAACGATTAAAGCAATTAAGATTGCTGCATTTACAAGCTATTTTTCATTCCCAGAAAATATCCTACACCTACTTGTCCATATGGTACTTTGCAATTGGAACATAAGGATAGTTTAGGAAAATTACGTATGCTCTTATCCTAATCTGCCTATTACTGGATAGTGTCTTAACTAATGAACCCGACTTTCTGAGCTGCTCTTAAGTGGTGGTCTTGTCATATCAGAGAGACACAAAACGGGTAACCCAATCAGTCAAGGATGCTTTGAACAAGTTTGTTTTGTTACAATATGCCGAGCTTATGCATACTGCCTCATGCTGTGAATACATTAGCACCGAATGGTGTTTGAAAAGCATTTCGATGGAAATTTCTTAGCCACAACCATAAGTGTAATTGGCTGCCTTTCAATAGGACATAACAGACTTCAAGTTAAATGGAAACTATTAAAGCTCAAATGATTTCTGCCGTTGTTTCATGACAAATGTACAGTTTTATTATTATGAGTATTCATTCTTGTACAGTGGTCGACAGCGATTTGAATATACATGATTATTCACATGCCCGATGATCTAGAATGGTATATATCATCACAGTTAAAAGAGGTAACCAAGGTGACGTTGCTTCAGGTGCTAATGTCATTATAAAACTGTAAAGCAATTATTCTGTTAGTTCTTGCCTGGGGTTATGAATACATTGAAGAGTGCAGAGCCATTAGATGTAGCTACTTCAAAAAGCGGGGAAAAGAAGATGAATTTTAATGCATGGGTAATTGCTCAACATGACCGCATTCCTAATAAATTTTTGTATATTAGAAAAAAAGAACAGAATATATTTGTGTATGATGCATGAAACAGAAAAAAAACTCAGATGAAATGCATGCAATTAATTTTATGGTAAAATCATTTTATGACTATGCACCTTATAATAAATATAATTTGTTAGAGCACAATTCCATGCATAAAATGTAATTTGTTTTTTCTGCTTTCAT #index bwa index ref2.fa #new read that is from the second locus cat r3.fa >my_read CGCAAGACTTTTTACATGCCTTTCAATTTTAAGCAACGATTAAAGCAATTAAGATTGCTGCATTTACAAGCTATTTTTCATTCCCAGAAAATATCCTACA #align the reads bwa aln ref2.fa r1.fa > r1.sai bwa aln ref2.fa r3.fa > r3.sai #generate the paired-end alignment bwa sampe -a 400 ref2.fa r1.sai r3.sai r1.fa r3.fa > my_read2.sam cat my_read2.sam @SQ SN:chr2:172936693-172938111 LN:1418 @SQ SN:chr10:131642920-131644198 LN:1278 @PG ID:bwa PN:bwa VN:0.7.10-r789 CL:bwa sampe -a 400 ref2.fa r1.sai r3.sai r1.fa r3.fa my_read 65 chr2:172936693-172938111 129 37 100M chr10:131642920-131644198 261 0 CTAACTAGCCTGGGAAAAAAGGATAGTGTCTCTCTGTTCTTTCATAGGAAATGTTGAATCAGACCCCTACTGGGAAAAGAAATTTAATGCATATCTCACT * XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:100 my_read 129 chr10:131642920-131644198 261 37 100M chr2:172936693-172938111 129 0 CGCAAGACTTTTTACATGCCTTTCAATTTTAAGCAACGATTAAAGCAATTAAGATTGCTGCATTTACAAGCTATTTTTCATTCCCAGAAAATATCCTACA * XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:100
The two bitwise flags for the reads are 65 and 129, which indicate:
65 – read paired and first in pair
129 – read paired and second in pair
They are no longer properly paired. Also notice that the RNEXT field now contains the reference sequence of the respective mate and the TLEN is 0.
Gaining some quick statistics
We can get some simple statistics on the mate-pairs using SAMTools flagstat. Firstly we need to convert the SAM files into BAM, and then we can run flagstat:
#convert into BAM samtools view -bS my_read.sam > my_read.bam samtools view -bS my_read2.sam > my_read2.bam #run flagstat samtools flagstat my_read.bam 2 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 2 + 0 mapped (100.00%:nan%) 2 + 0 paired in sequencing 1 + 0 read1 1 + 0 read2 2 + 0 properly paired (100.00%:nan%) 2 + 0 with itself and mate mapped 0 + 0 singletons (0.00%:nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5) samtools flagstat my_read2.bam 2 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 2 + 0 mapped (100.00%:nan%) 2 + 0 paired in sequencing 1 + 0 read1 1 + 0 read2 0 + 0 properly paired (0.00%:nan%) 2 + 0 with itself and mate mapped 0 + 0 singletons (0.00%:nan%) 2 + 0 with mate mapped to a different chr 2 + 0 with mate mapped to a different chr (mapQ>=5)
Using flagstat we can quickly see how many paired-end reads were properly paired, i.e., concordant reads, or not properly paired, i.e., discordant reads, with respect to the reference sequence. Discordant reads may indicate structural variants.
Rescuing ambiguously mapped reads
Imagine a single read that maps equally well to two positions in a reference; there’s a 50% chance that it originated from either of the positions. However if we have paired-end reads, the other pair may be mapping uniquely and we can leverage this information to figure out where the ambiguous read originated from. Let’s create an example:
cat ref3.fa >chr2:172936693-172938111 TTATGGCTTTAAAAATATTGTAACAGTGTCTGAATCAAACTTTAGTAAAATCTCTTTGGGTTATATCTGAGAAGCTTTTATTGAAGACTTTGAACAAAATTGTGTTTTTGACAGTTTTAAATTATAGGCTAACTAGCCTGGGAAAAAAGGATAGTGTCTCTCTGTTCTTTCATAGGAAATGTTGAATCAGACCCCTACTGGGAAAAGAAATTTAATGCATATCTCACTATCTTACTGTCCATGAATATAATAGAAATGAATTCAAAATGCAGTTTTATTTTTGCAAATGGGATGAGTCGATAGATGCACCTCATATTTTTGAACACCTAGGGTTCAACAAATTTACTGGTGGTGCTCTTGCATTTTAACAAAATTTATTCTTCAGTAGAAGGGGGCAGAGAACACTAGATTCTTATTCAAGCATTCTATCGAGCTCTGCATTCATGGCTGTGTCTAAAGGGCATGTCAGCCTTTGATTCTCTCTGAGAGGTAATTATCCTTTTCCTGTCACGGAACAACAAATGATAGCTAACTACAGAGGCACATTTGCAGTAGTCACATTCATCAACTGCAGAAAAAAAAATTCAATTTAATTGTGCAACACAGCTGCACATGGGCTTTTGAGCATTTCTGTTGTTCTCCCTGTCTCGCTATTCCTCCCTCCAGATCTATTTTTTAAACTTTTTTTCTGGTTATTTTTTCCCCTTTTTGTCTCTTCTTCCATTTTTACTCTCTGTACTTTCTTGTTAAAGTAATTTTCCTTTGTGGCTCTCATTCTTTTTCCCCCATTGAAGGCTATGAATGTAGAAAATTATCACAATTACTCATATAATTGAGCCTCTTTGTAGCAAGTGCAACTCCAGTAGCCTTTCTCCATCATGAAAATGGTTTCATTATAGGGTTTTTCATATTCTCTGACACCATCTACACAGAGGAACAGGCGTGCAGATGAGATGTGCTAGGAACAGGCTAGATCAGTAAGGTCACAGTAGGAATAATTAGCTCTGCTATGGAAAGAGCATCTAGGCCTTTTACTGCTACATAAATGTACTGTCCATGGCTTTTAGTCACAAAAAAAACTTACTAACAAATGGAGCTCCCGCCTACTACTTTGAAAAAAAGATTTGTATCAACACTACAATTTTCCATCATTAAGACTAATAACACAGAGCCTAGTATACATCAAGGGGAATAAAAAGAAAAATCTCACATTCAAGTGGCGGCTGGGTGCTGACCTTTGTTCCCTTTTTTTGTGTACGACTTAACTCTTTACAAAAAAGAGCCACACGCCACACCAACATGCAGGTGAACTCCAGCTAGTACTAGCAAAGCATAGCATTCAGTTGGAAAATTTGATAAATCTCCATGCAGGATAATGCATTTCATTACATATTCACTACATTAATTCTAGCTACATT >chr2:172936993-172938111 TAGATGCACCTCATATTTTTGAACACCTAGGGTTCAACAAATTTACTGGTGGTGCTCTTGCATTTTAACAAAATTTATTCTTCAGTAGAAGGGGGCAGAGAACACTAGATTCTTATTCAAGCATTCTATCGAGCTCTGCATTCATGGCTGTGTCTAAAGGGCATGTCAGCCTTTGATTCTCTCTGAGAGGTAATTATCCTTTTCCTGTCACGGAACAACAAATGATAGCTAACTACAGAGGCACATTTGCAGTAGTCACATTCATCAACTGCAGAAAAAAAAATTCAATTTAATTGTGCAACACAGCTGCACATGGGCTTTTGAGCATTTCTGTTGTTCTCCCTGTCTCGCTATTCCTCCCTCCAGATCTATTTTTTAAACTTTTTTTCTGGTTATTTTTTCCCCTTTTTGTCTCTTCTTCCATTTTTACTCTCTGTACTTTCTTGTTAAAGTAATTTTCCTTTGTGGCTCTCATTCTTTTTCCCCCATTGAAGGCTATGAATGTAGAAAATTATCACAATTACTCATATAATTGAGCCTCTTTGTAGCAAGTGCAACTCCAGTAGCCTTTCTCCATCATGAAAATGGTTTCATTATAGGGTTTTTCATATTCTCTGACACCATCTACACAGAGGAACAGGCGTGCAGATGAGATGTGCTAGGAACAGGCTAGATCAGTAAGGTCACAGTAGGAATAATTAGCTCTGCTATGGAAAGAGCATCTAGGCCTTTTACTGCTACATAAATGTACTGTCCATGGCTTTTAGTCACAAAAAAAACTTACTAACAAATGGAGCTCCCGCCTACTACTTTGAAAAAAAGATTTGTATCAACACTACAATTTTCCATCATTAAGACTAATAACACAGAGCCTAGTATACATCAAGGGGAATAAAAAGAAAAATCTCACATTCAAGTGGCGGCTGGGTGCTGACCTTTGTTCCCTTTTTTTGTGTACGACTTAACTCTTTACAAAAAAGAGCCACACGCCACACCAACATGCAGGTGAACTCCAGCTAGTACTAGCAAAGCATAGCATTCAGTTGGAAAATTTGATAAATCTCCATGCAGGATAATGCATTTCATTACATATTCACTACATTAATTCTAGCTACATT bwa index ref3.fa
The second sequence is a shorter duplicate of the first sequence; it’s a subset of the first sequence starting 200 bp downstream of the first. The idea is to have one of the paired-end reads mapping uniquely (to the first 200 bp of the first sequence) and the other read mapping to the duplicated region. Here are the reads:
cat ../read/r4.fa >my_read AAAAATATTGTAACAGTGTCTGAATCAAACTTTAGTAAAATCTCTTTGGGTTATATCTGAGAAGCTTTTATTGAAGACTTTGAACAAAATTGTGTTTTTG cat read/r5.fa >my_read TGCACAATTAAATTGAATTTTTTTTTCTGCAGTTGATGAATGTGACTACTGCAAATGTGCCTCTGTAGTTAGCTATCATTTGTTGTTCCGTGACAGGAAA
Let’s align them as per before:
#align the reads bwa aln ref/ref3.fa read/r4.fa > read/r4.sai bwa aln ref/ref3.fa read/r5.fa > read/r5.sai #generate the paired-end alignment bwa sampe -a 600 ref/ref3.fa read/r4.sai read/r5.sai read/r4.fa read/r5.fa > aln/my_read3.sam #check the alignment cat aln/my_read3.sam @SQ SN:chr2:172936693-172938111 LN:1418 @SQ SN:chr2:172936993-172938111 LN:1118 @PG ID:bwa PN:bwa VN:0.7.10-r789 CL:bwa sampe -a 600 ref/ref3.fa read/r4.sai read/r5.sai read/r4.fa read/r5.fa my_read 99 chr2:172936693-172938111 11 37 100M = 501 590 AAAAATATTGTAACAGTGTCTGAATCAAACTTTAGTAAAATCTCTTTGGGTTATATCTGAGAAGCTTTTATTGAAGACTTTGAACAAAATTGTGTTTTTG * XT:A:U NM:i:0 SM:i:37 AM:i:X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:100 my_read 147 chr2:172936693-172938111 501 36 100M = 11 -590 TTTCCTGTCACGGAACAACAAATGATAGCTAACTACAGAGGCACATTTGCAGTAGTCACATTCATCAACTGCAGAAAAAAAAATTCAATTTAATTGTGCA * XT:A:R NM:i:0 SM:i:0 AM:i:X0:i:2 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:100 XA:Z:chr2:172936993-172938111,-201,100M,0;
The two bitwise flags are 99 and 147, indicating a proper pair (see above). The XT:A:R flag indicates that the second read is repetitive. The AM:i:X0 tag indicates the number of best hits the read mapped to (without taking into account the paired-end information); the AM indicates template independent. Note that the mapping quality of the second read is not 0 and has been mapped to the correct position; let’s see if it will be 0 if we generate an alignment by itself:
#mapping quality becomes 0 bwa samse ref/ref3.fa read/r5.sai read/r5.fa > aln/r5_se.sam cat aln/r5_se.sam @SQ SN:chr2:172936693-172938111 LN:1418 @SQ SN:chr2:172936993-172938111 LN:1118 @PG ID:bwa PN:bwa VN:0.7.10-r789 CL:bwa samse ref/ref3.fa read/r5.sai read/r5.fa my_read 16 chr2:172936693-172938111 501 0 100M * 0 0 TTTCCTGTCACGGAACAACAAATGATAGCTAACTACAGAGGCACATTTGCAGTAGTCACATTCATCAACTGCAGAAAAAAAAATTCAATTTAATTGTGCA * XT:A:R NM:i:0 X0:i:2 X1:i:XM:i:0 XO:i:0 XG:i:0 MD:Z:100 XA:Z:chr2:172936993-172938111,-201,100M,0;
I created this example in the hope of emulating the XT:A:M tag, but I can’t find much information on this tag. This post on Biostars, suggests that the XT:A:M tag means that one of the pairs is uniquely mapped and the other isn’t. However this is the exact question I’m asking right now.
Visualising paired-end reads
IGV can be used to visualise BAM files; we just need to index them first using SAMTools:
#index the BAM files samtools index my_read.bam samtools index my_read2.bam
Screenshot from my brand new MacBook Air showing the two examples. The first panel shows the properly paired reads and the second panel shows only one of the pairs.
Conclusions
This was the first time I’ve looked into paired-end reads and I simply wanted to create simple examples showing reads that are properly and not properly paired. When working with single-end reads, the BAM flags were usually just 0 or 16 but as you saw there are many more combinations with paired-end reads. Naturally, I also never paid attention to the RNEXT, PNEXT, and TLEN fields. I also showed how paired-end reads can be used to rescue ambiguously mapped reads.
I hope this was useful to those that have not encountered paired-end libraries; remember to check out the blog post I linked right at the start if you haven’t already. All the files for this post are available on GitHub.

This work is licensed under a Creative Commons
Attribution 4.0 International License.
It is amazing. I understand it easily.
Thanks a lot for this wonderful explanation!!
Hello, thanks for the wonderful explanation. I was wondering though, what would you do if after aligning a full Whole Exome Sequencing data from a mouse (paired end reads) you had a good mapping quality (> 97%) but about 40-60% properly paired reads