Getting started with paired-end reads

I’ve wanted to write this post for a while, but I never had to work with paired-end libraries, so the impetus wasn’t quite there. Finally I’ve decided to take a look at some paired-end libraries at work and as usual, I will test some simple examples before I touch the real data. For those not familiar with paired-end reads, check out this post; it has very nice and simple illustrations, along with explanations on the terminology used in paired-end sequencing. Now let’s get started!

For the first test, I took some sequence from the human genome (hg19) and created two 100 bp reads from this region. The inner mate distance between the two reads is 200 bp, creating an insert size of 400 bp. The reads were then mapped back to the reference using BWA aln and sampe.

cat ref.fa
>chr2:172936693-172938111
TTATGGCTTTAAAAATATTGTAACAGTGTCTGAATCAAACTTTAGTAAAATCTCTTTGGGTTATATCTGAGAAGCTTTTATTGAAGACTTTGAACAAAATTGTGTTTTTGACAGTTTTAAATTATAGGCTAACTAGCCTGGGAAAAAAGGATAGTGTCTCTCTGTTCTTTCATAGGAAATGTTGAATCAGACCCCTACTGGGAAAAGAAATTTAATGCATATCTCACTATCTTACTGTCCATGAATATAATAGAAATGAATTCAAAATGCAGTTTTATTTTTGCAAATGGGATGAGTCGATAGATGCACCTCATATTTTTGAACACCTAGGGTTCAACAAATTTACTGGTGGTGCTCTTGCATTTTAACAAAATTTATTCTTCAGTAGAAGGGGGCAGAGAACACTAGATTCTTATTCAAGCATTCTATCGAGCTCTGCATTCATGGCTGTGTCTAAAGGGCATGTCAGCCTTTGATTCTCTCTGAGAGGTAATTATCCTTTTCCTGTCACGGAACAACAAATGATAGCTAACTACAGAGGCACATTTGCAGTAGTCACATTCATCAACTGCAGAAAAAAAAATTCAATTTAATTGTGCAACACAGCTGCACATGGGCTTTTGAGCATTTCTGTTGTTCTCCCTGTCTCGCTATTCCTCCCTCCAGATCTATTTTTTAAACTTTTTTTCTGGTTATTTTTTCCCCTTTTTGTCTCTTCTTCCATTTTTACTCTCTGTACTTTCTTGTTAAAGTAATTTTCCTTTGTGGCTCTCATTCTTTTTCCCCCATTGAAGGCTATGAATGTAGAAAATTATCACAATTACTCATATAATTGAGCCTCTTTGTAGCAAGTGCAACTCCAGTAGCCTTTCTCCATCATGAAAATGGTTTCATTATAGGGTTTTTCATATTCTCTGACACCATCTACACAGAGGAACAGGCGTGCAGATGAGATGTGCTAGGAACAGGCTAGATCAGTAAGGTCACAGTAGGAATAATTAGCTCTGCTATGGAAAGAGCATCTAGGCCTTTTACTGCTACATAAATGTACTGTCCATGGCTTTTAGTCACAAAAAAAACTTACTAACAAATGGAGCTCCCGCCTACTACTTTGAAAAAAAGATTTGTATCAACACTACAATTTTCCATCATTAAGACTAATAACACAGAGCCTAGTATACATCAAGGGGAATAAAAAGAAAAATCTCACATTCAAGTGGCGGCTGGGTGCTGACCTTTGTTCCCTTTTTTTGTGTACGACTTAACTCTTTACAAAAAAGAGCCACACGCCACACCAACATGCAGGTGAACTCCAGCTAGTACTAGCAAAGCATAGCATTCAGTTGGAAAATTTGATAAATCTCCATGCAGGATAATGCATTTCATTACATATTCACTACATTAATTCTAGCTACATT

cat r1.fa
>my_read
CTAACTAGCCTGGGAAAAAAGGATAGTGTCTCTCTGTTCTTTCATAGGAAATGTTGAATCAGACCCCTACTGGGAAAAGAAATTTAATGCATATCTCACT

cat r2.fa
>my_read
TCGAGCTCTGCATTCATGGCTGTGTCTAAAGGGCATGTCAGCCTTTGATTCTCTCTGAGAGGTAATTATCCTTTTCCTGTCACGGAACAACAAATGATAG

#reverse complement the second pair in one line
cat r2.fa | grep -v "^>" | tr ACGT TGCA | rev | sponge r2.fa && echo ">my_read" | cat - r2.fa | sponge r2.fa

cat r2.fa
>my_read
CTATCATTTGTTGTTCCGTGACAGGAAAAGGATAATTACCTCTCAGAGAGAATCAAAGGCTGACATGCCCTTTAGACACAGCCATGAATGCAGAGCTCGA

#index reference
bwa index ref.fa

#align the reads
bwa aln ref.fa r1.fa > r1.sai
bwa aln ref.fa r2.fa > r2.sai

#generate the paired-end alignment
bwa sampe -a 400 ref.fa r1.sai r2.sai r1.fa r2.fa > my_read.sam

#check the result
cat my_read.sam 
@SQ	SN:chr2:172936693-172938111	LN:1418
@PG	ID:bwa	PN:bwa	VN:0.7.10-r789	CL:bwa sampe -a 400 ref.fa r1.sai r2.sai r1.fa r2.fa
my_read	99	chr2:172936693-172938111	129	60	100M	=	429	400	CTAACTAGCCTGGGAAAAAAGGATAGTGTCTCTCTGTTCTTTCATAGGAAATGTTGAATCAGACCCCTACTGGGAAAAGAAATTTAATGCATATCTCACT	*	XT:A:U	NM:i:0	SM:i:37	AM:i:37	X0:i:1	X1:i:0	XM:i:0	XO:i:0	XG:i:0	MD:Z:100
my_read	147	chr2:172936693-172938111	429	60	100M	=	129	-400	TCGAGCTCTGCATTCATGGCTGTGTCTAAAGGGCATGTCAGCCTTTGATTCTCTCTGAGAGGTAATTATCCTTTTCCTGTCACGGAACAACAAATGATAG	*	XT:A:U	NM:i:0	SM:i:37	AM:i:37	X0:i:1	X1:i:0	XM:i:0	XO:i:0	XG:i:0	MD:Z:100

From the SAM format specification, the fields are:

Column Field Type Regexp/Range Brief description
1 QNAME String [!-?A-~]{1,255} Query template NAME
2 FLAG Int [0,216-1] bitwise FLAG
3 RNAME String \*|[!-()+-<>-~][!-~]* Reference sequence NAME
4 POS Int [0,231-1] 1-based leftmost mapping POSition
5 MAPQ Int [0,28-1] MAPping Quality
6 CIGAR String \*|([0-9]+[MIDNSHPX=])+ CIGAR string
7 RNEXT String \*|=|[!-()+-<>-~][!-~]* Ref. name of the mate/next read
8 PNEXT Int [0,231-1] Position of the mate/next read
9 TLEN Int [-231+1,231-1] observed Template LENgth
10 SEQ String \*|[A-Za-z=.]+ segment SEQuence
11 QUAL String [!-~]+ ASCII of Phred-scaled base QUALity+33

The two bitwise flags for the reads are 99 and 147, which indicate:

99 – read paired, read mapped in proper pair, mate reverse strand, and first in pair
147 – read paired, read mapped in proper pair, read reverse strand, and second in pair

The RNEXT field is “=” because the mate-pairs mapped to the same reference. The PNEXT field shows the position of the mate pair, so for r1 this shows the position of r2 and vice versa. The TLEN is +/- 400, which is equal to the insert size of the fragment created in this example.

Let’s create another example, where the the reads map to two different places, i.e., not properly paired.

#new reference sequence with two loci
cat ref2.fa 
>chr2:172936693-172938111
TTATGGCTTTAAAAATATTGTAACAGTGTCTGAATCAAACTTTAGTAAAATCTCTTTGGGTTATATCTGAGAAGCTTTTATTGAAGACTTTGAACAAAATTGTGTTTTTGACAGTTTTAAATTATAGGCTAACTAGCCTGGGAAAAAAGGATAGTGTCTCTCTGTTCTTTCATAGGAAATGTTGAATCAGACCCCTACTGGGAAAAGAAATTTAATGCATATCTCACTATCTTACTGTCCATGAATATAATAGAAATGAATTCAAAATGCAGTTTTATTTTTGCAAATGGGATGAGTCGATAGATGCACCTCATATTTTTGAACACCTAGGGTTCAACAAATTTACTGGTGGTGCTCTTGCATTTTAACAAAATTTATTCTTCAGTAGAAGGGGGCAGAGAACACTAGATTCTTATTCAAGCATTCTATCGAGCTCTGCATTCATGGCTGTGTCTAAAGGGCATGTCAGCCTTTGATTCTCTCTGAGAGGTAATTATCCTTTTCCTGTCACGGAACAACAAATGATAGCTAACTACAGAGGCACATTTGCAGTAGTCACATTCATCAACTGCAGAAAAAAAAATTCAATTTAATTGTGCAACACAGCTGCACATGGGCTTTTGAGCATTTCTGTTGTTCTCCCTGTCTCGCTATTCCTCCCTCCAGATCTATTTTTTAAACTTTTTTTCTGGTTATTTTTTCCCCTTTTTGTCTCTTCTTCCATTTTTACTCTCTGTACTTTCTTGTTAAAGTAATTTTCCTTTGTGGCTCTCATTCTTTTTCCCCCATTGAAGGCTATGAATGTAGAAAATTATCACAATTACTCATATAATTGAGCCTCTTTGTAGCAAGTGCAACTCCAGTAGCCTTTCTCCATCATGAAAATGGTTTCATTATAGGGTTTTTCATATTCTCTGACACCATCTACACAGAGGAACAGGCGTGCAGATGAGATGTGCTAGGAACAGGCTAGATCAGTAAGGTCACAGTAGGAATAATTAGCTCTGCTATGGAAAGAGCATCTAGGCCTTTTACTGCTACATAAATGTACTGTCCATGGCTTTTAGTCACAAAAAAAACTTACTAACAAATGGAGCTCCCGCCTACTACTTTGAAAAAAAGATTTGTATCAACACTACAATTTTCCATCATTAAGACTAATAACACAGAGCCTAGTATACATCAAGGGGAATAAAAAGAAAAATCTCACATTCAAGTGGCGGCTGGGTGCTGACCTTTGTTCCCTTTTTTTGTGTACGACTTAACTCTTTACAAAAAAGAGCCACACGCCACACCAACATGCAGGTGAACTCCAGCTAGTACTAGCAAAGCATAGCATTCAGTTGGAAAATTTGATAAATCTCCATGCAGGATAATGCATTTCATTACATATTCACTACATTAATTCTAGCTACATT
>chr10:131642920-131644198
GGAACAGCACTTAATCCAGAAAGTTCCCGGATGGGTGGCTCTGACAATGGGCATTTACCTCTGAAATTGGTGCCGACGCTTGATGTTGCATGCATCTAGTTTGCATACAAATAAAGAATTAGACTTGTCATGAAGCAGGTATAATCAATGAAAGGCCCAGCTGTCTTAATCAGGTTTCGTTAGCCTGAGCAGCTCTCTGTGAGCAAGATAAAGTGTTTTTCAGAGGTATTAATAATTACTCATAATCTCTCCTATCATATCGCAAGACTTTTTACATGCCTTTCAATTTTAAGCAACGATTAAAGCAATTAAGATTGCTGCATTTACAAGCTATTTTTCATTCCCAGAAAATATCCTACACCTACTTGTCCATATGGTACTTTGCAATTGGAACATAAGGATAGTTTAGGAAAATTACGTATGCTCTTATCCTAATCTGCCTATTACTGGATAGTGTCTTAACTAATGAACCCGACTTTCTGAGCTGCTCTTAAGTGGTGGTCTTGTCATATCAGAGAGACACAAAACGGGTAACCCAATCAGTCAAGGATGCTTTGAACAAGTTTGTTTTGTTACAATATGCCGAGCTTATGCATACTGCCTCATGCTGTGAATACATTAGCACCGAATGGTGTTTGAAAAGCATTTCGATGGAAATTTCTTAGCCACAACCATAAGTGTAATTGGCTGCCTTTCAATAGGACATAACAGACTTCAAGTTAAATGGAAACTATTAAAGCTCAAATGATTTCTGCCGTTGTTTCATGACAAATGTACAGTTTTATTATTATGAGTATTCATTCTTGTACAGTGGTCGACAGCGATTTGAATATACATGATTATTCACATGCCCGATGATCTAGAATGGTATATATCATCACAGTTAAAAGAGGTAACCAAGGTGACGTTGCTTCAGGTGCTAATGTCATTATAAAACTGTAAAGCAATTATTCTGTTAGTTCTTGCCTGGGGTTATGAATACATTGAAGAGTGCAGAGCCATTAGATGTAGCTACTTCAAAAAGCGGGGAAAAGAAGATGAATTTTAATGCATGGGTAATTGCTCAACATGACCGCATTCCTAATAAATTTTTGTATATTAGAAAAAAAGAACAGAATATATTTGTGTATGATGCATGAAACAGAAAAAAAACTCAGATGAAATGCATGCAATTAATTTTATGGTAAAATCATTTTATGACTATGCACCTTATAATAAATATAATTTGTTAGAGCACAATTCCATGCATAAAATGTAATTTGTTTTTTCTGCTTTCAT

#index
bwa index ref2.fa

#new read that is from the second locus
cat r3.fa 
>my_read
CGCAAGACTTTTTACATGCCTTTCAATTTTAAGCAACGATTAAAGCAATTAAGATTGCTGCATTTACAAGCTATTTTTCATTCCCAGAAAATATCCTACA

#align the reads
bwa aln ref2.fa r1.fa > r1.sai
bwa aln ref2.fa r3.fa > r3.sai

#generate the paired-end alignment
bwa sampe -a 400 ref2.fa r1.sai r3.sai r1.fa r3.fa > my_read2.sam

cat my_read2.sam 
@SQ	SN:chr2:172936693-172938111	LN:1418
@SQ	SN:chr10:131642920-131644198	LN:1278
@PG	ID:bwa	PN:bwa	VN:0.7.10-r789	CL:bwa sampe -a 400 ref2.fa r1.sai r3.sai r1.fa r3.fa
my_read	65	chr2:172936693-172938111	129	37	100M	chr10:131642920-131644198	261	0	CTAACTAGCCTGGGAAAAAAGGATAGTGTCTCTCTGTTCTTTCATAGGAAATGTTGAATCAGACCCCTACTGGGAAAAGAAATTTAATGCATATCTCACT	*	XT:A:U	NM:i:0	SM:i:37	AM:i:37	X0:i:1	X1:i:0	XM:i:0	XO:i:0	XG:i:0	MD:Z:100
my_read	129	chr10:131642920-131644198	261	37	100M	chr2:172936693-172938111	129	0	CGCAAGACTTTTTACATGCCTTTCAATTTTAAGCAACGATTAAAGCAATTAAGATTGCTGCATTTACAAGCTATTTTTCATTCCCAGAAAATATCCTACA	*	XT:A:U	NM:i:0	SM:i:37	AM:i:37	X0:i:1	X1:i:0	XM:i:0	XO:i:0	XG:i:0	MD:Z:100

The two bitwise flags for the reads are 65 and 129, which indicate:

65 – read paired and first in pair
129 – read paired and second in pair

They are no longer properly paired. Also notice that the RNEXT field now contains the reference sequence of the respective mate and the TLEN is 0.

Gaining some quick statistics

We can get some simple statistics on the mate-pairs using SAMTools flagstat. Firstly we need to convert the SAM files into BAM, and then we can run flagstat:

#convert into BAM
samtools view -bS my_read.sam > my_read.bam
samtools view -bS my_read2.sam > my_read2.bam

#run flagstat
samtools flagstat my_read.bam 
2 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
2 + 0 mapped (100.00%:nan%)
2 + 0 paired in sequencing
1 + 0 read1
1 + 0 read2
2 + 0 properly paired (100.00%:nan%)
2 + 0 with itself and mate mapped
0 + 0 singletons (0.00%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

samtools flagstat my_read2.bam 
2 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
2 + 0 mapped (100.00%:nan%)
2 + 0 paired in sequencing
1 + 0 read1
1 + 0 read2
0 + 0 properly paired (0.00%:nan%)
2 + 0 with itself and mate mapped
0 + 0 singletons (0.00%:nan%)
2 + 0 with mate mapped to a different chr
2 + 0 with mate mapped to a different chr (mapQ>=5)

Using flagstat we can quickly see how many paired-end reads were properly paired, i.e., concordant reads, or not properly paired, i.e., discordant reads, with respect to the reference sequence. Discordant reads may indicate structural variants.

Rescuing ambiguously mapped reads

Imagine a single read that maps equally well to two positions in a reference; there’s a 50% chance that it originated from either of the positions. However if we have paired-end reads, the other pair may be mapping uniquely and we can leverage this information to figure out where the ambiguous read originated from. Let’s create an example:

cat ref3.fa 
>chr2:172936693-172938111
TTATGGCTTTAAAAATATTGTAACAGTGTCTGAATCAAACTTTAGTAAAATCTCTTTGGGTTATATCTGAGAAGCTTTTATTGAAGACTTTGAACAAAATTGTGTTTTTGACAGTTTTAAATTATAGGCTAACTAGCCTGGGAAAAAAGGATAGTGTCTCTCTGTTCTTTCATAGGAAATGTTGAATCAGACCCCTACTGGGAAAAGAAATTTAATGCATATCTCACTATCTTACTGTCCATGAATATAATAGAAATGAATTCAAAATGCAGTTTTATTTTTGCAAATGGGATGAGTCGATAGATGCACCTCATATTTTTGAACACCTAGGGTTCAACAAATTTACTGGTGGTGCTCTTGCATTTTAACAAAATTTATTCTTCAGTAGAAGGGGGCAGAGAACACTAGATTCTTATTCAAGCATTCTATCGAGCTCTGCATTCATGGCTGTGTCTAAAGGGCATGTCAGCCTTTGATTCTCTCTGAGAGGTAATTATCCTTTTCCTGTCACGGAACAACAAATGATAGCTAACTACAGAGGCACATTTGCAGTAGTCACATTCATCAACTGCAGAAAAAAAAATTCAATTTAATTGTGCAACACAGCTGCACATGGGCTTTTGAGCATTTCTGTTGTTCTCCCTGTCTCGCTATTCCTCCCTCCAGATCTATTTTTTAAACTTTTTTTCTGGTTATTTTTTCCCCTTTTTGTCTCTTCTTCCATTTTTACTCTCTGTACTTTCTTGTTAAAGTAATTTTCCTTTGTGGCTCTCATTCTTTTTCCCCCATTGAAGGCTATGAATGTAGAAAATTATCACAATTACTCATATAATTGAGCCTCTTTGTAGCAAGTGCAACTCCAGTAGCCTTTCTCCATCATGAAAATGGTTTCATTATAGGGTTTTTCATATTCTCTGACACCATCTACACAGAGGAACAGGCGTGCAGATGAGATGTGCTAGGAACAGGCTAGATCAGTAAGGTCACAGTAGGAATAATTAGCTCTGCTATGGAAAGAGCATCTAGGCCTTTTACTGCTACATAAATGTACTGTCCATGGCTTTTAGTCACAAAAAAAACTTACTAACAAATGGAGCTCCCGCCTACTACTTTGAAAAAAAGATTTGTATCAACACTACAATTTTCCATCATTAAGACTAATAACACAGAGCCTAGTATACATCAAGGGGAATAAAAAGAAAAATCTCACATTCAAGTGGCGGCTGGGTGCTGACCTTTGTTCCCTTTTTTTGTGTACGACTTAACTCTTTACAAAAAAGAGCCACACGCCACACCAACATGCAGGTGAACTCCAGCTAGTACTAGCAAAGCATAGCATTCAGTTGGAAAATTTGATAAATCTCCATGCAGGATAATGCATTTCATTACATATTCACTACATTAATTCTAGCTACATT
>chr2:172936993-172938111
TAGATGCACCTCATATTTTTGAACACCTAGGGTTCAACAAATTTACTGGTGGTGCTCTTGCATTTTAACAAAATTTATTCTTCAGTAGAAGGGGGCAGAGAACACTAGATTCTTATTCAAGCATTCTATCGAGCTCTGCATTCATGGCTGTGTCTAAAGGGCATGTCAGCCTTTGATTCTCTCTGAGAGGTAATTATCCTTTTCCTGTCACGGAACAACAAATGATAGCTAACTACAGAGGCACATTTGCAGTAGTCACATTCATCAACTGCAGAAAAAAAAATTCAATTTAATTGTGCAACACAGCTGCACATGGGCTTTTGAGCATTTCTGTTGTTCTCCCTGTCTCGCTATTCCTCCCTCCAGATCTATTTTTTAAACTTTTTTTCTGGTTATTTTTTCCCCTTTTTGTCTCTTCTTCCATTTTTACTCTCTGTACTTTCTTGTTAAAGTAATTTTCCTTTGTGGCTCTCATTCTTTTTCCCCCATTGAAGGCTATGAATGTAGAAAATTATCACAATTACTCATATAATTGAGCCTCTTTGTAGCAAGTGCAACTCCAGTAGCCTTTCTCCATCATGAAAATGGTTTCATTATAGGGTTTTTCATATTCTCTGACACCATCTACACAGAGGAACAGGCGTGCAGATGAGATGTGCTAGGAACAGGCTAGATCAGTAAGGTCACAGTAGGAATAATTAGCTCTGCTATGGAAAGAGCATCTAGGCCTTTTACTGCTACATAAATGTACTGTCCATGGCTTTTAGTCACAAAAAAAACTTACTAACAAATGGAGCTCCCGCCTACTACTTTGAAAAAAAGATTTGTATCAACACTACAATTTTCCATCATTAAGACTAATAACACAGAGCCTAGTATACATCAAGGGGAATAAAAAGAAAAATCTCACATTCAAGTGGCGGCTGGGTGCTGACCTTTGTTCCCTTTTTTTGTGTACGACTTAACTCTTTACAAAAAAGAGCCACACGCCACACCAACATGCAGGTGAACTCCAGCTAGTACTAGCAAAGCATAGCATTCAGTTGGAAAATTTGATAAATCTCCATGCAGGATAATGCATTTCATTACATATTCACTACATTAATTCTAGCTACATT

bwa index ref3.fa

The second sequence is a shorter duplicate of the first sequence; it’s a subset of the first sequence starting 200 bp downstream of the first. The idea is to have one of the paired-end reads mapping uniquely (to the first 200 bp of the first sequence) and the other read mapping to the duplicated region. Here are the reads:

cat ../read/r4.fa 
>my_read
AAAAATATTGTAACAGTGTCTGAATCAAACTTTAGTAAAATCTCTTTGGGTTATATCTGAGAAGCTTTTATTGAAGACTTTGAACAAAATTGTGTTTTTG
cat read/r5.fa 
>my_read
TGCACAATTAAATTGAATTTTTTTTTCTGCAGTTGATGAATGTGACTACTGCAAATGTGCCTCTGTAGTTAGCTATCATTTGTTGTTCCGTGACAGGAAA

Let’s align them as per before:

#align the reads
bwa aln ref/ref3.fa read/r4.fa > read/r4.sai
bwa aln ref/ref3.fa read/r5.fa > read/r5.sai

#generate the paired-end alignment
bwa sampe -a 600 ref/ref3.fa read/r4.sai read/r5.sai read/r4.fa read/r5.fa > aln/my_read3.sam

#check the alignment
cat aln/my_read3.sam 
@SQ	SN:chr2:172936693-172938111	LN:1418
@SQ	SN:chr2:172936993-172938111	LN:1118
@PG	ID:bwa	PN:bwa	VN:0.7.10-r789	CL:bwa sampe -a 600 ref/ref3.fa read/r4.sai read/r5.sai read/r4.fa read/r5.fa
my_read	99	chr2:172936693-172938111	11	37	100M	=	501	590	AAAAATATTGTAACAGTGTCTGAATCAAACTTTAGTAAAATCTCTTTGGGTTATATCTGAGAAGCTTTTATTGAAGACTTTGAACAAAATTGTGTTTTTG	*	XT:A:U	NM:i:0	SM:i:37	AM:i:X0:i:1	X1:i:0	XM:i:0	XO:i:0	XG:i:0	MD:Z:100
my_read	147	chr2:172936693-172938111	501	36	100M	=	11	-590	TTTCCTGTCACGGAACAACAAATGATAGCTAACTACAGAGGCACATTTGCAGTAGTCACATTCATCAACTGCAGAAAAAAAAATTCAATTTAATTGTGCA	*	XT:A:R	NM:i:0	SM:i:0	AM:i:X0:i:2	X1:i:0	XM:i:0	XO:i:0	XG:i:0	MD:Z:100	XA:Z:chr2:172936993-172938111,-201,100M,0;

The two bitwise flags are 99 and 147, indicating a proper pair (see above). The XT:A:R flag indicates that the second read is repetitive. The AM:i:X0 tag indicates the number of best hits the read mapped to (without taking into account the paired-end information); the AM indicates template independent. Note that the mapping quality of the second read is not 0 and has been mapped to the correct position; let’s see if it will be 0 if we generate an alignment by itself:

#mapping quality becomes 0
bwa samse ref/ref3.fa read/r5.sai read/r5.fa > aln/r5_se.sam
cat aln/r5_se.sam 
@SQ	SN:chr2:172936693-172938111	LN:1418
@SQ	SN:chr2:172936993-172938111	LN:1118
@PG	ID:bwa	PN:bwa	VN:0.7.10-r789	CL:bwa samse ref/ref3.fa read/r5.sai read/r5.fa
my_read	16	chr2:172936693-172938111	501	0	100M	*	0	0	TTTCCTGTCACGGAACAACAAATGATAGCTAACTACAGAGGCACATTTGCAGTAGTCACATTCATCAACTGCAGAAAAAAAAATTCAATTTAATTGTGCA	*	XT:A:R	NM:i:0	X0:i:2	X1:i:XM:i:0	XO:i:0	XG:i:0	MD:Z:100	XA:Z:chr2:172936993-172938111,-201,100M,0;

I created this example in the hope of emulating the XT:A:M tag, but I can’t find much information on this tag. This post on Biostars, suggests that the XT:A:M tag means that one of the pairs is uniquely mapped and the other isn’t. However this is the exact question I’m asking right now.

Visualising paired-end reads

IGV can be used to visualise BAM files; we just need to index them first using SAMTools:

#index the BAM files
samtools index my_read.bam
samtools index my_read2.bam

Screen Shot 2014-07-24 at 10.54.21 PMScreenshot from my brand new MacBook Air showing the two examples. The first panel shows the properly paired reads and the second panel shows only one of the pairs.

Conclusions

This was the first time I’ve looked into paired-end reads and I simply wanted to create simple examples showing reads that are properly and not properly paired. When working with single-end reads, the BAM flags were usually just 0 or 16 but as you saw there are many more combinations with paired-end reads. Naturally, I also never paid attention to the RNEXT, PNEXT, and TLEN fields. I also showed how paired-end reads can be used to rescue ambiguously mapped reads.

I hope this was useful to those that have not encountered paired-end libraries; remember to check out the blog post I linked right at the start if you haven’t already. All the files for this post are available on GitHub.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
4 comments Add yours
  1. Hello, thanks for the wonderful explanation. I was wondering though, what would you do if after aligning a full Whole Exome Sequencing data from a mouse (paired end reads) you had a good mapping quality (> 97%) but about 40-60% properly paired reads

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.