SAMtools mpileup

The SAMtools mpileup utility provides a summary of the coverage of mapped reads on a reference sequence at a single base pair resolution. In addition, the output from mpileup can be piped to BCFtools to call genomic variants. I'm currently working with some Sanger sequenced PCR products, which I would like to call variants on. There are various tools for variant detection on Sanger sequences but I wanted to take this opportunity to check out SAMtools mpileup and BCFtools. In this post, I illustrate the BWA-MEM, SAMtools mpileup, and BCFtools pipeline with a bunch of randomly generated sequences.

Firstly follow these instructions to install SAMtools and HTSlib. Next install BWA from GitHub:

# create a new directory to work in
mkdir mpileup; cd mpileup

# clone the BWA repository
git clone https://github.com/lh3/bwa.git
cd bwa; make

# check out the version
bwa 2>&1 | head -5

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.17-r1188
Contact: Heng Li <lh3@sanger.ac.uk>

# move back into the root working directory
cd ..

I've written three Perl scripts that will be used in this post: one will be used to generate an original random reference sequence, one to mutate the reference sequence, and one to create perfect paired-end reads from a reference sequence. The idea is to generate reads from the mutated reference and then align these reads back to the original reference sequence. The mutation script only generates single nucleotide variants (SNVs). Download these scripts from my GitHub Gist:

# collect the scripts inside the script directory
mkdir script
cd script
git clone https://gist.github.com/3e9c8ae57eded71c84de.git ref
mv ref/generate_random_seq.pl .
git clone https://gist.github.com/a5aa7bedf923aa49b0c6.git mut
mv mut/mutate_fasta.pl .
git clone https://gist.github.com/fce12a99e514938b5725.git pair
mv pair/random_paired_end.pl .
rm -rf ref mut pair
# I forgot to make some of the scripts executable on the Gist
chmod 755 *.pl

# move back into the root working directory
cd ..

To get started, generate a reference sequence and then create an index:

# check the usage of the script
# by running it without any parameters
script/generate_random_seq.pl 
Usage: script/generate_random_seq.pl <bp> <seed>

# generate ref.fa
script/generate_random_seq.pl 10000 31 > ref.fa

# index ref.fa
bwa/bwa index ref.fa
[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.00 seconds elapse.
[bwa_index] Update BWT... 0.00 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.00 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa/bwa index ref.fa
[main] Real time: 0.011 sec; CPU: 0.005 sec

Next, generate a mutated reference sequence:

# check the usage of the script
# by running it without any parameters
script/mutate_fasta.pl 
Usage: script/mutate_fasta.pl <infile.fa> <mutation percent> <seed>

# generate a mutated reference
script/mutate_fasta.pl ref.fa 0.01 31 > ref_mutated.fa

Lastly, generate some paired-end reads from the mutated reference and we're done with my Perl scripts.

# check the usage of the script
# by running it without any parameters
script/random_paired_end.pl 
Usage: script/random_paired_end.pl <infile.fa> <read length> <number of pairs> <inner mate distance> <seed>

# generate 1000 paired-end reads of length 100
# where the inner mate distance is 300 bp
script/random_paired_end.pl ref_mutated.fa 100 1000 300 31

Before we can call any variants we need to align these reads back to the original reference. We'll use BWA-MEM to perform the alignment:

# pipe the alignment to SAMtools
bwa/bwa mem ref.fa l100_n1000_d300_31_1.fq l100_n1000_d300_31_2.fq |
samtools sort -o l100_n1000_d300_31_1.bam -
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 2000 sequences (200000 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 1000, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (399, 399, 399)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (399, 399)
[M::mem_pestat] mean and std.dev: (399.00, 0.00)
[M::mem_pestat] low and high boundaries for proper pairs: (399, 399)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 2000 reads in 0.068 CPU sec, 0.070 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa/bwa mem ref.fa l100_n1000_d300_31_1.fq l100_n1000_d300_31_2.fq
[main] Real time: 0.074 sec; CPU: 0.073 sec

# index the BAM file
samtools index l100_n1000_d300_31_1.bam

# obtain some simple statistics
samtools flagstat l100_n1000_d300_31_1.bam
2000 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
2000 + 0 mapped (100.00% : N/A)
2000 + 0 paired in sequencing
1000 + 0 read1
1000 + 0 read2
2000 + 0 properly paired (100.00% : N/A)
2000 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Now we're ready to use SAMtools mpileup!

SAMtools mpileup

If you run SAMtools mpileup without any parameters, the usage and parameters will be displayed. It is useful to use the -f parameter, which specifies the reference file, so that the actual nucleotide at a genomic location is printed out. The -s parameter is useful for outputting the mapping qualities of reads.

# the sed command was used to print out
# lines 500 to 515
# there's only one bam file and I'm too lazy to type it
samtools mpileup -f ref.fa -s l100_n1000_d300_31_1.bam |
sed -n '500,515p'
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
10000	515	C	25	..,...........,..,..,.,.,	JJJJJJJJJJJJJJJJJJJJJJJJJ	]]]]]]]]]]]]]]]]]]]]]]]]]
10000	516	G	25	..,...........,..,..,.,.,	JJJJJJJJJJJJJJJJJJJJJJJJJ	]]]]]]]]]]]]]]]]]]]]]]]]]
10000	517	T	25	..,...........,..,..,.,.,	JJJJJJJJJJJJJJJJJJJJJJJJJ	]]]]]]]]]]]]]]]]]]]]]]]]]
10000	518	T	25	..,...........,..,..,.,.,	JJJJJJJJJJJJJJJJJJJJJJJJJ	]]]]]]]]]]]]]]]]]]]]]]]]]
10000	519	C	25	..,...........,..,..,.,.,	JJJJJJJJJJJJJJJJJJJJJJJJJ	]]]]]]]]]]]]]]]]]]]]]]]]]
10000	520	G	26	.$.,...........,..,..,.,.,^],	>JJJJJJJJJJJJJJJJJJJJJJJJ@	]]]]]]]]]]]]]]]]]]]]]]]]]]
10000	521	G	26	.,...........,..,..,.,.,,^].	DJJJJJJJJJJJJJJJJJJJJJJJA7	]]]]]]]]]]]]]]]]]]]]]]]]]]
10000	522	T	25	,...........,..,..,.,.,,.	JJJJJJJJJJJJJJJJJJJJJJJA7	]]]]]]]]]]]]]]]]]]]]]]]]]
10000	523	A	25	tTTTTTTTTTTTtTTtTTtTtTttT	JJJJJJJJJJJJJJJJJJJJJJJA7	]]]]]]]]]]]]]]]]]]]]]]]]]
10000	524	C	27	,...........,..,..,.,.,,...	JJJJJJJJJJJJJJJJJJJJJJJJG88	]]]]]]]]]]]]]]]]]]]]]]]]]]]
10000	525	A	27	,...........,..,..,.,.,,...	JJJJJJJJJJJJJJJJJJJJJJJJJHH	]]]]]]]]]]]]]]]]]]]]]]]]]]]
10000	526	A	27	,...........,..,..,.,.,,...	JJJJJJJJJJJJJJJJJJJJJJJJJJJ	]]]]]]]]]]]]]]]]]]]]]]]]]]]
10000	527	A	27	,...........,..,..,.,.,,...	JJJJJJJJJJJJJJJJJJJJJJJJJJJ	]]]]]]]]]]]]]]]]]]]]]]]]]]]
10000	528	C	27	,...........,..,..,.,.,,...	JJJJJJJJJJJJJJJJJJJJJJJJJJJ	]]]]]]]]]]]]]]]]]]]]]]]]]]]
10000	529	T	28	,$...........,..,..,.,.,,...^],	>JJJJJJJJJJJJJJJJJJJJJJJJJJB	]]]]]]]]]]]]]]]]]]]]]]]]]]]]
10000	530	T	27	...........,..,..,.,.,,...,	JJJJJJJJJJJJJJJJJJJJJJJJJJE	]]]]]]]]]]]]]]]]]]]]]]]]]]]

Each line represents a single genomic position and has seven columns:

  1. Sequence name
  2. 1-based coordinate
  3. Reference base
  4. Number of reads covering this position
  5. Read bases
  6. Base qualities
  7. Alignment mapping qualities

Looking at the first line, position 515 on our reference sequence was a C and was covered by 25 reads. The Read base column contains information on whether a read base matched, mismatched, was inserted or deleted with respect to the reference. It also contains information about whether the read base was on the positive or negative strand with respect to the reference, and whether a read base was at the start or the end of a read. A period stands for a match to the reference base on the positive strand, a comma for a match on the negative strand. We generated paired-end reads, so we have reads on both positive and negative strand. The "^]" signifies a base that was at the start of a read and the "$" signifies a base that was at the end of a read. The mutation script only generated SNVs, so there are no examples of indels.

At position 523 there is a SNV: A -> T. The lowercase t is to indicate that the t was on a read that mapped on the negative strand.

Calling SNPs/Indels using BCFtools

We can pipe the output of SAMtools mpileup directly to BCFtools but we need the -g parameter, which produces output in BCF format.

git clone git://github.com/samtools/htslib.git
git clone git://github.com/samtools/bcftools.git
cd bcftools
make
cd ..

samtools mpileup -f ref.fa -g l100_n1000_d300_31_1.bam | bcftools/bcftools call -m - > l100_n1000_d300_31_1.vcf
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000

The bcftools call utility calls the SNPs and Indels and the -m parameter is the recommended calling model. Let's check out the VCF file:

cat l100_n1000_d300_31_1.vcf |
grep -v "^#" |
sed -n '506,510p'
10000	521	.	G	.	285	.	DP=26;MQSB=1;MQ0F=0;AN=2;DP4=19,7,0,0;MQ=60	GT	0/0
10000	522	.	T	.	285	.	DP=28;MQSB=1;MQ0F=0;AN=2;DP4=18,7,0,0;MQ=60	GT	0/0
10000	523	.	A	T	225	.	DP=28;VDB=0.16571;SGB=-0.692914;MQSB=1;MQ0F=0;AC=2;AN=2;DP4=0,0,18,7;MQ=60	GT:PL	1/1:255,75,0
10000	524	.	C	.	285	.	DP=27;MQSB=1;MQ0F=0;AN=2;DP4=20,7,0,0;MQ=60	GT	0/0
10000	525	.	A	.	285	.	DP=27;MQSB=1;MQ0F=0;AN=2;DP4=20,7,0,0;MQ=60	GT	0/0

All the information about the VCF annotations are in the metadata at the start of the VCF file. The PL format was a bit unclear though but I found the answer on BioStar:

For a biallelic site, the PL has three numbers, The first one is the probability that the site is homozgyous reference, the second is the probability that the sample is heterzygous, the third that it is homozygous for the alternate letter. The higher the number, the less likely it is that your sample is that genotype. So if your PL is 483,0,532 the software is quite sure that your sample is not homozygous reference or homozygous alternate, it's heterozygous.

The PL at position 523 is 255,75,0, which suggests homozygous alternative, i.e. 1/1.

Did we find all the variants that was introduced by my script? The mutate_fasta.pl script produces a file, ref_mutated.txt, that contains all the mutations that were introduced.

# 1-based coordinates with reference base
# and mutated base
cat ref_mutated.txt | head
17      T       G
91      G       A
115     C       T
422     G       C
502     A       C
523     A       T
570     C       A
588     A       G
600     T       G
682     G       T

# how many mutations?
cat ref_mutated.txt | wc -l
     100

# check out the VCF entries with a genotype likelihood
cat l100_n1000_d300_31_1.vcf |
grep -v "^#" |
grep PL |
cut -f1-6 |
head
10000	91	.	G	A	179
10000	115	.	C	T	182
10000	422	.	G	C	225
10000	502	.	A	C	225
10000	523	.	A	T	225
10000	570	.	C	A	225
10000	588	.	A	G	225
10000	600	.	T	G	225
10000	682	.	G	T	225
10000	825	.	C	A	225


cat l100_n1000_d300_31_1.vcf | grep -v "^#" | grep PL | wc -l
      97

Summary

For this post, I generated reads with very high base-calling qualities, which should be the case with Sanger sequencing, and reads that were perfectly paired. The reference sequence was random and with read lengths of 100 bp, there shouldn't be any mapping problems. I did miss two SNVs and I suspect that they should be the ones at the end:

tail -5 ref_mutated.txt 
9711    A       C
9752    A       G
9882    A       C
9917    T       C
9962    G       T

cat l100_n1000_d300_31_1.vcf | grep -v "^#" | grep PL | tail -5 | cut -f1-6
10000	9547	.	G	A	225
10000	9694	.	C	T	168
10000	9711	.	A	C	171
10000	9752	.	A	G	178
10000	9882	.	A	C	42.4147

And those SNPs were probably missed because there was no read coverage at those positions (9917 and 9962):

samtools mpileup -f ref.fa l100_n1000_d300_31_1.bam| tail -5
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
10000	9890	T	1	,	J
10000	9891	C	1	,	J
10000	9892	C	1	,	J
10000	9893	G	1	,	E
10000	9894	G	1	,$	B

Indeed.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
4 comments Add yours
  1. Very helpful. But I think there is a minor mistake in "samtools sort - l100_n1000_d300_31_1" in the part of "pipe the alignment to SAMtools". There are errors when I executed this step.

  2. Hallo Dave,

    good training example.
    but, the command is not working; samtools mpileup -f ref.fa -g *.bam | bcftools call -m - > l100_n1000_d300_31_1.vcf
    can you figour it out?

    Ju

    1. Hi Ju,

      the new version of SAMtools changed the sort syntax; my post was based on an older version of SAMtools. I've updated the post and I could run every step using SAMtools version 1.6.

      Cheers,

      Dave

Leave a Reply

Your email address will not be published. Required fields are marked *