SAMtools mpileup

The SAMtools mpileup utility provides a summary of the coverage of mapped reads on a reference sequence at a single base pair resolution. In addition, the output from mpileup can be piped to BCFtools to call genomic variants. I'm currently working with some Sanger sequenced PCR products, which I would like to call variants on. There are various tools for variant detection on Sanger sequences but I wanted to take this opportunity to check out SAMtools mpileup and BCFtools. In this post, I illustrate the BWA-MEM, SAMtools mpileup, and BCFtools pipeline with a bunch of randomly generated sequences.

Firstly follow these instructions to install SAMtools and HTSlib. Next install BWA from GitHub:

# create a new directory to work in
mkdir mpileup; cd mpileup

# clone the BWA repository
git clone https://github.com/lh3/bwa.git
cd bwa; make

# check out the version
bwa 2>&1 | head -5

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.12-r1044
Contact: Heng Li <lh3@sanger.ac.uk>

# move back into the root working directory
cd ..

I've written three Perl scripts that will be used in this post: one will be used to generate an original random reference sequence, one to mutate the reference sequence, and one to create perfect paired-end reads from a reference sequence. The idea is to generate reads from the mutated reference and then align these reads back to the original reference sequence. The mutation script only generates single nucleotide variants (SNVs). Download these scripts from my GitHub Gist:

# collect the scripts inside the script directory
mkdir script
cd script
git clone https://gist.github.com/3e9c8ae57eded71c84de.git ref
mv ref/generate_random_seq.pl .
git clone https://gist.github.com/a5aa7bedf923aa49b0c6.git mut
mv mut/mutate_fasta.pl .
git clone https://gist.github.com/fce12a99e514938b5725.git pair
mv pair/random_paired_end.pl .
rm -rf ref mut pair
# I forgot to make some of the scripts executable on the Gist
chmod 755 *.pl

# move back into the root working directory
cd ..

To get started, generate a reference sequence and then create an index:

# check the usage of the script
# by running it without any parameters
script/generate_random_seq.pl 
Usage: script/generate_random_seq.pl <bp> <seed>

# generate ref.fa
script/generate_random_seq.pl 10000 31 > ref.fa

# index ref.fa
bwa/bwa index ref.fa

Next, generate a mutated reference sequence:

# check the usage of the script
# by running it without any parameters
script/mutate_fasta.pl 
Usage: script/mutate_fasta.pl <infile.fa> <mutation percent> <seed>

# generate a mutated reference
script/mutate_fasta.pl ref.fa 0.01 31 > ref_mutated.fa

Lastly, generate some paired-end reads from the mutated reference and we're done with my Perl scripts.

# check the usage of the script
# by running it without any parameters
script/random_paired_end.pl 
Usage: script/random_paired_end.pl <infile.fa> <read length> <number of pairs> <inner mate distance> <seed>

# generate 1000 paired-end reads of length 100
# where the inner mate distance is 300 bp
script/random_paired_end.pl ref_mutated.fa 100 1000 300 31

Before we can call any variants we need to align these reads back to the original reference. We'll use BWA-MEM to perform the alignment:

# pipe the alignment to SAMtools
bwa/bwa mem ref.fa l100_n1000_d300_31_1.fq l100_n1000_d300_31_2.fq | 
samtools view -bSu - |
samtools sort - l100_n1000_d300_31_1

# index the BAM file
samtools index *.bam

# obtain some simple statistics
samtools flagstat *.bam
2000 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
2000 + 0 mapped (100.00%:nan%)
2000 + 0 paired in sequencing
1000 + 0 read1
1000 + 0 read2
2000 + 0 properly paired (100.00%:nan%)
2000 + 0 with itself and mate mapped
0 + 0 singletons (0.00%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Now we're ready to use SAMtools mpileup!

SAMtools mpileup

If you run SAMtools mpileup without any parameters, the usage and parameters will be displayed. It is useful to use the -f parameter, which specifies the reference file, so that the actual nucleotide at a genomic location is printed out. The -s parameter is useful for outputting the mapping qualities of reads.

# the sed command was used to print out
# lines 500 to 515
# there's only one bam file and I'm too lazy to type it
samtools mpileup -f ref.fa -s *.bam |
sed -n '500,515p'
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
10000   515     C       25      ..,...........,..,..,.,.,       JJJJJJJJJJJJJJJJJJJJJJJJJ       ]]]]]]]]]]]]]]]]]]]]]]]]]
10000   516     G       25      ..,...........,..,..,.,.,       JJJJJJJJJJJJJJJJJJJJJJJJJ       ]]]]]]]]]]]]]]]]]]]]]]]]]
10000   517     T       25      ..,...........,..,..,.,.,       JJJJJJJJJJJJJJJJJJJJJJJJJ       ]]]]]]]]]]]]]]]]]]]]]]]]]
10000   518     T       25      ..,...........,..,..,.,.,       JJJJJJJJJJJJJJJJJJJJJJJJJ       ]]]]]]]]]]]]]]]]]]]]]]]]]
10000   519     C       25      ..,...........,..,..,.,.,       JJJJJJJJJJJJJJJJJJJJJJJJJ       ]]]]]]]]]]]]]]]]]]]]]]]]]
10000   520     G       26      .$.,...........,..,..,.,.,^],   >JJJJJJJJJJJJJJJJJJJJJJJJ@      ]]]]]]]]]]]]]]]]]]]]]]]]]]
10000   521     G       26      .,...........,..,..,.,.,,^].    DJJJJJJJJJJJJJJJJJJJJJJJA7      ]]]]]]]]]]]]]]]]]]]]]]]]]]
10000   522     T       25      ,...........,..,..,.,.,,.       JJJJJJJJJJJJJJJJJJJJJJJA7       ]]]]]]]]]]]]]]]]]]]]]]]]]
10000   523     A       25      tTTTTTTTTTTTtTTtTTtTtTttT       JJJJJJJJJJJJJJJJJJJJJJJA7       ]]]]]]]]]]]]]]]]]]]]]]]]]
10000   524     C       27      ,...........,..,..,.,.,,...     JJJJJJJJJJJJJJJJJJJJJJJJG88     ]]]]]]]]]]]]]]]]]]]]]]]]]]]
10000   525     A       27      ,...........,..,..,.,.,,...     JJJJJJJJJJJJJJJJJJJJJJJJJHH     ]]]]]]]]]]]]]]]]]]]]]]]]]]]
10000   526     A       27      ,...........,..,..,.,.,,...     JJJJJJJJJJJJJJJJJJJJJJJJJJJ     ]]]]]]]]]]]]]]]]]]]]]]]]]]]
10000   527     A       27      ,...........,..,..,.,.,,...     JJJJJJJJJJJJJJJJJJJJJJJJJJJ     ]]]]]]]]]]]]]]]]]]]]]]]]]]]
10000   528     C       27      ,...........,..,..,.,.,,...     JJJJJJJJJJJJJJJJJJJJJJJJJJJ     ]]]]]]]]]]]]]]]]]]]]]]]]]]]
10000   529     T       28      ,$...........,..,..,.,.,,...^], >JJJJJJJJJJJJJJJJJJJJJJJJJJB    ]]]]]]]]]]]]]]]]]]]]]]]]]]]]
10000   530     T       27      ...........,..,..,.,.,,...,     JJJJJJJJJJJJJJJJJJJJJJJJJJE     ]]]]]]]]]]]]]]]]]]]]]]]]]]]

Each line represents a single genomic position and has seven columns:

  1. Sequence name
  2. 1-based coordinate
  3. Reference base
  4. Number of reads covering this position
  5. Read bases
  6. Base qualities
  7. Alignment mapping qualities

Looking at the first line, position 515 on our reference sequence was a C and was covered by 25 reads. The Read base column contains information on whether a read base matched, mismatched, was inserted or deleted with respect to the reference. It also contains information about whether the read base was on the positive or negative strand with respect to the reference, and whether a read base was at the start or the end of a read. A period stands for a match to the reference base on the positive strand, a comma for a match on the negative strand. We generated paired-end reads, so we have reads on both positive and negative strand. The "^]" signifies a base that was at the start of a read and the "$" signifies a base that was at the end of a read. The mutation script only generated SNVs, so there are no examples of indels.

At position 523 there is a SNV: A -> T. The lowercase t is to indicate that the t was on a read that mapped on the negative strand.

Calling SNPs/Indels using BCFtools

We can pipe the output of SAMtools mpileup directly to BCFtools but we need the -g parameter, which produces output in BCF format.

# there's only one bam file and I'm too lazy to type it
samtools mpileup -f ref.fa -g *.bam | bcftools call -m - > l100_n1000_d300_31_1.vcf

The bcftools call utility calls the SNPs and Indels and the -m parameter is the recommended calling model. Let's check out the VCF file:

cat l100_n1000_d300_31_1.vcf |
grep -v "^#" |
sed -n '506,510p'
10000   521     .       G       .       999     .       DP=26;MQSB=1;MQ0F=0;AN=2;DP4=19,7,0,0;MQ=60     GT      0/0
10000   522     .       T       .       999     .       DP=28;MQSB=1;MQ0F=0;AN=2;DP4=18,7,0,0;MQ=60     GT      0/0
10000   523     .       A       T       228     .       DP=28;VDB=0.16571;SGB=-0.692914;MQSB=1;MQ0F=0;AC=2;AN=2;DP4=0,0,18,7;MQ=60      GT:PL   1/1:255,75,0
10000   524     .       C       .       999     .       DP=27;MQSB=1;MQ0F=0;AN=2;DP4=20,7,0,0;MQ=60     GT      0/0
10000   525     .       A       .       999     .       DP=27;MQSB=1;MQ0F=0;AN=2;DP4=20,7,0,0;MQ=60     GT      0/0

All the information about the VCF annotations are in the metadata at the start of the VCF file. The PL format was a bit unclear though but I found the answer on BioStar:

For a biallelic site, the PL has three numbers, The first one is the probability that the site is homozgyous reference, the second is the probability that the sample is heterzygous, the third that it is homozygous for the alternate letter. The higher the number, the less likely it is that your sample is that genotype. So if your PL is 483,0,532 the software is quite sure that your sample is not homozygous reference or homozygous alternate, it's heterozygous.

The PL at position 523 is 255,75,0, which suggests homozygous alternative, i.e. 1/1.

Did we find all the variants that was introduced by my script? The mutate_fasta.pl script produces a file, ref_mutated.txt, that contains all the mutations that were introduced.

# 1-based coordinates with reference base
# and mutated base
cat ref_mutated.txt | head
17      T       G
91      G       A
115     C       T
422     G       C
502     A       C
523     A       T
570     C       A
588     A       G
600     T       G
682     G       T

# how many mutations?
cat ref_mutated.txt | wc -l
     100

# check out the VCF entries with a genotype likelihood
cat l100_n1000_d300_31_1.vcf |
grep -v "^#" |
grep PL |
cut -f1-6 |
head
10000   17      .       T       G       13.3811
10000   91      .       G       A       182
10000   115     .       C       T       187
10000   422     .       G       C       228
10000   502     .       A       C       228
10000   523     .       A       T       228
10000   570     .       C       A       228
10000   588     .       A       G       228
10000   600     .       T       G       228
10000   682     .       G       T       228

cat l100_n1000_d300_31_1.vcf | grep -v "^#" | grep PL | wc -l
      98

Summary

For this post, I generated reads with very high base-calling qualities, which should be the case with Sanger sequencing, and reads that were perfectly paired. The reference sequence was random and with read lengths of 100 bp, there shouldn't be any mapping problems. I did miss two SNVs and I suspect that they should be the ones at the end:

tail -5 ref_mutated.txt 
9711    A       C
9752    A       G
9882    A       C
9917    T       C
9962    G       T

cat l100_n1000_d300_31_1.vcf | grep -v "^#" | grep PL | tail -5 | cut -f1-6
10000   9547    .       G       A       228
10000   9694    .       C       T       171
10000   9711    .       A       C       174
10000   9752    .       A       G       181
10000   9882    .       A       C       44.335

And those SNPs were probably missed because there was no read coverage at those positions (9917 and 9962):

samtools mpileup -f ref.fa *.bam| tail -5
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
10000   9890    T       1       ,       J
10000   9891    C       1       ,       J
10000   9892    C       1       ,       J
10000   9893    G       1       ,       E
10000   9894    G       1       ,$      B

Indeed.

Print Friendly



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
2 comments Add yours
  1. Very helpful. But I think there is a minor mistake in "samtools sort - l100_n1000_d300_31_1" in the part of "pipe the alignment to SAMtools". There are errors when I executed this step.

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.