SAMtools mpileup

The SAMtools mpileup utility provides a summary of the coverage of mapped reads on a reference sequence at a single base pair resolution. In addition, the output from mpileup can be piped to BCFtools to call genomic variants. I'm currently working with some Sanger sequenced PCR products, which I would like to call variants on. There are various tools for variant detection on Sanger sequences but I wanted to take this opportunity to check out SAMtools mpileup and BCFtools. In this post, I illustrate the BWA-MEM, SAMtools mpileup, and BCFtools pipeline with a bunch of randomly generated sequences.

Firstly follow these instructions to install SAMtools and HTSlib. Next install BWA from GitHub:

# create a new directory to work in
mkdir mpileup; cd mpileup

# clone the BWA repository
git clone https://github.com/lh3/bwa.git
cd bwa; make

# check out the version
bwa 2>&1 | head -5

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.17-r1188
Contact: Heng Li <lh3@sanger.ac.uk>

# move back into the root working directory
cd ..

I've written three Perl scripts that will be used in this post: one will be used to generate an original random reference sequence, one to mutate the reference sequence, and one to create perfect paired-end reads from a reference sequence. The idea is to generate reads from the mutated reference and then align these reads back to the original reference sequence. The mutation script only generates single nucleotide variants (SNVs). Download these scripts from my GitHub Gist:

# collect the scripts inside the script directory
mkdir script
cd script
git clone https://gist.github.com/3e9c8ae57eded71c84de.git ref
mv ref/generate_random_seq.pl .
git clone https://gist.github.com/a5aa7bedf923aa49b0c6.git mut
mv mut/mutate_fasta.pl .
git clone https://gist.github.com/fce12a99e514938b5725.git pair
mv pair/random_paired_end.pl .
rm -rf ref mut pair
# I forgot to make some of the scripts executable on the Gist
chmod 755 *.pl

# move back into the root working directory
cd ..

To get started, generate a reference sequence and then create an index:

# check the usage of the script
# by running it without any parameters
script/generate_random_seq.pl 
Usage: script/generate_random_seq.pl <bp> <seed>

# generate ref.fa
script/generate_random_seq.pl 10000 31 > ref.fa

# index ref.fa
bwa/bwa index ref.fa
[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.00 seconds elapse.
[bwa_index] Update BWT... 0.00 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.00 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa/bwa index ref.fa
[main] Real time: 0.011 sec; CPU: 0.005 sec

Next, generate a mutated reference sequence:

# check the usage of the script
# by running it without any parameters
script/mutate_fasta.pl 
Usage: script/mutate_fasta.pl <infile.fa> <mutation percent> <seed>

# generate a mutated reference
script/mutate_fasta.pl ref.fa 0.01 31 > ref_mutated.fa

Lastly, generate some paired-end reads from the mutated reference and we're done with my Perl scripts.

# check the usage of the script
# by running it without any parameters
script/random_paired_end.pl 
Usage: script/random_paired_end.pl <infile.fa> <read length> <number of pairs> <inner mate distance> <seed>

# generate 1000 paired-end reads of length 100
# where the inner mate distance is 300 bp
script/random_paired_end.pl ref_mutated.fa 100 1000 300 31

Before we can call any variants we need to align these reads back to the original reference. We'll use BWA-MEM to perform the alignment:

# pipe the alignment to SAMtools
bwa/bwa mem ref.fa l100_n1000_d300_31_1.fq l100_n1000_d300_31_2.fq |
samtools sort -o l100_n1000_d300_31_1.bam -
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 2000 sequences (200000 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 1000, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (399, 399, 399)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (399, 399)
[M::mem_pestat] mean and std.dev: (399.00, 0.00)
[M::mem_pestat] low and high boundaries for proper pairs: (399, 399)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 2000 reads in 0.068 CPU sec, 0.070 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa/bwa mem ref.fa l100_n1000_d300_31_1.fq l100_n1000_d300_31_2.fq
[main] Real time: 0.074 sec; CPU: 0.073 sec

# index the BAM file
samtools index l100_n1000_d300_31_1.bam

# obtain some simple statistics
samtools flagstat l100_n1000_d300_31_1.bam
2000 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
2000 + 0 mapped (100.00% : N/A)
2000 + 0 paired in sequencing
1000 + 0 read1
1000 + 0 read2
2000 + 0 properly paired (100.00% : N/A)
2000 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Now we're ready to use SAMtools mpileup!

SAMtools mpileup

If you run SAMtools mpileup without any parameters, the usage and parameters will be displayed. It is useful to use the -f parameter, which specifies the reference file, so that the actual nucleotide at a genomic location is printed out. The -s parameter is useful for outputting the mapping qualities of reads.

# the sed command was used to print out
# lines 500 to 515
# there's only one bam file and I'm too lazy to type it
samtools mpileup -f ref.fa -s l100_n1000_d300_31_1.bam |
sed -n '500,515p'
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
10000	515	C	25	..,...........,..,..,.,.,	JJJJJJJJJJJJJJJJJJJJJJJJJ	]]]]]]]]]]]]]]]]]]]]]]]]]
10000	516	G	25	..,...........,..,..,.,.,	JJJJJJJJJJJJJJJJJJJJJJJJJ	]]]]]]]]]]]]]]]]]]]]]]]]]
10000	517	T	25	..,...........,..,..,.,.,	JJJJJJJJJJJJJJJJJJJJJJJJJ	]]]]]]]]]]]]]]]]]]]]]]]]]
10000	518	T	25	..,...........,..,..,.,.,	JJJJJJJJJJJJJJJJJJJJJJJJJ	]]]]]]]]]]]]]]]]]]]]]]]]]
10000	519	C	25	..,...........,..,..,.,.,	JJJJJJJJJJJJJJJJJJJJJJJJJ	]]]]]]]]]]]]]]]]]]]]]]]]]
10000	520	G	26	.$.,...........,..,..,.,.,^],	>JJJJJJJJJJJJJJJJJJJJJJJJ@	]]]]]]]]]]]]]]]]]]]]]]]]]]
10000	521	G	26	.,...........,..,..,.,.,,^].	DJJJJJJJJJJJJJJJJJJJJJJJA7	]]]]]]]]]]]]]]]]]]]]]]]]]]
10000	522	T	25	,...........,..,..,.,.,,.	JJJJJJJJJJJJJJJJJJJJJJJA7	]]]]]]]]]]]]]]]]]]]]]]]]]
10000	523	A	25	tTTTTTTTTTTTtTTtTTtTtTttT	JJJJJJJJJJJJJJJJJJJJJJJA7	]]]]]]]]]]]]]]]]]]]]]]]]]
10000	524	C	27	,...........,..,..,.,.,,...	JJJJJJJJJJJJJJJJJJJJJJJJG88	]]]]]]]]]]]]]]]]]]]]]]]]]]]
10000	525	A	27	,...........,..,..,.,.,,...	JJJJJJJJJJJJJJJJJJJJJJJJJHH	]]]]]]]]]]]]]]]]]]]]]]]]]]]
10000	526	A	27	,...........,..,..,.,.,,...	JJJJJJJJJJJJJJJJJJJJJJJJJJJ	]]]]]]]]]]]]]]]]]]]]]]]]]]]
10000	527	A	27	,...........,..,..,.,.,,...	JJJJJJJJJJJJJJJJJJJJJJJJJJJ	]]]]]]]]]]]]]]]]]]]]]]]]]]]
10000	528	C	27	,...........,..,..,.,.,,...	JJJJJJJJJJJJJJJJJJJJJJJJJJJ	]]]]]]]]]]]]]]]]]]]]]]]]]]]
10000	529	T	28	,$...........,..,..,.,.,,...^],	>JJJJJJJJJJJJJJJJJJJJJJJJJJB	]]]]]]]]]]]]]]]]]]]]]]]]]]]]
10000	530	T	27	...........,..,..,.,.,,...,	JJJJJJJJJJJJJJJJJJJJJJJJJJE	]]]]]]]]]]]]]]]]]]]]]]]]]]]

Each line represents a single genomic position and has seven columns:

  1. Sequence name
  2. 1-based coordinate
  3. Reference base
  4. Number of reads covering this position
  5. Read bases
  6. Base qualities
  7. Alignment mapping qualities

Looking at the first line, position 515 on our reference sequence was a C and was covered by 25 reads. The Read base column contains information on whether a read base matched, mismatched, was inserted or deleted with respect to the reference. It also contains information about whether the read base was on the positive or negative strand with respect to the reference, and whether a read base was at the start or the end of a read. A period stands for a match to the reference base on the positive strand, a comma for a match on the negative strand. We generated paired-end reads, so we have reads on both positive and negative strand. The "^]" signifies a base that was at the start of a read and the "$" signifies a base that was at the end of a read. The mutation script only generated SNVs, so there are no examples of indels.

At position 523 there is a SNV: A -> T. The lowercase t is to indicate that the t was on a read that mapped on the negative strand. More information on the read bases can be found on the Wikipedia article.

Calling SNPs/Indels using BCFtools

We can pipe the output of SAMtools mpileup directly to BCFtools but we need the -g parameter, which produces output in BCF format.

git clone git://github.com/samtools/htslib.git
git clone git://github.com/samtools/bcftools.git
cd bcftools
make
cd ..

samtools mpileup -f ref.fa -g l100_n1000_d300_31_1.bam | bcftools/bcftools call -m - > l100_n1000_d300_31_1.vcf
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000

The bcftools call utility calls the SNPs and Indels and the -m parameter is the recommended calling model. Let's check out the VCF file:

cat l100_n1000_d300_31_1.vcf |
grep -v "^#" |
sed -n '506,510p'
10000	521	.	G	.	285	.	DP=26;MQSB=1;MQ0F=0;AN=2;DP4=19,7,0,0;MQ=60	GT	0/0
10000	522	.	T	.	285	.	DP=28;MQSB=1;MQ0F=0;AN=2;DP4=18,7,0,0;MQ=60	GT	0/0
10000	523	.	A	T	225	.	DP=28;VDB=0.16571;SGB=-0.692914;MQSB=1;MQ0F=0;AC=2;AN=2;DP4=0,0,18,7;MQ=60	GT:PL	1/1:255,75,0
10000	524	.	C	.	285	.	DP=27;MQSB=1;MQ0F=0;AN=2;DP4=20,7,0,0;MQ=60	GT	0/0
10000	525	.	A	.	285	.	DP=27;MQSB=1;MQ0F=0;AN=2;DP4=20,7,0,0;MQ=60	GT	0/0

All the information about the VCF annotations are in the metadata at the start of the VCF file. The PL format was a bit unclear though but I found the answer on BioStar:

For a biallelic site, the PL has three numbers, The first one is the probability that the site is homozgyous reference, the second is the probability that the sample is heterzygous, the third that it is homozygous for the alternate letter. The higher the number, the less likely it is that your sample is that genotype. So if your PL is 483,0,532 the software is quite sure that your sample is not homozygous reference or homozygous alternate, it's heterozygous.

The PL at position 523 is 255,75,0, which suggests homozygous alternative, i.e. 1/1.

Did we find all the variants that was introduced by my script? The mutate_fasta.pl script produces a file, ref_mutated.txt, that contains all the mutations that were introduced.

# 1-based coordinates with reference base
# and mutated base
cat ref_mutated.txt | head
17      T       G
91      G       A
115     C       T
422     G       C
502     A       C
523     A       T
570     C       A
588     A       G
600     T       G
682     G       T

# how many mutations?
cat ref_mutated.txt | wc -l
     100

# check out the VCF entries with a genotype likelihood
cat l100_n1000_d300_31_1.vcf |
grep -v "^#" |
grep PL |
cut -f1-6 |
head
10000	91	.	G	A	179
10000	115	.	C	T	182
10000	422	.	G	C	225
10000	502	.	A	C	225
10000	523	.	A	T	225
10000	570	.	C	A	225
10000	588	.	A	G	225
10000	600	.	T	G	225
10000	682	.	G	T	225
10000	825	.	C	A	225


cat l100_n1000_d300_31_1.vcf | grep -v "^#" | grep PL | wc -l
      97

Summary

For this post, I generated reads with very high base-calling qualities, which should be the case with Sanger sequencing, and reads that were perfectly paired. The reference sequence was random and with read lengths of 100 bp, there shouldn't be any mapping problems. I did miss two SNVs and I suspect that they should be the ones at the end:

tail -5 ref_mutated.txt 
9711    A       C
9752    A       G
9882    A       C
9917    T       C
9962    G       T

cat l100_n1000_d300_31_1.vcf | grep -v "^#" | grep PL | tail -5 | cut -f1-6
10000	9547	.	G	A	225
10000	9694	.	C	T	168
10000	9711	.	A	C	171
10000	9752	.	A	G	178
10000	9882	.	A	C	42.4147

And those SNPs were probably missed because there was no read coverage at those positions (9917 and 9962):

samtools mpileup -f ref.fa l100_n1000_d300_31_1.bam| tail -5
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
10000	9890	T	1	,	J
10000	9891	C	1	,	J
10000	9892	C	1	,	J
10000	9893	G	1	,	E
10000	9894	G	1	,$	B

Indeed.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
15 comments Add yours
  1. Very helpful. But I think there is a minor mistake in “samtools sort – l100_n1000_d300_31_1” in the part of “pipe the alignment to SAMtools”. There are errors when I executed this step.

    1. Thanks for letting me know. It is probably due to a newer version of SAMtools, which changed the syntax.

  2. Hallo Dave,

    good training example.
    but, the command is not working; samtools mpileup -f ref.fa -g *.bam | bcftools call -m – > l100_n1000_d300_31_1.vcf
    can you figour it out?

    Ju

    1. Hi Ju,

      the new version of SAMtools changed the sort syntax; my post was based on an older version of SAMtools. I’ve updated the post and I could run every step using SAMtools version 1.6.

      Cheers,

      Dave

  3. Here are some corrections for scripts on Ubuntu

    # index ref.fa
    bwa index ref.fa

    # pipe the alignment to SAMtools
    bwa mem ref.fa l100_n1000_d300_31_1.fq l100_n1000_d300_31_2.fq |
    samtools view -bSu – |
    samtools sort – l100_n1000_d300_31_1

    # index the BAM file
    samtools index *.bam

    # obtain some simple statistics
    samtools flagstat *.bam

    ###Calling SNPs/Indels using BCFtools
    samtools mpileup -uD -f ref.fa -g *.bam | bcftools view -m – > l100_n1000_d300_31_1.raw.bcf
    bcftools view l100_n1000_d300_31_1.raw.bcf > l100_n1000_d300_31_1.vcf

    cat l100_n1000_d300_31_1.vcf |
    grep -v “^#” |
    sed -n ‘506,510p’

    instead,
    grep -v ^## l100_n1000_d300_31_1.vcf
    grep -v ^## l100_n1000_d300_31_1.vcf | less -S

    # check out the VCF entries with a genotype likelihood
    cat l100_n1000_d300_31_1.vcf |
    grep -v “^#” |
    grep PL |
    cut -f1-6 |
    head

    instead,
    grep -v ^## l100_n1000_d300_31_1.vcf |
    grep PL |
    cut -f1-6 |
    head

  4. Dear Dave,
    I have a question and hope you can help me. I used BWA aligned one whole genome sequencing sample. The reference I used is GRCh38. I downloaded all chromosomes and made the final GRCh38_r77.all.fa as the reference. The alignment with BWA and samtools was doing well and I have sorted and indexed the bam files. Now I need to index GRCh38_r77.all.fa
    The command I used is:
    samtools faidx GRCh38_r77.all.fa

    It was not working with the error: Segmentation fault (core dumped).

    Can you please give me some advice?
    Thank you so much in advance.

    1. Hi Brad,

      it’s quite hard to troubleshoot your problem without more information. There is a Curtin University Hacky Hour every week on Wednesdays 3-4pm (http://computation.curtin.edu.au/events/hacky-hour/) and I’m sure the people there can help you out. Or if you’re around at UWA, you can join the UWA Hacky Hour on Fridays 10-11am; sometimes I’m there but Philipp is usually there.

      Best,

      Dave

  5. What does mpileup stand for? memory pileup? mapped pile up? Trying to figure out the etymology of the word.

  6. thanks for sharing, it’s helpful and clear.

    to obtain coverage of each bases, I had used to command: one is samtools mpileup sorted.bam and other is samtools depth sorted.bam, but I got totally different value from these two command. I don’t know why they are different and feel confused about it.

    Are you familiar with these two command?

  7. Hi Dave,

    Thank you for the super useful blogs. I have a question regarding downstream analysis after seeing the pipe commands you wrote. I would like to compress vcf files and index them. The command I wrote was ‘bcftools view filename.vcf -Oz -o filename.vcf.gz | bcftools index’ but the second command didn’t run. Sorry if the question is not very relevant to the blog. I’m very new to bioinformatics and would be grateful if you can advise. Thank you!

    1. Hi Sophie,

      thanks for the question and it’s very relevant to my blog! Your bcftools view command is saving the output to filename.vcf.gz so there is nothing being piped to the index command. You can either run them as two separate commands or if you really want to run them as one command, you can do something like this:

      bcftools view ex.vcf -Oz | tee ex.vcf.gz | bcftools index - -f -o ex.vcf.gz.csi
      

      I did get get a warning (index: the input is probably truncated, use -f to index anyway: -), which is why I also used the -f parameter in the indexing step. So it is probably better if you just ran the two commands separately, since I’m always wary of warnings.

      Lastly, I didn’t even know there’s a bcftools index command; I have been using tabix. I don’t know what the difference is but they seem to generate different binary files. To index using tabix:

      tabix -p vcf ex.vcf.gz
      

      Hope that helps,
      Dave

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.