The SAMtools mpileup utility provides a summary of the coverage of mapped reads on a reference sequence at a single base pair resolution. In addition, the output from mpileup can be piped to BCFtools to call genomic variants. I'm currently working with some Sanger sequenced PCR products, which I would like to call variants on. There are various tools for variant detection on Sanger sequences but I wanted to take this opportunity to check out SAMtools mpileup and BCFtools. In this post, I illustrate the BWA-MEM, SAMtools mpileup, and BCFtools pipeline with a bunch of randomly generated sequences.
Firstly follow these instructions to install SAMtools and HTSlib. Next install BWA from GitHub:
# create a new directory to work in mkdir mpileup; cd mpileup # clone the BWA repository git clone https://github.com/lh3/bwa.git cd bwa; make # check out the version bwa 2>&1 | head -5 Program: bwa (alignment via Burrows-Wheeler transformation) Version: 0.7.17-r1188 Contact: Heng Li <firstname.lastname@example.org> # move back into the root working directory cd ..
I've written three Perl scripts that will be used in this post: one will be used to generate an original random reference sequence, one to mutate the reference sequence, and one to create perfect paired-end reads from a reference sequence. The idea is to generate reads from the mutated reference and then align these reads back to the original reference sequence. The mutation script only generates single nucleotide variants (SNVs). Download these scripts from my GitHub Gist:
# collect the scripts inside the script directory mkdir script cd script git clone https://gist.github.com/3e9c8ae57eded71c84de.git ref mv ref/generate_random_seq.pl . git clone https://gist.github.com/a5aa7bedf923aa49b0c6.git mut mv mut/mutate_fasta.pl . git clone https://gist.github.com/fce12a99e514938b5725.git pair mv pair/random_paired_end.pl . rm -rf ref mut pair # I forgot to make some of the scripts executable on the Gist chmod 755 *.pl # move back into the root working directory cd ..
To get started, generate a reference sequence and then create an index:
# check the usage of the script # by running it without any parameters script/generate_random_seq.pl Usage: script/generate_random_seq.pl <bp> <seed> # generate ref.fa script/generate_random_seq.pl 10000 31 > ref.fa # index ref.fa bwa/bwa index ref.fa [bwa_index] Pack FASTA... 0.00 sec [bwa_index] Construct BWT for the packed sequence... [bwa_index] 0.00 seconds elapse. [bwa_index] Update BWT... 0.00 sec [bwa_index] Pack forward-only FASTA... 0.00 sec [bwa_index] Construct SA from BWT and Occ... 0.00 sec [main] Version: 0.7.17-r1188 [main] CMD: bwa/bwa index ref.fa [main] Real time: 0.011 sec; CPU: 0.005 sec
Next, generate a mutated reference sequence:
# check the usage of the script # by running it without any parameters script/mutate_fasta.pl Usage: script/mutate_fasta.pl <infile.fa> <mutation percent> <seed> # generate a mutated reference script/mutate_fasta.pl ref.fa 0.01 31 > ref_mutated.fa
Lastly, generate some paired-end reads from the mutated reference and we're done with my Perl scripts.
# check the usage of the script # by running it without any parameters script/random_paired_end.pl Usage: script/random_paired_end.pl <infile.fa> <read length> <number of pairs> <inner mate distance> <seed> # generate 1000 paired-end reads of length 100 # where the inner mate distance is 300 bp script/random_paired_end.pl ref_mutated.fa 100 1000 300 31
Before we can call any variants we need to align these reads back to the original reference. We'll use BWA-MEM to perform the alignment:
# pipe the alignment to SAMtools bwa/bwa mem ref.fa l100_n1000_d300_31_1.fq l100_n1000_d300_31_2.fq | samtools sort -o l100_n1000_d300_31_1.bam - [M::bwa_idx_load_from_disk] read 0 ALT contigs [M::process] read 2000 sequences (200000 bp)... [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 1000, 0, 0) [M::mem_pestat] skip orientation FF as there are not enough pairs [M::mem_pestat] analyzing insert size distribution for orientation FR... [M::mem_pestat] (25, 50, 75) percentile: (399, 399, 399) [M::mem_pestat] low and high boundaries for computing mean and std.dev: (399, 399) [M::mem_pestat] mean and std.dev: (399.00, 0.00) [M::mem_pestat] low and high boundaries for proper pairs: (399, 399) [M::mem_pestat] skip orientation RF as there are not enough pairs [M::mem_pestat] skip orientation RR as there are not enough pairs [M::mem_process_seqs] Processed 2000 reads in 0.068 CPU sec, 0.070 real sec [main] Version: 0.7.17-r1188 [main] CMD: bwa/bwa mem ref.fa l100_n1000_d300_31_1.fq l100_n1000_d300_31_2.fq [main] Real time: 0.074 sec; CPU: 0.073 sec # index the BAM file samtools index l100_n1000_d300_31_1.bam # obtain some simple statistics samtools flagstat l100_n1000_d300_31_1.bam 2000 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 2000 + 0 mapped (100.00% : N/A) 2000 + 0 paired in sequencing 1000 + 0 read1 1000 + 0 read2 2000 + 0 properly paired (100.00% : N/A) 2000 + 0 with itself and mate mapped 0 + 0 singletons (0.00% : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)
Now we're ready to use SAMtools mpileup!
If you run SAMtools mpileup without any parameters, the usage and parameters will be displayed. It is useful to use the -f parameter, which specifies the reference file, so that the actual nucleotide at a genomic location is printed out. The -s parameter is useful for outputting the mapping qualities of reads.
# the sed command was used to print out # lines 500 to 515 # there's only one bam file and I'm too lazy to type it samtools mpileup -f ref.fa -s l100_n1000_d300_31_1.bam | sed -n '500,515p' [mpileup] 1 samples in 1 input files <mpileup> Set max per-file depth to 8000 10000 515 C 25 ..,...........,..,..,.,., JJJJJJJJJJJJJJJJJJJJJJJJJ ]]]]]]]]]]]]]]]]]]]]]]]]] 10000 516 G 25 ..,...........,..,..,.,., JJJJJJJJJJJJJJJJJJJJJJJJJ ]]]]]]]]]]]]]]]]]]]]]]]]] 10000 517 T 25 ..,...........,..,..,.,., JJJJJJJJJJJJJJJJJJJJJJJJJ ]]]]]]]]]]]]]]]]]]]]]]]]] 10000 518 T 25 ..,...........,..,..,.,., JJJJJJJJJJJJJJJJJJJJJJJJJ ]]]]]]]]]]]]]]]]]]]]]]]]] 10000 519 C 25 ..,...........,..,..,.,., JJJJJJJJJJJJJJJJJJJJJJJJJ ]]]]]]]]]]]]]]]]]]]]]]]]] 10000 520 G 26 .$.,...........,..,..,.,.,^], >JJJJJJJJJJJJJJJJJJJJJJJJ@ ]]]]]]]]]]]]]]]]]]]]]]]]]] 10000 521 G 26 .,...........,..,..,.,.,,^]. DJJJJJJJJJJJJJJJJJJJJJJJA7 ]]]]]]]]]]]]]]]]]]]]]]]]]] 10000 522 T 25 ,...........,..,..,.,.,,. JJJJJJJJJJJJJJJJJJJJJJJA7 ]]]]]]]]]]]]]]]]]]]]]]]]] 10000 523 A 25 tTTTTTTTTTTTtTTtTTtTtTttT JJJJJJJJJJJJJJJJJJJJJJJA7 ]]]]]]]]]]]]]]]]]]]]]]]]] 10000 524 C 27 ,...........,..,..,.,.,,... JJJJJJJJJJJJJJJJJJJJJJJJG88 ]]]]]]]]]]]]]]]]]]]]]]]]]]] 10000 525 A 27 ,...........,..,..,.,.,,... JJJJJJJJJJJJJJJJJJJJJJJJJHH ]]]]]]]]]]]]]]]]]]]]]]]]]]] 10000 526 A 27 ,...........,..,..,.,.,,... JJJJJJJJJJJJJJJJJJJJJJJJJJJ ]]]]]]]]]]]]]]]]]]]]]]]]]]] 10000 527 A 27 ,...........,..,..,.,.,,... JJJJJJJJJJJJJJJJJJJJJJJJJJJ ]]]]]]]]]]]]]]]]]]]]]]]]]]] 10000 528 C 27 ,...........,..,..,.,.,,... JJJJJJJJJJJJJJJJJJJJJJJJJJJ ]]]]]]]]]]]]]]]]]]]]]]]]]]] 10000 529 T 28 ,$...........,..,..,.,.,,...^], >JJJJJJJJJJJJJJJJJJJJJJJJJJB ]]]]]]]]]]]]]]]]]]]]]]]]]]]] 10000 530 T 27 ...........,..,..,.,.,,..., JJJJJJJJJJJJJJJJJJJJJJJJJJE ]]]]]]]]]]]]]]]]]]]]]]]]]]]
Each line represents a single genomic position and has seven columns:
- Sequence name
- 1-based coordinate
- Reference base
- Number of reads covering this position
- Read bases
- Base qualities
- Alignment mapping qualities
Looking at the first line, position 515 on our reference sequence was a C and was covered by 25 reads. The Read base column contains information on whether a read base matched, mismatched, was inserted or deleted with respect to the reference. It also contains information about whether the read base was on the positive or negative strand with respect to the reference, and whether a read base was at the start or the end of a read. A period stands for a match to the reference base on the positive strand, a comma for a match on the negative strand. We generated paired-end reads, so we have reads on both positive and negative strand. The "^]" signifies a base that was at the start of a read and the "$" signifies a base that was at the end of a read. The mutation script only generated SNVs, so there are no examples of indels.
At position 523 there is a SNV: A -> T. The lowercase t is to indicate that the t was on a read that mapped on the negative strand. More information on the read bases can be found on the Wikipedia article.
Calling SNPs/Indels using BCFtools
We can pipe the output of SAMtools mpileup directly to BCFtools but we need the -g parameter, which produces output in BCF format.
git clone git://github.com/samtools/htslib.git git clone git://github.com/samtools/bcftools.git cd bcftools make cd .. samtools mpileup -f ref.fa -g l100_n1000_d300_31_1.bam | bcftools/bcftools call -m - > l100_n1000_d300_31_1.vcf Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid [mpileup] 1 samples in 1 input files <mpileup> Set max per-file depth to 8000
The bcftools call utility calls the SNPs and Indels and the -m parameter is the recommended calling model. Let's check out the VCF file:
cat l100_n1000_d300_31_1.vcf | grep -v "^#" | sed -n '506,510p' 10000 521 . G . 285 . DP=26;MQSB=1;MQ0F=0;AN=2;DP4=19,7,0,0;MQ=60 GT 0/0 10000 522 . T . 285 . DP=28;MQSB=1;MQ0F=0;AN=2;DP4=18,7,0,0;MQ=60 GT 0/0 10000 523 . A T 225 . DP=28;VDB=0.16571;SGB=-0.692914;MQSB=1;MQ0F=0;AC=2;AN=2;DP4=0,0,18,7;MQ=60 GT:PL 1/1:255,75,0 10000 524 . C . 285 . DP=27;MQSB=1;MQ0F=0;AN=2;DP4=20,7,0,0;MQ=60 GT 0/0 10000 525 . A . 285 . DP=27;MQSB=1;MQ0F=0;AN=2;DP4=20,7,0,0;MQ=60 GT 0/0
All the information about the VCF annotations are in the metadata at the start of the VCF file. The PL format was a bit unclear though but I found the answer on BioStar:
For a biallelic site, the PL has three numbers, The first one is the probability that the site is homozgyous reference, the second is the probability that the sample is heterzygous, the third that it is homozygous for the alternate letter. The higher the number, the less likely it is that your sample is that genotype. So if your PL is 483,0,532 the software is quite sure that your sample is not homozygous reference or homozygous alternate, it's heterozygous.
The PL at position 523 is 255,75,0, which suggests homozygous alternative, i.e. 1/1.
Did we find all the variants that was introduced by my script? The mutate_fasta.pl script produces a file, ref_mutated.txt, that contains all the mutations that were introduced.
# 1-based coordinates with reference base # and mutated base cat ref_mutated.txt | head 17 T G 91 G A 115 C T 422 G C 502 A C 523 A T 570 C A 588 A G 600 T G 682 G T # how many mutations? cat ref_mutated.txt | wc -l 100 # check out the VCF entries with a genotype likelihood cat l100_n1000_d300_31_1.vcf | grep -v "^#" | grep PL | cut -f1-6 | head 10000 91 . G A 179 10000 115 . C T 182 10000 422 . G C 225 10000 502 . A C 225 10000 523 . A T 225 10000 570 . C A 225 10000 588 . A G 225 10000 600 . T G 225 10000 682 . G T 225 10000 825 . C A 225 cat l100_n1000_d300_31_1.vcf | grep -v "^#" | grep PL | wc -l 97
For this post, I generated reads with very high base-calling qualities, which should be the case with Sanger sequencing, and reads that were perfectly paired. The reference sequence was random and with read lengths of 100 bp, there shouldn't be any mapping problems. I did miss two SNVs and I suspect that they should be the ones at the end:
tail -5 ref_mutated.txt 9711 A C 9752 A G 9882 A C 9917 T C 9962 G T cat l100_n1000_d300_31_1.vcf | grep -v "^#" | grep PL | tail -5 | cut -f1-6 10000 9547 . G A 225 10000 9694 . C T 168 10000 9711 . A C 171 10000 9752 . A G 178 10000 9882 . A C 42.4147
And those SNPs were probably missed because there was no read coverage at those positions (9917 and 9962):
samtools mpileup -f ref.fa l100_n1000_d300_31_1.bam| tail -5 [mpileup] 1 samples in 1 input files <mpileup> Set max per-file depth to 8000 10000 9890 T 1 , J 10000 9891 C 1 , J 10000 9892 C 1 , J 10000 9893 G 1 , E 10000 9894 G 1 ,$ B
This work is licensed under a Creative Commons
Attribution 4.0 International License.
Very helpful. But I think there is a minor mistake in “samtools sort – l100_n1000_d300_31_1” in the part of “pipe the alignment to SAMtools”. There are errors when I executed this step.
Thanks for letting me know. It is probably due to a newer version of SAMtools, which changed the syntax.
good training example.
but, the command is not working; samtools mpileup -f ref.fa -g *.bam | bcftools call -m – > l100_n1000_d300_31_1.vcf
can you figour it out?
the new version of SAMtools changed the sort syntax; my post was based on an older version of SAMtools. I’ve updated the post and I could run every step using SAMtools version 1.6.
Here are some corrections for scripts on Ubuntu
# index ref.fa
bwa index ref.fa
# pipe the alignment to SAMtools
bwa mem ref.fa l100_n1000_d300_31_1.fq l100_n1000_d300_31_2.fq |
samtools view -bSu – |
samtools sort – l100_n1000_d300_31_1
# index the BAM file
samtools index *.bam
# obtain some simple statistics
samtools flagstat *.bam
###Calling SNPs/Indels using BCFtools
samtools mpileup -uD -f ref.fa -g *.bam | bcftools view -m – > l100_n1000_d300_31_1.raw.bcf
bcftools view l100_n1000_d300_31_1.raw.bcf > l100_n1000_d300_31_1.vcf
cat l100_n1000_d300_31_1.vcf |
grep -v “^#” |
sed -n ‘506,510p’
grep -v ^## l100_n1000_d300_31_1.vcf
grep -v ^## l100_n1000_d300_31_1.vcf | less -S
# check out the VCF entries with a genotype likelihood
cat l100_n1000_d300_31_1.vcf |
grep -v “^#” |
grep PL |
cut -f1-6 |
grep -v ^## l100_n1000_d300_31_1.vcf |
grep PL |
cut -f1-6 |
Thank you for your suggestions.
I have a question and hope you can help me. I used BWA aligned one whole genome sequencing sample. The reference I used is GRCh38. I downloaded all chromosomes and made the final GRCh38_r77.all.fa as the reference. The alignment with BWA and samtools was doing well and I have sorted and indexed the bam files. Now I need to index GRCh38_r77.all.fa
The command I used is:
samtools faidx GRCh38_r77.all.fa
It was not working with the error: Segmentation fault (core dumped).
Can you please give me some advice?
Thank you so much in advance.
it’s quite hard to troubleshoot your problem without more information. There is a Curtin University Hacky Hour every week on Wednesdays 3-4pm (http://computation.curtin.edu.au/events/hacky-hour/) and I’m sure the people there can help you out. Or if you’re around at UWA, you can join the UWA Hacky Hour on Fridays 10-11am; sometimes I’m there but Philipp is usually there.
What does mpileup stand for? memory pileup? mapped pile up? Trying to figure out the etymology of the word.
Perhaps multiple pileup.
thanks for sharing, it’s helpful and clear.
to obtain coverage of each bases, I had used to command: one is samtools mpileup sorted.bam and other is samtools depth sorted.bam, but I got totally different value from these two command. I don’t know why they are different and feel confused about it.
Are you familiar with these two command?
samtools depth and samtools mpileup generate different outputs but the coverage calculations are the same. I created a simple example at https://github.com/davetang/learning_bam_file#coverage
Thank you for the super useful blogs. I have a question regarding downstream analysis after seeing the pipe commands you wrote. I would like to compress vcf files and index them. The command I wrote was ‘bcftools view filename.vcf -Oz -o filename.vcf.gz | bcftools index’ but the second command didn’t run. Sorry if the question is not very relevant to the blog. I’m very new to bioinformatics and would be grateful if you can advise. Thank you!
thanks for the question and it’s very relevant to my blog! Your bcftools view command is saving the output to filename.vcf.gz so there is nothing being piped to the index command. You can either run them as two separate commands or if you really want to run them as one command, you can do something like this:
I did get get a warning (index: the input is probably truncated, use -f to index anyway: -), which is why I also used the -f parameter in the indexing step. So it is probably better if you just ran the two commands separately, since I’m always wary of warnings.
Lastly, I didn’t even know there’s a bcftools index command; I have been using tabix. I don’t know what the difference is but they seem to generate different binary files. To index using tabix:
Hope that helps,
Thank you for the prompt reply. Much appreciated. It clarifies. Will try tabix too!