Updated hyperlinks on the 2015 January 26th; please comment if you find any more dead links.
Picard is a suite of Java-based command-line utilities that manipulate SAM/BAM files. Currently, I'm analysing some paired-end libraries and I wanted to calculate the average insert size based on the alignments; that's how I found Picard. While reading the documentation I realised that Picard has a lot more to offer and hence I'm writing this post to look into specific tools that I thought would be useful.
To get started, download Picard and an alignment (whole genome shotgun sequencing) from the 1000 genomes project that we can use to test Picard:
#download the BAM file curl -o HG00313.chrom11.ILLUMINA.bwa.FIN.low_coverage.20120522.bam ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00313/alignment/HG00313.chrom11.ILLUMINA.bwa.FIN.low_coverage.20120522.bam #download the BAM index file curl -o HG00313.chrom11.ILLUMINA.bwa.FIN.low_coverage.20120522.bam.bai ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00313/alignment/HG00313.chrom11.ILLUMINA.bwa.FIN.low_coverage.20120522.bam.bai
There's a large suite of command line tools and most of them follow this usage format:
java jvm-args -jar PicardCommand.jar OPTION1=value1 OPTION2=value2...
I'll take a look at several of the tools below.
BamIndexStats
The tool BamIndexStats returns the number of aligned and unaligned reads per reference feature, which are chromosomes in this example.
java -jar ~/bin/picard-tools-1.117/BamIndexStats.jar INPUT=HG00313.chrom11.ILLUMINA.bwa.FIN.low_coverage.20120522.bam > bam_index_stats.tsv head -15 bam_index_stats.tsv 1 length= 249250621 Aligned= 0 Unaligned= 0 2 length= 243199373 Aligned= 0 Unaligned= 0 3 length= 198022430 Aligned= 0 Unaligned= 0 4 length= 191154276 Aligned= 0 Unaligned= 0 5 length= 180915260 Aligned= 0 Unaligned= 0 6 length= 171115067 Aligned= 0 Unaligned= 0 7 length= 159138663 Aligned= 0 Unaligned= 0 8 length= 146364022 Aligned= 0 Unaligned= 0 9 length= 141213431 Aligned= 0 Unaligned= 0 10 length= 135534747 Aligned= 0 Unaligned= 0 11 length= 135006516 Aligned= 5222600 Unaligned= 43589 12 length= 133851895 Aligned= 0 Unaligned= 0 13 length= 115169878 Aligned= 0 Unaligned= 0 14 length= 107349540 Aligned= 0 Unaligned= 0 15 length= 102531392 Aligned= 0 Unaligned= 0
Since our example file only contains alignments for chromosome 11, there are only aligned and unaligned statistics for chromosome 11.
CollectAlignmentSummaryMetrics
This tool reads a SAM/BAM file and outputs a summary containing alignment metrics.
java -jar ~/bin/picard-tools-1.117/CollectAlignmentSummaryMetrics.jar INPUT=HG00313.chrom11.ILLUMINA.bwa.FIN.low_coverage.20120522.bam OUTPUT=aln_sum_metric.tsv #check the output without the comments cat aln_sum_metric.tsv | grep -v "^#" CATEGORY TOTAL_READS PF_READS PCT_PF_READS PF_NOISE_READS PF_READS_ALIGNED PCT_PF_READS_ALIGNED PF_ALIGNED_BASES PF_HQ_ALIGNED_READS PF_HQ_ALIGNED_BASES PF_HQ_ALIGNED_Q20_BASES PF_HQ_MEDIAN_MISMATCHES PF_MISMATCH_RATE PF_HQ_ERROR_RATE PF_INDEL_RATE MEAN_READ_LENGTH READS_ALIGNED_IN_PAIRS PCT_READS_ALIGNED_IN_PAIRS BAD_CYCLES STRAND_BALANCE PCT_CHIMERAS PCT_ADAPTER SAMPLE LIBRARY READ_GROUP FIRST_OF_PAIR 2596107 2596107 1 1 0 0 0 0 0 0 0 0 0 0 89.201237 0 0 0 0 0 0.000002 SECOND_OF_PAIR 2596026 2596026 1 0 0 0 0 0 0 0 0 0 0 0 89.199531 0 0 0 0 0 0.000005 PAIR 5192133 5192133 1 1 0 0 0 0 0 0 0 0 0 0 89.200384 0 0 0 0 0 0.000004 UNPAIRED 74056 74056 1 0 0 0 0 0 0 0 0 0 0 0 81.166023 0 0 0 0 0 0
Let's transpose the output to make it more readable using this Perl script:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env perl | |
use strict; | |
use warnings; | |
my $data = []; | |
my $t_data = []; | |
while(<>){ | |
chomp; | |
#skip comments | |
next if /^#/; | |
#skip lines without anything | |
next if /^$/; | |
#split lines on tabs | |
my @s = split(/\t/); | |
#store each line, which has been split on tabs | |
#in the array reference as an array reference | |
push(@{$data}, \@s); | |
} | |
#loop through array reference | |
for my $row (@{$data}){ | |
#go through each array reference | |
#each array element is each row of the data | |
for my $col (0 .. $#{$row}){ | |
#each row of $t_data is an array reference | |
#that is being populated with the $data columns | |
push(@{$t_data->[$col]}, $row->[$col]); | |
} | |
} | |
for my $row (@$t_data){ | |
my $line_to_print = ''; | |
for my $col (@{$row}){ | |
$line_to_print .= "$col\t"; | |
} | |
$line_to_print =~ s/\t$//; | |
print "$line_to_print\n"; | |
} | |
exit(0); |
#transpose the output using the Perl script transpose.pl aln_sum_metric.tsv CATEGORY FIRST_OF_PAIR SECOND_OF_PAIR PAIR UNPAIRED TOTAL_READS 2596107 2596026 5192133 74056 PF_READS 2596107 2596026 5192133 74056 PCT_PF_READS 1 1 1 1 PF_NOISE_READS 1 0 1 0 PF_READS_ALIGNED 0 0 0 0 PCT_PF_READS_ALIGNED 0 0 0 0 PF_ALIGNED_BASES 0 0 0 0 PF_HQ_ALIGNED_READS 0 0 0 0 PF_HQ_ALIGNED_BASES 0 0 0 0 PF_HQ_ALIGNED_Q20_BASES 0 0 0 0 PF_HQ_MEDIAN_MISMATCHES 0 0 0 0 PF_MISMATCH_RATE 0 0 0 0 PF_HQ_ERROR_RATE 0 0 0 0 PF_INDEL_RATE 0 0 0 0 MEAN_READ_LENGTH 89.201237 89.199531 89.200384 81.166023 READS_ALIGNED_IN_PAIRS 0 0 0 0 PCT_READS_ALIGNED_IN_PAIRS 0 0 0 0 BAD_CYCLES 0 0 0 0 STRAND_BALANCE 0 0 0 0 PCT_CHIMERAS 0 0 0 0 PCT_ADAPTER 0.000002 0.000005 0.000004 0 SAMPLE LIBRARY READ_GROUP
If we checkout the full definitions for the abbreviations, we see that PF is defined as passing Illumina's filter and PCT_PF_READS displays the percentage of reads that are PF (PF_READS / TOTAL_READS). See the link for the other abbreviations.
CollectInsertSizeMetrics
This tools calculates and shows the distribution of the insert sizes, which was what I wanted to tally in some paired-end libraries I'm looking at.
java -jar ~/bin/picard-tools-1.117/CollectInsertSizeMetrics.jar INPUT=HG00313.chrom11.ILLUMINA.bwa.FIN.low_coverage.20120522.bam OUTPUT=insert_size_metric.tsv H=insert_size_metric.pdf #convert to png for this blog convert insert_size_metric.pdf -background white -flatten insert_size_metric.png #the output contains values #for the histogram, which I'm not interested in #since I get a plot cat insert_size_metric.tsv | grep -v "^#" | head -3 > insert_size_metric_no_hist.tsv #transpose transpose.pl insert_size_metric_no_hist.tsv MEDIAN_INSERT_SIZE 473 MEDIAN_ABSOLUTE_DEVIATION 8 MIN_INSERT_SIZE 12 MAX_INSERT_SIZE 130846602 MEAN_INSERT_SIZE 472.187538 STANDARD_DEVIATION 17.990341 READ_PAIRS 2476685 PAIR_ORIENTATION FR WIDTH_OF_10_PERCENT 3 WIDTH_OF_20_PERCENT 7 WIDTH_OF_30_PERCENT 9 WIDTH_OF_40_PERCENT 13 WIDTH_OF_50_PERCENT 17 WIDTH_OF_60_PERCENT 19 WIDTH_OF_70_PERCENT 25 WIDTH_OF_80_PERCENT 29 WIDTH_OF_90_PERCENT 39 WIDTH_OF_99_PERCENT 73 SAMPLE LIBRARY READ_GROUP
Along with the summary statistics about the insert size, the tool also outputs a histogram:
As we saw from the summary statistics, most insert sizes are around 470 bp.
CollectRnaSeqMetrics
The CollectRnaSeqMetrics looks very useful; from the documentation:
Program to collect metrics about the alignment of RNA to various functional classes of loci in the genome: coding, intronic, UTR, intergenic, ribosomal. Also determines strand-specificity for strand-specific libraries.
I will definitely look into this tool in a future post.
CollectWgsMetrics
This tool computes a number of metrics that are useful for evaluating coverage and performance of whole genome sequencing experiments. It requires the reference sequence fasta that the reads were aligned to. Conveniently we can find out which reference was used for the alignment using samtools view:
samtools view -H HG00313.chrom11.ILLUMINA.bwa.FIN.low_coverage.20120522.bam | grep "^@SQ" | head -1 @SQ SN:1 LN:249250621 M5:1b22b98cdeb4a9304cb5d48026a85128 UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz AS:NCBI37 SP:Human #download it curl -o hs37d5.fa.gz ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz curl -o hs37d5.fa.gz.fai ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz.fai #CollectWgsMetrics can't process gz #gunzip file gunzip hs37d5.fa.gz java -jar ~/bin/picard-tools-1.117/CollectWgsMetrics.jar INPUT=HG00313.chrom11.ILLUMINA.bwa.FIN.low_coverage.20120522.bam OUTPUT=wgs_metric.tsv REFERENCE_SEQUENCE=hs37d5.fa #remove histogram info from the output cat wgs_metric.tsv | grep -v "^#" | head -3 > wgs_metric_no_hist.tsv #transpose and examine output transpose.pl wgs_metric_no_hist.tsv GENOME_TERRITORY 2900434419 MEAN_COVERAGE 0.132371 SD_COVERAGE 0.732859 MEDIAN_COVERAGE 0 MAD_COVERAGE 0 PCT_EXC_MAPQ 0.030199 PCT_EXC_DUPE 0.019441 PCT_EXC_UNPAIRED 0.015514 PCT_EXC_BASEQ 0.038496 PCT_EXC_OVERLAP 0.000179 PCT_EXC_CAPPED 0 PCT_EXC_TOTAL 0.10383 PCT_5X 0.008875 PCT_10X 0.000119 PCT_20X 0 PCT_30X 0 PCT_40X 0 PCT_50X 0 PCT_60X 0 PCT_70X 0 PCT_80X 0 PCT_90X 0 PCT_100X 0
Here are the WgsMetrics definitions:
GENOME_TERRITORY: The number of non-N bases in the genome reference over which coverage will be evaluated.
MEAN_COVERAGE: The mean coverage in bases of the genome territory, after all filters are applied.
SD_COVERAGE: The standard deviation of coverage of the genome after all filters are applied.
MEDIAN_COVERAGE: The median coverage in bases of the genome territory, after all filters are applied.
MAD_COVERAGE: The median absolute deviation of coverage of the genome after all filters are applied.
PCT_EXC_MAPQ: The fraction of aligned bases that were filtered out because they were in reads with low mapping quality (default is < 20).
PCT_EXC_DUPE: The fraction of aligned bases that were filtered out because they were in reads marked as duplicates.
PCT_EXC_UNPAIRED: The fraction of aligned bases that were filtered out because they were in reads without a mapped mate pair.
PCT_EXC_BASEQ: The fraction of aligned bases that were filtered out because they were of low base quality (default is < 20).
PCT_EXC_OVERLAP: The fraction of aligned bases that were filtered out because they were the second observation from an insert with overlapping reads.
PCT_EXC_CAPPED: The fraction of aligned bases that were filtered out because they would have raised coverage above the capped value (default cap = 250x).
PCT_EXC_TOTAL: The total fraction of aligned bases excluded due to all filters.
PCT_5X: The fraction of bases that attained at least 5X sequence coverage in post-filtering bases.
PCT_10X: The fraction of bases that attained at least 10X sequence coverage in post-filtering bases.
The coverage is low because the BAM file only contains alignments for chromosome 11. I tried to run WgsMetrics using only chromosome 11 as the reference sequence but ran into errors.
QualityScoreDistribution
java -jar ~/bin/picard-tools-1.117/QualityScoreDistribution.jar INPUT=HG00313.chrom11.ILLUMINA.bwa.FIN.low_coverage.20120522.bam OUTPUT=quality_score_dist.tsv CHART_OUTPUT=quality_score_dist.pdf #convert to png convert quality_score_dist.pdf -background white -flatten quality_score_dist.png
Sequencing quality scores of all reads.
To see the sequencing quality scores for only aligned reads, set ALIGNED_READS_ONLY to true.
Notable mentions
I didn't go through these tools, but they look very useful:
- DownsampleSam - randomly subset a SAM/BAM file
- EstimateLibraryComplexity - estimates library complexity from read pairs
- ReplaceSamHeader - self-explanatory
Conclusions
Picard contains a suite of extremely useful command-line tools for SAM/BAM files that I wish I had known earlier! But as I always say, better late than never. The files generated for this post are available at GitHub. Finally below is a full list of the Picard command-line tools:
- AddCommentsToBam
- AddOrReplaceReadGroups
- BamToBfq
- BamIndexStats
- BuildBamIndex
- CalculateHsMetrics
- CleanSam
- CollectAlignmentSummaryMetrics
- CollectGcBiasMetrics
- CollectInsertSizeMetrics
- CollectMultipleMetrics
- CollectTargetedPcrMetrics
- CollectRnaSeqMetrics
- CollectWgsMetrics
- CompareSAMs
- CreateSequenceDictionary
- DownsampleSam
- ExtractIlluminaBarcodes
- EstimateLibraryComplexity
- FastqToSam
- FilterSamReads
- FixMateInformation
- GatherBamFiles
- IlluminaBasecallsToFastq
- IlluminaBasecallsToSam
- CheckIlluminaDirectory
- IntervalListTools
- MakeSitesOnlyVcf
- MarkDuplicates
- MeanQualityByCycle
- MergeBamAlignment
- MergeSamFiles
- MergeVcfs
- NormalizeFasta
- ExtractSequences
- QualityScoreDistribution
- ReorderSam
- ReplaceSamHeader
- RevertSam
- RevertOriginalBaseQualitiesAndAddMateCigar
- SamFormatConverter
- SamToFastq
- SortSam
- VcfFormatConverter
- MarkIlluminaAdapters
- SplitVcfs
- ValidateSamFile
- ViewSam

This work is licensed under a Creative Commons
Attribution 4.0 International License.
Why is CollectAlignmentSummaryMetrics summary report showing everything as 0? I am getting the same kind of summary. Is it a bug or am I doing something wrong?
Have a look here. Those statistics were not available from the file.