Getting started with Picard

Updated hyperlinks on the 2015 January 26th; please comment if you find any more dead links.

Picard is a suite of Java-based command-line utilities that manipulate SAM/BAM files. Currently, I'm analysing some paired-end libraries and I wanted to calculate the average insert size based on the alignments; that's how I found Picard. While reading the documentation I realised that Picard has a lot more to offer and hence I'm writing this post to look into specific tools that I thought would be useful.

To get started, download Picard and an alignment (whole genome shotgun sequencing) from the 1000 genomes project that we can use to test Picard:

#download the BAM file
curl -o HG00313.chrom11.ILLUMINA.bwa.FIN.low_coverage.20120522.bam ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00313/alignment/HG00313.chrom11.ILLUMINA.bwa.FIN.low_coverage.20120522.bam

#download the BAM index file
curl -o HG00313.chrom11.ILLUMINA.bwa.FIN.low_coverage.20120522.bam.bai ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00313/alignment/HG00313.chrom11.ILLUMINA.bwa.FIN.low_coverage.20120522.bam.bai

There's a large suite of command line tools and most of them follow this usage format:

java jvm-args -jar PicardCommand.jar OPTION1=value1 OPTION2=value2...

I'll take a look at several of the tools below.

BamIndexStats

The tool BamIndexStats returns the number of aligned and unaligned reads per reference feature, which are chromosomes in this example.

java -jar ~/bin/picard-tools-1.117/BamIndexStats.jar INPUT=HG00313.chrom11.ILLUMINA.bwa.FIN.low_coverage.20120522.bam > bam_index_stats.tsv

head -15 bam_index_stats.tsv 
1 length=       249250621       Aligned= 0      Unaligned= 0
2 length=       243199373       Aligned= 0      Unaligned= 0
3 length=       198022430       Aligned= 0      Unaligned= 0
4 length=       191154276       Aligned= 0      Unaligned= 0
5 length=       180915260       Aligned= 0      Unaligned= 0
6 length=       171115067       Aligned= 0      Unaligned= 0
7 length=       159138663       Aligned= 0      Unaligned= 0
8 length=       146364022       Aligned= 0      Unaligned= 0
9 length=       141213431       Aligned= 0      Unaligned= 0
10 length=      135534747       Aligned= 0      Unaligned= 0
11 length=      135006516       Aligned= 5222600        Unaligned= 43589
12 length=      133851895       Aligned= 0      Unaligned= 0
13 length=      115169878       Aligned= 0      Unaligned= 0
14 length=      107349540       Aligned= 0      Unaligned= 0
15 length=      102531392       Aligned= 0      Unaligned= 0

Since our example file only contains alignments for chromosome 11, there are only aligned and unaligned statistics for chromosome 11.

CollectAlignmentSummaryMetrics

This tool reads a SAM/BAM file and outputs a summary containing alignment metrics.

java -jar ~/bin/picard-tools-1.117/CollectAlignmentSummaryMetrics.jar INPUT=HG00313.chrom11.ILLUMINA.bwa.FIN.low_coverage.20120522.bam OUTPUT=aln_sum_metric.tsv

#check the output without the comments
cat aln_sum_metric.tsv | grep -v "^#"

CATEGORY        TOTAL_READS     PF_READS        PCT_PF_READS    PF_NOISE_READS  PF_READS_ALIGNED        PCT_PF_READS_ALIGNED    PF_ALIGNED_BASES        PF_HQ_ALIGNED_READS     PF_HQ_ALIGNED_BASES     PF_HQ_ALIGNED_Q20_BASES     PF_HQ_MEDIAN_MISMATCHES PF_MISMATCH_RATE        PF_HQ_ERROR_RATE        PF_INDEL_RATE   MEAN_READ_LENGTH        READS_ALIGNED_IN_PAIRS  PCT_READS_ALIGNED_IN_PAIRS      BAD_CYCLES STRAND_BALANCE   PCT_CHIMERAS    PCT_ADAPTER     SAMPLE  LIBRARY READ_GROUP
FIRST_OF_PAIR   2596107 2596107 1       1       0       0       0       0       0       0       0       0       0       0       89.201237       0       0       0       0       0       0.000002
SECOND_OF_PAIR  2596026 2596026 1       0       0       0       0       0       0       0       0       0       0       0       89.199531       0       0       0       0       0       0.000005
PAIR    5192133 5192133 1       1       0       0       0       0       0       0       0       0       0       0       89.200384       0       0       0       0       0       0.000004
UNPAIRED        74056   74056   1       0       0       0       0       0       0       0       0       0       0       0       81.166023       0       0       0       0       0       0

Let's transpose the output to make it more readable using this Perl script:


#!/usr/bin/env perl
use strict;
use warnings;
my $data = [];
my $t_data = [];
while(<>){
chomp;
#skip comments
next if /^#/;
#skip lines without anything
next if /^$/;
#split lines on tabs
my @s = split(/\t/);
#store each line, which has been split on tabs
#in the array reference as an array reference
push(@{$data}, \@s);
}
#loop through array reference
for my $row (@{$data}){
#go through each array reference
#each array element is each row of the data
for my $col (0 .. $#{$row}){
#each row of $t_data is an array reference
#that is being populated with the $data columns
push(@{$t_data->[$col]}, $row->[$col]);
}
}
for my $row (@$t_data){
my $line_to_print = '';
for my $col (@{$row}){
$line_to_print .= "$col\t";
}
$line_to_print =~ s/\t$//;
print "$line_to_print\n";
}
exit(0);

view raw

transpose.pl

hosted with ❤ by GitHub

#transpose the output using the Perl script
transpose.pl aln_sum_metric.tsv 
CATEGORY        FIRST_OF_PAIR   SECOND_OF_PAIR  PAIR    UNPAIRED
TOTAL_READS     2596107 2596026 5192133 74056
PF_READS        2596107 2596026 5192133 74056
PCT_PF_READS    1       1       1       1
PF_NOISE_READS  1       0       1       0
PF_READS_ALIGNED        0       0       0       0
PCT_PF_READS_ALIGNED    0       0       0       0
PF_ALIGNED_BASES        0       0       0       0
PF_HQ_ALIGNED_READS     0       0       0       0
PF_HQ_ALIGNED_BASES     0       0       0       0
PF_HQ_ALIGNED_Q20_BASES 0       0       0       0
PF_HQ_MEDIAN_MISMATCHES 0       0       0       0
PF_MISMATCH_RATE        0       0       0       0
PF_HQ_ERROR_RATE        0       0       0       0
PF_INDEL_RATE   0       0       0       0
MEAN_READ_LENGTH        89.201237       89.199531       89.200384       81.166023
READS_ALIGNED_IN_PAIRS  0       0       0       0
PCT_READS_ALIGNED_IN_PAIRS      0       0       0       0
BAD_CYCLES      0       0       0       0
STRAND_BALANCE  0       0       0       0
PCT_CHIMERAS    0       0       0       0
PCT_ADAPTER     0.000002        0.000005        0.000004        0
SAMPLE
LIBRARY
READ_GROUP

If we checkout the full definitions for the abbreviations, we see that PF is defined as passing Illumina's filter and PCT_PF_READS displays the percentage of reads that are PF (PF_READS / TOTAL_READS). See the link for the other abbreviations.

CollectInsertSizeMetrics

This tools calculates and shows the distribution of the insert sizes, which was what I wanted to tally in some paired-end libraries I'm looking at.

java -jar ~/bin/picard-tools-1.117/CollectInsertSizeMetrics.jar INPUT=HG00313.chrom11.ILLUMINA.bwa.FIN.low_coverage.20120522.bam OUTPUT=insert_size_metric.tsv H=insert_size_metric.pdf

#convert to png for this blog
convert insert_size_metric.pdf -background white -flatten insert_size_metric.png

#the output contains values
#for the histogram, which I'm not interested in
#since I get a plot
cat insert_size_metric.tsv | grep -v "^#" | head -3 > insert_size_metric_no_hist.tsv

#transpose
transpose.pl insert_size_metric_no_hist.tsv 
MEDIAN_INSERT_SIZE      473
MEDIAN_ABSOLUTE_DEVIATION       8
MIN_INSERT_SIZE 12
MAX_INSERT_SIZE 130846602
MEAN_INSERT_SIZE        472.187538
STANDARD_DEVIATION      17.990341
READ_PAIRS      2476685
PAIR_ORIENTATION        FR
WIDTH_OF_10_PERCENT     3
WIDTH_OF_20_PERCENT     7
WIDTH_OF_30_PERCENT     9
WIDTH_OF_40_PERCENT     13
WIDTH_OF_50_PERCENT     17
WIDTH_OF_60_PERCENT     19
WIDTH_OF_70_PERCENT     25
WIDTH_OF_80_PERCENT     29
WIDTH_OF_90_PERCENT     39
WIDTH_OF_99_PERCENT     73
SAMPLE
LIBRARY
READ_GROUP

Along with the summary statistics about the insert size, the tool also outputs a histogram:

insert_size_metricAs we saw from the summary statistics, most insert sizes are around 470 bp.

CollectRnaSeqMetrics

The CollectRnaSeqMetrics looks very useful; from the documentation:

Program to collect metrics about the alignment of RNA to various functional classes of loci in the genome: coding, intronic, UTR, intergenic, ribosomal. Also determines strand-specificity for strand-specific libraries.

I will definitely look into this tool in a future post.

CollectWgsMetrics

This tool computes a number of metrics that are useful for evaluating coverage and performance of whole genome sequencing experiments. It requires the reference sequence fasta that the reads were aligned to. Conveniently we can find out which reference was used for the alignment using samtools view:

samtools view -H HG00313.chrom11.ILLUMINA.bwa.FIN.low_coverage.20120522.bam | grep "^@SQ" | head -1
@SQ     SN:1    LN:249250621    M5:1b22b98cdeb4a9304cb5d48026a85128     UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz AS:NCBI37       SP:Human

#download it
curl -o hs37d5.fa.gz ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
curl -o hs37d5.fa.gz.fai ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz.fai

#CollectWgsMetrics can't process gz
#gunzip file
gunzip hs37d5.fa.gz

java -jar ~/bin/picard-tools-1.117/CollectWgsMetrics.jar INPUT=HG00313.chrom11.ILLUMINA.bwa.FIN.low_coverage.20120522.bam OUTPUT=wgs_metric.tsv REFERENCE_SEQUENCE=hs37d5.fa

#remove histogram info from the output
cat wgs_metric.tsv  | grep -v "^#" | head -3 > wgs_metric_no_hist.tsv

#transpose and examine output
transpose.pl wgs_metric_no_hist.tsv 
GENOME_TERRITORY	2900434419	
MEAN_COVERAGE	0.132371	
SD_COVERAGE	0.732859	
MEDIAN_COVERAGE	0	
MAD_COVERAGE	0	
PCT_EXC_MAPQ	0.030199	
PCT_EXC_DUPE	0.019441	
PCT_EXC_UNPAIRED	0.015514	
PCT_EXC_BASEQ	0.038496	
PCT_EXC_OVERLAP	0.000179	
PCT_EXC_CAPPED	0	
PCT_EXC_TOTAL	0.10383	
PCT_5X	0.008875	
PCT_10X	0.000119	
PCT_20X	0	
PCT_30X	0	
PCT_40X	0	
PCT_50X	0	
PCT_60X	0	
PCT_70X	0	
PCT_80X	0	
PCT_90X	0	
PCT_100X	0

Here are the WgsMetrics definitions:

GENOME_TERRITORY: The number of non-N bases in the genome reference over which coverage will be evaluated.
MEAN_COVERAGE: The mean coverage in bases of the genome territory, after all filters are applied.
SD_COVERAGE: The standard deviation of coverage of the genome after all filters are applied.
MEDIAN_COVERAGE: The median coverage in bases of the genome territory, after all filters are applied.
MAD_COVERAGE: The median absolute deviation of coverage of the genome after all filters are applied.
PCT_EXC_MAPQ: The fraction of aligned bases that were filtered out because they were in reads with low mapping quality (default is < 20). PCT_EXC_DUPE: The fraction of aligned bases that were filtered out because they were in reads marked as duplicates. PCT_EXC_UNPAIRED: The fraction of aligned bases that were filtered out because they were in reads without a mapped mate pair. PCT_EXC_BASEQ: The fraction of aligned bases that were filtered out because they were of low base quality (default is < 20). PCT_EXC_OVERLAP: The fraction of aligned bases that were filtered out because they were the second observation from an insert with overlapping reads. PCT_EXC_CAPPED: The fraction of aligned bases that were filtered out because they would have raised coverage above the capped value (default cap = 250x). PCT_EXC_TOTAL: The total fraction of aligned bases excluded due to all filters. PCT_5X: The fraction of bases that attained at least 5X sequence coverage in post-filtering bases. PCT_10X: The fraction of bases that attained at least 10X sequence coverage in post-filtering bases. The coverage is low because the BAM file only contains alignments for chromosome 11. I tried to run WgsMetrics using only chromosome 11 as the reference sequence but ran into errors.

QualityScoreDistribution

java -jar ~/bin/picard-tools-1.117/QualityScoreDistribution.jar INPUT=HG00313.chrom11.ILLUMINA.bwa.FIN.low_coverage.20120522.bam OUTPUT=quality_score_dist.tsv CHART_OUTPUT=quality_score_dist.pdf

#convert to png
convert quality_score_dist.pdf -background white -flatten quality_score_dist.png

quality_score_distSequencing quality scores of all reads.

To see the sequencing quality scores for only aligned reads, set ALIGNED_READS_ONLY to true.

Notable mentions

I didn't go through these tools, but they look very useful:

Conclusions

Picard contains a suite of extremely useful command-line tools for SAM/BAM files that I wish I had known earlier! But as I always say, better late than never. The files generated for this post are available at GitHub. Finally below is a full list of the Picard command-line tools:




Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
2 comments Add yours
  1. Why is CollectAlignmentSummaryMetrics summary report showing everything as 0? I am getting the same kind of summary. Is it a bug or am I doing something wrong?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.