BAM to CRAM

I was catching up on some blog reading and just read Ewan's latest post on CRAM going mainline. CRAM files are alignment files like BAM files but provides better compression. The toolkit for manipulating CRAM files is called CRAMTools and is a set of Java tools and APIs for efficient compression of sequence read data. This is necessary because of the constant increase in the throughput of sequencers. I went through the CRAM toolkit, format specification, the CRAM tutorial on the EBI page and decided to test it out.

Firstly I built CRAMTools from its source using Apache Ant:

#download apache-ant
wget http://ftp.jaist.ac.jp/pub/apache/ant/binaries/apache-ant-1.9.4-bin.tar.gz
tar -xzf apache-ant-1.9.4-bin.tar.gz

#download and build CRAMTools
git clone git://github.com/enasequence/cramtools.git
cd cramtools
apache-ant-1.9.4/bin/ant -f build/build.xml runnable
cd ..

#test it out
java -jar cramtools/cramtools-2.1.jar 
Version 2.1-b237

Usage: cramtools [options] [command] [command options]
  Options:    -h, --help  Print help and quit (default: false)
  Commands:
    bam         CRAM to BAM conversion. 
    cram        BAM to CRAM converter. 
    index       BAM/CRAM indexer. 
    merge       Tool to merge CRAM or BAM files. 
    fastq       CRAM to FastQ dump conversion. 
    fixheader   A tool to fix CRAM header without re-writing the whole file.
    getref      Download reference sequences.
    qstat       Quality score statistics.

OK looks like it's working. I wrote a Perl script called generate_random_seq.pl to generate a reference sequence:

#!/usr/bin/env perl

use strict;
use warnings;

my $usage = "Usage: $0 <bp>\n";
my $num = shift or die $usage;

my $random_seq = random_seq($num);
print ">$num\n$random_seq\n";

exit(0);

sub random_seq {
   my ($num) = @_;
   my @nuc = qw/ A C G T /;
   my $seq = '';
   for (1 .. $num){
      my $rand_ind = int(rand(scalar(@nuc)));
      $seq .= $nuc[$rand_ind];
   }
   return($seq);
}

Then I generated a 1,000,000 bp reference sequence:

generate_random_seq.pl 1000000 > 1000000.fa

Then I generated some reads from the reference sequence using another Perl script called subset_random_seq.pl:

cat subset_random_seq.pl 
#!/usr/bin/env perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.fa> <number> <length>\n";
my $fasta = shift or die $usage;
my $num = shift or die $usage;
my $len = shift or die $usage;

my $seq = '';
open(IN,'<',$fasta) || die "Could not open $fasta: $!\n";
while(<IN>){
   chomp;
   next if /^>/;
   $seq .= $_;
}
close(IN);

my $seq_len = length($seq);
my $limit = $seq_len - $len + 1;

if ($len > $seq_len){
   die "Your read length is longer than the input sequence\n";
}

my $fake_qual = 'B' x $len;

for (1 .. $num){
   my $random_start = int(rand($limit));
   my $substring = substr($seq,$random_start,$len);
   print "\@$_\n$substring\n+\n$fake_qual\n";
}

exit(0);

__END__

Lastly, I generated a random fastq file with 100,000 50 bp reads taken from the reference and used BWA to align the reads back to the reference:

subset_random_seq.pl 1000000.fa 100000 50 > 100000_50_bp.fq
bwa index 1000000.fa
bwa aln 1000000.fa 100000_50_bp.fq > 100000_50_bp.sai
bwa samse -f 100000_50_bp.sam 1000000.fa 100000_50_bp.sai 100000_50_bp.fq
samtools view -bS 100000_50_bp.sam | samtools sort - 100000_50_bp_sorted

ls -sh 100000_50_bp_sorted.bam
2.2M 100000_50_bp_sorted.bam

The sorted BAM file is ~2.2M. Let's convert it from BAM to CRAM using CRAMTools (I prefer the less readable but easier to type options, e.g. -I instead of --input-bam-file). While we're at it, download and compile the latest SAMtools:

#installing SAMtools
wget http://sourceforge.net/projects/samtools/files/samtools/1.1/samtools-1.1.tar.bz2
tar -xjf samtools-1.1.tar.bz2
cd samtools-1.1
make install
cd ..

#index the reference
#this is necessary for CRAMTools
samtools faidx 1000000.fa

#convert from BAM to CRAM
#I used these options to keep the sequence qualities
#and all the tag information
time java -jar cramtools/cramtools-2.1.jar cram --lossless-quality-score --capture-all-tags -I 100000_50_bp_sorted.bam -R 1000000.fa -O 100000_50_bp_sorted.cram

real    0m2.968s
user    0m5.945s
sys     0m0.242s

ls -sh 100000_50_bp_sorted.cram
496K 100000_50_bp_sorted.cram

#convert back CRAM to BAM
time java -jar cramtools/cramtools-2.1.jar bam --calculate-nm-tag -I 100000_50_bp_sorted.cram -R 1000000.fa -O 100000_50_bp_sorted_2.bam

real    0m1.949s
user    0m2.653s
sys     0m0.151s

#going from BAM to CRAM to BAM increased the file size
ls -sh 100000_50_bp_sorted_2.bam 
2.3M 100000_50_bp_sorted_2.bam

When going back from CRAM to BAM the file increased by .1M; I'm not sure why but they look the same, apart from rearranged tags and the missing MD tag:

#original
samtools view 100000_50_bp_sorted.bam | head -5
56666   0       1000000 5       37      50M     *       0       0       AGCATCCTTGAGCGAGCTTCGGAGTTCACCTTAGTCAGGAATCGCTATTT      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB      XT:A:U  NM:i:0  X0:i:1      X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:50
23930   0       1000000 28      37      50M     *       0       0       GTTCACCTTAGTCAGGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAA      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB      XT:A:U  NM:i:0  X0:i:1      X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:50
57485   0       1000000 39      37      50M     *       0       0       TCAGGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAATCGTTGGTAAG      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB      XT:A:U  NM:i:0  X0:i:1      X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:50
23549   0       1000000 42      37      50M     *       0       0       GGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAATCGTTGGTAAGTTG      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB      XT:A:U  NM:i:0  X0:i:1      X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:50
54140   0       1000000 66      37      50M     *       0       0       AGGAGTTGTCAATCGTTGGTAAGTTGTACTCGTCAAGTCGTTCGGTTTCT      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB      XT:A:U  NM:i:0  X0:i:1      X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:50

#CRAM to BAM version
56666   0       1000000 5       37      50M     *       0       0       AGCATCCTTGAGCGAGCTTCGGAGTTCACCTTAGTCAGGAATCGCTATTT      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB      X0:i:1  X1:i:0  XG:i:0      NM:i:0  XM:i:0  XO:i:0  XT:A:U
23930   0       1000000 28      37      50M     *       0       0       GTTCACCTTAGTCAGGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAA      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB      X0:i:1  X1:i:0  XG:i:0      NM:i:0  XM:i:0  XO:i:0  XT:A:U
57485   0       1000000 39      37      50M     *       0       0       TCAGGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAATCGTTGGTAAG      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB      X0:i:1  X1:i:0  XG:i:0      NM:i:0  XM:i:0  XO:i:0  XT:A:U
23549   0       1000000 42      37      50M     *       0       0       GGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAATCGTTGGTAAGTTG      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB      X0:i:1  X1:i:0  XG:i:0      NM:i:0  XM:i:0  XO:i:0  XT:A:U
54140   0       1000000 66      37      50M     *       0       0       AGGAGTTGTCAATCGTTGGTAAGTTGTACTCGTCAAGTCGTTCGGTTTCT      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB      X0:i:1  X1:i:0  XG:i:0      NM:i:0  XM:i:0  XO:i:0  XT:A:U

However, I couldn't use SAMtools to view this CRAM file:

samtools view 100000_50_bp_sorted.cram
Failed to populate reference for id 0
Slice ends beyond reference end.
Unable to fetch reference #0 5..98765
Failure to decode slice
[E::hts_close] Failed to decode sequence.
samtools: error closing "100000_50_bp_sorted.cram": -1

I found out later (after getting CRAMTools to work) that I can use SAMtools to convert BAM to CRAM:

#convert BAM to CRAM using SAMtools
time samtools view -T 1000000.fa -C -o 100000_50_bp_sorted_2.cram 100000_50_bp_sorted.bam

real    0m0.334s
user    0m0.280s
sys     0m0.012s

#slightly bigger than the CRAM file using CRAMTools
#but still ~4 times smaller
ls -sh 100000_50_bp_sorted_2.cram 
512K 100000_50_bp_sorted_2.cram

samtools view 100000_50_bp_sorted_2.cram | head -5
56666   0       1000000 5       37      50M     *       0       0       AGCATCCTTGAGCGAGCTTCGGAGTTCACCTTAGTCAGGAATCGCTATTT      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB      XT:A:U  X0:i:1  X1:i:0      XM:i:0  XO:i:0  XG:i:0  MD:Z:50 NM:i:0
23930   0       1000000 28      37      50M     *       0       0       GTTCACCTTAGTCAGGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAA      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB      XT:A:U  X0:i:1  X1:i:0      XM:i:0  XO:i:0  XG:i:0  MD:Z:50 NM:i:0
57485   0       1000000 39      37      50M     *       0       0       TCAGGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAATCGTTGGTAAG      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB      XT:A:U  X0:i:1  X1:i:0      XM:i:0  XO:i:0  XG:i:0  MD:Z:50 NM:i:0
23549   0       1000000 42      37      50M     *       0       0       GGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAATCGTTGGTAAGTTG      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB      XT:A:U  X0:i:1  X1:i:0      XM:i:0  XO:i:0  XG:i:0  MD:Z:50 NM:i:0
54140   0       1000000 66      37      50M     *       0       0       AGGAGTTGTCAATCGTTGGTAAGTTGTACTCGTCAAGTCGTTCGGTTTCT      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB      XT:A:U  X0:i:1  X1:i:0      XM:i:0  XO:i:0  XG:i:0  MD:Z:50 NM:i:0

More reads

With 100,000 reads the CRAM file was about a quarter of the size of the BAM file. Let's see if this is the same with more reads:

subset_random_seq.pl 1000000.fa 1000000 50 > 1000000_50_bp.fq
bwa aln 1000000.fa 1000000_50_bp.fq > 1000000_50_bp.sai
bwa samse -f 1000000_50_bp.sam 1000000.fa 1000000_50_bp.sai 1000000_50_bp.fq
samtools view -bS 1000000_50_bp.sam | samtools sort - 1000000_50_bp_sorted
time samtools view -T 1000000.fa -C -o 1000000_50_bp_sorted.cram 1000000_50_bp_sorted.bam

real    0m2.995s
user    0m2.666s
sys     0m0.110s

ls -sh 1000000_50_bp_sorted.bam
14M 1000000_50_bp_sorted.bam

ls -sh 1000000_50_bp_sorted.cram
3.9M 1000000_50_bp_sorted.cram

#still around 1/4
bc -l<<<3.9/14
.27857142857142857142

The CRAM file is roughly a quarter of the size of the BAM file. Also I noticed that it's a bit slower for SAMtools to process CRAM files:

time samtools view 1000000_50_bp_sorted.bam | head -1000000 > /dev/null

real    0m1.114s
user    0m1.341s
sys     0m0.121s

time samtools view 1000000_50_bp_sorted.cram | head -1000000 > /dev/null

real    0m2.883s
user    0m2.063s
sys     0m0.228s

But think about all the space you're saving.

Conclusions

Upgrade your SAMtools if you want to work with CRAM files.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
One comment Add yours

Leave a Reply

Your email address will not be published. Required fields are marked *