BAM to CRAM

Updated 2022 January 20th.

As pointed out by Colin, converting a BAM file to CRAM is simply one command:

samtools view -T genome/chrX.fa -C -o eg/ERR188273_chrX.cram eg/ERR188273_chrX.bam

Of note is that the reference file used to produce the BAM file is required and is used as an argument for the -T option.

As for why we should convert from BAM to CRAM, the reason is that CRAM files use less space. To illustrate I will align the following reads with BWA and store them as BAM and CRAM files. If you want to experiment yourself, you can download the (exome) FASTQ files as per below:

wget -c ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00284/sequence_read/ERR031940_1.filt.fastq.gz
wget -c ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00284/sequence_read/ERR031940_2.filt.fastq.gz

Align with BWA MEM.

bwa mem \
   -t 8 \
   /ref/bundle/hg38/Homo_sapiens_assembly38.fasta \
   ERR031940_1.filt.fastq.gz \
   ERR031940_2.filt.fastq.gz |
   samtools sort -@ 8 -O BAM |\
   tee ERR031940.bam |\
   samtools index - ERR031940.bam.bai

Convert to CRAM.

time samtools view -@ 8 -T /ref/bundle/hg38/Homo_sapiens_assembly38.fasta -C -o ERR031940.cram ERR031940.bam

real    1m29.007s
user    8m31.539s
sys     0m20.911s

Size difference = 8.5G versus 4.6G, which is pretty impressive.

ls -lh ERR031940.bam
-rw-rw-r-- 1 dtang dtang 8.5G Jan 20 14:04 ERR031940.bam

ls -lh ERR031940.cram
-rw-rw-r-- 1 dtang dtang 4.6G Jan 20 14:46 ERR031940.cram

However, the downside is that processing a CRAM file is slower.

time samtools view -@ 8 ERR031940.bam | wc -l
111887797

real    0m36.458s
user    3m55.326s
sys     0m49.213s

time samtools view -@ 8 -T /ref/bundle/hg38/Homo_sapiens_assembly38.fasta ERR031940.cram | wc -l
111887797

real    0m53.974s
user    4m23.496s
sys     0m51.295s

Some tools do not work with CRAM files, but you can always convert back to BAM (and use CRAM for archiving purposes). Note that ERR031940.bam (original BAM file) and ERR031940_cram.bam (BAM to CRAM back to BAM file) are different because the BAM tags are in a different order.

time samtools view -@ 8 -T /ref/bundle/hg38/Homo_sapiens_assembly38.fasta -b ERR031940.cram -o ERR031940_cram.bam

real    3m13.871s
user    22m27.343s
sys     0m34.830s

You can read the rest of the original post (from 2014) below but also note that CRAMTools is no longer supported and Samtools is recommended.

I was catching up on some blog reading and just read Ewan's latest post on CRAM going mainline. CRAM files are alignment files like BAM files but provides better compression. The toolkit for manipulating CRAM files is called CRAMTools and is a set of Java tools and APIs for efficient compression of sequence read data. This is necessary because of the constant increase in the throughput of sequencers. I went through the CRAM toolkit, format specification, the CRAM tutorial on the EBI page and decided to test it out.

Firstly I built CRAMTools from its source using Apache Ant:

#download apache-ant
wget http://ftp.jaist.ac.jp/pub/apache/ant/binaries/apache-ant-1.9.4-bin.tar.gz
tar -xzf apache-ant-1.9.4-bin.tar.gz

#download and build CRAMTools
git clone git://github.com/enasequence/cramtools.git
cd cramtools
apache-ant-1.9.4/bin/ant -f build/build.xml runnable
cd ..

#test it out
java -jar cramtools/cramtools-2.1.jar 
Version 2.1-b237

Usage: cramtools [options] [command] [command options]
  Options:    -h, --help  Print help and quit (default: false)
  Commands:
    bam         CRAM to BAM conversion. 
    cram        BAM to CRAM converter. 
    index       BAM/CRAM indexer. 
    merge       Tool to merge CRAM or BAM files. 
    fastq       CRAM to FastQ dump conversion. 
    fixheader   A tool to fix CRAM header without re-writing the whole file.
    getref      Download reference sequences.
    qstat       Quality score statistics.

OK looks like it's working. I wrote a Perl script called generate_random_seq.pl to generate a reference sequence:

#!/usr/bin/env perl

use strict;
use warnings;

my $usage = "Usage: $0 <bp>\n";
my $num = shift or die $usage;

my $random_seq = random_seq($num);
print ">$num\n$random_seq\n";

exit(0);

sub random_seq {
   my ($num) = @_;
   my @nuc = qw/ A C G T /;
   my $seq = '';
   for (1 .. $num){
      my $rand_ind = int(rand(scalar(@nuc)));
      $seq .= $nuc[$rand_ind];
   }
   return($seq);
}

Then I generated a 1,000,000 bp reference sequence:

generate_random_seq.pl 1000000 > 1000000.fa

Then I generated some reads from the reference sequence using another Perl script called subset_random_seq.pl:

cat subset_random_seq.pl 
#!/usr/bin/env perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.fa> <number> <length>\n";
my $fasta = shift or die $usage;
my $num = shift or die $usage;
my $len = shift or die $usage;

my $seq = '';
open(IN,'<',$fasta) || die "Could not open $fasta: $!\n";
while(<IN>){
   chomp;
   next if /^>/;
   $seq .= $_;
}
close(IN);

my $seq_len = length($seq);
my $limit = $seq_len - $len + 1;

if ($len > $seq_len){
   die "Your read length is longer than the input sequence\n";
}

my $fake_qual = 'B' x $len;

for (1 .. $num){
   my $random_start = int(rand($limit));
   my $substring = substr($seq,$random_start,$len);
   print "\@$_\n$substring\n+\n$fake_qual\n";
}

exit(0);

__END__

Lastly, I generated a random fastq file with 100,000 50 bp reads taken from the reference and used BWA to align the reads back to the reference:

subset_random_seq.pl 1000000.fa 100000 50 > 100000_50_bp.fq
bwa index 1000000.fa
bwa aln 1000000.fa 100000_50_bp.fq > 100000_50_bp.sai
bwa samse -f 100000_50_bp.sam 1000000.fa 100000_50_bp.sai 100000_50_bp.fq
samtools view -bS 100000_50_bp.sam | samtools sort - 100000_50_bp_sorted

ls -sh 100000_50_bp_sorted.bam
2.2M 100000_50_bp_sorted.bam

The sorted BAM file is ~2.2M. Let's convert it from BAM to CRAM using CRAMTools (I prefer the less readable but easier to type options, e.g. -I instead of --input-bam-file). While we're at it, download and compile the latest SAMtools:

#installing SAMtools
wget http://sourceforge.net/projects/samtools/files/samtools/1.1/samtools-1.1.tar.bz2
tar -xjf samtools-1.1.tar.bz2
cd samtools-1.1
make install
cd ..

#index the reference
#this is necessary for CRAMTools
samtools faidx 1000000.fa

#convert from BAM to CRAM
#I used these options to keep the sequence qualities
#and all the tag information
time java -jar cramtools/cramtools-2.1.jar cram --lossless-quality-score --capture-all-tags -I 100000_50_bp_sorted.bam -R 1000000.fa -O 100000_50_bp_sorted.cram

real    0m2.968s
user    0m5.945s
sys     0m0.242s

ls -sh 100000_50_bp_sorted.cram
496K 100000_50_bp_sorted.cram

#convert back CRAM to BAM
time java -jar cramtools/cramtools-2.1.jar bam --calculate-nm-tag -I 100000_50_bp_sorted.cram -R 1000000.fa -O 100000_50_bp_sorted_2.bam

real    0m1.949s
user    0m2.653s
sys     0m0.151s

#going from BAM to CRAM to BAM increased the file size
ls -sh 100000_50_bp_sorted_2.bam 
2.3M 100000_50_bp_sorted_2.bam

When going back from CRAM to BAM the file increased by .1M; I'm not sure why but they look the same, apart from rearranged tags and the missing MD tag:

#original
samtools view 100000_50_bp_sorted.bam | head -5
56666   0       1000000 5       37      50M     *       0       0       AGCATCCTTGAGCGAGCTTCGGAGTTCACCTTAGTCAGGAATCGCTATTT      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB      XT:A:U  NM:i:0  X0:i:1      X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:50
23930   0       1000000 28      37      50M     *       0       0       GTTCACCTTAGTCAGGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAA      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB      XT:A:U  NM:i:0  X0:i:1      X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:50
57485   0       1000000 39      37      50M     *       0       0       TCAGGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAATCGTTGGTAAG      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB      XT:A:U  NM:i:0  X0:i:1      X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:50
23549   0       1000000 42      37      50M     *       0       0       GGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAATCGTTGGTAAGTTG      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB      XT:A:U  NM:i:0  X0:i:1      X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:50
54140   0       1000000 66      37      50M     *       0       0       AGGAGTTGTCAATCGTTGGTAAGTTGTACTCGTCAAGTCGTTCGGTTTCT      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB      XT:A:U  NM:i:0  X0:i:1      X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:50

#CRAM to BAM version
56666   0       1000000 5       37      50M     *       0       0       AGCATCCTTGAGCGAGCTTCGGAGTTCACCTTAGTCAGGAATCGCTATTT      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB      X0:i:1  X1:i:0  XG:i:0      NM:i:0  XM:i:0  XO:i:0  XT:A:U
23930   0       1000000 28      37      50M     *       0       0       GTTCACCTTAGTCAGGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAA      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB      X0:i:1  X1:i:0  XG:i:0      NM:i:0  XM:i:0  XO:i:0  XT:A:U
57485   0       1000000 39      37      50M     *       0       0       TCAGGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAATCGTTGGTAAG      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB      X0:i:1  X1:i:0  XG:i:0      NM:i:0  XM:i:0  XO:i:0  XT:A:U
23549   0       1000000 42      37      50M     *       0       0       GGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAATCGTTGGTAAGTTG      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB      X0:i:1  X1:i:0  XG:i:0      NM:i:0  XM:i:0  XO:i:0  XT:A:U
54140   0       1000000 66      37      50M     *       0       0       AGGAGTTGTCAATCGTTGGTAAGTTGTACTCGTCAAGTCGTTCGGTTTCT      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB      X0:i:1  X1:i:0  XG:i:0      NM:i:0  XM:i:0  XO:i:0  XT:A:U

However, I couldn't use SAMtools to view this CRAM file:

samtools view 100000_50_bp_sorted.cram
Failed to populate reference for id 0
Slice ends beyond reference end.
Unable to fetch reference #0 5..98765
Failure to decode slice
[E::hts_close] Failed to decode sequence.
samtools: error closing "100000_50_bp_sorted.cram": -1

I found out later (after getting CRAMTools to work) that I can use SAMtools to convert BAM to CRAM:

#convert BAM to CRAM using SAMtools
time samtools view -T 1000000.fa -C -o 100000_50_bp_sorted_2.cram 100000_50_bp_sorted.bam

real    0m0.334s
user    0m0.280s
sys     0m0.012s

#slightly bigger than the CRAM file using CRAMTools
#but still ~4 times smaller
ls -sh 100000_50_bp_sorted_2.cram 
512K 100000_50_bp_sorted_2.cram

samtools view 100000_50_bp_sorted_2.cram | head -5
56666   0       1000000 5       37      50M     *       0       0       AGCATCCTTGAGCGAGCTTCGGAGTTCACCTTAGTCAGGAATCGCTATTT      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB      XT:A:U  X0:i:1  X1:i:0      XM:i:0  XO:i:0  XG:i:0  MD:Z:50 NM:i:0
23930   0       1000000 28      37      50M     *       0       0       GTTCACCTTAGTCAGGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAA      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB      XT:A:U  X0:i:1  X1:i:0      XM:i:0  XO:i:0  XG:i:0  MD:Z:50 NM:i:0
57485   0       1000000 39      37      50M     *       0       0       TCAGGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAATCGTTGGTAAG      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB      XT:A:U  X0:i:1  X1:i:0      XM:i:0  XO:i:0  XG:i:0  MD:Z:50 NM:i:0
23549   0       1000000 42      37      50M     *       0       0       GGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAATCGTTGGTAAGTTG      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB      XT:A:U  X0:i:1  X1:i:0      XM:i:0  XO:i:0  XG:i:0  MD:Z:50 NM:i:0
54140   0       1000000 66      37      50M     *       0       0       AGGAGTTGTCAATCGTTGGTAAGTTGTACTCGTCAAGTCGTTCGGTTTCT      BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB      XT:A:U  X0:i:1  X1:i:0      XM:i:0  XO:i:0  XG:i:0  MD:Z:50 NM:i:0

More reads

With 100,000 reads the CRAM file was about a quarter of the size of the BAM file. Let's see if this is the same with more reads:

subset_random_seq.pl 1000000.fa 1000000 50 > 1000000_50_bp.fq
bwa aln 1000000.fa 1000000_50_bp.fq > 1000000_50_bp.sai
bwa samse -f 1000000_50_bp.sam 1000000.fa 1000000_50_bp.sai 1000000_50_bp.fq
samtools view -bS 1000000_50_bp.sam | samtools sort - 1000000_50_bp_sorted
time samtools view -T 1000000.fa -C -o 1000000_50_bp_sorted.cram 1000000_50_bp_sorted.bam

real    0m2.995s
user    0m2.666s
sys     0m0.110s

ls -sh 1000000_50_bp_sorted.bam
14M 1000000_50_bp_sorted.bam

ls -sh 1000000_50_bp_sorted.cram
3.9M 1000000_50_bp_sorted.cram

#still around 1/4
bc -l<<<3.9/14
.27857142857142857142

The CRAM file is roughly a quarter of the size of the BAM file. Also I noticed that it's a bit slower for SAMtools to process CRAM files:

time samtools view 1000000_50_bp_sorted.bam | head -1000000 > /dev/null

real    0m1.114s
user    0m1.341s
sys     0m0.121s

time samtools view 1000000_50_bp_sorted.cram | head -1000000 > /dev/null

real    0m2.883s
user    0m2.063s
sys     0m0.228s

But think about all the space you're saving.

Conclusions

Upgrade your SAMtools if you want to work with CRAM files.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
3 comments Add yours
  1. could be worth adding a short TLDR, since someone could think they have to download cramtools etc. when it is basically as simple as

    samtools view -T reference.fa file.bam -o file.cram

    (this is a highly rated post on google if searching for bam to cram!:)

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.