Last updated: 2024/12/04
TL;DR As pointed out by Colin, converting a BAM file to CRAM is simply one command:
samtools view -T genome/chrX.fa -C -o eg/ERR188273_chrX.cram eg/ERR188273_chrX.bam
Of note is that the reference file used to produce the BAM file is required and is used as an argument for the -T
option.
As for why we should convert from BAM to CRAM, the reason is that CRAM files use less space. To illustrate I will align the following reads with BWA MEM and store them as BAM and CRAM files. If you want to experiment yourself, you can download the (exome) FASTQ files and run the following steps.
wget -c ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00284/sequence_read/ERR031940_1.filt.fastq.gz
wget -c ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00284/sequence_read/ERR031940_2.filt.fastq.gz
Download the human reference genome using gsutil
.
mamba install -c conda-forge google-cloud-sdk
gsutil ls gs://genomics-public-data/references/hg38/v0/ | grep Homo_sapiens_assembly38.fasta > files_to_download.txt
mkdir -p hg38
cat files_to_download.txt | gsutil -m cp -I hg38
Install BWA MEM, if you haven't already and put it in your PATH.
git clone https://github.com/lh3/bwa.git
cd bwa && make
Align with BWA MEM.
bwa mem \
-t 8 \
hg38/Homo_sapiens_assembly38.fasta \
ERR031940_1.filt.fastq.gz \
ERR031940_2.filt.fastq.gz |
samtools sort -@ 8 -O BAM |\
tee ERR031940.bam |\
samtools index - ERR031940.bam.bai
Convert to CRAM.
/usr/bin/time -p samtools view -@ 8 -T hg38/Homo_sapiens_assembly38.fasta -C -o ERR031940.cram ERR031940.bam
real 41.92
user 272.21
sys 19.95
Size difference = 8.5G versus 4.6G, which is pretty impressive.
ls -lh ERR031940.bam
-rw-r--r-- 1 dtang dtang 8.5G Dec 4 18:35 ERR031940.bam
ls -lh ERR031940.cram
-rw-r--r-- 1 dtang dtang 4.6G Dec 4 18:36 ERR031940.cram
However, the downside is that processing a CRAM file is slower.
/usr/bin/time -p samtools view -@ 8 ERR031940.bam | wc -l
real 26.00
user 87.86
sys 31.39
111887797
/usr/bin/time -p samtools view -@ 8 -T hg38/Homo_sapiens_assembly38.fasta ERR031940.cram | wc -l
real 29.26
user 120.63
sys 17.29
111887797
Some tools do not work with CRAM files, but you can always convert back to BAM (and use CRAM for archiving purposes). Note that ERR031940.bam (original BAM file) and ERR031940_cram.bam (BAM to CRAM back to BAM file) are different because the BAM tags are in a different order.
/usr/bin/time -p samtools view -@ 8 -T hg38/Homo_sapiens_assembly38.fasta -b ERR031940.cram -o ERR031940_cram.bam
real 81.89
user 595.98
sys 26.84
Slightly different sizes.
ls -al ERR031940*.bam
-rw-r--r-- 1 dtang dtang 9112566438 Dec 4 18:35 ERR031940.bam
-rw-r--r-- 1 dtang dtang 9123199965 Dec 4 18:50 ERR031940_cram.bam
Different ordering of tags.
samtools view ERR031940.bam | head -3 > a
samtools view ERR031940_cram.bam| head -3 > b
diff a b
3c3
< ERR031940.10903848 97 chr1 10026 0 90M chr10 17160288 0 AACTCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAACCTA GGGHGF@GGGFFFEGGDGGGFEDGEGGEGGG;GFFBEEEE+DEEEFFFBFFHBDBFF?<>DD5D@>??A##################### NM:i:2 MD:Z:3C81C4 MC:Z:90M MQ:i:60 AS:i:81 XS:i:81
---
> ERR031940.10903848 97 chr1 10026 0 90M chr10 17160288 0 AACTCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAACCTA GGGHGF@GGGFFFEGGDGGGFEDGEGGEGGG;GFFBEEEE+DEEEFFFBFFHBDBFF?<>DD5D@>??A##################### MC:Z:90M MQ:i:60 AS:i:81 XS:i:81 MD:Z:3C81C4 NM:i:2
My laptop CPU.
cat /proc/cpuinfo| head
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 154
model name : 12th Gen Intel(R) Core(TM) i7-12650H
stepping : 3
microcode : 0xffffffff
cpu MHz : 2688.000
cache size : 24576 KB
physical id : 0
Versions of tools.
bwa
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.18-r1243-dirty
Contact: Heng Li <hli@ds.dfci.harvard.edu>
samtools version
samtools 1.19.2
Using htslib 1.19.1
Copyright (C) 2024 Genome Research Ltd.
Samtools compilation details:
Features: build=configure curses=yes
CC: gcc
CPPFLAGS:
CFLAGS: -Wall -g -O2
LDFLAGS:
HTSDIR: htslib-1.19.1
LIBS:
CURSES_LIB: -lncursesw
HTSlib compilation details:
Features: build=configure libcurl=no S3=no GCS=no libdeflate=yes lzma=yes bzip2=yes plugins=no htscodecs=1.6.0
CC: gcc
CPPFLAGS:
CFLAGS: -Wall -g -O2 -fvisibility=hidden
LDFLAGS: -fvisibility=hidden
HTSlib URL scheme handlers present:
built-in: preload, data, file
crypt4gh-needed: crypt4gh
mem: mem
You can read the rest of the original post (from 2014) below but also note that CRAMTools is no longer supported and Samtools is recommended.
Original post from 2014
I was catching up on some blog reading and just read Ewan's latest post on CRAM going mainline. CRAM files are alignment files like BAM files but provides better compression. The toolkit for manipulating CRAM files is called CRAMTools and is a set of Java tools and APIs for efficient compression of sequence read data. This is necessary because of the constant increase in the throughput of sequencers. I went through the CRAM toolkit, format specification, the CRAM tutorial on the EBI page and decided to test it out.
Firstly I built CRAMTools from its source using Apache Ant:
#download apache-ant
wget http://ftp.jaist.ac.jp/pub/apache/ant/binaries/apache-ant-1.9.4-bin.tar.gz
tar -xzf apache-ant-1.9.4-bin.tar.gz
#download and build CRAMTools
git clone git://github.com/enasequence/cramtools.git
cd cramtools
apache-ant-1.9.4/bin/ant -f build/build.xml runnable
cd ..
#test it out
java -jar cramtools/cramtools-2.1.jar
Version 2.1-b237
Usage: cramtools [options] [command] [command options]
Options: -h, --help Print help and quit (default: false)
Commands:
bam CRAM to BAM conversion.
cram BAM to CRAM converter.
index BAM/CRAM indexer.
merge Tool to merge CRAM or BAM files.
fastq CRAM to FastQ dump conversion.
fixheader A tool to fix CRAM header without re-writing the whole file.
getref Download reference sequences.
qstat Quality score statistics.
OK looks like it's working. I wrote a Perl script called generate_random_seq.pl
to generate a reference sequence:
#!/usr/bin/env perl
use strict;
use warnings;
my $usage = "Usage: $0 <bp>\n";
my $num = shift or die $usage;
my $random_seq = random_seq($num);
print ">$num\n$random_seq\n";
exit(0);
sub random_seq {
my ($num) = @_;
my @nuc = qw/ A C G T /;
my $seq = '';
for (1 .. $num){
my $rand_ind = int(rand(scalar(@nuc)));
$seq .= $nuc[$rand_ind];
}
return($seq);
}
Then I generated a 1,000,000 bp reference sequence:
generate_random_seq.pl 1000000 > 1000000.fa
Then I generated some reads from the reference sequence using another Perl script called subset_random_seq.pl
:
#!/usr/bin/env perl
use strict;
use warnings;
my $usage = "Usage: $0 <infile.fa> <number> <length>\n";
my $fasta = shift or die $usage;
my $num = shift or die $usage;
my $len = shift or die $usage;
my $seq = '';
open(IN,'<',$fasta) || die "Could not open $fasta: $!\n";
while(<IN>){
chomp;
next if /^>/;
$seq .= $_;
}
close(IN);
my $seq_len = length($seq);
my $limit = $seq_len - $len + 1;
if ($len > $seq_len){
die "Your read length is longer than the input sequence\n";
}
my $fake_qual = 'B' x $len;
for (1 .. $num){
my $random_start = int(rand($limit));
my $substring = substr($seq,$random_start,$len);
print "\@$_\n$substring\n+\n$fake_qual\n";
}
exit(0);
__END__
Lastly, I generated a random fastq file with 100,000 50 bp reads taken from the reference and used BWA to align the reads back to the reference:
subset_random_seq.pl 1000000.fa 100000 50 > 100000_50_bp.fq
bwa index 1000000.fa
bwa aln 1000000.fa 100000_50_bp.fq > 100000_50_bp.sai
bwa samse -f 100000_50_bp.sam 1000000.fa 100000_50_bp.sai 100000_50_bp.fq
samtools view -bS 100000_50_bp.sam | samtools sort - 100000_50_bp_sorted
ls -sh 100000_50_bp_sorted.bam
2.2M 100000_50_bp_sorted.bam
The sorted BAM file is ~2.2M. Let's convert it from BAM to CRAM using CRAMTools (I prefer the less readable but easier to type options, e.g. -I instead of --input-bam-file). While we're at it, download and compile the latest SAMtools:
#installing SAMtools
wget http://sourceforge.net/projects/samtools/files/samtools/1.1/samtools-1.1.tar.bz2
tar -xjf samtools-1.1.tar.bz2
cd samtools-1.1
make install
cd ..
#index the reference
#this is necessary for CRAMTools
samtools faidx 1000000.fa
#convert from BAM to CRAM
#I used these options to keep the sequence qualities
#and all the tag information
time java -jar cramtools/cramtools-2.1.jar cram --lossless-quality-score --capture-all-tags -I 100000_50_bp_sorted.bam -R 1000000.fa -O 100000_50_bp_sorted.cram
real 0m2.968s
user 0m5.945s
sys 0m0.242s
ls -sh 100000_50_bp_sorted.cram
496K 100000_50_bp_sorted.cram
#convert back CRAM to BAM
time java -jar cramtools/cramtools-2.1.jar bam --calculate-nm-tag -I 100000_50_bp_sorted.cram -R 1000000.fa -O 100000_50_bp_sorted_2.bam
real 0m1.949s
user 0m2.653s
sys 0m0.151s
#going from BAM to CRAM to BAM increased the file size
ls -sh 100000_50_bp_sorted_2.bam
2.3M 100000_50_bp_sorted_2.bam
When going back from CRAM to BAM the file increased by .1M; I'm not sure why but they look the same, apart from rearranged tags and the missing MD tag:
#original
samtools view 100000_50_bp_sorted.bam | head -5
56666 0 1000000 5 37 50M * 0 0 AGCATCCTTGAGCGAGCTTCGGAGTTCACCTTAGTCAGGAATCGCTATTT BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50
23930 0 1000000 28 37 50M * 0 0 GTTCACCTTAGTCAGGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAA BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50
57485 0 1000000 39 37 50M * 0 0 TCAGGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAATCGTTGGTAAG BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50
23549 0 1000000 42 37 50M * 0 0 GGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAATCGTTGGTAAGTTG BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50
54140 0 1000000 66 37 50M * 0 0 AGGAGTTGTCAATCGTTGGTAAGTTGTACTCGTCAAGTCGTTCGGTTTCT BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50
#CRAM to BAM version
56666 0 1000000 5 37 50M * 0 0 AGCATCCTTGAGCGAGCTTCGGAGTTCACCTTAGTCAGGAATCGCTATTT BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB X0:i:1 X1:i:0 XG:i:0 NM:i:0 XM:i:0 XO:i:0 XT:A:U
23930 0 1000000 28 37 50M * 0 0 GTTCACCTTAGTCAGGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAA BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB X0:i:1 X1:i:0 XG:i:0 NM:i:0 XM:i:0 XO:i:0 XT:A:U
57485 0 1000000 39 37 50M * 0 0 TCAGGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAATCGTTGGTAAG BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB X0:i:1 X1:i:0 XG:i:0 NM:i:0 XM:i:0 XO:i:0 XT:A:U
23549 0 1000000 42 37 50M * 0 0 GGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAATCGTTGGTAAGTTG BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB X0:i:1 X1:i:0 XG:i:0 NM:i:0 XM:i:0 XO:i:0 XT:A:U
54140 0 1000000 66 37 50M * 0 0 AGGAGTTGTCAATCGTTGGTAAGTTGTACTCGTCAAGTCGTTCGGTTTCT BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB X0:i:1 X1:i:0 XG:i:0 NM:i:0 XM:i:0 XO:i:0 XT:A:U
However, I couldn't use SAMtools to view this CRAM file:
samtools view 100000_50_bp_sorted.cram
Failed to populate reference for id 0
Slice ends beyond reference end.
Unable to fetch reference #0 5..98765
Failure to decode slice
[E::hts_close] Failed to decode sequence.
samtools: error closing "100000_50_bp_sorted.cram": -1
I found out later (after getting CRAMTools to work) that I can use SAMtools to convert BAM to CRAM:
#convert BAM to CRAM using SAMtools
time samtools view -T 1000000.fa -C -o 100000_50_bp_sorted_2.cram 100000_50_bp_sorted.bam
real 0m0.334s
user 0m0.280s
sys 0m0.012s
#slightly bigger than the CRAM file using CRAMTools
#but still ~4 times smaller
ls -sh 100000_50_bp_sorted_2.cram
512K 100000_50_bp_sorted_2.cram
samtools view 100000_50_bp_sorted_2.cram | head -5
56666 0 1000000 5 37 50M * 0 0 AGCATCCTTGAGCGAGCTTCGGAGTTCACCTTAGTCAGGAATCGCTATTT BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB XT:A:U X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50 NM:i:0
23930 0 1000000 28 37 50M * 0 0 GTTCACCTTAGTCAGGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAA BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB XT:A:U X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50 NM:i:0
57485 0 1000000 39 37 50M * 0 0 TCAGGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAATCGTTGGTAAG BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB XT:A:U X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50 NM:i:0
23549 0 1000000 42 37 50M * 0 0 GGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAATCGTTGGTAAGTTG BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB XT:A:U X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50 NM:i:0
54140 0 1000000 66 37 50M * 0 0 AGGAGTTGTCAATCGTTGGTAAGTTGTACTCGTCAAGTCGTTCGGTTTCT BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB XT:A:U X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50 NM:i:0
More reads
With 100,000 reads the CRAM file was about a quarter of the size of the BAM file. Let's see if this is the same with more reads:
subset_random_seq.pl 1000000.fa 1000000 50 > 1000000_50_bp.fq
bwa aln 1000000.fa 1000000_50_bp.fq > 1000000_50_bp.sai
bwa samse -f 1000000_50_bp.sam 1000000.fa 1000000_50_bp.sai 1000000_50_bp.fq
samtools view -bS 1000000_50_bp.sam | samtools sort - 1000000_50_bp_sorted
time samtools view -T 1000000.fa -C -o 1000000_50_bp_sorted.cram 1000000_50_bp_sorted.bam
real 0m2.995s
user 0m2.666s
sys 0m0.110s
ls -sh 1000000_50_bp_sorted.bam
14M 1000000_50_bp_sorted.bam
ls -sh 1000000_50_bp_sorted.cram
3.9M 1000000_50_bp_sorted.cram
#still around 1/4
bc -l<<<3.9/14
.27857142857142857142
The CRAM file is roughly a quarter of the size of the BAM file. Also I noticed that it's a bit slower for SAMtools to process CRAM files:
time samtools view 1000000_50_bp_sorted.bam | head -1000000 > /dev/null
real 0m1.114s
user 0m1.341s
sys 0m0.121s
time samtools view 1000000_50_bp_sorted.cram | head -1000000 > /dev/null
real 0m2.883s
user 0m2.063s
sys 0m0.228s
But think about all the space you're saving.
Conclusions
Upgrade your SAMtools if you want to work with CRAM files.

This work is licensed under a Creative Commons
Attribution 4.0 International License.
could be worth adding a short TLDR, since someone could think they have to download cramtools etc. when it is basically as simple as
samtools view -T reference.fa file.bam -o file.cram
(this is a highly rated post on google if searching for bam to cram!:)
Thanks for the suggestion Colin! I’ll update the post accordingly, soon.
Hi Dave, the blog is very interesting. i wanted to ask you if it is normal that the bam converted back from the cram is smaller than the original bam.
Thanks
Hi Matteo,
I updated my post to show the actual sizes of the original BAM and the BAM converted from CRAM and the sizes are almost the same (n = 1).
Take a look inside the two BAM files to see what is different. Perhaps the CRAM file had truncated the IDs (I have a vague memory of this happening), so when you converted back to BAM from CRAM, the size is smaller because there is less data.
Cheers,
Dave