Updated 2022 January 20th.
As pointed out by Colin, converting a BAM file to CRAM is simply one command:
samtools view -T genome/chrX.fa -C -o eg/ERR188273_chrX.cram eg/ERR188273_chrX.bam
Of note is that the reference file used to produce the BAM file is required and is used as an argument for the -T option.
As for why we should convert from BAM to CRAM, the reason is that CRAM files use less space. To illustrate I will align the following reads with BWA and store them as BAM and CRAM files. If you want to experiment yourself, you can download the (exome) FASTQ files as per below:
wget -c ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00284/sequence_read/ERR031940_1.filt.fastq.gz wget -c ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00284/sequence_read/ERR031940_2.filt.fastq.gz
Align with BWA MEM.
bwa mem \ -t 8 \ /ref/bundle/hg38/Homo_sapiens_assembly38.fasta \ ERR031940_1.filt.fastq.gz \ ERR031940_2.filt.fastq.gz | samtools sort -@ 8 -O BAM |\ tee ERR031940.bam |\ samtools index - ERR031940.bam.bai
Convert to CRAM.
time samtools view -@ 8 -T /ref/bundle/hg38/Homo_sapiens_assembly38.fasta -C -o ERR031940.cram ERR031940.bam real 1m29.007s user 8m31.539s sys 0m20.911s
Size difference = 8.5G versus 4.6G, which is pretty impressive.
ls -lh ERR031940.bam -rw-rw-r-- 1 dtang dtang 8.5G Jan 20 14:04 ERR031940.bam ls -lh ERR031940.cram -rw-rw-r-- 1 dtang dtang 4.6G Jan 20 14:46 ERR031940.cram
However, the downside is that processing a CRAM file is slower.
time samtools view -@ 8 ERR031940.bam | wc -l 111887797 real 0m36.458s user 3m55.326s sys 0m49.213s time samtools view -@ 8 -T /ref/bundle/hg38/Homo_sapiens_assembly38.fasta ERR031940.cram | wc -l 111887797 real 0m53.974s user 4m23.496s sys 0m51.295s
Some tools do not work with CRAM files, but you can always convert back to BAM (and use CRAM for archiving purposes). Note that ERR031940.bam (original BAM file) and ERR031940_cram.bam (BAM to CRAM back to BAM file) are different because the BAM tags are in a different order.
time samtools view -@ 8 -T /ref/bundle/hg38/Homo_sapiens_assembly38.fasta -b ERR031940.cram -o ERR031940_cram.bam real 3m13.871s user 22m27.343s sys 0m34.830s
You can read the rest of the original post (from 2014) below but also note that CRAMTools is no longer supported and Samtools is recommended.
I was catching up on some blog reading and just read Ewan's latest post on CRAM going mainline. CRAM files are alignment files like BAM files but provides better compression. The toolkit for manipulating CRAM files is called CRAMTools and is a set of Java tools and APIs for efficient compression of sequence read data. This is necessary because of the constant increase in the throughput of sequencers. I went through the CRAM toolkit, format specification, the CRAM tutorial on the EBI page and decided to test it out.
Firstly I built CRAMTools from its source using Apache Ant:
#download apache-ant wget http://ftp.jaist.ac.jp/pub/apache/ant/binaries/apache-ant-1.9.4-bin.tar.gz tar -xzf apache-ant-1.9.4-bin.tar.gz #download and build CRAMTools git clone git://github.com/enasequence/cramtools.git cd cramtools apache-ant-1.9.4/bin/ant -f build/build.xml runnable cd .. #test it out java -jar cramtools/cramtools-2.1.jar Version 2.1-b237 Usage: cramtools [options] [command] [command options] Options: -h, --help Print help and quit (default: false) Commands: bam CRAM to BAM conversion. cram BAM to CRAM converter. index BAM/CRAM indexer. merge Tool to merge CRAM or BAM files. fastq CRAM to FastQ dump conversion. fixheader A tool to fix CRAM header without re-writing the whole file. getref Download reference sequences. qstat Quality score statistics.
OK looks like it's working. I wrote a Perl script called generate_random_seq.pl to generate a reference sequence:
#!/usr/bin/env perl use strict; use warnings; my $usage = "Usage: $0 <bp>\n"; my $num = shift or die $usage; my $random_seq = random_seq($num); print ">$num\n$random_seq\n"; exit(0); sub random_seq { my ($num) = @_; my @nuc = qw/ A C G T /; my $seq = ''; for (1 .. $num){ my $rand_ind = int(rand(scalar(@nuc))); $seq .= $nuc[$rand_ind]; } return($seq); }
Then I generated a 1,000,000 bp reference sequence:
generate_random_seq.pl 1000000 > 1000000.fa
Then I generated some reads from the reference sequence using another Perl script called subset_random_seq.pl:
cat subset_random_seq.pl #!/usr/bin/env perl use strict; use warnings; my $usage = "Usage: $0 <infile.fa> <number> <length>\n"; my $fasta = shift or die $usage; my $num = shift or die $usage; my $len = shift or die $usage; my $seq = ''; open(IN,'<',$fasta) || die "Could not open $fasta: $!\n"; while(<IN>){ chomp; next if /^>/; $seq .= $_; } close(IN); my $seq_len = length($seq); my $limit = $seq_len - $len + 1; if ($len > $seq_len){ die "Your read length is longer than the input sequence\n"; } my $fake_qual = 'B' x $len; for (1 .. $num){ my $random_start = int(rand($limit)); my $substring = substr($seq,$random_start,$len); print "\@$_\n$substring\n+\n$fake_qual\n"; } exit(0); __END__
Lastly, I generated a random fastq file with 100,000 50 bp reads taken from the reference and used BWA to align the reads back to the reference:
subset_random_seq.pl 1000000.fa 100000 50 > 100000_50_bp.fq bwa index 1000000.fa bwa aln 1000000.fa 100000_50_bp.fq > 100000_50_bp.sai bwa samse -f 100000_50_bp.sam 1000000.fa 100000_50_bp.sai 100000_50_bp.fq samtools view -bS 100000_50_bp.sam | samtools sort - 100000_50_bp_sorted ls -sh 100000_50_bp_sorted.bam 2.2M 100000_50_bp_sorted.bam
The sorted BAM file is ~2.2M. Let's convert it from BAM to CRAM using CRAMTools (I prefer the less readable but easier to type options, e.g. -I instead of --input-bam-file). While we're at it, download and compile the latest SAMtools:
#installing SAMtools wget http://sourceforge.net/projects/samtools/files/samtools/1.1/samtools-1.1.tar.bz2 tar -xjf samtools-1.1.tar.bz2 cd samtools-1.1 make install cd .. #index the reference #this is necessary for CRAMTools samtools faidx 1000000.fa #convert from BAM to CRAM #I used these options to keep the sequence qualities #and all the tag information time java -jar cramtools/cramtools-2.1.jar cram --lossless-quality-score --capture-all-tags -I 100000_50_bp_sorted.bam -R 1000000.fa -O 100000_50_bp_sorted.cram real 0m2.968s user 0m5.945s sys 0m0.242s ls -sh 100000_50_bp_sorted.cram 496K 100000_50_bp_sorted.cram #convert back CRAM to BAM time java -jar cramtools/cramtools-2.1.jar bam --calculate-nm-tag -I 100000_50_bp_sorted.cram -R 1000000.fa -O 100000_50_bp_sorted_2.bam real 0m1.949s user 0m2.653s sys 0m0.151s #going from BAM to CRAM to BAM increased the file size ls -sh 100000_50_bp_sorted_2.bam 2.3M 100000_50_bp_sorted_2.bam
When going back from CRAM to BAM the file increased by .1M; I'm not sure why but they look the same, apart from rearranged tags and the missing MD tag:
#original samtools view 100000_50_bp_sorted.bam | head -5 56666 0 1000000 5 37 50M * 0 0 AGCATCCTTGAGCGAGCTTCGGAGTTCACCTTAGTCAGGAATCGCTATTT BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50 23930 0 1000000 28 37 50M * 0 0 GTTCACCTTAGTCAGGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAA BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50 57485 0 1000000 39 37 50M * 0 0 TCAGGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAATCGTTGGTAAG BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50 23549 0 1000000 42 37 50M * 0 0 GGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAATCGTTGGTAAGTTG BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50 54140 0 1000000 66 37 50M * 0 0 AGGAGTTGTCAATCGTTGGTAAGTTGTACTCGTCAAGTCGTTCGGTTTCT BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50 #CRAM to BAM version 56666 0 1000000 5 37 50M * 0 0 AGCATCCTTGAGCGAGCTTCGGAGTTCACCTTAGTCAGGAATCGCTATTT BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB X0:i:1 X1:i:0 XG:i:0 NM:i:0 XM:i:0 XO:i:0 XT:A:U 23930 0 1000000 28 37 50M * 0 0 GTTCACCTTAGTCAGGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAA BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB X0:i:1 X1:i:0 XG:i:0 NM:i:0 XM:i:0 XO:i:0 XT:A:U 57485 0 1000000 39 37 50M * 0 0 TCAGGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAATCGTTGGTAAG BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB X0:i:1 X1:i:0 XG:i:0 NM:i:0 XM:i:0 XO:i:0 XT:A:U 23549 0 1000000 42 37 50M * 0 0 GGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAATCGTTGGTAAGTTG BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB X0:i:1 X1:i:0 XG:i:0 NM:i:0 XM:i:0 XO:i:0 XT:A:U 54140 0 1000000 66 37 50M * 0 0 AGGAGTTGTCAATCGTTGGTAAGTTGTACTCGTCAAGTCGTTCGGTTTCT BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB X0:i:1 X1:i:0 XG:i:0 NM:i:0 XM:i:0 XO:i:0 XT:A:U
However, I couldn't use SAMtools to view this CRAM file:
samtools view 100000_50_bp_sorted.cram Failed to populate reference for id 0 Slice ends beyond reference end. Unable to fetch reference #0 5..98765 Failure to decode slice [E::hts_close] Failed to decode sequence. samtools: error closing "100000_50_bp_sorted.cram": -1
I found out later (after getting CRAMTools to work) that I can use SAMtools to convert BAM to CRAM:
#convert BAM to CRAM using SAMtools time samtools view -T 1000000.fa -C -o 100000_50_bp_sorted_2.cram 100000_50_bp_sorted.bam real 0m0.334s user 0m0.280s sys 0m0.012s #slightly bigger than the CRAM file using CRAMTools #but still ~4 times smaller ls -sh 100000_50_bp_sorted_2.cram 512K 100000_50_bp_sorted_2.cram samtools view 100000_50_bp_sorted_2.cram | head -5 56666 0 1000000 5 37 50M * 0 0 AGCATCCTTGAGCGAGCTTCGGAGTTCACCTTAGTCAGGAATCGCTATTT BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB XT:A:U X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50 NM:i:0 23930 0 1000000 28 37 50M * 0 0 GTTCACCTTAGTCAGGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAA BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB XT:A:U X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50 NM:i:0 57485 0 1000000 39 37 50M * 0 0 TCAGGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAATCGTTGGTAAG BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB XT:A:U X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50 NM:i:0 23549 0 1000000 42 37 50M * 0 0 GGAATCGCTATTTGATGGCGTATCAGGAGTTGTCAATCGTTGGTAAGTTG BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB XT:A:U X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50 NM:i:0 54140 0 1000000 66 37 50M * 0 0 AGGAGTTGTCAATCGTTGGTAAGTTGTACTCGTCAAGTCGTTCGGTTTCT BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB XT:A:U X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50 NM:i:0
More reads
With 100,000 reads the CRAM file was about a quarter of the size of the BAM file. Let's see if this is the same with more reads:
subset_random_seq.pl 1000000.fa 1000000 50 > 1000000_50_bp.fq bwa aln 1000000.fa 1000000_50_bp.fq > 1000000_50_bp.sai bwa samse -f 1000000_50_bp.sam 1000000.fa 1000000_50_bp.sai 1000000_50_bp.fq samtools view -bS 1000000_50_bp.sam | samtools sort - 1000000_50_bp_sorted time samtools view -T 1000000.fa -C -o 1000000_50_bp_sorted.cram 1000000_50_bp_sorted.bam real 0m2.995s user 0m2.666s sys 0m0.110s ls -sh 1000000_50_bp_sorted.bam 14M 1000000_50_bp_sorted.bam ls -sh 1000000_50_bp_sorted.cram 3.9M 1000000_50_bp_sorted.cram #still around 1/4 bc -l<<<3.9/14 .27857142857142857142
The CRAM file is roughly a quarter of the size of the BAM file. Also I noticed that it's a bit slower for SAMtools to process CRAM files:
time samtools view 1000000_50_bp_sorted.bam | head -1000000 > /dev/null real 0m1.114s user 0m1.341s sys 0m0.121s time samtools view 1000000_50_bp_sorted.cram | head -1000000 > /dev/null real 0m2.883s user 0m2.063s sys 0m0.228s
But think about all the space you're saving.
Conclusions
Upgrade your SAMtools if you want to work with CRAM files.
This work is licensed under a Creative Commons
Attribution 4.0 International License.
could be worth adding a short TLDR, since someone could think they have to download cramtools etc. when it is basically as simple as
samtools view -T reference.fa file.bam -o file.cram
(this is a highly rated post on google if searching for bam to cram!:)
Thanks for the suggestion Colin! I’ll update the post accordingly, soon.