Storing FASTQ as unaligned CRAM

I was updating my BAM to CRAM post spurred on by a recent comment and then I wondered whether I could store my FASTQ files as unaligned CRAM files to save space. I thought it wouldn't be possible because the reads are unaligned and therefore we can't make use of a reference to save space but I proceeded anyway because I was curious. The files used for the first example can be downloaded as per below; the two gzipped files are around 8.8G in combined size.

wget -c ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00284/sequence_read/ERR031940_1.filt.fastq.gz
wget -c ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00284/sequence_read/ERR031940_2.filt.fastq.gz

du -h *.fastq.gz
4.4G    ERR031940_1.filt.fastq.gz
4.4G    ERR031940_2.filt.fastq.gz

First I used Picard to convert the FASTQ files to unaligned BAM.

java -jar picard.jar FastqToSam \
  --FASTQ ERR031940_1.filt.fastq.gz \
  --FASTQ2 ERR031940_2.filt.fastq.gz \
  --OUTPUT ERR031940_unaligned.bam \
  --SAMPLE_NAME ERR031940

du -h ERR031940_unaligned.bam
7.8G    ERR031940_unaligned.bam

I already saved some space by converting to unaligned BAM. (This is probably because the read IDs are trimmed, as you will see later.)

Now let's convert to CRAM. (I do not need the -T option [Reference sequence FASTA FILE] because the reads are not aligned to any reference.)

samtools view -C -o ERR031940_unaligned.cram ERR031940_unaligned.bam

du -h ERR031940_unaligned.cram
5.8G    ERR031940_unaligned.cram

It seems I have gone from 8.8G to 5.8G, a reduction of around 34%! But can I regenerate the original FASTQ files from the CRAM file?

samtools fastq -@ 8 -1 test_1.fq.gz -2 test_2.fq.gz ERR031940_unaligned.cram

zcat ERR031940_1.filt.fastq.gz | head -4
@ERR031940.1 A81BHGABXX:3:1101:1438:2135#NCCAATAT/1
ATTGATTGTAATTTTTATTTTGTTTGCTTTGTTTACTTCTCTAGGCTTTCTTTCTGTGTCATTTCAGACAATCATAGCACATCTAGATAT
+
DF8FBFF@FC@8EEEB@B=BEE;EEEEE7;B>=B8,;>3;D?BD?>B5DC########################################

zcat test_1.fq.gz | head -4
@ERR031940.1
ATTGATTGTAATTTTTATTTTGTTTGCTTTGTTTACTTCTCTAGGCTTTCTTTCTGTGTCATTTCAGACAATCATAGCACATCTAGATAT
+
DF8FBFF@FC@8EEEB@B=BEE;EEEEE7;B>=B8,;>3;D?BD?>B5DC########################################

zcat ERR031940_1.filt.fastq.gz | wc -l
223137348
zcat ERR031940_2.filt.fastq.gz | wc -l
223137348

zcat test_1.fq.gz | wc -l
223137348
zcat test_2.fq.gz | wc -l
223137348

The original FASTQ file and FASTQ file generated from the CRAM file have the same sequence and base qualities for the first read but the Illumina sequence identifiers have been removed from the CRAM FASTQ file. In addition, the FASTQ files have the same amount of lines (but the reads in the CRAM FASTQ files are lexically sorted, see below).

zcat test_1.fq.gz | head -8
@ERR031940.1
ATTGATTGTAATTTTTATTTTGTTTGCTTTGTTTACTTCTCTAGGCTTTCTTTCTGTGTCATTTCAGACAATCATAGCACATCTAGATAT
+
DF8FBFF@FC@8EEEB@B=BEE;EEEEE7;B>=B8,;>3;D?BD?>B5DC########################################
@ERR031940.10
TCCCTCCTTTACCGCCTCAAGTTCAAGGAGTTTGTCCAGAGTGTCCCCACCAAACGATTAAACACCGAGCAGATCAGGGGGCACCTCCGG
+
GDGGGGGEBBGGG?GGFFEGGFF:FDC=CC4A?#########################################################

zcat test_2.fq.gz | head -8
@ERR031940.1
TAGTAATTCGTTCTGGAAGCCGACTTTTCTCAATGGATAGAAACAATTCCAACAAATCTCCATGCACANNNNCNNNNNNNNNNNNNNNTN
+
EEECCEEDB=?..@;;??@<EE;AFB9G<EB69?A>AA####################################################
@ERR031940.10
CTGCCCCCCGACGTCCCACACTTGGAAGGTGATCCCACGAGATCCCACGAGGGGAACCCGGATCTTCTCTGAGTTGAAGTACTTGTCGGG
+
HH8EHHFGHEFBF@FDDBD??@:<==B@##############################################################

Therefore it does look like I can regenerate the FASTQ files with the same sequence and base qualities but the Illumina sequence identifiers are lost and the reads are rearranged lexically (which doesn't matter as long as the reads are correctly paired between the first and second FASTQ file). Currently, I do not work with the sequence identifiers (I have in the past when I needed to get the barcode/index of each FASTQ read), so I don't mind that they are removed when I am saving a lot of space.

To check whether this can be reproduced with another pair of FASTQ files, I repeated the steps for some RNA-seq reads. The combined size of the FASTQ files are around 6.1G and the unaligned CRAM file is 3.8G thus saving 37.7% of space!

du -h ERR164634*
3.1G    ERR164634_1.fastq.gz
3.0G    ERR164634_2.fastq.gz

zcat ERR164634_1.fastq.gz | wc -l
132460264
zcat ERR164634_2.fastq.gz | wc -l
132460264

java -jar picard.jar FastqToSam \
  --FASTQ ERR164634_1.fastq.gz \
  --FASTQ2 ERR164634_2.fastq.gz \
  --OUTPUT ERR164634_unaligned.bam \
  --SAMPLE_NAME ERR164634

samtools view -C -o ERR164634_unaligned.cram ERR164634_unaligned.bam

du -h ERR164634_unaligned.cram
3.8G    ERR164634_unaligned.cram

samtools fastq -@ 8 -1 test2_1.fq.gz -2 test2_2.fq.gz ERR164634_unaligned.cram

zcat test2_1.fq.gz | wc -l
132460264
zcat test2_2.fq.gz | wc -l
132460264

This work is licensed under a Creative Commons
Attribution 4.0 International License.

Leave a Reply Cancel reply