I asked a question on Twitter about sorting a really huge file (more specifically sorting a huge BED file). To put really huge into context, the file I’m processing has 3,947,386,561 lines of genomic coordinates. I want the file to be sorted by the chromosome (lexical order), then by the start coordinate (numeric order) and then by the strand (by + and -), i.e. sort file.bed -k1,1 -k2,2n -k6,6.
To get started with a smaller file, download a CAGE dataset from ENCODE:
#download a CAGE bam file from ENCODE wget -O test.bam http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageK562CytosolPapAlnRep1.bam md5sum test.bam 895bcf6430707991d108281ddf1e8977 test.bam samtools flagstat test.bam 23166381 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 23166381 + 0 mapped (100.00%:-nan%) 0 + 0 paired in sequencing 0 + 0 read1 0 + 0 read2 0 + 0 properly paired (-nan%:-nan%) 0 + 0 with itself and mate mapped 0 + 0 singletons (-nan%:-nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5) #convert to bed using BEDTools available at https://code.google.com/p/bedtools/ bamToBed -i test.bam > test.bed cat test.bed | wc -l 23166381 head -3 test.bed chr1 10571 10598 HWUSI-EAS733_0011:1:62:3915:15245#0|AGA 1 - chr1 10571 10598 HWUSI-EAS747_0013:3:40:11187:7108#0|AGA 1 - chr1 10571 10598 HWUSI-EAS747_0013:3:60:14237:7545#0|AGA 1 -
Now by the looks of it, the file looks like it’s sorted. Going off on a tangent, I wonder how much longer it takes to sort an unsorted file versus sorting a sorted file. Here’s a Perl script that randomises an input bed file:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/env perl | |
use strict; | |
use warnings; | |
my $usage = "Usage: $0 <infile.bed>\n"; | |
my $infile = shift or die $usage; | |
my %bed = (); | |
open(IN,'<',$infile) || die "Could not open $infile: $!\n"; | |
while(<IN>){ | |
chomp; | |
$bed{$.} = $_; | |
} | |
close(IN); | |
foreach my $line (keys %bed){ | |
print "$bed{$line}\n"; | |
} | |
exit(0); |
Sorting a sorted file versus sorting an unsorted file
Now let’s create a randomised version of the BED file and compare the sorting times:
random_bed.pl test.bed > test_random.bed head -3 test_random.bed chrX 100307075 100307102 HWUSI-EAS733_0011:1:7:19671:8637#0|AGA 50 - chrM 9076 9103 HWUSI-EAS733_0011:1:22:11466:10158#0|AGA 3 - chr12 7079819 7079846 HWUSI-EAS733_0011:1:68:11985:5898#0|AGA 50 - #sort random file time sort -k1,1 -k2,2n -k6,6 test_random.bed > test_random_sorted.bed real 48m6.007s user 48m0.547s sys 0m2.556s md5sum test_random_sorted.bed e3b08a1f59bbdd527f7f68230d4ed484 test_random_sorted.bed #sort original bed file time sort -k1,1 -k2,2n -k6,6 test.bed > test_sorted.bed real 38m34.189s user 38m26.197s sys 0m2.187s md5sum test_sorted.bed e3b08a1f59bbdd527f7f68230d4ed484 test_sorted.bed
It took ~10 more minutes to sort randomised BED file. Now sort comes with a parallel function; let’s try using 12 cores:
#get the latest version of sort here http://www.gnu.org/software/coreutils/ sort --version sort (GNU coreutils) 8.21 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>. This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Written by Mike Haertel and Paul Eggert. time sort --parallel=12 -k1,1 -k2,2n -k6,6 test_random.bed > test_random_parallel_sorted.bed real 1m23.888s user 5m49.971s sys 0m2.472s #comparing the md5sum md5sum test_random_parallel_sorted.bed e3b08a1f59bbdd527f7f68230d4ed484 test_random_parallel_sorted.bed md5sum test_random_sorted.bed e3b08a1f59bbdd527f7f68230d4ed484 test_random_sorted.bed #one more time time sort --parallel=12 -k1,1 -k2,2n -k6,6 test_random.bed > again.bed real 1m27.156s user 5m49.710s sys 0m2.686s md5sum again.bed e3b08a1f59bbdd527f7f68230d4ed484 again.bed
I’m impressed (perhaps easily so); it took less than 2 minutes to sort through 23,166,381 lines! I know what ever approach I’m going to take, it’s not going to beat that. But let’s see how much better I can improve on the 38 minute mark.
Let’s split the BED file into 25 separate chromosomal files using GNU parallel and grep, and then sort through each chromosome file individually.
#GNU parallel using 12 cores #by reading the entire file 25 times and grepping out the chromosomes time parallel -j12 'cat test_random.bed | grep "chr{}\b" > chr{}.bed' ::: {1..22} X Y M real 0m16.274s user 0m40.058s sys 0m44.237s #check ls chr*.bed | wc 25 25 238 #double check cat chr*.bed | wc -l 23166381 #GNU parallel sort using 12 cores time parallel -j12 'sort -k1,1 -k2,2n -k6,6 chr{}.bed > chr{}_sorted.bed' ::: {1..22} X Y M real 6m13.700s user 47m34.764s sys 0m3.676s #join it back time for i in 1 {10..19} 2 {20..22} {3..9} M X; do cat chr${i}_sorted.bed >> all_sorted.bed done real 0m7.856s user 0m0.012s sys 0m1.501s md5sum all_sorted.bed e3b08a1f59bbdd527f7f68230d4ed484 all_sorted.bed md5sum test_random_sorted.bed e3b08a1f59bbdd527f7f68230d4ed484 ../test_random_sorted.bed
So all in all it took around 7 minutes to split the file, sort each individual file and then join them back up together. While it’s a nice improvement from the 38 minute mark, running sort with the parallel parameter is clearly the easiest and also fastest way to sort this file.
Splitting the huge file
In the above example, I read the original file entirely 25 times, which is more times than necessary and this was pointed out by Owen. He suggested the use of chunks with parallel, which basically partitions the file into different sized chunks thus saving time by not reading the entire file over and over again.
I’ll use the really big file with 3,947,386,561 lines for this example:
time parallel --joblog split.log -j25 'cat it_s_a_big_file.bed | grep "chr{}\b" > chr{}.bed' ::: {1..22} X Y M real 60m3.672s user 124m40.977s sys 182m8.711s #~1 hour to grep chromosome 1 tail -1 split.log 1 : 1384848887.189 3603.259 0 0 0 0 cat it_s_a_big_file.bed | grep "chr1\b" > chr1.bed
That took roughly 1 hour; more specifically it took 1 hour to grep through the file for the biggest chromosome, chr1. Now let’s try using chunks to see if we can do better; here I illustrate what –block does:
#it's a big file! du -h it_s_a_big_file.bed 186G it_s_a_big_file.bed #to illustrate the use of block size perl -le 'for (1 .. 1_000_000){ print $_ }' > 1000000.txt du -h 1000000.txt 7.8M 1000000.txt #4 blocks of ~2M, since the file size is ~8M cat 1000000.txt | parallel --pipe --block 2M wc -l 301586 285714 285715 126985 #how many chunks of the big file, if I use 500M? #In case you were wondering, yes I can do this in my head but I'm showing it for illustrative purposes bc -l<<<186/0.5 372.00000000000000000000
So Owen wrote a Perl script and emailed it to me, to which I slightly modified, that can read in (through streaming) the chunks from parallel:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/env perl | |
use strict; | |
use warnings; | |
#hash for filehandles | |
my %fh = (); | |
#read from stream | |
while (<>){ | |
my ($chr, @rest) = split(/\t/); | |
if (exists $fh{$chr}){ | |
print {$fh{$chr}} "$_"; | |
} else { | |
#open output filehandle for each chromosome | |
open(my $fh, ">>", "$chr.bed") || die("Could not open $chr.bed for output: $!\n"); | |
#store file handle in hash | |
$fh{$chr} = $fh; | |
print {$fh{$chr}} "$_"; | |
} | |
} | |
#close all filehandles | |
foreach my $fh (keys %fh){ | |
close $fh{$fh}; | |
} | |
exit(0); |
Now we can use parallel with blocks and pipe it into Perl:
time cat it_s_a_big_file.bed | parallel --joblog perl_split.log --pipe --block 500M split_chr.pl real 16m22.555s user 156m43.135s sys 27m27.253s ls chr*.bed | wc -l 25 #check numbers #pay attention to the sum of lines ls -1 chr*.bed > my_files time parallel -a my_files wc -l | cut -f1 -d' ' | stats Total lines: 25 Sum of lines: 3947386561 Ari. Mean: 157895462.44 Geo. Mean: 135115347.228594 Median: 167157944 Mode: 21314313 (N=1) Anti-Mode: 21314313 (N=1) Minimum: 21314313 Maximum: 318405487 Variance: 5.11484782570063e+15 StdDev: 71518164.3060043 real 2m46.105s user 2m1.353s sys 14m51.288s
On the surface it looked great; much faster processing and the line count matched the original file. However when I checked, the data was corrupted:
tail -5 chr1.bed chr1 249239133 249239134 a_F512_0 1 + chr1 249239134 249239135 a_F512_0 1 + chr1 249239138 249239139 a_F512_0 1 + a_F512_0 1 - chr1 249240237 249240238 a_F512_0 1 +
Notice the second last line wasn’t completed. In fact there are a lot of empty lines in the file (not shown). I searched the web and generally it seems like it’s a bad idea to simultaneously write to the same file.
Parallel sorting versus chromosomal sorting
From the smaller BED file we tested in the first part of this post, we learned that sort –parallel was extremely fast. So I used the –parallel option:
#for info on bgzip see http://samtools.sourceforge.net/tabix.shtml cat it_s_a_big_file.bed | sort --parallel=12 -k1,1 -k2,2n -k6,6 | bgzip > it_s_a_big_file.bed.gz
I didn’t run it with time (yes silly me!), but from the timestamps of the files I worked out that it took roughly 7 hours but that’s including the bgzip step (I created the run.sh script at 14:09 and I had the it_s_a_big_file.bed.gz at 21:17).
Let’s compare this to the sorting of each individual chromosome; so how long does it take to sort through all the chromosomes (or how long does it take to sort chromosome 1)?
#using the split chromosome files prepared with parallel and grep time parallel --joblog sort.log --verbose 'sort -k1,1 -k2,2n -k6,6 chr{}.bed > chr{}_sorted.bed' ::: {1..22} X Y M real 350m28.062s user 4187m36.244s sys 192m30.739s #and join the files back up time for i in 1 {10..19} 2 {20..22} {3..9} M X; do cat chr${i}_sorted.bed >> all_sorted.bed done real 13m27.396s user 0m2.563s sys 7m35.211s #bgzip time cat all_sorted.bed | bgzip > all_sorted.bed.gz real 88m24.757s user 84m26.179s sys 8m15.971s
So if we split the chromosomes using the grep approach (60m3.672s), sort each chromosome (350m28.062s), join them back up (13m27.396s) and bgzip (88m24.757s), it takes ~464 minutes or roughly 8.5 hours.
Conclusions
I’ve learned that:
- It’s faster to sort a sorted file than an unsorted file
- Do not append to the same file simultaneously
- The easiest hassle free solution to sorting a huge file is using the –parallel parameter
However, the best solution here would be to split up the huge file by the chromosomes (using parallel and grep [unless I find a better way]) and then sort through each chromosome BED file with –parallel.
Shout-outs to @robin_andersson, @OwenRackham and @tjkwon75 (his three words of wisdom) for their help via Twitter.

This work is licensed under a Creative Commons
Attribution 4.0 International License.
Nice. Sometimes you may also want to sort the chromosomes alphanumerically: http://www.biostars.org/p/64687/#81308
Thanks! That’s a neat trick! I’ll remember it 🙂
echo -e "chrY\nchr10\nchr2\nchr1\nchrM" | sort -k1,1V
Hi Dave,
When you use bamToBed, is the start 0-based or 1-based ?
“bamToBed -i test.bam > test.bed”
head -3 test.bed
chr1 10571 10598 HWUSI-EAS733_0011:1:62:3915:15245#0|AGA 1 –
chr1 10571 10598 HWUSI-EAS747_0013:3:40:11187:7108#0|AGA 1 –
chr1 10571 10598 HWUSI-EAS747_0013:3:60:14237:7545#0|AGA 1
So, is 10571 0-based or 1-based.
Thanks in advance.
Hi Chirag,
I just checked by making a fake BAM file that starts from 0:
samtools view a.bam
VHE-246613351024-23-7-2-1280 0 chr1 0 15 30M * 0 0 TGCAGCACAGTGCAGCGCGAGCCAGCTGGA * NM:i:4 MD:Z:14T1T4T4A3 XP:Z:~~~~~~~~~~~~~~~~~~~~~~~~~~A@@~
bamToBed -i a.bam
chr1 -1 29 VHE-246613351024-23-7-2-1280 15 +
So it looks like bamToBed expects 1-based coordinates and converts them to 0-based.
Congratulations on your paper by the way! I still have to read it 🙂
Cheers,
Dave
Thanks Dave !!
So, bam files are always 1-based as mentioned in IGV
http://www.broadinstitute.org/igv/BAM
One-based index: Start and end positions are identified using a one-based index. The end position is included. For example, setting start-end to 1-2 describes two bases, the first and second in the sequence.
While BEDTools used 0-based start and 1-based end
1.3.4. BED starts are zero-based and BED ends are one-based.
bedtools users are sometimes confused by the way the start and end of BED features are represented. Specifically, bedtools uses the UCSC Genome Browser’s internal database convention of making the start position 0-based and the end position 1-based: (http://genome.ucsc.edu/FAQ/FAQtracks#tracks1) In other words, bedtools interprets the “start” column as being 1 basepair higher than what is represented in the file. For example, the following BED feature represents a single base on chromosome 1; namely, the 1st base:
So, as you suggested, the start is 0-based.
Thanks !! You should read the paper, i guess it is pretty decent paper 🙂
cheers
Thanks for looking it up and sharing! I shall have a read 🙂
Hi Dave,
Did you try the sort-bed function in bedops. It is far superior than sortBED in bedtools.
Hi Kalyan,
I didn’t even know the existence of bedops! I’ll give it a go sometime; thanks for letting me know about it.
Cheers,
Dave