Sorting a huge BED file

I asked a question on Twitter about sorting a really huge file (more specifically sorting a huge BED file). To put really huge into context, the file I’m processing has 3,947,386,561 lines of genomic coordinates. I want the file to be sorted by the chromosome (lexical order), then by the start coordinate (numeric order) and then by the strand (by + and -), i.e. sort file.bed -k1,1 -k2,2n -k6,6.

To get started with a smaller file, download a CAGE dataset from ENCODE:

#download a CAGE bam file from ENCODE
wget -O test.bam http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/wgEncodeRikenCageK562CytosolPapAlnRep1.bam
md5sum test.bam
895bcf6430707991d108281ddf1e8977  test.bam
samtools flagstat test.bam
23166381 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
23166381 + 0 mapped (100.00%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

#convert to bed using BEDTools available at https://code.google.com/p/bedtools/
bamToBed -i test.bam > test.bed
cat test.bed | wc -l
23166381
head -3 test.bed
chr1    10571   10598   HWUSI-EAS733_0011:1:62:3915:15245#0|AGA 1       -
chr1    10571   10598   HWUSI-EAS747_0013:3:40:11187:7108#0|AGA 1       -
chr1    10571   10598   HWUSI-EAS747_0013:3:60:14237:7545#0|AGA 1       -

Now by the looks of it, the file looks like it’s sorted. Going off on a tangent, I wonder how much longer it takes to sort an unsorted file versus sorting a sorted file. Here’s a Perl script that randomises an input bed file:

Sorting a sorted file versus sorting an unsorted file

Now let’s create a randomised version of the BED file and compare the sorting times:

random_bed.pl test.bed > test_random.bed
head -3 test_random.bed
chrX    100307075       100307102       HWUSI-EAS733_0011:1:7:19671:8637#0|AGA  50      -
chrM    9076    9103    HWUSI-EAS733_0011:1:22:11466:10158#0|AGA        3       -
chr12   7079819 7079846 HWUSI-EAS733_0011:1:68:11985:5898#0|AGA 50      -

#sort random file
time sort -k1,1 -k2,2n -k6,6 test_random.bed > test_random_sorted.bed

real    48m6.007s
user    48m0.547s
sys     0m2.556s

md5sum test_random_sorted.bed
e3b08a1f59bbdd527f7f68230d4ed484  test_random_sorted.bed

#sort original bed file
time sort -k1,1 -k2,2n -k6,6 test.bed > test_sorted.bed

real    38m34.189s
user    38m26.197s
sys     0m2.187s

md5sum test_sorted.bed
e3b08a1f59bbdd527f7f68230d4ed484  test_sorted.bed

It took ~10 more minutes to sort randomised BED file. Now sort comes with a parallel function; let’s try using 12 cores:

#get the latest version of sort here http://www.gnu.org/software/coreutils/
sort --version
sort (GNU coreutils) 8.21
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and Paul Eggert.

time sort --parallel=12 -k1,1 -k2,2n -k6,6 test_random.bed > test_random_parallel_sorted.bed

real    1m23.888s
user    5m49.971s
sys     0m2.472s

#comparing the md5sum
md5sum test_random_parallel_sorted.bed
e3b08a1f59bbdd527f7f68230d4ed484  test_random_parallel_sorted.bed
md5sum test_random_sorted.bed
e3b08a1f59bbdd527f7f68230d4ed484  test_random_sorted.bed

#one more time
time sort --parallel=12 -k1,1 -k2,2n -k6,6 test_random.bed > again.bed

real    1m27.156s
user    5m49.710s
sys     0m2.686s

md5sum again.bed
e3b08a1f59bbdd527f7f68230d4ed484  again.bed

I’m impressed (perhaps easily so); it took less than 2 minutes to sort through 23,166,381 lines! I know what ever approach I’m going to take, it’s not going to beat that. But let’s see how much better I can improve on the 38 minute mark.

Let’s split the BED file into 25 separate chromosomal files using GNU parallel and grep, and then sort through each chromosome file individually.

#GNU parallel using 12 cores
#by reading the entire file 25 times and grepping out the chromosomes
time parallel -j12 'cat test_random.bed | grep "chr{}\b" > chr{}.bed' ::: {1..22} X Y M

real    0m16.274s
user    0m40.058s
sys     0m44.237s

#check
ls chr*.bed | wc
     25      25     238

#double check
cat chr*.bed | wc -l
23166381

#GNU parallel sort using 12 cores
time parallel -j12 'sort -k1,1 -k2,2n -k6,6 chr{}.bed > chr{}_sorted.bed' ::: {1..22} X Y M

real    6m13.700s
user    47m34.764s
sys     0m3.676s

#join it back
time for i in 1 {10..19} 2 {20..22} {3..9} M X;
   do cat chr${i}_sorted.bed >> all_sorted.bed
done

real    0m7.856s
user    0m0.012s
sys     0m1.501s

md5sum all_sorted.bed
e3b08a1f59bbdd527f7f68230d4ed484  all_sorted.bed

md5sum test_random_sorted.bed
e3b08a1f59bbdd527f7f68230d4ed484  ../test_random_sorted.bed

So all in all it took around 7 minutes to split the file, sort each individual file and then join them back up together. While it’s a nice improvement from the 38 minute mark, running sort with the parallel parameter is clearly the easiest and also fastest way to sort this file.

Splitting the huge file

In the above example, I read the original file entirely 25 times, which is more times than necessary and this was pointed out by Owen. He suggested the use of chunks with parallel, which basically partitions the file into different sized chunks thus saving time by not reading the entire file over and over again.

I’ll use the really big file with 3,947,386,561 lines for this example:

time parallel --joblog split.log -j25 'cat it_s_a_big_file.bed | grep "chr{}\b" > chr{}.bed' ::: {1..22} X Y M

real    60m3.672s
user    124m40.977s
sys     182m8.711s

#~1 hour to grep chromosome 1
tail -1 split.log
1       :       1384848887.189  3603.259        0       0       0       0       cat it_s_a_big_file.bed | grep "chr1\b" > chr1.bed

That took roughly 1 hour; more specifically it took 1 hour to grep through the file for the biggest chromosome, chr1. Now let’s try using chunks to see if we can do better; here I illustrate what –block does:

#it's a big file!
du -h it_s_a_big_file.bed
186G    it_s_a_big_file.bed

#to illustrate the use of block size
perl -le 'for (1 .. 1_000_000){ print $_ }' > 1000000.txt
du -h 1000000.txt
7.8M    1000000.txt

#4 blocks of ~2M, since the file size is ~8M
cat 1000000.txt | parallel --pipe --block 2M wc -l
301586
285714
285715
126985

#how many chunks of the big file, if I use 500M?
#In case you were wondering, yes I can do this in my head but I'm showing it for illustrative purposes
bc -l<<<186/0.5
372.00000000000000000000

So Owen wrote a Perl script and emailed it to me, to which I slightly modified, that can read in (through streaming) the chunks from parallel:

Now we can use parallel with blocks and pipe it into Perl:

time cat it_s_a_big_file.bed | parallel --joblog perl_split.log --pipe --block 500M split_chr.pl

real    16m22.555s
user    156m43.135s
sys     27m27.253s

ls chr*.bed | wc -l
25

#check numbers
#pay attention to the sum of lines
ls -1 chr*.bed > my_files
time parallel -a my_files wc -l | cut -f1 -d' ' | stats
Total lines:            25
Sum of lines:           3947386561
Ari. Mean:              157895462.44
Geo. Mean:              135115347.228594
Median:                 167157944
Mode:                   21314313 (N=1)
Anti-Mode:              21314313 (N=1)
Minimum:                21314313
Maximum:                318405487
Variance:               5.11484782570063e+15
StdDev:                 71518164.3060043

real    2m46.105s
user    2m1.353s
sys     14m51.288s

On the surface it looked great; much faster processing and the line count matched the original file. However when I checked, the data was corrupted:

tail -5 chr1.bed
chr1    249239133       249239134       a_F512_0        1       +
chr1    249239134       249239135       a_F512_0        1       +
chr1    249239138       249239139       a_F512_0        1       +
a_F512_0     1       -
chr1    249240237       249240238       a_F512_0        1       +

Notice the second last line wasn’t completed. In fact there are a lot of empty lines in the file (not shown). I searched the web and generally it seems like it’s a bad idea to simultaneously write to the same file.

Parallel sorting versus chromosomal sorting

From the smaller BED file we tested in the first part of this post, we learned that sort –parallel was extremely fast. So I used the –parallel option:

#for info on bgzip see http://samtools.sourceforge.net/tabix.shtml
cat it_s_a_big_file.bed | sort --parallel=12 -k1,1 -k2,2n -k6,6 | bgzip > it_s_a_big_file.bed.gz

I didn’t run it with time (yes silly me!), but from the timestamps of the files I worked out that it took roughly 7 hours but that’s including the bgzip step (I created the run.sh script at 14:09 and I had the it_s_a_big_file.bed.gz at 21:17).

Let’s compare this to the sorting of each individual chromosome; so how long does it take to sort through all the chromosomes (or how long does it take to sort chromosome 1)?

#using the split chromosome files prepared with parallel and grep
time parallel --joblog sort.log --verbose 'sort -k1,1 -k2,2n -k6,6 chr{}.bed > chr{}_sorted.bed' ::: {1..22} X Y M
real    350m28.062s
user    4187m36.244s
sys     192m30.739s

#and join the files back up
time for i in 1 {10..19} 2 {20..22} {3..9} M X;
   do cat chr${i}_sorted.bed >> all_sorted.bed
done

real    13m27.396s
user    0m2.563s
sys     7m35.211s

#bgzip
time cat all_sorted.bed | bgzip > all_sorted.bed.gz

real    88m24.757s
user    84m26.179s
sys     8m15.971s

So if we split the chromosomes using the grep approach (60m3.672s), sort each chromosome (350m28.062s), join them back up (13m27.396s) and bgzip (88m24.757s), it takes ~464 minutes or roughly 8.5 hours.

Conclusions

I’ve learned that:

  1. It’s faster to sort a sorted file than an unsorted file
  2. Do not append to the same file simultaneously
  3. The easiest hassle free solution to sorting a huge file is using the –parallel parameter

However, the best solution here would be to split up the huge file by the chromosomes (using parallel and grep [unless I find a better way]) and then sort through each chromosome BED file with –parallel.

Shout-outs to @robin_andersson, @OwenRackham and @tjkwon75 (his three words of wisdom) for their help via Twitter.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
8 comments Add yours
  1. Hi Dave,
    When you use bamToBed, is the start 0-based or 1-based ?
    “bamToBed -i test.bam > test.bed”

    head -3 test.bed
    chr1 10571 10598 HWUSI-EAS733_0011:1:62:3915:15245#0|AGA 1 –
    chr1 10571 10598 HWUSI-EAS747_0013:3:40:11187:7108#0|AGA 1 –
    chr1 10571 10598 HWUSI-EAS747_0013:3:60:14237:7545#0|AGA 1

    So, is 10571 0-based or 1-based.
    Thanks in advance.

    1. Hi Chirag,

      I just checked by making a fake BAM file that starts from 0:


      samtools view a.bam
      VHE-246613351024-23-7-2-1280 0 chr1 0 15 30M * 0 0 TGCAGCACAGTGCAGCGCGAGCCAGCTGGA * NM:i:4 MD:Z:14T1T4T4A3 XP:Z:~~~~~~~~~~~~~~~~~~~~~~~~~~A@@~
      bamToBed -i a.bam
      chr1 -1 29 VHE-246613351024-23-7-2-1280 15 +

      So it looks like bamToBed expects 1-based coordinates and converts them to 0-based.

      Congratulations on your paper by the way! I still have to read it ๐Ÿ™‚

      Cheers,

      Dave

  2. Thanks Dave !!

    So, bam files are always 1-based as mentioned in IGV
    http://www.broadinstitute.org/igv/BAM
    One-based index: Start and end positions are identified using a one-based index. The end position is included. For example, setting start-end to 1-2 describes two bases, the first and second in the sequence.

    While BEDTools used 0-based start and 1-based end
    1.3.4. BED starts are zero-based and BED ends are one-based.

    bedtools users are sometimes confused by the way the start and end of BED features are represented. Specifically, bedtools uses the UCSC Genome Browserโ€™s internal database convention of making the start position 0-based and the end position 1-based: (http://genome.ucsc.edu/FAQ/FAQtracks#tracks1) In other words, bedtools interprets the โ€œstartโ€ column as being 1 basepair higher than what is represented in the file. For example, the following BED feature represents a single base on chromosome 1; namely, the 1st base:

    So, as you suggested, the start is 0-based.

    Thanks !! You should read the paper, i guess it is pretty decent paper ๐Ÿ™‚

    cheers

    1. Hi Kalyan,

      I didn’t even know the existence of bedops! I’ll give it a go sometime; thanks for letting me know about it.

      Cheers,

      Dave

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.