Updated 2013 November 20th
Tabix as described in the abstract of the paper:
Tabix is the first generic tool that indexes position sorted files in TAB-delimited formats such as GFF, BED, PSL, SAM and SQL export, and quickly retrieves features overlapping specified regions. Tabix features include few seek function calls per query, data compression with gzip compatibility and direct FTP/HTTP access. Tabix is implemented as a free command-line tool as well as a library in C, Java, Perl and Python. It is particularly useful for manually examining local genomic features on the command line and enables genome viewers to support huge data files and remote custom tracks over networks.
When I came across tabix a couple of years ago, I looked at the paper and also the man page and while I felt it could be extremely useful, I didn't really appreciate the role it served. Now that I've seen what a colleague has achieved with it, I can see the light. Here I demonstrate with two simple examples some of the advantages of using tabix, namely speed and compression.
Testing tabix with SAM files
For the BAM file example, let's use one from The 1000 Genomes Project.
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA18553/alignment/NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.bam
Now convert to SAM and index:
samtools view -h NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.bam > NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.sam bgzip NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.sam tabix -p sam NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.sam.gz
Of note is the file sizes:
1717942076 NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.bam
1624632294 NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.sam.gz
bgzip compression of a SAM file is smaller than compressing to BAM; I can still use samtools with the bgzipped file though not as quick as processing a BAM file.
Now let's try some examples with tabix and samtools (samtools can also return reads within coordinates see this page) for returning reads mapped within genomic coordinates:
tabix NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.sam.gz 11:60000-70000 SRR032291.13719341 163 11 60046 0 67M9S = 60358 346 TAAATACACTGAAGCTGCCAAAACAATCTATCGTTTTGCCTACGTACTTATCAACTTCCTCATAGCAACATGGGAG ?DEE?@BFDIHDGGHKHIHJHHF6H?<2HE?E>7:=3D7/CB=;.D?G@=)16.>@=+18+@(5@9########## X0:i:10 X1:i:0 XC:i:67 MD:Z:67 RG:Z:SRR032291 AM:i:0 NM:i:0 SM:i:0 MQ:i:0 XT:A:R BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ [rest of results snipped] time tabix NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.sam.gz 11:60000-5000000 | wc 550668 11647956 203417916 real 0m5.192s user 0m6.918s sys 0m0.154s #testing samtools using test.bed, which contains the coordinates 11 60000 5000000 time samtools view -F 4 -L test.bed NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.bam | wc 537327 11475465 200350980 real 0m30.318s user 0m30.119s sys 0m1.800s #the proper way to get a slice of a BAM file #see comments for more information samtools index NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.bam time samtools view NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.bam 11:60000-5000000 | wc 550668 11647956 203417916 real 0m5.192s user 0m8.998s sys 0m0.139s
Thanks to the wisdom of an anonymous commenter, I learned that
By the way, samtools and tabix are by the same author, and BAM files are also block-gzipped. Extracting entries using the coordinate index (BAI or TBI) is equally fast in both technologies.
The reason why you are observing the time discrepancy is because the "-L targets.bed" option in samtools does not consult the index. Instead, when that option is given, samtools still goes through all mappings in the input file and checks for each one whether it overlaps any BED target.
The equivalent call for samtools is "samtools view file.bam 11:60000-5000000", which uses the BAM index to slice out only that region.
Testing tabix with bed files
For the FANTOM5 project we have well over thousands of CAGE libraries. My aforementioned colleague prepared a bed file containing all the CAGE tags in all human samples. Despite having more than half a billion lines in this bed file, making a genomic inquiry took only seconds.
time tabix hg19.ctss.bed.gz chr11:60000-5000000 | wc 9577172 57463032 336697101 real 0m6.276s user 0m7.997s sys 0m3.092s
Using tabix and BEDTools
Here I have a bunch of BAM files and I create one big tabix file, so I can quickly check the number of reads for a region of interest. You'll need to install BEDTools for converting the BAM files into BED:
#convert to bed
#if you have a lot of bam files
#you can run this using GNU parallel
#parallel 'bamToBed -i {} > {.}.bed' ::: *.bam
#otherwise use this:
for file in `ls *.bam`;
do echo $file
base=`basename $file .bam`
bamToBed -i $file > $base.bed
done
Now to tally all the BED files into one file and index:
for file in `ls *.bed`;
do echo $file;
base=`basename $file .bed`;
cat $file | groupBy -g 1,2,6 -c 2 -o count | awk -v id=$base 'OFS="\t" {print $1, $2, $2+1, id, $4, $3}' >> all.bed
done
cat all.bed | sort -k1,1 -k2,2n -k6,6 | bgzip > all.bed.gz
tabix -p bed all.bed.gz
#now perform a query
tabix all.bed.gz chr7:5566000-5570500 | sort -k4 | groupBy -g 4 -c 5 -ops sum | head
a 9672
b 3530
c 21401
d 8113
e 8428
f 6173
g 7349
h 15761
i 12229
j 4693
#just to double check the numbers
#I've made a bed file containing the coordinates above
intersectBed -abam a.bam -b ~/tmp/test.bed -bed -u | wc
9672 116064 1093327
Storing normalised values instead:
#convert to bed as per before
for file in `ls *.bam`;
do echo $file
base=`basename $file .bam`
bamToBed -i $file > $base.bed
done
for file in `ls *.bed`;
do echo $file;
total=`cat $file | wc -l`
base=`basename $file .bed`;
cat $file | groupBy -g 1,2,6 -c 2 -o count | awk -v id=$base -v total=$total 'OFS="\t" {print $1, $2, $2+1, id, $4*1000000/total, $3}' >> all_tpm.bed
done
cat all_tpm.bed | sort -k1,1 -k2,2n -k6,6 | bgzip > all_tmp.bed.gz
tabix -p bed all_tmp.bed.gz
#now perform a query
tabix all_tmp.bed.gz chr7:5566000-5570500 | sort -k4 | groupBy -g 4 -c 5 -ops sum | head
a 1143.33436399999982314
b 1364.21559799999999996
c 2421.59271699999999328
d 1957.13968499999964479
e 932.962562000000161788
f 575.469340999999985797
g 847.564216999999985092
h 1919.78977400000030684
i 1105.43800160000023425
j 1072.46930399999996553
#double check
bc -l<<<9672*1000000/8459465
1143.33471442934038972913
Further reading
For the keen reader, have a look at the paper for tabix.

This work is licensed under a Creative Commons
Attribution 4.0 International License.
By the way, samtools and tabix are by the same author, and BAM files are also block-gzipped. Extracting entries using the coordinate index (BAI or TBI) is equally fast in both technologies.
The reason why you are observing the time discrepancy is because the “-L targets.bed” option in samtools does not consult the index. Instead, when that option is given, samtools still goes through all mappings in the input file and checks for each one whether it overlaps any BED target.
The equivalent call for samtools is “samtools view file.bam 11:60000-5000000”, which uses the BAM index to slice out only that region.
Thanks for the clarification and the tip! I had always wondered about the [region1 […]] usage part in the usage of samtools view.