Using tabix

Updated 2013 November 20th

Tabix as described in the abstract of the paper:

Tabix is the first generic tool that indexes position sorted files in TAB-delimited formats such as GFF, BED, PSL, SAM and SQL export, and quickly retrieves features overlapping specified regions. Tabix features include few seek function calls per query, data compression with gzip compatibility and direct FTP/HTTP access. Tabix is implemented as a free command-line tool as well as a library in C, Java, Perl and Python. It is particularly useful for manually examining local genomic features on the command line and enables genome viewers to support huge data files and remote custom tracks over networks.

When I came across tabix a couple of years ago, I looked at the paper and also the man page and while I felt it could be extremely useful, I didn’t really appreciate the role it served. Now that I’ve seen what a colleague has achieved with it, I can see the light. Here I demonstrate with two simple examples some of the advantages of using tabix, namely speed and compression.

Testing tabix with SAM files

For the BAM file example, let’s use one from The 1000 Genomes Project.

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA18553/alignment/NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.bam

Now convert to SAM and index:

samtools view -h NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.bam > NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.sam
bgzip NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.sam
tabix -p sam NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.sam.gz

Of note is the file sizes:

1717942076 NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.bam
1624632294 NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.sam.gz

bgzip compression of a SAM file is smaller than compressing to BAM; I can still use samtools with the bgzipped file though not as quick as processing a BAM file.

Now let’s try some examples with tabix and samtools (samtools can also return reads within coordinates see this page) for returning reads mapped within genomic coordinates:

tabix NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.sam.gz 11:60000-70000
SRR032291.13719341      163     11      60046   0       67M9S   =       60358   346     TAAATACACTGAAGCTGCCAAAACAATCTATCGTTTTGCCTACGTACTTATCAACTTCCTCATAGCAACATGGGAG    ?DEE?@BFDIHDGGHKHIHJHHF6H?<2HE?E>7:=3D7/CB=;.D?G@=)16.>@=+18+@(5@9########## X0:i:10 X1:i:0  XC:i:67 MD:Z:67 RG:Z:SRR032291  AM:i:0  NM:i:0  SM:i:0  MQ:i:0  XT:A:R  BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[rest of results snipped]
time tabix NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.sam.gz 11:60000-5000000 | wc
 550668 11647956 203417916

real    0m5.192s
user    0m6.918s
sys     0m0.154s

#testing samtools using test.bed, which contains the coordinates 11 60000 5000000
time samtools view -F 4 -L test.bed NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.bam | wc
 537327 11475465 200350980

real    0m30.318s
user    0m30.119s
sys     0m1.800s

#the proper way to get a slice of a BAM file
#see comments for more information
samtools index NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.bam
time samtools view NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.bam 11:60000-5000000 | wc
 550668 11647956 203417916

real    0m5.192s
user    0m8.998s
sys     0m0.139s

Thanks to the wisdom of an anonymous commenter, I learned that

By the way, samtools and tabix are by the same author, and BAM files are also block-gzipped. Extracting entries using the coordinate index (BAI or TBI) is equally fast in both technologies.

The reason why you are observing the time discrepancy is because the “-L targets.bed” option in samtools does not consult the index. Instead, when that option is given, samtools still goes through all mappings in the input file and checks for each one whether it overlaps any BED target.

The equivalent call for samtools is “samtools view file.bam 11:60000-5000000”, which uses the BAM index to slice out only that region.

Testing tabix with bed files

For the FANTOM5 project we have well over thousands of CAGE libraries. My aforementioned colleague prepared a bed file containing all the CAGE tags in all human samples. Despite having more than half a billion lines in this bed file, making a genomic inquiry took only seconds.

time tabix hg19.ctss.bed.gz chr11:60000-5000000 | wc
9577172 57463032 336697101

real    0m6.276s
user    0m7.997s
sys     0m3.092s

Using tabix and BEDTools

Here I have a bunch of BAM files and I create one big tabix file, so I can quickly check the number of reads for a region of interest. You’ll need to install BEDTools for converting the BAM files into BED:

#convert to bed
#if you have a lot of bam files
#you can run this using GNU parallel
#parallel 'bamToBed -i {} > {.}.bed' ::: *.bam
#otherwise use this:
for file in `ls *.bam`;
   do echo $file
   base=`basename $file .bam`
   bamToBed -i $file > $base.bed
done

Now to tally all the BED files into one file and index:

for file in `ls *.bed`;
  do echo $file;
  base=`basename $file .bed`;
  cat $file | groupBy -g 1,2,6 -c 2 -o count | awk -v id=$base 'OFS="\t" {print $1, $2, $2+1, id, $4, $3}' >> all.bed
done

cat all.bed | sort -k1,1 -k2,2n -k6,6 | bgzip > all.bed.gz
tabix -p bed all.bed.gz

#now perform a query
tabix all.bed.gz chr7:5566000-5570500 | sort -k4 | groupBy -g 4 -c 5 -ops sum | head
a	9672
b	3530
c	21401
d	8113
e	8428
f	6173
g	7349
h	15761
i	12229
j	4693

#just to double check the numbers
#I've made a bed file containing the coordinates above
intersectBed -abam a.bam -b ~/tmp/test.bed -bed -u | wc
   9672  116064 1093327

Storing normalised values instead:

#convert to bed as per before
for file in `ls *.bam`;
   do echo $file
   base=`basename $file .bam`
   bamToBed -i $file > $base.bed
done

for file in `ls *.bed`;
  do echo $file;
  total=`cat $file | wc -l`
  base=`basename $file .bed`;
  cat $file | groupBy -g 1,2,6 -c 2 -o count | awk -v id=$base -v total=$total 'OFS="\t" {print $1, $2, $2+1, id, $4*1000000/total, $3}' >> all_tpm.bed
done

cat all_tpm.bed | sort -k1,1 -k2,2n -k6,6 | bgzip > all_tmp.bed.gz
tabix -p bed all_tmp.bed.gz

#now perform a query
tabix all_tmp.bed.gz chr7:5566000-5570500 | sort -k4 | groupBy -g 4 -c 5 -ops sum | head
a	1143.33436399999982314
b	1364.21559799999999996
c	2421.59271699999999328
d	1957.13968499999964479
e	932.962562000000161788
f	575.469340999999985797
g	847.564216999999985092
h	1919.78977400000030684
i	1105.43800160000023425
j	1072.46930399999996553

#double check
bc -l<<<9672*1000000/8459465
1143.33471442934038972913

Further reading

For the keen reader, have a look at the paper for tabix.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
2 comments Add yours
  1. By the way, samtools and tabix are by the same author, and BAM files are also block-gzipped. Extracting entries using the coordinate index (BAI or TBI) is equally fast in both technologies.

    The reason why you are observing the time discrepancy is because the “-L targets.bed” option in samtools does not consult the index. Instead, when that option is given, samtools still goes through all mappings in the input file and checks for each one whether it overlaps any BED target.

    The equivalent call for samtools is “samtools view file.bam 11:60000-5000000”, which uses the BAM index to slice out only that region.

    1. Thanks for the clarification and the tip! I had always wondered about the [region1 […]] usage part in the usage of samtools view.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.