Using blat

My multipurpose sequence aligner tool of choice for many years has been blat. This is a short post on the basics of blat. To use blat, download the 64bit Linux version of blat (or a version that matches your operating system) here. When aligning sequences to the genome, make sure you use the 64 bit version or else you will receive an out of memory error (needHugeMen: Out of huge memory – request size).

To get started below is a slide I made couple of years ago:

First blat splits the reference sequence up into “tiles”. The manner in which it is split depends on two parameters, -tileSize and -stepSize, where -tileSize is the size of the tile and -stepSize specifies when to start the next tile. The default setting of both for DNA sequences is 11, which also means the tiles do not overlap.

For blat to report an alignment, your query sequence must match at least two tiles (set via -minMatch) with no mismatches (you can allow up to one mismatch in the tile by using -oneOff). So if you’re trying to align a 21 bp sequence to a reference using the default setting, blat will never report an alignment.

To illustrate, imagine this reference sequence (44bp):

>database
AAAAAAAAAAACCCCCCCCCCCGGGGGGGGGGGTTTTTTTTTTT

and this query sequence (12bp)

>test
GGGGGGGGGGGT:

The only way an alignment will be reported is if the tileSize is set to 1 and the minScore set to less than 12.

#returns no hit
blat -minScore=0 -stepSize=2 database.fa test.fa output.psl
#returns 2 hits
blat -minScore=0 -stepSize=1 database.fa test.fa output.psl

Here’s an old post showing how the blat parameters affect the output.

Running blat in parallel

Before running blat in parallel, check how many processors are on your machine (32 are available in this example):

cat /proc/cpuinfo | grep "^processor" | wc
     32      96     470

To use blat in parallel, download GNU parallel and compile it. Download the hg19 chromosomes. Let’s align a bunch of long non coding RNAs (lncRNA) to hg19. Download the human lncRNA dataset from NONCODE and unzip it. There should be 33,829 sequences in the human.fa file.

unzip lncRNA_human.zip
cd human
cat human.fa | grep "^>" | wc
  33829  692926 4407633

Now untar and gunzip the hg19 file. I’m only interested in assembled chromosomes, so I will delete the rest.

tar -xzf chromFa.tar.gz
#ls to check whether I'm deleting the right files
ls *random.fa
chr11_gl000202_random.fa  chr17_gl000206_random.fa  chr1_gl000191_random.fa   chr4_gl000194_random.fa  chr9_gl000198_random.fa
chr17_gl000203_random.fa  chr18_gl000207_random.fa  chr1_gl000192_random.fa   chr7_gl000195_random.fa  chr9_gl000199_random.fa
chr17_gl000204_random.fa  chr19_gl000208_random.fa  chr21_gl000210_random.fa  chr8_gl000196_random.fa  chr9_gl000200_random.fa
chr17_gl000205_random.fa  chr19_gl000209_random.fa  chr4_gl000193_random.fa   chr8_gl000197_random.fa  chr9_gl000201_random.fa
rm *random.fa
ls chrUn*.fa
#ls to check again
chrUn_gl000211.fa  chrUn_gl000216.fa  chrUn_gl000221.fa  chrUn_gl000226.fa  chrUn_gl000231.fa  chrUn_gl000236.fa  chrUn_gl000241.fa  chrUn_gl000246.fa
chrUn_gl000212.fa  chrUn_gl000217.fa  chrUn_gl000222.fa  chrUn_gl000227.fa  chrUn_gl000232.fa  chrUn_gl000237.fa  chrUn_gl000242.fa  chrUn_gl000247.fa
chrUn_gl000213.fa  chrUn_gl000218.fa  chrUn_gl000223.fa  chrUn_gl000228.fa  chrUn_gl000233.fa  chrUn_gl000238.fa  chrUn_gl000243.fa  chrUn_gl000248.fa
chrUn_gl000214.fa  chrUn_gl000219.fa  chrUn_gl000224.fa  chrUn_gl000229.fa  chrUn_gl000234.fa  chrUn_gl000239.fa  chrUn_gl000244.fa  chrUn_gl000249.fa
chrUn_gl000215.fa  chrUn_gl000220.fa  chrUn_gl000225.fa  chrUn_gl000230.fa  chrUn_gl000235.fa  chrUn_gl000240.fa  chrUn_gl000245.fa
rm chrUn*.fa
ls *hap*.fa
chr17_ctg5_hap1.fa  chr6_apd_hap1.fa  chr6_dbb_hap3.fa   chr6_mcf_hap5.fa  chr6_ssto_hap7.fa
chr4_ctg9_hap1.fa   chr6_cox_hap2.fa  chr6_mann_hap4.fa  chr6_qbl_hap6.fa
rm *hap*.fa
#22 somatic chromosomes + chrY, chrM and chrX
ls *.fa | wc
     25      25     213

Create a test fasta file containing the sequence for n37. I checked the human.bed file for n37 (since NONCODE also provides you with the mapping location as a bed file) to see that it maps to chr11.

cat human/human.fa | grep -v "^#" | head -2 > human/test.fa
cat human/human.bed | awk '$4 == "n37" {print}'
chr11   2161746 2169894 n37     992     +       2161746 2169894 0       7       431,181,100,293,38,682,331,     0,5728,5911,6801,7095,7134,7817,
#run a test blat job using default parameters
blat chr11.fa human/test.fa test.psl
cat test.psl
#results are the same
psLayout version 3

match   mis-    rep.    N's     Q gap   Q gap   T gap   T gap   strand  Q               Q       Q       Q       T               T       T       T       block   blockSizes   qStarts  tStarts
        match   match           count   bases   count   bases           name            size    start   end     name            size    start   end     count
---------------------------------------------------------------------------------------------------------------------------------------------------------------
2046    10      0       0       0       0       6       6092    +       n37     2056    0       2056    chr11   135006516       2161746 2169894 7       431,181,100,293,38,682,331,  0,431,612,712,1005,1043,1725,   2161746,2167474,2167657,2168547,2168841,2168880,2169563,

Now to do this in parallel and convert results from psl to bed. I wrote a script called psl_to_bed_best_score.pl, a while ago and we can use it to parse the psl files.

parallel blat chr{}.fa human/test.fa test_{}.psl ::: {1..22} X Y M
ls *.psl
test_1.psl   test_11.psl  test_13.psl  test_15.psl  test_17.psl  test_19.psl  test_20.psl  test_22.psl  test_4.psl  test_6.psl  test_8.psl  test_M.psl  test_Y.psl
test_10.psl  test_12.psl  test_14.psl  test_16.psl  test_18.psl  test_2.psl   test_21.psl  test_3.psl   test_5.psl  test_7.psl  test_9.psl  test_X.psl
cat *.psl > all.psl
psl_to_bed_best_score.pl all.psl all.bed
cat all.bed
chr11   2161746 2169894 n37     2030    +       2161746 2169894 255,0,0 7       431,181,100,293,38,682,331,     0,5728,5911,6801,7095,7134,7817
#almost identical to the bed file provided from NONCODE apart from the score column
cat human/human.bed | awk '$4 == "n37" {print}'
chr11   2161746 2169894 n37     992     +       2161746 2169894 0       7       431,181,100,293,38,682,331,     0,5728,5911,6801,7095,7134,7817,
#try it on all lncRNA but first remove the comments from the human.fa file
cat human/human.fa | grep -v "^#" > tmp
mv tmp human/human.fa
time parallel blat chr{}.fa human/human.fa test_{}.psl ::: {1..22} X Y M

#well that took too long!
real    16996m51.851s
user    301319m54.423s
sys     50m14.070s

See also this useful post for more examples of how to use GNU parallel.

blat parameters

For convenience.

blat - Standalone BLAT v. 34 fast sequence search command line tool
usage:
   blat database query [-ooc=11.ooc] output.psl
where:
   database and query are each either a .fa , .nib or .2bit file,
   or a list these files one file name per line.
   -ooc=11.ooc tells the program to load over-occurring 11-mers from
               and external file.  This will increase the speed
               by a factor of 40 in many cases, but is not required
   output.psl is where to put the output.
   Subranges of nib and .2bit files may specified using the syntax:
      /path/file.nib:seqid:start-end
   or
      /path/file.2bit:seqid:start-end
   or
      /path/file.nib:start-end
   With the second form, a sequence id of file:start-end will be used.
options:
   -t=type     Database type.  Type is one of:
                 dna - DNA sequence
                 prot - protein sequence
                 dnax - DNA sequence translated in six frames to protein
               The default is dna
   -q=type     Query type.  Type is one of:
                 dna - DNA sequence
                 rna - RNA sequence
                 prot - protein sequence
                 dnax - DNA sequence translated in six frames to protein
                 rnax - DNA sequence translated in three frames to protein
               The default is dna
   -prot       Synonymous with -t=prot -q=prot
   -ooc=N.ooc  Use overused tile file N.ooc.  N should correspond to
               the tileSize
   -tileSize=N sets the size of match that triggers an alignment.
               Usually between 8 and 12
               Default is 11 for DNA and 5 for protein.
   -stepSize=N spacing between tiles. Default is tileSize.
   -oneOff=N   If set to 1 this allows one mismatch in tile and still
               triggers an alignments.  Default is 0.
   -minMatch=N sets the number of tile matches.  Usually set from 2 to 4
               Default is 2 for nucleotide, 1 for protein.
   -minScore=N sets minimum score.  This is the matches minus the
               mismatches minus some sort of gap penalty.  Default is 30
   -minIdentity=N Sets minimum sequence identity (in percent).  Default is
               90 for nucleotide searches, 25 for protein or translated
               protein searches.
   -maxGap=N   sets the size of maximum gap between tiles in a clump.  Usually
               set from 0 to 3.  Default is 2. Only relevent for minMatch > 1.
   -noHead     suppress .psl header (so it's just a tab-separated file)
   -makeOoc=N.ooc Make overused tile file. Target needs to be complete genome.
   -repMatch=N sets the number of repetitions of a tile allowed before
               it is marked as overused.  Typically this is 256 for tileSize
               12, 1024 for tile size 11, 4096 for tile size 10.
               Default is 1024.  Typically only comes into play with makeOoc.
               Also affected by stepSize. When stepSize is halved repMatch is
               doubled to compensate.
   -mask=type  Mask out repeats.  Alignments won't be started in masked region
               but may extend through it in nucleotide searches.  Masked areas
               are ignored entirely in protein or translated searches. Types are
                 lower - mask out lower cased sequence
                 upper - mask out upper cased sequence
                 out   - mask according to database.out RepeatMasker .out file
                 file.out - mask database according to RepeatMasker file.out
   -qMask=type Mask out repeats in query sequence.  Similar to -mask above but
               for query rather than target sequence.
   -repeats=type Type is same as mask types above.  Repeat bases will not be
               masked in any way, but matches in repeat areas will be reported
               separately from matches in other areas in the psl output.
   -minRepDivergence=NN - minimum percent divergence of repeats to allow
               them to be unmasked.  Default is 15.  Only relevant for
               masking using RepeatMasker .out files.
   -dots=N     Output dot every N sequences to show program's progress
   -trimT      Trim leading poly-T
   -noTrimA    Don't trim trailing poly-A
   -trimHardA  Remove poly-A tail from qSize as well as alignments in
               psl output
   -fastMap    Run for fast DNA/DNA remapping - not allowing introns,
               requiring high %ID
   -out=type   Controls output file format.  Type is one of:
                   psl - Default.  Tab separated format, no sequence
                   pslx - Tab separated format with sequence
                   axt - blastz-associated axt format
                   maf - multiz-associated maf format
                   sim4 - similar to sim4 format
                   wublast - similar to wublast format
                   blast - similar to NCBI blast format
                   blast8- NCBI blast tabular format
                   blast9 - NCBI blast tabular format with comments
   -fine       For high quality mRNAs look harder for small initial and
               terminal exons.  Not recommended for ESTs
   -maxIntron=N  Sets maximum intron size. Default is 750000
   -extendThroughN - Allows extension of alignment through large blocks of N's
Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
23 comments Add yours
    1. Hi Andy,

      Nope I haven’t mainly because Bowtie and SOAP were developed for use on high throughput datasets and Blat is a general purpose alignment tool.

      I still use Blat just because it’s integrated with the UCSC Genome Browser.

      Cheers,

      Dave

  1. For newbs like me, which of the outputs of the time() command is the actual time it took to complete the parallelized command?

    I’m hoping it was 50 minutes…

      1. Oh, well… OK. I’m 8 hours into a blat run against the mouse genome, ~5k queries… on 8 cores. Good to know it’s not actually practical to complete the analysis on this machine. Question: are there any community web-resources that support clustered computation of blat, or should I be looking at trying to get access to local cluster for this kind of thing?

        Second, if you were going to distribute blat jobs across a cluster, what do you think would be the smartest way? Would you send the full set of queries against a single chromosome per job?

        Thanks for your guidance/response.

        1. I’m not aware of any resources that supports clustered computations of blat. Actually I think most people come across this post because they searched for “parallel blat” on Google.

          The easiest way is just running a job per chromosome and thus the computation time is just the time it takes to map to chromosome 1 (for human) or 2 (for mouse). You could further split the jobs by splitting your set of queries. There should be tools out there to split a fasta file.

          Perhaps there are ways of splitting the chromosome too, but that might get a bit tricky.

          Hope that helps,

          Dave

  2. Hi Dave,
    I have been following an annotation pipeline for a draft genome (no reference), since they guide using BLAT of RNAseq data (around 72 gigabytes fasta files in my case) against the draft genome (~900Mbp in size). The command is as below:

    ${blat}/blat -noHead -stepSize=5 -minIdentity=93 ${genome}/genome.fa ${RNAseq}/list ${outdir}/blat_align.psl

    Could you please give me some suggestion regarding estimated runtime because this run have been ongoing for almost 20 days and I haven’t set up any parallelization for BLAT.
    Thank you very much in advance!

    Phuong.

    1. Hi Phuong,

      it’s hard to estimate but I can say that BLAT wasn’t designed for RNA-Seq data. Even with parallelisation it could take a very long time. I would suggest using BWA or another short read alignment program to align your RNA-Seq reads.

      Cheers,

      Dave

  3. Hi Davo,
    thanks for your posts, they’re very helpful.
    Recently I start to work with blat and have a question, maybe you have more experience with it. For dna sequences around 3000 nucleotides long, how parameters make more sense to set instead of the defaults. I try the suggestions in the blat website for a sensitive search and also with the parameters that are given for the similar webresults. But nonetheless I have only results with very large insertions(not usable). Would be very grateful for a hint.
    Thanks, Lidia

    1. Hi Lidia,

      the default parameters for BLAT should work quite well with long sequences. If you are not expecting long gaps, try adjusting the -maxIntron parameter.

      Cheers,

      Dave

  4. From the description and documentation I always though that blat builds index while it runs and keeps that in the memory, but when dealing with Apollo installation I’ve run into the requirement of “specifying the path to blat database” and I’m a bit at loss of that does this actually mean. Referring to the same .fasta sequence that was specified as a reference genome sequence seems a bit redundant, but what kind of files are usually specified as a database?

    1. I guess some people use 2bit files instead of fasta files, so perhaps they used “database” to cover them both? My understanding is the same as yours; blat builds the index and keeps it in memory.

  5. hi,
    i am trying to use blat to retrive results into .pls format from fasta file, my query is very short around 21 nucleotide and reference database is 87327254 and i get null results after i run command

  6. hi DAVO,
    My sequence data is from the same chromosome and each of sequence is 25bp .I start work the blat with the hg19 genomic data.I set the parameters is” -stepSize=1 -minScore=20 -minIdentity=20″. However, there is an error”Segmentation fault (core dumped)”. Could you give me some advise to set the suitable parameters?

    1. Hi Ivony,

      it could be various reasons. Firstly, are you able to use blat with the same parameters but mapping to a smaller reference, for example just map to chr22 of hg19? Just take some random sequence from chr22 that is 25 bp long.

      Cheers,

      Dave

  7. Hey Dave-
    Was wondering if you might have any idea how to solve the issue I’m having: blat will only return matches which contain the letters of nucleotides, despite both the query and database files consisting of amino acid sequences, and with the -prot parameter specified. My thought is that I must have gone wrong somewhere in the setup, but I have no idea where.

    Earlier I set up gfserver according to the faq on how to replicate web results
    gfServer start blatMachine portX -stepSize=5 -log=untrans.log [mydatabase].2bit
    gfServer start blatMachine portY -trans -mask -log=trans.log [mydatabase].2bit

    And am also running blat search accordingly:
    blat -stepSize=5 -repMatch=2253 -minScore=0 -minIdentity=0 database.2bit query.fa output.psl -prot

    That’s about all I’ve done. I’m a little unclear on the function of gfServer, but the way that I set those servers up shouldn’t be affecting my search in blat, right? Am I missing something obvious, like some step you have to do to enable protein protein searches?

    Sorry if this is the wrong forum for this question.

    Thanks so much,
    Lily

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.