TIL that you can customise BLAST output

I am going to start a new series of posts based on new things (new to me) I learned about recently and they will be included in the TIL category. I learned about TIL from the bird app a while ago and it is based on the popular subreddit todayilearned. I used to post TIL stuff on the bird app but since I no longer use it, I'll use my blog to share some things that are hopefully new and useful to others as well.

BLAST has been around for a long time and was (and still is) a staple of bioinformatics. I first used it a long time ago as well and at that time (if I recall correctly) BLAST only had several output options that you would specify with -outfmt. The most useful was the tabular output (6 or 7) that could be easily parsed with a Perl script. Thus it became ingrained in me that that was the way to go.

Recently I needed the actual alignment produced by BLAST, which is provided by output format 0 and the part that I'm interested in looks like this.

Query  1    FTASSFLKSFY  11
            FTA SFLKSFY
Sbjct  708  FTAPSFLKSFY  718

However, output format 6 or 7 produces output that doesn't contain the alignment.

FTASSFLKSFY ENSP00000395986.2|ENST00000431480.6|ENSG00000136111.14|OTTHUMG00000017088.5|-|TBC1D4-204|TBC1D4|1290    90.909  11  1   0   1   11  708 718 3.97e-04    34.6

While I can (and I did) write a script to parse output format 0, it doesn't look nice. That is when I thought (and I have been doing this much more recently) is there a better way to do this? After consulting the blastp help page, I found out that you can customise BLAST output. The relevant section from blastp -help is shown below:

Options 6, 7 and 10 can be additionally configured to produce a custom format specified by space delimited format specifiers, or by a token specified by the delim keyword.

The format specifier that I needed was:

qseq means Aligned part of query sequence
sseq means Aligned part of subject sequence

Therefore by running blastp with the following outfmt:

-outfmt "7 sacc qacc slen qlen sstart send qstart qend sseq qseq length pident nident mismatch gapopen gaps evalue bitscore"

I was able to generate the following output:

ENSP00000395986.2|ENST00000431480.6|ENSG00000136111.14|OTTHUMG00000017088.5|-|TBC1D4-204|TBC1D4|1290    FTASSFLKSFY 1290    11  708 718 1   11  FTAPSFLKSFY FTASSFLKSFY 11  90.909  10  1   0   0   3.97e-04    34.6

which allowed me to reconstruct the alignment with a cleaner script than the one I wrote to parse output format 0.

Summary

I mentioned that recently I have started to question whether something (anything) can be done better (without putting in an Herculean effort). This has been a good habit to pick up because it forces me to keep updating my knowledge instead of leaving it stagnant. I have illustrated this point in this blog post:

  1. I had an ugly script that parsed an unfriendly format because the information I needed was only available in this format.
  2. My preconception was that the BLAST output formats were fixed based on my outdated knowledge base.
  3. Reading the documentation allowed to find out that you can customise BLAST output.
  4. The customised BLAST output contained the information I needed and is much nicer to parse, resulting in a much less error-prone and easier to read script.

Lastly, this post is the 300th post on this blog. I hope to keep it going for as long as possible. Please do let me know if you have found any of the posts useful. Take care, Dave.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.