Perl and bioinformatics
Monday, September 7th, 2009Recently I posted a question to the Perl beginner mailing list about Perl and bioinformatics. My question was asking why Perl is better than other scripting languages when it comes to working with text. And as a final note in my email, I made a remark towards to how parsing text is useful to biology since a lot of biological data is stored in text files.
I had an abundance of answers (some that I am still slowly digesting), and there were several reasons towards Perl’s superiority in text processing. In no particular order, here were some answers to why Perl is good/great at text parsing:
1. Regexes are first class citizens in Perl, in other languages you must use a library that feels out of place
2. Perl tends to innovate in regexes, adding features (the most popular non-Perl regex library is PCRE, PCRE stands for Perl Compatible Regular Expressions)
3. Perl is weakly typed, which reduces the amount of code you must write. In strongly typed languages you spend a lot of time casting variables into the desired type.
4. Many of the operators like the readline operator (<>) are setup to do what you want with a minimum level of effort, such as a bare readline operator creating the default UNIX filter style (read from stdin if no files are specified on the commandline, otherwise open each file passed on the commandline and read from each of them in turn).
5. The, much maligned, set of special variables also makes this easy (for instance, $ARGV is set to the file currently being read by the construct in the last point).
6. Perl’s regexes are still better and faster than the compatible libraries. There are many things Perl can do that they can’t even come close to (my favourite is the /e modifier on s///!).
7. CPAN is another major reason for Perl’s strength in text processing. There are hundreds if not thousands of modules that parse, mung, generate text in all sorts of forms. just consider how many templaters there are alone.
8. Perl’s scalar values are designed from the start to be powerful text storage items. they have no fixed limit on size, you can add, cut, shrink, extract, etc directly with operators instead of long winded function calls. the Perl ops and functions are designed to work well together generating concise (a better word and more accurate than terse) code. Perl’s guts have been optimized over many years and are very fast when doing text munging.
So from what I have gathered Perl’s regexes give the user power to create complex text parsing abilities and since Perl’s scalar values can be limitless in size, large chucks of text can be stored and operated upon. It also seems that the special variables and operators available to Perl make it easier to work with text and pattern matching.
And after some further discussion with a Perl guru, I understood the concept of type casting and how Perl’s many operators assist with writing less code for casting variables. They also mentioned about strongly typed languages like C++ where 10.0 + 5 would result in an error due to addition between a floating point type vs integer type. This was also touched on briefly in the “How Perl Saved The Human Genome Project” and how weak casting can be a problem if a variable changes from a number to a character. This often happens as a lot of data is manually handled and human errors do occur.
And just today I corresponded with a fellow bioinformatician who has used Perl for almost 10 years. They wanted to point out that biological data isn’t just huge genomic files (which makes Perl extremely useful) and as such other scripting languages are just as competent as Perl, which was also a note made by another Perl guru. Apparently Ruby is extremely popular in Japan, and was developed by the Japanese. There is even a BioRuby! But the point is really that for many bioinformatic tasks such as parsing XML files, Perl holds no advantage over other scripting languages. Perl is just popular in bioinformatics because it was there first.
It has been a very insightful thread and I did get a lot out of it. It helped me understand the history of Perl and biology and bioinformatics, and explain why Perl is very suitable in parsing large text files. However I have heard that Perl’s syntax is hard to read and ugly. I guess when you’re developing software, readability plays a large part. Besides Perl is known to be quick and dirty and considered to be the Swiss army chainsaw. But hopefully when I get my python bioinformatics book, I can get a better perspective.