I learned Perl as my first language as it was the language of choice in the first lab I joined. Over the years I've heard many criticisms, such as Perl code looks ugly and its motto "There's more than one way to do it" allows too much flexibility. I particularly like this description of Perl:
Perl would be Voodoo - An incomprehensible series of arcane incantations that involve the blood of goats and permanently corrupt your soul. Often used when your boss requires you to do an urgent task at 21:00 on friday night.
I never really thought Perl was ugly since it was the only language I knew and beauty is always relative. The thing that got me wanting to learn a low-level programming language was speed but I never got around to it. So after much delay I have decided to learn C and I've even created a GitHub repository for getting started with C. The task I have always wanted to do was to write a simple C program and compare its speed with a Perl script that does the same job. So I wrote a C program (flank_bed.c) that takes in a BED6 file, subtracts 200 bp from the start coordinate and adds 200 bp to the end coordinate, and prints everything out. I wrote the C program just by googling stuff, so it's probably sub-optimal at best:
#include <stdio.h>
#include <stdlib.h>
int main( int argc, char *argv[] )
{
if(argc != 2){
fprintf(stderr, "Please input one file name\n");
exit(1);
}
FILE *ifp;
char *mode = "r";
char chr [30];
int start;
int end;
char id [30];
int score;
char strand [2];
char *my_file_name = argv[1];
ifp = fopen(my_file_name, mode);
if (ifp == NULL) {
fprintf(stderr, "Could not open input file %s!\n", my_file_name);
exit(1);
}
/* http://stackoverflow.com/questions/3501338/c-read-file-line-by-line */
char * line = NULL;
size_t len = 0;
ssize_t read;
while ((read = getline(&line, &len, ifp)) != -1) {
/* printf("Retrieved line of length %zu :\n", read); */
sscanf (line, "%s %d %d %s %d %s", chr, &start, &end, id, &score, strand);
printf ("%s\t%d\t%d\t%s\t%d\t%s\n", chr, start-200, end+200, id, score, strand);
}
fclose(ifp);
return 0;
}
The equivalent Perl script:
#!/bin/env perl
use strict;
use warnings;
my $usage = "Usage: $0 <infile.bed>\n";
my $infile = shift or die $usage;
open(IN,'<',$infile) || die "Could not open $infile: $!\n";
while(<IN>){
chomp;
my @line = split(/\t/);
$line[1] -= 200;
$line[2] += 200;
print join("\t", @line), "\n";
}
close(IN);
exit(0);
To test the speeds, I created a random BED file with 10 million entries using BEDTools.
#compile C program gcc flank_bed.c -o flank_bed #generate file with hg19 chromosome coordinates mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg19.chromInfo" > hg19.genome #create random bed file #with 10,000,000 lines randomBed -g hg19.genome -n 10000000 > big.bed #run the C program flank_bed big.bed > big_c.bed #run the Perl script flank_bed.pl big.bed > big_perl.bed #check for differences diff big_c.bed big_perl.bed
Now let's compare the times:
#if I used just time without the full path
#-f is treated as the command and not the parameter
#-f %e = Elapsed real (wall clock) time used by the process, in seconds.
for i in {1..10}; do /usr/bin/time -f %e flank_bed big.bed > /dev/null; done 2>> c_time.txt
for i in {1..10}; do /usr/bin/time -f %e flank_bed.pl big.bed > /dev/null; done 2>> perl_time.txt
What are the results? Let's plot them using R:
c <- read.table("c_time.txt", header=F, stringsAsFactors=F)
perl <- read.table("perl_time.txt", header=F, stringsAsFactors=F)
df <- data.frame(lang=factor(c(rep('Perl',10),rep('C',10))), time=c(perl$V1, c$V1))
library(reshape2)
df <- melt(df)
library(ggplot2)
qplot(lang, value, data=df, colour=lang, geom=c("boxplot"),
fill=lang, main="Wall time difference",
xlab="Language", ylab="Seconds")
aggregate(df$value, list(lang=df$lang), mean)
# lang x
#1 C 7.701
#2 Perl 20.743
Probably not the best choice of plot but you get the point.
For the simple task of reading in a BED file, performing some simple arithmetic to the genomic coordinates, and outputting the results, C was 2.7 times faster (even with the scrappy C program I came up with).
Conclusions
A colleague recently showed me this:
where the index is an indicator of the popularity of programming languages (see the full list).
He wanted to make the point that C has been around for a long time and remains one of the most popular programming language. Actually, if you click on the full list link above, you can see that C has been either the 1st or 2nd dating back to 1989.
So here I am finally getting started with C.
See also
Online tutorial for learning C: http://www.learn-c.org/en/Welcome

This work is licensed under a Creative Commons
Attribution 4.0 International License.

This is probably my favorite post you have made on your site. It’s insightful and I agree with many of the points.
I’m glad you liked it; I need to find some time to get back into learning C.
Very nice post.
You are right that the C code can be optimised, but also the Perl split command can go faster. Try that instead and be surprise how much speed you gain:
# for BED-6 format
my($chrom, $chrom_start, $chrom_end, $name, $score, $strand) = split(“\t”, $_, 6);
I think there is something to do with the allocation of array in perl and the split based on regular expression.
Keep posting cool stuff, cheers!
Cool! I test it out and include it in an update to this post. Thanks!