I learned Perl as my first language as it was the language of choice in the first lab I joined. Over the years I've heard many criticisms, such as Perl code looks ugly and its motto "There's more than one way to do it" allows too much flexibility. I particularly like this description of Perl:
Perl would be Voodoo - An incomprehensible series of arcane incantations that involve the blood of goats and permanently corrupt your soul. Often used when your boss requires you to do an urgent task at 21:00 on friday night.
I never really thought Perl was ugly since it was the only language I knew and beauty is always relative. The thing that got me wanting to learn a low-level programming language was speed but I never got around to it. So after much delay I have decided to learn C and I've even created a GitHub repository for getting started with C. The task I have always wanted to do was to write a simple C program and compare its speed with a Perl script that does the same job. So I wrote a C program (flank_bed.c) that takes in a BED6 file, subtracts 200 bp from the start coordinate and adds 200 bp to the end coordinate, and prints everything out. I wrote the C program just by googling stuff, so it's probably sub-optimal at best:
#include <stdio.h> #include <stdlib.h> int main( int argc, char *argv[] ) { if(argc != 2){ fprintf(stderr, "Please input one file name\n"); exit(1); } FILE *ifp; char *mode = "r"; char chr [30]; int start; int end; char id [30]; int score; char strand [2]; char *my_file_name = argv[1]; ifp = fopen(my_file_name, mode); if (ifp == NULL) { fprintf(stderr, "Could not open input file %s!\n", my_file_name); exit(1); } /* http://stackoverflow.com/questions/3501338/c-read-file-line-by-line */ char * line = NULL; size_t len = 0; ssize_t read; while ((read = getline(&line, &len, ifp)) != -1) { /* printf("Retrieved line of length %zu :\n", read); */ sscanf (line, "%s %d %d %s %d %s", chr, &start, &end, id, &score, strand); printf ("%s\t%d\t%d\t%s\t%d\t%s\n", chr, start-200, end+200, id, score, strand); } fclose(ifp); return 0; }
The equivalent Perl script:
#!/bin/env perl use strict; use warnings; my $usage = "Usage: $0 <infile.bed>\n"; my $infile = shift or die $usage; open(IN,'<',$infile) || die "Could not open $infile: $!\n"; while(<IN>){ chomp; my @line = split(/\t/); $line[1] -= 200; $line[2] += 200; print join("\t", @line), "\n"; } close(IN); exit(0);
To test the speeds, I created a random BED file with 10 million entries using BEDTools.
#compile C program gcc flank_bed.c -o flank_bed #generate file with hg19 chromosome coordinates mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg19.chromInfo" > hg19.genome #create random bed file #with 10,000,000 lines randomBed -g hg19.genome -n 10000000 > big.bed #run the C program flank_bed big.bed > big_c.bed #run the Perl script flank_bed.pl big.bed > big_perl.bed #check for differences diff big_c.bed big_perl.bed
Now let's compare the times:
#if I used just time without the full path #-f is treated as the command and not the parameter #-f %e = Elapsed real (wall clock) time used by the process, in seconds. for i in {1..10}; do /usr/bin/time -f %e flank_bed big.bed > /dev/null; done 2>> c_time.txt for i in {1..10}; do /usr/bin/time -f %e flank_bed.pl big.bed > /dev/null; done 2>> perl_time.txt
What are the results? Let's plot them using R:
c <- read.table("c_time.txt", header=F, stringsAsFactors=F) perl <- read.table("perl_time.txt", header=F, stringsAsFactors=F) df <- data.frame(lang=factor(c(rep('Perl',10),rep('C',10))), time=c(perl$V1, c$V1)) library(reshape2) df <- melt(df) library(ggplot2) qplot(lang, value, data=df, colour=lang, geom=c("boxplot"), fill=lang, main="Wall time difference", xlab="Language", ylab="Seconds") aggregate(df$value, list(lang=df$lang), mean) # lang x #1 C 7.701 #2 Perl 20.743
Probably not the best choice of plot but you get the point.
For the simple task of reading in a BED file, performing some simple arithmetic to the genomic coordinates, and outputting the results, C was 2.7 times faster (even with the scrappy C program I came up with).
Conclusions
A colleague recently showed me this:
where the index is an indicator of the popularity of programming languages (see the full list).
He wanted to make the point that C has been around for a long time and remains one of the most popular programming language. Actually, if you click on the full list link above, you can see that C has been either the 1st or 2nd dating back to 1989.
So here I am finally getting started with C.
See also
Online tutorial for learning C: http://www.learn-c.org/en/Welcome

This work is licensed under a Creative Commons
Attribution 4.0 International License.
This is probably my favorite post you have made on your site. It’s insightful and I agree with many of the points.
I’m glad you liked it; I need to find some time to get back into learning C.
Very nice post.
You are right that the C code can be optimised, but also the Perl split command can go faster. Try that instead and be surprise how much speed you gain:
# for BED-6 format
my($chrom, $chrom_start, $chrom_end, $name, $score, $strand) = split(“\t”, $_, 6);
I think there is something to do with the allocation of array in perl and the split based on regular expression.
Keep posting cool stuff, cheers!
Cool! I test it out and include it in an update to this post. Thanks!