Getting started with C

I learned Perl as my first language as it was the language of choice in the first lab I joined. Over the years I've heard many criticisms, such as Perl code looks ugly and its motto "There's more than one way to do it" allows too much flexibility. I particularly like this description of Perl:

Perl would be Voodoo - An incomprehensible series of arcane incantations that involve the blood of goats and permanently corrupt your soul. Often used when your boss requires you to do an urgent task at 21:00 on friday night.

I never really thought Perl was ugly since it was the only language I knew and beauty is always relative. The thing that got me wanting to learn a low-level programming language was speed but I never got around to it. So after much delay I have decided to learn C and I've even created a GitHub repository for getting started with C. The task I have always wanted to do was to write a simple C program and compare its speed with a Perl script that does the same job. So I wrote a C program (flank_bed.c) that takes in a BED6 file, subtracts 200 bp from the start coordinate and adds 200 bp to the end coordinate, and prints everything out. I wrote the C program just by googling stuff, so it's probably sub-optimal at best:

#include <stdio.h>
#include <stdlib.h>

int main( int argc, char *argv[] )  
{

  if(argc != 2){
    fprintf(stderr, "Please input one file name\n");
    exit(1);
  }

  FILE *ifp;
  char *mode = "r";
  char chr [30];
  int start;
  int end;
  char id [30];
  int score;
  char strand [2];

  char *my_file_name = argv[1];

  ifp = fopen(my_file_name, mode);

  if (ifp == NULL) {
    fprintf(stderr, "Could not open input file %s!\n", my_file_name);
    exit(1);
  }

  /* http://stackoverflow.com/questions/3501338/c-read-file-line-by-line */
  char * line = NULL;
  size_t len = 0;
  ssize_t read;

  while ((read = getline(&line, &len, ifp)) != -1) {
    /* printf("Retrieved line of length %zu :\n", read); */
    sscanf (line, "%s %d %d %s %d %s", chr, &start, &end, id, &score, strand);
    printf ("%s\t%d\t%d\t%s\t%d\t%s\n", chr, start-200, end+200, id, score, strand);
  }

  fclose(ifp);

  return 0;
}

The equivalent Perl script:

#!/bin/env perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.bed>\n";
my $infile = shift or die $usage;

open(IN,'<',$infile) || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
   my @line = split(/\t/);
   $line[1] -= 200;
   $line[2] += 200;
   print join("\t", @line), "\n";
}
close(IN);

exit(0);

To test the speeds, I created a random BED file with 10 million entries using BEDTools.

#compile C program
gcc flank_bed.c -o flank_bed

#generate file with hg19 chromosome coordinates
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg19.chromInfo"  > hg19.genome

#create random bed file
#with 10,000,000 lines
randomBed -g hg19.genome -n 10000000 > big.bed

#run the C program
flank_bed big.bed > big_c.bed

#run the Perl script
flank_bed.pl big.bed > big_perl.bed

#check for differences
diff big_c.bed big_perl.bed

Now let's compare the times:

#if I used just time without the full path
#-f is treated as the command and not the parameter
#-f %e = Elapsed real (wall clock) time used by the process, in seconds.
for i in {1..10}; do /usr/bin/time -f %e flank_bed big.bed > /dev/null; done 2>> c_time.txt
for i in {1..10}; do /usr/bin/time -f %e flank_bed.pl big.bed > /dev/null; done 2>> perl_time.txt

What are the results? Let's plot them using R:

c <- read.table("c_time.txt", header=F, stringsAsFactors=F)
perl <- read.table("perl_time.txt", header=F, stringsAsFactors=F)
df <- data.frame(lang=factor(c(rep('Perl',10),rep('C',10))), time=c(perl$V1, c$V1))

library(reshape2)
df <- melt(df)

library(ggplot2)
qplot(lang, value, data=df, colour=lang, geom=c("boxplot"), 
      fill=lang, main="Wall time difference",
      xlab="Language", ylab="Seconds")

aggregate(df$value, list(lang=df$lang), mean)
#  lang      x
#1    C  7.701
#2 Perl 20.743

wall_time_differenceProbably not the best choice of plot but you get the point.

For the simple task of reading in a BED file, performing some simple arithmetic to the genomic coordinates, and outputting the results, C was 2.7 times faster (even with the scrappy C program I came up with).

Conclusions

A colleague recently showed me this:

Screenshot from 2014-04-29 00:28:53

where the index is an indicator of the popularity of programming languages (see the full list).

He wanted to make the point that C has been around for a long time and remains one of the most popular programming language. Actually, if you click on the full list link above, you can see that C has been either the 1st or 2nd dating back to 1989.

So here I am finally getting started with C.

See also

Online tutorial for learning C: http://www.learn-c.org/en/Welcome




Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
7 comments Add yours
  1. This is probably my favorite post you have made on your site. It’s insightful and I agree with many of the points.

  2. Very nice post.
    You are right that the C code can be optimised, but also the Perl split command can go faster. Try that instead and be surprise how much speed you gain:

    # for BED-6 format
    my($chrom, $chrom_start, $chrom_end, $name, $score, $strand) = split(“\t”, $_, 6);

    I think there is something to do with the allocation of array in perl and the split based on regular expression.
    Keep posting cool stuff, cheers!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.