Calculating the h-index

Updated 2014 September 19th to include a method that does not require sorting.

The h-index is an index that is calculated by measuring the number of articles, say n, that has at least n citations. If you published 10 articles, and each of them had four citations, your h-index would be four, since there are four articles with at least four citations. I'm not interested in what this number represents or measures but instead I'm interested in how one would calculate the h-index given an array of numbers.

Continue reading

A transpose tool

Updated 2014 September 19th to compare different transpose tools

I wrote a simple transpose tool, using Perl, for taking in tabular data and outputting a transposed version of the data. The primary motivation for writing this was because when viewing files with a lot of columns on the command-line, it becomes hard to match the columns to the column header. Most times, I just want to see a couple of typical values for each column, so that I can figure out how I could parse it. Here's the script I wrote:

Continue reading

Getting started with C

I learned Perl as my first language as it was the language of choice in the first lab I joined. Over the years I've heard many criticisms, such as Perl code looks ugly and its motto "There's more than one way to do it" allows too much flexibility. I particularly like this description of Perl:

Perl would be Voodoo - An incomprehensible series of arcane incantations that involve the blood of goats and permanently corrupt your soul. Often used when your boss requires you to do an urgent task at 21:00 on friday night.

Continue reading

Saving disk space with Perl

Disk space is cheaper these days but here's one way of using less disk space by working directly with gzipped files. Here's a very straight forward example of Perl code that opens a gzipped file and outputs a gzipped file.


#!/usr/bin/perl

use strict;
use warnings;

my $infile = 'test.txt.gz';
#the three argument open is the preferred way
open(IN,'-|',"gunzip -c $infile") || die "Could not open $infile: $!\n";

my $outfile = 'test.out.gz';
open(OUT,'|-',"gzip >$outfile") || die "Could not gzip $outfile: $!\n";

while(<IN>){
   chomp;
   print OUT "$_\n";
}
close(IN);
close(OUT);

exit(0);

__END__

And here's some other code that just counts the number of lines in a file, when gzipped and when it is not gzipped.

#!/usr/bin/perl

use strict;
use warnings;

my $infile = 'big_whoop';
open(IN,'<',$infile) || die "Could not open $infile: $!\n";

#my $infile = 'big_whoop.gz';
#open(IN,'-|',"gunzip -c $infile") || die "Could not open $infile: $!\n";

my $line_count = '0';
while(<IN>){
   chomp;
   ++$line_count;
}
close(IN);

print "$line_count\n";

exit(0);

__END__

Using time, the difference between working with a gzipped and not gzipped file when counting ~7.3 million lines:

Gzipped result:

7320248

real 0m3.725s
user 0m5.313s
sys 0m0.844s

Not gzipped:

7320248

real 0m2.481s
user 0m2.151s
sys 0m0.328s

Equivalents in R, Python and Perl

Last update 2015 September 9th

I've been using Perl heavily for several years until I started my PhD back in 2010 (I still use it for many tasks but much more sparingly). Perl was widely used back in the early days when the human genome was yet to be sequenced and this famous article explained some of the reasons. However, the general trend is that Perl is on the decline, while Python on the rise. There are many reasons for this trend and at least for the biologists/bioinformaticians who program in Python that I have spoken to, they all claim that syntactically, Perl is ugly when compared to Python. Perl was what I learned first when entering bioinformatics, so I got used to looking at Perl code. However, given this trend, I will need to learn some Python to be able to read the growing number of scripts/applications written in Python. Surely it's not necessary, but I have a penchant for looking into how things are done.

Continue reading

Passing arguments from the command line in Perl

I used to do this for specifying the usage:

#!/usr/bin/perl

use strict;
use warnings;

my $usage = "$0: <infile.fa> <blah> <blah>\n";
my $a = shift or die $usage;
my $b = shift or die $usage;

#etc.

However this became a problem when I needed to pass the number "0" as an argument. So I thought I'll improve the code via the Perl module Getopt::Std.


#!/usr/bin/perl

use strict;
use warnings;

use Getopt::Std;

my %opts = ();
getopts('f:b:g:e:h', \%opts);

if ($opts{'h'} || !keys %opts){
   usage();
}

print "Your options are:\n";
foreach my $opts (keys %opts){
   print "$opts\t$opts{$opts}\n";
}

exit(0);

sub usage {
print STDERR <<EOF;
Usage: $0 -f file -b 10 -g temp.file -e 20

Where:   -f test.txt            input file
         -b kdjaksd             b for bananas
         -g kdjf                more description
         -e 3                   blah blah blah
         -h                     this helpful usage message

EOF
exit;
}

__END__

Depending on how your script works, you can set up conditional checks (e.g. unless exists $opt{'f'}) to see if essential arguments have been set or not.

Using bins when comparing genomic features

Comparing two files containing genomic features is a common task e.g. finding out whether the coordinates of your tags intersect with genes. Of course you could use intersectBed (as part of the BEDTools suite) for this purpose but here's how to do it anyway using Perl. NOTE: I hard code the length of my tags in this script as 27 bp.

The key to binning is this line of code:

my $which_bin = int($start / $bin); #where $bin = 100,000

Any genomic feature in the bed file that is less than 100,000 will be in bin 0. Anything in 100,000 to 199,999 is in bin 1 and so on. These bins are then used as hash keys.

This assumes a non-redundant bed file but can be changed by removing the last CHECK line of code.


#!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.bam> <infile.bed>\n";
my $infile = shift or die $usage;
my $bed = shift or die $usage;

my %bed = ();
my $counter = '0';
my $bin = 100_000;

warn "Processing $bed\n";

open(IN,'<',$bed) || die "Could not open $bed: $!\n";
while(<IN>){
   chomp;
   ++$counter;
   #chr1    134938   134939   name  0       -
   my ($chr,$start,$end,$name,$score,$strand) = split(/\t/);
   my $which_bin = int($start / $bin);
   $bed{$chr}{$strand}{$which_bin}{$counter}->{'START'} = $start;
   $bed{$chr}{$strand}{$which_bin}{$counter}->{'END'} = $end;
}
close(IN);

warn "Finished storing $bed\n";

my $bam_entry = '0';
open(IN,'-|',"samtools view $infile") || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
   my ($qname,$flag,$rname,$pos,$mapq,$cigar,$mrnm,$mpos,$isize,$seq,$qual,$tag,$vtype,$value) = split;
   next if $flag == '4';
   my $strand = '+';
   my $start = $pos;
   if ($flag == '16'){
      $strand = '-';
      $start = $pos + '27';
   }
   ++$bam_entry;
   my $which_bin = int($start/$bin);
   CHECK: foreach my $entry (sort {$a <=> $b} keys %{$bed{$rname}{$strand}{$which_bin}}){
      my $s_start = $bed{$rname}{$strand}{$which_bin}{$entry}->{'START'};
      my $s_end = $bed{$rname}{$strand}{$which_bin}{$entry}->{'END'};
      #print "$entry\t$rname\t$s_start\t$s_end\n";
      if ($start >= $s_start && $start <= $s_end){
         print "$_\n";
         last CHECK;
      }
   }
   warn "Processed $bam_entry bam entries\n";
}
close(IN);

exit(0);