Calculating the h-index

Updated 2014 September 19th to include a method that does not require sorting.

The h-index is an index that is calculated by measuring the number of articles, say n, that has at least n citations. If you published 10 articles, and each of them had four citations, your h-index would be four, since there are four articles with at least four citations. I'm not interested in what this number represents or measures but instead I'm interested in how one would calculate the h-index given an array of numbers.

Firstly, I wrote this post to illustrate how I approached the problem of coming up with a way to calculate the h-index. It is definitely not because I came up with something clever but I wanted to guide you through the frightening world of a self-taught programmer (if I can call myself a programmer, whatever that is). I started writing the code in Perl because that's what I'm most familiar with. This is what I came up with initially:

sub h_index_old {
   my ($exp) = @_;
   my $h_index = 1;
   my @sort_exp = sort {$a <=> $b} @$exp;
   OUTER: for (my $h = 2; $h<=scalar(@sort_exp); ++$h){
      my $check = 0;
      INNER: for (my $i = 0; $i<scalar(@sort_exp); ++$i){
         my $e = $sort_exp[$i];
         next if $e == 0;
         if ($e >= $h){
            ++$check;
         }
         if ($check >= $h){
            $h_index = $h;
            last INNER;
         }
         if ($i == scalar(@sort_exp) - 1 && $check < $h){
            last OUTER;
         }
      }
   }
   return($h_index);
}

That frightening code should qualify as spaghetti code. When I wrote the above code, my thought process was that I needed two loops: one for checking the potential h-indices and another loop for going through each citation number and checking it against the current h-index. To make things a little more efficient, I sorted the array of numbers (at the cost of running a sort, which also takes time) so that I could exit the citation loop when the current h-index has been reached (the last INNER bit). I wrote another checkpoint to exit the h-index loop (last OUTER), when all the citation numbers have been checked against the h-index, and can not be increased. The two loops make the code inefficient and setting up several conditionals with the LABELS is confusing. The code worked and even though it was inefficient, it gave me the correct answer (an assumption) and sometimes getting the right answer is more important than efficiency. And if I were using this for an analysis, I would probably move on to focus on the results, and forgetting about the code altogether.

But because I'm trying to improve my skills, I thought about how I could solve this problem without using two loops. So I tried all kinds of strange things and play around with code for one or two hours (I can't remember as I was focused), until I came up with this:

sub h_index {
   my ($exp) = @_;
   #sort array
   my @sort_exp = sort {$a <=> $b} @$exp;
   my $length = scalar(@sort_exp);
   my $h_index = 0;
   for (my $i = $length - 1; $i >= 0; $i--){
      my $e = $sort_exp[$i];
      if ($e > $h_index){
         $h_index++;
      } else {
         return($h_index);
      }
   }
}

I thought if I go through the sorted citations in a reverse manner, i.e. from largest to smallest, I can check the largest number first and increment the h-index, and exit the loop once the current largest number is equal to or less than the current h-index. This approach eliminates the need for two loops and the loop is exited when the largest number doesn't increase the h-index. There is definitely an even faster way, and perhaps one day I can come up with it.

Conclusions

Until I had to work with millions to billions (and sometimes trillions) of lines of text, I didn't care much about efficiency as long as I got the right answer in a reasonable amount of time (one or two days of computing is reasonable). It's gotten to a point where I'm learning C to do things faster than using Perl. I think the bottom line is that until something becomes necessary, old habits, such as writing code that's not optimised, die hard. And it takes a special interest and more effort to becoming better at something, such as programming.

No sorting

I got a comment from Robert to use hash tables instead of sorting. I decided to use arrays instead of hashes but without sorting:

sub h_index_no_sort {
   my ($cit) = @_;
   my @h = ();
   for (my $i=0; $i < scalar(@$cit); ++$i){
      my $c = $cit->[$i];
      if ($c > scalar(@$cit)){
         $h[scalar(@$cit)]++;
      } else {
         $h[$c]++;
      }
   }
   my $sum = 0;
   for (my $i=scalar(@$cit); $i>=0; $i--){
      if (exists $h[$i]){
         $sum += $h[$i];
      }
      if ($sum >= $i){
         return($i);
      }
   }
}

In the above code, an array (@h) is used to keep a tally of the citations. However, the citation count is levelled by the size of the array (scalar(@$cit)), since having 10 articles with 100 citations each, still means the h-index is 10. The citations is the array index and the value is the number of times that citation occurs. The next block of code, goes through potential h-indices from largest to smallest, and creates a tally ($sum). Once the tally is greater than or equal to the current iterating h-index, this is the h-index and is returned.

Comparison of the methods

I put together all the subroutines in this script:

#!/usr/bin/env perl

use strict;
use warnings;

my $usage = "Usage: $0 <old|new|no_sort> <infile.tsv>\n";
my $method = shift or die $usage;
my $infile = shift or die $usage;

if ($method !~ /^old$|^new$|^no_sort$/){
   die "Please use either old, new or no_sort as a method\n";
}

open(IN,'<',$infile) || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
   my @s = split(/\t/);
   my $h_index = '';
   if ($method eq 'old'){
      $h_index = h_index_old(\@s);
   } elsif ($method eq 'new'){
      $h_index = h_index_new(\@s);
   } elsif ($method eq 'no_sort'){
      $h_index = h_index_no_sort(\@s);
   }
   print "$.\t$h_index\n";
}

sub h_index_no_sort {
   my ($cit) = @_;
   my @h = ();
   for (my $i=0; $i < scalar(@$cit); ++$i){
      my $c = $cit->[$i];
      if ($c > scalar(@$cit)){
         $h[scalar(@$cit)]++;
      } else {
         $h[$c]++;
      }
   }
   my $sum = 0;
   for (my $i=scalar(@$cit); $i>=0; $i--){
      if (exists $h[$i]){
         $sum += $h[$i];
      }
      if ($sum >= $i){
         return($i);
      }
   }
}

sub h_index_old {
   my ($exp) = @_;
   my $h_index = 1;
   my @sort_exp = sort {$a <=> $b} @$exp;
   OUTER: for (my $h = 2; $h<=scalar(@sort_exp); ++$h){
      my $check = 0;
      INNER: for (my $i = 0; $i<scalar(@sort_exp); ++$i){
         my $e = $sort_exp[$i];
         next if $e == 0;
         if ($e >= $h){
            ++$check;
         }
         if ($check >= $h){
            $h_index = $h;
            last INNER;
         }
         if ($i == scalar(@sort_exp) - 1 && $check < $h){
            last OUTER;
         }
      }
   }
   return($h_index);
}

sub h_index_new {
   my ($exp) = @_;
   #sort array
   my @sort_exp = sort {$a <=> $b} @$exp;
   my $length = scalar(@sort_exp);
   my $h_index = 0;
   for (my $i = $length - 1; $i >= 0; $i--){
      my $e = $sort_exp[$i];
      if ($e > $h_index){
         $h_index++;
      } else {
         return($h_index);
      }
   }
   return($length);
}

Let's generate a tabular file with some random citation numbers in R:

data <- matrix(data=round(runif(n=10000000, max=1000, min=0)), ncol=10)
write.table(x=data, file="test.tsv", col.names=F, row.names=F, sep="\t")

Now let's test the speeds of this file with 1,000,000 lines of citations on my MacBook Air:

#check out the file
head -5 test.tsv 
559     137     85      452     506     646     503     198     216     944
250     13      515     710     675     613     131     169     647     746
742     133     550     526     558     611     965     843     960     943
972     876     765     16      24      278     656     832     976     934
108     220     685     789     79      843     827     197     334     665

#test the no_sort method
time h_index.pl no_sort test.tsv > no_sort.txt

real    0m6.984s
user    0m6.937s
sys     0m0.029s

#test the new sort method
time h_index.pl new test.tsv > new_sort.txt

real    0m9.200s
user    0m9.143s
sys     0m0.038s

#test the old method
time h_index.pl old test.tsv > old_sort.txt

real    0m29.908s
user    0m29.789s
sys     0m0.074s

for file in `ls *.txt`; do md5 $file; done
MD5 (new_sort.txt) = d249ab46047936c2de5c2ceb09371e1a
MD5 (no_sort.txt) = d249ab46047936c2de5c2ceb09371e1a
MD5 (old_sort.txt) = d249ab46047936c2de5c2ceb09371e1a
Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
7 comments Add yours
  1. Hi,
    thanks for the posts. This one has a billion different solutions. A couple of thoughts as you move forward in your programming.
    1. You should look up hash tables in perl. This would make it so you didn't have to do the sort, you simply store the occurrences of the items and then go through the items and output those that occurred more than x-times.
    2. If this is in a tabular text file, you might consider looking at awk. A simple 1-liner might be something like this awk '{count[$2]++}end{for(j in count) if(count[j]>10){print j" "count[j] }}' file.csv This is pretty much how it reads, iterate a counter on column 2 of the input file .csv storing the counter as a array indexed on the text in field 2 (hash table), after done, scroll through the array and output the items >10 counts.

    Great blog,
    thanks,
    Bob

    1. Hi Robert,

      thank you for the pointers! I'll definitely implement your suggestions sometime, compare the speeds, and update the post.

      Cheers,

      Dave

    2. Hi Robert,

      I wrote another method that doesn't require sorting and as expected it is faster. For the new implementation, I figured that using an array for storing the occurrences was better than using a hash.

      Cheers,

      Dave

  2. Awesome. You might try converting it to a hash for experimentation. The advantage of a hash is it does not require the index to be a number so the code will generalize to more uses and still be fast. While you are at it, you might also give the awk code a try. I am not sure how your data is organized exactly, is the first number in a line supposed to be the citation number and the rest are articles citing that one? If so, try awk '{count[$1]=(NF-1)}end{for(j in count) if(count[j]>10){print j" "count[j] }}' file.tsv . If instead, each number is a citation in an article, try awk '{for (i=1; i10){print j" "count[j] }}' file.tsv I would need to try the second one to make sure it interprets the $i to be the actual citation number in the appropriate field.

    Happy programming.

  3. I've recently proposed a novel index for evaluation of individual researchers that does not depend on the number of publications, accounts for different co-author contributions and age of publications, and scales from 0.0 to 9.9 (http://f1000research.com/articles/4-884). Moreover, it can be calculated with the help of freely available software. Please, share your thoughts on it. Would you use it along with the h-index, or maybe even instead of it, for evaluating your peers, potential collaborators or job applicants? If you've tried it on the people you know, do you find the results fair?

    1. Hi Aleksey,

      thanks for the paper. I'm not a fan of the h-index (or the citation index), so I'll definitely check out what you propose.

      Cheers,

      Dave

Leave a Reply

Your email address will not be published. Required fields are marked *