Updated 2014 September 19th to include a method that does not require sorting.

The *h*-index is an index that is calculated by measuring the number of articles, say , that has at least citations. If you published 10 articles, and each of them had four citations, your *h*-index would be four, since there are four articles with at least four citations. I'm not interested in what this number represents or measures but instead I'm interested in how one would calculate the *h*-index given an array of numbers.

Firstly, I wrote this post to illustrate how I approached the problem of coming up with a way to calculate the *h*-index. It is definitely *not* because I came up with something clever but I wanted to guide you through the frightening world of a self-taught programmer (if I can call myself a programmer, whatever that is). I started writing the code in Perl because that's what I'm most familiar with. This is what I came up with initially:

sub h_index_old { my ($exp) = @_; my $h_index = 1; my @sort_exp = sort {$a <=> $b} @$exp; OUTER: for (my $h = 2; $h<=scalar(@sort_exp); ++$h){ my $check = 0; INNER: for (my $i = 0; $i<scalar(@sort_exp); ++$i){ my $e = $sort_exp[$i]; next if $e == 0; if ($e >= $h){ ++$check; } if ($check >= $h){ $h_index = $h; last INNER; } if ($i == scalar(@sort_exp) - 1 && $check < $h){ last OUTER; } } } return($h_index); }

That frightening code should qualify as spaghetti code. When I wrote the above code, my thought process was that I needed two loops: one for checking the potential *h*-indices and another loop for going through each citation number and checking it against the current *h*-index. To make things a little more efficient, I sorted the array of numbers (at the cost of running a sort, which also takes time) so that I could exit the citation loop when the current *h*-index has been reached (the **last INNER** bit). I wrote another checkpoint to exit the *h*-index loop (**last OUTER**), when all the citation numbers have been checked against the *h*-index, and can not be increased. The two loops make the code inefficient and setting up several conditionals with the LABELS is confusing. The code worked and even though it was inefficient, it gave me the correct answer (an assumption) and **sometimes getting the right answer is more important than efficiency**. And if I were using this for an analysis, I would probably move on to focus on the results, and forgetting about the code altogether.

But because I'm trying to improve my skills, I thought about how I could solve this problem without using two loops. So I tried all kinds of strange things and play around with code for one or two hours (I can't remember as I was focused), until I came up with this:

sub h_index { my ($exp) = @_; #sort array my @sort_exp = sort {$a <=> $b} @$exp; my $length = scalar(@sort_exp); my $h_index = 0; for (my $i = $length - 1; $i >= 0; $i--){ my $e = $sort_exp[$i]; if ($e > $h_index){ $h_index++; } else { return($h_index); } } }

I thought if I go through the sorted citations in a reverse manner, i.e. from largest to smallest, I can check the largest number first and increment the *h*-index, and exit the loop once the current largest number is equal to or less than the current *h*-index. This approach eliminates the need for two loops and the loop is exited when the largest number doesn't increase the *h*-index. There is definitely an even faster way, and perhaps one day I can come up with it.

### Conclusions

Until I had to work with millions to billions (and sometimes trillions) of lines of text, I didn't care much about efficiency as long as I got the right answer in a reasonable amount of time (one or two days of computing is reasonable). It's gotten to a point where I'm learning C to do things faster than using Perl. I think the bottom line is that until something becomes necessary, old habits, such as writing code that's not optimised, die hard. And it takes a special interest and more effort to becoming better at something, such as programming.

### No sorting

I got a comment from Robert to use hash tables instead of sorting. I decided to use arrays instead of hashes but without sorting:

sub h_index_no_sort { my ($cit) = @_; my @h = (); for (my $i=0; $i < scalar(@$cit); ++$i){ my $c = $cit->[$i]; if ($c > scalar(@$cit)){ $h[scalar(@$cit)]++; } else { $h[$c]++; } } my $sum = 0; for (my $i=scalar(@$cit); $i>=0; $i--){ if (exists $h[$i]){ $sum += $h[$i]; } if ($sum >= $i){ return($i); } } }

In the above code, an array (**@h**) is used to keep a tally of the citations. However, the citation count is levelled by the size of the array (**scalar(@$cit)**), since having 10 articles with 100 citations each, still means the *h*-index is 10. The citations is the array index and the value is the number of times that citation occurs. The next block of code, goes through potential *h*-indices from largest to smallest, and creates a tally (**$sum**). Once the tally is greater than or equal to the current iterating *h*-index, this is the *h*-index and is returned.

### Comparison of the methods

I put together all the subroutines in this script:

#!/usr/bin/env perl use strict; use warnings; my $usage = "Usage: $0 <old|new|no_sort> <infile.tsv>\n"; my $method = shift or die $usage; my $infile = shift or die $usage; if ($method !~ /^old$|^new$|^no_sort$/){ die "Please use either old, new or no_sort as a method\n"; } open(IN,'<',$infile) || die "Could not open $infile: $!\n"; while(<IN>){ chomp; my @s = split(/\t/); my $h_index = ''; if ($method eq 'old'){ $h_index = h_index_old(\@s); } elsif ($method eq 'new'){ $h_index = h_index_new(\@s); } elsif ($method eq 'no_sort'){ $h_index = h_index_no_sort(\@s); } print "$.\t$h_index\n"; } sub h_index_no_sort { my ($cit) = @_; my @h = (); for (my $i=0; $i < scalar(@$cit); ++$i){ my $c = $cit->[$i]; if ($c > scalar(@$cit)){ $h[scalar(@$cit)]++; } else { $h[$c]++; } } my $sum = 0; for (my $i=scalar(@$cit); $i>=0; $i--){ if (exists $h[$i]){ $sum += $h[$i]; } if ($sum >= $i){ return($i); } } } sub h_index_old { my ($exp) = @_; my $h_index = 1; my @sort_exp = sort {$a <=> $b} @$exp; OUTER: for (my $h = 2; $h<=scalar(@sort_exp); ++$h){ my $check = 0; INNER: for (my $i = 0; $i<scalar(@sort_exp); ++$i){ my $e = $sort_exp[$i]; next if $e == 0; if ($e >= $h){ ++$check; } if ($check >= $h){ $h_index = $h; last INNER; } if ($i == scalar(@sort_exp) - 1 && $check < $h){ last OUTER; } } } return($h_index); } sub h_index_new { my ($exp) = @_; #sort array my @sort_exp = sort {$a <=> $b} @$exp; my $length = scalar(@sort_exp); my $h_index = 0; for (my $i = $length - 1; $i >= 0; $i--){ my $e = $sort_exp[$i]; if ($e > $h_index){ $h_index++; } else { return($h_index); } } return($length); }

Let's generate a tabular file with some random citation numbers in R:

data <- matrix(data=round(runif(n=10000000, max=1000, min=0)), ncol=10) write.table(x=data, file="test.tsv", col.names=F, row.names=F, sep="\t")

Now let's test the speeds of this file with 1,000,000 lines of citations on my MacBook Air:

#check out the file head -5 test.tsv 559 137 85 452 506 646 503 198 216 944 250 13 515 710 675 613 131 169 647 746 742 133 550 526 558 611 965 843 960 943 972 876 765 16 24 278 656 832 976 934 108 220 685 789 79 843 827 197 334 665 #test the no_sort method time h_index.pl no_sort test.tsv > no_sort.txt real 0m6.984s user 0m6.937s sys 0m0.029s #test the new sort method time h_index.pl new test.tsv > new_sort.txt real 0m9.200s user 0m9.143s sys 0m0.038s #test the old method time h_index.pl old test.tsv > old_sort.txt real 0m29.908s user 0m29.789s sys 0m0.074s for file in `ls *.txt`; do md5 $file; done MD5 (new_sort.txt) = d249ab46047936c2de5c2ceb09371e1a MD5 (no_sort.txt) = d249ab46047936c2de5c2ceb09371e1a MD5 (old_sort.txt) = d249ab46047936c2de5c2ceb09371e1a

This work is licensed under a Creative Commons

Attribution 4.0 International License.

Hi,

thanks for the posts. This one has a billion different solutions. A couple of thoughts as you move forward in your programming.

1. You should look up hash tables in perl. This would make it so you didn't have to do the sort, you simply store the occurrences of the items and then go through the items and output those that occurred more than x-times.

2. If this is in a tabular text file, you might consider looking at awk. A simple 1-liner might be something like this awk '{count[$2]++}end{for(j in count) if(count[j]>10){print j" "count[j] }}' file.csv This is pretty much how it reads, iterate a counter on column 2 of the input file .csv storing the counter as a array indexed on the text in field 2 (hash table), after done, scroll through the array and output the items >10 counts.

Great blog,

thanks,

Bob

Hi Robert,

thank you for the pointers! I'll definitely implement your suggestions sometime, compare the speeds, and update the post.

Cheers,

Dave

Hi Robert,

I wrote another method that doesn't require sorting and as expected it is faster. For the new implementation, I figured that using an array for storing the occurrences was better than using a hash.

Cheers,

Dave

Awesome. You might try converting it to a hash for experimentation. The advantage of a hash is it does not require the index to be a number so the code will generalize to more uses and still be fast. While you are at it, you might also give the awk code a try. I am not sure how your data is organized exactly, is the first number in a line supposed to be the citation number and the rest are articles citing that one? If so, try awk '{count[$1]=(NF-1)}end{for(j in count) if(count[j]>10){print j" "count[j] }}' file.tsv . If instead, each number is a citation in an article, try awk '{for (i=1; i10){print j" "count[j] }}' file.tsv I would need to try the second one to make sure it interprets the $i to be the actual citation number in the appropriate field.

Happy programming.

I've recently proposed a novel index for evaluation of individual researchers that does not depend on the number of publications, accounts for different co-author contributions and age of publications, and scales from 0.0 to 9.9 (http://f1000research.com/articles/4-884). Moreover, it can be calculated with the help of freely available software. Please, share your thoughts on it. Would you use it along with the h-index, or maybe even instead of it, for evaluating your peers, potential collaborators or job applicants? If you've tried it on the people you know, do you find the results fair?

Hi Aleksey,

thanks for the paper. I'm not a fan of the h-index (or the citation index), so I'll definitely check out what you propose.

Cheers,

Dave

So, have you managed to try it? What do you think?