Perl

From Dave's wiki
Jump to navigation Jump to search

Perl has quite a bad reputation in bioinformatics and has become the butt of the joke. I guess it's because a lot of people without a computing background (e.g. people with a biology background such as myself) started using it and wrote sub-optimal (and probably erroneous) code that left a bad taste with people who were more computationally abled. However, learning a new language from scratch requires a large investment of time and Perl is very useful for my work, so I still use it.

Introduction

For a general introduction to Perl checkout perlintro

perldoc perlintro

and for what to read checkout perltoc

perldoc perltoc

Also check out https://learnxinyminutes.com/docs/perl/

Sorting

Perl uses a sort-definition subroutine or sort subroutine for short to specify what order you want to sort by. Perl already knows how to sort a list of items; it merely doesn’t know which order you want so the sort-definition subroutine simply tells it the order. The sort subroutine is almost defined like an ordinary subroutine. This routine will be called repeatedly, each time checking on a pair of elements from the list you’re sorting. When the sort subroutine starts running, $a and $b are two elements from the original list.

The subroutine returns a coded value describing how the elements compare. If $a should appear before $b in the final list, the sort subroutine returns -1 to say so. If $b should appear before $a, it returns 1. If the order of $a and $b doesn't matter, the subroutine returns 0; this can occur with a numeric sort and the two numbers are equal.

You could write a numeric sort subroutine as follows:

sub by_number {
# a sort subroutine, expect $a and $b
   if ($a < $b) { -1 } elsif ($a > $b) { 1 } else { 0 }
}

To use the sort subroutine, just put its name between sort and the list.

my @result = sort by_number @some_numbers;

Perl has a convenient shortcut for the (three-way comparison) used in by_number called the spaceship operator (<=>). This operator compares two numbers and returns -1, 0, or 1 as needed to sort them numerically. The by_number sort subroutine could be written as:

sub by_number { $a <=> $b }

The three-way comparison for strings is cmp.

sub by_code_point { $a cmp $b }
my @strings = sort by_code_point @any_strings;

The sort above is actually the default sort; the following code does the same thing:

my @strings = sort @any_strings;

cmp can be used to build a case insensitive sort:

sub case_insensitive { "\F$a" cmp "\F$b" }

The string from $a (case-folded) is compared against the string from $b (case-folded), giving a case-insensitive sort order.

Typically the sort subroutine is provided "inline":

my @numbers = sort { $a <=> $b } @some_numbers;

To carry out a numeric sort in descending order, swap $a and $b around:

my @descending = sort { $b <=> $a } @some_numbers;

The comparison operators (<=> and cmp) are nearsighted meaning that they can not see which operand is $a and $b but only which value is on the left and which is on the right.

Sorting by multiple keys

my @winners = sort by_score_and_name keys %score;
sub by_score_and_name {
   $score{$b} <=> $score{$a} # by descending numeric score
      or
   $a cmp $b # code point order by name
}

If the spaceship sees two different scores, that is the comparison that is used since it returns either a -1 or 1, both true values and the low precedence short-circuit or will be skipped (the short-circuit or returns the last expression evaluated). But if the scores are identical, it returns 0, a false value, and thus the cmp operator is used and returns an appropriate ordering value considering the keys as strings.

It is possible to specify more than two levels of sorting. The example below sorts according to the amount of each patron's outstanding fines, then by the number of items they currently have checked out, their name (in order by family name, then by personal name, both from hashes), and finally by the patron's ID number, in case everything else is the same:

@patron_IDs = sort {
   &fines($b) <=> &fines($a) or
   $items{$b} <=> $items{$a} or
   $family_name{$a} cmp $family_name{$b} or
   $personal_name{$a} cmp $personal_name{$b} or
   $a <=> $b
} @patron_IDs;

Sorting a hash of hashes by two values https://perlmaven.com/how-to-sort-a-hash-of-hashes-by-value.

Hashes

Sort keys of hashes using the values.

foreach my $key (sort { $hash{$a} <=> $hash{$b} } keys %planets) {
    print join("\t", $key, $hash{$key}), "\n";
}

Perl references

Read perlreftut

perldoc perlreftut

Map

Think of the R functions map and apply; map in Perl does the same thing of applying a function to a list of items.

From the documentation: Evaluates the BLOCK or EXPR for each element of LIST (locally setting $_ to each element) and returns the list value composed of the results of each such evaluation. In scalar context, returns the total number of elements so generated. Evaluates BLOCK or EXPR in list context, so each element of LIST may produce zero, one, or more elements in the returned value.

my @numbers = 97..122;
my @chars = map(chr, @numbers);
print "@chars\n";

Create a new list.

my @data = (4.75, 1.5, 2, 1234, 6.9456, 12345678.9, 29.95);
my @formatted_data = map { big_money($_) } @data;

Map hash

# https://stackoverflow.com/questions/8412530/join-keys-and-values-in-perl
my %hash = (4 => "model1", 2 => "model2");
my $str = join(", ", map { "$_ X $hash{$_}" } keys %hash);
print $str;

Scripts

Script to parse a csv file, which sorts a row and stores the sorted index

#!/bin/env perl
use strict;
use warnings;
use Text::CSV;
my $infile = 'all_correlation.csv';
my $csv = Text::CSV->new();
my @sample_name = ();
open(IN,'<',$infile) || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
   if ($csv->parse($_)) {
      my @column = $csv->fields();
      if ($. == 1){
         shift(@column);
         @sample_name = @column;
      } else {
         my $sample = shift(@column);
         next unless $sample =~ /osteosarcoma/i;
         my @list_order = sort { $column[$b] <=> $column[$a] } 0 .. $#column;
         #print "$list_order[0] -> $column[$list_order[0]]\n";
         #print "$sample -> $sample_name[$list_order[0]]\n";
         for my $i ( 0 .. $#list_order ) {
            print "$sample_name[$list_order[$i]]\t$column[$list_order[$i]]\n";
         }
      }
      print "---\n";
   } else {
      my $err = $csv->error_input;
      die "Failed to parse line: $err";
   }
}
close(IN);
exit(0);

Others

Use http://perldoc.perl.org/Data/Dumper.html

#!/bin/env perl

use strict;
use warnings;
use Data::Dumper;

my $first_name = 'Dave';
my $last_name = 'Tang';
my @name = ();
push(@name, [$first_name, $last_name]);

print "@name\n\n";

print "My name is ", $name[0][0], ' ', $name[0][1], "\n\n";

print Dumper(@name);

Useful packages

Links to useful Perl packages.

Parallel::ForkManager

Devel::NYTProf

Powerful fast feature-rich Perl source code profiler for benchmarking Perl code.

http://search.cpan.org/~timb/Devel-NYTProf-5.06/lib/Devel/NYTProf.pm

Getopt::Std

Module for getting parameters from the command line

use warnings;
use strict;
use Getopt::Std;

#my options
my %opts = ();
getopts('f:s:g:e:h', \%opts);

#if missing parameters show usage
if ($opts{'h'} || !exists $opts{'f'} || !exists $opts{'s'} || !exists $opts{'g'} || !exists $opts{'e'}){
   usage();
}

#store parameters
my $infile = $opts{'f'};
my $spacer = $opts{'s'};
my $edit_threshold = $opts{'e'};
my $genome = $opts{'g'};

#usage as a subroutine
sub usage {
print STDERR <<EOF;
Usage: $0 -f file -s spacer -g genome.fa -e edit distance

Where:   -f infile.bam          input sam or bam file
         -s TATA                nucleotide spacer sequence
         -g hg19.fa             fasta file containing genome sequence
         -e 1                   the number of edits at which to remove reads
         -h                     this helpful usage message

EOF
exit();
}

Tips

Best way to skip headers: http://stackoverflow.com/questions/14393295/best-way-to-skip-a-header-when-reading-in-from-a-text-file-in-perl

The do block

The following:

my $bowler;
if( ...some condition... ) {
   $bowler = 'Mary Ann';
} elsif( ... some condition ... ) {
   $bowler = 'Ginger';
} else {
   $bowler = 'The Professor';
}

can be rewritten using the do block:

my $bowler = do {
   if( ... some condition ... ) { 'Mary Ann' }
   elsif( ... some condition ... ) { 'Ginger' }
   else { 'The Professor' }
};

List::Util

Search whether a list contains an item that matches some condition and you want to stop once you find the first matching element. Use the first subroutine from List::Util:

use List::Util qw(first);
my $first_match = first { /\bPebbles\b/i } @characters;

Totalling.

use List::Util qw(sum);
my $total = sum( 1..1000 );

Max.

use List::Util qw(max);
my $max = max( 3, 5, 10, 4, 6 );

Shuffle.

use List::Util qw(shuffle);
my @shuffled = shuffle(1..1000);

List::MoreUtils

List::MoreUtils has more subroutines than List::Util but needs to be installed via CPAN.

If you need to combine two or more lists, you can use mesh to create the large list that interweaves all of the elements, even if the small arrays are not the same length:

use List::MoreUtils qw(mesh);
my @abc = 'a' .. 'z';
my @numbers = 1 .. 20;
my @dinosaurs = qw( dino );
my @large_array = mesh @abc, @numbers, @dinosaurs;

But mesh will populate @large_array with undef when it runs out of elements from one of its input arrays.

Use grep

Here you select only the lines mentioning fred from a file:

my @matching_lines = grep { /\bfred\b/i } <$fh>;

Catch errors

Use eval to catch errors; if an error occurs, it returns undef and the error message is stored in the $@ special variable ($@ will be empty if there was no error). You can use the defined-or operator (//) to set a default value, such as NaN.

use v5.10;
my $barney = eval { $fred / $dino } // 'NaN';
print "I couldn't divide by \$dino: $@" if $@;

Subroutines

When passing variables to subroutines, pass it as a reference, because passing a variable into a subroutine involves making a copy of the data into the subroutine's variables. By passing as a reference, you avoid making a copy and will use less memory.

One caveat is that if the subroutine modifies the the referenced data, the effect will in place even when the subroutine has exited.

Installing modules

In the past, I've used Perl on servers in which I don't have the admin access, and had to use different ways of installing modules.

By far the easiest way is to use ActivePerl; download it at http://www.activestate.com/activeperl/downloads. Once downloaded just run the shell script, install.sh, and follow the instructions.

To install Perl modules using ActiveState, use ppm.

ppm install Config::General

Use https://metacpan.org/pod/App::cpanminus

curl -L https://cpanmin.us | perl - --sudo App::cpanminus
cpanm Pod::PseudoPod::HTML