Passing arguments from the command line in Perl

I used to do this for specifying the usage:


use strict;
use warnings;

my $usage = "$0: <infile.fa> <blah> <blah>\n";
my $a = shift or die $usage;
my $b = shift or die $usage;


However this became a problem when I needed to pass the number "0" as an argument. So I thought I'll improve the code via the Perl module Getopt::Std.


use strict;
use warnings;

use Getopt::Std;

my %opts = ();
getopts('f:b:g:e:h', \%opts);

if ($opts{'h'} || !keys %opts){

print "Your options are:\n";
foreach my $opts (keys %opts){
   print "$opts\t$opts{$opts}\n";


sub usage {
print STDERR <<EOF;
Usage: $0 -f file -b 10 -g temp.file -e 20

Where:   -f test.txt            input file
         -b kdjaksd             b for bananas
         -g kdjf                more description
         -e 3                   blah blah blah
         -h                     this helpful usage message



Depending on how your script works, you can set up conditional checks (e.g. unless exists $opt{'f'}) to see if essential arguments have been set or not.

The Poisson distribution

A Poisson distribution is the probability distribution that results from a Poisson experiment. A probability distribution assigns a probability to possible outcomes of a random experiment. A Poisson experiment has the following properties:

  1. The outcomes of the experiment can be classified as either successes or failures.
  2. The average number of successes that occurs in a specified region is known.
  3. The probability that a success will occur is proportional to the size of the region.
  4. The probability that a success will occur in an extremely small region is virtually zero.

Continue reading

Manual linear regression analysis using R

Updated 2014 June 24th

The aim of linear regression is to find the equation of the straight line that fits the data points the best; the best line is one that minimises the sum of squared residuals of the linear regression model. The equation of a straight line is:

y = mx + b

where m is the slope or gradient and b is the y-intercept.

Continue reading

GC and AT content of 5' UTRs, 3' UTRs and coding exons of RefSeq gene models

Firstly download bed tracks of the 5' UTR, 3' UTR and coding exons from the UCSC Table Browser. The RefSeq gene models are in the table called RefGene.

After you've saved the 3 bed files (e.g. mm9_refgene_090212_5_utr.bed, mm9_refgene_090212_3_utr.bed and mm9_refgene_090212_coding_exon.bed) use the fastaFromBed program from the BEDTools suite and convert the bed file into a fasta file:

fastaFromBed -fi mm9.fa -bed mm9_refgene_090212_5_utr.bed -fo mm9_refgene_090212_5_utr.fa
fastaFromBed -fi mm9.fa -bed mm9_refgene_090212_3_utr.bed -fo mm9_refgene_090212_3_utr.fa
fastaFromBed -fi mm9.fa -bed mm9_refgene_090212_coding_exon.bed -fo mm9_refgene_090212_coding_exon.fa

Then use this Perl script to calculate the nucleotide frequencies:


use strict;
use warnings;

my $usage = "Usage: $0 <fasta.fa>\n";
my $infile = shift or die $usage;

my $seq = '';

open (IN, $infile) or die "Can't open $infile\n";
while (<IN>){
   next if (/^$/);
   if (/^>(.*)$/){
   else {
      $seq .= $_;

my ($a,$c,$g,$t,$other) = parseSeq($seq);

my $total = $a + $c + $g + $t;
my $gc = sprintf("%.2f",($c + $g) * 100 / $total);
my $at = sprintf("%.2f",($a + $t) * 100 / $total);

print "Length = ", length($seq), "\n";
print "A: $a\n";
print "C: $c\n";
print "G: $g\n";
print "T: $t\n";
print "Other: $other\n";

print "GC: $gc\n";
print "AT: $at\n";

sub parseSeq {
   my ($seq) = @_;
   my $a = '0';
   my $c = '0';
   my $g = '0';
   my $t = '0';
   my $other = '0';
   for (my $i = 0; $i < length($seq); ++$i){
      my $base = substr($seq,$i,1);
      if ($base =~ /a/i){
      } elsif ($base =~ /c/i){
      } elsif ($base =~ /g/i){
      } elsif ($base =~ /t/i){
      } else {
   return ($a,$c,$g,$t,$other);

Running the Perl script on all the fasta files:

for file in `ls *.fa`; do echo $file; $file; done

Length = 32985433
A: 9238398
C: 7238453
G: 7238331
T: 9270249
Other: 2
GC: 43.89
AT: 56.11
Length = 7543736
A: 1654667
C: 2115587
G: 2124322
T: 1649160
Other: 0
GC: 56.20
AT: 43.80
Length = 44137007
A: 10628969
C: 11450211
G: 11459637
T: 10598190
Other: 0
GC: 51.91
AT: 48.09

In summary, the 5' UTR has the highest GC content (GC: 56.20 and AT: 43.80), the coding exons have no particular bias (GC: 51.91 and AT: 48.09) and the 3' UTR has the lowest GC content (GC: 43.89 and AT: 56.11) of the three categories.

DropBox offering up to 5GB worth of space

From the official DropBox forum:

During this beta period, we are also offering additional free space to test automatic uploading of photos and videos. For every 500MB of photos and videos automatically uploaded, you'll receive another 500MB space bonus, up to 5GB total. More information here.

Here's some R code I used to generate some random plots that I wanted to automatically upload to my DropBox folder as a test:

for (i in 1:10000){
   filename <- paste(i,'.jpg',sep="")
   #roughly 149 kb per file

To use the automatic upload function save the images onto a pen/thumb drive and the DropBox automatic upload function should be available when you plug your pen/thumb drive into your computer or on Windows right click on your drive and click autoplay.


Mainly sourced from Wikipedia but arranged as per my train of thought.

Histones are highly alkaline proteins found in eukaryotic cell nuclei that package and order the DNA into structural units called nucleosomes. They are the chief protein components of chromatin, acting as spools around which DNA winds, and play a role in gene regulation. Histone H3 is one of the core histone proteins (the others are H2A, H2B and H4) involved in the structure of chromatin in eukaryotic cells. H3 is involved with the structure of the nucleosomes of the 'beads on a string' structure and H3 is the most extensively modified of the five histones.

Nucleosomes are the basic unit of DNA packaging in eukaryotes, consisting of a segment of DNA wound around a histone protein core. This structure is often compared to thread wrapped around a spool.

Chromatin is the combination of DNA and proteins that make up the contents of the nucleus of a cell. The primary functions of chromatin are; to package DNA into a smaller volume to fit in the cell, to strengthen the DNA to allow mitosis and meiosis and prevent DNA damage, and to control gene expression and DNA replication.

The common nomenclature of histone modifications is:

The name of the histone (e.g. H3)
The single-letter amino acid abbreviation (e.g., K for Lysine) and the amino acid position in the protein
The type of modification (Me: methyl, P: phosphate, Ac: acetyl, Ub: ubiquitin)

So H3K27Ac denotes the acetylation of the 27th residue (a lysine) from the start (i.e. the N-terminal) of the H3 protein.

Histone acetyltransferases (HAT) are enzymes that acetylate conserved lysine amino acids on histone proteins by transferring an acetyl group from acetyl CoA. In general, histone acetylation is linked to transcriptional activation and associated with euchromatin. Histone modification levels and gene expression are well correlated; the levels of a single modification (H3K27ac) can be used to faithfully model gene expression (Karlic et al., 2010 PNAS).

Chemical modifications (e.g. methylation and acylation) to the histone proteins present in chromatin influence gene expression by changing how accessible the chromatin is to transcription. A specific modification of a specific histone protein is called a histone mark. This track shows the levels of enrichment of the H3K27Ac histone mark across the genome as determined by a ChIP-seq assay. The H3K27Ac histone mark is the acetylation of lysine 27 of the H3 histone protein, and it is thought to enhance transcription possibly by blocking the spread of the repressive histone mark H3K27Me3 (from the ENCODE track on the UCSC Genome Browser).

ChIP-sequencing, also known as ChIP-seq, is used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to precisely map global binding sites for any protein of interest. Previously, ChIP-on-chip was the most common technique utilized to study these protein–DNA relations.

The ChIP process enriches specific crosslinked DNA-protein complexes using an antibody against a protein of interest. It can be used to precisely map global binding sites for any protein of interest.

See also:

ChIP-seq: welcome to the new frontier

PCA and rggobi

#generate row names
rname <- c(1:10000)
#generate column names
cname <- c(rep("control",20),rep("test",20))
#initialise array with 10,000 rows and 40 columns
data <- array(0,dim=c(10000,40),dimnames=list(rname,cname))
for (i in 1:nrow(data)){
   x_mean <- sample(1:10000,1)
   x_sd <- sample(1:1000,1)
   y_mean <- sample(1:10000,1)
   y_sd <- sample(1:1000,1)
   x <- rnorm(20,mean=x_mean,sd=x_sd)
   y <- rnorm(20,mean=y_mean,sd=y_sd)
   for (j in 1:20){
      data[i,j] <- x[j]
   for (k in 1:20){
      k_1 <- 20 + k
      data[i,k_1] <- y[k]
data.pca <- prcomp(data)
x <- data.pca$rotation[,"PC1"]
y <- data.pca$rotation[,"PC2"]
xy <- data.frame(x,y)
g <- ggobi(xy)

I labelled only two samples since the text would overlap. Samples 1 to 20 are located near the 16 and samples 21 to 40 near the 37, as expected.