Passing arguments from the command line in Perl

I used to do this for specifying the usage:

#!/usr/bin/perl

use strict;
use warnings;

my $usage = "$0: <infile.fa> <blah> <blah>\n";
my $a = shift or die $usage;
my $b = shift or die $usage;

#etc.

However this became a problem when I needed to pass the number “0″ as an argument. So I thought I’ll improve the code via the Perl module Getopt::Std.


#!/usr/bin/perl

use strict;
use warnings;

use Getopt::Std;

my %opts = ();
getopts('f:b:g:e:h', \%opts);

if ($opts{'h'} || !keys %opts){
   usage();
}

print "Your options are:\n";
foreach my $opts (keys %opts){
   print "$opts\t$opts{$opts}\n";
}

exit(0);

sub usage {
print STDERR <<EOF;
Usage: $0 -f file -b 10 -g temp.file -e 20

Where:   -f test.txt            input file
         -b kdjaksd             b for bananas
         -g kdjf                more description
         -e 3                   blah blah blah
         -h                     this helpful usage message

EOF
exit;
}

__END__

Depending on how your script works, you can set up conditional checks (e.g. unless exists $opt{‘f’}) to see if essential arguments have been set or not.

Posted in programming | Tagged | Leave a comment

The Poisson distribution

The Poisson distribution


op<-par(mfrow=c(1,2))
hist(rpois(10000,10))
hist(rnorm(10000,mean=10))


p <- rpois(10000,10)
n <- rnorm(10000,mean=10)
#note that the variance and mean are close to lambda
var(p)
#[1] 9.785996
mean(p)
#[1] 9.9233
#normal distribution
var(n)
#[1] 0.9777767
mean(n)
#[1] 10.00674

#the standard deviation of a Poisson distribution should be sqrt(lambda)
sd(p)
#[1] 3.128258
sqrt(10)
#[1] 3.162278

#how many data points are one, two and three sd away
p_sd <- sd(p)
table(p < mean(p) - p_sd | p > mean(p) + p_sd)

#FALSE  TRUE
# 7355  2645
table(p < mean(p) - 2*p_sd | p > mean(p) + 2*p_sd)

#FALSE  TRUE
# 9622   378
table(p < mean(p) - 3*p_sd | p > mean(p) + 3*p_sd)

#FALSE  TRUE
# 9962    38

#compare this with the Normal distribution, which follows the The 68-95-99.7% Rule
> n_sd <- sd(n)
> table(n <mean(n) - n_sd | n > mean(n) +n_sd)

#FALSE  TRUE
# 6849  3151
> table(n <mean(n) - 2*n_sd | n > mean(n) +2*n_sd)

#FALSE  TRUE
# 9543   457
> table(n <mean(n) - 3*n_sd | n > mean(n) +3*n_sd)

#FALSE  TRUE
# 9976    24

#kurtosis
install.packages("moments")
library(moments)
#Intuitively, the kurtosis is a measure of the peakedness of the data distribution.
#Negative kurtosis would indicates a flat distribution
#Positive kurtosis would indicates a peaked distribution
#should be near the inverse of lambda
kurtosis(p) - 3
#[1] 0.1208631
kurtosis(n) - 3
#[1] -0.06176341

#to illustrate kurtosis
flat <- runif(n=100,min=80,max=100)
flat
#  [1] 83.37960 94.57424 96.51311 92.08249 94.61121 92.76638 94.39542 88.27215
#  [9] 83.45642 85.86747 83.47850 83.64775 93.25635 89.62777 83.81349 98.97906
# [17] 87.77092 82.99616 86.15581 90.00653 98.94037 80.90453 91.16177 87.32763
# [25] 94.39605 86.94838 85.33424 83.13871 90.90948 84.66674 96.88543 85.93574
# [33] 96.30937 93.95475 91.75024 85.80579 92.71691 93.81835 91.85252 98.39650
# [41] 80.83516 96.42987 91.40050 87.62511 89.01398 93.64511 93.04411 97.10447
# [49] 95.36469 84.68530 96.63656 95.54381 90.22114 96.57856 98.74975 87.92109
# [57] 95.69652 91.92067 82.19441 89.12797 95.79111 86.24502 93.31333 81.85892
# [65] 91.05611 84.59561 83.72790 91.37922 91.72667 89.93868 84.91768 94.51121
# [73] 97.35298 93.62817 82.55570 98.28182 95.29541 83.29980 80.77201 89.98626
# [81] 89.20221 94.86378 96.45546 86.71370 91.02340 82.61356 90.93799 98.27153
# [89] 80.44516 92.74910 89.62782 91.43687 91.65159 97.68580 95.21252 80.46632
# [97] 84.03176 96.59315 95.60750 87.62108

kurtosis(flat) - 3
#[1] -1.125288
Posted in R, Statistics | Tagged , | Leave a comment

Manual linear regression analysis using R

I read a nice example in the “Statistics For Dummies” book on linear regression. The example data was the number of cricket (the insect) chirps vs. temperature. Here I redo the analysis using R. Firstly the five summaries you require for calculating the best fitting line are:

1. The mean of x
2. The mean of y
3. The sd of x
4. The sd of y
5. The r between x and y


x <- c(18,20,21,23,27,30,34,39)
y <- c(57,60,64,65,68,71,74,77)

> mean(x)
#[1] 26.5
> mean(y)
#[1] 67
> sd(x)
#[1] 7.387248
> sd(y)
#[1] 6.845228
> cor(x,y)
#[1] 0.9774791

#formula for the slope m
#m = r(sd(y) / sd(x))
> cor(x,y)*(sd(y)/sd(x))
#[1] 0.9057592

#formula for intercept
#b = mean(y) - (m * mean(x))
m <- cor(x,y)*(sd(y)/sd(x))
mean(y) - (m * mean(x))
#[1] 42.99738

#now the using the lm() function in R
#the response variable is listed first
#y is the response variable

lm(y~x)

#Call:
#lm(formula = y ~ x)
#
#Coefficients:
#(Intercept)            x
#    42.9974       0.9058

#the above results are the same
#in the form y = mx + b
#y = 0.9058x + 42.9974

#and finally the plot
plot(x,y)
res <- lm(y~x)
abline(res)

By performing the step by step calculations, I found I have a better interpretation of the results from a linear regression analysis, which was the point.

Posted in R, Statistics | Tagged , | Leave a comment

GC and AT content of 5′ UTRs, 3′ UTRs and coding exons of RefSeq gene models

Firstly download bed tracks of the 5′ UTR, 3′ UTR and coding exons from the UCSC Table Browser. The RefSeq gene models are in the table called RefGene.

After you’ve saved the 3 bed files (e.g. mm9_refgene_090212_5_utr.bed, mm9_refgene_090212_3_utr.bed and mm9_refgene_090212_coding_exon.bed) use the fastaFromBed program from the BEDTools suite and convert the bed file into a fasta file:

fastaFromBed -fi mm9.fa -bed mm9_refgene_090212_5_utr.bed -fo mm9_refgene_090212_5_utr.fa
fastaFromBed -fi mm9.fa -bed mm9_refgene_090212_3_utr.bed -fo mm9_refgene_090212_3_utr.fa
fastaFromBed -fi mm9.fa -bed mm9_refgene_090212_coding_exon.bed -fo mm9_refgene_090212_coding_exon.fa

Then use this Perl script to calculate the nucleotide frequencies:


#!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage: $0 <fasta.fa>\n";
my $infile = shift or die $usage;

my $seq = '';

open (IN, $infile) or die "Can't open $infile\n";
while (<IN>){
   chomp;
   next if (/^$/);
   if (/^>(.*)$/){
      next;
   }
   else {
      $seq .= $_;
   }
}
close(IN);

my ($a,$c,$g,$t,$other) = parseSeq($seq);

my $total = $a + $c + $g + $t;
my $gc = sprintf("%.2f",($c + $g) * 100 / $total);
my $at = sprintf("%.2f",($a + $t) * 100 / $total);

print "Length = ", length($seq), "\n";
print "A: $a\n";
print "C: $c\n";
print "G: $g\n";
print "T: $t\n";
print "Other: $other\n";

print "GC: $gc\n";
print "AT: $at\n";

sub parseSeq {
   my ($seq) = @_;
   my $a = '0';
   my $c = '0';
   my $g = '0';
   my $t = '0';
   my $other = '0';
   for (my $i = 0; $i < length($seq); ++$i){
      my $base = substr($seq,$i,1);
      if ($base =~ /a/i){
         ++$a;
      } elsif ($base =~ /c/i){
         ++$c;
      } elsif ($base =~ /g/i){
         ++$g;
      } elsif ($base =~ /t/i){
         ++$t;
      } else {
         ++$other;
      }
   }
   return ($a,$c,$g,$t,$other);
}

Running the Perl script on all the fasta files:

for file in `ls *.fa`; do echo $file; tally_nuc.pl $file; done

mm9_refgene_090212_3_utr.fa
Length = 32985433
A: 9238398
C: 7238453
G: 7238331
T: 9270249
Other: 2
GC: 43.89
AT: 56.11
mm9_refgene_090212_5_utr.fa
Length = 7543736
A: 1654667
C: 2115587
G: 2124322
T: 1649160
Other: 0
GC: 56.20
AT: 43.80
mm9_refgene_090212_coding_exon.fa
Length = 44137007
A: 10628969
C: 11450211
G: 11459637
T: 10598190
Other: 0
GC: 51.91
AT: 48.09

In summary, the 5′ UTR has the highest GC content (GC: 56.20 and AT: 43.80), the coding exons have no particular bias (GC: 51.91 and AT: 48.09) and the 3′ UTR has the lowest GC content (GC: 43.89 and AT: 56.11) of the three categories.

Posted in biology | Tagged , | Leave a comment

DropBox offering up to 5GB worth of space

From the official DropBox forum:

During this beta period, we are also offering additional free space to test automatic uploading of photos and videos. For every 500MB of photos and videos automatically uploaded, you’ll receive another 500MB space bonus, up to 5GB total. More information here.

Here’s some R code I used to generate some random plots that I wanted to automatically upload to my DropBox folder as a test:

for (i in 1:10000){
   filename <- paste(i,'.jpg',sep="")
   #roughly 149 kb per file
   jpeg(file=filename,width=2000,height=2000,type="cairo")
   plot(rnorm(1000))
   dev.off()
}

To use the automatic upload function save the images onto a pen/thumb drive and the DropBox automatic upload function should be available when you plug your pen/thumb drive into your computer or on Windows right click on your drive and click autoplay.

Posted in /etc, fun | Tagged | Leave a comment

H3K27Ac

Mainly sourced from Wikipedia but arranged as per my train of thought.

Histones are highly alkaline proteins found in eukaryotic cell nuclei that package and order the DNA into structural units called nucleosomes. They are the chief protein components of chromatin, acting as spools around which DNA winds, and play a role in gene regulation. Histone H3 is one of the core histone proteins (the others are H2A, H2B and H4) involved in the structure of chromatin in eukaryotic cells. H3 is involved with the structure of the nucleosomes of the ‘beads on a string’ structure and H3 is the most extensively modified of the five histones.

Nucleosomes are the basic unit of DNA packaging in eukaryotes, consisting of a segment of DNA wound around a histone protein core. This structure is often compared to thread wrapped around a spool.

Chromatin is the combination of DNA and proteins that make up the contents of the nucleus of a cell. The primary functions of chromatin are; to package DNA into a smaller volume to fit in the cell, to strengthen the DNA to allow mitosis and meiosis and prevent DNA damage, and to control gene expression and DNA replication.

The common nomenclature of histone modifications is:

The name of the histone (e.g. H3)
The single-letter amino acid abbreviation (e.g., K for Lysine) and the amino acid position in the protein
The type of modification (Me: methyl, P: phosphate, Ac: acetyl, Ub: ubiquitin)

So H3K27Ac denotes the acetylation of the 27th residue (a lysine) from the start (i.e. the N-terminal) of the H3 protein.

Histone acetyltransferases (HAT) are enzymes that acetylate conserved lysine amino acids on histone proteins by transferring an acetyl group from acetyl CoA. In general, histone acetylation is linked to transcriptional activation and associated with euchromatin. Histone modification levels and gene expression are well correlated; the levels of a single modification (H3K27ac) can be used to faithfully model gene expression (Karlic et al., 2010 PNAS).

Chemical modifications (e.g. methylation and acylation) to the histone proteins present in chromatin influence gene expression by changing how accessible the chromatin is to transcription. A specific modification of a specific histone protein is called a histone mark. This track shows the levels of enrichment of the H3K27Ac histone mark across the genome as determined by a ChIP-seq assay. The H3K27Ac histone mark is the acetylation of lysine 27 of the H3 histone protein, and it is thought to enhance transcription possibly by blocking the spread of the repressive histone mark H3K27Me3 (from the ENCODE track on the UCSC Genome Browser).

ChIP-sequencing, also known as ChIP-seq, is used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to precisely map global binding sites for any protein of interest. Previously, ChIP-on-chip was the most common technique utilized to study these protein–DNA relations.

The ChIP process enriches specific crosslinked DNA-protein complexes using an antibody against a protein of interest. It can be used to precisely map global binding sites for any protein of interest.

See also:

ChIP-seq: welcome to the new frontier

Posted in biology | Tagged | Leave a comment

PCA and rggobi

#generate row names
rname <- c(1:10000)
#generate column names
cname <- c(rep("control",20),rep("test",20))
#initialise array with 10,000 rows and 40 columns
data <- array(0,dim=c(10000,40),dimnames=list(rname,cname))
for (i in 1:nrow(data)){
   x_mean <- sample(1:10000,1)
   x_sd <- sample(1:1000,1)
   y_mean <- sample(1:10000,1)
   y_sd <- sample(1:1000,1)
   x <- rnorm(20,mean=x_mean,sd=x_sd)
   y <- rnorm(20,mean=y_mean,sd=y_sd)
   for (j in 1:20){
      data[i,j] <- x[j]
   }
   for (k in 1:20){
      k_1 <- 20 + k
      data[i,k_1] <- y[k]
   }
}
data.pca <- prcomp(data)
x <- data.pca$rotation[,"PC1"]
y <- data.pca$rotation[,"PC2"]
xy <- data.frame(x,y)
library(rggobi)
g <- ggobi(xy)


I labelled only two samples since the text would overlap. Samples 1 to 20 are located near the 16 and samples 21 to 40 near the 37, as expected.

Posted in programming, R | Tagged , , | Leave a comment

Step by step principal components analysis using R

I’ve always wondered about the calculations behind a principal components analysis (PCA). An extremely useful tutorial explains the key concepts and runs through the analysis. Here I use R to calculate each step of a PCA in hopes of better understanding the analysis.

x <- c(2.5, 0.5, 2.2, 1.9, 3.1, 2.3, 2, 1, 1.5, 1.1)
y <- c(2.4, 0.7, 2.9, 2.2, 3.0, 2.7, 1.6, 1.1, 1.6, 0.9)
mean(x)
#[1] 1.81
mean(y)
#[1] 1.91
x_less_mean <- x - mean(x)
x_less_mean
# [1]  0.69 -1.31  0.39  0.09  1.29  0.49  0.19 -0.81 -0.31 -0.71
y_less_mean <- y - mean(y)
y_less_mean
# [1]  0.49 -1.21  0.99  0.29  1.09  0.79 -0.31 -0.81 -0.31 -1.01
cov(x,y)
#[1] 0.6154444
cov(x,x)
#[1] 0.6165556
cov(y,y)
#[1] 0.7165556
covar_matrix <- matrix(c(cov(x,x),cov(x,y),cov(y,x),cov(y,y)),nrow=2,ncol=2,byrow=TRUE,dimnames=list(c("x","y"),c("x","y")))
covar_matrix
#          x         y
#x 0.6165556 0.6154444
#y 0.6154444 0.7165556
eigen(covar_matrix)
#$values
#[1] 1.2840277 0.0490834
#
#$vectors
#          [,1]       [,2]
#[1,] 0.6778734 -0.7351787
#[2,] 0.7351787  0.6778734
e <- eigen(covar_matrix)
#to get the first principal components
#multiple the first eigenvector to mean transformed x, y values
for (i in 1:length(x_less_mean)){
   pc1 <- (x_less_mean[i] * e$vectors[1,1]) + (y_less_mean[i] * e$vectors[2,1])
   print(pc1)
}
#[1] 0.8279702
#[1] -1.77758
#[1] 0.9921975
#[1] 0.2742104
#[1] 1.675801
#[1] 0.9129491
#[1] -0.09910944
#[1] -1.144572
#[1] -0.4380461
#[1] -1.223821
for (i in 1:length(x_less_mean)){
   pc2 <- (x_less_mean[i] * e$vectors[1,2]) + (y_less_mean[i] * e$vectors[2,2])
   print(pc2)
}
#[1] -0.1751153
#[1] 0.1428572
#[1] 0.384375
#[1] 0.1304172
#[1] -0.2094985
#[1] 0.1752824
#[1] -0.3498247
#[1] 0.04641726
#[1] 0.01776463
#[1] -0.1626753
z <- array(0, dim=c(10,2))
for (i in 1:length(x_less_mean)){
   pc1 <- (x_less_mean[i] * e$vectors[1,1]) + (y_less_mean[i] * e$vectors[2,1])
   pc2 <- (x_less_mean[i] * e$vectors[1,2]) + (y_less_mean[i] * e$vectors[2,2])
   z[i,1] <- pc1
   z[i,2] <- pc2
}
z
#             [,1]        [,2]
# [1,]  0.82797019 -0.17511531
# [2,] -1.77758033  0.14285723
# [3,]  0.99219749  0.38437499
# [4,]  0.27421042  0.13041721
# [5,]  1.67580142 -0.20949846
# [6,]  0.91294910  0.17528244
# [7,] -0.09910944 -0.34982470
# [8,] -1.14457216  0.04641726
# [9,] -0.43804614  0.01776463
#[10,] -1.22382056 -0.16267529
#now do this using the inbuilt prcomp() function
data <- data.frame(x,y)
data.pca <- prcomp(data)
data.pca
#Standard deviations:
#[1] 1.1331495 0.2215477
#
#Rotation:
#         PC1        PC2
#x -0.6778734  0.7351787
#y -0.7351787 -0.6778734
data.pca$x
#              PC1         PC2
# [1,] -0.82797019  0.17511531
# [2,]  1.77758033 -0.14285723
# [3,] -0.99219749 -0.38437499
# [4,] -0.27421042 -0.13041721
# [5,] -1.67580142  0.20949846
# [6,] -0.91294910 -0.17528244
# [7,]  0.09910944  0.34982470
# [8,]  1.14457216 -0.04641726
# [9,]  0.43804614 -0.01776463
#[10,]  1.22382056  0.16267529
op<-par(mfrow=c(1,2))
plot(z)
plot(data.pca$x)

I’m still trying to figure out why the signs are inverted between calculating the principal components manually vs. using prcomp(). I guess the relationships don’t change but was curious as to why this was the case. However in the tutorial listed at the start of this post, first principal components are identical however the second principal components are inverted.

Posted in R, Statistics | Tagged , , | Leave a comment

Creating a correlation matrix with R

I calculate the correlations (Spearman’s rank) row by row using R, creating a correlation matrix.

#create random matrix with numbers ranging from 1 to 100
data <- matrix(runif(100,1,100),nrow=10,ncol=10,byrow=T)
#loop through all possible pairwise rows
for (i in 1:(nrow(data)-1)){
   for (j in (i+1):nrow(data)){
      print(c(i, j))
   }
}
#calculating the Spearman correlation between row 1 and row 2
cor(as.vector(as.matrix(data[1,])),as.vector(as.matrix(data[2,])),method="spearman")
#initialise a matrix for storing the correlations
final_matrix <- matrix(rep("0",nrow(data)*nrow(data)),ncol=nrow(data),nrow=nrow(data))
#calculate all correlations and store in final_matrix
for (i in 1:(nrow(data)-1)){
   final_matrix[i,i] <- 1
   for (j in (i+1):nrow(data)){
      final_matrix[i,j] <- cor(as.vector(as.matrix(data[i,])),as.vector(as.matrix(data[j,])),method="spearman")
      final_matrix[j,i] <- final_matrix[i,j]
   }
}
#save the matrix
library(MASS)
write.matrix(final_matrix,file="some.file")
quit()

I have a comma delimited file with 10006 lines containing expression values for 40 samples. The first column is the identifier and the last two columns are the pre-calculated means and variances. For this many rows, there would be (10005 -1)(10005 – 1) + (10005 – 1) / 2, which is 50,045,010 calculations. Let’s see how long this takes on one core of a Intel(R) Xeon(R) CPU X7560 @ 2.27GHz.

table <- read.table("level2_rename_mean_5.csv",sep=",",header=T,row.names=1)
dim(table)
[1] 10005    42
data <- as.matrix(table[,1:40])
final_matrix <- matrix(rep("0",nrow(data)*nrow(data)),ncol=nrow(data),nrow=nrow(data))
dim(final_matrix)
#[1] 10005 10005
system.time(
   for (i in 1:(nrow(data)-1)){
      final_matrix[i,i] <- 1
      for (j in (i+1):nrow(data)){
         final_matrix[i,j] <- cor(as.vector(as.matrix(data[i,])),as.vector(as.matrix(data[j,])),method="spearman")
         final_matrix[j,i] <- final_matrix[i,j]
      }
   }
)
#      user     system    elapsed
#100781.438     24.017 120825.286
#~33 hours
library(MASS)
write.matrix(final_matrix,file="level2_rename_mean_5.matrix")
#cat level2_rename_mean_5.matrix | wc
#  10005 100100025 2202200550

To calculate how many comparisons with respect to the number of rows

#write a function to calculate the number of comparisons
number_of_comparison <- function(n) {
   (((n-1)^2) + (n-1)) / 2
}
for (i in 1:10){ print(c(i,number_of_comparison(i))) }
#[1] 1 0
#[1] 2 1
#[1] 3 3
#[1] 4 6
#[1]  5 10
#[1]  6 15
#[1]  7 21
#[1]  8 28
#[1]  9 36
#[1] 10 45
x <- array(0, dim=c(10000,1))
for (i in 1:10000){ x[i,1]<- number_of_comparison(i) }
plot(x)
tail(x)
#             [,1]
#[9995,]  49945015
#[9996,]  49955010
#[9997,]  49965006
#[9998,]  49975003
#[9999,]  49985001
#[10000,] 49995000

So now what can you do with the level2_rename_mean_5.matrix file? My original intention was to create a network based on the correlations. Here’s one way of achieving this:

#!/usr/bin/perl

use strict;
use warnings;

my $threshold = '0.9';

my $infile = 'level2_rename_mean_5.matrix';
my $csv = 'level2_rename_mean_5.csv';

my @name = ();
open(IN,'<',$csv) || die "Could not open $csv: $!\n";
while(<IN>){
   chomp;
   next if (/raw/);
   my ($id,@rest) = split(/,/);
   $id =~ s/"//g;
   push(@name,$id);
}
close(IN);

my %sif = ();
my $counter = '0';
open(IN,'<',$infile) || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
   my $current = $. - 1;
   my $counter += $.;
   my @cor = split();
   for (my $i=$counter; $i<scalar(@cor); ++$i){
      if ($cor[$i] >= $threshold){
         #print join("\t",$name[$current],$name[$i],$current,$i,$cor[$i]),"\n";
         print join("\t",$name[$current],'xx',$name[$i]),"\n";
      }
   }
}
close(IN);

exit(0);

The Perl script outputs a sif file which can then be loaded into Cytoscape.

And just for reference’s sake, here’s how to calculate the correlations of each column:

final_matrix <- matrix(rep("1",ncol(data)*ncol(data)),ncol=ncol(data),nrow=ncol(data))
#calculate all correlations and store in final_matrix
for (i in 1:(ncol(data)-1)){
   for (j in (i+1):ncol(data)){
      final_matrix[i,j] <- cor(as.vector(as.matrix(data[,i])),as.vector(as.matrix(data[,j])),method="spearman")
      final_matrix[j,i] <- final_matrix[i,j]
   }
}
#save the matrix
library(MASS)
write.matrix(final_matrix,file="some_file.tsv")
quit()
Posted in programming, R | Tagged , , | Leave a comment

Using R to obtain basic statistics on your dataset

Most of the data I work with are represented as tables i.e. in rows and columns. R makes it easy to store and process such data. Here are just some R functions that calculate some basic, but nevertheless useful, statistics.

#load your data
data <- read.table("level2.tsv",header=TRUE,row.names=1)
#dimensions of your dataset
dim(data)
#[1] 2265057      40
dimensions <- dim(data)
#calculate the mean of each row and add an extra column called mean
data$mean <- apply(data,1,function(row) mean(as.vector(row[1:dimensions[2]])))
#calculate the variance of each row and add an extra column called variance
data$variance <- apply(data,1,function(row) var(as.vector(row[1:dimensions[2]])))
#check to see if there are two extra columns
dim(data)
#[1] 2265057      42
#check the quantiles for the mean and variance
quantile(data$mean)
#         0%         25%         50%         75%        100%
#      0.025       0.025       0.025       0.050 2535068.625
quantile(data$variance)
#          0%          25%          50%          75%         100%
#2.500000e-02 2.500000e-02 2.500000e-02 4.871795e-02 9.692558e+11
#check to see how many rows have a mean greater than 10
table(data$mean > 10)
#
#  FALSE    TRUE
#2258419    6638
#calculate correlation between variance and mean
> cor(data$variance,data$mean,method="pearson")
#[1] 0.99374
> cor(data$variance,data$mean,method="spearman")
#[1] 0.9918228
#store all the rows that have a mean > 5
use <- (data$mean > 5)
data_subset <- data[use,]
dim(data_subset)
#[1] 10005    42
write.csv(data_subset,"level2_rename_mean_5.csv")
#specific some conditions on each row
#I want row names where the minimum value is less than 10 AND the variance is greater than 2000
#initialise an array to store results
result <- array(0,dim=c(1,1))
#start array index
counter <- 1
#loop through the data
for (i in 1:nrow(data_subset)){
   #specify the two conditions
   if(min(data_subset[i,]) < 10 & data_subset[i,]$variance > 2000){
      #store result in the result array
      result[[counter]] <- row.names(data_subset[i,]);
      #increase index by 1
      counter <- counter + 1
   }
}
write.csv(data_subset[result,],"condition.txt")
Posted in R, Statistics | Tagged , | Leave a comment