# Creating a correlation matrix with R

Updated 2014 January 6th

This post on creating a correlation matrix with R was published in 2012 on January the 31st and has become one of the most viewed posts. I've learned a bit more since then, so I have updated and improved this post.

### Incentive

Let $A$ be a $m \times n$ matrix, where $a_{ij}$ are elements of $A$, where $i$ is the $i_{th}$ row and $j$ is the $j_{th}$ column.

If the matrix $A$ contained transcriptomic data, $a_{ij}$ is the expression level of the $i^{th}$ transcript in the $j^{th}$ assay. The elements of the $i^{th}$ row of $A$ form the transcriptional response of the $i^{th}$ transcript. The elements of the $j^{th}$ column of $A$ form the expression profile of the $j^{th}$ assay.

Transcripts that have a similar transcriptional response may indicate that they are co-expressed together, and may belong to the same biological pathway. A simple way of looking at co-expression is through correlation, i.e., correlating all pairs of transcriptional responses, resulting in a correlation matrix.

### Getting started

Let's get started with a simple example, with a $10 \times 10$ matrix.

#create random matrix with numbers ranging from 1 to 100
set.seed(12345)
A <- matrix(runif(100,1,100),nrow=10,ncol=10,byrow=T)
image(A)

Heatmap of $A$

Now to calculate the correlations, using Spearman's rank correlation coefficient, of each row by each row using R and storing the results in a correlation matrix.

#our "expression" matrix
A
[,1]      [,2]      [,3]      [,4]      [,5]     [,6]      [,7]      [,8]      [,9]     [,10]
[1,] 72.369486 87.701546 76.337251 88.726332 46.191615 17.47081 33.184443 51.413209 73.042820 98.983957
[2,]  4.419008 16.084976 73.832810  1.112522 39.729130 46.78697 39.426254 40.846029 18.717395 95.214217
[3,] 45.919079 33.348488 96.576117 71.040706 64.809721 39.59302 70.155820 54.861729 23.420251 48.971218
[4,] 79.507710  1.592775 19.583532 68.501529 37.640308 36.80093 87.010695 90.511312 62.125032 14.269132
[5,] 78.437135 43.490683 92.800124 77.551079 26.708443 32.80124  6.959321  5.302189  6.450328 62.928737
[6,] 96.482559 82.902984 32.187795 22.089520 73.517116 50.42486 73.247425  8.953268 44.117518 24.421465
[7,] 79.365212 26.609747 98.612399 75.930501 97.998046 22.67584 94.922011 15.796337 60.435340 94.696644
[8,] 69.146982 51.047839 37.920688 34.244699  5.776884 62.27581 96.183282 65.841090 51.518907 15.859723
[9,] 87.174340 51.929726  1.856143  2.900282 15.306678 31.19814 82.740029 50.732120 80.553690  7.003358
[10,] 92.867559 81.009677  8.802521 60.491900 71.763208 51.87072 72.291534 75.244681 10.468412 40.384761

#row 1
A[1,]
[1] 72.36949 87.70155 76.33725 88.72633 46.19162 17.47081 33.18444 51.41321 73.04282 98.98396
#row 2
A[2,]
[1]  4.419008 16.084976 73.832810  1.112522 39.729130 46.786971 39.426254 40.846029 18.717395 95.214217

#calculating the Spearman's correlation between row 1 and row 2
cor(A[1,], A[2,], method="spearman")
[1] -0.1030303

#calculating all row by row correlations, remember the transpose t()
correlation_matrix <- cor(t(A), method="spearman")
correlation_matrix
[,1]        [,2]        [,3]       [,4]        [,5]       [,6]       [,7]       [,8]
[1,]  1.00000000 -0.10303030  0.05454545 -0.4545455  0.50303030 -0.3212121  0.1151515 -0.5393939
[2,] -0.10303030  1.00000000  0.15151515 -0.3454545 -0.05454545 -0.3575758  0.2000000 -0.2242424
[3,]  0.05454545  0.15151515  1.00000000  0.2363636  0.33333333 -0.3696970  0.6121212 -0.2121212
[4,] -0.45454545 -0.34545455  0.23636364  1.0000000 -0.43030303 -0.1757576 -0.1636364  0.6121212
[5,]  0.50303030 -0.05454545  0.33333333 -0.4303030  1.00000000  0.1515152  0.4545455 -0.2969697
[6,] -0.32121212 -0.35757576 -0.36969697 -0.1757576  0.15151515  1.0000000  0.2000000  0.2242424
[7,]  0.11515152  0.20000000  0.61212121 -0.1636364  0.45454545  0.2000000  1.0000000 -0.3575758
[8,] -0.53939394 -0.22424242 -0.21212121  0.6121212 -0.29696970  0.2242424 -0.3575758  1.0000000
[9,] -0.41818182 -0.45454545 -0.56363636  0.4424242 -0.40606061  0.5878788 -0.2969697  0.7696970
[10,] -0.22424242 -0.53939394 -0.17575758  0.3818182 -0.12727273  0.4909091 -0.3090909  0.4545455
[,9]      [,10]
[1,] -0.4181818 -0.2242424
[2,] -0.4545455 -0.5393939
[3,] -0.5636364 -0.1757576
[4,]  0.4424242  0.3818182
[5,] -0.4060606 -0.1272727
[6,]  0.5878788  0.4909091
[7,] -0.2969697 -0.3090909
[8,]  0.7696970  0.4545455
[9,]  1.0000000  0.6242424
[10,]  0.6242424  1.0000000

#output the matrix in a file
#using rounded values
library(MASS)
write.matrix(round(correlation_matrix, digits=3), file="cor_matrix.txt")

If we look at the file cor_matrix.txt:

cat cor_matrix.txt
1.000 -0.103  0.055 -0.455  0.503 -0.321  0.115 -0.539 -0.418 -0.224
-0.103  1.000  0.152 -0.345 -0.055 -0.358  0.200 -0.224 -0.455 -0.539
0.055  0.152  1.000  0.236  0.333 -0.370  0.612 -0.212 -0.564 -0.176
-0.455 -0.345  0.236  1.000 -0.430 -0.176 -0.164  0.612  0.442  0.382
0.503 -0.055  0.333 -0.430  1.000  0.152  0.455 -0.297 -0.406 -0.127
-0.321 -0.358 -0.370 -0.176  0.152  1.000  0.200  0.224  0.588  0.491
0.115  0.200  0.612 -0.164  0.455  0.200  1.000 -0.358 -0.297 -0.309
-0.539 -0.224 -0.212  0.612 -0.297  0.224 -0.358  1.000  0.770  0.455
-0.418 -0.455 -0.564  0.442 -0.406  0.588 -0.297  0.770  1.000  0.624
-0.224 -0.539 -0.176  0.382 -0.127  0.491 -0.309  0.455  0.624  1.000

### What about bigger matrices?

The computational time increases exponentially as we have more rows, since we are calculating the correlations between every row pair. For a matrix with 10005 rows and 40 columns, there would be:

choose(10005,2)
[1] 50045010

50,045,010 comparisons. It took ~33 hours to do all the calculations on one core of a Intel(R) Xeon(R) CPU X7560 @ 2.27GHz.

Let's create a plot of the number of rows versus the number of pairwise comparisons:

#for a dataset with 10 rows
choose(10,2)
[1] 45

#for 195,433
choose(195433,2)
[1] 19096931028

#store comparisons in an array
row_number <- 1000000
x <- array(0, dim=c(row_number,1))
system.time(
for (i in 1:row_number){
x[i,1]<- choose(i,2)
}
)
#   user  system elapsed
#   3.39    0.00    3.39
tail(x)
[,1]
[999995,] 499994500015
[999996,] 499995500010
[999997,] 499996500006
[999998,] 499997500003
[999999,] 499998500001
[1e+06,]  499999500000

plot(x,type="l")

As I mentioned above, it took roughly 33 hours to compute the correlations of a 10,005 by 40 matrix (50,045,010 comparisons) on one core.

We can speed this up by using more cores.

### Parallel calculation of correlations

This fantastic guide on Parallel Computing in R shows how we can parallelise the calculations and I will follow their example. They used the test dataset included in the package Biobase:

#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("Biobase")

data(geneData, package = "Biobase")
dim(geneData)
#[1] 500  26

#how many comparisons?
choose(500,2)
[1] 124750

The comparisons go like this:

row1 by row2
row1 by row3
...

We can use the combn() function to generate the comparisons as a list:

#for 3 rows
combn(3, 2, simplify=F)
#[[1]]
#[1] 1 2

#[[2]]
#[1] 1 3

#[[3]]
#[1] 2 3

#for all rows of geneData
pair <- combn(1:nrow(geneData), 2, simplify = F)

#now we need to create a function to calculate the correlations
#taking the input of the object pair
geneCor <- function(x, gene = geneData) {
cor(gene[x[1], ], gene[x[2], ])
}

#to find the correlation of row 1 and 5
geneCor(c(1,5))
#[1] 0.6572192

#all correlations, stored in out
system.time(out <- lapply(pair, geneCor))
#   user  system elapsed
#  9.153   0.012   9.181

#results
[[1]]
[1] 0.1536158

[[2]]
[1] 0.7066034

[[3]]
[1] -0.2125437

#we can parallelise this by using
#the package multicore
#install if necessary
install.packages("multicore")

library(multicore)

#use the mclapply() function
system.time(out <- mclapply(pair, geneCor))
#   user  system elapsed
# 11.083   0.561   1.206

corm <- cbind(do.call(rbind, pair), unlist(out))
write.table(corm, file = "correlation.tsv", quote = F, sep = "\t")

Using mclapply took ~1 sec compared to ~9.

### Creating a co-expression network

In the R code above, I saved the correlation matrix using the function write.matrix(). Below is a simple Perl script that parses the output from write.matrix() and creates a sif file, which can then be loaded into Cytoscape. Please note that this script does not capture negative correlations, only positive correlations that are equal to or higher than the $threshold value. #!/usr/bin/perl use strict; use warnings; my$usage = "Usage: $0 <infile.txt> <correlation>\n"; my$infile = shift or die $usage; my$cor = shift or die $usage; my @name = (); my %sif = (); my$counter = 0;
open(IN,'<',$infile) || die "Could not open$infile: $!\n"; while(<IN>){ chomp; if ($. == 1){
@name = split(/\s+/);
next;
}
my $current =$. - 2;
my $counter =$counter + $. - 1; my @cor = split(); print "$counter $current\n"; for (my$i=$counter;$i<scalar(@cor); ++$i){ if ($cor[$i] >=$cor){
#print join("\t",$name[$current],$name[$i],$current,$i,$cor[$i]),"\n";
print join("\t",$name[$current],'xx',$name[$i]),"\n";
}
}
}
close(IN);

exit(0);

Basically the script examines the correlation matrix and prints out associations if the correlation between two rows is greater than $threshold. Here's how it looks when loaded into Cytoscape: While most transcripts have a very similar expression pattern, several transcripts exhibit unique transcriptional responses. ### Correlation network within R An example of creating a correlation network was also available from the guide on Parallel Computing in R. Following on from above: data(geneData, package = "Biobase") pair <- combn(1:nrow(geneData), 2, simplify = F) geneCor <- function(x, gene = geneData) { cor(gene[x[1], ], gene[x[2], ]) } out <- lapply(pair, geneCor) head(out, n=3) #[[1]] #[1] 0.1536158 # #[[2]] #[1] 0.7066034 # #[[3]] #[1] -0.2125437 #reshape the result corm <- cbind(do.call(rbind, pair), unlist(out)) head(corm,n=3) [,1] [,2] [,3] [1,] 1 2 0.1536158 [2,] 1 3 0.7066034 [3,] 1 4 -0.2125437 #remove low cor pairs corm <- corm[abs(corm[,3]) >= 0.86, ] dim(corm) #[1] 112 3 #install packages if necessary install.packages("network") install.packages("sna") library(network) library(sna) #the network net <- network(corm, directed = F) net # Network attributes: # vertices = 498 # directed = FALSE # hyper = FALSE # loops = FALSE # multiple = FALSE # bipartite = FALSE # total edges= 112 # missing edges= 0 # non-missing edges= 112 # # Vertex attribute names: # vertex.names # #No edge attributes #component analysis cd <- component.dist(net) #delete genes not connected with others delete.vertices(net, which(cd$csize[cd$membership] == 1)) plot(net) ### Using a real dataset I will use the pnas_expression.txt file, which is a processed dataset from this this study. data <-read.table("pnas_expression.txt",header=T,row.names=1) dim(data) [1] 37435 8 head(data) lane1 lane2 lane3 lane4 lane5 lane6 lane8 len ENSG00000215696 0 0 0 0 0 0 0 330 ENSG00000215700 0 0 0 0 0 0 0 2370 ENSG00000215699 0 0 0 0 0 0 0 1842 ENSG00000215784 0 0 0 0 0 0 0 2393 ENSG00000212914 0 0 0 0 0 0 0 384 ENSG00000212042 0 0 0 0 0 0 0 92 #get rid of the len column data <- data[,-data$len]
#check how many rows are all zeros
table(rowSums(data)==0)

FALSE  TRUE
21877 15558
#create a smaller subset as an example
data_subset <- data[rowSums(data)>500,]
dim(data_subset)
[1] 4010    7

#save row names
my_row <- rownames(data_subset)

#convert into matrix
data_subset_matrix <- as.matrix(data_subset)

#correlation of row 1 and row 2
cor(data_subset_matrix[1,],data_subset_matrix[2,],method="spearman")
[1] 0.5357143

#correlation matrix
correlation_matrix <- cor(t(data_subset_matrix), method="spearman")
correlation_matrix[1,2]
[1] 0.5357143
dim(correlation_matrix)
[1] 4010 4010

#output results
library(MASS)
write.matrix(correlation_matrix, file="pnas_expression_correlation.tsv")

The Perl script above can be used to convert the pnas_expression_correlation.tsv file into a sif file, which can then be loaded into Cytoscape.

### Conclusions

Creating a correlation matrix with R is quite easy and as I have shown, the results can be visualised using Cytoscape. When applied to transcriptomic datasets, this may be useful in identifying co-expressed transcripts. I've shown an example of this using a real dataset, however note that in the example there are relatively few assays or samples. This may limit the usefulness of this approach since the number of transcriptional responses are smaller. However, imagine a dataset with a large number of assays, such as transcriptional profiling of a large panel of tissues. Using this approach, we may be able to unveil tissue specific transcripts since they will have very unique transcriptional responses.

Finally one can use the multicore package to speed up a large number of calculations.

# Using R to obtain basic statistics on your dataset

Updated: 2014 June 20th

Most of the data I work with are represented as tables i.e. with rows and columns. R makes it easy to store (as data frames) and process such data to produce some basic statistics. Here are just some R functions that calculate some basic, but nevertheless useful, statistics. I will use the iris dataset that comes with R.

# edgeR vs. SAMSeq

A while ago I received a comment on comparing parametric methods against nonparametric for calling differential expression in count data. Here I compare SAMSeq (Jun Li and Robert Tibshirani (2011) Finding consistent patterns: a nonparametric approach for identifying differential expression in RNA-Seq data. Statistical Methods in Medical Research, in press.) with edgeR. For more information on SAMSeq refer to the paper. I intend to look at cuffdiff sometime later.

library(edgeR)
d <- raw.data[,1:7]
rownames(d) <- rownames(raw.data)
#                lane1 lane2 lane3 lane4 lane5 lane6 lane8
#ENSG00000124208   478   619   628   744   483   716   240
#ENSG00000182463    27    20    27    26    48    55    24
#ENSG00000124201   180   218   293   275   373   301    88
#ENSG00000124207    76    80    85    97    80    81    37
#ENSG00000125835   132   200   200   228   280   204    52
#ENSG00000125834    42    60    72    86   131    99    30
group <- c(rep("Control",4),rep("DHT",3))
d <- DGEList(counts = d, group=group)
#Calculating library sizes from column totals.
d <- calcNormFactors(d)
d <- estimateCommonDisp(d)
de.com <- exactTest(d)
#Comparison of groups:  DHT - Control
summary(decideTestsDGE(de.com,p.value=0.05))
#   [,1]
#-1  2368
#0  11715
#1   2411
d <- estimateTagwiseDisp(d, prop.used=0.5, grid.length=500)
de.tgw <- exactTest(d)
summary(decideTestsDGE(de.tgw, p.value=0.05))
#   [,1]
#-1  2119
#0  12048
#1   2327
#total 4446
de <- summary(decideTestsDGE(de.tgw,p.value=0.05))
total <- de[1] + de[3]
sink("edger.output")
topTags(de.tgw,n=total)
sink()
sessionInfo()
#R version 2.14.1 (2011-12-22)
#Platform: x86_64-unknown-linux-gnu (64-bit)
#
#locale:
#[1] C
#
#attached base packages:
#[1] stats     graphics  grDevices utils     datasets  methods   base
#
#other attached packages:
#[1] edgeR_2.4.1
#
#loaded via a namespace (and not attached):
#[1] limma_3.10.0

For SAMSeq:

 Data type? Two class unpaired Array or Seq data? seq Arrays centered? FALSE Delta 2.21025 Minimum fold change 0 Test statistic standard Are data are log scale? FALSE Number of permutations 1000 Number of neighbors for KNN 10 Seed for Random number generator 1234567 Estimate of pi0 (proportion of null genes) 0.55 Exchangibility factor s0 0 s0 percentile 0 Quantiles of estimated sequencing depths (0%, 25%, 50%, 75%, 100%) 0.474, 0.822, 1.116, 1.228, 1.311 False Discovery Rate (%) 3.228881

Of the 4,408 significantly differentiated genes detected by SAMSeq using the settings above, 3,908 overlap the 4,446 detected by edgeR using a tagwise dispersion.

See also: edgeR vs. DESeq using this same dataset.

and

http://www.stanford.edu/~junli07/research.html#SAM

# The DGEList object in R

I've updated this post (2013 June 29th) to use the latest version of R, Bioconductor and edgeR. I also demonstrate how results of edgeR can be saved and outputted into one useful table.

The DGEList object holds the dataset to be analysed by edgeR and the subsequent calculations performed on the dataset. Specifically it contains:

 counts numeric matrix containing the read counts. lib.size numeric vector containing the total to normalize against for each sample (optional) norm.factors numeric vector containing normalization factors (optional, defaults to all 1) group vector giving the experimental group/condition for each sample/library genes data frame containing annotation information for the tags/transcripts/genes for which we have count data (optional). remove.zeros whether to remove rows that have 0 total count; default is FALSE so as to retain all information in the dataset

After calling the function estimateCommonDisp the DGEList object contains several new elemenets. Straight from the manual:

The output of estimateCommonDisp is a DGEList object with several new elements. The element common.dispersion, as the name suggests, provides the estimate of the common dispersion, and pseudo.alt gives the pseudocounts calculated under the alternative hypothesis. The element genes contains the information about gene/tag identifiers. The element conc gives the estimates of the overall concentration of each tag across all of the original samples (conc$conc.common) and the estimate of the concentration of each tag within each group (conc$conc.group). The element common.lib.size gives the library size to which the original libraries have been adjusted in the pseudocounts.

Here I use edgeR for a differential gene expression analysis and show how I access the DGEList and the results of the exact test and save it all into an expanded data frame:

#install DESeq if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("DESeq")
library("DESeq")
example_file <- system.file ("extra/TagSeqExample.tab", package="DESeq")
dim(data)
[1] 18760     6
T1a T1b  T2  T3  N1  N2
Gene_00001   0   0   2   0   0   1
Gene_00002  20   8  12   5  19  26
Gene_00003   3   0   2   0   0   0
Gene_00004  75  84 241 149 271 257
Gene_00005  10  16   4   0   4  10
Gene_00006 129 126 451 223 243 149
summary(rowSums(data))
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0      31     228    1521     883  234500
#let's get rid of some lowly expressed genes
data_subset <- data[rowSums(data)>10,]
dim(data_subset)
[1] 15716     6
#classic edgeR for differential expression
library(edgeR)
group <- c(rep("Tumour",4),rep("Control",2))
#create the DGEList object
d <- DGEList(counts = data_subset, group=group)
d <- calcNormFactors(d)
d <- estimateCommonDisp(d,verbose=T)
Disp = 0.57717 , BCV = 0.7597
d <- estimateTagwiseDisp(d)
de.tgw <- exactTest(d)
summary(decideTestsDGE(de.tgw, p.value=0.01))
[,1]
-1    93
0  15587
1     36
topTags(de.tgw)
Comparison of groups:  Tumour-Control
logFC   logCPM       PValue          FDR
Gene_12457 -11.954146 5.205006 2.725899e-17 4.284023e-13
Gene_14803 -11.181823 4.410806 5.260892e-16 4.134009e-12
Gene_00665  -8.093491 4.579092 8.091297e-15 4.238761e-11
Gene_03846  -6.845510 5.869045 6.247258e-14 2.454548e-10
Gene_01354  -9.611609 6.619495 1.831463e-12 5.756654e-09
Gene_10880 -10.210789 6.010410 2.499662e-12 6.547448e-09
Gene_09044  -8.378516 4.605649 4.742492e-12 1.020125e-08
Gene_18549  -8.151739 3.708076 5.192798e-12 1.020125e-08
Gene_02555 -10.534767 3.859586 8.331151e-12 1.454804e-08
Gene_08310  -5.250630 6.059341 2.135419e-11 3.356024e-08
#now say if you wanted to know the tagwise dispersions for each gene
names(d)
[1] "counts"             "samples"            "common.dispersion"  "pseudo.counts"      "AveLogCPM"          "pseudo.lib.size"
[7] "prior.n"            "tagwise.dispersion"
data_subset$twd <- d$tagwise.dispersion
T1a T1b  T2  T3  N1  N2       twd
Gene_00002  20   8  12   5  19  26 0.7054258
Gene_00004  75  84 241 149 271 257 0.3119328
Gene_00005  10  16   4   0   4  10 1.3231080
Gene_00006 129 126 451 223 243 149 0.3076355
Gene_00007  13   4  21  19  31   4 0.6682897
Gene_00009 202 122 256  43 287 357 0.4313053
#plot the tagwise dispersions
#histogram at the end of the post
hist(data_subset$twd, breaks=20, xlim=c(0,3)) #now if you wanted to save the fold changes and p values per gene names(de.tgw) [1] "table" "comparison" "genes" head(de.tgw$table)
logFC   logCPM    PValue
Gene_00002 -0.71266757 1.950463 0.4919477
Gene_00004 -0.90990666 5.280466 0.1732668
Gene_00005  0.59732915 1.126382 0.8207064
Gene_00006  0.30343120 5.532274 0.7148083
Gene_00007 -0.03886553 1.863605 0.9267819
Gene_00009 -0.84392775 5.626952 0.2764131
data_subset <- cbind(data_subset, de.tgw$table) head(data_subset) T1a T1b T2 T3 N1 N2 twd logFC logCPM PValue Gene_00002 20 8 12 5 19 26 0.7054258 -0.71266757 1.950463 0.4919477 Gene_00004 75 84 241 149 271 257 0.3119328 -0.90990666 5.280466 0.1732668 Gene_00005 10 16 4 0 4 10 1.3231080 0.59732915 1.126382 0.8207064 Gene_00006 129 126 451 223 243 149 0.3076355 0.30343120 5.532274 0.7148083 Gene_00007 13 4 21 19 31 4 0.6682897 -0.03886553 1.863605 0.9267819 Gene_00009 202 122 256 43 287 357 0.4313053 -0.84392775 5.626952 0.2764131 #I want the fdr data_subset$PValue_fdr <- p.adjust(method="fdr",p=data_subset$PValue) head(data_subset) T1a T1b T2 T3 N1 N2 twd logFC logCPM PValue PValue_fdr Gene_00002 20 8 12 5 19 26 0.7054258 -0.71266757 1.950463 0.4919477 1.0000000 Gene_00004 75 84 241 149 271 257 0.3119328 -0.90990666 5.280466 0.1732668 0.8566228 Gene_00005 10 16 4 0 4 10 1.3231080 0.59732915 1.126382 0.8207064 1.0000000 Gene_00006 129 126 451 223 243 149 0.3076355 0.30343120 5.532274 0.7148083 1.0000000 Gene_00007 13 4 21 19 31 4 0.6682897 -0.03886553 1.863605 0.9267819 1.0000000 Gene_00009 202 122 256 43 287 357 0.4313053 -0.84392775 5.626952 0.2764131 0.9593879 #sanity check with the decideTestsDGE() call table(data_subset$PValue_fdr<0.01)

FALSE  TRUE
15587   129
#to write out this useful data frame
write.table(data_subset, file="blah.tsv", quote=F)
sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252    LC_MONETARY=English_Australia.1252
[4] LC_NUMERIC=C                       LC_TIME=English_Australia.1252

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] edgeR_3.2.3        limma_3.16.5       DESeq_1.12.0       lattice_0.20-15    locfit_1.5-9.1     Biobase_2.20.0     BiocGenerics_0.6.0

loaded via a namespace (and not attached):
[1] annotate_1.38.0      AnnotationDbi_1.22.5 DBI_0.2-7            genefilter_1.42.0    geneplotter_1.38.0   grid_3.0.1
[7] IRanges_1.18.1       RColorBrewer_1.0-5   RSQLite_0.11.4       splines_3.0.1        stats4_3.0.1         survival_2.37-4
[13] tools_3.0.1          XML_3.96-1.1         xtable_1.7-1

# Processing rows of a data frame in R

Once you've read in a tab delimited file into a data.frame, here's one way of operating on the rows

#print out row number 1
data[1,]
#                      A B C D E F
#row_1       2       4       3       9       9      13
#calculate the variance of row 1 in the data.frame
var(as.vector(as.matrix(data[1,])))
#[1] 18.66667
#Just to test the results
#make some test variable corresponding to the values in row 1
test <- c(2,4,3,9,9,13)
#calculate the variance
var(test)
#[1] 18.66667

I'm still wondering why I need two conversion steps ( e.g. var(as.vector(as.matrix(data_subset[1,]))) ), since var(as.vector(data_subset[1,])) doesn't work. In time, when I learn more about data.frames and R in general I hope to address this or if some expert comes across this, may you kindly explain it to me. Thanks!

Calculating the variance for each row and storing the variance as an additional column

for (i in 1:nrow(data_subset)){
print(var(as.vector(as.matrix(data_subset[i,]))))
}
#to add the variance as an additional column in the data frame
data_subset\$variance <- apply(data_subset,1,function(row) var(as.vector(row[1:6])))
#and to delete the variance column
#data_subset <- subset(data_subset,select=-c(variance))

# Creating data subsets in R

Say you have a tab delimited file called tally.tsv with n rows and you only want to work with a subset of n based on the sum of each row.

Here's how to do it within R:

#your tsv file has a header row and the identifiers for each row are in the first column
#sum up each row
rs <- rowSums(data)
#if you are only interested in rows where the sum is greater than 50
use <- (rs > 50)
#check to see how many satisfy your condition
table(use)
#use
#FALSE  TRUE
#12331 22622
#store the rows you are interested in another variable
data_subset <- data[use,]
#check to see if the number of rows is equal to 22622
nrow(data_subset)
#[1] 22622
#to use two conditions
#use <- (rs > 4000 & rs < 1000000)