Doing simple stuff in R

Updated 2014 June 21st

The very first time I tried to use R was back in 2005 when I wanted to perform a Wilcoxon test. I was migrating from microbiology into bioinformatics and I still remember trying to follow this guide: http://cran.r-project.org/doc/manuals/r-release/R-intro.html. All I managed to do was waste a lot of paper by printing out the entire manual and typing out in verbatim Appendix A, which was a sample session, without understanding much. Since starting my PhD in 2010, I’ve had to use several of the R Bioconductor packages for performing some analyses. But it wasn’t until recently when I enrolled in an online data analysis course that I learned a bit more about the language. This post is about doing simple stuff in R given my recent increase in exposure to the language.

Getting started

First, to load your dataset (which is presumably in a table) into R use the read.table() function. The data.frame is the most important data structure for handling tabular statistical data in R.

df <- read.table("wine.csv", sep=",", header=T)

The above code stored the data into an object called df. The R language is an object-oriented programming language, so every object is an instance of a class.

class(df)
[1] "data.frame"

You can also read a file from a URL:

#if the URL the using HTTPS
#you need to use the RCurl package

install.packages('RCurl')
library(RCurl)

#download collection of trusted root certification authorities
if( !file.exists("cacert.pem") ){
  download.file(url="http://curl.haxx.se/ca/cacert.pem",
                destfile = "cacert.pem")
}

#I've hosted the file on my Dropbox public folder
test <- getURL("https://dl.dropboxusercontent.com/u/15251811/wine.csv",
               cainfo = "cacert.pem")

df <- read.table(textConnection(test), header=T, sep=",")

I’ve created the df object into R, which contains the data from the wine.csv file. The file has headers and each column is separated by commas, I used header=T and sep=”,” in the read.table() function. The df object is a data frame, which is a collection of multiple vectors, which may be different classes, but of the same length.

If we use the str() function, we can see classes of each column and some sample data:

str(df)
'data.frame':	178 obs. of  14 variables:
 $ class               : Factor w/ 3 levels "class_1","class_2",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ alcohol             : num  14.2 13.2 13.2 14.4 13.2 ...
 $ malic_acid          : num  1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
 $ ash                 : num  2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
 $ alcalinity_of_ash   : num  15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
 $ magnesium           : int  127 100 101 113 118 112 96 121 97 98 ...
 $ total_phenols       : num  2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
 $ flavanoids          : num  3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
 $ nonflavanoid_phenols: num  0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
 $ proanthocyanins     : num  2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
 $ colour_intensity    : num  5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
 $ hue                 : num  1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
 $ od280_od315         : num  3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
 $ proline             : int  1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...

If we have a comma delimited file that was prepared using Microsoft Excel, there may be double quotation marks surrounding each value. We can use the parameter quote=”\”” inside the read.table() function:

df <- read.table("wine.csv",
                 sep=",",
                 header=T,
                 quote="\"")

We can use the head() function to take a look at the data frame:

#have a look at the first 4 lines
head(df,4)
    class alcohol malic_acid  ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols
1 class_1   14.23       1.71 2.43              15.6       127          2.80       3.06                 0.28
2 class_1   13.20       1.78 2.14              11.2       100          2.65       2.76                 0.26
3 class_1   13.16       2.36 2.67              18.6       101          2.80       3.24                 0.30
4 class_1   14.37       1.95 2.50              16.8       113          3.85       3.49                 0.24
  proanthocyanins colour_intensity  hue od280_od315 proline
1            2.29             5.64 1.04        3.92    1065
2            1.28             4.38 1.05        3.40    1050
3            2.81             5.68 1.03        3.17    1185
4            2.18             7.80 0.86        3.45    1480

Use the dim() function to find out the dimension of the data and the colnames() function to find out the column names:

dim(df)
[1] 178  14
colnames(df)
 [1] "class"                "alcohol"              "malic_acid"           "ash"                 
 [5] "alcalinity_of_ash"    "magnesium"            "total_phenols"        "flavanoids"          
 [9] "nonflavanoid_phenols" "proanthocyanins"      "colour_intensity"     "hue"                 
[13] "od280_od315"          "proline"

#to access a particular column,
#we can type the column name or the column number
head(df$malic_acid)
#[1] 1.71 1.78 2.36 1.95 2.59 1.76

head(df[,3])
#[1] 1.71 1.78 2.36 1.95 2.59 1.76

Each column or row is a vector, which is the elementary structure for data handling in R. It is a set of simple elements, all being objects of the same class.

Nominal or categorical measurements are represented by factor variables in R, such as the category of the wine classes:

class(df$class)
#[1] "factor"

We can perform different operations on the columns:

#the mean of all malic acid measurements
mean(df$malic_acid)
[1] 2.336348

#calculate all column means
#we can't calculate the mean of column 1
#we can remove column one by using -1
colMeans(df[,-1])
             alcohol           malic_acid                  ash    alcalinity_of_ash            magnesium 
          13.0006180            2.3363483            2.3665169           19.4949438           99.7415730 
       total_phenols           flavanoids nonflavanoid_phenols      proanthocyanins     colour_intensity 
           2.2951124            2.0292697            0.3618539            1.5908989            5.0580899 
                 hue          od280_od315              proline 
           0.9574494            2.6116854          746.8932584

#calculate all column sums
colSums(df[,-1])
             alcohol           malic_acid                  ash    alcalinity_of_ash            magnesium 
            2314.110              415.870              421.240             3470.100            17754.000 
       total_phenols           flavanoids nonflavanoid_phenols      proanthocyanins     colour_intensity 
             408.530              361.210               64.410              283.180              900.340 
                 hue          od280_od315              proline 
             170.426              464.880           132947.000

Subsetting the data means selecting particular rows or columns. Below I show how we can select all class 1 measurements:

#finding out how many classes
#by using table()
table(df$class)

class_1 class_2 class_3 
     59      71      48

#selecting all class 1 measurements
df_class_1 <- df[df$class =="class_1",]

dim(df_class_1)
[1] 59 14

We can also subset by the row and column numbers:

#first 10 rows and first 5 columns
df[1:10,1:5]
#     class alcohol malic_acid  ash alcalinity_of_ash
#1  class_1   14.23       1.71 2.43              15.6
#2  class_1   13.20       1.78 2.14              11.2
#3  class_1   13.16       2.36 2.67              18.6
#4  class_1   14.37       1.95 2.50              16.8
#5  class_1   13.24       2.59 2.87              21.0
#6  class_1   14.20       1.76 2.45              15.2
#7  class_1   14.39       1.87 2.45              14.6
#8  class_1   14.06       2.15 2.61              17.6
#9  class_1   14.83       1.64 2.17              14.0
#10 class_1   13.86       1.35 2.27              16.0

In contrast to indexing with positive integers, negative indexing removes rows or columns specified by the negative number/s:

#get the dimensions
dim(df)
#[1] 178  14
#subset as per above but using negative indexing
df[-(11:178),-(6:14)]
#     class alcohol malic_acid  ash alcalinity_of_ash
#1  class_1   14.23       1.71 2.43              15.6
#2  class_1   13.20       1.78 2.14              11.2
#3  class_1   13.16       2.36 2.67              18.6
#4  class_1   14.37       1.95 2.50              16.8
#5  class_1   13.24       2.59 2.87              21.0
#6  class_1   14.20       1.76 2.45              15.2
#7  class_1   14.39       1.87 2.45              14.6
#8  class_1   14.06       2.15 2.61              17.6
#9  class_1   14.83       1.64 2.17              14.0
#10 class_1   13.86       1.35 2.27              16.0

For other cool ways on selecting data from data frames, I highly recommend this short guide. If you’ve had some database experience and would like to use SQL statements, see my older post on using SQL on R data frames.

Data summaries

R provides several functions for calculating data summaries.

#quantiles
quantile(df$alcohol)
     0%     25%     50%     75%    100% 
11.0300 12.3625 13.0500 13.6775 14.8300

#full summary of all columns
summary(df)
     class       alcohol        malic_acid         ash        alcalinity_of_ash   magnesium     
 class_1:59   Min.   :11.03   Min.   :0.740   Min.   :1.360   Min.   :10.60     Min.   : 70.00  
 class_2:71   1st Qu.:12.36   1st Qu.:1.603   1st Qu.:2.210   1st Qu.:17.20     1st Qu.: 88.00  
 class_3:48   Median :13.05   Median :1.865   Median :2.360   Median :19.50     Median : 98.00  
              Mean   :13.00   Mean   :2.336   Mean   :2.367   Mean   :19.49     Mean   : 99.74  
              3rd Qu.:13.68   3rd Qu.:3.083   3rd Qu.:2.558   3rd Qu.:21.50     3rd Qu.:107.00  
              Max.   :14.83   Max.   :5.800   Max.   :3.230   Max.   :30.00     Max.   :162.00  
 total_phenols     flavanoids    nonflavanoid_phenols proanthocyanins colour_intensity      hue        
 Min.   :0.980   Min.   :0.340   Min.   :0.1300       Min.   :0.410   Min.   : 1.280   Min.   :0.4800  
 1st Qu.:1.742   1st Qu.:1.205   1st Qu.:0.2700       1st Qu.:1.250   1st Qu.: 3.220   1st Qu.:0.7825  
 Median :2.355   Median :2.135   Median :0.3400       Median :1.555   Median : 4.690   Median :0.9650  
 Mean   :2.295   Mean   :2.029   Mean   :0.3619       Mean   :1.591   Mean   : 5.058   Mean   :0.9574  
 3rd Qu.:2.800   3rd Qu.:2.875   3rd Qu.:0.4375       3rd Qu.:1.950   3rd Qu.: 6.200   3rd Qu.:1.1200  
 Max.   :3.880   Max.   :5.080   Max.   :0.6600       Max.   :3.580   Max.   :13.000   Max.   :1.7100  
  od280_od315       proline      
 Min.   :1.270   Min.   : 278.0  
 1st Qu.:1.938   1st Qu.: 500.5  
 Median :2.780   Median : 673.5  
 Mean   :2.612   Mean   : 746.9  
 3rd Qu.:3.170   3rd Qu.: 985.0  
 Max.   :4.000   Max.   :1680.0

Finding out data classes and the number of unique values

sapply(df,class)
               class              alcohol           malic_acid                  ash    alcalinity_of_ash 
            "factor"            "numeric"            "numeric"            "numeric"            "numeric" 
           magnesium        total_phenols           flavanoids nonflavanoid_phenols      proanthocyanins 
           "integer"            "numeric"            "numeric"            "numeric"            "numeric" 
    colour_intensity                  hue          od280_od315              proline 
           "numeric"            "numeric"            "numeric"            "integer"
#how many unique values for proline measurements
length(unique(df$proline))
[1] 121
#number of proline measurements greater than 1000
table(df$proline > 1000)

FALSE  TRUE 
  135    43

Calculating breaks or bins using cut():

#forms 10 bins/breaks and provides the counts
cut(iris[,1], 10)

Data munging

Change a vector of characters into lower case

test <- c('oNe','Two','thREE')
test
[1] "oNe"   "Two"   "thREE"
tolower(test)
[1] "one"   "two"   "three"
toupper(test)
[1] "ONE"   "TWO"   "THREE"

Making substitutions

test2 <- 'one.two.three'
test2
[1] "one.two.three"
gsub("\\.",'_',test2)
[1] "one_two_three"

Splitting a string of characters into a vector and joining all elements of a data frame

#splitting
a <- 'acgactagcatcgatcagcatc'
strsplit(a, split="")
[[1]]
 [1] "a" "c" "g" "a" "c" "t" "a" "g" "c" "a" "t" "c" "g" "a" "t" "c" "a" "g" "c" "a" "t" "c"
#joining
a <- "acgactagcatcgatcagcatc"
b <- "acgtacgtacgatcgtacgatcg"
c <- "agctacgtacgatcgatcagctagc"
abc <- data.frame(a,b,c)
combined <- apply(abc, 1, paste, collapse="")
combined
[1] "acgactagcatcgatcagcatcacgtacgtacgatcgtacgatcgagctacgtacgatcgatcagctagc"

Sorting and ordering

head(df$proline)
[1] 1065 1050 1185 1480  735 1450
head(sort(df$proline))
[1] 278 290 312 315 325 342
tail(sort(df$proline))
[1] 1450 1480 1510 1515 1547 1680
head(sort(df$proline,decreasing=T))
[1] 1680 1547 1515 1510 1480 1450
#order returns the vector number of the lowest value
head(order(df$proline))
[1]  81  94 109 106 112 129
df$proline[81]
[1] 278
tail(order(df$proline))
[1]  6  4 11 32 15 19
df$proline[19]
[1] 1680
#reordering a data frame
df2 <- df[order(df$proline),]
head(df2,4)
      class alcohol malic_acid  ash alcalinity_of_ash magnesium total_phenols flavanoids
81  class_2   12.00       0.92 2.00                19        86          2.42       2.26
94  class_2   12.29       2.83 2.22                18        88          2.45       2.25
109 class_2   12.22       1.29 1.94                19        92          2.36       2.04
106 class_2   12.42       2.55 2.27                22        90          1.68       1.84
    nonflavanoid_phenols proanthocyanins colour_intensity  hue od280_od315 proline
81                  0.30            1.43             2.50 1.38        3.12     278
94                  0.25            1.99             2.15 1.15        3.30     290
109                 0.39            2.08             2.70 0.86        3.02     312
106                 0.66            1.42             2.70 0.86        3.30     315

Finding the corresponding index of a matrix where the value satisfies a specific condition (learned from this nice post):

#make up some matrix
set.seed(12345)
A <- matrix(runif(100,1,100),nrow=10,ncol=10,byrow=T)

#values in the matrix greater than 98
A > 98
       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
 [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
 [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [7,] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

#obtaining that values that satisfy the condition
A[A>98]
[1] 98.61240 98.98396

#obtaining the indexes that satisfy the condition
which(A>98, arr.in=T)
     row col
[1,]   7   3
[2,]   1  10

#specifying two conditions
which(A<98 & A>97, arr.in=T)
     row col
[1,]   7   5

The Data Analysis for Genomics course taught me the match() function:

#make an example data frame
rats <- data.frame(id = paste0("rat",1:10),  
                   sex = factor(rep(c("female","male"),each=5)),
                   weight = c(2,4,1,11,18,12,7,12,19,20),
                   length = c(100,105,115,130,95,150,165,180,190,175))
rats
      id    sex weight length
1   rat1 female      2    100
2   rat2 female      4    105
3   rat3 female      1    115
4   rat4 female     11    130
5   rat5 female     18     95
6   rat6   male     12    150
7   rat7   male      7    165
8   rat8   male     12    180
9   rat9   male     19    190
10 rat10   male     20    175

#create another table
ratsTable <- data.frame(id = paste0("rat",c(6,9,7,3,5,1,10,4,8,2)),
                        secretID = 1:10)

ratsTable
      id secretID
1   rat6        1
2   rat9        2
3   rat7        3
4   rat3        4
5   rat5        5
6   rat1        6
7  rat10        7
8   rat4        8
9   rat8        9
10  rat2       10

#use match to match ids from two tables
match(ratsTable$id, rats$id)
 [1]  6  9  7  3  5  1 10  4  8  2

#use match to re-order the rats data frame
#according to the ratsTable order of rat ids
rats[match(ratsTable$id, rats$id),] 
      id    sex weight length
6   rat6   male     12    150
9   rat9   male     19    190
7   rat7   male      7    165
3   rat3 female      1    115
5   rat5 female     18     95
1   rat1 female      2    100
10 rat10   male     20    175
4   rat4 female     11    130
8   rat8   male     12    180
2   rat2 female      4    105

#since the two tables have the same order
#we can add the secretID column
#using cbind()
cbind(rats[match(ratsTable$id, rats$id),], ratsTable)
      id    sex weight length    id secretID
6   rat6   male     12    150  rat6        1
9   rat9   male     19    190  rat9        2
7   rat7   male      7    165  rat7        3
3   rat3 female      1    115  rat3        4
5   rat5 female     18     95  rat5        5
1   rat1 female      2    100  rat1        6
10 rat10   male     20    175 rat10        7
4   rat4 female     11    130  rat4        8
8   rat8   male     12    180  rat8        9
2   rat2 female      4    105  rat2       10

#merge does all this for us
merge(x=rats, y=ratsTable, by='id')
      id    sex weight length secretID
1   rat1 female      2    100        6
2  rat10   male     20    175        7
3   rat2 female      4    105       10
4   rat3 female      1    115        4
5   rat4 female     11    130        8
6   rat5 female     18     95        5
7   rat6   male     12    150        1
8   rat7   male      7    165        3
9   rat8   male     12    180        9
10  rat9   male     19    190        2

#use aggregate to find
#weights of different sexes
aggregate(rats$weight, list(sex=rats$sex), mean)
     sex    x
1 female  7.2
2   male 14.0

The above code is available at GitHub.

How to subset based on a vector:

set.seed(31)
df <- data.frame(id=paste(rep('boy', 10), 1:10, sep=''), rand= sample(1:100,10))
df
      id rand
1   boy1   53
2   boy2   95
3   boy3   42
4   boy4   98
5   boy5   91
6   boy6   43
7   boy7   79
8   boy8   17
9   boy9   86
10 boy10   23

#use the %in% notation
df$id %in% c('boy1', 'boy9')
 [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE

df[df$id %in% c('boy1', 'boy9'),]
    id rand
1 boy1   53
9 boy9   86

Model formulae

In R, regression relationships are specified by so-called model formulae which, in a simple bivariate case, the dependent variable is on the left and the independent variable on the right.

install.packages("UsingR")
library(UsingR)
data(galton)
#height of the child and the parent
head(galton)
#  child parent
#1  61.7   70.5
#2  61.7   68.5
#3  61.7   65.5
#4  61.7   64.5
#5  61.7   64.0
#6  62.2   67.5

#the tilde separates left and right
mf <- child ~ parent
class(mf)
#[1] "formula"

lm(mf, galton)

#Call:
#lm(formula = mf, data = galton)
#
#Coefficients:
#(Intercept)       parent  
#    23.9415       0.6463

Simple plots

Scatterplot of different pairs

pairs(~malic_acid+proline+hue, data=df)

malic_acid_proline_hue

Proline measurements per different classes

plot(df$proline,col=df$class)

proline_by_colour

#using ggplot2
#install if necessary
install.packages("ggplot2")
library(ggplot2)
qplot(c(1:length(df$proline)),proline,data=df, colour=class)

proline_by_colour_ggplot2

plot(density(df$proline,bw=75))

proline_density_bw_75

For more different types of plots see this tutorial

Conclusions

Well I hope you got something useful out of that. If not, you can try this really awesome tutorial for learning R I found last year; I enjoyed it a lot. You can also search other R related posts on this site and see my collection of short snippets of R code.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
Posted in RTagged
One comment Add yours

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.