Updated 2014 June 21st
The very first time I tried to use R was back in 2005 when I wanted to perform a Wilcoxon test. I was migrating from microbiology into bioinformatics and I still remember trying to follow this guide: http://cran.r-project.org/doc/manuals/r-release/R-intro.html. All I managed to do was waste a lot of paper by printing out the entire manual and typing out in verbatim Appendix A, which was a sample session, without understanding much. Since starting my PhD in 2010, I’ve had to use several of the R Bioconductor packages for performing some analyses. But it wasn’t until recently when I enrolled in an online data analysis course that I learned a bit more about the language. This post is about doing simple stuff in R given my recent increase in exposure to the language.
Getting started
First, to load your dataset (which is presumably in a table) into R use the read.table() function. The data.frame is the most important data structure for handling tabular statistical data in R.
df <- read.table("wine.csv", sep=",", header=T)
The above code stored the data into an object called df. The R language is an object-oriented programming language, so every object is an instance of a class.
class(df) [1] "data.frame"
You can also read a file from a URL:
#if the URL the using HTTPS #you need to use the RCurl package install.packages('RCurl') library(RCurl) #download collection of trusted root certification authorities if( !file.exists("cacert.pem") ){ download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile = "cacert.pem") } #I've hosted the file on my Dropbox public folder test <- getURL("https://dl.dropboxusercontent.com/u/15251811/wine.csv", cainfo = "cacert.pem") df <- read.table(textConnection(test), header=T, sep=",")
I’ve created the df object into R, which contains the data from the wine.csv file. The file has headers and each column is separated by commas, I used header=T and sep=”,” in the read.table() function. The df object is a data frame, which is a collection of multiple vectors, which may be different classes, but of the same length.
If we use the str() function, we can see classes of each column and some sample data:
str(df) 'data.frame': 178 obs. of 14 variables: $ class : Factor w/ 3 levels "class_1","class_2",..: 1 1 1 1 1 1 1 1 1 1 ... $ alcohol : num 14.2 13.2 13.2 14.4 13.2 ... $ malic_acid : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ... $ ash : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ... $ alcalinity_of_ash : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ... $ magnesium : int 127 100 101 113 118 112 96 121 97 98 ... $ total_phenols : num 2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ... $ flavanoids : num 3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ... $ nonflavanoid_phenols: num 0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ... $ proanthocyanins : num 2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ... $ colour_intensity : num 5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ... $ hue : num 1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ... $ od280_od315 : num 3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ... $ proline : int 1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...
If we have a comma delimited file that was prepared using Microsoft Excel, there may be double quotation marks surrounding each value. We can use the parameter quote=”\”” inside the read.table() function:
df <- read.table("wine.csv", sep=",", header=T, quote="\"")
We can use the head() function to take a look at the data frame:
#have a look at the first 4 lines head(df,4) class alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols 1 class_1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2 class_1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 3 class_1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 4 class_1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 proanthocyanins colour_intensity hue od280_od315 proline 1 2.29 5.64 1.04 3.92 1065 2 1.28 4.38 1.05 3.40 1050 3 2.81 5.68 1.03 3.17 1185 4 2.18 7.80 0.86 3.45 1480
Use the dim() function to find out the dimension of the data and the colnames() function to find out the column names:
dim(df) [1] 178 14 colnames(df) [1] "class" "alcohol" "malic_acid" "ash" [5] "alcalinity_of_ash" "magnesium" "total_phenols" "flavanoids" [9] "nonflavanoid_phenols" "proanthocyanins" "colour_intensity" "hue" [13] "od280_od315" "proline" #to access a particular column, #we can type the column name or the column number head(df$malic_acid) #[1] 1.71 1.78 2.36 1.95 2.59 1.76 head(df[,3]) #[1] 1.71 1.78 2.36 1.95 2.59 1.76
Each column or row is a vector, which is the elementary structure for data handling in R. It is a set of simple elements, all being objects of the same class.
Nominal or categorical measurements are represented by factor variables in R, such as the category of the wine classes:
class(df$class) #[1] "factor"
We can perform different operations on the columns:
#the mean of all malic acid measurements mean(df$malic_acid) [1] 2.336348 #calculate all column means #we can't calculate the mean of column 1 #we can remove column one by using -1 colMeans(df[,-1]) alcohol malic_acid ash alcalinity_of_ash magnesium 13.0006180 2.3363483 2.3665169 19.4949438 99.7415730 total_phenols flavanoids nonflavanoid_phenols proanthocyanins colour_intensity 2.2951124 2.0292697 0.3618539 1.5908989 5.0580899 hue od280_od315 proline 0.9574494 2.6116854 746.8932584 #calculate all column sums colSums(df[,-1]) alcohol malic_acid ash alcalinity_of_ash magnesium 2314.110 415.870 421.240 3470.100 17754.000 total_phenols flavanoids nonflavanoid_phenols proanthocyanins colour_intensity 408.530 361.210 64.410 283.180 900.340 hue od280_od315 proline 170.426 464.880 132947.000
Subsetting the data means selecting particular rows or columns. Below I show how we can select all class 1 measurements:
#finding out how many classes #by using table() table(df$class) class_1 class_2 class_3 59 71 48 #selecting all class 1 measurements df_class_1 <- df[df$class =="class_1",] dim(df_class_1) [1] 59 14
We can also subset by the row and column numbers:
#first 10 rows and first 5 columns df[1:10,1:5] # class alcohol malic_acid ash alcalinity_of_ash #1 class_1 14.23 1.71 2.43 15.6 #2 class_1 13.20 1.78 2.14 11.2 #3 class_1 13.16 2.36 2.67 18.6 #4 class_1 14.37 1.95 2.50 16.8 #5 class_1 13.24 2.59 2.87 21.0 #6 class_1 14.20 1.76 2.45 15.2 #7 class_1 14.39 1.87 2.45 14.6 #8 class_1 14.06 2.15 2.61 17.6 #9 class_1 14.83 1.64 2.17 14.0 #10 class_1 13.86 1.35 2.27 16.0
In contrast to indexing with positive integers, negative indexing removes rows or columns specified by the negative number/s:
#get the dimensions dim(df) #[1] 178 14 #subset as per above but using negative indexing df[-(11:178),-(6:14)] # class alcohol malic_acid ash alcalinity_of_ash #1 class_1 14.23 1.71 2.43 15.6 #2 class_1 13.20 1.78 2.14 11.2 #3 class_1 13.16 2.36 2.67 18.6 #4 class_1 14.37 1.95 2.50 16.8 #5 class_1 13.24 2.59 2.87 21.0 #6 class_1 14.20 1.76 2.45 15.2 #7 class_1 14.39 1.87 2.45 14.6 #8 class_1 14.06 2.15 2.61 17.6 #9 class_1 14.83 1.64 2.17 14.0 #10 class_1 13.86 1.35 2.27 16.0
For other cool ways on selecting data from data frames, I highly recommend this short guide. If you’ve had some database experience and would like to use SQL statements, see my older post on using SQL on R data frames.
Data summaries
R provides several functions for calculating data summaries.
#quantiles quantile(df$alcohol) 0% 25% 50% 75% 100% 11.0300 12.3625 13.0500 13.6775 14.8300 #full summary of all columns summary(df) class alcohol malic_acid ash alcalinity_of_ash magnesium class_1:59 Min. :11.03 Min. :0.740 Min. :1.360 Min. :10.60 Min. : 70.00 class_2:71 1st Qu.:12.36 1st Qu.:1.603 1st Qu.:2.210 1st Qu.:17.20 1st Qu.: 88.00 class_3:48 Median :13.05 Median :1.865 Median :2.360 Median :19.50 Median : 98.00 Mean :13.00 Mean :2.336 Mean :2.367 Mean :19.49 Mean : 99.74 3rd Qu.:13.68 3rd Qu.:3.083 3rd Qu.:2.558 3rd Qu.:21.50 3rd Qu.:107.00 Max. :14.83 Max. :5.800 Max. :3.230 Max. :30.00 Max. :162.00 total_phenols flavanoids nonflavanoid_phenols proanthocyanins colour_intensity hue Min. :0.980 Min. :0.340 Min. :0.1300 Min. :0.410 Min. : 1.280 Min. :0.4800 1st Qu.:1.742 1st Qu.:1.205 1st Qu.:0.2700 1st Qu.:1.250 1st Qu.: 3.220 1st Qu.:0.7825 Median :2.355 Median :2.135 Median :0.3400 Median :1.555 Median : 4.690 Median :0.9650 Mean :2.295 Mean :2.029 Mean :0.3619 Mean :1.591 Mean : 5.058 Mean :0.9574 3rd Qu.:2.800 3rd Qu.:2.875 3rd Qu.:0.4375 3rd Qu.:1.950 3rd Qu.: 6.200 3rd Qu.:1.1200 Max. :3.880 Max. :5.080 Max. :0.6600 Max. :3.580 Max. :13.000 Max. :1.7100 od280_od315 proline Min. :1.270 Min. : 278.0 1st Qu.:1.938 1st Qu.: 500.5 Median :2.780 Median : 673.5 Mean :2.612 Mean : 746.9 3rd Qu.:3.170 3rd Qu.: 985.0 Max. :4.000 Max. :1680.0
Finding out data classes and the number of unique values
sapply(df,class) class alcohol malic_acid ash alcalinity_of_ash "factor" "numeric" "numeric" "numeric" "numeric" magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins "integer" "numeric" "numeric" "numeric" "numeric" colour_intensity hue od280_od315 proline "numeric" "numeric" "numeric" "integer" #how many unique values for proline measurements length(unique(df$proline)) [1] 121 #number of proline measurements greater than 1000 table(df$proline > 1000) FALSE TRUE 135 43
Calculating breaks or bins using cut():
#forms 10 bins/breaks and provides the counts cut(iris[,1], 10)
Data munging
Change a vector of characters into lower case
test <- c('oNe','Two','thREE') test [1] "oNe" "Two" "thREE" tolower(test) [1] "one" "two" "three" toupper(test) [1] "ONE" "TWO" "THREE"
Making substitutions
test2 <- 'one.two.three' test2 [1] "one.two.three" gsub("\\.",'_',test2) [1] "one_two_three"
Splitting a string of characters into a vector and joining all elements of a data frame
#splitting a <- 'acgactagcatcgatcagcatc' strsplit(a, split="") [[1]] [1] "a" "c" "g" "a" "c" "t" "a" "g" "c" "a" "t" "c" "g" "a" "t" "c" "a" "g" "c" "a" "t" "c" #joining a <- "acgactagcatcgatcagcatc" b <- "acgtacgtacgatcgtacgatcg" c <- "agctacgtacgatcgatcagctagc" abc <- data.frame(a,b,c) combined <- apply(abc, 1, paste, collapse="") combined [1] "acgactagcatcgatcagcatcacgtacgtacgatcgtacgatcgagctacgtacgatcgatcagctagc"
Sorting and ordering
head(df$proline) [1] 1065 1050 1185 1480 735 1450 head(sort(df$proline)) [1] 278 290 312 315 325 342 tail(sort(df$proline)) [1] 1450 1480 1510 1515 1547 1680 head(sort(df$proline,decreasing=T)) [1] 1680 1547 1515 1510 1480 1450 #order returns the vector number of the lowest value head(order(df$proline)) [1] 81 94 109 106 112 129 df$proline[81] [1] 278 tail(order(df$proline)) [1] 6 4 11 32 15 19 df$proline[19] [1] 1680 #reordering a data frame df2 <- df[order(df$proline),] head(df2,4) class alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids 81 class_2 12.00 0.92 2.00 19 86 2.42 2.26 94 class_2 12.29 2.83 2.22 18 88 2.45 2.25 109 class_2 12.22 1.29 1.94 19 92 2.36 2.04 106 class_2 12.42 2.55 2.27 22 90 1.68 1.84 nonflavanoid_phenols proanthocyanins colour_intensity hue od280_od315 proline 81 0.30 1.43 2.50 1.38 3.12 278 94 0.25 1.99 2.15 1.15 3.30 290 109 0.39 2.08 2.70 0.86 3.02 312 106 0.66 1.42 2.70 0.86 3.30 315
Finding the corresponding index of a matrix where the value satisfies a specific condition (learned from this nice post):
#make up some matrix set.seed(12345) A <- matrix(runif(100,1,100),nrow=10,ncol=10,byrow=T) #values in the matrix greater than 98 A > 98 [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [7,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE #obtaining that values that satisfy the condition A[A>98] [1] 98.61240 98.98396 #obtaining the indexes that satisfy the condition which(A>98, arr.in=T) row col [1,] 7 3 [2,] 1 10 #specifying two conditions which(A<98 & A>97, arr.in=T) row col [1,] 7 5
The Data Analysis for Genomics course taught me the match() function:
#make an example data frame rats <- data.frame(id = paste0("rat",1:10), sex = factor(rep(c("female","male"),each=5)), weight = c(2,4,1,11,18,12,7,12,19,20), length = c(100,105,115,130,95,150,165,180,190,175)) rats id sex weight length 1 rat1 female 2 100 2 rat2 female 4 105 3 rat3 female 1 115 4 rat4 female 11 130 5 rat5 female 18 95 6 rat6 male 12 150 7 rat7 male 7 165 8 rat8 male 12 180 9 rat9 male 19 190 10 rat10 male 20 175 #create another table ratsTable <- data.frame(id = paste0("rat",c(6,9,7,3,5,1,10,4,8,2)), secretID = 1:10) ratsTable id secretID 1 rat6 1 2 rat9 2 3 rat7 3 4 rat3 4 5 rat5 5 6 rat1 6 7 rat10 7 8 rat4 8 9 rat8 9 10 rat2 10 #use match to match ids from two tables match(ratsTable$id, rats$id) [1] 6 9 7 3 5 1 10 4 8 2 #use match to re-order the rats data frame #according to the ratsTable order of rat ids rats[match(ratsTable$id, rats$id),] id sex weight length 6 rat6 male 12 150 9 rat9 male 19 190 7 rat7 male 7 165 3 rat3 female 1 115 5 rat5 female 18 95 1 rat1 female 2 100 10 rat10 male 20 175 4 rat4 female 11 130 8 rat8 male 12 180 2 rat2 female 4 105 #since the two tables have the same order #we can add the secretID column #using cbind() cbind(rats[match(ratsTable$id, rats$id),], ratsTable) id sex weight length id secretID 6 rat6 male 12 150 rat6 1 9 rat9 male 19 190 rat9 2 7 rat7 male 7 165 rat7 3 3 rat3 female 1 115 rat3 4 5 rat5 female 18 95 rat5 5 1 rat1 female 2 100 rat1 6 10 rat10 male 20 175 rat10 7 4 rat4 female 11 130 rat4 8 8 rat8 male 12 180 rat8 9 2 rat2 female 4 105 rat2 10 #merge does all this for us merge(x=rats, y=ratsTable, by='id') id sex weight length secretID 1 rat1 female 2 100 6 2 rat10 male 20 175 7 3 rat2 female 4 105 10 4 rat3 female 1 115 4 5 rat4 female 11 130 8 6 rat5 female 18 95 5 7 rat6 male 12 150 1 8 rat7 male 7 165 3 9 rat8 male 12 180 9 10 rat9 male 19 190 2 #use aggregate to find #weights of different sexes aggregate(rats$weight, list(sex=rats$sex), mean) sex x 1 female 7.2 2 male 14.0
The above code is available at GitHub.
How to subset based on a vector:
set.seed(31) df <- data.frame(id=paste(rep('boy', 10), 1:10, sep=''), rand= sample(1:100,10)) df id rand 1 boy1 53 2 boy2 95 3 boy3 42 4 boy4 98 5 boy5 91 6 boy6 43 7 boy7 79 8 boy8 17 9 boy9 86 10 boy10 23 #use the %in% notation df$id %in% c('boy1', 'boy9') [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE df[df$id %in% c('boy1', 'boy9'),] id rand 1 boy1 53 9 boy9 86
Model formulae
In R, regression relationships are specified by so-called model formulae which, in a simple bivariate case, the dependent variable is on the left and the independent variable on the right.
install.packages("UsingR") library(UsingR) data(galton) #height of the child and the parent head(galton) # child parent #1 61.7 70.5 #2 61.7 68.5 #3 61.7 65.5 #4 61.7 64.5 #5 61.7 64.0 #6 62.2 67.5 #the tilde separates left and right mf <- child ~ parent class(mf) #[1] "formula" lm(mf, galton) #Call: #lm(formula = mf, data = galton) # #Coefficients: #(Intercept) parent # 23.9415 0.6463
Simple plots
Scatterplot of different pairs
pairs(~malic_acid+proline+hue, data=df)
Proline measurements per different classes
plot(df$proline,col=df$class)
#using ggplot2 #install if necessary install.packages("ggplot2") library(ggplot2) qplot(c(1:length(df$proline)),proline,data=df, colour=class)
plot(density(df$proline,bw=75))
For more different types of plots see this tutorial
Conclusions
Well I hope you got something useful out of that. If not, you can try this really awesome tutorial for learning R I found last year; I enjoyed it a lot. You can also search other R related posts on this site and see my collection of short snippets of R code.

This work is licensed under a Creative Commons
Attribution 4.0 International License.