R

From Dave's wiki
Jump to navigation Jump to search

I've been using R for the past couple of years, however there are some tasks that I seem to constantly forget. Here's a collection of some of them (some I have learned to remember over time but have left intact).

Installing packages

Sometimes you need to include additional library paths; use `Sys.setenv` to add an additional location.

new_path <- paste0(Sys.getenv("LD_LIBRARY_PATH"), ":", "/usr/include")
Sys.setenv("LD_LIBRARY_PATH" = new_path)
Sys.getenv("LD_LIBRARY_PATH")

Anaconda

Install R with https://conda.io/docs/index.html

wget -c -N https://repo.continuum.io/archive/Anaconda3-5.1.0-Linux-x86_64.sh
bash Anaconda3-5.1.0-Linux-x86_64.sh
conda install gxx_linux-64
conda install r-essentials
conda install gfortran_linux-64

I still can't install the igraph package...

Vectors!

R is a vector-based language. What's a vector? The R manual defines a vector as "a single entity consisting of a collection of things." You can think of a vector as a row or column of numbers or text; they are the simplest type of data structure in R. The beauty of vector-based operations as opposed to loops, is that a vectorised function works not just on a single value, but on a whole vector of values at the same time. R allows you to apply functions to the whole vector in a single operation, i.e. we can make calculations on many values at the same time.

lapply() applies a given function for each element in a list,so there will be several function calls. do.call() applies a given function to the list as a whole, so there is only one function call.

?do.call
#do.call constructs and executes a function call from a name or a function and a list of arguments to be passed to it.
#see http://stackoverflow.com/questions/10801750/whats-the-difference-between-lapply-and-do-call-in-r
x <- list(1:3,4:6,7:9)
x
[[1]]
[1] 1 2 3

[[2]]
[1] 4 5 6

[[3]]
[1] 7 8 9
do.call(what = rbind, args = x)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
tmp <- list(1:10, 11:20, 21:30)
tmp
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10

[[2]]
 [1] 11 12 13 14 15 16 17 18 19 20

[[3]]
 [1] 21 22 23 24 25 26 27 28 29 30
lapply(tmp, sum)
[[1]]
[1] 55

[[2]]
[1] 155

[[3]]
[1] 255
sapply(tmp, sum)
[1]  55 155 255

Objects

S4 slots

slotNames(perf)
[1] "x.name"       "y.name"       "alpha.name"   "x.values"     "y.values"     "alpha.values"

# use @ to access each slot
perf@alpha.name
[1] "Cutoff"

General operations

Precision

noquote(unlist(format(.Machine)))

Deleting a column in a data frame by assigning NULL

data$bye <- NULL

Use page() to go through data

page(iris, 'print')

Specify your own directory for installing packages

# set .libPaths
.libPaths('/home/me/rlib')

Lookup table

two <- c("AA", "AS")
lut <- c("AA" = "American", 
         "AS" = "Alaska", 
         "B6" = "JetBlue")
two <- lut[two]
two
        AA         AS 
"American"   "Alaska"

General summary of a data frame

head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
install.packages('psych')
library(psych)
describe(iris)
             vars   n mean   sd median trimmed  mad min max range  skew kurtosis   se
Sepal.Length    1 150 5.84 0.83   5.80    5.81 1.04 4.3 7.9   3.6  0.31    -0.61 0.07
Sepal.Width     2 150 3.06 0.44   3.00    3.04 0.44 2.0 4.4   2.4  0.31     0.14 0.04
Petal.Length    3 150 3.76 1.77   4.35    3.76 1.85 1.0 6.9   5.9 -0.27    -1.42 0.14
Petal.Width     4 150 1.20 0.76   1.30    1.18 1.04 0.1 2.5   2.4 -0.10    -1.36 0.06
Species*        5 150 2.00 0.82   2.00    2.00 1.48 1.0 3.0   2.0  0.00    -1.52 0.07

Compare two objects using identical()

identical(object1, object2)

Limit scientific notation

options(scipen=10000)

Time an operation

start <- Sys.time()
for(i in 1:100000000){
}
end<- Sys.time()
print(end-start)
Time difference of 4.354291 secs

# or use system.time()
system.time(for (i in 1:100000000){})
   user  system elapsed 
  3.318   0.115   3.446 

Move packages from older R versions to newer installations

#on OS X
#different R versions are stored in
#/Library/Frameworks/R.framework/Versions
#use rsync to copy directories that aren't in the existing library directory
#make sure you are in /Library/Frameworks/R.framework
pwd
/Library/Frameworks/R.framework
#then run
rsync --ignore-existing -r -v Versions/3.1/Resources/library/ Versions/3.2/Resources/library

Subsetting using a list

my_chr <- c(1:22,'X','Y')
lincrna_assembled <- subset(lincrna, lincrna$chromosome_name %in% my_chr)

Subsetting using with

df <- data.frame(a=1:10, b=11:20, c=21:30)
with(df, a > 5)
 [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
with(df, a > 5 & b > 19)
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

Populating a data frame from a loop

#http://stackoverflow.com/questions/13442461/populating-a-data-frame-in-r-in-a-loop
mylist <- list() #create an empty list
for (i in 1:5) {
  vec <- numeric(5) #preallocate a numeric vector
  for (j in 1:5) { #fill the vector
    vec[j] <- i^j 
  }
  mylist[[i]] <- vec #put all vectors in the list
}
df <- do.call("rbind",mylist) #combine all vectors into a matrix

Search Path for R Objects

searchpaths()

Check for missing values using complete.cases()

#no missing values in the iris dataset
table(complete.cases(iris))

#TRUE 
#150

When using apply, how do I use more than one parameter?

#http://stackoverflow.com/questions/6827299/r-apply-function-with-multiple-parameters
my_list <- list(a=1, b=2, c=3)
multiply <- function(var1, var2){
   var1*var2
}
var2 <- 2
sapply(my_list, multiply, var2=var2)
#a b c
#2 4 6

Compactly display the structure of an arbitrary R object

str(iris)
#'data.frame':	150 obs. of  5 variables:
#$ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#$ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#$ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#$ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

The which() function returns the indices that are TRUE

which(10:20 < 15)
[1] 1 2 3 4 5

The which() function also works with matrices.

my_mat <- matrix(1:9, nrow=3)
my_mat
#     [,1] [,2] [,3]
#[1,]    1    4    7
#[2,]    2    5    8
#[3,]    3    6    9

which(my_mat < 5)
[1] 1 2 3 4

which(my_mat < 5 , arr.ind = TRUE )
#     row col
#[1,]   1   1
#[2,]   2   1
#[3,]   3   1
#[4,]   1   2

How do I turn an object's name into a string?

obj_to_string <- function(x){
   name <- deparse(substitute(x))
   print(name)
}
my_obj <- 'blah'
obj_to_string(my_obj)
#[1] "my_obj"

Use match() to match items in two vectors

y <- c('three', 'two', 'one', 'onee')
x <- c('one', 'two', 'three')
#match returns the index in x that match y
match(y, x)
#[1]  3  2  1 NA

How do I subset/order a data frame according to a vector?

#use match
df <- data.frame(a = 1:10, b = letters[1:10])
df
    a b
1   1 a
2   2 b
3   3 c
4   4 d
5   5 e
6   6 f
7   7 g
8   8 h
9   9 i
10 10 j
x <- c(2,10,5,6)
df[match(x, df[,1]),]
    a b
2   2 b
10 10 j
5   5 e
6   6 f

How do I turn a string into an object that contains something I specify?

#use assign
assign('q0', 1:10)
q0
#[1]  1  2  3  4  5  6  7  8  9 10

Use string to refer to object

x <- 42
eval(parse(text = "x"))
[1] 42

How do I concatenate strings?

#use paste
my_first_name <- 'Dave'
my_last_name  <- 'Tang'
my_full_name <- paste(my_first_name, my_last_name, sep=" ")
my_full_name
#[1] "Dave Tang"

How do I output an object as tab delimited file?

write.table(data, file="blah.tsv", quote=F, sep="\t")

Creating a random number between 5.0 and 7.5

#set seed to make example reproducible
set.seed(31)
runif(1, 5.0, 7.5)
#[1] 6.305395

Taking a subset of a dataset based on some condition

subset(iris, Sepal.Length>7.6)
#    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#118          7.7         3.8          6.7         2.2 virginica
#119          7.7         2.6          6.9         2.3 virginica
#123          7.7         2.8          6.7         2.0 virginica
#132          7.9         3.8          6.4         2.0 virginica
#136          7.7         3.0          6.1         2.3 virginica

#OR
iris[iris$Sepal.Length>7.6,]
#    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#118          7.7         3.8          6.7         2.2 virginica
#119          7.7         2.6          6.9         2.3 virginica
#123          7.7         2.8          6.7         2.0 virginica
#132          7.9         3.8          6.4         2.0 virginica
#136          7.7         3.0          6.1         2.3 virginica

Reading in stuff

By default R sorts factors alphabetically and sometimes I require a specific order because ggplot2 uses this order. This page explains how to do so http://rwiki.sciviews.org/doku.php?id=tips:data-factors:factors

Set up an R script that takes in command line arguments

#!/bin/env Rscript
#store arguments as object
args <- commandArgs(TRUE)

#store the number of arguments
args_length <- length(args)

#check to see if 3 arguments were inputted
if (args_length != 3){
   stop("Requires three arguments: chr, start and end")
}

arg1 <- args[1]
arg2 <- args[2]
arg3 <- args[3]

Use R to download and load a file

download.file("https://dl.dropbox.com/u/7710864/courseraPublic/samsungData.rda",
              destfile="./samsungData.rda",
              method="curl")
#then load
load("samsungData.rda")

Read file and store as a vector

#the type of "what" gives the type of data to be read
#supported types are logical, integer, numeric, complex, character, raw and list
x <- scan("file.txt", what='raw', sep="\n")

Reading in a gzipped file (however it seems in the newer versions, R is able to recognise gzipped files automatically)

data <- read.table(gzfile("file.tsv.gz"), header=T)
#this also works in newer version of R
data <- read.table('file.tsv.gz', header=T)

Reading from a remote url

filepath<-"http://dl.dropbox.com/u/1648032/ggplot2_tutorial_dataset.txt"
my_data <- read.table(file=url(filepath), header=T, sep="\t")
head(my_data)
#         Tribe Hab    BM var1
#1 Aepycerotini   L 56.25 36.5
#2 Aepycerotini   L 56.25 40.9
#3 Aepycerotini   L 56.25 37.0
#4 Aepycerotini   L 56.25 36.2
#5 Aepycerotini   L 56.25 36.6
#6 Aepycerotini   L 56.25 37.7

Reading in BAM files

#install Rsamtools
source("http://bioconductor.org/biocLite.R")
biocLite("Rsamtools")

#index BAM file necessary
indexBam("file.bam")
bam <- scanBam("file.bam")

Looking for stuff

Finding out column names

colnames(iris)
#[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

Finding out row names (I used head to limit the output)

head(rownames(iris))
#[1] "1" "2" "3" "4" "5" "6"

Using grep to search colnames

grep("length", colnames(iris), ignore.case=T)
#[1] 1 3
#to get the values
grep("length", colnames(iris), ignore.case=T, value=T)
#[1] "Sepal.Length" "Petal.Length"

Show all the datasets that are available

data()

Show all installed packages (this can be slow)

installed.packages()

# install a package if it is missing
if (!'randomForest' %in% installed.packages()[,1]){
   install.packages('randomForest')
}

Find a specific package

find.package('edgeR')

Find all functions of a package

ls("package:stringr")
 [1] "fixed"           "ignore.case"     "invert_match"    "perl"            "str_c"          
 [6] "str_count"       "str_detect"      "str_dup"         "str_extract"     "str_extract_all"
[11] "str_join"        "str_length"      "str_locate"      "str_locate_all"  "str_match"      
[16] "str_match_all"   "str_pad"         "str_replace"     "str_replace_all" "str_split"      
[21] "str_split_fixed" "str_sub"         "str_sub<-"       "str_trim"        "str_wrap"       
[26] "word"

Manipulating data

It is important to normalise your data before analysis. Use the scale() function.

x <- matrix(1:10, ncol = 2)
# scale by column means
scale(x, scale = FALSE)
     [,1] [,2]
[1,]   -2   -2
[2,]   -1   -1
[3,]    0    0
[4,]    1    1
[5,]    2    2
attr(,"scaled:center")
[1] 3 8

# if scale = TRUE, values are further normalised with the standard score
scale(x)
           [,1]       [,2]
[1,] -1.2649111 -1.2649111
[2,] -0.6324555 -0.6324555
[3,]  0.0000000  0.0000000
[4,]  0.6324555  0.6324555
[5,]  1.2649111  1.2649111
attr(,"scaled:center")
[1] 3 8
attr(,"scaled:scale")
[1] 1.581139 1.581139

# function standard normalisation
z_scale <- function(x){
  (x - mean(x)) / sd(x)
}

# manually perform standard normalisation to check results
# results are the same as above
apply(scale(x, scale = FALSE), 2, z_scale)
           [,1]       [,2]
[1,] -1.2649111 -1.2649111
[2,] -0.6324555 -0.6324555
[3,]  0.0000000  0.0000000
[4,]  0.6324555  0.6324555
[5,]  1.2649111  1.2649111

Creating a cross table (use mosaicplot to visualise)

table(ChickWeight$Time, ChickWeight$Diet)
   
     1  2  3  4
 0  20 10 10 10
 2  20 10 10 10
 4  19 10 10 10
 6  19 10 10 10
 8  19 10 10 10
 10 19 10 10 10
 12 19 10 10 10
 14 18 10 10 10
 16 17 10 10 10
 18 17 10 10 10
 20 17 10 10  9
 21 16 10 10  9

mosaicplot(table(ChickWeight$Time, ChickWeight$Diet))

Concatenating two data frames

#requires same colnames
iris2 <- rbind(iris, iris)
dim(iris2)
#[1] 300   5
dim(iris)
#[1] 150   5

Split a string into a list

#split string into array
a <- 'acgactagcatcgatcagcatc'
blah <- strsplit(a, split=)
class(blah)
#[1] "list"
blah
[[1]]
 [1] "a" "c" "g" "a" "c" "t" "a" "g" "c" "a" "t" "c" "g" "a" "t" "c" "a" "g" "c"
[20] "a" "t" "c"

#use stringr
install.packages('stringr')
library(stringr)
str_split(a, "")
[[1]]
[1] ""  "a" "c" "g" "a" "c" "t" "a" "g" "c" "a" "t" "c" "g" "a" "t" "c" "a" "g" "c" "a" "t" "c"

Join all elements in a data frame

head(apply(iris, 1, paste, collapse=))
#[1] "5.13.51.40.2setosa" "4.93.01.40.2setosa" "4.73.21.30.2setosa" "4.63.11.50.2setosa" "5.03.61.40.2setosa"
#[6] "5.43.91.70.4setosa"

Change the column names of a data frame

names(iris) <- c('one', 'two', 'three', 'four', 'five')
head(iris)
#  one two three four   five
#1 5.1 3.5   1.4  0.2 setosa
#2 4.9 3.0   1.4  0.2 setosa
#3 4.7 3.2   1.3  0.2 setosa
#4 4.6 3.1   1.5  0.2 setosa
#5 5.0 3.6   1.4  0.2 setosa
#6 5.4 3.9   1.7  0.4 setosa
#reset the iris dataset
data(iris)

Multiple two columns together

head(iris[,1]*iris[,2])
#[1] 17.85 14.70 15.04 14.26 18.00 21.06

Adding a new column to a data frame

data(iris)
iris$mean <- apply(iris[,-5], 1, mean)
head(iris)
#  Sepal.Length Sepal.Width Petal.Length Petal.Width Species  mean
#1          5.1         3.5          1.4         0.2  setosa 2.550
#2          4.9         3.0          1.4         0.2  setosa 2.375
#3          4.7         3.2          1.3         0.2  setosa 2.350
#4          4.6         3.1          1.5         0.2  setosa 2.350
#5          5.0         3.6          1.4         0.2  setosa 2.550
#6          5.4         3.9          1.7         0.4  setosa 2.850

Sorting a dataframe by column

head(iris[order(iris$Sepal.Length),])
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#14          4.3         3.0          1.1         0.1  setosa
#9           4.4         2.9          1.4         0.2  setosa
#39          4.4         3.0          1.3         0.2  setosa
#43          4.4         3.2          1.3         0.2  setosa
#42          4.5         2.3          1.3         0.3  setosa
#4           4.6         3.1          1.5         0.2  setosa

Dropping levels

levels(iris$Species)
#[1] "setosa"     "versicolor" "virginica"
iris$Species <- levels(droplevels(iris$Species))
levels(iris$Species)
#NULL

Statistical

Function for geometric mean.

geo_mean <- function(x){
  prod(x)^(1/length(x))
}

https://stats.stackexchange.com/questions/205403/fitting-negative-binomial-distribution-to-large-count-data

data <- read.csv("data.txt", header=FALSE)
library(fitdistrplus)
# get the fitted densities. mu and size from fit
fit <- fitdist(data1, "nbinom")

Shapiro-Wilk Normality Test

# null hypothesis is that x
# is a normal distribution
shapiro.test(rnorm(100, mean = 5, sd = 3))

	Shapiro-Wilk normality test

data:  rnorm(100, mean = 5, sd = 3)
W = 0.99053, p-value = 0.7074

shapiro.test(runif(100, min = 2, max = 4))

	Shapiro-Wilk normality test

data:  runif(100, min = 2, max = 4)
W = 0.94602, p-value = 0.0004581

shapiro.test(rpois(100, 5))

	Shapiro-Wilk normality test

data:  rpois(100, 5)
W = 0.96591, p-value = 0.01084

Kolmogorov-Smirnov Tests

x <- rnorm(50)
y <- runif(30)
# null hypothesis is x and y come from the same distribution
# Do x and y come from the same distribution?
ks.test(x, y)

	Two-sample Kolmogorov-Smirnov test

data:  x and y
D = 0.56, p-value = 6.303e-06
alternative hypothesis: two-sided

z <- rnorm(10000)
x <- sample(z, 100)
y <- sample(z, 100)
ks.test(x, y)

	Two-sample Kolmogorov-Smirnov test

data:  x and y
D = 0.09, p-value = 0.8127
alternative hypothesis: two-sided

ANOVA

Graphics

Creating a grayscale gradient image

#the function image displays a colour Image
#the function gray is the gray level specification
image(as.matrix(seq(1,100,0.1)), col=gray((0:200)/200), yaxt='n', xaxt='n')
#image with legend
#use image.plot from the fields package
#http://www.image.ucar.edu/GSP/Software/Fields/Help/image.plot.html
install.packages("fields")
library(fields)
image.plot(correlations)
#function for error bars from http://monkeysuncle.stanford.edu/?p=485
error.bar <- function(x, y, upper, lower=upper, length=0.1,...){
   if(length(x) != length(y) | length(y) !=length(lower) | length(lower) != length(upper))
   stop("vectors must be same length")
   arrows(x,y+upper, x, y-lower, angle=90, code=3, length=length, ...)
}
#tweaking the x-axis of a histogram
hist(d,xlim=c(-10,10))
#adding a red vertical line to a plot at x=31
abline(v=31, col="red")
#export as postscript
postscript("somefile.eps")
plot(x,y)
dev.off()
#exporting as SVG, which can be opened using free software such as InkScape
install.packages("RSvgDevice")
library(RSvgDevice)
devSVG(file='file.svg')
dev.off()
#labelling plots
#the text function can place text according to the coordinates you give it
text(x=10,y=20,"some_label")

ggplot2

Various people believe that learning ggplot2 via the qplot() function is not the best way to learn ggplot2. I just thought I'd mention it here because all the examples below are using qplot().

Drawing lines with ggplot2 http://wiki.stdout.org/rcookbook/Graphs/Lines%20(ggplot2)/

Code from http://ggplot2.org/book/
#install if necessary
install.packages("ggplot2")
#qplot stands for quick plot
set.seed(1410)
dsmall <- diamonds[sample(nrow(diamonds), 100), ]
qplot(carat, price, data = diamonds)
qplot(log(carat), log(price), data=diamonds)
#adding a legend is automatically done
qplot(carat, price, data = dsmall, colour= cut)
#adding transparency
qplot(carat, price, data = diamonds, alpha = I(1/10))
#adding a trend line
qplot(carat, price, data = diamonds, geom = c("point", "smooth"))
#controlling the smoothness
qplot(carat, price, data = dsmall, geom = c("point", "smooth"), span = 0.2)
#fitting with a generalised additive model (gam)
library(mgcv)
qplot(carat, price, data = dsmall, geom = c("point", "smooth"), method = "gam", formula = y ~ s(x))
qplot(carat, price, data = dsmall, geom = c("point", "smooth"), method = "gam", formula = y ~ s(x, bs = "cs"))
#fitting using a linear model
qplot(carat, price, data = dsmall, geom = c("point", "smooth"), method="lm")
#fitting using splines
library(splines)
qplot(carat, price, data = dsmall, geom = c("point", "smooth"), method="lm", formula = y ~ ns(x,5))
#using jittering and transparency
qplot(color, price / carat, data = diamonds, geom = "jitter", alpha = I(1/5))
#boxplot
qplot(color, price / carat, data = diamonds, geom = "boxplot")
#histograms and density plots
qplot(carat, data = diamonds, geom = "histogram")
qplot(carat, data = diamonds, geom = "density")
#histogram and adjusting the binwidth, xlim specifies range of x-axis from 0 to 3
qplot(carat, data = diamonds, geom = "histogram", binwidth = 0.01, xlim = c(0,3))

Line charts: http://www.cookbook-r.com/Graphs/Bar_and_line_graphs_(ggplot2)/#line-graphs

ggplot(data=df, aes(x=time, y=total_bill, group=1)) + geom_line()

gridExtra -> http://www.imachordata.com/extra-extra-get-your-gridextra/

Equivalents using base graphics, qplot(), and ggplot()

Making a scatter plot

plot(x, y)
qplot(x, y)
qplot(x, y, data=df)
ggplot(df, aes(x=x, y=y)) + geom_point()

Making a line graph

plot(x, y, type='l')
qplot(x, y, geom="line")
qplot(x, y, data=df, geom="line")
ggplot(df, aes(x=x, y=y)) + geom_line()
#line and points together
qplot(x, y, data=df, geom=c("line", "point"))
ggplot(df, aes(x=x, y=y)) + geom_line() + geom_point()

Making a bar graph

barplot(1:10, names.arg = letters[1:10])
qplot(factor(1:10), letters[1:10], geom="bar", stat="identity")
qplot(x, y, data=df, geom="bar", stat="identity")
ggplot(df, aes(x=x, y=y)) + geom_bar(stat="identity")

Making a histogram

hist(x, breaks=10)
qplot(x, data=df, binwidth=4)
ggplot(df, aes(x=x)) + geom_histogram(binwidth=4)

Making a density plot

plot(density(x))
qplot(x, geom='density')
ggplot(df, aes(x=x)) + geom_density()

Animations

http://playingwithr.blogspot.jp/2011/06/animated-plots-with-r.html

My post on creating animated plots in R: http://davetang.org/muse/2015/02/12/animated-plots-using-r/

Analysis

t-SNE using the Rtsne package; plots not shown.

library(Rtsne)
set.seed(31)

pca_data <- prcomp(log1p(iris[,-5]))
plot(pca_data$x, col = iris[,5], pch = 19)
tsne_data <- Rtsne(pca_data$x, pca = FALSE, check_duplicates = FALSE, perplexity = 30)
plot(tsne_data$Y, col = iris[,5], pch = 19)

Calculating the coefficient of determination (r-squared) from

#http://www.r-tutor.com/elementary-statistics/simple-linear-regression/coefficient-determination
#using the faithful dataset
#Waiting time between eruptions and the duration of the eruption
head(faithful)
#  eruptions waiting
#1     3.600      79
#2     1.800      54
#3     3.333      74
#4     2.283      62
#5     4.533      85
#6     2.883      55
eruption.lm <- lm(eruptions ~ waiting, data=faithful)
names(eruption.lm)
# [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values"
# [6] "assign"        "qr"            "df.residual"   "xlevels"       "call"         
#[11] "terms"         "model"
summary(eruption.lm)$r.squared
#[1] 0.8114608
#compare with Pearson correlation
cor(faithful$eruptions, faithful$waiting)
#[1] 0.9008112

Logistic regression from http://www.ats.ucla.edu/stat/r/dae/logit.htm

library(aod)
library(ggplot2)
library(Rcpp)

mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
## two-way contingency table of categorical outcome and predictors
## we want to make sure there are not 0 cells
xtabs(~ admit + rank, data = mydata)
     rank
admit  1  2  3  4
    0 28 97 93 55
    1 33 54 28 12

mydata$rank <- factor(mydata$rank)
mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")
summary(mylogit)

Call:
glm(formula = admit ~ gre + gpa + rank, family = "binomial", 
    data = mydata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6268  -0.8662  -0.6388   1.1490   2.0790  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.989979   1.139951  -3.500 0.000465 ***
gre          0.002264   0.001094   2.070 0.038465 *  
gpa          0.804038   0.331819   2.423 0.015388 *  
rank2       -0.675443   0.316490  -2.134 0.032829 *  
rank3       -1.340204   0.345306  -3.881 0.000104 ***
rank4       -1.551464   0.417832  -3.713 0.000205 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 499.98  on 399  degrees of freedom
Residual deviance: 458.52  on 394  degrees of freedom
AIC: 470.52

Number of Fisher Scoring iterations: 4
#summarise data
http://wiki.stdout.org/rcookbook/Manipulating%20data/Summarizing%20data/
#calculate the lower 95% CI for a Poisson distribution
poisson_lower_tail <- function(n) {
   qchisq(0.025, 2*n)/2
}
#calculate the upper 95% CI for a Poisson distribution
poisson_upper_tail <- function(n) {
   qchisq(0.975, 2*(n+1))/2
}
#function for calculating number of pairwise comparisons
number_of_comparison <- function(n) {
  (((n-1)^2) + (n-1)) / 2
}
#function for calculating z-score
z_score <- function(x){
   (x-mean(x))/sd(x)
}
#function for calculating Euclidean distance
euclidean_distance <- function(x1,x2){
  sqrt(sum((x1 - x2) ^ 2))
}
#hierarchical clustering of columns
hc <- hclust(as.dist(1-cor(data, method="spearman")), method="average")
#hierarchical clustering of rows
hr <- hclust(as.dist(1-cor(t(data), method="spearman")), method="average")
#labeling hclust
fit <- hclust(d)
plot(fit, labels=data$name)
#plot principal component analysis with labels
data <- read.table("test.tsv",header=T,row.names=1,comment="#")
data.pca <- prcomp(data)
plot(data.pca$rotation, type="n")
text(data.pca$rotation, row.names(data.pca$rotation))

edgeR

#plotSmear
detags <- rownames(d)[as.logical(de)]
plotSmear(et, de.tags=detags)
abline(h = c(-2, 2), col = "blue")

randomForest

#see also http://mkseo.pe.kr/stats/?p=220
#from http://www.r-bloggers.com/random-forest-variable-importance/
#install if necessary
install.packages("randomForest")
library(randomForest)
install.packages("party")
# simulate the data
x1=rnorm(1000)
x2=rnorm(1000,x1,1)
y=2*x1+rnorm(1000,0,.5)
df=data.frame(y,x1,x2,x3=rnorm(1000),x4=rnorm(1000),x5=rnorm(1000))
#run the randomForest implementation
library(randomForest)
rf1 <- randomForest(y~., data=df, mtry=2, ntree=50, importance=TRUE)
importance(rf1,type=1)
#run the party implementation
library(party)
cf1 <- cforest(y~.,data=df,control=cforest_unbiased(mtry=2,ntree=50))
varimp(cf1)
varimp(cf1,conditional=TRUE)

Bioinformatics

Biostrings

http://www.bioconductor.org/help/course-materials/2011/BioC2011/LabStuff/BiostringsBSgenomeOverview.pdf

Learning R

http://tryr.codeschool.com/

swirl

| You can exit swirl and return to the R prompt (>) at any time by pressing the Esc key. If
| you are already at the prompt, type bye() to exit and save your progress. When you exit
| properly, you'll see a short message letting you know you've done so.

| When you are at the R prompt (>):
| -- Typing skip() allows you to skip the current question.
| -- Typing play() lets you experiment with R on your own; swirl will ignore what you do...
| -- UNTIL you type nxt() which will regain swirl's attention.
| -- Typing bye() causes swirl to exit. Your progress will be saved.
| -- Typing main() returns you to swirl's main menu.
| -- Typing info() displays these options again.

Courses

Courses at https://github.com/swirldev/swirl_courses

To install the Open Intro course (a beginner level course that takes 1-2 hours to complete):

library(swirl)
install_from_swirl('Open Intro')
swirl()

The basics of regression modeling in R

install_from_swirl('Regression Models')

Some notes:

| Sir Francis studied the relationship between heights of parents and their children. His
| work showed that parents who were taller than average had children who were also tall but
| closer to the average height. Similarly, parents who were shorter than average had
| children who were also shorter than average but less so than mom and dad. That is, they
| were closer to the average height. From one generation to the next the heights moved
| closer to the average or regressed toward the mean.

Snippets from Learning R

Chapter 1

#the rep() function
rep("Testing", times = 3)
#bring up examples of usage for a function
example(min)
#list files in the current directory
list.files()
#run a R script
source("bottle1.R")

Chapter 2

#the seq() function
seq(5,9,0.5)

Chapter 3

A matrix is just a fancy term for a 2-dimensional array.

#reshape a vector into a matrix
plank <- 1:8
dim(plank) <- c(2,4)
#contour map using contour()
contour(elevation)
#3D perspective plot using persp()
persp(elevation)
#set highest value to 0.2
persp(elevation, expand=0.2)
#image() for creating heatmap
image(volcano)

Chapter 4

#drawing a horizontal line for the mean
abline(h=mean(limbs))

Chapter 5

#categorising values using factor()
chests <- c('gold', 'silver', 'gems', 'gold', 'gems')
types <- factor(chests)
#get factor levels using levels()
levels(types)
#plot characters for each type of factor
plot(weights, prices, pch=as.integer(types))

Chapter 6

#accessing a column of a data.frame
treasure$prices
#merging data.frame, which merges on common columns
merge(x=targets, y=infantry)

Chapter 7

#getting help on a package
help(package = 'ggplot2')
#making a qplot from ggplot2
qplot(weights, prices, color=types)

Datasets

A data.frame of the Galton (1888) height and cubit data set. Francis Galton introduced the "co-relation" in 1888 with a paper discussing how to measure the relationship between two variables. His primary example was the relationship between height and forearm length. The data table (‘cubits’) is taken from Galton (1888). Unfortunately, there seem to be some errors in the original data table in that the marginal totals do not match the table.

install.packages("psych")
library(psych)
head(heights)
  height cubit
1     71 17.75
2     71 18.25
3     71 18.25
4     71 18.25
5     71 18.75
6     71 18.75
install.packages("UsingR")
library(UsingR)
data(package="UsingR")
#Pearson's data set on heights of fathers and their sons
head(father.son)
#   fheight  sheight
#1 65.04851 59.77827
#2 63.25094 63.21404
#3 64.95532 63.34242
#4 65.75250 62.79238
#5 61.13723 64.28113
#6 63.02254 64.24221

Notes

The three core features of the R language are object orientation, vectorisation, and its functional programming style. R is designed to have all data in-memory.

R's functions work with copies of arguments passed to them and cannot directly modify objects, i.e., objects are immutable. As such the += or ++ operators do not exist in R.

R's special values

R has four special values: NA, NULL, Inf/-Inf, and NaN.

  • NA is R's built in value for representing missing data
  • NULL represents not having a value; unlike NA, NULL is its own object and can't be used in a vector
  • Inf/-Inf represent numbers that are too big for R
  • NaN stands for "not a number", which occurs when some computations don't return numbers

Help

Get help on a class

?"numeric-class"

Help on functions in a package

library(help="edgeR")

Find out what methods are available for a given class of objects

methods(class="lm")

If you don't know what functions are available in a package, type

package_name::<tab>

Search for functions

help.search("cross tabulate")
#shortcut
??"cross tabulate"
#search functions with "table"
apropos('table')

Cookbook

Find peaks in a sequence of numbers using the diff() function, which takes the current number and subtracts it with the previous number.

find_peak <- function(x){
  which(
    diff(
      sign(
        diff(x)
      )
    ) < 0
  ) + 1
}

Links

Awesome cookbook http://wiki.stdout.org/rcookbook/

Etc.

Package for playing a sound when a job is complete: https://cran.r-project.org/web/packages/beepr/index.html