R
I've been using R for the past couple of years, however there are some tasks that I seem to constantly forget. Here's a collection of some of them (some I have learned to remember over time but have left intact).
Installing packages
Sometimes you need to include additional library paths; use `Sys.setenv` to add an additional location.
new_path <- paste0(Sys.getenv("LD_LIBRARY_PATH"), ":", "/usr/include") Sys.setenv("LD_LIBRARY_PATH" = new_path) Sys.getenv("LD_LIBRARY_PATH")
Anaconda
Install R with https://conda.io/docs/index.html
wget -c -N https://repo.continuum.io/archive/Anaconda3-5.1.0-Linux-x86_64.sh bash Anaconda3-5.1.0-Linux-x86_64.sh conda install gxx_linux-64 conda install r-essentials conda install gfortran_linux-64
I still can't install the igraph package...
Vectors!
R is a vector-based language. What's a vector? The R manual defines a vector as "a single entity consisting of a collection of things." You can think of a vector as a row or column of numbers or text; they are the simplest type of data structure in R. The beauty of vector-based operations as opposed to loops, is that a vectorised function works not just on a single value, but on a whole vector of values at the same time. R allows you to apply functions to the whole vector in a single operation, i.e. we can make calculations on many values at the same time.
lapply() applies a given function for each element in a list,so there will be several function calls. do.call() applies a given function to the list as a whole, so there is only one function call.
?do.call #do.call constructs and executes a function call from a name or a function and a list of arguments to be passed to it. #see http://stackoverflow.com/questions/10801750/whats-the-difference-between-lapply-and-do-call-in-r x <- list(1:3,4:6,7:9) x [[1]] [1] 1 2 3 [[2]] [1] 4 5 6 [[3]] [1] 7 8 9 do.call(what = rbind, args = x) [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9
tmp <- list(1:10, 11:20, 21:30) tmp [[1]] [1] 1 2 3 4 5 6 7 8 9 10 [[2]] [1] 11 12 13 14 15 16 17 18 19 20 [[3]] [1] 21 22 23 24 25 26 27 28 29 30 lapply(tmp, sum) [[1]] [1] 55 [[2]] [1] 155 [[3]] [1] 255 sapply(tmp, sum) [1] 55 155 255
Objects
S4 slots
slotNames(perf) [1] "x.name" "y.name" "alpha.name" "x.values" "y.values" "alpha.values" # use @ to access each slot perf@alpha.name [1] "Cutoff"
Turn data into code that generates the data using dput(). Remember that dataframes are lists of columns!
dput(women) structure(list(height = c(58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72), weight = c(115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164)), class = "data.frame", row.names = c(NA, -15L))
General operations
Precision
noquote(unlist(format(.Machine)))
Deleting a column in a data frame by assigning NULL
data$bye <- NULL
Use page() to go through data
page(iris, 'print')
Specify your own directory for installing packages
# set .libPaths .libPaths('/home/me/rlib')
Lookup table
two <- c("AA", "AS") lut <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue") two <- lut[two] two AA AS "American" "Alaska"
General summary of a data frame
head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa install.packages('psych') library(psych) describe(iris) vars n mean sd median trimmed mad min max range skew kurtosis se Sepal.Length 1 150 5.84 0.83 5.80 5.81 1.04 4.3 7.9 3.6 0.31 -0.61 0.07 Sepal.Width 2 150 3.06 0.44 3.00 3.04 0.44 2.0 4.4 2.4 0.31 0.14 0.04 Petal.Length 3 150 3.76 1.77 4.35 3.76 1.85 1.0 6.9 5.9 -0.27 -1.42 0.14 Petal.Width 4 150 1.20 0.76 1.30 1.18 1.04 0.1 2.5 2.4 -0.10 -1.36 0.06 Species* 5 150 2.00 0.82 2.00 2.00 1.48 1.0 3.0 2.0 0.00 -1.52 0.07
Compare two objects using identical()
identical(object1, object2)
Limit scientific notation
options(scipen=10000)
Time an operation
start <- Sys.time() for(i in 1:100000000){ } end<- Sys.time() print(end-start) Time difference of 4.354291 secs # or use system.time() system.time(for (i in 1:100000000){}) user system elapsed 3.318 0.115 3.446
Move packages from older R versions to newer installations
#on OS X #different R versions are stored in #/Library/Frameworks/R.framework/Versions #use rsync to copy directories that aren't in the existing library directory #make sure you are in /Library/Frameworks/R.framework pwd /Library/Frameworks/R.framework #then run rsync --ignore-existing -r -v Versions/3.1/Resources/library/ Versions/3.2/Resources/library
Subsetting using a list
my_chr <- c(1:22,'X','Y') lincrna_assembled <- subset(lincrna, lincrna$chromosome_name %in% my_chr)
Subsetting using with
df <- data.frame(a=1:10, b=11:20, c=21:30) with(df, a > 5) [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE with(df, a > 5 & b > 19) [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
Populating a data frame from a loop
#http://stackoverflow.com/questions/13442461/populating-a-data-frame-in-r-in-a-loop mylist <- list() #create an empty list for (i in 1:5) { vec <- numeric(5) #preallocate a numeric vector for (j in 1:5) { #fill the vector vec[j] <- i^j } mylist[[i]] <- vec #put all vectors in the list } df <- do.call("rbind",mylist) #combine all vectors into a matrix
Search Path for R Objects
searchpaths()
Check for missing values using complete.cases()
#no missing values in the iris dataset table(complete.cases(iris)) #TRUE #150
When using apply, how do I use more than one parameter?
#http://stackoverflow.com/questions/6827299/r-apply-function-with-multiple-parameters my_list <- list(a=1, b=2, c=3) multiply <- function(var1, var2){ var1*var2 } var2 <- 2 sapply(my_list, multiply, var2=var2) #a b c #2 4 6
Compactly display the structure of an arbitrary R object
str(iris) #'data.frame': 150 obs. of 5 variables: #$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... #$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... #$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... #$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... #$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
The which() function returns the indices that are TRUE
which(10:20 < 15) [1] 1 2 3 4 5
The which() function also works with matrices.
my_mat <- matrix(1:9, nrow=3) my_mat # [,1] [,2] [,3] #[1,] 1 4 7 #[2,] 2 5 8 #[3,] 3 6 9 which(my_mat < 5) [1] 1 2 3 4 which(my_mat < 5 , arr.ind = TRUE ) # row col #[1,] 1 1 #[2,] 2 1 #[3,] 3 1 #[4,] 1 2
How do I turn an object's name into a string?
obj_to_string <- function(x){ name <- deparse(substitute(x)) print(name) } my_obj <- 'blah' obj_to_string(my_obj) #[1] "my_obj"
Use match() to match items in two vectors
y <- c('three', 'two', 'one', 'onee') x <- c('one', 'two', 'three') #match returns the index in x that match y match(y, x) #[1] 3 2 1 NA
How do I subset/order a data frame according to a vector?
#use match df <- data.frame(a = 1:10, b = letters[1:10]) df a b 1 1 a 2 2 b 3 3 c 4 4 d 5 5 e 6 6 f 7 7 g 8 8 h 9 9 i 10 10 j x <- c(2,10,5,6) df[match(x, df[,1]),] a b 2 2 b 10 10 j 5 5 e 6 6 f
How do I turn a string into an object that contains something I specify?
#use assign assign('q0', 1:10) q0 #[1] 1 2 3 4 5 6 7 8 9 10
Use string to refer to object
x <- 42 eval(parse(text = "x")) [1] 42
How do I concatenate strings?
#use paste my_first_name <- 'Dave' my_last_name <- 'Tang' my_full_name <- paste(my_first_name, my_last_name, sep=" ") my_full_name #[1] "Dave Tang"
How do I output an object as tab delimited file?
write.table(data, file="blah.tsv", quote=F, sep="\t")
Creating a random number between 5.0 and 7.5
#set seed to make example reproducible set.seed(31) runif(1, 5.0, 7.5) #[1] 6.305395
Taking a subset of a dataset based on some condition
subset(iris, Sepal.Length>7.6) # Sepal.Length Sepal.Width Petal.Length Petal.Width Species #118 7.7 3.8 6.7 2.2 virginica #119 7.7 2.6 6.9 2.3 virginica #123 7.7 2.8 6.7 2.0 virginica #132 7.9 3.8 6.4 2.0 virginica #136 7.7 3.0 6.1 2.3 virginica #OR iris[iris$Sepal.Length>7.6,] # Sepal.Length Sepal.Width Petal.Length Petal.Width Species #118 7.7 3.8 6.7 2.2 virginica #119 7.7 2.6 6.9 2.3 virginica #123 7.7 2.8 6.7 2.0 virginica #132 7.9 3.8 6.4 2.0 virginica #136 7.7 3.0 6.1 2.3 virginica
Reading in stuff
By default R sorts factors alphabetically and sometimes I require a specific order because ggplot2 uses this order. This page explains how to do so http://rwiki.sciviews.org/doku.php?id=tips:data-factors:factors
Set up an R script that takes in command line arguments
#!/bin/env Rscript #store arguments as object args <- commandArgs(TRUE) #store the number of arguments args_length <- length(args) #check to see if 3 arguments were inputted if (args_length != 3){ stop("Requires three arguments: chr, start and end") } arg1 <- args[1] arg2 <- args[2] arg3 <- args[3]
Use R to download and load a file
download.file("https://dl.dropbox.com/u/7710864/courseraPublic/samsungData.rda", destfile="./samsungData.rda", method="curl") #then load load("samsungData.rda")
Read file and store as a vector
#the type of "what" gives the type of data to be read #supported types are logical, integer, numeric, complex, character, raw and list x <- scan("file.txt", what='raw', sep="\n")
Reading in a gzipped file (however it seems in the newer versions, R is able to recognise gzipped files automatically)
data <- read.table(gzfile("file.tsv.gz"), header=T) #this also works in newer version of R data <- read.table('file.tsv.gz', header=T)
Reading from a remote url
filepath<-"http://dl.dropbox.com/u/1648032/ggplot2_tutorial_dataset.txt" my_data <- read.table(file=url(filepath), header=T, sep="\t") head(my_data) # Tribe Hab BM var1 #1 Aepycerotini L 56.25 36.5 #2 Aepycerotini L 56.25 40.9 #3 Aepycerotini L 56.25 37.0 #4 Aepycerotini L 56.25 36.2 #5 Aepycerotini L 56.25 36.6 #6 Aepycerotini L 56.25 37.7
Reading in BAM files
#install Rsamtools source("http://bioconductor.org/biocLite.R") biocLite("Rsamtools") #index BAM file necessary indexBam("file.bam") bam <- scanBam("file.bam")
Read list of CSV files into one data frame (from https://tidyr.tidyverse.org/articles/tidy-data.html#one-type-in-multiple-tables)
library(purrr) paths <- dir("data", pattern = "\\.csv$", full.names = TRUE) names(paths) <- basename(paths) map_dfr(paths, read.csv, stringsAsFactors = FALSE, .id = "filename")
Looking for stuff
Finding out column names
colnames(iris) #[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
Finding out row names (I used head to limit the output)
head(rownames(iris)) #[1] "1" "2" "3" "4" "5" "6"
Using grep to search colnames
grep("length", colnames(iris), ignore.case=T) #[1] 1 3 #to get the values grep("length", colnames(iris), ignore.case=T, value=T) #[1] "Sepal.Length" "Petal.Length"
Show all the datasets that are available
data()
Show all installed packages (this can be slow)
installed.packages() # install a package if it is missing if (!'randomForest' %in% installed.packages()[,1]){ install.packages('randomForest') }
Find a specific package
find.package('edgeR')
Find all functions of a package
ls("package:stringr") [1] "fixed" "ignore.case" "invert_match" "perl" "str_c" [6] "str_count" "str_detect" "str_dup" "str_extract" "str_extract_all" [11] "str_join" "str_length" "str_locate" "str_locate_all" "str_match" [16] "str_match_all" "str_pad" "str_replace" "str_replace_all" "str_split" [21] "str_split_fixed" "str_sub" "str_sub<-" "str_trim" "str_wrap" [26] "word"
Manipulating data
It is important to normalise your data before analysis. Use the scale() function.
x <- matrix(1:10, ncol = 2) # scale by column means scale(x, scale = FALSE) [,1] [,2] [1,] -2 -2 [2,] -1 -1 [3,] 0 0 [4,] 1 1 [5,] 2 2 attr(,"scaled:center") [1] 3 8 # if scale = TRUE, values are further normalised with the standard score scale(x) [,1] [,2] [1,] -1.2649111 -1.2649111 [2,] -0.6324555 -0.6324555 [3,] 0.0000000 0.0000000 [4,] 0.6324555 0.6324555 [5,] 1.2649111 1.2649111 attr(,"scaled:center") [1] 3 8 attr(,"scaled:scale") [1] 1.581139 1.581139 # function standard normalisation z_scale <- function(x){ (x - mean(x)) / sd(x) } # manually perform standard normalisation to check results # results are the same as above apply(scale(x, scale = FALSE), 2, z_scale) [,1] [,2] [1,] -1.2649111 -1.2649111 [2,] -0.6324555 -0.6324555 [3,] 0.0000000 0.0000000 [4,] 0.6324555 0.6324555 [5,] 1.2649111 1.2649111
Creating a cross table (use mosaicplot to visualise)
table(ChickWeight$Time, ChickWeight$Diet) 1 2 3 4 0 20 10 10 10 2 20 10 10 10 4 19 10 10 10 6 19 10 10 10 8 19 10 10 10 10 19 10 10 10 12 19 10 10 10 14 18 10 10 10 16 17 10 10 10 18 17 10 10 10 20 17 10 10 9 21 16 10 10 9 mosaicplot(table(ChickWeight$Time, ChickWeight$Diet))
Concatenating two data frames
#requires same colnames iris2 <- rbind(iris, iris) dim(iris2) #[1] 300 5 dim(iris) #[1] 150 5
Split a string into a list
#split string into array a <- 'acgactagcatcgatcagcatc' blah <- strsplit(a, split=) class(blah) #[1] "list" blah [[1]] [1] "a" "c" "g" "a" "c" "t" "a" "g" "c" "a" "t" "c" "g" "a" "t" "c" "a" "g" "c" [20] "a" "t" "c" #use stringr install.packages('stringr') library(stringr) str_split(a, "") [[1]] [1] "" "a" "c" "g" "a" "c" "t" "a" "g" "c" "a" "t" "c" "g" "a" "t" "c" "a" "g" "c" "a" "t" "c"
Join all elements in a data frame
head(apply(iris, 1, paste, collapse=)) #[1] "5.13.51.40.2setosa" "4.93.01.40.2setosa" "4.73.21.30.2setosa" "4.63.11.50.2setosa" "5.03.61.40.2setosa" #[6] "5.43.91.70.4setosa"
Change the column names of a data frame
names(iris) <- c('one', 'two', 'three', 'four', 'five') head(iris) # one two three four five #1 5.1 3.5 1.4 0.2 setosa #2 4.9 3.0 1.4 0.2 setosa #3 4.7 3.2 1.3 0.2 setosa #4 4.6 3.1 1.5 0.2 setosa #5 5.0 3.6 1.4 0.2 setosa #6 5.4 3.9 1.7 0.4 setosa #reset the iris dataset data(iris)
Multiple two columns together
head(iris[,1]*iris[,2]) #[1] 17.85 14.70 15.04 14.26 18.00 21.06
Adding a new column to a data frame
data(iris) iris$mean <- apply(iris[,-5], 1, mean) head(iris) # Sepal.Length Sepal.Width Petal.Length Petal.Width Species mean #1 5.1 3.5 1.4 0.2 setosa 2.550 #2 4.9 3.0 1.4 0.2 setosa 2.375 #3 4.7 3.2 1.3 0.2 setosa 2.350 #4 4.6 3.1 1.5 0.2 setosa 2.350 #5 5.0 3.6 1.4 0.2 setosa 2.550 #6 5.4 3.9 1.7 0.4 setosa 2.850
Sorting a dataframe by column
head(iris[order(iris$Sepal.Length),]) # Sepal.Length Sepal.Width Petal.Length Petal.Width Species #14 4.3 3.0 1.1 0.1 setosa #9 4.4 2.9 1.4 0.2 setosa #39 4.4 3.0 1.3 0.2 setosa #43 4.4 3.2 1.3 0.2 setosa #42 4.5 2.3 1.3 0.3 setosa #4 4.6 3.1 1.5 0.2 setosa
Dropping levels
levels(iris$Species) #[1] "setosa" "versicolor" "virginica" iris$Species <- levels(droplevels(iris$Species)) levels(iris$Species) #NULL
Statistical
Function for geometric mean.
geo_mean <- function(x){ prod(x)^(1/length(x)) }
data <- read.csv("data.txt", header=FALSE) library(fitdistrplus) # get the fitted densities. mu and size from fit fit <- fitdist(data1, "nbinom")
Shapiro-Wilk Normality Test
# null hypothesis is that x # is a normal distribution shapiro.test(rnorm(100, mean = 5, sd = 3)) Shapiro-Wilk normality test data: rnorm(100, mean = 5, sd = 3) W = 0.99053, p-value = 0.7074 shapiro.test(runif(100, min = 2, max = 4)) Shapiro-Wilk normality test data: runif(100, min = 2, max = 4) W = 0.94602, p-value = 0.0004581 shapiro.test(rpois(100, 5)) Shapiro-Wilk normality test data: rpois(100, 5) W = 0.96591, p-value = 0.01084
Kolmogorov-Smirnov Tests
x <- rnorm(50) y <- runif(30) # null hypothesis is x and y come from the same distribution # Do x and y come from the same distribution? ks.test(x, y) Two-sample Kolmogorov-Smirnov test data: x and y D = 0.56, p-value = 6.303e-06 alternative hypothesis: two-sided z <- rnorm(10000) x <- sample(z, 100) y <- sample(z, 100) ks.test(x, y) Two-sample Kolmogorov-Smirnov test data: x and y D = 0.09, p-value = 0.8127 alternative hypothesis: two-sided
Tidyverse
Perform a function rowwise. https://stackoverflow.com/questions/53599684/dplyr-mutate-specific-columns-by-evaluating-lookup-cell-value
my_df <- structure( list( chr = c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1"), pos = c(13273L, 17365L, 17407L, 17614L, 183358L, 183598L), ref = c("G", "C", "G", "G", "G", "C"), alt = c("C", "G", "A", "A", "C", "G"), A = c(0L, 0L, 4L, 6L, 0L, 0L), C = c(18L, 18L, 0L, 0L, 6L, 40L), G = c(0L, 1L, 12L, 10L, 12L, 6L), T = c(0L, 0L, 0L, 0L, 0L, 0L), all = c(18L, 19L, 16L, 16L, 18L, 46L) ), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame") ) my_df %>% mutate(ref_count = pmap_chr( .l = ., .f = function(...){ row <- c(...) cols <- row["ref"] row[cols] } ))
pmap can be used to effectively work on dataframes by row by capturing each row as a named vector; you can then get the desired column names for indexing as cols by selecting them with ["ref"]. The function captures all the "..." arguments with "c(...)". Basically we are calling each row of the dataframe as arguments to the function.
Graphics
Creating a grayscale gradient image
#the function image displays a colour Image #the function gray is the gray level specification image(as.matrix(seq(1,100,0.1)), col=gray((0:200)/200), yaxt='n', xaxt='n')
#image with legend #use image.plot from the fields package #http://www.image.ucar.edu/GSP/Software/Fields/Help/image.plot.html install.packages("fields") library(fields) image.plot(correlations)
#function for error bars from http://monkeysuncle.stanford.edu/?p=485 error.bar <- function(x, y, upper, lower=upper, length=0.1,...){ if(length(x) != length(y) | length(y) !=length(lower) | length(lower) != length(upper)) stop("vectors must be same length") arrows(x,y+upper, x, y-lower, angle=90, code=3, length=length, ...) }
#tweaking the x-axis of a histogram hist(d,xlim=c(-10,10))
#adding a red vertical line to a plot at x=31 abline(v=31, col="red")
#export as postscript postscript("somefile.eps") plot(x,y) dev.off()
#exporting as SVG, which can be opened using free software such as InkScape install.packages("RSvgDevice") library(RSvgDevice) devSVG(file='file.svg') dev.off()
#labelling plots #the text function can place text according to the coordinates you give it text(x=10,y=20,"some_label")
ggplot2
Various people believe that learning ggplot2 via the qplot() function is not the best way to learn ggplot2. I just thought I'd mention it here because all the examples below are using qplot().
Drawing lines with ggplot2 http://wiki.stdout.org/rcookbook/Graphs/Lines%20(ggplot2)/
Code from http://ggplot2.org/book/
#install if necessary install.packages("ggplot2")
#qplot stands for quick plot set.seed(1410) dsmall <- diamonds[sample(nrow(diamonds), 100), ] qplot(carat, price, data = diamonds) qplot(log(carat), log(price), data=diamonds)
#adding a legend is automatically done qplot(carat, price, data = dsmall, colour= cut)
#adding transparency qplot(carat, price, data = diamonds, alpha = I(1/10))
#adding a trend line qplot(carat, price, data = diamonds, geom = c("point", "smooth"))
#controlling the smoothness qplot(carat, price, data = dsmall, geom = c("point", "smooth"), span = 0.2)
#fitting with a generalised additive model (gam) library(mgcv) qplot(carat, price, data = dsmall, geom = c("point", "smooth"), method = "gam", formula = y ~ s(x)) qplot(carat, price, data = dsmall, geom = c("point", "smooth"), method = "gam", formula = y ~ s(x, bs = "cs"))
#fitting using a linear model qplot(carat, price, data = dsmall, geom = c("point", "smooth"), method="lm")
#fitting using splines library(splines) qplot(carat, price, data = dsmall, geom = c("point", "smooth"), method="lm", formula = y ~ ns(x,5))
#using jittering and transparency qplot(color, price / carat, data = diamonds, geom = "jitter", alpha = I(1/5))
#boxplot qplot(color, price / carat, data = diamonds, geom = "boxplot")
#histograms and density plots qplot(carat, data = diamonds, geom = "histogram") qplot(carat, data = diamonds, geom = "density")
#histogram and adjusting the binwidth, xlim specifies range of x-axis from 0 to 3 qplot(carat, data = diamonds, geom = "histogram", binwidth = 0.01, xlim = c(0,3))
Line charts: http://www.cookbook-r.com/Graphs/Bar_and_line_graphs_(ggplot2)/#line-graphs
ggplot(data=df, aes(x=time, y=total_bill, group=1)) + geom_line()
gridExtra -> http://www.imachordata.com/extra-extra-get-your-gridextra/
Log scale
https://ggplot2.tidyverse.org/reference/annotation_logticks.html
Equivalents using base graphics, qplot(), and ggplot()
Making a scatter plot
plot(x, y) qplot(x, y) qplot(x, y, data=df) ggplot(df, aes(x=x, y=y)) + geom_point()
Making a line graph
plot(x, y, type='l') qplot(x, y, geom="line") qplot(x, y, data=df, geom="line") ggplot(df, aes(x=x, y=y)) + geom_line() #line and points together qplot(x, y, data=df, geom=c("line", "point")) ggplot(df, aes(x=x, y=y)) + geom_line() + geom_point()
Making a bar graph
barplot(1:10, names.arg = letters[1:10]) qplot(factor(1:10), letters[1:10], geom="bar", stat="identity") qplot(x, y, data=df, geom="bar", stat="identity") ggplot(df, aes(x=x, y=y)) + geom_bar(stat="identity")
Making a histogram
hist(x, breaks=10) qplot(x, data=df, binwidth=4) ggplot(df, aes(x=x)) + geom_histogram(binwidth=4)
Making a density plot
plot(density(x)) qplot(x, geom='density') ggplot(df, aes(x=x)) + geom_density()
Animations
http://playingwithr.blogspot.jp/2011/06/animated-plots-with-r.html
My post on creating animated plots in R: http://davetang.org/muse/2015/02/12/animated-plots-using-r/
Analysis
t-SNE using the Rtsne package; plots not shown.
library(Rtsne) set.seed(31) pca_data <- prcomp(log1p(iris[,-5])) plot(pca_data$x, col = iris[,5], pch = 19) tsne_data <- Rtsne(pca_data$x, pca = FALSE, check_duplicates = FALSE, perplexity = 30) plot(tsne_data$Y, col = iris[,5], pch = 19)
Calculating the coefficient of determination (r-squared) from
#http://www.r-tutor.com/elementary-statistics/simple-linear-regression/coefficient-determination #using the faithful dataset #Waiting time between eruptions and the duration of the eruption head(faithful) # eruptions waiting #1 3.600 79 #2 1.800 54 #3 3.333 74 #4 2.283 62 #5 4.533 85 #6 2.883 55 eruption.lm <- lm(eruptions ~ waiting, data=faithful) names(eruption.lm) # [1] "coefficients" "residuals" "effects" "rank" "fitted.values" # [6] "assign" "qr" "df.residual" "xlevels" "call" #[11] "terms" "model" summary(eruption.lm)$r.squared #[1] 0.8114608 #compare with Pearson correlation cor(faithful$eruptions, faithful$waiting) #[1] 0.9008112
Logistic regression from http://www.ats.ucla.edu/stat/r/dae/logit.htm
library(aod) library(ggplot2) library(Rcpp) mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv") ## two-way contingency table of categorical outcome and predictors ## we want to make sure there are not 0 cells xtabs(~ admit + rank, data = mydata) rank admit 1 2 3 4 0 28 97 93 55 1 33 54 28 12 mydata$rank <- factor(mydata$rank) mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial") summary(mylogit) Call: glm(formula = admit ~ gre + gpa + rank, family = "binomial", data = mydata) Deviance Residuals: Min 1Q Median 3Q Max -1.6268 -0.8662 -0.6388 1.1490 2.0790 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.989979 1.139951 -3.500 0.000465 *** gre 0.002264 0.001094 2.070 0.038465 * gpa 0.804038 0.331819 2.423 0.015388 * rank2 -0.675443 0.316490 -2.134 0.032829 * rank3 -1.340204 0.345306 -3.881 0.000104 *** rank4 -1.551464 0.417832 -3.713 0.000205 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 499.98 on 399 degrees of freedom Residual deviance: 458.52 on 394 degrees of freedom AIC: 470.52 Number of Fisher Scoring iterations: 4
#summarise data http://wiki.stdout.org/rcookbook/Manipulating%20data/Summarizing%20data/
#calculate the lower 95% CI for a Poisson distribution poisson_lower_tail <- function(n) { qchisq(0.025, 2*n)/2 }
#calculate the upper 95% CI for a Poisson distribution poisson_upper_tail <- function(n) { qchisq(0.975, 2*(n+1))/2 }
#function for calculating number of pairwise comparisons number_of_comparison <- function(n) { (((n-1)^2) + (n-1)) / 2 }
#function for calculating z-score z_score <- function(x){ (x-mean(x))/sd(x) }
#function for calculating Euclidean distance euclidean_distance <- function(x1,x2){ sqrt(sum((x1 - x2) ^ 2)) }
#hierarchical clustering of columns hc <- hclust(as.dist(1-cor(data, method="spearman")), method="average")
#hierarchical clustering of rows hr <- hclust(as.dist(1-cor(t(data), method="spearman")), method="average")
#labeling hclust fit <- hclust(d) plot(fit, labels=data$name)
#plot principal component analysis with labels data <- read.table("test.tsv",header=T,row.names=1,comment="#") data.pca <- prcomp(data) plot(data.pca$rotation, type="n") text(data.pca$rotation, row.names(data.pca$rotation))
edgeR
#plotSmear detags <- rownames(d)[as.logical(de)] plotSmear(et, de.tags=detags) abline(h = c(-2, 2), col = "blue")
randomForest
#see also http://mkseo.pe.kr/stats/?p=220
#from http://www.r-bloggers.com/random-forest-variable-importance/ #install if necessary install.packages("randomForest") library(randomForest) install.packages("party") # simulate the data x1=rnorm(1000) x2=rnorm(1000,x1,1) y=2*x1+rnorm(1000,0,.5) df=data.frame(y,x1,x2,x3=rnorm(1000),x4=rnorm(1000),x5=rnorm(1000)) #run the randomForest implementation library(randomForest) rf1 <- randomForest(y~., data=df, mtry=2, ntree=50, importance=TRUE) importance(rf1,type=1) #run the party implementation library(party) cf1 <- cforest(y~.,data=df,control=cforest_unbiased(mtry=2,ntree=50)) varimp(cf1) varimp(cf1,conditional=TRUE)
Bioinformatics
Biostrings
Learning R
swirl
| You can exit swirl and return to the R prompt (>) at any time by pressing the Esc key. If | you are already at the prompt, type bye() to exit and save your progress. When you exit | properly, you'll see a short message letting you know you've done so. | When you are at the R prompt (>): | -- Typing skip() allows you to skip the current question. | -- Typing play() lets you experiment with R on your own; swirl will ignore what you do... | -- UNTIL you type nxt() which will regain swirl's attention. | -- Typing bye() causes swirl to exit. Your progress will be saved. | -- Typing main() returns you to swirl's main menu. | -- Typing info() displays these options again.
Courses
Courses at https://github.com/swirldev/swirl_courses
To install the Open Intro course (a beginner level course that takes 1-2 hours to complete):
library(swirl) install_from_swirl('Open Intro') swirl()
The basics of regression modeling in R
install_from_swirl('Regression Models')
Some notes:
| Sir Francis studied the relationship between heights of parents and their children. His | work showed that parents who were taller than average had children who were also tall but | closer to the average height. Similarly, parents who were shorter than average had | children who were also shorter than average but less so than mom and dad. That is, they | were closer to the average height. From one generation to the next the heights moved | closer to the average or regressed toward the mean.
Snippets from Learning R
Chapter 1
#the rep() function rep("Testing", times = 3)
#bring up examples of usage for a function example(min)
#list files in the current directory list.files()
#run a R script source("bottle1.R")
Chapter 2
#the seq() function seq(5,9,0.5)
Chapter 3
A matrix is just a fancy term for a 2-dimensional array.
#reshape a vector into a matrix plank <- 1:8 dim(plank) <- c(2,4)
#contour map using contour() contour(elevation)
#3D perspective plot using persp() persp(elevation) #set highest value to 0.2 persp(elevation, expand=0.2)
#image() for creating heatmap image(volcano)
Chapter 4
#drawing a horizontal line for the mean abline(h=mean(limbs))
Chapter 5
#categorising values using factor() chests <- c('gold', 'silver', 'gems', 'gold', 'gems') types <- factor(chests)
#get factor levels using levels() levels(types)
#plot characters for each type of factor plot(weights, prices, pch=as.integer(types))
Chapter 6
#accessing a column of a data.frame treasure$prices
#merging data.frame, which merges on common columns merge(x=targets, y=infantry)
Chapter 7
#getting help on a package help(package = 'ggplot2')
#making a qplot from ggplot2 qplot(weights, prices, color=types)
Datasets
A data.frame of the Galton (1888) height and cubit data set. Francis Galton introduced the "co-relation" in 1888 with a paper discussing how to measure the relationship between two variables. His primary example was the relationship between height and forearm length. The data table (‘cubits’) is taken from Galton (1888). Unfortunately, there seem to be some errors in the original data table in that the marginal totals do not match the table.
install.packages("psych") library(psych) head(heights) height cubit 1 71 17.75 2 71 18.25 3 71 18.25 4 71 18.25 5 71 18.75 6 71 18.75
install.packages("UsingR") library(UsingR) data(package="UsingR") #Pearson's data set on heights of fathers and their sons head(father.son) # fheight sheight #1 65.04851 59.77827 #2 63.25094 63.21404 #3 64.95532 63.34242 #4 65.75250 62.79238 #5 61.13723 64.28113 #6 63.02254 64.24221
Notes
The three core features of the R language are object orientation, vectorisation, and its functional programming style. R is designed to have all data in-memory.
R's functions work with copies of arguments passed to them and cannot directly modify objects, i.e., objects are immutable. As such the += or ++ operators do not exist in R.
R's special values
R has four special values: NA, NULL, Inf/-Inf, and NaN.
- NA is R's built in value for representing missing data
- NULL represents not having a value; unlike NA, NULL is its own object and can't be used in a vector
- Inf/-Inf represent numbers that are too big for R
- NaN stands for "not a number", which occurs when some computations don't return numbers
Help
Get help on a class
?"numeric-class"
Help on functions in a package
library(help="edgeR")
Find out what methods are available for a given class of objects
methods(class="lm")
If you don't know what functions are available in a package, type
package_name::<tab>
Search for functions
help.search("cross tabulate") #shortcut ??"cross tabulate" #search functions with "table" apropos('table')
Cookbook
Find peaks in a sequence of numbers using the diff() function, which takes the current number and subtracts it with the previous number.
find_peak <- function(x){ which( diff( sign( diff(x) ) ) < 0 ) + 1 }
Deep learning
Using Ubuntu 22.04.
First install Miniconda https://docs.conda.io/en/latest/miniconda.html
Run the following to see if you need to install Nvidia drivers.
nvidia-smi
If necessary you can install Nvidia drivers using https://www.nvidia.com/Download/index.aspx or using Software & Updates and selecting the Additional Drivers tab.
https://www.tensorflow.org/install/pip?hl=en#step-by-step_instructions
conda install -c conda-forge cudatoolkit=11.2 cudnn=8.1.0 export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/ python3 -m pip install tensorflow # Verify install: python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))" 2022-06-08 16:02:09.924483: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-06-08 16:02:09.992160: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-06-08 16:02:09.992676: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
For RStudio
sudo apt install python3.10-venv
In RStudio
install.packages("tensorflow") library(tensorflow) install_tensorflow(version = "gpu") # use Conda installation above (and probably don't need to run install_tensorflow() anymore) Sys.setenv(LD_LIBRARY_PATH=paste("/home/tan118/miniconda3/lib", Sys.getenv("LD_LIBRARY_PATH"),sep=":")) # verify tf$constant("Hellow Tensorflow") tf.Tensor(b'Hellow Tensorflow', shape=(), dtype=string) # install Keras install.packages("keras") library(keras) install_keras(tensorflow = "gpu")
Test
library(tensorflow) Sys.setenv(LD_LIBRARY_PATH=paste("/home/tan118/miniconda3/lib", Sys.getenv("LD_LIBRARY_PATH"),sep=":")) library(keras) backend()
Links
Awesome cookbook http://wiki.stdout.org/rcookbook/
Etc.
Package for playing a sound when a job is complete: https://cran.r-project.org/web/packages/beepr/index.html