# Manual linear regression analysis using R

Updated 2014 June 24th

The aim of linear regression is to find the equation of the straight line that fits the data points the best; the best line is one that minimises the sum of squared residuals of the linear regression model. The equation of a straight line is:

where $m$ is the slope or gradient and $b$ is the y-intercept.

In R we can write a function that derives y based on x:

my_line <- function(x, m, b){ y <- m*x + b; return(y) }
#the line y = 2x + 2
sapply(1:10, my_line, m=2, b=2)
# [1]  4  6  8 10 12 14 16 18 20 22
x <- -10:10
library(ggplot2)
qplot(x, sapply(x, my_line, m=2, b=2))


Given a set of points, we can work out the $m$ that fits the data the best using this formula:

Let's implement this in R:

slope <- function(x, y){
mean_x <- mean(x)
mean_y <- mean(y)
nom <- sum((x - mean_x)*(y-mean_y))
denom <- sum((x - mean_x)^2)
m <- nom / denom
return(m)
}

#then the intercept is
intercept <- function(x, y, m){
b <- mean(y) - (m * mean(x))
return(b)
}


I will use the Old Faithful Geyser Data to illustrate this function:

head(faithful)
eruptions waiting
1     3.600      79
2     1.800      54
3     3.333      74
4     2.283      62
5     4.533      85
6     2.883      55


where there are 272 observations on 2 variables, which are the eruption time in minutes and the waiting time to the next eruption in minutes. Let's say we want to predict how long an eruption will last, given the waiting time. In this case the waiting time is the independent variable, and the eruption is the dependent variable, since the eruption time is dependent on the waiting time (I guess the inverse is also true but for this example, I'll make waiting time the independent variable). Let's work out the slope and the intercept:

eruption <- faithful$eruptions waiting <- faithful$waiting

my_slope <- slope(waiting, eruption)
my_intercept <- intercept(waiting, eruption, my_slope)

my_slope
#[1] 0.07562795
my_intercept
#[1] -1.874016

#let's plot this
#the x-axis by convention is for the independent variable
plot(waiting, eruption, pch=19)

abline(my_intercept, my_slope)


The longer the wait, the longer the eruption time.

In R the intercept and slope can be worked out by using the lm() function. The arguments for lm() are the dependent variables followed by ~ and then followed by the independent variables ():

wait_lm <- lm(eruption ~ waiting)
wait_lm
#
#Call:
#lm(formula = eruption ~ waiting)
#
#Coefficients:
#(Intercept)      waiting
#   -1.87402      0.07563

names(wait_lm)
# [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values" "assign"
# [7] "qr"            "df.residual"   "xlevels"       "call"          "terms"         "model"


The residual data is the difference between the observed data of the dependent variable and the fitted values. We can plot our observed values against the fitted values to see how well the regression model fits:

plot(waiting, wait_lm$residuals, pch=19) abline(h = 0)  Lastly how can we use this regression model to predict the duration of an eruption given a waiting time: predict(wait_lm, data.frame(waiting=50)) # 1 #1.907381  For a waiting time of 50 minutes, the predicted eruption time would be 1.91 minutes. ### The example in Statistics for Dummies I read a nice example in the "Statistics For Dummies" book on linear regression. The example data was the number of cricket (the insect) chirps vs. temperature. Here I redo the analysis using R. Firstly the five summaries you require for calculating the best fitting line are: 1. The mean of x 2. The mean of y 3. The sd of x 4. The sd of y 5. The r between x and y chirp <- c(18,20,21,23,27,30,34,39) temp <- c(57,60,64,65,68,71,74,77) mean(chirp) #[1] 26.5 mean(temp) #[1] 67 sd(chirp) #[1] 7.387248 sd(temp) #[1] 6.845228 cor(chirp,temp) #[1] 0.9774791 #formula for the slope m, response variable first #m = r(sd(x) / sd(y)) cor(chirp,temp)*(sd(chirp)/sd(temp)) #[1] 1.054878 #formula for intercept #b = mean(y) - (m * mean(x)) m <- cor(chirp,temp)*(sd(chirp)/sd(temp)) mean(chirp) - (m * mean(temp)) #[1] -44.17683 #now the using the lm() function in R #From ?lm #Models for lm are specified symbolically. #A typical model has the form response ~ terms, where response is the response vector #and terms is a series of terms which specifies a linear predictor for response #chirps are the response from temperature lm(chirp~temp) #Call: #lm(formula = x ~ y) #Coefficients: #(Intercept) y # -44.177 1.055 #in the form y = mx + b #y = 1.055x - 44.177 #we can write a function to obtain y from x model <- function(x){ 1.055 * x - 44.177 } #how many chirps at 68 Fahrenheit using our linear model? model(68) [1] 27.563 model(77) [1] 37.058 #we can also use the predict() function reg <- lm(chirp~temp) predict(reg, data.frame(temp=68)) # 1 #27.55488 predict(reg, data.frame(temp=77)) # 1 #37.04878 #to get the coefficient of determination (i.e. R-square) summary(reg)$r.squared
#[1] 0.9554655

#full summary
summary(reg)

summary(reg)
#
#Call:
#lm(formula = chirp ~ temp)
#
#Residuals:
#    Min      1Q  Median      3Q     Max
#-2.3354 -0.8872 -0.2195  1.1509  2.0488
#
#Coefficients:
#             Estimate Std. Error t value Pr(>|t|)
#(Intercept) -44.17683    6.25773   -7.06 0.000404 ***
#temp          1.05488    0.09298   11.35 2.81e-05 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 1.684 on 6 degrees of freedom
#Multiple R-squared:  0.9555,	Adjusted R-squared:  0.948
#F-statistic: 128.7 on 1 and 6 DF,  p-value: 2.808e-05

#and finally the plot
#usually the response variable is on the vertical axis i.e. y-axis,
#which is the number of chirps, which is x
plot(temp, chirp, pch=19)
abline(reg)


Or a fancier scatterplot

library(ggplot2)
#response variable on the y-axis
qplot(temp, chirp, geom = c("point", "smooth"), method="lm")