Updated 2014 June 24th

The aim of linear regression is to find the equation of the straight line that fits the data points the best; the best line is one that minimises the sum of squared residuals of the linear regression model. The equation of a straight line is:

where is the slope or gradient and is the y-intercept.

In R we can write a function that derives y based on x:

my_line <- function(x, m, b){ y <- m*x + b; return(y) } #the line y = 2x + 2 sapply(1:10, my_line, m=2, b=2) # [1] 4 6 8 10 12 14 16 18 20 22 x <- -10:10 library(ggplot2) qplot(x, sapply(x, my_line, m=2, b=2))

Given a set of points, we can work out the that fits the data the best using this formula:

Let's implement this in R:

slope <- function(x, y){ mean_x <- mean(x) mean_y <- mean(y) nom <- sum((x - mean_x)*(y-mean_y)) denom <- sum((x - mean_x)^2) m <- nom / denom return(m) } #then the intercept is intercept <- function(x, y, m){ b <- mean(y) - (m * mean(x)) return(b) }

I will use the Old Faithful Geyser Data to illustrate this function:

head(faithful) eruptions waiting 1 3.600 79 2 1.800 54 3 3.333 74 4 2.283 62 5 4.533 85 6 2.883 55

where there are 272 observations on 2 variables, which are the eruption time in minutes and the waiting time to the next eruption in minutes. Let's say we want to predict how long an eruption will last, given the waiting time. In this case the waiting time is the independent variable, and the eruption is the dependent variable, since the eruption time is dependent on the waiting time (I guess the inverse is also true but for this example, I'll make waiting time the independent variable). Let's work out the slope and the intercept:

eruption <- faithful$eruptions waiting <- faithful$waiting my_slope <- slope(waiting, eruption) my_intercept <- intercept(waiting, eruption, my_slope) my_slope #[1] 0.07562795 my_intercept #[1] -1.874016 #let's plot this #the x-axis by convention is for the independent variable plot(waiting, eruption, pch=19) #add the regression line abline(my_intercept, my_slope)

*The longer the wait, the longer the eruption time*.

In R the intercept and slope can be worked out by using the lm() function. The arguments for lm() are the dependent variables followed by ~ and then followed by the independent variables ():

wait_lm <- lm(eruption ~ waiting) wait_lm # #Call: #lm(formula = eruption ~ waiting) # #Coefficients: #(Intercept) waiting # -1.87402 0.07563 names(wait_lm) # [1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign" # [7] "qr" "df.residual" "xlevels" "call" "terms" "model"

The residual data is the difference between the observed data of the dependent variable and the fitted values. We can plot our observed values against the fitted values to see how well the regression model fits:

plot(waiting, wait_lm$residuals, pch=19) abline(h = 0)

Lastly how can we use this regression model to predict the duration of an eruption given a waiting time:

predict(wait_lm, data.frame(waiting=50)) # 1 #1.907381

For a waiting time of 50 minutes, the predicted eruption time would be 1.91 minutes.

### The example in Statistics for Dummies

I read a nice example in the "Statistics For Dummies" book on linear regression. The example data was the number of cricket (the insect) chirps vs. temperature. Here I redo the analysis using R. Firstly the five summaries you require for calculating the best fitting line are:

1. The mean of x

2. The mean of y

3. The sd of x

4. The sd of y

5. The r between x and y

chirp <- c(18,20,21,23,27,30,34,39) temp <- c(57,60,64,65,68,71,74,77) mean(chirp) #[1] 26.5 mean(temp) #[1] 67 sd(chirp) #[1] 7.387248 sd(temp) #[1] 6.845228 cor(chirp,temp) #[1] 0.9774791 #formula for the slope m, response variable first #m = r(sd(x) / sd(y)) cor(chirp,temp)*(sd(chirp)/sd(temp)) #[1] 1.054878 #formula for intercept #b = mean(y) - (m * mean(x)) m <- cor(chirp,temp)*(sd(chirp)/sd(temp)) mean(chirp) - (m * mean(temp)) #[1] -44.17683 #now the using the lm() function in R #From ?lm #Models for lm are specified symbolically. #A typical model has the form response ~ terms, where response is the response vector #and terms is a series of terms which specifies a linear predictor for response #chirps are the response from temperature lm(chirp~temp) #Call: #lm(formula = x ~ y) #Coefficients: #(Intercept) y # -44.177 1.055 #in the form y = mx + b #y = 1.055x - 44.177 #we can write a function to obtain y from x model <- function(x){ 1.055 * x - 44.177 } #how many chirps at 68 Fahrenheit using our linear model? model(68) [1] 27.563 model(77) [1] 37.058 #we can also use the predict() function reg <- lm(chirp~temp) predict(reg, data.frame(temp=68)) # 1 #27.55488 predict(reg, data.frame(temp=77)) # 1 #37.04878 #to get the coefficient of determination (i.e. R-square) summary(reg)$r.squared #[1] 0.9554655 #full summary summary(reg) summary(reg) # #Call: #lm(formula = chirp ~ temp) # #Residuals: # Min 1Q Median 3Q Max #-2.3354 -0.8872 -0.2195 1.1509 2.0488 # #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) -44.17683 6.25773 -7.06 0.000404 *** #temp 1.05488 0.09298 11.35 2.81e-05 *** #--- #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # #Residual standard error: 1.684 on 6 degrees of freedom #Multiple R-squared: 0.9555, Adjusted R-squared: 0.948 #F-statistic: 128.7 on 1 and 6 DF, p-value: 2.808e-05 #and finally the plot #usually the response variable is on the vertical axis i.e. y-axis, #which is the number of chirps, which is x plot(temp, chirp, pch=19) abline(reg)

Or a fancier scatterplot

library(ggplot2) #response variable on the y-axis qplot(temp, chirp, geom = c("point", "smooth"), method="lm")

This work is licensed under a Creative Commons

Attribution 4.0 International License.