Updated 2017 September 5th

The aim of linear regression is to find the equation of the straight line that fits the data points the best; the best line is one that minimises the sum of squared residuals of the linear regression model. The equation of a straight line is:

where is the slope or gradient and is the y-intercept.

Here's a quick example using the "women" dataset that comes with R.

# install the packages below if you haven't already install.packages('ggplot2') install.packages('cowplot') # load libraries library(ggplot2) library(cowplot) # check out the data frame head(women) height weight 1 58 115 2 59 117 3 60 120 4 61 123 5 62 126 6 63 129 # plot height vs. weight # and fit a linear model ggplot(women, aes(height, weight)) + geom_point() + geom_smooth(method = "lm")

*There seems to be a relationship between height and weight*.

We can work out for the above line using this formula:

and we can work out with:

We can write this function in R as:

slope <- function(x, y){ mean_x <- mean(x) mean_y <- mean(y) nom <- sum((x - mean_x)*(y-mean_y)) denom <- sum((x - mean_x)^2) m <- nom / denom return(m) } # the slope formula is just # covariance(x, y) / variance(x) slope2 <- function(x, y){ return(cov(x, y)/var(x)) } intercept <- function(x, y, m){ b <- mean(y) - (m * mean(x)) return(b) }

Before we work out and , a word on dependent and independent variables. When modelling, we want to work out the dependent variable from an independent variable (or variables). For this post, we would like to work out weight from height, therefore our independent variable is height and dependent variable is weight. It can be the other way around too. When plotting, the independent variable is usually on the x-axis, while the dependent variable is on the y-axis (as I have done above). Now to use our slope() and intercept() functions.

my_slope <- slope(women$height, women$weight) my_intercept <- intercept(women$height, women$weight, my_slope) my_slope # [1] 3.45 my_intercept # [1] -87.51667 # let's plot this using base R functions plot(women$height, women$weight, pch=19) # add the regression line abline(my_intercept, my_slope)

*This plot looks similar to the ggplot2 version without the confidence intervals*.

The lm() function in R can work out the intercept and slope for us (and other things too). The arguments for lm() are a formula and the data; the formula starts with the dependent variable followed by tilde (~) and followed by the independent variable:

fit <- lm(weight ~ height, women) fit # # Call: # lm(formula = weight ~ height, data = women) # # Coefficients: # (Intercept) height # -87.52 3.45 # # other values stored in the lm object names(fit) # [1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign" "qr" # [8] "df.residual" "xlevels" "call" "terms" "model"

The residuals data is the difference between the observed data of the dependent variable and the fitted values. We can plot our observed values against the fitted values to see how well the regression model fits.

plot(women$weight, fit$residuals, pch=19) abline(h = 0, lty = 2)

*The fourth point from the left has a residual that is close to zero; this is because the fitted weight (122.9333 pounds) is almost the same as the real weight (123 pounds) for someone who is 61 inches tall*.

We can use the predict() function to predict someone's weight based on their height.

# predict someone's weight when they are 61 inches tall predict(fit, data.frame(height = 61)) 1 122.9333 # predict someone's weight when they are 80 inches tall predict(fit, data.frame(height = 80)) 1 188.4833

### The example in Statistics for Dummies

I read a nice example in the "Statistics For Dummies" book on linear regression and here I'll perform the analysis using R. The example data was the number of cricket (the insect) chirps vs. temperature. Firstly, the five summaries required for calculating the best fitting line are:

- The mean of x
- The mean of y
- The sd of x
- The sd of y
- The r between x and y

Note the slight change in terminology in the comments below, where I refer the dependent variable as the response variable. The independent variable is also known as the explanatory variable.

# create vectors of the data chirp <- c(18,20,21,23,27,30,34,39) temp <- c(57,60,64,65,68,71,74,77) # work out the five required summaries mean(chirp) # [1] 26.5 mean(temp) # [1] 67 sd(chirp) # [1] 7.387248 sd(temp) # [1] 6.845228 cor(chirp,temp) # [1] 0.9774791 # this is the third equation in https://en.wikipedia.org/wiki/Simple_linear_regression#Fitting_the_regression_line # the slope() and slope2() functions I created above are the first two # formula for the slope m, response variable first # m = r(sd(x) / sd(y)) cor(chirp,temp)*(sd(chirp)/sd(temp)) #[1] 1.054878 # formula for intercept as above # b = mean(y) - (m * mean(x)) m <- cor(chirp,temp) * (sd(chirp)/sd(temp)) m # [1] 1.054878 mean(chirp) - (m * mean(temp)) # [1] -44.17683 # now the using the lm() function in R # Models for lm are specified symbolically. # A typical model has the form response ~ terms, where response is the response vector # and terms is a series of terms which specifies a linear predictor for response #chirps are the response from temperature lm(chirp~temp) # Call: # lm(formula = x ~ y) # # Coefficients: # (Intercept) y # -44.177 1.055 # in the form y = mx + b # y = 1.055x - 44.177 # we can write a function to obtain y from x model <- function(x){ 1.055 * x - 44.177 } # how many chirps at 68 Fahrenheit using our linear model? model(68) [1] 27.563 model(77) [1] 37.058 # we can also use the predict() function reg <- lm(chirp~temp) predict(reg, data.frame(temp=68)) # 1 # 27.55488 predict(reg, data.frame(temp=77)) # 1 # 37.04878 # to get the coefficient of determination (i.e. R-square) summary(reg)$r.squared # [1] 0.9554655 # full summary summary(reg) # # Call: # lm(formula = chirp ~ temp) # # Residuals: # Min 1Q Median 3Q Max # -2.3354 -0.8872 -0.2195 1.1509 2.0488 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) -44.17683 6.25773 -7.06 0.000404 *** # temp 1.05488 0.09298 11.35 2.81e-05 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # Residual standard error: 1.684 on 6 degrees of freedom # Multiple R-squared: 0.9555, Adjusted R-squared: 0.948 # F-statistic: 128.7 on 1 and 6 DF, p-value: 2.808e-05 # usually the response variable is on the y-axis # and the explanatory variable is on the x-axis plot(temp, chirp, pch=19) abline(reg)

Using ggplot2.

library(ggplot2) # response variable on the y-axis qplot(temp, chirp, geom = c("point", "smooth"), method="lm")

This work is licensed under a Creative Commons

Attribution 4.0 International License.