Manual linear regression analysis using R

Updated 2017 September 5th

The aim of linear regression is to find the equation of the straight line that fits the data points the best; the best line is one that minimises the sum of squared residuals of the linear regression model. The equation of a straight line is:

y = mx + b

where m is the slope or gradient and b is the y-intercept.

Here's a quick example using the "women" dataset that comes with R.

# install the packages below if you haven't already
install.packages('ggplot2')
install.packages('cowplot')

# load libraries
library(ggplot2)
library(cowplot)

# check out the data frame
head(women)
  height weight
1     58    115
2     59    117
3     60    120
4     61    123
5     62    126
6     63    129

# plot height vs. weight
# and fit a linear model
ggplot(women, aes(height, weight)) +
  geom_point() +
  geom_smooth(method = "lm")

There seems to be a relationship between height and weight.

We can work out m for the above line using this formula:

m = \frac{\sum_{i=1}^{n}(x_i - \overline{x})(y_i - \overline{y})}{\sum_{i=1}^{n}(x_i - \overline{x})^2}

and we can work out b with:

b = \overline{y} - m\overline{x}

We can write this function in R as:

slope <- function(x, y){
  mean_x <- mean(x)
  mean_y <- mean(y)
  nom <- sum((x - mean_x)*(y-mean_y))
  denom <- sum((x - mean_x)^2)
  m <- nom / denom
  return(m)
}

# the slope formula is just
# covariance(x, y) / variance(x)
slope2 <- function(x, y){
  return(cov(x, y)/var(x))
}

intercept <- function(x, y, m){
  b <- mean(y) - (m * mean(x))
  return(b)
}

Before we work out m and b, a word on dependent and independent variables. When modelling, we want to work out the dependent variable from an independent variable (or variables). For this post, we would like to work out weight from height, therefore our independent variable is height and dependent variable is weight. It can be the other way around too. When plotting, the independent variable is usually on the x-axis, while the dependent variable is on the y-axis (as I have done above). Now to use our slope() and intercept() functions.

my_slope <- slope(women$height, women$weight)
my_intercept <- intercept(women$height, women$weight, my_slope)

my_slope
# [1] 3.45
my_intercept
# [1] -87.51667

# let's plot this using base R functions
plot(women$height, women$weight, pch=19)

# add the regression line
abline(my_intercept, my_slope)

This plot looks similar to the ggplot2 version without the confidence intervals.

The lm() function in R can work out the intercept and slope for us (and other things too). The arguments for lm() are a formula and the data; the formula starts with the dependent variable followed by tilde (~) and followed by the independent variable:

fit <- lm(weight ~ height, women)
fit
# 
# Call:
# lm(formula = weight ~ height, data = women)
#
# Coefficients:
# (Intercept)       height  
#      -87.52         3.45
#

# other values stored in the lm object
names(fit)
#  [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values" "assign"        "qr"           
#  [8] "df.residual"   "xlevels"       "call"          "terms"         "model"

The residuals data is the difference between the observed data of the dependent variable and the fitted values. We can plot our observed values against the fitted values to see how well the regression model fits.

plot(women$weight, fit$residuals, pch=19)
abline(h = 0, lty = 2)

The fourth point from the left has a residual that is close to zero; this is because the fitted weight (122.9333 pounds) is almost the same as the real weight (123 pounds) for someone who is 61 inches tall.

We can use the predict() function to predict someone's weight based on their height.

# predict someone's weight when they are 61 inches tall
predict(fit, data.frame(height = 61))
       1 
122.9333

# predict someone's weight when they are 80 inches tall
predict(fit, data.frame(height = 80))
       1 
188.4833

The example in Statistics for Dummies

I read a nice example in the "Statistics For Dummies" book on linear regression and here I'll perform the analysis using R. The example data was the number of cricket (the insect) chirps vs. temperature. Firstly, the five summaries required for calculating the best fitting line are:

  1. The mean of x
  2. The mean of y
  3. The sd of x
  4. The sd of y
  5. The r between x and y

Note the slight change in terminology in the comments below, where I refer the dependent variable as the response variable. The independent variable is also known as the explanatory variable.

# create vectors of the data
chirp <- c(18,20,21,23,27,30,34,39)
temp <- c(57,60,64,65,68,71,74,77)

# work out the five required summaries
mean(chirp)
# [1] 26.5
mean(temp)
# [1] 67
sd(chirp)
# [1] 7.387248
sd(temp)
# [1] 6.845228
cor(chirp,temp)
# [1] 0.9774791

# this is the third equation in https://en.wikipedia.org/wiki/Simple_linear_regression#Fitting_the_regression_line
# the slope() and slope2() functions I created above are the first two
# formula for the slope m, response variable first
# m = r(sd(x) / sd(y))
cor(chirp,temp)*(sd(chirp)/sd(temp))
#[1] 1.054878

# formula for intercept as above
# b = mean(y) - (m * mean(x))
m <- cor(chirp,temp) * (sd(chirp)/sd(temp))
m
# [1] 1.054878

mean(chirp) - (m * mean(temp))
# [1] -44.17683

# now the using the lm() function in R
# Models for lm are specified symbolically. 
# A typical model has the form response ~ terms, where response is the response vector 
# and terms is a series of terms which specifies a linear predictor for response

#chirps are the response from temperature
lm(chirp~temp)

# Call:
# lm(formula = x ~ y)
# 
# Coefficients:
# (Intercept)            y  
#     -44.177        1.055

# in the form y = mx + b
# y = 1.055x - 44.177
# we can write a function to obtain y from x
model <- function(x){
   1.055 * x - 44.177
}

# how many chirps at 68 Fahrenheit using our linear model?
model(68)
[1] 27.563

model(77)
[1] 37.058

# we can also use the predict() function
reg <- lm(chirp~temp)
predict(reg, data.frame(temp=68))
#        1 
# 27.55488

predict(reg, data.frame(temp=77))
#        1 
# 37.04878

# to get the coefficient of determination (i.e. R-square)
summary(reg)$r.squared 
# [1] 0.9554655

# full summary
summary(reg)
#
# Call:
# lm(formula = chirp ~ temp)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -2.3354 -0.8872 -0.2195  1.1509  2.0488 
# 
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept) -44.17683    6.25773   -7.06 0.000404 ***
# temp          1.05488    0.09298   11.35 2.81e-05 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 1.684 on 6 degrees of freedom
# Multiple R-squared:  0.9555,	Adjusted R-squared:  0.948 
# F-statistic: 128.7 on 1 and 6 DF,  p-value: 2.808e-05

# usually the response variable is on the y-axis
# and the explanatory variable is on the x-axis
plot(temp, chirp, pch=19)
abline(reg)

temp_vs_chirp

Using ggplot2.

library(ggplot2)
# response variable on the y-axis
qplot(temp, chirp, geom = c("point", "smooth"), method="lm")

temp_vs_chirp_qplot

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

Leave a Reply

Your email address will not be published. Required fields are marked *