Manual linear regression analysis using R

Updated 2017 September 5th

The aim of linear regression is to find the equation of the straight line that fits the data points the best; the best line is one that minimises the sum of squared residuals of the linear regression model. The equation of a straight line is:

y = mx + b

where m is the slope or gradient and b is the y-intercept.

Here's a quick example using the "women" dataset, which contains the average heights and weights of American women, that is included with R.

# install the packages below if you haven't already

# load libraries

# check out the data
  height weight
1     58    115
2     59    117
3     60    120
4     61    123
5     62    126
6     63    129

# plot height vs. weight
# and fit a linear model
ggplot(women, aes(height, weight)) +
  geom_point() +
  geom_smooth(method = "lm")

There seems to be a relationship between height and weight.

We can work out m for the above line using this formula:

m = \frac{\sum_{i=1}^{n}(x_i - \overline{x})(y_i - \overline{y})}{\sum_{i=1}^{n}(x_i - \overline{x})^2}

and we can work out b with:

b = \overline{y} - m\overline{x}

We can write this function in R as:

slope <- function(x, y){
  mean_x <- mean(x)
  mean_y <- mean(y)
  nom <- sum((x - mean_x)*(y-mean_y))
  denom <- sum((x - mean_x)^2)
  m <- nom / denom

# the slope formula is just
# covariance(x, y) / variance(x)
slope2 <- function(x, y){
  return(cov(x, y)/var(x))

intercept <- function(x, y, m){
  b <- mean(y) - (m * mean(x))

Before we work out m and b , a word on dependent and independent variables. When modelling, we want to estimate the dependent variable from an independent variable (or variables). In our example, we would like to estimate weight when given the height, therefore our independent variable is height and dependent variable is weight. It can be the other way around too: given someone's height, we want to estimate their weight. When plotting, it is by convention that the independent variable is on the x-axis, while the dependent variable is on the y-axis (as I have done above). Now to use our slope() and intercept() functions.

my_slope <- slope(women$height, women$weight)
my_intercept <- intercept(women$height, women$weight, my_slope)

# [1] 3.45
# [1] -87.51667

# let's plot this using base R functions
plot(women$height, women$weight, pch=19)

# add the regression line
abline(my_intercept, my_slope)

This plot looks similar to the ggplot2 version without the confidence intervals.

The lm() function in R can work out the intercept and slope for us (and other things too). The arguments for lm() are a formula and the data; the formula starts with the dependent variable followed by tilde (~) and followed by the independent variable or variables.

fit <- lm(weight ~ height, women)
# Call:
# lm(formula = weight ~ height, data = women)
# Coefficients:
# (Intercept)       height  
#      -87.52         3.45

# other values stored in the lm object
#  [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values" "assign"        "qr"           
#  [8] "df.residual"   "xlevels"       "call"          "terms"         "model"

The residuals data is the difference between the observed data of the dependent variable and the fitted values. We can plot our observed values against the fitted values to see how well the regression model fits.

plot(women$weight, fit$residuals, pch=19)
abline(h = 0, lty = 2)

The fourth point from the left has a residual that is close to zero; this is because the fitted weight (122.9333 pounds) is almost the same as the real weight (123 pounds) for someone who is 61 inches tall.

We can now use this model and the predict() function to predict someone's weight based on their height.

# predict weight given height of 61 inches
predict(fit, data.frame(height = 61))

# predict weight given height of 80 inches
predict(fit, data.frame(height = 80))

The example in Statistics for Dummies

I read a nice example in the "Statistics For Dummies" book on linear regression and here I'll perform the analysis using R. The example data was the number of cricket (the insect) chirps vs. temperature. Firstly, the five summaries required for calculating the best fitting line are:

  1. The mean of x
  2. The mean of y
  3. The sd of x
  4. The sd of y
  5. The r between x and y

Note the slight change in terminology in the comments below, where I refer the dependent variable as the response variable. The independent variable is also known as the explanatory variable.

# create vectors of the data
chirp <- c(18,20,21,23,27,30,34,39)
temp <- c(57,60,64,65,68,71,74,77)

# work out the five required summaries
# [1] 26.5
# [1] 67
# [1] 7.387248
# [1] 6.845228
# [1] 0.9774791

# this is the third equation in
# the slope() and slope2() functions I created above are the first two
# formula for the slope m, response variable first
# m = r(sd(x) / sd(y))
#[1] 1.054878

# formula for intercept as above
# b = mean(y) - (m * mean(x))
m <- cor(chirp,temp) * (sd(chirp)/sd(temp))
# [1] 1.054878

mean(chirp) - (m * mean(temp))
# [1] -44.17683

# now the using the lm() function in R
# Models for lm are specified symbolically. 
# A typical model has the form response ~ terms, where response is the response vector 
# and terms is a series of terms which specifies a linear predictor for response

#chirps are the response from temperature

# Call:
# lm(formula = x ~ y)
# Coefficients:
# (Intercept)            y  
#     -44.177        1.055

# in the form y = mx + b
# y = 1.055x - 44.177
# we can write a function to obtain y from x
model <- function(x){
   1.055 * x - 44.177

# how many chirps at 68 Fahrenheit using our linear model?
[1] 27.563

[1] 37.058

# we can also use the predict() function
reg <- lm(chirp~temp)
predict(reg, data.frame(temp=68))
#        1 
# 27.55488

predict(reg, data.frame(temp=77))
#        1 
# 37.04878

# to get the coefficient of determination (i.e. R-square)
# [1] 0.9554655

# full summary
# Call:
# lm(formula = chirp ~ temp)
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -2.3354 -0.8872 -0.2195  1.1509  2.0488 
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept) -44.17683    6.25773   -7.06 0.000404 ***
# temp          1.05488    0.09298   11.35 2.81e-05 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Residual standard error: 1.684 on 6 degrees of freedom
# Multiple R-squared:  0.9555,	Adjusted R-squared:  0.948 
# F-statistic: 128.7 on 1 and 6 DF,  p-value: 2.808e-05

# usually the response variable is on the y-axis
# and the explanatory variable is on the x-axis
plot(temp, chirp, pch=19)


Using ggplot2.

# response variable on the y-axis
qplot(temp, chirp, geom = c("point", "smooth"), method="lm")


Print Friendly, PDF & Email

Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
2 comments Add yours
  1. Your blog is amazing, thank you very much, it’s quite easy to read during stress. I used your blog to come to the realization that given that I was looking for the angle of my line with the x-axis but the regression equation was too far out there, I had to standardize my data to be able to get that angle. I was wondering if maybe you would like to add something link that to this entry. It’s quite useful for analyzing deforestation trends for example, but I don’t see it often because only psycologists seem to use standardization

    1. Hi Pablo,

      I’m glad you found the post useful. I didn’t quite follow your comment about adding a link. I’m happy to add a link if it’s relevant and useful.


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.