On curve fitting using R

For linear relationships we can perform a simple linear regression. For other relationships we can try fitting a curve. From Wikipedia:

Curve fitting is the process of constructing a curve, or mathematical function, that has the best fit to a series of data points, possibly subject to constraints.

I will use the dataset from this question on Stack Overflow.

Using the example dataset

x <- c(32,64,96,118,126,144,152.5,158)
y <- c(99.5,104.8,108.5,100,86,64,35.3,15)
#we will make y the response variable and x the predictor
#the response variable is usually on the y-axis
plot(x,y,pch=19)

curve_exampleLooks like we can fit a nice curve there.

#fit first degree polynomial equation:
fit  <- lm(y~x)
#second degree
fit2 <- lm(y~poly(x,2,raw=TRUE))
#third degree
fit3 <- lm(y~poly(x,3,raw=TRUE))
#fourth degree
fit4 <- lm(y~poly(x,4,raw=TRUE))
#generate range of 50 numbers starting from 30 and ending at 160
xx <- seq(30,160, length=50)
plot(x,y,pch=19,ylim=c(0,150))
lines(xx, predict(fit, data.frame(x=xx)), col="red")
lines(xx, predict(fit2, data.frame(x=xx)), col="green")
lines(xx, predict(fit3, data.frame(x=xx)), col="blue")
lines(xx, predict(fit4, data.frame(x=xx)), col="purple")

curve_fitting_polynomialWe can see how well each curve fits.

We can get more elaborate statistics with the fits using the summary() function:

summary(fit)

Call:
lm(formula = y ~ x)

Residuals:
   Min     1Q Median     3Q    Max 
-33.78 -18.69   3.40  19.27  27.35 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) 143.0438    24.8489   5.757   0.0012 **
x            -0.5966     0.2090  -2.854   0.0290 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 24.7 on 6 degrees of freedom
Multiple R-squared:  0.5759,	Adjusted R-squared:  0.5052 
F-statistic: 8.148 on 1 and 6 DF,  p-value: 0.02901

summary(fit2)

Call:
lm(formula = y ~ poly(x, 2, raw = TRUE))

Residuals:
      1       2       3       4       5       6       7       8 
  6.264 -11.596  -3.493   7.022   3.167  10.290  -1.625 -10.029 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)   
(Intercept)             42.514916  19.451549   2.186  0.08053 . 
poly(x, 2, raw = TRUE)1  2.015705   0.447096   4.508  0.00635 **
poly(x, 2, raw = TRUE)2 -0.013458   0.002266  -5.940  0.00193 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.533 on 5 degrees of freedom
Multiple R-squared:  0.9474,	Adjusted R-squared:  0.9263 
F-statistic:    45 on 2 and 5 DF,  p-value: 0.0006356

summary(fit3)

Call:
lm(formula = y ~ poly(x, 3, raw = TRUE))

Residuals:
       1        2        3        4        5        6        7        8 
-0.38748  1.21576 -0.67439  0.87812 -4.47543  6.00831  0.05934 -2.62422 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)   
(Intercept)              1.269e+02  1.925e+01   6.593  0.00274 **
poly(x, 3, raw = TRUE)1 -1.626e+00  7.737e-01  -2.102  0.10340   
poly(x, 3, raw = TRUE)2  2.910e-02  8.816e-03   3.301  0.02990 * 
poly(x, 3, raw = TRUE)3 -1.468e-04  3.022e-05  -4.857  0.00830 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.058 on 4 degrees of freedom
Multiple R-squared:  0.9924,	Adjusted R-squared:  0.9866 
F-statistic: 173.4 on 3 and 4 DF,  p-value: 0.0001089

summary(fit4)

Call:
lm(formula = y ~ poly(x, 4, raw = TRUE))

Residuals:
      1       2       3       4       5       6       7       8 
 0.1242 -0.6912  1.6355  1.4491 -5.1240  4.0360 -0.4692 -0.9604 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)
(Intercept)              7.474e+01  5.473e+01   1.366    0.265
poly(x, 4, raw = TRUE)1  1.426e+00  3.095e+00   0.461    0.676
poly(x, 4, raw = TRUE)2 -2.854e-02  5.729e-02  -0.498    0.653
poly(x, 4, raw = TRUE)3  2.878e-04  4.278e-04   0.673    0.549
poly(x, 4, raw = TRUE)4 -1.134e-06  1.113e-06  -1.018    0.384

Residual standard error: 4.04 on 3 degrees of freedom
Multiple R-squared:  0.9943,	Adjusted R-squared:  0.9868 
F-statistic: 131.5 on 4 and 3 DF,  p-value: 0.001064

Summary of the multiple R-squared using different degrees of polynomials:

1st: 0.5759
2nd: 0.9474
3rd: 0.9924
4th: 0.9943

We can compare different models with an Analysis of Variance:

anova(fit,fit2)
Analysis of Variance Table

Model 1: y ~ x
Model 2: y ~ poly(x, 2, raw = TRUE)
  Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
1      6 3660.8                                
2      5  454.3  1    3206.4 35.286 0.001931 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The Pr(>F) value is the probability of rejecting the null hypothesis, where one model does not fit better than the other model. We have a very significant p-value, so we can reject the null hypothesis, i.e. fit2 provides a better fit than fit

#significant
anova(fit2,fit3)
Analysis of Variance Table

Model 1: y ~ poly(x, 2, raw = TRUE)
Model 2: y ~ poly(x, 3, raw = TRUE)
  Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
1      5 454.34                                
2      4  65.87  1    388.47 23.589 0.008298 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#not significant
anova(fit3,fit4)
Analysis of Variance Table

Model 1: y ~ poly(x, 3, raw = TRUE)
Model 2: y ~ poly(x, 4, raw = TRUE)
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1      4 65.873                           
2      3 48.955  1    16.918 1.0368 0.3835

We can also create a function that reflects the polynomial equation:

#y=ax + b
coef(fit)
(Intercept)           x 
143.0437906  -0.5965753
#y=ax^2 + bx + c
coef(fit2)
            (Intercept) poly(x, 2, raw = TRUE)1 poly(x, 2, raw = TRUE)2 
            42.51491582              2.01570523             -0.01345808
#y=ax^3 + bx^2 + cx + d
coef(fit3)
            (Intercept) poly(x, 3, raw = TRUE)1 poly(x, 3, raw = TRUE)2 poly(x, 3, raw = TRUE)3 
           1.269381e+02           -1.626321e+00            2.910221e-02           -1.467589e-04
#y=ax^4 + bx^3 + cx^2 + dx + e
coef(fit4)
            (Intercept) poly(x, 4, raw = TRUE)1 poly(x, 4, raw = TRUE)2 poly(x, 4, raw = TRUE)3 
           7.473766e+01            1.425813e+00           -2.854370e-02            2.877714e-04 
poly(x, 4, raw = TRUE)4 
          -1.133744e-06
#create function for the third order polynomial
third_order <- function(newdist, model) {
    coefs <- coef(model)
    #y = d + cx + bx^2 + ax^3
    res <- coefs[1] + (coefs[2] * newdist) + (coefs[3] * newdist^2) + (coefs[4] * newdist^3)
    return(res)
}
yy <- third_order(xx,fit3)
plot(xx,yy,type="l")
lines(x,y,col="red")

third_order_polynomialThe values extrapolated from the third order polynomial has a very good fit to the original values, which we already knew from the R-squared values.

Conclusions

For non-linear curve fitting we can use lm() and poly() functions of R, which also provides useful statistics to how well the polynomial functions fits the dataset. We can also assess how well different models are against each other using an analysis of variance test. From the model it is possible to define a function that reflects the polynomial function, which could be used for extrapolating response variables.

xx<-seq(-1000,1000,length=500)
yy<-third_order(xx,fit3)
plot(xx,yy,type="l")

third_order_polynomial_1000

See also

Technical note: Curve fitting with the R Environment for Statistical Computing at http://www.css.cornell.edu/faculty/dgr2/teach/R/R_CurveFit.pdf

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
15 comments Add yours
  1. Thanks for the syntax, very useful. I have real date (a small series) I used your syntax, I can get the dots but not the curves. May I ask you why?
    These are the data of my experiment and your syntax.
    Another question please, what is xx? the x axis?

    a <- c(0:9)
    y <- c(92,2,3,3,1,5,1,0,0,1)
    plot(a,y,pch=19)
    #fit first degree polynomial equation:
    fit <- lm(y~a)
    #second degree
    fit2 <- lm(y~poly(a,2,raw=TRUE))
    #third degree
    fit3 <- lm(y~poly(a,3,raw=TRUE))
    #fourth degree
    fit4 <- lm(y~poly(a,4,raw=TRUE))
    lines(xx, predict(fit, data.frame(a=xx)), col="red")
    lines(xx, predict(fit2, data.frame(a=xx)), col="green")
    lines(xx, predict(fit3, data.frame(a=xx)), col="green")

    Best regards

  2. Thanks for the great post.

    Re: "The Pr(>F) value is the probability of rejecting the null hypothesis, where one model does not fit better than the other model."

    Sorry for the noob question, but, isn't Pr(>F) better described as the probability that the null hypothesis is true? Since a low value is better? i.e. p-value = .001931 = .001931 probability that neither model is better?

    1. Hi Aaron,

      thanks for the comment and the question; yes it could have been worded better. My statistics is not as good as I want it to be but here's my understanding. The F value or F statistic is a ratio that indicates how different two groups are (in this case the two models). If two groups are similar, the F value is around 1, as we see below when when we test anova(fit2,fit3). The p-value is the probability of obtaining that F value just by chance; larger F values gives smaller p-values. So in this case, the p-value (0.001931) suggests that the two models are different (our null hypothesis is that they are the same), we can assume that one fits better than the other.

      Hope that clarifies.

      Cheers,

      Dave

  3. Hi,

    Thanks for this useful info. Do you have any clue for applying the same process for data with other shapes, I mean other than polynomial? For example exponential, sigmoidal, etc.?

    Thanks

  4. This was very helpful. I am having trouble taking this one step further. How do I get the equation for the best fit line or model? For example what if I want to solve for x when y=50?

  5. Hello,

    firstly thank you for the very clear post!

    I have just a question regarding the polynomial curve that you have draw on the graph. Accordingly to the summary(anova) of fit2 (if I am interpreting it correct) The equation should be Y = 45.5 + 2.01X -0.01X^2

    Why then in the graph is it different? I have not gone in details but it is clear that the intercept is somewhere around 90..

    I keep having this problem when I work with polynomial regression., while with linear summary(anova) gives correct value of intercept and slope.

    Thank you for any clarification on this!

    Cheers,

    Valentino

  6. Hi, taking things one step further... Once you have your say 3rd degree polynomial plotted, how can you calculate the residual (distance between the point and your line)? I can do this easily with a linear line but don't know how to calculate it with a polynomial

    1. Using the example in the post:

      fit3$residuals
                1           2           3           4           5           6           7           8 
      -0.38748235  1.21576119 -0.67439258  0.87811753 -4.47543205  6.00830724  0.05934317 -2.62422214
      
  7. splendid summary, one of the best "first-to-read" blog for those interested in curve fitting with R. Well done. I have just finished working on this post, looking forward to reading the other articles.

Leave a Reply

Your email address will not be published. Required fields are marked *