# On curve fitting using R

For linear relationships we can perform a simple linear regression. For other relationships we can try fitting a curve. From Wikipedia:

Curve fitting is the process of constructing a curve, or mathematical function, that has the best fit to a series of data points, possibly subject to constraints.

I will use the dataset from this question on Stack Overflow.

### Using the example dataset

```x <- c(32,64,96,118,126,144,152.5,158)
y <- c(99.5,104.8,108.5,100,86,64,35.3,15)
#we will make y the response variable and x the predictor
#the response variable is usually on the y-axis
plot(x,y,pch=19)
``` Looks like we can fit a nice curve there.

```#fit first degree polynomial equation:
fit  <- lm(y~x)
#second degree
fit2 <- lm(y~poly(x,2,raw=TRUE))
#third degree
fit3 <- lm(y~poly(x,3,raw=TRUE))
#fourth degree
fit4 <- lm(y~poly(x,4,raw=TRUE))
#generate range of 50 numbers starting from 30 and ending at 160
xx <- seq(30,160, length=50)
plot(x,y,pch=19,ylim=c(0,150))
lines(xx, predict(fit, data.frame(x=xx)), col="red")
lines(xx, predict(fit2, data.frame(x=xx)), col="green")
lines(xx, predict(fit3, data.frame(x=xx)), col="blue")
lines(xx, predict(fit4, data.frame(x=xx)), col="purple")
``` We can see how well each curve fits.

We can get more elaborate statistics with the fits using the summary() function:

```summary(fit)

Call:
lm(formula = y ~ x)

Residuals:
Min     1Q Median     3Q    Max
-33.78 -18.69   3.40  19.27  27.35

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 143.0438    24.8489   5.757   0.0012 **
x            -0.5966     0.2090  -2.854   0.0290 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 24.7 on 6 degrees of freedom
Multiple R-squared:  0.5759,	Adjusted R-squared:  0.5052
F-statistic: 8.148 on 1 and 6 DF,  p-value: 0.02901

summary(fit2)

Call:
lm(formula = y ~ poly(x, 2, raw = TRUE))

Residuals:
1       2       3       4       5       6       7       8
6.264 -11.596  -3.493   7.022   3.167  10.290  -1.625 -10.029

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)             42.514916  19.451549   2.186  0.08053 .
poly(x, 2, raw = TRUE)1  2.015705   0.447096   4.508  0.00635 **
poly(x, 2, raw = TRUE)2 -0.013458   0.002266  -5.940  0.00193 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.533 on 5 degrees of freedom
Multiple R-squared:  0.9474,	Adjusted R-squared:  0.9263
F-statistic:    45 on 2 and 5 DF,  p-value: 0.0006356

summary(fit3)

Call:
lm(formula = y ~ poly(x, 3, raw = TRUE))

Residuals:
1        2        3        4        5        6        7        8
-0.38748  1.21576 -0.67439  0.87812 -4.47543  6.00831  0.05934 -2.62422

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)              1.269e+02  1.925e+01   6.593  0.00274 **
poly(x, 3, raw = TRUE)1 -1.626e+00  7.737e-01  -2.102  0.10340
poly(x, 3, raw = TRUE)2  2.910e-02  8.816e-03   3.301  0.02990 *
poly(x, 3, raw = TRUE)3 -1.468e-04  3.022e-05  -4.857  0.00830 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.058 on 4 degrees of freedom
Multiple R-squared:  0.9924,	Adjusted R-squared:  0.9866
F-statistic: 173.4 on 3 and 4 DF,  p-value: 0.0001089

summary(fit4)

Call:
lm(formula = y ~ poly(x, 4, raw = TRUE))

Residuals:
1       2       3       4       5       6       7       8
0.1242 -0.6912  1.6355  1.4491 -5.1240  4.0360 -0.4692 -0.9604

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)              7.474e+01  5.473e+01   1.366    0.265
poly(x, 4, raw = TRUE)1  1.426e+00  3.095e+00   0.461    0.676
poly(x, 4, raw = TRUE)2 -2.854e-02  5.729e-02  -0.498    0.653
poly(x, 4, raw = TRUE)3  2.878e-04  4.278e-04   0.673    0.549
poly(x, 4, raw = TRUE)4 -1.134e-06  1.113e-06  -1.018    0.384

Residual standard error: 4.04 on 3 degrees of freedom
Multiple R-squared:  0.9943,	Adjusted R-squared:  0.9868
F-statistic: 131.5 on 4 and 3 DF,  p-value: 0.001064
```

Summary of the multiple R-squared using different degrees of polynomials:

1st: 0.5759
2nd: 0.9474
3rd: 0.9924
4th: 0.9943

We can compare different models with an Analysis of Variance:

```anova(fit,fit2)
Analysis of Variance Table

Model 1: y ~ x
Model 2: y ~ poly(x, 2, raw = TRUE)
Res.Df    RSS Df Sum of Sq      F   Pr(>F)
1      6 3660.8
2      5  454.3  1    3206.4 35.286 0.001931 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```

The Pr(>F) value is the probability of rejecting the null hypothesis, where one model does not fit better than the other model. We have a very significant p-value, so we can reject the null hypothesis, i.e. fit2 provides a better fit than fit

```#significant
anova(fit2,fit3)
Analysis of Variance Table

Model 1: y ~ poly(x, 2, raw = TRUE)
Model 2: y ~ poly(x, 3, raw = TRUE)
Res.Df    RSS Df Sum of Sq      F   Pr(>F)
1      5 454.34
2      4  65.87  1    388.47 23.589 0.008298 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#not significant
anova(fit3,fit4)
Analysis of Variance Table

Model 1: y ~ poly(x, 3, raw = TRUE)
Model 2: y ~ poly(x, 4, raw = TRUE)
Res.Df    RSS Df Sum of Sq      F Pr(>F)
1      4 65.873
2      3 48.955  1    16.918 1.0368 0.3835
```

We can also create a function that reflects the polynomial equation:

```#y=ax + b
coef(fit)
(Intercept)           x
143.0437906  -0.5965753
#y=ax^2 + bx + c
coef(fit2)
(Intercept) poly(x, 2, raw = TRUE)1 poly(x, 2, raw = TRUE)2
42.51491582              2.01570523             -0.01345808
#y=ax^3 + bx^2 + cx + d
coef(fit3)
(Intercept) poly(x, 3, raw = TRUE)1 poly(x, 3, raw = TRUE)2 poly(x, 3, raw = TRUE)3
1.269381e+02           -1.626321e+00            2.910221e-02           -1.467589e-04
#y=ax^4 + bx^3 + cx^2 + dx + e
coef(fit4)
(Intercept) poly(x, 4, raw = TRUE)1 poly(x, 4, raw = TRUE)2 poly(x, 4, raw = TRUE)3
7.473766e+01            1.425813e+00           -2.854370e-02            2.877714e-04
poly(x, 4, raw = TRUE)4
-1.133744e-06
#create function for the third order polynomial
third_order <- function(newdist, model) {
coefs <- coef(model)
#y = d + cx + bx^2 + ax^3
res <- coefs + (coefs * newdist) + (coefs * newdist^2) + (coefs * newdist^3)
return(res)
}
yy <- third_order(xx,fit3)
plot(xx,yy,type="l")
lines(x,y,col="red")
``` The values extrapolated from the third order polynomial has a very good fit to the original values, which we already knew from the R-squared values.

### Conclusions

For non-linear curve fitting we can use lm() and poly() functions of R, which also provides useful statistics to how well the polynomial functions fits the dataset. We can also assess how well different models are against each other using an analysis of variance test. From the model it is possible to define a function that reflects the polynomial function, which could be used for extrapolating response variables.

```xx<-seq(-1000,1000,length=500)
yy<-third_order(xx,fit3)
plot(xx,yy,type="l")
``` Technical note: Curve fitting with the R Environment for Statistical Computing at http://www.css.cornell.edu/faculty/dgr2/teach/R/R_CurveFit.pdf .
1. ecologist08 says:

Nice post! Have you done any work with fitting functions to data with replicates?

1. Davo says:

Thanks! No sorry; that’s beyond the scope of my knowledge. Since I couldn’t be of more help, here’s a relevant PhD comic 🙂

2. Shelley Edwards says:

This has been really useful, thank you!!

3. Mohamed says:

Thanks for the syntax, very useful. I have real date (a small series) I used your syntax, I can get the dots but not the curves. May I ask you why?
These are the data of my experiment and your syntax.
Another question please, what is xx? the x axis?

a <- c(0:9)
y <- c(92,2,3,3,1,5,1,0,0,1)
plot(a,y,pch=19)
#fit first degree polynomial equation:
fit <- lm(y~a)
#second degree
fit2 <- lm(y~poly(a,2,raw=TRUE))
#third degree
fit3 <- lm(y~poly(a,3,raw=TRUE))
#fourth degree
fit4 <- lm(y~poly(a,4,raw=TRUE))
lines(xx, predict(fit, data.frame(a=xx)), col="red")
lines(xx, predict(fit2, data.frame(a=xx)), col="green")
lines(xx, predict(fit3, data.frame(a=xx)), col="green")

Best regards

1. Davo says:

You left out line 11 of the code.

4. Aaron McLendon says:

Thanks for the great post.

Re: “The Pr(>F) value is the probability of rejecting the null hypothesis, where one model does not fit better than the other model.”

Sorry for the noob question, but, isn’t Pr(>F) better described as the probability that the null hypothesis is true? Since a low value is better? i.e. p-value = .001931 = .001931 probability that neither model is better?

1. Davo says:

Hi Aaron,

thanks for the comment and the question; yes it could have been worded better. My statistics is not as good as I want it to be but here’s my understanding. The F value or F statistic is a ratio that indicates how different two groups are (in this case the two models). If two groups are similar, the F value is around 1, as we see below when when we test `anova(fit2,fit3)`. The p-value is the probability of obtaining that F value just by chance; larger F values gives smaller p-values. So in this case, the p-value (0.001931) suggests that the two models are different (our null hypothesis is that they are the same), we can assume that one fits better than the other.

Hope that clarifies.

Cheers,

Dave

5. Azad Henareh says:

Hi,

Thanks for this useful info. Do you have any clue for applying the same process for data with other shapes, I mean other than polynomial? For example exponential, sigmoidal, etc.?

Thanks

6. JK says:

This was very helpful. I am having trouble taking this one step further. How do I get the equation for the best fit line or model? For example what if I want to solve for x when y=50?

7. valentino says:

Hello,

firstly thank you for the very clear post!

I have just a question regarding the polynomial curve that you have draw on the graph. Accordingly to the summary(anova) of fit2 (if I am interpreting it correct) The equation should be Y = 45.5 + 2.01X -0.01X^2

Why then in the graph is it different? I have not gone in details but it is clear that the intercept is somewhere around 90..

I keep having this problem when I work with polynomial regression., while with linear summary(anova) gives correct value of intercept and slope.

Thank you for any clarification on this!

Cheers,

Valentino

1. Davo says:

The x-axis doesn’t start at 0. If you extend it, the y-intercept should be around 45.

8. Amanda says:

Hi, taking things one step further… Once you have your say 3rd degree polynomial plotted, how can you calculate the residual (distance between the point and your line)? I can do this easily with a linear line but don’t know how to calculate it with a polynomial

1. Davo says:

Using the example in the post:

```fit3\$residuals
1           2           3           4           5           6           7           8
-0.38748235  1.21576119 -0.67439258  0.87811753 -4.47543205  6.00830724  0.05934317 -2.62422214
```
9. Aykut Saribiyik says:

splendid summary, one of the best “first-to-read” blog for those interested in curve fitting with R. Well done. I have just finished working on this post, looking forward to reading the other articles.

10. oksana says:

this is an excellent post, just what i was looking for. Thank you for sharing!

11. vongrads says:

Nice post. How can I approximate the X value from given Y value?

12. Rittik says:

That helped and eased my curve fitting so much….thanks a bunch

13. Alterra Sanchez says:

Very nice post! Super easy to understand and follow. However, I’m having some user error and cannot figure it out. I keep getting this error:
“Error in xy.coords(x, y) : ‘x’ and ‘y’ lengths differ
‘newdata’ had 50 rows but variables found have 9 rows ”

Your y data doesn’t have 50 rows either, so I cannot figure out for the life of me what I’m doing wrong. Here is my code:
> Sarea str(Sarea)
‘data.frame’: 9 obs. of 2 variables:
\$ class : int 45 75 150 212 250 300 355 425 500
\$ hdpeSA: num 0.811 0.493 0.299 0.202 0.177 …
> plot(Sarea)
> x y #second degree
> fit2 #third degree
> fit3 #fourth degree
> fit4 #generate range of 50 numbers starting from 30 and ending at 160
> xx plot(x,y,pch=19,ylim=c(0,1))
> lines(xx, predict(fit2, data.frame(x=xx)), col=”red”)
Error in xy.coords(x, y) : ‘x’ and ‘y’ lengths differ
‘newdata’ had 50 rows but variables found have 9 rows
> lines(xx, predict(fit3, data.frame(x=xx)), col=”blue”)
Error in xy.coords(x, y) : ‘x’ and ‘y’ lengths differ
‘newdata’ had 50 rows but variables found have 9 rows
> lines(xx, predict(fit4, data.frame(x=xx)), col=”green”)
Error in xy.coords(x, y) : ‘x’ and ‘y’ lengths differ
14. Dennis Wollersheim says:
15. JK says: