For linear relationships we can perform a simple linear regression. For other relationships we can try fitting a curve. From Wikipedia:

Curve fitting is the process of constructing a curve, or mathematical function, that has the best fit to a series of data points, possibly subject to constraints.

I will use the dataset from this question on Stack Overflow.

### Using the example dataset

x <- c(32,64,96,118,126,144,152.5,158) y <- c(99.5,104.8,108.5,100,86,64,35.3,15) #we will make y the response variable and x the predictor #the response variable is usually on the y-axis plot(x,y,pch=19)

*Looks like we can fit a nice curve there*.

#fit first degree polynomial equation: fit <- lm(y~x) #second degree fit2 <- lm(y~poly(x,2,raw=TRUE)) #third degree fit3 <- lm(y~poly(x,3,raw=TRUE)) #fourth degree fit4 <- lm(y~poly(x,4,raw=TRUE)) #generate range of 50 numbers starting from 30 and ending at 160 xx <- seq(30,160, length=50) plot(x,y,pch=19,ylim=c(0,150)) lines(xx, predict(fit, data.frame(x=xx)), col="red") lines(xx, predict(fit2, data.frame(x=xx)), col="green") lines(xx, predict(fit3, data.frame(x=xx)), col="blue") lines(xx, predict(fit4, data.frame(x=xx)), col="purple")

*We can see how well each curve fits*.

We can get more elaborate statistics with the fits using the summary() function:

summary(fit) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -33.78 -18.69 3.40 19.27 27.35 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 143.0438 24.8489 5.757 0.0012 ** x -0.5966 0.2090 -2.854 0.0290 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 24.7 on 6 degrees of freedom Multiple R-squared: 0.5759, Adjusted R-squared: 0.5052 F-statistic: 8.148 on 1 and 6 DF, p-value: 0.02901 summary(fit2) Call: lm(formula = y ~ poly(x, 2, raw = TRUE)) Residuals: 1 2 3 4 5 6 7 8 6.264 -11.596 -3.493 7.022 3.167 10.290 -1.625 -10.029 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 42.514916 19.451549 2.186 0.08053 . poly(x, 2, raw = TRUE)1 2.015705 0.447096 4.508 0.00635 ** poly(x, 2, raw = TRUE)2 -0.013458 0.002266 -5.940 0.00193 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.533 on 5 degrees of freedom Multiple R-squared: 0.9474, Adjusted R-squared: 0.9263 F-statistic: 45 on 2 and 5 DF, p-value: 0.0006356 summary(fit3) Call: lm(formula = y ~ poly(x, 3, raw = TRUE)) Residuals: 1 2 3 4 5 6 7 8 -0.38748 1.21576 -0.67439 0.87812 -4.47543 6.00831 0.05934 -2.62422 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.269e+02 1.925e+01 6.593 0.00274 ** poly(x, 3, raw = TRUE)1 -1.626e+00 7.737e-01 -2.102 0.10340 poly(x, 3, raw = TRUE)2 2.910e-02 8.816e-03 3.301 0.02990 * poly(x, 3, raw = TRUE)3 -1.468e-04 3.022e-05 -4.857 0.00830 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.058 on 4 degrees of freedom Multiple R-squared: 0.9924, Adjusted R-squared: 0.9866 F-statistic: 173.4 on 3 and 4 DF, p-value: 0.0001089 summary(fit4) Call: lm(formula = y ~ poly(x, 4, raw = TRUE)) Residuals: 1 2 3 4 5 6 7 8 0.1242 -0.6912 1.6355 1.4491 -5.1240 4.0360 -0.4692 -0.9604 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.474e+01 5.473e+01 1.366 0.265 poly(x, 4, raw = TRUE)1 1.426e+00 3.095e+00 0.461 0.676 poly(x, 4, raw = TRUE)2 -2.854e-02 5.729e-02 -0.498 0.653 poly(x, 4, raw = TRUE)3 2.878e-04 4.278e-04 0.673 0.549 poly(x, 4, raw = TRUE)4 -1.134e-06 1.113e-06 -1.018 0.384 Residual standard error: 4.04 on 3 degrees of freedom Multiple R-squared: 0.9943, Adjusted R-squared: 0.9868 F-statistic: 131.5 on 4 and 3 DF, p-value: 0.001064

Summary of the multiple R-squared using different degrees of polynomials:

1st: 0.5759

2nd: 0.9474

3rd: 0.9924

4th: 0.9943

We can compare different models with an Analysis of Variance:

anova(fit,fit2) Analysis of Variance Table Model 1: y ~ x Model 2: y ~ poly(x, 2, raw = TRUE) Res.Df RSS Df Sum of Sq F Pr(>F) 1 6 3660.8 2 5 454.3 1 3206.4 35.286 0.001931 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The Pr(>F) value is the probability of rejecting the null hypothesis, where one model does not fit better than the other model. We have a very significant p-value, so we can reject the null hypothesis, i.e. fit2 provides a better fit than fit

#significant anova(fit2,fit3) Analysis of Variance Table Model 1: y ~ poly(x, 2, raw = TRUE) Model 2: y ~ poly(x, 3, raw = TRUE) Res.Df RSS Df Sum of Sq F Pr(>F) 1 5 454.34 2 4 65.87 1 388.47 23.589 0.008298 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 #not significant anova(fit3,fit4) Analysis of Variance Table Model 1: y ~ poly(x, 3, raw = TRUE) Model 2: y ~ poly(x, 4, raw = TRUE) Res.Df RSS Df Sum of Sq F Pr(>F) 1 4 65.873 2 3 48.955 1 16.918 1.0368 0.3835

We can also create a function that reflects the polynomial equation:

#y=ax + b coef(fit) (Intercept) x 143.0437906 -0.5965753 #y=ax^2 + bx + c coef(fit2) (Intercept) poly(x, 2, raw = TRUE)1 poly(x, 2, raw = TRUE)2 42.51491582 2.01570523 -0.01345808 #y=ax^3 + bx^2 + cx + d coef(fit3) (Intercept) poly(x, 3, raw = TRUE)1 poly(x, 3, raw = TRUE)2 poly(x, 3, raw = TRUE)3 1.269381e+02 -1.626321e+00 2.910221e-02 -1.467589e-04 #y=ax^4 + bx^3 + cx^2 + dx + e coef(fit4) (Intercept) poly(x, 4, raw = TRUE)1 poly(x, 4, raw = TRUE)2 poly(x, 4, raw = TRUE)3 7.473766e+01 1.425813e+00 -2.854370e-02 2.877714e-04 poly(x, 4, raw = TRUE)4 -1.133744e-06 #create function for the third order polynomial third_order <- function(newdist, model) { coefs <- coef(model) #y = d + cx + bx^2 + ax^3 res <- coefs[1] + (coefs[2] * newdist) + (coefs[3] * newdist^2) + (coefs[4] * newdist^3) return(res) } yy <- third_order(xx,fit3) plot(xx,yy,type="l") lines(x,y,col="red")

*The values extrapolated from the third order polynomial has a very good fit to the original values, which we already knew from the R-squared values*.

### Conclusions

For non-linear curve fitting we can use lm() and poly() functions of R, which also provides useful statistics to how well the polynomial functions fits the dataset. We can also assess how well different models are against each other using an analysis of variance test. From the model it is possible to define a function that reflects the polynomial function, which could be used for extrapolating response variables.

xx<-seq(-1000,1000,length=500) yy<-third_order(xx,fit3) plot(xx,yy,type="l")

### See also

Technical note: Curve fitting with the R Environment for Statistical Computing at http://www.css.cornell.edu/faculty/dgr2/teach/R/R_CurveFit.pdf

This work is licensed under a Creative Commons

Attribution 4.0 International License.

Nice post! Have you done any work with fitting functions to data with replicates?

Thanks! No sorry; that’s beyond the scope of my knowledge. Since I couldn’t be of more help, here’s a relevant PhD comic 🙂

This has been really useful, thank you!!

Thanks for the syntax, very useful. I have real date (a small series) I used your syntax, I can get the dots but not the curves. May I ask you why?

These are the data of my experiment and your syntax.

Another question please, what is xx? the x axis?

a <- c(0:9)

y <- c(92,2,3,3,1,5,1,0,0,1)

plot(a,y,pch=19)

#fit first degree polynomial equation:

fit <- lm(y~a)

#second degree

fit2 <- lm(y~poly(a,2,raw=TRUE))

#third degree

fit3 <- lm(y~poly(a,3,raw=TRUE))

#fourth degree

fit4 <- lm(y~poly(a,4,raw=TRUE))

lines(xx, predict(fit, data.frame(a=xx)), col="red")

lines(xx, predict(fit2, data.frame(a=xx)), col="green")

lines(xx, predict(fit3, data.frame(a=xx)), col="green")

Best regards

You left out line 11 of the code.

Thanks for the great post.

Re: “The Pr(>F) value is the probability of rejecting the null hypothesis, where one model does not fit better than the other model.”

Sorry for the noob question, but, isn’t Pr(>F) better described as the probability that the null hypothesis is true? Since a low value is better? i.e. p-value = .001931 = .001931 probability that neither model is better?

Hi Aaron,

thanks for the comment and the question; yes it could have been worded better. My statistics is not as good as I want it to be but here’s my understanding. The F value or F statistic is a ratio that indicates how different two groups are (in this case the two models). If two groups are similar, the F value is around 1, as we see below when when we test

`anova(fit2,fit3)`

. The p-value is the probability of obtaining that F value just by chance; larger F values gives smaller p-values. So in this case, the p-value (0.001931) suggests that the two models are different (our null hypothesis is that they are the same), we can assume that one fits better than the other.Hope that clarifies.

Cheers,

Dave

Hi,

Thanks for this useful info. Do you have any clue for applying the same process for data with other shapes, I mean other than polynomial? For example exponential, sigmoidal, etc.?

Thanks

This was very helpful. I am having trouble taking this one step further. How do I get the equation for the best fit line or model? For example what if I want to solve for x when y=50?

Hello,

firstly thank you for the very clear post!

I have just a question regarding the polynomial curve that you have draw on the graph. Accordingly to the summary(anova) of fit2 (if I am interpreting it correct) The equation should be Y = 45.5 + 2.01X -0.01X^2

Why then in the graph is it different? I have not gone in details but it is clear that the intercept is somewhere around 90..

I keep having this problem when I work with polynomial regression., while with linear summary(anova) gives correct value of intercept and slope.

Thank you for any clarification on this!

Cheers,

Valentino

The x-axis doesn’t start at 0. If you extend it, the y-intercept should be around 45.

Hi, taking things one step further… Once you have your say 3rd degree polynomial plotted, how can you calculate the residual (distance between the point and your line)? I can do this easily with a linear line but don’t know how to calculate it with a polynomial

Using the example in the post:

splendid summary, one of the best “first-to-read” blog for those interested in curve fitting with R. Well done. I have just finished working on this post, looking forward to reading the other articles.

this is an excellent post, just what i was looking for. Thank you for sharing!

Nice post. How can I approximate the X value from given Y value?

That helped and eased my curve fitting so much….thanks a bunch

Very nice post! Super easy to understand and follow. However, I’m having some user error and cannot figure it out. I keep getting this error:

“Error in xy.coords(x, y) : ‘x’ and ‘y’ lengths differ

In addition: Warning message:

‘newdata’ had 50 rows but variables found have 9 rows ”

Your y data doesn’t have 50 rows either, so I cannot figure out for the life of me what I’m doing wrong. Here is my code:

> Sarea str(Sarea)

‘data.frame’: 9 obs. of 2 variables:

$ class : int 45 75 150 212 250 300 355 425 500

$ hdpeSA: num 0.811 0.493 0.299 0.202 0.177 …

> plot(Sarea)

> x y #second degree

> fit2 #third degree

> fit3 #fourth degree

> fit4 #generate range of 50 numbers starting from 30 and ending at 160

> xx plot(x,y,pch=19,ylim=c(0,1))

> lines(xx, predict(fit2, data.frame(x=xx)), col=”red”)

Error in xy.coords(x, y) : ‘x’ and ‘y’ lengths differ

In addition: Warning message:

‘newdata’ had 50 rows but variables found have 9 rows

> lines(xx, predict(fit3, data.frame(x=xx)), col=”blue”)

Error in xy.coords(x, y) : ‘x’ and ‘y’ lengths differ

In addition: Warning message:

‘newdata’ had 50 rows but variables found have 9 rows

> lines(xx, predict(fit4, data.frame(x=xx)), col=”green”)

Error in xy.coords(x, y) : ‘x’ and ‘y’ lengths differ

In addition: Warning message:

‘newdata’ had 50 rows but variables found have 9 rows