Curve fitting on batches in the tidyverse: R, dplyr, and broom
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
I recently needed to fit curves on several sets of similar data, measured from different sensors. I found how to achieve this with dplyr, without needing to define outside functions or use forloops. This approach integrates perfectly with my usual dplyr and ggplot2 workflows, which means it adapts to new data or new experimental conditions with no changes. Here are the readymade recipes for any one else who may run into a similar problem.
First we’ll generate some artificial data so that you can follow along at home. Here’s an exponentially decaying value measured by two sensors at different locations:
# if needed
install.packages(c("tidyverse", "broom"))
library(tidyverse)
library(broom)
t = 1:100
y1 = 22 + (53  22) * exp(0.02 * t) %>% jitter(10)
y2 = 24 + (60  22) * exp(0.01 * t) %>% jitter(10)
df < data.frame(t = t, y = y1, sensor = 'sensor1') %>%
rbind(. , data.frame(t = t, y = y2, sensor = 'sensor2'))
Our data frame looks like this:
df
t  y  sensor 

1  52.25302  sensor1 
2  51.71440  sensor1 
3  51.13971  sensor1 
…  …  … 
98  38.13328  sensor2 
99  38.17017  sensor2 
100  37.72184  sensor2 
And our data looks like this:
qplot(t, y, data = df, colour = sensor)
I created the data in long format, as it works best with dplyr pipelines. You can read more about long format vs wide format data here.
TL;DR
I want to plot the fitted curve over my data points
augmented < df %>%
group_by(sensor) %>%
do(fit = nls(y ~ a * t + b, data = .)) %>%
augment(fit)
qplot(t, y, data = augmented, geom = 'point', colour = sensor) +
geom_line(aes(y=.fitted))
I want to plot the residuals
qplot(t, .resid, data = augmented, colour = sensor)
qplot(.fitted, .resid, data = augmented, colour = sensor)
I want a table of my fit parameters for each condition:
df %>%
group_by(sensor) %>%
do(fit = nls(y ~ a * t + b, data = .)) %>%
tidy(fit) %>%
select(sensor, term, estimate) %>%
spread(term, estimate)
sensor  a  b  

1  sensor1  0.2495347  47.87376 
2  sensor2  0.2350898  59.77793 
I want a less horrendous fit. How can I do better?
Try with an actual exponential. You’ll probably want a selfstarting function to avoid the “singular gradient” error — read more in my post on the subject.
Explanation: intro to curve fitting in R
The goal of our fitting example was to find an estimate $\hat{y}(t) = a t + b$ that approximates our measured data $y$. In R, we can directly write that we want to approximate $y$ as a function $a \cdot t + b$, using the very intuitive builtin formula syntax:
y ~ a * t + b
In R’s documentation, $a$ and $b$ are the coefficients, $t$ is the independent variable and $y$ is the dependent variable — $t$ can go about its day without every knowing $y$ exists, but $y$ can’t progress without knowing the latest trend in $t$.
We fit these models using various fitting functions: lm
or glm
for linear models or nls
for nonlinear least squares for example. I’m using nls
here because I can specify the names of my coefficients, which I find clearer to explain:
# Select the data from our first sensor:
sensor1 < df %>% filter(sensor == "sensor1")
# Fit our model:
fit < nls(y ~ a * t + b, data = sensor1)
# nls works best if we specify an initial guess for the coefficients,
# the 'start' point for the optimisation:
fit < nls(y ~ a * t + b, data = sensor1, start = list(a = 1, b = 10))
The fitting functions all return a fit object:
> summary(fit)
Formula: y ~ a * t + b
Parameters:
Estimate Std. Error t value Pr(>t)
a 0.249535 0.006377 39.13 <2e16 ***
b 47.873759 0.370947 129.06 <2e16 ***

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.841 on 98 degrees of freedom
Number of iterations to convergence: 1
Achieved convergence tolerance: 1.571e09
You can use one of these base R functions to extract useful data from the fit object:
coef(fit)
returns the values of the coefficients $a$ and $b$.predict(fit)
returns $\hat{y}(t_i)$, i.e. it applies the fitted model to each of the original data points $t_i$. It can also be applied on new values of $t$.resid(fit)
returns the residuals $y_i – \hat{y}(t_i)$ at each point of the original data.
Try them out in the console:
> coef(fit)
a b
0.2495347 47.8737585
> predict(fit)
[1] 47.62422 47.37469 47.12515 46.87562 46.62609 46.37655 ...
> predict(fit, newdata = data.frame(t = 101:200))
[1] 22.67076 22.42122 22.17169 21.92215 21.67262 21.42309 ...
> resid(fit)
[1] 4.628798 4.339712 4.014558 3.576258 3.528138 3.001447 ...
Now these functions all return vectors, which work best with R’s native plotting functions. To integrate with dplyr and ggplot, we’d rather have data frames. This is where the broom package comes in.
Introducing broom
Broom is a separate R package that feeds on fit results and produces useful data frames. Install it directly within the R console if you haven’t already:
install.packages('broom')
library(broom)
It provides three useful functions, which are easier to demonstrate than to explain. All of them return data frames:
 tidy extracts the fit coefficients and related information. The term column contains the name of your coefficient, and estimate contains its fitted value:
tidy(fit)
term  estimate  std.error  statistic  p.value 

a  0.2495347  0.006377179  39.12932  1.267667e61 
b  47.8737585  0.370946838  129.05827  3.123848e111 
 augment is equivalent to
predict
andresid
combined. The returned data frame contains columnst
andy
with your original data, as well as.fitted
, the fitted curve, and.resid
, the residuals:
augment(fit)
t  y  .fitted  .resid 

1  52.25302  47.62422  4.62879752 
2  51.71440  47.37469  4.33971163 
3  51.13971  47.12515  4.01455804 
4  50.45188  46.87562  3.57625801 
…  …  …  … 
You can plot the ‘augmented’ fit with qplot:
qplot(t, y, data = augment(fit)) + geom_line(aes(y = .fitted))
qplot(t, .resid, data = augment(fit))
 glance shows fit statistics:
glance(fit)
sigma  isConv  finTol  logLik  AIC  BIC  deviance  df.residual 

1.840841  TRUE  1.571375e09  201.906  409.8119  417.6274  332.0921  98 
Using broom with dplyr
Since broom’s functions return data frames, they integrate naturally with magrittr pipelines (%>%
) and dplyr. Let’s illustrate by grouping the data by sensor id, then producing a fit object for each group. The do
function calls nls
once for each group, then stores the return value in a column we named fit
:
fitted < df %>%
group_by(sensor) %>%
do(fit = nls(y ~ a * t + b, data = ., start = list(a = 1, b = 1)))
fitted
sensor  fit  

1  sensor1 

2  sensor2 

We now have one nls object per group, which can be fed into either tidy
or augment
. These functions act on each group separately:
fitted %>%
tidy(fit)
sensor  term  estimate  std.error  statistic  p.value 

sensor1  a  0.2495347  0.006377179  39.12932  1.267667e61 
sensor1  b  47.8737585  0.370946839  129.05827  3.123849e111 
sensor2  a  0.2350898  0.003088217  76.12478  5.400611e89 
sensor2  b  59.7779291  0.179634964  332.77446  1.931798e151 
From here, we can use select
and spread
to generate a more succinct table, as in the TL;DR example, or keep the table asis to inspect the quality of the fit.
The result of augment
is a long data frame that can be used immediately with qplot, also shown in the TL;DR section above:
fitted %>%
augment(fit)
sensor  t  y  .fitted  .resid 

sensor1  1  52.41684  47.58220  4.834645067 
sensor1  2  51.64323  47.33347  4.309757950 
sensor1  3  51.09595  47.08475  4.011196179 
…  …  …  …  … 
sensor2  98  38.06029  36.72428  1.336007642 
sensor2  99  37.86891  36.48926  1.379642201 
sensor2  100  37.89936  36.25425  1.645117019 
That’s it for broom and dplyr. The magic of this approach is that you can add a sensor to your input data then rerun all your code; your new curves will just appear on the plots. Similarly, you can change the formula of your fit, and your new coefficients will appear as a new column in the coefficients table. No need to change your code.
References
 R documentation: https://stat.ethz.ch/Rmanual/Rdevel/library/stats/html/nls.html
 Data transformation cheat sheet: https://github.com/rstudio/cheatsheets/raw/master/datatransformation.pdf
 Stack overflow: https://stackoverflow.com/questions/24356683/applygroupedmodelbackontodata
 Stack overflow: https://stackoverflow.com/questions/22713325/fittingseveralregressionmodelswithdplyr
Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job.
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.