Breaking Predict for lm() with dollar.sign

As is often the case with \(R\), there are many ways to do things that are equivalent or nearly equivalent. It is the nearly equivalent part that is frustrating; one of the first encounters with this can come with attempts to predict a regression. The ultimate source of troubles is scoping and environments; the use of the $ syntax sometimes has unintended side effects.

lm() Syntax is Important

I will refer to an example from a recent homework on regression. The data cover total costs per period and Fertilizer [sugar and wheat] production in a factory in Tennessee. I will embed the data in the first code chunk.

Fertilizer <- structure(list(TotalCost = c(656173, 619058, 668676, 622302, 
632286, 672717, 602916, 643074, 552836, 613314, 700822, 552576, 
579878, 668761, 735812, 665795, 670082, 550066, 593947, 613121, 
609388, 735423, 536458, 638920, 755232, 613023, 543013, 620659, 
538868, 588713, 601981, 558313, 640537, 637230, 567228, 532145, 
672828, 649695, 663317, 586153, 611113, 612919, 654498, 651475, 
507135, 625520, 617906, 573349, 657173, 629154), Wheat = c(6061, 
3478, 6192, 7245, 4641, 7469, 5778, 6282, 1833, 1200, 7863, 1764, 
1200, 6132, 9107, 7281, 4616, 1330, 5262, 3506, 6064, 8893, 2542, 
5212, 9081, 6062, 1633, 5222, 1200, 2495, 1431, 1200, 6272, 5729, 
1406, 2122, 6344, 7511, 6459, 1563, 4418, 1200, 4351, 6388, 1721, 
6339, 1200, 1520, 6625, 5042), Sugar = c(2875, 5636, 3535, 1817, 
4820, 1400, 2397, 2806, 6791, 8854, 1747, 6686, 8393, 3183, 1400, 
1746, 5393, 7396, 3402, 6023, 2446, 1400, 5877, 4790, 1400, 2654, 
7619, 3967, 7762, 6592, 8217, 7783, 2534, 3250, 7433, 5828, 3124, 
1400, 2478, 8046, 4056, 8648, 5191, 2818, 6292, 2169, 8473, 7378, 
2472, 4344), Total.Production = c(8936, 9114, 9727, 9062, 9461, 
8869, 8175, 9088, 8624, 10054, 9610, 8450, 9593, 9315, 10507, 
9027, 10009, 8726, 8664, 9529, 8510, 10293, 8419, 10002, 10481, 
8716, 9252, 9189, 8962, 9087, 9648, 8983, 8806, 8979, 8839, 7950, 
9468, 8911, 8937, 9609, 8474, 9848, 9542, 9206, 8013, 8508, 9673, 
8898, 9097, 9386)), row.names = c(NA, 50L), class = "data.frame")
summary(Fertilizer)
##    TotalCost          Wheat          Sugar      Total.Production
##  Min.   :507135   Min.   :1200   Min.   :1400   Min.   : 7950   
##  1st Qu.:586793   1st Qu.:1732   1st Qu.:2492   1st Qu.: 8814   
##  Median :619858   Median :5127   Median :4200   Median : 9088   
##  Mean   :620872   Mean   :4510   Mean   :4655   Mean   : 9165   
##  3rd Qu.:655754   3rd Qu.:6325   3rd Qu.:6765   3rd Qu.: 9580   
##  Max.   :755232   Max.   :9107   Max.   :8854   Max.   :10507

A Linear Regression: lm()

There are a few ways to deploy lm to estimate linear models. We will require the formula syntax to make full use of R’s capabilities. I will work with the threeT ways that deploy $ syntax, the formula syntax, and new variables. In all three cases, I want TotalCost as a function of Total.Production.

Model.Form <- lm(TotalCost~Total.Production, data=Fertilizer)

First, the above formula syntax is really the only correct way to do it to assure full functionality to R’s lm(). We could also refer to each variable in their location$name form as I do next.

Dollar.Form <- lm(Fertilizer$TotalCost~Fertilizer$Total.Production)

Finally, I could create new variables and put them together in a regression.

Costs <- Fertilizer$TotalCost
Total.Production <- Fertilizer$Total.Production
Nothing.Form <- lm(Costs~Total.Production)

Now the model summaries, they are identical.

summary(Model.Form)
## 
## Call:
## lm(formula = TotalCost ~ Total.Production, data = Fertilizer)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -83035 -34050   1934  38114  69332 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      78579.72   92318.38   0.851    0.399    
## Total.Production    59.17      10.05   5.886 3.74e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41750 on 48 degrees of freedom
## Multiple R-squared:  0.4192, Adjusted R-squared:  0.4071 
## F-statistic: 34.65 on 1 and 48 DF,  p-value: 3.74e-07
summary(Dollar.Form)
## 
## Call:
## lm(formula = Fertilizer$TotalCost ~ Fertilizer$Total.Production)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -83035 -34050   1934  38114  69332 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 78579.72   92318.38   0.851    0.399    
## Fertilizer$Total.Production    59.17      10.05   5.886 3.74e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41750 on 48 degrees of freedom
## Multiple R-squared:  0.4192, Adjusted R-squared:  0.4071 
## F-statistic: 34.65 on 1 and 48 DF,  p-value: 3.74e-07
summary(Nothing.Form)
## 
## Call:
## lm(formula = Costs ~ Total.Production)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -83035 -34050   1934  38114  69332 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      78579.72   92318.38   0.851    0.399    
## Total.Production    59.17      10.05   5.886 3.74e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41750 on 48 degrees of freedom
## Multiple R-squared:  0.4192, Adjusted R-squared:  0.4071 
## F-statistic: 34.65 on 1 and 48 DF,  p-value: 3.74e-07

So are the confidence intervals for the slope(s) and intercept and anova will generally work as expected.

confint(Model.Form)
##                          2.5 %       97.5 %
## (Intercept)      -107038.80975 264198.25859
## Total.Production      38.96039     79.38555
confint(Dollar.Form)
##                                     2.5 %       97.5 %
## (Intercept)                 -107038.80975 264198.25859
## Fertilizer$Total.Production      38.96039     79.38555
confint(Nothing.Form)
##                          2.5 %       97.5 %
## (Intercept)      -107038.80975 264198.25859
## Total.Production      38.96039     79.38555

All three behave exactly as desired.

Predicting require Formulae

Using formula everything behaves as expected. Let me predict the confidence interval of average total costs. I do this in two steps. First, I create a new data.frame that contains only one variable: Total.Production. It only takes the value of 9000. That is the average costs given units produced that I wish to predict.

Hypothetical.Data <- data.frame(Total.Production=9000)
predict(Model.Form, newdata = Hypothetical.Data, interval="confidence")
##        fit      lwr      upr
## 1 611136.4 598809.3 623463.5

That works. With 95% confidence, producing 9000 units should lead to average total costs between $598809 and $623463. Unfortunately, this is the only one that will work. Now let me try to do the same thing with the other two identical lm() examples.

predict(Dollar.Form, newdata = Hypothetical.Data, interval="confidence")
##         fit      lwr      upr
## 1  607349.4 594612.3 620086.5
## 2  617882.1 605968.2 629796.1
## 3  654155.2 637718.7 670591.6
## 4  614805.1 602755.6 626854.7
## 5  638415.2 625118.2 651712.2
## 6  603384.8 590096.5 616673.0
## 7  562318.7 539060.8 585576.6
## 8  616343.6 604373.2 628314.1
## 9  588887.4 572754.8 605020.0
## 10 673504.7 651961.0 695048.5
## 11 647231.9 632333.1 662130.8
## 12 578591.3 559896.9 597285.6
## 13 646226.0 631532.2 660919.7
## 14 629775.9 617522.3 642029.5
## 15 700310.1 670692.4 729927.8
## 16 612734.1 600542.9 624925.3
## 17 670841.9 650051.2 691632.7
## 18 594923.0 580108.8 609737.3
## 19 591254.3 575657.9 606850.7
## 20 642438.9 628468.5 656409.3
## 21 582141.7 564367.5 599915.8
## 22 687647.1 661933.8 713360.3
## 23 576756.9 557574.4 595939.5
## 24 670427.7 649753.0 691102.4
## 25 698771.6 669634.6 727908.5
## 26 594331.3 579395.2 609267.4
## 27 626048.0 614047.0 638049.1
## 28 622320.1 610439.7 634200.5
## 29 608887.9 596331.8 621443.9
## 30 616284.5 604311.4 628257.5
## 31 649480.5 634105.3 664855.7
## 32 610130.5 597706.3 622554.7
## 33 599656.9 585749.6 613564.1
## 34 609893.8 597445.5 622342.1
## 35 601609.6 588037.9 615181.2
## 36 549004.8 521737.0 576272.6
## 37 638829.4 625468.0 652190.7
## 38 605870.0 592941.1 618799.0
## 39 607408.5 594678.7 620138.3
## 40 647172.8 632286.1 662059.4
## 41 580011.4 561689.3 598333.6
## 42 661315.1 643101.1 679529.1
## 43 643208.2 629097.4 657318.9
## 44 623326.1 611426.4 635225.7
## 45 552732.7 526605.5 578860.0
## 46 582023.3 564219.1 599827.6
## 47 650959.8 635258.6 666661.1
## 48 605100.8 592065.5 618136.1
## 49 616876.2 604927.9 628824.5
## 50 633977.2 621291.0 646663.4

That gives me back the confidence interval of the average total costs for each actual value of production in the data [those are stored in the lm() object as fitted.values – the points on the line/plane. That is an effect of the syntax; the newdata bit cannot really replace Fertilizer$Total.Production with the single value 9000 without rewriting the data itself. Nor is there a variable called Fertilizer$Total.Production inside Hypothetical.Data; there cannot be because that violates the conventions of naming and the syntactical use of $.

What about my other strategy? Peel off production and costs and then use lm().

New variables?

predict(Nothing.Form, newdata = Hypothetical.Data, interval="confidence")
##        fit      lwr      upr
## 1 611136.4 598809.3 623463.5

That has the same problem as before. The underlying problem is that prediction operates at the level of a set of data and those are data.frame in R. Only this syntax will make it possible to deploy the predict function of linear models because the predict function basically just substitutes the name of the data.frame into the calculation and it has to match precisely, in both structure and name.

Effects Plots

Last time, we looked at some basic regression plots. In this case,

ggplot(Fertilizer, aes(x=Total.Production, y=TotalCost)) + geom_point() + geom_smooth(method="lm")

It is important to note that this is an x-y scatterplot. What happens if we try that for the two input example? First, let me show a simple plot that should capture what is going on.

ggplot2::ggplot(Fertilizer, aes(x=Sugar, y=Wheat, color=TotalCost, size=TotalCost)) + geom_point() + scale_size_continuous(guide=FALSE) + labs(title="Total Cost for Wheat/Sugar Pairs")

Now, estimate the regression.

Mod.SW <- lm(TotalCost ~ Sugar+Wheat, data=Fertilizer)

The plots?

ggplot2::ggplot(Fertilizer, aes(x=Wheat, y=TotalCost)) + geom_point() + geom_smooth(method="lm")

ggplot2::ggplot(Fertilizer, aes(x=Sugar, y=TotalCost)) + geom_point() + geom_smooth(method="lm")

Uh oh. How to fix that? We need to show it holding wheat units constant. That’s a partial plot, or in R an effects plot.

library(effects)
plot(allEffects(Mod.SW, partial.residuals=TRUE))

Notice the difference in the y-axes, it obscures the relevant difference in the slopes. Fix the limits across plots?

library(effects)
plot(allEffects(Mod.SW, partial.residuals=TRUE), ylim=c(300000,1000000))

Other Plots

par(mfrow=c(2,2))
plot(Mod.SW)

Avatar
Robert W. Walker
Associate Professor of Quantitative Methods

My research interests include causal inference, statistical computation and data visualization.

Next
Previous