Breaking Predict for lm() with dollar.sign
As is often the case with \(R\), there are many ways to do things that are equivalent or nearly equivalent. It is the nearly equivalent part that is frustrating; one of the first encounters with this can come with attempts to predict a regression. The ultimate source of troubles is scoping and environments; the use of the $
syntax sometimes has unintended side effects.
lm() Syntax is Important
I will refer to an example from a recent homework on regression. The data cover total costs per period and Fertilizer
[sugar and wheat] production in a factory in Tennessee. I will embed the data in the first code chunk.
Fertilizer <- structure(list(TotalCost = c(656173, 619058, 668676, 622302,
632286, 672717, 602916, 643074, 552836, 613314, 700822, 552576,
579878, 668761, 735812, 665795, 670082, 550066, 593947, 613121,
609388, 735423, 536458, 638920, 755232, 613023, 543013, 620659,
538868, 588713, 601981, 558313, 640537, 637230, 567228, 532145,
672828, 649695, 663317, 586153, 611113, 612919, 654498, 651475,
507135, 625520, 617906, 573349, 657173, 629154), Wheat = c(6061,
3478, 6192, 7245, 4641, 7469, 5778, 6282, 1833, 1200, 7863, 1764,
1200, 6132, 9107, 7281, 4616, 1330, 5262, 3506, 6064, 8893, 2542,
5212, 9081, 6062, 1633, 5222, 1200, 2495, 1431, 1200, 6272, 5729,
1406, 2122, 6344, 7511, 6459, 1563, 4418, 1200, 4351, 6388, 1721,
6339, 1200, 1520, 6625, 5042), Sugar = c(2875, 5636, 3535, 1817,
4820, 1400, 2397, 2806, 6791, 8854, 1747, 6686, 8393, 3183, 1400,
1746, 5393, 7396, 3402, 6023, 2446, 1400, 5877, 4790, 1400, 2654,
7619, 3967, 7762, 6592, 8217, 7783, 2534, 3250, 7433, 5828, 3124,
1400, 2478, 8046, 4056, 8648, 5191, 2818, 6292, 2169, 8473, 7378,
2472, 4344), Total.Production = c(8936, 9114, 9727, 9062, 9461,
8869, 8175, 9088, 8624, 10054, 9610, 8450, 9593, 9315, 10507,
9027, 10009, 8726, 8664, 9529, 8510, 10293, 8419, 10002, 10481,
8716, 9252, 9189, 8962, 9087, 9648, 8983, 8806, 8979, 8839, 7950,
9468, 8911, 8937, 9609, 8474, 9848, 9542, 9206, 8013, 8508, 9673,
8898, 9097, 9386)), row.names = c(NA, 50L), class = "data.frame")
summary(Fertilizer)
## TotalCost Wheat Sugar Total.Production
## Min. :507135 Min. :1200 Min. :1400 Min. : 7950
## 1st Qu.:586793 1st Qu.:1732 1st Qu.:2492 1st Qu.: 8814
## Median :619858 Median :5127 Median :4200 Median : 9088
## Mean :620872 Mean :4510 Mean :4655 Mean : 9165
## 3rd Qu.:655754 3rd Qu.:6325 3rd Qu.:6765 3rd Qu.: 9580
## Max. :755232 Max. :9107 Max. :8854 Max. :10507
A Linear Regression: lm()
There are a few ways to deploy lm to estimate linear models. We will require the formula syntax to make full use of R’s capabilities. I will work with the threeT ways that deploy $
syntax, the formula syntax, and new variables. In all three cases, I want TotalCost as a function of Total.Production.
Model.Form <- lm(TotalCost~Total.Production, data=Fertilizer)
First, the above formula
syntax is really the only correct way to do it to assure full functionality to R’s lm(). We could also refer to each variable in their location$name form as I do next.
Dollar.Form <- lm(Fertilizer$TotalCost~Fertilizer$Total.Production)
Finally, I could create new variables and put them together in a regression.
Costs <- Fertilizer$TotalCost
Total.Production <- Fertilizer$Total.Production
Nothing.Form <- lm(Costs~Total.Production)
Now the model summaries, they are identical.
summary(Model.Form)
##
## Call:
## lm(formula = TotalCost ~ Total.Production, data = Fertilizer)
##
## Residuals:
## Min 1Q Median 3Q Max
## -83035 -34050 1934 38114 69332
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 78579.72 92318.38 0.851 0.399
## Total.Production 59.17 10.05 5.886 3.74e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41750 on 48 degrees of freedom
## Multiple R-squared: 0.4192, Adjusted R-squared: 0.4071
## F-statistic: 34.65 on 1 and 48 DF, p-value: 3.74e-07
summary(Dollar.Form)
##
## Call:
## lm(formula = Fertilizer$TotalCost ~ Fertilizer$Total.Production)
##
## Residuals:
## Min 1Q Median 3Q Max
## -83035 -34050 1934 38114 69332
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 78579.72 92318.38 0.851 0.399
## Fertilizer$Total.Production 59.17 10.05 5.886 3.74e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41750 on 48 degrees of freedom
## Multiple R-squared: 0.4192, Adjusted R-squared: 0.4071
## F-statistic: 34.65 on 1 and 48 DF, p-value: 3.74e-07
summary(Nothing.Form)
##
## Call:
## lm(formula = Costs ~ Total.Production)
##
## Residuals:
## Min 1Q Median 3Q Max
## -83035 -34050 1934 38114 69332
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 78579.72 92318.38 0.851 0.399
## Total.Production 59.17 10.05 5.886 3.74e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41750 on 48 degrees of freedom
## Multiple R-squared: 0.4192, Adjusted R-squared: 0.4071
## F-statistic: 34.65 on 1 and 48 DF, p-value: 3.74e-07
So are the confidence intervals for the slope(s) and intercept and anova will generally work as expected.
confint(Model.Form)
## 2.5 % 97.5 %
## (Intercept) -107038.80975 264198.25859
## Total.Production 38.96039 79.38555
confint(Dollar.Form)
## 2.5 % 97.5 %
## (Intercept) -107038.80975 264198.25859
## Fertilizer$Total.Production 38.96039 79.38555
confint(Nothing.Form)
## 2.5 % 97.5 %
## (Intercept) -107038.80975 264198.25859
## Total.Production 38.96039 79.38555
All three behave exactly as desired.
Predicting require Formulae
Using formula
everything behaves as expected. Let me predict the confidence interval of average total costs. I do this in two steps. First, I create a new data.frame that contains only one variable: Total.Production. It only takes the value of 9000. That is the average costs given units produced that I wish to predict.
Hypothetical.Data <- data.frame(Total.Production=9000)
predict(Model.Form, newdata = Hypothetical.Data, interval="confidence")
## fit lwr upr
## 1 611136.4 598809.3 623463.5
That works. With 95% confidence, producing 9000 units should lead to average total costs between $598809 and $623463. Unfortunately, this is the only one that will work. Now let me try to do the same thing with the other two identical lm()
examples.
predict(Dollar.Form, newdata = Hypothetical.Data, interval="confidence")
## fit lwr upr
## 1 607349.4 594612.3 620086.5
## 2 617882.1 605968.2 629796.1
## 3 654155.2 637718.7 670591.6
## 4 614805.1 602755.6 626854.7
## 5 638415.2 625118.2 651712.2
## 6 603384.8 590096.5 616673.0
## 7 562318.7 539060.8 585576.6
## 8 616343.6 604373.2 628314.1
## 9 588887.4 572754.8 605020.0
## 10 673504.7 651961.0 695048.5
## 11 647231.9 632333.1 662130.8
## 12 578591.3 559896.9 597285.6
## 13 646226.0 631532.2 660919.7
## 14 629775.9 617522.3 642029.5
## 15 700310.1 670692.4 729927.8
## 16 612734.1 600542.9 624925.3
## 17 670841.9 650051.2 691632.7
## 18 594923.0 580108.8 609737.3
## 19 591254.3 575657.9 606850.7
## 20 642438.9 628468.5 656409.3
## 21 582141.7 564367.5 599915.8
## 22 687647.1 661933.8 713360.3
## 23 576756.9 557574.4 595939.5
## 24 670427.7 649753.0 691102.4
## 25 698771.6 669634.6 727908.5
## 26 594331.3 579395.2 609267.4
## 27 626048.0 614047.0 638049.1
## 28 622320.1 610439.7 634200.5
## 29 608887.9 596331.8 621443.9
## 30 616284.5 604311.4 628257.5
## 31 649480.5 634105.3 664855.7
## 32 610130.5 597706.3 622554.7
## 33 599656.9 585749.6 613564.1
## 34 609893.8 597445.5 622342.1
## 35 601609.6 588037.9 615181.2
## 36 549004.8 521737.0 576272.6
## 37 638829.4 625468.0 652190.7
## 38 605870.0 592941.1 618799.0
## 39 607408.5 594678.7 620138.3
## 40 647172.8 632286.1 662059.4
## 41 580011.4 561689.3 598333.6
## 42 661315.1 643101.1 679529.1
## 43 643208.2 629097.4 657318.9
## 44 623326.1 611426.4 635225.7
## 45 552732.7 526605.5 578860.0
## 46 582023.3 564219.1 599827.6
## 47 650959.8 635258.6 666661.1
## 48 605100.8 592065.5 618136.1
## 49 616876.2 604927.9 628824.5
## 50 633977.2 621291.0 646663.4
That gives me back the confidence interval of the average total costs for each actual value of production in the data [those are stored in the lm() object as fitted.values
– the points on the line/plane. That is an effect of the syntax; the newdata
bit cannot really replace Fertilizer$Total.Production
with the single value 9000 without rewriting the data itself. Nor is there a variable called Fertilizer$Total.Production
inside Hypothetical.Data
; there cannot be because that violates the conventions of naming and the syntactical use of $
.
What about my other strategy? Peel off production and costs and then use lm()
.
New variables?
predict(Nothing.Form, newdata = Hypothetical.Data, interval="confidence")
## fit lwr upr
## 1 611136.4 598809.3 623463.5
That has the same problem as before. The underlying problem is that prediction operates at the level of a set of data and those are data.frame in R. Only this syntax will make it possible to deploy the predict function of linear models because the predict function basically just substitutes the name of the data.frame into the calculation and it has to match precisely, in both structure and name.
Effects Plots
Last time, we looked at some basic regression plots. In this case,
ggplot(Fertilizer, aes(x=Total.Production, y=TotalCost)) + geom_point() + geom_smooth(method="lm")
It is important to note that this is an x-y scatterplot. What happens if we try that for the two input example? First, let me show a simple plot that should capture what is going on.
ggplot2::ggplot(Fertilizer, aes(x=Sugar, y=Wheat, color=TotalCost, size=TotalCost)) + geom_point() + scale_size_continuous(guide=FALSE) + labs(title="Total Cost for Wheat/Sugar Pairs")
Now, estimate the regression.
Mod.SW <- lm(TotalCost ~ Sugar+Wheat, data=Fertilizer)
The plots?
ggplot2::ggplot(Fertilizer, aes(x=Wheat, y=TotalCost)) + geom_point() + geom_smooth(method="lm")
ggplot2::ggplot(Fertilizer, aes(x=Sugar, y=TotalCost)) + geom_point() + geom_smooth(method="lm")
Uh oh. How to fix that? We need to show it holding wheat units constant. That’s a partial plot, or in R an effects
plot.
library(effects)
plot(allEffects(Mod.SW, partial.residuals=TRUE))
Notice the difference in the y-axes, it obscures the relevant difference in the slopes. Fix the limits across plots?
library(effects)
plot(allEffects(Mod.SW, partial.residuals=TRUE), ylim=c(300000,1000000))
Other Plots
par(mfrow=c(2,2))
plot(Mod.SW)