Code
theme_set(theme_minimal())
theme_set(theme_minimal())
Variable/Model selection and ANOVA on Whiteside data
We address the following question: was the external temperature distributed in the same way during the two heating seasons? When we raise this question, we silently make modeling assumptions. Spell them out.
To oversimplify, we assume that the weekly outdoor temperature is a random variable, that it is independently and identically distributed over the course of each season. And we ask ourselves whether the two distributions are identical.
To ask a simpler question, we may even assume that the distributions are Gaussian with equal variance and we may test for equality of means.
What kind of hypothesis are we testing in the next two chunks? Interpret the results.
lm_temp <- lm(Temp ~ Insul, whiteside)
lm_temp |>
tidy()
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 5.35 0.537 9.96 7.80e-14
2 InsulAfter -0.887 0.734 -1.21 2.32e- 1
lm_temp |>
glance()
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.0263 0.00830 2.74 1.46 0.232 1 -135. 276. 282.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
We are testing equality of means in the normal model. Indeed we use two equivalent devices to test the null hypothesis Insul==0
. The \(t\)-test (first table) and the Fisher test (second table) coincide (in this setting). Note that the \(p\)-values are identical event though the statistics differ.
The Fisher procedure tests the null hypothesis that all coefficients but the intercept are zero.
If we are ready to accept to err with probability less that \(5\%\) under the null hypothesis, we cannot reject the null hypothesis.
Display parallel boxplots, overlayed cumulative distribution functions and a quantile-quantile plot (QQ-plot) to compare the temperature distributions during the two heating seasons. Comment
whiteside |>
ggplot() +
aes(x=Insul, y=Temp) +
geom_boxplot(varwidth = T,outlier.shape = 3, notch = T) +
ylab(temp_label) +
labs(
title="Whiteside data: Outdoor temperature during two heating seasons"
)
The notches do overlap. There is not enough evidence to rule out that the medians are the same.
What’s behind notch construction?
The comparison of ECDFs allows us to remove the normality assumption.
whiteside |>
ggplot() +
aes(linetype=Insul, x=Temp) +
stat_ecdf() +
xlab(temp_label) +
ylab("Empirical Cumulative Distribution Function") +
ggtitle("Whiteside data: temperatures before and after insulation")
The two ECDFs seem to depart from each other. However, this does not tell us whether this discrepancy is significant. We can complement this plot with a non-parametric test: the two-sample Kolmogorov-Smirnov test.
Warning in ks.test(Temp[Insul == "Before"], Temp[Insul == "After"]): cannot
compute exact p-value with ties
statistic | p.value | method | alternative |
---|---|---|---|
0.3487179 | 0.0675804 | Two-sample Kolmogorov-Smirnov test | two-sided |
Again the \(p\)-value is above \(5\%\).
Besides, the KS test, Wilcoxon test (which is a rank-test) offers another possibility of comparing the two empirical distributions.
wilcox.test(formula= Temp ~ Insul, data=whiteside) |>
tidy() |>
knitr::kable()
Warning in wilcox.test.default(x = c(-0.8, -0.7, 0.4, 2.5, 2.9, 3.2, 3.6, :
cannot compute exact p-value with ties
statistic | p.value | method | alternative |
---|---|---|---|
471 | 0.1858225 | Wilcoxon rank sum test with continuity correction | two.sided |
# DT::datatable()
As we have to infer the dependence on Temperature, the questions turn tricky.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 4.750000 | 0.1936798 | 24.525013 | 0.00e+00 |
InsulAfter | -1.266667 | 0.2646170 | -4.786792 | 1.36e-05 |
In contrast with our investigation of Temp ~ Insul
, this computation suggests that the mena Gas consumption before and after insulation differ significantly.
parallel_boxplot(whiteside, Insul, Gas) +
labs(title="Gas consumption distribution before and after Insulation")
The notches do not overlap. This suggests that the median of the two distributions are different (but are we ready to assume that Gas consumptions are i.i.d?)
Draw a qqplot
to compare Gas consumptions before and after insulation.
my_qq(whiteside, Insul, "Before", Gas) +
xlab("Gas before") +
ylab("Gas after") +
labs(title = "QQ plot")
Compare ECDFs of Gas consumption before and after insulation.
whiteside |>
ggplot() +
aes(linetype=Insul, x=Gas) +
stat_ecdf() +
xlab(gas_label) +
ylab("Empirical Cumulative Distribution Function") +
ggtitle("Whiteside data: gas consumption before and after insulation")
This consists in assessing whether the Intercept is modified after Insulation while the slope is left unchanged. Which models should be used to assess this hypothesis?
Models Gas ~ Temp
and Gas ~ Insul + Temp
can do the job.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 6.551329 | 0.1180864 | 55.47911 | 0 |
InsulAfter | -1.565205 | 0.0970534 | -16.12726 | 0 |
Temp | -0.336697 | 0.0177631 | -18.95484 | 0 |
The \(t\)-test on coefficient Insul
is suggestive.
From model lm0
, we can draw to parallel regression lines in the Temp,Gas
plane.
coeffs0 <- as.list(coefficients(lm0))
whiteside |>
ggplot() +
aes(x=Temp, y=Gas) +
geom_point(aes(shape=Insul)) +
geom_abline(slope=coeffs0$Temp, intercept = coeffs0$`(Intercept)`, linetype="dotted") +
geom_abline(slope=coeffs0$Temp, intercept = coeffs0$`(Intercept)`+ coeffs0$InsulAfter, linetype="dashed") +
labs(
title= "Whiteside data two regression lines derived from `Gas ~ Insul + Temp`"
)
The fit looks better.
Draw the disgnostic plots for this model
source('./UTILS/my_diag_plots.R')
#draw_diag_plots(lm0)
patchwork::plot_layout(ncol=2, nrow=2)
$ncol
[1] 2
$nrow
[1] 2
$byrow
NULL
$widths
NULL
$heights
NULL
$guides
NULL
$tag_level
NULL
$design
NULL
attr(,"class")
[1] "plot_layout"
(
make_p_diag_1(lm0) +
make_p_diag_2(lm0) +
make_p_diag_3(lm0) +
make_p_diag_5(lm0)
) + patchwork::plot_annotation(caption=deparse(formula(lm0)))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
It would be nice to supercharge our diagnostic plots by mapping Insul
on the shape
aesthetic.
Find the formula and build the model.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 6.8538277 | 0.1359640 | 50.409146 | 0.0000000 |
InsulAfter | -2.1299780 | 0.1800917 | -11.827185 | 0.0000000 |
Temp | -0.3932388 | 0.0224870 | -17.487358 | 0.0000000 |
InsulAfter:Temp | 0.1153039 | 0.0321121 | 3.590665 | 0.0007307 |
From model lm1
, we can draw again two regression lines in the Temp,Gas
plane.
coeffs1 <- as.list(coefficients(lm1))
whiteside |>
ggplot() +
aes(x=Temp, y=Gas) +
geom_point(aes(shape=Insul)) +
geom_abline(slope=coeffs1$Temp, intercept = coeffs1$`(Intercept)`, linetype="dotted") +
geom_abline(slope=coeffs1$Temp + coeffs1$`InsulAfter:Temp`,
intercept = coeffs1$`(Intercept)`+ coeffs1$InsulAfter,
linetype="dashed") +
labs(
title=deparse(formula(lm1))
)
We can redo this plot in pure ggplot
whiteside |>
ggplot() +
aes(x=Temp, y=Gas, group=Insul) +
geom_point(aes(shape=Insul)) +
geom_smooth(formula = 'y ~ x',
data = filter(whiteside, Insul=="Before"),
method="lm", se=F,
linetype="dotted") +
geom_smooth(formula = 'y ~ x',
data = filter(whiteside, Insul=="After"),
method="lm", se=F,
linetype="dashed")
Investigate formulae Gas ~ poly(Temp, 2, raw=T)*Insul
, Gas ~ poly(Temp, 2)*Insul
, Gas ~ (Temp +I(Temp*2))*Insul
, Gas ~ (Temp +I(Temp*2))| Insul
There are several ways to write concisely this model or an equivalent one.
Using orthogonal polynomials
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 6.7592152 | 0.1507868 | 44.8263124 | 0.0000000 |
poly(Temp, 2, raw = T)1 | -0.3176587 | 0.0629652 | -5.0449913 | 0.0000064 |
poly(Temp, 2, raw = T)2 | -0.0084726 | 0.0066247 | -1.2789296 | 0.2068259 |
InsulAfter | -2.2628413 | 0.2203425 | -10.2696530 | 0.0000000 |
poly(Temp, 2, raw = T)1:InsulAfter | 0.1797571 | 0.0964473 | 1.8637855 | 0.0682291 |
poly(Temp, 2, raw = T)2:InsulAfter | -0.0065069 | 0.0099673 | -0.6528248 | 0.5168598 |
Using raw polynomials
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 4.9463633 | 0.0625303 | 79.1034827 | 0.0000000 |
poly(Temp, 2, raw = F)1 | -8.0647633 | 0.4445861 | -18.1399343 | 0.0000000 |
poly(Temp, 2, raw = F)2 | -0.5389000 | 0.4213680 | -1.2789296 | 0.2068259 |
InsulAfter | -1.5894796 | 0.0853040 | -18.6331130 | 0.0000000 |
poly(Temp, 2, raw = F)1:InsulAfter | 2.4464525 | 0.6342874 | 3.8570093 | 0.0003292 |
poly(Temp, 2, raw = F)2:InsulAfter | -0.4138719 | 0.6339708 | -0.6528248 | 0.5168598 |
Compare the model matrices to understand what is going on.
whiteside |>
ggplot() +
aes(x=Temp, y=Gas, group=Insul) +
geom_point(aes(shape=Insul)) +
geom_smooth(formula = 'y ~ x + I(x^2)',
data = filter(whiteside, Insul=="Before"),
method="lm", se=F,
linetype="dotted") +
geom_smooth(formula = 'y ~ x + I(x^2)',
data = filter(whiteside, Insul=="After"),
method="lm", se=F,
linetype="dashed") +
labs(
title=deparse(formula(lm2_a))
)
A third way of sepcifying an equivalent model is:
lm3 |>
tidy() |>
knitr::kable()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 6.7592152 | 0.1507868 | 44.8263124 | 0.0000000 |
Temp | -0.3176587 | 0.0629652 | -5.0449913 | 0.0000064 |
I(Temp^2) | -0.0084726 | 0.0066247 | -1.2789296 | 0.2068259 |
InsulAfter | -2.2628413 | 0.2203425 | -10.2696530 | 0.0000000 |
Temp:InsulAfter | 0.1797571 | 0.0964473 | 1.8637855 | 0.0682291 |
I(Temp^2):InsulAfter | -0.0065069 | 0.0099673 | -0.6528248 | 0.5168598 |
Play it with degree 10 polynomials
Make a named list with the models constructed so far
stepAIC()
to perform stepwise explorationStart: AIC=-124.75
Gas ~ (Temp + I(Temp^2)) * Insul
Df Sum of Sq RSS AIC
- I(Temp^2):Insul 1 0.04152 4.9132 -126.27
<none> 4.8717 -124.75
- Temp:Insul 1 0.33845 5.2101 -122.99
Step: AIC=-126.27
Gas ~ Temp + I(Temp^2) + Insul + Temp:Insul
Df Sum of Sq RSS AIC
<none> 4.9132 -126.27
- I(Temp^2) 1 0.51205 5.4252 -122.72
- Temp:Insul 1 1.45401 6.3672 -113.75
Use fonction anova()
to compare models constructed with formulae
formula(lm0)
Gas ~ Insul + Temp
respons_df <- . %>%
mutate(across(where(is.double), \(x) signif(x, digits=3))) %>%
DT::datatable(extensions = "Responsive")
eqf <- . %>%
enframe() %>%
mutate(Fn=rank(value, ties.method = "max")/n()) %>%
distinct(value, Fn) %>%
arrange(value) %$%
stepfun(x=Fn, y=c(value, max(value)), f=1, right=T)
#
#
my_qq <- function(df, fac, lev, quanti) {
ecdf_lev <- df |>
filter({{fac}}==lev) |>
pull({{quanti}}) |>
ecdf()
eqf_other <- df |>
filter({{fac}}!=lev) |>
pull({{quanti}}) |>
eqf()
df |>
filter({{fac}}==lev) |>
ggplot() +
aes(x={{quanti}},
y=eqf_other(ecdf_lev({{quanti}}))) +
geom_point(fill="white",
color="black",
alpha=.5 ) +
geom_abline(intercept = 0,
slope=1,
linetype="dotted") +
coord_fixed() +
theme_minimal()
}
#
#
parallel_boxplot <- function(df, fac, quant){
df |>
ggplot() +
aes(x={{fac}}, y={{quant}}) +
geom_boxplot(varwidth = T, notch = T) +
theme_minimal()
}