Published

February 15, 2024

Code
theme_set(theme_minimal())

Variable/Model selection and ANOVA on Whiteside data

Challenge(s)

Comparing weekly average temperatures over two seasons

We address the following question: was the external temperature distributed in the same way during the two heating seasons? When we raise this question, we silently make modeling assumptions. Spell them out.

To oversimplify, we assume that the weekly outdoor temperature is a random variable, that it is independently and identically distributed over the course of each season. And we ask ourselves whether the two distributions are identical.

To ask a simpler question, we may even assume that the distributions are Gaussian with equal variance and we may test for equality of means.

What kind of hypothesis are we testing in the next two chunks? Interpret the results.

Code
lm_temp <- lm(Temp ~ Insul, whiteside)

lm_temp |> 
  tidy()
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    5.35      0.537      9.96 7.80e-14
2 InsulAfter    -0.887     0.734     -1.21 2.32e- 1
Code
lm_temp |> 
  glance()
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1    0.0263       0.00830  2.74      1.46   0.232     1  -135.  276.  282.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

We are testing equality of means in the normal model. Indeed we use two equivalent devices to test the null hypothesis Insul==0. The \(t\)-test (first table) and the Fisher test (second table) coincide (in this setting). Note that the \(p\)-values are identical event though the statistics differ.

The Fisher procedure tests the null hypothesis that all coefficients but the intercept are zero.

If we are ready to accept to err with probability less that \(5\%\) under the null hypothesis, we cannot reject the null hypothesis.

Display parallel boxplots, overlayed cumulative distribution functions and a quantile-quantile plot (QQ-plot) to compare the temperature distributions during the two heating seasons. Comment

Code
whiteside |> 
  ggplot() +
  aes(x=Insul, y=Temp) +
  geom_boxplot(varwidth = T,outlier.shape = 3, notch = T) +
  ylab(temp_label) +
  labs(
    title="Whiteside data: Outdoor temperature during two heating seasons"
  )

The notches do overlap. There is not enough evidence to rule out that the medians are the same.

What’s behind notch construction?

The comparison of ECDFs allows us to remove the normality assumption.

Code
whiteside |> 
  ggplot() +
  aes(linetype=Insul, x=Temp) +
  stat_ecdf() +
  xlab(temp_label) +
  ylab("Empirical Cumulative Distribution Function") +
  ggtitle("Whiteside data: temperatures before and after insulation")

The two ECDFs seem to depart from each other. However, this does not tell us whether this discrepancy is significant. We can complement this plot with a non-parametric test: the two-sample Kolmogorov-Smirnov test.

Code
with(whiteside,
  ks.test(Temp[Insul=="Before"], Temp[Insul=="After"])) |> 
  tidy() |> 
  knitr::kable()
Warning in ks.test(Temp[Insul == "Before"], Temp[Insul == "After"]): cannot
compute exact p-value with ties
statistic p.value method alternative
0.3487179 0.0675804 Two-sample Kolmogorov-Smirnov test two-sided

Again the \(p\)-value is above \(5\%\).

Perform a Wilcoxon test to assess change of Temperature between the two seasons

Besides, the KS test, Wilcoxon test (which is a rank-test) offers another possibility of comparing the two empirical distributions.

Code
wilcox.test(formula= Temp ~ Insul, data=whiteside) |> 
  tidy() |> 
  knitr::kable()
Warning in wilcox.test.default(x = c(-0.8, -0.7, 0.4, 2.5, 2.9, 3.2, 3.6, :
cannot compute exact p-value with ties
statistic p.value method alternative
471 0.1858225 Wilcoxon rank sum test with continuity correction two.sided
Code
  # DT::datatable()

Does Insulation matter?

  • Does average Gas consumption change with Insulation?
  • Does Gas consumption dependence on Temperature change with Insulation?

As we have to infer the dependence on Temperature, the questions turn tricky.

Compare Gas consumption before and after (leaving Temperature aside)

Code
lm0_0 <- lm(Gas ~ Insul, whiteside)
lm0_0 |> 
  tidy() |> 
  knitr::kable()
term estimate std.error statistic p.value
(Intercept) 4.750000 0.1936798 24.525013 0.00e+00
InsulAfter -1.266667 0.2646170 -4.786792 1.36e-05

In contrast with our investigation of Temp ~ Insul, this computation suggests that the mena Gas consumption before and after insulation differ significantly.

Code
parallel_boxplot(whiteside, Insul, Gas) + 
  labs(title="Gas consumption distribution before and after Insulation")

The notches do not overlap. This suggests that the median of the two distributions are different (but are we ready to assume that Gas consumptions are i.i.d?)

Draw a qqplot to compare Gas consumptions before and after insulation.

Code
my_qq(whiteside, Insul, "Before", Gas) +
   xlab("Gas before") +
   ylab("Gas after") +
   labs(title = "QQ plot")

Compare ECDFs of Gas consumption before and after insulation.

Code
whiteside |> 
  ggplot() +
  aes(linetype=Insul, x=Gas) +
  stat_ecdf() +
  xlab(gas_label) +
  ylab("Empirical Cumulative Distribution Function") +
  ggtitle("Whiteside data: gas consumption before and after insulation")

Do Insulation and Temperature additively matter?

This consists in assessing whether the Intercept is modified after Insulation while the slope is left unchanged. Which models should be used to assess this hypothesis?

Models Gas ~ Temp and Gas ~ Insul + Temp can do the job.

Code
lm0 <- lm(Gas ~ Insul + Temp, whiteside)

lm0 |> 
  tidy() |> 
  knitr::kable()
term estimate std.error statistic p.value
(Intercept) 6.551329 0.1180864 55.47911 0
InsulAfter -1.565205 0.0970534 -16.12726 0
Temp -0.336697 0.0177631 -18.95484 0

The \(t\)-test on coefficient Insul is suggestive.

From model lm0, we can draw to parallel regression lines in the Temp,Gas plane.

Code
coeffs0 <- as.list(coefficients(lm0))

whiteside |> 
  ggplot()  +
  aes(x=Temp, y=Gas) +
  geom_point(aes(shape=Insul)) +
  geom_abline(slope=coeffs0$Temp, intercept = coeffs0$`(Intercept)`, linetype="dotted") +
  geom_abline(slope=coeffs0$Temp, intercept = coeffs0$`(Intercept)`+ coeffs0$InsulAfter, linetype="dashed") +
  labs(
    title= "Whiteside data two regression lines derived from `Gas ~ Insul + Temp`"
  )

The fit looks better.

Draw the disgnostic plots for this model

Code
source('./UTILS/my_diag_plots.R')
Code
#draw_diag_plots(lm0)

patchwork::plot_layout(ncol=2, nrow=2)
$ncol
[1] 2

$nrow
[1] 2

$byrow
NULL

$widths
NULL

$heights
NULL

$guides
NULL

$tag_level
NULL

$design
NULL

attr(,"class")
[1] "plot_layout"
Code
(
 make_p_diag_1(lm0) +
  make_p_diag_2(lm0) +
  make_p_diag_3(lm0) +
  make_p_diag_5(lm0)
) + patchwork::plot_annotation(caption=deparse(formula(lm0)))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

It would be nice to supercharge our diagnostic plots by mapping Insul on the shape aesthetic.

Do Insulation and Temperature matter and interact?

Find the formula and build the model.

Code
lm1 <- lm(Gas ~ Insul * Temp, whiteside)

lm1 |> 
  tidy() |> 
  knitr::kable()
term estimate std.error statistic p.value
(Intercept) 6.8538277 0.1359640 50.409146 0.0000000
InsulAfter -2.1299780 0.1800917 -11.827185 0.0000000
Temp -0.3932388 0.0224870 -17.487358 0.0000000
InsulAfter:Temp 0.1153039 0.0321121 3.590665 0.0007307

From model lm1, we can draw again two regression lines in the Temp,Gas plane.

Code
coeffs1 <- as.list(coefficients(lm1))

whiteside |> 
  ggplot()  +
  aes(x=Temp, y=Gas) +
  geom_point(aes(shape=Insul)) +
  geom_abline(slope=coeffs1$Temp, intercept = coeffs1$`(Intercept)`, linetype="dotted") +
  geom_abline(slope=coeffs1$Temp + coeffs1$`InsulAfter:Temp`, 
              intercept = coeffs1$`(Intercept)`+ coeffs1$InsulAfter, 
              linetype="dashed") +
  labs(
      title=deparse(formula(lm1))
      )

We can redo this plot in pure ggplot

Code
whiteside |> 
  ggplot()  +
  aes(x=Temp, y=Gas, group=Insul) +
  geom_point(aes(shape=Insul)) +
  geom_smooth(formula = 'y ~ x', 
              data = filter(whiteside, Insul=="Before"), 
              method="lm", se=F, 
              linetype="dotted") + 
  geom_smooth(formula = 'y ~ x', 
              data = filter(whiteside, Insul=="After"), 
              method="lm", se=F, 
              linetype="dashed") 

Do Insulation and powers of temperature interact?

Investigate formulae Gas ~ poly(Temp, 2, raw=T)*Insul, Gas ~ poly(Temp, 2)*Insul, Gas ~ (Temp +I(Temp*2))*Insul, Gas ~ (Temp +I(Temp*2))| Insul

There are several ways to write concisely this model or an equivalent one.

Using orthogonal polynomials

Code
lm2 <- lm(Gas ~ poly(Temp, 2, raw=T)*Insul, whiteside)

lm2 |> 
  tidy() |> 
  knitr::kable()
term estimate std.error statistic p.value
(Intercept) 6.7592152 0.1507868 44.8263124 0.0000000
poly(Temp, 2, raw = T)1 -0.3176587 0.0629652 -5.0449913 0.0000064
poly(Temp, 2, raw = T)2 -0.0084726 0.0066247 -1.2789296 0.2068259
InsulAfter -2.2628413 0.2203425 -10.2696530 0.0000000
poly(Temp, 2, raw = T)1:InsulAfter 0.1797571 0.0964473 1.8637855 0.0682291
poly(Temp, 2, raw = T)2:InsulAfter -0.0065069 0.0099673 -0.6528248 0.5168598

Using raw polynomials

Code
lm2_a <- lm(Gas ~ poly(Temp, 2, raw=F)*Insul, whiteside)

lm2_a |> 
  tidy() |> 
  knitr::kable()
term estimate std.error statistic p.value
(Intercept) 4.9463633 0.0625303 79.1034827 0.0000000
poly(Temp, 2, raw = F)1 -8.0647633 0.4445861 -18.1399343 0.0000000
poly(Temp, 2, raw = F)2 -0.5389000 0.4213680 -1.2789296 0.2068259
InsulAfter -1.5894796 0.0853040 -18.6331130 0.0000000
poly(Temp, 2, raw = F)1:InsulAfter 2.4464525 0.6342874 3.8570093 0.0003292
poly(Temp, 2, raw = F)2:InsulAfter -0.4138719 0.6339708 -0.6528248 0.5168598

Compare the model matrices to understand what is going on.

Code
whiteside |> 
  ggplot()  +
  aes(x=Temp, y=Gas, group=Insul) +
  geom_point(aes(shape=Insul)) +
  geom_smooth(formula = 'y ~ x + I(x^2)', 
              data = filter(whiteside, Insul=="Before"), 
              method="lm", se=F, 
              linetype="dotted") + 
  geom_smooth(formula = 'y ~ x + I(x^2)', 
              data = filter(whiteside, Insul=="After"), 
              method="lm", se=F, 
              linetype="dashed") +
  labs(
    title=deparse(formula(lm2_a))
  )

A third way of sepcifying an equivalent model is:

Code
lm3 <- lm(Gas ~ (Temp + I(Temp^2))*Insul, data = whiteside)
Code
lm3 |> 
  tidy() |> 
  knitr::kable()
term estimate std.error statistic p.value
(Intercept) 6.7592152 0.1507868 44.8263124 0.0000000
Temp -0.3176587 0.0629652 -5.0449913 0.0000064
I(Temp^2) -0.0084726 0.0066247 -1.2789296 0.2068259
InsulAfter -2.2628413 0.2203425 -10.2696530 0.0000000
Temp:InsulAfter 0.1797571 0.0964473 1.8637855 0.0682291
I(Temp^2):InsulAfter -0.0065069 0.0099673 -0.6528248 0.5168598

Higher degree polynomials

Play it with degree 10 polynomials

Code
lm10 <- lm(Gas ~ poly(Temp, 10)*Insul, whiteside)
lm10 |> 
  broom::tidy() |> 
  respons_df()

Drying model exploration

Collecting the models a posteriori

Make a named list with the models constructed so far

Code
lm_names <- ls()[str_starts(ls(), "lm")]

lms <- lapply(as.list(lm_names), get)
names(lms) <- lm_names

Use stepAIC() to perform stepwise exploration

Code
whi.stp <- stepAIC(lm3, scope = list( formula(lm0_0), formula(lm3)), trace=3)
Start:  AIC=-124.75
Gas ~ (Temp + I(Temp^2)) * Insul

                  Df Sum of Sq    RSS     AIC
- I(Temp^2):Insul  1   0.04152 4.9132 -126.27
<none>                         4.8717 -124.75
- Temp:Insul       1   0.33845 5.2101 -122.99

Step:  AIC=-126.27
Gas ~ Temp + I(Temp^2) + Insul + Temp:Insul

             Df Sum of Sq    RSS     AIC
<none>                    4.9132 -126.27
- I(Temp^2)   1   0.51205 5.4252 -122.72
- Temp:Insul  1   1.45401 6.3672 -113.75
Code
anova(whi.stp) |> 
  broom::tidy() |> 
  respons_df()

ANOVA table(s)

Use fonction anova() to compare models constructed with formulae

Code
formula(lm0)
Gas ~ Insul + Temp
Code
anova(lm0, lm3) |> 
    broom::tidy() |> 
    respons_df()
Code
anova(lm0, lm2) |> 
  broom::tidy() |> 
  respons_df()

Wikipedia on Analysis of Variance

Appendix

respons_df <-  . %>% 
  mutate(across(where(is.double), \(x) signif(x, digits=3))) %>% 
  DT::datatable(extensions = "Responsive")
eqf <- . %>% 
  enframe() %>% 
  mutate(Fn=rank(value, ties.method = "max")/n()) %>% 
  distinct(value, Fn) %>%  
  arrange(value) %$%
  stepfun(x=Fn, y=c(value, max(value)), f=1, right=T)
#
#
my_qq <- function(df, fac, lev, quanti) {

ecdf_lev <- df |>
  filter({{fac}}==lev) |>
  pull({{quanti}})  |>
  ecdf()
  
eqf_other  <-  df |>
  filter({{fac}}!=lev) |>
  pull({{quanti}})  |>
  eqf()

df |>
  filter({{fac}}==lev) |>
  ggplot() +
  aes(x={{quanti}},
      y=eqf_other(ecdf_lev({{quanti}}))) +
  geom_point(fill="white",
             color="black",
             alpha=.5 ) +
  geom_abline(intercept = 0,
              slope=1,
              linetype="dotted") +
  coord_fixed() +
  theme_minimal()
}  

#
#
parallel_boxplot <- function(df, fac, quant){
  
  df |> 
    ggplot() +
    aes(x={{fac}}, y={{quant}}) +
    geom_boxplot(varwidth = T, notch = T) +
    theme_minimal()
  
}