whiteside <- MASS::whiteside # no need to load the whole packagecur_dataset <-str_to_title(as.character(substitute(whiteside)))# ?whiteside
Mr Derek Whiteside of the UK Building Research Station recorded the weekly gas consumption and average external temperature at his own house in south-east England for two heating seasons, one of 26 weeks before, and one of 30 weeks after cavity-wall insulation was installed. The object of the exercise was to assess the effect of the insulation on gas consumption.
C <- whiteside %>%select(where(is.numeric)) %>%cov()# Covariance between Gas and Tempmu_n <- whiteside %>%select(where(is.numeric)) %>%colMeans()# mu_n # Mean vector
Hand-made calculatoin of simple linear regression estimates for Gas versus Temp
Solution
Code
b <- C[1,2] / C[1,1] # slope a <- whiteside %$%# exposing pipe from magrittr (mean(Gas) - b *mean(Temp)) # intercept# with(whiteside,# mean(Gas) - b * mean(Temp))
Overlay the scatterplot with the regression line.
Solution
Code
p +geom_abline(slope=b, intercept = a) +ggtitle(glue("{cur_dataset} data"), subtitle ="Least square regression line")
Using lm()
lm stands for Linear Models. Function lm has a number of arguments, including:
formula
data
Solution
Code
lm0 <-lm(Gas ~ Temp, data = whiteside)
The result is an object of class lm. The generic function summary() has a method for class lm
Code
lm0 %>%summary()
Call:
lm(formula = Gas ~ Temp, data = whiteside)
Residuals:
Min 1Q Median 3Q Max
-1.6324 -0.7119 -0.2047 0.8187 1.5327
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.4862 0.2357 23.275 < 2e-16 ***
Temp -0.2902 0.0422 -6.876 6.55e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8606 on 54 degrees of freedom
Multiple R-squared: 0.4668, Adjusted R-squared: 0.457
F-statistic: 47.28 on 1 and 54 DF, p-value: 6.545e-09
The summary is made of four parts
The call. Very useful if we handle many different models (corresponding to different formulae, or different datasets)
A numerical summary of residuals
A commented display of the estimated coefficients
Estimate of noise scale (under Gaussian assumptions)
Squared linear correlation coefficient between response variable \(Y\) (Gas) and predictions \(\widehat{Y}\)
A test statistic (Fisher’s statistic) for assessing null hypothesis that slope is null, and corresponding \(p\)-value (under Gaussian assumptions).
Including a rough summary in a report is not always a good idea. It is easy to extract a tabular version of the summary using functions tidy() and glance() from package broom.
For html output DT::datatable() allows us to polish the final output
(diag_1 + diag_2 + diag_3 +guide_area()) +plot_layout(guides="collect") +plot_annotation(title=glue("{cur_dataset} dataset"),subtitle =glue("Regression diagnostic {deparse(lm0$call$formula)}"), caption ='The fact that the sign of residuals depend on Insul shows that our modeling is too naive.\n The qqplot suggests that the residuals are not collected from Gaussian homoschedastic noise.' )
Taking into account Insulation
Design a formula that allows us to take into account the possible impact of Insulation. Insulation may impact the relation between weekly Gas consumption and average external Temperature in two ways. Insulation may modify the Intercept, it may also modify the slope, that is the sensitivity of Gas consumption with respect to average external Temperature.
Have a look at formula documentation (?formula).
Solution
Code
lm1 <-lm(Gas ~ Temp * Insul, data = whiteside)
Check the design using function model.matrix(). How can you relate this augmented design and the one-hot encoding of variable Insul?
Column .sigma contains the leave-one-out estimates of \(\sigma\), that is whiteside_aug1$.sigma[i] is the estimate of \(\sigma\) you obtain by leaving out the i row of the dataframe.
There is no need to recompute everything for each sample element.