name: inter-slide class: left, middle, inverse {{ content }} --- name: layout-general layout: true class: left, middle <style> .remark-slide-number { position: inherit; } .remark-slide-number .progress-bar-container { position: absolute; bottom: 0; height: 4px; display: block; left: 0; right: 0; } .remark-slide-number .progress-bar { height: 100%; background-color: red; } </style>
--- template: inter-slide # Multilinear Regression I #### [Master I MIDS & MFA]() #### [Analyse Exploratoire de Données](http://stephane-v-boucheron.fr/courses/eda/) #### [Stéphane Boucheron](http://stephane-v-boucheron.fr) --- template: inter-slide ##
### [Revisiting simple linear regression](#rev-slr) ### [Simple linear regression per group](#slr-group) ### [Formulae and model building](#formulae) ### [Higher order modeling](#polynomials) ### [ANOVA](#anova) --- template: inter-slide name: rev-slr ## Revisiting simple linear regression --- ### House Insulation: Whiteside's Data
.fl.w-third.pa2[ ```r whiteside <- MASS::whiteside whiteside %>% dplyr::sample_n(5) %>% knitr::kable() ``` <table> <thead> <tr> <th style="text-align:left;"> Insul </th> <th style="text-align:right;"> Temp </th> <th style="text-align:right;"> Gas </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> After </td> <td style="text-align:right;"> 0.8 </td> <td style="text-align:right;"> 4.6 </td> </tr> <tr> <td style="text-align:left;"> After </td> <td style="text-align:right;"> 7.5 </td> <td style="text-align:right;"> 2.6 </td> </tr> <tr> <td style="text-align:left;"> After </td> <td style="text-align:right;"> 7.1 </td> <td style="text-align:right;"> 3.0 </td> </tr> <tr> <td style="text-align:left;"> Before </td> <td style="text-align:right;"> 6.0 </td> <td style="text-align:right;"> 4.4 </td> </tr> <tr> <td style="text-align:left;"> After </td> <td style="text-align:right;"> 4.9 </td> <td style="text-align:right;"> 4.0 </td> </tr> </tbody> </table> ] .fl.w-two-thirds.pa2[ Mr Derek Whiteside of the UK Building Research Station recorded the weekly gas consumption and average external temperature at his own house in south-east England for two heating seasons, one of 26 weeks before, and one of 30 weeks after cavity-wall insulation was installed. The object of the exercise was to assess the effect of the insulation on gas consumption. `Temp`: Purportedly the average outside temperature in degrees Celsius. `Gas`: The weekly gas consumption in 1000s of cubic feet. Venables, W. N. and Ripley, B. D. (2002) _Modern Applied Statistics with S._ Fourth edition. Springer. ] --- ### Multiple goals - simple linear regression of `Gas` (weekly consumption) with respect to `Temp` (average external temperature) leaves us with mixed feelings
- there is a need to incorporate the influence of `Insul`
into our model - there are few a priori reasons to believe that the relationship between gas consumption and external temperature
should be linear - there is a need for methods to compare different models and ideally to select one in a principled way
---
We perform again simple linear regression on `Whiteside` data We look for `\(\beta_0, \beta_1\)` that minimize the quadratic error `$$\sum_{i=1}^n \left( \mathrm{Gas}_i - \beta_0 -\beta_1 \mathrm{Temp}_i\right)^2$$` In the parlance of mathematical statistic, we proceed as if we were fitting a _Gaussian Linear Model_ with _homoschedastic_ noise: `$$\begin{bmatrix} \mathrm{Gas}_1 \\ \vdots \\ \mathrm{Gas}_{56} \end{bmatrix} = \begin{bmatrix} 1 & \mathrm{Temp}_1 \\ \vdots & \vdots \\ 1 & \mathrm{Temp}_{56} \end{bmatrix} \times \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} + \sigma \begin{bmatrix} \epsilon_1 \\ \vdots \\ \epsilon_{56} \end{bmatrix}$$` where `\(\epsilon_i \sim_{i.i.d.} \mathcal{N}(0,1)\)`. We aim at optimizing `\(\beta_0, \beta_1, \sigma>0\)` --- ### Invoking `lm()` Evaluating ```r lm0 <- lm(Gas ~ Temp, data=whiteside) ``` returns an object `lm0` of class `lm` (for linear models) An object of class `lm` is a _list_ of named objects Among those objects, `coefficients`, which can be retrieved either as - `lm0$coefficients` or - `coefficients(lm0)` contains what we call the _Ordinary Least Square_ estimator of the linear regression of `Gas` versus `Temp` -- The _Intercept_ `\(\widehat{\beta}_0\)` and the _slope_ `\(\widehat{\beta}_1\)` define the simple regression line --- ### Simple linear regression of `Gas ~ Temp` .fl.w-third.pa2[
The sign of residuals is determined by `Insul` This is not what we would expect if the connection between `Gas` consumption and average outside temperature `Temp` were independent of insulation. This modeling deserves to be challenged ] .fl.w-two-thirds.pa2[ <img src="cm-4-EDA_files/figure-html/whitesidesimple-1.png" width="504" /> ] --- count: false ### Whiteside step by step .panel1-my_whiteside-auto[ ```r *whiteside ``` ] .panel2-my_whiteside-auto[ ``` ## Insul Temp Gas ## 1 Before -0.8 7.2 ## 2 Before -0.7 6.9 ## 3 Before 0.4 6.4 ## 4 Before 2.5 6.0 ## 5 Before 2.9 5.8 ## 6 Before 3.2 5.8 ## 7 Before 3.6 5.6 ## 8 Before 3.9 4.7 ## 9 Before 4.2 5.8 ## 10 Before 4.3 5.2 ## 11 Before 5.4 4.9 ## 12 Before 6.0 4.9 ## 13 Before 6.0 4.3 ## 14 Before 6.0 4.4 ## 15 Before 6.2 4.5 ## 16 Before 6.3 4.6 ## 17 Before 6.9 3.7 ## 18 Before 7.0 3.9 ## 19 Before 7.4 4.2 ## 20 Before 7.5 4.0 ## 21 Before 7.5 3.9 ## 22 Before 7.6 3.5 ## 23 Before 8.0 4.0 ## 24 Before 8.5 3.6 ## 25 Before 9.1 3.1 ## 26 Before 10.2 2.6 ## 27 After -0.7 4.8 ## 28 After 0.8 4.6 ## 29 After 1.0 4.7 ## 30 After 1.4 4.0 ## 31 After 1.5 4.2 ## 32 After 1.6 4.2 ## 33 After 2.3 4.1 ## 34 After 2.5 4.0 ## 35 After 2.5 3.5 ## 36 After 3.1 3.2 ## 37 After 3.9 3.9 ## 38 After 4.0 3.5 ## 39 After 4.0 3.7 ## 40 After 4.2 3.5 ## 41 After 4.3 3.5 ## 42 After 4.6 3.7 ## 43 After 4.7 3.5 ## 44 After 4.9 3.4 ## 45 After 4.9 3.7 ## 46 After 4.9 4.0 ## 47 After 5.0 3.6 ## 48 After 5.3 3.7 ## 49 After 6.2 2.8 ## 50 After 7.1 3.0 ## 51 After 7.2 2.8 ## 52 After 7.5 2.6 ## 53 After 8.0 2.7 ## 54 After 8.7 2.8 ## 55 After 8.8 1.3 ## 56 After 9.7 1.5 ``` ] --- count: false ### Whiteside step by step .panel1-my_whiteside-auto[ ```r whiteside %>% * ggplot() ``` ] .panel2-my_whiteside-auto[ <img src="cm-4-EDA_files/figure-html/my_whiteside_auto_02_output-1.png" width="504" /> ] --- count: false ### Whiteside step by step .panel1-my_whiteside-auto[ ```r whiteside %>% ggplot() + * aes(x = Temp) ``` ] .panel2-my_whiteside-auto[ <img src="cm-4-EDA_files/figure-html/my_whiteside_auto_03_output-1.png" width="504" /> ] --- count: false ### Whiteside step by step .panel1-my_whiteside-auto[ ```r whiteside %>% ggplot() + aes(x = Temp) + * aes(y = Gas) ``` ] .panel2-my_whiteside-auto[ <img src="cm-4-EDA_files/figure-html/my_whiteside_auto_04_output-1.png" width="504" /> ] --- count: false ### Whiteside step by step .panel1-my_whiteside-auto[ ```r whiteside %>% ggplot() + aes(x = Temp) + aes(y = Gas) + * geom_smooth( * method="lm", * formula = y ~ x, * se=FALSE * ) ``` ] .panel2-my_whiteside-auto[ <img src="cm-4-EDA_files/figure-html/my_whiteside_auto_05_output-1.png" width="504" /> ] --- count: false ### Whiteside step by step .panel1-my_whiteside-auto[ ```r whiteside %>% ggplot() + aes(x = Temp) + aes(y = Gas) + geom_smooth( method="lm", formula = y ~ x, se=FALSE ) + * geom_point( * alpha = .8, * size = 2, * aes(color = Insul, * shape = Insul), * ) ``` ] .panel2-my_whiteside-auto[ <img src="cm-4-EDA_files/figure-html/my_whiteside_auto_06_output-1.png" width="504" /> ] --- count: false ### Whiteside step by step .panel1-my_whiteside-auto[ ```r whiteside %>% ggplot() + aes(x = Temp) + aes(y = Gas) + geom_smooth( method="lm", formula = y ~ x, se=FALSE ) + geom_point( alpha = .8, size = 2, aes(color = Insul, shape = Insul), ) + * ggtitle("Whiteside data: Simple Linear Regression") ``` ] .panel2-my_whiteside-auto[ <img src="cm-4-EDA_files/figure-html/my_whiteside_auto_07_output-1.png" width="504" /> ] --- count: false ### Whiteside step by step .panel1-my_whiteside-auto[ ```r whiteside %>% ggplot() + aes(x = Temp) + aes(y = Gas) + geom_smooth( method="lm", formula = y ~ x, se=FALSE ) + geom_point( alpha = .8, size = 2, aes(color = Insul, shape = Insul), ) + ggtitle("Whiteside data: Simple Linear Regression") + * labs(subtitle="Residuals should not be so predictable") ``` ] .panel2-my_whiteside-auto[ <img src="cm-4-EDA_files/figure-html/my_whiteside_auto_08_output-1.png" width="504" /> ] <style> .panel1-my_whiteside-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-my_whiteside-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-my_whiteside-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- template: inter-slide name: slr-group ## Simple linear regression per group --- ### Simple linear regression per group SLR per group is equivalent to fitting a _multiple linear regression model_, (at least from a statistical viewpoint) It is equivalent to fitting the next model `$$\begin{bmatrix}\mathrm{Gas}_1 \\ \vdots \\ \mathrm{Gas}_{56} \end{bmatrix} = \begin{bmatrix} 1 & 0 & \mathrm{Temp}_1 & 0\\ \vdots & \vdots & \vdots & \vdots \\ 1 & 0 & \mathrm{Temp}_{26} & 0\\ 0 & 1 & 0 & \mathrm{Temp}_{27} \\ \vdots & \vdots & \vdots & \vdots \\ 0 & 1 & 0 & \mathrm{Temp}_{56} \end{bmatrix} \times \begin{bmatrix} \beta_{0,b} \\ \beta_{0,a} \\ \beta_{1, b} \\ \beta_{1, a} \end{bmatrix} + \sigma_b \begin{bmatrix} \epsilon_1 \\ \vdots \\ \epsilon_{26}\\ 0 \\ \vdots \\ 0 \end{bmatrix} + \sigma_a \begin{bmatrix} 0 \\ \vdots \\ 0 \\ \epsilon_{27}\\ \vdots \\ \epsilon_{56} \end{bmatrix}$$` --- ### Simple linear regression per group .fl.w-third.pa2[ Like multiple linear regression with mildly _heteroschedastic_ noise Noise standard deviation is assumed to be constant within each group The residuals are now centered within each group Visually, more appealing than simple linear regression ] .fl.w-two-thirds.pa2[ ![](cm-4-EDA_files/figure-html/whitesidesimplebygroup-1.png) ] --- ### Subsetting with `lm()` To perform linear regression of the subset of rows defined by `Insul == After`, the next expression works: ```r whiteside %>% lm(formula = Gas ~ Temp, data= ., subset= Insul=='After') ```
We use the pronoun `.` to connect the `data` argument of `lm` with the left-hand-side of the pipe `%>%` This is necessary since `lm` belongs to base `R` and does not comply with `tidyverse` conventions --- ### One-hot encoding of `Insul` We could also use the _design_ defined by formula `Gas ~ Temp * Insul`. This amounts to use the `treatment` _contrast_ for encoding the two-valued factor `Insul` with `Before` as _control_ and `After` as _treatment_. In modern machine learning language, this is called _one-hot encoding_ of qualitative variable `Insul` In this modeling attempt, we envision interaction between `Insul` and `Intercept` and `Insul` and `Temp` --- ### One-hot encoding of `Insul` (continued) We investigate whether insulation is influencing the amount of gas used to make the house comfortable at temperature `\(0\)` and influencing the sensitivity of weekly gas consumption to external temperature This is at least physically plausible `$$\begin{bmatrix} \mathrm{Gas}_1 \\ \vdots \\ \mathrm{Gas}_{56}\end{bmatrix} = \begin{bmatrix} 1 & 0 & \mathrm{Temp}_1 & 0\\ \vdots & \vdots & \vdots & \vdots \\ 1 & 0 & \mathrm{Temp}_{26} & 0\\ 1 & 1 & \mathrm{Temp}_{27} & \mathrm{Temp}_{27} \\ \vdots & \vdots & \vdots & \vdots \\ 1 & 1 & \mathrm{Temp}_{56} & \mathrm{Temp}_{56}\end{bmatrix} \times \begin{bmatrix} \beta_{0,b} \\ \beta_{0,a} \\ \beta_{1, b} \\ \beta_{1, a} \end{bmatrix} + \sigma \begin{bmatrix} \epsilon_1 \\ \vdots \\ \epsilon_{56} \end{bmatrix}$$` ```r whiteside %>% * lm(formula = Gas ~ Temp * Insul, data = .) ``` --- ### Linear regression after one-hot encoding of `Temp` .fl.w-third.pa2[ TODO: `ggPredict` ] .fl.w-two-thirds.pa2[ ![](cm-4-EDA_files/figure-html/whitesideonehot-1.png) ] --- template: inter-slide name: formulae ## About formulae --- Objects of class `formula` are used in many corners of `R` environment. In multiple linear regression, formulae are used to define a design starting from the column of a dataframe .bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[ The models fit by, e.g., the `lm` and `glm` functions are specified in a compact symbolic form The `~` operator is basic in the formation of such models An expression of the form `y ~ model` is interpreted as a specification that the response `y` is modelled by a linear predictor specified symbolically by `model` Such a model consists of a series of _terms_ separated by `+` operators The terms themselves consist of variable and factor names separated by `:` operators Such a term is interpreted as the interaction of all the variables and factors appearing in the term .tr[
documentation] ] --- .bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[ In addition to `+` and `:`, a number of other operators are useful in model formulae. The `*` operator denotes factor crossing: `a*b` interpreted as `a+b+a:b`. The `^` operator indicates crossing to the specified degree. For example `(a+b+c)^2` is identical to `(a+b+c)*(a+b+c)` which in turn expands to a formula containing the main effects for `a`, `b` and `c` together with their second-order interactions. .tr[
documentation] ] --- .bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[ The `%in%` operator indicates that the terms on its left are nested within those on the right. For example `a + b %in% a` expands to the formula `a + a:b`. The `-` operator removes the specified terms, so that `(a+b+c)^2 - a:b` is identical to `a + b + c + b:c + a:c`. It can also be used to remove the intercept term: when fitting a linear model `y ~ x - 1` specifies a line through the origin. A model with no intercept can be also specified as `y ~ x + 0` or `y ~ 0 + x`. .tr[
documentation] ] --- ### Formulae with arithmetic expressions .bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[ The formula `log(y) ~ a + log(x)` is legal. When such arithmetic expressions involve operators which are also used symbolically in model formulae, there can be confusion between arithmetic and symbolic operator use. To avoid this confusion, the function `I()` can be used to bracket those portions of a model formula where the operators are used in their arithmetic sense. For example, in the formula `y ~ a + I(b+c)`, the term `b+c` is to be interpreted as the sum of `b` and `c`. Variable names can be quoted by backticks `like this` in formulae, although there is no guarantee that all code using formulae will accept such non-syntactic names .tr[From [
documentation](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html)] ] --- .fl.w-third.pa2[ Formulae are so useful when modeling that they have been dedicated class and even a package: `Formula` (Extended Model Formulas) ] .fl.w-two-thirds.pa2[ ```r knitr::include_url("https://rdrr.io/cran/Formula/") ``` ] --- By taking into account interactions between `Temp` and `Insul`, we have improved the appeal of our model. Now we face two questions: 1. can we go further and improve data fitting? 1. can we formally assess the apparent improvement delivered by a richer model? One question concerns the possibility of describing richer models The second question is about model assessment/validation/selection The first question is about programming, the second takes us in inferential statistics territory --- template: inter-slide name: polynomials ## Higher degree polynomials --- ### Building an enriched design .fl.w-two-thirds.pa2[ Rather than searching a _linear connection_ between `Gas` and `Temp` we might as well look for a _polynomial connection_ In the next chunk, we add powers of `Temp` to dataframe `whiteside` ```r poly(whiteside$Temp, degree=10, raw=TRUE) %>% as_tibble() -> whiteside_matrix names(whiteside_matrix) <- paste("Temp", 1:10, sep="^") whiteside_10 <- cbind(whiteside, whiteside_matrix) ``` ] .fl.w-third.pa2[ We fit a degree `\(10\)` polynomial to the data ```r lm_10 <- whiteside_10 %>% lm(formula = Gas ~ . - Temp - Insul) ``` We are looking for a predictor of `Gas` based on powers of `Temp` ] --- ### A paradox .fl.w-third.pa2[ Attempting to fit a degree `\(10\)` polynomial to the `Whiteside` data leads to: - Goodness of fit is improved - Plausibility, and thus explanatory power is degraded: It is hard to envision a non-monotonous relationship between gas consumption and average external temperature ] .fl.w-two-thirds.pa2[ <img src="cm-4-EDA_files/figure-html/whitesideten-label-1.png" width="504" /> ] --- ### High order modeling with interaction .fl.w-third.pa2[ We can play that game further by fitting a high-degree polynomial of `Temperature` with interaction with `Insulation` Fitting a degree `\(10\)` polynomial with interaction with `Insul` to the `whitesidedata`. We attempt to fit a model with `\(22\)` coefficients (degrees of freedom) to the `whiteside` data ( `\(56\)` observations) Goodness of fit is visually improved but physical plausibility and explanatory power remain bad. ] .fl.w-two-thirds.pa2[ <img src="cm-4-EDA_files/figure-html/whiteside10insul-1.png" width="504" /> ] --- ### The best of both worlds .fl.w-third.pa2[ Fit a degree `\(2\)` polynomial with interaction w.r.t. `Insul` When fitting a degree `\(2\)` polynomial, we face new questions: - should we prefer quadratic modelling to linear modelling? - shoud we retain interaction with `Insul` for both subgroups? Visual inspection does not provides us with clear cut guidelines ] .fl.w-two-thirds.pa2[ <img src="cm-4-EDA_files/figure-html/whiteside2Insul-1.png" width="504" /> ] --- ### GLM interpretation Figure corresponds to least-square fitting of the following model `$$\begin{bmatrix}\mathrm{Gas}_1 \\ \vdots \\ \mathrm{Gas}_{56} \end{bmatrix} = \begin{bmatrix} 1 & 0 & \mathrm{Temp}_1 & 0 & \mathrm{Temp}^2_1 & 0\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\ 1 & 0 & \mathrm{Temp}_{26} & 0 & \mathrm{Temp}^2_{26} & 0\\ 1 & 1 & \mathrm{Temp}_{27} & \mathrm{Temp}_{27} & \mathrm{Temp}^2_{27} & \mathrm{Temp}^2_{27} \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & 1 & \mathrm{Temp}_{56} & \mathrm{Temp}_{56} & \mathrm{Temp}^2_{56} & \mathrm{Temp}^2_{56} \end{bmatrix} \times \begin{bmatrix} \beta_{0,b} \\ \beta_{0,a} \\ \beta_{1, b} \\ \beta_{1, a} \\ \beta_{2, b} \\ \beta_{2, a}\end{bmatrix} + \sigma \begin{bmatrix} \epsilon_1 \\ \vdots \\ \epsilon_{56}\end{bmatrix}$$` We still assume _homoschedastic_ noise --- exclude: true ```r require("parsnip") require(Formula) require(flipbookr) ``` ``` ## Loading required package: flipbookr ``` ```r formule <- Formula(Gas ~ Temp * Insul) lm_mod <- linear_reg() %>% set_engine("lm") lm_fit <- lm_mod %>% fit(formule, data = whiteside) ``` ```r # model.matrix(lm_2) %>% View() ``` --- exclude:true ```r pacman::p_load(ggfortify) ``` ```r autoplot(lm_2, which = 6) ``` <img src="cm-4-EDA_files/figure-html/unnamed-chunk-11-1.png" width="504" /> --- template: inter-slide name: anova ## ANOVA --- [ANOVA](https://en.wikipedia.org/wiki/Analysis_of_variance) stands for ANalysis Of VAriance. Its provides us with a method for choosing and assessing a model at least if we adhere to the _Gaussian Linear Modeling_ credo. ANOVA is just the tip of a large iceberg. It has been extended to more general models and has inspired _model selection methods_ in many areas of statistics and machine learning. --- template: inter-slide ## Variable selection in Gaussian Linear Models --- In the sequel of this section, we use simulated data to introduce and motivate the ANOVA method. We generate - a design matrix `\(Z \in \mathcal{M}_{n,p}\)` with `\(n=1000\)` and `\(p=100\)`, - a sparse random vector `\(\beta \in \mathbb{R}^p\)` - a vector of observation `\(Y = Z \times \beta + \sigma \epsilon\)` where `\(\sigma=1\)` and `\(\epsilon\)` is a standard Gaussian vector of dimension `\(p\)`. --- count: false ### Preparing ANOVA ```r *set.seed(1515) # for reproducibility ``` --- count: false ### Preparing ANOVA ```r set.seed(1515) # for reproducibility *p <- 100 # dimension of sample points ``` --- count: false ### Preparing ANOVA ```r set.seed(1515) # for reproducibility p <- 100 # dimension of sample points *n <- 1000 # sample size ``` --- count: false ### Preparing ANOVA ```r set.seed(1515) # for reproducibility p <- 100 # dimension of sample points n <- 1000 # sample size *sigma <- 1 # noise standard deviation ``` --- count: false ### Preparing ANOVA ```r set.seed(1515) # for reproducibility p <- 100 # dimension of sample points n <- 1000 # sample size sigma <- 1 # noise standard deviation *betas <- matrix(rbinom(p, 1, 0.1) * rnorm(p, mean=1), * nrow=p, ncol=1) # true coefficient vector ``` --- count: false ### Preparing ANOVA ```r set.seed(1515) # for reproducibility p <- 100 # dimension of sample points n <- 1000 # sample size sigma <- 1 # noise standard deviation betas <- matrix(rbinom(p, 1, 0.1) * rnorm(p, mean=1), nrow=p, ncol=1) # true coefficient vector *important_coeffs <- which(abs(betas) >.01) # sparse support ``` --- count: false ### Preparing ANOVA ```r set.seed(1515) # for reproducibility p <- 100 # dimension of sample points n <- 1000 # sample size sigma <- 1 # noise standard deviation betas <- matrix(rbinom(p, 1, 0.1) * rnorm(p, mean=1), nrow=p, ncol=1) # true coefficient vector important_coeffs <- which(abs(betas) >.01) # sparse support *Z <- matrix(rnorm(n*p, 5), ncol = p, nrow=n) # design ``` --- count: false ### Preparing ANOVA ```r set.seed(1515) # for reproducibility p <- 100 # dimension of sample points n <- 1000 # sample size sigma <- 1 # noise standard deviation betas <- matrix(rbinom(p, 1, 0.1) * rnorm(p, mean=1), nrow=p, ncol=1) # true coefficient vector important_coeffs <- which(abs(betas) >.01) # sparse support Z <- matrix(rnorm(n*p, 5), ncol = p, nrow=n) # design *Y <- Z %*% betas + matrix(rnorm(n, 0, sd=sigma), nrow=n) ``` --- count: false ### Preparing ANOVA ```r set.seed(1515) # for reproducibility p <- 100 # dimension of sample points n <- 1000 # sample size sigma <- 1 # noise standard deviation betas <- matrix(rbinom(p, 1, 0.1) * rnorm(p, mean=1), nrow=p, ncol=1) # true coefficient vector important_coeffs <- which(abs(betas) >.01) # sparse support Z <- matrix(rnorm(n*p, 5), ncol = p, nrow=n) # design Y <- Z %*% betas + matrix(rnorm(n, 0, sd=sigma), nrow=n) *df <- as.data.frame(cbind(Z, Y), ) ``` --- count: false ### Preparing ANOVA ```r set.seed(1515) # for reproducibility p <- 100 # dimension of sample points n <- 1000 # sample size sigma <- 1 # noise standard deviation betas <- matrix(rbinom(p, 1, 0.1) * rnorm(p, mean=1), nrow=p, ncol=1) # true coefficient vector important_coeffs <- which(abs(betas) >.01) # sparse support Z <- matrix(rnorm(n*p, 5), ncol = p, nrow=n) # design Y <- Z %*% betas + matrix(rnorm(n, 0, sd=sigma), nrow=n) df <- as.data.frame(cbind(Z, Y), ) *names(df) <- c(paste("x", 1:p, sep="_"), "y") ``` <style> .panel1-my_simul_anova-auto { color: black; width: 99%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-my_simul_anova-auto { color: black; width: NA%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-my_simul_anova-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- ### True model versus full model Dataframe `df` fits Gaussian linear modeling assumptions We may feed it to `lm()` This leads to what we call the _full model_ We may also take advantage of the fact that we know which columns of `\(Z\)` matters when we want to predict `\(Y\)`. Hence, we can compare the linear model obtained by brute force `lm(y ~ . -1, df)` ( `\(-1\)` because we need no intercept) with the linear model obtained by using columns of `df` corresponding to non-null coefficients of `betas` We call the former the _true model_ --- ### Building formulae ```r formula_full <- formula("y ~ . - 1") formula_true <- formula(str_c("y", paste(str_c("x", important_coeffs, sep = "_"), collapse =" + "), sep=" ~ ")) formulae <- c(formula_full, formula_true) lm_df <- map(formulae, lm, data=df) names(lm_df) <- c("full", "true") ``` --- ### The true model ```r formula_true ``` ``` ## y ~ x_2 + x_9 + x_22 + x_27 + x_41 + x_49 + x_52 + x_53 + x_59 + ## x_60 + x_77 + x_88 + x_93 ``` --- ### Comparing reconstruction errors We can compare the true and full models using the quadratic error criterion. The residuals can be gathered from both models using function `residuals()`. The Residual Sum of Squares (RSS) are respectively 885.4 for the full model and 957.5. At first glance, the full model achieves a better fit to the data than the true model --- ### Comparing estimation errors
This comparison is not fair The full model is obtained by optimizing the quadratic error over a set which is much larger than the set used for the true model As we are handling simulated data, we can compare the coefficient estimates in the two models and see whether one comes closer to the coefficients used to generate the data The squared `\(\ell_2\)` distances to `\(\beta\)` are substantially different: `$$\| \beta - \widehat{\beta}^{\text{true}}\|^2 = 1.31 \times 10^{-2}$$` while `$$\| \beta - \widehat{\beta}^{\text{full}}\|^2 = 9.75 \times 10^{-2}$$` --- ### Overfitting The fact that the full model outperforms the small model with respect to the Residual Sum of Squares criterion is righteously called _overfitting_. --- ### Diagnosing the full model .panelset[ .panel[.panel-name[Code] ```r xx <- residuals(lm_df[['full']]) yy <- fitted.values(lm_df[['full']]) qplot(y=yy, x=xx, alpha=I(.25)) + xlab("fitted values") + ylab("residuals") + ggtitle("Gaussian linear model", subtitle = "n=1000, p=100, full model") ``` ] .panel[.panel-name[Residuals versus Fitted plot] ![](cm-4-EDA_files/figure-html/qplot-residuals-label-1.png) ] ] --- ### ```r head((anova(lm_df[['full']])) %>% broom::tidy(.n), 10) ``` ``` ## # A tibble: 10 × 6 ## term df sumsq meansq statistic p.value ## <chr> <int> <dbl> <dbl> <dbl> <dbl> ## 1 x_1 1 2852285. 2852285. 2899391. 0 ## 2 x_2 1 85346. 85346. 86755. 0 ## 3 x_3 1 10823. 10823. 11002. 0 ## 4 x_4 1 5974. 5974. 6072. 0 ## 5 x_5 1 3901. 3901. 3965. 0 ## 6 x_6 1 2053. 2053. 2086. 1.23e-236 ## 7 x_7 1 1990. 1990. 2023. 1.96e-232 ## 8 x_8 1 1153. 1153. 1172. 4.21e-165 ## 9 x_9 1 4715. 4715. 4793. 0 ## 10 x_10 1 1032. 1032. 1049. 3.69e-153 ``` --- count: false ### Residual distribution (full model) .panel1-kde_residuals-auto[ ```r *tib ``` ] .panel2-kde_residuals-auto[ ``` ## # A tibble: 1,000 × 1 ## residuals ## <dbl> ## 1 -0.319 ## 2 0.427 ## 3 -0.495 ## 4 0.0399 ## 5 0.302 ## 6 -0.920 ## 7 -0.0672 ## 8 -1.64 ## 9 -0.475 ## 10 -0.954 ## # … with 990 more rows ``` ] --- count: false ### Residual distribution (full model) .panel1-kde_residuals-auto[ ```r tib %>% * ggplot(aes(x=residuals)) ``` ] .panel2-kde_residuals-auto[ <img src="cm-4-EDA_files/figure-html/kde_residuals_auto_02_output-1.png" width="504" /> ] --- count: false ### Residual distribution (full model) .panel1-kde_residuals-auto[ ```r tib %>% ggplot(aes(x=residuals)) + * geom_histogram(aes(y=..density..), * alpha=.3) ``` ] .panel2-kde_residuals-auto[ ``` ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ``` <img src="cm-4-EDA_files/figure-html/kde_residuals_auto_03_output-1.png" width="504" /> ] --- count: false ### Residual distribution (full model) .panel1-kde_residuals-auto[ ```r tib %>% ggplot(aes(x=residuals)) + geom_histogram(aes(y=..density..), alpha=.3) + * stat_density(kernel = "epanechnikov", * bw = .15, * alpha=.0, * col=1) ``` ] .panel2-kde_residuals-auto[ ``` ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ``` <img src="cm-4-EDA_files/figure-html/kde_residuals_auto_04_output-1.png" width="504" /> ] --- count: false ### Residual distribution (full model) .panel1-kde_residuals-auto[ ```r tib %>% ggplot(aes(x=residuals)) + geom_histogram(aes(y=..density..), alpha=.3) + stat_density(kernel = "epanechnikov", bw = .15, alpha=.0, col=1) + * stat_function(fun = dnorm, * args=list(mean=m, * sd=s), * linetype=2) ``` ] .panel2-kde_residuals-auto[ ``` ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ``` <img src="cm-4-EDA_files/figure-html/kde_residuals_auto_05_output-1.png" width="504" /> ] --- count: false ### Residual distribution (full model) .panel1-kde_residuals-auto[ ```r tib %>% ggplot(aes(x=residuals)) + geom_histogram(aes(y=..density..), alpha=.3) + stat_density(kernel = "epanechnikov", bw = .15, alpha=.0, col=1) + stat_function(fun = dnorm, args=list(mean=m, sd=s), linetype=2) + * ggtitle(tit, * "Dashed curve: Gaussian fit") ``` ] .panel2-kde_residuals-auto[ ``` ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ``` <img src="cm-4-EDA_files/figure-html/kde_residuals_auto_06_output-1.png" width="504" /> ] <style> .panel1-kde_residuals-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-kde_residuals-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-kde_residuals-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- ### Assessing residuals ```r shapiro.test(residuals(lm_df[['full']])) ``` ``` ## ## Shapiro-Wilk normality test ## ## data: residuals(lm_df[["full"]]) ## W = 0.99752, p-value = 0.134 ``` ```r # (coefficients(lm.1)[1])^2 + norm(betas - coefficients(lm.1)[-1])^2 # norm(betas - lm.0$coefficients)^2 ``` --- template: inter-slide ## Using Fisher `\(F\)` statistic --- ### Computing the Fisher `\(F\)` statistic for comparing model 0 and model 1 As we are dealing with simulated data, we can behave as if a benevolent _oracle_ had whispered that all but `97` coefficients are null, and even told us the index of the non-null coefficients. We fit this exact and minimal model to the data. Under the Gaussian linear model assumption, the empirical distribution of residuals is the empirical distribution of the projection of a standard Gaussian vector onto the linear subspace orthogonal to the subspace generated by the design columns. As such the distribution of residuals is not the empirical distribution of an i.i.d Gaussian sample. Nevertheless, if the linear subspace is not too correlated with the canonical basis, the residuals empirical distribution is not too far from a Gaussian distribution. --- exclude: true <div class="figure"> <img src="cm-4-EDA_files/figure-html/residdistrib-1.png" alt="(ref:residdistrib)" width="504" /> <p class="caption">(ref:residdistrib)</p> </div> --- ### Comparing true model and full model We are facing a _binary hypotheses problem_. - Null hypothesis: all coefficients but `x_2 , x_9 , x_22 , x_27 , x_41 , x_49 , x_52 , x_53 , x_59, x_60 , x_77 , x_88 , x_93` are null - Alternative hypothesis: the vector of coefficients doesnot satisfy the `\(13\)` linear equations defining the null hypothesis Note that both the null and alternative hypotheses are composite. --- The null hypothesis asserts that the true vector of coefficients belong to a linear subspace of dimension `13` of `\(\mathbb{R}^{100}\)`. If we stick to Gaussian linear modelling we may test this null hypothesis by performing a _Fisher test_. Under the null hypothesis, `$$\frac{\|\widehat{Y} - \widehat{Y}^\circ\|^2/(p-p^\circ)}{\|{Y} - \widehat{Y}\|^2/(n-p)} \sim F_{p-p^\circ, n-p}$$` where `\(F_{p-p^\circ, n-p}\)` stands for the Fisher distribution with `\(p-p^\circ\)` and `\(n-p\)` degrees of freedom. Function `anova` does this. --- exclude: true ```r anova(lm_df[['true']], lm_df[['full']]) %>% broom::tidy() %>% dplyr::mutate_if(is.numeric, round, 2) ``` ``` ## # A tibble: 2 × 6 ## res.df rss df sumsq statistic p.value ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 986 958. NA NA NA NA ## 2 900 885. 86 72.1 0.85 0.82 ``` On the second row, column `statistic` contains the Fisher statistic. The `p.value` tells us the probability that a random variable distributed according to Fisher distribution with `\(87\)` and `\(900\)` degrees of freedom ($F_{87,900}$) exceeds the statistic's value `\(0.89\)`. The fact that `p.value` (`pf(.89, 87, 900, lower.tail = FALSE)`) is `\(0.75\)` does not prompt us to reject the null hypothesis. --- ### Handling `\(p\)`-values The rule of thumb when using the `p.value` reported by a test function is the following: > If, _under the null hypothesis_, you accept to be wrong with probability `\(\alpha\)`, > then > you should reject when the `p.value` is less than `\(\alpha\)` In words: if you are looking for a _level_ (probability of an error of the first kind) equal to `\(\alpha\)`, you should reject the null hypothesis when the `\(p\)`-value is less than `\(\alpha\)` We can decompose the computations performed by ANOVA --- exclude: true ```r setdiff(important_coeffs, # 2 9 22 27 41 49 52 53 59 60 77 88 93 which( (df %>% select(starts_with("x")) %>% purrr::map_dbl(.f = ~ cor(., df$y)) %>% abs() ) > .05) ) ``` ```r lm.0 %>% summary() (svd(t(Z) %*% (Z))$d)^(-1) # Inverse of covariance matrix of coefficients in lm.0 ``` --- exclude: true class: middle, center, inverse ## Iterative methods for OLS --- exclude: true class: middle, center, inverse ## ... --- exclude: true template: inter-slide ## Ridge regression --- exclude: true ### Regularized least squares criterion For each `\(\lambda>0\)`, the penalized least square costs is defined by `$$\Big\Vert Y - Z \beta \Big\Vert^2 + λ \big\|\beta\big\|^2$$` Gradient with respect to `\(\beta\)` `$$2\times\big(λ β + Z^T Z β - Z^T Y\big)$$` Unique optimum `$$\big(\lambda \operatorname{Id} + Z^TZ\big)^{-1} Z^T Y$$` --- exclude: true ### Solution --- exclude: true template: inter-slide ## Split/Apply/Combine and (Ridge) regression --- exclude: true --- exclude: true template: inter-slide ## Feature engineering --- exclude: true template: inter-slide ## --- exclude: true template: inter-slide ## --- class: middle, center, inverse background-image: url('./img/pexels-cottonbro-3171837.jpg') background-size: cover # The End