Multilinear Regression

---
name: layout-general
layout: true
class: left, middle

.remark-slide-number .progress-bar-container {
  position: absolute;
  bottom: 0;
  height: 4px;
  display: block;
  left: 0;
  right: 0;
}

.remark-slide-number .progress-bar {
  height: 100%;
  background-color: red;
}
</style>

<div>
<style type="text/css">.xaringan-extra-logo {
width: 110px;
height: 128px;
z-index: 0;
background-image: url(./img/Universite_Paris_logo_horizontal.jpg);
background-size: contain;
background-repeat: no-repeat;
position: absolute;
top:1em;right:1em;
}
</style>
<script>(function () {
  let tries = 0
  function addLogo () {
    if (typeof slideshow === 'undefined') {
      tries += 1
      if (tries < 10) {
        setTimeout(addLogo, 100)
      }
    } else {
      document.querySelectorAll('.remark-slide-content:not(.hide_logo)')
        .forEach(function (slide) {
          const logo = document.createElement('a')
          logo.classList = 'xaringan-extra-logo'
          logo.href = 'http://master.math.univ-paris-diderot.fr/annee/m1-mi/'
          slide.appendChild(logo)
        })
    }
  }
  document.addEventListener('DOMContentLoaded', addLogo)
})()</script>
</div>

---
template: inter-slide

# Multilinear Regression I

#### [Master I MIDS & MFA]()

#### [Analyse Exploratoire de Données](http://stephane-v-boucheron.fr/courses/eda/)

#### [Stéphane Boucheron](http://stephane-v-boucheron.fr)

---
template: inter-slide

## <svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M0 117.66v346.32c0 11.32 11.43 19.06 21.94 14.86L160 416V32L20.12 87.95A32.006 32.006 0 0 0 0 117.66zM192 416l192 64V96L192 32v384zM554.06 33.16L416 96v384l139.88-55.95A31.996 31.996 0 0 0 576 394.34V48.02c0-11.32-11.43-19.06-21.94-14.86z"/></svg>

### [Revisiting simple linear regression](#rev-slr)

### [Simple linear regression per group](#slr-group)

### [Formulae and model building](#formulae)

### [Higher order modeling](#polynomials)

### [ANOVA](#anova)

---
template: inter-slide
name: rev-slr

## Revisiting simple linear regression

---

### House Insulation: Whiteside's Data  <svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M320 33.9c-10.5-1.2-21.2-1.9-32-1.9-99.8 0-187.8 50.8-239.4 128H320V33.9zM96 192H30.3C11.1 230.6 0 274 0 320h96V192zM352 39.4V160h175.4C487.2 99.9 424.8 55.9 352 39.4zM480 320h96c0-46-11.1-89.4-30.3-128H480v128zm-64 64v96h128c17.7 0 32-14.3 32-32v-96H411.5c2.6 10.3 4.5 20.9 4.5 32zm32-192H128v128h49.8c22.2-38.1 63-64 110.2-64s88 25.9 110.2 64H448V192zM0 448c0 17.7 14.3 32 32 32h128v-96c0-11.1 1.9-21.7 4.5-32H0v96zm288-160c-53 0-96 43-96 96v96h192v-96c0-53-43-96-96-96z"/></svg>

.fl.w-third.pa2[

```r
whiteside <- MASS::whiteside

whiteside %>%
  dplyr::sample_n(5) %>%
  knitr::kable()
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> Insul </th>
   <th style="text-align:right;"> Temp </th>
   <th style="text-align:right;"> Gas </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> After </td>
   <td style="text-align:right;"> 0.8 </td>
   <td style="text-align:right;"> 4.6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> After </td>
   <td style="text-align:right;"> 7.5 </td>
   <td style="text-align:right;"> 2.6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> After </td>
   <td style="text-align:right;"> 7.1 </td>
   <td style="text-align:right;"> 3.0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Before </td>
   <td style="text-align:right;"> 6.0 </td>
   <td style="text-align:right;"> 4.4 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> After </td>
   <td style="text-align:right;"> 4.9 </td>
   <td style="text-align:right;"> 4.0 </td>
  </tr>
</tbody>
</table>
]

.fl.w-two-thirds.pa2[

Mr Derek Whiteside of the UK Building Research Station recorded the weekly gas consumption and average external temperature at his own house in south-east England for two heating seasons, one of 26 weeks before, and one of 30 weeks after cavity-wall insulation was installed. The object of the exercise was to assess the effect of the insulation on gas consumption.

`Temp`: Purportedly the average outside temperature in degrees Celsius.

`Gas`: The weekly gas consumption in 1000s of cubic feet.

Venables, W. N. and Ripley, B. D. (2002) _Modern Applied Statistics with S._ Fourth edition. Springer.

]

---

### Multiple goals

- simple linear regression of `Gas` (weekly consumption) with respect to `Temp` (average external temperature) leaves us with mixed feelings <svg aria-hidden="true" role="img" viewBox="0 0 496 512" style="height:1em;width:0.97em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M248 8C111 8 0 119 0 256s111 248 248 248 248-111 248-248S385 8 248 8zm80 168c17.7 0 32 14.3 32 32s-14.3 32-32 32-32-14.3-32-32 14.3-32 32-32zm-160 0c17.7 0 32 14.3 32 32s-14.3 32-32 32-32-14.3-32-32 14.3-32 32-32zm170.2 218.2C315.8 367.4 282.9 352 248 352s-67.8 15.4-90.2 42.2c-13.5 16.3-38.1-4.2-24.6-20.5C161.7 339.6 203.6 320 248 320s86.3 19.6 114.7 53.8c13.6 16.2-11 36.7-24.5 20.4z"/></svg>

- there is a need to incorporate the influence of `Insul` <svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M320 33.9c-10.5-1.2-21.2-1.9-32-1.9-99.8 0-187.8 50.8-239.4 128H320V33.9zM96 192H30.3C11.1 230.6 0 274 0 320h96V192zM352 39.4V160h175.4C487.2 99.9 424.8 55.9 352 39.4zM480 320h96c0-46-11.1-89.4-30.3-128H480v128zm-64 64v96h128c17.7 0 32-14.3 32-32v-96H411.5c2.6 10.3 4.5 20.9 4.5 32zm32-192H128v128h49.8c22.2-38.1 63-64 110.2-64s88 25.9 110.2 64H448V192zM0 448c0 17.7 14.3 32 32 32h128v-96c0-11.1 1.9-21.7 4.5-32H0v96zm288-160c-53 0-96 43-96 96v96h192v-96c0-53-43-96-96-96z"/></svg> into our model

- there are few a priori reasons to believe that the relationship between gas consumption and external temperature <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M416 0c-52.9 0-96 43.1-96 96s43.1 96 96 96 96-43.1 96-96-43.1-96-96-96zm0 128c-17.7 0-32-14.3-32-32s14.3-32 32-32 32 14.3 32 32-14.3 32-32 32zm-160-16C256 50.1 205.9 0 144 0S32 50.1 32 112v166.5C12.3 303.2 0 334 0 368c0 79.5 64.5 144 144 144s144-64.5 144-144c0-34-12.3-64.9-32-89.5V112zM144 448c-44.1 0-80-35.9-80-80 0-25.5 12.2-48.9 32-63.8V112c0-26.5 21.5-48 48-48s48 21.5 48 48v192.2c19.8 14.8 32 38.3 32 63.8 0 44.1-35.9 80-80 80zm16-125.1V304c0-8.8-7.2-16-16-16s-16 7.2-16 16v18.9c-18.6 6.6-32 24.2-32 45.1 0 26.5 21.5 48 48 48s48-21.5 48-48c0-20.9-13.4-38.5-32-45.1z"/></svg> should be linear

- there is a need for methods to compare different models and ideally to select
one in a principled way <svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M72 64h24v240c0 44.1 35.9 80 80 80s80-35.9 80-80V64h24c4.4 0 8-3.6 8-8V8c0-4.4-3.6-8-8-8H72c-4.4 0-8 3.6-8 8v48c0 4.4 3.6 8 8 8zm72 0h64v96h-64V64zm480 384H16c-8.8 0-16 7.2-16 16v32c0 8.8 7.2 16 16 16h608c8.8 0 16-7.2 16-16v-32c0-8.8-7.2-16-16-16zM360 64h24v240c0 44.1 35.9 80 80 80s80-35.9 80-80V64h24c4.4 0 8-3.6 8-8V8c0-4.4-3.6-8-8-8H360c-4.4 0-8 3.6-8 8v48c0 4.4 3.6 8 8 8zm72 0h64v96h-64V64z"/></svg>

---

We look for `$\beta_0, \beta_1$`  that minimize the quadratic error

`$$\sum_{i=1}^n \left( \mathrm{Gas}_i - \beta_0  -\beta_1 \mathrm{Temp}_i\right)^2$$`

In the parlance of mathematical statistic, we proceed as if we were
fitting a _Gaussian Linear Model_ with _homoschedastic_ noise:

`$$\begin{bmatrix} \mathrm{Gas}_1 \\
  \vdots \\  \mathrm{Gas}_{56} \end{bmatrix}  = \begin{bmatrix} 1 & \mathrm{Temp}_1 \\   \vdots & \vdots \\ 1 & \mathrm{Temp}_{56} \end{bmatrix} \times \begin{bmatrix}  \beta_0 \\ \beta_1
\end{bmatrix} + \sigma \begin{bmatrix} \epsilon_1 \\ \vdots \\ \epsilon_{56} \end{bmatrix}$$`

where `$\epsilon_i \sim_{i.i.d.} \mathcal{N}(0,1)$`.

We aim at optimizing `$\beta_0, \beta_1, \sigma>0$`

---

### Invoking `lm()`

Evaluating

```r
lm0 <- lm(Gas ~ Temp,
          data=whiteside)
```
returns an object `lm0` of class `lm` (for linear models)

An object of class `lm` is a _list_ of named objects

Among those objects,  `coefficients`, which can be retrieved either as

- `lm0$coefficients` or
- `coefficients(lm0)`

contains what we call the _Ordinary Least Square_ estimator of
the linear regression of `Gas`  versus `Temp`

The _Intercept_  `$\widehat{\beta}_0$`  and the _slope_  `$\widehat{\beta}_1$` define the simple regression line

---

### Simple linear regression of `Gas ~ Temp`

.fl.w-third.pa2[

This is not what we would expect if the connection between `Gas` consumption  and average outside temperature `Temp` were independent of insulation.

This modeling deserves to be challenged

]

.fl.w-two-thirds.pa2[

<img src="cm-4-EDA_files/figure-html/whitesidesimple-1.png" width="504" />
]

---

```r
*whiteside
```
]
 
.panel2-my_whiteside-auto[

```
##     Insul Temp Gas
## 1  Before -0.8 7.2
## 2  Before -0.7 6.9
## 3  Before  0.4 6.4
## 4  Before  2.5 6.0
## 5  Before  2.9 5.8
## 6  Before  3.2 5.8
## 7  Before  3.6 5.6
## 8  Before  3.9 4.7
## 9  Before  4.2 5.8
## 10 Before  4.3 5.2
## 11 Before  5.4 4.9
## 12 Before  6.0 4.9
## 13 Before  6.0 4.3
## 14 Before  6.0 4.4
## 15 Before  6.2 4.5
## 16 Before  6.3 4.6
## 17 Before  6.9 3.7
## 18 Before  7.0 3.9
## 19 Before  7.4 4.2
## 20 Before  7.5 4.0
## 21 Before  7.5 3.9
## 22 Before  7.6 3.5
## 23 Before  8.0 4.0
## 24 Before  8.5 3.6
## 25 Before  9.1 3.1
## 26 Before 10.2 2.6
## 27  After -0.7 4.8
## 28  After  0.8 4.6
## 29  After  1.0 4.7
## 30  After  1.4 4.0
## 31  After  1.5 4.2
## 32  After  1.6 4.2
## 33  After  2.3 4.1
## 34  After  2.5 4.0
## 35  After  2.5 3.5
## 36  After  3.1 3.2
## 37  After  3.9 3.9
## 38  After  4.0 3.5
## 39  After  4.0 3.7
## 40  After  4.2 3.5
## 41  After  4.3 3.5
## 42  After  4.6 3.7
## 43  After  4.7 3.5
## 44  After  4.9 3.4
## 45  After  4.9 3.7
## 46  After  4.9 4.0
## 47  After  5.0 3.6
## 48  After  5.3 3.7
## 49  After  6.2 2.8
## 50  After  7.1 3.0
## 51  After  7.2 2.8
## 52  After  7.5 2.6
## 53  After  8.0 2.7
## 54  After  8.7 2.8
## 55  After  8.8 1.3
## 56  After  9.7 1.5
```
]

---
count: false
 
### Whiteside step by step
.panel1-my_whiteside-auto[

```r
whiteside %>%
* ggplot()
```
]
 
.panel2-my_whiteside-auto[
<img src="cm-4-EDA_files/figure-html/my_whiteside_auto_02_output-1.png" width="504" />
]

---
count: false
 
### Whiteside step by step
.panel1-my_whiteside-auto[

```r
whiteside %>%
  ggplot() +
* aes(x = Temp)
```
]
 
.panel2-my_whiteside-auto[
<img src="cm-4-EDA_files/figure-html/my_whiteside_auto_03_output-1.png" width="504" />
]

---
count: false
 
### Whiteside step by step
.panel1-my_whiteside-auto[

```r
whiteside %>%
  ggplot() +
  aes(x = Temp) +
* aes(y = Gas)
```
]
 
.panel2-my_whiteside-auto[
<img src="cm-4-EDA_files/figure-html/my_whiteside_auto_04_output-1.png" width="504" />
]

---
count: false
 
### Whiteside step by step
.panel1-my_whiteside-auto[

```r
whiteside %>%
  ggplot() +
  aes(x = Temp) +
  aes(y = Gas) +
* geom_smooth(
*   method="lm",
*   formula = y ~ x,
*   se=FALSE
* )
```
]
 
.panel2-my_whiteside-auto[
<img src="cm-4-EDA_files/figure-html/my_whiteside_auto_05_output-1.png" width="504" />
]

---
count: false
 
### Whiteside step by step
.panel1-my_whiteside-auto[

```r
whiteside %>%
  ggplot() +
  aes(x = Temp) +
  aes(y = Gas) +
  geom_smooth(
    method="lm",
    formula = y ~ x,
    se=FALSE
  ) +
* geom_point(
*   alpha = .8,
*   size = 2,
*   aes(color = Insul,
*       shape = Insul),
*   )
```
]
 
.panel2-my_whiteside-auto[
<img src="cm-4-EDA_files/figure-html/my_whiteside_auto_06_output-1.png" width="504" />
]

---
count: false
 
### Whiteside step by step
.panel1-my_whiteside-auto[

```r
whiteside %>%
  ggplot() +
  aes(x = Temp) +
  aes(y = Gas) +
  geom_smooth(
    method="lm",
    formula = y ~ x,
    se=FALSE
  ) +
  geom_point(
    alpha = .8,
    size = 2,
    aes(color = Insul,
        shape = Insul),
    ) +
* ggtitle("Whiteside data: Simple Linear Regression")
```
]
 
.panel2-my_whiteside-auto[
<img src="cm-4-EDA_files/figure-html/my_whiteside_auto_07_output-1.png" width="504" />
]

---
count: false
 
### Whiteside step by step
.panel1-my_whiteside-auto[

```r
whiteside %>%
  ggplot() +
  aes(x = Temp) +
  aes(y = Gas) +
  geom_smooth(
    method="lm",
    formula = y ~ x,
    se=FALSE
  ) +
  geom_point(
    alpha = .8,
    size = 2,
    aes(color = Insul,
        shape = Insul),
    ) +
  ggtitle("Whiteside data: Simple Linear Regression") +
* labs(subtitle="Residuals should not be so predictable")
```
]
 
.panel2-my_whiteside-auto[
<img src="cm-4-EDA_files/figure-html/my_whiteside_auto_08_output-1.png" width="504" />
]

---
template: inter-slide
name: slr-group

## Simple linear regression per group

---
### Simple linear regression per group

SLR per group is equivalent to fitting
a _multiple linear regression model_, (at least from a statistical viewpoint)

It is equivalent to fitting the next model

`$$\begin{bmatrix}\mathrm{Gas}_1 \\ \vdots \\
  \mathrm{Gas}_{56} \end{bmatrix}  = \begin{bmatrix} 1 & 0 & \mathrm{Temp}_1 &  0\\
  \vdots & \vdots & \vdots & \vdots \\   1 & 0 & \mathrm{Temp}_{26} &  0\\ 0 & 1 & 0 & \mathrm{Temp}_{27} \\  \vdots & \vdots & \vdots & \vdots \\   0 & 1 & 0 & \mathrm{Temp}_{56}
\end{bmatrix} \times \begin{bmatrix}  \beta_{0,b} \\ \beta_{0,a} \\ \beta_{1, b} \\ \beta_{1, a}
\end{bmatrix} + \sigma_b \begin{bmatrix} \epsilon_1 \\ \vdots \\ \epsilon_{26}\\ 0 \\ \vdots \\ 0 \end{bmatrix} + \sigma_a \begin{bmatrix} 0 \\ \vdots \\ 0 \\ \epsilon_{27}\\ \vdots \\ \epsilon_{56} \end{bmatrix}$$`

---

### Simple linear regression per  group

.fl.w-third.pa2[

Like multiple linear regression with mildly _heteroschedastic_ noise

Noise standard deviation is assumed to be constant within each group

The residuals are now centered within each group

Visually, more appealing than simple linear regression

]

.fl.w-two-thirds.pa2[

![](cm-4-EDA_files/figure-html/whitesidesimplebygroup-1.png)

]

---

### Subsetting with `lm()`

To perform linear regression of the subset of rows defined by `Insul == After`,
the next expression works:

```r
whiteside %>%
  lm(formula = Gas ~ Temp,
     data= .,
     subset= Insul=='After')
```

This is necessary since `lm` belongs to base `R` and does not comply with `tidyverse` conventions

---

### One-hot encoding of `Insul`

We could also use the _design_ defined by formula `Gas ~ Temp * Insul`.

This amounts to use the `treatment` _contrast_ for encoding the two-valued factor `Insul` with `Before` as _control_ and `After` as _treatment_.

In modern machine learning language, this is called _one-hot encoding_ of qualitative variable `Insul`

In this modeling attempt, we envision interaction between `Insul` and `Intercept` and `Insul`  and `Temp`

---

### One-hot encoding of `Insul` (continued)

We investigate whether insulation is
 influencing the amount of gas used to make the house comfortable at temperature `$0$` and influencing the sensitivity of weekly gas consumption to external temperature

This is at least physically plausible

`$$\begin{bmatrix} \mathrm{Gas}_1 \\  \vdots \\  \mathrm{Gas}_{56}\end{bmatrix}  = \begin{bmatrix}  1 & 0 & \mathrm{Temp}_1 &  0\\  \vdots & \vdots & \vdots & \vdots \\  1 & 0 & \mathrm{Temp}_{26} &  0\\  1 & 1 & \mathrm{Temp}_{27} & \mathrm{Temp}_{27} \\  \vdots & \vdots & \vdots & \vdots \\ 1 & 1 & \mathrm{Temp}_{56} & \mathrm{Temp}_{56}\end{bmatrix} \times \begin{bmatrix}  \beta_{0,b} \\  \beta_{0,a} \\  \beta_{1, b} \\  \beta_{1, a} \end{bmatrix} + \sigma \begin{bmatrix} \epsilon_1 \\ \vdots \\  \epsilon_{56} \end{bmatrix}$$`

```r
whiteside %>%
* lm(formula = Gas ~ Temp * Insul, data = .)
```

---

### Linear regression after one-hot encoding of `Temp`

.fl.w-third.pa2[

TODO: `ggPredict`

]

.fl.w-two-thirds.pa2[

![](cm-4-EDA_files/figure-html/whitesideonehot-1.png)

]

---
template: inter-slide
name: formulae

## About formulae

---

Objects of class `formula` are used in many corners of `R` environment.

In multiple linear regression, formulae are used to define a design starting from the column of a
dataframe

.bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[

The models fit by, e.g., the `lm` and `glm` functions are specified in a compact symbolic form

The `~` operator is basic in the formation of such models

An expression of the form `y ~ model` is interpreted as a specification that the response `y` is modelled by a linear predictor specified symbolically by `model`

Such a model consists of a series of _terms_ separated by `+` operators

The terms themselves consist of variable and factor names separated by `:` operators

Such a term is interpreted as the interaction of all the variables and factors appearing in the term

.tr[<svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg> documentation]

]
---

.bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[

In addition to `+` and `:`, a number of other operators are useful in model formulae.

The `*` operator denotes factor crossing: `a*b` interpreted as `a+b+a:b`.

The `^` operator indicates crossing to the specified degree.

For example `(a+b+c)^2` is identical to `(a+b+c)*(a+b+c)` which in turn expands to a formula containing the main effects for `a`, `b` and `c` together with their second-order interactions.

]

---

.bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[

The `%in%` operator indicates that the terms on its left are nested within those on the right.

For example `a + b %in% a` expands to the formula `a + a:b`.

The `-` operator removes the specified terms, so that `(a+b+c)^2 - a:b` is identical to `a + b + c + b:c + a:c`.

It can also be  used to remove the intercept term: when fitting a linear model `y ~ x - 1` specifies a line through the origin.

A model with no intercept can be also specified as `y ~ x + 0` or `y ~ 0 + x`.

]

---

### Formulae with arithmetic expressions

.bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[

The formula `log(y) ~ a + log(x)` is legal.

When such arithmetic expressions involve operators which are also used symbolically in model formulae, there can be confusion between arithmetic and symbolic operator use.

To avoid this confusion, the function `I()` can be used to bracket those portions of a model formula where the operators are used in their arithmetic sense.

For example, in the formula `y ~ a + I(b+c)`, the term `b+c` is to be interpreted as the sum of `b` and `c`.

Variable names can be quoted by backticks `like this` in formulae, although there is no guarantee that all code using formulae will accept such non-syntactic names

.tr[From [<svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg> documentation](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html)]

]

---

.fl.w-third.pa2[

Formulae are so useful when modeling that they have been dedicated
class and even a package: `Formula` (Extended Model Formulas)

]

.fl.w-two-thirds.pa2[

```r
knitr::include_url("https://rdrr.io/cran/Formula/")
```

]

---

By taking into account interactions between `Temp` and `Insul`, we have improved
the appeal of our model. Now we face two questions:

1. can we go further and improve data fitting?

1. can we formally assess the apparent improvement delivered by a richer model?

One question concerns the possibility of describing richer models

The second question is about model assessment/validation/selection

The first question  is about programming, the second takes us in inferential statistics territory

---
template: inter-slide
name: polynomials

## Higher degree polynomials

---

### Building an enriched design

.fl.w-two-thirds.pa2[

Rather than searching a _linear connection_
between `Gas` and `Temp` we might
as well look for a _polynomial connection_

In the next chunk, we add powers of `Temp` to dataframe `whiteside`

```r
poly(whiteside$Temp,
     degree=10,
     raw=TRUE) %>%
 as_tibble() -> whiteside_matrix

names(whiteside_matrix) <- paste("Temp",
                                 1:10,
                                 sep="^")

whiteside_10  <- cbind(whiteside,
                       whiteside_matrix)
```
]

.fl.w-third.pa2[

We fit a degree `$10$` polynomial to the data

```r
lm_10 <-
  whiteside_10 %>%
  lm(formula =
     Gas ~ . - Temp - Insul)
```

We are looking for a predictor of `Gas` based on powers of `Temp`

]
---

### A paradox

.fl.w-third.pa2[

Attempting to fit a degree `$10$` polynomial to the `Whiteside` data leads to:

- Goodness of fit is improved

- Plausibility, and thus explanatory power is degraded:
It is hard to envision a non-monotonous relationship between gas consumption and average external temperature

]

.fl.w-two-thirds.pa2[

]

---

### High order modeling with interaction

.fl.w-third.pa2[

We can play that game further by fitting
a high-degree polynomial of `Temperature` with interaction with `Insulation`

Fitting a degree `$10$` polynomial with interaction with `Insul` to the `whitesidedata`.

We attempt to fit a model with `$22$` coefficients (degrees of freedom) to the `whiteside` data ( `$56$` observations)

Goodness of fit is visually improved  but physical plausibility and explanatory power remain bad.

]

.fl.w-two-thirds.pa2[

<img src="cm-4-EDA_files/figure-html/whiteside10insul-1.png" width="504" />
]

---

### The best of both worlds

.fl.w-third.pa2[

Fit a degree `$2$` polynomial with interaction w.r.t. `Insul`

When fitting a degree `$2$` polynomial, we face new questions:

- should we prefer quadratic modelling to linear modelling?
- shoud we retain interaction with `Insul` for both subgroups?

Visual inspection does not provides us with clear cut guidelines

]

.fl.w-two-thirds.pa2[

]

---

### GLM interpretation

Figure  corresponds to least-square fitting of the following model

`$$\begin{bmatrix}\mathrm{Gas}_1 \\  \vdots \\  \mathrm{Gas}_{56} \end{bmatrix}  = \begin{bmatrix}  1 & 0 & \mathrm{Temp}_1 &  0 & \mathrm{Temp}^2_1 &  0\\  \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\  1 & 0 & \mathrm{Temp}_{26} &  0 & \mathrm{Temp}^2_{26} &  0\\  1 & 1 & \mathrm{Temp}_{27} & \mathrm{Temp}_{27} & \mathrm{Temp}^2_{27} & \mathrm{Temp}^2_{27}  \\  \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\   1 & 1 & \mathrm{Temp}_{56} & \mathrm{Temp}_{56} & \mathrm{Temp}^2_{56} & \mathrm{Temp}^2_{56} \end{bmatrix} \times \begin{bmatrix}  \beta_{0,b} \\  \beta_{0,a} \\  \beta_{1, b} \\  \beta_{1, a} \\  \beta_{2, b} \\  \beta_{2, a}\end{bmatrix} + \sigma \begin{bmatrix}  \epsilon_1 \\  \vdots \\  \epsilon_{56}\end{bmatrix}$$`

We still assume _homoschedastic_ noise

---
exclude: true

```r
require("parsnip")
require(Formula)
require(flipbookr)
```

```
## Loading required package: flipbookr
```

```r
formule <- Formula(Gas ~ Temp * Insul)

lm_mod <-
  linear_reg() %>%
  set_engine("lm")

lm_fit <-
    lm_mod %>%
    fit(formule,
        data = whiteside)
```

```r
# model.matrix(lm_2) %>% View()
```

---
exclude:true

```r
pacman::p_load(ggfortify)
```

```r
autoplot(lm_2, which = 6)
```

---
template: inter-slide
name: anova

## ANOVA

---

[ANOVA](https://en.wikipedia.org/wiki/Analysis_of_variance) stands for ANalysis
Of VAriance.

Its provides us with a method for choosing and assessing a model at
least if we adhere to the _Gaussian Linear Modeling_ credo.

ANOVA is just the tip of a large iceberg.

It has been extended to more general models and
has inspired _model selection methods_ in many areas of statistics and machine learning.

---
template: inter-slide

## Variable selection in Gaussian Linear Models

---

In the sequel of this section, we use simulated data to introduce and motivate
the ANOVA method.

We generate

- a design matrix `$Z \in \mathcal{M}_{n,p}$`
with `$n=1000$` and `$p=100$`,
- a sparse random vector `$\beta \in \mathbb{R}^p$`
- a vector of observation `$Y = Z \times \beta + \sigma \epsilon$` where
`$\sigma=1$` and `$\epsilon$` is a standard Gaussian vector of dimension `$p$`.

---

```r
*set.seed(1515)  # for reproducibility
```

---
count: false
### Preparing ANOVA

```r
set.seed(1515)  # for reproducibility

*p <- 100  # dimension of sample points
```

---
count: false
### Preparing ANOVA

```r
set.seed(1515)  # for reproducibility

p <- 100  # dimension of sample points
*n <- 1000  # sample size
```

---
count: false
### Preparing ANOVA

```r
set.seed(1515)  # for reproducibility

p <- 100  # dimension of sample points
n <- 1000  # sample size

*sigma <- 1  # noise standard deviation
```

---
count: false
### Preparing ANOVA

```r
set.seed(1515)  # for reproducibility

p <- 100  # dimension of sample points
n <- 1000  # sample size

sigma <- 1  # noise standard deviation

*betas <- matrix(rbinom(p, 1, 0.1) * rnorm(p, mean=1),
*               nrow=p, ncol=1)  # true coefficient vector
```

---
count: false
### Preparing ANOVA

```r
set.seed(1515)  # for reproducibility

p <- 100  # dimension of sample points
n <- 1000  # sample size

sigma <- 1  # noise standard deviation

betas <- matrix(rbinom(p, 1, 0.1) * rnorm(p, mean=1),
                nrow=p, ncol=1)  # true coefficient vector

*important_coeffs <- which(abs(betas) >.01)  # sparse support
```

---
count: false
### Preparing ANOVA

```r
set.seed(1515)  # for reproducibility

p <- 100  # dimension of sample points
n <- 1000  # sample size

sigma <- 1  # noise standard deviation

betas <- matrix(rbinom(p, 1, 0.1) * rnorm(p, mean=1),
                nrow=p, ncol=1)  # true coefficient vector

important_coeffs <- which(abs(betas) >.01)  # sparse support

*Z <- matrix(rnorm(n*p, 5), ncol = p, nrow=n)  #  design
```

---
count: false
### Preparing ANOVA

```r
set.seed(1515)  # for reproducibility

p <- 100  # dimension of sample points
n <- 1000  # sample size

sigma <- 1  # noise standard deviation

betas <- matrix(rbinom(p, 1, 0.1) * rnorm(p, mean=1),
                nrow=p, ncol=1)  # true coefficient vector

important_coeffs <- which(abs(betas) >.01)  # sparse support

Z <- matrix(rnorm(n*p, 5), ncol = p, nrow=n)  #  design

*Y <- Z %*% betas  + matrix(rnorm(n, 0, sd=sigma), nrow=n)
```

---
count: false
### Preparing ANOVA

```r
set.seed(1515)  # for reproducibility

p <- 100  # dimension of sample points
n <- 1000  # sample size

sigma <- 1  # noise standard deviation

betas <- matrix(rbinom(p, 1, 0.1) * rnorm(p, mean=1),
                nrow=p, ncol=1)  # true coefficient vector

important_coeffs <- which(abs(betas) >.01)  # sparse support

Z <- matrix(rnorm(n*p, 5), ncol = p, nrow=n)  #  design

Y <- Z %*% betas  + matrix(rnorm(n, 0, sd=sigma), nrow=n)

*df <- as.data.frame(cbind(Z, Y), )
```

---
count: false
### Preparing ANOVA

```r
set.seed(1515)  # for reproducibility

p <- 100  # dimension of sample points
n <- 1000  # sample size

sigma <- 1  # noise standard deviation

betas <- matrix(rbinom(p, 1, 0.1) * rnorm(p, mean=1),
                nrow=p, ncol=1)  # true coefficient vector

important_coeffs <- which(abs(betas) >.01)  # sparse support

Z <- matrix(rnorm(n*p, 5), ncol = p, nrow=n)  #  design

Y <- Z %*% betas  + matrix(rnorm(n, 0, sd=sigma), nrow=n)

df <- as.data.frame(cbind(Z, Y), )

*names(df)  <- c(paste("x", 1:p, sep="_"), "y")
```

---

### True model versus full model

Dataframe `df` fits Gaussian linear modeling assumptions

We may feed it to `lm()`

This leads to what we call the _full model_

We may also take advantage of the fact that we know which columns of `$Z$` matters when we want to predict `$Y$`.

Hence, we can compare the linear model obtained by brute force `lm(y ~ . -1, df)` ( `$-1$` because we need no intercept) with the linear model obtained by using columns of `df` corresponding
to non-null coefficients of `betas`

We call the former the _true model_

---

### Building formulae

```r
formula_full <- formula("y ~ . - 1")

formula_true <- formula(str_c("y",
                              paste(str_c("x",
                                          important_coeffs,
                                          sep = "_"),
                                     collapse =" + "),
                              sep=" ~ "))

formulae <-  c(formula_full, formula_true)

lm_df <- map(formulae,
             lm,
             data=df)

names(lm_df) <- c("full", "true")
```

---

### The true model

```r
formula_true
```

```
## y ~ x_2 + x_9 + x_22 + x_27 + x_41 + x_49 + x_52 + x_53 + x_59 + 
##     x_60 + x_77 + x_88 + x_93
```

---

### Comparing reconstruction errors

We can compare the true and full models using the quadratic error criterion.

The residuals can be gathered from both models using function `residuals()`.

The Residual Sum of Squares (RSS)  are respectively 885.4 for the full model and 957.5.

At first glance, the full model achieves a better fit to the data than the true model

---

### Comparing estimation errors

The full model is obtained by optimizing the quadratic error over a set which is much larger  than the set used
for the true model

As we are handling simulated data, we can compare the coefficient
estimates in the two models and see whether one comes closer to
the coefficients used to generate the data

The squared `$\ell_2$` distances to `$\beta$` are substantially different:

`$$\| \beta - \widehat{\beta}^{\text{true}}\|^2  = 1.31 \times 10^{-2}$$`

while

`$$\| \beta - \widehat{\beta}^{\text{full}}\|^2  = 9.75 \times 10^{-2}$$`

---

### Overfitting

The fact that the full model outperforms the small model with
respect to the Residual Sum of Squares criterion is righteously called _overfitting_.

---

### Diagnosing the full model

```r
xx <- residuals(lm_df[['full']])
yy <- fitted.values(lm_df[['full']])
qplot(y=yy,
      x=xx,
      alpha=I(.25)) +
  xlab("fitted values") +
  ylab("residuals") +
  ggtitle("Gaussian linear model",
  subtitle = "n=1000, p=100, full model")
```

]

![](cm-4-EDA_files/figure-html/qplot-residuals-label-1.png)

]

---

###

```r
head((anova(lm_df[['full']]))  %>%
       broom::tidy(.n), 10)
```

```
## # A tibble: 10 × 6
##    term     df    sumsq   meansq statistic   p.value
##    <chr> <int>    <dbl>    <dbl>     <dbl>     <dbl>
##  1 x_1       1 2852285. 2852285.  2899391. 0        
##  2 x_2       1   85346.   85346.    86755. 0        
##  3 x_3       1   10823.   10823.    11002. 0        
##  4 x_4       1    5974.    5974.     6072. 0        
##  5 x_5       1    3901.    3901.     3965. 0        
##  6 x_6       1    2053.    2053.     2086. 1.23e-236
##  7 x_7       1    1990.    1990.     2023. 1.96e-232
##  8 x_8       1    1153.    1153.     1172. 4.21e-165
##  9 x_9       1    4715.    4715.     4793. 0        
## 10 x_10      1    1032.    1032.     1049. 3.69e-153
```

---

count: false
 
### Residual distribution (full model)
.panel1-kde_residuals-auto[

```r
*tib
```
]
 
.panel2-kde_residuals-auto[

```
## # A tibble: 1,000 × 1
##    residuals
##        <dbl>
##  1   -0.319 
##  2    0.427 
##  3   -0.495 
##  4    0.0399
##  5    0.302 
##  6   -0.920 
##  7   -0.0672
##  8   -1.64  
##  9   -0.475 
## 10   -0.954 
## # … with 990 more rows
```
]

---
count: false
 
### Residual distribution (full model)
.panel1-kde_residuals-auto[

```r
tib %>%
* ggplot(aes(x=residuals))
```
]
 
.panel2-kde_residuals-auto[
<img src="cm-4-EDA_files/figure-html/kde_residuals_auto_02_output-1.png" width="504" />
]

---
count: false
 
### Residual distribution (full model)
.panel1-kde_residuals-auto[

```r
tib %>%
  ggplot(aes(x=residuals)) +
* geom_histogram(aes(y=..density..),
*                alpha=.3)
```
]
 
.panel2-kde_residuals-auto[

```
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

<img src="cm-4-EDA_files/figure-html/kde_residuals_auto_03_output-1.png" width="504" />
]

---
count: false
 
### Residual distribution (full model)
.panel1-kde_residuals-auto[

```r
tib %>%
  ggplot(aes(x=residuals)) +
  geom_histogram(aes(y=..density..),
                 alpha=.3) +
* stat_density(kernel = "epanechnikov",
*              bw = .15,
*              alpha=.0,
*              col=1)
```
]
 
.panel2-kde_residuals-auto[

```
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

<img src="cm-4-EDA_files/figure-html/kde_residuals_auto_04_output-1.png" width="504" />
]

---
count: false
 
### Residual distribution (full model)
.panel1-kde_residuals-auto[

```r
tib %>%
  ggplot(aes(x=residuals)) +
  geom_histogram(aes(y=..density..),
                 alpha=.3) +
  stat_density(kernel = "epanechnikov",
               bw = .15,
               alpha=.0,
               col=1) +
* stat_function(fun = dnorm,
*               args=list(mean=m,
*                         sd=s),
*               linetype=2)
```
]
 
.panel2-kde_residuals-auto[

```
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

<img src="cm-4-EDA_files/figure-html/kde_residuals_auto_05_output-1.png" width="504" />
]

---
count: false
 
### Residual distribution (full model)
.panel1-kde_residuals-auto[

```r
tib %>%
  ggplot(aes(x=residuals)) +
  geom_histogram(aes(y=..density..),
                 alpha=.3) +
  stat_density(kernel = "epanechnikov",
               bw = .15,
               alpha=.0,
               col=1) +
  stat_function(fun = dnorm,
                args=list(mean=m,
                          sd=s),
                linetype=2) +
* ggtitle(tit,
*         "Dashed curve: Gaussian fit")
```
]
 
.panel2-kde_residuals-auto[

```
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

<img src="cm-4-EDA_files/figure-html/kde_residuals_auto_06_output-1.png" width="504" />
]

---

### Assessing residuals

```r
shapiro.test(residuals(lm_df[['full']]))
```

```
## 
## 	Shapiro-Wilk normality test
## 
## data:  residuals(lm_df[["full"]])
## W = 0.99752, p-value = 0.134
```

```r
# (coefficients(lm.1)[1])^2 + norm(betas - coefficients(lm.1)[-1])^2

# norm(betas - lm.0$coefficients)^2
```

---
template: inter-slide

## Using Fisher `$F$` statistic

---

### Computing the Fisher `$F$` statistic for comparing model 0 and model 1

As we are dealing with simulated data, we can behave as if a benevolent _oracle_
had whispered that all but `97` coefficients are null, and even told us the index
of the non-null coefficients.

We fit this exact and minimal model to the data.

Under the Gaussian linear model assumption, the empirical distribution of residuals is the empirical distribution of the projection of a standard Gaussian vector onto the linear subspace orthogonal to the subspace generated by the design columns.

As such the distribution of residuals is not the empirical distribution of an i.i.d Gaussian sample. Nevertheless, if the linear subspace is not too correlated with the canonical basis, the residuals empirical distribution is not too far from a Gaussian distribution.

---
exclude: true

<div class="figure">
<img src="cm-4-EDA_files/figure-html/residdistrib-1.png" alt="(ref:residdistrib)" width="504" />
<p class="caption">(ref:residdistrib)</p>
</div>

---

### Comparing true model and full model

We are facing a _binary hypotheses problem_.

- Null hypothesis: all coefficients but `x_2 , x_9 , x_22 , x_27 , x_41 , x_49 , x_52 , x_53 , x_59, x_60 , x_77 , x_88 , x_93` are null

- Alternative hypothesis: the vector of coefficients doesnot satisfy the `$13$` linear equations defining the null hypothesis

Note that both the null and alternative hypotheses are composite.

---

The null hypothesis asserts that the true vector of coefficients belong to a linear subspace of dimension `13` of `$\mathbb{R}^{100}$`.

If we stick to Gaussian linear modelling
we may test this null hypothesis by performing a _Fisher test_.

Under the null hypothesis,

`$$\frac{\|\widehat{Y} - \widehat{Y}^\circ\|^2/(p-p^\circ)}{\|{Y} - \widehat{Y}\|^2/(n-p)} \sim F_{p-p^\circ, n-p}$$`

where `$F_{p-p^\circ, n-p}$` stands for the Fisher distribution with
`$p-p^\circ$` and `$n-p$` degrees of freedom.

Function `anova`  does this.

---
exclude: true

```r
anova(lm_df[['true']], lm_df[['full']]) %>%
  broom::tidy() %>%
  dplyr::mutate_if(is.numeric, round, 2)
```

```
## # A tibble: 2 × 6
##   res.df   rss    df sumsq statistic p.value
##    <dbl> <dbl> <dbl> <dbl>     <dbl>   <dbl>
## 1    986  958.    NA  NA       NA      NA   
## 2    900  885.    86  72.1      0.85    0.82
```

On the second row, column `statistic` contains the Fisher statistic. The `p.value` tells us the probability that a random variable distributed according to Fisher distribution with `$87$` and `$900$` degrees of freedom ($F_{87,900}$) exceeds the statistic's value `$0.89$`. The fact that `p.value` (`pf(.89, 87, 900, lower.tail = FALSE)`) is `$0.75$` does not prompt us to reject the null hypothesis.

---

### Handling `$p$`-values

The rule of thumb when using the `p.value` reported by a test function is the following:

> If, _under the null hypothesis_, you  accept to be wrong  with probability `$\alpha$`,

> then

> you should reject when the `p.value` is less than `$\alpha$`

In words: if you are looking for a _level_ (probability of an error of the first kind) equal to `$\alpha$`, you should reject the null hypothesis when the `$p$`-value is less than `$\alpha$`

We can decompose the computations performed by ANOVA

---
exclude: true

```r
setdiff(important_coeffs, # 2  9 22 27 41 49 52 53 59 60 77 88 93
        which(
(df %>%
  select(starts_with("x")) %>%
  purrr::map_dbl(.f = ~ cor(., df$y)) %>%
  abs() ) > .05)
)
```

```r
lm.0 %>% summary()
(svd(t(Z) %*% (Z))$d)^(-1)
   # Inverse of covariance matrix of coefficients in lm.0
```

---
exclude: true
class: middle, center, inverse

## Iterative methods for OLS

---
exclude: true
class: middle, center, inverse

## ...

---
exclude: true
template: inter-slide

## Ridge regression

---
exclude: true
### Regularized least squares criterion

For each `$\lambda>0$`, the penalized least square costs is defined by

`$$\Big\Vert Y - Z \beta  \Big\Vert^2 + λ \big\|\beta\big\|^2$$`

Gradient with respect to `$\beta$`

`$$2\times\big(λ β + Z^T Z β - Z^T Y\big)$$`

Unique optimum

`$$\big(\lambda \operatorname{Id} + Z^TZ\big)^{-1} Z^T Y$$`

---
exclude: true
### Solution

---
exclude: true
template: inter-slide

## Split/Apply/Combine and (Ridge) regression

---
exclude: true

---
exclude: true
template: inter-slide

## Feature engineering

---
exclude: true
template: inter-slide

---

background-image: url('./img/pexels-cottonbro-3171837.jpg')
background-size: cover

# The End