Multilinear Regression III: Variable selection

---
name: layout-general
layout: true
class: left, middle

.remark-slide-number .progress-bar-container {
  position: absolute;
  bottom: 0;
  height: 4px;
  display: block;
  left: 0;
  right: 0;
}

.remark-slide-number .progress-bar {
  height: 100%;
  background-color: red;
}
</style>

<div>
<style type="text/css">.xaringan-extra-logo {
width: 110px;
height: 128px;
z-index: 0;
background-image: url(./img/Universite_Paris_logo_horizontal.jpg);
background-size: contain;
background-repeat: no-repeat;
position: absolute;
top:1em;right:1em;
}
</style>
<script>(function () {
  let tries = 0
  function addLogo () {
    if (typeof slideshow === 'undefined') {
      tries += 1
      if (tries < 10) {
        setTimeout(addLogo, 100)
      }
    } else {
      document.querySelectorAll('.remark-slide-content:not(.hide_logo)')
        .forEach(function (slide) {
          const logo = document.createElement('a')
          logo.classList = 'xaringan-extra-logo'
          logo.href = 'http://master.math.univ-paris-diderot.fr/annee/m1-mi/'
          slide.appendChild(logo)
        })
    }
  }
  document.addEventListener('DOMContentLoaded', addLogo)
})()</script>
</div>

---
template: inter-slide

# Multilinear Regression II: Variable selection

#### [Master I MIDS & MFA]()

#### [Analyse Exploratoire de Données](http://stephane-v-boucheron.fr/courses/eda/)

#### [Stéphane Boucheron](http://stephane-v-boucheron.fr)

---
class: inverse, middle

## <svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M0 117.66v346.32c0 11.32 11.43 19.06 21.94 14.86L160 416V32L20.12 87.95A32.006 32.006 0 0 0 0 117.66zM192 416l192 64V96L192 32v384zM554.06 33.16L416 96v384l139.88-55.95A31.996 31.996 0 0 0 576 394.34V48.02c0-11.32-11.43-19.06-21.94-14.86z"/></svg>

### Motivations

### Nested models

### Information criteria

### Cross-validation

---
---
template: inter-slide
name: anova

## ANOVA

---

[ANOVA](https://en.wikipedia.org/wiki/Analysis_of_variance) stands for ANalysis
Of VAriance.

Its provides us with a method for choosing and assessing a model at
least if we adhere to the _Gaussian Linear Modeling_ credo.

ANOVA is just the tip of a large iceberg.

It has been extended to more general models and
has inspired _model selection methods_ in many areas of statistics and machine learning.

---
template: inter-slide

## Variable selection in Gaussian Linear Models

---

In the sequel of this section, we use simulated data to introduce and motivate
the ANOVA method.

We generate

- a design matrix `$Z \in \mathcal{M}_{n,p}$`
with `$n=1000$` and `$p=100$`,
- a sparse random vector `$\beta \in \mathbb{R}^p$`
- a vector of observation `$Y = Z \times \beta + \sigma \epsilon$` where
`$\sigma=1$` and `$\epsilon$` is a standard Gaussian vector of dimension `$p$`.

---

```r
*set.seed(1515)  # for reproducibility
```

---
count: false
### Preparing ANOVA

```r
set.seed(1515)  # for reproducibility

*p <- 100  # dimension of sample points
```

---
count: false
### Preparing ANOVA

```r
set.seed(1515)  # for reproducibility

p <- 100  # dimension of sample points
*n <- 1000  # sample size
```

---
count: false
### Preparing ANOVA

```r
set.seed(1515)  # for reproducibility

p <- 100  # dimension of sample points
n <- 1000  # sample size

*sigma <- 1  # noise standard deviation
```

---
count: false
### Preparing ANOVA

```r
set.seed(1515)  # for reproducibility

p <- 100  # dimension of sample points
n <- 1000  # sample size

sigma <- 1  # noise standard deviation

*betas <- matrix(rbinom(p, 1, 0.1) * rnorm(p, mean=1),
*               nrow=p, ncol=1)  # true coefficient vector
```

---
count: false
### Preparing ANOVA

```r
set.seed(1515)  # for reproducibility

p <- 100  # dimension of sample points
n <- 1000  # sample size

sigma <- 1  # noise standard deviation

betas <- matrix(rbinom(p, 1, 0.1) * rnorm(p, mean=1),
                nrow=p, ncol=1)  # true coefficient vector

*important_coeffs <- which(abs(betas) >.01)  # sparse support
```

---
count: false
### Preparing ANOVA

```r
set.seed(1515)  # for reproducibility

p <- 100  # dimension of sample points
n <- 1000  # sample size

sigma <- 1  # noise standard deviation

betas <- matrix(rbinom(p, 1, 0.1) * rnorm(p, mean=1),
                nrow=p, ncol=1)  # true coefficient vector

important_coeffs <- which(abs(betas) >.01)  # sparse support

*Z <- matrix(rnorm(n*p, 5), ncol = p, nrow=n)  #  design
```

---
count: false
### Preparing ANOVA

```r
set.seed(1515)  # for reproducibility

p <- 100  # dimension of sample points
n <- 1000  # sample size

sigma <- 1  # noise standard deviation

betas <- matrix(rbinom(p, 1, 0.1) * rnorm(p, mean=1),
                nrow=p, ncol=1)  # true coefficient vector

important_coeffs <- which(abs(betas) >.01)  # sparse support

Z <- matrix(rnorm(n*p, 5), ncol = p, nrow=n)  #  design

*Y <- Z %*% betas  + matrix(rnorm(n, 0, sd=sigma), nrow=n)
```

---
count: false
### Preparing ANOVA

```r
set.seed(1515)  # for reproducibility

p <- 100  # dimension of sample points
n <- 1000  # sample size

sigma <- 1  # noise standard deviation

betas <- matrix(rbinom(p, 1, 0.1) * rnorm(p, mean=1),
                nrow=p, ncol=1)  # true coefficient vector

important_coeffs <- which(abs(betas) >.01)  # sparse support

Z <- matrix(rnorm(n*p, 5), ncol = p, nrow=n)  #  design

Y <- Z %*% betas  + matrix(rnorm(n, 0, sd=sigma), nrow=n)

*df <- as.data.frame(cbind(Z, Y), )
```

---
count: false
### Preparing ANOVA

```r
set.seed(1515)  # for reproducibility

p <- 100  # dimension of sample points
n <- 1000  # sample size

sigma <- 1  # noise standard deviation

betas <- matrix(rbinom(p, 1, 0.1) * rnorm(p, mean=1),
                nrow=p, ncol=1)  # true coefficient vector

important_coeffs <- which(abs(betas) >.01)  # sparse support

Z <- matrix(rnorm(n*p, 5), ncol = p, nrow=n)  #  design

Y <- Z %*% betas  + matrix(rnorm(n, 0, sd=sigma), nrow=n)

df <- as.data.frame(cbind(Z, Y), )

*names(df)  <- c(paste("x", 1:p, sep="_"), "y")
```

---

### True model versus full model

Dataframe `df` fits Gaussian linear modeling assumptions

We may feed it to `lm()`

This leads to what we call the _full model_

We may also take advantage of the fact that we know which columns of `$Z$` matters when we want to predict `$Y$`.

Hence, we can compare the linear model obtained by brute force `lm(y ~ . -1, df)` ( `$-1$` because we need no intercept) with the linear model obtained by using columns of `df` corresponding
to non-null coefficients of `betas`

We call the former the _true model_

---

### Building formulae

```r
formula_full <- formula("y ~ . - 1")

formula_true <- formula(str_c("y",
                              paste(str_c("x",
                                          important_coeffs,
                                          sep = "_"),
                                     collapse =" + "),
                              sep=" ~ "))

formulae <-  c(formula_full, formula_true)

lm_df <- map(formulae,
             lm,
             data=df)

names(lm_df) <- c("full", "true")
```

---

### The true model

```r
formula_true
```

```
## y ~ x_2 + x_9 + x_22 + x_27 + x_41 + x_49 + x_52 + x_53 + x_59 + 
##     x_60 + x_77 + x_88 + x_93
```

---

### Comparing reconstruction errors

We can compare the true and full models using the quadratic error criterion.

The residuals can be gathered from both models using function `residuals()`.

The Residual Sum of Squares (RSS)  are respectively 885.4 for the full model and 957.5.

At first glance, the full model achieves a better fit to the data than the true model

---

### Comparing estimation errors

The full model is obtained by optimizing the quadratic error over a set which is much larger  than the set used
for the true model

As we are handling simulated data, we can compare the coefficient
estimates in the two models and see whether one comes closer to
the coefficients used to generate the data

The squared `$\ell_2$` distances to `$\beta$` are substantially different:

`$$\| \beta - \widehat{\beta}^{\text{true}}\|^2  = 1.31 \times 10^{-2}$$`

while

`$$\| \beta - \widehat{\beta}^{\text{full}}\|^2  = 9.75 \times 10^{-2}$$`

---

### Overfitting

The fact that the full model outperforms the small model with
respect to the Residual Sum of Squares criterion is righteously called _overfitting_.

---

### Diagnosing the full model

```r
xx <- residuals(lm_df[['full']])
yy <- fitted.values(lm_df[['full']])
qplot(y=yy,
      x=xx,
      alpha=I(.25)) +
  xlab("fitted values") +
  ylab("residuals") +
  ggtitle("Gaussian linear model",
  subtitle = "n=1000, p=100, full model")
```

]

![](cm-5-EDA-III_files/figure-html/qplot-residuals-label-1.png)

]

---

###

```r
head((anova(lm_df[['full']]))  %>%
       broom::tidy(.n), 10)
```

```
## # A tibble: 10 × 6
##    term     df    sumsq   meansq statistic   p.value
##    <chr> <int>    <dbl>    <dbl>     <dbl>     <dbl>
##  1 x_1       1 2852285. 2852285.  2899391. 0        
##  2 x_2       1   85346.   85346.    86755. 0        
##  3 x_3       1   10823.   10823.    11002. 0        
##  4 x_4       1    5974.    5974.     6072. 0        
##  5 x_5       1    3901.    3901.     3965. 0        
##  6 x_6       1    2053.    2053.     2086. 1.23e-236
##  7 x_7       1    1990.    1990.     2023. 1.96e-232
##  8 x_8       1    1153.    1153.     1172. 4.21e-165
##  9 x_9       1    4715.    4715.     4793. 0        
## 10 x_10      1    1032.    1032.     1049. 3.69e-153
```

---

count: false
 
### Residual distribution (full model)
.panel1-kde_residuals-auto[

```r
*tib
```
]
 
.panel2-kde_residuals-auto[

```
## # A tibble: 1,000 × 1
##    residuals
##        <dbl>
##  1   -0.319 
##  2    0.427 
##  3   -0.495 
##  4    0.0399
##  5    0.302 
##  6   -0.920 
##  7   -0.0672
##  8   -1.64  
##  9   -0.475 
## 10   -0.954 
## # … with 990 more rows
```
]

---
count: false
 
### Residual distribution (full model)
.panel1-kde_residuals-auto[

```r
tib %>%
* ggplot(aes(x=residuals))
```
]
 
.panel2-kde_residuals-auto[
<img src="cm-5-EDA-III_files/figure-html/kde_residuals_auto_02_output-1.png" width="504" />
]

---
count: false
 
### Residual distribution (full model)
.panel1-kde_residuals-auto[

```r
tib %>%
  ggplot(aes(x=residuals)) +
* geom_histogram(aes(y=..density..),
*                alpha=.3)
```
]
 
.panel2-kde_residuals-auto[

```
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

<img src="cm-5-EDA-III_files/figure-html/kde_residuals_auto_03_output-1.png" width="504" />
]

---
count: false
 
### Residual distribution (full model)
.panel1-kde_residuals-auto[

```r
tib %>%
  ggplot(aes(x=residuals)) +
  geom_histogram(aes(y=..density..),
                 alpha=.3) +
* stat_density(kernel = "epanechnikov",
*              bw = .15,
*              alpha=.0,
*              col=1)
```
]
 
.panel2-kde_residuals-auto[

```
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

<img src="cm-5-EDA-III_files/figure-html/kde_residuals_auto_04_output-1.png" width="504" />
]

---
count: false
 
### Residual distribution (full model)
.panel1-kde_residuals-auto[

```r
tib %>%
  ggplot(aes(x=residuals)) +
  geom_histogram(aes(y=..density..),
                 alpha=.3) +
  stat_density(kernel = "epanechnikov",
               bw = .15,
               alpha=.0,
               col=1) +
* stat_function(fun = dnorm,
*               args=list(mean=m,
*                         sd=s),
*               linetype=2)
```
]
 
.panel2-kde_residuals-auto[

```
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

<img src="cm-5-EDA-III_files/figure-html/kde_residuals_auto_05_output-1.png" width="504" />
]

---
count: false
 
### Residual distribution (full model)
.panel1-kde_residuals-auto[

```r
tib %>%
  ggplot(aes(x=residuals)) +
  geom_histogram(aes(y=..density..),
                 alpha=.3) +
  stat_density(kernel = "epanechnikov",
               bw = .15,
               alpha=.0,
               col=1) +
  stat_function(fun = dnorm,
                args=list(mean=m,
                          sd=s),
                linetype=2) +
* ggtitle(tit,
*         "Dashed curve: Gaussian fit")
```
]
 
.panel2-kde_residuals-auto[

```
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

<img src="cm-5-EDA-III_files/figure-html/kde_residuals_auto_06_output-1.png" width="504" />
]

---

### Assessing residuals

```r
shapiro.test(residuals(lm_df[['full']]))
```

```
## 
## 	Shapiro-Wilk normality test
## 
## data:  residuals(lm_df[["full"]])
## W = 0.99752, p-value = 0.134
```

```r
# (coefficients(lm.1)[1])^2 + norm(betas - coefficients(lm.1)[-1])^2

# norm(betas - lm.0$coefficients)^2
```

---
template: inter-slide

## Using Fisher `$F$` statistic

---

### Computing the Fisher `$F$` statistic for comparing model 0 and model 1

As we are dealing with simulated data, we can behave as if a benevolent _oracle_
had whispered that all but `97` coefficients are null, and even told us the index
of the non-null coefficients.

We fit this exact and minimal model to the data.

Under the Gaussian linear model assumption, the empirical distribution of residuals is the empirical distribution of the projection of a standard Gaussian vector onto the linear subspace orthogonal to the subspace generated by the design columns.

As such the distribution of residuals is not the empirical distribution of an i.i.d Gaussian sample. Nevertheless, if the linear subspace is not too correlated with the canonical basis, the residuals empirical distribution is not too far from a Gaussian distribution.

---
exclude: true

<div class="figure">
<img src="cm-5-EDA-III_files/figure-html/residdistrib-1.png" alt="(ref:residdistrib)" width="504" />
<p class="caption">(ref:residdistrib)</p>
</div>

---

### Comparing true model and full model

We are facing a _binary hypotheses problem_.

- Null hypothesis: all coefficients but `x_2 , x_9 , x_22 , x_27 , x_41 , x_49 , x_52 , x_53 , x_59, x_60 , x_77 , x_88 , x_93` are null

- Alternative hypothesis: the vector of coefficients doesnot satisfy the `$13$` linear equations defining the null hypothesis

Note that both the null and alternative hypotheses are composite.

---

The null hypothesis asserts that the true vector of coefficients belong to a linear subspace of dimension `13` of `$\mathbb{R}^{100}$`.

If we stick to Gaussian linear modelling
we may test this null hypothesis by performing a _Fisher test_.

Under the null hypothesis,

`$$\frac{\|\widehat{Y} - \widehat{Y}^\circ\|^2/(p-p^\circ)}{\|{Y} - \widehat{Y}\|^2/(n-p)} \sim F_{p-p^\circ, n-p}$$`

where `$F_{p-p^\circ, n-p}$` stands for the Fisher distribution with
`$p-p^\circ$` and `$n-p$` degrees of freedom.

Function `anova`  does this.

---
exclude: true

```r
anova(lm_df[['true']], lm_df[['full']]) %>%
  broom::tidy() %>%
  dplyr::mutate_if(is.numeric, round, 2)
```

```
## # A tibble: 2 × 6
##   res.df   rss    df sumsq statistic p.value
##    <dbl> <dbl> <dbl> <dbl>     <dbl>   <dbl>
## 1    986  958.    NA  NA       NA      NA   
## 2    900  885.    86  72.1      0.85    0.82
```

On the second row, column `statistic` contains the Fisher statistic. The `p.value` tells us the probability that a random variable distributed according to Fisher distribution with `$87$` and `$900$` degrees of freedom ($F_{87,900}$) exceeds the statistic's value `$0.89$`. The fact that `p.value` (`pf(.89, 87, 900, lower.tail = FALSE)`) is `$0.75$` does not prompt us to reject the null hypothesis.

---

### Handling `$p$`-values

The rule of thumb when using the `p.value` reported by a test function is the following:

> If, _under the null hypothesis_, you  accept to be wrong  with probability `$\alpha$`,

> then

> you should reject when the `p.value` is less than `$\alpha$`

In words: if you are looking for a _level_ (probability of an error of the first kind) equal to `$\alpha$`, you should reject the null hypothesis when the `$p$`-value is less than `$\alpha$`

We can decompose the computations performed by ANOVA

---
exclude: true

```r
setdiff(important_coeffs, # 2  9 22 27 41 49 52 53 59 60 77 88 93
        which(
(df %>%
  select(starts_with("x")) %>%
  purrr::map_dbl(.f = ~ cor(., df$y)) %>%
  abs() ) > .05)
)
```

```r
lm.0 %>% summary()
(svd(t(Z) %*% (Z))$d)^(-1)
   # Inverse of covariance matrix of coefficients in lm.0
```

---
exclude: true
class: middle, center, inverse

## Iterative methods for OLS

---
exclude: true
class: middle, center, inverse

## ...

---
exclude: true
template: inter-slide

## Feature engineering

---
exclude: true
template: inter-slide

---

background-image: url('./img/pexels-cottonbro-3171837.jpg')
background-size: cover

# The End