name: inter-slide class: left, middle, inverse {{ content }} --- name: layout-general layout: true class: left, middle <style> .remark-slide-number { position: inherit; } .remark-slide-number .progress-bar-container { position: absolute; bottom: 0; height: 4px; display: block; left: 0; right: 0; } .remark-slide-number .progress-bar { height: 100%; background-color: red; } </style>
--- template: inter-slide # Multilinear Regression II: Variable selection #### [Master I MIDS & MFA]() #### [Analyse Exploratoire de Données](http://stephane-v-boucheron.fr/courses/eda/) #### [Stéphane Boucheron](http://stephane-v-boucheron.fr) --- class: inverse, middle ##
### Motivations ### Nested models ### Information criteria ### Cross-validation --- --- template: inter-slide name: anova ## ANOVA --- [ANOVA](https://en.wikipedia.org/wiki/Analysis_of_variance) stands for ANalysis Of VAriance. Its provides us with a method for choosing and assessing a model at least if we adhere to the _Gaussian Linear Modeling_ credo. ANOVA is just the tip of a large iceberg. It has been extended to more general models and has inspired _model selection methods_ in many areas of statistics and machine learning. --- template: inter-slide ## Variable selection in Gaussian Linear Models --- In the sequel of this section, we use simulated data to introduce and motivate the ANOVA method. We generate - a design matrix `\(Z \in \mathcal{M}_{n,p}\)` with `\(n=1000\)` and `\(p=100\)`, - a sparse random vector `\(\beta \in \mathbb{R}^p\)` - a vector of observation `\(Y = Z \times \beta + \sigma \epsilon\)` where `\(\sigma=1\)` and `\(\epsilon\)` is a standard Gaussian vector of dimension `\(p\)`. --- count: false ### Preparing ANOVA ```r *set.seed(1515) # for reproducibility ``` --- count: false ### Preparing ANOVA ```r set.seed(1515) # for reproducibility *p <- 100 # dimension of sample points ``` --- count: false ### Preparing ANOVA ```r set.seed(1515) # for reproducibility p <- 100 # dimension of sample points *n <- 1000 # sample size ``` --- count: false ### Preparing ANOVA ```r set.seed(1515) # for reproducibility p <- 100 # dimension of sample points n <- 1000 # sample size *sigma <- 1 # noise standard deviation ``` --- count: false ### Preparing ANOVA ```r set.seed(1515) # for reproducibility p <- 100 # dimension of sample points n <- 1000 # sample size sigma <- 1 # noise standard deviation *betas <- matrix(rbinom(p, 1, 0.1) * rnorm(p, mean=1), * nrow=p, ncol=1) # true coefficient vector ``` --- count: false ### Preparing ANOVA ```r set.seed(1515) # for reproducibility p <- 100 # dimension of sample points n <- 1000 # sample size sigma <- 1 # noise standard deviation betas <- matrix(rbinom(p, 1, 0.1) * rnorm(p, mean=1), nrow=p, ncol=1) # true coefficient vector *important_coeffs <- which(abs(betas) >.01) # sparse support ``` --- count: false ### Preparing ANOVA ```r set.seed(1515) # for reproducibility p <- 100 # dimension of sample points n <- 1000 # sample size sigma <- 1 # noise standard deviation betas <- matrix(rbinom(p, 1, 0.1) * rnorm(p, mean=1), nrow=p, ncol=1) # true coefficient vector important_coeffs <- which(abs(betas) >.01) # sparse support *Z <- matrix(rnorm(n*p, 5), ncol = p, nrow=n) # design ``` --- count: false ### Preparing ANOVA ```r set.seed(1515) # for reproducibility p <- 100 # dimension of sample points n <- 1000 # sample size sigma <- 1 # noise standard deviation betas <- matrix(rbinom(p, 1, 0.1) * rnorm(p, mean=1), nrow=p, ncol=1) # true coefficient vector important_coeffs <- which(abs(betas) >.01) # sparse support Z <- matrix(rnorm(n*p, 5), ncol = p, nrow=n) # design *Y <- Z %*% betas + matrix(rnorm(n, 0, sd=sigma), nrow=n) ``` --- count: false ### Preparing ANOVA ```r set.seed(1515) # for reproducibility p <- 100 # dimension of sample points n <- 1000 # sample size sigma <- 1 # noise standard deviation betas <- matrix(rbinom(p, 1, 0.1) * rnorm(p, mean=1), nrow=p, ncol=1) # true coefficient vector important_coeffs <- which(abs(betas) >.01) # sparse support Z <- matrix(rnorm(n*p, 5), ncol = p, nrow=n) # design Y <- Z %*% betas + matrix(rnorm(n, 0, sd=sigma), nrow=n) *df <- as.data.frame(cbind(Z, Y), ) ``` --- count: false ### Preparing ANOVA ```r set.seed(1515) # for reproducibility p <- 100 # dimension of sample points n <- 1000 # sample size sigma <- 1 # noise standard deviation betas <- matrix(rbinom(p, 1, 0.1) * rnorm(p, mean=1), nrow=p, ncol=1) # true coefficient vector important_coeffs <- which(abs(betas) >.01) # sparse support Z <- matrix(rnorm(n*p, 5), ncol = p, nrow=n) # design Y <- Z %*% betas + matrix(rnorm(n, 0, sd=sigma), nrow=n) df <- as.data.frame(cbind(Z, Y), ) *names(df) <- c(paste("x", 1:p, sep="_"), "y") ``` <style> .panel1-my_simul_anova-auto { color: black; width: 99%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-my_simul_anova-auto { color: black; width: NA%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-my_simul_anova-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- ### True model versus full model Dataframe `df` fits Gaussian linear modeling assumptions We may feed it to `lm()` This leads to what we call the _full model_ We may also take advantage of the fact that we know which columns of `\(Z\)` matters when we want to predict `\(Y\)`. Hence, we can compare the linear model obtained by brute force `lm(y ~ . -1, df)` ( `\(-1\)` because we need no intercept) with the linear model obtained by using columns of `df` corresponding to non-null coefficients of `betas` We call the former the _true model_ --- ### Building formulae ```r formula_full <- formula("y ~ . - 1") formula_true <- formula(str_c("y", paste(str_c("x", important_coeffs, sep = "_"), collapse =" + "), sep=" ~ ")) formulae <- c(formula_full, formula_true) lm_df <- map(formulae, lm, data=df) names(lm_df) <- c("full", "true") ``` --- ### The true model ```r formula_true ``` ``` ## y ~ x_2 + x_9 + x_22 + x_27 + x_41 + x_49 + x_52 + x_53 + x_59 + ## x_60 + x_77 + x_88 + x_93 ``` --- ### Comparing reconstruction errors We can compare the true and full models using the quadratic error criterion. The residuals can be gathered from both models using function `residuals()`. The Residual Sum of Squares (RSS) are respectively 885.4 for the full model and 957.5. At first glance, the full model achieves a better fit to the data than the true model --- ### Comparing estimation errors
This comparison is not fair The full model is obtained by optimizing the quadratic error over a set which is much larger than the set used for the true model As we are handling simulated data, we can compare the coefficient estimates in the two models and see whether one comes closer to the coefficients used to generate the data The squared `\(\ell_2\)` distances to `\(\beta\)` are substantially different: `$$\| \beta - \widehat{\beta}^{\text{true}}\|^2 = 1.31 \times 10^{-2}$$` while `$$\| \beta - \widehat{\beta}^{\text{full}}\|^2 = 9.75 \times 10^{-2}$$` --- ### Overfitting The fact that the full model outperforms the small model with respect to the Residual Sum of Squares criterion is righteously called _overfitting_. --- ### Diagnosing the full model .panelset[ .panel[.panel-name[Code] ```r xx <- residuals(lm_df[['full']]) yy <- fitted.values(lm_df[['full']]) qplot(y=yy, x=xx, alpha=I(.25)) + xlab("fitted values") + ylab("residuals") + ggtitle("Gaussian linear model", subtitle = "n=1000, p=100, full model") ``` ] .panel[.panel-name[Residuals versus Fitted plot] ![](cm-5-EDA-III_files/figure-html/qplot-residuals-label-1.png) ] ] --- ### ```r head((anova(lm_df[['full']])) %>% broom::tidy(.n), 10) ``` ``` ## # A tibble: 10 × 6 ## term df sumsq meansq statistic p.value ## <chr> <int> <dbl> <dbl> <dbl> <dbl> ## 1 x_1 1 2852285. 2852285. 2899391. 0 ## 2 x_2 1 85346. 85346. 86755. 0 ## 3 x_3 1 10823. 10823. 11002. 0 ## 4 x_4 1 5974. 5974. 6072. 0 ## 5 x_5 1 3901. 3901. 3965. 0 ## 6 x_6 1 2053. 2053. 2086. 1.23e-236 ## 7 x_7 1 1990. 1990. 2023. 1.96e-232 ## 8 x_8 1 1153. 1153. 1172. 4.21e-165 ## 9 x_9 1 4715. 4715. 4793. 0 ## 10 x_10 1 1032. 1032. 1049. 3.69e-153 ``` --- count: false ### Residual distribution (full model) .panel1-kde_residuals-auto[ ```r *tib ``` ] .panel2-kde_residuals-auto[ ``` ## # A tibble: 1,000 × 1 ## residuals ## <dbl> ## 1 -0.319 ## 2 0.427 ## 3 -0.495 ## 4 0.0399 ## 5 0.302 ## 6 -0.920 ## 7 -0.0672 ## 8 -1.64 ## 9 -0.475 ## 10 -0.954 ## # … with 990 more rows ``` ] --- count: false ### Residual distribution (full model) .panel1-kde_residuals-auto[ ```r tib %>% * ggplot(aes(x=residuals)) ``` ] .panel2-kde_residuals-auto[ <img src="cm-5-EDA-III_files/figure-html/kde_residuals_auto_02_output-1.png" width="504" /> ] --- count: false ### Residual distribution (full model) .panel1-kde_residuals-auto[ ```r tib %>% ggplot(aes(x=residuals)) + * geom_histogram(aes(y=..density..), * alpha=.3) ``` ] .panel2-kde_residuals-auto[ ``` ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ``` <img src="cm-5-EDA-III_files/figure-html/kde_residuals_auto_03_output-1.png" width="504" /> ] --- count: false ### Residual distribution (full model) .panel1-kde_residuals-auto[ ```r tib %>% ggplot(aes(x=residuals)) + geom_histogram(aes(y=..density..), alpha=.3) + * stat_density(kernel = "epanechnikov", * bw = .15, * alpha=.0, * col=1) ``` ] .panel2-kde_residuals-auto[ ``` ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ``` <img src="cm-5-EDA-III_files/figure-html/kde_residuals_auto_04_output-1.png" width="504" /> ] --- count: false ### Residual distribution (full model) .panel1-kde_residuals-auto[ ```r tib %>% ggplot(aes(x=residuals)) + geom_histogram(aes(y=..density..), alpha=.3) + stat_density(kernel = "epanechnikov", bw = .15, alpha=.0, col=1) + * stat_function(fun = dnorm, * args=list(mean=m, * sd=s), * linetype=2) ``` ] .panel2-kde_residuals-auto[ ``` ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ``` <img src="cm-5-EDA-III_files/figure-html/kde_residuals_auto_05_output-1.png" width="504" /> ] --- count: false ### Residual distribution (full model) .panel1-kde_residuals-auto[ ```r tib %>% ggplot(aes(x=residuals)) + geom_histogram(aes(y=..density..), alpha=.3) + stat_density(kernel = "epanechnikov", bw = .15, alpha=.0, col=1) + stat_function(fun = dnorm, args=list(mean=m, sd=s), linetype=2) + * ggtitle(tit, * "Dashed curve: Gaussian fit") ``` ] .panel2-kde_residuals-auto[ ``` ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ``` <img src="cm-5-EDA-III_files/figure-html/kde_residuals_auto_06_output-1.png" width="504" /> ] <style> .panel1-kde_residuals-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-kde_residuals-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-kde_residuals-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- ### Assessing residuals ```r shapiro.test(residuals(lm_df[['full']])) ``` ``` ## ## Shapiro-Wilk normality test ## ## data: residuals(lm_df[["full"]]) ## W = 0.99752, p-value = 0.134 ``` ```r # (coefficients(lm.1)[1])^2 + norm(betas - coefficients(lm.1)[-1])^2 # norm(betas - lm.0$coefficients)^2 ``` --- template: inter-slide ## Using Fisher `\(F\)` statistic --- ### Computing the Fisher `\(F\)` statistic for comparing model 0 and model 1 As we are dealing with simulated data, we can behave as if a benevolent _oracle_ had whispered that all but `97` coefficients are null, and even told us the index of the non-null coefficients. We fit this exact and minimal model to the data. Under the Gaussian linear model assumption, the empirical distribution of residuals is the empirical distribution of the projection of a standard Gaussian vector onto the linear subspace orthogonal to the subspace generated by the design columns. As such the distribution of residuals is not the empirical distribution of an i.i.d Gaussian sample. Nevertheless, if the linear subspace is not too correlated with the canonical basis, the residuals empirical distribution is not too far from a Gaussian distribution. --- exclude: true <div class="figure"> <img src="cm-5-EDA-III_files/figure-html/residdistrib-1.png" alt="(ref:residdistrib)" width="504" /> <p class="caption">(ref:residdistrib)</p> </div> --- ### Comparing true model and full model We are facing a _binary hypotheses problem_. - Null hypothesis: all coefficients but `x_2 , x_9 , x_22 , x_27 , x_41 , x_49 , x_52 , x_53 , x_59, x_60 , x_77 , x_88 , x_93` are null - Alternative hypothesis: the vector of coefficients doesnot satisfy the `\(13\)` linear equations defining the null hypothesis Note that both the null and alternative hypotheses are composite. --- The null hypothesis asserts that the true vector of coefficients belong to a linear subspace of dimension `13` of `\(\mathbb{R}^{100}\)`. If we stick to Gaussian linear modelling we may test this null hypothesis by performing a _Fisher test_. Under the null hypothesis, `$$\frac{\|\widehat{Y} - \widehat{Y}^\circ\|^2/(p-p^\circ)}{\|{Y} - \widehat{Y}\|^2/(n-p)} \sim F_{p-p^\circ, n-p}$$` where `\(F_{p-p^\circ, n-p}\)` stands for the Fisher distribution with `\(p-p^\circ\)` and `\(n-p\)` degrees of freedom. Function `anova` does this. --- exclude: true ```r anova(lm_df[['true']], lm_df[['full']]) %>% broom::tidy() %>% dplyr::mutate_if(is.numeric, round, 2) ``` ``` ## # A tibble: 2 × 6 ## res.df rss df sumsq statistic p.value ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 986 958. NA NA NA NA ## 2 900 885. 86 72.1 0.85 0.82 ``` On the second row, column `statistic` contains the Fisher statistic. The `p.value` tells us the probability that a random variable distributed according to Fisher distribution with `\(87\)` and `\(900\)` degrees of freedom ($F_{87,900}$) exceeds the statistic's value `\(0.89\)`. The fact that `p.value` (`pf(.89, 87, 900, lower.tail = FALSE)`) is `\(0.75\)` does not prompt us to reject the null hypothesis. --- ### Handling `\(p\)`-values The rule of thumb when using the `p.value` reported by a test function is the following: > If, _under the null hypothesis_, you accept to be wrong with probability `\(\alpha\)`, > then > you should reject when the `p.value` is less than `\(\alpha\)` In words: if you are looking for a _level_ (probability of an error of the first kind) equal to `\(\alpha\)`, you should reject the null hypothesis when the `\(p\)`-value is less than `\(\alpha\)` We can decompose the computations performed by ANOVA --- exclude: true ```r setdiff(important_coeffs, # 2 9 22 27 41 49 52 53 59 60 77 88 93 which( (df %>% select(starts_with("x")) %>% purrr::map_dbl(.f = ~ cor(., df$y)) %>% abs() ) > .05) ) ``` ```r lm.0 %>% summary() (svd(t(Z) %*% (Z))$d)^(-1) # Inverse of covariance matrix of coefficients in lm.0 ``` --- exclude: true class: middle, center, inverse ## Iterative methods for OLS --- exclude: true class: middle, center, inverse ## ... --- exclude: true template: inter-slide ## Feature engineering --- exclude: true template: inter-slide ## --- class: middle, center, inverse background-image: url('./img/pexels-cottonbro-3171837.jpg') background-size: cover # The End