Multilinear Regression II: Diagnostics

name: inter-slide
class: left, middle, inverse

---
name: layout-general
layout: true
class: left, middle

.remark-slide-number .progress-bar-container {
  position: absolute;
  bottom: 0;
  height: 4px;
  display: block;
  left: 0;
  right: 0;
}

.remark-slide-number .progress-bar {
  height: 100%;
  background-color: red;
}
</style>

<div>
<style type="text/css">.xaringan-extra-logo {
width: 110px;
height: 128px;
z-index: 0;
background-image: url(./img/Universite_Paris_logo_horizontal.jpg);
background-size: contain;
background-repeat: no-repeat;
position: absolute;
top:1em;right:1em;
}
</style>
<script>(function () {
  let tries = 0
  function addLogo () {
    if (typeof slideshow === 'undefined') {
      tries += 1
      if (tries < 10) {
        setTimeout(addLogo, 100)
      }
    } else {
      document.querySelectorAll('.remark-slide-content:not(.hide_logo)')
        .forEach(function (slide) {
          const logo = document.createElement('a')
          logo.classList = 'xaringan-extra-logo'
          logo.href = 'http://master.math.univ-paris-diderot.fr/annee/m1-mi/'
          slide.appendChild(logo)
        })
    }
  }
  document.addEventListener('DOMContentLoaded', addLogo)
})()</script>
</div>

---
template: inter-slide

# Multilinear Regression II: Diagnostics

#### [Master I ISIFAR]()

#### [Analyse de Données](http://stephane-v-boucheron.fr/courses/isidata/)

#### [Stéphane Boucheron](http://stephane-v-boucheron.fr)

---
class: inverse, middle

## <svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M0 117.66v346.32c0 11.32 11.43 19.06 21.94 14.86L160 416V32L20.12 87.95A32.006 32.006 0 0 0 0 117.66zM192 416l192 64V96L192 32v384zM554.06 33.16L416 96v384l139.88-55.95A31.996 31.996 0 0 0 576 394.34V48.02c0-11.32-11.43-19.06-21.94-14.86z"/></svg>

### Motivations

### Residuals

### Spotting outliers

---
template: inter-slide

## Motivations

---

### Revisiting (again) the Anscombe quartett

```r
datasets::anscombe %>%
  pivot_longer(everything(),
    names_to = c(".value", "group"),
    names_pattern = "(.)(.)"
  )  %>% 
  rename(X=x, Y=y) %>% 
  arrange(group)-> anscombe

lms <- list()

for (s in as.character("1":"4")) {
  lms[[s]] <-  lm(Y ~ X, 
                  anscombe, 
                  subset = group==s) 
}   
```

---

### Visualization: scatterplots and lineplots

.panelset[

.panel[.panel-name[Code]

```r
p <- anscombe %>%
  ggplot(aes(x=X, y=Y)) +
* geom_smooth(method="lm", se=FALSE) +
* facet_wrap(~ group) +
  ggtitle("Anscombe quartet: linear regression Y ~ X")
```

]

.panel[.panel-name[Lines and points]

.pull-left[

```
## `geom_smooth()` using formula 'y ~ x'
```

<img src="cm-5-EDA-II_files/figure-html/unnamed-chunk-3-1.png" width="504" />
]

.pull-right[

Among the four datasets, only the two left ones are righteously handled
using simple linear regression

The bottom left dataset outlines the impact of *outliers* on Least Squares Minimization

]
]

]

???

---

Base <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg> has a generic `plot()` function.

For objects of class `lm`, function `plot` outputs diagnostic plots that can help figure out 
whether liner modeling is convenient in the current context

---

### Graphic diagnostic on Anscombe first dataset

```r
plot(lms[["1"]])
```

---

### Graphic diagnostic on Anscombe third dataset

```r
  plot(lms[["3"]])
```

---

### Graphic diagnostic on Anscombe second dataset

```r
plot(lms[["2"]])
```

---

All plots are concerned with residuals, and most of them with fitted values

To understand and possibly use the diagnostic plots, we need to know the meaning of some terms

- Normal QQ-plot

- Standardized residuals

- Cook's distance

- Leverages

???
---

All diagnostic plots can be built using the output of `broom::augment()`

```r
broom::augment(lms[["1"]]) 
```

```
## # A tibble: 11 × 8
##        Y     X .fitted  .resid   .hat .sigma   .cooksd .std.resid
##    <dbl> <dbl>   <dbl>   <dbl>  <dbl>  <dbl>     <dbl>      <dbl>
##  1  8.04    10    8.00  0.0390 0.100    1.31 0.0000614     0.0332
##  2  6.95     8    7.00 -0.0508 0.1      1.31 0.000104     -0.0433
##  3  7.58    13    9.50 -1.92   0.236    1.06 0.489        -1.78  
##  4  8.81     9    7.50  1.31   0.0909   1.22 0.0616        1.11  
##  5  8.33    11    8.50 -0.171  0.127    1.31 0.00160      -0.148 
##  6  9.96    14   10.0  -0.0414 0.318    1.31 0.000383     -0.0405
##  7  7.24     6    6.00  1.24   0.173    1.22 0.127         1.10  
##  8  4.26     4    5.00 -0.740  0.318    1.27 0.123        -0.725 
##  9 10.8     12    9.00  1.84   0.173    1.10 0.279         1.63  
## 10  4.82     7    6.50 -1.68   0.127    1.15 0.154        -1.45  
## 11  5.68     5    5.50  0.179  0.236    1.31 0.00427       0.166
```

---
template: inter-slide
name: residuals-predictions

## Residual and predictions

---

### Spotting a trend

.fl.w-50.pa2[
Residuals and predictions are supposed to be orthogonal

Orthogonality is weaker than independence

The first diagnostic plot attempts to visualize a possible 
trend

```r
p <- broom::augment(lms[["2"]])  %>% 
  ggplot() +
  aes(x=.fitted) +
  aes(y=.resid) +
  geom_point() +
  ggtitle("First diagnostic plot")
```
]

.fl.w-50.pa2[

]
---
template: inter-slide
name: qqplot

## Normal QQ-plot

---

Quantiles functions are defined at the end of the [lesson on Univariate Statistics](http://stephane-v-boucheron.fr/slides/cm-2-ISIDATA.html?panelset1=code2#72)

The (empirical) quantile function is the (generalized) inverse of the (empirical) cumulative distribution function

In diagnostic plots, the second plot is a normal QQ-plot: it compares the empirical quantile function defined by the residuals (the quantile function of the empirical distribution of the standardized residuals) with the quantile function of a Gaussian distribution with the same expectation (mean)  and standard-deviation

---

.bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[

The _quantile function_ of a probability distribution over `$\mathbb{R}$` is
the generalized, left-continuous inverse function of the cumulative
distribution function.

For `$p \in (0,1)$`

`$$F_n^{\leftarrow}(p) = \inf \Big\{ z : F_n(z) \geq p\Big\}$$`

`$$F_n^{\leftarrow}(p) =  X_{k+1:n} \qquad \text{if } \qquad\frac{k}{n} < p \leq \frac{k+1}{n}$$`

]

---

.fl.w-third.pa2[

In order to plot the quantile function, it is almost enough to exchange the
role of the axes in the ECDF plot.

```r
p <- broom::augment(lms[["1"]]) %>%
  ggplot(mapping=aes(x=.std.resid)) +
  geom_step(stat="ecdf", direction='vh') +
  coord_flip() +
  xlab("Quantiles") +
  ylab("Probabilities") +
  ggtitle("Anscombe I: residuals quantile function")
```

Why is it safe to set the `direction` argument in the call to `geom_step()`

]

.fl.w-two-thirds.pa2[

]

---

### Use `geom_qq()`

```r
broom::augment(lms[["1"]]) %>%
  ggplot(mapping=aes(sample=.std.resid)) +
  geom_qq() +
  stat_qq_line() +
  xlab("Normal Quantiles") +
  ylab("Standardized residuals") +
  ggtitle("Anscombe I: residuals quantile function")
```

---
### Comparing sample quantiles with distributional quantiles

When it comes to comparing the empirical distribution defined by one sample, and 
another distribution (the so-called theoretical distribution),  _quantile-quantile plots_ ( `qqplot` ) are favored

- `$F$` the theoretical CDF and `$G_m$`  the empirical CDF

- the quantile-quantile function is defined by

`$$G_m^\leftarrow \circ F(z) = Y_{k+1:m}\qquad \text{if} \qquad \frac{k}{m}< F(z) \leq \frac{k+1}{m}$$`

---

### Standardizing residuals

If we believe in the Gaussian Linear Model theory `$Y =  Z \times \theta + \sigma \epsilon$`
where `$(\epsilon_i)_{i\leq n} \sim_{\text{i.i.d.}} \mathcal{N}(0,1)$`,

- the residuals `$\widehat{\epsilon}$` are distributed like `$\sigma (\text{Id} - H) \times \epsilon$`

- `$\widehat{\epsilon}_i \sim \mathcal{N}(0, \sigma^2 (1 - H_{i,i}))$`

- Standardized residuals are `$\widehat{\epsilon}_i/(\sigma\sqrt{1- H_{i,i}})$`

If Gaussian Linear Model assumption holds, the distribution of standardized residuals should 
not depart too much from a standard Gaussian distribution

---

The third diagnostic plot plots the square root of the standardized residuals against fitted values

Like the first diagnostic plot, it aims at detecting trends which should not be observed under Gaussian linear model theory

---
template: inter-slide

## Leverages

---
### <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M201.5 174.8l55.7 55.8c3.1 3.1 3.1 8.2 0 11.3l-11.3 11.3c-3.1 3.1-8.2 3.1-11.3 0l-55.7-55.8-45.3 45.3 55.8 55.8c3.1 3.1 3.1 8.2 0 11.3l-11.3 11.3c-3.1 3.1-8.2 3.1-11.3 0L111 265.2l-26.4 26.4c-17.3 17.3-25.6 41.1-23 65.4l7.1 63.6L2.3 487c-3.1 3.1-3.1 8.2 0 11.3l11.3 11.3c3.1 3.1 8.2 3.1 11.3 0l66.3-66.3 63.6 7.1c23.9 2.6 47.9-5.4 65.4-23l181.9-181.9-135.7-135.7-64.9 65zm308.2-93.3L430.5 2.3c-3.1-3.1-8.2-3.1-11.3 0l-11.3 11.3c-3.1 3.1-3.1 8.2 0 11.3l28.3 28.3-45.3 45.3-56.6-56.6-17-17c-3.1-3.1-8.2-3.1-11.3 0l-33.9 33.9c-3.1 3.1-3.1 8.2 0 11.3l17 17L424.8 223l17 17c3.1 3.1 8.2 3.1 11.3 0l33.9-34c3.1-3.1 3.1-8.2 0-11.3l-73.5-73.5 45.3-45.3 28.3 28.3c3.1 3.1 8.2 3.1 11.3 0l11.3-11.3c3.1-3.2 3.1-8.2 0-11.4z"/></svg> Leverages

The Hat matrix satisfies

`$$\sum_{i=1}^n H_{i,i} =  \text{rank}(Z)$$`

`$$0 \leq H_{i,i } \leq 1 \qquad \forall i \leq n$$`

`$$\widehat{Y}_i =  H_{i,i}Y_i + \sum_{j\neq i} H_{i, j} Y_j  \qquad \forall i \leq n$$`

`$H_{i,i}$` is called the _leverage_ of the `$i^{\text{th}}$` case (row)

---

### <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M201.5 174.8l55.7 55.8c3.1 3.1 3.1 8.2 0 11.3l-11.3 11.3c-3.1 3.1-8.2 3.1-11.3 0l-55.7-55.8-45.3 45.3 55.8 55.8c3.1 3.1 3.1 8.2 0 11.3l-11.3 11.3c-3.1 3.1-8.2 3.1-11.3 0L111 265.2l-26.4 26.4c-17.3 17.3-25.6 41.1-23 65.4l7.1 63.6L2.3 487c-3.1 3.1-3.1 8.2 0 11.3l11.3 11.3c3.1 3.1 8.2 3.1 11.3 0l66.3-66.3 63.6 7.1c23.9 2.6 47.9-5.4 65.4-23l181.9-181.9-135.7-135.7-64.9 65zm308.2-93.3L430.5 2.3c-3.1-3.1-8.2-3.1-11.3 0l-11.3 11.3c-3.1 3.1-3.1 8.2 0 11.3l28.3 28.3-45.3 45.3-56.6-56.6-17-17c-3.1-3.1-8.2-3.1-11.3 0l-33.9 33.9c-3.1 3.1-3.1 8.2 0 11.3l17 17L424.8 223l17 17c3.1 3.1 8.2 3.1 11.3 0l33.9-34c3.1-3.1 3.1-8.2 0-11.3l-73.5-73.5 45.3-45.3 28.3 28.3c3.1 3.1 8.2 3.1 11.3 0l11.3-11.3c3.1-3.2 3.1-8.2 0-11.4z"/></svg> Spectral properties of Projection matrices

- If the design is orthogonal (with unit normed columns) `$H = Z \times Z^T$`

- The average leverage is `$1/n$`

- What does large or small leverages mean ?

- What is a _typical_ leverage ?

---

### <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M512 199.652c0 23.625-20.65 43.826-44.8 43.826h-99.851c16.34 17.048 18.346 49.766-6.299 70.944 14.288 22.829 2.147 53.017-16.45 62.315C353.574 425.878 322.654 448 272 448c-2.746 0-13.276-.203-16-.195-61.971.168-76.894-31.065-123.731-38.315C120.596 407.683 112 397.599 112 385.786V214.261l.002-.001c.011-18.366 10.607-35.889 28.464-43.845 28.886-12.994 95.413-49.038 107.534-77.323 7.797-18.194 21.384-29.084 40-29.092 34.222-.014 57.752 35.098 44.119 66.908-3.583 8.359-8.312 16.67-14.153 24.918H467.2c23.45 0 44.8 20.543 44.8 43.826zM96 200v192c0 13.255-10.745 24-24 24H24c-13.255 0-24-10.745-24-24V200c0-13.255 10.745-24 24-24h48c13.255 0 24 10.745 24 24zM68 368c0-11.046-8.954-20-20-20s-20 8.954-20 20 8.954 20 20 20 20-8.954 20-20z"/></svg> Leverages

Assume that the columns of `$\mathbf{Z}$` are centered (have zero mean)

Assuming that the columns of `$\mathbf{Z}$` are centered means that `$1$` is in the kernel
(nullspace) of the Hat matrix

Denote by `$Z_i$` the `$i^{th}$` column of `$\mathbf{Z}^T$`  ( `$i^{\text{th}}$` observation )

`$$H_{i,i} =  Z_i^T \times (\mathbf{Z}^T\mathbf{Z})^{-1} \times Z_i$$`

Lines `$r^2 =  z^T \times (\mathbf{Z}^T\mathbf{Z})^{-1} \times z$` are centered ellipses
with axes defined by the eigenvectors of `$\mathbf{Z}^T\mathbf{Z}$`

> In some problems, high-leverage points with values close to `$1$` can occur, and
> identifying these cases is very useful in understanding a regression problem

Weisberg p. 170

---
exclude: true

class: middle, center, inverse

## Leave One Out (LOO)/Jackknife

---
exclude: true

### Notation

.fl.w-50.pa2[

- `$Y^{(i)}$`: `$Y$` deprived from `$i^{\text{th}}$` coordinate

- `$\mathbf{Z}^{(i)}$`: designed deprived from `$i^{\text{th}}$` row

- `$\epsilon^{(i)}$`: `$ϵ$` deprived from `$i^{\text{th}}$` coordinate

]
--
exclude: true

.fl.w-50.pa2[

- `$\widehat{\theta}^{(i)}$`: minimizer of `$\big\|Y^{(i)} -\mathbf{Z}^{(i)}ϑ \big\|^2_2$`

`$$\widehat{\theta}^{(i)} = \left({\mathbf{Z}^{(i)}}^T\mathbf{Z}^{(i)}\right)^{-1}{\mathbf{Z}^{(i)}}^T$$`

- `$\widehat{Y}^{(i)} = \mathbf{Z} \widehat{\theta}^{(i)}$`

]

???

These objects obtained by removing the `$i^{\text{th}}$` observation from the data.

Comparing `$\widehat{\theta}$` and `$\widehat{\theta}^{(i)}$` tells us about the _influence_
of the `$i^{\text{th}}$` sample point

---
exclude: true

### A weighted `$\ell_2$` distance from `$\widehat{\theta}$`

`$$D(ϑ, \widehat{\theta})= \frac{1}{p \widehat{\sigma}^2}  (ϑ -\widehat{\theta})^T × \big(\mathbf{Z}^T\mathbf{Z}\big)× (ϑ -\widehat{\theta})$$`

---
exclude: true

### Cook's distance

.bg-light-gray.b--dark-gray.ba.bw1.br3.shadow-5.ph4.mt5[

`$$D_i = D(\widehat{\theta}^{(i)}, \widehat{\theta})= \frac{1}{p \widehat{\sigma}^2}  (\widehat{\theta}^{(i)} -\widehat{\theta})^T × \big(\mathbf{Z}^T\mathbf{Z}\big) × (\widehat{\theta}^{(i)} -\widehat{\theta})$$`

]

This is a weighted `$\ell_2$` distance between `$\widehat{\theta}$` and `$\widehat{\theta}^{(i)}$`, the OLS solution obtained by removing the `$i^{\text{th}}$` observation.

???

Cook (1977)

Cook's distance  is a statistic

---
exclude: true

### Using `$D_i$`

Deleting observations with large values of `$D_i$` results in substantial changes in the analysis

Observations with the largest few `$D_i$` are o special interest

???

---
exclude: true

If `$D_i$` were exactly equal to the `$\alpha$` quantile of `$F_{p, n-p}$`
distribution,

then

deleting the `$i^{\text{th}}$` observation would move
the estimate of `$\widehat{\beta}$` to the edge of `$1-\alpha$` confidence region
based on the complete dataset

---

### Cook's distances

`$$D_i = \frac{1}{p} \frac{H_{i,i}}{(1 - H_{i,i})^2} \frac{\left(Y_i - \widehat{Y}_i\right)^2}{\widehat{\sigma}^2}$$`

---

```r
lms[["3"]] %>% 
  broom::augment() %>% 
  ggplot() +
  aes(x=.hat) +
  aes(y=.std.resid) +
  geom_point() +
  xlab("Leverage") +
  ylab("Standardized residual") +
  ggtitle("Anscombe third dataset")
```

---
exclude: true

If all leverages are `$1/n$`

`$$\sum_{i=1}^n D_i =$$`

---
exclude: true
### LOOCV

???

- Ouverture to resampling method

- Jackknife

- Bootstrap

- Cross-Validation

---
exclude: true

###

- `$\widehat{Y}^{(i)}_i = X_i^T \widehat{\theta}^{(i)}$` is a linear function of `$ϵ^{(i)}$`

- `$Y_i$` is a linear function of `$\epsilon_i$`

- `$Y_i$` and  `$\widehat{Y}^{(i)}_i$` are stochastically independent

- We do not need to assume that the noise is Gaussian to ensure that

---
exclude: true

- If noise is centered

`$$\mathbb{E}\left[Y_i - \widehat{Y}^{(i)}_i\right] = 0$$`

`$$\mathbb{E}\left[\left(Y_i - \widehat{Y}^{(i)}_i\right)^2\right] = \sigma^2 (1+ H_{i,i})$$`

- If noise is Gaussian

`$$Y_i - \widehat{Y}^{(i)}_i \sim \mathcal{N}\left(0, \sigma^2 \left(1+ X_i^T \left({\mathbf{Z}^{(i)}}^T \mathbf{Z}^{(i)}\right)^{-1}X_i\right)\right)$$`

with

`$$X_i^T \left({\mathbf{Z}^{(i)}}^T \mathbf{Z}^{(i)}\right)^{-1}X_i = H_{i,i}$$`

---
exclude: true

`$$\mathbb{E}\left[\sum_{i=1}^n \left(Y_i - \widehat{Y}^{(i)}_i\right)^2\right] = \sigma^2 (n+p)$$`

???

Computing the LOOCV estimate

`$$\sum_{i=1}^n \left(Y_i - \widehat{Y}^{(i)}_i\right)^2$$`

looks demanding

---
exclude: true
### Leave-One-Out Cross-Validation formula

`$$\sum_{i=1}^n \left(Y_i - \widehat{Y}^{(i)}_i\right)^2 = \sum_{i=1}^n  \frac{(Y_i - \widehat{Y}_i)^2}{(1-H_{i,i})^2}$$`

???

One of the most beautiful formulae in statistics

---
exclude: true
### Shermann-Morrison formula (redux)

Let `$\mathbf{A}$` be an invertible square matrix and `$u, v$` be two vectors

- `$\mathbf{A}+ u\,v^T$` is invertible iff `$1 + v^T \mathbf{A}^{-1} u \neq 0$`

- if `$1 + v^T \mathbf{A}^{-1} u \neq 0$`

`$$\left(\mathbf{A} + u\,v^T\right)^{-1} = \mathbf{A}^{-1} - \frac{1}{1 + v^T \mathbf{A}^{-1} u} \mathbf{A}^{-1} uv^T \mathbf{A}^{-1}$$`

The inverse of a rank-one perturbation is a rank-one perturbation of the inverse

---
exclude: true
### Proof of Shermann-Morrison formula

If `$\mathbf{A} + u\,v^T$` is not invertible, there exists a non-zero vector `$w$` such that `$$\mathbf{A} w + u\, v^T\, w=0$$`

Note that this implies `$v^T\, w\neq 0$`

This also implies `$v^T w + v^T \mathbf{A}^{-1} u v^T w=0$`

And thus `$1 + v^T \mathbf{A}^{-1} u = 0$`

---
exclude: true
### Proof of Shermann-Morrison formula (continued)

Conversely, if `$1 + v^T \mathbf{A}^{-1} u = 0$`

`$$\left(\mathbf{A} + u v^T\right)× \left(\mathbf{A}^{-1} u\right) = u \left(1 + v^T \mathbf{A}^{-1} u\right)=0$$`

This implies  that `$\mathbf{A} + u v^T$` is not invertible

---
exclude: true
### Proof of Shermann-Morrison formula (continued)

To check the  formula, first check the easier formula
`$$\left(\mathbf{Id} + xy^T \right)^{-1} = \mathbf{Id} - \frac{1}{1+ y^Tx} xy^T$$`
for `$1 + y^Tx\neq 0$`

Then use this formula with `$x=\mathbf{A}^{-1}u$` and `$y=v$`

Let `$\mathbf{Z}$` denote the design matrix

Let `$X_i$` denote the  `$i^{\text{th}}$` column of `$\mathbf{Z}^T$`

Let `$\mathbf{Z}^{(i)}$` denote the design matrix deprived from its  `$i^{\text{th}}$`  row

`$$\mathbf{Z}^T \times \mathbf{Z} = \left({\mathbf{Z}^{(i)}}^T \mathbf{Z}^{(i)}\right) + X_i X_i^T$$`

---
exclude: true
### Proof of LOOCV formula (continued)

$$$$