EDA VI : PCA

.remark-slide-number .progress-bar-container {
  position: absolute;
  bottom: 0;
  height: 4px;
  display: block;
  left: 0;
  right: 0;
}

.remark-slide-number .progress-bar {
  height: 100%;
  background-color: red;
}
</style>

<div>
<style type="text/css">.xaringan-extra-logo {
width: 110px;
height: 128px;
z-index: 0;
background-image: url(./img/UniversiteParisCite_logo_horizontal_couleur_RVB.jpeg);
background-size: contain;
background-repeat: no-repeat;
position: absolute;
top:1em;right:1em;
}
</style>
<script>(function () {
  let tries = 0
  function addLogo () {
    if (typeof slideshow === 'undefined') {
      tries += 1
      if (tries < 10) {
        setTimeout(addLogo, 100)
      }
    } else {
      document.querySelectorAll('.remark-slide-content:not(.hide_logo)')
        .forEach(function (slide) {
          const logo = document.createElement('a')
          logo.classList = 'xaringan-extra-logo'
          logo.href = 'http://master.math.univ-paris-diderot.fr/annee/m1-mi/'
          slide.appendChild(logo)
        })
    }
  }
  document.addEventListener('DOMContentLoaded', addLogo)
})()</script>
</div>

---
name: inter-slide
class: left, middle, inverse

---
template: inter-slide

# Exploratory Data Analysis VI: PCA

### 2022-03-16

#### [Master I MIDS & MFA]()

#### [Analyse Exploratoire de Données](http://stephane-v-boucheron.fr/courses/eda/)

#### [Stéphane Boucheron](http://stephane-v-boucheron.fr)

---
exclude: true
template: inter-slide

### [Appetizer](#appetizer)  <svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M0 117.66v346.32c0 11.32 11.43 19.06 21.94 14.86L160 416V32L20.12 87.95A32.006 32.006 0 0 0 0 117.66zM192 416l192 64V96L192 32v384zM554.06 33.16L416 96v384l139.88-55.95A31.996 31.996 0 0 0 576 394.34V48.02c0-11.32-11.43-19.06-21.94-14.86z"/></svg>

### [Approximating a sample matrix](#approx)

### [Singular Value Decomposition](#svd)

### [Eckart-Young Theorems](#eckart-young)

### [Computing SVD](#computing-svd)

### [PCA and SVD](#pca-svd)

### [Visualizing PCA](#vis-pca)

---

### <svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M0 117.66v346.32c0 11.32 11.43 19.06 21.94 14.86L160 416V32L20.12 87.95A32.006 32.006 0 0 0 0 117.66zM192 416l192 64V96L192 32v384zM554.06 33.16L416 96v384l139.88-55.95A31.996 31.996 0 0 0 576 394.34V48.02c0-11.32-11.43-19.06-21.94-14.86z"/></svg>

### [Approximating a sample matrix](#approx)

### [Singular value decomposition](#svd)

### [Eckart-Young Theorem](#eckart-young)

### [Two perspectives on PCA (and SVD)](#pca-svd)

### [Computing SVD](#computing-svd)

### [ PCA up and running](#pca-up-running)

### [Visualizing PCA](#vis-pca)

### [PCA visualization in perspective](#pca-dataviz)

---
template: inter-slide
name: approx

## Approximating a sample matrix

---

### A dimension reduction technique

Principal Component Analysis (PCA) is an ancient dimension reduction technique from statistics that may be motivated from different perspectives:

- maximizing projected variance

- minimizing reconstruction error

---
name: big-picture

#### The big picture

.bg-light-gray.b--light-gray.ba.bw1.br3.shadow-5.ph4.mt5[

The central idea of principal component analysis (PCA) is to reduce the
dimensionality of a data set which consists of a large number of interrelated
variables, while retaining as much as possible of the variation present in the
data set.

This is achieved by transforming to a new set of variables, the
principal components (PCs), which are uncorrelated, and which are ordered
so that the first few retain most of the variation present in all of the original
variables.

]

---

### PCA: under the hood

PCA can be performed using

- _spectral decomposition_ of a data covariance (or correlation) matrix or

- _singular value decomposition_ of a data matrix,

possibly after a normalization step of the initial data.

The main device is a _matrix factorization method_ known as Singular Value Decomposition (SVD)

---
template: inter-slide
name: svd

## Singular value decomposition

---

The next theorem tells us that any real-valued matrix can be _factorized_
into the product of two orthogonal matrices and a diagonal matrix

This _factorization_ is at the core of many results and methods in computational
matrix analysis.

From a computational viewpoint, Principal Component Analysis (PCA) is
another name for Singular Value Decomposition (SVD)

---

### Théorème: Singular Value Decomposition Theorem

.bg-light-gray.b--light-gray.ba.bw1.br3.shadow-5.ph4.mt5[

Let `$X$` be a `$n \times p$` real matrix

There exists `$U, D, V$`

`$$X =  U \times D \times V^T \qquad\text{with} \begin{cases}  U \times U^T & = \operatorname{Id}_n \\  V \times V^T & = \operatorname{Id}_p \\  D & \text{non-negative diagonal.}\end{cases}$$`

]

- The columns of `$U$`  are called _left singular vectors_.

- The columns of `$V$` are called the  _right singular vectors_.

- The diagonal coefficients of `$D$` are called the _singular values_.

---

`$$X =  U \times D \times V^T \qquad\text{with} \begin{cases}  U \times U^T & = \operatorname{Id}_n \\  V \times V^T & = \operatorname{Id}_p \\  D & \text{non-negative diagonal.}\end{cases}$$`

![](img/SVDimage.svg)

---

### <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M512 199.652c0 23.625-20.65 43.826-44.8 43.826h-99.851c16.34 17.048 18.346 49.766-6.299 70.944 14.288 22.829 2.147 53.017-16.45 62.315C353.574 425.878 322.654 448 272 448c-2.746 0-13.276-.203-16-.195-61.971.168-76.894-31.065-123.731-38.315C120.596 407.683 112 397.599 112 385.786V214.261l.002-.001c.011-18.366 10.607-35.889 28.464-43.845 28.886-12.994 95.413-49.038 107.534-77.323 7.797-18.194 21.384-29.084 40-29.092 34.222-.014 57.752 35.098 44.119 66.908-3.583 8.359-8.312 16.67-14.153 24.918H467.2c23.45 0 44.8 20.543 44.8 43.826zM96 200v192c0 13.255-10.745 24-24 24H24c-13.255 0-24-10.745-24-24V200c0-13.255 10.745-24 24-24h48c13.255 0 24 10.745 24 24zM68 368c0-11.046-8.954-20-20-20s-20 8.954-20 20 8.954 20 20 20 20-8.954 20-20z"/></svg>

The SVD decomposition is not unique.

`$M := \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}$`  is orthogonal but it is not a rotation

`$$\begin{array}{rl}\begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix} & =  \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix} \times\begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} \times \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}\\ & = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} \times \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} \times \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix} \\ &  =  \ldots\end{array}$$`

---
exclude: true

---

### Proposition

.bg-light-gray.b--light-gray.ba.bw1.br3.shadow-5.ph4.mt5[

The number of positive (non-zero) singular values equals the rank of `$X$`

]

---

### Definition :  Thin SVD

Let `$X$` be a `$n \times p$` real matrix with rank `$r$`.

Let  `$U, D, V$` be a singular value decomposition of `$X$` such that the diagonal
entries of `$D$` are non-increasing.

Let `$U_r$` (resp. `$V_r$`) be the `$n \times r$` (resp. `$p \times r$`) matrix
formed by the first `$r$` columns of `$U$` (resp. `$V$`)

Then

`$$X =  U_r \times D_r \times V_r^T$$`

is a _thin Singular Value Decomposition_ of `$X$`.

---
exclude: true

---

### Example

In <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg>, the SVD of matrix `$A$` is obtained by `svd_ <- svd(A)`
with `U <- svd_$u`, `V <- svd_$V` and `D <- diag(svd_$d)`

`$$\begin{array}{rl}& \underbrace{\begin{bmatrix}  3.33 & 0.77 & 2.33 \\  0.98 & 1.94 & 1.71 \\
  -0.74 & -0.60 & 1.06 \\  1.65 & 0.17 & 2.90 \\  \end{bmatrix}}_{A} \\ &= \underbrace{\begin{bmatrix}
  -0.71 & 0.28 & -0.48 \\  -0.42 & 0.27 & 0.86 \\
  -0.02 & -0.75 & 0.16 \\  -0.56 & -0.54 & -0.04 \\ \end{bmatrix}}_{U} \times
\underbrace{\begin{bmatrix}  5.66 & 0.00 & 0.00 \\ 0.00 & 1.87 & 0.00 \\ 0.00 & 0.00 & 1.58 \\ \end{bmatrix}}_{D}  \times \underbrace{\begin{bmatrix} -0.65 & -0.25 & -0.71 \\ 0.48 & 0.59 & -0.65 \\  -0.59 & 0.76 & 0.27 \\  \end{bmatrix}}_{V^T}\end{array}$$`

]

---

During the proof of the SVD Theorem, we rely on the next matrix norm.

### Definition : operator norm

`$$\|\mathbf{M} \|_{\text{op}} :=  \sup_{u \in \mathbb{R}^n  \|u\|\leq 1}\sup_{v \in \mathbb{R}^p  \|v\|\leq 1} u^T \mathbf{M} v = \sup_{v \in \mathbb{R}^p  \|v\|\leq 1} \| \mathbf{M} v \|$$`

---

### Proof of full SVD existence

**Induction on `$\max(n,p)$`**

If `$\max(n,p)=1$` nothing to do. The matrix is (almost) its own SVD.

Assume SVD exists for any matrix `$n\times p$` with `$\max(n,p)\leq m$`}

**Convention.**
Let `$\mathbf{Z}$`  be an `$n\times p$` matrix with `$\max(n,p)=m+1$` and rank `$r\leq \min(n,p)$`.

If `$\|\mathbf{Z}\|_{\text{op}}=0$`,
 `$\operatorname{rank}(\mathbf{Z})=0, \mathbf{Z}=\mathbf{0}$` and there is nothing to prove

If `$\|\mathbf{Z}\|_{\text{op}}>0$`,
 by compactness of unit ball `$B^p_2(1)$` in `$\mathbb{R}^p$`, there exist unit
vectors `$\widehat{u} \in \mathbb{R}^n, \widehat{v} \in \mathbb{R}^p$`,
`$$\|\mathbf{Z}\|_{\text{op}} = \widehat{u}^T \mathbf{Z} \widehat{v} = \sup_{u \in B_2^n(1), v\in B^p_2(1) } u^T \mathbf{Z} v$$`

---

### Proof of full SVD existence (continued)

- Let `$\sigma_1 = \|\mathbf{Z}\|_{\text{op}}$`

- Let `$\mathbf{A}$`, `$\mathbf{B}$`  be matrices such that

`$$\begin{bmatrix} \widehat{u} & \vdots  & \mathbf{A} \end{bmatrix} \qquad \text{and}\qquad \begin{bmatrix} \widehat{v} & \vdots & \mathbf{B} \end{bmatrix}$$`

be *orthognal* matrices with dimensions `$n\times n$` and `$p\times p$`

---

### Proof of full SVD existence (continued)

Decomposing matrices into blocks:

`$$\begin{bmatrix}\widehat{u}^T\\ \mathbf{A}^T\end{bmatrix} \times \mathbf{Z} \times \begin{bmatrix} \widehat{v} & \mathbf{B}\end{bmatrix} =  \begin{bmatrix}\sigma_1 & w^T  \\ \mathbf{0} & \mathbf{A}^T \mathbf{Z} \mathbf{B}\end{bmatrix} =: \mathbf{Y}$$`

`$\mathbf{A}^T \mathbf{Z} \widehat{v} = \mathbf{0}$` since

- `$\mathbf{Z} \widehat{v}$` is colinear with `$\widehat{u}$` and

- the columns of `$\mathbf{A}$` are orthogonal to `$\widehat{u}$`

---

### Proof of full SVD existence (continued)

Now, we have to check that `$w=0$`

Because multiplying by orthogonal matrices does not change operator norm `$\|\mathbf{Y}\|_{\text{op}}=\|\mathbf{Z}\|_{\text{op}} = \sigma_1$`

`$$\|\mathbf{Y}\|_{\text{op}} \geq \frac{\left\|\mathbf{Y} \begin{bmatrix}\sigma_1 \\ w \end{bmatrix}\right\|}{\left\|\begin{bmatrix}\sigma_1 \\ w \end{bmatrix}\right\|} \geq \frac{\sigma_1^2 +w^Tw}{\sqrt{\sigma_1^2 +w^Tw}} = \sqrt{\sigma_1^2 +w^Tw}$$`

In order to have `$\sigma_1 \geq \|\mathbf{Y}\|_{\text{op}}$`, we need to have `$w=\mathbf{0}$`

Matrix `$\mathbf{A}^T \mathbf{Z} \mathbf{B}$` (with rank `$r-1$`)
satisfy the induction hypothesis ( `$\max(n-1,p-1)\leq m$` )

---

### Proof of full SVD existence (continued)

`$\rightarrow$` There exist  orthogonal matrices `$\mathbf{U}'$`  and `$\mathbf{V}'$`
such that `${\mathbf{U}'}^T \mathbf{A}^T \mathbf{Z} \mathbf{B} \mathbf{V}'$` is equal to a diagonal non-negative matrix `$\mathbf{D}'$` with `$r-1$` non null-coefficients

This entails

`$$\begin{bmatrix} 1 & \mathbf{0} \\ \mathbf{0} &  \mathbf{U'}^T\end{bmatrix} \times \begin{bmatrix}
\widehat{u}^T\\\mathbf{A}^T\end{bmatrix} \times \mathbf{Z} \times \begin{bmatrix} \widehat{v} & \mathbf{B}\end{bmatrix} \times \begin{bmatrix}1 & \mathbf{0} \\ \mathbf{0} &  \mathbf{V}'\end{bmatrix}  = \begin{bmatrix}\sigma_1 & \mathbf{0} \\ \mathbf{0} & \mathbf{D}'\end{bmatrix}$$`

Matrices

`$$\begin{bmatrix} 1 & \mathbf{0} \\ \mathbf{0} &  \mathbf{U'}^T \end{bmatrix} \times \begin{bmatrix} \widehat{u}^T\\ \mathbf{A}^T \end{bmatrix}\qquad \text{and} \qquad \begin{bmatrix} \widehat{v} & \mathbf{B}\end{bmatrix} \times \begin{bmatrix} 1 & \mathbf{0} \\ \mathbf{0} &  \mathbf{V}'\end{bmatrix}$$`

are orthogonal with dimensions `$n\times n$` and `$p\times p$`.

---

### Take-home message  <svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M576 240c0-23.63-12.95-44.04-32-55.12V32.01C544 23.26 537.02 0 512 0c-7.12 0-14.19 2.38-19.98 7.02l-85.03 68.03C364.28 109.19 310.66 128 256 128H64c-35.35 0-64 28.65-64 64v96c0 35.35 28.65 64 64 64h33.7c-1.39 10.48-2.18 21.14-2.18 32 0 39.77 9.26 77.35 25.56 110.94 5.19 10.69 16.52 17.06 28.4 17.06h74.28c26.05 0 41.69-29.84 25.9-50.56-16.4-21.52-26.15-48.36-26.15-77.44 0-11.11 1.62-21.79 4.41-32H256c54.66 0 108.28 18.81 150.98 52.95l85.03 68.03a32.023 32.023 0 0 0 19.98 7.02c24.92 0 32-22.78 32-32V295.13C563.05 284.04 576 263.63 576 240zm-96 141.42l-33.05-26.44C392.95 311.78 325.12 288 256 288v-96c69.12 0 136.95-23.78 190.95-66.98L480 98.58v282.84z"/></svg>

The SVD theorem is a factorization theorem:

There is  something special about SVD beyond factorization.

SVD delivers a sequence of _best approximations with given rank_

The low-rank approximations are simultaneously optimal with respect to two norms (operator and Hilbert-Schmidt)

This is the content of the _Eckart-Young Theorem_

---
template: inter-slide
name: eckart-young

## Eckart-Young Theorem(s)

---

### Landscape : Two relevant norms on `$\mathcal{M}_{n,p}$` the set of `$n\times p$` matrices

The first relevant norm is the _operator norm_ defined above

The second norm is the _Hilbert-Schmidt-Frobenius norm_

---

### <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M201.5 174.8l55.7 55.8c3.1 3.1 3.1 8.2 0 11.3l-11.3 11.3c-3.1 3.1-8.2 3.1-11.3 0l-55.7-55.8-45.3 45.3 55.8 55.8c3.1 3.1 3.1 8.2 0 11.3l-11.3 11.3c-3.1 3.1-8.2 3.1-11.3 0L111 265.2l-26.4 26.4c-17.3 17.3-25.6 41.1-23 65.4l7.1 63.6L2.3 487c-3.1 3.1-3.1 8.2 0 11.3l11.3 11.3c3.1 3.1 8.2 3.1 11.3 0l66.3-66.3 63.6 7.1c23.9 2.6 47.9-5.4 65.4-23l181.9-181.9-135.7-135.7-64.9 65zm308.2-93.3L430.5 2.3c-3.1-3.1-8.2-3.1-11.3 0l-11.3 11.3c-3.1 3.1-3.1 8.2 0 11.3l28.3 28.3-45.3 45.3-56.6-56.6-17-17c-3.1-3.1-8.2-3.1-11.3 0l-33.9 33.9c-3.1 3.1-3.1 8.2 0 11.3l17 17L424.8 223l17 17c3.1 3.1 8.2 3.1 11.3 0l33.9-34c3.1-3.1 3.1-8.2 0-11.3l-73.5-73.5 45.3-45.3 28.3 28.3c3.1 3.1 8.2 3.1 11.3 0l11.3-11.3c3.1-3.2 3.1-8.2 0-11.4z"/></svg> Definition: Hilbert-Schmidt inner product and norm

.bg-light-gray.b--light-gray.ba.bw1.br3.shadow-5.ph4.mt5[

`$$\langle \mathbf{A}, \mathbf{B} \rangle_{\text{HS}} :=  \operatorname{Trace}(\mathbf{A} \mathbf{B}^T) \qquad \| \mathbf{A} \|_{\text{HS}}^2 :=  \langle \mathbf{A}, \mathbf{A} \rangle_{\text{HS}} = \sum_{i=1}^n\sum_{j=1}^p \mathbf{A}_{i,j}^2$$`

- `$\langle \cdot, \cdot \rangle_{\text{HS}}$` is an inner product (symmetric, bilinear, positive)

- `$\|\cdot \|_{\text{HS}}$` is a norm (satisfies triangle inequality, positive, scaling property)

]

---

### Theorem (Optimality of Truncated SVD)

.bg-light-gray.b--light-gray.ba.bw1.br3.shadow-5.ph4.mt5[

The _truncated SVD_ leads to the best rank- `$k$` approximation
with respect to the Hilbert-Schmidt norm and the operator norm

]

---

### A basis for `$\mathbb{R}^{n \times p}$`

### Proposition

.bg-light-gray.b--light-gray.ba.bw1.br3.shadow-5.ph4.mt5[

Let `$u_1, \ldots, u_n$` and `$v_1,\ldots,v_p$`
form two orthonormal bases of `$\mathbb{R}^n$` and `$\mathbb{R}^p$`,

then

`$\left(u_i \times v_j^T\right)_{i\leq n, j\leq p}$` forms an orthonormal
basis of `$\mathbb{R}^{n\times p}$`

]

---

### Proof

`$$\begin{array}{rcl}\langle u_i v_j^T, u_k v_\ell^T\rangle_{HS} & = &  \operatorname{Trace}\left( u_i v_j^T v_\ell u_k^T\right) \\  &= & \langle v_j, v_\ell\rangle  \operatorname{Trace}\left( u_i  u_k^T\right) \\  &= & \langle v_j, v_\ell\rangle \times \langle u_i, u_k\rangle \end{array}$$`

---

### Corollary

.bg-light-gray.b--light-gray.ba.bw1.br3.shadow-5.ph4.mt5[

The left and right singular vectors in the full SVD of an `$n\times p$` matrix form an orthonormal basis of `$\mathbb{R}^{n\times p}$` endowed with Hilbert-Schmidt norm

]

---
exclude: true

### Proof

**Best approximation in Hilbert-Schmidt norm (proof)**

Assume `$n\geq p$` (w.l.o.g.),

From the fat SVD for `$\mathbf{Z}$`, we have

`$$\mathbf{Z} =  \sum_{i=1}^p \sigma_i u_i v_i^T\qquad \text{where } \sigma_1\geq \ldots \geq \sigma_r > \sigma_{r+1} = \ldots = \sigma_p= 0$$`

---
exclude: true

### Proof (continued)

Let

`$$\mathbf{W} = \sum_{i\leq n,j\leq p} \mu_{i,j} u_i v_j^T$$`

`$$\left\| \mathbf{Z} - \mathbf{W} \right\|_{HS}^2 = \sum_{i=1}^p  (\sigma_i^2 - \mu_{i,i})^2+ \sum_{i=1}^n\sum_{j=1}^p \mathbb{I}_{i\neq j}  \mu_{i,j}^2$$`

- Any minimizer should satisfy `$\mu_{i,j}=0$` for `$i\neq j$`

- `$\sum_{i\leq p} \mu_{i,i} u_iv_i^T$` has rank `$k$` iff
exactly `$k$` coefficients `$\mu_{i,i}$` are non zero

- The squared distance is minimized by picking `$\mu_{i,i}=\sigma_i$`
for `$i\leq k$` and `$\mu_{i,i}=0$`  for `$k+1 \leq i \leq p$`

---

### Proof (Best approximation in operator norm)

The SVD of `$\mathbf{Z}-\mathbf{Z}_k$` is

`$$\mathbf{U}\times  \operatorname{diag}\left(\begin{array}{c} 0 \\ \vdots\\ 0 \\ \sigma_{k+1}\\ \vdots \\ \sigma_p \end{array} \right)\times  \mathbf{V}^T$$`

The operator norm of the difference is equal to `$\sigma_{k+1}$`

Now, let `$\mathbf{W}$`  be of rank `$k$`. `$\operatorname{ker}(\mathbf{W})$` has dimension `$p-k$`,
is intersects the linear span of the first `$k+1$` columns of `$\mathbf{V}$`.

---

### Proof (continued)

Let `$y$` be a unit vector in this intersection. By definition `$\mathbf{W}y=0$`.

`$$\| \mathbf{Z}y \|^2 = \sum_{i=1}^{k+1} \sigma_i^2 \langle v_i, y_i \rangle^2 \geq \sigma^2_{k+1}$$`

which proves that `$\| \mathbf{Z} - \mathbf{W}\|_2 \geq \sigma_{k+1}$`

---
template: inter-slide
name: pca-svd

## Two perspectives on PCA (and SVD)

---

.fl.w-50.pa2[
### Faithful projection

Let `$X$` be a `$n \times p$` matrix, a reasonable task consists in finding
a `$k \lll p$`-dimensional subspace `$E$` of `$\mathbb{R}^p$` such that

`$$X \times \Pi_E$$`

is as close as possible from `$X$`. By as close as possible, we mean that the sum
of squared Eudlidean distances between the rows of `$X$` and
the rows of `$X \times \Pi_E$` is as small as possible

Picking `$E$` as the subspace generated by the first `$k$` right-singular vectors of `$X$`  solves the problem
]

.fl.w-50.pa2[
### Maximizing the projected variance

Let `$X$` be a `$n \times p$` matrix, a reasonable task consists in finding
a `$k \lll p$`-dimensional subspace `$E$` of `$\mathbb{R}^p$` such that

`$$X \times \Pi_E$$`

retains as much variance as possible.

This means maximizing

`$$\operatorname{var}\left(X \times \Pi_E\right)$$`

]

---
template: inter-slide
name: computing-svd

## Computing SVD

---

### Proposition

.bg-light-gray.b--light-gray.ba.bw1.br3.shadow-5.ph4.mt5[

Computing the SVD is not more costly than computing the spectral decomposition of a symmetric matrix

]

---
exclude: true

### Proof

If we feed a symmetric matrix `$A$` to an SVD algorithm,
we obtain `$U, D, V$` such that `$A = U \times D \times V^T$`.

Then,  we also have

`$$A =  \frac{1}{2} \left((U+V) \times D \times (U+V)^T \right)$$`

which defines a spectral decomposition of `$A$`  <svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M639.4 433.6c-8.4-20.4-31.8-30.1-52.2-21.6l-22.1 9.2-38.7-101.9c47.9-35 64.8-100.3 34.5-152.8L474.3 16c-8-13.9-25.1-19.7-40-13.6L320 49.8 205.7 2.4c-14.9-6.2-32-.3-40 13.6L79.1 166.5C48.9 219 65.7 284.3 113.6 319.2L74.9 421.1l-22.1-9.2c-20.4-8.5-43.7 1.2-52.2 21.6-1.7 4.1.2 8.8 4.3 10.5l162.3 67.4c4.1 1.7 8.7-.2 10.4-4.3 8.4-20.4-1.2-43.8-21.6-52.3l-22.1-9.2L173.3 342c4.4.5 8.8 1.3 13.1 1.3 51.7 0 99.4-33.1 113.4-85.3l20.2-75.4 20.2 75.4c14 52.2 61.7 85.3 113.4 85.3 4.3 0 8.7-.8 13.1-1.3L506 445.6l-22.1 9.2c-20.4 8.5-30.1 31.9-21.6 52.3 1.7 4.1 6.4 6 10.4 4.3L635.1 444c4-1.7 6-6.3 4.3-10.4zM275.9 162.1l-112.1-46.5 36.5-63.4 94.5 39.2-18.9 70.7zm88.2 0l-18.9-70.7 94.5-39.2 36.5 63.4-112.1 46.5z"/></svg>

---

### Proof

If we have an eigendecomposition algorithm (say `eigen`), we can feed it with

- symmetric matrix `$A \times A^T$` to get the _left singular vectors_ as _eigenvectors_ of `$A \times A^T$`  and
- symmetric matrix `$A^T \times A$` to get the _right singular vectors_ as _eigenvectors_ of `$A^T \times A$`

Both spectral decompositions lead to the squared singular values

If we count the number of field operations, both problems are linearly equivalent (matrix multiplication is less expansive
than spectral or singular value decomposition).

---
template: inter-slide
name: pca-up-running

##  PCA up and running

---

We will walk through Principal Component Analysis using historical datasets

- `USArrests`: crime data across states (individuals) in USA

- `iris`: morphological data of 150 flowers (individuals) from genus _iris_

- `decathlon`: scores of athletes at the ten events making a decathlon

- Life tables for a collection of Western European countries  and the USA

Such columns are not used when computing the SVD that underpins PCA

But non-numerical columns may be incorporated into PCA visualization

???

When PCA is used as a feature engineering device to prepare the dataset for regression, supervised classification, clustering or any other statistical/machine learning treatment, visualizing the interplay between non-numerical columns and the numerical columns
constructed by PCA is crucial

---

### Iris flower dataset

A showcase for Principal Component Analysis

.fl.w-60.pa2.f6[

> The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by Ronald Fisher in 1936. Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus"

> The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters

.fl.w-40.pa2[

]
---

### `USArrests`

.fl.w-60.f6[

<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> Murder </th>
   <th style="text-align:right;"> Assault </th>
   <th style="text-align:right;"> UrbanPop </th>
   <th style="text-align:right;"> Rape </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> New Jersey </td>
   <td style="text-align:right;"> 7.4 </td>
   <td style="text-align:right;"> 159 </td>
   <td style="text-align:right;"> 89 </td>
   <td style="text-align:right;"> 18.8 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Mississippi </td>
   <td style="text-align:right;"> 16.1 </td>
   <td style="text-align:right;"> 259 </td>
   <td style="text-align:right;"> 44 </td>
   <td style="text-align:right;"> 17.1 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Kansas </td>
   <td style="text-align:right;"> 6.0 </td>
   <td style="text-align:right;"> 115 </td>
   <td style="text-align:right;"> 66 </td>
   <td style="text-align:right;"> 18.0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Texas </td>
   <td style="text-align:right;"> 12.7 </td>
   <td style="text-align:right;"> 201 </td>
   <td style="text-align:right;"> 80 </td>
   <td style="text-align:right;"> 25.5 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Hawaii </td>
   <td style="text-align:right;"> 5.3 </td>
   <td style="text-align:right;"> 46 </td>
   <td style="text-align:right;"> 83 </td>
   <td style="text-align:right;"> 20.2 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Wyoming </td>
   <td style="text-align:right;"> 6.8 </td>
   <td style="text-align:right;"> 161 </td>
   <td style="text-align:right;"> 60 </td>
   <td style="text-align:right;"> 15.6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Colorado </td>
   <td style="text-align:right;"> 7.9 </td>
   <td style="text-align:right;"> 204 </td>
   <td style="text-align:right;"> 78 </td>
   <td style="text-align:right;"> 38.7 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Arkansas </td>
   <td style="text-align:right;"> 8.8 </td>
   <td style="text-align:right;"> 190 </td>
   <td style="text-align:right;"> 50 </td>
   <td style="text-align:right;"> 19.5 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Minnesota </td>
   <td style="text-align:right;"> 2.7 </td>
   <td style="text-align:right;"> 72 </td>
   <td style="text-align:right;"> 66 </td>
   <td style="text-align:right;"> 14.9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ohio </td>
   <td style="text-align:right;"> 7.3 </td>
   <td style="text-align:right;"> 120 </td>
   <td style="text-align:right;"> 75 </td>
   <td style="text-align:right;"> 21.4 </td>
  </tr>
</tbody>
</table>

4 numerical columns, questionable scalings
]

.fl.w-40[
<img src="cm-6-EDA_files/figure-html/unnamed-chunk-8-1.png" width="504" />
]

---

### Decathlon

.fl.w-60.f6[

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> rowname </th>
   <th style="text-align:right;"> 100m </th>
   <th style="text-align:right;"> Long.jump </th>
   <th style="text-align:right;"> High.jump </th>
   <th style="text-align:right;"> Discus </th>
   <th style="text-align:right;"> Pole.vault </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> MARTINEAU </td>
   <td style="text-align:right;"> 11.64 </td>
   <td style="text-align:right;"> 6.81 </td>
   <td style="text-align:right;"> 1.95 </td>
   <td style="text-align:right;"> 47.60 </td>
   <td style="text-align:right;"> 4.92 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> McMULLEN </td>
   <td style="text-align:right;"> 10.83 </td>
   <td style="text-align:right;"> 7.31 </td>
   <td style="text-align:right;"> 2.13 </td>
   <td style="text-align:right;"> 44.41 </td>
   <td style="text-align:right;"> 4.42 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Smith </td>
   <td style="text-align:right;"> 10.85 </td>
   <td style="text-align:right;"> 6.81 </td>
   <td style="text-align:right;"> 1.91 </td>
   <td style="text-align:right;"> 49.02 </td>
   <td style="text-align:right;"> 4.20 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sebrle </td>
   <td style="text-align:right;"> 10.85 </td>
   <td style="text-align:right;"> 7.84 </td>
   <td style="text-align:right;"> 2.12 </td>
   <td style="text-align:right;"> 48.72 </td>
   <td style="text-align:right;"> 5.00 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Casarsa </td>
   <td style="text-align:right;"> 11.36 </td>
   <td style="text-align:right;"> 6.68 </td>
   <td style="text-align:right;"> 1.94 </td>
   <td style="text-align:right;"> 48.66 </td>
   <td style="text-align:right;"> 4.40 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Qi </td>
   <td style="text-align:right;"> 11.06 </td>
   <td style="text-align:right;"> 7.34 </td>
   <td style="text-align:right;"> 1.97 </td>
   <td style="text-align:right;"> 45.13 </td>
   <td style="text-align:right;"> 4.50 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Macey </td>
   <td style="text-align:right;"> 10.89 </td>
   <td style="text-align:right;"> 7.47 </td>
   <td style="text-align:right;"> 2.15 </td>
   <td style="text-align:right;"> 48.34 </td>
   <td style="text-align:right;"> 4.40 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Zsivoczky </td>
   <td style="text-align:right;"> 10.91 </td>
   <td style="text-align:right;"> 7.14 </td>
   <td style="text-align:right;"> 2.12 </td>
   <td style="text-align:right;"> 45.62 </td>
   <td style="text-align:right;"> 4.70 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> KARPOV </td>
   <td style="text-align:right;"> 11.02 </td>
   <td style="text-align:right;"> 7.30 </td>
   <td style="text-align:right;"> 2.04 </td>
   <td style="text-align:right;"> 48.95 </td>
   <td style="text-align:right;"> 4.92 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ZSIVOCZKY </td>
   <td style="text-align:right;"> 11.13 </td>
   <td style="text-align:right;"> 7.30 </td>
   <td style="text-align:right;"> 2.01 </td>
   <td style="text-align:right;"> 45.67 </td>
   <td style="text-align:right;"> 4.42 </td>
  </tr>
</tbody>
</table>

10 numerical columns, questionable scalings
]

.fl.w-40[
<img src="cm-6-EDA_files/figure-html/unnamed-chunk-10-1.png" width="504" />
]
---

### Using `prcomp` from base <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg>

> The calculation is done by a singular value decomposition of the (centered and possibly scaled) data matrix, not by using `eigen` on the covariance matrix.

> This is generally the preferred method for numerical accuracy.

> The `print` method for these objects prints the results in a nice format and the plot method produces a _scree plot_

At first glance, performing Principal Component Analysis
consists in applying SVD to the `$n  \times p$` data matrix.

---

### Raw PCA on `iris` dataset

.fl.w-50.pa2[

```r
pca_iris <- iris %>%
  select(- Species) %>%
* prcomp(x = .,
*        center = FALSE,
*        scale. = FALSE)

class(pca_iris) ; is.list(pca_iris)
```

```
## [1] "prcomp"
```

```
## [1] TRUE
```

```r
names(pca_iris)
```

```
## [1] "sdev"     "rotation" "center"   "scale"    "x"
```
]

.fl.w-50.pa2[

Result has 5 components:

- `sdev`: singular values

- `rotation`: orthogonal matrix `$V$` in SVD

- `center`: `FALSE` or centering row  vector

- `scale`: FALSE of scaling row vector

- `x`: matrix `$U \times D$` from SVD

]

---

### `prcomp`  and SVD

.fl.w-50.pa2[

Result has 5 components:

- `sdev`: singular values

- `rotation`: orthogonal matrix `$V$` in SVD

- `center`: `FALSE` or centering row  vector

- `scale`: FALSE of scaling row vector

- `x`: matrix `$U \times D$` from SVD

]

.fl.w-50.pa2[

![](./img/SVDimage.svg)

]

---

### From data to PCA and back

Let

- `$Z$` be the `$n \times p$` matrix fed to `prcomp`
- `$X$` be the `$n \times p$` matrix forming component `x` of the result
- `$V$` be the `$p \times p$` matrix forming component `rotation` of the result
- `$\mu$` be the centering vector, `$\mathbf{0}$` if `center=FALSE`
- `$\sigma$` be the scaling vector, `$\mathbf{1}$` is `scale.=FALSE`

then

`$$\text{diag}(\sigma) \times (Z - \mathbf{1} \times \mu^T) =  X \times V^T$$`

`$$Z  =  \text{diag}(\sigma)^{-1} \times  X \times V^T +  \mathbf{1} \times \mu^T$$`

???

PCA can be pictured as follows

- optional centering and scaling
- performing SVD
- gathering the byproducts of SVD  so as to please statisticians
- possibly truncating the result so as to obtain a truncated SVD

---

### Glossary

The columns of components `x`  are called the _principal components_ of the dataset

---

### Conventions

| PCA vocabulary | SVD vocabulary |
|:---------------|:----------------|
| Principal axis | Left singular vector |
| Principal component | Scaled  left singular vector |
| Factorial axis | Right singular vector |
| Component of inertia | Squared singular value |

---

### Modus operandi

1. Read the data.

2. Choose the active individuals and variables (cherrypick amongs numeric variables)

3. Choose if the variables need to be standardised

4. Choose the number of dimensions you want to keep (the _rank_)

5. Analyse the results

6. Automatically describe the dimensions of variability

7. Back to raw data.

---

### Impact of standardization and centering

When performing PCA, once the _active variables_ have been chosen,
we wonder whether we should centre and/or standardize the columns

There is no universal answer

---

### Centering/Scaling: iris dataset

On iris dataset Centering and standardizing the columns slightly spreads out the eigenvalues

.fl.w-50.pa2.f6[

```r
iris <- datasets::iris
iris_num <- dplyr::select(iris, -Species)

list_pca_iris <-
  map2(.x=rep(c(FALSE, TRUE), 2),
       .y=rep(c(FALSE, TRUE), c(2,2)),
       ~ prcomp(iris_num, center= .x, scale.=.y)
)

names(list_pca_iris) <- stringr::str_c("pca_iris",
    c("", "c", "s", "c_s"),
    sep="_")
```

]

.fl.w-50.pa2[

]

---

### More on the centering/scaling dilemma

.fl.w-50.pa2[

For `USArrests`, scaling is mandatory

]

.fl.w-50.pa2[

For `decathlon`, centering is mandatory, scaling
is recommended

]

???

In order to investigate the columns means, there is no need to use PCA.
Centering is almost always relevant.

The goal of PCA is to understand the fluctuations of the data and, when
possible, to show that most  fluctuations can be found in a few priviledged
directions

Scaling is often convenient: it does not alter correlations. Second,
high variance columns artificially capture an excessive amount of inertia,
without telling us much about the correlations between variables

---
template: inter-slide
name: vis-pca

## Visualizing PCA

---

Recall [Joliffe's big picture](#big-picture)

- Visualize the correlation plot

- Visualize the share of inertia per principal component (screeplot)

- Visualize the individuals in the factorial planes generated by the leading principal axes

- Visualize the (scaled) orginal variables in the basis defined by the factorial axes
(correlation circle)

- Visualize simultaneously individuals and variables (biplot)

???

---

### Inspecting the correlation matrix

```r
iris %>%
  dplyr::select(where(is.numeric)) %>%
* corrr::correlate() -> tb
```

```
## 
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
```

```r
tb %>%
* corrr::shave() %>%
* corrr::fashion() %>%
  knitr::kable(format="markdown")
```

|term         |Sepal.Length |Sepal.Width |Petal.Length |Petal.Width |
|:------------|:------------|:-----------|:------------|:-----------|
|Sepal.Length |             |            |             |            |
|Sepal.Width  |-.12         |            |             |            |
|Petal.Length |.87          |-.43        |             |            |
|Petal.Width  |.82          |-.37        |.96          |            |
]

[Package `corrr`](https://cran.r-project.org/web/packages/corrr/vignettes/using-corrr.html)

- `correlate` explores pairwise correlations just as `cor`

- It outputs a dataframe (a `tibble`)

- The `corrr` API is designed with data pipelines in mind

- `corrr` offers functions for manipulating correlation matrices

]

```r
tb %>%
  corrr::rearrange(absolute = FALSE) %>%
  corrr::shave(upper = FALSE) %>%
* corrr::rplot()
```

```
## Don't know how to automatically pick scale for object of type noquote. Defaulting to continuous.
```

]

---

### Correlation plot for `decathlon`

.fl.w-60[
<img src="cm-6-EDA_files/figure-html/DecathlonR_Cor-1.png" width="504" />
]

.fl.w-40[

If our goal is to partitition the set of columns into tightly connected groups, this
correlation plot already tells us a story:

- short running races  `100m`, ..., `400m` along `long jump` form a cluster,

- `shot.put`, `discus`, and to a lesser extent `high jump`  form another cluster

- `pole.vault`, `1500m` and `javeline` look pretty isolated
]
---

### PCA Visualization : first goals

- Look at the data in PC coordinates (Factorial plane)

- Look at the rotation matrix (Factorial axes)

- Look at the variance explained by each PC (Screeplot)

---

### Inspect eigenvalues

```r
iris_pca <- list_pca_iris[[4]]

iris_pca %>%
  broom::tidy(matrix="pcs") %>%
  knitr::kable(format="markdown", digits=2)
```

| PC| std.dev| percent| cumulative|
|--:|-------:|-------:|----------:|
|  1|    1.71|    0.73|       0.73|
|  2|    0.96|    0.23|       0.96|
|  3|    0.38|    0.04|       0.99|
|  4|    0.14|    0.01|       1.00|

```r
iris_plus_pca <- broom::augment(iris_pca, iris)
```
]
---

### Scatterplots

```r
share_variance <- broom::tidy(iris_pca, "pcs")[["percent"]]

iris_plus_pca %>%
  ggplot() +
  aes(x=.fittedPC1, y=.fittedPC2) +
  aes(colour=Species) +
  geom_point() +
  ggtitle("Iris data projected on first two principal axes") +
  xlab(paste("PC1 (", share_variance[1], "%)", sep="")) +
  ylab(paste("PC2 (", share_variance[2], "%)", sep="")) -> p

iris_plus_pca %>%
  ggplot() +
  aes(y=Sepal.Width, x=Petal.Length) +
  aes(colour=Species) +
  geom_point() +
  ggtitle("Iris data projected on Petal length and Sepal width") -> q
```
]

```r
p ; q
```

<img src="cm-6-EDA_files/figure-html/unnamed-chunk-21-1.png" width="50%" /><img src="cm-6-EDA_files/figure-html/unnamed-chunk-21-2.png" width="50%" />
]

]

---

### Biplot : overplotting the original variables in the factorial plane

.fl.w-third.pa2[

```r
total_inertia <- sum(share_variance)

p +
  aes(label=.rownames) +
  geom_text_repel(verbose = FALSE) +
  coord_fixed()
```
]

See [`plot_pca`](https://cran.r-project.org/web/packages/ggfortify/vignettes/plot_pca.html)

This is a scatterplot on the plane spanned
by the two leading principal components

]

.fl.w-two-thirds.pa2[

![](cm-6-EDA_files/figure-html/autopca-iris-plot-1.png)

]
]

.fl.w-third.pa2[

```r
iris %>%
  select(where(is.numeric)) %>%
  prcomp(.scale=TRUE) %>%
* autoplot(
   data = iris,
   colour = 'Species',
*  label = TRUE,
*  loadings = TRUE,
   loadings.colour = 'blue',
   loadings.label = TRUE
  ) +
  ggtitle("Iris dataset biplot")
```

```
## Warning: In prcomp.default(., .scale = TRUE) :
##  extra argument '.scale' will be disregarded
```
]

Scatterplot of the factorial plane is overplotted
with _loadings_ corresponding to native variables

]

.fl.w-two-thirds.pa2[

![](cm-6-EDA_files/figure-html/autopca-iris-biplot-1.png)

]

```r
require(plotly)

(
autoplot(prcomp(iris[,-5]),
         data = iris,
         colour = 'Species',
         label = TRUE,
         loadings = TRUE,
         loadings.label = TRUE) +
  ggtitle("Iris dataset ")
) %>% ggplotly()
```
]

]

---

### Official plots for PCA on iris dataset

```r
plot(list_pca_iris[[4]])

biplot(list_pca_iris[[4]])
```

---

### Plotting PCA on decathlon data

.fl.w-50[

```r
decathlon_pca <- list_pca_decathlon[[4]]
inertias <- decathlon_pca$sdev^2
share_variance <- round((100*(inertias)/
                         sum(inertias)), 2)

p %+% broom::augment(decathlon_pca, decathlon) +
  aes(colour=Points) +
  ggtitle("Decathlon data") +
  xlab(paste("PC1 (", share_variance[1], "%)", sep="")) +
  ylab(paste("PC2 (", share_variance[2], "%)", sep="")) +
  aes(label=.rownames) +
  geom_text_repel(verbose = FALSE) +
  coord_fixed()
```
]

.fl.w-50[

![](cm-6-EDA_files/figure-html/decathlon_pcs_plots-1.png)

]

---

### Decathlon correlation circle

.fl.w-50[

```r
decathlon_pca$rotation %>%
  as_tibble(rownames="Name") %>%
  ggplot() +
  aes(x=PC1, y=PC2, label=Name) +
  geom_segment(xend=0, yend=0, arrow = grid::arrow(ends = "first")) +
  geom_text_repel() +
  coord_fixed() +
  xlim(-.7, .7) +
  ylim(-.7, .7) +
  stat_function(fun = ~  sqrt(.7^2 - .^2), alpha=.5, color="grey") +
  stat_function(fun = ~ - sqrt(.7^2 - .^2), alpha=.5, color="grey") +
  geom_segment(x = 0, y = -1, xend = 0, yend = 1, linetype = "dashed", color = "grey") +
  geom_segment(x = -1, y = 0, xend = 1, yend = 0, linetype = "dashed", color = "grey") +
  ggtitle("Decathlon correlation circle")
```

]

.fl.w-50[

![](cm-6-EDA_files/figure-html/decathlon_cor-1.png)

]

---
exclude: true
### Using `broom::augment`

```r
pca_fit <- prcomp(iris[,-5])

pca_fit %>%
  broom::augment(iris) %>% #<<  add original dataset back in
  ggplot(aes(x= .fittedPC1, y=.fittedPC2, color = Species)) +
  geom_point(size = 1.5) +
  ggtitle("Iris PCA with broom::augment") -> p
```
]

]

---
exclude: true

### Screeplot

```r
zpca <- list_pca_iris[[4]]

qplot(x=1:4, y=zpca$sdev^2, geom="col") +
  xlab("Principal components") +
  ylab("Variance") +
  ggtitle("Screeplot of standardized/centered Iris dataset") -> p
```

]

<img src="cm-6-EDA_files/figure-html/ns-scree-plot-1.png" width="504" />
]

]

---
exclude: false

### Variations on screeplot

```r
zetitle <-  "PCA on Iris: Projected variance on the first PCs"

p <-list_pca_iris[[4]] %>%
  broom::tidy() %>%
  ggplot() +
  aes(x= as_factor(PC), y= value) +
  labs(xlab="PC", ylab="Coord") +
  ggtitle(zetitle)
```

]

```r
p + geom_boxplot()  ; p + geom_violin()
```

]

---

### Two packages

Several packages handle the field of factorial analysis

.fl.w-50.pa2[

#### `ADE4`

+ 2002-...
+ Duality diagrams: `dudi.pca`
+ Ecology oriented
+ [ADE4 Homepage](http://pbil.univ-lyon1.fr)

]

.fl.w-50.pa2[

#### `FactoMineR`

+ 2003-...
+ `Factoshiny`
+ `FactoExtra`
+ `FactoInvestigate`
+ A nice book
+ A MOOC

]

---

### FactoMineR homepage

---

### FactoMineR on PCA

---

### FactoMineR : scatterplot and correlation circle

```r
iris %>%
  dplyr::select(where(is.numeric)) %>%
  FactoMineR::PCA() %>%
  plot(graph.type="ggplot",
       title="Iris data with FactoMineR")
```

---

### Revisiting Decathlon data

Ad hoc transformation of running scores: average speed matters

```r
decathlonTime <- mutate(decathlon,
                        `100m` = 100/`100m`,
                        `400m` = 400/`400m`,
                        `110m.hurdle` = 110/`110m.hurdle`,
                        `1500m` = 1500/`1500m`)

# rownames(decathlonTime)  <- rownames(decathlon)

PCAdecathlontime <- PCA(decathlonTime[,1:10],  graph = FALSE)
```
]

```r
factoextra::fviz_pca_ind(PCAdecathlontime, repel = TRUE)
```

```
## Warning: ggrepel: 2 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
```

```r
factoextra::fviz_pca_var(PCAdecathlontime, repel = TRUE)
```

<img src="cm-6-EDA_files/figure-html/unnamed-chunk-27-1.png" width="50%" /><img src="cm-6-EDA_files/figure-html/unnamed-chunk-27-2.png" width="50%" />
]

]

???

When inspecting `decathlon` some columns are positively correlated with column `Points`,
others are not. The negatively correlated columns contain running races times. In order
to handle all scores in an homogenous  way, running races scores are converted
into average speeds. The larger, the better.

The correlation circle becomes more transparent: columns can be clustered into three subsets:
- running races and long jump, p
- puts, javeline and high jump
- 1500m and pole vault which are poorly correlated with first two principal components

We can do it with `tidyverse`

```r
decathlonTime[, 1:10] %>%
  prcomp(scale.=TRUE) -> pc_decathlonTime

share_variance <-  round(broom::tidy(pc_decathlonTime, matrix="pcs")[["percent"]], 2)

pc_decathlonTime  %>%
  broom::augment(data=decathlonTime) %>%
  ggplot() +
  aes(x=.fittedPC1, y=.fittedPC2) +
  aes(color=Points, label=.rownames) +
  geom_point() +
  geom_text_repel() +
  xlab(paste("Dim1 (", share_variance[1]  ,"%)", sep="")) +
  ylab(paste("Dim2 (", share_variance[2]  ,"%)", sep=""))
```

```
## Warning: ggrepel: 2 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
```

```r
radius <-  sqrt(broom::tidy(pc_decathlonTime, matrix="pcs")[["cumulative"]][2])

pc_decathlonTime[["rotation"]] %>%
  as_tibble(rownames="Name") %>%
  ggplot() +
  aes(x=PC1, y=PC2, label=Name) +
  geom_segment(xend=0, yend=0, arrow = grid::arrow(ends = "first") ) +
  geom_text_repel() +
  coord_fixed() +
  xlim(-radius, radius) +
  ylim(-radius, radius) +
  stat_function(fun = ~  sqrt(radius^2 - .^2), alpha=.5, color="grey") +
  stat_function(fun = ~ - sqrt(radius^2 - .^2), alpha=.5, color="grey") +
  geom_segment(x = 0, y = -1, xend = 0, yend = 1, linetype = "dashed", color = "grey") +
  geom_segment(x = -1, y = 0, xend = 1, yend = 0, linetype = "dashed", color = "grey") +
  ggtitle("Decathlon correlation circle (bis)") +
  xlab(paste("Dim1 (", share_variance[1]  ,"%)", sep="")) +
  ylab(paste("Dim2 (", share_variance[2]  ,"%)", sep=""))
```

---

## PCA visualization in perspective

---

### PCA and the tidy approach

The output of `prcomp` (or of any contender) is complex object represented by a
list with five named elements: `x`, `sdev`, `rotation`, `center` `scale`

The `prcomp` class is not `ggplot2` friendly

In the `tidyverse` framework, visualization abides to the _grammar of graphics_ framework:
plots are built on dataframes (usually a single dataframe)

The different plots that serve to understand the output of PCA (screeplot, correlation circle, biplot, ...)
have to be built on  dataframes  that themselves have to be built from the output of PCA and
possibly also from the original dataframe

Package `broom` (<svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M256.47 216.77l86.73 109.18s-16.6 102.36-76.57 150.12C206.66 523.85 0 510.19 0 510.19s3.8-23.14 11-55.43l94.62-112.17c3.97-4.7-.87-11.62-6.65-9.5l-60.4 22.09c14.44-41.66 32.72-80.04 54.6-97.47 59.97-47.76 163.3-40.94 163.3-40.94zM636.53 31.03l-19.86-25c-5.49-6.9-15.52-8.05-22.41-2.56l-232.48 177.8-34.14-42.97c-5.09-6.41-15.14-5.21-18.59 2.21l-25.33 54.55 86.73 109.18 58.8-12.45c8-1.69 11.42-11.2 6.34-17.6l-34.09-42.92 232.48-177.8c6.89-5.48 8.04-15.53 2.55-22.44z"/></svg>) offers off-the-shelf components for building PCA graphical pipelines (and many other things)

> `broom` is an attempt to bridge the gap from untidy outputs of predictions and estimations to the tidy data we want to work with. It centers around three `S3` methods, each of which take common objects produced by <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg> statistical functions (`lm`, `t.test`, `nls`, `prcomp`, etc) and convert them into a `tibble`

]

???

> `broom` is particularly designed to work with Hadley’s `dplyr` package
>  (see the broom+dplyr vignette for more)

---

### <svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M256.47 216.77l86.73 109.18s-16.6 102.36-76.57 150.12C206.66 523.85 0 510.19 0 510.19s3.8-23.14 11-55.43l94.62-112.17c3.97-4.7-.87-11.62-6.65-9.5l-60.4 22.09c14.44-41.66 32.72-80.04 54.6-97.47 59.97-47.76 163.3-40.94 163.3-40.94zM636.53 31.03l-19.86-25c-5.49-6.9-15.52-8.05-22.41-2.56l-232.48 177.8-34.14-42.97c-5.09-6.41-15.14-5.21-18.59 2.21l-25.33 54.55 86.73 109.18 58.8-12.45c8-1.69 11.42-11.2 6.34-17.6l-34.09-42.92 232.48-177.8c6.89-5.48 8.04-15.53 2.55-22.44z"/></svg> Three (generic) functions

`tidy`:
constructs a tibble that summarizes the model’s statistical findings. This includes coefficients and p-values for each term in a regression, per-cluster information in clustering applications, or per-test information for multtest functions

`augment`:
add columns to the original data that was modeled. This includes predictions, residuals, and cluster assignments

`glance`:
construct a concise one-row summary of the model. There is no such method
for class `prcomp`

???

---

.fl.w-50[

```r
pc <- prcomp(USArrests, scale = TRUE)

# information about samples  (U x D matrix)
# Default behaviour

broom::tidy(pc, matrix="samples") %>%
  head(5) %>%
  knitr::kable()
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> row </th>
   <th style="text-align:right;"> PC </th>
   <th style="text-align:right;"> value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Alabama </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> -0.9756604 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Alabama </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 1.1220012 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Alabama </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> -0.4398037 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Alabama </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 0.1546966 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Alaska </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> -1.9305379 </td>
  </tr>
</tbody>
</table>
]

.fl.w-50[

The output of `tidy.prcomp` is always a `tibble`

Depending on argument `matrix`, the output gathers
information on different components of the SVD factorization

`tidy(pc, "samples")` provide a long format tibble version of `$U \times D$`

The tibble has three columns

- `row`: rownames in the  dataframe that served as input to `prcomp`
- `PC`: index of principal components
- `value`: score of individual indexed by `row` on principal components `PC`, `$(U \times D)[\text{row}, \text{PC}]$`

The `x` component of `pc` is a _wide format_ version of the output of `tidy(pc, "samples")`

]

???

[Documentation](https://www.rdocumentation.org/packages/broom/versions/0.7.4/topics/tidy.prcomp)

```r
as_tibble(pc$x) %>%
  add_column(row=rownames(pc$x)) %>%
  pivot_longer(starts_with("PC"),
               names_to="PC",
               names_prefix="PC") %>%
  head(5)
```

```
## # A tibble: 5 × 3
##   row     PC     value
##   <chr>   <chr>  <dbl>
## 1 Alabama 1     -0.976
## 2 Alabama 2      1.12 
## 3 Alabama 3     -0.440
## 4 Alabama 4      0.155
## 5 Alaska  1     -1.93
```
---

.fl.w-50[

```r
pc <- prcomp(USArrests, scale = TRUE)

# information about rotation (V matrix)
broom::tidy(pc, "rotation") %>%
  head(5)
```

```
## # A tibble: 5 × 3
##   column     PC  value
##   <chr>   <dbl>  <dbl>
## 1 Murder      1 -0.536
## 2 Murder      2  0.418
## 3 Murder      3 -0.341
## 4 Murder      4  0.649
## 5 Assault     1 -0.583
```

]

.fl.w-50[

With `matrix="rotation"`, the result is again a long format
tibble version of the orthogonal matrix `$V$` from the SVD
factorization

This tibble is convenient when plotting the correlation circles
associated with different sets of components

]

???

```r
as_tibble(pc$rotation) %>%
  add_column(column=rownames(pc$rotation)) %>%
  pivot_longer(starts_with("PC"),
               names_to="PC",
               names_prefix="PC") %>%
  head()
```

```
## # A tibble: 6 × 3
##   column  PC     value
##   <chr>   <chr>  <dbl>
## 1 Murder  1     -0.536
## 2 Murder  2      0.418
## 3 Murder  3     -0.341
## 4 Murder  4      0.649
## 5 Assault 1     -0.583
## 6 Assault 2      0.188
```

---

.fl.w-50[

```r
# information about singular values
broom::tidy(pc, "pcs") %>%
  knitr::kable()
```

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> PC </th>
   <th style="text-align:right;"> std.dev </th>
   <th style="text-align:right;"> percent </th>
   <th style="text-align:right;"> cumulative </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1.5748783 </td>
   <td style="text-align:right;"> 0.62006 </td>
   <td style="text-align:right;"> 0.62006 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 0.9948694 </td>
   <td style="text-align:right;"> 0.24744 </td>
   <td style="text-align:right;"> 0.86750 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 0.5971291 </td>
   <td style="text-align:right;"> 0.08914 </td>
   <td style="text-align:right;"> 0.95664 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 0.4164494 </td>
   <td style="text-align:right;"> 0.04336 </td>
   <td style="text-align:right;"> 1.00000 </td>
  </tr>
</tbody>
</table>

```r
broom::tidy(pc, "pcs") %>%
  ggplot() +
  aes(x=as.integer(PC), y=percent) +
  geom_col() +
  xlab("PC") +ylab("Percentage of inertia") +
  ggtitle("USArrests: screeplot")
```

]

.fl.w-50[

With `matrix="pcs"` or `matrix="eigen"`,
we obtain a tibble that is convenient for generating a `screeplot`

It contains the information returned by `summary(pc)`

Indeed, column `percent` is obtained by squaring column
`std.dev` and normalizing the column

![](cm-6-EDA_files/figure-html/tidyscreeplot-1.png)

]

???

See `sloop::s3_get_method(tidy.prcomp)`

```r
as_tibble(pc$sdev) %>%
  rename(std.dev=value) %>%
  rownames_to_column("PC") %>%
  mutate(PC=as.integer(PC),
         percent=(std.dev^2/sum(std.dev^2)),
         cumulative=cumsum(percent))

# or
t(summary(pc)$importance) %>%
  data.frame() %>%
  rownames_to_column()
```

---

### Package `broom`  <svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M256.47 216.77l86.73 109.18s-16.6 102.36-76.57 150.12C206.66 523.85 0 510.19 0 510.19s3.8-23.14 11-55.43l94.62-112.17c3.97-4.7-.87-11.62-6.65-9.5l-60.4 22.09c14.44-41.66 32.72-80.04 54.6-97.47 59.97-47.76 163.3-40.94 163.3-40.94zM636.53 31.03l-19.86-25c-5.49-6.9-15.52-8.05-22.41-2.56l-232.48 177.8-34.14-42.97c-5.09-6.41-15.14-5.21-18.59 2.21l-25.33 54.55 86.73 109.18 58.8-12.45c8-1.69 11.42-11.2 6.34-17.6l-34.09-42.92 232.48-177.8c6.89-5.48 8.04-15.53 2.55-22.44z"/></svg>

.center[
<iframe src="https://cran.r-project.org/web/packages/broom/vignettes/broom.html" width="100%" height="400px" data-external="1"></iframe>
]

[Vignette](https://cran.r-project.org/web/packages/broom/vignettes/broom.html)

???

`broom`'s ambitions go far beyond tidying the output of `prcomp(...)`

The vignette describes the goals of the package and the way they fit
in the `tidverse`

---

Dispatching to ad hoc methods

.fl.w-50[
`broom::tidy()` is an `S3` generic function.

When called, it invokes a dispatcher that calls
a method tailored to the class of the object passed as argument

```
> broom::tidy
function (x, ...)
{
    UseMethod("tidy")
}
```

> When a function calling `UseMethod("fun")` is applied to an object
with class attribute `c("first", "second")`, the system searches for a function
called `fun.first` and, if it finds it, applies it to the object...

]

.fl.w-50[

The hard work is performed by the `tidy.prcomp()` registered for
class `prcomp`

```r
class(pc)
```

```
## [1] "prcomp"
```

```r
# sloop::s3_get_method(tidy.prcomp)
```

See [S3 section in Advanced R Programmming](https://adv-r.hadley.nz/s3.html)

]

---
exclude: true

.fl.w-60[

An excerpt of the body of `augment.prcomp`

```
function (x, data = NULL, newdata, ...)
{
    ret <- if (!missing(newdata)) {
        # ...
    }
    else {
        pred <- as.data.frame(predict(x))
        names(pred) <- paste0(".fitted", names(pred))
        if (!missing(data) && !is.null(data)) {
            cbind(.rownames = rownames(as.data.frame(data)),
                data, pred)
        }
        else {
          # ...
        }
    as_tibble(ret)
}
```

]

.fl.w-40[

Function `augment` is also an `S3` generic function.

The method relies on another generic function `predict`

For `prcomp` `predict` returns the `x` component of the list

The output combines the columns of the original dataframe and (`data`)
with their names and the predictions

]

---

.fl.w-50[

```r
require("maps")

pc %>%
  broom::tidy(matrix = "samples") %>%
  mutate(region = tolower(row)) %>%
  inner_join(map_data("state"), by = "region") %>%
  ggplot() +
  aes(long, lat) +
  aes(group = group, fill = value) +
  geom_polygon() +
  facet_wrap(~PC) +
  scale_fill_viridis_c() +
  ggtitle("Principal components of arrest data")
```
]

.fl.w-50[

![](cm-6-EDA_files/figure-html/pc_arrests-1.png)

]

---

.fl.w-50[

```r
au <- augment(pc, data = USArrests)
au %>% head(3)
```

```
## # A tibble: 3 × 9
##   .rownames Murder Assault UrbanPop  Rape .fittedPC1 .fittedPC2 .fittedPC3
##   <chr>      <dbl>   <int>    <int> <dbl>      <dbl>      <dbl>      <dbl>
## 1 Alabama     13.2     236       58  21.2     -0.976      1.12     -0.440 
## 2 Alaska      10       263       48  44.5     -1.93       1.06      2.02  
## 3 Arizona      8.1     294       80  31       -1.75      -0.738     0.0542
## # … with 1 more variable: .fittedPC4 <dbl>
```

```r
ggplot(au) +
  aes(.fittedPC1, .fittedPC2) +
  geom_point() +
  geom_text(aes(label = .rownames), vjust = 1, hjust = 1)
```

```r
# }
```
]

.fl.w-50[

]
???

---
exclude: true

## References

@HorJoh90 and @Bha97 are classics on Matrix Analysis. Their content goes far beyond
Singular Value Decomposition. @elden2019matrix introduces matrix analysis in a data mining/machine
learning perspective. This book shows that many factorizations are relevant to machine learning.
Few of these factorizations are as well understood as SVD.

@MR1417720 is classic on matrix computations.

---

background-image: url('./img/pexels-cottonbro-3171837.jpg')
background-size: cover

# The End