EDA XII Canonical Correlation Analysis

name: inter-slide
class: left, middle, inverse

---
name: layout-general
layout: true
class: left, middle

.remark-slide-number .progress-bar-container {
  position: absolute;
  bottom: 0;
  height: 4px;
  display: block;
  left: 0;
  right: 0;
}

.remark-slide-number .progress-bar {
  height: 100%;
  background-color: red;
}
</style>

<div>
<style type="text/css">.xaringan-extra-logo {
width: 110px;
height: 128px;
z-index: 0;
background-image: url(./img/Universite_Paris_logo_horizontal.jpg);
background-size: contain;
background-repeat: no-repeat;
position: absolute;
top:1em;right:1em;
}
</style>
<script>(function () {
  let tries = 0
  function addLogo () {
    if (typeof slideshow === 'undefined') {
      tries += 1
      if (tries < 10) {
        setTimeout(addLogo, 100)
      }
    } else {
      document.querySelectorAll('.remark-slide-content:not(.hide_logo)')
        .forEach(function (slide) {
          const logo = document.createElement('a')
          logo.classList = 'xaringan-extra-logo'
          logo.href = 'http://master.math.univ-paris-diderot.fr/annee/m1-mi/'
          slide.appendChild(logo)
        })
    }
  }
  document.addEventListener('DOMContentLoaded', addLogo)
})()</script>
</div>

---

class: middle, left, inverse

# Exploratory Data Analysis : Canonical Correlation Analysis

### 2021-12-10

#### [Master I MIDS & MFA]()

#### [Analyse Exploratoire de Données](http://stephane-v-boucheron.fr/courses/eda/)

#### [Stéphane Boucheron](http://stephane-v-boucheron.fr)

---
exclude: true
class: middle, left, inverse

# Exploratory Data Analysis XII Canonical Correlation Analysis (CCA)

### 2021-12-10

#### [EDA Master I MIDS et MFA](http://stephane-v-boucheron.fr/courses/eda)

#### [Stéphane Boucheron](http://stephane-v-boucheron.fr)

---
class: inter-slide
exclude: true

## <svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M0 117.66v346.32c0 11.32 11.43 19.06 21.94 14.86L160 416V32L20.12 87.95A32.006 32.006 0 0 0 0 117.66zM192 416l192 64V96L192 32v384zM554.06 33.16L416 96v384l139.88-55.95A31.996 31.996 0 0 0 576 394.34V48.02c0-11.32-11.43-19.06-21.94-14.86z"/></svg>

### [Motivation](#bigpic)

### [CCA](#cca)

---

[Canonical Correlation Analysis](https://en.wikipedia.org/wiki/Canonical_correlation) goes back to Hotteling  (1936)

Consider a setting where we have to views/perspectives on the same data

For example, suppose we record meteorological data from a range of locations

Each location defines a sample point

For each location, we have temperature data on one side, and wind speed, wind direction, atmospheric pressure on the other side

How can we decribe relationships between the two perspectives?

This is the question tackled by CCA

---

### Definition (CCA)

Given a real matrix `$Z$` with `$n$` rows and `$J_1 + J_2$` columns.
`$$Z = \left[ \quad {\underbrace{\Huge Z_1 }_{J_1 \text{ col. }}}\quad  {\Large\vdots}\quad { \underbrace{\Huge Z_2 }_{J_2 \text{ col. }} } \quad\right]$$`

Canonical Correlation Analyis (CCA) consists of finding vectors `$a \in \mathbb{R}^{J_1}$`
 and `$b \in \mathbb{R}^{J_2}$` that maximize correlation between
 `$Z_1 a$`  and  `$Z_2 b$`

---

Let `$S_{1,1}, S_{2,2}, S_{1,2}$` denote the covariance matrices defined by `$Z_1, Z_2$`

`$$S_{1,1} = \frac{1}{n} \left( Z_1^T \times Z_1 - Z_1^T \times 1\times 1^T \times Z_1\right)$$`

`$$S_{1,2} = \frac{1}{n} \left( Z_1^T \times Z_2 - Z_1^T \times 1\times 1^T \times Z_2\right)$$`

`$$S_{2,2} = \frac{1}{n} \left( Z_2^T \times Z_2 - Z_2^T \times 1\times 1^T \times Z_2\right)$$`

We look for `$a$` and `$b$` that maximize

`$$\frac{a^T S_{1,2} b}{\big((a^TS_{1,1}a)(b^TS_{2,2}b)\big)^{1/2}}$$`

---

### Proposition

The vectors `$a \in \mathbb{R}^{J_1}$` and `$b \in \mathbb{R}^{J_2}$` that maximize
`$\frac{a^T S_{1,2} b}{\big((a^TS_{1,1}a)(b^TS_{2,2}b)\big)^{1/2}}$`
are the first left and right extended singular vectors of
`$$S_{1,2}$$`
with respect to matrices `$S_{1,1}$` and `$S_{2,2}$`

---

The extended singular value decomposition of `$S_{1,2}$`
with respect to matrices `$S_{1,1}$` and `$S_{2,2}$` is a triple  `$U \in \mathcal{M}_{J_1, k}$`,
`$D \in \mathcal{M}_{k,k}$`,
`$V \in \mathcal{M}_{J_2, k}$` such that
`$$S_{1,2} = U \times D \times V^T$$`

- `$D$` is non-negative, diagonal, with non-increasing diagonal entries

- `$U^T \times S_{1,1} \times U =  \text{Id}_{k}$`

- `$V^T \times S_{2,2} \times V =  \text{Id}_{k}$`

---

### Proof

We first assume `$S_{1,1}$` and `$S_{2,2}$` to be Positive Definite.

- `$S_{1,1}$` and `$S_{2,2}$` have invertible square roots  `$S_{1,1}^{-1/2}$` and `$S_{2,2}^{-1/2}$`

- For `$a \in \mathbb{R}^{J_1}$` and `$b \in \mathbb{R}^{J_2}$`,  let  `$u, v$`
be defined as `$u = S_{1,1}^{1/2}a$`  and `$v = S_{2,2}^{1/2}b$`

- `$$\frac{a^T S_{1,2} b}{\sqrt{a^TS_{1,1}a}\sqrt{b^T S_{2,2}b}}= \frac{u^T S_{1,1}^{-1/2} S_{1,2} S_{2,2}^{-1/2}v}{\|u\|\|v\|}$$`

---

### Proof (continued)

The unit vectors `$u,v$` that maximize the right-hand-side are the
leading  left and right  singular vectors of

`$$S_{1,1}^{-1/2} \times S_{1,2} \times S_{2,2}^{-1/2}$$`

---

### Proof (continued)

Let us handle the case where either `$S_{1,1}$` or `$S_{2,2}$`, or both are not Positive Definite

- `$S_{1,1}$` or `$S_{2,2}$` still have square roots, and the square roots have symmetric Semi Positive Definite pseudo-inverses (Moore-Penrose pseudo-inverses derived from spectral decomposition) denoted by
`$S_{1,1}^{-1/2}$` and `$S_{2,2}^{-1/2}$` that satisfy:
`$$S_{1,1}^{1/2}  \times S_{1,1}^{-1/2} \times S_{1,1}=  S^{1/2}_{1,1} \qquad S_{1,1}^{-1/2} \times S_{1,1}^{1/2}  \times S_{1,1}^{-1/2} = S_{1,1}^{-1/2}$$`

- The unit vectors `$u,v$` that maximize `$\frac{u^T S_{1,1}^{-1/2} S_{1,2} S_{2,2}^{-1/2}v}{\|u\|\|v\|}$`  are again the leading  left and right  singular vectors of

`$$S_{1,1}^{-1/2} \times S_{1,2} \times S_{2,2}^{-1/2}$$`

---

The leading left and right singular vectors of `$S_{1,2}$` with respect to the metrics defined by `$S_{1,1}$` and `$S_{2,2}$` define the first stage of _canonical correlation analysis_

The full canonical correlation analysis of `$Z = [ Z_1 \; \vdots\; Z_2]$` is made of the whole sequence of extended left and right singular vectors corresponding to positive singular values.

The `$j^{\text{th}}$` step

---

Canonical Correlation Analysis builds on Singular Value Decomposition just
as

- Principal Component Analyis,

- Correspondence Analyis,

- Multiple Linear Regression (at least implicitly)

- ...

???

We point out the tight connection
between Canonical Correlation Analysis and methods we have already encountered

---

We shall work on a qualitative data frame : `credit`

```r
readr::read_csv2("./DATA/credit.csv") %>%
  dplyr::mutate_all(factor) -> credit
```

We use functions from `forcats` to tidy the data

We focus on CCA and CA for variables `Marche` and `Logement`

---

### Handling rare levels

```r
*credit[['Marche']] <- fct_collapse(
  credit[['Marche']],
  "Mobilier / Ameublement" = "Mobilier / Ameublement",
  "Renovation" = "Renovation",
* "Moto" = c ("Moto", "Scooter", "Side-car"),
  "Voiture" = "Voiture")

*credit$Enfants <- fct_lump(credit$Enfants,
  n = 3,
  other_level = "Enf>2")

*credit$Logement <- fct_collapse(credit$Logement,
  "Accedant a la propriete"="Accedant a la propriete",
  "Locataire"="Locataire",
* "Loge par ..." = c("Loge par l'employeur",
  "Loge par la famille"),
  "Proprietaire"="Proprietaire")
```

---

### Description of `credit` dataset

Columns `Marche` and `Logement`

### Bivariate indicator matrix

- A 2-way contingency table `$T$` with `$J_1$` rows and `$J_2$` columns. `$T[a,b]$`
denotes the number of number of co-occurrences  of modalities `$a \in \{1, J_1\}$`
and `$b \in \{1, \ldots J_2\}$`.

- The 2-way contingency table is _usually_ collected from a data frame `DT` with two
qualitative columns and `$n$` rows.

- We can also proceed by _pivoting_ the bivariate table, making it a dataframe `$Z$`
with `$n$` rows and `$J_1 + J_2$` columns.
If `$j_1 \leq J_1$`, `$Z[i, j_1] =1$` if for
observation/row `$i$`, the modality of first variable is `$j_1$`, `$0$` otherwise.

- Table `$Z$` is called the _complete disjunctive table_ derived from DT

`$$Z = \bigg[  \underbrace{Z_1 }_{J_1 \text{ col. }} {\Large\vdots} \underbrace{Z_2 }_{J_2 \text{ col. }} \bigg]$$`

`$$T = Z_1^T \times Z_2$$`

---

### Building disjunctive table

Packages dedicated to Correspondence Analyis export functions that return disjunctive tables `tab.disjonctif()` in `FactoMineR`.

The construction of disjunctive tables can (also) be performed using verbs from `dplyr` and `tidyverse`

```r
dplyr::select(credit, Marche, Impaye) %>%
  tibble::rowid_to_column("id") %>%
* tidyr::pivot_wider(id_cols = - Marche,
*                    names_from = Marche,
*                    values_from = Marche)  %>%
  tidyr::pivot_wider(id_cols = -Impaye,      
                     names_from = Impaye,
                     values_from = Impaye) %>%
  dplyr::select(-id) %>%
  dplyr::mutate_all(~ ! is.na(.)) %>%
  dplyr::mutate_all(as.integer)  -> Z
```

---

`$P = \frac{1}{n} Z_1^T \times Z_2$`

`$S_{1,1} = \frac{1}{n} Z_1^T \times Z_1 - \frac{1}{n^2} Z_1^T\times 1 \times 1^T \times Z_1$`

`$S_{1,2} = \frac{1}{n} Z_1^T \times Z_2 - \frac{1}{n^2} Z_1^T\times 1 \times 1^T \times Z_2$`

`$D_r = \frac{1}{n} Z_1^T \times Z_1$`

`$D_c = \frac{1}{n} Z_2^T \times Z_2$`

---

### CA

SVD of   `$D_r^{-1} \times P \times  D_r^{-1}  - \mathbb{I} \times \mathbb{I}^T$`

### CCA

SVD of  `$S_{1,1}^{-1/2} \times S_{1,2} \times S_{2,2}^{-1/2}$` 
<svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M639.4 433.6c-8.4-20.4-31.8-30.1-52.2-21.6l-22.1 9.2-38.7-101.9c47.9-35 64.8-100.3 34.5-152.8L474.3 16c-8-13.9-25.1-19.7-40-13.6L320 49.8 205.7 2.4c-14.9-6.2-32-.3-40 13.6L79.1 166.5C48.9 219 65.7 284.3 113.6 319.2L74.9 421.1l-22.1-9.2c-20.4-8.5-43.7 1.2-52.2 21.6-1.7 4.1.2 8.8 4.3 10.5l162.3 67.4c4.1 1.7 8.7-.2 10.4-4.3 8.4-20.4-1.2-43.8-21.6-52.3l-22.1-9.2L173.3 342c4.4.5 8.8 1.3 13.1 1.3 51.7 0 99.4-33.1 113.4-85.3l20.2-75.4 20.2 75.4c14 52.2 61.7 85.3 113.4 85.3 4.3 0 8.7-.8 13.1-1.3L506 445.6l-22.1 9.2c-20.4 8.5-30.1 31.9-21.6 52.3 1.7 4.1 6.4 6 10.4 4.3L635.1 444c4-1.7 6-6.3 4.3-10.4zM275.9 162.1l-112.1-46.5 36.5-63.4 94.5 39.2-18.9 70.7zm88.2 0l-18.9-70.7 94.5-39.2 36.5 63.4-112.1 46.5z"/></svg>

---
template: inter-slide
name: mlr

## Multiple Linear Regression as Canonical Correlation Analysis

---

- In Multiple Linear  Regression, we are given a response vector `$Y \in \mathbb{R}^n$`
and a design `$Z \in \mathcal{M}_{n,p}$`

- We are looking for `$\beta \in \mathbb{R}^p$`
that minimizes `$\Vert Y - Z \beta\Vert^2$`

- The optimum is achieved at `$\color{red}{\widehat{\beta} = (Z^T\times Z)^{-1}\times Z^T \times Y}$`<sup>*</sup>

- For CCA, the optimum correlation is  the cosine of the angle between `$Y$` and its projection `$\widehat{Y}$` on the linear space  spanned  by the columns of `$Z$`, `$$\widehat{Y} = Z \widehat{\beta}$$`

- We may choose `$\color{red}{a=1}$`   and `$\color{red}{b=\widehat{\beta}}$` (or any vectors
in these two directions)

[*] In case `$Z^T \times Z$` is not invertible,
`$(Z^T\times Z)^{-1}$` denotes the Moore-Penrose pseudo-inverse

---
exclude: true

## CCA of Complete Disjunctive Table

```r
# code chunk here
data(iris)
ggplot(iris) +
  aes(Sepal.Length,
      Sepal.Width,
      color = Species) +
  geom_point()
```

![](cm-12-EDA_files/figure-html/plot-label-fc-1.png)

???

---
class: middle, center, inverse

background-image: url('./img/pexels-cottonbro-3171837.jpg')
background-size: cover

# The End