EDA XI Multiple Correspondance Analysis

---
name: layout-general
layout: true
class: left, middle

.remark-slide-number .progress-bar-container {
  position: absolute;
  bottom: 0;
  height: 4px;
  display: block;
  left: 0;
  right: 0;
}

.remark-slide-number .progress-bar {
  height: 100%;
  background-color: red;
}
</style>

<div>
<style type="text/css">.xaringan-extra-logo {
width: 110px;
height: 128px;
z-index: 0;
background-image: url(./img/Universite_Paris_logo_horizontal.jpg);
background-size: contain;
background-repeat: no-repeat;
position: absolute;
top:1em;right:1em;
}
</style>
<script>(function () {
  let tries = 0
  function addLogo () {
    if (typeof slideshow === 'undefined') {
      tries += 1
      if (tries < 10) {
        setTimeout(addLogo, 100)
      }
    } else {
      document.querySelectorAll('.remark-slide-content:not(.hide_logo)')
        .forEach(function (slide) {
          const logo = document.createElement('a')
          logo.classList = 'xaringan-extra-logo'
          logo.href = 'http://master.math.univ-paris-diderot.fr/annee/m1-mi/'
          slide.appendChild(logo)
        })
    }
  }
  document.addEventListener('DOMContentLoaded', addLogo)
})()</script>
</div>

---

# Exploratory Data Analysis : Multiple Correspondence Analysis

### 2021-12-10

#### [Master I MIDS & MFA]()

#### [Analyse Exploratoire de Données](http://stephane-v-boucheron.fr/courses/eda/)

#### [Stéphane Boucheron](http://stephane-v-boucheron.fr)

---
exclude: true
class: middle, left, inverse

# Exploratory Data Analysis XI: Multiple Correspondance Analysis

### 2021-12-10

#### [EDA Master I MIDS et MFA](http://stephane-v-boucheron.fr/courses/eda)

#### [Stéphane Boucheron](http://stephane-v-boucheron.fr)

---
class: middle, inverse

## <svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M0 117.66v346.32c0 11.32 11.43 19.06 21.94 14.86L160 416V32L20.12 87.95A32.006 32.006 0 0 0 0 117.66zM192 416l192 64V96L192 32v384zM554.06 33.16L416 96v384l139.88-55.95A31.996 31.996 0 0 0 576 394.34V48.02c0-11.32-11.43-19.06-21.94-14.86z"/></svg>

### [Motivation](#bigpic)

### [Variants on mosaicplot](#variant-mosaic)

### [Indicator matrix](#indic-matrix)

### [MCA as CA on indicator matrix](#mca-in-words)

### [Illustrations](#)

???

### [CCA](#cca)

---
name: bigpic
template: inter-slide

## Motivation: analyzing more than 2 categorical variables

---

### Different perspectives

When handling two random variables, questions revolve around one topic: are they independent?

From a numerical viewpoint, the chi-square divergence quantifies possible departure from independence

A mosaicplot helps spotting   associations between categories. Coloring tiles using
Pearson residuals makes the pictures even more convenient

Correspondance Analysis (CA) and the associated plots (screeplot, biplot, correlation circle plot) provide
a geometric toolkit that complements the chi-square divergence and the different flavors of mosaicplots:
the total inertia of CA is the chi-square divergence computed from the contingency table

---

### Different perspectives (continued)

When handling more than two variables, several settings are possible

- Some variables may be called _explanatory_, one variable may be considered as a _response_ variable. Is the response variable
_dependent_ on the explanatory variables? If yes, we may wonder whether, conditionally on some explanatory variabless, the response variable is independent on the other explanatory variables

- All variables share a similar status (explanatory or response),  we may explore the relations between the variables, between categories and try to reduce dimension in some way

In this session, we will address the different settings

---

### MCA (from documentations)

> The aim of multiple correspondence analysis (MCA) is to summarise and visualise a data table where individuals are described by _qualitative_ variables with similar status

> MCA is used to study the similarities between individuals from the point of view of all the variables and identify individuals' profiles

> MCA is also used to assess relationships between variables and study the associations between categories

> As with PCA and CA, the individuals or groups of individuals (rows) can be connected with categories of  variables (columns)

.fr.f6[From _R for statistics_, Cornillon et al. Chapman & Hall. Pub.]

???

> `mca` is a Multiple Correspondence Analysis (MCA) package for <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M439.8 200.5c-7.7-30.9-22.3-54.2-53.4-54.2h-40.1v47.4c0 36.8-31.2 67.8-66.8 67.8H172.7c-29.2 0-53.4 25-53.4 54.3v101.8c0 29 25.2 46 53.4 54.3 33.8 9.9 66.3 11.7 106.8 0 26.9-7.8 53.4-23.5 53.4-54.3v-40.7H226.2v-13.6h160.2c31.1 0 42.6-21.7 53.4-54.2 11.2-33.5 10.7-65.7 0-108.6zM286.2 404c11.1 0 20.1 9.1 20.1 20.3 0 11.3-9 20.4-20.1 20.4-11 0-20.1-9.2-20.1-20.4.1-11.3 9.1-20.3 20.1-20.3zM167.8 248.1h106.8c29.7 0 53.4-24.5 53.4-54.3V91.9c0-29-24.4-50.7-53.4-55.6-35.8-5.9-74.7-5.6-106.8.1-45.2 8-53.4 24.7-53.4 55.6v40.7h106.9v13.6h-147c-31.1 0-58.3 18.7-66.8 54.2-9.8 40.7-10.2 66.1 0 108.6 7.6 31.6 25.7 54.2 56.8 54.2H101v-48.8c0-35.3 30.5-66.4 66.8-66.4zm-6.7-142.6c-11.1 0-20.1-9.1-20.1-20.3.1-11.3 9-20.4 20.1-20.4 11 0 20.1 9.2 20.1 20.4s-9 20.3-20.1 20.3z"/></svg>, intended to be used with `pandas`. MCA is a feature extraction method; essentially PCA for categorical variables.

.fr.f6[[mca homepage](https://github.com/esafak/mca)]

Take-home message: MCA is a matrix factorization based method for exploring samples of categorical variables

Questions:

- Transforming samples of categorical variables into matrices
- SVD Factorization
- Relating the factors to the original data

---

### Useful packages

```r
require(tidyverse)
*require(FactoMineR)
*require(factoextra)
*require(FactoInvestigate)
```

---
exclude: true

### Questionnaires

[](https://en.wikipedia.org/wiki/Questionnaire)

???

Questionnaires:

- Definition
- Usage
- Example
- Interpretation

---
exclude: true

### Example: Questionnaire based on fictitious case reports

[Case report on Wikipedia](https://en.wikipedia.org/wiki/Case_report)

???

- Definition of  case report
- Usage of fictitious case report
-

---

### Handling more than 2 qualitative variables

Example <svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M496.616 372.639l70.012-70.012c16.899-16.9 9.942-45.771-12.836-53.092L512 236.102V96c0-17.673-14.327-32-32-32h-64V24c0-13.255-10.745-24-24-24H248c-13.255 0-24 10.745-24 24v40h-64c-17.673 0-32 14.327-32 32v140.102l-41.792 13.433c-22.753 7.313-29.754 36.173-12.836 53.092l70.012 70.012C125.828 416.287 85.587 448 24 448c-13.255 0-24 10.745-24 24v16c0 13.255 10.745 24 24 24 61.023 0 107.499-20.61 143.258-59.396C181.677 487.432 216.021 512 256 512h128c39.979 0 74.323-24.568 88.742-59.396C508.495 491.384 554.968 512 616 512c13.255 0 24-10.745 24-24v-16c0-13.255-10.745-24-24-24-60.817 0-101.542-31.001-119.384-75.361zM192 128h256v87.531l-118.208-37.995a31.995 31.995 0 0 0-19.584 0L192 215.531V128z"/></svg> Titanic data set  (called `tit` in the sequel)

`$$\begin{array}{ll}\text{Demographic/Explanatory} & \leftrightarrow \begin{cases}\text{Embarked} \\
\text{Sex} \\ \text{Passenger class} \\
\text{Age (Child/Adult)} \end{cases} \\ \phantom{\text{Demographic/Explanatory} } & \phantom{\text{Demographic/Explanatory}} \\ \text{Attitudinal/Response} &  \leftrightarrow \text{Survived}\end{array}$$`

???

When handling a collection of qualitative variables, we may face several kinds of situations: we may investigate

- response/attitudinal variables with respect to  explanatory/demographic variables
- collections of  response/attitudinal variables
- collections of explanatory/demographic variables

MCA is geared towards investigating collections of variables of similar status

```
## Warning: The following named parsers don't match the column names: Survived
```

```
## # A tibble: 6 × 4
##   Sex    Age   Embarked Pclass
##   <fct>  <fct> <fct>    <ord> 
## 1 male   Adult S        3     
## 2 female Adult C        1     
## 3 female Adult S        3     
## 4 female Adult S        1     
## 5 male   Adult S        3     
## 6 male   NA.A  Q        3
```

---

### <svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M496.616 372.639l70.012-70.012c16.899-16.9 9.942-45.771-12.836-53.092L512 236.102V96c0-17.673-14.327-32-32-32h-64V24c0-13.255-10.745-24-24-24H248c-13.255 0-24 10.745-24 24v40h-64c-17.673 0-32 14.327-32 32v140.102l-41.792 13.433c-22.753 7.313-29.754 36.173-12.836 53.092l70.012 70.012C125.828 416.287 85.587 448 24 448c-13.255 0-24 10.745-24 24v16c0 13.255 10.745 24 24 24 61.023 0 107.499-20.61 143.258-59.396C181.677 487.432 216.021 512 256 512h128c39.979 0 74.323-24.568 88.742-59.396C508.495 491.384 554.968 512 616 512c13.255 0 24-10.745 24-24v-16c0-13.255-10.745-24-24-24-60.817 0-101.542-31.001-119.384-75.361zM192 128h256v87.531l-118.208-37.995a31.995 31.995 0 0 0-19.584 0L192 215.531V128z"/></svg> Mosaicplots for `n`-ways contingency tables

.fl.w-50.pa2[
<img src="cm-11-EDA_files/figure-html/unnamed-chunk-7-1.png" width="504" />
]

.fl.w-50.pa2[

The interplay between the response variable `Survived` and the four explanatory variables  using
a naive mosaicplot is hard to spot

The global arrangement of tiles has a huge impact on the interpretability of multi-dimensional mosaicplots

Variants of mosaicplots like _double decker_ plots make the task easier

]

???

```r
tit %>%
  dplyr::select(Pclass, Embarked) %>%
  drop_na() %>%
  ggplot() +
  geom_mosaic(aes(x = product(Embarked, Pclass), fill=Embarked)) +
  labs(x= "Passenger class", y="Embarked") +
  scale_fill_viridis_d() +
  ggtitle("Titanic mosaic with tidyverse flavor")
```

In the Titanic table the four variables do have the same status:

- Class, Age, Sex may be considered as "demographic", or "explanatory"
- Survived is a "response" variable

When using variants of mosaicplot to investigate the Titanic dataset

---
name: variant-mosaic
template: inter-slide

## Variations on Mosaicplot

---

### Mosaic plots versus Association plots

> In order to explain multi-dimensional categorical data, statisticians typically look for (conditional) independence structures

> Whether the task is purely exploratory or model-based, techniques such as _mosaic_ and _association_ plots offer good support for visualization    .fr.f6[Structplot vignette]

Before turning back to Titanic dataset, let us revisit the `UCBAdmissions` dataset

Remember that
the `UCBAdmissions` dataset was elaborated in the 1970's to assess whether the admission process at the different departments of UC Berkeley   suffered from a gender bias

Such a possibility was suggested by looking at the global admission rate for female and male candidates

???

> Both _mosaic_ and _association_ plots visualize aspects of (possibly higher-dimensional) contingency tables, with several extensions

---
exclude: true

### Extensions

- double-decker plots

- spine plots

- spinograms

- conditional association plots

---

### <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M496 128v16a8 8 0 0 1-8 8h-24v12c0 6.627-5.373 12-12 12H60c-6.627 0-12-5.373-12-12v-12H24a8 8 0 0 1-8-8v-16a8 8 0 0 1 4.941-7.392l232-88a7.996 7.996 0 0 1 6.118 0l232 88A8 8 0 0 1 496 128zm-24 304H40c-13.255 0-24 10.745-24 24v16a8 8 0 0 0 8 8h464a8 8 0 0 0 8-8v-16c0-13.255-10.745-24-24-24zM96 192v192H60c-6.627 0-12 5.373-12 12v20h416v-20c0-6.627-5.373-12-12-12h-36V192h-64v192h-64V192h-64v192h-64V192H96z"/></svg> Order matters

```r
aperm(UCBAdmissions, c(3, 2, 1)) %>% mosaicplot(shade=TRUE)

aperm(UCBAdmissions, c(3, 2, 1)) %>% vcd::mosaic(shade=TRUE)
```

The diagram on the right is easier to interpret: it provides a picture
of admission rates per department and then gender

]
---

.fl.w-60.pa2[

```r
aperm(UCBAdmissions, c(3, 2, 1)) %>%
  vcd::doubledecker()
```

<img src="cm-11-EDA_files/figure-html/unnamed-chunk-10-1.png" width="504" />
.f6[
Double decker plots are special kinds of mosaicplots

- Per department admissions exhibit no clear bias against
women
- Men tend to fill more applications to less selective departments
]
]

.fl.w-40.pa2.f6[

[Simpson's paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox)

> A trend appears in several different groups of data but disappears or reverses when these groups are combined

> This result is often encountered in social-science and medical-science statistics and is particularly problematic when frequency data is unduly given causal interpretations.

> The paradox is also referred to as Simpson's reversal, Yule–Simpson effect, amalgamation paradox, or reversal paradox

]

???

A double decker plot is a collection of stacked column/bar plots: For each
department, we have a column plot where `Gender` is mapped to `x` and `Admit`
to `y`. Columns are stacked. Width is proportional to number of applicants
with given Gender for the department. Height is proportional to fraction
of successful/failing applicants for given  Department and Gender

---

```r
pacman::p_load(xtable)
vcd::structable(~ Class + Age + Sex, aperm(Titanic, c(1,3,2, 4))) %>%
  xtable::xtableFtable()
```

.fl.w-50.pa2.f6[

|  Class | Sex    |       Child|  Adult|
|:------:|:-------|------:|------:|
|1st     | Male   |          5 |   175 |
|        | Female |         1 |   144 |
|  2nd   | Male   |         11 |   168 |
|        | Female |         13 |    93 |
|  3rd   | Male   |         48 |   462 |
|        | Female |         31 |   165 |
|  Crew  | Male   |          0 |   862 |
|        | Female |          0 |    23 |
]

.fl.w-50.pa2.f6[

Flattenig allows to organize information

Double decker plots allow to do this graphically

]
---

```r
vcd::doubledecker(Titanic)

tit %>%
  filter(Survived %in% c('Survived', 'Deceased')) %>%
  filter(Age %in% c('Adult', 'Child')) %>%
  mutate(Age=fct_drop(Age)) %>%
  ggplot() +
  geom_mosaic(aes(x=product(Survived, Age, Sex, Pclass), fill=Survived), divider=ddecker()) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5)) +
  scale_fill_viridis_d()
```

???

`Titanic` is a base <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg> contingency table where `Pclass, Sex, Age, Survived` were cross-tabulated

---
exclude: true

```r
# knitr::include_url("https://rdrr.io/cran/vcd/")
```

---
exclude: true

###  Strucplot framework

- low-level grapcon functions
  - created by generating functions (grapcon generators)
  - `group_...`, `struc_...`, `labelling_...`, `legend_...`, `spacing_...`
- a suitable combination of the low-level grapcon
functions is passed as “hyperparameters” to strucplot()
- convenience functions such as
mosaic(), sieve(), assoc(), and doubledecker() which interface strucplot()

???

> The strucplot framework is highly modularized: Figure 5 shows the hierarchical relationship between the various components. On the lowest level, there are several groups of
workhorse and parameter functions that directly or indirectly influence the final appearance of the plot (see Table 2 for an overview). These are examples of grapcon functions.
They are created by generating functions (grapcon generators), allowing flexible parameterization and extensibility (Figure 5 only shows the generators). The generator names follow the
naming convention group_... (), where group reflects the group the generators belong to
(strucplot core, labeling, legend, shading, or spacing). The workhorse functions (created by
struc_... (), labeling_... (), and legend_... ()) directly produce graphical output (i.e.,
“add ink to the canvas”), whereas the parameter functions (created by spacing_... () and
shading_... ()) compute graphical parameters used by the others. The grapcon functions
returned by struc_... () implement the core functionality, creating the tiles and their content. On the second level of the framework, a suitable combination of the low-level grapcon
functions (or, alternatively, corresponding generating functions) is passed as “hyperparameters” to strucplot(). This central function sets up the graphical layout using grid viewports
(see Figure 6), and coordinates the specified core, labeling, shading, and spacing functions
to produce the plot. On the third level, we provide several convenience functions such as
mosaic(), sieve(), assoc(), and doubledecker() which interface strucplot() through
sensible parameter defaults and support for model formulae. Finally, on the fourth level,
there are “related” vcd functions (such as cotabplot() and the pairs() methods for table
objects) arranging collections of plots of the strucplot framework into more complex displays
(e.g., by means of panel functions)

---
name: indic-matrix
template: inter-slide

##  <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M256.12 245.96c-13.25 0-24 10.74-24 24 1.14 72.25-8.14 141.9-27.7 211.55-2.73 9.72 2.15 30.49 23.12 30.49 10.48 0 20.11-6.92 23.09-17.52 13.53-47.91 31.04-125.41 29.48-224.52.01-13.25-10.73-24-23.99-24zm-.86-81.73C194 164.16 151.25 211.3 152.1 265.32c.75 47.94-3.75 95.91-13.37 142.55-2.69 12.98 5.67 25.69 18.64 28.36 13.05 2.67 25.67-5.66 28.36-18.64 10.34-50.09 15.17-101.58 14.37-153.02-.41-25.95 19.92-52.49 54.45-52.34 31.31.47 57.15 25.34 57.62 55.47.77 48.05-2.81 96.33-10.61 143.55-2.17 13.06 6.69 25.42 19.76 27.58 19.97 3.33 26.81-15.1 27.58-19.77 8.28-50.03 12.06-101.21 11.27-152.11-.88-55.8-47.94-101.88-104.91-102.72zm-110.69-19.78c-10.3-8.34-25.37-6.8-33.76 3.48-25.62 31.5-39.39 71.28-38.75 112 .59 37.58-2.47 75.27-9.11 112.05-2.34 13.05 6.31 25.53 19.36 27.89 20.11 3.5 27.07-14.81 27.89-19.36 7.19-39.84 10.5-80.66 9.86-121.33-.47-29.88 9.2-57.88 28-80.97 8.35-10.28 6.79-25.39-3.49-33.76zm109.47-62.33c-15.41-.41-30.87 1.44-45.78 4.97-12.89 3.06-20.87 15.98-17.83 28.89 3.06 12.89 16 20.83 28.89 17.83 11.05-2.61 22.47-3.77 34-3.69 75.43 1.13 137.73 61.5 138.88 134.58.59 37.88-1.28 76.11-5.58 113.63-1.5 13.17 7.95 25.08 21.11 26.58 16.72 1.95 25.51-11.88 26.58-21.11a929.06 929.06 0 0 0 5.89-119.85c-1.56-98.75-85.07-180.33-186.16-181.83zm252.07 121.45c-2.86-12.92-15.51-21.2-28.61-18.27-12.94 2.86-21.12 15.66-18.26 28.61 4.71 21.41 4.91 37.41 4.7 61.6-.11 13.27 10.55 24.09 23.8 24.2h.2c13.17 0 23.89-10.61 24-23.8.18-22.18.4-44.11-5.83-72.34zm-40.12-90.72C417.29 43.46 337.6 1.29 252.81.02 183.02-.82 118.47 24.91 70.46 72.94 24.09 119.37-.9 181.04.14 246.65l-.12 21.47c-.39 13.25 10.03 24.31 23.28 24.69.23.02.48.02.72.02 12.92 0 23.59-10.3 23.97-23.3l.16-23.64c-.83-52.5 19.16-101.86 56.28-139 38.76-38.8 91.34-59.67 147.68-58.86 69.45 1.03 134.73 35.56 174.62 92.39 7.61 10.86 22.56 13.45 33.42 5.86 10.84-7.62 13.46-22.59 5.84-33.43z"/></svg> Indicator and Burt matrices

---

### <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M416 48c0-8.84-7.16-16-16-16h-64c-8.84 0-16 7.16-16 16v48h96V48zM63.91 159.99C61.4 253.84 3.46 274.22 0 404v44c0 17.67 14.33 32 32 32h96c17.67 0 32-14.33 32-32V288h32V128H95.84c-17.63 0-31.45 14.37-31.93 31.99zm384.18 0c-.48-17.62-14.3-31.99-31.93-31.99H320v160h32v160c0 17.67 14.33 32 32 32h96c17.67 0 32-14.33 32-32v-44c-3.46-129.78-61.4-150.16-63.91-244.01zM176 32h-64c-8.84 0-16 7.16-16 16v48h96V48c0-8.84-7.16-16-16-16zm48 256h64V128h-64v160z"/></svg> Two perspectives

MCA can be viewed along two perspectives:

- Analyzing the dataset after performing _one-hot_ encoding of categorical variables: investigating the so-called _indicator_ matrix

- Analyzing all pairwise two-way contingency tables derived from the dataset: : investigating the so-called _Burt_ matrix

???

The two perspective define different pipelines

---

### A glimpse at the indicator matrix

Function `tab.disjonctif()` from `FactoMineR` builds an _indicator_ matrix starting
from  a dataframe with categorical columns

```r
Z <- tit %>%
  drop_na() %>%
  select(Sex, Age, Embarked, Pclass) %>%
* tab.disjonctif()

Z %>% head() %>% knitr::kable()
```

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> female </th>
   <th style="text-align:right;"> male </th>
   <th style="text-align:right;"> Child </th>
   <th style="text-align:right;"> Adult </th>
   <th style="text-align:right;"> NA.A </th>
   <th style="text-align:right;"> S </th>
   <th style="text-align:right;"> C </th>
   <th style="text-align:right;"> Q </th>
   <th style="text-align:right;"> 1 </th>
   <th style="text-align:right;"> 2 </th>
   <th style="text-align:right;"> 3 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
</tbody>
</table>

]

???

`tab.disjonctif` is a shorthand for _tableau disjonctif complet_, the French name
of indicator matrix

---

### Construction of indicator matrix

- A categorical variable `$V_j$` (factor) with `$q$` levels is mapped to `$q$` `$\{0,1\}$` -valued variables `$V_{j,r}$` for `$r \leq q$`

- If levels are indexed by `$\{1, \ldots, q\}$`, if the value of the categorical variable `$V_j$` from row `$i$`  is `$k \in \{1, \ldots, q\}$`, the binary variables `$V_{j,r}$` on that row take values
`$$k \mapsto \underbrace{0,\ldots, 0}_{k-1}, 1, \underbrace{0, \ldots, 0}_{q-k}$$`

- In Machine Learning parlance building  the indicator matrix consists of performing _one-hot encoding_ for each categorical variable

- The indicator matrix has as many rows as the data matrix

- The number of columns of the indicator matrix is the sum of the number of levels of the categorical variables/columns of the data matrix

- The indicator matrix is a numerical matrix. It is suitable for factorial methods <svg aria-hidden="true" role="img" viewBox="0 0 496 512" style="height:1em;width:0.97em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M248 8C111 8 0 119 0 256s111 248 248 248 248-111 248-248S385 8 248 8zm80 168c17.7 0 32 14.3 32 32s-14.3 32-32 32-32-14.3-32-32 14.3-32 32-32zm-160 0c17.7 0 32 14.3 32 32s-14.3 32-32 32-32-14.3-32-32 14.3-32 32-32zm194.8 170.2C334.3 380.4 292.5 400 248 400s-86.3-19.6-114.8-53.8c-13.6-16.3 11-36.7 24.6-20.5 22.4 26.9 55.2 42.2 90.2 42.2s67.8-15.4 90.2-42.2c13.4-16.2 38.1 4.2 24.6 20.5z"/></svg>

???

---

### The Burt matrix

.fl.w-40.pa2.f6[

- Multiple Correspondance Analysis may be based on the Burt matrix

- The Burt matrix is a symmetric integer-valued matrix made of blocks consisting
of all pairwise 2-ways contingency tables

- Each block is the contingency table defined by two categorical variables from the data matrix

- Diagonal blocks are diagonal sub-matrices

]

.fl.w-60.pa2.f6[

```r
B <- t(Z) %*% as.matrix(Z)

B[1:6, 1:6] %>% knitr::kable()
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> female </th>
   <th style="text-align:right;"> male </th>
   <th style="text-align:right;"> Child </th>
   <th style="text-align:right;"> Adult </th>
   <th style="text-align:right;"> NA.A </th>
   <th style="text-align:right;"> S </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> female </td>
   <td style="text-align:right;"> 95 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 14 </td>
   <td style="text-align:right;"> 74 </td>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 56 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> male </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 107 </td>
   <td style="text-align:right;"> 9 </td>
   <td style="text-align:right;"> 86 </td>
   <td style="text-align:right;"> 12 </td>
   <td style="text-align:right;"> 73 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Child </td>
   <td style="text-align:right;"> 14 </td>
   <td style="text-align:right;"> 9 </td>
   <td style="text-align:right;"> 23 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 17 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Adult </td>
   <td style="text-align:right;"> 74 </td>
   <td style="text-align:right;"> 86 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 160 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 99 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> NA.A </td>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 12 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 19 </td>
   <td style="text-align:right;"> 13 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> S </td>
   <td style="text-align:right;"> 56 </td>
   <td style="text-align:right;"> 73 </td>
   <td style="text-align:right;"> 17 </td>
   <td style="text-align:right;"> 99 </td>
   <td style="text-align:right;"> 13 </td>
   <td style="text-align:right;"> 129 </td>
  </tr>
</tbody>
</table>

A glimpse at stacked 2-ways contingency tables: rows 3,..., 5 and columns
1,2 contain the contingency table defined by variables `Age` and `Sex`

]

???

All pairwise contingency tables

`$$B =  Z^T \times Z$$`

---
template: inter-slide
name: mca-in-words

## MCA in words

---

.fl.w-40.pa2[

![](./img/greenacre.jpg)

]

.fr.w-60.pa2[

]

---

### MCA on <svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M496.616 372.639l70.012-70.012c16.899-16.9 9.942-45.771-12.836-53.092L512 236.102V96c0-17.673-14.327-32-32-32h-64V24c0-13.255-10.745-24-24-24H248c-13.255 0-24 10.745-24 24v40h-64c-17.673 0-32 14.327-32 32v140.102l-41.792 13.433c-22.753 7.313-29.754 36.173-12.836 53.092l70.012 70.012C125.828 416.287 85.587 448 24 448c-13.255 0-24 10.745-24 24v16c0 13.255 10.745 24 24 24 61.023 0 107.499-20.61 143.258-59.396C181.677 487.432 216.021 512 256 512h128c39.979 0 74.323-24.568 88.742-59.396C508.495 491.384 554.968 512 616 512c13.255 0 24-10.745 24-24v-16c0-13.255-10.745-24-24-24-60.817 0-101.542-31.001-119.384-75.361zM192 128h256v87.531l-118.208-37.995a31.995 31.995 0 0 0-19.584 0L192 215.531V128z"/></svg> Titanic

.fl.w-40.pa2[

```r
res.mca <- tit %>% drop_na() %>%
  select(Sex,
         Age,
         Embarked,
         Pclass) %>%
* MCA(graph = FALSE)
```

We perform MCA on the data matrix made from four similar qualitative  columns: `Sex`, `Age`, `Embarked`, `Pclass`

]

.fl.w-60.pa2[

`res.mca` is a list with class attributes `MCA`  and `list`. It contains many related components

Other components are either byproducts of the computation of `res.mca$svd`
or derived from `res.mca$svd` so as to facilitate
reporting either numerical or graphical

]

???

---

### Output of  `print(res.mca)`

|   | Name   |             Description                                          |
|:-:|:------------------------|:----------------------------------------------------|
|1  |   "$eig"                |"eigenvalues"                                        |
|2  |   "$var"                |"results for the variables"                          |
|3  |   "$var$coord"          |"coord. of the categories"                           |
|4  |   "$var$cos2"           |"cos2 for the categories"                           |
|5  |   "$var$contrib"        |"contributions of the categories"                    |
|6  |   "$var$v.test"         |"v-test for the categories"                          |
|7  |   "$ind"                |"results for the individuals"                        |
|8  |   "$ind$coord"          |"coord. for the individuals"                         |
|9  |   "$ind$cos2"          |"cos2 for the individuals"                          |
|10 |   "$ind$contrib"        |"contributions of the individuals"                   |
|11 |  "$quali.sup"          |"results for the supplementary categorical variables"|
|12 |  "$quali.sup$coord"    |"coord. for the supplementary categories"            |
|13 |  "$quali.sup$cos2"     |"cos2 for the supplementary categories"              |
|14 |  "$quali.sup$v.test"   |"v-test for the supplementary categories"            |
|15 |  "$call"               |"intermediate results"                               |
|16 |  "$call$marge.col"     |"weights of columns"                                 |
|17 |  "$call$marge.li"      |"weights of rows"                                    |

]

---

### Comment on output of  `print(res.mca)`

- `eig` is computed from the singular values in `res.mca$svd`

- `var` contains material for plotting information about categories and variables on factorial planes

- `ind` conatins material for plotting information about individuals on on factorial planes

???

---

### <svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M208 0c-29.9 0-54.7 20.5-61.8 48.2-.8 0-1.4-.2-2.2-.2-35.3 0-64 28.7-64 64 0 4.8.6 9.5 1.7 14C52.5 138 32 166.6 32 200c0 12.6 3.2 24.3 8.3 34.9C16.3 248.7 0 274.3 0 304c0 33.3 20.4 61.9 49.4 73.9-.9 4.6-1.4 9.3-1.4 14.1 0 39.8 32.2 72 72 72 4.1 0 8.1-.5 12-1.2 9.6 28.5 36.2 49.2 68 49.2 39.8 0 72-32.2 72-72V64c0-35.3-28.7-64-64-64zm368 304c0-29.7-16.3-55.3-40.3-69.1 5.2-10.6 8.3-22.3 8.3-34.9 0-33.4-20.5-62-49.7-74 1-4.5 1.7-9.2 1.7-14 0-35.3-28.7-64-64-64-.8 0-1.5.2-2.2.2C422.7 20.5 397.9 0 368 0c-35.3 0-64 28.6-64 64v376c0 39.8 32.2 72 72 72 31.8 0 58.4-20.7 68-49.2 3.9.7 7.9 1.2 12 1.2 39.8 0 72-32.2 72-72 0-4.8-.5-9.5-1.4-14.1 29-12 49.4-40.6 49.4-73.9z"/></svg>

Understanding MCA, the way it works, the ways it can be used amounts to understand the steps
that lead to the computation of `res.mca$svd`

Asserting that, by default, MCA consists of performing Correspondance Analysis (CA)  on the indicator matrix
deserves some explanation

In order to make the argument self-contained, we first skech what CA is and how it relates to (extensions of) SVD

Then, we explain what it means to perform CA on the indicator matrix

Finally we relate `res.mca$svd` with the extended SVD of the residual matrix of the indicator matrix

---
template: inter-slide
name: brush-up-CA

## Brush up your CA

---

### CA executive summary

- Start from a 2-way contingency table `$X$` with `$\sum_{i,j} X_{i,j}=N$`
- Normalize `$P = \frac{1}{N}X$` (_correspondance matrix_)
- Let `$r$` (resp. `$c$`) be the row (resp. column) wise sums vector
- Let `$D_r=\text{diag}(r)$` denote the diagonal matrix with row sums of `$P$` as coefficients
- Let `$D_c=\text{diag}(c)$` denote the diagonal matrix with column sums of `$P$` as coefficients

+ The _row profiles matrix_ is `$D_r^{-1} \times P$`
+ The _standardized residuals matrix_ is  `$S = D_r^{-1/2} \times \left(P - r c^T\right) \times D_c^{-1/2}$`

CA consists in computing the SVD of the standardized residuals matrix `$S =  U  \times D \times V^T$`

From the SVD, we get
- `$D_r^{-1/2} \times U$` standardized coordinates of rows
- `$D_c^{-1/2} \times V$` standardized coordinates of columns
- `$D_r^{-1/2} \times U \times D$` principal coordinates of rows
- `$D_c^{-1/2} \times V \times D$` principal coordinates of columns
- Squared singular values: the principal inertia

???

When calling `svd(.)`, the argument should be
`$$D_r^{1/2}\times \left(D_r^{-1} \times P \times D_c^{-1}- \mathbf{I}\times \mathbf{I}^T  \right)\times D_c^{1/2}$$`

---

### CA and extended SVD

As
`$$D_r^{-1} \times P \times D_c^{-1} - \mathbf{I}\mathbf{I}^T = (D_r^{-1/2} \times U)\times D \times (D_c^{-1/2}\times V)^T$$`

`$(D_r^{-1/2} \times U)\times D \times (D_c^{-1/2}\times V)^T$` is the _extended SVD_ of
`$$D_r^{-1} \times P \times D_c^{-1} - \mathbf{I}\mathbf{I}^T$$`
with respect to `$D_r$` and `$D_c$`

---

### CA and reconstructions formulae

---
template: inter-slide

## Performing CA on indicator matrix

---

### MCA: CA on indicator matrix

Let `$X$` be the data matrix with `$n$` rows (individuals) and `$p$` categorical columns (variables)

For `$j \in \{1, \ldots, p\}$`, let `$J_j$` denote the number of levels(categories) of variable `$j$`

Let `$q = \sum_{j\leq p} J_j$` be the sum of the number of levels throughout the variables

Let `$Z$` be the incidence matrix with `$n$` rows and `$q$` columns

For `$j\leq p$` and `$k \leq J_j$`, let `$\langle j, k\rangle = \sum_{j'<j} J_{j'}+k$`

Let `$N = n \times p = \sum_{i\leq n} \sum_{j \leq p} X_{i,j}$` and `$P = \frac{1}{N} Z$` (the _correspondence matrix_ for MCA)

The column wise sum of the correspondence matrix `$P$` for the `$k$`th level of the `$j$`th variable of `$X$` ( `$j \leq p$` ) is `$N_{\langle j,k\rangle}/N = f_{\langle j,k\rangle}/p$` where `$f_{\langle j,k\rangle}$` stands for the relative frequency of level `$k$` of the `$j$`th variable

`$$D_r = \frac{1}{n}\text{Id}_n\qquad D_c =\text{diag}\left(\frac{f_{\langle j,k\rangle}}{p}\right)_{j \leq p, k\leq J_j}$$`

---

### MCA: CA on incidence matrix (continued)

Let `$r= D_r \times \mathbb{I}_n = \frac{1}{n} \mathbb{I}_p$`  and `$c = D_c \times \mathbb{I}_q$`

In MCA, we compute the SVD `$U \times D \times V^T$` of the standardized residuals matrix:

`$$S = D_r^{-1/2}\times \left(P - r\times c^T\right) \times D_c^{-1/2} = \sqrt{n}\left(P - r\times c^T\right) \times D_c^{-1/2}$$`

Coefficient `$i, \langle j, k\rangle$`  of `$S$` is
`$$\frac{\mathbb{I}_{i, \langle j, k\rangle}- f_{\langle j,k\rangle}}{\sqrt{n f_{\langle j,k\rangle}/p}}$$`

---

```r
tol <- 1e-10
X <- select(tit, Pclass, Sex, Embarked, Age) %>% drop_na()

p <- ncol(X)
Z <- tab.disjonctif(X)  #<< indicator matrix

n <- nrow(Z)
N <- sum(Z)
assert_that(N == n * p)
P <- Z/N   #<< correspondence matrix

r <- rowSums(P)
Dr <- diag(r)
assert_that(all(r == 1/n))
c <- colSums(P)
Dc <- diag(c)
S <- (P - r %o% c)  #<< residuals

assert_that(abs(sum(S))<= 2 * tol)
SS <- diag(sqrt(r^(-1))) %*% S %*% diag(sqrt(c^(-1)))  #<< standardized residuals

svd.SS <- svd(SS)  #<< bare MCA
```
]

---

.f6[
We may now compare `svd.SS`,  the SVD of the standardized residuals `SS`
with member `res.mca$svd` of `res.mca <- MCA(X)`

The singular values coincide <svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M639.4 433.6c-8.4-20.4-31.8-30.1-52.2-21.6l-22.1 9.2-38.7-101.9c47.9-35 64.8-100.3 34.5-152.8L474.3 16c-8-13.9-25.1-19.7-40-13.6L320 49.8 205.7 2.4c-14.9-6.2-32-.3-40 13.6L79.1 166.5C48.9 219 65.7 284.3 113.6 319.2L74.9 421.1l-22.1-9.2c-20.4-8.5-43.7 1.2-52.2 21.6-1.7 4.1.2 8.8 4.3 10.5l162.3 67.4c4.1 1.7 8.7-.2 10.4-4.3 8.4-20.4-1.2-43.8-21.6-52.3l-22.1-9.2L173.3 342c4.4.5 8.8 1.3 13.1 1.3 51.7 0 99.4-33.1 113.4-85.3l20.2-75.4 20.2 75.4c14 52.2 61.7 85.3 113.4 85.3 4.3 0 8.7-.8 13.1-1.3L506 445.6l-22.1 9.2c-20.4 8.5-30.1 31.9-21.6 52.3 1.7 4.1 6.4 6 10.4 4.3L635.1 444c4-1.7 6-6.3 4.3-10.4zM275.9 162.1l-112.1-46.5 36.5-63.4 94.5 39.2-18.9 70.7zm88.2 0l-18.9-70.7 94.5-39.2 36.5 63.4-112.1 46.5z"/></svg>

```r
assert_that(all(abs(res.mca$svd$vs - svd.SS$d)  <= 10 * tol))
```

Matrix `res.mca$svd$U`  is orthogonal with respect to `$D_r$`:

Matrix `res.mca$svd$U` equals `$D_r^{-1/2} \times U$` (up to sign changes and numerical errors)

Matrix `res.mca$svd$V`  is orthogonal with respect to `$D_c$`:

Matrix `res.mca$svd$V` equals `$D_c^{-1/2} \times V$` (up to sign changes and numerical errors)

]

---
template: inter-slide

## Zooming on other components of `res.mca`

---

### Component `res.mca$eig`

.fl.w-50.pa2.f6[

```r
eigv <- res.mca$svd$vs^2

tibble(eigenvalue=eigv,
       `% variance`=eigv/sum(eigv),
       `cum. % variance`= cumsum(eigv)/sum(eigv)) %>%
  rownames_to_column(var="Dimension") %>%
  head() %>%
  knitr::kable(digits=2)
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> Dimension </th>
   <th style="text-align:right;"> eigenvalue </th>
   <th style="text-align:right;"> % variance </th>
   <th style="text-align:right;"> cum. % variance </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 1 </td>
   <td style="text-align:right;"> 0.39 </td>
   <td style="text-align:right;"> 0.22 </td>
   <td style="text-align:right;"> 0.22 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 2 </td>
   <td style="text-align:right;"> 0.34 </td>
   <td style="text-align:right;"> 0.19 </td>
   <td style="text-align:right;"> 0.42 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 3 </td>
   <td style="text-align:right;"> 0.28 </td>
   <td style="text-align:right;"> 0.16 </td>
   <td style="text-align:right;"> 0.58 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 4 </td>
   <td style="text-align:right;"> 0.24 </td>
   <td style="text-align:right;"> 0.14 </td>
   <td style="text-align:right;"> 0.71 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 5 </td>
   <td style="text-align:right;"> 0.19 </td>
   <td style="text-align:right;"> 0.11 </td>
   <td style="text-align:right;"> 0.82 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 6 </td>
   <td style="text-align:right;"> 0.16 </td>
   <td style="text-align:right;"> 0.09 </td>
   <td style="text-align:right;"> 0.92 </td>
  </tr>
</tbody>
</table>
]

.fr.w-50.pa2[

The table on the left contains the first rows of `res.mca$eig`

Thanks to the Eckhart-Young Theorem  for extended SVD, component `$eig`
tells us how well the  matrix
`$$D_r^{-1} \times(P -r \times c^T) \times D_c^{-1}$$`
can be approximated by low
rank matrices according to the Hilbert-Schmidt norm with respect to `$D_r$` and `$D_c$`

]
---

### Component `res.mca$var`

.fl.w-50.pa2.f6[

```r
coord <-  res.mca$var$coord
contrib <- res.mca$var$contrib
cos2 <- res.mca$var$cos2

assert_that(norm(res.mca$svd$V %*% diag(res.mca$svd$vs[1:8])
                 - coord, "F") <= tol)

tmp <- t(t(coord^2) %*% Dc) %*% diag(1/res.mca$svd$vs[1:8]^2)
assert_that(norm(100 * tmp - contrib, "F") <= tol)

assert_that(norm(cos2 - coord^2/rowSums(coord^2), "F") <= tol)
```
]

.fl.w-50.pa2[

`res.mca$var` is a list of 3 matrices with the same dimensions `$q$` (sum of level numbers)  and `$q-p$`

`$coord` is made of the principal coordinates of columns `$D_c^{-1/2} \times V \times D$`. Each row corresponds to one column of the indicator matrix

`$contrib` ...

`$cos2`  is derived from `$coord`. For each row (column of the indicator matrix) `$\langle j, k\rangle$`, for each
dimension `$i$`, the  `$\langle j, k\rangle, i$` coefficient of `$cos2` is the squared cosine
of the angle between the row of `$\langle j, k\rangle$` row of `$coord`  and the `$i$`th right singular vector

]

???

---

### Component `res.mca$ind`

.fl.w-50.pa2.f6[

```r
coord <-  res.mca$ind$coord
contrib <- res.mca$ind$contrib
cos2 <- res.mca$ind$cos2

assert_that(norm(res.mca$svd$U %*% diag(res.mca$svd$vs[1:8])
                 - coord, "F") <= tol)

tmp <- 100 * t(t(coord^2) %*% Dr) %*% diag(1/res.mca$svd$vs[1:8]^2)
assert_that(norm(tmp - contrib, "F") <= tol)

assert_that(norm(cos2 - coord^2/rowSums(coord^2), "F") <= tol)
```
]

.fl.w-50.pa2[

`res.mca$ind` is a list of 3 matrices with the same dimensions `$n$` (sum of level numbers)  and `$q-p$`

`$coord` is made of the principal coordinates of columns: `$D_r^{-1/2} \times U \times D$`. Each row corresponds to one row of the indicator matrix

`$contrib` ...

`$cos2`  is derived from `$coord`. For each row (row of the indicator matrix) `$j$`, for each
dimension `$i$`, the  `$j, i$` coefficient of `$cos2` is the squared cosine
of the angle between the `$j$`th row of `$coord`  and the `$i$`th left singular vector

]

---

### More about MCA  (from `FactoMineR`)

`FactoMineR::MCA` does much more than computing an  SVD on a standardized residuals matrix

In real life MCA is performed on a subset of categorical columns of some data frame (the so-called
_active_ columns). The result
may help understanding the interplay between the active variables and the other variables

> MCA performs Multiple Correspondence Analysis (MCA) with supplementary individuals, supplementary quantitative variables and supplementary categorical variables.

> Performs also Specific Multiple Correspondence Analysis with supplementary categories and supplementary categorical variables.

> Missing values are treated as an additional level, categories which are rare can be ventilated ...

.fr.f6[From FactomineR documentation]

???

- supplementary categories and

- supplementary categorical variables

- ventilated:

---

### Result of `MCA`

Beyond `$var`, `$ind`, `$eig`, `res.mca` contains further elements, including:

- `ind.sup`
a list of matrices containing all the results for the supplementary individuals (coordinates, square cosine)

- `quanti.sup`
a matrix containing the coordinates of the supplementary quantitative variables (the correlation between a variable and an axis is equal to the variable coordinate on the axis)

- `quali.sup`
a list of matrices with all the results for the supplementary categorical variables (coordinates of each categories of each variables, square cosine and v.test which is a criterion with a Normal distribution, square correlation ratio)

- `call`
a list with some statistics

---

### <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M416 48c0-8.84-7.16-16-16-16h-64c-8.84 0-16 7.16-16 16v48h96V48zM63.91 159.99C61.4 253.84 3.46 274.22 0 404v44c0 17.67 14.33 32 32 32h96c17.67 0 32-14.33 32-32V288h32V128H95.84c-17.63 0-31.45 14.37-31.93 31.99zm384.18 0c-.48-17.62-14.3-31.99-31.93-31.99H320v160h32v160c0 17.67 14.33 32 32 32h96c17.67 0 32-14.33 32-32v-44c-3.46-129.78-61.4-150.16-63.91-244.01zM176 32h-64c-8.84 0-16 7.16-16 16v48h96V48c0-8.84-7.16-16-16-16zm48 256h64V128h-64v160z"/></svg>

In a [tidy universe](http::tidyverse.org), there would exist  `tidy`, `augment` and `glance` methods for class `MCA` just as there are such methods for class `prcomp` (used to perform PCA)

Many components of `res.mca` could be computed by methods like `tidy`, `augment` and `glance` and
`MCA` would just return the extended SVD, matrices `$D_r$`, `$D_c$` and some information from the call, like
the names of the active variables, the levels of each active variables, a vector recording the active individuals

---
template: inter-slide
name: mca-viz

## Visualization

---

### Inspecting Titanic using `factoextra`: screeplot

.fl.w-40.pa2.f6[

```r
knitr::kable(get_eigenvalue(res.mca), digits=2)
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> eigenvalue </th>
   <th style="text-align:right;"> variance.percent </th>
   <th style="text-align:right;"> cumulative.variance.percent </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Dim.1 </td>
   <td style="text-align:right;"> 0.39 </td>
   <td style="text-align:right;"> 22.35 </td>
   <td style="text-align:right;"> 22.35 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Dim.2 </td>
   <td style="text-align:right;"> 0.34 </td>
   <td style="text-align:right;"> 19.25 </td>
   <td style="text-align:right;"> 41.60 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Dim.3 </td>
   <td style="text-align:right;"> 0.28 </td>
   <td style="text-align:right;"> 15.94 </td>
   <td style="text-align:right;"> 57.54 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Dim.4 </td>
   <td style="text-align:right;"> 0.24 </td>
   <td style="text-align:right;"> 13.89 </td>
   <td style="text-align:right;"> 71.43 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Dim.5 </td>
   <td style="text-align:right;"> 0.19 </td>
   <td style="text-align:right;"> 11.05 </td>
   <td style="text-align:right;"> 82.48 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Dim.6 </td>
   <td style="text-align:right;"> 0.16 </td>
   <td style="text-align:right;"> 9.13 </td>
   <td style="text-align:right;"> 91.61 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Dim.7 </td>
   <td style="text-align:right;"> 0.15 </td>
   <td style="text-align:right;"> 8.39 </td>
   <td style="text-align:right;"> 100.00 </td>
  </tr>
</tbody>
</table>

]

.fl.w-60.pa2.f6[

```r
fviz_screeplot(res.mca, addlabels=TRUE)
```

<img src="cm-11-EDA_files/figure-html/unnamed-chunk-27-1.png" width="504" />
]

???

compare with `broom::tidy()` etc  for objects of class  `pca`

---

.fl.w-40.pa2[

```r
fviz_mca_var(res.mca,
             choice = "var", ) +
  coord_fixed() +
  ggtitle("MCA Titanic, variables")
```
]

.fl.w-60.pa2[

![](cm-11-EDA_files/figure-html/tita_var-1.png)

]

---

### Hand-made plots

The `plot` method for class `MCA` and the functions in `factoextra` provide off-the-shelf
constructions for classical MCA plots

There is nothing special about MCA plots

- the screeplot is a column plot
- the other plots are (sometimes decorated) scatter plots after some coordinate change

In the sequel, we build plots for categories, individuals and biplots using
the constructs from `ggplot2`

---

.fl.w-40.pa2.f6[

```r
res.mca.2 <- tit %>%
  select(Sex, Age, Embarked, Pclass, Survived) %>%
  MCA(quali.sup = c(5), graph = FALSE)

df_ind <- res.mca.2$ind$coord %>%
  as.data.frame() %>%
  bind_cols(Survived =tit$Survived) %>%
  drop_na()

df_ind %>%
  ggplot() +
  aes(x=`Dim 1`,
      y=`Dim 2`,
      colour=Survived) +
  geom_jitter(alpha=.2,
              width=.1,
              height = .1) +
  scale_color_viridis_d() +
  coord_fixed() +
  ggtitle("Titanic: indivudals in principal coordinates ")
```

]

.fl.w-60.pa2[

![](cm-11-EDA_files/figure-html/tita_ind-1.png)

]

---

.fl.w-40.pa2.f6[

```r
df_cat <- res.mca.2$var$coord %>%
  as.data.frame() %>%
  rownames_to_column(var="Category")

p <- df_cat %>%
  ggplot() +
  aes(x=`Dim 1`,
      y=`Dim 2`) +
  geom_point() +
  geom_text_repel(aes(label=Category)) +
  coord_fixed()

p +  ggtitle("Titanic: categories in principal coordinates ")
```

]

.fl.w-60.pa2[

![](cm-11-EDA_files/figure-html/tita_cat-1.png)

]

---

.fl.w-40.pa2.f6[

```r
p +
  geom_jitter(data=df_ind,
              aes(x=`Dim 1`,
                  y=`Dim 2`,
                  colour=Survived),
              alpha=.2,
              width=.1,
              height = .1
              ) +
  scale_color_viridis_d() +
  ggtitle("Titanic: biplot")
```
]

.fl.w-60.pa2[

![](cm-11-EDA_files/figure-html/tita_biplot-1.png)

]

---
template: inter-slide

## References

---

### Packages

- <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg> [FactoMineR]()

- <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M439.8 200.5c-7.7-30.9-22.3-54.2-53.4-54.2h-40.1v47.4c0 36.8-31.2 67.8-66.8 67.8H172.7c-29.2 0-53.4 25-53.4 54.3v101.8c0 29 25.2 46 53.4 54.3 33.8 9.9 66.3 11.7 106.8 0 26.9-7.8 53.4-23.5 53.4-54.3v-40.7H226.2v-13.6h160.2c31.1 0 42.6-21.7 53.4-54.2 11.2-33.5 10.7-65.7 0-108.6zM286.2 404c11.1 0 20.1 9.1 20.1 20.3 0 11.3-9 20.4-20.1 20.4-11 0-20.1-9.2-20.1-20.4.1-11.3 9.1-20.3 20.1-20.3zM167.8 248.1h106.8c29.7 0 53.4-24.5 53.4-54.3V91.9c0-29-24.4-50.7-53.4-55.6-35.8-5.9-74.7-5.6-106.8.1-45.2 8-53.4 24.7-53.4 55.6v40.7h106.9v13.6h-147c-31.1 0-58.3 18.7-66.8 54.2-9.8 40.7-10.2 66.1 0 108.6 7.6 31.6 25.7 54.2 56.8 54.2H101v-48.8c0-35.3 30.5-66.4 66.8-66.4zm-6.7-142.6c-11.1 0-20.1-9.1-20.1-20.3.1-11.3 9-20.4 20.1-20.4 11 0 20.1 9.2 20.1 20.4s-9 20.3-20.1 20.3z"/></svg> [prince](https://github.com/MaxHalford/prince)

---

background-image: url('./img/pexels-cottonbro-3171837.jpg')
background-size: cover

# The End