name: inter-slide class: left, middle, inverse {{ content }} --- name: layout-general layout: true class: left, middle <style> .remark-slide-number { position: inherit; } .remark-slide-number .progress-bar-container { position: absolute; bottom: 0; height: 4px; display: block; left: 0; right: 0; } .remark-slide-number .progress-bar { height: 100%; background-color: red; } </style>
--- class: middle, left, inverse # Exploratory Data Analysis : Multiple Correspondence Analysis ### 2021-12-10 #### [Master I MIDS & MFA]() #### [Analyse Exploratoire de Données](http://stephane-v-boucheron.fr/courses/eda/) #### [Stéphane Boucheron](http://stephane-v-boucheron.fr) --- exclude: true class: middle, left, inverse # Exploratory Data Analysis XI: Multiple Correspondance Analysis ### 2021-12-10 #### [EDA Master I MIDS et MFA](http://stephane-v-boucheron.fr/courses/eda) #### [Stéphane Boucheron](http://stephane-v-boucheron.fr) --- class: middle, inverse ##
### [Motivation](#bigpic) ### [Variants on mosaicplot](#variant-mosaic) ### [Indicator matrix](#indic-matrix) ### [MCA as CA on indicator matrix](#mca-in-words) ### [Illustrations](#) ??? ### [CCA](#cca) --- name: bigpic template: inter-slide ## Motivation: analyzing more than 2 categorical variables --- ### Different perspectives When handling two random variables, questions revolve around one topic: are they independent? From a numerical viewpoint, the chi-square divergence quantifies possible departure from independence A mosaicplot helps spotting associations between categories. Coloring tiles using Pearson residuals makes the pictures even more convenient Correspondance Analysis (CA) and the associated plots (screeplot, biplot, correlation circle plot) provide a geometric toolkit that complements the chi-square divergence and the different flavors of mosaicplots: the total inertia of CA is the chi-square divergence computed from the contingency table --- ### Different perspectives (continued) When handling more than two variables, several settings are possible - Some variables may be called _explanatory_, one variable may be considered as a _response_ variable. Is the response variable _dependent_ on the explanatory variables? If yes, we may wonder whether, conditionally on some explanatory variabless, the response variable is independent on the other explanatory variables - All variables share a similar status (explanatory or response), we may explore the relations between the variables, between categories and try to reduce dimension in some way In this session, we will address the different settings --- ### MCA (from documentations) > The aim of multiple correspondence analysis (MCA) is to summarise and visualise a data table where individuals are described by _qualitative_ variables with similar status > MCA is used to study the similarities between individuals from the point of view of all the variables and identify individuals' profiles > MCA is also used to assess relationships between variables and study the associations between categories > As with PCA and CA, the individuals or groups of individuals (rows) can be connected with categories of variables (columns) .fr.f6[From _R for statistics_, Cornillon et al. Chapman & Hall. Pub.] ??? > `mca` is a Multiple Correspondence Analysis (MCA) package for
, intended to be used with `pandas`. MCA is a feature extraction method; essentially PCA for categorical variables. .fr.f6[[mca homepage](https://github.com/esafak/mca)] Take-home message: MCA is a matrix factorization based method for exploring samples of categorical variables Questions: - Transforming samples of categorical variables into matrices - SVD Factorization - Relating the factors to the original data --- ### Useful packages ```r require(tidyverse) *require(FactoMineR) *require(factoextra) *require(FactoInvestigate) ``` --- exclude: true ### Questionnaires [](https://en.wikipedia.org/wiki/Questionnaire) ??? Questionnaires: - Definition - Usage - Example - Interpretation --- exclude: true ### Example: Questionnaire based on fictitious case reports [Case report on Wikipedia](https://en.wikipedia.org/wiki/Case_report) ??? - Definition of case report - Usage of fictitious case report - --- ### Handling more than 2 qualitative variables Example
Titanic data set (called `tit` in the sequel) `$$\begin{array}{ll}\text{Demographic/Explanatory} & \leftrightarrow \begin{cases}\text{Embarked} \\ \text{Sex} \\ \text{Passenger class} \\ \text{Age (Child/Adult)} \end{cases} \\ \phantom{\text{Demographic/Explanatory} } & \phantom{\text{Demographic/Explanatory}} \\ \text{Attitudinal/Response} & \leftrightarrow \text{Survived}\end{array}$$` ??? When handling a collection of qualitative variables, we may face several kinds of situations: we may investigate - response/attitudinal variables with respect to explanatory/demographic variables - collections of response/attitudinal variables - collections of explanatory/demographic variables MCA is geared towards investigating collections of variables of similar status ``` ## Warning: The following named parsers don't match the column names: Survived ``` ``` ## # A tibble: 6 × 4 ## Sex Age Embarked Pclass ## <fct> <fct> <fct> <ord> ## 1 male Adult S 3 ## 2 female Adult C 1 ## 3 female Adult S 3 ## 4 female Adult S 1 ## 5 male Adult S 3 ## 6 male NA.A Q 3 ``` --- ###
Mosaicplots for `n`-ways contingency tables .fl.w-50.pa2[ <img src="cm-11-EDA_files/figure-html/unnamed-chunk-7-1.png" width="504" /> ] .fl.w-50.pa2[ The interplay between the response variable `Survived` and the four explanatory variables using a naive mosaicplot is hard to spot The global arrangement of tiles has a huge impact on the interpretability of multi-dimensional mosaicplots Variants of mosaicplots like _double decker_ plots make the task easier ] ??? ```r tit %>% dplyr::select(Pclass, Embarked) %>% drop_na() %>% ggplot() + geom_mosaic(aes(x = product(Embarked, Pclass), fill=Embarked)) + labs(x= "Passenger class", y="Embarked") + scale_fill_viridis_d() + ggtitle("Titanic mosaic with tidyverse flavor") ``` In the Titanic table the four variables do have the same status: - Class, Age, Sex may be considered as "demographic", or "explanatory" - Survived is a "response" variable When using variants of mosaicplot to investigate the Titanic dataset --- name: variant-mosaic template: inter-slide ## Variations on Mosaicplot --- ### Mosaic plots versus Association plots > In order to explain multi-dimensional categorical data, statisticians typically look for (conditional) independence structures > Whether the task is purely exploratory or model-based, techniques such as _mosaic_ and _association_ plots offer good support for visualization .fr.f6[Structplot vignette] Before turning back to Titanic dataset, let us revisit the `UCBAdmissions` dataset Remember that the `UCBAdmissions` dataset was elaborated in the 1970's to assess whether the admission process at the different departments of UC Berkeley suffered from a gender bias Such a possibility was suggested by looking at the global admission rate for female and male candidates ??? > Both _mosaic_ and _association_ plots visualize aspects of (possibly higher-dimensional) contingency tables, with several extensions --- exclude: true ### Extensions - double-decker plots - spine plots - spinograms - conditional association plots --- ###
Order matters ```r aperm(UCBAdmissions, c(3, 2, 1)) %>% mosaicplot(shade=TRUE) aperm(UCBAdmissions, c(3, 2, 1)) %>% vcd::mosaic(shade=TRUE) ``` <img src="cm-11-EDA_files/figure-html/unnamed-chunk-9-1.png" width="50%" /><img src="cm-11-EDA_files/figure-html/unnamed-chunk-9-2.png" width="50%" /> .f6[ Both diagrams are mosaicplots The diagram on the right is easier to interpret: it provides a picture of admission rates per department and then gender ] --- ###
Double decker plot for `UCBAdmissions` .fl.w-60.pa2[ ```r aperm(UCBAdmissions, c(3, 2, 1)) %>% vcd::doubledecker() ``` <img src="cm-11-EDA_files/figure-html/unnamed-chunk-10-1.png" width="504" /> .f6[ Double decker plots are special kinds of mosaicplots - Per department admissions exhibit no clear bias against women - Men tend to fill more applications to less selective departments ] ] .fl.w-40.pa2.f6[ [Simpson's paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox) > A trend appears in several different groups of data but disappears or reverses when these groups are combined > This result is often encountered in social-science and medical-science statistics and is particularly problematic when frequency data is unduly given causal interpretations. > The paradox is also referred to as Simpson's reversal, Yule–Simpson effect, amalgamation paradox, or reversal paradox .fr[Wikipedia] ] ??? A double decker plot is a collection of stacked column/bar plots: For each department, we have a column plot where `Gender` is mapped to `x` and `Admit` to `y`. Columns are stacked. Width is proportional to number of applicants with given Gender for the department. Height is proportional to fraction of successful/failing applicants for given Department and Gender --- ###
Flattening the Titanic table ```r pacman::p_load(xtable) vcd::structable(~ Class + Age + Sex, aperm(Titanic, c(1,3,2, 4))) %>% xtable::xtableFtable() ``` .fl.w-50.pa2.f6[ | Class | Sex | Child| Adult| |:------:|:-------|------:|------:| |1st | Male | 5 | 175 | | | Female | 1 | 144 | | 2nd | Male | 11 | 168 | | | Female | 13 | 93 | | 3rd | Male | 48 | 462 | | | Female | 31 | 165 | | Crew | Male | 0 | 862 | | | Female | 0 | 23 | ] .fl.w-50.pa2.f6[ Flattenig allows to organize information Double decker plots allow to do this graphically ] --- ###
Double-decker plots for the Titanic data ```r vcd::doubledecker(Titanic) tit %>% filter(Survived %in% c('Survived', 'Deceased')) %>% filter(Age %in% c('Adult', 'Child')) %>% mutate(Age=fct_drop(Age)) %>% ggplot() + geom_mosaic(aes(x=product(Survived, Age, Sex, Pclass), fill=Survived), divider=ddecker()) + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5)) + scale_fill_viridis_d() ``` <img src="cm-11-EDA_files/figure-html/unnamed-chunk-12-1.png" width="50%" /><img src="cm-11-EDA_files/figure-html/unnamed-chunk-12-2.png" width="50%" /> ??? `Titanic` is a base
contingency table where `Pclass, Sex, Age, Survived` were cross-tabulated --- exclude: true ```r # knitr::include_url("https://rdrr.io/cran/vcd/") ``` --- exclude: true ### Strucplot framework - low-level grapcon functions - created by generating functions (grapcon generators) - `group_...`, `struc_...`, `labelling_...`, `legend_...`, `spacing_...` - a suitable combination of the low-level grapcon functions is passed as “hyperparameters” to strucplot() - convenience functions such as mosaic(), sieve(), assoc(), and doubledecker() which interface strucplot() ??? > The strucplot framework is highly modularized: Figure 5 shows the hierarchical relationship between the various components. On the lowest level, there are several groups of workhorse and parameter functions that directly or indirectly influence the final appearance of the plot (see Table 2 for an overview). These are examples of grapcon functions. They are created by generating functions (grapcon generators), allowing flexible parameterization and extensibility (Figure 5 only shows the generators). The generator names follow the naming convention group_... (), where group reflects the group the generators belong to (strucplot core, labeling, legend, shading, or spacing). The workhorse functions (created by struc_... (), labeling_... (), and legend_... ()) directly produce graphical output (i.e., “add ink to the canvas”), whereas the parameter functions (created by spacing_... () and shading_... ()) compute graphical parameters used by the others. The grapcon functions returned by struc_... () implement the core functionality, creating the tiles and their content. On the second level of the framework, a suitable combination of the low-level grapcon functions (or, alternatively, corresponding generating functions) is passed as “hyperparameters” to strucplot(). This central function sets up the graphical layout using grid viewports (see Figure 6), and coordinates the specified core, labeling, shading, and spacing functions to produce the plot. On the third level, we provide several convenience functions such as mosaic(), sieve(), assoc(), and doubledecker() which interface strucplot() through sensible parameter defaults and support for model formulae. Finally, on the fourth level, there are “related” vcd functions (such as cotabplot() and the pairs() methods for table objects) arranging collections of plots of the strucplot framework into more complex displays (e.g., by means of panel functions) --- name: indic-matrix template: inter-slide ##
Indicator and Burt matrices --- ###
Two perspectives MCA can be viewed along two perspectives: - Analyzing the dataset after performing _one-hot_ encoding of categorical variables: investigating the so-called _indicator_ matrix - Analyzing all pairwise two-way contingency tables derived from the dataset: : investigating the so-called _Burt_ matrix ??? The two perspective define different pipelines --- ### A glimpse at the indicator matrix .f6[ Function `tab.disjonctif()` from `FactoMineR` builds an _indicator_ matrix starting from a dataframe with categorical columns ```r Z <- tit %>% drop_na() %>% select(Sex, Age, Embarked, Pclass) %>% * tab.disjonctif() Z %>% head() %>% knitr::kable() ``` <table> <thead> <tr> <th style="text-align:right;"> female </th> <th style="text-align:right;"> male </th> <th style="text-align:right;"> Child </th> <th style="text-align:right;"> Adult </th> <th style="text-align:right;"> NA.A </th> <th style="text-align:right;"> S </th> <th style="text-align:right;"> C </th> <th style="text-align:right;"> Q </th> <th style="text-align:right;"> 1 </th> <th style="text-align:right;"> 2 </th> <th style="text-align:right;"> 3 </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> </tr> </tbody> </table> ] ??? `tab.disjonctif` is a shorthand for _tableau disjonctif complet_, the French name of indicator matrix --- ### Construction of indicator matrix - A categorical variable `\(V_j\)` (factor) with `\(q\)` levels is mapped to `\(q\)` `\(\{0,1\}\)` -valued variables `\(V_{j,r}\)` for `\(r \leq q\)` - If levels are indexed by `\(\{1, \ldots, q\}\)`, if the value of the categorical variable `\(V_j\)` from row `\(i\)` is `\(k \in \{1, \ldots, q\}\)`, the binary variables `\(V_{j,r}\)` on that row take values `$$k \mapsto \underbrace{0,\ldots, 0}_{k-1}, 1, \underbrace{0, \ldots, 0}_{q-k}$$` - In Machine Learning parlance building the indicator matrix consists of performing _one-hot encoding_ for each categorical variable - The indicator matrix has as many rows as the data matrix - The number of columns of the indicator matrix is the sum of the number of levels of the categorical variables/columns of the data matrix - The indicator matrix is a numerical matrix. It is suitable for factorial methods
??? --- ### The Burt matrix .fl.w-40.pa2.f6[ - Multiple Correspondance Analysis may be based on the Burt matrix - The Burt matrix is a symmetric integer-valued matrix made of blocks consisting of all pairwise 2-ways contingency tables - Each block is the contingency table defined by two categorical variables from the data matrix - Diagonal blocks are diagonal sub-matrices ] .fl.w-60.pa2.f6[ ```r B <- t(Z) %*% as.matrix(Z) B[1:6, 1:6] %>% knitr::kable() ``` <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> female </th> <th style="text-align:right;"> male </th> <th style="text-align:right;"> Child </th> <th style="text-align:right;"> Adult </th> <th style="text-align:right;"> NA.A </th> <th style="text-align:right;"> S </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> female </td> <td style="text-align:right;"> 95 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> 74 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 56 </td> </tr> <tr> <td style="text-align:left;"> male </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 107 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 86 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 73 </td> </tr> <tr> <td style="text-align:left;"> Child </td> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 23 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 17 </td> </tr> <tr> <td style="text-align:left;"> Adult </td> <td style="text-align:right;"> 74 </td> <td style="text-align:right;"> 86 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 160 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 99 </td> </tr> <tr> <td style="text-align:left;"> NA.A </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 19 </td> <td style="text-align:right;"> 13 </td> </tr> <tr> <td style="text-align:left;"> S </td> <td style="text-align:right;"> 56 </td> <td style="text-align:right;"> 73 </td> <td style="text-align:right;"> 17 </td> <td style="text-align:right;"> 99 </td> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 129 </td> </tr> </tbody> </table> A glimpse at stacked 2-ways contingency tables: rows 3,..., 5 and columns 1,2 contain the contingency table defined by variables `Age` and `Sex` ] ??? All pairwise contingency tables `$$B = Z^T \times Z$$` --- template: inter-slide name: mca-in-words ## MCA in words --- ###
Acknowledgments .fl.w-40.pa2[
Expositions of MCA is taken from ![](./img/greenacre.jpg) ] .fr.w-60.pa2[
Our favorite CA and MCA package is <iframe src="http://factominer.free.fr/factomethods/multiple-correspondence-analysis.html" width="750px" height="400px"></iframe> ] --- ### MCA on
Titanic
MCA consists of performing CA (correspondance analysis) on the _indicator matrix_ .fl.w-40.pa2[ ```r res.mca <- tit %>% drop_na() %>% select(Sex, Age, Embarked, Pclass) %>% * MCA(graph = FALSE) ``` We perform MCA on the data matrix made from four similar qualitative columns: `Sex`, `Age`, `Embarked`, `Pclass` ] .fl.w-60.pa2[ `res.mca` is a list with class attributes `MCA` and `list`. It contains many related components
Component `svd` of `res.mca$svd` can righteously be considered as the most important part of the result Other components are either byproducts of the computation of `res.mca$svd` or derived from `res.mca$svd` so as to facilitate reporting either numerical or graphical ] ??? --- ### Output of `print(res.mca)` .f7[ | | Name | Description | |:-:|:------------------------|:----------------------------------------------------| |1 | "$eig" |"eigenvalues" | |2 | "$var" |"results for the variables" | |3 | "$var$coord" |"coord. of the categories" | |4 | "$var$cos2" |"cos2 for the categories" | |5 | "$var$contrib" |"contributions of the categories" | |6 | "$var$v.test" |"v-test for the categories" | |7 | "$ind" |"results for the individuals" | |8 | "$ind$coord" |"coord. for the individuals" | |9 | "$ind$cos2" |"cos2 for the individuals" | |10 | "$ind$contrib" |"contributions of the individuals" | |11 | "$quali.sup" |"results for the supplementary categorical variables"| |12 | "$quali.sup$coord" |"coord. for the supplementary categories" | |13 | "$quali.sup$cos2" |"cos2 for the supplementary categories" | |14 | "$quali.sup$v.test" |"v-test for the supplementary categories" | |15 | "$call" |"intermediate results" | |16 | "$call$marge.col" |"weights of columns" | |17 | "$call$marge.li" |"weights of rows" | ] --- ### Comment on output of `print(res.mca)`
`res.mca$svd` is not part of the output - `eig` is computed from the singular values in `res.mca$svd` - `var` contains material for plotting information about categories and variables on factorial planes - `ind` conatins material for plotting information about individuals on on factorial planes ??? --- ###
Understanding MCA, the way it works, the ways it can be used amounts to understand the steps that lead to the computation of `res.mca$svd` Asserting that, by default, MCA consists of performing Correspondance Analysis (CA) on the indicator matrix deserves some explanation In order to make the argument self-contained, we first skech what CA is and how it relates to (extensions of) SVD Then, we explain what it means to perform CA on the indicator matrix Finally we relate `res.mca$svd` with the extended SVD of the residual matrix of the indicator matrix --- template: inter-slide name: brush-up-CA ## Brush up your CA --- ### CA executive summary - Start from a 2-way contingency table `\(X\)` with `\(\sum_{i,j} X_{i,j}=N\)` - Normalize `\(P = \frac{1}{N}X\)` (_correspondance matrix_) - Let `\(r\)` (resp. `\(c\)`) be the row (resp. column) wise sums vector - Let `\(D_r=\text{diag}(r)\)` denote the diagonal matrix with row sums of `\(P\)` as coefficients - Let `\(D_c=\text{diag}(c)\)` denote the diagonal matrix with column sums of `\(P\)` as coefficients + The _row profiles matrix_ is `\(D_r^{-1} \times P\)` + The _standardized residuals matrix_ is `\(S = D_r^{-1/2} \times \left(P - r c^T\right) \times D_c^{-1/2}\)` CA consists in computing the SVD of the standardized residuals matrix `\(S = U \times D \times V^T\)` From the SVD, we get - `\(D_r^{-1/2} \times U\)` standardized coordinates of rows - `\(D_c^{-1/2} \times V\)` standardized coordinates of columns - `\(D_r^{-1/2} \times U \times D\)` principal coordinates of rows - `\(D_c^{-1/2} \times V \times D\)` principal coordinates of columns - Squared singular values: the principal inertia ??? When calling `svd(.)`, the argument should be `$$D_r^{1/2}\times \left(D_r^{-1} \times P \times D_c^{-1}- \mathbf{I}\times \mathbf{I}^T \right)\times D_c^{1/2}$$` --- ### CA and extended SVD As `$$D_r^{-1} \times P \times D_c^{-1} - \mathbf{I}\mathbf{I}^T = (D_r^{-1/2} \times U)\times D \times (D_c^{-1/2}\times V)^T$$` `\((D_r^{-1/2} \times U)\times D \times (D_c^{-1/2}\times V)^T\)` is the _extended SVD_ of `$$D_r^{-1} \times P \times D_c^{-1} - \mathbf{I}\mathbf{I}^T$$` with respect to `\(D_r\)` and `\(D_c\)` --- ### CA and reconstructions formulae --- template: inter-slide ## Performing CA on indicator matrix --- ### MCA: CA on indicator matrix Let `\(X\)` be the data matrix with `\(n\)` rows (individuals) and `\(p\)` categorical columns (variables) For `\(j \in \{1, \ldots, p\}\)`, let `\(J_j\)` denote the number of levels(categories) of variable `\(j\)` Let `\(q = \sum_{j\leq p} J_j\)` be the sum of the number of levels throughout the variables Let `\(Z\)` be the incidence matrix with `\(n\)` rows and `\(q\)` columns For `\(j\leq p\)` and `\(k \leq J_j\)`, let `\(\langle j, k\rangle = \sum_{j'<j} J_{j'}+k\)` Let `\(N = n \times p = \sum_{i\leq n} \sum_{j \leq p} X_{i,j}\)` and `\(P = \frac{1}{N} Z\)` (the _correspondence matrix_ for MCA)
The row wise sums of correspondence matrix `\(P\)` are all equal to `\(1/n=p/N\)` The column wise sum of the correspondence matrix `\(P\)` for the `\(k\)`th level of the `\(j\)`th variable of `\(X\)` ( `\(j \leq p\)` ) is `\(N_{\langle j,k\rangle}/N = f_{\langle j,k\rangle}/p\)` where `\(f_{\langle j,k\rangle}\)` stands for the relative frequency of level `\(k\)` of the `\(j\)`th variable `$$D_r = \frac{1}{n}\text{Id}_n\qquad D_c =\text{diag}\left(\frac{f_{\langle j,k\rangle}}{p}\right)_{j \leq p, k\leq J_j}$$` --- ### MCA: CA on incidence matrix (continued) Let `\(r= D_r \times \mathbb{I}_n = \frac{1}{n} \mathbb{I}_p\)` and `\(c = D_c \times \mathbb{I}_q\)` In MCA, we compute the SVD `\(U \times D \times V^T\)` of the standardized residuals matrix: `$$S = D_r^{-1/2}\times \left(P - r\times c^T\right) \times D_c^{-1/2} = \sqrt{n}\left(P - r\times c^T\right) \times D_c^{-1/2}$$` Coefficient `\(i, \langle j, k\rangle\)` of `\(S\)` is `$$\frac{\mathbb{I}_{i, \langle j, k\rangle}- f_{\langle j,k\rangle}}{\sqrt{n f_{\langle j,k\rangle}/p}}$$`
--- ###
Hand-made MCA on Titanic data .f6[ ```r tol <- 1e-10 X <- select(tit, Pclass, Sex, Embarked, Age) %>% drop_na() p <- ncol(X) Z <- tab.disjonctif(X) #<< indicator matrix n <- nrow(Z) N <- sum(Z) assert_that(N == n * p) P <- Z/N #<< correspondence matrix r <- rowSums(P) Dr <- diag(r) assert_that(all(r == 1/n)) c <- colSums(P) Dc <- diag(c) S <- (P - r %o% c) #<< residuals assert_that(abs(sum(S))<= 2 * tol) SS <- diag(sqrt(r^(-1))) %*% S %*% diag(sqrt(c^(-1))) #<< standardized residuals svd.SS <- svd(SS) #<< bare MCA ``` ] --- ###
Hand-made MCA on Titanic data (continued) .f6[ We may now compare `svd.SS`, the SVD of the standardized residuals `SS` with member `res.mca$svd` of `res.mca <- MCA(X)` The singular values coincide
```r assert_that(all(abs(res.mca$svd$vs - svd.SS$d) <= 10 * tol)) ``` Matrix `res.mca$svd$U` is orthogonal with respect to `\(D_r\)`: Matrix `res.mca$svd$U` equals `\(D_r^{-1/2} \times U\)` (up to sign changes and numerical errors) Matrix `res.mca$svd$V` is orthogonal with respect to `\(D_c\)`: Matrix `res.mca$svd$V` equals `\(D_c^{-1/2} \times V\)` (up to sign changes and numerical errors) ]
`MCA(X)$svd` contains the _extended SVD_ of `\(D_r^{-1} \times P \times D_c^{-1} - \mathbf{I}\mathbf{I}^T\)` --- template: inter-slide ## Zooming on other components of `res.mca` --- ### Component `res.mca$eig` .fl.w-50.pa2.f6[ ```r eigv <- res.mca$svd$vs^2 tibble(eigenvalue=eigv, `% variance`=eigv/sum(eigv), `cum. % variance`= cumsum(eigv)/sum(eigv)) %>% rownames_to_column(var="Dimension") %>% head() %>% knitr::kable(digits=2) ``` <table> <thead> <tr> <th style="text-align:left;"> Dimension </th> <th style="text-align:right;"> eigenvalue </th> <th style="text-align:right;"> % variance </th> <th style="text-align:right;"> cum. % variance </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 0.39 </td> <td style="text-align:right;"> 0.22 </td> <td style="text-align:right;"> 0.22 </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:right;"> 0.34 </td> <td style="text-align:right;"> 0.19 </td> <td style="text-align:right;"> 0.42 </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:right;"> 0.28 </td> <td style="text-align:right;"> 0.16 </td> <td style="text-align:right;"> 0.58 </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:right;"> 0.24 </td> <td style="text-align:right;"> 0.14 </td> <td style="text-align:right;"> 0.71 </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:right;"> 0.19 </td> <td style="text-align:right;"> 0.11 </td> <td style="text-align:right;"> 0.82 </td> </tr> <tr> <td style="text-align:left;"> 6 </td> <td style="text-align:right;"> 0.16 </td> <td style="text-align:right;"> 0.09 </td> <td style="text-align:right;"> 0.92 </td> </tr> </tbody> </table> ] .fr.w-50.pa2[ The table on the left contains the first rows of `res.mca$eig` Thanks to the Eckhart-Young Theorem for extended SVD, component `$eig` tells us how well the matrix `$$D_r^{-1} \times(P -r \times c^T) \times D_c^{-1}$$` can be approximated by low rank matrices according to the Hilbert-Schmidt norm with respect to `\(D_r\)` and `\(D_c\)` ] --- ### Component `res.mca$var` .fl.w-50.pa2.f6[ ```r coord <- res.mca$var$coord contrib <- res.mca$var$contrib cos2 <- res.mca$var$cos2 assert_that(norm(res.mca$svd$V %*% diag(res.mca$svd$vs[1:8]) - coord, "F") <= tol) tmp <- t(t(coord^2) %*% Dc) %*% diag(1/res.mca$svd$vs[1:8]^2) assert_that(norm(100 * tmp - contrib, "F") <= tol) assert_that(norm(cos2 - coord^2/rowSums(coord^2), "F") <= tol) ``` ] .fl.w-50.pa2[ `res.mca$var` is a list of 3 matrices with the same dimensions `\(q\)` (sum of level numbers) and `\(q-p\)` `$coord` is made of the principal coordinates of columns `\(D_c^{-1/2} \times V \times D\)`. Each row corresponds to one column of the indicator matrix `$contrib` ... `$cos2` is derived from `$coord`. For each row (column of the indicator matrix) `\(\langle j, k\rangle\)`, for each dimension `\(i\)`, the `\(\langle j, k\rangle, i\)` coefficient of `$cos2` is the squared cosine of the angle between the row of `\(\langle j, k\rangle\)` row of `$coord` and the `\(i\)`th right singular vector ] ??? --- ### Component `res.mca$ind` .fl.w-50.pa2.f6[ ```r coord <- res.mca$ind$coord contrib <- res.mca$ind$contrib cos2 <- res.mca$ind$cos2 assert_that(norm(res.mca$svd$U %*% diag(res.mca$svd$vs[1:8]) - coord, "F") <= tol) tmp <- 100 * t(t(coord^2) %*% Dr) %*% diag(1/res.mca$svd$vs[1:8]^2) assert_that(norm(tmp - contrib, "F") <= tol) assert_that(norm(cos2 - coord^2/rowSums(coord^2), "F") <= tol) ``` ] .fl.w-50.pa2[ `res.mca$ind` is a list of 3 matrices with the same dimensions `\(n\)` (sum of level numbers) and `\(q-p\)` `$coord` is made of the principal coordinates of columns: `\(D_r^{-1/2} \times U \times D\)`. Each row corresponds to one row of the indicator matrix `$contrib` ... `$cos2` is derived from `$coord`. For each row (row of the indicator matrix) `\(j\)`, for each dimension `\(i\)`, the `\(j, i\)` coefficient of `$cos2` is the squared cosine of the angle between the `\(j\)`th row of `$coord` and the `\(i\)`th left singular vector ] --- ### More about MCA (from `FactoMineR`) `FactoMineR::MCA` does much more than computing an SVD on a standardized residuals matrix In real life MCA is performed on a subset of categorical columns of some data frame (the so-called _active_ columns). The result may help understanding the interplay between the active variables and the other variables > MCA performs Multiple Correspondence Analysis (MCA) with supplementary individuals, supplementary quantitative variables and supplementary categorical variables. > Performs also Specific Multiple Correspondence Analysis with supplementary categories and supplementary categorical variables. > Missing values are treated as an additional level, categories which are rare can be ventilated ... .fr.f6[From FactomineR documentation] ??? - supplementary categories and - supplementary categorical variables - ventilated: --- ### Result of `MCA` Beyond `$var`, `$ind`, `$eig`, `res.mca` contains further elements, including: - `ind.sup` a list of matrices containing all the results for the supplementary individuals (coordinates, square cosine) - `quanti.sup` a matrix containing the coordinates of the supplementary quantitative variables (the correlation between a variable and an axis is equal to the variable coordinate on the axis) - `quali.sup` a list of matrices with all the results for the supplementary categorical variables (coordinates of each categories of each variables, square cosine and v.test which is a criterion with a Normal distribution, square correlation ratio) - `call` a list with some statistics --- ###
In a [tidy universe](http::tidyverse.org), there would exist `tidy`, `augment` and `glance` methods for class `MCA` just as there are such methods for class `prcomp` (used to perform PCA) Many components of `res.mca` could be computed by methods like `tidy`, `augment` and `glance` and `MCA` would just return the extended SVD, matrices `\(D_r\)`, `\(D_c\)` and some information from the call, like the names of the active variables, the levels of each active variables, a vector recording the active individuals --- template: inter-slide name: mca-viz ## Visualization --- ### Inspecting Titanic using `factoextra`: screeplot .fl.w-40.pa2.f6[ ```r knitr::kable(get_eigenvalue(res.mca), digits=2) ``` <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> eigenvalue </th> <th style="text-align:right;"> variance.percent </th> <th style="text-align:right;"> cumulative.variance.percent </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Dim.1 </td> <td style="text-align:right;"> 0.39 </td> <td style="text-align:right;"> 22.35 </td> <td style="text-align:right;"> 22.35 </td> </tr> <tr> <td style="text-align:left;"> Dim.2 </td> <td style="text-align:right;"> 0.34 </td> <td style="text-align:right;"> 19.25 </td> <td style="text-align:right;"> 41.60 </td> </tr> <tr> <td style="text-align:left;"> Dim.3 </td> <td style="text-align:right;"> 0.28 </td> <td style="text-align:right;"> 15.94 </td> <td style="text-align:right;"> 57.54 </td> </tr> <tr> <td style="text-align:left;"> Dim.4 </td> <td style="text-align:right;"> 0.24 </td> <td style="text-align:right;"> 13.89 </td> <td style="text-align:right;"> 71.43 </td> </tr> <tr> <td style="text-align:left;"> Dim.5 </td> <td style="text-align:right;"> 0.19 </td> <td style="text-align:right;"> 11.05 </td> <td style="text-align:right;"> 82.48 </td> </tr> <tr> <td style="text-align:left;"> Dim.6 </td> <td style="text-align:right;"> 0.16 </td> <td style="text-align:right;"> 9.13 </td> <td style="text-align:right;"> 91.61 </td> </tr> <tr> <td style="text-align:left;"> Dim.7 </td> <td style="text-align:right;"> 0.15 </td> <td style="text-align:right;"> 8.39 </td> <td style="text-align:right;"> 100.00 </td> </tr> </tbody> </table> ] .fl.w-60.pa2.f6[ ```r fviz_screeplot(res.mca, addlabels=TRUE) ``` <img src="cm-11-EDA_files/figure-html/unnamed-chunk-27-1.png" width="504" /> ] ??? compare with `broom::tidy()` etc for objects of class `pca` --- ###
Plotting colums from data matrix .fl.w-40.pa2[ ```r fviz_mca_var(res.mca, choice = "var", ) + coord_fixed() + ggtitle("MCA Titanic, variables") ``` ] .fl.w-60.pa2[ ![](cm-11-EDA_files/figure-html/tita_var-1.png) ] --- ### Hand-made plots The `plot` method for class `MCA` and the functions in `factoextra` provide off-the-shelf constructions for classical MCA plots There is nothing special about MCA plots - the screeplot is a column plot - the other plots are (sometimes decorated) scatter plots after some coordinate change In the sequel, we build plots for categories, individuals and biplots using the constructs from `ggplot2` --- ###
Plotting individuals on first factorial plane .fl.w-40.pa2.f6[ ```r res.mca.2 <- tit %>% select(Sex, Age, Embarked, Pclass, Survived) %>% MCA(quali.sup = c(5), graph = FALSE) df_ind <- res.mca.2$ind$coord %>% as.data.frame() %>% bind_cols(Survived =tit$Survived) %>% drop_na() df_ind %>% ggplot() + aes(x=`Dim 1`, y=`Dim 2`, colour=Survived) + geom_jitter(alpha=.2, width=.1, height = .1) + scale_color_viridis_d() + coord_fixed() + ggtitle("Titanic: indivudals in principal coordinates ") ``` ] .fl.w-60.pa2[ ![](cm-11-EDA_files/figure-html/tita_ind-1.png) ] --- ###
Plotting categories on first factorial plane .fl.w-40.pa2.f6[ ```r df_cat <- res.mca.2$var$coord %>% as.data.frame() %>% rownames_to_column(var="Category") p <- df_cat %>% ggplot() + aes(x=`Dim 1`, y=`Dim 2`) + geom_point() + geom_text_repel(aes(label=Category)) + coord_fixed() p + ggtitle("Titanic: categories in principal coordinates ") ``` ] .fl.w-60.pa2[ ![](cm-11-EDA_files/figure-html/tita_cat-1.png) ] --- ###
MCA biplot .fl.w-40.pa2.f6[ ```r p + geom_jitter(data=df_ind, aes(x=`Dim 1`, y=`Dim 2`, colour=Survived), alpha=.2, width=.1, height = .1 ) + scale_color_viridis_d() + ggtitle("Titanic: biplot") ``` ] .fl.w-60.pa2[ ![](cm-11-EDA_files/figure-html/tita_biplot-1.png) ] --- template: inter-slide ## References --- ### Packages -
[FactoMineR]() -
[ade4]() -
[ca]() -
[prince](https://github.com/MaxHalford/prince) --- class: middle, center, inverse background-image: url('./img/pexels-cottonbro-3171837.jpg') background-size: cover # The End