name: layout-general layout: true class: left, middle <style> .remark-slide-number { position: inherit; } .remark-slide-number .progress-bar-container { position: absolute; bottom: 0; height: 4px; display: block; left: 0; right: 0; } .remark-slide-number .progress-bar { height: 100%; background-color: red; } </style>
--- name: inter-slide class: left, middle, inverse {{content}} --- class: middle, left, inverse # Exploratory Data Analysis : EDA Correspondence Analysis ### 2022-03-23 #### [Master I MIDS & MFA]() #### [Analyse Exploratoire de Données](http://stephane-v-boucheron.fr/courses/eda/) #### [Stéphane Boucheron](http://stephane-v-boucheron.fr) --- exclude: true template: inter-slide # Exploratory Data Analysis VII CA ### 2022-03-23 #### [EDA Master I MIDS et MFA](/courses/eda/index.html) #### [Stéphane Boucheron](http://stephane-v-boucheron.fr) --- template: inter-slide ##
### [Exploring two-way contingency tables](#motivca) ### [Chi-square divergence](#chi-deux-section) ### [Extending SVD](#revSVD) ### [Correspondance Analysis as extended SVD](#basicca) ### [Illustrations](#factoshowcase) --- name: motivca template: inter-slide ## Motivations --- In this lesson, we further explore the connection between two qualitative variables, that is, we attempt to go beyond Lesson [Bivariate statistics](#biv). Our starting point is a qualitative bivariate sample. Consider the celebrated `UCBAdmissions` dataset According to `R` documentation, this dataset is made of > Aggregate data on applicants to graduate school at Berkeley for the six largest departments in 1973 classified by admission and sex. This is a compilation of 4526 application files. For each application file, three variables have been reported: the `department`, the `gender` of the applicant and whether the applicant has been `admitted` The dataset is a trivariate sample. In the sequel we make it a bivariate sample by focusing on `Gender` and `Admit`. ??? The dataset was gathered because the admission process at UCB graduate schools was suspected to be biased against women --- ### Plotting two-way contingency tables As in Lesson [Bivariate statistics](#biv), we start by collecting the 2-way _contingency table_ and displaying a _mosaicplot_. .panelset[ .panel[.panel-name[`mosaicplot`] .fl.w-50.pa2[ ```r *mosaicplot(~ Gender + Admit, * UCBAdmissions) ``` ] .fl.w-50.pa2[ ![](cm-7-EDA_files/figure-html/mosaic1-1.png) ] ] .panel[.panel-name[`mosaic` (from `vcd`)] .fl.w-50.pa2[ ```r apply(UCBAdmissions, MARGIN = c(1,2), sum) %>% as.table() %>% * vcd::mosaic(formula=Admit~Gender, * data=.) ``` ] .fl.w-50.pa2[ ![](cm-7-EDA_files/figure-html/mosaic2-1.png) ] ] .panel[.panel-name[`ggmosaic`] .fl.w-50.pa2[ .f6[ If, rather than the contingency table, we have the original data (a dataframe with two categorical columns), we can visualize the contengincy table using `geom_mosaic` from `ggmosaic`. Conforming to the `ggplot2` API, `geom_mosaic` works on a native dataframe and compute _statistics_ from the dataframe, the statistics are then used to draw the picture ```r titanic_train %>% dplyr::select(Pclass, Embarked) %>% drop_na() %>% ggplot() + geom_mosaic(aes(x = product(Embarked, Pclass), fill=Embarked)) + labs(x= "Passenger class", y="Embarked") + scale_fill_viridis_d() + ggtitle("Titanic mosaic with tidyverse flavor") ``` ] ] .fl.w-50.pa2[ ![](cm-7-EDA_files/figure-html/mosaic3-1.png) ] ] ] ??? This mosaicplot suggests that Admission and Gender are associated. [ggmosaic](https://cran.r-project.org/web/packages/ggmosaic/vignettes/ggmosaic.html) --- template: inter-slide name: chi-deux-section ## Chi-square divergence --- name: chisqDist In order to assess the association between two qualitative variables, we use an _information divergence_ between the joint distribution and the product of the two marginal distributions. In correspondence analysis, the relevant information divergence is the `\(\chi^2\)` (khi-square) divergence. ### Definition Chi-square divergence Let `\(P\)` and `\(Q\)` be two probability distributions over the same measurable space. The `\(\chi^2\)` (khi-square) divergence from `\(P\)` to `\(Q\)` is defined as - `\(\infty\)` if `\(P\)` is not absolutely continuous with respect to `\(Q\)` and - `\(\chi^2(P \Vert Q) = \mathbb{E}_Q\Big[ \left(\frac{\mathrm{d}P}{\mathrm{d}Q} - 1\right)^2\Big] \qquad \text{ otherwise}\)` --- ### `\(f\)`-divergences The `\(\chi^2\)`-divergence belongs to the family of _information divergences_ or `\(f\)`-divergences that fits the next expression: if `\(P\)` is absolutely continuous with respect to `\(Q\)` `$$I_f(P, Q) = \mathbb{E}_Q\Big[ f\left(\frac{\mathrm{d}P}{\mathrm{d}Q}\right)\Big]$$` with `\(f\)` convex and `\(f(1)=0\)`. - for `\(f(x)= x \log (x)\)`, the information divergence is the _relative entropy_ or _Kullback-Leibler divergence_ `\(D(P \Vert Q)\)` - for `\(f(x)= |x -1|\)`, the information divergence is the _total variation distance_ `\(\mathrm{d}_{\text{TV}}(P,Q)\)` - for `\(f(x)= (x-1)^2\)`, the `\(f\)`-divergence is the `\(\chi^2\)` _divergence_ `\(\chi^2(P \Vert Q)\)` --- name: chisqassociation ### Definition Chi-Square association index Let `\(P\)` be a probability distribution over `\(\mathcal{X} \times \mathcal{Y}\)` (a bivariate distribution) with marginal distributions `\(P_X\)` and `\(P_Y\)`. The `\(\chi^2\)`-association index is defined as `$$\chi^2(P \Vert P_X \otimes P_Y)$$` --- ### Application A 2-way contingency table like `ucb2` defines a joint distribution `\(P\)` on couples from `\(\mathcal{X} \times \mathcal{Y}\)` as well as two marginal distributions: `$$P\{(a,b)\} = \frac{N_{a,b}}{n} \qquad P_X\{a\} = \frac{N_{a, .}}{n} \qquad P_Y\{b\} = \frac{N_{.,b}}{n}$$` In the sequel, `\(w_a = P_X\{a\} = \frac{N_{a, .}}{n}\)` for every `\(a\in \mathcal{X}\)`. The `\(\chi^2\)` divergence defined by the two-way contingency table is `$$\begin{array}{rl} \chi^2(P, \Vert P_X \otimes P_Y) & = \sum_{a \in \mathcal{X}, b \in \mathcal{Y}} \frac{N_{a,\cdot} N_{\cdot, b}}{n^2} \left( \frac{n N_{a,b}}{N_{a,\cdot} N_{\cdot, b}} - 1\right)^2 \\ & = \sum_{a \in \mathcal{X}} \frac{N_{a, .}}{n} \sum_{b \in \mathcal{Y}} \frac{\left( \frac{N_{a,b}}{N_{a,\cdot}} - \frac{N_{\cdot, b}}{n} \right)^2}{N_{.,b}/n} \\ & = \sum_{a \in \mathcal{Y}} w_a \times \chi^2\left(P_Y (\cdot | X=a) \Vert P_Y\right) \end{array}$$` ??? The last expression shows that the _Pearson association_ statistic is a (convex) combination of the `\(\chi^2\)` divergences between the conditional distributions of `\(Y\)` given the values of `\(X\)` and the marginal distribution of `\(Y\)`. --- ### Pearson association index Given a two-way contingency table, the Pearson association index is defined as `$$n \times \chi^2(P \Vert P_X \otimes P_Y) = \sum_{a \in \mathcal{X},b \in \mathcal{Y}} \frac{\left({N_{a,b}} - \frac{N_{a, .}}{n}N_{., b} \right)^2}{N_{a,.} N_{.,b}/n}$$` --- Note the factor `\(n\)`. It is motivated by testing theory. Indeed, if we collect a bivariate sample of length `\(m\)` according to `\(P_X \otimes P_Y\)` (that is if the two variables are stochastically independent), compute the contingency table defined by this random sample, then the Pearson association is a random variable. If `\(m\)` is large enough, the distribution of the Pearson association index is close (in the topology of weak convergence) to a `\(\chi^2\)`-distribution with `\((|\mathcal{X}|-1)(|\mathcal{Y}|-1)\)` degrees of freedom. The fact that the distribution of the Pearson association index is close to a probability distribution that does not depend on the details of `\(P_X\)` and `\(P_Y\)` is important, both theoretically and practically. We may compare the computed Pearson association index to the quantiles of the `\(\chi^2\)` distribution with the adequate number of degrees of freedom. --- ### Illustration UCB admission data For the UCB admission data, this leads to ```r M <- apply(UCBAdmissions, MARGIN = c(1,2), sum) rowsums <- M %*% matrix(1, nrow=2, ncol=1) colsums <- matrix(1, nrow=1, ncol=2) %*% M n <- sum(M) *sum((M - rowsums %*% colsums/n)^2/(rowsums %*% colsums/n)) ``` ``` [1] 92.20528 ``` ```r chisq.test(as.table(apply(UCBAdmissions, MARGIN = c(1,2), sum)), correct = FALSE) ``` ``` Pearson's Chi-squared test data: as.table(apply(UCBAdmissions, MARGIN = c(1, 2), sum)) X-squared = 92.205, df = 1, p-value < 2.2e-16 ``` --- ### Tentative interpretation This results suggests that admission and gender are associated. Further interpretation requires more visualization tools and modeling assumptions. We shall first extend the matrix analytic notions that underpins Principal Component Analysis, that is SVD and Hilbert-Schmidt-Frobenius norms. In [Recipes](#basicca), we construct _Correspondence Analysis_ as a pair of dual _extended SVD_ known as row-profiles and column-profiles analyses In [Applications](#visuca), we illustrate recipes on examples using
packages --- name: revSVD template: inter-slide ## Revisiting and extending SVD --- Just like PCA, correspondance analysis (CA) builds on Singular Value Decomposition
CA consists in computing the SVD of a _transformation_ of the two-way contingency table. This transformation is best explained by extending the SVD definition --- ### Digression For `\(\mathbb{R}^p\)`, the Euclidean norm of `\(x\)` is defined by `\(\| x \|^2 = \sum_{i=1}^p x_i^2 = \langle x, x\rangle\)` If `\(A \in \mathcal{M}_{p,p}\)` is a symmetric positive definite matrix with spectral decomposition `\(A = O \times \Lambda \times O^T\)` (the columns of `\(O\)` are unit eigenvectors of `\(A\)`, `\(O \times O^T=\text{Id}_p\)` and the diagonal coefficients of `\(\Lambda\)` are the eigenvalues of `\(A\)`) We may define another norm on `\(\mathbb{R}^p\)`: `$$\| x \|_A^2 = \langle x, A x\rangle = (O^Tx)^T \Lambda (O^T x)$$` or `$$\| x \|_A^2 = \langle A^{1/2}x, A^{1/2} x\rangle$$` This is the Mahalanobis norm induced by `\(A\)` ??? --- When dealing with matrices, the Hilbert-Schmidt norm plays the role of the Euclidean norm We may define variations of Hilbert-Schmidt norm in a way that parallels the definition of Mahalanobis norms --- `\(\mathcal{M}(n,p)\)` denotes the set of real matrices with `\(n\)` rows and `\(p\)` columns We will use weighted versions of [Hilbert-Schmidt-Frobenius inner-product and norm](#eckartyoung). --- name: weighted-HS-norm ### Proposition Weighted Hilbert-Schmidt-Frobenius inner-product .bg-light-gray.b--light-gray.ba.bw1.br3.shadow-5.ph4.mt5[ Let `\(\Omega \in \mathcal{M}_{p, p}, \Phi \in \mathcal{M}_{q,q}\)` be symmetric and positive definite. Matrices `\(\Omega, \Phi\)` define an _inner product_ on `\(\mathcal{M}_{p,q}\)`: let `\(X, Y \in \mathcal{M}_{p,q}\)` `$$\begin{array}{rl} \langle X, Y\rangle_{\Omega, \Phi} & = \operatorname{Trace}\left(\Omega \times X \times \Phi \times Y^T\right) \\ & = \bigg\langle \Omega^{1/2} \times X \times \Phi^{1/2}, \Omega^{1/2} \times Y \times \Phi^{1/2} \bigg\rangle_{\text{HS}}\end{array}$$` The weighted Hilbert-Schmidt-Frobenius norm of `\(X\)` is defined by `$$\Vert X \Vert_{\Omega, \Phi}^2 = \left\Vert \Omega^{1/2} \times X \times \Phi^{1/2}\right\Vert_{\text{HS}}^2$$` ] ??? --- ### Proof Linearity with respect to `\(X\)` and `\(Y\)` follows from distributivity of matrix multiplication with respect to matrix addition. Symmetry is a consequence of invariance of trace by similarity. Multiplying on the left by `\(\Omega^{-1/2}\)` and on the right by `\(\Omega^{1/2}\)`, `$$\begin{array}{rl} \langle X, Y\rangle_{\Omega, \Phi} & = \operatorname{Trace}\left(\Omega \times X \times \Phi \times Y^T\right) \\ & = \operatorname{Trace}\left(\Omega^{1/2} \times X \times \Phi \times Y^T \times \Omega^{1/2} \right) \\ & = \operatorname{Trace}\left(\big(\Omega^{1/2} \times X \times \Phi^{1/2}\big) \times\big(\Phi^{1/2} \times Y^T \times \Omega^{1/2}\big) \right) \\ & = \operatorname{Trace}\left(\big(\Omega^{1/2} \times X \times \Phi^{1/2}\big) \times \big(\Omega^{1/2} \times Y \times \Phi^{1/2} \big)^T \right)\\ & = \left\langle \Omega^{1/2} \times X \times \Phi^{1/2}, \Omega^{1/2} \times Y \times \Phi^{1/2} \right\rangle_{\text{HS}} \\ & = \left\langle \Omega^{1/2} \times Y \times \Phi^{1/2}, \Omega^{1/2} \times X \times \Phi^{1/2} \right\rangle_{\text{HS}} \\ & = \langle Y, X\rangle_{\Omega, \Phi}\end{array}$$` ??? Remember that the trace is _invariant by similarity_: `$$\operatorname{Tr}(Z) = \operatorname{Tr}(P \times Z \times P^{-1})$$` for all `\(Z\)` and invertible `\(P\)` --- ### Proof (continued) Positivity is a by-product of the argument above: `$$\begin{array}{rl}\langle X, X\rangle_{\Omega, \Phi} & = \langle \Omega^{1/2} \times X \times \Phi^{1/2}, \Omega^{1/2} \times X \times \Phi^{1/2} \rangle \\ & \geq 0\end{array}$$` with equality only if `\(X=0\)`
??? We shall extend the notion of Singular Value Decomposition so as to adapt to weighted Hilbert-Schmidt-Frobenius norms. --- name: extendedsvd ### Theorem Extended SVD factorization .bg-light-gray.b--light-gray.ba.bw1.br3.shadow-5.ph4.mt5[ Let `\(\Omega \in \mathcal{M}(p,p)\)`, `\(\Phi \in \mathcal{M}(q,q)\)` be symmetric and Positive Definite. Let `\(X \in \mathcal{M}(n,p)\)` There exists `\(M, D, N\)` such that `$$X = N \times D \times M^T$$` with `\(N \in \mathcal{M}(p,p)\)`, `\(M \in \mathcal{M}(q,q)\)` and `\(D \in \mathcal{M}(p, q)\)` satisfying: - `\(N^T \times \Omega \times N = I_p\)` - `\(M^T \times \Phi \times M = I_q\)` - `\(D_{k,k} \geq D_{k+1, k+1} \geq 0\)` for `\(0\leq k< \min (p,q)\)` and `\(D_{j,k}=0\)` for `\(j\neq k\)`. If `\(\Omega = I_p\)` and `\(\Phi = I_q\)`, we recover the Singular Value Decomposition ] --- ### Proof Because they are positive definite, by the spectral decomposition theorem, `\(\Omega\)` and `\(\Phi\)` have invertible square roots: there exists symmetric invertible matrices `\(\Omega^{1/2}\)` and `\(\Phi^{1/2}\)` such that `$$\Omega = (\Omega^{1/2})^2\qquad\text{and}\qquad\Phi = (\Phi^{1/2})^2$$` Let `\(U \times D \times V^T\)` be the SVD factorization of `\(\Omega^{1/2} \times X \times \Phi^{1/2}\)`, then `$$X = \underbrace{\Omega^{-1/2} \times U}_{:=N} \times D \times \underbrace{V^T \times \Phi^{-1/2}}_{:=M^T}$$` ??? The existence `\(\Omega^{1/2}\)` and `\(\Phi^{1/2}\)` is guaranteed by the _spectral decomposition theorem_ The proof provides a straightforward computational recipe --- ### Proof (continued) Indeed: `$$\begin{array}{rl}N^T \times \Omega \times N & = (U^T \times \Omega^{-1/2}) \times \Omega \times (\Omega^{-1/2} \times U) \\ &= U^T \times (\Omega^{-1/2} \times \Omega \times \Omega^{-1/2} ) \times U \\ & = \operatorname{Id}_p\end{array}$$` and similarly for `\(M\)` and `\(\Phi\)`.
--- ###
Computing the extended SVD Given `\(X\)`, `\(\Omega\)` and `\(\Phi\)`, computing the extended SVD of `\(X\)` with respect to `\(\Phi, \Omega\)` is not harder than computing an ordinary SVD `\(\Omega^{1/2}\)` and `\(\Phi^{1/2}\)` can be obtained from the SVD of `\(\Omega, \Phi\)` Indeed, from the proof of the existence of the extended SVD, the extended SVD of `\(X\)` can be obtained in a three-steps approach: - Compute `\(\Omega^{1/2}\)` and `\(\Phi^{1/2}\)` from the SVDs of `\(\Omega, \Phi\)` - Compute the ordinary SVD `\(U \times D \times V^T\)` of `\(\Omega^{1/2} \times X \times \Phi^{1/2}\)` - Return `\(\Omega^{-1/2} \times U\)`, `\(D\)`, `\(\Phi^{-1/2}\times V\)` ??? --- name: truncSVD ### Definition: Truncated extended SVD factorization Let `\(k \leq \min(p,q)\)`. The `\(k\)`-truncated extended SVD factorization of `\(X \in \mathcal{M}(p,q)\)` with respect to `\(\Omega, \Phi\)` is derived from the extended SVD factorization `$$X = N \times D \times M^T$$` by taking the first `\(k\)` columns of `\(N\)` ( `\(N_{[k]}\)` ) and `\(M\)` ( `\(M_{[k]}\)` ) as well as the first `\(k\)` rows and columns of `\(D\)` ( `\(D_{[k]}\)` ) The matrix `$$N_{[k]} \times D_{[k]} \times M_{[k]}^T$$` has rank not larger than `\(k\)`. It has rank `\(k\)` if `\(D[k,k]>0\)`. ??? The truncated generalized SVD factorization is defined in a way that parallels the truncated SVD --- ### Performance of `\(k\)`-truncated SVD The `\(k\)`-truncated SVD of `\(X\)` is known to be the best rank- `\(k\)` approximation of `\(X\)` in the least-square sense (that is according to the Hilbert-Schmidt norm)
The `\(k\)`-truncated extended SVD factorization also enjoys optimal approximation properties under rank constraints Wheras the optimality property of the truncated SVD was assessed with respect to Hilbert-Schmidt-Frobenius norm, the property of the truncated SVD is assessed with respect to a weighted version of Hilbert-Schmidt-Frobenius norm --- name: weightedHSorthobasis ### Proposition .bg-light-gray.b--light-gray.ba.bw1.br3.shadow-5.ph4.mt5[ Let `\(\Omega \in \mathcal{M}_{p, p}, \Phi \in \mathcal{M}_{q,q}\)` be symmetric and positive definite. Let `\(X \in \mathcal{M}_{p,q}\)` with extended SVD factorization `\(X = N \times D \times M^T\)` with respect to `\(\Omega\)` and `\(\Phi\)`. Let `\((n_i)_{i \leq p}\)` and `\((m_{j})_{j\leq q}\)` denote the columns of `\(N\)` and `\(M\)` then `\(\left(n_i \times m_j^T \right)_{i\leq p, j \leq q}\)` is an orthonormal basis of `\(\mathcal{M}_{q,q}\)` endowed with inner product `\(\langle X, Y\rangle_{\Omega, \Phi}\)` ] ??? The straightforward proof parallels the proof of the statement for the ordinary SVD. --- ### Proof `$$\begin{array}{rl}\langle n_i \times m_j^T , n_k \times m_\ell^T \rangle_{\Omega, \Phi} & = \operatorname{Tr}\left(\Omega \times n_i \times \underbrace{m_j^T \times \Phi \times m_\ell}_{=\langle m_j , m_{\ell}\rangle_{\Phi}} \times n_k^T \right)\\ &= \langle m_j , m_{\ell}\rangle_{\Phi} \operatorname{Tr}\left(\Omega \times n_i \times n_k^T \right)\\ & = \langle m_j , m_{\ell}\rangle_{\Phi} \times \langle n_i , n_{k}\rangle_{\Omega}\end{array}$$` The fact that `\(N\)` (resp. `\(M\)`) is `\(\Omega\)` (resp. `\(\Phi\)`) orthogonal allows to conclude
??? We just take advantage of the observation `$$m_j^T \times \Phi \times m_\ell = \operatorname{Tr}\left(\Phi \times m_\ell \times m_j^T \right)$$` --- name: extendedEckartYoung ### Theorem Eckart-Young Theorem for extended SVD .bg-light-gray.b--light-gray.ba.bw1.br3.shadow-5.ph4.mt5[ Let `\(\Omega \in \mathcal{M}_{p, p}, \Phi \in \mathcal{M}_{q,q}\)` be symmetric and positive definite. Let `\(X \in \mathcal{M}_{p,q}\)` with extended SVD factorization `\(X = N \times D \times M^T\)`. Let `\(X_{[k]}\)` denote the `\(k\)`-truncated extended SVD of `\(X\)` with respect to `\(\Omega\)` and `\(\Phi\)`. Then `\(X_{[k]}\)` minimizes `$$\Vert X-Y\Vert^2_{\Omega, \Phi} = \langle X-Y, X-Y\rangle_{\Omega, \Phi} = \operatorname{Trace}\left(\Omega (X-Y) \Phi (X-Y)^T\right)$$` among all matrices `\(Y \in \mathcal{M}_{p,q}\)` with rank less or equal than `\(k\)`. ] --- ### Proof The argument parallels the proof of Eckart-Young's Theorem for SVD.
--- name: basicca template: inter-slide ## Correspondence analysis : recipes --- ### FactoMineR homepage <iframe src="http://factominer.free.fr" height="400" width="800" title="FactoMineR"></iframe> --- ### FactoMineR on Correspondance Analysis <iframe src="http://factominer.free.fr/factomethods/correspondence-analysis.html" height="400" width="800" title="FactoMineR"></iframe> ---
Correspondence Analysis (CA) consists of - normalizing and recentering the 2-way contingency table - computing the _extended_ SVD with carefully chosen matrices `\(\Omega\)` and `\(\Phi\)`.
Reporting CA consists of - Row Profiles analysis - Column Profiles analysis - Singular value analysis --- ### Two important diagonal matrices Consider a 2-way contingency table `\(X \in \mathcal{M}_{p, q}\)`. Let `\(n\)` denote sum of coefficients of `\(X\)`: `\(n = \sum_{a \leq p, b \leq q} X_{a, b}\)`. Let `\(P\)` be the _normalization_ of `\(X\)`: `$$P = \frac{1}{n} X$$` Let - `\(D_r \in \mathcal{M}_{p,p}\)` be the diagonal matrix of row sums of `\(P\)` - `\(D_c \in \mathcal{M}_{q,q}\)` be the diagonal matrix of column sums of `\(P\)` The diagonal elements of `\(D_r\)` define the row weights The diagonal coefficients of `\(D_c\)` define the so-called _centroid_ : the marginal distribution of `\(Y\)` ??? The marginal distribution of `\(Y\)` is the weighted average of the conditional distributions of `\(Y\)` given `\(X\)`. --- name: rowprofiles ### Row profiles The rows of matrix `$$R = D_r^{-1} \times P \qquad R_{a,b} = \frac{N_{a,b}}{N_{a,.}}$$` are called _row profiles_ Each _row profile_ defines a conditional probability of `\(Y\)` given `\(X\)` In Correspondence Analysis, row profile analysis consists of factorizing the centered row profile matrix `\(R\)` The recipe consists in choosing the couple of symmetric positive definitive matrices that define the extended SVD --- ### Example Consider the 2-way contingency table derived from the `Embarkment` and `Pclass` columns of the `train` subset of the Titanic dataset | | 1| 2| 3| |:--|---:|---:|---:| |Queenstown | 2| 3| 72| |Cherbourg | 85| 17| 66| |Southampton | 127| 164| 353| Once normalized by the table sum `n=889`, we obtain matrix `\(P\)` (up to rounding errors): `$$\begin{bmatrix} 0.002 & 0.003 & 0.081 \\ 0.096 & 0.019 & 0.074 \\ 0.143 & 0.184 & 0.397 \end{bmatrix}$$` Matices `\(D_r\)` and `\(D_c\)` are `$$D_r \approx \begin{bmatrix} 0.086 & 0 & 0 \\ 0 & 0.189 & 0 \\ 0 & 0 & 0.724 \end{bmatrix} \qquad D_c \approx \begin{bmatrix} 0.241 & 0 & 0 \\ 0 & 0.206 & 0 \\ 0 & 0 & 0.552 \end{bmatrix}$$` ??? ```r readr::read_csv("DATA/TITANIC/train.csv", col_types = cols(Embarked = col_factor(levels = c("Q", "C", "S")), Pclass = col_factor(levels = c("1", "2", "3")), Sex = col_factor(levels = c("male", "female")), Survived = col_factor(levels = c("0", "1")))) %>% dplyr::select(Embarked, Pclass) %>% table() -> X ``` ```r P <- X/sum(X) ; Dr <- diag(rowSums(P)) ; Dc <- diag(colSums(P)) R <- solve(Dr) %*% P foo <- matrix(1, nrow = 3, ncol=3) %*% P ``` --- The row profiles matrix `\(R\)` is `\(D_r^{-1} \times P\)` `$$R \approx \begin{bmatrix} 0.02 & 0.03 & 0.94 \\ 0.51 & 0.10 & 0.39 \\ 0.20 & 0.25 & 0.55 \end{bmatrix}$$` The matrix which is fed to _extended SVD_ is obtained by centering `\(R\)` around the weighted average of the rows `$$R - 1 \times 1^T \times P \approx \begin{bmatrix} -0.221 & -0.176 & 0.388 \\ 0.269 & -0.106 & -0.162 \\ -0.041 & 0.044 & -0.002 \end{bmatrix}$$` --- Recall that the row vector `\(1^T \times P\)` defines the marginal distribution of `\(Y\)` (the column-profile of matrix `\(P\)`). Note that all rows of `\(R - 1 \times 1^T \times P\)` sum up to `\(0\)`, they are orthogonal to `\((1, 1, 1)^T\)`. | | 1| 2| 3| |:--|---:|---:|---:| |Q | 2| 3| 72| |C | 85| 17| 66| |S | 127| 164| 353| --- ### A Mahalanobis metric for row profiles The space of row profiles (a subset of `\(\mathbb{R}^q\)`) is endowed with the weighted `\(\ell_2\)` metric (a Mahalanobis metric) induced by `\(D_c^{-1}\)`: `$$(x -y)^T \times D_c^{-1} \times (x-y)$$` --- ### Applying the extendend SVD The extended SVD factorization is actually applied to `$$D_r^{-1} \times P \times D_c^{-1} - 1 \times 1^T$$` where `\(1 \times 1^T\)` denotes the matrix from `\(\mathcal{M}_{p, q}\)` with all entries equal to `\(1\)`. Matrices `\(\Omega\)` and `\(\Phi\)` are chosen as `\(D_r\)` and `\(D_c\)`. The entry of `\(D_r^{-1} \times P \times D_c^{-1} - 1 \times 1^T\)` at indices `\(a,b\)` are `$$\frac{n N_{a,b}}{N_{a, .}N_{.,b}} - 1$$` ??? Entries of `\(D_r^{-1} \times P \times D_c^{-1} - 1 \times 1^T\)` show up in the definition of the chi-square divergence between the joint empirical distributions and the product of marginal empirical distributions --- Recall the definition of the Pearson association index: `$$\begin{array}{rl} n \chi^2(P \Vert P_X \otimes P_Y) & = n \sum_{a \in \mathcal{X}, b \in \mathcal{Y}} \frac{N_{a,\cdot} N_{\cdot, b}}{n^2} \left( \frac{n N_{a,b}}{N_{a,\cdot} N_{\cdot, b}} - 1\right)^2 \\ & = n \sum_{a \in \mathcal{X}} \frac{N_{a, .}}{n} \sum_{b \in \mathcal{Y}} \frac{\left( \frac{N_{a,b}}{N_{a,\cdot}} - \frac{N_{\cdot, b}}{n} \right)^2}{N_{.,b}/n} \\ & = n \sum_{a \in \mathcal{Y}} w_a \times \chi^2\left(P_Y (\cdot | X=a), P_Y\right) \\ & = \sum_{a \in \mathcal{X},b \in \mathcal{Y}} \frac{\left({N_{a,b}} - \frac{N_{a, .}}{n}N_{., b} \right)^2}{N_{a,.} N_{.,b}/n} \end{array}$$` where `\(P\)` is the joint distribution defined by the contingency table, `\(P_X, P_Y\)` the two marginal distributions `$$\begin{array}{rl} n \chi^2(P \Vert P_X \otimes P_Y) & = \big\Vert D_r^{1/2} \big(D_r^{-1} \times P \times D_c^{-1} - 1 \times 1^T \big)D_c^{1/2}\big\Vert^2_{\text{HS}} \\ & = \big\Vert \big(D_r^{-1} \times P \times D_c^{-1} - 1 \times 1^T \big)\big\Vert^2_{D_r^{1/2},D_c^{1/2}}\end{array}$$` --- As `$$D_r^{1/2} \times \big(D_r^{-1} \times P \times D_c^{-1} - 1 \times 1^T \big) \times D_c^{1/2} = D_r^{1/2} \times \big(R - 1 \times 1^T \times P \big) \times D_c^{-1/2}$$` we can also portray correspondence analysis as > computing the extended SVD of the centered row profiles matrix `$$\big(R - 1\times 1^T \times P \big)$$` > with respect to the symmetric positive definite matrices `\(\Omega = D_r, \Phi = D_c^{-1}\)` --- ### Interpretation of singular values The sum of squared singular values equals the `\(\chi^2\)` Pearson index of association It is also called the _total inertia_ of the centered row profiles matrix. The number `\(k\)` of positive singular values equals the rank of the centered row profiles matrix The latter is not larger than `\(q-1\)`. The sum of the squared `\(k-2\)` smallest positive singular values equals the weighted squared Hilbert-Schmidt-Frobenius distance of the centered row profiles matrix to the set of rank-2 matrices (extended Eckart-Young Theorem). ??? --- ### Interpretation of singular values (continued) Let `\(N \in \mathcal{M}_{p, k}\)` (resp. `\(M \in \mathcal{M}_{q, k}\)`) denote the extended left (resp. right) singular vectors computed from the extended SVD of `\(R - 1 \times 1^T \times P\)` with respect to `\(D_r\)` and `\(D^{-1}_c\)` We have `$$N^T \times D_r \times N = I_k \qquad\text{and}\qquad M^T \times D^{-1}_c \times M = I_k$$` Let `\(D_{s} \in \mathcal{M}_{k,k}\)` be the diagonal matrix with non-increasing singular values as diagonal coefficients ??? --- ### Interpretation of left singular vectors The truncated versions of matrix `\(N \times D_s\)` define the optimal subspaces onto which to project the row profiles. Turn back to the projected Titanic dataset. The 2 positive squared singular values are `\(0.103, 0.036\)` (if we sum these two values and multiply the result by the number `\(889\)` of rows of the Titanic sub-dataset, we recover the Pearson association statistic, `\(123.57\)`). `$$F = N \times D_s =\begin{bmatrix} -0.573 & -0.515 \\ 0.604 & -0.165 \\ -0.089 & 0.105\end{bmatrix}$$` ??? The graphical display of the row profiles consists in displaying the points defined by the rows of `\(F\)`. The percentage of inertia assiated with each axis is usually added to the plot. If possible `rownames` are used to identify each point and facilitate interpretation. --- ###
Titanic illustrated : row profiles .panelset[ .panel[.panel-name[Code] ```r P <- (1/sum(X)) * X Dr <- diag(rowSums(P)) ; Dc <- diag(colSums(P)) R <- diag(rowSums(P)^(-1)) %*% P - matrix(1, 3, 1) %*% colSums(P) svd_R_ext <- svd(diag(rowSums(P)^(1/2)) %*% R %*% diag(colSums(P)^(-1/2))) N <- diag(rowSums(P)^(-1/2)) %*% svd_R_ext$u M <- diag(colSums(P)^(1/2)) %*% svd_R_ext$v F <- N %*% diag(svd_R_ext$d) res_ca <- data.frame(F[, 1:2]) names(res_ca) <- c("lambda1", "lambda2") res_ca$rownames_ <- c("Queensland", "Cherbourg", "Southampton") ``` ] .panel[.panel-name[Plot preparation] ```r p <- res_ca %>% ggplot() + aes(x=lambda1, y=lambda2, label=rownames_) + geom_point() + geom_text(vjust = 0, nudge_y = 0.05) + geom_point() + xlim(-1, 1) + ylim(-1, 1) + coord_fixed() + geom_hline(yintercept = 0, linetype=2) + geom_vline(xintercept = 0, linetype=2) + xlab(latex2exp::TeX("$\\lambda_1 = 74.1\\%$")) + ylab(latex2exp::TeX("$\\lambda_2 = 25.9\\%$")) + ggtitle("Titanic CA row profiles") p ``` ] .panel[.panel-name[Plot] .centered[ ![](cm-7-EDA_files/figure-html/CArowtitanic-1.png) ] ] .panel[.panel-name[Discussion] #### Row profile correspondence analysis of `Embarked` and `Pclass` for subset of Titanic dataset. Not too surprisingly, the class distribution conditioned on `Embarked=Southampton` is closest to the centroid, that is to the marginal distribution of Passenger class ] ] --- name: colprofiles ### Column-profiles Whereas in PCA, rows and columns play different roles (observations and variables), in CA, rows and columns are both associated with modalities of categorical variables. Column analysis consists of performing row analysis on the transposed matrix `\(P^T\)`. The matrix of centered column profiles is `$$D_c^{-1} \times P^T - \mathbf{1} \times \mathbf{1}^T \times P^T$$` Recall that for row profile analyis, the extended SVD factorization is applied to `\(D_r^{-1} \times P \times D_c^{-1} - 1 \times 1^T\)`, with matrices `\(\Omega\)` and `\(\Phi\)` chosen as `\(D_r\)` and `\(D_c\)`. This leads to `$$D_r^{-1} \times P \times D_c^{-1} - 1 \times 1^T = \widetilde{N} \times D_s \times \widetilde{M}^T$$` with `$$R - 1 \times 1^T \times P = \widetilde{N} \times D_s \times \widetilde{M}^T \times D_c$$` --- ### Column-profiles (continued) For column profile analysis, the extended SVD factorization is applied to `\(D_c^{-1} \times P^T \times D_r^{-1} - 1 \times 1^T\)`, with matrices `\(\Omega\)` and `\(\Phi\)` chosen as `\(D_c\)` and `\(D_r\)`. `$$D_c^{-1} \times P^T \times D_r^{-1} - 1 \times 1^T = \widetilde{M} \times D_s \times \widetilde{N}^T$$` This leads to `$$C - 1 \times 1^T \times P^T = \widetilde{M} \times D_s \times \widetilde{N}^T D_r$$` Exchanging rows and columns does not change the decomposition of inertia
Investigating the left singular vectors `\(\widetilde{M}\)` of column profile analyis amounts to investigate the right singular vectors of row profile analysis `\(D_c \times \widetilde{M}\)`. ??? Column profile correspondence analysis of `Embarked` and `Pclass` for subset of Titanic dataset. --- ###
Titanic illustrated : column profiles .panelset[ .panel[.panel-name[Code] ```r C <- diag(colSums(P)^(-1)) %*% t(P) - matrix(1, 3, 1) %*% rowSums(P) svd_C_ext <- svd(diag(colSums(P)^(1/2)) %*% C %*% diag(rowSums(P)^(-1/2))) Ndual <- diag(colSums(P)^(-1/2)) %*% svd_C_ext$u Mdual <- diag(rowSums(P)^(1/2)) %*% svd_C_ext$v G <- Ndual %*% diag(svd_C_ext$d) res_ca_dual <- data.frame(-G[, 1:2]) names(res_ca_dual) <- c("lambda1", "lambda2") res_ca_dual$rownames_ <- c("1", "2", "3") ``` ] .panel[.panel-name[Plot preparation] ```r p %+% res_ca_dual + ggtitle("Titanic CA column profiles") ``` ] .panel[.panel-name[Plot] .center[ ![](cm-7-EDA_files/figure-html/CArowtitanicDual-1.png) ] ] .panel[.panel-name[Discussion] - Passenger Class 3 makes the bulk of passengers (it accounts for more than half of the passengers). The conditional Embarkment distribution for that class is closed to the centroid. Should it be a surprise ? - Passengers with Class 2 tickets almost exclusively boarded at Southampton - Passengers with Class 1 tickets almost exclusively boarded at Southampton and Cherbourg ] ] --- name: assocmodalities ### Association between modalities Correspondance analysis is used to compare row profiles and column profiles It is also and mainly used to investigate associations between modalities of the two variables. Function `CA` from `FactoMineR` delivers among other products a _biplot for correspondence analysis_. ```r res_ca <- X %>% * FactoMineR::CA(ncp = 2, graph = FALSE) ``` | | Dim.1 | Dim.2 | |:----------------|------------:|------:| |Variance | 0.10 | 0.036 | | `% of var.` | 74.12 | 25.87| | Cumulative `% of var.`| 74.12 |100.0 | --- ### Disecting the output of `FactoMineR::CA` This output, denoted `res_ca`, is a list with five elements | Element name | Description | |:-------------|:------------------------------------------------| |`eig` | a redundant matrix for inertia decomposition | |`call` | a list of 9 elements| |`row` | a list for row profile analysis | |`col` | a list for column profile analysis| |`svd` | extended SVD components | --- ### Component `eig` of `res_ca` is made of three columns - Column 1 `eigenvalue` lists the eigenvalues of `\(\ldots\)` - Column 2 is a normalized version of Column 1 - Column 3 is obtained by appling `cumsum` to column 2 ??? Component `eig` paves the way for building a `screeplot` for CA --- ### Component `call` is a list - `X` is the contingency table that was fed to `CA` - `marge.col` is the marginal distribution for columns - `marge.row` is the marginal distribution for rows - `row.w`: a constant vector `1` here - `excl`: irrelevant here - `Xtot`: irrelevant here - `N` sum of coefficients of `X`
Here `X == Xtot`, `excl` is `NULL` and `row.w` is identically `1` ??? --- ### Component `row` of `res_ca` If the extended SVD of `\(X = D_r^{-1} \times P \times D_c^{-1} - \mathbf{1} \times \mathbf{1}^T\)` is `\(U \times D \times V^T\)` then `res_ca$row` is `\(U \times D = X \times \Phi \times V\)` ??? - Compare with PCA output --- exclude: true ### Component `col` --- ### Component `svd` - What matrix is this component the SVD of? - It is the (extended) SVD of `$$D_r^{-1} \times P \times D_c^{-1} - \mathbf{1} \times \mathbf{1}^T$$` The `\(\Omega\)` matrix is `diag(res_ca$call$marge.row)` The `\(\Phi\)` matrix is `diag(res_ca$call$marge.col)` --- exclude: true ```r names(res_ca) ``` ``` [1] "eig" "call" "row" "col" "svd" ``` ```r names(res_ca$call) ``` ``` [1] "X" "marge.col" "marge.row" "ncp" "row.w" "excl" [7] "call" "Xtot" "N" ``` ```r res_ca$row ``` ``` $coord Dim 1 Dim 2 Q -0.57253082 -0.5151977 C 0.60436922 -0.1645507 S -0.08920676 0.1045260 $contrib Dim 1 Dim 2 Q 27.515826 63.82276 C 66.897215 14.20515 S 5.586959 21.97210 $cos2 Dim 1 Dim 2 Q 0.5525630 0.4474370 C 0.9309859 0.0690141 S 0.4214174 0.5785826 $inertia [1] 0.05138127 0.07414282 0.01367941 ``` ```r res_ca$call$marge.row ``` ``` Q C S 0.08661417 0.18897638 0.72440945 ``` --- exclude: true ### Designing a `tidy` method for outputs of `CA` ?
the input of `CA` is not a dataframe, it is a _contigency table_ --- ### Inertia analysis .fl.w-40.pa2[ The inertia of the first dimensions shows if there are strong relationships between variables and suggests the number of dimensions that should be studied. The first two dimensions express `\(100\%\)` of the total dataset inertia That means that `\(100\%\)` of the rows (or columns) cloud total variability is explained by the plane (_this is trivial!_) ] .fl.w-60.pa2[ .center[ <img src="cm-7-EDA_files/figure-html/decompInertia-1.png" width="432" /> ] ] --- |Rows | `\(\text{Iner.} \times 1000\)`| Dim.1 | contribution | `\(\cos^2\)` | Dim.2 | contribution | `\(\cos^2\)` | |:-|:--------:|----:|----:|----:|----:|----:|----:| |Q | 51.381 | -0.573 | 27.516 | 0.553 | -0.515 | 63.823 | 0.447 | |C | 74.143 | 0.604 |66.897 | 0.931 | -0.165 |14.205 | 0.069 | |S | 13.679 | -0.089 | 5.587 | 0.421 | 0.105 |21.972 | 0.579 | According to rows analysis, Southampton contributes little to total inertia. Indeed, Southampton mostly contributes to the centroid. --- |Rows | `\(\text{Iner.} \times 1000\)`| Dim.1 | contribution | `\(\cos^2\)` | Dim.2 | contribution | `\(\cos^2\)` | |:-|:--------:|----:|----:|----:|----:|----:|----:| |1 | 77.518 | 0.566 | 74.698 | 0.994 | -0.043 | 1.230 | 0.006 | |2 | 29.988 | -0.103 | 2.118 | 0.073 | 0.367 | 77.185 | 0.927 | |3 | 31.697 | -0.208 | 23.184 | 0.755 | -0.119 | 21.585 | 0.245 | Classes contribute more evenly to Inertia. --- ### Biplots for CA .fl.w-40[ Passengers who boarded at Cherbourg were disproportionately first class passengers Passengers who boarded at Queenstown were disproportionately third class passengers Note that this biplot does not displays the weight of the different categories ] .fl.w-60.pa2[ <img src="cm-7-EDA_files/figure-html/fctmCATita-1.png" width="432" /> ] --- exclude: true ### Questionnaire ```r require(FactoMineR) data(children) res.ca <- CA (children, row.sup = 15:18, col.sup = 6:8) ``` <img src="cm-7-EDA_files/figure-html/unnamed-chunk-24-1.png" width="432" /> --- template: inter-slide name: factoshowcase ## Another FactoMineR showcase --- ###
Women at work (1974) .panelset[ .panel[.panel-name[The 1974 survey] .fl.w-50[ In a 1974 survey, 1724 women answered several questions about women'work among which: - What do you think the perfect family is ? - Both husband and wife work - Husband works more than wife - Only husband works - Which activity is the best for a mother when children go to school? - Stay at home - Part-time work - Full-time work ] .fl.w-50[ ![](./img/arts_menagers.jpeg) ] ] .panel[.panel-name[Get the data] ```r facto_url <- "http://factominer.free.fr" url <- paste(facto_url, "factomethods", "datasets", "women_work.txt", sep = "/") women_work <- read.table(url, header=TRUE, row.names=1, sep="\t") tab <- women_work[, 1:3] head(tab) ``` ``` stay.at.home part.time.work full.time.work both.man.and.woman.work 13 142 106 man.morks.more 30 408 117 only.man.works 241 573 94 ``` ] .panel[.panel-name[Perform CA] ```r library(FactoMineR) res_ca <- CA(tab) ``` ```r res_ca$eig %>% knitr::kable() ``` <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> eigenvalue </th> <th style="text-align:right;"> percentage of variance </th> <th style="text-align:right;"> cumulative percentage of variance </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> dim 1 </td> <td style="text-align:right;"> 0.1168400 </td> <td style="text-align:right;"> 86.29218 </td> <td style="text-align:right;"> 86.29218 </td> </tr> <tr> <td style="text-align:left;"> dim 2 </td> <td style="text-align:right;"> 0.0185604 </td> <td style="text-align:right;"> 13.70782 </td> <td style="text-align:right;"> 100.00000 </td> </tr> </tbody> </table> ```r res_ca$row$coord %>% knitr::kable() ``` <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Dim 1 </th> <th style="text-align:right;"> Dim 2 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> both.man.and.woman.work </td> <td style="text-align:right;"> -0.5586051 </td> <td style="text-align:right;"> 0.2333870 </td> </tr> <tr> <td style="text-align:left;"> man.morks.more </td> <td style="text-align:right;"> -0.2437595 </td> <td style="text-align:right;"> -0.1722066 </td> </tr> <tr> <td style="text-align:left;"> only.man.works </td> <td style="text-align:right;"> 0.3095622 </td> <td style="text-align:right;"> 0.0381726 </td> </tr> </tbody> </table> ] .panel[.panel-name[Plot CA] ![](cm-7-EDA_files/figure-html/ca_women-1.png) ] ] --- ###
Women at work (1974), supplementary columns .panelset[ .panel[.panel-name[The 1974 survey] Another question was asked th 1724 women - What do you think of the following sentence: _women who do not work feel cut off from the world_? + Totally agree
+ Quite agree
+ Quite disagree
+ Totally disagree
] .panel[.panel-name[The data] ```r pattern_ <- "housewives.cut.from.world." women_work <- women_work %>% rename_with(.fn= ~ str_replace(., pattern_, ""), starts_with(pattern_)) head(women_work[, 4:ncol(women_work)]) ``` ``` totally.agree quite.agree quite.disagree both.man.and.woman.work 107 75 40 man.morks.more 192 175 100 only.man.works 140 215 254 totally.disagree both.man.and.woman.work 39 man.morks.more 88 only.man.works 299 ``` ] .panel[.panel-name[Perform CA] ```r res_ca <- CA(women_work, col.sup=4:ncol(women_work)) ``` ```r res_ca$row$coord %>% knitr::kable() ``` <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Dim 1 </th> <th style="text-align:right;"> Dim 2 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> both.man.and.woman.work </td> <td style="text-align:right;"> -0.5586051 </td> <td style="text-align:right;"> 0.2333870 </td> </tr> <tr> <td style="text-align:left;"> man.morks.more </td> <td style="text-align:right;"> -0.2437595 </td> <td style="text-align:right;"> -0.1722066 </td> </tr> <tr> <td style="text-align:left;"> only.man.works </td> <td style="text-align:right;"> 0.3095622 </td> <td style="text-align:right;"> 0.0381726 </td> </tr> </tbody> </table> ] .panel[.panel-name[Plot supplemented CA] ![](cm-7-EDA_files/figure-html/ca_women_sup-1.png) ] .panel[.panel-name[Discussion] > "Totally agree" and "Quite agree" for "Women who do not work feel cut off from the world" are close to categories in favour of women's work. > "Quite disagree" and "Totally "disagree" are close to categories opposed to women's work. ] ] --- exclude: true ## Bibliographic remarks {#bibca} @MR767260 is a lively and thorough presentation of correspondence analysis and related techniques. @holmes is a brief and spirited introduction to the duality diagrams framework that underlies package `ade4`. @husson2018r explains how to use `FactoMineR` to perform Correspondence Analysis. --- class: middle, center, inverse background-image: url('./img/pexels-cottonbro-3171837.jpg') background-size: cover # The End