- M1 MIDS/MFA
- Université Paris Cité
- Année 2023-2024
-
Course Homepage
- Moodle
Code
require(FactoMineR)
require(FactoInvestigate)
require(ggrepel)
require(ggfortify)
require(broom)
opts <- options() # save old options
options(ggplot2.discrete.colour="viridis")
options(ggplot2.discrete.fill="viridis")
options(ggplot2.continuous.fill="viridis")
options(ggplot2.continuous.colour="viridis")Computing SVD
PCA and SVD
Visualizing PCA
Let \(X\) be a \(n \times p\) real matrix with rank \(r\).
Let \(U, D, V\) be a singular value decomposition of \(X\) such that the diagonal entries of \(D\) are non-increasing.
Let \(U_r\) (resp. \(V_r\)) be the \(n \times r\) (resp. \(p \times r\)) matrix formed by the first \(r\) columns of \(U\) (resp. \(V\))
Then
\[X = U_r \times D_r \times V_r^T\]
is a thin Singular Value Decomposition of \(X\).
Code
[,1] [,2] [,3]
[1,] 0.7887244 2.142099666 0.4212978
[2,] 1.3395730 -0.002127517 0.6538190
[3,] -0.7027860 2.834266236 0.2102367
[4,] 1.0683588 1.659178043 1.8625600
Code
U [,1] [,2] [,3]
[1,] -0.5401492 -0.02689543 -0.6790662
[2,] -0.1136021 -0.58964536 -0.3946417
[3,] -0.6091248 0.61862144 0.1055700
[4,] -0.5694737 -0.51855810 0.6099033
Code
diag(D) [,1] [,2] [,3]
[1,] 4.177496 0.0000 0.0000000
[2,] 0.000000 2.3364 0.0000000
[3,] 0.000000 0.0000 0.7965151
Code
t(V) [,1] [,2] [,3]
[1,] -0.1815741 -0.9163604 -0.3568114
[2,] -0.7703521 0.3580720 -0.5275813
[3,] -0.6112188 -0.1790753 0.7709368
In , the SVD of matrix \(A\) is obtained by svd_ <- svd(A) with U <- svd_$u, V <- svd_$V and D <- diag(svd_$d)
Two perspectives on PCA (and SVD)
Let \(X\) be a \(n \times p\) matrix, a reasonable task consists in finding a \(k \lll p\)-dimensional subspace \(E\) of \(\mathbb{R}^p\) such that
\[X \times \Pi_E\]
is as close as possible from \(X\). By as close as possible, we mean that the sum of squared Eudlidean distances between the rows of \(X\) and the rows of \(X \times \Pi_E\) is as small as possible
Picking \(E\) as the subspace generated by the first \(k\) right-singular vectors of \(X\) solves the problem
Let \(X\) be a \(n \times p\) matrix, a reasonable task consists in finding a \(k \lll p\)-dimensional subspace \(E\) of \(\mathbb{R}^p\) such that
\[X \times \Pi_E\]
retains as much variance as possible.
This means maximizing
\[\operatorname{var}\left(X \times \Pi_E\right)\]
PCA up and running
We will walk through Principal Component Analysis using historical datasets
USArrests: crime data across states (individuals) in USAiris: morphological data of 150 flowers (individuals) from genus irisdecathlon: scores of athletes at the ten events making a decathlonLife tables for a collection of Western European countries and the USA
Some datasets have non-numerical columns.
Such columns are not used when computing the SVD that underpins PCA
But non-numerical columns may be incorporated into PCA visualization
When PCA is used as a feature engineering device to prepare the dataset for regression, supervised classification, clustering or any other statistical/machine learning treatment, visualizing the interplay between non-numerical columns and the numerical columns constructed by PCA is crucial
Iris flower dataset
A showcase for Principal Component Analysis
The Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by Ronald Fisher in 1936. Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula “all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus”
The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters
USArrests
| Murder | Assault | UrbanPop | Rape | |
|---|---|---|---|---|
| Wisconsin | 2.6 | 53 | 66 | 10.8 |
| New York | 11.1 | 254 | 86 | 26.1 |
| Arkansas | 8.8 | 190 | 50 | 19.5 |
| Rhode Island | 3.4 | 174 | 87 | 8.3 |
| Utah | 3.2 | 120 | 80 | 22.9 |
| Maryland | 11.3 | 300 | 67 | 27.8 |
| South Dakota | 3.8 | 86 | 45 | 12.8 |
| Texas | 12.7 | 201 | 80 | 25.5 |
| Mississippi | 16.1 | 259 | 44 | 17.1 |
| Tennessee | 13.2 | 188 | 59 | 26.9 |
Code
USArrests %>%
rownames_to_column() %>%
ggplot() +
aes(x = Assault, y = Murder) +
aes(colour = UrbanPop, label=rowname) +
geom_point(size = 5) +
geom_text_repel(box.padding = unit(0.75, "lines"))Warning: ggrepel: 22 unlabeled data points (too many overlaps). Consider
increasing max.overlaps
Decathlon
Code
| rowname | 100m | Long.jump | High.jump | Discus | Pole.vault |
|---|---|---|---|---|---|
| Drews | 10.87 | 7.38 | 1.88 | 40.11 | 5.00 |
| HERNU | 11.37 | 7.56 | 1.86 | 44.99 | 4.82 |
| McMULLEN | 10.83 | 7.31 | 2.13 | 44.41 | 4.42 |
| CLAY | 10.76 | 7.40 | 1.86 | 50.72 | 4.92 |
| Lorenzo | 11.10 | 7.03 | 1.85 | 40.22 | 4.50 |
| SEBRLE | 11.04 | 7.58 | 2.07 | 43.75 | 5.02 |
| Gomez | 11.08 | 7.26 | 1.85 | 40.95 | 4.40 |
| Smirnov | 10.89 | 7.07 | 1.94 | 42.47 | 4.70 |
| ZSIVOCZKY | 11.13 | 7.30 | 2.01 | 45.67 | 4.42 |
| WARNERS | 11.11 | 7.60 | 1.98 | 41.10 | 4.92 |
Code
decathlon %>%
rownames_to_column() %>%
ggplot() +
aes(x = `100m`, y = Long.jump) +
aes(colour = Points, label=rowname) +
geom_point(size = 5) +
geom_smooth(se = FALSE) +
geom_smooth(method="lm", se=FALSE) +
geom_text_repel(box.padding = unit(0.75, "lines"))`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: The following aesthetics were dropped during statistical transformation:
colour, label
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
`geom_smooth()` using formula = 'y ~ x'
Warning: The following aesthetics were dropped during statistical transformation:
colour, label
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
Warning: ggrepel: 9 unlabeled data points (too many overlaps). Consider
increasing max.overlaps