Published

March 12, 2024

Code
require(FactoMineR)
require(FactoInvestigate)
require(ggrepel)
require(ggfortify)
require(broom)

opts <- options()  # save old options

options(ggplot2.discrete.colour="viridis")
options(ggplot2.discrete.fill="viridis")
options(ggplot2.continuous.fill="viridis")
options(ggplot2.continuous.colour="viridis")

Computing SVD

PCA and SVD

Visualizing PCA

Definition : Thin SVD

Let \(X\) be a \(n \times p\) real matrix with rank \(r\).

Let \(U, D, V\) be a singular value decomposition of \(X\) such that the diagonal entries of \(D\) are non-increasing.

Let \(U_r\) (resp. \(V_r\)) be the \(n \times r\) (resp. \(p \times r\)) matrix formed by the first \(r\) columns of \(U\) (resp. \(V\))

Then

\[X = U_r \times D_r \times V_r^T\]

is a thin Singular Value Decomposition of \(X\).

Code
A <- matrix(rnorm(12, 1), nrow = 4)

foo <- svd(A)

U <- foo$u

D <- foo$d

V <- foo$v

# xtable::xtableMatharray(A)
# xtable::xtableMatharray(U)
# xtable::xtableMatharray(diag(D))
# xtable::xtableMatharray(t(V))
A
           [,1]         [,2]      [,3]
[1,]  0.7887244  2.142099666 0.4212978
[2,]  1.3395730 -0.002127517 0.6538190
[3,] -0.7027860  2.834266236 0.2102367
[4,]  1.0683588  1.659178043 1.8625600
Code
U
           [,1]        [,2]       [,3]
[1,] -0.5401492 -0.02689543 -0.6790662
[2,] -0.1136021 -0.58964536 -0.3946417
[3,] -0.6091248  0.61862144  0.1055700
[4,] -0.5694737 -0.51855810  0.6099033
Code
diag(D)
         [,1]   [,2]      [,3]
[1,] 4.177496 0.0000 0.0000000
[2,] 0.000000 2.3364 0.0000000
[3,] 0.000000 0.0000 0.7965151
Code
t(V)
           [,1]       [,2]       [,3]
[1,] -0.1815741 -0.9163604 -0.3568114
[2,] -0.7703521  0.3580720 -0.5275813
[3,] -0.6112188 -0.1790753  0.7709368
Example

In , the SVD of matrix \(A\) is obtained by svd_ <- svd(A) with U <- svd_$u, V <- svd_$V and D <- diag(svd_$d)

Two perspectives on PCA (and SVD)

Faithful projection

Let \(X\) be a \(n \times p\) matrix, a reasonable task consists in finding a \(k \lll p\)-dimensional subspace \(E\) of \(\mathbb{R}^p\) such that

\[X \times \Pi_E\]

is as close as possible from \(X\). By as close as possible, we mean that the sum of squared Eudlidean distances between the rows of \(X\) and the rows of \(X \times \Pi_E\) is as small as possible

Picking \(E\) as the subspace generated by the first \(k\) right-singular vectors of \(X\) solves the problem

Maximizing the projected variance

Let \(X\) be a \(n \times p\) matrix, a reasonable task consists in finding a \(k \lll p\)-dimensional subspace \(E\) of \(\mathbb{R}^p\) such that

\[X \times \Pi_E\]

retains as much variance as possible.

This means maximizing

\[\operatorname{var}\left(X \times \Pi_E\right)\]

PCA up and running

We will walk through Principal Component Analysis using historical datasets

  • USArrests: crime data across states (individuals) in USA

  • iris: morphological data of 150 flowers (individuals) from genus iris

  • decathlon: scores of athletes at the ten events making a decathlon

  • Life tables for a collection of Western European countries and the USA

Some datasets have non-numerical columns.

Such columns are not used when computing the SVD that underpins PCA

But non-numerical columns may be incorporated into PCA visualization

When PCA is used as a feature engineering device to prepare the dataset for regression, supervised classification, clustering or any other statistical/machine learning treatment, visualizing the interplay between non-numerical columns and the numerical columns constructed by PCA is crucial

Iris flower dataset

A showcase for Principal Component Analysis

The Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by Ronald Fisher in 1936. Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula “all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus”

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters

Wikipedia

USArrests

Code
data(USArrests)

USArrests %>%
  sample_n(10) %>%
  knitr::kable()
Murder Assault UrbanPop Rape
Wisconsin 2.6 53 66 10.8
New York 11.1 254 86 26.1
Arkansas 8.8 190 50 19.5
Rhode Island 3.4 174 87 8.3
Utah 3.2 120 80 22.9
Maryland 11.3 300 67 27.8
South Dakota 3.8 86 45 12.8
Texas 12.7 201 80 25.5
Mississippi 16.1 259 44 17.1
Tennessee 13.2 188 59 26.9
Code
USArrests %>%
  rownames_to_column() %>%
  ggplot() +
  aes(x = Assault, y = Murder) +
  aes(colour = UrbanPop, label=rowname) +
  geom_point(size = 5) +
  geom_text_repel(box.padding = unit(0.75, "lines"))
Warning: ggrepel: 22 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

Decathlon

Code
data(decathlon)

decathlon %>%
  rownames_to_column() %>%
  dplyr::select(-c(`400m`, `110m.hurdle`, `Shot.put`,`Javeline`, `1500m`, `Rank`, `Competition`, `Points`)) %>%
  sample_n(10) %>%
  knitr::kable()
rowname 100m Long.jump High.jump Discus Pole.vault
Drews 10.87 7.38 1.88 40.11 5.00
HERNU 11.37 7.56 1.86 44.99 4.82
McMULLEN 10.83 7.31 2.13 44.41 4.42
CLAY 10.76 7.40 1.86 50.72 4.92
Lorenzo 11.10 7.03 1.85 40.22 4.50
SEBRLE 11.04 7.58 2.07 43.75 5.02
Gomez 11.08 7.26 1.85 40.95 4.40
Smirnov 10.89 7.07 1.94 42.47 4.70
ZSIVOCZKY 11.13 7.30 2.01 45.67 4.42
WARNERS 11.11 7.60 1.98 41.10 4.92
Code
decathlon %>%
  rownames_to_column() %>%
  ggplot() +
  aes(x = `100m`, y = Long.jump) +
  aes(colour = Points, label=rowname) +
  geom_point(size = 5) +
  geom_smooth(se = FALSE) +
  geom_smooth(method="lm", se=FALSE) +
  geom_text_repel(box.padding = unit(0.75, "lines"))
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: The following aesthetics were dropped during statistical transformation:
colour, label
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?
`geom_smooth()` using formula = 'y ~ x'
Warning: The following aesthetics were dropped during statistical transformation:
colour, label
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?
Warning: ggrepel: 9 unlabeled data points (too many overlaps). Consider
increasing max.overlaps