Code
require(tidyverse)
require(patchwork)
require(httr)
require(glue)
require(broom)
<- theme_set(theme_minimal()) old_theme
require(tidyverse)
require(patchwork)
require(httr)
require(glue)
require(broom)
<- theme_set(theme_minimal()) old_theme
Dataset swiss
from datasets::swiss
connect fertility and social, economic data within 47 French-speaking districts in Switzerland.
Fertility
: fertility indexAgriculture
: jobs in agricultural sectorExamination
: literacy index (military examination)Education
: proportion of people with successful secondary educationCatholic
: proportion of CatholicsInfant.Mortality
: mortality quotient at age 0Fertility index (Fertility
) is considered as the response variable
The social and economic variables are covariates (explanatory variables).
See European Fertility Project for more on this dataset.
PCA (Principal Component Analysis) is concerned with covariates.
data("swiss")
%>%
swiss glimpse(50)
Rows: 47
Columns: 6
$ Fertility <dbl> 80.2, 83.1, 92.5, 85.8,…
$ Agriculture <dbl> 17.0, 45.1, 39.7, 36.5,…
$ Examination <int> 15, 6, 5, 12, 17, 9, 16…
$ Education <int> 12, 9, 5, 7, 15, 7, 7, …
$ Catholic <dbl> 9.96, 84.84, 93.40, 33.…
$ Infant.Mortality <dbl> 22.2, 22.2, 20.2, 20.3,…
Have a look at the documentation of the dataset
It is enough to call summary()
on each column of swiss
. This can be done in a functional programming style using package purrr
. The collections of summaries can be rearranged so as to build a dataframe that is fit for reporting.
<- map_dfr(swiss, summary, .id = "var") %>%
tt mutate(across(where(is.numeric), ~ round(.x, digits=1)))
%>%
tt ::datatable() DT
Function skim
from skimr
delivers all univariate summaries in proper form.
<- swiss %>%
foo select(-Fertility) %>%
skim()
<- foo %>%
foobar filter(skim_type=="numeric") %>%
rename(variable=skim_variable) %>%
mutate(across(where(is.numeric), ~ round(.x, digits=1)))
%>%
foobar ::datatable(extensions = c('Buttons', 'ColReorder', 'FixedColumns', 'Responsive'),
DToptions = list( dom = 'Bfrtip',
buttons = c('csv', 'pdf', 'print'),
colReorder = TRUE,
dom = 't',
scrollX = TRUE,
fixedColumns = list(leftColumns = 3, rightColumns = 1))
)
We have to pick some graphical summary of the data. Boxplots and violine plots could be used if we look for concision.
We use histograms to get more details about each column.
Not that covariates have different meanings: Agriculture
, Catholic
, Examination
, and Education
are percentages with values between \(0\) and \(100\).
We have no details about the standardized fertility index Fertility
Infant.Mortality
is also a rate:
Infant mortality is the death of an infant before his or her first birthday. The infant mortality rate is the number of infant deaths for every 1,000 live births. In addition to giving us key information about maternal and infant health, the infant mortality rate is an important marker of the overall health of a society.
see Center for Desease Control
We reuse the function we have already developped during previous sessions.
make_biotifoul(swiss, .f = is.numeric)
Histograms reveal that our covariates have very different distributions.
Religious affiliation (Catholic
) tells us that there two types of districts, which is reminiscent of the old principle Cujus regio, ejus religio , see Old Swiss Confederacy.
Agriculture
shows that in most districts, agriculture was still a very important activity.
Education
reveals that in all but a few districts, most children did not receive secondary education. Examination
shows that some districts lag behind the bulk of districts. Even less exhibit a superior performance.
The two demographic variables Fertility
and Infant.Mortality
look roughly unimodal with a few extreme districts.
Compute, display and comment the sample correlation matrix.
Display jointplots for each pair of variables.
Package corrr
, functions correlate
and rplot
provide a conveniemt tool.
Note that rplot()
creates a graphical object of class ggplot
. We can endow it with more layers.
::correlate(swiss) %>%
corrr::rplot() %>% +
corrrggtitle("Correlation plot for Swiss Fertility data")
Correlation computed with
• Method: 'pearson'
• Missing treated using: 'pairwise.complete.obs'
The high positive linear correlation between Education
and Examination
is moderately surprising. The negative correlation between the proportion of people involved in Agriculture
and Education
and Examination
is also not too surprising. Secondary schooling required pupils from rural areas to move to cities.
A more intriguing observation concerns the pairs Catholic
and Examination
(negative correlation) and Catholic
and Education
(little correlation).
The response variable Fertility
looks negatively correlated with Examination
an Education
. These correlations are worth being further explored. In Demography, the decline of Fertility is often associated with the the rise of women education. Note that Examination
is about males, and that Education
does not give details about the way women complete primary education.
Pairwise analysis did not provide us with a clear and simple picture of the French-speaking districts.
Play with centering and scaling
We first call prcomp()
with the default arguments for centering and scaling, that is, we center columns and do not attempt to standardize columns.
<- swiss %>%
pco select(-Fertility) %>%
prcomp()
The result
Hand-made centering of the dataframe
<- select(swiss, -Fertility)
X
<- nrow(X)
n
<- (X - matrix(1, nrow = n, ncol=1) %*% rep(1/n,n) %*% as.matrix(X))
Y
<- as.matrix(Y) Y
tibble(var=names(X), mX=colMeans(X), mY=colMeans(Y)) %>%
mutate(across(where(is.numeric), ~ round(.x, digits=2))) %>%
::datatable() DT
Function scale(X, scale=F)
from base R
does the job.
<- svd(Y)
svd_Y
%$%
svd_Y as.matrix(Y) - u %*% diag(d) %*% t(v)) %>%
(norm(type = "F") # <1> checking the factorization
[1] 2.054251e-13
norm( diag(1, ncol(Y)) -
%$% (t(v) %*% v)), 'F') # <2> checking that colomns of `v` frm an orthonormal family. (svd_Y
[1] 1.261261e-15
Note that we used the exposing pipe %$%
from magrittr
to unpack svd_Y
which is a list with class svd
and members named u
, d
and v
.
We could have used with(,)
from base R
.
The matrix \(1/n Y^T \times Y\) is the covariance matrix of the covariates. The spectral decomposition of the symmetric Semi Definite Positive (SDP) matrix \(1/n Y^T \times Y\) is related with the SVD factorization of \(Y\).
The spectral decomposition of \(Y^T \times Y\) can be obtained using eigen
.
t(eigen(t(Y) %*% Y )$vectors) %*% svd_Y$v ) %>%
(round(digits=2)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 0 0 0
[2,] 0 -1 0 0 0
[3,] 0 0 1 0 0
[4,] 0 0 0 1 0
[5,] 0 0 0 0 1
Here, the eigenvectors of \(Y^T \times Y\) coincide with the right singular vectors of \(Y\) corresponding to non-zero singular values. Up to sign changes, it is always true when the non-zero singular values are pairwise distinct.
Now we check that prcomp
is indeed a wrapper for svd
.
- pco$x %*% t(pco$rotation )) %>%
(Y round(digits = 2) %>%
head()
Agriculture Examination Education Catholic Infant.Mortality
Courtelary 0 0 0 0 0
Delemont 0 0 0 0 0
Franches-Mnt 0 0 0 0 0
Moutier 0 0 0 0 0
Neuveville 0 0 0 0 0
Porrentruy 0 0 0 0 0
$v %*% t(pco$rotation )) %>%
(svd_Yround(2)
Agriculture Examination Education Catholic Infant.Mortality
[1,] 1 0 0 0 0
[2,] 0 1 0 0 0
[3,] 0 0 1 0 0
[4,] 0 0 0 1 0
[5,] 0 0 0 0 1
t(pco$x) %*% pco$x) %>%
(round(2)
PC1 PC2 PC3 PC4 PC5
PC1 86484.49 0.00 0.00 0.00 0.00
PC2 0.00 21127.44 0.00 0.00 0.00
PC3 0.00 0.00 2706.14 0.00 0.00
PC4 0.00 0.00 0.00 639.22 0.00
PC5 0.00 0.00 0.00 0.00 348.01
Objects of class pca
can be handled by generic functions like plot
.
plot(pco)
The displayed plot is the so-called screeplot, it provides information about the approximation of the decomposedmatrix by its truncated SVDs.
<- . %>%
p_screeplot tidy(matrix="pcs") %>% {
ggplot(.) +
aes(x=PC, y=percent, label=format(1.-cumulative,2)) +
geom_text(angle=45, vjust=-1, hjust=-0.1) +
geom_col(fill=NA, colour="black")
}
1- percent
tell the reader about the relative Frobenious error achieved by keeping the first components of the SVD expansion.
%>%
pco p_screeplot() +
labs(title="Screeplot for swiss fertility data",
caption="Keeping the first two components is enough to achieve relative Froebenius relative error 3.3%")
Project the dataset on the first two principal components (perform dimension reduction) and build a scatterplot. Colour the points according to the value of original covariates.
<- pco %>%
p augment(swiss) %>%
ggplot() +
aes(x=.fittedPC1, y=.fittedPC2, label=.rownames) +
geom_point() +
coord_fixed() +
::geom_text_repel()
ggrepel
+
(p aes(color=Infant.Mortality)) +
+
(p aes(color=Education)) +
+
(p aes(color=Examination)) +
+
(p aes(color=Catholic)) +
+
(p aes(color=Agriculture)) +
+
(p aes(color=Fertility)) +
plot_layout(ncol = 2) +
plot_annotation(title="Swiss data on first two PCs" ,
subtitle = "centered, unscaled")
We can extract factor \(V\) from the SVD factorization using generic function tidy
from package broom
%>%
pco tidy(matrix="v") %>%
glimpse()
Rows: 25
Columns: 3
$ column <chr> "Agriculture", "Agriculture", "Agriculture", "Agriculture", "Ag…
$ PC <dbl> 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, …
$ value <dbl> 0.28151505, -0.88377692, -0.36961938, -0.02652821, -0.04863543,…
The result is a tibble in long form. It is worth pivoting the dataframe
%>%
pco tidy(matrix="v") %>%
pivot_wider(id_cols =column,
names_from = PC,
values_from = value)
# A tibble: 5 × 6
column `1` `2` `3` `4` `5`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Agriculture 0.282 -0.884 -0.370 -0.0265 -0.0486
2 Examination -0.121 0.174 -0.450 -0.867 0.0332
3 Education -0.0584 0.311 -0.807 0.485 -0.117
4 Catholic 0.950 0.303 0.00166 -0.0715 0.0223
5 Infant.Mortality 0.0105 0.0193 0.0985 -0.0867 -0.991
Think of the rows of swiss
as vectors. Then matrix v
In wide form, we readily access to the decomposition of the or
<- . %>%
prep_co_circle tidy(matrix="v") %>%
pivot_wider(id_cols =column,
names_from = PC,
values_from = value)
<- (
co_circle_ppl %>%
pco prep_co_circle() %>%
filter(F)
%>%
) ggplot() +
aes(x=`1`, y=`2`, label=column) +
geom_segment(aes(xend=0, yend=0), arrow = grid::arrow(ends = "first")) +
::geom_text_repel() +
ggrepelcoord_fixed() +
xlim(c(-1.1, 1.1)) + ylim(c(-1.1, 1.1)) +
annotate("path",
x=0+1*cos(seq(0,2*pi,length.out=100)),
y=0+1*sin(seq(0,2*pi,length.out=100)), linetype="dashed")
%+% (
co_circle_ppl %>%
pco prep_co_circle()
+
) ggtitle("Swiss, correlation circle",
subtitle = "centered, unscaled")
# pco %$% {
# ifelse(!is.null(center), "centered", "not centered") ;
# ifelse(!is.null(scale), "scaled", "not scaled")
# }
<- select(swiss, -Fertility) %>%
pco2 prcomp(scale. = T)
%+% (
co_circle_ppl %>%
pco2 prep_co_circle()
+
) ggtitle("Swiss, correlation circle",
subtitle = "centered, scaled")
scale(., center=T, scale-F)
)\[X\]
<- as.matrix(select(swiss, -Fertility)) |>
X scale(center = T, scale=F)
# check centering, spot the difference in variances
|>
X as_tibble() |>
summarise(across(everything(), c(var, mean)))
# A tibble: 1 × 10
Agriculture_1 Agriculture_2 Examination_1 Examination_2 Education_1
<dbl> <dbl> <dbl> <dbl> <dbl>
1 516. 2.64e-15 63.6 -1.51e-16 92.5
# ℹ 5 more variables: Education_2 <dbl>, Catholic_1 <dbl>, Catholic_2 <dbl>,
# Infant.Mortality_1 <dbl>, Infant.Mortality_2 <dbl>
# should be 0
norm(X %*% pco$rotation - pco$x)
[1] 0
# check the left singular vectors
$x %*% diag((pco$sdev)^(-1)) |>
pcoas_tibble() |>
summarise(across(everything(), c(mean,var)))
Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
`.name_repair` is omitted as of tibble 2.0.0.
ℹ Using compatibility `.name_repair`.
# A tibble: 1 × 10
V1_1 V1_2 V2_1 V2_2 V3_1 V3_2 V4_1 V4_2 V5_1 V5_2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 7.44e-17 1 -1.05e-16 1 -8.24e-17 1.00 -6.84e-17 1 5.56e-16 1
#
$rotation %*% (diag((pco$sdev)^(-2)) %*% t(pco$x) %*% X) pco
Agriculture Examination Education Catholic
Agriculture 4.600000e+01 6.994405e-15 9.325873e-15 -2.192690e-14
Examination 1.346007e-13 4.600000e+01 3.042011e-14 3.273354e-13
Education 1.090239e-13 2.825518e-14 4.600000e+01 -3.185507e-13
Catholic 1.054712e-15 -2.102485e-15 -4.982126e-15 4.600000e+01
Infant.Mortality 1.172396e-13 -2.442491e-14 -7.194245e-14 -1.971756e-13
Infant.Mortality
Agriculture -5.329071e-15
Examination 4.440892e-16
Education -1.598721e-14
Catholic 4.440892e-16
Infant.Mortality 4.600000e+01
|>
pco tidy(matrix="v") |>
pivot_wider(id_cols =column,
names_from = PC,
values_from = value) |>
rowwise() |>
summarise(column, l2=sum((c_across(where(is.numeric)))^2))
# A tibble: 5 × 2
column l2
<chr> <dbl>
1 Agriculture 1.00
2 Examination 1.00
3 Education 1
4 Catholic 1
5 Infant.Mortality 1.00
Checking Orthogonality of \(V\)
# checking that pco$rotation is an orthogonal matrix
t(pco$rotation) %*% pco$rotation
PC1 PC2 PC3 PC4 PC5
PC1 1.000000e+00 -4.341417e-16 -7.220786e-17 2.710505e-18 3.469447e-18
PC2 -4.341417e-16 1.000000e+00 3.649425e-16 -8.001412e-17 6.938894e-17
PC3 -7.220786e-17 3.649425e-16 1.000000e+00 3.642919e-17 -1.387779e-17
PC4 2.710505e-18 -8.001412e-17 3.642919e-17 1.000000e+00 2.498002e-16
PC5 3.469447e-18 6.938894e-17 -1.387779e-17 2.498002e-16 1.000000e+00
$rotation %*% t(pco$rotation) pco
Agriculture Examination Education Catholic
Agriculture 1.000000e+00 6.223320e-17 2.177078e-16 3.248270e-16
Examination 6.223320e-17 1.000000e+00 -5.316927e-16 1.517883e-17
Education 2.177078e-16 -5.316927e-16 1.000000e+00 -2.059984e-16
Catholic 3.248270e-16 1.517883e-17 -2.059984e-16 1.000000e+00
Infant.Mortality 6.245005e-17 2.983724e-16 -1.249001e-16 -1.734723e-17
Infant.Mortality
Agriculture 6.245005e-17
Examination 2.983724e-16
Education -1.249001e-16
Catholic -1.734723e-17
Infant.Mortality 1.000000e+00
Pay attention to the correlation circles.
<- swiss %>%
pco_c select(-Fertility) %>%
prcomp()
<- swiss %>%
pco_cs select(-Fertility) %>%
prcomp(scale.=T, center=T)
%+% (pco_c %>%
(co_circle_ppl prep_co_circle()) +
ggtitle("Swiss, correlation circle",
subtitle = "centered, unscaled"))
(%+% (pco_cs %>%
co_circle_ppl prep_co_circle()) +
ggtitle("Swiss, correlation circle",
subtitle = "centered, scaled")
)
Explain the contrast between the two correlation circles.
In the sequel we focus on standardized PCA.
<- p %+% (pco_cs %>%
q augment(swiss)) +
ggtitle("Swiss data on first two PCs", subtitle = "centered, scaled")
+
(q aes(color=Infant.Mortality)) +
+
(q aes(color=Education)) +
+
(q aes(color=Examination)) +
+
(q aes(color=Catholic)) +
+
(q aes(color=Agriculture)) +
+
(q aes(color=Fertility)) +
plot_layout(ncol = 2)
Warning: ggrepel: 45 unlabeled data points (too many overlaps). Consider increasing max.overlaps
ggrepel: 45 unlabeled data points (too many overlaps). Consider increasing max.overlaps
ggrepel: 45 unlabeled data points (too many overlaps). Consider increasing max.overlaps
ggrepel: 45 unlabeled data points (too many overlaps). Consider increasing max.overlaps
ggrepel: 45 unlabeled data points (too many overlaps). Consider increasing max.overlaps
ggrepel: 45 unlabeled data points (too many overlaps). Consider increasing max.overlaps
How many axes should we keep?
%+% (pco_cs%>%
p_screeplot tidy(matrix="pcs"))
plot(pco_cs)
Elbow rule: keep the first three PCs
This comes from the correlation circle. We rely on function prep_co_circle
and on the graphical pipeline co_circle_ppl
.
(%+%
co_circle_ppl prep_co_circle(pco_cs) +
ggtitle("Swiss, correlation circle",
subtitle = "centered, scaled")
)
|>
swiss select(-Fertility) |>
::correlate() |>
corrr::shave() |>
corrr::rplot(print_cor = T) +
corrrtheme_minimal()
Correlation computed with
• Method: 'pearson'
• Missing treated using: 'pairwise.complete.obs'
Fertility
variablePlot again the correlation circle using the same principal axes as before, but add the Fertility
variable. How does Fertility
relate with covariates? with principal axes?
<- pco_cs %$% # exposition pipe
U 1/sqrt(nrow(x)-1) *x %*% diag((sdev)^(-1)))
(
<- with(pco_cs,
Uprime 1/sqrt(nrow(x)-1) *x %*% diag((sdev)^(-1)))
t(U) %*% U
[,1] [,2] [,3] [,4] [,5]
[1,] 1.000000e+00 -1.717376e-16 1.110223e-16 -3.008119e-16 6.210310e-16
[2,] -1.717376e-16 1.000000e+00 2.498002e-16 -1.970266e-16 3.816392e-17
[3,] 1.110223e-16 2.498002e-16 1.000000e+00 4.523508e-15 5.828671e-16
[4,] -3.008119e-16 -1.970266e-16 4.523508e-15 1.000000e+00 -6.432029e-16
[5,] 6.210310e-16 3.816392e-17 5.828671e-16 -6.432029e-16 1.000000e+00
t(Uprime) %*% Uprime
[,1] [,2] [,3] [,4] [,5]
[1,] 1.000000e+00 -1.717376e-16 1.110223e-16 -3.008119e-16 6.210310e-16
[2,] -1.717376e-16 1.000000e+00 2.498002e-16 -1.970266e-16 3.816392e-17
[3,] 1.110223e-16 2.498002e-16 1.000000e+00 4.523508e-15 5.828671e-16
[4,] -3.008119e-16 -1.970266e-16 4.523508e-15 1.000000e+00 -6.432029e-16
[5,] 6.210310e-16 3.816392e-17 5.828671e-16 -6.432029e-16 1.000000e+00
norm(U,type = "F"))^2 (
[1] 5
<- swiss %>%
pco select(-Fertility) %>%
prcomp(scale.=T)
<- pco %>%
df_cocirc tidy(matrix="v") %>%
pivot_wider(id_cols =column,
names_from = PC,
values_from = value)
augment(pco, data=swiss) %>%
ggplot() +
geom_point(aes(x=.fittedPC1,
y=.fittedPC2,
color=Fertility, label=.rownames)) +
coord_fixed() +
::geom_text_repel(aes(x=.fittedPC1,
ggrepely=.fittedPC2,
color=Infant.Mortality,
label=.rownames)) +
geom_segment(data=df_cocirc,
mapping=aes(x= 4* `1`,
y= 4 * `2`,
linetype=factor(column),
label=column,
xend=0,
yend=0),
arrow = grid::arrow(ends = "first",
unit(.1, "inches")
+
)) scale_color_viridis_c() +
xlim(c(-5,5)) +
ylim(c(-5,5)) #+
Warning in geom_point(aes(x = .fittedPC1, y = .fittedPC2, color = Fertility, :
Ignoring unknown aesthetics: label
Warning in geom_segment(data = df_cocirc, mapping = aes(x = 4 * `1`, y = 4 * :
Ignoring unknown aesthetics: label
Warning: ggrepel: 19 unlabeled data points (too many overlaps). Consider
increasing max.overlaps
# ggrepel::geom_text_repel(data=df_cocirc,
# aes(x= 4* `1`,
# y= 4 * `2`,
# label=column),
# color="red")
autoplot(pco,
data=swiss,
color="Agriculture",
loadings = TRUE,
loadings.colour = 'blue',
loadings.label = TRUE)
https://scholar.google.com/citations?user=xbCKOYMAAAAJ&hl=fr&oi=ao