Published

March 20, 2024

Code

stopifnot(
  require(tidyverse),
  require(patchwork),
  require(httr),
  require(glue),
  require(broom),
  require(DT),
  require(GGally),
  require(ggforce),
  require(ggfortify),
  require(testthat),
  require(viridisLite)
)

tidymodels::tidymodels_prefer(quiet = TRUE)

Code

old_theme <- theme_set(
  theme_minimal(base_size=9, 
                base_family = "Helvetica")              
  )

Correspondence Analysis

The `mortality` dataset

The goal is to investigate a possible link between age group and cause of death. We work with dataset mortality from package FactoMineR

Code

stopifnot(
  require(FactoMineR),
  require(factoextra),
  require(FactoInvestigate)
)
## Loading required package: FactoMineR
## Loading required package: factoextra
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
## Loading required package: FactoInvestigate

data("mortality", package = "FactoMineR")

Code

#help(mortality)

A data frame with 62 rows (the different causes of death) and 18 columns. Each column corresponds to an age interval (15-24, 25-34, 35-44, 45-54, 55-64, 65-74, 75-84, 85-94, 95 and more) in a year. The 9 first columns correspond to data in 1979 and the 9 last columns to data in 2006. In each cell, the counts of deaths for a cause of death in an age interval (in a year) is given.

Source: Centre d’épidemiologie sur les causes de décès médicales

Elementary statistics and table wrangling

Before proceeding to Correspondence Analysis (CA), let us draw some elementary plots. As ggplot is adapted to data in long format, we start by partially pivoting mortality, so as to obtain a tibble with columns cause, year, while keeping all columns named after age groups. We use rowwise() and sum(c_cross()) so as to compute the total number of deaths per year and cause in column total. This allows to mimic rowSums() inside a pipeline. Column grand_total is computed using a window function over grouping by cause.

Code

mortality_long <- mortality |> 
  as_tibble() |> 
  mutate(cause = rownames(mortality)) |> 
  mutate(cause=as_factor(cause)) |> 
  relocate(cause) |> 
  pivot_longer(
    cols=-cause,
    cols_vary="slowest",
    names_to=c(".value", "year"),
    names_pattern="([\\w\\- ]*) \\(([0-9]{2})\\)"
  )  |> 
  mutate(year=ifelse(year=='06', 2006, 1979)) |> 
  rowwise() |> 
  mutate(total=sum(c_across(-c(cause, year)))) |> 
  group_by(cause) |> 
  mutate(grand_total = sum(total)) |> 
  ungroup()

Code

th <- theme_get()
(
mortality_long |> 
  ggplot() +
  scale_fill_discrete() +
  aes(x=fct_reorder(cause, grand_total), 
      y=total, 
      fill=as_factor(year), 
      label=cause) +
  geom_col(position=position_dodge()) +
  theme(
    legend.position="none",
    axis.text.x=element_blank(), #remove x axis labels
    axis.ticks.x=element_blank(), #remove x axis ticks
  ) +
  labs(
    title = "Causes of death, France, 1979, 2006",
    subtitle= "Raw counts"
  ) +
  xlab(label=NULL)
) |>plotly::ggplotly()

Code

oth <- theme_set(th)

Code

mortality_long |> 
  group_by(year) |> 
  summarise(total_deaths = sum(total))

# A tibble: 2 × 2
   year total_deaths
  <dbl>        <int>
1  1979       529974
2  2006       510921

Question

Compute the marginal counts for each year (1979, 2006). Compare.

Counts have already been computed above.

Code

mortality_long |> 
  select(cause, year, total) |> 
  pivot_wider(id_cols=c(cause), names_from = year, values_from = total) |> 
  DT::datatable()

Correspondance Analysis

CA executive summary

Start from a 2-way contingency table $X$ with $\sum_{i,j} X_{i,j}=N$
Normalize $P = \frac{1}{N}X$ (correspondance matrix)
Let $r$ (resp. $c$) be the row (resp. column) wise sums vector
Let $D_r=\text{diag}(r)$ denote the diagonal matrix with row sums of $P$ as coefficients
Let $D_c=\text{diag}(c)$ denote the diagonal matrix with column sums of $P$ as coefficients
The row profiles matrix is $D_r^{-1} \times P$
The standardized residuals matrix is $S = D_r^{-1/2} \times \left(P - r c^T\right) \times D_c^{-1/2}$

CA consists in computing the SVD of the standardized residuals matrix $S = U \times D \times V^T$

From the SVD, we get - $D_r^{-1/2} \times U$ standardized coordinates of rows - $D_c^{-1/2} \times V$ standardized coordinates of columns - $D_r^{-1/2} \times U \times D$ principal coordinates of rows - $D_c^{-1/2} \times V \times D$ principal coordinates of columns - Squared singular values: the principal inertia

When calling svd(.), the argument should be \[D_r^{1/2}\times \left(D_r^{-1} \times P \times D_c^{-1}- \mathbf{I}\times \mathbf{I}^T \right)\times D_c^{1/2}\]

CA and extended SVD

As \[D_r^{-1} \times P \times D_c^{-1} - \mathbf{I}\mathbf{I}^T = (D_r^{-1/2} \times U)\times D \times (D_c^{-1/2}\times V)^T\]

$(D_r^{-1/2} \times U)\times D \times (D_c^{-1/2}\times V)^T$ is the extended SVD of \[D_r^{-1} \times P \times D_c^{-1} - \mathbf{I}\mathbf{I}^T\] with respect to $D_r$ and $D_c$

Question

Perform CA on the two contingency tables.

You may use FactoMineR::CA(). It is interesting to compute the correspondence analysis in your own way, by preparing the matrix that is handled to svd() and returning a named list containing all relevant information.

Do the Jedi and Sith build their own light sabers? Jedi do. It’s a key part of the religion to have a kyber crystal choose you, to build the saber through the power of the force creating a blade unique and in tune with them

Code

lst_ca <- list()

for (y in c('79', '06')) {
  lst_ca[[y]] <- mortality |> 
    select(ends_with(glue('({y})'))) |> 
    FactoMineR::CA(ncp=8, graph = F)
}

Code

lst <- map(c('79', '06'), 
             \(x) select(mortality, ends_with(glue('({x})'))) |>
             FactoMineR::CA(ncp=8, graph = F)
           )

Question

If you did use FactoMineR::CA(), explain the organization of the result.

The result of FactoMineR::CA(...) is a named and nested list with five elements:

eig: a matrix/array containing enough information to build a screeplot.
call: a list of 9, containing the call to CA(), an object of type language, telling (in principle) the user how CA() was called. However, this is a quoted expression. Here we need to guess the value of y in the calling environment understand what’s going on.

Code

lst_ca[[1]]$call$call

FactoMineR::CA(X = select(mortality, ends_with(glue("({y})"))), 
    ncp = 8, graph = F)

Element call also contains the table margin distributions marge.col and marge.row. The truncation rank ncp (number of components) can be assigned before computing the SVD (default value is 5). Element $X$ stores the contingency table that was effectively used for computing Correpondance Analysis.

row: Information gathered from SVD to facilitate row profiles analysis.
col: a list structured in the same way as element row. Used for column profiles analysis
svd: a list of 3, just as the resuld of svd() containing the singular values, the left and right singular vectors of matrix $...$

In principle, all relevant information can be gathered from components svd, call.marge.row, and call.marge.col.

Screeplots

Question

Draw screeplots. Why are they useful? Comment briefly.

Code

ca_79 <- lst_ca[[1]]

ca_79$eig |> 
  as_tibble() |> 
  mutate(across(where(is.numeric), ~ round(.x, digits=2))) |> 
  DT::datatable(extensions = c("Responsive"))

Screeplot

Code

ca_79$eig |> 
  as_tibble() |> 
  rownames_to_column(var="PC") |> 
  rename(percent=eigenvalue, cumulative=`cumulative percentage of variance`) |> 
  ggplot() + 
  aes(x=PC, y=percent, label=round(cumulative,2)) +
  geom_text(angle=45, vjust=-1, hjust=-0.1) + 
  geom_col(fill=NA, colour="black") +
  ylab("Squared singular values") +
  ylim(c(0, .4)) +
  labs(
    title="Screeplot for CA",
    subtitle = "Mortality 1979: Age Group versus Causes of Death"
  )

Row profiles analysis

Question

Perform row profiles analysis.

What are the classical plots? How can you build them from the output of FactoMiner::CA?

Build the table of row contributions (the so-called $cos^2$)

Row profiles analysis

Attribute row of objects of class CA (exported from FactoMineR) is the starting point of any row profiles analysis.

Code

ca_79_row <- ca_79$row

Attribute row is a named list made of $4$ components.

coord: a matrix with named rows and columns. The number of rows of coord matches the number of rows of the contingency table (here, the number of possible death causes). The number of columns matches the rank of the truncated SVD that underlies Correspondance Analysis. Here it is $5$ which also the rank of the standardized contingency table.

The row principal coordinates are the principal coordinates of each row profile in terms of the principal component.

The columns of coord are pairwise orthogonal in the inner product space defined by diag(call$marge.row) (which embodies the marginal probabilities of the so-called causes of deaths)

Code

x <- ca_79$row$coord
r <- ca_79$call$marge.row

A <- round(t(x) %*% diag(r) %*% x, 2)

is_diagonal <- function (A, tol=1e-2){
  norm(diag(diag(A))-A, type='F') <= tol
}

# We expect A to be diagonal
is_diagonal(A)

[1] TRUE

We can recover row$coord from the left singular vectors and the singular values:

Code

with(ca_79,   
  norm(row$coord - with(svd, U %*% diag(vs[1:ca_79$call$ncp])), 'F')
)

[1] 0

Code

prep_rows <- ca_79_row$coord |> 
  as_tibble() |> 
  mutate(name= rownames(ca_79_row$coord)) |> 
  relocate(name) |> 
  mutate(prop=r, inertia=ca_79_row$inertia) 



prep_rows |> 
  mutate(across(where(is.numeric), \(x) round(x,2))) |> 
  DT::datatable()

inertia: a numerical vector with length matching the number of rows of coord, contrib and cos2.

Inertia is the way CA measures variation between row profiles. Total inertia is the $\Chi^2$ statistic divided by sample size.

Row inertia can be obtained by multiplying the row marginal probability by the squared Euclidean norm of the row in the principal coordinate matrix.

Code

with (ca_79_row,
  sum(abs(r* (rowSums(coord^2)) - inertia))
)

[1] 3.004223e-16

cos2: Coefficients of matrix cos2 are the share of row inertia from the corresponding cell in coord

Code

with (ca_79_row,
  norm((diag(r/inertia) %*% coord^2) - cos2, type='F')
)

[1] 5.542062e-16

contrib

Not too surprisingly, coord, contrib, and cos2 share the same row names and column names.

Code

sum(ca_79$call$X)

[1] 529974

Code

sum((rowSums(ca_79$call$X)/sum(ca_79$call$X) - r)^2)

[1] 6.311339e-35

The Row Profiles are the rows of matrix R below

Code

P <- as.matrix(with(ca_79$call, Xtot/N))
coord <- ca_79_row$coord
inertia <- ca_79_row$inertia

r <- ca_79$call$marge.row
c <- colSums(P)

n <- nrow(P)
p <- ncol(P)

R <- diag(r^(-1)) %*% P 

Q <- R - matrix(1, nrow = n, ncol = n) %*% P

Code

M <- diag(r^(-1)) %*% P %*% diag(c^(-1)) - matrix(1, nrow=n, ncol=p)

n * norm(diag(r^(1/2)) %*% M %*% diag(c^(1/2)), type = "F")^2

[1] 29.39279

We can now display scatterplots from component coord. This is called a Row Plot.

Code

p_scat <-  ( 
  prep_rows |> 
    ggplot() +
    aes(x=`Dim 1`, y=`Dim 2`, label=name) +
    geom_point() +
    coord_fixed() 
  ) 

p_scat |> plotly::ggplotly()

With little effort, it is possible to scale the points so as to tell the reader the relative numerical importance of each cause of death. Coloring/filling the points using inertia also helps: high inertia rows match light-colored points.

Code

ppp <- prep_rows |> 
    ggplot() +
    aes(x=`Dim 1`, 
        y=`Dim 2`, 
        label=name, 
        size=prop, 
        fill=log10(inertia),
        color=log10(inertia)) +
    geom_point(alpha=0.75) +
    scale_size_area() +
    coord_fixed() +
    scale_fill_viridis_c(aesthetics=c("fill", "color"),direction = 1)
    ggtitle(
      "Mortality France 1979: Row plot"
    )

$title
[1] "Mortality France 1979: Row plot"

attr(,"class")
[1] "labels"

Code

ppp |> plotly::ggplotly()

Code

# (ca_79$row)$contrib

Column profiles analysis

Code

names(ca_79_row)

[1] "coord"   "contrib" "cos2"    "inertia"

Code

age_group_names <-  str_match(rownames(ca_79$col$coord), '([\\w \\-]*) \\(79\\)')[,2]

prep_cols <- ca_79$col$coord |> 
  as_tibble() |> 
  mutate(name= age_group_names) |> 
  relocate(name) |> 
  mutate(prop=c, inertia=ca_79$col$inertia)

Code

(
  prep_cols |> 
    ggplot() +
    aes(x=`Dim 1`, 
        y=`Dim 2`, 
        label=name, 
        size=prop, 
        fill=log10(inertia),
        color=log10(inertia)) +
    geom_point(alpha=0.75) +
    scale_size_area() +
    coord_fixed() +
    scale_fill_viridis_c(aesthetics=c("fill", "color"),direction = 1) +
    ggtitle(
      "Mortality France 1979: Col plot"
    )) |> plotly::ggplotly()

Symmetric plots

Question

Build the symmetric plots (biplots) for correspondence analysis of Mortalitity data

From the shelf

Code

plot.CA(ca_79)

Home made

Code

(
prep_rows |> 
    ggplot() +
    aes(x=`Dim 1`, 
        y=`Dim 2`, 
        label=name, 
        size=prop, 
        fill=log10(inertia),
        color=log10(inertia)) +
    geom_point(alpha=0.75) +
    scale_size_area() +
    coord_fixed() +
    scale_fill_viridis_c(aesthetics=c("fill", "color"),direction = 1) +
    geom_point(data = prep_cols,
      aes(x=`Dim 1`, 
        y=`Dim 2`, 
        label=name, 
        size=prop, 
        fill=log10(inertia),
        color=log10(inertia)
      ),
      shape="square",
      alpha=.5,      
    )
) |> plotly::ggplotly()

Warning in geom_point(data = prep_cols, aes(x = `Dim 1`, y = `Dim 2`, label =
name, : Ignoring unknown aesthetics: label

It would be convenient to use distinct color scales for rows and columns.

Code

(
prep_rows |> 
    ggplot() +
    scale_size_area() +
    coord_fixed() +
    aes(x=`Dim 1`, 
        y=`Dim 2`, 
        text=name, 
        size=prop, 
        fill=log10(inertia)) +
    geom_point(alpha=0.75) +
    scale_fill_viridis_c(option="D") +
    geom_point(data = prep_cols,
      aes(x=`Dim 1`, 
        y=`Dim 2`, 
        text=name, 
        size=prop, 
        color=log10(inertia)
      ),
      shape="square",
      alpha=.5,      
    ) +
    scale_color_viridis_c(option="F") +
    theme_minimal(
    )  
)  |> plotly::ggplotly()

Warning in geom_point(data = prep_cols, aes(x = `Dim 1`, y = `Dim 2`, text =
name, : Ignoring unknown aesthetics: text

Warning in L$marker$color[idx] <- aes2plotly(data, params, "fill")[idx]: number
of items to replace is not a multiple of replacement length

Mosaicplots

Question

Mosaic plots provide an alternative way of exploring contingency tables. They are particularly handy when handling 2-way contingency tables.

Draw mosaic plots for the two contingency tables living inside mortality datasets.

Code

mortality |> 
  select(ends_with('(06)')) |> 
  chisq.test() |> 
  glance()

Warning in chisq.test(select(mortality, ends_with("(06)"))): Chi-squared
approximation may be incorrect

# A tibble: 1 × 4
  statistic p.value parameter method                    
      <dbl>   <dbl>     <int> <chr>                     
1   229784.       0       488 Pearson's Chi-squared test

Code

mortality |> 
  select(ends_with('(06)')) |> 
  as.matrix() |> 
  as.table() |> 
  mosaicplot(color = T)

Question

Are you able to deliver an interpretation of this Correspondence Analysis?

Hierarchical clusetring of row profiles

Question

Build the standardized matrix for row profiles analysis. Compute the pairwise distance matrix using the $\chi^2$ distances. Should you work centered row profiles?

Question

Perform hierarchical clustering of row profiles with method/linkage "single". Check the definition of the method. Did you know the underlying algorithm? If yes, in which context did you get acquainted with this algorithm?

Question

Choose the number of classes (provide justification).

Question

Can you xxplain the size of the different classes in the partition?

Atypical row profiles

Question

Row profiles that do not belong to the majority class are called atypical.

Compute the share of inertia of atypical row profiles.
Draw a symmetric plot (biplot) outlining the atypical row profiles.

Investigating independence/association

Question

Calculate the theoretical population table for deces. Do you possible to carry out a chi-squared test?
Perform a hierarchical classification of the line profiles into two classes.
Merge the rows of deces corresponding to the same class (you can use the the tapply function), and perform a chi-square test. chi-square test. What’s the conclusion?
Why is it more advantageous to carry out this grouping into two classes compared to arbitrarily grouping two classes, in order to prove the dependence between these two variables?

About the “average profile”

Question

Represent individuals from the majority class. Do they all seem to you to correspond to an average profile?
Try to explain this phenomenon considering the way in which hierarchical classification uses the Single Linkage method.

Caveat

The mortality dataset should be taken with grain of salt. Assigning a single cause to every death is not a trivial task. It is even questionable: if somebody dies from some infection because she could not be cured using an available drug due to another preexisting pathology, who is the culprit?

Correspondence Analysis

The mortality dataset

Elementary statistics and table wrangling

Correspondance Analysis

Screeplots

Screeplot

Row profiles analysis

Row profiles analysis

Column profiles analysis

Symmetric plots

From the shelf

Home made

Mosaicplots

Hierarchical clusetring of row profiles

Atypical row profiles

Investigating independence/association

About the “average profile”

The `mortality` dataset