Published

March 20, 2024

Code
stopifnot(
  require(tidyverse),
  require(patchwork),
  require(httr),
  require(glue),
  require(broom),
  require(DT),
  require(GGally),
  require(ggforce),
  require(ggfortify),
  require(testthat),
  require(viridisLite)
)

tidymodels::tidymodels_prefer(quiet = TRUE)
Code
old_theme <- theme_set(
  theme_minimal(base_size=9, 
                base_family = "Helvetica")              
  )

Correspondence Analysis

The mortality dataset

The goal is to investigate a possible link between age group and cause of death. We work with dataset mortality from package FactoMineR

Code
stopifnot(
  require(FactoMineR),
  require(factoextra),
  require(FactoInvestigate)
)
## Loading required package: FactoMineR
## Loading required package: factoextra
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
## Loading required package: FactoInvestigate

data("mortality", package = "FactoMineR")
Code
#help(mortality)

A data frame with 62 rows (the different causes of death) and 18 columns. Each column corresponds to an age interval (15-24, 25-34, 35-44, 45-54, 55-64, 65-74, 75-84, 85-94, 95 and more) in a year. The 9 first columns correspond to data in 1979 and the 9 last columns to data in 2006. In each cell, the counts of deaths for a cause of death in an age interval (in a year) is given.

Source
Centre d’épidemiologie sur les causes de décès médicales

See also EuroStat:

Question

Read the documentation of the mortality dataset. Is this a sample? an aggregated dataset?

If you consider mortality as an agregated dataset, can you figure out the organization of the sample mortality was built from?

The mortality dataset is aggregated dataset. It has been built from two samples. Each sample was built from the collection of death certificates from one calendar year in France (years 1999 and 2006). From each death certificate, two categorical pieces of information were extracted: age group of the deceased and a cause of death. Each sample was then grouped by age group and cause of death and counts were computed. This defines a two-ways contingency table in long form. The contingency table in wide form is obtained by pivoting: pick column names from column age group and values from counts. Column cause of depth provide row names.

The final form of the dataset is obtained by concatenating the two contingency tables along the second axis.

Code
mortality |> 
    select(ends_with('(06)')) |> 
    DT::datatable(extensions = list("Responsive"))

Elementary statistics and table wrangling

Before proceeding to Correspondence Analysis (CA), let us draw some elementary plots. As ggplot is adapted to data in long format, we start by partially pivoting mortality, so as to obtain a tibble with columns cause, year, while keeping all columns named after age groups. We use rowwise() and sum(c_cross()) so as to compute the total number of deaths per year and cause in column total. This allows to mimic rowSums() inside a pipeline. Column grand_total is computed using a window function over grouping by cause.

Code
mortality_long <- mortality |> 
  as_tibble() |> 
  mutate(cause = rownames(mortality)) |> 
  mutate(cause=as_factor(cause)) |> 
  relocate(cause) |> 
  pivot_longer(
    cols=-cause,
    cols_vary="slowest",
    names_to=c(".value", "year"),
    names_pattern="([\\w\\- ]*) \\(([0-9]{2})\\)"
  )  |> 
  mutate(year=ifelse(year=='06', 2006, 1979)) |> 
  rowwise() |> 
  mutate(total=sum(c_across(-c(cause, year)))) |> 
  group_by(cause) |> 
  mutate(grand_total = sum(total)) |> 
  ungroup()
Code
th <- theme_get()
(
mortality_long |> 
  ggplot() +
  scale_fill_discrete() +
  aes(x=fct_reorder(cause, grand_total), 
      y=total, 
      fill=as_factor(year), 
      label=cause) +
  geom_col(position=position_dodge()) +
  theme(
    legend.position="none",
    axis.text.x=element_blank(), #remove x axis labels
    axis.ticks.x=element_blank(), #remove x axis ticks
  ) +
  labs(
    title = "Causes of death, France, 1979, 2006",
    subtitle= "Raw counts"
  ) +
  xlab(label=NULL)
) |>plotly::ggplotly()  
Code
oth <- theme_set(th)
Code
mortality_long |> 
  group_by(year) |> 
  summarise(total_deaths = sum(total))
# A tibble: 2 × 2
   year total_deaths
  <dbl>        <int>
1  1979       529974
2  2006       510921
Question

Compute the marginal counts for each year (1979, 2006). Compare.

Counts have already been computed above.

Code
mortality_long |> 
  select(cause, year, total) |> 
  pivot_wider(id_cols=c(cause), names_from = year, values_from = total) |> 
  DT::datatable()

Correspondance Analysis

CA executive summary
  • Start from a 2-way contingency table \(X\) with \(\sum_{i,j} X_{i,j}=N\)

  • Normalize \(P = \frac{1}{N}X\) (correspondance matrix)

  • Let \(r\) (resp. \(c\)) be the row (resp. column) wise sums vector

  • Let \(D_r=\text{diag}(r)\) denote the diagonal matrix with row sums of \(P\) as coefficients

  • Let \(D_c=\text{diag}(c)\) denote the diagonal matrix with column sums of \(P\) as coefficients

  • The row profiles matrix is \(D_r^{-1} \times P\)

  • The standardized residuals matrix is \(S = D_r^{-1/2} \times \left(P - r c^T\right) \times D_c^{-1/2}\)

CA consists in computing the SVD of the standardized residuals matrix \(S = U \times D \times V^T\)

From the SVD, we get - \(D_r^{-1/2} \times U\) standardized coordinates of rows - \(D_c^{-1/2} \times V\) standardized coordinates of columns - \(D_r^{-1/2} \times U \times D\) principal coordinates of rows - \(D_c^{-1/2} \times V \times D\) principal coordinates of columns - Squared singular values: the principal inertia

When calling svd(.), the argument should be \[D_r^{1/2}\times \left(D_r^{-1} \times P \times D_c^{-1}- \mathbf{I}\times \mathbf{I}^T \right)\times D_c^{1/2}\]

CA and extended SVD

As \[D_r^{-1} \times P \times D_c^{-1} - \mathbf{I}\mathbf{I}^T = (D_r^{-1/2} \times U)\times D \times (D_c^{-1/2}\times V)^T\]

\((D_r^{-1/2} \times U)\times D \times (D_c^{-1/2}\times V)^T\) is the extended SVD of \[D_r^{-1} \times P \times D_c^{-1} - \mathbf{I}\mathbf{I}^T\] with respect to \(D_r\) and \(D_c\)

Question

Perform CA on the two contingency tables.

You may use FactoMineR::CA(). It is interesting to compute the correspondence analysis in your own way, by preparing the matrix that is handled to svd() and returning a named list containing all relevant information.

Do the Jedi and Sith build their own light sabers? Jedi do. It’s a key part of the religion to have a kyber crystal choose you, to build the saber through the power of the force creating a blade unique and in tune with them

Code
lst_ca <- list()

for (y in c('79', '06')) {
  lst_ca[[y]] <- mortality |> 
    select(ends_with(glue('({y})'))) |> 
    FactoMineR::CA(ncp=8, graph = F)
}
Code
lst <- map(c('79', '06'), 
             \(x) select(mortality, ends_with(glue('({x})'))) |>
             FactoMineR::CA(ncp=8, graph = F)
           )
Question

If you did use FactoMineR::CA(), explain the organization of the result.

The result of FactoMineR::CA(...) is a named and nested list with five elements:

eig
a matrix/array containing enough information to build a screeplot.
call
a list of 9, containing the call to CA(), an object of type language, telling (in principle) the user how CA() was called. However, this is a quoted expression. Here we need to guess the value of y in the calling environment understand what’s going on.
Code
lst_ca[[1]]$call$call
FactoMineR::CA(X = select(mortality, ends_with(glue("({y})"))), 
    ncp = 8, graph = F)

Element call also contains the table margin distributions marge.col and marge.row. The truncation rank ncp (number of components) can be assigned before computing the SVD (default value is 5). Element \(X\) stores the contingency table that was effectively used for computing Correpondance Analysis.

row
Information gathered from SVD to facilitate row profiles analysis.
col
a list structured in the same way as element row. Used for column profiles analysis
svd
a list of 3, just as the resuld of svd() containing the singular values, the left and right singular vectors of matrix \(...\)

In principle, all relevant information can be gathered from components svd, call.marge.row, and call.marge.col.

Screeplots

Question

Draw screeplots. Why are they useful? Comment briefly.

Code
ca_79 <- lst_ca[[1]]

ca_79$eig |> 
  as_tibble() |> 
  mutate(across(where(is.numeric), ~ round(.x, digits=2))) |> 
  DT::datatable(extensions = c("Responsive"))

Screeplot

Code
ca_79$eig |> 
  as_tibble() |> 
  rownames_to_column(var="PC") |> 
  rename(percent=eigenvalue, cumulative=`cumulative percentage of variance`) |> 
  ggplot() + 
  aes(x=PC, y=percent, label=round(cumulative,2)) +
  geom_text(angle=45, vjust=-1, hjust=-0.1) + 
  geom_col(fill=NA, colour="black") +
  ylab("Squared singular values") +
  ylim(c(0, .4)) +
  labs(
    title="Screeplot for CA",
    subtitle = "Mortality 1979: Age Group versus Causes of Death"
  )

Row profiles analysis

Question

Perform row profiles analysis.

What are the classical plots? How can you build them from the output of FactoMiner::CA?

Build the table of row contributions (the so-called \(cos^2\))

Row profiles analysis

Attribute row of objects of class CA (exported from FactoMineR) is the starting point of any row profiles analysis.

Code
ca_79_row <- ca_79$row

Attribute row is a named list made of \(4\) components.

coord
a matrix with named rows and columns. The number of rows of coord matches the number of rows of the contingency table (here, the number of possible death causes). The number of columns matches the rank of the truncated SVD that underlies Correspondance Analysis. Here it is \(5\) which also the rank of the standardized contingency table.

The row principal coordinates are the principal coordinates of each row profile in terms of the principal component.

The columns of coord are pairwise orthogonal in the inner product space defined by diag(call$marge.row) (which embodies the marginal probabilities of the so-called causes of deaths)

Code
x <- ca_79$row$coord
r <- ca_79$call$marge.row

A <- round(t(x) %*% diag(r) %*% x, 2)

is_diagonal <- function (A, tol=1e-2){
  norm(diag(diag(A))-A, type='F') <= tol
}

# We expect A to be diagonal
is_diagonal(A)
[1] TRUE

We can recover row$coord from the left singular vectors and the singular values:

Code
with(ca_79,   
  norm(row$coord - with(svd, U %*% diag(vs[1:ca_79$call$ncp])), 'F')
)
[1] 0
Code
prep_rows <- ca_79_row$coord |> 
  as_tibble() |> 
  mutate(name= rownames(ca_79_row$coord)) |> 
  relocate(name) |> 
  mutate(prop=r, inertia=ca_79_row$inertia) 



prep_rows |> 
  mutate(across(where(is.numeric), \(x) round(x,2))) |> 
  DT::datatable()
inertia
a numerical vector with length matching the number of rows of coord, contrib and cos2.

Inertia is the way CA measures variation between row profiles. Total inertia is the \(\Chi^2\) statistic divided by sample size.

Row inertia can be obtained by multiplying the row marginal probability by the squared Euclidean norm of the row in the principal coordinate matrix.

Code
with (ca_79_row,
  sum(abs(r* (rowSums(coord^2)) - inertia))
)
[1] 3.004223e-16
cos2
Coefficients of matrix cos2 are the share of row inertia from the corresponding cell in coord
Code
with (ca_79_row,
  norm((diag(r/inertia) %*% coord^2) - cos2, type='F')
)
[1] 5.542062e-16
contrib

Not too surprisingly, coord, contrib, and cos2 share the same row names and column names.

Code
sum(ca_79$call$X)
[1] 529974
Code
sum((rowSums(ca_79$call$X)/sum(ca_79$call$X) - r)^2)
[1] 6.311339e-35

The Row Profiles are the rows of matrix R below

Code
P <- as.matrix(with(ca_79$call, Xtot/N))
coord <- ca_79_row$coord
inertia <- ca_79_row$inertia

r <- ca_79$call$marge.row
c <- colSums(P)

n <- nrow(P)
p <- ncol(P)

R <- diag(r^(-1)) %*% P 

Q <- R - matrix(1, nrow = n, ncol = n) %*% P
Code
M <- diag(r^(-1)) %*% P %*% diag(c^(-1)) - matrix(1, nrow=n, ncol=p)

n * norm(diag(r^(1/2)) %*% M %*% diag(c^(1/2)), type = "F")^2
[1] 29.39279

We can now display scatterplots from component coord. This is called a Row Plot.

Code
p_scat <-  ( 
  prep_rows |> 
    ggplot() +
    aes(x=`Dim 1`, y=`Dim 2`, label=name) +
    geom_point() +
    coord_fixed() 
  ) 

p_scat |> plotly::ggplotly()

With little effort, it is possible to scale the points so as to tell the reader the relative numerical importance of each cause of death. Coloring/filling the points using inertia also helps: high inertia rows match light-colored points.

Code
ppp <- prep_rows |> 
    ggplot() +
    aes(x=`Dim 1`, 
        y=`Dim 2`, 
        label=name, 
        size=prop, 
        fill=log10(inertia),
        color=log10(inertia)) +
    geom_point(alpha=0.75) +
    scale_size_area() +
    coord_fixed() +
    scale_fill_viridis_c(aesthetics=c("fill", "color"),direction = 1)
    ggtitle(
      "Mortality France 1979: Row plot"
    )
$title
[1] "Mortality France 1979: Row plot"

attr(,"class")
[1] "labels"
Code
ppp |> plotly::ggplotly()
Code
# (ca_79$row)$contrib

Column profiles analysis

Code
names(ca_79_row)
[1] "coord"   "contrib" "cos2"    "inertia"
Code
age_group_names <-  str_match(rownames(ca_79$col$coord), '([\\w \\-]*) \\(79\\)')[,2]

prep_cols <- ca_79$col$coord |> 
  as_tibble() |> 
  mutate(name= age_group_names) |> 
  relocate(name) |> 
  mutate(prop=c, inertia=ca_79$col$inertia)
Code
(
  prep_cols |> 
    ggplot() +
    aes(x=`Dim 1`, 
        y=`Dim 2`, 
        label=name, 
        size=prop, 
        fill=log10(inertia),
        color=log10(inertia)) +
    geom_point(alpha=0.75) +
    scale_size_area() +
    coord_fixed() +
    scale_fill_viridis_c(aesthetics=c("fill", "color"),direction = 1) +
    ggtitle(
      "Mortality France 1979: Col plot"
    )) |> plotly::ggplotly()

Symmetric plots

Question

Build the symmetric plots (biplots) for correspondence analysis of Mortalitity data

From the shelf

Code
plot.CA(ca_79)

Home made

Code
(
prep_rows |> 
    ggplot() +
    aes(x=`Dim 1`, 
        y=`Dim 2`, 
        label=name, 
        size=prop, 
        fill=log10(inertia),
        color=log10(inertia)) +
    geom_point(alpha=0.75) +
    scale_size_area() +
    coord_fixed() +
    scale_fill_viridis_c(aesthetics=c("fill", "color"),direction = 1) +
    geom_point(data = prep_cols,
      aes(x=`Dim 1`, 
        y=`Dim 2`, 
        label=name, 
        size=prop, 
        fill=log10(inertia),
        color=log10(inertia)
      ),
      shape="square",
      alpha=.5,      
    )
) |> plotly::ggplotly()
Warning in geom_point(data = prep_cols, aes(x = `Dim 1`, y = `Dim 2`, label =
name, : Ignoring unknown aesthetics: label

It would be convenient to use distinct color scales for rows and columns.

Code
(
prep_rows |> 
    ggplot() +
    scale_size_area() +
    coord_fixed() +
    aes(x=`Dim 1`, 
        y=`Dim 2`, 
        text=name, 
        size=prop, 
        fill=log10(inertia)) +
    geom_point(alpha=0.75) +
    scale_fill_viridis_c(option="D") +
    geom_point(data = prep_cols,
      aes(x=`Dim 1`, 
        y=`Dim 2`, 
        text=name, 
        size=prop, 
        color=log10(inertia)
      ),
      shape="square",
      alpha=.5,      
    ) +
    scale_color_viridis_c(option="F") +
    theme_minimal(
    )  
)  |> plotly::ggplotly()
Warning in geom_point(data = prep_cols, aes(x = `Dim 1`, y = `Dim 2`, text =
name, : Ignoring unknown aesthetics: text
Warning in L$marker$color[idx] <- aes2plotly(data, params, "fill")[idx]: number
of items to replace is not a multiple of replacement length

Mosaicplots

Question

Mosaic plots provide an alternative way of exploring contingency tables. They are particularly handy when handling 2-way contingency tables.

Draw mosaic plots for the two contingency tables living inside mortality datasets.

Code
mortality |> 
  select(ends_with('(06)')) |> 
  chisq.test() |> 
  glance()
Warning in chisq.test(select(mortality, ends_with("(06)"))): Chi-squared
approximation may be incorrect
# A tibble: 1 × 4
  statistic p.value parameter method                    
      <dbl>   <dbl>     <int> <chr>                     
1   229784.       0       488 Pearson's Chi-squared test
Code
mortality |> 
  select(ends_with('(06)')) |> 
  as.matrix() |> 
  as.table() |> 
  mosaicplot(color = T)

Question

Are you able to deliver an interpretation of this Correspondence Analysis?

Hierarchical clusetring of row profiles

Question

Build the standardized matrix for row profiles analysis. Compute the pairwise distance matrix using the \(\chi^2\) distances. Should you work centered row profiles?

Question

Perform hierarchical clustering of row profiles with method/linkage "single". Check the definition of the method. Did you know the underlying algorithm? If yes, in which context did you get acquainted with this algorithm?

Question

Choose the number of classes (provide justification).

Question

Can you xxplain the size of the different classes in the partition?

Atypical row profiles

Question

Row profiles that do not belong to the majority class are called atypical.

  1. Compute the share of inertia of atypical row profiles.

  2. Draw a symmetric plot (biplot) outlining the atypical row profiles.

Investigating independence/association

Question
  1. Calculate the theoretical population table for deces. Do you possible to carry out a chi-squared test?

  2. Perform a hierarchical classification of the line profiles into two classes.

  3. Merge the rows of deces corresponding to the same class (you can use the the tapply function), and perform a chi-square test. chi-square test. What’s the conclusion?

  4. Why is it more advantageous to carry out this grouping into two classes compared to arbitrarily grouping two classes, in order to prove the dependence between these two variables?

About the “average profile”

Question
  1. Represent individuals from the majority class. Do they all seem to you to correspond to an average profile?

  2. Try to explain this phenomenon considering the way in which hierarchical classification uses the Single Linkage method.

Caveat

The mortality dataset should be taken with grain of salt. Assigning a single cause to every death is not a trivial task. It is even questionable: if somebody dies from some infection because she could not be cured using an available drug due to another preexisting pathology, who is the culprit?