Published

March 20, 2024

Code
stopifnot(
  require(tidyverse),
  require(patchwork),
  require(httr),
  require(glue),
  require(broom),
  require(DT),
  require(GGally),
  require(ggforce),
  require(ggfortify),
  require(testthat),
  require(viridisLite)
)

tidymodels::tidymodels_prefer(quiet = TRUE)
Code
old_theme <- theme_set(
  theme_minimal(base_size=9, 
                base_family = "Helvetica")              
  )

Correspondence Analysis

The mortality dataset

The goal is to investigate a possible link between age group and cause of death. We work with dataset mortality from package FactoMineR

Code
stopifnot(
  require(FactoMineR),
  require(factoextra),
  require(FactoInvestigate)
)
## Loading required package: FactoMineR
## Loading required package: factoextra
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
## Loading required package: FactoInvestigate

data("mortality", package = "FactoMineR")
Code
#help(mortality)

A data frame with 62 rows (the different causes of death) and 18 columns. Each column corresponds to an age interval (15-24, 25-34, 35-44, 45-54, 55-64, 65-74, 75-84, 85-94, 95 and more) in a year. The 9 first columns correspond to data in 1979 and the 9 last columns to data in 2006. In each cell, the counts of deaths for a cause of death in an age interval (in a year) is given.

Source
Centre d’épidemiologie sur les causes de décès médicales

See also EuroStat:

Question

Read the documentation of the mortality dataset. Is this a sample? an aggregated dataset?

If you consider mortality as an agregated dataset, can you figure out the organization of the sample mortality was built from?

The mortality dataset is aggregated dataset. It has been built from two samples. Each sample was built from the collection of death certificates from one calendar year in France (years 1999 and 2006). From each death certificate, two categorical pieces of information were extracted: age group of the deceased and a cause of death. Each sample was then grouped by age group and cause of death and counts were computed. This defines a two-ways contingency table in long form. The contingency table in wide form is obtained by pivoting: pick column names from column age group and values from counts. Column cause of depth provide row names.

The final form of the dataset is obtained by concatenating the two contingency tables along the second axis.

Code
mortality |> 
    select(ends_with('(06)')) |> 
    DT::datatable(extensions = list("Responsive"))

Elementary statistics and table wrangling

Before proceeding to Correspondence Analysis (CA), let us draw some elementary plots. As ggplot is adapted to data in long format, we start by partially pivoting mortality, so as to obtain a tibble with columns cause, year, while keeping all columns named after age groups. We use rowwise() and sum(c_cross()) so as to compute the total number of deaths per year and cause in column total. This allows to mimic rowSums() inside a pipeline. Column grand_total is computed using a window function over grouping by cause.

Code
mortality_long <- mortality |> 
  as_tibble() |> 
  mutate(cause = rownames(mortality)) |> 
  mutate(cause=as_factor(cause)) |> 
  relocate(cause) |> 
  pivot_longer(
    cols=-cause,
    cols_vary="slowest",
    names_to=c(".value", "year"),
    names_pattern="([\\w\\- ]*) \\(([0-9]{2})\\)"
  )  |> 
  mutate(year=ifelse(year=='06', 2006, 1979)) |> 
  rowwise() |> 
  mutate(total=sum(c_across(-c(cause, year)))) |> 
  group_by(cause) |> 
  mutate(grand_total = sum(total)) |> 
  ungroup()
Code
th <- theme_get()
(
mortality_long |> 
  ggplot() +
  scale_fill_discrete() +
  aes(x=fct_reorder(cause, grand_total), 
      y=total, 
      fill=as_factor(year), 
      label=cause) +
  geom_col(position=position_dodge()) +
  theme(
    legend.position="none",
    axis.text.x=element_blank(), #remove x axis labels
    axis.ticks.x=element_blank(), #remove x axis ticks
  ) +
  labs(
    title = "Causes of death, France, 1979, 2006",
    subtitle= "Raw counts"
  ) +
  xlab(label=NULL)
) |>plotly::ggplotly()  
Code
oth <- theme_set(th)
Code
mortality_long |> 
  group_by(year) |> 
  summarise(total_deaths = sum(total))
# A tibble: 2 × 2
   year total_deaths
  <dbl>        <int>
1  1979       529974
2  2006       510921
Question

Compute the marginal counts for each year (1979, 2006). Compare.

Counts have already been computed above.

Code
mortality_long |> 
  select(cause, year, total) |> 
  pivot_wider(id_cols=c(cause), names_from = year, values_from = total) |> 
  DT::datatable()

Correspondance Analysis

CA executive summary
  • Start from a 2-way contingency table \(X\) with \(\sum_{i,j} X_{i,j}=N\)

  • Normalize \(P = \frac{1}{N}X\) (correspondance matrix)

  • Let \(r\) (resp. \(c\)) be the row (resp. column) wise sums vector

  • Let \(D_r=\text{diag}(r)\) denote the diagonal matrix with row sums of \(P\) as coefficients

  • Let \(D_c=\text{diag}(c)\) denote the diagonal matrix with column sums of \(P\) as coefficients

  • The row profiles matrix is \(D_r^{-1} \times P\)

  • The standardized residuals matrix is \(S = D_r^{-1/2} \times \left(P - r c^T\right) \times D_c^{-1/2}\)

CA consists in computing the SVD of the standardized residuals matrix \(S = U \times D \times V^T\)

From the SVD, we get - \(D_r^{-1/2} \times U\) standardized coordinates of rows - \(D_c^{-1/2} \times V\) standardized coordinates of columns - \(D_r^{-1/2} \times U \times D\) principal coordinates of rows - \(D_c^{-1/2} \times V \times D\) principal coordinates of columns - Squared singular values: the principal inertia

When calling svd(.), the argument should be \[D_r^{1/2}\times \left(D_r^{-1} \times P \times D_c^{-1}- \mathbf{I}\times \mathbf{I}^T \right)\times D_c^{1/2}\]

CA and extended SVD

As \[D_r^{-1} \times P \times D_c^{-1} - \mathbf{I}\mathbf{I}^T = (D_r^{-1/2} \times U)\times D \times (D_c^{-1/2}\times V)^T\]

\((D_r^{-1/2} \times U)\times D \times (D_c^{-1/2}\times V)^T\) is the extended SVD of \[D_r^{-1} \times P \times D_c^{-1} - \mathbf{I}\mathbf{I}^T\] with respect to \(D_r\) and \(D_c\)

Question

Perform CA on the two contingency tables.

You may use FactoMineR::CA(). It is interesting to compute the correspondence analysis in your own way, by preparing the matrix that is handled to svd() and returning a named list containing all relevant information.

Do the Jedi and Sith build their own light sabers? Jedi do. It’s a key part of the religion to have a kyber crystal choose you, to build the saber through the power of the force creating a blade unique and in tune with them

Code
lst_ca <- list()

for (y in c('79', '06')) {
  lst_ca[[y]] <- mortality |> 
    select(ends_with(glue('({y})'))) |> 
    FactoMineR::CA(ncp=8, graph = F)
}
Code
lst <- map(c('79', '06'), 
             \(x) select(mortality, ends_with(glue('({x})'))) |>
             FactoMineR::CA(ncp=8, graph = F)
           )
Question

If you did use FactoMineR::CA(), explain the organization of the result.

The result of FactoMineR::CA(...) is a named and nested list with five elements:

eig
a matrix/array containing enough information to build a screeplot.
call
a list of 9, containing the call to CA(), an object of type language, telling (in principle) the user how CA() was called. However, this