lab-tests-miashs-2

Published

March 28, 2024

Code

params = list(
  truc= "Science des Données",
  year= 2023 ,
  curriculum= "L3 MIASHS",
  university= "Université Paris Cité",
  homepage= "https://stephane-v-boucheron.fr/courses/scidon",
  moodle= "https://moodle.u-paris.fr/course/view.php?id=13227",
  path_data = './DATA',
  country_code= '...',
  country= '...',
  datafile= '...'
  )

attach(params)

Code

require(patchwork)
require(glue)
require(here)
require(tidyverse)
require(vcd)
require(vcdExtra)
require(ggmosaic)
require(skimr)
require(plotly)
require(DT)
require(GGally)
require(ggforce)
require(ggfortify)

tidymodels::tidymodels_prefer(quiet = TRUE)

old_theme <-theme_set(theme_minimal(base_size=9, base_family = "Helvetica"))

L3 MIASHS
Université Paris Cité
Année 2023-2024
Course Homepage
Moodle

Confidence Intervals

We start with Confidence Intervals in a simple Gaussian setting. We have \(X_1, \ldots, X_n \sim_{i.i.d.} \mathcal{N}(\mu, \sigma^2)\) where \(\mu\) and \(\sigma\) are unknown (to be estimated and/or tested).

The maximum likelihood estimator for \((\mu, \sigma^2)\) is \((\overline{X}_n, \widehat{\sigma}^2)\) where

\[\overline{X}_n =\sum_{i=1}^n \frac{1}{n} X_i\quad\text{and}\quad \widehat{\sigma}^2=\frac{1}{n}\sum_{i=1}^n (X_i - \overline{X}_n)^2\] By Student’s Theorem \(\overline{X}_n\) and \(\widehat{\sigma}^2\) are stochastically independent \(\overline{X}_n \sim \mathcal{N}(\mu, \widehat{\sigma}^2/n)\) and \(n \widehat{\sigma}^2/\sigma^2 \sim \chi^2_{n-1}\).

\[\sqrt{n} \frac{\overline{X}_n - \mu}{\widehat{\sigma}} \sim t_{n-1}\]

where \(t_{n-1}\) denotes the Student’s \(t\) distribution with \(n-1\) degrees of freedom.

We have the following confidence interval for \(\mu\) at confidence level \(1-\alpha\):

\[\left[\overline{X}_n - \frac{\widehat{\sigma} t_{n-1,\alpha/2}}{\sqrt{n}}, \overline{X}_n + \frac{\widehat{\sigma} t_{n-1,\alpha/2}}{\sqrt{n}}\right]\]

Question

Simulate \(N=1000\) Gaussian samples of size \(n=100\).

Compute the empirical coverage of confidence intervals for \(\alpha=5\%\) and \(\alpha=10\%\).

Plot a histogram for replicates of \(\frac{\overline{X}_n - \mu}{\widehat{\sigma}\sqrt{n}}\). Overlay the density of \(t_{n-1}\).

Code

N <- 1000 ; n <- 30

X <- rnorm(N * n) |> 
  matrix(nrow=n) |> 
  as.data.frame() |> 
  dplyr::summarise(across(everything(), .fns=c(mu_hat=mean, sigma_hat=sd), .names = '{.col}_{.fn}'))

X |> 
  select(1:4) |> 
  head()

  V1_mu_hat V1_sigma_hat V2_mu_hat V2_sigma_hat
1 0.1773152    0.8886621 0.1017232     1.123775

Code

X <- X |> 
  pivot_longer(
    cols = everything(),
    names_pattern = 'V([0-9]*)_([a-z_]*)',
    names_to = c("Id", ".value")
  )

X |> 
  glimpse()

Rows: 1,000
Columns: 3
$ Id        <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12…
$ mu_hat    <dbl> 0.17731516, 0.10172316, 0.15621909, -0.43895985, -0.18648079…
$ sigma_hat <dbl> 0.8886621, 1.1237751, 1.0265977, 1.1214491, 0.9998393, 1.114…

Code

X <- X |> 
  mutate(stud= sqrt(n)*mu_hat/(sigma_hat)) 

X |> 
  glimpse()

Rows: 1,000
Columns: 4
$ Id        <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12…
$ mu_hat    <dbl> 0.17731516, 0.10172316, 0.15621909, -0.43895985, -0.18648079…
$ sigma_hat <dbl> 0.8886621, 1.1237751, 1.0265977, 1.1214491, 0.9998393, 1.114…
$ stud      <dbl> 1.0928733, 0.4957938, 0.8334786, -2.1439066, -1.0215615, 0.4…

Code

p <- X |> 
  ggplot() +
  aes(x=stud) +
  geom_histogram(aes(y=after_stat(density)), 
                 bins = 30,
                 fill="white",
                 color="black") +
  stat_function(fun=dt, args=c(df=n-1), linetype="dashed") +
  stat_function(fun=dnorm, linetype="dotted", color="blue") 
  
    
p + (p + scale_y_log10()) +
  plot_annotation(
    title = "Histogram for Studentized discrepancy between true mean and estimate", 
    subtitle = glue::glue("{N} replicates of Gaussian samples of size {n}"),
    caption = glue::glue("Dashed line is Student t density with {n-1} degrees of freedom\nDotted line is standard Gaussian density")
    )

Warning: Transformation introduced infinite values in continuous y-axis

Warning: Removed 2 rows containing missing values (`geom_bar()`).

The next function takes as arguments two vectors mu_hat and sig_hat and returns a dataframe where each row defines the bounds of a confidence interval whose width is computed using the optional arguments alpha (1-alpha is the targeted confidence level) and n (n is the common size of the samples used to compute the estimates mu_hat and sig_hat).

Code

IC <- function(mu_hat, sig_hat, alpha=.05, n=100){
  t_score <- qt(1-alpha/2, df=n-1) 
  half_width <- t_score * sig_hat / sqrt(n)
  bounds <- tibble(low=mu_hat-half_width, up=mu_hat+half_width)
  bounds
}

Code

Y <- X |> 
  mutate(ic_5 = IC(mu_hat, sigma_hat, alpha=.05, n = n),
         ic_1 = IC(mu_hat, sigma_hat, alpha=.01, n = n)) |> 
  unnest(c(ic_1, ic_5), names_sep = "_") 

Y |> 
  select(starts_with('ic_')) |> 
  head()

# A tibble: 6 × 4
  ic_5_low ic_5_up ic_1_low ic_1_up
     <dbl>   <dbl>    <dbl>   <dbl>
1   -0.155  0.509    -0.270   0.625
2   -0.318  0.521    -0.464   0.667
3   -0.227  0.540    -0.360   0.673
4   -0.858 -0.0202   -1.00    0.125
5   -0.560  0.187    -0.690   0.317
6   -0.319  0.513    -0.464   0.658

We can compare the achieved coverage and the theoretical coverage of our confidence intervals.

Code

Y |>
  mutate(ok_5 = (ic_5_low < 0) & (0 <ic_5_up),
         ok_1 = (ic_1_low < 0) & (0 < ic_1_up)) |>
  dplyr::summarise(across(starts_with("ok_"),
                          .fn= sum,
                          .names = c("coverage_{.col}")))

# A tibble: 1 × 2
  coverage_ok_5 coverage_ok_1
          <int>         <int>
1           949           991

Testing independence

In data gathered from the 2000 General Social Survey (GSS), one cross classiﬁes gender and political party identiﬁcation. Respondents indicated whether they identiﬁed more strongly with the Democratic or Republican party or as Independents. This is summarized in the next contingency table (taken from Agresti Introduction to Categorical Data Analysis).

Code

# GSS <- vcdExtra::GSS

T <- tribble(~ Democrat, ~ Independent, ~ Republican,
             762, 327, 468,
             484, 239, 477)
rownames(T) <- c('Females', 'Males')

Warning: Setting row names on a tibble is deprecated.

Code

T <- as.matrix(T)
T <- as.table(T)
names(dimnames(T)) <- c("Gender", "Party identification")

Code

prop.table(T)

         Party identification
Gender      Democrat Independent Republican
  Females 0.27638738  0.11860718 0.16974973
  Males   0.17555314  0.08668843 0.17301415

Code

margin.table(T, 1)

Gender
Females   Males 
   1557    1200

Code

margin.table(T, 2)

Party identification
   Democrat Independent  Republican 
       1246         566         945

Question

Draw mosaicplot for the cross classification table
Compute the Pearson chi-square statistic for testing independence
Comment

Code

chisq.test(T) |> 
  broom::tidy() |> 
  knitr::kable()

statistic	p.value	parameter	method
30.07015	3e-07	2	Pearson’s Chi-squared test

Code

vcd::mosaic(T, shade=T)

Warning in if (shade) {: the condition has length > 1 and only the first
element will be used

Visualizing multiway categorical data

Consider the celebrated UCBAdmissions dataset

According to R documentation, this dataset is made of

Aggregate data on applicants to graduate school at Berkeley for the six largest departments in 1973 classified by admission and sex.

This is a compilation of 4526 application files.

For each application, three variables have been reported: the department , the gender of the applicant, and whether the applicant has been admitted.

The dataset is a trivariate sample, which is summarized by a 3-way contingency table.

Code

data("UCBAdmissions")

Question

Turn the 3-way contingency table into a dataframe/tibble with columns Gender, Dept, Admit, n, where the first columns are categorical, and the last column counts the number of co-occurrences of the values in the first three columns amongst the UCB applicants.

We start from data summarized in table form and obtain data summarized in frequency form.

Code

UCBA_long <- UCBAdmissions |> 
  as_tibble() |> 
  select(Gender, Dept, Admit, n) |> 
  arrange(Gender, Dept, Admit)

UCBA_long |> 
  knitr::kable()

Gender	Dept	Admit	n
Female	A	Admitted	89
Female	A	Rejected	19
Female	B	Admitted	17
Female	B	Rejected	8
Female	C	Admitted	202
Female	C	Rejected	391
Female	D	Admitted	131
Female	D	Rejected	244
Female	E	Admitted	94
Female	E	Rejected	299
Female	F	Admitted	24
Female	F	Rejected	317
Male	A	Admitted	512
Male	A	Rejected	313
Male	B	Admitted	353
Male	B	Rejected	207
Male	C	Admitted	120
Male	C	Rejected	205
Male	D	Admitted	138
Male	D	Rejected	279
Male	E	Admitted	53
Male	E	Rejected	138
Male	F	Admitted	22
Male	F	Rejected	351

See vcd tutorial

Question

Make it a bivariate sample by focusing on Gender and Admit: compute the margin table

Draw the corresponding mosaicplot and compute the chi-square independence statistic.

Comment.

Code

UCB_2 <- UCBAdmissions |>  
  margin.table(margin=c('Gender', 'Admit')) 

UCB_2

        Admit
Gender   Admitted Rejected
  Male       1198     1493
  Female      557     1278

Code

UCB_2 |> 
  vcd::mosaic(shade=TRUE)

Code

UCB_2 |> 
  chisq.test() |> 
  broom::tidy() |> 
  knitr::kable()

statistic	p.value	parameter	method
91.6096	0	1	Pearson’s Chi-squared test with Yates’ continuity correction

Question

Visualize the three-way contingency table using double-decker plots from vcd

Code

aperm(UCBAdmissions, c(3, 2, 1)) |> 
  vcd::mosaic(shade=T)

Code

aperm(UCBAdmissions, c(3, 2, 1)) |> 
  vcd::doubledecker()

Question

Viewing the UCBAdmissions dataset, which variable would you call a response variable? Which variable would you call covariates?

Test independence between Gender and Dept.

Code

UCBAdmissions |>  
  margin.table(margin=c('Gender', 'Dept')) |> 
  chisq.test() |> 
  broom::tidy() |> 
  knitr::kable()

statistic	p.value	parameter	method
1068.372	0	5	Pearson’s Chi-squared test

Code

UCBAdmissions |>  
  margin.table(margin=c('Gender', 'Dept')) |>
  vcd::mosaic(shade=T)

Dept and Gender are associated at every conceivable significance level.

Question

For each department of application (Dept), extract the partial two-way table for Gender and Admit. Test each two-way table for independence. How many departments pass the test at significance level \(1\%\), \(5\%\)?

Note that the two-way cross-sectional slices of the three-way table are called partial tables.

Code

dimnames(UCBAdmissions)$Dept |> 
  map(\(x) UCBAdmissions[,,x]) |> 
  map(chisq.test) |> 
  map(broom::tidy) |> 
  bind_rows() |> 
  mutate(method="Pearson's Chi-squared test",
         Dept=dimnames(UCBAdmissions)$Dept) |>
  relocate(Dept) |> 
  knitr::kable(digits = 2)

Dept	statistic	p.value	parameter	method
A	16.37	0.00	1	Pearson’s Chi-squared test
B	0.09	0.77	1	Pearson’s Chi-squared test
C	0.63	0.43	1	Pearson’s Chi-squared test
D	0.22	0.64	1	Pearson’s Chi-squared test
E	0.81	0.37	1	Pearson’s Chi-squared test
F	0.22	0.64	1	Pearson’s Chi-squared test

All departments but A pass the test at \(5\%\) significance level, C and E fail the test at \(1\%\).

In Department

A, female applicants are much more successful than male applicants.
C, E, female applicants are slightly less successful than male applicants

This table summarizing the per Department chi-square tests nicely complements the double decker plot above.

What we observed has a name.

Simpson’s paradox

The result that a marginal association can have different direction from the conditional associations is called Simpson’s paradox. This result applies to quantitative as well as categorical variables.

Further investigation of datasets like UCBAdmissions suggest designing a test for the following null hypothesis.

In many examples with two categorical predictors \(X\) and \(Z\), and a binary response \(Y\), \(X\) identiﬁes two groups (here Males and Females) to compare and \(Z\) is a control variable (Department of application).

For example, in a clinical trial, \(X\) might refer to two treatments, \(Y\) to the outcome of the treatment, and \(Z\) might refer to several centers that recruited patients for the study.

We want to test whether \(X\) and \(Y\) are independent conditionally on \(Z\) (which is something different than independence).

This is the task faced by the Cochran–Mantel–Haenszel Test for \(2 \times 2 \times K\) Contingency Tables (in the UCBAdmissions dataset, \(K\) is the number of departements, and the conditional contingency tables are \(2\times 2\)).