Published

March 28, 2024

Code
params = list(
  truc= "Science des Données",
  year= 2023 ,
  curriculum= "L3 MIASHS",
  university= "Université Paris Cité",
  homepage= "https://stephane-v-boucheron.fr/courses/scidon",
  moodle= "https://moodle.u-paris.fr/course/view.php?id=13227",
  path_data = './DATA',
  country_code= '...',
  country= '...',
  datafile= '...'
  )

attach(params)

Confidence Intervals

We start with Confidence Intervals in a simple Gaussian setting. We have \(X_1, \ldots, X_n \sim_{i.i.d.} \mathcal{N}(\mu, \sigma^2)\) where \(\mu\) and \(\sigma\) are unknown (to be estimated and/or tested).

The maximum likelihood estimator for \((\mu, \sigma^2)\) is \((\overline{X}_n, \widehat{\sigma}^2)\) where

\[\overline{X}_n =\sum_{i=1}^n \frac{1}{n} X_i\quad\text{and}\quad \widehat{\sigma}^2=\frac{1}{n}\sum_{i=1}^n (X_i - \overline{X}_n)^2\] By Student’s Theorem \(\overline{X}_n\) and \(\widehat{\sigma}^2\) are stochastically independent \(\overline{X}_n \sim \mathcal{N}(\mu, \widehat{\sigma}^2/n)\) and \(n \widehat{\sigma}^2/\sigma^2 \sim \chi^2_{n-1}\).

\[\sqrt{n} \frac{\overline{X}_n - \mu}{\widehat{\sigma}} \sim t_{n-1}\]

where \(t_{n-1}\) denotes the Student’s \(t\) distribution with \(n-1\) degrees of freedom.

We have the following confidence interval for \(\mu\) at confidence level \(1-\alpha\):

\[\left[\overline{X}_n - \frac{\widehat{\sigma} t_{n-1,\alpha/2}}{\sqrt{n}}, \overline{X}_n + \frac{\widehat{\sigma} t_{n-1,\alpha/2}}{\sqrt{n}}\right]\]

Question

Simulate \(N=1000\) Gaussian samples of size \(n=100\).

Compute the empirical coverage of confidence intervals for \(\alpha=5\%\) and \(\alpha=10\%\).

Plot a histogram for replicates of \(\frac{\overline{X}_n - \mu}{\widehat{\sigma}\sqrt{n}}\). Overlay the density of \(t_{n-1}\).

Code
N <- 1000 ; n <- 30

X <- rnorm(N * n) |> 
  matrix(nrow=n) |> 
  as.data.frame() |> 
  dplyr::summarise(across(everything(), .fns=c(mu_hat=mean, sigma_hat=sd), .names = '{.col}_{.fn}'))

X |> 
  select(1:4) |> 
  head()
  V1_mu_hat V1_sigma_hat V2_mu_hat V2_sigma_hat
1 0.1773152    0.8886621 0.1017232     1.123775
Code
X <- X |> 
  pivot_longer(
    cols = everything(),
    names_pattern = 'V([0-9]*)_([a-z_]*)',
    names_to = c("Id", ".value")
  )

X |> 
  glimpse()
Rows: 1,000
Columns: 3
$ Id        <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12…
$ mu_hat    <dbl> 0.17731516, 0.10172316, 0.15621909, -0.43895985, -0.18648079…
$ sigma_hat <dbl> 0.8886621, 1.1237751, 1.0265977, 1.1214491, 0.9998393, 1.114…
Code
X <- X |> 
  mutate(stud= sqrt(n)*mu_hat/(sigma_hat)) 

X |> 
  glimpse()
Rows: 1,000
Columns: 4
$ Id        <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12…
$ mu_hat    <dbl> 0.17731516, 0.10172316, 0.15621909, -0.43895985, -0.18648079…
$ sigma_hat <dbl> 0.8886621, 1.1237751, 1.0265977, 1.1214491, 0.9998393, 1.114…
$ stud      <dbl> 1.0928733, 0.4957938, 0.8334786, -2.1439066, -1.0215615, 0.4…
Code
p <- X |> 
  ggplot() +
  aes(x=stud) +
  geom_histogram(aes(y=after_stat(density)), 
                 bins = 30,
                 fill="white",
                 color="black") +
  stat_function(fun=dt, args=c(df=n-1), linetype="dashed") +
  stat_function(fun=dnorm, linetype="dotted", color="blue") 
  
    
p + (p + scale_y_log10()) +
  plot_annotation(
    title = "Histogram for Studentized discrepancy between true mean and estimate", 
    subtitle = glue::glue("{N} replicates of Gaussian samples of size {n}"),
    caption = glue::glue("Dashed line is Student t density with {n-1} degrees of freedom\nDotted line is standard Gaussian density")
    )
Warning: Transformation introduced infinite values in continuous y-axis
Warning: Removed 2 rows containing missing values (`geom_bar()`).

The next function takes as arguments two vectors mu_hat and sig_hat and returns a dataframe where each row defines the bounds of a confidence interval whose width is computed using the optional arguments alpha (1-alpha is the targeted confidence level) and n (n is the common size of the samples used to compute the estimates mu_hat and sig_hat).

Code
IC <- function(mu_hat, sig_hat, alpha=.05, n=100){
  t_score <- qt(1-alpha/2, df=n-1) 
  half_width <- t_score * sig_hat / sqrt(n)
  bounds <- tibble(low=mu_hat-half_width, up=mu_hat+half_width)
  bounds
}
Code
Y <- X |> 
  mutate(ic_5 = IC(mu_hat, sigma_hat, alpha=.05, n = n),
         ic_1 = IC(mu_hat, sigma_hat, alpha=.01, n = n)) |> 
  unnest(c(ic_1, ic_5), names_sep = "_") 

Y |> 
  select(starts_with('ic_')) |> 
  head()
# A tibble: 6 × 4
  ic_5_low ic_5_up ic_1_low ic_1_up
     <dbl>   <dbl>    <dbl>   <dbl>
1   -0.155  0.509    -0.270   0.625
2   -0.318  0.521    -0.464   0.667
3   -0.227  0.540    -0.360   0.673
4   -0.858 -0.0202   -1.00    0.125
5   -0.560  0.187    -0.690   0.317
6   -0.319  0.513    -0.464   0.658

We can compare the achieved coverage and the theoretical coverage of our confidence intervals.

Code
Y |>
  mutate(ok_5 = (ic_5_low < 0) & (0 <ic_5_up),
         ok_1 = (ic_1_low < 0) & (0 < ic_1_up)) |>
  dplyr::summarise(across(starts_with("ok_"),
                          .fn= sum,
                          .names = c("coverage_{.col}")))
# A tibble: 1 × 2
  coverage_ok_5 coverage_ok_1
          <int>         <int>
1           949           991

Testing independence

In data gathered from the 2000 General Social Survey (GSS), one cross classifies gender and political party identification. Respondents indicated whether they identified more strongly with the Democratic or Republican party or as Independents. This is summarized in the next contingency table (taken from Agresti Introduction to Categorical Data Analysis).

Code
# GSS <- vcdExtra::GSS

T <- tribble(~ Democrat, ~ Independent, ~ Republican,
             762, 327, 468,
             484, 239, 477)
rownames(T) <- c('Females', 'Males')
Warning: Setting row names on a tibble is deprecated.
Code
T <- as.matrix(T)
T <- as.table(T)
names(dimnames(T)) <- c("Gender", "Party identification") 
Code
         Party identification
Gender      Democrat Independent Republican
  Females 0.27638738  0.11860718 0.16974973
  Males   0.17555314  0.08668843 0.17301415
Code
Gender
Females   Males 
   1557    1200 
Code
Party identification
   Democrat Independent  Republican 
       1246         566         945 
Question
  • Draw mosaicplot for the cross classification table
  • Compute the Pearson chi-square statistic for testing independence
  • Comment
Code
chisq.test(T) |> 
  broom::tidy() |> 
  knitr::kable()
statistic p.value parameter method
30.07015 3e-07 2 Pearson’s Chi-squared test
Code
vcd::mosaic(T, shade=T)
Warning in if (shade) {: the condition has length > 1 and only the first
element will be used

Visualizing multiway categorical data

Consider the celebrated UCBAdmissions dataset

According to R documentation, this dataset is made of

Aggregate data on applicants to graduate school at Berkeley for the six largest departments in 1973 classified by admission and sex.

This is a compilation of 4526 application files.

For each application, three variables have been reported: the department , the gender of the applicant, and whether the applicant has been admitted.

The dataset is a trivariate sample, which is summarized by a 3-way contingency table.

Code
data("UCBAdmissions")
Question

Turn the 3-way contingency table into a dataframe/tibble with columns Gender, Dept, Admit, n, where the first columns are categorical, and the last column counts the number of co-occurrences of the values in the first three columns amongst the UCB applicants.

We start from data summarized in table form and obtain data summarized in frequency form.

Code
UCBA_long <- UCBAdmissions |> 
  as_tibble() |> 
  select(Gender, Dept, Admit, n) |> 
  arrange(Gender, Dept, Admit)

UCBA_long |> 
  knitr::kable()
Gender Dept Admit n
Female A Admitted 89
Female A Rejected 19
Female B Admitted 17
Female B Rejected 8
Female C Admitted 202
Female C Rejected 391
Female D Admitted 131
Female D Rejected 244
Female E Admitted 94
Female E Rejected 299
Female F Admitted 24
Female F Rejected 317
Male A Admitted 512
Male A Rejected 313
Male B Admitted 353
Male B Rejected 207
Male C Admitted 120
Male C Rejected 205
Male D Admitted 138
Male D Rejected 279
Male E Admitted 53
Male E Rejected 138
Male F Admitted 22
Male F Rejected 351

See vcd tutorial

Question

Make it a bivariate sample by focusing on Gender and Admit: compute the margin table

Draw the corresponding mosaicplot and compute the chi-square independence statistic.

Comment.

Code
UCB_2 <- UCBAdmissions |>  
  margin.table(margin=c('Gender', 'Admit')) 

UCB_2
        Admit
Gender   Admitted Rejected
  Male       1198     1493
  Female      557     1278
Code
UCB_2 |> 
  vcd::mosaic(shade=TRUE)

Code
UCB_2 |> 
  chisq.test() |> 
  broom::tidy() |> 
  knitr::kable()
statistic p.value parameter method
91.6096 0 1 Pearson’s Chi-squared test with Yates’ continuity correction
Question

Visualize the three-way contingency table using double-decker plots from vcd

Code
aperm(UCBAdmissions, c(3, 2, 1)) |> 
  vcd::mosaic(shade=T)

Code
aperm(UCBAdmissions, c(3, 2, 1)) |> 
  vcd::doubledecker()

Question
Question

Viewing the UCBAdmissions dataset, which variable would you call a response variable? Which variable would you call covariates?

Test independence between Gender and Dept.

Code
UCBAdmissions |>  
  margin.table(margin=c('Gender', 'Dept')) |> 
  chisq.test() |> 
  broom::tidy() |> 
  knitr::kable()
statistic p.value parameter method
1068.372 0 5 Pearson’s Chi-squared test
Code
UCBAdmissions |>  
  margin.table(margin=c('Gender', 'Dept')) |>
  vcd::mosaic(shade=T)

Dept and Gender are associated at every conceivable significance level.

Question

For each department of application (Dept), extract the partial two-way table for Gender and Admit. Test each two-way table for independence. How many departments pass the test at significance level \(1\%\), \(5\%\)?

Note that the two-way cross-sectional slices of the three-way table are called partial tables.

Code
dimnames(UCBAdmissions)$Dept |> 
  map(\(x) UCBAdmissions[,,x]) |> 
  map(chisq.test) |> 
  map(broom::tidy) |> 
  bind_rows() |> 
  mutate(method="Pearson's Chi-squared test",
         Dept=dimnames(UCBAdmissions)$Dept) |>
  relocate(Dept) |> 
  knitr::kable(digits = 2)
Dept statistic p.value parameter method
A 16.37 0.00 1 Pearson’s Chi-squared test
B 0.09 0.77 1 Pearson’s Chi-squared test
C 0.63 0.43 1 Pearson’s Chi-squared test
D 0.22 0.64 1 Pearson’s Chi-squared test
E 0.81 0.37 1 Pearson’s Chi-squared test
F 0.22 0.64 1 Pearson’s Chi-squared test

All departments but A pass the test at \(5\%\) significance level, C and E fail the test at \(1\%\).

In Department

  • A, female applicants are much more successful than male applicants.
  • C, E, female applicants are slightly less successful than male applicants

This table summarizing the per Department chi-square tests nicely complements the double decker plot above.

What we observed has a name.

Simpson’s paradox

The result that a marginal association can have different direction from the conditional associations is called Simpson’s paradox. This result applies to quantitative as well as categorical variables.

Further investigation of datasets like UCBAdmissions suggest designing a test for the following null hypothesis.

In many examples with two categorical predictors \(X\) and \(Z\), and a binary response \(Y\), \(X\) identifies two groups (here Males and Females) to compare and \(Z\) is a control variable (Department of application).

For example, in a clinical trial, \(X\) might refer to two treatments, \(Y\) to the outcome of the treatment, and \(Z\) might refer to several centers that recruited patients for the study.

We want to test whether \(X\) and \(Y\) are independent conditionally on \(Z\) (which is something different than independence).

This is the task faced by the Cochran–Mantel–Haenszel Test for \(2 \times 2 \times K\) Contingency Tables (in the UCBAdmissions dataset, \(K\) is the number of departements, and the conditional contingency tables are \(2\times 2\)).