We start with Confidence Intervals in a simple Gaussian setting. We have \(X_1, \ldots, X_n \sim_{i.i.d.} \mathcal{N}(\mu, \sigma^2)\) where \(\mu\) and \(\sigma\) are unknown (to be estimated and/or tested).
The maximum likelihood estimator for \((\mu, \sigma^2)\) is \((\overline{X}_n, \widehat{\sigma}^2)\) where
\[\overline{X}_n =\sum_{i=1}^n \frac{1}{n} X_i\quad\text{and}\quad \widehat{\sigma}^2=\frac{1}{n}\sum_{i=1}^n (X_i - \overline{X}_n)^2\] By Student’s Theorem \(\overline{X}_n\) and \(\widehat{\sigma}^2\) are stochastically independent \(\overline{X}_n \sim \mathcal{N}(\mu, \widehat{\sigma}^2/n)\) and \(n \widehat{\sigma}^2/\sigma^2 \sim \chi^2_{n-1}\).
p<-X|>ggplot()+aes(x=stud)+geom_histogram(aes(y=after_stat(density)), bins =30, fill="white", color="black")+stat_function(fun=dt, args=c(df=n-1), linetype="dashed")+stat_function(fun=dnorm, linetype="dotted", color="blue")p+(p+scale_y_log10())+plot_annotation( title ="Histogram for Studentized discrepancy between true mean and estimate", subtitle =glue::glue("{N} replicates of Gaussian samples of size {n}"), caption =glue::glue("Dashed line is Student t density with {n-1} degrees of freedom\nDotted line is standard Gaussian density"))
Warning: Transformation introduced infinite values in continuous y-axis
The next function takes as arguments two vectors mu_hat and sig_hat and returns a dataframe where each row defines the bounds of a confidence interval whose width is computed using the optional arguments alpha (1-alpha is the targeted confidence level) and n (n is the common size of the samples used to compute the estimates mu_hat and sig_hat).
In data gathered from the 2000 General Social Survey (GSS), one cross classifies gender and political party identification. Respondents indicated whether they identified more strongly with the Democratic or Republican party or as Independents. This is summarized in the next contingency table (taken from Agresti Introduction to Categorical Data Analysis).
Turn the 3-way contingency table into a dataframe/tibble with columns Gender, Dept, Admit, n, where the first columns are categorical, and the last column counts the number of co-occurrences of the values in the first three columns amongst the UCB applicants.
We start from data summarized in table form and obtain data summarized in frequency form.
Dept and Gender are associated at every conceivable significance level.
Question
For each department of application (Dept), extract the partial two-way table for Gender and Admit. Test each two-way table for independence. How many departments pass the test at significance level \(1\%\), \(5\%\)?
Note that the two-way cross-sectional slices of the three-way table are called partial tables.
All departments but A pass the test at \(5\%\) significance level, C and E fail the test at \(1\%\).
In Department
A, female applicants are much more successful than male applicants.
C, E, female applicants are slightly less successful than male applicants
This table summarizing the per Department chi-square tests nicely complements the double decker plot above.
What we observed has a name.
Simpson’s paradox
The result that a marginal association can have different direction from the conditional associations is called Simpson’s paradox. This result applies to quantitative as well as categorical variables.
Further investigation of datasets like UCBAdmissions suggest designing a test for the following null hypothesis.
In many examples with two categorical predictors \(X\) and \(Z\), and a binary response \(Y\), \(X\) identifies two groups (here Males and Females) to compare and \(Z\) is a control variable (Department of application).
For example, in a clinical trial, \(X\) might refer to two treatments, \(Y\) to the outcome of the treatment, and \(Z\) might refer to several centers that recruited patients for the study.
We want to test whether \(X\) and \(Y\) are independent conditionally on \(Z\) (which is something different than independence).
This is the task faced by the Cochran–Mantel–Haenszel Test for \(2 \times 2 \times K\) Contingency Tables (in the UCBAdmissions dataset, \(K\) is the number of departements, and the conditional contingency tables are \(2\times 2\)).