Published

April 24, 2024

Quantitative bivariate samples and Simple linear regression

Numerical summaries

The numerical summary of a numerical bivariate sample consists of an empirical mean

\[\begin{pmatrix}\overline{X}_n \\ \overline{Y}_n \end{pmatrix} = \frac{1}{n} \sum_{i=1}^n \begin{pmatrix} x_i \\ y_i \end{pmatrix}\]

and an empirical covariance matrix

\[\begin{pmatrix}\operatorname{var}_n(X) & \operatorname{cov}_n(X, Y) \\ \operatorname{cov}_n(X, Y) & \operatorname{var}_n(Y)\end{pmatrix}\]

with

\[\operatorname{var}_n(X, Y) = \frac{1}{n}\sum_{k=1}^n \Big(x_i-\overline{X}_n\Big)^2\]

and

\[\operatorname{cov}_n(X, Y) = \frac{1}{n}\sum_{k=1}^n \Big(x_i-\overline{X}_n\Big)\times \Big(y_i-\overline{Y}_n\Big)\]

Covariance matrices have properties

The empirical covariance matrix is the covariance matrix of the joint empirical distribution.

As a covariance matrix, the empirical covariance matrix is symmetric, semi-definite positive (SDP)

Semi-definiteness
  • A square \(n \times n\) matrix \(A\) is semi-definite positive (SDP) iff

\[\forall u \in \mathbb{R}^n, \qquad u^T \times A u = \langle u, Au \rangle \geq 0\]

  • A square \(n \times n\) matrix \(A\) is definite positive (DP) iff

\[\forall u \in \mathbb{R}^n \setminus \{0\}, \qquad u^T \times A u = \langle u, Au \rangle > 0\]

Linear correlation coefficient

The linear correlation coefficient is defined from the covariance matrix as

\[\rho = \frac{\operatorname{cov}_n(X, Y)}{\sqrt{\operatorname{var}_n(X) \operatorname{var}_n(Y)}}\]

By the Cauchy-Schwarz inequality, we always have

\[-1 \leq \rho \leq 1\]

Translating and/or rescaling the columns does not modify the linear correlation coefficient!

Functions cov and cor from base perform the computations

Visualizing quantitative bivariate samples

Suppose now, we want to visualize a quantitative bivariate sample of length \(n\).

This bivariate sample (a dataframe) may be handled as a real matrix with \(n\) rows and \(2\) columns

Geometric concepts come into play

Exploring column space

We may attempt to visualize the two columns, that is the two \(n\)-dimensional vectors or the rows, that is \(n\) points on the real plane.

If we try to visualize the two columns, we simplify the problem by projecting on the plane generated by the two columns

Then what matters is the angle between the two vectors.

Its cosine is precisely the linear correlation coefficient defined above

Exploring row space

If we try visualize the rows, the most basic visualization of a quantitative bivariate sample is the scatterplot.

In the grammar of graphics parlance, it consists in mapping the two variables on the two axes, and mapping rows to points using geom_point and stat_identity

A Gaussian cloud

We build an artificial bivariate sample, by first building a covariance matrix COV (it is randomly generated). Then we build a bivariate normal sample s of length n and turn it into a dataframe u. The dataframe is then fed to ggplot.

Code
set.seed(1515) # for the sake of reproducibility

n <- 100
V <- matrix(rnorm(4, 1, 1), nrow = 2)
COV <- V %*% t(V)         # a random covariance matrix, COV is symmetric and SDP

s <- t(V %*% matrix(rnorm(2 * 10 * n), ncol=10*n))
u <- as_tibble(list(X=s[,1], Y=s[, 2]))                      # a bivariate normal sample

emp_mean <- as_tibble(t(colMeans(u)))

Numerical summary

  • Mean vector (Empirical mean)
Code
t(colMeans(u)) %>%
  knitr::kable(digits = 3, col.names = c("$\\overline{X_n}$", "$\\overline{Y_n}$"))
\(\overline{X_n}\) \(\overline{Y_n}\)
0.004 -0.004
  • Covariance matrix (Empirical covariance matrix)
Code
cov(u) %>% as.data.frame() %>% knitr::kable(digits = 3)
X Y
X 4.370 -0.706
Y -0.706 1.212

Code

Code
p_scatter_gaussian <- u %>%
  ggplot() +
  aes(x = X, y = Y) +
  geom_point(alpha = .25, size = 1) +
  geom_point(data = emp_mean, color = 2, size = 5) +
  stat_ellipse(type = "norm", level = .9) +
  stat_ellipse(type = "norm", level = .5) +
  stat_ellipse(type = "norm", level = .95) +
  annotate(geom="text", x=emp_mean$X+1.5, y= emp_mean$Y+1, label="Empirical mean")+
  geom_segment(aes(x=emp_mean$X, y=emp_mean$Y, xend=emp_mean$X+1.5, yend=emp_mean$Y+1)) +
  coord_fixed() +
  ggtitle(stringr::str_c("Gaussian cloud, cor = ",
    round(cor(u$X, u$Y), 2),
    sep = ""
  ))

p_scatter_gaussian

Qualitative and quantitative variables

Conditional summaries

For each modality \(i \in \mathcal{X}\), we define:

  • Conditional Mean of \(X\) given \(\{ X = i \}\)

\[\overline{Y}_{n\mid i} = \frac{1}{n_i} \sum_{k\leq n} \mathbb{I}_{x_k =i} \times y_k\]

  • Conditional Variance \(Y\) given \(\{ X= i\}\)

\[\sigma^2_{Y\mid i} = \frac{1}{n_i} \sum_{k \leq n} \mathbb{I}_{x_k =i} \times \bigg( y_k - \overline{Y}_{n \mid i}\bigg)^2\]

Huygens-Pythagoras formula

\[\sigma^2_{Y} = \underbrace{\sum_{i\in \mathcal{X}} \frac{n_i}{n} \sigma^2_{Y \mid i}}_{\text{mean of conditional variances}} + \underbrace{\sum_{i\in \mathcal{X}} \frac{n_i}{n} \big(\overline{Y}_{n \mid i} - \overline{Y}_{n}\big)^2}_{\text{variance of conditional means}}\]

Check this

Robust bivariate summaries

It is also possible and fruitful to compute

  • conditional quantiles (median, quartiles) and
  • conditional interquartile ranges (IQR)

Conditional mean, variance, median, IQR ()

Code
tit <-  readr::read_csv("DATA/titanic/train.csv")
Rows: 891 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Name, Sex, Ticket, Cabin, Embarked
dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
tit <-  tit |>
    mutate(across(c(Survived, Pclass, Name, Sex, Embarked), as.factor)) 

tit |>  
  glimpse()
Rows: 891
Columns: 12
$ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ Survived    <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
$ Pclass      <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
$ Name        <fct> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
$ Sex         <fct> male, female, female, female, male, male, male, male, fema…
$ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
$ SibSp       <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
$ Parch       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
$ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
$ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
$ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6", "C…
$ Embarked    <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, C…
Code
tit %>%
  dplyr::select(Survived, Fare) %>%
  dplyr::group_by(Survived) %>%
  dplyr::summarise(cmean=mean(Fare, na.rm=TRUE), #<<
                   csd=sd(Fare,na.rm = TRUE),
                   cmedian=median(Fare, na.rm = TRUE),
                   cIQR=IQR(Fare,na.rm = TRUE))
# A tibble: 2 × 5
  Survived cmean   csd cmedian  cIQR
  <fct>    <dbl> <dbl>   <dbl> <dbl>
1 0         22.1  31.4    10.5  18.1
2 1         48.4  66.6    26    44.5

Visualization of mixed bivariate samples

Visualization of qualitative/quantitative bivariate samples

consists in displaying visual summaries of conditional distribution of \(Y\) given \(X=i, i \in \mathcal{X}\)

Boxplots and violinplots are relevant here

Mixed bivariate samples from Titanic (violine plots)

Code
filtered_tit <- tit %>%
  dplyr::select(Pclass, Survived, Fare) %>%
  dplyr::filter(Fare > 0 )

v <- filtered_tit %>%
  ggplot() +
  aes(y=Fare) +
  scale_y_log10()

# vv <- v + geom_violin()

filtered_tit |> 
  glimpse()
Rows: 876
Columns: 3
$ Pclass   <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, 2…
$ Survived <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0…
$ Fare     <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21…
Code
p <- v +
  aes(x=Pclass) + 
  geom_violin() +
  ggtitle("Titanic: Fare versus Passenger Class")

q <- v +
  aes(x=Survived) +
  geom_violin() +
  ggtitle("Titanic: Fare versus Survival")

p + q

Mixed bivariate samples from Titanic (boxplots)

Code
(
v + aes(x=Pclass) +
  geom_boxplot() +
  ggtitle("Titanic: Fare versus Passenger Class")
) + (
v +
  aes(x=Survived) +
  geom_boxplot() +
  ggtitle("Titanic: Fare versus Survival")
)