EDA: Bivariate statistics

---
name: layout-general
layout: true
class: left, middle

.remark-slide-number .progress-bar-container {
  position: absolute;
  bottom: 0;
  height: 4px;
  display: block;
  left: 0;
  right: 0;
}

.remark-slide-number .progress-bar {
  height: 100%;
  background-color: red;
}
</style>

<div>
<style type="text/css">.xaringan-extra-logo {
width: 110px;
height: 128px;
z-index: 0;
background-image: url(./img/Universite_Paris_logo_horizontal.jpg);
background-size: contain;
background-repeat: no-repeat;
position: absolute;
top:1em;right:1em;
}
</style>
<script>(function () {
  let tries = 0
  function addLogo () {
    if (typeof slideshow === 'undefined') {
      tries += 1
      if (tries < 10) {
        setTimeout(addLogo, 100)
      }
    } else {
      document.querySelectorAll('.remark-slide-content:not(.hide_logo)')
        .forEach(function (slide) {
          const logo = document.createElement('a')
          logo.classList = 'xaringan-extra-logo'
          logo.href = 'http://master.math.univ-paris-diderot.fr/annee/m1-mi/'
          slide.appendChild(logo)
        })
    }
  }
  document.addEventListener('DOMContentLoaded', addLogo)
})()</script>
</div>

---

# Exploratory Data Analysis : Bivariate statistics

### 2021-12-10

#### [Master I MIDS & MFA]()

#### [Analyse Exploratoire de Données](http://stephane-v-boucheron.fr/courses/eda/)

#### [Stéphane Boucheron](http://stephane-v-boucheron.fr)

---
exclude:true
template: inter-slide

# Bivariate Statistics

### 2021-12-10

#### [Master I MIDS & MFA]()

#### [Analyse Exploratoire de Données](http://stephane-v-boucheron.fr/courses/eda/)

#### [Stéphane Boucheron](http://stephane-v-boucheron.fr)

---
class: inverse, middle

## <svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M0 117.66v346.32c0 11.32 11.43 19.06 21.94 14.86L160 416V32L20.12 87.95A32.006 32.006 0 0 0 0 117.66zM192 416l192 64V96L192 32v384zM554.06 33.16L416 96v384l139.88-55.95A31.996 31.996 0 0 0 576 394.34V48.02c0-11.32-11.43-19.06-21.94-14.86z"/></svg>

### Bivariate samples

### Contingency tables

### From barplots to mosaic plots

### Pearson's statistic

### Linear regression

---
class: center, middle, inverse

## Bivariate samples

---

### Definition

A bivariate sample is a sequence of couples from  `$\mathcal{X} \times \mathcal{Y}$`

Computationally, a bivariate sample is a two-dimensional array (not a matrix) with `$n$` rows and `$2$` columns

The two columns may be of the same type (`numeric`, `integer`, `character`, `factor`, `date`) or  not

Assume that `$\mathcal{X} \times \mathcal{Y}$` may be endowed witha
`$\sigma$`-algebra and a probability distribution `$P$`

If we collect an i.i.d sample from `$P$`, we obtain a _bivariate_ sample

---

### Setting I

For a health survey, we can poll repeatedly a well-defined human population by picking
uniformly at random elements of the population (individuals)

For each individual, we measure at wake up

- *blood pressure*

- *heart rate* (pulses per minute)

We obtain a bivariate sample

Both variables are quantitative (and non-negative)

---

### Setting II  <svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M496.616 372.639l70.012-70.012c16.899-16.9 9.942-45.771-12.836-53.092L512 236.102V96c0-17.673-14.327-32-32-32h-64V24c0-13.255-10.745-24-24-24H248c-13.255 0-24 10.745-24 24v40h-64c-17.673 0-32 14.327-32 32v140.102l-41.792 13.433c-22.753 7.313-29.754 36.173-12.836 53.092l70.012 70.012C125.828 416.287 85.587 448 24 448c-13.255 0-24 10.745-24 24v16c0 13.255 10.745 24 24 24 61.023 0 107.499-20.61 143.258-59.396C181.677 487.432 216.021 512 256 512h128c39.979 0 74.323-24.568 88.742-59.396C508.495 491.384 554.968 512 616 512c13.255 0 24-10.745 24-24v-16c0-13.255-10.745-24-24-24-60.817 0-101.542-31.001-119.384-75.361zM192 128h256v87.531l-118.208-37.995a31.995 31.995 0 0 0-19.584 0L192 215.531V128z"/></svg>

Consider the collection of passengers on board of HMS Titanic on April the 12th, 1912

For each passenger, we record class (`Pclass`) and Fate (`Survived/Deceased`).

This is again a bivariate sample

Both variables are _qualitative_/categorical

Finally, consider again the `Titanic` dataset, but record class (`Pclass`) and fare (`Fare`).

One variable is _qualitative_, the other _quantitative._

---

### Convention

When we deal with a generic bivariate sample, we denote
by `$X$` the first coordinate, and `$Y$` the second coordinate

If `$(x,y) \in \mathcal{X} \times \mathcal{Y}$`

then

`$$X(x, y) =x  \qquad \text{and} \qquad Y(x, y)=y$$`

In statistical parlance, `$X$` and `$Y$` are called _variables_

The ranges of the two coordinates may be finite or not

The ranges may be different or not

---

### Two dimensional arrays: `dataframe`

A `$n \times p$` array with columns of different types is not a `matrix`

In <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg>, <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M439.8 200.5c-7.7-30.9-22.3-54.2-53.4-54.2h-40.1v47.4c0 36.8-31.2 67.8-66.8 67.8H172.7c-29.2 0-53.4 25-53.4 54.3v101.8c0 29 25.2 46 53.4 54.3 33.8 9.9 66.3 11.7 106.8 0 26.9-7.8 53.4-23.5 53.4-54.3v-40.7H226.2v-13.6h160.2c31.1 0 42.6-21.7 53.4-54.2 11.2-33.5 10.7-65.7 0-108.6zM286.2 404c11.1 0 20.1 9.1 20.1 20.3 0 11.3-9 20.4-20.1 20.4-11 0-20.1-9.2-20.1-20.4.1-11.3 9.1-20.3 20.1-20.3zM167.8 248.1h106.8c29.7 0 53.4-24.5 53.4-54.3V91.9c0-29-24.4-50.7-53.4-55.6-35.8-5.9-74.7-5.6-106.8.1-45.2 8-53.4 24.7-53.4 55.6v40.7h106.9v13.6h-147c-31.1 0-58.3 18.7-66.8 54.2-9.8 40.7-10.2 66.1 0 108.6 7.6 31.6 25.7 54.2 56.8 54.2H101v-48.8c0-35.3 30.5-66.4 66.8-66.4zm-6.7-142.6c-11.1 0-20.1-9.1-20.1-20.3.1-11.3 9-20.4 20.1-20.4 11 0 20.1 9.2 20.1 20.4s-9 20.3-20.1 20.3z"/></svg>, <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M448 73.143v45.714C448 159.143 347.667 192 224 192S0 159.143 0 118.857V73.143C0 32.857 100.333 0 224 0s224 32.857 224 73.143zM448 176v102.857C448 319.143 347.667 352 224 352S0 319.143 0 278.857V176c48.125 33.143 136.208 48.572 224 48.572S399.874 209.143 448 176zm0 160v102.857C448 479.143 347.667 512 224 512S0 479.143 0 438.857V336c48.125 33.143 136.208 48.572 224 48.572S399.874 369.143 448 336z"/></svg>, such arrays are called _tables_ (databases) or _dataframes_ (<svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg>, <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M439.8 200.5c-7.7-30.9-22.3-54.2-53.4-54.2h-40.1v47.4c0 36.8-31.2 67.8-66.8 67.8H172.7c-29.2 0-53.4 25-53.4 54.3v101.8c0 29 25.2 46 53.4 54.3 33.8 9.9 66.3 11.7 106.8 0 26.9-7.8 53.4-23.5 53.4-54.3v-40.7H226.2v-13.6h160.2c31.1 0 42.6-21.7 53.4-54.2 11.2-33.5 10.7-65.7 0-108.6zM286.2 404c11.1 0 20.1 9.1 20.1 20.3 0 11.3-9 20.4-20.1 20.4-11 0-20.1-9.2-20.1-20.4.1-11.3 9.1-20.3 20.1-20.3zM167.8 248.1h106.8c29.7 0 53.4-24.5 53.4-54.3V91.9c0-29-24.4-50.7-53.4-55.6-35.8-5.9-74.7-5.6-106.8.1-45.2 8-53.4 24.7-53.4 55.6v40.7h106.9v13.6h-147c-31.1 0-58.3 18.7-66.8 54.2-9.8 40.7-10.2 66.1 0 108.6 7.6 31.6 25.7 54.2 56.8 54.2H101v-48.8c0-35.3 30.5-66.4 66.8-66.4zm-6.7-142.6c-11.1 0-20.1-9.1-20.1-20.3.1-11.3 9-20.4 20.1-20.4 11 0 20.1 9.2 20.1 20.4s-9 20.3-20.1 20.3z"/></svg>)

_dataframes_ are _column-oriented_

All vectors in the `list` have the same length, they may be of different types (`class`)
]

We should think of a _bivariate_ sample as dataframe `df` with two _columns_ `X` and `Y` and as many _rows_ as there are _individuals_ in the sample

If we project on either coordinate, we obtain a _univariate_ sample (`df$X`, `df[["X"]]` or `df$Y`)

```r
df <-  tibble::tibble(X=letters[seq(1,6,2)],
                      Y=rnorm(3))
df # a bivariate sample of length 3
```

```
## # A tibble: 3 × 2
##   X          Y
##   <chr>  <dbl>
## 1 a      0.412
## 2 c     -0.431
## 3 e      0.977
```
]
]
---
class: center, middle, inverse

## Summarizing bivariate samples

---

### Roadmap

Just as we did for univariate samples, we review the different statistics used to summarize bivariate samples

We start with  qualitative bivariate samples, then proceed  to quantitative bivariate
samples and to mixed qualitative/quantitative samples

Summary statistics are enhanced by visualization, that
is standard graphic displays that depend on the kind of bivariate sample under consideration

---

### Goals

The main goal of bivariate sample exploration is the assessment of possible _association_ between the two variables

_Association_ is a loose term

_Association_ can be assigned  technical definitions if we
consider purely quantitative or purely qualitative bivariate samples

- linear correlation and regression

- chi-square statistics

---
class: center, middle, inverse

## Qualitative bivariate samples

---

### Two-ways contingency tables

Qualitative univariate samples  are summarized using _one-way contingency tables_, that
is by counting the number of occurrences of each modality

Qualitative bivariate samples are summarized using _two-way contingency tables_: the counts of co-occurrences for each couples of modalities

When `$X$` and `$Y$` are qualitative variables with respectively `$p$`  and `$q$` modalities, let

`$$\begin{array}{rl}  n_{i,j} & = \sum_{k=1}^n \mathbb{I}_{X_k=i, Y_k=j} \\
  n_{i,\cdot} & = \sum_{k=1}^n \mathbb{I}_{X_k=i} \\  n_{\cdot,j} & = \sum_{k=1}^n \mathbb{I}_{Y_k=j} \\  n & = \sum_{i \leq p} n_{i,\cdot} = \sum_{j\leq q} n_{\cdot,j}  \end{array}$$`

Counts like `$n_{i, \cdot}$` or `$n_{\cdot, j}$` are called _marginal counts_

---

### Class struggle  <svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M496.616 372.639l70.012-70.012c16.899-16.9 9.942-45.771-12.836-53.092L512 236.102V96c0-17.673-14.327-32-32-32h-64V24c0-13.255-10.745-24-24-24H248c-13.255 0-24 10.745-24 24v40h-64c-17.673 0-32 14.327-32 32v140.102l-41.792 13.433c-22.753 7.313-29.754 36.173-12.836 53.092l70.012 70.012C125.828 416.287 85.587 448 24 448c-13.255 0-24 10.745-24 24v16c0 13.255 10.745 24 24 24 61.023 0 107.499-20.61 143.258-59.396C181.677 487.432 216.021 512 256 512h128c39.979 0 74.323-24.568 88.742-59.396C508.495 491.384 554.968 512 616 512c13.255 0 24-10.745 24-24v-16c0-13.255-10.745-24-24-24-60.817 0-101.542-31.001-119.384-75.361zM192 128h256v87.531l-118.208-37.995a31.995 31.995 0 0 0-19.584 0L192 215.531V128z"/></svg>

From the Titanic dataset (`Kaggle`), we extract columns `Pclass`  and `Survived`

- `Pclass` has three modalities: `(1, 2, 3)`

- `Survived` has two modalities: (`Deceased`, `Survived`)

]

```r
tit_col_types = cols(

PassengerId = col_integer(),

* Survived = col_factor(levels=c("0", "1"),
*                       include_na = TRUE),

* Pclass = col_factor(levels=c("1", "2", "3"),
*                     ordered = TRUE,
*                     include_na = TRUE),

Sex = col_factor(levels = c("female", "male")),
  Age = col_double(),
  SibSp = col_integer(),
  Parch = col_integer(),

Embarked = col_factor(levels = c("S", "C", "Q"),
                        include_na = TRUE)
)
```
]

```r
train <- read_csv("DATA/titanic/train.csv",
          col_types=tit_col_types)
test <- read_csv("DATA/titanic/test.csv",
          col_types=tit_col_types)
```

```
## Warning: The following named parsers don't match the column names: Survived
```

```r
*test <- mutate(test,
*              Survived=NA)

tit <- union(train,
             test)

*tit$Survived <- forcats::fct_recode(tit$Survived,
*                                   "Deceased"="0",
*                                   "Survived"="1") %>%
*                         forcats::fct_relevel(c("Survived", "Deceased"))
```
]

```r
tit %>%
  dplyr::select(Pclass, Survived) %>% # Projection on two columns
* table()
```

```
##       Survived
## Pclass Survived Deceased
##      1      136       80
##      2       87       97
##      3      119      372
```

]

```r
tit %>%
  dplyr::select(Pclass, Survived) %>%
  table() %>%
  broom::tidy() %>%   # make it a dataframe
  tidyr::pivot_wider(names_from=Survived,
                     values_from=n) %>%    # with usual look and feel
  knitr::kable(format="markdown")
```

```
## Warning: 'tidy.table' is deprecated.
## See help("Deprecated")
```

|Pclass | Survived| Deceased|
|:------|--------:|--------:|
|1      |      136|       80|
|2      |       87|       97|
|3      |      119|      372|

]
]

---

### Supercharged contingency tables

Package `summarytools` provides richer contingency tables.

```r
pacman::p_load(summarytools)

ctable(x=tit$Pclass,
       y=tit$Survived,
       style="rmarkdown" ,
       headings = FALSE)
```

|        | __Survival__ |    Survived |    Deceased |      NA |         Total |
|:-------:|---------:|------------:|------------:|------------:|--------------:|
| __Pclass__ |          |             |             |             |               |
|      1 |          | 136 (42.1%) |  80 (24.8%) | 107 (33.1%) |  323 (100.0%) |
|      2 |          |  87 (31.4%) |  97 (35.0%) |  93 (33.6%) |  277 (100.0%) |
|      3 |          | 119 (16.8%) | 372 (52.5%) | 218 (30.7%) |  709 (100.0%) |
|  Total |          | 342 (26.1%) | 549 (41.9%) | 418 (31.9%) | 1309 (100.0%) |

---
class: center, middle, inverse
background-image: url(img/Piet_Mondriaan_1921_-_Composition_en_rouge_jaune_bleu_et_noir.jpg)
background-size: cover

## Mosaicplots

---

### Mosaicplots as tweaked barplots

A handy way of portraying a contingency table, and especially a two-way contingency table
consists in building a **mosaic plot**.

Function `mosaicplot` belongs to base `R`,
it takes as input a contingency table and outputs a plot

---

### A mosaic on the Titanic  <svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M496.616 372.639l70.012-70.012c16.899-16.9 9.942-45.771-12.836-53.092L512 236.102V96c0-17.673-14.327-32-32-32h-64V24c0-13.255-10.745-24-24-24H248c-13.255 0-24 10.745-24 24v40h-64c-17.673 0-32 14.327-32 32v140.102l-41.792 13.433c-22.753 7.313-29.754 36.173-12.836 53.092l70.012 70.012C125.828 416.287 85.587 448 24 448c-13.255 0-24 10.745-24 24v16c0 13.255 10.745 24 24 24 61.023 0 107.499-20.61 143.258-59.396C181.677 487.432 216.021 512 256 512h128c39.979 0 74.323-24.568 88.742-59.396C508.495 491.384 554.968 512 616 512c13.255 0 24-10.745 24-24v-16c0-13.255-10.745-24-24-24-60.817 0-101.542-31.001-119.384-75.361zM192 128h256v87.531l-118.208-37.995a31.995 31.995 0 0 0-19.584 0L192 215.531V128z"/></svg>

```r
tit %>%
  dplyr::select(Pclass,
                Survived) %>%
* table() %>%
* mosaicplot()
```

For each Passenger Class `Pclass`, we draw a _stacked bar plot_

The _width_ of each bar is proportional to the Class frequency  (count)

]

![](cm-3-EDA_files/figure-html/titanic-mosaic-label-1.png)

]
]

---

### `mosaicplot` from a Grammar of Graphics perspective

The `stat_...` part consists in computing the `$(n_{i,j}/n)_{i\in \mathcal{X}, j \in \mathcal{Y}}$`

The `geom_...` part consists in mapping the counts to rectangles:
a `mosaicplot` associates a rectangle to each couple of modalities

The surface area of the rectangle associated with
`$(i,j) \in \mathcal{X}\times \mathcal{Y}$` is proportional
to

`$$\underbrace{n_{i,j}}_{\propto\text{ surface area}} = \underbrace{n_{i, .}}_{\propto\text{ width}} \times \underbrace{\frac{n_{i, j}}{n_{i,.}}}_{\propto\text{ height}}$$`

Rectangles are placed on a 2-dimensional grid

---

### From counts to probabilities

- Normalized counts

`$$(n_{i,j}/n)_{i\in \mathcal{X}, j \in \mathcal{Y}}$$`

define a probability distribution on `$\mathcal{X}\times \mathcal{Y}$`, the so-called *empirical distribution* `$P_n$`:

`$$P_n\big\{(i,j)\big\} = P_n\{ X=i \wedge Y=j\} = \frac{n_{i,j}}{n}$$`

- Marginal counts `$(n_{i,.})_{i \in \mathcal{X}}$` define _empirical marginal distributions_

- For each modality `$i \in \mathcal{X}$`, the sum of the heights of rectangles `$(i,j)_{j \in \mathcal{Y}}$` is normalized

The heights  `$\propto n_{i,j}/n_{i, \cdot}$` define (empirical) *conditional probability distributions*:

`$$\frac{n_{i,j}}{n_{i, \cdot}} = P_n \big\{Y=j \mid X=i\big\}$$`

`$$P_n \circ X^{-1}   \leftrightarrow  \Big(\frac{n_{i,\cdot}}{n}\Big)_{i \in \mathcal{X}}$$`

and

`$$P_n \circ Y^{-1} \leftrightarrow \Big(\frac{n_{\cdot, j}}{n}\Big)_{j \in \mathcal{Y}}$$`

---

### Do it with `ggplot` and <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M448 73.143v45.714C448 159.143 347.667 192 224 192S0 159.143 0 118.857V73.143C0 32.857 100.333 0 224 0s224 32.857 224 73.143zM448 176v102.857C448 319.143 347.667 352 224 352S0 319.143 0 278.857V176c48.125 33.143 136.208 48.572 224 48.572S399.874 209.143 448 176zm0 160v102.857C448 479.143 347.667 512 224 512S0 479.143 0 438.857V336c48.125 33.143 136.208 48.572 224 48.572S399.874 369.143 448 336z"/></svg> <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M501.1 395.7L384 278.6c-23.1-23.1-57.6-27.6-85.4-13.9L192 158.1V96L64 0 0 64l96 128h62.1l106.6 106.6c-13.6 27.8-9.2 62.3 13.9 85.4l117.1 117.1c14.6 14.6 38.2 14.6 52.7 0l52.7-52.7c14.5-14.6 14.5-38.2 0-52.7zM331.7 225c28.3 0 54.9 11 74.9 31l19.4 19.4c15.8-6.9 30.8-16.5 43.8-29.5 37.1-37.1 49.7-89.3 37.9-136.7-2.2-9-13.5-12.1-20.1-5.5l-74.4 74.4-67.9-11.3L334 98.9l74.4-74.4c6.6-6.6 3.4-17.9-5.7-20.2-47.4-11.7-99.6.9-136.6 37.9-28.5 28.5-41.9 66.1-41.2 103.6l82.1 82.1c8.1-1.9 16.5-2.9 24.7-2.9zm-103.9 82l-56.7-56.7L18.7 402.8c-25 25-25 65.5 0 90.5s65.5 25 90.5 0l123.6-123.6c-7.6-19.9-9.9-41.6-5-62.7zM64 472c-13.2 0-24-10.8-24-24 0-13.3 10.7-24 24-24s24 10.7 24 24c0 13.2-10.7 24-24 24z"/></svg>

We need to count rows for each combination of the modalities of `Pclass` and `Survived`, and also to gather total counts per modality of `Pclass`

```r
tit %>%
  filter(!is.na(Survived)) %>%
  group_by(Pclass, Survived) %>%
* summarise(count=n()) %>%
  ungroup() -> tmp

tmp %>%
  dplyr::group_by(Pclass) %>%
* summarise(margin=sum(count)) %>%
  ungroup() %>%
  dplyr::inner_join(tmp, by=c("Pclass")) %>%
  mutate(Prop = count/margin) -> df
```

In <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M448 73.143v45.714C448 159.143 347.667 192 224 192S0 159.143 0 118.857V73.143C0 32.857 100.333 0 224 0s224 32.857 224 73.143zM448 176v102.857C448 319.143 347.667 352 224 352S0 319.143 0 278.857V176c48.125 33.143 136.208 48.572 224 48.572S399.874 209.143 448 176zm0 160v102.857C448 479.143 347.667 512 224 512S0 479.143 0 438.857V336c48.125 33.143 136.208 48.572 224 48.572S399.874 369.143 448 336z"/></svg>, use `GROUP BY ROLLUP(Pclass, Survived)`
and self `JOIN` or a `WINDOW` function

]

```r
df %>%
  ggplot(aes(x=Pclass,
             y=Prop,
*            fill=Survived)) +
* geom_col(position = "stack",
*          aes(width=margin/sum(tmp$count))) +
  ggtitle("Hand-made mosaicplot") +
  xlab("Passenger class")
```

```
## Warning: Ignoring unknown aesthetics: width
```

]

]
]

???

More work would allow to put the bars closer and to mimick the mosaicplot in a more faithful way

In the Grammar of Graphics perspective,

- barplots,
- two-way mosaicplots,
- higher-dimensional mosaicplots

are based on column plots and require stat functions involving aggregation operations from extended SQL

---
exclude: true

---

### Transposing a two-way contingency table

When building a `mosaicplot`, the two variables do not serve the same purpose: the variable mapped to the `x` axis serves as an *explanatory* variable.

The one-way contingency table generated by the explanatory variable
can be read directly from the widths of the different columns. This is not the case for the other variable

The messages conveyed by the `mosaicplot` of the transpose of a contingency table differs from the original message

---

### Variable order matters!

Rather than telling us the fate of the different passenger classes,
this `mosaicplot` tells us about the class composition of survivors and casualties

]

```r
mosaicplot(formula= Pclass ~ Survived, data= tit)
```

]

```r
mosaicplot(formula= Survived ~ Pclass, data= tit)
```

]
]

---

### Mosaicplots and `tidyverse`

Package `ggmosaic` is an extension of `ggplot2` that delivers mosaicplots that
fits in the `tidyverse` suite and comply with *Grammar of Graphics*.

```r
pacman::p_load(ggmosaic)

tit %>%
  dplyr::select(Pclass, Survived) %>%
  ggplot() +
* geom_mosaic(aes(x = product(Survived, Pclass), fill=Survived)) +
  labs(x= "Passenger class", y="Fate") +
* scale_fill_viridis_d() +
  ggtitle("Titanic mosaic with tidyverse flavor")
```
]

]

Here again the order of the variable names that are passed to `ggmosaic::product` is important.

We are (implicitly)
trying to visualize the impact of (passenger) class on fate.

It makes sense to map `Pclass` on the `x` axis and `Survived` on the `y` axis.

]

---
class: middle, center, inverse

## Quantitative bivariate samples

---

### Numerical summaries

The numerical summary of a numerical bivariate sample consists of an _empirical mean_

`$$\begin{pmatrix}\overline{X}_n \\ \overline{Y}_n \end{pmatrix} = \frac{1}{n} \sum_{i=1}^n \begin{pmatrix} x_i \\ y_i \end{pmatrix}$$`

and an _empirical covariance matrix_

`$$\begin{pmatrix}\operatorname{var}_n(X) & \operatorname{cov}_n(X, Y) \\
\operatorname{cov}_n(X, Y) & \operatorname{var}_n(Y)\end{pmatrix}$$`

with

`$$\operatorname{var}_n(X, Y) = \frac{1}{n}\sum_{k=1}^n \Big(x_i-\overline{X}_n\Big)^2$$`

and

`$$\operatorname{cov}_n(X, Y) = \frac{1}{n}\sum_{k=1}^n \Big(x_i-\overline{X}_n\Big)\times \Big(y_i-\overline{Y}_n\Big)$$`

---

### Covariance matrices have properties

The empirical covariance matrix is the *covariance matrix of the joint empirical distribution*.

As a covariance matrix, the empirical covariance matrix is *symmetric*, *semi-definite positive (SDP)*

### <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M201.5 174.8l55.7 55.8c3.1 3.1 3.1 8.2 0 11.3l-11.3 11.3c-3.1 3.1-8.2 3.1-11.3 0l-55.7-55.8-45.3 45.3 55.8 55.8c3.1 3.1 3.1 8.2 0 11.3l-11.3 11.3c-3.1 3.1-8.2 3.1-11.3 0L111 265.2l-26.4 26.4c-17.3 17.3-25.6 41.1-23 65.4l7.1 63.6L2.3 487c-3.1 3.1-3.1 8.2 0 11.3l11.3 11.3c3.1 3.1 8.2 3.1 11.3 0l66.3-66.3 63.6 7.1c23.9 2.6 47.9-5.4 65.4-23l181.9-181.9-135.7-135.7-64.9 65zm308.2-93.3L430.5 2.3c-3.1-3.1-8.2-3.1-11.3 0l-11.3 11.3c-3.1 3.1-3.1 8.2 0 11.3l28.3 28.3-45.3 45.3-56.6-56.6-17-17c-3.1-3.1-8.2-3.1-11.3 0l-33.9 33.9c-3.1 3.1-3.1 8.2 0 11.3l17 17L424.8 223l17 17c3.1 3.1 8.2 3.1 11.3 0l33.9-34c3.1-3.1 3.1-8.2 0-11.3l-73.5-73.5 45.3-45.3 28.3 28.3c3.1 3.1 8.2 3.1 11.3 0l11.3-11.3c3.1-3.2 3.1-8.2 0-11.4z"/></svg>

- A square `$n \times n$` matrix `$A$`  is semi-definite positive (SDP) iff

`$$\forall u \in \mathbb{R}^n, \qquad  u^T \times A u = \langle u, Au \rangle \geq 0$$`

- A square `$n \times n$` matrix `$A$`  is definite positive (DP) iff

`$$\forall u \in \mathbb{R}^n \setminus \{0\}, \qquad  u^T \times A u = \langle u, Au \rangle > 0$$`

---

### Linear correlation coefficient

The **linear correlation coefficient** is defined from the covariance matrix as

`$$\rho = \frac{\operatorname{cov}_n(X, Y)}{\sqrt{\operatorname{var}_n(X)  \operatorname{var}_n(Y)}}$$`

`$$-1 \leq \rho \leq 1$$`

Functions `cov` and `cor` from base <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg> perform the computations

---

### Do it the SQL way  <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M448 73.143v45.714C448 159.143 347.667 192 224 192S0 159.143 0 118.857V73.143C0 32.857 100.333 0 224 0s224 32.857 224 73.143zM448 176v102.857C448 319.143 347.667 352 224 352S0 319.143 0 278.857V176c48.125 33.143 136.208 48.572 224 48.572S399.874 209.143 448 176zm0 160v102.857C448 479.143 347.667 512 224 512S0 479.143 0 438.857V336c48.125 33.143 136.208 48.572 224 48.572S399.874 369.143 448 336z"/></svg>

```r
data <- read_delim('./DATA/Enfants.txt', delim='\t')

data %>%
  dplyr::select(MASSE, TAILLE) %>%
  dplyr::summarise(mx= mean(TAILLE), my=mean(MASSE),
                   m2x=mean(TAILLE^2), m2y=mean(MASSE^2),
                   mxy=mean(TAILLE*MASSE)) %>%
  dplyr::mutate(var_taille=m2x-mx^2,
                var_masse=m2y-my^2,
                cov_masse_taille=mxy -mx*my) %>%
  dplyr::mutate(cor=cov_masse_taille/sqrt(var_taille * var_masse)) %>%
  dplyr::select(- starts_with('m'))
```

```
## # A tibble: 1 × 4
##   var_taille var_masse cov_masse_taille   cor
##        <dbl>     <dbl>            <dbl> <dbl>
## 1      2119.      44.0             48.7 0.160
```

---

### Visualizing quantitative bivariate samples

Suppose now, we want to visualize a quantitative bivariate sample of length `$n$`.

This bivariate sample (a dataframe) may be handled as a _real matrix_ with `$n$` rows and `$2$` columns

Geometric concepts come into play

---

### Exploring column space

We may attempt to visualize the two columns, that is the two `$n$`-dimensional vectors
or the rows, that is `$n$` points on the real plane.

Then what matters is the _angle_ between the two vectors.

Its _cosine_ is precisely the _linear correlation coefficient_ defined above

---

### Exploring row space

If we try visualize the rows, the most basic visualization of a quantitative bivariate sample is the *scatterplot*.

In the grammar of graphics parlance, it consists in mapping the two variables on the two axes,
and mapping rows to points using `geom_point` and `stat_identity`

---

### A Gaussian cloud  <svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M537.6 226.6c4.1-10.7 6.4-22.4 6.4-34.6 0-53-43-96-96-96-19.7 0-38.1 6-53.3 16.2C367 64.2 315.3 32 256 32c-88.4 0-160 71.6-160 160 0 2.7.1 5.4.2 8.1C40.2 219.8 0 273.2 0 336c0 79.5 64.5 144 144 144h368c70.7 0 128-57.3 128-128 0-61.9-44-113.6-102.4-125.4z"/></svg>

We build an artificial bivariate sample, by first building a covariance matrix `COV` (it is randomly generated). Then we build a bivariate normal sample `s` of length `n` and turn it into a dataframe `u`. The dataframe is then fed  to `ggplot`.

```r
set.seed(1515) # for the sake of reproducibility

n <- 100
V <- matrix(rnorm(4, 1, 1), nrow = 2)
COV <- V %*% t(V)         # a random covariance matrix, COV is symmetric and SDP

s <- t(V %*% matrix(rnorm(2 * 10 * n), ncol=10*n))
u <- tibble(X=s[,1], Y=s[, 2])                       # a bivariate normal sample

emp_mean <- as_data_frame(t(colMeans(u)))
```

```
## Warning: `as_data_frame()` was deprecated in tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
```

```r
p_scatter_gaussian <- ggplot(u, aes(x=X, y=Y)) +
  geom_point(alpha=.5, size=1) +
  geom_point(data=emp_mean, color=2, size=5) +
  coord_fixed() +
  ggtitle(stringr::str_c("Gaussian cloud, cor = ",
                         round(cor(u$X, u$Y), 2), sep=""))

p_scatter_gaussian
```
]

- Mean vector (Empirical mean)

```r
t(colMeans(u)) %>%
  knitr::kable(digits = 3, col.names = c("$\\overline{X_n}$", "$\\overline{Y_n}$"))
```

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> `$\overline{X_n}$` </th>
   <th style="text-align:right;"> `$\overline{Y_n}$` </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0.004 </td>
   <td style="text-align:right;"> -0.004 </td>
  </tr>
</tbody>
</table>

- Covariance matrix  (Empirical covariance matrix)

```r
cov(u) %>% as.data.frame() %>% knitr::kable(digits = 3)
```

]

![](cm-3-EDA_files/figure-html/gaussiancloud-1.png)

]

---
class: center, middle, inverse

## Qualitative and quantitative variables

---

### Back to <svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M496.616 372.639l70.012-70.012c16.899-16.9 9.942-45.771-12.836-53.092L512 236.102V96c0-17.673-14.327-32-32-32h-64V24c0-13.255-10.745-24-24-24H248c-13.255 0-24 10.745-24 24v40h-64c-17.673 0-32 14.327-32 32v140.102l-41.792 13.433c-22.753 7.313-29.754 36.173-12.836 53.092l70.012 70.012C125.828 416.287 85.587 448 24 448c-13.255 0-24 10.745-24 24v16c0 13.255 10.745 24 24 24 61.023 0 107.499-20.61 143.258-59.396C181.677 487.432 216.021 512 256 512h128c39.979 0 74.323-24.568 88.742-59.396C508.495 491.384 554.968 512 616 512c13.255 0 24-10.745 24-24v-16c0-13.255-10.745-24-24-24-60.817 0-101.542-31.001-119.384-75.361zM192 128h256v87.531l-118.208-37.995a31.995 31.995 0 0 0-19.584 0L192 215.531V128z"/></svg>

Back to  Titanic dataset, let us consider variables `$X=$` `Pclass` (qualitative) and `$Y=$` `Fare` (quantitative)

The numerical summary of such a bivariate sample consists of _list of numerical summaries of univariate samples_

For each modality of the qualitative variable `$X$`, we compute the _conditional mean_ and _variance_ the quantitative variable `$Y$`

As before, `$\overline{Y}_n$` denotes the empirical mean of `$Y$` and `$\sigma^2_Y$` the empirical variance of `$Y$`  (also called the _total variance_)

---

### Conditional summaries

For each modality `$i \in \mathcal{X}$`, we define:

- Conditional Mean of `$X$` given `$\{ X = i \}$`

`$$\overline{Y}_{n\mid i}  = \frac{1}{n_i} \sum_{k\leq n} \mathbb{I}_{x_k =i} \times  y_k$$`

- Conditional Variance `$Y$` given `$\{ X= i\}$`

`$$\sigma^2_{Y\mid i}  = \frac{1}{n_i} \sum_{k \leq n}  \mathbb{I}_{x_k =i} \times \bigg( y_k  - \overline{Y}_{n \mid i}\bigg)^2$$`

---

### Huygens-Pythagoras formula

`$$\sigma^2_{Y} =  \underbrace{\sum_{i\in \mathcal{X}} \frac{n_i}{n} \sigma^2_{Y \mid i}}_{\text{mean of conditional variances}}  + \underbrace{\sum_{i\in \mathcal{X}} \frac{n_i}{n} \big(\overline{Y}_{n \mid i} - \overline{Y}_{n}\big)^2}_{\text{variance of conditional means}}$$`

---

### Robust bivariate summaries

It is also possible and fruitful to compute

- conditional quantiles (median, quartiles) and
- conditional interquartile ranges (IQR)

Conditional mean, variance, median, IQR  (<svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M448 73.143v45.714C448 159.143 347.667 192 224 192S0 159.143 0 118.857V73.143C0 32.857 100.333 0 224 0s224 32.857 224 73.143zM448 176v102.857C448 319.143 347.667 352 224 352S0 319.143 0 278.857V176c48.125 33.143 136.208 48.572 224 48.572S399.874 209.143 448 176zm0 160v102.857C448 479.143 347.667 512 224 512S0 479.143 0 438.857V336c48.125 33.143 136.208 48.572 224 48.572S399.874 369.143 448 336z"/></svg>)

```r
tit %>%
  dplyr::select(Survived, Fare) %>%
  dplyr::group_by(Survived) %>%
* dplyr::summarise(cmean=mean(Fare, na.rm=TRUE),
                   csd=sd(Fare,na.rm = TRUE),
                   cmedian=median(Fare, na.rm = TRUE),
                   cIQR=IQR(Fare,na.rm = TRUE))
```

```
## # A tibble: 3 × 5
##   Survived cmean   csd cmedian  cIQR
##   <fct>    <dbl> <dbl>   <dbl> <dbl>
## 1 Survived  48.4  66.6    26    44.5
## 2 Deceased  22.1  31.4    10.5  18.1
## 3 <NA>      35.6  55.9    14.5  23.6
```

---

### Visualization of mixed bivariate samples

Visualization of qualitative/quantitative bivariate samples

consists in

displaying visual summaries of conditional distribution of `$Y$` given `$X=i, i \in \mathcal{X}$`

`Boxplots` and `violinplots` are relevant here

---

### Mixed bivariate samples from Titanic (violine plots)

```r
filtered_tit <- tit %>%
  dplyr::select(Pclass, Survived, Fare) %>%
  dplyr::filter(Fare > 0 )

v <- filtered_tit %>%
  ggplot() +
  aes(y=Fare) +
  scale_y_log10()

vv <- v + geom_violin()
```
]

```r
vv +
  aes(x=Pclass) +
  ggtitle("Titanic: Fare versus Passenger Class")
```

<img src="cm-3-EDA_files/figure-html/unnamed-chunk-12-1.png" width="504" />
]

```r
vv +
  aes(x=Survived) +
  ggtitle("Titanic: Fare versus Survival")
```

<img src="cm-3-EDA_files/figure-html/unnamed-chunk-13-1.png" width="504" />
]
]

---

### Mixed bivariate samples from Titanic (boxplots)

- Comply with the `DRY` principle
- Avoid `WET`

```r
vw <- v + geom_boxplot()
```

]

```r
vw + aes(x=Pclass) +
  ggtitle("Titanic: Fare versus Passenger Class")
```

<img src="cm-3-EDA_files/figure-html/unnamed-chunk-15-1.png" width="504" />
]

```r
vw +
  aes(x=Survived) +
  ggtitle("Titanic: Fare versus Survival")
```

]

---

### Dataset `whiteside`  (from package `MASS` of <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg>)

> Mr Derek Whiteside of the UK Building Research Station recorded the weekly gas consumption and average external temperature at his own house in south-east England for two heating seasons, one of 26 weeks before, and one of 30 weeks after cavity-wall insulation was installed. The object of the exercise was to assess the effect of the insulation on gas consumption.

---

### Dataset `whiteside`

`Gas`  and `Temp` are both quantitative variables while `Insul` is qualitative with two modalities (`Before`, `After`).

`Insul`
: A factor, before or after insulation.

`Temp`
: Purportedly the average outside temperature in degrees Celsius. (These values is far too low for any 56-week period in the 1960s in South-East England. It might be the weekly average of daily minima.)

`Gas`
: The weekly gas consumption in 1000s of cubic feet.

---

```r
MASS::whiteside %>%
  ggplot(aes(x=Insul, y=Temp)) +
  geom_violin() +
  ggtitle("Whiteside data: violinplots")
```

---
class: middle, center, inverse

## Simple linear regression

---

- We now explore association between _two_ quantitative variables

- We investigate  the association between two quantitative variables as a _prediction_ problem

- We aim at predicting the value of `$Y$` as a function of `$X$`.

- We restrict our attention to linear/affine prediction.

---

We look for `$a, b \in \mathbb{R}$` such that

`$$y_i \approx a x_i +b$$`

Making `$\approx$` meaningful compels us to choose a
_goodness of fit_ criterion.

Several criteria are possible, for example:

`$$\begin{array}{rl}\text{Mean absolute deviation} & = \frac{1}{n}\sum_{i=1}^n \big|y_i - a x_i -b \big| \\\text{Mean quadratic deviation} & = \frac{1}{n}\sum_{i=1}^n \big|y_i - a x_i -b \big|^2 \end{array}$$`

---

In their days, Laplace championed the mean absolute deviation, while  Gauss
advocated the mean quadratic deviation. For computational reasons, we focus
on minimizing the mean quadratic deviation.

> The fourth chapter of Laplace treatise includes an exposition of the _method of least squares_, a remarkable testimony to Laplace's command over the processes of analysis.

> In 1805 Legendre had published the _method of least squares_, making no attempt to tie it to the theory of probability.

]

> In 1809 Gauss had derived the _normal distribution_ from the principle that the arithmetic mean of observations gives the most probable value for the quantity measured; then, turning this argument back upon itself, he showed that, if the errors of observation are normally distributed, the _least squares estimates_ give the most probable values for the coefficients in _regression_ situations

]

---
class: center, middle, inverse

## Least Square Regression

---

### Minimizing a cost function

The *Least Square Regression* problem consists of minimizing  (with respect to `$(a,b)$`):

`$$\begin{array}{rl} \ell_n(a,b)  & = \sum_{i=1}^n \big(y_i - a x_i -b \big)^2  \\ & = \sum_{i=1}^n \big((y_i - \overline{Y}_n) - a (x_i - \overline{X}_n) + \overline{Y}_n - a \overline{X}_n-b \big)^2 \\ & = \sum_{i=1}^n \big((y_i - \overline{Y}_n) - a (x_i - \overline{X}_n) \big)^2 + n \big(\overline{Y}_n - a \overline{X}_n-b\big)^2 \end{array}$$`

---

### Deriving the solution

The function to be minimized is smooth and strictly convex over `$\mathbb{R}^2$` :
a unique minimum is attained where the gradient vanishes

It is enough to compute the partial derivatives.

`$$\begin{array}{rl}\frac{\partial \ell_n}{\partial a} & = - 2  \operatorname{cov}(X,Y) + 2 a \operatorname{var}(X)
  -2 n \big(\overline{Y}_n - a \overline{X}_n-b\big) \overline{X}_n \\
  \frac{\partial \ell_n}{\partial b} & = -2 n \big(\overline{Y}_n - a \overline{X}_n-b\big)\end{array}$$`

---

### A closed-form solution

Zeroing partial derivatives leads to

`$$\begin{array}{rl}
  \widehat{a} & = \frac{\operatorname{cov}(X,Y)}{\operatorname{var}(X)} \\
  \widehat{b} & = \overline{Y}_n - \frac{\operatorname{cov}(X,Y)}{\operatorname{var}(X)} \overline{X}_n
\end{array}$$`

`$$\begin{array}{rl}
  \widehat{a} & = \rho \frac{\sigma_y}{\sigma_x} \\
  \widehat{b} & = \overline{Y}_n - \rho\frac{\sigma_y}{\sigma_x} \overline{X}_n
\end{array}$$`

---

### Overplotting the  Gaussian cloud

- The _slope_ and _intercept_  can be computed from the
sample summary (empirical mean and covariance matrix)

- In higher dimension, coefficients are  from `lm(...)`

]

]

---

### `lm(formula, data)`

```r
mod <- lm(formula=Y ~ X, data=u)

mod %>% summary()
```

```
## 
## Call:
## lm(formula = Y ~ X, data = u)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0168 -0.7106 -0.0079  0.7294  3.5773 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.003685   0.033145  -0.111    0.911    
## X           -0.161562   0.015864 -10.184   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.048 on 998 degrees of freedom
## Multiple R-squared:  0.09415,	Adjusted R-squared:  0.09324 
## F-statistic: 103.7 on 1 and 998 DF,  p-value: < 2.2e-16
```

---

### Residuals

The _residuals_ are the prediction errors `$\left(y_i - \widehat{a}x_i - \widehat{b}\right)_{i\leq n}$`

Residuals play a central role in _regression diagnostic_

The `Residual Standard Error`, is the square root of the normalized sum of squared residuals:

`$$\frac{1}{n-2}\sum_{i=1}^n \left(y_i - \widehat{a}x_i - \widehat{b}\right)^2$$`

The normalization coefficient is the number of rows `$n$` diminished by the number of adjusted parameters (the so-called _degrees of freedom_)

]

```r
p_scatter_gaussian %+%
* broom::augment(lm(Y ~ X, u)) +
  geom_line(aes(x=X, y=.fitted)) +
  geom_segment(aes(x=X, xend=X, y=.fitted, yend=Y,
                   color=forcats::as_factor(sign(.resid))),
               alpha=.2) +
  theme(legend.position = "None") +
  ggtitle("Gaussian cloud",subtitle = "with residuals")
```
]

![](cm-3-EDA_files/figure-html/scatplot-residuals-1.png)

The residuals are the lengths of the segments connecting sample points to  their projections on the regression line

]

Technically, the `Multiple R-squared`  or
:  _coefficient of determination_ is the squared empirical correlation coefficient  `$\rho^2$` between the explanatory and the response variables (in simple linear regression)

`$$1 - \frac{\sum_{i=1}^n \left(y_i - \widehat{a}x_i - \widehat{b}\right)^2}{\sum_{i=1}^n \left(y_i - \overline{Y}_n\right)^2}= 1 - \frac{\sum_{i=1}^n \left(y_i - \widehat{y}_i \right)^2}{\sum_{i=1}^n \left(y_i - \overline{Y}_n\right)^2}$$`

It is also understood as the share of the variance of the response variable that is _explained_ by the explanatory variable

]

The `Adjusted R-squared` is a deflated version of `Multiple R-squared`

`$$1 - \frac{\sum_{i=1}^n \left(y_i - \widehat{a}x_i - \widehat{b}\right)^2/(n-p-1)}{\sum_{i=1}^n \left(y_i - \overline{Y}_n\right)^2/(n-1)}$$`

It is useful when comparing
the merits of several competing models (this takes us beyond
the scope of this lesson)

]
]

---

```r
*p_scatter_gaussian
```
]
 
.panel2-flip-scatplot-residuals-auto[
<img src="cm-3-EDA_files/figure-html/flip-scatplot-residuals_auto_01_output-1.png" width="504" />
]