Univariate statistics

---
name: layout-general
layout: true
class: left, middle

.remark-slide-number .progress-bar-container {
  position: absolute;
  bottom: 0;
  height: 4px;
  display: block;
  left: 0;
  right: 0;
}

.remark-slide-number .progress-bar {
  height: 100%;
  background-color: red;
}
</style>

<div>
<style type="text/css">.xaringan-extra-logo {
width: 110px;
height: 128px;
z-index: 0;
background-image: url(./img/Universite_Paris_logo_horizontal.jpg);
background-size: contain;
background-repeat: no-repeat;
position: absolute;
top:1em;right:1em;
}
</style>
<script>(function () {
  let tries = 0
  function addLogo () {
    if (typeof slideshow === 'undefined') {
      tries += 1
      if (tries < 10) {
        setTimeout(addLogo, 100)
      }
    } else {
      document.querySelectorAll('.remark-slide-content:not(.hide_logo)')
        .forEach(function (slide) {
          const logo = document.createElement('a')
          logo.classList = 'xaringan-extra-logo'
          logo.href = 'http://master.math.univ-paris-diderot.fr/annee/m1-mi/'
          slide.appendChild(logo)
        })
    }
  }
  document.addEventListener('DOMContentLoaded', addLogo)
})()</script>
</div>

---

# Exploratory Data Analysis :

### 2022-02-02

#### [Master I MIDS & EDA]()

#### [Analyse Exploratoire des Données](http://stephane-v-boucheron.fr/courses/isidata/)

#### [Stéphane Boucheron](http://stephane-v-boucheron.fr)

---
exclude: true
template: inter-slide

# Univariate Statistics

### 2022-02-02

#### [Master I MIDS & EDA]()

#### [Analyse Exploratoire des Données](http://stephane-v-boucheron.fr/courses/isidata/)

#### [Stéphane Boucheron](http://stephane-v-boucheron.fr)

---
template: inter-slide

## <svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M0 117.66v346.32c0 11.32 11.43 19.06 21.94 14.86L160 416V32L20.12 87.95A32.006 32.006 0 0 0 0 117.66zM192 416l192 64V96L192 32v384zM554.06 33.16L416 96v384l139.88-55.95A31.996 31.996 0 0 0 576 394.34V48.02c0-11.32-11.43-19.06-21.94-14.86z"/></svg>

### [Samples, observations an variables ](#samples)

### [Summary statistics](#numerical-summaries)

### [Robust summary statistics](#robust)

### [Boxplots](#boxplots)

### [Histograms](#histograms)

### [Cumulative distribution function](#ecdf)

### [Quantile function](#quantile-function)

---
template: inter-slide
name: samples

## Samples, observations an variables

---

### <svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M624 448H16c-8.84 0-16 7.16-16 16v32c0 8.84 7.16 16 16 16h608c8.84 0 16-7.16 16-16v-32c0-8.84-7.16-16-16-16zM80.55 341.27c6.28 6.84 15.1 10.72 24.33 10.71l130.54-.18a65.62 65.62 0 0 0 29.64-7.12l290.96-147.65c26.74-13.57 50.71-32.94 67.02-58.31 18.31-28.48 20.3-49.09 13.07-63.65-7.21-14.57-24.74-25.27-58.25-27.45-29.85-1.94-59.54 5.92-86.28 19.48l-98.51 49.99-218.7-82.06a17.799 17.799 0 0 0-18-1.11L90.62 67.29c-10.67 5.41-13.25 19.65-5.17 28.53l156.22 98.1-103.21 52.38-72.35-36.47a17.804 17.804 0 0 0-16.07.02L9.91 230.22c-10.44 5.3-13.19 19.12-5.57 28.08l76.21 82.97z"/></svg> `flights`dataset from `nycflights` package

.fl.w-50.pa2[

```r
pacman::p_load("nycflights13")
```

> On-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.

`flights` is a multivariate _sample_

Each `row` corresponds to an _observation_

Each column corresponds to a _measurement_

]

.fl.w-50.pa2.f6[

```
Columns: 19
$ year           <int> 2013,…
$ month          <int> 1, 1,…
$ day            <int> 1, 1,…
$ dep_time       <int> 517, …
$ sched_dep_time <int> 515, …
$ dep_delay      <dbl> 2, 4,…
$ arr_time       <int> 830, …
$ sched_arr_time <int> 819, …
$ arr_delay      <dbl> 11, 2…
$ carrier        <chr> "UA",…
$ flight         <int> 1545,…
$ tailnum        <chr> "N142…
$ origin         <chr> "EWR"…
$ dest           <chr> "IAH"…
$ air_time       <dbl> 227, …
$ distance       <dbl> 1400,…
$ hour           <dbl> 5, 5,…
$ minute         <dbl> 15, 2…
$ time_hour      <dttm> 2013…
```

]

---
### Definition(s)

A _sample_ is a sequence of items of the same type

In <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg>, if the type matches a basetype, a sample is typically represented using a `vector`

A _numerical_ (quantitative) sample is just a sequence of numbers (either integers
or floats), it is represented by a `numeric` vector.

A _categorical_ (qualitative)  sample is a sequence of values  from  some predefined finite set

In <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg>, categorical samples are handled using a special class: `factor`

---

In this lesson, we are concerned with _univariate statistics_, that
is either with

- _numerical samples_,

- _categorical samples_ taken from some finite (or seldom countable) set

---

### `flights` dataset

- Columns `carrier`, `dest`, `origin` are considered categorical

```r
levels(factor(flights$origin))
```

```
## [1] "EWR" "JFK" "LGA"
```

- Columns `arr_delay`, `dep_delay`, distance are numerical

- Column `time_hour`, ... contains timestamps

---

We will learn about:

- numerical summaries

- graphical displays

for univariate samples

---
template: inter-slide
name: numerical-summaries

## Summary statistics

---

### Numerical summaries for quantitative variables

- __Mean__:

`$$\overline{X}_n = \frac{1}{n}\sum_{i=1}^n x_i \qquad  \textbf{(location)}$$`

- **Standard deviation**:

`$$s^2 = \frac{1}{n-1} \sum_{i=1}^n \big(x_i - \overline{X}_n\big)^2 \quad \text{or}\quad s^2 = \frac{1}{n} \sum_{i=1}^n \big(x_i - \overline{X}_n\big)^2\qquad\textbf{(dispersion)}$$`

---

### Beyond location et dispersion

- **Skewness**

`$$\frac{\frac{1}{n} \sum_{i=1}^n \big(x_i - \overline{X}_n\big)^3}{s^3}$$`

- **Kurtosis**

`$$\kappa = \frac{\frac{1}{n} \sum_{i=1}^n \big(x_i -\overline{X}_n\big)^4}{s^4} - 2$$`

---

### Numerical summaries at work

```r
s <- rnorm(1000)   # a random sample from the standard normal distribution

*mean(s) - sum(s)/length(s)
```

```
## [1] 0
```

```r
*sd(s) - sqrt(sum((s-mean(s))^2)/(length(s)-1))
```

```
## [1] 0
```

```r
*mean((s- mean(s))^4)/sd(s)^4 -3
```

```
## [1] 0.06372208
```

---

### Summarizing `arr_delay`

.fl.w-40.pa2.f6[

### One at a time

```r
mean(flights$arr_delay)
```

```
## [1] NA
```

```r
*mean(flights$arr_delay, na.rm=T)
```

```
## [1] 6.895377
```

```r
sd(flights$arr_delay)
```

```
## [1] NA
```

```r
*sd(flights$arr_delay, na.rm=T)
```

```
## [1] 44.63329
```
]

.fl.w-60.pa2.f6[

### All inclusive

```r
*summary(flights$arr_delay) %>%
  broom::tidy() %>% 
  knitr::kable(digits = 2,
  caption = "Summary statistics for nycflights13 arrival delay")
```

<table>
<caption>Summary statistics for nycflights13 arrival delay</caption>
 <thead>
  <tr>
   <th style="text-align:right;"> minimum </th>
   <th style="text-align:right;"> q1 </th>
   <th style="text-align:right;"> median </th>
   <th style="text-align:right;"> mean </th>
   <th style="text-align:right;"> q3 </th>
   <th style="text-align:right;"> maximum </th>
   <th style="text-align:right;"> na </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> -86 </td>
   <td style="text-align:right;"> -17 </td>
   <td style="text-align:right;"> -5 </td>
   <td style="text-align:right;"> 6.9 </td>
   <td style="text-align:right;"> 14 </td>
   <td style="text-align:right;"> 1272 </td>
   <td style="text-align:right;"> 9430 </td>
  </tr>
</tbody>
</table>

```r
sum_funs <- list(mean, 
  sd, median, IQR)

*purrr::map_dbl(sum_funs,
* ~ .x( flights$arr_delay, na.rm=T))
```

```
## [1]  6.895377 44.633292 -5.000000 31.000000
```

]
---

### With `dplyr::summarise`

```r
flights %>% 
  select(- year, -month, -day, -hour, -minute , -time_hour, -ends_with("time"), -flight) %>% 
* summarise(across(where(is.numeric),
*                  .fns=list("mean"=mean, "median"=median , "IQR"=IQR),
*                  na.rm=T,
*                  .names="{.col}_{.fn}")) %>%
  pivot_longer(cols = everything()) %>% 
  knitr::kable(digits=2)
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> name </th>
   <th style="text-align:right;"> value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> dep_delay_mean </td>
   <td style="text-align:right;"> 12.64 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> dep_delay_median </td>
   <td style="text-align:right;"> -2.00 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> dep_delay_IQR </td>
   <td style="text-align:right;"> 16.00 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> arr_delay_mean </td>
   <td style="text-align:right;"> 6.90 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> arr_delay_median </td>
   <td style="text-align:right;"> -5.00 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> arr_delay_IQR </td>
   <td style="text-align:right;"> 31.00 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> distance_mean </td>
   <td style="text-align:right;"> 1039.91 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> distance_median </td>
   <td style="text-align:right;"> 872.00 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> distance_IQR </td>
   <td style="text-align:right;"> 887.00 </td>
  </tr>
</tbody>
</table>

---
exclude: true

### What if some data are missing?

```r
s <- rnorm(10)
s[c(2, 3, 7)] = NA_real_
s
```

```
##  [1] -0.33618433          NA          NA -0.07696039 -0.90840922 -0.68383652
##  [7]          NA -0.72035314  1.39659206 -0.50888751
```

```r
*mean(s)
```

```
## [1] NA
```

```r
*mean(s, na.rm = TRUE)
```

```
## [1] -0.262577
```

```r
mean(s[-c(2, 3, 7)])
```

```
## [1] -0.262577
```

---

### Missing values and trimming

The code for `mean.default` goes beyond computing the weighted mean

```r
function (x, trim = 0, na.rm = FALSE, ...)
{
    if (!is.numeric(x) && !is.complex(x) && !is.logical(x)) {
        warning("argument is not numeric or logical: returning NA")
        return(NA_real_)
    }
*   if (na.rm)
*       x <- x[!is.na(x)]
    if (!is.numeric(trim) || length(trim) != 1L)
        stop("'trim' must be numeric of length one")
    n <- length(x)
    if (trim > 0 && n) {
        if (is.complex(x))
            stop("trimmed means are not defined for complex data")
        if (anyNA(x))
            return(NA_real_)
        if (trim >= 0.5)
            return(stats::median(x, na.rm = FALSE))
*       lo <- floor(n * trim) + 1
*       hi <- n + 1 - lo
*       x <- sort.int(x, partial = unique(c(lo, hi)))[lo:hi]
    }
*   .Internal(mean(x))
}
```
]

---

### <svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M208 0c-29.9 0-54.7 20.5-61.8 48.2-.8 0-1.4-.2-2.2-.2-35.3 0-64 28.7-64 64 0 4.8.6 9.5 1.7 14C52.5 138 32 166.6 32 200c0 12.6 3.2 24.3 8.3 34.9C16.3 248.7 0 274.3 0 304c0 33.3 20.4 61.9 49.4 73.9-.9 4.6-1.4 9.3-1.4 14.1 0 39.8 32.2 72 72 72 4.1 0 8.1-.5 12-1.2 9.6 28.5 36.2 49.2 68 49.2 39.8 0 72-32.2 72-72V64c0-35.3-28.7-64-64-64zm368 304c0-29.7-16.3-55.3-40.3-69.1 5.2-10.6 8.3-22.3 8.3-34.9 0-33.4-20.5-62-49.7-74 1-4.5 1.7-9.2 1.7-14 0-35.3-28.7-64-64-64-.8 0-1.5.2-2.2.2C422.7 20.5 397.9 0 368 0c-35.3 0-64 28.6-64 64v376c0 39.8 32.2 72 72 72 31.8 0 58.4-20.7 68-49.2 3.9.7 7.9 1.2 12 1.2 39.8 0 72-32.2 72-72 0-4.8-.5-9.5-1.4-14.1 29-12 49.4-40.6 49.4-73.9z"/></svg>

- What happens when `na.rm` is `TRUE`?

- What are the possible values of argument `trim`?

- If `trim` belongs to `$(0, 1/2)$` what happens?

- Which lines actually compute the (uniformly) weighted mean?

???

- The first `if (...) {...}` is an example of defensive programming

- `if (na.rm) { ... }` handles missing data when the default behavior is overriden.
`NA` are filtered out

- When positive, parameter  `trim` allows us to compute _trimmed means_

Depending on optional arguments, fucntion `mean` allows to compute either arithmetic means, or trimmed (arithmetic) means

---

### <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M512 199.652c0 23.625-20.65 43.826-44.8 43.826h-99.851c16.34 17.048 18.346 49.766-6.299 70.944 14.288 22.829 2.147 53.017-16.45 62.315C353.574 425.878 322.654 448 272 448c-2.746 0-13.276-.203-16-.195-61.971.168-76.894-31.065-123.731-38.315C120.596 407.683 112 397.599 112 385.786V214.261l.002-.001c.011-18.366 10.607-35.889 28.464-43.845 28.886-12.994 95.413-49.038 107.534-77.323 7.797-18.194 21.384-29.084 40-29.092 34.222-.014 57.752 35.098 44.119 66.908-3.583 8.359-8.312 16.67-14.153 24.918H467.2c23.45 0 44.8 20.543 44.8 43.826zM96 200v192c0 13.255-10.745 24-24 24H24c-13.255 0-24-10.745-24-24V200c0-13.255 10.745-24 24-24h48c13.255 0 24 10.745 24 24zM68 368c0-11.046-8.954-20-20-20s-20 8.954-20 20 8.954 20 20 20 20-8.954 20-20z"/></svg>

- In a _tidy_ sample, _missing values_ are explicitely pointed as missing information using special objects `NA`, `NA_integer_`, `NA_real_`, ...

- To compute summary statistics when facing missing values,  `na.rm` has to be `TRUE`.

- What is the class of `NA`, `NA_integer_`, `NA_real_`?

---

Why is `var(sample)` defined by

`$$\sum_{i=1}^n \frac{1}{n-1} (x_i - \overline{X}_n)^2$$`

when `$\texttt{sample}=x_1, \ldots, x_n$`?

---
template: inter-slide
name: robust

## Robust summary statistics

---
exclude: true
### `summary(...)`

```r
summary(s)
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -0.9084 -0.7021 -0.5089 -0.2626 -0.2066  1.3966       3
```

---

- **Order statistics**: `$x_{1:n} \leq x_{2:n} \leq \ldots \leq x_{n:n}$`

- **Median**  `$x_{\lfloor n/2\rfloor+1:n}$` if `$n$` is odd,
`$(x_{n/2:n}+ x_{n/2+1:n})/2$` if `$n$` is even

- **Quartiles** `$x_{\lfloor n/4\rfloor:n}$` and `$x_{\lfloor 3 n/4\rfloor:n}$`

- **Quintiles** `$x_{\lfloor k n/5\rfloor:n}$` pour `$k \in 1, \ldots, 4$`

- **Deciles** `$x_{\lfloor k n/10\rfloor:n}$` pour `$k \in 1, \ldots, 9$`

- **Percentiles**

- ...

---

Compute the derivative of

- the mean

- the median

- the trimmed mean

with respect to `$x_{(n)}$`

---
exclude: true

Why _robustness_?

.bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[

'Est etincidunt non adipisci quaerat sed. Ipsum adipisci numquam neque ipsum modi neque modi. Amet velit tempora aliquam etincidunt. Quaerat neque labore velit dolorem dolore eius ipsum. Modi sed dolorem aliquam dolorem non neque. Dolor velit porro etincidunt magnam aliquam labore. Sed quisquam dolore tempora dolor ipsum sit est. Porro numquam adipisci aliquam. Amet labore etincidunt dolore porro.'

]

---
exclude: true

Let `$x_1, \ldots,  x_n$` be a numerical sample.

```r
rapportools::kurtosis
```

```
## function (x, na.rm = TRUE) 
## {
##     if (is.variable(x)) {
##         if (na.rm) 
##             x <- na.omit(x)
##         m <- base::mean(x)
##         s <- stats::sd(x)
##         n <- length(x)
##         (((base::sum((x - m)^4)/n)/s^4) - 3)
##     }
##     else {
##         if (is.matrix(x)) 
##             apply(x, 2, kurtosis, na.rm = na.rm)
##         else if (is.data.frame(x)) 
##             sapply(x, kurtosis, na.rm = na.rm)
##         else stop("unsupported type")
##     }
## }
## <bytecode: 0x7fd33d41fa00>
## <environment: namespace:rapportools>
```

Check that the kurtosis and the skewness are translation invariant.

---
template: inter-slide
name: boxplots

## Boxplots

---

.bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[

In descriptive statistics, a box plot or boxplot is a method for graphically depicting groups of numerical data through their _quantiles_ (more specifically _quartiles_)

Box plots also have lines extending from the boxes (__whiskers__) indicating variability outside the upper and lower quartiles, hence the terms _box-and-whisker plot_ and box-and-whisker diagram

_Outliers_ may be plotted as individual points

Box plots are _non-parametric_: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution (though Tukey's boxplot assumes symmetry for the whiskers and normality for their length)

The spacings between the different parts of the box indicate the degree of _dispersion_ (spread) and _skewness_ in the data, and show _outliers_

]

---

> Tukey's boxplot assumes symmetry for the whiskers and normality for their length

What does this mean?

Assume that we _boxplot_ the `$\mathcal{N}(\mu, \sigma^2) \sim \mu + \sigma \mathcal{N}(0, 1)$` distribution instead of an empirical distribution.

Then:

- The median is equal to `$\mu$` because of the symmetry of `$\mathcal{N}(0,1)$`,

- The quartiles are `$\mu - \sigma \Phi^{\leftarrow}(3/4)$` and `$\mu + \sigma \Phi^{\leftarrow}(3/4)$`.

- The IQR is `$2 \sigma  \Phi^{\leftarrow}(3/4) \approx 1.35\times \sigma$`

- The default length of the whiskers (under `geom_boxplot`) is
`$1.5 \times \text{IQR}$`

- For a Gaussian population, `$1.5 \text{IQR} = 3 \sigma \Phi^{\leftarrow}(3/4)$`

- In a Gaussian population, the probability that a sample point lies outside the whiskers is `$2 \times (1 - \Phi(4 \times \Phi^{\leftarrow}(3/4))) \approx 7/1000$`

---
class: top, left

### Boxplot of a Gaussian sample

```r
n <- 100
fake_data <- tibble(normal=rnorm(n),t=rt(n, df=3))
title <- stringr::str_c("Standard Gaussian sample, size : ", n)

fake_data %>%
  ggplot(mapping = aes(x=1L, y=normal)) +
* geom_boxplot(outlier.colour = "red", outlier.shape = 2, coef=1.5) +
  ggtitle(title) +
  ylab("") +
  ggplot2::theme(
  axis.text.x = element_blank(),
  axis.ticks.x = element_blank()
  )
```

```r
summary(fake_data$normal)
```

```
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.10528 -0.67292 -0.20090 -0.07495  0.43615  2.29645
```

---

### Boxplot of a Gaussian sample

.fl.w-60.pa2[

![gauss-boxplot-label](cm-2-EDA_files/figure-html/normal-boxplot-label-1.png)

]

???

Playing with boxplots

- Tune whiskers lengths using `coef`

- Jitter to spot the whole

---

If we _boxplot_ a Gaussian sample, on average, `$4\%$` of the points lie outside the whiskers. Such points are called _extremes_

Deciding which points should be considered as extremes is partly a matter of taste

Boxplots provide a visual assessment of the departure of a sample from normality/Gaussianity

[`geom_boxplot` documentation](https://ggplot2.tidyverse.org/reference/geom_boxplot.html)

???

Point out the distinction between outliers and extremes

---

### Boxplot of a non-Gaussian sample

The sampling distribution ( `$t$`-distribution with `$3$` degrees of freedom) is _heavy-tailed_

Some sample points lie far away from the whiskers

```r
title <- stringr::str_c("Student sample with 3 df, size : ", n)
ggplot(fake_data,
       aes(y=t, x=1L)) +
* geom_boxplot(outlier.shape=2, outlier.colour = 'red', outlier.alpha=.5, coef=1.5) +
  ylab("") +
  ggtitle(title) +
  ggplot2::theme(
  axis.text.x = element_blank(),
  axis.ticks.x = element_blank()
  )
```

```r
summary(fake_data$t)
```

```
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -5.629301 -0.736732 -0.005991  0.038924  0.841745  5.314943
```

![non-gauss-boxplot-label](cm-2-EDA_files/figure-html/non-gauss-boxplot-label-1.png)

]

---

### Boxplot of a non-Gaussian sample

.fl.w-60.pa2[

![non-gauss-boxplot-label](cm-2-EDA_files/figure-html/non-gauss-boxplot-label-1.png)

]

---

### Plotting two boxplots side by side

.fl.w-50.pa2[

```r
n <- 1000

tibble(normal=rnorm(n), 
       t=rt(n, df=3)) %>%
* pivot_longer(cols=c(normal, t),
*              names_to='Distribution') %>%
* ggplot() +
* aes(x=Distribution, y=value) +
* geom_boxplot(outlier.alpha = .5,
               outlier.colour = 'red',
               outlier.shape = 2) +
  ggtitle("Gaussian versus heavy-tailed samples")
```
]

.fl.w-50.pa2[

]

---

### Boxplots for normal and heavy-tailed distributions

With usual whiskers length

- In Gaussian samples: 
  + very few sample points outside the whiskers (few extremes/outliers), 
  + extremes lie close to the whiskers endpoints

- In heavy-tailed distributions: 
  + many sample points outside the whiskers, and 
  + some extremes may lie far far away

---
exclude: true

### Boxplotting Titanic Fare prices

.fl.w-60.pa2[

```
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
```

<img src="cm-2-EDA_files/figure-html/titanic-fares-boxplot-label-1.png" width="504" />
]

.fl.w-40.pa2[

Titanic fare prices distribution exhibits a non-Gaussian behavior
<svg aria-hidden="true" role="img" viewBox="0 0 496 512" style="height:1em;width:0.97em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M0 256c0 137 111 248 248 248s248-111 248-248S385 8 248 8 0 119 0 256zm200-48c0 17.7-14.3 32-32 32s-32-14.3-32-32 14.3-32 32-32 32 14.3 32 32zm158.5 16.5c-14.8-13.2-46.2-13.2-61 0L288 233c-8.3 7.4-21.6.4-19.8-10.8 4-25.2 34.2-42.1 59.9-42.1S384 197 388 222.2c1.7 11.1-11.4 18.3-19.8 10.8l-9.7-8.5zM157.8 325.8C180.2 352.7 213 368 248 368s67.8-15.4 90.2-42.2c13.6-16.2 38.1 4.2 24.6 20.5C334.3 380.4 292.5 400 248 400s-86.3-19.6-114.8-53.8c-13.5-16.3 11.2-36.7 24.6-20.4z"/></svg>

The median looks much closer to the first quartile than to the third quartile (asymmetry)

There are many extreme points

Extreme lie far away above the whiskers

]

---

### Boxplots for delays

.fl.w-40.pa2.f6[

```r
flights %>% 
  select(
    ends_with("_delay")
    ) %>% 
  pivot_longer(
    cols=ends_with("_delay"),
    names_pattern = "(.*)_delay", 
    values_to = "delay"
    ) %>% 
  filter(
    abs(delay) < 120
    ) %>% 
  ggplot() +
  aes(x=name, y=delay) +
  geom_boxplot()
```

]

.fl.w-60.pa2[

]

---
exclude: true
template: inter-slide
name: violinplots

## Violinplots

---
exclude: true

.fl.w-60.pa2[

```
## Warning: Removed 1 rows containing non-finite values (stat_ydensity).
```

<img src="cm-2-EDA_files/figure-html/titanic-fares-violinplot-label-1.png" width="504" />
]

.fl.w-40.pa2[

They are a mix of three populations that also exhibit non-Gaussian behavior
]

---
exclude: true

.bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[

]

---
template: inter-slide
name: histograms

## Histograms

---

### Definition(s)

.fl.w-40.pa2[

A histogram is defined by the number and position of the breaks (bins)

A histogram may be regular or not

]

.fl.w-60.pa2[

]

---

```r
*tibble(x=rnorm(1000, mean = 2))
```
]
 
.panel2-histogram-gauss-auto[

```
## # A tibble: 1,000 × 1
##        x
##    <dbl>
##  1  1.63
##  2  3.34
##  3  2.17
##  4  1.69
##  5  2.23
##  6  1.73
##  7  2.74
##  8  2.95
##  9  2.70
## 10  1.97
## # … with 990 more rows
```
]

---
count: false

```r
tibble(x=rnorm(1000, mean = 2)) %>%
* ggplot()
```
]
 
.panel2-histogram-gauss-auto[
<img src="cm-2-EDA_files/figure-html/histogram-gauss_auto_02_output-1.png" width="504" />
]

---
count: false

```r
tibble(x=rnorm(1000, mean = 2)) %>%
  ggplot() +
* aes(x=x)
```
]
 
.panel2-histogram-gauss-auto[
<img src="cm-2-EDA_files/figure-html/histogram-gauss_auto_03_output-1.png" width="504" />
]

---
count: false

```r
tibble(x=rnorm(1000, mean = 2)) %>%
  ggplot() +
  aes(x=x) +
* geom_rug(alpha=.1)
```
]
 
.panel2-histogram-gauss-auto[
<img src="cm-2-EDA_files/figure-html/histogram-gauss_auto_04_output-1.png" width="504" />
]

---
count: false

```r
tibble(x=rnorm(1000, mean = 2)) %>%
  ggplot() +
  aes(x=x) +
  geom_rug(alpha=.1) +
* geom_histogram(aes(y=..density..),
*                alpha=.5,
*                fill="white",
*                colour="black",
*                bins=30)
```
]
 
.panel2-histogram-gauss-auto[
<img src="cm-2-EDA_files/figure-html/histogram-gauss_auto_05_output-1.png" width="504" />
]

---
count: false

```r
tibble(x=rnorm(1000, mean = 2)) %>%
  ggplot() +
  aes(x=x) +
  geom_rug(alpha=.1) +
  geom_histogram(aes(y=..density..),
                 alpha=.5,
                 fill="white",
                 colour="black",
                 bins=30) +
* stat_function(linetype="dashed",
*               fun=dnorm,
*               args=list(mean=2, sd=1))
```
]
 
.panel2-histogram-gauss-auto[
<img src="cm-2-EDA_files/figure-html/histogram-gauss_auto_06_output-1.png" width="504" />
]

---
count: false

```r
tibble(x=rnorm(1000, mean = 2)) %>%
  ggplot() +
  aes(x=x) +
  geom_rug(alpha=.1) +
  geom_histogram(aes(y=..density..),
                 alpha=.5,
                 fill="white",
                 colour="black",
                 bins=30) +
  stat_function(linetype="dashed",
                fun=dnorm,
                args=list(mean=2, sd=1)) +
* annotate("text", x=4.5, y=.3,
*          label="Gaussian density with mean 2")
```
]
 
.panel2-histogram-gauss-auto[
<img src="cm-2-EDA_files/figure-html/histogram-gauss_auto_07_output-1.png" width="504" />
]

---
count: false

```r
tibble(x=rnorm(1000, mean = 2)) %>%
  ggplot() +
  aes(x=x) +
  geom_rug(alpha=.1) +
  geom_histogram(aes(y=..density..),
                 alpha=.5,
                 fill="white",
                 colour="black",
                 bins=30) +
  stat_function(linetype="dashed",
                fun=dnorm,
                args=list(mean=2, sd=1)) +
  annotate("text", x=4.5, y=.3,
           label="Gaussian density with mean 2") +
* ggtitle("Histogram for Gaussian sample")
```
]
 
.panel2-histogram-gauss-auto[
<img src="cm-2-EDA_files/figure-html/histogram-gauss_auto_08_output-1.png" width="504" />
]

---
count: false

```r
tibble(x=rnorm(1000, mean = 2)) %>%
  ggplot() +
  aes(x=x) +
  geom_rug(alpha=.1) +
  geom_histogram(aes(y=..density..),
                 alpha=.5,
                 fill="white",
                 colour="black",
                 bins=30) +
  stat_function(linetype="dashed",
                fun=dnorm,
                args=list(mean=2, sd=1)) +
  annotate("text", x=4.5, y=.3,
           label="Gaussian density with mean 2") +
  ggtitle("Histogram for Gaussian sample") +
* labs(caption="Sample size=1000, 30 bins")
```
]
 
.panel2-histogram-gauss-auto[
<img src="cm-2-EDA_files/figure-html/histogram-gauss_auto_09_output-1.png" width="504" />
]

---

### Construction

- Binning the sample values
  + Picking the right (?) number of bins
  + Placing the breaks
  
-

---

### Histograms for (trimmed) delay distributions

.fl.w-40.pa2.f6[

.fl.w-60.pa2.f6[

]

---
exclude: true

.bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[

]

---
template: inter-slide
name: ecdf

## Cumulative distribution function

---

.bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[

`$P$` be a probability distribution over `$\mathbb{R}$` (endowed with the
Borelian `$\sigma$`-algebra)

The cumulative distribution function (CDF) `$F$` is a function from `$\mathbb{R}$` to `$[0,1]$`

`$$F(x) = P(-\infty, x)$$`

]

---

A real-valued sample defines a probability distribution, called the _empirical distribution_.
Each sample point is assigned a probability equal to its relative frequency.

The cumulative distribution function associated with this empirical distribution
is called the _empirical cumulative distribution function_ (ECDF).

---

Let `$x_1, \ldots, x_n$` denote the points from a real sample (they may not be pairwise
distincts), let `$P_n$` be the empirical distribution and `$F_n$` be the ECDF.

`$$\begin{array}{rl}  P_n(A) & = \frac{1}{n} \sum_{i=1}^n \mathbb{I}_{x_i \in A}  \\ F_n(z) & = \frac{1}{n} \sum_{i=1}^n \mathbb{I}_{x_i \leq z} \end{array}$$`

---

Let `$x_{1:n} \leq x_{2:n} \leq \ldots \leq x_{n:n}$` denote the _order statistics_
of the sample (that is the non-decreasing rearrangement of sample, with possible
repetitions).

The ECDF is easily computed from the order statistics:

`$$F_n(z) = \begin{cases} 0           & \text{if } z < x_{1:n}               \\ \frac{k}{n} & \text{if } x_{k:n} \leq z < x_{k+1:n}\\ 1           & \text{if } x_{n:n} \leq z \end{cases}$$`

---

Recall that ECDFs are determined  by order statistics (and conversely)

.bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[

### Proposition

The empirical cumulative distribution function `$F_n$` is

- right-continuous over `$\mathbb{R}$`

- non-decreasing

- takes values in `$\{ k/n , k \in \{0, 1, \ldots, n\}\}$`

- is constant between consecutive order statistics

]

---

### Example

Any numerical dataset may be used to illustrate ECDFs.

For example,

`faithful`: a data frame with `272` observations on `2` variables concerning waiting time between eruptions and the duration of the eruption for the _Old Faithful_ geyser in Yellowstone National Park, Wyoming, USA.

- `eruptions`:	`numeric`,	Eruption time in minutess

- `wating`:	`numeric`	Waiting time to next eruption (in minutes)

> A closer look at `faithful$eruptions` reveals that these are heavily rounded times originally in seconds, where multiples of 5 are more frequent than expected under non-human measurement.

---

###  A glimpse at the faithfull data

```r
data(faithful)
faithful %>%
  glimpse()
```

```
## Rows: 272
## Columns: 2
## $ eruptions <dbl> 3.600, 1.800, 3.333, 2.283, 4.533, 2.883, 4.700, 3.600, 1.95…
## $ waiting   <dbl> 79, 54, 74, 62, 85, 55, 88, 85, 51, 85, 54, 84, 78, 47, 83, …
```

---

### Plotting an empirical CDF

.fl.w-third.pa2[

Eruptions durations

Under the hood, `ggplot` combines `geom_step`  and `ecdf` statistic

```r
faithful %>%
  ggplot() +
  aes(x=eruptions) +
* geom_step(stat="ecdf") +
  ylab("ECDF") +
  xlab("Eruptions duration in minutes") +
  ggtitle("Old Faithful Geyser Data")
```

]

.fl.w-two-thirds.pa2[

![](cm-2-EDA_files/figure-html/geyser-label-1.png)

]

---
exclude: false

### Computing the ECDF statistic

.fl.w-third.pa2[

ECDF plots may be used to compare two samples or to compare an empirical
cumulative distribution function with the cumulative distribution function
of some well-known probability distribution.

```r
*Fn <- ecdf(faithful$eruptions)

x_grid <- seq(min(faithful$eruptions)-1,
              max(faithful$eruptions)+1,
              0.01)

df <- data_frame(x=x_grid,
                 Fn=Fn(x_grid))
```

```
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
```

```r
df %>%
  ggplot() +
  aes(x=x, y=Fn) +
  geom_step(direction="hv")
```

]

.fl.w-two-thirds.pa2[

![](cm-2-EDA_files/figure-html/geyser-label-2-1.png)

]

---
template: inter-slide
name: quantile-function

## Quantile function

---
exclude: false

.bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[

The _quantile function_ of a probability distribution over `$\mathbb{R}$` is
the generalized, left-continuous inverse function of the cumulative
distribution function.

For `$p \in (0,1)$`

`$$F_n^{\leftarrow}(p) = \inf \Big\{ z : F_n(z) \geq p\Big\}$$`

`$$F_n^{\leftarrow}(p) =  X_{k+1:n} \qquad \text{if } \qquad\frac{k}{n} < p \leq \frac{k+1}{n}$$`

]

---

.fl.w-third.pa2[

In order to plot the quantile function, it is almost enough to exchange the
role of the axes in the ECDF plot.

```r
faithful[1:10,] %>%
  ggplot() +
  aes(x=eruptions) +
  geom_step(stat="ecdf", 
            direction='vh') +
* coord_flip() +
  xlab("Quantiles") +
  ylab("Probabilities") +
  ggtitle("Old Faithful Geyser Data")
```

Why is it safe to set the `direction` argument in the call to `geom_step()`

]

.fl.w-two-thirds.pa2[

![](cm-2-EDA_files/figure-html/eruptions-label-1.png)

]

---
exclude: false

### Comparing sample quantiles with distributional quantiles

When it comes to comparing the empirical distributions defined by two samples,
_quantile-quantile plots_ ( `qqplot` ) are favored.

- `$F_n$` and `$G_m$` : two ECDFs

- the quantile-quantile function is defined by

`$$G_m^\leftarrow \circ F_n(z) = Y_{k+1:m}\qquad \text{if} \qquad \frac{k}{m}< F_n(z) \leq \frac{k+1}{m}$$`

---

If `$G_m = F_n$` (if the two samples are equal up to a permutation), then

`$$F_n^\leftarrow \circ F_n(z) = X_{k+1:n} \text{if} \qquad X_{k:n} < z \leq X_{k+1:n}$$`

The quantile-quantile plot lies above line `$y=x$` and meets this line at every
unique order statistics.

We can for example compare the first and second half of the `faithfull`
dataset.

---

### Defining an empirical quantile function

Function `equantile()` returns a function

The returned function is _parametrized_ by argument
`sample`

- A function captures (encloses) the _environment_ in which it is created

- Function `equantile` creates a new _execution environment_ every time it is run. This environment is usually ephemeral, but here it becomes the _enclosing environment_ of the manufactured function

- The _enclosing environment_ of the manufactured function is an _execution environment_ of the function factory

]

#### A quantile function factory

```r
equantile <- function(sample){
  sample <- sort(sample)
  m <- length(sample)

* function(p){
*   ks <- pmax(1, ceiling(m * p))
*   sample[ks]
  }
}
```

See [Advanced R programming](https://adv-r.hadley.nz/function-factories.html)
]

]

---

### QQ plot

.fl.w-40.pa2.f6[

```r
not_cancelled <- flights %>% 
  filter(!is.na(arr_delay)) %>% 
  sample_n(100) %>% 
  arrange(dep_delay) %>% 
  filter(abs(arr_delay) < 120)

nr <- nrow(not_cancelled)

not_cancelled %>% 
  ggplot() +
  aes(x=dep_delay, 
      y=quantile(arr_delay, 
                 seq(1, nr)/nr)
      ) +
  geom_point(size=.1) +
  coord_fixed() +
  geom_abline(slope=1, 
              intercept = 0, 
              alpha=.5) +
  ggtitle("QQ plot clipped arr_delay versus dep_delay") +
  ylab("arr_delay") +
  xlab("dep-delay")
```
]

.fl.w-60.pa2[

]

---
exclude: true

We can repeat this experiment after sample shuffling.

.fl.w-third.pa2[

```r
unfaithful <- faithful[sample(seq(1,nrow(faithful)),
                               size=nrow(faithful),
                               replace = FALSE),
                        ]

first <- unfaithful[1:135,]
second <- unfaithful[-(1:135),]

Fn <- ecdf(first$eruptions)

Qm <- equantile(second$eruptions)

df <- data.frame(x= seq(min(first$eruptions)-1,
                        max(first$eruptions)+1, 0.01)) %>%
  dplyr::mutate(y = Qm(Fn(x)))

df %>%
  ggplot(mapping = aes(x=x, y=y)) +
  geom_step(direction='vh') +
  geom_abline(slope=1,
    intercept = 0,
    linetype='dashed')
```
]

.fl.w-two-thirds.pa2[

![](cm-2-EDA_files/figure-html/faithful-shuffle-label-1.png)

]

---
exclude: true

```r
first <- faithful[1:135,]
second <- faithful[-(1:135),]

Fn <- ecdf(first$eruptions)
Qm <- equantile(second$eruptions)

df <- data.frame(x= seq(min(first$eruptions)-1, max(first$eruptions)+1, 0.01)) %>%
  dplyr::mutate(y = Qm(Fn(x)))

df %>%
  ggplot() +
  aes(x=x, y=y) +
  geom_step(direction='vh') +
  geom_abline(slope=1,
              intercept = 0,
              linetype='dashed')
```

---

background-image: url('./img/pexels-cottonbro-3171837.jpg')
background-size: cover

# The End