EDA I.3: Introduction to R and Data Analysis

---
name: layout-general
layout: true
class: left, middle

.remark-slide-number .progress-bar-container {
  position: absolute;
  bottom: 0;
  height: 4px;
  display: block;
  left: 0;
  right: 0;
}

.remark-slide-number .progress-bar {
  height: 100%;
  background-color: red;
}

/* custom.css */
.plot-callout {
  width: 300px;
  bottom: 5%;
  right: 5%;
  position: absolute;
  padding: 0px;
  z-index: 100;
}
.plot-callout img {
  width: 100%;
  border: 1px solid #23373B;
}
</style>

<div>
<style type="text/css">.xaringan-extra-logo {
width: 110px;
height: 128px;
z-index: 0;
background-image: url(./img/UniversiteParisCite_logo_horizontal_couleur_RVB.jpeg);
background-size: contain;
background-repeat: no-repeat;
position: absolute;
top:1em;right:1em;
}
</style>
<script>(function () {
  let tries = 0
  function addLogo () {
    if (typeof slideshow === 'undefined') {
      tries += 1
      if (tries < 10) {
        setTimeout(addLogo, 100)
      }
    } else {
      document.querySelectorAll('.remark-slide-content:not(.hide_logo)')
        .forEach(function (slide) {
          const logo = document.createElement('a')
          logo.classList = 'xaringan-extra-logo'
          logo.href = 'http://master.math.univ-paris-diderot.fr/annee/m1-isifar/'
          slide.appendChild(logo)
        })
    }
  }
  document.addEventListener('DOMContentLoaded', addLogo)
})()</script>
</div>

---

# Analyse des Données : Introduction to table manipulation(s)

### 2023-01-15

#### [Master I MFA et MIDS](https://master.math.univ-paris-diderot.fr/annee/m1-isifar/)

#### [Analyse de Données](http://stephane-v-boucheron.fr/courses/isidata/)

#### [Stéphane Boucheron](http://stephane-v-boucheron.fr)

---
template: inter-slide

## <svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M565.6 36.24C572.1 40.72 576 48.11 576 56V392C576 401.1 569.8 410.9 560.5 414.4L392.5 478.4C387.4 480.4 381.7 480.5 376.4 478.8L192.5 417.5L32.54 478.4C25.17 481.2 16.88 480.2 10.38 475.8C3.882 471.3 0 463.9 0 456V120C0 110 6.15 101.1 15.46 97.57L183.5 33.57C188.6 31.6 194.3 31.48 199.6 33.23L383.5 94.52L543.5 33.57C550.8 30.76 559.1 31.76 565.6 36.24H565.6zM48 421.2L168 375.5V90.83L48 136.5V421.2zM360 137.3L216 89.3V374.7L360 422.7V137.3zM408 421.2L528 375.5V90.83L408 136.5V421.2z"/></svg>

### [Tables](#dt)

### [SQL and Relational algebra with `dplyr`](#sql)

### [Tidy tables](#tidytables)

### [Aggregations](#aggregation)

### [Pivoting](#pivots)

### [Pipe](#pipe)

---
template: inter-slide
name: dt

## Tables

---

### Tables (examples)

- Speadsheets (Excel)

- <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M448 80v48c0 44.2-100.3 80-224 80S0 172.2 0 128V80C0 35.8 100.3 0 224 0S448 35.8 448 80zM393.2 214.7c20.8-7.4 39.9-16.9 54.8-28.6V288c0 44.2-100.3 80-224 80S0 332.2 0 288V186.1c14.9 11.8 34 21.2 54.8 28.6C99.7 230.7 159.5 240 224 240s124.3-9.3 169.2-25.3zM0 346.1c14.9 11.8 34 21.2 54.8 28.6C99.7 390.7 159.5 400 224 400s124.3-9.3 169.2-25.3c20.8-7.4 39.9-16.9 54.8-28.6V432c0 44.2-100.3 80-224 80S0 476.2 0 432V346.1z"/></svg> Relational tables

- Dataframes in datascience frameworks

- <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg>: `data.frame`, `tibble`, ...
  - <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M439.8 200.5c-7.7-30.9-22.3-54.2-53.4-54.2h-40.1v47.4c0 36.8-31.2 67.8-66.8 67.8H172.7c-29.2 0-53.4 25-53.4 54.3v101.8c0 29 25.2 46 53.4 54.3 33.8 9.9 66.3 11.7 106.8 0 26.9-7.8 53.4-23.5 53.4-54.3v-40.7H226.2v-13.6h160.2c31.1 0 42.6-21.7 53.4-54.2 11.2-33.5 10.7-65.7 0-108.6zM286.2 404c11.1 0 20.1 9.1 20.1 20.3 0 11.3-9 20.4-20.1 20.4-11 0-20.1-9.2-20.1-20.4.1-11.3 9.1-20.3 20.1-20.3zM167.8 248.1h106.8c29.7 0 53.4-24.5 53.4-54.3V91.9c0-29-24.4-50.7-53.4-55.6-35.8-5.9-74.7-5.6-106.8.1-45.2 8-53.4 24.7-53.4 55.6v40.7h106.9v13.6h-147c-31.1 0-58.3 18.7-66.8 54.2-9.8 40.7-10.2 66.1 0 108.6 7.6 31.6 25.7 54.2 56.8 54.2H101v-48.8c0-35.3 30.5-66.4 66.8-66.4zm-6.7-142.6c-11.1 0-20.1-9.1-20.1-20.3.1-11.3 9-20.4 20.1-20.4 11 0 20.1 9.2 20.1 20.4s-9 20.3-20.1 20.3z"/></svg>: `pandas.dataframe`
  - `spark`: `dataframe`
  - `Dask`: `dataframe`
  - and many others

---

### Tables (Why ?)

In Data Science, each framework comes with its own flavor(s) of table(s)

In <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg> legacy dataframes shape the life of statisticians and data scientists

The purpose of this session is

- describe dataframes from an end-user viewpoint (we leave aside implementations)

- presenting tools for
  - accessing information within dataframes (*querying*)
  - summarizing information (*aggregation queries*)
  - cleaning/cleaning dataframes  (*tidying*)

???

---

### Loading tables and packages

```r
pacman::p_load("tidyverse")      # All we need is there

# Almost all. Helper packages
pacman::p_load("nycflights13")    # for flight data
# for manipulating dates and times
pacman::p_load("lubridate")
pacman::p_load("stringr")
# nice table output for web presentations
pacman::p_load("DT")
pacman::p_load("gt")
pacman::p_load("kableExtra")

# 
data(flights)
```

---

### About loaded packages

- Metapackage [`tidyverse`](https://www.tidyverse.org) provides tools to create, query, tidy dataframes as well as tools to load data from various sources and save them in persistent storage

- [`nycflights13`](https://github.com/tidyverse/nycflights13) provides the dataframes we play with

- [`DT`](https://rstudio.github.io/DT/) is a gateway to a `Javascript` library that enables gracious display of dataframes on the WWW

---

### The `flights` table

```r
head(flights) %>%
  glimpse(width = 50) 
```

```
## Rows: 6
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1
## $ day            <int> 1, 1, 1, 1, 1, 1
## $ dep_time       <int> 517, 533, 542, 544, 554, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4
## $ arr_time       <int> 830, 850, 923, 1004, 812,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12
## $ carrier        <chr> "UA", "UA", "AA", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 46…
## $ tailnum        <chr> "N14228", "N24211", "N619…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN…
## $ air_time       <dbl> 227, 227, 160, 183, 116, …
## $ distance       <dbl> 1400, 1416, 1089, 1576, 7…
## $ hour           <dbl> 5, 5, 5, 5, 6, 5
## $ minute         <dbl> 15, 29, 40, 45, 0, 58
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013…
```

]

???

A dataframe is a two-ways (two-dimensional) table

`head(df)` displays the first 6 rows of its first argument

The vectors making a dataframe may have different types/classes (a dataframe is not a matrix)

Compare `str()`, `glimpse()`, `head()`

---

### Table schema

.fl.w-30.pa2.f6[

A table is a _list_ of _columns_

Each _column_ has

- _name_ and
- _type_ (_class_ in <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg>)

```r
*glimpse(flights,
        width=50)
```
]

.fl.w-70.pa2.f6[

```
## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, …
## $ arr_time       <int> 830, 850, 923, 1004, 812,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12,…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 46…
## $ tailnum        <chr> "N14228", "N24211", "N619…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN…
## $ air_time       <dbl> 227, 227, 160, 183, 116, …
## $ distance       <dbl> 1400, 1416, 1089, 1576, 7…
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0,…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 201…
```

]

???

- `flights` has 19 columns
- Each column is  a sequence (`vector`) of items with the same type/class
- All columns have the same length
- `flights` has 336776 rows
- In <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M448 80v48c0 44.2-100.3 80-224 80S0 172.2 0 128V80C0 35.8 100.3 0 224 0S448 35.8 448 80zM393.2 214.7c20.8-7.4 39.9-16.9 54.8-28.6V288c0 44.2-100.3 80-224 80S0 332.2 0 288V186.1c14.9 11.8 34 21.2 54.8 28.6C99.7 230.7 159.5 240 224 240s124.3-9.3 169.2-25.3zM0 346.1c14.9 11.8 34 21.2 54.8 28.6C99.7 390.7 159.5 400 224 400s124.3-9.3 169.2-25.3c20.8-7.4 39.9-16.9 54.8-28.6V432c0 44.2-100.3 80-224 80S0 476.2 0 432V346.1z"/></svg> parlance, a row is (often) called a _tuple_
- In <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M448 80v48c0 44.2-100.3 80-224 80S0 172.2 0 128V80C0 35.8 100.3 0 224 0S448 35.8 448 80zM393.2 214.7c20.8-7.4 39.9-16.9 54.8-28.6V288c0 44.2-100.3 80-224 80S0 332.2 0 288V186.1c14.9 11.8 34 21.2 54.8 28.6C99.7 230.7 159.5 240 224 240s124.3-9.3 169.2-25.3zM0 346.1c14.9 11.8 34 21.2 54.8 28.6C99.7 390.7 159.5 400 224 400s124.3-9.3 169.2-25.3c20.8-7.4 39.9-16.9 54.8-28.6V432c0 44.2-100.3 80-224 80S0 476.2 0 432V346.1z"/></svg> parlance, a column is (often) called a _variable_

---

### Column types

.fl.w-50.pa2[

| class |  columns |
|:-----:|:---------|
| `integer`   |  'year' 'month' 'day' 'dep_time' 'sched_dep_time' 'arr_time' 'sched_arr_time' 'flight'  |
| `numeric`  | 'dep_delay' 'arr_delay' 'air_time' 'distance' 'hour' 'minute'  |
| `character`   |  'carrier' 'tailnum' 'origin' 'dest' |
| `POSIXct`   |  'time_hour' |
| `POSIXt`   |  'time_hour' |

]

.fl.w-50.pa2[

A column, as a vector, may be belong to different classes

Other classes:  `factor` for categorical variables

Columns `dest`, `origin` `carrier` could be coerced as factors

Should columns `dest`  and `origin` be coerced to the same factor?

]

---

### `nycflights13`

![](./img/bd_2023-nycflights13.png)

---

### Columns specification

.fl.w-30.pa2[

```r
as.col_spec(flights)
```
]

.fl.w-70.pa2.f6[

```r
cols(
  year = col_integer(),
  month = col_integer(),
  day = col_integer(),
  dep_time = col_integer(),
  sched_dep_time = col_integer(),
  dep_delay = col_double(),
  arr_time = col_integer(),
  sched_arr_time = col_integer(),
  arr_delay = col_double(),
  carrier = col_character(),
  flight = col_integer(),
  tailnum = col_character(),
  origin = col_character(),
  dest = col_character(),
  air_time = col_double(),
  distance = col_double(),
  hour = col_double(),
  minute = col_double(),
  time_hour = col_datetime(format = "")
)
```
]

???

`$\approx$` table schema in relational databases

Column specifications are useful when loading dataframes from structured text files
like `.csv` files

`.csv` files do not contain typing information

File loaders from package `readr` can be tipped about column classes using column specifications

---
template: inter-slide
name: sql

## SQL and Relational algebra with `dplyr`

???

---

- SQL stands for structured/simple Query Language

- A query language elaborated during the 1970's at IBM by E. Codd

- Geared towards exploitation of collections of relational tables

- Less powerful but simpler to use than a programming language

- `dplyr` is a principled <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg>-friendly
implementation of SQL ideas (and other things)

At the core of SQL lies the idea of a table calculus called **relational algebra**

---

### Relational algebra (basics)

Convention: `$R$`  is a table with columns `$A_1, \ldots, A_k$`

- Projection (picking columns)

`$\pi(R, A_1, A_3)$`

]

- Selection/Filtering (picking rows)

`$\sigma(R, {\text{condition}})$`

]

- Join (mulitple tables operation)

`$\bowtie(R,S, {\text{condition}})$`

]

???

Relational calculus relies on a small set of basic operations `$\pi, \sigma, \bowtie$`

Each operation has one or two table **operands** and produce a table

---

`$\pi(R, {A_1, A_3})$`

]

A projection  `$\pi(\cdot, {A_1, A_3})$` is defined by a set of column names, say `$A_1, A_3$`

If `$R$` has columns with given names, the result is a table with names `$A_1, A_3$` and one row per row of `$R$`

A projection is parametrized by a list of column names

???

- Checks

- Variable number of arguments or list argument

- What if `$R$` does not have columns named `$A_1, A_3$`?

---

### <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M78.6 5C69.1-2.4 55.6-1.5 47 7L7 47c-8.5 8.5-9.4 22-2.1 31.6l80 104c4.5 5.9 11.6 9.4 19 9.4h54.1l109 109c-14.7 29-10 65.4 14.3 89.6l112 112c12.5 12.5 32.8 12.5 45.3 0l64-64c12.5-12.5 12.5-32.8 0-45.3l-112-112c-24.2-24.2-60.6-29-89.6-14.3l-109-109V104c0-7.5-3.5-14.5-9.4-19L78.6 5zM19.9 396.1C7.2 408.8 0 426.1 0 444.1C0 481.6 30.4 512 67.9 512c18 0 35.3-7.2 48-19.9L233.7 374.3c-7.8-20.9-9-43.6-3.6-65.1l-61.7-61.7L19.9 396.1zM512 144c0-10.5-1.1-20.7-3.2-30.5c-2.4-11.2-16.1-14.1-24.2-6l-63.9 63.9c-3 3-7.1 4.7-11.3 4.7H352c-8.8 0-16-7.2-16-16V102.6c0-4.2 1.7-8.3 4.7-11.3l63.9-63.9c8.1-8.1 5.2-21.8-6-24.2C388.7 1.1 378.5 0 368 0C288.5 0 224 64.5 224 144l0 .8 85.3 85.3c36-9.1 75.8 .5 104 28.7L429 274.5c49-23 83-72.8 83-130.5zM104 432c0 13.3-10.7 24-24 24s-24-10.7-24-24s10.7-24 24-24s24 10.7 24 24z"/></svg> Package `dplyr`

.fl.w-30.pa2[

- [_Tranformation_ chapter in R4DS](https://r4ds.had.co.nz/transform.html)

- [Cheat sheet I](https://github.com/rstudio/cheatsheets/blob/main/data-transformation.pdf)

- [Cheat sheet II](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)
]

.fl.w-70.pa2[

[https://dplyr.tidyverse.org](https://dplyr.tidyverse.org)

]

???

Base <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg> provides tools to perform relational algebra operations

But:

- Base <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg> does not provide a consistent API

- The lack of a consistent API makes operation chaining tricky

---

### `dplyr` verbs

Five basic verbs:

- Pick observations/rows by their values (`filter()`)  .fr[ σ(...) ]

- Pick variables by their names (`select()`)    .fr[ π(...)]

- Reorder the rows (`arrange()`)

- Create new variables with functions of existing variables (`mutate()`)

- Collapse many values down to a single summary (`summarise()`)

And

- `group_by()`  changes the scope of each function from operating on the entire dataset to operating on it group-by-group

???

---

### <svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M346.3 271.8l-60.1-21.9L214 448H32c-17.7 0-32 14.3-32 32s14.3 32 32 32H544c17.7 0 32-14.3 32-32s-14.3-32-32-32H282.1l64.1-176.2zm121.1-.2l-3.3 9.1 67.7 24.6c18.1 6.6 38-4.2 39.6-23.4c6.5-78.5-23.9-155.5-80.8-208.5c2 8 3.2 16.3 3.4 24.8l.2 6c1.8 57-7.3 113.8-26.8 167.4zM462 99.1c-1.1-34.4-22.5-64.8-54.4-77.4c-.9-.4-1.9-.7-2.8-1.1c-33-11.7-69.8-2.4-93.1 23.8l-4 4.5C272.4 88.3 245 134.2 226.8 184l-3.3 9.1L434 269.7l3.3-9.1c18.1-49.8 26.6-102.5 24.9-155.5l-.2-6zM107.2 112.9c-11.1 15.7-2.8 36.8 15.3 43.4l71 25.8 3.3-9.1c19.5-53.6 49.1-103 87.1-145.5l4-4.5c6.2-6.9 13.1-13 20.5-18.2c-79.6 2.5-154.7 42.2-201.2 108z"/></svg> tidyverse

.fl.w-50.pa2[

> All verbs work similarly:

> The first argument is a data frame (table).

> The subsequent arguments describe what to do with the data frame, using the variable/column names (without quotes)

> The result is a new data frame (table)

]

.fl.w-50.pa2[

???

`dplyr` is part of `tidyverse`

`dplyr` provides a consistent API

---

### `dplyr::select()` as a projection operator (π)

`$\pi(R, \underbrace{A_1, \ldots, A_3}_{\text{column names}})$`

```r
*select(R, A1, A3)
```

or,  equivalently

```r
*R %>% select(A1, A3)
```

???

Function `select` has a variable number of arguments

Function `select`  has a variable number of arguments

Function `select` allows to pick column by names (and much more)

Note that in the current environment, there are no objects called `A1`, `A3`

The consistent API allows to use the pipe operator

---

### Toy tables

.fl.w-50.pa2[

```r
spam <- set.seed(42)

R <-  tibble(A1=seq(2, 10, 2),
             A2=sample(letters, 5),
             A3=seq(from=date("2021-10-21"),
                    to=date("2021-11-20"),
                    by=7),
             D=sample(letters, 5))

S <- tibble(E=c(3,4,6,9, 10),
            F=sample(letters, 5),
            G=seq(from=date("2021-10-21"),
                   to=date("2021-10-21")+4, by=1),
            D=sample(letters,5)
          )
```
]

.fl.w-50.pa2[

]

???

---

### Projecting `flights` on `origin`  and `dest`

.fl.w-50.pa2[

```r
flights %>%
* select(origin, dest) %>%
  head()
```

A more readable equivalent of

```r
head(select(flights,
            origin,
            dest),
     10)
```
]

.fl.w-50.pa2[

```
## # A tibble: 6 × 2
##   origin dest 
##   <chr>  <chr>
## 1 EWR    IAH  
## 2 LGA    IAH  
## 3 JFK    MIA  
## 4 JFK    BQN  
## 5 LGA    ATL  
## 6 EWR    ORD
```

```sql
SELECT origin, dest
FROM flights
LIMIT 10;
```

]

???

---

`$\sigma(R, \text{condition})$`

]

A selection/filtering operation is defined by a condition that can be checked on the rows of tables with convenient schema

`$\sigma(R, \text{condition})$` returns a table with the same schema as `$R$`

The resulting table contains the rows/tuples of `$R$` that satisfy `$\text{condition}$`

`$\sigma(R, \text{FALSE})$` returns an empty table with the same schema as `$R$`

---

### Chaining filtering and projecting

.fl.w-50.pa2[

```r
start <- date("2021-10-27")
end <- start + 21

R %>%
* filter(A2 > "n" ,
         between(A3, start, end)) %>%
* select(A1, A3)
```

]

.fl.w-50.pa2[

```
## # A tibble: 0 × 2
## # … with 2 variables: A1 <dbl>, A3 <date>
```

]
---

### Selecting `flights` based on `origin`  and `dest`

and then projecting on `dest, time_hour, carrier`

.fl.w-50.pa2[

```r
flights %>%
* filter(dest %in% c('ATL', 'LAX'),
         origin == 'JFK') %>%
* select(dest, time_hour, carrier) %>%
  head()
```

- In SQL (<svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M448 80v48c0 44.2-100.3 80-224 80S0 172.2 0 128V80C0 35.8 100.3 0 224 0S448 35.8 448 80zM393.2 214.7c20.8-7.4 39.9-16.9 54.8-28.6V288c0 44.2-100.3 80-224 80S0 332.2 0 288V186.1c14.9 11.8 34 21.2 54.8 28.6C99.7 230.7 159.5 240 224 240s124.3-9.3 169.2-25.3zM0 346.1c14.9 11.8 34 21.2 54.8 28.6C99.7 390.7 159.5 400 224 400s124.3-9.3 169.2-25.3c20.8-7.4 39.9-16.9 54.8-28.6V432c0 44.2-100.3 80-224 80S0 476.2 0 432V346.1z"/></svg>) parlance:

```sql
SELECT dest, time_hour, carrier
FROM flights
WHERE dest IN ('ATL', 'LAX') AND
      origin = 'JFK'
LIMIT 6
```
]

.fl.w-50.pa2[

```
## # A tibble: 6 × 3
##   dest  time_hour           carrier
##   <chr> <dttm>              <chr>  
## 1 LAX   2013-01-01 06:00:00 UA     
## 2 ATL   2013-01-01 06:00:00 DL     
## 3 LAX   2013-01-01 07:00:00 VX     
## 4 LAX   2013-01-01 07:00:00 B6     
## 5 LAX   2013-01-01 07:00:00 AA     
## 6 ATL   2013-01-01 08:00:00 DL
```

]
???

Filtering is also called subsetting

---

### Logical operations

`filter(R, condition_1, condition_2)` is meant to return the rows of `R` that satisfy `condition_1` **and** `condition_2`

`filter(R, condition_1 & condition_2)` is an equivalent formulation

`filter(R, condition_1 | condition_2)` is meant to return the rows of `R` that satisfy `condition_1` **or** `condition_2` (possibly both)

`filter(R, xor(condition_1,condition_2))` is meant to return the rows of `R` that satisfy **either** `condition_1` **or** `condition_2` (just one of them)

`filter(R, ! condition_1)` is meant to return the rows of `R` that **do not** satisfy  `condition_1`

---

### Overview of set and boolean operations

![](./img/transform-logical.png)

---

### <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M400 128c0 44.4-25.4 83.5-64 106.4V256c0 17.7-14.3 32-32 32H208c-17.7 0-32-14.3-32-32V234.4c-38.6-23-64-62.1-64-106.4C112 57.3 176.5 0 256 0s144 57.3 144 128zM200 176c17.7 0 32-14.3 32-32s-14.3-32-32-32s-32 14.3-32 32s14.3 32 32 32zm144-32c0-17.7-14.3-32-32-32s-32 14.3-32 32s14.3 32 32 32s32-14.3 32-32zM35.4 273.7c7.9-15.8 27.1-22.2 42.9-14.3L256 348.2l177.7-88.8c15.8-7.9 35-1.5 42.9 14.3s1.5 35-14.3 42.9L327.6 384l134.8 67.4c15.8 7.9 22.2 27.1 14.3 42.9s-27.1 22.2-42.9 14.3L256 419.8 78.3 508.6c-15.8 7.9-35 1.5-42.9-14.3s-1.5-35 14.3-42.9L184.4 384 49.7 316.6c-15.8-7.9-22.2-27.1-14.3-42.9z"/></svg> Missing values!

Numerical column `dep_time` contains many `NA's` (missing values)

```r
summary(flights$dep_time)
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       1     907    1401    1349    1744    2400    8255
```

```r
NA & TRUE
```

[1] NA

```r
NA | TRUE
```

[1] TRUE

---

### Truth tables for three-valued logic

.fl.w-50.pa2[

```r
v <- c(TRUE, FALSE, NA) # truth values

*list_tt <- map(c(`&`, `|`, xor),
*              ~ outer(v, v, .x))

for (i in seq_along(list_tt)){
  colnames(list_tt[[i]]) <- v
  rownames(list_tt[[i]]) <- v
}

names(list_tt) <- c('& AND',
                    '| OR',
                    'XOR')
```
]

.fl.w-50.pa2.f6.tl[

<table class=" lightable-minimal" style='font-family: "Trebuchet MS", verdana, sans-serif; width: auto !important; '>
<caption>&amp; AND</caption>
 <thead>
  <tr>
   <th style="text-align:left;font-weight: bold;background-color: lightgray !important;">   </th>
   <th style="text-align:left;font-weight: bold;background-color: lightgray !important;"> TRUE </th>
   <th style="text-align:left;font-weight: bold;background-color: lightgray !important;"> FALSE </th>
   <th style="text-align:left;font-weight: bold;background-color: lightgray !important;"> NA </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;font-weight: bold;background-color: lightgray !important;"> TRUE </td>
   <td style="text-align:left;"> TRUE </td>
   <td style="text-align:left;"> FALSE </td>
   <td style="text-align:left;"> NA </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;background-color: lightgray !important;"> FALSE </td>
   <td style="text-align:left;"> FALSE </td>
   <td style="text-align:left;"> FALSE </td>
   <td style="text-align:left;"> FALSE </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;background-color: lightgray !important;"> NA </td>
   <td style="text-align:left;"> NA </td>
   <td style="text-align:left;"> FALSE </td>
   <td style="text-align:left;"> NA </td>
  </tr>
</tbody>
</table><br>

<table class=" lightable-minimal" style='font-family: "Trebuchet MS", verdana, sans-serif; width: auto !important; '>
<caption>| OR</caption>
 <thead>
  <tr>
   <th style="text-align:left;font-weight: bold;background-color: lightgray !important;">   </th>
   <th style="text-align:left;font-weight: bold;background-color: lightgray !important;"> TRUE </th>
   <th style="text-align:left;font-weight: bold;background-color: lightgray !important;"> FALSE </th>
   <th style="text-align:left;font-weight: bold;background-color: lightgray !important;"> NA </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;font-weight: bold;background-color: lightgray !important;"> TRUE </td>
   <td style="text-align:left;"> TRUE </td>
   <td style="text-align:left;"> TRUE </td>
   <td style="text-align:left;"> TRUE </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;background-color: lightgray !important;"> FALSE </td>
   <td style="text-align:left;"> TRUE </td>
   <td style="text-align:left;"> FALSE </td>
   <td style="text-align:left;"> NA </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;background-color: lightgray !important;"> NA </td>
   <td style="text-align:left;"> TRUE </td>
   <td style="text-align:left;"> NA </td>
   <td style="text-align:left;"> NA </td>
  </tr>
</tbody>
</table><br>

<table class=" lightable-minimal" style='font-family: "Trebuchet MS", verdana, sans-serif; width: auto !important; '>
<caption>XOR</caption>
 <thead>
  <tr>
   <th style="text-align:left;font-weight: bold;background-color: lightgray !important;">   </th>
   <th style="text-align:left;font-weight: bold;background-color: lightgray !important;"> TRUE </th>
   <th style="text-align:left;font-weight: bold;background-color: lightgray !important;"> FALSE </th>
   <th style="text-align:left;font-weight: bold;background-color: lightgray !important;"> NA </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;font-weight: bold;background-color: lightgray !important;"> TRUE </td>
   <td style="text-align:left;"> FALSE </td>
   <td style="text-align:left;"> TRUE </td>
   <td style="text-align:left;"> NA </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;background-color: lightgray !important;"> FALSE </td>
   <td style="text-align:left;"> TRUE </td>
   <td style="text-align:left;"> FALSE </td>
   <td style="text-align:left;"> NA </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;background-color: lightgray !important;"> NA </td>
   <td style="text-align:left;"> NA </td>
   <td style="text-align:left;"> NA </td>
   <td style="text-align:left;"> NA </td>
  </tr>
</tbody>
</table><br>

]

---
exclude: true

|  <div style="width:80px">`and`</div> | <div style="width:80px">`TRUE`</div>  | <div style="width:80px">`FALSE`</div> |<div style="width:80px">`NA`</div>    |
|:-----|:-----|:-----|:-----|
|**`TRUE`**  |TRUE  |FALSE |NA    |
|**`FALSE`** |FALSE |FALSE |FALSE |
|**`NA`**    |NA    |FALSE |NA    |

<br>

|  <div style="width:80px">`or`</div> | <div style="width:80px">`TRUE`</div>  | <div style="width:80px">`FALSE`</div> |<div style="width:80px">`NA`</div>    |
|:-----|:----|:-----|:----|
|**`TRUE`**  |TRUE |TRUE  |TRUE |
|**`FALSE`** |TRUE |FALSE |NA   |
|**`NA`**    |TRUE |NA    |NA   |

<br>

|  <div style="width:80px">`xor`</div> | <div style="width:80px">`TRUE`</div>  | <div style="width:80px">`FALSE`</div> |<div style="width:80px">`NA`</div>    |
|:-----|:-----|:-----|:--|
|**`TRUE`** |FALSE |TRUE  |NA |
|**`FALSE`** |TRUE  |FALSE |NA |
|**`NA`**   |NA    |NA    |NA |

---

### `slice()`: choosing rows based on location

.fl.w-50.pa2[

In base <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg> dataframe cells can be addressed by
indices

`flights[5000:5010,seq(1, 19, by=5)]` returns rows `5000:5010` and columns
`1, 6, 11` from dataframe `flights`

This can be done in a (verbose) `dplyr` way using `slice()` and `select()`

]

.fl.w-50.pa2[

```r
flights %>%
* slice(5001:5005) %>%
  select(seq(1, 19, by=5))
```

```
## # A tibble: 5 × 4
##    year dep_delay flight distance
##   <int>     <dbl>  <int>    <dbl>
## 1  2013         3   4437      602
## 2  2013        43   1016      187
## 3  2013        -2   2190     1089
## 4  2013        -1     91     2576
## 5  2013         5   2131      502
```
]

???

Useful variant `slice_sample()`

---
template: inter-slide

## Joins : multi-table queries

---

`$\bowtie(R,S, {\text{condition}})$`

]

stands for

> join rows/tuples of `$R$` and rows/tuples of `$S$`  that satisfy `$\text{condition}$`

---

### `nycflights` tables

The `nycflights13` package  offers five related tables:

- _Fact_ tables:
  - `flights`
  - `weather`  (hourly weather conditions at different locations)

- _Dimension_ tables:
  - `airports`  (airports full names, location, ...)
  - `planes`    (model, manufacturer, year, ...)
  - `airlines`  (full names)

This is an instance of a [Star Schema](https://en.wikipedia.org/wiki/Star_schema)

<img src="./img/bd_2023-nycflights13.png" width="443" />
]

???

---

### Star schema

> Fact tables record measurements for a specific event

> Fact tables generally consist of numeric values, and foreign keys to dimensional data where descriptive information is kept

---

### Star schema illustrated

![](./img/relational-nycflights.png)

---

### <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M288 32c0 17.7 14.3 32 32 32h32c17.7 0 32 14.3 32 32s-14.3 32-32 32H32c-17.7 0-32 14.3-32 32s14.3 32 32 32H352c53 0 96-43 96-96s-43-96-96-96H320c-17.7 0-32 14.3-32 32zm64 352c0 17.7 14.3 32 32 32h32c53 0 96-43 96-96s-43-96-96-96H32c-17.7 0-32 14.3-32 32s14.3 32 32 32H416c17.7 0 32 14.3 32 32s-14.3 32-32 32H384c-17.7 0-32 14.3-32 32zM128 512h32c53 0 96-43 96-96s-43-96-96-96H32c-17.7 0-32 14.3-32 32s14.3 32 32 32H160c17.7 0 32 14.3 32 32s-14.3 32-32 32H128c-17.7 0-32 14.3-32 32s14.3 32 32 32z"/></svg> weather conditions

```r
weather %>%
  glimpse(width = 50)
```

```
## Rows: 26,115
## Columns: 15
## $ origin     <chr> "EWR", "EWR", "EWR", "EWR", "…
## $ year       <int> 2013, 2013, 2013, 2013, 2013,…
## $ month      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ day        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ hour       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10…
## $ temp       <dbl> 39.02, 39.02, 39.02, 39.92, 3…
## $ dewp       <dbl> 26.06, 26.96, 28.04, 28.04, 2…
## $ humid      <dbl> 59.37, 61.63, 64.43, 62.21, 6…
## $ wind_dir   <dbl> 270, 250, 240, 250, 260, 240,…
## $ wind_speed <dbl> 10.35702, 8.05546, 11.50780, …
## $ wind_gust  <dbl> NA, NA, NA, NA, NA, NA, NA, N…
## $ precip     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ pressure   <dbl> 1012.0, 1012.3, 1012.5, 1012.…
## $ visib      <dbl> 10, 10, 10, 10, 10, 10, 10, 1…
## $ time_hour  <dttm> 2013-01-01 01:00:00, 2013-01…
```

]

---

### Connecting `flights`  and `weather`

We want to complement information in `flights` using data `weather`

Motivation: we would like to relate delays (`arr_delay`) and weather conditions

- can we explain (justify) delays using weather data?

- can we predict delays using weather data?

---

### <svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M381 114.9L186.1 41.8c-16.7-6.2-35.2-5.3-51.1 2.7L89.1 67.4C78 73 77.2 88.5 87.6 95.2l146.9 94.5L136 240 77.8 214.1c-8.7-3.9-18.8-3.7-27.3 .6L18.3 230.8c-9.3 4.7-11.8 16.8-5 24.7l73.1 85.3c6.1 7.1 15 11.2 24.3 11.2H248.4c5 0 9.9-1.2 14.3-3.4L535.6 212.2c46.5-23.3 82.5-63.3 100.8-112C645.9 75 627.2 48 600.2 48H542.8c-20.2 0-40.2 4.8-58.2 14L381 114.9zM0 480c0 17.7 14.3 32 32 32H608c17.7 0 32-14.3 32-32s-14.3-32-32-32H32c-17.7 0-32 14.3-32 32z"/></svg> ⋈  <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M288 32c0 17.7 14.3 32 32 32h32c17.7 0 32 14.3 32 32s-14.3 32-32 32H32c-17.7 0-32 14.3-32 32s14.3 32 32 32H352c53 0 96-43 96-96s-43-96-96-96H320c-17.7 0-32 14.3-32 32zm64 352c0 17.7 14.3 32 32 32h32c53 0 96-43 96-96s-43-96-96-96H32c-17.7 0-32 14.3-32 32s14.3 32 32 32H416c17.7 0 32 14.3 32 32s-14.3 32-32 32H384c-17.7 0-32 14.3-32 32zM128 512h32c53 0 96-43 96-96s-43-96-96-96H32c-17.7 0-32 14.3-32 32s14.3 32 32 32H160c17.7 0 32 14.3 32 32s-14.3 32-32 32H128c-17.7 0-32 14.3-32 32s14.3 32 32 32z"/></svg>

For each flight (row in `flights`)

- `year`, `month`, `day`, `hour` (computed from `time_hour`) indicate
the approaximate time of departure

- `origin` indicates the airport where the plane takes off

Each row of `weather` contains corresponding information

---

### `inner_join`: natural join

.fl.w-40.pa2[

```r
f_w <- flights %>%
* inner_join(weather)

f_w %>% 
  select(seq(1, 
             ncol(f_w),
             by=2)) %>% 
  glimpse(width=50)
```
]

.fl.w-60.pa2.f6[

```
## Joining, by = c("year", "month", "day", "origin", "hour", "time_hour")
```

```
## Rows: 335,220
## Columns: 14
## $ year           <int> 2013, 2013, 2013, 2013, 2…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ sched_dep_time <int> 515, 529, 540, 545, 600, …
## $ arr_time       <int> 830, 850, 923, 1004, 812,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12,…
## $ flight         <int> 1545, 1714, 1141, 725, 46…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK…
## $ air_time       <dbl> 227, 227, 160, 183, 116, …
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 201…
## $ dewp           <dbl> 28.04, 24.98, 26.96, 26.9…
## $ wind_dir       <dbl> 260, 250, 260, 260, 260, …
## $ wind_gust      <dbl> NA, 21.86482, NA, NA, 23.…
## $ pressure       <dbl> 1011.9, 1011.4, 1012.1, 1…
```

]

???

---

### Join schema

```
## Rows: 335,220
## Columns: 28
## $ year           <int> 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, …
## $ arr_time       <int> 830, 850, 923, 1004, 812,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12,…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 46…
## $ tailnum        <chr> "N14228", "N24211", "N619…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN…
## $ air_time       <dbl> 227, 227, 160, 183, 116, …
## $ distance       <dbl> 1400, 1416, 1089, 1576, 7…
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0,…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 201…
## $ temp           <dbl> 39.02, 39.92, 39.02, 39.0…
## $ dewp           <dbl> 28.04, 24.98, 26.96, 26.9…
## $ humid          <dbl> 64.43, 54.81, 61.63, 61.6…
## $ wind_dir       <dbl> 260, 250, 260, 260, 260, …
## $ wind_speed     <dbl> 12.65858, 14.96014, 14.96…
## $ wind_gust      <dbl> NA, 21.86482, NA, NA, 23.…
## $ precip         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ pressure       <dbl> 1011.9, 1011.4, 1012.1, 1…
## $ visib          <dbl> 10, 10, 10, 10, 10, 10, 1…
```
]

???

The schema of the result is the union of the schemas of the operands

A tuple from `flights` matches a tuple from `weather` if the tuple have the same values in the common columns

year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr_time, arr_delay, carrier, flight, tailnum, origin, dest, air_time, distance, hour, minute, time_hour, temp, dewp, humid, wind_dir, wind_speed, wind_gust, precip, pressure, visib

---

### Which columns are used when joining tables `$R$` and `$S$`?

- _default behavior_ of `inner_join`: all columns shared by  `$R$` and `$S$`. Common columns  have the same name
in both schema. They are expected to have the same class

- _manual definition_: in many settings, we  want to overrule the default behavior. We specify
manually which column from `$R$` should match which column from `$S$`

---

### Natural join of  `flights`  and `weather`:

```r
common_names <- base::intersect(names(weather),
                                names(flights))

setequal(
  inner_join(flights, weather),
  inner_join(flights,
             weather,
             by=common_names)
)
```

```
## Joining, by = c("year", "month", "day", "origin", "hour", "time_hour")
```

```
## [1] TRUE
```

---

### <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M184 0c30.9 0 56 25.1 56 56V456c0 30.9-25.1 56-56 56c-28.9 0-52.7-21.9-55.7-50.1c-5.2 1.4-10.7 2.1-16.3 2.1c-35.3 0-64-28.7-64-64c0-7.4 1.3-14.6 3.6-21.2C21.4 367.4 0 338.2 0 304c0-31.9 18.7-59.5 45.8-72.3C37.1 220.8 32 207 32 192c0-30.7 21.6-56.3 50.4-62.6C80.8 123.9 80 118 80 112c0-29.9 20.6-55.1 48.3-62.1C131.3 21.9 155.1 0 184 0zM328 0c28.9 0 52.6 21.9 55.7 49.9c27.8 7 48.3 32.1 48.3 62.1c0 6-.8 11.9-2.4 17.4c28.8 6.2 50.4 31.9 50.4 62.6c0 15-5.1 28.8-13.8 39.7C493.3 244.5 512 272.1 512 304c0 34.2-21.4 63.4-51.6 74.8c2.3 6.6 3.6 13.8 3.6 21.2c0 35.3-28.7 64-64 64c-5.6 0-11.1-.7-16.3-2.1c-3 28.2-26.8 50.1-55.7 50.1c-30.9 0-56-25.1-56-56V56c0-30.9 25.1-56 56-56z"/></svg> Are you surprised by the next chunk?

```r
dtu  <- inner_join(flights,
           weather,
           by=c("year", "month", "day", "origin", "hour"))

dtv <- inner_join(flights,
           weather,
           by=c("origin", "time_hour"))

setequal(dtu, dtv)
```

```
## [1] FALSE
```

Recall that columns `year`, `month` `day` `hour` can be computed from  `time_hour`

```r
# helper for datetime objects
require(lubridate)

flights %>%
  filter(year!=year(time_hour) |
         month!=month(time_hour) |
         day!=day(time_hour) |
         hour!=hour(time_hour)) %>%
  nrow()
```

```
## [1] 0
```

???

This is an example of functional dependency

---

The two results do not have the same schema!

```r
setdiff(colnames(dtv), colnames(dtu))
```

```
## [1] "year.x"    "month.x"   "day.x"     "hour.x"    "time_hour" "year.y"   
## [7] "month.y"   "day.y"     "hour.y"
```

```r
setdiff(colnames(dtu), colnames(dtv))
```

```
## [1] "year"        "month"       "day"         "hour"        "time_hour.x"
## [6] "time_hour.y"
```

Fixing

```r
dtu  <- inner_join(flights,
           weather,
           by=c("year", "month", "day", "origin", "hour"),
*          suffix= c("", ".y")) %>%
*          select(-ends_with(".y"))

dtv <- inner_join(flights,
           weather,
           by=c("origin", "time_hour"),
*          suffix= c("", ".y")) %>%
*          select(-ends_with(".y"))

setequal(dtu, dtv)
```

```
## [1] TRUE
```

---

### About `inner_join`

.fl.w-40.pa2[

```r
inner_join(
  x, y,
* by = NULL,
  copy = FALSE,
* suffix = c(".x", ".y"),
  ...,
* keep = FALSE,
* na_matches = "na")
```

]

.fl.w-60.pa2[

- `by`:
  - `by=c("A1", "A3", "A7")` row `r` from `R` and `s` from `S` match if `r.A1 == s.A1`,
  `r.A3 == s.A3`,   `r.A7 == s.A7`
  - `by=c("A1"="B", "A3"="C", "A7"="D")` row `r` from `R` and `s` from `S` match if `r.A1 == s.B`,
  `r.A3 == s.C`,   `r.A7 == s.D`

- `suffix`: If there are non-joined duplicate variables in `x` and `y`, these suffixes will be added to the output to disambiguate them.

- `keep`: Should the join keys from _both_ `x` and `y` be preserved in the output?

- `na_matches`: Should NA and NaN values match one another?

]

???

---

### Join flavors

Different flavors of `join` cab be used to join one table to columns from another, matching values with the rows that they correspond to

Each join retains a different combination of values from the tables

- `left_join(x, y, by = NULL, suffix = c(".x", ".y"), ...)` Join matching values from `y` to `x`.
Retain all rows of `x` padding missing values from `y` by `NA`

- `semi_join` ...

- `anti_join` ...

???

---

### Toy examples : `inner_join`

.fl.w-30.pa2.f6[

.fl.w-70.pa2.f6[

<table>
<caption>inner_join(S, R, by=c("E"="A1"))</caption>
 <thead>
  <tr>
   <th style="text-align:right;"> E </th>
   <th style="text-align:left;"> F </th>
   <th style="text-align:left;"> G </th>
   <th style="text-align:left;"> D.x </th>
   <th style="text-align:left;"> A2 </th>
   <th style="text-align:left;"> A3 </th>
   <th style="text-align:left;"> D.y </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:left;"> e </td>
   <td style="text-align:left;"> 2021-10-22 </td>
   <td style="text-align:left;"> c </td>
   <td style="text-align:left;"> e </td>
   <td style="text-align:left;"> 2021-10-28 </td>
   <td style="text-align:left;"> q </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:left;"> n </td>
   <td style="text-align:left;"> 2021-10-23 </td>
   <td style="text-align:left;"> i </td>
   <td style="text-align:left;"> a </td>
   <td style="text-align:left;"> 2021-11-04 </td>
   <td style="text-align:left;"> o </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:left;"> r </td>
   <td style="text-align:left;"> 2021-10-25 </td>
   <td style="text-align:left;"> e </td>
   <td style="text-align:left;"> d </td>
   <td style="text-align:left;"> 2021-11-18 </td>
   <td style="text-align:left;"> d </td>
  </tr>
</tbody>
</table>

]

---

### Toy examples : `left_join`

.fl.w-30.pa2.f6[

]

.fl.w-70.pa2.f6[

<table>
<caption>left_join(S, R, by=c("E"="A1"))</caption>
 <thead>
  <tr>
   <th style="text-align:right;"> E </th>
   <th style="text-align:left;"> F </th>
   <th style="text-align:left;"> G </th>
   <th style="text-align:left;"> D.x </th>
   <th style="text-align:left;"> A2 </th>
   <th style="text-align:left;"> A3 </th>
   <th style="text-align:left;"> D.y </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:left;"> y </td>
   <td style="text-align:left;"> 2021-10-21 </td>
   <td style="text-align:left;"> o </td>
   <td style="text-align:left;"> NA </td>
   <td style="text-align:left;"> NA </td>
   <td style="text-align:left;"> NA </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:left;"> e </td>
   <td style="text-align:left;"> 2021-10-22 </td>
   <td style="text-align:left;"> c </td>
   <td style="text-align:left;"> e </td>
   <td style="text-align:left;"> 2021-10-28 </td>
   <td style="text-align:left;"> q </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:left;"> n </td>
   <td style="text-align:left;"> 2021-10-23 </td>
   <td style="text-align:left;"> i </td>
   <td style="text-align:left;"> a </td>
   <td style="text-align:left;"> 2021-11-04 </td>
   <td style="text-align:left;"> o </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 9 </td>
   <td style="text-align:left;"> t </td>
   <td style="text-align:left;"> 2021-10-24 </td>
   <td style="text-align:left;"> d </td>
   <td style="text-align:left;"> NA </td>
   <td style="text-align:left;"> NA </td>
   <td style="text-align:left;"> NA </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:left;"> r </td>
   <td style="text-align:left;"> 2021-10-25 </td>
   <td style="text-align:left;"> e </td>
   <td style="text-align:left;"> d </td>
   <td style="text-align:left;"> 2021-11-18 </td>
   <td style="text-align:left;"> d </td>
  </tr>
</tbody>
</table>
]

---

### Toy examples : `semi_join` `anti_join`

.fl.w-30.pa2.f6[

]

.fl.w-70.pa2.f6[

]

---
### Conditional/ `$\theta$` -join

In relational databases, joins are not restricted to _natural joins_

`$$U \leftarrow R \bowtie_{\theta} S$$`

reads as

`$$\begin{array}{rl}
T & \leftarrow R \times S\\
U & \leftarrow \sigma(T, \theta)\end{array}$$`

where

- `$R \times S$` is the _cartesian product_ of `$R$` and `$S$`

- `$\theta$` is a boolean expression that can be evaluated on any tuple of `$R \times S$`

---

### Do we need conditional/ `$\theta$` -joins?

- <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M320 408c0-6.428-.8457-12.66-2.434-18.6C338.2 376.7 352 353.9 352 328c0-6.428-.8457-12.66-2.434-18.6C370.2 296.7 384 273.9 384 248c0-2.705-.1484-5.373-.4414-8H440C479.7 240 512 207.7 512 168S479.7 96 440 96H243.7C227.5 76.51 203.2 64 176 64H126.1C94.02 64 64.47 81.1 49 108.6L17.65 164.5C6.104 185.1 0 208.4 0 231.8v107.9C0 417.1 64.6 480 144 480h104C287.7 480 320 447.7 320 408zM280 304c13.23 0 24 10.78 24 24S293.2 352 280 352H232.1C218.9 352 208 341.2 208 328S218.8 304 232 304H280zM312 224c13.23 0 24 10.78 24 24S325.2 272 312 272h-48c-3.029 0-5.875-.7012-8.545-1.73C260.7 259.9 264 248.4 264 236V224H312zM440 144c13.23 0 24 10.78 24 24S453.2 192 440 192h-176V152c0-2.686-.5566-5.217-.793-7.84C263.5 144.2 263.7 144 264 144H440zM48 339.7V231.8c0-15.25 3.984-30.41 11.52-43.88l31.34-55.78C97.84 119.7 111.4 112 126.1 112H176c22.06 0 40 17.94 40 40v84c0 15.44-12.56 28-28 28S160 251.4 160 236V184C160 170.8 149.3 160 136 160S112 170.8 112 184v52c0 33.23 21.58 61.25 51.36 71.54C161.3 314 160 320.9 160 328c0 5.041 1.166 9.836 2.178 14.66C137.4 354 120 378.1 120 408c0 7.684 1.557 14.94 3.836 21.87C80.56 420.9 48 383.9 48 339.7zM192 432c-13.23 0-24-10.78-24-24S178.8 384 192 384h56c13.23 0 24 10.78 24 24s-10.77 24-24 24H192z"/></svg>: We can implement `$\theta$`/conditional-joins by pipelining a cross product and a filtering

.fr[
[About conditional join](https://www.r-bloggers.com/2018/02/in-between-a-rock-and-a-conditional-join/)
]

---

### A conditional join between `flights` and `weather`

- The natural join between `flights` and `weather` we implemented can be regarded as an ad hoc conditional join between normalized versions of `weather` and `flights` <svg aria-hidden="true" role="img" viewBox="0 0 384 512" style="height:1em;width:0.75em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M112.1 454.3c0 6.297 1.816 12.44 5.284 17.69l17.14 25.69c5.25 7.875 17.17 14.28 26.64 14.28h61.67c9.438 0 21.36-6.401 26.61-14.28l17.08-25.68c2.938-4.438 5.348-12.37 5.348-17.7L272 415.1h-160L112.1 454.3zM192 0C90.02 .3203 16 82.97 16 175.1c0 44.38 16.44 84.84 43.56 115.8c16.53 18.84 42.34 58.23 52.22 91.45c.0313 .25 .0938 .5166 .125 .7823h160.2c.0313-.2656 .0938-.5166 .125-.7823c9.875-33.22 35.69-72.61 52.22-91.45C351.6 260.8 368 220.4 368 175.1C368 78.8 289.2 .0039 192 0zM288.4 260.1c-15.66 17.85-35.04 46.3-49.05 75.89h-94.61c-14.01-29.59-33.39-58.04-49.04-75.88C75.24 236.8 64 206.1 64 175.1C64 113.3 112.1 48.25 191.1 48C262.6 48 320 105.4 320 175.1C320 206.1 308.8 236.8 288.4 260.1zM176 80C131.9 80 96 115.9 96 160c0 8.844 7.156 16 16 16S128 168.8 128 160c0-26.47 21.53-48 48-48c8.844 0 16-7.148 16-15.99S184.8 80 176 80z"/></svg>

- Table `flights` and `weather` are redundant: `year`, `month`, `day`, `hour` can be computed from `time_hour`

- Assume `flights` and `weather` are trimmed so as to become irredundant

- The conditional join is then based on _truncations_ of variables `time_hour`

```sql
SELECT *
FROM flights AS f, weather AS w
WHERE date_trunc('hour', f.time_hour) = date_trunc('hour', w.time_hour)
```

- Adding redundant columns to `flights` and `weather` allows us to transform
a tricky conditional join into a simple natural join <svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M320 128V49.1L186.6 .3c-11.4-4.2-24 .9-29.5 11.7L71.8 181.1c-30.8 61-8 133.8 48.1 167.4l-28 77.4L32.1 403.9C19.7 399.4 6 405.8 1.4 418.3s1.9 26.3 14.3 30.8l164.6 60.3c12.4 4.5 26.1-1.9 30.6-14.4s-1.9-26.3-14.3-30.8l-59.9-21.9 28-77.3c68.1 11.6 135.7-32.8 150.1-103.6l5.1-24.8 5.1 24.8c14.5 70.8 82 115.2 150.1 103.6l28 77.3-59.9 21.9c-12.4 4.5-18.8 18.3-14.3 30.8s18.2 18.9 30.6 14.4l164.6-60.3c12.4-4.5 18.8-18.3 14.3-30.8s-18.2-18.9-30.6-14.4l-59.9 21.9-28-77.4c56.1-33.6 78.8-106.4 48.1-167.4L482.9 12C477.4 1.1 464.7-3.9 453.4 .3L320 49.1V128h0zm-35.7 44.4L153.9 124.6l36.3-71.9L300.6 93.1l-16.2 79.3zm71.3 0L339.4 93.1 449.8 52.7l36.3 71.9L355.7 172.4z"/></svg>

---
template: inter-slide
name: beyonddplyr

## Creating new columns

---

Creation of new columns may happen

- on the fly

- when altering (enriching) the schema of a table

In databases, creation of new columns may be the result of a query or be the result of altering a table schema with `ALTER TABLE ADD COLUMN ...`

In `tidyverse()` we use verbs `mutate`  or `add_column` to add columns to the input table

---

### `mutate`

.fl.w-50.pa2[

```r
*mutate(
  .data,
* new_col= expression,
* ...,
  .keep = c("all", "used", "unused", "none"),
  .before = NULL,
  .after = NULL
)
```

]

.fl.w-50.pa2[

`.data`: the input data frame

`new_col= expression`:

-  `new_col` is the name of a new column

-  `expression` is evaluated on each row of `.data` or it is a vector of length `1`

- `all` is the default behavior, retains all columns from `.data`

]

---

### Creating a categorical column to spot large delays

.fl.w-50.pa2[

```r
breaks_delay <- with(flights,
  c(min(arr_delay, na.rm=TRUE),
    0, 30,
    max(arr_delay, na.rm=TRUE)))

level_delay <- c("None",
                 "Moderate",
                 "Large")

flights %>%
* mutate(large_delay = cut(arr_delay,
*   breaks=breaks_delay,
*   labels=level_delay,
*   ordered_result=TRUE)) %>%
  select(large_delay, arr_delay) %>%
  sample_n(5)
```

]

.fl.w-50.pa2[

```
## # A tibble: 5 × 2
##   large_delay arr_delay
##   <ord>           <dbl>
## 1 Large             219
## 2 Moderate           18
## 3 None              -19
## 4 None              -16
## 5 None               -1
```
]

???

```r
flights %>%
* mutate(foo = if_else(arr_time > sched_arr_time,
                              arr_time - sched_arr_time,
                              0L,
                              missing = NA_integer_)) %>%
  group_by( (foo >0) & abs(foo - arr_delay)  > 100) %>%
  summarise(N=n())
```

```
## # A tibble: 3 × 2
##   `(foo > 0) & abs(foo - arr_delay) > 100`      N
##   <lgl>                                     <int>
## 1 FALSE                                    322281
## 2 TRUE                                       5157
## 3 NA                                         9338
```

---

### Changing the class of a column

.fl.w-50.pa2[

```r
flights %>%
* mutate(large_delay = cut(arr_delay,
    breaks=breaks_delay,
    labels=level_delay,
    ordered_result=TRUE),
*   origin = as.factor(origin),
*   dest = as.factor(dest)
  ) %>%
  select(large_delay,
    arr_delay,
    origin,
    dest) %>%
  sample_n(5)
```

]

.fl.w-50.pa2[

```
## # A tibble: 5 × 4
##   large_delay arr_delay origin dest 
##   <ord>           <dbl> <fct>  <fct>
## 1 None              -44 LGA    CVG  
## 2 None              -15 EWR    DAY  
## 3 Large             136 EWR    DEN  
## 4 None               -9 EWR    TPA  
## 5 Moderate           14 LGA    TPA
```
]

---
template: inter-slide
name: tidytables

## Tidy tables

---

Tidying tables is part of data cleaning

> A (tidy) dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative)

> Values are organised in two ways

> Every value belongs to a _variable_ and an _observation_

> A _variable_ contains all values that measure the same underlying attribute (like height, temperature, duration) across _units_

> An _observation_ contains all values measured on the same _unit_ (like a person, or a day, or a race) across attributes

> The principles of tidy data are tied to those of relational databases and Codd's relational algebra

.fr[[<svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M96 0C43 0 0 43 0 96V416c0 53 43 96 96 96H384h32c17.7 0 32-14.3 32-32s-14.3-32-32-32V384c17.7 0 32-14.3 32-32V32c0-17.7-14.3-32-32-32H384 96zm0 384H352v64H96c-17.7 0-32-14.3-32-32s14.3-32 32-32zm32-240c0-8.8 7.2-16 16-16H336c8.8 0 16 7.2 16 16s-7.2 16-16 16H144c-8.8 0-16-7.2-16-16zm16 48H336c8.8 0 16 7.2 16 16s-7.2 16-16 16H144c-8.8 0-16-7.2-16-16s7.2-16 16-16z"/></svg> The tidy data paper](https://vita.had.co.nz/papers/tidy-data.html)]

---

In a _tidy_ table

- Each variable is a column

- Each observation is a row

- Every cell is a single value

???

---

### Untidy data

> Column headers are values, not variable names.

> Multiple variables are stored in one column.

> Variables are stored in both rows and columns.

> Multiple types of observational units are stored in the same table.

> A single observational unit is stored in multiple tables.

> ...

.fr[ <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M78.6 5C69.1-2.4 55.6-1.5 47 7L7 47c-8.5 8.5-9.4 22-2.1 31.6l80 104c4.5 5.9 11.6 9.4 19 9.4h54.1l109 109c-14.7 29-10 65.4 14.3 89.6l112 112c12.5 12.5 32.8 12.5 45.3 0l64-64c12.5-12.5 12.5-32.8 0-45.3l-112-112c-24.2-24.2-60.6-29-89.6-14.3l-109-109V104c0-7.5-3.5-14.5-9.4-19L78.6 5zM19.9 396.1C7.2 408.8 0 426.1 0 444.1C0 481.6 30.4 512 67.9 512c18 0 35.3-7.2 48-19.9L233.7 374.3c-7.8-20.9-9-43.6-3.6-65.1l-61.7-61.7L19.9 396.1zM512 144c0-10.5-1.1-20.7-3.2-30.5c-2.4-11.2-16.1-14.1-24.2-6l-63.9 63.9c-3 3-7.1 4.7-11.3 4.7H352c-8.8 0-16-7.2-16-16V102.6c0-4.2 1.7-8.3 4.7-11.3l63.9-63.9c8.1-8.1 5.2-21.8-6-24.2C388.7 1.1 378.5 0 368 0C288.5 0 224 64.5 224 144l0 .8 85.3 85.3c36-9.1 75.8 .5 104 28.7L429 274.5c49-23 83-72.8 83-130.5zM104 432c0 13.3-10.7 24-24 24s-24-10.7-24-24s10.7-24 24-24s24 10.7 24 24z"/></svg> ]

???

Source of untidyness

---

### Functions from `tidyr::...`

- `pivot_wider` and `pivot_longer`

- `separate` and  `unite`

- Handling missing values with `complete`, `fill`, ...

- ...

[`tidyr` website](https://tidyr.tidyverse.org)

---

### Pivot longer

.fl.w-50.pa2[

> `pivot_longer()` is commonly needed to tidy wild-caught datasets as they often optimise for ease of data entry or ease of comparison rather than ease of analysis.

]

.fl.w-50.pa2[

```r
messy %>% pivot_longer(
* cols=c(-row),
  names_to = "name",
  values_to = "value",
)  %>% kable()
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> row </th>
   <th style="text-align:left;"> name </th>
   <th style="text-align:right;"> value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> A </td>
   <td style="text-align:left;"> a </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> A </td>
   <td style="text-align:left;"> b </td>
   <td style="text-align:right;"> 4 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> A </td>
   <td style="text-align:left;"> c </td>
   <td style="text-align:right;"> 7 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> B </td>
   <td style="text-align:left;"> a </td>
   <td style="text-align:right;"> 2 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> B </td>
   <td style="text-align:left;"> b </td>
   <td style="text-align:right;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> B </td>
   <td style="text-align:left;"> c </td>
   <td style="text-align:right;"> 8 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> C </td>
   <td style="text-align:left;"> a </td>
   <td style="text-align:right;"> 3 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> C </td>
   <td style="text-align:left;"> b </td>
   <td style="text-align:right;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> C </td>
   <td style="text-align:left;"> c </td>
   <td style="text-align:right;"> 9 </td>
  </tr>
</tbody>
</table>

]

???

> `pivot_longer()` makes datasets longer by increasing the number of rows and decreasing the number of columns. I don’t believe it makes sense to describe a dataset as being in “long form”. Length is a relative term, and you can only say (e.g.) that dataset A is longer than dataset B.

---

### Pivot wider

.fl.w-50.pa2[

```r
*pivot_wider(
  data,
* id_cols = NULL,
* names_from = name,
  names_prefix = "",
* values_from = value,
  ...
)
```
<svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M320 408c0-6.428-.8457-12.66-2.434-18.6C338.2 376.7 352 353.9 352 328c0-6.428-.8457-12.66-2.434-18.6C370.2 296.7 384 273.9 384 248c0-2.705-.1484-5.373-.4414-8H440C479.7 240 512 207.7 512 168S479.7 96 440 96H243.7C227.5 76.51 203.2 64 176 64H126.1C94.02 64 64.47 81.1 49 108.6L17.65 164.5C6.104 185.1 0 208.4 0 231.8v107.9C0 417.1 64.6 480 144 480h104C287.7 480 320 447.7 320 408zM280 304c13.23 0 24 10.78 24 24S293.2 352 280 352H232.1C218.9 352 208 341.2 208 328S218.8 304 232 304H280zM312 224c13.23 0 24 10.78 24 24S325.2 272 312 272h-48c-3.029 0-5.875-.7012-8.545-1.73C260.7 259.9 264 248.4 264 236V224H312zM440 144c13.23 0 24 10.78 24 24S453.2 192 440 192h-176V152c0-2.686-.5566-5.217-.793-7.84C263.5 144.2 263.7 144 264 144H440zM48 339.7V231.8c0-15.25 3.984-30.41 11.52-43.88l31.34-55.78C97.84 119.7 111.4 112 126.1 112H176c22.06 0 40 17.94 40 40v84c0 15.44-12.56 28-28 28S160 251.4 160 236V184C160 170.8 149.3 160 136 160S112 170.8 112 184v52c0 33.23 21.58 61.25 51.36 71.54C161.3 314 160 320.9 160 328c0 5.041 1.166 9.836 2.178 14.66C137.4 354 120 378.1 120 408c0 7.684 1.557 14.94 3.836 21.87C80.56 420.9 48 383.9 48 339.7zM192 432c-13.23 0-24-10.78-24-24S178.8 384 192 384h56c13.23 0 24 10.78 24 24s-10.77 24-24 24H192z"/></svg> some optional arguments are missing

]

.fl.w-50.pa2[
When reporting, we often use `pivot_wider` (explicitely or implicitely)
to make results more readable, possibly to conform to a tradition

- Life tables in demography and actuarial science
- Longitudinal data
- See slide [How many flights per day of week per departure airport?](#aggregate-pivot-wider)
]

---
template: inter-slide
name: aggregation

## Aggregations

---

### How many flights per carrier?

.fl.w-50.pa2[

```r
flights %>%
* group_by(carrier) %>%
* summarise(count=n()) %>%
  arrange(desc(count))
```

```sql
SELECT carrier, COUNT(*) AS n
FROM flights
GROUP BY carrier
ORDER BY n DESCENDING
```
]

.fl.w-50.pa2[

```
## # A tibble: 16 × 2
##    carrier count
##    <chr>   <int>
##  1 UA      58665
##  2 B6      54635
##  3 EV      54173
##  4 DL      48110
##  5 AA      32729
##  6 MQ      26397
##  7 US      20536
##  8 9E      18460
##  9 WN      12275
## 10 VX       5162
## 11 FL       3260
## 12 AS        714
## 13 F9        685
## 14 YV        601
## 15 HA        342
## 16 OO         32
```

]

???

> `group_by`

> `summarise`

> `arrange`

---
name: aggregate-pivot-wider

### How many flights per day of week per departure airport?

```r
flights %>%
* group_by(origin,  wday(time_hour, abbr=T, label=T)) %>%
* summarise(count=n(), .groups="drop") %>%
  rename(day_of_week=`wday(time_hour, abbr = T, label = T)`) %>%
* pivot_wider(
*   id_cols="origin",
*   names_from="day_of_week",
*   values_from="count") %>%
  kable(caption="Departures per day")
```

]

.plot-callout.f6[

<table>
<caption>Departures per day</caption>
 <thead>
  <tr>
   <th style="text-align:left;"> origin </th>
   <th style="text-align:right;"> Sun </th>
   <th style="text-align:right;"> Mon </th>
   <th style="text-align:right;"> Tue </th>
   <th style="text-align:right;"> Wed </th>
   <th style="text-align:right;"> Thu </th>
   <th style="text-align:right;"> Fri </th>
   <th style="text-align:right;"> Sat </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> EWR </td>
   <td style="text-align:right;"> 16425 </td>
   <td style="text-align:right;"> 18329 </td>
   <td style="text-align:right;"> 18243 </td>
   <td style="text-align:right;"> 18180 </td>
   <td style="text-align:right;"> 18169 </td>
   <td style="text-align:right;"> 18142 </td>
   <td style="text-align:right;"> 13347 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> JFK </td>
   <td style="text-align:right;"> 15966 </td>
   <td style="text-align:right;"> 16104 </td>
   <td style="text-align:right;"> 16017 </td>
   <td style="text-align:right;"> 15841 </td>
   <td style="text-align:right;"> 16087 </td>
   <td style="text-align:right;"> 16176 </td>
   <td style="text-align:right;"> 15088 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> LGA </td>
   <td style="text-align:right;"> 13966 </td>
   <td style="text-align:right;"> 16257 </td>
   <td style="text-align:right;"> 16162 </td>
   <td style="text-align:right;"> 16039 </td>
   <td style="text-align:right;"> 15963 </td>
   <td style="text-align:right;"> 15990 </td>
   <td style="text-align:right;"> 10285 </td>
  </tr>
</tbody>
</table>

]

---
name: pipe
class: middle, left, inverse
background-image: url('./img/pexels-andris-bergmanis-7891767.jpg')
background-size: cover

## Pipelines/chaining operations

---

### `%>%`, `|>` and other pipes

> All `dplyr` functions take a table as the first argument

> Rather than forcing the user to either save intermediate objects or nest functions, `dplyr` provides the `%>%` operator from `magrittr`

> `x %>% f(y)` turns into `f(x, y)`

> The result from one step is  _piped_ into the next step

> Use `%>%`  to rewrite multiple operations that you can read left-to-right/top-to-bottom

```r
g(f(x, y), z)

x %>%
  f(y) %>%
  g(z)
```

---
exclude: true

### Unix pipe `|`

---

### Magrittr `%>%`

.fl.w-50.pa2[

`%>%` is not tied to `dplyr`

`%>%` can be used with packages from `tidyverse`

`%>%` can be used outside `tidyverse` that is with functions which take a table (or something else) as a second, third or keyword argument

]

.fl.w-50.pa2[

Second argument of `g` has the same type as the result of `f`

```r
g(z, f(x, y))

x %>%
  f(y) %>%
* g(z, .)
```

`x %>% f(y)` is a shorthand for `x %>% f(., y)`
]

---

### Standard pipe `|>` (version > 4.)

As of version 4.1 (2021), base <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg> offers a pipe operator denoted by `|>`

.fl.w-50.pa2[

`x |> f(y)` turns into `f(x, y)`

```r
g(f(x, y), z)

x |>
  f(y) |>
  g(z)
```

]

.fl.w-50.pa2[

The roundabout consists in using another new construct `\(x)`

```r
g(z, w)

x |>
  (\(x) g(z, w=x))()
```

```r
"une" |>
  (\(x) str_c("ceci n'est pas", x, sep=" "))() |>
  str_c("pipe", sep=" ") |>
  cat()
```

```
## ceci n'est pas une pipe
```
]

---

### Other pipes

`Magrittr` offers several variants of `%>%`

- Tee operator `%T>%`
- Assignement pipe `%<>%`
- Exposition operator `%$%`
- ...

.fr[See [pipes for beginners](https://www.r-bloggers.com/2017/12/pipes-in-r-tutorial-for-beginners/)]

---
template: inter-slide

- [R for Data Science](https://r4ds.had.co.nz)
  + [Data transformation](https://r4ds.had.co.nz/transform.html)
- Rstudio cheat sheets
  + [dplyr](https://www.rstudio.com/resources/cheatsheets/)
  + [tidyr](https://www.rstudio.com/resources/cheatsheets/)
  + [datatable](https://www.rstudio.com/resources/cheatsheets/)
  + [readr](https://www.rstudio.com/resources/cheatsheets/)

---
exclude: true

```r
family <- tibble::tribble(
  ~family,  ~dob_child1,  ~dob_child2, ~gender_child1, ~gender_child2,
       1L, "1998-11-26", "2000-01-29",             1L,             2L,
       2L, "1996-06-22",           NA,             2L,             NA,
       3L, "2002-07-11", "2004-04-05",             2L,             2L,
       4L, "2004-10-10", "2009-08-27",             1L,             1L,
       5L, "2000-12-05", "2005-02-28",             2L,             1L,
)

family %>%
  mutate(across(starts_with("dob"),  readr::parse_date)) %>%
  pivot_longer(
   !family,
   names_to = c(".value", "child"),
   names_pattern = "([a-z]*)_child(.)",
   values_drop_na = TRUE
 )
```

```
## # A tibble: 9 × 4
##   family child dob        gender
##    <int> <chr> <date>      <int>
## 1      1 1     1998-11-26      1
## 2      1 2     2000-01-29      2
## 3      2 1     1996-06-22      2
## 4      3 1     2002-07-11      2
## 5      3 2     2004-04-05      2
## 6      4 1     2004-10-10      1
## 7      4 2     2009-08-27      1
## 8      5 1     2000-12-05      2
## 9      5 2     2005-02-28      1
```

```r
pivot_longer_spec
```

```
## function (data, spec, names_repair = "check_unique", values_drop_na = FALSE, 
##     values_ptypes = NULL, values_transform = NULL) 
## {
##     spec <- check_pivot_spec(spec)
##     spec <- deduplicate_spec(spec, data)
##     v_fct <- factor(spec$.value, levels = unique(spec$.value))
##     values <- split(spec$.name, v_fct)
##     value_names <- names(values)
##     value_keys <- split(spec[-(1:2)], v_fct)
##     keys <- vec_unique(spec[-(1:2)])
##     if (identical(values_ptypes, list())) {
##         values_ptypes <- NULL
##     }
##     values_ptypes <- check_list_of_ptypes(values_ptypes, value_names, 
##         "values_ptypes")
##     values_transform <- check_list_of_functions(values_transform, 
##         value_names, "values_transform")
##     vals <- set_names(vec_init(list(), length(values)), value_names)
##     for (value in value_names) {
##         cols <- values[[value]]
##         col_id <- vec_match(value_keys[[value]], keys)
##         val_cols <- vec_init(list(), nrow(keys))
##         val_cols[col_id] <- unname(as.list(data[cols]))
##         val_cols[-col_id] <- list(rep(NA, nrow(data)))
##         if (has_name(values_transform, value)) {
##             val_cols <- lapply(val_cols, values_transform[[value]])
##         }
##         val_type <- vec_ptype_common(!!!set_names(val_cols[col_id], 
##             cols), .ptype = values_ptypes[[value]])
##         out <- vec_c(!!!val_cols, .ptype = val_type)
##         n_vals <- nrow(data) * length(val_cols)
##         idx <- t(matrix(seq_len(n_vals), ncol = length(val_cols)))
##         vals[[value]] <- vec_slice(out, as.integer(idx))
##     }
##     vals <- as_tibble(vals)
##     df_out <- drop_cols(as_tibble(data, .name_repair = "minimal"), 
##         spec$.name)
##     out <- wrap_error_names(vec_cbind(vec_rep_each(df_out, vec_size(keys)), 
##         vec_rep(keys, vec_size(data)), vals, .name_repair = names_repair))
##     if (values_drop_na) {
##         out <- vec_slice(out, !vec_equal_na(vals))
##     }
##     out$.seq <- NULL
##     reconstruct_tibble(data, out)
## }
## <bytecode: 0x7fcbec32f820>
## <environment: namespace:tidyr>
```

```r
#> # A tibble: 5 × 5
#>   family dob_child1 dob_child2 gender_child1 gender_child2
#>    <int> <date>     <date>             <int>         <int>
#> 1      1 1998-11-26 2000-01-29             1             2
#> 2      2 1996-06-22 NA                     2            NA
#> 3      3 2002-07-11 2004-04-05             2             2
#> 4      4 2004-10-10 2009-08-27             1             1
#> 5      5 2000-12-05 2005-02-28             2             1
```

---

background-image: url('./img/pexels-cottonbro-3171837.jpg')
background-size: cover

# The End