name: inter-slide class: left, middle, inverse {{ content }} --- name: layout-general layout: true class: left, middle <style> .remark-slide-number { position: inherit; } .remark-slide-number .progress-bar-container { position: absolute; bottom: 0; height: 4px; display: block; left: 0; right: 0; } .remark-slide-number .progress-bar { height: 100%; background-color: red; } /* custom.css */ .plot-callout { width: 300px; bottom: 5%; right: 5%; position: absolute; padding: 0px; z-index: 100; } .plot-callout img { width: 100%; border: 1px solid #23373B; } </style>
--- class: middle, left, inverse # Analyse des Données : Introduction to table manipulation(s) ### 2023-01-15 #### [Master I MFA et MIDS](https://master.math.univ-paris-diderot.fr/annee/m1-isifar/) #### [Analyse de Données](http://stephane-v-boucheron.fr/courses/isidata/) #### [Stéphane Boucheron](http://stephane-v-boucheron.fr) --- template: inter-slide ##
### [Tables](#dt) ### [SQL and Relational algebra with `dplyr`](#sql) ### [Tidy tables](#tidytables) ### [Aggregations](#aggregation) ### [Pivoting](#pivots) ### [Pipe](#pipe) --- template: inter-slide name: dt ## Tables --- ### Tables (examples) - Speadsheets (Excel) -
Relational tables - Dataframes in datascience frameworks -
: `data.frame`, `tibble`, ... -
: `pandas.dataframe` - `spark`: `dataframe` - `Dask`: `dataframe` - and many others --- ### Tables (Why ?) In Data Science, each framework comes with its own flavor(s) of table(s)
Tables from relational databases serve as inspiration In
legacy dataframes shape the life of statisticians and data scientists The purpose of this session is - describe dataframes from an end-user viewpoint (we leave aside implementations) - presenting tools for - accessing information within dataframes (*querying*) - summarizing information (*aggregation queries*) - cleaning/cleaning dataframes (*tidying*) ??? --- ### Loading tables and packages ```r pacman::p_load("tidyverse") # All we need is there # Almost all. Helper packages pacman::p_load("nycflights13") # for flight data # for manipulating dates and times pacman::p_load("lubridate") pacman::p_load("stringr") # nice table output for web presentations pacman::p_load("DT") pacman::p_load("gt") pacman::p_load("kableExtra") # data(flights) ``` --- ### About loaded packages - Metapackage [`tidyverse`](https://www.tidyverse.org) provides tools to create, query, tidy dataframes as well as tools to load data from various sources and save them in persistent storage - [`nycflights13`](https://github.com/tidyverse/nycflights13) provides the dataframes we play with - [`DT`](https://rstudio.github.io/DT/) is a gateway to a `Javascript` library that enables gracious display of dataframes on the WWW --- ### The `flights` table .f6[ ```r head(flights) %>% glimpse(width = 50) ``` ``` ## Rows: 6 ## Columns: 19 ## $ year <int> 2013, 2013, 2013, 2013, 2… ## $ month <int> 1, 1, 1, 1, 1, 1 ## $ day <int> 1, 1, 1, 1, 1, 1 ## $ dep_time <int> 517, 533, 542, 544, 554, … ## $ sched_dep_time <int> 515, 529, 540, 545, 600, … ## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4 ## $ arr_time <int> 830, 850, 923, 1004, 812,… ## $ sched_arr_time <int> 819, 830, 850, 1022, 837,… ## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12 ## $ carrier <chr> "UA", "UA", "AA", "B6", "… ## $ flight <int> 1545, 1714, 1141, 725, 46… ## $ tailnum <chr> "N14228", "N24211", "N619… ## $ origin <chr> "EWR", "LGA", "JFK", "JFK… ## $ dest <chr> "IAH", "IAH", "MIA", "BQN… ## $ air_time <dbl> 227, 227, 160, 183, 116, … ## $ distance <dbl> 1400, 1416, 1089, 1576, 7… ## $ hour <dbl> 5, 5, 5, 5, 6, 5 ## $ minute <dbl> 15, 29, 40, 45, 0, 58 ## $ time_hour <dttm> 2013-01-01 05:00:00, 2013… ``` ] ??? A dataframe is a two-ways (two-dimensional) table `head(df)` displays the first 6 rows of its first argument The vectors making a dataframe may have different types/classes (a dataframe is not a matrix) Compare `str()`, `glimpse()`, `head()` --- ### Table schema .fl.w-30.pa2.f6[ A table is a _list_ of _columns_ Each _column_ has - _name_ and - _type_ (_class_ in
) ```r *glimpse(flights, width=50) ``` ] .fl.w-70.pa2.f6[ ``` ## Rows: 336,776 ## Columns: 19 ## $ year <int> 2013, 2013, 2013, 2013, 2… ## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1… ## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1… ## $ dep_time <int> 517, 533, 542, 544, 554, … ## $ sched_dep_time <int> 515, 529, 540, 545, 600, … ## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, … ## $ arr_time <int> 830, 850, 923, 1004, 812,… ## $ sched_arr_time <int> 819, 830, 850, 1022, 837,… ## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12,… ## $ carrier <chr> "UA", "UA", "AA", "B6", "… ## $ flight <int> 1545, 1714, 1141, 725, 46… ## $ tailnum <chr> "N14228", "N24211", "N619… ## $ origin <chr> "EWR", "LGA", "JFK", "JFK… ## $ dest <chr> "IAH", "IAH", "MIA", "BQN… ## $ air_time <dbl> 227, 227, 160, 183, 116, … ## $ distance <dbl> 1400, 1416, 1089, 1576, 7… ## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6… ## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0,… ## $ time_hour <dttm> 2013-01-01 05:00:00, 201… ``` ] ??? - `flights` has 19 columns - Each column is a sequence (`vector`) of items with the same type/class - All columns have the same length - `flights` has 336776 rows - In
parlance, a row is (often) called a _tuple_ - In
parlance, a column is (often) called a _variable_ --- ### Column types .fl.w-50.pa2[ | class | columns | |:-----:|:---------| | `integer` | 'year' 'month' 'day' 'dep_time' 'sched_dep_time' 'arr_time' 'sched_arr_time' 'flight' | | `numeric` | 'dep_delay' 'arr_delay' 'air_time' 'distance' 'hour' 'minute' | | `character` | 'carrier' 'tailnum' 'origin' 'dest' | | `POSIXct` | 'time_hour' | | `POSIXt` | 'time_hour' | ] .fl.w-50.pa2[ A column, as a vector, may be belong to different classes Other classes: `factor` for categorical variables Columns `dest`, `origin` `carrier` could be coerced as factors Should columns `dest` and `origin` be coerced to the same factor? ] --- ### `nycflights13` ![](./img/bd_2023-nycflights13.png) --- ### Columns specification .fl.w-30.pa2[ ```r as.col_spec(flights) ``` ] .fl.w-70.pa2.f6[ ```r cols( year = col_integer(), month = col_integer(), day = col_integer(), dep_time = col_integer(), sched_dep_time = col_integer(), dep_delay = col_double(), arr_time = col_integer(), sched_arr_time = col_integer(), arr_delay = col_double(), carrier = col_character(), flight = col_integer(), tailnum = col_character(), origin = col_character(), dest = col_character(), air_time = col_double(), distance = col_double(), hour = col_double(), minute = col_double(), time_hour = col_datetime(format = "") ) ``` ] ??? `\(\approx\)` table schema in relational databases Column specifications are useful when loading dataframes from structured text files like `.csv` files `.csv` files do not contain typing information File loaders from package `readr` can be tipped about column classes using column specifications --- template: inter-slide name: sql ## SQL and Relational algebra with `dplyr` ??? --- - SQL stands for structured/simple Query Language - A query language elaborated during the 1970's at IBM by E. Codd - Geared towards exploitation of collections of relational tables - Less powerful but simpler to use than a programming language - `dplyr` is a principled
-friendly implementation of SQL ideas (and other things) At the core of SQL lies the idea of a table calculus called **relational algebra** --- ### Relational algebra (basics) Convention: `\(R\)` is a table with columns `\(A_1, \ldots, A_k\)` - Projection (picking columns) .f4[ `\(\pi(R, A_1, A_3)\)` ] - Selection/Filtering (picking rows) .f4[ `\(\sigma(R, {\text{condition}})\)` ] - Join (mulitple tables operation) .f4[ `\(\bowtie(R,S, {\text{condition}})\)` ]
Any operation produces a table
The schema of the derived table depends on the operation (but does not depend on the content/value of the operands) ??? Relational calculus relies on a small set of basic operations `\(\pi, \sigma, \bowtie\)` Each operation has one or two table **operands** and produce a table
There is more to SQL than relational algebra --- .f3[ `\(\pi(R, {A_1, A_3})\)` ] A projection `\(\pi(\cdot, {A_1, A_3})\)` is defined by a set of column names, say `\(A_1, A_3\)` If `\(R\)` has columns with given names, the result is a table with names `\(A_1, A_3\)` and one row per row of `\(R\)` A projection is parametrized by a list of column names ??? - Checks - Variable number of arguments or list argument - What if `\(R\)` does not have columns named `\(A_1, A_3\)`? --- ###
Package `dplyr` .fl.w-30.pa2[ - [_Tranformation_ chapter in R4DS](https://r4ds.had.co.nz/transform.html) - [Cheat sheet I](https://github.com/rstudio/cheatsheets/blob/main/data-transformation.pdf) - [Cheat sheet II](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) ] .fl.w-70.pa2[ <iframe src="https://dplyr.tidyverse.org" width="504" height="400px" data-external="1"></iframe> [https://dplyr.tidyverse.org](https://dplyr.tidyverse.org) ] ??? Base
provides tools to perform relational algebra operations But: - Base
does not provide a consistent API - The lack of a consistent API makes operation chaining tricky --- ### `dplyr` verbs Five basic verbs: - Pick observations/rows by their values (`filter()`) .fr[ σ(...) ] - Pick variables by their names (`select()`) .fr[ π(...)] - Reorder the rows (`arrange()`) - Create new variables with functions of existing variables (`mutate()`) - Collapse many values down to a single summary (`summarise()`) -- And - `group_by()` changes the scope of each function from operating on the entire dataset to operating on it group-by-group ??? --- ###
tidyverse .fl.w-50.pa2[ > All verbs work similarly: > The first argument is a data frame (table). > The subsequent arguments describe what to do with the data frame, using the variable/column names (without quotes) > The result is a new data frame (table) ] .fl.w-50.pa2[ <iframe src="https://www.tidyverse.org" width="504" height="400px" data-external="1"></iframe> ] ??? `dplyr` is part of `tidyverse` `dplyr` provides a consistent API --- ### `dplyr::select()` as a projection operator (π) `\(\pi(R, \underbrace{A_1, \ldots, A_3}_{\text{column names}})\)` ```r *select(R, A1, A3) ``` or, equivalently ```r *R %>% select(A1, A3) ```
`%>%` is the pipe operator from `magrittr`
`x %>% f(y, z)` is translated to `f(x, y, z)` and then evaluated ??? Function `select` has a variable number of arguments Function `select` has a variable number of arguments Function `select` allows to pick column by names (and much more) Note that in the current environment, there are no objects called `A1`, `A3` The consistent API allows to use the pipe operator --- ### Toy tables .fl.w-50.pa2[ ```r spam <- set.seed(42) R <- tibble(A1=seq(2, 10, 2), A2=sample(letters, 5), A3=seq(from=date("2021-10-21"), to=date("2021-11-20"), by=7), D=sample(letters, 5)) S <- tibble(E=c(3,4,6,9, 10), F=sample(letters, 5), G=seq(from=date("2021-10-21"), to=date("2021-10-21")+4, by=1), D=sample(letters,5) ) ``` ] .fl.w-50.pa2[ <table> <caption>R</caption> <thead> <tr> <th style="text-align:right;"> A1 </th> <th style="text-align:left;"> A2 </th> <th style="text-align:left;"> A3 </th> <th style="text-align:left;"> D </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> q </td> <td style="text-align:left;"> 2021-10-21 </td> <td style="text-align:left;"> r </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> e </td> <td style="text-align:left;"> 2021-10-28 </td> <td style="text-align:left;"> q </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:left;"> a </td> <td style="text-align:left;"> 2021-11-04 </td> <td style="text-align:left;"> o </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> j </td> <td style="text-align:left;"> 2021-11-11 </td> <td style="text-align:left;"> g </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:left;"> d </td> <td style="text-align:left;"> 2021-11-18 </td> <td style="text-align:left;"> d </td> </tr> </tbody> </table> <table> <caption>S</caption> <thead> <tr> <th style="text-align:right;"> E </th> <th style="text-align:left;"> F </th> <th style="text-align:left;"> G </th> <th style="text-align:left;"> D </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> y </td> <td style="text-align:left;"> 2021-10-21 </td> <td style="text-align:left;"> o </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> e </td> <td style="text-align:left;"> 2021-10-22 </td> <td style="text-align:left;"> c </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:left;"> n </td> <td style="text-align:left;"> 2021-10-23 </td> <td style="text-align:left;"> i </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> t </td> <td style="text-align:left;"> 2021-10-24 </td> <td style="text-align:left;"> d </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:left;"> r </td> <td style="text-align:left;"> 2021-10-25 </td> <td style="text-align:left;"> e </td> </tr> </tbody> </table> ] ??? --- ### Projecting `flights` on `origin` and `dest` .fl.w-50.pa2[ ```r flights %>% * select(origin, dest) %>% head() ``` A more readable equivalent of ```r head(select(flights, origin, dest), 10) ``` ] .fl.w-50.pa2[ ``` ## # A tibble: 6 × 2 ## origin dest ## <chr> <chr> ## 1 EWR IAH ## 2 LGA IAH ## 3 JFK MIA ## 4 JFK BQN ## 5 LGA ATL ## 6 EWR ORD ``` ```sql SELECT origin, dest FROM flights LIMIT 10; ``` ] ??? --- .f3[ `\(\sigma(R, \text{condition})\)` ] A selection/filtering operation is defined by a condition that can be checked on the rows of tables with convenient schema `\(\sigma(R, \text{condition})\)` returns a table with the same schema as `\(R\)` The resulting table contains the rows/tuples of `\(R\)` that satisfy `\(\text{condition}\)` `\(\sigma(R, \text{FALSE})\)` returns an empty table with the same schema as `\(R\)` --- ### Chaining filtering and projecting .fl.w-50.pa2[ ```r start <- date("2021-10-27") end <- start + 21 R %>% * filter(A2 > "n" , between(A3, start, end)) %>% * select(A1, A3) ``` ] .fl.w-50.pa2[ ``` ## # A tibble: 0 × 2 ## # … with 2 variables: A1 <dbl>, A3 <date> ``` ] --- ### Selecting `flights` based on `origin` and `dest` and then projecting on `dest, time_hour, carrier` .fl.w-50.pa2[ ```r flights %>% * filter(dest %in% c('ATL', 'LAX'), origin == 'JFK') %>% * select(dest, time_hour, carrier) %>% head() ``` - In SQL (
) parlance: ```sql SELECT dest, time_hour, carrier FROM flights WHERE dest IN ('ATL', 'LAX') AND origin = 'JFK' LIMIT 6 ``` ] .fl.w-50.pa2[ ``` ## # A tibble: 6 × 3 ## dest time_hour carrier ## <chr> <dttm> <chr> ## 1 LAX 2013-01-01 06:00:00 UA ## 2 ATL 2013-01-01 06:00:00 DL ## 3 LAX 2013-01-01 07:00:00 VX ## 4 LAX 2013-01-01 07:00:00 B6 ## 5 LAX 2013-01-01 07:00:00 AA ## 6 ATL 2013-01-01 08:00:00 DL ``` ] ??? Filtering is also called subsetting --- ### Logical operations `filter(R, condition_1, condition_2)` is meant to return the rows of `R` that satisfy `condition_1` **and** `condition_2` `filter(R, condition_1 & condition_2)` is an equivalent formulation `filter(R, condition_1 | condition_2)` is meant to return the rows of `R` that satisfy `condition_1` **or** `condition_2` (possibly both) `filter(R, xor(condition_1,condition_2))` is meant to return the rows of `R` that satisfy **either** `condition_1` **or** `condition_2` (just one of them) `filter(R, ! condition_1)` is meant to return the rows of `R` that **do not** satisfy `condition_1` --- ### Overview of set and boolean operations ![](./img/transform-logical.png) --- ###
Missing values! Numerical column `dep_time` contains many `NA's` (missing values) ```r summary(flights$dep_time) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 1 907 1401 1349 1744 2400 8255 ```
Missing values (`NA` and variants) should be handled with care ```r NA & TRUE ``` [1] NA ```r NA | TRUE ``` [1] TRUE --- ### Truth tables for three-valued logic .fl.w-50.pa2[
uses _three-valued logic_
Generate complete truth tables for `and, or, xor` ```r v <- c(TRUE, FALSE, NA) # truth values *list_tt <- map(c(`&`, `|`, xor), * ~ outer(v, v, .x)) for (i in seq_along(list_tt)){ colnames(list_tt[[i]]) <- v rownames(list_tt[[i]]) <- v } names(list_tt) <- c('& AND', '| OR', 'XOR') ``` ] -- .fl.w-50.pa2.f6.tl[ <table class=" lightable-minimal" style='font-family: "Trebuchet MS", verdana, sans-serif; width: auto !important; '> <caption>& AND</caption> <thead> <tr> <th style="text-align:left;font-weight: bold;background-color: lightgray !important;"> </th> <th style="text-align:left;font-weight: bold;background-color: lightgray !important;"> TRUE </th> <th style="text-align:left;font-weight: bold;background-color: lightgray !important;"> FALSE </th> <th style="text-align:left;font-weight: bold;background-color: lightgray !important;"> NA </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;background-color: lightgray !important;"> TRUE </td> <td style="text-align:left;"> TRUE </td> <td style="text-align:left;"> FALSE </td> <td style="text-align:left;"> NA </td> </tr> <tr> <td style="text-align:left;font-weight: bold;background-color: lightgray !important;"> FALSE </td> <td style="text-align:left;"> FALSE </td> <td style="text-align:left;"> FALSE </td> <td style="text-align:left;"> FALSE </td> </tr> <tr> <td style="text-align:left;font-weight: bold;background-color: lightgray !important;"> NA </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> FALSE </td> <td style="text-align:left;"> NA </td> </tr> </tbody> </table><br> <!-- --> <table class=" lightable-minimal" style='font-family: "Trebuchet MS", verdana, sans-serif; width: auto !important; '> <caption>| OR</caption> <thead> <tr> <th style="text-align:left;font-weight: bold;background-color: lightgray !important;"> </th> <th style="text-align:left;font-weight: bold;background-color: lightgray !important;"> TRUE </th> <th style="text-align:left;font-weight: bold;background-color: lightgray !important;"> FALSE </th> <th style="text-align:left;font-weight: bold;background-color: lightgray !important;"> NA </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;background-color: lightgray !important;"> TRUE </td> <td style="text-align:left;"> TRUE </td> <td style="text-align:left;"> TRUE </td> <td style="text-align:left;"> TRUE </td> </tr> <tr> <td style="text-align:left;font-weight: bold;background-color: lightgray !important;"> FALSE </td> <td style="text-align:left;"> TRUE </td> <td style="text-align:left;"> FALSE </td> <td style="text-align:left;"> NA </td> </tr> <tr> <td style="text-align:left;font-weight: bold;background-color: lightgray !important;"> NA </td> <td style="text-align:left;"> TRUE </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> NA </td> </tr> </tbody> </table><br> <!-- --> <table class=" lightable-minimal" style='font-family: "Trebuchet MS", verdana, sans-serif; width: auto !important; '> <caption>XOR</caption> <thead> <tr> <th style="text-align:left;font-weight: bold;background-color: lightgray !important;"> </th> <th style="text-align:left;font-weight: bold;background-color: lightgray !important;"> TRUE </th> <th style="text-align:left;font-weight: bold;background-color: lightgray !important;"> FALSE </th> <th style="text-align:left;font-weight: bold;background-color: lightgray !important;"> NA </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;background-color: lightgray !important;"> TRUE </td> <td style="text-align:left;"> FALSE </td> <td style="text-align:left;"> TRUE </td> <td style="text-align:left;"> NA </td> </tr> <tr> <td style="text-align:left;font-weight: bold;background-color: lightgray !important;"> FALSE </td> <td style="text-align:left;"> TRUE </td> <td style="text-align:left;"> FALSE </td> <td style="text-align:left;"> NA </td> </tr> <tr> <td style="text-align:left;font-weight: bold;background-color: lightgray !important;"> NA </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> NA </td> </tr> </tbody> </table><br> <!-- --> ] --- exclude: true | <div style="width:80px">`and`</div> | <div style="width:80px">`TRUE`</div> | <div style="width:80px">`FALSE`</div> |<div style="width:80px">`NA`</div> | |:-----|:-----|:-----|:-----| |**`TRUE`** |TRUE |FALSE |NA | |**`FALSE`** |FALSE |FALSE |FALSE | |**`NA`** |NA |FALSE |NA | <br> | <div style="width:80px">`or`</div> | <div style="width:80px">`TRUE`</div> | <div style="width:80px">`FALSE`</div> |<div style="width:80px">`NA`</div> | |:-----|:----|:-----|:----| |**`TRUE`** |TRUE |TRUE |TRUE | |**`FALSE`** |TRUE |FALSE |NA | |**`NA`** |TRUE |NA |NA | <br> | <div style="width:80px">`xor`</div> | <div style="width:80px">`TRUE`</div> | <div style="width:80px">`FALSE`</div> |<div style="width:80px">`NA`</div> | |:-----|:-----|:-----|:--| |**`TRUE`** |FALSE |TRUE |NA | |**`FALSE`** |TRUE |FALSE |NA | |**`NA`** |NA |NA |NA | --- ### `slice()`: choosing rows based on location .fl.w-50.pa2[ In base
dataframe cells can be addressed by indices `flights[5000:5010,seq(1, 19, by=5)]` returns rows `5000:5010` and columns `1, 6, 11` from dataframe `flights` This can be done in a (verbose) `dplyr` way using `slice()` and `select()` ] .fl.w-50.pa2[ ```r flights %>% * slice(5001:5005) %>% select(seq(1, 19, by=5)) ``` ``` ## # A tibble: 5 × 4 ## year dep_delay flight distance ## <int> <dbl> <int> <dbl> ## 1 2013 3 4437 602 ## 2 2013 43 1016 187 ## 3 2013 -2 2190 1089 ## 4 2013 -1 91 2576 ## 5 2013 5 2131 502 ``` ] ??? Useful variant `slice_sample()` --- template: inter-slide ## Joins : multi-table queries --- .f3[ `\(\bowtie(R,S, {\text{condition}})\)` ] stands for > join rows/tuples of `\(R\)` and rows/tuples of `\(S\)` that satisfy `\(\text{condition}\)` --- ### `nycflights` tables The `nycflights13` package offers five related tables: - _Fact_ tables: - `flights` - `weather` (hourly weather conditions at different locations) - _Dimension_ tables: - `airports` (airports full names, location, ...) - `planes` (model, manufacturer, year, ...) - `airlines` (full names) This is an instance of a [Star Schema](https://en.wikipedia.org/wiki/Star_schema) .plot-callout[ <img src="./img/bd_2023-nycflights13.png" width="443" /> ] ??? --- ### Star schema > Fact tables record measurements for a specific event > Fact tables generally consist of numeric values, and foreign keys to dimensional data where descriptive information is kept .fr[From [Wikipedia](https://en.wikipedia.org/wiki/Star_schema)] --- ### Star schema illustrated ![](./img/relational-nycflights.png) --- ###
weather conditions .f6[ ```r weather %>% glimpse(width = 50) ``` ``` ## Rows: 26,115 ## Columns: 15 ## $ origin <chr> "EWR", "EWR", "EWR", "EWR", "… ## $ year <int> 2013, 2013, 2013, 2013, 2013,… ## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,… ## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,… ## $ hour <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10… ## $ temp <dbl> 39.02, 39.02, 39.02, 39.92, 3… ## $ dewp <dbl> 26.06, 26.96, 28.04, 28.04, 2… ## $ humid <dbl> 59.37, 61.63, 64.43, 62.21, 6… ## $ wind_dir <dbl> 270, 250, 240, 250, 260, 240,… ## $ wind_speed <dbl> 10.35702, 8.05546, 11.50780, … ## $ wind_gust <dbl> NA, NA, NA, NA, NA, NA, NA, N… ## $ precip <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ pressure <dbl> 1012.0, 1012.3, 1012.5, 1012.… ## $ visib <dbl> 10, 10, 10, 10, 10, 10, 10, 1… ## $ time_hour <dttm> 2013-01-01 01:00:00, 2013-01… ``` ] --- ### Connecting `flights` and `weather` We want to complement information in `flights` using data `weather` Motivation: we would like to relate delays (`arr_delay`) and weather conditions - can we explain (justify) delays using weather data? - can we predict delays using weather data? --- ###
⋈
For each flight (row in `flights`) - `year`, `month`, `day`, `hour` (computed from `time_hour`) indicate the approaximate time of departure - `origin` indicates the airport where the plane takes off Each row of `weather` contains corresponding information
for each row of `flights` we look for rows of `weather` with matching values in `year`, `month`, `day`, `hour` and `origin`
NATURAL INNER JOIN between the tables --- ### `inner_join`: natural join .fl.w-40.pa2[ ```r f_w <- flights %>% * inner_join(weather) f_w %>% select(seq(1, ncol(f_w), by=2)) %>% glimpse(width=50) ``` ] .fl.w-60.pa2.f6[ ``` ## Joining, by = c("year", "month", "day", "origin", "hour", "time_hour") ``` ``` ## Rows: 335,220 ## Columns: 14 ## $ year <int> 2013, 2013, 2013, 2013, 2… ## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1… ## $ sched_dep_time <int> 515, 529, 540, 545, 600, … ## $ arr_time <int> 830, 850, 923, 1004, 812,… ## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12,… ## $ flight <int> 1545, 1714, 1141, 725, 46… ## $ origin <chr> "EWR", "LGA", "JFK", "JFK… ## $ air_time <dbl> 227, 227, 160, 183, 116, … ## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6… ## $ time_hour <dttm> 2013-01-01 05:00:00, 201… ## $ dewp <dbl> 28.04, 24.98, 26.96, 26.9… ## $ wind_dir <dbl> 260, 250, 260, 260, 260, … ## $ wind_gust <dbl> NA, 21.86482, NA, NA, 23.… ## $ pressure <dbl> 1011.9, 1011.4, 1012.1, 1… ``` ] ??? --- ### Join schema .f6[ ``` ## Rows: 335,220 ## Columns: 28 ## $ year <int> 2013, 2013, 2013, 2013, 2… ## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1… ## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1… ## $ dep_time <int> 517, 533, 542, 544, 554, … ## $ sched_dep_time <int> 515, 529, 540, 545, 600, … ## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, … ## $ arr_time <int> 830, 850, 923, 1004, 812,… ## $ sched_arr_time <int> 819, 830, 850, 1022, 837,… ## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12,… ## $ carrier <chr> "UA", "UA", "AA", "B6", "… ## $ flight <int> 1545, 1714, 1141, 725, 46… ## $ tailnum <chr> "N14228", "N24211", "N619… ## $ origin <chr> "EWR", "LGA", "JFK", "JFK… ## $ dest <chr> "IAH", "IAH", "MIA", "BQN… ## $ air_time <dbl> 227, 227, 160, 183, 116, … ## $ distance <dbl> 1400, 1416, 1089, 1576, 7… ## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6… ## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0,… ## $ time_hour <dttm> 2013-01-01 05:00:00, 201… ## $ temp <dbl> 39.02, 39.92, 39.02, 39.0… ## $ dewp <dbl> 28.04, 24.98, 26.96, 26.9… ## $ humid <dbl> 64.43, 54.81, 61.63, 61.6… ## $ wind_dir <dbl> 260, 250, 260, 260, 260, … ## $ wind_speed <dbl> 12.65858, 14.96014, 14.96… ## $ wind_gust <dbl> NA, 21.86482, NA, NA, 23.… ## $ precip <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ pressure <dbl> 1011.9, 1011.4, 1012.1, 1… ## $ visib <dbl> 10, 10, 10, 10, 10, 10, 1… ``` ] ??? The schema of the result is the union of the schemas of the operands A tuple from `flights` matches a tuple from `weather` if the tuple have the same values in the common columns year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr_time, arr_delay, carrier, flight, tailnum, origin, dest, air_time, distance, hour, minute, time_hour, temp, dewp, humid, wind_dir, wind_speed, wind_gust, precip, pressure, visib --- ### Which columns are used when joining tables `\(R\)` and `\(S\)`? - _default behavior_ of `inner_join`: all columns shared by `\(R\)` and `\(S\)`. Common columns have the same name in both schema. They are expected to have the same class - _manual definition_: in many settings, we want to overrule the default behavior. We specify manually which column from `\(R\)` should match which column from `\(S\)` --- ### Natural join of `flights` and `weather`: ```r common_names <- base::intersect(names(weather), names(flights)) setequal( inner_join(flights, weather), inner_join(flights, weather, by=common_names) ) ``` ``` ## Joining, by = c("year", "month", "day", "origin", "hour", "time_hour") ``` ``` ## [1] TRUE ``` --- ###
Are you surprised by the next chunk? ```r dtu <- inner_join(flights, weather, by=c("year", "month", "day", "origin", "hour")) dtv <- inner_join(flights, weather, by=c("origin", "time_hour")) setequal(dtu, dtv) ``` ``` ## [1] FALSE ``` Recall that columns `year`, `month` `day` `hour` can be computed from `time_hour` ```r # helper for datetime objects require(lubridate) flights %>% filter(year!=year(time_hour) | month!=month(time_hour) | day!=day(time_hour) | hour!=hour(time_hour)) %>% nrow() ``` ``` ## [1] 0 ``` ??? This is an example of functional dependency --- The two results do not have the same schema! ```r setdiff(colnames(dtv), colnames(dtu)) ``` ``` ## [1] "year.x" "month.x" "day.x" "hour.x" "time_hour" "year.y" ## [7] "month.y" "day.y" "hour.y" ``` ```r setdiff(colnames(dtu), colnames(dtv)) ``` ``` ## [1] "year" "month" "day" "hour" "time_hour.x" ## [6] "time_hour.y" ``` -- Fixing ```r dtu <- inner_join(flights, weather, by=c("year", "month", "day", "origin", "hour"), * suffix= c("", ".y")) %>% * select(-ends_with(".y")) dtv <- inner_join(flights, weather, by=c("origin", "time_hour"), * suffix= c("", ".y")) %>% * select(-ends_with(".y")) setequal(dtu, dtv) ``` ``` ## [1] TRUE ``` --- ### About `inner_join` .fl.w-40.pa2[ ```r inner_join( x, y, * by = NULL, copy = FALSE, * suffix = c(".x", ".y"), ..., * keep = FALSE, * na_matches = "na") ``` ] .fl.w-60.pa2[ - `by`: - `by=c("A1", "A3", "A7")` row `r` from `R` and `s` from `S` match if `r.A1 == s.A1`, `r.A3 == s.A3`, `r.A7 == s.A7` - `by=c("A1"="B", "A3"="C", "A7"="D")` row `r` from `R` and `s` from `S` match if `r.A1 == s.B`, `r.A3 == s.C`, `r.A7 == s.D` - `suffix`: If there are non-joined duplicate variables in `x` and `y`, these suffixes will be added to the output to disambiguate them. - `keep`: Should the join keys from _both_ `x` and `y` be preserved in the output? - `na_matches`: Should NA and NaN values match one another? .fl[From online documentation] ] ??? --- ### Join flavors Different flavors of `join` cab be used to join one table to columns from another, matching values with the rows that they correspond to Each join retains a different combination of values from the tables - `left_join(x, y, by = NULL, suffix = c(".x", ".y"), ...)` Join matching values from `y` to `x`. Retain all rows of `x` padding missing values from `y` by `NA` - `semi_join` ... - `anti_join` ... ??? --- ### Toy examples : `inner_join` .fl.w-30.pa2.f6[ <table> <caption>R</caption> <thead> <tr> <th style="text-align:right;"> A1 </th> <th style="text-align:left;"> A2 </th> <th style="text-align:left;"> A3 </th> <th style="text-align:left;"> D </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> q </td> <td style="text-align:left;"> 2021-10-21 </td> <td style="text-align:left;"> r </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> e </td> <td style="text-align:left;"> 2021-10-28 </td> <td style="text-align:left;"> q </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:left;"> a </td> <td style="text-align:left;"> 2021-11-04 </td> <td style="text-align:left;"> o </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> j </td> <td style="text-align:left;"> 2021-11-11 </td> <td style="text-align:left;"> g </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:left;"> d </td> <td style="text-align:left;"> 2021-11-18 </td> <td style="text-align:left;"> d </td> </tr> </tbody> </table> <table> <caption>S</caption> <thead> <tr> <th style="text-align:right;"> E </th> <th style="text-align:left;"> F </th> <th style="text-align:left;"> G </th> <th style="text-align:left;"> D </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> y </td> <td style="text-align:left;"> 2021-10-21 </td> <td style="text-align:left;"> o </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> e </td> <td style="text-align:left;"> 2021-10-22 </td> <td style="text-align:left;"> c </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:left;"> n </td> <td style="text-align:left;"> 2021-10-23 </td> <td style="text-align:left;"> i </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> t </td> <td style="text-align:left;"> 2021-10-24 </td> <td style="text-align:left;"> d </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:left;"> r </td> <td style="text-align:left;"> 2021-10-25 </td> <td style="text-align:left;"> e </td> </tr> </tbody> </table> ] .fl.w-70.pa2.f6[ <table> <caption>inner_join(S, R, by=c("E"="A1"))</caption> <thead> <tr> <th style="text-align:right;"> E </th> <th style="text-align:left;"> F </th> <th style="text-align:left;"> G </th> <th style="text-align:left;"> D.x </th> <th style="text-align:left;"> A2 </th> <th style="text-align:left;"> A3 </th> <th style="text-align:left;"> D.y </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> e </td> <td style="text-align:left;"> 2021-10-22 </td> <td style="text-align:left;"> c </td> <td style="text-align:left;"> e </td> <td style="text-align:left;"> 2021-10-28 </td> <td style="text-align:left;"> q </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:left;"> n </td> <td style="text-align:left;"> 2021-10-23 </td> <td style="text-align:left;"> i </td> <td style="text-align:left;"> a </td> <td style="text-align:left;"> 2021-11-04 </td> <td style="text-align:left;"> o </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:left;"> r </td> <td style="text-align:left;"> 2021-10-25 </td> <td style="text-align:left;"> e </td> <td style="text-align:left;"> d </td> <td style="text-align:left;"> 2021-11-18 </td> <td style="text-align:left;"> d </td> </tr> </tbody> </table> ] --- ### Toy examples : `left_join` .fl.w-30.pa2.f6[ <table> <caption>R</caption> <thead> <tr> <th style="text-align:right;"> A1 </th> <th style="text-align:left;"> A2 </th> <th style="text-align:left;"> A3 </th> <th style="text-align:left;"> D </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> q </td> <td style="text-align:left;"> 2021-10-21 </td> <td style="text-align:left;"> r </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> e </td> <td style="text-align:left;"> 2021-10-28 </td> <td style="text-align:left;"> q </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:left;"> a </td> <td style="text-align:left;"> 2021-11-04 </td> <td style="text-align:left;"> o </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> j </td> <td style="text-align:left;"> 2021-11-11 </td> <td style="text-align:left;"> g </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:left;"> d </td> <td style="text-align:left;"> 2021-11-18 </td> <td style="text-align:left;"> d </td> </tr> </tbody> </table> <table> <caption>S</caption> <thead> <tr> <th style="text-align:right;"> E </th> <th style="text-align:left;"> F </th> <th style="text-align:left;"> G </th> <th style="text-align:left;"> D </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> y </td> <td style="text-align:left;"> 2021-10-21 </td> <td style="text-align:left;"> o </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> e </td> <td style="text-align:left;"> 2021-10-22 </td> <td style="text-align:left;"> c </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:left;"> n </td> <td style="text-align:left;"> 2021-10-23 </td> <td style="text-align:left;"> i </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> t </td> <td style="text-align:left;"> 2021-10-24 </td> <td style="text-align:left;"> d </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:left;"> r </td> <td style="text-align:left;"> 2021-10-25 </td> <td style="text-align:left;"> e </td> </tr> </tbody> </table> ] .fl.w-70.pa2.f6[ <table> <caption>left_join(S, R, by=c("E"="A1"))</caption> <thead> <tr> <th style="text-align:right;"> E </th> <th style="text-align:left;"> F </th> <th style="text-align:left;"> G </th> <th style="text-align:left;"> D.x </th> <th style="text-align:left;"> A2 </th> <th style="text-align:left;"> A3 </th> <th style="text-align:left;"> D.y </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> y </td> <td style="text-align:left;"> 2021-10-21 </td> <td style="text-align:left;"> o </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> NA </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> e </td> <td style="text-align:left;"> 2021-10-22 </td> <td style="text-align:left;"> c </td> <td style="text-align:left;"> e </td> <td style="text-align:left;"> 2021-10-28 </td> <td style="text-align:left;"> q </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:left;"> n </td> <td style="text-align:left;"> 2021-10-23 </td> <td style="text-align:left;"> i </td> <td style="text-align:left;"> a </td> <td style="text-align:left;"> 2021-11-04 </td> <td style="text-align:left;"> o </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> t </td> <td style="text-align:left;"> 2021-10-24 </td> <td style="text-align:left;"> d </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> NA </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:left;"> r </td> <td style="text-align:left;"> 2021-10-25 </td> <td style="text-align:left;"> e </td> <td style="text-align:left;"> d </td> <td style="text-align:left;"> 2021-11-18 </td> <td style="text-align:left;"> d </td> </tr> </tbody> </table> ] --- ### Toy examples : `semi_join` `anti_join` .fl.w-30.pa2.f6[ <table> <caption>R</caption> <thead> <tr> <th style="text-align:right;"> A1 </th> <th style="text-align:left;"> A2 </th> <th style="text-align:left;"> A3 </th> <th style="text-align:left;"> D </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> q </td> <td style="text-align:left;"> 2021-10-21 </td> <td style="text-align:left;"> r </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> e </td> <td style="text-align:left;"> 2021-10-28 </td> <td style="text-align:left;"> q </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:left;"> a </td> <td style="text-align:left;"> 2021-11-04 </td> <td style="text-align:left;"> o </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> j </td> <td style="text-align:left;"> 2021-11-11 </td> <td style="text-align:left;"> g </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:left;"> d </td> <td style="text-align:left;"> 2021-11-18 </td> <td style="text-align:left;"> d </td> </tr> </tbody> </table> <table> <caption>S</caption> <thead> <tr> <th style="text-align:right;"> E </th> <th style="text-align:left;"> F </th> <th style="text-align:left;"> G </th> <th style="text-align:left;"> D </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> y </td> <td style="text-align:left;"> 2021-10-21 </td> <td style="text-align:left;"> o </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> e </td> <td style="text-align:left;"> 2021-10-22 </td> <td style="text-align:left;"> c </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:left;"> n </td> <td style="text-align:left;"> 2021-10-23 </td> <td style="text-align:left;"> i </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> t </td> <td style="text-align:left;"> 2021-10-24 </td> <td style="text-align:left;"> d </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:left;"> r </td> <td style="text-align:left;"> 2021-10-25 </td> <td style="text-align:left;"> e </td> </tr> </tbody> </table> ] .fl.w-70.pa2.f6[ <table> <caption>semi_join(S, R, by=c("E"="A1"))</caption> <thead> <tr> <th style="text-align:right;"> E </th> <th style="text-align:left;"> F </th> <th style="text-align:left;"> G </th> <th style="text-align:left;"> D </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> e </td> <td style="text-align:left;"> 2021-10-22 </td> <td style="text-align:left;"> c </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:left;"> n </td> <td style="text-align:left;"> 2021-10-23 </td> <td style="text-align:left;"> i </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:left;"> r </td> <td style="text-align:left;"> 2021-10-25 </td> <td style="text-align:left;"> e </td> </tr> </tbody> </table> <br><br> <table> <caption>anti_join(S, R, by=c("E"="A1"))</caption> <thead> <tr> <th style="text-align:right;"> E </th> <th style="text-align:left;"> F </th> <th style="text-align:left;"> G </th> <th style="text-align:left;"> D </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> y </td> <td style="text-align:left;"> 2021-10-21 </td> <td style="text-align:left;"> o </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> t </td> <td style="text-align:left;"> 2021-10-24 </td> <td style="text-align:left;"> d </td> </tr> </tbody> </table> ] --- ### Conditional/ `\(\theta\)` -join In relational databases, joins are not restricted to _natural joins_ -- `$$U \leftarrow R \bowtie_{\theta} S$$` reads as `$$\begin{array}{rl} T & \leftarrow R \times S\\ U & \leftarrow \sigma(T, \theta)\end{array}$$` where - `\(R \times S\)` is the _cartesian product_ of `\(R\)` and `\(S\)` - `\(\theta\)` is a boolean expression that can be evaluated on any tuple of `\(R \times S\)` --- ### Do we need conditional/ `\(\theta\)` -joins? -
: We can implement `\(\theta\)`/conditional-joins by pipelining a cross product and a filtering -- -
: Cross products are costly: + `\(\#\text{rows}(R \times S) = \#\text{rows}(R) \times \#\text{rows}(S)\)` + `\(\#\text{cols}(R \times S) = \#\text{cols}(R) + \#\text{cols}(S)\)` -- -
: RDBMS use query planning and optimization, indexing to circumvent the cross product bottleneck (when possible) -- -
: if we need to perform a `\(\theta\)`-join + outsource it to a RDBMS, or + design an ad hoc pipeline .fr[ [About conditional join](https://www.r-bloggers.com/2018/02/in-between-a-rock-and-a-conditional-join/) ] --- ### A conditional join between `flights` and `weather` - The natural join between `flights` and `weather` we implemented can be regarded as an ad hoc conditional join between normalized versions of `weather` and `flights`
- Table `flights` and `weather` are redundant: `year`, `month`, `day`, `hour` can be computed from `time_hour` - Assume `flights` and `weather` are trimmed so as to become irredundant - The conditional join is then based on _truncations_ of variables `time_hour` ```sql SELECT * FROM flights AS f, weather AS w WHERE date_trunc('hour', f.time_hour) = date_trunc('hour', w.time_hour) ``` - Adding redundant columns to `flights` and `weather` allows us to transform a tricky conditional join into a simple natural join
.fr[ [PostgreSQL documentation](https://www.postgresql.org/docs/14/index.html) ] --- template: inter-slide name: beyonddplyr ## Creating new columns --- Creation of new columns may happen - on the fly - when altering (enriching) the schema of a table In databases, creation of new columns may be the result of a query or be the result of altering a table schema with `ALTER TABLE ADD COLUMN ...` In `tidyverse()` we use verbs `mutate` or `add_column` to add columns to the input table --- ### `mutate` .fl.w-50.pa2[ ```r *mutate( .data, * new_col= expression, * ..., .keep = c("all", "used", "unused", "none"), .before = NULL, .after = NULL ) ``` ] .fl.w-50.pa2[ `.data`: the input data frame `new_col= expression`: - `new_col` is the name of a new column - `expression` is evaluated on each row of `.data` or it is a vector of length `1` - `all` is the default behavior, retains all columns from `.data` ] --- ### Creating a categorical column to spot large delays .fl.w-50.pa2[ ```r breaks_delay <- with(flights, c(min(arr_delay, na.rm=TRUE), 0, 30, max(arr_delay, na.rm=TRUE))) level_delay <- c("None", "Moderate", "Large") flights %>% * mutate(large_delay = cut(arr_delay, * breaks=breaks_delay, * labels=level_delay, * ordered_result=TRUE)) %>% select(large_delay, arr_delay) %>% sample_n(5) ``` ] .fl.w-50.pa2[ ``` ## # A tibble: 5 × 2 ## large_delay arr_delay ## <ord> <dbl> ## 1 Large 219 ## 2 Moderate 18 ## 3 None -19 ## 4 None -16 ## 5 None -1 ``` ] ??? ```r flights %>% * mutate(foo = if_else(arr_time > sched_arr_time, arr_time - sched_arr_time, 0L, missing = NA_integer_)) %>% group_by( (foo >0) & abs(foo - arr_delay) > 100) %>% summarise(N=n()) ``` ``` ## # A tibble: 3 × 2 ## `(foo > 0) & abs(foo - arr_delay) > 100` N ## <lgl> <int> ## 1 FALSE 322281 ## 2 TRUE 5157 ## 3 NA 9338 ``` --- ### Changing the class of a column .fl.w-50.pa2[ ```r flights %>% * mutate(large_delay = cut(arr_delay, breaks=breaks_delay, labels=level_delay, ordered_result=TRUE), * origin = as.factor(origin), * dest = as.factor(dest) ) %>% select(large_delay, arr_delay, origin, dest) %>% sample_n(5) ``` ] .fl.w-50.pa2[ ``` ## # A tibble: 5 × 4 ## large_delay arr_delay origin dest ## <ord> <dbl> <fct> <fct> ## 1 None -44 LGA CVG ## 2 None -15 EWR DAY ## 3 Large 136 EWR DEN ## 4 None -9 EWR TPA ## 5 Moderate 14 LGA TPA ``` ] --- template: inter-slide name: tidytables ## Tidy tables --- Tidying tables is part of data cleaning > A (tidy) dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative) > Values are organised in two ways > Every value belongs to a _variable_ and an _observation_ > A _variable_ contains all values that measure the same underlying attribute (like height, temperature, duration) across _units_ > An _observation_ contains all values measured on the same _unit_ (like a person, or a day, or a race) across attributes > The principles of tidy data are tied to those of relational databases and Codd's relational algebra .fr[[
The tidy data paper](https://vita.had.co.nz/papers/tidy-data.html)] --- ###
`dplyr` functions expect and return _tidy_ tables In a _tidy_ table - Each variable is a column - Each observation is a row - Every cell is a single value .fr[[
The tidy data paper](https://vita.had.co.nz/papers/tidy-data.html)] ???
In order to tell whether a table is tidy, we need to know what is the _population_ under investigation, what are the observations/individuals, which measures are performed on each individual, ... --- ### Untidy data > Column headers are values, not variable names. > Multiple variables are stored in one column. > Variables are stored in both rows and columns. > Multiple types of observational units are stored in the same table. > A single observational unit is stored in multiple tables. > ... .fr[
] ??? Source of untidyness --- ### Functions from `tidyr::...` - `pivot_wider` and `pivot_longer` - `separate` and `unite` - Handling missing values with `complete`, `fill`, ... - ... [`tidyr` website](https://tidyr.tidyverse.org) --- ### Pivot longer .fl.w-50.pa2[ > `pivot_longer()` is commonly needed to tidy wild-caught datasets as they often optimise for ease of data entry or ease of comparison rather than ease of analysis. <table> <thead> <tr> <th style="text-align:left;"> row </th> <th style="text-align:right;"> a </th> <th style="text-align:right;"> b </th> <th style="text-align:right;"> c </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:left;"> C </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 9 </td> </tr> </tbody> </table> ] .fl.w-50.pa2[ ```r messy %>% pivot_longer( * cols=c(-row), names_to = "name", values_to = "value", ) %>% kable() ``` <table> <thead> <tr> <th style="text-align:left;"> row </th> <th style="text-align:left;"> name </th> <th style="text-align:right;"> value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A </td> <td style="text-align:left;"> a </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> A </td> <td style="text-align:left;"> b </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:left;"> A </td> <td style="text-align:left;"> c </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:left;"> a </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:left;"> b </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:left;"> c </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:left;"> C </td> <td style="text-align:left;"> a </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> C </td> <td style="text-align:left;"> b </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> C </td> <td style="text-align:left;"> c </td> <td style="text-align:right;"> 9 </td> </tr> </tbody> </table> ] ??? > `pivot_longer()` makes datasets longer by increasing the number of rows and decreasing the number of columns. I don’t believe it makes sense to describe a dataset as being in “long form”. Length is a relative term, and you can only say (e.g.) that dataset A is longer than dataset B. --- ### Pivot wider .fl.w-50.pa2[ ```r *pivot_wider( data, * id_cols = NULL, * names_from = name, names_prefix = "", * values_from = value, ... ) ```
some optional arguments are missing ] .fl.w-50.pa2[ When reporting, we often use `pivot_wider` (explicitely or implicitely) to make results more readable, possibly to conform to a tradition - Life tables in demography and actuarial science - Longitudinal data - See slide [How many flights per day of week per departure airport?](#aggregate-pivot-wider) ] --- template: inter-slide name: aggregation ## Aggregations --- ### How many flights per carrier? .fl.w-50.pa2[ ```r flights %>% * group_by(carrier) %>% * summarise(count=n()) %>% arrange(desc(count)) ``` ```sql SELECT carrier, COUNT(*) AS n FROM flights GROUP BY carrier ORDER BY n DESCENDING ``` ] .fl.w-50.pa2[ ``` ## # A tibble: 16 × 2 ## carrier count ## <chr> <int> ## 1 UA 58665 ## 2 B6 54635 ## 3 EV 54173 ## 4 DL 48110 ## 5 AA 32729 ## 6 MQ 26397 ## 7 US 20536 ## 8 9E 18460 ## 9 WN 12275 ## 10 VX 5162 ## 11 FL 3260 ## 12 AS 714 ## 13 F9 685 ## 14 YV 601 ## 15 HA 342 ## 16 OO 32 ``` ] ??? > `group_by` > `summarise` > `arrange` --- name: aggregate-pivot-wider ### How many flights per day of week per departure airport? .f6[ ```r flights %>% * group_by(origin, wday(time_hour, abbr=T, label=T)) %>% * summarise(count=n(), .groups="drop") %>% rename(day_of_week=`wday(time_hour, abbr = T, label = T)`) %>% * pivot_wider( * id_cols="origin", * names_from="day_of_week", * values_from="count") %>% kable(caption="Departures per day") ``` ] .plot-callout.f6[ <table> <caption>Departures per day</caption> <thead> <tr> <th style="text-align:left;"> origin </th> <th style="text-align:right;"> Sun </th> <th style="text-align:right;"> Mon </th> <th style="text-align:right;"> Tue </th> <th style="text-align:right;"> Wed </th> <th style="text-align:right;"> Thu </th> <th style="text-align:right;"> Fri </th> <th style="text-align:right;"> Sat </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> EWR </td> <td style="text-align:right;"> 16425 </td> <td style="text-align:right;"> 18329 </td> <td style="text-align:right;"> 18243 </td> <td style="text-align:right;"> 18180 </td> <td style="text-align:right;"> 18169 </td> <td style="text-align:right;"> 18142 </td> <td style="text-align:right;"> 13347 </td> </tr> <tr> <td style="text-align:left;"> JFK </td> <td style="text-align:right;"> 15966 </td> <td style="text-align:right;"> 16104 </td> <td style="text-align:right;"> 16017 </td> <td style="text-align:right;"> 15841 </td> <td style="text-align:right;"> 16087 </td> <td style="text-align:right;"> 16176 </td> <td style="text-align:right;"> 15088 </td> </tr> <tr> <td style="text-align:left;"> LGA </td> <td style="text-align:right;"> 13966 </td> <td style="text-align:right;"> 16257 </td> <td style="text-align:right;"> 16162 </td> <td style="text-align:right;"> 16039 </td> <td style="text-align:right;"> 15963 </td> <td style="text-align:right;"> 15990 </td> <td style="text-align:right;"> 10285 </td> </tr> </tbody> </table> ] --- name: pipe class: middle, left, inverse background-image: url('./img/pexels-andris-bergmanis-7891767.jpg') background-size: cover ## Pipelines/chaining operations --- ### `%>%`, `|>` and other pipes > All `dplyr` functions take a table as the first argument > Rather than forcing the user to either save intermediate objects or nest functions, `dplyr` provides the `%>%` operator from `magrittr` > `x %>% f(y)` turns into `f(x, y)` > The result from one step is _piped_ into the next step > Use `%>%` to rewrite multiple operations that you can read left-to-right/top-to-bottom ```r g(f(x, y), z) x %>% f(y) %>% g(z) ``` .fr[From [dplyr vignette](https://dplyr.tidyverse.org/articles/dplyr.html)] --- exclude: true ### Unix pipe `|` --- ### Magrittr `%>%` .fl.w-50.pa2[ `%>%` is not tied to `dplyr` `%>%` can be used with packages from `tidyverse` `%>%` can be used outside `tidyverse` that is with functions which take a table (or something else) as a second, third or keyword argument
Use pronoun `.` to denote the LHS of the pipe expression ] .fl.w-50.pa2[ Second argument of `g` has the same type as the result of `f` ```r g(z, f(x, y)) x %>% f(y) %>% * g(z, .) ``` `x %>% f(y)` is a shorthand for `x %>% f(., y)` ] --- ### Standard pipe `|>` (version > 4.) As of version 4.1 (2021), base
offers a pipe operator denoted by `|>` .fl.w-50.pa2[ `x |> f(y)` turns into `f(x, y)` ```r g(f(x, y), z) x |> f(y) |> g(z) ``` ] .fl.w-50.pa2[
the standard pipe `|>` has no pronoun/placeholder to denote the LHS of the pipe expression The roundabout consists in using another new construct `\(x)` ```r g(z, w) x |> (\(x) g(z, w=x))() ``` ```r "une" |> (\(x) str_c("ceci n'est pas", x, sep=" "))() |> str_c("pipe", sep=" ") |> cat() ``` ``` ## ceci n'est pas une pipe ``` ] .fr[See [Blog on the new standard pipe](https://www.r-bloggers.com/2021/05/the-new-r-pipe/)] --- ### Other pipes `Magrittr` offers several variants of `%>%` - Tee operator `%T>%` - Assignement pipe `%<>%` - Exposition operator `%$%` - ... .fr[See [pipes for beginners](https://www.r-bloggers.com/2017/12/pipes-in-r-tutorial-for-beginners/)]
Base
has a `pipe()` function to manipulate connections (Files, URLs, ...) --- template: inter-slide ###
References - [R for Data Science](https://r4ds.had.co.nz) + [Data transformation](https://r4ds.had.co.nz/transform.html) - Rstudio cheat sheets + [dplyr](https://www.rstudio.com/resources/cheatsheets/) + [tidyr](https://www.rstudio.com/resources/cheatsheets/) + [datatable](https://www.rstudio.com/resources/cheatsheets/) + [readr](https://www.rstudio.com/resources/cheatsheets/) --- exclude: true ```r family <- tibble::tribble( ~family, ~dob_child1, ~dob_child2, ~gender_child1, ~gender_child2, 1L, "1998-11-26", "2000-01-29", 1L, 2L, 2L, "1996-06-22", NA, 2L, NA, 3L, "2002-07-11", "2004-04-05", 2L, 2L, 4L, "2004-10-10", "2009-08-27", 1L, 1L, 5L, "2000-12-05", "2005-02-28", 2L, 1L, ) family %>% mutate(across(starts_with("dob"), readr::parse_date)) %>% pivot_longer( !family, names_to = c(".value", "child"), names_pattern = "([a-z]*)_child(.)", values_drop_na = TRUE ) ``` ``` ## # A tibble: 9 × 4 ## family child dob gender ## <int> <chr> <date> <int> ## 1 1 1 1998-11-26 1 ## 2 1 2 2000-01-29 2 ## 3 2 1 1996-06-22 2 ## 4 3 1 2002-07-11 2 ## 5 3 2 2004-04-05 2 ## 6 4 1 2004-10-10 1 ## 7 4 2 2009-08-27 1 ## 8 5 1 2000-12-05 2 ## 9 5 2 2005-02-28 1 ``` ```r pivot_longer_spec ``` ``` ## function (data, spec, names_repair = "check_unique", values_drop_na = FALSE, ## values_ptypes = NULL, values_transform = NULL) ## { ## spec <- check_pivot_spec(spec) ## spec <- deduplicate_spec(spec, data) ## v_fct <- factor(spec$.value, levels = unique(spec$.value)) ## values <- split(spec$.name, v_fct) ## value_names <- names(values) ## value_keys <- split(spec[-(1:2)], v_fct) ## keys <- vec_unique(spec[-(1:2)]) ## if (identical(values_ptypes, list())) { ## values_ptypes <- NULL ## } ## values_ptypes <- check_list_of_ptypes(values_ptypes, value_names, ## "values_ptypes") ## values_transform <- check_list_of_functions(values_transform, ## value_names, "values_transform") ## vals <- set_names(vec_init(list(), length(values)), value_names) ## for (value in value_names) { ## cols <- values[[value]] ## col_id <- vec_match(value_keys[[value]], keys) ## val_cols <- vec_init(list(), nrow(keys)) ## val_cols[col_id] <- unname(as.list(data[cols])) ## val_cols[-col_id] <- list(rep(NA, nrow(data))) ## if (has_name(values_transform, value)) { ## val_cols <- lapply(val_cols, values_transform[[value]]) ## } ## val_type <- vec_ptype_common(!!!set_names(val_cols[col_id], ## cols), .ptype = values_ptypes[[value]]) ## out <- vec_c(!!!val_cols, .ptype = val_type) ## n_vals <- nrow(data) * length(val_cols) ## idx <- t(matrix(seq_len(n_vals), ncol = length(val_cols))) ## vals[[value]] <- vec_slice(out, as.integer(idx)) ## } ## vals <- as_tibble(vals) ## df_out <- drop_cols(as_tibble(data, .name_repair = "minimal"), ## spec$.name) ## out <- wrap_error_names(vec_cbind(vec_rep_each(df_out, vec_size(keys)), ## vec_rep(keys, vec_size(data)), vals, .name_repair = names_repair)) ## if (values_drop_na) { ## out <- vec_slice(out, !vec_equal_na(vals)) ## } ## out$.seq <- NULL ## reconstruct_tibble(data, out) ## } ## <bytecode: 0x7fcbec32f820> ## <environment: namespace:tidyr> ``` ```r #> # A tibble: 5 × 5 #> family dob_child1 dob_child2 gender_child1 gender_child2 #> <int> <date> <date> <int> <int> #> 1 1 1998-11-26 2000-01-29 1 2 #> 2 2 1996-06-22 NA 2 NA #> 3 3 2002-07-11 2004-04-05 2 2 #> 4 4 2004-10-10 2009-08-27 1 1 #> 5 5 2000-12-05 2005-02-28 2 1 ``` --- class: middle, center, inverse background-image: url('./img/pexels-cottonbro-3171837.jpg') background-size: cover # The End