LAB: dplyr and SQL

Published

January 31, 2024

Code
to_be_loaded <- c("tidyverse", 
                  "glue",
                  "cowplot",
                  "patchwork",
                  "nycflights13",
                  "DBI",
                  "RSQLite",
                  "RPostgreSQL",
                  "dtplyr",
                  "dbplyr"
)

for (pck in to_be_loaded) {
  if (!require(pck, character.only = T)) {
    install.packages(pck, repos="http://cran.rstudio.com/")
    stopifnot(require(pck, character.only = T))
  }  
}
Code
old_theme <- theme_set(theme_minimal())

Objectives

From the Documentation

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:

-mutate() adds new variables that are functions of existing variables - select() picks variables based on their names. - filter() picks cases based on their values. - summarise() reduces multiple values down to a single summary. - arrange() changes the ordering of the rows.

dplyr provides an elegant implementation of table calculus, as embodied by SQL.

We will play with the nycflights13 dataset

NYCFlights13

Loading nycflights

In memory

Code
flights <- nycflights13::flights
weather <- nycflights13::weather
airports <- nycflights13::airports
airlines <- nycflights13::airlines
planes <- nycflights13::planes
Code
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
flights_lite <- copy_to(con, nycflights13::flights)
airports_lite <- copy_to(con, nycflights13::airports)
planes_lite <- copy_to(con, nycflights13::planes)
weather_lite <- copy_to(con, nycflights13::weather)
airlines_lite <- copy_to(con, nycflights13::airlines)

Pure relational algebra \(\sigma, \pi, \bowtie, \cap, \cup, \setminus\)

Projection, \(\pi\), select(...)

Projection of a table \(R\) on columns \(A_{i_1},\ldots, A_{i_k}\) results in a table with schema \(A_{i_1},\ldots, A_{i_k}\) and one row for each row of \(R\). Projection is denoted by \(\pi(R, A_{i_1},\ldots, A_{i_k})\).

In SQL this reads as

SELECT Ai1, , Aik
FROM R 
Code
R_lite <- copy_to(con, R)
S_lite <- copy_to(con, S)

In the sequel, we illustrate operations on the next two toy tables

Table R

A B C
2024-02-06 4 q
2024-02-06 7 l
2024-02-06 5 c
2024-02-06 10 b
2024-02-08 6 w
2024-02-09 8 m
2024-02-06 8 l
2024-02-02 7 y
2024-02-07 1 w
2024-02-08 8 d

Table S

A D F
2024-02-06 21 s
2024-02-02 21 j
2024-02-06 29 l
2024-02-06 29 v
2024-02-06 22 p
2024-02-06 29 i
2024-02-07 27 d
2024-02-06 20 g

In Relational Algebra, tables are sets rather than multisets, there are no duplicates. In SQL we are handling multisets of rows, duplicates need to be removed explicitly

SELECT DISTINCT Ai1, ..., Aik
FROM R 

dplyr has one verb select(...) for \(\pi\) or SELECT, and verb distinct() for SELECT DISTINCT ....

If we have no intention to remove duplicates:

select(R, Ai1, ..., Aik)
# or
R |> 
  select(Ai1, ..., Aik)

If we want to remove deplicates

distinct(R, Ai1, ..., Aik)
# or
R |> 
  distinct(Ai1, ..., Aik)

\(\pi(R, B, C)\) (SELECT B, C FROM R) leads to

B C
4 q
7 l
5 c
10 b
6 w
8 m
8 l
7 y
1 w
8 d

\(\pi(R, B)\) and SELECT DISTINCT B FROM R lead to

B
4
7
5
10
6
8
1

For each departure airport (denoted by origin'), each day of the year, list the codes (denoted bycarrier`) of the airlines that have one or more planes taking off from that airport on that day.

Selection, \(\sigma\), filter(...)

Selection of a table \(R\) according to condition \(\texttt{expr}\) is an expression that can be evaluated on each row of \(R\) results in a table with the same schema as \(R\) and all rows of \(R\) where \(\texttt{expr}\) evaluates to TRUE. Selection is denoted by \(\sigma(R, \texttt{expr})\).

In SQL this reads as

SELECT R.*
FROM R 
WHERE expr

dplyr has one verb filter(...) for \(\sigma\).

\(\sigma(R, A < \text{2024-02-06} \wedge \text{2024-02-02} \leq A)\) (SELECT * FROM R WHERE A < CAST('2024-02-06' AS DATE) AND A >= CAST('2024-02-02' AS DATE)) leads to

A B C
2024-02-02 7 y

SELECT DISTINCT B FROM R leads to

B
4
7
5
10
6
8
1

List all the planes built by a manufacturer named like AIRBUS between 2005 and 2010

Joins, \(\bowtie\), xxx_join(...)

In relational algebra, a \(\theta\)-join boils down to a selection according to expression \(\theta\) over a cross product (possibly after renaming some columns)

\[\bowtie(R, S, \theta) \approx \sigma(R \times S, \theta)\] dplyr does not (yet?) offer such a general join (which anyway can be very expensive on a cutting edge RDBMS) several variants of equijoin.

ChatGPT asserts:

An equijoin is a type of join operation in relational databases where the join condition involves an equality comparison between two attributes from different tables. In other words, it’s a join operation that combines rows from two tables based on matching values in specified columns. These specified columns are usually called the “join columns” or “join keys.”

Joins in dplyr documentation

  • inner_join()
  • left_join()
  • right_join()
  • full_join()

but also

  • semi_join()
  • anti_join()

the matching columns are stated using optional argument by=...

If argument by is omitted, NATURAL JOIN is assumed.

\(\bowtie(R, S)\) (SELECT * FROM R NATURAL JOIN S) leads to

Joining with `by = join_by(A)`
Warning in inner_join(R, S): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

\(\bowtie^{\text{Left outer}}(R, S)\) and SELECT * FROM R LEFT JOIN S ON (R.A=S.A) lead to

Joining with `by = join_by(A)`
Warning in left_join(R, S): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

List weather conditions at departure for all flights operated by airline named Delta ....

Agregation, summarize(...)

According to ChatGPT:

In SQL, an aggregation function is a function that operates on a set of values and returns a single aggregated value summarizing those values. These functions are commonly used in SQL queries to perform calculations across multiple rows and produce meaningful results. Some common aggregation functions in SQL include:

  • COUNT: Counts the number of rows in a result set.
  • SUM: Calculates the sum of values in a column.
  • AVG: Calculates the average of values in a column.
  • MIN: Finds the minimum value in a column.
  • MAX: Finds the maximum value in a column.

Count the number of airport whose name starts with International

Partition, group_by

Following again ChatGPT

Aggregation functions are often used with the GROUP BY clause in SQL queries to group rows that have the same values in specified columns, allowing the aggregation functions to operate on each group separately. This enables powerful analysis and reporting capabilities in SQL, allowing users to extract useful insights from large datasets.

In table \(R\), for each value in column A sum the values in column B

A s
2024-02-02 7
2024-02-06 34
2024-02-07 1
2024-02-08 14
2024-02-09 8

dplyr offers a group_by() verb that proves powerful and flexible. The resulting grouped tibble can be used both for aggregation and for implementing certain kinds of windows.

For each departure airport, each airline, count the number of flights operated by this airline from this airport.

List the features of planes that have been operated by several airlines

Adding/modifying columns, mutate(...)

In the SELECT clause of an SQL query, certain columns can be computed. Verb select from dplyr does not offer this possibility. We have to add the computed columns using verb mutate and then to perform projection using select (if necessary)

Assume we want to add one day to every value in column A from \(R\) so as to obtain:

# A tibble: 10 × 3
   A              B C    
   <date>     <int> <chr>
 1 2024-02-07     4 q    
 2 2024-02-07     7 l    
 3 2024-02-07     5 c    
 4 2024-02-07    10 b    
 5 2024-02-09     6 w    
 6 2024-02-10     8 m    
 7 2024-02-07     8 l    
 8 2024-02-03     7 y    
 9 2024-02-08     1 w    
10 2024-02-09     8 d    

In SQL we can proceed like this

SELECT A + 1  AS A, B, C
FROM R

Window functions

Asking ChatGPT we obtain

In SQL, a window function (also known as an analytic function or windowed function) is a type of function that performs a calculation across a set of rows related to the current row within a query result set. Unlike aggregation functions which collapse multiple rows into a single row, window functions operate on a “window” of rows defined by a partition or an ordering.

Key features of window functions include:

Partitioning: The window function can be partitioned by one or more columns, dividing the result set into groups or partitions. The function is applied independently within each partition.

Ordering: The window function can be ordered by one or more columns, defining the sequence of rows within each partition. This determines the rows included in the window for each calculation.

Frame specification: Optionally, a window function can have a frame specification, which further refines the rows included in the window based on their position relative to the current row.

Window functions allow for advanced analytics and reporting tasks that require comparisons or calculations across multiple rows without collapsing the result set. They can be used to compute running totals, calculate moving averages, rank rows within partitions, and perform other complex analyses.

Some common window functions in SQL include ROW_NUMBER(), RANK(), DENSE_RANK(), NTILE(), LAG(), LEAD(), SUM() OVER(), AVG() OVER(), etc.

If we focus on window queries with a single window built using PARTITION BY and ORDER BY, we just need to combine, group_by(), possibly arrange(), and mutate()

For each departure airport, and day, list the 10 most delayed flights.

Use the cheatsheet