LAB: dplyr and SQL

Published

February 1, 2024

L3 MIASHS
Université Paris Cité
Année 2023-2024
Course Homepage
Moodle

Code

to_be_loaded <- c("tidyverse", 
                  "glue",
                  "cowplot",
                  "patchwork",
                  "nycflights13",
                  "DBI",
                  "RSQLite",
                  "RPostgreSQL",
                  "dtplyr",
                  "dbplyr"
)

for (pck in to_be_loaded) {
  if (!require(pck, character.only = T)) {
    install.packages(pck, repos="http://cran.rstudio.com/")
    stopifnot(require(pck, character.only = T))
  }  
}

Code

old_theme <- theme_set(theme_minimal())

Objectives

From the Documentation

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:

-mutate() adds new variables that are functions of existing variables - select() picks variables based on their names. - filter() picks cases based on their values. - summarise() reduces multiple values down to a single summary. - arrange() changes the ordering of the rows.

dplyr provides an elegant implementation of table calculus, as embodied by SQL.

We will play with the nycflights13 dataset

Loading `nycflights`

In memory

Code

flights <- nycflights13::flights
weather <- nycflights13::weather
airports <- nycflights13::airports
airlines <- nycflights13::airlines
planes <- nycflights13::planes

Code

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
flights_lite <- copy_to(con, nycflights13::flights)
airports_lite <- copy_to(con, nycflights13::airports)
planes_lite <- copy_to(con, nycflights13::planes)
weather_lite <- copy_to(con, nycflights13::weather)
airlines_lite <- copy_to(con, nycflights13::airlines)

Pure relational algebra \(\sigma, \pi, \bowtie, \cap, \cup, \setminus\)

Projection, \(\pi\), `select(...)`

Projection of a table \(R\) on columns \(A_{i_1},\ldots, A_{i_k}\) results in a table with schema \(A_{i_1},\ldots, A_{i_k}\) and one row for each row of \(R\). Projection is denoted by \(\pi(R, A_{i_1},\ldots, A_{i_k})\).

In SQL this reads as

SELECT Ai1, , Aik
FROM R

Code

R_lite <- copy_to(con, R)
S_lite <- copy_to(con, S)

In the sequel, we illustrate operations on the next two toy tables

Table R

A	B	C
2024-02-03	5	j
2024-02-08	9	i
2024-02-01	3	n
2024-02-02	3	b
2024-02-07	4	l
2024-02-04	4	t
2024-02-06	3	o
2024-02-07	1	a
2024-02-03	7	j
2024-02-07	1	k

Table S

A	D	F
2024-02-01	26	b
2024-02-03	27	p
2024-02-07	24	q
2024-02-07	27	t
2024-02-07	30	z
2024-02-04	30	m
2024-02-07	21	s
2024-02-03	26	g

In Relational Algebra, tables are sets rather than multisets, there are no duplicates. In SQL we are handling multisets of rows, duplicates need to be removed explicitly

SELECT DISTINCT Ai1, ..., Aik
FROM R

dplyr has one verb select(...) for \(\pi\) or SELECT, and verb distinct() for SELECT DISTINCT ....

If we have no intention to remove duplicates:

select(R, Ai1, ..., Aik)
# or
R |> 
  select(Ai1, ..., Aik)

If we want to remove deplicates

distinct(R, Ai1, ..., Aik)
# or
R |> 
  distinct(Ai1, ..., Aik)

\(\pi(R, B, C)\) (SELECT B, C FROM R) leads to

B	C
5	j
9	i
3	n
3	b
4	l
4	t
3	o
1	a
7	j
1	k

\(\pi(R, B)\) and SELECT DISTINCT B FROM R lead to

B
5
9
3
4
1
7

For each departure airport (denoted by origin'), each day of the year, list the codes (denoted bycarrier`) of the airlines that have one or more planes taking off from that airport on that day.

In SQL, we can phrase this query like this:

SELECT DISTINCT f.origin, f.year, f.month, f.day, f.carrier
FROM nycflights.flights f ;

Using dplyr and chaining with standard pipe |> or %>% from magrittr, we can write.

Code

q1 <- . %>%
  distinct(origin, year, month, day, carrier) 

q1(flights) |>
  head()

1: To define a unary function implementing the short pipeline, we have to use %>%.

# A tibble: 6 × 5
  origin  year month   day carrier
  <chr>  <int> <int> <int> <chr>  
1 EWR     2013     1     1 UA     
2 LGA     2013     1     1 UA     
3 JFK     2013     1     1 AA     
4 JFK     2013     1     1 B6     
5 LGA     2013     1     1 DL     
6 EWR     2013     1     1 B6

We can reuse the pipeline to query the lazy tables., we can even

Code

q1(flights_lite) |> 
  head()

# Source:   SQL [6 x 5]
# Database: sqlite 3.41.2 [:memory:]
  origin  year month   day carrier
  <chr>  <int> <int> <int> <chr>  
1 EWR     2013     1     1 UA     
2 LGA     2013     1     1 UA     
3 JFK     2013     1     1 AA     
4 JFK     2013     1     1 B6     
5 LGA     2013     1     1 DL     
6 EWR     2013     1     1 B6

Code

q1(flights_lite) |> 
  show_query()

<SQL>
SELECT DISTINCT `origin`, `year`, `month`, `day`, `carrier`
FROM `nycflights13::flights`

Code

q1(flights_lite) |> 
  explain()

<SQL>
SELECT DISTINCT `origin`, `year`, `month`, `day`, `carrier`
FROM `nycflights13::flights`

<PLAN>
  id parent notused                       detail
1  3      0       0   SCAN nycflights13::flights
2 15      0       0 USE TEMP B-TREE FOR DISTINCT

Selection, \(\sigma\), `filter(...)`

Selection of a table \(R\) according to condition \(\texttt{expr}\) is an expression that can be evaluated on each row of \(R\) results in a table with the same schema as \(R\) and all rows of \(R\) where \(\texttt{expr}\) evaluates to TRUE. Selection is denoted by \(\sigma(R, \texttt{expr})\).

In SQL this reads as

SELECT R.*
FROM R 
WHERE expr

dplyr has one verb filter(...) for \(\sigma\).

\(\sigma(R, A < \text{2024-02-06} \wedge \text{2024-02-02} \leq A)\) (SELECT * FROM R WHERE A < CAST('2024-02-06' AS DATE) AND A >= CAST('2024-02-02' AS DATE)) leads to

A	B	C
2024-02-03	5	j
2024-02-02	3	b
2024-02-04	4	t
2024-02-03	7	j

SELECT DISTINCT B FROM R leads to

B
5
9
3
4
1
7

List all the planes built by a manufacturer named like AIRBUS between 2005 and 2010

Code

qx <- . %>% 
  filter(str_like(manufacturer, 'AIRBUS%') &  year >= 2005 & year <= 2010)

qx(planes)

# A tibble: 143 × 9
   tailnum  year type              manufacturer model engines seats speed engine
   <chr>   <int> <chr>             <chr>        <chr>   <int> <int> <int> <chr> 
 1 N125UW   2009 Fixed wing multi… AIRBUS       A320…       2   182    NA Turbo…
 2 N126UW   2009 Fixed wing multi… AIRBUS       A320…       2   182    NA Turbo…
 3 N127UW   2010 Fixed wing multi… AIRBUS       A320…       2   182    NA Turbo…
 4 N128UW   2010 Fixed wing multi… AIRBUS       A320…       2   182    NA Turbo…
 5 N193UW   2008 Fixed wing multi… AIRBUS       A321…       2   199    NA Turbo…
 6 N195UW   2008 Fixed wing multi… AIRBUS       A321…       2   199    NA Turbo…
 7 N196UW   2009 Fixed wing multi… AIRBUS       A321…       2   199    NA Turbo…
 8 N197UW   2009 Fixed wing multi… AIRBUS       A321…       2   199    NA Turbo…
 9 N201FR   2008 Fixed wing multi… AIRBUS       A320…       2   182    NA Turbo…
10 N202FR   2008 Fixed wing multi… AIRBUS       A320…       2   182    NA Turbo…
# ℹ 133 more rows

Code

qx(planes_lite) |> 
  show_query()

<SQL>
SELECT *
FROM `nycflights13::planes`
WHERE (`manufacturer` LIKE 'AIRBUS%' AND `year` >= 2005.0 AND `year` <= 2010.0)

Joins, \(\bowtie\), `xxx_join(...)`

In relational algebra, a \(\theta\)-join boils down to a selection according to expression \(\theta\) over a cross product (possibly after renaming some columns)

\[\bowtie(R, S, \theta) \approx \sigma(R \times S, \theta)\] dplyr does not (yet?) offer such a general join (which anyway can be very expensive on a cutting edge RDBMS) several variants of equijoin.

ChatGPT asserts:

An equijoin is a type of join operation in relational databases where the join condition involves an equality comparison between two attributes from different tables. In other words, it’s a join operation that combines rows from two tables based on matching values in specified columns. These specified columns are usually called the “join columns” or “join keys.”

Joins in dplyr documentation

inner_join()
left_join()
right_join()
full_join()

but also

semi_join()
anti_join()

the matching columns are stated using optional argument by=...

If argument by is omitted, NATURAL JOIN is assumed.

\(\bowtie(R, S)\) (SELECT * FROM R NATURAL JOIN S) leads to

Joining with `by = join_by(A)`

Warning in inner_join(R, S): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 3 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

\(\bowtie^{\text{Left outer}}(R, S)\) and SELECT * FROM R LEFT JOIN S ON (R.A=S.A) lead to

Joining with `by = join_by(A)`

Warning in left_join(R, S): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 3 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

List weather conditions at departure for all flights operated by airline named Delta ....

In SQL, we can rely on the next query

WITH delta AS (
    SELECT al.carrier
    FROM nycflights.airlines al
    WHERE al."name" = 'Delta Air Lines Inc.'
),
delta_f AS (
    SELECT f.origin, f.flight, f.year, f.month, f.day, f.hour 
    FROM nycflights.flights f
    WHERE f.carrier IN (SELECT * FROM delta)
)

SELECT f.flight, w.*
FROM  nycflights.weather w NATURAL JOIN delta_f f;

Using dplyr, we can mimick this approach.

Code

delta <- . %>% 
  filter(str_like(name, "Delta%")) 

delta_f <- . %>% 
  semi_join(delta(airlines),
            by = join_by(carrier))

delta_f(flights) |>  head()

jb <- join_by(origin, year, month, day, hour)

q3 <- .  %>% 
  inner_join(weather, by = jb)

q3(delta_f(flights)) |> 
  head()

1: Check semi_join

# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      554            600        -6      812            837
2  2013     1     1      602            610        -8      812            820
3  2013     1     1      606            610        -4      837            845
4  2013     1     1      615            615         0      833            842
5  2013     1     1      653            700        -7      936           1009
6  2013     1     1      655            655         0     1021           1030
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 6 × 29
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      554            600        -6      812            837
2  2013     1     1      602            610        -8      812            820
3  2013     1     1      606            610        -4      837            845
4  2013     1     1      615            615         0      833            842
5  2013     1     1      653            700        -7      936           1009
6  2013     1     1      655            655         0     1021           1030
# ℹ 21 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour.x <dttm>, temp <dbl>, dewp <dbl>,
#   humid <dbl>, wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
#   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour.y <dttm>

We can (almost) reuse the pipeline on lazy tables

Code

delta_f <- . %>% 
  semi_join(delta(airlines_lite), by = join_by(carrier))

delta_f(flights_lite) |> 
  head()

# Source:   SQL [6 x 19]
# Database: sqlite 3.41.2 [:memory:]
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      554            600        -6      812            837
2  2013     1     1      602            610        -8      812            820
3  2013     1     1      606            610        -4      837            845
4  2013     1     1      615            615         0      833            842
5  2013     1     1      653            700        -7      936           1009
6  2013     1     1      655            655         0     1021           1030
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dbl>

Code

delta_f(flights_lite) |>  
  show_query()

<SQL>
SELECT *
FROM `nycflights13::flights`
WHERE EXISTS (
  SELECT 1 FROM (
  SELECT *
  FROM `nycflights13::airlines`
  WHERE (`name` LIKE 'Delta%')
) AS `RHS`
  WHERE (`nycflights13::flights`.`carrier` = `RHS`.`carrier`)
)

Code

q3 <- .  %>% 
  inner_join(weather_lite, by = jb)

q3(delta_f(flights_lite)) |> 
  show_query()

<SQL>
SELECT
  `LHS`.`year` AS `year`,
  `LHS`.`month` AS `month`,
  `LHS`.`day` AS `day`,
  `dep_time`,
  `sched_dep_time`,
  `dep_delay`,
  `arr_time`,
  `sched_arr_time`,
  `arr_delay`,
  `carrier`,
  `flight`,
  `tailnum`,
  `LHS`.`origin` AS `origin`,
  `dest`,
  `air_time`,
  `distance`,
  `LHS`.`hour` AS `hour`,
  `minute`,
  `LHS`.`time_hour` AS `time_hour.x`,
  `temp`,
  `dewp`,
  `humid`,
  `wind_dir`,
  `wind_speed`,
  `wind_gust`,
  `precip`,
  `pressure`,
  `visib`,
  `nycflights13::weather`.`time_hour` AS `time_hour.y`
FROM (
  SELECT *
  FROM `nycflights13::flights`
  WHERE EXISTS (
    SELECT 1 FROM (
    SELECT *
    FROM `nycflights13::airlines`
    WHERE (`name` LIKE 'Delta%')
) AS `RHS`
    WHERE (`nycflights13::flights`.`carrier` = `RHS`.`carrier`)
  )
) AS `LHS`
INNER JOIN `nycflights13::weather`
  ON (
    `LHS`.`origin` = `nycflights13::weather`.`origin` AND
    `LHS`.`year` = `nycflights13::weather`.`year` AND
    `LHS`.`month` = `nycflights13::weather`.`month` AND
    `LHS`.`day` = `nycflights13::weather`.`day` AND
    `LHS`.`hour` = `nycflights13::weather`.`hour`
  )

Agregation, `summarize(...)`

According to ChatGPT:

In SQL, an aggregation function is a function that operates on a set of values and returns a single aggregated value summarizing those values. These functions are commonly used in SQL queries to perform calculations across multiple rows and produce meaningful results. Some common aggregation functions in SQL include:

COUNT: Counts the number of rows in a result set.
SUM: Calculates the sum of values in a column.
AVG: Calculates the average of values in a column.
MIN: Finds the minimum value in a column.
MAX: Finds the maximum value in a column.

Count the number of airport whose name starts with International

Code

qb <- . %>% 
  filter(str_like(name, 'International%')) %>% 
  summarise(n=n())

airports %>% 
  filter(str_like(name, 'International%')) %>%
  count()

# A tibble: 1 × 1
      n
  <int>
1     1

Code

qb(airports_lite) |>
  show_query()

<SQL>
SELECT COUNT(*) AS `n`
FROM `nycflights13::airports`
WHERE (`name` LIKE 'International%')

Partition, `group_by`

Following again ChatGPT

Aggregation functions are often used with the GROUP BY clause in SQL queries to group rows that have the same values in specified columns, allowing the aggregation functions to operate on each group separately. This enables powerful analysis and reporting capabilities in SQL, allowing users to extract useful insights from large datasets.

In table \(R\), for each value in column A sum the values in column B

A	s
2024-02-01	3
2024-02-02	3
2024-02-03	12
2024-02-04	4
2024-02-06	3
2024-02-07	6
2024-02-08	9

dplyr offers a group_by() verb that proves powerful and flexible. The resulting grouped tibble can be used both for aggregation and for implementing certain kinds of windows.

For each departure airport, each airline, count the number of flights operated by this airline from this airport.

SELECT f.origin, f.carrier, COUNT(*) AS n
FROM nycflights.flights f
GROUP BY f.origin, f.carrier 
ORDER BY f.carrier, n DESC;

Code

q4 <- . %>% 
  group_by(carrier, origin) %>% 
  summarise(n=n(), .groups="drop") %>%
  arrange(desc(carrier))
  
q4(flights)

# A tibble: 35 × 3
   carrier origin     n
   <chr>   <chr>  <int>
 1 YV      LGA      601
 2 WN      EWR     6188
 3 WN      LGA     6087
 4 VX      EWR     1566
 5 VX      JFK     3596
 6 US      EWR     4405
 7 US      JFK     2995
 8 US      LGA    13136
 9 UA      EWR    46087
10 UA      JFK     4534
# ℹ 25 more rows

Code

q4(flights_lite) |> 
  show_query()

<SQL>
SELECT `carrier`, `origin`, COUNT(*) AS `n`
FROM `nycflights13::flights`
GROUP BY `carrier`, `origin`
ORDER BY `carrier` DESC

SELECT f.origin, f.year, f.month, f.day, f.carrier, COUNT(DISTINCT tailnum)
FROM nycflights.flights f 
GROUP BY f.origin, f.year, f.month, f.day, f.carrier
ORDER BY f.origin, f.year, f.month, f.day, f.carrier;

Code

q2 <- . %>% 
  group_by(origin, year, month, day, carrier) %>% 
  summarise(n_tailnum=n_distinct(tailnum), .groups = "drop") %>% 
  arrange(origin, year, month, day, carrier) 

q2(flights) |> 
  head()

# A tibble: 6 × 6
  origin  year month   day carrier n_tailnum
  <chr>  <int> <int> <int> <chr>       <int>
1 EWR     2013     1     1 AA              9
2 EWR     2013     1     1 AS              2
3 EWR     2013     1     1 B6             18
4 EWR     2013     1     1 DL              6
5 EWR     2013     1     1 EV             66
6 EWR     2013     1     1 MQ              7

Code

q2(flights_lite) |>
  show_query()

<SQL>
SELECT
  `origin`,
  `year`,
  `month`,
  `day`,
  `carrier`,
  COUNT(DISTINCT `tailnum`) AS `n_tailnum`
FROM `nycflights13::flights`
GROUP BY `origin`, `year`, `month`, `day`, `carrier`
ORDER BY `origin`, `year`, `month`, `day`, `carrier`

List the features of planes that have been operated by several airlines

WITH for_hire AS (
    SELECT f.tailnum, COUNT(DISTINCT f.carrier) AS n_carrier
    FROM nycflights.flights f
    GROUP BY f.tailnum 
    HAVING COUNT(DISTINCT f.carrier) >=2 
)

SELECT p.*
FROM nycflights.planes p NATURAL JOIN for_hire ;

Code

for_hire <- . %>% 
  group_by(tailnum) %>% 
  summarise(n_carriers=n_distinct(carrier)) %>% 
  filter(n_carriers >= 2) %>% 
  select(tailnum)

for_hire(flights) |>  head()

# A tibble: 6 × 1
  tailnum
  <chr>  
1 N146PQ 
2 N153PQ 
3 N176PQ 
4 N181PQ 
5 N197PQ 
6 N200PQ

Code

planes %>% 
  semi_join(for_hire(flights), by=join_by(tailnum))

# A tibble: 17 × 9
   tailnum  year type              manufacturer model engines seats speed engine
   <chr>   <int> <chr>             <chr>        <chr>   <int> <int> <int> <chr> 
 1 N146PQ   2007 Fixed wing multi… BOMBARDIER … CL-6…       2    95    NA Turbo…
 2 N153PQ   2007 Fixed wing multi… BOMBARDIER … CL-6…       2    95    NA Turbo…
 3 N176PQ   2008 Fixed wing multi… BOMBARDIER … CL-6…       2    95    NA Turbo…
 4 N181PQ   2008 Fixed wing multi… BOMBARDIER … CL-6…       2    95    NA Turbo…
 5 N197PQ   2008 Fixed wing multi… BOMBARDIER … CL-6…       2    95    NA Turbo…
 6 N200PQ   2008 Fixed wing multi… BOMBARDIER … CL-6…       2    95    NA Turbo…
 7 N228PQ   2009 Fixed wing multi… BOMBARDIER … CL-6…       2    95    NA Turbo…
 8 N232PQ   2009 Fixed wing multi… BOMBARDIER … CL-6…       2    95    NA Turbo…
 9 N933AT   2000 Fixed wing multi… BOEING       717-…       2   100    NA Turbo…
10 N935AT   2000 Fixed wing multi… BOEING       717-…       2   100    NA Turbo…
11 N977AT   2002 Fixed wing multi… BOEING       717-…       2   100    NA Turbo…
12 N978AT   2002 Fixed wing multi… BOEING       717-…       2   100    NA Turbo…
13 N979AT   2002 Fixed wing multi… BOEING       717-…       2   100    NA Turbo…
14 N981AT   2002 Fixed wing multi… BOEING       717-…       2   100    NA Turbo…
15 N989AT   2001 Fixed wing multi… BOEING       717-…       2   100    NA Turbo…
16 N990AT   2001 Fixed wing multi… BOEING       717-…       2   100    NA Turbo…
17 N994AT   2002 Fixed wing multi… BOEING       717-…       2   100    NA Turbo…

Code

planes_lite %>% 
  semi_join(for_hire(flights_lite), by=join_by(tailnum)) %>% 
  show_query()

<SQL>
SELECT *
FROM `nycflights13::planes`
WHERE EXISTS (
  SELECT 1 FROM (
  SELECT `tailnum`
  FROM `nycflights13::flights`
  GROUP BY `tailnum`
  HAVING (COUNT(DISTINCT `carrier`) >= 2.0)
) AS `RHS`
  WHERE (`nycflights13::planes`.`tailnum` = `RHS`.`tailnum`)
)

Adding/modifying columns, `mutate(...)`

In the SELECT clause of an SQL query, certain columns can be computed. Verb select from dplyr does not offer this possibility. We have to add the computed columns using verb mutate and then to perform projection using select (if necessary)

Assume we want to add one day to every value in column A from \(R\) so as to obtain:

# A tibble: 10 × 3
   A              B C    
   <date>     <int> <chr>
 1 2024-02-04     5 j    
 2 2024-02-09     9 i    
 3 2024-02-02     3 n    
 4 2024-02-03     3 b    
 5 2024-02-08     4 l    
 6 2024-02-05     4 t    
 7 2024-02-07     3 o    
 8 2024-02-08     1 a    
 9 2024-02-04     7 j    
10 2024-02-08     1 k

In SQL we can proceed like this

SELECT A + 1  AS A, B, C
FROM R

Window functions

Asking ChatGPT we obtain

In SQL, a window function (also known as an analytic function or windowed function) is a type of function that performs a calculation across a set of rows related to the current row within a query result set. Unlike aggregation functions which collapse multiple rows into a single row, window functions operate on a “window” of rows defined by a partition or an ordering.

Key features of window functions include:

Partitioning: The window function can be partitioned by one or more columns, dividing the result set into groups or partitions. The function is applied independently within each partition.

Ordering: The window function can be ordered by one or more columns, defining the sequence of rows within each partition. This determines the rows included in the window for each calculation.

Frame specification: Optionally, a window function can have a frame specification, which further refines the rows included in the window based on their position relative to the current row.

Window functions allow for advanced analytics and reporting tasks that require comparisons or calculations across multiple rows without collapsing the result set. They can be used to compute running totals, calculate moving averages, rank rows within partitions, and perform other complex analyses.

Some common window functions in SQL include ROW_NUMBER(), RANK(), DENSE_RANK(), NTILE(), LAG(), LEAD(), SUM() OVER(), AVG() OVER(), etc.

If we focus on window queries with a single window built using PARTITION BY and ORDER BY, we just need to combine, group_by(), possibly arrange(), and mutate()

For each departure airport, and day, list the 10 most delayed flights.

In SQL we can proceed like this:

WITH f_delayed AS (
  SELECT f.*, RANK() OVER w AS rnk
  FROM nycflights.flights f
  WHERE f.dep_time IS NOT NULL
  WINDOW w AS (PARTITION BY f.origin, f.year, f.month, f.day ORDER BY f.dep_delay DESC)  
)

SELECT fd.origin, fd.year, fd.month, fd.day, fd.tailnum
FROM f_delayed fd
WHERE fd.rnk <= 10;

Code

f_delayed <- . %>% 
  filter(!is.na(dep_time)) %>% 
  group_by(origin, year, month, day) %>% 
  mutate(rnk=min_rank(desc(dep_delay))) %>% 
  ungroup()

f_delayed(flights) |>
  filter(rnk <= 10) |> 
  head()

# A tibble: 6 × 20
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      811            630       101     1047            830
2  2013     1     1      848           1835       853     1001           1950
3  2013     1     1      957            733       144     1056            853
4  2013     1     1     1114            900       134     1447           1222
5  2013     1     1     1301           1150        71     1518           1345
6  2013     1     1     1400           1250        70     1645           1502
# ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, rnk <int>

Use the cheatsheet

Objectives

Loading nycflights

In memory

Pure relational algebra \(\sigma, \pi, \bowtie, \cap, \cup, \setminus\)

Projection, \(\pi\), select(...)

Selection, \(\sigma\), filter(...)

Joins, \(\bowtie\), xxx_join(...)

Agregation, summarize(...)

Partition, group_by

Adding/modifying columns, mutate(...)

Window functions

Loading `nycflights`

Projection, \(\pi\), `select(...)`

Selection, \(\sigma\), `filter(...)`

Joins, \(\bowtie\), `xxx_join(...)`

Agregation, `summarize(...)`

Partition, `group_by`

Adding/modifying columns, `mutate(...)`