dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:
-mutate() adds new variables that are functions of existing variables - select() picks variables based on their names. - filter() picks cases based on their values. - summarise() reduces multiple values down to a single summary. - arrange() changes the ordering of the rows.
dplyr provides an elegant implementation of table calculus, as embodied by SQL.
Pure relational algebra \(\sigma, \pi, \bowtie, \cap, \cup, \setminus\)
Projection, \(\pi\), select(...)
Projection of a table \(R\) on columns \(A_{i_1},\ldots, A_{i_k}\) results in a table with schema \(A_{i_1},\ldots, A_{i_k}\) and one row for each row of \(R\). Projection is denoted by \(\pi(R, A_{i_1},\ldots, A_{i_k})\).
In SQL this reads as
SELECT Ai1, , AikFROM R
Code
R_lite <-copy_to(con, R)S_lite <-copy_to(con, S)
In the sequel, we illustrate operations on the next two toy tables
Table R
A
B
C
2024-02-03
5
j
2024-02-08
9
i
2024-02-01
3
n
2024-02-02
3
b
2024-02-07
4
l
2024-02-04
4
t
2024-02-06
3
o
2024-02-07
1
a
2024-02-03
7
j
2024-02-07
1
k
Table S
A
D
F
2024-02-01
26
b
2024-02-03
27
p
2024-02-07
24
q
2024-02-07
27
t
2024-02-07
30
z
2024-02-04
30
m
2024-02-07
21
s
2024-02-03
26
g
In Relational Algebra, tables are sets rather than multisets, there are no duplicates. In SQL we are handling multisets of rows, duplicates need to be removed explicitly
SELECTDISTINCT Ai1, ..., AikFROM R
dplyr has one verb select(...) for \(\pi\) or SELECT, and verb distinct() for SELECT DISTINCT ....
\(\pi(R, B)\) and SELECT DISTINCT B FROM R lead to
B
5
9
3
4
1
7
For each departure airport (denoted by origin'), each day of the year, list the codes (denoted bycarrier`) of the airlines that have one or more planes taking off from that airport on that day.
In SQL, we can phrase this query like this:
SELECTDISTINCT f.origin, f.year, f.month, f.day, f.carrierFROM nycflights.flights f ;
Using dplyr and chaining with standard pipe |> or %>% from magrittr, we can write.
To define a unary function implementing the short pipeline, we have to use %>%.
# A tibble: 6 × 5
origin year month day carrier
<chr> <int> <int> <int> <chr>
1 EWR 2013 1 1 UA
2 LGA 2013 1 1 UA
3 JFK 2013 1 1 AA
4 JFK 2013 1 1 B6
5 LGA 2013 1 1 DL
6 EWR 2013 1 1 B6
We can reuse the pipeline to query the lazy tables., we can even
Code
q1(flights_lite) |>head()
# Source: SQL [6 x 5]
# Database: sqlite 3.41.2 [:memory:]
origin year month day carrier
<chr> <int> <int> <int> <chr>
1 EWR 2013 1 1 UA
2 LGA 2013 1 1 UA
3 JFK 2013 1 1 AA
4 JFK 2013 1 1 B6
5 LGA 2013 1 1 DL
6 EWR 2013 1 1 B6
Code
q1(flights_lite) |>show_query()
<SQL>
SELECT DISTINCT `origin`, `year`, `month`, `day`, `carrier`
FROM `nycflights13::flights`
Code
q1(flights_lite) |>explain()
<SQL>
SELECT DISTINCT `origin`, `year`, `month`, `day`, `carrier`
FROM `nycflights13::flights`
<PLAN>
id parent notused detail
1 3 0 0 SCAN nycflights13::flights
2 15 0 0 USE TEMP B-TREE FOR DISTINCT
Selection, \(\sigma\), filter(...)
Selection of a table \(R\) according to condition \(\texttt{expr}\) is an expression that can be evaluated on each row of \(R\) results in a table with the same schema as \(R\) and all rows of \(R\) where \(\texttt{expr}\) evaluates to TRUE. Selection is denoted by \(\sigma(R, \texttt{expr})\).
In SQL this reads as
SELECT R.*FROM R WHERE expr
dplyr has one verb filter(...) for \(\sigma\).
\(\sigma(R, A < \text{2024-02-06} \wedge \text{2024-02-02} \leq A)\) (SELECT * FROM R WHERE A < CAST('2024-02-06' AS DATE) AND A >= CAST('2024-02-02' AS DATE)) leads to
A
B
C
2024-02-03
5
j
2024-02-02
3
b
2024-02-04
4
t
2024-02-03
7
j
SELECT DISTINCT B FROM R leads to
B
5
9
3
4
1
7
List all the planes built by a manufacturer named like AIRBUS between 2005 and 2010
Code
qx <- . %>%filter(str_like(manufacturer, 'AIRBUS%') & year >=2005& year <=2010)qx(planes)
# A tibble: 143 × 9
tailnum year type manufacturer model engines seats speed engine
<chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
1 N125UW 2009 Fixed wing multi… AIRBUS A320… 2 182 NA Turbo…
2 N126UW 2009 Fixed wing multi… AIRBUS A320… 2 182 NA Turbo…
3 N127UW 2010 Fixed wing multi… AIRBUS A320… 2 182 NA Turbo…
4 N128UW 2010 Fixed wing multi… AIRBUS A320… 2 182 NA Turbo…
5 N193UW 2008 Fixed wing multi… AIRBUS A321… 2 199 NA Turbo…
6 N195UW 2008 Fixed wing multi… AIRBUS A321… 2 199 NA Turbo…
7 N196UW 2009 Fixed wing multi… AIRBUS A321… 2 199 NA Turbo…
8 N197UW 2009 Fixed wing multi… AIRBUS A321… 2 199 NA Turbo…
9 N201FR 2008 Fixed wing multi… AIRBUS A320… 2 182 NA Turbo…
10 N202FR 2008 Fixed wing multi… AIRBUS A320… 2 182 NA Turbo…
# ℹ 133 more rows
Code
qx(planes_lite) |>show_query()
<SQL>
SELECT *
FROM `nycflights13::planes`
WHERE (`manufacturer` LIKE 'AIRBUS%' AND `year` >= 2005.0 AND `year` <= 2010.0)
Joins, \(\bowtie\), xxx_join(...)
In relational algebra, a \(\theta\)-join boils down to a selection according to expression \(\theta\) over a cross product (possibly after renaming some columns)
\[\bowtie(R, S, \theta) \approx \sigma(R \times S, \theta)\]dplyr does not (yet?) offer such a general join (which anyway can be very expensive on a cutting edge RDBMS) several variants of equijoin.
ChatGPT asserts:
An equijoin is a type of join operation in relational databases where the join condition involves an equality comparison between two attributes from different tables. In other words, it’s a join operation that combines rows from two tables based on matching values in specified columns. These specified columns are usually called the “join columns” or “join keys.”
the matching columns are stated using optional argument by=...
If argument by is omitted, NATURAL JOIN is assumed.
\(\bowtie(R, S)\) (SELECT * FROM R NATURAL JOIN S) leads to
Joining with `by = join_by(A)`
Warning in inner_join(R, S): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 3 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
\(\bowtie^{\text{Left outer}}(R, S)\) and SELECT * FROM R LEFT JOIN S ON (R.A=S.A) lead to
Joining with `by = join_by(A)`
Warning in left_join(R, S): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 3 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
List weather conditions at departure for all flights operated by airline named Delta ....
In SQL, we can rely on the next query
WITH delta AS (SELECT al.carrierFROM nycflights.airlines alWHERE al."name"='Delta Air Lines Inc.'),delta_f AS (SELECT f.origin, f.flight, f.year, f.month, f.day, f.hourFROM nycflights.flights fWHERE f.carrier IN (SELECT*FROM delta))SELECT f.flight, w.*FROM nycflights.weather w NATURALJOIN delta_f f;
<SQL>
SELECT *
FROM `nycflights13::flights`
WHERE EXISTS (
SELECT 1 FROM (
SELECT *
FROM `nycflights13::airlines`
WHERE (`name` LIKE 'Delta%')
) AS `RHS`
WHERE (`nycflights13::flights`.`carrier` = `RHS`.`carrier`)
)
Code
q3 <- . %>%inner_join(weather_lite, by = jb)q3(delta_f(flights_lite)) |>show_query()
<SQL>
SELECT
`LHS`.`year` AS `year`,
`LHS`.`month` AS `month`,
`LHS`.`day` AS `day`,
`dep_time`,
`sched_dep_time`,
`dep_delay`,
`arr_time`,
`sched_arr_time`,
`arr_delay`,
`carrier`,
`flight`,
`tailnum`,
`LHS`.`origin` AS `origin`,
`dest`,
`air_time`,
`distance`,
`LHS`.`hour` AS `hour`,
`minute`,
`LHS`.`time_hour` AS `time_hour.x`,
`temp`,
`dewp`,
`humid`,
`wind_dir`,
`wind_speed`,
`wind_gust`,
`precip`,
`pressure`,
`visib`,
`nycflights13::weather`.`time_hour` AS `time_hour.y`
FROM (
SELECT *
FROM `nycflights13::flights`
WHERE EXISTS (
SELECT 1 FROM (
SELECT *
FROM `nycflights13::airlines`
WHERE (`name` LIKE 'Delta%')
) AS `RHS`
WHERE (`nycflights13::flights`.`carrier` = `RHS`.`carrier`)
)
) AS `LHS`
INNER JOIN `nycflights13::weather`
ON (
`LHS`.`origin` = `nycflights13::weather`.`origin` AND
`LHS`.`year` = `nycflights13::weather`.`year` AND
`LHS`.`month` = `nycflights13::weather`.`month` AND
`LHS`.`day` = `nycflights13::weather`.`day` AND
`LHS`.`hour` = `nycflights13::weather`.`hour`
)
Agregation, summarize(...)
According to ChatGPT:
In SQL, an aggregation function is a function that operates on a set of values and returns a single aggregated value summarizing those values. These functions are commonly used in SQL queries to perform calculations across multiple rows and produce meaningful results. Some common aggregation functions in SQL include:
COUNT: Counts the number of rows in a result set.
SUM: Calculates the sum of values in a column.
AVG: Calculates the average of values in a column.
MIN: Finds the minimum value in a column.
MAX: Finds the maximum value in a column.
Count the number of airport whose name starts with International
<SQL>
SELECT COUNT(*) AS `n`
FROM `nycflights13::airports`
WHERE (`name` LIKE 'International%')
Partition, group_by
Following again ChatGPT
Aggregation functions are often used with the GROUP BY clause in SQL queries to group rows that have the same values in specified columns, allowing the aggregation functions to operate on each group separately. This enables powerful analysis and reporting capabilities in SQL, allowing users to extract useful insights from large datasets.
In table \(R\), for each value in column A sum the values in column B
A
s
2024-02-01
3
2024-02-02
3
2024-02-03
12
2024-02-04
4
2024-02-06
3
2024-02-07
6
2024-02-08
9
dplyr offers a group_by() verb that proves powerful and flexible. The resulting grouped tibble can be used both for aggregation and for implementing certain kinds of windows.
For each departure airport, each airline, count the number of flights operated by this airline from this airport.
SELECT f.origin, f.carrier, COUNT(*) AS nFROM nycflights.flights fGROUPBY f.origin, f.carrier ORDERBY f.carrier, n DESC;
# A tibble: 6 × 6
origin year month day carrier n_tailnum
<chr> <int> <int> <int> <chr> <int>
1 EWR 2013 1 1 AA 9
2 EWR 2013 1 1 AS 2
3 EWR 2013 1 1 B6 18
4 EWR 2013 1 1 DL 6
5 EWR 2013 1 1 EV 66
6 EWR 2013 1 1 MQ 7
Code
q2(flights_lite) |>show_query()
<SQL>
SELECT
`origin`,
`year`,
`month`,
`day`,
`carrier`,
COUNT(DISTINCT `tailnum`) AS `n_tailnum`
FROM `nycflights13::flights`
GROUP BY `origin`, `year`, `month`, `day`, `carrier`
ORDER BY `origin`, `year`, `month`, `day`, `carrier`
List the features of planes that have been operated by several airlines
WITH for_hire AS (SELECT f.tailnum, COUNT(DISTINCT f.carrier) AS n_carrierFROM nycflights.flights fGROUPBY f.tailnum HAVINGCOUNT(DISTINCT f.carrier) >=2)SELECT p.*FROM nycflights.planes p NATURALJOIN for_hire ;
<SQL>
SELECT *
FROM `nycflights13::planes`
WHERE EXISTS (
SELECT 1 FROM (
SELECT `tailnum`
FROM `nycflights13::flights`
GROUP BY `tailnum`
HAVING (COUNT(DISTINCT `carrier`) >= 2.0)
) AS `RHS`
WHERE (`nycflights13::planes`.`tailnum` = `RHS`.`tailnum`)
)
Adding/modifying columns, mutate(...)
In the SELECT clause of an SQL query, certain columns can be computed. Verb select from dplyr does not offer this possibility. We have to add the computed columns using verb mutate and then to perform projection using select (if necessary)
Assume we want to add one day to every value in column A from \(R\) so as to obtain:
# A tibble: 10 × 3
A B C
<date> <int> <chr>
1 2024-02-04 5 j
2 2024-02-09 9 i
3 2024-02-02 3 n
4 2024-02-03 3 b
5 2024-02-08 4 l
6 2024-02-05 4 t
7 2024-02-07 3 o
8 2024-02-08 1 a
9 2024-02-04 7 j
10 2024-02-08 1 k
In SQL we can proceed like this
SELECT A +1AS A, B, CFROM R
Window functions
Asking ChatGPT we obtain
In SQL, a window function (also known as an analytic function or windowed function) is a type of function that performs a calculation across a set of rows related to the current row within a query result set. Unlike aggregation functions which collapse multiple rows into a single row, window functions operate on a “window” of rows defined by a partition or an ordering.
Key features of window functions include:
Partitioning: The window function can be partitioned by one or more columns, dividing the result set into groups or partitions. The function is applied independently within each partition.
Ordering: The window function can be ordered by one or more columns, defining the sequence of rows within each partition. This determines the rows included in the window for each calculation.
Frame specification: Optionally, a window function can have a frame specification, which further refines the rows included in the window based on their position relative to the current row.
Window functions allow for advanced analytics and reporting tasks that require comparisons or calculations across multiple rows without collapsing the result set. They can be used to compute running totals, calculate moving averages, rank rows within partitions, and perform other complex analyses.
Some common window functions in SQL include ROW_NUMBER(), RANK(), DENSE_RANK(), NTILE(), LAG(), LEAD(), SUM() OVER(), AVG() OVER(), etc.
If we focus on window queries with a single window built using PARTITION BY and ORDER BY, we just need to combine, group_by(), possibly arrange(), and mutate()
For each departure airport, and day, list the 10 most delayed flights.
In SQL we can proceed like this:
WITH f_delayed AS (SELECT f.*, RANK() OVER w AS rnkFROM nycflights.flights fWHERE f.dep_time ISNOTNULL WINDOW w AS (PARTITIONBY f.origin, f.year, f.month, f.dayORDERBY f.dep_delay DESC) )SELECT fd.origin, fd.year, fd.month, fd.day, fd.tailnumFROM f_delayed fdWHERE fd.rnk <=10;