Code
```{r}
old_theme <- theme_set(theme_minimal())
```
We will use the following packages. If needed, we install them.
```{r}
old_theme <- theme_set(theme_minimal())
```
Check nycflights13 for any explanation concerning the tables and their columns.
```{r}
flights <- nycflights13::flights
weather <- nycflights13::weather
airports <- nycflights13::airports
airlines <- nycflights13::airlines
planes <- nycflights13::planes
```
```{r}
#| eval: true
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
flights_lite <- copy_to(con, nycflights13::flights)
airports_lite <- copy_to(con, nycflights13::airports)
planes_lite <- copy_to(con, nycflights13::planes)
weather_lite <- copy_to(con, nycflights13::weather)
airlines_lite <- copy_to(con, nycflights13::airlines)
```
```{r}
flights_lite %>%
select(contains("delay")) %>%
show_query()
```
<SQL>
SELECT `dep_delay`, `arr_delay`
FROM `nycflights13::flights`
View data in spreadsheet style.
```{r}
#| eval: false
View(flights)
```
Ask for help about table flights
dplyr
way)Find all flights that
IAH
or HOU
)Package stringr
could be useful.
```{r}
airlines %>%
filter(stringr::str_starts(name, "United") |
stringr::str_starts(name, "American") |
stringr::str_starts(name, "Delta"))
airlines %>%
filter(stringr::str_detect(name, ("United|American|Delta"))) %>%
pluck("carrier")
```
# A tibble: 3 × 2
carrier name
<chr> <chr>
1 AA American Airlines Inc.
2 DL Delta Air Lines Inc.
3 UA United Air Lines Inc.
[1] "AA" "DL" "UA"
```{r}
airlines_lite %>%
filter(stringr::str_starts(name, "United") |
stringr::str_starts(name, "American") |
stringr::str_starts(name, "Delta")) %>%
show_query()
```
<SQL>
SELECT *
FROM `nycflights13::airlines`
WHERE (str_starts(`name`, 'United') OR str_starts(`name`, 'American') OR str_starts(`name`, 'Delta'))
SELECT *
FROM `nycflights13::airlines`
WHERE "name" LIKE 'United%' OR
"name" LIKE 'American%' OR
"name" LIKE 'Delta%' ;
When manipulating temporal information (date, time, duration), keep an eye on what lubridate
offers. The API closely parallels what RDMS and Python offer.
origin
have a missing dep_time
?The introduction to tidyselect
is a must read.
```{r}
not_cancelled <- flights %>%
filter(!is.na(dep_time))
```
flight
report the number of missing values.arrange()
to sort all missing values to the start? (Hint: use is.na()
).dep_delay
)The database provides all we need with columns distance
and air_time
. Otherwise, with the positions of airports from table airports
, we should be able to compute distances using :
‘Haversine’ formula.
dep_time
, dep_delay
, arr_time
, and arr_delay
from flights.
select()
call?any_of()
function do? Why might it be helpful in conjunction with this vector?vars <- c("year", "month", "day", "dep_delay", "arr_delay")
```{r}
#| eval: true
select(flights, contains("TIME", ignore.case =TRUE)) %>%
head()
```
# A tibble: 6 × 6
dep_time sched_dep_time arr_time sched_arr_time air_time time_hour
<int> <int> <int> <int> <dbl> <dttm>
1 517 515 830 819 227 2013-01-01 05:00:00
2 533 529 850 830 227 2013-01-01 05:00:00
3 542 540 923 850 160 2013-01-01 05:00:00
4 544 545 1004 1022 183 2013-01-01 05:00:00
5 554 600 812 837 116 2013-01-01 06:00:00
6 554 558 740 728 150 2013-01-01 05:00:00
dep_time
and sched_dep_time
are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.air_time
with arr_time - dep_time
. What do you expect to see? What do you see? What do you need to do to fix it?dep_time
, sched_dep_time
, and dep_delay.
How would you expect those three numbers to be related?Carefully read the documentation for min_rank()
.
```{r}
flights %>%
group_by(dest) %>%
summarise(n_cancelled = sum(is.na(dep_time)))
```
# A tibble: 105 × 2
dest n_cancelled
<chr> <int>
1 ABQ 0
2 ACK 0
3 ALB 20
4 ANC 0
5 ATL 317
6 AUS 21
7 AVL 12
8 BDL 31
9 BGR 15
10 BHM 25
# ℹ 95 more rows
```{r}
flights_lite %>%
group_by(dest) %>%
summarise(n_cancelled = sum(is.na(dep_time))) %>%
show_query()
```
Warning: Missing values are always removed in SQL aggregation functions.
Use `na.rm = TRUE` to silence this warning
This warning is displayed once every 8 hours.
<SQL>
SELECT `dest`, SUM((`dep_time` IS NULL)) AS `n_cancelled`
FROM `nycflights13::flights`
GROUP BY `dest`
not_cancelled %>% count(dest)
and (without using
count()`).(is.na(dep_delay) | is.na(arr_delay) )
is slightly suboptimal. Why? Which is the most important column?Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarise(n())
)
sort
argument to count()
do. When might you use it?carriers
serve all destination airports (in the table) ?tailnum
) has the worst on-time record amongst planes with at least ten flights?Using dplyr
, it is easy. See A second look at group_by
lag()
, explore how the delay of a flight is related to the delay of the immediately preceding flight.lag()
is an example of window function. If we were using SQL
, we would define a WINDOW
using an expression like
As (PARTITION BY origin ORDER BY year, month, day, sched_dep_time) WINDOW w
Something still needs fixing here: some flights never took off (is.na(dep_time)
). Should they be sided out? assigned an infinite departure delay?
Consider all flights with average speed above \(950\text{km/h}\) as suspicious.
Let us visualize destinations and origins of the speedy flights.
Assume a plane is characterized by tailnum
. Some flights have no tailnum
. We ignore them.