We will reproduce the animated demonstration using
ggplot2: an implementation of grammar of graphics in `R
plotly: a bridge between R and the javascript library D3.js
Using plotly, opting for html ouput, brings the possibility of interactivity and animation
Install and load packages
Code
require("gapminder")
Insist on the difference between installing and loading a package
How do we get the list of installed packages?
How do we get the list of loaded packages?
Which objects are made available by a package?
solution
The (usually very long) list of installed packages can be obtained by a simple function call.
Code
df <-installed.packages()head(df)
Package
addinexamples "addinexamples"
alphavantager "alphavantager"
anytime "anytime"
arrow "arrow"
askpass "askpass"
assertthat "assertthat"
LibPath Version
addinexamples "/home/boucheron/R/x86_64-pc-linux-gnu-library/4.1" "0.1.0"
alphavantager "/home/boucheron/R/x86_64-pc-linux-gnu-library/4.1" "0.1.3"
anytime "/home/boucheron/R/x86_64-pc-linux-gnu-library/4.1" "0.3.9"
arrow "/home/boucheron/R/x86_64-pc-linux-gnu-library/4.1" "13.0.0.1"
askpass "/home/boucheron/R/x86_64-pc-linux-gnu-library/4.1" "1.2.0"
assertthat "/home/boucheron/R/x86_64-pc-linux-gnu-library/4.1" "0.2.1"
Priority Depends
addinexamples NA "R (>= 3.0.0)"
alphavantager NA "R (>= 3.3.0)"
anytime NA "R (>= 3.2.0)"
arrow NA "R (>= 3.4)"
askpass NA NA
assertthat NA NA
Imports
addinexamples "shiny (>= 0.13), miniUI (>= 0.1.1), rstudioapi (>= 0.4),\nformatR"
alphavantager "dplyr (>= 0.7.0), glue (>= 1.1.1), httr (>= 1.2.1), jsonlite\n(>= 1.5), purrr (>= 0.2.2.2), readr (>= 1.1.1), stringr (>=\n1.2.0), tibble (>= 1.3.3), tidyr (>= 0.6.3), timetk (>=\n0.1.1.1)"
anytime "Rcpp (>= 0.12.9)"
arrow "assertthat, bit64 (>= 0.9-7), glue, methods, purrr, R6, rlang\n(>= 1.0.0), stats, tidyselect (>= 1.0.0), utils, vctrs"
askpass "sys (>= 2.1)"
assertthat "tools"
LinkingTo
addinexamples NA
alphavantager NA
anytime "Rcpp (>= 0.12.9), BH"
arrow "cpp11 (>= 0.4.2)"
askpass NA
assertthat NA
Suggests
addinexamples NA
alphavantager "testthat, knitr"
anytime "tinytest (>= 1.0.0), gettz"
arrow "blob, cli, DBI, dbplyr, decor, distro, dplyr, duckdb (>=\n0.2.8), hms, jsonlite, knitr, lubridate, pillar, pkgload,\nreticulate, rmarkdown, stringi, stringr, sys, testthat (>=\n3.1.0), tibble, tzdb, withr"
askpass "testthat"
assertthat "testthat, covr"
Enhances License License_is_FOSS
addinexamples NA "MIT + file LICENSE" NA
alphavantager NA "GPL (>= 3)" NA
anytime NA "GPL (>= 2)" NA
arrow NA "Apache License (>= 2.0)" NA
askpass NA "MIT + file LICENSE" NA
assertthat NA "GPL-3" NA
License_restricts_use OS_type MD5sum NeedsCompilation Built
addinexamples NA NA NA "no" "4.1.2"
alphavantager NA NA NA "no" "4.1.2"
anytime NA NA NA "yes" "4.1.2"
arrow NA NA NA "yes" "4.1.2"
askpass NA NA NA "yes" "4.1.2"
assertthat NA NA NA "no" "4.1.2"
Note that the output is tabular (it is a matrix and an array) that contains much more than the names of installed packages. If we just want the names of the installed packages, we can extract the column named Package.
Matrices and arrays represent mathematical object and are fit for computations. They are not so convenient as far as querying is concerned. Dataframes which are also tabular objects can be queried like tables in a relational database.
Loading a package amounts to make a number of objects available in the current session. The objects are made available though Namespaces.
# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
Even an empty dataframe has a scheme:
Code
gapminder %>%head(0) %>%glimpse()
Rows: 0
Columns: 6
$ country <fct>
$ continent <fct>
$ year <int>
$ lifeExp <dbl>
$ pop <int>
$ gdpPercap <dbl>
Code
glimpse(head(gapminder, 0))
Rows: 0
Columns: 6
$ country <fct>
$ continent <fct>
$ year <int>
$ lifeExp <dbl>
$ pop <int>
$ gdpPercap <dbl>
solution
The schema of a dataframe/tibble is the list of column names and classes. The content of a dataframe is made of the rows. A dataframe may have null content
Code
gapminder %>%filter(FALSE) %>%glimpse()
Rows: 0
Columns: 6
$ country <fct>
$ continent <fct>
$ year <int>
$ lifeExp <dbl>
$ pop <int>
$ gdpPercap <dbl>
Get a feeling of the dataset
Pick two random rows for each continent using slice_sample()
solution
To pick a slice at random, we can use function slice_sample. We can even perform sampling within groups defined by the value of a column.
Code
gapminder %>%slice_sample(n=2, by=continent)
# A tibble: 10 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Japan Asia 1952 63.0 86459025 3217.
2 Indonesia Asia 2002 68.6 211060000 2874.
3 Norway Europe 1982 76.0 4114787 26299.
4 Albania Europe 1962 64.8 1728137 2313.
5 Angola Africa 1992 40.6 8735988 2628.
6 Djibouti Africa 1987 50.0 311025 2880.
7 Costa Rica Americas 1992 75.7 3173216 6160.
8 Cuba Americas 1977 72.6 9537988 6380.
9 New Zealand Oceania 1982 73.8 3210650 17632.
10 Australia Oceania 1992 77.6 17481977 23425.
Code
#< or equivalently gapminder %>%group_by(continent) %>%slice_sample(n=2)
# A tibble: 10 × 6
# Groups: continent [5]
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Niger Africa 1982 42.6 6437188 910.
2 Zambia Africa 1977 51.4 5216550 1589.
3 Argentina Americas 1977 68.5 26983828 10079.
4 Panama Americas 1992 72.5 2484997 6619.
5 Oman Asia 1962 43.2 628164 2925.
6 China Asia 1967 58.4 754550000 613.
7 Serbia Europe 2007 74.0 10150265 9787.
8 Iceland Europe 1987 77.2 244676 26923.
9 Australia Oceania 1967 71.1 11872264 14526.
10 New Zealand Oceania 2002 79.1 3908037 23190.
gapminder is redundant: column country completely determines the content of column continent. In database parlance, we have a functional dependancy: country → continent whereas the key of the table is made of columns country, year.
Column gapminder is not in Boyce-Codd Normal Form (BCNF), not even in Third Normal Form (3NF).
Gapminder tibble (extract)
Extract/filter a subset of rows using dplyr::filter(...)
solution
Code
gapminder %>%filter(country=='France') %>%head()
# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 France Europe 1952 67.4 42459667 7030.
2 France Europe 1957 68.9 44310863 8663.
3 France Europe 1962 70.5 47124000 10560.
4 France Europe 1967 71.6 49569000 13000.
5 France Europe 1972 72.4 51732000 16107.
6 France Europe 1977 73.8 53165019 18293.
Note that equality testing is performed using == not = (which is used to implement assignment)
Filtering (selection \(σ\) from database theory) : Picking one year of data
There is simple way to filter rows satisfying some condition. It consists in mimicking indexation in a matrix, leaving the colum index empty, replacing the row index by a condition statement (a logical expression) also called a mask.
Have a look at gapminder$year==2002. What is the type/class of this expression?
This is possible in base R and very often convenient.
Nevertheless, this way of performing row filtering does not emphasize the connection between the dataframe and the condition. Any logical vector with the right length could be used as a mask. Moreover, this way of performing filtering is not very functional.
In the parlance of Relational Algebra, filter performs a selection of rows. Relational expression \[σ_{\text{condition}}(\text{Table})\] translates to
Code
filter(Table, condition)
where \(\text{condition}\) is a boolean expression that can be evaluated on each row of \(\text{Table}\). In SQL, the relational expression would translate into
# A tibble: 142 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 2002 42.1 25268405 727.
2 Albania Europe 2002 75.7 3508512 4604.
3 Algeria Africa 2002 71.0 31287142 5288.
4 Angola Africa 2002 41.0 10866106 2773.
5 Argentina Americas 2002 74.3 38331121 8798.
6 Australia Oceania 2002 80.4 19546792 30688.
7 Austria Europe 2002 79.0 8148312 32418.
8 Bahrain Asia 2002 74.8 656397 23404.
9 Bangladesh Asia 2002 62.0 135656790 1136.
10 Belgium Europe 2002 78.3 10311970 30486.
# ℹ 132 more rows
Note that in stating the condition, we simply write year==2002 even though year is not the name of an object in our current session. This is possible because filter( ) uses data masking, year is meant to denote a column in gapminder.
The ability to use data masking is one of the great strengths of the R programming language.
Static plotting: First attempt
Define a plot with respect to gapminder_2002
solution
Code
p <- gapminder_2002 %>%ggplot()
You should define a ggplot object with data layer gapminder_2022 and call this object p for further reuse.
Map variables gdpPercap and lifeExp to axes x and y
solution
Code
p <- p +aes(x=gdpPercap, y=lifeExp)p
Use ggplot object p and add a global aesthetic mapping gdpPercap and lifeExp to axes x and y (using + from ggplot2) .
For each row, draw a point at coordinates defined by the mapping
solution
Code
p +geom_point()
You need to add a geom_ layer to your ggplot object, in this case geom_point() will do.
We are building a graphical object (a ggplot object) around a data frame (gapminder)
We supply aesthetic mappings (aes()) that can be either global or bound to some geometries (geom_point())or statistics
The global aesthetic mapping defines which columns are
mapped to which axes,
possibly mapped to colours, linetypes, shapes, …
Geometries and Statistics describe the building blocks of graphics
What’s missing here?
when comparing to the Gapminder demonstration, we can spot that
colors are missing
bubble sizes are all the same. They should reflect the population size of the country
titles and legends are missing. This means the graphic object is useless.
We will add layers to the graphical object to complete the plot
Second attempt: display more information
Map continent to color (use aes())
Map pop to bubble size (use aes())
Make point transparent by tuning alpha (inside geom_point() avoid overplotting)
solution
Code
p <- p +aes(color=continent, size=pop) +geom_point(alpha=.5) p
solution
In this enrichment of the graphical object, guides have been automatically added for two aesthetics: color and size. Those two guides are deemed necessary since the reader has no way to guess the mapping from the five levels of continent to color (the color scale), and the reader needs help to connect population size and bubble size.
ggplot2 provides us with helpers to fine tune guides.
The scalings on the x and y axis do not deserve guides: the ticks along the coordinate axes provide enough information.
Scaling
In order to pay tribute to Hans Rosling, we need to take care of two scaling issues:
the gdp per capita axis should be logarithmicscale_x_log10()
the area of the point should be proportional to the population scale_size_area()
solution
Code
p <- p +scale_x_log10() +scale_size_area()p
Motivate the proposed scalings.
Why is it important to use logarithmic scaling for gdp per capita?
When is it important to use logarithmic scaling on some axis (in other contexts)?
Why is it important to specify scale_size_area() ?
solution
Code
p +scale_radius()
Scale for size is already present.
Adding another scale for size, which will replace the existing scale.
Scale for size is already present.
Adding another scale for size, which will replace the existing scale.
Code
ptchwrk +plot_annotation(title='Comparing scale_size_area and scale_size', caption='In the current setting, scale_size_area() should be favored')
In perspective
Add a plot title
Make axes titles
explicit
readable
Use labs(...)
solution
Code
yoi <-2002p <- p +labs(title=glue('The world in year {yoi}'),x="Gross Domestic Product per capita (US$ 2009, corrected for PPP)",y="Life expectancy at birth" )p
solution
We should also fine tune the guides: replace pop by Population and titlecase continent.
What should be the respective purposes of Title, Subtitle, Caption, … ?
Theming using ggthemes (or not)
Theming
Code
require("ggthemes")
Look at the online help on pacman::p_load(), how does pacman::p_load() relate to require() and library()?
A theme defines the look and feel of plots
Within a single document, we should use only one theme
p <- p +scale_size_area(max_size =15) +#<<scale_color_manual(values = neat_color_scale) #<<
Scale for size is already present.
Adding another scale for size, which will replace the existing scale.
Code
p
Choosing a color scale is a difficult task
viridis is often a good pick.
solution
Mimnimalist themes are often a good pick.
Code
p <- p +scale_size_area(max_size =15,labels= scales::label_number(scale=1/1e6,suffix=" M")) +scale_color_manual(values = neat_color_scale) +theme_minimal() +labs(title=glue("Gapminder {min(gapminder$year)}-{max(gapminder$year)}"),x ="Yearly Income per Capita",y ="Life Expectancy",caption="From sick and poor (bottom left) to healthy and rich (top right)")
Scale for size is already present.
Adding another scale for size, which will replace the existing scale.
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Code
p +theme(legend.position ="none")
Zooming on a continent
Code
zoom_continent <-'Europe'# choose another continent at your convenience
As all rows in gapminder_2002 are all related to year 2002, we need to rebuild the graphical object along the same lines (using the same graphical pipeline) but starting from the whole gapminder dataset.
Should we do this using cut and paste?
No
Don’t Repeat Yoursel (DRY)
Abide to the DRY principle using operator %+%: the ggplot2 object p can be fed with another dataframe and all you need is proper facetting.