In this lab, we load the data from the hard drive. The data are read from some file located in our tree of directories. Loading requires the determination of the correct filepath. This filepath is often a relative filepath, it is relative to the directory where the R session/the R script has been launched. Base R offers functions that can help you to find your way the directories tree.
Code
getwd() # Where are we? ## [1] "/home/boucheron/Documents/COURS/EDA_LABS"head(list.files()) # List the files in the current directory## [1] "_extensions" "_handout" "_handout_fr" ## [4] "_handout_solution" "_metadata.yml" "_quarto-french.yml"head(list.dirs()) # List sub-directories## [1] "." ## [2] "./_extensions" ## [3] "./_extensions/quarto-ext" ## [4] "./_extensions/quarto-ext/fontawesome" ## [5] "./_extensions/quarto-ext/fontawesome/assets" ## [6] "./_extensions/quarto-ext/fontawesome/assets/css"
Objectives
In this lab, we pursue our walk in univariate analysis, by introducing univariate analysis for categorical variables.
This amounts to exploring, summarizing, visualizing categorical columns of a dataset.
This also often involves table wrangling: retyping some columns, relabelling, reordering, lumping levels of factors, that is factor re-engineering.
Summarizing univariate categorical samples amounts to counting the number of occurrences of levels in the sample.
Visualizing categorical samples starts with
Bar plots
Column plots
This exploratory work seldom makes it to the final report. Nevertheless, it has to be done in an efficient, reproducible way.
This is an opportunity to introduce the DRY principle.
At the end, we shall see that skimr::skim() can be very helpful.
Dataset Recensement (Census, bis)
Since 1948, the US Census Bureau carries out a monthly Current Population Survey, collecting data concerning residents aged above 15 from \(150 000\) households. This survey is one of the most important sources of information concerning the american workforce. Data reported in file Recensement.txt originate from the 2012 census.
Dataset Recensement can be found in file Recensement.csv in your DATA repository.
Have a look at the text file. Choose a loading function for each format. Rstudio IDE provides a valuable helper.
Load the data into the session environment and call it df.
education_lookup =c("32"="<= 4 years schooling","33"="between 5 and 6 years","34"="between 7 and 8 years","35"="9 years schooling","36"="10 years schooling","37"="11 years schooling","38"="12 years schooling, no diploma","39"="12 years schooling, HS diploma","40"="College without diploma","41"="Associate degree, vocational","42"="Associate degree, academic","43"="Bachelor","44"="Master","45"="Specific School Diploma","46"="PhD")code_education <-vector2tibble(education_lookup)
Which columns should be considered as categorical/factor?
Deciding which variables are categorical sometimes requires judgement.
Let us attempt to base the decision on a checkable criterion: determine the number of distinct values in each column, consider those columns with less than 20 distinct values as factors.
We can find the names of the columns with few unique values by iterating over the column names.
Note that columns NB_PERS and NB_ENF have few unique values and nevertheless we could consider them as quantitative.
Coerce the relevant columns as factors.
Use dplyr and forcats verbs to perform this coercion.
Use the across() construct so as to perform a kind if tidy selection (as with select) with verb mutate.
You may use forcats::as_factor() to transform columns when needed.
Verb dplyr::mutate is a convenient way to modify a dataframe.
Relabel the levels of REV_FOYER using the breaks.
Relabel the levels of the different factors so as to make the data more readbale
Search for missing data (optional)
Check whether some columns contain missing data (use is.na).
::: {.callout-tip} Useful functions:
dplyr::summarise
across
tidyr::pivot_longer
dplyr::arrange
SEXE
Counting
Use table, prop.table from base R to compute the frequencies and proportions of the different levels. In statistics, the result of table() is a (one-way) contingency table.
What is the class of the object generated by table? Is it a vector, a list, a matrix, an array ?
as.data.frame() (or as_tibble) can transform a table object into a dataframe.
Code
ta <-rename(as.data.frame(ta), SEXE=`.`)ta
SEXE Freq
1 F 297
2 M 302
You may use knitr::kabble(), possibly knitr::kable(., format="markdown") to tweak the output.
In order to feed ggplot with a contingency table, it is useful to build contingency tables as dataframes. Use dplyr::count() to do this.
skimr::skim() allows us to perform univariate categorical analysis all at once.
Code
df %>% skimr::skim(where(is.factor))
Data summary
Name
Piped data
Number of rows
599
Number of columns
11
_______________________
Column type frequency:
factor
9
________________________
Group variables
None
Variable type: factor
skim_variable
n_missing
complete_rate
ordered
n_unique
top_counts
SEXE
0
1
FALSE
2
M: 302, F: 297
REGION
0
1
FALSE
4
S: 200, W: 148, NE: 129, NW: 122
STAT_MARI
0
1
FALSE
5
M: 325, C: 193, D: 61, S: 14
SYNDICAT
0
1
FALSE
2
non: 496, oui: 103
CATEGORIE
0
1
FALSE
10
Lib: 133, Ser: 125, Adm: 94, Sel: 48
NIV_ETUDES
0
1
FALSE
15
12 : 187, Col: 148, Bac: 114, Ass: 45
NB_PERS
0
1
FALSE
9
2: 196, 4: 130, 3: 122, 1: 63
NB_ENF
0
1
FALSE
7
0: 413, 1: 86, 2: 76, 3: 18
REV_FOYER
0
1
FALSE
16
600: 89, 750: 77, 500: 71, 400: 70
The output can be tailored to your specific objectives and fed to functions that are geared to displaying large tables (see packages knitr, DT, and gt)
Save the (polished) data
Saving polished data in self-documented formats can be time-saving. Base R offers the .RDS format
Code
df %>%saveRDS("./DATA/Recensement.RDS")
By saving into this format we can persist our work.
Build a barplot to visualize the distribution of the SEXE column.
Use
geom_bar (working directly with the data)
geom_col (working with a contingency table)
When investigating relations between categerical columns we will often rely on mosaicplot(). Indeed, barplot and mosaicplot belong to the collection of area plots that are used to visualize counts (statistical summaries for categorical variables).
Repeat the same operation for each qualitative variable (DRY)
Using a for loop
We have to build a barplot for each categorical variable. Here, we just have nine of them. We could do this using cut and paste, and some editing. In doing so, we would not comply with the DRY (Don’t Repeat Yourself) principle.
In order to remain DRY, we will attempt to abstract the recipe we used to build our first barplot.
This recipe is pretty simple:
Build a ggplot object with df as the data layer.
Add an aesthetic mapping a categorical column to axis x
Add a geometry using geom_bar
Add labels explaining the reader which column is under scrutiny
We first need to gather the names of the categorical columns. The following chunk does this in a simple way.
In the next chunk, we shall build a named list of ggplot objects consisting of barplots. The for loop body is almost obtained by cutting and pasting the recipe for the first barplot.
Note an important difference: instead of something aes(x=col) where col denotes a column in the dataframe, we shall write aes(x=.data[[col]]) where col is a string that matches a column name. Writing aes(x=col) would not work.
The loop variable col iterates over the column names, not over the columns themselves.
When using ggplot in interactive computations, we write aes(x=col), and, under the hood, the interpreter uses the tidy evaluation mechanism that underpins R to map df$col to the x axis.
ggplot functions like aes() use data masking to alleviate the burden of the working Statistician.
Within the context of ggplot programming, pronoun .data refers to the data layer of the graphical object.
If the labels on the x-axis are not readable, we need to tweak them. This amounts to modifying the theme layer in the ggplot object, and more specifically the axis.text.x attribute.
Using functional programming (lapply, purrr::...)
Another way to compute the list of graphical objects replaces the for loop by calling a functional programming tool. This mechanism relies on the fact that in R, functions are first-class objects.
Package purrr offers a large range of tools with a clean API. Base R offers lapply().
We shall first define a function that takes as arguments a datafame, a column name, and a title. We do not perform any defensive programming. Call your function foo.
Functional programmming makes code easier to understand.
Use foo, lapply or purrr::map() to build the list of graphical objects.
With purrr::map(), you may use either a formula or an anonymous function. With lapply use an anonymous function.
Package patchwork offers functions for displaying collections of related plots.