In Exploratory analysis of tabular data, bivariate analysis is the second step. It consists in exploring, summarizing, visualizing pairs of columns of a dataset.
Bivariate techniques depend on the types of columns we are facing.
For numerical/numerical samples
Scatter plots
Smoothed lineplots (for example linear regression)
2-dimensional density plots
For categorical/categorical samples : mosaicplots and variants
For numerical/categorical samples
Boxplots per group
Histograms per group
Density plots per group
Quantile-Quantile plots
Dataset
Once again we rely on the Census dataset.
Since 1948, the US Census Bureau carries out a monthly Current Population Survey, collecting data concerning residents aged above 15 from \(150 000\) households. This survey is one of the most important sources of information concerning the american workforce. Data reported in file Recensement.txt originate from the 2012 census.
Load the data into the session environment and call it df. Take advantage of the fact that we saved the result of our data wrangling job in a self-documented file format. Download a parquet file from the following URL:
fname <-"Recensement.parquet"datapath <-"./DATA"fpath <-paste(datapath, fname, sep="/")if (!file.exists(fpath)) {tryCatch(expr = { url <-'https://stephane-v-boucheron.fr/data/Recensement.parquet' rep <- httr::GET(url)stopifnot(rep$status_code==200) con <-file(fpath, open="wb")writeBin(rep$content, con)close(con) }, warning =function(w) {glue("Successful download but {w}") }, error =function(e) {stop("Houston, we have a problem!") # error-handler-code }, finally = {if (exists("con") &&isOpen(con)){close(con) } } )} df <- arrow::read_parquet(fpath)
Code
df |>glimpse()## Rows: 599## Columns: 11## $ AGE <dbl> 58, 40, 29, 59, 51, 19, 64, 23, 47, 66, 26, 23, 54, 44, 56,…## $ SEXE <fct> F, M, M, M, M, M, F, F, M, F, M, F, F, F, F, F, F, M, M, F,…## $ REGION <fct> NE, W, S, NE, W, NW, S, NE, NW, S, NE, NE, W, NW, S, S, NW,…## $ STAT_MARI <fct> C, M, C, D, M, C, M, C, M, D, M, C, M, C, M, C, S, M, S, C,…## $ SAL_HOR <dbl> 13.25, 12.50, 14.00, 10.60, 13.00, 7.00, 19.57, 13.00, 20.1…## $ SYNDICAT <fct> non, non, non, oui, non, non, non, non, oui, non, non, non,…## $ CATEGORIE <fct> "Administration", "Building ", "Administration", "Services"…## $ NIV_ETUDES <fct> "Bachelor", "12 years schooling, no diploma", "Associate de…## $ NB_PERS <fct> 2, 2, 2, 4, 8, 6, 3, 2, 3, 1, 3, 2, 6, 5, 4, 4, 3, 2, 3, 2,…## $ NB_ENF <fct> 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…## $ REV_FOYER <fct> [35000-40000), [17500-20000), [75000-1e+05), [17500-20000),…df |>head()## # A tibble: 6 × 11## AGE SEXE REGION STAT_MARI SAL_HOR SYNDICAT CATEGORIE NIV_ETUDES NB_PERS## <dbl> <fct> <fct> <fct> <dbl> <fct> <fct> <fct> <fct> ## 1 58 F NE C 13.2 non "Administrat… Bachelor 2 ## 2 40 M W M 12.5 non "Building " 12 years … 2 ## 3 29 M S C 14 non "Administrat… Associate… 2 ## 4 59 M NE D 10.6 oui "Services" 12 years … 4 ## 5 51 M W M 13 non "Services" 9 years s… 8 ## 6 19 M NW C 7 non "Services" 12 years … 6 ## # ℹ 2 more variables: NB_ENF <fct>, REV_FOYER <fct>
Categorical/Categorical pairs
Code
df |>select(where(is.factor)) |>head()
# A tibble: 6 × 9
SEXE REGION STAT_MARI SYNDICAT CATEGORIE NIV_ETUDES NB_PERS NB_ENF REV_FOYER
<fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct>
1 F NE C non "Administ… Bachelor 2 0 [35000-4…
2 M W M non "Building… 12 years … 2 0 [17500-2…
3 M S C non "Administ… Associate… 2 0 [75000-1…
4 M NE D oui "Services" 12 years … 4 1 [17500-2…
5 M W M non "Services" 9 years s… 8 1 [75000-1…
6 M NW C non "Services" 12 years … 6 0 [1e+05-1…
Explore the connection between CATEGORIE and SEX. Compute the 2-ways contingency table using table(), and count() from dplyr.
Use tibble::as_tibble() to transform the output of table() into a dataframe/tibble.
Use tidyr::pivot_wider() so as to obtain a wide (but messy) tibble with the same the same shape as the output of table(). Can you spot a difference?
SEXE
CATEGORIE F M
Business, Management and Finance 23 23
Liberal profession 82 51
Services 75 50
Selling 30 18
Administration 72 22
Agriculture, Fishing, Forestry 2 8
Building 0 36
Repair and maintenance 0 32
Production 9 30
Commodities Transport 4 32
Code
tb2 <- df |>count(CATEGORIE, SEXE)tb2
# A tibble: 18 × 3
CATEGORIE SEXE n
<fct> <fct> <int>
1 "Business, Management and Finance" F 23
2 "Business, Management and Finance" M 23
3 "Liberal profession" F 82
4 "Liberal profession" M 51
5 "Services" F 75
6 "Services" M 50
7 "Selling" F 30
8 "Selling" M 18
9 "Administration" F 72
10 "Administration" M 22
11 "Agriculture, Fishing, Forestry" F 2
12 "Agriculture, Fishing, Forestry" M 8
13 "Building " M 36
14 "Repair and maintenance" M 32
15 "Production" F 9
16 "Production" M 30
17 "Commodities Transport" F 4
18 "Commodities Transport" M 32
# A tibble: 12 × 3
CATEGORIE SEXE n
<fct> <fct> <int>
1 Business, Management and Finance F 23
2 Business, Management and Finance M 23
3 Liberal profession F 82
4 Liberal profession M 51
5 Services F 75
6 Services M 50
7 Selling F 30
8 Selling M 18
9 Administration F 72
10 Administration M 22
11 Primary-Secondary F 15
12 Primary-Secondary M 138