Published

February 8, 2024

General Social Survey (GSS)

We will explore a (small) subset of the GSS dataset

The GSS has been a reliable source of data to help researchers, students, and journalists monitor and explain trends in American behaviors, demographics, and opinions. You’ll find the complete GSS data set on this site, and can access the GSS Data Explorer to explore, analyze, extract, and share custom sets of GSS data.

Data gathering

Download the data

Code
download_data <-  function(fname,
                           baseurl = 'https://stephane-v-boucheron.fr/data',
                           datapath = "./DATA") {
  fpath <- paste(datapath, fname, sep = "/")
  
  if (!file.exists(fpath)) {
    url <- paste(baseurl, fname, sep = "/")
    
    rep <- httr::GET(url)
    stopifnot(rep$status_code == 200)
    
    con <- file(fpath, open = "wb")
    writeBin(rep$content, con)
    close(con)
    
    print(glue('File "{fname}" downloaded!'))
  } else {
    print(glue('File "{fname}" already on hard drive!'))
  }
}
download_data(fname="sub-data.txt")
download_data(fname="sub-cdbk.txt")

Base R (package utils) offers a function download.file(). There is

fname <- 'sub-data.txt'
baseurl <- 'https://stephane-v-boucheron.fr/data'
download.file(url=paste(baseurl, fname, sep="/"),
              destfile=paste('./DATA', fname, sep="/"))

There is no need to (always) reinvent the wheel!

Load the data in your session

File inspection shows that the data file sub-data.txt is indeed a csv file

09:01 $ file DATA/sub-data.txt
DATA/sub-data.txt: CSV text

We do not know the peculiarities of this file formatting. We load it as if fields were separated by coma (,, this is an American file). and prevent any type inference by asserting that all columns should be treated as character (c).

“Solution
Code
df <- readr::read_csv("./DATA/sub-data.txt",
                      col_types = cols(.default = "c")
                      )

dim(df)
[1] 21370   540

Answer the following questions:

  • What are the observations/individuals/sample points?
  • What do the columns stand for?
  • Is the dataset tidy/messy?

Inspect the schema of dataframe (there are 540 columns!)

df |> 
  glimpse()

NULL values

In the dataframe, NULL are encoded in several ways. From the metadata, we learn

           VALUE  LABEL
              .d  don't know
              .i  iap
              .j  I don't have a job
              .m  dk, na, iap
              .n  no answer
              .p  not imputable
              .q  not imputable
              .r  refused
              .s  skipped on web
              .u  uncodeable
              .x  not available in this release
              .y  not available in this year
              .z  see codebook
              
Missing-data codes: .d,.i,.j,.m,.n,.p,.q,.r,.s,.u,.x,.y,.z

Using a brute force approach, we replace the missing data codes with NA, not the string 'NA' but NULL value for character vectors 'NA_character_'.

We first define a regular expression that will allow us to detect the presence of missing data codes in a string and to replace the missing data code by 'NA_character_'

The repeated backslashes in na_patterns are due to the way R handles escape/control characters like \ or . which play an important role in the definition of regular expressions.

Code
na_patterns <- '.d,.i,.j,.m,.n,.p,.q,.r,.s,.u,.x,.y,.z' |> 
  str_replace_all('\\.', '\\\\.') |> 
  str_replace_all(',', '|')

na_patterns
[1] "\\.d|\\.i|\\.j|\\.m|\\.n|\\.p|\\.q|\\.r|\\.s|\\.u|\\.x|\\.y|\\.z"
Regular expressions

Regular expressions are a Swiss army knife when dealing with text data. Get acquainted with them. It is useful whenver you work data or edit a file

See Regular expressions in R

This is also useful when programming with Python or querying a relational database.

Code
df <- df |> 
  mutate(across(
    everything(),
    \(x) str_replace(x, na_patterns, NA_character_)))  # Anonymous function in Python 4....

Our handling of the Missing-data codes is fast, sloppy, and dirty. The occurrence of a specific code, say .i rather than .r might be a valuable information. For some columns, a specific treatment may be indeed if we do not want to waste information.

Downsizing the data

Project the dataframe df onto columns year, age, sex, race, ethnic, columns ending with educ, ending with deg, starting with dwel, starting with income, hompop, earnrs, coninc, conrinc.

Call the resulting dataframe df_redux.

Open the metadata file sub-cdbk.txt in your favorite editor to get a feeling of the column names meaning and of encoding conventions.

Solution
Code
df_redux <- df |> 
  select(`year`, id, age, sex, race, hispanic, ethnic, 
         ends_with('educ'), 
         degree, 
         ends_with('deg'),
         starts_with('dwel'),
         contains('income'),
         hompop,
         earnrs,
         coninc,
         conrinc
         )
Tidy selection

dplyr::select allows us to use helpers to denote all columns with given type, or with names containing some patterns. Get acquainted with those helpers. They are time savers.

Code
df_redux |> 
  glimpse()
Rows: 21,370
Columns: 38
$ year        <chr> "2008", "2008", "2008", "2008", "2008", "2008", "2008", "2…
$ id          <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "…
$ age         <chr> "49", "48", "47", "32", "37", "72", "21", "36", "48", "56"…
$ sex         <chr> "1", "1", "1", "1", "2", "1", "2", "2", "2", "2", "1", "1"…
$ race        <chr> "3", "3", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2"…
$ hispanic    <chr> "3", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"…
$ ethnic      <chr> "22", "39", "1", "1", "28", "1", "39", "39", "39", "39", N…
$ educ        <chr> "12", "20", "13", "10", "12", "16", "12", "12", "14", "9",…
$ paeduc      <chr> "12", NA, NA, NA, "12", NA, "16", NA, "8", NA, "6", "8", "…
$ maeduc      <chr> "12", "12", "16", "12", "12", "12", "16", "12", "8", "9", …
$ speduc      <chr> NA, "19", "14", NA, NA, "16", NA, NA, NA, NA, "16", NA, NA…
$ sei10educ   <chr> "22.3", "93.7", "28.0", "36.5", "40.8", "42.0", "35.0", "9…
$ spsei10educ <chr> NA, "79.8", NA, NA, NA, "98.9", NA, NA, NA, NA, "76.9", NA…
$ pasei10educ <chr> "24.1", NA, NA, NA, "27.0", "22.9", "20.4", "45.2", "45.2"…
$ masei10educ <chr> "55.9", "22.3", "56.1", "56.1", "37.0", NA, "38.4", NA, NA…
$ nateduc     <chr> "1", "1", "1", "1", "2", "1", "2", "2", "1", "1", "1", "1"…
$ degree      <chr> "1", "4", "1", "1", "1", "3", "1", "1", "1", "0", "4", "4"…
$ padeg       <chr> "1", NA, NA, NA, "1", "0", "3", NA, "0", NA, "0", "0", "1"…
$ madeg       <chr> "1", "1", "3", "1", "1", "1", "3", "1", "0", "0", "1", "0"…
$ spdeg       <chr> NA, "4", "1", NA, NA, "3", NA, NA, NA, NA, "3", NA, NA, "1…
$ dwelling    <chr> "8", "8", "8", "9", "8", "6", "5", "6", "8", "8", "8", "8"…
$ dwelngh     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ dwelcity    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ dwelown     <chr> "2", "2", "2", NA, "1", "1", NA, "2", "2", NA, "2", NA, NA…
$ income      <chr> "7", "12", "12", "12", "12", NA, NA, NA, NA, NA, "12", "12…
$ rincome     <chr> NA, "12", "12", "10", "12", NA, NA, NA, NA, NA, "12", "12"…
$ income72    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ income77    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ income82    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ income86    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ income91    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ income98    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ income06    <chr> "7", "23", "18", "17", "16", NA, NA, NA, NA, NA, "25", "23…
$ income16    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ hompop      <chr> "2", "1", "2", "3", "3", "2", "5", "5", "1", "1", "2", "1"…
$ earnrs      <chr> "0", "2", "1", "2", "1", "2", "4", "3", "1", "0", "2", "1"…
$ coninc      <chr> "6.22875000000000000e+03", "9.96600000000000000e+04", "3.7…
$ conrinc     <chr> NA, "6.85162500000000000e+04", "3.73725000000000000e+04", …

There are still many columns and some of them do not look very exciting.

Howm many missing values per column ?

Code
null_columns <- df_redux |> 
  skimr::skim() |> 
  skimr::yank("character") |> 
  select(skim_variable, n_missing, complete_rate) |> 
  arrange(desc(n_missing))

null_columns

Variable type: character

skim_variable n_missing complete_rate
dwelngh 21370 0.00
dwelcity 21370 0.00
income72 21370 0.00
income77 21370 0.00
income82 21370 0.00
income86 21370 0.00
income91 21370 0.00
income98 21370 0.00
spsei10educ 15397 0.28
income06 13719 0.36
masei10educ 12069 0.44
speduc 11923 0.44
spdeg 11834 0.45
nateduc 10812 0.49
pasei10educ 10414 0.51
income16 10003 0.53
rincome 8869 0.58
conrinc 8869 0.58
sei10educ 8221 0.62
dwelown 7212 0.66
dwelling 6092 0.71
hompop 5796 0.73
paeduc 5571 0.74
padeg 5082 0.76
ethnic 3439 0.84
maeduc 2368 0.89
income 2352 0.89
coninc 2352 0.89
madeg 1828 0.91
age 585 0.97
earnrs 283 0.99
sex 112 0.99
educ 111 0.99
hispanic 108 0.99
race 107 0.99
degree 32 1.00
year 0 1.00
id 0 1.00
skimr

skimr:: is a package that aims at schematizing the different columns of a dataframe. Columns are handled according to their basetype. The output is made of dataframes for each basetype (numeric, factor, …). In each returned dataframe, a row corresponds to a column from the dataframe under investigation. The row contains the column name (as skim_variable), the number of missing values (NA), the proportion of non-missing values (Complete rate) and basetype dependent information.

Using skimr

Drop NULL columns

Code
to_be_dropped <- null_columns |> 
  filter(complete_rate < 1e-10) |> 
  pull(skim_variable)

to_be_dropped
[1] "dwelngh"  "dwelcity" "income72" "income77" "income82" "income86" "income91"
[8] "income98"
Code
df_redux <- df_redux |> 
  select(-all_of(to_be_dropped)) 

df_redux |> 
  glimpse()
Rows: 21,370
Columns: 30
$ year        <chr> "2008", "2008", "2008", "2008", "2008", "2008", "2008", "2…
$ id          <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "…
$ age         <chr> "49", "48", "47", "32", "37", "72", "21", "36", "48", "56"…
$ sex         <chr> "1", "1", "1", "1", "2", "1", "2", "2", "2", "2", "1", "1"…
$ race        <chr> "3", "3", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2"…
$ hispanic    <chr> "3", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"…
$ ethnic      <chr> "22", "39", "1", "1", "28", "1", "39", "39", "39", "39", N…
$ educ        <chr> "12", "20", "13", "10", "12", "16", "12", "12", "14", "9",…
$ paeduc      <chr> "12", NA, NA, NA, "12", NA, "16", NA, "8", NA, "6", "8", "…
$ maeduc      <chr> "12", "12", "16", "12", "12", "12", "16", "12", "8", "9", …
$ speduc      <chr> NA, "19", "14", NA, NA, "16", NA, NA, NA, NA, "16", NA, NA…
$ sei10educ   <chr> "22.3", "93.7", "28.0", "36.5", "40.8", "42.0", "35.0", "9…
$ spsei10educ <chr> NA, "79.8", NA, NA, NA, "98.9", NA, NA, NA, NA, "76.9", NA…
$ pasei10educ <chr> "24.1", NA, NA, NA, "27.0", "22.9", "20.4", "45.2", "45.2"…
$ masei10educ <chr> "55.9", "22.3", "56.1", "56.1", "37.0", NA, "38.4", NA, NA…
$ nateduc     <chr> "1", "1", "1", "1", "2", "1", "2", "2", "1", "1", "1", "1"…
$ degree      <chr> "1", "4", "1", "1", "1", "3", "1", "1", "1", "0", "4", "4"…
$ padeg       <chr> "1", NA, NA, NA, "1", "0", "3", NA, "0", NA, "0", "0", "1"…
$ madeg       <chr> "1", "1", "3", "1", "1", "1", "3", "1", "0", "0", "1", "0"…
$ spdeg       <chr> NA, "4", "1", NA, NA, "3", NA, NA, NA, NA, "3", NA, NA, "1…
$ dwelling    <chr> "8", "8", "8", "9", "8", "6", "5", "6", "8", "8", "8", "8"…
$ dwelown     <chr> "2", "2", "2", NA, "1", "1", NA, "2", "2", NA, "2", NA, NA…
$ income      <chr> "7", "12", "12", "12", "12", NA, NA, NA, NA, NA, "12", "12…
$ rincome     <chr> NA, "12", "12", "10", "12", NA, NA, NA, NA, NA, "12", "12"…
$ income06    <chr> "7", "23", "18", "17", "16", NA, NA, NA, NA, NA, "25", "23…
$ income16    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ hompop      <chr> "2", "1", "2", "3", "3", "2", "5", "5", "1", "1", "2", "1"…
$ earnrs      <chr> "0", "2", "1", "2", "1", "2", "4", "3", "1", "0", "2", "1"…
$ coninc      <chr> "6.22875000000000000e+03", "9.96600000000000000e+04", "3.7…
$ conrinc     <chr> NA, "6.85162500000000000e+04", "3.73725000000000000e+04", …
all_of()

all_of() is a helper provided by dplyr. It allows us to project on a collection of columns denoted by their names (specified as string).

I would like us to have such a device in SQL

Tidy selection

Count the number of observations per year

Count for each year

Code
df_redux |> 
  count(`year`)
# A tibble: 8 × 2
  year      n
  <chr> <int>
1 2008   2023
2 2010   2044
3 2012   1974
4 2014   2538
5 2016   2867
6 2018   2348
7 2021   4032
8 2022   3544

count() is a shortcut for

df_redux |> 
  group_by(`year`)
  summarize(n=n()) 

In SQL, we would write:

SELECT df."year", COUNT(*) AS n
FROM df_redux AS df
GROUP BY df."year"

Plot the number of rows per year as a barplot

Solution
Code
p <- df_redux |> 
  ggplot() +
  aes(x=`year`) +
  geom_bar()  +
  labs(caption="year as a string")

p

Code
q <- df_redux |> 
  ggplot() +
  aes(x=as.numeric(`year`)) +
  geom_bar()  +
  labs(caption="year as a numeric")

p + q

Should year be handled as a numeric?

Explore columns with name containing inc

Find the number of unique values in each column.

Code
skim_inc <- df_redux |> 
  select(contains('inc')) |> 
  skimr::skim() |> 
  skimr::yank("character") 

skim_inc

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
income 2352 0.89 1 2 0 12 0
rincome 8869 0.58 1 2 0 12 0
income06 13719 0.36 1 2 0 25 0
income16 10003 0.53 1 2 0 26 0
coninc 2352 0.89 23 23 0 178 0
conrinc 8869 0.58 23 23 0 178 0

What are the unique values in columns whose name contains income ?

Solution
Code
df_redux |> 
  select(contains("income")) |> 
  summarise(across(everything(), \(x) paste(sort(unique(x)), collapse=", "))) |> 
  pivot_longer(cols=everything(),
               names_to = "col",
               values_to = "unique_vals")
# A tibble: 4 × 2
  col      unique_vals                                                          
  <chr>    <chr>                                                                
1 income   1, 10, 11, 12, 2, 3, 4, 5, 6, 7, 8, 9                                
2 rincome  1, 10, 11, 12, 2, 3, 4, 5, 6, 7, 8, 9                                
3 income06 1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 21, 22, 23, 24, 25…
4 income16 1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 21, 22, 23, 24, 25…

Make income and rincome a factor

Solution
Code
df_redux <- df_redux |> 
  mutate(across(ends_with('income'), as_factor)) 

df_redux |> 
  select(contains('inc')) |> 
  glimpse()
Rows: 21,370
Columns: 6
$ income   <fct> 7, 12, 12, 12, 12, NA, NA, NA, NA, NA, 12, 12, 11, 12, NA, NA…
$ rincome  <fct> NA, 12, 12, 10, 12, NA, NA, NA, NA, NA, 12, 12, NA, NA, NA, N…
$ income06 <chr> "7", "23", "18", "17", "16", NA, NA, NA, NA, NA, "25", "23", …
$ income16 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ coninc   <chr> "6.22875000000000000e+03", "9.96600000000000000e+04", "3.7372…
$ conrinc  <chr> NA, "6.85162500000000000e+04", "3.73725000000000000e+04", "1.…

Summarize and Visualize the distributions of income and rincome

Solution
Code
p_income <- df_redux |> 
  ggplot() +
  aes(x=income) +
  geom_bar() 

p_rincome <- df_redux |> 
  ggplot() +
  aes(x=rincome) +
  geom_bar()

p_income + p_rincome

The factors need reordering

Solution
Code
cat(levels(df_redux$rincome))
12 10 11 3 8 1 9 6 5 7 2 4
Code
df_redux$rincome <- fct_relevel(df_redux$rincome, as.character(seq(1,12)))

df_redux$income <- fct_relevel(df_redux$income,
            as.character(seq(1,12)))

cat(levels(df_redux$rincome))
1 2 3 4 5 6 7 8 9 10 11 12
Code
p_income <- df_redux |> 
  ggplot() +
  aes(x=income) +
  geom_bar() 

p_rincome <- df_redux |> 
  ggplot() +
  aes(x=rincome) +
  geom_bar()

p_income + p_rincome

Recode factors

We have to search the metadata in order to figure out the way columns like income or rincome have been encoded.

38. Did you earn any income from [OCCUPATION DESCRIBED IN Q2]
           last year? a. If yes: In which of these groups did your earnings
           from [OCCUPATION IN Q2] for last year fall? That is, before
           taxes or other deductions.

           VALUE  LABEL
               1  under $1,000
               2  $1,000 to $2,999
               3  $3,000 to $3,999
               4  $4,000 to $4,999
               5  $5,000 to $5,999
               6  $6,000 to $6,999
               7  $7,000 to $7,999
               8  $8,000 to $9,999
               9  $10,000 to $14,999
              10  $15,000 to $19,999
              11  $20,000 to $24,999
              12  $25,000 or more
              .d  don't know
              .i  iap
              .j  I don't have a job
              .m  dk, na, iap
              .n  no answer
              .p  not imputable
              .q  not imputable
              .r  refused
              .s  skipped on web
              .u  uncodeable
              .x  not available in this release
              .y  not available in this year
              .z  see codebook

Now, the roadmap is simple: define a encoding table to map values to label (and vice versa), transform columns income and rincome into factor and recode the levels using the encoding table.

Code
income_encoding <- tribble(
 ~VALUE,   ~LABEL,
               '1',  'under $1,000',
               '2',  '$1,000 to $2,999',
               '3',  '$3,000 to $3,999',
               '4',  '$4,000 to $4,999',
               '5',  '$5,000 to $5,999',
               '6',  '$6,000 to $6,999',
               '7',  '$7,000 to $7,999',
               '8',  '$8,000 to $9,999',
               '9',  '$10,000 to $14,999',
              '10',  '$15,000 to $19,999',
              '11',  '$20,000 to $24,999',
              '12',  '$25,000 or more',
              '.d',  'don not know',
              '.i',  'iap',
              '.j',  'I do not have a job',
              '.m',  'dk, na, iap',
              '.n',  'no answer',
              '.p',  'not imputable',
              '.q',  'not imputable',
              '.r',  'refused',
              '.s',  'skipped on web',
              '.u',  'uncodeable',
              '.x',  'not available in this release',
              '.y',  'not available in this year',
              '.z',  'see codebook'
)

income_encoding
# A tibble: 25 × 2
   VALUE LABEL             
   <chr> <chr>             
 1 1     under $1,000      
 2 2     $1,000 to $2,999  
 3 3     $3,000 to $3,999  
 4 4     $4,000 to $4,999  
 5 5     $5,000 to $5,999  
 6 6     $6,000 to $6,999  
 7 7     $7,000 to $7,999  
 8 8     $8,000 to $9,999  
 9 9     $10,000 to $14,999
10 10    $15,000 to $19,999
# ℹ 15 more rows
Code
income_labels <- income_encoding$VALUE
names(income_labels) <- income_encoding$LABEL
Code
stopifnot(require(rlang))
Loading required package: rlang

Attaching package: 'rlang'
The following objects are masked from 'package:purrr':

    %@%, flatten, flatten_chr, flatten_dbl, flatten_int, flatten_lgl,
    flatten_raw, invoke, splice
Code
df_redux$income_2 <- fct_recode(df_redux$income, 
                  !!!income_labels)
Warning: Unknown levels in `f`: .d, .i, .j, .m, .n, .p, .q, .r, .s, .u, .x, .y,
.z
Code
df_redux$rincome_2 <- fct_recode(df_redux$rincome, 
                  !!!income_labels)
Warning: Unknown levels in `f`: .d, .i, .j, .m, .n, .p, .q, .r, .s, .u, .x, .y,
.z
Code
df_redux |> 
  count(income_2, rincome_2)
# A tibble: 143 × 3
   income_2         rincome_2              n
   <fct>            <fct>              <int>
 1 under $1,000     under $1,000          37
 2 under $1,000     $1,000 to $2,999       2
 3 under $1,000     $3,000 to $3,999       5
 4 under $1,000     $5,000 to $5,999       1
 5 under $1,000     $6,000 to $6,999       1
 6 under $1,000     $15,000 to $19,999     2
 7 under $1,000     $20,000 to $24,999     2
 8 under $1,000     $25,000 or more        2
 9 under $1,000     <NA>                 259
10 $1,000 to $2,999 under $1,000          16
# ℹ 133 more rows
Code
p <- df_redux |> 
  ggplot() +
  aes(x=income_2) +
  geom_bar()

q <- p + theme(axis.text.x = element_text(angle = 45)) 

p + q 

The right plot looks more readable.

List/vector unpacking in R

The second argument of fct_recode() uses a very convenient feature provided by package rlang: !!! (bang-bang-bang), when applied to named vector inside a function call, it unpacks the vector and the named vector elements behave like keyword arguments of the funtion fct_recode(). Very practical if you do not enjoy typing.

Distribution of year

Solution
Code
q1 <- df_redux |> 
  ggplot() +
  aes(x=`year`) + 
  geom_bar()

Make year an integer column

Solution
Code
df_redux <- df_redux |> 
  mutate(`year`=as.integer(`year`)) 

q2 <- df_redux |> 
  ggplot() +
  aes(x=`year`) + 
  geom_bar()

q1 + q2

Plot rincome and income distributions with respect to year

Solution
Code
df_redux |> 
  count(`year`, rincome_2) |>
  group_by(`year`) |> 
  mutate(n_year=sum(n)) |> 
  ggplot() +
  aes(x=rincome_2, y=n/n_year) +
  geom_col() +
  facet_wrap(vars(year), ncol=2) +
  theme(axis.text.x = element_text(angle = 45)) 

Scatterplot of conrinc (y) with respect to coninc, facet by sex

Solution

Let us first retype the two columns

Code
df_redux <- df_redux |> 
  mutate(across(all_of(c("coninc", "conrinc")), as.numeric)) 

Let us summarize them

Code
df_redux |> 
  select(where(is.numeric)) |> 
  skimr::skim() |> 
  skimr::yank("numeric") |> 
  select(-n_missing, - complete_rate)

Variable type: numeric

skim_variable mean sd p0 p25 p50 p75 p100 hist
year 2016.22 4.75 2008 2012.00 2016.00 2021.0 2022.0 ▅▂▆▂▇
coninc 50874.98 44652.49 336 18480.00 37372.50 67200.0 178712.5 ▇▆▂▁▂
conrinc 38068.37 43077.07 336 13143.75 26991.25 47317.5 434612.4 ▇▁▁▁▁

skimr::skim() tidies the output of summary() for numerical columns.

We compute the numerical summaries of the columns

Code
df_redux$coninc |> summary() 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    336   18480   37372   50875   67200  178712    2352 
Code
mean(df_redux$coninc, na.rm = TRUE)
[1] 50874.98
Code
sd(df_redux$coninc, na.rm=TRUE)
[1] 44652.49
Code
median(df_redux$coninc, na.rm = TRUE)
[1] 37372.5
Code
IQR(df_redux$coninc, na.rm= TRUE)
[1] 48720
Code
quantile(df_redux$coninc, probs= c(.25, .5, .75), na.rm = TRUE)
    25%     50%     75% 
18480.0 37372.5 67200.0 
Code
df_redux |> 
  drop_na(coninc, conrinc) |> 
  filter(sex %in% c("1", "2")) |> 
  ggplot() +
  aes(x=coninc, y=conrinc, shape=sex) +
  facet_wrap(~ sex) +
  geom_point()

Facet histogram for conrinc according to income

Solution
Code
df_redux |> 
  drop_na(coninc, conrinc) |> 
  filter(sex %in% c("1", "2")) |> 
  ggplot() +
  aes(x=conrinc) +
  geom_histogram(aes(y=after_stat(density))) +
  facet_wrap(~ income_2) 
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

TODO

  • Retype age
  • Distribution of age (summary and visualization)
  • Distribution of age (summary and visualization) with respect to sex
  • Scatterplot of conrinc with respect to age
  • Boxplot of conrinc with respect to sex