Published

February 8, 2024

General Social Survey (GSS)

We will explore a (small) subset of the GSS dataset

The GSS has been a reliable source of data to help researchers, students, and journalists monitor and explain trends in American behaviors, demographics, and opinions. You’ll find the complete GSS data set on this site, and can access the GSS Data Explorer to explore, analyze, extract, and share custom sets of GSS data.

Data gathering

Download the data

Code
download_data <-  function(fname,
                           baseurl = 'https://stephane-v-boucheron.fr/data',
                           datapath = "./DATA") {
  fpath <- paste(datapath, fname, sep = "/")
  
  if (!file.exists(fpath)) {
    url <- paste(baseurl, fname, sep = "/")
    
    rep <- httr::GET(url)
    stopifnot(rep$status_code == 200)
    
    con <- file(fpath, open = "wb")
    writeBin(rep$content, con)
    close(con)
    
    print(glue('File "{fname}" downloaded!'))
  } else {
    print(glue('File "{fname}" already on hard drive!'))
  }
}
download_data(fname="sub-data.txt")
download_data(fname="sub-cdbk.txt")

Base R (package utils) offers a function download.file(). There is

fname <- 'sub-data.txt'
baseurl <- 'https://stephane-v-boucheron.fr/data'
download.file(url=paste(baseurl, fname, sep="/"),
              destfile=paste('./DATA', fname, sep="/"))

There is no need to (always) reinvent the wheel!

Load the data in your session

File inspection shows that the data file sub-data.txt is indeed a csv file

09:01 $ file DATA/sub-data.txt
DATA/sub-data.txt: CSV text

We do not know the peculiarities of this file formatting. We load it as if fields were separated by coma (,, this is an American file). and prevent any type inference by asserting that all columns should be treated as character (c).

Answer the following questions:

  • What are the observations/individuals/sample points?
  • What do the columns stand for?
  • Is the dataset tidy/messy?

Inspect the schema of dataframe (there are 540 columns!)

NULL values

In the dataframe, NULL are encoded in several ways. From the metadata, we learn

           VALUE  LABEL
              .d  don't know
              .i  iap
              .j  I don't have a job
              .m  dk, na, iap
              .n  no answer
              .p  not imputable
              .q  not imputable
              .r  refused
              .s  skipped on web
              .u  uncodeable
              .x  not available in this release
              .y  not available in this year
              .z  see codebook
              
Missing-data codes: .d,.i,.j,.m,.n,.p,.q,.r,.s,.u,.x,.y,.z

Using a brute force approach, we replace the missing data codes with NA, not the string 'NA' but NULL value for character vectors 'NA_character_'.

We first define a regular expression that will allow us to detect the presence of missing data codes in a string and to replace the missing data code by 'NA_character_'

The repeated backslashes in na_patterns are due to the way R handles escape/control characters like \ or . which play an important role in the definition of regular expressions.

Code
na_patterns <- '.d,.i,.j,.m,.n,.p,.q,.r,.s,.u,.x,.y,.z' |> 
  str_replace_all('\\.', '\\\\.') |> 
  str_replace_all(',', '|')

na_patterns
[1] "\\.d|\\.i|\\.j|\\.m|\\.n|\\.p|\\.q|\\.r|\\.s|\\.u|\\.x|\\.y|\\.z"
Regular expressions

Regular expressions are a Swiss army knife when dealing with text data. Get acquainted with them. It is useful whenver you work data or edit a file

See Regular expressions in R

This is also useful when programming with Python or querying a relational database.

Code
df <- df |> 
  mutate(across(
    everything(),
    \(x) str_replace(x, na_patterns, NA_character_)))  # Anonymous function in Python 4....

Our handling of the Missing-data codes is fast, sloppy, and dirty. The occurrence of a specific code, say .i rather than .r might be a valuable information. For some columns, a specific treatment may be indeed if we do not want to waste information.

Downsizing the data

Project the dataframe df onto columns year, age, sex, race, ethnic, columns ending with educ, ending with deg, starting with dwel, starting with income, hompop, earnrs, coninc, conrinc.

Call the resulting dataframe df_redux.

Open the metadata file sub-cdbk.txt in your favorite editor to get a feeling of the column names meaning and of encoding conventions.

Howm many missing values per column ?

Drop NULL columns

Count the number of observations per year

Count for each year

Code
df_redux |> 
  count(`year`)
# A tibble: 8 × 2
  year      n
  <chr> <int>
1 2008   2023
2 2010   2044
3 2012   1974
4 2014   2538
5 2016   2867
6 2018   2348
7 2021   4032
8 2022   3544

count() is a shortcut for

df_redux |> 
  group_by(`year`)
  summarize(n=n()) 

In SQL, we would write:

SELECT df."year", COUNT(*) AS n
FROM df_redux AS df
GROUP BY df."year"

Plot the number of rows per year as a barplot

Explore columns with name containing inc

Find the number of unique values in each column.

What are the unique values in columns whose name contains income ?

Make income and rincome a factor

Summarize and Visualize the distributions of income and rincome

The factors need reordering

Recode factors

Distribution of year

Make year an integer column

Plot rincome and income distributions with respect to year

Scatterplot of conrinc (y) with respect to coninc, facet by sex

Facet histogram for conrinc according to income

TODO

  • Retype age
  • Distribution of age (summary and visualization)
  • Distribution of age (summary and visualization) with respect to sex
  • Scatterplot of conrinc with respect to age
  • Boxplot of conrinc with respect to sex