We will explore a (small) subset of the GSS dataset
The GSS has been a reliable source of data to help researchers, students, and journalists monitor and explain trends in American behaviors, demographics, and opinions. You’ll find the complete GSS data set on this site, and can access the GSS Data Explorer to explore, analyze, extract, and share custom sets of GSS data.
Data gathering
Download the data
Code
download_data <-function(fname,baseurl ='https://stephane-v-boucheron.fr/data',datapath ="./DATA") { fpath <-paste(datapath, fname, sep ="/")if (!file.exists(fpath)) { url <-paste(baseurl, fname, sep ="/") rep <- httr::GET(url)stopifnot(rep$status_code ==200) con <-file(fpath, open ="wb")writeBin(rep$content, con)close(con)print(glue('File "{fname}" downloaded!')) } else {print(glue('File "{fname}" already on hard drive!')) }}
download_data(fname="sub-data.txt")
download_data(fname="sub-cdbk.txt")
Base R (package utils) offers a function download.file(). There is
File inspection shows that the data file sub-data.txt is indeed a csv file
09:01 $ file DATA/sub-data.txt
DATA/sub-data.txt: CSV text
We do not know the peculiarities of this file formatting. We load it as if fields were separated by coma (,, this is an American file). and prevent any type inference by asserting that all columns should be treated as character (c).
Answer the following questions:
What are the observations/individuals/sample points?
What do the columns stand for?
Is the dataset tidy/messy?
Inspect the schema of dataframe (there are 540 columns!)
NULL values
In the dataframe, NULL are encoded in several ways. From the metadata, we learn
VALUE LABEL
.d don't know
.i iap
.j I don't have a job
.m dk, na, iap
.n no answer
.p not imputable
.q not imputable
.r refused
.s skipped on web
.u uncodeable
.x not available in this release
.y not available in this year
.z see codebook
Missing-data codes: .d,.i,.j,.m,.n,.p,.q,.r,.s,.u,.x,.y,.z
Using a brute force approach, we replace the missing data codes with NA, not the string 'NA' but NULL value for character vectors 'NA_character_'.
We first define a regular expression that will allow us to detect the presence of missing data codes in a string and to replace the missing data code by 'NA_character_'
The repeated backslashes in na_patterns are due to the way R handles escape/control characters like \ or . which play an important role in the definition of regular expressions.
This is also useful when programming with Python or querying a relational database.
Code
df <- df |>mutate(across(everything(), \(x) str_replace(x, na_patterns, NA_character_))) # Anonymous function in Python 4....
Our handling of the Missing-data codes is fast, sloppy, and dirty. The occurrence of a specific code, say .i rather than .r might be a valuable information. For some columns, a specific treatment may be indeed if we do not want to waste information.
Downsizing the data
Project the dataframe df onto columns year, age, sex, race, ethnic, columns ending with educ, ending with deg, starting with dwel, starting with income, hompop, earnrs, coninc, conrinc.
Call the resulting dataframe df_redux.
Open the metadata file sub-cdbk.txt in your favorite editor to get a feeling of the column names meaning and of encoding conventions.