Published

February 15, 2024

Code
if (!require(gssr)) {
  if (!require(remotes)){
    install.packages("remotes")
  }
  remotes::install_github("kjhealy/gssr")
}

Install and use package gssr

We work again with General Social Survey (GSS) data.

We take advantage of R package gssr

if (!require(gssr)) {
  if (!require(remotes)){
    install.packages("remotes")
  }
  remotes::install_github("kjhealy/gssr")
}

Get data for year 2018

The GSS is carried out every two years. It offers both cross-sectional data and panel data.

Package gssr offers a simple way to retrieve yearly data.

df_2018 <- gssr::gss_get_yr(2018)
Fetching: https://gss.norc.org/documents/stata/2018_stata.zip

Inspect the data

  • How many observations?
  • How many variables?
  • Are the data tidy/messy?

Numerical summaries for age and agekdbrn

The 2018 data provide (among too many other things) columns named age abd agekdbrn. Get numerical summaries about these two columns.

Thanks to gssr, you can get meta-information about the columns

?aged
?agekdbrn
?sex

How is sex encoded? Is it worth recoding it?

Histogram and density plots for age distribution/facet by sex

Compare sample age distribution with population age distribution

knitr::include_url("https://perspective.usherbrooke.ca/bilan/servlet/BMPagePyramide/USA/2018/?", height=600)

Sherbrooke University offers visual information about the age structure of population of a wide range of countries.

Following demographic usage, the age structure is presented through an age pyramid.

Note that an age pyramid is a special kind of histogram

Code

Parallel boxplots of age with respect to sex

QQplot comparing sample male and female age distributions

Make your own qqplot

Scatterplot for age and agekdbrn, facet by sex `

Working with gss_sub

Code
data("gss_sub")

gss_sub |> 
  glimpse()
Rows: 72,390
Columns: 19
$ year     <dbl+lbl> 1972, 1972, 1972, 1972, 1972, 1972, 1972, 1972, 1972, 197…
$ id       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ ballot   <dbl+lbl> NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), N…
$ age      <dbl+lbl> 23, 70, 48, 27, 61, 26, 28, 27, 21, 30, 30, 56, 54, 49, 4…
$ race     <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, …
$ sex      <dbl+lbl> 2, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 2, 1, 1, 1, 2, 2, …
$ degree   <dbl+lbl> 3, 0, 1, 3, 1, 1, 1, 3, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 3, …
$ padeg    <dbl+lbl>     0,     0,     0,     3,     0,     3,     3,     3,  …
$ madeg    <dbl+lbl> NA(i),     0,     0,     1,     0,     4,     1,     1,  …
$ relig    <dbl+lbl> 3, 2, 1, 5, 1, 1, 2, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ polviews <dbl+lbl> NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), N…
$ fefam    <dbl+lbl> NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), N…
$ vpsu     <dbl+lbl> NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), N…
$ vstrat   <dbl+lbl> NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), N…
$ oversamp <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ formwt   <dbl+lbl> NA(y), NA(y), NA(y), NA(y), NA(y), NA(y), NA(y), NA(y), N…
$ wtssall  <dbl+lbl> 0.4446, 0.8893, 0.8893, 0.8893, 0.8893, 0.4446, 0.4446, 0…
$ sampcode <dbl+lbl> NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), N…
$ sample   <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …

Education through generations

What kind of information do we get through variables degree and padeg?

?degree
?padeg

Compute contingency table for degree and padeg

Visualize contingency table for degree and padeg

Rearrange the levels of degree and padeg