Published

February 15, 2024

Code
if (!require(gssr)) {
  if (!require(remotes)){
    install.packages("remotes")
  }
  remotes::install_github("kjhealy/gssr")
}

Install and use package gssr

We work again with General Social Survey (GSS) data.

We take advantage of R package gssr

if (!require(gssr)) {
  if (!require(remotes)){
    install.packages("remotes")
  }
  remotes::install_github("kjhealy/gssr")
}

Get data for year 2018

The GSS is carried out every two years. It offers both cross-sectional data and panel data.

Package gssr offers a simple way to retrieve yearly data.

df_2018 <- gssr::gss_get_yr(2018)
Fetching: https://gss.norc.org/documents/stata/2018_stata.zip

Inspect the data

  • How many observations?
  • How many variables?
  • Are the data tidy/messy?
Code
dim(df_2018)
[1] 2348 1068

Numerical summaries for age and agekdbrn

The 2018 data provide (among too many other things) columns named age abd agekdbrn. Get numerical summaries about these two columns.

Code
df_2018 |> 
  dplyr::select(age, agekdbrn) |> 
  skimr::skim() |> 
  skimr::yank("numeric")

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 7 1.00 48.97 18.06 18 34 48 63 89 ▇▇▇▆▃
agekdbrn 682 0.71 24.30 5.74 12 20 23 28 51 ▃▇▃▁▁

Thanks to gssr, you can get meta-information about the columns

?aged
?agekdbrn
?sex

How is sex encoded? Is it worth recoding it?

Code
df_2018 |> 
  mutate(sex=as_factor(sex)) |> 
  skimr::skim(sex) |> 
  skimr::yank("factor")

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
sex 0 1 FALSE 2 fem: 1296, mal: 1052, don: 0, iap: 0

Histogram and density plots for age distribution/facet by sex

Code
p_age <- df_2018 |> 
  mutate(sex=as_factor(sex)) |> 
  ggplot() +
  aes(x=age) +
  facet_wrap(~ sex, ) 
Code
p_age +
  geom_histogram(aes(y=after_stat(density)),
                 fill="white", 
                 color="black",
                 bins=72) +
  labs(
    title="GSS 2018",
    subtitle = "Age distribution of respondents"
  )
Warning: Removed 7 rows containing non-finite values (`stat_bin()`).

  • Play with number of bins
  • Spot the irregular behavior of the histograms
  • Something special at the right edge of both histograms
Code
p_age + 
  stat_density(aes(y=after_stat(density)),
               fill="white", 
               alpha=.5,
               color="black",
               bw = "SJ",
               adjust = .25
               )
Warning: Removed 7 rows containing non-finite values (`stat_density()`).

  • Play with arguments bw and adjust of stat_density
  • Same comments

Compare sample age distribution with population age distribution

knitr::include_url("https://perspective.usherbrooke.ca/bilan/servlet/BMPagePyramide/USA/2018/?", height=600)

Sherbrooke University offers visual information about the age structure of population of a wide range of countries.

Following demographic usage, the age structure is presented through an age pyramid.

Note that an age pyramid is a special kind of histogram

Code

Parallel boxplots of age with respect to sex

Code
df_2018 |> 
  mutate(sex=as_factor(sex)) |> 
  ggplot() +
  aes(y=age, x=sex) +
  geom_boxplot(varwidth = T) +
  xlab("sex")
Warning: Removed 7 rows containing non-finite values (`stat_boxplot()`).

QQplot comparing sample male and female age distributions

Code
with(df_2018,
     qqplot(x=age[sex==1], y=age[sex==2]))

Make your own qqplot

Code
cdf_age_2018_1 <- ecdf(df_2018$age[df_2018$sex==1])

tb <- df_2018 |> 
  dplyr::filter(sex==2) |> 
  dplyr::select(age) |> 
  mutate(Fn=rank(age, ties.method = "max")/n()) |> 
  distinct() |> 
  arrange(age)
  
eqf_age_2018_2 <- with(tb,
     stepfun(x=Fn, y=c(age, max(age)), right = T, f = 1))
Code
filter(df_2018, sex==1) |> 
  ggplot() +
  aes(x=age, y=eqf_age_2018_2(cdf_age_2018_1(age))) +
  geom_point(alpha=.1, fill="white") + 
  geom_abline(intercept = 0, slope=1, linetype="dotted") +
  coord_fixed() +
  xlab("Age (men)") +
  ylab("Age (women)")
Warning: Removed 15 rows containing missing values (`geom_point()`).

Code
# data(gss_all)
Code
data("gss_dict")
Code
gss_dict |> 
  filter(variable=="age")
# A tibble: 1 × 12
    pos variable label      missing var_doc_label value_labels var_text years   
  <int> <chr>    <chr>        <int> <chr>         <chr>        <chr>    <list>  
1    90 age      age of re…     769 age of respo… [89] 89 or … 13. Res… <tibble>
# ℹ 4 more variables: var_yrtab <list>, col_type <chr>, var_type <chr>,
#   var_na_codes <chr>
Code
# gss_which_years(gss_all, c("age", "agekdbrn"))

Scatterplot for age and agekdbrn, facet by sex `

Working with gss_sub

Code
data("gss_sub")

gss_sub |> 
  glimpse()
Rows: 72,390
Columns: 19
$ year     <dbl+lbl> 1972, 1972, 1972, 1972, 1972, 1972, 1972, 1972, 1972, 197…
$ id       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ ballot   <dbl+lbl> NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), N…
$ age      <dbl+lbl> 23, 70, 48, 27, 61, 26, 28, 27, 21, 30, 30, 56, 54, 49, 4…
$ race     <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, …
$ sex      <dbl+lbl> 2, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 2, 1, 1, 1, 2, 2, …
$ degree   <dbl+lbl> 3, 0, 1, 3, 1, 1, 1, 3, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 3, …
$ padeg    <dbl+lbl>     0,     0,     0,     3,     0,     3,     3,     3,  …
$ madeg    <dbl+lbl> NA(i),     0,     0,     1,     0,     4,     1,     1,  …
$ relig    <dbl+lbl> 3, 2, 1, 5, 1, 1, 2, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ polviews <dbl+lbl> NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), N…
$ fefam    <dbl+lbl> NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), N…
$ vpsu     <dbl+lbl> NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), N…
$ vstrat   <dbl+lbl> NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), N…
$ oversamp <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ formwt   <dbl+lbl> NA(y), NA(y), NA(y), NA(y), NA(y), NA(y), NA(y), NA(y), N…
$ wtssall  <dbl+lbl> 0.4446, 0.8893, 0.8893, 0.8893, 0.8893, 0.4446, 0.4446, 0…
$ sampcode <dbl+lbl> NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), N…
$ sample   <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
Code
gss_sub |> 
  head()
# A tibble: 6 × 19
  year         id ballot      age    race    sex     degree  padeg   madeg      
  <dbl+lbl> <dbl> <dbl+lbl>   <dbl+> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <dbl+lbl>  
1 1972          1 NA(i) [iap] 23     1 [whi… 2 [fem… 3 [bac… 0 [les… NA(i) [iap]
2 1972          2 NA(i) [iap] 70     1 [whi… 1 [mal… 0 [les… 0 [les…     0 [les…
3 1972          3 NA(i) [iap] 48     1 [whi… 2 [fem… 1 [hig… 0 [les…     0 [les…
4 1972          4 NA(i) [iap] 27     1 [whi… 2 [fem… 3 [bac… 3 [bac…     1 [hig…
5 1972          5 NA(i) [iap] 61     1 [whi… 2 [fem… 1 [hig… 0 [les…     0 [les…
6 1972          6 NA(i) [iap] 26     1 [whi… 1 [mal… 1 [hig… 3 [bac…     4 [gra…
# ℹ 10 more variables: relig <dbl+lbl>, polviews <dbl+lbl>, fefam <dbl+lbl>,
#   vpsu <dbl+lbl>, vstrat <dbl+lbl>, oversamp <dbl+lbl>, formwt <dbl+lbl>,
#   wtssall <dbl+lbl>, sampcode <dbl+lbl>, sample <dbl+lbl>
Code
gss_sub |> 
  dplyr::select(-id, -year) |> 
  summarise(across(everything(), n_distinct)) |> 
  pivot_longer(cols = everything(), names_to="name_col", values_to = "n_distct") |> 
  arrange(n_distct) |> 
  filter(n_distct < 15) |> 
  left_join(gss_dict, by=c("name_col"="variable"))
# A tibble: 11 × 13
   name_col n_distct   pos label     missing var_doc_label value_labels var_text
   <chr>       <int> <int> <chr>       <int> <chr>         <chr>        <chr>   
 1 sex             3   125 responde…     112 respondents … [1] male; [… 23. Cod…
 2 race            4   126 race of …     107 race of resp… [1] white; … 24. Wha…
 3 ballot          5  6072 ballot u…   21875 ballot used … [1] ballot … 1659. B…
 4 fefam           5   784 better f…   37259 better for m… [1] strongl… 252. No…
 5 oversamp        5  6078 weights …       0 weights for … [1] not 198… None    
 6 degree          6    98 r's high…     196 r's highest … [0] less th… 19. If …
 7 padeg           6    99 father's…   17881 father's hig… [0] less th… 20. If …
 8 madeg           6   100 mothers …    8971 mothers high… [0] less th… 21. If …
 9 polviews        8   227 think of…    9672 think of sel… [1] extreme… 67a. We…
10 sample         11  6077 sampling…    4032 sampling fra… [1] 1960 sa… 1664. T…
11 relig          14   336 r's reli…     437 r's religiou… [1] protest… 104. Wh…
# ℹ 5 more variables: years <list>, var_yrtab <list>, col_type <chr>,
#   var_type <chr>, var_na_codes <chr>

Education through generations

What kind of information do we get through variables degree and padeg?

?degree
?padeg

Compute contingency table for degree and padeg

Code
tab_degree_padeg <- gss_sub |> 
  dplyr::select(degree, padeg) |> 
  mutate(across(everything(), as_factor)) |> 
  table()  
Code
tab_degree_padeg |> 
  chisq.test()
Warning in chisq.test(tab_degree_padeg): Chi-squared approximation may be
incorrect

    Pearson's Chi-squared test

data:  tab_degree_padeg
X-squared = NaN, df = 256, p-value = NA

Visualize contingency table for degree and padeg

Code
tab_degree_padeg |> 
  t() |> 
  mosaicplot(color = T)

Rearrange the levels of degree and padeg