lab-gss-r

Published

February 15, 2024

Code

if (!require(gssr)) {
  if (!require(remotes)){
    install.packages("remotes")
  }
  remotes::install_github("kjhealy/gssr")
}

L3 MIASHS
Université Paris Cité
Année 2023-2024
Course Homepage
Moodle

Install and use package `gssr`

We work again with General Social Survey (GSS) data.

We take advantage of R package gssr

if (!require(gssr)) {
  if (!require(remotes)){
    install.packages("remotes")
  }
  remotes::install_github("kjhealy/gssr")
}

Get data for year 2018

The GSS is carried out every two years. It offers both cross-sectional data and panel data.

Package gssr offers a simple way to retrieve yearly data.

df_2018 <- gssr::gss_get_yr(2018)

Fetching: https://gss.norc.org/documents/stata/2018_stata.zip

Inspect the data

How many observations?
How many variables?
Are the data tidy/messy?

Code

dim(df_2018)

[1] 2348 1068

Numerical summaries for `age` and `agekdbrn`

The 2018 data provide (among too many other things) columns named age abd agekdbrn. Get numerical summaries about these two columns.

Code

df_2018 |> 
  dplyr::select(age, agekdbrn) |> 
  skimr::skim() |> 
  skimr::yank("numeric")

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
age	7	1.00	48.97	18.06	18	34	48	63	89	▇▇▇▆▃
agekdbrn	682	0.71	24.30	5.74	12	20	23	28	51	▃▇▃▁▁

Thanks to gssr, you can get meta-information about the columns

?aged
?agekdbrn
?sex

How is `sex` encoded? Is it worth recoding it?

Code

df_2018 |> 
  mutate(sex=as_factor(sex)) |> 
  skimr::skim(sex) |> 
  skimr::yank("factor")

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
sex	0	1	FALSE	2	fem: 1296, mal: 1052, don: 0, iap: 0

Histogram and density plots for `age` distribution/facet by `sex`

Code

p_age <- df_2018 |> 
  mutate(sex=as_factor(sex)) |> 
  ggplot() +
  aes(x=age) +
  facet_wrap(~ sex, )

Code

p_age +
  geom_histogram(aes(y=after_stat(density)),
                 fill="white", 
                 color="black",
                 bins=72) +
  labs(
    title="GSS 2018",
    subtitle = "Age distribution of respondents"
  )

Warning: Removed 7 rows containing non-finite values (`stat_bin()`).

Play with number of bins
Spot the irregular behavior of the histograms
Something special at the right edge of both histograms

Code

p_age + 
  stat_density(aes(y=after_stat(density)),
               fill="white", 
               alpha=.5,
               color="black",
               bw = "SJ",
               adjust = .25
               )

Warning: Removed 7 rows containing non-finite values (`stat_density()`).

Play with arguments bw and adjust of stat_density
Same comments

Compare sample `age` distribution with population `age` distribution

knitr::include_url("https://perspective.usherbrooke.ca/bilan/servlet/BMPagePyramide/USA/2018/?", height=600)

Sherbrooke University offers visual information about the age structure of population of a wide range of countries.

Following demographic usage, the age structure is presented through an age pyramid.

Note that an age pyramid is a special kind of histogram

Code

Parallel boxplots of `age` with respect to `sex`

Code

df_2018 |> 
  mutate(sex=as_factor(sex)) |> 
  ggplot() +
  aes(y=age, x=sex) +
  geom_boxplot(varwidth = T) +
  xlab("sex")

Warning: Removed 7 rows containing non-finite values (`stat_boxplot()`).

QQplot comparing sample male and female age distributions

Code

with(df_2018,
     qqplot(x=age[sex==1], y=age[sex==2]))

Make your own qqplot

Code

cdf_age_2018_1 <- ecdf(df_2018$age[df_2018$sex==1])

tb <- df_2018 |> 
  dplyr::filter(sex==2) |> 
  dplyr::select(age) |> 
  mutate(Fn=rank(age, ties.method = "max")/n()) |> 
  distinct() |> 
  arrange(age)
  
eqf_age_2018_2 <- with(tb,
     stepfun(x=Fn, y=c(age, max(age)), right = T, f = 1))

Code

filter(df_2018, sex==1) |> 
  ggplot() +
  aes(x=age, y=eqf_age_2018_2(cdf_age_2018_1(age))) +
  geom_point(alpha=.1, fill="white") + 
  geom_abline(intercept = 0, slope=1, linetype="dotted") +
  coord_fixed() +
  xlab("Age (men)") +
  ylab("Age (women)")

Warning: Removed 15 rows containing missing values (`geom_point()`).

Code

# data(gss_all)

Code

data("gss_dict")

Code

gss_dict |> 
  filter(variable=="age")

# A tibble: 1 × 12
    pos variable label      missing var_doc_label value_labels var_text years   
  <int> <chr>    <chr>        <int> <chr>         <chr>        <chr>    <list>  
1    90 age      age of re…     769 age of respo… [89] 89 or … 13. Res… <tibble>
# ℹ 4 more variables: var_yrtab <list>, col_type <chr>, var_type <chr>,
#   var_na_codes <chr>

Code

# gss_which_years(gss_all, c("age", "agekdbrn"))

Working with `gss_sub`

Code

data("gss_sub")

gss_sub |> 
  glimpse()

Rows: 72,390
Columns: 19
$ year     <dbl+lbl> 1972, 1972, 1972, 1972, 1972, 1972, 1972, 1972, 1972, 197…
$ id       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ ballot   <dbl+lbl> NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), N…
$ age      <dbl+lbl> 23, 70, 48, 27, 61, 26, 28, 27, 21, 30, 30, 56, 54, 49, 4…
$ race     <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, …
$ sex      <dbl+lbl> 2, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 2, 1, 1, 1, 2, 2, …
$ degree   <dbl+lbl> 3, 0, 1, 3, 1, 1, 1, 3, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 3, …
$ padeg    <dbl+lbl>     0,     0,     0,     3,     0,     3,     3,     3,  …
$ madeg    <dbl+lbl> NA(i),     0,     0,     1,     0,     4,     1,     1,  …
$ relig    <dbl+lbl> 3, 2, 1, 5, 1, 1, 2, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ polviews <dbl+lbl> NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), N…
$ fefam    <dbl+lbl> NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), N…
$ vpsu     <dbl+lbl> NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), N…
$ vstrat   <dbl+lbl> NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), N…
$ oversamp <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ formwt   <dbl+lbl> NA(y), NA(y), NA(y), NA(y), NA(y), NA(y), NA(y), NA(y), N…
$ wtssall  <dbl+lbl> 0.4446, 0.8893, 0.8893, 0.8893, 0.8893, 0.4446, 0.4446, 0…
$ sampcode <dbl+lbl> NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), N…
$ sample   <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …

Code

gss_sub |> 
  head()

# A tibble: 6 × 19
  year         id ballot      age    race    sex     degree  padeg   madeg      
  <dbl+lbl> <dbl> <dbl+lbl>   <dbl+> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <dbl+lbl>  
1 1972          1 NA(i) [iap] 23     1 [whi… 2 [fem… 3 [bac… 0 [les… NA(i) [iap]
2 1972          2 NA(i) [iap] 70     1 [whi… 1 [mal… 0 [les… 0 [les…     0 [les…
3 1972          3 NA(i) [iap] 48     1 [whi… 2 [fem… 1 [hig… 0 [les…     0 [les…
4 1972          4 NA(i) [iap] 27     1 [whi… 2 [fem… 3 [bac… 3 [bac…     1 [hig…
5 1972          5 NA(i) [iap] 61     1 [whi… 2 [fem… 1 [hig… 0 [les…     0 [les…
6 1972          6 NA(i) [iap] 26     1 [whi… 1 [mal… 1 [hig… 3 [bac…     4 [gra…
# ℹ 10 more variables: relig <dbl+lbl>, polviews <dbl+lbl>, fefam <dbl+lbl>,
#   vpsu <dbl+lbl>, vstrat <dbl+lbl>, oversamp <dbl+lbl>, formwt <dbl+lbl>,
#   wtssall <dbl+lbl>, sampcode <dbl+lbl>, sample <dbl+lbl>

Code

gss_sub |> 
  dplyr::select(-id, -year) |> 
  summarise(across(everything(), n_distinct)) |> 
  pivot_longer(cols = everything(), names_to="name_col", values_to = "n_distct") |> 
  arrange(n_distct) |> 
  filter(n_distct < 15) |> 
  left_join(gss_dict, by=c("name_col"="variable"))

# A tibble: 11 × 13
   name_col n_distct   pos label     missing var_doc_label value_labels var_text
   <chr>       <int> <int> <chr>       <int> <chr>         <chr>        <chr>   
 1 sex             3   125 responde…     112 respondents … [1] male; [… 23. Cod…
 2 race            4   126 race of …     107 race of resp… [1] white; … 24. Wha…
 3 ballot          5  6072 ballot u…   21875 ballot used … [1] ballot … 1659. B…
 4 fefam           5   784 better f…   37259 better for m… [1] strongl… 252. No…
 5 oversamp        5  6078 weights …       0 weights for … [1] not 198… None    
 6 degree          6    98 r's high…     196 r's highest … [0] less th… 19. If …
 7 padeg           6    99 father's…   17881 father's hig… [0] less th… 20. If …
 8 madeg           6   100 mothers …    8971 mothers high… [0] less th… 21. If …
 9 polviews        8   227 think of…    9672 think of sel… [1] extreme… 67a. We…
10 sample         11  6077 sampling…    4032 sampling fra… [1] 1960 sa… 1664. T…
11 relig          14   336 r's reli…     437 r's religiou… [1] protest… 104. Wh…
# ℹ 5 more variables: years <list>, var_yrtab <list>, col_type <chr>,
#   var_type <chr>, var_na_codes <chr>

Education through generations

What kind of information do we get through variables degree and padeg?

?degree
?padeg

Compute contingency table for `degree` and `padeg`

Code

tab_degree_padeg <- gss_sub |> 
  dplyr::select(degree, padeg) |> 
  mutate(across(everything(), as_factor)) |> 
  table()

Code

tab_degree_padeg |> 
  chisq.test()

Warning in chisq.test(tab_degree_padeg): Chi-squared approximation may be
incorrect


    Pearson's Chi-squared test

data:  tab_degree_padeg
X-squared = NaN, df = 256, p-value = NA

Visualize contingency table for `degree` and `padeg`

Code

tab_degree_padeg |> 
  t() |> 
  mosaicplot(color = T)

Install and use package `gssr`

Get data for year 2018

Inspect the data

Numerical summaries for `age` and `agekdbrn`

How is `sex` encoded? Is it worth recoding it?

Histogram and density plots for `age` distribution/facet by `sex`

Compare sample `age` distribution with population `age` distribution

Parallel boxplots of `age` with respect to `sex`

QQplot comparing sample male and female age distributions

Make your own qqplot

Scatterplot for `age` and `agekdbrn`, facet by `sex` `

Working with `gss_sub`

Education through generations

Compute contingency table for `degree` and `padeg`

Visualize contingency table for `degree` and `padeg`

Rearrange the levels of `degree` and `padeg`

Install and use package gssr

Get data for year 2018

Inspect the data

Numerical summaries for age and agekdbrn

How is sex encoded? Is it worth recoding it?

Histogram and density plots for age distribution/facet by sex

Compare sample age distribution with population age distribution

Parallel boxplots of age with respect to sex

QQplot comparing sample male and female age distributions

Make your own qqplot

Scatterplot for age and agekdbrn, facet by sex `

Working with gss_sub

Education through generations

Compute contingency table for degree and padeg

Visualize contingency table for degree and padeg

Rearrange the levels of degree and padeg

Install and use package `gssr`

Numerical summaries for `age` and `agekdbrn`

How is `sex` encoded? Is it worth recoding it?

Histogram and density plots for `age` distribution/facet by `sex`

Compare sample `age` distribution with population `age` distribution

Parallel boxplots of `age` with respect to `sex`

Scatterplot for `age` and `agekdbrn`, facet by `sex` `

Working with `gss_sub`

Compute contingency table for `degree` and `padeg`

Visualize contingency table for `degree` and `padeg`

Rearrange the levels of `degree` and `padeg`