Code
if (!require(gssr)) {
if (!require(remotes)){
install.packages("remotes")
}
remotes::install_github("kjhealy/gssr")
}
if (!require(gssr)) {
if (!require(remotes)){
install.packages("remotes")
}
remotes::install_github("kjhealy/gssr")
}
gssr
We work again with General Social Survey (GSS) data.
We take advantage of R
package gssr
if (!require(gssr)) {
if (!require(remotes)){
install.packages("remotes")
}
remotes::install_github("kjhealy/gssr")
}
The GSS is carried out every two years. It offers both cross-sectional data and panel data.
Package gssr
offers a simple way to retrieve yearly data.
df_2018 <- gssr::gss_get_yr(2018)
Fetching: https://gss.norc.org/documents/stata/2018_stata.zip
dim(df_2018)
[1] 2348 1068
age
and agekdbrn
The 2018
data provide (among too many other things) columns named age
abd agekdbrn
. Get numerical summaries about these two columns.
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
age | 7 | 1.00 | 48.97 | 18.06 | 18 | 34 | 48 | 63 | 89 | ▇▇▇▆▃ |
agekdbrn | 682 | 0.71 | 24.30 | 5.74 | 12 | 20 | 23 | 28 | 51 | ▃▇▃▁▁ |
Thanks to gssr
, you can get meta-information about the columns
?aged
?agekdbrn
?sex
sex
encoded? Is it worth recoding it?age
distribution/facet by sex
p_age <- df_2018 |>
mutate(sex=as_factor(sex)) |>
ggplot() +
aes(x=age) +
facet_wrap(~ sex, )
p_age +
geom_histogram(aes(y=after_stat(density)),
fill="white",
color="black",
bins=72) +
labs(
title="GSS 2018",
subtitle = "Age distribution of respondents"
)
Warning: Removed 7 rows containing non-finite values (`stat_bin()`).
p_age +
stat_density(aes(y=after_stat(density)),
fill="white",
alpha=.5,
color="black",
bw = "SJ",
adjust = .25
)
Warning: Removed 7 rows containing non-finite values (`stat_density()`).
bw
and adjust
of stat_density
age
distribution with population age
distributionknitr::include_url("https://perspective.usherbrooke.ca/bilan/servlet/BMPagePyramide/USA/2018/?", height=600)
Sherbrooke University offers visual information about the age structure of population of a wide range of countries.
Following demographic usage, the age structure is presented through an age pyramid.
Note that an age pyramid is a special kind of histogram
age
with respect to sex
df_2018 |>
mutate(sex=as_factor(sex)) |>
ggplot() +
aes(y=age, x=sex) +
geom_boxplot(varwidth = T) +
xlab("sex")
Warning: Removed 7 rows containing non-finite values (`stat_boxplot()`).
filter(df_2018, sex==1) |>
ggplot() +
aes(x=age, y=eqf_age_2018_2(cdf_age_2018_1(age))) +
geom_point(alpha=.1, fill="white") +
geom_abline(intercept = 0, slope=1, linetype="dotted") +
coord_fixed() +
xlab("Age (men)") +
ylab("Age (women)")
Warning: Removed 15 rows containing missing values (`geom_point()`).
# data(gss_all)
data("gss_dict")
gss_dict |>
filter(variable=="age")
# A tibble: 1 × 12
pos variable label missing var_doc_label value_labels var_text years
<int> <chr> <chr> <int> <chr> <chr> <chr> <list>
1 90 age age of re… 769 age of respo… [89] 89 or … 13. Res… <tibble>
# ℹ 4 more variables: var_yrtab <list>, col_type <chr>, var_type <chr>,
# var_na_codes <chr>
# gss_which_years(gss_all, c("age", "agekdbrn"))
age
and agekdbrn
, facet by sex
`gss_sub
data("gss_sub")
gss_sub |>
glimpse()
Rows: 72,390
Columns: 19
$ year <dbl+lbl> 1972, 1972, 1972, 1972, 1972, 1972, 1972, 1972, 1972, 197…
$ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ ballot <dbl+lbl> NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), N…
$ age <dbl+lbl> 23, 70, 48, 27, 61, 26, 28, 27, 21, 30, 30, 56, 54, 49, 4…
$ race <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, …
$ sex <dbl+lbl> 2, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 2, 1, 1, 1, 2, 2, …
$ degree <dbl+lbl> 3, 0, 1, 3, 1, 1, 1, 3, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 3, …
$ padeg <dbl+lbl> 0, 0, 0, 3, 0, 3, 3, 3, …
$ madeg <dbl+lbl> NA(i), 0, 0, 1, 0, 4, 1, 1, …
$ relig <dbl+lbl> 3, 2, 1, 5, 1, 1, 2, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ polviews <dbl+lbl> NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), N…
$ fefam <dbl+lbl> NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), N…
$ vpsu <dbl+lbl> NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), N…
$ vstrat <dbl+lbl> NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), N…
$ oversamp <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ formwt <dbl+lbl> NA(y), NA(y), NA(y), NA(y), NA(y), NA(y), NA(y), NA(y), N…
$ wtssall <dbl+lbl> 0.4446, 0.8893, 0.8893, 0.8893, 0.8893, 0.4446, 0.4446, 0…
$ sampcode <dbl+lbl> NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), NA(i), N…
$ sample <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
gss_sub |>
head()
# A tibble: 6 × 19
year id ballot age race sex degree padeg madeg
<dbl+lbl> <dbl> <dbl+lbl> <dbl+> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <dbl+lbl>
1 1972 1 NA(i) [iap] 23 1 [whi… 2 [fem… 3 [bac… 0 [les… NA(i) [iap]
2 1972 2 NA(i) [iap] 70 1 [whi… 1 [mal… 0 [les… 0 [les… 0 [les…
3 1972 3 NA(i) [iap] 48 1 [whi… 2 [fem… 1 [hig… 0 [les… 0 [les…
4 1972 4 NA(i) [iap] 27 1 [whi… 2 [fem… 3 [bac… 3 [bac… 1 [hig…
5 1972 5 NA(i) [iap] 61 1 [whi… 2 [fem… 1 [hig… 0 [les… 0 [les…
6 1972 6 NA(i) [iap] 26 1 [whi… 1 [mal… 1 [hig… 3 [bac… 4 [gra…
# ℹ 10 more variables: relig <dbl+lbl>, polviews <dbl+lbl>, fefam <dbl+lbl>,
# vpsu <dbl+lbl>, vstrat <dbl+lbl>, oversamp <dbl+lbl>, formwt <dbl+lbl>,
# wtssall <dbl+lbl>, sampcode <dbl+lbl>, sample <dbl+lbl>
# A tibble: 11 × 13
name_col n_distct pos label missing var_doc_label value_labels var_text
<chr> <int> <int> <chr> <int> <chr> <chr> <chr>
1 sex 3 125 responde… 112 respondents … [1] male; [… 23. Cod…
2 race 4 126 race of … 107 race of resp… [1] white; … 24. Wha…
3 ballot 5 6072 ballot u… 21875 ballot used … [1] ballot … 1659. B…
4 fefam 5 784 better f… 37259 better for m… [1] strongl… 252. No…
5 oversamp 5 6078 weights … 0 weights for … [1] not 198… None
6 degree 6 98 r's high… 196 r's highest … [0] less th… 19. If …
7 padeg 6 99 father's… 17881 father's hig… [0] less th… 20. If …
8 madeg 6 100 mothers … 8971 mothers high… [0] less th… 21. If …
9 polviews 8 227 think of… 9672 think of sel… [1] extreme… 67a. We…
10 sample 11 6077 sampling… 4032 sampling fra… [1] 1960 sa… 1664. T…
11 relig 14 336 r's reli… 437 r's religiou… [1] protest… 104. Wh…
# ℹ 5 more variables: years <list>, var_yrtab <list>, col_type <chr>,
# var_type <chr>, var_na_codes <chr>
What kind of information do we get through variables degree
and padeg
?
?degree
?padeg
degree
and padeg
tab_degree_padeg |>
chisq.test()
Warning in chisq.test(tab_degree_padeg): Chi-squared approximation may be
incorrect
Pearson's Chi-squared test
data: tab_degree_padeg
X-squared = NaN, df = 256, p-value = NA
degree
and padeg
tab_degree_padeg |>
t() |>
mosaicplot(color = T)
degree
and padeg