name: inter-slide class: left, middle, inverse {{ content }} --- name: layout-general layout: true class: left, middle <style> .remark-slide-number { position: inherit; } .remark-slide-number .progress-bar-container { position: absolute; bottom: 0; height: 4px; display: block; left: 0; right: 0; } .remark-slide-number .progress-bar { height: 100%; background-color: red; } </style>
--- class: middle, left, inverse # Exploratory Data Analysis : Categorical Univariate Statistics ### 2022-02-09 #### [Master I MIDS & MFA]() #### [Analyse Exploratoire de Données](http://stephane-v-boucheron.fr/courses/isidata/) #### [Stéphane Boucheron](http://stephane-v-boucheron.fr) --- template: inter-slide ###
### Numerical summaries for categorical samples ### Barplots ### Pieplots ### Semmelweis quasi-experiment --- template: inter-slide name: num-sum-cat-samples ## Numerical summaries for categorical samples --- ### Counts For qualitative samples, summary statistics consists of _counts_: for each modality, we record the _number of occurences_ or _frequency_
In English, the _frequency_ of a symbol in a sequence is the number of occurences (a count) In French, the _frequence_ of that symbol is the _ratio_ between the number of occurences of the symbol and the total length of the sequence --- ### Summarizing `carrier` .fl.w-40.pa2.f6[ Load `nycflights` and inspect `flights` dataset ```r tmp <- purrr::map( c("tidyverse", "broom", "knitr", "nycflights13"), ~ require(., character.only=T, quietly=T)) flights %>% glimpse(w=40) ``` We focus on variable `carrier` ```r v_carrier <- flights$carrier *v_carrier <- flights %>% * pull(carrier) ``` ] .fl.w-60.pa2.f6[ ``` ## Warning: package 'broom' was built under R version 4.1.2 ``` ``` ## Rows: 336,776 ## Columns: 19 ## $ year <int> 2013, 2013, 201… ## $ month <int> 1, 1, 1, 1, 1, … ## $ day <int> 1, 1, 1, 1, 1, … ## $ dep_time <int> 517, 533, 542, … ## $ sched_dep_time <int> 515, 529, 540, … ## $ dep_delay <dbl> 2, 4, 2, -1, -6… ## $ arr_time <int> 830, 850, 923, … ## $ sched_arr_time <int> 819, 830, 850, … ## $ arr_delay <dbl> 11, 20, 33, -18… ## $ carrier <chr> "UA", "UA", "AA… ## $ flight <int> 1545, 1714, 114… ## $ tailnum <chr> "N14228", "N242… ## $ origin <chr> "EWR", "LGA", "… ## $ dest <chr> "IAH", "IAH", "… ## $ air_time <dbl> 227, 227, 160, … ## $ distance <dbl> 1400, 1416, 108… ## $ hour <dbl> 5, 5, 5, 5, 6, … ## $ minute <dbl> 15, 29, 40, 45,… ## $ time_hour <dttm> 2013-01-01 05:… ``` ] --- ### Function `table()` .fl.w-50.pa2.f6[ Function `table()` builds a _contingency table_ of the counts For univariate samples, `table()` returns a 1-way contingency table ```r *table(v_carrier) %>% tidy() %>% head() %>% kable(col.names = c("Carrier", "Frequency")) ```
Function `table` puts dust under the carpet: NAs are not reported ] .fl.w-50.pa2.f6[ <table> <thead> <tr> <th style="text-align:left;"> Carrier </th> <th style="text-align:right;"> Frequency </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 9E </td> <td style="text-align:right;"> 18460 </td> </tr> <tr> <td style="text-align:left;"> AA </td> <td style="text-align:right;"> 32729 </td> </tr> <tr> <td style="text-align:left;"> AS </td> <td style="text-align:right;"> 714 </td> </tr> <tr> <td style="text-align:left;"> B6 </td> <td style="text-align:right;"> 54635 </td> </tr> <tr> <td style="text-align:left;"> DL </td> <td style="text-align:right;"> 48110 </td> </tr> <tr> <td style="text-align:left;"> EV </td> <td style="text-align:right;"> 54173 </td> </tr> </tbody> </table> ] --- ### Function `summary()` .fl.w-50.pa2.f6[ ```r summary(v_carrier) *summary(factor(v_carrier)) %>% as_tibble(rownames="Carrier") %>% head() %>% kable() ```
Function `summary()` tells us everything there is to know, counts, and NAs
Function `summary()` is generic: it calls a method that depends on the class of the argument ] .fl.w-50.pa2.f6[ ``` ## Length Class Mode ## 336776 character character ``` <table> <thead> <tr> <th style="text-align:left;"> Carrier </th> <th style="text-align:right;"> value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 9E </td> <td style="text-align:right;"> 18460 </td> </tr> <tr> <td style="text-align:left;"> AA </td> <td style="text-align:right;"> 32729 </td> </tr> <tr> <td style="text-align:left;"> AS </td> <td style="text-align:right;"> 714 </td> </tr> <tr> <td style="text-align:left;"> B6 </td> <td style="text-align:right;"> 54635 </td> </tr> <tr> <td style="text-align:left;"> DL </td> <td style="text-align:right;"> 48110 </td> </tr> <tr> <td style="text-align:left;"> EV </td> <td style="text-align:right;"> 54173 </td> </tr> </tbody> </table> ] --- template: inter-slide name: graphical-summaries ## Graphical summaries for categorical samples --- ### Bar plotting in one step using `geom_bar` .fl.w-40.pa2[ ```r flights %>% ggplot() + aes(x=origin) + * geom_bar(width=.3) + ylab("Count") + ggtitle("Flights: destinations distribution") ``` ] .fl.w-60.pa2[ <img src="cm-2.3-EDA-categorical_files/figure-html/flights-dest-out-1.png" width="504" /> ] --- ### Barplotting as a two-step process. 1. Compute _statistics_ from the raw data: we _count the number of occurrences of each symbol_. This is a simple yet non-trivial task. In SQL, this consists in `group by` and then `count`. How would you proceed if you had to code this counting step? 2. Plot the counting data: for example, map the modalities/categories on the `X-axis` (possibly choosing an ordering), the counts are mapped on the `Y-axis`, they specify the height of each bar Most of the flexibility lies in the mapping of modalities on the `X-axis` or doing something different ??? Emphasize - distinction betweem `geom_` and `stat_` --- ### Bar plotting in two steps using `group_by()`, `n()` and `geom_col()` .fl.w-40.pa2.f5[ ```r p <- flights %>% group_by(origin) %>% summarise(Count=n()) %>% ggplot(mapping=aes(x=origin, y=Count)) + ggtitle("Flights: origin distribution") *p + geom_col(width=.3) ``` ] .fl.w-60.pa2[ <img src="cm-2.3-EDA-categorical_files/figure-html/flights-origin-label-out-1.png" width="504" /> ] --- ### Playing with `stat_` and `geom_`
In the `ggplot` framework, a `geom_` function is paired with a default `stat_` function Once a data source and an (aesthetic) mapping have been fixed (telling which columns are fed to which axis), the data are (possibly) transformed using the `stat_`: a sample may be transformed into counts in order to build a contingency table and the result is then fed to the `geom_` --- .fl.w-40.pa2[ We could obtain the same plot using `geom_bar` using the `identity` stat. ```r p + * geom_bar(stat="identity", * width=.3) ``` ] .fl.w-60.pa2[ <img src="cm-2.3-EDA-categorical_files/figure-html/tweak-bar-stat-label-out-1.png" width="504" /> ] --- .fl.w-40.pa2.fa6[ Barplots provide us with a great way to compare counts of different categories. But it is somewhat difficult to figure out the relative frequencies of the different categories. One way to facilitate this consists of normalizing the counts by dividing all counts by sample size. ```r dt <- flights %>% group_by(dest) %>% summarise(Count=n()) %>% mutate(Prop=Count/sum(Count)) p <- dt %>% ggplot() + aes(x=dest, y=Prop) + ggtitle("Flights: dest distribution") (p + geom_col(width=.3)) ``` ] .fl.w-60.pa2[ <img src="cm-2.3-EDA-categorical_files/figure-html/normalize-col-plot-label-out-1.png" width="504" /> ] --- .fl.w-30.pa2[ It may be a good idea to doctor the tick labels on the Y-axis, indicating percentages rather than numbers between `\(0\)` and `\(1\)` (this depends on the readership) We may focus on flights without repeating the whole process, just upgrading the data. Defining a `ggplot` object is very much like defining a _pipeline_ ] .fl.w-70.pa2[ <img src="cm-2.3-EDA-categorical_files/figure-html/unnamed-chunk-3-1.png" width="504" /> ] --- ### Reordering .fl.w-40.pa2.f6[ ```r dt %>% * mutate(dest=fct_reorder(factor(dest), * Prop, .desc = T)) %>% ggplot() + aes(x=dest, y=Prop) + ggtitle("Flights: dest distribution") + geom_col(width=.3) + theme(axis.text.x=element_text(angle=45, hjust=1, size =4)) + xlab("Destination Airport") + ylab("Prop") ``` ] .fl.w-60.pa2[ <img src="cm-2.3-EDA-categorical_files/figure-html/fct_reorder-out-1.png" width="504" /> ] --- count: false .panel1-fct_reorder-auto[ ```r *dt ``` ] .panel2-fct_reorder-auto[ ``` ## # A tibble: 105 × 3 ## dest Count Prop ## <chr> <int> <dbl> ## 1 ABQ 254 0.000754 ## 2 ACK 265 0.000787 ## 3 ALB 439 0.00130 ## 4 ANC 8 0.0000238 ## 5 ATL 17215 0.0511 ## 6 AUS 2439 0.00724 ## 7 AVL 275 0.000817 ## 8 BDL 443 0.00132 ## 9 BGR 375 0.00111 ## 10 BHM 297 0.000882 ## # … with 95 more rows ``` ] --- count: false .panel1-fct_reorder-auto[ ```r dt %>% * mutate(dest=fct_reorder(factor(dest), #<< * Prop, #<< * .desc = T)) ``` ] .panel2-fct_reorder-auto[ ``` ## # A tibble: 105 × 3 ## dest Count Prop ## <fct> <int> <dbl> ## 1 ABQ 254 0.000754 ## 2 ACK 265 0.000787 ## 3 ALB 439 0.00130 ## 4 ANC 8 0.0000238 ## 5 ATL 17215 0.0511 ## 6 AUS 2439 0.00724 ## 7 AVL 275 0.000817 ## 8 BDL 443 0.00132 ## 9 BGR 375 0.00111 ## 10 BHM 297 0.000882 ## # … with 95 more rows ``` ] --- count: false .panel1-fct_reorder-auto[ ```r dt %>% * mutate(dest=fct_reorder(factor(dest), * Prop, .desc = T)) %>% * ggplot() ``` ] .panel2-fct_reorder-auto[ <img src="cm-2.3-EDA-categorical_files/figure-html/fct_reorder_auto_03_output-1.png" width="504" /> ] --- count: false .panel1-fct_reorder-auto[ ```r dt %>% * mutate(dest=fct_reorder(factor(dest), * Prop, .desc = T)) %>% ggplot() + * aes(x=dest, y=Prop) ``` ] .panel2-fct_reorder-auto[ <img src="cm-2.3-EDA-categorical_files/figure-html/fct_reorder_auto_04_output-1.png" width="504" /> ] --- count: false .panel1-fct_reorder-auto[ ```r dt %>% * mutate(dest=fct_reorder(factor(dest), * Prop, .desc = T)) %>% ggplot() + aes(x=dest, y=Prop) + * ggtitle("Flights: dest distribution") ``` ] .panel2-fct_reorder-auto[ <img src="cm-2.3-EDA-categorical_files/figure-html/fct_reorder_auto_05_output-1.png" width="504" /> ] --- count: false .panel1-fct_reorder-auto[ ```r dt %>% * mutate(dest=fct_reorder(factor(dest), * Prop, .desc = T)) %>% ggplot() + aes(x=dest, y=Prop) + ggtitle("Flights: dest distribution") + * geom_col(width=.3) ``` ] .panel2-fct_reorder-auto[ <img src="cm-2.3-EDA-categorical_files/figure-html/fct_reorder_auto_06_output-1.png" width="504" /> ] --- count: false .panel1-fct_reorder-auto[ ```r dt %>% * mutate(dest=fct_reorder(factor(dest), * Prop, .desc = T)) %>% ggplot() + aes(x=dest, y=Prop) + ggtitle("Flights: dest distribution") + geom_col(width=.3) + * theme(axis.text.x=element_text(angle=45, * hjust=1, * size =4)) ``` ] .panel2-fct_reorder-auto[ <img src="cm-2.3-EDA-categorical_files/figure-html/fct_reorder_auto_07_output-1.png" width="504" /> ] --- count: false .panel1-fct_reorder-auto[ ```r dt %>% * mutate(dest=fct_reorder(factor(dest), * Prop, .desc = T)) %>% ggplot() + aes(x=dest, y=Prop) + ggtitle("Flights: dest distribution") + geom_col(width=.3) + theme(axis.text.x=element_text(angle=45, hjust=1, size =4)) + * xlab("Destination Airport") ``` ] .panel2-fct_reorder-auto[ <img src="cm-2.3-EDA-categorical_files/figure-html/fct_reorder_auto_08_output-1.png" width="504" /> ] --- count: false .panel1-fct_reorder-auto[ ```r dt %>% * mutate(dest=fct_reorder(factor(dest), * Prop, .desc = T)) %>% ggplot() + aes(x=dest, y=Prop) + ggtitle("Flights: dest distribution") + geom_col(width=.3) + theme(axis.text.x=element_text(angle=45, hjust=1, size =4)) + xlab("Destination Airport") + * ylab("Prop") ``` ] .panel2-fct_reorder-auto[ <img src="cm-2.3-EDA-categorical_files/figure-html/fct_reorder_auto_09_output-1.png" width="504" /> ] --- count: false .panel1-fct_reorder-auto[ ```r dt %>% * mutate(dest=fct_reorder(factor(dest), * Prop, .desc = T)) %>% ggplot() + aes(x=dest, y=Prop) + ggtitle("Flights: dest distribution") + geom_col(width=.3) + theme(axis.text.x=element_text(angle=45, hjust=1, size =4)) + xlab("Destination Airport") + ylab("Prop") ``` ] .panel2-fct_reorder-auto[ <img src="cm-2.3-EDA-categorical_files/figure-html/fct_reorder_auto_10_output-1.png" width="504" /> ] <style> .panel1-fct_reorder-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-fct_reorder-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-fct_reorder-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- template: inter-slide name: pieplots ## Pieplots --- .fl.w-40.pa2.f6[ Another possibility is to use a `Pieplot` Counts are normalized and mapped to angular values (fractions of `\(2\pi\)`). It is up to the data scientist to decide whether one plot is more relevant than the other. ```r flights %>% group_by(carrier) %>% summarize(Count=n()) %>% mutate(Prop=Count/sum(Count)) %>% mutate(carrier=fct_reorder(carrier, Count, .desc = F)) %>% ggplot() + * geom_bar(mapping=aes(x=factor(1), * y=Prop, * fill=carrier), * stat='identity') + * coord_polar(theta='y') + xlab("") + ylab("") ``` ] .fl.w-60.pa2[ <img src="cm-2.3-EDA-categorical_files/figure-html/pieplot-label-out-1.png" width="504" /> ] --- template: inter-slide ## Barplots in action --- ### Semmelweis inquiry .fl.w-third.pa2[ .middle[ ![](img/355px-Ignaz_Semmelweis_1860.jpg) [Filip Semmelweis 1818-1865](https://en.wikipedia.org/wiki/Ignaz_Semmelweis) ] ] .fl.w-two-thirds.pa2[ .bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[ Semmelweis discovered that the incidence of _puerperal_ fever could be cut by the use of hand disinfection in obstetrical clinics. Puerperal fever was common in mid-19th-century hospitals and often fatal. Semmelweis proposed the practice of washing hands with chlorinated lime solutions in 1847 while working in Vienna General Hospital's First Obstetrical Clinic, where doctors' wards had three times the mortality of midwives' wards. .tr[ [wikipedia](https://en.wikipedia.org/wiki/Ignaz_Semmelweis) ] ] ] --- class: middle, center ### Vienna General Hospital ![](img/AAKH-1784.jpg) --- ### Semmelweis inquiry (continued) .small[ .bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[ .small[ Maternity institutions were set up [...] to address problems of infanticide of illegitimate children. They were set up as gratis institutions and offered to care for the infants, which made them attractive to underprivileged women [...]. In return for the free services, the women would be subjects for the training of doctors and midwives. _Two maternity clinics_ were at the Viennese hospital. The First Clinic had an average maternal mortality rate of about `\(10\%\)` due to _puerperal fever_. The Second Clinic's rate was considerably lower, averaging less than `\(4\%\)`. This fact was known outside the hospital. _The two clinics admitted on alternate days_, but women begged to be admitted to the Second Clinic, [...]. Some women even preferred to give birth in the streets, pretending to have given sudden birth en route to the hospital (a practice known as street births), which meant they would still qualify for the child care benefits without having been admitted to the clinic. ] ] ] --- ### Semmelweis inquiry (continued) .fl.w-two-thirds.pa2[ .bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[ .small[ Semmelweis was puzzled that puerperal fever was rare among women giving street births. > "To me, it appeared logical that patients who experienced street births would become ill at least as frequently as those who delivered in the clinic. [...] What protected those who delivered outside the clinic from these destructive unknown endemic influences?" ]]] .fl.w-third.pa2[ <table> <thead> <tr> <th style="text-align:right;"> Year </th> <th style="text-align:right;"> RateM </th> <th style="text-align:right;"> RateL </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1841 </td> <td style="text-align:right;"> 7.8 </td> <td style="text-align:right;"> 3.5 </td> </tr> <tr> <td style="text-align:right;"> 1842 </td> <td style="text-align:right;"> 15.8 </td> <td style="text-align:right;"> 7.6 </td> </tr> <tr> <td style="text-align:right;"> 1843 </td> <td style="text-align:right;"> 9.0 </td> <td style="text-align:right;"> 6.0 </td> </tr> <tr> <td style="text-align:right;"> 1844 </td> <td style="text-align:right;"> 8.2 </td> <td style="text-align:right;"> 2.3 </td> </tr> <tr> <td style="text-align:right;"> 1845 </td> <td style="text-align:right;"> 6.9 </td> <td style="text-align:right;"> 2.0 </td> </tr> <tr> <td style="text-align:right;"> 1846 </td> <td style="text-align:right;"> 11.4 </td> <td style="text-align:right;"> 2.8 </td> </tr> </tbody> </table> ] --- ### Semmelweis inquiry (continued) .bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[ The result was the mortality rate in the First Clinic declined `\(90\%\)`, and was then comparable to that in the Second Clinic. The mortality rate in April 1847 was `\(18.3\%\)`. After hand washing was instituted in mid-May, the rates in June were `\(2.2\%\)`, July `\(1.2\%\)`, August `\(1.9\%\)` and, for the first time since the introduction of anatomical orientation, the death rate was zero in two months in the year following this discovery. ] --- - What is the sample here? - Is there just one sample? - Does the Semmelweis dataset contain a sample or a statistic (something computed from a sample)? --- Plot the yearly death rates at both clinics. Choose aesthetics so as to emphasize the difference in mortality rates in the clinic supervised by doctors and the clinic supervised by midwifes. --- ### Pivoting .fl.w-third.pa2[ `semmelweis` dataframe has _wide format_ It is convenient (but not necessary) to `pivot` columns `RateM` and `RateL` to a pair of columns: `Clinic` and `Death_rate` Column `Clinic` is retyped as a `factor` Column `Year` is also retyped as a `factor` (It would be more natural to handle "Year" as a `Date` object) ] .fl.w-two-thirds.pa2[ ```r tmp <- semmelweis %>% select(Year, RateM, RateL) %>% * pivot_longer(cols=c(RateM, RateL), names_to="Clinic", values_to="Death_rate" ) %>% * mutate(Clinic = as_factor(Clinic), Year = as_factor(Year)) %>% * mutate(Clinic = fct_recode(Clinic, Doctors="RateM", Midwifes="RateL") ) ``` <table> <thead> <tr> <th style="text-align:left;"> Year </th> <th style="text-align:left;"> Clinic </th> <th style="text-align:right;"> Death_rate </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1841 </td> <td style="text-align:left;"> Doctors </td> <td style="text-align:right;"> 7.8 </td> </tr> <tr> <td style="text-align:left;"> 1841 </td> <td style="text-align:left;"> Midwifes </td> <td style="text-align:right;"> 3.5 </td> </tr> <tr> <td style="text-align:left;"> 1842 </td> <td style="text-align:left;"> Doctors </td> <td style="text-align:right;"> 15.8 </td> </tr> <tr> <td style="text-align:left;"> 1842 </td> <td style="text-align:left;"> Midwifes </td> <td style="text-align:right;"> 7.6 </td> </tr> </tbody> </table> ] --- ### Keywords and buzzwords - Randomized clinical trials - A/B testing - Case Control studies - Quasi-experiments - Observational data - Experimental data ??? --- ### Data wrangling makes ploting easy. .panelset[ <div class="panel"> <div class="panel-name">Code</div> ```r tmp %>% ggplot() + aes(x=Year, y=Death_rate, group=Clinic, fill=Clinic, linetype=Clinic) + geom_col(position="dodge", #<< colour="black", width=.2) + theme(legend.position=c(5.5/6, 13/15)) + ggtitle("Semmelweis data") + ylab("Death rate (%)") ``` </div> <div class="panel"> <div class="panel-name">Semmelweis barplot</div> <img src="cm-2.3-EDA-categorical_files/figure-html/semmelweis-label-1.png" width="504" /> </div> ] --- exclude: true .fl.w-third.pa2[ ] .fl.w-two-thirds.pa2[ ![](cm-2.3-EDA-categorical_files/figure-html/semmelweis-label-1.png) ] --- class: center, middle, inverse ## Quasi-experiments and beyond --- ### Milestones - Semmelweis circa 1847 Puerperal fever - Snow circa 1852 Cholera - Hill _Randomized Clinical Trial Streptomyces versus Collapsotherapia against Tuberculosis_ 1948 - ... --- exclude: true class: inverse, center, middle ## Inequalities --- exclude: true ### Why and what - Quantifying income inequalities - Quantifying market share concentration --- exclude: true ### How Normalizing income distribution Non-decreasing rearrangement --- exclude: true ### Lorenz curve ```r ineq::Lc function (x, n = rep(1, length(x)), plot = FALSE) { ina <- !is.na(x) n <- n[ina] x <- as.numeric(x)[ina] k <- length(x) o <- order(x) x <- x[o] n <- n[o] x <- n * x p <- cumsum(n)/sum(n) L <- cumsum(x)/sum(x) p <- c(0, p) L <- c(0, L) L2 <- L * mean(x)/mean(n) Lc <- list(p, L, L2) names(Lc) <- c("p", "L", "L.general") class(Lc) <- "Lc" if (plot) plot(Lc) Lc } ``` --- exclude: true ### Indices and Schur monotonicity - Gini - Atkinson - Decile ratios --- class: middle, center, inverse background-image: url('./img/pexels-cottonbro-3171837.jpg') background-size: cover # The End