name: inter-slide class: left, middle, inverse {{ content }} --- name: layout-general layout: true class: left, middle <style> .remark-slide-number { position: inherit; } .remark-slide-number .progress-bar-container { position: absolute; bottom: 0; height: 4px; display: block; left: 0; right: 0; } .remark-slide-number .progress-bar { height: 100%; background-color: red; } </style>
--- template: inter-slide # Univariate Statistics ### 2021-09-30 #### [Master I ISIFAR]() #### [Analyse de Données](http://stephane-v-boucheron.fr/courses/isidata/) #### [Stéphane Boucheron](http://stephane-v-boucheron.fr) --- template: inter-slide ##
### Samples and variables ### Summary statistics for numerical samples ### Graphical summaries for numerical samples ### Summary statistics for categorical samples ### Graphical summaries for categorical samples ### Cummulative distribution functions and quantile functions --- exclude: true Est etincidunt non adipisci quaerat sed. Ipsum adipisci numquam neque ipsum modi neque modi. Amet velit tempora aliquam etincidunt. Quaerat neque labore velit dolorem dolore eius ipsum. Modi sed dolorem aliquam dolorem non neque. Dolor velit porro etincidunt magnam aliquam labore. Sed quisquam dolore tempora dolor ipsum sit est. Porro numquam adipisci aliquam. Amet labore etincidunt dolore porro. .font50[ Est etincidunt non adipisci quaerat sed. Ipsum adipisci numquam neque ipsum modi neque modi. Amet velit tempora aliquam etincidunt. Quaerat neque labore velit dolorem dolore eius ipsum. Modi sed dolorem aliquam dolorem non neque. Dolor velit porro etincidunt magnam aliquam labore. Sed quisquam dolore tempora dolor ipsum sit est. Porro numquam adipisci aliquam. Amet labore etincidunt dolore porro. ] --- template: inter-slide ## Samples --- ### Titanic passenger dataset .fl.w-third.pa2[
The `Titanic` dataset comprises a list of (related) univariate samples: `Survived`, `Pclass`, `Sex`, `Embarked`, ...
Each sample element corresponds to a unique passenger on board of HMS Titanic during her first transatlantic trip on April 1912.
The dataset obtained from [kaggle]() is built from two `csv` files: `train.csv` and `test.csv`. Column `Survived` is absent from file `test`. ] .fl.w-two-thirds.pa2[ ```r tit_col_types = cols( PassengerId = col_integer(), Survived = col_factor(levels=c("0", "1"), include_na = TRUE), * Pclass = col_factor(levels=c("1", "2", "3"), * ordered = TRUE, * include_na = TRUE), Sex = col_factor(levels = c("female", "male")), Age = col_double(), SibSp = col_integer(), Parch = col_integer(), Embarked = col_factor(levels = c("S", "C", "Q"), include_na = TRUE) ) ``` ] --- ### Viewing (part of) the dataset
Table loading and wrangling ```r train <- read_csv("DATA/titanic/train.csv", col_types=tit_col_types) test <- read_csv("DATA/titanic/test.csv", col_types=tit_col_types) ``` ``` ## Warning: The following named parsers don't match the column names: Survived ``` ```r *test <- mutate(test, * Survived=NA) tit <- union(train, test) ``` --- ###
Reporting .fl.w-40.pa2[ ```r tit %>% * dplyr::select(Survived, Pclass, Sex, Age) %>% * dplyr::sample_n(5) %>% * knitr::kable() ``` We _project_ the table on four columns, then _sample_ 5 rows at random, and then _display_ ] .fl.w-60.pa2[ <table> <thead> <tr> <th style="text-align:left;"> Survived </th> <th style="text-align:left;"> Pclass </th> <th style="text-align:left;"> Sex </th> <th style="text-align:right;"> Age </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> male </td> <td style="text-align:right;"> 49 </td> </tr> <tr> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> male </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> female </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> male </td> <td style="text-align:right;"> 27 </td> </tr> <tr> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> male </td> <td style="text-align:right;"> 13 </td> </tr> </tbody> </table> ] --- ### Definition A _sample_ is a sequence of items of the same type In
, if the type matches a basetype, a sample is typically represented using a `vector` A numerical (quantitative) sample is just a sequence of numbers (either integers or floats), it is represented by a `numeric` vector. A _categorical_ (qualitative) sample is a sequence of values from some predefined finite set In
, categorical samples are handled using a special class: `factor` --- In this lesson, we are concerned with _univariate statistics_, that is either with - _numerical samples_, - with _categorical samples_ taken from some finite (or seldom countable) set
The boundary between qualitative and quantitative is not always straightforward --- We will learn about: - numerical summaries - graphical displays for univariate samples --- class: middle, center, inverse ## Summary statistics --- ### Numerical summaries for quantitative variables - __Mean__: `$$\overline{X}_n = \frac{1}{n}\sum_{i=1}^n x_i \qquad \textbf{(location)}$$` - **Standard deviation**: `$$s^2 = \frac{1}{n-1} \sum_{i=1}^n \big(x_i - \overline{X}_n\big)^2 \quad \text{or}\quad s^2 = \frac{1}{n} \sum_{i=1}^n \big(x_i - \overline{X}_n\big)^2\qquad\textbf{(dispersion)}$$` - **Skewness** `$$\frac{\frac{1}{n} \sum_{i=1}^n \big(x_i - \overline{X}_n\big)^3}{s^3}$$` - **Kurtosis** `$$\kappa = \frac{\frac{1}{n} \sum_{i=1}^n \big(x_i -\overline{X}_n\big)^4}{s^4} - 2$$` --- ### Numerical summaries at work ```r s <- rnorm(1000) # a random sample from the standard normal distribution *mean(s) - sum(s)/length(s) ``` ``` ## [1] -4.336809e-19 ``` ```r *sd(s) - sqrt(sum((s-mean(s))^2)/(length(s)-1)) ``` ``` ## [1] 0 ``` ```r *mean((s- mean(s))^4)/sd(s)^4 -3 ``` ``` ## [1] -0.05186417 ``` --- ### What if some data are missing? ```r s <- rnorm(10) s[c(2, 3, 7)] = NA_real_ s ``` ``` ## [1] 0.97326414 NA NA 0.47182634 0.04304689 0.07657912 ## [7] NA 0.87267271 0.74392937 -1.34577888 ``` ```r *mean(s) ``` ``` ## [1] NA ``` ```r *mean(s, na.rm = TRUE) ``` ``` ## [1] 0.26222 ``` ```r mean(s[-c(2, 3, 7)]) ``` ``` ## [1] 0.26222 ``` --- ### Missing values and trimming The code for `mean.default` goes beyond computing the weighted mean ```r function (x, trim = 0, na.rm = FALSE, ...) { if (!is.numeric(x) && !is.complex(x) && !is.logical(x)) { warning("argument is not numeric or logical: returning NA") return(NA_real_) } * if (na.rm) * x <- x[!is.na(x)] if (!is.numeric(trim) || length(trim) != 1L) stop("'trim' must be numeric of length one") n <- length(x) if (trim > 0 && n) { if (is.complex(x)) stop("trimmed means are not defined for complex data") if (anyNA(x)) return(NA_real_) if (trim >= 0.5) return(stats::median(x, na.rm = FALSE)) * lo <- floor(n * trim) + 1 * hi <- n + 1 - lo * x <- sort.int(x, partial = unique(c(lo, hi)))[lo:hi] } .Internal(mean(x)) } ``` --- ###
- What happens when `na.rm` is `TRUE`? - What are the possible values of argument `trim`? - If `trim` belongs to `\((0, 1/2)\)` what happens? - Which lines actually compute the (uniformly) weighted mean? ??? - The first `if (...) {...}` is an example of defensive programming - `if (na.rm) { ... }` handles missing data when the default behavior is overriden. `NA` are filtered out - When positive, parameter `trim` allows us to compute _trimmed means_ Depending on optional arguments, fucntion `mean` allows to compute either arithmetic means, or trimmed (arithmetic) means --- ###
- In a _tidy_ sample, _missing values_ are explicitely pointed as missing information using special objects `NA`, `NA_integer_`, `NA_real_`, ... - To compute summary statistics when facing missing values, `na.rm` has to be `TRUE`. - What is the class of `NA`, `NA_integer_`, `NA_real_`? --- template: inter-slide ## Robust summary statistics --- ### `summary(...)` ```r summary(s) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## -1.34578 0.05981 0.47183 0.26222 0.80830 0.97326 3 ``` --- - **Order statistics**: `\(x_{1:n} \leq x_{2:n} \leq \ldots \leq x_{n:n}\)` - **Median** `\(x_{\lfloor n/2\rfloor+1:n}\)` if `\(n\)` is odd, `\((x_{n/2:n}+ x_{n/2+1:n})/2\)` if `\(n\)` is even - **Quartiles** `\(x_{\lfloor n/4\rfloor:n}\)` and `\(x_{\lfloor 3 n/4\rfloor:n}\)` - **Quintiles** `\(x_{\lfloor k n/5\rfloor:n}\)` pour `\(k \in 1, \ldots, 4\)` - **Deciles** `\(x_{\lfloor k n/10\rfloor:n}\)` pour `\(k \in 1, \ldots, 9\)` - **Percentiles** - ... --- exclude: true Why _robustness_? .bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[ 'Est etincidunt non adipisci quaerat sed. Ipsum adipisci numquam neque ipsum modi neque modi. Amet velit tempora aliquam etincidunt. Quaerat neque labore velit dolorem dolore eius ipsum. Modi sed dolorem aliquam dolorem non neque. Dolor velit porro etincidunt magnam aliquam labore. Sed quisquam dolore tempora dolor ipsum sit est. Porro numquam adipisci aliquam. Amet labore etincidunt dolore porro.' ] --- exclude: true Let `\(x_1, \ldots, x_n\)` be a numerical sample. ```r rapportools::kurtosis ``` ``` ## function (x, na.rm = TRUE) ## { ## if (is.variable(x)) { ## if (na.rm) ## x <- na.omit(x) ## m <- base::mean(x) ## s <- stats::sd(x) ## n <- length(x) ## (((base::sum((x - m)^4)/n)/s^4) - 3) ## } ## else { ## if (is.matrix(x)) ## apply(x, 2, kurtosis, na.rm = na.rm) ## else if (is.data.frame(x)) ## sapply(x, kurtosis, na.rm = na.rm) ## else stop("unsupported type") ## } ## } ## <bytecode: 0x7fb4cfea8bc8> ## <environment: namespace:rapportools> ``` Check that the kurtosis and the skewness are translation invariant. --- template: inter-slide ## Boxplots --- .bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[ In descriptive statistics, a box plot or boxplot is a method for graphically depicting groups of numerical data through their _quantiles_ (more specifically _quartiles_) Box plots also have lines extending from the boxes (__whiskers__) indicating variability outside the upper and lower quartiles, hence the terms _box-and-whisker plot_ and box-and-whisker diagram. _Outliers_ may be plotted as individual points. Box plots are non-parametric: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution (though Tukey's boxplot assumes symmetry for the whiskers and normality for their length). The spacings between the different parts of the box indicate the degree of _dispersion_ (spread) and _skewness_ in the data, and show _outliers_. In addition to the points themselves, they allow one to visually estimate various _L-estimators_, notably the _interquartile range_, _midhinge_, _range_, _mid-range_, and _trimean_. .tr[ [Wikipedia](https://en.wikipedia.org/wiki/Box_plot) ] ] --- > Tukey's boxplot assumes symmetry for the whiskers and normality for their length What does this mean? Assume that we _boxplot_ the `\(\mathcal{N}(\mu, \sigma^2) \sim \mu + \sigma \mathcal{N}(0, 1)\)` distribution instead of an empirical distribution. Then: - The median is equal to `\(\mu\)` because of the symmetry of `\(\mathcal{N}(0,1)\)`, - The quartiles are `\(\mu - \sigma \Phi^{\leftarrow}(3/4)\)` and `\(\mu + \sigma \Phi^{\leftarrow}(3/4)\)`. - The IQR is `\(2 \sigma \Phi^{\leftarrow}(3/4) \approx 1.35\times \sigma\)` - The default length of the whiskers (under `geom_boxplot`) is `\(1.5 \times \text{IQR}\)` - For a Gaussian population, `\(1.5 \text{IQR} = 3 \sigma \Phi^{\leftarrow}(3/4)\)` - In a Gaussian population, the probability that a sample point lies outside the whiskers is `\(2 \times (1 - \Phi(4 \times \Phi^{\leftarrow}(3/4))) \approx 7/1000\)` --- class: top, left ### Boxplot of a Gaussian sample ```r n <- 100 fake_data <- tibble(normal=rnorm(n),t=rt(n, df=3)) title <- stringr::str_c("Standard Gaussian sample, size : ", n) fake_data %>% ggplot(mapping = aes(x=1L, y=normal)) + * geom_boxplot(outlier.colour = "red", outlier.shape = 2, coef=1.5) + ggtitle(title) + ylab("") + ggplot2::theme( axis.text.x = element_blank(), axis.ticks.x = element_blank() ) ``` ```r summary(fake_data$normal) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -2.6090 -0.4832 0.3411 0.2160 0.8663 2.5145 ``` .plot-callout[  ] --- ### Boxplot of a Gaussian sample .fl.w-60.pa2[  ] ??? Playing with boxplots - Tune whiskers lengths using `coef` - Jitter to spot the whole --- If we _boxplot_ a Gaussian sample, on average, `\(4\%\)` of the points lie outside the whiskers. Such points are called _extremes_ Deciding which points should be considered as extremes is partly a matter of taste Boxplots provide a visual assessment of the departure of a sample from normality/Gaussianity [`geom_boxplot` documentation](https://ggplot2.tidyverse.org/reference/geom_boxplot.html) ??? Point out the distinction between outliers and extremes --- ### Boxplot of a non-Gaussian sample The sampling distribution ( `\(t\)`-distribution with `\(3\)` degrees of freedom) is _heavy-tailed_ Some sample points lie far away from the whiskers. ```r title <- stringr::str_c("Student sample with 3 df, size : ", n) ggplot(fake_data, aes(y=t, x=1L)) + * geom_boxplot(outlier.shape=2, outlier.colour = 'red', outlier.alpha=.5, coef=1.5) + ylab("") + ggtitle(title) + ggplot2::theme( axis.text.x = element_blank(), axis.ticks.x = element_blank() ) ``` ```r summary(fake_data$t) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -7.93685 -0.63715 -0.07165 -0.08659 0.53770 3.72539 ``` .plot-callout[  ] --- ### Boxplot of a non-Gaussian sample .fl.w-60.pa2[  ] --- ### Plotting two boxplots side by side .fl.w-60.pa2[ <img src="cm-2-EDA_files/figure-html/unnamed-chunk-9-1.png" width="504" /> ] .fl.w-40.pa2[ With usual whiskers length - in Gaussian samples: very few sample points outside the whiskers (few extremes/outliers), extremes lie close to the whiskers endpoints - in heavy-tailed distributions: many sample points outside the whiskers, and some extremes may lie far far away ] ??? Comment on pivoting --- ### Boxplotting Titanic Fare prices .fl.w-60.pa2[ ``` ## Warning: Removed 1 rows containing non-finite values (stat_boxplot). ``` <img src="cm-2-EDA_files/figure-html/titanic-fares-boxplot-label-1.png" width="504" /> ] .fl.w-40.pa2[ Titanic fare prices distribution exhibits a non-Gaussian behavior
The median looks much closer to the first quartile than to the third quartile (asymmetry) There are many extreme points Extreme lie far away above the whiskers ] --- template: inter-slide ## Violinplots --- .fl.w-60.pa2[ ``` ## Warning: Removed 1 rows containing non-finite values (stat_ydensity). ``` <img src="cm-2-EDA_files/figure-html/titanic-fares-violinplot-label-1.png" width="504" /> ] .fl.w-40.pa2[ Titanic fare prices distribution exhibits a non-Gaussian behavior
They are a mix of three populations that also exhibit non-Gaussian behavior ] --- exclude: true .bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[ 'Est etincidunt non adipisci quaerat sed. Ipsum adipisci numquam neque ipsum modi neque modi. Amet velit tempora aliquam etincidunt. Quaerat neque labore velit dolorem dolore eius ipsum. Modi sed dolorem aliquam dolorem non neque. Dolor velit porro etincidunt magnam aliquam labore. Sed quisquam dolore tempora dolor ipsum sit est. Porro numquam adipisci aliquam. Amet labore etincidunt dolore porro.' ] --- template: inter-slide ## Histograms --- exclude: true .bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[ 'Est etincidunt non adipisci quaerat sed. Ipsum adipisci numquam neque ipsum modi neque modi. Amet velit tempora aliquam etincidunt. Quaerat neque labore velit dolorem dolore eius ipsum. Modi sed dolorem aliquam dolorem non neque. Dolor velit porro etincidunt magnam aliquam labore. Sed quisquam dolore tempora dolor ipsum sit est. Porro numquam adipisci aliquam. Amet labore etincidunt dolore porro.' ] --- template: inter-slide ## Numerical summaries for categorical samples --- ### Counts For qualitative samples, summary statistics consists of _counts_: for each modality, we record the _number of occurences_ or _frequency_
In English, the _frequency_ of a symbol in a sequence is the number of occurences In French, the _frequence_ of that symbol is the ratio between the number of occurences of the symbol and the total length of the sequence --- ### Summarizing `Pclass` ```r summary(train$Pclass) ``` ``` ## 1 2 3 ## 216 184 491 ``` ```r *table(train$Pclass) %>% kable(col.names = c("PClass", "Frequency")) ``` <table> <thead> <tr> <th style="text-align:left;"> PClass </th> <th style="text-align:right;"> Frequency </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 216 </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:right;"> 184 </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:right;"> 491 </td> </tr> </tbody> </table>
: function `table()` uses the cross-classifying factors to build a _contingency table_ of the counts at each combination of factor levels. For univariate samples, ... --- ### Summarizing `Embarked` ```r summary(train$Embarked) ``` ``` ## S C Q NA's ## 644 168 77 2 ``` ```r *table(train$Embarked) %>% kable(col.names = c("Embarkment", "Frequency")) ``` <table> <thead> <tr> <th style="text-align:left;"> Embarkment </th> <th style="text-align:right;"> Frequency </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> S </td> <td style="text-align:right;"> 644 </td> </tr> <tr> <td style="text-align:left;"> C </td> <td style="text-align:right;"> 168 </td> </tr> <tr> <td style="text-align:left;"> Q </td> <td style="text-align:right;"> 77 </td> </tr> </tbody> </table>
Base function `summary` tells us everything there is to know: number of passengers for every class and passengers, and number of sample elements with unknown class
Base function `table` puts dust under the carpet: number of passengers with unknown class is unreported --- exclude: true .bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[ 'Est etincidunt non adipisci quaerat sed. Ipsum adipisci numquam neque ipsum modi neque modi. Amet velit tempora aliquam etincidunt. Quaerat neque labore velit dolorem dolore eius ipsum. Modi sed dolorem aliquam dolorem non neque. Dolor velit porro etincidunt magnam aliquam labore. Sed quisquam dolore tempora dolor ipsum sit est. Porro numquam adipisci aliquam. Amet labore etincidunt dolore porro.' ] --- exclude: true ### Contingency tables ```r # data %>% # table() %>% # kable() ``` --- template: inter-slide ## Graphical summaries for categorical samples --- exclude: true .bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[ 'Est etincidunt non adipisci quaerat sed. Ipsum adipisci numquam neque ipsum modi neque modi. Amet velit tempora aliquam etincidunt. Quaerat neque labore velit dolorem dolore eius ipsum. Modi sed dolorem aliquam dolorem non neque. Dolor velit porro etincidunt magnam aliquam labore. Sed quisquam dolore tempora dolor ipsum sit est. Porro numquam adipisci aliquam. Amet labore etincidunt dolore porro.' ] xx --- template: inter-slide ## Barplots --- For qualitative samples, `barplot` are a simple way of visualizing counts --- ### Bar plotting in one step using `geom_bar` .fl.w-50.pa2[ ```r tit %>% ggplot(mapping=aes(x=Pclass)) + * geom_bar(width=.3) + ylab("Count") + ggtitle("Titanic: boarding distributions") ``` ] .fl.w-50.pa2[  ] --- ### Barplotting as a two-step process. 1. Compute _statistics_ from the raw data: we _count the number of occurrences of each symbol_. This is a simple yet non-trivial task. In SQL, this consists in `group by` and then `count`. How would you proceed if you had to code this counting step? 2. Plot the counting data: for example, map the modalities/categories on the `X-axis` (possibly choosing an ordering), the counts are mapped on the `Y-axis`, they specify the height of each bar Most of the flexibility lies in the mapping of modalities on the `X-axis` or doing something different ??? Emphasize - distinction betweem `geom_` and `stat_` --- ### Bar plotting in two steps using `group_by()`, `n()` and `geom_col()` .fl.w-50.pa2.f5[ ```r p <- tit %>% group_by(Pclass) %>% summarise(Count=n()) %>% ggplot(mapping=aes(x=Pclass, y=Count)) + ggtitle("Titanic: boarding distributions") *p + geom_col(width=.3) ``` ] .fl.w-50.pa2[  ] --- ### Playing with `stat_` and `geom_`
In the `ggplot` framework, a `geom_` function is paired with a default `stat_` function Once a data source and an (aesthetic) mapping have been fixed (telling which columns are fed to which axis), the data are (possibly) transformed using the `stat_`: a sample may be transformed into counts in order to build a contingency table and the result is then fed to the `geom_` --- .fl.w-50.pa2[ We could obtain the same plot using `geom_bar` using the `identity` stat. ```r p + * geom_bar(stat="identity", * width=.3) ``` ] .fl.w-50.pa2[  ] --- .fl.w-50.pa2[ Barplots provide us with a great way to compare counts of different categories. But it is somewhat difficult to figure out the relative frequencies of the different categories. One way to facilitate this consists of normalizing the counts by dividing all counts by sample size. ```r dt <- tit %>% group_by(Pclass) %>% dplyr::summarise(Count=n() ) %>% dplyr::mutate(Prop=Count/sum(Count)) p <- dt %>% ggplot(mapping=aes(x=Pclass, y=Prop)) + ggtitle("Titanic: boarding distributions") (p + geom_col(width=.3)) ``` ] .fl.w-50.pa2[  ] --- .fl.w-third.pa2[ It may be a good idea to doctor the tick labels on the Y-axis, indicating percentages rather than numbers between `\(0\)` and `\(1\)` (this depends on the readership) We may focus on passengers without repeating the whole process, just upgrading the data. Defining a `ggplot` object is very much like defining a _pipeline_ ] .fl.w-two-thirds.pa2[ <img src="cm-2-EDA_files/figure-html/unnamed-chunk-13-1.png" width="504" /> ] --- exclude: true .fl.w-third.pa2[ A stacked barplot offers another possibility to show the different proportions ] .fl.w-two-thirds.pa2[ ```r p <- dt %>% ggplot(mapping=aes(x=Cat, y=Prop)) ``` ] --- class: middle, inverse, center ## Pieplots --- .fl.w-third.pa2[ Another possibility is to use a `Pieplot` Counts are normalized and mapped to angular values (fractions of `\(2\pi\)`). It is up to the data scientist to decide whether one plot is more relevant than the other. ```r dt %>% ggplot() + geom_bar(mapping=aes(x=factor(1), y=Prop, fill=Pclass), stat='identity') + coord_polar(theta='y') + xlab("") + ylab("") ``` ] .fl.w-two-thirds.pa2[  ] --- template: inter-slide ## Barplots in action --- ### Semmelweis inquiry .fl.w-third.pa2[ .middle[  [Filip Semmelweis 1818-1865](https://en.wikipedia.org/wiki/Ignaz_Semmelweis) ] ] .fl.w-two-thirds.pa2[ .bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[ Semmelweis discovered that the incidence of _puerperal_ fever could be cut by the use of hand disinfection in obstetrical clinics. Puerperal fever was common in mid-19th-century hospitals and often fatal. Semmelweis proposed the practice of washing hands with chlorinated lime solutions in 1847 while working in Vienna General Hospital's First Obstetrical Clinic, where doctors' wards had three times the mortality of midwives' wards. .tr[ [wikipedia](https://en.wikipedia.org/wiki/Ignaz_Semmelweis) ] ] ] --- class: middle, center ### Vienna General Hospital  --- ### Semmelweis inquiry (continued) .small[ .bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[ .small[ Maternity institutions were set up [...] to address problems of infanticide of illegitimate children. They were set up as gratis institutions and offered to care for the infants, which made them attractive to underprivileged women [...]. In return for the free services, the women would be subjects for the training of doctors and midwives. _Two maternity clinics_ were at the Viennese hospital. The First Clinic had an average maternal mortality rate of about `\(10\%\)` due to _puerperal fever_. The Second Clinic's rate was considerably lower, averaging less than `\(4\%\)`. This fact was known outside the hospital. _The two clinics admitted on alternate days_, but women begged to be admitted to the Second Clinic, [...]. Some women even preferred to give birth in the streets, pretending to have given sudden birth en route to the hospital (a practice known as street births), which meant they would still qualify for the child care benefits without having been admitted to the clinic. ] ] ] --- ### Semmelweis inquiry (continued) .fl.w-two-thirds.pa2[ .bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[ .small[ Semmelweis was puzzled that puerperal fever was rare among women giving street births. > "To me, it appeared logical that patients who experienced street births would become ill at least as frequently as those who delivered in the clinic. [...] What protected those who delivered outside the clinic from these destructive unknown endemic influences?" ]]] .fl.w-third.pa2[ <table> <thead> <tr> <th style="text-align:right;"> Year </th> <th style="text-align:right;"> RateM </th> <th style="text-align:right;"> RateL </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1841 </td> <td style="text-align:right;"> 7.8 </td> <td style="text-align:right;"> 3.5 </td> </tr> <tr> <td style="text-align:right;"> 1842 </td> <td style="text-align:right;"> 15.8 </td> <td style="text-align:right;"> 7.6 </td> </tr> <tr> <td style="text-align:right;"> 1843 </td> <td style="text-align:right;"> 9.0 </td> <td style="text-align:right;"> 6.0 </td> </tr> <tr> <td style="text-align:right;"> 1844 </td> <td style="text-align:right;"> 8.2 </td> <td style="text-align:right;"> 2.3 </td> </tr> <tr> <td style="text-align:right;"> 1845 </td> <td style="text-align:right;"> 6.9 </td> <td style="text-align:right;"> 2.0 </td> </tr> <tr> <td style="text-align:right;"> 1846 </td> <td style="text-align:right;"> 11.4 </td> <td style="text-align:right;"> 2.8 </td> </tr> </tbody> </table> ] --- ### Semmelweis inquiry (continued) .bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[ The result was the mortality rate in the First Clinic declined `\(90\%\)`, and was then comparable to that in the Second Clinic. The mortality rate in April 1847 was `\(18.3\%\)`. After hand washing was instituted in mid-May, the rates in June were `\(2.2\%\)`, July `\(1.2\%\)`, August `\(1.9\%\)` and, for the first time since the introduction of anatomical orientation, the death rate was zero in two months in the year following this discovery. ] --- - What is the sample here? - Is there just one sample? - Does the Semmelweis dataset contain a sample or a statistic (something computed from a sample)? --- Plot the yearly death rates at both clinics. Choose aesthetics so as to emphasize the difference in mortality rates in the clinic supervised by doctors and the clinic supervised by midwifes. --- ### Pivoting .fl.w-third.pa2[ `semmelweis` dataframe has _wide format_ It is convenient (but not necessary) to `pivot` columns `RateM` and `RateL` to a pair of columns: `Clinic` and `Death_rate` Column `Clinic` is retyped as a `factor` Column `Year` is also retyped as a `factor` (It would be more natural to handle "Year" as a `Date` object) ] .fl.w-two-thirds.pa2[ ```r tmp <- semmelweis %>% select(Year, RateM, RateL) %>% * pivot_longer(cols=c(RateM, RateL), names_to="Clinic", values_to="Death_rate" ) %>% * mutate(Clinic = as_factor(Clinic), Year = as_factor(Year)) %>% * mutate(Clinic = fct_recode(Clinic, Doctors="RateM", Midwifes="RateL") ) ``` <table> <thead> <tr> <th style="text-align:left;"> Year </th> <th style="text-align:left;"> Clinic </th> <th style="text-align:right;"> Death_rate </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1841 </td> <td style="text-align:left;"> Doctors </td> <td style="text-align:right;"> 7.8 </td> </tr> <tr> <td style="text-align:left;"> 1841 </td> <td style="text-align:left;"> Midwifes </td> <td style="text-align:right;"> 3.5 </td> </tr> <tr> <td style="text-align:left;"> 1842 </td> <td style="text-align:left;"> Doctors </td> <td style="text-align:right;"> 15.8 </td> </tr> <tr> <td style="text-align:left;"> 1842 </td> <td style="text-align:left;"> Midwifes </td> <td style="text-align:right;"> 7.6 </td> </tr> </tbody> </table> ] --- ### Keywords and buzzwords - Randomized clinical trials - A/B testing - Case Control studies - Quasi-experiments - Observational data - Experimental data ??? --- ### Data wrangling makes ploting easy. .panelset[ <div class="panel"> <div class="panel-name">Code</div> ```r tmp %>% ggplot() + aes(x=Year, y=Death_rate, group=Clinic, fill=Clinic, linetype=Clinic) + geom_col(position="dodge", #<< colour="black", width=.2) + theme(legend.position=c(5.5/6, 13/15)) + ggtitle("Semmelweis data") + ylab("Death rate (%)") ``` </div> <div class="panel"> <div class="panel-name">Semmelweis barplot</div> <img src="cm-2-EDA_files/figure-html/semmelweis-label-1.png" width="504" /> </div> ] --- exclude: true .fl.w-third.pa2[ ] .fl.w-two-thirds.pa2[  ] --- class: center, middle, inverse ## Quasi-experiments and beyond --- ### Milestones - Semmelweis circa 1847 Puerperal fever - Snow circa 1852 Cholera - Hill _Randomized Clinical Trial Streptomyces versus Collapsotherapia against Tuberculosis_ 1948 - ... --- class: center, middle, inverse ## Cumulative distribution function --- .bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[ `\(P\)` be a probability distribution over `\(\mathbb{R}\)` (endowed with the Borelian `\(\sigma\)`-algebra) The cumulative distribution function (CDF) `\(F\)` is a function from `\(\mathbb{R}\)` to `\([0,1]\)` `$$F(x) = P(-\infty, x)$$` ] --- A real-valued sample defines a probability distribution, called the _empirical distribution_. Each sample point is assigned a probability equal to its relative frequency. The cumulative distribution function associated with this empirical distribution is called the _empirical cumulative distribution function_ (ECDF). --- Let `\(x_1, \ldots, x_n\)` denote the points from a real sample (they may not be pairwise distincts), let `\(P_n\)` be the empirical distribution and `\(F_n\)` be the ECDF. `$$\begin{array}{rl} P_n(A) & = \frac{1}{n} \sum_{i=1}^n \mathbb{I}_{x_i \in A} \\ F_n(z) & = \frac{1}{n} \sum_{i=1}^n \mathbb{I}_{x_i \leq z} \end{array}$$` --- Let `\(x_{1:n} \leq x_{2:n} \leq \ldots \leq x_{n:n}\)` denote the _order statistics_ of the sample (that is the non-decreasing rearrangement of sample, with possible repetitions). The ECDF is easily computed from the order statistics: `$$F_n(z) = \begin{cases} 0 & \text{if } z < x_{1:n} \\ \frac{k}{n} & \text{if } x_{k:n} \leq z < x_{k+1:n}\\ 1 & \text{if } x_{n:n} \leq z \end{cases}$$` --- Recall that ECDFs are determined by order statistics (and conversely) .bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[ ### Proposition The empirical cumulative distribution function `\(F_n\)` is - right-continuous over `\(\mathbb{R}\)` - non-decreasing - takes values in `\(\{ k/n , k \in \{0, 1, \ldots, n\}\}\)` - is constant between consecutive order statistics ] --- ### Example Any numerical dataset may be used to illustrate ECDFs. For example, `faithful`: a data frame with `272` observations on `2` variables concerning waiting time between eruptions and the duration of the eruption for the _Old Faithful_ geyser in Yellowstone National Park, Wyoming, USA. - `eruptions`: `numeric`, Eruption time in minutess - `wating`: `numeric` Waiting time to next eruption (in minutes) > A closer look at `faithful$eruptions` reveals that these are heavily rounded times originally in seconds, where multiples of 5 are more frequent than expected under non-human measurement. --- ```r data(faithful) faithful %>% glimpse() ``` ``` ## Rows: 272 ## Columns: 2 ## $ eruptions <dbl> 3.600, 1.800, 3.333, 2.283, 4.533, 2.883, 4.700, 3.600, 1.95… ## $ waiting <dbl> 79, 54, 74, 62, 85, 55, 88, 85, 51, 85, 54, 84, 78, 47, 83, … ``` --- ### Plotting an empirical CDF .fl.w-third.pa2[ Eruptions durations Under the hood, `ggplot` combines `geom_step` and `ecdf` statistic ```r faithful %>% ggplot(mapping=aes(x=eruptions)) + * geom_step(stat="ecdf") + ylab("ECDF") + xlab("Eruptions duration in minutes") + ggtitle("Old Faithful Geyser Data") ``` ] .fl.w-two-thirds.pa2[  ] --- exclude: true ### Computing the ECDF statistic .fl.w-third.pa2[ ECDF plots may be used to compare two samples or to compare an empirical cumulative distribution function with the cumulative distribution function of some well-known probability distribution. ```r *Fn <- ecdf(faithful$eruptions) x_grid <- seq(min(faithful$eruptions)-1, max(faithful$eruptions)+1, 0.01) df <- data_frame(x=x_grid, Fn=Fn(x_grid)) ``` ``` ## Warning: `data_frame()` was deprecated in tibble 1.1.0. ## Please use `tibble()` instead. ``` ```r df %>% ggplot(mapping=aes(x=x, y=Fn)) + geom_step(direction="hv") ``` ] .fl.w-two-thirds.pa2[  ] --- class: center, middle, inverse ## Quantile function --- exclude: true .bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[ The _quantile function_ of a probability distribution over `\(\mathbb{R}\)` is the generalized, left-continuous inverse function of the cumulative distribution function. For `\(p \in (0,1)\)` `$$F_n^{\leftarrow}(p) = \inf \Big\{ z : F_n(z) \geq p\Big\}$$` `$$F_n^{\leftarrow}(p) = X_{k+1:n} \qquad \text{if } \qquad\frac{k}{n} < p \leq \frac{k+1}{n}$$` ] --- .fl.w-third.pa2[ In order to plot the quantile function, it is almost enough to exchange the role of the axes in the ECDF plot. ```r faithful[1:10,] %>% ggplot(mapping=aes(x=eruptions)) + geom_step(stat="ecdf", direction='vh') + coord_flip() + xlab("Quantiles") + ylab("Probabilities") + ggtitle("Old Faithful Geyser Data") ``` Why is it safe to set the `direction` argument in the call to `geom_step()` ] .fl.w-two-thirds.pa2[  ] --- exclude: true ### Comparing sample quantiles with distributional quantiles
When it comes to comparing the empirical distributions defined by two samples, _quantile-quantile plots_ ( `qqplot` ) are favored. - `\(F_n\)` and `\(G_m\)` : two ECDFs - the quantile-quantile function is defined by `$$G_m^\leftarrow \circ F_n(z) = Y_{k+1:m}\qquad \text{if} \qquad \frac{k}{m}< F_n(z) \leq \frac{k+1}{m}$$` --- If `\(G_m = F_n\)` (if the two samples are equal up to a permutation), then `$$F_n^\leftarrow \circ F_n(z) = X_{k+1:n} \text{if} \qquad X_{k:n} < z \leq X_{k+1:n}$$` The quantile-quantile plot lies above line `\(y=x\)` and meets this line at every unique order statistics.
The departure of the quantile-quantile funcion from identity reflects the differences between the two empirical distribution functions. We can for example compare the first and second half of the `faithfull` dataset. --- ### Defining an empirical quantile function .pull-left[ Function `equantile()` returns a function The returned function is _parametrized_ by argument `sample` .small[ - A function captures (encloses) the _environment_ in which it is created - Function `equantile` creates a new _execution environment_ every time it is run. This environment is usually ephemeral, but here it becomes the _enclosing environment_ of the manufactured function - The _enclosing environment_ of the manufactured function is an _execution environment_ of the function factory ] ] .pull-right[ #### A quantile function factory ```r equantile <- function(sample){ sample <- sort(sample) m <- length(sample) * function(p){ * ks <- pmax(1, ceiling(m * p)) * sample[ks] } } ``` See [Advanced R programming](https://adv-r.hadley.nz/function-factories.html) ] ] --- .fl.w-third.pa2[ ```r first <- faithful[1:135,] second <- faithful[-(1:135),] Fn <- ecdf(first$eruptions) Qm <- equantile(second$eruptions) df <- data.frame(x= seq(min(first$eruptions)-1, max(first$eruptions)+1, 0.01)) %>% dplyr::mutate(y = Qm(Fn(x))) df %>% ggplot(mapping = aes(x=x, y=y)) + geom_step(direction='vh') + geom_abline(slope=1, intercept = 0, linetype='dashed') ``` ] .fl.w-two-thirds.pa2[  ] --- We can repeat this experiment after sample shuffling. .fl.w-third.pa2[ ```r unfaithful <- faithful[sample(seq(1,nrow(faithful)), size=nrow(faithful), replace = FALSE), ] first <- unfaithful[1:135,] second <- unfaithful[-(1:135),] Fn <- ecdf(first$eruptions) Qm <- equantile(second$eruptions) df <- data.frame(x= seq(min(first$eruptions)-1, max(first$eruptions)+1, 0.01)) %>% dplyr::mutate(y = Qm(Fn(x))) df %>% ggplot(mapping = aes(x=x, y=y)) + geom_step(direction='vh') + geom_abline(slope=1, intercept = 0, linetype='dashed') ``` ] .fl.w-two-thirds.pa2[  ] --- exclude: true class: inverse, center, middle ## Inequalities --- exclude: true ### Why and what - Quantifying income inequalities - Quantifying market share concentration --- exclude: true ### How Normalizing income distribution Non-decreasing rearrangement --- exclude: true ### Lorenz curve ```r ineq::Lc function (x, n = rep(1, length(x)), plot = FALSE) { ina <- !is.na(x) n <- n[ina] x <- as.numeric(x)[ina] k <- length(x) o <- order(x) x <- x[o] n <- n[o] x <- n * x p <- cumsum(n)/sum(n) L <- cumsum(x)/sum(x) p <- c(0, p) L <- c(0, L) L2 <- L * mean(x)/mean(n) Lc <- list(p, L, L2) names(Lc) <- c("p", "L", "L.general") class(Lc) <- "Lc" if (plot) plot(Lc) Lc } ``` --- exclude: true ### Indices and Schur monotonicity - Gini - Atkinson - Decile ratios --- class: middle, center, inverse background-image: url('./img/pexels-cottonbro-3171837.jpg') background-size: 112% # The End