name: inter-slide class: left, middle, inverse {{ content }} --- name: layout-general layout: true class: left, middle <style> .remark-slide-number { position: inherit; } .remark-slide-number .progress-bar-container { position: absolute; bottom: 0; height: 4px; display: block; left: 0; right: 0; } .remark-slide-number .progress-bar { height: 100%; background-color: red; } </style>
--- class: middle, left, inverse # Exploratory Data Analysis : ### 2022-02-02 #### [Master I MIDS & EDA]() #### [Analyse Exploratoire des Données](http://stephane-v-boucheron.fr/courses/isidata/) #### [Stéphane Boucheron](http://stephane-v-boucheron.fr) --- exclude: true template: inter-slide # Univariate Statistics ### 2022-02-02 #### [Master I MIDS & EDA]() #### [Analyse Exploratoire des Données](http://stephane-v-boucheron.fr/courses/isidata/) #### [Stéphane Boucheron](http://stephane-v-boucheron.fr) --- template: inter-slide ##
### [Samples, observations an variables ](#samples) ### [Summary statistics](#numerical-summaries) ### [Robust summary statistics](#robust) ### [Boxplots](#boxplots) ### [Histograms](#histograms) ### [Cumulative distribution function](#ecdf) ### [Quantile function](#quantile-function) --- template: inter-slide name: samples ## Samples, observations an variables --- ###
`flights`dataset from `nycflights` package .fl.w-50.pa2[ ```r pacman::p_load("nycflights13") ``` > On-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013. `flights` is a multivariate _sample_ Each `row` corresponds to an _observation_ Each column corresponds to a _measurement_ ] .fl.w-50.pa2.f6[ ``` Columns: 19 $ year <int> 2013,… $ month <int> 1, 1,… $ day <int> 1, 1,… $ dep_time <int> 517, … $ sched_dep_time <int> 515, … $ dep_delay <dbl> 2, 4,… $ arr_time <int> 830, … $ sched_arr_time <int> 819, … $ arr_delay <dbl> 11, 2… $ carrier <chr> "UA",… $ flight <int> 1545,… $ tailnum <chr> "N142… $ origin <chr> "EWR"… $ dest <chr> "IAH"… $ air_time <dbl> 227, … $ distance <dbl> 1400,… $ hour <dbl> 5, 5,… $ minute <dbl> 15, 2… $ time_hour <dttm> 2013… ``` ] --- ### Definition(s) A _sample_ is a sequence of items of the same type In
, if the type matches a basetype, a sample is typically represented using a `vector` A _numerical_ (quantitative) sample is just a sequence of numbers (either integers or floats), it is represented by a `numeric` vector. A _categorical_ (qualitative) sample is a sequence of values from some predefined finite set In
, categorical samples are handled using a special class: `factor` --- In this lesson, we are concerned with _univariate statistics_, that is either with - _numerical samples_, - _categorical samples_ taken from some finite (or seldom countable) set
The boundary between qualitative and quantitative is not always straightforward --- ### `flights` dataset - Columns `carrier`, `dest`, `origin` are considered categorical ```r levels(factor(flights$origin)) ``` ``` ## [1] "EWR" "JFK" "LGA" ``` - Columns `arr_delay`, `dep_delay`, distance are numerical - Column `time_hour`, ... contains timestamps --- We will learn about: - numerical summaries - graphical displays for univariate samples --- template: inter-slide name: numerical-summaries ## Summary statistics --- ### Numerical summaries for quantitative variables - __Mean__: `$$\overline{X}_n = \frac{1}{n}\sum_{i=1}^n x_i \qquad \textbf{(location)}$$` - **Standard deviation**: `$$s^2 = \frac{1}{n-1} \sum_{i=1}^n \big(x_i - \overline{X}_n\big)^2 \quad \text{or}\quad s^2 = \frac{1}{n} \sum_{i=1}^n \big(x_i - \overline{X}_n\big)^2\qquad\textbf{(dispersion)}$$` --- ### Beyond location et dispersion - **Skewness** `$$\frac{\frac{1}{n} \sum_{i=1}^n \big(x_i - \overline{X}_n\big)^3}{s^3}$$` - **Kurtosis** `$$\kappa = \frac{\frac{1}{n} \sum_{i=1}^n \big(x_i -\overline{X}_n\big)^4}{s^4} - 2$$` --- ### Numerical summaries at work ```r s <- rnorm(1000) # a random sample from the standard normal distribution *mean(s) - sum(s)/length(s) ``` ``` ## [1] 0 ``` ```r *sd(s) - sqrt(sum((s-mean(s))^2)/(length(s)-1)) ``` ``` ## [1] 0 ``` ```r *mean((s- mean(s))^4)/sd(s)^4 -3 ``` ``` ## [1] 0.06372208 ``` --- ### Summarizing `arr_delay` .fl.w-40.pa2.f6[ ### One at a time ```r mean(flights$arr_delay) ``` ``` ## [1] NA ``` ```r *mean(flights$arr_delay, na.rm=T) ``` ``` ## [1] 6.895377 ``` ```r sd(flights$arr_delay) ``` ``` ## [1] NA ``` ```r *sd(flights$arr_delay, na.rm=T) ``` ``` ## [1] 44.63329 ``` ] .fl.w-60.pa2.f6[ ### All inclusive ```r *summary(flights$arr_delay) %>% broom::tidy() %>% knitr::kable(digits = 2, caption = "Summary statistics for nycflights13 arrival delay") ``` <table> <caption>Summary statistics for nycflights13 arrival delay</caption> <thead> <tr> <th style="text-align:right;"> minimum </th> <th style="text-align:right;"> q1 </th> <th style="text-align:right;"> median </th> <th style="text-align:right;"> mean </th> <th style="text-align:right;"> q3 </th> <th style="text-align:right;"> maximum </th> <th style="text-align:right;"> na </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> -86 </td> <td style="text-align:right;"> -17 </td> <td style="text-align:right;"> -5 </td> <td style="text-align:right;"> 6.9 </td> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> 1272 </td> <td style="text-align:right;"> 9430 </td> </tr> </tbody> </table> ```r sum_funs <- list(mean, sd, median, IQR) *purrr::map_dbl(sum_funs, * ~ .x( flights$arr_delay, na.rm=T)) ``` ``` ## [1] 6.895377 44.633292 -5.000000 31.000000 ```
NAs matter! ] --- ### With `dplyr::summarise` ```r flights %>% select(- year, -month, -day, -hour, -minute , -time_hour, -ends_with("time"), -flight) %>% * summarise(across(where(is.numeric), * .fns=list("mean"=mean, "median"=median , "IQR"=IQR), * na.rm=T, * .names="{.col}_{.fn}")) %>% pivot_longer(cols = everything()) %>% knitr::kable(digits=2) ``` <table> <thead> <tr> <th style="text-align:left;"> name </th> <th style="text-align:right;"> value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> dep_delay_mean </td> <td style="text-align:right;"> 12.64 </td> </tr> <tr> <td style="text-align:left;"> dep_delay_median </td> <td style="text-align:right;"> -2.00 </td> </tr> <tr> <td style="text-align:left;"> dep_delay_IQR </td> <td style="text-align:right;"> 16.00 </td> </tr> <tr> <td style="text-align:left;"> arr_delay_mean </td> <td style="text-align:right;"> 6.90 </td> </tr> <tr> <td style="text-align:left;"> arr_delay_median </td> <td style="text-align:right;"> -5.00 </td> </tr> <tr> <td style="text-align:left;"> arr_delay_IQR </td> <td style="text-align:right;"> 31.00 </td> </tr> <tr> <td style="text-align:left;"> distance_mean </td> <td style="text-align:right;"> 1039.91 </td> </tr> <tr> <td style="text-align:left;"> distance_median </td> <td style="text-align:right;"> 872.00 </td> </tr> <tr> <td style="text-align:left;"> distance_IQR </td> <td style="text-align:right;"> 887.00 </td> </tr> </tbody> </table> --- exclude: true ### What if some data are missing? ```r s <- rnorm(10) s[c(2, 3, 7)] = NA_real_ s ``` ``` ## [1] -0.33618433 NA NA -0.07696039 -0.90840922 -0.68383652 ## [7] NA -0.72035314 1.39659206 -0.50888751 ``` ```r *mean(s) ``` ``` ## [1] NA ``` ```r *mean(s, na.rm = TRUE) ``` ``` ## [1] -0.262577 ``` ```r mean(s[-c(2, 3, 7)]) ``` ``` ## [1] -0.262577 ``` --- ### Missing values and trimming The code for `mean.default` goes beyond computing the weighted mean .f6[ ```r function (x, trim = 0, na.rm = FALSE, ...) { if (!is.numeric(x) && !is.complex(x) && !is.logical(x)) { warning("argument is not numeric or logical: returning NA") return(NA_real_) } * if (na.rm) * x <- x[!is.na(x)] if (!is.numeric(trim) || length(trim) != 1L) stop("'trim' must be numeric of length one") n <- length(x) if (trim > 0 && n) { if (is.complex(x)) stop("trimmed means are not defined for complex data") if (anyNA(x)) return(NA_real_) if (trim >= 0.5) return(stats::median(x, na.rm = FALSE)) * lo <- floor(n * trim) + 1 * hi <- n + 1 - lo * x <- sort.int(x, partial = unique(c(lo, hi)))[lo:hi] } * .Internal(mean(x)) } ``` ] --- ###
- What happens when `na.rm` is `TRUE`? - What are the possible values of argument `trim`? - If `trim` belongs to `\((0, 1/2)\)` what happens? - Which lines actually compute the (uniformly) weighted mean? ??? - The first `if (...) {...}` is an example of defensive programming - `if (na.rm) { ... }` handles missing data when the default behavior is overriden. `NA` are filtered out - When positive, parameter `trim` allows us to compute _trimmed means_ Depending on optional arguments, fucntion `mean` allows to compute either arithmetic means, or trimmed (arithmetic) means --- ###
- In a _tidy_ sample, _missing values_ are explicitely pointed as missing information using special objects `NA`, `NA_integer_`, `NA_real_`, ... - To compute summary statistics when facing missing values, `na.rm` has to be `TRUE`. - What is the class of `NA`, `NA_integer_`, `NA_real_`? --- ###
Why is `var(sample)` defined by `$$\sum_{i=1}^n \frac{1}{n-1} (x_i - \overline{X}_n)^2$$` when `\(\texttt{sample}=x_1, \ldots, x_n\)`? --- template: inter-slide name: robust ## Robust summary statistics --- exclude: true ### `summary(...)` ```r summary(s) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## -0.9084 -0.7021 -0.5089 -0.2626 -0.2066 1.3966 3 ``` --- - **Order statistics**: `\(x_{1:n} \leq x_{2:n} \leq \ldots \leq x_{n:n}\)` - **Median** `\(x_{\lfloor n/2\rfloor+1:n}\)` if `\(n\)` is odd, `\((x_{n/2:n}+ x_{n/2+1:n})/2\)` if `\(n\)` is even - **Quartiles** `\(x_{\lfloor n/4\rfloor:n}\)` and `\(x_{\lfloor 3 n/4\rfloor:n}\)` - **Quintiles** `\(x_{\lfloor k n/5\rfloor:n}\)` pour `\(k \in 1, \ldots, 4\)` - **Deciles** `\(x_{\lfloor k n/10\rfloor:n}\)` pour `\(k \in 1, \ldots, 9\)` - **Percentiles** - ... --- ###
Compute the derivative of - the mean - the median - the trimmed mean with respect to `\(x_{(n)}\)` --- exclude: true Why _robustness_? .bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[ 'Est etincidunt non adipisci quaerat sed. Ipsum adipisci numquam neque ipsum modi neque modi. Amet velit tempora aliquam etincidunt. Quaerat neque labore velit dolorem dolore eius ipsum. Modi sed dolorem aliquam dolorem non neque. Dolor velit porro etincidunt magnam aliquam labore. Sed quisquam dolore tempora dolor ipsum sit est. Porro numquam adipisci aliquam. Amet labore etincidunt dolore porro.' ] --- exclude: true Let `\(x_1, \ldots, x_n\)` be a numerical sample. ```r rapportools::kurtosis ``` ``` ## function (x, na.rm = TRUE) ## { ## if (is.variable(x)) { ## if (na.rm) ## x <- na.omit(x) ## m <- base::mean(x) ## s <- stats::sd(x) ## n <- length(x) ## (((base::sum((x - m)^4)/n)/s^4) - 3) ## } ## else { ## if (is.matrix(x)) ## apply(x, 2, kurtosis, na.rm = na.rm) ## else if (is.data.frame(x)) ## sapply(x, kurtosis, na.rm = na.rm) ## else stop("unsupported type") ## } ## } ## <bytecode: 0x7fd33d41fa00> ## <environment: namespace:rapportools> ``` Check that the kurtosis and the skewness are translation invariant. --- template: inter-slide name: boxplots ## Boxplots --- .bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[ In descriptive statistics, a box plot or boxplot is a method for graphically depicting groups of numerical data through their _quantiles_ (more specifically _quartiles_) Box plots also have lines extending from the boxes (__whiskers__) indicating variability outside the upper and lower quartiles, hence the terms _box-and-whisker plot_ and box-and-whisker diagram _Outliers_ may be plotted as individual points Box plots are _non-parametric_: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution (though Tukey's boxplot assumes symmetry for the whiskers and normality for their length) The spacings between the different parts of the box indicate the degree of _dispersion_ (spread) and _skewness_ in the data, and show _outliers_ .tr[ [Wikipedia](https://en.wikipedia.org/wiki/Box_plot) ] ] --- > Tukey's boxplot assumes symmetry for the whiskers and normality for their length What does this mean? Assume that we _boxplot_ the `\(\mathcal{N}(\mu, \sigma^2) \sim \mu + \sigma \mathcal{N}(0, 1)\)` distribution instead of an empirical distribution. Then: - The median is equal to `\(\mu\)` because of the symmetry of `\(\mathcal{N}(0,1)\)`, - The quartiles are `\(\mu - \sigma \Phi^{\leftarrow}(3/4)\)` and `\(\mu + \sigma \Phi^{\leftarrow}(3/4)\)`. - The IQR is `\(2 \sigma \Phi^{\leftarrow}(3/4) \approx 1.35\times \sigma\)` - The default length of the whiskers (under `geom_boxplot`) is `\(1.5 \times \text{IQR}\)` - For a Gaussian population, `\(1.5 \text{IQR} = 3 \sigma \Phi^{\leftarrow}(3/4)\)` - In a Gaussian population, the probability that a sample point lies outside the whiskers is `\(2 \times (1 - \Phi(4 \times \Phi^{\leftarrow}(3/4))) \approx 7/1000\)` --- class: top, left ### Boxplot of a Gaussian sample ```r n <- 100 fake_data <- tibble(normal=rnorm(n),t=rt(n, df=3)) title <- stringr::str_c("Standard Gaussian sample, size : ", n) fake_data %>% ggplot(mapping = aes(x=1L, y=normal)) + * geom_boxplot(outlier.colour = "red", outlier.shape = 2, coef=1.5) + ggtitle(title) + ylab("") + ggplot2::theme( axis.text.x = element_blank(), axis.ticks.x = element_blank() ) ``` ```r summary(fake_data$normal) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -2.10528 -0.67292 -0.20090 -0.07495 0.43615 2.29645 ``` .plot-callout[ ![gauss-boxplot-label](cm-2-EDA_files/figure-html/normal-boxplot-label-1.png) ] --- ### Boxplot of a Gaussian sample .fl.w-60.pa2[ ![gauss-boxplot-label](cm-2-EDA_files/figure-html/normal-boxplot-label-1.png) ] ??? Playing with boxplots - Tune whiskers lengths using `coef` - Jitter to spot the whole --- If we _boxplot_ a Gaussian sample, on average, `\(4\%\)` of the points lie outside the whiskers. Such points are called _extremes_ Deciding which points should be considered as extremes is partly a matter of taste Boxplots provide a visual assessment of the departure of a sample from normality/Gaussianity [`geom_boxplot` documentation](https://ggplot2.tidyverse.org/reference/geom_boxplot.html) ??? Point out the distinction between outliers and extremes --- ### Boxplot of a non-Gaussian sample The sampling distribution ( `\(t\)`-distribution with `\(3\)` degrees of freedom) is _heavy-tailed_ Some sample points lie far away from the whiskers ```r title <- stringr::str_c("Student sample with 3 df, size : ", n) ggplot(fake_data, aes(y=t, x=1L)) + * geom_boxplot(outlier.shape=2, outlier.colour = 'red', outlier.alpha=.5, coef=1.5) + ylab("") + ggtitle(title) + ggplot2::theme( axis.text.x = element_blank(), axis.ticks.x = element_blank() ) ``` ```r summary(fake_data$t) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -5.629301 -0.736732 -0.005991 0.038924 0.841745 5.314943 ``` .plot-callout[ ![non-gauss-boxplot-label](cm-2-EDA_files/figure-html/non-gauss-boxplot-label-1.png) ] --- ### Boxplot of a non-Gaussian sample .fl.w-60.pa2[ ![non-gauss-boxplot-label](cm-2-EDA_files/figure-html/non-gauss-boxplot-label-1.png) ] --- ### Plotting two boxplots side by side .fl.w-50.pa2[ ```r n <- 1000 tibble(normal=rnorm(n), t=rt(n, df=3)) %>% * pivot_longer(cols=c(normal, t), * names_to='Distribution') %>% * ggplot() + * aes(x=Distribution, y=value) + * geom_boxplot(outlier.alpha = .5, outlier.colour = 'red', outlier.shape = 2) + ggtitle("Gaussian versus heavy-tailed samples") ``` ] .fl.w-50.pa2[ <img src="cm-2-EDA_files/figure-html/two-boxplots-out-1.png" width="504" /> ] --- ### Boxplots for normal and heavy-tailed distributions With usual whiskers length - In Gaussian samples: + very few sample points outside the whiskers (few extremes/outliers), + extremes lie close to the whiskers endpoints - In heavy-tailed distributions: + many sample points outside the whiskers, and + some extremes may lie far far away --- exclude: true ### Boxplotting Titanic Fare prices .fl.w-60.pa2[ ``` ## Warning: Removed 1 rows containing non-finite values (stat_boxplot). ``` <img src="cm-2-EDA_files/figure-html/titanic-fares-boxplot-label-1.png" width="504" /> ] .fl.w-40.pa2[ Titanic fare prices distribution exhibits a non-Gaussian behavior
The median looks much closer to the first quartile than to the third quartile (asymmetry) There are many extreme points Extreme lie far away above the whiskers ] --- ### Boxplots for delays .fl.w-40.pa2.f6[ ```r flights %>% select( ends_with("_delay") ) %>% pivot_longer( cols=ends_with("_delay"), names_pattern = "(.*)_delay", values_to = "delay" ) %>% filter( abs(delay) < 120 ) %>% ggplot() + aes(x=name, y=delay) + geom_boxplot() ``` ] .fl.w-60.pa2[ <img src="cm-2-EDA_files/figure-html/boxplots-delays-out-1.png" width="504" /> ] --- exclude: true template: inter-slide name: violinplots ## Violinplots --- exclude: true .fl.w-60.pa2[ ``` ## Warning: Removed 1 rows containing non-finite values (stat_ydensity). ``` <img src="cm-2-EDA_files/figure-html/titanic-fares-violinplot-label-1.png" width="504" /> ] .fl.w-40.pa2[ Titanic fare prices distribution exhibits a non-Gaussian behavior
They are a mix of three populations that also exhibit non-Gaussian behavior ] --- exclude: true .bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[ 'Est etincidunt non adipisci quaerat sed. Ipsum adipisci numquam neque ipsum modi neque modi. Amet velit tempora aliquam etincidunt. Quaerat neque labore velit dolorem dolore eius ipsum. Modi sed dolorem aliquam dolorem non neque. Dolor velit porro etincidunt magnam aliquam labore. Sed quisquam dolore tempora dolor ipsum sit est. Porro numquam adipisci aliquam. Amet labore etincidunt dolore porro.' ] --- template: inter-slide name: histograms ## Histograms --- ### Definition(s) .fl.w-40.pa2[
A histogram defines a piecewise constant probability density function A histogram is defined by the number and position of the breaks (bins) A histogram may be regular or not ] .fl.w-60.pa2[ <img src="cm-2-EDA_files/figure-html/histogram-gauss-1.png" width="504" /> ] --- count: false .panel1-histogram-gauss-auto[ ```r *tibble(x=rnorm(1000, mean = 2)) ``` ] .panel2-histogram-gauss-auto[ ``` ## # A tibble: 1,000 × 1 ## x ## <dbl> ## 1 1.63 ## 2 3.34 ## 3 2.17 ## 4 1.69 ## 5 2.23 ## 6 1.73 ## 7 2.74 ## 8 2.95 ## 9 2.70 ## 10 1.97 ## # … with 990 more rows ``` ] --- count: false .panel1-histogram-gauss-auto[ ```r tibble(x=rnorm(1000, mean = 2)) %>% * ggplot() ``` ] .panel2-histogram-gauss-auto[ <img src="cm-2-EDA_files/figure-html/histogram-gauss_auto_02_output-1.png" width="504" /> ] --- count: false .panel1-histogram-gauss-auto[ ```r tibble(x=rnorm(1000, mean = 2)) %>% ggplot() + * aes(x=x) ``` ] .panel2-histogram-gauss-auto[ <img src="cm-2-EDA_files/figure-html/histogram-gauss_auto_03_output-1.png" width="504" /> ] --- count: false .panel1-histogram-gauss-auto[ ```r tibble(x=rnorm(1000, mean = 2)) %>% ggplot() + aes(x=x) + * geom_rug(alpha=.1) ``` ] .panel2-histogram-gauss-auto[ <img src="cm-2-EDA_files/figure-html/histogram-gauss_auto_04_output-1.png" width="504" /> ] --- count: false .panel1-histogram-gauss-auto[ ```r tibble(x=rnorm(1000, mean = 2)) %>% ggplot() + aes(x=x) + geom_rug(alpha=.1) + * geom_histogram(aes(y=..density..), * alpha=.5, * fill="white", * colour="black", * bins=30) ``` ] .panel2-histogram-gauss-auto[ <img src="cm-2-EDA_files/figure-html/histogram-gauss_auto_05_output-1.png" width="504" /> ] --- count: false .panel1-histogram-gauss-auto[ ```r tibble(x=rnorm(1000, mean = 2)) %>% ggplot() + aes(x=x) + geom_rug(alpha=.1) + geom_histogram(aes(y=..density..), alpha=.5, fill="white", colour="black", bins=30) + * stat_function(linetype="dashed", * fun=dnorm, * args=list(mean=2, sd=1)) ``` ] .panel2-histogram-gauss-auto[ <img src="cm-2-EDA_files/figure-html/histogram-gauss_auto_06_output-1.png" width="504" /> ] --- count: false .panel1-histogram-gauss-auto[ ```r tibble(x=rnorm(1000, mean = 2)) %>% ggplot() + aes(x=x) + geom_rug(alpha=.1) + geom_histogram(aes(y=..density..), alpha=.5, fill="white", colour="black", bins=30) + stat_function(linetype="dashed", fun=dnorm, args=list(mean=2, sd=1)) + * annotate("text", x=4.5, y=.3, * label="Gaussian density with mean 2") ``` ] .panel2-histogram-gauss-auto[ <img src="cm-2-EDA_files/figure-html/histogram-gauss_auto_07_output-1.png" width="504" /> ] --- count: false .panel1-histogram-gauss-auto[ ```r tibble(x=rnorm(1000, mean = 2)) %>% ggplot() + aes(x=x) + geom_rug(alpha=.1) + geom_histogram(aes(y=..density..), alpha=.5, fill="white", colour="black", bins=30) + stat_function(linetype="dashed", fun=dnorm, args=list(mean=2, sd=1)) + annotate("text", x=4.5, y=.3, label="Gaussian density with mean 2") + * ggtitle("Histogram for Gaussian sample") ``` ] .panel2-histogram-gauss-auto[ <img src="cm-2-EDA_files/figure-html/histogram-gauss_auto_08_output-1.png" width="504" /> ] --- count: false .panel1-histogram-gauss-auto[ ```r tibble(x=rnorm(1000, mean = 2)) %>% ggplot() + aes(x=x) + geom_rug(alpha=.1) + geom_histogram(aes(y=..density..), alpha=.5, fill="white", colour="black", bins=30) + stat_function(linetype="dashed", fun=dnorm, args=list(mean=2, sd=1)) + annotate("text", x=4.5, y=.3, label="Gaussian density with mean 2") + ggtitle("Histogram for Gaussian sample") + * labs(caption="Sample size=1000, 30 bins") ``` ] .panel2-histogram-gauss-auto[ <img src="cm-2-EDA_files/figure-html/histogram-gauss_auto_09_output-1.png" width="504" /> ] <style> .panel1-histogram-gauss-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-histogram-gauss-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-histogram-gauss-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- ### Construction - Binning the sample values + Picking the right (?) number of bins + Placing the breaks - --- ### Histograms for (trimmed) delay distributions .fl.w-40.pa2.f6[ ```r flights %>% select( ends_with("_delay") ) %>% pivot_longer( cols=ends_with("_delay"), names_pattern = "(.*)_delay", values_to = "delay" ) %>% filter( abs(delay) < 120 ) %>% ggplot() + * aes(x=delay, * y=..density.., * fill=name) + * geom_histogram(alpha=.3, * bins=30) ``` ] .fl.w-60.pa2.f6[ <img src="cm-2-EDA_files/figure-html/histograms-arr_delay-out-1.png" width="504" /> ] --- exclude: true .bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[ 'Est etincidunt non adipisci quaerat sed. Ipsum adipisci numquam neque ipsum modi neque modi. Amet velit tempora aliquam etincidunt. Quaerat neque labore velit dolorem dolore eius ipsum. Modi sed dolorem aliquam dolorem non neque. Dolor velit porro etincidunt magnam aliquam labore. Sed quisquam dolore tempora dolor ipsum sit est. Porro numquam adipisci aliquam. Amet labore etincidunt dolore porro.' ] --- template: inter-slide name: ecdf ## Cumulative distribution function --- .bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[ `\(P\)` be a probability distribution over `\(\mathbb{R}\)` (endowed with the Borelian `\(\sigma\)`-algebra) The cumulative distribution function (CDF) `\(F\)` is a function from `\(\mathbb{R}\)` to `\([0,1]\)` `$$F(x) = P(-\infty, x)$$` ] --- A real-valued sample defines a probability distribution, called the _empirical distribution_. Each sample point is assigned a probability equal to its relative frequency. The cumulative distribution function associated with this empirical distribution is called the _empirical cumulative distribution function_ (ECDF). --- Let `\(x_1, \ldots, x_n\)` denote the points from a real sample (they may not be pairwise distincts), let `\(P_n\)` be the empirical distribution and `\(F_n\)` be the ECDF. `$$\begin{array}{rl} P_n(A) & = \frac{1}{n} \sum_{i=1}^n \mathbb{I}_{x_i \in A} \\ F_n(z) & = \frac{1}{n} \sum_{i=1}^n \mathbb{I}_{x_i \leq z} \end{array}$$` --- Let `\(x_{1:n} \leq x_{2:n} \leq \ldots \leq x_{n:n}\)` denote the _order statistics_ of the sample (that is the non-decreasing rearrangement of sample, with possible repetitions). The ECDF is easily computed from the order statistics: `$$F_n(z) = \begin{cases} 0 & \text{if } z < x_{1:n} \\ \frac{k}{n} & \text{if } x_{k:n} \leq z < x_{k+1:n}\\ 1 & \text{if } x_{n:n} \leq z \end{cases}$$` --- Recall that ECDFs are determined by order statistics (and conversely) .bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[ ### Proposition The empirical cumulative distribution function `\(F_n\)` is - right-continuous over `\(\mathbb{R}\)` - non-decreasing - takes values in `\(\{ k/n , k \in \{0, 1, \ldots, n\}\}\)` - is constant between consecutive order statistics ] --- ### Example Any numerical dataset may be used to illustrate ECDFs. For example, `faithful`: a data frame with `272` observations on `2` variables concerning waiting time between eruptions and the duration of the eruption for the _Old Faithful_ geyser in Yellowstone National Park, Wyoming, USA. - `eruptions`: `numeric`, Eruption time in minutess - `wating`: `numeric` Waiting time to next eruption (in minutes) > A closer look at `faithful$eruptions` reveals that these are heavily rounded times originally in seconds, where multiples of 5 are more frequent than expected under non-human measurement. --- ### A glimpse at the faithfull data ```r data(faithful) faithful %>% glimpse() ``` ``` ## Rows: 272 ## Columns: 2 ## $ eruptions <dbl> 3.600, 1.800, 3.333, 2.283, 4.533, 2.883, 4.700, 3.600, 1.95… ## $ waiting <dbl> 79, 54, 74, 62, 85, 55, 88, 85, 51, 85, 54, 84, 78, 47, 83, … ``` --- ### Plotting an empirical CDF .fl.w-third.pa2[ Eruptions durations Under the hood, `ggplot` combines `geom_step` and `ecdf` statistic ```r faithful %>% ggplot() + aes(x=eruptions) + * geom_step(stat="ecdf") + ylab("ECDF") + xlab("Eruptions duration in minutes") + ggtitle("Old Faithful Geyser Data") ``` ] .fl.w-two-thirds.pa2[ ![](cm-2-EDA_files/figure-html/geyser-label-1.png) ] --- exclude: false ### Computing the ECDF statistic .fl.w-third.pa2[ ECDF plots may be used to compare two samples or to compare an empirical cumulative distribution function with the cumulative distribution function of some well-known probability distribution. ```r *Fn <- ecdf(faithful$eruptions) x_grid <- seq(min(faithful$eruptions)-1, max(faithful$eruptions)+1, 0.01) df <- data_frame(x=x_grid, Fn=Fn(x_grid)) ``` ``` ## Warning: `data_frame()` was deprecated in tibble 1.1.0. ## Please use `tibble()` instead. ## This warning is displayed once every 8 hours. ## Call `lifecycle::last_warnings()` to see where this warning was generated. ``` ```r df %>% ggplot() + aes(x=x, y=Fn) + geom_step(direction="hv") ``` ] .fl.w-two-thirds.pa2[ ![](cm-2-EDA_files/figure-html/geyser-label-2-1.png) ] --- template: inter-slide name: quantile-function ## Quantile function --- exclude: false .bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[ The _quantile function_ of a probability distribution over `\(\mathbb{R}\)` is the generalized, left-continuous inverse function of the cumulative distribution function. For `\(p \in (0,1)\)` `$$F_n^{\leftarrow}(p) = \inf \Big\{ z : F_n(z) \geq p\Big\}$$` `$$F_n^{\leftarrow}(p) = X_{k+1:n} \qquad \text{if } \qquad\frac{k}{n} < p \leq \frac{k+1}{n}$$` ] --- .fl.w-third.pa2[ In order to plot the quantile function, it is almost enough to exchange the role of the axes in the ECDF plot. ```r faithful[1:10,] %>% ggplot() + aes(x=eruptions) + geom_step(stat="ecdf", direction='vh') + * coord_flip() + xlab("Quantiles") + ylab("Probabilities") + ggtitle("Old Faithful Geyser Data") ``` Why is it safe to set the `direction` argument in the call to `geom_step()` ] .fl.w-two-thirds.pa2[ ![](cm-2-EDA_files/figure-html/eruptions-label-1.png) ] --- exclude: false ### Comparing sample quantiles with distributional quantiles
When it comes to comparing the empirical distributions defined by two samples, _quantile-quantile plots_ ( `qqplot` ) are favored. - `\(F_n\)` and `\(G_m\)` : two ECDFs - the quantile-quantile function is defined by `$$G_m^\leftarrow \circ F_n(z) = Y_{k+1:m}\qquad \text{if} \qquad \frac{k}{m}< F_n(z) \leq \frac{k+1}{m}$$` --- If `\(G_m = F_n\)` (if the two samples are equal up to a permutation), then `$$F_n^\leftarrow \circ F_n(z) = X_{k+1:n} \text{if} \qquad X_{k:n} < z \leq X_{k+1:n}$$` The quantile-quantile plot lies above line `\(y=x\)` and meets this line at every unique order statistics.
The departure of the quantile-quantile function from identity reflects the differences between the two empirical distribution functions. We can for example compare the first and second half of the `faithfull` dataset. --- ### Defining an empirical quantile function .pull-left[ Function `equantile()` returns a function The returned function is _parametrized_ by argument `sample` .small[ - A function captures (encloses) the _environment_ in which it is created - Function `equantile` creates a new _execution environment_ every time it is run. This environment is usually ephemeral, but here it becomes the _enclosing environment_ of the manufactured function - The _enclosing environment_ of the manufactured function is an _execution environment_ of the function factory ] ] .pull-right[ #### A quantile function factory ```r equantile <- function(sample){ sample <- sort(sample) m <- length(sample) * function(p){ * ks <- pmax(1, ceiling(m * p)) * sample[ks] } } ``` See [Advanced R programming](https://adv-r.hadley.nz/function-factories.html) ] ] --- ### QQ plot .fl.w-40.pa2.f6[ ```r not_cancelled <- flights %>% filter(!is.na(arr_delay)) %>% sample_n(100) %>% arrange(dep_delay) %>% filter(abs(arr_delay) < 120) nr <- nrow(not_cancelled) not_cancelled %>% ggplot() + aes(x=dep_delay, y=quantile(arr_delay, seq(1, nr)/nr) ) + geom_point(size=.1) + coord_fixed() + geom_abline(slope=1, intercept = 0, alpha=.5) + ggtitle("QQ plot clipped arr_delay versus dep_delay") + ylab("arr_delay") + xlab("dep-delay") ``` ] .fl.w-60.pa2[ <img src="cm-2-EDA_files/figure-html/qq_delays-out-1.png" width="504" /> ] --- exclude: true We can repeat this experiment after sample shuffling. .fl.w-third.pa2[ ```r unfaithful <- faithful[sample(seq(1,nrow(faithful)), size=nrow(faithful), replace = FALSE), ] first <- unfaithful[1:135,] second <- unfaithful[-(1:135),] Fn <- ecdf(first$eruptions) Qm <- equantile(second$eruptions) df <- data.frame(x= seq(min(first$eruptions)-1, max(first$eruptions)+1, 0.01)) %>% dplyr::mutate(y = Qm(Fn(x))) df %>% ggplot(mapping = aes(x=x, y=y)) + geom_step(direction='vh') + geom_abline(slope=1, intercept = 0, linetype='dashed') ``` ] .fl.w-two-thirds.pa2[ ![](cm-2-EDA_files/figure-html/faithful-shuffle-label-1.png) ] --- exclude: true ```r first <- faithful[1:135,] second <- faithful[-(1:135),] Fn <- ecdf(first$eruptions) Qm <- equantile(second$eruptions) df <- data.frame(x= seq(min(first$eruptions)-1, max(first$eruptions)+1, 0.01)) %>% dplyr::mutate(y = Qm(Fn(x))) df %>% ggplot() + aes(x=x, y=y) + geom_step(direction='vh') + geom_abline(slope=1, intercept = 0, linetype='dashed') ``` --- class: middle, center, inverse background-image: url('./img/pexels-cottonbro-3171837.jpg') background-size: cover # The End