name: inter-slide class: left, middle, inverse {{ content }} --- name: layout-general layout: true class: left, middle <style> .remark-slide-number { position: inherit; } .remark-slide-number .progress-bar-container { position: absolute; bottom: 0; height: 4px; display: block; left: 0; right: 0; } .remark-slide-number .progress-bar { height: 100%; background-color: red; } </style>
--- class: middle, left, inverse # Exploratory Data Analysis : Bivariate statistics ### 2021-12-10 #### [Master I MIDS & MFA]() #### [Analyse Exploratoire de Données](http://stephane-v-boucheron.fr/courses/eda/) #### [Stéphane Boucheron](http://stephane-v-boucheron.fr) --- exclude:true template: inter-slide # Bivariate Statistics ### 2021-12-10 #### [Master I MIDS & MFA]() #### [Analyse Exploratoire de Données](http://stephane-v-boucheron.fr/courses/eda/) #### [Stéphane Boucheron](http://stephane-v-boucheron.fr) --- class: inverse, middle ##
### Bivariate samples ### Contingency tables ### From barplots to mosaic plots ### Pearson's statistic ### Linear regression --- class: center, middle, inverse ## Bivariate samples --- ### Definition A bivariate sample is a sequence of couples from `\(\mathcal{X} \times \mathcal{Y}\)` Computationally, a bivariate sample is a two-dimensional array (not a matrix) with `\(n\)` rows and `\(2\)` columns The two columns may be of the same type (`numeric`, `integer`, `character`, `factor`, `date`) or not -- Assume that `\(\mathcal{X} \times \mathcal{Y}\)` may be endowed witha `\(\sigma\)`-algebra and a probability distribution `\(P\)` If we collect an i.i.d sample from `\(P\)`, we obtain a _bivariate_ sample --- ### Setting I For a health survey, we can poll repeatedly a well-defined human population by picking uniformly at random elements of the population (individuals) For each individual, we measure at wake up - *blood pressure* - *heart rate* (pulses per minute) We obtain a bivariate sample Both variables are quantitative (and non-negative) --- ### Setting II
Consider the collection of passengers on board of HMS Titanic on April the 12th, 1912 For each passenger, we record class (`Pclass`) and Fate (`Survived/Deceased`). This is again a bivariate sample Both variables are _qualitative_/categorical Finally, consider again the `Titanic` dataset, but record class (`Pclass`) and fare (`Fare`). One variable is _qualitative_, the other _quantitative._ --- ### Convention When we deal with a generic bivariate sample, we denote by `\(X\)` the first coordinate, and `\(Y\)` the second coordinate If `\((x,y) \in \mathcal{X} \times \mathcal{Y}\)` then `$$X(x, y) =x \qquad \text{and} \qquad Y(x, y)=y$$` In statistical parlance, `\(X\)` and `\(Y\)` are called _variables_ The ranges of the two coordinates may be finite or not The ranges may be different or not --- ### Two dimensional arrays: `dataframe` .panelset[ .panel[.panel-name[List of vectors] A `\(n \times p\)` array with columns of different types is not a `matrix` In
,
,
, such arrays are called _tables_ (databases) or _dataframes_ (
,
) _dataframes_ are _column-oriented_ In
, a _dataframe_ is a `list` of `vectors` All vectors in the `list` have the same length, they may be of different types (`class`) ] .panel[.panel-name[Dataframes, Tibble, ...] We should think of a _bivariate_ sample as dataframe `df` with two _columns_ `X` and `Y` and as many _rows_ as there are _individuals_ in the sample If we project on either coordinate, we obtain a _univariate_ sample (`df$X`, `df[["X"]]` or `df$Y`) ```r df <- tibble::tibble(X=letters[seq(1,6,2)], Y=rnorm(3)) df # a bivariate sample of length 3 ``` ``` ## # A tibble: 3 × 2 ## X Y ## <chr> <dbl> ## 1 a 0.412 ## 2 c -0.431 ## 3 e 0.977 ``` ] ] --- class: center, middle, inverse ## Summarizing bivariate samples --- ### Roadmap Just as we did for univariate samples, we review the different statistics used to summarize bivariate samples We start with qualitative bivariate samples, then proceed to quantitative bivariate samples and to mixed qualitative/quantitative samples Summary statistics are enhanced by visualization, that is standard graphic displays that depend on the kind of bivariate sample under consideration --- ### Goals The main goal of bivariate sample exploration is the assessment of possible _association_ between the two variables _Association_ is a loose term _Association_ can be assigned technical definitions if we consider purely quantitative or purely qualitative bivariate samples - linear correlation and regression - chi-square statistics --- class: center, middle, inverse ## Qualitative bivariate samples --- ### Two-ways contingency tables Qualitative univariate samples are summarized using _one-way contingency tables_, that is by counting the number of occurrences of each modality Qualitative bivariate samples are summarized using _two-way contingency tables_: the counts of co-occurrences for each couples of modalities When `\(X\)` and `\(Y\)` are qualitative variables with respectively `\(p\)` and `\(q\)` modalities, let `$$\begin{array}{rl} n_{i,j} & = \sum_{k=1}^n \mathbb{I}_{X_k=i, Y_k=j} \\ n_{i,\cdot} & = \sum_{k=1}^n \mathbb{I}_{X_k=i} \\ n_{\cdot,j} & = \sum_{k=1}^n \mathbb{I}_{Y_k=j} \\ n & = \sum_{i \leq p} n_{i,\cdot} = \sum_{j\leq q} n_{\cdot,j} \end{array}$$` Counts like `\(n_{i, \cdot}\)` or `\(n_{\cdot, j}\)` are called _marginal counts_ --- ### Class struggle
.panelset[ .panel[.panel-name[Two columns] From the Titanic dataset (`Kaggle`), we extract columns `Pclass` and `Survived` - `Pclass` has three modalities: `(1, 2, 3)` - `Survived` has two modalities: (`Deceased`, `Survived`) ] .panel[.panel-name[Preparing] ```r tit_col_types = cols( PassengerId = col_integer(), * Survived = col_factor(levels=c("0", "1"), * include_na = TRUE), * Pclass = col_factor(levels=c("1", "2", "3"), * ordered = TRUE, * include_na = TRUE), Sex = col_factor(levels = c("female", "male")), Age = col_double(), SibSp = col_integer(), Parch = col_integer(), Embarked = col_factor(levels = c("S", "C", "Q"), include_na = TRUE) ) ``` ] .panel[.panel-name[Retyping] ```r train <- read_csv("DATA/titanic/train.csv", col_types=tit_col_types) test <- read_csv("DATA/titanic/test.csv", col_types=tit_col_types) ``` ``` ## Warning: The following named parsers don't match the column names: Survived ``` ```r *test <- mutate(test, * Survived=NA) tit <- union(train, test) *tit$Survived <- forcats::fct_recode(tit$Survived, * "Deceased"="0", * "Survived"="1") %>% * forcats::fct_relevel(c("Survived", "Deceased")) ``` ] .panel[.panel-name[Two-ways table] ```r tit %>% dplyr::select(Pclass, Survived) %>% # Projection on two columns * table() ``` ``` ## Survived ## Pclass Survived Deceased ## 1 136 80 ## 2 87 97 ## 3 119 372 ``` ] .panel[.panel-name[Nicer output] ```r tit %>% dplyr::select(Pclass, Survived) %>% table() %>% broom::tidy() %>% # make it a dataframe tidyr::pivot_wider(names_from=Survived, values_from=n) %>% # with usual look and feel knitr::kable(format="markdown") ``` ``` ## Warning: 'tidy.table' is deprecated. ## See help("Deprecated") ``` |Pclass | Survived| Deceased| |:------|--------:|--------:| |1 | 136| 80| |2 | 87| 97| |3 | 119| 372| ] ] --- ### Supercharged contingency tables Package `summarytools` provides richer contingency tables. ```r pacman::p_load(summarytools) ctable(x=tit$Pclass, y=tit$Survived, style="rmarkdown" , headings = FALSE) ``` | | __Survival__ | Survived | Deceased | NA | Total | |:-------:|---------:|------------:|------------:|------------:|--------------:| | __Pclass__ | | | | | | | 1 | | 136 (42.1%) | 80 (24.8%) | 107 (33.1%) | 323 (100.0%) | | 2 | | 87 (31.4%) | 97 (35.0%) | 93 (33.6%) | 277 (100.0%) | | 3 | | 119 (16.8%) | 372 (52.5%) | 218 (30.7%) | 709 (100.0%) | | Total | | 342 (26.1%) | 549 (41.9%) | 418 (31.9%) | 1309 (100.0%) | --- class: center, middle, inverse background-image: url(img/Piet_Mondriaan_1921_-_Composition_en_rouge_jaune_bleu_et_noir.jpg) background-size: cover ## Mosaicplots --- ### Mosaicplots as tweaked barplots A handy way of portraying a contingency table, and especially a two-way contingency table consists in building a **mosaic plot**. Function `mosaicplot` belongs to base `R`, it takes as input a contingency table and outputs a plot --- ### A mosaic on the Titanic
.panelset[ .panel[.panel-name[Code] ```r tit %>% dplyr::select(Pclass, Survived) %>% * table() %>% * mosaicplot() ``` For each Passenger Class `Pclass`, we draw a _stacked bar plot_ The _width_ of each bar is proportional to the Class frequency (count)
Within each bar (Class), the _height_ of each (Survival Status) component is proportional to the _relative frequency_ of this Survival Status within that Passenger Class ] .panel[.panel-name[Plot] ![](cm-3-EDA_files/figure-html/titanic-mosaic-label-1.png)
Records with missing values have been omitted ] ] --- ### `mosaicplot` from a Grammar of Graphics perspective The `stat_...` part consists in computing the `\((n_{i,j}/n)_{i\in \mathcal{X}, j \in \mathcal{Y}}\)` The `geom_...` part consists in mapping the counts to rectangles: a `mosaicplot` associates a rectangle to each couple of modalities The surface area of the rectangle associated with `\((i,j) \in \mathcal{X}\times \mathcal{Y}\)` is proportional to `$$\underbrace{n_{i,j}}_{\propto\text{ surface area}} = \underbrace{n_{i, .}}_{\propto\text{ width}} \times \underbrace{\frac{n_{i, j}}{n_{i,.}}}_{\propto\text{ height}}$$` Rectangles are placed on a 2-dimensional grid --- ### From counts to probabilities - Normalized counts `$$(n_{i,j}/n)_{i\in \mathcal{X}, j \in \mathcal{Y}}$$` define a probability distribution on `\(\mathcal{X}\times \mathcal{Y}\)`, the so-called *empirical distribution* `\(P_n\)`: `$$P_n\big\{(i,j)\big\} = P_n\{ X=i \wedge Y=j\} = \frac{n_{i,j}}{n}$$` -- - Marginal counts `\((n_{i,.})_{i \in \mathcal{X}}\)` define _empirical marginal distributions_ -- - For each modality `\(i \in \mathcal{X}\)`, the sum of the heights of rectangles `\((i,j)_{j \in \mathcal{Y}}\)` is normalized The heights `\(\propto n_{i,j}/n_{i, \cdot}\)` define (empirical) *conditional probability distributions*: `$$\frac{n_{i,j}}{n_{i, \cdot}} = P_n \big\{Y=j \mid X=i\big\}$$` --
This matters when we want to assess a possible _association_ between `\(X\)` and `\(Y\)`, that is a departure of `\(P_n\)` (the empirical joint distribution) from the *product distribution* defined from the marginal distributions `$$P_n \circ X^{-1} \leftrightarrow \Big(\frac{n_{i,\cdot}}{n}\Big)_{i \in \mathcal{X}}$$` and `$$P_n \circ Y^{-1} \leftrightarrow \Big(\frac{n_{\cdot, j}}{n}\Big)_{j \in \mathcal{Y}}$$` --- ### Do it with `ggplot` and
.panelset[ .panel[.panel-name[Stat] We need to count rows for each combination of the modalities of `Pclass` and `Survived`, and also to gather total counts per modality of `Pclass` ```r tit %>% filter(!is.na(Survived)) %>% group_by(Pclass, Survived) %>% * summarise(count=n()) %>% ungroup() -> tmp tmp %>% dplyr::group_by(Pclass) %>% * summarise(margin=sum(count)) %>% ungroup() %>% dplyr::inner_join(tmp, by=c("Pclass")) %>% mutate(Prop = count/margin) -> df ``` In
, use `GROUP BY ROLLUP(Pclass, Survived)` and self `JOIN` or a `WINDOW` function ] .panel[.panel-name[Code] ```r df %>% ggplot(aes(x=Pclass, y=Prop, * fill=Survived)) + * geom_col(position = "stack", * aes(width=margin/sum(tmp$count))) + ggtitle("Hand-made mosaicplot") + xlab("Passenger class") ``` ``` ## Warning: Ignoring unknown aesthetics: width ``` ] .panel[.panel-name[Plot] <img src=cm-3-EDA_files/figure-html/mosaic-ggplot-label-1.png alt="Hand made mosaic plot" height="400", align="left"> ] ] ??? More work would allow to put the bars closer and to mimick the mosaicplot in a more faithful way In the Grammar of Graphics perspective, - barplots, - two-way mosaicplots, - higher-dimensional mosaicplots are based on column plots and require stat functions involving aggregation operations from extended SQL --- exclude: true --- ### Transposing a two-way contingency table When building a `mosaicplot`, the two variables do not serve the same purpose: the variable mapped to the `x` axis serves as an *explanatory* variable. The one-way contingency table generated by the explanatory variable can be read directly from the widths of the different columns. This is not the case for the other variable The messages conveyed by the `mosaicplot` of the transpose of a contingency table differs from the original message --- ### Variable order matters! .panelset[ .panel[.panel-name[A tale of two tables] Rather than telling us the fate of the different passenger classes, this `mosaicplot` tells us about the class composition of survivors and casualties
The two stories are related but one is harder to comment about ] .panel[.panel-name[Survival explained by Class] ```r mosaicplot(formula= Pclass ~ Survived, data= tit) ``` <img src="cm-3-EDA_files/figure-html/mosaictitanic1-1.png" width="504" /> ] .panel[.panel-name[Class according to Survival] ```r mosaicplot(formula= Survived ~ Pclass, data= tit) ``` <img src="cm-3-EDA_files/figure-html/mosaictitanic2-1.png" width="504" /> ] ] --- ### Mosaicplots and `tidyverse` Package `ggmosaic` is an extension of `ggplot2` that delivers mosaicplots that fits in the `tidyverse` suite and comply with *Grammar of Graphics*. .panelset[ .panel[.panel-name[Code] ```r pacman::p_load(ggmosaic) tit %>% dplyr::select(Pclass, Survived) %>% ggplot() + * geom_mosaic(aes(x = product(Survived, Pclass), fill=Survived)) + labs(x= "Passenger class", y="Fate") + * scale_fill_viridis_d() + ggtitle("Titanic mosaic with tidyverse flavor") ``` ] .panel[.panel-name[Plot it] <img src=cm-3-EDA_files/figure-html/titanic-ggmosaic-label-1.png alt="ggmosaic plot" height="400" align="left"> ] .panel[.panel-name[Plot it again] <img src=cm-3-EDA_files/figure-html/titanic-ggmosaic-label-transpose-1.png alt="ggmosaic plot" height="400" align="left"> Here again the order of the variable names that are passed to `ggmosaic::product` is important. We are (implicitly) trying to visualize the impact of (passenger) class on fate. It makes sense to map `Pclass` on the `x` axis and `Survived` on the `y` axis. ] ] --- class: middle, center, inverse ## Quantitative bivariate samples --- ### Numerical summaries The numerical summary of a numerical bivariate sample consists of an _empirical mean_ `$$\begin{pmatrix}\overline{X}_n \\ \overline{Y}_n \end{pmatrix} = \frac{1}{n} \sum_{i=1}^n \begin{pmatrix} x_i \\ y_i \end{pmatrix}$$` and an _empirical covariance matrix_ `$$\begin{pmatrix}\operatorname{var}_n(X) & \operatorname{cov}_n(X, Y) \\ \operatorname{cov}_n(X, Y) & \operatorname{var}_n(Y)\end{pmatrix}$$` with `$$\operatorname{var}_n(X, Y) = \frac{1}{n}\sum_{k=1}^n \Big(x_i-\overline{X}_n\Big)^2$$` and `$$\operatorname{cov}_n(X, Y) = \frac{1}{n}\sum_{k=1}^n \Big(x_i-\overline{X}_n\Big)\times \Big(y_i-\overline{Y}_n\Big)$$` --- ### Covariance matrices have properties The empirical covariance matrix is the *covariance matrix of the joint empirical distribution*. As a covariance matrix, the empirical covariance matrix is *symmetric*, *semi-definite positive (SDP)* ###
- A square `\(n \times n\)` matrix `\(A\)` is semi-definite positive (SDP) iff `$$\forall u \in \mathbb{R}^n, \qquad u^T \times A u = \langle u, Au \rangle \geq 0$$` - A square `\(n \times n\)` matrix `\(A\)` is definite positive (DP) iff `$$\forall u \in \mathbb{R}^n \setminus \{0\}, \qquad u^T \times A u = \langle u, Au \rangle > 0$$` --- ### Linear correlation coefficient The **linear correlation coefficient** is defined from the covariance matrix as `$$\rho = \frac{\operatorname{cov}_n(X, Y)}{\sqrt{\operatorname{var}_n(X) \operatorname{var}_n(Y)}}$$`
By the Cauchy-Schwarz inequality, we always have `$$-1 \leq \rho \leq 1$$`
Translating and/or rescaling the columns does not modify the linear correlation coefficient! Functions `cov` and `cor` from base
perform the computations --- ### Do it the SQL way
```r data <- read_delim('./DATA/Enfants.txt', delim='\t') data %>% dplyr::select(MASSE, TAILLE) %>% dplyr::summarise(mx= mean(TAILLE), my=mean(MASSE), m2x=mean(TAILLE^2), m2y=mean(MASSE^2), mxy=mean(TAILLE*MASSE)) %>% dplyr::mutate(var_taille=m2x-mx^2, var_masse=m2y-my^2, cov_masse_taille=mxy -mx*my) %>% dplyr::mutate(cor=cov_masse_taille/sqrt(var_taille * var_masse)) %>% dplyr::select(- starts_with('m')) ``` ``` ## # A tibble: 1 × 4 ## var_taille var_masse cov_masse_taille cor ## <dbl> <dbl> <dbl> <dbl> ## 1 2119. 44.0 48.7 0.160 ``` --- ### Visualizing quantitative bivariate samples <img src="img/pexels-katerina-holmes-5905611.jpg" align="right" width="200px"> Suppose now, we want to visualize a quantitative bivariate sample of length `\(n\)`. This bivariate sample (a dataframe) may be handled as a _real matrix_ with `\(n\)` rows and `\(2\)` columns Geometric concepts come into play --- ### Exploring column space We may attempt to visualize the two columns, that is the two `\(n\)`-dimensional vectors or the rows, that is `\(n\)` points on the real plane.
If we try to visualize the two columns, we simplify the problem by _projecting on the plane generated by the two columns_ Then what matters is the _angle_ between the two vectors. Its _cosine_ is precisely the _linear correlation coefficient_ defined above --- ### Exploring row space If we try visualize the rows, the most basic visualization of a quantitative bivariate sample is the *scatterplot*. In the grammar of graphics parlance, it consists in mapping the two variables on the two axes, and mapping rows to points using `geom_point` and `stat_identity` --- ### A Gaussian cloud
.panelset[ .panel[.panel-name[Simulation] We build an artificial bivariate sample, by first building a covariance matrix `COV` (it is randomly generated). Then we build a bivariate normal sample `s` of length `n` and turn it into a dataframe `u`. The dataframe is then fed to `ggplot`. ```r set.seed(1515) # for the sake of reproducibility n <- 100 V <- matrix(rnorm(4, 1, 1), nrow = 2) COV <- V %*% t(V) # a random covariance matrix, COV is symmetric and SDP s <- t(V %*% matrix(rnorm(2 * 10 * n), ncol=10*n)) u <- tibble(X=s[,1], Y=s[, 2]) # a bivariate normal sample emp_mean <- as_data_frame(t(colMeans(u))) ``` ``` ## Warning: `as_data_frame()` was deprecated in tibble 2.0.0. ## Please use `as_tibble()` instead. ## The signature and semantics have changed, see `?as_tibble`. ``` ```r p_scatter_gaussian <- ggplot(u, aes(x=X, y=Y)) + geom_point(alpha=.5, size=1) + geom_point(data=emp_mean, color=2, size=5) + coord_fixed() + ggtitle(stringr::str_c("Gaussian cloud, cor = ", round(cor(u$X, u$Y), 2), sep="")) p_scatter_gaussian ``` ] .panel[.panel-name[Numerical Summary] - Mean vector (Empirical mean) ```r t(colMeans(u)) %>% knitr::kable(digits = 3, col.names = c("$\\overline{X_n}$", "$\\overline{Y_n}$")) ``` <table> <thead> <tr> <th style="text-align:right;"> `\(\overline{X_n}\)` </th> <th style="text-align:right;"> `\(\overline{Y_n}\)` </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.004 </td> <td style="text-align:right;"> -0.004 </td> </tr> </tbody> </table> - Covariance matrix (Empirical covariance matrix) ```r cov(u) %>% as.data.frame() %>% knitr::kable(digits = 3) ``` <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> X </th> <th style="text-align:right;"> Y </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> X </td> <td style="text-align:right;"> 4.370 </td> <td style="text-align:right;"> -0.706 </td> </tr> <tr> <td style="text-align:left;"> Y </td> <td style="text-align:right;"> -0.706 </td> <td style="text-align:right;"> 1.212 </td> </tr> </tbody> </table> ] .panel[.panel-name[Plot] ![](cm-3-EDA_files/figure-html/gaussiancloud-1.png) ] ] --- class: center, middle, inverse ## Qualitative and quantitative variables --- ### Back to
Back to Titanic dataset, let us consider variables `\(X=\)` `Pclass` (qualitative) and `\(Y=\)` `Fare` (quantitative) The numerical summary of such a bivariate sample consists of _list of numerical summaries of univariate samples_ For each modality of the qualitative variable `\(X\)`, we compute the _conditional mean_ and _variance_ the quantitative variable `\(Y\)` As before, `\(\overline{Y}_n\)` denotes the empirical mean of `\(Y\)` and `\(\sigma^2_Y\)` the empirical variance of `\(Y\)` (also called the _total variance_) --- ### Conditional summaries For each modality `\(i \in \mathcal{X}\)`, we define: - Conditional Mean of `\(X\)` given `\(\{ X = i \}\)` `$$\overline{Y}_{n\mid i} = \frac{1}{n_i} \sum_{k\leq n} \mathbb{I}_{x_k =i} \times y_k$$` - Conditional Variance `\(Y\)` given `\(\{ X= i\}\)` `$$\sigma^2_{Y\mid i} = \frac{1}{n_i} \sum_{k \leq n} \mathbb{I}_{x_k =i} \times \bigg( y_k - \overline{Y}_{n \mid i}\bigg)^2$$` --- ### Huygens-Pythagoras formula `$$\sigma^2_{Y} = \underbrace{\sum_{i\in \mathcal{X}} \frac{n_i}{n} \sigma^2_{Y \mid i}}_{\text{mean of conditional variances}} + \underbrace{\sum_{i\in \mathcal{X}} \frac{n_i}{n} \big(\overline{Y}_{n \mid i} - \overline{Y}_{n}\big)^2}_{\text{variance of conditional means}}$$`
Check it --- ### Robust bivariate summaries It is also possible and fruitful to compute - conditional quantiles (median, quartiles) and - conditional interquartile ranges (IQR) Conditional mean, variance, median, IQR (
) ```r tit %>% dplyr::select(Survived, Fare) %>% dplyr::group_by(Survived) %>% * dplyr::summarise(cmean=mean(Fare, na.rm=TRUE), csd=sd(Fare,na.rm = TRUE), cmedian=median(Fare, na.rm = TRUE), cIQR=IQR(Fare,na.rm = TRUE)) ``` ``` ## # A tibble: 3 × 5 ## Survived cmean csd cmedian cIQR ## <fct> <dbl> <dbl> <dbl> <dbl> ## 1 Survived 48.4 66.6 26 44.5 ## 2 Deceased 22.1 31.4 10.5 18.1 ## 3 <NA> 35.6 55.9 14.5 23.6 ``` --- ### Visualization of mixed bivariate samples Visualization of qualitative/quantitative bivariate samples consists in displaying visual summaries of conditional distribution of `\(Y\)` given `\(X=i, i \in \mathcal{X}\)` `Boxplots` and `violinplots` are relevant here --- ### Mixed bivariate samples from Titanic (violine plots) .panelset[ .panel[.panel-name[Code] ```r filtered_tit <- tit %>% dplyr::select(Pclass, Survived, Fare) %>% dplyr::filter(Fare > 0 ) v <- filtered_tit %>% ggplot() + aes(y=Fare) + scale_y_log10() vv <- v + geom_violin() ``` ] .panel[.panel-name[Fare versus Passenger Class] ```r vv + aes(x=Pclass) + ggtitle("Titanic: Fare versus Passenger Class") ``` <img src="cm-3-EDA_files/figure-html/unnamed-chunk-12-1.png" width="504" /> ] .panel[.panel-name[Fare versus Survival] ```r vv + aes(x=Survived) + ggtitle("Titanic: Fare versus Survival") ``` <img src="cm-3-EDA_files/figure-html/unnamed-chunk-13-1.png" width="504" /> ] ] --- ### Mixed bivariate samples from Titanic (boxplots) .panelset[ .panel[.pane-name[Code] - Comply with the `DRY` principle - Avoid `WET` ```r vw <- v + geom_boxplot() ``` ] .panel[.panel-name[Fare versus Passenger Class] ```r vw + aes(x=Pclass) + ggtitle("Titanic: Fare versus Passenger Class") ``` <img src="cm-3-EDA_files/figure-html/unnamed-chunk-15-1.png" width="504" /> ] .panel[.panel-name[Fare versus Survival] ```r vw + aes(x=Survived) + ggtitle("Titanic: Fare versus Survival") ``` <img src="cm-3-EDA_files/figure-html/unnamed-chunk-16-1.png" width="504" /> ] ] --- ### Dataset `whiteside` (from package `MASS` of
) > Mr Derek Whiteside of the UK Building Research Station recorded the weekly gas consumption and average external temperature at his own house in south-east England for two heating seasons, one of 26 weeks before, and one of 30 weeks after cavity-wall insulation was installed. The object of the exercise was to assess the effect of the insulation on gas consumption. --- ### Dataset `whiteside` `Gas` and `Temp` are both quantitative variables while `Insul` is qualitative with two modalities (`Before`, `After`). `Insul` : A factor, before or after insulation. `Temp` : Purportedly the average outside temperature in degrees Celsius. (These values is far too low for any 56-week period in the 1960s in South-East England. It might be the weekly average of daily minima.) `Gas` : The weekly gas consumption in 1000s of cubic feet. ---
```r MASS::whiteside %>% ggplot(aes(x=Insul, y=Temp)) + geom_violin() + ggtitle("Whiteside data: violinplots") ``` <img src="cm-3-EDA_files/figure-html/unnamed-chunk-17-1.png" width="504" /> --- class: middle, center, inverse ## Simple linear regression --- - We now explore association between _two_ quantitative variables - We investigate the association between two quantitative variables as a _prediction_ problem - We aim at predicting the value of `\(Y\)` as a function of `\(X\)`. - We restrict our attention to linear/affine prediction. --- We look for `\(a, b \in \mathbb{R}\)` such that `$$y_i \approx a x_i +b$$` Making `\(\approx\)` meaningful compels us to choose a _goodness of fit_ criterion. -- Several criteria are possible, for example: `$$\begin{array}{rl}\text{Mean absolute deviation} & = \frac{1}{n}\sum_{i=1}^n \big|y_i - a x_i -b \big| \\\text{Mean quadratic deviation} & = \frac{1}{n}\sum_{i=1}^n \big|y_i - a x_i -b \big|^2 \end{array}$$` --- In their days, Laplace championed the mean absolute deviation, while Gauss advocated the mean quadratic deviation. For computational reasons, we focus on minimizing the mean quadratic deviation. .pull-left[ <img src="img/Rue_Laplace_(Paris).JPG" align="right" width="200"> > The fourth chapter of Laplace treatise includes an exposition of the _method of least squares_, a remarkable testimony to Laplace's command over the processes of analysis. > In 1805 Legendre had published the _method of least squares_, making no attempt to tie it to the theory of probability. ] .pull-right[ <img src="img/gauss_10_DM.jpg" align="right" height="100"> > In 1809 Gauss had derived the _normal distribution_ from the principle that the arithmetic mean of observations gives the most probable value for the quantity measured; then, turning this argument back upon itself, he showed that, if the errors of observation are normally distributed, the _least squares estimates_ give the most probable values for the coefficients in _regression_ situations ] .tr[Wikipedia] --- class: center, middle, inverse ## Least Square Regression --- ### Minimizing a cost function The *Least Square Regression* problem consists of minimizing (with respect to `\((a,b)\)`): `$$\begin{array}{rl} \ell_n(a,b) & = \sum_{i=1}^n \big(y_i - a x_i -b \big)^2 \\ & = \sum_{i=1}^n \big((y_i - \overline{Y}_n) - a (x_i - \overline{X}_n) + \overline{Y}_n - a \overline{X}_n-b \big)^2 \\ & = \sum_{i=1}^n \big((y_i - \overline{Y}_n) - a (x_i - \overline{X}_n) \big)^2 + n \big(\overline{Y}_n - a \overline{X}_n-b\big)^2 \end{array}$$` --- ### Deriving the solution The function to be minimized is smooth and strictly convex over `\(\mathbb{R}^2\)` : a unique minimum is attained where the gradient vanishes -- It is enough to compute the partial derivatives. `$$\begin{array}{rl}\frac{\partial \ell_n}{\partial a} & = - 2 \operatorname{cov}(X,Y) + 2 a \operatorname{var}(X) -2 n \big(\overline{Y}_n - a \overline{X}_n-b\big) \overline{X}_n \\ \frac{\partial \ell_n}{\partial b} & = -2 n \big(\overline{Y}_n - a \overline{X}_n-b\big)\end{array}$$` --- ### A closed-form solution Zeroing partial derivatives leads to `$$\begin{array}{rl} \widehat{a} & = \frac{\operatorname{cov}(X,Y)}{\operatorname{var}(X)} \\ \widehat{b} & = \overline{Y}_n - \frac{\operatorname{cov}(X,Y)}{\operatorname{var}(X)} \overline{X}_n \end{array}$$` -- or `$$\begin{array}{rl} \widehat{a} & = \rho \frac{\sigma_y}{\sigma_x} \\ \widehat{b} & = \overline{Y}_n - \rho\frac{\sigma_y}{\sigma_x} \overline{X}_n \end{array}$$` --
If the sample were standardized, that is, if `\(X\)` (resp. `\(Y\)`) were divided by `\(\sigma_X\)` (resp. `\(\sigma_Y\)`), the slope of the regression line would be the correlation coefficient --- ### Overplotting the Gaussian cloud .left-column[ - The _slope_ and _intercept_ can be computed from the sample summary (empirical mean and covariance matrix) - In higher dimension, coefficients are from `lm(...)` ] .right-column[ <img src="cm-3-EDA_files/figure-html/unnamed-chunk-18-1.png" width="504" /> ] --- ### `lm(formula, data)` ```r mod <- lm(formula=Y ~ X, data=u) mod %>% summary() ``` ``` ## ## Call: ## lm(formula = Y ~ X, data = u) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.0168 -0.7106 -0.0079 0.7294 3.5773 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.003685 0.033145 -0.111 0.911 ## X -0.161562 0.015864 -10.184 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.048 on 998 degrees of freedom ## Multiple R-squared: 0.09415, Adjusted R-squared: 0.09324 ## F-statistic: 103.7 on 1 and 998 DF, p-value: < 2.2e-16 ``` --- ### Residuals .panelset[ .panel[.panel-name[Residuals] The _residuals_ are the prediction errors `\(\left(y_i - \widehat{a}x_i - \widehat{b}\right)_{i\leq n}\)` Residuals play a central role in _regression diagnostic_ The `Residual Standard Error`, is the square root of the normalized sum of squared residuals: `$$\frac{1}{n-2}\sum_{i=1}^n \left(y_i - \widehat{a}x_i - \widehat{b}\right)^2$$` The normalization coefficient is the number of rows `\(n\)` diminished by the number of adjusted parameters (the so-called _degrees of freedom_)
This makes sense if we adopt a modeling perspective, if we accept the _Gaussian Linear Models_ assumptions from the Statistical Inference course ] .panel[.panel-name[Code] ```r p_scatter_gaussian %+% * broom::augment(lm(Y ~ X, u)) + geom_line(aes(x=X, y=.fitted)) + geom_segment(aes(x=X, xend=X, y=.fitted, yend=Y, color=forcats::as_factor(sign(.resid))), alpha=.2) + theme(legend.position = "None") + ggtitle("Gaussian cloud",subtitle = "with residuals") ``` ] .panel[.panel-name[Plot] ![](cm-3-EDA_files/figure-html/scatplot-residuals-1.png) The residuals are the lengths of the segments connecting sample points to their projections on the regression line ] .panel[.panel-name[Multiple R-squared] Technically, the `Multiple R-squared` or : _coefficient of determination_ is the squared empirical correlation coefficient `\(\rho^2\)` between the explanatory and the response variables (in simple linear regression) `$$1 - \frac{\sum_{i=1}^n \left(y_i - \widehat{a}x_i - \widehat{b}\right)^2}{\sum_{i=1}^n \left(y_i - \overline{Y}_n\right)^2}= 1 - \frac{\sum_{i=1}^n \left(y_i - \widehat{y}_i \right)^2}{\sum_{i=1}^n \left(y_i - \overline{Y}_n\right)^2}$$` It is also understood as the share of the variance of the response variable that is _explained_ by the explanatory variable ] .panel[.panel-name[Adjusted R-squared] The `Adjusted R-squared` is a deflated version of `Multiple R-squared` `$$1 - \frac{\sum_{i=1}^n \left(y_i - \widehat{a}x_i - \widehat{b}\right)^2/(n-p-1)}{\sum_{i=1}^n \left(y_i - \overline{Y}_n\right)^2/(n-1)}$$` It is useful when comparing the merits of several competing models (this takes us beyond the scope of this lesson) ] ] --- count: false ### Visualizing residuals .panel1-flip-scatplot-residuals-auto[ ```r *p_scatter_gaussian ``` ] .panel2-flip-scatplot-residuals-auto[ <img src="cm-3-EDA_files/figure-html/flip-scatplot-residuals_auto_01_output-1.png" width="504" /> ] --- count: false ### Visualizing residuals .panel1-flip-scatplot-residuals-auto[ ```r p_scatter_gaussian %+% * broom::augment(lm(Y ~ X, u)) #<< ``` ] .panel2-flip-scatplot-residuals-auto[ <img src="cm-3-EDA_files/figure-html/flip-scatplot-residuals_auto_02_output-1.png" width="504" /> ] --- count: false ### Visualizing residuals .panel1-flip-scatplot-residuals-auto[ ```r p_scatter_gaussian %+% * broom::augment(lm(Y ~ X, u)) + * geom_line(aes(x=X, y=.fitted)) ``` ] .panel2-flip-scatplot-residuals-auto[ <img src="cm-3-EDA_files/figure-html/flip-scatplot-residuals_auto_03_output-1.png" width="504" /> ] --- count: false ### Visualizing residuals .panel1-flip-scatplot-residuals-auto[ ```r p_scatter_gaussian %+% * broom::augment(lm(Y ~ X, u)) + geom_line(aes(x=X, y=.fitted)) + * geom_segment(aes(x=X, * xend=X, * y=.fitted, * yend=Y, * color=as_factor(sign(.resid))), * alpha=.2) ``` ] .panel2-flip-scatplot-residuals-auto[ <img src="cm-3-EDA_files/figure-html/flip-scatplot-residuals_auto_04_output-1.png" width="504" /> ] --- count: false ### Visualizing residuals .panel1-flip-scatplot-residuals-auto[ ```r p_scatter_gaussian %+% * broom::augment(lm(Y ~ X, u)) + geom_line(aes(x=X, y=.fitted)) + geom_segment(aes(x=X, xend=X, y=.fitted, yend=Y, color=as_factor(sign(.resid))), alpha=.2) + * theme(legend.position = "None") ``` ] .panel2-flip-scatplot-residuals-auto[ <img src="cm-3-EDA_files/figure-html/flip-scatplot-residuals_auto_05_output-1.png" width="504" /> ] --- count: false ### Visualizing residuals .panel1-flip-scatplot-residuals-auto[ ```r p_scatter_gaussian %+% * broom::augment(lm(Y ~ X, u)) + geom_line(aes(x=X, y=.fitted)) + geom_segment(aes(x=X, xend=X, y=.fitted, yend=Y, color=as_factor(sign(.resid))), alpha=.2) + theme(legend.position = "None") + * ggtitle("Gaussian cloud", * subtitle = "with residuals!") ``` ] .panel2-flip-scatplot-residuals-auto[ <img src="cm-3-EDA_files/figure-html/flip-scatplot-residuals_auto_06_output-1.png" width="504" /> ] <style> .panel1-flip-scatplot-residuals-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-flip-scatplot-residuals-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-flip-scatplot-residuals-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- class: middle, inverse, center ## `\(y = x^T \beta + \sigma \epsilon\)`: The biggest lie? --- ### Attention!
- Any numeric bivariate sample can be fed to `lm` - Whatever the bivariate dataset, you will obtain a linear prediction model - It is not wise to rely exclusively on the `Multiple R-squared` to assess a linear model -
Different datasets can lead to the same regression line and the same `Multiple R-squared` and the same `Adjusted R-squared` --- ### Anscombe quartet .fl.w-50.pa2[ 4 simple linear regression problems packaged in dataframe `datasets::anscombe` - `y1 ~ x1` - `y2 ~ x2` - `y3 ~ x3` - `y4 ~ x4` ] .fl.w-50.pa2[ ```r anscombe <- datasets::anscombe anscombe %>% gt() ```
x1
x2
x3
x4
y1
y2
y3
y4
10
10
10
8
8.04
9.14
7.46
6.58
8
8
8
8
6.95
8.14
6.77
5.76
13
13
13
8
7.58
8.74
12.74
7.71
9
9
9
8
8.81
8.77
7.11
8.84
11
11
11
8
8.33
9.26
7.81
8.47
14
14
14
8
9.96
8.10
8.84
7.04
6
6
6
8
7.24
6.13
6.08
5.25
4
4
4
19
4.26
3.10
5.39
12.50
12
12
12
8
10.84
9.13
8.15
5.56
7
7
7
8
4.82
7.26
6.42
7.91
5
5
5
8
5.68
4.74
5.73
6.89
] --- ### Anscombe quartet: 4 datasets, 1 linear fit with almost identical goodness of fits .panelset[ .panel[.panel-name[`y1 ~ x1`] ```r lm(y1 ~ x1, anscombe) %>% summary ``` ``` ## ## Call: ## lm(formula = y1 ~ x1, data = anscombe) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.92127 -0.45577 -0.04136 0.70941 1.83882 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.0001 1.1247 2.667 0.02573 * ## x1 0.5001 0.1179 4.241 0.00217 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.237 on 9 degrees of freedom ## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295 ## F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217 ``` ] .panel[.panel-name[`y2 ~ x2`] ```r lm(y2 ~ x2, anscombe) %>% summary ``` ``` ## ## Call: ## lm(formula = y2 ~ x2, data = anscombe) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.9009 -0.7609 0.1291 0.9491 1.2691 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.001 1.125 2.667 0.02576 * ## x2 0.500 0.118 4.239 0.00218 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.237 on 9 degrees of freedom ## Multiple R-squared: 0.6662, Adjusted R-squared: 0.6292 ## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179 ``` ] .panel[.panel-name[`y3 ~ x3`] ```r lm(y3 ~ x3, anscombe) %>% summary ``` ``` ## ## Call: ## lm(formula = y3 ~ x3, data = anscombe) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.1586 -0.6146 -0.2303 0.1540 3.2411 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.0025 1.1245 2.670 0.02562 * ## x3 0.4997 0.1179 4.239 0.00218 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.236 on 9 degrees of freedom ## Multiple R-squared: 0.6663, Adjusted R-squared: 0.6292 ## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002176 ``` ] .panel[.panel-name[`y4 ~ x4`] ```r lm(y4 ~ x4, anscombe) %>% summary ``` ``` ## ## Call: ## lm(formula = y4 ~ x4, data = anscombe) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.751 -0.831 0.000 0.809 1.839 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.0017 1.1239 2.671 0.02559 * ## x4 0.4999 0.1178 4.243 0.00216 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.236 on 9 degrees of freedom ## Multiple R-squared: 0.6667, Adjusted R-squared: 0.6297 ## F-statistic: 18 on 1 and 9 DF, p-value: 0.002165 ``` ] ] ??? --- ### Anscombe quartet (continued) All four numerical summaries look similar: - `Intercept` `\(\approx 3.0017\)` - `slope` `\(\approx 0.5\)` - Residual standard error `\(\approx 1.236\)` - Multiple R-squared `\(\approx .67\)` - F-statistic `\(\approx 18\)` `\(n\)` is equal to 11 The number of adjusted parameters `\(p\)` is 2 The number of degrees of freedom is `\(n-p=9\)` ??? How is RSE computed ? `$$\frac{1}{n-p}\sum_{i=1}^n \left(y_j[i] - \widehat{y}_j[i] \right)^2$$` How is R-squared computed ? How is adjusted R-squared ? --- --- ### Anscombe quartet (continued) .panelset[ .panel[.panel-name[Beyond numbers] Visual inspection of the data reveals that some linear models are more relevant than others This is the message of the Anscombe quartet. It is made of four bivariate samples with `\(n=11\)` individuals. ] .panel[.panel-name[Anscombe in long format] ```r datasets::anscombe %>% pivot_longer(everything(), names_to = c(".value", "group"), names_pattern = "(.)(.)" ) %>% rename(X=x, Y=y) %>% arrange(group)-> anscombe ``` From [https://tidyr.tidyverse.org/articles/pivot.html](https://tidyr.tidyverse.org/articles/pivot.html) ] .panel[.panel-name[Before] ```r datasets::anscombe %>% head(8) %>% knitr::kable() ``` <table> <thead> <tr> <th style="text-align:right;"> x1 </th> <th style="text-align:right;"> x2 </th> <th style="text-align:right;"> x3 </th> <th style="text-align:right;"> x4 </th> <th style="text-align:right;"> y1 </th> <th style="text-align:right;"> y2 </th> <th style="text-align:right;"> y3 </th> <th style="text-align:right;"> y4 </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 8.04 </td> <td style="text-align:right;"> 9.14 </td> <td style="text-align:right;"> 7.46 </td> <td style="text-align:right;"> 6.58 </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 6.95 </td> <td style="text-align:right;"> 8.14 </td> <td style="text-align:right;"> 6.77 </td> <td style="text-align:right;"> 5.76 </td> </tr> <tr> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 7.58 </td> <td style="text-align:right;"> 8.74 </td> <td style="text-align:right;"> 12.74 </td> <td style="text-align:right;"> 7.71 </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 8.81 </td> <td style="text-align:right;"> 8.77 </td> <td style="text-align:right;"> 7.11 </td> <td style="text-align:right;"> 8.84 </td> </tr> <tr> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 8.33 </td> <td style="text-align:right;"> 9.26 </td> <td style="text-align:right;"> 7.81 </td> <td style="text-align:right;"> 8.47 </td> </tr> <tr> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 9.96 </td> <td style="text-align:right;"> 8.10 </td> <td style="text-align:right;"> 8.84 </td> <td style="text-align:right;"> 7.04 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 7.24 </td> <td style="text-align:right;"> 6.13 </td> <td style="text-align:right;"> 6.08 </td> <td style="text-align:right;"> 5.25 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 19 </td> <td style="text-align:right;"> 4.26 </td> <td style="text-align:right;"> 3.10 </td> <td style="text-align:right;"> 5.39 </td> <td style="text-align:right;"> 12.50 </td> </tr> </tbody> </table> ] .panel[.panel-name[After] ```r anscombe %>% head(8) %>% knitr::kable() ``` <table> <thead> <tr> <th style="text-align:left;"> group </th> <th style="text-align:right;"> X </th> <th style="text-align:right;"> Y </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 8.04 </td> </tr> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 6.95 </td> </tr> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 7.58 </td> </tr> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 8.81 </td> </tr> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 8.33 </td> </tr> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> 9.96 </td> </tr> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 7.24 </td> </tr> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 4.26 </td> </tr> </tbody> </table> ] ] --- count: false ### Pivoting Anscombe .panel1-pivot_anscombe-auto[ ```r *datasets::anscombe ``` ] .panel2-pivot_anscombe-auto[ ``` ## x1 x2 x3 x4 y1 y2 y3 y4 ## 1 10 10 10 8 8.04 9.14 7.46 6.58 ## 2 8 8 8 8 6.95 8.14 6.77 5.76 ## 3 13 13 13 8 7.58 8.74 12.74 7.71 ## 4 9 9 9 8 8.81 8.77 7.11 8.84 ## 5 11 11 11 8 8.33 9.26 7.81 8.47 ## 6 14 14 14 8 9.96 8.10 8.84 7.04 ## 7 6 6 6 8 7.24 6.13 6.08 5.25 ## 8 4 4 4 19 4.26 3.10 5.39 12.50 ## 9 12 12 12 8 10.84 9.13 8.15 5.56 ## 10 7 7 7 8 4.82 7.26 6.42 7.91 ## 11 5 5 5 8 5.68 4.74 5.73 6.89 ``` ] --- count: false ### Pivoting Anscombe .panel1-pivot_anscombe-auto[ ```r datasets::anscombe %>% * pivot_longer(everything(), * names_to = c(".value", "group"), * names_pattern = "(.)(.)" * ) ``` ] .panel2-pivot_anscombe-auto[ ``` ## # A tibble: 44 × 3 ## group x y ## <chr> <dbl> <dbl> ## 1 1 10 8.04 ## 2 2 10 9.14 ## 3 3 10 7.46 ## 4 4 8 6.58 ## 5 1 8 6.95 ## 6 2 8 8.14 ## 7 3 8 6.77 ## 8 4 8 5.76 ## 9 1 13 7.58 ## 10 2 13 8.74 ## # … with 34 more rows ``` ] --- count: false ### Pivoting Anscombe .panel1-pivot_anscombe-auto[ ```r datasets::anscombe %>% pivot_longer(everything(), names_to = c(".value", "group"), names_pattern = "(.)(.)" ) %>% * rename(X=x, Y=y) ``` ] .panel2-pivot_anscombe-auto[ ``` ## # A tibble: 44 × 3 ## group X Y ## <chr> <dbl> <dbl> ## 1 1 10 8.04 ## 2 2 10 9.14 ## 3 3 10 7.46 ## 4 4 8 6.58 ## 5 1 8 6.95 ## 6 2 8 8.14 ## 7 3 8 6.77 ## 8 4 8 5.76 ## 9 1 13 7.58 ## 10 2 13 8.74 ## # … with 34 more rows ``` ] --- count: false ### Pivoting Anscombe .panel1-pivot_anscombe-auto[ ```r datasets::anscombe %>% pivot_longer(everything(), names_to = c(".value", "group"), names_pattern = "(.)(.)" ) %>% rename(X=x, Y=y) %>% * arrange(group)-> anscombe_long ``` ] .panel2-pivot_anscombe-auto[ ] --- count: false ### Pivoting Anscombe .panel1-pivot_anscombe-auto[ ```r datasets::anscombe %>% pivot_longer(everything(), names_to = c(".value", "group"), names_pattern = "(.)(.)" ) %>% rename(X=x, Y=y) %>% arrange(group)-> anscombe_long *anscombe_long ``` ] .panel2-pivot_anscombe-auto[ ``` ## # A tibble: 44 × 3 ## group X Y ## <chr> <dbl> <dbl> ## 1 1 10 8.04 ## 2 1 8 6.95 ## 3 1 13 7.58 ## 4 1 9 8.81 ## 5 1 11 8.33 ## 6 1 14 9.96 ## 7 1 6 7.24 ## 8 1 4 4.26 ## 9 1 12 10.8 ## 10 1 7 4.82 ## # … with 34 more rows ``` ] <style> .panel1-pivot_anscombe-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-pivot_anscombe-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-pivot_anscombe-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> ??? Interpreting `pivot_longer` --- ### Performing regression per group For each value of `group` we perform a linear regression of `Y` versus `X` ```r list_lm <- purrr::map(anscombe_long$group , .f = function(g) lm(Y ~ X, anscombe, subset = anscombe$group==g)) ```
Don't Repeat Yourself (DRY) We use _functional programming_: `purrr::map(.l, .f)` where - `.l` is a list - `.f` is a function to be applied to each item of list `.l` or a `formula` to be evaluated on each list item [`purrr` package](https://purrr.tidyverse.org/reference/map.html) --- ### Inspecting summaries All four regressions lead to the same intercept and the same slope All four regressions have the same Sum of Squared Residuals All four regressions have the same Adjusted R-square We are tempted to conclude that > all four linear regressions are equally relevant Plotting points and lines helps dispel this illusion --- ### Unveiling points .panelset[ .panel[.panel-name[Code] ```r p <- anscombe %>% ggplot(aes(x=X, y=Y)) + * geom_smooth(method="lm", se=FALSE) + * facet_wrap(~ group) + ggtitle("Anscombe quartet: linear regression Y ~ X") ``` ] .panel[.panel-name[Regression lines] .pull-left[ ``` ## `geom_smooth()` using formula 'y ~ x' ``` <img src="cm-3-EDA_files/figure-html/unnamed-chunk-33-1.png" width="504" /> ] .pull-right[ Least squares minimization leads to the same optimum on the four datasets, that is to the same intercept and slope The distribution of residuals differ substantially from one dataset to the next ] ] .panel[.panel-name[Lines and points] .pull-left[ ``` ## `geom_smooth()` using formula 'y ~ x' ``` <img src="cm-3-EDA_files/figure-html/unnamed-chunk-34-1.png" width="504" /> ] .pull-right[ Among the four datasets, only the two left ones are righteously handled using simple linear regression The bottom left dataset outlines the impact of *outliers* on Least Squares Minimization ] ] ] ??? What happens if we fit a line by minimizing Least Absolute Deviation? (adopt an `\(\ell_1\)` criterion rather than an `\(\ell_2\)`) --- class: center, middle, inverse ## Association between qualitative variables --- We showcase the approach by assessing the association between Passenger class (`Pclass`) and Survival (`Survived`) in the Titanic data Modeling assumptions (often implicit) - Each class defines a population. The fates of individuals within each population are assumed to be independent and identically distributed - The fates of individuals from different population are assumed to be independent -- Do you take these modeling assumptions for granted? --- ### Pearson's `\(\chi^2\)` association statistic Each population is associated with a Bernoulli distribution. The question is: > are the three Bernoulli distributions identical? The quantity we compute indexes the departure of the joint empirical distrition from the product of its marginal distributions `$$\sum_{i\in \mathcal{X}, j \in \mathcal{Y}} \frac{\Big(n_{i, j} - \frac{n_{i, \cdot}n_{\cdot, j}}{n} \Big)^2}{\frac{n_{i, \cdot}n_{\cdot, j}}{n}}$$` If this statistic is large with respect to `\((|\mathcal{X}|-1)(|\mathcal{Y}-1|)\)`. If it is much larger, this indicates a strong departure of the joint empirical distribution from a product distribution --- ### The `\(\chi^2\)` homogeneity statistics ```r tit %>% dplyr::select(Pclass, Survived) %>% table() %>% chisq.test() %>% broom::tidy() ``` ``` ## # A tibble: 1 × 4 ## statistic p.value parameter method ## <dbl> <dbl> <int> <chr> ## 1 103. 4.55e-23 2 Pearson's Chi-squared test ``` --- ```r tit %>% dplyr::group_by(Pclass, Survived) %>% dplyr::summarize(n =n()) %>% knitr::kable(format="markdown") ``` ``` ## `summarise()` has grouped output by 'Pclass'. You can override using the `.groups` argument. ``` |Pclass |Survived | n| |:------|:--------|---:| |1 |Survived | 136| |1 |Deceased | 80| |1 |NA | 107| |2 |Survived | 87| |2 |Deceased | 97| |2 |NA | 93| |3 |Survived | 119| |3 |Deceased | 372| |3 |NA | 218| --- Function `ctable` from `summarytools` can output the Pearson `\(\chi^2\)` statistic: ```r summarytools::ctable(x = tit$Pclass, y=tit$Survived, chisq = TRUE, style="rmarkdown", headings = FALSE) ``` | | | | | | | |-------:|---------:|------------:|------------:|------------:|--------------:| | | Survived | Survived | Deceased | \<NA\> | Total | | Pclass | | | | | | | 1 | | 136 (42.1%) | 80 (24.8%) | 107 (33.1%) | 323 (100.0%) | | 2 | | 87 (31.4%) | 97 (35.0%) | 93 (33.6%) | 277 (100.0%) | | 3 | | 119 (16.8%) | 372 (52.5%) | 218 (30.7%) | 709 (100.0%) | | Total | | 342 (26.1%) | 549 (41.9%) | 418 (31.9%) | 1309 (100.0%) | ---------------------------- Chi.squared df p.value ------------- ---- --------- 102.889 2 0 ---------------------------- --- ### `\(p\)`-value If the modeling assumptions are correct, if you are accepting to reject the independence hypothesis with probability `\(\alpha\)` while independence holds, then you should reject the independence hypothesis when the so-called `\(p\)`-value is smaller than `\(\alpha\)`. --- class: middle, center, inverse background-image: url('./img/pexels-cottonbro-3171837.jpg') background-size: cover # The End