Categorical Univariate Statistics

---
name: layout-general
layout: true
class: left, middle

.remark-slide-number .progress-bar-container {
  position: absolute;
  bottom: 0;
  height: 4px;
  display: block;
  left: 0;
  right: 0;
}

.remark-slide-number .progress-bar {
  height: 100%;
  background-color: red;
}
</style>

<div>
<style type="text/css">.xaringan-extra-logo {
width: 110px;
height: 128px;
z-index: 0;
background-image: url(./img/Universite_Paris_logo_horizontal.jpg);
background-size: contain;
background-repeat: no-repeat;
position: absolute;
top:1em;right:1em;
}
</style>
<script>(function () {
  let tries = 0
  function addLogo () {
    if (typeof slideshow === 'undefined') {
      tries += 1
      if (tries < 10) {
        setTimeout(addLogo, 100)
      }
    } else {
      document.querySelectorAll('.remark-slide-content:not(.hide_logo)')
        .forEach(function (slide) {
          const logo = document.createElement('a')
          logo.classList = 'xaringan-extra-logo'
          logo.href = 'http://master.math.univ-paris-diderot.fr/annee/m1-mi/'
          slide.appendChild(logo)
        })
    }
  }
  document.addEventListener('DOMContentLoaded', addLogo)
})()</script>
</div>

---

# Exploratory Data Analysis : Categorical Univariate Statistics

### 2022-02-09

#### [Master I MIDS & MFA]()

#### [Analyse Exploratoire de Données](http://stephane-v-boucheron.fr/courses/isidata/)

#### [Stéphane Boucheron](http://stephane-v-boucheron.fr)

---
template: inter-slide

### <svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M0 117.66v346.32c0 11.32 11.43 19.06 21.94 14.86L160 416V32L20.12 87.95A32.006 32.006 0 0 0 0 117.66zM192 416l192 64V96L192 32v384zM554.06 33.16L416 96v384l139.88-55.95A31.996 31.996 0 0 0 576 394.34V48.02c0-11.32-11.43-19.06-21.94-14.86z"/></svg>

### Numerical summaries for categorical samples

### Barplots

### Pieplots

### Semmelweis quasi-experiment

---
template: inter-slide
name: num-sum-cat-samples

## Numerical summaries for categorical samples

---

### Counts

For qualitative samples, summary statistics consists of _counts_: for each modality, we record the _number of occurences_ or _frequency_

In French, the _frequence_ of that symbol is the _ratio_ between the number
of occurences of the symbol and the total length of the sequence

---

### Summarizing `carrier`

.fl.w-40.pa2.f6[

Load `nycflights`  and inspect 
`flights` dataset

```r
tmp <- purrr::map(
  c("tidyverse", 
  "broom", 
  "knitr",
  "nycflights13"),  
  ~ require(., 
            character.only=T,
            quietly=T))

flights %>%  
  glimpse(w=40)
```

We  focus on variable `carrier`

```r
v_carrier <- flights$carrier

*v_carrier <- flights %>%
* pull(carrier)
```

]

.fl.w-60.pa2.f6[

```
## Warning: package 'broom' was built under R version 4.1.2
```

```
## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 201…
## $ month          <int> 1, 1, 1, 1, 1, …
## $ day            <int> 1, 1, 1, 1, 1, …
## $ dep_time       <int> 517, 533, 542, …
## $ sched_dep_time <int> 515, 529, 540, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6…
## $ arr_time       <int> 830, 850, 923, …
## $ sched_arr_time <int> 819, 830, 850, …
## $ arr_delay      <dbl> 11, 20, 33, -18…
## $ carrier        <chr> "UA", "UA", "AA…
## $ flight         <int> 1545, 1714, 114…
## $ tailnum        <chr> "N14228", "N242…
## $ origin         <chr> "EWR", "LGA", "…
## $ dest           <chr> "IAH", "IAH", "…
## $ air_time       <dbl> 227, 227, 160, …
## $ distance       <dbl> 1400, 1416, 108…
## $ hour           <dbl> 5, 5, 5, 5, 6, …
## $ minute         <dbl> 15, 29, 40, 45,…
## $ time_hour      <dttm> 2013-01-01 05:…
```

]

---

### Function `table()`

.fl.w-50.pa2.f6[

Function `table()` builds a _contingency table_ of the counts

For univariate samples, `table()` returns a 1-way contingency table

```r
*table(v_carrier) %>%
  tidy() %>% 
  head() %>% 
  kable(col.names = c("Carrier", 
                      "Frequency"))
```
<svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M569.517 440.013C587.975 472.007 564.806 512 527.94 512H48.054c-36.937 0-59.999-40.055-41.577-71.987L246.423 23.985c18.467-32.009 64.72-31.951 83.154 0l239.94 416.028zM288 354c-25.405 0-46 20.595-46 46s20.595 46 46 46 46-20.595 46-46-20.595-46-46-46zm-43.673-165.346l7.418 136c.347 6.364 5.609 11.346 11.982 11.346h48.546c6.373 0 11.635-4.982 11.982-11.346l7.418-136c.375-6.874-5.098-12.654-11.982-12.654h-63.383c-6.884 0-12.356 5.78-11.981 12.654z"/></svg> Function `table` puts dust under the carpet: NAs are not reported 
]

.fl.w-50.pa2.f6[

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> Carrier </th>
   <th style="text-align:right;"> Frequency </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 9E </td>
   <td style="text-align:right;"> 18460 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> AA </td>
   <td style="text-align:right;"> 32729 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> AS </td>
   <td style="text-align:right;"> 714 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> B6 </td>
   <td style="text-align:right;"> 54635 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> DL </td>
   <td style="text-align:right;"> 48110 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> EV </td>
   <td style="text-align:right;"> 54173 </td>
  </tr>
</tbody>
</table>
]
---

### Function  `summary()`

.fl.w-50.pa2.f6[

```r
summary(v_carrier)

*summary(factor(v_carrier)) %>%
  as_tibble(rownames="Carrier") %>% 
  head() %>% 
  kable()
```

.fl.w-50.pa2.f6[

```
##    Length     Class      Mode 
##    336776 character character
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> Carrier </th>
   <th style="text-align:right;"> value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 9E </td>
   <td style="text-align:right;"> 18460 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> AA </td>
   <td style="text-align:right;"> 32729 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> AS </td>
   <td style="text-align:right;"> 714 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> B6 </td>
   <td style="text-align:right;"> 54635 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> DL </td>
   <td style="text-align:right;"> 48110 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> EV </td>
   <td style="text-align:right;"> 54173 </td>
  </tr>
</tbody>
</table>

]

---
template: inter-slide
name: graphical-summaries

## Graphical summaries for categorical samples

---

### Bar plotting in one step using `geom_bar`

.fl.w-40.pa2[

```r
flights %>%
  ggplot() +
  aes(x=origin) +
* geom_bar(width=.3) +
  ylab("Count") +
  ggtitle("Flights: destinations distribution")
```

]

.fl.w-60.pa2[

]

---

### Barplotting as a two-step process.

1.  Compute _statistics_ from the raw
data: we _count the number of occurrences of each symbol_. This is a simple yet non-trivial task. In SQL, this consists in `group by` and then `count`.
How would you proceed if you had to code this counting step?

2. Plot the counting data: for example, map the modalities/categories on the `X-axis` (possibly choosing an ordering), the counts are mapped on the `Y-axis`, they specify the height of each bar

Most of the flexibility lies in the mapping of modalities on the `X-axis` or doing something different

???

Emphasize

- distinction betweem `geom_`  and `stat_`

---

### Bar plotting in two steps using `group_by()`, `n()` and `geom_col()`

.fl.w-40.pa2.f5[

```r
p <- flights %>%
  group_by(origin) %>%
  summarise(Count=n()) %>%
  ggplot(mapping=aes(x=origin,
                     y=Count)) +
  ggtitle("Flights: origin distribution")

*p + geom_col(width=.3)
```

]

.fl.w-60.pa2[

]

---

### Playing with `stat_` and `geom_`

In the `ggplot` framework, a `geom_` function is paired with a default `stat_` function

Once a data source and an (aesthetic) mapping have been fixed (telling which columns are fed to which axis), the data are (possibly) transformed using the `stat_`: a sample may be transformed into counts in order to build a contingency table and the result is then fed to the `geom_`

---

.fl.w-40.pa2[

We could obtain the same plot using `geom_bar` using the `identity` stat.

```r
p +
* geom_bar(stat="identity",
*          width=.3)
```

]

.fl.w-60.pa2[

]

---

.fl.w-40.pa2.fa6[

Barplots provide us with a great way to compare counts of different categories.

But it is somewhat difficult to figure out the relative frequencies
of the different categories.

One way to facilitate this consists of normalizing
the counts by dividing all counts by sample size.

```r
dt <- flights %>%
    group_by(dest) %>%
    summarise(Count=n()) %>%
    mutate(Prop=Count/sum(Count))

p <- dt %>%
  ggplot() +
  aes(x=dest, y=Prop) +
  ggtitle("Flights: dest distribution")

(p + geom_col(width=.3))
```

]

.fl.w-60.pa2[

]

---

.fl.w-30.pa2[

It may be a good idea to doctor the tick labels on the Y-axis, indicating
percentages rather than numbers between `$0$` and `$1$` (this depends on the readership)

We may focus on flights without repeating the whole process, just
upgrading the data.

Defining a `ggplot` object is very much
like defining a _pipeline_

]

.fl.w-70.pa2[

]

---

### Reordering

.fl.w-40.pa2.f6[

```r
dt %>%
* mutate(dest=fct_reorder(factor(dest),
*                   Prop,
                    .desc = T)) %>% 
  ggplot() +
  aes(x=dest,  y=Prop) +
  ggtitle("Flights: dest distribution") +
  geom_col(width=.3) +
  theme(axis.text.x=element_text(angle=45, 
                                 hjust=1, 
                                 size =4))  +
  xlab("Destination Airport") +
  ylab("Prop")
```

]

.fl.w-60.pa2[

]

---

```r
*dt
```
]
 
.panel2-fct_reorder-auto[

```
## # A tibble: 105 × 3
##    dest  Count      Prop
##    <chr> <int>     <dbl>
##  1 ABQ     254 0.000754 
##  2 ACK     265 0.000787 
##  3 ALB     439 0.00130  
##  4 ANC       8 0.0000238
##  5 ATL   17215 0.0511   
##  6 AUS    2439 0.00724  
##  7 AVL     275 0.000817 
##  8 BDL     443 0.00132  
##  9 BGR     375 0.00111  
## 10 BHM     297 0.000882 
## # … with 95 more rows
```
]

---
count: false

```r
dt %>%
* mutate(dest=fct_reorder(factor(dest),  #<<
*                   Prop,  #<<
*                   .desc = T))
```
]
 
.panel2-fct_reorder-auto[

```
## # A tibble: 105 × 3
##    dest  Count      Prop
##    <fct> <int>     <dbl>
##  1 ABQ     254 0.000754 
##  2 ACK     265 0.000787 
##  3 ALB     439 0.00130  
##  4 ANC       8 0.0000238
##  5 ATL   17215 0.0511   
##  6 AUS    2439 0.00724  
##  7 AVL     275 0.000817 
##  8 BDL     443 0.00132  
##  9 BGR     375 0.00111  
## 10 BHM     297 0.000882 
## # … with 95 more rows
```
]

---
count: false

```r
dt %>%
* mutate(dest=fct_reorder(factor(dest),
*                   Prop,
                    .desc = T)) %>%
* ggplot()
```
]
 
.panel2-fct_reorder-auto[
<img src="cm-2.3-EDA-categorical_files/figure-html/fct_reorder_auto_03_output-1.png" width="504" />
]

---
count: false

```r
dt %>%
* mutate(dest=fct_reorder(factor(dest),
*                   Prop,
                    .desc = T)) %>%
  ggplot() +
* aes(x=dest,  y=Prop)
```
]
 
.panel2-fct_reorder-auto[
<img src="cm-2.3-EDA-categorical_files/figure-html/fct_reorder_auto_04_output-1.png" width="504" />
]

---
count: false

```r
dt %>%
* mutate(dest=fct_reorder(factor(dest),
*                   Prop,
                    .desc = T)) %>%
  ggplot() +
  aes(x=dest,  y=Prop) +
* ggtitle("Flights: dest distribution")
```
]
 
.panel2-fct_reorder-auto[
<img src="cm-2.3-EDA-categorical_files/figure-html/fct_reorder_auto_05_output-1.png" width="504" />
]

---
count: false

```r
dt %>%
* mutate(dest=fct_reorder(factor(dest),
*                   Prop,
                    .desc = T)) %>%
  ggplot() +
  aes(x=dest,  y=Prop) +
  ggtitle("Flights: dest distribution") +
* geom_col(width=.3)
```
]
 
.panel2-fct_reorder-auto[
<img src="cm-2.3-EDA-categorical_files/figure-html/fct_reorder_auto_06_output-1.png" width="504" />
]

---
count: false

```r
dt %>%
* mutate(dest=fct_reorder(factor(dest),
*                   Prop,
                    .desc = T)) %>%
  ggplot() +
  aes(x=dest,  y=Prop) +
  ggtitle("Flights: dest distribution") +
  geom_col(width=.3) +
* theme(axis.text.x=element_text(angle=45,
*                                hjust=1,
*                                size =4))
```
]
 
.panel2-fct_reorder-auto[
<img src="cm-2.3-EDA-categorical_files/figure-html/fct_reorder_auto_07_output-1.png" width="504" />
]

---
count: false

```r
dt %>%
* mutate(dest=fct_reorder(factor(dest),
*                   Prop,
                    .desc = T)) %>%
  ggplot() +
  aes(x=dest,  y=Prop) +
  ggtitle("Flights: dest distribution") +
  geom_col(width=.3) +
  theme(axis.text.x=element_text(angle=45,
                                 hjust=1,
                                 size =4))  +
* xlab("Destination Airport")
```
]
 
.panel2-fct_reorder-auto[
<img src="cm-2.3-EDA-categorical_files/figure-html/fct_reorder_auto_08_output-1.png" width="504" />
]

---
count: false

```r
dt %>%
* mutate(dest=fct_reorder(factor(dest),
*                   Prop,
                    .desc = T)) %>%
  ggplot() +
  aes(x=dest,  y=Prop) +
  ggtitle("Flights: dest distribution") +
  geom_col(width=.3) +
  theme(axis.text.x=element_text(angle=45,
                                 hjust=1,
                                 size =4))  +
  xlab("Destination Airport") +
* ylab("Prop")
```
]
 
.panel2-fct_reorder-auto[
<img src="cm-2.3-EDA-categorical_files/figure-html/fct_reorder_auto_09_output-1.png" width="504" />
]

---
count: false

```r
dt %>%
* mutate(dest=fct_reorder(factor(dest),
*                   Prop,
                    .desc = T)) %>%
  ggplot() +
  aes(x=dest,  y=Prop) +
  ggtitle("Flights: dest distribution") +
  geom_col(width=.3) +
  theme(axis.text.x=element_text(angle=45,
                                 hjust=1,
                                 size =4))  +
  xlab("Destination Airport") +
  ylab("Prop")
```
]
 
.panel2-fct_reorder-auto[
<img src="cm-2.3-EDA-categorical_files/figure-html/fct_reorder_auto_10_output-1.png" width="504" />
]

---
template: inter-slide
name: pieplots

## Pieplots

---

.fl.w-40.pa2.f6[

Another possibility is to use a `Pieplot`

Counts are normalized and mapped
to angular values (fractions of `$2\pi$`).

It is up to the data scientist to decide whether one plot is more
relevant than the other.

```r
flights %>%
  group_by(carrier) %>% 
  summarize(Count=n()) %>% 
  mutate(Prop=Count/sum(Count)) %>% 
  mutate(carrier=fct_reorder(carrier, 
                             Count, 
                             .desc = F)) %>% 
  ggplot() +
* geom_bar(mapping=aes(x=factor(1),
*                      y=Prop,
*                      fill=carrier),
*           stat='identity') +
*  coord_polar(theta='y') +
   xlab("") +
  ylab("")
```

]

.fl.w-60.pa2[

]

---
template: inter-slide

## Barplots in action

---

### Semmelweis inquiry

.fl.w-third.pa2[
.middle[

![](img/355px-Ignaz_Semmelweis_1860.jpg)

[Filip Semmelweis 1818-1865](https://en.wikipedia.org/wiki/Ignaz_Semmelweis)
]
]

.fl.w-two-thirds.pa2[

.bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[

Semmelweis discovered that the incidence of _puerperal_ fever could be cut by the use of hand disinfection in obstetrical clinics.

Puerperal fever was common in mid-19th-century hospitals and often fatal.

Semmelweis proposed the practice of washing hands with chlorinated lime solutions in 1847 while working in Vienna General Hospital's First Obstetrical Clinic, where doctors' wards had three times the mortality of midwives' wards.

]

---
class: middle, center

### Vienna General Hospital

![](img/AAKH-1784.jpg)

---

### Semmelweis inquiry (continued)

.bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[

.small[
Maternity institutions were set up [...] to address problems of infanticide of illegitimate children. They were set up as gratis institutions and offered to care for the infants, which made them attractive to underprivileged women [...]. In return for the free services, the women would be subjects for the training of doctors and midwives.

_Two maternity clinics_ were at the Viennese hospital.

The First Clinic had an average maternal mortality rate of about `$10\%$` due to _puerperal fever_.

The Second Clinic's rate was considerably lower, averaging less than `$4\%$`.

This fact was known outside the hospital. _The two clinics admitted on alternate days_, but women begged to be admitted to the Second Clinic, [...].

Some women even preferred to give birth in the streets, pretending to have given sudden birth en route to the hospital (a practice known as street births), which meant they would still qualify for the child care benefits without having been admitted to the clinic.
]

]
]

---

### Semmelweis inquiry (continued)

.fl.w-two-thirds.pa2[

.bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[

> "To me, it appeared logical that patients who experienced street births would become ill at least as frequently as those who delivered in the clinic. [...] What protected those who delivered outside the clinic from these destructive unknown endemic influences?"

]]]

.fl.w-third.pa2[

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> Year </th>
   <th style="text-align:right;"> RateM </th>
   <th style="text-align:right;"> RateL </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1841 </td>
   <td style="text-align:right;"> 7.8 </td>
   <td style="text-align:right;"> 3.5 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1842 </td>
   <td style="text-align:right;"> 15.8 </td>
   <td style="text-align:right;"> 7.6 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1843 </td>
   <td style="text-align:right;"> 9.0 </td>
   <td style="text-align:right;"> 6.0 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1844 </td>
   <td style="text-align:right;"> 8.2 </td>
   <td style="text-align:right;"> 2.3 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1845 </td>
   <td style="text-align:right;"> 6.9 </td>
   <td style="text-align:right;"> 2.0 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1846 </td>
   <td style="text-align:right;"> 11.4 </td>
   <td style="text-align:right;"> 2.8 </td>
  </tr>
</tbody>
</table>

]

---

### Semmelweis inquiry (continued)

.bg-light-gray.b--light-gray.ba.bw2.br3.shadow-5.ph4.mt5[

The result was the mortality rate in the First Clinic declined `$90\%$`, and was then comparable to that in the Second Clinic.

The mortality rate in April 1847 was `$18.3\%$`.

After hand washing was instituted in mid-May, the rates in June were `$2.2\%$`, July `$1.2\%$`, August `$1.9\%$` and, for the first time since the introduction of anatomical orientation, the death rate was zero in two months in the year following this discovery.

]

---

- What is the sample here?

- Is there just one sample?

- Does the Semmelweis dataset contain a sample or a statistic (something computed from a sample)?

---

Plot the yearly death rates at both clinics. Choose aesthetics
so as to emphasize the difference in mortality rates in the
clinic supervised by doctors and the clinic supervised by midwifes.

---

### Pivoting

.fl.w-third.pa2[

`semmelweis` dataframe has _wide format_

It is convenient (but not necessary) to `pivot` columns
`RateM` and `RateL` to a pair of columns: `Clinic`  and `Death_rate`

Column `Clinic` is retyped as a `factor`

Column `Year` is also retyped as a `factor` (It would be
more natural to handle "Year" as a `Date` object)

]

.fl.w-two-thirds.pa2[

```r
tmp <- semmelweis %>%
  select(Year, RateM, RateL) %>%
* pivot_longer(cols=c(RateM, RateL),
               names_to="Clinic",
               values_to="Death_rate"
             ) %>%
* mutate(Clinic = as_factor(Clinic),
         Year = as_factor(Year)) %>%
* mutate(Clinic = fct_recode(Clinic,
                             Doctors="RateM",
                             Midwifes="RateL")
         )
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> Year </th>
   <th style="text-align:left;"> Clinic </th>
   <th style="text-align:right;"> Death_rate </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 1841 </td>
   <td style="text-align:left;"> Doctors </td>
   <td style="text-align:right;"> 7.8 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 1841 </td>
   <td style="text-align:left;"> Midwifes </td>
   <td style="text-align:right;"> 3.5 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 1842 </td>
   <td style="text-align:left;"> Doctors </td>
   <td style="text-align:right;"> 15.8 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 1842 </td>
   <td style="text-align:left;"> Midwifes </td>
   <td style="text-align:right;"> 7.6 </td>
  </tr>
</tbody>
</table>

]

---

### Keywords and buzzwords

- Randomized clinical trials

- A/B testing

- Case Control studies

- Quasi-experiments

- Observational data

- Experimental data

???

---

### Data wrangling makes ploting easy.

```r
tmp %>%
  ggplot() +
  aes(x=Year,
      y=Death_rate,
      group=Clinic,
      fill=Clinic,
      linetype=Clinic) +
  geom_col(position="dodge",   #<<
           colour="black",
           width=.2) +
  theme(legend.position=c(5.5/6, 13/15)) +
  ggtitle("Semmelweis data") +
  ylab("Death rate (%)")
```

</div>

<div class="panel">
<div class="panel-name">Semmelweis barplot</div>

</div>

]

---
exclude: true

.fl.w-third.pa2[

]

.fl.w-two-thirds.pa2[

![](cm-2.3-EDA-categorical_files/figure-html/semmelweis-label-1.png)

]

---
class: center, middle, inverse

## Quasi-experiments and beyond

---

### Milestones

- Semmelweis circa 1847 Puerperal fever

- Snow circa 1852 Cholera

- Hill _Randomized Clinical Trial Streptomyces versus Collapsotherapia against Tuberculosis_ 1948

- ...

---
exclude: true
class: inverse, center, middle

## Inequalities

---
exclude: true

### Why and what

- Quantifying income inequalities

- Quantifying market share concentration

---
exclude: true

### How

Normalizing income distribution

Non-decreasing rearrangement

---
exclude: true

### Lorenz curve

```r
ineq::Lc
function (x, n = rep(1, length(x)), plot = FALSE)
{
    ina <- !is.na(x)
    n <- n[ina]
    x <- as.numeric(x)[ina]
    k <- length(x)
    o <- order(x)
    x <- x[o]
    n <- n[o]
    x <- n * x
    p <- cumsum(n)/sum(n)
    L <- cumsum(x)/sum(x)
    p <- c(0, p)
    L <- c(0, L)
    L2 <- L * mean(x)/mean(n)
    Lc <- list(p, L, L2)
    names(Lc) <- c("p", "L", "L.general")
    class(Lc) <- "Lc"
    if (plot)
        plot(Lc)
    Lc
}
```

---
exclude: true

### Indices and Schur monotonicity

- Gini

- Atkinson

- Decile ratios

---

background-image: url('./img/pexels-cottonbro-3171837.jpg')
background-size: cover

# The End