Probability I: Introduction to Probability Theory

name: inter-slide
class: left, middle, inverse

---
name: layout-general
layout: true
class: left, middle

.remark-slide-number .progress-bar-container {
  position: absolute;
  bottom: 0;
  height: 4px;
  display: block;
  left: 0;
  right: 0;
}

.remark-slide-number .progress-bar {
  height: 100%;
  background-color: red;
}
</style>

<div>
<style type="text/css">.xaringan-extra-logo {
width: 110px;
height: 128px;
z-index: 0;
background-image: url(./img/Universite_Paris_logo_horizontal.jpg);
background-size: contain;
background-repeat: no-repeat;
position: absolute;
top:1em;right:1em;
}
</style>
<script>(function () {
  let tries = 0
  function addLogo () {
    if (typeof slideshow === 'undefined') {
      tries += 1
      if (tries < 10) {
        setTimeout(addLogo, 100)
      }
    } else {
      document.querySelectorAll('.remark-slide-content:not(.hide_logo)')
        .forEach(function (slide) {
          const logo = document.createElement('a')
          logo.classList = 'xaringan-extra-logo'
          logo.href = 'http://master.math.univ-paris-diderot.fr/annee/m1-mi/'
          slide.appendChild(logo)
        })
    }
  }
  document.addEventListener('DOMContentLoaded', addLogo)
})()</script>
</div>

---
template: inter-slide

# Introduction to Probability Theory

### 2021-09-07

#### [Probability Master I MIDS](http://stephane-v-boucheron.fr/courses/probability)

#### [Stéphane Boucheron](http://stephane-v-boucheron.fr)

---
template: inter-slide
name: xxx

## <svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M0 117.66v346.32c0 11.32 11.43 19.06 21.94 14.86L160 416V32L20.12 87.95A32.006 32.006 0 0 0 0 117.66zM192 416l192 64V96L192 32v384zM554.06 33.16L416 96v384l139.88-55.95A31.996 31.996 0 0 0 576 394.34V48.02c0-11.32-11.43-19.06-21.94-14.86z"/></svg>

### [Hashing](#bigpic)

### [Probability spaces](#gmm)

### [Independence](#em)

### [Convergences](#mclust)

???

In this chapter we survey the basic definitions of Probability Theory starting
from a simple modeling problem from computer science.
The notions are formally defined in next chapters. The simple context
allows us to carry out computations and to outline the kind of results we will look
for during the course:  moments, tail bounds, law of large numbers,
central limit theorems, and possibly other kind of weak convergence results.

---
template: inter-slide
name: hashing

## Hashing

---

### From hashing to random allocations

Hashing is a computational technique that is used in almost every area of
computing, from databases to compilers  through  (big) datawarehouses.

Every book on algorithms contain a discussion of hashing, see for example
[`Introduction to Hashing by Jeff Erickson`](http://jeffe.cs.illinois.edu/teaching/algorithms/notes/05-hashing.pdf)

---

### From hashing to random allocations (continued)

Under _idealized conditions_, hashing `$n$` items to `$m$` values consists of applying
a function picked _uniformly_ at random among the `$m^n$` functions from `$1, \ldots, n$` to `$1, \ldots, m$`

The performance of a hashing method (how many cells have to be probed during a search operation?)
depends on the typical properties of such a random function.

It is convenient to think of the values in `$1, \ldots, m$` as numbered _bins_ and of the items
as `$n$` numbered _balls_

Picking a random function amounts to throw independently the `$n$` balls into the `$m$` bins

The probability that a given ball falls into a given bin is `$1/m$`

???

---

### Questions around the random functions <svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M592 192H473.26c12.69 29.59 7.12 65.2-17 89.32L320 417.58V464c0 26.51 21.49 48 48 48h224c26.51 0 48-21.49 48-48V240c0-26.51-21.49-48-48-48zM480 376c-13.25 0-24-10.75-24-24 0-13.26 10.75-24 24-24s24 10.74 24 24c0 13.25-10.75 24-24 24zm-46.37-186.7L258.7 14.37c-19.16-19.16-50.23-19.16-69.39 0L14.37 189.3c-19.16 19.16-19.16 50.23 0 69.39L189.3 433.63c19.16 19.16 50.23 19.16 69.39 0L433.63 258.7c19.16-19.17 19.16-50.24 0-69.4zM96 248c-13.25 0-24-10.75-24-24 0-13.26 10.75-24 24-24s24 10.74 24 24c0 13.25-10.75 24-24 24zm128 128c-13.25 0-24-10.75-24-24 0-13.26 10.75-24 24-24s24 10.74 24 24c0 13.25-10.75 24-24 24zm0-128c-13.25 0-24-10.75-24-24 0-13.26 10.75-24 24-24s24 10.74 24 24c0 13.25-10.75 24-24 24zm0-128c-13.25 0-24-10.75-24-24 0-13.26 10.75-24 24-24s24 10.74 24 24c0 13.25-10.75 24-24 24zm128 128c-13.25 0-24-10.75-24-24 0-13.26 10.75-24 24-24s24 10.74 24 24c0 13.25-10.75 24-24 24z"/></svg>

- How many _empty_ bins on _average_?

- _Distribution_ of the number of empty bins?

- How many bins with `$r$` balls?

- What is the maximum number of balls in a single bin?

Have a look at the [http://stephane-v-boucheron.fr/post/2019-09-02-idealizedhashing/](http://stephane-v-boucheron.fr/post/2019-09-02-idealizedhashing/) and download the notebook from there.

---

This toy model is an opportunity to recall basic notions of probability
theory

We call this framework the _random alllocations_ experiment

An outcome of the random allocation experiment with `$n= 10$` and `$m= 5$`

<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> 1 </th>
   <th style="text-align:right;"> 2 </th>
   <th style="text-align:right;"> 3 </th>
   <th style="text-align:right;"> 4 </th>
   <th style="text-align:right;"> 5 </th>
   <th style="text-align:right;"> 6 </th>
   <th style="text-align:right;"> 7 </th>
   <th style="text-align:right;"> 8 </th>
   <th style="text-align:right;"> 9 </th>
   <th style="text-align:right;"> 10 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> `$\omega$` </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 5 </td>
  </tr>
</tbody>
</table>

Line `$\omega$` represents the outcome of the  random allocation experiment:

`$$\omega_4= 2 \omega_5= 4 \qquad \ldots$$`

---
template: inter-slide
name: space

## A Probability space

---

###  <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M318.38 208h-39.09c-1.49 27.03-6.54 51.35-14.21 70.41 27.71-13.24 48.02-39.19 53.3-70.41zm0-32c-5.29-31.22-25.59-57.17-53.3-70.41 7.68 19.06 12.72 43.38 14.21 70.41h39.09zM224 97.31c-7.69 7.45-20.77 34.42-23.43 78.69h46.87c-2.67-44.26-15.75-71.24-23.44-78.69zm-41.08 8.28c-27.71 13.24-48.02 39.19-53.3 70.41h39.09c1.49-27.03 6.53-51.35 14.21-70.41zm0 172.82c-7.68-19.06-12.72-43.38-14.21-70.41h-39.09c5.28 31.22 25.59 57.17 53.3 70.41zM247.43 208h-46.87c2.66 44.26 15.74 71.24 23.43 78.69 7.7-7.45 20.78-34.43 23.44-78.69zM448 358.4V25.6c0-16-9.6-25.6-25.6-25.6H96C41.6 0 0 41.6 0 96v320c0 54.4 41.6 96 96 96h326.4c12.8 0 25.6-9.6 25.6-25.6v-16c0-6.4-3.2-12.8-9.6-19.2-3.2-16-3.2-60.8 0-73.6 6.4-3.2 9.6-9.6 9.6-19.2zM224 64c70.69 0 128 57.31 128 128s-57.31 128-128 128S96 262.69 96 192 153.31 64 224 64zm160 384H96c-19.2 0-32-12.8-32-32s16-32 32-32h288v64z"/></svg> Universe `$\Omega$`

The set of outcomes is called the _universe_

In the random allocations setting it is the set of `$1, \ldots, m$`-valued _sequences_ of length `$n$`

A sequence is also a function mapping `$\{1, \ldots, n\}$` to `$\{1, \ldots, m\}$`

We denote a generic _outcome_ by `$\omega$`

The `$i^{\text{th}}$` element of `$\omega$` is denoted by `$\omega_i$`

This universe is denoted by `$\Omega$`, here it is finite with cardinality `$|\Omega|=m^n$`

---

###   <svg aria-hidden="true" role="img" viewBox="0 0 496 512" style="height:1em;width:0.97em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M248 8C111.03 8 0 119.03 0 256s111.03 248 248 248 248-111.03 248-248S384.97 8 248 8zm82.29 357.6c-3.9 3.88-7.99 7.95-11.31 11.28-2.99 3-5.1 6.7-6.17 10.71-1.51 5.66-2.73 11.38-4.77 16.87l-17.39 46.85c-13.76 3-28 4.69-42.65 4.69v-27.38c1.69-12.62-7.64-36.26-22.63-51.25-6-6-9.37-14.14-9.37-22.63v-32.01c0-11.64-6.27-22.34-16.46-27.97-14.37-7.95-34.81-19.06-48.81-26.11-11.48-5.78-22.1-13.14-31.65-21.75l-.8-.72a114.792 114.792 0 0 1-18.06-20.74c-9.38-13.77-24.66-36.42-34.59-51.14 20.47-45.5 57.36-82.04 103.2-101.89l24.01 12.01C203.48 89.74 216 82.01 216 70.11v-11.3c7.99-1.29 16.12-2.11 24.39-2.42l28.3 28.3c6.25 6.25 6.25 16.38 0 22.63L264 112l-10.34 10.34c-3.12 3.12-3.12 8.19 0 11.31l4.69 4.69c3.12 3.12 3.12 8.19 0 11.31l-8 8a8.008 8.008 0 0 1-5.66 2.34h-8.99c-2.08 0-4.08.81-5.58 2.27l-9.92 9.65a8.008 8.008 0 0 0-1.58 9.31l15.59 31.19c2.66 5.32-1.21 11.58-7.15 11.58h-5.64c-1.93 0-3.79-.7-5.24-1.96l-9.28-8.06a16.017 16.017 0 0 0-15.55-3.1l-31.17 10.39a11.95 11.95 0 0 0-8.17 11.34c0 4.53 2.56 8.66 6.61 10.69l11.08 5.54c9.41 4.71 19.79 7.16 30.31 7.16s22.59 27.29 32 32h66.75c8.49 0 16.62 3.37 22.63 9.37l13.69 13.69a30.503 30.503 0 0 1 8.93 21.57 46.536 46.536 0 0 1-13.72 32.98zM417 274.25c-5.79-1.45-10.84-5-14.15-9.97l-17.98-26.97a23.97 23.97 0 0 1 0-26.62l19.59-29.38c2.32-3.47 5.5-6.29 9.24-8.15l12.98-6.49C440.2 193.59 448 223.87 448 256c0 8.67-.74 17.16-1.82 25.54L417 274.25z"/></svg>

In this setting, the uniform probability distribution on the universe assigns to each subset `$A$` of `$\Omega$` the probability
`$$|A|/|\Omega|$$`

When the universe is finite or countable, all subsets of the universe are _events_ <svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M639.4 433.6c-8.4-20.4-31.8-30.1-52.2-21.6l-22.1 9.2-38.7-101.9c47.9-35 64.8-100.3 34.5-152.8L474.3 16c-8-13.9-25.1-19.7-40-13.6L320 49.8 205.7 2.4c-14.9-6.2-32-.3-40 13.6L79.1 166.5C48.9 219 65.7 284.3 113.6 319.2L74.9 421.1l-22.1-9.2c-20.4-8.5-43.7 1.2-52.2 21.6-1.7 4.1.2 8.8 4.3 10.5l162.3 67.4c4.1 1.7 8.7-.2 10.4-4.3 8.4-20.4-1.2-43.8-21.6-52.3l-22.1-9.2L173.3 342c4.4.5 8.8 1.3 13.1 1.3 51.7 0 99.4-33.1 113.4-85.3l20.2-75.4 20.2 75.4c14 52.2 61.7 85.3 113.4 85.3 4.3 0 8.7-.8 13.1-1.3L506 445.6l-22.1 9.2c-20.4 8.5-30.1 31.9-21.6 52.3 1.7 4.1 6.4 6 10.4 4.3L635.1 444c4-1.7 6-6.3 4.3-10.4zM275.9 162.1l-112.1-46.5 36.5-63.4 94.5 39.2-18.9 70.7zm88.2 0l-18.9-70.7 94.5-39.2 36.5 63.4-112.1 46.5z"/></svg>

When the universe is finite or countable, assigning  a probability to every subset of the universe is not an issue

---

A _probability distribution_ `$P$` maps a collection `$\mathcal{F}$` of subsets of the universe `$\Omega$` to `$[0,1]$`

`$$\mathcal{F} \subseteq 2^\Omega \to [0,1]$$`

and satisfies:

1. `$P(\emptyset)=0$`
1. `$P(\Omega)=1$`
1. for any _countable_ subcollection of _pairwise disjoint_ events `$A_1, A_2, \ldots, n, \ldots$`,
`$$P(\cup_{n=1}^\infty A_n) = \sum_{n=1}^\infty P(A_n)$$`

???

See Section \@ref(distribution).

---

### Consequences

`$$P(A_1 \cup A_2 \cup \ldots \cup A_k) = \sum_{i=1}^k P(A_i)$$`
for all finite collection of pairwise disjoint subsets `$A_1,  \ldots, A_k$`

For the domain of `$P$` to be well-defined, the collection of subsets `$\mathcal{F}$`
has to be closed under
- countable unions,
- countable intersections
- complementation,

and to contain both the empty set `$\emptyset$` and the universe `$\Omega$`.

In words, it has to be a `$\sigma$`-_algebra_

???

Note that other probability distributions make sense on this simple universe.

---

### Definition: `$\sigma$`-_algebra_

A collection `$\mathcal{A}$` of subsets of `$\Omega$` is a `$\sigma$`-algebra

iff

1. `$\emptyset \in \mathcal{A}$`
2. `$\Omega \in \mathcal{A}$`
3. If `$A_1, \ldots, A_n, \ldots \in \mathcal{A}$`, then `$\cup_{i=1}^\infty A_i \in \mathcal{A}$`
4. If `$A, B \in \mathcal{A}$` then `$A \setminus B \in \mathcal{A}$`

---
exclude: true

### Example: Balanced allocations

In the ballanced allocations scenario, the random functions from `$1, \ldots, n$` to `$1, \ldots, m$` are constructed
sequentially. We first construct `$\omega_1$` by picking a number uniformly at random from `$1, \ldots, n$`.

Now, assume we have constructed  `$\omega_1, \ldots, \omega_i$` for some `$i<n$`. In order to determine `$\omega_{i+1}$`,
we pick uniformly at random two numbers from `$1, \ldots, n$`, say `$j$` and `$k$`.

We compute

`$$c_j = \Big|\{ \ell : 1\leq \ell \leq i, \omega_\ell = j\}\Big| \qquad\text{and} \qquad c_k = \Big|\{ \ell : 1\leq \ell \leq i, \omega_\ell = k\}\Big| \, .$$`

If `$c_j < c_k$`, `$\omega_{i+1}= j$` otherwise  `$\omega_{i+1}= k$`.

This iterative construction defines a (unique) probability distribution over `$\{1, \ldots, m\}^n$` that
differs from the uniform probability distribution.

It is non-trivial to show that it achieves a non-trivial balancing guarantee for the size of the preimages induced by `$\omega$`.

---
template: inter-slide
name: randomvariables

## Random variables

---

###  Random variables

Consider the real valued functions from `$\Omega$` to `$\mathbb{R}$` defined by:

`$$X_{i, j}(\omega) = \begin{cases}
  1 & \text{if } \omega_i = j \\
  0 & \text{otherwise} \, .
\end{cases}$$`

This function is a special case of a _random variable_

In the example
<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> 1 </th>
   <th style="text-align:right;"> 2 </th>
   <th style="text-align:right;"> 3 </th>
   <th style="text-align:right;"> 4 </th>
   <th style="text-align:right;"> 5 </th>
   <th style="text-align:right;"> 6 </th>
   <th style="text-align:right;"> 7 </th>
   <th style="text-align:right;"> 8 </th>
   <th style="text-align:right;"> 9 </th>
   <th style="text-align:right;"> 10 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> `$\omega$` </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 5 </td>
  </tr>
</tbody>
</table>

we have `$X_{4,2}(\omega)= 1, X_{5,1}(\omega)= 0, ...$`

---

### <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M272 136c8.8 0 16-7.2 16-16s-7.2-16-16-16-16 7.2-16 16 7.2 16 16 16zm176 222.4V25.6c0-16-9.6-25.6-25.6-25.6H96C41.6 0 0 41.6 0 96v320c0 54.4 41.6 96 96 96h326.4c12.8 0 25.6-9.6 25.6-25.6v-16c0-6.4-3.2-12.8-9.6-19.2-3.2-16-3.2-60.8 0-73.6 6.4-3.2 9.6-9.6 9.6-19.2zM240 56c44.2 0 80 28.7 80 64 0 20.9-12.7 39.2-32 50.9V184c0 8.8-7.2 16-16 16h-64c-8.8 0-16-7.2-16-16v-13.1c-19.3-11.7-32-30-32-50.9 0-35.3 35.8-64 80-64zM124.8 223.3l6.3-14.7c1.7-4.1 6.4-5.9 10.5-4.2l98.3 42.1 98.4-42.1c4.1-1.7 8.8.1 10.5 4.2l6.3 14.7c1.7 4.1-.1 8.8-4.2 10.5L280.6 264l70.3 30.1c4.1 1.7 5.9 6.4 4.2 10.5l-6.3 14.7c-1.7 4.1-6.4 5.9-10.5 4.2L240 281.4l-98.3 42.2c-4.1 1.7-8.8-.1-10.5-4.2l-6.3-14.7c-1.7-4.1.1-8.8 4.2-10.5l70.4-30.1-70.5-30.3c-4.1-1.7-5.9-6.4-4.2-10.5zm256 224.7H96c-19.2 0-32-12.8-32-32s16-32 32-32h284.8zM208 136c8.8 0 16-7.2 16-16s-7.2-16-16-16-16 7.2-16 16 7.2 16 16 16z"/></svg> Don't mess with terminology

.fl.w-50.pa2[
<svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M569.517 440.013C587.975 472.007 564.806 512 527.94 512H48.054c-36.937 0-59.999-40.055-41.577-71.987L246.423 23.985c18.467-32.009 64.72-31.951 83.154 0l239.94 416.028zM288 354c-25.405 0-46 20.595-46 46s20.595 46 46 46 46-20.595 46-46-20.595-46-46-46zm-43.673-165.346l7.418 136c.347 6.364 5.609 11.346 11.982 11.346h48.546c6.373 0 11.635-4.982 11.982-11.346l7.418-136c.375-6.874-5.098-12.654-11.982-12.654h-63.383c-6.884 0-12.356 5.78-11.981 12.654z"/></svg> The definition of the random variable has nothing to do with the probability
distribution we have considered so far.

A random variable is not a variable, it is a function
]

.fl.w-50.pa2[

You may question this terminology, but it has been sanctified by tradition

![](./img/pexels-oleg-zaicev-4834891.jpg)

]

---

In the probability space `$(\Omega, 2^\Omega, \Pr)$`, the distribution of the random variable `$X_{i,j}$` is a _Bernoulli_ distribution with _success parameter_ `$1/m$`

`$$\Pr \Big\{ X_{i,j} = 1\Big\} = \frac{1}{m} \qquad \Pr \Big\{ X_{i,j} = 0\Big\}  = 1 - \frac{1}{m}$$`

This comes from

`$$\Pr \Big\{\omega : X_{i,j}(\omega) = 1\Big\} = \frac{\Big|\{ \omega : X_{i,j}(\omega) = 1 \}\Big|}{m^n} = \frac{m^{n-1}}{m^n} = \frac{1}{m}$$`

---

Fix some `$j \in \{1, \ldots, m\}$` and consider the collection of
random variables `$(X_{i, j})_{i \leq n}$`.

For each `$i$`, we can define events (subsets of `$\Omega$`) from the value of `$X_{i,j}$`:

`$$\begin{array}{rl}
  & \Big\{ \omega : X_{i,j}(\omega) = 1\Big\} \\
  & \Big\{ \omega : X_{i,j}(\omega) = 0\Big\}
\end{array}$$`

and together with `$\Omega, \emptyset$` they form the collection `$\sigma(X_{i,j})$` of events that are
definable from `$X_{i,j}$`

---
template: inter-slide
name: randomvariables

## Independence

---

Recall the definition of _independent events_ or rather the definition of a _collection of independent events_.

### Definition: Collection of independent events

A collection of events `$E_1, E_2, \ldots, E_k$` from  `$(\Omega, 2^{\Omega})$`
is _independent_ with respect to `$\Pr$`

for all `$I \subseteq \{1, \ldots, n\}$`,

`$$\Pr \Big\{\cap_{i \in I} E_i \Big\} = \prod_{i \in I} \Pr \{ E_i \}$$`

---

One can check that for each fixed `$j \leq m$`,  `$(X_{i, j})_{i \leq n}$` is a  _collection of independent  random variables_ under `$\Pr$`

By this we mean that  each collection `$E_1, E_2, \ldots, E_n$` of events
where `$E_i \in \sigma(X_{i,j})$` for each `$i \in \{1, \ldots, n\}$`, `$E_1, E_2, \ldots, E_n$`
is an independent collection of events under `$\Pr$`

---

### Definition: independent collection of  random variables

A collection of integer valued random variables `$X_1, \ldots, X_n$` over some probability space `$(\Omega, \mathcal{F}, P)$`
is independent ( with respect to `$P$` )

iff

for all collections of subsets `$A_1, \ldots, A_n$` from `$\mathbb{N}$`, the collection of events

`$$X_i^{-1}(A_i) = \{ \omega : X_i(\omega) \in A_i\}$$`

is an independent collection of events

???

---

### <svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M576 240c0-23.63-12.95-44.04-32-55.12V32.01C544 23.26 537.02 0 512 0c-7.12 0-14.19 2.38-19.98 7.02l-85.03 68.03C364.28 109.19 310.66 128 256 128H64c-35.35 0-64 28.65-64 64v96c0 35.35 28.65 64 64 64h33.7c-1.39 10.48-2.18 21.14-2.18 32 0 39.77 9.26 77.35 25.56 110.94 5.19 10.69 16.52 17.06 28.4 17.06h74.28c26.05 0 41.69-29.84 25.9-50.56-16.4-21.52-26.15-48.36-26.15-77.44 0-11.11 1.62-21.79 4.41-32H256c54.66 0 108.28 18.81 150.98 52.95l85.03 68.03a32.023 32.023 0 0 0 19.98 7.02c24.92 0 32-22.78 32-32V295.13C563.05 284.04 576 263.63 576 240zm-96 141.42l-33.05-26.44C392.95 311.78 325.12 288 256 288v-96c69.12 0 136.95-23.78 190.95-66.98L480 98.58v282.84z"/></svg>

The notion of independence is a cornerstone of probability theory

<br>

Concretely, this means that for any sequence `$b_1, \ldots, b_n \in \{0,1\}^n$` , a possible outcome for the
sequence of random variables `$X_{1,j}, X_{2,j}, \ldots, X_{n,j}$`, we have

`$$\begin{array}{rl}
\Pr \Big\{ \bigwedge_{i=1}^n X_{i,j}(\omega) = b_i \Big\}
& = \prod_{i=1}^n  \Pr \Big\{  X_{i,j}(\omega) = b_i \Big\} \\
& = \prod_{i=1}^n \left(\frac{1}{m}\right)^{b_i} \left(1-\frac{1}{m}\right)^{1-b_i} \\
& = \left(\frac{1}{m}\right)^{\sum_{i=1}^n b_i} \left(1-\frac{1}{m}\right)^{n- \sum_{i=1}^n b_i}
\end{array}$$`

Observe that the outcome of the sequence `$X_{i,j}$`  for `$i \in 1,\ldots,n$`  is `$b_1, \ldots, b_n$`
only depends on `$\sum_{i=1}^n b_i= Y_j$`

This greatly simplifies computations  <svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M639.4 433.6c-8.4-20.4-31.8-30.1-52.2-21.6l-22.1 9.2-38.7-101.9c47.9-35 64.8-100.3 34.5-152.8L474.3 16c-8-13.9-25.1-19.7-40-13.6L320 49.8 205.7 2.4c-14.9-6.2-32-.3-40 13.6L79.1 166.5C48.9 219 65.7 284.3 113.6 319.2L74.9 421.1l-22.1-9.2c-20.4-8.5-43.7 1.2-52.2 21.6-1.7 4.1.2 8.8 4.3 10.5l162.3 67.4c4.1 1.7 8.7-.2 10.4-4.3 8.4-20.4-1.2-43.8-21.6-52.3l-22.1-9.2L173.3 342c4.4.5 8.8 1.3 13.1 1.3 51.7 0 99.4-33.1 113.4-85.3l20.2-75.4 20.2 75.4c14 52.2 61.7 85.3 113.4 85.3 4.3 0 8.7-.8 13.1-1.3L506 445.6l-22.1 9.2c-20.4 8.5-30.1 31.9-21.6 52.3 1.7 4.1 6.4 6 10.4 4.3L635.1 444c4-1.7 6-6.3 4.3-10.4zM275.9 162.1l-112.1-46.5 36.5-63.4 94.5 39.2-18.9 70.7zm88.2 0l-18.9-70.7 94.5-39.2 36.5 63.4-112.1 46.5z"/></svg>

---

We are interested in the number of elements from `$1, \ldots, n$` that are mapped (allocated) to `$j$` through the random function `$\omega$`.

Let

`$$Y_j(\omega) =  \sum_{i=1}^n X_{i, j}(\omega)$$`

This is the occupancy score of bin `$j$` when we throw `$n$` balls

---

In the toy example, `$Y_3(\omega) = 2$` while `$Y_5(\omega)= 1$`
and `$Y_4(\omega)=1$`:

---

`$$\Pr \Big\{ Y_j = r \Big\} =  \binom{n}{r} p^r (1-p)^{n-r} \qquad \text{with} \quad p =\frac{1}{m} \qquad \text{for } r \in 0, \ldots, n$$`

`$$\begin{array}{rl}
\Pr \Big\{ Y_j = r \Big\}
  &  = \sum_{\omega : Y_j(\omega) = r} \Pr\Big\{\omega\Big\} \\
  &  = \sum_{\omega : Y_j(\omega) = r} \left(\frac{1}{m}\right)^{r} \left(1-\frac{1}{m}\right)^{n- r} \\
  &  = \left| \Big\{ \omega : \omega \in \Omega, Y_j(\omega) = r \Big\} \right|
  \times \left(\frac{1}{m}\right)^{r} \left(1-\frac{1}{m}\right)^{n- r} \\
  & = \binom{n}{r} \left(\frac{1}{m}\right)^{r} \left(1-\frac{1}{m}\right)^{n- r}
\end{array}$$`

---

`$$\mathbb{E} Y_j =  \sum_{r=0}^n r \times \Pr \Big\{ Y_j = r \Big\} = \frac{n}{m}$$`

We will develop a systematic approach to expectation, variance and higher moments, based on Integration theory

A last chapter lesson on concentration is dedicated the development of _tail bounds_ for random variables like `$Y_j$` that are _smooth functions of independent random variables_

`$$\mathbb{E} Z =  \sum_{\omega \in \Omega} \Pr\{\omega\} \times Z(\omega)$$`

provided the series is absolutely convergent

---

In principle, a binomial random variable with parameters `$n=5000$` and `$p=.001$`
can take any value between `$0$` and `$5000$`.

However, most (more than `$95\%$`) of the probability mass is supported by  `$\{1, \ldots, 10\}$`.

<div class="figure" style="text-align: center">
<img src="cm-1-introduction_files/figure-html/binom-pmf-1.png" alt="Probability mass function of Binomial(5000,0.001)" width="504" />
<p class="caption">Probability mass function of Binomial(5000,0.001)</p>
</div>

---
template: inter-slide
name: conv2poisson

## Convergences

---

### Law of rare events

If we let `$n,m$` tend to infinity while `$n/m$` tends toward `$c>0$`, we observe that, for each fixed `$r\geq 0$`
the sequence `$\Pr \Big\{ Y_j = r \Big\} = \binom{n}{r} (1/m)^r (1-1/m)^{n-r}$` tends towards

`$$\mathrm{e}^{-c}	\frac{c^r}{r !}$$`

which is the probability that a Poisson distributed random variable with expectation `$c$` equals `$r$`

This is an instance of the _law of rare events_, a special case of _convergence in distribution_

---

### Binomial/Poisson approximation illustrated

The difference between the probability mass functions of the Binomial distributions
with parameters `$n=250, m=0.02$`, and `$n=2500, m=0.002$` and the Poisson distribution with parameter `$5$` is small

If we chose parameters `$n=2500, m=0.002$`, the difference between Binomial and Poisson is barely visible

<div class="figure" style="text-align: center">
<img src="cm-1-introduction_files/figure-html/binom-poisson-1.png" alt="Probability mass functions of Binomial(250,0.02) (left), Binomial(2500,0.002) (middle) and Poisson(5) (right)" width="504" />
<p class="caption">Probability mass functions of Binomial(250,0.02) (left), Binomial(2500,0.002) (middle) and Poisson(5) (right)</p>
</div>

---

### Quantifying closeness between probability distributions

The proximity between  Binomial$(n, \lambda/n)$ and Poisson$(\lambda)$ can be quantified in different ways

A simple one consists in computing

`$$\sum_{x \in \mathbb{N}}  \Big| p_{n, \lambda/n}(x) - q_\lambda(x) \Big|$$`

where `$p_{n, \lambda/n}$` (resp. `$q_{\lambda}$`) stands for Binomial (resp. Poisson)

This quantity is called the _variation distance_ between the two probability distributions

---

### Variation distance between Binomial and Poisson

.fl.w-30.pa2[

The distance between Binomial distribution with parameters `$n,5/n$` and Poisson(5)
is plotted against `$n$` (beware logarithmic scales)

This plot suggests that the variation distance decays like `$1/n$`.

]

.fl.w-70.pa2[

<div class="figure" style="text-align: center">
<img src="cm-1-introduction_files/figure-html/lawrareevents-1.png" alt="Law of rare events: distance between Binomial(n, 5/n) and Poisson(5) as a function of n" width="504" />
<p class="caption">Law of rare events: distance between Binomial(n, 5/n) and Poisson(5) as a function of n</p>
</div>

]

---

It suffices to check that
two  events `$E_j, E_{j'}$`  are not independent with `$\omega \in E_j$` being a function of `$Y_j$`
and `$\omega \in E_{j'}$` being a function of `$Y_{j'}$` (later, we will concisely say
`$E_j \in \sigma(X_j)$` or `$E_j$` being `$Y_j$`-measurable)

Choose `$E_j = \{ \omega : Y_j(\omega) =r\}$` and `$E_{j'} = \{ \omega : Y_{j'}(\omega) =r\}$`.

`$$\begin{array}{rl}
\Pr(E_j) & = \binom{n}{r} \left(\frac{1}{m}\right)^r \left(1 - \frac{1}{m}\right)^{n-r} \\
\Pr(E_j \cap E_{j'}) & = \binom{n}{r} \times \binom{n-r}{r}  \left(\frac{1}{m}\right)^{2r} \left(1 - \frac{2}{m}\right)^{n-2r} \,
\end{array}$$`

`$$\frac{\Pr(E_j \cap E_{j'}) }{\Pr(E_j) \times \Pr(E_{j'})} =
\frac{\left(1 - \frac{2}{m}\right)^{n-2r}}{\left(1 - \frac{1}{m}\right)^{2n-2r}}
\frac{((n-r)!)^2}{n!(n-2r)!} \neq 1$$`

---

If we define

`$$K_{n,r}(\omega) = \sum_{j=1}^m \mathbb{I}_{Y_j(\omega)=r}$$`

as the number of elements of `$1, \ldots, m$` that occur exactly `$r$` times in `$\omega$`, the random variable `$K_{n,r}$`
is not described as a sum of independent random  variables.

Nevertheless, it is possible to gather a lot of information about its moments and distribution.

If we let again `$n,m$` tend to infinity while `$n/m$` tends toward `$c>0$`, we observe that the distribution of `$K_{n,r}/m$` tends to concentrate around `$\mathrm{e}^{-c}	\frac{c^r}{r !}$`. This is an example of _convergence in probability_

Now, if we consider the sequence of recentered and rescaled random variables

`$$(K_{n,r} - \mathbb{E}K_{n,r})/\sqrt{\operatorname{var}(K_{n,r})}$$`

we observe that its _distribution function_ converges pointwise towards the distribution function of the Gaussian distribution.

---

### Profile of toy example

---
template: inter-slide
name: summary

## Summary

---

In this chapter, we investigated a toy stochastic model: _random allocations_. This toy model was motivated
by the analysis of hashing, a widely used technique from Computer science.
To perform the analysis, we introduced notation and notions from probability theory:

- Universe,
- Events,
- `$\sigma$`-algebras,
- Probability distributions,
- Preimages,
- Random variables,
- Expectation,
- Variance,
- Independence of events,
- Independence of random variables,
- Binomial distribution,
- Poisson distribution.

---

Through numerical simulations, we got a feeling of several important phenomena:

- Law of rare events: approximation of Poisson distribution by certain Binomial distributions.

- Law of large numbers for normalized sums of identically distributed random variables that are not independent.

- Central limit theorems for normalized and centered
sums of identically distributed random variables that are not independent

At that point, our elementary approach did not provide us with the notions and tools that
make possible the rigorous analysis of these phenomena

---

class: middle, center, inverse

background-image: url('./img/pexels-cottonbro-3171837.jpg')
background-size: 112%

# The End