Probability V: A modicum of Integration

name: inter-slide
class: left, middle, inverse

---
name: layout-general
layout: true
class: left, middle

.remark-slide-number .progress-bar-container {
  position: absolute;
  bottom: 0;
  height: 4px;
  display: block;
  left: 0;
  right: 0;
}

.remark-slide-number .progress-bar {
  height: 100%;
  background-color: red;
}
</style>

<div>
<style type="text/css">.xaringan-extra-logo {
width: 110px;
height: 128px;
z-index: 0;
background-image: url(./img/Universite_Paris_logo_horizontal.jpg);
background-size: contain;
background-repeat: no-repeat;
position: absolute;
top:1em;right:1em;
}
</style>
<script>(function () {
  let tries = 0
  function addLogo () {
    if (typeof slideshow === 'undefined') {
      tries += 1
      if (tries < 10) {
        setTimeout(addLogo, 100)
      }
    } else {
      document.querySelectorAll('.remark-slide-content:not(.hide_logo)')
        .forEach(function (slide) {
          const logo = document.createElement('a')
          logo.classList = 'xaringan-extra-logo'
          logo.href = 'http://master.math.univ-paris-diderot.fr/annee/m1-mi/'
          slide.appendChild(logo)
        })
    }
  }
  document.addEventListener('DOMContentLoaded', addLogo)
})()</script>
</div>

---
template: inter-slide

# Probability V: A modicum of Integration

### 2021-09-08

#### [Probability Master I MIDS](http://stephane-v-boucheron.fr/courses/probability)

#### [Stéphane Boucheron](http://stephane-v-boucheron.fr)

---
class: inverse, middle, left

## <svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M0 117.66v346.32c0 11.32 11.43 19.06 21.94 14.86L160 416V32L20.12 87.95A32.006 32.006 0 0 0 0 117.66zM192 416l192 64V96L192 32v384zM554.06 33.16L416 96v384l139.88-55.95A31.996 31.996 0 0 0 576 394.34V48.02c0-11.32-11.43-19.06-21.94-14.86z"/></svg>

.fl.w-50[

### [Simple functions](#simplefunctions)

### [Integration](#integration)

### [Limit theorems](#limittheorems)

### [Expectation](#expectation)

]

.fl.w-50[

### [Jensen formula](#jensen)

### [Variance](#variance)
### [Higher moments](#highermoments)
### [Median and interquartile range](#median)
### [ `$\mathcal{L}_p$` and `$L_p$` spaces](#lpspaces)

]
---
name: roadmapintegration

### Roadmap

First, we define _simple functions,_
a subclass of piecewise measurable functions

Defining the integral of a simple function
with respect to a measure 
is straightforward.

Some more work allows us to derive
useful properties: linearity, monotonicity, to name a few.

We define the integral of a non-negative measurable function as a supremum
of integrals of simple functions.

This definition is theoretically sound and it lends itself to computations.

We state three convergence theorems
culminating with the _dominated convergence theorem_.

We relate the notion of _expectation_
of a random variable and the notion of integral.

The _Transfer Theorem_ is a key instrument in the
characterization of image distributions.

???

We start by reviewing basic definitions and results from integration theory.

We follow the measure-theoretic approach.

---
name: simplefunctions
template: inter-slide

## Simple functions

---

The integral of a `$\{0,1\}$`-valued measurable function `$f$` with respect to a measure `$\mu$`
is defined by

`$$\int_{\Omega} f \mathrm{d}\mu = \mu\Big(f^{-1}(\{1\})\Big)$$`

alternatively

`$$\int_{\Omega} \mathbb{I}_A \mathrm{d}\mu = \mu(A) \qquad \text{for any measurable set } A \, .$$`

The next step consists in defining the integral of finite linear combinations of `$\{0,1\}$`-valued measurable function `$f$`.

---

### Definition: Simple function

Let `$(\Omega, \mathcal{F})$` be a measurable space.

The  function `$f : \Omega \to \mathbb{R}$`  is said to be _simple_

iff

- `$f$` takes finitely many values: `$\Big|\big\{ f(x) : x \in \Omega\big\} \Big|<\infty$`

- For each `$y \in f(\Omega) \subset \mathbb{R}$`, `$f^{-1}(\{y\}) \in \mathcal{F}$`

---

A simple function defines a partition of `$\Omega$` into finitely many measurable classes.

The simple function is  constant on each class.

If `$f$` is a simple function,

then,

the `$\sigma$`-algebra `$$f^{-1}(\mathcal{B}(\mathbb{R})) =  \left\{f^{-1}(B) : B \in \mathcal{B}(\mathbb{R})\right\}$$` is finite

---

### Example

Simple functions are finite linear combinations of set characteristic (indicator) functions

- For each `$A \in \mathcal{F}$`, `$\mathbb{I}_A$` is simple

- For any finite collection `$A_1, \ldots, A_n$`  of measurable subsets of `$\Omega$`,
any sequence `$c_1, \ldots, c_n$`  of real numbers, `$\sum_{i \leq n} c_i \mathbb{I}_{A_i}$`
is a simple function

- For any measurable function `$f: \Omega \to \mathbb{R}$`, and `$n \in \mathbb{N}$`,
the function `$g_n$` defined by
`$$g_n(\omega) =  n \wedge (-n \vee \lfloor f(\omega) \rfloor)$$`
is simple

---

The definition of the integral of a simple function with respect to a measure is straightforward: it is a finite sum

### Definition: Integral of a simple function

Let `$(\Omega, \mathcal{F}, \mu)$` be a measured space.

Let `$f : \Omega \to \mathbb{R}$`
be a non-negative simple function which is defined by a finite partition of `$\Omega$`
into measurable sets `$A_1, A_2, \ldots, A_n$` and numbers `$f_1, \ldots, f_n$`:

`$$f(\omega) = \sum_{i \leq n} f_i \mathbb{I}_{A_i}(\omega) \,.$$`

The integral of `$f$` with respect to  `$\mu$` is defined by
`$$\int_\Omega f \mathrm{d}\mu = \sum_{i \leq n} f_i \mu(A_i)$$`

---

If `$\mu(A_i)=\infty$` and `$f_i=0$`, we agree on `$f_i \mu(A_i) =0$`.

---

If we turn to signed simple functions,

it is enough to notice that

> if `$f$` is simple, so are `$(f)_+$` and `$(f)_-$`

and to define  `$\int_\Omega f \mathrm{d}\mu$`
as

`$$\int_\Omega (f)_+ \mathrm{d}\mu - \int_\Omega (f)_- \mathrm{d}\mu$$`

provided at leat one of the two summands is finite

---

Although they are simple, simple functions have interesting approximation capabilities

Any non-negative measurable function can be approximated from below by
non-negative simple functions <svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M639.4 433.6c-8.4-20.4-31.8-30.1-52.2-21.6l-22.1 9.2-38.7-101.9c47.9-35 64.8-100.3 34.5-152.8L474.3 16c-8-13.9-25.1-19.7-40-13.6L320 49.8 205.7 2.4c-14.9-6.2-32-.3-40 13.6L79.1 166.5C48.9 219 65.7 284.3 113.6 319.2L74.9 421.1l-22.1-9.2c-20.4-8.5-43.7 1.2-52.2 21.6-1.7 4.1.2 8.8 4.3 10.5l162.3 67.4c4.1 1.7 8.7-.2 10.4-4.3 8.4-20.4-1.2-43.8-21.6-52.3l-22.1-9.2L173.3 342c4.4.5 8.8 1.3 13.1 1.3 51.7 0 99.4-33.1 113.4-85.3l20.2-75.4 20.2 75.4c14 52.2 61.7 85.3 113.4 85.3 4.3 0 8.7-.8 13.1-1.3L506 445.6l-22.1 9.2c-20.4 8.5-30.1 31.9-21.6 52.3 1.7 4.1 6.4 6 10.4 4.3L635.1 444c4-1.7 6-6.3 4.3-10.4zM275.9 162.1l-112.1-46.5 36.5-63.4 94.5 39.2-18.9 70.7zm88.2 0l-18.9-70.7 94.5-39.2 36.5 63.4-112.1 46.5z"/></svg>

---

### Proposition: Approximation of measurable functions
 
.bg-light-gray.b--dark-gray.ba.br3.shadow-5.ph4.mt5.f6[

Let `$(\Omega, \mathcal{F})$` be a measurable space.

Any non-negative measurable function `$f: \Omega \to \mathbb{R}$` is the monotone pointwise limit of simple functions:

there exists a sequence of simple function `$f_1, \ldots, f_n, \ldots$`

such that

for each `$\omega \in \Omega$`, the following holds:

`$$f_1(\omega) \leq f_2(\omega) \leq \ldots \leq f_n(\omega) \leq \ldots \leq f(\omega)$$`
and
`$$\lim_n f_n(\omega) = f(\omega)$$`

]

---

### Proof

Define `$f_n$` as
`$$f_n(\omega) = n  \wedge \Big(2^{-n} \big\lfloor 2^n f(\omega) \big\rfloor \Big)$$`

As
`$$\big\lfloor 2^n f(\omega) \big\rfloor \leq 2^n f(\omega)$$`
we have `$f_n(\omega)\leq f(\omega)$` for all `$\omega$`.

The range of function `$f_n$` is `$i \times 2^{-n}$` for `$i=0, \ldots, n \times 2^n$`.

For each `$i \in 0, \ldots, (n-1) \times 2^n$`
`$$f_n^{-1}\Big(\{i \times 2^{-n}\}\Big) =f^{-1}\Big(\Big[\frac{i}{2^n}, \frac{i+1}{2^n}\Big)\Big)$$`

which is in `$\mathcal{F}$` because `$f$` is measurable and `$\Big[\frac{i}{2^n}, \frac{i+1}{2^n}\Big) \in \mathcal{B}(\mathbb{R})$`

Likewise `$f_n^{-1}\Big(\{n\}\Big) =f^{-1}\big(\big[n, \infty\big)\big)$` belongs to `$\mathcal{F}$`.

---

### Proof (continued)

To check that `$f_n \leq f_{n+1}$`, we consider two cases.

1. `$f_{n+1}(\omega)\geq n$`. This entails `$f(\omega)\geq n$` and thus `$f_n(\omega)=n <f_{n+1}(\omega)$`

2. `$f_{n+1}(\omega) = k + i 2^{-n-1}$` for `$k<n$` and `$i<2^{n+1}$`. This entails `$f_{n}(\omega) = k + \lfloor i/2\rfloor  2^{-n} \leq f_{n+1}(\omega)$`.

Finally, if `$f(\omega) \leq n$`, `$0 \leq f(\omega) - f_n(\omega) \leq 2^{-n}$`.

This implies that `$\lim_n f_n(\omega)=f(\omega)$` for all `$\omega$`.

---

###  Approximation of the exponential function

.fl.w-30[
Consider the sequence of simple functions

`$$\omega \mapsto n \wedge \Big(2^{-n} \big\lfloor 2^n \exp(\omega) \big\rfloor \Big)$$`

for `$n=2, 3, 4, ...$`

]

.fl.w-70[
<img src="cm-5-integration-101_files/figure-html/approxexpsimple-1.png" width="504" />
]
---

### Proposition

.bg-light-gray.b--dark-gray.ba.br3.shadow-5.ph4.mt5.f6[

If `$f,g$` are two non-negative simple functions on `$(\Omega, \mathcal{F})$`

then

for all `$a, b\in \mathbb{R}_+$`,

- `$a f + b g$` and 
- `$fg$`

are non-negative simple functions.

]

---

### Proposition (Monotonicity of integration of simple functions)

.bg-light-gray.b--dark-gray.ba.br3.shadow-5.ph4.mt5.f6[

If 
- `$f,g$` are two non-negative simple functions and 
- `$\mu$` a non-negative measure on `$(\Omega, \mathcal{F})$` such that
`$$\mu\Big\{ \omega: f(\omega)> g(\omega)\Big\} = 0$$` 
( `$f$` is less of equal than `$g$` `$\mu$`-almost everywhere ),

then

`$$\int  f  \, \mathrm{d}\mu \leq \int  g  \, \mathrm{d}\mu$$`

]

---

### Proposition (Linearity of integration of simple functions)

.bg-light-gray.b--dark-gray.ba.br3.shadow-5.ph4.mt5.f6[

If 
- `$f,g$` are two non-negative simple functions and 
- `$\mu$` a non-negative measure on `$(\Omega, \mathcal{F})$`,

then

for all `$a, b\in \mathbb{R}_+$`,

`$$\int a f + b g \, \mathrm{d}\mu = a \int  f  \, \mathrm{d}\mu + b \int  g \, \mathrm{d}\mu$$`

]

---
name: integration
template: inter-slide

## Integration

---

Let `$\mathcal{S}_+$` denote the set of non-negative simple functions on `$(\Omega, \mathcal{F})$`

### Definition (Integration with respect to a measure)

Let `$f$` be a non-negative measurable function on `$(\Omega, \mathcal{F}, \mu)$`,

then

for any `$A \in \mathcal{F}$`, the integral of `$f$` over `$A$` with respect to
measure `$\mu$` is defined by:

`$$\int_A f \, \mathrm{d}\mu = \sup_{s \in \mathcal{S}_+: s \leq f} \int_A s \, \mathrm{d}\mu$$`

---

### Proposition (Monotonicity of integration)

.bg-light-gray.b--dark-gray.ba.br3.shadow-5.ph4.mt5.f6[

- `$f,g$` are two non-negative measurable  functions
and 
- `$\mu$` a non-negative measure on `$(\Omega, \mathcal{F})$` such that
`$$\mu\Big\{ \omega: f(\omega)> g(\omega)\Big\} = 0$$` 
( `$f$` is less of equal than `$g$` `$\mu$`-almost everywhere ),

then

`$$\int  f \,  \mathrm{d}\mu \leq \int  g  \, \mathrm{d}\mu$$`

]

---

### Proposition (Linearity of integration)

.bg-light-gray.b--dark-gray.ba.br3.shadow-5.ph4.mt5.f6[

If `$f,g$` are two non-negative measurable  functions and `$\mu$` a non-negative measure on `$(\Omega, \mathcal{F})$`, then for all `$a, b\in \mathbb{R}_+$`,

`$$\int a f + b g \, \mathrm{d}\mu = a \int  f \,  \mathrm{d}\mu + b \int  g \, \mathrm{d}\mu$$`

]

---

The integral of a signed measurable functions is defined by a decomposition argument.

Let `$f$` be a measurable function and `$f= (f)_+ - (f)_-$`,

then

`$$\int_{\Omega} f \mathrm{d}\mu = \int_{\Omega} (f)_+ \mathrm{d}\mu - \int_{\Omega} (f)_- \mathrm{d}\mu$$`

provided at least one of `$\int_{\Omega} (f)_+ \mathrm{d}\mu$` and `$\int_{\Omega} (f)_- \mathrm{d}\mu$`
is finite.

---
name: limittheorems
template: inter-slide

## Limit theorems

---

### <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M512 199.652c0 23.625-20.65 43.826-44.8 43.826h-99.851c16.34 17.048 18.346 49.766-6.299 70.944 14.288 22.829 2.147 53.017-16.45 62.315C353.574 425.878 322.654 448 272 448c-2.746 0-13.276-.203-16-.195-61.971.168-76.894-31.065-123.731-38.315C120.596 407.683 112 397.599 112 385.786V214.261l.002-.001c.011-18.366 10.607-35.889 28.464-43.845 28.886-12.994 95.413-49.038 107.534-77.323 7.797-18.194 21.384-29.084 40-29.092 34.222-.014 57.752 35.098 44.119 66.908-3.583 8.359-8.312 16.67-14.153 24.918H467.2c23.45 0 44.8 20.543 44.8 43.826zM96 200v192c0 13.255-10.745 24-24 24H24c-13.255 0-24-10.745-24-24V200c0-13.255 10.745-24 24-24h48c13.255 0 24 10.745 24 24zM68 368c0-11.046-8.954-20-20-20s-20 8.954-20 20 8.954 20 20 20 20-8.954 20-20z"/></svg>

- Measurable functions are meant to be real-valued, and

- `$\mathbb{R}$` is endowed with the Borel `$\sigma$`-algebra ( `$\mathcal{B}(\mathbb{R})$` )

- Monotone convergence Theorem

- Fatou's Lemma

- Dominated convergence Theorem

are the three pillars of  integral calculus

---

### Theorem (Monotone convergence theorem)

.bg-light-gray.b--dark-gray.ba.br3.shadow-5.ph4.mt5.f6[

Let `$(\Omega, \mathcal{F}, \mu)$` be a measured space.

Let `$(f_n)_n$`  be  a non-decreasing sequence of non-negative measurable functions converging towards `$f$`.

Then

`$$\int \lim_n \uparrow f_n \, \mathrm{d}\mu = \lim_n \uparrow \int f_n \, \mathrm{d}\mu.$$`

]

---
The proof of the monotone convergence theorem boils down to the
definition of positive measure and property `$\mu(\lim_n \uparrow A_n)= \lim_n \uparrow \mu(A_n)$`.

### Proof

Let function `$f$` be defined by `$f(\omega)=\lim_n \uparrow f_n(\omega)$` for all `$\omega \in \Omega$`.

Note that if `$f(\omega)=0$`, then `$f_n(\omega)=0$`  for all `$n\in \mathbb{N}$`.

The function  `$f$` is  positive measurable.

In order to prove the monotone convergence theorem it is enough to check that for
every non-negative simple function `$g$` such that `$g \leq f$` everywhere, for any `$a\in [0, 1)$`, the following holds:

`$$a \int g \, \mathrm{d} \mu \leq \lim_n \uparrow \int f_n \, \mathrm{d}\mu \,.$$`

For each `$n \in \mathbb{N}$`, define `$$E_n = \Big\{ \omega : f_n(\omega) \geq a g(\omega)\Big\}.$$`

---

### Proof (continued)

Note that as `$(f_n)_n$` is non-decreasing, the sequence `$(E_n)$` is non-decreasing.

Moreover, if `$f(\omega)>0$` as `$\lim_n \uparrow f_n(\omega)=f(\omega) > a f(\omega) \geq a g(\omega)$`.

Hence for all `$\omega \in \Omega$`, `$\mathbb{I}_{E_n}(\omega)=1$` for all sufficiently large `$n$`
(beware there is no uniformity guarantee), we have

`$$\lim_n \uparrow E_n = \Omega$$`

Combining the different remarks, we have for all `$n$`,
`$\mathbb{I}_{E_n} a g \leq f_n$` everywhere.

Monotonicity of integration entails, for all `$n$`
`$$\int \mathbb{I}_{E_n} a g \,\mathrm{d}\mu \leq \int f_n \,\mathrm{d}\mu \qquad\forall n$$`

Now, for each `$n$`,  `$\mathbb{I}_{E_n} a g$` is a non-negative simple function, and the sequence
`$(\mathbb{I}_{E_n} a g)_n$` is a non-decreasing sequence of non-negative simple functions converging
towards simple function `$ag$`.

---

### Proof (continued)

Let `$g = \sum_{i \leq k} c_i \mathbb{I}_{A_i}$` where `$(A_i)_{i\leq k}$` is a finite partition of `$\Omega$` into measurable subsets.

`$$\mathbb{I}_{E_n}  g =  \sum_{i \leq k} c_i \mathbb{I}_{A_i \cap E_n}$$`

Hence
`$$\begin{array}{rl}
  \int \mathbb{I}_{E_n} a g\, \mathrm{d}\mu & = \sum_{i \leq k} c_i \int \mathbb{I}_{A_i \cap E_n}\, \mathrm{d}\mu \\
    & = \sum_{i \leq k} c_i  \mu(A_i \cap E_n) \, .
\end{array}$$`

For each `$i \leq k$`, we have `$\lim_n \uparrow c_i  \mu(A_i \cap E_n)  = c_i \mu(A_i)$`.

We have:

`$$\int \lim_n \uparrow \mathbb{I}_{E_n} a g \,\mathrm{d}\mu = \lim_n \uparrow \int \mathbb{I}_{E_n} a g \, \mathrm{d}\mu$$`

This proves that monotonicity holds for all `$a\in [0,1)$` and `$g \in \mathcal{S}_+$` with `$g \leq f$`:
`$$\forall g \in \mathcal{S}_+ \text{ with } \forall a \in [0,1)$$`

---

Is it true that `$\int \lim_n \downarrow f_n \mathrm{d}\mu = \lim_n \downarrow \int f_n \mathrm{d}\mu$`?.

Answer the same question assuming `$\int f_1 \mathrm{d}\mu < \infty$`.

Answer the same question if `$\mu$` is assumed to be a probability measure.

---

### Theorem (Fatou's Lemma)

.bg-light-gray.b--dark-gray.ba.br3.shadow-5.ph4.mt5.f6[

Let `$(\Omega, \mathcal{F}, \mu)$` be a measured space.

Let `$(f_n)_n$`  be  a sequence of non-negative measurable functions.

Then

`$$\int \liminf_n  f_n \mathrm{d}\mu \leq  \liminf_n  \int f_n \mathrm{d}\mu.$$`

]

---

### Proof

Define `$h_n(\omega) = \inf_{m\geq n} f_n(\omega)$`. Each `$h_n$` is also non-negative and measurable.

By monotonicity,
`$$\int h_n \mathrm{d}\mu \leq \inf_{m\geq n} \int f_m \mathrm{d}\mu \, .$$`

The sequence `$h_n$` is non-decreasing.

And `$\lim \uparrow h_n(\omega) = \liminf f_n(\omega)$` for all `$\omega \in \Omega$`.

---

### Proof (continued)

For each `$n$`, by the monotone convergence theorem

`$$\int \lim_n \uparrow h_n \mathrm{d}\mu = \lim_n \uparrow \int h_n \mathrm{d}\mu$$`

so that

`$$\int \liminf_n f_n \mathrm{d}\mu = \lim_n \uparrow \int h_n \mathrm{d}\mu$$`

and

`$$\int \liminf_n f_n \mathrm{d}\mu \leq \lim_n \inf_{m\geq n} \int f_m \mathrm{d}\mu = \liminf_{n} \int f_n \mathrm{d}\mu$$`

---

### Theorem (Dominated convergence theorem)

.bg-light-gray.b--dark-gray.ba.br3.shadow-5.ph4.mt5.f6[

Let `$(\Omega, \mathcal{F}, \mu)$` be a measured space.

Let `$(f_n)_n$`  be  a sequence of measurable functions that converges pointwise towards function `$f$`.

Assume that there exists a integrable function `$g$` that
dominates `$(f_n)_n$`: for all `$n$`, all `$\omega \in \Omega$`,
`$|f_n(\omega)|\leq g(\omega)$`.

Then

`$f$` is integrable and

`$$\int f \mathrm{d}\mu = \int \lim_n  f_n \mathrm{d}\mu =  \lim_n  \int f_n \mathrm{d}\mu$$`

]

---

### Proof

Let us first check that `$f$` is integrable.

Observe that `$\lim_n |f_n| = |f|$` and thus `$\liminf |f_n| = |f|$`.

By Fatou's Lemma,

`$$\int |f| \mathrm{d}\mu = \int \liminf_n |f_n| \mathrm{d}\mu  \leq \liminf_n \int |f_n| \mathrm{d}\mu  = \int |g| \mathrm{d}\mu < \infty \,.$$`

Now define `$h_n = \inf_{m\geq n} f_m$` and `$j_n = \sup_{m \geq n}f_m$`.

We have `$\lim_n \uparrow h_n = f$` and `$\lim_n \downarrow j_n=f.$`

---

### Proof (continued)

Note  that

`$$\int h_n \mathrm{d}\mu \leq \int f \mathrm{d}\mu \leq \int j_n \mathrm{d}\mu \, .$$`

By monotone convergence

`$$\int h_n \mathrm{d}\mu \uparrow \int f\mathrm{d}\mu$$` 
and 
`$$\int j_n \mathrm{d}\mu \downarrow \int f\mathrm{d}\mu$$`

This entails `$\lim \int f_n \mathrm{d}\mu$`.

---

### Exercise

Let `$g: \Omega \times \mathbb{R} \to \mathbb{R}$` be a function of two variables such that for each `$t \in \mathbb{R}$`,
`$g(\cdot, t)$` is measurable. Assume that for  each `$t \in \mathbb{R}$`, `$g(\cdot, t)$` is `$\mu$`-integrable and that for each
`$\omega \in \Omega$`, `$g(\omega, \cdot)$` is differentiable. Define `$G(t)= \int_{\Omega} g(\omega, t) \mathrm{d}\mu(\omega)$`.

Is it always true that `$G$` is differentiable at every `$t$`?

Provide sufficient conditions for `$G$` to be differentiable and
`$$G'(t) = \int \frac{\partial g}{\partial s}(\omega, s)_{|s=t} \mathrm{d}\mu(\omega) \, .$$`

---
name: densities
template: inter-slide

## Probability distributions defined by a density

---

### Proposition

.bg-light-gray.b--dark-gray.ba.br3.shadow-5.ph4.mt5.f6[

Let `$(\Omega, \mathcal{F})$` be a measurable space
and `$\mu$` be a `$\sigma$`-finite measure over `$(\Omega, \mathcal{F})$`.

Let `$f$` be a non-negative measurable  real function over `$(\Omega, \mathcal{F})$`.

Let `$\nu : \mathcal{F} \to \mathbb{R}_+$` be defined by

`$$\nu(A) = \int \mathbb{I}_A f \, \mathrm{d}\mu = \int_A f\, \mathrm{d}\mu \,.$$`

`$\nu$` is a measure over `$(\Omega, \mathcal{F})$`. The function
`$f$` is said to be a density of `$\nu$` with respect to `$\mu$`.

]

---

### Proof

The fact that `$\nu(\emptyset)=0$` is immediate.

The fact that `$\nu$` is `$\sigma$`-additive follows from the
monotone convergence theorem.

If `$A_1, \ldots, A_n, \ldots$` is a collection or pairwise disjoint
measurable sets,
`$$\begin{array}{rl}
\nu(\cup_n A_n)
& = \int \mathbb{I}_{\cup_n A} f \, \mathrm{d}\mu \\
& = \int \Big(\lim_n \sum_{k\leq n}\mathbb{I}_{A_k}\Big) f \, \mathrm{d}\mu  \\
& = \int \Big(\lim_n \sum_{k\leq n}\mathbb{I}_{A_k} f \Big)  \, \mathrm{d}\mu \\
& = \lim_n  \sum_{k\leq n} \int \mathbb{I}_{A_k} f   \, \mathrm{d}\mu \\
& = \lim_n  \sum_{k\leq n} \int \mathbb{I}_{A_k} f   \, \mathrm{d}\mu \\
& = \lim_n  \sum_{k\leq n} \nu(A_k) \\
& = \sum_{k=1}^\infty \nu(A_k) \, .
\end{array}$$`
The fourth equality is justified by the monotone convergence theorem, others
equalities follow from the fact that we are handling non-negative series.

---

Let `$(A_n)_n$`  be such that `$A_n \in \mathcal{F}, \mu(A_n)<\infty$` for each `$n$` and
`$\cup_n A_n = \Omega$`.

For each `$n$`, we have

`$$\nu(A_n) = \int_{A_n}  f \,\mathrm{d}\mu \leq  \int_{\Omega}  f \,\mathrm{d}\mu < \infty$$`

This proves that if `$\mu$` is `$\sigma$`-finite, so is `$\nu$`.

---
name: expectation
template: inter-slide

## Expectation

---

The expectation of a real random variable is a (Lebesgue) integral
with respect to a probability measure. We have to get familiar
with probabilistic notation.

### Definition

Let `$(\Omega, \mathcal{F}, P)$` be a probability space.

The random variable `$X$`
defined on `$(\Omega, \mathcal{F})$` is `$P$`-integrable

the measurable function `$|X|: \omega \mapsto |X(\omega)|$` is `$P$`-integrable,

we agree on:

`$$\mathbb{E} X = \mathbb{E}_P X = \int_{\mathcal{X}} X(\omega) \mathrm{d}P(\omega) =\int X \mathrm{d}P$$`.

---

The next statement called the _transfer formula_ can be used to compute the density
of an image distribution or to simplify the computation of an expectation.

### Theorem (Transfer formula)

.bg-light-gray.b--dark-gray.ba.br3.shadow-5.ph4.mt5.f6[

Let 
- `$(\mathcal{X}, \mathcal{F}, P)$` be a probability space,
- `$(\mathcal{Y}, \mathcal{G})$`  be a measurable space,
- `$f$` be a measurable function from `$(\mathcal{X}, \mathcal{F})$` to `$(\mathcal{Y}, \mathcal{G})$`.

Let `$Q$` denote the probability distribution that is the image of `$P$` by `$f$`: `$Q = P \circ f^{-1}$`.

Then

for all measurable functions `$h$` from `$(\mathcal{Y}, \mathcal{G})$` to `$(\mathbb{R}, \mathcal{B}(\mathbb{R}))$`

`$$\mathbb{E}[h(Y)] = \int_{\mathcal{Y}} h(y) \mathrm{d}Q(y)  = \int_{\mathcal{X}} h\circ f(x) \mathrm{d}P(x) = \mathbb{E} h\circ f(X) \,$$`

if either integral is defined.

]

---

### Proof

Assume first that `$h= \mathbb{I}_B$` where `$C \in \mathcal{G}$`. Then

`$$\begin{array}{rl}
\mathbb{E} h(Y)
& = \int_{\mathcal{Y}} \mathbb{I}_B(y) \, \mathrm{d}Q(y) \\
& = Q(B) \\
& = P \circ f^{-1}(B) \\
& = P \Big\{ x : f(x) \in B \Big\} \\
& = P \Big\{ x : h \circ f(x) =1 \Big\}  \\
& = \int_{\mathcal{X}}  h \circ f(x) \mathrm{d}P(x) \\
& = \mathbb{E} h\circ f(X) \, .
\end{array}$$`

Then, by linearity, the transfer formula holds for all simple functions from `$\mathcal{Y}$` to `$\mathbb{R}$`.

By the definition of the Lebesgue integral, the transfer formula holds for non-negative measurable functions.

The usual decomposition argument completes the proof.

???

It is clear that the expectation of a random variable only depends on the probability distribution of the random variable.

---
name: jensen
template: inter-slide

## Jensen's inequality

---
The tools from integration theory we have reviewed so far serve to compute or approximate
integrals and expectations.

The next theorem circumvents computations and allows us to compare expectations.

Jensen's inequality is a workhorse of Information Theory, Statistics and large parts of Probability Theory.

It embodies the interaction between _convexity_ and _expectation_

We first introduce a modicum of convexity theory and notation.

### Definition (Lower semi-continuity)

A function `$f$` from some metric space `$\mathcal{X}$` to `$\mathbb{R}$` is _lower semi-continuous_
at `$x \in \mathcal{X}$`,

`$$\liminf_{x_n \to x} f(x_n) \geq f(x) \, .$$`

---

A continuous function is lower semi-continuous

If `$A \subseteq \mathcal{X}$` is an open set, then `$\mathbb{I}_A$` is lower semi-continuous but, unless it is constant, it is not
continuous  at the boundary of `$A$`

---

### Definition (Convex subset)

Let `$\mathcal{X}$`  be a vector space

A subset `$C \subseteq \mathcal{X}$` is said to be _convex_

for all `$x,y \in C$`,  all `$\lambda \in [0,1]$`:

`$$\lambda x + (1-\lambda) y \in C \, .$$`

Prove that  `$\overline{C}$` and `$\overline{C} \setminus C$` are convex.

---

In the next definition, we  consider functions from some vector space to `$\mathbb{R} \cup \{+\infty\}$`.

### Definition (Convex functions)

Let `$\mathcal{X}$` be a (topological) vector space.

Let `$C \subseteq \mathcal{X}$` be a convex subset.

A function `$f$` from `$\mathcal{C}$` to `$\mathbb{R} \cup \{\infty\}$`
is convex

for `$x,y \in C$`,  all `$\lambda \in [0,1]$`,

`$$f(\lambda x + (1-\lambda) y) \leq \lambda f(x) + (1-\lambda) f(y) \, .$$`

The _domain_ of `$f$` `$\operatorname{Dom}(f)$` is the subset of `$C$` where `$f$` is finite.

---

The function `$f : x \mapsto \mathbb{I}_{x<0}|x| + \mathbb{I}_{x\geq 0} x^2$` is convex, continuous.

It is differentiable everywhere except at `$x=0$`.

The dotted lines define affine functions that are below the cruve `$y=f(x)$`.

The dotted lines define supporting hyperplanes for the epigraph of `$f$`.

---

iff

sets `$\{ x : f(x) \leq t\}$` are closed intervals for all `$t \in \mathbb{R}$`.

The  next result warrants that any convex lower semi-continuous has a dual representation.

This dual representation is a precious tool when comparing expectation of random variables.

---

### Theorem (Fenchel-Legendre duality)

.bg-light-gray.b--dark-gray.ba.br3.shadow-5.ph4.mt5.f6[

Let `$f$` be a convex lower-semi-continuous function on `$\mathbb{R}$` with  a closed domain.

The dual function `$f^*$` of `$f$` is defined over `$\mathbb{R}$` by

`$$f^*(y) = \sup_{x \in \text{Dom}(f)} xy - f(x) \, .$$`

Then

- `$f^*$` is convex
- `$f^*$` is lower-semi-continuous
- If `$f^*(y)= xy - f(x)$` then `$y$` is a sub-gradient of `$f$` at `$x$`.
- If `$y$` is a sub-gradient of `$f$` at `$x$`, `$f^*(y) = xy -f(x)$`.
- `$f= (f^{*})^*$`, the dual function of the dual function equals the original function: `$f(x) = \sup_{y} xy -f^*(y).$`

]

---

### Example

The next  dual pairs will be used in several places.

- if `$f(x) = \frac{|x|^p}{p}$` ( `$p> 1$` ), then `$f^*(y)= \frac{|y|^q}{q}$` where `$q=p/(p-1)$`

- if `$f(x) = |x|$`, then  `$f^*(y)= 0$`  for `$y \in [-1,1]$` and `$\infty$` for `$|y|>1$`

- if `$f(x) = \exp(x)$`  then `$f^*(y) = y \log y - y$` for `$y>0$`, `$f^*(y)=\infty$` for `$y<0$`

---

### Proof

The fact that `$f^*$` is `$\mathbb{R} \cup \{\infty\}$`-valued and convex is immediate.

To check lower semi-continuity, assume `$y_n \to y$`, with `$y_n \in \operatorname{Dom}(f^*)$`
and `$f^*(y) > \liminf_n f^*(y_n)$`.

Assume first that `$y \in \operatorname{Dom}(f^*)$`.
Then for some sufficiently large `$m$` and some `$x \in \operatorname{Dom}(f)$`
`$$f^*(y) \geq xy - f(x) -\frac{1}{m} > \liminf_n f^*(y_n) \geq \liminf_n y_n x -f(x) = yx -f(x)$$`

which is contradictory.

Assume now that `$y \not\in \operatorname{Dom}(f^*)$` and `$\liminf_n f^*(y_n) < \infty$`.
Extract a subsequence `$(y_{m_n})_n$` such  that `$\lim_n f^*(y_{m_n}) = \liminf_n f^*(y_n)$`.
There exists `$x \in \operatorname{Dom}(f)$` such that `$$f^*(y) > xy -f(x) > \liminf_n f^*(y_n) = \lim_n f^*(y_{m_n}) \geq \lim_n xy_{m_n} -f (x) = xy - f(x)$$`

which is again contradictory.

---

### Proof (continued)

The fact that `$y$` is a sub-gradient of `$f$` at `$x$` if
`$f^*(y)= xy - f(x)$` is a rephrasing of the definition of sub-gradients.

Note that if `$x \in \operatorname{Dom}(f)$` and `$y\in \operatorname{Dom}(f^*)$` then `$f(x)+f^*(y)\geq xy$`.

This observation entails that `$(f^*)^*(x)\leq f(x)$` for all `$x \in \operatorname{Dom}(f)$`. If there existed some
`$x \in \operatorname{Dom}(f)$` with `$(f^*)^*(x)>x$`, there would exist some `$y \in \operatorname{Dom}(f^*)$`
with `$xy - f^*(y) > f(x)$` which is not possible.

In order to prove that that `$(f^*)^*(x)\geq f(x)$` for all `$x \in \operatorname{Dom}(f)$`, we rely on the
convexity, lower semi-continuity of `$f$` and `$f^*$` and the closure of `$\operatorname{Dom}(f)$`. Under these
conditions, every point `$x$` in `$\operatorname{Dom}(f)$` has a sub-gradient `$y$` and
this entails `$f(x) + f^*(y)= xy$`.

---

### Remark

It is possible to define `$f^*$` as `$f^*(y) =\sup_x xy -f(x)$` even if `$f$` is not convex
and lower semi-continuous.

The function `$f^*$` retains the convexity and lower semi-continuity
properties.

But `$f \neq (f^{*})^*$`, we only get `$f \geq (f^{*})^*$`.

Indeed `$(f^{*})^*$` is the largest convex minorant of `$f$`.

---

### Theorem (Jensen's inequality)

.bg-light-gray.b--dark-gray.ba.br3.shadow-5.ph4.mt5.f6[

Let 
- `$X$` be a real-valued random variable and 
- `$f: \mathbb{R} \to \mathbb{R}$` be
_convex, lower-semi-continuous_ such that the closed set  `$\text{Dom}(f) \subseteq \text{supp}(\mathcal{L}(X))$`
and `$\mathbb{E} |f(X)|< \infty$`,

then

`$$f(\mathbb{E} X) \leq \mathbb{E} f(X) \, .$$`

]

---

### Remark

In view of the definition of convexity and of the fact that taking expectation
extends the idea of taking a convex combination, Jensen's inequality is not a surprise.

---

### Proof

`$$\begin{array}{rl}
\mathbb{E} f(X) & = \mathbb{E} (f^*)^*(X) \\
& = \mathbb{E} \Big[ \sup_y \Big( yX - f^*(y)\Big)\Big] \\
 & \geq   \sup_y  \Big( y \mathbb{E} X - f^*(y)\Big) \\
& =  (f^*)^*\Big( \mathbb{E} X \Big) \\
 & = f\Big( \mathbb{E} X \Big)  \, .
\end{array}$$`

---

In the argument above, it is not _a priori_ obvious that `$\sup_y \Big( yX - f^*(y)\Big)$` is measurable,
since the supremum is taken over a non-countable collection. Check that this is not an issue.

We will see many applications of Jensen's inequality:

- comparison of sampling with replacement with sampling without replacement (comparison of binomial and hypergeometric tails)
- Cauchy-Schwarz and Hölder's inequalities
- Derivation of maximal inequalities
- Non-negativity of relative entropy
- Derivation of Efron-Stein-Steele's inequalities
- ...

---
name: variance
template: inter-slide

## Variance

---

The variance (when it is defined) is an index of dispersion of the distribution of
a random variable.

### Proposition (Characterizations of variance)

.bg-light-gray.b--dark-gray.ba.br3.shadow-5.ph4.mt5.f6[

Let `$X$` be a random variable over some probability space.

The variance of `$X$` is finite iff `$\mathbb{E}X^2 <\infty$` and it may
be defined using the next three equalities:

`$$\begin{array}{rl}
\operatorname{var}(X) & =  \mathbb{E}\left[(X - \mathbb{E}X)^2\right] \\
 & = \inf_{a \in \mathbb{R}} \mathbb{E}\left[(X - a)^2\right] \\
 & = \mathbb{E}X^2 - (\mathbb{E}X)^2 \,.
\end{array}$$`
]
---

We need to check that three right-hand-side are finite if one of them is, and
that when they are finite, they are all equal.

### Proof

Assume `$\mathbb{E}X^2 < \infty$`, as `$|X| \leq \frac{X^2}{2} + \frac{1}{2}$`, this entails `$\mathbb{E} |X|<\infty$`.

If `$\mathbb{E}X^2 < \infty$` then so is `$\mathbb{E}|X|$`.

The right-hand-side on the third line is finite if `$\mathbb{E}X^2 < \infty$`.

As `$(x-b)^2 \leq 2 x^2 + 2 b^2$` for all `$x,b$`,

The right-hand-side on the first line, the infimum on the second line are finite when `$\mathbb{E} X^2 <\infty.$`

As `$X^2 \leq 2 (X- \mathbb{E}X)^2 + 2 (\mathbb{E}X)^2$`, `$\mathbb{E}X^2<\infty$` if `$\mathbb{E}\left[(X - \mathbb{E}X)^2\right] <\infty.$`

---

### Proof (continued)

Assume now that `$\mathbb{E}X^2 < \infty$`.
`$$\begin{array}{rl}
  \mathbb{E}\left[(X - a)^2\right]
  & = \mathbb{E}\left[(X - \mathbb{E}X - (a-\mathbb{E}X))^2\right] \\
  & = \mathbb{E}\left[(X- \mathbb{E}X)^2 - 2 \mathbb{E}[(X-\mathbb{E}X)](a-\mathbb{E}X) + (a-\mathbb{E}X)^2 \right]\\
  & = \mathbb{E}\left[(X- \mathbb{E}X)^2\right] + (a-\mathbb{E}X)^2   \, .
\end{array}$$`

As `$(a- \mathbb{E}X)^2\geq 0$`, we have established that
`$\mathbb{E}\left[(X - \mathbb{E}X)^2\right]   = \inf_{a \in \mathbb{R}} \mathbb{E}\left[(X - a)^2\right]$`.

Moreover, the infimum is a minimum, it is achieved at a single point `$\mathbb{E}X$`.

---

---
name: highermoments
template: inter-slide

## Higher moments

---

In this Section we relate `$\mathbb{E} |X|^p$` with `$\mathbb{E} |X|^q$` for
different values of `$p, q \in \mathbb{R}_+$`. Our starting point is
small technical result in real analysis.

### Proposition (Young's inequality)

.bg-light-gray.b--dark-gray.ba.br3.shadow-5.ph4.mt5.f6[

Let `$p, q>1$` be _conjugate_ ($1/p + 1/q =1$), and `$x, y>0$`, then

`$$xy \leq \frac{x^p}{p} + \frac{y^q}{q} \,.$$`

]

---

### Proof

Note that if `$p$` and `$q$` are conjugate, then `$q= p/(p-1)$` and  `$(p-1)(q-1)=1$`.

It suffices to check that for all `$x,y>0$`, `$$\frac{x^p}{p} \geq xy - \frac{y^q}{q} \, .$$`

Fix `$x>0$`, consider the function over `$[0,\infty)$` defined by `$$z \mapsto xz - \frac{z^q}{q} \,.$$` This function is differentiable with derivative `$x - z^{q-1} = x - z^{1/(p-1)}$`.

It achieves its maximum at `$z=x^{p-1}$` and the maximum is equal to `$$x x^{p-1} - \frac{x^{q(p-1)}}{q} = x^p - \frac{x^p}{q} = \frac{x^p}{p} \, .$$`

---

### Graphic proof of Young's inequality.

.fl.w-50.f6[
We choose `$p=1.5$` and `$q= 3$`, `$x = 1.5$` and `$y= 1$`. The black point is located at `$(x,y)^T$`. The product `$xy$` equals the area of the rectangle located between the origin and `$(x,y)^T$` (delimited by the dashed segments). The dotted line represents function `$s \mapsto s^{p-1}$`, and interchanging  the axes, the function `$t \mapsto t^{q-1} = t^{1/(p-1)}$`. The area of the light grey surface under the dotted line equals `$\frac{x^p}{p} = \int_0^x s^{p-1} \mathrm{d}s$`, while the area of the darker grey surface below line `$y=1$` and above the dotted line, equals  `$\frac{y^q}{q} = \int_0^y t^{q-1} \mathrm{d}t$`. The union of the two disjoint surfaces covers the rectangle  located between the origin and `$(x,y)^T$`. Equality occurs when the dotted line passes though `$(x,y)^T$`, that is when `$y=x^{p-1}$`.

]

.fl.w-50[
<img src="cm-5-integration-101_files/figure-html/graphyoung-1.png" width="504" />
]

---

A special case of Young inequality is obtained by taking `$p=q=2$`.

We are now in a position to prove three fundamental inequalities: Cauchy-Schwarz, Hölder
and Minkowski.

### Theorem (Cauchy-Schwarz)

.bg-light-gray.b--dark-gray.ba.br3.shadow-5.ph4.mt5.f6[

Let `$X$` and `$Y$` be two random variables on the same probability space.

Assume both `$\mathbb{E}X^2$` and `$\mathbb{E}Y^2$` are finite.

Then

`$$\mathbb{E} [XY] \leq  \sqrt{\mathbb{E}X^2} \times  \sqrt{\mathbb{E}Y^2}$$`

]

---

### Proof

If either `$\sqrt{\mathbb{E}X^2}=0$` or
`$\sqrt{\mathbb{E}Y^2}=0$`, the inequality is trivially satisfied.

So, without loss of generality, assume `$\sqrt{\mathbb{E}X^2}>0$` and
`$\sqrt{\mathbb{E}Y^2}>0$`. Then, because `$ab \leq a^2/2 + b^2/2$`, for all
real `$a,b$`,  everywhere,

`$$\frac{|XY|}{\sqrt{\mathbb{E}X^2}\sqrt{\mathbb{E}Y^2}} \leq \frac{|X|^2}{2\mathbb{E}X^2} + \frac{|Y|^2}{2\mathbb{E}Y^2} \,.$$`

Taking expectation on both sides leads to the desired result.

---

---

Hölder's inequality generalizes Cauchy-Schwarz inequality.

Indeed, Cauchy-Schwarz inequality is just Hölder's inequality
for `$p=q=2$` ( `$2$` is its own conjugate )

### Theorem (Hölder's inequality)

.bg-light-gray.b--dark-gray.ba.br3.shadow-5.ph4.mt5.f6[

Let `$X$` and `$Y$` be two random variables on the same probability space.

Let `$p, q>1$` be _conjugate_  ( `$1/p + 1/q =1$` ),
assume both `$\mathbb{E}|X|^p$` and `$\mathbb{E}|Y|^q$` are finite.

Then we have

`$$\mathbb{E} [XY] \leq  \left(\mathbb{E}|X|^p\right)^{1/p} \times  \left(\mathbb{E}|Y|^q\right)^{1/q}$$`

]

---

### Proof

If either `$\mathbb{E}|X|^p=0$` or
`$\mathbb{E}|Y|^q=0$`, the inequality is trivially satisfied.

Assume that `$\mathbb{E}|X|^p > 0$` and  `$\mathbb{E}|Y|^q > 0$`.

Follow the proof of Cauchy-Schwarz inequality, but replace `$2 ab \leq a^2 +b^2$` by
Young's inequality:

`$$ab \leq \frac{|a|^p}{p} + \frac{|b|^q}{q}\qquad  \forall a,b \in \mathbb{R}$$`

if `$1/p+ 1/q=1$`.

---

### Proof (continued)

The inequality  below is a consequence of Young's inequality and of the monotonicity of expectation:

`$$\begin{array}{rl}
\frac{\mathbb{E}|XY|}{\mathbb{E}[|X|^p]^{1/p}\mathbb{E}[|Y|^q]^{1/q}}
& =  \mathbb{E}\Big[\frac{|X|}{\mathbb{E}[|X|^p]^{1/p}} \frac{|Y|}{\mathbb{E}[|Y|^q]^{1/q}} \Big] \\
& \leq \mathbb{E}\Big[\frac{|X|^p}{p \mathbb{E}[|X|^p]} + \frac{|Y|^q}{q \mathbb{E}[|Y|^q]} \\
& = \frac{1}{p} + \frac{1}{q} \\
& = 1 \, .
\end{array}$$`

---

### Corollary

.bg-light-gray.b--dark-gray.ba.br3.shadow-5.ph4.mt5.f6[

For `$1\leq p < q$`,

`$$\mathbb{E}\Big[|X|^p\Big]^{1/p} \leq \mathbb{E}\Big[|X|^q\Big]^{1/q} \, .$$`

]

---

For `$p \in [0, \infty)$` `$X \mapsto (\mathbb{E}|X|^p)^{1/p}$`
defines a semi-norm on the set of random variables for which `$(\mathbb{E}|X|^p)^{1/p}$`
is finite. Minkowski's inequality asserts that  `$X \mapsto (\mathbb{E}|X|^p)^{1/p}$`
satisfies the triangle inequality.

### Theorem (Minkowski's inequality)

.bg-light-gray.b--dark-gray.ba.br3.shadow-5.ph4.mt5.f6[

Let `$X, Y$` be two real-valued random variables defined on the same probability space.

Let `$1 \leq p < \infty$`

Assume that `$\mathbb{E}|X|^p <\infty$`  and `$\mathbb{E}|Y|^p<\infty$`

Then we have:

`$$\left(\mathbb{E} [| X + Y|^p]\right)^{1/p} \leq \left(\mathbb{E} [| X|^p]\right)^{1/p} + \left(\mathbb{E} [|Y|^p]\right)^{1/p}$$`

which entails `$\mathbb{E}|X+Y|^p <\infty.$`
]
---

The proof of Minkowski's inequality follows from Hölder's inequality

### Proof

The inequality below also
follows from triangle inequality on `$\mathbb{R}$`, monotonicity.

The last equality follows from linearity of expectation:

`$$\begin{array}{rl}
\mathbb{E} \Big[ |X+Y|^p\Big]
& \leq \mathbb{E} \Big[ (|X|+|Y|) \times |X+Y|^{p-1}\Big] \\
& = \mathbb{E} \Big[ |X| \times |X+Y|^{p-1}\Big] + \mathbb{E} \Big[ |Y| \times |X+Y|^{p-1}\Big] \, .
\end{array}$$`

This is enough to handle the case `$p=1$`.

---

### Proof (continued)

From now on, assume `$p>1$`.

Hölder's inequality entails the next
inequality and a similar upper bound for `$\mathbb{E} \Big[ |Y| \times |X+Y|^{p-1}\Big]$`.

`$$\begin{array}{rl}
\mathbb{E} \Big[ |X| \times |X+Y|^{p-1}\Big]
& \leq \mathbb{E} \Big[ |X|^p\Big]^{1/p} \times  \mathbb{E} \Big[ |X+Y|^{p}\Big]^{(p-1)/p} \,
\end{array}$$`

Summing the  two upper bounds, we obtain

`$$\begin{array}{rl}
  \mathbb{E} \Big[ |X+Y|^p\Big]
  & \leq \left(\mathbb{E} \Big[ |X|^p\Big]^{1/p} + \mathbb{E} \Big[ |Y|^p\Big]^{1/p}\right) \times \mathbb{E} \Big[ |X+Y|^{p}\Big]^{(p-1)/p} \, .
\end{array}$$`

This prove's Minkowski's inequality for `$p>1$`.

---
name: medianiqr
template: inter-slide

## Median and interquartile range

---

Robust and non-robust indices of location.

### Definition

Let `$X$` be a real random variable over some probability space. Let `$F$`
be the cumulative distribution function of `$X$`. The median of the distribution of `$X$`
is `$F^{\leftarrow}(1/2)$`.

---

The median minimizes the mean absolute deviation.

### Proposition

.bg-light-gray.b--dark-gray.ba.br3.shadow-5.ph4.mt5.f6[

If `$m$`  is such that `$P\{ X > m\} = P\{ X<m\}$`

then  `$m$` is median of the distribution of `$X$`,

and if `$X$` is integrable:

`$$\mathbb{E}\Big| X - m \Big| = \min_{a \in \mathbb{R}} \mathbb{E}\Big| X - a \Big|$$`

]

---

### Proof

Assume `$a<m$`,
`$$\begin{array}{rl}
  \mathbb{E} \left[\Big| X - a \Big| - \Big| X - m \Big| \right]
  & = - (m-a) P(-\infty, a] + \int_{(a, m]} (2 X - (a+m)) \mathrm{d}P(x) + (m-a)P(m,\infty) \\
  & \geq  - (m-a) P(-\infty, a] - (m-a) P(a,m] + (m-a)P(m,\infty) \\
  & = (m-a) \Big(P(m,\infty) - P(-\infty, m]\Big) \\
  & = 0 \, .
\end{array}$$`

The same line of reasoning allows to handle the case `$a>m$` and to conclude.

---

Combining three of the inequalities we have just proved, allows us
to establish an interesting connection between expectation, median
and standard deviation.

### Theorem (Lévy's inequality)

.bg-light-gray.b--dark-gray.ba.br3.shadow-5.ph4.mt5.f6[

Let `$m$` be the median of the distribution of `$X$`, a square-integrable random
variable over some probability space. Then

`$$\Big| m - \mathbb{E} X\Big| \leq \sqrt{\operatorname{var}(X)} \, .$$`

]

---

The robust and non-robust indices of location differ by at most
the standard deviation, which may be infinite.

---

### Proof

By convexity of `$x \mapsto |x|$`, we have
`$$\begin{array}{rl}
\Big| m - \mathbb{E} X\Big| & \leq \mathbb{E} \Big| m - X\Big| \\
& \text{by Jensen's inequality} \\
& \leq \mathbb{E} \Big| \mathbb{E}X - X\Big| \\
& \text{the median minimizes the mean absolute error} \\
& \leq \left(\mathbb{E} \Big| \mathbb{E}X - X\Big|^2\right)^{1/2}  \\
& \text{by Cauchy-Schwarz inequality.}
\end{array}$$`

---

### Remark

The mean and the median may differ. First the median is always defined, while the mean may not.
Think for example of the standard Cauchy distribution which has density `$\frac{1}{\pi}\frac{1}{1+x^2}$`
over `$\mathbb{R}$`. If `$X$` is Cauchy distributed, then `$\mathbb{E}|X|=\infty$`. The mean is not defined.
But as the density is a pair function, `$X$` is symmetric ($X$ and `$-X$` are distributed the same way),
and this implies that the median of (the distribution) of `$X$` is `$0$`.

Consider the exponential distribution with density `$\exp(-x)$` over `$[0, \infty)$`, it has mean `$1$`, median `$\log(2)$`, and
variance `$1$`. If we turn to exponential distribution with density `$\lambda \exp(-\lambda x)$`, it has mean `$1/\lambda$`, median
`$\log(2)/\lambda$`, and variance `$1/\lambda^2$`. Lévy's inequality does not tell more that what we can
compute with bare hands.

Finally consider Gamma distributions with shape parameter `$p$`  and intensity parameter `$\lambda$`.
It has mean `$p/\lambda$`, variance `$p/\lambda^2$`. The  median  is not easily computed
though we can easily check that it is equal to `$g(p)/\lambda$` where `$g(p)$` is the median
of the Gamma distribution with parameters `$p$` and `$1$`. Lévy's inequality tells us that `$|g(p) - p|\leq \sqrt{p}$`.

---
template: inter-slide
name: lpspaces

## `$\mathcal{L}_p$` and `$L_p$` spaces

---

Let `$p \in [1, \infty)$`.

Let `$(\Omega, \mathcal{F}, P)$` be a probability space.

Define `$\mathcal{L}_p(\Omega, \mathcal{F}, P)$`
(often abbreviated to `$\mathcal{L}_p(P)$` or even `$\mathcal{L}_p$` when there is no ambiguity) as

`$$\mathcal{L}_p(\Omega, \mathcal{F}, P) = \Big\{ X : X \text{ is a real random variable over } (\Omega, \mathcal{F}, P), \quad \mathbb{E}|X|^p < \infty \Big\} \, .$$`

Let `$\| X \|_p$` be defined by `$\| X\|_p = \Big(\mathbb{E} |X|^p\Big)^{1/p}$`.

Let `$\mathcal{L}_0(\Omega, \mathcal{F}, P)$` denote the vector space of random variables over `$(\Omega, \mathcal{F}, P)$`.

We first notice that sets `$\mathcal{L}_p(\Omega, \mathcal{F}, P)$` form a nested sequence.

---

### Proposition

.bg-light-gray.b--dark-gray.ba.br3.shadow-5.ph4.mt5.f6[

Let `$(\Omega, \mathcal{F}, P)$` be a probability space,

then

for `$1 \leq p \leq q <\infty$`:

1. `$\|X\|_p < \| X\|_q$`.

2. `$\mathcal{L}_q(\Omega, \mathcal{F}, P) \leq \mathcal{L}_p(\Omega, \mathcal{F}, P)$`.

]

---

### Proof

Assume `$1 \leq p \leq q <\infty$`

As `$x \mapsto x^{q/p}$` is convex on `$[0, \infty)$`
by Jensen's inequality,

we have

`$$\begin{array}{rl}
  \mathbb{E} [|X|^p]^{q/p} & \leq \mathbb{E} [|X|^q] \,.
\end{array}$$`

This establishes 1.) And 2.) is an immediate consequence of `$1$`.

---

Proposition  is a about inclusion of sets.

The next theorem
summarizes several points: that sets `$\mathcal{L}_p$` are linear subspaces of `$\mathcal{L}_0$`,
and that they are complete as pseudo-metric (pseudo-normed) spaces.

### Theorem  (completeness of `$\mathcal{L}_p$`)

.bg-light-gray.b--dark-gray.ba.br3.shadow-5.ph4.mt5.f6[

For `$1 \leq p < \infty$`, let `$\mathcal{L}_p(\Omega, \mathcal{F}, P)$` and `$\|\cdot\|_p$` be defined as above.

Then,

1. `$\mathcal{L}_p(\Omega, \mathcal{F}, P)$` is a linear subspace  of the space of real random variables.
1. `$\| \cdot\|_p$` is a pseudo-norm on `$\mathcal{L}_p(\Omega, \mathcal{F}, P)$`.
1. If `$(X_n)_n$` is a sequence in `$\mathcal{L}_p(\Omega, \mathcal{F}, P)$`  that satisfies
`$$\lim_n \sup_{m\geq n} \Big| X_n - X_m \Big|_p = 0$$`

then

- There exists `$X \in \mathcal{L}_p(\Omega, \mathcal{F}, P)$` such that `$\lim_n \| X_n - X\|_p=0$`.

- There exists a subsequence `$(X_{m_n})_{n}$` such that `$X_{m_n} \to X$` `$P$`-almost surely.

]

---

### Remark

In a pseudo-metric space, to prove that a Cauchy sequence converges, it is enough to check convergence of a subsequence.

Picking a convenient subsequence, and possibly relabeling elements, we may assume   `$\Big\| X_n - X_m \Big\|_p \leq 2^{- n \wedge m}$` for all `$n,m$`.

---
name:  borelCant1

### First Borell-Cantelli Lemma

.bg-light-gray.b--dark-gray.ba.br3.shadow-5.ph4.mt5.f6[

Let `$(A_n)_n$` be a sequence of events from some probability space `$(\Omega, \mathcal{F}, P)$`.

Assume `$\sum_{n} P(A_n) < \infty$`

then,

with probability `$1$`, only finetely many events `$A_n$`  are realized:

`$$P \left\{ \omega : \sum_n \mathbb{I}_{A_n}(\omega) < \infty \right\} = 1 \,.$$`

]

---

### Proof (Borell-Cantelli Lemma)

The event

`$$\left\{ \omega : \sum_n \mathbb{I}_{A_n}(\omega) = \infty \right\}$$`

coincides with `$\cap_n \cup_{m\geq n} A_n$`:

`$$P \left\{ \sum_n \mathbb{I}_{A_n}(\omega) = \infty\right\} = P(\cap_n \cup_{m\geq n} A_n)$$`

---

### Proof (continued)

Now, the sequence `$(\cup_{m\geq n} A_n)_n$` is monotone decreasing: 
`$$\lim_n \downarrow \cup_{m\geq n} A_n = \cap_n \cup_{m\geq n} A_n$$`

By Fatou's Lemma,
`$$\begin{array}{rl}
\mathbb{E} \lim_m \mathbb{I}_{\cup_{m\geq n} A_m}
  & = \mathbb{E} \liminf_n\mathbb{I}_{\cup_{m\geq n} A_m}  \\
  & \leq  \liminf_n  \mathbb{E} \mathbb{I}_{\cup_{m\geq n} A_m} \\
  & \leq  \liminf_n  \sum_{m\geq n} P(A_m) \\
  & =   0 \, .
\end{array}$$`
The last equation comes from the fact that the remainders of a convergent series are vanishing.

---

### Proof (completeness of `$\mathcal{L}_p$`)

Points 1) and 2) follow from Minkowski's inequality. This entails that `$\|\cdot\|_p$` defines a pseudo-norm
on `$\mathcal{L}_p$`. If two random variables `$X,Y$` from `$\mathcal{L}_p$` satisfy `$\| X- Y\|_p=0$`, then `$X=Y$` `$P$`-a.s.

To establish 3), we need to check that the sequence converges almost surely, and that an
almost sure limit belongs to `$\mathcal{L}_p$`.

Define event `$A_n$` by `$$A_n = \Big\{ \omega : \Big| X_n(\omega) - X_{n+1}(\omega) \Big| > \frac{1}{n^2}\Big\} \, .$$`

By Markov inequality, `$$P(A_n) \leq \mathbb{E}\Big[n^{2p} \Big| X_n - X_m \Big|^p \Big] \leq n^{2p} 2^{-np} \, .$$`

Hence, `$\sum_{n\geq 1} P(A_n) < \infty.$`

By the first Borel-Cantelli Lemma, on some event `$E$` with probability `$1$`, only finitely many `$A_n$` are realized.

---

### Proof (continued)

If `$\omega \in E$`, the condition `$\Big| X_n(\omega) - X_{n+1}(\omega) \Big| > \frac{1}{n^2}$` is realized for
only finitely many indices `$n$`. Thus the real-valued  sequence `$(X_n(\omega))_n$` is a Cauchy sequence. It has a
limit we denote `$X(\omega)$`. If `$\omega \not\in E$`, we agree on `$X(\omega)=0.$` On `$\Omega$`, we have

`$$X(\omega) = \lim_n \mathbb{I}_E(\omega) X(\omega) \, .$$`

A limit of random variables is a random variable. Hence `$X$` is a random variable.

It remains to check that `$X \in \mathcal{L}_p$`.

Note first that

Hence `$\big(\big\|X_n \big\|_p \big)_n$` is a Cauchy sequence and converges to some finite limit.

`$$|X(\omega)|  \leq  \liminf |X_n(\omega)|$$`

by Fatou's Lemma

`$$\mathbb{E} |X|^p  \leq \liminf \mathbb{E} |X_n|^p < \infty\, .$$`

---

### Proof (continued)

Hence `$X \in \mathcal{L}_p$`.

Finally we check that `$\lim_m \|X_n - X\|_p =0$`.
By Fatou's lemma again,
`$$\mathbb{E} \Big| X - X_m \Big|^p \leq \liminf_n \mathbb{E} \Big| X_n - X_m \Big|^p$$`

Hence `$$\lim_m \mathbb{E} \Big| X - X_m \Big|^p \leq \lim_m \liminf_n \mathbb{E} \Big| X_n - X_m \Big|^p = 0 \, .$$`

---

### Remark

Can we extend the almost sure convergence to the whole sequence?

This is not the case.

Consider `$([0,1], \mathcal{B}([0,1]), P)$` where `$P$` is the uniform
distribution. For `$k= j+ n(n-1)/2$`, `$1\leq j\leq n$`, let `$X_n = \mathbb{I}_{[(j-1)/n, j/n]}$`.

The sequence `$X_n$` converges to `$0$` in `$\mathcal{L}_p$` for all `$p \in [1, \infty)$`.

Indeed `$\|X_k\|_p = n^{-p}$` for `$k= j+ n(n-1)/2$`, `$1\leq j\leq n$`. For any `$\omega \in [0,1]$`,
the sequence `$X_n(\omega)$` oscillates between `$0$` and `$1$` infinitely many times.

---

`$\mathcal{L}_p$` provide us with a bridge between probability and analysis.

In analysis, the fact that `$\|\cdot \|_p$` is just a pseudo-norm leads to
consider `$L_p$` spaces. `$L_p$`  spaces are defined from `$\mathcal{L}_p$` spaces
by taking equivalence classes of random variables. Indeed, define relation `$\equiv$`
over `$\mathcal{L}_p(\Omega, \mathcal{F}, P)$`  by `$X \equiv X'$` iff `$P\{X=X'\}=1$`.

This relation is an equivalence relation (reflexive, symmetric and transitive).

If `$X \equiv X'$` and `$Y \equiv Y'$`, then `$\|X -Y\|_p = \|X' -Y\|_p = \|X' - Y'\|_p$`.
`$L_p(\Omega, \mathcal{F}, P)$` is the quotient space of `$\mathcal{L}_p$` by relation `$\equiv$`.

---

We have the fundamental result.

### Theorem

.bg-light-gray.b--dark-gray.ba.br3.shadow-5.ph4.mt5.f6[

For `$1 \leq p <\infty$`, `$L_p(\Omega, \mathcal{F}, P)$` equiped with
`$\| \cdot\|_p$` is a complete normed space (Banach space).

]

This eventually allows us to invoke theorems from functional analysis.

---
template: inter-slide
exclude: true

## Bibliographic remarks {#bibmoments}

---
exclude: true

@MR1932358 gives a self-contained and thorough treatment of
measure and integration theory with probability theory in mind.

@MR1261420 is an excellent and accessible reference on convexity.

---

class: middle, center, inverse

background-image: url('./img/pexels-cottonbro-3171837.jpg')
background-size: 112%

# The End