Non Asymptotic results

name: layout-general
layout: true
class: left, middle

.remark-slide-number .progress-bar-container {
  position: absolute;
  bottom: 0;
  height: 4px;
  display: block;
  left: 0;
  right: 0;
}

.remark-slide-number .progress-bar {
  height: 100%;
  background-color: red;
}
</style>

<div>
<style type="text/css">.xaringan-extra-logo {
width: 110px;
height: 128px;
z-index: 0;
background-image: url(./img/Universite_Paris_logo_horizontal.jpg);
background-size: contain;
background-repeat: no-repeat;
position: absolute;
top:1em;right:1em;
}
</style>
<script>(function () {
  let tries = 0
  function addLogo () {
    if (typeof slideshow === 'undefined') {
      tries += 1
      if (tries < 10) {
        setTimeout(addLogo, 100)
      }
    } else {
      document.querySelectorAll('.remark-slide-content:not(.hide_logo)')
        .forEach(function (slide) {
          const logo = document.createElement('a')
          logo.classList = 'xaringan-extra-logo'
          logo.href = 'http://master.math.univ-paris-diderot.fr/annee/m1-mi/'
          slide.appendChild(logo)
        })
    }
  }
  document.addEventListener('DOMContentLoaded', addLogo)
})()</script>
</div>

---
class: middle, center, inverse

# Non asymptotic results: concentration

### 2021-01-08

#### [Probabilités Master I MIDS](http://stephane-v-boucheron.fr/courses/probability/)

#### [Stéphane Boucheron](http://stephane-v-boucheron.fr)

---
class: middle, inverse

## <svg style="height:0.8em;top:.04em;position:relative;fill:white;" viewBox="0 0 576 512"><path d="M0 117.66v346.32c0 11.32 11.43 19.06 21.94 14.86L160 416V32L20.12 87.95A32.006 32.006 0 0 0 0 117.66zM192 416l192 64V96L192 32v384zM554.06 33.16L416 96v384l139.88-55.95A31.996 31.996 0 0 0 576 394.34V48.02c0-11.32-11.43-19.06-21.94-14.86z"/></svg>

### Concentration

### Variance bounds for functions of independent random variables

### Exponential inequalities

### Maximal inequalities

---
exclude: true
class: inverse, center, middle

## Motivation(s)

## <svg style="height:0.8em;top:.04em;position:relative;fill:white;" viewBox="0 0 496 512"><path d="M248 8C111 8 0 119 0 256s111 248 248 248 248-111 248-248S385 8 248 8zm80 168c17.7 0 32 14.3 32 32s-14.3 32-32 32-32-14.3-32-32 14.3-32 32-32zm-160 0c17.7 0 32 14.3 32 32s-14.3 32-32 32-32-14.3-32-32 14.3-32 32-32zm80 256c-60.6 0-134.5-38.3-143.8-93.3-2-11.8 9.3-21.6 20.7-17.9C155.1 330.5 200 336 248 336s92.9-5.5 123.1-15.2c11.3-3.7 22.6 6.1 20.7 17.9-9.3 55-83.2 93.3-143.8 93.3z"/></svg>

---
exclude: true
### Example of non-asymptotic results

- Bounds on approximation error in limit theory

`$$\mathrm{d}_{\text{TV}}\left(\text{Poi}(\lambda), \text{Binom}\left(n, \frac{\lambda}{n}\right)\right) \leq \min\left(\frac{\lambda}{n}, \frac{1}{n}\right)$$`

- Non-asymptotic tail bounds

.center[ ≈ Hoeffding's inequality]

???

---
exclude: true

### Why?

- Refining limit theorems

- High dimensional probability

???

---
class: inverse, middle, center

## Concentration

---

### Concentration in product spaces

In a nutshell:

.cf[

> A function of many independent random variables that does not depend too much
> on any of them is approximately constant       .fr[Talagrand]

]

The concentration of measure phenomenon describes the deviations of
_smooth_ functions (random variables) around their median/mean in some
probability spaces

- Product spaces
- Gaussian spaces
- High-dimensional spheres
- Compact topological groups
- ...

---

In Gaussian probability spaces, the Poincaré Inequality asserts:

> If `$X_1, \ldots, X_n \sim_{\text{i.i.d.}} \mathcal{N}(0,1)$` and `$f$` is `$L$`-Lipschitz,

> `$$\operatorname{Var}(f(X_1, \ldots, X_n )) \leq L^2$$`

Borrell-Gross-Cirelson inequalities show that similar bounds hold for exponential
moments.

Comparable results hold in product spaces

We need workable definitions of _smoothness_

---
class: inverse, center, middle
name: ess

## Efron-Stein-Steele inequalities

---

### Scene

`$X_1, \ldots, X_n$` denote _independent_ random variables on
some probability space with values in `$\mathcal{X}_1, \ldots, \mathcal{X}_n$`,

`$f$` denote a measurable function from `$\mathcal{X}_1 \times  \ldots \times \mathcal{X}_n$` to `$\mathbb{R}$`.

`$$Z=f(X_1, \ldots, X_n)$$`

`$Z$` is a general function of independent random variables

We assume `$Z$` is integrable.

---

If we had `$Z = \sum_{i=1}^n X_i$`, we could write

`$$\operatorname{var}(Z)
= \sum_{i=1}^n \operatorname{var}(X_i)
= \sum_{i=1}^n \mathbb{E}\Big[\operatorname{var}( Z \mid X_1, \ldots, X_{i-1}, X_{i+1}, \ldots X_n)\Big]$$`

.cf[.fr[even though the last expression looks pedantic]]

Our aim is to
show that even if `$f$` is not as simple as the sum of its arguments, the last expression can still
serve as an upper bound on the variance

---

### Doob's embedding

We express `$Z-\mathbb{E} Z$` as a sum of  differences

Denote by `$\mathbb{E}_i$` the conditional expectation operator, conditioned on
`$\left(X_{1},\ldots,X_{i}\right)$`:

`$$\mathbb{E}_i Y = \mathbb{E}\left[ Y \mid \sigma(X_{1},\ldots,X_{i})\right]$$`

Convention: `$\mathbb{E}_0=\mathbb{E}$`

For every `$i=1,\ldots,n$`:

`$$\Delta_{i}=\mathbb{E}_i Z  -\mathbb{E}_{i-1} Z$$`

`$$Z - \mathbb{E}Z = \sum_{i=1}^n \left(\mathbb{E}_i Z  - \mathbb{E}_{i-1}Z \right)= \sum_{i=1}^n Δ_i$$`

---

Check that

`$$\mathbb{E} \Delta_i=0$$`

and that for `$j>i$`,

`$$\mathbb{E}_i \Delta_j=0 \qquad \text{a.s.}$$`

---

Starting from the decomposition

`$$Z-\mathbb{E} Z  =\sum_{i=1}^{n}\Delta_{i}$$`

one has

`$$\operatorname{var}\left(Z\right)  =\mathbb{E}\left[  \left(  \sum_{i=1}^{n}\Delta_{i}\right)  ^{2}\right]  =\sum_{i=1}^{n}\mathbb{E}\left[ \Delta_{i}^{2}\right]  +2\sum_{j>i}\mathbb{E}\left[  \Delta_{i}\Delta
_{j}\right]$$`

Now if `$j>i$`, `$\mathbb{E}_i \Delta_{j}  =0$` implies that

`$$\mathbb{E}_i\left[  \Delta_{j}\Delta_{i}\right]  =\Delta_{i}\mathbb{E}_{i} \Delta_{j}  =0$$`

and, a fortiori,

`$$\mathbb{E}\left[  \Delta_{j}\Delta_{i}\right]  =0$$`

---

We obtain the following analog of the additivity formula of the variance:

`$$\operatorname{var}\left(  Z\right)  =\mathbb{E}\left[  \left(  \sum_{i=1}^{n}\Delta_{i}\right)  ^{2}\right]  =\sum_{i=1}^{n}\mathbb{E}\left[ \Delta_{i}^{2}\right]$$`

---

### Independence at work

Independence may be used as in the following argument:

For any integrable function `$Z= f\left(  X_{1},\ldots,X_{n}\right)$` one may write,
by the Tonelli-Fubini theorem,

`$$\mathbb{E}_i Z  =\int _{\mathcal{X}^{n-i}}f\left(  X_{1},\ldots,X_{i},x_{i+1},\ldots,x_{n}\right) d\mu_{i+1}\left(  x_{i+1}\right)  \ldots d\mu_{n}\left(  x_{n}\right)  \text{,}$$`

where, `$X_j \sim \mu_{j}$` for  `$j= 1,\ldots,n$`

---

Denote by `$\mathbb{E}^{(i)}$` the conditional expectation operator conditioned on
`$X^{(i)}=(X_{1},\ldots,X_{i-1},X_{i+1},\ldots,X_{n})$`,

`$$\mathbb{E}^{(i)} Y = \mathbb{E}\left[ Y \mid \sigma(X_{1},\ldots,X_{i-1},X_{i+1},\ldots,X_{n})\right]$$`

`$$\mathbb{E}^{(i)}Z  =\int_{\mathcal{X}} f\left(  X_{1},\ldots,X_{i-1},x_{i},X_{i+1},\ldots,X_{n}\right) d\mu_{i}\left(x_{i}\right)$$`

Again by the Tonelli-Fubini theorem:

.bg-light-gray.b--dark-gray.ba.bw1.br3.shadow-5.ph4.mt5[

`$$\mathbb{E}_i\left[  \mathbb{E}^{\left(  i\right)  } Z \right] =\mathbb{E}_{i-1} Z$$`

]

---

.bg-light-gray.b--dark-gray.ba.bw1.br3.shadow-5.ph4.mt5[

### Theorem:  Efron-Stein-Steele's inequalities (I)

Let `$X_1,\ldots,X_n$` be independent random variables and let `$Z=f(X)$` be
a square-integrable function of `$X=\left(  X_{1},\ldots,X_{n}\right)$`.

Then

`$$\operatorname{var}\left(  Z\right)  \leq \sum_{i=1}^n\mathbb{E}\left[
\left(  Z-\mathbb{E}^{(i)} Z  \right)^2\right]  = v$$`

]

---

.bg-light-gray.b--dark-gray.ba.bw1.br3.shadow-5.ph4.mt5[

### Theorem:  Efron-Stein-Steele's inequalities (II)

Let `$X_1',\ldots,X_n'$` be independent copies of `$X_1,\ldots,X_n$` and

`$$Z_i'=  f\left(X_1,\ldots,X_{i-1},X_i',X_{i+1},\ldots,X_n\right)~,$$`

then

`$$v=\frac{1}{2}\sum_{i=1}^n\mathbb{E}\left[  \left(  Z-Z_i'\right)^2\right]
=\sum_{i=1}^n\mathbb{E}\left[  \left(  Z-Z_i'\right)_+^2\right]
=\sum_{i=1}^n\mathbb{E}\left[  \left(  Z-Z_i'\right)_-^2\right]$$`

where `$x_+=\max(x,0)$` and `$x_-=\max(-x,0)$` denote the positive and
negative parts of a real number `$x$`.

`$$v=\inf_{Z_{i}}\sum_{i=1}^{n}\mathbb{E}\left[  \left(  Z-Z_{i}\right)^2\right]~,$$`

where the infimum is taken over the class of all `$X^{(i)}$`-measurable
and square-integrable variables `$Z_{i}$`, `$i=1,\ldots,n$`.

]

---

### Proof

Using

.bg-light-gray.b--dark-gray.ba.bw1.br3.shadow-5.ph4.mt5[

`$$\mathbb{E}_i\left[\mathbb{E}^{\left(i\right)} Z \right] = \mathbb{E}_{i-1} Z$$`

]

we may write

`$$\Delta_{i}=\mathbb{E}_i\left[  Z-\mathbb{E}^{\left(  i\right)  } Z  \right]$$`

By the conditional Jensen Inequality,

`$$\Delta_{i}^{2}\leq\mathbb{E}_i\left[  \left(  Z-\mathbb{E}^{\left(i\right)  }Z  \right)  ^{2}\right]$$`

---

### Proof  (continued)

Using

`$$\operatorname{var}(Z) = \sum_{i=1}^n \mathbb{E}\left[ \Delta_i^2\right]$$`

we obtain

`$$\operatorname{var}(Z) \leq \sum_{i=1}^n \mathbb{E}\left[\mathbb{E}_i\left[  \left(  Z-\mathbb{E}^{\left(i\right)  }Z  \right)  ^{2}\right]\right]$$`

---

### Proof (continued)

To prove the identities for `$v$`,

Denote by
`$\operatorname{var}^{\left(i\right) }$` the conditional variance operator conditioned on `$X^{\left(  i\right)  }$`

`$$\operatorname{var}^{\left(i\right)}(Y) =  \mathbb{E}\left[ \left(Y - \mathbb{E}^{\left(i\right)}Y\right)^2\mid \sigma(X_1, \ldots, X_{i-1}, X_{i+1}, \ldots, X_n)\right]$$`

Then we may write `$v$` as

`$$v=\sum_{i=1}^{n}\mathbb{E}\left[  \operatorname{var}^{\left(  i\right)  }\left(Z\right)  \right]$$`

---

### Proof (continued)

`$$\operatorname{var}(X)=(1/2) \mathbb{E}[(X-Y)^2]$$`

Conditionally on `$X^{\left(  i\right)  }$`, `$Z_i'$` is an independent copy of `$Z$`

`$$\operatorname{var}^{\left(  i\right)  }\left(  Z\right)  =\frac{1}{2}\mathbb{E}
^{\left(  i\right)  }\left[  \left(  Z-Z_i'\right)^2\right]
=\mathbb{E}^{\left(  i\right)  }\left[  \left(  Z-Z_i'\right)_+^2\right]
=\mathbb{E}^{\left(  i\right)  }\left[  \left(  Z-Z_i'\right)_-^2\right]$$`

where we used the fact that the conditional distributions of `$Z$` and `$Z_i'$`
are identical

---

### Proof (continued)

`$$\operatorname{var}(X) = \inf_{a\in \mathbb{R}} \mathbb{E}[(X-a)^2]$$`

Using this fact conditionally, for every `$i=1,\ldots,n$`,

`$$\operatorname{var}^{\left(  i\right)  }\left(  Z\right)  =\inf_{Z_{i}}\mathbb{E}^{\left(  i\right)  }\left[  \left(  Z-Z_{i}\right)^2\right]$$`

---

The bound in the Efron-Stein-Steele inequality is, _in a sense_, not improvable.

---
class: inverse, center, middle
name: mcdiarmid

## Bounding the variance : example

---
exclude: true

### Random graphs  (Erdös-Rényi)

- clique number

- chromatic number

- size of giant component (for super-critical graphs)

---
exclude: true

### Longest Increasing Subsequence

- Ulam's problem

---
exclude: true

### Norms of random vectors

---

### Random matrices

- Largest eigenvalue of a random symmetric matrix with bounded entries

`$$X =\begin{pmatrix}0 & \epsilon_{1,2} & \ldots & \epsilon_{1,n} \\
\epsilon_{1,2} &  0 & \ddots & \vdots \\
\vdots & \ddots & \ddots & \epsilon_{n-1,n}\\
\epsilon_{1,n} &  \ldots & \epsilon_{n-1,n} &  0
\end{pmatrix}$$`

where `$(\epsilon_{i,j})_{i<j}$` are i.i.d. random symmetric signs

`$$Z =  \sup_{\|\lambda\|_2 \leq 1} \lambda^T X \lambda = 2 \sup_{\|\lambda\|_2 \leq 1} \sum_{i< j} \lambda_i \lambda_j \epsilon_{i,j}$$`

`$$\operatorname{var}\left(Z\right) \leq 4$$`

---
exclude: true

###  Bin packing

---
exclude: true

### Order statistics

---
class: inverse, center, middle
name: hoeffding

## Hoeffding's inequality

---

Laws of large numbers are asymptotic statements.
In applications, in Statistics, in Statistical Learning Theory, it is often desirable
to have guarantees  for fixed `$n$`. Exponential inequalities are
refinements of Chebychev inequality. Under strong integrability assumptions
on the summands, it is possible and relatively easy to derive sharp tail
bounds for sums of independent random variables.

---

.bg-light-gray.b--dark-gray.ba.bw1.br3.shadow-5.ph4.mt5[

### Lemma  Hoeffding Lemma

Let `$Y$` be a random variable  taking
values in a bounded interval `$[a,b]$` and
let `$\psi_Y(\lambda)=\log \mathbb{E} e^{\lambda (Y- \mathbb{E}Y)}$`

Then

`$$\operatorname{var}(Y) \leq \frac{(b-a)^2}{4}\qquad \text{and} \qquad \psi_Y(\lambda) \leq \frac{1}{2} \frac{(b-a)^2}{4}$$`

]

---

### Proof

The upper bound on the variance of `$Y$` has been established.

Now let `$P$` denote the distribution of `$Y$` and let `$P_{\lambda}$` be the
probability distribution with density

`$$x \rightarrow e^{-\psi_{Y}\left(  \lambda\right)  }e^{\lambda (x - \mathbb{E}Y)}$$`

with respect to `$P$`.

Since `$P_{\lambda}$` is concentrated on
`$[a,b]$` ( `$P_\lambda([a, b]) = P([a, b]) =1$` ), the variance of a random
variable `$Z$` with distribution `$P_{\lambda}$` is bounded by `$(b-a)^2/4$`

---

### Proof (continued)

Note that `$P_0 = P$`.

Dominated convergence arguments allow to compute the derivatives of `$\psi_Y(\lambda)$`.

Namely

`$$\psi'_Y(\lambda) = \frac{\mathbb{E}\Big[ (Y- \mathbb{E}Y) e^{\lambda (Y- \mathbb{E}Y)} \Big]}{\mathbb{E} e^{\lambda (Y- \mathbb{E}Y)}} = \mathbb{E}_{P_\lambda} Z$$`

and

`$$\psi^{\prime\prime}_Y(\lambda) = \frac{\mathbb{E}\Big[ (Y- \mathbb{E}{Y})^2 e^{\lambda (Y- \mathbb{E}Y)} \Big]}{\mathbb{E} e^{\lambda (Y- \mathbb{E}Y)}} - \Bigg(\frac{\mathbb{E}\Big[ (Y- \mathbb{E}{Y}) e^{\lambda (Y- \mathbb{E}Y)} \Big]}{\mathbb{E} e^{\lambda (Y- \mathbb{E}Y)}}\Bigg)^2
= \operatorname{var}_{P_\lambda}(Z)$$`

---

### Proof (continued)

Hence,  thanks to the variance upper bound:
`\begin{align*}
\psi_Y^{\prime\prime}(\lambda) & \leq \frac{(b-a)^2}{4}~.
\end{align*}`

Note that
`$\psi_{Y}(0)  = \psi_{Y}'(0) =0$`, and
by Taylor's theorem, for some
`$\theta \in [0,\lambda]$`,

`$$\psi_Y(\lambda) = \psi_Y(0) + \lambda\psi_Y'(0)  + \frac{\lambda^2}{2}\psi_Y''(\theta)   \leq  \frac{\lambda^2(b-a)^2}{8}$$`

---

The upper bound on the variance is sharp in the
special case of a _Rademacher_ random variable
`$X$` whose distribution is defined by

`$$P\{X =-1\} = P\{X =1\} = 1/2$$`

Then one may take `$a=-b=1$` and `$\operatorname{var}(X)  =1=\left(  b-a\right)^2/4$`.

We can now build on Hoeffding's Lemma to derive  very practical tail bounds
for sums of bounded independent random variables.

---

.bg-light-gray.b--dark-gray.ba.bw1.br3.shadow-5.ph4.mt5[

### Theorem Hoeffding's inequality

Let `$X_1,\ldots,X_n$` be
independent random variables such that `$X_i$`
takes its values in `$[a_i,b_i]$` almost surely for all `$i\leq n$`.

Let

`$$S=\sum_{i=1}^n\left(X_i- \mathbb{E} X_i \right)$$`

Then

`$$\operatorname{var}(S) \leq v = \sum_{i=1}^n  \frac{(b_i-a_i)^2}{4}$$`

`$$\forall \lambda \in \mathbb{R}, \qquad \log \mathbb{E} \mathrm{e}^{\lambda S} \leq \frac{\lambda^2 v}{2}$$`

`$$\forall t>0, \qquad P\left\{  S \geq t \right\}  \le
\exp\left( -\frac{t^2}{2 v}\right)$$`

]

---

The proof is based on the so-called Cramer-Chernoff bounding technique and on Hoeffding's Lemma.

### Proof

The upper bound on variance follows from `$\operatorname{var}(S) = \sum_{i=1}^n \operatorname{var}(X_i)$` and from the first part of Hoeffding's Lemma.

For the upper-bound on `$\log \mathbb{E} \mathrm{e}^{\lambda S}$`,

`$$\begin{array}{rl}\log \mathbb{E} \mathrm{e}^{\lambda S} & = \log \mathbb{E} \mathrm{e}^{\sum_{i=1}^n \lambda (X_i - \mathbb{E} X_i)} \\ & = \log \mathbb{E} \Big[\prod_{i=1}^n  \mathrm{e}^{\lambda (X_i - \mathbb{E} X_i)}\Big]  \\ & = \log \Big(\prod_{i=1}^n  \mathbb{E} \Big[\mathrm{e}^{\lambda (X_i - \mathbb{E} X_i)}\Big]\Big)  \\ & = \sum_{i=1}^n \log \mathbb{E} \Big[\mathrm{e}^{\lambda (X_i - \mathbb{E} X_i)}\Big] \\ & \leq  \sum_{i=1}^n \frac{\lambda^2 (b_i-a_i)^2}{8} \\ & = \frac{\lambda^2 v}{2}\end{array}$$`

where the third equality comes from independence of the `$X_i$`'s and the  inequality follows from
invoking  Hoeffding's Lemma for each summand.

---

### Proof (continued)

The Cramer-Chernoff technique consists of using Markov's inequality with exponential moments.

`$$\begin{array}{rl}P \big\{ S \geq t \big\} & \leq \inf_{\lambda\geq 0}\frac{\mathbb{E} \mathrm{e}^{\lambda S}}{\mathrm{e}^{\lambda t}} \\ & \leq \exp\Big(- \sup_{\lambda \geq 0} \big( \lambda t - \log \mathbb{E} \mathrm{e}^{\lambda S}\big) \Big)\\ & \leq \exp\Big(- \sup_{\lambda \geq 0}\big(  \lambda t - \frac{\lambda^2 v}{2}\big) \Big) \\ & = \mathrm{e}^{- \frac{t^2}{2v}  }\end{array}$$`

---

Hoeffding's inequality provides interesting tail bounds for binomial random variables which are sums of independent `$[0,1]$`-valued random variables.

However in some cases, the variance  upper bound used in Hoeffding's inequality
is excessively conservative.

Think for example of binomial random variable with parameters `$n$` and `$\mu/n$`,
the variance upper-bound obtained from the boundedness assumption is `$n$` while the true variance is `$\mu$`

---
class: inverse, center, middle
name: mcdiarmid

## Bounded differences inequality

---

In this section we combine Hoeffding's inequality and conditioning
to establish the so-called _Bounded differences inequality_ (also known
as McDiarmid's inequality). This inequality is a first example of the
_concentration of measure phenomenon_. This phenomenon is best portrayed by the
following say:

> A function of many independent random variables that does not depend too
much on any of them is concentrated around its mean or median value.

---

.bg-light-gray.b--dark-gray.ba.bw1.br3.shadow-5.ph4.mt5[

### Theorem: Bounded Differences Inequality

Let `$X_1, \ldots, X_n$` be independent with values in `$\mathcal{X}_1, \mathcal{X}_2, \ldots, \mathcal{X}_n$`.

Let `$f : \mathcal{X}_1 \times \mathcal{X}_2 \times  \ldots \times  \mathcal{X}_n \to \mathbb{R}$`
be  measurable

Assume there exists non-negative `$c_1, \ldots, c_n$` satisfying

`$\forall x_1, \ldots, x_n \in \prod_{i=1}^n \mathcal{X}_i$`,
`$\forall y_1, \ldots, y_n \in \prod_{i=1}^n \mathcal{X}_i$`,

`$$\Big| f(x_1, \ldots, x_n) - f(y_1, \ldots, y_n)\Big|  \leq \sum_{i=1}^n c_i \mathbb{I}_{x_i\neq y_i}$$`

Let `$Z =  f(X_1, \ldots, X_n)$` and `$v = \sum_{i=1}^n \frac{c_i^2}{4}$`

Then `$\operatorname{var}(Z)  \leq v$`

`$$\log \mathbb{E} \mathrm{e}^{\lambda (Z -\mathbb{E}Z)} \leq \frac{\lambda^2 v}{2}\qquad \text{and} \qquad P \Big\{ Z \geq \mathbb{E}Z + t \Big\} \leq \mathrm{e}^{-\frac{t^2}{2v}}$$`

]

---

### Proof

The variance bound is an immediate consequence of the Efron-Stein-Steele inequalities.

The tail bound follows from the upper bound on the logarithmic moment generating function
by Cramer-Chernoff bounding.

To check the upper-bound on the logarithmic moment generating function,
we proceed by induction on the number of arguments `$n$`.

If `$n=1$`, the upper-bound on the logarithmic moment generating function is just Hoeffing's

Assume the upper-bound is valid up to `$n-1$`.

`$$\begin{array}{rl}
\mathbb{E} \mathrm{e}^{\lambda (Z - \mathbb{E}Z)}
  & = \mathbb{E}\Big[ \mathbb{E}_{n-1}\mathrm{e}^{\lambda (Z - \mathbb{E}Z)} \Big] \\
  & = \mathbb{E}\Big[ \mathbb{E}_{n-1}\big[\mathrm{e}^{\lambda (Z - \mathbb{E}_{n-1}Z)}\big] \times \mathrm{e}^{\lambda (\mathbb{E}_{n-1}Z - \mathbb{E}Z)}  \Big]\end{array}$$`

---

### Proof (continued)

Now,

`$$\mathbb{E}_{n-1}Z = \int_{\mathcal{X}_n} f(x_1,\ldots,x_{n-1}, u) \mathrm{d}P_{X_n}(u) \qquad\text{a.s.}$$`

and

`$$\begin{array}{rl}
& \mathbb{E}_{n-1}\big[\mathrm{e}^{\lambda (Z - \mathbb{E}_{n-1}Z)}\big] \\
  & = \int_{\mathcal{X}_n}  \exp\Big(\lambda  \int_{\mathcal{X}_n} f(x_1,\ldots,x_{n-1}, v) -f(x_1,\ldots,x_{n-1}, u) \mathrm{d}P_{X_n}(u)  \Big) \mathrm{d}P_{X_n}(v)\end{array}$$`

For every `$x_1, \ldots, x_{n-1} \in \mathcal{X_1} \times \ldots \times \mathcal{X}_{n-1}$`,
for every `$v, v' \in \mathcal{X}_n$`,

`$$\begin{array}{rl}
& \Big| \int_{\mathcal{X}_n} f(x_1,\ldots,x_{n-1}, v) -f(x_1,\ldots,x_{n-1}, u) \mathrm{d}P_{X_n}(u) \\
& - \int_{\mathcal{X}_n} f(x_1,\ldots,x_{n-1}, v') -f(x_1,\ldots,x_{n-1}, u) \mathrm{d}P_{X_n}(u)\Big| \leq c_n
\end{array}$$`

---

### Proof (continued)

By Hoeffding's Lemma

`$$\mathbb{E}_{n-1}\big[\mathrm{e}^{\lambda (Z - \mathbb{E}_{n-1}Z)}\big] \leq \mathrm{e}^{\frac{\lambda^2 c_n^2}{8}}$$`

`$$\begin{array}{rl}
  \mathbb{E} \mathrm{e}^{\lambda (Z - \mathbb{E}Z)}
  & \leq \mathbb{E}\Big[  \mathrm{e}^{\lambda (\mathbb{E}_{n-1}Z - \mathbb{E}Z)}  \Big] \times \mathrm{e}^{\frac{\lambda^2 c_n^2}{8}} \, .
\end{array}$$`

But, if `$X_1=x_1, \ldots X_{n-1}=x_{n-1}$`,

`$$\mathrm{e}^{\lambda (\mathbb{E}_{n-1}Z - \mathbb{E}Z)}
= \int_{\mathcal{X}_n} f(x_1,\ldots,x_{n-1}, v) \mathrm{d}P_{X_n}(v) - \mathbb{E}Z \,,$$`

it is a function of `$n-1$` independent random variables that satisfies the bounded differences conditions
with constants `$c_1, \ldots, c_{n-1}$`.

By the induction hypothesis:

`$$\mathbb{E}\Big[  \mathrm{e}^{\lambda (\mathbb{E}_{n-1}Z - \mathbb{E}Z)}  \Big]
\leq \mathrm{e}^{\frac{\lambda^2}{2} \sum_{i=1}^{n-1} \frac{c_i^2}{4}}$$`

---
exclude: true

## Bibliographic remarks {#bibconditionning}

Conditional expectations can  be constructed from the  Radon-Nikodym Theorem, see [@MR1932358].

It is also possible to prove the Radon-Nikodym Theorem starting from the construction of conditional expectation in `$\mathcal{L}_2$`, see [@MR1155402].

The Section on Efron-Stein-Steele's inequalities is from [@BoLuMa13]

Bounded difference inequality is due to C. McDiarmid.

It became popular in (Theoretical) computer science during the 1990's.

---
class: inverse, middle, center

## Maximal inequalities

---

### Maximal inequalities: simplest setting

`$Z = \max(X_1, \ldots, X_n)$` with
`$X_1, \ldots, X_n \sim_{\text{i.i.d.}} P$`

Goal:

`$$\mathbb{E} Z \leq \text{something that depends on }n \text{ and on }P$$`

Dependance on `$n$` is tied to the tail behavior of `$P$`

---

The purpose of this section is to show how information on the
Cramér transform of random variables in a finite collection
can be used to bound the expected maximum of these random variables.

---

The main idea is perhaps most transparent if we consider _sub-Gaussian_ random variables.

Let `$Z_1,\ldots,Z_N$` be real-valued
random variables such that there exists a `$v>0$` such that for every
`$i=1,\ldots,N$`, the logarithm of the moment generating function
of `$Z_i$` satisfies
`$\psi_{Z_i}(\lambda)  \leq \lambda^2v/2$` for all `$\lambda >0$`.

Then by Jensen's inequality,

`$$\begin{array}{rcl} \exp \left(\lambda\,\mathbb{E} \max_{i=1,\ldots,N} Z_i \right)  & \leq & \mathbb{E} \exp\left(\lambda  \max_{i=1,\ldots,N} Z_i \right) \\ & = &   \mathbb{E} \max_{i=1,\ldots,N} e^{\lambda Z_i} \\ & \leq & \sum_{i=1}^N \mathbb{E} e^{\lambda Z_i} \\ & \leq &   N e^{\lambda^2v/2} \end{array}$$`

---

Taking logarithms on both sides, we have

`$$\mathbb{E} \max_{i=1,\ldots,N} Z_i \le \frac{\log N}{\lambda}    + \frac{\lambda v}{2}$$`

The upper bound is minimized for  `$\lambda = \sqrt{2\log N/v}$` which yields

`$$\mathbb{E} \max_{i=1,\ldots,N}Z_i\le \sqrt {2v\log N}$$`

This simple bound is (asymptotically) sharp if the `$Z_i$` are i.i.d. normal random variables,

---

The argument above may be generalized beyond sub-Gaussian
variables.

Next we formalize such a general inequality but

We start with a technical result that establishes a useful
formula for the inverse of the Fenchel-Legendre dual of a smooth convex
function.

---

.bg-light-gray.b--dark-gray.ba.bw1.br3.shadow-5.ph4.mt5[

### Lemma

Let `$\psi$` be a convex and continuously differentiable
function defined on  `$\left[  0,b\right)$` where `$0<b\leq\infty$`.

Assume that `$\psi\left(  0\right)  =\psi'\left(  0\right)  =0$` and set, for every
`$t\geq0$`,

`$$\psi^*(t)  =\sup_{\lambda\in (0,b)} \left( \lambda t-\psi(\lambda)\right)$$`

Then `$\psi^*$` is a nonnegative convex and nondecreasing function on `$[0,\infty)$`.

For every `$y\geq 0$`,  `$\left\{  t \ge 0: \psi^*(t)  >y\right\}\neq \emptyset$`
and  the generalized inverse of `$\psi^*$`, defined by

`$$\psi^{*\leftarrow}(y)  =\inf\left\{  t\ge 0:\psi^*(t)  >y \right\}$$`

can also be written as

`$$\psi^{*\leftarrow}(y)  =\inf_{\lambda\in (0,b)  } \left[ \frac{y +\psi(\lambda)}{\lambda}\right]$$`

]

---

### Proof

By definition, `$\psi^*$` is the supremum of convex and nondecreasing
functions on `$[0,\infty)$` and `$\psi^*(0)  =0$`,

therefore

`$\psi^*$` is a nonnegative, convex, and nondecreasing function on `$[0,\infty)$`.

Given `$\lambda\in (0,b)$`, since `$\psi^*(t)  \geq\lambda t-\psi(\lambda)$`,
`$\psi^*$` is unbounded which shows that

`$$\forall y\geq 0, \qquad \left\{  t\geq 0:\psi^*(t)  >y\right\} \neq \emptyset$$`

Defining

`$$u=\inf_{\lambda\in (0,b)} \left[  \frac{y+\psi(\lambda)  }{\lambda}\right]$$`

For every `$t \ge 0$`, we have `$u\geq t$` iff

`$$\forall  \lambda \in (0,b), \qquad \frac{y+\psi(\lambda)  }{\lambda}\geq t$$`

As this implies `$y\ge \psi^*(t)$`, we have `$\left\{ t\ge 0:\psi^*(t)> y\right\} = (u,\infty)$`

This proves that `$u=\psi^{*-1}(y)$` by definition of `$\psi^{*-1}$`.

---

The next result offers a convenient bound for the expected value
of the maximum of finitely many exponentially integrable
random variables.

This type of bound has been used in _chaining arguments_ for bounding
suprema of Gaussian or empirical processes

---

.bg-light-gray.b--dark-gray.ba.bw1.br3.shadow-5.ph4.mt5[

### Theorem

Let `$Z_1,\ldots,Z_N$` be real-valued random variables
such that for every `$\lambda\in (0,b)$` and
`$i=1,\ldots,N$`, the logarithm of the moment generating function
of `$Z_i$` satisfies

`$$\psi_{Z_i}(\lambda)  \leq \psi(\lambda)$$`

where `$\psi$` is a convex and
continuously differentiable function on `$(0,b)$`
with `$0<b\leq\infty$` such that `$\psi(0)=\psi'(0)=0$`

Then

`$$\mathbb{E} \max_{i=1,\ldots,N} Z_i \leq \psi^{*\leftarrow}(\log N)$$`

]

---

If the `$Z_i$` are sub-Gaussian with variance factor `$v$`,
that is, `$\psi(\lambda)  =\lambda^2v/2$` for every `$\lambda\in (0,\infty)$`,
then

`$$\mathbb{E} \max_{i=1,\ldots,N}Z_i \leq \sqrt {2v\log N}$$`

---

### Proof

By Jensen's inequality, for any `$\lambda\in (0,b)$`,

`$$\exp\left(  \lambda \mathbb{E} \max_{i=1,\ldots,N}Z_i \right) \leq \mathbb{E} \exp\left( \lambda\max_{i=1,\ldots,N}Z_i \right) = \mathbb{E} \max_{i=1,\ldots,N}\exp\left(\lambda Z_i \right)$$`

Recalling that `$\psi_{Z_i}(\lambda)  =\log\mathbb{E}\exp\left(\lambda Z_i \right)$`,

`$$\exp\left(  \lambda \mathbb{E} \max_{i=1,\ldots,N}Z_i \right)\leq \sum_{i=1}^N \mathbb{E} \exp\left(\lambda Z_i\right)
\leq N  \exp\left(  \psi(\lambda)  \right)$$`

Therefore, for any `$\lambda\in (0,b)$`,

`$$\lambda \mathbb{E}  \max_{i=1,\ldots,N}Z_i -\psi(\lambda) \leq \log N$$`

which means that

`$$\mathbb{E} \max_{i=1,\ldots,N}Z_i \leq \inf_{\lambda\in (0,b)}\left(  \frac{\log N +\psi(\lambda)  }{\lambda}\right)$$`

and the result follows from Lemma

---

.bg-light-gray.b--dark-gray.ba.bw1.br3.shadow-5.ph4.mt5[

### Corollary

Let `$Z_1,\ldots,Z_N$` be real-valued random variables belonging to `$\Gamma_+(v,c)$`

Then

`$$\mathbb{E} \max_{i=1,\ldots,N} Z_i  \leq\sqrt{2v\log N}+ c\log N$$`

]

---

.ttc[chi-squared distribution]

If `$p$` is a positive integer, a gamma random variable with parameters
`$a=p/2$` and `$b=2$` is said to have chi-square distribution with
`$p$` _degrees of freedom_ ( `$\chi^2_p$` )

If `$Y_1,\ldots,Y_p \sim_{\tetx{i.i.d.}} \mathcal{N}(0,1)$` then `$\sum_{i=1}^p Y_i^2  \sim \chi^2_p$`

If `$X_1,\ldots,X_N$` have chi-square distribution with `$p$` degrees of freedom,

then

`$$\mathbb{E}\left[  \max_{i=1,\ldots,N} X_i - p\right] \leq 2\sqrt{p\log N }+ 2\log N$$`

---
class: middle, center, inverse

background-image: url('./img/pexels-cottonbro-3171837.jpg')
background-size: 112%

# The End