name: layout-general layout: true class: left, middle <style> .remark-slide-number { position: inherit; } .remark-slide-number .progress-bar-container { position: absolute; bottom: 0; height: 4px; display: block; left: 0; right: 0; } .remark-slide-number .progress-bar { height: 100%; background-color: red; } </style>
--- class: middle, center, inverse # Non asymptotic results: concentration ### 2021-01-08 #### [Probabilités Master I MIDS](http://stephane-v-boucheron.fr/courses/probability/) #### [Stéphane Boucheron](http://stephane-v-boucheron.fr) --- class: middle, inverse ## <svg style="height:0.8em;top:.04em;position:relative;fill:white;" viewBox="0 0 576 512"><path d="M0 117.66v346.32c0 11.32 11.43 19.06 21.94 14.86L160 416V32L20.12 87.95A32.006 32.006 0 0 0 0 117.66zM192 416l192 64V96L192 32v384zM554.06 33.16L416 96v384l139.88-55.95A31.996 31.996 0 0 0 576 394.34V48.02c0-11.32-11.43-19.06-21.94-14.86z"/></svg> ### Concentration ### Variance bounds for functions of independent random variables ### Exponential inequalities ### Maximal inequalities --- exclude: true class: inverse, center, middle ## Motivation(s) ## <svg style="height:0.8em;top:.04em;position:relative;fill:white;" viewBox="0 0 496 512"><path d="M248 8C111 8 0 119 0 256s111 248 248 248 248-111 248-248S385 8 248 8zm80 168c17.7 0 32 14.3 32 32s-14.3 32-32 32-32-14.3-32-32 14.3-32 32-32zm-160 0c17.7 0 32 14.3 32 32s-14.3 32-32 32-32-14.3-32-32 14.3-32 32-32zm80 256c-60.6 0-134.5-38.3-143.8-93.3-2-11.8 9.3-21.6 20.7-17.9C155.1 330.5 200 336 248 336s92.9-5.5 123.1-15.2c11.3-3.7 22.6 6.1 20.7 17.9-9.3 55-83.2 93.3-143.8 93.3z"/></svg> --- exclude: true ### Example of non-asymptotic results - Bounds on approximation error in limit theory `$$\mathrm{d}_{\text{TV}}\left(\text{Poi}(\lambda), \text{Binom}\left(n, \frac{\lambda}{n}\right)\right) \leq \min\left(\frac{\lambda}{n}, \frac{1}{n}\right)$$` - Non-asymptotic tail bounds .center[ ≈ Hoeffding's inequality] ??? --- exclude: true ### Why? - Refining limit theorems - High dimensional probability ??? --- class: inverse, middle, center ## Concentration --- ### Concentration in product spaces In a nutshell: .cf[ > A function of many independent random variables that does not depend too much > on any of them is approximately constant .fr[Talagrand] ] The concentration of measure phenomenon describes the deviations of _smooth_ functions (random variables) around their median/mean in some probability spaces - Product spaces - Gaussian spaces - High-dimensional spheres - Compact topological groups - ... --- In Gaussian probability spaces, the Poincaré Inequality asserts: > If `\(X_1, \ldots, X_n \sim_{\text{i.i.d.}} \mathcal{N}(0,1)\)` and `\(f\)` is `\(L\)`-Lipschitz, > `$$\operatorname{Var}(f(X_1, \ldots, X_n )) \leq L^2$$` Borrell-Gross-Cirelson inequalities show that similar bounds hold for exponential moments. -- Comparable results hold in product spaces We need workable definitions of _smoothness_ --- class: inverse, center, middle name: ess ## Efron-Stein-Steele inequalities --- ### Scene `\(X_1, \ldots, X_n\)` denote _independent_ random variables on some probability space with values in `\(\mathcal{X}_1, \ldots, \mathcal{X}_n\)`, `\(f\)` denote a measurable function from `\(\mathcal{X}_1 \times \ldots \times \mathcal{X}_n\)` to `\(\mathbb{R}\)`. `$$Z=f(X_1, \ldots, X_n)$$` `\(Z\)` is a general function of independent random variables We assume `\(Z\)` is integrable. --- If we had `\(Z = \sum_{i=1}^n X_i\)`, we could write `$$\operatorname{var}(Z) = \sum_{i=1}^n \operatorname{var}(X_i) = \sum_{i=1}^n \mathbb{E}\Big[\operatorname{var}( Z \mid X_1, \ldots, X_{i-1}, X_{i+1}, \ldots X_n)\Big]$$` -- .cf[.fr[even though the last expression looks pedantic]] -- Our aim is to show that even if `\(f\)` is not as simple as the sum of its arguments, the last expression can still serve as an upper bound on the variance --- ### Doob's embedding We express `\(Z-\mathbb{E} Z\)` as a sum of differences Denote by `\(\mathbb{E}_i\)` the conditional expectation operator, conditioned on `\(\left(X_{1},\ldots,X_{i}\right)\)`: `$$\mathbb{E}_i Y = \mathbb{E}\left[ Y \mid \sigma(X_{1},\ldots,X_{i})\right]$$` Convention: `\(\mathbb{E}_0=\mathbb{E}\)` -- For every `\(i=1,\ldots,n\)`: `$$\Delta_{i}=\mathbb{E}_i Z -\mathbb{E}_{i-1} Z$$` `$$Z - \mathbb{E}Z = \sum_{i=1}^n \left(\mathbb{E}_i Z - \mathbb{E}_{i-1}Z \right)= \sum_{i=1}^n Δ_i$$` --- <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 576 512"><path d="M208 0c-29.9 0-54.7 20.5-61.8 48.2-.8 0-1.4-.2-2.2-.2-35.3 0-64 28.7-64 64 0 4.8.6 9.5 1.7 14C52.5 138 32 166.6 32 200c0 12.6 3.2 24.3 8.3 34.9C16.3 248.7 0 274.3 0 304c0 33.3 20.4 61.9 49.4 73.9-.9 4.6-1.4 9.3-1.4 14.1 0 39.8 32.2 72 72 72 4.1 0 8.1-.5 12-1.2 9.6 28.5 36.2 49.2 68 49.2 39.8 0 72-32.2 72-72V64c0-35.3-28.7-64-64-64zm368 304c0-29.7-16.3-55.3-40.3-69.1 5.2-10.6 8.3-22.3 8.3-34.9 0-33.4-20.5-62-49.7-74 1-4.5 1.7-9.2 1.7-14 0-35.3-28.7-64-64-64-.8 0-1.5.2-2.2.2C422.7 20.5 397.9 0 368 0c-35.3 0-64 28.6-64 64v376c0 39.8 32.2 72 72 72 31.8 0 58.4-20.7 68-49.2 3.9.7 7.9 1.2 12 1.2 39.8 0 72-32.2 72-72 0-4.8-.5-9.5-1.4-14.1 29-12 49.4-40.6 49.4-73.9z"/></svg> Check that `$$\mathbb{E} \Delta_i=0$$` and that for `\(j>i\)`, `$$\mathbb{E}_i \Delta_j=0 \qquad \text{a.s.}$$` --- Starting from the decomposition `$$Z-\mathbb{E} Z =\sum_{i=1}^{n}\Delta_{i}$$` one has `$$\operatorname{var}\left(Z\right) =\mathbb{E}\left[ \left( \sum_{i=1}^{n}\Delta_{i}\right) ^{2}\right] =\sum_{i=1}^{n}\mathbb{E}\left[ \Delta_{i}^{2}\right] +2\sum_{j>i}\mathbb{E}\left[ \Delta_{i}\Delta _{j}\right]$$` -- Now if `\(j>i\)`, `\(\mathbb{E}_i \Delta_{j} =0\)` implies that `$$\mathbb{E}_i\left[ \Delta_{j}\Delta_{i}\right] =\Delta_{i}\mathbb{E}_{i} \Delta_{j} =0$$` and, a fortiori, `$$\mathbb{E}\left[ \Delta_{j}\Delta_{i}\right] =0$$` --- We obtain the following analog of the additivity formula of the variance: `$$\operatorname{var}\left( Z\right) =\mathbb{E}\left[ \left( \sum_{i=1}^{n}\Delta_{i}\right) ^{2}\right] =\sum_{i=1}^{n}\mathbb{E}\left[ \Delta_{i}^{2}\right]$$` <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M512 199.652c0 23.625-20.65 43.826-44.8 43.826h-99.851c16.34 17.048 18.346 49.766-6.299 70.944 14.288 22.829 2.147 53.017-16.45 62.315C353.574 425.878 322.654 448 272 448c-2.746 0-13.276-.203-16-.195-61.971.168-76.894-31.065-123.731-38.315C120.596 407.683 112 397.599 112 385.786V214.261l.002-.001c.011-18.366 10.607-35.889 28.464-43.845 28.886-12.994 95.413-49.038 107.534-77.323 7.797-18.194 21.384-29.084 40-29.092 34.222-.014 57.752 35.098 44.119 66.908-3.583 8.359-8.312 16.67-14.153 24.918H467.2c23.45 0 44.8 20.543 44.8 43.826zM96 200v192c0 13.255-10.745 24-24 24H24c-13.255 0-24-10.745-24-24V200c0-13.255 10.745-24 24-24h48c13.255 0 24 10.745 24 24zM68 368c0-11.046-8.954-20-20-20s-20 8.954-20 20 8.954 20 20 20 20-8.954 20-20z"/></svg> Up to now, we have not made any use of the fact that `\(Z\)` is a function of independent variables `\(X_{1},\ldots,X_{n}\)` --- ### Independence at work Independence may be used as in the following argument: For any integrable function `\(Z= f\left( X_{1},\ldots,X_{n}\right)\)` one may write, by the Tonelli-Fubini theorem, `$$\mathbb{E}_i Z =\int _{\mathcal{X}^{n-i}}f\left( X_{1},\ldots,X_{i},x_{i+1},\ldots,x_{n}\right) d\mu_{i+1}\left( x_{i+1}\right) \ldots d\mu_{n}\left( x_{n}\right) \text{,}$$` where, `\(X_j \sim \mu_{j}\)` for `\(j= 1,\ldots,n\)` --- Denote by `\(\mathbb{E}^{(i)}\)` the conditional expectation operator conditioned on `\(X^{(i)}=(X_{1},\ldots,X_{i-1},X_{i+1},\ldots,X_{n})\)`, `$$\mathbb{E}^{(i)} Y = \mathbb{E}\left[ Y \mid \sigma(X_{1},\ldots,X_{i-1},X_{i+1},\ldots,X_{n})\right]$$` -- `$$\mathbb{E}^{(i)}Z =\int_{\mathcal{X}} f\left( X_{1},\ldots,X_{i-1},x_{i},X_{i+1},\ldots,X_{n}\right) d\mu_{i}\left(x_{i}\right)$$` Again by the Tonelli-Fubini theorem: .bg-light-gray.b--dark-gray.ba.bw1.br3.shadow-5.ph4.mt5[ `$$\mathbb{E}_i\left[ \mathbb{E}^{\left( i\right) } Z \right] =\mathbb{E}_{i-1} Z$$` ] --- .bg-light-gray.b--dark-gray.ba.bw1.br3.shadow-5.ph4.mt5[ ### Theorem: Efron-Stein-Steele's inequalities (I) Let `\(X_1,\ldots,X_n\)` be independent random variables and let `\(Z=f(X)\)` be a square-integrable function of `\(X=\left( X_{1},\ldots,X_{n}\right)\)`. Then `$$\operatorname{var}\left( Z\right) \leq \sum_{i=1}^n\mathbb{E}\left[ \left( Z-\mathbb{E}^{(i)} Z \right)^2\right] = v$$` ] --- .bg-light-gray.b--dark-gray.ba.bw1.br3.shadow-5.ph4.mt5[ ### Theorem: Efron-Stein-Steele's inequalities (II) Let `\(X_1',\ldots,X_n'\)` be independent copies of `\(X_1,\ldots,X_n\)` and `$$Z_i'= f\left(X_1,\ldots,X_{i-1},X_i',X_{i+1},\ldots,X_n\right)~,$$` then `$$v=\frac{1}{2}\sum_{i=1}^n\mathbb{E}\left[ \left( Z-Z_i'\right)^2\right] =\sum_{i=1}^n\mathbb{E}\left[ \left( Z-Z_i'\right)_+^2\right] =\sum_{i=1}^n\mathbb{E}\left[ \left( Z-Z_i'\right)_-^2\right]$$` where `\(x_+=\max(x,0)\)` and `\(x_-=\max(-x,0)\)` denote the positive and negative parts of a real number `\(x\)`. `$$v=\inf_{Z_{i}}\sum_{i=1}^{n}\mathbb{E}\left[ \left( Z-Z_{i}\right)^2\right]~,$$` where the infimum is taken over the class of all `\(X^{(i)}\)`-measurable and square-integrable variables `\(Z_{i}\)`, `\(i=1,\ldots,n\)`. ] --- ### Proof Using .bg-light-gray.b--dark-gray.ba.bw1.br3.shadow-5.ph4.mt5[ `$$\mathbb{E}_i\left[\mathbb{E}^{\left(i\right)} Z \right] = \mathbb{E}_{i-1} Z$$` ] we may write `$$\Delta_{i}=\mathbb{E}_i\left[ Z-\mathbb{E}^{\left( i\right) } Z \right]$$` By the conditional Jensen Inequality, `$$\Delta_{i}^{2}\leq\mathbb{E}_i\left[ \left( Z-\mathbb{E}^{\left(i\right) }Z \right) ^{2}\right]$$` --- ### Proof (continued) Using `$$\operatorname{var}(Z) = \sum_{i=1}^n \mathbb{E}\left[ \Delta_i^2\right]$$` we obtain `$$\operatorname{var}(Z) \leq \sum_{i=1}^n \mathbb{E}\left[\mathbb{E}_i\left[ \left( Z-\mathbb{E}^{\left(i\right) }Z \right) ^{2}\right]\right]$$` --- ### Proof (continued) To prove the identities for `\(v\)`, Denote by `\(\operatorname{var}^{\left(i\right) }\)` the conditional variance operator conditioned on `\(X^{\left( i\right) }\)` `$$\operatorname{var}^{\left(i\right)}(Y) = \mathbb{E}\left[ \left(Y - \mathbb{E}^{\left(i\right)}Y\right)^2\mid \sigma(X_1, \ldots, X_{i-1}, X_{i+1}, \ldots, X_n)\right]$$` Then we may write `\(v\)` as `$$v=\sum_{i=1}^{n}\mathbb{E}\left[ \operatorname{var}^{\left( i\right) }\left(Z\right) \right]$$` --- ### Proof (continued) <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M512 199.652c0 23.625-20.65 43.826-44.8 43.826h-99.851c16.34 17.048 18.346 49.766-6.299 70.944 14.288 22.829 2.147 53.017-16.45 62.315C353.574 425.878 322.654 448 272 448c-2.746 0-13.276-.203-16-.195-61.971.168-76.894-31.065-123.731-38.315C120.596 407.683 112 397.599 112 385.786V214.261l.002-.001c.011-18.366 10.607-35.889 28.464-43.845 28.886-12.994 95.413-49.038 107.534-77.323 7.797-18.194 21.384-29.084 40-29.092 34.222-.014 57.752 35.098 44.119 66.908-3.583 8.359-8.312 16.67-14.153 24.918H467.2c23.45 0 44.8 20.543 44.8 43.826zM96 200v192c0 13.255-10.745 24-24 24H24c-13.255 0-24-10.745-24-24V200c0-13.255 10.745-24 24-24h48c13.255 0 24 10.745 24 24zM68 368c0-11.046-8.954-20-20-20s-20 8.954-20 20 8.954 20 20 20 20-8.954 20-20z"/></svg> one may simply use (conditionally) the elementary fact that if `\(X\)` and `\(Y\)` are independent and identically distributed real-valued random variables, then `$$\operatorname{var}(X)=(1/2) \mathbb{E}[(X-Y)^2]$$` Conditionally on `\(X^{\left( i\right) }\)`, `\(Z_i'\)` is an independent copy of `\(Z\)` `$$\operatorname{var}^{\left( i\right) }\left( Z\right) =\frac{1}{2}\mathbb{E} ^{\left( i\right) }\left[ \left( Z-Z_i'\right)^2\right] =\mathbb{E}^{\left( i\right) }\left[ \left( Z-Z_i'\right)_+^2\right] =\mathbb{E}^{\left( i\right) }\left[ \left( Z-Z_i'\right)_-^2\right]$$` where we used the fact that the conditional distributions of `\(Z\)` and `\(Z_i'\)` are identical --- ### Proof (continued) <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M201.5 174.8l55.7 55.8c3.1 3.1 3.1 8.2 0 11.3l-11.3 11.3c-3.1 3.1-8.2 3.1-11.3 0l-55.7-55.8-45.3 45.3 55.8 55.8c3.1 3.1 3.1 8.2 0 11.3l-11.3 11.3c-3.1 3.1-8.2 3.1-11.3 0L111 265.2l-26.4 26.4c-17.3 17.3-25.6 41.1-23 65.4l7.1 63.6L2.3 487c-3.1 3.1-3.1 8.2 0 11.3l11.3 11.3c3.1 3.1 8.2 3.1 11.3 0l66.3-66.3 63.6 7.1c23.9 2.6 47.9-5.4 65.4-23l181.9-181.9-135.7-135.7-64.9 65zm308.2-93.3L430.5 2.3c-3.1-3.1-8.2-3.1-11.3 0l-11.3 11.3c-3.1 3.1-3.1 8.2 0 11.3l28.3 28.3-45.3 45.3-56.6-56.6-17-17c-3.1-3.1-8.2-3.1-11.3 0l-33.9 33.9c-3.1 3.1-3.1 8.2 0 11.3l17 17L424.8 223l17 17c3.1 3.1 8.2 3.1 11.3 0l33.9-34c3.1-3.1 3.1-8.2 0-11.3l-73.5-73.5 45.3-45.3 28.3 28.3c3.1 3.1 8.2 3.1 11.3 0l11.3-11.3c3.1-3.2 3.1-8.2 0-11.4z"/></svg> For any real-valued random variable `\(X\)`, `$$\operatorname{var}(X) = \inf_{a\in \mathbb{R}} \mathbb{E}[(X-a)^2]$$` Using this fact conditionally, for every `\(i=1,\ldots,n\)`, `$$\operatorname{var}^{\left( i\right) }\left( Z\right) =\inf_{Z_{i}}\mathbb{E}^{\left( i\right) }\left[ \left( Z-Z_{i}\right)^2\right]$$` <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M512 199.652c0 23.625-20.65 43.826-44.8 43.826h-99.851c16.34 17.048 18.346 49.766-6.299 70.944 14.288 22.829 2.147 53.017-16.45 62.315C353.574 425.878 322.654 448 272 448c-2.746 0-13.276-.203-16-.195-61.971.168-76.894-31.065-123.731-38.315C120.596 407.683 112 397.599 112 385.786V214.261l.002-.001c.011-18.366 10.607-35.889 28.464-43.845 28.886-12.994 95.413-49.038 107.534-77.323 7.797-18.194 21.384-29.084 40-29.092 34.222-.014 57.752 35.098 44.119 66.908-3.583 8.359-8.312 16.67-14.153 24.918H467.2c23.45 0 44.8 20.543 44.8 43.826zM96 200v192c0 13.255-10.745 24-24 24H24c-13.255 0-24-10.745-24-24V200c0-13.255 10.745-24 24-24h48c13.255 0 24 10.745 24 24zM68 368c0-11.046-8.954-20-20-20s-20 8.954-20 20 8.954 20 20 20 20-8.954 20-20z"/></svg> the infimum is achieved whenever `\(Z_{i}=\mathbb{E}^{\left(i\right)}Z\)`. <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 448 512"><path d="M400 32H48C21.5 32 0 53.5 0 80v352c0 26.5 21.5 48 48 48h352c26.5 0 48-21.5 48-48V80c0-26.5-21.5-48-48-48z"/></svg> --- <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 576 512"><path d="M569.517 440.013C587.975 472.007 564.806 512 527.94 512H48.054c-36.937 0-59.999-40.055-41.577-71.987L246.423 23.985c18.467-32.009 64.72-31.951 83.154 0l239.94 416.028zM288 354c-25.405 0-46 20.595-46 46s20.595 46 46 46 46-20.595 46-46-20.595-46-46-46zm-43.673-165.346l7.418 136c.347 6.364 5.609 11.346 11.982 11.346h48.546c6.373 0 11.635-4.982 11.982-11.346l7.418-136c.375-6.874-5.098-12.654-11.982-12.654h-63.383c-6.884 0-12.356 5.78-11.981 12.654z"/></svg> When `\(Z=\sum_{i=1}^n X_i\)` is a sum of independent random variables (with finite variance) then the Efron-Stein-Steele inequality becomes an equality. The bound in the Efron-Stein-Steele inequality is, _in a sense_, not improvable. --- class: inverse, center, middle name: mcdiarmid ## Bounding the variance : example --- exclude: true ### Random graphs (Erdös-Rényi) - clique number - chromatic number - size of giant component (for super-critical graphs) --- exclude: true ### Longest Increasing Subsequence - Ulam's problem --- exclude: true ### Norms of random vectors --- ### Random matrices - Largest eigenvalue of a random symmetric matrix with bounded entries `$$X =\begin{pmatrix}0 & \epsilon_{1,2} & \ldots & \epsilon_{1,n} \\ \epsilon_{1,2} & 0 & \ddots & \vdots \\ \vdots & \ddots & \ddots & \epsilon_{n-1,n}\\ \epsilon_{1,n} & \ldots & \epsilon_{n-1,n} & 0 \end{pmatrix}$$` where `\((\epsilon_{i,j})_{i<j}\)` are i.i.d. random symmetric signs `$$Z = \sup_{\|\lambda\|_2 \leq 1} \lambda^T X \lambda = 2 \sup_{\|\lambda\|_2 \leq 1} \sum_{i< j} \lambda_i \lambda_j \epsilon_{i,j}$$` `$$\operatorname{var}\left(Z\right) \leq 4$$` --- exclude: true ### Bin packing --- exclude: true ### Order statistics --- class: inverse, center, middle name: hoeffding ## Hoeffding's inequality --- Laws of large numbers are asymptotic statements. In applications, in Statistics, in Statistical Learning Theory, it is often desirable to have guarantees for fixed `\(n\)`. Exponential inequalities are refinements of Chebychev inequality. Under strong integrability assumptions on the summands, it is possible and relatively easy to derive sharp tail bounds for sums of independent random variables. --- .bg-light-gray.b--dark-gray.ba.bw1.br3.shadow-5.ph4.mt5[ ### Lemma Hoeffding Lemma Let `\(Y\)` be a random variable taking values in a bounded interval `\([a,b]\)` and let `\(\psi_Y(\lambda)=\log \mathbb{E} e^{\lambda (Y- \mathbb{E}Y)}\)` Then `$$\operatorname{var}(Y) \leq \frac{(b-a)^2}{4}\qquad \text{and} \qquad \psi_Y(\lambda) \leq \frac{1}{2} \frac{(b-a)^2}{4}$$` ] --- ### Proof The upper bound on the variance of `\(Y\)` has been established. Now let `\(P\)` denote the distribution of `\(Y\)` and let `\(P_{\lambda}\)` be the probability distribution with density `$$x \rightarrow e^{-\psi_{Y}\left( \lambda\right) }e^{\lambda (x - \mathbb{E}Y)}$$` with respect to `\(P\)`. Since `\(P_{\lambda}\)` is concentrated on `\([a,b]\)` ( `\(P_\lambda([a, b]) = P([a, b]) =1\)` ), the variance of a random variable `\(Z\)` with distribution `\(P_{\lambda}\)` is bounded by `\((b-a)^2/4\)` --- ### Proof (continued) Note that `\(P_0 = P\)`. Dominated convergence arguments allow to compute the derivatives of `\(\psi_Y(\lambda)\)`. Namely `$$\psi'_Y(\lambda) = \frac{\mathbb{E}\Big[ (Y- \mathbb{E}Y) e^{\lambda (Y- \mathbb{E}Y)} \Big]}{\mathbb{E} e^{\lambda (Y- \mathbb{E}Y)}} = \mathbb{E}_{P_\lambda} Z$$` and `$$\psi^{\prime\prime}_Y(\lambda) = \frac{\mathbb{E}\Big[ (Y- \mathbb{E}{Y})^2 e^{\lambda (Y- \mathbb{E}Y)} \Big]}{\mathbb{E} e^{\lambda (Y- \mathbb{E}Y)}} - \Bigg(\frac{\mathbb{E}\Big[ (Y- \mathbb{E}{Y}) e^{\lambda (Y- \mathbb{E}Y)} \Big]}{\mathbb{E} e^{\lambda (Y- \mathbb{E}Y)}}\Bigg)^2 = \operatorname{var}_{P_\lambda}(Z)$$` --- ### Proof (continued) Hence, thanks to the variance upper bound: `\begin{align*} \psi_Y^{\prime\prime}(\lambda) & \leq \frac{(b-a)^2}{4}~. \end{align*}` Note that `\(\psi_{Y}(0) = \psi_{Y}'(0) =0\)`, and by Taylor's theorem, for some `\(\theta \in [0,\lambda]\)`, `$$\psi_Y(\lambda) = \psi_Y(0) + \lambda\psi_Y'(0) + \frac{\lambda^2}{2}\psi_Y''(\theta) \leq \frac{\lambda^2(b-a)^2}{8}$$` <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 448 512"><path d="M400 32H48C21.5 32 0 53.5 0 80v352c0 26.5 21.5 48 48 48h352c26.5 0 48-21.5 48-48V80c0-26.5-21.5-48-48-48z"/></svg> --- The upper bound on the variance is sharp in the special case of a _Rademacher_ random variable `\(X\)` whose distribution is defined by `$$P\{X =-1\} = P\{X =1\} = 1/2$$` Then one may take `\(a=-b=1\)` and `\(\operatorname{var}(X) =1=\left( b-a\right)^2/4\)`. -- We can now build on Hoeffding's Lemma to derive very practical tail bounds for sums of bounded independent random variables. --- .bg-light-gray.b--dark-gray.ba.bw1.br3.shadow-5.ph4.mt5[ ### Theorem Hoeffding's inequality Let `\(X_1,\ldots,X_n\)` be independent random variables such that `\(X_i\)` takes its values in `\([a_i,b_i]\)` almost surely for all `\(i\leq n\)`. Let `$$S=\sum_{i=1}^n\left(X_i- \mathbb{E} X_i \right)$$` Then `$$\operatorname{var}(S) \leq v = \sum_{i=1}^n \frac{(b_i-a_i)^2}{4}$$` `$$\forall \lambda \in \mathbb{R}, \qquad \log \mathbb{E} \mathrm{e}^{\lambda S} \leq \frac{\lambda^2 v}{2}$$` `$$\forall t>0, \qquad P\left\{ S \geq t \right\} \le \exp\left( -\frac{t^2}{2 v}\right)$$` ] --- The proof is based on the so-called Cramer-Chernoff bounding technique and on Hoeffding's Lemma. ### Proof The upper bound on variance follows from `\(\operatorname{var}(S) = \sum_{i=1}^n \operatorname{var}(X_i)\)` and from the first part of Hoeffding's Lemma. For the upper-bound on `\(\log \mathbb{E} \mathrm{e}^{\lambda S}\)`, `$$\begin{array}{rl}\log \mathbb{E} \mathrm{e}^{\lambda S} & = \log \mathbb{E} \mathrm{e}^{\sum_{i=1}^n \lambda (X_i - \mathbb{E} X_i)} \\ & = \log \mathbb{E} \Big[\prod_{i=1}^n \mathrm{e}^{\lambda (X_i - \mathbb{E} X_i)}\Big] \\ & = \log \Big(\prod_{i=1}^n \mathbb{E} \Big[\mathrm{e}^{\lambda (X_i - \mathbb{E} X_i)}\Big]\Big) \\ & = \sum_{i=1}^n \log \mathbb{E} \Big[\mathrm{e}^{\lambda (X_i - \mathbb{E} X_i)}\Big] \\ & \leq \sum_{i=1}^n \frac{\lambda^2 (b_i-a_i)^2}{8} \\ & = \frac{\lambda^2 v}{2}\end{array}$$` where the third equality comes from independence of the `\(X_i\)`'s and the inequality follows from invoking Hoeffding's Lemma for each summand. --- ### Proof (continued) The Cramer-Chernoff technique consists of using Markov's inequality with exponential moments. `$$\begin{array}{rl}P \big\{ S \geq t \big\} & \leq \inf_{\lambda\geq 0}\frac{\mathbb{E} \mathrm{e}^{\lambda S}}{\mathrm{e}^{\lambda t}} \\ & \leq \exp\Big(- \sup_{\lambda \geq 0} \big( \lambda t - \log \mathbb{E} \mathrm{e}^{\lambda S}\big) \Big)\\ & \leq \exp\Big(- \sup_{\lambda \geq 0}\big( \lambda t - \frac{\lambda^2 v}{2}\big) \Big) \\ & = \mathrm{e}^{- \frac{t^2}{2v} }\end{array}$$` <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 448 512"><path d="M400 32H48C21.5 32 0 53.5 0 80v352c0 26.5 21.5 48 48 48h352c26.5 0 48-21.5 48-48V80c0-26.5-21.5-48-48-48z"/></svg> --- Hoeffding's inequality provides interesting tail bounds for binomial random variables which are sums of independent `\([0,1]\)`-valued random variables. However in some cases, the variance upper bound used in Hoeffding's inequality is excessively conservative. Think for example of binomial random variable with parameters `\(n\)` and `\(\mu/n\)`, the variance upper-bound obtained from the boundedness assumption is `\(n\)` while the true variance is `\(\mu\)` --- class: inverse, center, middle name: mcdiarmid ## Bounded differences inequality --- In this section we combine Hoeffding's inequality and conditioning to establish the so-called _Bounded differences inequality_ (also known as McDiarmid's inequality). This inequality is a first example of the _concentration of measure phenomenon_. This phenomenon is best portrayed by the following say: > A function of many independent random variables that does not depend too much on any of them is concentrated around its mean or median value. --- .bg-light-gray.b--dark-gray.ba.bw1.br3.shadow-5.ph4.mt5[ ### Theorem: Bounded Differences Inequality Let `\(X_1, \ldots, X_n\)` be independent with values in `\(\mathcal{X}_1, \mathcal{X}_2, \ldots, \mathcal{X}_n\)`. Let `\(f : \mathcal{X}_1 \times \mathcal{X}_2 \times \ldots \times \mathcal{X}_n \to \mathbb{R}\)` be measurable Assume there exists non-negative `\(c_1, \ldots, c_n\)` satisfying `\(\forall x_1, \ldots, x_n \in \prod_{i=1}^n \mathcal{X}_i\)`, `\(\forall y_1, \ldots, y_n \in \prod_{i=1}^n \mathcal{X}_i\)`, `$$\Big| f(x_1, \ldots, x_n) - f(y_1, \ldots, y_n)\Big| \leq \sum_{i=1}^n c_i \mathbb{I}_{x_i\neq y_i}$$` Let `\(Z = f(X_1, \ldots, X_n)\)` and `\(v = \sum_{i=1}^n \frac{c_i^2}{4}\)` Then `\(\operatorname{var}(Z) \leq v\)` `$$\log \mathbb{E} \mathrm{e}^{\lambda (Z -\mathbb{E}Z)} \leq \frac{\lambda^2 v}{2}\qquad \text{and} \qquad P \Big\{ Z \geq \mathbb{E}Z + t \Big\} \leq \mathrm{e}^{-\frac{t^2}{2v}}$$` ] --- ### Proof The variance bound is an immediate consequence of the Efron-Stein-Steele inequalities. The tail bound follows from the upper bound on the logarithmic moment generating function by Cramer-Chernoff bounding. To check the upper-bound on the logarithmic moment generating function, we proceed by induction on the number of arguments `\(n\)`. If `\(n=1\)`, the upper-bound on the logarithmic moment generating function is just Hoeffing's Assume the upper-bound is valid up to `\(n-1\)`. `$$\begin{array}{rl} \mathbb{E} \mathrm{e}^{\lambda (Z - \mathbb{E}Z)} & = \mathbb{E}\Big[ \mathbb{E}_{n-1}\mathrm{e}^{\lambda (Z - \mathbb{E}Z)} \Big] \\ & = \mathbb{E}\Big[ \mathbb{E}_{n-1}\big[\mathrm{e}^{\lambda (Z - \mathbb{E}_{n-1}Z)}\big] \times \mathrm{e}^{\lambda (\mathbb{E}_{n-1}Z - \mathbb{E}Z)} \Big]\end{array}$$` --- ### Proof (continued) Now, `$$\mathbb{E}_{n-1}Z = \int_{\mathcal{X}_n} f(x_1,\ldots,x_{n-1}, u) \mathrm{d}P_{X_n}(u) \qquad\text{a.s.}$$` and `$$\begin{array}{rl} & \mathbb{E}_{n-1}\big[\mathrm{e}^{\lambda (Z - \mathbb{E}_{n-1}Z)}\big] \\ & = \int_{\mathcal{X}_n} \exp\Big(\lambda \int_{\mathcal{X}_n} f(x_1,\ldots,x_{n-1}, v) -f(x_1,\ldots,x_{n-1}, u) \mathrm{d}P_{X_n}(u) \Big) \mathrm{d}P_{X_n}(v)\end{array}$$` For every `\(x_1, \ldots, x_{n-1} \in \mathcal{X_1} \times \ldots \times \mathcal{X}_{n-1}\)`, for every `\(v, v' \in \mathcal{X}_n\)`, `$$\begin{array}{rl} & \Big| \int_{\mathcal{X}_n} f(x_1,\ldots,x_{n-1}, v) -f(x_1,\ldots,x_{n-1}, u) \mathrm{d}P_{X_n}(u) \\ & - \int_{\mathcal{X}_n} f(x_1,\ldots,x_{n-1}, v') -f(x_1,\ldots,x_{n-1}, u) \mathrm{d}P_{X_n}(u)\Big| \leq c_n \end{array}$$` --- ### Proof (continued) By Hoeffding's Lemma `$$\mathbb{E}_{n-1}\big[\mathrm{e}^{\lambda (Z - \mathbb{E}_{n-1}Z)}\big] \leq \mathrm{e}^{\frac{\lambda^2 c_n^2}{8}}$$` `$$\begin{array}{rl} \mathbb{E} \mathrm{e}^{\lambda (Z - \mathbb{E}Z)} & \leq \mathbb{E}\Big[ \mathrm{e}^{\lambda (\mathbb{E}_{n-1}Z - \mathbb{E}Z)} \Big] \times \mathrm{e}^{\frac{\lambda^2 c_n^2}{8}} \, . \end{array}$$` But, if `\(X_1=x_1, \ldots X_{n-1}=x_{n-1}\)`, `$$\mathrm{e}^{\lambda (\mathbb{E}_{n-1}Z - \mathbb{E}Z)} = \int_{\mathcal{X}_n} f(x_1,\ldots,x_{n-1}, v) \mathrm{d}P_{X_n}(v) - \mathbb{E}Z \,,$$` it is a function of `\(n-1\)` independent random variables that satisfies the bounded differences conditions with constants `\(c_1, \ldots, c_{n-1}\)`. By the induction hypothesis: `$$\mathbb{E}\Big[ \mathrm{e}^{\lambda (\mathbb{E}_{n-1}Z - \mathbb{E}Z)} \Big] \leq \mathrm{e}^{\frac{\lambda^2}{2} \sum_{i=1}^{n-1} \frac{c_i^2}{4}}$$` <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 448 512"><path d="M400 32H48C21.5 32 0 53.5 0 80v352c0 26.5 21.5 48 48 48h352c26.5 0 48-21.5 48-48V80c0-26.5-21.5-48-48-48z"/></svg> --- exclude: true ## Bibliographic remarks {#bibconditionning} Conditional expectations can be constructed from the Radon-Nikodym Theorem, see [@MR1932358]. It is also possible to prove the Radon-Nikodym Theorem starting from the construction of conditional expectation in `\(\mathcal{L}_2\)`, see [@MR1155402]. The Section on Efron-Stein-Steele's inequalities is from [@BoLuMa13] Bounded difference inequality is due to C. McDiarmid. It became popular in (Theoretical) computer science during the 1990's. --- class: inverse, middle, center ## Maximal inequalities --- ### Maximal inequalities: simplest setting `\(Z = \max(X_1, \ldots, X_n)\)` with `\(X_1, \ldots, X_n \sim_{\text{i.i.d.}} P\)` -- Goal: `$$\mathbb{E} Z \leq \text{something that depends on }n \text{ and on }P$$` Dependance on `\(n\)` is tied to the tail behavior of `\(P\)` --- The purpose of this section is to show how information on the Cramér transform of random variables in a finite collection can be used to bound the expected maximum of these random variables. --- The main idea is perhaps most transparent if we consider _sub-Gaussian_ random variables. Let `\(Z_1,\ldots,Z_N\)` be real-valued random variables such that there exists a `\(v>0\)` such that for every `\(i=1,\ldots,N\)`, the logarithm of the moment generating function of `\(Z_i\)` satisfies `\(\psi_{Z_i}(\lambda) \leq \lambda^2v/2\)` for all `\(\lambda >0\)`. Then by Jensen's inequality, `$$\begin{array}{rcl} \exp \left(\lambda\,\mathbb{E} \max_{i=1,\ldots,N} Z_i \right) & \leq & \mathbb{E} \exp\left(\lambda \max_{i=1,\ldots,N} Z_i \right) \\ & = & \mathbb{E} \max_{i=1,\ldots,N} e^{\lambda Z_i} \\ & \leq & \sum_{i=1}^N \mathbb{E} e^{\lambda Z_i} \\ & \leq & N e^{\lambda^2v/2} \end{array}$$` --- Taking logarithms on both sides, we have `$$\mathbb{E} \max_{i=1,\ldots,N} Z_i \le \frac{\log N}{\lambda} + \frac{\lambda v}{2}$$` The upper bound is minimized for `\(\lambda = \sqrt{2\log N/v}\)` which yields `$$\mathbb{E} \max_{i=1,\ldots,N}Z_i\le \sqrt {2v\log N}$$` This simple bound is (asymptotically) sharp if the `\(Z_i\)` are i.i.d. normal random variables, --- The argument above may be generalized beyond sub-Gaussian variables. Next we formalize such a general inequality but We start with a technical result that establishes a useful formula for the inverse of the Fenchel-Legendre dual of a smooth convex function. --- .bg-light-gray.b--dark-gray.ba.bw1.br3.shadow-5.ph4.mt5[ ### Lemma Let `\(\psi\)` be a convex and continuously differentiable function defined on `\(\left[ 0,b\right)\)` where `\(0<b\leq\infty\)`. Assume that `\(\psi\left( 0\right) =\psi'\left( 0\right) =0\)` and set, for every `\(t\geq0\)`, `$$\psi^*(t) =\sup_{\lambda\in (0,b)} \left( \lambda t-\psi(\lambda)\right)$$` Then `\(\psi^*\)` is a nonnegative convex and nondecreasing function on `\([0,\infty)\)`. For every `\(y\geq 0\)`, `\(\left\{ t \ge 0: \psi^*(t) >y\right\}\neq \emptyset\)` and the generalized inverse of `\(\psi^*\)`, defined by `$$\psi^{*\leftarrow}(y) =\inf\left\{ t\ge 0:\psi^*(t) >y \right\}$$` can also be written as `$$\psi^{*\leftarrow}(y) =\inf_{\lambda\in (0,b) } \left[ \frac{y +\psi(\lambda)}{\lambda}\right]$$` ] --- ### Proof By definition, `\(\psi^*\)` is the supremum of convex and nondecreasing functions on `\([0,\infty)\)` and `\(\psi^*(0) =0\)`, therefore `\(\psi^*\)` is a nonnegative, convex, and nondecreasing function on `\([0,\infty)\)`. Given `\(\lambda\in (0,b)\)`, since `\(\psi^*(t) \geq\lambda t-\psi(\lambda)\)`, `\(\psi^*\)` is unbounded which shows that `$$\forall y\geq 0, \qquad \left\{ t\geq 0:\psi^*(t) >y\right\} \neq \emptyset$$` Defining `$$u=\inf_{\lambda\in (0,b)} \left[ \frac{y+\psi(\lambda) }{\lambda}\right]$$` For every `\(t \ge 0\)`, we have `\(u\geq t\)` iff `$$\forall \lambda \in (0,b), \qquad \frac{y+\psi(\lambda) }{\lambda}\geq t$$` As this implies `\(y\ge \psi^*(t)\)`, we have `\(\left\{ t\ge 0:\psi^*(t)> y\right\} = (u,\infty)\)` This proves that `\(u=\psi^{*-1}(y)\)` by definition of `\(\psi^{*-1}\)`. <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 448 512"><path d="M400 32H48C21.5 32 0 53.5 0 80v352c0 26.5 21.5 48 48 48h352c26.5 0 48-21.5 48-48V80c0-26.5-21.5-48-48-48z"/></svg> --- The next result offers a convenient bound for the expected value of the maximum of finitely many exponentially integrable random variables. This type of bound has been used in _chaining arguments_ for bounding suprema of Gaussian or empirical processes --- .bg-light-gray.b--dark-gray.ba.bw1.br3.shadow-5.ph4.mt5[ ### Theorem Let `\(Z_1,\ldots,Z_N\)` be real-valued random variables such that for every `\(\lambda\in (0,b)\)` and `\(i=1,\ldots,N\)`, the logarithm of the moment generating function of `\(Z_i\)` satisfies `$$\psi_{Z_i}(\lambda) \leq \psi(\lambda)$$` where `\(\psi\)` is a convex and continuously differentiable function on `\((0,b)\)` with `\(0<b\leq\infty\)` such that `\(\psi(0)=\psi'(0)=0\)` Then `$$\mathbb{E} \max_{i=1,\ldots,N} Z_i \leq \psi^{*\leftarrow}(\log N)$$` ] --- If the `\(Z_i\)` are sub-Gaussian with variance factor `\(v\)`, that is, `\(\psi(\lambda) =\lambda^2v/2\)` for every `\(\lambda\in (0,\infty)\)`, then `$$\mathbb{E} \max_{i=1,\ldots,N}Z_i \leq \sqrt {2v\log N}$$` --- ### Proof By Jensen's inequality, for any `\(\lambda\in (0,b)\)`, `$$\exp\left( \lambda \mathbb{E} \max_{i=1,\ldots,N}Z_i \right) \leq \mathbb{E} \exp\left( \lambda\max_{i=1,\ldots,N}Z_i \right) = \mathbb{E} \max_{i=1,\ldots,N}\exp\left(\lambda Z_i \right)$$` Recalling that `\(\psi_{Z_i}(\lambda) =\log\mathbb{E}\exp\left(\lambda Z_i \right)\)`, `$$\exp\left( \lambda \mathbb{E} \max_{i=1,\ldots,N}Z_i \right)\leq \sum_{i=1}^N \mathbb{E} \exp\left(\lambda Z_i\right) \leq N \exp\left( \psi(\lambda) \right)$$` Therefore, for any `\(\lambda\in (0,b)\)`, `$$\lambda \mathbb{E} \max_{i=1,\ldots,N}Z_i -\psi(\lambda) \leq \log N$$` which means that `$$\mathbb{E} \max_{i=1,\ldots,N}Z_i \leq \inf_{\lambda\in (0,b)}\left( \frac{\log N +\psi(\lambda) }{\lambda}\right)$$` and the result follows from Lemma <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 448 512"><path d="M400 32H48C21.5 32 0 53.5 0 80v352c0 26.5 21.5 48 48 48h352c26.5 0 48-21.5 48-48V80c0-26.5-21.5-48-48-48z"/></svg> --- .bg-light-gray.b--dark-gray.ba.bw1.br3.shadow-5.ph4.mt5[ ### Corollary Let `\(Z_1,\ldots,Z_N\)` be real-valued random variables belonging to `\(\Gamma_+(v,c)\)` Then `$$\mathbb{E} \max_{i=1,\ldots,N} Z_i \leq\sqrt{2v\log N}+ c\log N$$` ] --- <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M416 48c0-8.84-7.16-16-16-16h-64c-8.84 0-16 7.16-16 16v48h96V48zM63.91 159.99C61.4 253.84 3.46 274.22 0 404v44c0 17.67 14.33 32 32 32h96c17.67 0 32-14.33 32-32V288h32V128H95.84c-17.63 0-31.45 14.37-31.93 31.99zm384.18 0c-.48-17.62-14.3-31.99-31.93-31.99H320v160h32v160c0 17.67 14.33 32 32 32h96c17.67 0 32-14.33 32-32v-44c-3.46-129.78-61.4-150.16-63.91-244.01zM176 32h-64c-8.84 0-16 7.16-16 16v48h96V48c0-8.84-7.16-16-16-16zm48 256h64V128h-64v160z"/></svg> .ttc[chi-squared distribution] If `\(p\)` is a positive integer, a gamma random variable with parameters `\(a=p/2\)` and `\(b=2\)` is said to have chi-square distribution with `\(p\)` _degrees of freedom_ ( `\(\chi^2_p\)` ) If `\(Y_1,\ldots,Y_p \sim_{\tetx{i.i.d.}} \mathcal{N}(0,1)\)` then `\(\sum_{i=1}^p Y_i^2 \sim \chi^2_p\)` If `\(X_1,\ldots,X_N\)` have chi-square distribution with `\(p\)` degrees of freedom, then `$$\mathbb{E}\left[ \max_{i=1,\ldots,N} X_i - p\right] \leq 2\sqrt{p\log N }+ 2\log N$$` --- class: middle, center, inverse background-image: url('./img/pexels-cottonbro-3171837.jpg') background-size: 112% # The End