In a nutshell:
A function of many independent random variables that does not depend too much on any of them is approximately constant Talagrand
The concentration of measure phenomenon describes the deviations of smooth functions (random variables) around their median/mean in some probability spaces
In Gaussian probability spaces, the Poincaré Inequality asserts:
If X1,…,Xn∼i.i.d.N(0,1) and f is L-Lipschitz,
Var(f(X1,…,Xn))≤L2
Borrell-Gross-Cirelson inequalities show that similar bounds hold for exponential moments.
Comparable results hold in product spaces
We need workable definitions of smoothness
We express Z−EZ as a sum of differences
Denote by Ei the conditional expectation operator, conditioned on (X1,…,Xi):
EiY=E[Y∣σ(X1,…,Xi)]
Convention: E0=E
For every i=1,…,n:
Δi=EiZ−Ei−1Z
Z - \mathbb{E}Z = \sum_{i=1}^n \left(\mathbb{E}_i Z - \mathbb{E}_{i-1}Z \right)= \sum_{i=1}^n Δ_i
Starting from the decomposition
Z-\mathbb{E} Z =\sum_{i=1}^{n}\Delta_{i}
one has
\operatorname{var}\left(Z\right) =\mathbb{E}\left[ \left( \sum_{i=1}^{n}\Delta_{i}\right) ^{2}\right] =\sum_{i=1}^{n}\mathbb{E}\left[ \Delta_{i}^{2}\right] +2\sum_{j>i}\mathbb{E}\left[ \Delta_{i}\Delta _{j}\right]
Now if j>i, \mathbb{E}_i \Delta_{j} =0 implies that
\mathbb{E}_i\left[ \Delta_{j}\Delta_{i}\right] =\Delta_{i}\mathbb{E}_{i} \Delta_{j} =0
and, a fortiori,
\mathbb{E}\left[ \Delta_{j}\Delta_{i}\right] =0
We obtain the following analog of the additivity formula of the variance:
\operatorname{var}\left( Z\right) =\mathbb{E}\left[ \left( \sum_{i=1}^{n}\Delta_{i}\right) ^{2}\right] =\sum_{i=1}^{n}\mathbb{E}\left[ \Delta_{i}^{2}\right]
Up to now, we have not made any use of the fact that Z is a function of independent variables X_{1},\ldots,X_{n}
Independence may be used as in the following argument:
For any integrable function Z= f\left( X_{1},\ldots,X_{n}\right) one may write, by the Tonelli-Fubini theorem,
\mathbb{E}_i Z =\int _{\mathcal{X}^{n-i}}f\left( X_{1},\ldots,X_{i},x_{i+1},\ldots,x_{n}\right) d\mu_{i+1}\left( x_{i+1}\right) \ldots d\mu_{n}\left( x_{n}\right) \text{,}
where, X_j \sim \mu_{j} for j= 1,\ldots,n
Denote by \mathbb{E}^{(i)} the conditional expectation operator conditioned on X^{(i)}=(X_{1},\ldots,X_{i-1},X_{i+1},\ldots,X_{n}),
\mathbb{E}^{(i)} Y = \mathbb{E}\left[ Y \mid \sigma(X_{1},\ldots,X_{i-1},X_{i+1},\ldots,X_{n})\right]
\mathbb{E}^{(i)}Z =\int_{\mathcal{X}} f\left( X_{1},\ldots,X_{i-1},x_{i},X_{i+1},\ldots,X_{n}\right) d\mu_{i}\left(x_{i}\right)
Again by the Tonelli-Fubini theorem:
\mathbb{E}_i\left[ \mathbb{E}^{\left( i\right) } Z \right] =\mathbb{E}_{i-1} Z
Let X_1,\ldots,X_n be independent random variables and let Z=f(X) be a square-integrable function of X=\left( X_{1},\ldots,X_{n}\right).
Then
\operatorname{var}\left( Z\right) \leq \sum_{i=1}^n\mathbb{E}\left[ \left( Z-\mathbb{E}^{(i)} Z \right)^2\right] = v
Let X_1',\ldots,X_n' be independent copies of X_1,\ldots,X_n and
Z_i'= f\left(X_1,\ldots,X_{i-1},X_i',X_{i+1},\ldots,X_n\right)~,
then
v=\frac{1}{2}\sum_{i=1}^n\mathbb{E}\left[ \left( Z-Z_i'\right)^2\right] =\sum_{i=1}^n\mathbb{E}\left[ \left( Z-Z_i'\right)_+^2\right] =\sum_{i=1}^n\mathbb{E}\left[ \left( Z-Z_i'\right)_-^2\right]
where x_+=\max(x,0) and x_-=\max(-x,0) denote the positive and negative parts of a real number x.
v=\inf_{Z_{i}}\sum_{i=1}^{n}\mathbb{E}\left[ \left( Z-Z_{i}\right)^2\right]~,
where the infimum is taken over the class of all X^{(i)}-measurable and square-integrable variables Z_{i}, i=1,\ldots,n.
Using
\mathbb{E}_i\left[\mathbb{E}^{\left(i\right)} Z \right] = \mathbb{E}_{i-1} Z
we may write
\Delta_{i}=\mathbb{E}_i\left[ Z-\mathbb{E}^{\left( i\right) } Z \right]
By the conditional Jensen Inequality,
\Delta_{i}^{2}\leq\mathbb{E}_i\left[ \left( Z-\mathbb{E}^{\left(i\right) }Z \right) ^{2}\right]
To prove the identities for v,
Denote by \operatorname{var}^{\left(i\right) } the conditional variance operator conditioned on X^{\left( i\right) }
\operatorname{var}^{\left(i\right)}(Y) = \mathbb{E}\left[ \left(Y - \mathbb{E}^{\left(i\right)}Y\right)^2\mid \sigma(X_1, \ldots, X_{i-1}, X_{i+1}, \ldots, X_n)\right]
Then we may write v as
v=\sum_{i=1}^{n}\mathbb{E}\left[ \operatorname{var}^{\left( i\right) }\left(Z\right) \right]
one may simply use (conditionally) the elementary fact that if X and Y are independent and identically distributed real-valued random variables, then
\operatorname{var}(X)=(1/2) \mathbb{E}[(X-Y)^2]
Conditionally on X^{\left( i\right) }, Z_i' is an independent copy of Z
\operatorname{var}^{\left( i\right) }\left( Z\right) =\frac{1}{2}\mathbb{E} ^{\left( i\right) }\left[ \left( Z-Z_i'\right)^2\right] =\mathbb{E}^{\left( i\right) }\left[ \left( Z-Z_i'\right)_+^2\right] =\mathbb{E}^{\left( i\right) }\left[ \left( Z-Z_i'\right)_-^2\right]
where we used the fact that the conditional distributions of Z and Z_i' are identical
X =\begin{pmatrix}0 & \epsilon_{1,2} & \ldots & \epsilon_{1,n} \\ \epsilon_{1,2} & 0 & \ddots & \vdots \\ \vdots & \ddots & \ddots & \epsilon_{n-1,n}\\ \epsilon_{1,n} & \ldots & \epsilon_{n-1,n} & 0 \end{pmatrix}
where (\epsilon_{i,j})_{i<j} are i.i.d. random symmetric signs
Z = \sup_{\|\lambda\|_2 \leq 1} \lambda^T X \lambda = 2 \sup_{\|\lambda\|_2 \leq 1} \sum_{i< j} \lambda_i \lambda_j \epsilon_{i,j}
\operatorname{var}\left(Z\right) \leq 4
Laws of large numbers are asymptotic statements. In applications, in Statistics, in Statistical Learning Theory, it is often desirable to have guarantees for fixed n. Exponential inequalities are refinements of Chebychev inequality. Under strong integrability assumptions on the summands, it is possible and relatively easy to derive sharp tail bounds for sums of independent random variables.
The upper bound on the variance of Y has been established.
Now let P denote the distribution of Y and let P_{\lambda} be the probability distribution with density
x \rightarrow e^{-\psi_{Y}\left( \lambda\right) }e^{\lambda (x - \mathbb{E}Y)}
with respect to P.
Since P_{\lambda} is concentrated on [a,b] ( P_\lambda([a, b]) = P([a, b]) =1 ), the variance of a random variable Z with distribution P_{\lambda} is bounded by (b-a)^2/4
Note that P_0 = P.
Dominated convergence arguments allow to compute the derivatives of \psi_Y(\lambda).
Namely
\psi'_Y(\lambda) = \frac{\mathbb{E}\Big[ (Y- \mathbb{E}Y) e^{\lambda (Y- \mathbb{E}Y)} \Big]}{\mathbb{E} e^{\lambda (Y- \mathbb{E}Y)}} = \mathbb{E}_{P_\lambda} Z
and
\psi^{\prime\prime}_Y(\lambda) = \frac{\mathbb{E}\Big[ (Y- \mathbb{E}{Y})^2 e^{\lambda (Y- \mathbb{E}Y)} \Big]}{\mathbb{E} e^{\lambda (Y- \mathbb{E}Y)}} - \Bigg(\frac{\mathbb{E}\Big[ (Y- \mathbb{E}{Y}) e^{\lambda (Y- \mathbb{E}Y)} \Big]}{\mathbb{E} e^{\lambda (Y- \mathbb{E}Y)}}\Bigg)^2 = \operatorname{var}_{P_\lambda}(Z)
Hence, thanks to the variance upper bound: \begin{align*} \psi_Y^{\prime\prime}(\lambda) & \leq \frac{(b-a)^2}{4}~. \end{align*}
Note that \psi_{Y}(0) = \psi_{Y}'(0) =0, and by Taylor's theorem, for some \theta \in [0,\lambda],
\psi_Y(\lambda) = \psi_Y(0) + \lambda\psi_Y'(0) + \frac{\lambda^2}{2}\psi_Y''(\theta) \leq \frac{\lambda^2(b-a)^2}{8}
The upper bound on the variance is sharp in the special case of a Rademacher random variable X whose distribution is defined by
P\{X =-1\} = P\{X =1\} = 1/2
Then one may take a=-b=1 and \operatorname{var}(X) =1=\left( b-a\right)^2/4.
We can now build on Hoeffding's Lemma to derive very practical tail bounds for sums of bounded independent random variables.
Let X_1,\ldots,X_n be independent random variables such that X_i takes its values in [a_i,b_i] almost surely for all i\leq n.
Let
S=\sum_{i=1}^n\left(X_i- \mathbb{E} X_i \right)
Then
\operatorname{var}(S) \leq v = \sum_{i=1}^n \frac{(b_i-a_i)^2}{4}
\forall \lambda \in \mathbb{R}, \qquad \log \mathbb{E} \mathrm{e}^{\lambda S} \leq \frac{\lambda^2 v}{2}
\forall t>0, \qquad P\left\{ S \geq t \right\} \le \exp\left( -\frac{t^2}{2 v}\right)
The proof is based on the so-called Cramer-Chernoff bounding technique and on Hoeffding's Lemma.
The upper bound on variance follows from \operatorname{var}(S) = \sum_{i=1}^n \operatorname{var}(X_i) and from the first part of Hoeffding's Lemma.
For the upper-bound on \log \mathbb{E} \mathrm{e}^{\lambda S},
\begin{array}{rl}\log \mathbb{E} \mathrm{e}^{\lambda S} & = \log \mathbb{E} \mathrm{e}^{\sum_{i=1}^n \lambda (X_i - \mathbb{E} X_i)} \\ & = \log \mathbb{E} \Big[\prod_{i=1}^n \mathrm{e}^{\lambda (X_i - \mathbb{E} X_i)}\Big] \\ & = \log \Big(\prod_{i=1}^n \mathbb{E} \Big[\mathrm{e}^{\lambda (X_i - \mathbb{E} X_i)}\Big]\Big) \\ & = \sum_{i=1}^n \log \mathbb{E} \Big[\mathrm{e}^{\lambda (X_i - \mathbb{E} X_i)}\Big] \\ & \leq \sum_{i=1}^n \frac{\lambda^2 (b_i-a_i)^2}{8} \\ & = \frac{\lambda^2 v}{2}\end{array}
where the third equality comes from independence of the X_i's and the inequality follows from invoking Hoeffding's Lemma for each summand.
The Cramer-Chernoff technique consists of using Markov's inequality with exponential moments.
\begin{array}{rl}P \big\{ S \geq t \big\} & \leq \inf_{\lambda\geq 0}\frac{\mathbb{E} \mathrm{e}^{\lambda S}}{\mathrm{e}^{\lambda t}} \\ & \leq \exp\Big(- \sup_{\lambda \geq 0} \big( \lambda t - \log \mathbb{E} \mathrm{e}^{\lambda S}\big) \Big)\\ & \leq \exp\Big(- \sup_{\lambda \geq 0}\big( \lambda t - \frac{\lambda^2 v}{2}\big) \Big) \\ & = \mathrm{e}^{- \frac{t^2}{2v} }\end{array}
Hoeffding's inequality provides interesting tail bounds for binomial random variables which are sums of independent [0,1]-valued random variables.
However in some cases, the variance upper bound used in Hoeffding's inequality is excessively conservative.
Think for example of binomial random variable with parameters n and \mu/n, the variance upper-bound obtained from the boundedness assumption is n while the true variance is \mu
In this section we combine Hoeffding's inequality and conditioning to establish the so-called Bounded differences inequality (also known as McDiarmid's inequality). This inequality is a first example of the concentration of measure phenomenon. This phenomenon is best portrayed by the following say:
A function of many independent random variables that does not depend too much on any of them is concentrated around its mean or median value.
Let X_1, \ldots, X_n be independent with values in \mathcal{X}_1, \mathcal{X}_2, \ldots, \mathcal{X}_n.
Let f : \mathcal{X}_1 \times \mathcal{X}_2 \times \ldots \times \mathcal{X}_n \to \mathbb{R} be measurable
Assume there exists non-negative c_1, \ldots, c_n satisfying
\forall x_1, \ldots, x_n \in \prod_{i=1}^n \mathcal{X}_i, \forall y_1, \ldots, y_n \in \prod_{i=1}^n \mathcal{X}_i,
\Big| f(x_1, \ldots, x_n) - f(y_1, \ldots, y_n)\Big| \leq \sum_{i=1}^n c_i \mathbb{I}_{x_i\neq y_i}
Let Z = f(X_1, \ldots, X_n) and v = \sum_{i=1}^n \frac{c_i^2}{4}
Then \operatorname{var}(Z) \leq v
\log \mathbb{E} \mathrm{e}^{\lambda (Z -\mathbb{E}Z)} \leq \frac{\lambda^2 v}{2}\qquad \text{and} \qquad P \Big\{ Z \geq \mathbb{E}Z + t \Big\} \leq \mathrm{e}^{-\frac{t^2}{2v}}
The variance bound is an immediate consequence of the Efron-Stein-Steele inequalities.
The tail bound follows from the upper bound on the logarithmic moment generating function by Cramer-Chernoff bounding.
To check the upper-bound on the logarithmic moment generating function, we proceed by induction on the number of arguments n.
If n=1, the upper-bound on the logarithmic moment generating function is just Hoeffing's
Assume the upper-bound is valid up to n-1.
\begin{array}{rl} \mathbb{E} \mathrm{e}^{\lambda (Z - \mathbb{E}Z)} & = \mathbb{E}\Big[ \mathbb{E}_{n-1}\mathrm{e}^{\lambda (Z - \mathbb{E}Z)} \Big] \\ & = \mathbb{E}\Big[ \mathbb{E}_{n-1}\big[\mathrm{e}^{\lambda (Z - \mathbb{E}_{n-1}Z)}\big] \times \mathrm{e}^{\lambda (\mathbb{E}_{n-1}Z - \mathbb{E}Z)} \Big]\end{array}
Now,
\mathbb{E}_{n-1}Z = \int_{\mathcal{X}_n} f(x_1,\ldots,x_{n-1}, u) \mathrm{d}P_{X_n}(u) \qquad\text{a.s.}
and
\begin{array}{rl} & \mathbb{E}_{n-1}\big[\mathrm{e}^{\lambda (Z - \mathbb{E}_{n-1}Z)}\big] \\ & = \int_{\mathcal{X}_n} \exp\Big(\lambda \int_{\mathcal{X}_n} f(x_1,\ldots,x_{n-1}, v) -f(x_1,\ldots,x_{n-1}, u) \mathrm{d}P_{X_n}(u) \Big) \mathrm{d}P_{X_n}(v)\end{array}
For every x_1, \ldots, x_{n-1} \in \mathcal{X_1} \times \ldots \times \mathcal{X}_{n-1}, for every v, v' \in \mathcal{X}_n,
\begin{array}{rl} & \Big| \int_{\mathcal{X}_n} f(x_1,\ldots,x_{n-1}, v) -f(x_1,\ldots,x_{n-1}, u) \mathrm{d}P_{X_n}(u) \\ & - \int_{\mathcal{X}_n} f(x_1,\ldots,x_{n-1}, v') -f(x_1,\ldots,x_{n-1}, u) \mathrm{d}P_{X_n}(u)\Big| \leq c_n \end{array}
By Hoeffding's Lemma
\mathbb{E}_{n-1}\big[\mathrm{e}^{\lambda (Z - \mathbb{E}_{n-1}Z)}\big] \leq \mathrm{e}^{\frac{\lambda^2 c_n^2}{8}}
\begin{array}{rl} \mathbb{E} \mathrm{e}^{\lambda (Z - \mathbb{E}Z)} & \leq \mathbb{E}\Big[ \mathrm{e}^{\lambda (\mathbb{E}_{n-1}Z - \mathbb{E}Z)} \Big] \times \mathrm{e}^{\frac{\lambda^2 c_n^2}{8}} \, . \end{array}
But, if X_1=x_1, \ldots X_{n-1}=x_{n-1},
\mathrm{e}^{\lambda (\mathbb{E}_{n-1}Z - \mathbb{E}Z)} = \int_{\mathcal{X}_n} f(x_1,\ldots,x_{n-1}, v) \mathrm{d}P_{X_n}(v) - \mathbb{E}Z \,,
it is a function of n-1 independent random variables that satisfies the bounded differences conditions with constants c_1, \ldots, c_{n-1}.
By the induction hypothesis:
\mathbb{E}\Big[ \mathrm{e}^{\lambda (\mathbb{E}_{n-1}Z - \mathbb{E}Z)} \Big] \leq \mathrm{e}^{\frac{\lambda^2}{2} \sum_{i=1}^{n-1} \frac{c_i^2}{4}}
The main idea is perhaps most transparent if we consider sub-Gaussian random variables.
Let Z_1,\ldots,Z_N be real-valued random variables such that there exists a v>0 such that for every i=1,\ldots,N, the logarithm of the moment generating function of Z_i satisfies \psi_{Z_i}(\lambda) \leq \lambda^2v/2 for all \lambda >0.
Then by Jensen's inequality,
\begin{array}{rcl} \exp \left(\lambda\,\mathbb{E} \max_{i=1,\ldots,N} Z_i \right) & \leq & \mathbb{E} \exp\left(\lambda \max_{i=1,\ldots,N} Z_i \right) \\ & = & \mathbb{E} \max_{i=1,\ldots,N} e^{\lambda Z_i} \\ & \leq & \sum_{i=1}^N \mathbb{E} e^{\lambda Z_i} \\ & \leq & N e^{\lambda^2v/2} \end{array}
Taking logarithms on both sides, we have
\mathbb{E} \max_{i=1,\ldots,N} Z_i \le \frac{\log N}{\lambda} + \frac{\lambda v}{2}
The upper bound is minimized for \lambda = \sqrt{2\log N/v} which yields
\mathbb{E} \max_{i=1,\ldots,N}Z_i\le \sqrt {2v\log N}
This simple bound is (asymptotically) sharp if the Z_i are i.i.d. normal random variables,
.bg-light-gray.b--dark-gray.ba.bw1.br3.shadow-5.ph4.mt5[
Let \psi be a convex and continuously differentiable function defined on \left[ 0,b\right) where 0<b\leq\infty.
Assume that \psi\left( 0\right) =\psi'\left( 0\right) =0 and set, for every t\geq0,
\psi^*(t) =\sup_{\lambda\in (0,b)} \left( \lambda t-\psi(\lambda)\right)
Then \psi^* is a nonnegative convex and nondecreasing function on [0,\infty).
For every y\geq 0, \left\{ t \ge 0: \psi^*(t) >y\right\}\neq \emptyset and the generalized inverse of \psi^*, defined by
\psi^{*\leftarrow}(y) =\inf\left\{ t\ge 0:\psi^*(t) >y \right\}
can also be written as
\psi^{*\leftarrow}(y) =\inf_{\lambda\in (0,b) } \left[ \frac{y +\psi(\lambda)}{\lambda}\right]
]
By definition, \psi^* is the supremum of convex and nondecreasing functions on [0,\infty) and \psi^*(0) =0,
therefore
\psi^* is a nonnegative, convex, and nondecreasing function on [0,\infty).
Given \lambda\in (0,b), since \psi^*(t) \geq\lambda t-\psi(\lambda), \psi^* is unbounded which shows that
\forall y\geq 0, \qquad \left\{ t\geq 0:\psi^*(t) >y\right\} \neq \emptyset
Defining
u=\inf_{\lambda\in (0,b)} \left[ \frac{y+\psi(\lambda) }{\lambda}\right]
For every t \ge 0, we have u\geq t iff
\forall \lambda \in (0,b), \qquad \frac{y+\psi(\lambda) }{\lambda}\geq t
As this implies y\ge \psi^*(t), we have \left\{ t\ge 0:\psi^*(t)> y\right\} = (u,\infty)
This proves that u=\psi^{*-1}(y) by definition of \psi^{*-1}.
Let Z_1,\ldots,Z_N be real-valued random variables such that for every \lambda\in (0,b) and i=1,\ldots,N, the logarithm of the moment generating function of Z_i satisfies
\psi_{Z_i}(\lambda) \leq \psi(\lambda)
where \psi is a convex and continuously differentiable function on (0,b) with 0<b\leq\infty such that \psi(0)=\psi'(0)=0
Then
\mathbb{E} \max_{i=1,\ldots,N} Z_i \leq \psi^{*\leftarrow}(\log N)
By Jensen's inequality, for any \lambda\in (0,b),
\exp\left( \lambda \mathbb{E} \max_{i=1,\ldots,N}Z_i \right) \leq \mathbb{E} \exp\left( \lambda\max_{i=1,\ldots,N}Z_i \right) = \mathbb{E} \max_{i=1,\ldots,N}\exp\left(\lambda Z_i \right)
Recalling that \psi_{Z_i}(\lambda) =\log\mathbb{E}\exp\left(\lambda Z_i \right),
\exp\left( \lambda \mathbb{E} \max_{i=1,\ldots,N}Z_i \right)\leq \sum_{i=1}^N \mathbb{E} \exp\left(\lambda Z_i\right) \leq N \exp\left( \psi(\lambda) \right)
Therefore, for any \lambda\in (0,b),
\lambda \mathbb{E} \max_{i=1,\ldots,N}Z_i -\psi(\lambda) \leq \log N
which means that
\mathbb{E} \max_{i=1,\ldots,N}Z_i \leq \inf_{\lambda\in (0,b)}\left( \frac{\log N +\psi(\lambda) }{\lambda}\right)
and the result follows from Lemma
chi-squared distribution
If p is a positive integer, a gamma random variable with parameters a=p/2 and b=2 is said to have chi-square distribution with p degrees of freedom ( \chi^2_p )
If Y_1,\ldots,Y_p \sim_{\tetx{i.i.d.}} \mathcal{N}(0,1) then \sum_{i=1}^p Y_i^2 \sim \chi^2_p
If X_1,\ldots,X_N have chi-square distribution with p degrees of freedom,
then
\mathbb{E}\left[ \max_{i=1,\ldots,N} X_i - p\right] \leq 2\sqrt{p\log N }+ 2\log N
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
In a nutshell:
A function of many independent random variables that does not depend too much on any of them is approximately constant Talagrand
The concentration of measure phenomenon describes the deviations of smooth functions (random variables) around their median/mean in some probability spaces
In Gaussian probability spaces, the Poincaré Inequality asserts:
If X_1, \ldots, X_n \sim_{\text{i.i.d.}} \mathcal{N}(0,1) and f is L-Lipschitz,
\operatorname{Var}(f(X_1, \ldots, X_n )) \leq L^2
Borrell-Gross-Cirelson inequalities show that similar bounds hold for exponential moments.
Comparable results hold in product spaces
We need workable definitions of smoothness
X_1, \ldots, X_n denote independent random variables on some probability space with values in \mathcal{X}_1, \ldots, \mathcal{X}_n,
f denote a measurable function from \mathcal{X}_1 \times \ldots \times \mathcal{X}_n to \mathbb{R}.
Z=f(X_1, \ldots, X_n)
Z is a general function of independent random variables
We assume Z is integrable.
If we had Z = \sum_{i=1}^n X_i, we could write
\operatorname{var}(Z) = \sum_{i=1}^n \operatorname{var}(X_i) = \sum_{i=1}^n \mathbb{E}\Big[\operatorname{var}( Z \mid X_1, \ldots, X_{i-1}, X_{i+1}, \ldots X_n)\Big]
even though the last expression looks pedantic
Our aim is to show that even if f is not as simple as the sum of its arguments, the last expression can still serve as an upper bound on the variance
We express Z-\mathbb{E} Z as a sum of differences
Denote by \mathbb{E}_i the conditional expectation operator, conditioned on \left(X_{1},\ldots,X_{i}\right):
\mathbb{E}_i Y = \mathbb{E}\left[ Y \mid \sigma(X_{1},\ldots,X_{i})\right]
Convention: \mathbb{E}_0=\mathbb{E}
For every i=1,\ldots,n:
\Delta_{i}=\mathbb{E}_i Z -\mathbb{E}_{i-1} Z
Z - \mathbb{E}Z = \sum_{i=1}^n \left(\mathbb{E}_i Z - \mathbb{E}_{i-1}Z \right)= \sum_{i=1}^n Δ_i
Starting from the decomposition
Z-\mathbb{E} Z =\sum_{i=1}^{n}\Delta_{i}
one has
\operatorname{var}\left(Z\right) =\mathbb{E}\left[ \left( \sum_{i=1}^{n}\Delta_{i}\right) ^{2}\right] =\sum_{i=1}^{n}\mathbb{E}\left[ \Delta_{i}^{2}\right] +2\sum_{j>i}\mathbb{E}\left[ \Delta_{i}\Delta _{j}\right]
Now if j>i, \mathbb{E}_i \Delta_{j} =0 implies that
\mathbb{E}_i\left[ \Delta_{j}\Delta_{i}\right] =\Delta_{i}\mathbb{E}_{i} \Delta_{j} =0
and, a fortiori,
\mathbb{E}\left[ \Delta_{j}\Delta_{i}\right] =0
We obtain the following analog of the additivity formula of the variance:
\operatorname{var}\left( Z\right) =\mathbb{E}\left[ \left( \sum_{i=1}^{n}\Delta_{i}\right) ^{2}\right] =\sum_{i=1}^{n}\mathbb{E}\left[ \Delta_{i}^{2}\right]
Up to now, we have not made any use of the fact that Z is a function of independent variables X_{1},\ldots,X_{n}
Independence may be used as in the following argument:
For any integrable function Z= f\left( X_{1},\ldots,X_{n}\right) one may write, by the Tonelli-Fubini theorem,
\mathbb{E}_i Z =\int _{\mathcal{X}^{n-i}}f\left( X_{1},\ldots,X_{i},x_{i+1},\ldots,x_{n}\right) d\mu_{i+1}\left( x_{i+1}\right) \ldots d\mu_{n}\left( x_{n}\right) \text{,}
where, X_j \sim \mu_{j} for j= 1,\ldots,n
Denote by \mathbb{E}^{(i)} the conditional expectation operator conditioned on X^{(i)}=(X_{1},\ldots,X_{i-1},X_{i+1},\ldots,X_{n}),
\mathbb{E}^{(i)} Y = \mathbb{E}\left[ Y \mid \sigma(X_{1},\ldots,X_{i-1},X_{i+1},\ldots,X_{n})\right]
\mathbb{E}^{(i)}Z =\int_{\mathcal{X}} f\left( X_{1},\ldots,X_{i-1},x_{i},X_{i+1},\ldots,X_{n}\right) d\mu_{i}\left(x_{i}\right)
Again by the Tonelli-Fubini theorem:
\mathbb{E}_i\left[ \mathbb{E}^{\left( i\right) } Z \right] =\mathbb{E}_{i-1} Z
Let X_1,\ldots,X_n be independent random variables and let Z=f(X) be a square-integrable function of X=\left( X_{1},\ldots,X_{n}\right).
Then
\operatorname{var}\left( Z\right) \leq \sum_{i=1}^n\mathbb{E}\left[ \left( Z-\mathbb{E}^{(i)} Z \right)^2\right] = v
Let X_1',\ldots,X_n' be independent copies of X_1,\ldots,X_n and
Z_i'= f\left(X_1,\ldots,X_{i-1},X_i',X_{i+1},\ldots,X_n\right)~,
then
v=\frac{1}{2}\sum_{i=1}^n\mathbb{E}\left[ \left( Z-Z_i'\right)^2\right] =\sum_{i=1}^n\mathbb{E}\left[ \left( Z-Z_i'\right)_+^2\right] =\sum_{i=1}^n\mathbb{E}\left[ \left( Z-Z_i'\right)_-^2\right]
where x_+=\max(x,0) and x_-=\max(-x,0) denote the positive and negative parts of a real number x.
v=\inf_{Z_{i}}\sum_{i=1}^{n}\mathbb{E}\left[ \left( Z-Z_{i}\right)^2\right]~,
where the infimum is taken over the class of all X^{(i)}-measurable and square-integrable variables Z_{i}, i=1,\ldots,n.
Using
\mathbb{E}_i\left[\mathbb{E}^{\left(i\right)} Z \right] = \mathbb{E}_{i-1} Z
we may write
\Delta_{i}=\mathbb{E}_i\left[ Z-\mathbb{E}^{\left( i\right) } Z \right]
By the conditional Jensen Inequality,
\Delta_{i}^{2}\leq\mathbb{E}_i\left[ \left( Z-\mathbb{E}^{\left(i\right) }Z \right) ^{2}\right]
To prove the identities for v,
Denote by \operatorname{var}^{\left(i\right) } the conditional variance operator conditioned on X^{\left( i\right) }
\operatorname{var}^{\left(i\right)}(Y) = \mathbb{E}\left[ \left(Y - \mathbb{E}^{\left(i\right)}Y\right)^2\mid \sigma(X_1, \ldots, X_{i-1}, X_{i+1}, \ldots, X_n)\right]
Then we may write v as
v=\sum_{i=1}^{n}\mathbb{E}\left[ \operatorname{var}^{\left( i\right) }\left(Z\right) \right]
one may simply use (conditionally) the elementary fact that if X and Y are independent and identically distributed real-valued random variables, then
\operatorname{var}(X)=(1/2) \mathbb{E}[(X-Y)^2]
Conditionally on X^{\left( i\right) }, Z_i' is an independent copy of Z
\operatorname{var}^{\left( i\right) }\left( Z\right) =\frac{1}{2}\mathbb{E} ^{\left( i\right) }\left[ \left( Z-Z_i'\right)^2\right] =\mathbb{E}^{\left( i\right) }\left[ \left( Z-Z_i'\right)_+^2\right] =\mathbb{E}^{\left( i\right) }\left[ \left( Z-Z_i'\right)_-^2\right]
where we used the fact that the conditional distributions of Z and Z_i' are identical
X =\begin{pmatrix}0 & \epsilon_{1,2} & \ldots & \epsilon_{1,n} \\ \epsilon_{1,2} & 0 & \ddots & \vdots \\ \vdots & \ddots & \ddots & \epsilon_{n-1,n}\\ \epsilon_{1,n} & \ldots & \epsilon_{n-1,n} & 0 \end{pmatrix}
where (\epsilon_{i,j})_{i<j} are i.i.d. random symmetric signs
Z = \sup_{\|\lambda\|_2 \leq 1} \lambda^T X \lambda = 2 \sup_{\|\lambda\|_2 \leq 1} \sum_{i< j} \lambda_i \lambda_j \epsilon_{i,j}
\operatorname{var}\left(Z\right) \leq 4
Laws of large numbers are asymptotic statements. In applications, in Statistics, in Statistical Learning Theory, it is often desirable to have guarantees for fixed n. Exponential inequalities are refinements of Chebychev inequality. Under strong integrability assumptions on the summands, it is possible and relatively easy to derive sharp tail bounds for sums of independent random variables.
The upper bound on the variance of Y has been established.
Now let P denote the distribution of Y and let P_{\lambda} be the probability distribution with density
x \rightarrow e^{-\psi_{Y}\left( \lambda\right) }e^{\lambda (x - \mathbb{E}Y)}
with respect to P.
Since P_{\lambda} is concentrated on [a,b] ( P_\lambda([a, b]) = P([a, b]) =1 ), the variance of a random variable Z with distribution P_{\lambda} is bounded by (b-a)^2/4
Note that P_0 = P.
Dominated convergence arguments allow to compute the derivatives of \psi_Y(\lambda).
Namely
\psi'_Y(\lambda) = \frac{\mathbb{E}\Big[ (Y- \mathbb{E}Y) e^{\lambda (Y- \mathbb{E}Y)} \Big]}{\mathbb{E} e^{\lambda (Y- \mathbb{E}Y)}} = \mathbb{E}_{P_\lambda} Z
and
\psi^{\prime\prime}_Y(\lambda) = \frac{\mathbb{E}\Big[ (Y- \mathbb{E}{Y})^2 e^{\lambda (Y- \mathbb{E}Y)} \Big]}{\mathbb{E} e^{\lambda (Y- \mathbb{E}Y)}} - \Bigg(\frac{\mathbb{E}\Big[ (Y- \mathbb{E}{Y}) e^{\lambda (Y- \mathbb{E}Y)} \Big]}{\mathbb{E} e^{\lambda (Y- \mathbb{E}Y)}}\Bigg)^2 = \operatorname{var}_{P_\lambda}(Z)
Hence, thanks to the variance upper bound: \begin{align*} \psi_Y^{\prime\prime}(\lambda) & \leq \frac{(b-a)^2}{4}~. \end{align*}
Note that \psi_{Y}(0) = \psi_{Y}'(0) =0, and by Taylor's theorem, for some \theta \in [0,\lambda],
\psi_Y(\lambda) = \psi_Y(0) + \lambda\psi_Y'(0) + \frac{\lambda^2}{2}\psi_Y''(\theta) \leq \frac{\lambda^2(b-a)^2}{8}
The upper bound on the variance is sharp in the special case of a Rademacher random variable X whose distribution is defined by
P\{X =-1\} = P\{X =1\} = 1/2
Then one may take a=-b=1 and \operatorname{var}(X) =1=\left( b-a\right)^2/4.
We can now build on Hoeffding's Lemma to derive very practical tail bounds for sums of bounded independent random variables.
Let X_1,\ldots,X_n be independent random variables such that X_i takes its values in [a_i,b_i] almost surely for all i\leq n.
Let
S=\sum_{i=1}^n\left(X_i- \mathbb{E} X_i \right)
Then
\operatorname{var}(S) \leq v = \sum_{i=1}^n \frac{(b_i-a_i)^2}{4}
\forall \lambda \in \mathbb{R}, \qquad \log \mathbb{E} \mathrm{e}^{\lambda S} \leq \frac{\lambda^2 v}{2}
\forall t>0, \qquad P\left\{ S \geq t \right\} \le \exp\left( -\frac{t^2}{2 v}\right)
The proof is based on the so-called Cramer-Chernoff bounding technique and on Hoeffding's Lemma.
The upper bound on variance follows from \operatorname{var}(S) = \sum_{i=1}^n \operatorname{var}(X_i) and from the first part of Hoeffding's Lemma.
For the upper-bound on \log \mathbb{E} \mathrm{e}^{\lambda S},
\begin{array}{rl}\log \mathbb{E} \mathrm{e}^{\lambda S} & = \log \mathbb{E} \mathrm{e}^{\sum_{i=1}^n \lambda (X_i - \mathbb{E} X_i)} \\ & = \log \mathbb{E} \Big[\prod_{i=1}^n \mathrm{e}^{\lambda (X_i - \mathbb{E} X_i)}\Big] \\ & = \log \Big(\prod_{i=1}^n \mathbb{E} \Big[\mathrm{e}^{\lambda (X_i - \mathbb{E} X_i)}\Big]\Big) \\ & = \sum_{i=1}^n \log \mathbb{E} \Big[\mathrm{e}^{\lambda (X_i - \mathbb{E} X_i)}\Big] \\ & \leq \sum_{i=1}^n \frac{\lambda^2 (b_i-a_i)^2}{8} \\ & = \frac{\lambda^2 v}{2}\end{array}
where the third equality comes from independence of the X_i's and the inequality follows from invoking Hoeffding's Lemma for each summand.
The Cramer-Chernoff technique consists of using Markov's inequality with exponential moments.
\begin{array}{rl}P \big\{ S \geq t \big\} & \leq \inf_{\lambda\geq 0}\frac{\mathbb{E} \mathrm{e}^{\lambda S}}{\mathrm{e}^{\lambda t}} \\ & \leq \exp\Big(- \sup_{\lambda \geq 0} \big( \lambda t - \log \mathbb{E} \mathrm{e}^{\lambda S}\big) \Big)\\ & \leq \exp\Big(- \sup_{\lambda \geq 0}\big( \lambda t - \frac{\lambda^2 v}{2}\big) \Big) \\ & = \mathrm{e}^{- \frac{t^2}{2v} }\end{array}
Hoeffding's inequality provides interesting tail bounds for binomial random variables which are sums of independent [0,1]-valued random variables.
However in some cases, the variance upper bound used in Hoeffding's inequality is excessively conservative.
Think for example of binomial random variable with parameters n and \mu/n, the variance upper-bound obtained from the boundedness assumption is n while the true variance is \mu
In this section we combine Hoeffding's inequality and conditioning to establish the so-called Bounded differences inequality (also known as McDiarmid's inequality). This inequality is a first example of the concentration of measure phenomenon. This phenomenon is best portrayed by the following say:
A function of many independent random variables that does not depend too much on any of them is concentrated around its mean or median value.
Let X_1, \ldots, X_n be independent with values in \mathcal{X}_1, \mathcal{X}_2, \ldots, \mathcal{X}_n.
Let f : \mathcal{X}_1 \times \mathcal{X}_2 \times \ldots \times \mathcal{X}_n \to \mathbb{R} be measurable
Assume there exists non-negative c_1, \ldots, c_n satisfying
\forall x_1, \ldots, x_n \in \prod_{i=1}^n \mathcal{X}_i, \forall y_1, \ldots, y_n \in \prod_{i=1}^n \mathcal{X}_i,
\Big| f(x_1, \ldots, x_n) - f(y_1, \ldots, y_n)\Big| \leq \sum_{i=1}^n c_i \mathbb{I}_{x_i\neq y_i}
Let Z = f(X_1, \ldots, X_n) and v = \sum_{i=1}^n \frac{c_i^2}{4}
Then \operatorname{var}(Z) \leq v
\log \mathbb{E} \mathrm{e}^{\lambda (Z -\mathbb{E}Z)} \leq \frac{\lambda^2 v}{2}\qquad \text{and} \qquad P \Big\{ Z \geq \mathbb{E}Z + t \Big\} \leq \mathrm{e}^{-\frac{t^2}{2v}}
The variance bound is an immediate consequence of the Efron-Stein-Steele inequalities.
The tail bound follows from the upper bound on the logarithmic moment generating function by Cramer-Chernoff bounding.
To check the upper-bound on the logarithmic moment generating function, we proceed by induction on the number of arguments n.
If n=1, the upper-bound on the logarithmic moment generating function is just Hoeffing's
Assume the upper-bound is valid up to n-1.
\begin{array}{rl} \mathbb{E} \mathrm{e}^{\lambda (Z - \mathbb{E}Z)} & = \mathbb{E}\Big[ \mathbb{E}_{n-1}\mathrm{e}^{\lambda (Z - \mathbb{E}Z)} \Big] \\ & = \mathbb{E}\Big[ \mathbb{E}_{n-1}\big[\mathrm{e}^{\lambda (Z - \mathbb{E}_{n-1}Z)}\big] \times \mathrm{e}^{\lambda (\mathbb{E}_{n-1}Z - \mathbb{E}Z)} \Big]\end{array}
Now,
\mathbb{E}_{n-1}Z = \int_{\mathcal{X}_n} f(x_1,\ldots,x_{n-1}, u) \mathrm{d}P_{X_n}(u) \qquad\text{a.s.}
and
\begin{array}{rl} & \mathbb{E}_{n-1}\big[\mathrm{e}^{\lambda (Z - \mathbb{E}_{n-1}Z)}\big] \\ & = \int_{\mathcal{X}_n} \exp\Big(\lambda \int_{\mathcal{X}_n} f(x_1,\ldots,x_{n-1}, v) -f(x_1,\ldots,x_{n-1}, u) \mathrm{d}P_{X_n}(u) \Big) \mathrm{d}P_{X_n}(v)\end{array}
For every x_1, \ldots, x_{n-1} \in \mathcal{X_1} \times \ldots \times \mathcal{X}_{n-1}, for every v, v' \in \mathcal{X}_n,
\begin{array}{rl} & \Big| \int_{\mathcal{X}_n} f(x_1,\ldots,x_{n-1}, v) -f(x_1,\ldots,x_{n-1}, u) \mathrm{d}P_{X_n}(u) \\ & - \int_{\mathcal{X}_n} f(x_1,\ldots,x_{n-1}, v') -f(x_1,\ldots,x_{n-1}, u) \mathrm{d}P_{X_n}(u)\Big| \leq c_n \end{array}
By Hoeffding's Lemma
\mathbb{E}_{n-1}\big[\mathrm{e}^{\lambda (Z - \mathbb{E}_{n-1}Z)}\big] \leq \mathrm{e}^{\frac{\lambda^2 c_n^2}{8}}
\begin{array}{rl} \mathbb{E} \mathrm{e}^{\lambda (Z - \mathbb{E}Z)} & \leq \mathbb{E}\Big[ \mathrm{e}^{\lambda (\mathbb{E}_{n-1}Z - \mathbb{E}Z)} \Big] \times \mathrm{e}^{\frac{\lambda^2 c_n^2}{8}} \, . \end{array}
But, if X_1=x_1, \ldots X_{n-1}=x_{n-1},
\mathrm{e}^{\lambda (\mathbb{E}_{n-1}Z - \mathbb{E}Z)} = \int_{\mathcal{X}_n} f(x_1,\ldots,x_{n-1}, v) \mathrm{d}P_{X_n}(v) - \mathbb{E}Z \,,
it is a function of n-1 independent random variables that satisfies the bounded differences conditions with constants c_1, \ldots, c_{n-1}.
By the induction hypothesis:
\mathbb{E}\Big[ \mathrm{e}^{\lambda (\mathbb{E}_{n-1}Z - \mathbb{E}Z)} \Big] \leq \mathrm{e}^{\frac{\lambda^2}{2} \sum_{i=1}^{n-1} \frac{c_i^2}{4}}
The main idea is perhaps most transparent if we consider sub-Gaussian random variables.
Let Z_1,\ldots,Z_N be real-valued random variables such that there exists a v>0 such that for every i=1,\ldots,N, the logarithm of the moment generating function of Z_i satisfies \psi_{Z_i}(\lambda) \leq \lambda^2v/2 for all \lambda >0.
Then by Jensen's inequality,
\begin{array}{rcl} \exp \left(\lambda\,\mathbb{E} \max_{i=1,\ldots,N} Z_i \right) & \leq & \mathbb{E} \exp\left(\lambda \max_{i=1,\ldots,N} Z_i \right) \\ & = & \mathbb{E} \max_{i=1,\ldots,N} e^{\lambda Z_i} \\ & \leq & \sum_{i=1}^N \mathbb{E} e^{\lambda Z_i} \\ & \leq & N e^{\lambda^2v/2} \end{array}
Taking logarithms on both sides, we have
\mathbb{E} \max_{i=1,\ldots,N} Z_i \le \frac{\log N}{\lambda} + \frac{\lambda v}{2}
The upper bound is minimized for \lambda = \sqrt{2\log N/v} which yields
\mathbb{E} \max_{i=1,\ldots,N}Z_i\le \sqrt {2v\log N}
This simple bound is (asymptotically) sharp if the Z_i are i.i.d. normal random variables,
.bg-light-gray.b--dark-gray.ba.bw1.br3.shadow-5.ph4.mt5[
Let \psi be a convex and continuously differentiable function defined on \left[ 0,b\right) where 0<b\leq\infty.
Assume that \psi\left( 0\right) =\psi'\left( 0\right) =0 and set, for every t\geq0,
\psi^*(t) =\sup_{\lambda\in (0,b)} \left( \lambda t-\psi(\lambda)\right)
Then \psi^* is a nonnegative convex and nondecreasing function on [0,\infty).
For every y\geq 0, \left\{ t \ge 0: \psi^*(t) >y\right\}\neq \emptyset and the generalized inverse of \psi^*, defined by
\psi^{*\leftarrow}(y) =\inf\left\{ t\ge 0:\psi^*(t) >y \right\}
can also be written as
\psi^{*\leftarrow}(y) =\inf_{\lambda\in (0,b) } \left[ \frac{y +\psi(\lambda)}{\lambda}\right]
]
By definition, \psi^* is the supremum of convex and nondecreasing functions on [0,\infty) and \psi^*(0) =0,
therefore
\psi^* is a nonnegative, convex, and nondecreasing function on [0,\infty).
Given \lambda\in (0,b), since \psi^*(t) \geq\lambda t-\psi(\lambda), \psi^* is unbounded which shows that
\forall y\geq 0, \qquad \left\{ t\geq 0:\psi^*(t) >y\right\} \neq \emptyset
Defining
u=\inf_{\lambda\in (0,b)} \left[ \frac{y+\psi(\lambda) }{\lambda}\right]
For every t \ge 0, we have u\geq t iff
\forall \lambda \in (0,b), \qquad \frac{y+\psi(\lambda) }{\lambda}\geq t
As this implies y\ge \psi^*(t), we have \left\{ t\ge 0:\psi^*(t)> y\right\} = (u,\infty)
This proves that u=\psi^{*-1}(y) by definition of \psi^{*-1}.
Let Z_1,\ldots,Z_N be real-valued random variables such that for every \lambda\in (0,b) and i=1,\ldots,N, the logarithm of the moment generating function of Z_i satisfies
\psi_{Z_i}(\lambda) \leq \psi(\lambda)
where \psi is a convex and continuously differentiable function on (0,b) with 0<b\leq\infty such that \psi(0)=\psi'(0)=0
Then
\mathbb{E} \max_{i=1,\ldots,N} Z_i \leq \psi^{*\leftarrow}(\log N)
By Jensen's inequality, for any \lambda\in (0,b),
\exp\left( \lambda \mathbb{E} \max_{i=1,\ldots,N}Z_i \right) \leq \mathbb{E} \exp\left( \lambda\max_{i=1,\ldots,N}Z_i \right) = \mathbb{E} \max_{i=1,\ldots,N}\exp\left(\lambda Z_i \right)
Recalling that \psi_{Z_i}(\lambda) =\log\mathbb{E}\exp\left(\lambda Z_i \right),
\exp\left( \lambda \mathbb{E} \max_{i=1,\ldots,N}Z_i \right)\leq \sum_{i=1}^N \mathbb{E} \exp\left(\lambda Z_i\right) \leq N \exp\left( \psi(\lambda) \right)
Therefore, for any \lambda\in (0,b),
\lambda \mathbb{E} \max_{i=1,\ldots,N}Z_i -\psi(\lambda) \leq \log N
which means that
\mathbb{E} \max_{i=1,\ldots,N}Z_i \leq \inf_{\lambda\in (0,b)}\left( \frac{\log N +\psi(\lambda) }{\lambda}\right)
and the result follows from Lemma
chi-squared distribution
If p is a positive integer, a gamma random variable with parameters a=p/2 and b=2 is said to have chi-square distribution with p degrees of freedom ( \chi^2_p )
If Y_1,\ldots,Y_p \sim_{\tetx{i.i.d.}} \mathcal{N}(0,1) then \sum_{i=1}^p Y_i^2 \sim \chi^2_p
If X_1,\ldots,X_N have chi-square distribution with p degrees of freedom,
then
\mathbb{E}\left[ \max_{i=1,\ldots,N} X_i - p\right] \leq 2\sqrt{p\log N }+ 2\log N