mat-blag: Chernoff bound

Saturday, July 2, 2016

On the separation of nearest neighbors

We work through Lemma 3 (called the "$A-B$ Lemma" or the "cleaning procedure") of [2], adopting a cleaner and more thorough approach.

Necessary tools

Definition: The inverse of the complex-valued function $f(z) = ze^z$ is called the Lambert $W$-function and denoted by $W = f^{-1}$. When restricted to the real numbers, it is multi-valued on part of its domain, so it is split up into two branches $W_0$ (for positive values) and $W_{-1}$ (for negative values).

Hoeffding's inequality gives an upper bound on how much we should expect a sum of random variables to deviate from their combined mean. The authors of [2] use a similar inequality called the Chernoff bound, but Hoeffding gives a tighter bound on the desired event.

Proposition: (Hoeffding - Theorem 2 and Equation (1.4) of [1])
Let $X_1,\dots,X_n$ be independent random variables, with $X_i$ bounded on the interval $[a_i,b_i]$. Then
\[
P\left(\left| \frac1n\sum_{i=1}^n X_i - \frac 1n\sum_{i=1}^n E[X_i]\right|\geqslant t\right) \leqslant 2\exp\left(\frac{-2t^2n^2}{\sum_{i=1}^n(b_i-a_i)^2}\right).
\]

The union bound (or Boole's inequality) says that the probability of one of a collection of events happening is no larger than the sum of the probabilities of each of the events happening.

Proposition: Let $A_1,A_2,\dots$ be a countable collection of events. Then $P(\bigcup_i A_i) \leqslant \sum_i P(A_i)$.

The setup

Let $P$ be a probability distribution $P$ on $\R^n$ and $X=\{x_1,\dots,x_k\}\subseteq \R^n$ a finite set of points drawn according to $P$. These points may be considered as random variables $X_1,X_2,\dots,X_k$ on the sample space $\R^n$, with $X_i$ evaluating to 1 only on $x_i$, and 0 otherwise. Choose $s>0$ and construct the nearest neighbor graph $G$ on $X$, with parameter $s$. Write $X=A\cup B$ and set
\[
\eta := \inf_{a\in A,b\in B}\left\{|| a-b||\right\}
\hspace{1cm},\hspace{1cm}
\alpha_s := \inf_{a\in A}\left\{P(B^n(s,a))\right\}
\hspace{1cm},\hspace{1cm}
\beta_s \:= \sup_{b\in B}\left\{P(B^n(s,b))\right\},
\]
with $h = (\alpha_s-\beta_s)/2$. We assume that

$\eta>0$, so $A$ and $B$ are disjoint;
$s<\eta/2$, so $A$ and $B$ are in separate components of $G$; and
$\alpha_s >\beta_s$, so any point in $A$ is more likely to be chosen than every point in $B$.

Proposition: Choose $\delta\in (0,1)$. If $|X| >-W_{-1}(-\delta h^2e^{-2h^2})/(2h^2)$, then for all $a\in A$ and $b\in B$, with probability $1-\delta$,
\[
\frac{\deg_G(a)}{k - 1} > \frac{\alpha_s+\beta_s}2
\hspace{1cm}\text{and}\hspace{1cm}
\frac{\deg_G(b)}{k - 1} < \frac{\alpha_s+\beta_s}2.
\]
The statement holds also for $\alpha,\beta$ instead of $\alpha_s,\beta_s$, such that $\alpha_s\> \alpha >\beta \> \beta_s$, which may be useful to bound the degree of vertices in $G$.

The proof

For each $i=1,\dots,k$, define new random variables $Y_{ij}$ on the sample space $X$, with $Y_{ij}$ evaluating to 1 on $x_j$ iff $x_j\in B^n(s,x_i)$, and evaluating to 0 otherwise. The mean of $Y_{ij}$ is $P(B^n(s,x_i))$. Since the $Y_{ij}$ are independent with the same mean, Hoeffding's inequality gives that
\[
\left(\begin{array}{c}
\text{the probability that the sampled $x_j$}\\
\text{have clustered around a point more than} \\
\text{a distance $h$ away from $B^n(s,x_i)$}
\end{array}\right)
=
P\Bigg(\underbrace{\left|\frac{1}{k-1}\sum_{j\neq i} Y_{ij} - P(B^n(s,x_i))\right| \> h}_{\text{event}\ A_i}\Bigg) \leqslant 2e^{-2h^2(k-1)}.
\]
The union bound gives that
\[
\left(\begin{array}{c}
\text{the probability that at}\\
\text{least one $A_i$ occurs}
\end{array}\right)
=
P\left(\bigcup_{i=1}^k A_i\right) < \sum_{i=1}^k P(A_i) \leqslant 2ke^{-2h^2(k-1)}.
\]
Note that $\sum_{j\neq i} Y_{ij} = \deg_G(x_i)$ for every $i$, so whenever $\delta>2ke^{-2h^2(k-1)}$, with probability $1-\delta$
\[
\left| \frac{\deg_G(x_i)}{k-1} - P(B^n(s,x_i))\right| < h
\hspace{1cm}\text{or}\hspace{1cm}
P(B^n(s,x_i)) -h < \frac{\deg_G(x_i)}{k-1} < P(B^n(s,x_i)) + h.
\]
When $x_i\in A$ ($x_i\in B$) we have a lower (upper) bound of $\alpha_s$ ($\beta_s$) on $P(B^n(s,x_i))$. Indeed:
\[
\frac{\deg_G(a)}{k-1} > \alpha_s- h = \frac{\alpha_s+\beta_s}2
\hspace{1cm}\text{and}\hspace{1cm}
\frac{\deg_G(b)}{k-1} < \beta_s + h = \frac{\alpha_s+\beta_s}2.
\]
To find how many points we need to sample, we solve for $k$ in the inequality $ \delta > 2ke^{-2h^2(k-1)}$. With the aid of a computer algebra system, we find that
\[
k > \frac{-1}{2h^2}W_{-1}\left(-\delta h^2e^{-2h^2}\right),
\]
completing the proof.

References:
[1] Hoeffding (Probability inequalities for sums of bounded random variables)
[2] Niyogi, Smale, and Weinberger (A topological view of unsupervised learning from noisy data)