Showing posts with label Chernoff bound. Show all posts
Showing posts with label Chernoff bound. Show all posts

Saturday, July 2, 2016

On the separation of nearest neighbors

We work through Lemma 3 (called the "$A-B$ Lemma" or the "cleaning procedure") of [2], adopting a cleaner and more thorough approach.

Necessary tools

Definition: The inverse of the complex-valued function $f(z) = ze^z$ is called the Lambert $W$-function and denoted by $W = f^{-1}$. When restricted to the real numbers, it is multi-valued on part of its domain, so it is split up into two branches $W_0$ (for positive values) and $W_{-1}$ (for negative values).

Hoeffding's inequality gives an upper bound on how much we should expect a sum of random variables to deviate from their combined mean. The authors of [2] use a similar inequality called the Chernoff bound, but Hoeffding gives a tighter bound on the desired event.

Proposition:
(Hoeffding - Theorem 2 and Equation (1.4) of [1])
Let $X_1,\dots,X_n$ be independent random variables, with $X_i$ bounded on the interval $[a_i,b_i]$. Then
\[
P\left(\left| \frac1n\sum_{i=1}^n X_i - \frac 1n\sum_{i=1}^n E[X_i]\right|\geqslant t\right) \leqslant 2\exp\left(\frac{-2t^2n^2}{\sum_{i=1}^n(b_i-a_i)^2}\right).
\]

The union bound (or Boole's inequality) says that the probability of one of a collection of events happening is no larger than the sum of the probabilities of each of the events happening.

Proposition:
Let $A_1,A_2,\dots$ be a countable collection of events. Then $P(\bigcup_i A_i) \leqslant \sum_i P(A_i)$.

The setup

Let $P$ be a probability distribution $P$ on $\R^n$ and $X=\{x_1,\dots,x_k\}\subseteq \R^n$ a finite set of points drawn according to $P$. These points may be considered as random variables $X_1,X_2,\dots,X_k$ on the sample space $\R^n$, with $X_i$ evaluating to 1 only on $x_i$, and 0 otherwise. Choose $s>0$ and construct the nearest neighbor graph $G$ on $X$, with parameter $s$. Write $X=A\cup B$ and set
\[
\eta := \inf_{a\in A,b\in B}\left\{|| a-b||\right\}
\hspace{1cm},\hspace{1cm}
\alpha_s := \inf_{a\in A}\left\{P(B^n(s,a))\right\}
\hspace{1cm},\hspace{1cm}
\beta_s \:= \sup_{b\in B}\left\{P(B^n(s,b))\right\},
\]
with $h = (\alpha_s-\beta_s)/2$. We assume that
  • $\eta>0$, so $A$ and $B$ are disjoint;
  • $s<\eta/2$, so $A$ and $B$ are in separate components of $G$; and
  • $\alpha_s >\beta_s$, so any point in $A$ is more likely to be chosen than every point in $B$.
Proposition: Choose $\delta\in (0,1)$. If $|X| >-W_{-1}(-\delta h^2e^{-2h^2})/(2h^2)$, then for all $a\in A$ and $b\in B$, with probability $1-\delta$,
\[
\frac{\deg_G(a)}{k - 1} > \frac{\alpha_s+\beta_s}2
\hspace{1cm}\text{and}\hspace{1cm}
\frac{\deg_G(b)}{k - 1} < \frac{\alpha_s+\beta_s}2.
\]
The statement holds also for $\alpha,\beta$ instead of $\alpha_s,\beta_s$, such that $\alpha_s\> \alpha >\beta \> \beta_s$, which may be useful to bound the degree of vertices in $G$.

The proof

For each $i=1,\dots,k$, define new random variables $Y_{ij}$ on the sample space $X$, with $Y_{ij}$ evaluating to 1 on $x_j$ iff $x_j\in B^n(s,x_i)$, and evaluating to 0 otherwise. The mean of $Y_{ij}$ is $P(B^n(s,x_i))$. Since the $Y_{ij}$ are independent with the same mean, Hoeffding's inequality gives that
\[
\left(\begin{array}{c}
\text{the probability that the sampled $x_j$}\\
\text{have clustered around a point more than} \\
\text{a distance $h$ away from $B^n(s,x_i)$}
\end{array}\right)
=
P\Bigg(\underbrace{\left|\frac{1}{k-1}\sum_{j\neq i} Y_{ij} - P(B^n(s,x_i))\right| \> h}_{\text{event}\ A_i}\Bigg)  \leqslant 2e^{-2h^2(k-1)}.
\]
The union bound gives that
\[
\left(\begin{array}{c}
\text{the probability that at}\\
\text{least one $A_i$ occurs}
\end{array}\right)
=
P\left(\bigcup_{i=1}^k A_i\right) < \sum_{i=1}^k P(A_i) \leqslant 2ke^{-2h^2(k-1)}.
\]
Note that $\sum_{j\neq i} Y_{ij} = \deg_G(x_i)$ for every $i$, so whenever $\delta>2ke^{-2h^2(k-1)}$, with probability $1-\delta$
\[
\left| \frac{\deg_G(x_i)}{k-1} - P(B^n(s,x_i))\right| < h
\hspace{1cm}\text{or}\hspace{1cm}
P(B^n(s,x_i)) -h < \frac{\deg_G(x_i)}{k-1} < P(B^n(s,x_i)) + h.
\]
When $x_i\in A$ ($x_i\in B$) we have a lower (upper) bound of $\alpha_s$ ($\beta_s$) on $P(B^n(s,x_i))$. Indeed:
\[
\frac{\deg_G(a)}{k-1} > \alpha_s- h =  \frac{\alpha_s+\beta_s}2
\hspace{1cm}\text{and}\hspace{1cm}
\frac{\deg_G(b)}{k-1} < \beta_s + h = \frac{\alpha_s+\beta_s}2.
\]
To find how many points we need to sample, we solve for $k$ in the inequality $ \delta > 2ke^{-2h^2(k-1)}$. With the aid of a computer algebra system, we find that
\[
k > \frac{-1}{2h^2}W_{-1}\left(-\delta h^2e^{-2h^2}\right),
\]
completing the proof.

References:
[1] Hoeffding (Probability inequalities for sums of bounded random variables)
[2] Niyogi, Smale, and Weinberger (A topological view of unsupervised learning from noisy data)