We follow the article [3] and add more background and clarifications. Some assumptions are made that are not explicitly mentioned in the article, to make calculations easier.
Background in probability, measure theory, topology
Let $X$ be a random variable over a space $A$. Recall that the expression $P(X)$ is a number in $[0,1]$ describing the probability of the event $X$ happening. This is called a probability distribution. Here we will consider continuous random variables, so $P(X=x)=0$ for any single element $x\in A$.
Definition: The probability density function of $X$ is the function $f:A\to \R$ satisfying
Definition: The probability density function of $X$ is the function $f:A\to \R$ satisfying
- $f(x)\geqslant 0$ for all $x\in A$, and
- $\int_B f(x)\ dx = P(X\in B)$ for any $B\subseteq A$.
The second condition implies $\int_A f(x)\ dx=1$.
Often authors use just $P$ instead of $f$, and write $P(x)$ instead of $P(X=x)$.
Definition: Let $Y=g(X)$ be another random variable. The expected value of $Y$ is
\[
E[Y] = E[g(X)] = \int_Ag(x)f(x)\ dx.
\]
The mean of $X$ is $\mu= E[X]$, and the variance of $X$ is $\sigma^2 = E[(X-\mu)^2]$. If $\vec X=(X_1\ \cdots\ X_n)^T$ is a multivariate random variable, then $\vec \mu=E[\vec X]$ is an $n$-vector, and the variance is an $(n\times n)$-matrix given as
\[
\Sigma = E[(\vec X-E[\vec X])(\vec X-E[\vec X])^T]
\hspace{1cm}
\text{or}
\hspace{1cm}
\Sigma_{ij} = E[(X_i-E[X_i])(X_j-E[X_j])].
\]
The covariance of $X$ and $Y$ is $E[(X-E[X])(Y-E[Y])]$. Note that the covariance of $X$ with itself is just the usual variance of $X$.
Often authors use just $P$ instead of $f$, and write $P(x)$ instead of $P(X=x)$.
Definition: Let $Y=g(X)$ be another random variable. The expected value of $Y$ is
\[
E[Y] = E[g(X)] = \int_Ag(x)f(x)\ dx.
\]
The mean of $X$ is $\mu= E[X]$, and the variance of $X$ is $\sigma^2 = E[(X-\mu)^2]$. If $\vec X=(X_1\ \cdots\ X_n)^T$ is a multivariate random variable, then $\vec \mu=E[\vec X]$ is an $n$-vector, and the variance is an $(n\times n)$-matrix given as
\[
\Sigma = E[(\vec X-E[\vec X])(\vec X-E[\vec X])^T]
\hspace{1cm}
\text{or}
\hspace{1cm}
\Sigma_{ij} = E[(X_i-E[X_i])(X_j-E[X_j])].
\]
The covariance of $X$ and $Y$ is $E[(X-E[X])(Y-E[Y])]$. Note that the covariance of $X$ with itself is just the usual variance of $X$.
Example: One example of a probability distribution is the normal (or Gaussian) distribution, and we say a random variable with the normal distribution is normally distributed. If a random variable $X$ is normally distributed with mean $\mu$ and variance $\sigma^2$, then the probability density function of $X$ is
\[
f(x) = \frac{\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)}{\sigma\sqrt{2\pi}}.
\]
If $\vec X=(X_1\ \cdots\ X_n)^T$ is a normally distributed multivariate random variable, then $\vec \mu = (E[X_1]\ \cdots\ E[X_n])^T$ and the probability density function of $\vec X$ is
\[
f(\vec x) = \frac{\exp\left(-\frac 12 (\vec x-\vec \mu)^T\Sigma^{-1}(\vec x-\vec \mu)\right)}{\sqrt{(2\pi)^n\det(\Sigma)}}.
\]
Definition: A measure on $\R^D$ is a function $m:\{$subsets of $\R^D\}\to [0,\infty]$ such that $m(\emptyset) = 0$ and $m(\bigcup_{i\in I} E_i) = \sum_{i\in I} m(E_i)$ for $\{E_i\}_{i\in I}$ a countable sequence of disjoint subsets of $\R^D$. A probability measure on $\R^D$ is a measure $m$ on $\R^D$ with the added condition that $m(\R^D)=1$.
A probability distribution is an example of a probability measure.
Definition: Let $U= \{U_i\}_{i\in I}$ be a covering of a topological space $M$. The nerve of the covering $U$ is a set $N$ of subsets of $I$ given by
\[
N = \left\{J\subset I\ :\ \bigcap_{j\in J} U_j \neq\emptyset\right\}.
\]
Note that this makes $N$ into an abstract simplicial complex, as $J\in N$ implies $J'\in N$ for all $J'\subseteq J$.
Let $M$ be a smooth compact submanifold of $\R^d$. By the tubular neighborhood theorem (see Theorem 2.11.4 in [3]), every smooth compact submanifold $M$ of $\R^d$ has a tubular neighborhood for some $\epsilon>0$.
Definition: For a particular embedding of $M$, let the condition number of $M$ be $\tau=\sup\{\epsilon\ :$ $M$ has an $\epsilon-$tubular neighborhood$\}$.
Distributions on a manifold
Let $M$ be a $d$-dimensional manifold embedded in $\R^D$, with $D>d$. Recall that every element in $NM\subseteq \R^D$, the normal bundle of $M$, may be represented as a pair $(\vec x,\vec y)$, where $\vec x\in M$ and $\vec y\in T^\perp$ (since $M$ is a manifold, all the normal spaces are isomorphic). Hence we may consider a probability distribution $P$ on $NM$, with $\vec X$ the $d$-multivariate random variable representing points on $M$ and $\vec Y$ the $(D-d)$-multivariate random variable representing points on the space normal to $M$ at a point on $M$. We make the assumption that $\vec X$ and $\vec Y$ are independent, or that
\[
P(\vec X, \vec Y) = P_M(\vec X)P_{T^\perp}(\vec Y).
\]
That is, $P_{T^\perp}$ is a probability distribution that is the same at any point on the manifold.
Definition: Let $P$ be a probability distribution on $NM$ and $f_M$ the probability density function of $P_M$. In the context described above, $P$ satisfies the strong variance condition if
\[
P(\vec X, \vec Y) = P_M(\vec X)P_{T^\perp}(\vec Y).
\]
That is, $P_{T^\perp}$ is a probability distribution that is the same at any point on the manifold.
Definition: Let $P$ be a probability distribution on $NM$ and $f_M$ the probability density function of $P_M$. In the context described above, $P$ satisfies the strong variance condition if
- there exist $a,b>0$ such that $f_M(\vec x)\in [a,b]$ for all $\vec x\in M$, and
- $P_{T^\perp}(\vec Y)$ is normally distributed with $\vec \mu = 0$ and $\Sigma = \sigma^2I$.
The second condition implies that the covariance of $Y_i$ with $Y_j$ is trivial iff $i\neq j$, and that the vairance of all the $Y_i$s is the same. From the normally distributed multivariate example above, this also tells us that the probability density function $f^\perp$ of $\vec Y$ is
\[
f^\perp(\vec y) = \frac{\exp\left(\displaystyle-\frac{\sigma^2}{2}\sum_{i=1}^{D-d}y_i^2\right)}{\sigma^{D-d}\sqrt{(2\pi)^{D-d}}}.
\]
Theorem: In the context described above, let $P$ be a probability distribution on $NM$ satisfying the strong variance condition, and let $\delta>0$. If there is $c>1$ such that
\[
\sigma <\frac{c\tau(\sqrt9-\sqrt 8)}{9\sqrt{8(D-d)}},
\]
then there is an algorithm that computes the homology of $M$ from a random sample of $n$ points, with probability $1-\delta$. The number $n$ depends on $\tau,\delta,c,d,D$, and the diameter of $M$.
\[
f^\perp(\vec y) = \frac{\exp\left(\displaystyle-\frac{\sigma^2}{2}\sum_{i=1}^{D-d}y_i^2\right)}{\sigma^{D-d}\sqrt{(2\pi)^{D-d}}}.
\]
Theorem: In the context described above, let $P$ be a probability distribution on $NM$ satisfying the strong variance condition, and let $\delta>0$. If there is $c>1$ such that
\[
\sigma <\frac{c\tau(\sqrt9-\sqrt 8)}{9\sqrt{8(D-d)}},
\]
then there is an algorithm that computes the homology of $M$ from a random sample of $n$ points, with probability $1-\delta$. The number $n$ depends on $\tau,\delta,c,d,D$, and the diameter of $M$.
The homology computing algorithm
Below is a broad view of the algorithm described in sections 3, 4, and 5 of [1]. Let $M$ be a $d$-manifold embedded in $\R^D$, and $P$ a probability measure on $NM$ satisfying the strong variance condition.
1. Calculate the following numbers:
\begin{align*}
\tau & = \text{condition number of $M$}\\
\text{vol}(M) & = \text{volume of $M$}\\
\sigma^2 & = \text{variance of $P$}
\end{align*}
2. Define (or choose) the following numbers:
1. Calculate the following numbers:
\begin{align*}
\tau & = \text{condition number of $M$}\\
\text{vol}(M) & = \text{volume of $M$}\\
\sigma^2 & = \text{variance of $P$}
\end{align*}
2. Define (or choose) the following numbers:
\begin{align*}
\delta & \in (0,1) \\
r & \in \left(2\sqrt{2(D-d)}\sigma,\textstyle\frac\tau9 (3-2\sqrt 2)\right) \\
n & > \text{function}(a,r,\tau,d,\delta,\text{vol}(M)) & (\max(A,B)\ \text{in Proposition 9 of [1])} \\
s & = 4r \\
deg & > \textstyle \frac{3a}4 \left(1-\left(\frac r{2\tau}\right)^2\right)^{d/2}\text{vol}\left(B^d(r,0)\right)\\
R & = (9r+\tau)/2
\end{align*}
3. Choose $n$ points randomly from $NM$ according to $P$.
4. From these $n$ points, construct the nearest neighbor graph $G$ with distance $s$.
5. Remove from $G$ all the vertices of degree $<deg$ to get a refined graph $G'$.
6. Set $U=\bigcup_{\vec x\in V(G')}B^D(R,\vec x)$ and construct the simplicial complex $K$ of its nerve.
7. Compute the homology of $K$, which is the homology of $M$, with probability $1-\delta$.
\delta & \in (0,1) \\
r & \in \left(2\sqrt{2(D-d)}\sigma,\textstyle\frac\tau9 (3-2\sqrt 2)\right) \\
n & > \text{function}(a,r,\tau,d,\delta,\text{vol}(M)) & (\max(A,B)\ \text{in Proposition 9 of [1])} \\
s & = 4r \\
deg & > \textstyle \frac{3a}4 \left(1-\left(\frac r{2\tau}\right)^2\right)^{d/2}\text{vol}\left(B^d(r,0)\right)\\
R & = (9r+\tau)/2
\end{align*}
3. Choose $n$ points randomly from $NM$ according to $P$.
4. From these $n$ points, construct the nearest neighbor graph $G$ with distance $s$.
5. Remove from $G$ all the vertices of degree $<deg$ to get a refined graph $G'$.
6. Set $U=\bigcup_{\vec x\in V(G')}B^D(R,\vec x)$ and construct the simplicial complex $K$ of its nerve.
7. Compute the homology of $K$, which is the homology of $M$, with probability $1-\delta$.
References:
[1] Niyogi, Smale, and Weinberger (A topological view of unsupervised learning from noisy data)
[2] Folland (Real analysis, Chapter 10.1)
[3] Bredon (Topology and Geometry, Chapter 2.11)
[1] Niyogi, Smale, and Weinberger (A topological view of unsupervised learning from noisy data)
[2] Folland (Real analysis, Chapter 10.1)
[3] Bredon (Topology and Geometry, Chapter 2.11)