We follow the article [3] and add more background and clarifications. Some assumptions are made that are not explicitly mentioned in the article, to make calculations easier.
Background in probability, measure theory, topology
Let X be a random variable over a space A. Recall that the expression P(X) is a number in [0,1] describing the probability of the event X happening. This is called a probability distribution. Here we will consider continuous random variables, so P(X=x)=0 for any single element x∈A.
Definition: The probability density function of X is the function f:A→R satisfying
Definition: The probability density function of X is the function f:A→R satisfying
- f(x)⩾ for all x\in A, and
- \int_B f(x)\ dx = P(X\in B) for any B\subseteq A.
The second condition implies \int_A f(x)\ dx=1.
Often authors use just P instead of f, and write P(x) instead of P(X=x).
Definition: Let Y=g(X) be another random variable. The expected value of Y is
E[Y] = E[g(X)] = \int_Ag(x)f(x)\ dx.
The mean of X is \mu= E[X], and the variance of X is \sigma^2 = E[(X-\mu)^2]. If \vec X=(X_1\ \cdots\ X_n)^T is a multivariate random variable, then \vec \mu=E[\vec X] is an n-vector, and the variance is an (n\times n)-matrix given as
\Sigma = E[(\vec X-E[\vec X])(\vec X-E[\vec X])^T] \hspace{1cm} \text{or} \hspace{1cm} \Sigma_{ij} = E[(X_i-E[X_i])(X_j-E[X_j])].
The covariance of X and Y is E[(X-E[X])(Y-E[Y])]. Note that the covariance of X with itself is just the usual variance of X.
Often authors use just P instead of f, and write P(x) instead of P(X=x).
Definition: Let Y=g(X) be another random variable. The expected value of Y is
E[Y] = E[g(X)] = \int_Ag(x)f(x)\ dx.
The mean of X is \mu= E[X], and the variance of X is \sigma^2 = E[(X-\mu)^2]. If \vec X=(X_1\ \cdots\ X_n)^T is a multivariate random variable, then \vec \mu=E[\vec X] is an n-vector, and the variance is an (n\times n)-matrix given as
\Sigma = E[(\vec X-E[\vec X])(\vec X-E[\vec X])^T] \hspace{1cm} \text{or} \hspace{1cm} \Sigma_{ij} = E[(X_i-E[X_i])(X_j-E[X_j])].
The covariance of X and Y is E[(X-E[X])(Y-E[Y])]. Note that the covariance of X with itself is just the usual variance of X.
Example: One example of a probability distribution is the normal (or Gaussian) distribution, and we say a random variable with the normal distribution is normally distributed. If a random variable X is normally distributed with mean \mu and variance \sigma^2, then the probability density function of X is
f(x) = \frac{\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)}{\sigma\sqrt{2\pi}}.
If \vec X=(X_1\ \cdots\ X_n)^T is a normally distributed multivariate random variable, then \vec \mu = (E[X_1]\ \cdots\ E[X_n])^T and the probability density function of \vec X is
f(\vec x) = \frac{\exp\left(-\frac 12 (\vec x-\vec \mu)^T\Sigma^{-1}(\vec x-\vec \mu)\right)}{\sqrt{(2\pi)^n\det(\Sigma)}}.
Definition: A measure on \R^D is a function m:\{subsets of \R^D\}\to [0,\infty] such that m(\emptyset) = 0 and m(\bigcup_{i\in I} E_i) = \sum_{i\in I} m(E_i) for \{E_i\}_{i\in I} a countable sequence of disjoint subsets of \R^D. A probability measure on \R^D is a measure m on \R^D with the added condition that m(\R^D)=1.
A probability distribution is an example of a probability measure.
Definition: Let U= \{U_i\}_{i\in I} be a covering of a topological space M. The nerve of the covering U is a set N of subsets of I given by
N = \left\{J\subset I\ :\ \bigcap_{j\in J} U_j \neq\emptyset\right\}.
Note that this makes N into an abstract simplicial complex, as J\in N implies J'\in N for all J'\subseteq J.
Let M be a smooth compact submanifold of \R^d. By the tubular neighborhood theorem (see Theorem 2.11.4 in [3]), every smooth compact submanifold M of \R^d has a tubular neighborhood for some \epsilon>0.
Definition: For a particular embedding of M, let the condition number of M be \tau=\sup\{\epsilon\ : M has an \epsilon-tubular neighborhood\}.
Distributions on a manifold
Let M be a d-dimensional manifold embedded in \R^D, with D>d. Recall that every element in NM\subseteq \R^D, the normal bundle of M, may be represented as a pair (\vec x,\vec y), where \vec x\in M and \vec y\in T^\perp (since M is a manifold, all the normal spaces are isomorphic). Hence we may consider a probability distribution P on NM, with \vec X the d-multivariate random variable representing points on M and \vec Y the (D-d)-multivariate random variable representing points on the space normal to M at a point on M. We make the assumption that \vec X and \vec Y are independent, or that
P(\vec X, \vec Y) = P_M(\vec X)P_{T^\perp}(\vec Y).
That is, P_{T^\perp} is a probability distribution that is the same at any point on the manifold.
Definition: Let P be a probability distribution on NM and f_M the probability density function of P_M. In the context described above, P satisfies the strong variance condition if
P(\vec X, \vec Y) = P_M(\vec X)P_{T^\perp}(\vec Y).
That is, P_{T^\perp} is a probability distribution that is the same at any point on the manifold.
Definition: Let P be a probability distribution on NM and f_M the probability density function of P_M. In the context described above, P satisfies the strong variance condition if
- there exist a,b>0 such that f_M(\vec x)\in [a,b] for all \vec x\in M, and
- P_{T^\perp}(\vec Y) is normally distributed with \vec \mu = 0 and \Sigma = \sigma^2I.
The second condition implies that the covariance of Y_i with Y_j is trivial iff i\neq j, and that the vairance of all the Y_is is the same. From the normally distributed multivariate example above, this also tells us that the probability density function f^\perp of \vec Y is
f^\perp(\vec y) = \frac{\exp\left(\displaystyle-\frac{\sigma^2}{2}\sum_{i=1}^{D-d}y_i^2\right)}{\sigma^{D-d}\sqrt{(2\pi)^{D-d}}}.
Theorem: In the context described above, let P be a probability distribution on NM satisfying the strong variance condition, and let \delta>0. If there is c>1 such that
\sigma <\frac{c\tau(\sqrt9-\sqrt 8)}{9\sqrt{8(D-d)}},
then there is an algorithm that computes the homology of M from a random sample of n points, with probability 1-\delta. The number n depends on \tau,\delta,c,d,D, and the diameter of M.
f^\perp(\vec y) = \frac{\exp\left(\displaystyle-\frac{\sigma^2}{2}\sum_{i=1}^{D-d}y_i^2\right)}{\sigma^{D-d}\sqrt{(2\pi)^{D-d}}}.
Theorem: In the context described above, let P be a probability distribution on NM satisfying the strong variance condition, and let \delta>0. If there is c>1 such that
\sigma <\frac{c\tau(\sqrt9-\sqrt 8)}{9\sqrt{8(D-d)}},
then there is an algorithm that computes the homology of M from a random sample of n points, with probability 1-\delta. The number n depends on \tau,\delta,c,d,D, and the diameter of M.
The homology computing algorithm
Below is a broad view of the algorithm described in sections 3, 4, and 5 of [1]. Let M be a d-manifold embedded in \R^D, and P a probability measure on NM satisfying the strong variance condition.
1. Calculate the following numbers:
\begin{align*} \tau & = \text{condition number of $M$}\\ \text{vol}(M) & = \text{volume of $M$}\\ \sigma^2 & = \text{variance of $P$} \end{align*}
2. Define (or choose) the following numbers:
1. Calculate the following numbers:
\begin{align*} \tau & = \text{condition number of $M$}\\ \text{vol}(M) & = \text{volume of $M$}\\ \sigma^2 & = \text{variance of $P$} \end{align*}
2. Define (or choose) the following numbers:
\begin{align*}
\delta & \in (0,1) \\
r & \in \left(2\sqrt{2(D-d)}\sigma,\textstyle\frac\tau9 (3-2\sqrt 2)\right) \\
n & > \text{function}(a,r,\tau,d,\delta,\text{vol}(M)) & (\max(A,B)\ \text{in Proposition 9 of [1])} \\
s & = 4r \\
deg & > \textstyle \frac{3a}4 \left(1-\left(\frac r{2\tau}\right)^2\right)^{d/2}\text{vol}\left(B^d(r,0)\right)\\
R & = (9r+\tau)/2
\end{align*}
3. Choose n points randomly from NM according to P.
4. From these n points, construct the nearest neighbor graph G with distance s.
5. Remove from G all the vertices of degree <deg to get a refined graph G'.
6. Set U=\bigcup_{\vec x\in V(G')}B^D(R,\vec x) and construct the simplicial complex K of its nerve.
7. Compute the homology of K, which is the homology of M, with probability 1-\delta.
3. Choose n points randomly from NM according to P.
4. From these n points, construct the nearest neighbor graph G with distance s.
5. Remove from G all the vertices of degree <deg to get a refined graph G'.
6. Set U=\bigcup_{\vec x\in V(G')}B^D(R,\vec x) and construct the simplicial complex K of its nerve.
7. Compute the homology of K, which is the homology of M, with probability 1-\delta.
References:
[1] Niyogi, Smale, and Weinberger (A topological view of unsupervised learning from noisy data)
[2] Folland (Real analysis, Chapter 10.1)
[3] Bredon (Topology and Geometry, Chapter 2.11)
[1] Niyogi, Smale, and Weinberger (A topological view of unsupervised learning from noisy data)
[2] Folland (Real analysis, Chapter 10.1)
[3] Bredon (Topology and Geometry, Chapter 2.11)