Loading [MathJax]/jax/output/HTML-CSS/jax.js

Thursday, May 26, 2016

Reconstructing a manifold from sample data, with noise

We follow the article [3] and add more background and clarifications. Some assumptions are made that are not explicitly mentioned in the article, to make calculations easier.

Background in probability, measure theory, topology

Let X be a random variable over a space A. Recall that the expression P(X) is a number in [0,1] describing the probability of the event X happening. This is called a probability distribution. Here we will consider continuous random variables, so P(X=x)=0 for any single element xA.

Definition: The probability density function of X is the function f:AR satisfying
  • f(x)0 for all xA, and
  • Bf(x) dx=P(XB) for any BA.
The second condition implies Af(x) dx=1.

Often authors use just P instead of f, and write P(x) instead of P(X=x).

Definition: Let Y=g(X) be another random variable. The expected value of Y is
E[Y]=E[g(X)]=Ag(x)f(x) dx.
The mean of X is μ=E[X], and the variance of X is σ2=E[(Xμ)2]. If X=(X1  Xn)T is a multivariate random variable, then μ=E[X] is an n-vector, and the variance is an (n×n)-matrix given as
Σ=E[(XE[X])(XE[X])T]orΣij=E[(XiE[Xi])(XjE[Xj])].
The covariance of X and Y is E[(XE[X])(YE[Y])]. Note that the covariance of X with itself is just the usual variance of X.

Example: One example of a probability distribution is the normal (or Gaussian) distribution, and we say a random variable with the normal distribution is normally distributed. If a random variable X is normally distributed with mean μ and variance σ2, then the probability density function of X is
f(x)=exp((xμ)22σ2)σ2π.
If X=(X1  Xn)T is a normally distributed multivariate random variable, then μ=(E[X1]  E[Xn])T and the probability density function of X is
f(x)=exp(12(xμ)TΣ1(xμ))(2π)ndet(Σ).

Definition: A measure on RD is a function m:{subsets of RD}[0,] such that m()=0 and m(iIEi)=iIm(Ei) for {Ei}iI a countable sequence of disjoint subsets of RD. A probability measure on RD is a measure m on RD with the added condition that m(RD)=1.

A probability distribution is an example of a probability measure.

Definition: Let U={Ui}iI be a covering of a topological space M. The nerve of the covering U is a set N of subsets of I given by
N={JI : jJUj}.
Note that this makes N into an abstract simplicial complex, as JN implies JN for all JJ.

Let M be a smooth compact submanifold of Rd. By the tubular neighborhood theorem (see Theorem 2.11.4 in [3]), every smooth compact submanifold M of Rd has a tubular neighborhood for some ϵ>0.

Definition: For a particular embedding of M, let the condition number of M be τ=sup{ϵ : M has an ϵtubular neighborhood}.

Distributions on a manifold

Let M be a d-dimensional manifold embedded in RD, with D>d. Recall that every element in NMRD, the normal bundle of M, may be represented as a pair (x,y), where xM and yT (since M is a manifold, all the normal spaces are isomorphic). Hence we may consider a probability distribution P on NM, with X the d-multivariate random variable representing points on M and Y the (Dd)-multivariate random variable representing points on the space normal to M at a point on M. We make the assumption that X and Y are independent, or that
P(X,Y)=PM(X)PT(Y).
That is, PT is a probability distribution that is the same at any point on the manifold.

Definition: Let P be a probability distribution on NM and fM the probability density function of PM. In the context described above, P satisfies the strong variance condition if
  • there exist a,b>0 such that fM(x)[a,b] for all xM, and
  • PT(Y) is normally distributed with μ=0 and Σ=σ2I.
The second condition implies that the covariance of Yi with Yj is trivial iff ij, and that the vairance of all the Yis is the same. From the normally distributed multivariate example above, this also tells us that the probability density function f of Y is
f(y)=exp(σ22Ddi=1y2i)σDd(2π)Dd.

Theorem:
In the context described above, let P be a probability distribution on NM satisfying the strong variance condition, and let δ>0. If there is c>1 such that
σ<cτ(98)98(Dd),
then there is an algorithm that computes the homology of M from a random sample of n points, with probability 1δ. The number n depends on τ,δ,c,d,D, and the diameter of M.

The homology computing algorithm

Below is a broad view of the algorithm described in sections 3, 4, and 5 of [1]. Let M be a d-manifold embedded in RD, and P a probability measure on NM satisfying the strong variance condition.

1. Calculate the following numbers:
τ=condition number of Mvol(M)=volume of Mσ2=variance of P
2. Define (or choose) the following numbers:
δ(0,1)r(22(Dd)σ,τ9(322))n>function(a,r,τ,d,δ,vol(M))(max(A,B) in Proposition 9 of [1])s=4rdeg>3a4(1(r2τ)2)d/2vol(Bd(r,0))R=(9r+τ)/2
3. Choose n points randomly from NM according to P.
4. From these n points, construct the nearest neighbor graph G with distance s.
5. Remove from G all the vertices of degree <deg to get a refined graph G.
6. Set U=xV(G)BD(R,x) and construct the simplicial complex K of its nerve.
7. Compute the homology of K, which is the homology of M, with probability 1δ.

References:
[1] Niyogi, Smale, and Weinberger (A topological view of unsupervised learning from noisy data)
[2] Folland (Real analysis, Chapter 10.1)
[3] Bredon (Topology and Geometry, Chapter 2.11)

No comments:

Post a Comment