# American Institute of Mathematical Sciences

March  2019, 1(1): 1-38. doi: 10.3934/fods.2019001

## Consistent manifold representation for topological data analysis

 Department of Mathematical Sciences, Fairfax, VA 22030, USA

* Corresponding author: Timothy Sauer

Published  March 2019

Fund Project: The authors are partially supported by NSF grant DMS-1723128.

For data sampled from an arbitrary density on a manifold embedded in Euclidean space, the Continuous k-Nearest Neighbors (CkNN) graph construction is introduced. It is shown that CkNN is geometrically consistent in the sense that under certain conditions, the unnormalized graph Laplacian converges to the Laplace-Beltrami operator, spectrally as well as pointwise. It is proved for compact (and conjectured for noncompact) manifolds that CkNN is the unique unweighted construction that yields a geometry consistent with the connected components of the underlying manifold in the limit of large data. Thus CkNN produces a single graph that captures all topological features simultaneously, in contrast to persistent homology, which represents each homology generator at a separate scale. As applications we derive a new fast clustering algorithm and a method to identify patterns in natural images topologically. Finally, we conjecture that CkNN is topologically consistent, meaning that the homology of the Vietoris-Rips complex (implied by the graph Laplacian) converges to the homology of the underlying manifold (implied by the Laplace-de Rham operators) in the limit of large data.

Citation: Tyrus Berry, Timothy Sauer. Consistent manifold representation for topological data analysis. Foundations of Data Science, 2019, 1 (1) : 1-38. doi: 10.3934/fods.2019001
##### References:

show all references

##### References:
For $N \in \{1000,5000,20000\}$ uniformly random data points on a unit circle. (a) Using $\delta = 3N^{-1/7}$ and letting $\vec f_i = \sin(\theta_i)$ we compare the pointwise estimate $\left(c^{-1}L_{\rm un}\vec f \right)_i$ (red, dashed) to the true operator $\Delta f(x_i) = -\sin(\theta_i)$ (black, solid). (b) Using $\delta = 3N^{-1/7}$ we compare the first 9 eigenvalues of $c^{-1}L_{\rm un}$ (red, dashed) to the first 9 eigenvalues of $\Delta$ (black, solid). (c, d) Same as (a, b) respectively but with $\delta = 3N^{-1/3}$
(a) Standard $\epsilon$-ball persistent 1-homology reveals that no value of $\epsilon$ captures both holes simultaneously: Each homology class persists at separate $\epsilon$ scales. (b) When $\epsilon = 0.16$ we see that the small hole is completely triangulated (the black segment completes the triangulation), while the large hole is still not connected. (c) CkNN persistent 1-homology as a function of $\delta$ shows that both homology classes (top two lines) are present simultaneously over a large range of $\delta$ values. (d) When $\delta = 0.1$, the resulting graph represents the true topology of the manifold
(a) Data sampled from a 2-dimensional Gaussian with a radial gap forming a non-compact manifold. (b) Standard fixed $\epsilon$-ball persistence diagram has no persistent region with 2 connected components and one non-contractible hole. (c) When $\epsilon = 0.135$ the 1-homology is correct, but outliers are still not connected. (d) For $\epsilon = 0.16$ the outliers are still not connected and the middle section is bridged. For any size gap this bridge will happen in the limit of large data as the outliers become more spaced out. (e) CkNN persistence in terms of $\delta$ shows the correct homology ($\beta_0 = 2$ and $\beta_1 = 1$) for $0.180<\delta<0.215$. (f) The CkNN construction for $\delta = 0.20$
Rectangular regions indicate the true clusters. (a) Circles of a fixed radius $\epsilon$ show that bridging of the dense regions occurs before any connections are made in the sparse region. (b) Graph connecting all points with distance less than $\epsilon$
Data set from Fig. 1.(a) Graph based on connecting each point to its (first) nearest neighbor leaves all regions disconnected internally. (b) Connecting each point to its two nearest neighbors bridges the sparse region to one of the dense regions before fully connecting the left and center boxes
Data set from Fig. 1.(a) Color indicates the relative values of the bandwidth function $\rho(x) = d(x,x_k)$ using $k = 10$ and optimally tuning $\delta$ (blue = low, red = high). (b) Graph connecting pairs of data points with $||x-y|| < \delta \sqrt{\rho(x)\rho(y)}$. Notice that the connected components are fully connected and triangulated, yielding the correct homology in dimensions $0, \; 1$, and $2$
Non-compact version of data set from Fig. 1, density of leftmost 'box' decays continuously to zero. (a) The kNN 'AND' construction bridges the components while the 1-homology still has spurious generators (holes) in the sparse regions. (b) The CkNN is capable of capturing the correct homology for this data set. Both methods use $k = 10$ and the value of $\delta$ was tuned to give the best results for each method. Decreasing $\delta$ for (a) would obtain the correct clusters but introduce more spurious holes
The fast clustering algorithm applied to a set of three spirals with densities which decay exponentially along the length of the spiral. (a) 1000 total points sampled from three 2-dimensional spirals in the plane. (b) Diagram shows number of clusters as a function of the percentage of possible edges added to the graph (a unitless measure of persistence) (c) Graph corresponding to the maximum number of edges with 3 components, colored according to identification by depth-first-search. (d) 2000 total points sampled from three 3-dimensional spirals in $\mathbb{R}^3$. (e) Large interval identifies the correct number of clusters. (f) Graph with maximum number of edges having 3 components, colored by clustering algorithm
Persistence of the clustering of a cut-Gaussian distribution using the multi-scale graph construction with $\rho = q^{\beta}$, where the distribution is in dimension (a) $1$ (b) $2$ (c) $3$ (d) $4$. Persistence is measured as a percentage of the total number of possible edges in the graph, $N(N-1)/2$ as a function of $\beta$. Each data point represents an average over 500 data sets, each data set starts with sufficiently many points randomly sampled from an $m$-dimensional Gaussian so that after points in the gap region are rejected there are $N$ points remaining. The correct homology (two connected components) is most persistent when $\beta$ is near $-1/m$ implying the optimal multi-scale graph construction is the CkNN construction where $\rho\propto q^{-1/m}$
Persistence of the true homology using the multi-scale graph construction with $\rho = q^{\beta}$. The underlying data sets are (a) a cut 1-dimensional Gaussian embedded in the plane with a loop introduced and (b) the same 2-dimensional cut Gaussian density as in Fig. 8(b). Persistence is measured as a percentage of the total number of possible edges in the graph, $N(N-1)/2$ as a function of $\beta$. Each data point represents an average over 200 data sets. The correct homology ($\beta_0 = 2$ and $\beta_1 = 1$ for both (a) and (b)) is most persistent when $\beta$ is near $-1/m$ implying the optimal multi-scale graph construction is the CkNN construction where $\rho\propto q^{-1/m}$
(a) Four simple patterns whose sub-image orbifolds exhibit nontrivial homology. The true values of $\beta_1$ for the four patterns are $1,2,2,$ and $3$ respectively. (b) Using fixed $\epsilon$-ball graph construction and selecting $\epsilon$ from the region where the homology generators span the largest proportion $L$ of total edges without changing (the most persistent homology). (c) Using the CkNN construction and selecting $\delta$ from the region where the homology generators span the largest $L$ without changing. The homology was computed with JavaPlex [42]
(a) Image of zebra stripes [17]. (b) Persistence diagram for the space of subimages. (c) The CkNN graph construction on the subimage space with $\delta$ chosen from the longest region where the homology is constant. (d) Image of fish scales [24]. (e) Persistence diagram. (f) The CkNN graph construction on the PCA projection of the subimage space with $\delta$ chosen from the longest region where the homology is constant. The homology was computed with JavaPlex [42]
For $N \in \{250,500,1000,2000,4000,8000\}$ we show (a) the root mean squared error (RMSE) of the pointwise approximation of $c^{-1}L_{\rm un}\vec f \approx -\sin(x_i)$ where $\vec f_i = \sin(x_i)$ and (b) the RMSE of the first 5 eigenvalues of $c^{-1}L_{\rm un}$. Both (a) and (b) are averaged over 10 random data sets for each $N$ and each $\delta$ and the minimum RMSE in each curve is highlighted with a black point. (c) For each $N$, we plot the optimal $\delta$ for pointwise approximation (black points) and the theoretical power law $\delta \propto N^{-1/7}$ (black, dashed line) and the optimal $\delta$ for spectral approximation (red points) and the theoretical power law $\delta \propto N^{-1/3}$ (red, dashed line)
 [1] Huai-Dong Cao and Jian Zhou. On quantum de Rham cohomology theory. Electronic Research Announcements, 1999, 5: 24-34. [2] Mark F. Demers, Hong-Kun Zhang. Spectral analysis of the transfer operator for the Lorentz gas. Journal of Modern Dynamics, 2011, 5 (4) : 665-709. doi: 10.3934/jmd.2011.5.665 [3] George Siopsis. Quantum topological data analysis with continuous variables. Foundations of Data Science, 2019, 1 (4) : 419-431. doi: 10.3934/fods.2019017 [4] Jiang Xie, Junfu Xu, Celine Nie, Qing Nie. Machine learning of swimming data via wisdom of crowd and regression analysis. Mathematical Biosciences & Engineering, 2017, 14 (2) : 511-527. doi: 10.3934/mbe.2017031 [5] S. Aaron, Z. Conn, Robert S. Strichartz, H. Yu. Hodge-de Rham theory on fractal graphs and fractals. Communications on Pure & Applied Analysis, 2014, 13 (2) : 903-928. doi: 10.3934/cpaa.2014.13.903 [6] Matthew O. Williams, Clarence W. Rowley, Ioannis G. Kevrekidis. A kernel-based method for data-driven koopman spectral analysis. Journal of Computational Dynamics, 2015, 2 (2) : 247-265. doi: 10.3934/jcd.2015005 [7] Massimiliano Guzzo, Giancarlo Benettin. A spectral formulation of the Nekhoroshev theorem and its relevance for numerical and experimental data analysis. Discrete & Continuous Dynamical Systems - B, 2001, 1 (1) : 1-28. doi: 10.3934/dcdsb.2001.1.1 [8] Luigi Fontana, Steven G. Krantz and Marco M. Peloso. Hodge theory in the Sobolev topology for the de Rham complex on a smoothly bounded domain in Euclidean space. Electronic Research Announcements, 1995, 1: 103-107. [9] Baolan Yuan, Wanjun Zhang, Yubo Yuan. A Max-Min clustering method for $k$-means algorithm of data clustering. Journal of Industrial & Management Optimization, 2012, 8 (3) : 565-575. doi: 10.3934/jimo.2012.8.565 [10] Daniel Mckenzie, Steven Damelin. Power weighted shortest paths for clustering Euclidean data. Foundations of Data Science, 2019, 1 (3) : 307-327. doi: 10.3934/fods.2019014 [11] O. A. Veliev. Essential spectral singularities and the spectral expansion for the Hill operator. Communications on Pure & Applied Analysis, 2017, 16 (6) : 2227-2251. doi: 10.3934/cpaa.2017110 [12] Aurore Back, Emmanuel Frénod. Geometric two-scale convergence on manifold and applications to the Vlasov equation. Discrete & Continuous Dynamical Systems - S, 2015, 8 (1) : 223-241. doi: 10.3934/dcdss.2015.8.223 [13] Keonhee Lee, Ngoc-Thach Nguyen, Yinong Yang. Topological stability and spectral decomposition for homeomorphisms on noncompact spaces. Discrete & Continuous Dynamical Systems - A, 2018, 38 (5) : 2487-2503. doi: 10.3934/dcds.2018103 [14] Saikat Mazumdar. Struwe's decomposition for a polyharmonic operator on a compact Riemannian manifold with or without boundary. Communications on Pure & Applied Analysis, 2017, 16 (1) : 311-330. doi: 10.3934/cpaa.2017015 [15] Eduardo Lara, Rodolfo Rodríguez, Pablo Venegas. Spectral approximation of the curl operator in multiply connected domains. Discrete & Continuous Dynamical Systems - S, 2016, 9 (1) : 235-253. doi: 10.3934/dcdss.2016.9.235 [16] Mario Ahues, Filomena D. d'Almeida, Alain Largillier, Paulo B. Vasconcelos. Defect correction for spectral computations for a singular integral operator. Communications on Pure & Applied Analysis, 2006, 5 (2) : 241-250. doi: 10.3934/cpaa.2006.5.241 [17] A. M. Micheletti, Angela Pistoia. Multiple eigenvalues of the Laplace-Beltrami operator and deformation of the Riemannian metric. Discrete & Continuous Dynamical Systems - A, 1998, 4 (4) : 709-720. doi: 10.3934/dcds.1998.4.709 [18] Shingo Takeuchi. Partial flat core properties associated to the $p$-laplace operator. Conference Publications, 2007, 2007 (Special) : 965-973. doi: 10.3934/proc.2007.2007.965 [19] Micol Amar, Roberto Gianni. Laplace-Beltrami operator for the heat conduction in polymer coating of electronic devices. Discrete & Continuous Dynamical Systems - B, 2018, 23 (4) : 1739-1756. doi: 10.3934/dcdsb.2018078 [20] Chengxiang Wang, Li Zeng, Yumeng Guo, Lingli Zhang. Wavelet tight frame and prior image-based image reconstruction from limited-angle projection data. Inverse Problems & Imaging, 2017, 11 (6) : 917-948. doi: 10.3934/ipi.2017043

Impact Factor:

## Tools

Article outline

Figures and Tables