For data sampled from an arbitrary density on a manifold embedded in Euclidean space, the Continuous k-Nearest Neighbors (CkNN) graph construction is introduced. It is shown that CkNN is geometrically consistent in the sense that under certain conditions, the unnormalized graph Laplacian converges to the Laplace-Beltrami operator, spectrally as well as pointwise. It is proved for compact (and conjectured for noncompact) manifolds that CkNN is the unique unweighted construction that yields a geometry consistent with the connected components of the underlying manifold in the limit of large data. Thus CkNN produces a single graph that captures all topological features simultaneously, in contrast to persistent homology, which represents each homology generator at a separate scale. As applications we derive a new fast clustering algorithm and a method to identify patterns in natural images topologically. Finally, we conjecture that CkNN is topologically consistent, meaning that the homology of the Vietoris-Rips complex (implied by the graph Laplacian) converges to the homology of the underlying manifold (implied by the Laplace-de Rham operators) in the limit of large data.
Citation: |
Figure 12. For $ N \in \{1000,5000,20000\} $ uniformly random data points on a unit circle. (a) Using $ \delta = 3N^{-1/7} $ and letting $ \vec f_i = \sin(\theta_i) $ we compare the pointwise estimate $ \left(c^{-1}L_{\rm un}\vec f \right)_i $ (red, dashed) to the true operator $ \Delta f(x_i) = -\sin(\theta_i) $ (black, solid). (b) Using $ \delta = 3N^{-1/7} $ we compare the first 9 eigenvalues of $ c^{-1}L_{\rm un} $ (red, dashed) to the first 9 eigenvalues of $ \Delta $ (black, solid). (c, d) Same as (a, b) respectively but with $ \delta = 3N^{-1/3} $
Figure 5. (a) Standard $ \epsilon $-ball persistent 1-homology reveals that no value of $ \epsilon $ captures both holes simultaneously: Each homology class persists at separate $ \epsilon $ scales. (b) When $ \epsilon = 0.16 $ we see that the small hole is completely triangulated (the black segment completes the triangulation), while the large hole is still not connected. (c) CkNN persistent 1-homology as a function of $ \delta $ shows that both homology classes (top two lines) are present simultaneously over a large range of $ \delta $ values. (d) When $ \delta = 0.1 $, the resulting graph represents the true topology of the manifold
Figure 6. (a) Data sampled from a 2-dimensional Gaussian with a radial gap forming a non-compact manifold. (b) Standard fixed $ \epsilon $-ball persistence diagram has no persistent region with 2 connected components and one non-contractible hole. (c) When $ \epsilon = 0.135 $ the 1-homology is correct, but outliers are still not connected. (d) For $ \epsilon = 0.16 $ the outliers are still not connected and the middle section is bridged. For any size gap this bridge will happen in the limit of large data as the outliers become more spaced out. (e) CkNN persistence in terms of $ \delta $ shows the correct homology ($ \beta_0 = 2 $ and $ \beta_1 = 1 $) for $ 0.180<\delta<0.215 $. (f) The CkNN construction for $ \delta = 0.20 $
Figure 2. Data set from Fig. 1.(a) Graph based on connecting each point to its (first) nearest neighbor leaves all regions disconnected internally. (b) Connecting each point to its two nearest neighbors bridges the sparse region to one of the dense regions before fully connecting the left and center boxes
Figure 3. Data set from Fig. 1.(a) Color indicates the relative values of the bandwidth function $\rho(x) = d(x,x_k)$ using $k = 10$ and optimally tuning $\delta$ (blue = low, red = high). (b) Graph connecting pairs of data points with $||x-y|| < \delta \sqrt{\rho(x)\rho(y)}$. Notice that the connected components are fully connected and triangulated, yielding the correct homology in dimensions $0, \; 1$, and $2$
Figure 4. Non-compact version of data set from Fig. 1, density of leftmost 'box' decays continuously to zero. (a) The kNN 'AND' construction bridges the components while the 1-homology still has spurious generators (holes) in the sparse regions. (b) The CkNN is capable of capturing the correct homology for this data set. Both methods use $k = 10$ and the value of $\delta$ was tuned to give the best results for each method. Decreasing $\delta$ for (a) would obtain the correct clusters but introduce more spurious holes
Figure 7. The fast clustering algorithm applied to a set of three spirals with densities which decay exponentially along the length of the spiral. (a) 1000 total points sampled from three 2-dimensional spirals in the plane. (b) Diagram shows number of clusters as a function of the percentage of possible edges added to the graph (a unitless measure of persistence) (c) Graph corresponding to the maximum number of edges with 3 components, colored according to identification by depth-first-search. (d) 2000 total points sampled from three 3-dimensional spirals in $\mathbb{R}^3$. (e) Large interval identifies the correct number of clusters. (f) Graph with maximum number of edges having 3 components, colored by clustering algorithm
Figure 8. Persistence of the clustering of a cut-Gaussian distribution using the multi-scale graph construction with $\rho = q^{\beta}$, where the distribution is in dimension (a) $1$ (b) $2$ (c) $3$ (d) $4$. Persistence is measured as a percentage of the total number of possible edges in the graph, $N(N-1)/2$ as a function of $\beta$. Each data point represents an average over 500 data sets, each data set starts with sufficiently many points randomly sampled from an $m$-dimensional Gaussian so that after points in the gap region are rejected there are $N$ points remaining. The correct homology (two connected components) is most persistent when $\beta$ is near $-1/m$ implying the optimal multi-scale graph construction is the CkNN construction where $\rho\propto q^{-1/m}$
Figure 9. Persistence of the true homology using the multi-scale graph construction with $\rho = q^{\beta}$. The underlying data sets are (a) a cut 1-dimensional Gaussian embedded in the plane with a loop introduced and (b) the same 2-dimensional cut Gaussian density as in Fig. 8(b). Persistence is measured as a percentage of the total number of possible edges in the graph, $N(N-1)/2$ as a function of $\beta$. Each data point represents an average over 200 data sets. The correct homology ($\beta_0 = 2$ and $\beta_1 = 1$ for both (a) and (b)) is most persistent when $\beta$ is near $-1/m$ implying the optimal multi-scale graph construction is the CkNN construction where $\rho\propto q^{-1/m}$
Figure 10. (a) Four simple patterns whose sub-image orbifolds exhibit nontrivial homology. The true values of $\beta_1$ for the four patterns are $1,2,2,$ and $3$ respectively. (b) Using fixed $\epsilon$-ball graph construction and selecting $\epsilon$ from the region where the homology generators span the largest proportion $L$ of total edges without changing (the most persistent homology). (c) Using the CkNN construction and selecting $\delta$ from the region where the homology generators span the largest $L$ without changing. The homology was computed with JavaPlex [42]
Figure 11. (a) Image of zebra stripes [17]. (b) Persistence diagram for the space of subimages. (c) The CkNN graph construction on the subimage space with $ \delta $ chosen from the longest region where the homology is constant. (d) Image of fish scales [24]. (e) Persistence diagram. (f) The CkNN graph construction on the PCA projection of the subimage space with $ \delta $ chosen from the longest region where the homology is constant. The homology was computed with JavaPlex [42]
Figure 13. For $ N \in \{250,500,1000,2000,4000,8000\} $ we show (a) the root mean squared error (RMSE) of the pointwise approximation of $ c^{-1}L_{\rm un}\vec f \approx -\sin(x_i) $ where $ \vec f_i = \sin(x_i) $ and (b) the RMSE of the first 5 eigenvalues of $ c^{-1}L_{\rm un} $. Both (a) and (b) are averaged over 10 random data sets for each $ N $ and each $ \delta $ and the minimum RMSE in each curve is highlighted with a black point. (c) For each $ N $, we plot the optimal $ \delta $ for pointwise approximation (black points) and the theoretical power law $ \delta \propto N^{-1/7} $ (black, dashed line) and the optimal $ \delta $ for spectral approximation (red points) and the theoretical power law $ \delta \propto N^{-1/3} $ (red, dashed line)
[1] | M. Belkin and P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Computation, 15 (2003), 1373-1396. doi: 10.1162/089976603321780317. |
[2] | M. Belkin and P. Niyogi, Convergence of Laplacian eigenmaps, Advances in Neural Information Processing Systems, (2007), 139-136. |
[3] | T. Berry, J. R. Cressman, Z. G. Ferenček and T. Sauer, Time-scale separation from diffusion-mapped delay coordinates, SIAM J. Appl. Dyn. Sys, 12 (2013), 618-649. doi: 10.1137/12088183X. |
[4] | T. Berry and J. Harlim, Variable bandwidth diffusion kernels, Appl. Comp. Harmonic Anal., 40 (2016), 68-96. doi: 10.1016/j.acha.2015.01.001. |
[5] | T. Berry and T. Sauer, Local kernels and the geometric structure of data, Appl. Comp. Harmonic Anal., 40 (2016), 439-469. doi: 10.1016/j.acha.2015.03.002. |
[6] | T. Berry and D. Giannakis, Spectral exterior calculus, arXiv preprint, arXiv: 1802.01209. |
[7] | O. Bobrowski, S. Mukherjee, J. E. Taylor et al., Topological consistency via kernel estimation, Bernoulli, 23 (2017), 288–328. doi: 10.3150/15-BEJ744. |
[8] | G. Carlsson, Topology and data, Bulletin of the American Mathematical Society, 46 (2009), 255-308. doi: 10.1090/S0273-0979-09-01249-X. |
[9] | J. Chacón, A population background for nonparametric density-based clustering, Statistical Science, 30 (2015), 518-532. doi: 10.1214/15-STS526. |
[10] | K. Chaudhuri, S. Dasgupta, S. Kpotufe and U. von Luxburg, Consistent procedures for cluster tree estimation and pruning, Information Theory, IEEE Transactions on, 60 (2014), 7900-7912. doi: 10.1109/TIT.2014.2361055. |
[11] | A. Cianchi and V. Maz'ya, On the discreteness of the spectrum of the laplacian on noncompact riemannian manifolds, J. Differential Geom., 87 (2011), 469–492, URL http://projecteuclid.org/euclid.jdg/1312998232. doi: 10.4310/jdg/1312998232. |
[12] | R. Coifman and S. Lafon, Diffusion maps, Appl. Comp. Harmonic Anal., 21 (2006), 5-30. doi: 10.1016/j.acha.2006.04.006. |
[13] | R. Coifman, S. Lafon, B. Nadler and I. Kevrekidis, Diffusion maps, spectral clustering and reaction coordinates of dynamical systems, Appl. Comp. Harmonic Anal., 21 (2006), 113-127. doi: 10.1016/j.acha.2005.07.004. |
[14] | R. Coifman, Y. Shkolnisky, F. Sigworth and A. Singer, Graph Laplacian tomography from unknown random projections, IEEE Trans. on Image Proc., 17 (2008), 1891-1899. doi: 10.1109/TIP.2008.2002305. |
[15] | M. Desbrun, E. Kanso and Y. Tong, Discrete differential forms for computational modeling, in Discrete Differential Geometry, Springer, 38 (2008), 287–324. doi: 10.1007/978-3-7643-8621-4_16. |
[16] | H. Edelsbrunner and J. Harer, Computational Toplogy: An Introduction, American Mathematical Soc., 2010. |
[17] | S. Fazel, Zebra in mikumi.jpg, 2012, URL https://commons.wikimedia.org/wiki/File:Zebra_in_Mikumi.JPG, https://commons.wikimedia.org/wiki/File:Zebra_in_Mikumi.JPG; accessed June 3, 2016; Creative Commons License. |
[18] | R. Ghrist, Barcodes: The persistent topology of data, Bulletin of the American Mathematical Society, 45 (2008), 61-75. doi: 10.1090/S0273-0979-07-01191-3. |
[19] | D. Giannakis and A. J. Majda, Nonlinear laplacian spectral analysis for time series with intermittency and low-frequency variability, Proceedings of the National Academy of Sciences, 109 (2012), 2222-2227. doi: 10.1073/pnas.1118984109. |
[20] | M. Hein, Geometrical aspects of statistical learning theory, Thesis, URL http://elib.tu-darmstadt.de/diss/000673. |
[21] | M. Hein, Uniform convergence of adaptive graph-based regularization, in Learning Theory, Springer, 4005 (2006), 50–64. doi: 10.1007/11776420_7. |
[22] | M. Hein, J.-Y. Audibert and U. Von Luxburg, From graphs to manifolds–weak and strong pointwise consistency of graph Laplacians, in Learning Theory, Springer, 3559 (2005), 470–485. doi: 10.1007/11503415_32. |
[23] | A. N. Hirani, Discrete Exterior Calculus, PhD thesis, California Institute of Technology, 2003. |
[24] | Kallerna, Scale common roach.jpg, 2009, URL https://commons.wikimedia.org/wiki/File:Scale_Common_Roach.JPG, https://commons.wikimedia.org/wiki/File:Scale_Common_Roach.JPG; accessed June 3, 2016; Creative Commons License. |
[25] | J. Latschev, Vietoris-Rips complexes of metric spaces near a closed Riemannian manifold, Archiv der Mathematik, 77 (2001), 522-528. doi: 10.1007/PL00000526. |
[26] | D. Loftsgaarden, C. Quesenberry et al., A nonparametric estimate of a multivariate density function, The Annals of Mathematical Statistics, 36 (1965), 1049-1051. doi: 10.1214/aoms/1177700079. |
[27] | M. Maier, M. Hein and U. Von Luxburg, Cluster identification in nearest-neighbor graphs, in Algorithmic Learning Theory, Springer, 2007, 196–210. |
[28] | M. Maier, M. Hein and U. von Luxburg, Optimal construction of k-nearest-neighbor graphs for identifying noisy clusters, Theoretical Computer Science, 410 (2009), 1749-1764. |
[29] | M. Maier, U. Von Luxburg and M. Hein, How the result of graph clustering methods depends on the construction of the graph, ESAIM: Probability and Statistics, 17 (2013), 370-418. doi: 10.1051/ps/2012001. |
[30] | B. Nadler and M. Galun, Fundamental limitations of spectral clustering methods, in Advances in Neural Information Processing Systems 19 (eds. B. Schölkopf, J. Platt and T. Hoffman), MIT Press, Cambridge, MA, 2007. |
[31] | B. Nadler, S. Lafon, R. Coifman and I. Kevrekidis, Diffusion maps-a probabilistic interpretation for spectral embedding and clustering algorithms, in Principal Manifolds for Data Visualization and Dimension Reduction, Springer, NY, 58 (2008), 238–260. doi: 10.1007/978-3-540-73750-6_10. |
[32] | B. Nadler, S. Lafon, I. Kevrekidis and R. Coifman, Diffusion maps, spectral clustering and eigenfunctions of Fokker-Planck operators, in Advances in Neural Information Processing Systems, MIT Press, 2005, 955–962. |
[33] | A. Ng, M. Jordan and Y. Weiss, On spectral clustering: Analysis and an algorithm, Neur. Inf. Proc. Soc. |
[34] | P. Niyogi, S. Smale and S. Weinberger, Finding the homology of submanifolds with high confidence from random samples, Discrete & Computational Geometry, 39 (2008), 419-441. doi: 10.1007/s00454-008-9053-2. |
[35] | S. Rosenberg, The Laplacian on a Riemannian Manifold, Cambridge University Press, 1997. doi: 10.1017/CBO9780511623783. |
[36] | D. S. Rufat, Spectral Exterior Calculus and Its Implementation, PhD thesis, California Institute of Technology, 2017, URL http://resolver.caltech.edu/CaltechTHESIS:05302017-094600781. |
[37] | B. Schölkopf, A. Smola and K. Müller, Nonlinear component analysis as a kernel eigenvalue problem, Neural Computation, 10 (1998), 1299-1319. |
[38] | E. Schulz and G. Tsogtgerel, Convergence of discrete exterior calculus approximations for Poisson problems, 2016. |
[39] | Z. Shi, Convergence of laplacian spectra from random samples., arXiv preprint, arXiv: 1507.00151. |
[40] | A. Singer, From graph to manifold Laplacian: The convergence rate, Applied and Computational Harmonic Analysis, 21 (2006), 128-134. doi: 10.1016/j.acha.2006.03.004. |
[41] | I. Steinwart, Fully adaptive density-based clustering, The Annals of Statistics, 43 (2015), 2132-2167. doi: 10.1214/15-AOS1331. |
[42] | A. Tausz, M. Vejdemo-Johansson and H. Adams, JavaPlex: A research software package for persistent (co)homology, in Proceedings of ICMS 2014 (eds. H. Hong and C. Yap), Lecture Notes in Computer Science 8592, 2014, 129–136, Software available at http://appliedtopology.github.io/javaplex/. |
[43] | D. G. Terrell and D. W. Scott, Variable kernel density estimation, Annals of Statistics, 20 (1992), 1236-1265. doi: 10.1214/aos/1176348768. |
[44] | D. Ting, L. Huang and M. Jordan, An analysis of the convergence of graph Laplacians, in Proceedings of the 27th International Conference on Machine Learning (ICML), 2010. |
[45] | N. G. Trillos, M. Gerlach, M. Hein and D. Slepcev, Error estimates for spectral convergence of the graph Laplacian on random geometric graphs towards the Laplace-Beltrami operator, arXiv preprint, arXiv: 1801.10108. |
[46] | N. G. Trillos and D. Slepčev, A variational approach to the consistency of spectral clustering, Applied and Computational Harmonic Analysis, 45 (2018), 239-281. |
[47] | U. Von Luxburg, A tutorial on spectral clustering, Statistics and Computing, 17 (2007), 395-416. doi: 10.1007/s11222-007-9033-z. |
[48] | U. Von Luxburg, M. Belkin and O. Bousquet, Consistency of spectral clustering, Annals of Statistics, 36 (2008), 555-586. doi: 10.1214/009053607000000640. |
[49] | L. Zelnick-Manor and P. Perona, Self-tuning spectral clustering, Adv. Neur. Inf. Proc. Sys. (2005) |
For
(a) Standard
(a) Data sampled from a 2-dimensional Gaussian with a radial gap forming a non-compact manifold. (b) Standard fixed
Rectangular regions indicate the true clusters. (a) Circles of a fixed radius
Data set from Fig. 1.(a) Graph based on connecting each point to its (first) nearest neighbor leaves all regions disconnected internally. (b) Connecting each point to its two nearest neighbors bridges the sparse region to one of the dense regions before fully connecting the left and center boxes
Data set from Fig. 1.(a) Color indicates the relative values of the bandwidth function
Non-compact version of data set from Fig. 1, density of leftmost 'box' decays continuously to zero. (a) The kNN 'AND' construction bridges the components while the 1-homology still has spurious generators (holes) in the sparse regions. (b) The CkNN is capable of capturing the correct homology for this data set. Both methods use
The fast clustering algorithm applied to a set of three spirals with densities which decay exponentially along the length of the spiral. (a) 1000 total points sampled from three 2-dimensional spirals in the plane. (b) Diagram shows number of clusters as a function of the percentage of possible edges added to the graph (a unitless measure of persistence) (c) Graph corresponding to the maximum number of edges with 3 components, colored according to identification by depth-first-search. (d) 2000 total points sampled from three 3-dimensional spirals in
Persistence of the clustering of a cut-Gaussian distribution using the multi-scale graph construction with
Persistence of the true homology using the multi-scale graph construction with
(a) Four simple patterns whose sub-image orbifolds exhibit nontrivial homology. The true values of
(a) Image of zebra stripes [17]. (b) Persistence diagram for the space of subimages. (c) The CkNN graph construction on the subimage space with
For