March  2021, 3(1): 49-66. doi: 10.3934/fods.2021005

A topological approach to spectral clustering

Centro de Investigación en Matemáticas, Calle Jalisco S/N, Colonia Valenciana, Guanajuato C.P 36023, GTO, Mexico

* Corresponding author: Antonio Rieser

Received  July 2020 Revised  January 2021 Published  March 2021

Fund Project: This research was supported in part by the TOPOSYS project FP7-ICT-318493-STREP, Cátedras CONACYT 1076, and Grant #N62909-19-1-2134 from the US Office of Naval Research Global and the Southern Office of Aerospace Research and Development of the US Air Force Office of Scientific Research

We propose two related unsupervised clustering algorithms which, for input, take data assumed to be sampled from a uniform distribution supported on a metric space $ X $, and output a clustering of the data based on the selection of a topological model for the connected components of $ X $. Both algorithms work by selecting a graph on the samples from a natural one-parameter family of graphs, using a geometric criterion in the first case and an information theoretic criterion in the second. The estimated connected components of $ X $ are identified with the kernel of the associated graph Laplacian, which allows the algorithm to work without requiring the number of expected clusters or other auxiliary data as input.

Citation: Antonio Rieser. A topological approach to spectral clustering. Foundations of Data Science, 2021, 3 (1) : 49-66. doi: 10.3934/fods.2021005
References:
[1]

M. Belkin and P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput., 15 (2003), 1373-1396.  doi: 10.1162/089976603321780317.  Google Scholar

[2]

G. Carlsson, Topology and data, Bull. Amer. Math. Soc. (N.S.), 46 (2009), 255-308.  doi: 10.1090/S0273-0979-09-01249-X.  Google Scholar

[3]

F. Chazal, L. J. Guibas, S. Y. Oudot and P. Skraba, Persistence-based clustering in Riemannian manifolds, J. ACM, 60 (2013), 38pp. doi: 10.1145/2535927.  Google Scholar

[4]

F. R. K. Chung, Spectral Graph Theory, CBMS Regional Conference Series in Mathematics, 92, American Mathematical Society, Providence, RI, 1997. doi: 10.1090/cbms/092.  Google Scholar

[5]

R. R. Coifman and S. Lafon, Diffusion maps, Appl. Comput. Harmon. Anal., 21 (2006), 5-30.  doi: 10.1016/j.acha.2006.04.006.  Google Scholar

[6]

D. L. Donoho and C. Grimes, Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data, Proc. Natl. Acad. Sci. USA, 100 (2003), 5591-5596.  doi: 10.1073/pnas.1031596100.  Google Scholar

[7]

R. B. Ghrist, Barcodes: The persistent topology of data, Bull. Amer. Math. Soc. (N.S.), 45 (2008), 61-75.  doi: 10.1090/S0273-0979-07-01191-3.  Google Scholar

[8]

T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning. Data Mining, Inference, and Prediction, Springer Series in Statistics, Springer, New York, 2009 doi: 10.1007/978-0-387-84858-7.  Google Scholar

[9]

J. Jost, Riemannian Geometry and Geometric Analysis, Universitext, Springer, Heidelberg, 2011. doi: 10.1007/978-3-642-21298-7.  Google Scholar

[10]

S. T. Roweis and L. K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science, 290 (2000), 2323-2326.  doi: 10.1126/science.290.5500.2323.  Google Scholar

[11]

R. Sánchez-GarciaM. FennellyS. NorrisN. WrightG. NibloJ. Brodzki and J. Bialek, Hierarchical spectral clustering of power grids, IEEE Transactions on Power Systems, 29 (2014), 2229-2237.  doi: 10.1109/TPWRS.2014.2306756.  Google Scholar

[12]

W.-J. ShenH.-S. WongQ.-W. XiaoX. Guo and S. Smale, Introduction to the peptide binding problem of computational immunology: New results, Found. Comput. Math., 14 (2014), 951-984.  doi: 10.1007/s10208-013-9173-9.  Google Scholar

[13]

J. B. TenenbaumV. de Silva and J. C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science, 290 (2000), 2319-2323.  doi: 10.1126/science.290.5500.2319.  Google Scholar

[14]

A. Zomorodian and G. Carlsson, Computing persistent homology, Discrete Comput. Geom., 33 (2005), 249-274.  doi: 10.1007/s00454-004-1146-y.  Google Scholar

show all references

References:
[1]

M. Belkin and P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput., 15 (2003), 1373-1396.  doi: 10.1162/089976603321780317.  Google Scholar

[2]

G. Carlsson, Topology and data, Bull. Amer. Math. Soc. (N.S.), 46 (2009), 255-308.  doi: 10.1090/S0273-0979-09-01249-X.  Google Scholar

[3]

F. Chazal, L. J. Guibas, S. Y. Oudot and P. Skraba, Persistence-based clustering in Riemannian manifolds, J. ACM, 60 (2013), 38pp. doi: 10.1145/2535927.  Google Scholar

[4]

F. R. K. Chung, Spectral Graph Theory, CBMS Regional Conference Series in Mathematics, 92, American Mathematical Society, Providence, RI, 1997. doi: 10.1090/cbms/092.  Google Scholar

[5]

R. R. Coifman and S. Lafon, Diffusion maps, Appl. Comput. Harmon. Anal., 21 (2006), 5-30.  doi: 10.1016/j.acha.2006.04.006.  Google Scholar

[6]

D. L. Donoho and C. Grimes, Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data, Proc. Natl. Acad. Sci. USA, 100 (2003), 5591-5596.  doi: 10.1073/pnas.1031596100.  Google Scholar

[7]

R. B. Ghrist, Barcodes: The persistent topology of data, Bull. Amer. Math. Soc. (N.S.), 45 (2008), 61-75.  doi: 10.1090/S0273-0979-07-01191-3.  Google Scholar

[8]

T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning. Data Mining, Inference, and Prediction, Springer Series in Statistics, Springer, New York, 2009 doi: 10.1007/978-0-387-84858-7.  Google Scholar

[9]

J. Jost, Riemannian Geometry and Geometric Analysis, Universitext, Springer, Heidelberg, 2011. doi: 10.1007/978-3-642-21298-7.  Google Scholar

[10]

S. T. Roweis and L. K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science, 290 (2000), 2323-2326.  doi: 10.1126/science.290.5500.2323.  Google Scholar

[11]

R. Sánchez-GarciaM. FennellyS. NorrisN. WrightG. NibloJ. Brodzki and J. Bialek, Hierarchical spectral clustering of power grids, IEEE Transactions on Power Systems, 29 (2014), 2229-2237.  doi: 10.1109/TPWRS.2014.2306756.  Google Scholar

[12]

W.-J. ShenH.-S. WongQ.-W. XiaoX. Guo and S. Smale, Introduction to the peptide binding problem of computational immunology: New results, Found. Comput. Math., 14 (2014), 951-984.  doi: 10.1007/s10208-013-9173-9.  Google Scholar

[13]

J. B. TenenbaumV. de Silva and J. C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science, 290 (2000), 2319-2323.  doi: 10.1126/science.290.5500.2319.  Google Scholar

[14]

A. Zomorodian and G. Carlsson, Computing persistent homology, Discrete Comput. Geom., 33 (2005), 249-274.  doi: 10.1007/s00454-004-1146-y.  Google Scholar

Figure 6.1.  An accurate classification of $ 500 $ points using the Average Local Volume Ratio method, illustrated by color. The classification produced by the Average Relative Entropy method was identical in this experiment
Figure 6.2.  Average Relative Entropy vs. Scale, for the experiment in Figure 6.1. Note that local maxima appear where the smaller circles join to a larger cluster
Figure 6.3.  Average Local Volume Ratio vs. Scale for the experiment in Figure 6.1. Note that local minima appear where the smaller circles join with the larger circle
Figure A.1.  Example classification of 500 points using the Average Local Volume Ratio method, illustrated by color, for an experiment where the algorithm reported 4 clusters. One can see that the vast majority of points are correctly classified, although a small group of points in the circle on the left are assigned to their own cluster
Figure A.2.  The classification of 500 points using the Average Local Volume Ratio method, illustrated by color, for an experiment where the algorithm reported 5 clusters. One can see that the majority of points are correctly classified, even though the horizontal circle is split into three groups
Figure A.3.  Example classification of 1000 points using the Average Local Volume Ratio method, illustrated by color, for an experiment where the algorithm reported 4 clusters. One can see that the vast majority of points are correctly classified, although one outlier is assigned to its own cluster
Figure A.4.  Example classification of 1000 points using the Average Local Volume Ratio method, illustrated by color, for an experiment where the algorithm reported 5 clusters. One can see that the vast majority of points are correctly classified, and there are two outliers visible in different shades of light blue, one partially obscured by the circle on the left
Figures A.3 and A.4, the vast majority of points were classified correctly, and the additional clusters are from (partially obstructed) outliers">Figure A.5.  Example classification of 1000 points using the Average Local Volume Ratio method, illustrated by color, for an experiment where the algorithm reported 6 clusters. As in Figures A.3 and A.4, the vast majority of points were classified correctly, and the additional clusters are from (partially obstructed) outliers
Figure A.6.  The classification of 500 points using the Average Relative Entropy method, illustrated by color, for an experiment where the algorithm reported 4 clusters. One can see that the vast majority of points are correctly classified, although there are three outliers at least partially visible: the darker green point on the left circle, a light violet point in the center circle, and a dark violet point which is mostly obscured by the center cicle
Figure A.7.  The classification of 500 points using the Average Relative Entropy method, illustrated by color, for an experiment where the algorithm reported 5 clusters. The majority of points are correctly classified, although there are two extra groups assigned to the slightly separated groups of points in the circle on the left
Figure A.6 and A.6, the majority of points were classified correctly, although the circle on the right is split into top and bottom parts, there is an outlier (in blue) on the horizontal circle, as well a small group of separated points (in indigo) on the horzontal circle">Figure A.8.  The classification of 500 points using the Average Relative Entropy method, illustrated by color, for an experiment where the algorithm reported 6 clusters. As in Figure A.6 and A.6, the majority of points were classified correctly, although the circle on the right is split into top and bottom parts, there is an outlier (in blue) on the horizontal circle, as well a small group of separated points (in indigo) on the horzontal circle
Figure A.9.  Example classification of 1000 points using the Average Relative Entropy method, illustrated by color, for an experiment where the algorithm reported 4 clusters. One can see that the vast majority of points are correctly classified, although there is an outlier on the right circle given its own cluster
Figure A.10.  Example classification of 1000 points using the Average Relative Entropy Method, illustrated by color, for an experiment where the algorithm reported 5 clusters. One can see that the vast majority of points are correctly classified, although the center circle is split into 2 subgroups, and there is an outlier which is given its own cluster
Figure A.11.  Example classification of 1000 points using the Average Relative Entropy method, illustrated by color, for an experiment where the algorithm reported 6 clusters. One can see that the majority of points were classified correctly, although the left circle is split into two subgroups, and there are two outliers given their own cluster
Table 1.  Results for the Average Relative Entropy method, using $ 500 $ and $ 1000 $ sample points. This table shows the percentages, rounded to the nearest hundredth, of the $ 150 $ trials using the standard deviation in the left-hand column in which the ARE algorithm returned a given number $ \beta_0 $ of clusters
Average Relative Entropy Method, 500 sample points
$ SD\mid\beta_0 $ $ 1 $ $ 2 $ $ 3 $ $ 4 $ $ 5 $ $ 6 $ $ 7 $ $ 8 $ $ 9 $ $ \geq 10 $
$ 0.01 $ $ 0 $ $ 0 $ $ 64 $ $ 18.67 $ $ 10.67 $ $ 5.33 $ $ 0.67 $ $ 0.67 $ $ 0 $ $ 0 $
$ 0.02 $ $ 0 $ $ 0 $ $ 51.33 $ $ 27.33 $ $ 12.67 $ $ 4 $ $ 2 $ $ 2 $ $ 0.67 $ $ 0 $
$ 0.03 $ $ 0 $ $ 0 $ $ 54.67 $ $ 26.67 $ $ 10.67 $ $ 4 $ $ 3.33 $ $ 0.67 $ $ 0 $ $ 0 $
$ 0.04 $ $ 0.67 $ $ 2 $ $ 44 $ $ 23.33 $ $ 13.33 $ $ 5.33 $ $ 3.33 $ $ 2.67 $ $ 2 $ $ 3.33 $
$ 0.05 $ $ 6 $ $ 26 $ $ 18 $ $ 20 $ $ 8.67 $ $ 4.67 $ $ 4.67 $ $ 4.67 $ $ 2 $ $ 5.33 $
Average Relative Entropy Method, 1000 sample
SD|β0 1 2 3 4 5 6 7 8 9 ≥ 10
0.01 0 0 62.67 26 7.33 0.67 2 1.33 0 0
0.02 0 0 59.33 22 8 9.33 0.67 0.67 0 0
0.03 0 0 23.33 28 22 10.67 4.67 1.33 0 10
0.04 0 0 4 9.33 12 2 5.33 4 5.33 58.01
0.05 0 0 1.33 5.33 2.67 4.67 2 5.33 2.67 76
Average Relative Entropy Method, 500 sample points
$ SD\mid\beta_0 $ $ 1 $ $ 2 $ $ 3 $ $ 4 $ $ 5 $ $ 6 $ $ 7 $ $ 8 $ $ 9 $ $ \geq 10 $
$ 0.01 $ $ 0 $ $ 0 $ $ 64 $ $ 18.67 $ $ 10.67 $ $ 5.33 $ $ 0.67 $ $ 0.67 $ $ 0 $ $ 0 $
$ 0.02 $ $ 0 $ $ 0 $ $ 51.33 $ $ 27.33 $ $ 12.67 $ $ 4 $ $ 2 $ $ 2 $ $ 0.67 $ $ 0 $
$ 0.03 $ $ 0 $ $ 0 $ $ 54.67 $ $ 26.67 $ $ 10.67 $ $ 4 $ $ 3.33 $ $ 0.67 $ $ 0 $ $ 0 $
$ 0.04 $ $ 0.67 $ $ 2 $ $ 44 $ $ 23.33 $ $ 13.33 $ $ 5.33 $ $ 3.33 $ $ 2.67 $ $ 2 $ $ 3.33 $
$ 0.05 $ $ 6 $ $ 26 $ $ 18 $ $ 20 $ $ 8.67 $ $ 4.67 $ $ 4.67 $ $ 4.67 $ $ 2 $ $ 5.33 $
Average Relative Entropy Method, 1000 sample
SD|β0 1 2 3 4 5 6 7 8 9 ≥ 10
0.01 0 0 62.67 26 7.33 0.67 2 1.33 0 0
0.02 0 0 59.33 22 8 9.33 0.67 0.67 0 0
0.03 0 0 23.33 28 22 10.67 4.67 1.33 0 10
0.04 0 0 4 9.33 12 2 5.33 4 5.33 58.01
0.05 0 0 1.33 5.33 2.67 4.67 2 5.33 2.67 76
Table 2.  Results for the Average Local Volume Ratio Method, using $ 500 $ and $ 1000 $ sample points.This table shows the percentages, rounded to the nearest hundredth, of the $ 150 $ trials using the standard deviation in the left-hand column in which the ALVR algorithm returned a given number $ \beta_0 $ of clusters
Average Local Volume Ratio Method, 500 Sample Points
$ SD\mid\beta_0 $ $ 1 $ $ 2 $ $ 3 $ $ 4 $ $ 5 $ $ 6 $ $ 7 $ $ 8 $ $ 9 $ $ \geq 10 $
$ 0.01 $ $ 0 $ $ 0 $ $ 96 $ $ 3.33 $ $ 0.67 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $
$ 0.02 $ $ 18 $ $ 0 $ $ 76 $ $ 6 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $
$ 0.03 $ $ 65.33 $ $ 0 $ $ 32 $ $ 2.67 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $
$ 0.04 $ $ 95.33 $ $ 1.33 $ $ 2.67 $ $ 0.67 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $
$ 0.05 $ $ 91.33 $ $ 7.33 $ $ 1.33 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $
Average Local Volume Ratio Method, 1000 Sample Points
SD|β0 1 2 3 4 5 6 7 8 9 ≥ 10
0.01 0 0 98.67 1.33 0 0 0 0 0 0
0.02 0 0 89.33 8 1.33 1.33 0 0 0 0
0.03 0 0 45.33 28 16 6.67 2.67 1.33 0 0
0.04 1.33 0 13.33 21.33 26 11.33 12 6 3.33 0
0.05 75.33 11.33 2.67 4.67 1.33 2 0.67 1.33 0.67 5.38
Average Local Volume Ratio Method, 500 Sample Points
$ SD\mid\beta_0 $ $ 1 $ $ 2 $ $ 3 $ $ 4 $ $ 5 $ $ 6 $ $ 7 $ $ 8 $ $ 9 $ $ \geq 10 $
$ 0.01 $ $ 0 $ $ 0 $ $ 96 $ $ 3.33 $ $ 0.67 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $
$ 0.02 $ $ 18 $ $ 0 $ $ 76 $ $ 6 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $
$ 0.03 $ $ 65.33 $ $ 0 $ $ 32 $ $ 2.67 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $
$ 0.04 $ $ 95.33 $ $ 1.33 $ $ 2.67 $ $ 0.67 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $
$ 0.05 $ $ 91.33 $ $ 7.33 $ $ 1.33 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $
Average Local Volume Ratio Method, 1000 Sample Points
SD|β0 1 2 3 4 5 6 7 8 9 ≥ 10
0.01 0 0 98.67 1.33 0 0 0 0 0 0
0.02 0 0 89.33 8 1.33 1.33 0 0 0 0
0.03 0 0 45.33 28 16 6.67 2.67 1.33 0 0
0.04 1.33 0 13.33 21.33 26 11.33 12 6 3.33 0
0.05 75.33 11.33 2.67 4.67 1.33 2 0.67 1.33 0.67 5.38
[1]

Eric Babson and Dmitry N. Kozlov. Topological obstructions to graph colorings. Electronic Research Announcements, 2003, 9: 61-68.

[2]

Cheng-Kai Hu, Fung-Bao Liu, Hong-Ming Chen, Cheng-Feng Hu. Network data envelopment analysis with fuzzy non-discretionary factors. Journal of Industrial & Management Optimization, 2021, 17 (4) : 1795-1807. doi: 10.3934/jimo.2020046

[3]

Rui Wang, Rundong Zhao, Emily Ribando-Gros, Jiahui Chen, Yiying Tong, Guo-Wei Wei. HERMES: Persistent spectral graph software. Foundations of Data Science, 2021, 3 (1) : 67-97. doi: 10.3934/fods.2021006

[4]

Xiaoyi Zhou, Tong Ye, Tony T. Lee. Designing and analysis of a Wi-Fi data offloading strategy catering for the preference of mobile users. Journal of Industrial & Management Optimization, 2021  doi: 10.3934/jimo.2021038

[5]

Wei Wang, Yang Shen, Linyi Qian, Zhixin Yang. Hedging strategy for unit-linked life insurance contracts with self-exciting jump clustering. Journal of Industrial & Management Optimization, 2021  doi: 10.3934/jimo.2021072

[6]

Paul E. Anderson, Timothy P. Chartier, Amy N. Langville, Kathryn E. Pedings-Behling. The rankability of weighted data from pairwise comparisons. Foundations of Data Science, 2021, 3 (1) : 1-26. doi: 10.3934/fods.2021002

[7]

Habib Ammari, Josselin Garnier, Vincent Jugnon. Detection, reconstruction, and characterization algorithms from noisy data in multistatic wave imaging. Discrete & Continuous Dynamical Systems - S, 2015, 8 (3) : 389-417. doi: 10.3934/dcdss.2015.8.389

[8]

Woocheol Choi, Youngwoo Koh. On the splitting method for the nonlinear Schrödinger equation with initial data in $ H^1 $. Discrete & Continuous Dynamical Systems, 2021, 41 (8) : 3837-3867. doi: 10.3934/dcds.2021019

[9]

Marion Darbas, Jérémy Heleine, Stephanie Lohrengel. Numerical resolution by the quasi-reversibility method of a data completion problem for Maxwell's equations. Inverse Problems & Imaging, 2020, 14 (6) : 1107-1133. doi: 10.3934/ipi.2020056

[10]

Haili Qiao, Aijie Cheng. A fast high order method for time fractional diffusion equation with non-smooth data. Discrete & Continuous Dynamical Systems - B, 2021  doi: 10.3934/dcdsb.2021073

[11]

Masahiro Ikeda, Ziheng Tu, Kyouhei Wakasa. Small data blow-up of semi-linear wave equation with scattering dissipation and time-dependent mass. Evolution Equations & Control Theory, 2021  doi: 10.3934/eect.2021011

[12]

Lei Zhang, Luming Jia. Near-field imaging for an obstacle above rough surfaces with limited aperture data. Inverse Problems & Imaging, , () : -. doi: 10.3934/ipi.2021024

[13]

Miroslav Bulíček, Victoria Patel, Endre Süli, Yasemin Şengül. Existence of large-data global weak solutions to a model of a strain-limiting viscoelastic body. Communications on Pure & Applied Analysis, , () : -. doi: 10.3934/cpaa.2021053

[14]

Yumi Yahagi. Construction of unique mild solution and continuity of solution for the small initial data to 1-D Keller-Segel system. Discrete & Continuous Dynamical Systems - B, 2021  doi: 10.3934/dcdsb.2021099

[15]

Muhammad Ajmal, Xiande Zhang. New optimal error-correcting codes for crosstalk avoidance in on-chip data buses. Advances in Mathematics of Communications, 2021, 15 (3) : 487-506. doi: 10.3934/amc.2020078

[16]

Vaibhav Mehandiratta, Mani Mehra, Günter Leugering. Existence results and stability analysis for a nonlinear fractional boundary value problem on a circular ring with an attached edge : A study of fractional calculus on metric graph. Networks & Heterogeneous Media, 2021, 16 (2) : 155-185. doi: 10.3934/nhm.2021003

[17]

Cheng Wang. Convergence analysis of Fourier pseudo-spectral schemes for three-dimensional incompressible Navier-Stokes equations. Electronic Research Archive, , () : -. doi: 10.3934/era.2021019

[18]

Bing Gao, Rui Gao. On fair entropy of the tent family. Discrete & Continuous Dynamical Systems, 2021, 41 (8) : 3797-3816. doi: 10.3934/dcds.2021017

[19]

Lekbir Afraites, Abdelghafour Atlas, Fahd Karami, Driss Meskine. Some class of parabolic systems applied to image processing. Discrete & Continuous Dynamical Systems - B, 2016, 21 (6) : 1671-1687. doi: 10.3934/dcdsb.2016017

[20]

Elena Bonetti, Pierluigi Colli, Gianni Gilardi. Singular limit of an integrodifferential system related to the entropy balance. Discrete & Continuous Dynamical Systems - B, 2014, 19 (7) : 1935-1953. doi: 10.3934/dcdsb.2014.19.1935

 Impact Factor: 

Metrics

  • PDF downloads (35)
  • HTML views (95)
  • Cited by (0)

Other articles
by authors

[Back to Top]