March  2021, 3(1): 49-66. doi: 10.3934/fods.2021005

A topological approach to spectral clustering

Centro de Investigación en Matemáticas, Calle Jalisco S/N, Colonia Valenciana, Guanajuato C.P 36023, GTO, Mexico

* Corresponding author: Antonio Rieser

Received  July 2020 Revised  January 2021 Published  March 2021 Early access  March 2021

Fund Project: This research was supported in part by the TOPOSYS project FP7-ICT-318493-STREP, Cátedras CONACYT 1076, and Grant #N62909-19-1-2134 from the US Office of Naval Research Global and the Southern Office of Aerospace Research and Development of the US Air Force Office of Scientific Research

We propose two related unsupervised clustering algorithms which, for input, take data assumed to be sampled from a uniform distribution supported on a metric space $ X $, and output a clustering of the data based on the selection of a topological model for the connected components of $ X $. Both algorithms work by selecting a graph on the samples from a natural one-parameter family of graphs, using a geometric criterion in the first case and an information theoretic criterion in the second. The estimated connected components of $ X $ are identified with the kernel of the associated graph Laplacian, which allows the algorithm to work without requiring the number of expected clusters or other auxiliary data as input.

Citation: Antonio Rieser. A topological approach to spectral clustering. Foundations of Data Science, 2021, 3 (1) : 49-66. doi: 10.3934/fods.2021005
References:
[1]

M. Belkin and P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput., 15 (2003), 1373-1396.  doi: 10.1162/089976603321780317.  Google Scholar

[2]

G. Carlsson, Topology and data, Bull. Amer. Math. Soc. (N.S.), 46 (2009), 255-308.  doi: 10.1090/S0273-0979-09-01249-X.  Google Scholar

[3]

F. Chazal, L. J. Guibas, S. Y. Oudot and P. Skraba, Persistence-based clustering in Riemannian manifolds, J. ACM, 60 (2013), 38pp. doi: 10.1145/2535927.  Google Scholar

[4]

F. R. K. Chung, Spectral Graph Theory, CBMS Regional Conference Series in Mathematics, 92, American Mathematical Society, Providence, RI, 1997. doi: 10.1090/cbms/092.  Google Scholar

[5]

R. R. Coifman and S. Lafon, Diffusion maps, Appl. Comput. Harmon. Anal., 21 (2006), 5-30.  doi: 10.1016/j.acha.2006.04.006.  Google Scholar

[6]

D. L. Donoho and C. Grimes, Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data, Proc. Natl. Acad. Sci. USA, 100 (2003), 5591-5596.  doi: 10.1073/pnas.1031596100.  Google Scholar

[7]

R. B. Ghrist, Barcodes: The persistent topology of data, Bull. Amer. Math. Soc. (N.S.), 45 (2008), 61-75.  doi: 10.1090/S0273-0979-07-01191-3.  Google Scholar

[8]

T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning. Data Mining, Inference, and Prediction, Springer Series in Statistics, Springer, New York, 2009 doi: 10.1007/978-0-387-84858-7.  Google Scholar

[9]

J. Jost, Riemannian Geometry and Geometric Analysis, Universitext, Springer, Heidelberg, 2011. doi: 10.1007/978-3-642-21298-7.  Google Scholar

[10]

S. T. Roweis and L. K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science, 290 (2000), 2323-2326.  doi: 10.1126/science.290.5500.2323.  Google Scholar

[11]

R. Sánchez-GarciaM. FennellyS. NorrisN. WrightG. NibloJ. Brodzki and J. Bialek, Hierarchical spectral clustering of power grids, IEEE Transactions on Power Systems, 29 (2014), 2229-2237.  doi: 10.1109/TPWRS.2014.2306756.  Google Scholar

[12]

W.-J. ShenH.-S. WongQ.-W. XiaoX. Guo and S. Smale, Introduction to the peptide binding problem of computational immunology: New results, Found. Comput. Math., 14 (2014), 951-984.  doi: 10.1007/s10208-013-9173-9.  Google Scholar

[13]

J. B. TenenbaumV. de Silva and J. C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science, 290 (2000), 2319-2323.  doi: 10.1126/science.290.5500.2319.  Google Scholar

[14]

A. Zomorodian and G. Carlsson, Computing persistent homology, Discrete Comput. Geom., 33 (2005), 249-274.  doi: 10.1007/s00454-004-1146-y.  Google Scholar

show all references

References:
[1]

M. Belkin and P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput., 15 (2003), 1373-1396.  doi: 10.1162/089976603321780317.  Google Scholar

[2]

G. Carlsson, Topology and data, Bull. Amer. Math. Soc. (N.S.), 46 (2009), 255-308.  doi: 10.1090/S0273-0979-09-01249-X.  Google Scholar

[3]

F. Chazal, L. J. Guibas, S. Y. Oudot and P. Skraba, Persistence-based clustering in Riemannian manifolds, J. ACM, 60 (2013), 38pp. doi: 10.1145/2535927.  Google Scholar

[4]

F. R. K. Chung, Spectral Graph Theory, CBMS Regional Conference Series in Mathematics, 92, American Mathematical Society, Providence, RI, 1997. doi: 10.1090/cbms/092.  Google Scholar

[5]

R. R. Coifman and S. Lafon, Diffusion maps, Appl. Comput. Harmon. Anal., 21 (2006), 5-30.  doi: 10.1016/j.acha.2006.04.006.  Google Scholar

[6]

D. L. Donoho and C. Grimes, Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data, Proc. Natl. Acad. Sci. USA, 100 (2003), 5591-5596.  doi: 10.1073/pnas.1031596100.  Google Scholar

[7]

R. B. Ghrist, Barcodes: The persistent topology of data, Bull. Amer. Math. Soc. (N.S.), 45 (2008), 61-75.  doi: 10.1090/S0273-0979-07-01191-3.  Google Scholar

[8]

T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning. Data Mining, Inference, and Prediction, Springer Series in Statistics, Springer, New York, 2009 doi: 10.1007/978-0-387-84858-7.  Google Scholar

[9]

J. Jost, Riemannian Geometry and Geometric Analysis, Universitext, Springer, Heidelberg, 2011. doi: 10.1007/978-3-642-21298-7.  Google Scholar

[10]

S. T. Roweis and L. K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science, 290 (2000), 2323-2326.  doi: 10.1126/science.290.5500.2323.  Google Scholar

[11]

R. Sánchez-GarciaM. FennellyS. NorrisN. WrightG. NibloJ. Brodzki and J. Bialek, Hierarchical spectral clustering of power grids, IEEE Transactions on Power Systems, 29 (2014), 2229-2237.  doi: 10.1109/TPWRS.2014.2306756.  Google Scholar

[12]

W.-J. ShenH.-S. WongQ.-W. XiaoX. Guo and S. Smale, Introduction to the peptide binding problem of computational immunology: New results, Found. Comput. Math., 14 (2014), 951-984.  doi: 10.1007/s10208-013-9173-9.  Google Scholar

[13]

J. B. TenenbaumV. de Silva and J. C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science, 290 (2000), 2319-2323.  doi: 10.1126/science.290.5500.2319.  Google Scholar

[14]

A. Zomorodian and G. Carlsson, Computing persistent homology, Discrete Comput. Geom., 33 (2005), 249-274.  doi: 10.1007/s00454-004-1146-y.  Google Scholar

Figure 6.1.  An accurate classification of $ 500 $ points using the Average Local Volume Ratio method, illustrated by color. The classification produced by the Average Relative Entropy method was identical in this experiment
Figure 6.2.  Average Relative Entropy vs. Scale, for the experiment in Figure 6.1. Note that local maxima appear where the smaller circles join to a larger cluster
Figure 6.3.  Average Local Volume Ratio vs. Scale for the experiment in Figure 6.1. Note that local minima appear where the smaller circles join with the larger circle
Figure A.1.  Example classification of 500 points using the Average Local Volume Ratio method, illustrated by color, for an experiment where the algorithm reported 4 clusters. One can see that the vast majority of points are correctly classified, although a small group of points in the circle on the left are assigned to their own cluster
Figure A.2.  The classification of 500 points using the Average Local Volume Ratio method, illustrated by color, for an experiment where the algorithm reported 5 clusters. One can see that the majority of points are correctly classified, even though the horizontal circle is split into three groups
Figure A.3.  Example classification of 1000 points using the Average Local Volume Ratio method, illustrated by color, for an experiment where the algorithm reported 4 clusters. One can see that the vast majority of points are correctly classified, although one outlier is assigned to its own cluster
Figure A.4.  Example classification of 1000 points using the Average Local Volume Ratio method, illustrated by color, for an experiment where the algorithm reported 5 clusters. One can see that the vast majority of points are correctly classified, and there are two outliers visible in different shades of light blue, one partially obscured by the circle on the left
Figures A.3 and A.4, the vast majority of points were classified correctly, and the additional clusters are from (partially obstructed) outliers">Figure A.5.  Example classification of 1000 points using the Average Local Volume Ratio method, illustrated by color, for an experiment where the algorithm reported 6 clusters. As in Figures A.3 and A.4, the vast majority of points were classified correctly, and the additional clusters are from (partially obstructed) outliers
Figure A.6.  The classification of 500 points using the Average Relative Entropy method, illustrated by color, for an experiment where the algorithm reported 4 clusters. One can see that the vast majority of points are correctly classified, although there are three outliers at least partially visible: the darker green point on the left circle, a light violet point in the center circle, and a dark violet point which is mostly obscured by the center cicle
Figure A.7.  The classification of 500 points using the Average Relative Entropy method, illustrated by color, for an experiment where the algorithm reported 5 clusters. The majority of points are correctly classified, although there are two extra groups assigned to the slightly separated groups of points in the circle on the left
Figure A.6 and A.6, the majority of points were classified correctly, although the circle on the right is split into top and bottom parts, there is an outlier (in blue) on the horizontal circle, as well a small group of separated points (in indigo) on the horzontal circle">Figure A.8.  The classification of 500 points using the Average Relative Entropy method, illustrated by color, for an experiment where the algorithm reported 6 clusters. As in Figure A.6 and A.6, the majority of points were classified correctly, although the circle on the right is split into top and bottom parts, there is an outlier (in blue) on the horizontal circle, as well a small group of separated points (in indigo) on the horzontal circle
Figure A.9.  Example classification of 1000 points using the Average Relative Entropy method, illustrated by color, for an experiment where the algorithm reported 4 clusters. One can see that the vast majority of points are correctly classified, although there is an outlier on the right circle given its own cluster
Figure A.10.  Example classification of 1000 points using the Average Relative Entropy Method, illustrated by color, for an experiment where the algorithm reported 5 clusters. One can see that the vast majority of points are correctly classified, although the center circle is split into 2 subgroups, and there is an outlier which is given its own cluster
Figure A.11.  Example classification of 1000 points using the Average Relative Entropy method, illustrated by color, for an experiment where the algorithm reported 6 clusters. One can see that the majority of points were classified correctly, although the left circle is split into two subgroups, and there are two outliers given their own cluster
Table 1.  Results for the Average Relative Entropy method, using $ 500 $ and $ 1000 $ sample points. This table shows the percentages, rounded to the nearest hundredth, of the $ 150 $ trials using the standard deviation in the left-hand column in which the ARE algorithm returned a given number $ \beta_0 $ of clusters
Average Relative Entropy Method, 500 sample points
$ SD\mid\beta_0 $ $ 1 $ $ 2 $ $ 3 $ $ 4 $ $ 5 $ $ 6 $ $ 7 $ $ 8 $ $ 9 $ $ \geq 10 $
$ 0.01 $ $ 0 $ $ 0 $ $ 64 $ $ 18.67 $ $ 10.67 $ $ 5.33 $ $ 0.67 $ $ 0.67 $ $ 0 $ $ 0 $
$ 0.02 $ $ 0 $ $ 0 $ $ 51.33 $ $ 27.33 $ $ 12.67 $ $ 4 $ $ 2 $ $ 2 $ $ 0.67 $ $ 0 $
$ 0.03 $ $ 0 $ $ 0 $ $ 54.67 $ $ 26.67 $ $ 10.67 $ $ 4 $ $ 3.33 $ $ 0.67 $ $ 0 $ $ 0 $
$ 0.04 $ $ 0.67 $ $ 2 $ $ 44 $ $ 23.33 $ $ 13.33 $ $ 5.33 $ $ 3.33 $ $ 2.67 $ $ 2 $ $ 3.33 $
$ 0.05 $ $ 6 $ $ 26 $ $ 18 $ $ 20 $ $ 8.67 $ $ 4.67 $ $ 4.67 $ $ 4.67 $ $ 2 $ $ 5.33 $
Average Relative Entropy Method, 1000 sample
SD|β0 1 2 3 4 5 6 7 8 9 ≥ 10
0.01 0 0 62.67 26 7.33 0.67 2 1.33 0 0
0.02 0 0 59.33 22 8 9.33 0.67 0.67 0 0
0.03 0 0 23.33 28 22 10.67 4.67 1.33 0 10
0.04 0 0 4 9.33 12 2 5.33 4 5.33 58.01
0.05 0 0 1.33 5.33 2.67 4.67 2 5.33 2.67 76
Average Relative Entropy Method, 500 sample points
$ SD\mid\beta_0 $ $ 1 $ $ 2 $ $ 3 $ $ 4 $ $ 5 $ $ 6 $ $ 7 $ $ 8 $ $ 9 $ $ \geq 10 $
$ 0.01 $ $ 0 $ $ 0 $ $ 64 $ $ 18.67 $ $ 10.67 $ $ 5.33 $ $ 0.67 $ $ 0.67 $ $ 0 $ $ 0 $
$ 0.02 $ $ 0 $ $ 0 $ $ 51.33 $ $ 27.33 $ $ 12.67 $ $ 4 $ $ 2 $ $ 2 $ $ 0.67 $ $ 0 $
$ 0.03 $ $ 0 $ $ 0 $ $ 54.67 $ $ 26.67 $ $ 10.67 $ $ 4 $ $ 3.33 $ $ 0.67 $ $ 0 $ $ 0 $
$ 0.04 $ $ 0.67 $ $ 2 $ $ 44 $ $ 23.33 $ $ 13.33 $ $ 5.33 $ $ 3.33 $ $ 2.67 $ $ 2 $ $ 3.33 $
$ 0.05 $ $ 6 $ $ 26 $ $ 18 $ $ 20 $ $ 8.67 $ $ 4.67 $ $ 4.67 $ $ 4.67 $ $ 2 $ $ 5.33 $
Average Relative Entropy Method, 1000 sample
SD|β0 1 2 3 4 5 6 7 8 9 ≥ 10
0.01 0 0 62.67 26 7.33 0.67 2 1.33 0 0
0.02 0 0 59.33 22 8 9.33 0.67 0.67 0 0
0.03 0 0 23.33 28 22 10.67 4.67 1.33 0 10
0.04 0 0 4 9.33 12 2 5.33 4 5.33 58.01
0.05 0 0 1.33 5.33 2.67 4.67 2 5.33 2.67 76
Table 2.  Results for the Average Local Volume Ratio Method, using $ 500 $ and $ 1000 $ sample points.This table shows the percentages, rounded to the nearest hundredth, of the $ 150 $ trials using the standard deviation in the left-hand column in which the ALVR algorithm returned a given number $ \beta_0 $ of clusters
Average Local Volume Ratio Method, 500 Sample Points
$ SD\mid\beta_0 $ $ 1 $ $ 2 $ $ 3 $ $ 4 $ $ 5 $ $ 6 $ $ 7 $ $ 8 $ $ 9 $ $ \geq 10 $
$ 0.01 $ $ 0 $ $ 0 $ $ 96 $ $ 3.33 $ $ 0.67 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $
$ 0.02 $ $ 18 $ $ 0 $ $ 76 $ $ 6 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $
$ 0.03 $ $ 65.33 $ $ 0 $ $ 32 $ $ 2.67 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $
$ 0.04 $ $ 95.33 $ $ 1.33 $ $ 2.67 $ $ 0.67 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $
$ 0.05 $ $ 91.33 $ $ 7.33 $ $ 1.33 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $
Average Local Volume Ratio Method, 1000 Sample Points
SD|β0 1 2 3 4 5 6 7 8 9 ≥ 10
0.01 0 0 98.67 1.33 0 0 0 0 0 0
0.02 0 0 89.33 8 1.33 1.33 0 0 0 0
0.03 0 0 45.33 28 16 6.67 2.67 1.33 0 0
0.04 1.33 0 13.33 21.33 26 11.33 12 6 3.33 0
0.05 75.33 11.33 2.67 4.67 1.33 2 0.67 1.33 0.67 5.38
Average Local Volume Ratio Method, 500 Sample Points
$ SD\mid\beta_0 $ $ 1 $ $ 2 $ $ 3 $ $ 4 $ $ 5 $ $ 6 $ $ 7 $ $ 8 $ $ 9 $ $ \geq 10 $
$ 0.01 $ $ 0 $ $ 0 $ $ 96 $ $ 3.33 $ $ 0.67 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $
$ 0.02 $ $ 18 $ $ 0 $ $ 76 $ $ 6 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $
$ 0.03 $ $ 65.33 $ $ 0 $ $ 32 $ $ 2.67 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $
$ 0.04 $ $ 95.33 $ $ 1.33 $ $ 2.67 $ $ 0.67 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $
$ 0.05 $ $ 91.33 $ $ 7.33 $ $ 1.33 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $ $ 0 $
Average Local Volume Ratio Method, 1000 Sample Points
SD|β0 1 2 3 4 5 6 7 8 9 ≥ 10
0.01 0 0 98.67 1.33 0 0 0 0 0 0
0.02 0 0 89.33 8 1.33 1.33 0 0 0 0
0.03 0 0 45.33 28 16 6.67 2.67 1.33 0 0
0.04 1.33 0 13.33 21.33 26 11.33 12 6 3.33 0
0.05 75.33 11.33 2.67 4.67 1.33 2 0.67 1.33 0.67 5.38
[1]

Baolan Yuan, Wanjun Zhang, Yubo Yuan. A Max-Min clustering method for $k$-means algorithm of data clustering. Journal of Industrial & Management Optimization, 2012, 8 (3) : 565-575. doi: 10.3934/jimo.2012.8.565

[2]

Daniel Mckenzie, Steven Damelin. Power weighted shortest paths for clustering Euclidean data. Foundations of Data Science, 2019, 1 (3) : 307-327. doi: 10.3934/fods.2019014

[3]

Michael Herty, Lorenzo Pareschi, Giuseppe Visconti. Mean field models for large data–clustering problems. Networks & Heterogeneous Media, 2020, 15 (3) : 463-487. doi: 10.3934/nhm.2020027

[4]

Daniel Roggen, Martin Wirz, Gerhard Tröster, Dirk Helbing. Recognition of crowd behavior from mobile sensors with pattern analysis and graph clustering methods. Networks & Heterogeneous Media, 2011, 6 (3) : 521-544. doi: 10.3934/nhm.2011.6.521

[5]

Adela DePavia, Stefan Steinerberger. Spectral clustering revisited: Information hidden in the Fiedler vector. Foundations of Data Science, 2021, 3 (2) : 225-249. doi: 10.3934/fods.2021015

[6]

George Siopsis. Quantum topological data analysis with continuous variables. Foundations of Data Science, 2019, 1 (4) : 419-431. doi: 10.3934/fods.2019017

[7]

Tyrus Berry, Timothy Sauer. Consistent manifold representation for topological data analysis. Foundations of Data Science, 2019, 1 (1) : 1-38. doi: 10.3934/fods.2019001

[8]

Matthew O. Williams, Clarence W. Rowley, Ioannis G. Kevrekidis. A kernel-based method for data-driven koopman spectral analysis. Journal of Computational Dynamics, 2015, 2 (2) : 247-265. doi: 10.3934/jcd.2015005

[9]

Massimiliano Guzzo, Giancarlo Benettin. A spectral formulation of the Nekhoroshev theorem and its relevance for numerical and experimental data analysis. Discrete & Continuous Dynamical Systems - B, 2001, 1 (1) : 1-28. doi: 10.3934/dcdsb.2001.1.1

[10]

Jiangchuan Fan, Xinyu Guo, Jianjun Du, Weiliang Wen, Xianju Lu, Brahmani Louiza. Analysis of the clustering fusion algorithm for multi-band color image. Discrete & Continuous Dynamical Systems - S, 2019, 12 (4&5) : 1233-1249. doi: 10.3934/dcdss.2019085

[11]

Yongbin Ou, Cun-Quan Zhang. A new multimembership clustering method. Journal of Industrial & Management Optimization, 2007, 3 (4) : 619-624. doi: 10.3934/jimo.2007.3.619

[12]

Liu Hui, Lin Zhi, Waqas Ahmad. Network(graph) data research in the coordinate system. Mathematical Foundations of Computing, 2018, 1 (1) : 1-10. doi: 10.3934/mfc.2018001

[13]

Ana Jofre, Lan-Xi Dong, Ha Phuong Vu, Steve Szigeti, Sara Diamond. Rendering website traffic data into interactive taste graph visualizations. Big Data & Information Analytics, 2017, 2 (2) : 107-118. doi: 10.3934/bdia.2017003

[14]

Sarah Constantin, Robert S. Strichartz, Miles Wheeler. Analysis of the Laplacian and spectral operators on the Vicsek set. Communications on Pure & Applied Analysis, 2011, 10 (1) : 1-44. doi: 10.3934/cpaa.2011.10.1

[15]

Elissar Nasreddine. Two-dimensional individual clustering model. Discrete & Continuous Dynamical Systems - S, 2014, 7 (2) : 307-316. doi: 10.3934/dcdss.2014.7.307

[16]

Elissar Nasreddine. Well-posedness for a model of individual clustering. Discrete & Continuous Dynamical Systems - B, 2013, 18 (10) : 2647-2668. doi: 10.3934/dcdsb.2013.18.2647

[17]

Haixia Liu, Jian-Feng Cai, Yang Wang. Subspace clustering by (k,k)-sparse matrix factorization. Inverse Problems & Imaging, 2017, 11 (3) : 539-551. doi: 10.3934/ipi.2017025

[18]

Jinyuan Zhang, Aimin Zhou, Guixu Zhang, Hu Zhang. A clustering based mate selection for evolutionary optimization. Big Data & Information Analytics, 2017, 2 (1) : 77-85. doi: 10.3934/bdia.2017010

[19]

Guojun Gan, Qiujun Lan, Shiyang Sima. Scalable clustering by truncated fuzzy $c$-means. Big Data & Information Analytics, 2016, 1 (2&3) : 247-259. doi: 10.3934/bdia.2016007

[20]

Gurkan Ozturk, Mehmet Tahir Ciftci. Clustering based polyhedral conic functions algorithm in classification. Journal of Industrial & Management Optimization, 2015, 11 (3) : 921-932. doi: 10.3934/jimo.2015.11.921

 Impact Factor: 

Metrics

  • PDF downloads (157)
  • HTML views (315)
  • Cited by (0)

Other articles
by authors

[Back to Top]