
-
Previous Article
Eigenstructure assignment for polynomial matrix systems ensuring normalization and impulse elimination
- MFC Home
- This Issue
-
Next Article
Data modeling analysis on removal efficiency of hexavalent chromium
Using distribution analysis for parameter selection in repstream
Curtin University, Kent Street, Perth, Western Australia 6102, Australia |
One of the most significant challenges in data clustering is the evolution of the data distributions over time. Many clustering algorithms have been introduced to deal specifically with streaming data, but common amongst them is that they require users to set input parameters. These inform the algorithm about the criteria under which data points may be clustered together. Setting the initial parameters for a clustering algorithm is itself a non-trivial task, but the evolution of the data distribution over time could mean even optimally set parameters could become non-optimal as the stream evolves. In this paper we extend the RepStream algorithm, a combination graph and density-based clustering algorithm, in a way which allows the primary input parameter, the $ K $ value, to be automatically adjusted over time. We introduce a feature called the edge distribution score which we compute for data in memory, as well as introducing an incremental method for adjusting the $ K $ parameter over time based on this score. We evaluate our methods against RepStream itself, and other contemporary stream clustering algorithms, and show how our method of automatically adjusting the $ K $ value over time leads to higher quality clustering output even when the initial parameters are set poorly.
References:
[1] |
M. R. Ackermann, M. Märtens, C. Raupach, K. Swierkot, C. Lammersen and C. Sohler, Streamkm++: A clustering algorithm for data streams, Journal of Experimental Algorithmics, 17 (2012), Article 2.4, 30 pp.
doi: 10.1145/2133803.2184450. |
[2] |
C. Aggarwal, J. Han, J. Wang And P. Yu, A framework for projected clustering of high dimensional data streams, in Proceedings 2004 VLDB Conference, Elsevier, 2004, 852–863.
doi: 10.1016/B978-012088469-8.50075-9. |
[3] |
C. C. Aggarwal, P. S. Yu, J. Han and J. Wang, A framework for clustering evolving data streams, in Proceedings 2003 VLDB Conference, Elsevier, 2003, 81–92.
doi: 10.1016/B978-012722442-8/50016-1. |
[4] |
J. P. Barddal, H. M. Gomes and F. Enembreck, SNCStream: A social network-based data stream clustering algorithm, in Proceedings of the 30th Annual ACM Symposium on Applied Computing - SAC'15, SAC'15, ACM Press, New York, New York, USA, 2015, 935–940.
doi: 10.1145/2695664.2695674. |
[5] |
V. Bhatnagar, S. Kaur and S. Chakravarthy,
Clustering data streams using grid-based synopsis, Knowledge and Information Systems, 41 (2014), 127-152.
doi: 10.1007/s10115-013-0659-1. |
[6] |
J. A. Blackard and D. J. Dean, Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables, Computers and Electronics in Agriculture, 24 (1999), 131–151.
doi: 10.1016/S0168-1699(99)00046-0. |
[7] |
F. Cao, M. Estert, W. Qian and A. Zhou, Density-based clustering over an evolving data stream with noise, in Proceedings of the 2006 SIAM International Conference on Data Mining, Society for Industrial and Applied Mathematics, Philadelphia, PA, 2006, 328–339.
doi: 10.1137/1.9781611972764.29. |
[8] |
Y. Chen and L. Tu, Density-based clustering for real-time stream data, in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD'07, ACM, ACM Press, New York, New York, USA, 2007, 133–142.
doi: 10.1145/1281192.1281210. |
[9] |
A. Forestiero, C. Pizzuti and G. Spezzano,
A single pass algorithm for clustering evolving data streams based on swarm intelligence, Data Mining and Knowledge Discovery, 26 (2013), 1-26.
doi: 10.1007/s10618-011-0242-x. |
[10] |
J. Forrest, Stream: A framework for data stream modeling in r, Bachelor Thesis, Department of Computer Science and Engineering, SMU. |
[11] |
M. Hahsler and M. Bolaos, Clustering data streams based on shared density between micro-clusters, IEEE Transactions on Knowledge and Data Engineering, 28 (2016), 1449–1461, http://ieeexplore.ieee.org/document/7393836/.
doi: 10.1109/TKDE.2016.2522412. |
[12] |
S. Hettich and S. D. Bay, The UCI KDD Archive, http://kdd.ics.uci.edu, 1999. |
[13] |
A. K. Jain, M. N. Murty and P. J. Flynn,
Data clustering: A review, ACM computing surveys (CSUR), 31 (1999), 264-323.
doi: 10.1145/331499.331504. |
[14] |
S. Kaur, V. Bhatnagar and S. Chakravarthy, Stream clustering algorithms: A primer, in Big Data in Complex Systems, Springer, 9 (2015), 105–145, http://link.springer.com/10.1007/978-3-319-11056-1.
doi: 10.1007/978-3-319-11056-1_4. |
[15] |
H. Kremer, P. Kranen, T. Jansen, T. Seidl, A. Bifet, G. Holmes and B. Pfahringer, An effective evaluation measure for clustering on evolving data streams, in Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2011, 868–876.
doi: 10.1145/2020408.2020555. |
[16] |
S. Lühr and M. Lazarescu, Incremental clustering of dynamic data streams using connectivity based representative points, Data & Knowledge Engineering, 68 (2009), 1–27, http://linkinghub.elsevier.com/retrieve/pii/S0169023X08001092. |
[17] |
J. Schneider and M. Vlachos, Fast parameterless density-based clustering via random projections, in Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management - CIKM'13, ACM, ACM Press, New York, New York, USA, 2013, 861–866.
doi: 10.1145/2505515.2505590. |
[18] |
N. Wattanakitrungroj, S. Maneeroj and C. Lursinsap, Bestream: Batch capturing with elliptic function for one-pass data stream clustering, Data & Knowledge Engineering, 117 (2018), 53–70, http://www.sciencedirect.com/science/article/pii/S0169023X17301027.
doi: 10.1016/j.datak.2018.07.002. |
[19] |
G. Widmer and M. Kubat,
Learning in the presence of concept drift and hidden contexts, Machine Learning, 23 (1996), 69-101.
doi: 10.1007/BF00116900. |
[20] |
X. Zhang, C. Furtlehner and M. Sebag, Data streaming with affinity propagation, in Proc. Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2008, 628–643.
doi: 10.1007/978-3-540-87481-2_41. |
[21] |
A. Zhou, F. Cao, W. Qian and C. Jin,
Tracking clusters in evolving data streams over sliding windows, Knowledge and Information Systems, 15 (2008), 181-214.
doi: 10.1007/s10115-007-0070-x. |
show all references
References:
[1] |
M. R. Ackermann, M. Märtens, C. Raupach, K. Swierkot, C. Lammersen and C. Sohler, Streamkm++: A clustering algorithm for data streams, Journal of Experimental Algorithmics, 17 (2012), Article 2.4, 30 pp.
doi: 10.1145/2133803.2184450. |
[2] |
C. Aggarwal, J. Han, J. Wang And P. Yu, A framework for projected clustering of high dimensional data streams, in Proceedings 2004 VLDB Conference, Elsevier, 2004, 852–863.
doi: 10.1016/B978-012088469-8.50075-9. |
[3] |
C. C. Aggarwal, P. S. Yu, J. Han and J. Wang, A framework for clustering evolving data streams, in Proceedings 2003 VLDB Conference, Elsevier, 2003, 81–92.
doi: 10.1016/B978-012722442-8/50016-1. |
[4] |
J. P. Barddal, H. M. Gomes and F. Enembreck, SNCStream: A social network-based data stream clustering algorithm, in Proceedings of the 30th Annual ACM Symposium on Applied Computing - SAC'15, SAC'15, ACM Press, New York, New York, USA, 2015, 935–940.
doi: 10.1145/2695664.2695674. |
[5] |
V. Bhatnagar, S. Kaur and S. Chakravarthy,
Clustering data streams using grid-based synopsis, Knowledge and Information Systems, 41 (2014), 127-152.
doi: 10.1007/s10115-013-0659-1. |
[6] |
J. A. Blackard and D. J. Dean, Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables, Computers and Electronics in Agriculture, 24 (1999), 131–151.
doi: 10.1016/S0168-1699(99)00046-0. |
[7] |
F. Cao, M. Estert, W. Qian and A. Zhou, Density-based clustering over an evolving data stream with noise, in Proceedings of the 2006 SIAM International Conference on Data Mining, Society for Industrial and Applied Mathematics, Philadelphia, PA, 2006, 328–339.
doi: 10.1137/1.9781611972764.29. |
[8] |
Y. Chen and L. Tu, Density-based clustering for real-time stream data, in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD'07, ACM, ACM Press, New York, New York, USA, 2007, 133–142.
doi: 10.1145/1281192.1281210. |
[9] |
A. Forestiero, C. Pizzuti and G. Spezzano,
A single pass algorithm for clustering evolving data streams based on swarm intelligence, Data Mining and Knowledge Discovery, 26 (2013), 1-26.
doi: 10.1007/s10618-011-0242-x. |
[10] |
J. Forrest, Stream: A framework for data stream modeling in r, Bachelor Thesis, Department of Computer Science and Engineering, SMU. |
[11] |
M. Hahsler and M. Bolaos, Clustering data streams based on shared density between micro-clusters, IEEE Transactions on Knowledge and Data Engineering, 28 (2016), 1449–1461, http://ieeexplore.ieee.org/document/7393836/.
doi: 10.1109/TKDE.2016.2522412. |
[12] |
S. Hettich and S. D. Bay, The UCI KDD Archive, http://kdd.ics.uci.edu, 1999. |
[13] |
A. K. Jain, M. N. Murty and P. J. Flynn,
Data clustering: A review, ACM computing surveys (CSUR), 31 (1999), 264-323.
doi: 10.1145/331499.331504. |
[14] |
S. Kaur, V. Bhatnagar and S. Chakravarthy, Stream clustering algorithms: A primer, in Big Data in Complex Systems, Springer, 9 (2015), 105–145, http://link.springer.com/10.1007/978-3-319-11056-1.
doi: 10.1007/978-3-319-11056-1_4. |
[15] |
H. Kremer, P. Kranen, T. Jansen, T. Seidl, A. Bifet, G. Holmes and B. Pfahringer, An effective evaluation measure for clustering on evolving data streams, in Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2011, 868–876.
doi: 10.1145/2020408.2020555. |
[16] |
S. Lühr and M. Lazarescu, Incremental clustering of dynamic data streams using connectivity based representative points, Data & Knowledge Engineering, 68 (2009), 1–27, http://linkinghub.elsevier.com/retrieve/pii/S0169023X08001092. |
[17] |
J. Schneider and M. Vlachos, Fast parameterless density-based clustering via random projections, in Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management - CIKM'13, ACM, ACM Press, New York, New York, USA, 2013, 861–866.
doi: 10.1145/2505515.2505590. |
[18] |
N. Wattanakitrungroj, S. Maneeroj and C. Lursinsap, Bestream: Batch capturing with elliptic function for one-pass data stream clustering, Data & Knowledge Engineering, 117 (2018), 53–70, http://www.sciencedirect.com/science/article/pii/S0169023X17301027.
doi: 10.1016/j.datak.2018.07.002. |
[19] |
G. Widmer and M. Kubat,
Learning in the presence of concept drift and hidden contexts, Machine Learning, 23 (1996), 69-101.
doi: 10.1007/BF00116900. |
[20] |
X. Zhang, C. Furtlehner and M. Sebag, Data streaming with affinity propagation, in Proc. Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2008, 628–643.
doi: 10.1007/978-3-540-87481-2_41. |
[21] |
A. Zhou, F. Cao, W. Qian and C. Jin,
Tracking clusters in evolving data streams over sliding windows, Knowledge and Information Systems, 15 (2008), 181-214.
doi: 10.1007/s10115-007-0070-x. |

























Algorithm name | Density Parameters | Grid Granularity | Reclustering Method | Decay Parameter | Distance Parameter | Other Parameters |
CluStream | ||||||
SWClustering | ||||||
DenStream | ||||||
HPStream | ||||||
D-Stream | ||||||
ExCC | Grid Granularity in all dimensions | |||||
MR-Stream | Hierarchical | |||||
FlockStream | ||||||
DeBaRa | ||||||
DBStream | ||||||
SNCStream | ||||||
PASCAL | Evolutionary | Evolutionary EDA hyper-parameters | ||||
HASTREAM | Hierarchical | |||||
BEStream | ||||||
ADStream | ||||||
PatchWork | ||||||
RepStream |
Algorithm name | Density Parameters | Grid Granularity | Reclustering Method | Decay Parameter | Distance Parameter | Other Parameters |
CluStream | ||||||
SWClustering | ||||||
DenStream | ||||||
HPStream | ||||||
D-Stream | ||||||
ExCC | Grid Granularity in all dimensions | |||||
MR-Stream | Hierarchical | |||||
FlockStream | ||||||
DeBaRa | ||||||
DBStream | ||||||
SNCStream | ||||||
PASCAL | Evolutionary | Evolutionary EDA hyper-parameters | ||||
HASTREAM | Hierarchical | |||||
BEStream | ||||||
ADStream | ||||||
PatchWork | ||||||
RepStream |
Dataset | Best K | Best |
Worst K | Worst |
DS1 | 7 | 0.7208 | 18 | 0.2767 |
DS2 | 7 | 0.6371 | 21 | 0.2594 |
SynTest | 9 | 0.8614 | 5 | 0.5435 |
Closer | 9 | 0.7989 | 5 | 0.4345 |
TreeCov | 29 | 0.6108 | 5 | 0.2978 |
KDD99 | 30 | 0.7898 | 5 | 0.2636 |
Dataset | Best K | Best |
Worst K | Worst |
DS1 | 7 | 0.7208 | 18 | 0.2767 |
DS2 | 7 | 0.6371 | 21 | 0.2594 |
SynTest | 9 | 0.8614 | 5 | 0.5435 |
Closer | 9 | 0.7989 | 5 | 0.4345 |
TreeCov | 29 | 0.6108 | 5 | 0.2978 |
KDD99 | 30 | 0.7898 | 5 | 0.2636 |
Dataset | Best F-Measure | Worst F-Measure | Dynamic |
DS1 | 0.7208 | 0.2767 | 0.5153 |
DS2 | 0.6371 | 0.2594 | 0.4137 |
SynTest | 0.7989 | 0.5435 | 0.7091 |
Closer | 0.8614 | 0.4345 | 0.8410 |
TreeCov | 0.6108 | 0.2978 | 0.5920 |
KDD99 | 0.7898 | 0.2636 | 0.7882 |
Dataset | Best F-Measure | Worst F-Measure | Dynamic |
DS1 | 0.7208 | 0.2767 | 0.5153 |
DS2 | 0.6371 | 0.2594 | 0.4137 |
SynTest | 0.7989 | 0.5435 | 0.7091 |
Closer | 0.8614 | 0.4345 | 0.8410 |
TreeCov | 0.6108 | 0.2978 | 0.5920 |
KDD99 | 0.7898 | 0.2636 | 0.7882 |
[1] |
Jia Chen, Ioannis D. Schizas. Multimodal correlations-based data clustering. Foundations of Data Science, 2022 doi: 10.3934/fods.2022011 |
[2] |
Daniel Roggen, Martin Wirz, Gerhard Tröster, Dirk Helbing. Recognition of crowd behavior from mobile sensors with pattern analysis and graph clustering methods. Networks and Heterogeneous Media, 2011, 6 (3) : 521-544. doi: 10.3934/nhm.2011.6.521 |
[3] |
Baolan Yuan, Wanjun Zhang, Yubo Yuan. A Max-Min clustering method for $k$-means algorithm of data clustering. Journal of Industrial and Management Optimization, 2012, 8 (3) : 565-575. doi: 10.3934/jimo.2012.8.565 |
[4] |
Daniel Mckenzie, Steven Damelin. Power weighted shortest paths for clustering Euclidean data. Foundations of Data Science, 2019, 1 (3) : 307-327. doi: 10.3934/fods.2019014 |
[5] |
Michael Herty, Lorenzo Pareschi, Giuseppe Visconti. Mean field models for large data–clustering problems. Networks and Heterogeneous Media, 2020, 15 (3) : 463-487. doi: 10.3934/nhm.2020027 |
[6] |
Jinyuan Zhang, Aimin Zhou, Guixu Zhang, Hu Zhang. A clustering based mate selection for evolutionary optimization. Big Data & Information Analytics, 2017, 2 (1) : 77-85. doi: 10.3934/bdia.2017010 |
[7] |
Gurkan Ozturk, Mehmet Tahir Ciftci. Clustering based polyhedral conic functions algorithm in classification. Journal of Industrial and Management Optimization, 2015, 11 (3) : 921-932. doi: 10.3934/jimo.2015.11.921 |
[8] |
Lars Eirik Danielsen. Graph-based classification of self-dual additive codes over finite fields. Advances in Mathematics of Communications, 2009, 3 (4) : 329-348. doi: 10.3934/amc.2009.3.329 |
[9] |
Pengyu Yan, Shi Qiang Liu, Cheng-Hu Yang, Mahmoud Masoud. A comparative study on three graph-based constructive algorithms for multi-stage scheduling with blocking. Journal of Industrial and Management Optimization, 2019, 15 (1) : 221-233. doi: 10.3934/jimo.2018040 |
[10] |
Jiangchuan Fan, Xinyu Guo, Jianjun Du, Weiliang Wen, Xianju Lu, Brahmani Louiza. Analysis of the clustering fusion algorithm for multi-band color image. Discrete and Continuous Dynamical Systems - S, 2019, 12 (4&5) : 1233-1249. doi: 10.3934/dcdss.2019085 |
[11] |
Antonio Rieser. A topological approach to spectral clustering. Foundations of Data Science, 2021, 3 (1) : 49-66. doi: 10.3934/fods.2021005 |
[12] |
Yongbin Ou, Cun-Quan Zhang. A new multimembership clustering method. Journal of Industrial and Management Optimization, 2007, 3 (4) : 619-624. doi: 10.3934/jimo.2007.3.619 |
[13] |
Baoli Shi, Zhi-Feng Pang, Jing Xu. Image segmentation based on the hybrid total variation model and the K-means clustering strategy. Inverse Problems and Imaging, 2016, 10 (3) : 807-828. doi: 10.3934/ipi.2016022 |
[14] |
Elissar Nasreddine. Two-dimensional individual clustering model. Discrete and Continuous Dynamical Systems - S, 2014, 7 (2) : 307-316. doi: 10.3934/dcdss.2014.7.307 |
[15] |
Elissar Nasreddine. Well-posedness for a model of individual clustering. Discrete and Continuous Dynamical Systems - B, 2013, 18 (10) : 2647-2668. doi: 10.3934/dcdsb.2013.18.2647 |
[16] |
Haixia Liu, Jian-Feng Cai, Yang Wang. Subspace clustering by (k,k)-sparse matrix factorization. Inverse Problems and Imaging, 2017, 11 (3) : 539-551. doi: 10.3934/ipi.2017025 |
[17] |
Guojun Gan, Qiujun Lan, Shiyang Sima. Scalable clustering by truncated fuzzy $c$-means. Big Data & Information Analytics, 2016, 1 (2&3) : 247-259. doi: 10.3934/bdia.2016007 |
[18] |
Adela DePavia, Stefan Steinerberger. Spectral clustering revisited: Information hidden in the Fiedler vector. Foundations of Data Science, 2021, 3 (2) : 225-249. doi: 10.3934/fods.2021015 |
[19] |
Pawan Lingras, Farhana Haider, Matt Triff. Fuzzy temporal meta-clustering of financial trading volatility patterns. Big Data & Information Analytics, 2018 doi: 10.3934/bdia.2017018 |
[20] |
Guojun Gan, Kun Chen. A soft subspace clustering algorithm with log-transformed distances. Big Data & Information Analytics, 2016, 1 (1) : 93-109. doi: 10.3934/bdia.2016.1.93 |
Impact Factor:
Tools
Metrics
Other articles
by authors
[Back to Top]