January  2016, 1(1): 93-109. doi: 10.3934/bdia.2016.1.93

A soft subspace clustering algorithm with log-transformed distances

1. 

Department of Mathematics, University of Connecticut, 196 Auditorium Rd, Storrs, CT 06269-3009, United States

2. 

Department of Statistics, University of Connecticut, 215 Glenbrook Road, Storrs, CT 06269, United States

Received  May 2015 Revised  August 2015 Published  September 2015

Entropy weighting used in some soft subspace clustering algorithms is sensitive to the scaling parameter. In this paper, we propose a novel soft subspace clustering algorithm by using log-transformed distances in the objective function. The proposed algorithm allows users to choose a value of the scaling parameter easily because the entropy weighting in the proposed algorithm is less sensitive to the scaling parameter. In addition, the proposed algorithm is less sensitive to noises because a point far away from its cluster center is given a small weight in the cluster center calculation. Experiments on both synthetic datasets and real datasets are used to demonstrate the performance of the proposed algorithm.
Citation: Guojun Gan, Kun Chen. A soft subspace clustering algorithm with log-transformed distances. Big Data & Information Analytics, 2016, 1 (1) : 93-109. doi: 10.3934/bdia.2016.1.93
References:
[1]

C. C. Aggarwal and C. K. Reddy (eds.), Data Clustering: Algorithms and Applications, CRC Press, Boca Raton, FL, USA, 2014.

[2]

R. Agrawal, J. Gehrke, D. Gunopulos and P. Raghavan, Automatic subspace clustering of high dimensional data for data mining applications, in SIGMOD Record ACM Special Interest Group on Management of Data, ACM Press, New York, NY, USA, 27 (1998), 94-105. doi: 10.1145/276304.276314.

[3]

S. Boutemedjet, D. Ziou and N. Bouguila, Model-based subspace clustering of non-gaussian data, Neurocomputing, 73 (2010), 1730-1739. doi: 10.1016/j.neucom.2009.11.044.

[4]

A. Broder, L. Garcia-Pueyo, V. Josifovski, S. Vassilvitskii and S. Venkatesan, Scalable k-means by ranked retrieval, in Proceedings of the 7th ACM International Conference on Web Search and Data Mining, WSDM '14, ACM, 2014, 233-242. doi: 10.1145/2556195.2556260.

[5]

F. Cao, J. Liang and G. Jiang, An initialization method for the $k$-means algorithm using neighborhood model, Computers & Mathematics with Applications, 58 (2009), 474-483. doi: 10.1016/j.camwa.2009.04.017.

[6]

M. E. Celebi, H. A. Kingravi and P. A. Vela, A comparative study of efficient initialization methods for the k-means clustering algorithm, Expert Systems with Applications, 40 (2013), 200-210. doi: 10.1016/j.eswa.2012.07.021.

[7]

X. Chen, Y. Ye, X. Xu and J. Z. Huang, A feature group weighting method for subspace clustering of high-dimensional data, Pattern Recognition, 45 (2012), 434-446. doi: 10.1016/j.patcog.2011.06.004.

[8]

M. de Souto, I. Costa, D. de Araujo, T. Ludermir and A. Schliep, Clustering cancer gene expression data: A comparative study, BMC Bioinformatics, 9 (2008), 497-510. doi: 10.1186/1471-2105-9-497.

[9]

C. Domeniconi, D. Gunopulos, S. Ma, B. Yan, M. Al-Razgan and D. Papadopoulos, Locally adaptive metrics for clustering high dimensional data, Data Mining and Knowledge Discovery, 14 (2007), 63-97. doi: 10.1007/s10618-006-0060-8.

[10]

B. Duran and P. Odell, Cluster Analysis - A survey, vol. 100 of Lecture Notes in Economics and Mathematical Systems, Springer-Verlage, Berlin, Heidelberg, New York, 1974.

[11]

E. Elhamifar and R. Vidal, Sparse subspace clustering: Algorithm, theory, and applications, IEEE Transactions on Pattern Analysis and Machine Intelligence, 35 (2013), 2765-2781. doi: 10.1109/TPAMI.2013.57.

[12]

G. Gan, Data Clustering in C++: An Object-Oriented Approach, Data Mining and Knowledge Discovery Series, Chapman & Hall/CRC Press, Boca Raton, FL, USA, 2011. doi: 10.1201/b10814.

[13]

G. Gan and M. K.-P. Ng, Subspace clustering using affinity propagation, Pattern Recognition, 48 (2015), 1455-1464. doi: 10.1016/j.patcog.2014.11.003.

[14]

G. Gan and M. K.-P. Ng, Subspace clustering with automatic feature grouping, Pattern Recognition, 48 (2015), 3703-3713. doi: 10.1016/j.patcog.2015.05.016.

[15]

G. Gan and J. Wu, Subspace clustering for high dimensional categorical data, ACM SIGKDD Explorations Newsletter, 6 (2004), 87-94. doi: 10.1145/1046456.1046468.

[16]

G. Gan and J. Wu, A convergence theorem for the fuzzy subspace clustering (FSC) algorithm, Pattern Recognition, 41 (2008), 1939-1947. doi: 10.1016/j.patcog.2007.11.011.

[17]

G. Gan, J. Wu and Z. Yang, A fuzzy subspace algorithm for clustering high dimensional data, in Lecture Notes in Artificial Intelligence (eds. X. Li, S. Wang and Z. Dong), vol. 4093, Springer-Verlag, 2006, 271-278. doi: 10.1007/11811305_30.

[18]

J. A. Hartigan, Clustering Algorithms, Wiley, New York, NY, 1975.

[19]

J. Huang, M. Ng, H. Rong and Z. Li, Automated variable weighting in $k$-means type clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence, 27 (2005), 657-668. doi: 10.1109/TPAMI.2005.95.

[20]

A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1988.

[21]

L. Jing, M. Ng and J. Huang, An entropy weighting $k$-means algorithm for subspace clustering of high-dimensional sparse data, IEEE Transactions on Knowledge and Data Engineering, 19 (2007), 1026-1041. doi: 10.1109/TKDE.2007.1048.

[22]

H.-P. Kriegel, P. Kröger and A. Zimek, Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering, ACM Transactions on Knowledge Discovery from Data, 3 (2009), 1-58. doi: 10.1145/1497577.1497578.

[23]

J. Macqueen, Some methods for classification and analysis of multivariate observations, in Proceedings of the 5th Berkeley Symposium on Mathematical Statistics andProbability (eds. L. LeCam and J. Neyman), vol. 1, University of California Press, Berkely, CA, USA, 1967, 281-297.

[24]

J. Peña, J. Lozano and P. Larrañaga, An empirical comparison of four initialization methods for the $k$-means algorithm, Pattern Recognition Letters, 20 (1999), 1027-1040.

[25]

L. Peng and J. Zhang, An entropy weighting mixture model for subspace clustering of high-dimensional data, Pattern Recognition Letters, 32 (2011), 1154-1161. doi: 10.1016/j.patrec.2011.03.003.

[26]

R. Xu and D. Wunsch, Clustering, Wiley-IEEE Press, Hoboken, NJ, 2008.

show all references

References:
[1]

C. C. Aggarwal and C. K. Reddy (eds.), Data Clustering: Algorithms and Applications, CRC Press, Boca Raton, FL, USA, 2014.

[2]

R. Agrawal, J. Gehrke, D. Gunopulos and P. Raghavan, Automatic subspace clustering of high dimensional data for data mining applications, in SIGMOD Record ACM Special Interest Group on Management of Data, ACM Press, New York, NY, USA, 27 (1998), 94-105. doi: 10.1145/276304.276314.

[3]

S. Boutemedjet, D. Ziou and N. Bouguila, Model-based subspace clustering of non-gaussian data, Neurocomputing, 73 (2010), 1730-1739. doi: 10.1016/j.neucom.2009.11.044.

[4]

A. Broder, L. Garcia-Pueyo, V. Josifovski, S. Vassilvitskii and S. Venkatesan, Scalable k-means by ranked retrieval, in Proceedings of the 7th ACM International Conference on Web Search and Data Mining, WSDM '14, ACM, 2014, 233-242. doi: 10.1145/2556195.2556260.

[5]

F. Cao, J. Liang and G. Jiang, An initialization method for the $k$-means algorithm using neighborhood model, Computers & Mathematics with Applications, 58 (2009), 474-483. doi: 10.1016/j.camwa.2009.04.017.

[6]

M. E. Celebi, H. A. Kingravi and P. A. Vela, A comparative study of efficient initialization methods for the k-means clustering algorithm, Expert Systems with Applications, 40 (2013), 200-210. doi: 10.1016/j.eswa.2012.07.021.

[7]

X. Chen, Y. Ye, X. Xu and J. Z. Huang, A feature group weighting method for subspace clustering of high-dimensional data, Pattern Recognition, 45 (2012), 434-446. doi: 10.1016/j.patcog.2011.06.004.

[8]

M. de Souto, I. Costa, D. de Araujo, T. Ludermir and A. Schliep, Clustering cancer gene expression data: A comparative study, BMC Bioinformatics, 9 (2008), 497-510. doi: 10.1186/1471-2105-9-497.

[9]

C. Domeniconi, D. Gunopulos, S. Ma, B. Yan, M. Al-Razgan and D. Papadopoulos, Locally adaptive metrics for clustering high dimensional data, Data Mining and Knowledge Discovery, 14 (2007), 63-97. doi: 10.1007/s10618-006-0060-8.

[10]

B. Duran and P. Odell, Cluster Analysis - A survey, vol. 100 of Lecture Notes in Economics and Mathematical Systems, Springer-Verlage, Berlin, Heidelberg, New York, 1974.

[11]

E. Elhamifar and R. Vidal, Sparse subspace clustering: Algorithm, theory, and applications, IEEE Transactions on Pattern Analysis and Machine Intelligence, 35 (2013), 2765-2781. doi: 10.1109/TPAMI.2013.57.

[12]

G. Gan, Data Clustering in C++: An Object-Oriented Approach, Data Mining and Knowledge Discovery Series, Chapman & Hall/CRC Press, Boca Raton, FL, USA, 2011. doi: 10.1201/b10814.

[13]

G. Gan and M. K.-P. Ng, Subspace clustering using affinity propagation, Pattern Recognition, 48 (2015), 1455-1464. doi: 10.1016/j.patcog.2014.11.003.

[14]

G. Gan and M. K.-P. Ng, Subspace clustering with automatic feature grouping, Pattern Recognition, 48 (2015), 3703-3713. doi: 10.1016/j.patcog.2015.05.016.

[15]

G. Gan and J. Wu, Subspace clustering for high dimensional categorical data, ACM SIGKDD Explorations Newsletter, 6 (2004), 87-94. doi: 10.1145/1046456.1046468.

[16]

G. Gan and J. Wu, A convergence theorem for the fuzzy subspace clustering (FSC) algorithm, Pattern Recognition, 41 (2008), 1939-1947. doi: 10.1016/j.patcog.2007.11.011.

[17]

G. Gan, J. Wu and Z. Yang, A fuzzy subspace algorithm for clustering high dimensional data, in Lecture Notes in Artificial Intelligence (eds. X. Li, S. Wang and Z. Dong), vol. 4093, Springer-Verlag, 2006, 271-278. doi: 10.1007/11811305_30.

[18]

J. A. Hartigan, Clustering Algorithms, Wiley, New York, NY, 1975.

[19]

J. Huang, M. Ng, H. Rong and Z. Li, Automated variable weighting in $k$-means type clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence, 27 (2005), 657-668. doi: 10.1109/TPAMI.2005.95.

[20]

A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1988.

[21]

L. Jing, M. Ng and J. Huang, An entropy weighting $k$-means algorithm for subspace clustering of high-dimensional sparse data, IEEE Transactions on Knowledge and Data Engineering, 19 (2007), 1026-1041. doi: 10.1109/TKDE.2007.1048.

[22]

H.-P. Kriegel, P. Kröger and A. Zimek, Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering, ACM Transactions on Knowledge Discovery from Data, 3 (2009), 1-58. doi: 10.1145/1497577.1497578.

[23]

J. Macqueen, Some methods for classification and analysis of multivariate observations, in Proceedings of the 5th Berkeley Symposium on Mathematical Statistics andProbability (eds. L. LeCam and J. Neyman), vol. 1, University of California Press, Berkely, CA, USA, 1967, 281-297.

[24]

J. Peña, J. Lozano and P. Larrañaga, An empirical comparison of four initialization methods for the $k$-means algorithm, Pattern Recognition Letters, 20 (1999), 1027-1040.

[25]

L. Peng and J. Zhang, An entropy weighting mixture model for subspace clustering of high-dimensional data, Pattern Recognition Letters, 32 (2011), 1154-1161. doi: 10.1016/j.patrec.2011.03.003.

[26]

R. Xu and D. Wunsch, Clustering, Wiley-IEEE Press, Hoboken, NJ, 2008.

[1]

Baolan Yuan, Wanjun Zhang, Yubo Yuan. A Max-Min clustering method for $k$-means algorithm of data clustering. Journal of Industrial and Management Optimization, 2012, 8 (3) : 565-575. doi: 10.3934/jimo.2012.8.565

[2]

Haixia Liu, Jian-Feng Cai, Yang Wang. Subspace clustering by (k,k)-sparse matrix factorization. Inverse Problems and Imaging, 2017, 11 (3) : 539-551. doi: 10.3934/ipi.2017025

[3]

Baoli Shi, Zhi-Feng Pang, Jing Xu. Image segmentation based on the hybrid total variation model and the K-means clustering strategy. Inverse Problems and Imaging, 2016, 10 (3) : 807-828. doi: 10.3934/ipi.2016022

[4]

Guojun Gan, Qiujun Lan, Shiyang Sima. Scalable clustering by truncated fuzzy $c$-means. Big Data & Information Analytics, 2016, 1 (2&3) : 247-259. doi: 10.3934/bdia.2016007

[5]

Hua Huang, Weiwei Wang, Chengwu Lu, Xiangchu Feng, Ruiqiang He. Side-information-induced reweighted sparse subspace clustering. Journal of Industrial and Management Optimization, 2021, 17 (3) : 1235-1252. doi: 10.3934/jimo.2020019

[6]

Daniel Mckenzie, Steven Damelin. Power weighted shortest paths for clustering Euclidean data. Foundations of Data Science, 2019, 1 (3) : 307-327. doi: 10.3934/fods.2019014

[7]

Michael Herty, Lorenzo Pareschi, Giuseppe Visconti. Mean field models for large data–clustering problems. Networks and Heterogeneous Media, 2020, 15 (3) : 463-487. doi: 10.3934/nhm.2020027

[8]

Jia Chen, Ioannis D. Schizas. Multimodal correlations-based data clustering. Foundations of Data Science, 2022  doi: 10.3934/fods.2022011

[9]

Ruiqi Yang, Dachuan Xu, Yicheng Xu, Dongmei Zhang. An adaptive probabilistic algorithm for online k-center clustering. Journal of Industrial and Management Optimization, 2019, 15 (2) : 565-576. doi: 10.3934/jimo.2018057

[10]

Antonio Rieser. A topological approach to spectral clustering. Foundations of Data Science, 2021, 3 (1) : 49-66. doi: 10.3934/fods.2021005

[11]

Yongbin Ou, Cun-Quan Zhang. A new multimembership clustering method. Journal of Industrial and Management Optimization, 2007, 3 (4) : 619-624. doi: 10.3934/jimo.2007.3.619

[12]

Elissar Nasreddine. Two-dimensional individual clustering model. Discrete and Continuous Dynamical Systems - S, 2014, 7 (2) : 307-316. doi: 10.3934/dcdss.2014.7.307

[13]

Elissar Nasreddine. Well-posedness for a model of individual clustering. Discrete and Continuous Dynamical Systems - B, 2013, 18 (10) : 2647-2668. doi: 10.3934/dcdsb.2013.18.2647

[14]

Jinyuan Zhang, Aimin Zhou, Guixu Zhang, Hu Zhang. A clustering based mate selection for evolutionary optimization. Big Data & Information Analytics, 2017, 2 (1) : 77-85. doi: 10.3934/bdia.2017010

[15]

Gurkan Ozturk, Mehmet Tahir Ciftci. Clustering based polyhedral conic functions algorithm in classification. Journal of Industrial and Management Optimization, 2015, 11 (3) : 921-932. doi: 10.3934/jimo.2015.11.921

[16]

Adela DePavia, Stefan Steinerberger. Spectral clustering revisited: Information hidden in the Fiedler vector. Foundations of Data Science, 2021, 3 (2) : 225-249. doi: 10.3934/fods.2021015

[17]

Sung Ha Kang, Berta Sandberg, Andy M. Yip. A regularized k-means and multiphase scale segmentation. Inverse Problems and Imaging, 2011, 5 (2) : 407-429. doi: 10.3934/ipi.2011.5.407

[18]

Pawan Lingras, Farhana Haider, Matt Triff. Fuzzy temporal meta-clustering of financial trading volatility patterns. Big Data & Information Analytics, 2018  doi: 10.3934/bdia.2017018

[19]

Richard L Buckalew. Cell cycle clustering and quorum sensing in a response / signaling mediated feedback model. Discrete and Continuous Dynamical Systems - B, 2014, 19 (4) : 867-881. doi: 10.3934/dcdsb.2014.19.867

[20]

Jiangchuan Fan, Xinyu Guo, Jianjun Du, Weiliang Wen, Xianju Lu, Brahmani Louiza. Analysis of the clustering fusion algorithm for multi-band color image. Discrete and Continuous Dynamical Systems - S, 2019, 12 (4&5) : 1233-1249. doi: 10.3934/dcdss.2019085

 Impact Factor: 

Metrics

  • PDF downloads (122)
  • HTML views (0)
  • Cited by (3)

Other articles
by authors

[Back to Top]