SUBSPACE CLUSTERING BY ( k, k ) -SPARSE MATRIX FACTORIZATION

. High-dimensional data often lie in low-dimensional subspaces instead of the whole space. Subspace clustering is a problem to analyze data that are from multiple low-dimensional subspaces and cluster them into the corresponding subspaces. In this work, we propose a ( k,k )-sparse matrix factoriza- tion method for subspace clustering. In this method, data itself is considered as the “dictionary”, and each data point is represented as a linear combination of the basis of its cluster in the dictionary. Thus, the coeﬃcient matrix is low-rank and sparse. With an appropriate permutation, it is also blockwise with each block corresponding to a cluster. With an assumption that each block is no more than k -by- k in matrix recovery, we seek a low-rank and ( k,k )-sparse coeﬃcient matrix, which will be used for the construction of aﬃnity matrix in spectral clustering. The advantage of our proposed method is that we recover a coeﬃcient matrix with ( k,k )-sparse and low-rank simultaneously, which is better ﬁt for subspace clustering. Numerical results illustrate the eﬀectiveness that it is better than SSC and LRR in real-world classiﬁcation problems such as face clustering and motion segmentation.


1.
Introduction. There are huge amount of data generated and collected in our daily life. Despite of their different sources and natures, those seemingly structureless data often possess good structures. One commonly used structure is that highdimensional data belonging to the same category lie in a low-dimensional subspace instead of arbitrarily distributed in the whole space. For example, hand-writings of the same digit with different rotations, translations or thickness are approximately in a low-dimensional subspace. Some other examples are human face image data, where the face images of each human under different illuminations form a lowdimensional subspace, and a video sequence with multiple moving objects, where each moving object in different frames belongs to a low-dimensional subspace. When multiple categories are presented, we will have a data set that is a union of lowdimensional subspaces, one category corresponding to a subspace. The problem of subspace clustering refers to extraction of the low-dimensional subspaces from such a data set with multiple categories. Many applications can be formulated as a subspace clustering problem, including face clustering and motion segmentation in computer vision, community clustering in social networks, and so on.
In spectral clustering-based methods, an affinity matrix W is constructed with the (i, j)-th entry measuring the similarity between the i-th and j-th data points. One of the challenges in applying spectral clustering-based methods is how to choose the affinity matrix W . Usually, W is constructed based on the coefficient matrix when the data set is represented as a linear combination of itself. Due to the structure of union of subspaces, the coefficient matrix possesses properties of low rank and sparsity, which are exploited in some state-of-the-art subspace clustering methods. In sparse subspace clustering (SSC) [8], a sparse coefficient matrix is assumed and sought. The low-rank representation (LRR) [14] try to find a lowrank representation of the coefficient matrix. Although SSC and LRR are both successful in subspace clustering, they are problematic under some circumstances, due to the reason that only one of the sparsity and the low-rank properties is used and the other is ignored. LRR has never been shown that it is successful without of an restrictive assumption of "independent subspace" [27]. SSC obtains an over-sparse coefficient matrix, based on which the affinity graph may not be a connected body for the data points from the same cluster [27,16]. Moreover, the numerical experiments of Hopkins155 data show the instances where SSC fails are often different from that of LRR [27].
To fix these problems, one may consider both the low-rank and sparsity properties of the coefficient matrix. For this purpose, it is proposed in [27] to minimize a weighted sum of the nuclear norm and the 1 -norm of the coefficient matrix. Since the nuclear norm promotes low rank and the 1 -norm promotes the sparsity, it is expected that the coefficient matrix is simultaneously low-rank and sparse. However, it was shown in [18] that, when a matrix is jointly low-rank and sparse, minimizing a weighted sum of the nuclear norm and the 1 -norm has little improvement over minimizing just one of the two norms.
We have to seek new regularization terms that can promote the simultaneously low-rank and sparse structure. In particular, the coefficient matrix in subspace clustering is jointly block sparse and low-rank, which is a linear combination of a few block sparse and rank-1 matrices. Based on this observation, the atomic norm [4] minimization is proposed in [22] to find the coefficient matrix. Unfortunately, the atomic norm optimization for jointly block sparsity and low rank recovery is NPhard [22]. In this paper, we propose an algorithm to approximate the solution of the atomic norm minimization. Our algorithm is based on the low-rank factorization of the coefficient matrix, where the factors are sparse. By this way, we get a jointly sparse and low-rank coefficient matrix, which leads to effective subspace clustering. Numerical experiments demonstrate that our algorithm outperform state-of-the-art subspace clustering algorithms.
The rest of this paper is organized as follows. In Section 2, we present the formulation of the atomic norm for jointly block sparse and low-rank matrices. In Section 3, a new algorithm, called (k, k)-sparse matrix factorization, is proposed for matrix recovery of the corrupted data, which is applied for spectral-based subspace clustering in Section 4. Numerical results are given on two real-world clustering problems of face clustering and motion segmentation in Section 5. Finally, the paper is concluded in Section 6.
2. Jointly (k, k)-sparse and low-rank matrices and subspace clustering. In this section, we present the atomic norm minimization for the construction of the coefficient matrix for subspace clustering.
2.1. The atomic norm. Our goal is to cluster a collection of data points which are drawn from a union of low-dimensional subspaces. Let be the data matrix, where each data point x i ∈ R p , i = 1, · · · , n is drawn from a union of t (t is unknown) linear subspaces {S i } t i=1 with the dimension of each subspace d i = dim(S i ), i = 1, · · · , t, unknown. We assume that the dimension of each subspace is low enough such that it is smaller than the number of data points in this subspace. Then, each data points can be represented as a linear combination of data points from its cluster. Therefore, when each data point is represented as a linear combination of other data points in the same set, the coefficient matrix is block diagonal under an appropriate permutation, and each diagonal block corresponds to a subspace. Moreover, since each subspace is low dimensional, each diagonal block is also of low rank. More precisely, we use Z to denote the coefficient matrix of X under the linear representation of itself, i.e., Then, in the ideal case, Z is a diagonal block low-rank matrix after suitable permutation. We assume that each nonzero block is no more than a k-by-k submatrix.
In the following, we call the matrix (k, k)-sparse if the matrix is block-diagonal under an appropriate permutation and each nonzero block is no more than k-by-k submatrix.
To promote the (k, k)-sparse and low-rank properties simultaneously, we use the atomic norm [4,22] minimization. Let In other words, A k,k is a dictionary of infinite atorms, and each atom in A k,k is a rank-1 matrix that contains at most k nonzero rows and k nonzero columns. The coefficient matrix Z can be represented as a linear combination of a small number of elements in A k,k where r is the rank of Z. That is, Z has a sparse representation under dictionary A k,k . For a given arbitrary matrix M , its sparsity under dictionary A k,k is defined In subspace clustering, we want to find the coefficient matrix Z such that T (0) k,k (Z) is as small as possible subject to suitable constraints. That is, we seek a coefficient matrix Z that uses as few as possible atoms in A k,k . However, the function T (0) k,k is non-convex and discontinuous. Its best convex relaxation is the atomic norm [4,22] as follows Instead of the minimization of T is sought for the jointly (k, k)-sparse and low-rank coefficient matrix Z.

Robust recovery to noises or outliers.
In applications, the data points are often corrupted by noise with possible outliers. To be precise, let Y = [y 1 , y 2 , · · · , y n ] be the data matrix, where each y i ∈ R p , i = 1, · · · , n is a corrupted data point with y i = x i + i , where x i is the clear data point and i is the corresponding error vector. Or, in matrix notations, According to Equations (1) and (4), we have where E = − Z. We know the coefficient matrix Z ∈ R n×n is jointly low-rank and (k, k)-sparse. If we assume is sparse, then E is also sparse. This strategy was also used in [8,28]. Therefore, where Z is the coefficient matrix and E is the sparse error matrix. Hence, we can recover the jointly (k, k)-sparse and low-rank coefficient matrix via the following convex optimization 3. Matrix recovery by (k, k)-sparse matrix factorizaiton. In this section, we present an algorithm based on matrix factorization to approximate a solution of (7).
3.1. Matrix factorization approximation of the atomic norm. In this section, we present an approximation of (7). Our starting point is the following theorem, which is a corollary of Theorem 11 in [22].
Theorem 3.1 (a Corollary of [22,Theorem 11]). Let Z ∈ R n×n be a matrix. Then Inverse Problems and Imaging Volume 11, No. 3 (2017), 539-551 Here · sp k is the k-support norm [2] as follows: where G k denotes the set of all subsets of {1, 2, · · · , n} of cardinality at most k and supp(w I ) is the support of the vector w I .
Proof. By [22,Theorem 11], Let the infimum on the right hand side of (8) be S 1 (Z), and that of (9) be S 2 (Z). According to the Cauchy-Schwarz inequality, we have It remains to prove S 1 (Z) ≤ S 2 (Z). For a given arbitrary precision > 0, let v i and u i be fixed such that It can be easily checked that Z = iũ iṽ T i and ũ i sp k = ṽ i sp k , so Note that the number of u i and v i are unknown in (8). If we fix this number to s, then we get an approximation of T k,k as follows Instead of (7), we solve its approximation Model (11) is to find a (k, k)-sparse and low-rank representation by matrix factorization method. In the following, we call this method as (k, k)-sparse matrix factorization.

3.2.
Alternating direction method of multipliers (ADMM). In order to solve (11), we use the alternating direction method of multipliers (ADMM) [19]. By introducing an auxiliary variable, (11) is converted to an equivalence Then the augmented Lagrangian function of (12) is 2. Update E and U , V , respectively: 3. Update λ 1 and λ 2 : Subproblem (14) is a least squares problem and can be solved efficiently by linear system solvers. The solution of (15) is the entrywise soft-thresholding. However, it is not trivial to solve subproblem (16), and we employ an iterative method that will be discussed in Section 3.3 in detail. Algorithm 1 outlines the whole algorithm to solve (11) by ADMM.

(k, k)-sparse matrix factorization.
In this subsection, we discuss efficient numerical solvers for solving subproblem (16) in ADMM. In particular, proximal alternating linearized minimization (PALM) [3] is applied. For simplicity, we denoteZ In (16), we find two matrices U ∈ R n×s and V ∈ R n×s that satisfyZ ≈ U V T and U V T is a (k, k)-sparse matrix, where s is a fixed integer approximating the rank. If we further define E(U, V ) = 1 2 Z − U V T 2 F , then (16) is rewritten as (19) [U, V ] = arg min Note that (19) is a non-convex optimization problem with respect to U and V . Finding the global minimum is generally challenging. Nevertheless, one standard approach for finding a local minimum of (19) is the Gauss-Seidel iteration, which is also known as alternating minimization or block coordinate descent. The alternating minimization for solving (19) is as follows: Unfortunately, the subproblems involved in (20) do not have close-form solutions, which means that nested iterative algorithms are needed. Furthermore, a necessary condition for the convergence of alternating minization is the minimum in each step is uniquely attained [29], and, otherwise, it is possible that the method may cycle without convergence [20].
If each subproblem in (20) is approximately solved by one step of proximal forward-backward iteration [19], then we get the proximal alternating linearized minimization (PALM) algorithm [3] as follows: where prox 1 2α · 2 k,2 is the proximity operators of 1 2α · 2 k,2 as follows It is shown in [3] that (21) converges to a stationary point of (19), provided that the step sizes ζ and ξ satisfy certain constraints.
In order to get a practical algorithm, it remains to find an explicit expression of the proximity operator. To this end, let G = [g 1 , · · · , g s ] and H = [h 1 , · · · , h s ] in (22). Then we have Since for i = 1, . . . , s. A closed-form formula of p i for a given h i is provided by [2], which is outlined in Algorithm 2. For simplicity, we drop the index i in the vectors h i and p i . Since (19) is a non-convex optimization, the limit of our algorithm (21) depends on the initial guess . In general, the initialization with U = 0 and V = 0 is not good. Instead, we initialize U and V as follows. When (21) is called by the first iteration of Algorithm 1, we initialize U and V with U = P s Σ 1/2 s , V = Q s Σ 1/2 s , where (P s , Q s , Σ s ) are the singular value triplets correponding to the largest s singular value ofZ. In the subsqeuent iterations of Algorithm 1, we use the U, V calculated in the previous iteration as the initial guess of current iteration and terminate the algorithm when Data: h ∈ R n and the parameter α. Result: p = arg min ,h i is the i-th largest element of |h|. Let Π be the permutation matrix such thath = Π|h|. For simplicity, defineh 0 := +∞, h n+1 := −∞ and γ r,l := l i=k−rh i . 2 Find r ∈ {0, · · · , k − 1} and l ∈ {k, · · · , n} such that

Algorithm 2:
The algorithm for proximity operator of 1 2α ( · k sp ) 2 with input h.

Spectral clustering framework.
In this section, we present a framework of spectral subspace clustering based on the coefficient matrix C = U V T found by Algorithm 1 that solves (11).
As stated in the introduction, spectral clustering is very popular and is a very powerful tool for subspace clustering. In spectral clustering-based methods, an affinity matrix W ∈ R n×n is constructed with the (i, j)-th entry measuring the similarity between the i-th and j-th data points. First of all, we build a graph G = (V, E, W ) with V being the set of nodes, E being the set of edges. W is the affinity matrix whose entries are the weights of an edge connecting two nodes. The weight equals 0 if there is no edge between two nodes. In an ideal graph, nodes from the same cluster are connected together and there are no edges between the nodes from different clusters. One of the most challenges in applying spectral clusteringbased methods is how to choose the affinity matrix.
We use the coefficient matrix C = U V T by Algorithm 1 to construct the affinity matrix. Recall that Algorithm 1 solves (11), where we use the data itself as the dictionary matrix, i.e. each sample is represented as a linear combination of the data points themselves. Moreover, we assume that each data point can be represented as a linear combination of the basis of those data points from the same cluster as itself. Ideally, for the coefficient matrix C = U V T , the (i, j)-th entry c i,j of C is nonzero value if the i-th and j-th data points lie in the same cluster, otherwise, c i,j = 0. Under an appropriate permutation, the matrix C is block-diagnal. We will construct the affinity matrix W based on the coefficient matrix C for spectral clustering. Note that it is possible that the coefficient matrix is not symmetric. Sometimes, the entries of the coefficient matrix are negative. In our experiments, the affinity matrix is chosen as W = |C| + |C T | with w ij = |c ij | + |c ji |. Therefore, W is symmetric with nonnegative entries. Ideally, W is a block-diagonal matrix by an appropriate permutation with each block associated with one cluster.
Based on the affinity matrix W , we can apply normalized spectral clustering [17] for subspace clustering. More specifically, we compute the eigenvectors x 1 , x 2 , · · · , x t of the normalized Laplacian L sym corresponding to the t ( t is the number of clusters) smallest eigenvalues, and then we normalize the rows of [x 1 , · · · , x t ] to norm 1 to get a matrix Φ = [φ T 1 , φ T 2 , · · · , φ T n ] T ; finally we cluster {φ j } n j=1 by Kmeans.
5. Numerical results. In this section, we evaluate our proposed method in face clustering and motion segmentation by experiments. Our method is also compared with the two state-of-the-art subspace clustering methods SSC [8] and LRR [14]. To have a fair comparison, we take out the post processing steps used in the original paper [8] of SSC and [14] of LRR, and we construct the affinity matrix by the coefficient matrix directly for subspace clustering.

5.1.
Choice of s. In our proposed method, matrix factorization is performed for matrix recovery. It is necessary to find an appropriate value of s. In practice, we do not have available knowledge about the reduced number of dimension s priorly, and a suitable s is always based on the given data. Generally speaking, s must be large enough to contain necessary information and small enough to take out the noises. In other words, if s is less than the real rank, then the data can not be adequately approximated, and if the reduced number s is equal or more than the real rank, then the reduction of the residual sum of square (RSS) is very slow. Therefore, in the plot of RSS versus different reduced numbers s, there is an inflection where s matches the proper rank number [12].
We use one experiment to investigate the relationship between the RSS and the number of the reduced dimension, and the RSS based on model (11) is defined as Figure 1 illustrates the plot of the residual sum of square (RSS) versus the reduced number s for s = 5, · · · , 10 on one of the video sequences in Hopkins155 data. We see that there is an inflection point on the curve, which is chosen as s of our algorithm. We also observe that when s exceeds the inflection point, the RSS becomes stable. Therefore, our method is not sensive to the choice of the parameter s, as long as it is larger than the proper rank.

Face clustering.
Face clustering is to cluster the face images belonging to the same human subject into one cluster. In the following, we evaluate our proposed method on Extended Yale Database B [13]. This dataset contains face images for 38 human subjects with 64 face images for each human subject under different illumination conditions, Figure 2 illustrates the face images of one human subject under different illumination conditions. In our experiments, we test on the t ∈ {2, 3, 5, 8, 10} subjects for subspace clustering. In order to save time and memory storage [10,8], we resize each face image to 48×42 pixels from 192×168 pixels, and then further project the 2016-dimensional vector for each face image to 9 × t-dimensional subspace under singular value decomposition (SVD). For the parameters of SSC and LRR, we use the default values setting by the authors with α = 20 for SSC and α = 0.18 for LRR. For our proposed method, (k, k)-SMF, the parameters are setting as µ 1 = α/ min i max j =i y j 1 , µ 2 = α and µ e = α/50000 with α = 20.

Motion segmentation.
In computer vision, motion segmentation is the process of separating regions, features, or trajectories from a video sequence into coherent subsets of space and time. These subsets correspond to independent moving objects in the scene. In the following, we use Hopkins155 dataset to evaluate our proposed method for motion segmentation. In this dataset, there are 156 video sequences including Checkerboard sequences, Traffic sequences and Articulated/nonrigid sequences. Each sequence is an independent data with the extracted feature points in all frames and there are 156 different datasets. In our numerical experiment, there is no pre/post processing on the dataset for all three methods. The parameters for SSC and LRR are also the default values setting by the authors. Since the outliers in the data have been manually removed, the overall error level is low, we set the parameters of error µ e = α/10 with α = 80. Table 5.3 lists the error rate (mean %/median %) for motion segmentation on Hopkins155 dataset with the number of reduced dimension s = 10. From Table 5.3, we see that our method has a smaller mean error than both SSC and LRR.  Table 2. The error rate (mean %/median %) for motion segmentation on Hopkins155 dataset. 6. Conclusion. In this work, we have proposed a (k, k)-sparse matrix factorization method for subspace clustering. This method consists of a matrix recovery by (k, k)-sparse matrix factorization and a spectral clustering. In matrix recovery by (k, k)-sparse matrix factorization, we recover a low-rank and (k, k)-sparse coefficient matrix, which is used to construct the affinity matrix for spectral clustering. Our (k, k)-sparse matrix factorization algorithm is robust to outliers contained in the corrupted data. Our method is evaluated by experiments on the datasets of Extended Yale dataset B for face clustering and Hopkins155 dataset for motion segmentation. Numerical results demonstrate that our proposed method is better than the two state-of-the-art subspace clustering methods SSC and LRR.
There are several possible future research directions. One is to establish the convergence and theoretical guarantee of the proposed algorithm for subspace clustering. Another one is to investigate theoretical results of the atomic norm minimization for recovering low-rank and (k, k)-sparse matrix.