On linear convergence of projected gradient method for a class of affine rank minimization problems

The affine rank minimization problem is to find a low-rank matrix satisfying a set of linear equations, which includes the well-known matrix completion problem as a special case and draws much attention in recent years. In this paper, a new model for affine rank minimization problem is proposed. The new model not only enhances the robustness of affine rank minimization problem, but also leads to high nonconvexity. We show that if the classical projected gradient method is applied to solve our new model, the linear convergence rate can be established under some conditions. Some preliminary experiments have been conducted to show the efficiency and effectiveness of our method.


1.
Introduction. The affine rank minimization problem has drawn much attention in recent years [1,6,15]. The problem typically aims at finding a low-rank matrix satisfying a set of linear equations. Mathematically, the problem can be written as where X ∈ R m×n is the decision matrix, and the vector b ∈ R p is given. A : R m×n → R p is a linear map defined by where A i ∈ R m×n , i = 1, . . . , p. Problems of the above form naturally arise in the area of linear inverse problems, such as system identification [13], optimal control [5], and low-dimensional embedding in Euclidean space [12], etc. Among the models and applications, the matrix completion problem: min X∈R m×n rank(X) s.t. X ij = B ij , (i, j) ∈ Ω (2) 2010 Mathematics Subject Classification. Primary: 90C30, 49M37; Secondary: 65K05. Key words and phrases. Linear convergence, projected gradient method, affine rank minimization problems.
This work is partially supported by the National Natural Science Foundation of China (Grant No. 11401322) and Fundamental Research Funds for the Central Universities (Grant No. NKZXB1447). * Corresponding author: Su Zhang.

YU-NING YANG AND SU ZHANG
is of particular interest [4]. Such a problem aims at inferring the unknown entries of a low rank matrix given that only partial entries are observed. In the above problem, X represents the matrix to be inferred, while Ω is the set of indices indicating the known entries. The matrix completion problems find applications in collaborative filtering [2], image inpainting [6], gene prediction [14], etc. A variety of approaches, methods and algorithms have been proposed to address the affine rank minimization problems as well as the matrix completion problems. A well-known approach is to replace the rank function by its convex counterpart, i.e., to replace rank(X) by X * , the nuclear norm, which is the sum of the singular values of X, and hence obtain a convex relaxation of (1) and (2). Under some assumptions, exact recovery results have been obtained, see e.g., [4,9,10,15]. To deal with the noisy case, the nuclear norm has been imposed as a regularization term, yielding the following problem min X∈R m×n with λ > 0 being a regularization parameter.
To enhance the robustness of the problems, the least absolute deviation loss has been adopted. For example, the robust version in the matrix completion setting is as follows [3] min X∈R m×n X Ω − B Ω 1 + λ X * , where the first term controls and penalizes the outliers, where B Ω denotes the known entries. Correspondingly, exact recovery results can be obtained [3].
In this paper, we propose a new model to resolve the affine rank minimization problem (1) and its matrix completion counterpart (2). Unlike the existing approaches which consider using certain convex loss function to penalize the noise or outliers, our model employs the Cauchy loss: σ (t) = σ 2 /2 log(1 + t 2 /σ 2 ). This loss function is shown to be robust to outliers in robust statistics [7], and our optimization problem is to minimize a cost function associated with the above function with rank constraint. Then the projected gradient method (PGM) is used to solve the proposed optimization problem. The Cauchy loss causes nonconvexity in the objective function, and together with the rank constraint yields a highly nonconvex optimization problem. However, interestingly we show that under some conditions, the linear convergence rate can be established. This is the main result of this paper. Along with the main result, we also explore several properties and relationship between the Cauchy loss and the least squares loss, which may be of independent interests and useful for other related linear inverse problems.
The rest of this paper is organized as follows. In Section 2 we introduce our model and the algorithm. The linear convergence rate of PGM for solving the proposed optimization problem will be established in Section 3. Some preliminary experiments will be conducted in Section 4. Section 5 draws some conclusions.

2.
Model and algorithm. When there is noise or outliers in the affine rank minimization problem, certain function should be introduced to penalize the noise or outliers. To measure or penalize the difference between A i , X and b i , we adopt the following function σ (t) = σ 2 /2 log 1 + t 2 /σ 2 , which is known as the Cauchy loss function in robust statistics [7]. Here σ > 0 controls the robustness. The smaller the parameter σ is, the more robust the problem is. By letting t = A i , X − b i and summing σ over i from 1 to p, we arrive at the cost function Similarly, in matrix completion setting, the cost function is given by With the low rank constraint rank(X) ≤ R, new optimization problem is given by Because of the differentiability of F σ (·), the conventional projected gradient method (PGM) can be applied to solve (3). Denote S R := {X ∈ R m×n | rank(X) ≤ R}, the PGM for solving (3) is as follows Here ∇F σ (X) is the gradient of F σ (·) at X, which is given by In the matrix completion setting, the gradient is given by 1/α > 0 is the stepsize, and P S R (Y ) denotes the projection of Y onto S R , which is the best rank-R approximation to Y , i.e., This can be solved analytically by doing a singular value decomposition. Note that the PGM has been proposed to solve the affine rank minimization/matrix completion problems [1,6,8]. The difference is that their optimization problems all include the least square loss based cost function in the objective function, which is quadratic and convex. However, the objective function of (3) is nonconvex.
3. Linear convergence rate. In this section, we will prove the linear convergence of the PGM for solving (3) under some conditions. It has been shown in [1] that if the objective function is convex and quadratic, the PGM for affine rank minimization problems has a linear convergence rate. However, the analysis will become much more difficult when the objective function is nonconvex, as that in (3). In the following, we show how to tackle the difficulties step by step. Some properties of the objective function will be explored at first.
3.1. Some necessary properties. We firstly give some notations. Denote vec(·) as the vectorization operator over any matrix space R s×t , with vec(B) ∈ R st . We further define matrix A ∈ R p×mn , with Based on the above notations, the vectorization of A(X) can be written as vec(A(X)) = Avec(X), and the gradient of F σ at X can be rewritten as The Λ matrix can be seen as a weight matrix. If the magnitude of the noise or outliers hence assigns a small weight to the large noise or outliers, which helps to reduce the influence of large noise or outliers. This gives an intuitive explanation why the model and the algorithm may be robust. The parameter σ plays a similar role: the smaller the σ is, the smaller the weight is. Let A 2 be the spectral norm of A and denote L := A 2 2 . The following proposition shows that the gradient of F σ is Lipschitz continuous.
where Λ X and Λ Y are the diagonal matrices corresponding to ∇F σ (X) and ∇F σ (Y ) respectively. We then need to show that It is not hard to check that for any t 1 , t 2 ∈ R and σ > 0, As a result, we have ∇F σ (X) − ∇F σ (Y ) F ≤ L X − Y F . The proof has been completed.
The following conclusion is an immediate consequence of Proposition 1.

Proposition 2.
For any X, Y ∈ R m×n , it holds that We also need the following lemmas.

3.2.
Linear convergence analysis. The matrix RIP condition [15] is crucial in analyzing matrix recovery properties, while the matrix Scalable Restricted Isometry Property (SRIP) is important in establishing the linear convergence rate of affine rank minimization problem when the objective function is quadratic and convex [1]. The linear convergence analysis in our setting also relies on it.
Definition 3.4 (SRIP [1]). For any matrix X with rank(X) ≤ r, there exist constants ν r , µ r > 0 such that To prove our results, we have to make some assumptions.
1. At the (k + 1)-th iteration of the algorithm, the parameter σ k+1 of F σ is chosen as where 0.99 < β < 1, andσ is an arbitrary constant. 2. The spectral norm of A is upper bounded as A 2 2 ≤ 6 5 ν 2 2R . Although the second assumption seems restrictive, there are some linear operators satisfying it, e.g., the identity operator. Now we arrive at our main results of this paper.  (3)). Assume that X * is a rank-R matrix satisfying A(X * ) = b. Suppose that Assumption 1 holds. Let {X (k) } be a sequence generated by the projected gradient method for solving (3), with the step-size α = A 2 2 . Then the algorithm converges linearly, i.e., where 0 < q 1 , q 2 , q 3 < 1 which depend only on the choice of β.
Proof. Since rank(X * ) = R and X (k+1) is the best rank-R approximation to Y (k+1) , we have we have To prove the first claim, we need to bound the first and the third terms in (8) by X (k) − X * 2 F . We first deal with the first term. By the fact that Λ 2 2 ≤ 1 and α = A 2 2 , we observe that the first term of (8) can be upper bounded by We then focus on how to bound the third term. Denoting y k i := A i , X (k) −X * = A i , X (k) −b i for i = 1, . . . , p and recalling that Λ ii = 1/(1+(y k i /σ k+1 ) 2 ), we obtain ,σ , it follows for i = 1, . . . , p which yields the following inequality . Then the range of β derives that γ < −0.9608. Using the SRIP condition, (8) together with (9) and (11) implies the following As a consequence, we obtain the following estimation where the last inequality follows from the assumption that ν 2 2R /α ≥ 5/6 and the fact that γ < 0. Denote q 1 := 2 1 + 5γ 6 . Then the range of β implies that q 1 ∈ ( √ 2/ √ 3, 0.8929). Therefore, The first inequality has been verified. Next we focus on the second assertion. It follows from the SRIP condition that where the last inequality is due to that µ 2 2R ≤ A 2 2 ≤ 6/5ν 2 2R . Denote q 2 := 6q 2 1 /5. Then we see that q 2 ∈ (4/9, 0.9568), which verifies the second inequality.
We now proceed to the last assertion: the linear convergence in the F σ sense. It follows from the fact On the other hand, (6) yields Combining the above two inequalities, we have The remaining work is to derive upper bound of ∇F σ k+1 (X (k) ), X * − X (k) and ). We first consider the second term. Using the SRIP condition, we have and recalling the relationship between β and σ k+1 , we get δ ≤ 2(1 − β). Invoking Lemma 3.2, we obtain Summing the above inequalities over i from 1 to p, we have It then follows We proceed to bound ∇F σ k+1 (X (k) ), X * − X (k) . It follows from (11) and Lemma 3.1 that Combining (13), (14) and (15) together, we get Let q 3 := 1 + 6 5β − 2 3−2β . Again, the range of β shows that q 3 ∈ (0.2, 0.2272). Last, we have to replace F σ k+1 (X (k) ) by F σ k (X (k) ). Lemma 3.3 tells us that the function σ 2 log(1 + t 2 /σ 2 ) is nondecreasing with respect to σ > 0. Since (12) implies we get F σ k+1 (X (k) ) ≤ F σ k (X (k) ), and finally, there holds The proof has been completed.
Remark 1. The SRIP condition (7) does not hold in the matrix completion setting, because ν r in this case is zero. As a result, Theorem 3.5 may not be directly applied to the matrix completion setting. Fortunately, [8] proved that if some conditions are met, a refined RIP condition similar to condition (7) holds for matrix completion problems with high probability, see [8,Theorem 4.2] for details. Therefore, it is reasonable to show that Theorem 3.5 holds for matrix completion problems solved by PGM under the same conditions of [8].
Even if A does not meet (7), the following result shows that the sequence {F σ (X (k) )} generated by PGM for solving (3) is nonincreasing, which also indicates the convergence of PGM.
Proposition 3. Let {F σ (X (k) )} be generated by PGM for solving (3) and choose α > L := A 2 2 . Then {F σ (X (k) )} is nonincreasing. Proof. Since rank(X (k) ) ≤ R for all k ≥ 1, and X (k+1) is the best rank-R approximation of It follows from Proposition 2 that Since α > L, (16) and (17) imply that which implies that the sequence {F σ (X (k) )} is nonincreasing. The proof is completed. 4. Numerical experiments. This section presents some preliminary numerical experiments. First we randomly generate some matrices of size 500 × 500, and then truncate them to be low rank, where the rank varies from 5 to 50. Then 30% of the entries are contaminated by outliers with values between [0, 1]. Last, some entries are randomly missing, where the missing ratio varies between {0.3, 0.5, 0.7, 0.9}. We compare our method PGM for solving (3) with RPCA [3], which uses alternating direction method to solve The PGM for solving the following optimization problem [8] min is served as a based line, which is denoted as PGM-LS for short. The stopping criterion is that either X (k+1) − X (k) F ≤ 10 −5 , or the iteration is larger than 1000. For PGM-LS, the max iteration is 500.
We report their relative error and CPU time. Here the relative error is defined as X * − B F / B F where X * represents the recovered matrix, and B is the true low rank matrix; the unit of the CPU time is second. The results are reported in Table  1. From the table, we see that PGM-LS performs worst. The reason is that the model (18) uses the least squares to penalize the noise, which is not resistant when there are outliers. When the missing ratio is not high, i.e., 0.3, and the matrix is low rank, i.e., rank = 5, 10, 20, RPCA can recover the matrix with high precision. However, when missing ratio and rank increase, our method PGM for solving (3) performs better than RPCA in most cases, which indicates that our method is more resistant to outliers. Apart from this, the results also show that our method is more efficient than RPCA.
To intuitively show the effectiveness of our method, in Fig. 1 we plot an example of the matrix recovered by our method, where the rank of the matrix is 5, and the missing ratio is 0.7. Fig. 1c shows that our method can almost recover the true matrix. We also plot the relative error curve of our method in this setting in Fig.  2, which shows that our method converges fast.
The curve of logarithm of F σ (X (k) ) is plotted in Fig. 3, where the rank of the matrix is 5, and the missing ratio is 0.7. From the figure we can observe that PGM for solving (3) converges at least linearly.

5.
Conclusions. In this paper, we propose a new model for the affine rank minimization problem. The new model is more robust by using the Cauchy loss function in robust statistics, but also brings nonconvexity. We apply projected gradient method to solve it and prove linear convergence rate under some conditions. The efficiency of our model for solving new model has been shown in some preliminary numerical tests.
In the future, we will consider other methods for solving this new model to speed up the convergence. The Cauchy loss function could be also used in other problems to enhance robustness.