CONVERGENCE AND STABILITY OF ITERATIVELY REWEIGHTED LEAST SQUARES FOR LOW-RANK MATRIX RECOVERY

,


Introduction.
1.1. Low rank matrix recovery. Low rank matrix recovery is a burgeoning topic which has drawn attention of many researchers in different fields in the past serval years. It has wide applications in quantum state tomography [21,22], image processing [2], system identification [29], recovering shape and motion from image streams [32], etc. The fundamental problem of low rank matrix recovery is to reconstruct an unknown (approximately) low rank matrix M ∈ R n1×n2 from its linear measurements b = A(M )+z, where A : R n1×n2 −→ R m (m < n 1 n 2 ) is a linear measurement operator and z ∈ R m is the noise vector bounded by z 2 ≤ η. The direct method to tackle this inverse problem is to solve the following rank minimization problem (1) min X∈R n 1 ×n 2 rank(X) subject to A(X) − b 2 ≤ η.
Note that when X is restricted to be a diagonal matrix, the rank minimization problem (1) reduces to the following sparse signal recovery problem in compressed sensing (2) min where x 0 denotes the number of non-zero entries of vector x [7,8]. Therefore, rank minimization problem (1) includes the sparse signal recovery problem (2) as a special case. Unfortunately, the rank minimization problem (1) is NP-hard and thus computationally infeasible. To circumvent this intractability, a popular approach is to consider its relaxation problem, i.e., the following Schatten-p norm minimization problem (0 < p ≤ 1) where the Schatten-p (quasi) norm of X is defined by X p = ( i σ p i (X)) 1 p with σ i (X) being the singular values of X. When p = 1, problem (3) is the convex nuclear norm minimization problem. Nuclear norm minimization is a semi-definite program (SDP) and can be solved efficiently by convex optimization methods [16,35,4]. When 0 < p < 1, X p denotes a quasi-norm and problem (3) is a non-convex optimization problem [34,42]. A number of numerical algorithms implementing on the non-convex Schatten-p norm minimization problem has been developed and they empirically have better performance than nuclear norm problem [26,30,31]. Theoretical analysis showed that Schatten-p norm minimization problem (0 < p ≤ 1) (3) can exactly recover a low rank matrix on the case of noiseless measurement (i.e.,η = 0) and stably recover a low rank matrix in the presence of observation noise (i.e.,η > 0) when the measurement map A satisfies matrix restricted isometry property (M-RIP) [5,35,42,41]. Recht et al. [35] generalized the sparsity-restricted isometry property (RIP) [7] defined for sparse vectors in compressed sensing to low rank matrix recovery. The matrix restricted isometry property (M-RIP) is defined as follows: Definition 1.1. Given a linear operator A : R n1×n2 −→ R m , for every integer r with 1 ≤ r ≤ n 1 , the matrix restricted isometry constant δ r is defined as the minimum constant that satisfies for all X ∈ R n1×n2 with rank(X) ≤ r.
Literature [41] showed that when the unknown matrix M satisfies rank(M ) ≤ r and measurement map A satisfies M-RIP of order 2r, the optimal solution X * of (3) satisfies where C is a constant only dependent on the M-RIP constant δ 2r and rank r. Candès et al. [12] proved that any random measurement map A, which satisfies the following concentration inequality: for any given X ∈ R n1×n2 and for fixed 0 < δ < 1, for fixed constants c, t > 0, will satisfy the M-RIP with restricted isometry constant δ r ≤ δ with large probability provided that m ≥ cδ 2 n 2 r. Thus many types of random measurement maps such as Gaussian, Subgaussian and random partial Fourier measurement maps have the M-RIP with high probability [12,25]. However, in the matrix completion problem, the projection operator P Ω doesn't satisfy the M-RIP provided that there are rank one matrices in the kernel of P Ω . Literatures [9,10] show that under certain incoherence assumptions on the singular vectors of the unknown low rank matrix, matrix completion problem has good recovery performance. More studies about matrix completion could be found in [11,36,23,40], etc.
1.2. Related iteratively reweighted least squares algorithm. Although the Schatten-p norm minimization problem (0 < p ≤ 1) (3) can be solved using standard optimization tools, the iteratively reweighted least squares algorithm has been suggested as an attractive alternatively simple and efficient algorithm for low rank matrix recovery (abbreviated as IRLS-M) in the literatures [26,17,30,31]. Indeed, iteratively reweighted least squares (abbreviated as IRLS) algorithm appeared for the first time in the approximation theory for solving uniform approximation problems [27]. Then literatures [1,14,15,18,26,39] used iteratively reweighted least squares algorithm as an efficient algorithm to solve convex and nonconvex l p (0 < p ≤ 1) minimization problems in compressed sensing. Particularly, IRLS algorithm has been studied in detail for recovering a sparse signal in noiseless compressed sensing by Daubechies et al. [15]. Then Ba et al. [1] has used the IRLS algorithm to solve noisy l p (0 < p ≤ 1) minimization problems, and its kth iteration is given by where w k−1 ∈ R n is a weight vector with w k−1 i = (|x k−1 i | 2 + 2 ) p/2−1 ( > 0 being a smoothed parameter added to ensure that w k−1 is well defined). They proved that the IRLS algorithm has well theoretical and numerical recovery performance for sparse signal recovery problem in the presence of observation noise.
Fornasier et al. [17] have extended the IRLS algorithm [15] in sparse signal recovery to the IRLS-M algorithm for solving the noiseless nuclear norm minimization problem. Then Mohan and Fazel have used IRLS-M to solve Schatten-p minimization (0 < p ≤ 1) problem for noiseless low rank recovery [30,31]. In addition, [17,30,31] showed that the IRLS-M algorithm actually minimizes a certain smooth approximation to the Schatten-p norm (0 < p ≤ 1), i.e., IRLS-M solves the following smoothed noiseless Schatten-p minimization (0 < p ≤ 1) problem: where > 0 is the smoothed parameter. Note that when → 0, Tr((XX T + 2 I) p 2 ) is the approximation to Schatten-p norm X p p . Convergence of IRLS-M has been proved strictly, and when p = 1 they showed that IRLS-M algorithm allows for efficient implementations while having theoretical guarantees similar to nuclear norm minimization in the absence of noise [17,30,31]. Numerical experiments in [17,30,31] have shown that IRLS-M algorithm saves computational time and is also efficient for matrix completion problem in the presence of noise. However, both papers lack a rigorously theoretical analysis of the convergence and stability of IRLS-M algorithm for low rank matrix recovery in the presence of noise. In addition, the theoretical analysis of the algorithm is given only for p = 1.

Contribution.
In this paper, inspired by the good performance of IRLS algorithm for solving noisy sparse signal recovery problem [1], we extend the utility of IRLS-M algorithm for low rank matrix recovery in the presence of observation noise. We use IRLS-M algorithm to solve the following smoothed noisy Schatten-p (0 < p ≤ 1) minimization problem: and > 0 ensures that W k−1 is well-conditioned. To recover a low rank matrix or sparse vector, IRLS-M [17,30,31] and IRLS [15,14,26] always vary , starting at a large value and gradually reducing it. The update rule of in our IRLS-M algorithm for noisy low rank matrix recovery is the same as the one in [15,17] (see Algorithm 1 in Section 2.2). We first give a convergence analysis of the IRLS-M algorithm for all 0 < p ≤ 1. While the matrix null space property plays a central role in [17,30,31], it is no longer useful for our IRLS-M algorithm for noisy low rank matrix recovery as the iteration error matrix X k − X does not lie in the null space of measurement map A any more. We use the M-RIP of measurement map A for establishing the stability of our IRLS-M algorithm and giving a local error bound for all 0 < p ≤ 1. Specially, when p = 1, we prove that the M-RIP constant δ 2r < √ 2 − 1 is sufficient for IRLS-M to recover an unknown (approximately) low rank matrix with an error that is proportional to the noise level.

1.4.
Organization. The rest of this paper is organized as follows. In Section 2, we firstly introduce some basic notations used throughout the whole paper. Subsequently, we will state our IRLS-M algorithm for low rank matrix recovery in the presence of noise and give some preliminaries for proving the main results in Section 3. Section 3 give the convergence and stability analysis of IRLS-M algorithm in the noisy case provided that the measurement map A satisfies the matrix restricted isometry property. Section 4 give some conclusions.
2. IRLS-M algorithm for noisy low rank matrix recovery.
2.1. Notation. For a general matrix X ∈ R n1×n2 (without loss of generality assume n 1 ≤ n 2 ), let X T be the adjoint matrix of X. We also write the singular value decomposition (abbreviated as SVD) of X as X = U ΣV T for unitary matrices U ∈ R n1×n1 , V ∈ R n2×n2 and a diagonal matrix Σ = diag(σ 1 (X), σ 2 (X), . . . , σ n1 (X)) ∈ R n1×n2 , where σ 1 (X) ≥ σ 2 (X) ≥ · · · ≥ σ n1 (X) ≥ 0 are the singular values of X. We denote σ(X) = (σ 1 (X), σ 2 (X), . . . , σ n1 (X)) the singular vector of X. The trace of a square matrix X ∈ R n1×n1 is the sum of its diagonal entries, i.e., The trace satisfies Tr(X) = Tr(X T ) and Tr(XY ) = Tr(Y X) For a positive definite and self-adjoint matrix W ∈ R n1×n1 , we define the weighted Frobenius norm of X related to W by W More generally, we consider the Schatten-p (quasi) norms of X: In particular, when p = 1, X 1 = X * denotes the nuclear norm of X. The spectral norm of a matrix X is defined as X = sup which equals to the largest singular value of X. The rank of X equals to the number of nonzero singular values of X. For any integer 1 ≤ r ≤ n 1 , the best rank-r approximation X [r] of X is defined as where σ [r] (X) = (σ 1 (X), σ 2 (X), . . . , σ r (X)), U [r] ∈ R n1×n1 and V [r] ∈ R n2×n2 contain the left and right r singular vectors of X corresponding to the first r largest singular values respectively, and Σ [r] ∈ R n1×n2 is a diagonal matrix with the first r largest singular values of X on its diagonal.

IRLS-M algorithm.
In this subsection, we extend the utility of IRLS-M algorithm in [17,30,31] for low rank matrix recovery in the presence of observation noise. We aim to solve the minimization problem in (6). Note that when > 0, the function F p (X) = Tr((XX T + 2 I) p 2 ) for 0 < p ≤ 1 is the convex (p = 1) and nonconvex (0 < p < 1) smoothed approximation to the rank function of X. In addition, for X, F p (X) is differentiable for p > 0. We denote the feasible set by C := {X ∈ R n1×n2 : A(X)−b 2 ≤ η}. Motivated by the algorithm in [1,17,30,31], we give the IRLS-M algorithm to solve problem (6)(see Algorithm 1 in next page).
Next we give some remarks for our IRLS-M algorithm(Algorithm 1).

Remark 1.
Each iteration of the IRLS-M algorithm corresponds to a weighted least squares constrained to the closed quadratic convex set C and can be efficiently solved using standard convex optimization methods [4,19,20]. Literature [38] also has given an efficient method to solve a general class of least-squares constrained problems. In addition, the estimated rank r is one of the most important parameters. Literatures [26,31,17] give many methods of rank estimation such as r = 1.5r * where r * is the true rank of the unknown matrix M . Algorithm 1 Iteratively reweighted least squares minimization for noisy low rank matrix recovery Input: vector b, linear operator A, error bound η and estimated rank r; Output: matrix X ∈ R n1×n2 ; 1: Initialize W 0 = I ∈ R n1×n1 , set 0 = 1 and 0 < γ < 1, choose p ∈ (0, 1] ; 2: For k = 1, 2, . . .
The algorithm stops if k = 0, in this case, define X j := X k for j > k. In general, the algorithm generates an infinite sequence {X k } of matrices.
Remark 2. Note that the smoothed parameter > 0 is used to prevent the weight matrix W k to become ill-conditioned. In addition, in this sense the limit of X k (x k ) is often a good approximation of the true matrix (vector) up to O(| |) in magnitude [15,14,26]. Empirical analysis has shown that when starting at a large value of and gradually reducing it in each step, the IRLS [15,14,26] and IRLS-M [26,17,30,31] algorithm dramatically improves the ability to recover sparse signals and low rank matrices. There are various ways to choose k adaptively or in a static fashion. Ba et al. [1] analyzed the IRLS algorithm with fixed k > 0 for the simplicity of presentation. Mohan and Fazel [31] used the update rule 0 < k ≤ k+1 . Daubechies et al. [15], Fornasier et al. [17], Mohan and Fazel [30] and Lai et al. [26] used the same adaptively update rule of k as ours, which makes F p (X) provides a better approximation to the Schatten-p norm X p p . Actually, the theoretical analysis for fixed k and 0 < k ≤ k+1 case can be seen as a a part of our analysis of IRLS-M algorithm.
To derive the convergence of IRLS-M algorithm, we need a useful auxiliary function J p (X, W, ). Assume X ∈ R n1×n2 , 0 ≺ W = W T ∈ R n1×n1 and for 0 < p ≤ 1 consider the function We can express the iterations of IRLS-M as where (7) is obvious and (8) comes from the discussions of Section 3.1 in [31]. Moreover, the function J p (X, W, ) has the following property [31], for all k ∈ N , Note that by Algorithm 1 for each k ≥ 0, X k+1 is completely determined by W k . In particular, for k = 0, X 1 is determined solely by W 0 . Therefore, inequality (9) for k = 0 thus holds for arbitrary X 0 ∈ C.
Next, We give the definition of stationary point over a convex set.
Definition 2.1. Let S ⊆ R n1×n2 be an arbitrary convex subset and F (X) : R n1×n2 −→ R a continuously differentiable function over S. Then if ∇F (X * ), X − X * ≥ 0 for arbitrary X ∈ S, we say that X * is a stationary point of F (X) over S.
Lemma 2.2. If X * is a local minimizer of F (X) over a convex subset S ⊂ R n1×n2 , then we have ∇F (X * ), X − X * ≥ 0 for arbitrary X ∈ S. Moreover, if F (X) is convex over S, then the condition ∇F (X * ), X − X * ≥ 0 for arbitrary X ∈ S is sufficient for X * to minimize F (X) over S.
Proof. Let 0 ≤ t ≤ 1, then for arbitrary X ∈ S, X * + t(X − X * ) ∈ S. Consider the analytic function, where for local minimizer of F (X) over S, then by the convex analysis theory [3], Then X * is a minimizer of F (X) over S.
The following lemma is useful for analyzing the convergence of the IRLS-M algorithm for noisy low rank matrix recovery. Lemma 2.3. Assume that W ∈ R n1×n1 and W = W T 0. Let S denote an arbitrary non-empty, closed and convex subset of R n1×n2 . Then if X = arg min X∈S W 1 2 X 2 F , we have W X, X − X ≥ 0 for all X ∈ S.
Proof. We give a simple proof of Lemma 2.3. For arbitrary X ∈ S and all t ∈ [0, 1], then X + t(X − X) ∈ S. By the optimality of X, we have , which equals to (11) Tr(W X( X) T ) ≤ Tr(W ( X + t(X − X))( X + t(X − X)) T ).
Note that (11) also equals to This implies Tr(W X(X − X) T ) = W X, X − X ≥ 0.
The next lemma gives some simple properties of the iterations {X k } k∈N in IRLS-M algorithm. Proof. The proof of Lemma 2.4 is similar to the proof of Theorem 2 in [31]. But the main difference is the use of Lemma 2.3 instead of Lemma 13 in [31]. For completeness, we give the details. A direct computation shows that Notice that where the last inequality follows from (9). Therefore, the sequence {X k } k∈N is bounded. In addition, for each k ≥ 1, using (9) we have that where the second inequality follows from X k+1 = arg min X∈C (W k ) 1 2 X 2 F and Lemma 2.3. On the other hand, for any i = 1, 2, . . . , n 1 , Hence, there exists a positive constant D 0 , such that (W k ) − 1 Summing the above inequalities over all k ≥ 1, we have that that lim 3. Convergence and stability of IRLS-M for noisy low rank matrix recovery. In this section, convergence and stability of IRLS-M algorithm for all 0 < p ≤ 1 in the presence of observation noise are studied. Specially when p = 1, we give a weaker sufficient condition for IRLS-M algorithm to recover an unknown (approximately) low rank matrix for noisy low rank matrix recovery than [17,26], based on the matrix restricted isometry property. Our analysis is inspired by [1,15,17,30,31,26].
3.1. Convergence of IRLS-M algorithm. It is clear from Algorithm 1 that { k } k∈N is a nonincreasing sequence and lower bounded by zero, then { k } k∈N must converge to a nonnegative constant . In addition, because of the boundedness of {X k } k∈N (see Lemma 2.4 ), it has at least a convergent subsequence. Below we will show that when > 0, the limit of any convergent subsequence is a stationary point of F p (X). When = 0, we will show that there exists a convergent subsequence which converges to a low rank matrix. Theorem 3.1 gives the main convergence result.
Theorem 3.1. Suppose that the sequence of matrices {X k } k∈N is produced by the IRLS-M algorithm (Algorithm 1) and k → when k → ∞. Then {X k } k∈N has at least one convergent subsequence. If > 0, the limit of any convergent subsequence of {X k } k∈N is a stationary point of F p (X) = Tr((XX T + 2 I) p 2 ) over C := {X ∈ R n1×n2 : A(X) − b 2 ≤ η}. If = 0, then there exists a convergent subsequence of {X k } k∈N which converges to an rank-r matrix X.
Proof. By Lemma 2.4, {X k } k∈N is a bounded sequence. Hence there exists a subsequence {X kj } j∈N of {X k } k∈N converging to a matrix X. Note that A(X) − b 2 = lim j→∞ A(X kj ) − b 2 ≤ η, then X ∈ C. If lim k→∞ k = > 0, by the definition of W kj , W kj = (X kj (X kj ) T + 2 kj I) p 2 −1 , then W kj → W = (X(X) T + 2 I) p 2 −1 for j → ∞. By invoking Lemma 2.4, we also have X kj+1 → X when j → ∞. Hence, for arbitrary X ∈ C, by the optimality of X kj+1 and Lemma 2.3, we have For the matrix X ∈ C and fixed > 0, we can use Proposition 7.4 in [17] to get ∇F p (X) = W X, where W = (X(X) T + 2 I) p 2 −1 . Hence, ∇F p (X), X − X ≥ 0 and by the Definition 2.1 X is a stationary point of F p (X) over the constraint set C.
where C 1 , C 2 are positive constants and X [r] is the best rank-r approximation of X. When lim k→∞ k = > 0, the next theorem states that IRLS-M algorithm for 0 < p ≤ 1 is stable if the limit point X of the convergent subsequence of {X k } k∈N coincides with a minimizer of F p (X) = Tr((XX T + 2 I)  2 ), then for any integer t < r + 1 − C 0 (C 0 is a positive constant explicitly given in (16)) and arbitrary matrix X ∈ C := {X ∈ R n1×n2 : A(X) − b 2 ≤ η}, we have p p , where C 1 , C 2 are positive constants and X [t] is the best rank-t approximation of X. is strictly convex with respect to X. Hence, the limit point X of the convergent subsequence of {X k } k∈N generated by IRLS-M algorithm is the unique global minimizer of F 1 (X) over the convex set C := {X ∈ R n1×n2 : A(X) − b 2 ≤ η}. Next theorem gives a special case in p = 1. and arbitrary matrix X ∈ C := {X ∈ R n1×n2 : A(X) − b 2 ≤ η} , the limit point X of the convergent subsequence of {X k } k∈N generated by IRLS-M algorithm satisfies where C 1 , C 2 are positive constants and X [t] is the best rank-t approximation of X.
Remark 5. Note that when p = 1, [31] used the matrix NSP to give the matrix recovery guarantee, while we use the M-RIP to obtain the recovery guarantee. Our sufficient condition δ 2r < √ 2 − 1 is slightly weaker than δ3r 1−δ2r < 1 in [26] and √ 2δ4r 1−δ3r < 1 in [17]. We have not tried to optimize the M-RIP constant, we expect with a more complicated proof as in [5], one can still improve this condition. In addition, literatures [26,17] have successfully used the Woodbury matrix identity to give the error analysis of IRLS-M algorithm for matrix completion problem. We think it is also possible to extend our analysis of IRLS-M algorithm to the noisy matrix completion problem by using Woodbury matrix identity. We leave it to the interested readers.
Proof. In the case lim k→∞ k = = 0, by Theorem 3.1, there exists a subsequence {X kj } j∈N of {X k } k∈N converging to an rank-r matrix X ∈ C. For any matrix X ∈ C, let X [r] be the best rank-r approximation of X. Note that rank(X − X satisfies the M-RIP with δ 2r < 1, we use the triangle inequality to have where A is the operator norm of A.

Proof of Theorem 3.3.
For arbitrary X ∈ C, let H = X − X and X [r] be the best rank-r approximation of X. The next lemma is related to the matrix block decomposition. The details of the proof can be found in [6,17,35]. c ,. . . The following lemma is useful for our proof (see [42]). Lemma 3.6. Let 0 < p ≤ 1, then for arbitrary matrices Y 1 , Y 2 ∈ R n1×n2 , Y 1 + Y 2 where we have used Lemma 3.4 in [28] to get the first inequality. For estimating the second term in (19), by the feasibility of X and X, we get Substituting (20) and (22)