Guarantees of Riemannian Optimization for Low Rank Matrix Completion

We study the Riemannian optimization methods on the embedded manifold of low rank matrices for the problem of matrix completion, which is about recovering a low rank matrix from its partial entries. Assume $m$ entries of an $n\times n$ rank $r$ matrix are sampled independently and uniformly with replacement. We first prove that with high probability the Riemannian gradient descent and conjugate gradient descent algorithms initialized by one step hard thresholding are guaranteed to converge linearly to the measured matrix provided \begin{align*} m\geq C_\kappa n^{1.5}r\log^{1.5}(n), \end{align*} where $C_\kappa$ is a numerical constant depending on the condition number of the underlying matrix. The sampling complexity has been further improved to \begin{align*} m\geq C_\kappa nr^2\log^{2}(n) \end{align*} via the resampled Riemannian gradient descent initialization. The analysis of the new initialization procedure relies on an asymmetric restricted isometry property of the sampling operator and the curvature of the low rank matrix manifold. Numerical simulation shows that the algorithms are able to recover a low rank matrix from nearly the minimum number of measurements.


Introduction
The problem of matrix completion attempts to recover a low rank matrix from a subset of sampled entries.This problem arises from a wide variety of practical context, such as model reduction [33], pattern recognition [18], and machine learning [4,5].Since the data matrix is assumed to be low rank, it is natural to seek the lowest rank matrix consistent with the observed entries by solving a rank minimization problem min Z∈R n×n rank(Z) subject to P Ω (Z) = P Ω (X), (1) presented in Sec. 4 and Sec. 5 concludes this manuscript with potential future directions.Finally, the Appendix provides proofs for the supporting technical lemmas.Throughout this manuscript, we denote matrices by uppercase letters and vectors by lowercase letters.In particular, X denotes the n × n rank r matrix to be reconstructed, with σ min (X) and σ max (X) being its smallest nonzero and largest singular values repectively.The condition number of X is denoted by κ, which is defined as κ = σmax(X) σ min (X) .We always assume that X obeys the conditions set out in A0 and A1.For any matrix Z, the spectral norm of Z is denoted by Z , the Frobenius norm is denoted by Z F , and the maximum magnitude of its entries is denoted by Z ∞ .We use Z (i) to represent the i-th row of Z.The Euclidean norm of a vector x is denoted by x .Operators that map matrices to matrices are denoted by calligraphic letters.In particular, I denotes the identity operator.The spectral norm of a linear operator A is denoted by A .
Given a collection of indices Ω with |Ω| = m (counting multiplicity) for the observed entries, we define p = m n 2 which is the probability of each entry being observed, and P Ω represents the sampling operator that maps a matrix to its component-wise product with the matrix (i,j)∈Ω e i e * j .In the main results, ε 0 is a positive numerical constant which controls the size of the contraction neighborhood of an algorithm and its value only depends on the algorithm.Finally, we use C to denote an absolute numerical constant whose value may change according to context.

Algorithms and Main Results
We consider the sampling with replacement model for Ω in which each index is sampled independently from the uniform distribution on {1, • • • , n} × {1, • • • , n}.This model can be viewed as a proxy for the uniform sampling model and the failure probability under the uniform sampling model is less than or equal to the failure probability under the sampling with replacement model [38].Though there may exist duplicates in Ω, the maximum number of duplication can be upper bounded with high probability.
It follows immediately from Lem. 2.1 that with high probability we have P Ω ≤ 8 3 β log(n).Many computationally efficient algorithms have been designed to target the matrix completion problem directly by considering the following non-convex formulation see [7,25,30,43,23,50,44,47,36,34,35,26,8] and references therein for an incomplete review.Normalized iterative hard thresholding [43] (NIHT, also known as SVP [25] or IHT [21] when the stepsize is fixed) is a projected gradient descent algorithm which updates the estimate along the gradient descent direction of the objective function in (4), followed by the projection onto the set of rank r matrices via the hard thresholding operator: Algorithm 1 Riemannian Gradient Descent (RGrad) Initilization: P T l (G l ),P Ω P T l (G l ) 3. W l = X l + α l P T l (G l ) 4. X l+1 = H r (W l ) end for where the hard thresholding operator H r first computes the SVD of a matrix and then sets all but the r largest singular values to zero H r (Z) := P Λ r Q * where Z = P ΛQ * is the SVD of Z, and Λ r (i, i) := Λ(i, i) i ≤ r 0 i > r.
NIHT suffers from the slow asymptotic convergence rate of first order methods as a gradient descent algorithm.To overcome this slow convergence rate of NIHT, a conjugate gradient iterative hard thresholding (CGIHT) algorithm has been developed in [7].In each iteration of CGIHT, the new search direction is computed as a weighted sum of the gradient descent direction and the past search direction, and the selection of the weights guarantees that the new search direction is conjugate orthogonal to the past search direction when projected onto the column space of the current estimate.In both NIHT and CGIHT, we need to compute the largest r singular values and their associated singular vectors of an n × n matrix in each iteration.Although Krylov subspace methods can be applied to compute this truncated SVD, it is still computationally expensive especially when n is large and r is moderate.The Riemannian optimization algorithms based on the embedded manifold of rank r matrices can reduce the computational cost of the truncated SVD dramatically by exploring the local structure of low rank matrices.
We first present the Riemannian gradient descent method for matrix completion in Alg. 1.Let X l = U l Σ l V * l be the current estimate and T l be the tangent space of the rank r matrix manifold at X l ; that is, The new estimate X l+1 is obtained by updating X l along P T l (G l ), the gradient descent direction projected onto the tangent space T l , using the steepest stepsize α l , and then followed by hard thresholding the estimate back onto the set of rank r matrices.
In the Riemannian conjugate gradient descent method (Alg.2), the search direction is a linear combination of the projected gradient descent direction and the past search direction projected onto the tangent space of the current estimate.The orthogonalization weight β l in Alg. 2 is selected in a way such that P T l (P l ) is conjugate orthogonal to P T l (P l−1 ), which follows from CGIHT in [7].It is worth noting that there are other selections for β l following from the non-linear conjugate gradient descent method in convex optimization [47].In particular, the Riemannian conjugate gradient descent algorithm (LRGeomCG) developed in [47] for matrix completion uses the Polak-Ribière+ selection for β l .In this manuscript, we interpret the Riemannian optimization methods directly as iterative hard thresholding algorithms with subspace projections.For the differential geometry ideas behind the algorithms, we refer the readers to [1].

Algorithm 2 Riemannian Conjugate Gradient Descent (RCG)
Initilization: P T l (P l−1 ),P Ω P T l (P l−1 ) 4. α l = P T l (G l ),P T l (P l ) P T l (P l ),P Ω P T l (P l ) 5. W l = X l + α l P l 6. X l+1 = H r (W l ) end for Though we still need to compute the truncated SVD of W l to find its best rank r approximation, the truncated SVD can be computed efficiently using two QR factorizations of n × r matrices and one full SVD of a 2r × 2r matrix as matrices in T l are at most rank 2r.To see this, notice that W l in Algs. 1 and 2 can be rewritten as be the QR factorizations of Y 1 and Y 2 respectively.Then U * l Q 2 = 0 and V * l Q 1 = 0, and we have Since U l Q 2 and V l Q 1 are both unitary matrices, the SVD of W l can be obtained from the SVD of M l , which is a 2r × 2r matrix.

Local Convergence
We first identify a small neighborhood around the measured low rank matrix such that for any given initial guess in this neighborhood, Algs. 1 and 2 will converge linearly to the true solution.
For the Riemannian gradient descent algorithm (Alg.1), we have the following theorem.
Theorem 2.2 (Local Convergence of Riemannian Gradient Descent).Let X ∈ R n×n be the measured rank r matrix and T be the tangent space of the rank r matrix manifold at X. Suppose where β > 1, and ε 0 is a positive numerical constant such that Then the iterates X l generated by Alg. 1 statisfy To state a similar result for the Riemannian conjugate gradient descent algorithm, we introduce a restarted variant of Alg.2; that is, β l is set 0 and restarting occurs as long as either of the following conditions is violated The restarting conditions are introduced not only for the sake of proof, but also to improve the robustness of the non-linear conjugate gradient descent methods [37].The first restarting condition guarantees that the residual will be substantially orthogonal to the past search direction when projected onto the tangent space of current estimate so that the new search direction can be sufficiently gradient related.In the classical CG algorithm for linear systems, the residual is exactly orthogonal to all the past search directions.Roughly speaking, the second restarting condition implies that the projection of current residual cannot be too large when compared to the projection of the past residual since the search direction is gradient related by the first restarting condition.
Theorem 2.3 (Local Convergence of restarted Riemannian Conjugate Gradient Descent).Let X ∈ R n×n be the measured rank r matrix and T be the tangent space of the rank r matrix manifold at X. Suppose where β > 1, and ε 0 is a positive numerical constant.Assume that ε 0 obeys where Then we have ν cg = 1 2 τ 1 + τ 2 1 + 4τ 2 < 1 and the iterates X l generated by Alg. 2 subject to the restarting conditions in (11) satisfy When κ 1 = κ 2 = 0, ( 15) is reduced to (10).On the other hand we have So if κ 1 κ 2 < 1/3, τ 1 + τ 2 can be less than one when ε 0 is small.In particular, when κ 1 = 0.1 and The proofs of Thms.2.2 and 2.3 are presented in Secs.4.2 and 4.3 respectively.

Initialization and Recovery Guarantees
In Thms.2.2 and 2.3, we list three conditions to guarantee the convergence of the algorithms.
The requirement for P Ω to be bounded in (7) and ( 12) is just an artifact of the sampling model, and it can be satisfied with probability at least 1 − n 2−2β following from Lem. 2.1.The second condition ( 8) and ( 13) is a local restricted isometry property which has been established in [13] for the Bernoulli model and in [22,38] for the sampling with replacement model, and it also plays a key role in nuclear norm minimization for matrix completion.For the sampling with replacement model, Thm.A.2 implies that as long as nr log(n), P T − p −1 P T P Ω P T ≤ ε 0 with probability at least 1 − 2n 2−2β .Thus the only issue that remains to be addressed is how to produce an initial guess that is sufficiently close to the measured matrix.We will consider two initialization strategies.

Initialization via One Step Hard Thresholding
A widely used initialization for matrix completion is to set X 0 = H r p −1 P Ω (X) .The approximation error X 0 − X F can be estimated as follows.
Lemma 2.4.Suppose Ω with |Ω| = m is a set of indices sampled independently and uniformly with replacement.Let X 0 = H r p −1 P Ω (X) .Then for all β > 1 The proof of Lem.2.4 is presented in Sec.4.4, which follows from a direct application of Thm.A.3 for the sampling with replacement model.A similar inequality has been established in [29] for X 0 = H r p −1 P Ω (X) under the Bernoulli model using random graph theory, where P Ω (X) is a trimmed matrix obtained by setting all the rows and columns of P Ω (X) with too many observed

Algorithm 3 Initialization via Resampled Riemannian Gradient Descent and Trimming
Partition Ω into L + 1 equal groups: Ω 0 , • • • , Ω L ; and the size of each group is denoted by m. where , µ 0 r n entries to zero.Moreover, an application of the standard Chernoff bound can further show that with high probability there are no rows or columns of P Ω (X) with too many entries [27].
It follows from Lem. 2.4 that the third condition ( 9) and ( 14) in Thms.2.2 and 2.3 can be satisfied with probability at least 1 Therefore we can establish the following theorem.
Theorem 2.5 (Recovery Guarantee I).Let X ∈ R n×n be the measured rank r matrix.Suppose Ω with |Ω| = m is a set of indices sampled independently and uniformly with replacement.Let X 0 = H r p −1 P Ω (X) .Then for all β > 1, the iterates of the Riemannian gradient descent algorithm (Alg. 1) and the restarted Riemannian conjugate gradient algorithm (Alg.2, restarting when either the inequality in (11

Initialization via Resampled Riemannian Gradient Descent and Trimming
The sampling complexity in Thm.2.5 depends on n 1.5 , rather than linearly on n.To attenuate this dependence, we consider a more delicate initialization scheme via the resampled Riemannian gradient descent followed by trimming in each iteration, see Alg. 3. The trimming procedure (Alg.4) projects the estimate onto the set of µ 0 -incoherent matrices, while the resampling scheme breaks the dependence between the past iterate and the new sampling set.So we can establish the required isometry properties which are needed to prove the linear convergence of the iterates until they reach an order of p 1/2 neighborhood around the measured matrix where Thms.2.2 and 2.3 are activated.Since the main computational cost in each iteration of Alg. 3 is O(|Ω l |r) flops [47], the total computational cost of Alg. 3 is only slightly larger than one iteration of Alg. 1 applied on the full sampling set Ω.The output of Alg. 3 satisfies the following property.
Lemma 2.6.Let X ∈ R n×n be the measured rank r matrix.Then for all β > 1, with probability at least The proof of Lem.2.6 is presented in Sec.4.5.In Alg. 3, we use fixed stepsize n 2 m for ease of exposition, which can be replaced by the adaptive stepsize similar to Alg. 1. Lemma 2.6 implies that if we take the third condition ( 9) and ( 14) in Thms.2.2 and 2.3 can be satisfied with probability at least 1 − 2n Theorem 2.7 (Recovery Guarantee II).Let X ∈ R n×n be the measured rank r matrix.Suppose Ω with |Ω| = m is a set of indices sampled independently and uniformly with replacement.Let X 0 be the output of Alg. 3. Then for all β > 1, the iterates of the Riemannian gradient descent algorithm (Alg. 1) and the restarted Riemannian conjugate gradient algorithm (Alg.2, restarting when either the inequality in (11) is violated) with κ 1 κ 2 < 1/3 is guaranteed to converge to X with probability at least 1 − 2n 1−β − 24 log βn log(n)

Related Work
From the pioneer work of Recht, Fazel, and Parrilo [39,19] and Candès and Recht [13], the low rank matrix reconstruction problem has received intensive investigations from both the theoretical and algorithmic aspects.Recht, Fazel, and Parrilo [39] studied nuclear norm minimization for the low rank matrix recovery problem where each measurement is obtained by taking inner product between the underlying low rank matrix and a dense measurement matrix.They showed that if the restricted isometry constant of the sensing operator is smaller than a numerical constant, then nuclear norm minimization is guaranteed to recover a low rank matrix exactly.Moreover, this condition can be satisfies for certain family of random measurement matrices provided the number of measurements is O(nr) [12].Many non-convex optimization algorithms have also been studied for low rank matrix recovery based on either the restricted isometry constant of the sensing operator or directly on the random measurement models, see [30,31,43,25,45,17,53,40,6,52,51] and references therein.In particular, the Riemannian gradient descent and conjugate gradient descent algorithms for low rank matrix recovery have been studied in [49].Candès and Recht [13] first studied matrix completion by nuclear norm minimization under the Bernoulli model and incoherent conditions, showing that O(n 1.2 r log(n)) measurements are sufficient for successful recovery with high probability.This result was subsequently sharpened to O(nr log α (n)) in [22,38,14,15].The nuclear norm minimization problem can be solved by either semidefinite programming [46] or iterative soft thresholding algorithms [9].A gradient descent algorithm on the Grassmannian manifold of low dimensional subspaces was studied in [29].It was shown that O(κ 6 nr 2 ) number of measurements allows for convergence of the algorithm, which was established by penalizing the incoherence of the estimate in the loss function.The proof in [29] relies on the stationary point analysis and linear convergence rate of the algorithm was not shown.This framework was further extended in [42] to gradient descent algorithms based on the product factorization of low rank matrices.Alternating minimization was studied in [28,27] with the best known sampling complexity being O(κ 8 nr log n ǫ ), where the desired accuracy ǫ was introduced because the algorithm requires a fresh set of measurements in each iteration.In contrast, Thm.2.6 only requires resampling for the initialization stage and our algorithms can converge linearly to the desired low rank matrix (within arbitrary precision) for the fixed number of measurements.

Numerical Experiments
In this section, we present empirical observations of the Riemannian gradient descent and conjugate gradient descent algorithms for random tests.We test three algorithms, namely, the Riemannian gradient descent algorithm (Alg.1), the Riemannian conjugate gradient descent algorithm (Alg.2), and the Riemannian conjugate gradient algorithm being restarted if one of the conditions in ( 11) is violated.These three algorithms are abbreviated as RGrad, RCG, and RCG restarted respectively.For RCG restarted, we take κ 1 = 0.1 and κ 2 = 1 in (11).All tested algorithms are initialized by one step hard thresholding instead of Alg. 3, as the former has already led to very good performance (see Figs. 1, 2 and 3) and preliminary numerical results didn't find much difference between these two initialization strategies for random simulations.The numerical experiments are conducted on a Mac Pro laptop with 2.5GHz quad-core Intel Core i7 CPUs and 16 GB memory and executed from Matlab 2014b.

Empirical Phase Transition
An important question in matrix completion is how many of measurements are needed in order for an algorithm to be able to reliably reconstruct a low rank matrix.We investigate the recovery abilities of the tested algorithms in the framework of phase transition, which compares the number of measurements, m, the size of an n × n matrix, n 2 , and the minimum number of measurements needed to recover an n × n rank r matrix 2 , (2n − r)r, through the undersampling and oversampling ratios The unit square (p, q) ∈ [0, 1] 2 defines a phase transition space.Given a triple (n, m, r) corresponding to a fixed pair of (p, q), we conducts test on ten random instances.The test rank r matrix is formed as the product of two random rank r matrices; that is X = LR, where L ∈ R n×r and R ∈ R r×n with the entries of L and R sampled from the standard Gaussian distribution.The measurement vector P Ω (X) is obtained by sampling m entries of X uniformly at random.An algorithm is considered to have successfully recovered X if it returns a matrix X l which satisfies The tests are conducted with n = 800 and p taking 18 equispaced values from 0.1 to 0.95.The probabilities of successful recovery for the three tested algorithms RGrad, RCG and RCG restarted are displayed in Fig. 1.In the figure, white color indicates that the algorithm can recover all of the ten random test matrices while black color implies the algorithm fails to recover each of the randomly drawn matrices.For each tested algorithm, a clear phase transition occurs when q is greater than 0.9 for all the values of p.This implies that all the three tested algorithms are able to recover a rank r matrix from m = C • (2n − r)r number of measurements with C being slightly larger than 1.In addition, Fig. 1c also shows the effectiveness of our restarting conditions for the Riemannian conjugate gradient descent algorithm.

Computation Efficiency
We compare the computational efficiency of RGrad, RCG, RCG restarted by conducting random tests on matrices of 8000 × 8000 and rank 100.The tests are conducted with two different oversampling ratios 1/q ∈ {2, 3}.The algorithms are terminated when the relative residual is less than 10 −9 .The relative residual plotted against the number of iterations and the average recovery time are presented in Fig. 2. First it can be observed that the convergence curves for RCG and RCG restarted are almost indistinguishable, differing only in one or two iterations.This again shows the effectiveness of our restarting conditions.A close look at the computational results reveals that restarting usually occurs in the first few iterations for RCG restarted.Moreover, RCG and RCG restarted are sufficiently faster than RGrad both in terms of the number of iterations and in terms of the average computation time.It takes less number of iterations for each algorithm to converge below the desired accuracy when the number of measurements increases.

Robustness to Additive Noise
Following the test set-up in last subsection, we further explore performance of the algorithms under the measurements with additive noise.Tests with additive noise have the sampled entries P Ω (X) corrupted by the vector where the entries of w are i.i.d standard Gaussian random variables and σ is referred to as noise level.We conduct tests with nine different values of σ from 10 −4 to 1; and for each σ, ten random tests are conducted.The average relative reconstruction error in dB plotted against the signalto-noise ratio (SNR) is presented in Fig. 3 for RGrad, RCG and RCG restarted respectively.The plots clearly show the desirable linear scaling between the noise levels and the relative errors for all the three tested algorithms.The relative error decreases as the number of measurements increases.

Proofs
In this section, we prove the main results in Sec. 2. We first list several technical lemmas which are of independent interest.The proofs of these lemmas are presented in Appendix B.

Technical Lemmas
Lemma 4.1 (Bounds for Projections).Let X l = U l Σ l V * l be a rank r matrix, and T l be the tangent space of the rank r matrix manifold at X l .Let X = U ΣV * be another r matrix, and T be the corresponding tangent space.Then Lemma 4.2 (Restricted Isometry Property in a Local Neighborhood).Assume for some 0 < ε 0 < 1 and β > 1.Then Lemma 4.3 (Asymmetric Restricted Isometry Property).Let X l = U l Σ l V * l and X = U ΣV * be two fixed rank r matrices.Assume Let {c l } l≥0 be a non-negative sequence satisfying c 1 ≤ νc 0 and Then if τ 1 + τ 2 < 1, we have ν < 1 and Lemma 4.5 (Chordal and Projection Distances).Let U l , U ∈ R n×r be two orthogonal matrices.
Then there exists a r × r unitary matrix Q such that

Proof of Theorem 2.2
We start with a lemma which bounds the search stepsize α l in Alg. 1.
Lemma 4.6.Assume P T l − p −1 P T l P Ω P T l ≤ 4ε 0 .Then the stepsize α l in Alg. 1 can be bounded as Proof.First the assumption P T l − p −1 P T l P Ω P T l ≤ 4ε 0 implies P T l P Ω P T l ≤ p P T l − p −1 P T l P Ω P T l + p P T l ≤ (1 + 4ε 0 )p.
On one hand, which implies the lower bound of α l .On the other hand, from which we can obtain the upper bound of α l by multiplying p on both sides of the above inequality followed by the rearrangement.
The next lemma bounds the local isometry of P Ω when the adaptive stepsize α l is used.
Lemma 4.7.Assume P T l − p −1 P T l P Ω P T l ≤ 4ε 0 and α l can be bounded as in Lem.4.6.Then the spectral norm of P T l − α l P T l P Ω P T l can be bounded as Proof.The lemma follows from direct calculations where the second inequality follows from the assumption and Lem.4.6.
Proof of Theorem 2.2.We first assume that in the l-th iteration X l satisfies So together with the first two assumptions ( 7) and ( 8) of Thm.2.2, the application of Lem.4.2 gives and which means the assumptions of Lems.4.6 and 4.7 are satisfied.
Recall that W l = X l + α l P T l (G l ) in Alg. 1.The proof begins with the following inequality where the second inequality follows from the fact that X l+1 is the best rank r approximation of W l .Plugging in W l = X l + α l P T l (G l ) gives First we have which follows from Lems.4.7 and 4.1 respectively.The third term I 3 can be bounded as follows.
where the second inequality follows from (22), and Lems.4.6 and 4.1, and the third inequality follows from (21).So inserting the bounds for I 1 , I 2 and I 3 into (24) gives where by the assumption of Thm.2.2.It only remains to verify (21).By the assumption of Thm.2.2, ( 21) is valid for l = 0. Since X l − X F is a contractive sequence following from ( 25), ( 21) is valid for all l ≥ 0 by induction.

Proof of Theorem 2.3
We first estimate α l and β l in Alg. 2 with the restarting conditions in (11).Lemma 4.8.Assume P T l − p −1 P T l P Ω P T l ≤ 4ε 0 .When restarting occurs, β l = 0 and α l can be bounded as in Lem.4.6; otherwise where .
Note that the bounds above are also valid even when restarting occurs since κ 1 ≥ 0 and κ 2 ≥ 0.
Proof.The orthogonalization weight β l can be bounded as follows P T l (P l−1 ) , P Ω P T l (P l−1 ) ≤ P T l (G l ) , (P T l P Ω P T l − pP T l ) (P l−1 ) P T l (P l−1 ) , P Ω P T l (P l−1 ) + p P T l (G l ) , P T l (P l−1 ) P T l (P l−1 ) , P Ω P T l (P l−1 ) where in the third line we used P T l (P l−1 ) , P Ω P T l (P l−1 ) ≥ (1 − 4ε 0 )p P T l (P l−1 ) 2 F deducted from (20), and the last two inequalities follow from the restarting conditions.
To bound α l , we first need to bound P T l (G l ) F in terms of P T l (P l ) F .Note that ) , P T l P Ω P T l (P l−1 ) P T l (P l−1 ) , P T l P Ω P T l (P l−1 ) P T l (G l ), P T l (P l−1 ) So The application of the Cauchy-Schwarz inequality gives Since α l can be rewritten as α l = P T l (G l ) , P T l (P l ) P T l (P l ) , P Ω P T l (P l ) = P T l (G l ) , p −1 P T l P Ω P T l (P l ) P T l (P l ) , P Ω P T l (P l ) + P T l (G l ) , (P T l − p −1 P T l P Ω P T l ) (P l ) P T l (P l ) , P Ω P T l (P l ) = P T l (G l ) , p −1 P Ω P T l (P l ) P T l (P l ) , P Ω P T l (P l ) + P T l (G l ) , (P T l − p −1 P T l P Ω P T l ) (P l ) P T l (P l ) , it can be bounded as follows which completes the proof.
The following lemma bounds the local isometry of P Ω when the adaptive stepsize α l is used.
Lemma 4.9.Assume P T l − p −1 P T l P Ω P T l ≤ 4ε 0 , and α l is bounded as in Lem.4.8.Then the spectral norm of P T l − α l P T l P Ω P T l can be bounded as Proof.Direct calculations give which completes the proof.
Proof of Theorem 2.3.We first assume that for all j ≤ l So together with the first two assumptions ( 7) and ( 8) of Thm.2.3, the application of Lem.4.2 gives and which means the assumptions of Lems.4.8 and 4.9 are satisfied for all j ≤ l. to (24), we have Following the argument for the proof of Thm.2.2, the first three terms can be similarly bounded as follows To bound I 7 , first note that β l P T l (P l−1 ) can be expressed in terms of all the previous gradients where the second inequality follows from the bound for β i in Lem.4.8 and the fact the fourth inequality follows from Lem. 4.1, and the last inequality follows from the bound for α l in Lem.4.8, ( 26), (27), and (28).Inserting the bounds for I 4 , I 5 , I 6 and I 7 into (29) gives Analogous to (24), the approximation error at the (l + 1)th iteration can be decomposed as follows Since Z l (and consequently T l ) is independent of Ω l+1 , Thm.A.2 implies with probability at least 1 − 2n 2−2n .So For I 9 , Lem. 4.1 implies where the last inequality follows from (33).
To bound I 10 , note again Z l (and consequently T l ) is independent of Ω l+1 with the incoherence parameter 100 81 µ 0 .So Lem.4.3 implies with probability at least 1 − 2n 2−2β .Since we have where the second equality follows from Putting the bounds for I 8 , I 9 and I 10 together gives with C being sufficiently large.Clearly ( 33) is also valid for (l + 1)th iteration.Since Z 0 = H r m n 2 P Ω 0 (X) , ( 33) is valid for l = 0 with probability 1 − 2n 1−β provided m ≥ Cβµ 2 1 κ 6 nr 2 log(n) (36) following from Lem. 2.4.Therefore taking a maximum of the right hand sides of ( 35) and (36) gives

Conclusion and Future Direction
This manuscript presents the sampling complexity for a class of Riemannian gradient descent and conjugate gradient descent algorithms by interpreting them as iterative hard thresholding algorithms with subspace projections.To the best of our knowledge, this is the first work that provides recovery guarantees of the Riemannian optimization algorithms on the embedded low rank matrix manifold for matrix completion.Our first sampling complexity for the algorithms with one step hard thresholding initialization is O(n 1.5 r log 1.5 (n)).We would like to know whether this result can be improved to the nearly optimal one O(nr polylog(n)) [14] since the simulation results in Sec.3.1 strongly suggests so.The extra r term in the second recovery guarantee result comes from bounding the matrix spectral norm by the Frobenius norm.To further optimize the second sampling complexity, we may need to establish convergence of the algorithms in terms of the matrix spectral norm rather than the Frobenius norm.
It would be of interest to investigate whether we can extend our results to the case whether the unknown matrix is approximately low rank.Since in this case P Ω (X) can be decomposed as P Ω (X) = P Ω (X r ) + P Ω (X − X r ), where X r is the best rank r approximation of X, it is equivalent to study the robustness of the algorithms under additive noise which is interesting by itself.The empirical observations in Sec.3.3 suggests that both the Riemannian gradient descent and conjugate gradient descent algorithms are very robust under additive Gaussian white noise.Robustness analysis of the algorithms is scope for future work.It would also be desirable to consider more general and practical sampling models in matrix completion for the Riemannian optimization algorithms (Algs. 1 and 2).
The Riemannian gradient descent and conjugate gradient descent algorithms presented in this manuscript apply equally to other low rank reconstruction problems, such as phase retrieval [20,10,11,48,16,41] and blind deconvolution [3,32].This line of research will be pursued independently in the future.In phase retrieval and blind deconvolution, the underlying matrix after lifting is rank one.Since the condition number of a rank one matrix is alway equal to one, it is worth investigating whether we can obtain recovery guarantees of the form O(n polylog(n)) with a universal constant.
Two tail probability bounds for P Ω can be obtained using the Noncommutative Bernstein inequality under the sampling with replacement model.Theorem A.2 ([38, 22, 13]).Suppose Ω with |Ω| = m is a set of indices sampled independently and uniformly with replacement.Let X = U ΣV * ∈ R n×n be a rank r matrix with the corresponding tangent space T .Assume P U (e i ) ≤ µr n and P V (e j ) ≤ µr n for 1 ≤ i, j ≤ n.Then for all β > 1, with probability at least 1 − 2n 2−2β provided m ≥ 32 3 βµnr log(n).
Theorem A. 3 ([38, 22, 13]).Suppose Ω with |Ω| = m is a set of indices sampled independently and uniformly with replacement.Let Z ∈ R n×n be a fixed matrix.Then for all β > 1,

B Proofs of Technical Lemmas
B.1 Proof of Lemma 4.1 The proofs for the first five inequalities in this lemma can be found in [49].So it only remains to prove the last inequality.For any matrix Z ∈ R n×n , we have Taking the Frobenius norm on both sides gives

B.2 Proof of Lemma 4.2
For any Z ∈ R n×n , we have where the first inequality follows from the first assumption and the second inequality follows from the second assumption.So we have P Ω P T ≤ 8 3 β log(n)(1 + ε 0 )p and where the second inequality follows from (16).To prove (18), we use P T l − p −1 P T l P Ω P T l ≤ P T l − P T + p −1 P T l P Ω P T l − P T l P Ω P T + p −1 P T l P Ω P T − P T P Ω P T + P T − p −1 P T P Ω P T ≤ P T l − P T + p −1 P Ω P T l P T l − P T + p −1 P Ω P T P T l − P T + P T − p −1 P T P Ω P T ≤ 2 X l − X F σ min (X) + p −1 P Ω P T l 2 X l − X F σ min (X) + p −1 P Ω P T 2 X l − X F σ min (X) + P T − p −1 P T P Ω P T ≤ 4ε 0 , where in the second inequality we utilize the fact P * Ω = P Ω so that P T l P Ω = P Ω P T l ; and the last inequality follows from the assumption and the bounds for P Ω P T l and P Ω P T .Let T i k ,j k : R n×n → R n×n be a rank one linear operator defined as T i k ,j k = P T l e i k e * j k ⊗ (P U − P U l ) (e i k e * j k ).

B.3 Proof of Lemma 4.3
Then we have T * i k ,j k = (P U − P U l ) (e i k e * j k ) ⊗ P T l e i k e * j k , Moreover, So To apply the Bernstein inequality, we also need to bound and The first one can be proceeded as follows where the first inequality follows from ( 39) and ( 40), the third inequality follows from (43), and the fourth inequality follows from the the fact E (P U − P U l ) (e i k e * j k ) ⊗ (P U − P U l ) (e i k e * j k ) = 1 n 2 (P U − P U l ) 2 .
Similarly, we have where the first inequality follows from (39) and (40), the third inequality follows from ( 41) and (42), and the fourth inequality follows from the fact E P T l e i k e * j k ⊗ P T l e i k e * j k = 1 n 2 P T l .

Figure 1 :
Figure 1: Empirical phase transition curves (a) RGrad, (b) RCG and (c) RCG restarted when n = 800.Horizontal axis p = m/n 2 and vertical axis q = (2n − r)r/m.White denotes successful recovery in all ten random tests, and black denotes failure in all tests.

Figure 2 :Figure 3 :
Figure2: Relative residual (mean and standard deviation over ten random tests) as function of number of iterations for n = 8000, r = 100, 1/q = 2 (a) and 1/q = 3 (b).The values after each algorithm are the average computational time (seconds) for convergence.

Lemma 4 .
3 is an asymmetric version of Thm.A.2.We will use the Noncommutative Bernstein inequality (Thm.A.1) to prove it.Let Ω = {(i k , j k )} m k=1 be the sampled set of indices.For any Z ∈ R n×n , since(P U − P U l ) (Z) = n i,j=1 (P U − P U l ) (Z), e i e *j e i e * j = n i,j=1 Z, (P U − P U l ) (e i e * j ) e i e * j , we have P Ω (P U − P U l ) (Z) = m k=1 Z, (P U − P U l ) (e i k e * j k ) e i k e * j k and P T l P Ω (P U − P U l ) (Z) = m k=1 Z, (P U − P U l ) (e i k e * j k ) P T l e i k e * j k .
j k ≤ (P U − P U l ) (e i k e * j k ) F P T l e i k e * j k F ≤ P U (e i k e * j k ) F + P U l (e i k e * j k ) F P T l e i k e * U (e i k e * j k ) F = P U (e i k )e * j k F ≤ P U l (e i k e * j k ) F = P U l (e i k )e * j k F ≤ F = P T l e i k e * j k , e i k e * j k = P U l (e i k )e * j k + e i k (P V l (e j k )) * − P U l (e i k )(P V l (e j k )) * , e i k e * j k = P U l (e i k ) 2 + P V l (e j k ) 2 − P U l (e i k ) 2 P V l (e j k ) 2 k + 1 n 4 (P U − P U l ) P T l (P U − P U l ) ≤ E P T l e i k e * j k 2 F (P U − P U l ) (e i k e * j k ) ⊗ (P U − P U l ) (e i k e * j k ) + U − P U l ) (e i k e * j k ) ⊗ (P U − P U l ) (e i k e * j k ) + j k T * i k ,j k + 1 n 4 P T l (P U − P U l ) 2 P T l ≤ E (P U − P U l ) (e i k e * j k ) 2 F P T l e i k e * j k ⊗ P T l e i k e * T l e i k e * j k ⊗ P T l e i k e *