RECOVERING LOW-RANK MATRICES FROM BINARY MEASUREMENTS

. This article studies the approximate recovery of low-rank matrices acquired through binary measurements. Two types of recovery algorithms are considered, one based on hard singular value thresholding and the other one based on semideﬁnite programming. In case no thresholds are introduced before binary quantization, it is ﬁrst shown that the direction of the low-rank matrices can be well approximated. Then, in case nonadaptive thresholds are incorporated, it is shown that both the direction and the magnitude can be well approximated. Finally, by allowing the thresholds to be chosen adaptively, we exhibit a recovery procedure for which low-rank matrices are fully approximated with error decaying exponentially with the number of binary measurements. In all cases, the procedures are robust to prequantization error. The underlying arguments are essentially deterministic: they rely only on an unusual restricted isometry property of the measurement process, which is established once and for all for Gaussian measurement processes.


1.
Introduction. Low-rank matrices appear throughout science and engineering in topics as diverse as recommender systems, system identification, quantum-state tomography, etc. The encountered matrices may be gigantic but can only be acquired via a seemingly incomplete set of measurements. Nonetheless, a growing bank of research is showing that it is possible to exploit the low-rank structure when designing efficient algorithms to fully recover these matrices from the few measurements. We refer the reader to the survey [6] for a more in-depth discussion of applications and approaches.
Oftentimes, the few measurements also need to be quantized, i.e., mapped to only a finite set of possibilities. In the extreme case, there are only two possibilities, say +1 and −1. In the sparse-vector rather than low-rank-matrix setting, this scenario has given rise to the theory of one-bit compressive sensing, which is by now well established [1,9,10,7]. The setting of low-rank matrices is worthy of attention, too. Just as one example, many websites and services incorporate a basic upvote/downvote system when recommending relevant content based on already gathered votes.
We aim to provide some theoretical justification for easy-to-implement algorithms to recover low-rank matrices from few binary measurements. We shall consider the problem of recovering low-rank matrices X ∈ R n1×n2 from m binary measurements y 1 , . . . , y m , each of them given as the sign of a linear functional applied to X, i.e., as (1) y for some A 1 , . . . , A m ∈ R n1×n2 . In short, we write where A : R n1×n2 → R m is the linear map defined by We shall also allow for thresholds τ 1 , . . . , τ m ∈ R to be incorporated before binary quantization, leading to . This scenario of low-rank matrix recovery from binary measurements is a natural extension to the scenario of the sparse vector recovery from binary measurements (and of one-bit compressive sensing) and the techniques used here are adapted from the ones used there. In fact, general results on recovery of structured signals from nonlinear observations, such as the ones from [12], are already informative when particularized to our situation. But our contribution to the specific setting of lowrank matrix recovery from binary measurements goes further in several directions: • the recovery procedure is not a generalized Lasso, instead it is either a hard singular value thresholding algorithm or a semidefinite program in the spirit of [9] (rather than of [10]); • it can be enhanced to estimate not only the direction of low-rank matrices but also their magnitudes, thanks to the presence of thresholds; • it can handle adversarial prequantization error; • perhaps most importantly, probabilistic arguments and deterministic arguments are separated owing to an unusual restricted isometry property for the measurement map A : R n1×n2 → R m (this parallels simplified arguments put forward in [7,Section 8.4] for one-bit compressive sensing, which incidentally enables the removal of a logarithmic factor from the number of binary measurements obtained by adapting directly the techniques of [9]).
In Section 2, we will first establish the main theoretical tool at the basis of all the forthcoming arguments. In Section 3, we will prove that the direction of lowrank matrices X ∈ R n1×n2 can be well estimated for measurements of the type (2) using reconstruction procedures based on hard singular value thresholding and on semidefinite programming. In Section 4, we will prove that it is also possible, if we allow thresholds in the binary measurements as in (4), to fully estimate the lowrank matrices X ∈ R n1×n2 using related reconstruction procedures. In Section 5, we will further prove that an adaptive choice of the thresholds can reduce the reconstruction error drastically. Finally, we report in Section 6 on some modest numerical experiments.
But before launching into the technicalities, we remark that we can, and from now on we will, assume that n 1 = n 2 =: n. Indeed, if n 1 > n 2 , say, then instead of recovering the rectangular matrix X ∈ R n1×n2 , we would aim at recovering the square matrix X = X | 0 ∈ R n1×n1 , which has the same rank as X and is in one-to-one correspondence with X. Clearly, a suitable measurement map A : R n1×n1 → R m induces a suitable measurement map A : R n1×n2 → R m .

2.
A restricted isometry property. Our deterministic arguments rely on just one technical property of the map A : R n×n → R m , which is a version of the rank restricted isometry property modified to feature the 1 -norm as the inner norm 1 . Precisely, we shall say that A satisfies the modified rank restricted isometry property of order r with constant δ ∈ (0, 1) -in short, A satisfies MRRIP(r, δ)if As a matter of fact, we will sometimes need this property to be valid not only for matrices Z ∈ R n×n of rank at most r, but also for matrices Z ∈ R n×n of effective rank effrank(Z) := Z * / Z F at most r. Here, Z * represents the nuclear norm (aka trace norm or Schatten 1-norm) of Z, i.e., the sum of its singular values. Thus, we shall write that A satisfies MRRIP eff (r, δ) if The main result of this section states that properly normalized Gaussian random linear maps satisfy the modified restricted isometry property for genuinely and effectively low-rank matrices with high probability, as established below.

Remark 1.
We have opted to include a low-key proof that does not give optimal powers of δ −1 . However, specifying results from [11] or [13], combined with mean width estimations from [10], would yield the same conclusions under the assumption m ≥ Cδ −2 nr.
Proof. For a fixed Z ∈ R n×n , it is easy to notice that where the g i are independent standard normal random variables, and as such have first absolute moment equal to 2/π. It is also easy to notice that the map First, to deal with genuine low rank, we consider a ρ-net {Z 1 , . . . , Z K } for the set (11) S r := {Z ∈ R n×n : Z F ≤ 1, rank(Z) ≤ r}, which means that, for any Z ∈ S r , one can find k ∈ 1 : K such that Z − Z k F ≤ ρ. According to [4,Lemma 3.1], we can take K ≤ (1 + 6/ρ) (2n+1)r , hence K ≤ exp(18nr/ρ). We place ourselves in the situation where which, by (10) and a union bound, occurs with failure probability at most Let us introduce the smallest constant δ ≥ 0 such that Given Z ∈ S r , we select k ∈ 1 : K with Z − Z k F ≤ ρ and we observe that Since Z − Z k has rank at most 2r, it can be written as Z − Z k = M + N where both M, N ∈ R n×n have rank at most r and Z − Z k

and in turn
In view of (12) and (17), we therefore have Taking the supremum over all Z yields This means that A satisfies MRRIP(r, δ), as desired. Overall, the failure probability is at most Second, to deal with effective low rank, we claim that we can find a ρ-net for the set , and the rest of the argument follows the lines of the genuine low-rank case so closely that it is not reproduced. In order to justify our covering number estimate, we start by considering a (ρ/2)-net {Z 1 , . . . , Z K } for S t with t := ρ −2 r ∈ [ρ −2 r, 2ρ −2 r] and with K ≤ (1 + 6/(ρ/2)) (2n+1)t ≤ exp(72nr/ρ 3 ). Then, given Z ∈ S eff r , we denote by Z its best rank-t approximant (with respect to the Frobenius norm), so that, in view of [8, Theorem 2.5], Next, we select k ∈ 1 : K such that Z − Z k F ≤ ρ/2. We then observe, since Z is of effective rank r, that This shows that {Z 1 , . . . , Z K } is a ρ-net for S eff r , hence it concludes the proof.
We emphasize that, just like the standard rank restricted isometry property, this modified rank restricted isometry property holds with a number of measurements only proportional to nr, the number of 'degrees of freedom' of n × n matrices of rank r. An extra logarithmic factor, which is necessary in sparse vector recovery, does not appear in this situation.
3. Estimating the direction only. In this section, we suppose that genuinely or effectively low-rank matrices X ∈ R n×n are acquired via y = sgn(AX) ∈ {±1} m . In this setting, one can only hope to recover the direction of X, as all cX, c > 0, produce the same sign measurements. We study two reconstruction procedures, one based on hard singular value thresholding and one based on semidefinite programming. As we shall see, they are both robust to a prequantization error e ∈ R m that corrupts the sign measurement vector to y = sgn(AX + e).

3.1.
Hard singular value thresholding. The output of the hard singular value thresholding procedure is merely The main theorem of this subsection is stated below. Note that, according to Theorem 2.1, its assumption is met with high probability by (properly normalized) Gaussian measurement maps A : R n×n → R m , provided m ≥ Cδ −3 nr (or m ≥ Cδ −2 nr if the remark after Theorem 2.1 is taken into account).
Remark 2. The square-root scaling of e 1 in Theorem 3.1 is peculiar. It is probably an artifact of the oversimplicity of the argument in Lemma 3.2 below. We believe that the recovery error should depend linearly on e 1 , which is supported by the numerical experiment presented in Subsection 6.1.
The whole argument is based on the one simple lemma stated below, which already shows that normalized low-rank matrices sharing the same sign measurements are necessarily close to one another. In this lemma, the operator P S associated to a subspace of S of R n×n denotes the orthogonal projector onto the space S with respect to the Frobenius inner product.
Now that this key lemma has been established, the main result about direction recovery by hard singular value thresholding can be easily deduced.
Proof of Theorem 3.1. Since X ht is the best rank-r approximant to A * (y), it is a better approximant than X itself, so we have Introducing X in the left-hand side, expanding the square, and simplifying leads to Note that X and X ht are of rank at most r, so that both X and X − X ht belong to a space S = span{u 1 v * 1 , . . . , u 2r v * 2r }. Hence, also using Lemma 3.2, we derive This immediately implies that (37) X − X ht F ≤ 2 5δ + 4 e 1 , which already proves that X is well approximated by the unnormalized vector X ht . To realize that it is also well approximated by the normalization X ht / X ht F of this vector, we remark that the latter is the best normalized approximant to X ht and hence it is a better approximant than X, so that X ht − X ht / X ht F F ≤ X ht − X F , and in turn which a bound e 1 ≤ η is available, we refine the optimization procedure slightly and consider (40) (X sdp , e sdp ) = argmin The value 10/7 is somewhat arbitrary -it has been chosen to make the subsequent arguments look nice. Note that the constraint w 1 ≤ (10/7)η can still be recast as linear constraints, e.g. by introducing slack variables w + , The main theorem of this subsection reads as follows. Once again, we emphasize that its assumption is met with high probability by (properly normalized) Gaussian measurement maps A : R n×n → R m , provided m ≥ Cδ −5 nr (or m ≥ Cδ −2 nr if the remark after Theorem 2.1 is taken into account).
for some absolute constants C, D > 0.
The argument relies on a lemma about the effective rank of the output of the semidefinite program.
Lemma 3.4. Suppose that η ≤ 1/10 and that A satisfies MRRIP(16r, 1/5). If X ∈ R n×n satisfies rank(X) ≤ r and X F = 1, then any convex combination of X and X sdp is of effective rank at most 16r. The same conclusion holds if effrank(X) ≤ r and X F = 1, provided A satisfies MRRIP eff (16r, 1/5).
Proof. We set t = 16r and δ = 1/5. Taking note of we first point out that the couple (X, e)/ AX + e 1 is feasible for the optimization program in (40). It follows that Let X := (1 − λ)X + λX sdp , λ ∈ [0, 1], be a convex combination of X and X sdp . We also introduce e := (1 − λ)e + λe sdp . We notice that Moreover, thanks to sgn(AX + e) = sgn(AX sdp + e sdp ) = sgn(A X + e) = y, we have ≥ (1 − λ) 7 10 + λ = 7 10 1 + 3λ 7 , while e 1 can be bounded from above as so that A X 1 can be bounded from below as Besides, by applying the sort-and-split technique, we write X = X 0 + X 1 + X 2 +· · · , where the X k 's are rank-r matrices defined from the singular value decomposition

and we observe that
Combining the lower bound (47) and upper bound (48) on A X 1 gives X F ≥ 1 + 3λ 7 /4. We finally arrive at This means that X is effectively of rank at most 16r, as announced.
With this crucial lemma now established, the main result about direction recovery by semidefinite programming follows from a curvature argument already found in [11].
Proof of Theorem 3.3. Suppose first that δ ≤ 1/5 and η ≤ 1/10. The parallelogram identity gives According to Lemma 3.4, both X sdp and (X + X sdp )/2 are effectively of rank at most 16r. Thus, on the one hand, while on the other hand, using sgn(AX + e) = sgn(AX sdp + e sdp ) = y, Substituting (51) and (52) into (50) leads to for some absolute constants C , D > 0 that could be determined explicitly if needed. In turn, since X sdp / X sdp F is the best normalized approximant to X sdp , we obtain for some constants C, D > 0 that could again be determined explicitly. Up to possibly changing these constants, the same estimate remains true if η ≥ 1/10, since then A similar reasoning would take care of the case δ ≥ 1/5. The proof is now complete.

4.
Estimating the magnitude as well as the direction. In this section, we assume that a bound X F ≤ γ is available for the Frobenius norm of low-rank matrices X ∈ R n×n . We show that, in this case, we can estimate not only the direction X/ X F , but also the magnitude X F , i.e., it is possible to fully estimate X, provided the binary measurements feature well-chosen thresholds τ 1 , . . . , τ m ∈ R. Namely, these measurements take the form y i = sgn( A i , X F − τ i ), or more realistically, in the presence of prequantization error e ∈ R m , (56) For the recovery algorithm, one can choose between a hard singular value thresholding procedure or a semidefinite procedure. The argument involves a combination of the results from Section 3 and an augmentation trick already used in [3]. Precisely, matrices X ∈ R n×n of (effective) rank at most r are augmented to matrices X ∈ R (n+1)×(n+1) of (effective) rank at most r + 1 via One then works in the augmented space R (n+1)×(n+1) to approximate the direction of X, which provides a full approximation of the original X. This can be done because the thresholded binary measurements in the original space can be interpreted in the augmented space as the unthresholded binary measurements where the augmented matrices A 1 , . . . , A m ∈ R (n+1)×(n+1) are given by We now formalize the main result of this section. It is worth pointing out, once again, that the necessary number of measurements does not comprise any spurious logarithmic factor. Theorem 4.1. There exist absolute constants c, C > 0 such that, if m ≥ Cδ −3 nr, if independent random matrices A 1 , . . . , A m ∈ R n×n are populated by independent Gaussian variables with mean zero and standard deviation π/2/m, and if random thresholds τ 1 , . . . , τ m ∈ R are Gaussian variables with mean zero and standard deviation γ π/2/m, independent from one another and from the A i 's, then the following holds with failure probability at most 2 exp(−cδ 2 m): every matrix X ∈ R n×n which has rank at most r, satisfies X F ≤ γ, and is acquired via η an a priori bound on e 1 .

(61)
The matrices X ht , X sdp ∈ R n×n and the scalars x ht , x sdp ∈ R are produced by 2 X ht * * x ht = best rank-(r + 1) approximant to A * with extra randomness injected through the presence in (59) of independent random vectors u 1 , v 1 , . . . , u m , v m ∈ R n populated by independent Gaussian variables with mean zero and standard deviation π/2/m.
The following observation is central to the argument.

Proof.
A vector analog of this statement appeared in [3,Lemma 8]. One can follow the steps presented there and adapt them to our situation. Alternatively, one can vectorize matrices in R (n+1)×(n+1) to vectors in R n 2 +2n+1 , apply [3, Lemma 8] to bound the 2 -norm of the difference of two vectors in R n 2 +2n by the right-hand side of (65), and observe that the left-hand side of (65) is itself bounded by the 2 -norm of this difference.
Let us now exploit the observation made in Lemma 4.2 to prove the result stated above.
Proof of Theorem 4.1. The map A : R (n+1)×(n+1) → R m is guaranteed to satisfy MRRIP(16(r + 1), δ) by the assumptions. Therefore, Theorems 3.1 and 3.3 imply that where X • = X • * * x • is either given by (62), in which case η • := e 1 , or by (63), in which case η • := η. Looking at the bottom right entry, we derive that provided ε • is small enough. Lemma 4.2, together with (66), now gives Multiplying throughout by γ, we obtain which, in view of X F ≥ γ, yields the estimates required in (60) and (61), up to an adjustment of the constants C, D > 0.
5. Exponential decay of the recovery error. In this section, we still work under the assumption that an a priori bound X F ≤ γ is available for the rank-r matrices X ∈ R n×n of interest. In the absence of prequantization error, Section 4 essentially showed that random thresholds τ ∈ R m in y = sgn(AX − τ ) allow for the computation of approximants X satisfying where λ := m/(nr) represents an oversampling factor -note that no logarithmic factors are featured here. We now aim at proving that one can find a recovery procedure making the error X − X F decay exponentially fast with the oversampling factor λ, provided the thresholds τ are chosen adaptively. The vector analog of this result was established in [2] and we follow the same guiding idea here.
The procedure intertwines measurement and reconstruction processes by dividing the m = qT binary measurements into T batches of q nr binary measurements, each of them having the form Precisely, the procedure iteratively constructs, starting from X (0) = 0, a sequence (τ (t) ) t∈ 0:T −1 of thresholds and a sequence (X (t) ) t∈ 0:T of matrices according to X − X (t) := ∆(y (t) ), ∆ one of the reconstruction maps from Theorem 4.1, (74) , the best rank-r approximant. q populated by independent Gaussian variables with mean zero and standard deviation π/2/q and let g (t) ∈ R q be random vectors populated by standard Gaussian variables independent from one another and from the A (t) i 's. There exist absolute constants c, c , c > 0 such that, if q ≈ c nr, then the following holds with failure probability at most 2T exp(−c q): every matrix X ∈ R n×n which has rank at most r, satisfies X F ≤ γ, and is acquired via the m = qT binary measurements y 1 ≤ ν(γ/2 t ) for an appropriately small constant ν > 0, is approximated by the output X (T ) of the procedure described in Proof. We shall prove by induction on t ∈ 0 : T that Indeed, the resut for t = T implies the desired bound (76), since The inductive hypothesis clearly holds for t = 0. Let us now suppose that it holds for t ∈ 0 : T − 1 . We observe that Since X − X (t) has rank at most 2r and satisfies X − X (t) F ≤ γ/2 t , Theorem 4.1 applied with a small enough δ ∈ (0, 1) yields But since X (t+1) is the best rank-r approximant to X (t) + X − X (t) , we have (81) In turn, we deduce from (81) and (80) that This shows that the induction hypothesis holds for t + 1. The proof is therefore complete.
6. Numerical experiments. The goal of this final section is merely illustrative. The investigations showcased here are far from exhaustive, since they mostly serve as a confirmation that the algorithms considered in the paper are indeed implementable. Our own implementations (by no means optimized) can be downloaded from the first author's webpage as part of the matlab reproducible containing the code to generate the experiments below. In order to execute the semidefinite optimization procedures, cvx [5], a package for specifying and solving convex programs, is required. 6.1. Direction estimation. In this subsection, we look at the procedures (25) and (39)-(40) for estimating the direction of a low-rank matrix acquired via binary measurements without thresholds. 6.1.1. Hard singular value thresholding vs semidefinite programming. In a first experiment, we consider the situation where there is no prequantization error (e = 0). For fixed values of the rank r and of the oversampling factor λ, we vary n and record both the execution time and the recovery error (averaged over a reasonable number of tests). Figure 1(a) supports the intuitive prediction that the hard singular value thresholding procedure is much faster than the semidefinite procedure (and the slopes even suggest a cost Θ(n c ) with c close to 1 vs close to 3), while Figure 1(b) supports the other intuitive prediction that the semidefinite procedure is more accurate than the hard singular value thresholding procedure (note also that the oversampling factor λ is fixed, so the recovery error is more or less constant as n varies). 6.1.2. Influence of the measurement error. In a second experiment, we assess the hard singular value thresholding procedure (25) in the presence of prequantization error e ∈ R m to determine if the term e 1 found in (26) reflects reality or if they are artifacts of our proofs. Figure 2 seems to suggest that the recovery error scales in reality like e 1 .
6.2. Full estimation. In this subsection, we verify empirically that incorporating thresholds before quantization allows one to estimate not only the direction but also the magnitude of low-rank matrices. However, since many binary measurements need to be made on low-rank matrices in order to observe any persuasive phenomenon, we exclude the demanding semidefinite procedures and focus on the swift hard singular values thresholding procedures. We work in the absence of prequantization error. We first validate that the procedure (62) results in a recovery error decaying like the inverse of a power of the oversampling factor λ, as indicated in (71). However, Figure 3 suggests that the power 1/6 (or even 1/4 if the remark after Theorem 2.1 is taken into account) is too pessimistic and could be replaced by 1/2.