NONCONVEX MIXED MATRIX MINIMIZATION

. Given the ultrahigh dimensionality and the complex structure, which contains matrices and vectors, the mixed matrix minimization becomes crucial for the analysis of those data. Recently, the nonconvex functions such as the smoothly clipped absolute deviation, the minimax concave penalty, the capped (cid:96) 1 -norm penalty and the (cid:96) p quasi-norm with 0 < p < 1 have been shown remarkable advantages in variable selection due to the fact that they can overcome the over-penalization. In this paper, we propose and study a novel non- convex mixed matrix minimization, which combines the low-rank and sparse regularzations and nonconvex functions perfectly. The augmented Lagrangian method (ALM) is proposed to solve the noncovnex mixed matrix minimization problem. The resulting subproblems either have closed-form solutions or can be solved by fast solvers, which makes the ALM particularly eﬃcient. In theory, we prove that the sequence generated by the ALM converges to a stationary point when the penalty parameter is above a computable threshold. Extensive numerical experiments illustrate that our proposed nonconvex mixed matrix minimization model outperforms the existing ones.

1. Introduction. Modern technologies are producing a wealth of data with complex structures. For instance, in two-dimensional digital imaging, flow cytometry, and electroencephalography, matrix type covariates frequently arise when measurements are obtained for each combination of two underlying variables. To address scientific questions arising from those data, Zhou and Li [31] proposed the mixed matrix (MM) regression y = X, B + z, γ + ε, where y ∈ R is the response, X ∈ R m×q is the design matrix and z ∈ R p is the covariate vector, B ∈ R m×q , γ ∈ R p are the unknown regression coefficients, ε ∈ R is the noise which follows a normal distribution with mean 0 and standard deviation σ. Now the MM is the most popular approach for the data with complex structures. Next, we will address the recent developments in this direction as our approach is aligned with these developments.
1.1. Overview of the problem. In fact, if B = 0, model (1) will be reduced into the traditional linear regression y = z, γ + ε.
As we all know that many researchers have made lots of contributions related to the theory, algorithms as well as applications, including the Lasso [21], the fused Lasso [22], the elastic net [30] and so on. If γ = 0, model (1) will be reduced into the traditional matrix regression y = X, B + ε.
To the best of our knowledge, it can be traced back to earlier literatures by [15], and then developed to the nuclear norm [27], multivariate group Lasso [17], just to name a few.
To take the structure of B into consideration, Zhou and Li [31] studied the following low-rank regularized mixed matrix (LMM) minimization min B,γ in which B * is the nuclear norm (defined as the sum of all singular values) of B, and µ > 0 is a tuning parameter. In statistics, they discussed the degrees of freedom. In algorithm, they established the Nesterov optimal gradient method for spectral regularized matrix regression model (2). However, they ignored the sparsity of γ, which is also natural in practice. Thus, Shang et al. [20] considered the low-rank and sparse regularized mixed matrix (LSMM) minimization min B,γ in which γ 1 is the 1 norm in componentwise (defined as the sum of absolute values of all entries). They discussed its degree of freedom, but didn't give optimization algorithm.
Recently, the nonconvex penalty functions, such as the hard thresholding penalty [8], the p penalty with 0 < p < 1 [4], the smoothly clipped absolute deviation (SCAD) [7], the minimax concave penalty (MCP) [28], the capped 1 penalty [29], have attracted much attention. It is well known that the 0 norm is the essential measure of sparsity, and the 1 norm is the best convex relaxation form. However, the 1 norm usually induces over-penalization. All these efforts convince us that nonconvex penalty has nice nearly unbiased property, and could overcome the drawbacks of the 1 norm [9]. Hence, the 1 norm can be replaced by some functions of the nonconvex penalty if a more sparse solution is expected to be obtained with better reconstruction performance. Numerical experiments also illustrate that the nonconvex regularized model always offers many possibilities, see. e.g., signal processing [24,25], image reconstruction [3,16], foreground extraction [26,32] .
By integrating the mixed matrix regression and the nonconvex penalty functions, in this paper, we consider the nonconvex mixed matrix (NonMM) minimization where φ is a nonconvex function of our interest, and the parameter ν is contained.

1.2.
Contributions. Compared to previous work, the main innovations of our paper are summarized below: 1) We propose a novel nonconvex mixed matrix minimization model, which combines low-rank and sparse regularzations and nonconvex functions perfectly.
To the best of our knowledge, we are the first to model the mixed matrix minimization in such a nonconvex framework. 2) In algorithm, we develop an efficient augmengted Lagrangian method (ALM) to optimize our proposed model, and the resulting subproblems either have closed-form solutions or can be solved by fast solvers. We also show that the sequence generated by the proposed ALM converges to a stationary point when the penalty parameter is chosen above a computable threshold. 3) Moreover, we conduct numerical experiments, including simulation data and real-world data, to demonstrate that our proposed nonconvex mixed matrix minimization model can achieve better performance than the existing stateof-the-art ones. The remaining of this paper is organized as follows. In Section 2, some notations and preliminaries will be reviewed. The optimization algorithm and convergence result will be discussed in Section 3. In Section 4, extensive experiments will be conducted to substantiate the superiority of the proposed model over the other existing ones. This paper will be concluded in Section 5.

2.
Notations and preliminaries. In this paper, we use R m and R m×n to denote the set of m-dimensional vectors and m × n matrices, respectively. For a vector x ∈ R m , let x i denote its i-th entry. For a matrix X ∈ R m×n , let X ij denote its ij-th entry. The corresponding Euclidean norms are denoted by . For two vectors x and y of the same size, we denote their inner product by Furthermore, Diag(x) or Diag(x 1 , . . . , x m ) denotes a square matrix with vector x on its main diagonal and zeros elsewhere. For a matrix X ∈ R m×n , we use λ X max to denote the largest eigenvalue of XX T , where X T is the transpose of X. Finally, for an extended-real-valued function f : Let us recall a few definitions about lower semicontinuous and subdifferential in Rockafellar and Wets [19], which will be used in the convergence analysis.
From the above definition, we can easily observe that Notice that when f is convex, the above subdifferential coincides with the classical concept of convex subdifferential of f . Moreover, from the generalized Fermat's rule, we know that if x ∈ R m is a local minimizer of f , thus 0 ∈ ∂f (x). Next, we briefly review the ALM, which is widely used in computer vision, image processing, statistical learning, see [1] for a comprehensive review. Let us consider the following minimization problem: where x, y ∈ R m are two variables, A, B : R m → R n are linear maps, and c ∈ R n is a given vector. The augmented Lagrangian function for (5) is where β > 0 is a penalty parameter, and w ∈ R n is the Lagrangian multiplier. Thus, the augmented Lagrangian method (ALM) to solving problem (5) can be summarized in the following iterative scheme: where τ ∈ 0, 1 + √ 5 2 . In most cases, the resulting subproblems admit closed-form solutions, which makes the ALM particularly efficient.
When f and g are convex functions, the ALM was originally proposed in [12] and the convergence was analyzed in [10,11]. In fact, the convexity is the key property to characterize the convergence and the convergence rate of the ALM. But, when f or g is nonconvex, the convergence analysis of the ALM is much more challenging. This motivates the study of the ALM for nonconvex problems, see [13,14,23,26] for illustrations. However, all the aforementioned results cannot be applied directly for solving (4). The difficulty is that the two variables share linear transformation, which makes the nonconvex problem more complex. In this paper, we attempt to overcome this difficulty and design a fast algorithm with convergence analysis.
At the end of this section, we review the following two lemmas which will help us express the closed-form solution for the proposed ALM algorithm. X − T 2 F + µ X * has a closed-form solution which is given by where shrink 2,1 (σ, µ) = sign(σ) max 0, |σ| − 1 µ is the soft-thresholding defined in [5], and Σ = Diag(σ 1 , . . . , σ m ) is the singular value decomposition (SVD) of T , i.e., T = U ΣV T .
with parameter λ > 0, the minimization problem has a closed-form solution which where shrink 2,φ (·) is the associated shrinkage operators, see [25] for more details.
3. Proposed algorithm and convergence analysis. To apply the ALM, we first introduce two auxiliary variables L and s and reformulate (4) as min L,s,B,γ The auxiliary variable L is is used to replace B from the nonsmooth term · * , and the variable s is to liberate γ from the nonsmooth term φ(·). Let L β (L, s, B, γ; W, w) be the augmented Lagrangian function of (6) and it is defined as in which W ∈ R m×q , w ∈ R p are the Lagrangian multipliers and β 1 , β 2 > 0 are the penalty parameters.
3.1. Proposed algorithm. Under the framework of ALM, the optimization problem of L β with respect to each variable can be solved by the following iterative scheme: Now, we will show that each subproblem of the ALM either has a closed-form solution or can be solved by a fast solver. The optimization subproblems of L β with respect to L and s are seperatable, thus they can be computed distributedly.
• L-subproblem: , which can be solved by where , and the solution is The optimization subproblem of L β associated with B and γ is For convenience of description, we denote X := (vec(X 1 ), · · · , vec(X n )) T , y := (y 1 , y 2 , · · · , y n ) T , Z := (z 1 , · · · , z n ) T , then it can be transformed into Note that the variables B and γ are not seperatable, thus we should compute at the same time. By grouping vec(B) and γ, we obtain the following linear systems It is not hard to derive that The coefficient matrices β 1 I + X T X and β 2 I + Z T Z are nonsingular whenever β 1 , β 2 > 0, hence it can be computed via Cholesky decomposition or the conjugate gradient method.
The optimization subproblems of L β with respect to W and w are Finally, we present our proposed ALM for solving model (6) (equivalently (4)) in Algorithm 1.
Step 1 Compute (L k+1 , s k+1 ) according to (7) and (8); Step 2 Compute (B k+1 , γ k+1 ) according to (9); Step 3 Update (W k+1 , w k+1 ) via (10). end while 3.2. Convergence analysis. In this section, we will prove the convergence of Algorithm 1 in detail. We first present the first-order optimality condition of (4) at the local minimizer (B, γ), which is given by In this paper, we say that (B * , γ * ) is a stationary point of (4) if (B * , γ * ) satisfies (11) in place of (B, γ). Then we present the first-order optimality conditions of the subproblems in Algorithm 1 as follows, which will be used repeatedly in our convergence analysis below.
Our convergence is largely based on the following potential function: , where η is a tuning parameter, and } is a sequence generated by Algorithm 1. Then we have Moreover, if β 1 , β 2 satisfy some mild conditions, the it is easy to conclude that the where the third equality follows from (e) and (f). We then analyze the upper bound where the last equality follows from (e), and it can be transformed to Hence, for k > 1, We now consider two separate cases: 0 < τ ≤ 1 and 1 < τ < 1+ max is the largest eigenvalue of XX T , and λ Z max is the largest eigenvalue of ZZ T The first inequality comes from the convexity of · 2 F , the second inequality follows from basic inequality properties, i.e., (u + v) 2 ≤ 2u 2 + 2v 2 , the third inequality stems from the property of λ X max and λ Z max . We then add F to both sides of the above inequality and simplify the resulting inequality to get where the last equality follows from (e) in (12).
Thus, for 0 < τ < 1+ √ 5 2 , combing (15), (16) and recalling the definition of η in Θ β,τ , we have Similarly, from (d), we have Then, we can also derive Next, note that the function γ ← L β (L k+1 , s k+1 , B k+1 , γ; W k , w k ) is strongly convex with modulus at least λ Z min + β 2 . Using this fact and the definition of γ k+1 , we see that Then, B ← L β (L k+1 , s k+1 , B, γ k ; W k , w k ) is also strongly convex with modulus at least λ X min + β 1 , and we have Moreover, because s k+1 is a minimizer of s-subproblem, we have Finally, L ← L β (L, s k , B k , γ k ; W k , w k ) is strongly convex with modulus at least β 1 , then we have Summing (17), (18), (19), (21), (20) and (22), we can easily get Thus, if is decreasing. This completes the proof. Lemma 3.1 has proved that the sequence {Θ β,η (L k , s k , B k , γ k ; W k , w k )} ∞ k=1 generated by Algorithm 1 is decreasing under some mild conditions. Moreover, it is not hard to check the existence of the solution. For notational simplicity, from now on, we denote the conditions as β 1 ≥ β 1 , β 2 ≥ β 2 . We next illustrate that the corresponding sequence {(L k , s k , B k , γ k ; W k , w k )} ∞ k=1 is also bounded.
Proof. From Lemma 3.1, we know the sequence {Θ β,η (L k , s k , B k , γ k ; W k , w k )} ∞ k=1 generated by Algorithm 1 is decreasing. Suppose {Θ β,η (L k , s k , B k , γ k ; W k , w k )} ∞ k=1 has a upper bound, say Θ τ,β (L 1 , s 1 , B 1 , γ 1 ; W 1 , w 1 ). Thus, for k ≥ 1, we have We next derive the upper bound for W k 2 F . We start by substituting (e) into (c) to obtain We now consider two different cases: • For 0 < τ ≤ 1, it follows from the convexity of · 2 F and (25) that where the equality follows from (e) and the last inequality comes from the basic inequality property. Then, we have • For 1 < τ < 1+ √ 5 2 , dividing τ from both sides of (25) , we obtain Then, since 0 < 1 τ < 1, using the convexity of · 2 F and (e), we have Thus, combining (26) and (27) , we have Similarily, for w k 2 2 , we can also derive that Putting (28) and (29) into (24), we have With (30) established, we are now ready to prove the boundedness of the sequence. We start with the observation that for 0 < τ < 1+ √ 5 2 , β 1 ≥ β 1 , β 2 ≥ β 2 and the definiton of η, we always have This, together with the fact that L k * and φ(s k ) is nonnegative, illustrates these items are all bounded, including , from which we see immediately that {L k } and {s k } are bounded, and then {B k } and {γ k } are also bounded. The boundedness of {W k } and {w k } now follow from (28) and (29). Now, we conclude that the sequence {(L k , s k , B k , γ k ; W k , w k )} ∞ k=1 is bounded and the proof is completed.
Before stating the convergence result, we also need the following lemma.
Summing (23) from k = 1 to k = k i − 1, we have Θ β,η (L ki , s ki , B ki , γ ki ; W ki , w ki ) − Θ β,η (L 1 , s 1 , B 1 , γ 1 ; W 1 , w 1 ) where Passing to the limit and rearranging terms in the resulting relation, we obtain This together with c 1 , Next, summing both sides of (17) from k = 1 to k = k i and taking limits, we have from which we conclude that Similarly, we can also get Finally, by combing (31), (32), (33), and (c-d) in (12), we have Therefore, we have completed the proof.
Based on the above three lemmas, we are now ready to prove our convergence result for Algorithm 1, which also characterizes the cluster point of the sequence. Theorem 3.4. Suppose that β 1 ≥ β 1 , β 2 ≥ β 2 and {(L k , s k , B k , γ k ; W k , w k )} is a sequence generated by Algorithm 1. Then any cluster point (L * , s * , B * , γ * ; W * , w * ) of the sequence is a stationary point of (4).
Proof. Indeed, Θ τ,β (·) is lower semicontinuous, we have On the other hand, from the definition of L ki+1 abd s ki+1 , we obtain Θ β,η (L ki+1 , s ki+1 , B ki , γ ki ; W ki , w ki ) Taking limits on both sides of this inequality, we can derive Then, together with (34) and (35), we see that Thus, passing the limit in (a-f) in (12) along {(L ki , s ki , B ki , γ ki ; W ki , w ki )}, and invoking statement (36), we see that Rearranging terms in the above relations, we immediately have This shows that (L * , s * , B * , γ * ; W * , w * ) gives a stationary point of (4). This completes the proof.
Remark. From Theorem 3.4, the penalty parameter β 1 , β 2 should be chosen to be β 1 ≥ β 1 , β 2 ≥ β 2 . Note that it is not hard to check the existence. In fact, β 1 , β 2 are sufficiently large, the ALM algorithm will be very slow, because β 1 , β 2 are associated with step-size. In practice, we can apply an incremental process. Specifically, we initialize with a small β 1 , β 2 (less than β 1 , β 2 ), and then increase the β 1 , β 2 by a constant ratio until β 1 ≥ β 1 , β 2 ≥ β 2 . Clearly, after at most finitely steps, the penalty parameters β 1 , β 2 get above the threshold β 1 , β 2 and the convergence of the resulting algorithm is guaranteed. 4. Numerical experiments. In this section, we will conduct some experiments on synthetic data and real-world data to demonstrate the superiority of our proposed NonMM (4) over LMM (2) and LSMM (3). For the sake of fairness, we run 100 times to illustrate the performance of above algorithms. All the experiments are performed using MATLAB (R2018b) on a laptop computer with an Intel Core i7 CPU with 1.4 GHz and 16 GB of memory.
For the LMM, the results are generated from the source codes released by their authors, and the parameters are all set as default. For the LSMM, we implement it by ourselves. For our proposed NonMM, we choose φ to be the SCAD with a = 3.7 for the nonconvex regularization. As is well known to us, the accuracy of the solution depends on the latent parameters λ. We empirically set where 0 < α 1 < 1 and α 2 > 0. In addition, for all test algorithms, the maximum iteration number is set as 50000, and the stopping criterion is set to be Tol < 10 −4 , which is defined as

4.1.
Comparison results for synthetic data. In this subsection, we test synthetic data to compare the LMM, LSMM and NonMM. According to [31], we first generate matrix covariates X and vector covariates z, both of which contain pseudorandom values drawn from the standard normal distribution. Then, we generate the true matrix coefficient B = B 1 B T 2 , where B 1 ∈ R m×r , B 2 ∈ R q×r , which B 1 and B 2 also satisfy standard normal distributions, and r controls the rank of B. Thus, we obtain the low-r-rank matrix B. In addition, we randomly generate the true coefficient γ where its elements are random numbers chosen from a binomial distribution with parameters 1 and 0.5. Moreover, the percentage of non-zeros entries is controlled by a sparsity level s. Finally, we can get the observation data y by y = X, B + z, γ + ε with ε ∼ n(0, σ 2 I). We test the case r = 5, 10, 15, 20, 30 and s = 0.01, 0.05, 0.1, 0.2, 0.5 for (m, n, p, q) = (100, 50, 20000, 30). Table 1 presents the comparison results for r = 5, 10, 15, 20, 30, respectively. In the tables, RMSE denotes the root mean squared errors. Moreover, the best method for each test case is highlighted in boldface. Some observations can be made: (1) No matter for r = 5, 10, 15, 20, 30 and s = 0.01, 0.05, 0.1, 0.2, 0.5, the performance of NonMM is always the best in terms of RMSE(B) and RMSE(γ), which illustrates that our proposed method is the most effective. For example, for r = 5, s = 0.5, the gain of RMSE(B) is 0.1139 when compared with LMM, and 0.1731 when compared with LSMM. the gain of RMSE(γ) is 0.0279 when compared with LMM, and 0.8776 when compared with LSMM. (2) Compared with LMM, LSMM and NonMM reduce RMSE(γ) dramatically. This is because that LSMM and NonMM apply the sparse regularization. Moreover, compared with LSMM, NonMM always achieves the smaller RMSE(γ). The superior is due to the nonconvex regularization, which exploits the structure property.

4.2.
Comparison results for Real-world data. In this subsection, we test the trip time prediction from partial trajectories. Estimation of travel time for taxis is of great importance for existing electronic dispatch systems. It is an accurate dataset describing the trajectories performed by all the 442 taxis running in the city of Porto for a period of one year. We choose partial dataset with 7733 trajectories,  It also contains 7 regular variates, such as, trip id, call type, origin call, origin stand, taxi id, day type and start timestamp. Trip id contains a unique identifier for 15 trips. Call type identifies the way used to demand this service, which contains three types, i.e., this trip was dispatched from the central, demanded directly to a taxi driver at a specific stand and otherwise (demanded on a random street). Origin call contains a unique identifier for each phone number which was used to demand. Origin stand contains a unique identifier for the taxi stand. Taxi id contains a unique identifier for 404 taxi drivers that performed each trip. Day type identifies the day type of the trip's start, i.e., holiday or any other special day, a day before holiday and otherwise. Start timestamp identifies the trip's start. Thus, X is a 922×2 matrix and z is a 7 regular vector. Furthermore, we removed 32 trajectories due to the missing observation of coordinates. Our aim is to extract features to predict taxi's travel time for complete journeys. We summarize the comparison Results of RMSE(y) in Table 2. Here, we divide data into a training and a testing sample by using 5-fold or 10fold Cross-Validation. It is shown that NonMM enjoys a smaller RMSE(y) than LMM and LSMM.

5.
Conclusion. In this paper, we propose a novel nonconvex mixed matrix minimization model. The key feature of our model is the nonconvex regularizations, which can achieve much more accurate than the convex low-rank and sparse regularizations. We also demonstrate that our proposed nonconvex mixed matrix minimization could be easily solved by the nonconvex ALM with convergence analysis. The accuracy and efficiency of the nonconvex mixed matrix minimization model is finally verified by numerical examples. Naturally, many questions need a deeper exploration. For future research, an interesting direction is to integrate nonconvex functions with nuclear norm, and the improvement may be better.