Fused LASSO penalized least absolute deviation estimator for high dimensional linear regression

The least absolute shrinkage and selection operator (LASSO) has been playing an important role in variable selection and dimensionality reduction for high dimensional linear regression under the zero-mean or Gaussian assumptions of the noises. However, these assumptions may not hold in practice. In this case, the least absolute deviation is a popular and useful method. In this paper, we focus on the least absolute deviation via Fused LASSO, called Robust Fused LASSO, under the assumption that the unknown vector is sparsity for both the coefficients and its successive differences. Robust Fused LASSO estimator does not need any knowledge of standard deviation of the noises or any moment assumptions of the noises. We show that the Robust Fused LASSO estimator possesses near oracle performance, i.e. with large probability, the \begin{document} $\ell_2$ \end{document} norm of the estimation error is of order \begin{document} $O(\sqrt{k(\log p)/n})$ \end{document} . The result is true for a wide range of noise distributions, even for the Cauchy distribution. In addition, we apply the linearized alternating direction method of multipliers to find the Robust Fused LASSO estimator, which possesses the global convergence. Numerical results are reported to demonstrate the efficiency of our proposed method.


1.
Introduction. High dimensional data are frequently collected in a large variety of research areas such as information technology, genomic, functional magnetic resonance imaging, medical imaging and diagnosis, and finance [5,15,18]. Analysis of high dimensional data poses many challenges for statisticians, which calls for new methods and theories in the high dimensional linear regression model where the number n of observations is much less than the number p of unknown coefficients. For a high dimensional linear regression problem, a key assumption is the sparsity of the true coefficient parameters. That is, the number k of nonzero elements in the true regression coefficients is less than p. This is the so-called the high dimensional sparse linear model. The ordinary least squares method is not consistent in this setting since it is ill-posed and may not lead to sparse solution in this context. Many other methods have been proposed to solve this problem. In particular, methods based on 1 penalized least squares or constrained 1 minimization have been extensively studied. A remarkable model is the least absolute shrinkage and selection operator (LASSO) model proposed originally by Tibshirani [20]. The LASSO model has inspired a number of impressive articles [8,11,23,31,32], where they demonstrated the fundamental result that 1 penalized least squares estimators achieve the rate O k log p/n , which is very close to the oracle rate O k/n . However, the LASSO ignores the ordering of the features. To overcome this problem, an interesting general LASSO model, the Fused LASSO (FLASSO) model, is introduced by Tibshirani et al. [21], which focuses on the sparsity for both the coefficients and their successive differences. Some asymptotic properties of the FLASSO model have been established in Tibshirani et al. [21] and Rinaldo [19]. They also suggest to reformulate it as a second-order cone programming or quadratic programming, and then apply standard convex optimization tools to solve it, see, e.g., [9,21,22]. MARS (Multi Adaptive Regression Splines) is also a popular method, see, e.g., [7,12,17,28]. Due to extremely expensive computation, this approach is generally not applicable for large scale data set. Li et al. [13] proposed the linearized alternating direction method of multipliers (LADMM) approach to solve the FLASSO model, and numerical experiments demonstrated its efficiency.
Note that the LASSO and FLASSO methods have nice properties under the zeromean and a known variance, or Gaussian assumptions. However, these assumptions may not hold in practice and the estimation of the standard deviation is not easy. Thus, the least absolute deviation (LAD) method was proposed when data sets are subject to heavy-tailed errors or outliers which are commonly encountered in applications. See, e.g., [24,25,26]. Based on the LAD model, Wang et al. [24] combined the usual LAD criterion and the LASSO-type penalty to produce the LAD-LASSO method. Compared with the LAD regression, LAD-LASSO can do parameter estimation and variable selection simultaneously. In contrast to LASSO, LAD-LASSO is consistent to heavy-tailed errors or outliers in the response. Furthermore, Wang [25] showed that the LAD-LASSO method achieves near oracle performance under some mild conditions, which implies that the LAD-LASSO estimator enjoys the same asymptotic efficiency as the oracle estimator in high-dimensional setting. However, it cannot deal with the sparsity for both the coefficients and their successive differences. Motivated by the above analysis, in order to consider the above complex sparsity of the successive differences under the non-Gaussian assumption, an interesting question naturally occurs: can we propose a new model by combining the usual LAD criterion and the FLASSO type penalty? This paper will focus on this issue and give an affirmative answer. That is, we develop the least absolute deviation via Fused LASSO, called Robust Fused LASSO (RFLASSO), which replaces the least square by the least absolute deviation in the FLASSO model. We consider the choice of the penalty level for the RFLASSO, which does not depend on any unknown parameters or the noise distribution. Our analysis shows that with high probability, the RFLASSO estimator has near oracle performance as the LAD-LASSO estimator. Similarly, we do not have any assumptions on the moments of the noise and we only need a scale parameter to control the tail probability of the noise. The result is true for a wide range of noise distributions, even for Cauchy distributed noise, where the first order moment does not exist. Moreover, we introduce LADMM to find the estimator of the RFLASSO model, and show that the global convergence result is maintained for the proposed method. By numerical experiments, it turns out that the simple LADMM is quite efficient to solve the RFLASSO model.
The rest of this paper is organised as follows. We introduce the RFLASSO model for high dimensional linear model and discuss its statistical properties in Section 2. In Section 3 we give the optimization method for the RFLASSO model by applying the LADMM and prove its convergence. Section 4 reports numerical results to demonstrate the efficiency of the proposed method. Finally, we make some conclusions in Section 5.
2. Statistical Properties of the RFLASSO Method. We first introduce the RFLASSO model for high dimensional linear regression, and then we discuss the choice of the penalty level and near oracle property of the RFLASSO estimator.
We begin with the following high dimensional linear regression model where y = (y 1 , y 2 , · · · , y n ) T ∈ R n is a response vector, X = (x 1 , x 2 , · · · , x n ) T ∈ R n×p is a fixed design matrix, β = (β 1 , β 2 , · · · , β p ) T ∈ R p is a regression coefficient vector, and ε ∈ R n is an error vector. Throughout the paper, we assume the median of ε i is 0. Considering the ordering of the features, the FLASSO model was introduced by Tibshirani et al. [21] under the Gaussian assumpution, which focuses on the sparsity for both the coefficients and their successive differences. The FLASSO model is where However, for large scale data in the high dimensional setting, the Gaussian assumption may not hold in practice and the estimation of the standard deviation is not easy. LAD method was proposed when data sets subject to heavy-tailed errors or outliers are commonly encountered in applications. Motivated by the above work, in order to deal with the big data with sparsity for both the coefficients and their successive differences, we combine the the LAD and FLASSO penalty. We define the Robust Fused LASSO as min β∈R p y − Xβ 1 + λ 1 β 1 + λ 2 Aβ 1 . ( Correspondingly, we define the FLASSO penalized LAD regression estimator (the RFLASSO estimator): β ∈ arg min β∈R p {β : y − Xβ 1 + λ 1 β 1 + λ 2 Aβ 1 }. We will consider the statistical property of the RFLASSO estimator under some regularization condition, which states the upper bound for estimation error h = β − β * under 2 -norm, where β * is the true value of model (1). In what follows, for any set E ⊂ {1, 2, . . . , p} and vector h ∈ R p , let h E = hI(E) denote the p dimentional vector such that we only keep the coordinates of h when their indices are in E and replace others by 0. We assume T = supp(β * ) with the cardinality of T , |T | = k < n. The set T of nonzero coefficients or significant variables of β * is unknown.
2.1. Choice of Penalty. We discuss the choice of the penalties of λ 1 and λ 2 for the RFLASSO estimatorβ, which dominates the estimation error with large probability. This penalty level does not depend on any unknown parameters.
Since λ 1 sign(β) ∞ ≤ λ 1 and λ 2 A T sign(Aβ) ∞ ≤ 2λ 2 , we choose the penalty level of λ 1 and λ 2 satisfying for a given constant c > 1 and a given α > 0. When the distribution of sign(ε) is known, the distribution of S ∞ is known for any given X and does not depend on any unknown parameters. In the next subsection we will give further analysis on this penalty level, which shows its connection with the error bound of the RFLASSO estimator.
2.2. Near Oracle Property. We will show that the RFLASSO estimator possesses near oracle performance, which states that with high probability the 2 norm of the estimation error is of order O k log p/n . To do so, we need present a useful lemma.
Due to β T c = 0, we have that and also Since the subdifferential of Y − Xβ 1 is S = X T sign(ε), it is obvious that It follows from (4) and (5) that for any λ 1 > 0, λ 2 > 0, Thus under the condition Therefore we prove the assertion.
In order to study the near oracle property of the RFLASSO estimator, we also need to introduce some restricted eigenvalue concepts on the design matrix X. First, we define some important quantities of design matrix X based on 2 norm. Set where d 0 ≤ k means d ∈ R p is a k sparse vector with at most k nonzero coordinates. Similarly, set The above concepts λ u k , λ l k , θ k2 k1 are related to the sparse recovery conditions in the compressed sensing, which is an interesting research area with many applications. For instance, [3] introduced a restricted isometry property of a sensing matrix which guarantee to recover a sparse solution of sparse signal recovery via convex relaxation. See, e.g., [2,3,4] for more details. Since we focus on the LAD with fused penalty, we in addition define the restricted eigenvalue of design matrix X based on 1 norm as in [25], using the following formula We write K l k (t) as K l k for abbreviation when it does not cause any confusion. To bound the estimation error of the RFLASSO model, we introduce the scale assumption on the errors ε i . Suppose there exists a constant a > 0 such that Here a > 0 is a scale parameter of the distribution of ε i . This is a weak condition. For instance, Cauchy distribution satisfies the above assumption, see [25] for more details. That is a advantage over the traditional method, which significantly relies on the Gaussian assumption and the variance of the errors.
We are ready to establish our main result.
Theorem 2.2. Consider model (1), assume ε 1 , ε 2 , . . . , ε n are independent and identically distributed random variables satisfying (8) and Proof. Without loss of generality, assume . . , p} into the following sets: Since, for any x ∈ R n , λ = λ 1 + 2λ 2 , we have Then, we obtain that It is easy to see that Now for any fixed vector d, let By Lemma 3 of [25], we know that with probability at Again by Lemma 3 of [25], for any i ≥ 1 with probability at where C 1 = 1 + 2C 2 λ u k and C 2 > 1 is a constant. From the above inequalities, we know that with probability at least 1 By this and inequalities (4) and (9), we have that with probability at least 1 − Next, we consider two cases. First, if Xh 1 ≥ 3n/a, then from Lemma 4 and Lemma 5 of [25], From assumption 3 On the other hand, if Xh 1 ≤ 3n/a, from Lemma 4 and Lemma 5 of [25], By the same argument as in the proof of Theorem 3.1 and 3.2 in [4], we know that Therefore, Hence by (12) and (14), we have that with probability at where we have that with probability at least 1 Thus we complete the proof.
Notice that from the above proof and (6), if we replace S ∞ with 2A(α)nlog p, we obtain that for any λ 1 > 0, λ 2 > 0, Then, it follows from Lemma 1 in [25] that Similarly, setting A(α) = 2, we can obtain Hence, we can easily obtain the following corollary.
Corollary 1. Consider model (1), and assume ε 1 , ε 2 , . . . , ε n are independent and identically distributed random variables satisfying (8) and hold, then the RFLASSO estimatorβ satisfies with probability at least 1 − 2p −4k(C 2 From the above Corollary we can easily obtain that with high probability, This represnts that the RFLASSO penalized least absolute deviation estimator has near oracle performance. 3. Optimal Algorithm. We will apply the linearized alternating direction method of multipliers (LADMM) to find the RFLASSO estimator. The global convergence result is also given. This is inspired by the least square with FLASSO penalty [13].

Convergence Analysis.
In this section, we establish the convergence for the aforementioned LADMM algorithm. Our analysis is motivated by some existing work in different settings [1,6,10,16,27,29,30]. While the existing work deals with FLASSO penalty least square model, we focus on the fused lasso penalty least absolute deviation context based on the variational characterization of the RFLASSO model.
The Lagrange function of the RFLASSO model (19) is where θ ∈ R n , α ∈ R p−1 are the Lagrange multipliers.
Lemma 3.2. Let {w k } be the sequence generated by the LADMM algorithm. Then for any w * ∈ S * , Proof. Let w * be an arbitrary solution point in S * . The inequality (30) in Lemma 3.1 yields Since w * ∈ S * , we have Y − Xβ * − τ * = 0 and Aβ * − γ * = 0. Thus (38) leads to Since On the other hand, the convexity of · 1 yields that the mapping F (w) defined in (28) is monotone. We thus have which shows that Replacing (w k+1 − w * ) by (w k+1 − w k ) + (w k − w * ) in (40), we get for any w * ∈ S * , Combining with the above two lemmas, we can now show that the sequence {w k } generated by the LADMM algorithm is contractive with respect to the solution set S * . Lemma 3.3. Let {w k } be the sequence generated by the LADMM algorithm. Then for any w * ∈ S * , we have Proof. It follows from (37) in Lemma 3.2 that for all w * ∈ S * , As we have showed that λ 2 g(γ k ) + α k = 0, g 3 (τ k ) + θ k = 0 for any k. Thus Therefore, and Inserting (43) and (44) into (42), we prove the assertion. Lemma 3.3 implies that the sequence generated by the LADMM algorithm is contractive with respect to the solution set S. The assertions summerized in the following corollary are trival based on Lemma 3.3.

Corollary 2.
Let {w k } be the sequence generated by the LADMM algorithm. We have Now we show the convergence of the LADMM algorithm based on Corollary 2.
4. Numerical Results. In this section, we apply the proposed LADMM algorithm to solve the RFLASSO model, and compare with the SLEP package [14]. The numerical experiment is conducted on a desktop computer with the Intel (R) Core (TM) I5 2.80 GHz CPU and 3.88 GB of RAM running windows 8.1. Our codes for implementing LADMM is written in MATLAB (R2014a) (MathWorks, Natick, MA).
For the implementation, we now address the initialization and the termination criteria for the method. The initial iterate is taken as (β 0 , γ 0 , τ 0 , α 0 , θ 0 ) = (1, 1, 1, 1, 1). For the test algorithms, the maximum iteration number is set as 1000 and the stopping criterion is set as RelErr < 2.5e −3 , which is defined as We first show how to generate the synthetic data. The Gaussian matrix X ∈ R n×p with unit column norm is generated randomly. To generate a sparse solution with close relationship among the successive coefficient, we divide p into 80 groups and randomly select 10 groups denoted as a sample set G, whose cardinality is g. Then, the coefficient vector β * is generated by 3] represents the uniform distribution on the interval [−3, 3]. Finally, we can get the observation data y by y = Xβ * + ε with ε ∼ n(0, σ 2 I).
We test the cases σ = 0.001, 0.005 and 0.01, with (n, p, g) = (360i, 1280i, 160i) for i = 1, · · · , 4. As suggested by the SLEP, the tuning parameters are chosen as λ 1 = 0.1 and λ 2 = 0.1 in (3). For the parameters µ and ν, we take them as 0.5. For the η > ρ(µX T X + νA T A), we report the values of ρ(µX T X + νA T A) for some cases of σ and i in Table 2. As suggested in [13], the value of η is determined by the following rule if mod(k, 5) = 0; max{η k × 0.78, 0.1}, if mod(k, 5) = 0 and k ≤ 20; min{η k × 1.01, 50}, if mod(k, 5) = 0 and k > 20, starting from η = 5.5. Table 2. ρ(0.5X T X+0.5A T A) for design matrix with unit column norms.  Table 3 reports the results of our algorithm for (3) in terms of the number of iterations ("Iter"), the computing time in seconds ("Time"), the objective function ("Obj") and the estimation error ("Error") measured by β −β * 1 . We can see that the computing time of the algorithm is increasing when the value of i is increasing. where i = 2 and σ = 0.001, respectively. It is easy to derive that both the estimators which are generated by RFLASSO and FLASSO are very closely with the true value. Furthermore, we summarize the results when i = 2 and σ = 0.001 in Table 4. In this table, we use " " to denote the number of the elements, which is a more specific measurement to the proximity of the approximate solution β to the true solution β * for the RFLASSO model. This table illustrates that our proposed RFLASSO model can compare with FLASSO model under Gaussian noise. Next, we will compare RFLASSO with FLASSO under non-Gaussian noise, i.e., Cauchy distribution. In Figures 3-4, we visualize the results of RFLASSO and Table 4. Selected results for RFLASSO and FLASSO.

RFLASSO FLASSO
FLASSO for the Cauchy noise case where i = 2 and σ = 0.001. We further summarize the results in Table 5. This table shows that RFLASSO achieve the solutions with higher quality to the true solution β * , that is to say, RFLASSO is more robust than FLASSO when the noise is heavy-tailed distribution.

RFLASSO FLASSO
Conclusions. In this article we deal with Robust Fused LASSO under the assumption that both the unknown vector and its successive differences are sparse.
The main contribution of this paper is to prove the theoretical results on the Robust Fused LASSO estimator. For a wide range of noise distributions included heavytailed or the Cauchy distribution, we show that under some mild condition the Robust Fused LASSO estimator possesses near oracle performance, i.e., with high probability, the 2 norm of the estimation error is of order O k log p/n . Moreover, we apply the linearized alternating direction method of multipliers to find the Robust Fused LASSO estimator, which possesses the global convergence. We compare Robust Fused LASSO and Fused LASSO under different noise distributions by the numerical experiments, which demonstrates the efficiency of our proposed method. In the future study, we will try to apply our model to real data.