LEAST ABSOLUTE DEVIATIONS LEARNING OF MULTIPLE TASKS

. In this paper, we propose a new multitask feature selection model based on least absolute deviations. However, due to the inherent nonsmooth- ness of l 1 norm, optimizing this model is challenging. To tackle this problem eﬃciently, we introduce an alternating iterative optimization algorithm. Moreover, under some mild conditions, its global convergence result could be established. Experimental results and comparison with the state-of-the-art al- gorithm SLEP show the eﬃciency and eﬀectiveness of the proposed approach in solving multitask learning problems.


1.
Introduction. Multitask feature selection (MTFS), which is aimed to learn explanatory features across multiple related tasks, has been successfully applied in various applications including character recognition [18], classification [21], medical diagnosis [23], and object tracking [2]. One key assumption behind many MTFS models is that all tasks are interrelated to each other. In this paper, we mainly consider regression problem, and in the problem setup, we assume that there are k regression problems or "tasks" with all data coming from the same space. For each task, there are m j points. Hence, it consists of a dataset of D = ∪ k j=1 D j , where D j = {(x i j , y i j )} mj i=1 are sampled from an underlying distribution P j 1 , x i j ∈ R n denotes the i-th training sample for the j-th task, y i j ∈ R denotes the corresponding response, the superscript i indexes the independent and identically distributed (i.i.d.) observations for each task, m j is the number of samples for the j-th task, and the total number of training samples is denoted by m = k j=1 m j . The goal of MTFS is to learn k decision functions f j | k j=1 such that f j (x i j ) approximates y i j . Typically, in multitask learning models, the decision function f j for the jth task is assumed as a hyperplane parameterized by the model weight vector w j ∈ R n . The objective of MTFS models is to learn a weight matrix W ∈ R n×k . To make the notation uncluttered, we express the weight matrix W into columnwise and rowwise vectors, i.e., W = [w 1 , . . . , w k ] = [w 1 ; . . . ; w n ]. For convenience, let X j = [x 1 j , · · · , x mj j ] T ∈ R mj ×n denote the sample matrix for the j-th task, y j = [y 1 j , · · · , y mj j ] T ∈ R mj , and suppose that L j (X j w j , y j ) be a loss function on the sample (X j , y j ) for task j, a standard method to finding a sparse W is to solve the following l 1 minimization problem: where µ > 0 is the regularization parameter using to balance the loss term and the regularization term. Solving (1) leads to individual sparsity patterns for each w j . Many methods have been proposed to select features globally with the use of variants of l 1 regularization, or more specifically, by imposing a mixed l p,1 norm, such as l 2,1 matrix norm, [6,10,12,17,19,22,25]. As argued by [26], the advantages of the two matrix norms are that they can not only gain benefit from the l 1 norm which can promote sparse solutions, but also achieve group sparsity through the l p norm. The l 2,1 norm is essentially the sum of the l 2 norms of the rows, and the l ∞,1 norm penalizes the sum of maximum absolute values of each row. One appealing property of the two matrix norms is that they encourage multiple predictors from different tasks to share similar parameter sparsity patterns and discover solutions where only a few features are nonzero. One commonly used function for L j (X j w j , y j ) is the squared loss, that is, X j w j −y j 2 2 , which can be viewed as a regression with Gaussian noise [14]. In this paper, we focus on the regression problem in the context of MTFS. Instead of choosing the conventional squared loss, we consider a nonsmooth loss function and present our MTFS model from the perspective of probability. The key contributions of this paper are highlighted as follows.
• We propose a new MTFS model within probabilistic framework and develop an iterative algorithm to solve it. • We theoretically provide a convergence result of the developed algorithm.
• We conduct experiments on both synthetic data and real data to show the performance of the proposed approach. The rest of this paper is organized as follows. In Section 2, we introduce the multitask learning formulation and present an optimization algorithm to solve the proposed model. In Section 3, we provide theoretical analysis of the algorithm. Experimental comparison and results are reported in Section 4. Finally, we conclude this paper with future work in Section 5.

2.
Model formulation and algorithm. In this section, we first introduce our formulation for multitask learning and then present the solving process.
2.1. Model formulation. Given x j ∈ R n , suppose that the corresponding output y j ∈ R for task j has a Laplacian distribution with a location parameter (w T j x j ) and a scale parameter σ j > 0; that is to say, its probability density function is in the form: Denote σ = [σ 1 , . . . , σ k ] ∈ R k , and assume that the data {A, y} is drawn i.i.d. according to the distribution in (2), then the likelihood function can be written as To capture the task relatedness, we impose the exponential prior on the i-th row of W , i.e., where δ i > 0 is the so-called rate parameter. Denote δ = [δ 1 , . . . , δ n ] ∈ R n and assume that w 1 , . . . , w n are drawn i.i.d. according to (3), then we can express the prior on W as p(W |δ) = n i=1 p(w i |δ i ). It follows that the posterior distribution of W is p(W |A, y, σ, δ) ∝ p(y|W, A, σ)p(W |δ).
With the above likelihood and prior, we can obtain a maximum posterior solution of W by solving the following optimization problem: Clearly, the solution of the first term in (4) can be viewed as a Least Absolute Deviations (LAD) solution. Thus, we name (4) as LAD multitask feature selection model. For simplicity, we assume that σ j = σ for all j and δ i = δ for all i. Let µ = σδ, the proposed MTFS model based on LAD is given as follows: 2.2. Algorithm. In the following, we show how to solve the LAD multitask learning model (5). Actually, (5) can be rewritten as a matrix factorization problem with a sparsity penalty: where X : R m×n → R m is a map defined by matric-vector multiplication based on each task, i.e., X (W ) = [X 1 w 1 ; · · · ; X k w k ] ∈ R m , y = [y 1 ; · · · ; y k ] ∈ R m . By introducing an artificial variable r ∈ R m , (6) can be rewritten as min W ∈R n×k which is a convex optimization problem and can be solved by many methods. In this paper, we adopt an alternating direction method (ADM) that minimizes the following augmented Lagrange function: is the Lagrange multiplier, β > 0 is the penalty parameter of the linear constraint in (7). The basic idea of ADM can date back to the work of Gabay and Mercier [8], which has been applied to many different fields such as image restoration [20,28], quadratic programming [16], online learning [24], and background-foreground extraction [27]. Given (r k , λ k ), ADM generates the next iteration via We can see from (9) that at each iteration the main computation of ADM is solving two subproblems for W and r.
Firstly, for r = r k and λ = λ k , the minimizer W k+1 of (8) with respect to W is given by Instead of solving (10) exactly, we approximate it by where Let V = W k − τ G k and v i be the i-th row of V , then the solution of (12) is in the form: which means that the optimization problem (12) can be decomposed into n separate subproblems, namely, According to [13], the closed-form solutions of (13) can be given explicitly aŝ where (·) + = max(·, 0). All the operations in (14) are performed componentwise. Therefore, the solution of (12) is Secondly, given (W k+1 , λ k ), minimizing (8) with respect to r is equivalent to = arg min By using the soft thresholding operator S 3 , the solution of (16) can be written as Finally, the multiplier λ is updated as This is an inexact ADM since the W -subproblem and the r-subproblem are solved approximately, and we name this method LADL21 and outline it in Algorithm 1. The iteration procedure is repeated until the algorithm converges.

end while
Algorithm 1: LADL21 -An efficient iterative algorithm to solve the optimization problem in (5).
3. Convergence analysis. This section is devoted to establishing the convergence property of LADL21. We first show that the proposed LADL21 algorithm belongs to the framework of He et al. [11], which is designed to solve structured variational inequality (SVI) problems, and then the convergence result of LADL21 follows directly.
3.1. Preparations. We begin with some preparations for analyzing the convergence property. In order to show the global convergence of LADL21 clearly, we first turn to consider the SVI problem: finding a vector u * ∈ Ω such that 3 The soft thresholding operator S is defined as Another formula, which shows that the soft thresholding operator is a shrinkage operator (i.e., moves a point toward zero), is where Ω is a nonempty closed convex subset of R a+b , F is a mapping from R a+b to itself, , Ω = (s, t) : s ∈ S, t ∈ T , M s + N t = c , where S and T are given nonempty closed convex subsets of R a and R b , respectively, M ∈ R l×a and N ∈ R l×b are given matrices, c ∈ R l is a given vector, f : S → R a and g : T → R b are given monotone operators. By attaching a Lagrange multiplier vector λ ∈ R l to the linear constraint M s + N t = c, one can obtain an equivalent form of (19): where To solve (20), the method proposed by He et al. [11] produces a new iteration (s k+1 , t k+1 , λ k+1 ) via the following procedure given (t k , λ k ). Firstly, s k+1 is obtained by solving the following problem: Then, t k+1 is produced by solving Finally, the multiplier is updated by Here {H k }, {R k } and {S k } are sequences of both lower and upper bounded symmetric positive definite matrices. Under mild conditions, He et al. [11] established the convergence for their method.

Convergence result.
Based on the analysis mentioned above, we now consider (9) in the SVI framework. For simplicity, let ∂(·) be the subgradient operator of a function. The optimal condition of problem (7) can be characterized by finding a vector z * = (W * , r * , λ * ) ∈ Z = R n×k × R m × R m such that ∀z = (W , r , λ ) ∈ Z, then it holds that We consider the approximative W −subproblem in (12) for its optimal condition in SVI form: In the end, the multiplier is updated via Comparing (25)(26)(27) with (21)(22)(23), one can see that our method can be seen as a special case of the framework of [11] provided that H k = βI, R k = β( 1 τ − X * X ) and S k = , where I is an identity matrix, denotes a zero matrix. Hence, the convergence follows directly when 1 τ I − X * X is symmetric and positive definite. Let λ max X * X be the largest eigenvalue of X * X , then directly from [11,Theorem 4], the main convergence property of Algorithm 1 can be stated as follows.

Experiments.
In the following, we conduct experiments to demonstrate the performance of the proposed approach to solve MTFS problems. All experiments are implemented in Matlab. The compared algorithm is "SLEP" (short for "sparse learning with efficient projections") [15]. SLEP is an open Matlab package which provides a set of learning algorithms for solving l 2,1 regularized learning problems, whose functions are based on the work in [14]. In [14], the l 2,1 regularized multi-task learning takes the form of where ρ > 0 is the regularization parameter. Considering that the objective function in (28) is nonsmooth, the authors first transform (28) into the following equivalent constrained smooth convex optimization problem: where u = [u 1 , . . . , u n ] T and w i 2 ≤ u i , ∀i = 1, . . . , n, and then solve (29) via the Nesterov's method. Parametric settings in SLEP will be specified when we discuss individual experiment. 4.1. Synthetic data. In this section, we use synthetic data to test the ability of our model and algorithm. We create the synthetic data by generating task parameters w j from a Gaussian distribution with zero mean and covariance Cov. The training and test datasets X j are the Gaussian matrices whose elements are generated by Matlab function randn(m j , n). The outputs are computed by where is zero-mean Gaussian noise with standard deviation 0.01. Let W * be the optimal solution, we use the relative error (RelErr) to measure the quality of W * , We terminate the two methods when the relative change (RelChg) between two consecutive iterations is less than a pre-set positive threshold , Firstly, let Cov = diag{1, 0.25, 0.1, 0.05, 0.01} and Cov = diag{0.81, 0.64, 0.49, 0.36, 0.25, 0.16, 0.09, 0.04}, respectively, and to these we keep adding up to 20 irrelevant dimensions which are exactly zero. We set k = 200, n = 10, m j = 100, ∀ j = 1, . . . , k, and = 1e − 3. In SLEP method, we set mF lag = 1 and lF lag = 1, which means an adaptive line search is used. The other values of parameters are the same as the previous test. Figure 1 shows the decrease of relative errors as the number of iterations and CPU time increase. We can see from Figure 1 that the proposed method is faster than SLEP and obtains a better accuracy.
Secondly, considering that the number of samples, dimensions and tasks may also affect the performance of each method, we report the numerical results of the compared algorithms with different numbers of samples, dimensions and tasks. In the test, we take Cov = diag{1, 0.64, 0.49, 0.36, 0.25}, and Table 1 shows the simulation results. These results indicate that our method can obtain better quality solutions, which demonstrates the merit of the proposed approach.

School data.
We now conduct experiment on the School data, which has been widely used in multitask learning [1,3,7,14]. This data consists of the exam scores of 15362 students from 139 secondary schools in London from 1985 to 1987 with each sample containing 28 attributes. Here, each school is taken as "one task". Hence, we have 139 regression tasks corresponding to predicting the student performance in each school. We randomly take 75% of each task's data for training, and the rest for testing. We run each method with taking µ = 1e−5 and 200 iterations, and examine the behavior of the training errors and testing errors when each method proceeds. The fidelity of training and testing data is measured by root mean squared error. Figure 2 shows the convergence behavior of the compared algorithms. These curves show that all the methods converge finally. From Figure 2, we can find that the errors obtained by LADL21 are slightly larger than SLEP at the beginning, while becoming smaller in the later iterations. In addition, Figure 2(b) and Figure 2(d) indicate that our method is faster than SLEP. Overall, we propose a valid method for multitask learning which is comparable to the state-of-the-art algorithm SLEP.

5.
Conclusion. In this paper, we proposed a new multitask feature selection model and presented a solving algorithm for it. We derived closed-form solution to update the weight matrix, which ensures that the developed algorithm can work well. Numerical results illustrated that the proposed approach is effective and promising in solving multitask learning problems.
Since the proposed LADL21 algorithm is a batch learning algorithm, it has to retrain from scratch when a new sample comes in. So a future work associated with this work is to generalize LADL21 to an online situation, which may have less flops than that of batch learning. Secondly, the proposed MTFS model is based on a convex formulation, it is promising to design a nonconvex formulation for multitask learning via some new regularization functions; see [9] for example. Thirdly, the hyper-parameter µ in the LADL21 algorithm is predefined, we can find the optimal solution of µ via a systematic search technique, such as bilevel optimization [5].