SUPERVISED DISTANCE PRESERVING PROJECTION USING ALTERNATING DIRECTION METHOD OF MULTIPLIERS

. Supervised Distance Preserving Projection (SDPP) is a dimension reduction method in supervised setting proposed recently by Zhu et. al in [43]. The method learns a linear mapping from the input space to the reduced feature space. While the method showed very promising result in regression task, for classiﬁcation problems the performance is not satisfactory. The preservation of distance relation with neighborhood points forces data to project very close to one another in the projected space irrespective of their classes which ends up with low classiﬁcation rate. To avoid the crowdedness of SDPP approach we have proposed a modiﬁcation of SDPP which deals both regression and classiﬁcation problems and signiﬁcantly improves the performance of SDPP. We have incorporated the total variance of the projected co-variates to the SDPP problem which is maximized to preserve the global structure. This approach not only facilitates eﬃcient regression like SDPP but also successfully classiﬁes data into diﬀerent classes. We have formulated the proposed optimization problem as a Semideﬁnite Least Square (SLS) SDPP problem. A two block Alternating Direction Method of Multipliers have been developed to learn the transformation matrix solving the SLS-SDPP which can easily handle out of sample data.

1. Introduction. In this paper we have considered a dimension reduction method in supervised setting. Supervised Distance Preserving Projection (SDPP) is a dimension reduction method proposed recently by Zhu et al [43] which showed very promising result in regression problems [20]. The basic formulation of SDPP aims to project data points in reduced feature space in such a way that the distance between data points in the projected space mimics the distance between them in the response space. The method learns a linear mapping from the input space to the reduced feature space that leads to an efficient regression design. Suppose we have n data points {x 1 , x 2 , ...., x n }, x i ∈ m and their responses {y 1 , y 2 , ...., y n }, y i ∈ k . Assuming that the mapping X → Y is continuous and X is well sampled, the idea is to project high dimensional data {x 1 , x 2 , ...., x n } in a lower dimensional space Z 1784 SOHANA JAHAN with dimensionality r << m by Z = f (X) = W T X in such a way that the projection preserve distances locally between data points in the projected space (reduced feature spaces ) and the output space. In [43], SDPP seeks for the transformation matrix W that minimizes Locality around any point x i is controlled by its k nearest neighbors in N (x i ).
A drawback of SDPP approach is, for classification problems the preservation of local structure approach forces data of different classes to project very close to one another in the projected space which ends up with low classification rate.To avoid the crowdedness of SDPP approach we have proposed a modification of SDPP which deals both regression and classification problem and significantly improves the performance of baseline method.
Let m×n be the set of all m × n matrices with its induced Frobenius norm . F . For any X ∈ m×n its Frobenius norm is defined by X F = ( m i=1 n j=1 |x ij | 2 ), S n is the space of all real n × n symmetric matrices where S n + is the cone of positive semidefinite matrices in S n , Π S n + (X) is the projection of a given matrix X ∈ S n onto S n + , O n is the set of all n × n orthogonal matrices. For any linear operator A : X → Y , Ax = (< A 1 , x >, < A 2 , x >, . . . , < A n , x >) , A * is conjugate of a linear operator A, where A * y = A 1 y 1 + A 2 y 2 . . . + A n y n , where A i is the ith row of A , Ax, y = x, A * y . All further notations are either standard or defined in the text of the respective sections.
In our research, we incorporated the total variance n i=1 z i 2 of the projected co-variates to the stress F (W ) SDPP problem maximize the objective function, which can be reformulated (described in section 4) as the optimization problem: where ν > 0 is the penalty parameter. To put equal emphasis on both the terms of the objective function of (1) we choose the value ν = 1(details discussed in sec.4).
The goal of our model SLS-SDPP is to determine the positive semidefinite matrix X from which the transformation matrix W can be obtained to get the projection of the data points in a lower dimensional space. The detail of this model will be discussed in section 4.
The paper is organized as follows. In Section 2, we will give a brief overview of some prominent methods of supervised dimensionality reduction. Our review of SDPP model introduced by Zhu et. al. [43] will be given in section 3. We will then introduce total variance of the projected space to the SDPP model in section 4. On the way, we will study formulation of our model as semidefinite least square.
In Section 5, we will develop a two block ADMM method for our model. We will examine the performance of the proposed method on a number of commonly used regression and classification problem in comparison with SDPP and some other state-of-the-art approaches in Section 6, where we demonstrate that the proposed model can significantly improve the original SDPP model and outperforms other models in most of the cases. We conclude the paper in Section 7.
2.1. Fishers discriminant analysis (FDA). The most widely used supervised dimension reduction method for classification task [36,37] is Fishers discriminant analysis (FDA) [11]and its kernalized form kernel FDA [28]. These methods maximize the ratio of between-class and within-class covariances for a good projection of data in separate classes. For a general C class problem, FDA maps the data into a (C-1) dimensional space.
2.2. Sufficient dimensionality reduction (SDR). Sufficient dimensionality reduction (SDR) [14,25,42] method seeks for a central subspace which is the intersection of all such subspaces containing the orthogonal transformation U such that the output Y and input co-variates X are conditionally independent and no information about the regression is lost in reducing the dimension. But unfortunately for this approach to be successful strong assumptions have to be made on the existence of U . Kernel dimension reduction (KDR) is a new methodology for SDR that overcomes this problem. This method does not impose particular assumption on the underlying joint distribution of X and Y . KDR maximizes conditional dependence by a positive definite ordering of the expected covariance operators in the probability determining reduced kernel Hilbert spaces. However KDR is computationally highly demanding.
2.3. Partial Least Square (PLS). Classical Partial Least Square (PLS) [40,41] is a linear DR method for regression task that involves a family of techniques to analyze the relationship between blocks of data by constructing a low dimensional subspace with orthogonal latent components. At each iteration of PLS, it extracts latent vectors by maximizing the covariance between the projected co-variate and the output responses. not consider local structure of data. Consequently this method cannot extract the intrinsic dimensionality of the data.
3. Supervised distance preserving projection. In this section we will review the SDPP model introduced by Zhu et al in [43].
3.1. SDPP model. The Supervised Distance Preserving Projection (SDPP) is a dimensionality reduction method that minimizes the differences between distances among projected co-variates and distances among responses locally. It also preserves the continuity of the response space. Suppose we have n data points {x 1 , x 2 , ...., x n }, x i ∈ m and their responses {y 1 , y 2 , ..., y n }.
In [43] Zhu et al. proposed the following methodology. Assuming that the mapping h : X → Y is continuous and provided X is well sampled. It is assumed that for each point x ∈ X and for every y > 0 there exists an x > 0 such that SDPP projects the high dimensional data {x 1 , x 2 , ...., x n } into a lower dimensional space Z with dimensionality r << m through the linear function f : m → r defined by f (x) = W T x, where the transformation matrix W ∈ r×m . The idea of SDPP is to project the data in such a way that the local geometrical structure of the lower dimensional subspace preserves the geometrical characteristics of the response space Thus the method seeks for the transformation matrix W that minimizes and δ ij takes the form δ ij = y i − y j for regression task and for classification task: Locality around any point x i is controlled by its k nearest neighbors in N (x i ) where the number k is hyper-parameter of SDPP that has to be set beforehand or tuned from data. In [43] the value of k selected by a continuity measure that is discussed briefly in 3.2.
The schematic illustration of SDPP [43] is given in Fig. 1(a). For a point x in input space, consider three nearest neighbor N (x) = {x 1 , x 2 , x 3 }. Suppose in output space the neighborhood of y is {y 1 , y 3 , y 4 } ie. y 2 is outside of the neighborhood of y. SDPP seeks for the transformation matrix W for which z 2 = f (x 2 ) is moved outside the neighborhood in the Z-space while z 4 is moved inside to match the local geometry in the Y -space as shown in Fig. 1. Thus SDPP incorporates a neighborhood graph G ij in the objective function defined as follows: Thus the objective of SDPP is to minimize which can be written equivalently as follows: The rest of this research is focused on solving model (2) with the neighborhood graph G ij . CG method have been used in [43] to efficiently optimize the objective function of SDPP.
3.2. Continuity measure. The continuity measure, M Z →Y cont , of the mapping r : Z → Y is defined in [43,39] by where V kr (i) is the set of points that are in the k r -neighborhood of point z i in the projection space Z but not in the response space Y and r ij be the rank of y j in the ordering based on its distance from y i . C(k r ) is defined by

3.3.
Selection of the parameter. The value of the hyper parameter k can be selected in several ways as discussed in [7,43]. I have used continuity measure discussed in section 3.2. Firstly different SDPP projection matrices W (k) are learned using different locality widths k, in order to obtain different low-dimensional representations. Secondly, the testing input observations X t that are unseen are projected and then for each projection Z t (k) the corresponding continuity measures M Z →Y cont(kr) is calculated against the corresponding outputs y t , for a sequence of region sizes k r . The value of k with jointly the highest continuity over the range of k r is then used to learn the final model with all the data. In [5], one of the analysis of the connectivity of nearest-neighbor graphs suggests that k can be also be selected heuristically by setting k to be in the order of log(n). For small sample sizes (n ≥ 100), the value of k can be selected as 10% of the available learning points ; that is (k ≈ 0.1n) . (2). The new objective of our model is

Formulation of SDPP as Semidefinite Least Square (SLS-SDPP). Suppose
where ν > 0, G ij is defined before in SDPP, (i, j) ∈ ξ, ξ is the set of all possible pairs (i, j) such tha x j is nearest neighbor of x i . Since for each of the n points x i , k nearest neighbors are chosen, so |ξ| = k * n = p.
The first term of (3) can be written as: Then, the optimization model of SLS-SDPP can be written as We conclude this section by noting that the modified method is sensitive on the choice of the parameter ν. For example, when ν << 1, there appeared a significant level of failure in SLS-SDPP for regression tasks. When ν >> 1 then SLS-SDPP gives the same performance as SDPP in all the data sets (regression and classification). To get better projection using SLS-SDPP , we simply use ν = 1 to put equal emphasis on both terms of the objective function.
Therefore the above optimization problem of SLS-SDPP takes the following form and thus our goal is to find the best matrix X which solves the following Semidefinite Least Square (SLS) problem: In the next section we will study a two block ADMM method to solve SLS-SDPP problem (P).

Alternating Direction Method of Multipliers(ADMM) for SLS-SDPP.
Alternating Direction Method of Multipliers (ADMM) is a very efficient and important algorithm that solves convex optimization problem by breaking it into smaller and easier optimizations problems. Due to its simplicity and efficiency ADMM has recently found many applications in several areas like imaging science, signal processing, machine learning etc. In this section we will study a two block ADMM to determine the optimum X of (P) and therefore the best transformation matrix W .
Classical 2-block ADMM was first introduced by GLowinski and Marrocco [18] and Gabay and Mercier [15,16]. Its several applications are well documented in the article of Boyd et al. [3], Eckstein and Yao [10] and in [8,9,19,21,23,26,29,30,31].The general form of 2-block convex optimization problem is : where for each i ∈ 1, 2, ..., q, Z i is finite dimensional real Euclidean space equipped with an inner product ., . and its induced norm · , β i : X → Z i is a linear map, β i is conjugate of β i and b ∈ X is given. The functions θ i are closed proper and convex and without overlapping variables. The dual of (4) is given by For given σ > 0, the augmented Lagrangian associated with (4) is given as follows (6) Therefore for chosen τ > 0 and (z 0 1 , z 0 2 , x 0 ) ∈ dom(θ 1 ) × dom(θ 2 ) × X the successive iteration of classic 2-block ADMM are as follows: The convergence of 2-block ADMM is discussed in literature by several authors. For any τ ∈ (0, 2), convergence of 2-block ADMM has been proven first by Gabay and Mercier [16] when θ 1 is strongly convex, β 1 is the identity mapping and β 2 is injective. Glowinski in [17] and Fortin and Glowinski in [12,13] proved the convergence for τ ∈ (0, (1 + √ 5)/2) if θ 2 is general nonlinear convex function. We note that the dual problem of (P) takes the form Now for given σ > 0, let L σ (z, S; X), the augmented lagrange function for (D) is defined by Where (z, S, X) ∈ p × S m + × S m + . We are ready to introduce the ADMM algorithm for our SLS-SDPP problem.
• In (S.2), to update z we need to solve linear systems involving the operator AA * . The computation of AA * and its (sparse) Cholesky factorization, which only needs to be done once, can be done at a moderate cost.  [34]. Theorem 1. If the assumption 1 holds and if A is surjective. Then the sequence (S k , z k , X k ) generated by the Algorithm 1 is well defined. Furthermore under the condition that either (a) τ ∈ (0, 2) or (b) τ ≥ 2 but ∞ k=0 S k+1 +A * z k+1 +Ψ 2 < ∞ the sequence (S k , z k , X k ) converges to unique limit say (S ∞ , z ∞ , X ∞ ) satisfying the KKT conditions (7).
It is important to note that the algorithm does not optimizes the projection matrix W directly. It optimizes the positive semidefinite matrix X = W W T .
The projection matrix W can be computed as the square root of the matrix X. If X is optimal then W being the square root of a positive semidefinite matrix X is unique and hence would be the optimal. Alternatively W can be computed applying singular value decomposition (SVD) on X. In our numerical part we have applied SVD on X. The ith column of W is calculated as √ λ i p i where λ i and p i are the ith eigenvalue and eigenvector respectively. However, if X is not symmetric, there would be many such W. Hence, it is reasonable to ask which one is best. But in our case X is a positive semidefinite matrix. 6. Numerical experiments. In this section, we study the effectiveness of the proposed SLS-SDPP in comparison with SDPP [43], SPCA [1], PLS [40,41], KDR [14] and FDA [28] methods.The performance of the methods are evaluated on several regression and classification problems.
All tests have been carried out using the 64-bit version of MATLAB R2015a on a Windows 7 desktop with 64-bit operating system having Intel(R) Core(TM) 2 Duo CPU of 3.16GHz and 4.0GB of RAM.
For our algorithm, we measure the accuracy of an approximate optimal solution (X, S, z) by using the following relative residual [34] and η C = | X,S | 1+ X + S . We compute the relative duality gap η g = Ψ,X − b,z 1+| Ψ,X |+| b,z | as well. We terminate the algorithm when η < 10 −6 .
6.1. Parameter setting and performance indicators. For each of the data sets, randomly 60% of the data were initially selected as training data. In the numerical experiment, the weight matrix W in X = W W T is initialized using PCA on the training data set. The value of the neighborhood k is selected as k ≤ 10 using the continuity measure (3.2). τ is set to 1.618 as suggested in [34] The maximum number of iterations is set at 0.2N , where N is the number of data samples in the data set and 0.2N is the largest integer not greater than 0.2N .
For regression problems, root mean squared error (RMSE) and mean absolute error (MAE) are calculated to compare the regression accuracy of our algorithm with some other methods mentioned above.
For the classification problems, 1-Nearest Neighbor rule is used on the projected low dimensional dataset to assign them into different classes. For each of the dataset, the classification error rate is calculated as the ratio of number of misclassified points to the total number of test samples.

Regression:
In this section we evaluate the performance of proposed method on two real world regression problems obtained from UCI repository and compare with other methods in terms of Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE). We will use a number of graphs to show the improvement of SLS-SDPP over SDPP and some other leading methods.
Two very important and well known data set Parkinsons Telemonitoring (PT) and Concrete Compressive Strength (CCS) [6] Data Set are considered for experimental evaluation. Both of the datasets are preprocessed by mean centering and normalized to unit variance. After the dimension reduction step, a simple linear model is used for regression. Randomly 60% of the total data are used for training and 40% for testing. 100 such samples are evaluated and their average results are shown in Table 2        In [38], Parkinson's Telemonitoring data set is verified to be well fitted at 6 dimensional space. So here we used our approach to obtain the best 6 relevant features.The value of the parameter k is chosen to be 8 using continuity measure shown in Fig. 3(a). Table 2 and Table 3 reports the regression accuracy of the data sets for each of the four methods in terms of average root mean square error (RMSE) and mean absolute error (MAE) for each of the five methods. The red colored value indicates the minimum error in fitting the data and blue colored values indicate best estimation of each method and bold numbers at each row indicates lowest error along that dimension. The table shows that the data sets are best fitted by SLS-SDPP compared to other 4 methods. Small values of std indicates the stability of our algorithm. By SLS-SDPP method CCS data was best fitted at dimension 5. KDR showed the same performance in terms of MAE which can also be verified from Fig. 4. SDPP achieved its best estimation at lowest dimension (D = 2) which is useful for visualization purpose.

Classification:
This section is focused on classification problems. A number of real world benchmarking data set from UCI machine learning repository (http://archive.ics.uci.edu/ml/datasets.html) are considered listed in Table 1 to illustrates the performance of SLS-SDPP compared to other supervised dimensionality reduction methods SDPP, SPCA and KDR in classification task. In addition, we also compared our algorithm with Fishers Discriminant Analysis (FDA). For each of the dataset 60% of the data is used to calculate the transformation matrix which is used to predict the class of remaining 40% data using nearestneighbor classifier. Fig. 5 depicts error rate of testing data along different projection dimension.
For CTG data, SLS-SDPP projected the data with lowest error rate 0.1940 at dimension 3 and the error rate remained much lower than all other methods in the following dimensions. SDPP produces its best estimation at 5 dimensional space with error rate 0.2661 and for SPCA and KDR the minimum error rate is 0.2515  and 0.2115 obtained at projection dimension 9 and 7 respectively which can also be observed from Table 4. Since this data set is of 3 classes, for FDA the solution rank is 2. Therefore the error rate remains constant for any dimension m ≥ 2. For Seismic Bump data, lowest error rate 3.28% is obtained at dimension 1 and all the five methods were successful to achieve this accuracy. SLS-SDPP shows similar performance in case of CTG and Diabetic Retinopathy data. From Table 6 it can be observed that SLS-SDPP can classify the test data at dimension 2 more accurately than all other methods . The next better performance is obtained by SPCA at D = 4 as shown in Fig. 5(c) whereas FDA failed to produce a convincing projection. For Mushroom data, best estimation is obtained at D = 9 by SLS-SDPP and from D = 4 the error rate in SLS-SDPP remained consistently lower than all other methods.

7.
Conclusion. This paper introduces a supervised dimension reduction method Semidefinite least square supervised distance preserving projection SLS-SDPP which is a modification of recently proposed supervised distance preserving projection (SDPP) [43]. SDPP is a supervised learning method whose basic formulation aims to preserve distances locally between data point in the projected space (reduced feature space ) and the output space. For each data point, the local structure is preserved by keeping the distance of k nearest neighbors. The value of parameter k  is chosen by a continuity measure. Though the methods works very well in regression task but for classification problems the preservation of local structure approach forces data to project very close to one another in the projected space irrespective of their classes which ends up with low classification rate.
To avoid the crowdedness of SDPP approach we have proposed a modification of SDPP which deals both regression and classification problem and significantly improves the performance of SDPP. In our research, we have incorporated the total variance of the projected co-variates to the SDPP problem which prevents data of different classes to stay close therefore preserves the global structure. Thus the purpose of our model is to keep the distance relation with neighbors (local structure) and at the same time to preserve the global structure by maximizing the total variance. This approach not only facilitates efficient regression like SDPP but also successfully classifies data into different classes.
We have reformulated modified SDPP as a semidefinite least square (SLS-SDPP) model. A two block Alternating Direction method of Multipliers (ADMM) have been developed to solve the SLS-SDPP problem. Several real world data sets are considered to demonstrate the performance of our model in reducing the dimension of data by comparing the results with the baseline method and some other stateof-art approaches Supervised Principal Component Analysis (SPCA), Partial Least Square (PLS), Kernel Dimension Reduction (KDR) and Fishers Discriminant Analysis (FDA). Experimental evaluation shows that, in most of the cases our method successfully projects the data into lower dimensional space that preserves the most effective features. In regression task the proposed method showed an equivalent or superior performance than all other methods. In classification problems, SLS-SDP significantly improves SDPP and outperforms some other existing state-of-the-artapproaches.