PRIVACY PRESERVING FEATURE SELECTION AND MULTICLASS CLASSIFICATION FOR HORIZONTALLY DISTRIBUTED DATA

. In the last two decades, a lot of scientiﬁc ﬁelds have experienced a huge growth in data volume and data complexity, which brings data miners lots of opportunities, as well as many challenges. With the advent of the era of big data, applying data mining techniques on assembling data from multiple parties (or sources) has become a leading trend. However, those data mining tasks may divulge individuals’ privacy, which leads to the increased concerns in privacy preserving. In this work, a Privacy Preserving feature selection method (PPFS-IFW) and Multiclass Classiﬁcation method (PPM2C) are proposed. Experiments had been conducted to validate the performance of the proposed approaches. Both PPFS-IFW and PPM2C were tested on six benchmark datasets. The testing results demonstrate PPFS-IFW’s capability in enhancing the classiﬁcation performance at the level of accuracy by selection informative features. PPFS-IFW can not only preserve private information but also outperform some other state-of-the-art feature selection approaches. Experimental results also show that the proposed PPM2C method is workable and stable. Particularly, It reduces the risk of over-ﬁtting when compared with the regular Support Vector Machine. In the meantime, by employing the Secure Sum Protocol to encrypt data at the bottom layer, users’ privacy is preserved.

1. Introduction. In the last two decades, a lot of scientific fields have experienced a huge growth in data volume and data complexity, which brings data miners lots of opportunities, as well as many challenges. With the advent of the era of big data, machine learning and data mining approaches have been widely used in a large number of applications to analyze and assemble data. Those approaches have become very important tools in discovering useful information in many domains, such as discovering or analyzing medical data, consumer purchase data and census data. Since applying data mining approaches on aggregated datasets may enable us to generate more reliable prediction models and obtain more useful patterns, applying this technique on assembling data from databases maintained by different sources become pretty popular [26][6] [37]. Applications focus on medical research, customer service or homeland security could all benefit from the development and usage of data mining techniques [32,31]. For example, the Center for Disease Control (CDC) may want to identify the trend of a disease to better understand its progression via data mining techniques but facing the dilemma that they do not have relevant data. Meanwhile, insurance companies have collected considerable data regarding this issue and could provide assistance.
Data distribution can be roughly divided into two categories: horizontal distribution and vertical distribution . We say data stored at different locations is horizontally distributed if those distributed stored data records share common attributes [8,45]. For example, customer information collected at different bank branches. Data can also be vertically distributed. In this case, data collected at different sites have same records but the attributes collected for each record may be different. Such as bank, insurance company or auto insurance company who collects different information regarding the same customer.
Data mining do play an effective role in providing hidden or more useful information. However, mining on aggregated data might divulge the sensitive information about individuals. It thus leads to increasing concerns about privacy protection during the data mining process , which has a strong intention to prevent different parties from sharing information (especially sensitive information). To deal with this challenge, many researchers focus on the study of Privacy Preserving Data Mining (PPDM) technique. For example, a lot of work has been done in [47,17,50,2,41,44,30,11,52,21,22], which aim to provide a means to address the privacy preserving issue without accessing the actual data values, thus avoid the disclosure of information beyond the final results.
Among the machine learning and data mining approaches, feature selection and classification are two of the most important topics. Feature selection techniques address the issue of dimensionality reduction by selecting an available subset of features via predetermined selecting criteria. Therefore, it plays a vital role in optimizing mining procedure via selecting smaller size features and thus improves the performance of data mining algorithms. Many feature selection approaches have been proposed [24,43,40,49,42], but few of them takes privacy preservation as a concern. Classification targets at providing a method to identify or classify data that belong to unknown groups into different categories through building effective classifiers. During the classification process, data samples known as training set is commonly used to generate and validate the classifiers. Multi-class classification, representing a branch of classification problem, has being a hot topic and research direction in many domains for the past few years and become more and more important in the era of big data. During the classification process, irrelevant and redundant features are usually removed after feature selection procedure. Thus, a good feature selection strategy can reduce the computational complexity needed for classification process. Researchers have proposed a large number of state-theart multi-class classification approaches and algorithms based on traditional but popular classification algorithms, such as Support Vector Machine (SVM) [13][10], Decision Tree (DT) [39], Naive Bayes (NB) [38] and K-Nearest Neighbor (KNN) [5].
In this work, we are interested in the investigation of Privacy Preserving Feature Selection and Multi-Class Classification problems for horizontally distributed data. Based on one of our previous published paper [33], we propose a method of Privacy Preserving Feature Selection via Integrating Filter and Wrapper approaches for Horizontally Distributed Data, named PPFS-IFW and a privacy preserving multiclass classification algorithm named PPM2C. Experiments had been conducted to validate the performance of the proposed approaches. Both PPFS-IFW and PPM2C were tested on six benchmark datasets. The testing results demonstrate PPFS-IFW's capability in enhancing the classification performance at the level of accuracy by selection informative features. PPFS-IFW can not only preserve private information but also outperform some other state-of-the-art feature selection approaches. Experimental results also show that the proposed PPM2C method is workable and stable. Particularly, It reduces the risk of over-fitting when compared with the regular Support Vector Machine. In the meantime, by employing the Secure Sum Protocol to encrypt data at the bottom layer, users' privacy is preserved.
In the following sections, the details of PPFS-IFW will be introduced in section 2. PPM2C will be introduced in section 3, followed by the experimental results and discussion presented in section 4. A short conclusion is included in section 5.
2. Privacy perserving feature selection via integrating filter.

2.1.
Background. Feature selection methods can be grouped into two categories according to their searching directions: forward selection and backward selection [35]. Forwarding selection usually starts searching relevant features from an empty subset and adds one or more at each step until a stop criterion is met. On contrast, the backward selection methods usually start searching from the entire feature space and eliminate one or more at each step, until a predetermined stop criteria is reached.
According to different selecting strategies and procedures, feature selection methods can also be formulated into three main categories: filter, wrapper and embedded approaches [24]. The filter methods usually take account of the statistical properties of features and rank them according to predefined criteria on relevant information. This step is always done prior to classification and is completely independent of data mining algorithms. The selected feature subset will be applied to the classification or cluster algorithms. Since the selection procedure is independent of the mining algorithms, the effects of the subset of features on the performance of the mining algorithms will also be avoided. Since filter methods are always separate from mining algorithms, they are fast. Just as the name implies, the wrapper methods often wrapped the feature selection step into the process of data mining algorithms. Compared with the filter methods, wrapper methods have the advantages that it takes the performance of data mining algorithms into account. Thus a better classification model will be constructed. However, it needs to repeatedly train, test the data, and build classification model at each step when a subset of features are selected. The computational complexity thus will be increased sharply.
The third kind of feature selection approaches is embedded methods, which perform feature selection in the process of constructing data mining model by adding or modifying the optimizing process of classification as discussed in [36] [9]. Many feature selection approaches have been proposed [24] [43] [40] [49][42] [28,29] as data are integrated into a central location, while the privacy concerns of sharing data among distributed parties bring many challenges to feature selection.
To solve the problem of privacy concern during feature selection, we propose a Privacy Preserving Feature Selection method via Integrating Filter and Wrapper approaches (named PPFS-IFW in brief) for Horizontally Distributed Data. PPFS-IFW is based on our previously published privacy preserving framework, which is appointed by Privacy Aware Non-linear SVM for Multi-source Big Data (PAN-SVM) [33].

2.2.
Methods. In this section, we discuss in detail that the methods used in feature selection.
2.2.1. Filter Measurements. Filter methods often employ several independent measurements to evaluate features and filter relevant, informative features before the classification process. It is usually much faster than wrapper methods. Especially for high dimensional datasets, it is necessary to remove unnecessary features first to speed up the selection procedure. In the current work, three popular measurements are used as the filtering criteria, which are Fisher Score [16], Welch-t-test [46] and between versus within class scatter ratio [15].
• Between versus Within Class Scatter Ratio.
For fisher score and welch-t-test, j represents the j th feature, µ + j and µ − j , σ + j and σ − j denote the means and standard deviations of the j t h feature in the corresponding positive class and negative class. n + and n − denote the sample numbers in class + and class −, respectively.
For scatter ratio, S w and S b denote the within class and between class scatter matrix, respectively. S is the within versus between class scatter ratio. It is a vector with n (number of features) elements. c equals to the number of class. Particularly, c equals to 2 for binary classification. x j represents the j th record in the i th class, n c denotes the number of samples in the i th class, µ i denotes the mean of i th class, and µ denotes overall mean in these c classes.

PAN-SVM Classifier.
When applied to a classification problem, wrapper methods used for selecting features are closely related to the classifiers. In the current work, classification accuracy is used as the wrapper method, by which features will be evaluated at each iteration, and PAN-SVM [33] is used as the binary classifier. PAN-SVM contains three layers, where each layer performs corresponding functions. The bottom layer protects individual's data privacy, where sampled data from multiple parties are encrypted via the Secure Sum Protocol [12] and sent to the miner. At this layer, data are sampled by k-means clustering method and the center data are used as landmarks [51] [14]. At the medium layer, the landmark points are used to approximate kernel matrix via Nystrom technique [51][14]to reduce the complex computation burden and then the matrix will be decomposed via eigenvalue decomposition method. After the procedure of kernel matrix approximation and decomposition, the non-linear separable SVM will be converted into a linear separable SVM. The linear SVM will be optimized and speeded up by linear search and cutting plane techniques [18] at the top layer. Although the classification accuracy sacrifices slightly when compared with regular SVM, like LIBSVM [54], the individual private information is preserved. Furthermore, the training process is speeded up when compared with other distributed classification methods. Details about PAN-SVM could be found in [33].

2.2.3.
Workflow. The proposed privacy preserving feature selection algorithm PPFS-IFW integrates the filter and wrapper methods. The work flow can be summarized as follows: Step 1. Calculate the three measurements as stated in equation (1), (2) and (3).
Step 2. Choose features that selected by all of the three measurements and remove unselected ones. Features will be selected if they meet the predetermined thresholds based on the three measurements. The chosen feature set is referred as voted feature set in the following steps.
Step 3. Ranking features in the voted set (Wrapper method). The Ranking process can be summarized as five sub-steps. First, computing the overall classification accuracy with the selected features, denoted as overall acc . Second, computing the classification accuracy for each classifier (PAN-SVM) that without feature i(i = 1...k) , denoted as ith acc . For data with k features, there will be k classifiers.. Third, ranking features based on their accuracy. Fourth, Removing a feature in this way. Suppose ith a cc is the highest accuracy among the k accuracies calculated in step 2), if ith a cc > overall a cc, then remove feature i, since it increases the classification accuracy by eliminating it. Otherwise, if there is no feature increases accuracy, a local maxima accuracy of ith a cc is obtained, then set overall a cc = ith a cc and remove feature i, since it gives the classifier the highest negative affection. Last, Repeat the ranking process.
Step 4. Return a ranked feature list.
In step 2, the user can define a threshold indicating the approximate number of features needed to be kept or eliminated. For example, for the Welch-t-test measurement, features that have p-value larger 0.01 will be deleted. Therefore, a large number of features can be eliminated at this step. The highest negative affection used in step 3 can be illustrated through the following example. For example, if removing feature i, the classifier could obtain 98% classification accuracy, while removing feature j, the accuracy is 90%, we say feature i has a higher negative affection to classifier than feature j, so feature i will be eliminated at this step. 5fold cross validation is used at each iteration for every classifier. PPFS-IFW returns a ranked list of features that are selected based on the three filter methods.
3.1. Methods. In this section, the details of PPM2C method is presented.
3.1.1. Multiclass Support Vector Machines. Currently, the methods for solving multiclass classification problem can be formulated into two categories. The first category aims at directly solving the multi-classification problem by extending existed classifiers and taking all collected data under consideration in one optimization problem. The other category, on the contrary, focusing on indirectly solving the problem by converting it into several binary classification problems via constructing and combining multiple binary classifiers, which are usually SVM classifiers. SVM is a well-known sophisticated classification method, which was originally designed for binary classification problem. Now it has been widely used in many other domains, including multi-class classification problem. Furthermore, solving multiclass classification problem indirectly are commonly be formulated into two cases: One-Versus-All (OVA) method [7] and One-Versus-One (OVO) method [27], which are briefly discussed in the following two subsections.
3.1.2. One Versus All. For a k-class classification problem, the OVA method constructs k SVM classifiers. The h th SVM is trained by taking all of the samples in the h th class with a positive label (+1), and all samples in the remaining classes with a negative label (−1). Thus, given a multiclass classification problem, the h th SVM solves the problem described in equation (4): Where x i ∈ R n (i = 1, 2, ..., m) is a sample with n number of attributes, y i ∈ {1, 2, ..., k} is the class label of x i , and Φ is a kernel matrix. The Radial Basis Function (RBF) kernel function Φ = exp(−Υ x − x 2 ) is used here. C is a penalty parameter. To train the h th SVM is equivalant to find the maximal separate hyperplane by maximizing the term 2/ w h . Once solving equation (4), there are k decision functions will be obtained and can be used to predict unknown samples. There will be k output values for k classifiers, a strategy of "Winner takes all" will be used to decide the label of sample x. If the predicted value for x is positive, x will be classified into the class with a positive label in the current classifier; otherwise, x will be put in the class which has the largest absolute approximate value based on the decision function. For example, if the predicted value for x is −2.9, then x belongs to class 3; if it is −2.1, then x belongs to class 2.
The advantage of OVA scheme is that only k binary SVM classifiers have to be trained for a k-class classification problem, which speeds up the entire training process. However, the one versus all method might result in making the training data dramatically unbalanced. 3.1.3. One Versue One. On the other hand, for a k-class classification problem, OVO method will constructs binary classifiers to separate each one from the other, such as class 1 vs class 2, class 1 vs class 3, class 2 vs class 3, class k − 1 vs class k and so on. There are different approaches that can be used to test unknown data after all the SVM classifiers are built. [19] proposed a "Max Wins" strategy based on a sign function. The decision function for class i and class j works in the way that if it determines x in the i th class, then the vote for the i th class will be increased by one; Otherwise, the vote for j th class will be increased by one. In the end, the predicted class for x will be the one with the largest voting value. Although, OVO approache might have to train more binary classifiers than that of the OVA strategy, it is usually much faster. That is because generating several more SVM classifiers with smaller size is much faster than building fewer classifiers with larger size when solving the quadratic programming optimization problems exist in SVM. However, a potential limitation of the "Max wins" is that it might not be a good strategy in the case that two classes have identical votes. Since efficiency is an important concern, in this work, the OVA schema is used for PPM2C.

3.1.4.
Workflow. The workflow of PPM2C can be briefly presented in Fig. 1. PPM2C is also based on [33]. PAN-SVM proposed a privacy preserving framework for binary classification based on horizontally located data. The data is encrypted via Secure Sum Protocol proposed in [12], a Secure Multiparty Computation (SMC) technique presented in [12] is used. Furthermore, the computation of the kernel matrix in SVM is reduced significantly via Nystrom kernel matrix approximation method discussed in [14] [48] and eigenvalue decomposition approach. In PPM2C, PAN-SVM is used to construct multiple binary classifiers for the purposed of preserving privacy during the data mining process.

4.
Experimental results and discussion. The performance will be assessed by the measurement of classification accuracy, which is formulated in equation (5).
Accuracy = T P + T N T P + F P + T N + F N

Evaluation of PPFS-IFW. The Cross Validation (CV) method is often used
to assess the performance of classifier according to the classification due to lack of data that could be used as separate testing samples. During the cross validation process, data will be randomly split into k (k-fold) subsets. At each training round, k − 1 subsets are used as the training data, and the remaining one subset is used as the testing set. However, as pointed by [49], the feature selection results may vary due to even a single difference in the training set, especially for small datasets. Many feature selection methods are done with all samples, and the cross validation step was only done during the classification process, which makes the feature selection external to the cross validation procedures, and leads to information leakage in the feature selection step. It calls this kind of error made by cross validation as CV1 error. [49] demonstrates the existing bias via simulation data and suggested another error evaluation method, named CV2. Under CV2 scenario, a separate dataset will be used as test samples and excluded from the training set before the feature selection process.  [3], and the microarray datasets of Leukemia , Lymphoma and colon [23] are download from [53]. The details about the datasets are presented in Table 1. C is the penalty parameter for SVM and γ is a free parameter for Radial Basis Function kernel (RBF) used in SVM. They are generated by 10-fold cross validation.

Effectiveness of PPFS-IFW.
To check whether the proposed algorithm is workable or not and its effectiveness, PPFS-IFW is tested on six datasets under CV1 and CV2 scenario, respectively. The classification accuracies achieved by PAN-SVM before and after the feature selection of PPF-IFW are shown in Figure 2 and  CV1 or CV2 testing schema. These results illustrate that the proposed algorithm PPFS-IFW is not only workable but also effective.

4.1.3.
Performance Comparison under CV1 and CV2. Further comparison between the classification accuracies made by PAN-SVM is conducted to check whether there is any difference under CV1 and CV2 testing schema. The detail results are shown as in Figure 4 and Table 2. It can be seen from Figure 4 that before the proposed  Figure 4. Classification Accuracy comparison before and after feature selection (PPFS-IFW). feature selection procedure, the classification accuracies of PAN-SVM with all of the features are comparable for datasets DIA, Ionosphere, Colon and WBC, but the get enhanced slightly under CV1 schema for datasets DLBCL and Leukemia. After feature selection, the classification accuracies are improved significantly for all of the datasets, no matter under which testing situation, but there is no obvious pattern found. The results described in Table 2 show a wider picture and illustrate that the accuracy could be improved higher under CV1 test condition. It makes sense because there exists Information leakage during the feature selection step under CV1 test schema. The fourth and fifth column in Table 2 show the selected number of features under CV1 and CV2. From these two columns we can observe that although accuracy can be improved slightly higher under CV1, the number of selected features under CV2 is usually smaller than that under CV1.
The experimental results show that PPFS-IFW can significantly improve the classification performance by selecting informative features under both CV1 and CV2 testing situation. The improvement of classification accuracy has no obvious pattern for different datasets when tested under CV1 and CV2. Overall, higher improvement obtained under CV1 testing scheme, but fewer features are selected under CV2 testing situation when using a separate dataset as testing samples.

Comparison with Other Methods.
We also compare the proposed method in the current work with the other state-of-the-art methods, such as Fisher-SVM, FSV, RFE-SVM and KP-SVM [34][25] on the three common datasets of DIA, WBC and Colon, and the result data of these methods are selected from [34]. From Table  3, we can observe that the proposed method of PPFS-IFW(referred as ours in the table) outperforms the other methods for datasets DIA and WBC, which own few features and defeated by Fisher-SVM, RFE-SVM and KP-SVM on Colon data with high dimensionality. The proposed feature selection algorithm PPFS-IFW is based on PAN-SVM, whose kernel matrix is approximated and decomposed to alleviate the complex computation and communication cost. Therefore, it is reasonable that the classification accuracy sacrifices for some data. PPFS-IFW outperforms the other four methods on two datasets that have few features but defeated on the microarray dataset of Colon. Slight sacrifice on accuracy is acceptable for PPFS-IFW. Since it is based on PAN-SVM, which employs the Nystrom technique to approximate the kernel matrix in order to speed up the computation and communication for distributed mining. The most important difference from the other methods, PPFS-IFW can preserve individual privacy while significantly improve the classifier accuracy.

Datasets.
To test the performance of the proposed PPM2C, experiments had been conducted. All the following results shows the average value obtained from multiple tests. To be specific, PPM2C was tested on 6 datasets with the consideration of different number of classes, sample sizes and features. DNA, Vowel and Letter datasets were downloaded from LIBSVM repository [54] and Lung cancer dataset was download from the University of California, Irvine (UCI) Machine Learning Repository [4]. The microarray dataset of Leukemia data [23] was downloaded from [53]. The details regarding datasets are summarized in Table 4. The reliability of PPM2C was firstly tested to measure whether the proposed method is workable and reliable or not. PPM2C will be considered as reliable and workable if it can achieve approximately the same classification accuracies as using a regular SVM, like LIBSVM. The experimental results were shown in Fig. 5. From the results we can tell that the classification performance achieved by PPM2C is as good as LIBSVM for the five out of six datasets (Leukemia 3 c, Leukemia 4 c, vowel, lung 3 c and letter). Since PAN-SVM which is used in PPM2C use landmarks (selected samples) [14][51] to approximate the kernel matrix, the classification accuracies were sacrificed slightly, which is normal. From the lone run, this sacrifice is worth. Because PAN-SVM framework [33] reduce the communication and computation costs sharply for distributed classification. Especially, it further protects individual information data. The reason that the classification accuracy for the DNA dataset is higher than that of LIBSVM probably because the data in this dataset is sparse. Therefore, we could conclude that PPM2C is workable and reliable.

4.2.3.
Stability and Over-fitting. As in many other literatures, the classification accuracy is usually accessed by cross validation (CV). During the cross validation process, data will be randomly split into k (k-fold) subsets. At each training round, k − 1 subsets are used as the training data, and the remaining 1 subset will be used as the testing set. In the other word, all the samples will be involved in the training and testing process, which probably lead to information leak during the process. Similarly, performance was tested under both CV1 and CV2 errors. To be specific, 1/5 of the total samples are randomly selected to be used as a separate testing set and will not be involved in the training process under CV2 scenario. While under CV1, all samples will be used either in the training process or the testing process via 5-fold cross validation. Fig. 6 and 7 shows the experimental results tested on PAN-SVM and LIBSVM under CV1 and CV2 scenario, respectively. As shown in Fig. 6 and 7, the predicting accuracies are increased by PAN-SVM for datasets Leukemia 3 c, Leukemia 4 c and Lung cancer, but only slight improvements are obtained (less than 1.68%). Therefore, we say that the improvement is not significant. In other word, by employing PAN-SVM, PPM2C is stable no matter under which testing situations. On the opposite, the predicting performance of LIBSVM classifier decreased (about 5.19%) under the CV2 situation, which means that CV1 makes LIBSVM achieve higher classification accuracy, especially on small datasets, such as Leukemia 3 c, Leukemia 4 c and Lung cancer. The phenomenon of decreased classification accuracy under CV2 scenario indicates that LIBSVM has the problem of over-fitting, especially on small datasets, while PAN-SVM is relative stable and reduces such a risk. To further demonstrate the stability of PPM2C, more tests were constructed. Since the core of PPM2C: PAN-SVM depends on landmarks for approximating kernel matrix, so the tests were done regarding different percentages (25%, 30%, 35%, 40%, 45%, 50% and 55%) of landmarks under CV1 and CV2 situation, respectively.
The experiments were tested on Leukemia 3 c, Leukemia 4 c, DNA and Lung cancer datasets and the experimental results are shown in Fig. 8 and 9. The curves in Fig. 8 and 9 show that no matter under CV1 case or CV2 case, the classification  accuracy has no obvious change for different landmarks ( from 0.13% to 1.16% under CV1, and from 0.34% to 1.59% under CV2). This evidence demonstrates that PPM2C's capability to classify data and the classification accuracy keeps stable under different landmarks (sample sizes).

5.
Conclusion. In this work, a privacy preserving feature selection method (PPFS-IFW) and a Multi-Class Classification (PPM2C) method are proposed based on our previous proposed privacy preserving classification framework PAN-SVM. PPFS-IFW inherits the privacy preserving property of PAN-SVM and integrates three filter methods. Features are firstly evaluated based on three predefined measurements, and then are voted and selected according to the calculated scores. The voted features will further be ranked according to their classification accuracy. At last, PPFS-IFW will return a feature ranking list, with the most important feature on the top. PPM2C converts the multi-class classification problem into multiple binary classifiers, which are PAN-SVM classifiers here. The privacy preserving step works just like PAN-SVM, data are encrypted via the Secure Sum Protocol at the bottom layer, and sampled landmarks are used to approximate kernel matrix, which has to be computed during SVM training process. PPM2C inherits the privacy preserving and effectiveness properties of PAN-SVM, but can be used to solve multi-class classification problem. The performance of PPFS-IFW and PPM2C were tested on six benchmark datasets at two situations, refereed as CV1 and CV2. These experimental results demonstrate that PPFS-IFW and PPM2C outperforms the others.