# American Institute of Mathematical Sciences

November  2018, 1(4): 331-348. doi: 10.3934/mfc.2018016

## Privacy preserving feature selection and Multiclass Classification for horizontally distributed data

 1 Department of Computer Science, 33 Gilmer Street SE Atlanta, GA, USA 2 University of North Georgia, Dahlonega, GA, USA 3 Data-driven Intelligence Research Laboratory, College of Computing and Software Engineering, Kennesaw State University, 1100 South Marietta Pkwy, Marietta, GA, USA

* Corresponding author: Meng Han

Received  August 2018 Revised  October 2018 Published  December 2018

In the last two decades, a lot of scientific fields have experienced a huge growth in data volume and data complexity, which brings data miners lots of opportunities, as well as many challenges. With the advent of the era of big data, applying data mining techniques on assembling data from multiple parties (or sources) has become a leading trend. However, those data mining tasks may divulge individuals' privacy, which leads to the increased concerns in privacy preserving. In this work, a Privacy Preserving feature selection method (PPFS-IFW) and Multiclass Classification method (PPM2C) are proposed. Experiments had been conducted to validate the performance of the proposed approaches. Both PPFS-IFW and PPM2C were tested on six benchmark datasets. The testing results demonstrate PPFS-IFW's capability in enhancing the classification performance at the level of accuracy by selection informative features. PPFS-IFW can not only preserve private information but also outperform some other state-of-the-art feature selection approaches. Experimental results also show that the proposed PPM2C method is workable and stable. Particularly, It reduces the risk of over-fitting when compared with the regular Support Vector Machine. In the meantime, by employing the Secure Sum Protocol to encrypt data at the bottom layer, users' privacy is preserved.

Citation: Yunmei Lu, Mingyuan Yan, Meng Han, Qingliang Yang, Yanqing Zhang. Privacy preserving feature selection and Multiclass Classification for horizontally distributed data. Mathematical Foundations of Computing, 2018, 1 (4) : 331-348. doi: 10.3934/mfc.2018016
##### References:

show all references

##### References:
Workflow of PPM2C
Classification accuracy improved by PPFS-IFW under CV1 scenario
Classification accuracy improved by PPFS-IFW under CV2 scenario
Classification Accuracy comparison before and after feature selection (PPFS-IFW)
Comparison of classification accuracy for PPM2C when using PAN-SVM and LIBSVM
Classification accuracy of PrivacySVM under CV1 and CV2
Classification accuracy of LIBSVM under CV1 and CV2
Classification accuracy of PrivacySVM under CV1
Classification accuracy of PrivacySVM under CV2
Details of Datasets used in Evaluation of PPFS-IFW
 Dataset num. samples num. features C $\gamma$ Diabetes(DIA) 768 8 512.0 0.0078125 Ionosphere 351 34 8.0 0.5 Colon 62 2000 32.0 0.0078125 Leukemia 72 7129 128.0 0.0001221 Lymhoma(DLBCL) 47 4026 2.0 0.0078125 Breast Cancer (WBC) 569 30 128.0 8.0
 Dataset num. samples num. features C $\gamma$ Diabetes(DIA) 768 8 512.0 0.0078125 Ionosphere 351 34 8.0 0.5 Colon 62 2000 32.0 0.0078125 Leukemia 72 7129 128.0 0.0001221 Lymhoma(DLBCL) 47 4026 2.0 0.0078125 Breast Cancer (WBC) 569 30 128.0 8.0
Accuracy improved under CV1 and CV2
 Dataset CV2 CV1 CV1 num. of Feature CV2 num. of Feature DIA $3.39\%$ $2.10\%$ $4$ $4$ Ionosphere $0.35\%$ $3.42\%$ $2$ $8$ Colon $3.08\%$ $8.00\%$ $34$ $157$ WBC $2.47\%$ $1.12\%$ $10$ $4$ DLBCL $5.57\%$ $10.95\%$ $394$ $444$ Leukemia $8.57\%$ $3.45\%$ $537$ $631$ Sum $23.43\%$ $29.04\%$ $981$ $1248$
 Dataset CV2 CV1 CV1 num. of Feature CV2 num. of Feature DIA $3.39\%$ $2.10\%$ $4$ $4$ Ionosphere $0.35\%$ $3.42\%$ $2$ $8$ Colon $3.08\%$ $8.00\%$ $34$ $157$ WBC $2.47\%$ $1.12\%$ $10$ $4$ DLBCL $5.57\%$ $10.95\%$ $394$ $444$ Leukemia $8.57\%$ $3.45\%$ $537$ $631$ Sum $23.43\%$ $29.04\%$ $981$ $1248$
Accuracy comparison with other methods
 Dataset Fisher SVM FSV RFE SVM KP SVM Ours(CV2) Ours (CV1) DIA $76.42$ $76.58$ $76.56$ $76.74$ $79.87$ $78.86$ WBC $94.7$ $95.23$ $95.25$ $97.55$ $99.11$ $97.81$ Colon $87.46$ $92.03$ $92.52$ $96.57$ $85.00$ $90.00$
 Dataset Fisher SVM FSV RFE SVM KP SVM Ours(CV2) Ours (CV1) DIA $76.42$ $76.58$ $76.56$ $76.74$ $79.87$ $78.86$ WBC $94.7$ $95.23$ $95.25$ $97.55$ $99.11$ $97.81$ Colon $87.46$ $92.03$ $92.52$ $96.57$ $85.00$ $90.00$
Details of Datasets
 Dataset num. of samples num. of features num. of class $Leukemia_3c$ 72 7129 3 $Leukemia_4a$ 72 7129 4 DNA 2000 180 3 Vowel 528 10 11 Lung 32 56 3 Letter 15000 16 26
 Dataset num. of samples num. of features num. of class $Leukemia_3c$ 72 7129 3 $Leukemia_4a$ 72 7129 4 DNA 2000 180 3 Vowel 528 10 11 Lung 32 56 3 Letter 15000 16 26
 [1] Mohammed Abdulrazaq Kahya, Suhaib Abduljabbar Altamir, Zakariya Yahya Algamal. Improving whale optimization algorithm for feature selection with a time-varying transfer function. Numerical Algebra, Control & Optimization, 2021, 11 (1) : 87-98. doi: 10.3934/naco.2020017 [2] Yong-Jung Kim, Hyowon Seo, Changwook Yoon. Asymmetric dispersal and evolutional selection in two-patch system. Discrete & Continuous Dynamical Systems - A, 2020, 40 (6) : 3571-3593. doi: 10.3934/dcds.2020043 [3] Bing Liu, Ming Zhou. Robust portfolio selection for individuals: Minimizing the probability of lifetime ruin. Journal of Industrial & Management Optimization, 2021, 17 (2) : 937-952. doi: 10.3934/jimo.2020005 [4] Junkee Jeon. Finite horizon portfolio selection problems with stochastic borrowing constraints. Journal of Industrial & Management Optimization, 2021, 17 (2) : 733-763. doi: 10.3934/jimo.2019132 [5] Lin Jiang, Song Wang. Robust multi-period and multi-objective portfolio selection. Journal of Industrial & Management Optimization, 2021, 17 (2) : 695-709. doi: 10.3934/jimo.2019130 [6] Zonghong Cao, Jie Min. Selection and impact of decision mode of encroachment and retail service in a dual-channel supply chain. Journal of Industrial & Management Optimization, 2020  doi: 10.3934/jimo.2020167 [7] Haixiang Yao, Ping Chen, Miao Zhang, Xun Li. Dynamic discrete-time portfolio selection for defined contribution pension funds with inflation risk. Journal of Industrial & Management Optimization, 2020  doi: 10.3934/jimo.2020166 [8] Azmy S. Ackleh, Nicolas Saintier. Diffusive limit to a selection-mutation equation with small mutation formulated on the space of measures. Discrete & Continuous Dynamical Systems - B, 2021, 26 (3) : 1469-1497. doi: 10.3934/dcdsb.2020169 [9] Jiannan Zhang, Ping Chen, Zhuo Jin, Shuanming Li. Open-loop equilibrium strategy for mean-variance portfolio selection: A log-return model. Journal of Industrial & Management Optimization, 2021, 17 (2) : 765-777. doi: 10.3934/jimo.2019133 [10] Aisling McGlinchey, Oliver Mason. Observations on the bias of nonnegative mechanisms for differential privacy. Foundations of Data Science, 2020, 2 (4) : 429-442. doi: 10.3934/fods.2020020 [11] Riadh Chteoui, Abdulrahman F. Aljohani, Anouar Ben Mabrouk. Classification and simulation of chaotic behaviour of the solutions of a mixed nonlinear Schrödinger system. Electronic Research Archive, , () : -. doi: 10.3934/era.2021002 [12] Kuo-Chih Hung, Shin-Hwa Wang. Classification and evolution of bifurcation curves for a porous-medium combustion problem with large activation energy. Communications on Pure & Applied Analysis, , () : -. doi: 10.3934/cpaa.2020281 [13] Mathew Gluck. Classification of solutions to a system of $n^{\rm th}$ order equations on $\mathbb R^n$. Communications on Pure & Applied Analysis, 2020, 19 (12) : 5413-5436. doi: 10.3934/cpaa.2020246 [14] Wenjun Liu, Yukun Xiao, Xiaoqing Yue. Classification of finite irreducible conformal modules over Lie conformal algebra $\mathcal{W}(a, b, r)$. Electronic Research Archive, , () : -. doi: 10.3934/era.2020123

Impact Factor: