Privacy preserving feature selection and Multiclass Classification for horizontally distributed data

  • * Corresponding author: Meng Han

  • In the last two decades, a lot of scientific fields have experienced a huge growth in data volume and data complexity, which brings data miners lots of opportunities, as well as many challenges. With the advent of the era of big data, applying data mining techniques on assembling data from multiple parties (or sources) has become a leading trend. However, those data mining tasks may divulge individuals' privacy, which leads to the increased concerns in privacy preserving. In this work, a Privacy Preserving feature selection method (PPFS-IFW) and Multiclass Classification method (PPM2C) are proposed. Experiments had been conducted to validate the performance of the proposed approaches. Both PPFS-IFW and PPM2C were tested on six benchmark datasets. The testing results demonstrate PPFS-IFW's capability in enhancing the classification performance at the level of accuracy by selection informative features. PPFS-IFW can not only preserve private information but also outperform some other state-of-the-art feature selection approaches. Experimental results also show that the proposed PPM2C method is workable and stable. Particularly, It reduces the risk of over-fitting when compared with the regular Support Vector Machine. In the meantime, by employing the Secure Sum Protocol to encrypt data at the bottom layer, users' privacy is preserved.

    Mathematics Subject Classification: Primary: 58F15, 58F17; Secondary: 53C35.


    \begin{equation} \\ \end{equation}
  • Figure 1.  Workflow of PPM2C

    Figure 2.  Classification accuracy improved by PPFS-IFW under CV1 scenario

    Figure 3.  Classification accuracy improved by PPFS-IFW under CV2 scenario

    Figure 4.  Classification Accuracy comparison before and after feature selection (PPFS-IFW)

    Figure 5.  Comparison of classification accuracy for PPM2C when using PAN-SVM and LIBSVM

    Figure 6.  Classification accuracy of PrivacySVM under CV1 and CV2

    Figure 7.  Classification accuracy of LIBSVM under CV1 and CV2

    Figure 8.  Classification accuracy of PrivacySVM under CV1

    Figure 9.  Classification accuracy of PrivacySVM under CV2

    Table 1.  Details of Datasets used in Evaluation of PPFS-IFW

    Datasetnum. samples num. features C $\gamma$
    Diabetes(DIA) 768 8 512.0 0.0078125
    Ionosphere 351 34 8.0 0.5
    Colon 62 2000 32.0 0.0078125
    Leukemia 72 7129 128.0 0.0001221
    Lymhoma(DLBCL) 47 4026 2.0 0.0078125
    Breast Cancer (WBC) 569 30 128.0 8.0
    Table 2.  Accuracy improved under CV1 and CV2

    Dataset CV2 CV1 CV1 num. of Feature CV2 num. of Feature
    DIA $3.39\%$ $2.10\%$ $4$ $4$
    Ionosphere $0.35\%$ $3.42\%$ $2$ $8$
    Colon $3.08\%$ $8.00\%$ $34$ $157$
    WBC $2.47\%$ $1.12\%$ $10$ $4$
    DLBCL $5.57\%$ $10.95\%$ $394$ $444$
    Leukemia $8.57\%$ $3.45\%$ $537$ $631$
    Sum $23.43\%$ $29.04\%$ $981$ $1248$
    Table 3.  Accuracy comparison with other methods

    Dataset Fisher SVM FSV RFE SVM KP SVM Ours(CV2) Ours (CV1)
    DIA $76.42$ $76.58$ $76.56$ $76.74$ $79.87$ $78.86$
    WBC $94.7$ $95.23$ $95.25$ $97.55$ $99.11$ $97.81$
    Colon $87.46$ $92.03$ $92.52$ $96.57$ $85.00$ $90.00$
    Table 4.  Details of Datasets

    Dataset num. of samples num. of features num. of class
    $Leukemia_3c$ 72 7129 3
    $Leukemia_4a$ 72 7129 4
    DNA 2000 180 3
    Vowel 528 10 11
    Lung 32 56 3
    Letter 15000 16 26
