\`x^2+y_1+z_12^34\`
Advanced Search
Article Contents
Article Contents
Early Access

Early Access articles are published articles within a journal that have not yet been assigned to a formal issue. This means they do not yet have a volume number, issue number, or page numbers assigned to them, however, they can still be found and cited using their DOI (Digital Object Identifier). Early Access publication benefits the research community by making new scientific discoveries known as quickly as possible.

Readers can access Early Access articles via the “Early Access” tab for the selected journal.

A novel approach using incremental under sampling for data stream mining

Abstract Full Text(HTML) Figure(2) / Table(11) Related Papers Cited by
  • Data stream mining is every popular in recent years with advanced electronic devices generating continuous data streams. The performance of standard learning algorithms has been compromised with imbalance nature present in real world data streams. In this paper, we propose an algorithm known as Increment Under Sampling for Data streams (IUSDS) which uses an unique under sampling technique to almost balance the data sets to minimize the effect of imbalance in stream mining process. The experimental analysis conducted suggests that the proposed algorithm improves the knowledge discovery over benchmark algorithms like C4.5 and Hoeffding tree in terms of standard performance measures namely accuracy, AUC, precision, recall, F-measure, TP rate, FP rate and TN rate.

    Mathematics Subject Classification: Primary: 68T30; Secondary: 68T05.

    Citation:

    \begin{equation} \\ \end{equation}
  • 加载中
  • Figure 1.  Trends in TN Rate for C4.5, Hoeffding Tree versus IUSDS on data stream

    Figure 2.  Trends in FP Rate for C4.5, Hoeffding Tree versus IUSDS on data stream

    Table 1.  Details of the Dataset

    S.no Dataset symbol Instances Majority Minority IR
    1. Breast-cancer B1 286 201 85 2.36
    2. Breast-w B2 699 458 241 1.90
    3. Colic C1 368 232 136 1.71
    4. Credit-g C2 1,000 700 300 2.33
    5. Diabetes D1 768 500 268 1.87
    6. Heart-c H1 303 165 138 1.19
    7. Heart-h H2 294 188 10 1.77
    8. Heart-stat H3 270 150 120 1.25
    9. Hepatitis H4 155 123 32 3.85
    10. Ionosphere I1 351 225 126 1.79
    11. Kr-vs-kp K1 3196 1669 1527 1.09
    12. Labor L1 57 37 20 1.85
    13. Mushroom M1 8124 4208 3916 1.08
    14. Sick S1 3772 3541 231 15.32
    15. Sonar S2 208 111 97 1.15
     | Show Table
    DownLoad: CSV

    Table 2.  Data Stream Description

    Dataset Instances Majority Minority IR
    Chunk 1:{B1} 286 201 85 2.36
    Chunk 2:{B1, B2} 985 659 326 2.02
    Chunk 3:{B1, B2, C1} 1353 891 462 1.92
    Chunk 4:{B1, B2, C1, C2} 2353 1591 1062 1.49
    Chunk 5:{B1, B2, C1, C2, D1} 3121 2091 1325 1.57
    Chunk 6:{B1, B2, C1, C2, D1, H1} 3424 2256 1463 1.52
    Chunk 7:{B1, B2, C1, C2, D1, H1, H2} 3718 2444 1569 1.55
    Chunk 8:{B1, B2, C1, C2, D1, H1, H2, H3} 3988 2594 1689 1.53
    Chunk 9:{B1, B2, C1, C2, D1, H1, H2, H3, H4} 4143 2717 1721 1.57
    Chunk 10:{B1, B2, C1, C2, D1, H1, H2, H3, H4, I1} 4494 2942 1847 1.59
    Chunk 11:{B1, B2, C1, C2, D1, H1, H2, H3, H4, I1, K1} 7690 4611 3374 1.36
    Chunk 12:{B1, B2, C1, C2, D1, H1, H2, H3, H4, I1, K1, L1} 7747 4648 3394 1.36
    Chunk 13:{B1, B2, C1, C2, D1, H1, H2, H3, H4, I1, K1, L1, M1} 15871 8856 7310 1.21
    Chunk 14:{B1, B2, C1, C2, D1, H1, H2, H3, H4, I1, K1, L1, M1, S1} 19643 12397 7541 1.64
    Chunk 15:{B1, B2, C1, C2, D1, H1, H2, H3, H4, I1, K1, L1, M1, S1, S2} 19851 12508 7638 1.63
     | Show Table
    DownLoad: CSV

    Table 3.  Average TN Rate for IUSDS verses C4.5 during the 11 time stamps after each status change for chunk-by-chunk learning

    Chunk no C4.5 HoeffdingTree IUSDS
    Chunk 1 (maj=201; min=85) 0.260$\bullet$ 0.395$\bullet$ 0.325
    Chunk 2 (maj=659; min=326) 0.596$\bullet$ 0.685$\bullet$ 0.652
    Chunk 3 (maj=891; min=462) 0.636$\bullet$ 0.693$\bullet$ 0.689
    Chunk 4 (maj=1591; min=762) 0.577$\bullet$ 0.642$\bullet$ 0.624
    Chunk 5 (maj=2091; min=1030) 0.582$\bullet$ 0.634$\bullet$ 0.638
    Chunk 6 (maj=2214; min=1062) 0.635$\bullet$ 0.674$\bullet$ 0.685
    Chunk 7 (maj=2439; min=1188) 0.679$\bullet$ 0.707$\bullet$ 0.723
    Chunk 8 (maj=2476; min=1208) 0.702$\bullet$ 0.738$\bullet$ 0.740
    Chunk 9 (maj=6017; min=1438) 0.721$\bullet$ 0.667$\bullet$ 0.759
    Chunk 10 (maj=6128; min=1536) 0.724$\bullet$ 0.657$\bullet$ 0.757
    Chunk 11 (maj=6395; min=1704) 0.745$\bullet$ 0.684$\bullet$ 0.778
    $\bullet$ Bold dot indicates the win of IUSDS; $\circ$ Empty dot indicates the loss of IUSDS
     | Show Table
    DownLoad: CSV

    Table 4.  Average Accuracy for IUSDS verses C4.5 during the 11 time stamps after each status change for chunk-by-chunk learning

    Chunk no C4.5 HoeffdingTree IUSDS
    Chunk 1 (maj=201; min=85) 74.28$\circ$ 72.18$\circ$ 71.73
    Chunk 2 (maj=659; min=326) 84.64$\bullet$ 84.09$\bullet$ 84.94
    Chunk 3 (maj=891; min=462) 84.81$\circ$ 81.90$\bullet$ 84.03
    Chunk 4 (maj=1591; min=762) 81.42$\circ$ 80.19$\circ$ 79.10
    Chunk 5 (maj=2091; min=1030) 80.04$\circ$ 79.30$\circ$ 79.26
    Chunk 6 (maj=2214; min=1062) 79.90$\circ$ 79.86$\circ$ 79.59
    Chunk 7 (maj=2439; min=1188) 81.31$\circ$ 81.06$\bullet$ 81.23
    Chunk 8 (maj=2476; min=1208) 80.97$\bullet$ 82.15$\circ$ 81.01
    Chunk 9 (maj=6017; min=1438) 82.94 83.44$\circ$ 82.94
    Chunk 10 (maj=6128; min=1536) 82.01$\bullet$ 81.92$\bullet$ 82.17
    Chunk 11 (maj=6395; min=1704) 83.33$\bullet$ 83.11$\bullet$ 83.52
    $\bullet$ Bold dot indicates the win of IUSDS; $\circ$ Empty dot indicates the loss of IUSDS
     | Show Table
    DownLoad: CSV

    Table 5.  Average FP Rate for IUSDS verses C4.5 during the 11 time stamps after each status change for chunk-by-chunk learning

    Chunk no C4.5 HoeffdingTree IUSDS
    Chunk 1 (maj=201; min=85) 0.740$\bullet$ 0.605$\bullet$ 0.675
    Chunk 2 (maj=659; min=326) 0.404$\bullet$ 0.315$\bullet$ 0.348
    Chunk 3 (maj=891; min=462) 0.364$\bullet$ 0.307$\bullet$ 0.311
    Chunk 4 (maj=1591; min=762) 0.423$\bullet$ 0.358$\bullet$ 0.376
    Chunk 5 (maj=2091; min=1030) 0.418$\bullet$ 0.366$\bullet$ 0.362
    Chunk 6 (maj=2214; min=1062) 0.365$\bullet$ 0.326$\bullet$ 0.315
    Chunk 7 (maj=2439; min=1188) 0.321$\bullet$ 0.293$\bullet$ 0.277
    Chunk 8 (maj=2476; min=1208) 0.298$\bullet$ 0.262$\bullet$ 0.260
    Chunk 9 (maj=6017; min=1438) 0.279$\bullet$ 0.333$\bullet$ 0.241
    Chunk 10 (maj=6128; min=1536) 0.276$\bullet$ 0.343$\bullet$ 0.243
    Chunk 11 (maj=6395; min=1704) 0.255$\bullet$ 0.316$\bullet$ 0.222
    $\bullet$ Bold dot indicates the win of IUSDS; $\circ$ Empty dot indicates the loss of IUSDS
     | Show Table
    DownLoad: CSV

    Table 6.  Average AUC for IUSDS verses C4.5 during the 11 time stamps after each status change for chunk-by-chunk learning

    Chunk no C4.5 HoeffdingTree IUSDS
    Chunk 1 (maj=201; min=85) 0.606$\bullet$ 0.683$\circ$ 0.637
    Chunk 2 (maj=659; min=326) 0.782$\bullet$ 0.836$\circ$ 0.812
    Chunk 3 (maj=891; min=462) 0.802$\bullet$ 0.832$\bullet$ 0.833
    Chunk 4 (maj=1591; min=762) 0.764$\bullet$ 0.820$\circ$ 0.777
    Chunk 5 (maj=2091; min=1030) 0.761$\bullet$ 0.818$\circ$ 0.787
    Chunk 6 (maj=2214; min=1062) 0.746$\bullet$ 0.819$\circ$ 0.775
    Chunk 7 (maj=2439; min=1188) 0.766$\bullet$ 0.836$\circ$ 0.795
    Chunk 8 (maj=2476; min=1208) 0.761$\bullet$ 0.845$\circ$ 0.791
    Chunk 9 (maj=6017; min=1438) 0.782$\bullet$ 0.813$\circ$ 0.810
    Chunk 10 (maj=6128; min=1536) 0.779$\bullet$ 0.812$\circ$ 0.806
    Chunk 11 (maj=6395; min=1704) 0.798$\bullet$ 0.826$\circ$ 0.821
    $\bullet$ Bold dot indicates the win of IUSDS; $\circ$ Empty dot indicates the loss of IUSDS
     | Show Table
    DownLoad: CSV

    Table 7.  Average Precision for IUSDS verses C4.5 during the 11 time stamps after each status change for chunk-by-chunk learning

    Chunk no C4.5 HoeffdingTree IUSDS
    Chunk 1 (maj=201; min=85) 0.753$\circ$ 0.774$\bullet$ 0.736
    Chunk 2 (maj=659; min=326) 0.859$\bullet$ 0.881$\bullet$ 0.861
    Chunk 3 (maj=891; min=462) 0.856$\bullet$ 0.866$\bullet$ 0.836
    Chunk 4 (maj=1591; min=762) 0.834$\bullet$ 0.849$\bullet$ 0.800
    Chunk 5 (maj=2091; min=1030) 0.827$\bullet$ 0.839$\bullet$ 0.802
    Chunk 6 (maj=2214; min=1062) 0.774$\bullet$ 0.795 0.785
    Chunk 7 (maj=2439; min=1188) 0.791$\bullet$ 0.802$\circ$ 0.804
    Chunk 8 (maj=2476; min=1208) 0.779$\bullet$ 0.811$\circ$ 0.798
    Chunk 9 (maj=6017; min=1438) 0.803$\bullet$ 0.826$\bullet$ 0.819
    Chunk 10 (maj=6128; min=1536) 0.795$\bullet$ 0.807$\bullet$ 0.813
    Chunk 11 (maj=6395; min=1704) 0.811$\bullet$ 0.822$\bullet$ 0.828
    $\bullet$ Bold dot indicates the win of IUSDS; $\circ$ Empty dot indicates the loss of IUSDS
     | Show Table
    DownLoad: CSV

    Table 8.  Average Recall for IUSDS verses C4.5 during the 15 time stamps after each status change for chunk-by-chunk learning

    Chunk no C4.5 HoeffdingTree IUSDS
    Chunk 1 (maj=201; min=85) 0.947$\circ$ 0.860$\bullet$ 0.909
    Chunk 2 (maj=659; min=326) 0.953$\circ$ 0.906$\bullet$ 0.946
    Chunk 3 (maj=891; min=462) 0.946$\circ$ 0.875$\bullet$ 0.925
    Chunk 4 (maj=1591; min=762) 0.921 0.872$\bullet$ 0.888
    Chunk 5 (maj=2091; min=1030) 0.901$\bullet$ 0.866$\bullet$ 0.884
    Chunk 6 (maj=2214; min=1062) 0.813$\bullet$ 0.828$\bullet$ 0.820
    Chunk 7 (maj=2439; min=1188) 0.814 0.830$\bullet$ 0.824
    Chunk 8 (maj=2476; min=1208) 0.793 0.826$\bullet$ 0.808
    Chunk 9 (maj=6017; min=1438) 0.815$\bullet$ 0.844$\bullet$ 0.828
    Chunk 10 (maj=6128; min=1536) 0.806$\bullet$ 0.841$\bullet$ 0.822
    Chunk 11 (maj=6395; min=1704) 0.821$\bullet$ 0.851$\bullet$ 0.833
    $\bullet$ Bold dot indicates the win of IUSDS; $\circ$ Empty dot indicates the loss of IUSDS
     | Show Table
    DownLoad: CSV

    Table 9.  Average F-measure for IUSDS verses C4.5 during the 11 time stamps after each status change for chunk-by-chunk learning

    Chunk no C4.5 HoeffdingTree IUSDS
    Chunk 1 (maj=201; min=85) 0.838$\circ$ 0.812$\bullet$ 0.812
    Chunk 2 (maj=659; min=326) 0.900$\bullet$ 0.890$\bullet$ 0.898
    Chunk 3 (maj=891; min=462) 0.896 0.867$\bullet$ 0.874
    Chunk 4 (maj=1591; min=762) 0.873$\bullet$ 0.857$\bullet$ 0.838
    Chunk 5 (maj=2091; min=1030) 0.860$\bullet$ 0.849$\bullet$ 0.838
    Chunk 6 (maj=2214; min=1062) 0.785$\bullet$ 0.803$\bullet$ 0.791
    Chunk 7 (maj=2439; min=1188) 0.794$\bullet$ 0.808$\bullet$ 0.804
    Chunk 8 (maj=2476; min=1208) 0.774$\bullet$ 0.810$\circ$ 0.790
    Chunk 9 (maj=6017; min=1438) 0.798$\bullet$ 0.827$\bullet$ 0.813
    Chunk 10 (maj=6128; min=1536) 0.790$\bullet$ 0.815$\bullet$ 0.807
    Chunk 11 (maj=6395; min=1704) 0.807$\bullet$ 0.828$\bullet$ 0.821
    $\bullet$ Bold dot indicates the win of IUSDS; $\circ$ Empty dot indicates the loss of IUSDS
     | Show Table
    DownLoad: CSV

    Table 10.  Average FN Rate for IUSDS verses C4.5 during the 11 time stamps after each status change for chunk-by-chunk learning

    Chunk no C4.5 HoeffdingTree IUSDS
    Chunk 1 (maj=201; min=85) 0.053$\circ$ 0.140$\circ$ 0.091
    Chunk 2 (maj=659; min=326) 0.047$\circ$ 0.094$\circ$ 0.054
    Chunk 3 (maj=891; min=462) 0.054$\circ$ 0.125$\bullet$ 0.075
    Chunk 4 (maj=1591; min=762) 0.079 0.128$\bullet$ 0.112
    Chunk 5 (maj=2091; min=1030) 0.099$\circ$ 0.134$\bullet$ 0.116
    Chunk 6 (maj=2214; min=1062) 0.187$\circ$ 0.172$\bullet$ 0.180
    Chunk 7 (maj=2439; min=1188) 0.186 0.170$\bullet$ 0.176
    Chunk 8 (maj=2476; min=1208) 0.207 0.174$\bullet$ 0.192
    Chunk 9 (maj=6017; min=1438) 0.185$\circ$ 0.156$\bullet$ 0.172
    Chunk 10 (maj=6128; min=1536) 0.194$\circ$ 0.159$\bullet$ 0.178
    Chunk 11 (maj=6395; min=1704) 0.179$\circ$ 0.149$\circ$ 0.167
    $\bullet$ Bold dot indicates the win of IUSDS; $\circ$ Empty dot indicates the loss of IUSDS
     | Show Table
    DownLoad: CSV

    Table 11.  Summary of experimental results for IUSDS

    Results Systems Wins Ties Losses
    TN Rate IUSDS v/s C4.5 11 0 0
    IUSDS v/s HoeffdingTree 11 0 0
    Accuracy IUSDS v/s C4.5 04 1 6
    IUSDS v/s HoeffdingTree 05 0 6
    FP Rate IUSDS v/s C4.5 11 0 0
    IUSDS v/s HoeffdingTree 11 0 0
    AUC IUSDS v/s C4.5 11 0 0
    IUSDS v/s HoeffdingTree 1 0 10
    Precision IUSDS v/s C4.5 10 0 1
    IUSDS v/s HoeffdingTree 08 1 02
    Recall IUSDS v/s C4.5 05 03 03
    IUSDS v/s HoeffdingTree 11 0 0
    F-measure IUSDS v/s C4.5 09 01 01
    IUSDS v/s HoeffdingTree 10 00 01
    FN Rate IUSDS v/s C4.5 05 03 03
    IUSDS v/s HoeffdingTree 11 00 00
     | Show Table
    DownLoad: CSV
  • [1] J. Alcalá-FdezA. FernandezJ. LuengoJ. DerracS. GarcíaL. Sánchez and F. Herrera, KEEL data-mining software tool: Data set repository, Integration of Algorithms and Experimental Analysis Framework, Journal of Multiple-Valued Logic and Soft Computing, 17 (2011), 255-287. 
    [2] A. Asuncion and D. J. Newman, UCI Repository of Machine Learning Database (School of Information and Computer Science), Irvine, CA: Univ. of California [Online], 2007. Available: http://www.ics.uci.edu/mlearn/MLRepository.html
    [3] I. Brown and C. Mues, An experimental comparison of classification algorithms for imbalanced credit scoring data sets, Expert Systems with Applications, 39 (2012), 3446-3453.  doi: 10.1016/j.eswa.2011.09.033.
    [4] P. CaoD. Zhao and O. Zaiane, A PSO-based cost-sensitive neural network for imbalanced data classification, Trends and Applications in Knowledge Discovery and Data Mining, (2013), 452-463.  doi: 10.1007/978-3-642-40319-4_39.
    [5] Y. Chen, Learning Classifiers from Imbalanced Only Positive and Unlabeled Data Sets 2008 UC San Diego Data Mining Contest.
    [6] Y. ChenS. TangL. ZhouC. WangJ. DuT. Wang and S. Pei, Decentralized Clustering by Finding Loose and Distributed Density Cores, Inform. Sci., 433/434 (2018), 510-526.  doi: 10.1016/j.ins.2016.08.009.
    [7] Doucette and M. I. Heywood, Classification under imbalanced data sets:Active sub-sampling and auc approximation, M. O'Neill et al. Eds.:EuroGP 2008, LNCS, 4971 (2008), 266-277. 
    [8] B. J. Frey and D. Dueck, Clustering by passing messages between data points, Science, 315 (2007), 972-976.  doi: 10.1126/science.1136800.
    [9] G. Hulten, L. Spencer and P. Domingos, Mining time-changing data streams, In: ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, (2001), 97-106. doi: 10.1145/502512.502529.
    [10] A. K. Jain, Data clustering:50 years beyond K-means, Part of the Lecture Notes in Computer Science book series, 5211 (2008), 3-4.  doi: 10.1007/978-3-540-87479-9_3.
    [11] R. Kohavi, Scaling up the accuracy of Naive-Bayes classifiers: A decision-tree hybrid, In: Second International Conference on Knoledge Discovery and Data Mining, (1996), 202-207.
    [12] V. LópezI. TrigueroC. J. CarmonaS. García and F. Herrera, Addressing imbalanced classification withinstance generation techniques: IPADE-ID, Neurocomputing, 126 (2014), 15-28. 
    [13] A. C. LorenaL. F. O. JacinthoM. F. SiqueiraR. De GiovanniL. G. LohmannA. C. P. L. F. de Carvalho and M. Yamamoto, Comparing machine learning classifiers in potential distribution modelling, Expert Systems with Applications, 38 (2011), 5268-5275.  doi: 10.1016/j.eswa.2010.10.031.
    [14] H. Ma, Correlation-based Feature Subset Selection For Machine Learning PhD Thesis, 1998.
    [15] A. K. Menon, H. Narasimhan, S. Agarwal and S. Chawla, On the statistical consistency of algorithms for binary classification under class imbalance, Appearing in Proceedings of the 30 thInternational Conference on Machine Learning Atlanta, Georgia, USA, 2013.
    [16] A. Rodriguez and A. Laio, Clustering by fast search and find of density peaks, Science, 344 (2014), 1492-1496.  doi: 10.1126/science.1242072.
    [17] N. VerbiestaE. RamentolC. Cornelisa and F. Herrera, Preprocessing noisy imbalanced datasets using SMOTE enhanced withfuzzy rough prototype selection, Applied Soft Computing, 22 (2014), 511-517. 
    [18] S. WangL. L. Minku and X. Yao, Resampling-based ensemble methods for online class imbalance learning, IEEE Transactions on Knowledge and Data Engineering, 27 (2015), 1356-1368.  doi: 10.1109/TKDE.2014.2345380.
    [19] I. H. Witten and E. Frank, Data mining:Practical machine learning tools and techniques, Newsletter: ACM SIGMOD Record Homepage Archive, 31 (2002), 76-77.  doi: 10.1145/507338.507355.
    [20] B. Yang and L. Jing, A Novel nonparallel plane proximal svm for imbalance data classification Journal of Software, 9 2014.
  • 加载中

Figures(2)

Tables(11)

SHARE

Article Metrics

HTML views(2544) PDF downloads(106) Cited by(0)

Access History

Other Articles By Authors

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return