Selective further learning of hybrid ensemble for class imbalanced increment learning

Incremental learning has been investigated by many researchers. However, only few works have considered the situation where class imbalance occurs. In this paper, class imbalanced incremental learning was investigated and an ensemble-based method, named Selective Further Learning (SFL) was proposed. In SFL, a hybrid ensemble of Naive Bayes (NB) and Multilayer Perceptrons (MLPs) were employed. For the ensemble of MLPs, parts of the MLPs were selected to learning from the new data set. Negative Correlation Learning (NCL) with Dynamic Sampling (DyS) for handling class imbalance was used as the basic training method. Besides, as an additive model, Naive Bayes was employed as an individual of the ensemble to learn the data sets incrementally. A group of weights (with the number of the classes as the length) are updated for every individual of the ensemble to indicate the 'confidence' of the individual learning about the classes. The ensemble combines all of the individuals by weighted average according to the weights. Experiments on 3 synthetic data sets and 10 real world data sets showed that SFL was able to handle class imbalance incremental learning and outperform a recently related approach.


1.
Introduction. Normal machine learning problems require learning model to learn information from all the achieved data and all the data are stored. However, in practice, the data are usually updated all the time and new information is necessary to be learned from the new data [19]. It is usually time consuming to learn new information with accessing to the previous data and storing the learned data is also expensive. In this situation, the learning model is required to have the ability of learning new information from new data and preserving the previously learned information without accessing the previous data. This learning model is called incremental learning [8], [21].
In incremental learning, the whole data set is not available in a lump. In another word, we can only get a part of the whole data set every time. We suppose that the whole data set S is divided into T subsets, i.e., S 1 , S 2 , , S T . The rules (e.g., classification boundaries in classification problems) of S and S t are denoted as R and R t respectively. The aim of the learning model is to learn R by learning R t from S t respectively. The main difficulty is that the previously learned rules may be forgotten when the model learns new rules from new data subsets, especially when the rules of different data subsets are different. This phenomenon was called catastrophic forgetting. If R 1 = R 2 = ... = R T , the learning model can learn R 1 form S 1 and R 1 will not be forgotten when new data subsets are learned. In this case, incremental learning is not real challenging. However, in practice, R t are usually different between different data subsets, so the catastrophic forgetting may happen.
In our assumption, even though the rules are different between different data subsets, the target rules (i.e., R) are not changed. This phenomenon was also called virtual concept drift [28] and it is different from real concept drift, in which the target concept is changed when new data subsets are available. Virtual concept drift was called sampling shift in [22] and it will be referred to in this paper. There are some additive models that can be easily adopted to learn incrementally when sampling shift occurs. For example, in Bayes Decision Theory, the rules can be represented by some parameters and the parameters of the whole data set can be combined by those of all the data subsets. In this way, the models can learn the data subsets respectively to form a same learner of learning the whole data set. However, these kinds of methods often require assumptions about the data distribution and the decision boundaries are always simple. Neural networks have strong abilities to learn complex classification boundary. Unfortunately, they are not additive. By training with new data subsets, the model tends to perform well on the new data subsets but poorly on the previous ones [8]. In other words, the model forgets the previously learned rules. Therefore, it is a challenge to employ neural networks to learn incrementally in this situation.
To exploit neural networks for incremental learning, some ensemble based approaches have been proposed. In our previous work, i.e., Selective Negative Correlation Learning (SNCL) [26], selective ensemble method was employed to pre-vent the model from forgetting previously learned information. There are also other ensemble based methods for incremental learning, such as Fixed Size Negative Correlation Learning (FSNCL), Growing Negative Correlation Learning (GNCL) [15], and Learn++ methods [21], [17], [5]. In SNCL and FSNCL (size-fixed methods), the model was able to learn new information from new data subsets with the size of the model fixed. How-ever, the ability of preserving previously learned information was not as good as Learn++ methods and GNCL (size-grown methods), in which the sizes of the models grown larger as new data subsets were learned. Since the new data subsets always become available all the time in practice, the sizes of the models will become too large in Learn++ methods and GNCL. Therefore, it is worthy to design a method with the benefits of both size-fixed methods and size-grown methods.
Besides sampling shift, there is another issue in incremental learning, i.e., class imbalance. In normal learning model, class imbalance problem has been studied by many researchers and there are plenty of literatures addressing class imbalance problems [11], [4], [9]. Class imbalance problems may also occur in incremental learning and this kind of issue has also been investigated [5], [6]. There are mainly two cases for class imbalance incremental learning: (1)If the class distribution of the whole data set S is imbalanced, the class distribution of data subset S t will usually be imbalanced. Furthermore, it will be common that samples of the minority classes may be lost in some data subsets.
(2)Even though the class distribution of S is balanced, S t may also be class imbalanced. In typical case, all of the partial sets S t are class imbalanced but the combined data set S is class balanced.
In this paper, we focus on class imbalance cases, in which sampling shift also occurs. Specifically, when sampling shift occurs, new classes may come up in the new data subset and some previous classes may be lost in the new data subset. When class distribution of the whole data set is imbalanced, this phenomenon will be more likely to happen to the minority classes. This is also the main issue in this paper.
The rest of the paper is organized as follows. In Section II, we will briefly review some existing methods for incremental learning. Our methods, i.e., Selective Further Learning (SFL) will be described in Section III. Then in Section IV, the experimental studies will be presented. Finally, we will conclude this paper and discuss the future work in Section V.
2. Related work. Some neural network based methods, such as the Adaptive Resonance Theory modules map (ARTMAP) [3] , [23], [29], [2] and Evolving Fuzzy Neural Network (EFuNN) [12] have been proposed for incremental learning. Both ARTMAP and EFuNN can learn new rules by changing the architecture of the models, such as self-organize new clusters (in ARTMAP) and create new neurons (in EFuNN) when new data are sufficiently different from previous ones. However, it is usually a non-trivial task to estimate the difference between new data and previous ones. Moreover, both of them are very sensitive to the parameters of the algorithms.
Researches have shown the good performance of ART-MAP and EFuNN in incremental learning. However, their abili-ties of learning incrementally in class imbalance situation have not been well investigated. In [5], where Learn++.UDNC was proposed for class imbalance incremental learning, fuzzy ARTMAP [2] was presented with poor performance when class imbalance occurs. Learn++.UDNC is an ensemble based method. It is one of the Learn++ series methods [21], [20] which were based on AdaBoost [7]. Besides Learn++.UDNC, many versions of Learn++ methods have been proposed, such as Learn++.MT [16], Learn++.MT2 [16], Learn++.NC [18] and Learn++.SMOTE [6]. In these versions, Learn++.MT and Learn++.NC was proposed for handling the problem of out-voting when learning new classes. Learn++.MT2 was pro-posed for handling the imbalance of examples between data subsets. These versions did not consider class imbalance situations. Class imbalance in incremental learning was addressed only in Learn++. UDNC and Learn++.SMOTE. In Learn++.UDNC, it was assumed that no real concept drift will happen, while in Learn++.SMOTE, real concept drift in class imbalanced data was investigated. Therefore, the former matches the issue in this paper but the later dose not.
Besides Learn++, another type of ensemble based methods, i.e., methods based on Negative Correlation Learning (NCL) [14], have also been proposed for incremental learning [26], [15]. NCL is a method to construct neural networks ensemble. It is capable of improving the generalization performance of the ensemble by decreasing the error of every neural network and increasing the diversities between neural networks simultaneously. In [15], tow NCL-based methods, i.e., FSNCL and GNCL were proposed. In FSNCL, the size of the ensemble is fixed and all of the neural networks are trained when new data subsets become available. In GNCL, the size of the ensemble grows as the data sets are incrementally learned and only new added neural networks are trained when new data subsets become available. In our previous work [26], SNCL was proposed. In SNCL, new neural networks are added and trained when new data subsets become available and then a pruning method was employed to prune the ensemble to make the size of the ensemble fixed. Comparing to Learn++ methods and GNCL, FSNCL and SNCL can make the size of the ensemble fixed as more and more data sets come up while their abilities of preserving previously learned information are poorer than Learn++ and GNCL.
There are also some other methods with ability of incremental learning. Self-Organizing Neural Grove (SONG) [10] is an ensemble based method with Self-Generating Neural Net-works (SGNNs) [27] as the individual learners. Incremental Backpropagation Learning Networks (IBPLN) [8] employed neural networks for incremental learning by making the weights of the neural network bounded and adding new nodes. However, they did not consider the class imbalance in incremental learning. 3.1. Framework. In this paper, class imbalance is considered in incremental learning. In the existing work, Learn++.UDNC [5] was pro-posed for addressing this issue and it has been shown more effective than other incremental learning methods which did not consider class imbalance situation. However, as a size-grown method, the size of the ensemble in Learn++.UDNC increases all the way as new data sets become available. The size of the ensemble may become too large. In our method, ensemble based method is also considered and at the same time, we aim at controlling the size of ensemble at an acceptable level.
In our previous work, i.e., SNCL [26], selective ensemble was used to keep the size of the ensemble fixed. When new data subset comes up, it is used to train the copy of the previous ensemble. The two ensembles are combined and half of the individuals in the ensemble are pruned to keep the size of the ensemble fixed. However, in this model, previous information loss may easily occur due to the pruning process based on the latest data subset. The ensemble will be biased to the latest data subset. Furthermore, if the rules of the latest data subset is quite different from that of the previous data subset, i.e., high sampling shift occurs, all of the individuals of the previous ensemble might be pruned. On the other hand, since SNCL was designed without considering class imbalance situation, it might not be good at handling class imbalance incremental problems.
To overcome the above drawbacks, we propose a new en-semble based approach for incremental learning, i.e., Selective Further Learning (SFL). In SFL, a hybrid ensemble with two kinds of base classifiers was used. First of all, a group of Multi-Layer Perceptrons (MLPs) are used. When new data subsets become available, half of the MLPs in the current ensemble are selected to be trained with the new data subsets. After training, the selected MLPs are laid back to the ensemble. No pruning process will be executed so that the risk of previous information forgetting is reduced. At the same time, as an additive model, Naive Bayes (NB) is used as a component of the ensemble to incrementally learn from new data subsets. In this way, the strong incremental learning ability of NB will help the ensemble to preserve the previous information if high sampling shift occurs.
In addition, a group of weights (namely impact weights) are constructed for every individual (including MLPs and NB). The weights and the outputs of the ith individual are denoted as {w ik |k = 1, 2, ..., C} and {o ik k = 1, 2, ..., C}, respectively, where C is the number of the classes. The impact weight w ik is designed to indicate the confidence of the output produced by the ith individual on class k. At the testing stage, for an example, the output of the ensemble is calculated by the weighted average of all individuals: where y k is the kth output of the ensemble and indicates the probability that the example belonging to class k, M is the number of the individuals in the ensemble. Equation (1) is used only at the testing stage. At the training stage, the output of the ensemble is calculated by the arithmetical average of the individuals. w ik is initialized as 0 at the initial stage and updated during learning every new data subset. When updating w ik , two issues should be considered.
On one hand, the grade that the ith individual learn about the kth class is considered, i.e., the recall of the ith individual on class k and the precision of the ith individual on class k. w ik should be high when both the recall and precision are high. To this end, the definition of F-measure for multi-class [24] is introduced: where F ik , R ik and P ik is the F-measure, recall and the precision of the ith individual on class k, respectively. According to [24], R ik and P ik are defined as and km is the number of the examples of class k that were classified as class m by the ith individual.
On the other hand, since MLPs could be easily biased to the latest data subset, if some classes in the previous data subsets do not come up in the new data subset, the output of the MLPs that are selected to be trained with the new data subset should be suspectable. Therefore, a coefficient µ i is defined for every individual i to degrade the impact weights: where nt is the number of classes that are contained in the new data subset, nc is the number of classes in all the coming up data subsets. By considering both of the above issues, w ik is updated as: For the model of NB, µ i always equals to 1 since NB will not be biased to the latest data subset. For MLPs, µ i is updated once the MLP is selected to be trained. For the MLPs which were not selected to be trained and the model of NB, N km from the new data subset can be accumulated to the previous one to update R k and P k and then update w ik . In this way, w ik is updated not according to the current data subsets only and it would be helpful for preserving previous information.
The pseudo-code for the approach is presented in Fig. 1. In the pseudo-code, Select is the selecting process for selecting MLPs from the ensemble to be trained with the new data sub-set, MLPs-Training and NB-Training were the training process for training the MLPs and NB in the ensemble. The details of these processes are described in the following subsection.

Some details inside SFL.
3.2.1. Selecting Process. The selecting process is based on the current data subset S t . The individuals are added to Ens sel one by one by greedy strategy. Every time, every MLP in Ens res is temporarily added to Ens sel to estimate the performance (i.e., the arithmetical mean F-measures of all classes) on the current data subset. The MLPs in Ens sel are tested one by one and the MLP that makes Ens sel perform the worst will be finally added to Ens sel If the current data subset does not contain some classes that have appeared in the previous data subsets, the selection process should ensure that not all the MLPs that have been trained with the data of the lost classes are added to Ens sel . Therefore, when an MLP is added to Ens sel , the following constraint should be satisfied for the MLPs in Ens sel : where L = {k|"class k is not contained in S t "}. If there is no MLP that can be added to Ens sel , a new initialized MLP will be generated and added to Ens sel . In this way, the MLPs that are not well trained are selected to be further trained. Besides, the MLPs that are reserved in Ens res could preserve the previously learned information.
3.2.2. Training the model of Naive Bayes. According to Bayes Decision Theory, the probability of an testing example x = {x i |i = 1, 2, ..., d} belonging to class k is where P(kx) is the posterior probability of an examples x belonging to class k, C is the number of classes and P(k) is the prior probabilities of class k. In class imbalance situation, we assume that P(k) are equal for all of the classes. Besides, all the features of the examples are assumed to be independent to each other. Therefore, the probability of (8) becomes In incremental learning mode, P(x i |k) is updated as every new data subset comes up.P(x i |k) can be estimated in the form of n(x i k)/n(k), where n(x i k) is the number of examples that belongs to class k and the value of its ith feature is x i , n(k) is the number of examples that belongs to class k. Both n(x i k) and n(k) can be estimated in each data subset and then accumulated to estimate P(x i |k). In this way, NB can learn from new data subsets without any loss of previous information.
The estimation of P(kx) in (9) requires the values of features to be discrete. Specifically, for the features with continuous values, average partition is used to discretize the features for calculating P(x i |k).

3.2.3.
Training the ensemble of MLPs. We have proposed a Dynamic Sampling (DyS) method for class imbalance problems [13], which can be used for training the ensemble of MLPs. Similarly to the approach proposed in [13], the main process of DyS for an ensemble is presented as follows (in one epoch): step1. Randomly fetch an example x from the training set; step2. Estimate the probability p that the example should be used for updating the ensemble. step3. Generate a uniform random real number µ between 0 and 1. step4. If µ < p, then use x to update the ensemble using Negative Correlation Learning (NCL) [23] to make every MLP negatively correlated to other individuals (including the MLPs in Ens res and the model of NB). step5. Repeat steps 1 to 3 until there is no example in the training set.
The above steps will be repeated until stop criterion is satisfied. The following shows the method for estimating p, which was the main issue in DyS.
In a problem with nc classes, we set nc output nodes for all of the MLPs and for an example belonging to class k, we set the target output of the example as t = {t i |t k = 1, t (j| =k) = 0}. The real output of the example is denoted as y = {y i |i = 1, 2, ..., nc}, so the node with the highest output designates the class. Both the hidden node functions and the output node functions of all MLPs are set as the logistic function ϕ(x) = 1/(1 + e −x ), so that y i ∈ (0, 1).
The same to [13], the probability that an example belonging to class k will be used to update the ensemble is estimated as: where δ = y k − max i =k {y i } is the confidence of the current ensemble correctly classifying the example. For more details of DyS, please refer to [13]. By employing DyS, the MLPs in the ensemble are able to accommodate to class imbalance situations. After learning a new data subset which loses some previous classes, M LP a will be biased to the classes that are contained in the current data subset. However, the impact weights of M LP a will be degraded by a coefficient according to (5). Therefore, NB and the MLPs which is recently trained with the new classes will play a leading role in the ensemble when making the prediction.
Besides, as we discuss before, NB is able to learn incrementally without forgetting previous information. The use of NB will help to prevent the ensemble from catastrophic forgetting. Furthermore, in the training of NB and MLPs, the situation of class imbalance is considered. Therefore, SFL is able to deal with class imbalance in the new data subsets.

Experimental study.
To assess the performance of SFL, some synthetic data sets and real-world data sets were used to conduct the experiments. First of all, three types of synthetic data sets were generated to simulate the incremental learning process. Then, 5 real-world data sets with imbalanced class distributions from UCI repository [1] were used to simulate incremental learn-ing by randomly dividing the data sets. Finally, another 5 real-world data sets from UCI repository, including 3 class imbalanced data sets and 2 class balanced data sets, were used to simulate the incremental learning process by dividing the data sets. In this part, the dividing of the data sets considered new classes and the loss of previous classes in the new data subsets. The purpose of this part of experiment is to assess the ability of SFL learning form new classes and preserving previous in-formation when some classes are lost in the new data subsets. As a recently proposed approach which also addressed for class imbalanced incremental learning, Learn++.UDNC [5] was used for the comparison. Besides, in order to find out the efficiency of MLPs and NB to SFL, the model of ensembles with only MLPs (referred as SFL.MLP) and the model of NB are also compared with SFL. The recall of every class and the arithmetic mean values over recalls of all classes are used as the metric.  Table 1 4.1. Experiments on synthetic data sets. The synthetic data were generated as follows. Data of four 2-dimensional Normal Distributions were generated for four classes. The means were µ 1 = (0, 0), µ 2 = (0, 1), µ 3 = (1, 1)andµ 4 = (1, 0), the two features are independent with variances σ 1 = σ 2 = σ 3 = σ 4 = 0.2. Three types of synthetic data sets were generated. TABLE I presents the class distributions of every data subset for the three types. In Type A, there are three majority classes (class 1 to 3) and one minority class (class 4). Class 4 comes up as a new class in S 2 . Class 1 to 3 appears to be another minority class in training subsets S 3 to S 5 , respectively. This experiment was conducted to see the performance of SFL on problems with multi majority classes (which appear to be minority classes sometimes) and single minority class (also comes up as new class). In Type B, there are one majority class and three minority classes. Class 2 comes up at the beginning but is lost in the last two training subsets. Class 3 comes up as a new class in S 2 and is lost in the last training subset. Class 4 comes up as a new class in S 3 . This experiment was conducted to see the performance of SFL on problems with single majority class and multi minority classes, some of which come up as new classes and are lost in some data subsets. In Type C, the class distribution of the whole training set (i.e., the union of all the training subsets) is balanced. However, the training subsets are class imbalanced and every training subset contains only two classes. This experiment was conducted to see the performance of SFL on problems whose class distributions are balanced in total but imbalanced in data subsets. The distributions are quite different between the data subsets in all the three types.
10 MLPs with 20 hidden nodes of every MLP was used in SFL and SFL.MLP. The training stop error was 0.05 and the coefficient of the penalty term of NCL (referred as λ) was 0.5. The data sets were generated 30 times independently, and the means and standard deviations over 30 executions of the three types of data sets are presented in TABLE II, TABLE III and TABLE IV,    Wilcoxon signed-rank test with the level of significance α = 0.05 was employed for the comparison between SFL and other methods. In the results of other methods, the values with underline (or bold) denote that SFL performed significantly better (or worse) than them on those values and the values with normal type denote that there are no significant differences. The results on Type A data set are presented in TABLE II. Comparing to Learn++.UDNC, SFL gets better overall recalls (i.e., the average of the recalls of all classes). The recalls of SFL on class 1 to class 3, which are Figure 5. This is table4 Figure 6. This is table5 majority classes, are not as good as Learn++.UDNC. However, Learn++.UDNC is biased too much to the majority classes and performs very poor on the only minority class while SFL performs more balanced over all the classes. Therefore, SFL outperforms Learn++.UDNC in this data set. Comparing to SFL.MLP and NB, there is few statistical difference on average recalls, especially after training with S 3 , S 4 and S 5 . After training with S 2 , where class 4 comes up as a new class, SFL learns better of class 4 than SFL.MLP and as good as NB. At the same time, SFL does not degrade as much performance on class 1 as NB does. Although SFL degrades more performance on class 1 and class 3 than SFL.MLP, it performs better than SFL.MLP on class 2. Therefore, after training with S 2 , SFL performs better than both SFL.MLP and NB on the average recall. This observation indicates that SFL is capable of combining the advantages of both MLPs and NB to make a better model.
The results on Type B data set are presented in TABLE III. When comparing to Learn++.UDNC, the similar observations can be made and we can also conclude that SFL outperforms Learn++.UDNC in this data set. When comparing to SFL.MLP and NB, some values of SFL are between the values of SFL.MLP and NB (always closer to the larger ones), some values of SFL are significantly larger than both SFL.MLP and NB. Observing the results on Type C data set in TABLE IV, Figure 7. This is table6 the similar observations can be made. All these results show that SFL outperforms Learn++.UDNC and is capable of combining the advantages of both MLPs and NB to make a better model.

4.2.
Experiments on real-world data sets. The experiments on real-world data sets include three parts. First of all, 5 class imbalanced data sets were divided randomly to simulate the incremental learning process. Secondly, 3 class imbalanced data sets were divided with considering new classes and the loss of classes in the new data subsets. Finally, 2 class balanced data sets were divided into some class imbalanced subsets to simulate the incremental learning process. The situations of new classes and the loss of classes in the new data subsets were also considered.
The class distributions of the 5 class imbalanced data sets that were randomly divided are presented in TABLE V. Each one of these data sets was firstly stratified divided into training set (80%) and testing set (20%) and then the training set was randomly divided into 5 training subsets. The other real-world data sets, including 3 class imbalanced data sets and 2 class balanced data sets were divided according to predefined data distributions. The data distributions of all training subsets and testing sets were presented in TABLE VI. It can be observed from TABLE VI that for all the data sets, the data distributions between different training subsets are quite different and the situations of new classes and the loss of classes in the new data subsets occur in some training subsets.
For all the data sets, 10 MLPs with 20 hidden nodes of every MLP was used in SFL and the coefficient λ of NCL was 0.5. An independent execution was implemented for every data set to set the stop criterion for training MLPs to ensure the convergence of the training process. All the data sets were divided 30 times independently and for every time, all the comparing methods were executed once. The means and standard deviations of the overall recalls after every coming up of data subset over 30 executions of all the real-world data sets are presented in  Figure 8. This is table7 for the comparison between SFL and other methods. In the results of other methods, the values with underline (or bold) denote that SFL performed significantly better (or worse) than them on those values and the values with normal type denote that there are no significant differences.
It can be observed form TABLE VII that SFL can outperform Learn++.UDNC on most of the data sets, including Soybean, Splice, Thyroid-allrep, Car, Nursery, Optdigits and Vehicle. On the other data sets, SFL also does not perform significantly worse than Learn++.UDNC. When comparing with SFL.MLP and NB, the performance of SFL usually leans to the better one of SFL.MLP and NB and sometimes SFL outperforms both of them, such as the performance on Soybean, Nursery, Optdigits and Vehicle. These observations go a step further to support Figure 9. This is table8 Figure 10. This is table9 that SFL is capable of combing the advantages of both MLPs and NB to make a better model.
On Car, Nursery, Page-blocks, Optdigits and Vehicle, the data sets were divided according the distribution presented in TABLE VI, where coming up new classes or losing previous classes usually occurs in the new data subsets. It will be worthy to Figure 11. This is table10 see the detailed results of each class on these data sets. Therefore, the detailed results on two of them, i.e., Nursery (class imbalanced) and Optdigits (class balanced) were further presented.
The means and standard deviations over 30 executions of Nursery are presented in TABLE VIII. Wilcoxon signed-rank test with the level of significance α = 0.05 was employed for the comparison between SFL and other methods. In the results of other methods, the values with underline (or bold) denote that SFL performed significantly better (or worse) than them on those values and the values with normal type denote that there are no significant differences. It can be observed from TABLE VIII that, the performance of Learn++.UDNC on class 3 is much worse than that of SFL. Class 3 is a minority class. It comes up in S 2 as a new class and is lost in S 4 . The observations indicate that SFL could handle this kind of problems. To see the effect of MLPs and NB in SFL, we pay more attentions to the comparison to SFL.MLP and NB. It can be observed that, MLPs could learn better than NB if there is not any class loss. However, when class 3 is lost in S 4 , MLPs degrades much more recall on class 3 than NB. As the combination of MLPs and NB, SFL does not degrade too much recall on class 3 and at the same time, SFL learns better than NB on other classes, which leads to the better overall recalls. Therefore, in this data set, MLPs help SFL to learn better and NB helps SFL to preserve the previously learned information especially when class loss occurs.
The means and standard deviations over 30 executions of Optdigits are presented in TABLE IX. Wilcoxon signed-rank test with the level of significance α = 0.05 was employed for the comparison between SFL and other methods. In the results of other methods, the values with underline (or bold) denote that SFL performed significantly better (or worse) than them on those values and the values with normal type denote that there are no significant differences. In S 2 , class 4 first comes up as a minority class, class 8 first comes up as a majority class, class 1, class 6 and class 10 are lost. After learning from S 2 , the recall of class 4 of SFL is larger than those of Learn++.UDNC and NB, but not as large as SFL.MLP; the recall of class 8 of SFL is the best of all others. At the same time, the degradation of class 1, class 6 and class 10 of SFL is much less than Learn++.UDNC and SFL.MLP and a bit more than NB. This observation indicates SFL can learn new classes with little performance degradation of other lost classes. Even though SFL.MLP can perform better on class 2 and class 4 when they first come up, it degrades much more on other classes. Therefore, it is not surprising that SFL gets the best overall recalls.
The experimental results indicate that the performance of Learn++.UDNC usually leans to majority classes. Even though it sometimes performs better on minority classes, the performance on other classes are usually degraded too much. On contrary, SFL can usually get more balanced performance on different classes and get better overall performance. This is because of the different processing methods of SFL and Learn++.UDNC for handling class imbalance problems. In SFL, class imbalance is considered when training the model. The method for training MLPs has been shown to be effective for class imbalance problems. In Learn++.UDNC, the training process did not consider class imbalance and a transfer function with consideration of class imbalance was applied to the outputs. The effectiveness of the method has not been well proved. Even in the results presented in [5], the performance on minority classes was much worse than that of majority classes. Therefore, it is not surprising that SFL can outperform Learn++.UDNC on most of the data sets.

Computational time.
The computational time of SFL and Learn++.UDNC on all the data sets is presented in TABLE X. It can be observed from TABLE X that SFL usually takes less computational time than Learn++.UDNC. In the experiments, the structures of MLPs were the same for SFL and Learn++.UDNC and the stop criterion were also the same. However, more MLPs were trained for Learn++.UDNC for every new data subset. On the other hand, the training process of SFL usually meet the stop criterion earlier than that of Learn++.UDNC. Therefore, SFL is usually faster than Learn++.UDNC.

4.4.
Analyses about the components of SFL. In SFL, two kinds of base classifiers, i.e., MLPs and NB, are employed to construct the ensemble. The results have shown that SFL is capable of outperforming the models with only MLPs and the models with only NB. To find out the reason, the differences of SFL and its components (MLPs and NB) and the influences of the differences are investigated in detail.
After every data subset is learned, four numbers are estimated on testing data set: the number of the examples that are correctly classified by only MLPs(# 1 ) or NB (# 2 ), the number of the examples that are correctly classified by only MLPs or NB and correctly classified by SFL (# 3 ). Then four ratios are estimated: where # t is the number of examples in testing data set. The ratios are estimated for all the data sets and the average values over 30 executions are presented in TABLE XI. ρ 1 indecates the diversity (on making correct classification decisions) between MLPs and NB. ρ 4 indecates the benefits that SFL gets from the difference between MLPs and NB. It can be observed from TABLE XI that the values of ρ 4 are always closer to the larger one of rho 2 and rho 3 and sometimes exceed both of them. The observations partially show the reason that SFL always performs toward the better one of MLPs and NB and sometimes exceeds both of them. 4.5. Analyses of parameters. There are some parameters in SFL, including the number of MLPs, the number of hidden nodes in every MLP, the stop criterion for training MLPs and the coefficient λ in NCL. In our experimental studies, the number of MLPs and the number of hidden nodes in every MLP were set by experience. An independent execution was implemented for every data set to set the stop criterion to ensure the convergence of the training process. The coefficient λ in NCL is a parameter for controlling the diversities between the individuals in the ensemble (larger λ will lead to larger diversities). In the study of NCL [14], λ was suggested to be between 0 and 1. In our experimental studies, it was set to 0.5 for all the data sets. Since diversity is a very important issue for the success of ensemble learning methods [25], it is worthy to see the difference performance of SFL with different λ.
Extra executions of SFL with λ = 0, 0.25, 0.75 and 1 were conducted for all the used data sets. Wilcoxon signed-rank test with the level of significance α = 0.05 was employed for comparing the overall recalls after training with each data subset. The results of every setting of λ were compared with the results of the other four settings of λ and the number of windraw-lose was counted and presented in TABLE XII. It can be observed from TABLE XII that λ affects the performance on most data sets. On some data sets, we can also observe the trend that the performance becomes better as λ decreases, such as Synthetic Type A, Synthetic Type B, Nursery, Page-blocks, Optdigits and Vehicle. In SFL, λ is not the only factor for encourage diversities. On one hand, the model built by NB may be quite different from the MLPs. On the other hand, in incremental learning, different MLPs may be trained with different data subsets, which will also result in diversities, especially when the data subsets are quite different. Therefore,large λ (such as 1) may emphasize too much to produce diversities so that the performance may be degraded.

5.
Conclusions and future work. This paper investigates incremental learning in class imbalance situation. An ensemble-based method, i.e., SFL, which is a hybrid of MLPs and NB, was proposed. A group of impact weights (with the number of the classes as the length) was updated for every individual of the ensemble to indicate the confidence of the individual learning about the classes. The weights affect the outputs of the ensemble by weighted aver-age of all individuals outputs. The training of MLPs and NB considered class imbalance so that the ensemble can adapt the situation of class imbalance.
The experimental studies on 3 synthetic data sets and 10 real-world data sets have shown that the performance of SFL was better than that of a recently proposed approach for class imbalance incremental learning, i.e. Learn++.UDNC [9]. The experimental results have also shown that SFL can combine the advantages of both MLPs and NB to make a better model. SFL has successfully combined MLPs and NB. The experimental studies have shown that combining additive models can make progress in incremental learning. However, this is just an ordinary trial. Other additive models, such as parame-ter estimation model might also help to improve SFL. This would be a direction of our future work.