WHY CURRICULUM LEARNING & SELF-PACED LEARNING WORK IN BIG/NOISY DATA: A THEORETICAL PERSPECTIVE

. Since being recently raised, curriculum learning (CL) and self-paced learning (SPL) have attracted increasing attention due to its multiple successful applications. While currently the rationality of this learning regime is heuristically inspired by the cognitive principle of humans, there still isn’t a sound theory to explain the intrinsic mechanism leading to its eﬀectiveness, especially on some successful attempts on big/noise data. To address this is- sue, this paper presents some theoretical results for revealing the insights under this learning scheme. Speciﬁcally, we ﬁrst formulate a new learning problem aiming to learn a proper classiﬁer from samples generated from the training distribution which is deviated from the target distribution. Furthermore, we ﬁnd that the CL/SPL regime provides a feasible solving strategy for this learning problem. Especially, by ﬁrst introducing high-conﬁdence/easy samples and gradually involving low-conﬁdence/complex ones into learning, the CL/SPL process latently minimizes an upper bound of the expected risk under target distribution, purely using the data from the deviated training distribution. We further construct a new SPL learning algorithm based on random sampling, which better complies with our theory, and substantiate its eﬀectiveness by experiments implemented on synthetic and real data.

1. Introduction. Recently, curriculum learning (CL) [2] and self-paced learning (SPL) [12] have been attracting increasing attention in machine learning and computer vision. Both learning paradigms are inspired by the learning principle underlying the cognitive process of humans/animals, which generally starts with learning easier aspects of an learning task, and then gradually takes more complex examples into consideration.
Since being raised, multiple variations of this CL/SPL learning regime, like selfpaced reranking [8], self-paced learning with diversity [9], and self-paced curriculum learning [10], have been proposed to further ameliorate its capability. Its effectiveness has also been extensively validated in various machine learning and computer vision tasks, including object detector adaptation [20], dictionary learning [19], long-term tracking [18] and matrix factorization [23]. Especially, this paradigm has been integrated into the system developed by CMU Informedia team, and achieved the leading performance in challenging semantic query (SQ)/000Ex tasks of the TRECVID MED/MER competition organized by NIST in 2014 [22]. Just as indicated by the initial work [2] along this line, two advantages of the CL/SPL learning have been empirically substantiated, especially under big data/noisy scenarios [12,8,9,10,1,11]: generalization improving and convergence speedup.
Albeit with superior performance in applications, the reasonability of the CL/SPL regime is only intuitively explained by its cognitive understanding, while short of a sound theory to reveal the insightful mechanism leading to its effectiveness. Specifically, current CL/SPL learning methods need to iteratively solve varying optimization problems under gradually increasing pace parameters [12,8,9,10], while there is still not a theoretical argument presented to clarify where these methods converge to and which objective is these methods intrinsically solve.
To the above issue, this work initializes the learning theory for CL/SPL and provides an insightful explanation for the effectiveness mechanism under this line of learning schemes. Specifically, the main contribution of this paper can be summarized as the following aspects.
Different from the traditional learning theory assuming the similar training and test distribution, a new theory is formalized to understand the learning problem under the assumption that there exists deviation between training and test/target distributions. This actually is the case often encountered in this era of big data. Nowadays, in various learning tasks like object recognition, event detection and user behavior analysis, learners always need to achieve massive data source for training. In general these massive data are collected and annotated from company users (e.g., the Netflix database 1 ), the web (e.g., the LFW database 2 ) or by making use of crowdsourcing involvement (e.g., the ImageNet database 3 ). The subjective understanding of any annotator is inevitably more-or-less deviated from the objective oracle knowledge underlying data. This naturally conducts the deviation from the training distribution (accumulated from knowledge of all involved annotators) and the true target one, especially in those ambiguous annotated regions. This inspires us to formulate this learning problem and investigate its learning theory.
Under the premise of the proposed learning theory, the insight of CL/SPL can be rationally explained. Especially, the theory clarifies that the CL/SPL regime actually attempts to minimize an upper bound of the expected risk under target distribution, purely from the data generated from the deviated training distribution. In specific, easy samples in CL/SPL correspond to those in high-confidence annotated area of training distribution, which is also consistent with the high-confidence region of the target distribution (where annotators can easily confirm and agree). Complex ones, however, are more likely to be located in the ambiguous annotated regions, corresponding to the more deviated area between training and target distributions (where users are easily get uncertain or even wrongly cognized). Thus to start training from easy samples by CL/SPL actually simulates learning from the high-confidence target region, while to gradually incrementing complex ones means that the samples residing on ambiguous training regions then come to be involved. Through this process, the faithful information delivered by those highconfidence/easy samples incline to soundly guide the learning towards the expected target, while being less hampered by those low-confidence/complex samples relatively more deviated from the target. This naturally conducts the advantages of SPL, i.e., better generalization to target and faster convergence in a sound manner, as compared to the traditional learning mode, which considers or even emphasizes unreliable low-confidence samples throughout the learning process.
Besides, based on the proposed theory, we can construct a new CL/SPL learning scheme based on random sampling. This new scheme better complies with the deduced upper bound of the expected risk on the target distribution, and thus can be more faithfully explained by our theory. We also substantiate the effectiveness of the proposed learning scheme by experiments on synthetic dan real data.
The rest of this paper is organized as follows. Section 2 briefly reviews the related work on CL/SPL. Section 3 introduces the new learning problem and our motivations. Section 4 establishes the main learning theory for this learning problem, and clarifies its intrinsic relationship to CL/SPL. The SPL learning algorithm by random sampling is constructed in Section 5, and evaluated by experiments in Section 6. The paper is then concluded with a future research.
2. Related work. Inspired by the learning principle of humans/animals, [2] formulated the curriculum learning paradigm. Its core idea is to iteratively involve samples into learning in sequence, where easy samples are learned first and more complex ones are gradually included when the learner is ready for them. These gradually included sample sequences from easy to complex are called curriculums learned in different grown-up stages of training. In specific, [2] formalized the CL problem as follows. Let P train (z) be the training distribution from which the input data are generated, where z is a random variable representing a sample for the learner (corresponds to a pair of (x, y) for supervised learning). Let 0 ≤ W λ (z) ≤ 1 be the weight superimposed on z at step λ in the curriculum sequence, with the pace parameter 0 ≤ λ ≤ 1. The corresponding training distribution at step λ is such that Z Q λ (z)dz = 1, where Z denotes the whole training set. A sequence Q λ (z) can be called a curriculum if it satisfies that both its entropy H(Q λ ) and its weight function W λ (z) are monotonically increasing with respect to the increasing pace λ. This strategy has been empirically evaluated to be helpful in enhancing generalization capability and fastening the convergence speed in multiple applications [17,1].
To make the CL idea more implementable in applications, [12] first formulated the key principle of CL as a concise optimization model named SPL. The SPL model includes a weighted loss term on all samples and a general SPL regularizer imposed on sample weights. By sequentially optimizing the model with gradually increasing pace parameter on the SPL regularizer, more samples can be automatically included into training from easy to complex in a pure self-paced way. [8] and [23] further built a guideline to construct a rational SPL regularizer, and formalized the SPL model as the following optimization problem: where L(y, f (x, w)) denotes the loss between the annotated label y and the estimated one f (x, w), with model parameter w, and v i denotes the binary variable,  which indicates whether the i-th sample is easy or not. r(v; λ) is the SPL regularizer. λ is a parameter controlling the learning pace. The larger λ is, the more samples are involved in training and the more "grown-up" the trained model is. Under this guide line, multiple variations of SPL models have been constructed, including self-paced reranking (SPaR) [8], self-paced learning with diversity [9], and self-paced curriculum learning [10], and multiple applications of this SPL framework have been attempted, such as object detector adaptation [20], specific-class segmentation learning [13], visual category discovery [14], long-term tracking [18] and background subtraction [23]. Especially, the SPaR method was integrated into the system developed by CMU Informedia team, and achieved leading performance in challenging SQ/000Ex tasks of the TRECVID MED/MER competition organized by NIST [22]. In this paper, we attempt to explore the insightful reason behind these successful applications of CL/SPL. To the best of our knowledge, this is the first theoretical explanation work for this newly emerging methodology.

3.
A new understanding for the learning problem in big data sceneries. The current learning tasks always need to collect a massive data set for training. Such a large magnitude makes it only possible to achieve the expected data from crowdsourcing, especially for supervised learning tasks. This often conducts large amount of ambiguous (or complex in CL/SPL) samples for general users in the obtained data, as illustrated in Figure 1, showing typical "hard" samples from the SIN 4 and Pascal VOC 5 data sets, and returned by Google image search engine 6 . The reason is that any participant has his/her own specific viewpoint on a problem as compared to most others, and there is thus inevitably a deviation from each collector/annotator's subjective understanding to the objective oracle knowledge of Figure 2: Left: Illustration for the training/target distribution P train (x)/P target (x), as well as a sequence of pace distributions Q λ (x) varying from P target (x) to P train (x). Note that P train (x) has an evident heavy tail as compared to P target (x). Right: The corresponding weight functions with respect to varying pace λ.
the problem. This naturally leads to the problem that the training distribution, P train (z), accumulated by all collector/annotator's knowledge, is different from the test/target distribution, P target (z), to which the learning really needs to generalize.
Albeit deviated, useful information under P target (z) can still be explored from P train (z). Most participants share a same common sense on high-confidence samples, and these faithful samples thus tend to be distributed in a region with relatively large density. For supervised learning problem, such region should be located intraclass and relatively far from the classification boundary where samples are easy to be misclassified. In these high-confidence areas, the subjective understanding of humans and the objective knowledge should be consistent and P train (z) and P target (z) should be accordant. Comparatively, those ambiguous/complex samples, conducted by the cognitive differences or even misoperation of annotators, should occupy a relatively smaller proportion in data and located in a region with smaller density. Their locations should be near classification boundary or even inner wrong classes (e.g., noises/outliers) in supervised learning. This naturally leads to an evident heavytailed shape of P train (z) as compared to P target (z) in such low-confidence regions, as shown in Figure 2.
In small/clean sample cases, such a low-confidence region is always with few generated samples due to its small density and small base number of samples. Thus it tends to be configured as a blank "margin" area. Through finding a classification surface to maximize this margin, the decision boundary can always be effectively located [21]. In the premise of practical big/noisy data, however, such margin tends to be very hard to enanchor. Both relatively high density of marginal samples (caused by noise/outliers) and large data cardinality (caused by big data) tend to fill the margin, and the heavy noises/outliers even seriously mislead the margin location. This might explain the fail cases of traditional margin-emphasizing algorithms like SVM [21], Adaboost [7], and etc., in some real data applications [8,9].
It is thus rational to more emphasize the high-confidence (i.e., easy) samples rather than low-confidence (i.e., complex) ones in certain real data cases, instead of treating the former as non-support-vectors and ignoring their role in learning. This constitutes the basic methodology under CL/SPL, which more complies with the human learning process. Such high-confidence-sample-emphasizing idea has also been employed to build never-ending machine learning systems that acquire the ability to extract structured information from unstructured data [4,15] by persistently picking up high-confidence samples in iteration.
In sum, our argument is that in real big/noisy data scenarios, both learning theories and implementation methods need to be handled in new viewpoints. In theory, instead of similar [5,6], the target distribution is often deviated from the training, especially in those low-confidence regions; and in implementation, highconfidence samples, i.e., the traditional non-support-vectors, might be put more emphasis in learning, as the CL/SPL methodology suggests.
In the following, we will provide some preliminary theoretical results on this new setting of learning problem, and deliver a rational theoretical explanation for the working mechanism under CL/SPL methodology. 4. SPL learning theory.
4.1. Problem setting. In this work we mainly investigate the binary classification problem. Following the classic setting of learning theory, our aimed learning problem is: Let X be a compact subset of R d , Y = {−1, 1} be the label set and Z = X × Y be the whole set. The binary classification problem aims at learning a proper classifier f : X → Y from the input training samples {z i = (x i , y i )} n i=1 generated from the underlying training distribution P train (Z) = P train (X|Y )P train (Y ) [6], such that the following expected risk can be minimized: , denoting the loss function measuring the difference between the predicted and true labels. Both P train (Z) and P target (Z) are fixed while unknown. The following empirical risk is thus considered for actual implementation: We assume P target (y = 1) = P train (y = 1) = 1/2 for easy evaluation and denote P + train (x) = P train (x|y = 1), P − train (x) = P train (x|y = −1), P + target (x) = P target (x|y = 1), P − target (x) = P target (x|y = −1). Since the deduction for both y = 1 and y = −1 cases are exactly similar, we only consider one case in the following and denote P train (x) and P target (x) omitting notion +1 or −1.

4.2.
A simulated curriculum format. We first formulate P target (x) as the weighted expression of P train (x): where 0 ≤ W λ * (x) ≤ 1 and α * = X W λ * (x)P train (x)dx denotes the normalization factor 7 . Based on Eq. (4), P target (x) actually corresponds to a curriculum as defined in Eq. (1) under the weight function W λ * (x). As analyzed in the last section, W λ * (x) should be of small values in the low-confidence area of P target where complex samples are located, while have larger values (close to 1) in the high-confidence area where easy samples reside. This can be easily understood by observing Figure 2.
Eq. (4) can be equivalently reformulated as 7 We thus have α * ≤ 1 since W λ * (x)P train (x) ≤ P train (x). where Here it is easy to see E(x) is a distribution ( X E(x)dx = 1) formulated by the weighted P train (x) under the weight function (1 − W λ * (x)). This term actually measures the deviation from P target to P train . In high-confidence area of P target , E(x) corresponds to the nearly zero-weighted P train , and thus the deviations/errors tend to be small. On the contrary, in the low-confidence area, E(x) imposes relatively large weights on P train , naturally leading to its large deviation values. This complies with our aforementioned analysis on the deviation measure. The more confidently a sample is annotated, the less deviated its label should be from the true one.
We can then construct the following curriculum sequence for our theoretical evaluation: where α λ varies from 1 to α * with increasing pace parameter λ. Correspondingly, the curriculum Q λ simulates the changing process from P target to P train , as illustrated in Figure 2. Note that Q λ (x) can also be regularized into the curriculum formulation as Eq. (1) as follows: through normalizing its maximal value as 1.
Note that the initial stage of this CL process sets W λ ∝ Ptarget Ptrain , which is of larger weights in the high-confidence area while much smaller in low-confidence area due to the heavy-tail problem. The weights are thus of more vibrations. With the pace λ increasing, the large weights in high-confidence area become smaller while small ones in low-confidence area become larger, leading to more uniform distributed weights with smaller variations. After normalizing W λ (x) into the interval [0, 1], its values tend to consistently increase in λ, which can be easily understood by Figure  2. This thus complies with the weight-increasing condition defined for a curriculum in [2].
By taking (6) as the pace distribution, we attempt to present some theoretical results on CL/SPL strategy. These results will help us get some useful insights under this interesting learning scheme.
where σ i s are i.i.d. samples drawn from the uniform distribution in {−1, 1}. The Rademacher complexity of G is defined by the expectation ofR m (G) over all samples S:

TIELIANG GONG, QIAN ZHAO, DEYU MENG AND ZONGBEN XU
Definition 4.2. The Kullback-Leibler divergence D KL (p q) between two densities p(Ω) and q(Ω) is defined by Based on the above definitions, we can estimate the generalization error bound for CL/SPL learning under the curriculum Q λ . Firstly we present the following necessary lemmas for this task.
Lemma 4.4. [16] Let H be a family of function taking value in {−1, 1} and P be the distribution over the input space X. Then for any δ > 0, with confidence at least 1 − δ over a sample set S, the following holds for any f ∈ H: In addition, we have Lemma 4.5. Suppose S ⊆ {x : x ≤ R} be a sample set of size m, and H = {x −→ sgn(w T · x) : min S |w T x| = 1 ∧ w ≤ B} be hypothesis class, where w ∈ R n , x ∈ R n , and then we havê Proof.R Then we give the main results of this work. and where E + , E − denote the error distribution corresponding to P + target , P − target , and R + emp (f ), R − emp (f ) denote the empirical risk on positive samples and negative samples, respectively.
Proof. We first rewrite the expected risk as The empirical risk tends not to approximate the expected risk due to the inconsistence of P train and P target . However, by introducing intermediate risk with pace distribution, namely the pace risk, and denoting by E Q λ (f ) in the error analysis, we can formulate the following error decomposition

TIELIANG GONG, QIAN ZHAO, DEYU MENG AND ZONGBEN XU
Here, E Q + λ (f ) and E Q − λ (f ) denote the pace risk with respect to positive samples and negative samples, respectively.
We first focus on the estimation of A 1 . By the fact the 0-1 loss is bounded by 1, we have The last inequality is obtained by Lemma 4.3. For the estimation of A 2 , according to Lemma 4.4, the following holds with confidence 1 − δ In the similar way, we can bound B 1 and B 2 as follows and By taking m * = min{m + , m − } and combining Eqs. (17) (18) (19) (20), we can easily get Eq. (14). In addition, one can further get: By replacing R m (H) in Eq. (14) with Eq. (21), we have (15). The proof is then completed.
Note that the above established error bounds upon 0-1 loss are hard to optimize. We thus further deduce another bound under the commonly utilized hinge loss.
samples drawn from the pace distribution Q λ with radius |X| ≤ R. Denote m + /m − be the number of positive/nagetive samples and m * = min{m − , m + }. Let H = {x −→ w T x : min S |w T x| = 1 ∧ w ≤ B}, and φ(t) = (1 − t) + for t ∈ R be the hinge loss function. Then for any δ > 0 and g ∈ H, with confidence at least 1 − 2δ, it holds that: Proof. Based on Lemma 4.5 to Eq. (15), and the fact that the hinge loss is the upper bound of 0 − 1 loss, we can then obtain the result.
Note that there are three components in the upper bound of the expected risk under P target . The first row corresponds to the empirical risk on training samples generated from Q λ . With λ increasing, these samples start by mainly generating from high-confident (easy) area of P target in probability and gradually involve more complex ones. The second row reflects the approximation capability of training samples to evaluate information of Q λ . The more samples are considered, the smaller this term is and the better approximation can be achieved. The last two rows measure the generalization capability of the learned classifier, which is monotonically increasing with respect to both the KL-divergence between the error distribution E and the target P target , and the pace parameter λ. That is, the more deviated is the error E from P target , the more difficult is to learn a proper classifier from training data which can generalize well on P target . Also, in the late stage of CL/SPL (corresponding to large λ), the generalization of the learned classifier tends to be worse due to the gradually more evident deviation from the curriculum Q λ to P target . The last two terms actually compromise the approximation and generalization capabilities of this CL/SPL process with Q λ . This theory reveals the following insights underlying this CL/SPL process. The "easy-to-complex" property of the curriculum Q λ intrinsically facilitates the information transfer from P train to P target , and makes it feasible to approximate the solution of the learning problem as set in Section 4.1, i.e., to learn a classifier with minimal expected risk on P target through the empirical risk on training samples generated from P train . In specific, we can approach the task of minimizing the expected risk on P target by gradually increasing the pace λ, generating relatively high-confidence (easy) samples from Q λ , and minimizing the empirical risk on these samples. This complies with the core idea under previous CL/SPL regimes. It is interesting that the previous investigations attribute the advantage of CL/SPL by that its performance is soundly guided by the faithful easy samples, while our theory further reveals that this regime facilitates learning to approach a good generalization to the target distribution. 5. SPL insight: Approximate rational curriculums from training data.

5.1.
Simulate Q λ from training samples. When we only have samples 1} generated from P train , we can approximately simulate a rational Q λ as Eq. (6) in the following way. For easy discussion, we still only consider either of +1 and −1 cases, and ignore the notion +1 or −1.

TIELIANG GONG, QIAN ZHAO, DEYU MENG AND ZONGBEN XU
First, let's approximate P train = p i δ xi (x), where δ xi (x) denotes the Dirac delta function centered at x i and p i = 1 m . It is easy to see that P train supposes a uniform density on each sample x i . Next, in the beginning λ paces, we impose a smaller weights v i (λ) on low-confidence samples located near inter-class boundary than those on high-confidence regions to formulate the initial Q λ (x) ∝ n i=1 vi(λ)piδx i (x).
By dominantly suppressing the heavy-tailed region of P train , i.e., by putting nearly zero weights v i (λ) on those evident low-confidence samples, Q 0 is expected to form a rational approximation to P target . We then increase the pace λ to gradually increase the small weight v i (λ) to 1. The corresponding Q λ (x) ∝ n i=1 v i (λ)p i δ xi (x) then approximates a curriculum sequence varying from Q 0 to P train like Eq. (6).

5.2.
Revisit previous SPL models. Instead of minimizing the empirical risk R emp (f ) as illuminated in our theory, let's minimize its expected value under Q λ as: where the first expectation is taken with respect to {x i } n i=1 which are i.i.d samples drawn from Q λ . As analyzed above, v i (λ) should satisfy: (1) Under fixed λ, v i (λ) is monotonically increasing with its confidence degree; (2) For each sample x i , v i (λ) is monotonically increasing with respect to the pace λ.
An useful knowledge to judge whether the label confidence of a sample is high or low is through its learning error. That is, the high-confidence sample tends to be located inside the region of its category, thus always leading to its small training error, and vice versa. From this understanding, Eq. (23) exactly corresponds to current SPL learning models [8,23,10], which fit these weight values to accord with the similar requirements through supplementing a self-paced regularizer on v i (λ) in Eq. (23), as shown in the previous SPL model (2).
In this sense, we might explain the effectiveness of the previous SPL models by the following insight. Based on our theoretical results, this learning scheme tends to learn from the deviated training information to discover ground truth knowledge of the target distribution, through learning in a sound manner from high-confidence/easy/small-loss samples to low-confidence/complex/large-loss ones. Throughout this learning process, it intrinsically tries to minimize an upper bound of the expected risk on the target distribution, through being terminated at a proper compromised pace. This fully complies with the experience of its real implementations in multiple applications [8,9,23]. 5.3. SPL with random sampling. Note that current SPL models are all deterministic, while the empirical risk in the upper bound (22) is calculated on randomly generated samples. We thus want to build a new SPL algorithm by using random sampling mechanism. The core idea is to approximate the pace distribution Q λ by imposing weights on samples, and then sampling from this distribution to form new SPL training samples.
The implementation details are as follows. At each iteration, we first compute the losses of all training samples based on the current model. Then we solve the Algorithm 1 Self-Pace Learning with Random Sampling (RS-SPL) Input: training data D = {xi, yi} n i=1 , initial pace parameter λ, m and stepsize µ, k. Output: model parameter w.

6:
Train a new model on D λ to obtain w.

7:
If λ is small, increase λ by µ and increase m by k. 8: until stopping criteria satisfied following optimization problem to form weights on all samples: where r(v, λ) is the self-paced regularizer as defined in Eq. (2). After that, we normalize v by v/ v 1 to construct the empirical pace distribution Q λ (x) = v i (λ)p i δ xi (x), and then redraw samples from the training set according to Q λ . A new model is then recursively trained on these samples. The whole process is summarized in Algorithm 1.
There are many choices for r(v, λ) based on three axiomic conditions defined on it [8]. We just readily use the following due to its easiness and effectiveness: where γ > 0 is a tuning parameter. The optimal v(λ) to (24) can be analytically computed by 6. Experiments. In this section, we implemented experiments on synthetic and real classification datasets. The linear SVM, implemented by LibSVM [3], is utilized as the comparison method.
6.1. A synthetic example. We first give a synthetic example to illustrate behavior of the proposed RS-SPL algorithm. The data were generated as follows: Two 2-D Gaussian distributions, each associated with a class, were specified as the target distribution. The training distribution is further mixed with another two 2-D Gaussian distributions, each centered at the low density area of the target distribution of corresponding class to enforce deviation. We generated 2000 clean training samples, 1000 per class, and 2000 test samples from the target distributions. Then 400 samples from the deviated distributions, 200 per class, were added to the training set. The resulted training and test samples are shown in Figure 3.
In order to understand the behavior of RS-SPL, we implemented Algorithm 1 to this synthetic data and plot in Figure 4 the selected samples and the learned separating hyperplane during the SPL process. It can be observed that, samples   from the high density region of the training distribution are selected first. As the SPL iteration continues, more and more samples with comparatively high confidence are included for training the classifier, and the separating hyperplane tends to be learned more accurately. However, when "hard" samples, i.e., those deviated samples, are included at the latter stages of SPL, the learned hyperplane tends to be disordered. Such behavior can also be substantiated by the accuracy tendency on the test data as shown in Figure 5. These results coincide with the SPL learning theory developed in Section 4, which asserts that the optimal expected risk tends to be achieved as a tradeoff between the better approximation capability of increasingly more samples and the worse generalization derived by the divergence from the pace distribution to the target.
6.2. Real data evaluation. We also implemented the proposed method to 5 realworld classification datasets, including magic 8 , image, waveform, ringnorm and     Table 1. We randomly split each dataset into two subsets with equal sizes for training and testing, respectively. Then we applied the proposed RS-SPL algorithm to training a SVM classifier on the training set, and evaluated its performance in terms of classification accuracy on the test set. The parameters for SVM and RS-SPL were selected via hold-out validation on training set. We averaged the performance for each dataset over 50 runs as summarized in Table 2. As a comparison, we also include the results of the batch-trained SVM. We can see that the proposed SP-SPL algorithm can improve the classification accuracy over batch training. Its effectiveness can thus be validated. 7. Conclusion. We have presented a theoretical explanation for the working insight underlying the CL/SPL paradigm. Specifically, we clarify that the insight of the CL/SPL strategy is to learn knowledge of the target information from the given samples generated from the training distribution, which is deviated from the target. We have also argued that such a learning problem tends to happen in real big data scenarios due to the bias between subjective understanding of data collectors/annotators and objective oracle knowledge underlying data. Besides, our theory suggests the importance of high-confidence/easy samples in learning, which are generally taken as non-support-vectors in traditional learning methods and whose role is more or less underestimated. We further designed a new SPL algorithm with random sampling, which better complies our theory, and verified its effectiveness by experiments on synthetic and real data.
Our future research includes designing feasible termination condition for CL/SPL iteration based on our theory, deriving theory under unequal probabilities between P (y = 1) and P (y = −1), making the upper bound tighter, and applying the RS-SPL algorithm to more realistic big data sets.