Cluster, Classify, Regress: A General Method For Learning Discountinous Functions

This paper presents a method for solving the supervised learning problem in which the output is highly nonlinear and discontinuous. It is proposed to solve this problem in three stages: (i) cluster the pairs of input-output data points, resulting in a label for each point; (ii) classify the data, where the corresponding label is the output; and finally (iii) perform one separate regression for each class, where the training data corresponds to the subset of the original input-output pairs which have that label according to the classifier. It has not yet been proposed to combine these 3 fundamental building blocks of machine learning in this simple and powerful fashion. This can be viewed as a form of deep learning, where any of the intermediate layers can itself be deep. The utility and robustness of the methodology is illustrated on some toy problems, including one example problem arising from simulation of plasma fusion in a tokamak.


Introduction and motivation
Modern times have seen an explosion of available data. This largely originated from the internet, although improved abilities to capture and store data have lead to similar data deluge phenomena in science and engineering applications. Machine learning technology has recently achieved a level of maturity where it is of increasing interest to adapt and apply the fundamental algorithms to problems in science and engineering. However, aside from the big data aspect, the nature of the problems which arise in science are often fundamentally different in character from the problems for which standard machine learning technology was developed. For example, many problems which arise in science and engineering involve highly nonlinear and/or discontinuous functions over high-dimensional parameter spaces [14,26,23]. The term "high-dimensional" has different meanings in different contexts, but for the present work it will mean 10.
The fantasy of the domain scientist or engineer is that there is a magical black box in which he can dump his data, and out of which will appear a miraculous machine capable of reproducing the results, to suitable accuracy, of his experiments or code which had been months or even generations in development. Indeed, sometimes this is possible with traditional existing methodology. This problem has been studied for a long time in the approximation theory and statistics literature, where it may be referred to as a surrogate or an emulator [31,10]. The classical problems of machine learning are typically either of classification or regression type. Highly nonlinear and discontinuous functions are something in between, and their reconstruction has been the focus of much existing work in the literature [12,1,35,15,3,2,29,19], but it is still an active area [5,17,37,24,11]. In the present work, we propose a simple general framework which can be utilized to solve such problems, using as components the arsenal of available machine learning algorithms [25,6,13,30,16].
The paper will be organized as follows. In Section 2, the mathematical problem and algorithm will be presented. Section 3 describes an adaptive active learning extension to the current algorithm. The approach is capable of either selecting an opimal subset of training data among an existing set of labeled data, or adaptively selecting training data to label from the domain of all possible (unlabeled) data. Section 4 will feature some numerical illustrations. Finally Section 5 will conclude and discuss some extensions of the present work.

Mathematical Setup and Algorithm
The point of departure is a set of labeled training data where (x i , y i ) ∈ X × Y are assumed to correspond to input and output of some model and are otherwise considered independent. In other words, we want to do supervised learning. It is furthermore assumed that the model f : X → Y is highly irregular, including sharp features arising from strong nonlinearities, and most notably including discontinuities. The output space is taken as Y = R for simplicity. The important point is that it is continuous. The results can be easily generalized to vector valued outputs. We will also assume X = R d , although this can also be relaxed. If the data are exact, one can utilize interpolation or projection methods [9,36], which are generally outside the repertoire of machine learning. However, even in the case of exact data, these methods can be particularly sensitive to the choice of input data points used in the approximation, and can yield poor results. Ordinarily one would look to regression methods in this context, since the output space is continuous. Regression methods handle noisy data elegantly, and are also more robust to the peculiarities of the input data. However, typically regression methods will fail miserably when f has discontinuities (as will interpolation and projection methods). It is therefore natural to attempt to partition the domain according to the discontinuities, and then perform piecewise regression on the continuous components. This strategy has been proposed before in the literature [12,1,35,15,17,11,37]. However, a point which remains unclear is how to identify the continuous components, or equivalently the discontinuous hyper-surfaces, and importantly how to do so without relying on any grid of the input space. Here we propose to use a clustering algorithm to label the input-output training data pairs, coupled with a classification algorithm to label new inputs. To be precise the full method in generality is as follows.
The label function typically minimizes an objective function of the form where S l = {(x i , y i ); λ(x i , y i ) = l}, and l is some loss function associated to cluster l. For example, letting z i = (x i , y i ), l = |z i − µ l | 2 , and µ l = 1 |S l | i∈S l z i , where | · | denotes the Euclidean norm for points and counting norm for sets, we have K−means clustering. Choosing L, for example using the elbow method, completes the clustering phase of the algorithm [6,13]. (II) Classify. Letting l i = λ(x i , y i ), we now have an expanded set of training data The classification phase proceeds to find a function which appropriately labels the inputs according to the labels identified in the cluster phase: The classifier could be non-parametric or parametric, but the important point is that for each x ∈ X it provides an estimate f c : x → f (x) ∈ L such that f c (x i ) = l i for the majority of the data. This is crucial for the ultimate fidelity of the prediction. Note that the output data {y i } is ignored in this phase. The classification function minimizes where φ c : L × L → R + is small if f c (x i ) = l i and otherwise is not. For example, we can choose f c (x) = argmax l∈L g l (x), where g l (x) > 0, L l=1 g l (x) = 1 is a soft classifier, and φ c (l, f c (x)) = − log(g l (x)) [6,13,25], corresponding to cross-entropic loss.
(III) Regress. The final phase of approximation is to find a function which appropriately identifies the original output given the input and the label (5) f r : X × L → Y .
The regressor could be non-parametric or parametric, but the important point is that for each (x, l) ∈ X × L it provides an estimate f r : (x, l) → f r (x, l) ∈ Y such that now f r (x, f c (x)) ≈ y for both training data and test data. If successful then we can expect good reconstruction error for this ultimate predictor where f (·) = f r (·, f c (·)). The regression function can be found by minimizing [6,13]. Note that it is computationally more expedient to partition the data into C l = {i; f c (x i ) = l}, for l = 1, . . . , L, and then perform L separate regressions (which can also be done in parallel) Remark 2.1. A practical consideration with the proposed methodology is the relative scaling of the data in In particular, notice that if the range is a small fraction of the size of the domain, then even large discontinuities in the output may not be revealed in the clustering. Therefore, it is recommended for the clustering to scale as follows ỹ = C(y − min i∈{1,...,N } y)/|(y − min i∈{1,...,N } y)| , (10) for C > 1. In particular, it is recommended to choose C = 10d, where d = dimx. One could also use some other type of standardization, e.g. subtracting the mean and dividing by the standard deviation. For the regression it is recommended to set C = 1 above, as smaller variations in y are easier to deal with for regression. Remark 2.2. A critique of this method is that it re-uses the data in each phase. We note that there is a Bayesian formulation, which will be considered in a future work. Let , assume we have parametric models for the classifier g l (·; θ c ) = g l (·; θ l c ), and the regressor f r (·, l; θ l r ), for l = 1, . . . , L, where θ c = (θ 1 c , . . . , θ L c ) and θ r = (θ 1 r , . . . , θ L r ) index the parameters corresponding to each class, and let θ = (θ c , θ r ). Then the posterior density has the form where, for example, and π(l|x i , θ c ) = g l (x i ; θ c ) . In this case we may take for example where h l (·; θ l c ) are some standard parametric regressors.
It is important to note that fully Bayesian solutions, while elegant, clean, and complete with Occam's razor, are also very expensive. This example is grid-free, at least, and this feature will be desirable for high-dimensional problems of the type considered here. It will be interesting to compare this method to grid-based Bayesian approaches such as [11,24]. If the regressors are defined in terms of grids, for example θ l r are the coefficients of expansion in piecewise linear finite element nodal basis functions [38], then there are similarities between the methods. Remark 2.3. Another important critique is that the method will struggle if there are discontinuities which appear or vanish within the domain, for example the product of a heaviside and a vanishing function, as the phase (I) clustering will fail in this case. However, in practice, we have found that the method is able to handle such cases with a very large number of clusters, i.e. much greater than the number of continuous components.

Active learning extension
Here it is described how to embed the algorithm above into an active learning strategy [32]. First, observe that the fundamental steps above are (I-II), as a continuous approximation (III) of a function with a discontinuity will always be a poor approximation. Indeed the error in such approximations is typically concentrated around the discontinuities. For this reason, it is preferable to choose a soft-classifier (or probabilistic classifier) [25], even if the ultimate classification is taken as a thresholding of such continuous function, as described above in (II) for multiclass logistic regression. Suppose we have a soft classifier {g l (x)} l∈L , l∈L g l (x) = 1, such that g l (x) represents the probability that point x is in class l, and our classification is given by f c (x) = argmax l∈L g l (x). Suppose we have some initial set of N init training data points Active Strategy 1. It is natural to seek, for n > 0, for some appropriate compact domain D n ⊆ X . Active Strategy 1a. An extremely simple example would be the case in which D 1 = D init and D n = {x 1 , . . . x Nres }\D n−1 is a finite subset of the domain, consisting of N res points which we refer to as the "reservoir", which may for example be existing labeled points which we are aiming to scrupulously include in our training algorithm. Active Strategy 1b. Alternatively, we may constrain the algorithm to search only some continuous subset of the possible (currently unlabelled) input data. Here we either need an a priori defined domain, or some sensible way of adapting the domain. If we know the input data of interest lies within some hypercube , then it may be reasonable to choose this as D n = D. However, one expects that often the data may be concentrated on some manifold within such a hypercube. In that case, the naive strategy just mentioned would lead to many points in the large volume of the parameter space which is uninteresting and will never be queried in practice. An alternative which may be viable in the case that the initial data D init is suitably rich, and concentrated on the appropriate data manifold, would be to let D = convD init , the convex hull of the initial data set. Note that in this case the training data will always remain in this set.
Active Strategy 2. This strategy operates on the assumption of the latter version of strategy 1b, i.e. that D = convD init provides suitable coverage of the input domain of interest.
Now, let the batch of new training data at step n, B n , be defined by those points z * , one for each x ∈ S n and each y ∈ C n (x), such that This strategy is prone to degeneration, and so has to be regenerated once in a while, for example with strategy 1b or strategy 3 below. When all x ∈ S n and y ∈ C n (x) are exhausted, we have n * x∈Sn |C n (x)| new data points, D * which are concatenated to the present set D n+1 = D n ∪ D * . 4 Active Strategy 3. Here we choose some set A (different from above), for example with any one of the strategies above, and we then choose points z i (x) ∼ Q(x, ·), for i = 1, . . . , n, from a Markov kernel Q (i.e. Q(x, ·) is a probability distribution for each x ∈ X ), for each x ∈ A. We could let Q(x, ·) be a uniform on the d−dimensional hypercube centered at x, or a normal centered at x. It can be uniform/standard, or the covariance could be determined by the sample covariance of N n k (x) with a suitably large k, in order to stay faithful to the local manifold structure of the data.
Remark 3.1. Note with strategy 1 that there are likely infinitely many solutions to the optimization problem. Consider binary linear classification, where we choose label 1 if g 1 (x) > 1/2 and label 2 otherwise, and there is a separating hyperplane if g 1 is linear. Along the hyperplane we have g 1 (x) = 1/2, i.e. there is a linear subspace of dimension d − 1 which solves the problem (11). This is not a major concern, as we only need one point, and any of these points will equivalently enrich the training set.
Remark 3.2. We are mostly concerned with the case in which we can label any point, but the labelling itself is the limiting cost. So active strategy 1a may be of limited practical value.
Remark 3.3. The choice of domain D is important for strategy 1b. If D = X = R d , then the strategy above is likely to seek points outside the convex hull of the existing training data, where nothing is known about the classifier. The danger is that we may not want to know anything about the classifier in that region.
Remark 3.4. Motivated by the discussion at the beginning of this section, all strategies presented here are based on what is known as uncertainty sampling [32]. Other objectives can be used as well (or none at all, i.e. random sampling), including (1) Entropy sampling: one chooses the point whose class probability has the largest entropy.
(2) Margin sampling: one chooses the point for which the difference between the most and second most likely classes is the smallest.
Remark 3.5. This section is not to be confused with simple adaptive online learning, in which the machine may be embedded within a workflow and queried regularly for some input x new . In this case, one would employ a similar but distinct approach. First, evaluate the fitness of the machine for the queried point x new . For example, as in (11) we may compute max l∈L g l (x new ). Then, if the fitness is suitable (in this case the probability of the most likely class being sufficiently close to 1), we proceed with the machine. If the fitness is low, we augment the training set with this new point and refine the machine. In this case, one may venture to decide if the new query point is suitably close to the existing set of training data, and if not then refine within the convex hull as described in strategy 1b or 2 above.

Method used and Numerical examples
Clearly there is enormous potential for exploring various configurations of the component algorithms, but since the purpose of this paper is to introduce the method and illustrate its power on some simple examples we choose only some of the most basic component algorithms.
For clustering, K-means clustering [6] is used, as described in Section 2 (I). The standard iterative algorithm is used to optimize the objective function (2). The elbow method is used to determine the value of L [6]. For regression and classification, multilayered perceptrons (MLPs) are used [6], with a single hidden layer of 100d neurons and ReLU(x) = max{0, x} activation function for both. For the multi-classification, the output layer is given by a softmax, with a cross-entropy loss function, as described in Section 2 (II). For regression, the output activation is linear, and the loss function is quadratic, as described in Section 2 (III). A quadratic regularization is used in both cases, with parameter 0.001. The adaptive stochastic gradient descent solver Adam [21] is used to optimize (4)    forest method [8] is used, in which 100 bootstrapped classification or regression decision trees are aggregated [7], where each one involves a subset of only ceil(d/3) input parameters. The scikit learn python package [28] is used for all the basic learning algorithms employed.

Numerical experiments.
In this section several prototype models are considered. First, let X = R and consider the simple example f 1 (x) = x1 x≥1 , where The function is plotted in Figure 1 row (a), column (a), along with the prediction results of the final CCR machine output f r (x; f c (x)), and the intermediate f c (x). Figure  1 row (a), column (b) shows a scatter plot of the true f 1 (x) and the CCR machine f r (x; f c (x)), illustrating the correlation. Figure 1 row (a), column (d) shows a histogram of f r (x; f c (x)) − y(x), illustrating the dissimilarity between the CCR reconstruction and the truth. Notice that the previous example could have been dealt with using a simple clustering of the output values y ∈ R alone. Consider the slightly more complicated function Here the y-values alone cannot be used for 6 clustering, as they suggest a single cluster is optimal (e.g. using the elbow method would return L = 1). However, our method of clustering input-output pairs (x, y) ∈ R 2 works here, and elbow returns L = 2. The method is illustrated on this example in Figure 1, row (b) whose panels are the same as Figure 1a. Note that due to the gross simplicity of the model, if we look for 2 clusters based on y values alone, then we would get something like y ∈ [0, 0.5] and y ∈ (0.5, 1]. In other words, we would bypass the issue of the discontinuity, as a single class would not span the discontinuity, and this is the primary requirement in order for the piecewise regression to perform well. This will not hold in general, but it is illustrative of the robustness afforded by choosing extra clusters and extra amplification of y. Note that the first 2 functions can both be reconstructed exactly with relatively simple multilayer perceptrons, provided discontinuous activation functions are utilized. Consider a 3 node hidden layer with z 1 = max{0, x}, z 2 = max{0, −x}, z 3 = 1 x≥0 , and z 4 = 1 −x≥0 . Then we have f 1 (x) = 1 + z 1 − z 4 and f 2 (x) = 1 − z 2 + z 1 − z 3 . However, it is difficult in general to cook up an architecture which is appropriate for arbitrary highdimensional functions which have multiple discontinuities along nonlinear hyper-surfaces and highly nonlinear components, as will be introduced in the following examples. On the other hand, CCR is flexible and easy to implement, and its basic components are well-understood. Furthermore, discontinuous activation functions are problematic for gradient-based optimization methods, which are commonly employed.
To illustrate the benefit of CCR with simple MLPs in comparison to alternative regression methods, we compare two direct regression methods on f 2 . The first is an MLP with a single hidden layer of 100 neurons with ReLU activation functions, and a linear output. The second is a deep neural network (DNN) with 3 hidden layers of 200, 420, and 21 neurons each, with ReLU activation functions, and linear output. The results are presented in Figure 2. It is clear that neither of the simple regression methods are able to cleanly recover the discontinuity. Now we consider slightly more complicated functions, following the recent work [24]. First, consider X = [−4, 10] and Notice that the range of f 3 is very small in comparison to the domain X . Therefore this is a prime example where the transformation (9) needs to be utilized. The method is illustrated on this example in Figure 1 row (c), whose panels are the same as the rows above. The next example is simple the product of the previous, so X = [−4, 10] 2 and f 4 (x) = f 3 (x 1 )f 3 (x 2 ). The method is illustrated on this example in Figure 3. Here the information is the same as Figure 1, but now f 4 (x) is plotted in panel (a), and f r (x, f c (x)) and f c (x) are plotted in panels (b) and (c), respectively. Panel (e) shows a scatter plot of the true f 4 (x) and the CCR machine f r (x; f c (x)), illustrating the correlation. Panel (f) shows a histogram of f r (x; f c (x)) − y(x), illustrating the dissimilarity between the CCR reconstruction and the truth. Panel (g) shows the elbow plot. Notice that there is no 7 clear elbow. One might choose L = 4 clusters, but this is not sufficient to avoid any cluster spanning a discontinuity. From a visual inspection, L = 7 is the minimum number of continuous components, if we group components which meet at a point. In this case, sometimes we get a good set of clusters, but sometimes some clusters span a discontinuity. If we let these each correspond to 2 distinct components, then L = 10, which we find is sufficient to ensure no single cluster spans a discontinuity. This is another illustration of the robustness of choosing extra clusters. Interestingly, the clusters we find do not correspond to the continuous components at all, but rather (as in previous examples 2-3) partition the function primarily based on y-value contour level set intervals, crucially none of which span a discontinuity.

4.2.
Critical gradient model for tokamaks. A tokamak is a device which uses magnetic fields to confine hot plasma in the shape of a torus. It is the leading candidate for production of controlled thermonuclear power, for use in a prospective future fusion reactor [18]. One dimensional radial transport modeling [27] using theory-based models such as GLF23 [34], MMM95 [4], and TGLF [33] plays an essential role in interpreting experimental data and guiding new experiments for magnetically confined plasmas in tokamaks. Turbulent transport resulting from micro-instabilities have a strong nonlinear dependency on the temperature and density gradients. One of the key characteristics is a sharp increase of turbulent flux as the gradient of temperature increases beyond a certain critical value. This leads to a highly nonlinear and discontinuous function of the inputs.
Here we consider an analytic stiff transport model that describes turbulent ion energy transport in tokamak plasmas [20]: where χ is ion thermal diffusivity, H(·) is the Heaviside function, R is the major radius, and T is the radial derivative of ion temperature. The normalized critical gradient (RT /T ) crit of ion temperature is calculated using IFS/PPPL model [22], which is a 8   nonlinear function of electron density (n e ), electron and ion temperatures (T e , T ), safety factor (q), magnetic shear (ŝ), effective charge (Z eff ) and the normalized gradient of ion temperature (RT /T ) and density (Rn /n). It is assumed that S = 1 and α = 1.
Considering (T, T ) and (n, n ) as two input parameters each, this gives a total of 10 inputs x ∈ R 10 . The output (13) is y ∈ R + . Basic primitive model inputs are chosen uniformly at random from a hypercube ω ∈ [0, 1] 17 , which then give rise to realistic inputs x ∈ R 10 , which concentrate on a manifold in the ambient space. See [22] for details of the model, and Figure 4 (b) for visualization of the input distribution histogram. It is difficult to visualize a function over R 10 , so we plot some slices in Figure 4 (a) in order to get a sense of it. This is now explained. Let m = E(x), where the expectation is with respect to the input distribution. For i = 4 and j = 6, 9, 10 (in rows 1, 2, 3, respectively), χ(x) is plotted as x varies over a 2d grid (m 1 , . . . , m j−1 , x j , m j+1 , . . . m i−1 , x i , m i+1 , . . . , m 10 ).
The CCR method is illustrated on this example in Figure 5. Here N train = 300000 training data points are used, and N test = 500 test data points.
4.3. Accuracy and active learning extensions. The accuracy of the methods on test data is presented in Table 1. In particular, For example 5, the data used to compute the error is out-of-sample testing data. For the grid-based examples, it is the training data. Table 2 illustrates active learning with active strategy 1a from Section 3.

Discussion
In this work we present a simple general framework for using the existing arsenal of fundamental machine learning tools in order to solve the challenging problem of reconstructing a surrogate model, or machine, for approximating highly nonlinear and discontinuous functions over high dimensional spaces. We have illustrated the power and accuracy of the method on various examples. It is notable that the method admits a huge amount of flexibility, as any combination of clustering, classification, and regression models will work. It can also be wrapped around existing deep learning architectures, which typically already combine elements of the latter 2. Therefore, it is reasonable to call this super deep learning. The cost of the regression stage may be high if there are many classes/clusters, since there is one regression per class, but this stage is also embarrassingly parallel. 10