A difference of convex optimization algorithm for piecewise linear regression

The problem of finding a continuous piecewise linear function approximating a regression function is considered. This problem is formulated as a nonconvex nonsmooth optimization problem where the objective function is represented as a difference of convex (DC) functions. Subdifferentials of DC components are computed and an algorithm is designed based on these subdifferentials to find piecewise linear functions. The algorithm is tested using some synthetic and real world data sets and compared with other regression algorithms.


1.
Introduction. The aim of this paper is to develop an algorithm for nonparametric estimation of a multivariate regression function over the space of continuous piecewise linear functions.
Nonparametric methods are usually applied to estimate a regression function when a priori information about this function is not known. These methods minimize a kind of least squares risk of the regression estimate over a given function space. Such methods include regression trees like CART [10], adaptive spline fitting like MARS [13], artificial neural networks (ANN) [14] and Support Vector Machines (SVM) for regression [11,23].
In papers [3,5], a function space consisting of continuous piecewise linear functions is considered to estimate a regression function. The representation of these functions as a maximum of minimum of linear functions is used to formulate an empirical least squares risk. This risk function is nonconvex and nonsmooth. In [4], an algorithm based on the discrete gradient method from [6,7] is developed to minimize the risk function. The discrete gradient method uses function values to approximate the subdifferential of the objective function and therefore, this method is time consuming for regression problems in large data sets.
In this paper, we introduce a new algorithm for the estimation of a regression function over the space of continuous piecewise linear functions. This algorithm exploits an implicit difference of convex (DC) representation of the empirical risk function. The use of such representation allows one to efficiently calculate subgradients of DC components. We present computational results with synthetic and real world data sets and compare the proposed algorithm with other algorithms for regression.
The structure of the paper is as follows. The problem statement is given in Section 2. Subdifferentials of the empirical least squares risk function are described in Section 3. In Section 4, the minimization algorithm is introduced. Numerical results are reported in Section 5. Section 6 concludes the paper.
In what follows, we denote by R n the n-dimensional Euclidean space with the inner product u, v = n i=1 u i v i and the associated norm u = u, u 1/2 , u, v ∈ R n . Vectors are considered as a row. B ε (x) = {y ∈ R n : y − x < ε} is the open ball centered at x with the radius ε > 0. "conv" denotes the convex hull of a set.
2. Problem statement. We consider the nonparametric regression estimation problem. The definition of this problem is given, for example, in [16]. The problem can be formulated as the minimization of the so-called L 2 -risk over the set of all measurable functions. Usually, in an application the underlying distribution is not known and, therefore, the L 2 -risk is estimated by an empirical L 2 -risk [3,4].
In regression analysis, an R n × R-valued random vector (a, b) with Eb 2 < ∞ is considered and the dependency of b on the value of a is of interest. More precisely, the goal is to find a function ϕ : R n → R such that ϕ(a) is a "good approximation" of b. In applications, usually the set of data A = {(a i , b i ) ∈ R n × R : i = 1, . . . , m} is given, where a i ∈ R n are values of n input variables and b i ∈ R is their output.
Then the empirical L 2 -risk can be formulated as: For least squares estimates, first a class F of functions ϕ : R n → R is chosen and then the estimate is defined by minimizing (1) over F. In this paper, F is defined as a set of continuous piecewise linear functions. It is known that such functions can be represented as a max-min of the finite number of linear functions [9,15]. Therefore, all functions in F are given as a maximum of minimum of linear functions, and this representation is used to formulate the empirical L 2 -risk function. More precisely, let K ∈ N and M 1 , . . . , M K ∈ N be parameters of the estimate and set for some x kj ∈ R n , y kj ∈ R .
Then the minimization of the empirical L 2 -risk function, defined in (1), over the set F can be reformulated as: subject to x kj ∈ R n , y kj ∈ R, j = 1, . . . , M k , k = 1, . . . , K. Here 3. Subdifferentials of empirical L 2 -risk function. In this section, first we show that the empirical L 2 -risk function F , defined in (2), is DC and then formulate sudifferentials of its DC components (details concerning DC functions can be found, for example, in [1,17,24,25]). Consider the function This function is DC [4]: The function ψ 1 ik is linear and the function ψ 2 ik is convex piecewise linear since it is represented as a maximum of linear functions. Next, consider the function: This function is also DC [4]: Both functions ϕ i1 and ϕ i2 are convex piecewise linear.
Proof. The function F can be rewritten as: Define the functions φ 1 : R nq × R q → R m and φ 2 : R m → R such that: Since each component function ϕ i is DC the function φ 1 is also DC. Moreover, the function φ 2 as a quadratic function is DC. It is obvious that F = φ 2 • φ 1 . Then it follows from Proposition 3.4 from [25] that the function F is DC.

Remark 1.
Although the function F is DC, in general, it is not easy to get its explicit DC representation. However, we show that subdifferentials of its DC components can be formulated explicitly.
Some properties of the function F were studied in [4]. In particular, it is shown that this function is locally Lipschitz. This means that the function F is subdifferentiable in R nq × R q . Next, we formulate subdifferentials of DC components of this function. Let I = {1, . . . , m}.
Proposition 2. Let F 1 and F 2 be DC components of the function F : F (x, y) = Then subdifferentials of the functions F 1 and F 2 at (x, y) can be expressed as: Here Proof. The function F can be represented as a composition of functions φ 1 and φ 2 : Since functions F 1 and F 2 are convex on R nq × R q they are directionally differentiable. Their directional derivatives at a point (x, y) ∈ R nq × R q in a direction d = (d x , d y ) ∈ R nq × R q can be computed as: and Here ∂φ 2 (z 1 , . . . , z m )/∂z i is the partial derivative of the function φ 2 with respect to the variable z i , i ∈ I. It is obvious that Then expressions (3) and (4) for subdifferentials ∂F 1 (x, y) and ∂F 2 (x, y) follow directly from corresponding expressions for directional derivatives F 1 ((x, y), d) and F 2 ((x, y), d).
It follows from optimality conditions for general unconstrained nonsmooth DC optimization problems [7,12] that for a point (x * , y * ) ∈ R nq × R q to be a local minimizer of the problem (2) it is necessary that However, in practice it is not easy to verify this condition as it requires the calculation of the whole subdifferential ∂F 2 (x * , y * ). Therefore, the necessary condition (5) is replaced with the more relaxed condition of criticality: Most existing algorithms guarantee convergence to critical points satisfying the condition (6).

Remark 4.
As it is noted in Remark 1, in general, we do not have an explicit DC representation of the function F . However, the explicit expressions for subdifferentials of its DC components are available. Most existing algorithms for local DC optimization use not only subgradients but also values of DC components. These algorithms include, for example, the DCA algorithm [24] and the version of the proximal bundle method for DC optimization [18]. Such algorithms cannot be used to solve Problem (2) since the values of DC components are not always available. Therefore, in the next section we design an algorithm which does not require the explicit DC representation of the function F . More specifically, the proposed algorithm does not use the values of DC components, it uses only values of the function F and subgradients of its DC components.
4. Numerical algorithm. In this section, we design an algorithm for solving Problem (2). The algorithm starts from any initial point and finds a critical point of this problem. For convenience, we consider more general form of Problem (2): where f (x) = f 1 (x) − f 2 (x) and functions f 1 and f 2 are convex on R n . Moreover, we assume that domf 1 = domf 2 = R n , subdifferentials ∂f 1 (x) and ∂f 2 (x) are polytopes at any x ∈ R n and Notice that the objective function in Problem (2) satisfies these conditions.
Take any λ > 0 and define the following set at a point x ∈ R n : Here S 1 = {d ∈ R n : d = 1} is the unit sphere. It is obvious that the set Q 1 (x, λ) is convex and compact for any x ∈ R n and λ > 0. Denote by Q 2 (x) the set of vertices of the subdifferential ∂f 2 (x), x ∈ R n . For a given v ∈ Q 2 (x) define the following set

ADIL BAGIROV, SONA TAHERI AND SOODABEH ASADI
Convexity of functions f 1 and f 2 implies that for any w ∈Q v (x, λ) and v ∈ Q 2 (x) In turn this implies that Let λ, δ > 0 be given numbers. (7) iff: If λ = δ = 0 then we get the definition of critical points.
Assume that a point x ∈ R n is not a (λ, δ)-critical point of Problem (7).
Proof. From the necessary condition for a minimum we have . This together with (10) imply that d 0 is a descent direction satisfying (12).
It follows from Proposition 3 that if the point x is not a (λ, δ)-critical then the sets Q 1 (x, λ) and Q 2 (x) can be used to find a direction of sufficient decrease of the function f at x. However, the computation of these sets is not always possible. Next, we design an algorithm which uses a finite number of elements from these two sets to compute descent directions.
Let λ > 0 be a given number and x ∈ R n be a given point.

Algorithm 1 Computation of descent directions.
1: (Initialization). Select a search control parameter c ∈ (0, 1) and the initial direction then STOP. The direction d j+1 is a direction of sufficient decrease.
Remark 5. One can use the following scheme to compute subgradients v ∈ ∂f 2 (x) such that f 2 (x, d) = v, d for d ∈ S 1 . Let α > 0 be a sufficiently small number. Compute a subgradientv ∈ ∂f 2 (x + αd). Then semismoothness of the function f 2 implies that f 2 (x, d) ≈ v, d and the subgradient v can be replaced by the subgradientv.
Next, we prove that Algorithm 1 is finite convergent.
Proposition 4. If stopping criteria (13) and (14) in Algorithm 1 do not hold at the j-th itertaion then Proof. If the algorithm does not stop at the j-th iteration, j ≥ 1, then and From the necessary condition for a minimum we have ū It follows from (9)

ADIL BAGIROV, SONA TAHERI AND SOODABEH ASADI
Let Then it follows from (18) that Furthermore, from the definition of the direction d j+1 we have Since This implies thatp j ≤ p j which contradicts (20). Proof. Assume the contrary, that is Algorithm 1 generates an infinite sequence of pairs of the sets [Q j 1 ,Q j 2 ]. It is obvious that for any j ≥ 1 Moreover, conditions (13) and (14) do not hold for any j ≥ 1. The setsQ j 1 , Q 1 (x, λ) and ∂f 2 (x) are convex and compact. Furthermore, the setQ j 2 is finite. Take any d ∈ R n , d = 0 n . Then the first two inclusions from (21) imply that sequences {σ j i (d)}, i = 1, 2 are nondecreasing. Using the last two inclusions in (21) we can conclude that these sequences are bounded above. Then they are convergent, and for any ε > 0 there exists j ε > 1 such that for all j > j ε . Since the set S 1 is compact the inequalities (22) are true for any d ∈ S 1 . Taking into account the condition (13) we get from (20) that (23), and this completes the proof.
1: (Initialization). Select any starting point x 1 ∈ R n , numbers c 1 ∈ (0, 1) and c 2 ∈ (0, c 1 ]. Set j := 1. 2: (Finding search direction) Apply Algorithm 1 with c = c 1 to find a search direction at the point x j . This algorithm terminates after a finite number of iterations l j > 0 with either w lj ≤ δ or finds a descent direction d lj ∈ S 1 such that (24) holds then find the step-length α j > 0 as follows: 5: (Updating iteration) Set x j+1 := x j + α j d lj , j := j + 1 and go to Step 2.
Proposition 6. Assume that the condition (8) holds. Then Algorithm 2 finds a (λ, δ)-critical point of Problem (7) in at most j 1 iterations, where Proof. Assume the contrary, that is the sequence {x j } is infinite and points x j are not (λ, δ)-critical for all j = 1, 2, . . .. This means that w lj > δ, j = 1, 2, . . .. Since c 2 ≤ c 1 it follows from (24) that α j ≥ λ for any j > 0. Then which means that f (x j ) → −∞ as j → ∞. This contradicts the condition f * ≥ 0 and therefore, the algorithm is finite convergent. Since f * ≤ f (x j+1 ) it is obvious that the maximum number of iterations j 1 is given by (25). Now we are ready to design an algorithm to find critical points of Problem (7). Let {λ j } and {δ j } be sequences such that λ j , δ j ↓ 0 as j → ∞.

Algorithm 3 Finding critical points.
1: (Initialization). Select any starting point x 1 ∈ R n and an optimality tolerance ε > 0. Set j := 1. 2: Apply Algorithm 2 starting from the point x j to find the (λ j , δ j )-critical point x j+1 . 3: (Stopping criterion) If λ j < ε and δ j < ε then STOP. Otherwise, set j := j + 1 and go to Step 2. Proof. Algorithm 2 is a descent algorithm, therefore, x j ∈ J (x 1 ) for all j ≥ 1. Since the set J (x 1 ) is compact the sequence {x j } has at least one limit point. Assume, without loss of generality, that x j →x as j → ∞. For each x j one has This implies that there is a sequence {z j } such that z j ∈ ∂f 2 (x j ) and z j ∈ Q 1 (x j , λ j ) + B δj (0 n ) for all j ≥ 1. It follows from the upper semicontinuity of the subdifferential maps x → ∂f i (x), i = 1, 2 that for any γ > 0 there exists j 1 = j 1 (γ) > 0 such that for all j > j 1 . Then according to the definition of the map x → Q 1 (x, λ) we get that for γ > 0 there exists j 2 ≥ j 1 such that for all j > j 2 . This along with (26) imply that for all j > j 2 . It is obvious that the sequence {z j } is bounded and, therefore, has at least one limit point. Letz be a limit point of the sequence {z j }. Since γ > 0 is arbitrary and δ j ↓ 0 as j → ∞ it follows from (27) thatz ∈ ∂f 1 (x) ∩ ∂f 2 (x). This means that the pointx is a critical point of Problem (7).

Remark 6.
There are some similarities between Algorithm 3 and an optimization algorithm introduced in [8]. Both algorithms are designed to solve nonsmooth unconstrained DC optimization problems, and they use the same sets to approximate the subdifferential of the first DC component. But there are also significant differences between these two algorithms. The algorithm from [8] uses only one subgradient of the second DC component at each iteration whereas Algorithm 3 uses a subset of the subdifferential of this function. In addition, in Algorithm 3 search directions are found by calculating the distance between two polytopes approximating subdifferentials of functions f 1 and f 2 . However, in the algorithm from [8] these directions are computed using the set defined as the difference of the set, approximating the subdifferential of f 1 , and one subgradient of the function f 2 . The use of more than one subgradient from the subdifferential of the second DC component leads to the design of algorithms which can find solutions with a better quality than algorithms which use only one subgradient. This is especially important for solving nonconvex nonsmooth optimization problems.  The synthetic data sets are generated using the following relationship between the input U and the output V : where s is a regression function, γ is standard normally distributed and independent of U and σ ≥ 0. U is uniformly distributed on [−2, 2] n . For the noise level (standard deviation) σ we use three different values: 0, 0.5 and 1. Regression functions s used to generate synthetic data in different spaces are given in [4,5].
Real world data sets are used to study the prediction ability of the PWLREG algorithm and to compare it with other well known regression algorithms such as the MLR, the linear SVM (SVM(Lin)) for regression, the SVM regression algorithm with the radial basis kernel function (SVM(RBF)), the adaptive spline fitting (MARS) algorithm and the artificial neural networks (ANN).

Implementation of algorithms.
The PWLREG algorithm is based on Algorithm 3 which involves two other algorithms: Algorithms 1 and 2. Search directions are found by applying Algorithm 1. Most important step in this algorithm is Step 2 where search directions are computed by solving several quadratic programming subproblems. These problems are solved using a finite convergent algorithm introduced in [26].
All prediction methods are applied without testing their underlying assumptions. This means that some of these methods may not perform well as not all their underlying assumptions are met. For example, in many real life applications the relationship between the dependent and independent variables is highly nonlinear and the errors usually exhibit non-normality. These violations of assumptions may impact the prediction performance of the MLR.
Prediction performance of algorithms are estimated using five performance measures: the Root Mean Squared Error (RMSE), the Mean Absolute Error (MAE), the Mean Squared Error (MSE), the Nash-Sutcliffe Coefficient of Efficiency (CE) [21] and the Pearson correlation coefficient (r).
Definitions of these measures are given below. Assume that e 1 , . . . , e m , m ≥ 1 are observed values for some parameter e andē 1 , . . . ,ē m are their forecast values. Then 920 ADIL BAGIROV, SONA TAHERI AND SOODABEH ASADI In the MSE measure, p stands for the number of parameters in the model. In the CE measure, e 0 is the mean of observed values. The small values of RMSE, MAE and MSE measures indicate small deviations of the predictions from actual observations. The CE ranges from −∞ to 1. An efficiency CE = 1 means a perfect prediction. An efficiency of 0 indicates that the model predictions are as accurate as the mean of the observed data and an efficiency −∞ < CE < 0 occurs when the observed mean is a better predictor than the model. PWLREG and DGMREG algorithms were implemented in Fortran 95 and compiled using free compiler gfortan. We use R implementations of the MLR, SVM(Lin), SVM(RBF), MARS and ANN regression algorithms. Parameters in the SVM regression, the MARS and the ANN algorithms are chosen according to [19,20,22], respectively. Computational experiments were carried out on a 2.50 GHz Intel(R) Core(TM) i5 machine with 8 GB of RAM.

Illustrative example.
In this subsection using a synthetic data set, we demonstrate how the proposed algorithm works. The synthetic data set contains 1000 observations and one input variable, that is we have (a i , b i ), i = 1, . . . , 1000, where a i ∈ R and b i ∈ R. This data set is generated using the regression function and the level of noise σ = 0. In the implementation of the PWLREG (also DGMREG) algorithm it is important to choose the right number K of minimum functions under maximum and the number M k , k = 1, . . . , K of linear functions under each minimum function. The numbers M k can vary depending on k ∈ {1, . . . , K}, however in our calculations we choose them the same for all k. Thus, we use the notation (K, M ) to give the configuration of the max-min functions. Here K stands for the number of minimum functions under maximum and M stands for the number of linear functions under each minimum function. We consider the following configurations: (1, 1), (2, 1), (2,2), (3,2), (4,2), (5,2). Piecewise linear functions with larger values of K and M are not considered as their use may lead to overfitting.
Their DC representations are: i1 (x, y) = (x 11 + x 12 )a i + (y 11 + y 12 ), ψ 2 i1 (x, y) = max{x 11 a i + y 11 , x 12 a i + y 12 }, The functions ϕ i , i = 1, . . . , 1000 are given as: and their DC representations are: Since the functions ϕ i1 and ϕ i2 are convex their subgradients can be computed applying the subdifferential calculus. Using these subgradients one can compute subgradients of the functions F 1 and F 2 by applying formulas (3) and (4), respectively. Algorithm 3 is applied to solve Problem (2) using subdifferentials of the functions F 1 and F 2 and to find piecewise linear functions for each configuration separately.
In order to determine the best configuration of the piecewise linear function the data set is divided into to two parts: the training set containing 800 (80% of all) instances and the validation set containing the rest 200 (20% of all) instances. The training set is used to compute coefficients of linear functions using above mentioned six configurations of piecewise linear functions. Then the validation set is used to identify the best configuration using the RMSE as a performance measure. More specifically, the piecewise linear function providing the smallest value of the RMSE measure on the validation set is accepted as the estimation to the regression function.
The steps of the PWLREG algorithm for finding the best configuration are illustrated in Figure 1. In this figure, blue dots represent the simulation data and red lines represent the estimation functions. For each configuration we present the values of RMSE obtained using the validation set. One can see that the piecewise linear function with the configuration (3,2) gives the smallest value of RMSE. Therefore, the PWLREG algorithm chooses this function as an estimation for the regression function. 5.3. Generalization ability of PWLREG algorithm. In this subsection synthetic data sets, containing from 500 to 5000 instances and from 1 to 10 input variables, are used to study the generalization ability of the proposed algorithm and to determine the CPU time required by this algorithm for training. These data sets are generated using regression functions given in [4,5] with the standard deviation σ = 1.
Data sets are divided into training and test sets. In all cases, 60% of the data are used for training and 40% for testing. Results are reported in Table 1 where we present the values of RMSE for both training and test sets as well as the CPU time (in seconds) required for training. These results show that overall the proposed algorithm demonstrates a good generalization ability, especially in data sets with the number of instances m = 3000 and m = 5000. In these data sets, the difference between RMSE values for training and test sets is small. In small data sets with the number of instances m ≤ 1000 this difference in some cases is quite large. These results show that the generalization ability of the PWLREG algorithm depends on the number of linear functions and the number of instances in a data set. In addition, the generalization ability of this algorithm depends on the number of input variables. In small data sets, the number of linear functions should be restricted to have a good generalization of the algorithm. However, this number can be increased as the the number of instances of a data set increases. One can see that the CPU time used by the algorithm for training is reasonable in all cases. Moreover, the increase of the size of data sets does not lead to a significant increase of computational effort required by the algorithm. The graphical illustration of results by the PWLREG algorithm and their comparison with those obtained by SVM(Lin) and SVM(RBF) algorithms are presented in Figures 2, 3, 4 and 5. In Figures 2 and 3 data sets with one input variable are used. These data sets contain 500 and 5000 instances and are obtained using the regression function: with the standard deviation σ = 1. In Figures 4 and 5 synthetic data sets with two input variables are used. They contain 500 and 5000 instances and are generated using the regression function: , and the standard deviation σ = 1.
These figures demonstrate that the PWLREG algorithm finds the better estimation for the regression function than the SVM(Lin) algorithm and its results are comparable with those obtained by the SVM(RBF) algorithm.

5.4.
Dependence on the level of noise. The synthetic data sets containing 5000 observations and one input variable are used to investigate the dependence of the performance of the PWLREG algorithm on the level of noise. These data sets are generated using the regression function (28) with the standard deviation σ ∈ {0, 0.5, 1}. Results are presented in Table 2. These results indicate that the performance of the PWLREG algorithm depends on the level of noise. The increase in the level of noise leads to the increase of values of RMSE, MAE and MSE and to the decrease of the values of CE and correlation. One can see that the RMSE and MAE measures depends rather linearly on the level of noise. Dependencies of other measures on the level of noise are close to quadratic.

5.5.
Comparison of PWLREG and DGMREG algorithms. The PWLREG and DGMREG algorithms compute the same piecewise linear estimations, however, they use different optimization methods. Since they compute the similar piecewise linear functions we compare these two algorithms using the CPU time (in seconds).
Synthetic data sets with 5 · 10 4 instances and the number of input variables varying from 1 to 10 are used to demonstrate the performance of these algorithms depending on the number of input variables. Furthermore, data sets with 4 input variables and the number of instances varying from 500 to 3 · 10 5 are used to compare the performance of algorithms depending on the number of instances. Results are presented in Figure 6. These results clearly demonstrate that the PWL-REG algorithm requires significantly less computational time than the DGMREG algorithm.       5.6. Prediction performance of PWLREG algorithm on real data sets. In this subsection, we test the proposed algorithm as a prediction method and compare it with the MLR, SVM(Lin), SVM(RBF), MARS and ANN algorithms. For this purpose, we consider five real world data sets. The brief description of these data sets is given in Table 3. Their detailed description can be found in [2]. These data sets are divided into training (80% of the whole data) and test (20% of the whole data) sets. The training data are normalized so that each input variable had the mean value 0 and the standard deviation 1. An algorithm is trained using training set and its prediction performance is checked using the test set.  Table 4. These results show that the PWLREG algorithm is a better predictor than the MLR and SVM(Lin) algorithms. Pairwise comparison with the SVM(RBF) algorithm shows that in two data sets the PWL-REG algorithm outperforms the SVM(RBF), in two other data sets the SVM(RBF) provides better prediction results and in one data set these two algorithms give comparable results. This means that for these five data sets the PWLREG and SVM(RBF) are comparable prediction algorithms. The PWLREG algorithm outperforms the MARS in three data sets and their results are comparable in Red Wine Quality and Combined Cycle Power Plant data sets. Finally, comparison with the ANN algorithm shows that the PWLREG algorithm produces better results than the ANN in four data sets. Only in Red Wine Quality data set results by these two algorithms are comparable.
In medium size data sets (Airfoil self-noise, Red Wine Quality and White Wine Quality) with the number of data points m < 5000 the PWLREG algorithm either outperforms other algorithms or demonstrates the similar performance in comparison with the best performing algorithms (with the SVM(RBF) in the Airfoil selfnoise data set and with the ANN in the White Wine Quality data set). In relatively large data sets (Combined Cycle Power Plant and Physicochemical Properties of Protein Tertiary Structure) with m > 5000 the SVM(RBF) obtains the best results and the PWLREG algorithm achieves the second best results. There is no any clear indication on the dependence of the performance of algorithms on the number of input variables. 6. Conclusions. In this paper, we introduced an algorithm for solving the piecewise linear regression problem. This problem is formulated as the nonsmooth nonconvex optimization problem. The objective function of the optimization problem is the difference of convex functions, whose DC representation is not easy to express explicitly. However, the subdifferentials of the DC components can be formulated explicitly. Therefore, we developed an algorithm which uses information on these subdifferentials and the objective function. We proved that this algorithm converges to critical points of the piecewise linear regression problem.
Using synthetic data sets we demonstrated that the proposed algorithm has a good generalization ability, however this ability may depend on the number of linear functions and the size of a data set. We also used synthetic data sets to compare the computational time used by the proposed algorithm and another piecewise linear regression algorithm introduced in [4]. Results demonstrate that the proposed algorithm is a significant improvement of the algorithm from [4] in the sense of used computational time.
We applied the proposed algorithm to real world data sets to test its prediction ability and to compare it with other five prediction algorithms: the MLR, SVM(Lin), SVM(RBF), MARS and ANN. The performance of these algorithms are compared using five statistical measures for prediction. Although the proposed algorithm is not always the best prediction method these results show that it is a good alternative to other mainstream prediction methods.