COMPARISONS OF DIFFERENT METHODS FOR BALANCED DATA CLASSIFICATION UNDER THE DISCRETE NON-LOCAL TOTAL VARIATIONAL FRAMEWORK

. Because balanced constraints can overcome the problems of trivial solutions of data classiﬁcation via minimum cut method, many techniques with diﬀerent balanced strategies have been proposed to improve data classiﬁcation accuracy. However, their performances have not been compared com- prehensively so far. In this paper, we investigate seven balanced classiﬁcation methods under the discrete non-local total variational framework and compare their accuracy performances on graph. The two-class classiﬁcation problem with equality constraints, inequality constraints and Ratio Cut, Normalized Cut, Cheeger Cut models are investigated. For cases of equality constraint, we ﬁrstly compare the Penalty Function Method (PFM) and the Augmented La- grangian Method (ALM), which can transform the constrained problems into unconstrained ones, to show the advantages of ALM. The other cases are all solved using the ALM also. In order to make the comparison fairly, we solve all models using ALM method and using the same proportion of ﬁdelity points and the same neighborhood size on graph. Experimental results demonstrate ALM with the equality balanced constraint has the best classiﬁcation accuracy compared with other six constraints. 200 words.


1.
Introduction. Data classification is one of fundamental problems in the fields of data mining, machine learning, pattern recognition, computer vision and so on. Its task is to divide a specific dataset into different parts via labels without overlapping and vacuum problem. The minimum cut method on a graph to realize data classification has been accepted as an efficient tool recently, but the intrinsic trivial solution problems must be overcome, which exists also in data clustering problems [15]. The trivial solution problem is that in many cases, the solution of mincut simply separates one individual vertex from the rest of the graph, this is obviously not what we want to achieve.
Data classification by minimum cut [8] on a discrete graph usually starts with an undirected weighted graph, which can be denoted as a triple G(V, E, W ). Where V is the collection of n vertexes, and E is the collection of all edges describing the similarities of adjacent vertexes, W is the collection of weights defined on edges as w(x, y) ∈ W , where x, y ∈ V , w(x, y) ≥ 0 and w(x, y) = w(y, x) . Additionally, the degree of a vertex , degree of a graph and number of vertexes are defined as d(x) = y∈V w(x, y), vol (V ) = x∈V d (x), and |V | = n respectively. Based on the above mentioned descriptions, the data classification problem of two-class can be stated as the following minimum cut problem. where w (x, y) In order to avoid the problem of trivial solution of (1), some researchers have investigated many balanced cut approaches, such as Ratio Cut (RC), Cheeger Cut (CC) [3,13,24,21,5,17] and Normalized Cut (NC) [20] as follows where, vol(V 1 ) = x∈V1 d(x) is the degree of sub-set V 1 .All are NP hard problems [23]. In order to solve them properly, a binary label function u(x) ∈ {0, 1}, x ∈ V , u(x) = 1 if x ∈ V 1 0 otherwise Additionally, there are some balanced methods with equality and inequality constraints to deal with the trivial problem. The first one is mainly used to the cases that the number of each class is known exactly and the second one can be used to the cases that only the range of each class is known approximately. In semi-supervised classification problems, one usually assumes that the regions should have approximately equal sizes. The simplest way to achieve this goal is to give size for each class, and the corresponding equality constraint [23,14,2] can be stated as In this paper, we use two different methods to implement constraints in (5). One is equality constraint Penalty Function Method (EC-PFM), the other is equality constraint Augmented Lagrangian Method (EC-ALM). The inequality constraint [3]can be stated as where α u i is the upper bound of the size of class i and α l i is the lower bound. Note that (6) becomes (5) if α l i = α u i = α i . When low bound does not exist, we can use a single direction inequality constraint. Therefore, the method (6) can represent single direction inequality constraint (SDIC) and double directions inequality constraint (DDIC). If piecewise constant label functions are used to partition different data classes, the Cut is just the same as non-local total variation (NL-TV) whose local version was proposed originally in image processing [19]. This equivalent relation has already been successfully applied to semi-supervised data classification [24,26]with clearer boundary than Tikhonov model [22]. The systematic definitions of non-local operators associated with NL-TV were given in [9,10] in continuous domain, which were extended to discrete non-local operators on graphs [25,7]. These works have laid the foundation for the application of NL-TV in multi-class data classification and clustering [3], which is beyond the research scope of this paper. The research focus of this paper is to investigate the dependences of two-class data classification accuracy on different balanced constraints to guide their applications to multi-class classification. To our knowledge, there is no comprehensive studies on this topic from theoretic or experimental perspectives. Due to the mathematical difficulties for theoretic investigations, we resort to numerical experiments to draw some empirical conclusions. We start this work from two-class classification problems on undirected weighted graph using discrete NL-TV models, and then solve them using convex relaxification and thresholding strategies. For the models with explicit equality constraints, we design their Penalty Function Method and Augmented Lagrangian Method , and for the other cases, we design their Augmented Lagrangian Method. The paper starts with introduction of the non-local operator on graph in section 2. Section 3 provides seven schemes of balanced classification with or without explicit constraints, along with their algorithms. In section 4, numerous experiments are given to illustrate the effect of different constraints on accuracy. Some conclusions are drawn in section 5 finally.
2. Non-local total variation on graph. In this section, we introduce some basic definitions on discrete non-local operator [25,7] for two-class data classification on an undirected weighted graph. To deal with this problem, let dataset V be divided into V 1 and V 2 , and introduce a binary label function u(x) to partition them, i. e., The definition of discrete non-local derivative on an edge starting from x is defined as where, N (x) means the neighbouhood of x. Based on it, the discrete non-local gradient, discrete non-local divergence of non-local vector v = v(x, y), (where y ∈ N (x)), inner product of two non-local vector v 1 and v 2 , the norm of the non-local gradient, and the non-local Laplacian are expressed as respectively. ∇ w u(x, y) is a non-local vector. Considering the equivalence of Cut and discrete NL-TV, [24,11] rewrite the minimum Cut (1) as w (x, y) = min u∈{0,1} x∈V i. e., min u∈{0,1} which can be solved via convex relaxification and thresholding techniques. In order to overcome the problem of trivial solutions, various balanced constraint strategies have been proposed, well rewrite the corresponding discrete variational models combining them with (14) in the next section.
3. Different models of balanced classification and their algorithms. In this part, we consider the cases of explicit equality constraints, inequality constraints and Ratio Cut, Normalized Cut, Cheeger Cut without explicit constraints. The inequality constraints include single direction and double direction constraints. We use PFM or ALM to solve the corresponding minimization problems.
3.1. Equality constraint with penalty function method (EC-PFM). using binary function (7), the equality constraint (5) can be written as Augmenting this constraint to (14), and using PFM leads to the convex model where, µ 0 is a penalty parameter. To solve it easily, we design its alternating optimization scheme by introducing a non-local auxiliary variable v = ∇ w u as where λ 1 is a Lagrange multiplier and µ 1 is another penalty parameter. The alternating optimization is implemented by solving a series of sub-problems alternatively in each step, i. e., minimize only one sub-problem while fix other variables temporally. So (17a) is divided into two sub-problems as Employing the standard variational method to (18a), we obtain the Euler-Lagrangian equation on u as Its discrete iterative formulation is Using the same method to (19b), we get the generalized soft thresholding formula [6] of After the energy approaches minimum, u ∈ [0, 1] should be recovered via thresholding to u ∈ {0, 1} as To summarize the algorithm introduced in this sub-section, we write down the pseudo-code of algorithm EC-PFM here.

Algorithm 1 EC-PFM
Initialization u 0 given by random initialization λ 0 0 = 0 while energy not converged do u k+1 given by (20) v k+1 given by (21) λ k+1 1 given by (17b) end while 3.2. Equality constraint with augmented Lagrange method (EC-ALM). in order to circumvent the difficulty of the strongly dependence of the optimization problem on penalty parameter µ 0 in the previous sub-section, we employ ALM method to enforce the equality constraint (15). The modified energy functional is where, µ 0 s a penalty parameter and λ 0 is Lagrange multiplier. To solve it easily, we design its alternating optimization scheme by introducing a non-local auxiliary variable v = ∇ w u as where λ 1 is a Lagrange multiplier and µ 1 is another penalty parameter. The subproblems of optimization on u and v have the same forms as (18a) and (18b) respectively. Using the similar process as the previous sub-section, we get the with its discrete iterative formulation v k+1 is same as eq. (21). The algorithm of EC-ALM for solving (24) is stated as follows.
Algorithm 2 EC-ALM u 0 given by random initialization λ 0 0 , λ 0 1 ← 0 while energy not converged do u k+1 given by (26) v k+1 given by (21) λ k+1 0 given by (24c) λ k+1 1 given by (24b) end while 3.3. Single direction inequality constraint (SDIC). for the single direction inequality constraint x∈V u (x) ≤ α, using the KKT method for optimization problem with inequality, (14) can be transformed into the following equivalent form After introducing non-local auxiliary variable and some Lagrangian multipliers, penalty parameters, we get it's alternating direction optimization formulation λ k+1 Solving (28a) via the same procedure as previous sub-sections, we get the Euler-Lagrange equation on u and its discrete iterative formulation v k is the same as (21) also. The corresponding algorithm is as follows.
Using the same procedure as the last sub-section, we get λ k+1

SHIXIU ZHENG, ZHILEI XU, HUAN YANG, JINTAO SONG AND ZHENKUAN PAN
Solving (32a) via the variational method, we get u fulfilling with its discrete iterative formulation The algorithm is stated as To solve it easily, we design its alternating optimization scheme by introducing a non-local auxiliary variable v = ∇ w u as Using the variational method, we get in each loop and where, f = . (40) It has been shown in [21] that (40) which is a minmax problem Its alternating optimization scheme can be designed as Using alternatin direction optimization method to (44a), we get the Euler -Lagrangian equation on u and its discrete iterative formulation v k+1 is the same as (21) also. For w k+1 , it can be written as the following formula analytically where, f w = u k+1 −m (u)− λ0 µ0 , To avoid the trivial solution as stated in introduction, we apply the renormalization step: This algorithm is summarized as .
Its ALM scheme is (50b) From (50a) , we get the Euler-Lagrangian equation on u as and its discrete iterative formulation . The algorithm is described as Algorithm 7 Normalized Cut u 0 given by random initialization λ 0 1 ← 0 while energy not converged do u k+1 given by (52) v k+1 given by (53) λ k+1 1 given by (50b) end while 4. Experimental results. In this section, we use three different data sets to analyze the performance of the seven different algorithms in Sec.3 until where E k+1 is the energy functional of the (k + 1)th step, and is the counterpart of the kth step k, η is a small positive number for convergence. When this condition is fulfilled, the program terminates. The error rate is defined as the ratio between the numbers of incorrect vertices and all vertices on the graph.  All experiments are performed on a 3.3 GHz Inter Core i5 Quad computer. We use 3 datasets: two-moons, handwritten digit set 3&8 and handwritten digit set 4&9. The handwritten digits are generated by the Courant Institute of New York University, and consists of 70000 images of handwritten digits (0-9). The two-moons is a synthetic data set, to test different methods. We construct the unidirectional graph using 9 nearest neighbors and the scaling is based on the 9th closest neighbors, i. e., the k-nearest neighbor method. For the fidelity term, we choose 50 points per class in two-moons and 300 points per class in handwritten digits 3&8 and handwritten digits 4&9. The corresponding ratios are 5%, 4.3% and 4.35% respectively. The penalty parameters used in different algorithm are listed in Table  0. where n is the number of data points and k is the number of classes. In Fig. 1 (a) and (c), we use red dots to denote the pre-labeled points, but in Fig.1 (b), we use two different colors to label the randomly initialized data. This expression is helpful to get a more intuitive understanding of the classification on an undirected graph. In Fig. 2, we give the reference basis of three data sets for comparisons with the results of experiments.
In Fig. 3, we give the initialization picture and the two-moon data classification results using different methods. But it is difficult to describe the differences of results in this figure because the differences exist mainly near interfaces of classes. So we list their error rates and ranks in table 1 to compare the accuracies. It shows that the most accurate method is EC-ALM with error rate 1.25%. For the equality   constraints problems, EC-ALM method is better than EC-PFM method; For the inequality problems, SDIC method is better than DDIC method; For the problems without explicit constraints, CC method is better than RC and NC methods. In Fig. 4 and Table 2, we present the classification results of handwritten digits 3&8 dataset using different algorithms and their accuracies respectively. It shows that the most accurate method is EC-ALM with error rate 1.1385%. For the equality constraints problems, EC-ALM method is also better than EC-PFM method;  In Fig. 5 and Table 3, we present the classification results of handwritten digits 4&9 dataset using different algorithms and their accuracies respectively. It shows that the most accurate method is also EC-ALM with error rate 1.2335%. For the equality constraints problems, EC-ALM method is also better than EC-PFM method; For the inequality problems, SDIC method is better than DDIC method; For the problems without explicit constraints, CC method is better than RC and NC methods.
In order to void the dependences on initialization, each of the previous experiments has conducted ten times with different initial conditions randomly, the error rates listed in tables averages of ten experiments. In practice, we have implemented another classification of sub-dataset of handwritten digits, such as 0&1, 7&8, 1&9, the results show the same rule of error rate, we do not give them here due to repetitions.
Obviously, the fidelity set size is an important factor to the error rates of different methods. In order to show the dependency of error rates on fidelity set sizes, we set up two experiments for two-moon classification with 2000 points and 3&8 handwritten digit classification with 13966 digits using EC-ALM method, some results are listed in Table 4. It shows that the error rate decreases gradually with increase of fidelity set size, and the effect on error rate is little when the ratio of fidelity points to total points is less than 5%.
For the problems with double direction constraints, the DDIC method introduces a small positive range parameter, which maybe affect error rates. But numerous experiments show that this effect is very little although the less the the smaller the error rate. Table 5 lists some results in an experiment on handwritten digit 3&8 dataset classification.
5. Concluding remarks. The research focus of this paper is to compare experimentally the accuracies of some balanced data classification methods with explicit equality constraints, inequality constraints and without explicit constraints. In order to perform fairly, we reformulate their models under the discrete variational framework and design corresponding ALM algorithms: EC-PFM (Equality constraint with penalty function method), EC-ALM (Equality constraints with augmented Lagrangian method), SDIC(Single direction inequality constraint), DDIC (Double direction inequality constraint), RC (Ratio Cut), CC (Cheeger Cut), NC (Normalized Cut). Numerous experiments show that it is hard to draw a determined conclusion for their numerical accuracies globally from experimental viewpoints, because their performances may depend on dataset sometimes. But we can draw some intuitive conclusions for further investigations For the problems with explicit equality constraints, EC-ALM method is better than EC-PFM. Additionally, EC-ALM method circumvent the problem of penalty parameter selection. These methods work for the cases of known exact point numbers of sub-dataset. For the problems with explicit inequality constraints, SDIC shows better accuracy than DDIC. But DDIC method is superior to SDIC when only an approximate number range of points in a specific sub-dataset is known. The CC method is better than RC and NC methods in all experiments. Although these methods do not need explicit constraints, but the algorithms are more difficult than the ones with explicit constraints.
It is worth mentioning that in all experiments the EC-ALM method is the most accurate and the NC method is the most inaccurate one. NC is another version of RC by replacing vertex numbers of sub-datasets with their degrees, but the comparisons are based on the vertex numbers of sub-datasets, this is why RC's accuracy is better than NC.
Although the conclusions are not drawn theoretically, they are of importance in extensions of the investigated methods to multi-class data classification.