Approximate greatest descent in neural network optimization

Numerical optimization is required in artificial neural network to update weights iteratively for learning capability. In this paper, we propose the use of Approximate Greatest Descent (AGD) algorithm to optimize neural network weights using long-term backpropagation manner. The modification and development of AGD into stochastic diagonal AGD (SDAGD) algorithm could improve the learning ability and structural simplicity for deep learning neural networks. It is derived from the operation of a multi-stage decision control system which consists of two phases: (1) when local search region does not contain the minimum point, iteration shall be defined at the boundary of the local search region, (2) when local region contains the minimum point, Newton method is approximated for faster convergence. The integration of SDAGD into Multilayered perceptron (MLP) network is investigated with the goal of improving the learning ability and structural simplicity. Simulation results showed that two-layer MLP with SDAGD achieved a misclassification rate of 9.4% on a smaller mixed national institute of national and technology (MNIST) dataset. MNIST is a database equipped with handwritten digits images suitable for algorithm prototyping in artificial neural networks.


Introduction.
Deep structured learning has emerged as a new area of Artificial Neural Networks (ANN) research with its wide range of applications in signal and information processing. There are two general aspects often discussed in various high level descriptions of deep learning: i.e. (a) its structural models consisting of multiple layers or stages for nonlinear information processing, (b) optimization techniques for supervised or unsupervised learning of feature representation at sequentially deeper and larger abstract layers. However, non-linear higher level abstraction leads to the complex learning of good representation in the neural network without compensating with the information loss and confusion. In addition, it increases the cost of training problems such as ill-conditioning in weight initialization and local minima trapped in an objective function.
The learning process of ANN is achieved through neural network optimization. In general, optimization in mathematical term is the ability to minimize or maximize an objective function. Current optimization literature can be classified into two major aspects, (a) line search approach and (b) trust region approach. Line search approach computes the step direction based on the gradient information from the objective function followed by a heuristic choice of step length to minimize the error. On the other hand, trust region approach computes step direction via quadratic model that approximates a neighborhood of search region. Some of the well-known optimization techniques include, gradient descent (GD), Newton, Quasi-Newton (QN), Gauss-Newton (GN) and Levenberg-Marquardt (LM) methods.
GD [8] uses the gradient (first-order) information from the objective function to construct iteration steps. The computational cost in GD is much lower than the second-order derivative methods. However, GD often suffers from local minima problem because the gradient information does not fully interpret the actual geometry of an error surface. Hence, second-order derivative methods could achieve better performance than first-order derivative methods due to its better learning capability using the additional curvature information. One of second-order derivative examples is Newton method and it uses full Hessian [12] together with the gradient information to compute the change of update weights in the network. As a result, Newton method performs faster convergence rate if the objective function is quadratic and the initial guess of starting point is near to the solution. Nevertheless, due to the computation of inverse Hessian, Newton method is not recommended for function with many variables. It also suffers from singular Hessian problem when dealing with indefinite Hessian. QN [14] arises from Newton method to solve computation problem by using estimated Hessian. The most successful Hessian estimation scheme is the Broyden-Fletcher-Goldfard-Shanon (BFGS) [8] method but QN is suffering from storage problem due to the need to store large set of variables for estimation. GN [8] is another variances of Newton method which helps to solve heavy computational problems by using truncated Hessian. Although it does not need to deal with high storage cost it has suffered the problem of indefinite Hessian. LM [1] evolves from GN to resolve the indefinite Hessian problem by introducing a damping parameter. A relatively small value is added to truncated Hessian to avoid running into singular Hessian as the learning iterates. These problems become more significant when it is applied to the field of ANN parameters optimization. An efficient algorithm that avoids ill-conditioned weights and local optima while retaining optimal trajectory could help boosting the performance of ANN. Although there are plenty state-of-the-art learning algorithm available in ANN training, the strategy to seek for an optimal weights in ANN is yet to be discovered for all engineering applications.
The application of ANNs can be found in the wide range of industries from data analytic tools to future trend estimation and autonomous decision making. The performance of ANNs has drawn a huge attention from researchers due to its generalization to all applications and quick self-learning capability from real-life experience [9]. In the learning scheme of ANN, the weights are learned based on a set of labeled training samples with input patterns and targeted output using numerical optimization in the backpropagation (BP) manner. During BP, network errors are iteratively computed between network output and targeted values and backpropagated across the network for weights updating. Standard BP uses GDs approach to search for an optimum solution. However, ANN usually has a non-convex objective function with the presence of multiple local minimas [5]. Therefore, the searching of optimal solution may stuck in the local minima and contribute to poor performance during network training. In addition, standard BP is more complex due to tedious selection of ideal step length and network controlling parameters [8,3]. Some variances of GD are conjugate GD (CGD) [13] and stochastic GD (SGD) [4]. CGD avoids the use of linear algebra in iteration construction to reduce memory cost. It is suitable for function with many variables but the convergence rate is fairly comparable to other techniques. SGD introduces stochasticity to BP training samples with partial and noisy gradient. It is known to improve the convergence rate and networks generalizability. On top of that, second-order derivative method comprises the stochastic diagonal LM (SDLM). SDLM emerges from traditional LM with three approximations i.e. (a) dropping the off-diagonal terms of the Hessian matrices, (b) utilizing truncated Hessian in GN, and (c) estimating Hessian based on subset of entire training set [10,2]. The Hessian approximation is aimed to reduce computational cost whilst remain the stochastic approach of SGD with second-order derivative method. Nonetheless, the choice of damping parameter has less theoretical justification [6].
Approximate Greatest Descent (AGD) [6,7] is proposed in this paper as the learning algorithm of ANN. AGD utilizes the optimal control theory by segregating the learning trajectory into two phases. The implementation of AGD in optimizing the ANN parameters is not found in the literature review. This work serves as the pioneer in the field of ANN. In this paper, a modified version of AGD with stochastic diagonal approximation is proposed to iteratively compute errors and update the weights in the backpropagation manner. It incorporates the efficient learning of AGD algorithm with stochastic diagonal approximations to have more optimal ANN training and maintain low training time. We investigate the feasibility of stochastic diagonal AGD (SDAGD) in a Multilayered Perceptron (MLP) network in terms of missclassification rate and error rate. Subsequently, the aim of applying SDAGD is to improve the learning capability in neural network while maintain its network structure.
The presentation of this paper is outlined into five sections. Section 2 briefly presents the theory in neural networks. Section 3 shows the drivation of SDAGD into back propagation algorithm. In Section 4, an experimental setup is explained, followed by the discussion of SDAGD implementation with MLP. The last section contains a conclusion and future developments.
2. Backpropagation in Multilayer Perceptron. MLP [11] with BP learning scheme is represented Figure 1. MLP has the characteristic of nonlinear inputoutput mapping where function M (X, is the difference between targeted output D = {D 1 ...D n } and the output from the network M (X, W ). In practice, the objective function in incremental neural network training is formulated as the squared error, The role of objective function is to find the optimum value of W that minimizes E(W ), this is called the ANN optimization. For simplicity, the output of each layer is written as, By applying chain rule, the classical BP equations are obtained as, where F (·) is the partial derivatives of squashing function F (·). The weight updates of a network can be computed iteratively using BP to find an optimal solution which minimizes the objective function E(W ).
where W k+1 is the adjusted weight from W k , k represents the number of iteration and η is the learning rate. The iteration steps from BP can be generalized into a standard numerical optimization update rule. Learning rate η is equivalent to the step length α k and ∂E ∂W represents the search direction p k in Therefore, numerical optimization techniques are the key improvement to search for an optimal solution in neural network. The next section shows the derivation of second-order optimization in BP using SDAGD algorithm [16]. 3. Backpropagation using SDAGD. In order to adapt AGD in stochastic diagonal manner, Hessian is used for iteration computation instead of gradient information together with 3 approximations. Letting g(W ) = ∂E ∂W and H(W ) = ∂ 2 E ∂W 2 , the exact Hessian formula for MLP can be written as [16], where w ij and w rs are connection weight from unit j to unit i and unit s to unit r respectively. Unit i, j, r and s represents the nodes of an ANN structure. First approximation of SDAGD suggests to drop the off-diagonal terms of Hessian as there are no intra-connection in the same layer. Hence, the Hessian is written as, The local second-order derivative with respect to the sum of inputs of the downstream unit can be computed as, By utilizing product rule and chain rule, expanding ∂ 2 E ∂y 2 i yields, As for the second approximation, truncated Hessian is used instead of full Hessian to save computation cost. It also help to solve the existence of negative terms in the second-order derivatives. Hence, the truncated Hessian is written as, To further reduce the computation time for Hessian estimation, third approximation suggest to only use a smaller subset of training samples for Hessian estimation, where S is the number of training samples in the subset. With the intention of adapting AGD as the learning algorithm, (6) is modified to cater the derivation of AGD. AGD arise from the concept of dynamical control analysis in a long-term optimal control trajectory manner. In regard to long-term optimal trajectory, the optimization problem is segregated into two phases to search for the minimum point W * . Phase 1 of AGD algorithm incorporates trust region like method by seeking minimum points based on a sequence of boundaries and approximate Newton method in phase 2. The illustration of AGD iteration in solving 2-dimensional problem is shown in Figure 2.
Let the local search region be Z 1 , Z 2 ..., Z N −1 , the minimum point lies at the boundary of search region. Whereas in the last boundary Z N , where N represents the last search boundaries, the minimum point occurs either in the interior or on the boundary of Z N . Thus, the iteration steps in AGD can be defined as an unit circle   Figure 3. The circular search region defines the search boundary used to connect iteration steps from W k to W k+1 . The step length and step direction is computed similar to (6) but via AGD iteration. Based on Figure 3, iteration steps can be computed via the radius of the circular boundary as, Thus, the gradient of next iteration steps E(W k+1 ) is minimized subjected to (14) to define the step direction p k .
where R k is the radius of the search region at iteration k. Let α k be the step length and p k be the step direction. With regards to seek for the minimum point in the local search region, Lagrangian expression L(W k ) is used and written as, where u k = α k p k is the control vector and λ k is the Lagrange multiplier. Applying first-order derivatives to the Lagrangian expression (15) with respect to u k yields, According to optimal control theory, given ∂L(W k ) ∂u k = 0 and substitute u k = α k p k back into (16), the derivatives can be written as, To simplify the equation, let 2λ k α k = 1. Since the next iteration step is still unknown and assuming the iteration steps is relatively small, the gradient of next local search region is approximated as the gradient of current local search region. Thus, the step direction is approximated as, By applying Taylor series expansion, it yields, Let u k = α k p k and u T k u k = R 2 k referring to the constrain in equation (14), the relative step length of AGD is shown as follows, Recalling p k = ∂E ∂W k = g(W k ) from (18), by letting µ k = α −1 k yields, The effect of µ is as follows: (a) for a large µ, it encourages linear convergence of AGD iteration, (b) for a small µ, it encourages the quadratic convergence of Newton method. Newton method requires only a single iteration to achieve convergence for quadratic equation [6]. Unlike LM method, AGD does not introduce any ad-hoc parameters and the Hessian is not restricted to be positive definite. In the best-case scenario, iterations should generate points on the constructed boundaries using trust region like approach and subsequently apply the Newton method at the final local search region to achieve quadratic convergence. Substituting relative step length into (6), the SDAGD formula can be written as, where κ is the global learning rate.  [4] and Stochastic Diagonal Levenberg-Marquardt (SDLM) [10].
where d i,j is the targeted output whereas x r,s is the output.
where TP represents true positive, TN represents true negative and TS represent total number of samples. In Table 1, the second-order algorithms (SDLM and SDAGD) outperformed the first-order algorithm (SGD) in terms of MSE at 50 epochs. The reason of improvement by the second-order algorithms was due to the adaptation of curvature information in iteration steps construction during optimization which obtained more error surface information. In addition, SDAGD achieved faster roll-off rate and saturated at the earlier epoch as compared to SDLM. In terms of MCR, SDAGD achieved the lowest training and testing misclassification rate at 6.46% and 9.4% respectively among all optimization techniques. The outcomes demonstrated the robustness of a two-phase optimization technique by adaptively switching strategies with the control of relative step length.

5.
Conclusion. This paper concluded the integration of AGD algorithm into ANN optimization with numerical derivation. The simulation result have empirically proven the proposed algorithm achieved 9.4% of testing misclassification rate compared to other existing optimization techniques. This was due to the adaptation of two-phase system designed to solve optimization problem in neural network via long-term optimal trajectory approach. The relative step length used in the proposed algorithm enabled the ability to adaptively switching strategies from AGD iteration to approximate Newton algorithm. This adaptive optimization technique suggested a better structured training scheme for ANNs and could potentially improve the learning capability and structural simplicity of deeper neural networks. In future, the proposed algorithm will be implemented into deeper neural networks to probe on the performance and robustness with large dataset.