A class of descent four–term extension of the Dai–Liao conjugate gradient method based on the scaled memoryless BFGS update

Hybridizing the three–term conjugate gradient method proposed by Zhang et al. and the nonlinear conjugate gradient method proposed by Dai and Liao based on the scaled memoryless BFGS update, a one–parameter class of four–term conjugate gradient methods is proposed. It is shown that the suggested class of conjugate gradient methods possesses the sufficient descent property, without convexity assumption on the objective function. A brief global convergence analysis is made for uniformly convex objective functions. Results of numerical comparisons are reported. They demonstrate efficiency of a method of the proposed class in the sense of the Dolan–More performance profile.


1.
Introduction. Consider the following unconstrained minimization problem: in which f : R n → R is a smooth nonlinear function and the analytic expression of its gradient is available. Among the most useful tools for solving large-scale cases of the problem (1) are the limited memory BFGS (Broyden-Fletcher-Goldfarb-Shanno) methods [23] and the conjugate gradient (CG) methods [19], because the amount of memory storage required by the methods is low. This study is devoted to a class of four-term CG methods, constructed based on the scaled memoryless BFGS update [29]. In addition to the low memory requirement, CG methods possess the attractive feature of simple iterative formula, that is, where α k is a steplength to be computed by a line search procedure along the search direction d k defined by in which g k = ∇f (x k ) and β k is a scalar called the CG (update) parameter. The steplength α k is usually determined to fulfill the well-known Wolfe conditions [29], i.e., where 0 < δ < σ < 1. Different CG methods mainly correspond to different choices for the CG parameter [19]. One of the essential CG parameters has been suggested by Hestenes and Stiefel [20] (HS), that is, where y k = g k+1 −g k . As an important property, search directions of the HS method satisfy the conjugacy condition d T k+1 y k = 0, ∀k ≥ 0, independent of the line search and the objective function convexity.
In an attempt to suggest an extension of β HS k employing the quasi-Newton aspects [29], Dai and Liao [10] (DL) dealt with the extended conjugacy condition d T k+1 y k = −tg T k+1 s k , with the nonnegative parameter t, which considering (3) leads to the following CG parameter: It is worth noting that if where ||.|| stands for the Euclidean norm, then the CG parameter proposed by Hager and Zhang [17] is achieved. Also, the choice in which τ k is a parameter corresponding to the scaling factor in the scaled memoryless BFGS method [29], yields another CG parameter suggested by Dai and Kou [9]. The choices (6) and (7) are theoretically effective since they can guarantee the sufficient descent condition, i.e., where c is a positive constant, independent of the line search and the objective function convexity [2]. The Dai-Liao approach has been attracted special attention. In several efforts, modified secant equations have been applied to make modifications on the DL method. For example, Yabe and Takano [31] used the modified secant equation proposed by Zhang et al. [33]. Also, Zhou and Zhang [35] applied the modified secant equation proposed by Li and Fukushima [21]. Li et al. [22] used the modified secant equation proposed by Wei et al. [30]. Ford et al. [14] employed the multi-step quasi-Newton equations proposed by Ford et al. [13]. Babaie-Kafaki et al. [6] applied a revised form of the modified secant equation proposed by Zhang et al. [33], and the modified secant equation proposed by Yuan [32].
In another direction, descent property of the DL method has been focused on. Recall that we say the search directions {d k } k≥0 possess the descent property (or equivalently, satisfy the descent condition) if and only if being weaker than the sufficient descent condition (8). Recently, based on an eigenvalue analysis Babaie-Kafaki and Ghanbari [3] proposed the following family of two-parameter choices for t: with p > 1 4 and q ≤ 1 4 , guaranteeing the descent property (9). It is interesting that if we let (p, q) = (2, 0), then t p,q k reduces to the formula (6) proposed by Hager and Zhang [17], and if we let (p, q) = (1, 0), then t p,q k reduces to the formula (7) proposed by Dai and Kou [9] with the optimal choice τ k = s T k y k ||s k || 2 . More recently, Babaie-Kafaki and Ghanbari [4] dealt with a hybridization of the DL method and a three-term CG method proposed by Zhang et al. [34] (ZZL) with the following search directions: where t is a nonnegative parameter. As seen, search directions (10) can be viewed as a four-term extension of the DL method. If t = 0, then the method reduces to the three-term CG method ZZL which satisfies the sufficient descent condition g T k d k = −||g k || 2 , for all k ≥ 0. Inheriting from the ZZL method, it can be seen that search directions (10) satisfy the sufficient descent condition (8) with c = 1, independent of the line search and the objective function convexity. Based on the standard secant equation [29], an effective choice for the parameter t in (10) has been suggested as follows: where ξ is a nonnegative constant.
Here, we deal with other proper choices for the parameter t in (10). This work is organised as follows. In Section 2, we briefly study the scaled memoryless BFGS update. In Section 3, we propose a family of one-parameter choices for t in (10), using the scaled memoryless BFGS update [29]. Then, we conduct a brief global convergence analysis. We present comparative numerical comparisons in Section 4. Finally, conclusions are drawn in Section 5.

2.
On the scaled memoryless BFGS update. As known, quasi-Newton methods are of particular performance for solving (1) since they do not require explicit expressions of the second derivatives and are often globally and locally superlinearly convergent [29]. Iterative formula of the quasi-Newton methods is in the form of (2) in which the search direction d k is computed by where H k ∈ R n×n is an approximation of the inverse Hessian; more precisely, The methods are characterized by the fact that H k is effectively updated to achieve a new matrix H k+1 as an approximation of ∇ 2 f (x k+1 ) −1 , satisfying a version of secant (quasi-Newton) equation which implicitly includes the second order information [29]. The most popular equation is the standard secant equation; that is, Having favorable numerical performance and strong theoretical properties in contrast to the other quasi-Newton updating formulas, the BFGS update that is separately proposed by Broyden [8], Fletcher [12], Goldfarb [15] and Shanno [27] is given by It can be seen that if H k is a positive definite matrix and the line search ensures that s T k y k > 0, then H k+1 is also a positive definite matrix [29] and consequently, the descent condition (9) holds.
In order to achieve an ideal distribution of the eigenvalues of quasi-Newton updates, improving the condition number of successive approximations of the inverse Hessian and consequently, increasing the numerical stability in the iterative method (2)-(12), the scaled quasi-Newton updates have been developed [29]. In this context, replacing H k by θ k H k in (14) in which θ k > 0 is called the scaling parameter, the scaled BFGS update can be achieved as follows: The most effective choices for θ k in (15) have been proposed by Oren and Luenberger [25], and, Oren and Spedicato [26], (see also [24]). Note that the scaled BFGS update with one of the parameters (16) or (17) is called a self-scaling BFGS update. Although self-scaling BFGS methods are numerically efficient [26], as an important defect the methods need to save the matrix H k ∈ R n×n in each iteration, being improper for solving large-scale problems. Hence, replacing H k by the identity matrix in (15), self-scaling memoryless BFGS update has been proposed as follows:H (see also [1]). Similarly, memoryless version of the scaling parameters (16) and (17) can be respectively written as: and Note that the scaling parameter (19) can also be determined based on a two-point approximation of the standard secant equation (13) [7].

3.
A class of modified Dai-Liao conjugate gradient methods. Here, we deal with a new approach for computing the parameter t in the four-term extension of the DL method given by (10). In this context, we need to assume that s T k y k > 0, as guaranteed by the Wolfe condition (5).
It is worth noting that the search directions (10) can be written as: in which the matrix Q k+1 , called the search direction matrix, is given by Hence, structure of a CG method with the search directions (10) is similar to a quasi-Newton method. This motivated us to compute the parameter t in (22) in a way to make Q k+1 as closer as possible to an effective quasi-Newton update which is appropriate for large-scale problems. Thus, we employ the scaled memoryless BFGS update (18), due to its numerical and theoretical advantages. Define the matrix E ∈ R n×n as follows: where the matrices Q k+1 andH θ k k+1 are respectively defined by (22) and (18). In order to decrease the distance between Q k+1 andH θ k k+1 , here we determine the parameter t as a minimizer of ||E|| F in which ||.|| F stands for the Frobenius matrix norm. Since ||E|| 2 F = tr(E T E), we can deal with the following minimization problem: where is a real constant, being independent of t. After some algebraic manipulations it can be seen that is the unique solution of the minimization problem (23). However, considering (21), since in order to ensure the sufficient descent condition (8) we need to have t ≥ 0. So, here we suggest the following truncation of t * θ k : where ξ is a nonnegative constant. The formula (25) indicates a class of one-parameter choices for the parameter t in (10) or equivalently, in (22). Different choices for the scaling parameter θ k lead to different choices fort * θ k . The choice θ k = 1 yields while the choices (19) and (20) for θ k respectively yield t * k2 = max ξ, In what follows, based on the convergence analysis of [28], we discuss global convergence of a CG method with the search directions (21) in which the parameter t is computed by (26), (27) or (28). In this context, the following standard assumptions are needed.
Assumption A1. The level set L = {x| f (x) ≤ f (x 0 )}, with x 0 to be the starting point of the iterative method (2), is bounded.

Assumption A2.
In some open convex neighborhood N of L, f is continuously differentiable and its gradient is Lipschitz continuous; that is, there exists a positive constant L such that The following lemma is now immediate, ensuring the boundedness oft * ki , i = 1, 2, 3.
Lemma 3.1 plays an essential role in proving the following global convergence theorem. The proof is similar to the proof of Theorem 2.1 of [4] and here is omitted.
Theorem 3.2. Suppose that Assumptions A1 and A2 hold. Consider the iterative method (2) with the search directions {d k } k≥0 given by (10) in which the parameter t is computed by (26), (27) or (28). If the objective function f is uniformly convex on N and the steplength α k is determined to fulfill the Wolfe conditions (4) and (5), then the method converges in the sense that lim inf k→∞ ||g k || = 0.

Numerical experiments.
Here, we present some numerical results obtained by applying C++ implementations of the following four CG methods in the form of (2) with the search directions (10) in which • The NMDL1 method: t =t * k1 given by (26); • The NMDL2 method: t =t * k2 given by (27); • The NMDL3 method: t =t * k3 given by (28); • The MDL method: t = t k given by (11). To justify comparison of the new methods with MDL, note that in [4] it has been shown that MDL is numerically preferable to the CG − Descent and ZZL methods.
The codes were run on a computer with 3.2 GHz of CPU, 1 GB of RAM and Centos 6.2 server Linux operation system. Since CG methods have been mainly designed to solve large-scale problems, the experiments were performed on a set of 64 unconstrained optimization test problems of the CUTEr collection [16] with default dimensions being at least equal to 1000, as given in Hager's home page: 'http://www.math.ufl.edu/$\sim$hager/' (The test problems data has been also clarified in [5]).
For the MDL method, we set ξ = 0.66 in (11) as suggested in [4]. Also, for the methods of NMDL1, NMDL2 and NMDL3 respectively we set ξ = 0.65 in (26), ξ = 0.55 in (27) and ξ = 0.40 in (28), because of the promising numerical results obtained among the different values ξ ∈ {0.05k} 20 k=0 . In all the four methods, we  [18]. Also, all attempts to solve the test problems were terminated when ||g k || ∞ < 10 −6 (1 + |f (x k )|). Efficiency comparisons were made using the performance profile introduced by Dolan and Moré [11], on the running time and the total number of function and gradient evaluations being equal to N f + 3N g [18], where N f and N g respectively denote the number of function and gradient evaluations. Performance profile gives, for every ω ≥ 1, the proportion p(ω) of the test problems that each considered algorithmic variant has a performance within a factor of ω of the best. Figures 1 and 2 show the results of comparisons in the perspectives of the total number of function and gradient evaluations and the CPU time. As the figures show, NMDL3 outperforms the other methods. Moreover, MDL is preferable to NMDL2 while NMDL2 outperforms NMDL1. Thus, proper choices for the parameter θ k in (25) may lead to efficient CG methods in the form of (2)- (10). Also, effectiveness of the scaling parameters (27) and (28) has been demonstrated by the promising numerical behaviors of NMDL2 and NMDL3.

5.
Conclusions. Taking advantage of the scaled memoryless BFGS update, a class of one-parameter choices for the parameter of a recently proposed four-term extension of the Dai-Liao method has been suggested, leading to a class of nonlinear conjugate gradient methods. Under mild assumptions, it has been shown that the methods of the suggested class possess the sufficient descent property. A global convergence property has been established for three methods of the suggested class with the assumption of uniform convexity on the objective function. Preliminary numerical results showed that practical efficiency of the methods of the suggested Figure 2. CPU time performance profiles for NMDL1, NMDL2, NMDL3 and MDL class may be promising, specially when the scaling parameter is chosen appropriately.