A MODIFIED SCALED MEMORYLESS BFGS PRECONDITIONED CONJUGATE GRADIENT ALGORITHM FOR NONSMOOTH CONVEX OPTIMIZATION

. This paper presents a nonmonotone scaled memoryless BFGS preconditioned conjugate gradient algorithm for solving nonsmooth convex opti- mization problems, which combines the idea of scaled memoryless BFGS preconditioned conjugate gradient method with the nonmonotone technique and the Moreau-Yosida regularization. The proposed method makes use of approximate function and gradient values of the Moreau-Yosida regularization instead of the corresponding exact values. Under mild conditions, the global convergence of the proposed method is established. Preliminary numerical results and related comparisons show that the proposed method can be applied to solve large scale nonsmooth convex optimization problems.

1. Introduction. In this paper, we are concerned with the following unconstrained optimization problem min x∈R n f (x), where f : R n → R is a possibly nondifferentiable convex function. Associated with this problem is the following problem where F : R n → R is the so-called Moreau-Yosida regularization of f [16], which is defined by where λ is a positive parameter. Throughout the paper, the value of parameter λ is fixed. It is a well known fact that problems (1) and (2) are equivalent in the sense that an x ∈ R n solves (1) if and only if it solves (2), see [16] for details. A remarkable feature of the function F is that it is a differentiable convex function whose gradient is Lipschitz continuous, even though the function f is nondifferentiable The paper is organized as follows. In next section, we review some basic results in convex analysis and nonsmooth analysis, and outline briefly the SCALCG algorithm. In Section 3, we propose a new scaled memoryless BFGS preconditioned CG method which is based on the Moreau-Yosida regularization and a modified nonmonotone line search scheme. Section 4 is devoted to analyzing its convergence properties. Numerical results and related comparisons are reported in Section 5. Finally, some concluding remarks are given in Section 6.
Our notation is standard. For any points x, y ∈ R n , x, y = x T y stands for the Euclidean inner product, and · for the associated norm. Given a convex function f , we denote its subdifferential at the point x by ∂f (x) = {ξ : f (y) ≥ f (x) + ξ T (y − x), ∀y} [16].

2.
Preliminaries. In this section, we review some basic results and the SCALCG algorithm, which are useful in the subsequent discussions.
Proposition 2.1. The function F is finite-valued, convex and everywhere differentiable with gradient where p(x) is the unique minimizer in (3), i.e., Moreover, we have for all x, y ∈ R n , g(x) − g(y) 2 ≤ (g(x) − g(y)) T (x − y) λ , and This proposition shows that the mapping g : R n → R n is globally Lipschitz continuous. The next proposition (see Theorem 15.4.1.7 in [16]) states the equivalence between problem (1) and problem (2).
Proposition 2.2. The following statements are equivalent: (1) x minimizes f on R n ; The SCALCG algorithm for solving problem (2) is an iterative algorithm [1], which consists of iteration of the form where α k > 0 is a stepsize satisfying the standard Wolfe conditions (see [29] for details), and d k is a search direction defined by

YIGUI OU AND XIN ZHOU
with the matrix Q k+1 ∈ R n×n defined by where s k = x k+1 −x k , g k = ∇F (x k ), y k = g k+1 −g k , and θ k+1 is a scaling parameter determined based on a two-point approximation of the standard secant equation [5], i.e., Since it is impossible in general to evaluate exactly the function F defined by (3) and its gradient g at an arbitrary point x, we shall make use of approximate values of the gradient and the function itself. Fortunately, for each x ∈ R n and any > 0, there exists an approximate vector p α (x, ) ∈ R n such that Hence, we can use p α (x, ) to define the approximations of F (x) and g(x) by and respectively. Some implementable procedures for computing such an approximate minimizer p α (x, ) can be found in [11,2]. A remarkable feature of F α (x, ) and g α (x, ) is given by the following proposition [12].
Proposition 2.3. Let p α (x, ) be a vector satisfying (10), and F α (x, ) and g α (x, ) be defined by (11) and (12), respectively. Then we have The above proposition shows that the approximations F α (x, ) and g α (x, ) can be made arbitrarily close to the exact values F (x) and g(x), respectively, by choosing the parameter small enough. With these approximations, we redefine the search direction d k as where k is an appropriately chosen positive number, and the matrix Q k+1 ∈ R n×n is defined by with where with otherwise, Note that the matrix Q k+1 defined by (17) is precisely the self-scaling BFGS update in which the approximation of the inverse Hessian is restarted as θ k+1 I. Since it follows from (21) and (22) that w T k s k > 0, the matrix Q k+1 is symmetric positive definite (see [29] for details). Furthermore, using Sherman-Morrison Theorem (see [29] for details), it can be shown that the matrix P k+1 defined by is the inverse of Q k+1 . Thus, P k+1 is also a symmetric positive definite matrix.
3. New method. In this section, we first introduce a modified nonmonotone line search scheme, and then describe the new algorithm for solving problem (2). We also give some remarks about the proposed method.
Step 1. Compute an approximation p α (x k , k ), then evaluate and If g α (x k , k ) ≤ ε, then stop.

Remark 3.2. From
Step 1, we can see that computing the approximation p α (x k , k ) plays a key role in implementing Algorithm 3.1. In practical computation, we employ an implementable procedure [11] to obtain an approximation p α (x k , k ) to the unique minimizer p(x k ) in (3) at the k-th iteration. The procedure is described as follows.
(b) Solve the following quadratic programming to obtain z i and the optimal value F i k of the objective function: then terminate with p α (x k , k ) := z i . Otherwise, choose s i ∈ ∂f (z i ). Set i := i + 1 and go to (b).
It should be mentioned that, in Step (b), z i may be calculated practically by solving the dual problem associated with (28), that is, with For more details about (30) and (31), see Lemma 15.2.4.1 in [16].
Remark 3.3. Note that, by construction of Step 4 in Algorithm 3.1, we always have the inequality k ≤ να 2 k d k 2 for all k, which is an essential condition for guaranteeing the global convergence of Algorithm 3.1 (see next section for details).
4. Convergence analysis. In this section, we turn our attention to the convergence properties of Algorithm 3.1. To this end, we need the following assumption.
A. The level set L 0 defined by is bounded, where x 0 is an available initial point.
Remark 4.1. As mentioned in [18], Assumption A is a weaker condition than the strong convexity of f as required in [25,26].
Let {x k } be the sequence generated by Algorithm 3.1. Then we have Proof. From (25), it follows that We proceed the proof by induction. For k = 0, (33) holds, due to the facts that C 0 = F α (x 0 , 0 ) and 0 = γ −1 . Next, assume that (33) holds for some k ≥ 1. If we can show the following assertion: then it follows from (34) and (35) that which show that (33) holds for k + 1 as well.
From this proof process of Lemma 4.1, we can conclude that the following conclusion holds.
and x k ∈ L 0 for all k.
Proof. From the line search scheme (24) and (25), it follows that which implies that the first assertion (41) is true. Now we prove the second assertion by induction. Clearly, x 0 ∈ L 0 , due to (13). Assume that x k ∈ L 0 . Then it follows from (13), (33) and (41) that Thus x k+1 ∈ L 0 , since C 0 = F α (x 0 , 0 ). This completes the proof.
Proof. Using the update formulas (17) and (23), we can easily obtain the trace of Q k and P k respectively as follows and By (42) and (43), we get This together with (43) and (50)-(51) implies that Let c 1 = 1 nM and c 2 = m(n−1)+M m 2 . Then, we have by using (16) and (52)-(53) that Combining the above three inequalities, we obtain the desired inequalities (48) and (49), respectively.
Lemma 4.6. There exists a constant ρ > 0 such that

YIGUI OU AND XIN ZHOU
Proof. For simplicity, we define two index sets as follows.
Using these lemmas mentioned above, we obtain the global convergence result of Algorithm 3.1 as follows. Proof. We first show that the assertion (67) holds. From (25) and (54), it follows that Note that due to η max < 1. Consequently, it follows from (68) and (69) that Since x k ∈ L 0 , it follows from Assumption A that F (x k ) is bounded from below, so is C k due to the fact that C k ≥ F α (x k , γ k−1 ) ≥ F (x k ) for all k. Furthermore, by using i+1 = γ i for all i, we have

YIGUI OU AND XIN ZHOU
Combining this inequality with (70) yields which implies that the assertion (67) holds.
Thus, x * minimizes F by using Proposition 2.2. This proof of the second part is completed by noting the equivalence of problems (1) and (2). Based on the above arguments, the desired conclusions follow. This completes the proof.

Numerical experiments.
In this section, we present some numerical experiments to evaluate the performance of the proposed method on both small scale problems and large scale problems. At the same time, we give some comparisons with the related methods.

5.1.
Numerical experiments for small scale problems. We first test the proposed algorithm for small scale problems which are chosen from [21] and listed in Table 1, where n, x 0 and f ops refer to the dimension of variables, initial points and optimum function values, respectively.
To validate Algorithm 3.1 from a computational point of view, we compare it with the algorithm in [20] (abbreviated LWTR), the algorithm in [32] (abbreviated YWBB), the algorithm in [26] (abbreviated SFTR), and the algorithm in [25] (abbreviated RFBFGS). For the algorithm SFTR, the related parameters used are chosen as follows: β = 1, γ = 0.5, η = 0.5, ∆ 0 = 1, µ = 0.0001, and λ = 100; For the algorithm RFBFGS, the related parameters used are given as follows: ρ = 0.5, σ = 0.001, −1 = 1, B 0 = I, γ = 0.85 and β k = 1 (k+1) 2 . Table 2 shows the detailed numerical results which are given in the form of n i /n f /f k , where n i , n f and f k denote the number of iterations, the number of function evaluations and the function value f (x k ) at the final iteration, respectively. It should be mentioned that the numerical results of the algorithms LWTR and YWBB can be found in [20,32], respectively.   We adopt the performance profile introduced by Dolan and More [10] to compare the efficiency among the five different methods on the set of test problems, which has some advantages over other existing benchmarking tools. Figure 1 gives the performance profiles of the five methods for the number of iterations and function evaluations, respectively. From Figure 1, we observe that for the test problems in    [13] and listed in Table  3. Throughout the computational experiments, the values of parameters used are similar to that used in small scale problems. In order to show the performance of the proposed algorithm, we also list the testing results of paper [33] (abbreviated CG-YWL), which were implemented in Fortran 90 and could be found in [33]. Table 4 shows the detailed testing results which are given in the form of n i /n f /f k , where their meanings are the same as those in Table 2. Efficiency comparisons are also made by using the performance profile [10]. The performance profiles in Figure  2 compare Algorithm 3.1 with CG-YWL in terms of the number of iterations and function evaluations, respectively. From Figure 2, we see that for the test problems in Table 3, Algorithm 3.1 can be competitive with CG-YWL. While it would be unwise to draw any firm conclusions from the limited numerical results, they indicate some promise for the new approach proposed in this paper. Further improvement is expected from more suitable implementation. Table 3. Testing functions for large scale problems