Incremental gradient-free method for nonsmooth distributed optimization

In this paper we consider the minimization of the sum of local convex component functions distributed over a multi-agent network. We first extend the Nesterov's random gradient-free method to the incremental setting. Then we propose the incremental gradient-free methods, including a cyclic order and a randomized order in the selection of component function. We provide the convergence and iteration complexity analysis of the proposed methods under some suitable stepsize rules. To illustrate our proposed methods, extensive numerical results on a distributed $l_1$-regression problem are presented. Compared with existing incremental subgradient-based methods, our methods only require the evaluation of the function values rather than subgradients, which may be preferred by practical engineers.


1.
Introduction. In recent years, there is an increasing trend for minimizing the sum of a number of component functions that all share a common decision variable [17,5,18,7]. Such problems are often termed distributed optimization [3,18,7], and they arise in many network applications, including in-network estimation, learning, signal processing, and resource allocation [21,23,24,29,31]. In these applications, there is no central coordinator that has access to all the information which traditional optimization approaches are required [27]. Thus, decentralized algorithms are needed to solve the problems.
There exist several useful techniques for solving optimization problems in a distributed manner. In terms of the update strategies, most of them can be classified as the consensus based approach [18,7,14] and the incremental based approach [25,16,17,4,13]. In the consensus based approach, the agents achieve the minimizer globally through sharing the information locally (the agent only shares information within its neighbors). In [18], Nedić developed a general framework for distributed optimization based on subgradient methods and the consensus strategy over a time-varying network. The authors in [7] discussed a constrained distributed optimization based on dual subgradient averaging. In [14], Li et. al extended the Nesterov's gradient-free method [19] to the distributed setting based on the consensus scheme. Later, the authors in [15] further developed a distributed proximal gradient method for solving a class of convex optimization with inequality constraints. In [26], Yuan et. al proposed a derivative-free distributed method for multi-agent optimization based on Nesterov's gradient-free method and push-sum strategy. In the incremental approach, the basic idea is to perform the (sub)gradient-based update incrementally, by sequentially taking steps along the (sub)gradients of the component functions, with intermediate adjustment of the variables after processing each component function [25,17]. Hence, it can also work in a fully distributed manner and has been very successful in solving parameter estimation in networks of wireless sensors [21,23], stochastic programming [8]. In terms of the strategies that component functions are selected, most of existing incremental based approach can be classified into as cyclic incremental methods [17,5], equiprobable randomized incremental methods [17,5], and Markov randomized incremental methods [12,22]. In [17], the authors developed the cyclic incremental subgradient method and the equiprobable randomized incremental subgradient method for distribution optimization. Their contribution is to provide the explicit convergence rates of their proposed methods [16,17]. Unlike the cyclic incremental method, the authors have extended equiprobable randomized incremental subgradient method to a more general case where the sequence in which component functions are selected and processed in a time homogeneous Markov chain [12] or a time nonhomogeneous Markov chain [22]. To minimize the sum of a number of composite component functions, the proximal point method is firstly incorporated in the incremental method [5].
Most of the methods mentioned above are based on the assumption that the subgradients of objective functions are available and easy to evaluate. However, it is well-known that there exist a large number of problems where the subgradient information is unavailable or costly to compute [6,8,19,30]. On the other hand, for many practical engineers, derivative-free methods are always preferred since some substantial efforts of the subgradient computation require the knowledge of convex optimization and nonsmooth analysis. Thus, the development of derivative-free optimization schemes attracts many research interests. For the centralized optimization problems, derivative-free optimization schemes have a long history, which enjoy the desirable advantage of never requiring explicit subgradient calculations. For details we can refer to [1,6,8,28,19,30]. However, there is still limitation on gradient-free methods available in the distributed setting. In this paper, based on the incremental approach, we develop gradient-free computational schemes for distributed optimization problems.
The structure of this paper is as follows. In Section 2, we formulate the problem under consideration. In Section 3, we propose a cyclic incremental gradient-free algorithm associated with a ring structure network and give the convergence analysis of the algorithm. In Section 4, we propose a randomized incremental gradient-free algorithm over an arbitrary network and give the convergence results. Numerical experiments are presented in Section 5. Finally, some conclusions are given. Comparing with existing incremental-based methods, the incremental gradient-free algorithms over two types of networks (a cyclic network in Section 3 and an arbitrary network in Section 4) are considered in this paper. More importantly, our proposed methods are free of gradient, which may be preferred by practical engineers. Since only the values of cost functions are required, our method may suffer a factor of at most d 2 (d is the dimension of the problem variable) in iteration complexity over incremental subgradient-based methods in theory. However, our numerical simulations show that for some nonsmooth problems, our methods can even achieve better performance than that of subgradient-based methods under the same stepsize updating strategies.
2. Problem and Gaussian smoothing. In this section, we formulate the problem of interest and describe the Gaussian smoothing technique that we will use.
2.1. Problem. Consider a network with m agents, indexed by i = 1, . . . , m. The network objective is to solve the following optimization problem: where x is a global decision vector, X ⊆ R d is a closed, convex set, and each f i (x) : X → R is a convex function but nonsmooth, only known by agent i.
In this paper, we are interested in the case when the cost function values of the problem (1) are only available. Our goal is to deal with the situation in which each agent i has only access to its private cost function value f i (i = 1, . . . , m), and all agents cooperatively minimizes the sum of convex objective functions of the agents over a multi-agent network. For the simplicity of notation, we define Throughout the paper, we assume that X * is nonempty and F * is finite. To proceed it further, we require the following assumption.
Assumption 1. Each function f i is L-Lipschitz with respect to l 2 -norm || · ||, that is, there exists a constant L > 0 such that Note that Assumption 1 implies the boundedness of subgradient for the function f i , i.e., where ∂f i (x) is the set of subgradients for the function f i (x) (see e.g., [10]).

2.2.
Gaussian smoothing. In order to address difficulties associated with the nonsmooth objective function, we consider a smooth approximation of the objective function. It is well-known (see, e.g., Proposition 2.4 in [2]) that the convolution of two functions is at least as smooth as the smoother of the two original functions. In particular, let u be d-dimensional standard Gaussian random vector and ν > 0 be the smoothing parameter, then a smooth approximation of a nonsmooth function f is defined by In addition, we have other choices of the smoothing distribution. For example, the uniform distribution on the l 2 -ball or l ∞ -ball has been used in [8,30,9]. In what follows, we give some useful properties of f ν (x).
Lemma 2.1. [19] Assume that f (x) is convex and L-Lipschitz on X, then, for any smoothing parameter ν > 0, the following properties hold, is convex, continuously differentiable and the gradient ∇f ν (x) is given by (iii): for ν > 0, define the following random gradient-free oracle: It can be seen from Lemma 2.1 (i) that f ν is the Gaussian approximation of f . When compared with the nonsmooth function f , the smooth version f ν is relatively well-behaved (see, Lemma 2.1 (ii)), moreover, there is a close relationship between its gradient ∇f ν (x) and the random gradient-free oracleg(x, u) (see, Lemma 2.1 (iii)). Note, however, the bound E u [||g(x, u)||] is scaled as d (see, Lemma 2.1 (iv)), which is an additional penalty induced by the use of the gradient-free oracle.
Lemma 2.2. [14] Assume that f (x) is convex and L-Lipschitz on X, then, for any smoothing parameter ν > 0, we have 3. Incremental gradient-free method with cyclic order. In this section, we first propose a cyclic incremental gradient-free algorithm for solving problem (1), then give the convergence results of the algorithm. Our method is based on the cyclic incremental subgradient method [4]. Our focus is on the case where the subgradient evaluations of the agents are unavailable or prohibitive. We assume that all agents are connected in a directed network with ring structure [17], each agent merely exchange information with its direct neighbors. Formally, the cyclic incremental gradient-free (CIGF, for short) algorithm is described as follows: Step 2: For i = 1, . . . , m, do Step 2.1: Generate u i,k+1 by Gaussian random vector generator and call the gradient-free oracle of the component function f i for computing g(z i−1,k+1 , u i,k+1 ) given bỹ Step 2.2: Update End.
Step 4: Until a predefined stopping criterion is met. Output x k . In the algorithm, the vector x k is the estimate at the end of cycle k, z i,k+1 is the intermediate estimate obtained after agent i updates in (k + 1)st cycle. In addition, u i,k+1 is assumed as an i.i.d. random vector, Π X denotes Euclidean projection onto the set X.
In terms of the updates (3) and (4), the algorithm CIGF generates a random sequence {x k } k≥1 . Denote the σ-field U i k+1 generated by the entire history of the random variables u j,l to iterations (k + 1), i.e., U i k+1 = σ{(u j,l , j = 1, . . . , m, l = 1, . . . , k); (u j,k+1 , j = 1, . . . , i)}, i = 1, . . . , m, and let U m k = U 0 k+1 . For any stepsize rules, we firstly establish a lemma to reveal a basic relation for the iterates generated by the algorithm CIGF, which plays a key role in our subsequent analysis. All the results and proofs in this paper parallel those of the incremental subgradient method, see e.g., [17,4]. The essential difference is that we only use the evaluation of the agents' function values rather than the subgradients. Lemma 3.1. Let {x k } k≥1 be the sequence generated by the algorithm CIGF. For any non-increasing sequence {α k } k≥1 of positive stepsizes, then, we have where Proof. Using the iterate update (4) and the nonexpansion property of the projection Π X , we have for any y ∈ X, Taking conditional expectations with respect to the σ-field U i−1 k+1 , we further obtain Noting that the fact Taking now the expectation conditional on U m k , we get Summing over i = 1, . . . , m, we have for any y ∈ X, We now estimate the term in the preceding relation. By using Assumption 1, Lemma 2.1 and the iterate updates (3) and (4) By combining the preceding relations and letting β 1 = d+(d+4) 2 /m, we can obtain the desired result.
We first consider the case with a constant stepsize rule.
Theorem 3.2. Let {x k } k≥1 be the sequence generated by the algorithm CIGF, with a constant stepsize rule, i.e., α k = α > 0 for all k ≥ 1. Then, we have Proof. By contradiction, suppose that the result of the theorem does not hold, there exists an > 0 and an index k > 0 such that for all k ≥ k , (5) and then taking all expectations, this implies By combining the above relations, we have which cannot hold for k sufficiently large. Hence, (6) must hold.
Let K represent the number of cycles. The following theorem provides an estimate of K, required to reach a given level of optimality up to an error tolerance. Let the notation a stand for the smallest integer greater than or equal to a ∈ R. Theorem 3.3. Let {x k } k≥1 be the sequence generated by the algorithm CIGF, with a constant stepsize rule, i.e., α k = α > 0 for all k ≥ 1. Then, for any > 0 and x * ∈ X * , we have where K = 3||x 0 − x * || 2 /(2α ) .
Proof. Suppose that (7) dose not hold, then for all k with 1 ≤ k ≤ K, we have Letting y = x * , α k+1 = α in (5) and taking all expectations, give rise to Summation of the above inequalities over k for k = 0, . . . , K, gives which contradicts to the definition of K.
Remark 1. According to the iterates (3) and (4), every cycle requires m subiterations, so the total number N of component functions that must be evaluated in order for satisfying (7) is given by N = mK = m 3||x 0 − x * || 2 /(2α ) . In Theorem 3.3, for any given > 0, if we choose the smoothing parameter µ and the constant stepsize α satisfied: we can achieve min 1≤k≤K E[F (x k )] − F * ≤ . This implies that the total number of necessary iterations is at most Note that the iteration complexity bound (9) of the proposed algorithm CIGF is in O(d 2 ) times worse than that of the cyclic incremental subgradient method proposed in [16,17,4]. This can be explained by the upper bound E[||g i ||] ≤ dL provided by Lemma 2.1 (iv), which is different from the subgradient upper bound ||g i || ≤ L, ∀g i ∈ ∂f i provided in [16,17,4]. However, our method only requires the evaluation of the function values rather than subgradients. When m = 1, the method reduces to the case considered in [19].
We now consider a convergence result for a diminishing stepsize case.
Theorem 3.4. Let {x k } k≥1 be the sequence generated by the algorithm CIGF, with a diminishing stepsize rule satisfied α k > 0, lim k→∞ α k = 0 and Proof. By contradiction, suppose that (10) does not hold, then there exists an > 0 and an index k > 0 such that for all k ≥ k , Letting y = x * , x * ∈ X * in (5) and taking all expectations, we obtain Combining the preceding relations leads to Since lim k→∞ α k = 0, without loss of generality, we may assume that k is large enough such that Thus, for all k ≥ k , we have which cannot hold for k sufficiently large due to the condition ∞ k=1 α k = ∞. Hence, (10) holds. 4. Incremental gradient-free method with randomized order. In this section, based on the randomized incremental subgradient method developed in [17,4], we first propose a randomized incremental gradient-free algorithm, then give the convergence analysis of the algorithm under different stepsize rules. By comparison with the previous algorithm CIGF, the algorithm developed in this section is applicable to a broader class of networks. Now we assume that the network of agents is fully connected [22], and develop an incremental algorithm where the agent (only known its component function value) that updates is selected randomly at each iteration over the network. Formally, the randomized incremental gradient-free (RIGF, for short) algorithm is given as follows: Algorithm RIGF Initialization: Given an initial point x 0 ∈ X, stepsizes {α k+1 } k≥0 , smoothing parameters µ k+1 > 0. Set k = 0.
Step 5: Until a predefined stopping criterion is met. Output x k . We assume that: 1) {ω k+1 } is a sequence of independent random variables, which is independent of the sequence {x k+1 } [17]; 2) {u k+1 } is a sequence of i.i.d. random vectors; 3) the sequences {ω k+1 } and {u k+1 } are independent. Denote the σ-field F k = σ{(ω j , u j )|j = 1, . . . , k} generated by the entire history of the random variables ω j and u j up to iterations k. For the simplified notation, we also assume that µ k+1 ≤ µ for all k with µ > 0.
We first deal with the case of a constant stepsize.
Proof. Using the update (11) and the nonexpansion property of the projection Π X , we have for any y ∈ X, Taking the expectations with respect to F k in the above inequality, we can obtain . By Lemma 2.1 and the assumption that both ω k and u k are independent, we have In addition, letting f = f ω k+1 , x = x k ,x = y, ν = µ k+1 in Lemma 2.2, then using the relation E[g(x k , ω k+1 , u k+1 )|F k ] = ∇f µ k+1 ω k+1 obtained from Lemma 2.1 and µ k+1 ≤ µ, we have Combining the preceding relations gives rise to where in the first equality we use the fact that ω k+1 takes the values 1, . . . , m with equal probability 1/m, in the second equality we let β 2 = (d + 4) 2 . Similar to the proof of Proposition 3.1 in [17], for a fixed positive scalar γ, we construct the following level set: and let y γ ∈ X be such that F (y γ ) = F * + 1 γ . Obliviously, y γ ∈ L γ by construction. Define a sequence {x k } as follows: Thus, the process {x k } is identical to the process {x k }, except that once x k enters the level set L γ , the process terminates withx k = y γ . Letting y = y γ and α k+1 = α in (13), we have where Hence, ξ k+1 ≥ 0 for all k. By (14) and the Supermartingale Convergence Theorem (see, Proposition 2 in [5]), with probability 1, we obtain ∞ k=0 ξ k+1 ≤ ∞, which implies thatx k ∈ L γ for sufficiently large k. By letting γ → ∞, we can obtain (12).
Next we obtain an estimate on the expected number of iterations for algorithm RIGF, which parallels Theorem 3.3 for the non-randomized case.
Theorem 4.2. Let {x k } k≥1 be the sequence generated by the algorithm RIGF, with a constant stepsize rule, i.e., α k = α > 0 for all k ≥ 1. Then, for any > 0 and x * ∈ X * , with probability 1, we have, where N is a random variable with Proof. Letŷ ∈ X * be some fixed vector. Define a new process {x k } which is identical to {x k }, except that once x k enters the level set the process {x k } terminates atŷ. Similar to the proof of Theorem 4.1 (cf. Eq. (13)), for the process {x k }, we obtain for all k where otherwise.
In the case wherex k / ∈ L, using the definitions of L and η k+1 , we have By the Supermartingale Convergence Theorem (see, Proposition 2 in [5]), from (16) we obtain ∞ k=0 ξ k+1 < ∞ with probability 1, so that η k+1 = 0 for all k ≥ N , where N is a random variable. Hencex N ∈ L with probability 1, implying that in the original process we have with probability 1. Furthermore, by taking the total expectation in (16), we obtain for all k, where in the last inequality we use the factsx 0 = x 0 and E[||x 0 −x * || 2 ] = ||x 0 −x * || 2 . Therefore, letting k → ∞, and using the definition of η k+1 and the relation (17), we have Remark 2. For any given > 0, if we choose the smoothing parameter µ and the constant stepsize α in Theorem 4.2 such that: we can achieve min 1≤k≤N F (x k ) − F * ≤ , which implies that the expected number of iterations is at most By compared the bounds (18) with (9), the algorithm RIGF is much faster than the algorithm CIGF (a factor of m) in the sense of expectation. In addition, due to the replacement of subgradient by using only the evaluation of the objective function values, the iteration complexity of our algorithm RIGF is in O(d 2 ) times worse than that of the randomized incremental subgradient method proposed in [16,17,4].
In parallel with the result of Theorem 3.4, we give the following convergence result for a diminishing stepsize case. Theorem 4.3. Let {x k } k≥1 be the sequence generated by the algorithm RIGF, with a diminishing stepsize rule satisfied α k > 0, lim k→∞ α k = 0 and ∞ k=1 α k = ∞. Then, we have Proof. To arrive at a contradiction, assume that there exists an > 0 and an integer k 1 > 0 such that Letting y = x * , x * ∈ X * in (13) and taking all expectations, we have Since lim k→∞ α k = 0, for the same as above, there exists an integer k 2 > 0 such that By taking k 0 = max{k 1 , k 2 }, thus, for all k ≥ k 0 , we obtain which cannot hold for k sufficiently large due to the condition ∞ k=1 α k = ∞. Hence, (19) holds.
5. Numerical simulation. In this section, we illustrate some experimental results on the convergence behaviors of the proposed incremental gradient-free algorithms as a function of number of agents m as well as the dimension of the agent d. The comparisons among our algorithms, the cyclic incremental subgradient (CISG, for short) algorithm and the randomized incremental subgradient (RISG, for short) algorithm proposed in [17] under the same stepsize updating rules are also presented.
Consider a robust linear l 1 -regression problem commonly studied in system identification [20]. Specifically, given m pairs of the form (a i , b i ) ∈ R d × R, we want to estimate a vector x ∈ R d such that a T i x ≈ b i . The linear l 1 -regression problem can be formulated as follows: where ||x|| ≤ R is the l 2 norm constraint. Clearly, is convex and L-Lipschitz by setting L = max i ||a i ||.
For a given network size m, we generate a random instance of a regression problem with m data points. In all tests, we set R = 10 and choose the parameter µ that satisfies (8).
We first consider the constant stepsize case by setting α = 0.001, dimensions of the agent d = 1 or 4, and number of agents m = 100 or 500. Fig. 1 depicts the value of F (x k ) − F * versus the number of cycles K by using algorithms CIGF and CISG, Fig. 2 plots the value of F (x k ) − F * versus the number of iterations N by using algorithms RIGF and RISG. From both figures, we can clearly see that all algorithms can achieve good convergence results. By comparison, we find that our algorithms CIGF and RIGF can converge to a better suboptimal value than that of algorithms CISG and RISG, although algorithms CIGF and RIGF require more iterations than algorithms CISG and RISG. More importantly, our algorithms illustrate much smaller oscillation than algorithms CISG and RISG under the setting of the constant stepsize. This can be explained that our algorithms CIGF and RIGF are based on a smooth approximation of original objective functions, but subgradient-based algorithms CISG and RISG may frequently suffer from the case that F (x k+1 ) ≮ F (x k ). Finally, in contrast to subgradient-based algorithms, our algorithms can carry out without the calculation of subgradient. They depend only on the evaluations of function values, which are preferred by practical engineers. This is because for nonsmooth optimization problems, some substantial efforts of the computation of the subgradient require a certain knowledge of convex analysis. To illustrate the incremental gradient-free method with the diminishing stepsize rule, we fix m = 100 and d = 2, and consider two diminishing stepsize cases as 1/k and 1/k 2/3 . Fig. 3 plots the value of F (x k ) − F * versus number of iterations under these diminishing stepsize choices. It can been observed from Fig. 3 that all algorithms can achieve the good convergence, but all of them are sensitive to the choice of diminishing stepsizes. The performance of achieving the optimal value for algorithms chosen the stepsize 1/k is much better that of algorithms chosen the stepsize 1/k 2/3 . Under the choice of the stepsize 1/k, the convergence performance of our algorithms CIGF and RIGF is slightly better than that of algorithms CISG and RISG.
In Fig. 4, we present the actual behaviors of algorithm CIGF related to the dimension of agents d and number of agents m with a pre-fixed target accuracy = 0.01 and a constant stepsize α = 0.001. In each panel, each point on the heavy red curve is the average of 20 trials for algorithm CIGF, on the dotted blue curve is the average of 20 trials for algorithm CISG, and the vertical bars are the corresponding  (9) for the fixed and m, which is at most O(d 2 ) times worse than that of algorithm CISG. Fig. 4(b) depicts the value of N ( , d, m) versus the number of agents m for a fixed dimension of the agent d = 2. In Fig. 4(b), some small fluctuations on the number of iterations for both algorithms arise as m increases. The number of iterations N ( , d, m) of algorithm CIGF is slightly greater than that of algorithm CISG. This is because we fix the dimension of the agent d = 2, thus, the number of iterations for algorithm CIGF is at most d 2 = 4 times worse than that of algorithm CISG according to the given estimate given (9) in theory. These results show the excellent agreement of the empirical behavior with our theoretical predictions.
6. Conclusions. In this paper, based on the framework of the incremental approach, we have proposed incremental gradient-free methods for minimizing the sum of many-terms local convex functions embedded in multi-agent networks. We have proved the convergence of the proposed algorithms and provided the iteration   complexity bounds as a function of the number of the agents m as well as the dimension of the problem d. A linear l 1 -regression numerical example was used to show the excellent agreement of the empirical behavior with our theoretical predictions. In particular, our proposed methods depend only on the evaluations of the function value rather than subgradients, which may be preferred by practical engineers. Furthermore, for some nonsmooth problems, our methods can even achieve better numerical performance than that of subgradient-based methods.