THE DUAL STEP SIZE OF THE ALTERNATING DIRECTION METHOD CAN BE LARGER THAN 1.618 WHEN ONE FUNCTION IS STRONGLY CONVEX

. The alternating direction method of multipliers (ADMM) is one of the most well-known optimization scheme for solving linearly constrained separable convex problem. In the literature, Fortin and Glowinski proved that the step size for updating the Lagrange multiplier of the ADMM can be chosen in the open interval of zero to the golden ratio. But, it is still unknown whether the dual step size can be larger than the golden ratio. In this paper, for the case where one function term is strongly convex and the associate coeﬃcient matrix is full column rank, we present an aﬃrmative answer to the above question. We then derive an exact relationship between the modulus and the dual step size. Our analysis deepens the understanding of the convergence properties of the ADMM.

1. Introduction. In this paper, we restrict our discussion to the following convex programming min{θ 1 (x) + θ 2 (y) | Ax + By = b, x ∈ X , y ∈ Y}, where A ∈ m×n1 , B ∈ m×n2 , b ∈ m , X ⊂ n1 and Y ⊂ n2 are closed convex sets, θ 1 : n1 → and θ 2 : n2 → are proper, closed and convex (not necessarily smooth) functions. Throughout, we assume that the solution set of (1) is nonempty and the matrix B is full column rank. The model (1) captures a broad range of problems arising in image sciences, signal processing and machine learning, including Lasso estimation, total variation denoising, stable principal component pursuit, etc. We refer the readers to the monographs [1,14] and the references therein for more examples.
In the literature, many first-order optimization methods have been proposed to solve (1). Among them, the alternating direction method of multipliers (ADMM), which was introduced by Glowinski and Marroco [10] and Gabay and Mercier [11], and subsequently studied, e.g., in [6,7,8,24] has regained much attention in recent years, we refer to the surveys [2,13] for an overview on its recent developments. For problem (1), the ADMM iterates as follows: y k+1 ∈ arg min θ 2 (y) − (λ k ) T By + β 2 Ax k+1 + By − b 2 y ∈ Y , (2b) where β > 0 is a penalty parameter for violation of the linear constraints. Note that the ADMM decomposes the minimization problem (1) into two smaller subproblems so that each of the subproblems only involves one function θ i (·). In this case, the generated subproblems could be much easier, especially when θ 1 (·) and θ 2 (·) have some special structures to be employed. In addition, since the variables are treated individually, the computations of ADMM can be implemented in distributed systems. Note that the ADMM enjoys global convergence property, and it can be derived from its relation to the Douglas-Rachford operator splitting method; see [6]. Furthermore, the convergence rate of the ADMM can be found in e.g., [5,18,19,20]. There are many practical variants of ADMM. One famous variant proposed by Fortin and Glowinski in [9] appears in many guises in the literature. This variant, which suggests to attach a relaxation factor to the Lagrange-multiplier-updating step in ADMM, reads as follows: where the relaxation factor γ is allowed to be in the interval (0, 1+ √ 5 2 ). Note that the ADMM variant differs from the ADMM (2) only in that the step size for updating the Lagrange-multiplier is in the range (0, 1+ √ 5 2 ). Thus, the scheme (3) enjoys the same easiness as ADMM. It is worthwhile to mention that although the two algorithms are very similar, the convergence study of the ADMM variant is very different from ADMM. Indeed, as remarked in [8]. They are "actually two distinct families of ADMM algorithms, one derived from the operator-splitting framework and the other derived from Lagrangian splitting". Numerically, taking a larger value of γ is usually benefit to speeding up the performance; and it has been observed empirically in many cases that the upper bound of the step size, i.e., γ = 1.618, is helpful to get the best performance. We refer to e.g., [12,21,23] for some numerical demonstrations. Some other variants and extensions of ADMM, together with their convergence properties, can be seen in e.g., [3,4,5,16,17].
Remember that γ ∈ (0, 1+ √ 5 2 ) is only a sufficient condition to ensure the convergence of the ADMM and larger value of γ can accelerate its convergence. So it is reasonable to ask whether γ in (3) can be larger than 1+ √ 5 2 . Indeed, this issue stands as an open question by Glowinski raised in [12]. In [10], Gabay and Mercier proved that ADMM with γ ∈ (0, 2) is convergent if one of the two functions in the objective is linear. Recently, the authors [22] showed the convergence of the ADMM scheme (3) with γ ∈ (0, 2) when both the functions θ 1 and θ 2 in (1) are quadratic. In this work, we are concerned with the case where one function term is strongly convex. We prove that the dual step size can be larger than 1.618 for this case. We also derive an exact relationship between the modulus of the strongly convex function and the dual step size. This result partially answered Glowinski's open question and advanced the understanding of the convergence properties of the ADMM.
The paper is organized as follows: in Section 2, we present the assumptions for future discussion and summarize some necessary background; in Section 3, we prove the main convergence results. Finally, we conclude the work in Section 4.

2.
Preliminaries. Throughout this paper, we assume that θ 2 (y) is strongly convex with the modulus τ > 0. That is, there is a positive constant τ such that for any η ∈ ∂θ 2 (y), we have Unless otherwise stated, the results in this paper are given under this assumption.

Remark 1.
Since τ is positive and the matrix B is assumed to be full column rank, there exists a positive constant s such that τ = s · ρ(B T B), where ρ(·) denotes the spectral radius of a matrix.
Let ∂θ 1 (x) and ∂θ 2 (y) denote the subdifferentials of θ 1 (x) and θ 2 (y), respectively. Then, finding a saddle point of (5) is equivalent to finding w * ∈ Ω * , ξ * ∈ ∂θ 1 (x * ), and η * ∈ ∂θ 2 (y * ) such that the following variational inequality (VI) holds Recall that θ 1 (·) is convex and θ 2 (·) is strongly convex, we have and respectively. From the optimality condition, we have the following equivalent form: where To facilitate the analysis of ADMM, we introduce some basic notations. First, for the iterate (x k+1 , y k+1 , λ k+1 ) generated by the ADMM (3), we define an auxiliary vectorw k = (x k ,ỹ k ,λ k ) as (10b) In page 14 of [2], it was commented that the algorithm state in ADMM consists of y k and λ k , and x k is an intermediate result which is not involved to execute the new iteration, thus (y, λ) are essentially involved in the ADMM iteration. Following this idea, for w = (x, y, λ) and w k = (x k , y k , λ k ) generated by (3), we use the notations v = y λ , and v k = y k λ k to denote the essential parts of w and w k , respectively. We denote the essential part of w * in Ω * by using v * = (y * , λ * ) and let V * denote all the collection of v * . Now, we give a lemma to show the relations among v k ,ṽ k , and v k+1 . where Proof. It follows from (3) and (10) that Together with y k+1 =ỹ k , we have the following relationship The proof is complete.
3. Convergence analysis. In this section, we present convergence results of the ADMM with larger step size when one function is strongly convex. Now, we first establish several lemmas, which will be used to show our main convergence results.
Lemma 3.1. Let w k+1 be generated by the ADMM (3) andw k be defined by (10). Then, we havẽ (12b) Proof. Using the optimality condition of the x-subproblem in ADMM, we have Recalling the auxiliary vectorw k defined in (10), we havẽ For the y-subproblem in ADMM, we get the optimality condition Rearrange the terms in the above inequality, we obtain Using the notations defined in (10), we havẽ Forλ k defined by (10), we have and it can be written as Combining (13a), (13b) and (13c), and also noting the definitions in (6), the assertion of this lemma is proved.
In (12a), the right hand side term (v −ṽ k ) T Q(v k −ṽ k ) plays a key role in evaluating the solution point, we now focus on investigating it. Recalling the correction Substituting it into the right-hand side of (12a), we have where Since the matrix B is assumed to be full column rank and the parameters β, γ are both positive, H is positive definite. Now, substituting (14) into (12a), we obtaiñ With the above analysis, we now present the following lemma, which is the basis for proving the strict contraction property of the scheme (3). It is also useful for estimating the convergence rate for the sequence generated by (3).
where H is defined in (15) and Proof. Since H is positive definite, applying the identity to the right-hand side in (14) with Substituting (19) into the right-hand side of (16), we obtain For the last term on the right-hand side of (20), we have Substituting (21) into (20) and using the matrix G defined by (18), the assertion of this theorem is proved. Now, let us make a further investigation of the matrix G defined by (18) . Since HM = Q (see (15)), we have M T HM = M T Q. Note that Using the above equation, we have Obviously, when γ > 1 the matrix G is symmetric but not necessarily positive definite. For this case, we slightly abuse the notation of v 2 G to denote the term v T Gv. Then, we have the following theorem.
Proof. Setting w = w * in (17), we get On the other hand, setting w =w k in (9a), we get Adding (24) and (25), we have Recall the fact (w k − w * ) T (F (w * ) − F (w k )) = 0, we obtain the assertion.
Recall (22), when 0 < γ ≤ 1, the matrix G is positive definite. For this case, the inequality (23) indicates that the sequence {v k } generated by (2) is strictly contractive with respect to the solution set V * , and the convergence of the sequence can be easily established. However, when γ > 1, G is not positive definite, and we need to further investigate the quadratic term v k −ṽ k 2 G . This is given by the following lemma.  (22)), v = y λ and Notice that (see (10)) Then, we have Substituting (29) into the right-hand side of (28), we obtain This completes the proof.
In the following we will treat the crossing term in the right-hand side of (27), namely, β(Ax k+1 + By k+1 − b) T B(y k − y k+1 ).
We want to find a lower bound for it.
Proof. The optimality condition of the y-subproblem (3b) is Analogously, for the previous iteration, we have Setting y = y k and y = y k+1 in (31) and (32), respectively, and then adding them, we get Note that in the {k−1}-th iteration (see (3c)), we have Substituting (34) into (33) and with a simple manipulation, we have Then, we get Using the Cauchy-Schwarz inequality to the crossing terms in the right-hand side of (35), we have this lemma is proved.
Based on Lemma 3.4 and Lemma 3.5, we now establish the following results.
Theorem 3.7. Let the sequence {w k } be generated by the ADMM (2). Then, for any T > γ, we have Proof. Substituting (37) into (23) and with a simple manipulation, we have we obtain the assertion directly.
According to our assumption that τ = s · ρ(B T B), we have So if the following two inequalities holds the last term on the right hand side in (38) is in definite quadratic form. Now, by solving the two inequalities, we have With the above analysis, we are now ready to state the convergence result of the algorithm.
Theorem 3.8. Suppose that θ 2 (·) is τ -strongly convex and τ = s · ρ(B T B). Then, for the step size the sequence {v k } generated by the ADMM (3) converges to a solution point v ∞ ∈ V * .
Proof. Let (y 0 , λ 0 ) be the initial point and x 1 be the computational result of (y 0 , λ 0 ).
Furthermore, we know from (38) that the sequence {v k } is bounded. Thus it must have a finite cluster point v ∞ , and there exists a subsequence {v kj } convergent to this point. We now prove that v ∞ is a solution point of (6). Letx ∞ be induced by (3) with given (y ∞ , λ ∞ ). Recall the matrix B is assumed to be full column rank. From (44), we immediately have Then, it follows from (12a) that We obtain thatw ∞ = w ∞ is a solution point of (6). On the other hand, since (38) holds for any solution point of (6) and w ∞ ∈ Ω * , we have Because lim k→∞ Ax k + By k − b 2 = 0 and lim k→∞ y k − y k−1 2 = 0, the sequence {v k } cannot have another cluster point and thus it converges to a solution point v * = v ∞ ∈ V * .

Remark 2.
To the best of our knowledge, it is the first time that such a formulation (43) is presented. As we can see, when s = 0, the parameter γ lies in (0, 1+ is monotonically increasing regards of s, so the step size can be larger than 1+ √ 5 2 when s is positive. Thus the step size is further enlarged compared to the golden step size. Remark 3. Although the formula (43) regards of s, we can easily estimate the feasible range of γ with given τ and matrix B by using remark 1. This can be realized as follows: we first find a value of s such that τ = s · ρ(B T B), then substitute s into (43) to derive the feasible range of γ.

4.
Conclusions. In this paper, we focus on the convergence of the alternating direction method of multipliers (ADMM). We show that the dual step size of ADMM can be larger than the golden ration when one function term is strongly convex, which is not provided in the existing ADMM literatures. We also derive an exact relationship between the modulus and the dual step size. Since the step size considered here is larger than Glowinski's step size, the convergence analysis of this paper covers ADMM and Glowinski's ADMM as special cases.