A convergence analysis of the perturbed compositional gradient flow: averaging principle and normal deviations

We consider in this work a system of two stochastic differential equations named the perturbed compositional gradient flow. By introducing a separation of fast and slow scales of the two equations, we show that the limit of the slow motion is given by an averaged ordinary differential equation. We then demonstrate that the deviation of the slow motion from the averaged equation, after proper rescaling, converges to a stochastic process with Gaussian inputs. This indicates that the slow motion can be approximated in the weak sense by a standard perturbed gradient flow or the continuous-time stochastic gradient descent algorithm that solves the optimization problem for a composition of two functions. As an application, the perturbed compositional gradient flow corresponds to the diffusion limit of the Stochastic Composite Gradient Descent (SCGD) algorithm for minimizing a composition of two expected-value functions in the optimization literatures. For the strongly convex case, such an analysis implies that the SCGD algorithm has the same convergence time asymptotic as the classical stochastic gradient descent algorithm. Thus it validates, at the level of continuous approximation, the effectiveness of using the SCGD algorithm in the strongly convex case.

Here (w, v) follows a certain distribution on an index set D; f v : R m → R and g w : R n → R m are assumed to be in C (4) ; the vector ∇f v (y) is the gradient column m-vector of f v evaluated at y and the matrix ∇g w (x) is the n × m matrix formed by the gradient column n-vector of each of the m components of g w evaluated at x; ε > 0 and η > 0 are two small parameters. We assume that the functions f and g are supported on some compact subsets of R m and R n , respectively 1 .
The two Brownian motions W 1 t and W 2 t are independent standard Brownian motions moving in the spaces R m and R n , respectively. Here the diffusion matrix Σ 1 (x) satisfies and the diffusion matrix Σ 2 (x, y) satisfies and both matrices are assumed to be non-degenerate for any choice of x ∈ R n and y ∈ R m .
1 A further discussion of this assumption is provided in Remark 3 of Section 5.
The invariant measure µ X,ε (dY ) of the (multidimensional) OU process y X,ε (t) is a Gaussian measure with mean Eg w (X) and covariance matrix Let us introduce the operator where q(X, Y ) can be scalar, vector or matrix-valued functions with arguments X and Y . As the fast motion Y ε,η (t) process is running at a high speed, the process X ε,η (t) in (7) plays the role of the slow motion. That is to say, X ε,η (t) changes very little, and thus could be viewed as frozen, during a small time interval in which Y ε,η (t) is running very fast. Roughly speaking, in the dynamics of X ε,η (t), the fast component Y ε,η (t) can be replaced by the invariant measure of y X ε,η (t),ε with frozen X ε,η (t). This heuristic supports the following asymptotic picture: as ε > 0 fixed and η → 0, thus ε η → ∞, one can approximate the slow process X ε,η (t) in (7) by an averaged process X ε (t) satisfying The approximation of X ε,η (t) by X ε (t) is the content of the classical averaging principle and was discussed in many literatures (see e.g. [16], [15], [10,Chapter 7]). In this paper we will show that (see Proposition 1), as ε > 0 is fixed and set η → 0, This justifies the approximation of the averaged motion X ε (t) to the slow process X ε,η (t). It turns out, as we will prove quantitatively in Lemma A.1 below, that when ε → 0, Therefore as ε → 0, by (4) we see that

WENQING HU AND CHRIS JUNCHI LI
Thus as ε → 0, the process X ε (t) approximates another processX(t) that solves an ordinary differential equation: with an error of O( √ ε). In fact, equation (14) can be viewed as a gradient flow, which is the perturbed gradient flow with no stochastic noise terms [13]. Such averaging principle can hence explain why we call (1) the perturbed compositional gradient flow.

1.2.
A sharper rate via normal deviation. One major drawback of the classical averaging principle is that, the approximation X ε,η (t) → X ε (t) as η → 0 in (12) can only identify the deterministic drift, and thus the small diffusion part in the equation for X ε,η (t) vanishes as η → 0. To overcome this difficulty, let us consider the deviation X ε,η (t) − X ε (t) and we rescale it by a factor of √ η. Thus we consider the process We will show that (see Proposition 2), as η → 0, the process Z ε,η (t) converges weakly to random process Z ε t . The process Z ε t has its deterministic drift part and is driven by two mean 0 Gaussian processes carrying explicitly calculated covariance structures. This implies that, roughly speaking, from (15) we can expand as η → 0. Here D ≈ means approximate equality of probability distributions. In fact, such approximate expansions have been introduced in the classical program under the context of stochastic climate models (see [12], [1, equation (4.8)]), and in physics this is also known as the Van Kampen's approximation (see [26]). Therefore by (13), (14) and (16) we know that the slow motion X ε,η (t) in (7) (or (2)) has an expansion around the GD algorithm in (14): Let us introduce the process X ε,η (t) as the following fast time-scale version of the perturbed gradient flow [13], [14]: (14) and (18), we know that So that by (17) we further have From the perspective of mathematical techniques, there are two classical approaches to averaging principle and normal deviations 2 . One is the classical Khasminskii's averaging method [16]. This method chooses an intermediate time scale This intermediate time scale enables the analysis of averaging procedure by using a fast motion with frozen slow component. To demonstrate its effectiveness, in this work we exploit this method to do our averaging analysis. Another less intuitive method is the corrector method, which relies on the solution of an auxiliary Poisson equation. Upon obtaining appropriate a-priori estimates for this Poisson equation, one can reduce the averaging principle or normal deviations to the analysis of an Itô's formula. Since we are working in the case when fast motion Y ε,η (t) is an OU process, when applying the corrector method, we are mostly close to the set-up of [23] (see also [22], [24], [7]). Our analysis of the normal deviations will be following the corrector method and based on a-priori bounds provided in [23].
1.3. Connection with stochastic compositional gradient descent algorithm.
In the field of statistical optimization, the stochastic composition optimization problem of the following form has been of tremendous interests in both theory and application: min Here ) denotes the composite function, and (v, w) denotes a pair of random variables. [29] has shown that the optimization problem (20) includes many important applications in statistical learning and finance, such as reinforcement learning, statistical estimation, dynamic programming and portfolio management. Let us consider the following version of Stochastic Composite Gradient Descent (SCGD) algorithm in [29, Algorithm 1] whose iteration takes the form Here (w k , v k ) is taken as i.i.d. random vectors following some distribution D over the parameter space; 3 f v k : R m → R and g w k : R n → R m are functions indexed by the aformentioned random vectors; the vector ∇f v k (y k+1 ) is the gradient column mvector of f v k evaluated at y k+1 and the matrix ∇g w k (x k ) is the n×m matrix formed by the gradient column n-vector of each of the m components of g w k evaluated at x k . The SCGD algorithm (21) is a provably effective method that solves (20); see early optimization literatures on the convergence and rates of convergence analysis in [8,29]. However, the convergence rate of SCGD algorithm and its variations is not known to be comparable to its SGD counterpart [29,30]. To drill further into this algorithm we consider the coupled diffusion process (1) which is a continuum version, as both ε, η → 0 and ε/η → ∞, of the SCGD algorithm (21). We copy in below the perturbed compositional gradient flow (1) for convenience: Here (w, v) is taken to be distributed as D, and f v : R m → R and g w : R n → R m are assumed to be in C (4) . Without loss of generality, when considering an optimization problem (20), we can assume that the functions f and g are supported on some compact subsets of R m and R n , respectively 4 . Also for convenience, let us further assume that the w and v in the (w, v)-pair drawn from D are independent. We

WENQING HU AND CHRIS JUNCHI LI
do not believe this assumption is necessary, see discussions in [29,30]; however it does simplify our analysis since (W 1 t , W 2 t ) in the perturbed compositional gradient flow (22) can be chosen as an independent pair of Brownian motions which in turn simplifies the proof.
Recall that (X ε,η (t), Y ε,η (t)) = (x(t/η), y(t/η)). In the case where the objective function (Ef v • Eg w ) (x) is strongly convex, X ε,η (t) in (18) enters a basin containing the minimizer of (20) in finite time T > 0, so that (19) implies X ε,η (t) in (2) enters a basin containing the minimizer of (20) also in finite time T > 0. Such heuristic analysis validates, in the sense of convergence, the effectiveness of using the perturbed compositional gradient flow to solve (20) in the strongly convex case. Such argument can be generalized to the convex case and omitted due to the limitation of space.
It is worth pointing out that in an early probability literature [23], the authors have briefly mentioned in its introductory part the potential application of averaging principle to the analysis of stochastic approximation algorithms. In contrast, in the classical literature on stochastic approximation algorithms (see [4], [5], [19]), the techniques of normal deviations have been addressed under the context of weak convergence to diffusion processes in the discrete setting. For example, [4, Chap. 4, Part II] analyzed the asymptotic behavior of a board class of single-equation adaptive algorithms including SGD. Moreover, [19,Chap. 8] discussed the idea of multiple timescale analysis for stochastic approximation algorithms; see also [5,Chap. 6] for a connection to averaging principle for constant stepsize algorithms. However, these mathematical theories focus on the long-time asymptotic analysis instead of convergence rates, which is vital in many recent applications. The current work serves as an attempt on convergence rates using one algorithmic example (SCGD) and can be viewed as a further contribution along this line of research thread.
Organization. The paper is organized as follows. In Section 2 we will show the averaging principle that justifies the convergence of X ε,η (t) to X ε (t) as η → 0. In Section 3 we will consider the rescaled deviation Z ε,η t = (X ε,η (t) − X ε (t))/ √ η and we show that as η → 0 it converges weakly to the process Z ε t . This justifies (16). In Section 4 we show the approximation (19) and we justify the effectiveness of using SCGD in the strongly convex case. In Section 5 we discuss further problems, remarks and generalizations.
Notational Conventions. For an n-vector v = (v 1 , ..., v n ) we define the norm If q is a vector or a matrix, then |q| norm denotes either |q| R n when q is an n-vector, or q R n ⊗R n if q is an n × n matrix. The standard inner product in R n is denoted as •, • R n .
The spaces C (i) (D), i = 0, 1, ... (and C(D) = C (0) (D)) are the spaces of i-times continuously differentiable functions on a domain D (D can be the whole space). For a function f ∈ C (i) (D) we define f i to be the C (i) (D) norm of f on D. In case we need to highlight the target space, we also use C (i) (D; M ) that refers to functions In the case of vector or matrix valued functions, the Lipschitz norm is then defined to be the largest Lipschitz norm for its corresponding component functions. Throughout the paper, capital X(t), Y (t),X(t), etc., are quantities for the time rescaled process (2), and small x(t), y(t),x(t), etc., are quantities for the original process (1). The constant C denotes a positive constant that varies from line to line. Sometimes, to emphasize the dependence of this constant on other parameters, C = C(•) may also be used. For notational convenience, we use simultaneously, e.g., X(t) or X t to denote a stochastic process.
2. The convergence of X ε,η (t) to X ε (t): Averaging principle. In this section we are going to show the convergence of X ε,η (t) to X ε (t) as η → 0 by arguing as in the classical averaging principle (see [15], [16]).
and sup Proof. This Lemma can be derived in the same way as in [6,Lemma 4.2]. In fact, we can write the equation (7) for X ε,η t in an integral form as For a matrix valued random function σ(t) = σ(ω, t) adapted to the filtration of W t we have (see [17, (3.12) and (3.13)]) Therefore we obtain (23). We can write the solution Y ε,η t in (7) in mild form as Then we have Therefore by Gronwall inequality we know that for 0 ≤ t ≤ T we have Thus we obtain The next Lemma summarizes basic facts about the process y X,ε defined in (8).
Lemma 2.2. Let the process y X,ε (t) defined in (8) start from y X,ε (0) = y ∈ R m . Then for any function ϕ : R m → R, for some δ > 0 we have where the constant C > 0 may depend on ε, but is independent of X. Moreover, for some constant C > 0 we have Finally and in which the constant C may depend on ε but is independent of X.
Proof. Let us first recall the auxiliary process y X,ε (t) in (8).
Let µ(dY ) ∼ N 0, 1 2 I m be the invariant measure of Z(t), where I m is the identity matrix in R m . Then we have the exponential mixing estimate, that for δ > 0 we have This, together with (31), as well as the boundedness of Eg w (X) in terms of X, imply (27).
From (32), by the same argument as in [6, Lemma 2.3], we obtain From here, by making use of the representation (31), we obtain (28). Moreover, by (31) we infer that which is (29). Finally, (30) is a result of (9) and the fact that Eg w (X) and Σ 1 (X) are uniformly bounded in X. Now we will derive the averaging principle following the classical method in [16]. Let T > 0. Let us consider a partition of the time interval [0, T ] into intervals of the same length ∆ > 0. Let us introduce the auxiliary processes Y ε,η t , X ε.η t by means of the relations Lemma 2.3. The interval length ∆ = ∆(η) can be chosen such that η −1 ∆(η) → ∞, ∆(η) → 0 as η → 0 and for any small 0 < κ < 1 we have uniformly in x 0 ∈ R n , y 0 ∈ R m and t ∈ [0, T ].
Lemma 2.4. For any small 0 < κ < 1 we have Proof. By (7) we know that we can write the process X ε,η t as an integral equation Comparing (35) and (40) we see that , and so that, by (36), the Cauchy-Schwarz inequality and Lipschitz continuity of From the asymptotic that for any small a > 0 fixed, we have ln η −1 η −a → 0, we have ∆ ≤ η 1− a 4 . This implies that . Thus for possibly another small κ > 0 Proposition 1. For any T > 0 and ε > 0, η > 0 small enough, for 0 ≤ t ≤ T and any small 0 < κ < 1 we have for some constant C = C(T ) > 0.
Proof. By the defining equation (34) of the process Y ε,η t we know that we have Y ε,η t = y X ε,η k∆ ,ε tε/η with y X ε,η k∆ ,ε 0 = Y ε,η k∆ . This, together with Lemma 2.2 estimate (28) imply that By making use of (36), (37), (39), (42), (82) we have Taking into account the choice of ∆ in (38), we further infer that when ε, η > 0 are very small This ensures that and thus, combining this with (39), we infer that as ε, η > 0 are small we have, for so that we conclude with (41) by taking into account that for any a > 0 we have Remark 1. By using the corrector method in the next section, it is possible to remove κ in the estimate (41), and obtain a little bit better upper bound C η 2 ε 2 + η . However, the Khasminskii's method we use here is more intuitive. See Lemma 3.5 for a precise statement and proof. Also see Remark 2 in Section 5 for further discussion.
3. Normal deviations. In this section we consider normal deviations of the process X ε,η (t) from the averaged motion X ε (t). The method we use here is the corrector method. Similar techniques can be found in [18]. To apply this method, a-priori estimates of an auxiliary Poisson equation is needed, and there are various previous works dedicated to obtaining these estimates. In the paper [11], the authors considered the case when diffusion matrix is a scalar multiple of the identity matrix, and the Poisson equation there corresponds to hypo-elliptic diffusions. Our analysis relies more on estimates obtained in [23] (also see [22], [24]). (15), we define

As in
By (7) and (11) we see that the process Z ε,η (t) satisfies the integral equation and Let us introduce the infinitesimal generator of the OU process y X,ε (t) in (8) as the operator and consider the auxiliary Poisson equations for k = 1, 2, ..., n.
Since the variables (X, Y ) ∈ R n ×R m , we are going to single out a unique solution u k (X, Y ) to (47) by putting a restriction that for each k = 1, 2, ..., n R n u k (X, Y )µ X,ε (dY ) = 0 .
We have the following a-priori bounds for the solution u k (X, Y ) based on [23].
Lemma 3.1. The solution u k (X, Y ) ∈ C (2) (R n × R m ). Moreover, there exist some integer p > 0 and constant C k > 0 such that for each k = 1, 2, ..., n we have Proof. For simplicity of notations we will suppress the index k throughout the proof.
The Poisson equation (47) can be written as An explicit representation of the solution u(X, Y ) can be formulated as Here p t (Y, Y ′ ; X) is the (parabolic) fundamental solution (transition probability density) corresponding to the operator L X,ε , i.e., The fact that u k (X, Y ) ∈ C (2) (R n × R m ) is a consequence of Theorem 3 in [23]. Let us define Here p ∞ (Y ; X) is the density function for the invariant measure µ X,ε (dY ) in (9). Notice that the way we define the averaging operator in (10) guarantees that which thus leads to the validity of the second equality in (51). From (50), and combining (51), we can write Let us define p with respect to X, and it is a tensor with j indices. In a similar fashion, from (51), we define p is an n-dimensional vector, and p (2) t (Y, f ; X) is an n × n matrix. Thus we have By the estimates (14) and (15) from Theorem 2 in [23], taking into account the centering condition (52), we see that for any k > 0 there exist C, p > 0 such that for all t ≥ 1 we have and Now we consider standard estimates, including derivatives with respect to X, for the integral We can apply standard parabolic estimates to the solution v so that by taking into account the compactly supportedness of f (X, Y ) in terms of X and Y , we get The next Lemma is a reproduction of Proposition 4.2 estimate (4.5) in [6].
Proof. First of all, we can write the equation (7) for X ε,η t in an integral form as Therefore we know that By making use of the identity (25), as well as the Burkholder-Davis-Gundy inequality (see [25,Corollary IV.4.2]), we have (62) From the mild form (26) Then we have Therefore by Gronwall inequality we know that for 0 ≤ t ≤ T we have . It remains to estimate E|Γ(t)| p R m . Again, by the identity (25) as well as the Burkholder-Davis-Gundy inequality, we have Thus we obtain which leads to (61) by integrating on [0, T ].
The next Lemma is about how averaging principle is used to evaluate integrals.
Proof. By the estimates (36) and (41) and making use of the fact that K(X, Y ) is Lipschitz continuous in X and Y , we have so that as η → 0. Here the process Y ε,η t is defined as in (34). Reasoning as in (42), we have, that for each k = 0, 1, ... and each interval [k∆, (k + 1)∆], ∆ > 0,
We can estimate, by (71) and same methods in the proof of Lemma 3.4, that for 0 ≤ t ≤ T and some constant C = C(T ) we have E|(I)| 2 R n ≤ C η 2 ε 2 + η .
By Lemma A.2 we know that E|(II)| 2 R n ≤ C t 0 E|X ε,η s − X ε s | 2 R n ds .
It is straightforward to have E|(III)| 2 R n ≤ Cη .
Combining the above three estimates and make use of Gronwall's inequality, we arrive at (74).

Proposition 2.
As η → 0 the process Z ε,η t converges weakly on the interval [0, T ] and in the space C([0, T ]; R n ) to the process Z ε t defined by the following equation where N ε 1 (t) and N ε 2 (t) are two Gaussian processes with means 0 and explicitly calculated covariances, and M (X) is an n × n matrix function.
Finally, the weak convergence of U ε,η 2 (t) to M (X ε s )Z ε,η s ds Therefore by boundedness of second derivatives of of B 2 (X, Y ) with respect to X and reasoning as in Lemma A.2, we have By making use of Lemma 3.5, as well as Lemma A.2, we have