Decentralized gradient algorithm for solution of a linear equation

The paper develops a technique for solving a linear equation $Ax=b$ with a square and nonsingular matrix $A$, using a decentralized gradient algorithm. In the language of control theory, there are $n$ agents, each storing at time $t$ an $n$-vector, call it $x_i(t)$, and a graphical structure associating with each agent a vertex of a fixed, undirected and connected but otherwise arbitrary graph $\mathcal G$ with vertex set and edge set $\mathcal V$ and $\mathcal E$ respectively. We provide differential equation update laws for the $x_i$ with the property that each $x_i$ converges to the solution of the linear equation exponentially fast. The equation for $x_i$ includes additive terms weighting those $x_j$ for which vertices in $\mathcal G$ corresponding to the $i$-th and $j$-th agents are adjacent. The results are extended to the case where $A$ is not square but has full row rank, and bounds are given on the convergence rate.

result. For example, for a given square matrix, the equation might be initialized by the matrix and converge to a steady state solution that is a diagonal matrix with eigenvalues equal to those of the given matrix. Many of these results are set out in the book [1].
This paper offers another contribution along these lines: we show how to solve a linear equation Ax = b in a distributed way by a network of agents. Different from parallel algorithms in computer science [2][3][4] which usually require a common shared memory, a major novel feature of distributed algorithms lies in their implementation on multi-agent networks in which agents are provided with local memories and can communicate with each other. In the parallel world, the graph is chosen by the designer to maximize efficiency of computation in some sense whereas in ours it may be given by other considerations such as communication requirements. When a linear equation of interest involves millions or more unknowns, it is likely to be impossible for a single memory to even store the whole linear equation. Such large scale linear equations can easily arise in electromagnetic field problems, in which the boundary integral technique is employed to recover the solution of a differential equation in space by an appropriate integration of the solution on the two-dimensional boundary surface [5]. Another potential problem with the shared memory in parallel algorithms is the data safety. Security concerns often arise due to the fact that different agents are usually not necessarily in the same domain of trust. For example, when a customer turns to cluster computers for the help of computations which contain sensitive information such as business financial records, personally identifiable health information, etc, the customer may not be willing to share all such information with computers in the cluster [6]. One natural way to avoid these problems is by developing such distributed algorithms for multi-agent networks. Since agents may also be physically separated from each other, each agent typically is able to communicate only with certain other nearby agents. There are typically communication constraints on the information flow across multi-agent networks, which consequently preclude centralized processing and result in great interest of distributed algorithms.
One direction for solving linear equations in a distributed way is by reformulating them as a distributed optimization problem and then trying to employ existing algorithms in [7][8][9][10] to solve them. Rather than go through the intermediate step of problem reformulation, the authors of [11][12][13] have recently proposed a distributed algorithm for directly solving linear equations based on a so-called agreement principle, which was implicitly used in [14]. Here is the key idea: each agent limits the update of its state vector to satisfy its private equation, which is part of the linear equation Ax = b; at the same time a control is developed to drive all agents' states to reach a consensus vector, which means that this consensus vector must be the solution of Ax = b. Algorithms obtained along this direction were first shown to work for non-singular matrix A on fixed undirected networks in [11] and then were generalized to handle time-varying networks in [13].
In contract to the discrete-time algorithm proposed in [13], this paper proposes a continuous time algorithm and in so doing provides a different perspective on solving a linear equation in a distributed manner. The particular update rule for agent i's state isẋ where with n = {1, 2, . . . , n}, N i denotes the set of labels of agent i's neighbors' from which agent i receives information, x i is the state vector associated with agent i, and the particular form of the coefficients f ij will be given subsequently. Now a feature of the algorithm is that it is, using the language of control theory, a form of consensus algorithm. This means that the different x i (t) converge as t → ∞ to a common value, which will be the solution of the equation Ax = b. However what is probably the most distingishing feature of this work is its use of a basic result from differential geometry, which is perhaps not so well known in control theory, to give a more or less immediate proof of the paper's main result. In particular we use the fact that a gradient flow algorithm associated with a real analytic function on a real analytic Riemannian manifold necessarily converges to a fixed point. This result is perhaps better known for the case of gradient flow algorithms in Euclidean space, see [15]; its more general form on Riemannian manifolds goes back to work of [16]. The structure of the paper is as follows. In the next section, we motivate the form of the differential equations and state the main result. The following section proves the main result, and Section 4 offers some comments on the convergence rate. Section 5 contains remarks relevant to future research.
2. Establishing the consensus differential equations. Our starting point is the algebraic linear equation in which A ∈ R n×n is non-singular and b ∈ R n . We shall rewrite the equation with a ⊤ i denoting the i-th row of A as We shall later impose the requirement that A is nonsingular, but for the moment the requirement is not in force. The second rewriting of equations can be interpreted as stating that x is a member of n different manifolds (here affine subspaces), viz a ⊤ i x = b i . Our approach to finding x relies on finding n vectors x i (t) each of which belongs for all time to a particular manifold, in that and so moving the x i that they become more and more like each other. If and when they take a common value, call it x, it is obvious that it must satisfy the equation The basis for adjusting the x i (t) involves setting up a cost function which penalizes any differences between them. To this end, consider also a connected graph G = (V, E) with n vertices, and each is associated with an n-vector x i . Consider also the following cost function: Clearly V achieves a global minimum of zero if and only if all x i have a common value. (Connectivity of the graph is essential to conclude the 'only if' statement).
Formally, one can regard M as the set of vectors X in R n 2 obeying n scalar constraints of the form One should think of X as a vector obtained by stacking the x i ). The function V is obviously defined on R n 2 , but it is also defined on the affine subspace M ⊂ R n 2 . We observe that V is clearly convex. This implies that the restriction V |M of V to M is convex too because the M i are affine spaces, which shows a very useful property of V |M. In the following proposition, we state two easily established properties linking the cost function to the linear equation. The second part (ii) in the above Proposition is an immediate consequence of standard properties of smooth convex functions on vector spaces M.
If we were to regard the function V as defined on R n×n , and sought to compute a gradient flow from an arbitrary initial point in an attempt to reach the minimum, there would resultẋ While this would result in a steady state in which all x i had the same value, they would not be guaranteed to remain on the original manifold. To remedy this defect, we obtain the gradient of V on the manifold M. The way this is done in the optimization literature [1,17], given that the manifolds M i and therefore the manifold M are embedded in a Euclidean space, is to simply project the gradient onto the tangent space of the manifold. Because M = M 1 × M 2 × · · · × M n , it is not hard to check that this is equivalent to projectingẋ i onto the manifold M i , Because the manifold M i is an affine subspace, the tangent space is the set of vectors z ∈ R n for which a ⊤ i z = 0. Now given an arbitrary y ∈ R n , its projection onto the tangent space of the manifold M i is simply P i y, where P i denotes the orthogonal projection matrix to the kernel of a ⊤ i and for non-zero a i one has where P i denotes the projection matrix. More precisely, this means that replacing (9), we have for the gradient flow on the manifold Ṁ Of course we also require that the trajectory begins on M, i.e. we require It is easy to check using the differential equation that a ⊤ iẋ i (t) = 0, which implies the trajectory stays on the manifold.
And now we can state the main result: Consider also a connected graph G = (V, E) with n vertices, and let N i denote the neighbor set of vertex i. If A is nonsingular, the equation set (11) with the initial condition (12) has the property that x i (t) → x for all i ∈ n as t → ∞. Moreover, convergence is exponentially fast. If A ∈ R m×n for some m and has full row rank, convergence occurs to a solution of Ax = b.
The proof of Theorem 1 will be given later. One might well imagine that in the course of solving the differential equation set, round-off or other errors could move the x i (t) of the relevant manifold M i . This can be accommodated by adding a further term to the equations which restores the trajectory towards the manifold. The equations remain linear in character.

Corollary 1.
Adopt the hypotheses of the theorem, save that (11) is replaced bẏ and let the initial condition now be free. Then the conclusions of the theorem continue to hold.
The proof of Corollary 1 will be given in the next section. Further generalization of (11) and (13) can be achieved by introducing scalar positive gain constants (varying with i) in the equations, thus (13) could be replaced, for arbitrary positive α and α i , byẋ 3. Proof of Main Results. We start this section with the proof of Theorem 1. By way of overview, we know from the general theory of real analytic gradient flows outlined in [16] that convergence must occur to an equilibrium point of the equations, and that because the manifolds M i and thus the manifold M are closed, the equilibrium point must lie in M and thus the equilibrium values of the x i must lie in each M i . The equilibrium is necessarily a critical point of V . Now Proposition 1 (ii) indicates that the only critical points of V are the global minima of V |M. Under the hypothesis that A is square and nonsingular, or that it is has full row rank, there is an x such that Ax = b. An equilibrium point of the equations is given by x i = x for all i, which means that X ∈ M assumes the form . Such a value of X gives rise to V = 0, i.e. corresponds to a global minimum. Conversely, at a global minimum, we can argue that X must take the form indicated, and then it follows that Ax = b. For suppose, in order to obtain a contradiction, that at a global minimum, there held X = [x ⊤ 1 , x ⊤ 2 . . . , x ⊤ n ] ⊤ , and x i = x j for some pair ij. Because of the connectedness of the graph G, there is a path of edges in E connecting vertex i to vertex j. Since x i = x j , there must hold for some edge, call it rs, along this path that x r = x s , and then immediately V is seen to be nonzero since x r − x s 2 is one of the summands making up V . Thus V does not attain its global minimum. Hence a necessary and sufficient condition for any equilibrium point to which the equations converge is that consensus is attained, i.e. x i = x j for all i, j and the common value is a solution of Ax = b.
We now give a more explicit algebraic derivation of the same conclusion, which will be useful in pinning down the exponential rate of convergence. Let L denote the Laplacian matrix associated with G; note that the connectivity property for G implies that L is nonnegative definite symmetric, with one eigenvalue at the origin, and the associated eigenspace is the span of 1, where 1 denotes the vector with entries all 1. Let e i denote the unit vector in R n with 1 in the i-th position. Denote the matrix formed by placing the x i ∈ R n next to one another as (This should be distinguished from X , which is obtained by stacking the x i ). At the equilibrium, there holds This implies that and because the vectors a 1 , a 2 , . . . , a n are independent under the theorem hypothesis (whether or not A is square), we have that the λ i are all zero, that XL = 0 and therefore the columns of X are identical, i.e. consensus holds. Exponential convergence follows from the fact that the equations are linear and time-invariant. If convergence occurs, it is necessarily exponential. This completes the proof of Theorem 1, and we shall return to the question of the rate of convergence subsequently.
To prove the corollary, we let e i = x i − x * , where x * is a constant vector such that Ax * = b. Fromė i =ẋ i and (13) one haṡ To write the equations (21) in a more compact form, we let e = e ′ 1 e ′ 2 · · · e ′ n ′ , P = diag[P 1 , P 2 , . . . , P n ] andL = L ⊗ I n . One haṡ e = −(PL + I − P )e (22) Proving that all x i converge to x * exponentially fast is equivalent to proving that e converges to 0 exponentially fast, for which we need the following lemma: (We will then analyse the eigenvalues of the latter matrix) To see this, observe that because P 2 = P , Hence the eigenvalues of PL − P are the same as those of (PL − P )P = PLP − P by Lemma 2. Consequently, the eigenvalues of PL + I − P are the same as those of PLP + I − P . Observe first that PLP + I − P is positive semi-definite since PLP and I − P are positive semi-definite. Thus to prove Lemma 1, it is sufficient to establish that PLP + I − P is non-singular. We will prove this in the following by assuming the contrary to obtain a contradiction. Suppose there is a nonzero α for which [PLP + (I − P )]α = 0. Then clearlȳ LP α = 0 (24) It follows thatLα = 0 and so α = 1 ⊗ q for some q ∈ R n . Then again from the fact that (I − P )α = 0, we obtain a ⊤ i q = 0 for all i ∈ n, whence q and thus α is zero. Then PLP + I − P is non-singular.
The fact that exponentially fast convergence occurs means that if b is varying, the algorithm will still lead to an approximate consensus solution of Ax = b, with the quality of the approximation linked to the rate of variation of b.

Convergence rates.
There is an alternative way of writing the equations (11) in the following compact formẊ Evidently, when the a i are linearly independent, there are n eigenvalues of PL at the origin, and the remainder have negative real parts. The rate of exponential convergence of the system (26) to its equilibrium is determined by the smallest non-zero eigenvalue of PL which we shall denote by ρ.
Note that PL and PLP have the same eigenvalues because PL = P (PL) and P (PL) and (PL)P have the same eigenvalues. Thus ρ is real and positive and is equal to the smallest non-zero eigenvalue of PLP . In order to find a bound for ρ, we will study non-zero eigenvalues of PLP .

ANDERSON AND MOU AND MORSE AND HELMKE
Let Q = QQ be a square orthogonal matrix such that the columnsQ and Q form a basis for ker P and Im P , respectively. Then The last equality (29) comes from PQ =Q since the columns ofQ forms a basis for Im P and P 2 = P . Thus all non-zero eigenvalues of Q ⊤ PLP Q are the same as those of the nonsingular matrixQ ⊤LQ . FromQ ⊤Q = I n(n−1) and the Poincare Separation Theorem [19], one has λ 1 (L) ≤ λ 1 (Q ⊤LQ ) ≤ λ n+1 (L) (30) where λ j (·) denote the jth smallest eigenvalue of a Hermitian matrix. SinceL = L ⊗ I n and L is the Laplacian of a connected graph, one has Recall that ρ is equal to the smallest non-zero eigenvalue of PLP , which is similar to Q ⊤ PLP Q. Then ρ = λ 1 (Q ⊤LQ ) (32) From (30) to (32), one reaches a trivial lower bound for ρ, which is 0, and the following upper bound: Theorem 2. The smallest non-zero eigenvalue of PL is upper bounded by ρ ≤ λ 2 (L) (33) Remark 1. λ 2 (L) is called the algebraic connectivity of a graph. It is bounded below by [20] with D the diameter of the graph, and is bounded above by λ 2 (L) ≤ n n − 1 for n ≥ 2 with equality holding if and only if the graph is complete [21]. This further gives an upper bound of ρ in terms of the number of agents in the network.
From (34) one observes that λ 2 (L) could be very small, which suggests ρ may be close to 0 for certain graphs. Moreover, (32) implies that ρ is related to both L andQ, the latter of which is determined by the matrix A. It may be impossible to obtain a non-trivial lower bound for ρ without assuming more about A beyond non-singularity.

5.
Conclusions. There are a number of issues that are related to the ideas of this paper, but remain unexplored. We comment briefly on some of them.
In case the matrix A is tall, in general the equation Ax = b cannot be solved, but it does make sense to search for a least squares solution. Theoretically, this could be obtained in the full rank case by working with the linear equation A ⊤ Ax = A ⊤ b, but any such approach requiring the initial computation of A ⊤ A and A ⊤ b might be contrary to the spirit of seeking a decentralized solution because of the associated computations. For example, in distributed parameter estimation [22], a multi-agent network aims to solve a group of observation equations A i x = b i . Because the b i are usually contaminated with measurement noise, the whole linear equation Ax = b is usually overdetermined. How then to obtain a least squares solution is for the moment an open problem. An alternative idea to obtain the least square solution in a distributed way was briefly mentioned in [13] by solving a larger linear equation than Ax = b. However, this idea requires each agent to control an augmented state vector the dimension of which does not scale well with the number of agents in the network. This scaling problem has recently been partially ameliorated in [23] by assuming a sparse structure for the linear equations of interest.
It would clearly be relevant to contemplate algorithms in which rows of A were grouped together. The manifolds M i would be defined by equations like A i x i = b i where A i was a fat matrix. The changes to the algorithm and the associated proof are trivial to contemplate. One issue is what the difference in computational burden might be.
Evidently, the fact that the only requirement on the graph is that it be connected allows the overlaying of a concept like sparseness for each of the equations for x i , that is the differential equation for x i need only have at most one other x j feeding into it. How to exploit sparseness in the matrix A is less clear. Obviously though, inner products involving a i are easier to compute when A is sparse.
In the introduction, we noted that a discrete form of the algorithm is available, see [11][12][13]. It should be immediately derivable from the equations presented here. Whether there could be issues of stiffness that would cause problems in discretizing the equations is unknown.
In numerical linear algebra, the difficulty of executing certain calculations is frequently evaluated, and often the formula involves the dimension of the underlying matrices. It is evident here there are tradeoffs possible between convergence rate, storage requirements and complexity requirements in an implementation of a differential equation solver. How all these interact and what sort of comparison can be made with conventional counts for matrix inversion is unknown.
There are several other more significant directions in which this work might be developed. First, one could seek to add linear inequalities or more generally convex inequalities into the problem statement. Reference [14] can be thought of as explaining the use of consensus for distributed optimization in discrete time, and its existence is prima facie evidence of the reasonableness of seeking to generalize the ideas of this paper. Second, in coding theory there are sometimes requirements to solve linear equations with tall matrices whose entries are in a finite field. One might speculate as to whether a consensus approach could work, and be of benefit, for such problems. Of course, the notion of projection onto manifolds would not be expected to play any role. Last, we comment that the device of representing the equation to be solved as a consensus problem, and using a series of manifolds as a lens through which to view the problem, is a device that can find application to many control tasks. For example, consider the task of aligning three coordinate frames each in R 3 . This is a consensus problem on a sphere, and gradient flow on a sphere is easy to compute. Thus one can readily devise an algorithm to secure the alignment.