Geometry of Matrix Decompositions Seen Through Optimal Transport and Information Geometry

The space of probability densities is an infinite-dimensional Riemannian manifold, with Riemannian metrics in two flavors: Wasserstein and Fisher--Rao. The former is pivotal in optimal mass transport (OMT), whereas the latter occurs in information geometry---the differential geometric approach to statistics. The Riemannian structures restrict to the submanifold of multivariate Gaussian distributions, where they induce Riemannian metrics on the space of covariance matrices. Here we give a systematic description of classical matrix decompositions (or factorizations) in terms of Riemannian geometry and compatible principal bundle structures. Both Wasserstein and Fisher--Rao geometries are discussed. The link to matrices is obtained by considering OMT and information geometry in the category of linear transformations and multivariate Gaussian distributions. This way, OMT is directly related to the polar decomposition of matrices, whereas information geometry is directly related to the $QR$, Cholesky, spectral, and singular value decompositions. We also give a coherent description of gradient flow equations for the various decompositions; most flows are illustrated in numerical examples. The paper is a combination of previously known and original results. As a survey it covers the Riemannian geometry of OMT and polar decompositions (smooth and linear category), entropy gradient flows, and the Fisher--Rao metric and its geodesics on the statistical manifold of multivariate Gaussian distributions. The original contributions include new gradient flows associated with various matrix decompositions, new geometric interpretations of previously studied isospectral flows, and a new proof of the polar decomposition of matrices based an entropy gradient flow.


1.
Introduction. The influence of matrix decompositions in scientific computing cannot be overestimated. Numerical linear algebra, the subject treating computer algorithms for matrix decompositions, is part of the curriculum of almost every mathematics department. A typical course follows an algorithmic approach, based on algebra, combinatorics, and some analysis. It is also possible, although far less common, to follow a geometric approach, based on Riemannian geometry and Lie group theory; this point of view reveals hidden dynamical features of matrix decompositions. The best known examples are perhaps the isospectral flows [83], in particular the Toda flow [80] with its connection to the QR algorithm [78,33], and the work by Brockett [14] who produced a gradient flow that diagonalize matrices. There are also other examples, known to specialists but less known among most practitioners of numerical linear algebra [72,73,10,29,11,28,64,32,24,46,47,26,41]. Overviews and further references are available in survey papers [83,25,27,81].
In this paper we present a systematic Riemannian geometric description of classical matrix decompositions. We thereby give a Lagrangian perspective of decompositions that have previously been studied as Hamiltonian problems, such as the integrable Hamiltonian structure of the Toda flow and its generalizations [63,1,71,77]. We also show how entropy gradient flows, traditionally studied in infinite-dimensional settings, are connected to matrix decompositions. The examples treated are well suited for a first course on Riemannian geometry, as they provide students with new perspectives on topics familiar from numerical linear algebra. In addition, our approach gently introduces the more advanced subjects optimal mass transport (see Vilani [82]) and information geometry (see Amari and Nagaoka [3]). Throughout the paper we avoid most aspects of analysis, focusing on geometry.
The first step in our geometric approach to matrix decompositions is to provide the space of Gaussian distributions with a Riemannian metric. There are two standard choices: Wasserstein and Fisher-Rao. The former occurs in optimal mass transport (OMT) and gives the polar decomposition of matrices. The latter occurs in information geometry and gives the QR, Cholesky, spectral, and singular value decompositions. In § 2 we describe the Wasserstein geometry, through optimal mass transport, and the link to polar decompositions. We consider two different categories of transformations: diffeomorphisms (infinite-dimensional) and general linear transformations (finite-dimensional). In § 3 we describe the Fisher-Rao geometry, through information geometry, and we show how it gives rise to the aforementioned matrix decompositions. Throughout these sections we also give a coherent approach to vertical and horizontal Riemannian gradient flows, to recover the matrix decompositions.
We now continue by introducing the basic ingredients of the paper, without specifying any Riemannian structure. Let x = (x 1 , . . . , x n ) denote Euclidean coordinates on R n . A multivariate Gaussian distribution with mean zero is a distribution with probability density of the form for some symmetric, positive definite covariance matrix Σ. In the language of information geometry (cf. Amari and Nagaoka [3]), the set N n of all such probability distributions constitutes a statistical manifold. That is, N n is a submanifold of the infinite-dimensional manifold of all (smooth) probability densities on R n . There are several ways to equip Dens(R n ) with an infinite-dimensional manifold structure, see for example [36,45,54]. However, as mentioned we shall not go into analysis.
Consider the space of all symmetric n-by-n matrices S(n) = S ∈ R n×n ; S = S .
Then P(n) is a 1 2 n(n + 1)-dimensional manifold that is isomorphic to the statistical manifold N n . In other words, the mapping P(n) Σ −→ p(·, Σ) dx ∈ Dens(R n ) is injective. P(n) is therefore isomorphic, as a manifold, to N n . This is the first key ingredient towards matrix decompositions. Let us now explain the second.
Denote by Diff(R n ) the set of diffeomorphisms of R n . Then Diff(R n ) is an infinite-dimensional Lie group (with respect to a certain topology) that is acting on Dens(R n ) from the right by pullback Diff(R n ) × Dens(R n ) (ϕ, p dx) −→ ϕ * (p dx) = p • ϕ det(Dϕ)dx ∈ Dens(R n ).
The submanifold N n is invariant under the action of linear transformations (see § 2.3 for details). Since N n P(n) we can therefore replace the pair of infinitedimensional manifolds Diff(R n ) and Dens(R n ) by the pair of finite-dimensional manifolds GL(n) and P(n). This is the second key ingredient towards matrix decompositions. Indeed, as we shall see all the aforementioned decompositions are obtained through an interplay between the action of GL(n) on P(n) and compatible Riemannian structures.
Throughout the paper we use the notion of Riemannian submersions and descending metrics. A brief, self-contained presentation of these concepts is given in [60, § 4]. For more details, we refer to Lang [55,Ch. XIV] and Petersen [67, § 3.5].

Contributions. This paper is a combination of a survey and original research. In this section we clarify what is old and what is new.
A first, general contribution is an explicit connection between matrix decompositions and infinite-dimensional geometry of groups of diffeomorphisms (as studied in [54,52] and references therein). The more specific, original research contributions are As a survey, the paper covers • the Riemannian structure of OMT and the Wasserstein distance ( § 2.1); • the polar decomposition of diffeomorphisms described geometrically ( § 2.2); • the Fokker-Planck and heat equations as entropy gradient flows ( § 2.2.2); • the geometric description of optimal transport in the linear category ( § 2.3); • the polar decomposition of matrices described geometrically ( § 2. Some readers might be interested in specific parts, but not the whole paper. Therefore, we have tried to keep each section as independent as possible. For example, sections § 2.1- § 2.2, § 2.3- § 2.4, § 3.1- § 3.2, and § 3.4 are almost independent in themselves. Yet, care has been taken to point out both similarities and differences between the decompositions in Wasserstein geometry and in Fisher-Rao geometry. 2. Wasserstein geometry and polar decompositions. Recall the (real) polar decomposition of matrices: if A ∈ GL(n) there are unique matrices P ∈ P(n) and Q ∈ O(n) such that A = P Q. Brenier [13] showed that this decomposition provides a finite-dimensional "toy example" of optimal mass transport. In this section we shall look at optimal mass transport and polar decompositions from the point-ofview of Riemannian geometry. Our presentation is essentially a combination of results found in [13,65] and in [52,App. 5], but with some new aspects, especially related to vertical and lifted gradient flows (as listed in § 1.1).

2.1.
Optimal transport in the smooth category. Let us recall the classical L 2 optimal mass transport (OMT) problem of Monge [61]. Usually, this problem is presented and analyzed in the very general setting of probability measures and measurable maps. Here we present it in the smooth category, focusing on geometry rather than analysis.
Problem 1 (Smooth OMT). Given µ 0 , µ 1 ∈ Dens(R n ), find ϕ ∈ Diff(R n ) that minimizes Before diving into the geometric description of this problem, we list some properties.
1. If we write the densities as µ 0 = ρ 0 dx and µ 1 = ρ 1 dx, then the constraint (2) reads where det(Dϕ −1 ) denotes the Jacobian determinant of ϕ −1 . 2. The problem is symmetric in µ 0 and µ 1 under the inversion ϕ → ϕ −1 . Indeed, if ϕ fulfills the constraint ϕ * µ 0 = µ 1 , then Thus, if ϕ is a solution to Problem 1, then ϕ −1 is a solution to the reverse problem (µ 0 exchanged for µ 1 and vice versa). To describe the geometry of Problem 1, we think of Diff(R n ) and Dens(R n ) as infinite-dimensional manifolds. The tangent space T ϕ Diff(R n ) is identified with the space of smooth maps C ∞ (R n , R n ). A Riemannian metric on Diff(R n ) is given by The geodesic equation associated with this Riemannian metric is very simple: since G is independent of the base point ϕ, and since C ∞ (R n , R n )-variations of elements in Diff(R n ) remain in Diff(R n ), it is given bÿ In particular, the geometry is flat. Given two diffeomorphisms ϕ 0 and ϕ 1 , it follows from (4) that there is a unique geodesic curve [0, 1] → γ(t) between them, given by There is no guarantee that the path remains in Diff(R n ), although it always remains in C ∞ (R n , R n ), but we disregard this for now. Instead, let us compute the energy (squared length) of γ. By definition, it is given by Therefore, if id denotes the identity mapping on R n , the functional J in (1) can be written J(ϕ) = d 2 (id, ϕ).
Hence, OMT becomes a problem of Riemannian geometry: find the shortest geodesic curve from the identity to the constraint set C(µ 0 , µ 1 ) := {ϕ | ϕ * µ 0 = µ 1 }. This observation is already compelling, but it gets even more interesting. The pushforward yields a left action of Diff(R n ) on Dens(R n ), with action map The isotropy group of µ 0 with respect to this action is given by the subgroup Notice that the constraint set C(µ 0 , µ 1 ) is closed under the right action of Diff µ0 (R n ). That is, if ζ ∈ C(µ 0 , µ 1 ) and ψ ∈ Diff µ0 (R n ), then ζ • ψ ∈ C(µ 0 , µ 1 ). It is a short calculation to show that the reverse is also true: if ζ, η ∈ C(µ 0 , µ 1 ), then there exists a ψ ∈ Diff µ0 (R n ) such that η = ζ •ψ. From the so called "Moser-trick" [62] extended to non-compact manifolds [44] we get that C(µ 0 , µ 1 ) is always non-empty. Thus, by fixing ζ ∈ C(µ 0 , µ 1 ), the constraint set C(µ 0 , µ 1 ) is isomorphic to Diff µ0 (R n ) by the mapping so Diff µ0 (R n ) parameterizes C(µ 0 , µ 1 ). If we define a projection mapping then the constraint set C(µ 0 , µ 1 ) is given by π −1 (µ 1 ). The sets π −1 (µ) are then the fibers of a principal Diff µ0 (R n )-bundle over Dens(R n ): In other words, the space of left co-sets is isomorphic to Dens(R n ) by the mapping (5).
Let us now compute the derivative of the principal bundle projection π at ϕ ∈ Diff(R n ). To this end, let γ(t) be a curve in Diff(R n ) with γ(0) = ϕ and let u(t) =γ(t) • γ(t) −1 . By the Lie derivative theorem (cf. [57, § 4.3]) it follows that for any µ ∈ Dens(R n ). From the product rule and γ(t) * µ 0 := (γ(t) −1 ) * µ 0 we then get Applying the pullback of γ(t) and rearranging the terms now yields Since γ(0) * µ 0 = π(ϕ) we get that the derivative of π applied to U ∈ T ϕ Diff(R n ) is given by The derivative Dπ can now be written Thus, we see that a vector u • ϕ ∈ T ϕ Diff(R n ) is in the kernel of Dπ(ϕ) if and only if ∇· ρu = 0. This leads us to the the vertical distribution Geometrically, Ver ϕ is the tangent space of the fiber going through ϕ. Notice that the vertical distribution is defined without reference to a Riemannian metric: it is given solely by the principal bundle structure (6). Also notice that u • ϕ ∈ Ver ϕ implies u ∈ T id Diff ρdx (R n ), i.e., u is an element of the Lie algebra of Diff ρdx (R n ).
Let us now return to Riemannian geometry and the metric (3). The point is that the Riemannian metric G on Diff(R n ) is compatible with the principal bundle (6). That is, G is invariant under the action from the right of Diff µ0 (R n ) on Diff(R n ) This means that G induces a Riemannian metricḠ on Diff(R n )/Diff µ0 (R n ) Dens(R n ). Let us now compute what it is. We first need the orthogonal complement of the vertical distribution with respect to the Riemannian metric (3). This is the horizontal distribution, given by Indeed, if u • ϕ ∈ Ver ϕ , then That every horizontal vector is of the form in (10) follows from the classical Helmholtz decomposition of vector fields on R n . The vertical distribution depends on the reference volume form µ 0 . A curious property of the Wasserstein geometry is that the horizontal distribution does not depend on µ 0 .
Since Hor ϕ is transversal to the kernel of Dπ(ϕ), it follows that is an isomorphism. The Riemannian metricḠ is then defined as Due to the invariance (9) the definition is independent of the choice of ϕ ∈ π −1 (µ).
To compute Dπ(ϕ) −1 ·μ it follows from (10) that we need to find ∇f that fulfills Dπ(ϕ) · ∇f • ϕ =μ From (7) this equation is given by whereμ =ρdx. The operator ∆ ρ := ∇ · ρ∇ is an elliptic differential operator, invertible up to addition of constants. Thus, ∇f = −∇∆ −1 ρρ andḠ is given bȳ Notice how the dependence on ϕ is removed (by the change of variables in the third equality), as expected from the geometric considerations.
Let us now explicitly explore horizontal solution geodesics of Problem 1. From (4) and the fact that Hor id = {∇f | f ∈ C ∞ (R n } it follows that such a curve is of the form ).
Thus we recover the long-established result that Problem 1 is equivalent to the Monge-Ampère equation.
Let us now summarize this section. The flat Riemannian metric (3) on Diff(R n ) induces a Riemannian metric on the space of probability densities Dens(R n ). Problem 1 then becomes a standard geodesic problem: on the Riemannian manifold (Dens(R n ),Ḡ) find the shortest curveγ(t) such thatγ(0) = µ 0 andγ(1) = µ 1 . Once such a curve is found, the solution ϕ to Problem 1 is the endpoint of the corresponding horizontal geodesic γ(t), given by (12) with f = −∆ −1 ρ0γ (0). Remark 1. The setup in this section can be made completely geometric, in that R n can be exchanged for any Riemannian manifold M . The key is that the geodesic equation (4) instead becomes the point-wise geodesic equation of M , and the explicit formula (12) becomes where exp x denotes the Riemannian exponential on M . For details we refer to McCann [58], who developed optimal transport on Riemannian manifolds. The extension of Otto's geometric framework to Riemannian manifolds is discussed by Lott [56] and by Vilani [82].

Polar decomposition of diffeomorphisms.
We now show how the geometry of OMT gives rise to the polar decomposition of maps, obtained by Brenier [13]. We shall also discuss different dynamical formulations, aiming to recover the polar decomposition as limits of gradient flows.
Let us first state the result.
Theorem 2.1 (Polar decomposition of diffeomorphisms). Let ϕ ∈ Diff(R n ) and µ 0 ∈ Dens(R n ). Then there exists a strictly convex function φ ∈ C ∞ (R n ), unique up to addition of a constant, and a unique diffeomorphism ψ ∈ Diff µ0 (R n ) such that The diffeomorphism ∇φ is the unique solution of Problem 1 with µ 1 = π(ϕ).
To prove this result we need two lemmas. The subset K ♦ ⊂ Diff(R n ) of diffeomorphisms connected to the identity by horizontal geodesics is called the polar cone. Thus, ϕ ∈ K ♦ if and only if there is a horizontal geodesic γ(t) on Diff(R n ) such that γ(0) = id and γ(1) = ϕ. Lemma 2.2. Up to addition of constants, the mapping φ → ∇φ is an isomorphism between the space of smooth strictly convex functions on R n and the polar cone K ♦ . The polar cone itself is a convex subset of (Diff(R n ), G).
Proof. From (12) it follows that elements in K ♦ are of the form ∇φ for some φ ∈ C ∞ (R n ). Since ∇φ is a diffeomorphism, Consequently, the symmetric matrix ∇ 2 φ(x) has only non-zero eigenvalues. Let By definition of K ♦ , the path γ(t) = ∇φ(t, ·) is a horizontal geodesic on Diff(R n ).
Convexity of K ♦ follows since a convex combination of two strictly convex functions is again strictly convex.
The second lemma is the following non-trivial result.
Lemma 2.3. The polar cone K ♦ is a section of the principal bundle (6). That is, the mapping is an isomorphism.
This lemma follows from the work of Caffarelli [18] on regularity of solutions of the Monge-Ampére equation (see also [82,Ch. 12] for a wider discussion about regularity). Caffarelli's proof, however, is not based on the Riemannian geometry considered here, but rather on PDE analysis techniques. In the linear, finitedimensional category of optimal transport in § 2.3 below, the analog of Theorem 2.3 is proved geometrically by showing existence and uniqueness of a lifted gradient flow. We conjecture that the same proof technique can be used also in the smooth, infinite-dimensional category of optimal transport. A careful investigation of this, however, is outside the scope here and left for future work (see § 4.1.1 for a brief justification of the conjecture).
We are now ready to prove Theorem 2.1.
To obtain the actual decomposition, we notice that by construction ϕ and ∇φ belong to the same fiber π −1 (µ 1 ), so ψ := (∇φ) −1 • ϕ is an element of Diff µ0 (R n ). Hence, we arrive at the decomposition ϕ = ∇φ • ψ. From the geometry described in § 2.1 it follows that a solution to Problem 1 must be the endpoint of a horizontal geodesic from the identity, i.e., an element in K ♦ . The last assertion then follows since ∇φ is unique.
The geometric insights of OMT suggest the study of several Riemannian gradient flows. We consider three different types.
Vertical gradient flow: This is a gradient flow restricted to the fibers of the principal bundle (6). The flow is constructed so that the element ∇φ in Theorem 2.1 is an equilibrium. Entropy type gradient flow: This is a gradient flow on Dens(R n ) with respect to the Riemannian metric (11). We shall use relative entropy as functional, but other choices are also possible (for example the family of porous medium potentials discussed by Otto [65]). Lifted gradient flow: This is the lifting of an entropy type gradient flow to the polar cone. As mentioned, a strategy for a geometric proof of Theorem 2.3 is to establish existence and uniqueness of a limit. An interesting computational approach for Problem 1 is to discretize the lifted gradient flow numerically. An illustration of the different types of gradient flows is given in Figure 1. Let us continue with a more detailed description of each type of flow.

Vertical Gradient Flow.
Recall that the solution to Problem 1 is obtained as the point on the fiber of µ 1 that is closest to the identity. Because we have the closed form expression J, given by (1), for the distance from any diffeomorphism to the identity, it is natural to consider the constrained gradient floẇ where ∇ G denotes the gradient with respect to G and Pr Ver denotes orthogonal projection onto the vertical distribution. This gives us a gradient flow on the constraint manifold C(µ 0 , π(ϕ)) = π −1 (π(ϕ)) for which ∇φ in the polar decomposition (14) is an equilibrium. Since G η (∇ G J(η),η) = G η (η − id,η) it follows that (15) becomeṡ where the smooth function p is the Lagrangian multiplier enforcing η to remain on the constraint manifold C(µ 0 , π(ϕ)). In particular, the constraint ensures that η is always a diffeomorphism. We mention that the term η −id in (16) should be interpreted as a tangent vector in T η Diff(R n ); the Riemannian notation would be log η (id) = η − id, where is the inverse of the Riemannian exponential Let us now turn to the Lagrange multiplier p in (16). Sinceη ∈ Ver η , it follows from (8) that where u =η • η −1 and µ 1 = ρ 1 dx. Composing (16) by η −1 from the right, and applying ∇· ρ 1 · , we then obtain an equation for p, namely We may write equation (16) in a "fluid formulation", using the right reduced variable u =η • η −1 . Indeed, composing (16) from the right by η −1 leads tȯ where p, as before, is the Lagrange multiplier given by the solution of (17).
Remark 2. The idea of computing the optimal transport map by a flow along the fiber has been considered before, by Angenent, Haker, and Tannenbaum [4]. Their flow, however, is not a gradient flow with respect to the metric (3). Instead, it goes as follows.
Parameterize η as η(t) = η 0 • ψ(t) −1 , where ψ(t) ∈ Diff µ0 (R n ) (this is always possible because of the principal bundle structure (6)). Next, ψ itself is given as the flow of a time-dependent vector field v(t) with ∇· ρ 0 v(t) = 0. Thus, So far we have just carried out a change of variables (η,η) ↔ (ψ, v); to give the flow studied in [4] where q is the Lagrange multiplier corresponding to the Helmholtz projection. Now let us compare (19) with (16). To this extent, we need to see how the flow (16) looks like in the variables (ψ, v). Differentiating η(t) • ψ(t) = η 0 we geṫ Thus, u and v are related by minus conjugation by η(t). Using the fluid formulation (18) we then get Notice that this choice of v(t) is different from (19).

Entropy Gradient Flow.
In this section we consider gradient flows on Dens(R n ) with respect to the Wasserstein Riemannian metric (11). Such flows are studied by Jordan, Kinderlehrer, and Otto [50] for the entropy functional (giving the Fokker-Planck equation) and later by Otto [65] for a more general class of functionals (giving porous medium equations). Here, we focus on entropy as in [50].
The geometric insights of the flow (21) can be used for analysis, in particular the question of convergence towards a limit. Indeed, by proving negativity of the Hessian of the relative entropy functional H(µ) with respect to the Wasserstein metric, Otto [65] was able to give exponential rates of convergence. Of course, for the linear flow (21) this can be achieved by standard PDE techniques, but Otto's geometric analysis also works for non-linear porous medium flows.

Lifted gradient flow.
Here we are interested in constructing a gradient flow of diffeomorphisms, evolving on the polar cone K ♦ such that its limit is the solution to Problem 1. To do so, we consider lifting of the entropy gradient flow in § 2.2.2 with respect to the principal bundle (6). First, define the lifted functional By construction, F is constant on the fibers, so its gradient ∇ G F with respect to (3) is orthogonal to the fibers: ∇ G F (ϕ) ∈ Hor ϕ . Thus, the unconstrained gradient flowφ = ∇ G F (ϕ) (22) traces an integral curve of the horizontal distribution. Furthermore, since the projection π is a Riemannian submersion, it follows that so if ϕ(t) is an integral curve of (22), then µ(t) = π(ϕ(t)) is an integral curve of the entropy gradient flow (21). Since (21) has µ 1 as a limit, it follows that ϕ(t) approaches the fiber π −1 (µ 1 ) as t → ∞. At first sight, it therefore looks promising to use the flow (22) with initial data ϕ(0) = id as a way to compute the solution to Problem 1 (recall from §2.1 that the solution to Problem 1 is a horizontal geodesic from id to π −1 (µ 1 )). However, things are not quite that simple, because the horizontal distribution is not integrable, so two different horizontal paths starting at id and ending at π −1 (µ 1 ) typically end up at different points of the fiber π −1 (µ 1 ).
As a remedy we shall instead consider the lifted gradient flow constrained to the polar cone K ♦ . Notice that, in general, T ϕ K ♦ = Hor ϕ , although T id K ♦ = Hor id . (We also know that T ϕ K ♦ ∩ Hor ϕ is at least 1-dimensional, since K ♦ consists of endpoints of horizontal geodesics.) Consequently, K ♦ is not invariant under the unconstrained gradient flow (22): we need to consider the projectioṅ where Π ϕ : T ϕ Diff(R n ) → T ϕ K ♦ denotes the orthogonal projection. The flow (23) is then the Riemannian gradient flow of F restricted to K ♦ with respect to the Riemannian metric (3) restricted to K ♦ .
Let us now work out (23) explicitly. First, recall from Theorem 2.2 that elements in the polar cone are of the form ∇φ for a strictly convex, smooth function φ. Since the functional F restricted to K ♦ is given by where ∇ 2 φ denotes the Hessian of φ. The change of variables induced by ∇φ then gives where ∇· on matrices denotes the divergence operator applied rowwise. From the definition (3) of G it follows next that Thus, the gradient is Using Theorem 2.2 we can represent the polar cone by strictly convex functions φ, defined up to addition by constants. The Riemannian metric G on K ♦ then induces the Riemannian metric on the space of strictly convex functionŝ Now, the constrained lifted gradient flow (23), written in the variable φ, is given byφ From (25) we then obtain the explicit formulation of the flow as In the simple case when ρ 0 ≡ 1 we get As already mentioned, an approach for a geometric proof of Theorem 2.3 is to show that the flow (27) has a unique limit in the set of strictly convex functions. We conjecture this to be true, at least when ρ 1 is log-concave, based on calculations showing negativeness of the Hessian ofF (see § 4.1.1 for a brief justification).

2.3.
Optimal transport in the linear category. As we have seen earlier, the multivariate Gaussian distributions N n constitute a submanifold of Dens(R n ). In this section we consider the geometry of optimal transport restricted to the linear category, such as studied by Takatsu [79].
If ϕ(x) = Ax then where Since N n P(n) we can now reformulate Problem 2 in terms of covariance matrices.
under the constraint The action of GL(n) on P(n) is transitive, so for each pair Σ 0 , Σ 1 ∈ P(n) there exists a P ∈ GL(n) fulfilling condition (31). This is the finite-dimensional analogue of the "Moser trick", used in § 2.1.
Let us now proceed with the geometry of Problem 3. Since is a submanifold of Diff(R n ), it inherits the Riemannian metric (3), so henceforth we think of GL(n) as a Riemannian submanifold. In essence, the result is that the geometry of Problem 3 duplicates that of the infinite-dimensional Problem 1. The key to see this is the following simple but important result.
Proof. Under the identification (32), tangent vectors of the submanifold GL(n) consist of linear mappings R n → R n . The result follows since the solution to (4) with ϕ(0) ∈ GL(n) andφ(0) a linear mapping remains a linear mapping.
Explicitly, the Riemannian metric (3) restricted to GL(n) is given by which can also be written The corresponding Riemannian squared distance between A 0 , A 1 ∈ GL(n) is given by A direct consequence of Theorem 2.4 is that the squared distance from the identity to A with respect to G is given by (30). Therefore, all the geometric aspects of Problem 1 are valid also for Problem 3, but in a finite-dimensional setting. In particular, solutions are given by horizontal geodesics. For completeness, we shall now derive explicitly the analogous finite-dimensional geometric concepts.
2.3.1. Principal bundle structure. The principal bundle analogous to (6) is The projection π : GL(n) → P(n) is given by

Its derivative Dπ(A) is computed as follows
We encourage the reader to compare this formula with the infinite-dimensional case (7).
From (36) we get that the vertical distribution is given by where o(n, Σ) denotes the Lie algebra of O(n, Σ). Again, compare with the infinitedimensional case (8).

2.3.2.
Descending metric. The formula (34) reveals that the metric G is O(n, Σ 0 ) right-invariant, as expected. It therefore descends to a metricḠ on GL(n)/O(n, Σ 0 ) P(n), corresponding to the restriction of the Wasserstein metric (11) to N n . The horizontal distribution (10) restricted to GL(n) is given by Since x →ȦA −1 x is a linear mapping, it follows that f (x) must be a quadratic form; we thereby recover (38). Another way to arrive at the same result is to directly compute the orthogonal complement in T A GL(n) of Ver A . The metricḠ is now defined bȳ To work it out explicitly, first notice that From (34) and (38) it then follows that where S ∈ S(n) is the solution to the continuous Lyapunov equation (a special case of a Sylvester equation [76,7]) given bẏ (38) it follows that the horizontal geodesics from the identity are of the form Consequently, the finite-dimensional analogue of the Monge-Ampère equation (13) consists in finding P := P (1) ∈ S(n) such that In particular, if Σ 0 = I the solution is the matrix square root of Σ 1 . Hence we see that the solution to the Monge-Ampère equation (13) with ρ 0 = 1 is, in a certain sense, a generalization of the matrix square root.
2.4. Polar decomposition of matrices. The finite-dimensional polar cone K ♦ consists of those matrices that are connected to the identity matrix by a curve in GL(n) of the form (40), i.e., by a horizontal geodesic. The result corresponding to Theorem 2.2 is the following.
Lemma 2.5. The polar cone K ♦ ⊂ GL(n) consists of all positive definite symmetric matrices. It is a convex submanifold of GL(n).
Proof. The proof is almost identical to that of Theorem 2.2: Let P ∈ K ♦ and take γ(t) to be the horizontal geodesic such that γ(0) = I and γ(1) = P . Then, for each t ∈ [0, 1], γ(t) is a symmetric matrix and an element of GL(n). Since I has only positive eigenvalues, it follows that γ(t) has only positive eigenvalues. Thus, γ(t) is positive definite, so K ♦ consists of positive definite symmetric matrices. Now, if P is any positive definite symmetric matrix, then γ(t) = (1 − t)I + tP is a horizontal geodesic originating from the identity. Thus, P = γ(1) ∈ K ♦ per definition.
Convexity of K ♦ follows since a convex combination of positive definite symmetric matrices is positive definite symmetric.
Next follows the analogue of Theorem 2.3.
Lemma 2.6. The polar cone K ♦ is a section of the principal bundle (35). That is, the mapping π K♦ : K ♦ → P(n) is an isomorphism.
Whereas this result readily follows by linear algebraic techniques, we shall, as mentioned, give a new, geometric proof in § 2.4.3, based on a finite-dimensional analogue of the lifted Riemannian gradient flow in § 2.2.3.
The decomposition now reads as follows.
Then there exist unique matrices P ∈ P(n) and Q ∈ O(n, Σ 0 ) such that The matrix P is the unique solution of Problem 3 with Σ 1 = AΣ 0 A .
Proof. Let Σ 1 = π(A). From Theorem 2.6 we get a unique corresponding matrix P ∈ K ♦ = P(n) such that π(P ) = Σ 1 . By construction, A and P belong to the same fiber, so by the principal bundle structure (35) it follows that Q := P −1 A ∈ O(n, Σ 0 ). That P is the solution of Problem 3 follows from the geometry since it is the endpoint of a horizontal geodesic originating from the identity, which is the shortest curve between the identity matrix and the fiber π −1 (Σ 1 ).

Vertical Gradient Flow.
Let ∇ G denote the gradient with respect to the metric (33) and let Pr Ver denote orthogonal projection onto the vertical distribution (37). With J as in Problem 3 we are then interested in the constrained gradient flowḂ An equilibrium of this flow is obtained at the symmetric matrix P in the polar decomposition A = P Q.
and since the horizontal distribution is given by (38), it follows that the vertical gradient flow isḂ where S ∈ S(n) is a Lagrange multiplier and Σ 1 = AΣ 0 A .

Remark 3.
In order for the flow (41) to be able to reach P in Theorem 2.7 it is necessary that the initial data A belong to the identity component of GL(n). A future topic (see §4.1.3) is to find minimal conditions under which the flow converges to P .
By construction,Ḃ ∈ Ver B . Multiplying (41) from the right by B −1 and using from (37) thatḂB −1 ∈ o(n, Σ 1 ), we obtain a continuous Lyapunov equation for the Lagrange multiplier S, namely It is possible to formulate the equations in the right reduced variable Ω :=ḂB −1 , without using Lagrange multipliers. Indeed, multiplying (41) from the right and subtracting the transpose of the whole equation, we geṫ Using the characterization (37) of Ver we then get Thus, an alternative form for the gradient flow (41) is Notice that B − can be computed from A −1 (the inverse of the initial data) and Example 1. We give here an explicit example of a vertical gradient flow. Take with θ = π/3. Set and take Σ 0 = I. The matrix Σ 1 is then given by We discretize the vertical gradient flow (42) in time by the Lie-Euler method (cf. [21]) where Ω k is computed from B k by solving the Sylvester equation in (42) using the Bartels-Stewart algorithm [7]. We use ∆t = 0.1 as time-step.
The evolution of the matrix elements is shown in Figure 2; B starts at A and converges towards P ∞ . The convergence in squared Riemannian distance is shown in Figure 3; it appears to be exponential. To give a full explanation of the rapid convergence rate observed here is an interesting, future topic (see § 4 below).

Entropy Gradient Flow.
Here we consider the analogue of the entropy gradient flow (21). To this extent, the relative entropy functional (20) restricted to N n P(n) is given by To see this, let µ = p(·, Σ) and µ 1 = p(·, Σ 1 ) with Σ, Σ 1 ∈ P(n). We then have x.
From (20) we now get where the last equality follows from the same calculation as in (29). This proves the formula (44) for H(Σ).
The gradient flowΣ is therefore given byΣ By construction, this is the restriction to Gaussian distributions of the infinitedimensional gradient flow (21). As in (21), the flow strives toward the maximum of the relative entropy H(Σ), which occurs at Σ = Σ 1 .
We shall now give a result on the convergence of (45). The essential result is the following on convexity of minus the relative entropy functional.
Lemma 2.8. The Hessian of the relative entropy function (44) with respect to the Riemannian metric (39) fulfills the following inequality: there exists α > 0 such that We postpone the proof of this result until the next section: to compute the Hessian of H it is easier to first lift it to a function F = H • π on GL(n), then compute the Hessian, and then restrict it to the horizontal distribution.
A consequence of Theorem 2.8 is the following result.
Theorem 2.9. For any initial data Σ(0) ∈ P(n), the entropy gradient flow (45) converges exponentially fast towards the minimum Σ 1 of the relative entropy H(Σ).  (44) to F = H • π and consider the gradient flow of F restricted to the polar cone K ♦ . By showing that −F is convex on K ♦ we can thereby prove that the flow has a unique limit, which, as we shall see, implies that the mapping in Theorem 2.6 is an isomorphism. In addition, the lifted gradient flow provides a dynamical method for computing P in the polar decomposition A = P Q, or, equivalently, the solution to Problem 3. The relative entropy lifted to GL(n) is given by

Proof. Given
. Thus, the gradient of F with respect to G is given by and the corresponding gradient flow iṡ Before we continue, let us make a few remarks.
• Because F is lifted from a function on P(n) it is constant on the fibers: if A, B ∈ GL(n) and π(A) = π(B) then F (A) = F (B). Its gradient is therefore orthogonal to the fibers, i.e., ∇ G F (A) is in the horizontal distribution Hor A . From the characterization (38) of Hor it means that We encourage the reader to verify this from the expression (48). We aim to restrict the gradient flow (49) to the polar cone K ♦ . Recall from Theorem 2.5 that the polar cone K ♦ consist all positive definite symmetric matrices. At the identity, it coincides with the horizontal distribution: T e K ♦ = Hor e = S(n).
However, at points away from the identity this is not true; for one thing, the horizontal distribution is not integrable, i.e., it does not define a submanifold of GL(n). Thus, if P ∈ K ♦ then ∇ G F (P ) is typically not an element of T P K ♦ . Consequently, the polar cone K ♦ is not invariant under the gradient flow (49). Indeed, from (48) we immediately see that for a generic positive definite symmetric matrix P , the vector ∇ G F (P ) fails to be a symmetric matrix. Another way to understand this is to observe that the fibers generally do not cut the polar cone orthogonally (it cuts the horizontal distribution orthogonally by definition). Now, the gradient of F restricted to K ♦ is given by where Π P is the orthogonal projection onto T P K ♦ . (The gradient on a submanifold is the projection of the gradient on the ambient Riemannian manifold.) Thus, we need to know the orthogonal complement of the polar cone.
Lemma 2.10. Let P ∈ K ♦ . Then the orthogonal complement of T P K ♦ inside T P GL(n) with respect to the Riemannian metric (33) is given by Proof. By Theorem 2.5, T P K ♦ = S(n). The result now follows from (33) since the orthogonal complement of S(n) with respect to the Frobenius inner product consists of skew-symmetric matrices.
To compute (50) from (48) we therefore need to find V ∈ N P K ♦ such that Finally, we thereby obtain the gradient flow on K ♦ aṡ where V is the solution to the Sylvester equation (51). It is worth pointing out that if Σ 0 = I, then we obtain the simple equatioṅ Example 2. Let us give a simple example of how the polar decomposition can be numerically computed by solving the lifted gradient flow (52). We use the same data as in Example 1. Thus, with θ = π/3. Our objective is to compute the polar decomposition of with Σ 0 = I. Since P ∞ ∈ K ♦ and Q ∈ O(n, I), we know, by construction, the polar components of A. We first set Σ 1 = AA . The lifted gradient flow on K ♦ is then given by (53), with initial data P (0) = I.  Notice that the convergence of both −F (P (t)) and d 2 (P (t), P ∞ ) as t → ∞ is exponential, as fully explained by Theorem 2.14.
We discretize the equation by the classical 4th order Runge-Kutta method [17, § 322], with time-step ∆t = 0.1. The evolution of the elements is shown in Figure 4; P starts at the identity and converges towards P ∞ . The rate of convergence is shown in Figure 5; both quantities −F (P (t)) and d 2 (P (t), P ∞ ) converge exponentially to zero as t → ∞. We shall now give theoretical results that fully explain these numerical observations.
Lemma 2.11. The Hessian of the lifted relative entropy functional (46) with respect to the Riemannian metric (33) is given by Proof. Let γ(t) be a geodesic curve with γ(0) = A andγ(0) =Ȧ. From (4) and Theorem 2.4 we get that γ(t) = A + tȦ. The Hessian of F at A is a bilinear form on T A GL(n). Applied toȦ it is given by .
Differentiating once more and using that d dtγ (t) = 0, we get , which proves the result.
With formula (55) at hand, it is now straightforward to give a proof of Theorem 2.8.
Proof of Theorem 2.8. Since the projection π is a Riemannian submersion, it follows from general results in Riemannian geometry (see [55,Ch.XIV §4]) that the Hessian of the lifted relative entropy on GL(n) restricted to the horizontal distribution coincide with the Hessian of the relative entropy on P(n). That is, Now, for (Σ,Σ) ∈ T P(n), take A ∈ π −1 (Σ) andȦ ∈ Hor A such that Dπ(A) · A =Σ. Then by (56) and Theorem 2.11 we have SinceȦ is horizontal, it follows from the characterization (38) thatȦA −1 is symmetric. Therefore where · F denotes the Frobienius norm. Next, since Σ −1 1 is a symmetric, positive definite matrix, we can can use the Cholesky factorization to obtain Σ −1 1 = L L, where L is a lower triangular matrix with positive entries on the diagonal (a geometric description of the Cholesky factorization is given in § 3.3). Then where α is given by That α > 0 follows since L is non-degenerate. Combining (58) and (59) with (57) gives the result.
We shall now prove existence and uniqueness of a limit for the lifted gradient flow (52). Again, the key is to give a positive bound on minus the Hessian of F restricted to K ♦ . However, we cannot directly use Theorem 2.8, since the tangent spaces of K ♦ are not horizontal (equation (56) cannot be used). Nevertheless, we have the following result.
Proof. From Theorem 2.8 we get As in (59), the second term is estimated by with the same L as in (60), but infimum now over T P K ♦ instead of Hor A . For the first term of (61) we cannot immediately say that it is positive, sinceṖ is not necessarily horizontal, soṖ P −1 is in general not a symmetric matrix. We can, however, use that P itself is symmetric positive definite. Indeed, let P −1 = Z Z be the Cholesky factorization of P −1 . Then SinceṖ is a symmetric matrix, it follows that S is symmetric. Therefore, tr(Ṗ P −1Ṗ P −1 ) = tr(SS) = tr(S S) = S 2 F ≥ 0. This concludes the proof.
Recall that the tangent vectors of K ♦ are not necessarily horizontal. Nevertheless, the tangent bundle T K ♦ and the vertical distribution are transversal.
Since X ∈ Hor P if and only if X = SP for some S ∈ S(n), we get Let L L be the Cholesky factorization of Σ 0 . Taking S = L S L for S ∈ S(n), we can reformulate the condition as tr(LṖ L S LP L ) = 0 ∀ S ∈ S(n).
Taking S = LṖ L and using that L is non-degenerate and LP L is symmetric positive definite, we getṖ = 0. This concludes the proof.
Theorem 2.12 and Theorem 2.13 implies the following result, explaining the observed convergence in Example 2.
Proof. F is strictly concave by Theorem 2.12. Since, by Theorem 2.2, the set K ♦ is convex, it follows that F admits a unique maximum P ∞ ∈ K ♦ .
From general results on gradient flows on Riemannian manifolds (see [65, § 3.5] for details), it also follows from Theorem 2.12 that We need to show that π(P ∞ ) = Σ 1 and F (P ∞ ) = 0. For this, we use that which, by Theorem 2.13, implies Since H is strictly concave with respect toḠ (Theorem 2.8), the last equality implies that π(P ∞ ) must give the maximum value of the relative entropy. Thus, F (P ∞ ) = H(π(P ∞ )) = 0 and π(P ∞ ) = Σ 1 . This concludes the proof.
We are now finally ready to give a geometric proof of Theorem 2.6, based on the existence and uniqueness of the limit in Theorem 2.14 Geometric proof of Theorem 2.6. First we show that π| K♦ is surjective. Let Σ 1 ∈ P(n) and consider the lifted relative entropy gradient flow (52). By Theorem 2.14 the limit of this flow gives an element P ∞ ∈ K ♦ such that π(P ∞ ) = Σ 1 . Thus, π| K♦ is surjective.
This implies that Π P ∇ G F (P ) = 0. Since the flow (52) iṡ P = Π P ∇ G F (P ) and P (0) = P it follows that the limit as t → ∞ is given by P . Thus P = P ∞ which proves that π| K♦ is injective.
3. Fisher-Rao geometry and matrix decompositions. In this section we consider the same basic setting as in § 2 but with respect to a different Riemannian structure, namely the Fisher-Rao metric. Whereas the Wasserstein metric is rooted in OMT, the Fisher-Rao metric originates from information geometry-a branch of statistics that combines information theory and differential geometry. Let us now give a brief introduction to both finite and infinite-dimensional information geometry. For details, we refer to the monograph by Amari and Nagaoka [3] (finite dimension) and the work by Khesin, Lenells, Misiolek, and Preston [51] (infinite dimension). Aspects of information geometry and multivariate Gaussian distributions, different from those presented here, are given by Barbaresco [6].
Consider a probability distribution function x → p(x, θ) depending on parameters θ = (θ 1 , . . . , θ k ), for example, a multivariate Gaussian distribution depending on the covariance as discussed earlier. Fisher's information matrix [37] is given by It measures the information about θ carried by a random variable with probability distribution p(·, θ) dx. Rao [69] interpreted I ij (θ) as a Riemannian metric on the "manifold" of probability distributions parameterized by θ. This Fisher-Rao metric is invariant under changes of coordinates θ →θ, which may appear curious at this stage. However, the invariance becomes perfectly transparent when turning to the infinite-dimensional space Dens(R n ) of all probability distributions on R n , an approach pursued by Friedrich [40].
Notice something curious here: the Fisher-Rao metric is defined without using the Euclidean structure of R n . (In contrast, the Wasserstein metric (11) uses the Euclidean structure through the gradient and divergence operators.) As a consequence, it is invariant under arbitrary changes of coordinates, or, equivalently, under the pullback action of Diff(R n ). Explicitly, the invariance is seen as follows What is then the connection to Rao's original finite-dimensional metric given by the Fisher information matrix? The answer is provided by the following result.  Proof. The expression g ij (θ) for the Fisher-Rao metric expressed in local coordinates θ 1 , . . . , θ k is This concludes the proof.
In this paper the primary example of a statistical manifold is, of course, the space of multivariate Gaussian distributions N n P(n). Let us now discuss this example in more detail.
When dealing with the Fisher-Rao metric on N n it is convenient to use a different parameterization: instead of using the covariance matrix Σ we use its inverse W := Σ −1 . The underlying reason is that the principle bundle structure associated with the Fisher-Rao geometry is based on a right action (pullback) instead of a left action (pushforward) as in the Wasserstein geometry. Thus, the probability density function associated with W ∈ P(n) is given by Lemma 3.3. The Fisher-Rao metric on N n P(n) is given bȳ We shall prove this result in two ways; first by direct calculations, and then indirectly, by using the geometric invariance property, which gives the result up to multiplication by a scalar.
Direct proof of Theorem 3.3. First, we rewrite (64) as Let w ii denote the components of W . Then (using Einstein notation) From Proposition 1 it then follows that the metric tensor is given by Using Isserlis' theorem in statistics, the fourth order moments E[x i x i x j x j ] are given by We thereby get where, in the last equality, we use that W, U, V are symmetric matrices. This proves the result.
Indirect proof of Theorem 3.3. The action of A ∈ GL(n) lifted to act on (W, U ) ∈ T P(n) is given by (W, U ) · A = (A W A, A U A) (see (71) below). We then havē (using cyclic property: tr(ABC) = tr(BCA)) The metricḠ is therefore invariant. From the classical uniqueness theorem by Cencov [22] it then follows thatḠ is the Fisher-Rao metric up to multiplication by a positive scalar.
Of course, once a Riemannian metric is given, a succession of natural questions follows: the geodesic equation, a formula for geodesics, a formula for the distance, and a formula for the sectional curvature. For P(n) equipped with the Fisher-Rao metric, these questions have been addressed in detail [5,16,15,75,70,20]. Here, we give only a brief discussion.
Putting the last expression to zero and using the fundamental lemma of calculus of variations, we get, after multiplying from the left and right by W , the geodesic equationẄ Next, we consider solutions to the geodesic equation (66). We treat here only the case where the initial data is the identity I (due to the invariance, each geodesic can be shifted, by the action of GL(n), to this case). We claim that the solution is whereẆ 0 ∈ T I P(n) = S(n) is the initial velocity and exp denotes the matrix exponential. Let us now verify this. We haveẆ (t) =Ẇ 0 exp(tẆ 0 ) ⇒Ẅ (t) =Ẇ 0Ẇ0 exp(tẆ 0 ), so W (t) −1 commutes withẆ (t) for all t. In § 3.4 below we shall give a geometric explanation of this observation.
Therefore, it is enough to derive the distance from the identity to an element W 1 ∈ P(n). By definition, it is given bȳ where γ(t) is the geodesic curve between I and W 1 . From (67) it follows that where log denotes the matrix logarithm. We thereby get The invariance then yields 3.1. Principal bundle structure. So far we have that the statistical manifold of multivariate Gaussian distributions N n is identified with P(n) by (64), and the Fisher-Rao metric represented on P(n) is given by (65). Let us now discuss the connection to matrix decompositions. Recall from § 2 that a pivotal step to describe the geometry of the polar decomposition is to have (i) a principal bundle structure (to get fibers), and (ii) a Riemannian metric compatible with that bundle (to get horizontal geodesics). The setting in this section follows the same pattern. There are, however, some structural differences.
If µ 0 = p(·, W 0 )dx is a Gaussian distribution and Thus, the action of A ∈ GL(n) on W ∈ P(n) is and the isotropy group of W is given by We now see that the finite-dimensional principal bundle corresponding to (70) is where The corresponding vertical distribution, i.e., the kernel of Dπ, is given by

Remark 5.
It is possible to describe the Fisher-Rao geometry using the pushforward bundle structure as in § 2, but that reverses the order of the elements in the matrix decompositions discussed below. Also, the standard in the literature is to consider Fisher-Rao in a right-invariant setting, as in this section.
The second structural difference is the following. In Wasserstein geometry we start with a natural metric on Diff(R n ) (or GL(n)) and we show that it induces a metric on Dens(R n ) (or P(n)). In Fisher-Rao geometry the situation is the reverse: we start with a metric on Dens(R n ) (or P(n)), so we need to construct a compatible metric on Diff(R n ) (or GL(n)). Such a metric is heavily constrained: it must have left-invariant properties in order to descend with respect to the principal bundle (70) (or (72)), but it must also be right-invariant in order to induce the Fisher-Rao metric on the base space. That such metrics exist is not obvious a priori. But they do exist. Indeed, a two-parameter family is given in [60], giving rise to optimal information transport (OIT)-an information theoretic analogue of OMT where the polar decomposition is replaced by a different factorization of diffeomorphisms that solves the OIT problem. However, in this paper we refrain from discussing OIT, and instead we focus on the finite-dimensional setting which gives us matrix decompositions. Thus, our next objective is to derive a metric on GL(n) with the desired properties.
3.2. QR (or Iwasawa) decomposition. Following [60, § 5.2], we shall construct a right-invariant Riemannian metric on GL(n) such that the projection (73) with W 0 = I becomes a Riemannian submersion with respect to the Fisher-Rao metric on P(n). To this end, consider the two projection operators : gl(n) → gl(n) and σ : gl(n) → gl(n) given by In words, (U ) selects the strictly lower diagonal entries. 2 We then define a rightinvariant metric on GL(n) by Throughout this section, the corresponding distance function is denoted d(·, ·). Notice that GL(n) is not connected (it contains two connected components). As is customary for non-connected Riemannian manifolds we define the distance between two elements of different components to be infinite.
The horizontal distribution is thereby given by Here, the horizontal distribution fulfills an additional feature: since the space of upper triangular matrices o(n) ⊥ is a Lie algebra (closed under the commutator bracket), it follows that the horizontal distribution is integrable. That is, it corresponds to the tangent space of foliated manifolds. As we shall see below in the proof of Theorem 3.7, the manifold that cuts through the identify is given by the Lie group of upper triangular matrices with positive diagonal entries.
Let us now conclude the relation between the constructed metric G and the Fisher-Rao metricḠ.
Recall the objective of this section: a geometric description of the QR factorization of matrices. For any A ∈ GL(n), we aim to construct, geometrically, a Q ∈ O(n) and an upper triangular matrix R such that A = QR. The following central concept is analogous to the polar cone in § 2.
Definition 3.6. The upper triangular cone in GL(n) is the subset In words, the upper triangular cone is the set of all elements in GL(n) whose closest point on the identity fiber π −1 (I) = O(n) is the identity. The connection to the QR decomposition is established by the following result, which shows that T R K = Hor R and also motivates the name "upper triangular cone". Proof. The horizontal distribution Hor is right-invariant and given at the identity by the upper triangular matrices o(n) ⊥ . But the space of upper diagonal matrices is a Lie algebra g. Thus, the horizontal distribution is integrable and the corresponding Lie group G connected to the identity is given exactly by the upper triangular matrices with positive diagonal entries. This proves that every element in K is upper triangular with positive diagonal entries. That every element in G is also an element in K follows from Theorem 3.5 since any two elements in P(n) are connected by a unique Fisher-Rao geodesic.
As expected, the upper triangular cone K gives us a subset of GL(n) that is in one-to-one relation with P(n).
Proof. Whereas π being an isomorphism is clear from basic linear algebra, using for example the spectral decomposition, we shall give another, geometric proof, that is independent of results in linear algebra.
First, surjectively follows from Theorem 3.5, since any element in P(n) can be connected to I by a minimal geodesic, which is then lifted to a curve in K .
Let us now prove injectivity. Let R 1 , R 2 ∈ K and assume that π(R 1 ) = π(R 2 ). Since K is integrable, it follows that A = R 1 R −1 2 ∈ K . Thus, there is a shortest horizontal geodesic γ(t) connecting I with A. Since π is a Riemannian submersion, we get a corresponding shortest geodesicγ(t) on P(n) between I and π(A). But, since R 1 and R 2 belong to the same fiber, we have π(A) = I. Therefore,γ(t) = I for all t, in particular,γ(t) = 0. Thus,γ(t) ∈ Ver for all t. But we also have thatγ(t) ∈ Hor. Since Ver and Hor are orthogonal, this impliesγ(t) = 0. In turn, γ(t) = I for all t, so R 1 = R 2 . This proves injectivity.
Finally, that π is an isometry follows directly from the fact that the horisontal distribution is integrable and T R K = Hor R , so any geodesic in K descends to a geodesic on P(n) since π is a Riemannian submersion. Proof. Let W = π(A). From Theorem 3.8 it follows that there is a unique R ∈ K such that π(R) = W . By construction, R and A belong to the same fiber, so Q := AR −1 ∈ O(n). Q is unique since A is invertible and R is unique. That R is upper triangular with positive diagonal entries follows from Theorem 3.7.
3.2.1. Entropy Gradient Flow. Recall § 3.2.1 where we considered the entropy gradient flow on P(n) with respect to the Wasserstein metric. Here, we shall again derive the entropy gradient flow on P(n), but now with respect to the Fisher-Rao metric.
First, from (44) it follows that the relative entropy functional in the variable W is given by Thus, the gradient flowẆ = ∇ḠH(W ) is given byẆ which, of course, has W 1 as a limit for any initial data. The exponential convergence of (75) towards W 1 is not a coincidence; there is geometry concealed here also. Namely, the Fisher-Rao metric is a Hessian metric, so it is given by the Hessian of a convex function. The convex function is, in fact, minus the relative entropy functional (we encourage the reader to check this). Thus, so, by the same standard technique as in Theorem 2.9 we obtain a geometric proof for the exponential convergence of (75) towards W 1 (which, of course, we already knew by basic linear ODE theory). For more details on the Hessian structure of Fisher-Rao we refer to Shima [74,Ch. 6]. Gradient flows of Hessian metrics, in particular the non-smooth case, is discussed by Alvarez, Bolte, and Brahic [2].

Lifted Gradient Flow.
In order to recover R in the QR decomposition by a horizontal gradient flow, we need to lift the flow (75) to the upper triangular polar cone K . Due to the result that the horizontal distribution is integrable, the situation is much simpler than in the corresponding lifted gradient flow in the Wasserstein geometry, treated in § 2.4.3. Indeed, first define the lifted relative entropy functional on GL(n), given by Because F is constant on the fibers (by construction), and because T R K = Hor R , we automatically have that ∇ G F (R) ∈ T R K for any R ∈ K . Thus, we never have to project onto the polar cone as for the Wasserstein geometry in § 2.4.3. Furthermore, since π is a Riemannian submersion, the gradient flow on K is in one-to-one relation with the entropy gradient flow (75). In other words, γ(t) is an integral curve of (78) if and only ifγ(t) := π(γ(t)) is an integral curve of (75). We shall use this relation to derive the flow. Since Dπ(R) ·Ṙ =Ṙ R + R Ṙ we get from (75) a flow for R ∈ K aṡ From [49] the equation forṘ can be rewritten aṡ where Z is a skew-symmetric matrix. We interpret Z as a Lagrange multiplier to ensure thatṘ is upper triangular. Multiplying from the right by R −1 , we geṫ SinceṘR −1 is upper triangular, it follows that Z must be given by Using that R − W 1 R −1 − I is a symmetric matrix, the lifted gradient flow (78) then becomesṘ where u is the operator on matrices that selects the upper triangular entries 3 u := id − , and d the operator that selects the diagonal entries 4 .
Theorem 3.10. The lifted entropy function (77) restricted to K admits a unique maximum R ∞ ∈ K . It fulfills Furthermore, for any initial data R(0) ∈ K , the lifted gradient flow (79) converges towards R ∞ as t → ∞, with estimates where d(·, ·) is the distance function of the metric (74).
Proof. From general results on gradient flows on Riemannian manifolds (see [65, § 3.5] for details), it follows from (76) that the entropy gradient flow (75) converges towards the unique maximum W 1 with rates whered denotes the Riemannian distance with respect to the Fisher-Rao metric, given by (69). Since solution curves of the lifted gradient flow on K project to solutions of the entropy gradient flow on P(n), the result in the theorem now follow from Theorem 3.8, as π : K → P(n) is an isometry. This concludes the proof.
Example 3. We give a simple example of how the QR decomposition can be numerically computed by solving the lifted gradient flow (79). Let with θ = π/3. Further, let A := Q ∞ R ∞ . Our objective is to compute the QR factorization of A (which, of course, we already know to be Q ∞ R ∞ ). We first set W 1 = π(A) = A A. The lifted gradient flow on K is then given by (79). The initial data is R(0) = I.
We discretize the equation by the classical 4th order Runge-Kutta method [17, § 322], with time-step ∆t = 0.1. The evolution of is shown in Figure 6; R starts at the identity and converges towards R ∞ . The rate of convergence is shown in Figure 7; both quantities −F (R(t)) and d 2 (R(t), R ∞ ) converge to zero with rate exp(−2t) as t → ∞. These results are fully explained by Theorem 3.10.
3.3. Cholesky decomposition. The Cholesky decomposition is another classical matrix factorization. We now show that it is a direct consequence of the geometry developed in the previous section.
Theorem 3.11 (Cholesky decomposition). Let W ∈ P(n). Then there is a unique lower triangular matrix L with positive entries on the diagonal such that W = LL .
Proof. By Theorem 3.8 there exists a unique R ∈ K such that π(R) = W . Thus, Let L = R . In terms of L, equation (81) reads  Since R is upper triangular with positive entries on the diagonal, it follows that L is lower triangular with positive entries on the diagonal. Thus, equation (82) is the Cholesky decomposition of W . L is unique since R is unique.
It follows from the geometry that one can use the Cholesky decomposition to obtain the QR decomposition of A ∈ GL(n). Indeed: 1. set W = A A; 2. compute the Cholesky decomposition W = LL ; 3. set R = L and compute Q = AR −1 .
Furthermore, notice that the lifted entropy gradient flow (79) gives the Cholesky factorization of W 1 , by taking L = R ∞ .

Spectral decomposition.
In this section we show how the Fisher-Rao geometry of P(n) is related to the spectral decomposition. Our objective is to give a geometric description of the classical result that every W ∈ P(n) can be factorized as W = QΛQ where Λ is diagonal with positive entries and Q ∈ O(n).
From the principal bundle structure (72) it follows that P(n) is a homogeneous space with respect to the left action of O(n) on GL(n), i.e., P(n) O(n)\GL(n). Recall that the Fisher-Rao metric is obtained from the metric G on GL(n) given by (74). The remarkable property of G is that even after exhausting its left-invariant properties, to yield the Fisher-Rao metric on O(n)\GL(n), there are still rightinvariant properties available. It is therefore natural to continue exhausting the invariance properties by a second quotient, but now from the right. In other words, to study the double quotient space O(n)\GL(n)/O(n) P(n)/O(n). Roughly speaking, the results in this section are that P(n)/O(n) can be identified with the space of eigenvalues and that the Fisher-Rao metric induces horizontal directions, given by diagonal matrices. This gives rise to the spectral decomposition, much like the situation of the polar and QR decompositions. There is, however, one important difference: the quotient P(n)/O(n) is not a manifold. This complicates things, but we shall avoid getting technical.
We say that W and W belong to the same orbit if there exists a Q ∈ O(n) such that W = Q W Q. The set of orbits is therefore the quotient space Contrary to the situation in § 3.2, the action is not free, so the orbits are in general not isomorphic to each other: different orbits may have different dimensions. For example, the orbit of the identity matrix is zero dimensional since Q IQ = I for any Q ∈ O(n). Consequently, the action (83) does not give rise to a principal bundle, as before. The highest dimensional orbits are those for which the eigenvalues of W are all different; this is the generic case. Such orbits have the same dimension as O(n), namely n(n − 1)/2. We now give a different, more intuitive way to understand the orbits. Elements in P(n) can be thought of as n-dimensional ellipsoids embedded in R n and the action of O(n) corresponds to rotations and reflections. Rotation about an axis of an ellipsoid with all different semi-axes "changes" it. If, however, the ellipsoid has two semi-axes of the same length, then rotations in the plane spanned by these axes does not change it, so the action is singular in such directions. Here is the key point: the orbits themselves represent ellipsoids without reference to orientation. In other words, space of orbits = space of ellipsoids. As we shall see later in this section, the Fisher-Rao metric induces a natural metric on the space of ellipsoids; a formal statement, since the quotient P(n)/O(n) has singular points, so it is not a manifold.
The next question is how to work with the space of orbits P(n)/O(n). One possibility is to represent an element in P(n)/O(n) by n strictly positive real numbers representing the lengths of the semi-axes of the corresponding ellipsoid. This description, however, is redundant, since any permutation give the same ellipsoid. Hence, the correct space is (R + ) n /S n , where the symmetric group S n consists of all permutations of n elements. This shows that (R + ) n is an n! covering of P(n)/O(n).
There is also another, more explicit way to work with P(n)/O(n), namely it is isomorphic to the set of monic polynomials with positive roots, given by The projection : P(n) → poly + n is given by We have already seen that the metric G on GL(n) descends to the Fisher-Rao metric on P(n). Since the Fisher-Rao metric on P(n) is invariant with respect to the right action (83) it formally descends to a "metric" on poly + n (recall that poly + n is not a manifold). Let us now derive what it is.
To start, we need the vertical directions of the second projection , i.e., the kernel of D . From (83) we see that the infinitesimal action of ξ ∈ o(n) on W ∈ P(n) is Thus, the vertical directions at W are given by Let us now derive the horizontal directions, i.e., the directions orthogonal to Ver W . We have that The condition on S for this expression to vanish for any ξ ∈ o(n) is that −SW −1 + W −1 S is a symmetric matrix. In turn, this implies Thus, the horizontal directions at W are given by Consequently, S ∈ Hor W if and only if S and W −1 are simultaneously diagonalizable. Let us now discuss two special cases. First, when W = I. Since any matrix commutes with I −1 = I, we get that the horizontal directions at I fills the entire tangent space: Hor I = T I P(n). Thus, any geodesic originating from the identity is horizontal. Since an initially horizontal geodesic remains horizontal at all times [48], it follows that if W (t) is a geodesic with W (0) = I, then [ Thus, as declared, we get a geometric explanation of the observation in Remark 4. Second, when W = Λ is a diagonal matrix with positive entries. Since two diagonal matrices always commute, we immediately get that all diagonal matrices are contained in Hor Λ . Let D(n) ⊂ P(n) be the submanifold of diagonal matrices with positive entries. Notice that D(n) itself is a Riemannian manifold, as it inherits, by restriction, the Fisher-Rao metric on P(n). The tangent space T Λ D(n) consists of all diagonal matrices, so T Λ D(n) ⊂ Hor Λ . Thus, since initially horizontal geodesics remain horizontal, we get the following result.
Lemma 3.12. D(n) is a totally geodesic submanifold of P(n). That is, if γ(t) is a geodesic in D(n), then γ(t) is also a geodesic in P(n).
This result also follows directly from the geodesic equation (66). We now give a characterization of Hor Λ for a generic Λ. Lemma 3.13. Let Λ ∈ D(n) be generic (all its entries are different). Then That is, Hor Λ is the space of all diagonal matrices.
Since the dimension of both T Λ D(n) and Hor Λ is n and since T Λ D(n) ⊂ Hor Λ , it follows that Hor Λ = T Λ D(n).
From Theorem 3.12 we have that the submanifold D(n) of diagonal matrices with positive entries is totally geodesic in P(n). From our discussion it is also clear that it is tangential to the (non-regular) horizontal bundle Hor and that Hor Λ = T Λ D(n) at generic points (Theorem 3.13). It is therefore natural to think of D(n) as the analogue of the upper triangular cone K (or the polar cone in § 2).
Let us now give the key result leading to the spectral decomposition. It is the analog of Theorem 2.6 and Theorem 3.8, although the result is slightly weaker because of the singular action (we do not have injectivity). Proof. Let p ∈ poly + n . Then p has n real positive roots λ 1 , . . . , λ n . The corresponding diagonal matrix Λ ∈ D(n) then fulfills (Λ) = p. Proof. Let p = (W ). From Theorem 3.14 we get that there exists Λ ∈ D(n) such that (Λ) = p. Since Λ and W belong to the same orbit, there exists a Q ∈ O(n) such that W = Q ΛQ. Remark 6. The non-injectivity of in Theorem 3.14 is the reason that the spectral decomposition is not unique. This non-uniqueness is directly related to (R + ) n D(n) being an n! covering of P(n)/O(n) poly + n . So far, we did not state what the Riemannian metric on poly + n is. Actually, strictly speaking we cannot, since poly + n is not a manifold (we do not want to get into technical details of Riemannian orbifolds). But, since D(n) is a manifold and also an n! covering of poly + n , we derive the geodesics on D(n) instead. From Theorem 3.3 it is clear that the Fisher-Rao metric on D(n), expressed in the diagonal entries λ 1 , . . . , λ n , is given bŷ G Λ (a 1 , . . . , a n ), (b 1 , . . . , b n ) = 1 2 For every l > 0 there is a unique integral curve λ(t) such that λ(0) = 1 and λ(1) = l. In consequence, every Λ ∈ D(n) is connected to the identity I by a unique horizontal geodesic.
Since, by Theorem 3.12, D(n) is totally geodesic in P(n), it follows directly from (66) that the geodesic equation isλ Let us now derive its Hamiltonian form, without using the reference to (66).
The Hamiltonian corresponding to the Lagrangian L(Λ,Λ) =Ĝ Λ (Λ,Λ) is given by where m i =λ i /λ 2 i are the momentum variables. Since H is separable in the index i (the ith term only depends on λ i and m i ), it follows that the associated Hamiltonian system decouples into n independent systems, each given bẏ It is straightforward to check that the quantity λm is a first integral. From it then follows that the solution to (86) is In particular, for every l 0 , l 1 > 0, there is a unique geodesic curve λ(t) such that λ(0) = l 0 and λ(1) = l 1 . It is given by We encourage the reader to compare this formula with (68), which corresponds to l 0 = 1. The phase diagram of (86) is illustrated in Figure 8.
Proposition 2. The mapping is an isometric isomorphism between D(n), equipped with the Fisher-Rao metric (85), and R n , equipped with the standard Euclidean structure.

3.4.3.
Vertical (or Isospectral) Flows. Let A(t) be a curve such that (A(t)) = (A(s)) for all s and t. That is, A(t) is a curve on a single orbit of P(n)/O(n). Clearly, this implies that the eigenvalues of A(t) are independent of t. Differential equations whose flows are such curves are called isospectral flows. From (84) we see that any isospectral flow must be of the forṁ where Φ : P(n) → o(n) and [·, ·] is the matrix commutator.
The most studied case is the Toda flow [80], which describes waves on non-linear lattices. The Hamiltonian integrable structure of this flow and its connection to the KdV equation has been studied extensively, starting with the work of Flaschka [38]; see also [63,34,64]. Furthermore, Symes [78] showed that the Toda flow is a continuous version of the QR algorithm for computing eigenvalues, so there is a strong connection to numerical linear algebra; see also [33,83,23]. For a numerical treatment of general isospectral flows, see Calvo, Iserles, and Zanna [19] and references therein. The Toda flow is known to be a gradient flow [12], but we shall not discuss it more here; we refer the survey paper by Tomei [81] for details and further references. Instead, we make a connection to the work by Brockett [14], who had the idea to construct isospectral gradient flows that diagonalize matrices.
Brockett's flow is constructed as follows. The orbit of M ∈ P(n) is parametrized by Q M Q with Q ∈ O(n). Thus, a flow γ(t) on the Lie group O(n) induces a flow γ(t) M γ(t) on the orbit of M . Now, O(n) comes with a canonical Riemannian metric (corresponding to minus the Killing form). At a base point Q ∈ O(n) for vectors U, V ∈ T Q O(n), it is given by Brockett constructed a gradient flow on O(n) such that if Q M Q is diagonal, then Q is an equilibrium. He came up with the functional where N ∈ D(n), and showed that the gradient floẇ (89) Notice that if Q M Q is diagonal, then Q is an equilibrium. Thus, such an equilibrium corresponds to a diagonalization of M , i.e., it produces a spectral decomposition of M . We now show how Brockett's flow (89) is related to the Fisher-Rao geometry presented in this paper.
The action of O(n) on W 1 ∈ P(n) is given by the map Its derivative is DΓ(Q) · Q =Q W 1 Q + Q W 1Q . Written in the left translated variable ξ = Q Q , and using that ξ ∈ o(n) is skewsymmetric, we get is given by Proof. From the definitions (65) where, in the last equality, we used the left invariance of H. This concludes the proof.
Consider now the entropy (44) relative to N ∈ D(n), given by Through pullback by Γ, we then obtain a functional on O(n), given by is given byQ Thus, the gradient of F with respect to H is given by From (75) we get ∇ḠH(Γ(Q)) = N − Γ(Q).
From Theorem 3.16 it then follows that where the last equality follows since [Γ(Q), Γ(Q) −1 ] = 0. This proves the result.
Comparing the pullback entropy gradient flow (90) with Brockett's flow (89), and using that Γ(Q) −1 = Q W −1 1 Q, we immediately get the following result, which shows how Brockett's flow is related to entropy and Fisher-Rao geometry. Implicitly, what this result shows is that Brockett's flow can be interpreted as the entropy gradient flow restricted to the Riemannian submanifold given by the orbit of W 1 . Thus, if Orb(W 1 ) denotes the orbit of W 1 and Π W : T W P(n) → Ver W denotes orthogonal projection, then we have the following result.
If N ∈ D(n) is generic (all elements are different), then the double bracket flow (92) converges to a diagonal matrix in Orb(Σ 1 ), thus giving a spectral decomposition of Σ 1 [14, Th. 2]. 5 Consequently, under the same condition, (91) converges to a diagonal matrix in Orb(W 1 ).

3.4.4.
Horizontal Gradient Flow to Factorize Characteristic Polynomials. We have seen how to construct a vertical gradient flow on the orbit of W 1 that converges towards a spectral decomposition of W 1 . Here, we aim to construct a horizontal gradient flow, evolving on D(n), such that the eigenvalues of W 1 are obtained in the limit.
The ideas goes as follows. For W 1 ∈ P(n), construct its characteristic polynomial p 1 = (W 1 ). Then define a functionalF : poly + n → R + such thatF (p) = 0 if and only if p = p 1 . LiftF to a functional on D(n) If F is a strictly convex functional on D(n) with respect toḠ, then this flow will converge to Λ ∞ such that (Λ ∞ ) = p 1 . There is, however, an obstruction with the suggested approach. Namely, F can never be strictly convex. If it was, then F would have a unique minimum, so there should be a unique sequence of eigenvalues of W 1 . But this cannot be, since any reshuffling of the eigenvalues gives a different minimum.
Let us now consider the specific casê 5 For numerical experiments confirming the convergence, see [14, Fig. 1].
where ·, · denotes the Euclidean inner product of vectors of monomial coefficients.
The gradient flow (93) is thereby given bẏ where Y (Λ) = diag(y 1 , . . . , y n ). The question of convexity of F and convergence towards a limit of (95) is left for future work. We shall, however, give a numerical example indicating convergence. (What Q is does not affect the flow.) We then compute p 1 (λ) = (W 1 ) = det(Iλ − W 1 ) in the monomial basis From Theorem 3.17 we then get the gradient floẇ where with the inner product ·, · being the Euclidean inner product for the polynomial coefficients in the monomial basis. . The convergence appears to be exponential.
Since W 1 is generic (no eigenvalues are the same), it is conceivable that the flow converges to the ordered sequence of eigenvalues if we pick initial data in D < (n). Thus, we need to come up with a strictly growing sequence of initial data. To this extent, we use that det(W 1 ) = (−1) n p 1 (0) = −b 0 and we take Λ 0 to be the geometric sequence Thus, the initial data fulfills det(Λ 0 ) = det(W 0 ). 6 We discretize (96) by the classical 4th order Runge-Kutta method [17, § 322], with time-step ∆t = 0.01. The evolution of is shown in Figure 9; Λ(t) appears to converge to the ordered sequence of eigenvalues of W 1 , i.e., the sequence (1/2, 1, 5). A plot of F (Λ(t)) is given in Figure 10; it appears that F (Λ(t)) → 0 exponentially fast as t → ∞. A more careful study is left as a future research topic.  Proof. Let W = π(R) = A A. From Theorem 3.15 we get W = Q 2 ΛQ 2 . Let √ W := Q 2 √ ΛQ 2 . Then A and √ W belong to the same fiber (since π(A) = π( √ W )), so there exists a Q 1 ∈ O(n) such that A = Q 1 √ W . Thus, A = Q 1 Q 2 √ ΛQ 2 . The result now follows by taking U = Q 1 Q 2 , Σ = √ Λ, and V = Q 2 .

4.
Outlook. In this section we give a few topics and future research directions directly related to the geometry described in this paper.
4.1. New geometric techniques for optimal transport. We start with some ideas related to the geometry of optimal mass transport.
4.1.1. Limit of Lifted Gradient Flow. In § 2.4.3 we gave a geometric proof of Theorem 2.6, the central result in proving existence and uniqueness of the optimal transport problem in the linear category (Problem 3), or, equivalently, of the polar decomposition of matrices (Theorem 2.7). The proof is based on existence of a unique limit for the lifted gradient flow (52). In turn, this limit is obtained by showing that the Hessian of the functional for the gradient flow is negative definite. It is a natural idea to try to extend the finite-dimensional geometric proof to the infinite-dimensional, smooth case. That is, to prove that the infinite-dimensional lifted gradient flow (27) has a unique limit in the space of strictly convex functions. If that can be established, Theorem 2.3 follows as in finite dimensions.
Of course, gradient flows on infinite-dimensional spaces are more subtle than on finite-dimensional spaces. In addition, a Banach space setting might not suffice in the smooth category: for completeness one needs a Fréchet topology, for example as developed by Hamilton [45] or Kriegl and Michor [54]. Regardless of these technicalities, however, the first step is to show that the Hessian of the functionalF in (26) is negative definite. We shall now give some brief calculations in this direction.
Let φ = φ(t) be a geodesic curve in the space of strictly convex functions. Then, from (4) it follows thatφ = 0 and from (24) we get That the first term is negative follows from the same calculation as in the proof of Theorem 2.12, since ∇ 2 φ is positive definite at each point. If ρ 1 is log-concave then the Hessian matrix ∇ 2 log(ρ 1 )(x) is negative semi-definite for all x ∈ R n , so the second term is non-positive. We thereby come by the following conjecture.
Conjecture 1. Let ρ 1 be log-concave (that is, log •ρ 1 is concave). Then the flow (27) has a unique limit in the set of strictly convex functions.
Although most standard probability distribution functions in statistics are logconcave, the restriction on ρ 1 is rather stringent. (Notice that there is no restriction on ρ 0 .) Ideally, Hess(F ) < 0 for all ρ 1 ; to achieve this one might have to consider a different functionalF .
We aim to investigate the geometric approach for Theorem 2.3, as sketched here, in a forthcoming publication.

4.1.2.
Numerical Method Based on the Lifted Gradient Flow. If the lifted gradient flow (27) has nice convergence properties, it is natural to consider it for numerical computation of the solution to Problem 1. This, of course, requires space and time discretization to yield a fully discrete flow.
The field of numerical methods for optimal transport has been growing steadily since the classical paper by Benamou and Brenier [8]. Surveys of state-of-the-art are given by Peyré [ On the space of Riemannian metrics there is a natural Riemannian structure, first studied by Ebin [35], and later by Freed and Groisser [39], Gil-Medrano and Michor [42], and Clarke [30,31]. This Riemannian structure induces the Fisher-Rao metric, by the projection taking a Riemannian metric to its corresponding volume form. Now, the observation is that the formula (65), for the Fisher-Rao metric restricted to Gaussian distributions, essentially recovers Ebin's metric (lacking only integration of the domain of interest).
The idea is then to study the metric on the space Riemannian metrics originating, analogously, from the Wasserstein metric restricted to Gaussian distributions (39). This would then yield a new geometry on the space of Riemannian metrics, related, not to the Fisher-Rao metric, but to the physically relevant Wasserstein geometry.

4.2.
New geometric techniques associated with Fisher-Rao. As already mentioned, the finite-dimensional Fisher-Rao geometry studied in § 3 naturally extends to infinite-dimensions (see [40,51,60]). Therefore, it is natural to look for decompositions of diffeomorphisms analogous to the matrix decompositions studied in § 3. In addition, the horizontal flow in § 3.4.4 to factorize the characteristic polynomial would be interesting to study further. 4.2.1. QR (or Iwasawa) Decomposition of Diffeomorphisms. Let us recap the Wasserstein geometry described in § 2: Brenier [13] showed that the polar decomposition of matrices has an infinite-dimensional analogue, namely the polar decomposition of maps. The geometric approach of Otto [65] makes the relation more transparent in terms of Riemannian metrics on diffeomorphisms and densities, and the observation that Gaussian distributions is a finite-dimensional submanifold of the space of all densities.
In Fisher-Rao geometry, there is a similar picture. Indeed, an infinite-dimensional analogue of the QR decomposition is developed in [60] (for diffeomorphisms on a compact manifold, in the category of Banach manifolds). It would be interesting to study this infinite-dimensional analogue of the QR decomposition in more detail, to see how much of the structure in § 3.2 that is retained. Suggestively, one could take a flat compact manifold, such as the n-torus T n , and try to relate the finite and infinite-dimensional QR decompositions, analogously to how the finite and infinite-dimensional polar decompositions are related.

4.2.2.
Spectral decomposition of densities. The Fisher-Rao geometry is richer than the Wasserstein geometry: the Riemannian metric boasts more symmetries. These extra symmetries allow the continuation of the reduction scheme to get the spectral decomposition of symmetric matrices in § 3.4.
It is natural to look for an analogue of the spectral reduction in the infinitedimensional setting, where the analogue of the space P(n) of symmetric matrices is Dens(R n ). Just as P(n) can be viewed as the space of normalized inner products on R n , we may think of Dens(R n ) as the space of (smooth) normalized inner products on L 2 (R n ). Changing from R n to the torus T n (things are easier to prove for compact manifolds) we are then looking for a way to represent (or at least cover, as in the finite-dimensional case) the quotient Dens(T n )/Diff(T n ). The formal decomposition thereby obtained would be an infinite-dimensional spectral decomposition of smooth densities. Such a decomposition could have connections to integrable systems and the KAM theorem. 4.2.3. New Numerical Methods for Spectral Decompositions. As discussed in § 3.4.3, there are several vertical (isospectral) flows for obtaining spectral decompositions, e.g., the Toda and Brockett flows. The horizontal flow (95) in § 3.4.4 to factorize characteristic polynomials represents a new type of flows for spectral decompositions. It would be interesting to study this class of flows in-depth.
Convergence to a limit is, of course, one important question. Another is to develop numerical methods based on these flows. For that, one should look for a strictly convex functionalF on poly + n such that the gradient vector field ∇Ḡ(Λ) can be evaluated efficiently. For sparse, banded matrices, there are O(n) algorithms for evaluating the characteristic polynomial [53]. This might be helpful. It is plausible that some known iterative methods for computing eigenvalues can be seen as discretizations of flows of the form in § 3.4.4. The computation of eigenvalues of symmetric matrices is discussed thoroughly in the monograph by Parlett [66]. For a summary of numerical methods for spectral decompositions, see [84,43].