DISCRETE TIME MEAN FIELD GAMES: THE SHORT-STAGE LIMIT

. In this note we provide a model for discrete time mean ﬁeld games. Our main contributions are an explicit approximation in the discounted case and an approximation result for a mean ﬁeld game with short-stage duration.

1. Introduction. In this paper we study a model for a discrete time, discrete state space, finitely repeated stochastic games where the transition and the payoff of the players depend on the position in space and the actions of the adversaries, but not on their identities. We assume all the players have the same dynamics and the same payoff, thus, for each player, we can consider the influence of the adversaries only through the empirical distribution of the state-action pair.
Mean field games have been introduced independently by Huang, Caines and Malhamé [11] and by Lasry and Lions [13,14,15] and have received considerable attention in the literature. The aim of the mean field games paradigm is to describe situations with many interacting agents whose preferences and dynamics depend on the aggregate effect of the other agents. Mean field game models are composed by two parts: a backward component, where each agent considers the aggregate behavior as an external parameter and computes myopically his own optimal behavior and a forward component, which is the evolution of the initial distribution in the state space under a common strategy. Mean field games have found applications in many different areas, we refer to [9] and the references therein for examples.
Most of the models studied in the literature so far are in continuous time, while the discrete time case has received less attention. The discrete time case has not only independent interest, but also allows to model more general transitions, contrary to the assumption usually made in continuous time mean field games that the noise in the dynamics of the players is independent of their actions. In discrete time, we can also allow the players to choose their actions randomly, as in classical game theory. However, for some applications it might be relevant to consider frequent interactions between the players. This motivates the study of a limit model as the duration of each stage tends to zero, which we pursue in Section 3.
The main novelty of our work with respect to the previous work on discrete time mean field games is the short-stage version. Short-stage games have been recently introduced in [16]. The aim of this theory is to study games where players are allowed to interact more frequently. Incorporating this machinery, we obtain a limit object that provides an approximate Nash equilibrium for games with sufficiently many players and sufficiently frequent interactions.
In [8], a discrete time, finite state mean field game with a continuum of players is studied. The authors study a finite horizon game and prove the exponential convergence of the finite-horizon mean field equilibrium to a stationary solution. There are two significant differences with our work. First, we consider a fixed time horizon. Second, we are interested in constructing approximate equilibria for games with large numbers of players, while in [8] a continuum of players is considered. An anonymous referee pointed us to the recent paper [6] where a model for linear quadratic mean field games in discrete time is studied.
Our work is closer to [2], where a similar notion is studied for an infinite horizon, discounted stochastic game. While we restrict our framework to a finite state space (in [2] an unbounded state space is considered), we provide explicit approximation estimates in terms of the basic parameters of the game. Our estimate is of the same order as the one in [12] in continuous time.
Let us remark also that we study a discrete-time game by itself, and not the discretization of a continuous-time mean field game for numerical solution purposes. Numerical methods have been initially developed in [1]. A semi-Lagrangian scheme has been proposed in [4] for the deterministic, finite-horizon case. The full discretization has been studied in [5].
The paper is organized as follows, In Section 2, we describe the model and some results on the existence of mean field equilibrium, as well as the approximation results with explicit convergence rates. In Section 3 we introduce a short-stage version of the discounted stochastic mean field game. In the Appendix, we prove an approximation lemma which allows to prove the results we present in Section 2.3.
2. The discrete time model.

Mean field equilibrium.
Let Ω and A denote respectively the state and action sets. We assume both to be finite. Let Z := ∆(Ω × A), where, for a finite set S, ∆(S) denotes the set of probability distributions over S. Consider a bounded payoff function g : Ω×A×Z → [0, 1] and a transition function Q : Ω×A×Z → ∆(Ω). Let n be a fixed positive integer. Let us define a family of auxiliary one-player games, parameterized by a vector z = (z 1 , z 2 , . . . , z n ) ∈ Z n . The one-player dynamic programming problem Γ n z is defined as follows: at stage k, the player observes the state ω k ∈ Ω and chooses the action a k ∈ A from which he receives a payoff g(ω k , a k , z k ) and the new state is chosen according to the law Q(ω k , a k , z k ). A pure behavior strategy (resp. mixed behavior strategy) is a sequence of functions σ = (σ 1, . . . , σ n ) where σ k : H 1 k → A (resp. σ k : H 1 k → ∆(A)). Here, H 1 k = (Ω × A) k−1 ∪ Ω × ∆(Ω) denotes the set of histories up to time k, for k = 1, . . . , n. Let Σ n denote the set of pure strategies. The player knows z and observes the payoff. We introduce the value function V n : Ω × ∆(Ω) → [0, 1] for the game Γ n z : One can also consider an infinitely repeated game Γ λ z with parameters z ∈ Z and λ ∈ (0, 1], played as before but where the payoff is evaluated by From the familiar arguments 1 , one can prove that the value functions satisfy the following recursive formulae (dynamic programming principle): and In (1), if z = (z 1 , z 2 , . . . z n ), then z + denotes the vector (z 2 , z 3 , . . . z n ). The dynamic programming principle (1) also tells us that the player can restrict his attention to the set of Markovian strategies Σ M n ⊂ Σ n , which consists of all the functions σ = (σ 1, . . . , σ n ) such that σ k : Ω → ∆(A).
Let m 1 ∈ ∆(Ω) given and let Z n 1 := {z ∈ Z n : z 1 | Ω = m 1 } . For the rest of the paper, z k | Ω denotes the marginal distribution of z k ∈ Z on the set Ω. Define Ψ n : Z n 1 Σ M n as the set valued map that associates to every z ∈ Z n 1 the set of optimal Markovian strategies in Γ n z . Let Φ n : Σ M n → Z n 1 defined by σ → z σ where the sequence z σ is recursively defined by setting z σ 1 (ω, a) := m 1 (ω) · σ 1 [ω](a) and z σ k+1 (ω, a) := (ω ,a )∈Ω×A We are interested in the fixed points of Φ n • Ψ n . In order to apply fixed point theorems, one needs to ensure certain continuity and convexity properties, which will hold under the following assumptions. Assumption 1. (Lipschitz continuity) There exists positive real numbers L Q , L g such that for all (ω, a, y, z) ∈ Ω × A × Z × Z, For the existence results in these sections, this continuity assumption can be relaxed, however, for our main approximation results we need Lipschitz continuity. One way to ensure convexity properties is the following: In order to avoid this assumption, one needs to impose a different assumption so that a convexity property can still be preserved. It is possible to provide conditions on the basic model data that ensure that Assumption 3 holds, see for example Assumption 2 and 3 in [2] or Assumptions 1-3 in [8]. As uniqueness of the maximizer might hold under other circumstances, we prefer not to write down explicit conditions.
A straightforward application of Brouwer's fixed point theorem yields the following result. Proposition 1. If Q satisfies Assumptions 1 and 3, then Φ n • Ψ n has a fixed point.
For the discounted case, from (2) one obtains that there exist optimal stationary strategies, i.e. functions of the form σ : Ω → ∆(A). Let Σ denote the set of stationary strategies. Define Ψ λ : Z → Z × Σ as the set valued map that associates to every z ∈ Z the pair (z, S λ z ), where S λ z is the set of optimal stationary strategies in Γ λ z .
We will make the following ergodicity assumption throughout the paper. (4) has a stationary distribution z = z[y, σ]. In other words, for y, σ given, there exists z such that the following holds Under Assumption 4, Φ λ is well defined. One obtains analogous results to Proposition 1. Proof. The upper semicontinuity follows easily from the assumptions. For the convexity, let z ∈ Z and z 1 , Let σ θ be the following strategy: Observe that σ θ is optimal for Γ λ z from the optimality of σ 1 , σ 2 . We make the convention that when z θ (ω, a) = 0, σ θ [ω](a) = 0. Note also that z θ (ω, a) = 0 ⇐⇒ z 1 (ω, a), z 2 (ω, a) = 0.
We have, up to excluding the above trivial cases, In the case of pure strategies, one has the following: The proof of this proposition is a straightforward application of Brouwer's fixed point theorem.
Definition 2.2. A stationary mean field equilibrium is a pair (σ, z) ∈ Σ × Z such that z is a fixed point of Φ λ • Ψ λ and σ is the strategy associated to Ψ λ (z).

2.2.
The N-player game. We consider a n−stage stochastic game Γ n,N [m 1 ] with N + 1 identical players, i.e. with common state space Ω, action set A, stage payoff g : Ω × A × Z → [0, 1] and transition function Q : Ω × A × Z → ∆(Ω) played as follows: at stage k, for k = 1, . . . , n, each player i observes his own state ω i k and the state of each of the adversaries and chooses his action a i k . The initial state of each player is sampled i.i.d using the lottery m 1 . The actions of the players are chosen simultaneously and independently. After the actions were chosen, each player has a state-action pair z i k,N : denotes the empirical distribution of the state-action pairs of the players after the play at stage k. The new state for player i, ω i k+1,N , is chosen according to the law Q(ω i k,N , a i k , z k,N ) and the situation is repeated. A behavioral strategy for player i is a vector is the set of all possible histories up to stage k. Denote by Σ n,N the set of behavioral strategies for each player and note that Σ M n ⊂ Σ n,N . A strategy profile is a vector π = (π i ) i=1,...,N , where π i is a behavioral strategy of player i. The average payoff of player i, when using the strategy π i and when his adversaries use the strategy profile One can also consider a game Γ λ,N [m 1 ] with infinite horizon and payoff where λ ∈ (0, 1].
3. An −Nash equilibrium for the average payoff, where > 0, is a strategy profile (π i ) i=1,...,N such that, for all player i and all behavioral strategy Analogously, an −Nash equilibrium for the λ-discounted payoff is a strategy profile (π i ) i=1,...,N such that, for all player i and all behavioral strategy τ i , 2.3. Approximation results. We are ready to state our first main result in the finite horizon case, which is an easy consequence of the approximation lemma in Section 4. This result is an estimate of the maximal deviation between the trajectories followed by a player if the observed aggregate state-action of his adversaries affects his own transition functions and an independent game in which he plays the same action, but the transition function takes as argument the corresponding aggregate state-action of the mean field equilibrium. Throughout this Section Assumption 1 holds.
Once the difference of the trajectories of player i in the N player game and in the one-player game where the state-action term enters as a parameter is bounded, we obtain the following result: Theorem 2.4. Let > 0 be given. In a finite horizon, N player game, there exists N 0 such that for N > N 0 the mean field equilibrium is an -equilibrium, where the approximation error is given by: Proof. Consider a game with N players and let us focus on the payoff function of player i. Let (σ, z) denote a mean field equilibrium for a given initial distribution m 1 and let ω 1 = ω. We have that, for all behavior strategy τ : The above inequality comes from the optimality of σ. The result now follows applying Lemma 4.2 to each of the terms in brackets.

Remark 1.
Note that the discounted case can be reduced to the finite horizon case: indeed, it suffices to find K ∈ N large enough so that λ(1 − λ) K g ∞ < /2 and consider the N 0 for /2 in Theorem 2.4. However, this may not be appropriate when λ is small because the number of stages will be too large. For small λ, it makes more sense to consider the construction we proposed. This will be the case for instance in Section 3.

Remark 2.
Our bound suggests that the number of players should be much larger than the length of the game. This seems intuitive, since one would expect that if there are not enough players and they play for many stages, it could happen that the empirical distribution at early stages of the game is too far from the predicted distribution and this error would be propagated.

Remark 3.
Our result is on the spirit of [12]: construct a limit object that induces -Nash equilibria in games with large players. Our limit object corresponds heuristically to a game with a continuum of players. The complementary approach of studying the limits of a sequence of Nash equilibria of games with finitely many players has been explored in some cases, see for instance [3,13] but the general case remains open. We will illustrate this remark in Example 2.
We conclude this Section with two illustrative examples: Example 1. As an application let us consider the following example, adapted from learning by doing [7]. Consider the industry of online hotel booking, where many firms offer accommodation. In this case, the state space Ω is the reputation of the firm, the action set A is the capital to be invested. Each firm aims to improve their reputation by making investments and/or adjusting their offers. The reputation changes according to the transition Q. Note that in this context makes sense to consider independent transitions, as in Assumption 2. The firms interact with each other through the payoff function g, which represents their market share. For instance, if all the firms have similar reputation, customers might be indifferent and the utilities will be shared evenly, whereas if there are few firms with outstanding reputation, they may have higher revenues.
Example 2. Consider a game with N players, where each player chooses whether to drive on the left or right side of the street. Assume that the payoff for driving on the same side as everyone else is 1 and zero otherwise. Observe that 'everyone drives left' and 'everyone drives right' are Nash equilibria. However, this game has more equilibria which are sensible to the number of players present in the game, for instance 'everyone drives right if N is even'. In this case it does not make sense to consider limits of Nash equilibria, it is rather desirable to have equilibria that are independent of the number of players.

3.
A mean field game with frequent actions. The aim of this section is to study a model for mean field games where players are allowed to play more frequently.
3.1. The one-player game. Let δ > 0 and z ∈ Z. In the spirit of [16], we consider a family of discrete time repeated games parametrized by δ as follows: let µ : Ω × Ω × A × Z → R + bounded and such that, for all (ω, a) ∈ Ω × A, That is, µ(ω, ω , a, z) defines the escape velocity from ω to ω .
For δ small enough, the function Q δ (ω, a, z)(ω ) := δµ(ω , ω, a, z), for ω = ω, Q δ (ω, a, z)(ω) = 1 + δµ(ω, ω, a, z) (7) defines transition probabilities. Introduce the notation g δ := δg and let Γ λ,δ z denote the one player game with stage payoff g δ and transition function Q δ . For a fixed δ, this is exactly a discounted one-player game as introduced in Section 2 to define a stationary mean field equilibrium. The stationary mean field equilibrium defined through these games enjoys, for a fixed δ, identical approximation properties as in Section 2.3 in terms of the number of players. Our goal in this section is to provide a limit object that provides simultaneously good approximations for a large enough population of players and for a short enough time between plays.
Let 0 < ρ < 1. Informally, our aim is to approximate a mean field equilibrium for the stochastic game in continuous time with payoff ∞ 0 e −ρt g(ω t , a t , z)dt via mean field equilibria of the discrete time games Γ λ,δ z . The discount factor needs to be adjusted so that the accumulated payoff at the fraction t of the continuous time game is indeed the limit of the accumulated payoffs during the first t δ stages of the discrete time game. This is achieved by taking the discount factor λ = ρδ. Denote by V δ λ the value function of the game Γ λ,δ z . Taking λ = ρδ in (2) and dividing by δ yields which suggests that if f is an accumulation point of V δ ρδ δ>0 , then it should satisfy ρf (ω, z) = max a∈A g(ω, a, z) + ω ∈Ω µ(ω, ω , a, z)f (ω , z) .
Let us provide a proof of this result. The proof is inspired from the proof of Theorem 1 in [16].
Proof. Let C(Ω × Z) denote the set of continuous real-valued functions over Ω × Z.
Let T : C(Ω × Z) → C(Ω × Z) be the following operator: Note that T (f + c1) = T f + c µ ∞ µ ∞+ρ , and that T is monotone, i.e. T f ≥ T g whenever f ≥ g. Here, 1 denotes the constant function 1. Consequently, T is a µ µ ∞+ρ -contraction and has a unique fixed point V ρ . Besides, note that T v = v if and only if v is a solution to the following implicit equation: Denote with V ρ the unique solution of (10) and let σ ρ z be an optimal stationary strategy in (10). Consider the stochastic one-player game Γ ρδ,δ z with initial state ω and let Y m : Let us set λ δ := 1 − ρδ temporarily to alleviate the notation. We have that: The last inequality follows directly from (10) and using the fact that V ρ < g ∞ and ρ < 1. Hence, Summing over m (since we consider the payoff as in (6)) we obtain that V δ ρδ (ω, m 1 ) ≥ V ρ (ω, z) − 2 µ ∞ g ∞ ρ δ, and σ ρ z is asymptotically optimal in Γ ρδ,δ z , as δ → 0.
Heuristically, the limit state-action pair as the stage duration goes to zero, corresponding to the limit strategy σ ρ y should solve: L[σ ρ y , y]z = 0. Assumption 5. For every y ∈ Z and σ ∈ Σ, optimal stationary strategy given by (9) with z = y, the equation L[σ, y]z = 0.
One can prove existence of the limit stationary mean field equilibrium under the appropriate uniqueness assumption of the optimal action, analogous to Assumption 3.
Assumption 6. The right hand side of (9) has a unique maximizer.
By a straightforward application of Brouwer's fixed point theorem, one has: Proposition 6. Under Assumptions 5 and 6, the operator T ρ := Φ ρ • Ψ ρ has a fixed point.
3.3. The approximation of the N player game with frequent actions. Let us state now our second approximation result, namely the approximation of the game with sufficiently short stage and sufficiently many players.
Theorem 3.2. For every > 0 there exist δ 0 > 0 and N 0 ∈ N such that for all δ < δ 0 and N > N 0 the strategy provided by the limit stationary mean field equilibrium is an 2 -Nash equilibrium of the discounted mean field game with discount factor λ = ρδ and N players.
Proof. Let > 0 fixed and consider a limit stationary mean field equilibrium (σ, z). Observe from the proof of Proposition 5 that one can choose δ 0 small enough so that σ is −optimal for the one-player discounted game with discount ρδ, for all δ < δ 0 . Let K 0 such that Finally, let us take the N 0 given by Theorem 2.4 for the game of n = K 0 stages and error /2.
To conclude this Section, let us provide an example of a possible application of our model.

Example 3.
As an example of application, let us revisit the example of the online booking industry (Example 1). We consider again the state space as the reputation of the firm but restrict the action set to the offers the firm can post online. By monitoring each other actions, firms can frequently update their offers and promotions (with the help perhaps of automated software) to change their reputation levels.
4. Concluding remarks. An interesting feature of the mean field game models from the point of view of applications is the simplification it entails: on the equilibrium, each player has at his disposal an extremely simple strategy that depends only on his current state and he does not need to keep track of the other players, provided the number of players is large enough. This is because the aggregate state-action of the other players is regarded as a parameter, which deviates from the actual realization of the aggregated state-action with very small probability.
However, this nice feature is also its curse. One problem is that the mean field equilibrium need not be unique. If there is a coordinator of the game that informs the players which mean field equilibrium should be played, there are no problems. In applications, this will typically not be the case. One way around would be to provide the players with an adaptation mechanism. To explain this point, let us revisit the example of the driving game: Example 4. Consider the driving game of Example 2 with N players. The only equilibria that do not depend on N are everyone on the left and everyone on the right. Consider the following adaptation mechanism: each player chooses left or right with probability 1 2 on the first stage. On the second stage, observing the realizations of the first stage, each player looks at everyone's choice (and recalls its own) and imitates the choice of the majority. Thus, from stage three, the players will be on an equilibrium path if N is odd. If N is even, there is positive probability that none of the equilibria is reached.
A proper study of adaptation mechanisms for mean field games in the general case is clearly an interesting direction of future research. during a very pleasant stay at the University of Padova and to an anonymous referee for useful comments.
Appendix: An approximation Lemma. Let S denote a finite set. We identify the set S with the canonical basis of R |S| . Denote by M the subset of R |S|×|S| consisting of transition matrices for Markov chains over S.
Let P : ∆(S) → M denote a Lipschitz continuous function with respect to the L 1 norm with Lipschitz constant L P . Since S is finite, we have that the total variation distance in ∆(S), defined as µ − ν ∞ := max A⊂S |µ(A) − ν(A)| is related to the L 1 distance by µ − ν ∞ = 1 2 µ − ν 1 , so that a L P -Lipschitz function in the L 1 norm is 2L P in the total variation norm.
Let T > 1 be an integer, representing the number of stages. For i = 1, . . . , N and k = 0, 1 . . . , T − 1 define the following where X i 0 = X i 0,N is a random variable with law m 0 and X i 0,N are sampled i.i.d with probability m 0 . Here, m k denotes the law of X i k and m k,N := 1 N N i=1 X i k,N . Observe that Before we proceed to the approximation lemma, let us introduce ξ k,N := Observe that m k = Eξ k,N . We have the following Proposition 7. The following estimate holds: Proof. For every s ∈ S, the random variable ξ k,N (s) is the average of N independent Bernoulli variables. Hence, by definition of the variance and Jensen's inequality, Let F k denote the filtration generated by the observed history up to stage k. We are ready to prove the following approximation lemma. Proof. Let D i k := E max s≤k X i s+1 − X i s+1,N ∞ | F k and D k := max 1≤i≤N D i k . Now observe that, for any i: L P E m − ξ ,N 1 + ( P ∞ + L P )D .
From here it follows that The first inequality ( (12) and (13)) follows from the triangle inequality. For (14), we use the fact that  We also have that Combining these two inequalities yields the result.