Control systems of interacting objects modeled as a game against nature under a mean field approach

This paper deals with discrete-time stochastic systems composed of a large number of N interacting objects (a.k.a. agents or particles). There is a central controller whose decisions, at each stage, affect the system behavior. Each object evolves randomly among a finite set of classes, according to a transition law which depends on an unknown parameter. Such a parameter is possibly non observable and may change from stage to stage. Due to the lack of information and to the large number of agents, the control problem under study is rewritten as a game against nature according to the mean field theory; that is, we introduce a game model associated to the proportions of the objects in each class, whereas the values of the unknown parameter are now considered as "actions" selected by an opponent to the controller (the nature). Then, letting \begin{document} $N \to \infty $ \end{document} (the mean field limit) and considering a discounted optimality criterion, the objective for the controller is to minimize the maximum cost, where the maximum is taken over all possible strategies of the nature.


(Communicated by Onésimo Hernández-Lerma)
Abstract. This paper deals with discrete-time stochastic systems composed of a large number of N interacting objects (a.k.a. agents or particles). There is a central controller whose decisions, at each stage, affect the system behavior. Each object evolves randomly among a finite set of classes, according to a transition law which depends on an unknown parameter. Such a parameter is possibly non observable and may change from stage to stage. Due to the lack of information and to the large number of agents, the control problem under study is rewritten as a game against nature according to the mean field theory; that is, we introduce a game model associated to the proportions of the objects in each class, whereas the values of the unknown parameter are now considered as "actions'' selected by an opponent to the controller (the nature). Then, letting N → ∞ (the mean field limit) and considering a discounted optimality criterion, the objective for the controller is to minimize the maximum cost, where the maximum is taken over all possible strategies of the nature. control or action selected by some central controller, and µ t the unknown parameter, at time t, then the conditional distribution of X N n (t + 1) is given by K ij (a, µ) = Pr[X N n (t + 1) = j|X N n (t) = i, a t = a, µ t = µ], i, j ∈ S, n = 1, . . . , N. At each stage, a cost resulting from the movement of the objects among the classes and the selected control is generated. In this sense, the controller aims to select actions to minimize a discounted optimality criterion as (9).
Typically, systems depending on an unknown parameter, which in turn comes from a random process, are analyzed by means of statistical estimation procedures, provided, of course, that such a parameter, or the random process, is observable. However, because of the lack of observability of the parameter µ, such an approach is not applicable in our model. Instead, we assume that the only information about the parameter µ t is that, at each stage, it belongs to the set Γ. In this scenario, the interacting objects system can be modeled as a game against nature, also known as a minimax control model, which consists in assuming that the controller has an opponent, the "nature", who, at each stage, is selecting the parameter µ t from the set Γ. Under this new approach, the goal of the controller is to minimize the maximum cost, where the maximum is taken over all possible strategies of the nature, as is specified in (11).
By dealing with a large number (N ∼ ∞) of objects it is not possible to directly compute optimal policies (at least from a practical point of view), since for example, the corresponding dynamic programming equation contains terms depending on an N -dimensional integral. To overcome this obstacle, we will appeal to mean field theory; that is, instead of analyzing a single object, which would be almost impossible, we will focus on the proportion of objects occupying each class. This allows us to define, as a first step, a game against nature model M N whose states are now the proportions of objects. By letting N → ∞ and applying an appropriate type of law of large numbers, a game against nature M independent of N is obtained, whose states are precisely the probability measures obtained as limit of the proportions. The model M is referred to as the mean field game against nature. Under this scheme, we can establish solutions for the game model M and obtain a "worst case" optimal control policy π * . Our objective is then to analyze the minimax deviation of π * when it is used to control the original process defined on M N . Specifically we aim to show that π * is a worst case optimal control policy in an asymptotic sense as N → ∞.
Our approach is applied to other important models. For instance, consider the case when S ⊂ R and the process X N n (t) evolves according to a difference equation of the form X N n (t + 1) = F X N n (t), a t , ξ n t , t = 0, 1, ..., n = 1, 2, ..., N, where {ξ n t } is a family of possibly non observable random variables, taking values on a set Z, and partially identically distributed with unknown distribution µ n t . The latter means that Considering the distribution µ t as the unknown parameter and Γ as the set of probability measures on Z, the transition law takes the form In other words, at each stage t, the movement of objects n = 1, 2, ..., N is affected by random variables ξ 1 t , ξ 2 t , ..., ξ N t , respectively, not necessarily equal, but with the same distribution µ t .
Within the context (1)-(2), the specific case where {ξ n t } is a family of observable, independent and identically distributed random variables with common unknown distribution µ, i.e. µ t = µ for all t ≥ 0, was previously analyzed in [13]. In particular, it was assumed that the distribution µ has an unknown density ρ, so it was possible to implement statistical density estimation schemes for ρ together with control procedures, to prove existence of nearly optimal policies.
To the best of our knowledge, our approach has not been previously studied in the literature and its novelty relies in mixing the aforementioned two techniques. Furthermore, this work can be considered as an extension of [13] in the sense that our results are applicable to difference-equation models as (1)-(2), but assuming that the driving process {ξ n t } is nonobservable with unknown distribution. The paper is organized as follows. In Section 2 we describe the system of N objects under study, and the corresponding game against nature M N is introduced. Next, in Section 3, we present the game against nature M associated to the mean field scenario as well as the key convergence result. Furthermore we introduce our main result that establishes the existence of asymptotically minimax policies that are obtained from the model M but which in turn are the best approximate optimal policies for the model M N . Section 4 contains two examples to illustrate the elements of the proposed model. Finally, the proof of the main result is given in Section 5.
Notation and terminology. As usual, N (respectively N 0 ) stands for the set of positive (resp. nonnegative) integers. Similarly, R (resp. R + ) denotes the set of real (resp. nonnegative real) numbers.
On the other hand, given a Borel space Z (that is, a Borel subset of a complete and separable metric space) its Borel σ−algebra is denoted by B(Z), and the attribute "measurable"will be applied for either Borel measurable sets or Borel measurable functions.
Let P(Z) be the set of probability measures on Z. If Z is finite, e.g. Z = {1, 2, ..., z}, we will identify any p ∈ P(Z) by the vector p := (p(1), p(2), ..., p(z)) where p(i) ≥ 0, i ∈ Z, and z i=1 p(i) = 1. As usual, | · | will denote the norm on R. We shall define the norm on P(Z), for Z finite, under the corresponding L ∞ norm; that is, for each vector p ∈ P(Z): 2. Model description. We consider a discrete-time controlled stochastic system of N (N ∼ ∞) interacting objects where X N n (t), n = 1, 2, . . . , N , t ∈ N 0 , represents the class of the object n at time t taking values in a given finite set S = {c 1 , c 2 , ..., c s } := {1, 2, . . . , s}, named the class set. At each stage, a central controller selects an action or control a t from a given Borel set A, which produces a random movement of the objects from a class to another. Additionally, such a movement is also influenced by a parameter which is unknown and possibly nonobservable by the controller, and it can change form stage to stage. Hence, we assume that the objects evolve according to a transition law where µ t represents the parameter at time t ∈ N 0 . We assume that µ t belongs to a compact set Γ, so-named parameter set. The relation (3) defines a stochastic kernel represented in matrix notation by In the present analysis, it is assumed that the objects are observable only through the class in which they are located, so that the controller can only determine the number of objects in each class i ∈ S. In this sense, the behavior of the system can be reformulated by means of the proportions of objects at each class. Indeed, let M N i (t) be the proportion of objects in class i ∈ S at time t; i.e., and we denote by M N (t) the vector whose components are the M N i (t)'s: There are several ways to represent the dynamics of the proportions M N (t) by means of stochastic difference equations. Some of them depend on the specific control problem under study, while others are obtained according to a simulation process. That is, there exists a function G N : where { w t } is a sequence of i.i.d. random vectors on R N with common distribution θ.
In our case, we adopt the one obtained by a particular simulation process, borrowed from [9], which is in the spirit of simulations and has suitable convergence properties. Indeed, for each t ∈ N 0 , let w k n (t) , n ∈ {1, · · · , N } k ∈ S, be a family of i.i.d. uniform random variables on [0, 1], and denote It is worth noting that The dynamic of the process M N (t) is obtained as follows. We first define Then, the function G N in (5) takes the form The dynamics (5)-(6) also provide the convergence of M N (·) to some other dynamics associated to the mean field model (see Section 3). Now, let us assume that the one-stage cost depends on the configuration of the system M N (t) and on the action selected by the controller. This one-stage cost can then be represented by a measurable function r : With these previous ingredients, the optimal control problem for the controller reduces to find a control policy directed to minimize his/her cost flow r(M N (t), a t ) , M N (t) ∈ P N (S), a t ∈ A, over an infinite horizon using a discounted cost criterion of the form subject to (5). However, notice that at each stage the only information for the controller about the parameter µ t is that it belongs to the set Γ. This means that the problem must be addressed from an alternative approach, which is introduced below.
Games against nature formulation. Roughly speaking, in games against nature formulation, it is assumed that the controller has an opponent (the nature) who selects the parameter µ t at each stage, so the goal for the controller is to minimize the maximum cost incurred by the nature. More precisely, we define the discretetime game against nature associated to the system of N interacting objects (in short N -GAN) as follows: The evolution of the system associated to the model M N can be interpreted as follows: At time t, the controller observes the system configuration via the state m = M N (t) ∈ P N (S), representing the proportions of the objects. Then he/she selects an action a = a t ∈ A whereas the opponent (the nature) picks a parameter µ = µ t ∈ Γ. As a consequence the following happens: (1) a cost r(m, a) is incurred, and (2) the system moves to a new state M N (t + 1) according to the transition law with G N as in (5). Once the transition to the state M N (t + 1) occurs, the process is repeated.
Observe that the changes in the system configuration governed by the transition law Q, rely strongly on the parameters {µ t } selected by the opponent. In this sense, the objective for the controller is to find a policy that minimizes the maximum total expected discounted cost introduced in (11) below.
In order to ensure the existence of minimizers, we impose the following conditions regarding continuity and compactness on some elements of the game model.
(a) The action space A is a compact Borel space, whose metric is denoted by (3), is continuous for all i, j ∈ S and µ ∈ Γ. (d) The one-stage cost r is bounded and uniformly Lipschitz in m with constant L r ; that is, for some constant R > 0, and for every a, a ∈ A, and m, m ∈ P(S), A first consequence of Assumption 2.1 is the following result, whose proof is provided in Section 5. ). An admissible control policy is a sequence π N = π N t of stochastic kernels π N t on A given H N t such that π N t A|h N t = 1 for all h N t ∈ H N t , t ∈ N 0 . We denote by Π N the set of all admissible control policies for the controller.
Let F be the set consisting of all measurable functions f : P(S) → A and F N := F| P N (S) be the restriction of F to P N (S). A policy π N ∈ Π N is said to be a (deterministic) Markov policy if there exists a sequence f N t ⊆ F N such that for all t ∈ N 0 and h where δ a (·) is the Dirac measure concentrated at point a. In this case, π N takes the form π N = f N t .
In particular, if f N t ≡ f N for some f N ∈ F N and for all t ∈ N 0 , we say that π N is a stationary policy. We denote by Π N M and F N the set of all Markov policies and the set of stationary policies for the controller. We also denote by Π M the set of deterministic Markov policies for the controller when we use F instead of F N ; that is, Π M is the family of sequences of functions {f t } ⊆ F. Observe that any policy π = {f t } ∈ Π M whose elements f t are restricted to P N (S) turns out to be an element of Π N ; hence we can conclude that F N ⊆ Π N M ⊆ Π N . On the other hand, we assume that the nature's actions depend only on the time parameter; in other words, a control policy for the opponent is a sequencẽ π = {µ t } , µ t ∈ Γ, t ∈ N 0 . We denote by Π Γ the set of policies for the opponent.
For each pair of policies π N , π ∈ Π N × Π Γ and initial state M N (0) = m ∈ P N (S), we define the total expected discounted cost as where α ∈ (0, 1) is the so-called discount factor and E π N , π m denotes the expectation operator with respect to the probability measure P π N , π m induced by π N , π ∈ Π N × Π Γ given M N (0) = m (see, e.g., [8] for a construction of P π N , π m ). Hence, the aim for the controller is to find a policy π N * ∈ Π N such that In this case, π N * is said to be a minimax policy, whereas V N * is the N −value function. According to the description previously presented, and taking into account the continuity of the mapping a −→ G N (m, a, µ, w) stated in Lemma 2.2, we can apply the methodology of minimax Markov control theory to characterize minimax policies as follows (see, e.g., [12,22]).
For u ∈ B (P(S)) and (m, a, µ) ∈ P(S) × A × Γ, we define Remark 1. Taking into account Lemma 2.2, under Assumption 2.1, we can apply similar arguments as in [12] to prove that the operator T satisfies the following properties: (a) T is a contraction operator with modulus α,  (12) is the unique solution in C b (P N (S)) satisfying where T is the operator in (14).
or, equivalently, where, for each m ∈ P N (S) and a ∈ A, is a minimax policy for the controller; that is, Although Theorem 2.3 provides a way of finding minimax policies, from a practical point of view it is not quite useful because N is too large (N ∼ ∞). Just to mention one of these difficulties, observe that the equations (15)-(16) highly depend on the N -dimensional integral whose analysis is prohibitive for large N . To overcome this difficulty, we introduce a new game against nature, denoted by M, so-called mean field game against nature, that becomes the limit of the GAN model M N as N → ∞, which, of course, is independent of N. Thus, our challenge is to show that the resulting minimax policy on M is minimax in the original model M N , in an asymptotic sense as N → ∞; in other words, M will be used as an approximating model for M N .
3. Mean field game against nature. Consider the function G : P(S) × A × Γ → P(S) defined as G(m, a, µ) := mK(a, µ), where K is the transition kernel. Let {m(t)} ⊂ P(S) be a controlled process defined by the deterministic difference equation with initial condition m(0) = m ∈ P(S). As before, a t ∈ A represents the control selected and µ t stands for the unknown parameter at time t.
As is established in (23) below, {m(t)} is the limit process of M N (t) , as N → ∞, which plays the role of the mean field process. Hence, using the same one-stage cost r defined for the N -GAN (10) we can then define the mean field game against nature (MFGAN−M for short) as M = (P(S), A, Γ, G, r).
Considering the deterministic nature of the process (19), it is clear that the dynamics is completely determined by the sequences {a t } ⊂ A and {µ t }, and by the initial condition m ∈ P(S). Furthermore, it is also well recognized that a control policy π for deterministic control systems is a sequence of decision rules π = {f t } ⊆ F (see [4]). Therefore, in terms of the notation given in the definition of control policies, we can consider the set Π M as the set of all control policies for the controller on the model M. In this way, for each (π, π) ∈ Π M × Π Γ and initial condition m(0) = m ∈ P(S), we define the total discounted cost for the mean field model as Thus, the mean field game against nature consists in finding a policy π * ∈ Π M such that v * (m) := inf where v * is the mean field value function and π * is said to be a mean field-minimax policy for the MFGAN-M. Now, similarly as in Theorem 2.3, we can apply arguments on minimax theory (see, e.g., [12,22]) to get the following results.
(c) There exists a control policy f * ∈ F minimizing the right-hand side of (22); and furthermore π * = {f * } ∈ Π M becomes a mean field-minimax policy; that Observe that the existence of the minimax policy π * does not depend on N .
We have arrived to our main result concerning the existence of asymptotically minimax policies.
where Φ N is the function defined in (18).
Theorem 3.2 establishes that when the controller uses the policy π * to control the original process (5), corresponding to the model M N , the "minimax deviation" vanishes to zero, as N → ∞; that is, for N sufficiently large, π * is a minimax policy for the N −GAN M N .

4.
Examples. In this section we describe some examples in order to identify the elements of our model.

4.
1. An investment system. We consider the following investment system which is composed by N "small" investors whose actions have no economic impact on the market prices. Each investor is ranked in s different wealthiness classes, according to their capital. We denote by S := {1, 2, ..., s} the set of such classes. Assume that there is a central controller (government or public body) who provides a subsidy or imposes a fee a ∈ A to the investors. In addition, at time t ∈ N 0 , each investor is able to invest his/her money in different risky and risk-free assets, which, depending on the controller's decision, could cause an investor's class change at the next stage. We shall suppose that a ∈ [−a * , a * ] = A for some a * ≥ 0, whose interpretation is as follows: a t = fee of size −a t (if a t < 0) or subsidy of size a t (if a t > 0), at time t. Let ξ n t be a random variable with values on a set Z, representing the return rate related with the risky asset for investor n at time t, and assume that ξ n t has distribution γ µt , independent of n, that is, , ∀n = 1, 2, ..., N, B ∈ B(Z), where {µ t } ⊂ Γ is a sequence of unknown parameters. Clearly the investors evolve among the classes according to a transition law determined by distributions γ µt . In this case, we suppose that there exists a function Ψ : S 2 × A × P(Z) → [0, 1] such that K ij (a, µ) = Ψ(i, j, a, γ µ ). and a −→ Ψ(i, j, a, γ µ ) is continuous. Finally, the one-stage cost can be taken as an arbitrary measurable function satisfying Assumption 2.1(d).

4.2.
Financial strategy of a company. A company that sells a certain product of brand L is interested in applying marketing techniques to a group of N families to improve its income. Each family has their own perception about the quality of the product, which is determined by several factors such as price, relative quality, appearance, presentation, among others. Of course, this perception is difficult to measure by the company, which makes it an uncertain entity. Instead, the company can only observe and accurately determine when each family is in one of the following 3 classes: Let ξ n t be the random variable representing the perception of the family n about brand L at time t. Thus, it is reasonable to assume that ξ n t has a distribution depending on an unknown and nonobservable parameter by the company, denoted by µ.
The company wants to implement new policies to make a greater number of families to buy its brand L; i.e., families that do not buy its product start do it, but also making sure that families who regularly buy the product keep it buying. For this, the company has a set of marketing strategies A = {a 1 , · · · , a k }. Let X N n (t) ∈ S be the class of family n at time t. Then the transition among classes is modeled by a transition K ij (a, µ), i, j ∈ S, a ∈ A, µ ∈ Γ which depends on the parameter µ related with the perception of families about the quality of product.
Finally, we can assume a one-stage reward of the form where m 2 is the proportion of products of brand L sold by the company, and u : A → R is a suitable utility function satisfying Assumption 2.1(d).

Proof of Lemma 2.2.
From the compactness of set A, let {a k } ∈ A be a sequence such that a k → a * ∈ A. Note that Assumption 2.1(c) yields that the mapping a −→ Ψ µ ij (a) (with Ψ µ ij (·) defined as in (7)) is continuous for all i, j ∈ S, µ ∈ Γ. Hence, ∆ µ ij (a k ) → ∆ µ ij (a * ), as k → ∞, in the set sense. Now, for i, j ∈ S, w ∈ [0, 1], µ ∈ Γ, and a ∈ A, let δ w (∆ µ ij (a)) be the Dirac measure corresponding to the indicator function I {∆ µ ij (a)} (w). Then, since δ w (·) is a probability measure (therefore it is continuous), we conclude that (6), we obtain the continuity of the function G N defined in (8).

Proof of Theorem 3.2.
The proof of Theorem 3.2 will be consequence of the following facts.