Approachability in Population Games

This paper reframes approachability theory within the context of population games. Thus, whilst one player aims at driving her average payoff to a predefined set, her opponent is not malevolent but rather extracted randomly from a population of individuals with given distribution on actions. First, convergence conditions are revisited based on the common prior on the population distribution, and we define the notion of \emph{1st-moment approachability}. Second, we develop a model of two coupled partial differential equations (PDEs) in the spirit of mean-field game theory: one describing the best-response of every player given the population distribution (this is a \emph{Hamilton-Jacobi-Bellman equation}), the other capturing the macroscopic evolution of average payoffs if every player plays its best response (this is an \emph{advection equation}). Third, we provide a detailed analysis of existence, nonuniqueness, and stability of equilibria (fixed points of the two PDEs). Fourth, we apply the model to regret-based dynamics, and use it to establish convergence to Bayesian equilibrium under incomplete information.


Introduction
We consider a game played by a large population of individuals in continuous time. At every time, each individual engages in play with a random opponent extracted from the population and the resulting payoff, which depends on the action profiles of both players, is a vector. Such vector payoffs can be interpreted as deriving from a collection of noninterchangeable goods. Let us think, for instance, of a negotiation between an employer and a candidate employee over salary, career prospects, maximal number of days off and so forth. Formally, we can think of the completeness axiom being satisfied along each dimension of our vector but failing across them, giving a special case of Aumann's [4] framework. Indeed, vector payoffs may also be appropriate when the continuity axiom fails (see [12]). Alternatively, each player may be representative of a group of individuals whose preferences may not be aggregated into a single ordering, so that the vector payoff has one component for each individual in the group. Finally, payoff vectors also naturally arise when considering their regret at not having made each possible deviation. Main results. First, we provide a new model that combines approachability and population games. Given that the opponent is randomly extracted from the population, the approach by Blackwell-which looks at the worst-case payoffmay appear conservative. Thus, we relax Blackwell's conditions, assuming that the opponent is not malevolent but instead is simply extracted from a population with given distribution; we call this 1st-moment approachability. Second, we build upon the theory of mean-field games and adapt the concept of mean-field equilibrium to our evolutionary set-up; we call this self-confirmed equilibrium. Third, we discuss existence and nonuniqueness of the equilibrium. Finally, we explore the regret interpretation of our model; whereas 1st-moment approachability of nonpositive regrets no longer implies Nash equilibrium (as in [20]), we show that nonpositive maximal regret does imply Bayesian equilibrium under incomplete information. Related literature. The theory of "approachability" dates back to Blackwell [10] and culminates in the well known Blackwell's Theorem. Approachability arises in several areas of game theory, such as allocation processes in coalitional games [28], regret minimization [30,20], adaptive learning [13,16,18,19], excludability and bounded recall [31], and weak approachability [39], just to name a few. For instance, in coalitional games one asks whether the core is an approachable set, and which allocation processes can drive the complaint vector to that set. In regret minimization, one considers the nonpositive orthant in the space of regrets; a player tries to adjust her strategy based on the current regret so as to make that set approachable by the regret vector. Once all players have nonpositive regret, the resulting outcome is an equilibrium for the game. This idea of adapting the new action to the current state of the game is common to adaptive learning and evolutionary games as well, but in regret-based dynamics the state is in payoff (rather than strategy) space. Evolution under incomplete information has been relatively little studied, with the notable exception of Ely and Sandholm [15,36], who analyse a best response dynamic with a subpopulation for each possible type; here, by contrast, we have a single population of agents with nonconstant types who adopt (type-dependent) Bayesian strategies through time.
Despite its discrete-time nature in the original Blackwell formulation, approachability has been extended to continuous-time repeated games, thus showing common elements with Lyapunov theory [20]. Though first formalized in finite-dimensional spaces, a definition of approachability in infinite-dimensional space has been provided by Lehrer [29]. Approachability can be reframed within differential games and as such can be studied using differential calculus and stability theory [33,37]. In particular, in [33] the authors show that, beyond being an extension (to a vector space) of the von Neumann minmax theorem [40], the approachability principle also has elements in common with differential inclusions [2]. In addition to this, [37] establishes connections with viability theory [1], and set-valued analysis [3] (see, cfg., the comparison between an approachable set and a discriminating set) and set invariance theory [11]. 1 The approachability principle is also behind the notion of excludability; along this line, some authors investigate which sets are approachable and which ones excludable under imperfect information (bounded recall, delayed and/or stochastic monitoring) [31]. Connected to approachability as well is the concept of "attainability." Attainability is a new notion developed in [9,32] in the context of 2-player continuous-time repeated games with vector payoffs. Attainability arises in many contexts such as transportation networks, distribution networks, production networks applications. The main question is: "Under what conditions does a strategy for player 1 exist such that the cumulative payoff converges (in the lim sup sense) to a preassigned set (in the space of vector payoffs) independently of the strategy used by player 2?" A second stream of literature we follow in the present study is the one on mean field games. This theory originated in the work of M. Y. Huang, P. E. Caines and R. Malhamé [23,21,22], and independently in that of J. M. Lasry and P. L. Lions [25,26,27], where the now standard terminology of Mean Field Games (MFG) was introduced. Explicit solutions in terms of mean field equilibria are not common unless the problem has a linear-quadratic structure, see [8]. Mean field games have connections to evolutionary games (see for instance [24]) and large games [5]. Actually, both the anonymous game in [24] and the large game in [5] build upon the notion of mass interaction and can be seen as a stationary mean field. This paper is organized as follows. In Section 2, we set up the problem. In Section 3, we provide our population game motivation for the problem at hand. In Section 4, we establish the main results of the paper. In Section 5, we apply the model to a regret-based setting, and show under incomplete information that nonpositive maximal regrets that are approachable in 1st moment must be Bayesian equilibria. Finally, in Section 6, we draw concluding remarks. Notation. We view vectors as columns. For a vector x, we use x i to denote its ith coordinate component. Occasionally we may write (x) i=1,...,m to denote an m-dimensional column vector. For two vectors x and y, we use x < y (x ≤ y) to denote x i < y i (x i ≤ y i ) for all coordinate indices i. We let x T denote the transpose of a vector x, and x its Euclidean norm. We write P (x) to denote the projection of a vector x on a set X, and dist(x, X) for the distance from x to X, i.e. P (x) = arg min y∈X x − y and dist(x, X) = x − P (x) , respectively. We also denote by conv the convex hull of a given set of points. ∂ x indicates the first partial derivative with respect to x.

The Model
With the above preamble in mind, the game at hand is a two-player repeated game with vector payoffs in continuous time. 2 We assume that the players use nonanticipating behavior strategies with delay. This means that the behavior of a player may depend only on past play. In other words, the way a player plays during a given interval of time does not affect the way the opponent plays during that block. Still, it may affect the other player's play in subsequent intervals.
Let A = {1, 2, . . . , n} be a discrete set, a i : [0, T ] → A a measurable function of time and a j : [0, T ] → A a random disturbance. Let u : A × A → M where M = {M lk , l, k ∈ A} and M lk ∈ R m (each entry M lk is an m-dimensional vector). Let X := conv{M lk | l, k ∈ A}, where conv denotes the convex hull, and consider the differential equation in X where x 0 is generated according to a distribution law ρ 0 (x). More specifically, consider a probability density function ρ : representing the density of the players whose state is x at time t, which satisfies R ρ(x, t)dx = 1 for every t. Let us also define the mean state over players at time t as ρ(t) := X xρ(x, t)dx. We also have ρ(x, 0) = ρ 0 (x). The objective of a player is to approach a given target y : [0, T ] → X. Then, for each group, consider a running cost g : X × X → [0, +∞[, (x, y) → g(x, y) of the form: where Q > 0 and symmetric. The above cost describes i) the (weighted) square deviation of an individual's state from the target.
Also consider a terminal cost Ψ : where S > 0. The problem in its generic form is then the following: Problem 1 Let the initial state x(0) be given and with density ρ 0 . Given a finite horizon T > 0, a suitable running cost: g : X × X → [0, +∞[, (x, y) → g(x, y), as in (2); a terminal cost Ψ : (3), and given a suitable dynamics for x as in (1), solve where C is the set of all measurable functions a i (·) from [0, +∞[ to A i , and Eu(·) in (1) must be consistent with the evolution of the distribution ρ(·) if every player behaves optimally.

Motivation: Population Games
Consider a population game where continuously in time every individual matches with an opponent randomly extracted from the population and the resulting payoff is a vector. The resulting game is a two-player repeated game with vector payoffs in continuous time Γ that every individual plays against a population with given (evolving) distribution over actions. Let A be the finite set of actions of every individual, then the instantaneous payoff is given by a function u : A × A → R m , where m is a natural number. We assume w.l.o.g. that payoffs are bounded and correspond to the elements of the following discrete set We extend u to the set of mixed-action pairs, ∆(A) × ∆(A), in a bilinear fashion. The one-shot vector-payoff game (A, A, u) is denoted by G and we will say that the game in continuous time Γ is based on G.
The game Γ is played over the time interval [0, ∞). We assume that the players use markovian strategies where X := conv{M lk | l, k ∈ A} and x is the average (over time) expected (over opponent's play) payoff defined as: In the above equation, Once we differentiate (5) with respect to t we obtain the equation (1) in the same spirit as in Hart and Mas-Colell's paper [20] on continuous-time approachability. Then, Problem 1 analyzes the approachability of a given target in the space of vector payoffs on the part of a population of individuals.

Main results
This section outlines the main results of this paper. After introducing the expected value of the projected game, Theorem 1 establishes conditions for approachability in 1st-moment. Theorem 2 introduces the notion of self-confirmed equilibrium. Theorems 3 and 4 elaborate on existence and nonuniqueness respectively.

Expected value of the projected game
Given the above game, we wish to analyze convergence properties in the space of distributions of the cumulative or average payoff x i (t), in the spirit of approachability. We will make use of the notion of projected game which we recall next. Let λ ∈ R m and denote by λ, G the one-shot zero sum game whose set of players and their actions are as in game G, and the payoff that player j pays to player i is λ T u(a i (t), a j (t)) for every (a i (t), a j (t)) ∈ A i × A j . Observe that, as a zero-sum one-shot game, the game λ, G has a value, val(λ), obtained as val(λ) := min Given the stochastic nature of a j (t) the above min-max operation is not useful to our purposes. Then, we rather consider the expected value of the game (where the inner maximization is replaced by an expectation) and discuss approachability in expectation. In the light of this, and using the bilinear structure of the utility function, and assuming markovian strategies we can rewrite the expected value as In the case of state-dependent payoff, which occurs when we consider the game whose payoff is the above expression can be modified as: Note that here we use the notation u(a i (t), q(t)) to mean Eu(a i (t), a j (t)).

Approachability in 1st-moment
Approachability theory was developed by Blackwell in 1956 [10] and is captured in the well known Blackwell's Theorem. We recall next the geometric (approachability) principle that lies behind Blackwell's Theorem.
To introduce the approachability principle, let Φ be a closed and convex set in R m and let P (x) be the projection of any point x ∈ R m (closest point to x in Φ).
Definition 1 (Approachable set) A closed and convex set Φ in R m is approachable by player 1 if there exists a strategy for player 1 such that (9) holds true for every strategy of player 2: The next result is the Blackwell approachability theorem.
Proposition 1 (Approachability principle [10,33]) A closed and convex set Φ in R m is approachable by player 1 if for every x(t) there exists a strategy for player 1 such that (10) holds true for every strategy of player 2: Note that in the above statement, condition (10) is equivalent to saying that i) for every Now, if we assume that the opponent is committed to play a mixed strategy q ∈ ∆(A), condition (10) turns into (12) and the corresponding condition (11) can be rewritten as Theorem 1 (Approachability in 1st-moment) Let q ∈ ∆(A) be given. The set of approachable targets is Furthermore, there exists a partitioning R 1 , . . . , R n such that the approachable strategies are markovian and bang-bang: Proof. Sketch. (sufficiency) Let y ∈ T (q). Rewrite as y = l,k∈A p l q k M lk where where p, q ∈ ∆(A). Let us also take Φ = {y(t)}.
Then for every x ∈ X, taking λ = x−y x−y ∈ R m the value of the projected game satisfies (necessity) Let y ∈ T (q). Then the above does not hold. Q.E.D.
In the problem at hand, one additional challenge is that q must be selfconfirmed. This means that the mixed strategy q entering the computation of the expected value of the projected games Eval x (λ) must reflect the current state distribution. In formulas, this corresponds to expanding (15) as follows: In the rest of the paper we look for self-confirmed solutions, which we call equilibria.

The mean field game
Let us denote by v(x, t) the value of the optimization problem starting from time t at state x. The first step is to show that the problem results in the following mean field game system for the unknown scalar functions v(x, t), and ρ(x, t) when each group behaves according to (4): where a * i (t, x) and q are the optimal time-varying state-feedback controls of players i and j, respectively, obtained as The mean field game system (17) appears in the form of two coupled PDEs intertwined in a forward-backward way. The first equation in (17) is the Hamilton-Jacobi-Bellman (HJB) equation with variable v(x, t) and parametrized in ρ(·).
Given the boundary condition on final state (second equation in (17)), and assuming a given population behavior captured by ρ(·), the HJB equation is solved backwards and returns the value function and best-response behavior of the individuals (first equation in (18)) as well as the worst adversarial response (second equation in (18)). The HJB equation is coupled with a second PDE, known as Fokker-Planck-Kolmogorov (FPK) (third equation in (17)), defined on variable ρ(·) and parametrized in v(x, t). Given the boundary condition on initial distribution ρ(0) = ρ 0 (fourth equation in (17)), and assuming a given individual behavior described by u * , the FPK equation is solved forward and returns the population behavior time evolution ρ(t).
Let condition (12) hold true. Now, for given x, take for λ the value λ(∂ ∂xv(x,t) which is the gradient direction on x. Then, we can introduce the expected value of the projected anti-gradient game We can then establish the following result.
Theorem 2 (Self-confirmed equilibria) Let condition (12) hold true. Then, the mean-field game formulation of Problem 1 is Furthermore, the optimal controls for players 1 and 2 are Proof.
Due to the bilinear structure of f , we can deduce that the bestresponse strategy u * and worst adversarial disturbance w * are on a vertex of the associated simplices in R p and R q , respectively. This corresponds to saying that both strategies are pure strategies. We recall here that pure strategies are such that each player chooses as a result a single predetermined action, in contrast with mixed strategies where players select probabilities on actions and at time of play a random mechanism consistent with the selected probability distribution determines the actual action. A consequence of this is that the mean field equilibrium, if exists, is in pure strategies as well.
We can rewrite the value of the anti-gradient projected game as Best responses and adversarial strategies are then given by With the above definition of Eval x [∂ x v] in mind, the Hamilton-Jacobi part of (17) can be rewritten as It is left to observe that f (u * , w * ) = A i * j * and proves the third equation (FPK equation). Q.E.D.
In principle, to find the optimal control input we need to solve the two coupled PDEs in (19) in v and ρ with given boundary conditions (second and last conditions).

Existence and nonuniqueness of equilibria
In this section we investigate existence and nonuniqueness of equilibria. To do this, we analyze the time-dependence of an estimate error ν(t), which accounts for the deviation between an estimated density q(t) and a current oneq(t) at time t: Observe that the time-dependence ofq(t) enters in the above through the timevarying nature of the target y(t). Now, according to our procedure, we wish to hypothesize a pair (p, q), which constitutes the input, and obtain a new densitỹ q(p, q) as an output. To see this, from y = l,k∈A p l q k M lk , ∀p, q ∈ ∆(A) the expression (22) can be rewritten as Eventually, the procedure should return a fixed point. In other words, if we think of an equilibrium as the pair (p * , q * ) such that ν(p * , q * ) = 0, existence of an equilibrium is now related to existence of a fixed point for the above procedure, i.e.,q (p 1 , q 1 ) = q.
The above means that, given a (p, q) as input to our procedure, the output q(p, q) coincides with the hypothesized density q. It is natural to represent the above algorithmic procedure, as a continuous-time dynamical system and thus to relate convergence to a fixed point to the asymptotic stability of the dynamics. The next assumption introduces conditions for the asymptotic stability to hold.
Assumption 1 There exists a pair (ṗ,q) such that The above describes the possibility of varying (p, q) in order to reduce the estimate error ν, whatever the current error is. The next result establishes the existence of an equilibrium based on the above condition.

Proof.
This proof is based on a Lyapunov stability approach. In particular, let us introduce a quadratic (in the error) Lyapunov function and show that its derivative is strictly negative. The time derivative can be decomposed as sum of two terms involving the gradient of L with respect to the two variables p and q. More specifically, From condition (24), we also havė which proves the thesis. Q.E.D.
Essentially the above theorem shows that if we let the algorithm run for a long time the estimate error asymptotically converges to zero, namely, lim t→∞ ν = 0, which proves the existence of an equilibrium.
We are now in the position to study nonuniqueness of equilibria. In particular, we provide a variational condition under which the equilibrium is nonunique.

Theorem 4 (nonuniqueness) Starting at an equilibrium where
then there exists a (ṗ,q) such thatL = 0 and thus the current equilibrium is nonunique.

Solution of the mean field game
This section investigates on the microscopic dynamics of every player given an equilibrium (p, q) and the corresponding target which is common prior, where the target is denoted by y = l,k∈A p l q k M lk .
As a result we obtain that such a dynamics is a "potential" one, in the sense that every player's current average payoff, which we can call state of the player, describes a trajectory along the anti-gradient of a potential function, the latter being the value function of the mean-field game introduced earlier. To this purpose, let us denote by e(t) the deviation between the target y that every player wishes to approach, and the current average payoff x(t), namely Given that our running cost is quadratic, from dynamic programming, it is natural to assume that the value function has also a quadratic structure. This is a recurrent approach which needs an a posteriori verification of the consistency of the quadratic assumption. In particular, let us assume that the upper bound for the value function takes the form where Φ t is an opportune matrix which is positive definite, i.e., Φ t > 0. Likewise, consider a quadratic function for the terminal penalty, namely, Then, the HJB equation in (29) can be rewritten as Substituting the expression (27) for the value function in (28) we obtain The advantage of writing the HJB as above is in that all terms are explicitly written as quadratic terms in the error e(t). Considering that the HJB has to hold true for every e(t), we can drop e(t) and thus we have an expression in the only matrix variable Φ t as displayed next: The above has the form of a classical differential Riccati equation which can be solved backwardly given the boundary conditions on the matrix in the terminal penalty, Φ T = ψ. We can use such a result to analyze the microscopic dynamics of each player as detailed in the next subsection.

Microscopic model
Every single player is characterized by the following system of equations involving the evolution of the average payoff (first equation), its best-response (second equation), and the expression for the density (third equation): Note that the expression for the best-response is obtained from (20) where ∂ x v is now replaced by Φ t e(t). This is a straightforward consequence from assuming the value function quadratic as in (27).

Application: Regret and Bayesian equilibrium
Perhaps the leading application of games with vector payoffs is in the study of regret-based dynamics, to which we now turn.

Regret targeting in classical two-player games
Given a symmetric normal-form game with common action set A and symmetric payoff function π : A → R, let the regret of player i from not having played action k ∈ A under action profile α ∈ A 2 be A straightforward way to justify the vector payoffs introduced earlier is to make them coincide with the regret vector associated to each action profile, i.e. u(α) := r(k, α) k∈A . [20], approachability of the nonpositive orthant implies convergence to Nash equilibrium under such payoffs. This is no longer true for 1st-moment approachability, which drives expected -rather than maximumregret to zero, so that some deviations could still have positive regret.

In Hart and Mas-Colell
In the following, we turn standard games like the Prisoners' Dilemma, coordination games and Hawk-Dove games into games with regret vectors of type  Obviously we need that the integral of the distribution m over R 2 is consistent with the initial assumption, which means q 2 = R2 ρ(x, t)dx = 1/3. If this occurs, the vector field is such that eventually all players converge to y = (0, −1). Consequently, the distribution converges to a Dirac impulse in y.  Here we consider a distribution on actions q = (0, 1), i.e. everybody plays k = 2, then u(1, q) = (0, b) and u(2, q) = (−b, 0). The arrows indicate the vector field dx(t) for which eventually all players converge to y = (−b, 0). Consequently, the distribution converges to a Dirac impulse in y. However, there is an issue here related to the fact that the vertex y is not at the equilibrium. To see this, note that the supporting hyperplane H := {ξ| (ξ − y) T (u(1, q) − y) = 0} (dot-dashed line) partitions X into two regions, which is proven to be necessary for the vertex not to be at the equilibrium. This will be explained in Theorem 2.

Maximum regret and Bayesian equilibrium
Whilst 1st-moment approachability gives interesting dynamics in population games based on regret then, it does not give convergence to Nash equilibrium. In this section, however, we show how the model can be applied to an incompleteinformation setting to yield convergence to Bayesian equilibrium. Suppose then that the continuous-time population game Γ is based on a game of incomplete information; in particular, we are given a Harsanyi game G (as described in [41]) with state of the world ω = (s(ω); t 1 (ω), t 2 (ω)) chosen by Nature from a finite set Y using a probability distribution θ. Players then learn their own types t i (ω) ∈ T i , choose actions β i from a common finite set B(ω), and receive symmetric payoffs ̟ i (β; ω), β = (β 1 , β 2 ); the state of nature is s(ω) = (B(ω), ̟), ̟ = (̟ 1 , ̟ 2 ). Each player i then has a common finite set Σ of (T i -measurable) pure Bayesian strategies σ i : Y → B(ω), which we identify with the action set A in our general framework. Given a strategy profile σ ∈ Σ 2 , let the vector payoffs be given by maximal regrets, Players are continuously rematched against new opponents to play this game G, and a new state of the world is chosen for each such matching; hence, each play of G is one-shot in Nature, as distinct from repeated games of incomplete information (see [6] and Ch. 14 of [34]), where the opponents and state remain constant through time. 1st-moment approachability of the nonpositive orthant in Γ then implies that But since the maximum of convex functions is convex, Jensen's inequality implies that the left-hand side is no less than which is hence also nonpositive. Thus, we have a Nash equilibrium of the Harsanyi game, which is also a Bayesian equilibrium of the incomplete-information game by Harsanyi's [17] Theorem I.
However, note none of the above strategies corresponds to a self-confirmed equilibrium according to Theorem 2. Indeed, let us take for instance the first strategy, a i = 2, for all x when q = (1, 0, 0, 0). But a i = 2, for all x implies R 2 = X and R 1 = R 3 = R 4 = ∅ which implies in turn q = (0, 1, 0, 0) and this contradicts the assumption q = (1, 0, 0, 0). We can repeat the same reasoning for any other strategy.

Conclusion
This paper has combined approachability theory, evolutionary games, and meanfield games in a unified framework. The game studied has a vector payoff, a large number of players, and admits classical mean-field game representation involving two coupled PDEs, the Hamilton-Jacobi-Bellman equation and the advection equation. We have highlighted multiple contributions. First, we coin the notion of 1st-moment approachability and analyze the corresponding convergence conditions. Second, we use the mean-field game to introduce the self-confirmed equilibrium. Third we discuss on existence, non uniqueness, and stability of equilibria as fixed points of the two PDEs.
Future work involves the stochastic analysis of the same game in the presence of an additional Brownian motion in the dynamics. This would capture uncertainty or model-misspecification. In a different direction, we are interested in extending the study to the case where each player can adopt a mixed strategy, which would imply a new definition of density distribution on the space of mixed strategies; so far, the density distribution is defined on the space of pure strategies. A third development will be a further analysis of the connections with the Bayesian approach.