A PROBABILITY

. This paper introduces a probability criterion for two-person zero-sum stochastic games, and focuses on the probability that the payoﬀ before the ﬁrst passage time to some target state set exceeds a level formulated by both players, which shows the security for player 1 and the risk for player 2. For the game model based on discrete-time Markov chains, under a suitable condition on the game’s primitive data, we establish the Shapley equation, from which the existences of the value of the game and a pair of optimal policies are ensured. We also provide a recursive way of computing (or at least approximating) the value of the game. At last, the application of our main result is exhibited via an inventory system.

1. Introduction. In this paper, we start a discussion of discrete-time, noncooperative, two-person zero-sum stochastic games with a probability criterion in denumerable spaces. Under a mild condition, we establish the existences of the value of the game and a pair of optimal policies, and also give a way of approximating the value of the game.
As is well-known, the criteria commonly used in stochastic games are the average (payoff) criterion [3,8,13,14,15,16,24,22,25] and the (expected) discounted (payoff) criterion [4,5,20,24] and the many references therein. More precisely, for the game models based on discrete-time stochastic processes, Hernández-Lerma and Lasserre [8] studied the two-person zero-sum dynamic stochastic games under the average criterion in Borel spaces with a possibly unbounded payoff function, established the existence of the solution to the Shapley equation and provided some interesting martingale characterizations of optimal policies; Saha [22] extended some of the results in [8] to the case of partial observations; Sennott [24] discussed both the average and discounted criteria with unbounded payoff functions in the nonzerosum case. For the models based on continuous-time stochastic processes, Guo and Hernández-Lerma studied the discounted criterion in [5] for the zero-sum case and in [4] for the nonzero-sum case. Also, they showed the existence of pairs of optimal policies for the average criterion in [3]. Jaśkiewicz [14] and Vega-Amaya [25] studied zero-sum semi-Markov games under the average criterion, where a solution to the optimality equation is obtained via a fixed-point argument. Indeed, the average and discounted criteria share a common that they measure the expectation of rewards/costs obtained for players, which focus on the long term and short term expected rewards/costs, respectively. However, once one prefers the security probability of the reward to its expectation, the expected criteria seem powerless and the probability criterion is suitable to be introduced. Although the probability criterion has been considered in [11,12,18,19,23], it is restricted to Markov decision processes; see [18,19] for discrete-time Markov decision processes, and [11,12,23] for semi-Markov decision processes. So far, as we known, the topic on the probability criterion in stochastic games has not been discussed. In addition, for two-person zero-sum games, the concerned probability P π1,π2 (Z > λ) with the random variable Z gives capacity for player 1 to reach the profit level λ, and measures the risk for player 2 that the total cost incurred is no more than λ when using the pair of policies (π 1 , π 2 ), which has rich applications for risk-sensitive players. Therefore, the study of the zero-sum game with the probability as the optimality criterion is new and meaningful.
Considering that the horizon is random rather than infinite or finite in some practical situations (e.g., in reliability systems, players are interested in their safety before the system fails, and regardless of what happens after the failure of the system), we reasonably pay our attention to the probability before the first passage time τ D to some given target set D. That is, the optimality interval is [0, τ D ]. In particular, when D is empty, τ D = ∞ and then the random interval [0, τ D ] is reduced to the standard infinite interval in [5,6,24]. Thus, it is more general to focus on the first passage probability criterion, an optimality problem under which will be discussed in this paper. When playing two-person zero-sum games, player 1 wants to maximize the security probability to keep a high safety and player 2 wishes to minimize the risk probability to control his/her risk, from which background the goal of this paper is to find a pair of optimal policies with maximum securities for player 1 and minimum risks for player 2 over all history-dependent policies. To solve our optimality problem, we characterize the probability criterion (see Lemma 3.2), and impose a condition to guarantee the existence of the solution to the Shapley equation, which allows us to ensure the existences of the value of the game and a pair of optimal policies (see Theorem 3.6).
The rest of this paper is organized as follows. Section 2 states the game model and the concerned optimization problem. Our main result on the existences of the value of the game and a pair of optimal policies is shown in Section 3, including some basic facts and a condition, which is illustrated by an inventory system in Section 4.
2. The game model. The two-person zero-sum stochastic game model we are interested in is specified by: where S is the state space, A is the action set for player 1 and B the action set for player 2, which are all assumed to be denumerable. A(i) and B(i) represent the collections of actions available to player 1 and player 2 at state i ∈ S, respectively, both of which are supposed to be finite. D ⊆ S is any common target state set, denoting the class of all desired or undesired states according to practical needs, with D c := S−D as the complement of D with respect to S. The one-step transition law is described by a stochastic kernel Q on S given K, where K := {(i, a, b)|i ∈ S, a ∈ A(i), b ∈ B(i)} is the set of all feasible state-action pairs with the discrete topology. If actions a ∈ A(i) and b ∈ B(i) are chosen by player 1 and player 2 in state i, respectively, then Q(j|i, a, b) represents the probability that the next state is j ∈ S. Finally, the payoff function r : K → R + is the per-stage reward for player 1 (i.e., the per-stage cost for player 2), where R + := [0, +∞). Now, we describe the evolution of the discrete-time stochastic game under the first passage probability criterion. When the system state is i 0 ∈ D c at the initial decision epoch 0, and there is a common level λ 0 in the two players' mind, being taken a profit level for player 1 and a cost level for player 2 (that is, before the system state falls in the target set D, player 1 tries his or her best to get rewards more than λ 0 , while player 2 tries to control the cost no more than λ 0 ), the players independently of each other choose actions a 0 ∈ A(i 0 ) and b 0 ∈ B(i 0 ) depending on the initial state i 0 and the level λ 0 . Consequently, the system jumps to state i 1 ∈ S regulated by the one-step transition law Q(i 1 |i 0 , a 0 , b 0 ) at time 1, at which point a payoff r(i 0 , a 0 , b 0 ) is produced, representing the reward for player 1 and the cost for player 2. Then, there remains a level λ 1 := λ 0 − r(i 0 , a 0 , b 0 ) for both players. Based on the current state i 1 , the current level λ 1 as well as the previous ones, the player 1 adopts an action a 1 ∈ A(i 1 ) and the player 2 uses an action b 1 ∈ B(i 1 ), and the same sequence of events occurs. The game develops in this way, and so we get an admissible history h n up to the nth decision epoch, i.e., h n = (i 0 , λ 0 , a 0 , b 0 , . . . , i n−1 , λ n−1 , a n−1 , b n−1 , i n , λ n ), where (i m , a m , b m ) ∈ K, and λ 0 ∈ R := (−∞, +∞), λ m+1 := λ m − r(i m , a m , b m ) for m = 0, 1, . . . , n − 1. (2) For convenience, we denote by H n the set of all admissible histories h n of the system up to the nth decision epoch, and assume that H n is endowed with a Borel σ-algebra.
Remark 1. The case λ n < 0 is allowed here, which means that player 1's profit goal is achieved and player 2's cost is under control. Indeed, for any fixed initial level λ, we may get rid of the inclusion of the sequence {λ n } of histories. However, since each decision depends on the current state i n and the corresponding level λ n which may change with time n, in order to state the implements of policies (say Theorem 3.6(d) below) as well as the following arguments (such as the operators T ϕ,φ and T below), we keep the inclusion.
To specify the decision rule, we need the concept of policies.
Definition 2.1. A randomized history-dependent policy, or simply a policy for player 1, is a sequence π 1 = {π 1 n , n ≥ 0} of stochastic kernels π 1 n on A given H n such that π 1 n (A(i n )|h n ) = 1 ∀ h n ∈ H n , n = 0, 1, . . . . The set of all policies for player 1 is denoted by Π 1 . To further specify player 1's game rule, we need to define some subsets of Π 1 as follows.
Let Φ 1 denote the set of all stochastic kernels ϕ on A given S × R satisfying ϕ(A(i)|i, λ) = 1 for all (i, λ) ∈ S × R.
n } is said to be randomized Markov if there is a sequence {ϕ n } of stochastic kernels ϕ n ∈ Φ 1 such that π 1 n (·|h n ) = ϕ n (·|i n , λ n ) for every h n ∈ H n and n ≥ 0. We write such a policy as π 1 = {ϕ n }. (b) A randomized Markov policy π 1 = {ϕ n } is said to be randomized stationary if ϕ n are independent of n. We write π 1 = {ϕ, ϕ, . . .} as ϕ for brevity.
We denote by Π m 1 and Π s 1 the sets of all randomized Markov, and all randomized stationary policies for player 1, respectively. Similarly, we define the set of all randomized history-dependent policies Π 2 , the set of all stochastic kernels Φ 2 , the set of randomized Markov policies Π m 2 , and the set of all randomized stationary policies Π s 2 for player 2 with B and B(i) in lieu of A and A(i), respectively. It is by the well-known Tulcea's theorem [9], there exist a unique probability measure P π 1 ,π 2 (i,λ) and a stochastic process {i n , λ n , a n , b n } n≥0 such that, for each j ∈ S, a ∈ A, b ∈ B and h n ∈ H n with n = 0, 1, . . ., where i n , λ n := λ n−1 − r(i n−1 , a n−1 , b n−1 ), a n and b n are the variables of the state, the level for players, the actions adopted by player 1 and player 2 at time n, respectively. The expectation operator associated with P π 1 ,π 2 (i,λ) is denoted by E π 1 ,π 2 (i,λ) . For the target set D, we consider the random variable which is the first passage time into the set D of the state process {i n , n ≥ 0}. Then, for each initial state i ∈ S and level λ ∈ R, using the convention m n=l x n := 0 for any sequence {x n } with m < l, we define the (first passage) probability criterion G(i, λ, π 1 , π 2 ) of the discrete-time two-person zero-sum game (1) under a pair of policies (π 1 , π 2 ) ∈ Π 1 × Π 2 by which gives capacity for player 1 to reach the profit level λ, and also measures the risk of player 2 to control the cost level λ until the system's first entry to the set D when using the pair of policies (π 1 , π 2 ). To introduce our optimality problem, we also need the following concepts. The functions defined on S × R are called the lower value and the upper value of the game, respectively. Clearly, L(i, λ) ≤ U (i, λ) for every (i, λ) ∈ S × R.
for every (i, λ) ∈ S × R, then we call the common function the value of the game, which is denoted by V .
Definition 2.4. Suppose that the value of the game V exists. A policy π * 1 ∈ Π 1 is said to be optimal for player 1 if Similarly, π * 2 ∈ Π 2 is called to be optimal for player 2 if If π * k ∈ Π k is optimal for player k (k = 1, 2), then (π * 1 , π * 2 ) is called a pair of optimal policies (also known as a saddlepoint).
3. Main results. In this section, we state our main result on the existence of a pair of optimal policies. To this purpose, we give a condition and some basic facts at first.
Take P(U ) as the family of probability measures on the set U endowed with the weak topology. Let F m be the set of functions H : is Borel-measurable on R for each i ∈ D c and H(i, λ) = 1 if λ < 0 for each i ∈ D c . In addition, we introduce the operators T ϕ,φ , T and T π 1 ,π 2 on F m as follows: for any H ∈ F m , i ∈ D c , ϕ ∈ P(A(i)), φ ∈ P(B(i)) and (π 1 , π 2 ) ∈ Π s 1 × Π s 2 , if λ ≥ 0, T ϕ,φ H(i, λ) := a∈A(i),b∈B(i) T H(i, λ) := sup T π 1 ,π 2 H(i, λ) := T π 1 (·|i,λ),π 2 (·|i,λ) H(i, λ), Remark 2. It is clear that the operators T ϕ,φ , T and T π 1 ,π 2 are monotone, which are crucial in characterizing the value of the game and the pair of optimal policies; see Theorem 3.6 below.
Using the method similar to the proof of [21, Theorem 5.5.1], we can show by induction that (12) holds with π defined through (13), and thus the desired result is implied.
Indeed, for n = 1 and every (i, λ) ∈ D c × R, we have Now assume that (15) is true for some n ≥ 1. By the property of conditional expectation and the Markov property, with (n) π 1 ∈ Π m 1 and (n) π 2 ∈ Π m 2 as in (14), it yields that Then, letting n → ∞ in (15) implies Assumption 1.
For λ ≥ 0, we have Therefore, by the definition of {u n , n ≥ −1} and monotonicity of the operator T , we have u −1 ≤ u 0 ≤ · · · ≤ u n · · · , i.e., {u n , n ≥ −1} is a nondecreasing sequence, and converges to some function u * ∈ F m . For λ < 0, u * (i, λ) = T u * (i, λ) ≡ 1. To obtain the Shapley equation for λ ≥ 0, using the monotonicity of T again, we have T u * ≥ T u n = u n+1 for all n ≥ −1, which implies that To show the reverse inequality, it follows from the definition of the operator T that for any ϕ ∈ P(A(i)), where the existence of φ * n (·|i, λ) ∈ P(B(i)) (may be dependent on ϕ) is guaranteed by Lemma 3.5 and the measurable selection theorem [10]. By the compactness of P(B(i)), without loss of generality, we suppose that φ * n (·|i, λ) → φ * (·|i, λ) ∈ P(B(i)). Taking n → ∞ in (20), it follows from the extended Fatou