A RISK MINIMIZATION PROBLEM FOR FINITE HORIZON SEMI-MARKOV DECISION PROCESSES WITH LOSS RATES

. This paper deals with the risk probability for ﬁnite horizon semi-Markov decision processes with loss rates. The criterion to be minimized is the risk probability that the total loss incurred during a ﬁnite horizon exceed a loss level. For such an optimality problem, we ﬁrst establish the optimality equation, and prove that the optimal value function is a unique solution to the optimality equation. We then show the existence of an optimal policy, and develop a value iteration algorithm for computing the value function and optimal policies. We also derive the approximation of the value function and the rules of iteration. Finally, a numerical example is given to illustrate our results.


1.
Introduction. In the field of Markov decision processes (MDPs), many criteria have been proposed to study stochastic optimal control problems such as expected discount criteria, expected average criteria, risk-sensitive criteria, the first passage criteria, and risk probability criteria and so on [4,5,6,7,8,10,11,12,15,3,20,23]. Among all the criteria, risk probability criteria have received much attention as they have rich applications in many areas such as equipment maintenance, queuing systems, reliability engineering, risk analysis and so on [1,2,14,17,16,21].
Risk probability criteria have been widely studied in the literature for MDPs by many authors via different methods; see, for instance, [10,11,12,20,18] and their extensive references. According to the characterization of the optimality problems, the existing works on risk probability criteria can be roughly classified into two groups: (i) minimizing the risk probability that the total reward is not greater than (or less than) a given initial threshold value, (ii) minimizing the risk probability that the total loss exceeds a given initial threshold value. We next briefly describe some existing works on the two groups respectively, and then show our motivation and main results of this paper on the group (ii). In the group (i), the risk probability is defined as P π (Z r ≤ λ), where Z r denotes the nonnegative total reward. Several authors minimize the risk probability that a random total reward is less than or equal to a given initial threshold value in discrete time Markov decision processes [15,22,25]. For example, Ohtsubo and Sakaguchi [22] consider the risk probability that the total reward is not greater (or less) than a given initial threshold value. In addition, a few works have been devoted to risk probability criteria in semi-Markov decision processes (SMDPs). More precisely, Huang, Guo and Song study the risk probability that the system reaches a prescribed reward level during a first passage time to some target set for SMDPs [10]. Recently, Huang and Guo investigate the risk probability criteria for finite horizon SMDPs and introduce a class of horizonrelevant policies which depend on the usual states and the planning horizons [12].
In the group (ii), the risk probability is defined as P π (Z l > λ), where Z l denotes the nonnegative total loss. To the best of the authors' knowledge, there are only few studies dealing with the loss case; see, for instance, Ohtsubo considers the optimality problem of minimizing the risk probability that total loss exceeds a threshold value [18]. Obviously, from the fact P π (Z r ≤ λ) = 1 − P π (Z r > λ), we see that the problem of minimizing P π (Z r > λ) is not equivalent to that of minimizing P π (Z r ≤ λ). A review of existing references show that all of the related works for risk probability criteria are mainly on the case of reward for SMDPs. However, the case of loss for SMDPs has not been explored yet. In the present paper, we devote ourself to the risk probability criteria for SMDPs with loss rates.
As is well known, the business cycle is a series of peaks and troughs [16,21]. A business cycle is basically defined in terms of periods of expansion or recession. During recessions, the economy is contracting, as measured by decreases in indicators like employment, industrial production, sales and personal incomes [16,21]. The decision maker is often concerned with the total loss incurred during the recessions. Hence, a natural optimality question is how to choose optimal policies such that the risk of the system that the total loss incurred during a finite time is minimized. That is the main motivation of the current paper. Therefore, the optimality problem of minimizing the risk probability that the total loss incurred is meaningful and novel, it will be further studied in the present paper.
Compared with Huang, Guo and Li [12] for finite horizon semi-Markov decision processes with risk probability, this paper: (1) deals with the risk probability criteria P π (Z l > λ) in the group (ii) for finite horizon SMDPs with loss rates other than the reward case [12]; (2) gives the approximation value of the value function and the exact number of iteration steps for the stop of the iterations (See Theorem 3.2 below for details).
In this paper, we further study finite horizon SMDPs, and our goal is to find an optimal policy which minimizes the risk probability that the total loss incurred by a system during a finite horizon exceeds a loss level. The main contributions of this paper are as follows. Firstly, we characterize the risk probability (Lemma 3.1). Secondly, we establish the optimal equation via a successive approximation technique, and further, we provide an iteration algorithm for computing the value function (Theorem 3.1 (a))and prove that the value function is its unique solution (Theorem 3.1 (b)). Thirdly, we show the existence of a deterministic stationary ( -optimal) optimal policy (Theorem 3.1 (c) and Theorem 3.2 (c)). Fourthly, we draw a conclusion that the error of the approximation of the value function may be sufficiently small within finite iteration steps (Theorem 3.2 (a),(b)). These results are new for risk probability criteria in finite horizon SMDPs. The remainder of this paper is organized as follows. We describe the mathematical model and introduce the terminology in Section 2. We then prove that an optimal value function is a unique solution to an optimality equation and there exist optimal ( -optimal) policies in Section 3. An example illustrating possible applications of the obtained results is given in Section 4. 2. The model and optimality criterion. We shall give a brief review of the main concepts of SMDPs. The model of SMDPs we are concerned with is the five-tuple as below: where E and A are denumerable state and action spaces, respectively; A(i) denotes the set of admissible actions at state i ∈ E, which is assumed to be finite. The transition mechanism of the SMDPs is described by the semi-Markov kernel Q(·, ·|i, a) on R + × E given K, where R + := [0, +∞) and K := {(i, a)|i ∈ E, a ∈ A(i)} is the set of all feasible state-action pairs. It is assumed that (i) Q(·, j|i, a), for any fixed j ∈ E, and (i, a) ∈ K, is a nondecreasing, right continuous real function on R + such that Q(0, j|i, a)=0; (ii) Q(t, ·|i, a), for each fixed t ∈ R + , is a sub-stochastic kernel on E given K; and (iii) P (·|i, a) := Q(∞, ·|i, a) is a stochastic kernel on E given K.
If an action a ∈ A(i) is chosen in state i, then Q(t, j|i, a) is the joint probability that the sojourn time in state i is not greater than t ∈ R + and the next state is j ∈ E. Finally, the real function c : K −→ R + denotes the loss rate.
Remark 1. Compared with the risk probability criteria in SMDPs [10,12,20], in our model (1) a loss rate is considered, which results in a different definition of the risk probability in (7) below.
A finite horizon SMDP with risk probability criteria evolves as follows: at the initial decision epoch s 0 = 0, the system occupies state i 0 , and the decision maker has a loss level λ 0 over a planning horizon t 0 in mind (that is, he/she should try to control the loss incurred λ 0 within the planning horizon t 0 ), then he/she chooses an action a 0 ∈ A(i 0 ) according to the state i 0 , the planning horizon t 0 and the loss level λ 0 . As a consequence of this action selection, the system remains in i 0 until time s 1 , at which point the system jumps to i 1 and then the next decision epoch occurs. At time s 1 , a loss c(i 0 , a 0 )(s 1 − s 0 ) is incurred, leading to a remaining loss level λ 1 := λ 0 − c(i 0 , a 0 )(s 1 − s 0 ) over a remaining planning horizon t 1 := [t 0 − (s 1 − s 0 )] + for the decision maker, where [x] + := max{x, 0} According to the current state i 1 , the current planning horizon t 1 and the current loss level λ 1 as well as the previous state i 0 and the previous loss level λ 0 , the decision maker adopts an action a 1 ∈ A(i 1 ) and the same sequence of events occur. The decision process evolves in this way, and so we obtain an admissible horizon-relevant (h-r in short) history h n of the SMDPs up to the nth decision epoch, i.e., h n = (s 0 , i 0 , t 0 , λ 0 , a 0 , . . . , s n−1 , i n−1 , t n−1 , λ n−1 , a n−1 , s n , i n , t n , λ n ), where s 0 = 0, s m+1 ≥ s m , (i m , a m ) ∈ K, t 0 ∈ R + , t m+1 := [t m − (s m+1 − s m )] + , λ 0 ∈ R := (−∞, +∞), λ m+1 := λ m − c(i m , a m )(s m+1 − s m ) for m = 0, 1, . . . , n − 1, and i n ∈ E. Let H n denote the set of all admissible h-r histories h n of the system up to the nth decision epoch. Assume that H n is endowed with a Borel σ-algebra.

Remark 2. (a)
The h-r history h n here is similar to those in [11,12]. For more details, readers are referred to Remark 2.2 in [11,12]. (b) The case λ n < 0 is allowed here, which means that the planning horizon t n maybe equal to 0 in some decision epoch, while in [12] the case λ n < 0 is thought to be risk-free on behalf of the decision maker at the nth decision epoch.
To state the finite horizon SMDPs with risk probability criteria we are concerned with, we need to introduce the classes of policies.
Definition 2.1. An h-r randomized historic policy, or simply an h-r policy, is a sequence π = {π n , n ≥ 0} of stochastic kernels π n on A given H n such that π n (A(i n )|h n ) = 1 ∀ h n ∈ H n , n = 0, 1, . . . .

Remark 3.
Compared with the policies in infinite horizon discounted and expected criteria [4,7,8,23], the h-r policy here depends on the loss levels, the decision epochs, the planning horizon, the states and actions.
The set of all h-r policies is denoted by Π.
Notation. Let Φ denote the set of all stochastic kernels ϕ on A given E × R + × R satisfying ϕ(A(i)|i, t, λ) = 1 for all (i, t, λ) ∈ E ×R + ×R, and F the set of measurable Let us now introduce several subclasses of policies.
Definition 2.2. (a) An h-r policy π = {π n } is said to be h-r randomized Markov if there is a sequence {ϕ n } of stochastic kernels ϕ n ∈ Φ such that π n (·|h n ) = ϕ n (·|i n , t n , λ n ) for every h n ∈ H n and n ≥ 0. We write such a policy as π = {ϕ n }.
(b) An h-r randomized Markov policy π = {ϕ n } is said to be h-r randomized stationary if ϕ n are independent of n. We write π = {ϕ, ϕ, . . .} as ϕ for brevity. (c) An h-r randomized Markov policy π = {ϕ n } is said to be h-r deterministic Markov if there is a sequence {f n } of measurable functions f n ∈ F such that ϕ n (·|i, t, λ) is the Dirac measure at f n (i, t, λ) for every (i, t, λ) ∈ E × R + × R and n ≥ 0. We write such a policy as π = {f n }.
(d) An h-r deterministic Markov policy π = {f n } is said to be h-r (deterministic) stationary if f n are independent of n. We write π = {f, f, . . .} as f for simplicity.
For convenience of description, we denote by Π RM , Π RS , Π DM and Π DS the sets of all h-r randomized Markov, h-r randomized stationary, h-r deterministic Markov and h-r stationary policies, respectively. Obviously, Let (Ω, F) be the measurable space consisting of the sample space Ω and the corresponding product σ-algebra F.
Here, we define an underlying continuous-time state-action process {Z(t), W (t), t ∈ R + } corresponding to the stochastic process {S n , J n , A n , n ≥ 0}, by where ∂ E and ∂ A are the extra state and action joined to E and A, respectively, and S ∞ is the accumulation point of the sequence {S n }, i.e., S ∞ := lim n→∞ S n .
To deal with the finite horizon optimization problem, we fix an arbitrary Thorizon (with T ∈ R + ). To ensure the existence of an optimal policy, we impose the following assumption.
To verify Assumption 2.1 above, we need the following condition.
Condition 2.1. There exist constants δ > 0 and ξ > 0 such that Remark 5. For the verification of Assumption 2.1, the reader is referred to the proof of Proposition 2.1 in [11]. Moreover, Condition 2.1 can be more easily verified since it is imposed on the primitive data of the model (1).
which measures the risk of the system that the total loss incurred during the interval [0, t] exceeds λ when using the policy π. Thus, the optimization problem we are

QIULI LIU AND XIAOLONG ZOU
interested in is to minimize the system's risk F π (i, t, λ) over the set Π. That is, our goal is to find a policy π * ∈ Π such that where F * (i, t, λ) is the optimal value function.
Remark 6. (a) Different from the previous study on the probability criteria in SMDPs with reward rates [9,10,12], this paper deals with the loss rate case.
Our optimality problem can be equivalently transformed to the problem of maximizing F π (i, t, λ), while it is not equivalent to the one in [12] since Huang [12] considers the problem of minimizing F π (i, t, λ). Moreover, the reward rate r in [12] is replaced by the loss rate c in this paper.
Using arguments as in the proof of Proposition 2.2 in [11], we can also prove that Π RM is a sufficient set of policies for our optimality problem, i.e., which indicates that it suffices to find optimal policies in the class of randomized Markov policies.
A 0-optimal is simply called an optimal policy.
3. On the optimal value function and optimality equation. In this section, we show our main results. That is, we prove that the value function is a unique solution to the optimality equation, and there exists an optimal (or -optimal) policy. In addition, we obtain an algorithm for computing -optimal policies.
We define the following sets of functions. Let F m be the set of functions F : The operators H ϕ and H above are different from those for first passage SMDPs with risk probability criteria [12] and those for finite DTMDPs with risk probability criteria [18]. Also, these operators are important in characterizing the value function and optimal policies for first passage SMDPs with risk probability criteria and loss rates; see Theorem 3.5,Theorem 3.6 below. (b) In fact, the operators H ϕ and H are monotone, that is, where the second equality follows from Assumption 2.1, and the last equality is due to the nonnegativity of the loss rate and the continuity of probability measures. Based on (8), we define F π −1 (i, t, λ) := I (−∞,0) (λ), and Sm∧t c(Z(s), W (s))ds > λ , λ ≥ 0; 1, otherwise for every n ≥ 0 and (i, t) ∈ E × [0, T ]. Obviously, F π n ≤ F π n+1 for all n ≥ −1 and lim n→∞ F π n = F π . The following lemma is fundamental to our main results.
To give the proof of the following Lemma, we need to introduce the following notation. Let F m be the set of functions F : , The following lemma is key to the existence of optimal policies.
where ξ, δ are as in Condition 2.1. Let F (n) δ denote the n-fold convolution of F δ , defined by It follows from Condition 2.1 and the argument of the proof of Theorem 1 in [17] and Lemma 3.1 in [13] that for any n > k and t > 0, where k is a nonnegative integer satisfying k > T /δ, and n/k denotes the largest integer not larger than n/k. Below, under Condition 2.1, we use an induction argument to show that for all (i, t, λ) ∈ E × [0, T ] × R, n ≥ 1.
In fact, when n = 1, noting that F − G ≤ 1, we get where the last inequality is a direct result of Condition 2.1 and the definition of F δ (t).
Suppose (9) holds for l = n. For the case of l = n + 1, Suppose t ∈ [m t δ, (m t + 1)δ) for some nonnegative integer m t , then since F (n) δ (t) is a step function, we may rewrite the above integral as t, λ)). For sake of notation, we denote p = Q([0, t − m t δ), E|i, f (i, t, λ)). According to Condition 2.1, we have p ≤ 1 − ξ. Thus there exists a non-negative real number q, such that p + q = 1 − ξ.
where the last inequality is due to the non-decreasing property of F for every (i, t, λ) ∈ E ×[0, T ]×R and all n ≥ 1. Therefore, noting that 0 < 1−ξ k < 1 and letting n → ∞ in (10), it is clear that which gives F f = G by part (a), and so the uniqueness follows. Proof. Since F n ↑ F , the limit F = lim n→∞ F n exists. It follows from the monotonicity of the operator H that HF n ≤ HF for each n ≥ −1. Therefore, lim n→∞ HF n ≤ HF . We shall show that the reverse inequality is also true if A(·) is finite. Fix an arbitrary (i, t, λ) ∈ E × [0, T ] × R, we consider the sets By the finiteness of A(i), the monotonicity of H, and the assumption that F n ↑ F , we see that each of these sets is nonempty and A n ↓ A * for each n ≥ −1. Let a n be such that H an F n (i, t, λ) = HF n (i, t, λ) and a n ∈ A n . By the finiteness of A(i) and the fact A n ↓ A * , there exists a * ∈ A * and a subsequence {a n k } of a n satisfying a n k = a * for any n k ≥ m. Since H is a monotone operator and F n ↑ F , we have lim k→∞ HF n (i, t, λ) exists and Letting m → ∞ in the above expression gives that lim n→∞ HF n (i, t, λ) ≥ HF (i, t, λ). The proof is completed.
In fact, Lemma 3.3 gives a condition of interchanging limits and minima.
(a) If F ∈ F m , then HF ∈ F m , and there exists an f ∈ F such that HF = H f F . (b) If F ∈ F r , then both H a F and HF are in F r for any a ∈ A(i).
(c) If F n ∈ F r and F n ≤ F n+1 for each n ≥ 0, then lim n→∞ F n ∈ F r .
Proof. (a) Under Assumption 2.1, the measurable selection theorem (see Proposition D.5 in [7]) ensures the existence of an f ∈ F such that for every (i, t, λ) ∈ E × [0, T ] × R, and thus (a) follows. (b) It follows from the definition of H a and F ∈ F r that H a F ∈ F m , and furthermore, H a F (i, t, ·) is monotone nonincreasing and right continuous on R for each (i, t) ∈ E × [0, T ], and on the other hand, H a F (i, ·, λ) is right continuous on [0, T ] for each (i, λ) ∈ E × R and a ∈ A(i). To prove that H a F ∈ F r , we need only show that H a F (i, ·, λ) is monotone nondecreasing on [0, T ] for each (i, λ) ∈ E × R. Indeed, for fixed (i, λ) ∈ E × R and a ∈ A(i), if t 2 > t 1 > λ c(i,a) , by a direct calculation, we see that We now turn to proving HF ∈ F r . By part (a), we easily see that HF ∈ F m . By the monotonicity of H, HF (i, t, ·) is nonincreasing on R and HF (i, ·, λ) is nondecreasing on [0, T ]. Hence, to prove that HF ∈ F r , it need only to show that HF (i, t, ·) is right continuous on R for each (i, t) ∈ E × [0, T ] and HF (i, ·, λ) is right continuous on [0, T ] for each (i, λ) ∈ E × R. Indeed, for each λ ∈ R, let {λ k } be an arbitrary sequence such that λ k ↓ λ. Using the monotonicity of HF (i, t, ·) on R, we have which implies that On the other hand, since A(i) is finite, there exists a subsequence {λ km } and a * ∈ A(i) such that HF (i, t, λ km ) = H a * F (i, t, λ km ). Then, by the right continuity of H a F for any a ∈ A(i), it follows that lim λ k ↓λ HF (i, t, λ k ) = lim Therefore, combining (15) with (16) gives the desired result that lim λ k ↓λ HF (i, t, λ k ) = HF (i, t, λ). We next show that HF (i, ·, λ) is right continuous on [0, T ] for each On the other hand, noting that HF (i, ·, λ) is nondecreasing on [0, T ], we have HF (i, t k , λ) ≥ HF (i, t, λ), which implies that Hence, we have lim HF (i, t k , λ) = HF (i, t, λ). This completes the proof.
(c) Since F n ≤ F n+1 for each n ≥ 0, the limit F := lim n→∞ F n exists. It is easy to see that F (i, t, ·) is nonincreasing on R and F (i, ·, λ) is nondecreasing on [0, T ]. Now, to show F ∈ F r , it suffices to prove that F (i, t, ·) is right-continuous on R for each (i, t) ∈ E × [0, T ] and F (i, ·, λ) is right continuous on [0, T ]. We next show that F (i, t, ·) is right continuous on R, while the fact that F (i, ·, λ) is right continuous on [0, T ] can be similarly proved. Indeed, for each (i, t, λ) ∈ E × [0, T ] × R and any sequence {λ k } ⊂ R such that λ k ↓ λ, we have F n (i, t, λ k ) ≤ F (i, t, λ k ) for any n, k ≥ 0. Hence, for every n ≥ 0, which implies that lim λ k ↓λ F (i, t, λ k ) ≥ F (i, t, λ). On the other hand, recalling that F (i, t, ·) is nonincreasing on R, we immediately obtain lim λ k ↓λ F (i, t, λ k ) ≤ F (i, t, λ), which, together with the previous inequality, yields that lim λ k ↓λ F (i, t, λ k ) = F (i, t, λ). Hence, F ∈ F r and the proof is completed.
We now state our first main result, which gives an iteration algorithm for computing the value function F * . Proof. (a) From Lemma 3.4 (b), we have F * n ∈ F r . Hence F * n is in F m , and thus HF * n is well defined for every n ≥ −1. Furthermore, it is easy to see that F * n ≤ F * n+1 for all n ≥ −1, and hence lim n→∞ F * n =: F exists. By Lemma 3.4 (c), F is in F r . To complete the proof, it suffices to prove that F ≤ F * and F ≥ F * .
To show F ≤ F * , it suffices to show that F * n ≤ F π n for all π ∈ Π RM and n ≥ −1. We now prove this fact by induction. Indeed, it is obviously true for n = −1. Suppose that F * n ≤ F π n for all π ∈ Π RM and some n ≥ −1. Then, for any η = {η 0 , η 1 , . . .} ∈ Π RM , we have , where the first inequality is due to the induction hypothesis, and the last equality follows from Lemma 3.1(b). Therefore, by the induction hypothesis, we obtain F * n ≤ F π n for all π ∈ Π RM and n ≥ −1. Hence, F ≤ F π for all π ∈ Π RM , which together with the arbitrariness of π yields F ≤ F * . Now it remains to prove that F ≥ F * . By Lemma 3.4 (a), we obtain some decision rule f ∈ Π DM such that F = F f . Indeed, it follows from Lemma 3.3 that F = lim n→∞ HF * n = H lim n→∞ F * n = H F . On the other hand, by Lemma 3.4 (a), there exists an f ∈ Π DM such that H F = H f F . Therefore, we have F = H f F . It follows from Lemma 3.2 that F = F f , so that F ≥ F * . We already know F ≤ F * . This completes the proof.
(b) It follows from the proof of Theorem 3.1 (a) that F * satisfies the optimality equation F * = HF * . Suppose that G is a solution to F = HF . By Lemma 3.4 (a), there exist decision rules g, f such that G = H g G and F * = H f F * .
(c) Combining part (b) and Lemma 3.4 yields that there exists f ∈ F such that F * = H f F * . The optimality of f ∈ F is an immediate result of Lemma 3.2 (b).
We now state another main result concerning the approximation value of F * and the rules of iteration.
Theorem 3.6. Suppose that Condition 2.1 holds. (a) For any n ≥ 1, we have where k := T δ + 1 is as in Lemma 3.2. Here and below, we write sup i,t,λ instead of sup (i,t,λ)∈E×[0,T ]×R for sake of convenience.
Indeed, for all l ≥ 0, it is clear that By the induction hypothesis, we find that Thus (21) holds for all k ≥ 0. It follows that where the second inequality is due to (21) and the nonincreasing property of where f 1 ∈ F satisfies HF * n−1 = H f l F * n−1 . Thus the nonincreasing property follows.
The existence of f is ensured by Theorem 3.5 (a) and Lemma 3.4 (a). Using the argument as in (20) as well as Lemma 3.4 (b), we have which together with (18), gives Therefore, the fact F * (n0+1)k ≥ F * yields that F f − F * ≤ F f − F * (n0+1)k ≤ , which means that f is an -optimal policy. By Theorem 3.6, we develop an effective way of evaluating the value function and optimal policies. The iteration algorithm for -optimal policies includes the following two steps: Step 1. For any sufficiently small > 0, compute F * (n0+1)k−1 and F * (n0+1)k using the iteration algorithm in Theorem 3.5(a), with n 0 = ln( k ξ k ) ln(1−ξ k ) .
4. An application to financial market. In this section, we apply our results to the business cycle and demonstrate how to compute the value function and an optimal policy based on the value iteration algorithm. Example 4.1. (A business cycle) Consider a business cycle with three states, say 1, 2 and 3, which represent the depression, the recovery and the peak ones, respectively. During recessions, the company decision maker try to take the corresponding investment strategy to adapt their company to fluctuation in the business cycle. Suppose that in state 1 the decision maker may take either an action a 11 with a loss rate c(1, a 11 ) or an action a 12 with a loss rate c(1, a 12 ), while in state 2 he may choose another action a 21 with a loss rate c(2, a 21 ) or another action a 22 with a loss rate c(2, a 22 ). However, the system in state 3 incurs the loss rate c(3, a 31 ) = 0 with non-action denoted by a 31 . The transition mechanism of the model is given by the following data. When the action a (a = a 11 or a 12 ) is used, the system jumps to state j with probability p(j|i, a) (j = 1, 2, 3) after staying at state for a uniformly distributed random time in the interval [0, µ(1, a)] with µ(1, a) > 0. If the action a (a = a 21 or a 22 ) is chosen when the system is in state 2, the system jumps to state j with probability p(j|2, a) (j = 1, 2, 3) after an exponential distributed random time with parameter µ(2, a) > 0. Moreover, we assume that when it falls in state 3, the system transits to state 3 with probability one after an exponential-distributed random time with parameter µ (3, a 31 ). For such a business cycle, the company manager wishes to find an optimal policy with the minimum risk probability over a finite horizon [0, T ] with T > 0.
We now formulate the system as an SMDP model with the data as follows. The state space E = {1, 2, 3}; the action sets A(1) = {a 11 , a 12 }, A(2) = {a 21 , a 22 } and A(3) = {a 31 }; the horizon T is assumed to be 15; the semi-Markov kernel Q(·, ·|i, a) is of the following particular form: Q(t, j|i, a) = G(t|i, a)p(j|i, a) for every t ∈ R + , i ∈ E, j ∈ E and a ∈ A(i), in which G(t|i, a) and p(j|i, a) denote the distribution functions of the sojourn time and the transition probabilities, respectively. Suppose that the distribution function G(t|i, a) are given by From the data above, Condition 2.1 holds with δ = 1 and ξ = min{29/30, 19/20, e −0.05 , e −0.025 , e −0.3 }, and thus Assumption 2.1 is fulfilled. Therefore, the value iteration is valid and the existence of an optimal policy is ensured by Theorems 3.1 (a) and 3.2. The constant k is assumed to be 16 (k > T δ ). By Theorem 3.2, the iterations are stopped if n = 2023 =: n 0 ( where := 10 −4 ), and the condition 0 ≤ F * − F * (n0+1)k < 10 −4 is satisfied.
Note that, since c(3, a 31 ) = 0 and state 3 is absorbing, F * (3, t, λ) = I (−∞,0) (λ) for all (t, λ) ∈ [0, 15]×R. Hence, we only compute the functions F * (n0+1)k (1, t, λ) and F * (n0+1)k (2, t, λ). After applying the value iteration algorithm described in Theorem 3.1 (a), we obtain some numerical results shown in Figure 1 and Figure 2. Figure 1 and Figure 2 show the optimal value functions F * (n0+1)k (i, t, λ) (i = 1, 2) with respect to both t and λ. Figure 1 and Figure 2 illustrate that the optimal value function F *   Figure 3 plots the function H a F * (n0+1)k−1 (i, 10, λ) for the planning horizon t = 10. It is clear to see that the risk with a 12 is much lower than that with a 11 for the case λ ∈ (0, 49.9), while the risk with with a 12 is much higher than that with a 11 for the case λ ∈ (49.9, 120). Figure 4 have the similar conclusions for the planning horizon t = 15. For better illustrating -optimal policy, we also plot the function λ * (i, t) below, which depends on the states and planning horizons.
such that F * (i, t, λ) = H f * F * (i, t, λ) for every (   By Theorem 3.6, such a policy f * is -optimal with the minimum risk probability.