A perturbation approach to a class of discounted approximate valueiteration algorithms with borel spaces

The present paper gives computable performance bounds for the approximate value iteration (AVI) algorithm when are used approximation 
operators satisfying the following properties: (i) they are positive linear 
operators; (ii) constant functions are fixed points of such operators; (iii) 
they have certain continuity property. Such operators define transition 
probabilities on the state space of the controlled systems. This has two 
important consequences: (a) one can see the approximating function as the 
average value of the target function with respect to the induced transition 
probability; (b) the approximation step in the AVI algorithm can be thought 
of as a perturbation of the original Markov model. These two facts enable us 
to give finite-time bounds for the AVI algorithm performance depending on 
the operators accuracy to approximate the cost function and the transition 
law of the system. The results are illustrated with numerical approximations 
for a class of inventory systems.


1.
Introduction. Markov decision processes provide a highly flexible framework to analyze many sequential optimization problems in a number of fields. See, for instance, the books by Bertsekas [3], Bertsekas and Tsitsiklis [5], Hernández-Lerma [11], Hernández-Lerma and Lasserre [12,13], Puterman [19] for the theory, and the survey papers by Stidham and Weber [23], White [27,28,29], Yakowitz [30], Yeh [31] for real and potential applications. However, its range of usefulness is seriously limited by the so-called curse of dimensionality since it prevents the numerical computations in most applications, specially in those cases having large or infinite state spaces. It has been proposed a plethora of approximation schemes to break down or to alleviate the curse of dimensionality, which produce suboptimal but hopefully "good" solutions (see, for instance, Arruda et al. [2], Bertsekas [4], Jiang and Powell [14], Powell [16,17,18], Rust [20], Sutton [24]). The bulk of the references considers the approach called approximate value iteration (AVI) algorithm and deals mainly with finite models. In the present paper we also study AVI algorithms but considering models with Borel spaces.
Roughly speaking, the AVI algorithms combine suitable function approximation schemes with the standard value iteration algorithm. In many cases, the approximation schemes are represented by operators and the quality of the resulting AVI algorithms depend strongly on their properties; for instance, the convergence can not be guaranteed unless the approximation operator has a non-expansive property (de Farias and Van Roy [9], Gordon [10]). Thus, this kind of procedures raises the following three important issues: (I) convergence or stability of the AVI algorithms; (II) once the convergence is ensured, the second issue is to provide computable bounds for the error produced by the approximating functions; (III) the third one, perhaps the most important from the applications point of view, is to provide computable error bounds for the suboptimal policies generated by the algorithms.
Concerning issues (I)-(III) above, a quick glance at the references by Bertsekas [4], Bertsekas and Tsitsiklis [5], Powell [16], [17], [18], Rust [20] -and their extensive references-evidences, on one hand, the lack of general bounds for the performance of AVI algorithms-and for others approximation schemes as well-and for the other hand, that the most part of the papers focuses on finite models. There are of course several exceptions dealing with Borel spaces, among which we can mention the papers by Almudevar [1], Dufour and Prieto-Rumeau [8], Munos [15] and Stachurski [21]. The main differences of the present work with these latter papers are discussed below.
The present work studies a class of AVI algorithms for discounted Markov decision models with Borel spaces and bounded costs, and addresses issues (I)-(III) above for approximation operators with the following properties: (i) they are positive linear operators; (ii) constant functions are fixed points of such operators; (iii) they have a certain continuity property-see Definition 3.1, Section 3, for a precise statement of these properties. Many operators studied in approximation theory satisfy these properties as, for instance, piece-wise constant approximation operators, linear and multilinear interpolators, kernel-based interpolators (Gordon [10], Stachurski [21]), certain aggregation-projection operators (Van Roy [25]), Schoenberg' splines, Hermite-Fejér and Bernstein operators (Beutel [6], DeVore [7]), among others.
The key point in this approach is that operators satisfying properties (i)-(iii) define transition probabilities on the state space of the controlled system. This has two important consequences: (a) one can see the approximating function as the average value of the target function with respect to the induced transition probability; (b) the approximation step in the AVI algorithm can be thought of as a perturbation of the original Markov model. These two facts allow us to give finite-time bounds for the AVI algorithm performance in terms of the accuracy of the approximations given by such operators for the primitive data model, namely, the one-step reward function and the system transition law. The accuracy of the approximations is measured by means of the supremum norm of bounded functions and the total variation norm of finite signed-measures.
A remarkable and perhaps somewhat surprising fact is that once the approximation step is seen as a perturbation of the original Markov model, the convergence in (I) and the bounds for problems (II)-(III) are directly established with quite elementary proofs. To the best authors' knowledge, facts (a) and (b) have largely passed unnoticed except for the paper by Gordon [10], who refers to property (a) as an "intriguing property" that allows to see the "averagers as a Markov processes", whereas property (b) is plainly ignored. Thus, Gordon [10] does not take advantage of these facts to provide error bounds for the AVI algorithm performance. In fact, Gordon sets aside properties (i)-(iii) and focuses on non-expansive operators. As we shall see in Remark 3.2, properties (i)-(iii) imply the averagers have this latter property. In spite of these differences we follow Gordon's practice and call averagers to operators with properties (i)-(iii) because of the property (a) above.
The papers by Almudevar [1], Munos [15] and Stachurski [21] also deal with approximations for models with Borel spaces, but they differ with the present work in several ways. For instance, Almudevar [1] takes an abstract approach and first studies general approximate fixed-point iteration schemes, and then he applies the results to several types of Markov decision problems. However, Almudevar [1] only obtains asymptotic bounds under the assumption the algorithm is stable-which means that the errors of the algorithm are uniformly bounded-and the bounds are not tied to the accuracy with which the approximation scheme represents the primitive data model. Additionally, for the continuous state space case Almudevar [1] requires the transition law has a (conditional) density function, which is not required in the present work.
On the other hand, Munos [15] provides two kind of L p bounds using certain quantities called "transition probabilities concentration coefficients" and "first and second discounted state future distribution concentration coefficients." To obtain the first kind of bounds, he assumes that the algorithm is stable and bounds the algorithm performance asymptotically, as in the Almudevar's paper [1]. The second ones are based on the so-called "Bellman residual ", but nothing is said about how it can be bounded with computable quantities. These "concentration coefficients" seem to be quite difficult to compute or to bound excepting some simple cases. Moreover, in the continuous state space case, Munos requires that the transition law has a density.
Stachurski [21] focus on non-expansive operators and first gives performance bounds in terms of quantities that are not directly computable. He removes this drawbacks under the additional assumption that the dynamic programming or Bellman operator preserves monotonicity of functions. This assumption seems suitable for some problems in economics but obviously limits the general usefulness of his results. Jiang and Powell [14] also study the AVI algorithm in problems with "monotone structure" and prove the algorithm convergence for finite models under several technical conditions. An alternative to AVI algorithms is given by discretization procedures. These procedures require that the control model satisfies very nice structural properties such as Lipschitz continuity in the one-step cost function and the system evolution law, as well as in the multifunction defined by the admissible actions sets (Hernández-Lerma [11], Dufour and Prieto-Rumeau [8]). Under such kind of hypothesis, the latter two references give explicit bounds for the approximation error to the optimal value function; however, Hernández-Lerma [11] only analyzes the performance of the approximating policies in a asymptotical sense as the size of the discretization mesh goes to zero, whereas Dufour and Prieto-Rumeau [8] provide a performance bound depending on a quantity that cannot be controlled by the size of the discretization mesh. Here it is worth mentioning that in the present paper we obtain the performance bounds only assuming that the model satisfies the standard continuity-compactness hypothesis.
Summing up, the main contribution of the present work is a novel perturbation approach to analyze AVI algorithms defined by a class of approximation operators 264Ó. VEGA-AMAYA AND J. LÓPEZ-BORBÓN we call averagers. This framework allows to give performance bounds depending on the averagers accuracy to approximate the one-step cost function and the transition law. The averagers includes many of the operators used in approximation function theory and approximate dynamic programming itself. For instance, the examples of nonexpansive operators given in Stachurski [21] and the projection-aggregation operator in Van Roy [25] are in fact averagers. Note that the discretization procedures can be recast as piece-wise constant approximations, which are also represented by averagers. Hence, the results of the present paper are applicable to all these cases.
The remainder of the present work is organized as follows. Section 2 is largely expository since it contains a brief description of the Markov control model and some well-known results for the discounted optimal control problem with bounded costs. Section 3 introduces the approximate value iteration algorithm, the kind of approximation operators we are considering and the perturbed Markov models associated with these operators. Next, Section 4 provides bounds for the approximation problem (II) and for the performance of the AVI algorithm (III). In section 5, the results are illustrated with an inventory system with finite capacity, linear production cost, no set-up cost and no back-orders. The paper ends with some concluding remarks in Section 6.
2. The discounted cost criterion. Throughout the work we use the following notation. For a topological space (S, τ ), B(S) denotes the Borel σ-algebra generated by the topology τ and "measurability" will always mean Borel measurability. Moreover, M (S) is the class of measurable functions on S whereas M b (S) is the subspace of bounded measurable functions endowed with the supremum norm given as ||u|| ∞ := sup seS |u(s)|, u ∈ M b (S). The subspace of bounded continuous functions is denoted by C b (S). For a subset A ⊆ S, I A stands for the indicator function of A, that is, I A (s) = 1 for s ∈ S and I A (s) = 0 for s / ∈ A. A Borel space Y is a measurable subset of a complete separable metric space endowed with the inherited metric.
Let M = (X, A, {A(x) : x ∈ X}, R, Q) be the standard Markov control model. This is thought as a model of a controlled stochastic process {(x n , a n )}, where the state process {x n } takes values in the Borel space X and the control process {a n } takes values in the Borel space A. The controlled process evolves as follows: at each time n ∈ N 0 := {0, 1, . . .}, the controller observes the system in some state x n = x and chooses a control a n = a from the admissible control subset A(x), which is assumed to be a Borel subset of A. It is also assumed that the admissible pairs Moreover, the controlled system moves to a new state x n+1 = x according to the distribution measure Q(·|x, a), where Q is a stochastic kernel on X given K, that is, Q(·|x, a) is a probability measure on X for each pair (x, a) ∈ K, and Q(B|·, ·) is a Borel measurable function on K for each Borel subset B of X. Then, the controller chooses a new control a n+1 = a ∈ A(x ) receiving a reward R(x , a ) and so on.
Let H n = K n × X for n ∈ N and H 0 := X. Observe that a generic element of H n has the form h n = (x 0 , a 0 , x 1 , a 1 , . . . , x n−1 , a n−1 , x n ) where (x k , a k ) ∈ K for k = 0, . . . , n − 1 and x n ∈ X. A control policy is a sequence π = {π n } where π n (·|·) is a stochastic kernel on A given H n satisfying the constraint π n (A(x n )|h n ) = 1 for all h n ∈ H n , n ∈ N 0 . Now let F be the class of all measurable functions f : for each x ∈ X and n ∈ N 0 . Following a standard convention, the stationary policy π is identified with the selector f . The class of all policies is denoted by Π and the class of all stationary policies is identified with the class F.
Let Ω := (X × A) ∞ be the canonical sample space and F the product σ-algebra. For each policy π = {π n } ∈ Π and "initial" state x 0 = x ∈ X there exists a probability measure P π x on the measurable space (Ω, F) that governs the evolution of the controlled process {(x n , a n )}. The discounted reward criterion is given as where the discount factor α ∈ (0, 1) is fixed and E π x denotes the expectation operator with respect to the probability measure P π x . The optimal control problem is to find a control policy π * ∈ Π such that The policy π * is called discounted optimal policy, while V * is called the discounted optimal value function.
Each one of Assumptions 1 and 2 below guarantees the discounted reward criterion is well defined and the existence of stationary optimal policies as well.
is a non-empty compact subset of A for each x ∈ X and the mapping is continuous for each function u ∈ C b (X).

Assumption 2. (a)
The function R(·, ·) is bounded by a constant K > 0; moreover, the following holds for each x ∈ X : is continuous for each function u ∈ M b (X).
Throughout the paper C(X) will denote either C b (X) or M b (X) depending on whether Assumption 1 or 2 is being used, respectively. Then, under either one of Assumption 1 or 2, the dynamic programming operator is a contraction operator from the Banach space (C(X), || · || ∞ ) into itself with contraction factor α (see, for instance, Hernández-Lerma [11], Lemma 2.5, p. 20.)

266Ó. VEGA-AMAYA AND J. LÓPEZ-BORBÓN
Moreover, by a selection theorem (Hernández-Lerma [11], D.3, p. 130), for each u ∈ C(X) there exists an selector f u ∈ F such that We shall refer to the policy f u as u-greedy policy.
Moreover, we shall write for each measurable function u on X for which the integral is well defined. Now define the operators and observe that with this notation equation (1) becomes T u = T fu u. Assumption 1(a) (or Assumption 2(a)) implies that T f is a contraction operator from the Banach space (M b (X), || · || ∞ ) into itself with contraction factor α. The Banach fixed-point theorem and standard dynamic programming arguments yield the following well-known result (Hernández-Lerma [11], Theorem 2.2, p. 19).
Theorem 2.1. Assume either Assumption 1 or Assumption 2 holds. Then: (a) the optimal value function V * is the only fixed-point in C(X) of operator T ; (b) a stationary policy f ∈ F is optimal if and only if V * = T f V * ; (c) there exists a stationary policy f * such that V * = T f * V * ; hence, f * is optimal; (d) ||T n u − V * || ∞ → 0 at geometric rate for any u ∈ C(X). Theorem 2.1(a)-(c) gives a solution to the optimal control problem; however, the computation of an optimal stationary policy requires the optimal value function is known in advance, which, unfortunately, only occurs in few very simple cases. Thus, based on Theorem 2.1(d), one can seek approximations of the value function V * by means of the value iteration (VI) algorithm given as where V 0 ∈ C(X) is an arbitrary function. The VI algorithm prescribes the computation of a V k -greedy policy f ∈ F once some stopping rule is satisfied, and then approximates the optimal value function V * by means of V f . This section is closed with a result that gives a bound for the approximation error ||V * − V f || ∞ .
Theorem 2.2. Suppose either Assumption 1 or Assumption 2 holds. Let V ∈ C(X) be arbitrary and f be a V -greedy policy. Then, In 3. Approximation operators and perturbed models. The computational burden associated with the VI algorithm (2) is prohibitive for systems with large finite state spaces and is plainly infeasible for the continuous case. The standard way of tackling this difficulty is to intersperse the application of the dynamic programming operator with an approximation scheme. In many cases the approximation scheme can be represented by means of an operator L that maps a function v into a function Lv belonging to suitable subspace of functions; in this case, Lv represents an approximation to the function v. This yields two slightly different approximation procedures depending on which operator, either T or L, is applied first. Both approximation procedures are known as approximate value iteration (AVI) algorithm and they are defined as follows: where V 0 , V 0 are arbitrary functions, T := LT and T := T L.
There are three important issues concerning these algorithms: (I) the first one is the convergence of sequences in (3), say, to some functions V and V , respectively; (II) the second one is to bound the approximation errors V * − V ∞ and V * − V ∞ provided the algorithms converge, or to bound the sequence of Bellman (III) the third one is to bound the algorithm performance, that is, to bound the quantity V * − V f ∞ where the policy f ∈ F is V k -greedy or V k -greedy. These issues are addressed below down for a class of approximation operators we call averagers. The class of averagers may seem somewhat restrictive at a first sight, but many approximation schemes define operators with these properties. Examples of averagers are given by piece-wise constant approximation operators, linear and multilinear interpolators, kernel-based interpolators (Gordon [10], Stachurski [21]), certain aggregation-projection operators (Van Roy [25]), Schoenberg' splines, Hermite-Fejér and Bernstein operators (Beutel [6], DeVore [7]), among others.
The key point is that averagers allow to see the approximation step in the AVI algorithms (3) as perturbations of the original Markov model. To introduce these perturbed models we need several simple but important properties of averagers.
Remark 1. Suppose that L is an averager. Then: (a) L is monotone, that is, Lu ≥ Lv whenever u ≥ v. Moreover, L is non-expansive with respect to the supremum norm || · || ∞ , that is, ||Lu − Lv|| ∞ ≤ ||u − v|| ∞ for all u, v ∈ M b (X). These properties follow directly from properties in  (c) Moreover, if the averager L maps C b (X) into itself and Assumption 1 holds, the operators T = T L and T = LT are contraction operators from C b (X) into itself with modulus α. This is the case, for instance, if the averager interpolates values using continuous functions.
The following lemma plays a key role in our approach; it shows the averagers can also be seen as transition probabilities. Its proof is omitted because it follows from standard arguments.
Lemma 3.2. Let L be an averager and define L(D|x) := LI D (x) for x ∈ X and D ∈ B(X). Then: (a) L(·|·) is a transition probability on X, that is, L(·|x) is a probability measure on X for each x ∈ X, and L(D|·) is a measurable function for each D ∈ B(X). If the averager L maps C b (X) into itself, then Q(·|·, ·) is clearly a weakly continuous stochastic kernel on X given K because it is the composition of weakly continuous stochastic kernels (see Lemma 3.2(a) and (c)).
For a policy π ∈ Π and initial state x ∈ X, let {( x k , a k )} be the resulting controlled process and P π x the corresponding probability measure defined on the measurable space (Ω, F). Denote by E π x the expectation operator with respect to such measure. The discounted reward criterion and the discounted optimal value for the perturbed model M are given as V π (x) := E π x ∞ k=0 α k R( x k , a k ) and V * (x) := sup π∈Π V π (x), x ∈ X, π ∈ Π, respectively. A policy π * is said to be optimal for model M iff V * (·) = V π * (·). for each f ∈ F and B ∈ B(X). Again by Lemma 3.2(a), we have that Q f (·|·), f ∈ F, is a transition probability on X because Thus, for each stationary policy f ∈ F and initial state x ∈ X there exists a Markov chain { x n } and probability measure P f x defined on the measurable space (Ω, F) such that Q f (·|·) is the one-step transition probability of { x n }. The expectation operator with respect to P f x is denoted by E f x . The discounted reward criterion and optimal value function are given as A policy f * ∈ F is said to be optimal for the perturbed model M iff V * = V f * .   (a) Assumption 1 holds and L is an averager mapping C b (X) into itself; (b) Assumption 2 holds and L is an averager. Then, there exists a V * -greedy policy f ∈ F and the optimal value function V * is the unique solution in C(X) of the optimality equation  (1), there exists a unique function W in the space C(X) and a W -greedy policy f ∈ F such that W = T W = LT f W . Then, the linearity and monotonicity of L together with Lemma 3.2(b) imply, for one hand, that W = L(R f +αQ f W ) = R f +α Q f W and also that W ≥ R f +α Q f W for each f ∈ F. Then, Now, standard dynamic programming arguments yield W = V * = V f , which proves the first two statement of the lemma. The proof of the remainder results also follow standard dynamic programming arguments since the operators The AVI algorithms defined in (3) only differ in the order the operators T and L are applied. Propositions 1 and 2 show how this difference is passed down to the corresponding optimal control problems. (c) if f is optimal for the model M , then f is optimal for model M .
Proof of Proposition 1. (a) First recall, from Lemma 3.3, that V * is the unique fixed-point of T = T L in C b (X). Then, w := L V * = LT (L V * ) = T w. Observe that w belongs to C b (X); thus, there exists a w-greedy policy g ∈ F, that is, T w = T g w.

270Ó. VEGA-AMAYA AND J. LÓPEZ-BORBÓN
This latter fact implies that w = R g + α Q g w ≥ R f + α Q f w for all f ∈ F, which in turn implies that w = V * and also that V * = T V * , proving thus the first equality. Now, observe that T V * = T L V * = V * , which proves the second one. (b) This is a direct consequence of part (a), Lemma 3.3, and Theorem 2.
The Proposition 2 below gives bounds for the approximation errors ||V * − V * || ∞ and ||V * − V * || ∞ . They are unsatisfactory because they depend on the (unknown) optimal value function V * ; however, they evidence that such errors can be controlled by chosen "enough accuracy" approximating operators. This is the case, for instance, if the function V * is continuous, X is a compact subset of the real numbers set R, and L is given by piece-wise constant approximations or linear interpolation schemes.
Proposition 2. Suppose assumptions in Proposition 1 hold. Then: Proof of Proposition 2. The contraction property of T , Lemma 3.3 and Proposition 1(a) imply that Now, from Remark 1, Proposition 1(a) and (4), it follows that which in turn implies that inequality in part (a) holds. On the other hand, part (b) follows from (4) and part (a).

4.
Bounds for the approximate value iteration algorithm. The accuracy of the approximations provided for the averagers are expressed in terms of the supremum norm || · || ∞ for bounded functions and the total variation norm || · || T V for finite signed-measures. The latter one is defined as where λ is a finite signed-measure on X. From the definition it follows that Moreover, one can prove that if P 1 and P 2 are probabilities measures, then Let F 0 be a subclass of stationary policies that contains the stationary optimal policies for the original model M and the perturbed model M , and the policies V n -greedy for each n ∈ N as well. Similarly, F 0 is a subclass of F that contains the optimal policies for M and M , and the V n -greedy policies. Next, define the quantities and observe they measure the averager accuracy for approximating the one-step reward function and the transition law. Now we are ready to state the main results of the present work.
Theorem 4.1. Suppose Assumption 1 and that the averager L maps C b (X) into itself. Then: (a) for each f ∈ F it holds that The next result provides the analog bounds for the perturbed model M . (i) Assumption 1 holds and L is an averager mapping C b (X) into itself; (ii) Assumption 2.2 holds and L is an averager. Then: (a) for each f ∈ F it holds that Remark 2. The constants δ Q ( F 0 ) and δ Q ( F 0 ) are less or equal than 2 (see (5)); however, in general, it may be quite hard to get sharper bounds, at least some additional conditions are imposed on the transition law. For instance, Almudevar [1] and Munos [15] study the AVI algorithm for systems with continuous state Proof of Theorem 4.2. To prove part (a) note that V f = R f + α Q f V for each policy f ∈ F; thus we see that Then, The proofs of part (b) and (c) follow similar arguments to those given in the proof of Theorem 4.1. Thus, they are omitted. Proof of Proposition 3. To prove (7), let f ∈ F and note that Moreover, from the total variation norm property in Hernandez-Lerma [11], Appendix B.3, p. 125, we see that This equality implies that which proves the desire result. The last statement of Proposition 3 follows directly from the latter equality.
5. An inventory system with finite capacity. This section provides numerical results to illustrate the approach developed in the previous sections. We chose a simple inventory control problem for which the optimal stationary policy is already known in order to allow the comparison among the analytical and the numerical solutions. However, a similar analysis can be carried out for other models as inventory systems with positive set-up cost, fishery management, optimal replacement problems, optimal growth models, etc. Then, consider a single item inventory system with finite capacity θ > 0, no set-up cost and no backorders. Let x n be the item stock level and a n the quantity ordered at the beginning of the nth-decision epoch, and w n the quantity demanded during the same epoch. Assuming that the quantity ordered is immediately supplied, the inventory system evolves according to the difference equation where x 0 = x is the initial stock and v + := max(0, v). Thus, X = A = [0, θ] and A(x) = [0, θ − x] for each x ∈ X. The mapping x → A(x) = [0, θ − x] is clearly continuous. Instead of a reward function R we consider a one-step cost function C, so the optimal control problem is to find a policy with minimal cost.

274Ó. VEGA-AMAYA AND J. LÓPEZ-BORBÓN
We assume the demand process {w n } is a sequence of independent identically distributed nonnegative random variables with continuous distribution function F . Then, the inventory dynamic can also be expressed by means of the stochastic kernel Moreover, observe that Here E w0 stands for the expectation with respect to distribution F of the random variable w 0 . The latter equality implies that Q(·|·, ·) is weakly continuous on K.
Hence, this inventory model satisfies Assumption 1. Now, consider the approximating operator L defined by the linear interpolation scheme with nodes s 0 = 0 < s 1 < . . . < s N = θ. Thus, for each bounded measurable function v on X, the operator L is defined as The operator L is clearly an averager that maps C b (X) into itself.
Bounds for the perturbed model M . For this model we consider an arbitrary continuous one-step cost function C and take F 0 = F. The perturbed transition law Q(·|·, ·) is given as Note that the latter bound is given in terms of known or at least computable quantities. Moreover, it is quite general because it holds under very weak assumption, namely, the continuity of the distribution function F and the one-step cost function C. However, it does not depend on the grid {s 0 , s 1 , · · · , s N } and thus it is insensitive to refinements, which obviously is a disadvantage. This fault is not shared by the other AVI algorithm given in (3), as we show below.
Bounds for the perturbed model M . Consider the one-step cost function where p, h and c are positive constants representing the unit penalty cost for unmet demand, the unit holding cost for inventory at hand, and the unit production cost, respectively. Clearly, C is continuous on K. Moreover, assume the random variable w 0 has positive finite expectation w and also that the distribution function F has a bounded continuous density ρ. The latter assumption implies that Q(·|x, ·) is strongly continuous for each x ∈ X. Hence, Assumption 2 holds. Numerical experiments evidence that the optimal policies for the models M , M and M are base-stock policies. Recall that a stationary policy f is a base-stock policy if f (x) = S − x for x ∈ [0, K], and f (x) = 0 otherwise, where the constant S ≥ 0 is the so-called re-order point. In fact, it is shown in Vega-Amaya and Montes-de-Oca [26] that a base-stock policy is optimal for the inventory system (8) with a more general one-step cost function than (11), and also that the optimal re-order point S * for the cost (11) satisfies the equation F (S) = (p − h − c)/(p − αc) if p > h + c, and S * = 0 otherwise.
Thus, take F 0 as the class of all base-stock policies. In order to estimate δ C ( F 0 ) and δ Q ( F 0 ), introduce the following auxiliary functions: for x ∈ X and v ∈ M b (X). Moreover, denote by F , pv, P v and I the linear interpolation functions of F , pv, P v and I, respectively, with nodes at s 0 = 0, s 1 , . . . , s N = θ.
Let f S be a base-stock policy and suppose that s i ≤ S < s i+1 . Then, after some elementary computation, we obtain that Similarly, we have that s i+1 < x ≤ θ. One can verify, after some elementary but cumbersome computations, that where s := max i (s i+1 − s i ). Now, in order to get an estimate for δ Q ( F 0 ) we impose a last condition on the density ρ. We assume that the density funtion ρ is Lipschitz continuous with module l > 0 on [0, θ], that is, |ρ(x) − ρ(y)| ≤ l|x − y| for all x, y ∈ [0, θ]. This implies that ||pv − pv|| ∞ ≤ 2(θl + K ) s and ||F − F || ∞ ≤ 2K s for all v ∈ M b (X) with ||v|| ∞ ≤ 1, where K is a bound for the density function ρ. Thus, The bounds (12) and (13) may be quite conservative, but they show that the constants δ C ( F 0 ) and δ Q ( F 0 ) can be made arbitrarily small by taking fine enough grids. Moreover, they can be sharpened in specific cases as it is done in the numerical results given below. Numerical results. Take c = 1.5, h = 0.5, p = 3, θ = 20 and α = 0.6 and consider a grid with N + 1 evenly spaced nodes. Moreover, assume that the product demand has an exponential density function ρ with parameter λ = 0.1. Note that ρ is bounded by K = λ = 0.1 and also that it is Lipschitz with module l = λ 2 = 0.01. For this case, the bound given in (12) is improved to δ C ( F 0 ) ≤ (p + h + c) s = 5 s.
The AVI algorithm is stopped once E n := || V n − V n−1 || ∞ falls below of a given tolerance error ε > 0. Let n s be the first iteration when this happens and f ns be the corresponding V ns -greedy policy. As was already mentioned above, the greedy policies for the AVI algorithm are base-stock policies; thus let S n be the re-order point for the V n -greedy policy. The numerical results are shown in Table  1 with the following notation: A E := 2(1 − α) −1 δ Q ( F 0 ) + 2αK(1 − α) −2 δ Q ( F 0 ), S E := 2α(1 − α) −1 E n and T E := S E + A E . The Figure 1 displays the graphs of functions V n for n = 1, 5, 10, 17. Note that functions V 10 and V 17 are virtually indistinguishable. The numerical results in Table 1and Figure 1 show that the approximate value iteration algorithm practically identifies the optimal re-orden point S * = 6.466453 and also that the performance bounds are pretty good; in fact, the bounds can be done arbitrarily small by chosen fine enough grids.  6. Concluding remarks. In this paper we propose a perturbation approach to analyze the AVI algorithm when are used a class of approximation operators we call averagers. The key point is the dual nature of the averagers, which allows to see the AVI algorithm as the standard value iteration algorithm in certain perturbed Markov decision model. This fact offers a new framework for analyzing the major problems raised by the AVI algorithm (I)-(III) described in the Introduction . We think this perturbation approach can also be combined with the policy iteration algorithm and the linear programming approach as well, and used to approximate the average cost optimal control problem. Moreover, albeit the focus of this paper is the case of "continuous" spaces, it should be noted that the perturbation approach also works in the discrete space case. Thus, it could also be interesting to compare it with the existing numerical procedures for such models using some test problems.