On Adaptive Estimation for Dynamic Bernoulli Bandits

The multi-armed bandit (MAB) problem is a classic example of the exploration-exploitation dilemma. It is concerned with maximising the total rewards for a gambler by sequentially pulling an arm from a multi-armed slot machine where each arm is associated with a reward distribution. In static MABs, the reward distributions do not change over time, while in dynamic MABs, each arm's reward distribution can change, and the optimal arm can switch over time. Motivated by many real applications where rewards are binary counts, we focus on dynamic Bernoulli bandits. Standard methods like $\epsilon$-Greedy and Upper Confidence Bound (UCB), which rely on the sample mean estimator, often fail to track the changes in underlying reward for dynamic problems. In this paper, we overcome the shortcoming of slow response to change by deploying adaptive estimation in the standard methods and propose a new family of algorithms, which are adaptive versions of $\epsilon$-Greedy, UCB, and Thompson sampling. These new methods are simple and easy to implement. Moreover, they do not require any prior knowledge about the data, which is important for real applications. We examine the new algorithms numerically in different scenarios and find out that the results show solid improvements of our algorithms in dynamic environments.


Introduction
The multi-armed bandit (MAB) problem is a classic decision problem where one needs to balance acquiring new knowledge with optimising the choices based on current knowledge, a dilemma commonly referred to as the exploration-exploitation trade-off. The problem originally proposed by Robbins (1952) aims to sequentially make selections among a (finite) set of arms, A, and maximise the total reward obtained through selections during a (possibly infinite) time horizon T . The MAB framework is natural to model many real-world problems. It was originally motivated by the design of clinical trials (Thompson, 1933;see also Press, 2009, andVillar et al., 2015, for some recent developments). Other applications include online advertising (Li et al., 2010;Scott, 2015), adaptive routing (Awerbuch and Kleinberg, 2008), and financial portfolio design (Brochu et al., 2011;Shen et al., 2015). In stochastic MABs, each arm a ∈ A is characterised by an unknown reward distribution. The Bernoulli distribution is a very natural choice that appears very often in the literature, because in many real applications, the rewards can be represented by binary counts. For example, in clinical trials, we obtain a reward 1 for a successful treatment, and a reward 0 otherwise (Villar et al., 2015); in online advertising, counts of clicks are often used to measure success (Scott, 2010).
Formally, the MAB problem may be stated as follows: for discrete times t = 1, · · · , T , the decision maker selects one arm a t from A and receives a reward Y t (a t ). The goal is to optimise the arm selection sequence and maximise the total expected reward T t=1 E [Y t (a t )], or equivalently, minimise the total regret: where a * t is the optimal arm at time t. The total regret can be interpreted as the difference between the total expected reward obtained by playing an optimal strategy (selecting the optimal arm at every step) and that obtained by the algorithm. For notational convenience, we let µ t (a), a ∈ A, denote the expected reward of arm a at time t, i.e., µ t (a) = E [Y t (a)]. In rest of this paper, we will also use the notations like Y t and µ t when we introduce the methods/models that can be applied separately to different arms.
The classic MAB problem assumes the reward distribution structure does not change over time. That is to say, in this case, the optimal arm is the same for all t. A MAB problem with static reward distributions is also known as the stationary, or static MAB problem in the literature (e.g., Garivier and Moulines, 2011;Slivkins and Upfal, 2008). A dynamic MAB, where changes are allowed in the underlying reward distributions, is more realistic in real-world applications such as online advertising. An agent always seeks the best web position (that is, the placement of the advertisement on a webpage), and/or advertisement content, to maximise the probability of obtaining clicks. However, due to inherent changes in marketplace, the optimal choice may change over time, and thus the assumption of static reward distributions is not adequate in this example.
Two main types of change have been studied in the literature of dynamic MAB: abrupt changes (Garivier and Moulines, 2011;Yu and Mannor, 2009), and drifting (Granmo and Berg, 2010;Gupta et al., 2011;Slivkins and Upfal, 2008). For abrupt changes, the expected reward of an arm remains constant for a some period and changes suddenly at possibly unknown time instants (Garivier and Moulines, 2011). The study of drifting dynamic bandits follows the seminal work of Whittle (1988), in which restless bandit was introduced. In Whittle's study, the state of an arm can change according to a Markov transition function over time whether it is selected or not. Restless bandits are regarded as intractable, i.e., it is not possible to derive an optimal strategy even if the transitions are deterministic (Papadimitriou and Tsitsiklis, 1999). In recent studies of drifting dynamic bandits, the expected reward of an arm is often modelled by a random walk (e.g., Granmo and Berg, 2010;Gupta et al., 2011;Slivkins and Upfal, 2008).
In this work, we look at the problem of dynamic bandits where the expectation of the reward distribution changes over time, focusing on the Bernoulli reward distribution because of its wide relevance in real applications. In addition, we will emphasise on cases where the changes of the reward distribution can really have an effect on the decision making. As an example, for a two-armed Bernoulli bandit, the expected reward of Arm 1 oscillates in [0.1, 0.3] over time, and the expected reward of Arm 2 oscillates in [0.8, 0.9]. The reward distributions for both arms change, but the optimal arm remains the same. We will not regard this example as a dynamic case.
Many algorithms have been proposed in the literature to perform arm selection for MAB. Some of the most popular ones include -Greedy (Watkins, 1989), Upper Confidence Bound (UCB; Auer et al., 2002), and Thompson Sampling (TS; Thompson, 1933). These methods have been extended in various ways to improve performance. For example, Garivier and Cappe (2011) proposed the Kullback-Leibler UCB (KL-UCB) method which satisfies a uniformly better regret bound than UCB. May et al. (2012) introduced the Optimistic Thompson Sampling (OTS) method to boost exploration in TS. Some more extensions will be described in Section 3. Even in their basic forms, all the fore-mentioned approaches can perform well in practice in many situations (e.g., Chapelle and Li, 2011;Kuleshov and Precup, 2014;Vermorel and Mohri, 2005). One thing that these methods have in common is that, they treat all the observations Y 1 , · · · , Y t equally when estimating or making inference of µ t . Specifically, -Greedy and UCB use sample averages to estimate µ t . In static cases, given that Y 1 , · · · , Y t are i.i.d, this choice is a sensible one from a theoretical perspective, and one could invoke various asymptotic results in here as justification (e.g., law of large numbers, central limited theorem, Berry Essen inequality etc.). From a practical point of view, when µ t changes significantly with time, it could become a bottleneck in performance. The problem is that, a sample average does not put more weight on more recent data Y t , which is a direct observation of µ t . In this paper we will consider using a different estimator for µ t that is inspired from adaptive estimation (Haykin, 2002) and propose novel modifications of popular MAB algorithms.

Contributions and Organisation
We propose algorithms that use adaptive forgetting factors (Bodenham and Adams, 2016) in conjunction with the standard MAB methods. This results to a new family of algorithms for dynamic Bernoulli bandits. These algorithms overcome the shortcomings related to using sample averages for estimation of dynamically changing rewards. These algorithms are easy to implement and require very little tuning effort; they are quite robust to tuning parameters and their initialisation does not require assumptions or knowledge on the model structure in advance.
The remainder of this paper is structured as follows: Section 2 briefly summarises some adaptive estimation techniques, focusing on Adaptive Forgetting Factors (AFFs). Section 3 introduces the methodology for arm selection. Section 4 presents a variety of numerical results for different dynamic models and MAB algorithms. We summarise our findings in Section 5.

Adaptive Estimation Using Forgetting Factors
Solving the MAB problem involves two main steps: learning the reward distribution of each arm (estimation step), and selecting one arm to play (selection step). The foundation of making a good selection is to correctly and efficiently track the expected reward of the arms, especially in the context of time-evolving reward distributions. Adaptive estimation approaches are useful for this task, as they provide an estimator that follows closer a moving target, here the target is the expected reward (Anagnostopoulos et al., 2012;Bodenham and Adams, 2016). In this section, we introduce how to use an Adaptive Forgetting Factor (AFF) estimator for monitoring a single arm. For the sake of simplicity, when clear we have dropped dependence on arms in the notation.
Assume now that we select one arm all the time until t and receive rewards Y 1 , · · · , Y t . If the reward distribution is static, Y 1 , · · · , Y t are i.i.d. Therefore, it is natural to estimate the expected reward via the sample mean:Ȳ t = 1 t t i=1 Y i . This sample mean estimator was widely used in the algorithms designed for the static MAB problem such as -Greedy and UCB. One problem with this estimator is that it often fails in the case that the reward distribution changes over time. The adaptive filtering literature (Haykin, 2002) provides a generic and practical tool to track a time-evolving data stream, and it has been recently adapted to a variety of streaming machine learning problems (Anagnostopoulos et al., 2012;Bodenham and Adams, 2016). The key idea behind adaptive estimation is to gradually reduce the weight on older data as new data arrives (Haykin, 2002). For example, a fixed forgetting factor estimator employs a discount factor λ, λ ∈ [0, 1], and takes the form where w λ,t is a some normalising constant. Bodenham and Adams (2016) illustrated that the fixed forgetting factor estimator has some similarities with the Exponentially Weighted Moving Average (EWMA) scheme (Roberts, 1959) which is a basic approach in the change detection literature (Tsung and Wang, 2010).
In this paper, we will use an adaptive forgetting factor where the magnitude of the forgetting factor λ can be adjusted at each time step for better adaptation. One main advantage of an AFF estimator is that it can respond quickly to the changes of a target without requiring any prior knowledge about the process. In addition, by using dataadaptive tuning of λ, we side-step the problem of setting a key control parameter. Therefore, it is very useful when applied to dynamic MABs where we do not have any knowledge about the dynamics of the reward distribution.
Our AFF formulation follows Bodenham and Adams (2016). We present here only the main methodology. For a data stream Y 1 , · · · , Y t , the adaptive forgetting factor mean (denoted byŶ t ) is defined as follows: where the normalising constant w t = t i=1 t−1 p=i λ p is selected to give unbiased estimation when the data Y 1 , · · · , Y t are i.i.d. For convenience, we set t−1 p=t λ p = 1. We can updateŶ t via the following recursive updating equations: The adaptive forgetting factor λ t = (λ 1 , λ 2 , · · · , λ t ) is a expanding sequence over time, and the forgetting factor λ t is computed via a single gradient descent step, which is where η (η 1) is the step size, and L t is a user determined cost function of the estimator Y t . Here, we choose L t = (Ŷ t−1 − Y t ) 2 for good mean tracking performance, which can be interpreted as the one-step-ahead squared prediction error. Other choices are possible, such as the one-step-ahead negative log-likelihood (Anagnostopoulos et al., 2012), but this will not be pursued here. In addition, ∆(L t , λ t−2 ) is a derivative-like function of L t with respect to λ t−2 (see Bodenham and Adams, 2016, sect. 4.2.1 for details). Note, the index of λ is t-2 as only λ 1 , · · · , λ t−2 are involved in L t . We require the following recursions to sequentially compute ∆(L t , λ t−2 ):ṁ In addition to the mean, we may make use of an adaptive estimate of the variance. The adaptive forgetting factor variance is defined as: Note here we choose the same adaptive forgetting factor for mean and variance for convenience, though other formulations are possible. One can use a separate adaptive forgetting factor for the variance if needed. Again, s 2 t can be computed recursively via the following equations: The only tuning parameter in AFF estimation is the step size η used in (6), and its choice may affect the performance of estimation. In Bodenham and Adams (2016), the authors That is to say, the forgetting factors, λ 1 , · · · , λ t , computed via (6) will be forced to be either 0 or 1 if σ 2 is too large. Therefore, before examining the influence of η, they scaled ∆(L t , λ t−2 ) to ∆(L t , λ t−2 )/σ 2 (σ 2 can be estimated during a burn-in period). However, in this paper, we are only interested in Bernoulli rewards, which means that σ 2 is less than 1, so it is not essential to devise an elaborate scaling scheme. We apply the AFF estimation in standard MAB algorithms (see Section 3), and examine empirically the influence of η on these algorithms in Section 4.2.1

Dealing with Missing Observations
In the MAB setting, we have at least two arms, and for each arm, we will construct an AFF estimator. However, we can only observe one arm at a time. This means that the estimations and intermediate quantities of an unobserved arm will retain their previous values, that is, if arm a is not observed at time t, Not being able to update estimators sets more challenges in dynamic cases. In static cases, the sample mean estimator will converge quickly to the expected reward with a few observations, and therefore it has little effect if the arm is not observed further. However, in dynamic cases, even if the estimator tracks the expected reward perfectly at a given moment, its precision may deteriorate quickly once it stops getting new observations. Therefore, it is more challenging to balance exploration and exploitation in dynamic cases.

Action Selection
Having discussed how to track the expected reward of arms in the previous section, we now move on to methods for the selection step. We will consider three of the most popular methods: -Greedy (Watkins, 1989), UCB (Auer et al., 2002) and TS (Thompson, 1933). They are easy to implement and computationally efficient. Moreover, they have good performance in numerical evaluations (Chapelle and Li, 2011;Kuleshov and Precup, 2014;Vermorel and Mohri, 2005). Each of these methods uses a different mechanism to balance the exploration-exploitation trade-off. Deploying AFF in these methods, we propose a new family of MAB algorithms for dynamic Bernoulli bandits, and they are denoted with the prefix AFF-to emphasise the use of AFF in estimation. Driven by -Greedy, UCB and TS, the new algorithms are AFF-d-Greedy, AFF-UCB, and AFF-TS/AFF-OTS respectively.
In the literature of dynamic bandits, many approaches attempted to improve the performance in standard methods by choosing an estimator that uses the reward history wisely. Koulouriotis and Xanthopoulos (2008) applied exponentially-weighted average estimation in -Greedy. Kocsis and Szepesvari (2006) introduced the discounted UCB method (it was also called D-UCB in Garivier and Moulines, 2011) which used a fixed discounting factor in estimation. Garivier and Moulines (2011) proposed the Sliding Window UCB (SW-UCB) algorithm where the reward history used for estimation is restricted by a window. The Dynamic Thompson Sampling (DTS) algorithm applied a bound on the reward history used for updating the hyperparameters in posterior distribution of µ t (Gupta et al., 2011). These sophisticated algorithms require accurate tuning of some input parameters, which relies on knowledge of the model/behaviour of µ t . For example, computing the window size of SW-UCB, or the discounting factor of D-UCB (Garivier and Moulines, 2011) requires knowing the number of switch points (i.e., time instants that the optimal arm switches). While the idea behind our AFF MAB algorithms is similar, our approaches automate the tuning of the key parameters (i.e., the forgetting factors), and only require little effort to tune the higher level parameter η in (6). Moreover, we use the AFF technique to guide the tuning of the key parameter in the DTS algorithm, which will be discussed later in this section. Other approaches for dynamic bandits include UCB f (Slivkins and Upfal, 2008). DaCosta et al. (2008) used the Page-Hinkley test to restart the UCB algorithm in the application of adaptive operator selection.
In what follows, we discuss each AFF-deployed method separately. We review briefly the basics of each method and refer the reader to the references for more details. In addition, we will continue to use notations like Y t instead of Y t (a) when clear. In all the AFF MAB algorithms we propose below, we will use a very short initialisation (or burn-in) period for the initial estimations. Normally, the length of the burn-in period is |A|, that is, selecting each arm once; for the algorithms that requires estimates of variance, we use a longer burn-in period by selecting each arm M times.

-Greedy
-Greedy (Watkins, 1989) is the simplest method for the static MAB problem. The expected reward of an arm is estimated by its sample mean, and a fixed parameter ∈ (0, 1) is used for selection. At each time step, with probability , the algorithm selects an arm uniformly to explore, and with probability 1 − , the arm with the highest estimated reward is picked. -Greedy is simple and easy to implement, which makes it appealing for dynamic bandits. However, it can have two main issues: first, the sample average is not ideal for tracking the moving reward; second, the parameter is the key to balancing the exploration-exploitation dilemma, but it is challenging to tune as an optimal strategy in dynamic environments may require varying over time.
In Algorithm 1, we propose the AFF-d-Greedy algorithm to overcome the above weaknesses. In the algorithm, we use the AFF meanŶ t from (2) to estimate the expected reward. This estimator can respond quickly to changes, that is, for an arm that is frequently observed, it can closely follow the underlying reward; for an arm that is not observed for a long time, the estimator can capture µ t quickly once the arm is selected again. At each time step, we first identify the arm with the highest AFF mean; if the absolute difference between this arm's last two forgetting factors is smaller than d, we select it; otherwise, we select an arm from A uniformly. A threshold d ∈ (0, 1) is used to balance exploration and exploitation. Tuning d is easier than as it is related to the step size η used in (6). This was confirmed in a large number of simulations. For Bernoulli dynamic bandits, we suggest to set d ≈ η.
We use the forgetting factors λ t (t = 1, 2, · · · ) in the decision rule as their magnitudes indicate the variability of the data stream. For example, if λ t is close to zero, it can be interpreted as a sudden change occurring at time t, and if close to 1, it indicates that the data stream is stable at time t. To understand the decision rule better, we illustrate it using two examples.
1. Variable arm example: let us say armâ was selected at time t − 1, and at time t, armâ has the highest estimated reward and |λ t−1 (â) − λ t−2 (â)| < d. By the decision rule, the algorithm will select this arm again. We are interested in two cases: first, both λ t−1 (â) and λ t−2 (â) are close to 1; second, both λ t−1 (â) and λ t−2 (â) are close to 0. It is easy to understand why the algorithm select it in the first case, as the arm is currently stable and it has the highest estimated reward. In the second case, µ t (â) seems variable in the past two steps. Even if µ t (â) had kept moving down (that is, the worst possibility), the estimated reward would have fallen as well, since armâ still has the highest estimated reward, Algorithm 1 will select it.
2. Idle arm example: let us say armâ has the highest estimated reward at time t, and it was not selected at t − 1. By the decision rule, Algorithm 1 will select this arm for From these examples, we can see that exploration and exploitation are balanced in a way that takes into account the variability in the estimation procedure rather than by simply flipping a coin. It boosts gaining knowledge for active but variable arms and idle arms.

Upper Confidence Bound
Another type of algorithms uses upper confidence bounds for selection. The idea is that, instead of the plain sample average, an exploration bonus B t is added to account for the uncertainty in the estimation, and the arm with highest potential of being optimal will be selected. This exploration bonus is typically derived using concentration inequalities (e.g., Hoeffding, 1963). The UCB1 algorithm introduced by Auer et al. (2002) is a classic method. In latter works, UCB1 was often called simply UCB. For any reward distribution that is bound in [0,1], the UCB algorithm picks the arm which maximise the quantityȲ t + 2 log t Nt , whereȲ t is the sample average and N t is the number of times this arm was played up to time t. The exploration bonus B t = 2 log t Nt was derived using the Chernoff-Hoeffding bound. It is proved that the UCB algorithm achieves logarithmic regret uniformly over time (Auer et al., 2002).
For better adaptation in dynamic environments, we replaceȲ t withŶ t , and modify the upper bound accordingly. This results to the AFF-UCB algorithm in Algorithm 2. The upper bound for selection at time t + 1 takes the formŶ t + B t . We set B t to: where t last is the last time instant that the arm was observed; w t , k t , and s 2 t are quantities related to the AFF estimation (see Section 2; k t = t i=1 ( t−1 p=i λ 2 p ) and its recursive updating is k t = λ 2 t−1 k t−1 + 1). From (14), B t is a combination of two components. It can be interpreted by considering two cases: 1. if an arm was observed at the previous time step, t (i.e., t−t last = 0), B t = − log(0.05) 2(wt) 2 /kt ; 2. if an arm was not observed at the previous time step, B t = In the former case, B t = − log(0.05) 2(wt) 2 /kt is derived via the Chernoff-Hoeffding bound in a similar way to the derivation of UCB (see Appendix A for details). However, for an unselected arm, if we use the same B t expression, its upper bound will be static sinceŶ t , w t and k t do not change. As a consequence, it will only be selected if the arm with current highest upper bound drops below it. This is not desirable since in a changing environment, any sub-optimal arm can become optimal at any time. This motivates us to deliberately add some inflation to the upper bound of unselected arms to impose exploration, which leads to B t = s 2 t wt (t − t last ) 1/|A| . Note here B t decreases with the number of arms, |A|. This makes use of the fact that as |A| increases, the population of arms will "fill" more the reward space and more opportunities will arise for picking high reward arms.
Initialisation: play each arm M times. for t = M |A| + 1, · · · , T do for all a ∈ A, compute B t−1 (a) according to (14); find a t = arg max a ∈A Ŷ t−1 (a ) + B t−1 (a ) ; select arm a t and observe reward Y t (a t ); updateŶ t (a t ), w t (a t ), k t (a t ), s 2 t (a t ), and t last (a t ). end for Note here, we use a longer burn-in period since we need to initialise the estimation of data variance. In the simulation study in Section 4, we choose M = 10.
Initialisation: play each arm once. for t = |A| + 1, · · · , T do for all a ∈ A, draw a sample x(a) from Beta(α t−1 (a), β t−1 (a)); find a t = arg max a ∈A x(a ); select arm a t and observe reward Y t (a t ); update α t (a t ) and β t (a t ) according to (19)- (20). end for Recently, researchers (e.g., Scott, 2015) have given more attention to the Thompson Sampling (TS) method which can be dated back to Thompson (1933). It is an approach based on Bayesian principles. A (usually conjugate) prior is assigned to the expected reward of each arm at the beginning, and the posterior distribution of the expected reward is sequentially updated through successive arm selection. A decision rule is constructed using this posterior distribution. At each round, a random sample is drawn from the posterior distribution of each arm, and the arm with the highest sample value is selected.
For the static Bernoulli bandit, following the approach of Chapelle and Li (2011), it is convenient to choose the Beta distribution, Beta(α 0 , β 0 ), as a prior. The posterior distribution is then Beta(α t , β t ) at time t, and the parameters α t and β t can be updated recursively as follows: if an arm is selected at time t, otherwise, The simplicity and effectiveness in real applications (Scott, 2015) make TS a good candidate for dynamic bandits. However, it has similar issues in tracking µ t as in -Greedy and UCB, For illustration, assume an arm is observed all the time, and one can re-write the recursions in (15)-(16) as: As a result, the posterior distribution Beta(α t , β t ) keeps full memory of all the past observations, making posterior inference less responsive to observations near time t.
To modify the above updating, we use the intermediate quantities m t and w t from (4)-(5). If an arm is selected at time t, otherwise, α t and β t are updated via (17)-(18). Using these updates, we propose in Algorithm 3 the AFF-TS algorithm for dynamic Bernoulli bandits.

Optimistic Thompson Sampling
We now look at some popular extensions of TS. May et al. (2012) introduced the optimistic version of Thompson sampling called Optimistic Thompson Sampling (OTS), where the drawn sample value is replaced by its posterior mean if the former is smaller. That is to say, for each arm, the score used for decision will never be smaller than the posterior mean. OTS boosts further the exploration of highly uncertain arms compared to TS, as OTS increases the probability of getting a high score for arms with high posterior variance. However, OTS has the same problem as TS when applied to a dynamic problem, that it uses the full reward history to update the posterior distribution. We propose the AFF version of OTS in Algorithm 4.
Initialisation: play each arm once. for t = |A| + 1, · · · , T do for all a ∈ A, draw a sample x(a) from Beta(α t−1 (a), β t−1 (a)), and replace x(a) with α t−1 (a) α t−1 (a)+β t−1 (a) if x(a) is smaller; find a t = arg max a ∈A x(a ); select arm a t and observe reward Y t (a t ); update α t (a t ) and β t (a t ) according to (19)-(20). end for

Tuning Parameter C in Dynamic Thompson Sampling
The Dynamic Thompson Sampling (DTS) algorithm was introduced by Gupta et al. (2011) specifically for solving the dynamic Bernoulli bandit problem of interest here. The DTS algorithm uses a pre-determined threshold C in updating the posterior parameters α t and β t while using the standard Thompson sampling technique for arm selection. For the arm that is selected at time t, if α t−1 + β t−1 < C, the posterior parameters are updated via (15)-(16); otherwise when α t−1 + β t−1 ≥ C, To understand, letμ t denote the posterior mean, and assume an arm is observed all the time. Say at time s the arm achieves the threshold, i.e., α t + β t = C for t = s and onwards.
Following (17)-(21) of Gupta et al. (2011), which is a weighed average ofμ t−1 and the observation Y t . The recursion ofμ t is similar to the EWMA scheme (Roberts, 1959). Essentially, the DTS algorithm uses the threshold C to bound the total amount of reward history used for updating the posterior distribution. Once it comes to the threshold, the algorithm yields putting more weight on newer observations. Although it was demonstrated in Gupta et al. (2011) that the DTS algorithm has the ability to track the changes in the expected reward, the performance of the algorithm is very sensitive to the choice of C. In our numerical simulations (see Section 4.2.2), we found that the performance of the DTS algorithm varies a lot with different C values. However, in Gupta et al. (2011), the authors did not provide tuning methods for C. To address this issue, we propose below two different ways to tune C adaptively at each time step using AFF estimations (AFF-DTS1 & 2 resp.).

AFF-DTS1
From the numerical results in Gupta et al. (2011, sect. IV.C), the optimal C is related to the the speed of change of µ t . This motivates us to tune C according to the variance of the data stream. We can use the AFF variance, s 2 t , defined in (10) as an estimation of the data variance. One option is to use C t ∝ 1/s 2 t ; since high s 2 t indicates more dynamics in µ t , a shorter reward history is required. For example in the numerical examples in Section 4.2.2, we will use C t = 4 AFF-DTS2 Another way to set C t could be based on the similarity of the posterior mean in DTS and the AFF mean introduced in Section 2. In particular, in (21) the posterior mean is given by:μ Using (2)-(5), one can re-write the the AFF mean as: Therefore, at each time step t, we can set C t = w t − 1.

Numerical Results
In this section, we illustrate the performance improvements on -Greedy, UCB, and TS using AFFs. We consider two different dynamic scenarios for the expected reward µ t : abruptly changing and drifting. For the abruptly changing scenario, instead of manually setting up change points in µ t as in Yu and Mannor (2009) and Garivier and Moulines (2011), we set up change-point instants for an arm by an exponential clock (see Section 4.1.1). In the drifting scenario, the evolution of the expected reward µ t is driven by a random walk in the interval (0,1). For the random walk case we use two different models: the first model is inspired by Slivkins and Upfal (2008) where µ t is modelled by a random walk with reflecting bounds; the second model is to use a transformation function on a random walk. For each scenario, we test the performance with 2, 50 and 100 arms; the two-armed examples are used for the purpose of illustration, and the latter examples (50 and 100 arms) are used to evaluate the performance with a large number of arms. We also demonstrate the robustness of the AFF MAB algorithms to tuning, specifically, sensitivity to the step size, η. Finally, we use a two-armed example to show that the modified DTS algorithms, i.e., AFF-DTS1 and AFF-DTS2, can reduce the performance sensitivity of DTS to the input parameter C.

Performance for Different Dynamic Models
We first use two-armed examples to compare the performance of AFF-d-Greedy, AFF-UCB, and AFF-TS/AFF-OTS to the standard methods -Greedy, UCB, and TS respectively. We consider four different cases: two cases for the abruptly changing scenario, and two for the drifting scenario; each case has 100 independent replications. The length of each simulated experiment is T = 10, 000. For the -Greedy method, we evaluate over a grid of choice of , ∈ (0.1, · · · , 0.9), and report performance for the best choice. We use step size η = 0.001 for all AFF MAB algorithms. For AFF-d-Greedy, we set the threshold d = η. For all Thompson sampling based algorithms, we use Beta(2, 2) as the prior.   Table 1. Arm 1 is black and Arm 2 is red. Left/right are for Case 1/2 resp. The expected reward µ t is simulated by the following exponential clock model:

Abruptly Changing Expected Reward
The parameter θ determines the frequency at which change point occurs. At each change point, the new expected reward is sampled from a uniform distribution U(r l , r u ). We generate two different cases, Case 1 and 2. Parameters used for generating these cases can be found in Table 1. For visualisation purposes, we display in Figure 1 a single simulated path for µ t against t. For Case 1, we distinguish the two arms by varying their frequency of change, but in the long run, for high T ,μ T = E[ 1 T T i=1 µ i ] are the same. In Case 2, Arm 1 has a higherμ T .    Table 1.
In Figure 2, we present comparisons in each case. The bottom row of Figure 2 shows boxplots of the total regret R T as in (1). In addition, the top row of Figure 2 displays the cumulative regret over time; the results are averaged over 100 independent replications. The plots are good evidence that our algorithms yield improved performance over standard approaches. In particular, the improvement is distinguishable in Case 1, for which the two arms have the sameμ T . In the case that one arm's mean dominates in the long run (Case 2), the AFF MAB algorithms perform similarly to the standard methods. However, the AFF MAB algorithms have smaller variance among replications. In both cases, AFF-OTS has the best performance in terms of total regret.

Drifting Expected Reward
For the drifting scenario, we use two different models. The first is the random walk model with reflecting bounds introduced in Slivkins and Upfal (2008), which is: where , and x = |x| (mod 2). Slivkins and Upfal (2008) showed that µ t generated by this model is stationary, that is in the long run, µ t will be distributed according to a uniform distribution. The parameter σ 2 µ used in the model controls the rate of change in an arm. In the left panel of Figure 3, we illustrate a single sample from (23) with σ 2 µ = 0.0001 (Case 3). Similar to Case 1, the two arms in Case 3 have the sameμ T .
The second model we use to simulate drifting arms is: where the expected reward µ t is transformed from the random walk z t . Since a random walk diverges in the long run, any trajectory will move closer and closer to one of the boundaries 0 or 1. Again, the parameter σ 2 µ controls the speed that µ t evolves. In the right panel of Figure 3, we illustrate a single sample from (24)   The results for the drifting scenario can be found in Figure 4. The top row of Figure 4 displays the cumulative regret averaged over 100 independent replications, and the bottom row shows boxplots of total regret. For Case 3 that is simulated from the model in (23), we can see that the AFF MAB algorithms outperform the standard approaches. For Case 4 that is simulated from the model in (24), there is a solid improvement in the performance of TS, while UCB and AFF-UCB perform similarly. Similar to the abruptly changing case, AFF-OTS performs very well in both drifting cases in terms of total regret. It was more challenging to deploy adaptive estimation at UCB because it was harder to interpret the estimate from AFF estimator (it is more dynamic with less memory) and modify the upper bound.

Large Number of Arms
Modern applications of bandits problem can involve a large number of arms. For example, in online advertising, we need to optimise among hundreds of websites. Therefore, we evaluate the performance of our AFF MAB algorithms with large number of arms. We repeat earlier experiments with 50 and 100 arms. Results can be seen from Figures 5-8. It can be seen that performance gains hold for a large number of arms, and are very pronounced for all methods including UCB (that was more challenging to improve). For Case 1 and 3, the results for fifty-amred and one-hundred-armed examples are very similar to the twoarmed ones. For Case 2 and 4, unlike the two-armed examples where the improvement of adaptive estimation on UCB is marginal, with 50 and 100 arms, AFF-UCB performs better than UCB. In addition, AFF-OTS has good performance in all cases. In summary, with a large number of arms, our algorithms perform much better than the standard methods. Interestingly, in all cases, results for 50 and 100 arms are very similar. This could be attributed to both 50 and 100 arms being numbers large enough to fill the reward space [0,1] well enough so that the decision maker in both cases finds high value arms.

Robustness to Tuning
We have already seen the improvements the AFF MAB algorithms can offer in different dynamic scenarios. We now move on to examine the sensitivity of performance to the tuning parameters.

Initialisation in the AFF MAB Algorithms
In this section, we examine the influence of the step size η on the AFF MAB algorithms. We present only for Case 3 (see Section 4.1.2) for the sake of brevity; results for other cases are very similar and hence omitted. For each AFF MAB algorithm, we do experiments with η 1 = 0.0001, η 2 = 0.001, η 3 = 0.01, and η 4 (t) = 0.0001/s 2 t , where s 2 t is the AFF variance defined in (10). Note here η 1 , η 2 , and η 3 are fixed, while η 4 can change over time. Figures 9-12 display the results for AFF-d-Greedy, AFF-UCB, AFF-TS, and AFF-OTS respectively. From the results, we can see that the algorithms are not particularly sensitive to the step size η.  Acronym like DTS-C5 represents the DTS algorithm with parameter C = 5. Similarly acronym like AFF-DTS1-C5 represents the AFF-DTS1 algorithm with initial value C 0 = 5. We also plot the result of AFF-OTS for benchmark.
In Section 3.3.2, we discussed that we can use adaptive estimation to tune the input parameter C in the DTS algorithm proposed by Gupta et al. (2011), and we offered two self-tuning solutions, AFF-DTS1 and AFF-DTS2. We use the two-armed abruptly changing example (Case 1 in Section 4.1.1) to illustrate how the AFF version algorithms can reduce the sensitivity to C.
We test C = 5, 10, 100, and 1000 for DTS, AFF-DTS1, and AFF-DTS2. It (the C value) works as the initial value of C t in AFF-DTS1 and AFF-DTS2.
Step size η = 0.001 is used for AFF related algorithms. Figure 13 displays the boxplot of total regret. We also plotted the result of AFF-OTS for benchmark since it has good performance in all cases studied in the previous section. From Figure 13, the performance of AFF-DTS1 and AFF-DTS2 are very stable, while DTS is very sensitive to C. With a bad choice of C (i.e., 100 and 1000 in this case), the total regret of DTS is much higher than AFF-DTS1 and AFF-DTS2.

Conclusion
We have seen that the performance of popular MAB algorithms can be improved significantly using AFFs. The improvements are substantial when the arms are not distinguishable in the long run, i.e., the arms have the same long-term averaged expected rewardμ T , For the case that one arm has a higherμ T (e.g., the two-armed example in Case 2), gains for the AFF MAB algorithms seem marginal, but there is no loss in performance, so practitioners could be encouraged to implement our adaptive methods when they do not have knowledge of the behaviour of µ t with time. In addition, the performance gains for a large number of arms are very pronounced for all methods. Finally, the AFF MAB algorithms we proposed are easy to implement; they do not require any prior knowledge about the dynamic environment, and seem to be more robust to tuning parameters.
Combining adaptive estimation with UCB was more challenging. The reason was that one needs to reinterpret the estimate of µ t from a stable long run average to a "more dynamic" estimator (with less memory), and modify accordingly the upper bound. We should mention here that our algorithm AFF-UCB turns out to be similar to D-UCB (Garivier and Moulines, 2011). In D-UCB, a fixed forgetting factor approach is used for estimation. However, it requires knowing the number of switch points to tune the the forgetting factor, which we do not require.
We conclude by mentioning some interesting avenues for future work. One extension is to apply AFF-based methods for more challenging problems, e.g., rotting bandits (Levine et al., 2017), contextual bandits (Langford and Zhang, 2008;Li et al., 2010), and applications like online advertising. Another extension could involve a rigorous analysis of how the bias in AFF estimation varies with time and how can this affect the selection in MAB problems.
with t. However, as we are interested in dynamic cases, we are in favour of a bound that keeps a certain level of exploration over time, and hence we take a constant ξ = 0.05, and get the first part of the exploration bonus
If we only use B (1) t as the exploration bouns, for an unselected arm, its value will be static since w t and k t do not change. As a consequence, it will only be selected if the arm with current highest upper bound drops below it. This motivates us to add some inflation deliberately. A naive choice is to add the data variance, and this leads to the second part B (2) t = s 2 t wt where s 2 t is the AFF variance defined in (10).