# American Institute of Mathematical Sciences

• Previous Article
EmT: Locating empty territories of homology group generators in a dataset
• FoDS Home
• This Issue
• Next Article
Levels and trends in the sex ratio at birth and missing female births for 29 states and union territories in India 1990–2016: A Bayesian modeling study
June  2019, 1(2): 197-225. doi: 10.3934/fods.2019009

## On adaptive estimation for dynamic Bernoulli bandits

 Department of Mathematics, Imperial College London, London, SW7 2AZ, UK

* Corresponding author: Nikolas Kantas

Published  June 2019

The multi-armed bandit (MAB) problem is a classic example of the exploration-exploitation dilemma. It is concerned with maximising the total rewards for a gambler by sequentially pulling an arm from a multi-armed slot machine where each arm is associated with a reward distribution. In static MABs, the reward distributions do not change over time, while in dynamic MABs, each arm's reward distribution can change, and the optimal arm can switch over time. Motivated by many real applications where rewards are binary, we focus on dynamic Bernoulli bandits. Standard methods like $\epsilon$-Greedy and Upper Confidence Bound (UCB), which rely on the sample mean estimator, often fail to track changes in the underlying reward for dynamic problems. In this paper, we overcome the shortcoming of slow response to change by deploying adaptive estimation in the standard methods and propose a new family of algorithms, which are adaptive versions of $\epsilon$-Greedy, UCB, and Thompson sampling. These new methods are simple and easy to implement. Moreover, they do not require any prior knowledge about the dynamic reward process, which is important for real applications. We examine the new algorithms numerically in different scenarios and the results show solid improvements of our algorithms in dynamic environments.

Citation: Xue Lu, Niall Adams, Nikolas Kantas. On adaptive estimation for dynamic Bernoulli bandits. Foundations of Data Science, 2019, 1 (2) : 197-225. doi: 10.3934/fods.2019009
##### References:

show all references

##### References:
Illustration of the difference between tuning $d$ in AFF-$d$-Greedy and tuning $\epsilon$ in `adaptive estimation $\epsilon$-Greedy'. The step size $\eta = 0.01$
Performance of different algorithms in the case of small number of changes
Abruptly changing scenario (Case 1): examples of $\mu_{t}$ sampled from the model in (28) with parameters of Case 1 displayed in Table 1
Abruptly changing scenario (Case 2): examples of $\mu_{t}$ sampled from the model in (28) with parameters of Casek 2 displayed in Table 1
Results for the two-armed Bernoulli bandit with abruptly changing expected rewards. The top row displays the cumulative regret over time; results are averaged over 100 replications. The bottom row are boxplots of total regret at time $t = 10,000$. Trajectories are sampled from (28) with parameters displayed in Table 1
Drifting scenario (Case 3): examples of $\mu_t$ simulated from the model in (29) with $\sigma^{2}_{\mu} = 0.0001$
Drifting scenario (Case 4): examples of $\mu_t$ simulated from the model in (30) with $\sigma^{2}_{\mu} = 0.001$
Results for the two-armed Bernoulli bandit with drifting expected rewards. The top row displays the cumulative regret over time; results are averaged over 100 independent replications. The bottom row are boxplots of total regret at time $t = 10,000$. Trajectories for Case 3 are sampled from (29) with $\sigma^{2}_{\mu} = 0.0001$, and trajectories for Case 4 are sampled from (30) with $\sigma^{2}_{\mu} = 0.001$
Large number of arms: abruptly changing environment (Case 1)
Large number of arms: abruptly changing environment (Case 2)
Large number of arms: drifting environment (Case 3)
Large number of arms: drifting environment (Case 4)
AFF-$d$-Greedy algorithm with different $\eta$ values. $\eta_{1} = 0.0001, \eta_{2} = 0.001$, $\eta_{3} = 0.01$, and $\eta_{4}(t) = 0.0001/s^{2}_{t}$, where $s^{2}_{t}$ is as in (11)
AFF versions of UCB algorithm with different $\eta$ values. $\eta_{1} = 0.0001$, $\eta_{2} = 0.001$, $\eta_{3} = 0.01$, and $\eta_{4}(t) = 0.0001/s^{2}_{t}$, where $s^{2}_{t}$ is as in (11)
AFF versions of TS algorithm with different $\eta$ values. $\eta_{1} = 0.0001$, $\eta_{2} = 0.001$, $\eta_{3} = 0.01$, and $\eta_{4}(t) = 0.0001/s^{2}_{t}$, where $s^{2}_{t}$ is as in (11)
D-UCB and SW-UCB algorithms with different values of key parameters
Boxplot of total regret for algorithms DTS, AFF-DTS1, and AFF-DTS2. Acronym like DTS-C5 represents the DTS algorithm with parameter $C = 5$. Similarly acronym like AFF-DTS1-C5 represents the AFF-DTS1 algorithm with initial value $C_{0} = 5$. The result of AFF-OTS is plotted as a benchmark
Parameters used in the exponential clock model shown in (28)
 Case 1 Case 2 $\theta$ $r_{l}$ $r_{u}$ $\theta$ $r_{l}$ $r_{u}$ Arm 1 0.001 0.0 1.0 0.001 0.3 1.0 Arm 2 0.010 0.0 1.0 0.010 0.0 0.7
 Case 1 Case 2 $\theta$ $r_{l}$ $r_{u}$ $\theta$ $r_{l}$ $r_{u}$ Arm 1 0.001 0.0 1.0 0.001 0.3 1.0 Arm 2 0.010 0.0 1.0 0.010 0.0 0.7

Impact Factor: