
Previous Article
EmT: Locating empty territories of homology group generators in a dataset
 FoDS Home
 This Issue

Next Article
Levels and trends in the sex ratio at birth and missing female births for 29 states and union territories in India 1990–2016: A Bayesian modeling study
On adaptive estimation for dynamic Bernoulli bandits
Department of Mathematics, Imperial College London, London, SW7 2AZ, UK 
The multiarmed bandit (MAB) problem is a classic example of the explorationexploitation dilemma. It is concerned with maximising the total rewards for a gambler by sequentially pulling an arm from a multiarmed slot machine where each arm is associated with a reward distribution. In static MABs, the reward distributions do not change over time, while in dynamic MABs, each arm's reward distribution can change, and the optimal arm can switch over time. Motivated by many real applications where rewards are binary, we focus on dynamic Bernoulli bandits. Standard methods like $ \epsilon $Greedy and Upper Confidence Bound (UCB), which rely on the sample mean estimator, often fail to track changes in the underlying reward for dynamic problems. In this paper, we overcome the shortcoming of slow response to change by deploying adaptive estimation in the standard methods and propose a new family of algorithms, which are adaptive versions of $ \epsilon $Greedy, UCB, and Thompson sampling. These new methods are simple and easy to implement. Moreover, they do not require any prior knowledge about the dynamic reward process, which is important for real applications. We examine the new algorithms numerically in different scenarios and the results show solid improvements of our algorithms in dynamic environments.
References:
[1] 
C. Anagnostopoulos, D. K. Tasoulis, N. M. Adams, N. G. Pavlidis and D. J. Hand, Online linear and quadratic discriminant analysis with adaptive forgetting for streaming classification, Statistical Analysis and Data Mining: The ASA Data Science Journal, 5 (2012), 139166. doi: 10.1002/sam.10151. 
[2] 
P. Auer, N. CesaBianchi and P. Fischer, Finitetime analysis of the multiarmed bandit problem, Machine Learning, 47 (2002), 235256. 
[3] 
P. Auer, N. CesaBianchi, Y. Freund and R. E. Schapire, The nonstochastic multiarmed bandit problem, SIAM Journal on Computing, 32 (2002), 4877. doi: 10.1137/S0097539701398375. 
[4] 
B. Awerbuch and R. Kleinberg, Online linear optimization and adaptive routing, Journal of Computer and System Sciences, 74 (2008), 97114. doi: 10.1016/j.jcss.2007.04.016. 
[5] 
D. A. Bodenham and N. M. Adams, Continuous monitoring for changepoints in data streams using adaptive estimation, Statistics and Computing, 27 (2017), 12571270. doi: 10.1007/s1122201696848. 
[6] 
E. Brochu, M. D. Hoffman and N. de Freitas, Portfolio allocation for Bayesian optimization, preprint, arXiv: 1009.5419v2. 
[7] 
O. Chapelle and L. Li, An empirical evaluation of Thompson sampling, in Advances in Neural Information Processing Systems 24, Curran Associates, Inc., (2011), 2249–2257. 
[8] 
A. Garivier and O. Cappe, The KLUCB algorithm for bounded stochastic bandits and beyond, in Proceedings of the 24th Annual Conference on Learning Theory, vol. 19 of PMLR, (2011), 359–376. 
[9] 
A. Garivier and E. Moulines, On upperconfidence bound policies for switching bandit problems, in Algorithmic Learning Theory, vol. 6925 of Lecture Notes in Artificial Intelligence, SpringerVerlag Berlin, (2011), 174–188. doi: 10.1007/9783642244124_16. 
[10] 
P. W. Glynn and D. Ormoneit, Hoeffding's inequality for uniformly ergodic Markov chains, Statistics and Probability Letters, 56 (2002), 143146. doi: 10.1016/S01677152(01)001584. 
[11] 
O.C. Granmo and S. Berg, Solving nonstationary bandit problems by random sampling from sibling Kalman filters, in Proceedings of Trends in Applied Intelligent Systems, PT III, vol. 6098 of Lecture Notes in Artificial Intelligence, SpringerVerlag Berlin, (2010), 199–208. 
[12] 
N. Gupta, O.C. Granmo and A. Agrawala, Thompson sampling for dynamic multiarmed bandits, in Proceedings of the 10th International Conference on Machine Learning and Applications and Workshops, (2011), 484–489. 
[13] 
S. S. Haykin, Adaptive Filter Theory, 4^{th} edition, PrenticeHall, Upper Saddle River, N.J., 2002. 
[14] 
W. Hoeffding, Probability inequalities for sums of bounded random variables, Journal of the American Statistical Association, 58 (1963), 1330. 
[15] 
L. Kocsis and C. Szepesvari, Discounted UCB, in 2nd PASCAL Challenges Workshop, Venice, 2006. Available from: https://www.lri.fr/ sebag/Slides/Venice/Kocsis.pdf. 
[16] 
D. E. Koulouriotis and A. Xanthopoulos, Reinforcement learning and evolutionary algorithms for nonstationary multiarmed bandit problems, Applied Mathematics and Computation, 196 (2008), 913922. 
[17] 
V. Kuleshov and D. Precup, Algorithms for the multiarmed bandit problem, preprint, arXiv: 1402.6028v1. 
[18] 
J. Langford and T. Zhang, The epochgreedy algorithm for multiarmed bandits with side information, in Advances in Neural Information Processing Systems 20, Curran Associates, Inc., (2008), 817–824. 
[19] 
N. Levine, K. Crammer and S. Mannor, Rotting bandits, in Advances in Neural Information Processing Systems 30, Curran Associates, Inc., (2017) 3074–3083. 
[20] 
L. Li, W. Chu, J. Langford and R. E. Schapire, A contextualbandit approach to personalized news article recommendation, in Proceedings of the 19th International Conference on World Wide Web, ACM, (2010), 661–670. 
[21] 
B. C. May, N. Korda, A. Lee and D. S. Leslie, Optimistic Bayesian sampling in contextualbandit problems, The Journal of Machine Learning Research, 13 (2012), 20692106. 
[22] 
C. H. Papadimitriou and J. N. Tsitsiklis, The complexity of optimal queuing network control, Mathematics of Operations Research, 24 (1999), 293305. doi: 10.1287/moor.24.2.293. 
[23] 
W. H. Press, Bandit solutions provide unified ethical models for randomized clinical trials and comparative effectiveness research, Proceedings of the National Academy of Sciences of the United States of America, 106 (2009), 2238722392. 
[24] 
V. Raj and S. Kalyani, Taming nonstationary bandits: A Bayesian approach, arXiv: 1707.09727. 
[25] 
H. Robbins, Some aspects of the sequential design of experiments, Bulletin of the American Mathematical Society, 58 (1952), 527535. doi: 10.1090/S000299041952096208. 
[26] 
S. W. Roberts, Control chart tests based on geometric moving averages, Technometrics, 1 (1959), 239250. 
[27] 
S. L. Scott, A modern Bayesian look at the multiarmed bandit, Applied Stochastic Models in Business and Industry, 26 (2010), 639658. doi: 10.1002/asmb.874. 
[28] 
S. L. Scott, Multiarmed bandit experiments in the online service economy, Applied Stochastic Models in Business and Industry, 31 (2015), 3745. doi: 10.1002/asmb.2104. 
[29] 
W. Shen, J. Wang, Y.G. Jiang and H. Zha, Portfolio choices with orthogonal bandit learning, in Proceedings of the TwentyFourth International Joint Conference on Artificial Intelligence, (2015), 974–980. 
[30] 
A. Slivkins and E. Upfal, Adapting to a changing environment: The Brownian restless bandits, in 21st Conference on Learning Theory, (2008), 343–354. 
[31] 
W. R. Thompson, On the likelihood that one unknown probability exceeds another in view of the evidence of two samples, Biometrika, 25 (1933), 285294. 
[32] 
F. Tsung and K. Wang, Adaptive charting techniques: Literature review and extensions, in Frontiers in Statistical Quality Control 9, PhysicaVerlag HD, Heidelberg, (2010), 19–35,. 
[33] 
J. Vermorel and M. Mohri, Multiarmed bandit algorithms and empirical evaluation, in Proceedings of the 16th European Conference on Machine Learning, vol. 3720 of Lecture Notes in Computer Science, Springer, Berlin, (2005), 437–448. 
[34] 
S. S. Villar, J. Bowden and J. Wason, Multiarmed bandit models for the optimal design of clinical trials: Benefits and challenges, Statistical Science, 30 (2015), 199215. doi: 10.1214/14STS504. 
[35] 
C. J. C. H. Watkins, Learning from Delayed Rewards, Ph.D thesis, Cambridge University, 1989. 
[36] 
P. Whittle, Restless bandits: Activity allocation in a changing world, Journal of Applied Probability, 25 (1988), 287298. doi: 10.1017/s0021900200040420. 
[37] 
J. Y. Yu and S. Mannor, Piecewisestationary bandit problems with side observations, in Proceedings of the 26th International Conference on Machine Learning, (2009), 1177–1184. 
show all references
References:
[1] 
C. Anagnostopoulos, D. K. Tasoulis, N. M. Adams, N. G. Pavlidis and D. J. Hand, Online linear and quadratic discriminant analysis with adaptive forgetting for streaming classification, Statistical Analysis and Data Mining: The ASA Data Science Journal, 5 (2012), 139166. doi: 10.1002/sam.10151. 
[2] 
P. Auer, N. CesaBianchi and P. Fischer, Finitetime analysis of the multiarmed bandit problem, Machine Learning, 47 (2002), 235256. 
[3] 
P. Auer, N. CesaBianchi, Y. Freund and R. E. Schapire, The nonstochastic multiarmed bandit problem, SIAM Journal on Computing, 32 (2002), 4877. doi: 10.1137/S0097539701398375. 
[4] 
B. Awerbuch and R. Kleinberg, Online linear optimization and adaptive routing, Journal of Computer and System Sciences, 74 (2008), 97114. doi: 10.1016/j.jcss.2007.04.016. 
[5] 
D. A. Bodenham and N. M. Adams, Continuous monitoring for changepoints in data streams using adaptive estimation, Statistics and Computing, 27 (2017), 12571270. doi: 10.1007/s1122201696848. 
[6] 
E. Brochu, M. D. Hoffman and N. de Freitas, Portfolio allocation for Bayesian optimization, preprint, arXiv: 1009.5419v2. 
[7] 
O. Chapelle and L. Li, An empirical evaluation of Thompson sampling, in Advances in Neural Information Processing Systems 24, Curran Associates, Inc., (2011), 2249–2257. 
[8] 
A. Garivier and O. Cappe, The KLUCB algorithm for bounded stochastic bandits and beyond, in Proceedings of the 24th Annual Conference on Learning Theory, vol. 19 of PMLR, (2011), 359–376. 
[9] 
A. Garivier and E. Moulines, On upperconfidence bound policies for switching bandit problems, in Algorithmic Learning Theory, vol. 6925 of Lecture Notes in Artificial Intelligence, SpringerVerlag Berlin, (2011), 174–188. doi: 10.1007/9783642244124_16. 
[10] 
P. W. Glynn and D. Ormoneit, Hoeffding's inequality for uniformly ergodic Markov chains, Statistics and Probability Letters, 56 (2002), 143146. doi: 10.1016/S01677152(01)001584. 
[11] 
O.C. Granmo and S. Berg, Solving nonstationary bandit problems by random sampling from sibling Kalman filters, in Proceedings of Trends in Applied Intelligent Systems, PT III, vol. 6098 of Lecture Notes in Artificial Intelligence, SpringerVerlag Berlin, (2010), 199–208. 
[12] 
N. Gupta, O.C. Granmo and A. Agrawala, Thompson sampling for dynamic multiarmed bandits, in Proceedings of the 10th International Conference on Machine Learning and Applications and Workshops, (2011), 484–489. 
[13] 
S. S. Haykin, Adaptive Filter Theory, 4^{th} edition, PrenticeHall, Upper Saddle River, N.J., 2002. 
[14] 
W. Hoeffding, Probability inequalities for sums of bounded random variables, Journal of the American Statistical Association, 58 (1963), 1330. 
[15] 
L. Kocsis and C. Szepesvari, Discounted UCB, in 2nd PASCAL Challenges Workshop, Venice, 2006. Available from: https://www.lri.fr/ sebag/Slides/Venice/Kocsis.pdf. 
[16] 
D. E. Koulouriotis and A. Xanthopoulos, Reinforcement learning and evolutionary algorithms for nonstationary multiarmed bandit problems, Applied Mathematics and Computation, 196 (2008), 913922. 
[17] 
V. Kuleshov and D. Precup, Algorithms for the multiarmed bandit problem, preprint, arXiv: 1402.6028v1. 
[18] 
J. Langford and T. Zhang, The epochgreedy algorithm for multiarmed bandits with side information, in Advances in Neural Information Processing Systems 20, Curran Associates, Inc., (2008), 817–824. 
[19] 
N. Levine, K. Crammer and S. Mannor, Rotting bandits, in Advances in Neural Information Processing Systems 30, Curran Associates, Inc., (2017) 3074–3083. 
[20] 
L. Li, W. Chu, J. Langford and R. E. Schapire, A contextualbandit approach to personalized news article recommendation, in Proceedings of the 19th International Conference on World Wide Web, ACM, (2010), 661–670. 
[21] 
B. C. May, N. Korda, A. Lee and D. S. Leslie, Optimistic Bayesian sampling in contextualbandit problems, The Journal of Machine Learning Research, 13 (2012), 20692106. 
[22] 
C. H. Papadimitriou and J. N. Tsitsiklis, The complexity of optimal queuing network control, Mathematics of Operations Research, 24 (1999), 293305. doi: 10.1287/moor.24.2.293. 
[23] 
W. H. Press, Bandit solutions provide unified ethical models for randomized clinical trials and comparative effectiveness research, Proceedings of the National Academy of Sciences of the United States of America, 106 (2009), 2238722392. 
[24] 
V. Raj and S. Kalyani, Taming nonstationary bandits: A Bayesian approach, arXiv: 1707.09727. 
[25] 
H. Robbins, Some aspects of the sequential design of experiments, Bulletin of the American Mathematical Society, 58 (1952), 527535. doi: 10.1090/S000299041952096208. 
[26] 
S. W. Roberts, Control chart tests based on geometric moving averages, Technometrics, 1 (1959), 239250. 
[27] 
S. L. Scott, A modern Bayesian look at the multiarmed bandit, Applied Stochastic Models in Business and Industry, 26 (2010), 639658. doi: 10.1002/asmb.874. 
[28] 
S. L. Scott, Multiarmed bandit experiments in the online service economy, Applied Stochastic Models in Business and Industry, 31 (2015), 3745. doi: 10.1002/asmb.2104. 
[29] 
W. Shen, J. Wang, Y.G. Jiang and H. Zha, Portfolio choices with orthogonal bandit learning, in Proceedings of the TwentyFourth International Joint Conference on Artificial Intelligence, (2015), 974–980. 
[30] 
A. Slivkins and E. Upfal, Adapting to a changing environment: The Brownian restless bandits, in 21st Conference on Learning Theory, (2008), 343–354. 
[31] 
W. R. Thompson, On the likelihood that one unknown probability exceeds another in view of the evidence of two samples, Biometrika, 25 (1933), 285294. 
[32] 
F. Tsung and K. Wang, Adaptive charting techniques: Literature review and extensions, in Frontiers in Statistical Quality Control 9, PhysicaVerlag HD, Heidelberg, (2010), 19–35,. 
[33] 
J. Vermorel and M. Mohri, Multiarmed bandit algorithms and empirical evaluation, in Proceedings of the 16th European Conference on Machine Learning, vol. 3720 of Lecture Notes in Computer Science, Springer, Berlin, (2005), 437–448. 
[34] 
S. S. Villar, J. Bowden and J. Wason, Multiarmed bandit models for the optimal design of clinical trials: Benefits and challenges, Statistical Science, 30 (2015), 199215. doi: 10.1214/14STS504. 
[35] 
C. J. C. H. Watkins, Learning from Delayed Rewards, Ph.D thesis, Cambridge University, 1989. 
[36] 
P. Whittle, Restless bandits: Activity allocation in a changing world, Journal of Applied Probability, 25 (1988), 287298. doi: 10.1017/s0021900200040420. 
[37] 
J. Y. Yu and S. Mannor, Piecewisestationary bandit problems with side observations, in Proceedings of the 26th International Conference on Machine Learning, (2009), 1177–1184. 
Case 1  Case 2  
Arm 1  0.001  0.0  1.0  0.001  0.3  1.0 
Arm 2  0.010  0.0  1.0  0.010  0.0  0.7 
Case 1  Case 2  
Arm 1  0.001  0.0  1.0  0.001  0.3  1.0 
Arm 2  0.010  0.0  1.0  0.010  0.0  0.7 
[1] 
Aku Kammonen, Jonas Kiessling, Petr Plecháč, Mattias Sandberg, Anders Szepessy. Adaptive random Fourier features with Metropolis sampling. Foundations of Data Science, 2020, 2 (3) : 309332. doi: 10.3934/fods.2020014 
[2] 
Esmail Abdul Fattah, Janet Van Niekerk, Håvard Rue. Smart Gradient  An adaptive technique for improving gradient estimation. Foundations of Data Science, 2022, 4 (1) : 123136. doi: 10.3934/fods.2021037 
[3] 
Tamar Friedlander, Naama Brenner. Adaptive response and enlargement of dynamic range. Mathematical Biosciences & Engineering, 2011, 8 (2) : 515528. doi: 10.3934/mbe.2011.8.515 
[4] 
Christopher Rackauckas, Qing Nie. Adaptive methods for stochastic differential equations via natural embeddings and rejection sampling with memory. Discrete and Continuous Dynamical Systems  B, 2017, 22 (7) : 27312761. doi: 10.3934/dcdsb.2017133 
[5] 
Bernadette N. Hahn, MelinaLoren Kienle Garrido, Christian Klingenberg, Sandra Warnecke. Using the NavierCauchy equation for motion estimation in dynamic imaging. Inverse Problems and Imaging, 2022, 16 (5) : 11791198. doi: 10.3934/ipi.2022018 
[6] 
Tengfei Yan, Qunying Liu, Bowen Dou, Qing Li, Bowen Li. An adaptive dynamic programming method for torque ripple minimization of PMSM. Journal of Industrial and Management Optimization, 2021, 17 (2) : 827839. doi: 10.3934/jimo.2019136 
[7] 
Dongho Kim, EunJae Park. Adaptive CrankNicolson methods with dynamic finiteelement spaces for parabolic problems. Discrete and Continuous Dynamical Systems  B, 2008, 10 (4) : 873886. doi: 10.3934/dcdsb.2008.10.873 
[8] 
Vladimir Djordjevic, Vladimir Stojanovic, Hongfeng Tao, Xiaona Song, Shuping He, Weinan Gao. Datadriven control of hydraulic servo actuator based on adaptive dynamic programming. Discrete and Continuous Dynamical Systems  S, 2022, 15 (7) : 16331650. doi: 10.3934/dcdss.2021145 
[9] 
Mehmet Onur Olgun, Osman Palanci, Sirma Zeynep Alparslan Gök. On the grey BakerThompson rule. Journal of Dynamics and Games, 2020, 7 (4) : 303315. doi: 10.3934/jdg.2020024 
[10] 
Keaton Hamm, Longxiu Huang. Stability of sampling for CUR decompositions. Foundations of Data Science, 2020, 2 (2) : 8399. doi: 10.3934/fods.2020006 
[11] 
Omri M. Sarig. Bernoulli equilibrium states for surface diffeomorphisms. Journal of Modern Dynamics, 2011, 5 (3) : 593608. doi: 10.3934/jmd.2011.5.593 
[12] 
Takao Komatsu, Bijan Kumar Patel, Claudio PitaRuiz. Several formulas for Bernoulli numbers and polynomials. Advances in Mathematics of Communications, 2021 doi: 10.3934/amc.2021006 
[13] 
Matthew Nicol. Induced maps of hyperbolic Bernoulli systems. Discrete and Continuous Dynamical Systems, 2001, 7 (1) : 147154. doi: 10.3934/dcds.2001.7.147 
[14] 
Hajnal R. Tóth. Infinite Bernoulli convolutions with different probabilities. Discrete and Continuous Dynamical Systems, 2008, 21 (2) : 595600. doi: 10.3934/dcds.2008.21.595 
[15] 
Alexandre J. Chorin, Fei Lu, Robert N. Miller, Matthias Morzfeld, Xuemin Tu. Sampling, feasibility, and priors in data assimilation. Discrete and Continuous Dynamical Systems, 2016, 36 (8) : 42274246. doi: 10.3934/dcds.2016.36.4227 
[16] 
Shixu Meng. A sampling type method in an electromagnetic waveguide. Inverse Problems and Imaging, 2021, 15 (4) : 745762. doi: 10.3934/ipi.2021012 
[17] 
Kamil Rajdl, Petr Lansky. Fano factor estimation. Mathematical Biosciences & Engineering, 2014, 11 (1) : 105123. doi: 10.3934/mbe.2014.11.105 
[18] 
Arthur Henrique Caixeta, Irena Lasiecka, Valéria Neves Domingos Cavalcanti. On long time behavior of MooreGibsonThompson equation with molecular relaxation. Evolution Equations and Control Theory, 2016, 5 (4) : 661676. doi: 10.3934/eect.2016024 
[19] 
Wenhui Chen, Alessandro Palmieri. Nonexistence of global solutions for the semilinear Moore – Gibson – Thompson equation in the conservative case. Discrete and Continuous Dynamical Systems, 2020, 40 (9) : 55135540. doi: 10.3934/dcds.2020236 
[20] 
Luciano Abadías, Carlos Lizama, Marina MurilloArcila. Hölder regularity for the MooreGibsonThompson equation with infinite delay. Communications on Pure and Applied Analysis, 2018, 17 (1) : 243265. doi: 10.3934/cpaa.2018015 
Impact Factor:
Tools
Metrics
Other articles
by authors
[Back to Top]