
Previous Article
Levels and trends in the sex ratio at birth and missing female births for 29 states and union territories in India 1990–2016: A Bayesian modeling study
 FoDS Home
 This Issue

Next Article
EmT: Locating empty territories of homology group generators in a dataset
On adaptive estimation for dynamic Bernoulli bandits
Department of Mathematics, Imperial College London, London, SW7 2AZ, UK 
The multiarmed bandit (MAB) problem is a classic example of the explorationexploitation dilemma. It is concerned with maximising the total rewards for a gambler by sequentially pulling an arm from a multiarmed slot machine where each arm is associated with a reward distribution. In static MABs, the reward distributions do not change over time, while in dynamic MABs, each arm's reward distribution can change, and the optimal arm can switch over time. Motivated by many real applications where rewards are binary, we focus on dynamic Bernoulli bandits. Standard methods like $ \epsilon $Greedy and Upper Confidence Bound (UCB), which rely on the sample mean estimator, often fail to track changes in the underlying reward for dynamic problems. In this paper, we overcome the shortcoming of slow response to change by deploying adaptive estimation in the standard methods and propose a new family of algorithms, which are adaptive versions of $ \epsilon $Greedy, UCB, and Thompson sampling. These new methods are simple and easy to implement. Moreover, they do not require any prior knowledge about the dynamic reward process, which is important for real applications. We examine the new algorithms numerically in different scenarios and the results show solid improvements of our algorithms in dynamic environments.
References:
[1] 
C. Anagnostopoulos, D. K. Tasoulis, N. M. Adams, N. G. Pavlidis and D. J. Hand, Online linear and quadratic discriminant analysis with adaptive forgetting for streaming classification, Statistical Analysis and Data Mining: The ASA Data Science Journal, 5 (2012), 139166. doi: 10.1002/sam.10151. Google Scholar 
[2] 
P. Auer, N. CesaBianchi and P. Fischer, Finitetime analysis of the multiarmed bandit problem, Machine Learning, 47 (2002), 235256. Google Scholar 
[3] 
P. Auer, N. CesaBianchi, Y. Freund and R. E. Schapire, The nonstochastic multiarmed bandit problem, SIAM Journal on Computing, 32 (2002), 4877. doi: 10.1137/S0097539701398375. Google Scholar 
[4] 
B. Awerbuch and R. Kleinberg, Online linear optimization and adaptive routing, Journal of Computer and System Sciences, 74 (2008), 97114. doi: 10.1016/j.jcss.2007.04.016. Google Scholar 
[5] 
D. A. Bodenham and N. M. Adams, Continuous monitoring for changepoints in data streams using adaptive estimation, Statistics and Computing, 27 (2017), 12571270. doi: 10.1007/s1122201696848. Google Scholar 
[6] 
E. Brochu, M. D. Hoffman and N. de Freitas, Portfolio allocation for Bayesian optimization, preprint, arXiv: 1009.5419v2.Google Scholar 
[7] 
O. Chapelle and L. Li, An empirical evaluation of Thompson sampling, in Advances in Neural Information Processing Systems 24, Curran Associates, Inc., (2011), 2249–2257.Google Scholar 
[8] 
A. Garivier and O. Cappe, The KLUCB algorithm for bounded stochastic bandits and beyond, in Proceedings of the 24th Annual Conference on Learning Theory, vol. 19 of PMLR, (2011), 359–376.Google Scholar 
[9] 
A. Garivier and E. Moulines, On upperconfidence bound policies for switching bandit problems, in Algorithmic Learning Theory, vol. 6925 of Lecture Notes in Artificial Intelligence, SpringerVerlag Berlin, (2011), 174–188. doi: 10.1007/9783642244124_16. Google Scholar 
[10] 
P. W. Glynn and D. Ormoneit, Hoeffding's inequality for uniformly ergodic Markov chains, Statistics and Probability Letters, 56 (2002), 143146. doi: 10.1016/S01677152(01)001584. Google Scholar 
[11] 
O.C. Granmo and S. Berg, Solving nonstationary bandit problems by random sampling from sibling Kalman filters, in Proceedings of Trends in Applied Intelligent Systems, PT III, vol. 6098 of Lecture Notes in Artificial Intelligence, SpringerVerlag Berlin, (2010), 199–208.Google Scholar 
[12] 
N. Gupta, O.C. Granmo and A. Agrawala, Thompson sampling for dynamic multiarmed bandits, in Proceedings of the 10th International Conference on Machine Learning and Applications and Workshops, (2011), 484–489.Google Scholar 
[13] 
S. S. Haykin, Adaptive Filter Theory, 4^{th} edition, PrenticeHall, Upper Saddle River, N.J., 2002.Google Scholar 
[14] 
W. Hoeffding, Probability inequalities for sums of bounded random variables, Journal of the American Statistical Association, 58 (1963), 1330. Google Scholar 
[15] 
L. Kocsis and C. Szepesvari, Discounted UCB, in 2nd PASCAL Challenges Workshop, Venice, 2006. Available from: https://www.lri.fr/ sebag/Slides/Venice/Kocsis.pdf.Google Scholar 
[16] 
D. E. Koulouriotis and A. Xanthopoulos, Reinforcement learning and evolutionary algorithms for nonstationary multiarmed bandit problems, Applied Mathematics and Computation, 196 (2008), 913922. Google Scholar 
[17] 
V. Kuleshov and D. Precup, Algorithms for the multiarmed bandit problem, preprint, arXiv: 1402.6028v1.Google Scholar 
[18] 
J. Langford and T. Zhang, The epochgreedy algorithm for multiarmed bandits with side information, in Advances in Neural Information Processing Systems 20, Curran Associates, Inc., (2008), 817–824.Google Scholar 
[19] 
N. Levine, K. Crammer and S. Mannor, Rotting bandits, in Advances in Neural Information Processing Systems 30, Curran Associates, Inc., (2017) 3074–3083.Google Scholar 
[20] 
L. Li, W. Chu, J. Langford and R. E. Schapire, A contextualbandit approach to personalized news article recommendation, in Proceedings of the 19th International Conference on World Wide Web, ACM, (2010), 661–670.Google Scholar 
[21] 
B. C. May, N. Korda, A. Lee and D. S. Leslie, Optimistic Bayesian sampling in contextualbandit problems, The Journal of Machine Learning Research, 13 (2012), 20692106. Google Scholar 
[22] 
C. H. Papadimitriou and J. N. Tsitsiklis, The complexity of optimal queuing network control, Mathematics of Operations Research, 24 (1999), 293305. doi: 10.1287/moor.24.2.293. Google Scholar 
[23] 
W. H. Press, Bandit solutions provide unified ethical models for randomized clinical trials and comparative effectiveness research, Proceedings of the National Academy of Sciences of the United States of America, 106 (2009), 2238722392. Google Scholar 
[24] 
V. Raj and S. Kalyani, Taming nonstationary bandits: A Bayesian approach, arXiv: 1707.09727.Google Scholar 
[25] 
H. Robbins, Some aspects of the sequential design of experiments, Bulletin of the American Mathematical Society, 58 (1952), 527535. doi: 10.1090/S000299041952096208. Google Scholar 
[26] 
S. W. Roberts, Control chart tests based on geometric moving averages, Technometrics, 1 (1959), 239250. Google Scholar 
[27] 
S. L. Scott, A modern Bayesian look at the multiarmed bandit, Applied Stochastic Models in Business and Industry, 26 (2010), 639658. doi: 10.1002/asmb.874. Google Scholar 
[28] 
S. L. Scott, Multiarmed bandit experiments in the online service economy, Applied Stochastic Models in Business and Industry, 31 (2015), 3745. doi: 10.1002/asmb.2104. Google Scholar 
[29] 
W. Shen, J. Wang, Y.G. Jiang and H. Zha, Portfolio choices with orthogonal bandit learning, in Proceedings of the TwentyFourth International Joint Conference on Artificial Intelligence, (2015), 974–980.Google Scholar 
[30] 
A. Slivkins and E. Upfal, Adapting to a changing environment: The Brownian restless bandits, in 21st Conference on Learning Theory, (2008), 343–354.Google Scholar 
[31] 
W. R. Thompson, On the likelihood that one unknown probability exceeds another in view of the evidence of two samples, Biometrika, 25 (1933), 285294. Google Scholar 
[32] 
F. Tsung and K. Wang, Adaptive charting techniques: Literature review and extensions, in Frontiers in Statistical Quality Control 9, PhysicaVerlag HD, Heidelberg, (2010), 19–35,.Google Scholar 
[33] 
J. Vermorel and M. Mohri, Multiarmed bandit algorithms and empirical evaluation, in Proceedings of the 16th European Conference on Machine Learning, vol. 3720 of Lecture Notes in Computer Science, Springer, Berlin, (2005), 437–448.Google Scholar 
[34] 
S. S. Villar, J. Bowden and J. Wason, Multiarmed bandit models for the optimal design of clinical trials: Benefits and challenges, Statistical Science, 30 (2015), 199215. doi: 10.1214/14STS504. Google Scholar 
[35] 
C. J. C. H. Watkins, Learning from Delayed Rewards, Ph.D thesis, Cambridge University, 1989.Google Scholar 
[36] 
P. Whittle, Restless bandits: Activity allocation in a changing world, Journal of Applied Probability, 25 (1988), 287298. doi: 10.1017/s0021900200040420. Google Scholar 
[37] 
J. Y. Yu and S. Mannor, Piecewisestationary bandit problems with side observations, in Proceedings of the 26th International Conference on Machine Learning, (2009), 1177–1184.Google Scholar 
show all references
References:
[1] 
C. Anagnostopoulos, D. K. Tasoulis, N. M. Adams, N. G. Pavlidis and D. J. Hand, Online linear and quadratic discriminant analysis with adaptive forgetting for streaming classification, Statistical Analysis and Data Mining: The ASA Data Science Journal, 5 (2012), 139166. doi: 10.1002/sam.10151. Google Scholar 
[2] 
P. Auer, N. CesaBianchi and P. Fischer, Finitetime analysis of the multiarmed bandit problem, Machine Learning, 47 (2002), 235256. Google Scholar 
[3] 
P. Auer, N. CesaBianchi, Y. Freund and R. E. Schapire, The nonstochastic multiarmed bandit problem, SIAM Journal on Computing, 32 (2002), 4877. doi: 10.1137/S0097539701398375. Google Scholar 
[4] 
B. Awerbuch and R. Kleinberg, Online linear optimization and adaptive routing, Journal of Computer and System Sciences, 74 (2008), 97114. doi: 10.1016/j.jcss.2007.04.016. Google Scholar 
[5] 
D. A. Bodenham and N. M. Adams, Continuous monitoring for changepoints in data streams using adaptive estimation, Statistics and Computing, 27 (2017), 12571270. doi: 10.1007/s1122201696848. Google Scholar 
[6] 
E. Brochu, M. D. Hoffman and N. de Freitas, Portfolio allocation for Bayesian optimization, preprint, arXiv: 1009.5419v2.Google Scholar 
[7] 
O. Chapelle and L. Li, An empirical evaluation of Thompson sampling, in Advances in Neural Information Processing Systems 24, Curran Associates, Inc., (2011), 2249–2257.Google Scholar 
[8] 
A. Garivier and O. Cappe, The KLUCB algorithm for bounded stochastic bandits and beyond, in Proceedings of the 24th Annual Conference on Learning Theory, vol. 19 of PMLR, (2011), 359–376.Google Scholar 
[9] 
A. Garivier and E. Moulines, On upperconfidence bound policies for switching bandit problems, in Algorithmic Learning Theory, vol. 6925 of Lecture Notes in Artificial Intelligence, SpringerVerlag Berlin, (2011), 174–188. doi: 10.1007/9783642244124_16. Google Scholar 
[10] 
P. W. Glynn and D. Ormoneit, Hoeffding's inequality for uniformly ergodic Markov chains, Statistics and Probability Letters, 56 (2002), 143146. doi: 10.1016/S01677152(01)001584. Google Scholar 
[11] 
O.C. Granmo and S. Berg, Solving nonstationary bandit problems by random sampling from sibling Kalman filters, in Proceedings of Trends in Applied Intelligent Systems, PT III, vol. 6098 of Lecture Notes in Artificial Intelligence, SpringerVerlag Berlin, (2010), 199–208.Google Scholar 
[12] 
N. Gupta, O.C. Granmo and A. Agrawala, Thompson sampling for dynamic multiarmed bandits, in Proceedings of the 10th International Conference on Machine Learning and Applications and Workshops, (2011), 484–489.Google Scholar 
[13] 
S. S. Haykin, Adaptive Filter Theory, 4^{th} edition, PrenticeHall, Upper Saddle River, N.J., 2002.Google Scholar 
[14] 
W. Hoeffding, Probability inequalities for sums of bounded random variables, Journal of the American Statistical Association, 58 (1963), 1330. Google Scholar 
[15] 
L. Kocsis and C. Szepesvari, Discounted UCB, in 2nd PASCAL Challenges Workshop, Venice, 2006. Available from: https://www.lri.fr/ sebag/Slides/Venice/Kocsis.pdf.Google Scholar 
[16] 
D. E. Koulouriotis and A. Xanthopoulos, Reinforcement learning and evolutionary algorithms for nonstationary multiarmed bandit problems, Applied Mathematics and Computation, 196 (2008), 913922. Google Scholar 
[17] 
V. Kuleshov and D. Precup, Algorithms for the multiarmed bandit problem, preprint, arXiv: 1402.6028v1.Google Scholar 
[18] 
J. Langford and T. Zhang, The epochgreedy algorithm for multiarmed bandits with side information, in Advances in Neural Information Processing Systems 20, Curran Associates, Inc., (2008), 817–824.Google Scholar 
[19] 
N. Levine, K. Crammer and S. Mannor, Rotting bandits, in Advances in Neural Information Processing Systems 30, Curran Associates, Inc., (2017) 3074–3083.Google Scholar 
[20] 
L. Li, W. Chu, J. Langford and R. E. Schapire, A contextualbandit approach to personalized news article recommendation, in Proceedings of the 19th International Conference on World Wide Web, ACM, (2010), 661–670.Google Scholar 
[21] 
B. C. May, N. Korda, A. Lee and D. S. Leslie, Optimistic Bayesian sampling in contextualbandit problems, The Journal of Machine Learning Research, 13 (2012), 20692106. Google Scholar 
[22] 
C. H. Papadimitriou and J. N. Tsitsiklis, The complexity of optimal queuing network control, Mathematics of Operations Research, 24 (1999), 293305. doi: 10.1287/moor.24.2.293. Google Scholar 
[23] 
W. H. Press, Bandit solutions provide unified ethical models for randomized clinical trials and comparative effectiveness research, Proceedings of the National Academy of Sciences of the United States of America, 106 (2009), 2238722392. Google Scholar 
[24] 
V. Raj and S. Kalyani, Taming nonstationary bandits: A Bayesian approach, arXiv: 1707.09727.Google Scholar 
[25] 
H. Robbins, Some aspects of the sequential design of experiments, Bulletin of the American Mathematical Society, 58 (1952), 527535. doi: 10.1090/S000299041952096208. Google Scholar 
[26] 
S. W. Roberts, Control chart tests based on geometric moving averages, Technometrics, 1 (1959), 239250. Google Scholar 
[27] 
S. L. Scott, A modern Bayesian look at the multiarmed bandit, Applied Stochastic Models in Business and Industry, 26 (2010), 639658. doi: 10.1002/asmb.874. Google Scholar 
[28] 
S. L. Scott, Multiarmed bandit experiments in the online service economy, Applied Stochastic Models in Business and Industry, 31 (2015), 3745. doi: 10.1002/asmb.2104. Google Scholar 
[29] 
W. Shen, J. Wang, Y.G. Jiang and H. Zha, Portfolio choices with orthogonal bandit learning, in Proceedings of the TwentyFourth International Joint Conference on Artificial Intelligence, (2015), 974–980.Google Scholar 
[30] 
A. Slivkins and E. Upfal, Adapting to a changing environment: The Brownian restless bandits, in 21st Conference on Learning Theory, (2008), 343–354.Google Scholar 
[31] 
W. R. Thompson, On the likelihood that one unknown probability exceeds another in view of the evidence of two samples, Biometrika, 25 (1933), 285294. Google Scholar 
[32] 
F. Tsung and K. Wang, Adaptive charting techniques: Literature review and extensions, in Frontiers in Statistical Quality Control 9, PhysicaVerlag HD, Heidelberg, (2010), 19–35,.Google Scholar 
[33] 
J. Vermorel and M. Mohri, Multiarmed bandit algorithms and empirical evaluation, in Proceedings of the 16th European Conference on Machine Learning, vol. 3720 of Lecture Notes in Computer Science, Springer, Berlin, (2005), 437–448.Google Scholar 
[34] 
S. S. Villar, J. Bowden and J. Wason, Multiarmed bandit models for the optimal design of clinical trials: Benefits and challenges, Statistical Science, 30 (2015), 199215. doi: 10.1214/14STS504. Google Scholar 
[35] 
C. J. C. H. Watkins, Learning from Delayed Rewards, Ph.D thesis, Cambridge University, 1989.Google Scholar 
[36] 
P. Whittle, Restless bandits: Activity allocation in a changing world, Journal of Applied Probability, 25 (1988), 287298. doi: 10.1017/s0021900200040420. Google Scholar 
[37] 
J. Y. Yu and S. Mannor, Piecewisestationary bandit problems with side observations, in Proceedings of the 26th International Conference on Machine Learning, (2009), 1177–1184.Google Scholar 
Case 1  Case 2  
Arm 1  0.001  0.0  1.0  0.001  0.3  1.0 
Arm 2  0.010  0.0  1.0  0.010  0.0  0.7 
Case 1  Case 2  
Arm 1  0.001  0.0  1.0  0.001  0.3  1.0 
Arm 2  0.010  0.0  1.0  0.010  0.0  0.7 
[1] 
Tamar Friedlander, Naama Brenner. Adaptive response and enlargement of dynamic range. Mathematical Biosciences & Engineering, 2011, 8 (2) : 515528. doi: 10.3934/mbe.2011.8.515 
[2] 
Christopher Rackauckas, Qing Nie. Adaptive methods for stochastic differential equations via natural embeddings and rejection sampling with memory. Discrete & Continuous Dynamical Systems  B, 2017, 22 (7) : 27312761. doi: 10.3934/dcdsb.2017133 
[3] 
Dongho Kim, EunJae Park. Adaptive CrankNicolson methods with dynamic finiteelement spaces for parabolic problems. Discrete & Continuous Dynamical Systems  B, 2008, 10 (4) : 873886. doi: 10.3934/dcdsb.2008.10.873 
[4] 
Arthur Henrique Caixeta, Irena Lasiecka, Valéria Neves Domingos Cavalcanti. On long time behavior of MooreGibsonThompson equation with molecular relaxation. Evolution Equations & Control Theory, 2016, 5 (4) : 661676. doi: 10.3934/eect.2016024 
[5] 
Luciano Abadías, Carlos Lizama, Marina MurilloArcila. Hölder regularity for the MooreGibsonThompson equation with infinite delay. Communications on Pure & Applied Analysis, 2018, 17 (1) : 243265. doi: 10.3934/cpaa.2018015 
[6] 
Marta Pellicer, Joan SolàMorales. Optimal scalar products in the MooreGibsonThompson equation. Evolution Equations & Control Theory, 2019, 8 (1) : 203220. doi: 10.3934/eect.2019011 
[7] 
Alexandre J. Chorin, Fei Lu, Robert N. Miller, Matthias Morzfeld, Xuemin Tu. Sampling, feasibility, and priors in data assimilation. Discrete & Continuous Dynamical Systems  A, 2016, 36 (8) : 42274246. doi: 10.3934/dcds.2016.36.4227 
[8] 
Omri M. Sarig. Bernoulli equilibrium states for surface diffeomorphisms. Journal of Modern Dynamics, 2011, 5 (3) : 593608. doi: 10.3934/jmd.2011.5.593 
[9] 
Matthew Nicol. Induced maps of hyperbolic Bernoulli systems. Discrete & Continuous Dynamical Systems  A, 2001, 7 (1) : 147154. doi: 10.3934/dcds.2001.7.147 
[10] 
Hajnal R. Tóth. Infinite Bernoulli convolutions with different probabilities. Discrete & Continuous Dynamical Systems  A, 2008, 21 (2) : 595600. doi: 10.3934/dcds.2008.21.595 
[11] 
Peter Monk, Virginia Selgas. Sampling type methods for an inverse waveguide problem. Inverse Problems & Imaging, 2012, 6 (4) : 709747. doi: 10.3934/ipi.2012.6.709 
[12] 
Martin Hanke. Why linear sampling really seems to work. Inverse Problems & Imaging, 2008, 2 (3) : 373395. doi: 10.3934/ipi.2008.2.373 
[13] 
T. Hillen, K. Painter, Christian Schmeiser. Global existence for chemotaxis with finite sampling radius. Discrete & Continuous Dynamical Systems  B, 2007, 7 (1) : 125144. doi: 10.3934/dcdsb.2007.7.125 
[14] 
Kamil Rajdl, Petr Lansky. Fano factor estimation. Mathematical Biosciences & Engineering, 2014, 11 (1) : 105123. doi: 10.3934/mbe.2014.11.105 
[15] 
Jiying Liu, Jubo Zhu, Fengxia Yan, Zenghui Zhang. Compressive sampling and $l_1$ minimization for SAR imaging with low sampling rate. Inverse Problems & Imaging, 2013, 7 (4) : 12951305. doi: 10.3934/ipi.2013.7.1295 
[16] 
João M. Lemos, Fernando Machado, Nuno Nogueira, Luís Rato, Manuel Rijo. Adaptive and nonadaptive model predictive control of an irrigation channel. Networks & Heterogeneous Media, 2009, 4 (2) : 303324. doi: 10.3934/nhm.2009.4.303 
[17] 
Imen Bhouri, Houssem Tlili. On the multifractal formalism for Bernoulli products of invertible matrices. Discrete & Continuous Dynamical Systems  A, 2009, 24 (4) : 11291145. doi: 10.3934/dcds.2009.24.1129 
[18] 
Ben A. Vanderlei, Matthew M. Hopkins, Lisa J. Fauci. Error estimation for immersed interface solutions. Discrete & Continuous Dynamical Systems  B, 2012, 17 (4) : 11851203. doi: 10.3934/dcdsb.2012.17.1185 
[19] 
H.T. Banks, Jimena L. Davis. Quantifying uncertainty in the estimation of probability distributions. Mathematical Biosciences & Engineering, 2008, 5 (4) : 647667. doi: 10.3934/mbe.2008.5.647 
[20] 
Azmy S. Ackleh, Jeremy J. Thibodeaux. Parameter estimation in a structured erythropoiesis model. Mathematical Biosciences & Engineering, 2008, 5 (4) : 601616. doi: 10.3934/mbe.2008.5.601 
Impact Factor:
Tools
Article outline
Figures and Tables
[Back to Top]