多臂赌博机

在概率论和机器学习中，多臂赌博机问题（英语：multi-armed bandit problem）^[1]有时称为K-或N-臂赌博机问题（英语：K-or N-armed bandit problem）^[2]，是一个必须在竞争（替代）之间分配一组固定的有限资源的问题。当每个选择的属性在分配时仅部分已知时，以最大化其预期收益的方式进行选择，并且随着时间的推移或通过向该选择分配资源可能会更好地被理解。这是一个经典的强化学习问题，体现了探索-利用权衡困境^[3]^[4]。这个名字来源于想象一个赌徒坐在一排赌博机（或称角子机、老虎机）前（有时被称为“单臂赌博机”），他必须决定玩哪台机器，每台机器玩多少次以及玩的顺序^[5]，并且是否继续使用当前机器或尝试不同的机器。多臂赌博机问题也属于随机调度的广义范畴。

在该问题中，每台机器根据该机器特定的概率分布提供随机奖励，该奖励是先验未知的。赌徒的目标是最大化通过一系列杠杆拉动所获得的奖励总和^[4]。赌徒在每次试验中面临的关键权衡是在“利用”具有最高预期收益的机器和“探索”以获得有关其他机器的预期收益的更多信息之间^[3]。机器学习也面临着探索和利用之间的权衡。在实践中，多臂赌博机已用于对诸如管理大型组织（如科学基金会或制药公司）中的研究项目等问题进行建模^[3]^[4]。在问题的早期版本中，赌徒一开始对机器一无所知。

赫伯特·罗宾斯于1952年认识到该问题的重要性，在“实验序贯设计的某些方面”中构建了收敛种群选择策略^[6]。约翰·C·吉廷斯首次发表的吉廷斯指数定理给出了最大化预期折扣奖励的最优策略^[7]。

参考资料

^ Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning. 2002, 47 (2/3): 235–256. doi:10.1023/A:1013689704352 .
^ Katehakis, M. N.; Veinott, A. F. The Multi-Armed Bandit Problem: Decomposition and Computation. Mathematics of Operations Research. 1987, 12 (2): 262–268. S2CID 656323. doi:10.1287/moor.12.2.262.
^ ^3.0 ^3.1 ^3.2 引用错误：没有为名为Gittins89的参考文献提供内容
^ ^4.0 ^4.1 ^4.2 引用错误：没有为名为BF的参考文献提供内容
^ Weber, Richard, On the Gittins index for multiarmed bandits, Annals of Applied Probability, 1992, 2 (4): 1024–1033, JSTOR 2959678, doi:10.1214/aoap/1177005588 
^ Robbins, H. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society. 1952, 58 (5): 527–535. doi:10.1090/S0002-9904-1952-09620-8 .
^ J. C. Gittins. Bandit Processes and Dynamic Allocation Indices. Journal of the Royal Statistical Society. Series B (Methodological). 1979, 41 (2): 148–177. JSTOR 2985029. S2CID 17724147. doi:10.1111/j.2517-6161.1979.tb01068.x.

延伸阅读

Scholia上有关多臂赌博机的信息

Guha, S.; Munagala, K.; Shi, P., Approximation algorithms for restless bandit problems, Journal of the ACM, 2010, 58: 1–50, S2CID 1654066, arXiv:0711.3861 , doi:10.1145/1870103.1870106
Dayanik, S.; Powell, W.; Yamazaki, K., Index policies for discounted bandit problems with availability constraints, Advances in Applied Probability, 2008, 40 (2): 377–400, doi:10.1239/aap/1214950209  .
Powell, Warren B., Chapter 10, Approximate Dynamic Programming: Solving the Curses of Dimensionality, New York: John Wiley and Sons, 2007, ISBN 978-0-470-17155-4 .
Robbins, H., Some aspects of the sequential design of experiments, Bulletin of the American Mathematical Society, 1952, 58 (5): 527–535, doi:10.1090/S0002-9904-1952-09620-8  .
Sutton, Richard; Barto, Andrew, Reinforcement Learning, MIT Press, 1998, ISBN 978-0-262-19398-6, （原始内容存档于2013-12-11） .

Allesiardo, Robin, A Neural Networks Committee for the Contextual Bandit Problem, Neural Information Processing – 21st International Conference, ICONIP 2014, Malaisia, November 03-06,2014, Proceedings, Lecture Notes in Computer Science 8834, Springer: 374–381, 2014, ISBN 978-3-319-12636-4, S2CID 14155718, arXiv:1409.8191 , doi:10.1007/978-3-319-12637-1_47 .

Weber, Richard, On the Gittins index for multiarmed bandits, Annals of Applied Probability, 1992, 2 (4): 1024–1033, JSTOR 2959678, doi:10.1214/aoap/1177005588  .
Katehakis, M.; C. Derman, Computing optimal sequential allocation rules in clinical trials, Adaptive statistical procedures and related topics, Institute of Mathematical Statistics Lecture Notes - Monograph Series 8: 29–39, 1986, ISBN 978-0-940600-09-6, JSTOR 4355518, doi:10.1214/lnms/1215540286 .
Katehakis, M.; A. F. Veinott, Jr., The multi-armed bandit problem: decomposition and computation, Mathematics of Operations Research, 1987, 12 (2): 262–268, JSTOR 3689689, S2CID 656323, doi:10.1287/moor.12.2.262.

外部链接

MABWiser, open source Python implementation of bandit strategies that supports context-free, parametric and non-parametric contextual policies with built-in parallelization and simulation capability.
PyMaBandits （页面存档备份，存于互联网档案馆）, open source implementation of bandit strategies in Python and Matlab.
Contextual （页面存档备份，存于互联网档案馆）, open source R package facilitating the simulation and evaluation of both context-free and contextual Multi-Armed Bandit policies.
bandit.sourceforge.net Bandit project （页面存档备份，存于互联网档案馆）, open source implementation of bandit strategies.
Banditlib （页面存档备份，存于互联网档案馆）, Open-Source implementation of bandit strategies in C++.
Leslie Pack Kaelbling and Michael L. Littman (1996). Exploitation versus Exploration: The Single-State Case.
Tutorial: Introduction to Bandits: Algorithms and Theory. Part1 （页面存档备份，存于互联网档案馆）. Part2 （页面存档备份，存于互联网档案馆）.
Feynman's restaurant problem, a classic example (with known answer) of the exploitation vs. exploration tradeoff.
Bandit algorithms vs. A-B testing （页面存档备份，存于互联网档案馆）.
S. Bubeck and N. Cesa-Bianchi A Survey on Bandits （页面存档备份，存于互联网档案馆）.
A Survey on Contextual Multi-armed Bandits （页面存档备份，存于互联网档案馆）, a survey/tutorial for Contextual Bandits.
Blog post on multi-armed bandit strategies, with Python code （页面存档备份，存于互联网档案馆）.
Animated, interactive plots （页面存档备份，存于互联网档案馆） illustrating Epsilon-greedy, Thompson sampling, and Upper Confidence Bound exploration/exploitation balancing strategies.

[doi10.1023/A:1013689704352-1] Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning. 2002, 47 (2/3): 235–256. doi:10.1023/A:1013689704352 .

[2] Katehakis, M. N.; Veinott, A. F. The Multi-Armed Bandit Problem: Decomposition and Computation. Mathematics of Operations Research. 1987, 12 (2): 262–268. S2CID 656323. doi:10.1287/moor.12.2.262.

[Gittins89-3] 3.0 ^3.1 ^3.2 引用错误：没有为名为Gittins89的参考文献提供内容

[BF-4] 4.0 ^4.1 ^4.2 引用错误：没有为名为BF的参考文献提供内容

[weber-5] Weber, Richard, On the Gittins index for multiarmed bandits, Annals of Applied Probability, 1992, 2 (4): 1024–1033, JSTOR 2959678, doi:10.1214/aoap/1177005588 

[6] Robbins, H. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society. 1952, 58 (5): 527–535. doi:10.1090/S0002-9904-1952-09620-8 .

[Gittins1979-7] J. C. Gittins. Bandit Processes and Dynamic Allocation Indices. Journal of the Royal Statistical Society. Series B (Methodological). 1979, 41 (2): 148–177. JSTOR 2985029. S2CID 17724147. doi:10.1111/j.2517-6161.1979.tb01068.x.

[1]

[2]

[3]

[4]

[5]

[6]

[7]