Statistica Sinica 31 (2021), 2275-2287
Dong Woo Kim1, Tze Leung Lai2 and Huanzhong Xu2
Abstract: "Multi-armed bandits" were introduced as a new direction in the then-nascent field of sequential analysis, developed during World War II in response to the need for more efficient testing of anti-aircraft gunnery, and later as a concrete application of dynamic programming and optimal control of Markov decision processes. A comprehensive theory that unified both directions emerged in the 1980s, providing important insights and algorithms for diverse applications in many science, technology, engineering and mathematics fields. The turn of the millennium marked the onset of a "personalization revolution," from personalized medicine and online personalized advertising and recommender systems (e.g. Netflix's recommendations for movies and TV shows, Amazon's recommendations for products to purchase, and Microsoft's Matchbox recommender). This has required an extension of classical bandit theory to nonparametric contextual bandits, where "contextual" refers to the incorporation of personal information as covariates. Such theory is developed herein, together with illustrative applications, statistical models, and computational tools for its implementation.
Key words and phrases: Contextual multi-armed bandits, ϵ-greedy randomization, personalized medicine, recommender system, reinforcement learning.