Dong Woo Kim, Tze Leung Lai and Huanzhong Xu (2021). MULTI-ARMED BANDITS WITH COVARIATES: THEORY AND APPLICATIONS. Vol 31 No. 5, 2275-2287.

Abstract: "Multi-armed bandits" were introduced as a new direction in the then-nascent field of sequential analysis, developed during World War II in response to the need for more efficient testing of anti-aircraft gunnery, and later as a concrete application of dynamic programming and optimal control of Markov decision processes. A comprehensive theory that unified both directions emerged in the 1980s, providing important insights and algorithms for diverse applications in many science, technology, engineering and mathematics fields. The turn of the millennium marked the onset of a "personalization revolution," from personalized medicine and online personalized advertising and recommender systems (e.g. Netflix's recommendations for movies and TV shows, Amazon's recommendations for products to purchase, and Microsoft's Matchbox recommender). This has required an extension of classical bandit theory to nonparametric contextual bandits, where "contextual" refers to the incorporation of personal information as covariates. Such theory is developed herein, together with illustrative applications, statistical models, and computational tools for its implementation.

Key words and phrases: Contextual multi-armed bandits, ϵ-greedy randomization, personalized medicine, recommender system, reinforcement learning.