ka | en

Authorisation

Regret analysis of non-stationary multi-armed bandits

Author: Irakli Koberidze
Co-authors: David Soselia, Levan Shugliashvili, Shota Amashukeli, Sandro Jijavadze
Keywords: Exploration-exploitation dilemma, reinforcement learning, contextual bandit
Annotation:

The law of large numbers ensures that the most widely used exploration-exploitation algorithms such as Limit Epsilon Greedy, Upper Confidence Bound (UCB), and Posterior Sampling converge to the true expected reward distributions in stationary environments. However, they are less viable in non-stationary environments such as in real world applications, where preferences of people and environment changes day-by-day. In this work, we assume a bandit problem where the reward diststribution of arms changes at unknown time instants and propose a method to accomodate for these changes.