On Thompson Sampling and Asymptotic Optimality
2017pp. 4889–4893
Citations Over TimeTop 10% of 2017 papers
Abstract
We discuss some recent results on Thompson sampling for nonparametric reinforcement learning in countable classes of general stochastic environments. These environments can be non-Markovian, non-ergodic, and partially observable. We show that Thompson sampling learns the environment class in the sense that (1) asymptotically its value converges in mean to the optimal value and (2) given a recoverability assumption regret is sublinear. We conclude with a discussion about optimality in reinforcement learning.
Related Papers
- Near-optimal Regret Bounds for Reinforcement Learning(2010)
- Regret Bounds for Learning State Representations in Reinforcement Learning(2019)
- → Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds\n for Episodic Reinforcement Learning(2021)3 cited
- → Distributed Thompson Sampling(2020)