Adaptive Reward-Free Exploration
Citations Over Time
Abstract
Reward-free exploration is a reinforcement learning setting studied by Jin et al. (2020), who address it by running several algorithms with regret guarantees in parallel. In our work, we instead give a more natural adaptive approach for reward-free exploration which directly reduces upper bounds on the maximum MDP estimation error. We show that, interestingly, our reward-free UCRL algorithm can be seen as a variant of an algorithm of Fiechter from 1994, originally proposed for a different objective that we call best-policy identification. We prove that RF-UCRL needs of order $({SAH^4}/{\varepsilon^2})(\log(1/δ) + S)$ episodes to output, with probability $1-δ$, an $\varepsilon$-approximation of the optimal policy for any reward function. This bound improves over existing sample-complexity bounds in both the small $\varepsilon$ and the small $δ$ regimes. We further investigate the relative complexities of reward-free exploration and best-policy identification.
Related Papers
- → The Sample Complexity of Exploration in the Multi-Armed Bandit Problem(2004)328 cited
- → Improved Bounds on the Sample Complexity of Learning(2001)159 cited
- → Towards Sample Efficient Reinforcement Learning(2018)143 cited
- → Optimal Differentially Private Learning of Thresholds and Quasi-Concave Optimization(2023)3 cited