Combining online and offline knowledge in UCT
Citations Over TimeTop 1% of 2007 papers
Abstract
The UCT algorithm learns a value function online using sample-based search. The TD(λ) algorithm can learn a value function offline for the on-policy distribution. We consider three approaches for combining offline and online value functions in the UCT algorithm. First, the offline value function is used as a default policy during Monte-Carlo simulation. Second, the UCT value function is combined with a rapid online estimate of action values. Third, the offline value function is used as prior knowledge in the UCT search tree. We evaluate these algorithms in 9 x 9 Go against GnuGo 3.7.10. The first algorithm performs better than UCT with a random simulation policy, but surprisingly, worse than UCT with a weaker, handcrafted simulation policy. The second algorithm outperforms UCT altogether. The third algorithm outperforms UCT with handcrafted prior knowledge. We combine these algorithms in MoGo, the world's strongest 9 x 9 Go program. Each technique significantly improves MoGo's playing strength.
Related Papers
- → The Blind Men and the Elephant: Integrated Offline/Online Optimization Under Uncertainty(2020)12 cited
- → DARA: Dynamics-Aware Reward Augmentation in Offline Reinforcement Learning(2022)8 cited
- → MOORe: Model-based Offline-to-Online Reinforcement Learning(2022)3 cited
- → Towards Robust Offline-to-Online Reinforcement Learning via Uncertainty and Smoothness(2023)1 cited