Why This Matters

Many real-world decision-making systems face the challenge that environmental conditions change over time, making previously learned policies suboptimal. This work is innovative because it provides theoretical guarantees about when and how to combine learned policies with online planning to maintain performance despite environmental changes. The approach is elegant and applicable to a wide range of domains, from emergency response to transportation, where the policy learned in one setting may not be optimal when conditions evolve.

What We Did

This paper develops a policy-augmented Monte Carlo tree search framework for making decisions in non-stationary environments where the agent's policy may need to be updated as conditions change. The approach combines offline policy learning using Q-values learned in a previous environment with online MCTS planning to handle the case where environment dynamics have shifted. The method includes theoretical analysis showing conditions under which combining the learned policy with online search ensures the algorithm selects optimal or near-optimal actions.

Key Results

The theoretical analysis provides conditions under which the policy-augmented approach is guaranteed to select optimal actions, with bounds on the error incurred when policies are updated. Experimental validation on classic control tasks shows that the approach achieves robust performance superior to either pure offline learning or pure online planning when facing non-stationary environments. The method successfully balances the speed of learned policies with the adaptability of online search.

Full Abstract

Cite This Paper

@inproceedings{pettet2024decision,
  author = {Pettet, Ava and Zhang, Yunuo and Luo, Baiting and Wray, Kyle and Baier, Hendrik and Laszka, Aron and Dubey, Abhishek and Mukhopadhyay, Ayan},
  booktitle = {Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems},
  title = {Decision Making in Non-Stationary Environments with Policy-Augmented Search},
  year = {2024},
  address = {Richland, SC},
  acceptance = {36},
  pages = {2417–2419},
  publisher = {International Foundation for Autonomous Agents and Multiagent Systems},
  series = {AAMAS '24},
  abstract = {Sequential decision-making is challenging in non-stationary environments, where the environment in which an agent operates can change over time. Policies learned before execution become stale when the environment changes, and relearning takes time and computational effort. Online search, on the other hand, can return sub-optimal actions when there are limitations on allowed runtime. In this paper, we introduce Policy-Augmented Monte Carlo tree search (PA-MCTS), which combines action-value estimates from an out-of-date policy with an online search using an up-to-date model of the environment. We prove several theoretical results about PA-MCTS. We also compare and contrast our approach with AlphaZero, another hybrid planning approach, and Deep Q Learning on several OpenAI Gym environments and show that PA-MCTS outperforms these baselines.},
  contribution = {lead},
  isbn = {9798400704864},
  note = {extended abstract},
  keywords = {non-stationary MDPs, policy learning, Monte Carlo tree search, sequential decision-making, online planning, offline learning, policy augmentation},
  location = {Auckland, New Zealand},
  numpages = {3}
}
Quick Info
Year 2024
Series AAMAS '24
Keywords
non-stationary MDPs policy learning Monte Carlo tree search sequential decision-making online planning offline learning policy augmentation
Research Areas
POMDP scalable AI
Search Tags

Decision, Making, Stationary, Environments, Policy, Augmented, Search, non-stationary MDPs, policy learning, Monte Carlo tree search, sequential decision-making, online planning, offline learning, policy augmentation, POMDP, scalable AI, 2024, Pettet, Zhang, Luo, Wray, Baier, Laszka, Dubey, Mukhopadhyay, AAMAS24