zhihu ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search

ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search

ReST-MCTS* is a novel approach to large language model (LLM) self-training that utilizes process reward guidance with tree search MCTS*. This method is designed to collect higher-quality reasoning traces and per-step values for training policy and reward models.

Key Features

Integration of Process Reward Guidance ReST-MCTS* combines process reward guidance with tree search MCTS* to improve the quality of reasoning traces and the accuracy of per-step values. This approach helps to enhance the model's ability to reason by focusing on higher-quality data.
Tree-Search-Based Reinforcement Learning The method circumvents the need for per-step manual annotation by using tree-search-based reinforcement learning. It infers correct process rewards by estimating the probability that each step contributes to the correct answer. These inferred rewards are then used to refine the process reward model and select high-quality traces for policy model self-training.
Comparison with Baselines The tree-search policy in ReST-MCTS* has been shown to achieve higher accuracy compared to other LLM reasoning baselines, such as Best-of-N and Tree-of-Thought, within the same search budget.
Continuous Enhancement of Language Models By using the traces searched by the tree-search policy as training data, ReST-MCTS* can continuously enhance the three language models for multiple iterations, outperforming other self-training algorithms such as ReST$^\text{EM}$ and Self-Rewarding LM.
Dual Purpose of Inferred Rewards The inferred rewards serve dual purposes: they act as value targets for further refining the process reward model and also facilitate the selection of high-quality traces for policy model self-training.

Performance

Accuracy: The tree-search policy in ReST-MCTS* achieves higher accuracy compared to prior LLM reasoning baselines.
Self-Training Enhancement: ReST-MCTS* can continuously enhance the language models for multiple iterations, outperforming other self-training algorithms.

References