Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Automating Thought of Search: A Journey Towards Soundness and Completeness

10 views
Skip to first unread message

Steve Omohundro

unread,
Aug 24, 2024, 12:49:06 PM8/24/24
to guaranteed-safe-ai

Automating Thought of Search: A Journey Towards Soundness and Completeness

https://arxiv.org/abs/2408.11326

Planning remains one of the last standing bastions for large language models (LLMs), which now turn their attention to search. Most of the literature uses the language models as world models to define the search space, forgoing soundness for the sake of flexibility. A recent work, Thought of Search (ToS), proposed defining the search space with code, having the language models produce that code. ToS requires a human in the loop, collaboratively producing a sound successor function and goal test. The result, however, is worth the effort: all the tested datasets were solved with 100% accuracy. At the same time LLMs have demonstrated significant progress in code generation and refinement for complex reasoning tasks. In this work, we automate ToS (AutoToS), completely taking the human out of the loop of solving planning problems. AutoToS guides the language model step by step towards the generation of sound and complete search components, through feedback from both generic and domain specific unit tests. We achieve 100% accuracy, with minimal feedback iterations, using LLMs of various sizes on all evaluated domains.

Large language models have shown great promise across countless domains and fields, especially as their architectures become more advanced. Spurred by their abilities in natural language tasks, several recent works have studied AI planning in Large Language Models (LLMs) as a subset of code generation and code refinement. The approaches vary from giving a planning problem to an LLM and asking it to output an entire plan in a single call (Silver et al. 2022; Kambhampati et al. 2024a; Pallagani et al. 2022) to asking an LLM to generate a planning model to be given to an automated planner (Guan et al. 2023; Oswald et al. 2024; Gestrin, Kuhlmann, and Seipp 2024). Between these two extremes, lies a body of work on using language models to plan by performing a combinatorial search (Hao et al. 2023a; Yao et al. 2023a; Besta et al. 2024; Sel et al. 2023). Among these, Thought of Search (ToS) (Katz et al. 2024) stands out; it uses the language models to define the search space for the entire domain at once. It is done simply by soliciting two crucial search components, successor function and goal test. These components are then plugged into a standard search algorithm, such as Breadth-First Search (BFS) or Depth-First Search (DFS) (Cormen, Leiserson, and Rivest 1990).

ToS has an impressive accuracy of 100% on all tested benchmarks and it produces a symbolic model whose soundness and completeness can be verified. However, ToS has a limitation - it requires a human expert in the loop, providing a feedback to the model on the produced code. Our contribution is precisely there. We automate the iterative feedback and exception handling process through the use of unit tests and printed debugging statements for use with few shot and Chain of Thought (CoT) prompting (Brown et al. 2020; Wei et al. 2022; Kojima et al. 2022), limiting the human expert involvement with the language model. We test the search components for soundness and completeness and provide feedback to the model when a violation is detected. We use a mixture of domain-independent and domain-specific tests, based on a small number of held out instances.

We exemplify our proposed approach on five representative search problems from the recent literature and test with a variety of large language models of different sizes. Through automated feedback, we find that the accuracy of the code generated by language models consistently increases to reach 100% across all tested domains. We show that the total number of calls to the language model is typically small, comparable to the results of ToS with human feedback. In an ablation study, we justify the importance of soundness and completeness feedback for obtaining the highly accurate final code.

Best,
Steve

Reply all
Reply to author
Forward
0 new messages