📃 [Paper] • 💻 [Github] • 🤗 [HuggingFace] • 📌 [Blog] • 📽 [Slides] • 📋 [Poster]
We’ve made some exciting advances in the world of large language models (LLMs), such as ChatGPT, GPT-4, Bard, and PaLM 2. These language understanding powerhouses have become rock stars in various tasks, amping up deployment and adoption in many applications. The emerging abilities of LLMs, particularly in complex reasoning, have not only dominated the spotlight of recent research but also ignited huge public curiosity.
One area of LLMs we’ve struck gold is zero-shot reasoning, which has shown tremendous promise in specific task domains. But the million-dollar question is, can LLMs flex their reasoning muscles on general tasks as effectively? Well, we’re about to push that envelope further.
In our recent work, we’ve decided to boost the zero-shot reasoning abilities of LLMs in general language understanding tasks. Picture this — we construct an autonomous agent that acts as a tutor for our LLMs, guiding them step-by-step through the task. Our agent churns out task-specific instructions that piece together the reasoning process of LLMs for each task. We call this strategy zero-shot agent instructed reasoning (or AgentInstruct).
Here’s a sneak peek into our playbook. We’ve drawn inspiration from a couple of key works. First, we’ve adopted language agents, who are like little elves designed to autonomously get the job done. Only, our agent makes a twist on this, providing not the solution to the task but instructing how to accomplish it — kind of like the mastermind behind the operation. Our agent does this by gathering task-relevant knowledge from the web, and churning out high-quality, step-by-step instructions to complete tasks.
Next, we’ve leaned on the concept of chain of thought (CoT), specifically zero-shot CoT reasoning, in LLMs. This is where a task is broken down into tidbits and solved step-by-step, kind of like solving a jigsaw puzzle. We sync up this process further by feeding task-specific instructions from our agent to guide the reasoning steps. The aim here is to broaden the horizon of zero-shot reasoning abilities of LLMs to include more tasks with this combined approach.
We’ve subjected our strategy to a test across a variety of language understanding tasks. And guess what? The results have been pretty great! We achieved state-of-the-art performance on 20 out of 29 datasets we evaluated using three stellar LLMs — Vicuna, Llama-2-chat, and GPT-3.5 Turbo. More importantly, our strategy resulted in a significant performance boost overall, including Vicuna-13b (13.3%), Llama-2-70b-chat (23.2%), and GPT-3.5 Turbo (17.0%), particularly in reasoning tasks.
We’re diving into zero-shot AgentInstruct, a method that employs an agent to instruct LLMs to reason toward the final prediction. More importantly, for each dataset, the instructions are generated only once. Then, these same instructions are used across all instances and models when doing inference on that dataset. We only run our agent 29 times overall, once for each of the 29 datasets, NOT on every test instance of each dataset. AgentInstruct is zero-shot as there is no training involved.
You know how we humans often rely on clear instructions to guide our thought processes when tackling a problem? Well, to make sense of, say, a movie review, instructions like “1. Understand the Dataset: … Movie Reviews dataset … 2. Analyze the Passage: Pay attention to … the tone of the review …” are exceptionally helpful. They allow us to break down the problem into manageable chunks of task-specific reasoning steps, which we can solve one by one to arrive at the final answer.
Inspired by this, we came up with a way to produce task-specific instructions that can guide the reasoning process of our LLMs. But we didn’t just want to handwrite these instructions. Instead, we built an agent to take over the task! AgentInstruct consists of two components:
Instruction generation: Built on ReAct’s implementation in LangChain, the agent is designed to be a high-quality synthesizer of instructions. Our agent uses GPT-4 to propose a series of thoughts, tool-based actions, and observations. The output of these actions? Our task-specific instructions! The intuition is that an agent is able to synthesize high-quality instructions with access to a wide range of existing task knowledge on the web.
Chain of thought reasoning: This instructs our LLMs to break down a task into simpler steps leading to the final answer, using our task-specific agent instructions as a guide.
Overall, zero-shot AgentInstruct wins on 9 of the 13 generation datasets, 11 of the 16 classification datasets, and 10 of the 12 reasoning datasets. These datasets consist of datasets from the HELM benchmark and CoT reasoning datasets. We compare zero-shot AgentInstruct to standard zero-shot prompting. On each model, zero-shot AgentInstruct wins on the majority of datasets, with no less than a 13.0% increase on average. Average performance versus the zero-shot setting is best on Llama-2-70b-chat with a 23.2% improvement.
Significantly, Llama-2-70b-chat beats the performance of zero-shot GPT-3.5 Turbo by 10.2% on average across all datasets. Additionally, the 13b version of Llama-2-chat outperforms zero-shot GPT-3.5 Turbo by over 2%.
The most immediate comparison to zero-shot AgentInstruct is zero-shot CoT, as zero-shot AgentInstruct uses task-specific instructions instead of a fixed manual instruction. On average, across all three models, zero-shot AgentInstruct beats zero-shot CoT by 6.5%, with the largest growth being Vicuna-13b at 9.5%.
In particular, we look into the performance on reasoning tasks. Of our three models, the average difference between zero-shot AgentInstruct and the zero-shot setting on reasoning tasks is 31.3%, whereas the difference between zero-shot AgentInstruct and zero-shot CoT is 10.5%. As an example, we show the CoT reasoning steps from an instance of IMDB when predictions are correct.
We also focus our analysis on GPT-4. We test the following methods: zero-shot, zero-shot CoT, ReAct, and zero-shot AgentInstruct. Zero-shot AgentInstruct outperforms zero-shot GPT-4 by 8.6% and ties the performance of zero-shot CoT GPT-4 for approximately one-tenth of the cost.
We also compare Llama-2-70b-chat zero-shot AgentInstruct with few-shot results. Surprisingly, zero-shot AgentInstruct reaches competitiveness with few-shot prompting.
AgentInstruct is built on agents and language models. Similar to the base methods or models, it has its own set of limitations and can pose potential harm when misused. A notable concern is LLMs’ tendency to generate non-factual responses with a high degree of confidence. It is crucial to investigate this limitation further.
When misused, LLMs’ hallucinated responses can contribute to the proliferation of misinformation, spam, and other undesirable content. Furthermore, the mechanism exhibits several other shortcomings shared by language models, including:
AgentInstruct has made progress towards addressing these limitations through the creation of task-specific agent instructions to guide the LLMs’ reasoning across tasks. However, it remains vital for future research to thoroughly examine and address these issues to enhance the overall performance and safety of techniques like AgentInstruct built upon language models.
So far, we have released the following:
We recognize the heightened risk surrounding this release, as any release that increases the performance of AI systems carries some inherent danger, especially given that our approach requires no additional training to implement. However, we believe that the benefit of our research to the academic community outweighs the potential costs.
Furthermore, we understand the potential drawbacks of releasing an enhanced method for LLMs, namely concerns regarding producing disinformation and harmful content. We are exploring further methods to mitigate these risks.
The usage of AgentInstruct is subject to the Licenses of corresponding LLMs. Please contact us if you find any potential violations.
AgentInstruct was developed as a joint effort between WashU NLP Group at WashU and UC Berkeley.
Students: Nick Crispino, Kyle Montgomery, Fankun Zeng
Advisors: Dawn Song, Chenguang Wang
Our work on this project greatly depends on previous work surrounding agents and LLMs. Specifically, we would like to thank Ce Zhang and Together AI. Furthermore, this project would not have been possible had Meta/Vicuna not provided us with research access to the model weights. Finally, we extend our gratitude to OpenAI for API access to GPT-3.5 and GPT-4; it has been integral in the successful completion of this work.
@article{crispino2023agent,
title={Agent Instructs Large Language Models to be General Zero-Shot Reasoners},
author={Crispino, Nicholas and Montgomery, Kyle and Zeng, Fankun and Song, Dawn and Wang, Chenguang},
journal={arXiv preprint arXiv:2310.03710},
year={2023}
}