In a bid to enhance the reasoning capabilities of large language models (LLMs), researchers from Google Deepmind and University of Southern California have proposed a new ‘self-discover’ prompting framework.
Published on arXiV and Hugging Face this morning, the approach goes beyond existing prompting techniques used by LLMs and has been found capable of improving the performance of known models out there, including OpenAI’s GPT-4 and Google’s PaLM 2.
“Self-discover substantially improves GPT-4 and PaLM 2’s performance on challenging reasoning benchmarks such as BigBench-Hard, grounded agent reasoning and MATH by as much as 32% compared to Chain of Thought (CoT),” the researchers write in the paper.
The framework revolves around LLMs self-discovering task-intrinsic reasoning structures to solve a problem. The models look at multiple atomic reasoning modules, such as critical thinking and step-by-step thinking, and compose them into an explicit reasoning structure for LLMs to follow during decoding.
The AI Impact Tour – NYC
We’ll be in New York on February 29 in partnership with Microsoft to discuss how to balance risks and rewards of AI applications. Request an invite to the exclusive event below.
More interestingly, this approach works with 10 to 40 times less inference compute — something that can be great for enterprises.
Self-discovering unique structures
LLMs have evolved to handle numerous tasks, thanks to their ability to follow instructions, reason and generate coherent responses. To make this happen, the models, powered by transformer architecture, use various prompting techniques inspired by cognitive theories of how humans reason and solve problems. This includes few-shot and zero-shot chain-of-thought, inspired by how we solve a problem step-by-step, decomposition prompting of how we break a problem into multiple subproblems and step-back prompting of how we reflect on the nature of a task to establish general principles.
While all these methods, most notably chain-of-thought, do the job, they all work by making an implicit prior assumption of how to tackle a given task. This approach, the researchers argue, may not be the best as each task has a unique intrinsic structure and one particular technique may be better at solving it than the other.
With the latest research, Deepmind and USC researchers have proposed a general prompting framework that self-discovers this unique underlying structure to pick the right reasoning technique for the task while also being efficient at the same time.
“Self-discover is inspired by how humans internally devise a reasoning program for problem-solving. From a set of atomic reasoning modules described in natural language such as ‘break down into sub-tasks’ and ‘critical thinking’, an LLM, and task examples without labels, it composes a coherent reasoning structure intrinsic to the task (Stage1) and then solves instances of the task using the discovered structure (Stage2). Stage 1 operates at the task level and uses three actions to guide the LLM to generate a reasoning structure for the task. At Stage 2, during the final decoding, the LLM simply follows the self-discovered structure to arrive at the final answer,” the researchers explain.
Notable performance improvements for known LLMs
To see how the new approach works, the researchers tested it with multiple models – including GPT-4 and PaLM 2-L, on 25 reasoning tasks, including Big-Bench Hard, Thinking for Doing and Math. In 21 out of 25 tasks, self-discover was found to outperform chain-of-thought reasoning and other techniques with performance gains of up to 32%. The researchers also found that it did better in terms of efficiency by requiring 10 to 40 times less inference compute.
According to the data shared in the paper, when working with GPT-4, the self-discover approach achieved results with an accuracy of 81%, 85% and 73% across Big-Bench Hard, Thinking for Doing and Math tasks, respectively. However, when working with chain-of-thought, the results dropped to 75%, 52% and 71%, respectively. A nearly similar gap was noted when it was compared with the plan-and-solve approach.
On the other hand, PaLM 2-L achieved results with an accuracy of 67%, 69% and 50.5% across the three tasks. This is lower than that of GPT-4 but still much better than what was achieved with chain-of-thought (60%, 40% and 42%) and plan-and-solve (61%, 42% and 49%) approaches.
Improved reasoning is key to AI success
While the idea of a self-discover prompting framework has just been proposed, it has the potential to push the boundary of problem-solving and give LLMs the ability to address challenging problems with ease – ultimately moving toward the goal of general intelligence. Notably, the transferability studies conducted by the researchers show that the composed reasoning structures are universally applicable across model families and share commonalities with human reasoning patterns.
“Forward looking, we are excited to explore more on LLM structured reasoning to push the boundary of problem-solving and discover potentials for Human-AI collaboration,” the team added.
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.