Do AI reasoning models require new approaches to prompting?

0


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

The era of reasoning AI is well underway.

After OpenAI once again kickstarted an AI revolution with its o1 reasoning model introduced back in September 2024 — which takes longer to answer questions but with the payoff of higher performance, especially on complex, multi-step problems in math and science — the commercial AI field has been flooded with copycats and competitors.

There’s DeepSeek’s R1, Google Gemini 2 Flash Thinking, and just today, LlamaV-o1, all of which seek to offer similar built-in “reasoning” to OpenAI’s new o1 and upcoming o3 model families. These models engage in “chain-of-thought” (CoT) prompting — or “self-prompting” — forcing them to reflect on their analysis midstream, double back, check over their own work and ultimately arrive at a better answer than just shooting it out of their embeddings as fast as possible, as other large language models (LLMs) do.

Yet the high cost of o1 and o1-mini ($15.00/1M input tokens vs. $1.25/1M input tokens for GPT-4o on OpenAI’s API) has caused some to balk at the supposed performance gains. Is it really worth paying 12X as much as the typical, state-of-the-art LLM?

As it turns out, there are a growing number of converts — but the key to unlocking reasoning models’ true value may lie in the user prompting them differently.

Shawn Wang (founder of AI news service Smol) featured on his Substack over the weekend a guest post from Ben Hylak, the former Apple Inc., interface designer for visionOS (which powers the Vision Pro spatial computing headset). The post has gone viral as it convincingly explains how Hylak prompts OpenAI’s o1 model to receive incredibly valuable outputs (for him).

In short, instead of the human user writing prompts for the o1 model, they should think about writing “briefs,” or more detailed explanations that include lots of context up-front about what the user wants the model to output, who the user is and what format in which they want the model to output information for them.

As Hylak writes on Substack:

With most models, we’ve been trained to tell the model how we want it to answer us. e.g. ‘You are an expert software engineer. Think slowly and carefully“

This is the opposite of how I’ve found success with o1. I don’t instruct it on the how — only the what. Then let o1 take over and plan and resolve its own steps. This is what the autonomous reasoning is for, and can actually be much faster than if you were to manually review and chat as the “human in the loop”.

Hylak also includes a great annotated screenshot of an example prompt for o1 that produced a useful results for a list of hikes:

This blog post was so helpful, OpenAI’s own president and co-founder Greg Brockman re-shared it on his X account with the message: “o1 is a different kind of model. Great performance requires using it in a new way relative to standard chat models.”

I tried it myself on my recurring quest to learn to speak fluent Spanish and here was the result, for those curious. Perhaps not as impressive as Hylak’s well-constructed prompt and response, but definitely showing strong potential.

Separately, even when it comes to non-reasoning LLMs such as Claude 3.5 Sonnet, there may be room for regular users to improve their prompting to get better, less constrained results.

As Louis Arge, former Teton.ai engineer and current creator of neuromodulation device openFUS, wrote on X, “one trick i’ve discovered is that LLMs trust their own prompts more than my prompts,” and provided an example of how he convinced Claude to be “less of a coward” by first “trigger[ing] a fight” with him over its outputs.

All of which goes to show that prompt engineering remains a valuable skill as the AI era wears on.



Source link

You might also like
Leave A Reply

Your email address will not be published.