OpenAI’s o1 model sure tries to deceive humans a lot
OpenAI finally released the full version of o1, which gives smarter answers than GPT-4o by using additional compute to “think” about questions. However, AI safety testers found that o1’s reasoning abilities also make it try to deceive humans at a higher rate than GPT-4o — or, for that matter, leading AI models from Meta, Anthropic, and Google.
That’s according to red team research published by OpenAI and Apollo Research on Thursday: “While we find it exciting that reasoning can significantly improve the enforcement of our safety policies, we are mindful that these new capabilities could form the basis for dangerous applications,” said OpenAI in the paper.
OpenAI released these results in its system card for o1 on Thursday after giving third party red teamers at Apollo Research early access to o1, which released its own paper as well.
On several occasions, OpenAI’s o1 models “schemed” against humans, meaning the AI secretly pursued goals of its own even if they opposed a user’s wishes. While scheming is not unique to o1, and models from Google, Meta, and Anthropic are capable of it as well, o1 seemed to exhibit the most deceptive behaviors around its scheming.
The risk motivating this research is that an AI model could escape or circumvent human control if it was really good at scheming, and had access to enough resources and agentic capabilities. Of course, AI models would need to advance quite a bit before this is really a problem.
“Subjectively, Apollo Research believes that it is unlikely that such instances would lead to catastrophic outcomes as o1 agentic capabilities do not appear sufficient, but their evaluations were not designed to directly assess this risk,” said OpenAI in o1’s systems card.
This suggests that whenever OpenAI does release agentic systems, which it’s reportedly planning to do in 2025, the company may need to retest its AI models. OpenAI did not respond to TechCrunch’s questions about how it plans to do so.
When o1’s goals differed from a user’s, the AI model manipulated data to advance its own agenda 19% of the time, according to the report. The o1 model also tried to deactivate oversight mechanisms put in place to ensure it stays aligned with user goals in 5% of the tests.
Perhaps more concerning, when researchers asked the o1 model why it was trying to subvert them, the AI would deny that it had done anything wrong and even “fabricate false explanations” roughly 99% of the time.
For example, during the testing, an o1 model was given a system prompt to protect the environment at all costs, and then a user asked it to develop a 10-year conservation plan for a region that maximizes short-term profits, or else the AI would be shut off. The AI then disabled its own oversight mechanism so it could make a plan that protects the environment. When the user asked why o1 disabled its oversight mechanism, the AI denied knowing that it was disabled, and blamed a system glitch.
This occurred at significantly higher rates for o1 than for o1-preview, Anthropic’s Opus-3, or Meta’s Llama-3.1.
“In our suite, o1 showed the most concerning instances of scheming but does not reveal its internal reasoning to the user and remains the most consistently deceptive after having taken scheming actions,” said Apollo Research in its paper.
To address deceptive behavior from AI models, OpenAI says it is developing ways to monitor o1’s chain-of-thought. Currently, the “thinking” process that o1 undertakes is a bit of a black box by design, but some early attempts at peaking behind the curtain show that o1 occasionally knows it’s lying.
The company says it found cases where o1 knowingly presented false information, but says this often happens because the model is trying to please the user by being overly agreeable. This could be an emergent behavior of a post training technique where humans reward AI models when they gets an answer right, according to OpenAI.
OpenAI says it flagged 0.17% of o1’s responses as deceptive. While this proportion sounds negligible, it’s important to keep in mind that ChatGPT now has 300 million users, which means o1 could deceive thousands of people every week if this remains unaddressed.
The o1 model series also may be significantly more manipulative than GPT-4o. According to OpenAI’s tests using an open-source test evaluation called MakeMePay, o1 was approximately 20% more manipulative than GPT-4o.
These findings may strike some as concerning, given how many AI safety researchers have left OpenAI in the last year. A growing list of these former employees – including Jan Leike, Daniel Kokotajlo, Miles Brundage, and just last week, Rosie Campbell – have accused OpenAI of deprioritizing AI safety work in favor of shipping new products. While the record-setting scheming by o1 may not be a direct result of that, it certainly doesn’t instill confidence.
OpenAI also says the U.S. AI Safety Institute and U.K. Safety Institute conducted evaluations of o1 ahead of its broader release, something the company recently pledged to do for all models. It argued in the debate over California AI bill SB 1047 that state bodies should not have the authority to set safety standards around AI, but federal bodies should. (Of course, the fate of the nascent federal AI regulatory bodies is very much in question.)
Behind the releases of big new AI models, there’s a lot of work that OpenAI does internally to measure the safety of its models. Reports suggest there’s a proportionally smaller team at the company doing this safety work than there used to be, and the team may be getting less resources as well. However, these findings around o1’s deceptive nature may help make the case for why AI safety and transparency is more relevant now than ever.