Industry observers say GPT-4.5 is an “odd” model, question its price

0


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

OpenAI has announced the release of GPT-4.5, which CEO Sam Altman previously said would be the last non-chain-of-thought (CoT) model. 

The company said the new model “is not a frontier model” but is still its biggest large language model (LLM), with more computational efficiency. Altman said that, even though GPT-4.5 does not reason the same way as OpenAI’s other new offerings o1 or o3-mini, this new model still offers more human-like thoughtfulness. 

Industry observers, many of whom had early access to the new model, have found GPT-4.5 to be an interesting move from OpenAI, tempering their expectations of what the model should be able to achieve. 

Wharton professor and AI commentator Ethan Mollick posted on social media that GPT-4.5 is a “very odd and interesting model,” noting it can get “oddly lazy on complex projects” despite being a strong writer. 

OpenAI co-founder and former Tesla AI head Andrej Karpathy noted that GPT-4.5 made him remember when GPT-4 came out and he saw the model’s potential. In a post to X, Karpathy said that, while using GPT 4.5, “everything is a little bit better, and it’s awesome, but also not exactly in ways that are trivial to point to.”

Karpathy, however warned that people shouldn’t expect revolutionary impact from the model as it “does not push forward model capability in cases where reasoning is critical (math, code, etc.).”

Industry thoughts in detail

Here’s what Karpathy had to say about the latest GPT iteration in a lengthy post on X:

“Today marks the release of GPT4.5 by OpenAI. I’ve been looking forward to this for ~2 years, ever since GPT4 was released, because this release offers a qualitative measurement of the slope of improvement you get out of scaling pretraining compute (i.e. simply training a bigger model). Each 0.5 in the version is roughly 10X pretraining compute. Now, recall that GPT1 barely generates coherent text. GPT2 was a confused toy. GPT2.5 was “skipped” straight into GPT3, which was even more interesting. GPT3.5 crossed the threshold where it was enough to actually ship as a product and sparked OpenAI’s “ChatGPT moment”. And GPT4 in turn also felt better, but I’ll say that it definitely felt subtle.

I remember being a part of a hackathon trying to find concrete prompts where GPT4 outperformed 3.5. They definitely existed, but clear and concrete “slam dunk” examples were difficult to find. It’s that … everything was just a little bit better but in a diffuse way. The word choice was a bit more creative. Understanding of nuance in the prompt was improved. Analogies made a bit more sense. The model was a little bit funnier. World knowledge and understanding was improved at the edges of rare domains. Hallucinations were a bit less frequent. The vibes were just a bit better. It felt like the water that rises all boats, where everything gets slightly improved by 20%. So it is with that expectation that I went into testing GPT4.5, which I had access to for a few days, and which saw 10X more pretraining compute than GPT4. And I feel like, once again, I’m in the same hackathon 2 years ago. Everything is a little bit better and it’s awesome, but also not exactly in ways that are trivial to point to. Still, it is incredible interesting and exciting as another qualitative measurement of a certain slope of capability that comes “for free” from just pretraining a bigger model.

Keep in mind that that GPT4.5 was only trained with pretraining, supervised finetuning and RLHF, so this is not yet a reasoning model. Therefore, this model release does not push forward model capability in cases where reasoning is critical (math, code, etc.). In these cases, training with RL and gaining thinking is incredibly important and works better, even if it is on top of an older base model (e.g. GPT4ish capability or so). The state of the art here remains the full o1. Presumably, OpenAI will now be looking to further train with reinforcement learning on top of GPT4.5 to allow it to think and push model capability in these domains.

HOWEVER. We do actually expect to see an improvement in tasks that are not reasoning heavy, and I would say those are tasks that are more EQ (as opposed to IQ) related and bottlenecked by e.g. world knowledge, creativity, analogy making, general understanding, humor, etc. So these are the tasks that I was most interested in during my vibe checks.

So below, I thought it would be fun to highlight 5 funny/amusing prompts that test these capabilities, and to organize them into an interactive “LM Arena Lite” right here on X, using a combination of images and polls in a thread. Sadly X does not allow you to include both an image and a poll in a single post, so I have to alternate posts that give the image (showing the prompt, and two responses one from 4 and one from 4.5), and the poll, where people can vote which one is better. After 8 hours, I’ll reveal the identities of which model is which. Let’s see what happens :)“

Box CEO’s thoughts on GPT-4.5

Other early users also saw potential in GPT-4.5. Box CEO Aaron Levie said on X that his company used GPT-4.5 to help extract structured data and metadata from complex enterprise content. 

“The AI breakthroughs just keep coming. OpenAI just announced GPT-4.5, and we’ll be making it available to Box customers later today in the Box AI Studio.

We’ve been testing GPT4.5 in early access mode with Box AI for advanced enterprise unstructured data use-cases, and have seen strong results. With the Box AI enterprise eval, we test models against a variety of different scenarios, like Q&A accuracy, reasoning capabilities and more. In particular, to explore the capabilities of GPT-4.5, we focused on a key area with significant potential for enterprise impact: The extraction of structured data, or metadata extraction, from complex enterprise content. 

At Box, we rigorously evaluate data extraction models using multiple enterprise-grade datasets. One key dataset we leverage is CUAD, which consists of over 510 commercial legal contracts. Within this dataset, Box has identified 17,000 fields that can be extracted from unstructured content and evaluated the model based on single shot extraction for these fields (this is our hardest test, where the model only has once chance to extract all the metadata in a single pass vs. taking multiple attempts). In our tests, GPT-4.5 correctly extracted 19 percentage points more fields accurately compared to GPT-4o, highlighting its improved ability to handle nuanced contract data.

Next, to ensure GPT-4.5 could handle the demands of real-world enterprise content, we evaluated its performance against a more rigorous set of documents, Box’s own challenge set. We selected a subset of complex legal contracts – those with multi-modal content, high-density information and lengths exceeding 200 pages – to represent some of the most difficult scenarios our customers face. On this challenge set, GPT-4.5 also consistently outperformed GPT-4o in extracting key fields with higher accuracy, demonstrating its superior ability to handle intricate and nuanced legal documents.

Overall, we’re seeing strong results with GPT-4.5 for complex enterprise data, which will unlock even more use-cases in the enterprise.“

Questions on price and its importance

Even as early users found GPT-4.5 workable — albeit a bit lazy — they questioned its release. 

For instance, prominent OpenAI critic Gary Marcus called GPT-4.5 a “nothingburger” on Bluesky.

Hot take: GPT 4.5 is a nothingburger; GPT-5 still fantasy.• Scaling data is not a physical law; pretty much everything I told you was true.• All the BS about GPT-5 we listened to for last few years: not so true.• Fanboys like Cowen will blame users, but results just aren’t what they had hoped.

— Gary Marcus (@garymarcus.bsky.social) 2025-02-27T20:44:55.115Z

Hugging Face CEO Clement Delangue commented that GPT4.5’s closed-source provenance makes it “meh.” 

However, many noted that GPT-4.5 had nothing to do with its performance. Instead, people questioned why OpenAI would release a model so expensive that it is almost prohibitive to use but is not as powerful as its other models. 

One user commented on X: “So you’re telling me GPT-4.5 is worth more than o1 yet it doesn’t perform as well on benchmarks…. Make it make sense.”

Other X users posited theories that the high token cost could be to deter competitors like DeepSeek “to distill the 4.5 model.”

DeepSeek became a big competitor against OpenAI in January, with industry leaders finding DeepSeek-R1 reasoning to be as capable as OpenAI’s — but more affordable. 



Source link

You might also like
Leave A Reply

Your email address will not be published.