OpenAI releases 1 million token coding model GPT 4.1, available immediately via API

OpenAI has released GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano to its API suite, phasing out GPT-4.5 Preview while advancing code generation, instruction following, and long-context processing capabilities.

Essentially signaling the failure of GPT-4.5, the new 4.1 models introduce context windows of up to one million tokens, enabling native handling of full repositories, extensive documents, and complex multi-turn agent workflows within a single call.

While researching this article, I was able to use GPT-4.1 ‘vibe code,’ a simple Python-based dungeon crawler, in 5 minutes and 5 prompts. The model made no errors in its code, with the only issues related to identifying relevant sprites in the asset atlas I imported.

Dungeon crawler demo built with GPT-4.1

Due to its large context window, it was also able to successfully identify the functionality of a large code repo within a few prompts.

Model Capabilities and Transition Path

Per OpenAI, GPT-4.1 achieves a 54.6% score on SWE-bench Verified, reflecting the improved ability to produce runnable code patches that resolve real-world repository issues. This outpaces GPT-4o’s 33.2% and GPT-4.5’s 38% under the same benchmark. The model also executes code diffs more precisely, with 53% accuracy on Aider’s polyglot benchmark in diff format, more than doubling GPT-4o’s 18%.

Instruction-following fidelity is also refined. On Scale’s MultiChallenge, GPT-4.1 reaches 38.3% accuracy, compared to 27.8% for GPT-4o. These improvements include adhering to strict output formats, complying with constraints, and following nested or contradictory instructions.

According to the AI coding platform Windsurf, internal evaluations show that GPT-4.1 produces cleaner diffs and is more aligned with structured developer workflows.

The models’ ability to process long contexts includes 1 million token support, surpassing the previous 128K token window.

To validate this, OpenAI released MRCR, an open-source evaluation that tests a model’s ability to retrieve specific details from within dense, distractor-heavy context blocks. GPT-4.1 scored 72% on the long-video, no-subtitles category of the Video-MME benchmark, setting a new high.

Efficiency gains across the series and agent use

The GPT-4.1 mini model provides latency and cost reductions while maintaining comparable performance. OpenAI stated that GPT-4.1 mini reduces inference latency by nearly 50% and cost by 83% relative to GPT-4o, with equal or superior scores on multiple intelligence evaluations.

Meanwhile, GPT-4.1 nano, optimized for low-latency tasks, achieves 80.1% on MMLU, 50.3% on GPQA, and 9.8% on Aider’s polyglot coding. These gains exceed GPT-4o mini in critical areas and position nano for use in classification, autocomplete, and reactive agentic systems.

There is no added cost for long-context use across the GPT-4.1 series. Token usage follows standard API pricing, allowing developers to scale applications involving large document retrieval, repository comprehension, or complete project editing without premium pricing tiers.

Improvements in instruction fidelity and context retention bolster the model family’s viability for agentic applications. With OpenAI’s Responses API, developers can deploy GPT-4.1-based systems to autonomously execute chained operations such as resolving customer tickets, mining documents for insights, or operating across multi-step task environments.

GPT-4.5 Preview, launched as a research-focused offering, will be sunset on July 14. According to OpenAI, feedback from 4.5’s testing phase informed fine-tuning and deployment configurations, which are now embodied in GPT-4.1. As such, GPT-4.1 is positioned as a replacement path for developers using 4.5 in the API.

ChatGPT users will continue interacting with GPT-4o, where OpenAI is incrementally integrating instruction-following improvements. GPT-4.1 models, however, are API-exclusive.

Technical implications for code-first developers

The decision to scale the token context to one million is likely a response to Google’s Gemini 2.1 Pro model. It impacts developers managing large monorepos, documentation-heavy domains, or multi-file dependency chains.

In addition to traditional inference, GPT-4.1’s upgraded token output limit, now up to 32,768 tokens, enables single-call full file rewrites, removing the need for post-processing or fragment merging.

Adherence to structured formats allows developers to optimize workflows around minimal output generation for code diffs, cutting token costs and increasing system responsiveness.

According to OpenAI’s internal tests, GPT-4.1 has already demonstrated improved production results across frontend development, legal parsing, and backend automation.

In comparative evaluations, paid graders preferred GPT-4.1-generated websites over GPT-4o results in 80% of test cases, citing superior functionality and clarity in HTML, CSS, and JavaScript output.

GPT-4.1 mini and nano models extend these benefits to low-resource environments and latency-critical settings. The introduction of nano provides a fast-reacting, low-cost LLM capable of replacing larger models in rapid iteration pipelines, chat interfaces, or embedded dev tools.

Developers using GPT-4.5 or GPT-4o mini are advised to evaluate migration paths now, as GPT-4.1’s performance and token economics favor its adoption in most deployment configurations. Model access, prompting guides, and updated benchmarks are available through the OpenAI developer platform.

Per OpenAI, GPT-4o and GPT-4o mini will continue to be supported in the API for the foreseeable future, but emphasis is being placed on the GPT-4.1 line as the preferred upgrade path.

Mentioned in this article

Source link