OpenAI GPT-5.5 Arrives: A New Class of Real-Work AI
A new model, a new pricing floor, and a lot of weight on the word "real"
OpenAI released GPT-5.5 on Thursday, April 23, 2026, barely six weeks after GPT-5.4 and one week after Anthropic reclaimed the coding crown with Claude Opus 4.7. The company is calling the release "a new class of intelligence for real work", which is the kind of phrasing that usually signals a product pitch rather than a capability statement. In this case, it is a little of both.
GPT-5.5 is the first fully retrained base model OpenAI has shipped since GPT-4.5. The GPT-5.1, 5.2, 5.3, and 5.4 releases that preceded it were primarily post-training refinements on the same foundation. GPT-5.5, by contrast, is new pretraining, new reasoning behavior, and a large jump on the benchmarks that matter if you are trying to run agents that actually finish jobs.
It is also, notably, twice as expensive per token on the API. Input moves from $2.50 to $5.00 per million tokens. Output moves from $15.00 to $30.00. That is a material increase for any team running production workloads, and it raises a question worth sitting with before signing up for the upgrade: is the jump from GPT-5.4 to GPT-5.5 actually worth 2x the token cost on your workload, or is this a premium you pay for the parts of the capability curve you do not use?
The answer depends on what you are doing. For agent builders, coding teams, and anyone running long-horizon work where token efficiency compounds, the math works in OpenAI's favor. For everyday knowledge work, the improvement is incremental. For consumers who do not see the API bill, it is a free upgrade that ships with the usual ChatGPT Plus subscription.
This is the detailed read on what actually changed, how the benchmarks shake out against Claude Opus 4.7, Gemini 3.1 Pro, and the restricted Claude Mythos Preview, and what the release signals about where OpenAI, Anthropic, and Google are taking the frontier model market next.
What GPT-5.5 actually is
GPT-5.5 is a reasoning model with text and image inputs and text output. It ships in three variants: the standard model, GPT-5.5 Thinking, and GPT-5.5 Pro. In ChatGPT, GPT-5.5 Thinking is available to Plus, Pro, Business, and Enterprise users. GPT-5.5 Pro, designed for even harder questions and higher-accuracy work, is available to Pro, Business, and Enterprise users. In Codex, GPT-5.5 is available for Plus, Pro, Business, Enterprise, Edu, and Go plans with a 400K context window.
The API numbers are the ones to memorize. For API developers, gpt-5.5 will soon be available in the Responses and Chat Completions APIs at $5 per 1M input tokens and $30 per 1M output tokens, with a 1M context window. Batch and Flex pricing are available at half the standard API rate, while Priority processing is available at 2.5x the standard rate. Pro comes in at $30 and $180 per million tokens, which puts it squarely in "use only when you need it" territory.
The Artificial Analysis profile puts the model at a score of 59 on the Artificial Analysis Intelligence Index, which currently ranks it #2 out of 141 evaluated models. The composite draws on ten benchmarks including GDPval, SciCode, GPQA Diamond, and Humanity's Last Exam. The median score in the comparable cohort is 33. In practical terms, the model sits at or near the top of every leaderboard that actually matters for general-purpose work.
The codename is "Spud." OpenAI uses code names during development (like "Orion" for GPT-5). "Spud" is slang for "potato" in English — OpenAI often uses informal names. The name surfaced weeks before launch through an Azure configuration leak, which had developers briefly seeing a "gpt-5.5-turbo-preview" entry on OpenAI's servers. The model that eventually shipped is the same project.
What changed from GPT-5.4
The delta between GPT-5.4 and GPT-5.5 is not subtle, but it is specific. OpenAI highlighted four areas in its announcement: agentic coding, computer use, knowledge work, and early scientific research. The gains are concentrated there, and the wins are concrete enough that third-party evaluators have independently verified most of them.
Terminal-Bench 2.0 is the clearest demonstration. The benchmark measures multi-tool command line workflows that require planning, iteration, and recovery from mistakes. GPT-5.4 scored 75.1%. GPT-5.5 hits 82.7%. That is a 7.6-point jump in a single cycle, in a benchmark that is hard to saturate. For anyone building agents that run in a shell, this is the number that justifies looking at the upgrade.
GDPval, OpenAI's preferred benchmark for economically valuable work, moves from 83.0% to 84.9%. That sounds small until you remember that GDPval is designed to measure real-world task performance across 44 occupations, including financial analysis, legal drafting, and consulting work. If GDPval measures what it claims to, GPT-5.5 may not be a major leap for everyday professional tasks. The near-saturation reading is honest: if you were already happy with GPT-5.4 for writing, research, and document work, GPT-5.5 will feel like a polish pass rather than a new tier.
OSWorld-Verified, which tests desktop computer use, climbs from 75.0% to 78.7%. That is the difference between an agent that occasionally fumbles a UI and one that completes mixed software tasks reliably enough to deploy in production. Combined with BrowseComp at 90.1% on the Pro tier, GPT-5.5 is the first general-purpose OpenAI release where computer use agents feel like a product category rather than a demo.
Long-context performance also moved sharply. On the MRCR v2 benchmark, which tests how reliably a model can locate multiple pieces of hidden information across very long texts, GPT-5.5 jumps to 74.0 percent at context lengths of 512K to 1M tokens, up from 36.6 percent for GPT-5.4. On the Graphwalks BFS test with one million tokens, GPT-5.5 leaps from 9.4 percent (GPT-5.4) to 45.4 percent. Long-context was GPT-5.4's weakest dimension. It is still not where Gemini 3.1 Pro sits, but it is no longer a reason to route around the model.
Hallucinations are down meaningfully. It scores 88.7 % on SWE-bench and 92.4 % on MMLU, with a 60 % drop in hallucinations versus GPT-5.4. That 60% number deserves the usual skepticism reserved for first-party reliability claims, but it is consistent with the direction of travel OpenAI has been on since GPT-5, where safe-completion fine-tuning and improved abstention have been central to the training recipe.
The benchmark fight: where GPT-5.5 wins, where it loses
The honest read on GPT-5.5 against the rest of the frontier is that it dominates agentic benchmarks and loses on some coding-specific ones. This is a different story than the one OpenAI tells in its announcement, and both stories are partially correct.
On coding, Claude Opus 4.7 still has a real lead on the benchmark that gets cited most often. Opus 4.7 scores 64.3% on SWE-Bench Pro (vs 58.6% for GPT-5.5) and 79.1% on MCP-Atlas (vs 75.3%). Anthropic itself flags memorization concerns on a subset of SWE-bench problems — but the lead on tool-orchestration via MCP is real and matters for refactor-heavy and large-PR workloads. The memorization caveat matters, but it does not erase the gap. SWE-Bench Pro evaluates real GitHub issue resolution end-to-end. If your workload looks like that — multi-file refactors, large PRs, complex interdependencies — Opus 4.7 is still the stronger choice.
On agentic workflows, the picture flips. On Terminal-Bench 2.0 (planning, iteration, and tool coordination across command-line workflows), GPT-5.5 scores 82.7% versus 69.4% for Opus 4.7 per OpenAI's eval. On the internal Expert-SWE benchmark — long-horizon coding tasks with a median estimated 20-hour human completion time — GPT-5.5 hits 73.1%; Opus 4.7 isn't reported on this internal eval. Terminal-Bench tests something closer to what a coding agent actually does in production: plan, execute, handle a tool failure, recover. GPT-5.5 wins that category by more than 13 points.
Mathematics is where the Pro tier shines brightest. FrontierMath Tier 4, the hardest tier of a benchmark built by working mathematicians to resist memorization, has GPT-5.5 Pro at 39.6% versus 22.9% for Claude Opus 4.7 and 16.7% for Gemini 3.1 Pro. Doubling the second-best frontier model on a deliberately hard benchmark is the kind of result that changes how research teams route difficult problems.
Against Gemini 3.1 Pro, GPT-5.5 takes most rows but leaves two important ones behind. Gemini retains its 2M token context window, which is more than twice the 922K that Artificial Analysis measured on GPT-5.5. Gemini also remains cheaper at the token level, which matters when your workload is bounded by cost rather than capability. For very long documents, RAG pipelines, or cost-sensitive deployments, Google's model still has a legitimate case.
The most awkward comparison is Claude Mythos Preview, Anthropic's gated frontier model that OpenAI has pointedly not included in its primary comparison table. It is important to clarify that Mythos Preview is not a generally available product; Anthropic has classified it as a strategic defensive asset due to its high cybersecurity risks, restricting its access to a small, limited audience of trusted partners and government agencies. Because Mythos is excluded from broad commercial use, the primary market competition remains between GPT-5.5, Gemini 3.1 Pro, and Claude Opus 4.7.
Where they do overlap on published numbers, Mythos leads on most of them. Mythos leads: SWE-bench Pro (77.8 vs 58.6 / 64.3), HLE no tools (56.8 vs 41.4 / 46.9), HLE with tools (64.7 vs 52.2 / 54.7), CyberGym (83 vs 81.8 / 73.1), OSWorld-Verified (79.6 vs 78.7 / 78.0), GraphWalks long context (soft comparison but 80 vs 45.4). Effectively tied: GPQA Diamond (94.5 vs 93.6 / 94.2) and Terminal-Bench 2.0 on the headline numbers (82 vs 82.7). That is a sharp picture of the gap between what the public can buy and what the frontier labs are actually running internally. GPT-5.5 is the best model most enterprises can deploy. Mythos is stronger where it has been tested, but you cannot buy it.
Codex and the agentic coding story
The release that matters most commercially is not GPT-5.5 in ChatGPT. It is GPT-5.5 in Codex, OpenAI's agentic coding environment. That is where the efficiency story pays off, where the Terminal-Bench score converts directly into shipped work, and where NVIDIA's internal deployment shows what this capability tier actually looks like when a large engineering organization leans on it.
NVIDIA ran the numbers first. The company deployed Codex with GPT-5.5 internally to more than 10,000 employees, and reported that debugging cycles dropped from days to hours and that multi-week experimentation now completes overnight on complex codebases. Those are productivity-study quality gains, not benchmark anecdotes. NVIDIA also cites stronger reliability and fewer wasted cycles than earlier models, which matters because the practical ceiling on long-running agents is usually failure recovery, not raw capability.
OpenAI's framing on Codex efficiency deserves attention. The company says GPT-5.5 uses significantly fewer tokens to complete the same Codex tasks compared with GPT-5.4. That is the offsetting claim to the 2x price increase. If the model needs 40% fewer tokens to finish a task, the effective cost per task is closer to 1.2x than 2x. For long-running agents, where tokens compound across tool calls and reasoning steps, the net can even flip favorable.
This is also the benchmark that developers should pay the most attention to, because it is the closest proxy for how an agent actually behaves in production. A model that recovers from a shell error, re-plans, and finishes the task is worth more than a model that scores higher on a static eval and stalls the moment something goes wrong. Terminal-Bench 2.0 is designed to stress exactly that loop. The 82.7% headline is the single most defensible number in the GPT-5.5 release.
Knowledge work, research, and the GDPval ceiling
GDPval at 84.9% is the number OpenAI is using to argue that GPT-5.5 is ready for knowledge work at scale. The benchmark spans 44 occupations and targets tasks an economist would recognize as billable work: financial analysis, legal drafting, consulting write-ups, management presentations. The gain over GPT-5.4's 83.0% is small, and that is the interesting signal.
GDPval is saturating. The cleanest read of the near-flat improvement is that the capability needed to produce acceptable professional output is already near the top of what benchmarks can measure, and the frontier is now moving toward longer-horizon work that does not fit into a single-task eval. This is consistent with Ethan Mollick's extended hands-on testing. In his review, GPT-5.5 Pro produced PhD-quality academic output on hundreds of crowdfunding research files within four prompts, generating legitimate literature reviews and sophisticated statistical analyses without manual editing. Four prompts is the headline; the work itself is the story.
Mollick also spells out the remaining ceiling. Long-form fiction still shows an "uncanny" quality: ornate sentences, repetitive metaphors, dialogue where every character speaks in the same clipped tone. Hypothesis generation in statistics is technically sound but often uninteresting without expert prompting. The frontier remains jagged, which is the honest phrase. A model that can write a publishable literature review in four prompts still cannot reliably write a publishable novel.
The practical implication for knowledge workers is that the bottleneck has moved. A year ago, AI could draft, and the human had to do most of the thinking. Today AI can complete, and the human has to pick the right thing to hand over. That is a different job description than it was two models ago, and GDPval's flat improvement reflects the point that individual tasks are mostly solved — what is left is routing, judgment, and integration.
Computer use, browsing, and the superapp argument
OSWorld-Verified at 78.7% and BrowseComp at 90.1% on the Pro tier are the two numbers that underwrite OpenAI's superapp pitch. On the announcement call, Greg Brockman said GPT-5.5 was "a real step forward towards the kind of computing that we expect in the future", and added that the company still envisions a consolidated product that combines ChatGPT, Codex, and an AI browser into a single service for enterprise customers.
The superapp strategy is openly competitive with Elon Musk's stated plans for X, and internally with the enterprise adoption race against Anthropic and Google. It also reflects a pivot OpenAI made quietly over the last six months. This comes after OpenAI executives called Anthropic's rise a "code red" "wake-up call" that resulted in a pivot in strategy to focusing on business customer adoption. GPT-5.5 is the first product release that reads as a deliberate response to that pivot.
Computer use agents need two things to be useful: the ability to drive a UI and the ability to find information they do not already have. OSWorld covers the first. BrowseComp covers the second. GPT-5.5 Pro scores at or near the top on both, which is why OpenAI is comfortable pushing Pro access further down the plan ladder than it ever has before.
The scale numbers are also worth noting, because they explain why OpenAI can keep shipping at this cadence. The company also said there are 4 million active Codex users and 9 million paying business users on ChatGPT. ChatGPT also has more than 900 million weekly active users and over 50 million subscribers. That is the revenue base funding the compute that makes the GPT-5.5 Pro tier economically defensible. It is also the distribution moat that Anthropic and Google are trying to close.
Math, science, and the research partner thesis
FrontierMath Tier 4 at 39.6% on GPT-5.5 Pro is the result that probably best supports OpenAI's framing of the model as a research partner rather than an assistant. The benchmark is deliberately constructed by working mathematicians to resist pattern-matching, and the Tier 4 problems are difficult enough that professional mathematicians disagree about solutions. A publicly available model solving 40% of those is a qualitative shift in what math-heavy research teams can automate.
SciCode and GPQA Diamond, both represented in the Artificial Analysis composite, test graduate-level scientific reasoning and programming. GPT-5.5 performs at or near the top on both. Mark Chen, chief research officer at OpenAI, said that GPT-5.5 was better at navigating computer work than its predecessors, and also said that the model "shows meaningful gains on scientific and technical research workflows," with drug discovery cited as a specific application.
The research partner framing is accurate but needs context. GPT-5.5 does not replace a domain expert. It compresses the time between hypothesis and verification, which is the expensive step in most research workflows. For academic groups running simulations, literature review pipelines, or statistical analysis on large corpora, the efficiency delta is probably more important than the raw capability delta. You get more experiments per day from the same team, which is how AI actually changes research productivity.
GPT-5.5 Pro: when the higher tier is worth it
GPT-5.5 Pro is the variant that does the extended reasoning work. It is available to Pro, Business, and Enterprise users in ChatGPT, and at $30 input / $180 output per million tokens on the API. It is not the model you run by default. It is the model you run when the answer matters more than the latency or the cost.
The Pro-specific benchmarks are where the strongest GPT-5.5 numbers come from: 90.1% on BrowseComp (vs 84.4% on base), 39.6% on FrontierMath Tier 4 (vs 35.4%), 43.1% on HLE without tools (vs 41.4%), 57.2% on HLE with tools (vs 52.2%). These are not small deltas. If your workload actually needs the extra reasoning depth, the jump is meaningful.
The practical question is when. Pro is appropriate for deep research, long-horizon agent loops where a reasoning failure cascades across many steps, and tasks where a wrong answer has a real cost. It is overkill for everyday chat, routine coding, and document work that GPT-5.5 base handles at a quarter the price. The efficiency work in GPT-5.5 narrowed the gap enough that Pro is now closer to a daily driver for demanding users than it used to be, but the 6x token price differential still enforces a clear hierarchy.
In Mollick's testing, GPT-5.5 Pro completed a 5000-year procedurally evolving harbor town simulation in 20 minutes, down from 33 minutes on the previous generation, and built a complete 101-page illustrated tabletop RPG from a single prompt. Those are the kind of long-horizon, self-sustaining loops Pro is actually built for. If your workload does not look like that, base is fine.
The pricing math and what it means for production workloads
The headline pricing doubling obscures the more nuanced cost picture. At Batch pricing, GPT-5.5 costs $2.50 / $15.00 per million tokens; identical to GPT-5.4 standard pricing. For offline workloads, the price doubling disappears. For any team running overnight evaluations, historical backfills, or batch jobs where 24-hour turnaround is acceptable, the upgrade from GPT-5.4 is essentially free at the token level.
The Flex tier also runs at 50% of standard pricing with variable wait times, which is a reasonable fit for background work where latency is not the binding constraint. Priority processing runs at 2.5x standard for teams that need guaranteed fast response times, and Fast mode in Codex generates tokens 1.5x faster at 2.5x the cost, which is designed for interactive development work where a human is waiting.
The right way to evaluate the upgrade is per-task rather than per-token. OpenAI claims GPT-5.5 completes equivalent Codex tasks using significantly fewer tokens than GPT-5.4. If that efficiency is real in your workload, a 40% token reduction offsets most of the per-token price increase. Net effect: the cost per completed task moves from 2x GPT-5.4 down into the 1.1x to 1.3x range, and that is before accounting for the quality gains.
For consumer-facing applications where the user does not see the bill, this analysis is moot. The upgrade is free on Plus and Pro subscriptions, and the only question is whether the model is actually better on the workload in question. For enterprise teams running production pipelines, the answer deserves a real A/B test before the default model flag changes.
The infrastructure story: GB200 NVL72 and the compute curve
GPT-5.5 runs on NVIDIA GB200 NVL72 rack-scale systems, which NVIDIA says deliver 35x lower cost per million tokens and 50x higher token output per second per megawatt compared with prior systems. The first GB200 NVL72 100,000 GPU cluster completed large-scale training runs for GPT-5.5 and set new reliability benchmarks at scale.
These numbers explain the release cadence. At 35x lower serving cost and 50x higher throughput per megawatt, the economics of retraining base models shift from "capital-constrained event every 18 months" to "capital-constrained event every six months." Jakub Pachocki, OpenAI's chief scientist, said "We see pretty significant improvements in the short term, extremely significant improvements in the medium term. In fact, I would say, like, I think the last two years have been surprisingly slow."
That statement reads as a product pitch until you stack it against the release cadence. GPT-5 shipped in August 2025. GPT-5.1 in November 2025. GPT-5.2 in December 2025. GPT-5.3 and GPT-5.4 in early 2026. GPT-5.5, a fully retrained base, in April 2026. Two years ago, a fully retrained base model was roughly an annual event. The new cadence is closer to six months between full retrains, with incremental post-training updates filling the gaps. If Pachocki is right about the medium term, the next release will come faster than most enterprise procurement cycles can track.
OpenAI has committed to deploying more than 10 gigawatts of NVIDIA systems for next-generation infrastructure. That is a planning-horizon number, not a current deployment, but it signals the compute footprint OpenAI is building toward. The strategic bet is that cost per token continues to decline faster than token demand per task grows, which keeps the unit economics of frontier models improving even as the models themselves get more capable.
Safety, Preparedness, and the cybersecurity question
OpenAI is treating GPT-5.5's cybersecurity and biology/chemistry capabilities as High under its Preparedness Framework. While GPT-5.5 didn't reach Critical cybersecurity capability level, our evaluations and testing showed that its cybersecurity capabilities are a step up compared to GPT-5.4. The release documentation describes work with government partners on critical infrastructure protection, a defensive-use-only access pathway at chatgpt.com/cyber for verified security teams, and 200 trusted early-access partners who tested the model before launch.
The broader safety context matters here. Anthropic positioned Claude Mythos Preview as a defensive cybersecurity tool and restricted access accordingly. OpenAI has taken a different approach with GPT-5.5: broad release with gated access for higher-risk use cases. The philosophical difference is real, and it maps to the two labs' overall strategies. Anthropic is betting on deep access to a narrow set of trusted partners for frontier cyber work. OpenAI is betting on broad deployment with safeguards, and on the feedback loop that broad deployment generates.
Neither approach is obviously correct. Anthropic's model reduces the probability of misuse by reducing access. OpenAI's model produces more information about how the model actually behaves in production, which matters for improving the next release. The two strategies will diverge further as the capability tier increases, and the industry does not yet have a consensus on which way to lean.
The competitive landscape: three labs, three bets
April 2026 has produced the clearest differentiation in frontier model strategy yet. OpenAI is betting on autonomous agents and economically valuable work. Anthropic is betting on coding precision and cybersecurity. Google is betting on reasoning depth, very long context, and cost-efficient scale. These are not marketing positions. They are where each lab has actually concentrated its benchmark wins.
GPT-5.5 is the default model for general work, agentic pipelines, and any workload where the breadth of capabilities matters more than the peak score on any single benchmark. It wins Terminal-Bench, GDPval, OSWorld, and Tau2-bench. For enterprise deployments looking for a single model that is competent across the full stack, this is the current pick.
Claude Opus 4.7 is the coding specialist. Its 64.3% on SWE-bench Pro is still the best among generally-available models for real GitHub issue resolution, and its MCP-Atlas lead matters for teams building around the Model Context Protocol tool-use standard. If your workload is refactor-heavy, multi-file, or lives inside a large existing codebase, Opus 4.7 is still worth the default slot.
Gemini 3.1 Pro is the reasoning and context specialist. Its 77.1% on ARC-AGI-2 and 2M token context window are genuine differentiators for long-context reasoning, RAG pipelines that cannot be broken up, and cost-sensitive deployments where Google's lower token pricing compounds.
Claude Mythos Preview sits outside the normal comparison framework because it is not generally available. Its 93.9% on SWE-bench Verified would be the top of every chart if it were buyable, but Anthropic has restricted it to government and critical infrastructure partners. For strategic planning purposes, Mythos is best understood as a ceiling signal: the capability tier the labs are running internally is already meaningfully above what the public can deploy.
The practical consequence for anyone building on frontier models is that a single-model default is increasingly a suboptimal strategy. A well-designed routing layer that sends coding tasks to Opus 4.7, long-context work to Gemini, and agentic workflows to GPT-5.5 will outperform any single-model deployment. That is a real operational shift for most teams.
What this actually means
GPT-5.5 is not a new paradigm. It is a real jump: a fully retrained base model, state-of-the-art agentic benchmarks, a meaningful win on FrontierMath, and a pricing step up that asks enterprise customers to validate the value before upgrading by default.
The practical implications are worth being specific about.
For individual ChatGPT users, the upgrade is free and slightly better. GPT-5.5 Thinking handles harder problems faster, hallucinates less, and writes cleaner prose. If you use ChatGPT for daily work, you will notice the improvement, but you will not notice it as a new category of capability.
For agent and coding teams, the Codex efficiency gains and Terminal-Bench 2.0 score are the numbers that justify the price increase. Long-running agents become economically viable at a different tier than they were under GPT-5.4. Debugging cycles compress. Multi-week experimentation runs move overnight. If your production workload lives in Codex, the upgrade is probably worth the per-token increase before you even run the A/B test.
For research teams, GPT-5.5 Pro is the first generally-available model where the math and science benchmarks cross into territory where expert review is necessary rather than routine. The 39.6% on FrontierMath Tier 4 is the signal to watch.
For enterprise architects, the release raises a routing question. A single-model default is no longer optimal. The question is not "GPT-5.5 or Claude Opus 4.7" but "where in the stack does each model belong." Teams that answer that question in 2026 will have better unit economics than teams that pick one model and move on.
The release cadence is the other signal. If Pachocki is right that the last two years have been slow by the standards of what is coming, the next twelve months will produce more capability change than the last twenty-four. Procurement decisions made now on any single model as a long-term commitment will probably need to be revisited before the contract renews. The right architecture is one that can swap models without rewriting the application, which is a different engineering problem than picking the best model for a given benchmark.
GPT-5.5 is the best general-purpose model the public can buy today. Whether it is the best model for your specific workload is a question that pays for its own answer. Run the A/B test. The token cost of running both models on your actual traffic for a week is small compared to the cost of being wrong about which one belongs in production for the next quarter.