DeepSeek V4: 1.6T Params, 1M Context, Huawei Silicon

The V4 Preview Finally Ships

Fifteen months after R1 turned the AI industry upside down, DeepSeek is back. On April 24, 2026, the Hangzhou lab pushed preview weights for V4-Pro and V4-Flash to Hugging Face, flipped the API on the same day, and swapped out deepseek-chat and deepseek-reasoner as the default models at chat.deepseek.com. This is the first ground-up release from the company since R1, and it lands with the kind of quiet confidence that only comes from a team that has already read its own benchmarks.

The release pulls in three directions at once. It pushes hard on open-weight frontier capability with a 1.6-trillion-parameter flagship. It rewrites the attention stack to make million-token context actually affordable. And it is the first major DeepSeek model the company has positioned openly as running on Huawei Ascend silicon instead of Nvidia. Any one of those threads would be notable. Together, they put the closed-source incumbents on notice.

This piece walks through what was announced, what changed under the hood, how V4 stacks up against GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro across the benchmark suite that matters, and why the hardware subplot may end up being the bigger story than the benchmarks themselves.

Two Models, One Architecture

The V4 family ships as two tiers, and they are not simply different-sized cuts of the same model. They are tuned for different workloads.

DeepSeek-V4-Pro is the flagship. It is a Mixture-of-Experts model with 1.6 trillion total parameters and 49 billion active per token, pre-trained on roughly 33 trillion tokens. This is the largest model DeepSeek has ever shipped, and it targets the same deployment slot as GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro. DeepSeek surfaces it as Expert Mode inside chat.deepseek.com and as deepseek-v4-pro through the API.

DeepSeek-V4-Flash is the efficiency tier. It carries 284 billion total parameters with 13 billion active per token, trained on around 32 trillion tokens. The active count is in Haiku and GPT-5.5-mini territory, which means inference is fast and cheap, but the model has a much deeper pool of specialized experts to route through than a dense 13B model. It shows up as Instant Mode in the chat interface and deepseek-v4-flash on the API.

Both models share the same feature set: a 1 million token context window, up to 384K tokens of output, thinking and non-thinking modes, JSON output, tool calls, FIM completion in non-thinking mode, and Chat Prefix Completion in beta. Both support the OpenAI ChatCompletions and Anthropic messages APIs, so existing code written against either provider can be pointed at DeepSeek with a base URL swap and a model ID change.

One operational note worth flagging immediately: the older deepseek-chat and deepseek-reasoner IDs are being fully retired on July 24, 2026. Until then, they route silently to V4-Flash in non-thinking and thinking mode respectively. Any production pipeline still calling the old IDs has a hard deprecation deadline, and the sooner the migration happens, the less painful it will be.

The Architecture That Makes V4 Interesting

The headline numbers sell the release, but the attention stack is where V4 actually earns its keep. DeepSeek has been telegraphing the direction for a while. V3.2-Exp introduced DeepSeek Sparse Attention last September as an experiment. V3.2 shipped it in production in December. V4 takes that foundation and rebuilds the whole top of the stack around it.

Hybrid Attention: CSA and HCA

The core move is a Hybrid Attention architecture that interleaves two new attention types through the network. DeepSeek calls them Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA), and they attack the quadratic cost of standard attention from two complementary angles.

CSA first compresses the KV cache along the sequence dimension at a ratio of roughly 4 to 1, then applies DeepSeek Sparse Attention on top. A lightning indexer scores compressed KV entries against each query and picks the top-k most relevant entries. V4-Pro selects the top 1,024 compressed tokens per query. V4-Flash selects the top 512. A short sliding window of 128 uncompressed tokens sits on the side so the model never loses local context near the query position.

HCA goes harder. It compresses at a ratio of 128 to 1, consolidating 128 tokens into a single compressed representation, then runs dense attention over that much smaller set. The result is a cheap global view of distant context that every layer gets to consult.

CSA and HCA layers alternate through the network. V4-Flash opens with pure sliding-window attention in its first two layers. V4-Pro opens with HCA. The interleaving is the point: every query gets a precise sparse view of nearby context plus a coarse dense view of everything else, at a fraction of the FLOPs of full attention.

The efficiency numbers are the part that actually matters. At the 1M-token context setting, DeepSeek-V4-Pro needs about 27% of the single-token inference FLOPs and 10% of the KV cache compared to V3.2. That is not a tweak. That is a reshaping of what million-token context costs to serve, and it is the reason DeepSeek was able to make 1M context the default across the entire product line rather than a premium tier.

Manifold-Constrained Hyper-Connections

The second architectural change is mHC, a replacement for conventional residual connections. Residual connections are the plumbing that lets gradients flow cleanly through deep networks, and at trillion-parameter scale they can become unstable. DeepSeek's report describes mHC as a reinforcement of residual pathways that keeps signal propagation stable without giving up expressivity. It is the kind of change that does not move benchmarks dramatically on its own, but makes the rest of the training run possible.

The Muon Optimizer

The third change is the optimizer. V4 ships with Muon, replacing AdamW for most parameters. DeepSeek reports faster convergence and better training stability at the trillion-parameter scale. Muon has been getting attention across the research community through 2025, and V4 is the largest production deployment of it to date. For anyone training frontier models at scale, this is the validation data point the community was waiting for.

The rest of the V4 architecture inherits from V3: fine-grained routed and shared experts in the DeepSeekMoE layout, FP8 precision for most weights, Multi-Token Prediction, and the 128K-token tokenizer. The continuity matters. DeepSeek is not throwing out the V3 stack. It is surgically replacing the attention block and the optimizer while keeping everything else that already worked.

How V4 Stacks Up Against the Frontier

Benchmark numbers are messy and harnesses differ, but the overall shape of the competitive landscape is clear enough to call.

Coding and Agentic Workloads

This is where V4-Pro makes its strongest case. On LiveCodeBench Pass@1, V4-Pro-Max posts 93.5, ahead of Gemini 3.1 Pro at 91.7 and Claude Opus 4.6 Max at 88.8. On Codeforces rating, V4-Pro hits 3206, above GPT-5.4 xHigh at 3168 and Gemini 3.1 Pro at 3052. On Apex Shortlist Pass@1, V4-Pro takes the top score at 90.2, above Gemini's 89.1 and Opus 4.6's 85.9.

On the agentic side, V4-Pro scores 80.6 on SWE-Bench Verified, 55.4 on SWE-Bench Pro, 76.2 on SWE Multilingual, 67.9 on Terminal-Bench 2.0, 73.6 on MCPAtlas, and 51.8 on Toolathlon. The MCPAtlas and Toolathlon numbers are the ones worth looking at twice, because they test generalization across a wide set of external tools rather than performance on a fixed harness. V4-Pro is essentially tied with Opus 4.6 on MCPAtlas.

The caveat is the top of the agentic pile. GPT-5.5 still leads Terminal-Bench 2.0 by a real margin, and Opus 4.7 still leads SWE-Bench Pro. If your workload is the hardest multi-step, long-horizon agent execution, the closed frontier is still ahead by a few points. For the more typical 5-to-10-step agent loop with structured tool use, V4-Pro is genuinely competitive.

Reasoning and Math

V4-Pro is strong here. HMMT 2026 February Pass@1 comes in at 95.2, just behind GPT-5.4 at 97.7 and Opus 4.6 at 96.2. IMOAnswerBench at 89.8 is ahead of Opus 4.6's 75.3 but behind GPT-5.4 xHigh's 91.4. GPQA Diamond sits at 90.1, in the same neighborhood as the frontier closed models.

The gap that still matters is HLE. V4-Pro posts 37.7 against Opus 4.6's 40.0 and Gemini 3.1 Pro's 44.4. HLE measures expert-level cross-domain reasoning, and DeepSeek is honest about trailing on it. For users running research-adjacent workflows where the hardest synthesis questions matter, the closed frontier is still ahead.

World Knowledge

This is the weakest area for V4-Pro and DeepSeek does not pretend otherwise. SimpleQA-Verified Pass@1 lands at 57.9, compared to Gemini 3.1 Pro at 75.6. MMLU-Pro comes in at 87.5 against Gemini's 91.0. On plain factual recall and breadth, Gemini's training data and grounding pipelines still deliver a meaningful lead.

The read here is straightforward. V4-Pro is a coding and reasoning model with strong math and respectable knowledge coverage. It is not a research-oracle model. If your use case is "answer obscure factual questions correctly most of the time," Gemini is still the first call.

V4-Flash: Better Than It Has Any Right to Be

The Flash tier is the surprise of the release. On MMLU-Pro, Flash scores 86.2 against Pro's 87.5. On LiveCodeBench, 91.6 against 93.5. On SWE-Bench Verified, 79.0 against 80.6. Most of the gap is in the 1-to-3-point range on standard benchmarks.

Where Flash genuinely drops off is Terminal-Bench 2.0 (56.9 vs 67.9) and SimpleQA-Verified (34.1 vs 57.9). That pattern is consistent with what DeepSeek says in its own documentation: Flash is on par with Pro for simple agent tasks, but long-horizon tool use and deep factual recall are where the Pro model earns its cost. For most ordinary coding, code review, summarization, and routine agent workflows, Flash is a fully serious production model, not a discount fallback.

The Pricing Story That Defines the Release

Benchmarks get the headlines, but pricing is where V4 lands its hardest punch. The rate card DeepSeek published alongside the release puts the full frontier-class tier within reach at a cost structure the closed labs cannot match without rewriting their unit economics.

V4-Flash runs at $0.14 per million input tokens on a cache miss and $0.28 per million output tokens. Cache-hit input pricing is roughly $0.028 per million, a discount of about 80%. V4-Pro runs at $1.74 per million input tokens on a cache miss, $0.145 per million on a cache hit, and $3.48 per million output tokens. The cache-hit discount on Pro is closer to 92% on repeated prefixes.

Context caching is automatic on DeepSeek. Any request whose prefix is at least 1,024 tokens long and matches byte-for-byte against a prior request from the same account picks up the cache-hit rate with no code changes. For anyone running an assistant with a large stable system prompt, the real cost per call on Pro can sit close to the input cache-hit floor.

Put those numbers next to the closed-source frontier. Claude Opus 4.6 charges $5 per million input tokens and $25 per million output. GPT-5.4 runs $2.50 input and $15 output. At uncached rates, V4-Pro is around 2.9x cheaper than GPT-5.4 on input and roughly 4.3x cheaper on output. Against Opus 4.6, the output gap is closer to 7x. For coding agents that produce a lot of output tokens, which is the archetypal production workload, the math does not even require a spreadsheet.

Two caveats belong here. First, GDPval-AA scores, which measure knowledge-work economic value per Elo, still put GPT-5.4 and Opus 4.6 ahead of V4-Pro. The closed models may cost more per token but produce more valuable output per dollar on certain enterprise workflows, and a clean price-per-token comparison misses that. Second, data sovereignty is a real constraint for some buyers. Routing prompts to a Chinese-hosted API is a non-starter in certain regulated industries, regardless of how the benchmarks look. For those buyers, the open weights under the MIT license are the actual product, and the hosted API pricing is irrelevant.

The Huawei Subplot and the Nvidia Problem

The most strategically loaded detail in the V4 release is not in the tech report. It is the fact that Huawei announced on the same day, in parallel, that its full Ascend supernode product line supports V4 series inference. DeepSeek did not specify which chips were used for training. But the company's previous release, V3, was trained on Nvidia hardware, and the framing around V4 has been deliberately different. Huawei is out in front as collaborator and deployment partner. Nvidia is absent.

For context, V3 landed DeepSeek in an awkward spot when US officials accused the company of sourcing Nvidia chips in violation of export controls. V4 leans the other direction. Running it end-to-end on Chinese silicon would be the single most significant validation to date of the thesis that frontier AI can be trained and served outside the Nvidia-CUDA ecosystem.

The market noticed. Nvidia stock gave back ground on the news, down about 1.4% on Thursday's session before partially recovering in Friday premarket. More telling was the reaction across Chinese AI equities. Zhipu AI and MiniMax, two of DeepSeek's closest domestic rivals, fell 9% and 7% respectively. Cambricon Technologies, a Chinese AI chipmaker, announced compatibility with V4 within hours of the release. Huatai Securities analysts flagged the explicit compatibility call-out in the V4 announcement as a signal that domestic chip capability is now production-ready, not just experimental.

Jensen Huang has been the most visible voice from Nvidia on what this shift means. In interviews leading up to the V4 launch, he argued that if Chinese developers consolidate on Huawei's Ascend platform, Nvidia's long-term footprint in the world's second-largest AI market could shrink even if export controls are relaxed. The H200 chips Nvidia has been preparing for Chinese buyers have yet to actually ship, and the terms of sale remain unresolved between Washington and Beijing. V4 makes the question of whether those shipments even happen feel a little less urgent than it did six months ago.

The broader point is that V4 is not just a model. It is the first highly visible demonstration that China's AI stack, from training silicon to inference deployment to frontier model weights, can now operate end-to-end without Nvidia. That is a structural change in the industry's geography, and it is going to ripple through infrastructure investment decisions for years.

Agent-First Design: Built for Claude Code, OpenClaw, OpenCode

One of the more quietly significant choices in the V4 release is how aggressively DeepSeek has built for the agent ecosystem rather than just the chat interface. The announcement explicitly calls out integration with Claude Code, OpenClaw, OpenCode, and CodeBuddy as first-class deployment targets. V4 is already powering DeepSeek's own internal agentic coding, and the company published a sample PDF generated end-to-end by V4-Pro inside an agent loop as part of the launch materials.

The OpenClaw integration is particularly interesting for anyone already working inside that ecosystem. OpenClaw has been building out an open-source agent framework that runs across models, and having V4-Pro as a first-class backend means agent workflows can now be pointed at an open-weight model that is genuinely competitive with Claude Opus on coding tasks, at a fraction of the token cost. For teams running high-volume coding agents where output tokens are the dominant cost driver, the economics shift meaningfully.

One sharp implementation detail surfaces in the API layer. When the DeepSeek API detects a request coming from Claude Code or OpenCode, thinking effort is automatically upgraded to max. The reasoning_effort parameter accepts high and max, and lower values like low or medium get silently mapped to the nearest supported level. This is an opinionated choice. DeepSeek is betting that agent workloads benefit from maximum reasoning depth by default, and it has baked that assumption into the API rather than asking callers to opt in.

The dual API compatibility, supporting both OpenAI ChatCompletions and the Anthropic messages format, is the other agent-friendly move. Any tooling already written for Claude or GPT can point at DeepSeek with minimal refactoring. That removes the single biggest operational barrier to trying the model in production: nothing needs to be rewritten.

The 1M Context Window Is Now the Baseline

One of the less flashy but most important consequences of the V4 architecture is that million-token context has stopped being a premium feature. Both V4-Pro and V4-Flash ship with 1M input context and 384K output tokens as the default. There is no separate long-context model, no pricing tier jump, no context-length extension that kicks in at high cost.

The knock-on effect is that RAG pipelines become optional for a wider set of workloads. Feeding an entire moderate-size codebase, a multi-year document archive, or an extended conversation history directly into the context is now cheap enough to treat as a default design choice rather than an expensive edge case. Retrieval layers still matter for very large corpora and for precision recall, but the threshold at which building a retrieval system pays off has moved. Many applications that would have been RAG-first in 2024 can be context-first in 2026.

The long-context ceiling is not infinite, and DeepSeek does not claim it is. On MRCR 1M, which tests retrieval of key information buried in a million tokens of context, V4-Pro scores 83.5. Gemini 3.1 Pro scores 76.3. Opus 4.6 leads at 92.9. On CorpusQA 1M, which is harder because it demands precise answers synthesized from a long document, V4-Pro hits 62.0 against Opus 4.6's 71.7. The ordering is consistent. Opus is still the long-context recall leader, and the gap above 128K tokens is real.

But for the typical use case, where context length is in the 50K to 400K range and the task is more about having material in the room than pinpoint retrieval, V4 closes enough of the gap to be a credible default. Combined with the cost advantage, the long-context story becomes one of the release's strongest selling points.

Where V4 Still Trails

Any honest read of the V4 launch has to flag where the model is not the leader, because that is where the deployment decisions actually get made.

Hardest agentic workloads. GPT-5.5's Terminal-Bench 2.0 at 82.7% and Opus 4.7's SWE-Bench Pro at 64.3% set a ceiling that V4-Pro does not reach. If the job is the longest, most complex multi-step agent execution, the closed frontier still has the edge.

Multimodal tasks. V4 is text-only in the preview. DeepSeek says the company is working on multimodal capabilities, but they are not here yet. Anything involving image or audio input needs a different model.

Frontier knowledge. The gap to Gemini 3.1 Pro on MMLU-Pro, SimpleQA-Verified, GPQA Diamond, and HLE is narrower than it was a year ago but it has not closed. If the workflow depends on accurate factual retrieval at the edge of human knowledge, Gemini is still the pick.

Long-context pinpoint retrieval. Above 128K tokens, retrieval accuracy on hard benchmarks drops into the 60s on some measures. Opus 4.6 holds the crown for near-perfect recall across a million tokens. V4 is good enough for the median case but not the best in class at the extreme.

Enterprise data sovereignty. The hosted API runs from Chinese infrastructure. Multiple US states, Australia, Taiwan, South Korea, Denmark, and Italy introduced restrictions on DeepSeek-R1 citing privacy and national security concerns, and some of those are still in effect. For regulated buyers, the only deployment option is self-hosting the MIT-licensed weights.

These are not dealbreakers. They are scope boundaries. V4-Pro is the strongest open-weight model ever shipped, and for coding, reasoning, and math it is genuinely frontier-competitive. But the frontier is not one model anymore. It is a set of specialized leaders across benchmark categories, and V4 claims some of those top slots while leaving others to the closed labs.

What the V4 Release Signals for 2026

Stepping back from the spec sheet, V4 is more interesting for what it says about where the industry is going than for the absolute numbers.

First, the gap between open-weight and closed-frontier models has gotten smaller than most observers expected twelve months ago. The Stanford AI Index 2026 already flagged that Chinese AI labs have effectively closed the performance gap with their US counterparts. V4 is the strongest single data point in support of that thesis. A Chinese open-weight model is now beating closed-source US models on multiple coding benchmarks, at prices that rewrite the cost curve.

Second, attention architecture is now the binding constraint on frontier progress, not raw parameter count. V4's big move is not "more parameters." It is "better attention." The same compute budget applied to a smarter attention stack produces better results than throwing more weights at the old architecture. Expect every major lab to ship its own answer to hybrid sparse attention over the next twelve months. The V4 tech report just moved the goalposts on what efficient long-context looks like.

Third, the hardware independence story is real. The narrative that Chinese AI development is permanently bottlenecked by Nvidia chip access took a serious hit with V4. If the model was trained at least in part on Ascend silicon, and if it runs efficiently on Ascend for inference, then the entire premise of US chip export controls as a strategic constraint starts to look softer. The chips are not the moat everyone thought they were.

Fourth, pricing is going to be the brutal competitive axis of the next cycle. Closed labs cannot match DeepSeek's rate card at current unit economics. Either they find structural efficiency wins of their own, or they cede the price-sensitive half of the market. The "charge whatever because we're the only option" pricing posture that defined 2023 and 2024 is over.

Fifth, the agent stack matters more than the chat interface. DeepSeek did not ship V4 with a new chat experience. It shipped V4 with first-class integrations into Claude Code, OpenClaw, and OpenCode, with an API that silently opts agents into maximum reasoning effort. The center of gravity in AI deployment has moved from chatbot interaction to agent execution, and V4 is built for that world.

How to Actually Use V4 Today

For anyone already working with the Anthropic or OpenAI APIs, moving V4 into the stack is as close to a drop-in as it gets.

The base URL is the DeepSeek API endpoint. The model ID is deepseek-v4-pro or deepseek-v4-flash. The thinking mode toggles through a reasoning_effort parameter, which maps to DeepSeek's Expert Mode and Instant Mode surfaces for Pro and Flash respectively. OpenAI-style calls work out of the box. Anthropic-style messages work out of the box. Tool calls are supported in both dialects.

For a coding agent workload with a large stable system prompt, the right play is to lean into context caching. Prefix the conversation with the stable system prompt and code context, keep that prefix byte-identical across calls, and the 80-to-92% cache-hit discount does most of the cost work automatically. No explicit opt-in is required. A 10,000-token system prompt at cache-hit rates on V4-Pro costs roughly the same per call as a 1,000-token prompt at uncached rates.

For a high-volume low-latency use case like autocomplete or chat assistants, V4-Flash is the natural target. The active parameter count puts inter-token latency in the sub-15-millisecond territory that real-time applications need, and the cost at scale is small enough that per-query cost stops being the primary engineering conversation.

For enterprise work where prompt data cannot leave controlled infrastructure, the open weights are on Hugging Face under an MIT license. Self-hosting V4-Pro is non-trivial at 1.6 trillion parameters, and it is genuinely expensive to stand up. V4-Flash at 284B is more tractable for mid-sized deployments. Either way, the licensing is permissive enough to run in production without negotiation.

For teams already invested in Claude Code or OpenClaw, the integration is explicit. V4 is a supported backend out of the gate. For anyone running an open-source agent framework that is not yet wired up, the dual OpenAI/Anthropic API compatibility means the integration work is usually a half day rather than a sprint.

The one operational call to make now is the deepseek-chat and deepseek-reasoner deprecation. Those IDs go away on July 24, 2026. Any production pipeline still calling them has three months to migrate.

The Bottom Line

V4 does not repeat the R1 moment. It does not send Nvidia's stock off a cliff or trigger a fresh round of panic about US AI competitiveness. The release is more measured than that, and the market reaction has been proportional. But taking V4 on its own terms, this is the most significant open-weight model release since R1, and arguably the most structurally important open-weight release ever.

The architecture is a genuine step forward, not a refinement. The pricing resets the cost floor for frontier AI. The Huawei integration marks the moment Chinese AI decoupled from Nvidia in a way that was theoretical a year ago and operational now. And the agent-first deployment posture reflects where the industry's center of gravity has actually moved.

V4-Pro is not going to replace GPT-5.4 or Claude Opus 4.6 in every workload. It was not designed to. What it does is force a real comparison on every coding, reasoning, and math workload at a seventh of the cost, with open weights, on hardware that does not need US export approval. For a lot of production teams, that comparison is going to come out in V4's favor. And the next V4 release, whenever it lands, will have all the same leverage plus whatever gap the team decides to close next.