DeepSeek's V4 Pro Closes the Coding Gap With Claude

DeepSeek's open-weight V4 series has reset the price-performance frontier for frontier-grade coding models, putting renewed pressure on the U.S. labs that dominated headlines through most of 2025.

Rows of server racks inside a modern data center, illustrating the compute infrastructure that underpins frontier AI models like DeepSeek V4. — Server room interior — illustrative image. Source: BalticServers data center by BalticServers.com via Wikimedia Commons, CC BY-SA 3.0.

What Happened

On April 24, 2026, the Hangzhou-based lab DeepSeek released preview weights and an updated API for its V4 series, anchored by two variants. V4-Pro is a 1.6 trillion parameter Mixture-of-Experts flagship that activates roughly 49 billion parameters per token, while V4-Flash is a leaner 284 billion parameter model with 13 billion active parameters per token. Both ship with a one-million-token context window enabled by default, a step up from the company's V3.2 release earlier in the year.

The architectural headline is what DeepSeek calls Hybrid Attention. The mechanism stitches together two specialized components: Compressed Sparse Attention, which folds the key-value cache for blocks of m tokens into a single compressed entry, and Heavily Compressed Attention, which is reserved for very long-range retrieval. Each query token then routes to a top-k selection of compressed entries through DeepSeek Sparse Attention. The result, according to the model card, is that V4-Pro inference at one-million-token context length consumes only 27 percent of the single-token FLOPs and 10 percent of the KV cache memory of V3.2 at the same context size.

The benchmark numbers landed where they had to. V4-Pro scores 80.6 percent on SWE-bench Verified, sitting 0.2 percentage points behind Anthropic's Claude Opus 4.6, and reaches 87.5 percent on SWE-Bench Pro. It edges Claude on Terminal-Bench 2.0 at 67.9 percent versus 65.4 percent, takes a wider lead on LiveCodeBench at 93.5 percent versus 88.8 percent, and records a Codeforces rating of 3206. On the API side, V4-Pro is priced at 1.74 dollars per million input tokens and 3.48 dollars per million output tokens, roughly one-seventh the listed output cost of Claude Opus 4.6.

Why It Matters

The economic gap is the part procurement teams will fixate on. A model that lands within striking distance of the leading proprietary system on coding benchmarks, while charging a fraction of the per-token output rate, changes the math for the workloads that consume the largest share of frontier inference budgets today. Software-engineering agents, IDE copilots, and autonomous code-review pipelines are token-heavy by design, and a seven-times reduction in marginal cost rarely shows up at this level of capability parity.

The architectural story matters, too, even if it lands less viscerally than a price point. Long-context inference has been the expensive corner of the cost curve, with KV-cache memory growing linearly with sequence length under standard attention. Compressing that cache with a learned token-level compressor and then routing only to top-k entries is not a new idea on its own, but DeepSeek is the first major lab to ship the combination at frontier scale with measurable production economics behind it. If the published efficiency numbers hold under independent testing, the V4 release is a reminder that the brute-force scaling era is being threaded by structural efficiency gains, not replaced by them.

For the broader industry, V4 also reframes the open-weights debate. The model is released under DeepSeek's permissive license and is downloadable from Hugging Face, which means enterprise users can self-host on their own clusters. That option matters in jurisdictions where data residency and sovereignty rules complicate the use of U.S.-based API providers, and it gives cloud platforms that do not run their own frontier labs a credible third-party model to put behind their developer tools.

Reaction

Markets did not deliver a repeat of January 2025, when DeepSeek's R1 release helped erase nearly a trillion dollars of market value from U.S. AI-exposed equities in a single session. As StratNews Global noted, V4 arrived as an expected event rather than a shock, and Nvidia and the major cloud names traded broadly flat in the days after launch. The interpretation among sell-side analysts was that the U.S. side has had a year to adapt, with margin protection coming from custom silicon, retrieval infrastructure, and enterprise contracts rather than raw model lead alone.

Within the developer community, the early read on V4-Pro has been more enthusiastic. LMSYS Org, which maintains the SGLang inference framework, published Day-Zero support and reported that verified reinforcement-learning runs landed within hours of the release. On Hugging Face, the model card has accumulated thousands of downloads, and several agent frameworks have already published configuration recipes that swap V4-Pro into the planner role formerly occupied by Claude or GPT-class models. The reaction inside Chinese tech media has, predictably, been celebratory, with state outlets framing the release as evidence that domestic labs can compete at the frontier despite export controls on advanced accelerators.

Competitor responses have been measured rather than urgent. Anthropic continued to roll out incremental updates to Claude Opus 4.6 and emphasized its enterprise security posture, while Google highlighted the cost profile of its Gemini 3.1 Flash-Lite line, which it positions for high-volume serving rather than head-to-head capability comparisons. OpenAI, fresh off its GPT-5.5 launch the prior week, declined to comment publicly on a specific competitor release, in line with its longstanding policy.

What's Next

The most consequential question is whether V4-Pro's published efficiency holds up in production. The 27 percent FLOPs and 10 percent KV-cache figures come from internal benchmarking, and independent reproductions over the next several weeks will determine how the architecture behaves under realistic batching, retrieval workloads, and concurrent agent traffic. Any meaningful gap between the headline numbers and observed serving costs will surface quickly through the open-source community, which now has direct access to the weights.

The second front is fine-tuning. DeepSeek has signaled that a Pro-Max variant is in training and may surface later in the second quarter, and the V4-Flash release is positioned as a base for vertical specializations rather than a finished consumer product. Expect a wave of community fine-tunes targeting coding agents, document-grounded enterprise search, and Asian-language workloads where the model's training mix is naturally strong. The economics question for application developers is no longer whether to use a Chinese-origin frontier model, but how to manage the dual-vendor, dual-region inference setups those models tend to imply.

Regulatory scrutiny is the wildcard. U.S. policymakers have signaled that future export-control updates will look more closely at inference-time compute access, not only at training-side accelerators, and at least one bipartisan Senate working group has flagged the V4 release as a case study. The current administration has not committed to specific measures, but enterprise legal teams considering self-hosted deployments are already factoring in the possibility of additional disclosure or licensing obligations later in the year.

Closing Thoughts

V4 is the clearest signal yet that the frontier model market is maturing into a multi-lab, multi-jurisdiction equilibrium rather than consolidating around a single provider. The performance gap between leading U.S. and Chinese labs has narrowed to within the noise of benchmark methodology on the workloads most enterprises actually run, and the price gap on long-context inference is now a structural advantage, not a temporary discount. None of that erases the strategic moats that the U.S. incumbents have built around enterprise distribution, safety testing infrastructure, and integrated tooling, but it does change the procurement conversation. For the next two quarters, the interesting question is no longer who has the best raw model, but who can route the right workload to the right model at the right price, and the labs that make that orchestration easy will be the ones that capture margin from the broader race.

한글 요약

중국 항저우의 AI 연구소 DeepSeek은 4월 24일에 V4 시리즈 프리뷰 가중치와 API를 공개했다. 1.6조 파라미터 규모의 V4-Pro는 SWE-bench Verified에서 80.6%를 기록해 Anthropic의 Claude Opus 4.6과 0.2%포인트 차이로 좁혔으며, Terminal-Bench와 LiveCodeBench에서는 오히려 앞섰다. 출력 토큰 100만 개당 가격이 3.48달러로 Claude의 25달러 대비 약 7분의 1 수준이라는 점이 시장의 가장 큰 화두다.

기술적으로는 Compressed Sparse Attention과 Heavily Compressed Attention을 결합한 하이브리드 어텐션 구조가 핵심이다. 100만 토큰 컨텍스트에서 V3.2 대비 단일 토큰 추론 FLOPs를 27% 수준으로, KV 캐시 메모리를 10% 수준으로 낮췄다고 회사는 밝혔다. 이 수치가 외부 검증에서 유지된다면, 장문 컨텍스트 인퍼런스의 단가 곡선 자체가 다시 그려질 가능성이 있다.

시장 반응은 2025년 1월의 R1 충격과 달리 차분했다. 미국 빅테크는 자체 칩과 엔터프라이즈 계약, 안전성 인프라로 마진을 방어해왔고, 분석가들은 V4를 예상된 이벤트로 평가했다. 다만 코딩 에이전트와 IDE 코파일럿처럼 토큰 소비가 큰 워크로드를 보유한 기업에게는 조달 의사결정의 무게중심이 다시 한 번 가격으로 이동할 가능성이 크다. 다음 분기의 관전 포인트는 모델 자체보다는 어떤 워크로드를 어느 모델에 어떻게 라우팅할 것인가다.

참고 출처: DeepSeek API Docs — V4 Preview Release, Hugging Face — DeepSeek-V4-Pro Model Card, CNN Business — DeepSeek V4, LMSYS Blog — DeepSeek-V4 on Day 0.