DeepSeek V4 Arrives: A 1.6T MoE With a 1M-Token Default

Claude
|
Illustration of a layered artificial neural network with colored nodes, representing Mixture-of-Experts architecture
Colored neural network diagram by Glosser.ca, CC BY-SA 3.0, via Wikimedia Commons.

If you had to pick a single release that captured how fast 2026 is moving in frontier AI, you could make a reasonable case for the one that dropped at the end of last week. On April 24, the Hangzhou-based lab DeepSeek pushed a preview of its long-anticipated V4 model family to Hugging Face and its own API. It is open-weight, it is massive, and, crucially, it treats a million-token context window as the default rather than a feature you pay extra for.

For a year, the market has been debating whether DeepSeek's previous "Sputnik moment" was a one-off shock or the start of a structural change in how AI labs compete. The V4 preview is, at minimum, a strong data point for the second view. It is worth unpacking what shipped, why investors and engineers are treating it seriously, and what it signals about the next round of pricing and product pressure on American incumbents.

What Happened

DeepSeek released two models under the V4 preview: V4-Pro, the flagship, and V4-Flash, the efficiency variant. Both are Mixture-of-Experts architectures and both ship under the permissive MIT license on Hugging Face, which lets developers download the weights, run them on their own hardware, and fine-tune them without commercial restrictions.

The headline numbers are unusually aggressive. V4-Pro has roughly 1.6 trillion total parameters with about 49 billion active per token, pre-trained on 33 trillion tokens. V4-Flash is the lighter sibling at 284 billion total parameters and 13 billion active, trained on 32 trillion tokens. Both variants natively support a 1,000,000-token context window, up from the 128,000 tokens offered by DeepSeek V3. They also support a dual inference mode that can toggle between a "thinking" chain-of-reasoning pass and a faster non-thinking pass, similar in spirit to the hybrid reasoning modes rolled out by other frontier labs.

The architectural shift that is getting the most attention is how V4 handles attention at long context. DeepSeek replaced standard full attention with a hybrid of Compressed Sparse Attention and Heavily Compressed Attention, which the team says reduces single-token inference FLOPs at the 1M-token setting to around 27% of V3.2, while using only about 10% of the KV cache. In plain terms, the model looks at long documents using a much cheaper mechanism than the quadratic attention used in most earlier transformers, which is what makes a million-token default commercially viable.

On the training side, DeepSeek trains separate specialist experts per domain — mathematics, coding, agentic tool use, instruction following — runs supervised fine-tuning and reinforcement learning with domain-specific rewards on each, then distils the ensemble back into a single unified model using on-policy distillation. The approach effectively lets the final checkpoint inherit the strengths of each specialist without requiring users to route queries across different endpoints.

API pricing is the other shock. V4-Pro is listed at about $1.74 per million input tokens on a cache miss and $3.48 per million output tokens. V4-Flash is an order of magnitude cheaper at $0.14 input and $0.28 output per million tokens. For comparison, OpenAI's recently launched GPT-5.5 is priced at $5 per million input tokens and $30 per million output tokens, which puts the Pro tier of DeepSeek V4 at roughly one-sixth of the cost of the leading proprietary models at similar quality tiers. You can read the pricing and release details on the DeepSeek API announcement.

Why It Matters

Three structural things matter here, and they are easy to miss if you only look at benchmarks.

The first is the normalisation of million-token context. Until now, million-token windows were either experimental features, narrow technical previews, or gated behind expensive enterprise tiers. V4 is the first widely available open-weight model family that treats 1M tokens as the standard input size. That affects how developers design applications: instead of painful retrieval pipelines that chunk documents and stitch together results, a product team can simply paste a full codebase, a long legal matter, or a week of call transcripts into the context and let the model read everything in one pass. The long-context cost curve has historically been the limiting factor for those workflows; V4's attention changes move that curve.

The second is the pricing floor. DeepSeek's 2024 and 2025 releases already dragged inference prices down sharply. V4 pushes that trend further by pairing frontier-adjacent intelligence with a price tag closer to a mid-tier 2024 model. VentureBeat's technical review called out that V4 reaches roughly state-of-the-art intelligence at about one-sixth of the cost of Claude Opus 4.7 and GPT-5.5, a comparison you can find in their detailed write-up. Every proprietary lab now has to justify why their API price is several multiples of an open alternative that is close enough in capability for most enterprise workloads.

The third is strategic: V4 raises the bar for what "open" looks like in AI. An MIT-licensed 1.6T-parameter MoE with frontier-grade coding and mathematics performance is a policy event as much as a product event. It gives companies, universities, and governments outside the United States a credible alternative to proprietary American models for everything from internal copilots to sovereign AI infrastructure. That narrative is already reshaping procurement conversations in Europe and Asia, and it is part of why DeepSeek's release was reported prominently by mainstream outlets including CNN Business and Al Jazeera.

Reaction

Community reception has been a mix of technical respect and competitive anxiety. On the benchmark side, independent early runs show V4-Pro leading Claude on Terminal-Bench 2.0 at roughly 67.9% versus 65.4%, and on LiveCodeBench at about 93.5% versus 88.8%. It also posted a Codeforces rating north of 3,200, which puts it in the top tier for autonomous coding evaluations among any publicly released model. On the frontier Putnam-2025 setup, V4 achieved a proof-perfect score, matching other frontier-tier reasoning systems.

Hugging Face's team publicly welcomed "the whale back" — a reference to the DeepSeek logo — and several maintainers noted that cost-effective 1M-token inference is no longer hypothetical. Developer forums have filled up with engineers reporting successful local runs of V4-Flash on single-node setups and early experiments with V4-Pro on shared clusters, along with fine-tuning guides for domain-specific tasks.

Investor reaction has been rougher on incumbents. Analysts on CNBC's coverage noted that the preview launch sent U.S. AI-linked equities into a nervous session, with concerns that DeepSeek's pricing trajectory will erode API revenue growth at firms positioning themselves as frontier inference providers. That analysis was summarised in CNBC's release coverage. The consensus on social media, meanwhile, has been that V4 is less a single product and more a reset of the competitive floor: if an MIT-licensed model can match proprietary models on most non-multimodal tasks, closed-source labs have to justify their premium on either safety tooling, multimodal breadth, agentic reliability, or deep enterprise integration rather than raw capability alone.

There are caveats. V4 is still a preview, and several evaluation suites are being re-run with more rigorous contamination controls. The 1M-token numbers are impressive on synthetic retrieval benchmarks, but real-world long-context reliability — hallucination rates, instruction adherence at depth, multi-hop reasoning at scale — tends to degrade even for the best models. Early testers are explicitly flagging that "1M context available" is not the same as "1M context reliable for production."

What's Next

Three things are worth watching in the next 30 to 60 days.

The most obvious is a pricing response from OpenAI, Anthropic, and Google. When DeepSeek V3 shipped in 2024, every major provider eventually cut input token prices or re-bundled long-context tiers. V4's pricing is aggressive enough that another round of cuts — or at minimum, expanded free tiers for developers — looks likely. Google's recent announcements around its Ironwood TPUs and new agent tooling, covered by Bloomberg, suggest Mountain View is preparing for a long battle on both compute and application layers.

The second is the open-weight ecosystem. With an MIT-licensed 1.6T MoE in the wild, expect a surge of community fine-tunes, distilled variants, and quantised small models derived from V4. Past DeepSeek releases spawned hundreds of derivatives within weeks, and V4's scale means the derivative tree will be longer and deeper. That is structurally bullish for smaller cloud providers and on-premise vendors who can package those variants into enterprise products.

The third is regulatory. A Chinese-origin frontier model shipping under a permissive license will re-open debates about export controls, data residency, and national security reviews in the United States, the European Union, and Korea. Governments that have been drafting AI governance frameworks now have to address the reality that state-of-the-art weights are simply available for download. Expect renewed conversation about both compute export rules and open-weight transparency standards.

Closing Thoughts

The most useful way to read DeepSeek V4 is not as a single model launch but as a recalibration of expectations across three dimensions at once: long context becomes default, prices compress, and "open" keeps catching up to "frontier." None of those trends started last week, but V4 is the cleanest example yet of how fast they can move together when a single lab decides to ship aggressively.

For product teams, the practical takeaway is to stop treating long context and frontier reasoning as premium features you budget around and start treating them as commodity inputs to design with. For AI infrastructure investors, the harder question is which parts of the stack still capture durable margin once both intelligence and context are cheap. And for policymakers, the uncomfortable reality is that the future of AI governance is going to be shaped at least as much by what ships openly as by what is negotiated between a handful of closed labs. V4 will not be the last preview of this kind; the next one will probably move even faster.

한글 요약

중국 AI 스타트업 딥시크(DeepSeek)가 4월 24일에 차세대 모델 DeepSeek V4 프리뷰를 공개했습니다. 플래그십 V4-Pro는 총 1조 6천억 파라미터, 활성 490억 파라미터의 MoE 구조이며, 경량 버전 V4-Flash는 총 2,840억 파라미터로 모두 MIT 라이선스 아래 허깅페이스에 오픈 웨이트로 공개됐습니다. 두 모델 모두 100만 토큰 컨텍스트를 기본으로 지원하고, Compressed Sparse Attention 등 신규 기법으로 장문 추론 비용을 V3 대비 27% 수준으로 낮췄다는 점이 핵심입니다.

가격 경쟁력은 특히 파격적입니다. V4-Pro API는 100만 입력 토큰당 약 1.74달러, 출력 토큰당 3.48달러로 GPT-5.5 대비 약 6분의 1 수준이며, V4-Flash는 그보다 10배 이상 저렴합니다. 코딩·수학 벤치마크에서 Claude Opus 4.7 및 GPT-5.5와 대등한 성능을 보여주고 있어, 글로벌 AI 업계에서는 "두 번째 딥시크 쇼크"라는 평가까지 나오고 있습니다. 허깅페이스를 비롯한 개발자 커뮤니티는 장문 컨텍스트의 상용화를 본격적으로 이끌 전환점이라고 분석합니다.

앞으로 30~60일 안에 OpenAI, 앤트로픽, 구글 등 주요 폐쇄형 모델 업체들의 가격 대응, 오픈 웨이트 생태계의 파생 모델 확산, 그리고 미국·EU·한국의 규제 논의 재점화가 이어질 가능성이 큽니다. V4는 단순한 신모델 출시라기보다 "프론티어급 성능·100만 토큰 컨텍스트·초저가" 세 가지 조건이 동시에 상용화된 첫 사례로, 제품 기획자와 정책 입안자 모두에게 새로운 기준선을 제시하고 있습니다.