DeepSeek-V4-Pro is a 1.6 trillion parameter text-to-text mixture of experts model released in 2026.
Model architecture
Architecture notes
The model introduces a few new optimizations within attention:
- Layers have different, novel attention mechanisms:
- Compressed Sparse Attention (CSA) compresses KV by 4x, then picks the top 1024 compressed entries per query token.
- Heavily Compressed Attention (HCA) compresses KV by 128x, then does full attention between all of them.
- Output projections are grouped:
- 128 attention head outputs are divided into 16 groups
- each group is projected into an intermediate dimension with 1024 dimensions
- All 16 groups are concatenated into 16384 dimensions
- That is then projected that back to the hidden dimension of 7168.
- Expert routing uses a new method, where
sqrtsoftplusreplaces sigmoid, and the first three MOE layers use a static hash-based routing of tokens to experts instead of routing based on learned parameters. This latter optimization eliminates the dense FFNs used in DeepSeek-V3.
| Field | Value |
|---|---|
| Organization | DeepSeek |
| Release date | 2025 |
| Model lineage | DeepSeek-V3 → DeepSeek-V4-Pro |
| Total parameters | 1.6T |
| Active parameters | 49B |
| Model layers | 61 + 1 MTP |
| Layer composition | layers 0–1 HCA; layers 2–60 CSA/HCA interleaved; all MoE (no dense prefix; first 3 use hash routing) |
| Model dimension | 7168 |
| Attention variant | CSA/HCA hybrid (novel) |
| Q heads / head dim | 128 heads / 512 dim (q_lora_rank=1536) |
| KV geometry | MQA (1 KV head, 512 dim); compressed entry serves as both K and V |
| Attention peculiarities | CSA: 4x KV compression + sparse top-1024 selection via Lightning Indexer; HCA: 128x KV compression + dense attention; both use sliding window branch (128 tokens) for local dependencies; grouped output projection (16 groups, o_lora_rank=1024) |
| Context length | 1M tokens |
| Position encoding | YaRN (x16, from 64k base); separate compress_rope_theta=160000 for compressed layers |
| FFN type | MoE |
| MoE experts | 384 total / 6 active + 1 shared |
| MoE activation ratio | 1.6% |
| MoE routing | sqrtsoftplus + noaux_tc + sequence balance loss; first 3 layers use hash routing |
| Multi-token prediction | 1 layer |
| Sequence mixer | attention only |
| Modalities | text |
| Native dtype | bf16 |
| Quantization | fp8 (e4m3, dynamic, 128x128 block) compute; fp4 (MXFP4) stored for routed expert weights via QAT |
| Quantization exclusions |
Data pipeline
They trained on 33 trillion tokens.