DeepSeek-V4-Pro is a 1.6 trillion parameter text-to-text mixture of experts model released in 2026.

Model architecture

Architecture notes

The model introduces a few new optimizations within attention:

  • Layers have different, novel attention mechanisms:
    • Compressed Sparse Attention (CSA) compresses KV by 4x, then picks the top 1024 compressed entries per query token.
    • Heavily Compressed Attention (HCA) compresses KV by 128x, then does full attention between all of them.
  • Output projections are grouped:
    • 128 attention head outputs are divided into 16 groups
    • each group is projected into an intermediate dimension with 1024 dimensions
    • All 16 groups are concatenated into 16384 dimensions
    • That is then projected that back to the hidden dimension of 7168.
  • Expert routing uses a new method, where sqrtsoftplus replaces sigmoid, and the first three MOE layers use a static hash-based routing of tokens to experts instead of routing based on learned parameters. This latter optimization eliminates the dense FFNs used in DeepSeek-V3.
FieldValue
OrganizationDeepSeek
Release date2025
Model lineageDeepSeek-V3 → DeepSeek-V4-Pro
Total parameters1.6T
Active parameters49B
Model layers61 + 1 MTP
Layer compositionlayers 0–1 HCA; layers 2–60 CSA/HCA interleaved; all MoE (no dense prefix; first 3 use hash routing)
Model dimension7168
Attention variantCSA/HCA hybrid (novel)
Q heads / head dim128 heads / 512 dim (q_lora_rank=1536)
KV geometryMQA (1 KV head, 512 dim); compressed entry serves as both K and V
Attention peculiaritiesCSA: 4x KV compression + sparse top-1024 selection via Lightning Indexer; HCA: 128x KV compression + dense attention; both use sliding window branch (128 tokens) for local dependencies; grouped output projection (16 groups, o_lora_rank=1024)
Context length1M tokens
Position encodingYaRN (x16, from 64k base); separate compress_rope_theta=160000 for compressed layers
FFN typeMoE
MoE experts384 total / 6 active + 1 shared
MoE activation ratio1.6%
MoE routingsqrtsoftplus + noaux_tc + sequence balance loss; first 3 layers use hash routing
Multi-token prediction1 layer
Sequence mixerattention only
Modalitiestext
Native dtypebf16
Quantizationfp8 (e4m3, dynamic, 128x128 block) compute; fp4 (MXFP4) stored for routed expert weights via QAT
Quantization exclusions

Data pipeline

They trained on 33 trillion tokens.