DeepSeek-V4-Pro

DeepSeek-V4-Pro is a 1.6 trillion parameter text-to-text mixture of experts model released in 2026.

Model architecture

The model introduces a few new optimizations within attention:

Layers have different, novel attention mechanisms:
- Compressed Sparse Attention (CSA) compresses KV by 4x, then picks the top 1024 compressed entries per query token.
- Heavily Compressed Attention (HCA) compresses KV by 128x, then does full $N^{2}$ attention between all of them.
Output projections are grouped:
- 128 attention head outputs are divided into 16 groups
- each group is projected into an intermediate dimension with 1024 dimensions
- All 16 groups are concatenated into 16384 dimensions
- That is then projected that back to the hidden dimension of 7168.
Expert routing uses a new method, where sqrtsoftplus replaces sigmoid, and the first three MOE layers use a static hash-based routing of tokens to experts instead of routing based on learned parameters. This latter optimization eliminates the dense FFNs used in DeepSeek-V3.

Field	Value
Organization	DeepSeek
Release date	2025
Model lineage	DeepSeek-V3 → DeepSeek-V4-Pro
Total parameters	1.6T
Active parameters	49B
Model layers	61 + 1 MTP
Layer composition	layers 0–1 HCA; layers 2–60 CSA/HCA interleaved; all MoE (no dense prefix; first 3 use hash routing)
Model dimension	7168
Attention variant	CSA/HCA hybrid (novel)
Q heads / head dim	128 heads / 512 dim (q_lora_rank=1536)
KV geometry	MQA (1 KV head, 512 dim); compressed entry serves as both K and V
Attention peculiarities	CSA: 4x KV compression + sparse top-1024 selection via Lightning Indexer; HCA: 128x KV compression + dense attention; both use sliding window branch (128 tokens) for local dependencies; grouped output projection (16 groups, o_lora_rank=1024)
Context length	1M tokens
Position encoding	YaRN (x16, from 64k base); separate compress_rope_theta=160000 for compressed layers
FFN type	MoE
MoE experts	384 total / 6 active + 1 shared
MoE activation ratio	1.6%
MoE routing	sqrtsoftplus + noaux_tc + sequence balance loss; first 3 layers use hash routing
Multi-token prediction	1 layer
Sequence mixer	attention only
Modalities	text
Native dtype	bf16
Quantization	fp8 (e4m3, dynamic, 128x128 block) compute; fp4 (MXFP4) stored for routed expert weights via QAT
Quantization exclusions

They trained on 33 trillion tokens.