Kimi K2.5 is a multimodal model developed by Moonshot AI that was trained using “early vision fusion” whereby text and image data was used for training from an early stage of training.
- Its transformer backbone is Kimi K2,1 a MOE model with 1.04 parameters, 32 billion active.
- Its vision encoder is MoonViT-3D
Transformer / backbone
Its transformer backbone is a DeepSeek-V3-like 1.04 trillion parameter MOE organized as follows:1
- 61 layers
- 1.04T parameters
- 32.6B active parameters
- 384 total experts
- 8 active experts per token
- 1 shared expert
- 64 attention heads
- 1 dense layer
Vision encoder
Kimi K2.5 uses the MoonViT-3D vision transformer which is organized as follows:2
- 27 layers
- 1152 model dimension
- 4304 intermediate size
- 16 attention heads
- 14x14 patches
This comes out to around 425 million parameters on top of the 1.04 trillion.
Training infrastructure
From the Kimi K2.5 technical report:3
Kimi K2.5 is trained on NVIDIA H800 GPU clusters with 8×400 Gbps RoCE interconnects across nodes. We employ a flexible parallelism strategy combining 16-way Pipeline Parallelism (PP) with virtual stages, 16-way Expert Parallelism (EP), and ZeRO-1 Data Parallelism , enabling training on any number of nodes that is a multiple of 32. EP all-to-all communication is overlapped with computation under interleaved 1F1B scheduling. To fit activations within GPU memory constraints, we apply selective recomputation for LayerNorm, SwiGLU, and MLA up-projections, compress insensitive activations to FP8-E4M3, and offload remaining activations to CPU with overlapped streaming.