Kimi K2.5

Kimi K2.5 is a multimodal model developed by Moonshot AI that was trained using “early vision fusion” whereby text and image data was used for training from an early stage of training.

Its transformer backbone is Kimi K2,¹ a MOE model with 1.04 parameters, 32 billion active.
Its vision encoder is MoonViT-3D

Transformer / backbone

Its transformer backbone is a DeepSeek-V3-like 1.04 trillion parameter MOE organized as follows:¹

61 layers
1.04T parameters
32.6B active parameters
384 total experts
8 active experts per token
1 shared expert
64 attention heads
1 dense layer

Vision encoder

Kimi K2.5 uses the MoonViT-3D vision transformer which is organized as follows:²

27 layers
1152 model dimension
4304 intermediate size
16 attention heads
14x14 patches

This comes out to around 425 million parameters on top of the 1.04 trillion.

Training infrastructure

From the Kimi K2.5 technical report:³

Kimi K2.5 is trained on NVIDIA H800 GPU clusters with 8×400 Gbps RoCE interconnects across nodes. We employ a flexible parallelism strategy combining 16-way Pipeline Parallelism (PP) with virtual stages, 16-way Expert Parallelism (EP), and ZeRO-1 Data Parallelism , enabling training on any number of nodes that is a multiple of 32. EP all-to-all communication is overlapped with computation under interleaved 1F1B scheduling. To fit activations within GPU memory constraints, we apply selective recomputation for LayerNorm, SwiGLU, and MLA up-projections, compress insensitive activations to FP8-E4M3, and offload remaining activations to CPU with overlapped streaming.

Glenn's Digital Garden

Explorer

Kimi K2.5

Transformer / backbone

Vision encoder

Training infrastructure

Graph View

Table of Contents

Backlinks

Glenn's Digital Garden

Explorer

Kimi K2.5

Transformer / backbone

Vision encoder

Training infrastructure

Footnotes

Graph View

Table of Contents

Backlinks