Kimi K2.5 is a multimodal model developed by Moonshot AI that was trained using “early vision fusion” whereby text and image data was used for training from an early stage of training.

  • Its transformer backbone is Kimi K2,1 a MOE model with 1.04 parameters, 32 billion active.
  • Its vision encoder is MoonViT-3D

Transformer / backbone

Its transformer backbone is a DeepSeek-V3-like 1.04 trillion parameter MOE organized as follows:1

  • 61 layers
  • 1.04T parameters
  • 32.6B active parameters
  • 384 total experts
  • 8 active experts per token
  • 1 shared expert
  • 64 attention heads
  • 1 dense layer

Vision encoder

Kimi K2.5 uses the MoonViT-3D vision transformer which is organized as follows:2

This comes out to around 425 million parameters on top of the 1.04 trillion.

Training infrastructure

From the Kimi K2.5 technical report:3

Kimi K2.5 is trained on NVIDIA H800 GPU clusters with 8×400 Gbps RoCE interconnects across nodes. We employ a flexible parallelism strategy combining 16-way Pipeline Parallelism (PP) with virtual stages, 16-way Expert Parallelism (EP), and ZeRO-1 Data Parallelism , enabling training on any number of nodes that is a multiple of 32. EP all-to-all communication is overlapped with computation under interleaved 1F1B scheduling. To fit activations within GPU memory constraints, we apply selective recomputation for LayerNorm, SwiGLU, and MLA up-projections, compress insensitive activations to FP8-E4M3, and offload remaining activations to CPU with overlapped streaming.

Footnotes

  1. [2507.20534] Kimi K2: Open Agentic Intelligence 2

  2. config.json · moonshotai/Kimi-K2.5 at main

  3. [2602.02276] Kimi K2.5: Visual Agentic Intelligence