Kimi K2.5 is a multimodal model developed by Moonshot AI that was trained using “early vision fusion” whereby text and image data was used for training from an early stage of training.

Model architecture

Transformer / backbone

Its transformer backbone is a DeepSeek-V3-like 1.04 trillion parameter MOE organized as follows:12

FieldValue
OrganizationMoonshot AI
Release date2026
Model lineageDeepSeek-V3 derivative
Total parameters~104B (est.)
Active parameters~32B (est.)
Model layers61
Layer composition1 dense FFN prefix, 60 MoE layers; homogeneous full attention
Attention variantMLA
Q heads / head dim64 heads / 192 dim
KV geometrykv_lora_rank=512, q_lora_rank=1536
Attention peculiaritiesnone
Context length256k tokens
Position encodingYaRN (×x64, from 4k base)
FFN typeMoE
MoE experts384 total / 8 active + 1 shared
MoE activation ratio2.1%
MoE routingsigmoid + noaux_tc
Multi-token predictionnone
Sequence mixerattention only
Modalitiestext + vision + video
Native dtypebf16
Quantizationint4-g32
Quantization exclusionsattention, shared expert, dense FFN prefix

Vision encoder

Kimi K2.5 uses the MoonViT-3D vision transformer which is organized as follows:2

FieldValue
NameMoonViT-3D
Parameters~425M (est.)
Layers27
Hidden size1152
Intermediate size4304
Attention heads16
Patch size14x14 px
Input resolutiondynamic tiling
Token compression2x2 spatial merge + patchmerger projector
Video supportyes (spatial-temporal attention, divided position embeddings)

Projector

There is also a multilayer perceptron projection that connects the vision encoder to the transformer backbone. It takes the spatially merged patch features and projects them into the transformer’s embedding dimension. After this spatial merge, each output token has dimensions which then are projected into the LLM’s 7168 hidden dimensions. This contributes

  • 9216 parameters for the LayerNorm (which does what?)
  • parameters for the first linear projection (I think this projects directly into the 7168 dimensions of the transformer’s hidden dimension?)
  • GELU happens at this point (what’s that?)
  • parameters for the second projection

Training infrastructure

From the Kimi K2.5 technical report:3

Kimi K2.5 is trained on NVIDIA H800 GPU clusters with 8×400 Gbps RoCE interconnects across nodes. We employ a flexible parallelism strategy combining 16-way Pipeline Parallelism (PP) with virtual stages, 16-way Expert Parallelism (EP), and ZeRO-1 Data Parallelism , enabling training on any number of nodes that is a multiple of 32. EP all-to-all communication is overlapped with computation under interleaved 1F1B scheduling. To fit activations within GPU memory constraints, we apply selective recomputation for LayerNorm, SwiGLU, and MLA up-projections, compress insensitive activations to FP8-E4M3, and offload remaining activations to CPU with overlapped streaming.

Footnotes

  1. [2507.20534] Kimi K2: Open Agentic Intelligence 2

  2. config.json · moonshotai/Kimi-K2.5 at main 2

  3. [2602.02276] Kimi K2.5: Visual Agentic Intelligence