Kimi K2.5 is a multimodal model developed by Moonshot AI that was trained using “early vision fusion” whereby text and image data was used for training from an early stage of training.
- Its transformer backbone is Kimi K2,1 a MOE model with 1.04 million parameters, 32 billion active.
- Its vision encoder is MoonViT-3D with around 425 million parameters
- It uses a multilayer perceptron projector to connect the vision encoder to the transformer backbone.
Model architecture
Transformer / backbone
Its transformer backbone is a DeepSeek-V3-like 1.04 trillion parameter MOE organized as follows:12
| Field | Value |
|---|---|
| Organization | Moonshot AI |
| Release date | 2026 |
| Model lineage | DeepSeek-V3 derivative |
| Total parameters | ~104B (est.) |
| Active parameters | ~32B (est.) |
| Model layers | 61 |
| Layer composition | 1 dense FFN prefix, 60 MoE layers; homogeneous full attention |
| Attention variant | MLA |
| Q heads / head dim | 64 heads / 192 dim |
| KV geometry | kv_lora_rank=512, q_lora_rank=1536 |
| Attention peculiarities | none |
| Context length | 256k tokens |
| Position encoding | YaRN (×x64, from 4k base) |
| FFN type | MoE |
| MoE experts | 384 total / 8 active + 1 shared |
| MoE activation ratio | 2.1% |
| MoE routing | sigmoid + noaux_tc |
| Multi-token prediction | none |
| Sequence mixer | attention only |
| Modalities | text + vision + video |
| Native dtype | bf16 |
| Quantization | int4-g32 |
| Quantization exclusions | attention, shared expert, dense FFN prefix |
Vision encoder
Kimi K2.5 uses the MoonViT-3D vision transformer which is organized as follows:2
| Field | Value |
|---|---|
| Name | MoonViT-3D |
| Parameters | ~425M (est.) |
| Layers | 27 |
| Hidden size | 1152 |
| Intermediate size | 4304 |
| Attention heads | 16 |
| Patch size | 14x14 px |
| Input resolution | dynamic tiling |
| Token compression | 2x2 spatial merge + patchmerger projector |
| Video support | yes (spatial-temporal attention, divided position embeddings) |
Projector
There is also a multilayer perceptron projection that connects the vision encoder to the transformer backbone. It takes the spatially merged patch features and projects them into the transformer’s embedding dimension. After this spatial merge, each output token has dimensions which then are projected into the LLM’s 7168 hidden dimensions. This contributes
- 9216 parameters for the LayerNorm (which does what?)
- parameters for the first linear projection (I think this projects directly into the 7168 dimensions of the transformer’s hidden dimension?)
- GELU happens at this point (what’s that?)
- parameters for the second projection
Training infrastructure
From the Kimi K2.5 technical report:3
Kimi K2.5 is trained on NVIDIA H800 GPU clusters with 8×400 Gbps RoCE interconnects across nodes. We employ a flexible parallelism strategy combining 16-way Pipeline Parallelism (PP) with virtual stages, 16-way Expert Parallelism (EP), and ZeRO-1 Data Parallelism , enabling training on any number of nodes that is a multiple of 32. EP all-to-all communication is overlapped with computation under interleaved 1F1B scheduling. To fit activations within GPU memory constraints, we apply selective recomputation for LayerNorm, SwiGLU, and MLA up-projections, compress insensitive activations to FP8-E4M3, and offload remaining activations to CPU with overlapped streaming.