Kimi K2.5

Kimi K2.5 is a multimodal model developed by Moonshot AI that was trained using “early vision fusion” whereby text and image data was used for training from an early stage of training.

Its transformer backbone is Kimi K2,¹ a MOE model with 1.04 million parameters, 32 billion active.
Its vision encoder is MoonViT-3D with around 425 million parameters
It uses a multilayer perceptron projector to connect the vision encoder to the transformer backbone.

Model architecture

Transformer / backbone

Its transformer backbone is a DeepSeek-V3-like 1.04 trillion parameter MOE organized as follows:¹²

Field	Value
Organization	Moonshot AI
Release date	2026
Model lineage	DeepSeek-V3 derivative
Total parameters	~104B (est.)
Active parameters	~32B (est.)
Model layers	61
Layer composition	1 dense FFN prefix, 60 MoE layers; homogeneous full attention
Attention variant	MLA
Q heads / head dim	64 heads / 192 dim
KV geometry	`kv_lora_rank`=512, `q_lora_rank`=1536
Attention peculiarities	none
Context length	256k tokens
Position encoding	YaRN (×x64, from 4k base)
FFN type	MoE
MoE experts	384 total / 8 active + 1 shared
MoE activation ratio	2.1%
MoE routing	sigmoid + noaux_tc
Multi-token prediction	none
Sequence mixer	attention only
Modalities	text + vision + video
Native dtype	bf16
Quantization	int4-g32
Quantization exclusions	attention, shared expert, dense FFN prefix

Vision encoder

Kimi K2.5 uses the MoonViT-3D vision transformer which is organized as follows:²

Field	Value
Name	MoonViT-3D
Parameters	~425M (est.)
Layers	27
Hidden size	1152
Intermediate size	4304
Attention heads	16
Patch size	14x14 px
Input resolution	dynamic tiling
Token compression	2x2 spatial merge + patchmerger projector
Video support	yes (spatial-temporal attention, divided position embeddings)

Projector

There is also a multilayer perceptron projection that connects the vision encoder to the transformer backbone. It takes the $2 \times 2$ spatially merged patch features and projects them into the transformer’s embedding dimension. After this spatial merge, each output token has $4 \times 1152 = 4608$ dimensions which then are projected into the LLM’s 7168 hidden dimensions. This contributes

9216 parameters for the LayerNorm (which does what?)
$4608 \times 7168 = 33 M$ parameters for the first linear projection (I think this projects directly into the 7168 dimensions of the transformer’s hidden dimension?)
GELU happens at this point (what’s that?)
$7168 \times 7168 = 51 M$ parameters for the second projection

Training infrastructure

From the Kimi K2.5 technical report:³

Kimi K2.5 is trained on NVIDIA H800 GPU clusters with 8×400 Gbps RoCE interconnects across nodes. We employ a flexible parallelism strategy combining 16-way Pipeline Parallelism (PP) with virtual stages, 16-way Expert Parallelism (EP), and ZeRO-1 Data Parallelism , enabling training on any number of nodes that is a multiple of 32. EP all-to-all communication is overlapped with computation under interleaved 1F1B scheduling. To fit activations within GPU memory constraints, we apply selective recomputation for LayerNorm, SwiGLU, and MLA up-projections, compress insensitive activations to FP8-E4M3, and offload remaining activations to CPU with overlapped streaming.

Glenn's Digital Garden

Explorer

Kimi K2.5

Model architecture

Transformer / backbone

Vision encoder

Projector

Training infrastructure

Graph View

Table of Contents

Backlinks

Glenn's Digital Garden

Explorer

Kimi K2.5

Model architecture

Transformer / backbone

Vision encoder

Projector

Training infrastructure

Footnotes

Graph View

Table of Contents

Backlinks