GLM-5 is a MOE developed by Zipu AI and Tsinghua University. It has 744B parameters with 256 experts and 1 shared and 8 routed experts active per token (or 40B active parameters per token). They trained on 28.5T tokens.
Architecture
From the GLM-5 technical report:1
| Model | GLM-4.5 | GLM-5 |
|---|---|---|
| # Total Parameters | 355B | 744B |
| # Activated Parameters | 32B | 40B |
| # Dense Layers | 3 | 3 |
| # MoE Layers | 89 | 75 |
| # MTP Layers | 1 | 1 |
| Hidden Dim | 5120 | 6144 |
| Dense Intermediate Dim | 12288 | 12288 |
| MoE Intermediate Dim | 1536 | 2048 |
| QK Head Dim | 128 | 192 |
| V Head Dim | 128 | 256 |
| Q LoRA Dim | – | 2048 |
| KV LoRA Dim | – | 512 |
| # Attention Heads | 96 | 64 |
| # Key-Value Heads | 8 | – |
| # Indexer Attn Heads | – | 32 |
| # Indexer Head Dim | – | 128 |
| # Experts (total) | 160 | 256 |
| # Routed Experts | 8 | 8 |
| # Shared Experts | 1 | 1 |
| Vocabulary Size | 151552 | 154880 |
Data pipeline
They trained on 38.5 trillion tokens.