Trinity is a family of open-source MOE transformers developed by Arcee.

From Trinity Large:

Training Infrastructure & Scale

  • 2048 B300 GPUs for 33 days pretraining (claimed as “largest publicly stated” B300 run)
  • Total cost: $20M all-in for 4 models over 6 months (compute, salaries, data, storage, ops)
  • Training throughput optimization via HSDP with expert parallelism=8, totaling 2,048 data-parallel ranks
  • Batch size increased after 5T tokens (justified by high sparsity + Muon optimizer’s larger critical batch size tolerance)

Architecture & Sparsity Trade-offs

400B total parameters, 13B active per token:

  • 256 experts, 4 active per token (1.56% routing fraction; this is the highest sparsity they compare against except Llama 4-maverick at 0.78%)
  • 6 dense layers (increased from planned 3) to stabilize routing at this sparsity level
  • Claims 2-3x inference throughput advantage vs same-weight-class models on same hardware

MoE routing stability mechanics:

  • Momentum-based expert load balancing (router bias adjustment with tanh clipping + per-sequence balance loss)
  • z-loss regularization to prevent LM-head logit drift
  • They explicitly state routing stability was a challenge requiring architectural adjustment mid-design

Data Pipeline

17T tokens across 3 phases (10T/4T/3T):

  • Curated by DatologyAI
  • 8T tokens of synthetic data (web, code, math, reasoning, multilingual—14 non-English languages)
  • “State-of-the-art rephrasing approaches” (no specifics)
  • Data mix evolved specifically for Trinity Large vs smaller Trinity models

This is heavy on synthetic data relative to typical ratios. The fact that they call out “curation advancements” suggests they learned something between smaller Trinity models and this one, but they don’t say what broke.

Training dynamics worth noting:

  • Smooth loss curve with “clear phase transitions, no spikes”
  • They frame this as success after achieving “stability dialed in”
  • Muon optimizer mentioned as enabling larger batch sizes (vs AdamW)
  • They reference MiniMax-01 paper for batch-size scaling justification

This suggests they had instability issues early and had to tune heavily to get clean training. The architectural changes (3 to 6 dense layers) and routing tweaks likely came from failed runs.

Three Checkpoint Release Strategy

There are three variants:

  1. Preview: Light post-training, instruct-style (non-reasoning), optimized for creative tasks and agentic workflows
  2. Base: Full 17T pretraining checkpoint
  3. TrueBase: 10T checkpoint with zero instruct data, no LR annealing. Explicitly marketed as “real baseline” for researchers

Inference Context & Hosting

  • Native 512K context support
  • Preview API running at 128K with 8-bit quantization while they tune infrastructure
  • The launched was framed as a “preview of hosting platform” as much as model launch

Claims vs. substance

What they say:

  • “Frontier-class foundation model”
  • Matches/exceeds open-base peers across benchmarks
  • 2-3x inference throughput advantage

What’s missing:

  • No detailed hardware utilization metrics (MFU, throughput/GPU)
  • No ablations on sparsity vs performance trade-off
  • Vague on what routing instability they hit and how they fixed it