Llama 4 is a family of MOE models that was meant to follow Llama-3. As far as I can tell, it was a failure.

The full variant of the model, Behemoth, was aborted before it was ever released due to poor performance.1 This is despite Meta’s claim that it outperforms GPT-4.5 (which itself was a failure).

Model VariantTotal ParametersExpertsActive ParametersContext
Scout109B1617B10M
Maverick400B12817B1M
Behemoth2T16288B?

Behemoth was trained on 32K GPUs on over 30T tokens.2

From The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation:

Quote

Llama 4 Maverick models have 17B active parameters and 400B total parameters. We use alternating dense and mixture-of-experts (MoE) layers for inference efficiency. MoE layers use 128 routed experts and a shared expert. Each token is sent to the shared expert and also to one of the 128 routed experts.

Quote

With Llama 4, we revamped our post-training pipeline by adopting a different approach: lightweight supervised fine-tuning (SFT) > online reinforcement learning (RL) > lightweight direct preference optimization (DPO). A key learning was that SFT and DPO can over-constrain the model

Quote

Furthermore, we implemented a continuous online RL strategy, where we alternated between training the model and then using it to continually filter and retain only medium-to-hard difficulty prompts.

Quote

Llama 4 Scout, is a general purpose model with 17 billion active parameters, 16 experts, and 109 billion total parameters that delivers state-of-the-art performance for its class. Llama 4 Scout dramatically increases the supported context length from 128K in Llama 3 to an industry leading 10 million tokens.

Quote

A key innovation in the Llama 4 architecture is the use of interleaved attention layers without positional embeddings. Additionally, we employ inference time temperature scaling of attention to enhance length generalization.

Quote

we had to prune 95% of the SFT data, as opposed to 50% for smaller models, to achieve the necessary focus on quality and efficiency.

Quote

We also found that dynamically filtering out prompts with zero advantage during training and constructing training batches with mixed prompts from multiple capabilities were instrumental in providing a performance boost on math, reasoning, and coding.

Data quality matters a lot

Quote

We developed a fully asynchronous online RL training framework that enhanced flexibility.

Footnotes

  1. Meta’s New Superintelligence Lab Is Discussing Major A.I. Strategy Changes - The New York Times

  2. The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation