Nemotron 3 is a family of text-to-text hybrid transformer models developed by NVIDIA Research that comes in three flavors: Nano, Super, and Ultra.

Super

Super has 120.6B parameters with 12B active parameters per token.

Architecture

Super has 88 layers which are predominantly Mamba-2 layers followed by mixture of experts layers. There are also attention layers dispersed throughout, but they are not evenly spaced:

SegmentPatternMambaExpertAttention
1MEMEMEM*431
2EMEMEMEM*441
3EMEMEMEM*441
4EMEMEMEMEM*551
5EMEMEMEMEM*551
6EMEMEMEMEM*551
7EMEMEMEMEM*551
8EMEMEMEM*441
9EMEMEMEME450
Total40408

The model’s hidden dimension is 4096. In total, there are1

  • 40 Mamba-2 layers, each with
    • SSM recurrent state:
      • head dimension of 64
      • 128 Mamba heads
      • SSM state size of 128
      • 1,048,576 elements at FP322
      • 2 MiB of SSM state per layer, independent of input token count, must be cached between decodes
    • Convolution state:
      • kernel size of 4
      • state expansion factor of 2
  • 8 Attention layers, each with
    • head dimension of 128
    • 32 query heads (GQA)
    • 2 key-value heads (GQA)
    • 512 elements at FP83 must be cached between decodes
    • 512 bytes per layer per input token must be cached between decodes
  • 40 Expert layers (actually LatentMoE layers), each with
    • 1 shared expert
      • intermediate size of 5376
    • 512 routed experts (22 active per token)
      • latent size of 1024
      • intermediate size of 2688

KV cache savings

Because the model replaces full attention with Mamba layers, the model uses significantly less KV cache than a similarly sized transformer that does not use Mamba layers.

Attention cache

The amount of KV cache that must be maintained grows linearly with the sequence length. As more tokens are generated, so too does the KV cache size.

Each token’s state cache for each attention layer is governed by:

  • Key matrix: [num_key_value_heads x head_dim]
  • Value matrix: [num_key_value_heads x head_dim]

For Nemotron 3 Super, this comes out to be:

  • Key matrix: 2 x 128 = 256 elements, stored in 8-bit precision, for a total of 256 bytes per token
  • Value matrix: 2 x 128 = 256 elements, stored in 8-bit precision, for a total of 256 bytes per token

Across all 8 attention layers, the cache size requires 4096 bytes per token in Nemotron 3 Super.

Mamba cache

The Mamba-2 layers have a fixed state cache that does not grow with sequence length. It is governed by

  • SSM recurrent state: [mamba_num_heads x mamba_head_dim x ssm_state_size]
  • Convolution state: `[(mamba_num_heads x mamba_head_dim + 2 x n_groups x ssm_state_size) x conv_kernel]

For Nemotron 3 Super, this comes out to be

  • SSM recurrent state: 128 x 64 x 128 = 1,048,576 elements, stored in 32-bit precision4 for a total of 4 MiB.
  • Convolution state: 10,240 x 4 = 40,960 elements, stored in the default model precision (16-bit)4 for a total of 80 KiB.

Across all 40 Mamba-2 layers, the cached size is fixed at around 163 MiB in Nemotron 3 Super.

MTP cache

Nemotron 3 Super also includes optional multi-token prediction (MTP) layers which have a small amount of KV cache. The Huggingface implementation does not include this,1 so I won’t either.

Overall

Break-even sequence lengths:

  • vs Qwen3.5-122B: 10,440 tokens (this doesn’t include the DeltaNet layers though)
  • vs GPT-OSS-120B: 5,801 tokens

Below the break-even, GPT-OSS has a smaller total cache because of Nemotron’s fixed Mamba state. Above it, Nemotron wins as KV savings compound.

Training

It was trained on 15,573,172,908,990 (15.6T) tokens from across 153 datasets,4 some of which are public.

Footnotes

  1. config.json · nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-Base-BF16 at main 2

  2. The recurrent state is saved as FP32 because errors compound. See NVIDIA-Nemotron-3-Super-Technical-Report.pdf

  3. The model card4 shows a vLLM example with --kv-cache-dtype fp8.

  4. nemotron-3-super-120b-a12b Model by NVIDIA | NVIDIA NIM 2 3 4