Mamba is an alternative to attention where, instead of re-computing all pairwise interactions like attention, it keeps a hidden state that is updated every time new tokens arrive. This hidden state is designed to carry long-range dependencies (like attention) but scales computationally with sequence length as rather than (like attention) during training and prefill.

Mamba-2 is an updated form of Mamba.

Principles

A single Mamba layer has two components that are defined by a few variables:1

  1. SSM recurrent state
    • head count (mamba_num_heads)
    • head dimension (mamba_head_dim)
    • SSM state size (ssm_state_size)
    • Shape: [mamba_num_heads x mamba_head_dim x ssm_state_size]
    • Elements: mamba_num_heads x mamba_head_dim x ssm_state_size
  • Convolution state
    • head count (mamba_num_heads)
    • head dimension (mamba_head_dim)
    • number of groups (n_groups)
    • SSM state size (ssm_state_size)
    • kernel size (conv_kernel)
    • Shape: [(mamba_num_heads x mamba_head_dim + 2 x n_groups x ssm_state_size) x conv_kernel]
      • The first term (mamba_num_heads x mamba_head_dim) is the main input channels
      • The second term (2 x n_groups x ssm_state_size) accounts for the B and C projection matrices, each of which contribute one factor of n_groups x ssm_state_size which also pass through the convolution.
    • Elements: (128 x 64 + 2 x 8 x 128) x 4 = (8192 + 2048) x 4

This state never grows during decode (unlike with attention), but it does need to be carried forward (like KV cache).

The total memory footprint of the layer is just the product of these three. For example, Nemotron 3 Super had

  • mamba_head_dim = 64,
  • mamba_num_heads = 128,
  • ssm_state_size = 128

So each Mamba-2 layer had 1,048,576 elements per layer. At bf16, that’s 2 MiB per Mamba-2 layer.

Infrastructure requirements

During training and prefill, compute requirements scale linearly with sequence length . This makes training/prefilling very long contexts much less computationally expensive.

During decode, there is no KV cache, so the memory footprint is independent of . Every token generated simply updates the Mamba state.

Footnotes

  1. State Space Duality (Mamba-2) Part I - The Model | Tri Dao