Mamba is an alternative to attention where, instead of re-computing all pairwise interactions like attention, it keeps a hidden state that is updated every time new tokens arrive. This hidden state is designed to carry long-range dependencies (like attention) but scales computationally with sequence length as rather than (like attention) during training and prefill.
Mamba-2 is an updated form of Mamba.
Principles
A single Mamba layer has two components that are defined by a few variables:1
- SSM recurrent state
- head count (
mamba_num_heads) - head dimension (
mamba_head_dim) - SSM state size (
ssm_state_size) - Shape:
[mamba_num_heads x mamba_head_dim x ssm_state_size] - Elements:
mamba_num_heads x mamba_head_dim x ssm_state_size
- head count (
- Convolution state
- head count (
mamba_num_heads) - head dimension (
mamba_head_dim) - number of groups (
n_groups) - SSM state size (
ssm_state_size) - kernel size (
conv_kernel) - Shape:
[(mamba_num_heads x mamba_head_dim + 2 x n_groups x ssm_state_size) x conv_kernel]- The first term (
mamba_num_heads x mamba_head_dim) is the main input channels - The second term (
2 x n_groups x ssm_state_size) accounts for the B and C projection matrices, each of which contribute one factor ofn_groups x ssm_state_sizewhich also pass through the convolution.
- The first term (
- Elements:
(128 x 64 + 2 x 8 x 128) x 4=(8192 + 2048) x 4
- head count (
This state never grows during decode (unlike with attention), but it does need to be carried forward (like KV cache).
The total memory footprint of the layer is just the product of these three. For example, Nemotron 3 Super had
mamba_head_dim= 64,mamba_num_heads= 128,ssm_state_size= 128
So each Mamba-2 layer had 1,048,576 elements per layer. At bf16, that’s 2 MiB per Mamba-2 layer.
Infrastructure requirements
During training and prefill, compute requirements scale linearly with sequence length . This makes training/prefilling very long contexts much less computationally expensive.
During decode, there is no KV cache, so the memory footprint is independent of . Every token generated simply updates the Mamba state.