DeepSeek-R1 is a 671-billion parameter reasoning model implemented as a mixture-of-experts transformer. 37 billion of which are active during inferencing. It was fine-tuned from DeepSeek-V3-Base using chain-of-thought reasoning examples and group relative policy optimization1 to excel at reasoning.
It has a context window of 128,000 tokens.
Why is it important?
The real breakthrough is that they are as good as OpenAI’s new o1 model which came out just a few months ago, and it’s open-source so anyone can jump on the test-time compute bandwagon. DeepSeek also used a new algorithm, group relative policy optimization, to cut down the costs of reinforcement learning.
A couple reasons:
- It is a high-quality reasoning model that rivals best OpenAI’s publicly available reasoning model
- It was released only one month after the release of the version of OpenAI o1 against which it was benchmarked.
- It is open-source, so anyone can use this.
- It was developed by China despite the ban on exporting top-end NVIDIA GPUs to them.
- It was cheap to train and is cheap to inference. It’s 95% cheaper than o1.2
They also demonstrated that using this reasoning model to perform distillation also enables existing dense transformers to be significantly better. Since DeepSeek-R1 is open-source, this will allow all model developers to leverage synthetic data much more heavily to reduce the cost of training.
What was novel?
FP8 training
This is one of the first frontier models I’ve seen that trained in 8-bit precision. By comparison, Llama-3 was trained using 16-bit precision.
Latent attention
Latent attention is a new way of implementing attention where the WQ matrix is factorized into two smaller matrices. One such matrix can get go into the KV cache, and the other holds weights. Both take less memory, and you rehydrate the original full matrix only when needed.
Mixture of experts
DeepSeek’s MOE router limited tokens to four nodes at most to keep communication domains small.
Each token only activates nine experts: one shared and eight routed. Furthermore, the eight routed experts must exist on no more than four nodes to minimize the communication domain.
They overlap the computation of some experts with communication of the experts who didn’t receive tokens.
Multi-token prediction
DeepSeek trained with multi-token prediction to get a stronger learning signal per step. This requires more computation, but fewer steps are required to reach the same model quality when compared to standard single-token prediction.
Communication
DeepSeek did not use tensor parallelism due to the cost (presumably because they didn’t want to rely on having a strong NVLink domain; China only has access to H800, which has a significantly worse NVLink backplane than H100). Instead, they used pipeline parallelism, expert parallelism, and data parallelism (see Workload partitioning).
Dual-pipe parallelism breaks the pipeline into two parts:
- computation pipeline: attention, feed-forward network
- communication pipeline: all-to-all on forward pass, all-to-all on backpropogation
By breaking up the pipeline, bubbles can be shrunk by alternating computation and communication on GPUs; while some GPUs are communicating, others are computing. Then they reverse.
They also dedicated 20 SMs per GPU to communication and implemented their expert parallelism in a way that minimizes communication to a four-node domain.
FP8 training
DeepSeek applied fine-grained quantization where groups of values within a tensor (either 1D tiles or 2D blocks) were grouped before quantization to control the errors introduced by loss of precision. This contrasts with more typical tensor-level quantization which has hardware support.
In addition, they also maintained some values in higher precision:
- the forward pass GEMM outputs was 16- or 32-bit precision
- optimizer states were 16- or 32-bit precision
Communications were also reduced to FP8 to minimize the impacts of their interconnect, but some gradients were preserved in BF16.