DeepSeek-R1

DeepSeek-R1 is a 671-billion parameter reasoning model implemented as a mixture-of-experts transformer. 37 billion of which are active during inferencing. It was fine-tuned from DeepSeek-V3-Base using chain-of-thought reasoning examples and group relative policy optimization¹ to excel at reasoning.

It has a context window of 128,000 tokens.

Why is it important?

The real breakthrough is that they are as good as OpenAI’s new o1 model which came out just a few months ago, and it’s open-source so anyone can jump on the test-time compute bandwagon. DeepSeek also used a new algorithm, group relative policy optimization, to cut down the costs of reinforcement learning.

A couple reasons:

It is a high-quality reasoning model that rivals best OpenAI’s publicly available reasoning model
It was released only one month after the release of the version of OpenAI o1 against which it was benchmarked.
It is open-source, so anyone can use this.
It was developed by China despite the ban on exporting top-end NVIDIA GPUs to them.
It was cheap to train and is cheap to inference. It’s 95% cheaper than o1.²

They also demonstrated that using this reasoning model to perform distillation also enables existing dense transformers to be significantly better. Since DeepSeek-R1 is open-source, this will allow all model developers to leverage synthetic data much more heavily to reduce the cost of training.

What was novel?

FP8 training

This is one of the first frontier models I’ve seen that trained in 8-bit precision. By comparison, Llama-3 was trained using 16-bit precision.

They did this by applying fine-grained quantization, a technique where groups of values within a tensor (either 1D tiles or 2D blocks) were grouped before quantization to control the errors introduced by loss of precision. This contrasts with more typical whole-tensor quantization which has hardware support in NVIDIA GPUs.

In addition, they also maintained some values in higher precision:

the forward pass GEMM outputs was 16- or 32-bit precision
optimizer states were 16- or 32-bit precision

Communications were also reduced to FP8 to minimize the impacts of their interconnect, but some gradients were preserved in BF16.

Latent attention

Latent attention is a new way of implementing attention where the WQ matrix is factorized into two smaller matrices. One such matrix can get go into the KV cache, and the other holds weights. Both take less memory, and you rehydrate the original full matrix only when needed.

Mixture of experts

DeepSeek’s MOE router limited tokens to four nodes at most to keep communication domains small.

Each token only activates nine experts: one shared and eight routed. Furthermore, the eight routed experts must exist on no more than four nodes to minimize the communication domain.

They overlap the computation of some experts with communication of the experts who didn’t receive tokens.

Multi-token prediction

DeepSeek trained with multi-token prediction to get a stronger learning signal per step. This requires more computation, but fewer steps are required to reach the same model quality when compared to standard single-token prediction.

Communication

DeepSeek did not use tensor parallelism due to the cost (presumably because they didn’t want to rely on having a strong NVLink domain; China only has access to H800, which has a significantly worse NVLink backplane than H100). Instead, they used pipeline parallelism, expert parallelism, and data parallelism (see Workload partitioning).

Dual-pipe parallelism breaks the pipeline into two parts:

computation pipeline: attention, feed-forward network
communication pipeline: all-to-all on forward pass, all-to-all on backpropogation

By breaking up the pipeline, bubbles can be shrunk by alternating computation and communication on GPUs; while some GPUs are communicating, others are computing. Then they reverse.

They also dedicated 20 SMs per GPU to communication and implemented their expert parallelism in a way that minimizes communication to a four-node domain.

GRPO

Oxen wrote a great post about this: Why GRPO is Important and How it Works. In a nutshell,

Greg Schoeninger, oxen.ai

I was able to successfully turn a 1B parameter Llama 3.2 model into a reasoning model with 16GB of VRAM. […] Basically we can all train reasoning models from our garages by spending < $100 on cloud GPU services. Or essentially “for free” if you are talking smol models on your own hardware.

Glenn's Digital Garden

Explorer

DeepSeek-R1

Why is it important?

What was novel?

FP8 training

Latent attention

Mixture of experts

Multi-token prediction

Communication

GRPO

Graph View

Table of Contents

Backlinks

Glenn's Digital Garden

Explorer

DeepSeek-R1

Why is it important?

What was novel?

FP8 training

Latent attention

Mixture of experts

Multi-token prediction

Communication

GRPO

Footnotes

Graph View

Table of Contents

Backlinks