reinforcement learning

Reinforcement learning (RL) is used to fine-tune a pre-trained transformer by training it to prefer better responses based on feedback. The model learns by generating different responses and getting reward signals that guide it toward more useful or aligned outputs.

Reinforcement learning is generally a three-step process:

Collecting feedback: The pre-trained transformer generates a bunch of responses to a prompt, and something ranks those responses from best to worst. When that “something” are humans, we call this reinforcement learning from human feedback (RLHF).
Training the reward model: Those responses are used to train a separate reward model, which itself is a smaller transformer that takes responses as input and generates a score as output. The reward model learns how to score responses generated by the pre-trained transformer based on the data collected in Step 1.
Fine-tuning the transformer: The pre-trained transformer is then fine-tuned by plugging the reward model into it. Algorithms like Proximal Policy Optimization (PPO) are applied here; the transformer generates responses, the reward model scores those responses, and the transformer slowly fine-tunes itself to produce highly rewarded outputs.

Techniques

Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO) is a common algorithm that balances improvement with stability. It uses a clipped objective to preserve the core behavior of the pre-trained model during training.

Direct Preference Optimization (DPO)

Direct Preference Optimization (DPO) directly optimizes the model to match preference probabilities. There is no reward model; instead, feedback is given as paired preference data (which response is better?). It isn’t as generalizable as PPO because there is no reward model, but it is much simpler when enough high-quality preference data is present. It also reduces the risk of unintended behavior, since it directly follows the feedback it is given without a reward model in the middle that might get creative.

Llama-3 was fine-tuned using DPO.

Researchy approaches

There are also a bunch of research demonstrations. Gradient-based Reinforcement Preference Optimization (GRPO) is an example which incorporates gradients from reward comparisons into the PPO process. This more concretely incorporates feedback into fine-tuning which enables smoother and more stable changes to the pre-trained model. Oxen.ai has a great post that walks through GRPO and how straightforward it is to specify reward functions.¹

Seminal papers

I stole this reading list from No Hype DeepSeek-R1 Reading List.

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback by Google introduced reinforcement learning from AI-generated feedback, eliminating the need for human feedback.
Self-Rewarding Language Models by Meta laid the groundwork for using one model to both generate content as well as evaluate how well it did, establishing a way for models to think about what they’re doing.
Thinking LLMs: General Instruction Following with Thought Generation is another Meta paper that used self-rewarding models to create chains of thought, which are the basis for reasoning models.
DPO - Direct Preference Optimization from Stanford proposed a simplified approach to reinforcement learning that avoids the need to train a reward model.

Training a Rust 1.5B Coder LM with Reinforcement Learning (GRPO) ↩

Glenn's Digital Garden

Explorer

reinforcement learning

Techniques

Proximal Policy Optimization (PPO)

Direct Preference Optimization (DPO)

Researchy approaches

Seminal papers

Graph View

Table of Contents

Backlinks

Glenn's Digital Garden

Explorer

reinforcement learning

Techniques

Proximal Policy Optimization (PPO)

Direct Preference Optimization (DPO)

Researchy approaches

Seminal papers

Footnotes

Graph View

Table of Contents

Backlinks