Reinforcement learning (RL) is used to fine-tune a pre-trained transformer by training it to prefer better responses based on feedback. The model learns by generating different responses and getting reward signals that guide it toward more useful or aligned outputs.

Reinforcement learning is generally a three-step process:

  1. Collecting feedback: The pre-trained transformer generates a bunch of responses to a prompt, and something ranks those responses from best to worst. When that “something” are humans, we call this reinforcement learning from human feedback (RLHF).
  2. Training the reward model: Those responses are used to train a separate reward model, which itself is a smaller transformer that takes responses as input and generates a score as output. The reward model learns how to score responses generated by the pre-trained transformer based on the data collected in Step 1.
  3. Fine-tuning the transformer: The pre-trained transformer is then fine-tuned by plugging the reward model into it. Algorithms like Proximal Policy Optimization (PPO) are applied here; the transformer generates responses, the reward model scores those responses, and the transformer slowly fine-tunes itself to produce highly rewarded outputs.

Techniques

Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO) is a common algorithm that balances improvement with stability. It uses a clipped objective to preserve the core behavior of the pre-trained model during training.

Direct Preference Optimization (DPO)

Direct Preference Optimization (DPO) directly optimizes the model to match preference probabilities. There is no reward model; instead, feedback is given as paired preference data (which response is better?). It isn’t as generalizable as PPO because there is no reward model, but it is much simpler when enough high-quality preference data is present. It also reduces the risk of unintended behavior, since it directly follows the feedback it is given without a reward model in the middle that might get creative.

Llama-3 was fine-tuned using DPO.

Researchy approaches

There are also a bunch of research demonstrations. Gradient-based Reinforcement Preference Optimization (GRPO) is an example which incorporates gradients from reward comparisons into the PPO process. This more concretely incorporates feedback into fine-tuning which enables smoother and more stable changes to the pre-trained model.

Seminal papers

I stole this reading list from No Hype DeepSeek-R1 Reading List.