Direct preference optimization

Direct preference optimization (DPO) is a post-pretraining process whereby you make two copies of the model being fine-tuned: a reference model (which never changes, let’s call it $R$ ) and the model being updated (let’s call it $M$ ).

You start with records consisting of a prompt, a good response, and a bad response
For each record, both the reference model $R$ and the updating model $M$ are used to score the two responses based on how likely each model considers those responses (given the prompt). At the start, $R$ and $M$ are the same, so they would produce the same scores.
$M$ is updated to make the good response more likely than the bad. $R$ does not change, because it’s frozen.
More records scored by $R$ and $M$
The difference between $R$ and $M$ ‘s responses are used to tweak $M$ more and make it more likely to produce the good response.
Loop back to step 4.

This process produces a fine-tuned model ( $M$ ) that is still anchored to whatever fundamentals were present in the original ( $R$ ). This means that DPO only tweaks behavior rather than allowing the tuned model to acquire strange behavior.

Example input record

DPO inputs are tuples of (prompt, preferred response, rejected response):

Prompt: “Explain why the sky is blue in two sentences.”
Preferred response: “Sunlight is scattered by molecules in the atmosphere, and shorter blue wavelengths scatter more strongly than longer red wavelengths. That’s why the sky appears blue.”
Rejected response: “The sky is blue because it reflects the ocean, and blue light is heavier than other colors.”

Importance of the reference model

If you don’t anchor the model to a reference, you might wind up teaching the model into taking shortcuts or learning the lesson in the wrong way. For example, let’s say you have a model that exhibits “good manners” and you want to to teach it to “give clear answers.” Without anchoring this fine-tuning process to the reference model that has good manners, you might wind up fine-tuning the model to…

only give single-word yes/no answers to everything, because that is the most clear
not answer anything, since anything it says may be unclear

DPO allows a model to learn through comparisons while continuously checking to ensure that what it’s learning (“be clear”) remains consistent with the behavior of the reference (“have good manners”).

Uses

Llama-3 and Llama 4 were fine-tuned using DPO, likely because Meta had lots of direct feedback from all of its social platforms.

History

DPO was developed at Stanford and first published in 2023.¹

[2305.18290] Direct Preference Optimization: Your Language Model is Secretly a Reward Model ↩

Glenn's Digital Garden

Explorer

Direct preference optimization

Example input record

Importance of the reference model

Uses

History

Graph View

Table of Contents

Backlinks

Glenn's Digital Garden

Explorer

Direct preference optimization

Example input record

Importance of the reference model

Uses

History

Footnotes

Graph View

Table of Contents

Backlinks