Direct preference optimization (DPO) is a post-pretraining process whereby you make two copies of the model being fine-tuned: a reference model (which never changes, let’s call it ) and the model being updated (let’s call it ).
- You start with records consisting of a prompt, a good response, and a bad response
- For each record, both the reference model and the updating model are used to score the two responses based on how likely each model considers those responses (given the prompt). At the start, and are the same, so they would produce the same scores.
- is updated to make the good response more likely than the bad. does not change, because it’s frozen.
- More records scored by and
- The difference between and ‘s responses are used to tweak more and make it more likely to produce the good response.
- Loop back to step 4.
This process produces a fine-tuned model () that is still anchored to whatever fundamentals were present in the original (). This means that DPO only tweaks behavior rather than allowing the tuned model to acquire strange behavior.
Example input record
DPO inputs are tuples of (prompt, preferred response, rejected response):
- Prompt: “Explain why the sky is blue in two sentences.”
- Preferred response: “Sunlight is scattered by molecules in the atmosphere, and shorter blue wavelengths scatter more strongly than longer red wavelengths. That’s why the sky appears blue.”
- Rejected response: “The sky is blue because it reflects the ocean, and blue light is heavier than other colors.”
Importance of the reference model
If you don’t anchor the model to a reference, you might wind up teaching the model into taking shortcuts or learning the lesson in the wrong way. For example, let’s say you have a model that exhibits “good manners” and you want to to teach it to “give clear answers.” Without anchoring this fine-tuning process to the reference model that has good manners, you might wind up fine-tuning the model to…
- only give single-word yes/no answers to everything, because that is the most clear
- not answer anything, since anything it says may be unclear
DPO allows a model to learn through comparisons while continuously checking to ensure that what it’s learning (“be clear”) remains consistent with the behavior of the reference (“have good manners”).
Uses
Llama-3 and Llama 4 were fine-tuned using DPO, likely because Meta had lots of direct feedback from all of its social platforms.
History
DPO was developed at Stanford and first published in 2023.1