Group-Relative Policy Optimization (GRPO) is a method of online RL whereby
- A model is given an input query and generates multiple answers (a group)
- Each answer is scored
- The mean score of the group is calculated, then each answer’s score is compared to the group’s average
- The model is updated to be more likely to produce the better-than-average answers and less likely to produce worse-than-average answers while still being anchored to a reference model (like in Importance of the reference model)
This process is repeated.
GRPO is “group-relative” because answers are scored relative to each other.
DeepSeek-R1 was fine-tuned using GRPO.