Group-Relative Policy Optimization (GRPO) is a method of online RL whereby

  1. A model is given an input query and generates multiple answers (a group)
  2. Each answer is scored
  3. The mean score of the group is calculated, then each answer’s score is compared to the group’s average
  4. The model is updated to be more likely to produce the better-than-average answers and less likely to produce worse-than-average answers while still being anchored to a reference model (like in Importance of the reference model)

This process is repeated.

GRPO is “group-relative” because answers are scored relative to each other.

DeepSeek-R1 was fine-tuned using GRPO.