Glenn's Digital Garden

❯

GRPO

Feb 19, 2026

artificial-intelligence/fine-tuning

Group-Relative Policy Optimization (GRPO) is a method of online RL whereby

A model is given an input query and generates multiple answers (a group)
Each answer is scored
The mean score of the group is calculated, then each answer’s score is compared to the group’s average
The model is updated to be more likely to produce the better-than-average answers and less likely to produce worse-than-average answers while still being anchored to a reference model (like in Importance of the reference model)

This process is repeated.

GRPO is “group-relative” because answers are scored relative to each other.

DeepSeek-R1 was fine-tuned using GRPO.

Graph View

Backlinks

DeepSeek-R1

Created with Quartz v4.5.2 © 2026

glennklockwood.com
@glennklockwood.com