Reinforcement learning

Reinforcement learning (RL) is a way to fine-tune a model using feedback on its outputs rather than single, correct responses to model outputs. RL techniques follow a pattern where

The model is given a query and generates a response
The response is scored somehow resulting in a reward signal
The model is trained to shift its behavior towards outputs that generate higher rewards

RL is defined by this reward-driven learning, and the different variants are distinguished by how the input queries are generated and how the reward signal is calculated.

Online and offline

Offline RL

Offline reinforcement learning is a post-pretraining process whereby

The model being fine-tuned is given records of the form (prompt, response, score) from somewhere else
The model is updated to increase the likelihood that it generates higher-scoring responses and decrease the likelihood of lower-scoring responses.

The process is “offline” because the inputs come from logged interactions collected earlier, before this RL process began. These records may have come from

an older version of this model, or a completely different model
humans
actual interactions with end-users who provided thumbs up/down preferences which were converted into scores

Online RL

Online reinforcement learning is a post-pretraining process whereby

The model being fine-tuned is given prompts and generates responses
The responses are scored by some kind of reward mechanism (a reward model, an AI judge, or verifiable tests)
The model is updated to make higher-scoring responses more likely and lower-scoring responses less likely
The updated model is then used to repeat this loop

The “online” part means that, during fine-tuning, the system continuously creates new (prompt, response, score) records using a version of the model that is being continuously updated. Critically,

Online RL fine-tunes based on records that the model itself is generating.
Offline RL fine-tunes based on a fixed set of records that came from somewhere else (a different model, humans, or whatever else).

Example input record

Prompt: “Explain why the sky is blue in two sentences.”
Model response: “The sky looks blue because air molecules scatter blue light more than red light. This scattering sends blue light in many directions, including into your eyes.”
Reward score: 0.86 (from a reward model / judge / rules)

Offline RL starts with all three of these defined a priori.

Online RL starts with only the prompts, because responses and scores are generated continuously. However, the actual RL is still updating the model based on (prompt, response, score) records.

Jargon

Let’s say we’re applying reinforcement learning to fine-tune a model called OpenLLM-4.

The policy is OpenLLM-4 as it currently exists during fine-tuning. It’s the exact set of weights that are being updated. As RL proceeds, “the policy” is “the latest version of the model.”
The agent is OpenLLM-4 acting in a loop. It receives input, produces an output, gets feedback, and is updated. In papers, “agent” usually means “the thing you’re training,” so it is also OpenLLM-4. Policy and agent refer to the same model, but in different contexts.

The environment is everything outside the model that it interacts with during reinforcement learning, including
- where the prompts come from
- what produces feedback (reward model, AI judge, humans)
An action is the model’s output. It could be a single token, or it could be a whole generated response. Depends on the context.
A reward is a score computed after the model outputs something. Alternatively, a reward follows an action.
A transition (or experience) is one recorded step that includes the model input, model output, reward score, and (sometimes) the next input.
A trajectory (or episode) is a sequence of transitions that result from running the model on a task. A single-turn question and answer has a trajectory length of 1, but multi-turn conversations have longer trajectories. Trajectories have a well-defined end (task is finished).
A rollout is simply running the current model in the environment to generate one or more trajectories. Rollouts result in outputs (actions). Sometimes people use “rollout” to include actions and scores, and sometimes they just include the model outputs (actions).

Interesting factoids

Using a reward model for RL introduces a risk that the model inadvertently learns to prioritize earning rewards through scheming and hallucinating. OpenAI found that rewarding the model to confess in a way that doesn’t alter the main RL reward can stymie this behavior.¹

Joglekar et al. Training LLMs for Honesty via Confessions. 2025. ↩

Glenn's Digital Garden

Explorer

Reinforcement learning

Online and offline

Offline RL

Online RL

Example input record

Jargon

Interesting factoids

Graph View

Table of Contents

Backlinks

Glenn's Digital Garden

Explorer

Reinforcement learning

Online and offline

Offline RL

Online RL

Example input record

Jargon

Interesting factoids

Footnotes

Graph View

Table of Contents

Backlinks