Online Reinforcement Learning

The following content was generated by ChatGPT and is not reviewed for correctness.

Here’s a concrete, soup-to-nuts sketch of an always-learning RL system for a chat product at massive scale. I’ll keep it crisp but actionable.

0) High-level shape

Users → Inference Edge → (Traffic Router)
           ↓                       ↘
     Logging + Privacy Filter      Canary/Experiments
           ↓                                ↓
     Event Bus / Stream  →  Feature Store  → Reward Services (R)
           ↓                                ↓
         Data Lake  ←  Labeling (HITL/RLAIF) / Heuristics / Metrics
           ↓
  (Policy π) Training Orchestrator  ← Replay/On-Policy Buffers
           ↓
      PPO/GRPO/DPO Workers (PEFT or Full)  ← KL/Constraint Controllers
           ↓
      Checkpoint Registry / Eval Harness (offline + counterfactual)
           ↓
         Gradual Rollout (shadow → canary → % ramp) → Inference Edge

1) Traffic & inference

Traffic Router: Splits user requests across model variants/policies:
Prod π_prod: current stable policy.
Shadow π_shadow: gets mirrored traffic (no user-visible replies).
Exploration π_exp: a small % for structured exploration (temperature, tool-use, self-ask).
Bandit allocator: multi-armed bandit or Bayesian TS to allocate canary traffic under regret bounds.
Guardrails at edge: PII scrubbing before logging; safety filters (policy + heuristics) pre and post generation.

2) Logging & privacy

On-path redactors: deterministic PII scrubbing (emails, phones, addresses, unique IDs), optional hashing/salt for session linking.
Consent flags & purposes: per-request + per-user; non-consented data is excluded from learning buffers.
Minimize retention: store features and signatures (hashes), not raw text where possible; keep short rolling windows for raw.

3) Rewards (R)

Multiple complementary signals; combine into a scalar with weights and constraints:

Implicit signals: dwell time, re-ask rate, user edits, abandonment, thumbs up/down.
Heuristic scores: toxicity, factuality checks (retrieval cross-checks), style adherence, latency penalties.
Reward Models (Rφ): learned preference models (pairwise or listwise) per domain (reasoning, coding, safety, helpfulness).
Human-in-the-loop (HITL): targeted labeling on uncertainty spikes, policy disagreements, or high-impact prompts.
Aggregation: r = w_h r_h + w_i r_i + \sum_k w_k R_{\phi_k}, with caps and risk penalties (safety violations → large negative).

4) Data plumbing

Event Bus (Kafka/PubSub): logs: {prompt signature, compacted context features, policy_id, response id, logits sketch, latency, reward features}.
Feature Store: canonicalizes per-turn features (user/device/session, prompt taxonomy, tool-calls, retrieval stats).
Buffers:
On-policy buffer for PPO/GRPO (fresh rollouts from π_candidate).
Replay buffer (curated, deduped) for off-policy or DPO/ORPO.
Hard-negative queue for tricky prompts; curriculum queues per domain.

5) Training orchestrator

Schedulers: allocate GPU pools for:
Policy RL updates (PPO/GRPO with KL control to reference π_ref).
Reward-model training (periodic refresh on new preferences).
SFT/DPO refresh (stabilize policy; reduce RM overfitting).
PEFT first: LoRA/QLoRA adapters for fast iterations; periodically distill/merge into a full checkpoint to avoid adapter sprawl.

6) RL loop details (practical)

Rollouts: sample K responses per selected prompt; cap length; store action log-probs.
KL regularization: add -\beta \cdot \mathrm{KL}(\pi_\theta || \pi_{\text{ref}}) to discourage drift; tune β via target-KL controller.
Advantage estimation: GAE over token-wise or sequence-wise rewards (often sequence-level; optionally token shaping for reasoning steps).
PPO updates: small number of epochs/minibatches per batch (early stop on target KL or reward plateau).
Stability tools: entropy bonus, response length penalty, value-function clipping, gradient norm clip, mixed-precision with loss-scale auto-tune.

7) Continuous reward-model (RM) upkeep

Uncertainty-aware sampling: send high-variance or high-disagreement items to human raters.
RM ensembles: per-domain heads; calibrate with temperature scaling/Platt scaling.
Adversarial checks: probe for reward hacking; rotate adversarial prompts; add counter-bias terms.

8) Evaluation & gating

Offline eval: nightly suites (reasoning/math/code/safety/multilingual); exact-match, Pass@k, judge-LLM pairwise preferences.
Counterfactual evaluation: Inverse propensity (IPS/DR) on logged data to estimate candidate policy uplift without full production exposure.
Canary protocol: shadow → 0.5% → 5% → 20% with guardrail thresholds (safety, latency, CTR, satisfaction). Auto-rollback if breached.
Versioning: immutable checkpoints in a registry; fast rollback path; reproducible training manifests.

9) Exploration without wrecking UX

Small exploration budget (e.g., 1–3%) routed to π_exp with controlled higher temperature, alternative tool strategies, or chain-of-thought variants (if ToT is allowed internally).
Contextual bandits at the router pick between candidate policies per prompt taxonomy; decay bad arms quickly.
Safe exploration: exploration responses still pass safety filters; in sensitive categories exploration = 0%.

10) Cost & throughput controls

Token budgets per tier: cap max output length by user/product tier and by experiment arm.
Adaptive batching & speculative decoding at inference edge; kv-cache reuse across variants where safe.
Prioritize learning on “high-leverage prompts” (high frequency × high uncertainty × high impact).

11) Governance & compliance

Data contracts: explicit tables for “learnable” vs “non-learnable” data; enforce at query time.
Right-to-delete: training manifests maintain lineage; enable data redaction in future updates via SFT/DPO counter-training if hard deletion of weights isn’t feasible.
Red-team loops: continuous adversarial eval; blocklist regression tests.

12) A concrete “minute in the life” (E2E)

User sends prompt P. Router assigns to π_prod (and mirrors to π_shadow).
π_prod generates R; guardrails pass; user sees reply. Metrics emitted.
Logging pipeline scrubs P/R, computes heuristic scores; reward models score R; implicit signals arrive later.
Orchestrator selects a batch of (P,R, rewards) → computes advantages vs π_ref → runs PPO (LoRA) for N steps.
New π_candidate passes offline eval; IPS/DR estimates show uplift; safety OK.
Canary at 1% traffic; bandit allocator increases share if metrics improve.
Nightly: reward models retrained on fresh preferences; distillation merges LoRA into new base; π_ref updated.
If any metric regresses, auto-rollback; incident is triaged with logged artifacts.

13) What changes with “massive” compute

Scale out rewards: bigger, domain-specialized RMs; sequence-level critics; multi-objective Pareto fronts.
Richer environments: tool-use, code execution sandboxes, retrieval-augmented tasks; automatic creation of hard curricula.
Always-on co-training: policy π and reward R co-evolve; frequent small updates (minutes-scale) with strong gating.
Deeper exploration: broader prompt space, synthetic task generators, disagreement-driven sampling.

14) Minimal knobs to get right

Target-KL controller (prevents drift).
Reward mixture weights (don’t let style/sycophancy dominate truthfulness).
Canary gating thresholds (safety first, then user value, then cost).
RM refresh cadence (avoid overoptimization to stale R).

Glenn's Digital Garden

Explorer

Online Reinforcement Learning

0) High-level shape

1) Traffic & inference

2) Logging & privacy

3) Rewards (R)

4) Data plumbing

5) Training orchestrator

6) RL loop details (practical)

7) Continuous reward-model (RM) upkeep

8) Evaluation & gating

9) Exploration without wrecking UX

10) Cost & throughput controls

11) Governance & compliance

12) A concrete “minute in the life” (E2E)

13) What changes with “massive” compute

14) Minimal knobs to get right

Graph View

Table of Contents