Peter Wofford

Staff Engineer ยท AI Infrastructure

GRPO Has a Ten-Step Window

The Resilience I Found Was Real. It Was Also Temporary.

In my last post, I concluded that GRPO can't teach strategy but can teach resilience. I had the data: +25 percentage points on attribute removal, +10 on forced bad starts, all in five training steps. The agent learned to pivot around disabled attributes, always submit a guess, check its candidate list at decision points.

Then I ran the full 50-step experiments. Both collapsed.

What the Smoke Tests Showed

Quick recap. SFT-trained 20 Questions agent, 95% accuracy on clean games, matching GPT-5.2. I wanted to know if GRPO could teach it to handle adversity. Three perturbation types, three results:

PerturbationBaselineAfter 5 GRPO steps
Answer corruption (invisible)10%5% (worse)
Forced bad start (observable)90%100%
Attribute removal (observable)60%85%

The observable perturbations showed clear improvement. When is_food returns "unknown," try is_edible. When your memorized sequence breaks, check get_top_candidates and adapt. Not formatting tricks โ€” I'd already fixed the environment to rule that out.

Five steps looked like a clean win.

What Fifty Steps Showed

PerturbationBaselineStep 10 (peak)Step 50 (final)
Forced bad start55%98%26%
Attribute removal60%~80%0%

Both peaked around step 10. Then GRPO kept optimizing.

The forced bad start run degraded to suicide guessing โ€” guessing immediately, even with 14 candidates remaining, avoids the accumulated cost of asking questions. Questions dropped from 12 to 6 by step 25. Reward went negative and stayed there.

The attribute removal run was worse. Two collapse phases: suicide guessing (steps 30-40, questions crashed from 12 to under 1), then complete policy death (steps 40-50, zero questions, zero guesses, 76 candidates remaining). The agent stopped doing anything. Same safe-haven collapse from my very first experiment โ€” doing nothing beats getting penalized.

Step-by-step for attribute removal:

StepCorrect rateAvg questionsWhat's happening
180%12.4Normal play
1075%10.8Peak zone
2578%12.5Still functional
3080%12.7Reward turns negative
35100%5.2Collapse begins
40100%0.75Near-total collapse
450%0.0Policy death
500%0.0Agent does nothing

The 100% correct rate at steps 35-40 is a mirage. The agent guesses on every trajectory โ€” some happen to be right โ€” but it's stopped asking questions. By step 45, it's stopped doing anything at all.

The Same Attractors, Every Time

I've now seen policy collapse in four different settings:

  1. Run 1: GRPO from scratch. Collapsed to inaction by step 20.
  2. Run 4b (50 steps): GRPO from SFT checkpoint, forced bad starts. Collapsed to suicide guessing by step 25.
  3. Run 4c (50 steps): GRPO from SFT checkpoint, attribute removal. Collapsed to inaction by step 45.
  4. Run 4a (5 steps): Answer corruption. Didn't collapse but degraded โ€” early signs of the same drift.

Every collapse converges to one of two modes: guess immediately with too many candidates, or do nothing. These aren't bugs. They're attractors in the reward landscape. Any RL setting with costly action sequences, sparse terminal reward, and a zero-cost default will have them.

The SFT starting point delays collapse โ€” Run 1 hit it at step 20, the SFT-initialized runs lasted until steps 25-45 โ€” but doesn't prevent it. Starting closer to optimal buys time. It doesn't change the landscape.

The Sweet Spot Is Narrow

GRPO's optimization has three phases on top of an SFT checkpoint:

Phase 1 (steps 1-10): Genuine refinement. GRPO reinforces behaviors the SFT model almost-but-not-quite learned. Pivoting around disabled attributes. Checking candidates before guessing. Always submitting a guess. The SFT model had the right strategy but brittle execution. GRPO smooths the edges.

Phase 2 (steps 10-30): Plateau and drift. The easy improvements are captured. Accuracy stays high but volatile โ€” 75-80%, questions holding around 12. Functional, but GRPO is finding shortcuts. Reward turns negative around step 30. Bending toward collapse but not broken yet.

Phase 3 (steps 30+): Collapse. The shortcuts take over. Questions drop to near-zero. Once the policy commits to guessing early or doing nothing, reward variance drops and there's no gradient signal to recover.

The smoke tests only saw Phase 1. The full runs saw all three.

What This Means for My Previous Post

The first post's conclusion โ€” "give the model the strategy through SFT, then use RL to make it survive the real world" โ€” was directionally right but overstated. More accurately:

Give the model the strategy through SFT. Then use RL very carefully and briefly to make it slightly more robust. Then stop before it forgets everything.

That's less catchy. It's also more honest.

The resilience improvements were real. Attribute-pivoting, always-guess, strategic candidate checking โ€” GRPO genuinely taught those in 5-10 steps. But GRPO doesn't know when to stop. Left running, it optimizes past the sweet spot into the same degenerate attractors it always finds. The improvements and the collapse are both real. They just happen at different timescales.

The Practical Takeaway

If you're using GRPO (or similar on-policy RL) to refine an SFT-trained agent:

  1. Run smoke tests first. 5-10 steps will tell you if there's signal. No improvement in 10 steps means no improvement in 50.
  2. Eval early and often. The peak is around step 10. You won't know until you measure.
  3. Early stop aggressively. Stopping too early leaves improvement on the table. Stopping too late destroys everything the SFT taught. Guess which costs more.
  4. Don't trust training metrics past the peak. The 100% correct rate at step 35 in my attribute removal run was the agent guessing randomly on every trajectory. High accuracy during training can mean collapse, not improvement.

The deeper question is whether standard RL stabilization techniques could extend the window. A KL divergence penalty from the SFT reference policy would directly penalize wandering too far from the learned strategy โ€” the standard RLHF fix for this kind of drift, making the degenerate attractors more costly to reach. An entropy bonus would penalize the variance collapse I keep seeing โ€” when all trajectories converge to the same degenerate behavior, an entropy term pushes back. I didn't test either. They're where I'd start next.

The Revised Thesis

After ten runs and $65 in compute:

GRPO is a refinement layer, not a training method. It can polish SFT-learned behavior within a narrow window โ€” error pivoting, always-guess heuristics, candidate checking. It cannot teach strategy, cannot teach reasoning, and extended application destroys the very behaviors it initially improves.

The recipe: pretraining for reasoning capacity. SFT for strategy. A carefully early-stopped GRPO pass for robustness. And the discipline to stop.


This is a follow-up to The Reward Function Is Never the Problem You Think It Is. Built with ART (OpenPipe), Qwen-14B, and ~$65 total compute across all experiments. Code on GitHub.