The Reward Function Is Never the Problem You Think It Is
RL Refines Execution, It Doesn't Teach Reasoning: A 20 Questions Case Study
Ten steps into training, my 20 Questions agent hit 65% accuracy โ matching GPT-4o. By step 20, it scored 0%. Not because it guessed wrong. Because it stopped trying.
The agent found a loophole: doing nothing yields a timeout reward of 0.0, which beats the negative reward from a wrong guess. GRPO reinforced inaction. Every trajectory converged to the same degenerate loop โ check the attribute list, check the candidate count, repeat until time runs out. Reward standard deviation dropped to zero. No variance, no gradient signal, no escape.
I spent the next twelve hours figuring out that this wasn't a reward function problem. It was a question about what RL can and can't teach a language model.
Why 20 Questions
20 Questions with predefined boolean attributes is a solved problem. A greedy information-gain algorithm โ about 30 lines of Python โ achieves 100% accuracy in ~6.5 questions across 76 objects. An oracle exists. Optimal play is fully defined.
I wasn't trying to build the best 20Q agent. I was using 20Q as a testbed because when you know what optimal looks like, you can measure exactly where RL deviates. In real agent tasks you don't know what optimal looks like. Here I did.
The thesis: GRPO acquires behavioral patterns but cannot acquire novel algorithmic reasoning from reward signal alone. RL refines execution. It doesn't teach reasoning.
The Collapse and What It Taught Me
| Metric | Step 0 | Step 10 (peak) | Step 20 (collapsed) |
|---|---|---|---|
| Accuracy | 30% | 65% | 0% |
| Avg questions asked | 12.7 | 10.5 | 0.2 |
| Reward std dev | ~12 | ~8 | ~0 |
Step 0 is the base Qwen-14B with no training โ 30% accuracy out of the box from pretraining alone. GRPO sharpened that: by step 10, the model asked fewer but better questions, narrowed candidates efficiently, and matched GPT-4o without ever being shown a single example of good play.
Then it collapsed. Timeout reward (0.0) beats wrong-guess penalty (-15), so once a few trajectories discovered the do-nothing loop, GRPO amplified it. The reward_std_dev chart tells the whole story: it drops from ~12 to ~0 as every trajectory converges to identical behavior. At std_dev โ 0, GRPO's advantage computation produces no gradient. Game over.
This isn't a hyperparameter issue. It's structural: trajectory-level reward plus on-policy RL means degenerate low-variance policies are stable equilibria.
The Metrics Lie (Read Your Trajectories)
Two stories. Same moral.
Story one: the silent failure. Early on, I tried expert iteration โ injecting perfect oracle trajectories into the training pipeline. W&B showed 100% accuracy. Loss curves looked healthy. Everything green. But ART is on-policy: it needs logprobs from the model's own generation to compute gradients. The oracle trajectories had no logprobs. The optimizer accepted the data, logged the metrics, and silently skipped every gradient update.1 The model didn't change. I spent hours celebrating a result that didn't exist.
Story two: the formatting confound. Later, testing GRPO's resilience to perturbations, I saw 15% โ 80% accuracy. Dramatic. Except trajectory inspection revealed that 90% of baseline "failures" were correct identifications submitted by name ("Dog") instead of opaque ID ("d4t6u"). The environment did strict string matching. The model's reasoning was 90% accurate at baseline โ the 15% number reflected a formatting convention, not a reasoning failure. True GRPO improvement: +10 percentage points, not +65.
Same moral: aggregate metrics will lie to you in exactly the way you want to be lied to. Reading 20 individual trajectories is tedious. It's also where you find that your gradients are zero, or that your environment is measuring the wrong thing.
SFT Does in One Pass What GRPO Couldn't
If GRPO can't discover the strategy, can supervised fine-tuning just hand it over?
First attempt: 0% accuracy. The model learned the oracle's question-asking strategy perfectly (7.0 avg questions, 1.1 candidates remaining) but hallucinated fake object IDs at guess time. The oracle has direct state access and never calls get_top_candidates to look up IDs. The LLM can only see IDs through that tool call. Since the training data never included it, the model invented plausible-looking IDs instead.
The fix: inject a get_top_candidates call before every submit_guess in the 76 oracle trajectories. Second attempt: 95% accuracy. That's in the same range as GPT-5.2, which I benchmarked at 96.7% on the same environment.2 Seventy-six training examples on a 14B model matched a frontier model orders of magnitude larger.
SFT trivially teaches what GRPO couldn't learn. The base model already understands binary search and information theory from pretraining. It just needs to be shown the pattern once.
What GRPO Can Learn: Resilience
SFT-trained agents are brittle โ they've only seen the happy path. When something goes wrong mid-game, they have no experience recovering. I designed three perturbation experiments to test whether GRPO could teach resilience, even if it couldn't teach strategy.
Answer corruption (invisible perturbation): Flip 15% of yes/no answers silently. The environment state updates consistently based on the lie โ nothing looks wrong from the agent's perspective. Result: SFT accuracy dropped from 95% to 10%. GRPO made it worse โ down to 5%. Recovery would require reasoning: "I asked is_animal and got yes, but Desk is still in the candidate list โ that's inconsistent." That's cross-referencing world knowledge against tool outputs. GRPO can't teach it.
Forced bad start (observable perturbation): Pre-ask 2-3 random non-optimal questions before the agent takes control. The agent inherits a candidate set it didn't create. Result: +10 percentage points (90% โ 100%). The SFT agent was already surprisingly robust to bad starts โ its binary search strategy transferred even from non-optimal states. But GRPO polished the remaining failures. The agent learned to call get_top_candidates mid-game and pick attributes that discriminate among the actual remaining candidates, rather than rigidly following its memorized sequence.
Attribute removal (observable perturbation): Disable 15% of attributes per episode โ they return "unknown" when queried. Result: +25 percentage points (60% โ 85%) in just 5 training steps. This was the strongest signal. The SFT model would compulsively retry disabled attributes โ one episode asked is_animal seven times in a row, getting "unknown" each time, never pivoting. Post-GRPO, the agent learned three distinct behavioral adaptations: skip disabled attributes and try functional equivalents (is_animal blocked โ try is_living), use get_top_candidates when narrowing stalls, and always submit a guess rather than timing out. The remaining failures were all cases where critical branching attributes were disabled โ requiring full decision-tree replanning that GRPO couldn't provide.
The pattern across all three is clean: GRPO teaches resilience when perturbations are observable and recovery requires local behavioral adaptation. It fails when recovery requires reasoning about invisible state or global replanning.
The Refined Thesis
After a day of experiments, the thesis sharpened:
Use SFT or pretraining to teach the strategy. Use GRPO to make it resilient.
GRPO isn't a strategy teacher. It's a robustness layer. Behavioral patterns โ tool-use rituals, skip-and-pivot, always-submit-a-guess โ are learnable from reward signal. Algorithmic reasoning โ entropy maximization, inconsistency detection, replanning around structural failures โ is not.
If you're building RL-trained agents, don't expect the reward function to discover your agent's strategy. You will tune it six ways. You will add bonuses and penalties and shaping terms. You will convince yourself the next version will fix the collapse. It won't. The reward function was never the problem. The problem was asking RL to teach something it can't learn.
Give the model the strategy through SFT. Then use RL to make it survive the real world.
Built with ART (OpenPipe), Qwen-14B, and approximately $50 in compute across all experiments. The entire stack is open-source: ART for GRPO training, a serverless backend for compute (no reserved GPUs), and W&B for logging. Code is on GitHub.
Follow-up: GRPO Has a Ten-Step Window
1 This appears to be expected behavior for ART's on-policy architecture, though I didn't find it documented. If you inject trajectories without model logprobs, the optimizer has nothing to differentiate against. โฉ
2 These are my own evaluations on the same 76-object environment, not published benchmarks from OpenAI. โฉ