Amrith Setlur and Aviral Kumar, Carnegie Mellon University
This blog post answers the above question using insights from some of our works and we encourage readers to refer them for a deeper dive on how test-time scaling for reasoning is essentially a meta reinforcement learning (RL) problem, how this perspective provides methods that scale test-time compute beyond what they were trained on, dense rewards, and why RL is important for optimizing test-time compute over SFT or distillation.
With the rise of strong RL trained reasoning models that generate long chains of thought, a key question that remains unanswered is: how exactly does reinforcement learning (RL) fine-tuning with a 0/1 reward improve reasoning (or is this an illusion)? Since the release of OpenAI o1 and the DeepSeek-R1 technical report, researchers have explored two seemingly divergent themes to answer this question. On the one side, researchers have explored how RL encourages longer chains of thought (CoT) and how long CoTs improve reasoning and scale test-time compute better. The other line of work argues that RL simply “sharpens” the base model’s reasoning behavior as proven by the efficacy of training with self-defined rewards or even noisy and incorrect rewards. These two viewpoints are apparently contradictory: by definition, the latter viewpoint implies that RL cannot discover new strategies and only “sharpens” around correct responses within the base LLM’s support, while the former argues that RL “discovers” new strategies which are never sampled by the base LLM being fine-tuned.
Given that both of these types of approaches have shown to improve performance (subject to evaluation caveats), there is a natural question of what RL is precisely doing. In this blog post, we aim to reconcile these perspectives and answer can RL with simple 0/1 outcome rewards actually discover new capabilities beyond the base model? And if so, what does it take to make that happen in practice?
Recent work has shown that even spurious reward functions, such as rewards for formatting, confidence, or simple heuristics like majority voting, improves performance on math reasoning benchmarks like MATH500 and AMC, while another shows that RL training on just a single example can boost performance on the same benchmarks. Many papers (and X posts) also suggest that RL only makes minor improvements by amplifying reasoning traces already within the base model, with most of the heavy lifting done by pre-training or mid-training (see here).
We argue that this is largely true, but only if we operate within specific training and evaluation setups. In particular, short training budgets (for e.g., <8192 tokens), narrow training datasets (for e.g., a training dataset consisting of only AIME problems or MATH-12k), and binary (0/1) outcome-based rewards, RL can attain higher training rewards by only sharpening the base LLM distribution. In our prior work on process rewards (PAVs) (see Figure 1 below), we found a similar result: outcome-reward (ORM) RL training on MATH-12k rarely enables the model to solve harder problems that we not already solved by the base model at least once at a large sampling budget and attained worse pass@N. In contrast, using dense rewards using the PAV technique addressed this issue, enabling a higher pass@N and accuracy on hard problems.
Figure 1: Policies trained with dense reward RL discover solutions to hard problems and explore the search space of token traces more efficiently: (a) Best-of-N performance for policies trained with SFT, RFT, outcome reward RL, dense reward RL (PAVs). (b): Amongst hard problems that remain unsolved by Best-of-256 over the SFT policy, we check how many are solved by Best-of-N over a policy trained with dense rewards (PAV-RL) or one that is only trained with outcome rewards (ORM-RL). PAVs are able to solve many of these questions, implying that they are able to find reasoning strategies not in the base model.
Why? The “rich-gets-richer” effect. Note that most of the ways through which RL under tight length budgets could optimize training reward is by sampling a successful rollout and then increasing the model’s probability of sampling this rollout. As training progresses, token entropy drops and completions shorten, indicating the presence of a sort of a “rich-gets-richer” effect (e.g., see Figure 5d here). This sharpening phenomenon of the probability distribution of the LLM explains the gains in earlier benchmarks, but it doesn’t reflect genuine discovery of new reasoning abilities. As such, this resembles a form of an implicit “prompt tuning” effect, where careful prompting that drives the base model towards this sharpened mode might be able to recover fine-tuning performance in many cases.
Crucially, these sharpened models often underperform on harder benchmarks like AIME and HMMT, which demand algorithmic exploration and multi-strategy reasoning. Note that training with spurious or self-generated rewards has not shown promising results on these benchmarks, with the closest result showing that self-training on a diverse mix of prompts (from DAPO) quickly becomes unstable and substantially underperforms RL with actual, correct 0/1 reward on these datasets. So, how can RL be done correctly to actually discover novel reasoning behavior?
When does RL go beyond sharpening and actually enable discovery? We argue that the key lies in training models to chain basic skills present in the model in a coherent manner, per input problem. This chaining, i.e., combining base model abilities like solving a subproblem, verifying an intermediate step, or revising the solution or the plan of attack, in a dynamic way enables the model to execute a fairly general “algorithmic procedure” to obtain high performance, across a diverse set of prompts.
Figure 2: Asymmetric capabilities in the base model, like the verification-generation gap can be leveraged to chain verification and generation steps to explore and discover new strategies during RL training. When this happens, RL is going beyond sharpening since long traces from RL that chains several asymmetries beyond the length of a typical base model response are less likely under the base model itself.
To enable chaining in practice, RL training must incentivize chaining. In our recent work e3, we show that when the base model exhibits “asymmetric” competence across some basic skills, such as the presence of a verification-generation gap, RL training can incentivize chaining of skills (we refer to this as chaining asymmetries) that naturally lead to longer completions (see Figure 2 above). This is because traces with longer chains (e.g., more generate-verify-revise attempts) are more likely to end in a correct answer, and are hence rewarded and reinforced. Subsequent training is expected to amplify the process even further, resulting in longer chains (and length), and higher pass rates. This process of systematic chaining can happen even when the base LLM itself exhibits a substantially lower probability of ever sampling a long, coherent chain, and explains the increasing response length observed in deepseek-r1, kimi-1.5, and our work e3. If done right, systematic chaining should lead the trained model to solve hard problems, simply via extrapolation (i.e., by simply letting the model run for longer at deployment time even when it is not trained at such long budgets; see Figure 3 for illustration).