# What is Scaling RL in LLMs in train-time?

Canonical URL: https://www.adaline.ai/blog/what-is-scaling-rl
LLM text URL: https://www.adaline.ai/blog/what-is-scaling-rl/llms.txt
Published: 2025-04-28T00:00:00.000Z
Modified: 2025-04-29T17:36:30.675Z
Author: Nilesh Barla
Category: Research
Visibility: public
Reading time: 15 min
Topics: Research, Adaline, AI agent observability, agent evals, self-improving agents

## Summary

A Technical Exploration of Reinforcement Learning Scaling in LLMs

## Article

Scaling RL in LLMs means expanding the reinforcement learning phase across three dimensions: model capacity, training data, and computing resources. Unlike traditional scaling, it focuses on teaching models to use existing capacity more effectively through algorithms like PPO, GRPO, and RLVR to improve reasoning abilities.

Another key aspect of scaling RL in train-time is that it enables and allows the LLM to conduct a long chain-of-thought (CoT) thinking. The long CoT makes use of the train-time compute and scales itself effectively. This, in turn, allows the model to iterate and refine on the thinking process before yielding the answer.

Image: https://a-us.storyblok.com/f/1023026/3076x1466/0476a422a2/longcot.png

_Comparison of various LLMs in different benchmarks with long-CoT_ | **Source**: [Kimi k1.5: Scaling Reinforcement Learning with LLMs](https://arxiv.org/abs/2501.12599)

> Scaling RL in train-time allow LLMs to spend more time producing longer CoT or reasoning steps by efficiently utilizing train-time compute

# Why Reinforcement Learning Matters for LLMs

Reinforcement Learning teaches models to align with human preferences rather than just predict the next word. It also makes it possible to explore a wide range of steps or reasoning paths before yielding the right output. RL enables the model to observe and reflect on each and every step that it takes, and thus correcting itself from providing incorrect answers.

This approach also transforms raw language capabilities into useful, helpful, and safe AI assistants.

1. [Preference Alignment] RL helps models learn what humans actually want, not just what appears in training data.
2. [Behavior Refinement] Models can improve specific skills like reasoning, truthfulness, and helpfulness through targeted rewards.
3. [Safety Enhancement] Harmful or misleading outputs can be penalized, teaching the model to avoid unwanted behaviors.
4. [Task Adaptation] RL enables models to excel at specific tasks like math or coding without needing massive specialized datasets.

## Understanding RLHF Through InstructGPT

Reinforcement Learning from Human Feedback (RLHF) fundamentally changed how we train language models. OpenAI's[ ](https://documents/1)[InstructGPT](https://documents/1) demonstrates this approach through a three-stage process.

1. Assuming that the language model (LM) has been trained to predict the next work or token, the human labelers provide examples of desired outputs for specific prompts. **The model undergoes supervised fine-tuning on this dataset**. This is essentially where the model is trained on a question-answer pair. Similarly, **multiple models are trained on the similar dataset and a vast number of responses from each model is collected**.
2. Next, the human label scores each of the output and align it to their preference or “**human preference**”. This creates a dataset with LM output and score. This dataset is then used to train a **reward model** that predicts human preferences.
3. Finally, the LM is optimized using **Proximal Policy Optimization** (PPO) to maximize this reward function.

PPO is an algorithm that teaches AI models to improve through trial and error. It works like a coach who carefully adjusts a player's strategy: the AI tries different approaches, gets feedback on what worked well, and then makes small, controlled changes to get better results next time. The "proximal" part means it doesn't change too much at once, which keeps the learning process stable and prevents wild, unpredictable behavior.

Image: https://a-us.storyblok.com/f/1023026/2788x1736/3ff6e5fb6b/instructgpt.png

_The three phases of InstructGPT where it trains the LM using a Reward model and PPO_ | **Source**: [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)

This process transformed GPT-3 into a more helpful, harmless, and honest assistant. Despite having 100x fewer parameters, the 1.3B InstructGPT model produced outputs preferred over the 175B GPT-3 model.

## The Core Components of RL in LLMs

Reinforcement learning for language models requires several key elements:

1. [Policy Network] The language model itself, generating token sequences
2. [Reward Function] Evaluates output quality based on human preferences
3. [Value Function] Estimates expected future rewards from current state

Unlike traditional pre-training that only minimizes next-token prediction error, RL optimizes for longer-term objectives across entire sequences.

## RL vs. Next-Token Prediction

Traditional language models train through next-token prediction, essentially memorizing token distributions from their training data. Reinforcement learning provides several advantages:

```csv
Aspect	Next-Token Prediction	Reinforcement Learning
Objective	Minimize prediction error	Maximize cumulative reward
Scope	Local token probabilities	Global sequence quality
Alignment	Implicit from data	Explicit from human feedback
Optimization	One-step loss	Multi-step returns
```

## Policy-Network View of Language Models

When viewing language models through an RL lens, the decoder functions as a policy network πθ(a|s) where:

- Actions (a): tokens in vocabulary
- States (s): text history (previous tokens)
- Policy: probability distribution over next tokens

This framework allows for sophisticated optimization beyond simple text prediction.

# What “Scaling RL” Means in 2025

Reinforcement Learning (RL) scaling applies the "bigger is better" principle to the reinforcement learning phase of language model training. After a model completes its initial pre-training and supervised fine-tuning, RL scaling expands three critical dimensions:

1. [Model capacity] Increasing the size of policy networks and reward models. Now keep in mind that the size of the policy network can be scaled-up after pre-training. There are techniques such as essembling _multiple_ SFT models together.
2. [Training data] Collecting more human feedback and preference examples
3. [Computing resources] Dedicating more processing power to the RL training process.

Unlike traditional scaling that just makes models bigger, RL scaling focuses on teaching models to use their existing capacity more effectively. The goal is to help models:

- Think through problems step-by-step by producing more reasoning steps
- Generate safer and more helpful responses
- Show their reasoning process clearly

When properly implemented, RL scaling produces models that demonstrate significantly better reasoning abilities (2-3× improvement on complex tasks), more transparent thinking processes, and safer outputs that better align with human preferences.

Image: https://a-us.storyblok.com/f/1023026/1043x1200/2e2f3a93bd/r1-arch.jpg

_This diagram shows how DeepSeek makes its “thinking” better in steps. First, it learns from examples (SFT). Then it practices reasoning with rewards (RL). It gathers many “chains of thought” and mixes them with normal answers. Next, it “distills” or shrinks the big model into an even smarter one. Finally, it repeats this loop: learn, practice, combine, and distill. Each time, the model grows its capacity, sees more examples, and uses more compute, so it reasons more deeply and gives better answers._ | **Source**: [An Analysis of DeepSeek's R1-Zero and R1](https://arcprize.org/blog/r1-zero-r1-results-analysis)

A practical rule: When you double the model size, you typically need 2.2 times more feedback data and 1.8 times more training steps to maintain performance gains.

## Axes of Scale

Recent[ ](https://arxiv.org/abs/2501.12599v2)[Kimi K1.5](https://arxiv.org/abs/2501.12599v2) research identifies key scaling dimensions that impact performance:

1. [Model Size] Larger parameter counts provide better capabilities but require more compute
2. [Reward Model Size] Often smaller than the main model (Kimi uses 6B reward models)
3. [Rollout Length] Longer token sequences enable complex reasoning (Kimi scales to 128K)
4. [Batch Size] Larger batches improve training stability (Kimi uses 512 with 64 minibatches)
5. [Preference Labels] More human judgments create better reward models
6. [Gradient Update Budget] Number of training iterations (Kimi uses 256K episodes)

## Why Rollouts Matter in RL Scaling

A rollout is like one complete conversation with the AI. It includes:

- The starting prompt given to the model
- Each token (word piece) the model generates one by one
- Feedback scores on how good the response was
- When and why the response ended

Rollouts are the building blocks of RL training. The model learns by generating many rollouts and receiving feedback on each one.

## What Gets Recorded During Rollouts

```csv
Component	Description	Purpose
States	The growing sequence of tokens	Shows context at each decision point
Actions	Each next token choice	Reveals what the model decided
Rewards	Scores from a reward model	Tells the model what was good/bad
```

## How RL Makes Models Think Longer

RL doesn't make the AI think faster. Instead, it teaches the AI to use more computing budget wisely. This often leads to longer, more careful thinking:

**Rewards for Good Thinking Steps**

- The LLM gets rewards for showing its work, not just the final answer
- It learns that explaining things step-by-step earns more points

**Longer CoT Can Be Better**

- Each extra word gives another chance to earn reward
- The LLM learns to write more when that helps solve problems

**Trying Multiple Approaches**

- The LLM might try solving a problem several different reasoning paths
- It then picks the best solution or shows all its attempts
- This helps it double-check its own work

**Knowing When to Stop**

- The AI learns when more thinking won't help
- It stops when the value of adding more words gets too small
- This helps it be efficient with its thinking time

## Infrastructure Patterns

Image: https://a-us.storyblok.com/f/1023026/1263x553/ebe91d8e97/overview-of-kimi-1-5-workflow.png

_Overview of Kimi 1.5 workflow_ | **Source**: [Kimi k1.5: Scaling Reinforcement Learning with LLMs](https://arxiv.org/abs/2501.12599)

Modern RL training systems distribute work across thousands of GPUs:

1. [Distributed Rollouts] Generate experiences in parallel across many machines
2. [Partial Rollouts] Reuse previous trajectory chunks to improve efficiency
3. [Experience Replay] Store and reuse valuable training examples
4. [Mixed-Precision Training] Use lower precision for efficiency where possible
5. [Synthetic Preference Generation] Smaller LLM models grade outputs to create preference pairs

For example, Kimi's infrastructure uses a hybrid deployment framework combining training and inference phases. Each phase handles different computational tasks:

1. Training phase runs Megatron for policy updates
2. Inference phase executes vLLM for efficient rollouts
3. A "checkpoint engine" manages weight sharing between phases

This approach reduces GPU idle time and enables efficient scaling to massive training volumes.

# Core Algorithms Powering Scaled RL

There are three core algorithms that we will touch in this article:

1. PPO which is the industry standard.
2. GRPO which is new method introduced by DeepSeek.
3. RLVR which is a verification-based reward algorithm.

## PPO: The Industry Standard

[PPO](https://arxiv.org/pdf/1707.06347) remains the backbone of large-scale reinforcement learning for language models. OpenAI's [o-series models](https://www.louisbouchard.ai/rft/) continue to rely on PPO during their RL training stage. The algorithm uses a trust region approach that prevents excessive policy changes during updates.

The core PPO objective can be simplified as:

```math
J_{\text{PPO}}(\theta) = \mathbb{E}\left[ \min\left( \frac{\pi_\theta}{\pi_{\text{old}}} \cdot A, \text{clip}\left(\frac{\pi_\theta}{\pi_{\text{old}}}, 1-\varepsilon, 1+\varepsilon\right) \cdot A \right) \right]
```

Where:

- πθ is the current policy
- πold is the previous policy
- A is the advantage function
- ε is a small constant (usually 0.2)

PPO works well at scale because it:

1. Handles large batch sizes efficiently
2. Provides stable updates even with noisy rewards
3. Integrates easily with KL-divergence penalties to prevent output degradation

## GRPO: Efficiency Innovation

Image: https://a-us.storyblok.com/f/1023026/1156x502/01378f59bd/ppo-and-grpo.png

_Comparison between PPO and GRPO_ | **Source**: [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://arxiv.org/abs/2402.03300)

Group Relative Policy Optimization (GRPO) represents a significant advancement in RL efficiency. Introduced by [DeepSeekMath](https://arxiv.org/pdf/2402.03300), GRPO eliminates the need for a separate value network by using group statistics as baselines.

Key innovations in GRPO include:

```csv
Feature	Benefit
Group-based advantage	Reduces memory requirements
No critic model	25-40% fewer GPU-hours
Relative reward normalization	Improves training stability
```

This approach has proven especially effective for mathematical reasoning tasks, where the algorithm helped DeepSeek achieve over 50% accuracy on competition-level math problems.

## RLVR: Verification-Based Rewards

[Reinforcement Learning with Verifiable Rewards](https://arxiv.org/pdf/2504.13837) (RLVR) tackles one of RL's fundamental challenges: reward hacking. Instead of using subjective human preferences, RLVR employs external verification mechanisms to validate outputs.

RLVR offers unique advantages for scaling:

1. [Objectivity] Rewards based on verifiable criteria rather than subjective judgments
2. [Automation] Reduces dependence on human feedback collection
3. [Precision] Particularly valuable for domains with clear right/wrong answers

Early [research](https://www.interconnects.ai/p/openais-o3-over-optimization-is-back?utm_source=chatgpt.com) suggests RLVR principles are being incorporated into the latest o3 alignment techniques. This approach helps models maintain truthfulness even as system scale increases.

The most effective implementations now combine elements from multiple algorithms, creating hybrid approaches tailored to specific domains like coding, mathematics, and factual reasoning.

# End-to-End Training Workflow

Reinforcement learning fine-tuning for LLMs follows a structured workflow with several distinct phases. Each phase builds upon the previous one to create increasingly capable models.

### 1. Data Collection

The process begins with gathering high-quality data. For [InstructGPT](https://arxiv.org/abs/2203.02155), this involved:

- Human-written demonstrations of desired behaviors
- Comparison data where humans ranked model outputs
- Prompts collected from the API for diverse use cases

Quality matters more than quantity. A few thousand well-crafted examples often outperform millions of lower-quality samples.

## 2. Reward Model Training

The reward model (RM) learns to predict human preferences from the collected comparisons. Key considerations include:

- Using separate models for reward and policy to prevent overfitting
- Ensuring the RM generalizes to new examples through validation
- Training with careful learning rate scheduling to avoid collapse

## 3. Rollout Generation

During this phase, the policy model generates responses that will be evaluated by the reward model. Best practices include:

- Using nucleus sampling with temperature 0.7-1.0 for exploration
- Forcing visible Chain-of-Thought (CoT) reasoning when appropriate
- Generating multiple completions per prompt (64+ for [GRPO](https://arxiv.org/pdf/2402.03300))

## 4. Policy Optimization

This is where the actual RL training happens, with several algorithm options:

```csv
Algorithm	Key Advantage	Resource Usage
PPO	Stability with KL penalty	High (needs value model)
GRPO	Memory efficiency	Medium (no value model)
RLVR	Objective verification	Medium (external verifier)
```

Fine-grained tips that improve results:

- Token-level rewards provide denser learning signals than sequence-level rewards
- KL-annealing gradually increases divergence from reference model
- Curriculum learning progresses from simple to complex examples

## 5. Safety and Evaluation Loops

Regular evaluation ensures the model improves without unwanted behaviors:

1. Check performance on benchmark tasks
2. Verify safety guardrails remain effective
3. Test for new failure modes
4. Return to data collection if necessary

## 6. Optional Distillation

For deployment efficiency, the final model can be distilled:

- Teacher model (full RL-trained) guides a smaller student
- Knowledge transfers through supervised learning
- Trading minimal performance for significant speed gains

This workflow represents the current best practice for creating models that align with human preferences while maintaining high capability.

# Evidence & Case-Studies — Benchmarks, Cost-Efficiency, and Chain-of-Thought Gains

## Performance × Cost Snapshot (April 2025)

```csv
Model (RL stage)	ARC-AGI-1 (%)	ARC-AGI-2 (%)	MMLU (%)	$ / 1M Input	$ / 1M Output	$ / ARC-AGI-2 task	Notes
o3-medium	53 % X (formerly Twitter)	3.0 % ARC Prize	85.3 % Artificial Analysis	$10	$40.00	$2.53 ARC Prize	PPO, visible-CoT policy
o4-mini-medium	42 % X (formerly Twitter)	2.4 % ARC Prize	83.2 % Artificial Analysis	$1.10	$4.40	$0.23 ARC Prize	RLHF → short-GRPO pass, compact MoE
DeepSeek R1 (GRPO)	—	1.3 % ARC Prize	84.4 % Artificial Analysis	$0.55	$2.19	$0.08 ARC Prize	Critic-free GRPO; open-source weights
```

Key Take-aways:

- **Cost curves flatten faster than performance curves**; o4-mini delivers 80 % of o3’s MMLU with ~10× cheaper tokens.
- **Scaled RL + visible CoT** improves ARC-AGI-1 scores sharply—but raw “AGI-2” generalisation still needs verifier-based rewards.
- **GRPO** shows strong efficiency: DeepSeek hits GPT-4-level MMLU for one-tenth the price of o3, albeit with lower ARC-AGI.

## Chain-of-Thought–Centric RL Recipes

Today's most powerful language models don't just produce answers—they think through problems step by step. This approach, centered on Chain-of-Thought (CoT) reasoning, has become essential in reinforcement learning recipes.

Image: https://a-us.storyblok.com/f/1023026/2662x1304/64c225c1e7/short-cot.png

_Comparison of various LLMs in different benchmarks with short-CoT_ | **Source**: [Kimi k1.5: Scaling Reinforcement Learning with LLMs](https://arxiv.org/abs/2501.12599)

## OpenAI's Deliberative Approach

Image: https://a-us.storyblok.com/f/1023026/1010x904/c0bec6f34f/deliberate-alignment.png

_IIustration of how OpenAI uses deliberate alignment to finetune the LLM_ | **Source**: [Deliberative Alignment: Reasoning Enables Safer Language Models](https://arxiv.org/abs/2412.16339)

OpenAI's o-series models, including the recent o3 system, implement what could be called "[deliberative alignment](https://arxiv.org/pdf/2412.16339)." This method follows a specific pattern:

1. The model is required to explicitly write out its reasoning process
2. This internal reasoning trace is evaluated for correctness and safety
3. Rewards are assigned based on both process and outcome quality
4. For deployment, the trace can be hidden or distilled away

The deliberative approach enables what François Chollet describes as "natural language program search and execution within token space." This allows o3 to achieve an impressive 75.7% score on the challenging ARC-AGI benchmark.

## DeepSeek's Efficient Math Training

DeepSeek's R1 model takes a specialized approach to mathematical reasoning using:

- Step-by-step math proofs with individual rewards at each reasoning stage
- [Group Relative Policy Optimization](https://arxiv.org/pdf/2402.03300) (GRPO) that eliminates the need for a separate value network
- Relative normalization of rewards within sample groups

This efficiency-focused method reduces GPU training hours by approximately 40% while achieving state-of-the-art performance on competition-level math problems.

## Future Direction: Verified Reasoning

The next frontier combines Reinforcement Learning with Verifiable Rewards (RLVR) and CoT verification. This approach would:

```python
for each reasoning_step in chain_of_thought:
    verify_logical_consistency(reasoning_step)
    verify_factual_accuracy(reasoning_step)
    if verification_failed:
        apply_penalty()
        break
```

This verification process can catch deceptive or truncated reasoning before it receives any reward. The[ ](https://documents/2)[process supervision](https://documents/2) evaluates not just final answers but how the model arrives at them.

# Key Benefits of CoT-Centric Training

1. [Transparency] Reasoning becomes visible and auditable
2. [Accuracy] Complex problems benefit from structured thinking
3. [Alignment] Rewards target both process and outcome
4. [Efficiency] Better training signal from intermediate steps

As models continue to scale, this focus on explicit reasoning will likely become even more central to reinforcement learning workflows.

# Conclusion

Reinforcement learning scaling represents a significant shift in LLM training methodology. Rather than simply increasing model size, RL scaling focuses on optimizing how models use their existing capacity through enhanced training processes.

The key components of scaled RL include:

- **Core algorithms**: PPO remains the industry standard, while newer approaches like GRPO offer memory efficiency and RLVR provides objective verification.
- **Multi-dimensional scaling**: Effective RL requires balanced growth across model capacity, training data, and computing resources.
- **Chain-of-thought focus**: Modern RL training emphasizes visible reasoning processes that improve both transparency and accuracy.
- **Infrastructure patterns**: Distributed rollouts, experience replay, and hybrid deployment frameworks enable efficient scaling.

The results of properly implemented RL scaling are substantial: significantly better reasoning abilities, more transparent thinking processes, and outputs that better align with human preferences.

As the field advances, we can expect further innovations in verification-based rewards and process supervision that will enable even more capable, aligned models. The most successful implementations will continue to combine elements from multiple algorithms, creating tailored approaches for specific domains.