What is Scaling RL in LLMs in train-time?

Scaling RL in LLMs means expanding the reinforcement learning phase across three dimensions: model capacity, training data, and computing resources. Unlike traditional scaling, it focuses on teaching models to use existing capacity more effectively through algorithms like PPO, GRPO, and RLVR to improve reasoning abilities.

Another key aspect of scaling RL in train-time is that it enables and allows the LLM to conduct a long chain-of-thought (CoT) thinking. The long CoT makes use of the train-time compute and scales itself effectively. This, in turn, allows the model to iterate and refine on the thinking process before yielding the answer.

Comparison of various LLMs in different benchmarks with long-CoT | Source: Kimi k1.5: Scaling Reinforcement Learning with LLMs

Scaling RL in train-time allow LLMs to spend more time producing longer CoT or reasoning steps by efficiently utilizing train-time compute

Why Reinforcement Learning Matters for LLMs

Reinforcement Learning teaches models to align with human preferences rather than just predict the next word. It also makes it possible to explore a wide range of steps or reasoning paths before yielding the right output. RL enables the model to observe and reflect on each and every step that it takes, and thus correcting itself from providing incorrect answers.

This approach also transforms raw language capabilities into useful, helpful, and safe AI assistants.

1
Preference Alignment
RL helps models learn what humans actually want, not just what appears in training data.
2
Behavior Refinement
Models can improve specific skills like reasoning, truthfulness, and helpfulness through targeted rewards.
3
Safety Enhancement
Harmful or misleading outputs can be penalized, teaching the model to avoid unwanted behaviors.
4
Task Adaptation
RL enables models to excel at specific tasks like math or coding without needing massive specialized datasets.

Understanding RLHF Through InstructGPT

Reinforcement Learning from Human Feedback (RLHF) fundamentally changed how we train language models. OpenAI's InstructGPT demonstrates this approach through a three-stage process.

1
Assuming that the language model (LM) has been trained to predict the next work or token, the human labelers provide examples of desired outputs for specific prompts. The model undergoes supervised fine-tuning on this dataset. This is essentially where the model is trained on a question-answer pair. Similarly, multiple models are trained on the similar dataset and a vast number of responses from each model is collected.
2
Next, the human label scores each of the output and align it to their preference or “human preference”. This creates a dataset with LM output and score. This dataset is then used to train a reward model that predicts human preferences.
3
Finally, the LM is optimized using Proximal Policy Optimization (PPO) to maximize this reward function.

PPO is an algorithm that teaches AI models to improve through trial and error. It works like a coach who carefully adjusts a player's strategy: the AI tries different approaches, gets feedback on what worked well, and then makes small, controlled changes to get better results next time. The "proximal" part means it doesn't change too much at once, which keeps the learning process stable and prevents wild, unpredictable behavior.

The three phases of InstructGPT where it trains the LM using a Reward model and PPO | Source: Training language models to follow instructions with human feedback

This process transformed GPT-3 into a more helpful, harmless, and honest assistant. Despite having 100x fewer parameters, the 1.3B InstructGPT model produced outputs preferred over the 175B GPT-3 model.

The Core Components of RL in LLMs

Reinforcement learning for language models requires several key elements:

1
Policy Network
The language model itself, generating token sequences
2
Reward Function
Evaluates output quality based on human preferences
3
Value Function
Estimates expected future rewards from current state

Unlike traditional pre-training that only minimizes next-token prediction error, RL optimizes for longer-term objectives across entire sequences.

RL vs. Next-Token Prediction

Traditional language models train through next-token prediction, essentially memorizing token distributions from their training data. Reinforcement learning provides several advantages:

Policy-Network View of Language Models

When viewing language models through an RL lens, the decoder functions as a policy network πθ(a|s) where:

Actions (a): tokens in vocabulary
States (s): text history (previous tokens)
Policy: probability distribution over next tokens

This framework allows for sophisticated optimization beyond simple text prediction.

What “Scaling RL” Means in 2025

Reinforcement Learning (RL) scaling applies the "bigger is better" principle to the reinforcement learning phase of language model training. After a model completes its initial pre-training and supervised fine-tuning, RL scaling expands three critical dimensions:

1
Model capacity
Increasing the size of policy networks and reward models. Now keep in mind that the size of the policy network can be scaled-up after pre-training. There are techniques such as essembling multiple SFT models together.
2
Training data
Collecting more human feedback and preference examples
3
Computing resources
Dedicating more processing power to the RL training process.

Unlike traditional scaling that just makes models bigger, RL scaling focuses on teaching models to use their existing capacity more effectively. The goal is to help models:

Think through problems step-by-step by producing more reasoning steps
Generate safer and more helpful responses
Show their reasoning process clearly

When properly implemented, RL scaling produces models that demonstrate significantly better reasoning abilities (2-3× improvement on complex tasks), more transparent thinking processes, and safer outputs that better align with human preferences.

This diagram shows how DeepSeek makes its “thinking” better in steps. First, it learns from examples (SFT). Then it practices reasoning with rewards (RL). It gathers many “chains of thought” and mixes them with normal answers. Next, it “distills” or shrinks the big model into an even smarter one. Finally, it repeats this loop: learn, practice, combine, and distill. Each time, the model grows its capacity, sees more examples, and uses more compute, so it reasons more deeply and gives better answers. | Source: An Analysis of DeepSeek's R1-Zero and R1

A practical rule: When you double the model size, you typically need 2.2 times more feedback data and 1.8 times more training steps to maintain performance gains.

Axes of Scale

Recent Kimi K1.5 research identifies key scaling dimensions that impact performance:

1
Model Size
Larger parameter counts provide better capabilities but require more compute
2
Reward Model Size
Often smaller than the main model (Kimi uses 6B reward models)
3
Rollout Length
Longer token sequences enable complex reasoning (Kimi scales to 128K)
4
Batch Size
Larger batches improve training stability (Kimi uses 512 with 64 minibatches)
5
Preference Labels
More human judgments create better reward models
6
Gradient Update Budget
Number of training iterations (Kimi uses 256K episodes)

Why Rollouts Matter in RL Scaling

A rollout is like one complete conversation with the AI. It includes:

The starting prompt given to the model
Each token (word piece) the model generates one by one
Feedback scores on how good the response was
When and why the response ended

Rollouts are the building blocks of RL training. The model learns by generating many rollouts and receiving feedback on each one.

What Gets Recorded During Rollouts

How RL Makes Models Think Longer

RL doesn't make the AI think faster. Instead, it teaches the AI to use more computing budget wisely. This often leads to longer, more careful thinking:

Rewards for Good Thinking Steps

The LLM gets rewards for showing its work, not just the final answer
It learns that explaining things step-by-step earns more points

Longer CoT Can Be Better

Each extra word gives another chance to earn reward
The LLM learns to write more when that helps solve problems

Trying Multiple Approaches

The LLM might try solving a problem several different reasoning paths
It then picks the best solution or shows all its attempts
This helps it double-check its own work

Knowing When to Stop

The AI learns when more thinking won't help
It stops when the value of adding more words gets too small
This helps it be efficient with its thinking time

Infrastructure Patterns

Overview of Kimi 1.5 workflow | Source: Kimi k1.5: Scaling Reinforcement Learning with LLMs

Modern RL training systems distribute work across thousands of GPUs:

1
Distributed Rollouts
Generate experiences in parallel across many machines
2
Partial Rollouts
Reuse previous trajectory chunks to improve efficiency
3
Experience Replay
Store and reuse valuable training examples
4
Mixed-Precision Training
Use lower precision for efficiency where possible
5
Synthetic Preference Generation
Smaller LLM models grade outputs to create preference pairs

For example, Kimi's infrastructure uses a hybrid deployment framework combining training and inference phases. Each phase handles different computational tasks:

1
Training phase runs Megatron for policy updates
2
Inference phase executes vLLM for efficient rollouts
3
A "checkpoint engine" manages weight sharing between phases

This approach reduces GPU idle time and enables efficient scaling to massive training volumes.

Core Algorithms Powering Scaled RL

There are three core algorithms that we will touch in this article:

1
PPO which is the industry standard.
2
GRPO which is new method introduced by DeepSeek.
3
RLVR which is a verification-based reward algorithm.

PPO: The Industry Standard

PPO remains the backbone of large-scale reinforcement learning for language models. OpenAI's o-series models continue to rely on PPO during their RL training stage. The algorithm uses a trust region approach that prevents excessive policy changes during updates.

The core PPO objective can be simplified as:

Where:

πθ is the current policy
π_old is the previous policy
A is the advantage function
ε is a small constant (usually 0.2)

PPO works well at scale because it:

1
Handles large batch sizes efficiently
2
Provides stable updates even with noisy rewards
3
Integrates easily with KL-divergence penalties to prevent output degradation

GRPO: Efficiency Innovation

Comparison between PPO and GRPO | Source: DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Group Relative Policy Optimization (GRPO) represents a significant advancement in RL efficiency. Introduced by DeepSeekMath, GRPO eliminates the need for a separate value network by using group statistics as baselines.

Key innovations in GRPO include:

This approach has proven especially effective for mathematical reasoning tasks, where the algorithm helped DeepSeek achieve over 50% accuracy on competition-level math problems.

RLVR: Verification-Based Rewards

Reinforcement Learning with Verifiable Rewards (RLVR) tackles one of RL's fundamental challenges: reward hacking. Instead of using subjective human preferences, RLVR employs external verification mechanisms to validate outputs.

RLVR offers unique advantages for scaling:

1
Objectivity
Rewards based on verifiable criteria rather than subjective judgments
2
Automation
Reduces dependence on human feedback collection
3
Precision
Particularly valuable for domains with clear right/wrong answers

Early research suggests RLVR principles are being incorporated into the latest o3 alignment techniques. This approach helps models maintain truthfulness even as system scale increases.

The most effective implementations now combine elements from multiple algorithms, creating hybrid approaches tailored to specific domains like coding, mathematics, and factual reasoning.

End-to-End Training Workflow

Reinforcement learning fine-tuning for LLMs follows a structured workflow with several distinct phases. Each phase builds upon the previous one to create increasingly capable models.

1. Data Collection

The process begins with gathering high-quality data. For InstructGPT, this involved:

Human-written demonstrations of desired behaviors
Comparison data where humans ranked model outputs
Prompts collected from the API for diverse use cases

Quality matters more than quantity. A few thousand well-crafted examples often outperform millions of lower-quality samples.

2. Reward Model Training

The reward model (RM) learns to predict human preferences from the collected comparisons. Key considerations include:

Using separate models for reward and policy to prevent overfitting
Ensuring the RM generalizes to new examples through validation
Training with careful learning rate scheduling to avoid collapse

3. Rollout Generation

During this phase, the policy model generates responses that will be evaluated by the reward model. Best practices include:

Using nucleus sampling with temperature 0.7-1.0 for exploration
Forcing visible Chain-of-Thought (CoT) reasoning when appropriate
Generating multiple completions per prompt (64+ for GRPO)

4. Policy Optimization

This is where the actual RL training happens, with several algorithm options:

Fine-grained tips that improve results:

Token-level rewards provide denser learning signals than sequence-level rewards
KL-annealing gradually increases divergence from reference model
Curriculum learning progresses from simple to complex examples

5. Safety and Evaluation Loops

Regular evaluation ensures the model improves without unwanted behaviors:

1
Check performance on benchmark tasks
2
Verify safety guardrails remain effective
3
Test for new failure modes
4
Return to data collection if necessary

6. Optional Distillation

For deployment efficiency, the final model can be distilled:

Teacher model (full RL-trained) guides a smaller student
Knowledge transfers through supervised learning
Trading minimal performance for significant speed gains

This workflow represents the current best practice for creating models that align with human preferences while maintaining high capability.

Evidence & Case-Studies — Benchmarks, Cost-Efficiency, and Chain-of-Thought Gains

Performance × Cost Snapshot (April 2025)

Key Take-aways:

Cost curves flatten faster than performance curves; o4-mini delivers 80 % of o3’s MMLU with ~10× cheaper tokens.
Scaled RL + visible CoT improves ARC-AGI-1 scores sharply—but raw “AGI-2” generalisation still needs verifier-based rewards.
GRPO shows strong efficiency: DeepSeek hits GPT-4-level MMLU for one-tenth the price of o3, albeit with lower ARC-AGI.

Chain-of-Thought–Centric RL Recipes

Today's most powerful language models don't just produce answers—they think through problems step by step. This approach, centered on Chain-of-Thought (CoT) reasoning, has become essential in reinforcement learning recipes.

Comparison of various LLMs in different benchmarks with short-CoT | Source: Kimi k1.5: Scaling Reinforcement Learning with LLMs

OpenAI's Deliberative Approach

IIustration of how OpenAI uses deliberate alignment to finetune the LLM | Source: Deliberative Alignment: Reasoning Enables Safer Language Models

OpenAI's o-series models, including the recent o3 system, implement what could be called "deliberative alignment." This method follows a specific pattern:

1
The model is required to explicitly write out its reasoning process
2
This internal reasoning trace is evaluated for correctness and safety
3
Rewards are assigned based on both process and outcome quality
4
For deployment, the trace can be hidden or distilled away

The deliberative approach enables what François Chollet describes as "natural language program search and execution within token space." This allows o3 to achieve an impressive 75.7% score on the challenging ARC-AGI benchmark.

DeepSeek's Efficient Math Training

DeepSeek's R1 model takes a specialized approach to mathematical reasoning using:

Step-by-step math proofs with individual rewards at each reasoning stage
Group Relative Policy Optimization (GRPO) that eliminates the need for a separate value network
Relative normalization of rewards within sample groups

This efficiency-focused method reduces GPU training hours by approximately 40% while achieving state-of-the-art performance on competition-level math problems.

Future Direction: Verified Reasoning

The next frontier combines Reinforcement Learning with Verifiable Rewards (RLVR) and CoT verification. This approach would:

Python

This verification process can catch deceptive or truncated reasoning before it receives any reward. The process supervision evaluates not just final answers but how the model arrives at them.

Key Benefits of CoT-Centric Training

1
Transparency
Reasoning becomes visible and auditable
2
Accuracy
Complex problems benefit from structured thinking
3
Alignment
Rewards target both process and outcome
4
Efficiency
Better training signal from intermediate steps

As models continue to scale, this focus on explicit reasoning will likely become even more central to reinforcement learning workflows.

Conclusion

Reinforcement learning scaling represents a significant shift in LLM training methodology. Rather than simply increasing model size, RL scaling focuses on optimizing how models use their existing capacity through enhanced training processes.

The key components of scaled RL include:

Core algorithms: PPO remains the industry standard, while newer approaches like GRPO offer memory efficiency and RLVR provides objective verification.
Multi-dimensional scaling: Effective RL requires balanced growth across model capacity, training data, and computing resources.
Chain-of-thought focus: Modern RL training emphasizes visible reasoning processes that improve both transparency and accuracy.
Infrastructure patterns: Distributed rollouts, experience replay, and hybrid deployment frameworks enable efficient scaling.

The results of properly implemented RL scaling are substantial: significantly better reasoning abilities, more transparent thinking processes, and outputs that better align with human preferences.

As the field advances, we can expect further innovations in verification-based rewards and process supervision that will enable even more capable, aligned models. The most successful implementations will continue to combine elements from multiple algorithms, creating tailored approaches for specific domains.

Why Reinforcement Learning Matters for LLMs

Preference Alignment

Behavior Refinement

Safety Enhancement

Task Adaptation

Understanding RLHF Through InstructGPT

The Core Components of RL in LLMs

Policy Network

Reward Function

Value Function

RL vs. Next-Token Prediction

Policy-Network View of Language Models

What “Scaling RL” Means in 2025

Model capacity

Training data

Computing resources

Axes of Scale

Model Size

Reward Model Size

Rollout Length

Batch Size

Preference Labels

Gradient Update Budget

Why Rollouts Matter in RL Scaling

What Gets Recorded During Rollouts

How RL Makes Models Think Longer

Infrastructure Patterns

Distributed Rollouts

Partial Rollouts

Experience Replay

Mixed-Precision Training

Synthetic Preference Generation