# What is Scaling RL in LLMs in train-time? Canonical URL: https://www.adaline.ai/blog/what-is-scaling-rl LLM text URL: https://www.adaline.ai/blog/what-is-scaling-rl/llms.txt Published: 2025-04-28T00:00:00.000Z Modified: 2025-04-29T17:36:30.675Z Author: Nilesh Barla Category: Research Visibility: public Reading time: 15 min Topics: Research, Adaline, AI agent observability, agent evals, self-improving agents ## Summary A Technical Exploration of Reinforcement Learning Scaling in LLMs ## Article Scaling RL in LLMs means expanding the reinforcement learning phase across three dimensions: model capacity, training data, and computing resources. Unlike traditional scaling, it focuses on teaching models to use existing capacity more effectively through algorithms like PPO, GRPO, and RLVR to improve reasoning abilities. Another key aspect of scaling RL in train-time is that it enables and allows the LLM to conduct a long chain-of-thought (CoT) thinking. The long CoT makes use of the train-time compute and scales itself effectively. This, in turn, allows the model to iterate and refine on the thinking process before yielding the answer. Image: https://a-us.storyblok.com/f/1023026/3076x1466/0476a422a2/longcot.png _Comparison of various LLMs in different benchmarks with long-CoT_ | **Source**: [Kimi k1.5: Scaling Reinforcement Learning with LLMs](https://arxiv.org/abs/2501.12599) > Scaling RL in train-time allow LLMs to spend more time producing longer CoT or reasoning steps by efficiently utilizing train-time compute # Why Reinforcement Learning Matters for LLMs Reinforcement Learning teaches models to align with human preferences rather than just predict the next word. It also makes it possible to explore a wide range of steps or reasoning paths before yielding the right output. RL enables the model to observe and reflect on each and every step that it takes, and thus correcting itself from providing incorrect answers. This approach also transforms raw language capabilities into useful, helpful, and safe AI assistants. 1. [Preference Alignment] RL helps models learn what humans actually want, not just what appears in training data. 2. [Behavior Refinement] Models can improve specific skills like reasoning, truthfulness, and helpfulness through targeted rewards. 3. [Safety Enhancement] Harmful or misleading outputs can be penalized, teaching the model to avoid unwanted behaviors. 4. [Task Adaptation] RL enables models to excel at specific tasks like math or coding without needing massive specialized datasets. ## Understanding RLHF Through InstructGPT Reinforcement Learning from Human Feedback (RLHF) fundamentally changed how we train language models. OpenAI's[ ](https://documents/1)[InstructGPT](https://documents/1) demonstrates this approach through a three-stage process. 1. Assuming that the language model (LM) has been trained to predict the next work or token, the human labelers provide examples of desired outputs for specific prompts. **The model undergoes supervised fine-tuning on this dataset**. This is essentially where the model is trained on a question-answer pair. Similarly, **multiple models are trained on the similar dataset and a vast number of responses from each model is collected**. 2. Next, the human label scores each of the output and align it to their preference or “**human preference**”. This creates a dataset with LM output and score. This dataset is then used to train a **reward model** that predicts human preferences. 3. Finally, the LM is optimized using **Proximal Policy Optimization** (PPO) to maximize this reward function. PPO is an algorithm that teaches AI models to improve through trial and error. It works like a coach who carefully adjusts a player's strategy: the AI tries different approaches, gets feedback on what worked well, and then makes small, controlled changes to get better results next time. The "proximal" part means it doesn't change too much at once, which keeps the learning process stable and prevents wild, unpredictable behavior. Image: https://a-us.storyblok.com/f/1023026/2788x1736/3ff6e5fb6b/instructgpt.png _The three phases of InstructGPT where it trains the LM using a Reward model and PPO_ | **Source**: [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155) This process transformed GPT-3 into a more helpful, harmless, and honest assistant. Despite having 100x fewer parameters, the 1.3B InstructGPT model produced outputs preferred over the 175B GPT-3 model. ## The Core Components of RL in LLMs Reinforcement learning for language models requires several key elements: 1. [Policy Network] The language model itself, generating token sequences 2. [Reward Function] Evaluates output quality based on human preferences 3. [Value Function] Estimates expected future rewards from current state Unlike traditional pre-training that only minimizes next-token prediction error, RL optimizes for longer-term objectives across entire sequences. ## RL vs. Next-Token Prediction Traditional language models train through next-token prediction, essentially memorizing token distributions from their training data. Reinforcement learning provides several advantages: ```csv Aspect Next-Token Prediction Reinforcement Learning Objective Minimize prediction error Maximize cumulative reward Scope Local token probabilities Global sequence quality Alignment Implicit from data Explicit from human feedback Optimization One-step loss Multi-step returns ``` ## Policy-Network View of Language Models When viewing language models through an RL lens, the decoder functions as a policy network πθ(a|s) where: - Actions (a): tokens in vocabulary - States (s): text history (previous tokens) - Policy: probability distribution over next tokens This framework allows for sophisticated optimization beyond simple text prediction. # What “Scaling RL” Means in 2025 Reinforcement Learning (RL) scaling applies the "bigger is better" principle to the reinforcement learning phase of language model training. After a model completes its initial pre-training and supervised fine-tuning, RL scaling expands three critical dimensions: 1. [Model capacity] Increasing the size of policy networks and reward models. Now keep in mind that the size of the policy network can be scaled-up after pre-training. There are techniques such as essembling _multiple_ SFT models together. 2. [Training data] Collecting more human feedback and preference examples 3. [Computing resources] Dedicating more processing power to the RL training process. Unlike traditional scaling that just makes models bigger, RL scaling focuses on teaching models to use their existing capacity more effectively. The goal is to help models: - Think through problems step-by-step by producing more reasoning steps - Generate safer and more helpful responses - Show their reasoning process clearly When properly implemented, RL scaling produces models that demonstrate significantly better reasoning abilities (2-3× improvement on complex tasks), more transparent thinking processes, and safer outputs that better align with human preferences. Image: https://a-us.storyblok.com/f/1023026/1043x1200/2e2f3a93bd/r1-arch.jpg _This diagram shows how DeepSeek makes its “thinking” better in steps. First, it learns from examples (SFT). Then it practices reasoning with rewards (RL). It gathers many “chains of thought” and mixes them with normal answers. Next, it “distills” or shrinks the big model into an even smarter one. Finally, it repeats this loop: learn, practice, combine, and distill. Each time, the model grows its capacity, sees more examples, and uses more compute, so it reasons more deeply and gives better answers._ | **Source**: [An Analysis of DeepSeek's R1-Zero and R1](https://arcprize.org/blog/r1-zero-r1-results-analysis) A practical rule: When you double the model size, you typically need 2.2 times more feedback data and 1.8 times more training steps to maintain performance gains. ## Axes of Scale Recent[ ](https://arxiv.org/abs/2501.12599v2)[Kimi K1.5](https://arxiv.org/abs/2501.12599v2) research identifies key scaling dimensions that impact performance: 1. [Model Size] Larger parameter counts provide better capabilities but require more compute 2. [Reward Model Size] Often smaller than the main model (Kimi uses 6B reward models) 3. [Rollout Length] Longer token sequences enable complex reasoning (Kimi scales to 128K) 4. [Batch Size] Larger batches improve training stability (Kimi uses 512 with 64 minibatches) 5. [Preference Labels] More human judgments create better reward models 6. [Gradient Update Budget] Number of training iterations (Kimi uses 256K episodes) ## Why Rollouts Matter in RL Scaling A rollout is like one complete conversation with the AI. It includes: - The starting prompt given to the model - Each token (word piece) the model generates one by one - Feedback scores on how good the response was - When and why the response ended Rollouts are the building blocks of RL training. The model learns by generating many rollouts and receiving feedback on each one. ## What Gets Recorded During Rollouts ```csv Component Description Purpose States The growing sequence of tokens Shows context at each decision point Actions Each next token choice Reveals what the model decided Rewards Scores from a reward model Tells the model what was good/bad ``` ## How RL Makes Models Think Longer RL doesn't make the AI think faster. Instead, it teaches the AI to use more computing budget wisely. This often leads to longer, more careful thinking: **Rewards for Good Thinking Steps** - The LLM gets rewards for showing its work, not just the final answer - It learns that explaining things step-by-step earns more points **Longer CoT Can Be Better** - Each extra word gives another chance to earn reward - The LLM learns to write more when that helps solve problems **Trying Multiple Approaches** - The LLM might try solving a problem several different reasoning paths - It then picks the best solution or shows all its attempts - This helps it double-check its own work **Knowing When to Stop** - The AI learns when more thinking won't help - It stops when the value of adding more words gets too small - This helps it be efficient with its thinking time ## Infrastructure Patterns Image: https://a-us.storyblok.com/f/1023026/1263x553/ebe91d8e97/overview-of-kimi-1-5-workflow.png _Overview of Kimi 1.5 workflow_ | **Source**: [Kimi k1.5: Scaling Reinforcement Learning with LLMs](https://arxiv.org/abs/2501.12599) Modern RL training systems distribute work across thousands of GPUs: 1. [Distributed Rollouts] Generate experiences in parallel across many machines 2. [Partial Rollouts] Reuse previous trajectory chunks to improve efficiency 3. [Experience Replay] Store and reuse valuable training examples 4. [Mixed-Precision Training] Use lower precision for efficiency where possible 5. [Synthetic Preference Generation] Smaller LLM models grade outputs to create preference pairs For example, Kimi's infrastructure uses a hybrid deployment framework combining training and inference phases. Each phase handles different computational tasks: 1. Training phase runs Megatron for policy updates 2. Inference phase executes vLLM for efficient rollouts 3. A "checkpoint engine" manages weight sharing between phases This approach reduces GPU idle time and enables efficient scaling to massive training volumes. # Core Algorithms Powering Scaled RL There are three core algorithms that we will touch in this article: 1. PPO which is the industry standard. 2. GRPO which is new method introduced by DeepSeek. 3. RLVR which is a verification-based reward algorithm. ## PPO: The Industry Standard [PPO](https://arxiv.org/pdf/1707.06347) remains the backbone of large-scale reinforcement learning for language models. OpenAI's [o-series models](https://www.louisbouchard.ai/rft/) continue to rely on PPO during their RL training stage. The algorithm uses a trust region approach that prevents excessive policy changes during updates. The core PPO objective can be simplified as: ```math J_{\text{PPO}}(\theta) = \mathbb{E}\left[ \min\left( \frac{\pi_\theta}{\pi_{\text{old}}} \cdot A, \text{clip}\left(\frac{\pi_\theta}{\pi_{\text{old}}}, 1-\varepsilon, 1+\varepsilon\right) \cdot A \right) \right] ``` Where: - πθ is the current policy - πold is the previous policy - A is the advantage function - ε is a small constant (usually 0.2) PPO works well at scale because it: 1. Handles large batch sizes efficiently 2. Provides stable updates even with noisy rewards 3. Integrates easily with KL-divergence penalties to prevent output degradation ## GRPO: Efficiency Innovation Image: https://a-us.storyblok.com/f/1023026/1156x502/01378f59bd/ppo-and-grpo.png _Comparison between PPO and GRPO_ | **Source**: [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://arxiv.org/abs/2402.03300) Group Relative Policy Optimization (GRPO) represents a significant advancement in RL efficiency. Introduced by [DeepSeekMath](https://arxiv.org/pdf/2402.03300), GRPO eliminates the need for a separate value network by using group statistics as baselines. Key innovations in GRPO include: ```csv Feature Benefit Group-based advantage Reduces memory requirements No critic model 25-40% fewer GPU-hours Relative reward normalization Improves training stability ``` This approach has proven especially effective for mathematical reasoning tasks, where the algorithm helped DeepSeek achieve over 50% accuracy on competition-level math problems. ## RLVR: Verification-Based Rewards [Reinforcement Learning with Verifiable Rewards](https://arxiv.org/pdf/2504.13837) (RLVR) tackles one of RL's fundamental challenges: reward hacking. Instead of using subjective human preferences, RLVR employs external verification mechanisms to validate outputs. RLVR offers unique advantages for scaling: 1. [Objectivity] Rewards based on verifiable criteria rather than subjective judgments 2. [Automation] Reduces dependence on human feedback collection 3. [Precision] Particularly valuable for domains with clear right/wrong answers Early [research](https://www.interconnects.ai/p/openais-o3-over-optimization-is-back?utm_source=chatgpt.com) suggests RLVR principles are being incorporated into the latest o3 alignment techniques. This approach helps models maintain truthfulness even as system scale increases. The most effective implementations now combine elements from multiple algorithms, creating hybrid approaches tailored to specific domains like coding, mathematics, and factual reasoning. # End-to-End Training Workflow Reinforcement learning fine-tuning for LLMs follows a structured workflow with several distinct phases. Each phase builds upon the previous one to create increasingly capable models. ### 1. Data Collection The process begins with gathering high-quality data. For [InstructGPT](https://arxiv.org/abs/2203.02155), this involved: - Human-written demonstrations of desired behaviors - Comparison data where humans ranked model outputs - Prompts collected from the API for diverse use cases Quality matters more than quantity. A few thousand well-crafted examples often outperform millions of lower-quality samples. ## 2. Reward Model Training The reward model (RM) learns to predict human preferences from the collected comparisons. Key considerations include: - Using separate models for reward and policy to prevent overfitting - Ensuring the RM generalizes to new examples through validation - Training with careful learning rate scheduling to avoid collapse ## 3. Rollout Generation During this phase, the policy model generates responses that will be evaluated by the reward model. Best practices include: - Using nucleus sampling with temperature 0.7-1.0 for exploration - Forcing visible Chain-of-Thought (CoT) reasoning when appropriate - Generating multiple completions per prompt (64+ for [GRPO](https://arxiv.org/pdf/2402.03300)) ## 4. Policy Optimization This is where the actual RL training happens, with several algorithm options: ```csv Algorithm Key Advantage Resource Usage PPO Stability with KL penalty High (needs value model) GRPO Memory efficiency Medium (no value model) RLVR Objective verification Medium (external verifier) ``` Fine-grained tips that improve results: - Token-level rewards provide denser learning signals than sequence-level rewards - KL-annealing gradually increases divergence from reference model - Curriculum learning progresses from simple to complex examples ## 5. Safety and Evaluation Loops Regular evaluation ensures the model improves without unwanted behaviors: 1. Check performance on benchmark tasks 2. Verify safety guardrails remain effective 3. Test for new failure modes 4. Return to data collection if necessary ## 6. Optional Distillation For deployment efficiency, the final model can be distilled: - Teacher model (full RL-trained) guides a smaller student - Knowledge transfers through supervised learning - Trading minimal performance for significant speed gains This workflow represents the current best practice for creating models that align with human preferences while maintaining high capability. # Evidence & Case-Studies — Benchmarks, Cost-Efficiency, and Chain-of-Thought Gains ## Performance × Cost Snapshot (April 2025) ```csv Model (RL stage) ARC-AGI-1 (%) ARC-AGI-2 (%) MMLU (%) $ / 1M Input $ / 1M Output $ / ARC-AGI-2 task Notes o3-medium 53 % X (formerly Twitter) 3.0 % ARC Prize 85.3 % Artificial Analysis $10 $40.00 $2.53 ARC Prize PPO, visible-CoT policy o4-mini-medium 42 % X (formerly Twitter) 2.4 % ARC Prize 83.2 % Artificial Analysis $1.10 $4.40 $0.23 ARC Prize RLHF → short-GRPO pass, compact MoE DeepSeek R1 (GRPO) — 1.3 % ARC Prize 84.4 % Artificial Analysis $0.55 $2.19 $0.08 ARC Prize Critic-free GRPO; open-source weights ``` Key Take-aways: - **Cost curves flatten faster than performance curves**; o4-mini delivers 80 % of o3’s MMLU with ~10× cheaper tokens. - **Scaled RL + visible CoT** improves ARC-AGI-1 scores sharply—but raw “AGI-2” generalisation still needs verifier-based rewards. - **GRPO** shows strong efficiency: DeepSeek hits GPT-4-level MMLU for one-tenth the price of o3, albeit with lower ARC-AGI. ## Chain-of-Thought–Centric RL Recipes Today's most powerful language models don't just produce answers—they think through problems step by step. This approach, centered on Chain-of-Thought (CoT) reasoning, has become essential in reinforcement learning recipes. Image: https://a-us.storyblok.com/f/1023026/2662x1304/64c225c1e7/short-cot.png _Comparison of various LLMs in different benchmarks with short-CoT_ | **Source**: [Kimi k1.5: Scaling Reinforcement Learning with LLMs](https://arxiv.org/abs/2501.12599) ## OpenAI's Deliberative Approach Image: https://a-us.storyblok.com/f/1023026/1010x904/c0bec6f34f/deliberate-alignment.png _IIustration of how OpenAI uses deliberate alignment to finetune the LLM_ | **Source**: [Deliberative Alignment: Reasoning Enables Safer Language Models](https://arxiv.org/abs/2412.16339) OpenAI's o-series models, including the recent o3 system, implement what could be called "[deliberative alignment](https://arxiv.org/pdf/2412.16339)." This method follows a specific pattern: 1. The model is required to explicitly write out its reasoning process 2. This internal reasoning trace is evaluated for correctness and safety 3. Rewards are assigned based on both process and outcome quality 4. For deployment, the trace can be hidden or distilled away The deliberative approach enables what François Chollet describes as "natural language program search and execution within token space." This allows o3 to achieve an impressive 75.7% score on the challenging ARC-AGI benchmark. ## DeepSeek's Efficient Math Training DeepSeek's R1 model takes a specialized approach to mathematical reasoning using: - Step-by-step math proofs with individual rewards at each reasoning stage - [Group Relative Policy Optimization](https://arxiv.org/pdf/2402.03300) (GRPO) that eliminates the need for a separate value network - Relative normalization of rewards within sample groups This efficiency-focused method reduces GPU training hours by approximately 40% while achieving state-of-the-art performance on competition-level math problems. ## Future Direction: Verified Reasoning The next frontier combines Reinforcement Learning with Verifiable Rewards (RLVR) and CoT verification. This approach would: ```python for each reasoning_step in chain_of_thought: verify_logical_consistency(reasoning_step) verify_factual_accuracy(reasoning_step) if verification_failed: apply_penalty() break ``` This verification process can catch deceptive or truncated reasoning before it receives any reward. The[ ](https://documents/2)[process supervision](https://documents/2) evaluates not just final answers but how the model arrives at them. # Key Benefits of CoT-Centric Training 1. [Transparency] Reasoning becomes visible and auditable 2. [Accuracy] Complex problems benefit from structured thinking 3. [Alignment] Rewards target both process and outcome 4. [Efficiency] Better training signal from intermediate steps As models continue to scale, this focus on explicit reasoning will likely become even more central to reinforcement learning workflows. # Conclusion Reinforcement learning scaling represents a significant shift in LLM training methodology. Rather than simply increasing model size, RL scaling focuses on optimizing how models use their existing capacity through enhanced training processes. The key components of scaled RL include: - **Core algorithms**: PPO remains the industry standard, while newer approaches like GRPO offer memory efficiency and RLVR provides objective verification. - **Multi-dimensional scaling**: Effective RL requires balanced growth across model capacity, training data, and computing resources. - **Chain-of-thought focus**: Modern RL training emphasizes visible reasoning processes that improve both transparency and accuracy. - **Infrastructure patterns**: Distributed rollouts, experience replay, and hybrid deployment frameworks enable efficient scaling. The results of properly implemented RL scaling are substantial: significantly better reasoning abilities, more transparent thinking processes, and outputs that better align with human preferences. As the field advances, we can expect further innovations in verification-based rewards and process supervision that will enable even more capable, aligned models. The most successful implementations will continue to combine elements from multiple algorithms, creating tailored approaches for specific domains.