Queryloop

Research

Teaching Small LLMs to Think: Reinforcement Learning with Agent-RI

Zain ul Abideen
August 4, 2025
7 min read

Learn how we successfully trained Qwen2.5-1.5B, a small LLM, to tackle complex question-answering tasks using reinforcement learning with the Agent-R1 framework, achieving performance comparable to larger models.

The world of Large Language Models (LLMs) is often dominated by talk of ever-increasing model sizes. But what if we could achieve impressive results with their smaller, more nimble counterparts? This post dives into how we successfully trained Qwen2.5–1.5B, a relatively small LLM, to tackle complex question-answering tasks using reinforcement learning (RL) with the Agent-R1 framework, yielding performance comparable to larger models.

The Quest for Efficient Power: Small LLMs and RL

Large models are powerful, but their size can be a barrier due to computational costs, deployment challenges, and inference latency. Smaller LLMs offer a path to democratize AI, making powerful tools more accessible. However, coaxing high performance out of these compact models requires smart training strategies.

Reinforcement Learning from Human Feedback (RLHF) and similar RL techniques have emerged as potent methods for fine-tuning LLMs to align with desired behaviors and improve performance on specific tasks. The Agent-R1 framework is designed to facilitate exactly this, providing the tools to train language agents using RL. Our goal was to leverage Agent-R1 to see if we could elevate a 1.5 billion parameter model to new heights on a challenging benchmark like HotpotQA.

Setting the Stage: The HotpotQA Challenge and Agent-R1

HotpotQA is a question-answering dataset that requires reasoning over multiple supporting documents to find an answer. It's a great testbed for an LLM's ability to gather information, synthesize it, and respond accurately.
The original Agent-R1 team showcased results on HotpotQA using PPO (Proximal Policy Optimization), a popular RL algorithm. Their graphs indicated an critic/em_score/mean (Exact Match score) reaching around 0.4–0.5 after ~60 training steps, with an average of ~1.5 turns per response. This served as our initial benchmark.

Our Experimental Journey with Qwen2.5–1.5B

We chose Qwen2.5–1.5B, a capable small model, for our experiments. Here's how we iterated to unlock its potential:

Parameter Comparison:

Parameter Comparison:

Experiment 2: First Steps and Learnings

Parameters: We initially adjusted the actor learning rate (actor_lr) upwards to 5e-6, halved the ppo_mini_batch_size to 32, and significantly increased the kl_coef to 0.01.
Results:
  • val-core/hotpotqa/hotpot_qa/acc/mean@1: 0.2558
  • val-aux/hotpotqa/hotpot_qa/turns/mean@1: 0.0
Analysis: The accuracy of ~0.256 was our starting point. The aggressive actor_lr and very high kl_coef (0.01) likely contributed to unstable learning; a high KL coefficient can overly penalize deviations from the initial policy, effectively stifling the agent's ability to explore and learn new, beneficial behaviors. The smaller ppo_mini_batch_size of 32 (compared to the original 64) increases the variance of gradient updates, which, combined with a high learning rate and restrictive KL, could hinder convergence.

Experiment 3: Significant Tuning and Improvement

Building on learnings from Experiment 2, we refined the hyperparameters, leading to a substantial performance leap.
Parameters: We adopted a more conservative actor_lr (2e-6) and a much smaller critic_lr (1e-6). Crucially, the kl_coef was reduced to 0.0003, allowing the model more freedom to explore while still regularizing. We also doubled the total_epochs to 2, giving the model more time to learn.
  • val-core/hotpotqa/hotpot_qa/acc/mean@1: 0.515625
  • val-aux/hotpotqa/hotpot_qa/reward/mean@1: 0.515625
  • val-aux/hotpotqa/hotpot_qa/format/mean@1: 1.0 (Perfect formatting!)
  • val-aux/hotpotqa/hotpot_qa/turns/mean@1: 1.0
Analysis: The dramatic improvement to ~0.516 accuracy can be attributed to several synergistic changes. The significantly lower kl_coef (0.0003) was pivotal. This less restrictive KL penalty allowed the actor to explore policy space more freely and learn more effectively from the RL rewards, without straying too drastically from the valuable prior knowledge of the pre-trained model. It struck a better balance between exploration and exploitation. The more conservative learning rates (2e-6 for actor, 1e-6 for critic) fostered more stable training. A lower critic_lr is particularly important as a well-estimated value function (provided by the critic) is crucial for the quality of advantage estimates that guide PPO updates. This stability is essential when the KL constraint is relaxed. Increasing total_epochs to 2 provided the agent with more learning iterations under these more favorable conditions, allowing it to better converge towards an effective policy. The combined effect was a more stable learning process, enabling the agent to achieve not only higher accuracy but also perfect response formatting and efficient single-turn interactions.
Blog image

Experiment 4: Advanced Tuning and Breakthrough Performance! 🏆

Building on the solid foundation established in Experiment 3, further precise adjustments to hyperparameters enabled the model to reach its peak performance.
Parameters: We moderately increased the actor_lr to 3e-6. The critic_lr (1e-6), ppo_mini_batch_size (32), and total_epochs (2) were maintained from Experiment 3's successful setup. The kl_coef was carefully set to 0.0008.
  • val-core/hotpotqa/hotpot_qa/acc/mean@1: 0.801
  • val-aux/hotpotqa/hotpot_qa/reward/mean@1: 0.6150
  • val-aux/hotpotqa/hotpot_qa/format/mean@1: 1.0
  • val-aux/hotpotqa/hotpot_qa/turns/mean@1: 1.0
Analysis: The jump to an exceptional 0.801 accuracy marked a significant breakthrough. This was achieved by leveraging the stability of Experiment 3 and making targeted refinements. The slightly increased actor_lr (3e-6) infused more dynamism into policy exploration. The crucial adjustment was the kl_coef to 0.0008; this value appears to have hit an optimal balance, encouraging the policy to explore further and discover more effective strategies than in Experiment 3, without sacrificing stability, thanks to the consistent critic learning rate and epoch count. The result was a more refined policy capable of much higher accuracy, with only a marginal increase in average turns (1.1), demonstrating how precise hyperparameter tuning can unlock substantial performance gains in RL.
The actor/reward_kl_penalty chart for this experiment (shown below) further illuminates the learning process. This penalty reflects the extent to which the learning policy deviates from its original state, scaled by the kl_coef.
Blog image
The chart shows the penalty starting near zero and steadily increasing throughout most of the training, peaking around 0.3 before slightly stabilizing. This trend is indicative of an active learning process where the actor policy is significantly updated and refined to optimize for the HotpotQA task rewards. The kl_coef of 0.0008 proved crucial here; it was massive enough to allow substantial policy deviation and exploration (as evidenced by the rising penalty), yet constrained enough to prevent catastrophic forgetting or instability.

Why This Matters: The Power of Small, Tuned Models

Our experiments demonstrate a key insight: small language models, when carefully tuned with advanced RL techniques like those enabled by Agent-R1, can achieve performance comparable to their larger counterparts on complex tasks.
This has significant implications:
  • Accessibility: High-performing models become more accessible for researchers and developers with limited computational resources.
  • Efficiency: Smaller models mean lower deployment costs, faster inference times, and reduced energy consumption.
  • Specialization: These models can be fine-tuned to excel at specific tasks, creating highly optimized and efficient solutions.
  • New Benchmarks for Small Models: Our findings set a new bar for what can be expected from 1.5B parameter models, encouraging further research into optimizing compact architectures.

Conclusion

Our work with Qwen2.5–1.5B and Agent-R1 highlights the strong potential of smaller language models. From the early stages in Experiment 2 to the clear gains in Experiment 3, and ultimately reaching an impressive accuracy of 0.801 in Experiment 4, the progression demonstrates how consistent reinforcement learning tuning can significantly boost performance.
These outcomes are not just promising — they mark a meaningful shift. They prove that smaller models, when trained effectively, can deliver results that rival much larger systems. As RL techniques and frameworks like Agent-R1 continue to evolve, the capabilities of compact, efficient models are set to grow even further, shaping the future of AI development.
AIMachine LearningReinforcement LearningSmall LLMsAgent-R1PPOHotpotQAModel OptimizationQueryloop