Teaching Small LLMs to Think: Reinforcement Learning with Agent-RI
Learn how we successfully trained Qwen2.5-1.5B, a small LLM, to tackle complex question-answering tasks using reinforcement learning with the Agent-R1 framework, achieving performance comparable to larger models.
The world of Large Language Models (LLMs) is often dominated by talk of ever-increasing model sizes. But what if we could achieve impressive results with their smaller, more nimble counterparts? This post dives into how we successfully trained Qwen2.5–1.5B, a relatively small LLM, to tackle complex question-answering tasks using reinforcement learning (RL) with the Agent-R1 framework, yielding performance comparable to larger models.
The Quest for Efficient Power: Small LLMs and RL
Reinforcement Learning from Human Feedback (RLHF) and similar RL techniques have emerged as potent methods for fine-tuning LLMs to align with desired behaviors and improve performance on specific tasks. The Agent-R1 framework is designed to facilitate exactly this, providing the tools to train language agents using RL. Our goal was to leverage Agent-R1 to see if we could elevate a 1.5 billion parameter model to new heights on a challenging benchmark like HotpotQA.
Setting the Stage: The HotpotQA Challenge and Agent-R1
Our Experimental Journey with Qwen2.5–1.5B
Parameter Comparison:

Experiment 2: First Steps and Learnings
- val-core/hotpotqa/hotpot_qa/acc/mean@1: 0.2558
- val-aux/hotpotqa/hotpot_qa/turns/mean@1: 0.0
Experiment 3: Significant Tuning and Improvement
- val-core/hotpotqa/hotpot_qa/acc/mean@1: 0.515625
- val-aux/hotpotqa/hotpot_qa/reward/mean@1: 0.515625
- val-aux/hotpotqa/hotpot_qa/format/mean@1: 1.0 (Perfect formatting!)
- val-aux/hotpotqa/hotpot_qa/turns/mean@1: 1.0

Experiment 4: Advanced Tuning and Breakthrough Performance! 🏆
- val-core/hotpotqa/hotpot_qa/acc/mean@1: 0.801
- val-aux/hotpotqa/hotpot_qa/reward/mean@1: 0.6150
- val-aux/hotpotqa/hotpot_qa/format/mean@1: 1.0
- val-aux/hotpotqa/hotpot_qa/turns/mean@1: 1.0

Why This Matters: The Power of Small, Tuned Models
- Accessibility: High-performing models become more accessible for researchers and developers with limited computational resources.
- Efficiency: Smaller models mean lower deployment costs, faster inference times, and reduced energy consumption.
- Specialization: These models can be fine-tuned to excel at specific tasks, creating highly optimized and efficient solutions.
- New Benchmarks for Small Models: Our findings set a new bar for what can be expected from 1.5B parameter models, encouraging further research into optimizing compact architectures.