Teaching Small LLMs to Think — Part 2: Reinforcement Learning with OpenPipe ART
Explore how we used Group Relative Policy Optimization (GRPO) via OpenPipe ART framework to achieve faster and more stable reinforcement learning training compared to PPO, reaching over 2x improvement in accuracy in under 40 minutes.
Our research focuses on improving AI agent performance using reinforcement learning (RL). In earlier work, we fine-tuned Qwen2.5–1.5B on HotpotQA using Agent-R1 with Proximal Policy Optimization (PPO). The results were promising: we reached over 80% accuracy through careful hyperparameter tuning.