Queryloop

Latest from Queryloop

Stay updated with our latest research findings, product developments, and insights into AI optimization

Research

Teaching Small LLMs to Think — Part 2: Reinforcement Learning with OpenPipe ART

Zain ul Abideen

August 5, 2025

6 min read

Explore how we used Group Relative Policy Optimization (GRPO) via OpenPipe ART framework to achieve faster and more stable reinforcement learning training compared to PPO, reaching over 2x improvement in accuracy in under 40 minutes.

Our research focuses on improving AI agent performance using reinforcement learning (RL). In earlier work, we fine-tuned Qwen2.5–1.5B on HotpotQA using Agent-R1 with Proximal Policy Optimization (PPO). The results were promising: we reached over 80% accuracy through careful hyperparameter tuning.

Now, we explore a newer RL approach — Group Relative Policy Optimization (GRPO) — via the OpenPipe ART framework. This post presents our follow-up experiments using the same data and model setup, comparing GRPO's efficiency and learning behavior against PPO.

Why OpenPipe and GRPO?

Agent-R1, while powerful, is complex: it uses separate actor-critic models, intricate advantage calculations, and a multi-stage PPO loop. Fine-tuning often takes 6–8 hours even on well-optimized setups.

OpenPipe ART, in contrast:

Uses GRPO, which eliminates the need for a value function.
Integrates KL divergence directly into the loss, improving stability.
Runs on Unsloth.ai, enabling training in under 40 minutes.
Offers a plug-and-play design, compatible with OpenAI-style chat completions.

This makes it ideal for fast iteration and accessible reinforcement learning experiments.

Dataset and Reward Function

We used the HotpotQA dataset, a multi-hop question-answering benchmark where answers require reasoning across multiple context snippets. Each sample includes:

A prompt (question)
Ground-truth answer
Supporting fact references

For RL, we used a binary reward function:

Reward = 1 if the model's output contains the ground-truth string (case-insensitive match).
Reward = 0 otherwise.

This sparse but clear reward signal was sufficient to guide improvements under GRPO.

Experimental Setup

We reused the HotpotQA dataset from Agent-R1's repo, a benchmark requiring multi-hop reasoning. We conducted four experiments using OpenPipe ART and GRPO, each with a different hyperparameter setup.

All models were trained for 10 RL steps using a binary reward (1 if the generated answer matched the ground truth).

Why These Hyperparameters?

Our hyperparameter selection was informed by both prior PPO experience and the architectural characteristics of GRPO:

Learning Rate: We tested values from 1e-5 to 5e-5 to study how aggressive updates influence convergence. Higher rates (like 5e-5) encourage faster policy shifts, while lower rates ensure smoother, more stable training.
GPU Utilization: Varying from 0.65 to 0.95 helped us assess how resource limits impact throughput and sequence processing in practice.
Max Tokens: We lowered token limits (e.g., 128) in high-resource configs to prevent memory overflow, and increased them (e.g., 512) in the long-context experiment to test multi-hop comprehension.
Context Length: 6144 tokens were used to evaluate whether longer inputs provide better evidence aggregation for complex reasoning.

This exploration helped us understand not just what works best, but why — which is essential when generalizing RL training strategies across different tasks.

Accuracy Before vs After RL

We compared model accuracy by evaluating 100 random validation samples before and after RL training.

This confirms a >2x improvement in match accuracy using GRPO on a small model in less than one hour of training.

Example Q&A Pairs: Before vs After RL

Here are a few illustrative before/after snapshots showing how answers improved:

Example 1

Q: What movie do Michael Rooker and Yondu have in common?

Ground Truth: Guardians of the Galaxy

Before RL: Michael Rooker is an American actor. Yondu is a fictional character.
After RL: Michael Rooker plays Yondu in Guardians of the Galaxy.

✅ After RL, the model correctly links the actor and character through the right movie.

Example 2

Q: What former Florida State player had 217 receiving yards in his NFL debut?

Ground Truth: Anquan Boldin

Before RL: Florida State has had many NFL players.
After RL: Anquan Boldin, a former Florida State player, had 217 yards in his NFL debut.

✅ The RL-trained model gives a direct, factually complete answer.

Example 3

Q: Pennsylvania Route 213 runs next to a mall located next to which amusement park?

Ground Truth: Sesame Place

Before RL: Pennsylvania Route 213 is a state highway.
After RL: Pennsylvania Route 213 runs near Oxford Valley Mall, which is next to Sesame Place.

✅ Demonstrates successful multi-hop reasoning across locations.

Results Summary

Training Stability

Exception rate was 0.0 across all GRPO runs.
Gradient norms were smooth and controlled, unlike PPO's frequent spikes.
Training was stable even at high learning rates (e.g., 5e-5).

Reward Behavior

lr5e5-gpu095 achieved the highest average reward after 5–6 steps.
GRPO runs reached competitive reward scores in far fewer steps than PPO (which took 60+).
lr3e5-gpu085 balanced reward gain and learning stability best.

Training Speed and Simplicity

OpenPipe ART finished training in under 40 minutes per run.
PPO with Agent-R1 required 6–8 hours for similar experiments.
GRPO's no-value-function design simplified both setup and tuning.

Result Charts

OpenPipe vs Agent-R1: Key Differences

Conclusion

Switching to OpenPipe ART with GRPO gave us:

Faster training cycles
Higher stability
Simpler configuration

While Agent-R1 with PPO showed strong performance in earlier work, GRPO proved to be a faster and more scalable alternative — particularly when tuning small models for tasks like multi-hop QA. For researchers and practitioners looking to fine-tune LLMs efficiently, OpenPipe ART is a powerful tool worth adopting.

AIMachine LearningReinforcement LearningGRPOOpenPipe ARTSmall LLMsHotpotQAModel TrainingPPOQueryloop

Research

Schedule-Free Learning — A New Way to Train Models

Training 3 Llama models for comparison of Cosine Scheduled and Schedule-Free optimizer.

Research

Llama-Bitnet | Training a 1.58 bit LLM

What is 1 bit LLM and How to train 70M Llama-Bitnet?

Research

Teaching Small LLMs to Think — Part 2: Reinforcement Learning with OpenPipe ART

Zain ul Abideen

August 5, 2025

6 min read

Why OpenPipe and GRPO?

OpenPipe ART, in contrast:

Uses GRPO, which eliminates the need for a value function.
Integrates KL divergence directly into the loss, improving stability.
Runs on Unsloth.ai, enabling training in under 40 minutes.
Offers a plug-and-play design, compatible with OpenAI-style chat completions.

This makes it ideal for fast iteration and accessible reinforcement learning experiments.

Dataset and Reward Function

We used the HotpotQA dataset, a multi-hop question-answering benchmark where answers require reasoning across multiple context snippets. Each sample includes:

A prompt (question)
Ground-truth answer
Supporting fact references

For RL, we used a binary reward function:

Reward = 1 if the model's output contains the ground-truth string (case-insensitive match).
Reward = 0 otherwise.

This sparse but clear reward signal was sufficient to guide improvements under GRPO.

Experimental Setup

All models were trained for 10 RL steps using a binary reward (1 if the generated answer matched the ground truth).

Why These Hyperparameters?

Our hyperparameter selection was informed by both prior PPO experience and the architectural characteristics of GRPO:

Learning Rate: We tested values from 1e-5 to 5e-5 to study how aggressive updates influence convergence. Higher rates (like 5e-5) encourage faster policy shifts, while lower rates ensure smoother, more stable training.
GPU Utilization: Varying from 0.65 to 0.95 helped us assess how resource limits impact throughput and sequence processing in practice.
Max Tokens: We lowered token limits (e.g., 128) in high-resource configs to prevent memory overflow, and increased them (e.g., 512) in the long-context experiment to test multi-hop comprehension.
Context Length: 6144 tokens were used to evaluate whether longer inputs provide better evidence aggregation for complex reasoning.

This exploration helped us understand not just what works best, but why — which is essential when generalizing RL training strategies across different tasks.

Accuracy Before vs After RL

We compared model accuracy by evaluating 100 random validation samples before and after RL training.

This confirms a >2x improvement in match accuracy using GRPO on a small model in less than one hour of training.

Example Q&A Pairs: Before vs After RL

Here are a few illustrative before/after snapshots showing how answers improved:

Example 1

Q: What movie do Michael Rooker and Yondu have in common?

Ground Truth: Guardians of the Galaxy

Before RL: Michael Rooker is an American actor. Yondu is a fictional character.
After RL: Michael Rooker plays Yondu in Guardians of the Galaxy.

✅ After RL, the model correctly links the actor and character through the right movie.

Example 2

Q: What former Florida State player had 217 receiving yards in his NFL debut?

Ground Truth: Anquan Boldin

Before RL: Florida State has had many NFL players.
After RL: Anquan Boldin, a former Florida State player, had 217 yards in his NFL debut.

✅ The RL-trained model gives a direct, factually complete answer.

Example 3

Q: Pennsylvania Route 213 runs next to a mall located next to which amusement park?

Ground Truth: Sesame Place

Before RL: Pennsylvania Route 213 is a state highway.
After RL: Pennsylvania Route 213 runs near Oxford Valley Mall, which is next to Sesame Place.

✅ Demonstrates successful multi-hop reasoning across locations.

Results Summary

Training Stability

Exception rate was 0.0 across all GRPO runs.
Gradient norms were smooth and controlled, unlike PPO's frequent spikes.
Training was stable even at high learning rates (e.g., 5e-5).

Reward Behavior

lr5e5-gpu095 achieved the highest average reward after 5–6 steps.
GRPO runs reached competitive reward scores in far fewer steps than PPO (which took 60+).
lr3e5-gpu085 balanced reward gain and learning stability best.

Training Speed and Simplicity

OpenPipe ART finished training in under 40 minutes per run.
PPO with Agent-R1 required 6–8 hours for similar experiments.
GRPO's no-value-function design simplified both setup and tuning.

Result Charts

OpenPipe vs Agent-R1: Key Differences

Conclusion

Switching to OpenPipe ART with GRPO gave us:

Faster training cycles
Higher stability
Simpler configuration

AIMachine LearningReinforcement LearningGRPOOpenPipe ARTSmall LLMsHotpotQAModel TrainingPPOQueryloop

Research

Schedule-Free Learning — A New Way to Train Models

Training 3 Llama models for comparison of Cosine Scheduled and Schedule-Free optimizer.

Research

Llama-Bitnet | Training a 1.58 bit LLM

What is 1 bit LLM and How to train 70M Llama-Bitnet?

Latest from Queryloop

Why OpenPipe and GRPO?

Dataset and Reward Function

Experimental Setup

Why These Hyperparameters?

Accuracy Before vs After RL

Example Q&A Pairs: Before vs After RL

Example 1

Example 2

Example 3

Results Summary

Training Stability

Reward Behavior

Training Speed and Simplicity

Result Charts

OpenPipe vs Agent-R1: Key Differences

Conclusion

Related Posts

Schedule-Free Learning — A New Way to Train Models

Llama-Bitnet | Training a 1.58 bit LLM

Why OpenPipe and GRPO?

Dataset and Reward Function

Experimental Setup

Why These Hyperparameters?

Accuracy Before vs After RL

Example Q&A Pairs: Before vs After RL

Example 1

Example 2

Example 3

Results Summary

Training Stability

Reward Behavior

Training Speed and Simplicity

Result Charts

OpenPipe vs Agent-R1: Key Differences

Conclusion

Related Posts

Schedule-Free Learning — A New Way to Train Models

Llama-Bitnet | Training a 1.58 bit LLM