Queryloop

Research

Enhancing AI Agents with LADDER and TTRL: A Self-Improving Approach

Zain ul Abideen
April 14, 2025
7 min read

Explore how LADDER and TTRL frameworks can enable AI agents to improve without human supervision through self-directed learning and test-time reinforcement.

As AI agents evolve to become increasingly capable at tool use and autonomous reasoning, the need for methods that allow them to improve without human supervision becomes more critical. In this article, I explore the integration of LADDER (Learning by Auto-Decomposing and Deriving Easier Representations) and TTRL (Test-Time Reinforcement Learning) into a data analysis agent.

This research builds upon the concepts introduced in the Evaluating AI Agents course by DeepLearning.AI.

Blog image

Overview of LADDER

LADDER is a novel self-supervised learning framework introduced in this paper. It allows large language models to decompose complex problems into simpler variants, solve them, and use the feedback to incrementally improve their performance.
The goal is to simulate a curriculum-like progression, enabling the model to solve increasingly difficult tasks by practicing on easier subproblems it generates itself.

Example Variant Generation

1def generate_variants(prompt: str, tool_type: str, num_variants: int = 3) -> list:
2 variants = [prompt]
3
4 if tool_type == "lookup":
5 if "store" in prompt.lower() and "date" in prompt.lower():
6 variants.append("Show total sales for all stores.")
7 variants.append("Show sales for store 1320.")
8 else:
9 variants.append("Show total sales.")
10 variants.append("Show sales by region.")
11
12 elif tool_type == "analysis":
13 variants.append("Summarize total sales.")
14 variants.append("Find average sales.")
15
16 elif tool_type == "visualization":
17 variants.append("Create a bar chart of total sales.")
18 variants.append("Plot sales over time.")
19
20 return variants[:num_variants]
While this implementation serves as a prototype, it illustrates the core mechanism of generating simpler prompts from a complex one.

Test-Time Reinforcement Learning (TTRL)

TTRL complements LADDER by refining the agent at inference time. If an initial result is suboptimal or fails verification, TTRL engages by selecting variants, evaluating their output, and updating the agent accordingly.

Simplified TTRL Workflow

1def ttrl_refinement(prompt: str, tool_type: str, initial_result: str, rl_steps: int = 3):
2 variants = generate_variants(prompt, tool_type)
3
4 for step in range(rl_steps):
5 variant = random.choice(variants)
6 result = execute_variant(tool_type, variant)
7 reward = 1.0 if verify_output(result, tool_type) else -0.1
8 agent_model.train(reward) # Assuming `agent_model` is defined elsewhere
9
10 return execute_variant(tool_type, prompt)

Research Context and Setup

The agent developed for this study was capable of executing three types of tasks:
  • Database Lookups: Generating SQL queries from natural language
  • Data Analysis: Extracting insights from tabular data
  • Visualization: Producing Python code for data visualization
I designed the agent with modular support for tool invocation, verification checks, and feedback-based self-improvement. LADDER and TTRL were incorporated into the pipeline to handle training and refinement respectively. LADDER was used in a proactive capacity, generating task variants to improve model accuracy via self-supervised practice. TTRL, on the other hand, was invoked reactively — when a result failed verification or fell below the confidence threshold.
Each interaction was logged with timestamped accuracy tracking using standard Python logging. Logs captured details such as variant prompt execution, verification results, reward scores, and changes in agent accuracy. This enabled detailed observation of how LADDER and TTRL influenced performance on each task.

Prompt Example

1prompt =
2"Show me all the sales for store 1320 on November 1st,
32021, analyze the trends, and create a bar chart showing the
4results.
5"
6result = run_agent(prompt)
In this example, the agent uses three chained tool invocations: SQL lookup, data analysis, and chart generation. LADDER attempts training on decomposed variants of this multi-step query. If any part fails during execution, TTRL engages to iterate and improve outputs in real-time.

Observations

  • LADDER performs best on complex tasks where decomposition into simpler sub-tasks is meaningful (e.g., forecasting, trend analysis).
  • Limited effectiveness on simple prompts: When the task is direct (e.g., "Get sales for store 1320 on Nov 1st, 2021"), the generated variants add little instructional value.
  • Mock training can dilute feedback: Without executing actual outputs during training, the learning signal may become noisy.
  • TTRL often salvages underperforming results by conducting targeted, reward-driven refinement in real-time.

Logging Sample

1INFO - LADDER Variant 'Show total sales.': Reward -0.10, Accuracy
20.48
3INFO - TTRL Step 2 - Variant 'Simple lookup task: trends': Reward
41.00, Accuracy 0.59
5INFO - Final Result Verification: Passed

Conclusion

Integrating LADDER and TTRL into tool-using agents offers a promising approach for enabling self-improvement without external supervision. However, to fully realize their potential, variant generation must be intelligent, recursive, and context-aware.
Future work will focus on:
  • Incorporating GPT-based recursive decomposition
  • Learning prompt complexity heuristics
  • Improving training feedback loops
The exploration into these self-improving strategies reveals an exciting direction for AI agents: learning to learn, autonomously
AILADDERTTRLReinforcement LearningMachine LearningSelf-improvementAI AgentsQueryloop