Queryloop

Product

Automating RAG Optimization: Finding Optimal Configurations Through Systematic Testing

Queryloop Team
April 29, 2025
9 min read

Learn how Queryloop automates RAG optimization through systematic testing of parameter combinations to maximize accuracy, minimize latency, and control costs for complex document analysis.

Retrieval Augmented Generation (RAG) systems have emerged as a powerful approach for building accurate and reliable AI applications by connecting language models to external knowledge sources. However, achieving optimal performance requires carefully tuning numerous parameters - a process that traditionally demands extensive manual experimentation.

Queryloop's platform automates this optimization process, systematically testing parameter combinations to identify configurations that maximize accuracy, minimize latency, and control costs. This blog post examines how Queryloop's automated experiments helped us discover the optimal RAG configuration for analyzing complex financial documents.

Testing Ground: The Docugami SEC 10-Q Dataset

For our experiments, we selected the Docugami Knowledge Graph Retrieval Augmented Generation dataset, specifically focusing on the SEC 10-Q collection. SEC 10-Q reports are quarterly financial documents that publicly traded companies must file with the Securities and Exchange Commission, containing detailed financial data and management analysis.
We chose this dataset because it represents real-world challenges in enterprise document analysis:
  • It contains multiple documents from major tech companies (AAPL, AMZN, INTC, MSFT, NVDA)
  • The documents are long-form with complex structures (tables, sections, footnotes)
  • The questions require different retrieval capabilities, from simple lookups to multi-document synthesis
For our evaluation, we selected 20 representative questions covering three difficulty levels:
  • 10 Single-Doc, Single-Chunk questions (answers found in one contiguous section)
  • 5 Single-Doc, Multi-Chunk questions (answers requiring information from multiple sections)
  • 5 Multi-Doc questions (answers synthesized from multiple documents)

Experiment 1: Finding the Optimal Chunk Size

Our first experiment explored how document chunking affects retrieval performance. Chunking determines how documents are segmented into smaller pieces for indexing and retrieval.
Constant parameters:
  • Metric Type: cosine
  • Retrieval Method: basic
  • Post Retrieval: none
  • Top K: 20
  • Embedding Model: text-embedding-3-large
  • Document Parser: Basic
  • Query Transformation: Basic
Results:
1| Chunk Size | Accuracy (%) | Latency (s) | Cost ($) |
2|------------|--------------|-------------|----------|
3| 4200 | 81.6 | 1.39 | 0.00 |
4| 1800 | 78.2 | 1.01 | 0.00 |
5| 700 | 59.3 | 1.61 | 0.01 |
6| 300 | 41.7 | 1.37 | 0.01 |
Analysis: Larger chunk sizes dramatically improved accuracy on this dataset, with performance nearly doubling from 41.7% at 300 tokens to 81.6% at 4200 tokens. This indicates that financial questions often require substantial contextual information to be answered accurately. The performance gain from larger chunks came with minimal latency impact and no additional cost.

Experiment 2: Evaluating Distance Metrics

Next, we tested different distance metrics for measuring similarity between query and document vectors.
Constant parameters:
  • Chunk Size: 4200
  • Retrieval Method: basic
  • Post Retrieval: none
  • Top K: 20
  • Embedding Model: text-embedding-ada-002
  • Document Parser: Basic
  • Query Transformation: Basic
Results:
1| Distance Metric | Accuracy (%) | Latency (s) | Cost ($) |
2|-----------------|--------------|-------------|----------|
3| hybrid | 92.2 | 1.06 | 0.00 |
4| dotproduct | 89.8 | 0.89 | 0.00 |
5| cosine | 88.2 | 1.04 | 0.00 |
6| euclidean | 64.5 | 0.98 | 0.00 |
Analysis: The hybrid distance metric achieved the highest accuracy at 92.2%, significantly outperforming euclidean distance (64.5%). The difference between hybrid, dotproduct, and cosine metrics was smaller but still notable, suggesting that for financial documents, the hybrid approach better captures semantic relationships. Latency remained consistent across all metrics, with no measurable cost differences.

Experiment 3: Testing Query Transformation Techniques

Our third experiment compared various query transformation methods, which modify the original query to improve retrieval performance.
Constant parameters:
  • Chunk Size: 4200
  • Metric Type: hybrid
  • Retrieval Method: basic
  • Post Retrieval: none
  • Top K: 5
  • Embedding Model: text-embedding-ada-002
  • Document Parser: Basic
Results:
1| Query Transformation | Accuracy (%) | Latency (s) | Cost ($) |
2|---------------------|--------------|-------------|----------|
3| Basic | 82.6 | 1.02 | 0.00 |
4| HyDE | 80.6 | 4.82 | 0.00 |
5| Multi Phrasing | 75.9 | 3.90 | 0.00 |
6| Expansion | 65.5 | 2.65 | 0.00 |
7| Deconstruction | 62.7 | 4.08 | 0.00 |
8| Iterative Deconstruction | 37.7 | 12.75 | 0.13 |
Analysis: Surprisingly, the Basic approach delivered both the highest accuracy (82.6%) and lowest latency (1.02s). Advanced techniques like Hypothetical Document Embedding (HyDE) performed slightly worse while requiring significantly more processing time. The most complex technique, Iterative Deconstruction, performed poorly at 37.7% accuracy with latency exceeding 12 seconds and introducing measurable costs. This suggests that for this financial dataset, simpler query approaches are more effective.

Experiment 4: Optimizing Retrieval Method and Top K

Our final experiment explored different retrieval methods and the number of retrieved chunks (Top K).
Constant parameters:
  • Chunk Size: 4200
  • Metric Type: hybrid
  • Post Retrieval: none
  • Embedding Model: text-embedding-ada-002
  • Document Parser: Basic
  • Query Transformation: Basic
Results:
1| Retrieval Method | Top K | Accuracy (%) | Latency (s) |
2|------------------|-------|--------------|-------------|
3| sentence-window | 10 | 96.0 | 1.16 |
4| basic | 20 | 94.4 | 1.07 |
5| context_retrieval| 20 | 91.4 | 0.99 |
6| sentence-window | 5 | 85.8 | 1.04 |
7| basic | 10 | 80.2 | 0.88 |
8| context_retrieval| 10 | 76.2 | 0.89 |
9| basic | 5 | 74.1 | 1.24 |
10| context_retrieval| 5 | 69.7 | 0.96 |
Analysis: The sentence-window retrieval method with Top K=10 achieved remarkable 96% accuracy, outperforming most other configurations. Notably, our system couldn't run the combination of sentence-window with Top K=20, so we don't know how that configuration might have performed. The results show that increasing Top K generally improves performance, but the retrieval method has an even stronger impact. The sentence-window method, which returns not just matching sentences but also surrounding context, proved particularly effective for financial document analysis.

The Optimal Configuration

Through systematic experimentation, Queryloop identified an optimal configuration that achieved 96% accuracy on complex financial questions:
1Chunk Size: 4200
2Metric Type: hybrid
3Retrieval Method: sentence-window
4Top K: 10
5Embedding Model: text-embedding-ada-002
6Query Transformation: Basic
This represents a significant improvement from many baseline configurations—more than 2.5x better than the worst-performing setup. Even compared to reasonably good starting configurations, we observed a 10-15% accuracy boost.

Conclusion: The Value of Automated Parameter Optimization

Our experiments demonstrate that RAG system performance depends critically on configuration parameters, with accuracy ranging from 37.7% to 96% across different setups. Traditional manual tuning would require tedious trial-and-error across hundreds of combinations.
Blog image
Queryloop's automated parameter optimization eliminates this burden by systematically exploring the parameter space. For this financial document analysis use case, we discovered that:
  • Larger chunk sizes (4200 tokens) significantly outperform smaller ones
  • Hybrid distance metrics provide better results than standard cosine or euclidean metrics
  • Simple query transformation approaches performed better than complex ones
  • The sentence-window retrieval method with moderate Top K values (10) achieves optimal results
These insights would have taken weeks to discover manually, but Queryloop's automated experimentation identified them efficiently. For enterprises building RAG systems in complex domains like financial analysis, this automated approach enables the creation of high-performing applications in a fraction of the time typically required. (Part 2 of this blog will cover the generation module)
RAGOptimizationAutomationAIFinancial DocumentsEmbeddingRetrievalQueryloop