Queryloop

Latest from Queryloop

Stay updated with our latest research findings, product developments, and insights into AI optimization

Product

Automating RAG Optimization: Finding Optimal Configurations Through Systematic Testing

Queryloop Team

April 29, 2025

9 min read

Learn how Queryloop automates RAG optimization through systematic testing of parameter combinations to maximize accuracy, minimize latency, and control costs for complex document analysis.

Retrieval Augmented Generation (RAG) systems have emerged as a powerful approach for building accurate and reliable AI applications by connecting language models to external knowledge sources. However, achieving optimal performance requires carefully tuning numerous parameters - a process that traditionally demands extensive manual experimentation.

Queryloop's platform automates this optimization process, systematically testing parameter combinations to identify configurations that maximize accuracy, minimize latency, and control costs. This blog post examines how Queryloop's automated experiments helped us discover the optimal RAG configuration for analyzing complex financial documents.

Testing Ground: The Docugami SEC 10-Q Dataset

For our experiments, we selected the Docugami Knowledge Graph Retrieval Augmented Generation dataset, specifically focusing on the SEC 10-Q collection. SEC 10-Q reports are quarterly financial documents that publicly traded companies must file with the Securities and Exchange Commission, containing detailed financial data and management analysis.

We chose this dataset because it represents real-world challenges in enterprise document analysis:

It contains multiple documents from major tech companies (AAPL, AMZN, INTC, MSFT, NVDA)
The documents are long-form with complex structures (tables, sections, footnotes)
The questions require different retrieval capabilities, from simple lookups to multi-document synthesis

For our evaluation, we selected 20 representative questions covering three difficulty levels:

10 Single-Doc, Single-Chunk questions (answers found in one contiguous section)
5 Single-Doc, Multi-Chunk questions (answers requiring information from multiple sections)
5 Multi-Doc questions (answers synthesized from multiple documents)

Experiment 1: Finding the Optimal Chunk Size

Our first experiment explored how document chunking affects retrieval performance. Chunking determines how documents are segmented into smaller pieces for indexing and retrieval.

Constant parameters:

Metric Type: cosine
Retrieval Method: basic
Post Retrieval: none
Top K: 20
Embedding Model: text-embedding-3-large
Document Parser: Basic
Query Transformation: Basic

Results:

1| Chunk Size | Accuracy (%) | Latency (s) | Cost ($) |
2|------------|--------------|-------------|----------|
3| 4200       | 81.6         | 1.39        | 0.00     |
4| 1800       | 78.2         | 1.01        | 0.00     |
5| 700        | 59.3         | 1.61        | 0.01     |
6| 300        | 41.7         | 1.37        | 0.01     |

Analysis: Larger chunk sizes dramatically improved accuracy on this dataset, with performance nearly doubling from 41.7% at 300 tokens to 81.6% at 4200 tokens. This indicates that financial questions often require substantial contextual information to be answered accurately. The performance gain from larger chunks came with minimal latency impact and no additional cost.

Experiment 2: Evaluating Distance Metrics

Next, we tested different distance metrics for measuring similarity between query and document vectors.

Constant parameters:

Chunk Size: 4200
Retrieval Method: basic
Post Retrieval: none
Top K: 20
Embedding Model: text-embedding-ada-002
Document Parser: Basic
Query Transformation: Basic

Results:

1| Distance Metric | Accuracy (%) | Latency (s) | Cost ($) |
2|-----------------|--------------|-------------|----------|
3| hybrid          | 92.2         | 1.06        | 0.00     |
4| dotproduct      | 89.8         | 0.89        | 0.00     |
5| cosine          | 88.2         | 1.04        | 0.00     |
6| euclidean       | 64.5         | 0.98        | 0.00     |

Analysis: The hybrid distance metric achieved the highest accuracy at 92.2%, significantly outperforming euclidean distance (64.5%). The difference between hybrid, dotproduct, and cosine metrics was smaller but still notable, suggesting that for financial documents, the hybrid approach better captures semantic relationships. Latency remained consistent across all metrics, with no measurable cost differences.

Experiment 3: Testing Query Transformation Techniques

Our third experiment compared various query transformation methods, which modify the original query to improve retrieval performance.

Constant parameters:

Chunk Size: 4200
Metric Type: hybrid
Retrieval Method: basic
Post Retrieval: none
Top K: 5
Embedding Model: text-embedding-ada-002
Document Parser: Basic

Results:

1| Query Transformation | Accuracy (%) | Latency (s) | Cost ($) |
2|---------------------|--------------|-------------|----------|
3| Basic               | 82.6         | 1.02        | 0.00     |
4| HyDE                | 80.6         | 4.82        | 0.00     |
5| Multi Phrasing      | 75.9         | 3.90        | 0.00     |
6| Expansion           | 65.5         | 2.65        | 0.00     |
7| Deconstruction      | 62.7         | 4.08        | 0.00     |
8| Iterative Deconstruction | 37.7    | 12.75       | 0.13     |

Analysis: Surprisingly, the Basic approach delivered both the highest accuracy (82.6%) and lowest latency (1.02s). Advanced techniques like Hypothetical Document Embedding (HyDE) performed slightly worse while requiring significantly more processing time. The most complex technique, Iterative Deconstruction, performed poorly at 37.7% accuracy with latency exceeding 12 seconds and introducing measurable costs. This suggests that for this financial dataset, simpler query approaches are more effective.

Experiment 4: Optimizing Retrieval Method and Top K

Our final experiment explored different retrieval methods and the number of retrieved chunks (Top K).

Constant parameters:

Chunk Size: 4200
Metric Type: hybrid
Post Retrieval: none
Embedding Model: text-embedding-ada-002
Document Parser: Basic
Query Transformation: Basic

Results:

1| Retrieval Method | Top K | Accuracy (%) | Latency (s) |
2|------------------|-------|--------------|-------------|
3| sentence-window  | 10    | 96.0         | 1.16        |
4| basic            | 20    | 94.4         | 1.07        |
5| context_retrieval| 20    | 91.4         | 0.99        |
6| sentence-window  | 5     | 85.8         | 1.04        |
7| basic            | 10    | 80.2         | 0.88        |
8| context_retrieval| 10    | 76.2         | 0.89        |
9| basic            | 5     | 74.1         | 1.24        |
10| context_retrieval| 5     | 69.7         | 0.96        |

Analysis: The sentence-window retrieval method with Top K=10 achieved remarkable 96% accuracy, outperforming most other configurations. Notably, our system couldn't run the combination of sentence-window with Top K=20, so we don't know how that configuration might have performed. The results show that increasing Top K generally improves performance, but the retrieval method has an even stronger impact. The sentence-window method, which returns not just matching sentences but also surrounding context, proved particularly effective for financial document analysis.

The Optimal Configuration

Through systematic experimentation, Queryloop identified an optimal configuration that achieved 96% accuracy on complex financial questions:

1Chunk Size: 4200
2Metric Type: hybrid
3Retrieval Method: sentence-window
4Top K: 10
5Embedding Model: text-embedding-ada-002
6Query Transformation: Basic

This represents a significant improvement from many baseline configurations—more than 2.5x better than the worst-performing setup. Even compared to reasonably good starting configurations, we observed a 10-15% accuracy boost.

Conclusion: The Value of Automated Parameter Optimization

Our experiments demonstrate that RAG system performance depends critically on configuration parameters, with accuracy ranging from 37.7% to 96% across different setups. Traditional manual tuning would require tedious trial-and-error across hundreds of combinations.

Queryloop's automated parameter optimization eliminates this burden by systematically exploring the parameter space. For this financial document analysis use case, we discovered that:

Larger chunk sizes (4200 tokens) significantly outperform smaller ones
Hybrid distance metrics provide better results than standard cosine or euclidean metrics
Simple query transformation approaches performed better than complex ones
The sentence-window retrieval method with moderate Top K values (10) achieves optimal results

These insights would have taken weeks to discover manually, but Queryloop's automated experimentation identified them efficiently. For enterprises building RAG systems in complex domains like financial analysis, this automated approach enables the creation of high-performing applications in a fraction of the time typically required. (Part 2 of this blog will cover the generation module)

RAGOptimizationAutomationAIFinancial DocumentsEmbeddingRetrievalQueryloop

Product

Why Building Production-Grade RAG Applications Is So Hard

Learn why creating demo RAG applications is easy, but building production-grade systems is exponentially harder, and how Queryloop solves these challenges.

Product

Solving Complex Table Parsing in RAG Systems: A Comparative Analysis

Discover how we compared 8 different parsing solutions to tackle hierarchical tables, merged cells, and horizontally tiled tables in PDFs for RAG applications.

Product

Automating RAG Optimization: Finding Optimal Configurations Through Systematic Testing

Queryloop Team

April 29, 2025

9 min read

Learn how Queryloop automates RAG optimization through systematic testing of parameter combinations to maximize accuracy, minimize latency, and control costs for complex document analysis.

Testing Ground: The Docugami SEC 10-Q Dataset

We chose this dataset because it represents real-world challenges in enterprise document analysis:

It contains multiple documents from major tech companies (AAPL, AMZN, INTC, MSFT, NVDA)
The documents are long-form with complex structures (tables, sections, footnotes)
The questions require different retrieval capabilities, from simple lookups to multi-document synthesis

For our evaluation, we selected 20 representative questions covering three difficulty levels:

10 Single-Doc, Single-Chunk questions (answers found in one contiguous section)
5 Single-Doc, Multi-Chunk questions (answers requiring information from multiple sections)
5 Multi-Doc questions (answers synthesized from multiple documents)

Experiment 1: Finding the Optimal Chunk Size

Our first experiment explored how document chunking affects retrieval performance. Chunking determines how documents are segmented into smaller pieces for indexing and retrieval.

Constant parameters:

Metric Type: cosine
Retrieval Method: basic
Post Retrieval: none
Top K: 20
Embedding Model: text-embedding-3-large
Document Parser: Basic
Query Transformation: Basic

Results:

1| Chunk Size | Accuracy (%) | Latency (s) | Cost ($) |
2|------------|--------------|-------------|----------|
3| 4200       | 81.6         | 1.39        | 0.00     |
4| 1800       | 78.2         | 1.01        | 0.00     |
5| 700        | 59.3         | 1.61        | 0.01     |
6| 300        | 41.7         | 1.37        | 0.01     |

Experiment 2: Evaluating Distance Metrics

Next, we tested different distance metrics for measuring similarity between query and document vectors.

Constant parameters:

Chunk Size: 4200
Retrieval Method: basic
Post Retrieval: none
Top K: 20
Embedding Model: text-embedding-ada-002
Document Parser: Basic
Query Transformation: Basic

Results:

1| Distance Metric | Accuracy (%) | Latency (s) | Cost ($) |
2|-----------------|--------------|-------------|----------|
3| hybrid          | 92.2         | 1.06        | 0.00     |
4| dotproduct      | 89.8         | 0.89        | 0.00     |
5| cosine          | 88.2         | 1.04        | 0.00     |
6| euclidean       | 64.5         | 0.98        | 0.00     |

Experiment 3: Testing Query Transformation Techniques

Our third experiment compared various query transformation methods, which modify the original query to improve retrieval performance.

Constant parameters:

Chunk Size: 4200
Metric Type: hybrid
Retrieval Method: basic
Post Retrieval: none
Top K: 5
Embedding Model: text-embedding-ada-002
Document Parser: Basic

Results:

1| Query Transformation | Accuracy (%) | Latency (s) | Cost ($) |
2|---------------------|--------------|-------------|----------|
3| Basic               | 82.6         | 1.02        | 0.00     |
4| HyDE                | 80.6         | 4.82        | 0.00     |
5| Multi Phrasing      | 75.9         | 3.90        | 0.00     |
6| Expansion           | 65.5         | 2.65        | 0.00     |
7| Deconstruction      | 62.7         | 4.08        | 0.00     |
8| Iterative Deconstruction | 37.7    | 12.75       | 0.13     |

Experiment 4: Optimizing Retrieval Method and Top K

Our final experiment explored different retrieval methods and the number of retrieved chunks (Top K).

Constant parameters:

Chunk Size: 4200
Metric Type: hybrid
Post Retrieval: none
Embedding Model: text-embedding-ada-002
Document Parser: Basic
Query Transformation: Basic

Results:

1| Retrieval Method | Top K | Accuracy (%) | Latency (s) |
2|------------------|-------|--------------|-------------|
3| sentence-window  | 10    | 96.0         | 1.16        |
4| basic            | 20    | 94.4         | 1.07        |
5| context_retrieval| 20    | 91.4         | 0.99        |
6| sentence-window  | 5     | 85.8         | 1.04        |
7| basic            | 10    | 80.2         | 0.88        |
8| context_retrieval| 10    | 76.2         | 0.89        |
9| basic            | 5     | 74.1         | 1.24        |
10| context_retrieval| 5     | 69.7         | 0.96        |

The Optimal Configuration

Through systematic experimentation, Queryloop identified an optimal configuration that achieved 96% accuracy on complex financial questions:

1Chunk Size: 4200
2Metric Type: hybrid
3Retrieval Method: sentence-window
4Top K: 10
5Embedding Model: text-embedding-ada-002
6Query Transformation: Basic

Conclusion: The Value of Automated Parameter Optimization

Queryloop's automated parameter optimization eliminates this burden by systematically exploring the parameter space. For this financial document analysis use case, we discovered that:

Larger chunk sizes (4200 tokens) significantly outperform smaller ones
Hybrid distance metrics provide better results than standard cosine or euclidean metrics
Simple query transformation approaches performed better than complex ones
The sentence-window retrieval method with moderate Top K values (10) achieves optimal results

RAGOptimizationAutomationAIFinancial DocumentsEmbeddingRetrievalQueryloop

Product

Why Building Production-Grade RAG Applications Is So Hard

Learn why creating demo RAG applications is easy, but building production-grade systems is exponentially harder, and how Queryloop solves these challenges.

Product

Solving Complex Table Parsing in RAG Systems: A Comparative Analysis

Discover how we compared 8 different parsing solutions to tackle hierarchical tables, merged cells, and horizontally tiled tables in PDFs for RAG applications.

Queryloop

Latest from Queryloop

Automating RAG Optimization: Finding Optimal Configurations Through Systematic Testing

Testing Ground: The Docugami SEC 10-Q Dataset

Experiment 1: Finding the Optimal Chunk Size

Experiment 2: Evaluating Distance Metrics

Experiment 3: Testing Query Transformation Techniques

Experiment 4: Optimizing Retrieval Method and Top K

The Optimal Configuration

Conclusion: The Value of Automated Parameter Optimization

Related Posts

Why Building Production-Grade RAG Applications Is So Hard

Solving Complex Table Parsing in RAG Systems: A Comparative Analysis

Automating RAG Optimization: Finding Optimal Configurations Through Systematic Testing

Testing Ground: The Docugami SEC 10-Q Dataset

Experiment 1: Finding the Optimal Chunk Size

Experiment 2: Evaluating Distance Metrics

Experiment 3: Testing Query Transformation Techniques

Experiment 4: Optimizing Retrieval Method and Top K

The Optimal Configuration

Conclusion: The Value of Automated Parameter Optimization

Related Posts

Why Building Production-Grade RAG Applications Is So Hard

Solving Complex Table Parsing in RAG Systems: A Comparative Analysis