Queryloop

Product

Solving Complex Table Parsing in RAG Systems: A Comparative Analysis

Queryloop Team
May 1, 2025
3 min read

Discover how we compared 8 different parsing solutions to tackle hierarchical tables, merged cells, and horizontally tiled tables in PDFs for RAG applications.

RAG over PDFs containing complex tables has consistently posed a significant challenge. Developers experiment with various parsing solutions, but identifying the most effective parser for specific use cases remains tedious. Recently, we tackled a particularly challenging parsing scenario for one of our client's RAG use cases.

RAG over PDFs containing complex tables has consistently posed a significant challenge. Developers experiment with various parsing solutions, but identifying the most effective parser for specific use cases remains tedious. Recently, we tackled a particularly challenging parsing scenario for one of our client's RAG use cases. The PDF parsing presented several distinct challenges:

Key Parsing Challenges:

  • Hierarchical Tables: Tables with nested headings and multiple levels of subheadings.
  • Merged Cells: Merged cells spanning multiple rows or columns. Determining how to accurately assign merged cell values to all relevant cells was crucial.
  • Horizontally Tiled Tables: Differentiating between separate tables placed side by side was essential to prevent treating horizontally aligned tables as a single, contiguous table.

Here is an example of the complex data structure that we encountered.

Blog image
To determine the most suitable parsing solution, we used Queryloop's automation to experiment with the following parsers:
  • Basic Python parser
  • GPT-4o
  • Gemini 2.0 Flash
  • Claude 3.5
  • Claude 3.7
  • Unstructured.io
  • LlamaParse Premium mode
  • LlamaParse Layout Agent
Parsing quality was objectively assessed by passing the parsed context to Claude 3.7 and evaluating the model's ability to answer predefined questions correctly. We intentionally avoided purely visual assessment, as clarity in human-readable format doesn't necessarily translate to effective comprehension by LLMs.

Test Questions:

  • Question 1: What is the minimum FICO score required for a $1.8 million loan?
  • Expected Answer: 680
  • Question 2: For a second home purchase transaction using full documentation, with a loan amount of $2.4 million and borrower FICO score of 730, what is the maximum allowable LTV?
  • Expected Answer: 75%
  • Question 3: What is the maximum allowable LTV for a $3.1 million cash-out loan on a primary residence using alternative documentation?
  • Expected Answer: Cash-out loans under these circumstances are not permitted.

Results:

The output from LlamaParse Layout Agent enabled Claude 3.7 to answer all three questions correctly. LlamaParse Premium mode and Unstructured.io followed closely behind with Claude being able to correctly answer 2 out of 3 questions. Thus, within an hour, we were able to decide the optimal parser for our customer's usecase.
Blog image

The detailed responses of Claude for each of the above parsers are given here.

RAGPDF ParsingLLMData ExtractionTable ParsingClaudeLlamaParseUnstructured.ioQueryloop