Solving Complex Table Parsing in RAG Systems: A Comparative Analysis

RAG over PDFs containing complex tables has consistently posed a significant challenge. Developers experiment with various parsing solutions, but identifying the most effective parser for specific use cases remains tedious. Recently, we tackled a particularly challenging parsing scenario for one of our client's RAG use cases.

Key Parsing Challenges:

Hierarchical Tables: Tables with nested headings and multiple levels of subheadings.
Merged Cells: Merged cells spanning multiple rows or columns. Determining how to accurately assign merged cell values to all relevant cells was crucial.
Horizontally Tiled Tables: Differentiating between separate tables placed side by side was essential to prevent treating horizontally aligned tables as a single, contiguous table.

Here is an example of the complex data structure that we encountered.

To determine the most suitable parsing solution, we used Queryloop's automation to experiment with the following parsers:

Basic Python parser
GPT-4o
Gemini 2.0 Flash
Claude 3.5
Claude 3.7
Unstructured.io
LlamaParse Premium mode
LlamaParse Layout Agent

Parsing quality was objectively assessed by passing the parsed context to Claude 3.7 and evaluating the model's ability to answer predefined questions correctly. We intentionally avoided purely visual assessment, as clarity in human-readable format doesn't necessarily translate to effective comprehension by LLMs.

Test Questions:

Question 1: What is the minimum FICO score required for a $1.8 million loan?
Expected Answer: 680

Question 2: For a second home purchase transaction using full documentation, with a loan amount of $2.4 million and borrower FICO score of 730, what is the maximum allowable LTV?
Expected Answer: 75%

Question 3: What is the maximum allowable LTV for a $3.1 million cash-out loan on a primary residence using alternative documentation?
Expected Answer: Cash-out loans under these circumstances are not permitted.

Results:

The output from LlamaParse Layout Agent enabled Claude 3.7 to answer all three questions correctly. LlamaParse Premium mode and Unstructured.io followed closely behind with Claude being able to correctly answer 2 out of 3 questions. Thus, within an hour, we were able to decide the optimal parser for our customer's usecase.

The detailed responses of Claude for each of the above parsers are given here.

RAGPDF ParsingLLMData ExtractionTable ParsingClaudeLlamaParseUnstructured.ioQueryloop

Queryloop

Solving Complex Table Parsing in RAG Systems: A Comparative Analysis

Key Parsing Challenges:

Test Questions:

Results:

Related Posts

Why Building Production-Grade RAG Applications Is So Hard

Automating RAG Optimization: Finding Optimal Configurations Through Systematic Testing