Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG

🚀 Accepted at ICSE 2026 (SEIP track), Rio de Janeiro, Brazil.
📄 Paper: link to paper

This repository contains the code, data, and experiments for our systematic study of how PDF parsing and text chunking choices affect end‑to‑end Retrieval‑Augmented Generation (RAG) performance on financial documents. It also hosts TableQuest, a new table‑focused QA benchmark built from real-world SEC filings and earnings reports.

Overview

This work presents the first systematic investigation of how key design choices in a Retrieval‑Augmented Generation (RAG) pipeline -PDF parsing and text chunking- impact end‑to‑end QA performance over financial documents, and provide actionable guidance for practitioners.

Key Contributions

QA‑Centered Evaluation: We frame PDF understanding as a question‑answering task to mirror real analytical workflows.
Benchmarks: We leverage two financial QA datasets, including TableQuest, our newly released table‑focused benchmark.
Component Analysis: We compare multiple open‑source PDF parsers and six common chunking strategies (with varied overlap), and study their interactions.
Practical Guidelines: Our results offer clear recommendations for building robust RAG systems on PDF corpora.

TableQuest: A Table‑Focused Financial QA Benchmark

TableQuest targets QA over tables in financial documents (10‑K, 10‑Q, earnings reports). Each sample includes the PDF page(s) with tables plus QA pairs, organized by difficulty.

Source PDFs originate from the FinanceBench dataset.

Difficulty Levels

The dataset is organized into three difficulty tiers based on the cognitive complexity required to answer the questions:

Tier	Definition	Cognitive steps required	Example
Easy (single-table extractive)	Answer is copied verbatim from one table—a single cell, header, or total.	1. Locate the table. 2. Read the target cell.	What is the total amount of future maturities of long-term debt for 2026?
Medium (single-table numerical)	Answer requires ≤ 2 arithmetic operations (add, subtract, ratio, % change, etc.) within one table.	1. Find the table. 2. Identify 2–3 cells. 3. Compute result.	What is the total of "Accruals not currently deductible" and "Pension costs" for the year 2022?
Hard (multi-table cross-table)	Answer requires combining data from ≥ 2 tables on the same page, often plus a small calculation or comparison.	1. Detect all tables. 2. Select related tables. 3. Align rows/columns. 4. Merge or compare values (and optionally compute).	Analyze the impact of special items on the operating income margin for both the "Safety and Industrial" and "Transportation and Electronics" segments. How do these adjustments affect the overall financial performance of 3M in Q2 2023?

Directory Structure

├── tablequest_dataset/            # TableQuest dataset
│   ├── annotation/                # Annotation and verification scripts for QA validation
│   │   ├── prompts/               # Prompt for automatic annotation tasks
│   │   ├── verification_openai/   # OpenAI-based verification
│   │   ├── analyze_vote_results.py
│   │   ├── openai_voter.py
│   │   └── README.md
│   ├── metadata/                  # Page and table metadata, sampling information
│   ├── prompts/                   # LLM prompt templates for different difficulty levels
│   ├── qa_pairs/                  # JSON files with question-answer pairs
│   ├── sampled_pages_pdf/         # Sampled pages organized by difficulty (PDF files)
│   ├── scripts/                   # Data processing and QA generation scripts
│   └── stats/                     # Statistics and analysis scripts/notebooks
├── src/                           # Evaluation framework
│   ├── chunkers/                  # Text chunking implementations
│   ├── evaluation/                # Evaluation metrics and classes (retrieval and answer correctness)
│   ├── generators/                # Answer generation modules
│   ├── parsers/                   # PDF parsing implementations
│   ├── preprocessing/             # Data preprocessing utilities
│   ├── retrievers/                # Information retrieval methods (BM25, dense, hybrid, ColBert)
│   ├── stats/                     # Stats utilities
│   └── test/                      # End-to-end testing and evaluation
├── requirements.txt               # Python dependencies
└── README.md                      # This file

Installation

# Clone the repository
git clone <repository-url>
cd finqa-tablequest-bench

# Create and activate Conda env
conda create -n finqa-tablequest python=3.11 -y
conda activate finqa-tablequest

# Install dependencies
pip install -r requirements.txt

Note: If you encounter issues with installation, you can comment out the 'Layout detection dependencies' section in requirements.txt, install the remaining packages, then uncomment and reinstall.

Usage

Core Pipeline Components

1. PDF Parsing

# Parse PDF documents into text
python src/parsers/text_parsers.py

2. Text Chunking

# Chunk parsed text using different strategies
python src/chunkers/chunkers.py

3. Information Retrieval & Evaluation

# End-to-end indexing, retrieval & evaluation pipeline
python src/test/retrieval_evaluation.py

4. Answer Generation

FinanceBench Dataset:

# Generate answers using Ollama models
python src/generators/ollama_answer_generation_financebench.py

# Generate answers using OpenAI models
python src/generators/openai_generation_financebench.py

TableQuest Dataset:

# Generate answers using Ollama models
python src/generators/ollama_answer_generation_tablequest.py

# Generate answers using OpenAI models
python src/generators/openai_generation_tablequest.py

5. Evaluation

# Answer correctness evaluation
python src/evaluation/llm_judge_evaluation_openai.py

Supported Tools

Parsers: PyPDF2, PyMuPDF, pdfplumber, pypdfium2, docling, unstructured
Chunkers: token, sentence, semantic, recursive, SDPM, neural
Retrievers: BM25, dense embeddings, ColBERT, SPLADE

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG

TableQuest: A Table‑Focused Financial QA Benchmark

Difficulty Levels

Directory Structure

Installation

Usage

Core Pipeline Components

1. PDF Parsing

2. Text Chunking

3. Information Retrieval & Evaluation

4. Answer Generation

5. Evaluation

Supported Tools

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
src		src
tablequest_dataset		tablequest_dataset
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG

TableQuest: A Table‑Focused Financial QA Benchmark

Difficulty Levels

Directory Structure

Installation

Usage

Core Pipeline Components

1. PDF Parsing

2. Text Chunking

3. Information Retrieval & Evaluation

4. Answer Generation

5. Evaluation

Supported Tools

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages