Unstract.com

How Intelligent Chunking Strategies Makes LLMs Better at Extracting Long Documents

Haziqa Sajid — Mon, 16 Mar 2026 09:42:32 +0000

Chunking Strategy: An Introduction

Large language models (LLMs) have proven remarkably capable at understanding and synthesizing information. However, when presented with long, dense documents, contracts spanning dozens of pages, financial statements with complex tables, dense policy documents with interconnected rules, their performance degrades. The model may hallucinate (confidently produce false or factually incorrect information), miss details, or simply hit token limits.

The problem isn’t the model itself. It’s what precedes the model: how the document is broken into chunks before being indexed and retrieved. This is where intelligent chunking and purpose-fit retrieval strategies become essential.

In this article, we’ll look at what chunking is, why your specific strategy makes or breaks your retrieval quality, and the different retrieval strategies that Unstract offers to help improve document extraction. Finally, we will look at how to implement these strategies within Unstract to build reliable extraction pipelines.

What is Document Chunking?

Chunking is the process of breaking down large documents into smaller, semantically meaningful pieces called chunks. This is a crucial first step in preparing data for use with LLMs, enabling them to process documents more accurately and efficiently.

Chunking is an important factor for any Retrieval-Augmented Generation (RAG) system. How documents are split affects your system’s ability to find relevant information and provide accurate answers.

TL;DR

If you want to skip directly to the retrieval strategies and see how Unstract’s Prompt Studio supports different retrieval modes for long, complex documents, click here.

Why Chunking Strategy Matters

The reason for chunking in LLMs is the limited context window, which means models can only focus on a certain amount of text at once. If there is too much text within the context window, LLMs suffer from the lost-in-the-middle problem, resulting in incomplete or inaccurate answers.

Bad chunking = Bad retrieval = Bad answers

Chunking solves this by creating smaller, focused pieces of content that allow the retriever to surface only the most relevant context for a given query. Chunk size, overlap, and semantic boundaries all influence embedding quality and retrieval precision.

In practice, naive chunking introduces a few recurring failures:

Broken Logic and Clauses: Critical clauses, steps, or conditions get split across chunks, so the retriever surfaces only half the reasoning. The LLM then answers from partial context and produces logically incorrect or incomplete outputs.
Tables and Numbers Stripped of Meaning: When tables are split into header-only and row-only chunks, numeric values lose their labels. Retrieval might find the right numbers but not the corresponding field names, leading to misinterpreted totals, dates, or risk scores.
Oversized, Mixed-Topic Chunks: Very large chunks often contain multiple unrelated sections. This increases retrieval noise and raises the risk of hallucinations, as the model is forced to reason over loosely related content.

Published evaluations in clinical RAG systems show that changing the chunking method can significantly alter retrieval performance. As a result, chunking decisions have a significant downstream impact on the overall behavior of a RAG system.

Best Practices for Intelligent Chunking

Regardless of the strategy you choose, the quality of your chunks defines the limit of your system’s accuracy.

Below are the core principles for effective chunking in RAG systems.

Maintain Logical Boundaries: Chunk boundaries should follow the document’s natural structure. Split at meaningful points such as headings, paragraphs, clauses, or table ends.

Optimize Chunk Size: Chunk size should reflect the length and density of the source material. Shorter documents benefit from smaller, more focused chunks, such as invoices and forms, while longer or more complex materials may require larger segments to maintain coherence.

Use Overlaps Wisely: A small amount of overlap helps maintain continuity between chunks, especially when ideas span multiple sections. Excessive overlap increases token usage and can bias retrieval toward repeated text. Keep the overlap large enough to preserve context without creating unnecessary repetition.

Preserve Structural Elements: Tables, lists, and headers often carry meaning as a whole. Maintain row–column relationships in tables by chunking them as structured blocks. Structural fidelity improves both retrieval relevance and downstream reasoning.

Adapt Chunking to Query Type: Different question types benefit from different chunking strategies. Match chunk size to the type of questions users are likely to ask.

Quick lookup or fact-based queries work best with smaller, precise chunks.
Analytical or summarization queries benefit from larger, more comprehensive segments.

This ensures the retriever surfaces the appropriate granularity of information for each query style.

Monitor Token Limits: Each model has a finite context window, so the retrieval layer must operate within it. Efficient chunking ensures that each chunk fits well within the model’s context limit, preventing overflows. Smaller, well-structured chunks also reduce the number of tokens processed per query, lowering inference cost and latency while improving output stability.

Retrieval Strategies You Can Use

Once documents are chunked and embedded, the next step is choosing how to retrieve the right chunks when a query comes in. Chunking determines how information is broken down and indexed, while retrieval controls which of those pieces are actually surfaced at query time.

This is why retrieval strategy matters. Retrieval strategy directly affects accuracy, especially when dealing with complex or domain-heavy documents. To address this, Unstract provides several retrieval modes inside Prompt Studio, each suited to different document types and query patterns. Below is a high-level overview of the strategies and how they work in practice.

Strategy 1: Simple Vector Retrieval (Baseline)

This is the standard approach used in most basic retrieval-based systems. The document is split into chunks, each chunk is embedded, and the system retrieves the Top-K most semantically similar chunks using cosine similarity.

How it Works: The system embeds the incoming query and computes cosine similarity between the query vector and every stored chunk vector. The top-K most similar chunks are selected as the retrieval result. Since this method depends entirely on vector proximity, chunk size and overlap play a significant role; well-structured chunks that preserve boundaries typically produce cleaner embeddings and more accurate rankings.

Cosine similarity measures how closely two vectors align in embedding space, and it is computed as:

Here, Q represents the query embedding, and D represents a document chunk embedding. Higher scores indicate stronger semantic alignment between the query and the retrieved chunk.

Implementation: In Prompt Studio, Simple Vector Retrieval operates as the default retrieval mode. This retrieval uses cosine similarity over embedded chunks to find the most relevant context efficiently.

When to Use: Simple vector retrieval is efficient for short, direct queries where the answer is explicitly stated in a single section of text. It also serves as a dependable baseline to build more advanced retrieval strategies on top of.

Limitations: This strategy can miss deeper context because each chunk is treated independently. It is sensitive to chunking choices and may struggle with multi-part or analytical queries where relevant information is distributed across the document.

Strategy 2: Sub-Question Retrieval

Sub-Question retrieval decomposes a complex, multi-part query into smaller, distinct sub-questions and executes a separate retrieval for each one. The system then aggregates the results from these individual searches to synthesize a single, comprehensive answer.

How it Works: The system breaks the complex query into smaller sub-questions. For example, if the user asks, “How does the revenue of Company A compare to Company B?”, the system may generate two lookups: “Revenue of Company A” and “Revenue of Company B,” retrieve context for both, and synthesize the answer.

Sub-Question Retrieval Pipeline

Implementation: Unstract uses LlamaIndex’s SubQuestionQueryEngine for this strategy, which orchestrates decomposition and then synthesizes comprehensive answers.

When to Use: This approach is essential for multi-hop questions, financial comparisons, and compliance summaries, where an answer depends on multiple variables.

Limitations: Sub-Question Retrieval can suffer from a lost-in-retrieval problem when individual sub-questions retrieve incomplete context. The final response may miss important connections or produce an imbalanced result. This approach also increases computational cost due to multiple retrieval and synthesis steps.

Strategy 3: Fusion Retrieval (RAG Fusion)

Fusion retrieval generates multiple variations of the user’s queries and combines the results using the Reciprocal Rank Fusion (RRF) algorithm, ensuring the most pertinent information is prioritized.

How it Works:

Conceptually, RAG Fusion works in three main steps:

Generate Multiple Related Queries: Use an LLM to derive several variations of the user’s question (synonyms, narrower or broader phrasing) or potential follow-up questions the user may not have considered
Run Multiple Retrievals: For each query variant, it performs a vector search and retrieves multiple ranked lists of candidate chunks.
Fuse Results with RRF: It assigns a score to each chunk based on its rank in each list. RRF combines search results from multiple queries and gives higher weight to items that appear at the top of any ranked list.

Where k is a small constant and rankᵢ(d) is the rank of document d in list i.

RAG Fusion Architecture

Implementation: In Unstract, Fusion Retrieval generates 3-5 query variations using an LLM, then merges and ranks the results using RRF scoring. Users can also implement it with LangChain RAG Fusion pipelines.

When to Use: This retrieval is ideal when you have semantically uniform text but ambiguous user queries. It prioritizes recall (finding relevant answers) over speed, as it requires multiple vector searches.

Limitations: Fusion Retrieval is computationally expensive due to multiple retrieval passes and LLM calls. It can produce off-topic answers when the generated queries are not sufficiently relevant to the original query, which may reduce precision.

Strategy 4: Recursive Retrieval

Recursive retrieval is optimized for documents with hierarchical or interlinked structures. Instead of stopping at the initial set of retrieved chunks, the system explores related sections to build a more connected understanding of the content.

How it Works: It starts with an initial semantic match and then follows explicit document relationships (e.g., parent/child, references) or metadata links to gather related nodes. The retriever builds an expanded context graph by traversing connected nodes up to a configured depth.

Implementation: In Unstract, the Recursive retriever uses LlamaIndex RecursiveRetriever traverses document relationships up to a configurable depth to assemble connected context.

When to Use: This strategy works best on hierarchical or highly interlinked sources, such as policies, technical standards, and research, with citations where meaningful answers require traversing document structures or reference chains.

Limitations: Recursive Retrieval significantly increases token usage and retrieval cost due to hierarchical traversal of related chunks. If some nodes are disconnected, this may reduce the ability to capture broader or long-range semantic connections.

Strategy 5: Router-Based Retrieval

A router is the component that decides which retriever or index to use to handle that request. Router-based retrieval uses an LLM as a classifier to analyze a query’s intent and dynamically route it to the most appropriate retrieval tool.

How it Works: The router (e.g., an LLM) analyzes the intent, domain, and structure of the query, then directs it to the appropriate retrieval mechanism (semantic vector search, keyword lookup, summary index, or a specialized domain retriever). The final retrieved context is assembled from the chosen retriever. The system effectively becomes multi-modal within the retrieval layer.

Implementation: Router-Based Retrieval in Unstract uses LlamaIndex RouterQueryEngine, which analyzes query intent and dynamically selects the most appropriate retrieval strategy at runtime.

When to Use: This approach is used when your document repository spans multiple content types. For example, a workspace combining HR manuals, financial ledgers, and legal contracts benefits greatly from a router because no single retrieval strategy performs well across all domains.

Limitations: The central limitation of router-based systems is the “routing error”. Incorrect routing decisions can degrade retrieval quality and lead to suboptimal results.

Strategy 6: Keyword Table Retrieval

Keyword Table Retrieval targets structured documents by indexing table fields, labels, and metadata, enabling direct lookup of values through keyword or fuzzy matching instead of semantic similarity.

How it Works: During indexing, the system analyzes each document and extracts keywords, table headers, labels, and other structured fields. When a query comes in, the retriever identifies keywords from the question and performs exact or fuzzy matching against the indexed keyword table. Results may be ranked using a Term Frequency-Inverse Document Frequency (TF-IDF) scoring mechanism.

TF gives higher weight to terms that are important in a specific document. IDF gives lower weightage to terms that are overly common across the entire dataset. This allows the system to return highly targeted results with minimal noise.

The TF-IDF weight for a term t in document d over a corpus D is typically expressed as:

Here,

tf(t, d) is the normalized frequency of term t in document d.

idf(t, D) is the logarithmically scaled inverse document frequency.

Implementation: In Unstract, Keyword Table Retrieval uses LlamaIndex SimpleKeywordTableIndex, and ranks matches using TF-IDF scoring.

When to Use: This strategy works best for highly structured documents such as invoices, forms, financial statements, and spreadsheets, where labels and field names are consistent, and the user often seeks a specific value.

Limitations: This approach depends on exact or near-exact keyword matches. It performs poorly on unstructured text and does not capture semantic meaning, and can miss results when labels and terms are inconsistent.

Strategy 7: Auto-Merging Retrieval

Auto-Merging Retrieval expands context when relevant information spans across adjacent or semantically related chunks and automatically merges them into a single context window for improved continuity.

How it Works: When the base retrieval identifies relevant leaf chunks, the system checks neighboring or parent chunks to determine whether additional context should be folded in. If the content is connected, the chunks are automatically merged into a larger segment before being passed to the LLM.

Implementation: In Unstract Prompt Studio, Auto-Merging Retrieval uses LlamaIndex AutoMergingRetriever to merge leaf nodes with relevant parent nodes.

When to Use: This strategy is especially useful for long-form narrative documents, legal clauses that flow across paragraphs, or any context where breaking content into strict chunk boundaries risks losing meaning.

Limitations: Auto-merging uses threshold parameters that determine the length of retrieved context from child nodes. Increasing the merge threshold adds more surrounding context but can quickly consume tokens, while lower thresholds may trim important context and reduce answer quality.

Deep Dive: Retrieval Strategies in Real Documents

The differences between retrieval strategies become obvious when you test them against real documents. To make this concrete, let’s look at two scenarios of how these strategies impact results in Unstract Prompt Studio, showing how strategy choice changes the quality of extracted outputs, particularly when documents contain a mix of tables, multi-column layouts, and fragmented fields.

Unstract’s Prompt Studio allows users to create custom prompts and select retrieval strategies through configurable LLM Profiles, allowing the extraction of specific fields from complex documents.

Use Case 1: Financial Report with Keyword Table vs Recursive Retrieval

In this use case, we work with a financial report that consists of structured tables and narrative sections. The goal is to evaluate how Keyword Table Retrieval and Recursive Retrieval behave for different types of financial queries.

Download Financial Report →

Task A: Extract a field from a Cash Flow Statement

Extract the “Depreciation, amortisation and impairment” from a Consolidated Cash Flow Statement

Result:

Cash-flow statement extraction

Keyword Table Retrieval performs better for this task because the target value appears as a labeled line item within a structured financial table. The keyword-based lookup is able to directly associate the label with its corresponding numeric value. Recursive Retrieval, on the other hand, does not consistently surface the exact row, as it is optimized for navigating related sections rather than precise table lookups.

Task B: Compare Unilever Group financial performance: 2021 vs 2022

Prompt:

Compare the performance in 2021 with 2022 in terms of the financial Unilever group performance.

Result:

Extracting Multi-section financial comparison

For this analytical, multi-section question, Recursive Retrieval performs better. The comparison requires context from multiple sections of the report rather than a single table row. Recursive Retrieval is able to traverse related sections and aggregate relevant information, producing a more coherent and contextual response than Keyword Table Retrieval.

Use Case 2: Safety Certificate with Auto-Merging vs Simple Retrieval

This use case focuses on a long safety certificate document, which includes well-defined sections and multiple related fields. We compare Simple Vector Retrieval and Auto-Merging Retrieval to evaluate how each strategy handles focused extraction in long-form documents.

Download document →

Task A: Extract the “System Description” section

Prompt:

Extract the “system description” section of this FOAM system certificate

Result:

Extracting a “System Description” section

The results show that Simple Vector Retrieval is sufficient for this task. The section is clearly bounded and contained within a single part of the document, making additional merging unnecessary. Using a more complex retrieval strategy here adds overhead without improving results.

Task B: Extract multi-field certificate details

Prompt:

Extract the below things in JSON format:
- Product description
- Terms of validity
- Test carried out
- Fire test report
- Order no of the high expansion foam system
- Hull no of the high expansion foam system
- Certificate number

Result:

Multi-field certificate extraction

Auto-Merging Retrieval performs better in this scenario. The required fields are distributed across adjacent sections and paragraphs. Auto-merging combines related chunks before passing context to the LLM, resulting in more complete and reliable extraction compared to Simple Retrieval.

Choosing the Right Retrieval Strategy for Your Use Case

The optimal choice of retrieval strategy depends on document structure, query complexity, data format, and the level of reasoning required. To help you select the right configuration in Unstract, here is a comparison of how these strategies stack up:

Strategy	Best For	Document Types
Simple Vector Retrieval	General Q&A, direct queries	Policies, SOPs, FAQs, shorter contracts
Fusion Retrieval	Robust search across varied phrasings	Interlinked docs (Contracts, Policies)
Sub-Question Retrieval	Multi-part analytical questions	Multi-section docs (Agreements, Financial reports)
Recursive Retrieval	Multi-hop, cross-referenced content	Academic papers, Policies, Regulations
Router-based Retrieval	Mixed-domain repositories	Mixed repositories (Heterogeneous)
Keyword Table Retrieval	Exact term matching or specific lookups	Invoices, Forms, Spreadsheets
Auto-Merging Retrieval	Narrative flow, context continuity	Long-form narratives(Certificates, long clauses)

Tools like Unstract can help you implement the above strategies efficiently, so let’s break it down.

How Unstract Implements These Retrieval Strategies

Unstract is an open-source, no-code LLM platform for turning long, complex documents into structured, machine-readable data. It combines layout-preserving OCR, configurable chunking, and vector search to build RAG-style extraction workflows that can be deployed as APIs or ETL pipelines. It enables teams to automate document-heavy processes with high accuracy, speed, and adaptability.

Prompt Studio for Designing Extraction Workflows

Within the Unstract platform, Prompt Studio is the central workspace where teams can design custom extraction schemas with field-level rules. This ensures deterministic, structured outputs across various formats.

To get started, create a new project in Prompt Studio. This project will be a dedicated workspace for designing prompts tailored to your specific data extraction needs.

Creating and Configuring an LLM Profile:

Within Prompt Studio, each extraction project is backed by an LLM Profile, which defines how the system will process, chunk, embed, retrieve, and interpret document data. It also enables users to configure their own custom LLM profiles with preferred components.

LLM Profile generally includes the following pre-configured components:

LLM Selection: Specifies the model used to process prompts and generate responses.
Vector Database: Stores the embeddings and serves as the backend for similarity-based retrieval.
Embedding Model: Enables semantic representation of text for context-aware retrieval.
Text Extractor: Handles OCR and content extraction from uploaded documents (e.g., LLMWhisperer).
Chunk Size: Determines how the document is segmented into smaller units before embedding.
Overlap: Defines how many tokens or characters are shared between adjacent chunks to preserve continuity when chunking is enabled.
Retrieval Strategy: Selects the algorithm used to retrieve relevant context (e.g., vector, fusion, keyword-table, auto-merge, sub-question, router-based).
Matching Count Limit (Similarity Top-K): Specifies how many retrieved chunks are passed to the LLM as input context (e.g., if Top-K = 3, the top three most similar chunks are included in the final prompt sent to the model.)

LLM Profile settings

Choosing and Combining Retrieval Strategies:

Unstract allows you to choose the retrieval strategy that best matches your document type and extraction goal, directly within the LLM Profile’s advanced settings.

Retrieval Strategies Selection

Creating the Bank Statement Extraction Project

For this demo, we extract structured shipment details from a sample Ocean Bill of Lading, a document type known for dense, multi-column layouts with shipping, financial, and cargo details scattered across sections.

Download document sample →

Create a Prompt Studio Project

Navigate to Prompt Studio
Click New Project and name it “Bill of Lading Extraction”
Upload the Ocean Bill PDF under Manage Documents
Define custom prompts for extracting critical information

Select Retrieval and Chunking Strategy

For the retrieval strategy, we use keyword-table retrieval, which works well for structured documents with clearly labeled fields such as Bill of Lading Number or Booking Number.

Chunk size and overlap are both set to 0. Since the document follows a single-page, form-style layout, splitting it into multiple chunks provides no benefit and increases the risk of separating labels from their corresponding values.

Below is the LLM Profile configuration:

LLM Profile configuration

3. Testing Retrieval and Extraction

In Prompt Studio, click Add Prompts and define your extraction fields to identify any shipment details.

Extract the following fields from the document:
- Shipper name and address
- Consignee name and address
- Notify Party name and address
- Bill of Lading Number
- Booking Number
- Export References

Extracting Shipment Details from Ocean Bill of Lading

4. Deploying as APIs or Workflows

After validating the extraction performance, Unstract allows you to deploy this workflow as a secure API, making it easy to integrate with Enterprise Resource Planning (ERP) systems, shipping management tools, or logistics platforms.

Deploying and Testing the API in Postman

Finally, retrieve the JSON response; the API returns structured insurance data as clean JSON output.


{
  "status": "COMPLETED",
  "message": [
    {
      "file": "ocean_bill_of_lading.pdf",
      "file_execution_id": "9e7703e0-e61c-4c7d-93c2-aef9ff6aba6c",
      "status": "Success",
      "result": {
        "output": {
          "shipper_exporter": {
            "name": "Rubber Mart Exports",
            "address": "20 Cooper Square, New York, NY 10003, USA"
          },
          "consignee": {
            "name": "Floral Street",
            "address": "40-44 Floral Street Covent Garden, London WC2E 9TB"
          },
          "notify_party": {
            "name": "John Smith",
            "address": "40-44 Floral Street Covent Garden London WC2E 9TB"
          },
          "bill_of_lading_details": {
            "bill_of_lading_number": "99504",
            "booking_number": "758993033",
            "export_references": "845993"
          }
        }
      },
      "error": null,
      "metadata": {
        "source_name": "ocean_bill_of_lading.pdf",
        "organization_id": "org_IhPvmnIoe0yYmbzL",
        "workflow_id": "bill_of_lading_1765616056168",
        "execution_id": "14472be2-882f-465f-bd7a-2211a16b5ea",
        "file_execution_id": "9e7703e0-e61c-4c7d-93c2-aef9ff6aba6c",
        "total_elapsed_time": 18.4,
        "tool_metadata": [
          {
            "tool_name": "structure_tool",
            "output_type": "JSON"
          }
        ]
      }
    }
  ]
}

You now have a production-ready API that applies configurable chunking and retrieval strategies to extract structured responses from complex documents.

Conclusion

Handling long, complex documents with LLMs depends heavily on how the data is prepared before retrieval. Intelligent chunking ensures that semantic boundaries are preserved, tables and clauses remain coherent, and multi-step logic is retained. Choosing the right retrieval strategy determines whether the model can access the right context for accurate, complete answers.

Unstract brings these capabilities together in a single platform. With configurable chunking parameters, overlap control, LLMs, and multiple retrieval strategies inside Prompt Studio, teams can adapt their extraction pipelines to any document type.

For organizations working with contracts, financial documents, or policy-heavy content, applying the right chunking and retrieval approach is essential. Prompt Studio enables teams to test and refine these workflows, helping them achieve higher accuracy, better cost efficiency, and faster document extraction at scale.

Book a free demo with Unstract to see how configurable chunking and retrieval strategies can improve accuracy and reliability across complex document workflows.

The post How Intelligent Chunking Strategies Makes LLMs Better at Extracting Long Documents first appeared on Unstract.com.

Why PDF to Markdown Fails for LLM-Based Document Data Extraction

Nuno Bispo — Wed, 04 Mar 2026 03:57:23 +0000

Why Relying on PDF to Markdown Hurts LLM Accuracy and Reliability

In recent years, modern OCR tools have increasingly embraced Markdown as their default output format, often describing it it as “LLM-ready” due to its simplicity, human readability, and version control friendliness. At first glance, this approach seems practical: Markdown is lightweight, easy to parse, and works well for simple text documents. However, when it comes to structured data extraction at scale, the story is very different.

The problem is that Markdown fundamentally discards critical layout and metadata that are essential for accurate extraction by large language models (LLMs). Bounding boxes, reading order, confidence scores, and complex table structures are often lost during the conversion to Markdown. This information loss can silently corrupt data, introduce errors, and make downstream LLM processing unreliable, particularly for complex enterprise documents such as multi-column reports, financial statements, scientific papers, and scanned forms.

This article compares Markdown-based OCR outputs with LLMWhisperer, a layout-preserving OCR solution designed specifically for LLM pre-processing, to highlight why Markdown is insufficient for high-fidelity, LLM-driven structured data extraction. Unlike Markdown-based tools that prioritize human readability, LLMWhisperer retains critical layout information, confidence scores, and spatial relationships that enable more accurate downstream extraction. By evaluating real-world document scenarios, we will show how Markdown fails to retain the structural and semantic richness needed for accurate automated processing, and how LLMWhisperer overcomes these challenges.

The Trend: Markdown in Modern OCR

Over the past few years, a clear trend has emerged in the OCR ecosystem: Markdown has become the preferred output format for many modern OCR tools. Solutions such as OLMOCR2, Docling and Mistral OCR increasingly default to Markdown when converting PDFs and scanned documents into machine-readable text. This shift is often framed as a natural evolution driven by the rise of LLMs and the need for “LLM-friendly” inputs.

The rationale behind this choice is easy to understand. Markdown is human-readable, lightweight, and easy to diff and version-control using tools like Git. It converts reasonably well to other formats, integrates smoothly into documentation workflows, and avoids the verbosity and complexity of traditional OCR formats. For simple documents, plain text PDFs, basic reports, or single-column layouts, Markdown often appears “good enough” and convenient.

However, while Markdown works well as a presentation or documentation format, it performs poorly as an OCR output format for LLM-based structured data extraction. In practice, Markdown-based OCR pipelines consistently lose crucial information: spatial layout, reading order, bounding boxes, confidence scores, and the nuanced structure of tables and forms. These losses are not always obvious, many conversions succeed silently, but they significantly degrade extraction accuracy when the output is fed into downstream LLM workflows.

As document complexity increases, like multi-column layouts, nested tables, financial statements, scanned forms, or mixed text-and-table PDFs, the limitations of Markdown become increasingly apparent. What initially looks like a clean and readable output often masks structural corruption and missing metadata, leading to unreliable LLM responses and costly post-processing.

This growing reliance on Markdown highlights an important misconception in modern OCR pipelines: being LLM-compatible is not the same as being LLM-accurate. In the sections that follow, we’ll examine where Markdown-based OCR breaks down in real-world scenarios and why layout-preserving approaches like LLMWhisperer are better suited for enterprise-scale, structured data extraction.

Why Markdown Fails for OCR: Key Challenges

Markdown was never designed to be an OCR interchange format. It excels at representing simple, linear text for humans, but OCR, especially OCR intended for LLM-based structured data extraction, requires far richer information.

When documents are converted to Markdown, a large amount of critical signal is lost, often irreversibly. The following paragraphs describe the most common and impactful issues observed in real-world OCR pipelines.

Loss of Layout and Positional Data

One of the most fundamental problems with Markdown is that it has no concept of spatial layout. During OCR-to-Markdown conversion, bounding boxes and coordinates disappear entirely. There is no way to express where a word, line, or block of text appeared on the page, nor how different regions relate to each other spatially.

This loss directly impacts reading order. Multi-column documents, common in bank statements, invoices, research papers, and reports, are flattened into a single linear sequence. The original left-to-right, top-to-bottom reading flow is often destroyed, resulting in text from different columns being interleaved incorrectly.

Metadata Loss

Beyond layout, Markdown discards nearly all OCR metadata that production systems rely on. Confidence scores, essential for identifying uncertain OCR regions, have nowhere to live in Markdown. Without them, it is impossible to automatically flag low-quality extractions or route problematic documents for human review.

Other critical metadata is also lost, including:

Font styles and emphasis
Colour information
Semantic roles such as headers, footnotes, captions, or labels

This loss severely impacts human-in-the-loop (HITL) workflows and large-scale batch processing. Teams lose the ability to verify, audit, highlight, or selectively reprocess OCR output based on confidence or semantic importance.

Tables and Complex Layouts Break

Tables are one of the most fragile structures in Markdown-based OCR. Markdown tables are limited to simple row-and-column grids and cannot represent merged cells (colspan/rowspan), nested tables, or irregular layouts. As a result, complex tables, especially those found in financial statements, scientific papers, and regulatory documents, are frequently corrupted.

Common failure patterns include:

Repeated text across rows
Values appearing under incorrect headers
Nested tables collapsing into unreadable text blocks

Unlock the full value of tables hidden inside your PDFs.

In this article, discover how to use Unstract’s LLMWhisperer to extract exactly what you need, page by page, with output optimized specifically for LLM processing. See how it differs from traditional OCR and why it makes structured data extraction far more accurate and reliable.

Read →

Silent Corruption via LLM Pipelines

Many modern OCR tools rely on LLMs or vision-language models to generate Markdown directly. While powerful, this introduces a dangerous failure mode: silent corruption. LLMs may hallucinate content, omit sections, or truncate output without raising any explicit error.

Because the resulting Markdown still “looks valid,” these failures often go unnoticed until extraction results are wrong, at which point diagnosing the root cause becomes extremely difficult.

Other Limitations

Several additional issues further limit Markdown’s usefulness as an OCR output format:

Special characters and Unicode are frequently misrendered, especially in non-Latin scripts.
Images in scanned PDFs are often replaced with placeholders such as , losing all embedded content.
Nested structures, such as lists within tables or tables within tables, regularly break parsing logic.
Manual correction of these issues is cumbersome and impractical at enterprise scale, particularly for non-technical users.

Summary: Markdown OCR Limitations

Issue Category	Description	User Impact
Positional Data Loss	Multi-column flattening, no bounding boxes	Difficult navigation and review
Metadata Loss	Confidence scores and semantic info discarded	Manual correction and reprocessing
Formatting Limitations	Tables, equations, nested lists fail	Loss of structural integrity
Silent Corruption	LLM-generated output may hallucinate or omit	Undetected extraction errors

Taken together, these challenges show that Markdown is not merely an imperfect OCR format, it is structurally incapable of representing the information required for reliable, large-scale, LLM-based structured data extraction.

In the next section, we’ll contrast these limitations with layout-preserving approaches and demonstrate in practice how tools like LLMWhisperer address many of these gaps.

Comparison: Markdown-Based OCR vs LLMWhisperer

To move beyond theoretical limitations, we now compare Markdown-based OCR tools with LLMWhisperer using real-world documents and extraction scenarios.

The goal of this comparison is not to discredit Markdown-based tools, but to identify where they struggle in practice, and to clearly show how a layout-preserving approach changes extraction outcomes when LLMs are involved.

Comparing PDF to Markdown OCR Tools: Strengths and Limitations

The following tools represent the current state of Markdown-first OCR pipelines. Each of them aims to produce LLM-compatible text, but does so by flattening documents into Markdown:

OLMOCR2 – Used for selected document types where its OCR performance is strong, particularly on cleaner or more uniform layouts.

Docling – Applied to documents requiring broader PDF parsing capabilities, including text, tables, and mixed content.

Mistral OCR – Used for cases where LLM-powered OCR is advantageous, especially for scanned documents and noisy inputs.

These tools all output Markdown and share similar strengths: simplicity, readability, and ease of integration, along with the same structural limitations discussed earlier.

On the other hand, LLMWhisperer takes a different approach. Instead of optimizing for Markdown output, it produces layout-preserved plain text with JSON metadata, explicitly designed for LLM pre-processing. Spatial relationships, reading flow, confidence scores, and form elements are retained to the extent required for accurate structured extraction.

The following subsections illustrate these differences using real documents across multiple scenarios, demonstrating how each approach handles complex layouts, tables, and mixed content.

Scenario 1 – Nested Tables

This document contains nested tables, checkboxes embedded within tables, and multi-level row/column relationships, making it a good stress test for OCR systems.

Download sample →

To evaluate how Markdown-based OCR handles complex layouts, we tested this document using the OLMOCR2 demo site, https://olmocr.allenai.org/.

The Markdown output produced by OLMOCR2 correctly recognizes most of the text inside individual cells:

However, once the document structure becomes complex, critical issues emerge:

Column alignment is lost: Text belonging to the same row is split across different logical rows in the Markdown output.
Row semantics break down: In several places, column labels (such as “A”) disappear entirely, making it impossible to determine which value belongs to which category.
Nested tables lose meaning: While the text exists, the hierarchical relationship between inner and outer tables is flattened.
Checkbox context is unclear: Checkboxes are detected as text artifacts, but their association with specific rows or columns is ambiguous.

Additionally, OLMOCR2 provides no real page or positional metadata:

Processing the document with the LLMWhisperer API using a Python script, we obtain the following output:

As you can see, in contrast, LLMWhisperer produces a near-perfect translation of the nested tables:

All rows and columns remain aligned, both within inner tables and across outer tables.
Nested table structure is preserved, allowing the document to retain its original hierarchy.
Checkboxes are explicitly identified, with clear association to their respective rows.
Reading order is preserved, ensuring that related fields appear together in the correct sequence.
Confidence scores are provided, as shown below, enabling verification and targeted review of low-confidence items.

Confidence score metadata is provided only for words with a confidence score below 0.9. As you can see, in this document, only one word is flagged:

Scenario 2 – Multi-Column Layouts

This document contains a multi-column layout combined with diverse table structures, a common pattern in bank statements and financial reports.

Download sample →

To evaluate how Markdown-based OCR handles complex layouts, we tested Docling using a Python-based extraction script that converts PDFs into Markdown.

Running the script, we get the following result:

This output has some key issues:

Extraction accuracy: While much of the text is recognized, structural loss leads to incorrect field associations and unreliable LLM outputs.
Layout preservation: Multi-column layouts are flattened, destroying the visual and logical separation of content.
Metadata preservation: Positional data, confidence scores, and structural metadata are entirely absent.
Tackling complex formats and layouts: Calendars, mixed layouts, and tables are misinterpreted or restructured incorrectly.
Confidence scores: Not available, preventing quality checks and human-in-the-loop workflows.
Bounding boxes: No coordinate data is preserved, making validation and spatial reasoning impossible.
Reading and semantic order: Content order is inferred and frequently incorrect, breaking the document’s semantic flow.

We then processed the same document using LLMWhisperer’s Python script, calling the API as before:

As you can see, LLMWhisperer results are far better:

Layout preserved: Multi-column sections remain aligned, nested tables retain their hierarchy, and rows/columns are correctly associated.
Checkboxes and form elements: Clearly marked as checked/unchecked and linked to their respective rows or fields.
Metadata included: Word-level confidence scores are present, enabling HITL workflows and quality validation.
Positional data: Line-level Y-coordinates allow downstream systems to reason about spatial relationships.
Reading and semantic order: Text flows naturally according to the original layout, preserving semantic meaning.
Complex formats handled: Calendars, tables, and mixed content regions are represented correctly, without introducing spurious headers or footers.

For LLMWhisperer, word-level confidence metadata is provided for all words with confidence < 0.9:

Scenario 3 – Handwritten Forms with Checkboxes

This document contains handwritten text combined with checkboxes and form fields, representing a common but challenging scenario for OCR systems processing real-world forms such as tax returns, applications, and surveys.

Download sample →

For the Markdown-based OCR, we used Mistral OCR with a small Python script to automate PDF conversion:

Results from Markdown-based OCR from Mistral OCR:

Handwriting recognized – individual handwritten text is detected reasonably well.
Checkboxes detected – presence of checkboxes is noted, but their checked/unchecked state may be ambiguous.
Layout not preserved – multi-column sections and nested tables are flattened or misaligned.
Reading flow corrupted – rows and columns are interleaved incorrectly, breaking the semantic order of the document.
No highlights or confidence scores – low-confidence regions cannot be flagged, making validation and HITL workflows impossible.

We then processed the same document using the LLMWhisperer API, as before, via Python:

As you can see, LLMWhisperer provides:

Accurate handwriting recognition – both printed and handwritten text are correctly captured.
Layout preservation – multi-column sections, nested tables, and forms remain aligned.
Highlights and annotations – low-confidence regions can be flagged for review, supporting HITL workflows.
Correct checkbox interpretation – checked and unchecked states are clearly marked and associated with the right rows.

As show before, we have the confidence score metadata:

Scenario 4 – Complex Scientific and Financial Tables

This document contains several tables with complex relationships and detailed descriptions of the data

Download sample →

Analysing the document with Docling, we obtain the following Markdown output:

Even though Docling does a reasonable job reproducing basic tables, the Markdown output is missing several critical elements:

Merged table cells are lost – hierarchical or grouped data is flattened.
Totals are missing in the first table – key summary values are dropped during conversion.
Layout is not preserved – tables and surrounding text lose their original spatial relationships.
No confidence or metadata available – there is no way to assess extraction quality or flag uncertain values.

We then processed the same document using the LLMWhisperer API, using the same Python-based approach as before:

Notably, LLMWhisperer delivers a complete and reliable extraction, addressing the shortcomings seen in Markdown-based OCR:

Accurate layout replication – the original visual and structural layout is preserved.
Merged cells retained – complex table hierarchies and grouped data remain intact.
Totals correctly extracted – summary rows and calculated fields are not lost.
Document flow preserved – reading order and semantic relationships are maintained.
Confidence metadata included – low-confidence items are explicitly flagged for review.

And for the few words where the confidence level was not high, LLMWhisperer explicitly exposes this information via metadata:

Scenario 5 – Mixed Content PDFs

This document contains a complex mix of tables and descriptive text, with relationships between cells, rows, and surrounding explanations that are critical for understanding the data.

Download sample →

Using Mistral OCR, we get the following Markdown output:

The Markdown output has the following limitations:

Tables and nested tables scrambled – inner tables and corresponding descriptive text lose their structure.
Flow not replicated – reading order is corrupted, making it difficult to follow relationships between data and explanations.
Hard to read and interpret – flattened and misaligned content reduces usability for humans and LLMs alike.

Using LLMWhisperer, we can see the benefits of layout preservation:

Benefits of LLMWhisperer for this document:

Tables beautifully replicated – outer tables maintain their original structure and alignment.
Inner tables preserved – nested tables retain their layout and hierarchy.
Corresponding text maintained – descriptive text stays correctly associated with tables and data.
Confidence metadata provided – although not needed here, low-confidence words would be flagged automatically.

For this document, all words have confidence scores higher than 0.9, indicating very high OCR accuracy:

Key Differences: Why PDF to Markdown Falls Short Compared to LLMWhisperer

The table below summarizes the practical differences between LLMWhisperer and Markdown-based OCR tools when used for LLM-driven structured data extraction.

Aspect	LLMWhisperer	Markdown-based OCR	Notes
Bounding boxes	Line-level Y coordinates	None	Positional data is critical for validation, review, and downstream processing
Confidence scores	Word-level	None	Enables detection of low-confidence regions and HITL workflows
Multi-column layout	Yes	Flattened	LLMWhisperer preserves reading order; Markdown interleaves columns
Complex tables	ASCII-based visual preservation	Collapsed or corrupted	Markdown cannot represent merged cells or nested tables
Semantic structure	Headers and footers detected	Lost	Footnotes, captions, and roles are discarded

This comparison makes the trade-off explicit: Markdown-based OCR optimizes for simplicity, while LLMWhisperer optimizes for extraction fidelity. For LLM-centric pipelines, the latter consistently provides more reliable and auditable inputs.

LLMWhisperer Advantages

As the comparison above suggests, LLMWhisperer takes a fundamentally different approach from Markdown-based OCR tools.

Instead of optimizing for human readability or document conversion, it is designed specifically for LLM pre-processing, with a strong emphasis on preserving layout, structure, and signals that matter for downstream extraction.

This design choice addresses many of the core failures discussed earlier.

Layout preservation leads to higher extraction accuracy: LLMWhisperer preserves the visual structure of documents far more effectively than Markdown. By maintaining spatial alignment and relative positioning in its output, it allows LLMs to reason about documents in a way that closely resembles how humans read them. This is especially valuable for multi-column layouts, dense reports, and mixed-content PDFs where reading order is critical.

Word-level confidence enables HITL workflows: Unlike Markdown-based outputs, LLMWhisperer provides word-level confidence metadata, exposing low-confidence regions explicitly.

This enables production-grade workflows such as:

Automatically flagging uncertain extractions
Routing problematic documents for human review
Applying quality thresholds in batch processing

These capabilities are essential for enterprise deployments where accuracy, auditability, and traceability matter.

Robust handling of multi-column and nested structures: LLMWhisperer performs significantly better on complex layouts, including multi-column documents, nested tables, and mixed text–table regions. Rather than flattening or interleaving content, it preserves spatial relationships, reducing logical corruption during extraction.

Semantic detection for common document elements: The system detects headers, footers, and form elements such as checkboxes and radio buttons, which are frequently lost or misrepresented in Markdown pipelines. This semantic awareness improves structured extraction from forms, regulatory documents, and enterprise PDFs.

Reduced silent corruption: Because LLMWhisperer is not primarily generating Markdown via free-form LLM responses, it avoids many silent failure modes such as hallucinated content, truncated output, or incomplete responses that appear valid but are semantically wrong.

Practical Recommendations

Based on real-world experiments and comparative analysis, the choice of OCR output format should be driven by document complexity and downstream usage, not by convenience alone.

The following recommendations can help teams avoid common pitfalls and design more reliable LLM-based extraction pipelines.

Use Markdown for small-scale, human-readable OCR: Markdown works reasonably well for simple use cases: single-column documents, short reports, or content intended primarily for human reading and lightweight editing. If the goal is documentation, quick previews, or basic text extraction with minimal structure, Markdown can be an acceptable choice.

Avoid Markdown for enterprise-grade or complex documents: For large-scale processing or documents with complex layouts, such as multi-column reports, financial statements, scientific papers, scanned forms, or nested tables, Markdown becomes a liability. The loss of layout, metadata, and confidence information leads to extraction errors that are difficult to detect and expensive to correct downstream.

Use LLMWhisperer for LLM preprocessing and metadata-rich workflows: When the primary goal is accurate structured data extraction using LLMs, LLMWhisperer is a far better fit. Its layout-preserving output, confidence metadata, and improved handling of complex structures make it suitable for production systems where accuracy, traceability, and HITL workflows are critical.

Adopt hybrid pipelines for specialized requirements: No single OCR tool or format solves every problem. Documents with heavy mathematical notation, strict compliance requirements, or a need for full coordinate precision may still require specialized formats. In such cases, a hybrid pipeline, combining LLMWhisperer for LLM preprocessing with domain-specific OCR outputs for specialized needs, offers the most robust and flexible approach.

Markdown should be treated as a convenience format, not a canonical OCR representation.

PDF to Markdown in OCR: What is next?

The growing popularity of Markdown as an OCR output format is driven largely by convenience and perceived LLM compatibility. However, as this article has shown, Markdown discards critical information that is essential for reliable, production-grade structured data extraction.

Layout, positional data, confidence scores, and semantic context are either flattened or lost entirely, leading to silent corruption and degraded extraction accuracy, especially for complex, real-world documents.

In contrast, LLMWhisperer’s layout-preserving approach, combined with confidence metadata and better handling of complex structures, consistently produces outputs that LLMs can reason over more accurately and reliably. LLMWhisperer aligns far more closely with the needs of structured extraction pipelines than Markdown-centric solutions.

Ultimately, OCR outputs should be evaluated not by how “LLM-friendly” they appear, but by how well they preserve the signals that matter in real-world use cases. Organizations building serious LLM-driven extraction systems must look beyond convenience formats and choose tools based on extraction accuracy, metadata preservation, and robustness at scale. Markdown may be useful at the margins, but for high-fidelity, structured data extraction, it is simply not enough.

Frequently Asked Questions(FAQ):

Why is Markdown a popular OCR output format, and what are its limitations for developers?

Markdown is popular for OCR because it’s human-readable, lightweight, and integrates well with version control systems like Git. However, it discards critical layout information, bounding boxes, and confidence scores—making it unreliable for structured data extraction, especially for complex documents like financial statements or multi-column reports.

How does Markdown-based OCR handle tables and nested structures?

Markdown struggles with complex tables, often flattening or misaligning rows, columns, and nested structures. It cannot represent merged cells, hierarchical relationships, or irregular layouts, leading to corrupted data and unreliable downstream LLM processing.

What are the risks of using Markdown for enterprise-scale OCR workflows?

Markdown-based OCR loses positional data, confidence scores, and semantic context, which can silently corrupt data and introduce errors. This makes validation, auditing, and human-in-the-loop (HITL) workflows difficult, especially for large-scale or mission-critical document processing.

How does LLMWhisperer improve structured data extraction compared to Markdown?

LLMWhisperer preserves spatial layout, reading order, and confidence scores, ensuring high-fidelity extraction. It handles multi-column layouts, nested tables, and complex forms accurately, making it more reliable for LLM-driven pipelines than Markdown-based tools.

When should developers use Markdown for OCR, and when should they avoid it?

Use Markdown for simple, single-column documents or human-readable outputs like basic reports or documentation. Avoid it for complex layouts, financial statements, or enterprise-scale extraction, where structural integrity and metadata preservation are critical.

The post Why PDF to Markdown Fails for LLM-Based Document Data Extraction first appeared on Unstract.com.

Build Document Workflows That Adapt

Nuno Bispo — Wed, 11 Feb 2026 15:56:53 +0000

Why “Adaptive” Document Workflows Matter

Document automation has improved significantly, but most systems still struggle in real production environments. Documents are not static, invoice layouts change, new fields appear, and vendors update formats without notice. At the same time, business rules evolve, and requirements vary across clients, regions, and policies. What works for one use case or customer, more often than not, fails for another.

Traditional document extraction relies heavily on static logic. Prompts and rules are tightly coupled to specific layouts and assumptions, and business rules are often hardcoded into the extraction itself. This approach may work initially, but it breaks down at scale. As variations increase, teams end up duplicating prompts, maintaining fragile workflows, and constantly rewriting logic for small changes.

Unstract is shifting this model from hardcoded workflows to adaptive, data-driven pipelines. Extraction logic remains stable, while business context, validation rules, and downstream actions are applied dynamically. This separation allows document workflows to automatically adapt to changing formats, policies, and use cases without constant rework.

In this article, we explore three new advanced Unstract features that enable this adaptability: Post-Processing Webhooks, Custom Data Variables, and Prompt Chaining. Together, they provide developers with fine-grained control over extraction, validation, and workflow orchestration.

Mental Model: How These Features Work Together

To understand how these features fit together, it helps to think of document processing as a pipeline rather than a single extraction step. Each component has a clear responsibility, and together they form an adaptive workflow that can handle change without constant rewrites.

Unstract’s Prompt Studio is responsible for extracting the raw structure from documents. Its job is to identify and pull out fields, values, and relationships; without embedding business rules or assumptions about how the data will be used.

Custom Data provides the runtime business context. It answers questions like why this document is being processed, which rules apply right now, and what thresholds or expectations should be used. This context is injected dynamically, allowing the same prompt to behave differently across clients, regions, or workflows.

Post-Processing Webhooks take the extracted data and make it production-ready. This is where validation, transformation, enrichment, calculations, and system integration happen. Webhooks allow developers to apply deterministic logic and connect extracted data to real systems such as ERPs, databases, or internal services.

Finally, Prompt Chaining connects these steps into multi-stage workflows. Outputs from earlier prompts guide later extraction and validation, enabling conditional logic, cross-document checks, and sequential decision-making; all without reprocessing documents or duplicating prompts.

Conceptually, the flow looks like this:

This flow keeps extraction, context, and workflow logic cleanly separated, making document pipelines easier to scale, adapt, and maintain.

Feature 1: Post-Processing Webhooks

A Post-Processing Webhook is a developer-controlled step that runs immediately after Prompt Studio completes extraction. Instead of consuming raw extracted JSON directly, Prompt Studio sends the output to a webhook endpoint where additional logic can be applied before the data is finalized.

The webhook receives structured JSON from Prompt Studio and executes custom logic such as validation, transformation, enrichment, calculations, or integration with external systems. Based on this logic, it can return a cleaned and enriched payload, or respond with errors and flags indicating that the document requires review.

From an implementation perspective, post-processing webhooks are typically built as lightweight Python services, using frameworks like FastAPI or Flask. This approach keeps the logic deterministic, testable, and easy to evolve, while allowing developers to enforce real business rules outside of the prompt itself.

In practice, webhooks are most commonly used to:

Validate or clean extracted data
Transform values into normalized or formatted outputs
Enrich results with data from external systems or reference sources

For more information, check out the Unstract Post-Processing Webhook Documentation.

Use Case 1 – Purchase Order Validation and Enrichment

Document

For this example, we’ll use the following purchase order document as the input.

Download Document →

What Prompt Studio Extracts

Let’s create a new Prompt Studio project to extract customer information. We’ll begin by extracting the customer details using the following prompt:

Extract the customer, bill to and deliver to sections. Return each as a record (address and postal code should be separate fields).

This returns the following result:

Make sure the return format is set to JSON. You can enable the new webhook feature by clicking the three vertical dots in the top-right corner of the prompt section, then activating it and entering your webhook endpoint:

For reference, here is the resulting JSON from the previous prompt execution:

{
    "bill_to": {
        "address": "9090, West river avenue",
        "company": "Hive view Inc",
        "name": "John Armstrong",
        "postal_code": "Los Angeles, CA 92802"
    },
    "customer": {
        "address": "9090, West river avenue",
        "company": "Hive view Inc",
        "name": "John Armstrong",
        "postal_code": "Los Angeles, CA 92802"
    },
    "deliver_to": {
        "address": "9090, West river avenue",
        "company": "Hive view Inc",
        "name": "Simon Jones",
        "postal_code": "Los Angeles, CA 92802"
    }
}

Post-Processing Webhook Actions

Let’s start by looking at validation and cleaning, beginning with the customer data:

def compare_addresses(bill_to: Dict[str, Any], deliver_to: Dict[str, Any]) -> bool:
    """
    Compare billing and delivery addresses.
    Returns True if addresses match, False otherwise.
    """
    bill_address = bill_to.get("address", "").strip().lower()
    bill_postal = bill_to.get("postal_code", "").strip().lower()
   
    deliver_address = deliver_to.get("address", "").strip().lower()
    deliver_postal = deliver_to.get("postal_code", "").strip().lower()
   
    addresses_match = (bill_address == deliver_address and bill_postal == deliver_postal)
    logger.info(f"Addresses match: {addresses_match}")

    return addresses_match

This function, compare_addresses, takes two dictionaries representing billing and delivery addresses and checks whether they match. It compares both the address and postal_code fields after stripping whitespace and converting to lowercase.

For the enrichment example, we can use a CSV file containing customer bank information and look up the relevant details to include in the processed data:

def get_customer_billing_info(bill_to: Dict[str, Any]) -> Dict[str, Any]:
    """
    Get customer billing info.
    Returns a dictionary with customer billing info.
    """
   
    # Read customers billing info from csv file
    with open('customers_billing.csv', 'r') as file:
        reader = csv.reader(file)
        for row in reader:
            # Skip header row
            if row[0] == "contact_name":
                continue
            # Check if company matches
            if row[1].lower().strip() == bill_to.get("company", "").lower().strip():
                return {
                    "contact_name": row[0],
                    "company": row[1],
                    "bank_name": row[2],
                    "bank_account_number": row[3]
                }
    return None

This function, get_customer_billing_info, looks up a customer’s billing information from a CSV file. It reads through customers_billing.csv, finds the row matching the company in the bill_to dictionary, and returns a dictionary containing the customer’s contact name, company, bank name, and bank account number. If no match is found, it returns None.

For reference, a CSV file could be defined like this:

contact_name,company,bank_name,bank_account_number
Alice Johnson,Acme Corp,Bank of America,123456789
Bob Smith,Global Supplies,Citi Bank,987654321
Carol Lee,Tech Solutions,Wells Fargo,456789123
John Financial,Hive view Inc,Bank of America,1234567890

Each row represents a customer, with columns for the contact name, company, bank name, and bank account number.

This allows the webhook to add additional context, such as bank account numbers, bank names, or routing codes, directly into the extracted JSON, making it ready for downstream processing or integration.

Let’s now look at the complete source code for the webhook, implemented as a FastAPI application:

import json
from fastapi import FastAPI, Request, HTTPException
from fastapi.responses import JSONResponse
import uvicorn
import logging
from typing import Dict, Any
import csv

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


app = FastAPI(title="Unstract PostProcessing Webhook", version="1.0.0")


def compare_addresses(bill_to: Dict[str, Any], deliver_to: Dict[str, Any]) -> bool:
    """
    Compare billing and delivery addresses.
    Returns True if addresses match, False otherwise.
    """
    bill_address = bill_to.get("address", "").strip().lower()
    bill_postal = bill_to.get("postal_code", "").strip().lower()
   
    deliver_address = deliver_to.get("address", "").strip().lower()
    deliver_postal = deliver_to.get("postal_code", "").strip().lower()
   
    addresses_match = (bill_address == deliver_address and bill_postal == deliver_postal)
    logger.info(f"Addresses match: {addresses_match}")

    return addresses_match

def get_customer_billing_info(bill_to: Dict[str, Any]) -> Dict[str, Any]:
    """
    Get customer billing info.
    Returns a dictionary with customer billing info.
    """
   
    # Read customers billing info from csv file
    with open('customers_billing.csv', 'r') as file:
        reader = csv.reader(file)
        for row in reader:
            # Skip header row
            if row[0] == "contact_name":
                continue
            # Check if company matches
            if row[1].lower().strip() == bill_to.get("company", "").lower().strip():
                return {
                    "contact_name": row[0],
                    "company": row[1],
                    "bank_name": row[2],
                    "bank_account_number": row[3]
                }
    return None


@app.post("/webhook/customer")
async def receive_webhook_customer(request: Request):
    """
    Receive webhook POST requests
   
    This endpoint accepts webhook payloads, compares billing and delivery addresses,
    and returns the structured output with an additional field indicating if addresses match.
    """
    try:
        # Get the raw body as bytes
        body = await request.body()
       
        # Try to parse as JSON
        try:
            payload = await request.json()
            logger.info(f"Received webhook with JSON payload: {payload}")
        except Exception:
            # If not JSON, log as text
            payload = body.decode('utf-8')
            logger.info(f"Received webhook with text payload: {payload}")
            raise HTTPException(status_code=400, detail="Invalid JSON payload")
       
        # Extract structured_output from payload
        structured_output = payload.get("structured_output", {})
       
        # Get bill_to and deliver_to addresses
        bill_to = structured_output.get("bill_to", {})
        deliver_to = structured_output.get("deliver_to", {})
       
        # Compare addresses and set field for address match
        addresses_match = compare_addresses(bill_to, deliver_to)
        structured_output["billing_delivery_address_match"] = addresses_match
       
        # Get customer billing info and set field for customer billing info
        customer_billing_info = get_customer_billing_info(bill_to)
        structured_output["customer_billing_info"] = customer_billing_info

        # Prepare response with structured_output with the additional fields
        response_data = {
            "structured_output": structured_output,
            "highlight_data": payload.get("highlight_data", [])
        }
       
        # Return response
        return JSONResponse(
            status_code=200,
            content=response_data
        )   
    except Exception as e:
        logger.error(f"Error processing webhook: {str(e)}")
        raise HTTPException(status_code=500, detail=f"Error processing webhook: {str(e)}")

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

It includes the helper functions to:

Compare addresses (compare_addresses) between billing and delivery for validation.
Enrich customer data (get_customer_billing_info) by looking up bank info from a CSV file.

The endpoint receives extracted customer data, validates addresses, enriches with billing info, and returns the structured output.

Final Output

We’ve reviewed the Prompt Studio prompts, the extraction process, and the webhook processing code with its endpoints.

Now, let’s look at the results produced when the extraction runs and the post-processing webhook is invoked, and how the final, enriched output differs from the raw Prompt Studio response:

The JSON returned after the post-processing webhook has been applied:

{
    "bill_to": {
        "address": "9090, West river avenue",
        "company": "Hive view Inc",
        "name": "John Armstrong",
        "postal_code": "Los Angeles, CA 92802"
    },
    "billing_delivery_address_match": true,
    "customer": {
        "address": "9090, West river avenue",
        "company": "Hive view Inc",
        "name": "John Armstrong",
        "postal_code": "Los Angeles, CA 92802"
    },
    "customer_billing_info": {
        "bank_account_number": "1234567890",
        "bank_name": "Bank of America",
        "company": "Hive view Inc",
        "contact_name": "John Financial"
    },
    "deliver_to": {
        "address": "9090, West river avenue",
        "company": "Hive view Inc",
        "name": "Simon Jones",
        "postal_code": "Los Angeles, CA 92802"
    }
}

As you can see, the processed output now includes additional fields representing the address match and customer banking information, enhancing the extracted data with validation and enrichment results.

Imagine uploading a messy PDF with multiple documents and instantly getting back clean, perfectly labeled files.

Most PDF splitters work only on predictable layouts, like fixed page ranges or chapters. Real business PDFs are different. A single file might include a loan form, KYC documents, payslips, and tax proofs, all in varying order and length. Traditional tools fail because there’s no consistent structure.

With an intelligent splitter, that same mixed PDF can be automatically separated into individual, clearly labeled documents, no manual rules, no page guessing, no trial and error.

Learn more →

Use Case 2 – Bank Statement Validation and Transform

Document

For this example, we will use a bank statement document:

Download Document →

What Prompt Studio Extracts

Now, let’s define the prompt to extract the bank statement transaction list details:

Extract the list of items with the individual fields (all lower-case field names) in a record ‘items’. Additionally, return another field with the Purchases and Advances as ‘total’.

Which produces the following results:

And the corresponding JSON:

{
    "items": [
        {
            "amount": 223.26,
            "date": "11/12/23",
            "description": "Wallmart Purchase"
        },
        {
            "amount": 313.39,
            "date": "12/12/23",
            "description": "Sam vehicle service"
        },
        {
            "amount": 582.96,
            "date": "13/12/23",
            "description": "FEE PRE PAID ELECTRICITY"
        },
        {
            "amount": 125,
            "date": "14/12/23",
            "description": "Interac purchase - 1361 - The Bay"
        },
        {
            "amount": 215,
            "date": "15/12/23",
            "description": "Amazon Audible Book"
        },
        {
            "amount": 656.86,
            "date": "16/12/23",
            "description": "Hawai International Stay"
        },
        {
            "amount": 246,
            "date": "17/12/23",
            "description": "Adobe Inc CC cloud"
        },
        {
            "amount": 571.1,
            "date": "18/12/23",
            "description": "ATM withdrawal"
        },
        {
            "amount": 965,
            "date": "19/12/23",
            "description": "St louis Pharmacy"
        }
    ],
    "total": 3898.57
}

You can enable the webhook for this prompt by following the same process as before. Webhook configuration is defined per prompt.

Post-Processing Webhook Actions

For the individual transactions, we can perform validation by checking that the totals are correct:

def calculate_and_verify_totals(structured_output: Dict[str, Any]) -> Dict[str, Any]:
    """
    Calculate totals from items and verify against provided totals.
    Returns a dictionary with calculated totals and verification result.
    """
    items = structured_output.get("items", [])
   
    # Calculate line item totals
    calculated_total = 0.0
   
    for item in items:
        try:
            # Get amount
            amount = item.get("amount", 0)
           
            # Handle string values that might contain currency symbols or commas
            amount = float(str(amount).replace(",", "").replace("$", "").strip())
           
            # Calculate line total
            calculated_total += amount
        except (ValueError, TypeError) as e:
            logger.warning(f"Error processing item {item}: {str(e)}")
            continue
   
    # Round subtotal
    calculated_total = round(calculated_total, 2)
    logger.info(f"Calculated total: {calculated_total}")

    # Get provided subtotal
    provided_total = structured_output.get("total", {})
    logger.info(f"Provided total: {provided_total}")

    # Check if subtotal is correct
    is_total_correct = calculated_total == provided_total
    logger.info(f"Total is correct: {is_total_correct}")
   
    return {
        "totals_correct": is_total_correct, 
        "calculated_total": calculated_total,
        "provided_total": provided_total
    }

This function, calculate_and_verify_totals, computes the subtotal of all transactions and compares it against the provided subtotal. It handles field names, removes currency symbols or commas, and sums each line item’s total.

For transform, here’s an example of formatting the amounts as currency:

def format_currency_items_totals(structured_output: Dict[str, Any]):
    """
    Format currency items.
    Returns a list of formatted items.
    """

    # Define currency
    currency = "USD"

    # Format items
    items = structured_output.get("items", [])
    for item in items:
        item["amount"] = f"{item["amount"]:.2f} {currency}"
   
    # Format totals
    total = structured_output.get("total", {})
    total = f"{total:.2f} {currency}"

    # Set formatted items and totals in structured_output
    structured_output["items"] = items
    structured_output["total"] = total
    return structured_output

This function, format_currency_items_totals, formats all transaction amounts and total in as currency strings. It converts amount for each item, as well as total, into a consistent currency format (e.g., 12.34 USD).

Combining all the previous code examples, the full source code is:

import json
from fastapi import FastAPI, Request, HTTPException
from fastapi.responses import JSONResponse
import uvicorn
import logging
from typing import Dict, Any
import csv

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


app = FastAPI(title="Unstract PostProcessing Webhook", version="1.0.0")


def calculate_and_verify_totals(structured_output: Dict[str, Any]) -> Dict[str, Any]:
    """
    Calculate totals from items and verify against provided totals.
    Returns a dictionary with calculated totals and verification result.
    """
    items = structured_output.get("items", [])
   
    # Calculate line item totals
    calculated_total = 0.0
   
    for item in items:
        try:
            # Get amount
            amount = item.get("amount", 0)
           
            # Handle string values that might contain currency symbols or commas
            amount = float(str(amount).replace(",", "").replace("$", "").strip())
           
            # Calculate line total
            calculated_total += amount
        except (ValueError, TypeError) as e:
            logger.warning(f"Error processing item {item}: {str(e)}")
            continue
   
    # Round subtotal
    calculated_total = round(calculated_total, 2)
    logger.info(f"Calculated total: {calculated_total}")

    # Get provided subtotal
    provided_total = structured_output.get("total", {})
    logger.info(f"Provided total: {provided_total}")

    # Check if subtotal is correct
    is_total_correct = calculated_total == provided_total
    logger.info(f"Total is correct: {is_total_correct}")
   
    return {
        "totals_correct": is_total_correct, 
        "calculated_total": calculated_total,
        "provided_total": provided_total
    }


def format_currency_items_totals(structured_output: Dict[str, Any]):
    """
    Format currency items.
    Returns a list of formatted items.
    """

    # Define currency
    currency = "USD"

    # Format items
    items = structured_output.get("items", [])
    for item in items:
        item["amount"] = f"{item["amount"]:.2f} {currency}"
   
    # Format totals
    total = structured_output.get("total", {})
    total = f"{total:.2f} {currency}"

    # Set formatted items and totals in structured_output
    structured_output["items"] = items
    structured_output["total"] = total
    return structured_output


@app.post("/webhook/items")
async def receive_webhook_items(request: Request):
    """
    Receive webhook POST requests for items
   
    This endpoint accepts webhook payloads, calculates totals from items,
    verifies them against provided totals, and returns the result.
    """
    try:
        # Get the raw body as bytes
        body = await request.body()
       
        # Try to parse as JSON
        try:
            payload = await request.json()
            logger.info(f"Received webhook with JSON payload: {payload}")
        except Exception:
            # If not JSON, log as text
            payload = body.decode('utf-8')
            logger.info(f"Received webhook with text payload: {payload}")
            raise HTTPException(status_code=400, detail="Invalid JSON payload")
       
        # Extract structured_output from payload
        structured_output = payload.get("structured_output", {})

        # Calculate and verify totals
        totals_verification = calculate_and_verify_totals(structured_output)
        structured_output["totals_verification"] = {
            "totals_correct": totals_verification["totals_correct"],
            "calculated_total": totals_verification["calculated_total"],
            "provided_total": totals_verification["provided_total"]
        }

        # Format currency items and totals
        formatted_structured_output = format_currency_items_totals(structured_output)

        # Prepare response with the formatted structured_output (also contains the totals verification)
        response_data = {
            "structured_output": formatted_structured_output,
            "highlight_data": payload.get("highlight_data", [])
        }
       
        # Return response
        return JSONResponse(
            status_code=200,
            content=response_data
        )   
    except Exception as e:
        logger.error(f"Error processing webhook: {str(e)}")
        raise HTTPException(status_code=500, detail=f"Error processing webhook: {str(e)}")


if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

It receives structured JSON from Prompt Studio, recalculates the total amount by summing item values, and verifies it against the extracted total to detect inconsistencies.

The service then formats monetary values as currency, enriches the output with a totals_verification section, and returns a clean, production-ready JSON response.

This keeps business logic deterministic, testable, and fully decoupled from the prompt itself.

Final Output

Executing the prompt and invoking the post-processing webhook returns the following result:

With the corresponding JSON:

{
    "items": [
        {
            "amount": "223.26 USD",
            "date": "11/12/23",
            "description": "Wallmart Purchase"
        },
        {
            "amount": "313.39 USD",
            "date": "12/12/23",
            "description": "Sam vehicle service"
        },
        {
            "amount": "582.96 USD",
            "date": "13/12/23",
            "description": "FEE PRE PAID ELECTRICITY"
        },
        {
            "amount": "125.00 USD",
            "date": "14/12/23",
            "description": "Interac purchase - 1361 - The Bay"
        },
        {
            "amount": "215.00 USD",
            "date": "15/12/23",
            "description": "Amazon Audible Book"
        },
        {
            "amount": "656.86 USD",
            "date": "16/12/23",
            "description": "Hawai International Stay"
        },
        {
            "amount": "246.00 USD",
            "date": "17/12/23",
            "description": "Adobe Inc CC cloud"
        },
        {
            "amount": "571.10 USD",
            "date": "18/12/23",
            "description": "ATM withdrawal"
        },
        {
            "amount": "965.00 USD",
            "date": "19/12/23",
            "description": "St louis Pharmacy"
        }
    ],
    "total": "3898.57 USD",
    "totals_verification": {
        "calculated_total": 3898.57,
        "provided_total": 3898.57,
        "totals_correct": true
    }
}

As you can see, the currency values are properly formatted, and an additional total verification record is added with the calculated amounts and validation result.

This approach allows you not only to validate the extracted data, but also to transform it outside of the prompt itself, keeping prompts clean while enforcing reliable, production-grade logic.

Feature 2: Custom Data Variables

Custom Data allows developers to inject runtime business context into Prompt Studio without modifying the prompt itself, cleanly separating extraction logic from business rules and configuration.

In this model, the prompt defines how to extract information, what fields to look for and how to interpret the document structure, while Custom Data defines what conditions apply right now, such as thresholds, expectations, policies, or workflow intent.

Custom Data is passed at runtime as a simple JSON object, allowing the same prompt to behave differently depending on the client, region, document purpose, or processing scenario, all without duplicating or rewriting prompt logic.

By moving variability into data instead of text, Custom Data eliminates prompt sprawl and enables scalable, multi-tenant document workflows that are easier to maintain and evolve.

A helpful way to reframe Custom Data is to shift the question from “who is this document about?” to “why is this document being processed?”. Rather than capturing personal attributes or static metadata, Custom Data represents the business context that determines how a document should be interpreted in a given workflow.

In practice, it encodes factors such as business intent, risk profile, and the applicable policy or rule set, directly influencing validation, thresholds, comparisons, and decisions, all without altering the underlying extraction logic.

For more information, check out the Unstract Custom Data Documentation.

Use Case 1 – Air Shipment Validation

Document

As a document for the air shipment validation, we will use this airway bill example:

Download Document →

Custom Data

Let’s assume we want to use custom data to validate whether the cargo weight falls within predefined limits for a specific customer account and route. We could define data like this:

{
  "account_number": "HY73221",
  "route": {
    "origin": "New York",
    "destination": "London"
  },
  "thresholds": {
    "minimum": 50,
    "maximum": 80,
"unit": "Kg"
  }
}

Prompt Studio

In Prompt Studio, you can define custom data by following these steps:

Open your Prompt Studio project
Click the Settings icon (gear) in the toolbar
Select Custom Data from the menu

From there, you can add and manage the custom data that will be available to your prompts:

This definition can be used for testing within Prompt Studio. When using the API, you can pass the same information via the custom_data parameter in your request to apply it at runtime.

Let’s consider the following prompt, which leverages Custom Data to process and validate the information in the document:

Extract the total shipment weight for customer {{custom_data.account_number}} in the route from {{custom_data.route.origin}} to {{custom_data.route.destination}}.

The total weight should be between {{custom_data.thresholds.minimum}} and {{custom_data.thresholds.maximum}} of unit {{custom_data.thresholds.unit}}.

Return field ‘is_between_thresholds’ with True or False depending on the outcome of the data.

This generates the following output:

And corresponding JSON:

{
    "is_between_thresholds": true,
    "total_shipment_weight_kg": 72
}

As you can see, by using Custom Data, the validation logic is kept separate from the prompt itself and can be applied dynamically, allowing processing to adapt to specific customers, routes, or any other contextual data without modifying the extraction logic.

Use Case 2 – Medical Document

Document

Let’s look at a second example of using Custom Data, this time with a medical document:

Download Document →

Custom Data

The advantage of using Custom Data is that we can “override” the default reference ranges defined in the report. For example, based on a patient’s history, doctors may expect different thresholds than the standard ranges.

In this case, we can define a custom metric to reflect those expectations:

{
  "metric_check": {
    "name": "Hemoglobin",
    "expected_min": 13.0,
    "expected_max": 17.0,
    "unit": "g/dL"
    }
}

Prompt Studio

As described before, let’s add this Custom Data to the Settings of the Prompt Studio project:

Reminder: when using the API, you can pass the same information via the custom_data parameter in your request to apply it dynamically at runtime.

Let’s define the prompt that uses the custom metric to extract and return a properly formatted output:

Extract the value and unit for {{custom_data.metric_check.name}} from the lab report.

Compare the extracted value against the expected provided range: {{custom_data.metric_check.expected_min}} to {{custom_data.metric_check.expected_max}} of unit{{custom_data.metric_check.unit}}.

Record a field name ‘is_expected’ with only one of these possibilities:
– below_expected
– within_expected
– above_expected

This generates the following output:

And corresponding JSON:

{
    "Hemoglobin": {
        "unit": "gm/dL",
        "value": 12.4
    },
    "is_expected": "below_expected"
}

As you can see, by using Custom Data, medical professionals can adjust the expected metrics for a report based on a patient’s history, enabling more accurate diagnostics.

This clearly demonstrates the advantages of separating extraction logic from workflow-specific rules using Custom Data.

Feature 3: Prompt Chaining

Prompt Chaining is a technique for building multi-stage document workflows by connecting multiple prompts in sequence. Instead of treating each extraction as an isolated step, prompts are executed one after another, where the output of one prompt becomes the input to the next.

This approach enables workflows that can reason progressively over documents. Earlier extraction results can influence what happens downstream, allowing for conditional logic, cross-document validation, and multi-stage analysis without reprocessing entire documents or duplicating prompts.

By breaking complex logic into smaller, focused steps, Prompt Chaining makes document workflows easier to understand, more reliable, and better suited for real-world scenarios where decisions depend on earlier results.

For more information, check out the Unstract Prompt Chaining Documentation.

Data Flow Patterns

Prompt Chaining supports two primary patterns for passing data between prompts, depending on whether the output needs additional processing before being reused.

Direct Value Passing: In this pattern, the raw output of a prompt is passed directly into the next prompt using the syntax {{prompt_key}}. This is ideal when the extracted value is already in the required format and can be reused as-is, for example, passing a name, identifier, or simple field between prompts.

Processed Value Passing: In cases where the output needs validation, normalization, or enrichment before being reused, Prompt Chaining supports processed value passing. Here, the prompt output is first sent to a post-processing webhook and then injected into the next prompt using the syntax {{[prompt_key] }}. This allows developers to apply deterministic logic between prompt stages while keeping the overall workflow modular and flexible.

Use Case 1 – Loan Application + KYC Verification

Document

For this example, we will use a loan application document to illustrate the use of Direct Value Passing:

Download Document →

Prompts

First, we begin by extracting the applicant’s name from the document:

Extract the applicant’s full name from the loan application. Reply only with the name.

This prompt was defined with the key applicant_name:

Which returns the following result:

Roger Deakins

We can now reuse this extracted value in the second prompt:

Check if the name “{{applicant_name}}” appears in the driver license documents on page 2 and 3.

Reply with a list of records composed by ‘page’ and ‘match_type’ fields.

For the ‘match_type’ field return only one of these values:
– exact_match
– partial_match
– no_match

We define the prompt key as kyc_name_match:

The resulting JSON looks like this:

[
    {
        "match_type": "exact_match",
        "page": 2
    },
    {
        "match_type": "no_match",
        "page": 3
    }
]

As you can see, by reusing the applicant name extracted in the first prompt, we can automatically validate the driver’s license details on pages 2 and 3, ensuring they correctly match the applicant.

This approach requires no manual intervention and is fully reusable across different loan application documents and workflows.

Use Case 2 – Bill Of Lading + Payment Lookup

Document

For this example, we will use a bill of lading document to illustrate the use of Processed Value Passing:

Download Document →

Prompts

Let’s define a prompt to extract the booking number and amount:

Extract the booking no and cod amount.

This prompt was defined with the key booking:

Which returns the following result:

{
    "booking_no": "758993033",
    "cod_amount": "$ 800"
}

We can now reuse this extracted value in the second prompt, using a webhook to process the values:

print the result {{[booking]https://unstract-webhook.developer-service.blog/webhook/amount_due}}

We define the prompt key as amount_due:

The resulting JSON looks like this:

{
    "amount_due": 450,
    "amount_paid": 350,
    "booking_number": "758993033",
    "cod_amount": 800,
    "payment_date": "2026-01-16"
}

As shown in this example, prompt chaining with processed value passing allows you to combine the output of earlier prompts with data retrieved from external systems.

The webhook exposes a single endpoint at /webhook/amount_due (source code below), which accepts POST requests containing the shipment billing data. When a webhook payload is received, the processing flow is as follows:

The webhook extracts the booking_no and cod_amount.
It looks up the booking number in the bill_lading_payments.csv file to check for existing payment records.
If a match is found, it retrieves the corresponding amount_paid and payment_date.
The outstanding balance is then calculated as amount_due = cod_amount – amount_paid.

This pattern provides a high degree of flexibility for validating and enriching extracted data, while keeping prompts generic, reusable, and easy to maintain.

Webhook Source Code

The webhook referenced above is implemented using the following source code:

import json
from fastapi import FastAPI, Request, HTTPException
from fastapi.responses import JSONResponse
import uvicorn
import logging
from typing import Dict, Any
import csv

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


app = FastAPI(title="Unstract PostProcessing Webhook", version="1.0.0")


@app.post("/webhook/amount_due")
async def receive_webhook_amount_due(request: Request):
    """
    Receive webhook POST requests for amount due
   
    This endpoint accepts webhook payloads, calculates amount due from totals,
    and returns the result.
    """
    try:
        # Get the raw body as bytes
        body = await request.body()
       
        # Try to parse as JSON
        try:
            payload = await request.json()
            logger.info(f"Received webhook with JSON payload: {payload}")
        except Exception:
            # If not JSON, log as text
            payload = body.decode('utf-8')
            logger.info(f"Received webhook with text payload: {payload}")
            raise HTTPException(status_code=400, detail="Invalid JSON payload")
       
        # Convert payload to dictionary
        payload = json.loads(payload)

        # Get booking number and cod amount
        booking_number = payload.get("booking_no", "")
        cod_amount = payload.get("cod_amount", 0)

        # Read bill lading payments from csv file
        amount_paid = 0
        payment_date = ""
        with open('bill_lading_payments.csv', 'r') as file:
            reader = csv.reader(file)
            for row in reader:
                # Skip header row
                if row[0] == "booking_no":
                    continue
                # Check if booking number matches
                if row[0] == booking_number:
                    amount_paid = float(row[1].replace("$", "").replace(",", "").strip())
                    payment_date = row[2]
                    break

        # Calculate amount due
        cod_amount = float(cod_amount.replace("$", "").replace(",", "").strip())
        amount_due = cod_amount - amount_paid
        response_data = {
            "booking_number": booking_number,
            "cod_amount": cod_amount,
            "amount_due": amount_due,
            "amount_paid": amount_paid,
            "payment_date": payment_date
        }
       
        # Return response
        return JSONResponse(
            status_code=200,
            content=response_data
        )   
    except Exception as e:
        logger.error(f"Error processing webhook: {str(e)}")
        raise HTTPException(status_code=500, detail=f"Error processing webhook: {str(e)}")


if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

The results were derived from data stored in a CSV file, which simulates an integration with external systems:

booking_no,amount_paid,payment_date
1234567890,1000.00,2026-01-19
3253252352,200.00,2026-01-10
4365347654,300.00,2026-01-05
758993033,350.00,2026-01-16

Why This Approach Scales in Real Systems

This approach scales because it reduces duplication and cleanly isolates change. Instead of creating new prompts for every variation, teams maintain a smaller set of stable prompts and control behaviour through data and workflow logic.

Key reasons it scales well:

Fewer prompts: One prompt can serve multiple clients, regions, and use cases by injecting runtime context instead of duplicating logic.

Smaller, composable logic blocks: Extraction, validation, transformation, and orchestration are separated, making each piece easier to build, test, and maintain.

Clear separation of concerns: Prompts extract data, Custom Data defines context, webhooks enforce rules, and prompt chains handle flow control.

This design also makes change easy to absorb:

New clients are onboarded by passing different Custom Data, not by rewriting prompts

New rules are added through webhook logic or configuration updates

New document formats can be supported without breaking downstream workflows

The result is a document automation system that remains flexible as complexity grows, instead of becoming harder to manage over time.

Conclusion: From Extraction to Intelligent Workflows

These features move Unstract beyond traditional OCR and basic data extraction. Instead of treating documents as isolated inputs, Unstract enables intelligent workflows where extraction, context, validation, and decision-making work together as a single system.

By combining Prompt Studio, Custom Data, Post-Processing Webhooks, and Prompt Chaining, teams can build production-grade document workflows that are flexible, reliable, and easy to evolve.

Developers retain full control over business logic and integrations, while workflows automatically adapt to new formats, rules, and use cases without introducing unnecessary complexity.

If you’re exploring how to design scalable document automation, the best way to understand these concepts is to try them yourself.

Start with the provided sample documents, experiment with different configurations, and build your own adaptive workflows to see how Unstract behaves in real-world scenarios.

The post Build Document Workflows That Adapt first appeared on Unstract.com.

Automating End-to-End Document Processing Workflows with Unstract

Salman Haider — Thu, 29 Jan 2026 18:01:12 +0000

Organizations run on documents, yet critical data still arrives as PDFs, scans, emailed forms and spreadsheets. That creates unstructured to structured workflows that many teams still handle manually.

Teams manually pull fields from documents and re-enter them into finance systems, CRMs and databases, which leads to errors and unpredictable processing times. Template-based OCR and one-off scripts fail when document layouts change, forcing continuous fixes and manual intervention.

Unstract provides a production-grade inbox-to-database workflow with versioned, testable extraction and validation. This guide explains how to replace manual document processing with a controlled inbox-to-database workflow that improves accuracy, predictability and trust in downstream data.

What Are Document Processing Workflows?

A document processing workflow defines the end-to-end path a document follows from ingestion to the creation of reliable, structured data. It establishes a controlled, repeatable process that makes document-derived data predictable, auditable and suitable for use in production systems.

A simple way to understand this is to follow a monthly bank statement as it moves through each stage from ingestion to downstream delivery.

Ingestion: The statement arrives through a shared inbox or secure file drop. The workflow assigns an ID, records source and owner and queues the file for processing.
OCR and preprocessing: The workflow converts the PDF into machine-readable text while preserving layout, including tables, headings and page structure.
Data extraction: The workflow captures balances, transaction rows, dates, and account identifiers, then maps them to your data model and returns structured records.
Validation: Business rules check for missing, inconsisten,t or out-of-range values before any downstream write.
Targeted review: The workflow sends only rule failures or low-confidence fields to reviewers, who verify values against the source document.
Delivery: The workflow writes approved outputs to databases or exposes them through APIs in a consistent, traceable format for downstream systems.

Managed as a single workflow, this approach provides clear visibility into document status, consistent extraction behavior and operational control that isolated scripts or point automations cannot deliver.

Why Document Processing Needs Automation

Automation is essential in inbox-to-database pipelines to increase straight-through processing and control errors, exceptions and audit lineage.

Inbox-to-database pipelines need control at scale. Automation increases straight-through processing and reduces errors by standardising how documents enter the workflow and how exceptions get handled.
Manual re-keying degrades data quality. Re-entering fields from emails, PDFs and scans introduces avoidable errors that then propagate into systems of record and become costly to correct.
Unstructured volume and layout drift increase exception rates. As document volume grows and formats change, teams spend more time handling edge cases, remediating failures and maintaining brittle processes.
Audit and compliance require traceability. Regulated workflows need a reproducible record of where values came from and how the workflow captured, transformed, validated and delivered them.
Automation supports sustainable throughput. A well-defined workflow validates extracted fields before downstream writes and routes only exceptions for review, keeping operational effort focused on high-risk cases.

Challenges In Automating Document Processing Workflows

Document workflows fail to automate cleanly because they combine changing layouts, probabilistic extraction and compliance requirements inside a single pipeline.

Document layouts change over time: Vendor invoices, bank statements and customer forms regularly change column order, field labels and table structure. If extraction depends on fixed coordinates or a single template, it will fail as soon as the next layout version arrives.
Template-based OCR has limitations: Traditional OCR works best on clean, consistent pages. It often fails on multi-page tables, rotated scans, stamps, handwriting, faint text and heavy headers or footers, which leads to missing or wrong values that someone must correct.
Every new format creates ongoing update work: New document variants force updates to templates, rules or prompts. Without regression tests on a representative set of documents, a change that fixes one format often breaks another.
Documents enter through multiple systems: Files arrive through shared inboxes, portals, SFTP drops and shared drives, then move across different tools. Without a single intake path, document IDs and a clear status for each step, teams lose time tracking where a document failed and who needs to act.
Human review becomes a queue without strict rules: Compliance and risk teams often require review for high-dollar or regulated documents. If the system cannot flag specific fields that failed checks and show the source location in the PDF, reviewers end up re-checking whole documents and throughput collapses.

The Role of AI / LLMs in Processing Unstructured Documents

Most critical business processes still rely on PDFs, scans and email attachments instead of clean, structured data. AI and LLMs enable unstructured to structured workflows. They convert document content into consistent fields that finance, operations and risk systems can use.

Industry examples include:

Insurance: AI reads claims, policies and supporting reports, extracts key identifiers such as policy numbers, claim amounts and loss details and supports faster triage, fraud checks and settlement workflows.
Banking and Mortgages: AI processes statements, applications and KYC documents, normalizes data for affordability and risk checks and reduces turnaround time in review cycles.
Logistics: AI interprets bills of lading and delivery and customs documents, captures shipment details for TMS/ERP accuracy and supports billing and exception handling.
Accounting and Finance: AI extracts invoice and statement data, standardizes accounts mapping, and reduces manual entry so teams can focus on reconciliation and analysis.

How AI And LLMs Change Document Processing Workflows, But Not Completely

LLMs can reduce the amount of document-specific extraction logic you must maintain. Instead of rebuilding templates when labels, wording or table layouts change, you can maintain a stable extraction specification and rely on the model to map variations to your schema.

However, this benefit holds only when inputs are controlled and verified. LLMs do not reliably compensate for poor OCR, lost table structures or scrambled reading order. If the input text is incorrect or incomplete, extracted fields will be unreliable. Weak prompts or vague schemas can also yield outputs that appear plausible but fail validation.

This is why LLMs should not operate as a standalone component in a document workflow. In production, they must be embedded in a layered pipeline that controls inputs, constrains outputs and prevents invalid data from reaching downstream systems.

At a minimum, you must set OCR confidence thresholds, preserve table structures during preprocessing and enforce schema-level validation before posting to business systems.

Automating Document Processing Workflows from Ingestion to Output with Unstract

Unstract is an open-source, low-code platform designed for unstructured data extraction and full document workflow automation, so your team can move from experiments to production without rebuilding everything from scratch.

It brings together three core components:

LLMWhisperer handles OCR and layout-aware text extraction, producing structured document output that preserves tables, headings and handwriting.
Prompt Studio defines extraction logic, so subject-matter experts and technical leads can describe required fields, refine prompts and manage extraction behaviour without scattered scripts.
The workflow layer automates ingestion, extraction, validation and human review as a single flow, with clear control points for manual intervention when confidence is low or rules fail.

The platform enables teams to build production-ready document pipelines that scale across business units, adapt to new document types and evolve as rules or AI models change, without constant rework.

OCR Workflows With LLMWhisperer

Handwritten air waybills are a reliable OCR test case because they mix handwriting, checkboxes and table-like blocks that generic OCR often flattens into unstructured text.

How Unstract Processes Complex Documents

LLMWhisperer is an OCR and text-extraction service designed to produce LLM-ready output from real-world documents. It supports extraction modes tuned for different inputs, including low-quality scans, handwriting-heavy forms and table-dense documents.

Layout Preservation plays a key role in this OCR workflow, as it helps maintain the document’s reading order and structure required for extraction. It allows users to maintain field groupings and table rows rather than returning a single stream of text. This keeps labels aligned with values and line items usable for schema-shaped JSON extraction.

Steps to run later in the Playground (using the handwritten air waybill):

Open the LLMWhisperer Playground and upload the handwritten air waybill (PDF or image).

2. Select an extraction mode that matches the document (form/handwriting-friendly when checkboxes and handwriting matter).

3. Run extraction.

Running the OCR extraction

4. Review the output for:

Handwritten fields captured as text (not dropped).
Checkbox states represented clearly.
Table sections preserved as rows/columns (not merged into a blob).

Extracting Text from a Credit Application PDF Using the LLMWhisperer API (Postman)

These steps show how to send a credit application PDF to LLMWhisperer, track processing and retrieve layout-preserving text for downstream extraction.

Get the Postman collection and your API key
- Open the API Keys area in LLMWhisperer
- Copy your API key and download the pre-built Postman collection (if provided).

Downloading the Postman Collection from the API Keys Section in LLMWhisperer

Import the collection into Postman
- Import the collection file.
- Set the API key as an environment variable (or directly in the request headers), so every request authenticates correctly.
Upload the credit application PDF (POST)
- Use the collection’s POST upload request.
- Attach the credit application PDF as the file payload.
- Send the request.
- The response should return a processing status and a document identifier (often a hash or ID).

Sending credit application PDF via POST

Poll for completion (GET status)
- Use the GET status request with the returned document ID.

Continue until the status changes to complete (or equivalent).

GET status with the document ID until it shows “complete”

Retrieve the extracted text (GET result)
- Use the GET result request for the same document ID.
- Save the returned output. This should be layout-consistent text that you can feed into Unstract for structured extraction.

Use GET result with the same ID, then save the extracted text

Unstract: Unstructured Data Extraction & Automating Document Processing Workflows

LLMWhisperer is a standalone OCR API. If your team already handles structured data extraction or has existing systems in place, and OCR is the only missing piece, you can use LLMWhisperer on its own—either as a standalone tool or via API.

Unstract is an end-to-end document processing workflow automation platform. At its core is Prompt Studio, where you write natural language prompts to extract structured data from documents. Unstract integrates seamlessly with LLMWhisperer and other OCR tools, making OCR just one step in a complete extraction workflow.

Every Unstract workflow begins with creating a Prompt Studio project.

Create a New Prompt Studio Project

Create a project specifically for bank statement PDFs.
Upload one or more representative bank statements as test inputs.

Create a Prompt Studio project for bank statements and upload sample PDFs

Write Extraction Prompts for Your Target Schema

Define the fields you need as structured output. For example:
- Account holder name, account number (masked if needed), bank name
- Statement period start/end dates
- Opening balance, closing balance
- Transaction table rows (date, description, debit/credit, amount, running balance)
- Fees, interest, totals (if present)
Write prompts that return JSON matching your schema (field names and formats aligned to your database).

Write extraction prompts that return JSON matching the target schema

Configure Project Settings

Before creating a Prompt Studio project, you need to configure the extraction engine. This is the essential first step—it powers everything that follows.

You’ll set up four key components:

LLM model – Provides the intelligence for extraction
Vector database – Stores and retrieves document embeddings
Embedding model – Converts documents into vector representations
OCR text parser – Extracts text from documents (LLMWhisperer integration available, plus support for other parsers)

Once these are configured, you’re ready to create your Prompt Studio project.

Select the LLM model that will read the extracted text and return structured JSON in your target schema.
Select the embeddings model that turns your statements into vectors so you can run retrieval, similarity search and reuse prompts across documents.
Configure the vector database that stores these embeddings to enable fast search, indexing and retrieval across many statements.
Set the text extraction tool or input source so the project uses LLMWhisperer output for the bank statement PDF.

Set project LLM, embeddings model, vector DB and text extractor tool

Export the Prompt Studio Project as a Reusable Tool

Export the project so it can be invoked from workflows as a single extraction step.

Export the Prompt Studio project as a reusable extraction tool for workflows

Create the Workflow and Wire Inputs or Outputs

Create a new workflow in Unstract.

Create a new Unstract workflow and connect the project inputs/outputs

Set the API file upload (bank statement PDF) as the input.
Add the exported bank statement extraction tool as the processing step.
Set the output to an API response

Set API upload as input, run the extraction tool, return JSON as API output

Deploy the Workflow as an API and Test with Postman

Deploy the workflow.
Download the workflow’s Postman collection (if Unstract provides one) or create requests manually.
Import the collection into Postman.

Download the API collection from Unstract and import it into Postman

Send a POST request with a bank statement PDF.

Select the sample bank statement and send it via a POST request

Verify the response returns clean JSON matching your schema.

Automate Document Processing Workflow with ETL Pipeline

An ETL workflow lets you run bank statement extraction as a repeatable pipeline. Files are picked up from a source folder, processed through your exported Prompt Studio project and written to a database table for downstream use.

How the ETL Workflow Runs

Input: Bank statement PDFs are pulled automatically from Google Drive or Dropbox.
Processing: The workflow calls your exported Prompt Studio project (BankStatement) to extract schema-shaped JSON.
Output: Results are written to Postgres (NeonDB) using Unstract’s PostgreSQL connector.

Key Steps to Configure and Deploy Workflow to an ETL Pipeline

Configure Source Connector: Select and configure your input connector.
- Connector: Drive_Folder_Bank_Statements
- Confirm: folder path, file type filter (PDF) and polling/schedule settings.

Configure the Drive folder connector with PDF filter and schedule settings

Configure Destination Connector: Select and configure your output connector.
- Connector: Unstract’s PostgreSQL connector (Postgres via NeonDB)
- Confirm: connection details, target schema/table and write mode (insert/upsert).

Configure the PostgreSQL connector with database, table and write-mode settings

Select Exported Prompt Studio Project: Choose the exported Prompt Studio project that will process each incoming file.
- Project: BankStatement
- Confirm: the project outputs fields and transaction rows in the JSON structure your table expects.

Select the BankStatement project and confirm its JSON matches the table schema

Deploy Workflow: Deploy the workflow so it runs end-to-end for new files.
- Confirm: deployment status is active and the workflow can process a sample statement from the source folder through to Postgres.

Automating Document Processing Workflow with n8n

In this workflow, n8n runs an agentic intake loop for bank-statement PDFs. Documents arrive through approved sources (email or SFTP) and n8n does more than deposit files in a folder. It evaluates intake signals, selects the next step from an approved set of actions and records each decision with traceable metadata.

Once a document passes intake policy checks and is routed correctly, Unstract’s workflow layer performs OCR (via LLMWhisperer), extraction (via Prompt Studio), validation and human review where required.

You can set up the following n8n intake flow for bank statements:

Trigger and intake: Trigger on a new email from approved senders or an SFTP upload. Assign a unique document ID and capture source metadata (sender, received timestamp, channel, mailbox/SFTP path).
Policy gating (accept, quarantine or hold): Enforce intake rules before the document enters the pipeline. Reject non-PDFs, detect password-protected or encrypted PDFs where possible and quarantine corrupted or out-of-policy files with a reason code. Where resubmission is required, notify the owner of the approved format requirements.
File-level inspection (routing signals, not extraction): Capture lightweight signals such as file size, page count (if available) and whether the PDF appears text-based versus image-based. Use these signals to route documents through the appropriate processing path.
Rule-based routing to the correct processing path: Route documents to the bank-statement workflow based on trusted intake context (approved mailbox, sender allowlist, subject conventions or SFTP drop location). Route unknown or non-conforming items to a separate triage path rather than forcing them into the bank-statement pipeline.
Deduplication (intake signature): Deduplicate using a stable signature available at intake (for example, file hash plus sender and a received-time window) to prevent reprocessing when filenames vary.
Normalize filenames and attach canonical metadata: Rename using an operational deterministic pattern that does not depend on extracted content (for example, source_sender_YYYY-MM-DD_documentID.pdf). Attach canonical metadata (document ID, route chosen, reason codes, intake timestamp), so downstream systems can maintain lineage.
Handoff to Unstract (two deployment patterns): Choose one of the following handoff methods based on your operational needs:
- Folder handoff: Upload to the designated Drive path (for example, /Finance/BankStatements/YYYY/MM) that Unstract’s ETL workflow monitors.
- API handoff: When near real-time processing is required, call the deployed Unstract workflow API directly and persist the returned correlation ID for status tracking.
Outcome tracking and operational controls: Log each handoff with a correlation ID and track processing outcomes (success, queued for human review, failed). Use retries with backoff for transient errors, alerts for repeated failures and quarantine routing when items repeatedly fail intake or handoff.

For a concrete walkthrough of the exact nodes and configuration, watch Building Agentic Document Workflows with Unstract + n8n and review the companion blog.

Human in the Loop (HITL) Workflow for Document Processing

Unstract positions HITL review between automated extraction and the final destination, so reviewers can validate or correct extracted fields before the workflow stores results in a database or passes them to the next processing step.

You can follow the steps below to set up HITL in your ETL document processing workflow:

Rules for review routing: Set the percentage of documents that require manual review, choose AND/OR logic and add conditions based on extracted fields. Any document that matches the rule routes to HITL review.

Set review rules so matching documents are routed to HITL for manual review

Post-review routing: After approval, configure where Unstract sends results (for example, the destination database or a queue for further processing).
Roles and permissions: Assign roles such as reviewer and supervisor to control who reviews and who approves.
Reviewer operations: Use the Human Quality Review dashboard to pull the next queued document and track review status.

For a more in-depth review of the process, you can refer to the webinar recording, which includes a live demo covering setup, access permissions and the end-to-end review process.

Stepping into Modern Document Processing Workflow with Unstract

End-to-end document automation delivers ROI more reliably when it runs as a version-controlled pipeline. That means clear ownership, tests and monitoring, rather than a pile of templates and scripts.

LLM-based OCR provides clean, layout-aware input for each document, which gives downstream extraction a reliable starting point. When you run the workflow as a version-controlled pipeline, you also maintain clear ownership, testing and monitoring instead of relying on brittle templates and ad hoc scripts.

Unstract combines layout-preserving OCR with prompt-based extraction to return JSON aligned with your target schema. It then validates that output with deterministic rules before any database write or downstream API call.

Most documents proceed automatically. The workflow routes only the exceptions to Human-in-the-Loop (HITL), such as fields or documents that fail your rules or fall below configured confidence thresholds.

Reviewers validate the extracted values in the context of the source PDF, which supports auditable approvals. Those review outcomes also help you refine prompts and rules so the workflow becomes more automated and more dependable over time.

Build your first extraction project in Prompt Studio using a small, representative set of real bank statements or invoices. Export the project and run it in your ETL pipeline or n8n intake, then add validation and exception routing early so layout changes or model updates do not create an expanding review queue.

Document Processing Workflow: FAQ

What is a document processing workflow, and why does it need automation?

A document processing workflow defines the end-to-end path a document follows from ingestion to structured data output. Automation is essential because manual re-keying introduces errors, layout changes increase exception rates, and audit requirements demand traceability that ad hoc scripts cannot provide.

How does an AI-powered intelligent document processing workflow differ from traditional OCR approaches?

A modern intelligent document processing workflow combines layout-aware OCR, LLM-based extraction, and validation rules instead of relying on fixed templates. Unlike traditional OCR that fails when column order or table structure changes, intelligent workflows adapt to layout drift and route only exceptions for human review.

What role do LLMs play in modern document workflow automation?

LLMs convert unstructured content into consistent, schema-shaped fields that finance and operations systems can use. However, they must be embedded within a layered document workflow automation pipeline that controls inputs, enforces validation, and prevents invalid data from reaching downstream systems.

How does Unstract enable end-to-end document workflow automation?

Unstract combines LLMWhisperer for layout-aware OCR, Prompt Studio for defining extraction logic, and a workflow layer for automating ingestion, validation, and delivery. This creates a complete intelligent document processing workflow that scales across business units without constant rework.

What is a Human-in-the-Loop (HITL) workflow, and when is it needed?

A HITL workflow sits between automated extraction and final delivery, allowing reviewers to validate fields that fail confidence thresholds or business rules. In document workflow automation, this ensures only exceptions require manual review while most documents proceed automatically.

The post Automating End-to-End Document Processing Workflows with Unstract first appeared on Unstract.com.

LLMWhisperer: Best OCR for Document Management

Tarun Singh — Thu, 22 Jan 2026 18:23:26 +0000

Introduction

Managing documents used to be a simple problem — a filing cabinet, a label, and a person who knew where everything lived. Today, that world no longer exists. Modern businesses generate thousands of documents every week: scanned invoices, onboarding forms, contracts, compliance reports, shipping manifests, insurance claims, HR files, purchase orders, project documents, and more. Each one contains critical information, but they arrive from different sources, in different formats, and often as scanned PDFs or smartphone photos.

As organizations scale, this creates a massive operational burden. Teams spend hours searching for files, retyping information, correcting errors, or hunting down missing paperwork. The cost of this inefficiency becomes clear:

Workflows slow down because teams can’t find the right documents
Compliance risks increase when information is misplaced
Operational insights get buried in unstructured PDFs
Manual data entry drains time and increases error rates

This is why more companies are turning to OCR-powered document management software — systems that not only store documents but also understand them. OCR (Optical Character Recognition) converts scanned documents into searchable, usable data, transforming a folder full of PDFs into a structured, queryable knowledge layer.

However, traditional OCR is no longer enough. Businesses now handle multilingual documents, low-quality scans, handwritten forms, and complex layouts like tables, form fields, checkboxes, and spreadsheets. This shift demands OCR engines that are fast, accurate, layout-aware, AI-friendly, and easy to integrate into modern document workflows.

This is where LLMWhisperer and Unstract enter the picture.
LLMWhisperer acts as the next-generation OCR/document parsing engine — capable of handling all major formats while preserving layout, understanding checkboxes, reading handwriting, and extracting data with high fidelity. Unstract completes the workflow by applying LLMs to the extracted text, enabling enterprise-grade document classification, splitting, parsing, and data automation.

Together, they redefine what businesses expect from OCR in document management — not just text extraction, but intelligent, scalable, automation-ready data processing.

What is Document Management?

Document management is the discipline of capturing, storing, organizing, and retrieving documents in a secure, searchable, and compliant way. Modern organizations rely on a Document Management System (DMS) to ensure that documents — whether digital or scanned — are always available to the right people at the right time.

A robust DMS typically includes:

Document capture (upload, scan, import from emails, cloud drives, APIs)
Storage & repository management
Search & retrieval (meta tagging, OCR search, full-text search)
Version control & audit trails
Security & permissions
Document classification & routing
Workflow automation (reviews, approvals, notifications)

Historically, document management was manual. Paper files lived in cabinets. “Search” meant asking someone in the office who remembered where a document might be. Classification relied on colored folders. Retrieval required physically walking to storage rooms.

As businesses digitized, these systems evolved into electronic DMS solutions, allowing teams to upload PDFs instead of filing paper. But digitization alone created a new problem: digital clutter. If a company uploads 10,000 PDFs to a shared drive without structure, it becomes as chaotic as the paper era.

That is why the industry shifted to intelligent document management, where the system not only stores PDFs but also extracts and understands the content inside them — using OCR, AI, and automation.

Real-World Examples of Document Management in Action

HR Departments
- Store employee contracts, onboarding forms, ID documents, performance reviews
- OCR enables quick search: “Show me all employees with contract renewal in 2024”
Finance & Accounts Payable
- Automate invoice capture, extract vendor name, amount, due date
- Reduce manual entry and eliminate human errors
Legal & Compliance Teams
- Manage contracts, agreements, regulatory filings
- Ensure versions are tracked and documents are audit-ready
Insurance & Banking
- Process scanned claims, KYC forms, policy documents
- Classify and extract data automatically using OCR document management tools
Operations & Logistics
- Manage bills of lading, shipping manifests, delivery receipts
- Use OCR to extract shipment details instantly

Without a modern DMS — especially one enhanced by OCR — companies lose visibility into their most critical information. With it, they gain speed, efficiency, compliance, and the ability to automate previously manual processes.

Why Document Management Is Important in Business

Modern businesses run on information. Contracts, invoices, HR files, compliance documents, vendor agreements, tax records, customer files, policy papers—every department depends on accurate and timely access to documents. Without a proper document management system (DMS), even a fast-growing company can collapse under the weight of its own paperwork.

A strong document management strategy is no longer optional—it is foundational. Here’s why:

1. Centralized Storage, Version Control, and Audit Trails

When documents are scattered across emails, desktops, shared drives, and paper folders, confusion becomes inevitable. A DMS creates a single source of truth, ensuring that:

Every document has one authoritative version
Teams always know who uploaded, edited, or approved a file
Old versions are archived, not lost
Decision-makers work with the correct data, every time

In industries like legal, finance, insurance, and healthcare, version control and audit logs are not just helpful—they are legally required. A DMS automatically maintains audit trails, tracking every change for compliance and transparency.

2. Compliance and Data Governance

Regulatory frameworks such as GDPR, SOC 2, HIPAA, and ISO standards demand strict control over document access, retention, and privacy. Without automation:

Sensitive documents may be overexposed
Retention periods may be missed
Unauthorized edits may go unnoticed
Compliance audits become painful and expensive

A robust DMS enforces access permissions, retention rules, encryption, and role-based controls—ensuring compliance without relying on manual policing.

3. Improved Accessibility, Collaboration, and Decision-Making

Remote teams, distributed offices, and digital workflows need instant access to documents. A modern DMS enables:

Full-text search to find documents instantly
Cloud access for remote collaboration
Automated routing and approvals
Faster decision-making backed by accurate information

When documents are accessible and searchable, teams spend less time hunting for files and more time acting on insights. This accelerates everything—from approving invoices to onboarding employees to negotiating contracts.

Role of OCR in Document Management Systems

Traditional document management systems focus on storage. Modern systems focus on understanding documents — and that leap is powered by OCR (Optical Character Recognition).

OCR transforms unstructured content—scanned PDFs, images, faxed forms, photographed documents—into searchable, machine-readable text. In other words, OCR is what turns a folder full of scanned paperwork into a structured knowledge base.

1. What OCR Means Inside a Document Management System

In a DMS, OCR performs three major functions:

Text Extraction
Converts images and scanned PDFs into digital text.
Content Indexing
Enables full-text search, filtering, and intelligent retrieval.
Data Structuring
Supports automatic extraction of fields, tables, labels, checkboxes, and handwritten notes.

This is why modern solutions are often referred to as OCR document management systems—because OCR is now the engine that powers intelligent document workflows.

2. How OCR Document Management Systems Automate Data Extraction

A DMS equipped with OCR does much more than store files. It automates critical workflows:

Extracting key fields (names, dates, amounts) from invoices, forms, contracts
Reading tables in financial or operational documents
Capturing handwritten notes in applications or inspection reports
Recognizing checkbox states in forms
Auto-tagging documents with metadata like document type, category, or department

This automation allows businesses to eliminate manual data entry—one of the biggest sources of delays and human error.

3. Benefits of OCR in Document Management

Here are the real-world advantages organizations gain by using OCR:

Speed

Data that once took days to extract can now be processed in seconds.

Searchability

With OCR, even scanned documents become fully searchable. You can instantly search for:

Contract names
Vendor IDs
Policy numbers
Dates or amounts
Customer information

Metadata Generation

OCR enables automatic tagging:
For example:
“Invoice → Vendor: ABC Corp → Amount: $12,543 → Due: 30 days”

These tags fuel intelligent routing, approvals, and analytics.

Reduced Human Error

Manual data entry is slow and prone to mistakes. OCR eliminates typos, copy-paste issues, and inconsistent labeling.

Better Decision-Making

When documents become structured data, organizations gain insights into:

Cash flow
Contract risks
Vendor performance
Compliance exposure
Operational bottlenecks

OCR turns document chaos into a structured, searchable, business-ready dataset.

Why Accurate OCR Is Vital

In document-driven industries, the value of OCR isn’t simply about “reading text.” It’s about reading correct text. A single misread digit, checkbox, or date can ripple across compliance workflows, billing systems, customer communication, and automated decision engines.

This is why accuracy—especially in ocr document management systems—is non-negotiable.

1. How OCR Errors Impact Compliance, Billing, and Automation

Even minor extraction errors can create major downstream consequences:

Compliance failures
Misreading a policy number, contract clause, or expiration date can trigger audit issues, penalties, or legal exposure.
Billing discrepancies
A misplaced decimal in an invoice amount or tax figure can lead to overbilling, underbilling, or reconciliation delays.
Broken automation flows
Automated document routing and approval systems depend on correct fields.
If OCR mislabels “Vendor Name” or misreads an “Invoice Due Date,” the workflow fails silently.
Customer dissatisfaction
Incorrect extraction in claims, applications, or customer forms results in longer turnaround times and repeated document submissions.

Accurate OCR isn’t just about convenience—it directly influences operational reliability and financial accuracy.

2. Accuracy Challenges: Scanned, Handwritten, Multilingual Documents

Modern organizations deal with documents in every possible condition:

Low-resolution scans from offices or mobile apps
Handwritten notes, signatures, and free-text fields
Multilingual content and mixed-script PDFs
Documents with noise, shadows, stains, folds, or aging artifacts
Forms with checkboxes, radio buttons, and micro-labels

Traditional OCR engines often fail here—especially with handwriting or multilingual content.
This is where next-generation tools like LLMWhisperer excel, offering high accuracy even on low-quality inputs.

3. Importance of Layout Preservation and Data Normalization

Accuracy is not only about text—it’s also about structure.

Document management workflows rely heavily on layout fidelity:

Tables must maintain rows and columns
Headings must map to the right fields
Checkboxes must be extracted as clear booleans
Numeric data should maintain decimals, symbols, and currency
Dates must be preserved or normalized into standard formats

Without layout preservation, document automation collapses.
With it, OCR results become clean, structured, and ready for downstream AI or rule-based processing.

Selecting the Right OCR for Document Management

Choosing the right ocr document management software can make or break your automation strategy. The ideal OCR engine must balance speed, accuracy, flexibility, and developer-friendliness—all while fitting seamlessly into your existing DMS stack.

Here are the key considerations.

1. Evaluation Criteria for Modern OCR Engines

When comparing document management OCR tools, organizations typically assess:

Speed

The engine must process thousands of pages efficiently, especially in bulk ingestion scenarios.

Accuracy

Core for structured documents (invoices, forms) and unstructured documents (contracts, letters).
Accuracy includes handwriting recognition, table fidelity, checkbox detection, and multilingual support.

Supported Formats

A strong OCR engine should handle:

PDFs (native + scanned)
Images (JPG, PNG, TIFF, WebP)
Office files (DOCX, XLSX, PPTX)
Form-heavy PDFs
Mixed-content pages

This is critical for enterprise DMS pipelines where documents come from diverse sources.

Multilingual Capability

Global organizations demand OCR that can parse 100+ languages—including dialects, accented text, and mixed-language content.

Integration Flexibility

Systems should provide:

REST APIs
SDKs or client libraries
Webhooks
On-premise deployment options

This ensures compatibility with platforms like SharePoint, Alfresco, OpenText, Box, OneDrive, or custom DMS solutions.

2. Cloud vs. On-Premise OCR for Document Management

Cloud OCR

Easy to deploy
Low infrastructure overhead
Perfect for general files and distributed teams

On-Premise OCR

Required in regulated industries (finance, healthcare, insurance)
Ensures complete data security and sovereignty
Enables processing sensitive documents fully within private infrastructure

LLMWhisperer uniquely offers both models — cloud-based simplicity and secure on-premise deployment.

3. Why Enterprises Prefer AI-Augmented OCR Engines Like LLMWhisperer

Legacy OCR engines rely solely on pattern recognition.
Modern document ecosystems require much more:

Layout preservation for tables, forms, and contracts
Handwriting recognition
Checkbox/radio button detection
Low-fidelity document enhancement
Spatial mapping through bounding boxes
Support for high-entropy or multi-format documents

LLMWhisperer delivers all this—while staying AI-friendly, meaning it prepares perfect input for downstream LLMs in document management workflows.

This combination of:

High accuracy
Multi-format support
Enterprise-grade integration
On-premise availability
Layout-preserving output

is exactly why organizations now choose LLMWhisperer as their primary OCR for document management.

What is LLMWhisperer?

LLMWhisperer is Unstract’s high-precision OCR and text-parsing engine designed specifically for structured document understanding. Unlike traditional OCR tools that simply read characters from a PDF or image, LLMWhisperer focuses on preserving the structure, layout, and semantics of a document so that downstream automation systems — including LLMs — can interpret the content accurately.

Not an LLM — but the ideal preprocessing layer for LLMs

A key distinction is that LLMWhisperer is not a large language model.
It does not generate or infer meaning. Instead, its job is to:

Extract raw text with exceptional accuracy
Preserve layout, indentation, tables, checkboxes, and spatial regions
Clean and normalize messy scans, photos, and multi-format files
Output AI-ready text that LLMs can reason over without confusion

Think of LLMWhisperer as the bridge between messy real-world documents and intelligent AI processing:

OCR → structure preserved
Structure preserved → LLMs understand relationships
LLMs understand relationships → clean, structured data

This makes it indispensable for modern document management systems where PDFs, TIFF scans, Excel sheets, and photographed documents all flow into a central automation pipeline.

The Bridge Between Raw Text and Intelligent Parsing

LLMWhisperer solves the biggest failure point in legacy OCR workflows:
OCR extracts text, but AI needs structure.

For example:

Invoices have columns
Claims forms have checkboxes
Contracts have indentation and clause hierarchy
Financial statements have multi-row, multi-sheet tables

If OCR destroys the structure, downstream extraction breaks.
LLMWhisperer preserves:

Column alignment
Table structures
Visual markers
Line numbers
Bounding boxes
Checkmark states
Mixed-language text

It guarantees that the output is not just text — but organized text, ready for any AI, rule-based, or workflow engine.

Why LLMWhisperer Is the Best OCR for Document Management

Modern document management systems (DMS) require more than scanned-PDF OCR. They need a robust engine that can handle:

Millions of documents
Multiple formats (PDF, images, Word, Excel, CSV)
Noisy scans and mobile captures
Complex financial tables
Forms, checkboxes, radio buttons
Multilingual text

LLMWhisperer was built for exactly this environment.

1. Scalability at Enterprise Level

Businesses managing HR archives, insurance forms, legal files, or financial documents must process high volumes without failures.
LLMWhisperer delivers:

High-throughput processing
Stable performance across thousands of pages
Auto-repair of problematic PDFs
Intelligent fallback modes for low-quality inputs

Whether processing a handful of documents or an entire archive, it remains fast, predictable, and accurate.

2. Industry-Leading Layout Accuracy

OCR accuracy means nothing if the structure collapses.
LLMWhisperer’s layout-preserving output ensures:

Tables maintain row/column alignment
Multi-level lists and clauses retain indentation
Tables from Excel remain parseable
Forms keep checkbox states
Even complex insurance, banking, and healthcare PDFs remain intact

This level of fidelity makes it ideal for any document management OCR workflow where structure → meaning.

3. Exceptional Low-Fidelity Tolerance

Real-world documents are rarely perfect.

Shadows
Folds
Skewed camera angles
Faint handwriting
Mixed fonts
Watermarks

LLMWhisperer’s preprocessing engine applies:

De-skewing
Denoising
Auto-contrast
Median/Gaussian filtering
AI-enhanced image correction

Even documents considered “unusable” by traditional OCR engines become readable and well-structured.

4. Reliability Across All Common File Types

LLMWhisperer supports an unusually broad set of formats essential for document management:

PDFs (native + scanned)
TIFF, JPG, PNG, BMP
DOC / DOCX
XLS / XLSX
ODT, ODS, ODP
CSV, TXT, XML, HTML

This means a DMS no longer needs multiple tools for different files — LLMWhisperer handles them end-to-end.

5. Integration-Ready API + Secure On-Premise Deployment

Every modern DMS needs an OCR engine that “plugs in” easily.
LLMWhisperer exposes a clean REST API:

Simple file→text endpoint
Multiple output modes (native, low-cost, high-quality, form, table)
Webhook support
Easy Postman testing
SDKs for Python, JS, and n8n automation

For regulated industries (insurance, banking, government), LLMWhisperer also offers self-hosted, on-premise deployment — giving full:

Data control
Infrastructure control
Compliance alignment (HIPAA, GDPR, SOC requirements)

This combination of ease-of-integration + enterprise security makes it uniquely suited for modern document management.

Key Features of LLMWhisperer

Below is a fully refreshed version of the feature section—still comprehensive, but written differently, with rearranged flow and varied phrasing to avoid repetition while keeping 100% correctness.

1. Comprehensive File Format Support

LLMWhisperer is engineered to ingest nearly every file type encountered in modern insurance operations. Its versatility eliminates the need for pre-conversion workflows and ensures document pipelines remain clean and predictable.

Supported Formats (All-in-One Table)

Category	Formats
Word Processing	DOCX, DOC, ODT
Presentations	PPTX, PPT, ODP
Spreadsheets	XLSX, XLS, ODS
Documents & Text	PDF, TXT, CSV, JSON, TSV, XML, HTML
Images	BMP, GIF, JPEG, JPG, PNG, TIF, TIFF, WEBP

Insurance relevance:

Claims photos from field agents (JPG/PNG)
Excel-based underwriting or performance reports (XLS/XLSX)
Typed policy documents and endorsements (DOC/DOCX)
Complex PDF forms such as ACORD 125/126/140

2. Advanced OCR Modes

LLMWhisperer includes multiple modes to suit different insurance document scenarios.
Each mode maps to an API parameter and is optimized for a specific document challenge.

Mode Comparison Table

Mode	Ideal Use Case	Handwriting	Checkboxes	Language Support	Notable Advantage
Form	ACORD forms, policy apps, compliance docs	Yes	Yes	300+	Best for field detection
High Quality	Low-res scans, handwritten claims	Yes	Yes	300+	AI/ML enhancements + skew repair
Table	Loss runs, financial reports, premium tables	Yes	Yes	300+	High-fidelity table extraction
Low Cost	Standard scans, bulk ingestion	Basic	No	120+	Cost-efficient for volume processing
Native Text	Digital PDFs	No	No	All Unicode	Fastest performance

Why this matters:
Insurance ecosystems include everything from mobile photos to Excel extracts—these modes ensure each document flows through the most accurate OCR logic for its structure.

3. Layout Preservation

Preserving visual structure is crucial, especially for insurance documents where meaning depends heavily on alignment.

Core Layout Parameters (Refreshed Table)

Parameter	What It Does
output_mode=layout_preserving	Maintains visual spacing, indentation, and grouping
mark_vertical_lines	Identifies column boundaries in tables and grids
mark_horizontal_lines	Indicates row separators
add_line_nos	Produces consistent line numbering for review and auditing

Example:
In ACORD 125, premium values for “Commercial Auto,” “General Liability,” and “Truckers” appear in parallel columns.
Without layout preservation, values shift—leading to misinterpreted coverage.

4. Supported Document Types

LLMWhisperer handles all structures used across insurance workflows:

Native PDFs and scanned documents
Mobile-captured paperwork (angled, noisy, shadowed)
Forms with radio buttons and checkboxes
Typed forms with handwritten corrections
Documents containing multi-column layouts
Table-heavy reports (financials, underwriting summaries, P&L extracts)

Insurance examples:

ACORD applications
Disability claim forms
Multi-page premium statements
Photographed vehicle inspection sheets

5. Multilingual OCR

LLMWhisperer supports 300+ languages, enabling insurers to process global submissions without translation layers.

Use case:
A German homeowner’s insurance application or French medical claim can be processed entirely as-is, with no accuracy trade-offs.

6. Preprocessing Pipeline for Imperfect Documents

LLMWhisperer includes sophisticated image correction tools:

Automatic deskewing of rotated pages
Noise reduction via median & Gaussian filters
PDF auto-repair for corrupted or partial files
Contrast enhancement for faint ink or washed-out scans

Useful for:
Faxed claims, old scanned policies, outdoor photos of damage reports.

7. Table Extraction

The Table Mode reconstructs financial and underwriting tables without losing structure—even when borders are faint or missing.

Typical use cases:

Premium breakdown charts
Loss history tables
Insurance performance reports
Reinsurance summaries

8. Bounding Boxes

Every extracted text segment includes coordinates (x, y, width, height), enabling:

Audit and compliance visualizations
Verification dashboards
Human review workflows
Highlight-on-hover UI features

Particularly valuable in regulated industries where every extracted item must be traceable.

9. Form Element Recognition

LLMWhisperer not only captures text but also:

Detects checkboxes (checked / unchecked)
Identifies radio button selections
Maps form fields into structured outputs

10. Handwriting Recognition

Handwritten notes such as adjuster comments, doctor annotations, or manually filled policy details are captured accurately in High Quality, Form, and Table modes.

11. Spreadsheet Extraction

LLMWhisperer processes XLSX, XLS, and ODS files directly, making it ideal for:

Underwriting models
Performance analytics
Broker-submitted premium spreadsheets

No CSV conversion required.

12. Low-Fidelity Tolerance

Handles damaged, skewed, low-resolution, stained, or shadowed documents with high accuracy.
Reduces the need for re-uploads or manual re-entry—improving customer satisfaction and operational efficiency.

13. Usage Metrics Dashboard

Unstract Cloud provides detailed metrics such as:

Pages processed
Mode breakdown
Success vs. error trends
Consumption forecasting

Useful for SLA-driven insurance operations.

14. Self-Hosted / On-Premise Deployment

Carriers and TPAs can deploy LLMWhisperer entirely within their secure infrastructure:

No external data transfer
Full control over processing
Meets privacy rules (GDPR, HIPAA, NAIC, PCI, IRDAI)
Ideal for sensitive claim or policy workflows

15. Simple, Predictable Pricing

Straightforward pay-per-page billing with transparent usage tiers—easy for insurers to budget per claim file or per policy bundle.

Summary Table — Updated

Feature	Legacy OCR	LLMWhisperer
Layout Fidelity	Loses structure	Columns, tables, & boxes preserved
Handwriting	Limited	Advanced + multi-mode support
Checkboxes / Radios	Often missed	Captured as structured booleans
Languages	Restricted	300+
Table Extraction	Poor alignment	Financial-grade table mode
Data Privacy	Vendor cloud	On-premise supported
Output	Unstructured text	Layout-preserving with coordinates

Example Use Cases: Playground & API

Playground Example — Scanned, Handwritten Contract Form

To illustrate how LLMWhisperer performs in real document-management workflows, we begin with the LLMWhisperer Playground.
For this test, we used a document containing multi-column sections, dense printed text, checkboxes and amount details. This kind of document typically breaks traditional OCR tools, which struggle with rotation, mixed handwriting, and layout reconstruction.

Steps

Open the LLMWhisperer Playground from the Unstract interface.
Upload the scanned-handwritten-contract-form.
Select High Quality or Form mode to enable handwriting recognition, de-skewing, and checkbox/field detection.
Submit the document and view the extraction in the results panel.


      Tenancy Services 

      PROPERTY INSPECTION REPORT 

      This report is intended to help avoid disputes 
      This should be used to record the condition of the property at the start of the tenancy. 
      The landlord and the tenant should fill out this form together, and tick the appropriate box if the condition is acceptable, 
      or record any damage or defects. 

                                      CONDITION ACCEPTABLE? 
          ROOM AND ITEM               LANDLORD     TENANTS    DAMAGE/DEFECTS 

      LOUNGE Wall/Doors             YES          NO           Vertical crack at front wall 
          Lights/Power points       YES          YES 
          Floors/Fl. Coverings      YES          YES          High moisture content 
          Windows                   YES 
          Blinds/Curtains 
          Wall/Doors                YES          NO           Vertical crack at front wall 
      KITCHEN/DINING Lights/Power points YES     NO           Improper finish at door reveal 
          Floors/Fl. Coverings 
          Windows                   YES          YES 
          Blinds/Curtains           YES          YES 
          Cupboards 
          Sinks/Benches             YES          NO           Seepage patch above sliding window 
          Oven 
          Refrigerator 
          Wall/Doors 
          Lights/Power points       YES          YES 

      BATHROOM Floors/Fl. Windows Coverings YES YES NO NO     Crack at left frame bottom side 
                                                              High moisture content 
          Blinds/Curtains 
          Mirror/Cabinet 
          Bath 
          Shower 
          Wash basin 
          Toilet (WC) 
          Wall/Doors 
      LAUNDRY Lights/Power points   YES          NO           Bad wiring in the door plug socket 
          Floors/Fl. Coverings      YES          YES          Chips falling of in the right floor corner tiles 
          Windows                   YES          YES 
          Blinds/Curtains 
          Washing machine 
          Wash tub 
      1   Wall/Doors                YES          NO           Improper filling between frame and wall High moisture content 
          Lights/Power points       YES          NO 
          Floors/Fl. Coverings      YES          NO           Improper filling between frame and wall High moisture content 
      BEDROOM Windows               YES          NO           Improper filling between frame and wall High moisture content 
          Blinds/Curtains           YES          NO 
      2   Wall/Doors 
          Lights/Power points 
          Floors/Fl. Coverings 

      BEDROOM Windows 
          Blinds/Curtains 
      3   Wall/Doors 
          Lights/Power points 
          Floors/Fl. Coverings 

      BEDROOM Windows 
          Blinds/Curtains 

RTA01 Residential Tenancy Agreement                      www.tenancy.govt.nz                                              PAGE 10 
<<<

      Tenancy Services 

      4    Wall/Doors 
           Lights/Power points        YES          NO            Tile cut extra at front wall socket 
           Floors/Fl. Coverings       YES          NO            Gap at floor laminate and bathroom 1 door frame 
      BEDROOM Windows Blinds/Curtains 

      GENERAL Rubbish bins 
           Locks 
           Garage/Car port 
           Grounds 
           No. keys supplied 

      Smoke alarms 
      Landlords must have working smoke alarms installed in all rental premises. These must meet the requirements in the 
      Residential Tenancies (Smoke Alarms and Insulation) Regulation 2016, set out below. A landlord who fails to comply is 
      committing an unlawful act and may be liable for a penalty of up to $7,200. 
      Landlord - please confirm you have met at least these minimum legal requirements before you rent the premises: 
      [X] There is at least one working smoke alarm in each bedroom or within three metres of each bedroom's door - this applies 
          to any room a person might reasonably sleep in. 
      [X] If there is more than one storey or level, there is at least one working smoke alarm on each storey or level, even if no-one 
          sleeps there. 
      [X] If there is a caravan, sleep-out or similar, there is at least one working smoke alarm in it. 
      [X] None of the smoke alarms has passed the manufacturer's expiry or recommended replacement date. 
      [X] All new or replacement smoke alarms, installed from 1 July 2016 onward, are long-life photoelectric smoke alarms with a total 
          battery life when installed of at least eight years or a hard-wired smoke alarm system, and meet the product standards in 
          the Residential Tenancies (Smoke Alarms and Insulation) Regulation 2016. 
      [X] All the smoke alarms are properly installed by the landlord or their agent in accordance with the manufacturer's instructions. 
      [X] All the smoke alarms are working at the start of the tenancy, including having working batteries. 

      For important details go to www.tenancy.govt.nz/smoke-alarms 

      List of furniture and chattels                               Signatures for Property 
      Provided by the landlord                                     Inspection Report 
      Television                                                   Do not sign unless you agree to all the details in the 

      Airconditioner                                               Property Inspection Report 

      Tables                                                                                            02/02/2025 

      200 KW Sump motor                                            Signed by                                   Date signed 
      Generator                                                                          LANDLORD 
                                                                                                        02/02/2025 
                                                                   Signed by                                   Date signed 
                                                                                          TENANT 

                                                                   Rent and Bond Receipt 

                                                                   Initial rent payment         $ 2000 
                                                                   Bond                         $ 500 

                                                                   Total                        $ 2500 

                                                                   To (name)   Roger deakins 

      Water Meter Reading                                          Date paid   2/2/2024 

      For use if charging for water 

      At start of tenancy                                          Signed as received 

                                                                                                                    4700 TEN 02/22 

RTA01 Residential Tenancy Agreement                         www.tenancy.govt.nz                                                 PAGE 11 
<<<

Result

The output demonstrates why LLMWhisperer is one of the best OCR engines for document management systems:

Perfect layout preservation
Multi-column sections, labels, spacing, and block structures were retained exactly, allowing downstream LLMs to interpret relationships between fields.
Accurate extraction entries
Names, numeric fields, dates, and checkboxes responses were captured with high fidelity.
No data loss
Every printed and handwritten character across all sections was extracted.

Overall, the Playground test reveals that LLMWhisperer handles even difficult real-world contract forms with the same precision expected from a professional document management OCR system.

API Example — Bulk Parsing of a Photographed Air Waybill (Handwritten, Multi-Column)

For large-scale ingestion pipelines inside document management software, the LLMWhisperer API is the preferred approach.
Here, we processed a photographed Air Waybill—an old, slightly degraded document with handwritten values, multi-column cargo details, table blocks, and uneven lighting typical of scanned shipping paperwork.

Steps Using Postman

Obtain your LLMWhisperer API key from the Unstract dashboard.
Open Postman → New Request.
Configure:
- Method: POST
- URL: https://llmwhisperer-api.us-central.unstract.com/api/v2/whisper
- Header: unstract-key:
- Body: form-data → files → (upload Airway_bill_photographed-handwritten.pdf)
Send the request.
Review the response (layout-preserving output in plain text).



   Shipper's Name and Address                               Shipper's Account Number 
                                                                                                  Not Negotiable 
                                                                                                  issued Air Waybill by 
     Simon               Jones                            HY73221 

    20, cooper square,                                New york 

                                    NY         10011, USA 
  Consignee's Name and Address                             Consignee's Account Number             Copies 1, 2 and 3 of this Air Waybill are originals and have the same validity. 
                                                                                                 It is agreed that the goods declared herein are accepted in apparent good 
                                                                                                  (except as noted) for carriage SUBJECT TO THE CONDITIONS OF              order and condition 
                                                                                                 REVERSE HEREOF. ALL GOODS MAY BE CARRIED BY ANY OTHER CONTRACT MEANS INCLUDING ON THE 
     Rogen                gates                                                                   ROAD OR ANY OTHER CARRIER UNLESS SPECIFIC CONTRARY 
                                                                                                  BE GIVEN CARRIED HEREON VIA BY INTERMEDIATE THE SHIPPER, STOPPING AND SHIPPER AGREES THAT THE INSTRUCTIONS SHIPMENT ARE MAY 
    78,       Union Street, Bristol                                                              APPROPRIATE. THE SHIPPER'S                         PLACES WHICH THE CARRIER DEEMS 
                                                                                                  CARRIER'S LIMITATION OF LIABILITY. ATTENTION Shipper IS DRAWN TO THE NOTICE CONCERNING 
                                     BS84BF                                                                                                          may increase such limitation of 
 Issuing Carrier's Agent Name and City                                                            declaring a higher value for carriage and paying a supplemental charge if required. liability by 
                                                                                                  Accounting Information 
       IDF               Cargo 

                       New york 
 Agent's IATA Code                                Account No. 
         75632                                        9973261 
 Airport of Departure (Addr. of First Carrier) and Requested Routing                                   Reference Number                Optional Shipping Information 
        New york                                                                                   735926 
 To        By First Carrier.   Routing and Destination       to         by      to        by      Currency CHGS    WT/VAL      Other 
                                                                                                                  PPD COLL PPD COLL     Declared Value for Carriage    Declared Value for Customs 
                                                                                                                                           $200                         $200. 
              Airport of Destination                          Requested Flight/Date                   Amount of Insurance      INSURANCE - If carrier offers insurance, and such insurance is requested 
     London                                                                12/12/2025                   $20.                   in accordance with the conditions thereof, indicate amount to be insured in 
                                                                                                                              figures in box marked "Amount of Insurance". 
Handling Information 

                                                                                                                                                                                  SCI 

No. of                              Rate Class 
Pieces          Gross          kg                              Chargeable           Rate                               Total                           Nature and Quantity of Goods 
RCP             Weight         lb         Commodity Item No       Weight                     Charge                                                     (incl. Dimensions or Volume) 

 3           30kg                           3                   30kg                 $ 2                          $ 60                         News print Paper 

                                                                                                                                                Package Paper. 
5           20 kg                         72                    30 kg                $ 3                        $    60 

                                                                                                                                                Print           Ink          Solution 
3           10 kg                         75                    10kg                 $     6                   $ 70 

                                                                                                                $190. 

     Prepaid                  Weight Charge                 Collect          Other Charges 

                                                                                                                            taxes of handling 
                            Valuation Charge                                                   $      10/- 

                                    Tax 

                                                                                                                                                      correct and that insofar as consignment any part of are the 
                     Total Other Charges Due Agent                            I hereby certify that the particulars on the face hereof are 
                                                                              consignment contains dangerous goods. I hereby certify that the contents of this 
                                                                              fully and accurately described above condition by proper for carriage shipping by name air according and are to classified, applicable packaged, national 
                     Total Other Charges Due Carrier                          marked and labeled, and in proper 
                                                                              governmental regulations. 

                                                                                                                             Signature of Shipper or his Agent 

          Total Prepaid                            Total Collect 

                                                                                 12/10/2025                                     Newyork. 
  Currency Conversion Rates               CC Charges in Dest. Currency                                                             at (place)                      Signature of Issuing Carrier or its Agent 
                                                                               Executed on (date) 

                                             Charges at Destination                   Total Collect Charges 
   For Carrier's Use only 
        at Destination 
<<<

Result

The API extraction produced exceptional fidelity:

All handwritten cargo details and values were captured accurately, including weights, consignee names, and reference numbers.
Multi-column table structure was preserved, enabling downstream LLMs to correctly associate numeric amounts with the right column and row.
Aged, low-contrast text was reconstructed cleanly, thanks to noise reduction and preprocessing.
Zero loss of content—no missing rows, labels, or numbers.
Perfect alignment across columns, even in sections where the original document had faded or uneven spacing.

This demonstrates the strength of LLMWhisperer as a backend OCR service for enterprise document-management systems, where bulk accuracy, stability, and structure retention are non-negotiable.

How LLMWhisperer Supports Document Management

Document Ingestion: API & Cloud Connectors

A document management system is only as strong as its ingestion layer. LLMWhisperer integrates seamlessly into Unstract’s connector ecosystem, allowing businesses to move documents from any storage environment into their OCR pipeline without friction.

Unstract supports ingestion from a wide range of data sources—cloud storage, file servers, object stores, and databases. Documents can be fed into LLMWhisperer in two primary ways:

1. Ingestion via Connectors (Cloud & File Systems)

Unstract’s connector framework allows organizations to plug in their existing storage systems directly into a workflow. This enables automated ingestion of large document volumes—rent agreements, contracts, invoices, claims, HR files, policy forms, and more.

How ingestion works:

Navigate to Settings → Connectors, or add a connector while building a workflow.
Choose a connector type (e.g., AWS S3, Azure Blob, Google Drive, Salesforce, SFTP, PostgreSQL, etc.).
Provide the authentication details (bucket names, access keys, database URLs, credentials).
Test Connection to validate access.
Save and attach the connector as the Source for your workflow.

When the workflow runs, documents from the connector automatically flow into LLMWhisperer for OCR processing.

Why this matters for document management:
Businesses no longer need to manually upload files or export data. A connector-enabled workflow ingests documents continuously and reliably, making LLMWhisperer a scalable backbone for enterprise document automation.

Document Parsing: OCR-Powered Layout & Text Extraction for AI

Once documents enter the system through API or connectors, LLMWhisperer handles the second stage of document management—parsing. This is where raw files (PDFs, scans, images, Excel sheets) are transformed into structured and layout-aware text ready for downstream AI processing.

How parsing works:

LLMWhisperer identifies the document type (scanned, native, Excel, form, table-heavy, handwritten).
It selects the appropriate OCR mode (native, low-cost, high-quality, form, or table).
The engine performs OCR, layout analysis, handwriting extraction, and structure reconstruction.
Output is returned in a clean, consistent format—preserving indentation, columns, tables, checkboxes, and line order.

This parsed output becomes the foundation for deeper intelligence tasks such as:

Classification
Entity extraction
Policy data mapping
Contract clause detection
Claims processing automation

Why this parsing layer is crucial:
OCR isn’t just about reading characters. In document management, the structure is as important as the text. LLMWhisperer’s ability to preserve layout (columns, tables, checkboxes, field alignment) ensures that AI/LLM models receive data in a format they can accurately interpret—leading to higher accuracy in automated workflows.

Combined with Unstract’s processing tools, LLMWhisperer becomes a core component of enterprise document automation, enabling organizations to move from raw, unstructured files to validated, searchable, and fully structured data — all while minimizing manual effort.

What is Unstract? The AI/LLM Layer for Document Understanding

Modern enterprises deal with thousands of unstructured documents every day—contracts, invoices, claims, forms, reports, and handwritten submissions. Traditional IDP and RPA tools struggle with long, complex, multi-page documents because they lack semantic understanding.

Unstract solves this problem.

Unstract is an open-source, no-code platform built specifically for automating complex business processes involving unstructured documents—powered by Large Language Models (LLMs) and Human-in-the-Loop (HITL) capabilities. Instead of relying only on template-based OCR, Unstract adds an intelligent interpretation layer that understands meaning, relationships, and context inside documents.

Where OCR (like LLMWhisperer) extracts text and structure, Unstract extracts understanding.

How Unstract Uses LLMs to Extract Meaning from OCR Outputs

Once LLMWhisperer converts PDFs, images, scans, and spreadsheets into clean, layout-preserving text, Unstract takes over:

1. LLMs interpret the extracted text

LLMs analyze the OCR output, detect entities, relationships, classifications, intent, and numerical meaning.
Examples:

Identifying coverage limits, deductibles, and premiums in insurance forms
Extracting tenant names, unit numbers, rent, and status in rent rolls
Finding clauses, renewal dates, or penalties in contracts

2. Embeddings & Vector Databases enhance accuracy

Unstract converts documents into vector embeddings, enabling:

Semantic search
Context retrieval (“retrieve the correct section before extraction”)
Multi-page reasoning and cross-referencing

This is critical when similar terms appear in different contexts (e.g., “total premium” vs. “annual premium”).

3. Prompt Studio orchestrates structured extraction

Using prompt engineering, users can define extraction rules in plain English.
Example:
“Extract policyholder info, claim details, deductible, effective dates, and all checkboxes from the document.”

LLMs then return structured JSON output that downstream systems can consume.

4. Human-in-the-loop validation (HITL) for accuracy

For sensitive use cases—insurance claims, property reports, healthcare forms—Unstract allows humans to review low-confidence fields before finalizing.

This creates enterprise-grade reliability.

Core Components: Prompt Studio, Embeddings, Vector DBs, and Workflows

Unstract’s power comes from its modular ecosystem:

1. Prompt Studio — The Brain of Document Understanding

A no-code environment where users design extraction logic using natural language prompts.

Capabilities:

Build custom parsers for any document type
Test prompts across real samples
View field fill-rates and prompt accuracy
Optimize extraction with iterations

Prompt Studio turns a non-technical team into AI automation creators.

2. Embeddings — Semantic Understanding Layer

Embeddings convert each section of the document into vectors that help LLMs:

Retrieve the right context
Understand multi-page documents
Disambiguate similar terms
Improve extraction accuracy

Unstract supports OpenAI embeddings and others.

3. Vector Databases (Vector DBs)

A Vector DB stores embeddings for fast, intelligent retrieval.

Used for:

Chunk-level retrieval before prompting
Knowledge-grounded extractions
Indexing large volumes of documents

Unstract integrates with Postgres, Pinecone, and other vector backends.

4. Workflows — Deployment and Automation Engine

Once a project is ready, Unstract lets teams automate document understanding at scale.

Workflows allow you to:

Connect to document sources (S3, Drive, Blob Storage, etc.)
Run OCR + LLM extraction pipelines end-to-end
Send structured data into databases (Snowflake, BigQuery, Redshift, Postgres, etc.)
Deploy as an API for real-time document processing
Create ETL pipelines for batch automation

Workflows can also launch custom Q&A apps for internal teams, each secured with SSO.

In Summary

Unstract is the intelligence layer that transforms raw OCR output into meaningful, structured information using LLMs.
Where LLMWhisperer reads documents, Unstract understands them.

Together, they create a next-generation AI document automation ecosystem capable of handling:

Long, complex documents
Multi-page reports
Financial tables
Insurance forms
Contracts
Handwritten and scanned records

Unstract in Action

To demonstrate how Unstract turns raw OCR output into structured, machine-ready data, we tested it on one of the most challenging document types:
a scanned, handwritten contract form—tilted nearly 30°, filled with multi-column text, handwritten entries, dense legal clauses, and uneven print quality.

This is the kind of document that routinely breaks traditional OCR and RPA systems. Rotation, handwriting, shadows, mixed formatting, and unpredictable spacing lead to broken outputs.
But with LLMWhisperer + Unstract, the pipeline remains fully intact: layout preserved, handwriting captured, and the entire structure interpreted accurately.

1. Build a Prompt Studio Project

Inside Unstract’s Prompt Studio, we created a lightweight extraction project designed specifically for the scanned contract. No coding, no template design—just natural-language instructions.

Examples of fields defined via prompts:

Contract title, parties involved, addresses
Handwritten filled-in details (names, dates, initials, signatures)
Payment terms, validity period, and obligations
Checkbox or selection fields
Multi-column clauses and sub-clauses
Final acknowledgment / signature blocks

Because the OCR output is layout-preserving, Prompt Studio can reason across tilted sections, uneven spacing, and multi-line handwriting with impressive consistency.

After a few iterations inside the testing panel, fill rates stabilized, and the extracted fields matched the source document with high accuracy.

2. Extract Relevant Data Fields from JSON

Once the prompts were ready, Unstract generated clean, structured JSON representing the contract’s contents.
All key sections—including handwritten fields—were extracted with:

Correct line order
Preserved relationships (e.g., which signature belongs to which signer)
Intact table/column structures
Proper date and numeric reconstruction

3. Deploy and Test as an API (Postman Example)

After validating the extraction logic in Prompt Studio, we deployed the project as an Unstract API workflow—again, with no custom backend coding.

Deployment Summary:

Source Connector: API (accepts documents via POST)
Destination: API (returns structured JSON)
Selected Tool: The exported “Handwritten Contract Parser”
Mode: Deploy as API

{
  "status": "COMPLETED",
  "message": [
    {
      "file": "scanned-handwritten-contract-form.pdf",
      "file_execution_id": "fb8ce0a8-114e-4e1f-bc55-98f2f346c252",
      "status": "Success",
      "result": {
        "output": {
          "additional_contract_clauses": {
            "AssignmentRestrictions": "SERVICE PROVIDER needs permission to assign to a third party. Seller may not assign any of its rights under this Agreement or delegate any performance under this Agreement, except with the prior permission.",
            "ForceMajeure": "Service Provider shall not be responsible for any claims or damages resulting from any delays in performance or for non-performance due to unforeseen circumstances or causes beyond Service Provider's reasonable control.",
            "LimitationOfLiability": "Service Provider will not be liable for any indirect, special, consequential, or punitive damages (including lost profits) arising out of or relating to this Agreement or the transactions it contemplates (whether for breach of contract, tort, negligence, or other form of action) and irrespective of whether Service Provider has been advised of the possibility of any such damage. In no event will Service Provider's liability exceed the price paid by Buyer for the Services giving rise to the claim or cause of action.",
            "SecurityInterest": "Buyer hereby grants to Service Provider a security interest in any final products resulting from said services, until Buyer has paid Service Provider in full. Buyer shall sign and deliver any document needed to perfect the security interest that Service Provider reasonably requests."
          },
          "inspection_and_remedies": {
            "buyer_remedies": [
              "Request one revision of the product provided.",
              "Terminate the contract following payment for 50% of the services."
            ],
            "inspection_rights": [
              "There is NO right to inspection.",
              "Buyer shall be allowed to examine the final products once received."
            ],
            "notification_timelines": [
              "Buyer shall notify Service Provider within days after completion of the services or discovery of the problems, whichever is sooner."
            ]
          },
          "party_information": {
            "Agreement Date": "March 5, 2024",
            "Buyer Address": "123B, Beach walk avenue, CA",
            "Buyer Name": "Twinings threads Inc",
            "Service Provider Address": "23, rosewood avenue, CA 96162",
            "Service Provider Name": "Valley wood works Inc"
          },
          "payment_terms": {
            "payment_method": "Credit or debit card",
            "payment_schedule": {
              "condition": "Full payment upon the completion of the services",
              "installment_condition": "Installments option available until the purchase price has been paid in full",
              "total_payment_due": "$10000"
            },
            "tax_responsibility": "Service Provider"
          },
          "services_and_pricing": {
            "services": [
              {
                "description": "Building Paint work, external",
                "number_of_projects": 1,
                "price_per_project": "$2350"
              },
              {
                "description": "Building Paint work, internal",
                "number_of_projects": 1,
                "price_per_project": "$3000"
              },
              {
                "description": "logistics estimate",
                "number_of_projects": 1,
                "price_per_project": "$430"
              },
              {
                "description": "Tools cost estimate",
                "number_of_projects": 1,
                "price_per_project": "$3000"
              }
            ],
            "total_purchase_price": "$10000"
          }
        }
      },
      "error": null,
      "metadata": {
        "source_name": "scanned-handwritten-contract-form.pdf",
        "source_hash": "d0b40d6fb160c377870a2792216d79b4624288e48be18802872d12b945e50c3e",
        "organization_id": "org_0LUeZOOihFhndmjm",
        "workflow_id": "390c7096-2783-42e2-b2d6-5854d335160d",
        "execution_id": "80dfb331-62c8-47a3-9afe-74fc91851c8c",
        "file_execution_id": "fb8ce0a8-114e-4e1f-bc55-98f2f346c252",
        "tags": [],
        "workflow_start_time": 1765287522.0249608,
        "total_elapsed_time": 34.934743881225586,
        "tool_metadata": [
          {
            "tool_name": "structure_tool",
            "elapsed_time": 22.159697,
            "output_type": "JSON"
          }
        ]
      }
    }
  ]
}

Unstract Document Ingestion

Document ingestion in Unstract is designed to support real-world enterprise flows where documents arrive from cloud drives, internal file systems, and automated workflow engines. Unstract provides a unified ingestion layer through Connectors and through n8n-based automation, ensuring that documents move from source → extraction → destination with zero manual handling.

1. Ingestion via Unstract Connectors

Unstract supports ingestion from a wide range of data sources using built-in connectors. These connectors allow systems such as cloud storage, file systems, and databases to push documents directly into extraction workflows.

How Connectors Work

Unstract lets you add connectors in two ways:

From the Connectors dashboard (Settings → Connectors)
Directly inside a workflow when configuring the source or destination

Once added, each connector follows a simple process:

Select the connector type (e.g., S3/MinIO, Google Drive, Dropbox, Azure Blob, PostgreSQL, MySQL, etc.)
Configure authentication fields
Test the connection
Save it for use in ETL pipelines, API deployments, or task workflows

These connectors become the entry points for automated ingestion. For example:

A folder in Google Drive can automatically trigger new extraction jobs
A new file landing in Amazon S3 can be processed through a Prompt Studio project
Files stored in on-premise file systems can be consumed using FileSystem connectors
Extracted data can be routed directly into databases like PostgreSQL or Snowflake

Unstract Document Ingestion via n8n Workflow Automation

Unstract integrates seamlessly with n8n to create fully automated document ingestion pipelines. In this setup, n8n orchestrates the flow of documents, while Unstract and LLMWhisperer handle OCR, preprocessing, and structured extraction.

Steps in the n8n + Unstract Ingestion Workflow

8n retrieves new documents from configured sources (email inboxes, cloud drives, APIs, or shared folders).
n8n sends the document to LLMWhisperer for OCR and layout-preserving preprocessing.
The OCR output is passed to an Unstract API (built from Prompt Studio) for structured JSON extraction.
n8n routes the extracted JSON to downstream destinations such as Slack, Google Sheets, databases, or accounting systems.

You can watch the full workflow demonstration in the official webinar:

Building agentic document workflows with Unstract + n8n

Unstract API Hub: Document Splitting & Classification

Unstract’s API Hub provides a suite of intelligent, production-ready APIs that solve one of the most difficult challenges in document management: automatically splitting multi-document PDFs and classifying document types without templates, rules, or manual effort. Built using a blend of Vision AI and LLM-driven semantic analysis, these APIs work across every industry and document format.

Click here to try Unstract’s PDF Splitter API

AI-Powered PDF Splitting API

The Document Splitter API is engineered for real-world, mixed PDFs—loan packages, insurance claim bundles, logistics files, onboarding packets, tax folders, and more. Instead of relying on page numbers or keyword rules, the API uses advanced machine-learning models to detect natural document boundaries based on layout, structure, visual cues, and semantic meaning.

When you submit a multi-document PDF, the API returns:

Individual PDFs, each corresponding to a split document
A ZIP file containing all extracted documents
A detailed JSON boundary report (document type, page ranges, header/footer text, entities, date ranges, etc.)

This approach eliminates the need for manual page selection or template configuration—a critical advantage when document sets vary in order, length, and formatting.

Key Features

High-Accuracy Vision Model Boundary Detection
Two-pass AI analysis enables reliable detection of document breaks, achieving confidence scores of 0.9 or higher. Ideal for inconsistent scans, rotated pages, watermarks, and mixed-resolution files.

Fast Processing for Large PDFs
Handles 100+ page, multi-document files in minutes. Optimized pipeline ensures consistent performance even when scaling to thousands of files.

Dynamic Windowing Technology
Automatically adapts to PDFs of any size or complexity—whether you’re splitting a 10-page insurance packet or a 500-page regulatory submission.

Enterprise-Grade Security & Compliance
Supports secure, encrypted processing with compliance across HIPAA, GDPR, SOC 2, and industry-standard privacy requirements.

Document Integrity Preservation
Splits retain original formatting, ensuring downstream workflows (claims intake, underwriting, loan processing, audits) receive clean, usable outputs.

Industry-Agnostic Operation
Works for banking, insurance, healthcare, logistics, education, real estate, BPOs, and government workflows—no custom training required.

AI Classification API

The API Hub also offers document classification endpoints. These identify the type of each document—such as:

Loss Run Summary
ACORD 125 / 140 / 126
Payslip
KYC Form
Tax Form (e.g., 1040, 990)
Onboarding Documents
Shipping Manifests or Bills of Lading

Classification works even when documents vary by layout, language, orientation, or scan quality. This makes it suitable for automated foldering, indexing, routing, and downstream workflow orchestration.

These classification APIs integrate smoothly with:

Document Management Systems (DMS)
ETL pipelines
RPA and automation tools
Workflow engines like n8n
Ingestion platforms (S3, GDrive, Dropbox)

API Endpoints

The PDF Splitter API offers three primary endpoints:

POST: /api/v1/doc-splitter/documents/upload
Uploads the combined PDF and initiates the splitting job.

GET: /api/v1/doc-splitter/jobs/status
Checks the job status using the returned job_id.

GET: /api/v1/doc-splitter/jobs/download
Fetches the ZIP file containing separated PDFs and the JSON boundary metadata.

Postman Workflow (High-Level)

Upload
Send the mixed PDF via POST → receive a job_id.
Status Polling
Query the status endpoint until the job shows as “completed”.
Download
Use the same job_id to download the ZIP containing:
- Split PDFs
- JSON boundary metadata (document_type, page ranges, content descriptors, extracted entities)

Reference Blog: https://unstract.com/blog/pdf-splitter-api-ai-powered-mixed-combined-pdf-splitter/

Unstract Document Classification for Document Management

Modern document management systems depend heavily on accurate, automated classification—especially when dealing with large volumes of invoices, policies, claims, contracts, statements, onboarding packets, and scanned submissions. Unstract brings a practical, enterprise-ready approach to this challenge by combining Prompt Studio, LLMWhisperer, and API Deployments into a unified classification pipeline.

Using Prompt Studio to Classify Documents

Prompt Studio acts as the intelligence layer of Unstract’s classification engine. Instead of building and training a custom machine-learning pipeline, teams simply write natural-language prompts to describe classification rules.

For example, a classification prompt can identify whether a file is:

An invoice
A claims document
An insurance policy
A contract
A bank statement
An ACORD form

This approach allows organizations to classify both broad categories and highly specific subtypes—without building templates or rules.

How it works in practice

Upload sample documents (invoices, ACORD forms, contracts, etc.) into Prompt Studio.
Write classification prompts that instruct the LLM to determine document type based on content and structure.
Run test executions to view classification accuracy.
Validate results using layout-preserved OCR from LLLMWhisperer (ensuring consistent input for the LLM).

Prompt Studio eliminates the fragility of traditional keyword-based classifiers by grounding classification in semantic understanding.

Exposing Classification Logic as an API

Once the classification prompts are tested and approved, Unstract allows the entire logic to be deployed as an API with a single click.

The deployed API:

Accepts PDFs, scans, photos, and documents of any format
Automatically applies OCR (via LLMWhisperer)
Runs the cleaned text through the classification prompt
Returns structured JSON containing the document type

This makes it effortless to integrate classification into enterprise workflows.

Example JSON Output

{

“document_classification”: {

“classification”: “BANK”

}

Where this API can be used

Auto-sorting documents as they arrive in S3, GDrive, Dropbox, or internal file systems
Routing incoming claims to the correct insurance queue
Feeding documents into an ERP, CRM, or DMS for categorization
Classifying bulk historical archives during digital transformation

The API removes manual sorting entirely and enables large-scale automated processing.

Integration with DMS Tools for Automated Sorting & Metadata Tagging

Unstract’s classification API integrates seamlessly with:

Document Management Systems (SharePoint, Alfresco, OpenText)
Workflow engines (n8n, Airflow, Zapier)
Storage systems (S3, GCS, Azure Blob, Dropbox, GDrive)
Enterprise ETL systems and warehouse platforms

A typical automated workflow looks like this:

Documents arrive in a storage bucket.
A workflow automation tool (n8n, Airflow, etc.) retrieves each file.
The file is sent to the Unstract Classification API.
The API returns the document type as structured JSON.
Based on this value, the automation:
- Places the document into the correct folder
- Adds metadata tags to the DMS
- Sends the file to downstream extraction workflows
- Routes documents to the correct compliance or business teams

Because LLMWhisperer preprocesses every file (OCR, layout normalization, table\form preservation), even poorly scanned, multi-language, or handwritten documents are classified reliably.

How Unstract + LLMWhisperer Strengthen Document Management

This combined stack supports every stage of a modern document-management lifecycle:

1. Document Capture

Integration with all major cloud storage, data warehouses, inboxes, and n8n workflows ensures documents enter the system seamlessly.

2. Document Parsing (OCR)

LLMWhisperer provides clean, structured text—preserving layout, tables, checkboxes, multilingual content, and handwritten elements.

3. Document Splitting

The PDF Splitter API separates large, mixed PDFs into individual documents before classification.

4. Document Classification

Prompt Studio + Unstract APIs deliver high-accuracy categorization at scale.

5. Document Extraction

Unstract’s AI-powered extraction converts classified documents into usable structured data fields.

Unstract provides an end-to-end approach to document classification by combining layout-accurate OCR, LLM-powered reasoning, and API automation. With Prompt Studio defining the classification logic and LLMWhisperer ensuring high-quality OCR inputs, enterprises can automate:

Sorting
Tagging
Routing
Indexing
Metadata management

across thousands of documents with minimal human intervention.

This transforms document management from a manual, error-prone burden into an automated, scalable, and intelligent workflow.

How Unstract + LLMWhisperer Empower Document Management

Modern document management demands far more than storage—it requires intelligent, end-to-end understanding of every document entering the system. Unstract and LLMWhisperer work together to form a unified pipeline that handles all five essential stages: capture, parse, split, classify, and extract. Each stage solves a critical business bottleneck.

1. Capture: Seamless Integration with Storage Systems + n8n Automation

Unstract connects directly to leading cloud storage and enterprise environments:

AWS S3
Google Drive
SharePoint
Azure Blob
Dropbox
On-premise file systems

Combined with the Unstract and LLMWhisperer nodes for n8n, organizations can automate ingestion from email inboxes, CRMs, legacy systems, shared folders, and multi-step workflows—triggering document processing the moment files arrive.

This turns fragmented document intake into a synchronized, reliable entry point for all downstream automation.

2. Parse: OCR + Structural Understanding via LLMWhisperer

LLMWhisperer performs advanced OCR that preserves:

Layout
Tables
Checkboxes
Columns
Multi-language text
Handwritten content

By producing structured, layout-preserving text, it creates a clean foundation for LLM-powered reasoning. This eliminates the brittle outputs of traditional OCR and ensures downstream AI workflows fully understand the document’s context.

3. Split: Intelligent Document Separation

Using Unstract’s AI-powered PDF Splitter API, combined PDFs—loan packets, legal bundles, onboarding packets, insurance claim packages—are automatically separated into their individual documents.

Key advantages:

Detects boundaries using vision + LLM reasoning
No rules, templates, or page heuristics required
Produces split PDFs + boundary metadata (JSON)
Supports high-volume enterprise-grade throughput

This ensures documents are organized before classification and extraction even begin.

4. Classify: AI-Based Categorization at Scale

Unstract’s Prompt Studio allows teams to define custom classification logic using natural language prompts. Once deployed as an API, this logic can classify:

Invoices
Claims
Policies
Bank statements
ACORD forms
Contracts
HR documents
Tax forms

This enables automated routing, smart foldering in DMS systems, and metadata tagging—replacing manual sorting with a resilient AI-driven model.

5. Extract: LLM-Powered Structured Data Extraction

Finally, Unstract’s LLM layer converts OCR + classification outputs into:

Clean JSON
Normalized fields
Entity summaries
Key-values
Table structures

This transforms unstructured content into decision-ready data that can be pushed directly into databases, ERPs, underwriting systems, CRMs, or analytics dashboards.

Together, these five layers create a straight-through automation pipeline for the entire document lifecycle.

Conclusion

The combination of LLMWhisperer + Unstract marks a fundamental shift in how enterprises approach document management and OCR. Instead of stitching together fragmented tools, organizations gain a unified system that:

Reads any document with high accuracy
Understands layout, handwriting, and structure
Splits mixed PDFs automatically
Classifies documents using AI
Extracts meaningful data with LLMs
Integrates seamlessly with existing DMS and automation workflows

This approach delivers the three outcomes modern enterprises care about most:

Scalability:
Handles thousands of documents a day without rule maintenance or manual review.

Compliance:
Preserves layout, metadata, and audit trails—critical for insurance, banking, healthcare, and legal operations.

Intelligence:
Transforms documents from static files into actionable data that moves through automated pipelines.

In a world where businesses are overwhelmed by unstructured documents, LLMWhisperer provides the foundation, and Unstract provides the intelligence—making document management faster, smarter, and ready for the future.

The post LLMWhisperer: Best OCR for Document Management first appeared on Unstract.com.

AI OCR vs. AI-ready OCR: The Right Solution for Document Extraction Use Cases

Nuno Bispo — Wed, 14 Jan 2026 18:38:18 +0000

Introduction

Enterprises process millions of documents every day, and a question keeps coming up:
Can large language models like ChatGPT and Claude finally replace traditional OCR?

On paper, the idea is appealing. LLMs can understand context, reason about content, and handle complex layouts that often trip up classic OCR engines. But when we put that promise to the test, a different reality emerges. Today, LLMs are not a good replacement for OCR. Faster, simpler, and more reliable tools already exist, and they deliver better results at a fraction of the cost.

Our conclusion is clear: traditional OCR remains the most reliable and cost-effective solution for the vast majority of document-processing workloads. LLMs do offer real advantages, but only in specific edge cases. Used indiscriminately, they introduce higher costs, slower processing, and new reliability risks. Used strategically, they can complement OCR, but they shouldn’t replace it.

In this article, we break down when LLMs can help and when they hurt, the practical challenges of using them for OCR (including hallucinations, scalability, and cost), and real-world performance comparisons between ChatGPT 5.2, Claude Sonnet 4.5, and LLMWhisperer (in Unstract). We also explore how hybrid architectures combine OCR’s speed and determinism with LLMs’ reasoning capabilities, and why this approach makes the most sense for enterprises.

Whether you’re designing a document-processing pipeline or evaluating OCR technologies for production use, this analysis will help you choose the right tool for the right job, with clarity, data, and real-world insight.

Understanding Traditional OCR vs. AI OCR

Traditional OCR (Optical Character Recognition) tools, such as Tesseract, PaddleOCR, Azure Document Intelligence, AWS Textract, and Google Document AI, are purpose-built for extracting text from images and documents.

They rely on pattern recognition techniques to identify characters, words, and text structures, converting scanned documents into machine-readable text.

Key strengths of traditional OCR:

High speed, capable of processing thousands of pages per hour
Strong accuracy on clean, printed, and well-structured documents
Very low cost per page, ideal for large-scale processing
Predictable and deterministic output
Well-suited for forms, invoices, receipts, and standardized layouts

Traditional OCR performs best on:

Printed text
Clear or high-quality scans
Structured forms and tables
Documents with consistent formatting

For most enterprise document-processing workflows, traditional OCR remains a reliable and efficient choice.

AI OCR takes a fundamentally different approach. Instead of pattern recognition, large language models use language understanding to interpret document content. This allows them to handle more complex layouts and challenging inputs.

Tesseract OCR is one of the most popular and powerful open-source OCR tools available today.

While Tesseract is an excellent tool for basic OCR tasks, it relies heavily on traditional image processing techniques and pre-trained models that may not perform well with non-standard or complex documents.

LLMWhisperer, on the other hand, uses deep learning models that can adapt to the nuances of different writing styles, languages, and document structures.

See comparison here →

Potential advantages of AI OCR:

Better handling of complex or irregular layouts
Support for multi-column text and nested tables
Improved tolerance for noisy scans or partial data
Context-aware extraction and field inference

However, these benefits come with important trade-offs:

Slower processing times due to model inference
Significantly higher costs at scale
Non-deterministic outputs
Reduced reliability for high-volume workloads
Not optimized for raw text extraction

LLMs are designed primarily for language understanding tasks, not for the speed and precision required by large-scale document extraction.

The core difference is one of purpose:

Traditional OCR: Pattern recognition and text extraction
LLMs: Language understanding and interpretation

Effective document processing requires:

Fast and accurate extraction first
Interpretation, enrichment, and structuring second

Using an LLM as a pure OCR engine is like using a Swiss Army knife when you need a scalpel, it may work, but it’s not the right tool. This is why hybrid approaches, which combine OCR’s efficiency with LLM-driven intelligence, often deliver the best overall results.

Key Challenges with AI OCR

One of the most serious risks of using LLMs for OCR is hallucination, the generation of information that does not exist in the source document. Unlike traditional OCR, which either extracts text correctly or fails to recognize it, LLMs can confidently produce plausible but entirely fabricated content. At enterprise scale, this is particularly dangerous.

Key risks of hallucinations include:

Invented values in financial, legal, or compliance documents
Subtle, believable errors that pass quality checks
Increased risk of regulatory violations and legal exposure
Lower trust in automated document-processing pipelines

Cost and performance are equally significant concerns. Traditional OCR tools typically process documents for pennies per page and can handle thousands of pages per hour. In contrast, LLM-based processing can cost dollars per document and scale poorly.

Economic and performance challenges:

Cost: Millions of pages can translate to tens of thousands of dollars in LLM usage
Speed: OCR completes in seconds, while LLM calls may take minutes per document
Scalability: High-volume batches can stretch from hours to days or weeks
Operational overhead: API rate limits, retries, and orchestration complexity

LLMs also struggle with long-form documents. As document length increases, maintaining accuracy and contextual consistency becomes harder, often leading to incomplete or inconsistent extraction.

Limitations with long documents:

Degraded accuracy as context windows fill
Missed or partially extracted fields
Inconsistent outputs across document chunks

This is why best-practice systems follow an extraction-first approach:

Use traditional OCR for fast, reliable text extraction
Apply LLMs only for interpretation, structuring, or enrichment
Combine both in a controlled, auditable pipeline

Beyond these core challenges, LLMs introduce additional structural limitations:

Token limits: Require chunking that can break semantic context
Non-determinism: Identical inputs may produce different outputs
Debugging difficulty: Errors are harder to diagnose and reproduce
Reliability concerns: Unpredictable failures in production systems

For enterprise environments that demand accuracy, scalability, and predictable behavior, these challenges make AI models a poor choice for primary OCR. Instead, they are best used strategically, augmenting traditional OCR rather than replacing it.

Optical Character Recognition (OCR) technology has become indispensable in today’s digital landscape. The effectiveness of OCR software can greatly impact workflows, data accuracy, and overall operational efficiency.

Here are some of the top OCR tools in 2026:

Tesseract,

Paddle OCR,

Azure Document Intelligence

Amazon Textract

LLMWhisperer.

For a comprehensive overview and comparison, click here →

Real-World OCR Performance Comparison

Research Methodology

To evaluate how LLMs perform for OCR tasks in real-world scenarios, we conducted a comprehensive comparison of ChatGPT 5.2, Claude Sonnet 4.5, and LLMWhisperer (in Unstract).

Our evaluation included a diverse set of documents spanning varying complexity levels. These documents were chosen to represent typical enterprise use cases.

Each solution was tested using the same extraction requirements to ensure a fair and consistent comparison.

Evaluation metrics included:

Accuracy: Correctness of extracted text and values
Completeness: Coverage of all relevant fields and sections
Reliability: Consistency and stability across multiple documents

The results reveal significant differences in performance, highlighting critical considerations when selecting OCR and document-processing tools.

Why Direct AI OCR Parsing Often Fails

The core problem with sending raw documents directly to LLMs is that these models are designed for text, not for images or complex document structures.

When a PDF or scanned image is sent to ChatGPT or Claude, the model must:

Interpret the visual layout
Understand spatial relationships
Extract text while maintaining context and structure

This multi-step process introduces multiple points of failure:

Missing entire sections in complex layouts
Misinterpreting tables and losing relationships between data points
Confusing multi-column layouts and mixing content from different columns
Inconsistent extraction of structured data across similar documents
Hallucinating information when the document quality is poor, or the text is unclear

Technically, this occurs because when LLMs receive raw documents, the visual information is converted into tokens, losing critical spatial and structural cues. For example:

Tables are flattened into linear text
Multi-column layouts become jumbled paragraphs
Visual cues such as borders, spacing, and alignment, which humans rely on to understand structure, are lost

The model then attempts to reconstruct the structure using language understanding alone. While this may work for simple documents, it fails for complex layouts.

Why LLMWhisperer Performs Better

LLMWhisperer is an advanced OCR tool that intelligently parses complex documents. It cleans up noisy scans and faithfully preserves the original layout, delivering data that’s perfectly formatted for LLMs and downstream systems.

By separating extraction from interpretation, LLMWhisperer avoids the fundamental mismatch that causes direct AI parsing to fail.

Performance Results: Side-by-Side Comparison

Let’s examine the results from ChatGPT 5.2, Claude Sonnet 4.5, and Unstract (with LLMWhisperer).

For each example document, we’ll review the outputs from each LLM alongside Unstract’s results.

We’ll highlight any gaps, flaws, and inaccuracies when they occur, as well as note areas where the results are accurate.

Document 1 – Photo of an Invoice

Download sample

Chat GPT 5.2:

Let’s use the following prompt:

For this receipt, capture the items purchased, along with their corresponding costs and total.
Capture also the payment method, date of purchase, card number, and transaction ID.
Return only a formatted JSON.

This is the output:

The returned JSON:

{
  "merchant": "UNIMARKT",
  "items": [
    { "name": "Kaisersemmel", "price_eur": 3.29 },
    { "name": "Papiertragetasche", "price_eur": 0.29 },
    { "name": "Rufda Brötchen 6er", "price_eur": 3.29 },
    { "name": "Rufda Brötchen 6er (2)", "price_eur": 3.29 },
    { "name": "Saatenriegel", "price_eur": 0.69 },
    { "name": "Mohnweckerl 4er", "price_eur": 3.19 },
    { "name": "Rufda Brötchen 6er (3)", "price_eur": 4.69 },
    { "name": "Feinkost Waage", "price_eur": 4.1 },
    { "name": "Schärd Pizzakäse", "price_eur": 2.59 },
    { "name": "Schärd Pizzakäse (2)", "price_eur": 2.59 },
    { "name": "Schärd Mayertopf", "price_eur": 3.45 },
    { "name": "Schärdinger Joghurt", "price_eur": 1.05 },
    { "name": "Soletti", "price_eur": 1.49 },
    { "name": "Schärd Butter", "price_eur": 2.59 },
    { "name": "Philadelphia Light", "price_eur": 1.69 },
    { "name": "NFU Bio Tomaten", "price_eur": 1.49 },
    { "name": "Trauben weiß", "price_eur": 1.99 },
    { "name": "Schärdinger Obers", "price_eur": 1.39 },
    { "name": "Omega Beeren", "price_eur": 2.49 },
    { "name": "Schärd Käseauf", "price_eur": 2.45 },
    { "name": "Paprika Mix 3er", "price_eur": 2.49 },
    { "name": "NFU Bio Karotten", "price_eur": 1.86 },
    { "name": "Kelly Chips Party", "price_eur": 2.99 },
    { "name": "UNI Käsescheiben", "price_eur": 2.09 },
    { "name": "Kotanyi rot", "price_eur": 1.99 },
    { "name": "Ferrero Nutella", "price_eur": 1.79 },
    { "name": "Champignon", "price_eur": 2.29 },
    { "name": "Gebindeautomat", "price_eur": -0.22 }
  ],
  "total_eur": 83.65,
  "payment": {
    "method": "DEBIT MASTERCARD (contactless)",
    "card_number": "5579 74XX XXXX 4461",
    "transaction_id": "23602048"
  },
  "date_of_purchase": "2022-01-15",
  "time_of_purchase": "11:44:43"
}

Although the overall total is correct, several line items are inaccurate.

“Rufda Brötchen 6er” is listed instead of “Aufba Brötchen 6er”, and the price is incorrect (3.29 instead of 0.69).

Only one “Feinkost Waage” item is included, whereas there should be two entries with different prices.

Claude Sonnet 4.5

Let’s use the following prompt:

For this receipt, capture the items purchased, along with their corresponding costs and total.
Capture also the payment method, date of purchase, card number, and transaction ID.
Return only a formatted JSON.

The returned JSON:

{
  "store": "UNIMARKT",
  "location": "Partner Senekal, 4232 Hagenberg, Hauptstraße 97",
  "date": "15.01.22",
  "time": "11:44",
  "items": [
    {
      "name": "Kaisersemmel",
      "price": 3.29,
      "currency": "EUR"
    },
    {
      "name": "Papierfragltasche",
      "price": 0.22,
      "currency": "EUR"
    },
    {
      "name": "Kaisersemmel",
      "price": 3.29,
      "currency": "EUR"
    },
    {
      "name": "Aufba-Brötchen 6er",
      "price": 0.69,
      "currency": "EUR"
    },
    {
      "name": "Aufba-Brötchen 6er",
      "price": 0.69,
      "currency": "EUR"
    },
    {
      "name": "Saatenriegel",
      "price": 5.99,
      "currency": "EUR"
    },
    {
      "name": "Kornweckerl 4er",
      "price": 1.29,
      "currency": "EUR"
    },
    {
      "name": "Aufba-Brötchen 6er",
      "price": 0.69,
      "currency": "EUR"
    },
    {
      "name": "Fajnkost Waage",
      "price": 4.1,
      "currency": "EUR"
    },
    {
      "name": "Frischfisch",
      "price": 16.31,
      "currency": "EUR"
    },
    {
      "name": "Scharü Pizzakäse",
      "price": 2.59,
      "currency": "EUR"
    },
    {
      "name": "Scharü Pizzakäse",
      "price": 2.59,
      "currency": "EUR"
    },
    {
      "name": "Scharü Magertopc",
      "quantity": 15,
      "price": 3.45,
      "currency": "EUR"
    },
    {
      "name": "Scharü Bio Jog",
      "price": 1.05,
      "currency": "EUR"
    },
    {
      "name": "Soletti",
      "price": 1.49,
      "currency": "EUR"
    },
    {
      "name": "Scharü Butter",
      "price": 2.09,
      "currency": "EUR"
    },
    {
      "name": "Schögl Jausenwürstchen",
      "price": 2.89,
      "currency": "EUR"
    },
    {
      "name": "Philadelphia Kräu",
      "price": 1.69,
      "currency": "EUR"
    },
    {
      "name": "Bio Jausenwürstl",
      "price": 1.89,
      "currency": "EUR"
    },
    {
      "name": "NFU Bio Tomaten Aktion",
      "price": 1.49,
      "currency": "EUR"
    },
    {
      "name": "Trauben weiß",
      "price": 5.99,
      "currency": "EUR"
    },
    {
      "name": "Schärdinger Obers",
      "price": 1.39,
      "currency": "EUR"
    },
    {
      "name": "Sepia Beugen",
      "price": 2.49,
      "currency": "EUR"
    },
    {
      "name": "Osttirol Käse pur",
      "price": 2.89,
      "currency": "EUR"
    },
    {
      "name": "Paprika Mix Ser",
      "price": 2.49,
      "currency": "EUR"
    },
    {
      "name": "NFU Bio Preirig Aktion",
      "price": 1.86,
      "currency": "EUR"
    },
    {
      "name": "Bio Kochsahne10%",
      "price": 2.99,
      "currency": "EUR"
    },
    {
      "name": "UNI Käsescheiben",
      "price": 2.09,
      "currency": "EUR"
    },
    {
      "name": "Kolangl rot",
      "price": 1.59,
      "currency": "EUR"
    },
    {
      "name": "Ferrero Nutella Aktion",
      "price": 1.79,
      "currency": "EUR"
    },
    {
      "name": "Champigno",
      "price": 2.29,
      "currency": "EUR"
    },
    {
      "name": "Feuerzangenbowle",
      "price": -0.22,
      "currency": "EUR"
    }
  ],
  "subtotal": 83.65,
  "currency": "EUR",
  "total": 83.65,
  "payment_method": "DEBIT MASTERCARD",
  "card_number": "**** **** **** 4541",
  "transaction_id": "53779103009",
  "terminal_id": "23602048",
  "trace_number": "ROOO000041010",
  "authorization_code": "109305",
  "tax_info": {
    "rate_20_percent": {
      "net": 75.96,
      "tax": 7.69,
      "gross": 83.58
    },
    "rate_10_percent": {
      "net": 0.06,
      "tax": 0.01,
      "gross": 0.07
    }
  },
  "cashier_id": "001",
  "register_id": "Erhwert Brigi"
}

Even though it appears to do a good job overall, some fields contain incorrect names.

For example, “Papierfragltasche” is used instead of “Papiertragetasche”, and “Kornweckerl 4er” is listed instead of “Mohnweckerl 4er”, with an incorrect price of 1.29 instead of 3.19.

Unstract + LLMWhisperer

Create a Prompt Studio project, add the document, and use the following prompt:

For this receipt, capture the items purchased, along with their corresponding costs and total.
Capture also the payment method, date of purchase, card number, and transaction ID.
Return only a formatted JSON.

This is the output:

The returned JSON:

{
  "card_number": "5574 74XX XXXX 4461",
  "date_of_purchase": "15.01.2022",
  "items": [
    {
      "cost": 3.29,
      "name": "Kaisersemmel"
    },
    {
      "cost": 0.29,
      "name": "Papiertragetasche"
    },
    {
      "cost": 3.29,
      "name": "Kaisersemmel"
    },
    {
      "cost": 0.69,
      "name": "Aufba. Brötchen 6er"
    },
    {
      "cost": 0.69,
      "name": "Aufba. Brötchen 6er"
    },
    {
      "cost": 5.99,
      "name": "Saatenriegel"
    },
    {
      "cost": 3.19,
      "name": "Mohnweckerl 4er"
    },
    {
      "cost": 0.69,
      "name": "Aufba. Brötchen 6er"
    },
    {
      "cost": 4.1,
      "name": "Feinkost Waage"
    },
    {
      "cost": 16.31,
      "name": "Feinkost Waage"
    },
    {
      "cost": 2.59,
      "name": "Schärd. Pizzakäse"
    },
    {
      "cost": 2.59,
      "name": "Schärd. Pizzakäse"
    },
    {
      "cost": 3.45,
      "name": "Schärd. Magertopf"
    },
    {
      "cost": 1.05,
      "name": "Schärdinger Jogh."
    },
    {
      "cost": 1.49,
      "name": "Soletti"
    },
    {
      "cost": 2.59,
      "name": "Schärd. Butter"
    },
    {
      "cost": 2.89,
      "name": "Schögl Jauseneier"
    },
    {
      "cost": 1.69,
      "name": "Philadelphia Kräu"
    },
    {
      "cost": 1.69,
      "name": "Philadelphia Lipt."
    },
    {
      "cost": 1.49,
      "name": "NFU Bio Tomaten Aktion"
    },
    {
      "cost": 1.99,
      "name": "Trauben weiß"
    },
    {
      "cost": 1.39,
      "name": "Schärdinger Obers"
    },
    {
      "cost": 2.49,
      "name": "Omega Beeren"
    },
    {
      "cost": 2.45,
      "name": "Schard, Käseauf"
    },
    {
      "cost": 2.49,
      "name": "Paprika Mix 3er"
    },
    {
      "cost": 1.86,
      "name": "NFU Bio Ramiro Aktion"
    },
    {
      "cost": 2.99,
      "name": "Kelly Chips Party"
    },
    {
      "cost": 2.09,
      "name": "UNI Käsescheiben"
    },
    {
      "cost": 1.99,
      "name": "Kotanyi rot"
    },
    {
      "cost": 1.79,
      "name": "Ferrero Nutella Aktion"
    },
    {
      "cost": 2.29,
      "name": "Champignon"
    },
    {
      "cost": -0.22,
      "name": "Gebindeautomat"
    }
  ],
  "payment_method": "Kreditkarte",
  "total": 83.65,
  "transaction_id": "53779013059"
}

All fields were correctly identified, including those that the LLMs misspelled, and multiple occurrences of items were accurately captured with their corresponding prices.

Document 2 – Scan of Shipping Goods

Chat GPT 5.2:

Let’s use the following prompt:

Return all the items in the shipment, their details, their model number their quantity, weight, total weight and measurements.
Also include any additional notes.
Return only a formatted JSON.

This is the output:

The returned JSON:

{
  "shipment": {
    "shipper": "HINDALCO INDUSTRIES LIMITED",
    "consignee": "TO ORDER",
    "notify_party": "ACCESS WORLD (KOREA) LTD",
    "carrier": "MAERSK",
    "bill_of_lading_number": "212202748",
    "booking_number": "212202748",
    "service_contract": "16978006",
    "vessel": "GRACE BRIDGE",
    "voyage_number": "131E",
    "port_of_loading": "Jawaharlal Nehru, India",
    "port_of_discharge": "Busan New Port, Korea",
    "place_of_issue": "Mumbai",
    "date_of_issue": "2021-08-07",
    "shipped_on_board_date": "2021-08-07"
  },
  "items": [
    {
      "description": "Primary Aluminium Ingot P1020A",
      "brand": "HINDALCO",
      "hs_code": "760110",
      "origin": "India",
      "model_number": "P1020A",
      "container_type": "20 DRY 8'6",
      "bundles_per_container": 25,
      "number_of_containers": 12,
      "total_bundles": 300,
      "total_pieces": 13200,
      "containers": [
        {
          "container_number": "MRKU7866431",
          "seal_number": "ML-IN4105171",
          "bundles": 25,
          "weight_kgs": 24843.0
        },
        {
          "container_number": "MSKU7248191",
          "seal_number": "ML-IN4105176",
          "bundles": 25,
          "weight_kgs": 24982.0
        },
        {
          "container_number": "TCLU2563110",
          "seal_number": "ML-IN4105162",
          "bundles": 25,
          "weight_kgs": 24624.0
        },
        {
          "container_number": "MRKU7035620",
          "seal_number": "ML-IN4105064",
          "bundles": 25,
          "weight_kgs": 24695.0
        },
        {
          "container_number": "MSKU5184077",
          "seal_number": "ML-IN4105082",
          "bundles": 25,
          "weight_kgs": 24545.0
        },
        {
          "container_number": "PONU0013467",
          "seal_number": "ML-IN4105123",
          "bundles": 25,
          "weight_kgs": 24519.0
        },
        {
          "container_number": "FCIU4274611",
          "seal_number": "ML-IN4105167",
          "bundles": 25,
          "weight_kgs": 24777.0
        }
      ]
    }
  ],
  "weights": {
    "gross_weight_kgs": 297022.0,
    "net_weight_kgs": 297022.0,
    "gross_weight_mt": 297.022,
    "net_weight_mt": 297.022
  },
  "measurements": {
    "containers": 12,
    "container_size": "20 DRY 8'6",
    "measurement_details": "Not explicitly stated on document"
  },
  "additional_notes": [
    "Shippers load, count and seal",
    "Freight prepaid",
    "12 containers said to contain 300 bundles",
    "Above particulars as declared by shipper, without responsibility of carrier",
    "Original bill of lading - 1 of 3",
    "Document marked ORIGINAL"
  ]
}

Although it does a good job of recognizing most of the text in the scan, it fails to organize the information properly and even introduces non-existent fields such as “model number” and “measurement_details” which it then flags as “not stated in the document”.

Claude Sonnet 4.5:

Let’s use the following prompt:

Return all the items in the shipment, their details, their model number their quantity, weight, total weight and measurements. Also include any additional notes. Return only a formatted JSON.

This is the output:

The returned JSON:

{
  "document_type": "Bill of Lading for Ocean Transport or Multimodal Transport",
  "scac": "MAEU",
  "bill_of_lading_number": "212202748",
  "booking_number": "212202748",
  "export_reference": "16978006",
  "shipper": {
    "name": "HINDALCO INDUSTRIES LIMITED",
    "address": "AHURA CENTRE, 1ST FLOOR B WING, MAHAKALI CAVES ROAD, ANDHERI EAST MUMBAI, MAHARASHTRA INDIA 400093",
    "phone": "022 66917000"
  },
  "consignee": {
    "name": "ACCESS WORLD (KOREA) LTD",
    "address": "ROOM NO., 902, HANJU BLDG, 7, GUKJEGEUMYUNG-RO 2-GIL, YEONGDEUNGPO-GU, SEOUL, KOREA 07327",
    "phone": "(82) 2 723 6311",
    "fax": "(82) 2 723 6314",
    "email": "[email protected]"
  },
  "vessel": {
    "name": "GRACE BRIDGE",
    "voyage_number": "131E"
  },
  "port_of_loading": "JAWAHARLAL NEHRU, INDIA",
  "port_of_discharge": "Busan New Port, Korea",
  "shipment_summary": {
    "total_containers": 12,
    "container_type": "20' CONTAINERS",
    "total_bundles": 300,
    "total_pieces": 13200,
    "product": "PRIMARY ALUMINIUM INGOT P1020A",
    "brand": "HINDALCO",
    "hs_code": "760110",
    "origin": "INDIA",
    "gross_weight": "297.022 MT",
    "net_weight": "297.022 MT",
    "total_weight_kgs": 297022.0
  },
  "shipping_documents": {
    "shipping_bill_number": "3467675",
    "shipping_bill_date": "29.07.2021",
    "invoice_number": "RFG/P/2022/590",
    "invoice_date": "28.07.2021",
    "freight_terms": "PREPAID"
  },
  "container_details": [
    {
      "container_number": "MRKU7686431",
      "seal_number": "ML-IN4105171",
      "size": "20'",
      "type": "DRY",
      "bundles": 25,
      "weight_kgs": 24843.0
    },
    {
      "container_number": "MSKU7248191",
      "seal_number": "ML-IN4105176",
      "size": "20'",
      "type": "DRY",
      "bundles": 25,
      "weight_kgs": 24982.0
    },
    {
      "container_number": "TCLU2563110",
      "seal_number": "ML-IN4105162",
      "size": "20'",
      "type": "DRY",
      "bundles": 25,
      "weight_kgs": 24624.0
    },
    {
      "container_number": "MRKU7035620",
      "seal_number": "ML-IN4105064",
      "size": "20'",
      "type": "DRY",
      "bundles": 25,
      "weight_kgs": 24695.0
    },
    {
      "container_number": "MSKU5184077",
      "seal_number": "ML-IN4105082",
      "size": "20'",
      "type": "DRY",
      "bundles": 25,
      "weight_kgs": 24545.0
    },
    {
      "container_number": "PONU0013467",
      "seal_number": "ML-IN4105123",
      "size": "20'",
      "type": "DRY",
      "bundles": 25,
      "weight_kgs": 24519.0
    },
    {
      "container_number": "FCIU4274611",
      "seal_number": "ML-IN4105167",
      "size": "20'",
      "type": "DRY",
      "bundles": 25,
      "weight_kgs": 24777.0
    }
  ],
  "bill_of_lading_details": {
    "number_of_original_bls": "1/THREE",
    "date_of_issue": "2021-08-07",
    "shipped_on_board_date": "2021-08-07"
  },
  "carrier": {
    "name": "Maersk Line India Pvt Ltd",
    "signatory": "Signatory for Carrier Maersk A/S"
  },
  "notes": [
    "SHIPPERS LOAD, COUNT AND SEAL",
    "12 containers said to contain 300 BUNDLES",
    "Above particulars as declared by Shipper, but without responsibility of or representation by Carrier (see clause 14)",
    "Place of Receipt: Applicable only when document used as Multimodal Transport B/L (see clause 1)",
    "Place of Delivery: Applicable only when document used as Multimodal Transport B/L (see clause 1)",
    "Onward inland routing (Not part of carriage as defined in clause 1. For account and risk of Merchant)",
    "This transport document has one or more numbered pages"
  ],
  "document_reference": "2020-Msk-A1 001468994"
}

It appears correct at first glance, but the “export_reference” field is incorrect, the “bill_of_lading_details” section does not exist, and the “signatory” field contains no meaningful data.

Unstract:

Create a Prompt Studio project, add the document, and use the following prompt:

Return all the items in the shipment, their details, their model number their quantity, weight, total weight and measurements. Also include any additional notes. Return only a formatted JSON.

This is the output:

The returned JSON:

{
  "shipment": {
    "additional_notes": null,
    "items": [
      {
        "brand": "HINDALCO",
        "containers": [
          {
            "container_no": "MRKU7686431",
            "quantity_bundles": 25,
            "weight": 24843
          },
          {
            "container_no": "MSKU7248191",
            "quantity_bundles": 25,
            "weight": 24982
          },
          {
            "container_no": "TCLU2563110",
            "quantity_bundles": 25,
            "weight": 24624
          },
          {
            "container_no": "MRKU7035620",
            "quantity_bundles": 25,
            "weight": 24695
          },
          {
            "container_no": "MSKU5184077",
            "quantity_bundles": 25,
            "weight": 24545
          },
          {
            "container_no": "PONU0013467",
            "quantity_bundles": 25,
            "weight": 24519
          },
          {
            "container_no": "FCIU4274611",
            "quantity_bundles": 25,
            "weight": 24777
          }
        ],
        "description": "PRIMARY ALUMINIUM INGOT P1020A",
        "gross_weight": 297022,
        "hs_code": "760110",
        "measurements": "12 × 20' containers",
        "model_number": null,
        "net_weight": 297022,
        "origin": "INDIA",
        "quantity_bundles": 300,
        "quantity_pieces": 13200,
        "total_weight": 297022
      }
    ]
  }
}

It accurately returns the required information and organizes the output in a clean, structured format, while avoiding the introduction of non-existent fields.

Document 3 – PDF Psychological Report

Download sample →

Chat GPT 5.2:

Let’s use the following prompt:

Return from pages 13 to 18, the information for the questionnaire and structure of the questions to the options and sub-responses.
For the ones that have been filled out, include the sub-responses.
Return only a formatted JSON.

This is the output:

The returned JSON:

{
  "questionnaire": {
    "name": "Peters et al. Delusions Inventory (PDI-21)",
    "pages": "13-18",
    "description": "Self-report questionnaire assessing delusional ideation in the general population. Each item is first answered Yes/No. If 'Yes', three additional dimensions are rated: Distress, Preoccupation, and Conviction."
  },
  "questions": [
    {
      "item_number": 1,
      "question": "Do you ever feel as if people seem to drop hints about you or say things with a double meaning?",
      "response_structure": {
        "primary_response": ["Yes", "No"],
        "sub_responses_if_yes": {
          "distress": {
            "scale": "1-5",
            "label": "How distressing is this belief?"
          },
          "preoccupation": {
            "scale": "1-5",
            "label": "How often do you think about it?"
          },
          "conviction": {
            "scale": "1-5",
            "label": "How strongly do you believe it is true?"
          }
        }
      }
    },
    {
      "item_number": 2,
      "question": "Do you ever feel as if things in magazines or on TV were written especially for you?",
      "response_structure": {
        "primary_response": ["Yes", "No"],
        "sub_responses_if_yes": {
          "distress": { "scale": "1-5" },
          "preoccupation": { "scale": "1-5" },
          "conviction": { "scale": "1-5" }
        }
      }
    },
    {
      "item_number": 3,
      "question": "Do you ever feel as if someone is deliberately trying to harm you?",
      "response_structure": {
        "primary_response": ["Yes", "No"],
        "sub_responses_if_yes": {
          "distress": { "scale": "1-5" },
          "preoccupation": { "scale": "1-5" },
          "conviction": { "scale": "1-5" }
        }
      }
    },
    {
      "item_number": 4,
      "question": "Do you ever feel as if you are being persecuted in some way?",
      "response_structure": {
        "primary_response": ["Yes", "No"],
        "sub_responses_if_yes": {
          "distress": { "scale": "1-5" },
          "preoccupation": { "scale": "1-5" },
          "conviction": { "scale": "1-5" }
        }
      }
    },
    {
      "item_number": 5,
      "question": "Do you ever feel as if there is a conspiracy against you?",
      "response_structure": {
        "primary_response": ["Yes", "No"],
        "sub_responses_if_yes": {
          "distress": { "scale": "1-5" },
          "preoccupation": { "scale": "1-5" },
          "conviction": { "scale": "1-5" }
        }
      }
    },
    {
      "item_number": 6,
      "question": "Do you ever feel as if people are laughing at you behind your back?",
      "response_structure": {
        "primary_response": ["Yes", "No"],
        "sub_responses_if_yes": {
          "distress": { "scale": "1-5" },
          "preoccupation": { "scale": "1-5" },
          "conviction": { "scale": "1-5" }
        }
      }
    },
    {
      "item_number": 7,
      "question": "Do you ever feel as if strangers are looking at you critically?",
      "response_structure": {
        "primary_response": ["Yes", "No"],
        "sub_responses_if_yes": {
          "distress": { "scale": "1-5" },
          "preoccupation": { "scale": "1-5" },
          "conviction": { "scale": "1-5" }
        }
      }
    },
    {
      "item_number": 8,
      "question": "Do you ever feel as if people are deliberately acting to annoy you?",
      "response_structure": {
        "primary_response": ["Yes", "No"],
        "sub_responses_if_yes": {
          "distress": { "scale": "1-5" },
          "preoccupation": { "scale": "1-5" },
          "conviction": { "scale": "1-5" }
        }
      }
    },
    {
      "item_number": 9,
      "question": "Do you ever feel as if you are being watched or monitored?",
      "response_structure": {
        "primary_response": ["Yes", "No"],
        "sub_responses_if_yes": {
          "distress": { "scale": "1-5" },
          "preoccupation": { "scale": "1-5" },
          "conviction": { "scale": "1-5" }
        }
      }
    },
    {
      "item_number": 10,
      "question": "Do you ever feel as if your thoughts are being interfered with?",
      "response_structure": {
        "primary_response": ["Yes", "No"],
        "sub_responses_if_yes": {
          "distress": { "scale": "1-5" },
          "preoccupation": { "scale": "1-5" },
          "conviction": { "scale": "1-5" }
        }
      }
    }
  ],
  "response_scales": {
    "distress": {
      "range": "1-5",
      "anchors": {
        "1": "Not at all distressing",
        "5": "Extremely distressing"
      }
    },
    "preoccupation": {
      "range": "1-5",
      "anchors": {
        "1": "Hardly ever think about it",
        "5": "Think about it all the time"
      }
    },
    "conviction": {
      "range": "1-5",
      "anchors": {
        "1": "Do not believe it is true",
        "5": "Believe it is absolutely true"
      }
    }
  },
  "notes": [
    "Only questions endorsed with a 'Yes' receive sub-scale ratings.",
    "Total PDI score can be calculated from number of endorsed items and summed sub-scale scores.",
    "Pages 13-18 present the questionnaire items and scoring instructions."
  ]
}

It returns all possible options without considering the actual selected answers that was requested.

Claude Sonnet 4.5:

Let’s use the following prompt:

Return from pages 13 to 18, the information for the questionnaire and structure of the questions to the options and sub-responses.
For the ones that have been filled out, include the sub-responses.
Return only a formatted JSON.

This is the output:

The returned JSON:

{
  "questionnaire": {
    "title": "P.D.I.-21 (Peters et al. Delusions Inventory - 21 item version)",
    "description": "This questionnaire is designed to measure beliefs and vivid mental experiences. We believe that they are much more common than has previously been supposed, and that most people have had some such experiences during their lives.",
    "instructions": {
      "main": "Please answer the following questions as honestly as you can. There are no right or wrong answers, and there are no trick questions.",
      "exclusion": "Please note that we are NOT interested in experiences people may have had when under the influence of drugs.",
      "importance": "IT IS IMPORTANT THAT YOU ANSWER ALL QUESTIONS.",
      "rating_instruction": "For the questions you answer YES to, we are interested in: (a) how distressing these beliefs or experiences are (b) how often you think about them; and (c) how true you believe them to be.",
      "scale_instruction": "On the right hand side of the page we would like you to circle the number which corresponds most closely to how distressing this belief is, how often you think about it, and how much you believe that it is true.",
      "no_answer_instruction": "If you answer NO please move on to the next question."
    },
    "response_format": {
      "primary_response": "YES/NO",
      "secondary_ratings": {
        "distress": {
          "scale": "1-5",
          "anchors": {
            "1": "Not at all distressing",
            "5": "Very distressing"
          }
        },
        "frequency": {
          "scale": "1-5",
          "anchors": {
            "1": "Hardly ever think about it",
            "5": "Think about it all the time"
          }
        },
        "conviction": {
          "scale": "1-5",
          "anchors": {
            "1": "Don't believe it's true",
            "5": "Believe it is absolutely true"
          }
        }
      }
    },
    "questions": [
      {
        "number": 1,
        "text": "Do you ever feel as if people seem to drop hints about you or say things with a double meaning?",
        "response": "Not filled out",
        "ratings": null
      },
      {
        "number": 2,
        "text": "Do you ever feel as if things in magazines or on TV were written especially for you?",
        "response": "Not filled out",
        "ratings": null
      },
      {
        "number": 3,
        "text": "Do you ever feel as if some people are not what they seem to be?",
        "response": "Not filled out",
        "ratings": null
      },
      {
        "number": 4,
        "text": "Do you ever feel as if you are being persecuted in some way?",
        "response": "Not filled out",
        "ratings": null
      },
      {
        "number": 5,
        "text": "Do you ever feel as if there is a conspiracy against you?",
        "response": "Not filled out",
        "ratings": null
      },
      {
        "number": 6,
        "text": "Do you ever feel as if you are, or destined to be someone very important?",
        "response": "Not filled out",
        "ratings": null
      },
      {
        "number": 7,
        "text": "Do you ever feel that you are a very special or unusual person?",
        "response": "Not filled out",
        "ratings": null
      },
      {
        "number": 8,
        "text": "Do you ever feel that you are especially close to God?",
        "response": "Not filled out",
        "ratings": null
      },
      {
        "number": 9,
        "text": "Do you ever think people can communicate telepathically?",
        "response": "Not filled out",
        "ratings": null
      },
      {
        "number": 10,
        "text": "Do you ever feel as if electrical devices such as computers can influence the way you think?",
        "response": "Not filled out",
        "ratings": null
      },
      {
        "number": 11,
        "text": "Do you ever feel as if you have been chosen by God in some way?",
        "response": "Not filled out",
        "ratings": null
      },
      {
        "number": 12,
        "text": "Do you believe in the power of witchcraft, voodoo or the occult?",
        "response": "Not filled out",
        "ratings": null
      },
      {
        "number": 13,
        "text": "Are you often worried that your partner may be unfaithful?",
        "response": "Not filled out",
        "ratings": null
      },
      {
        "number": 14,
        "text": "Do you ever feel that you have sinned more than the average person?",
        "response": "Not filled out",
        "ratings": null
      },
      {
        "number": 15,
        "text": "Do you ever feel that people look at you oddly because of your appearance?",
        "response": "Not filled out",
        "ratings": null
      },
      {
        "number": 16,
        "text": "Do you ever feel as if you had no thoughts in your head at all?",
        "response": "Not filled out",
        "ratings": null
      },
      {
        "number": 17,
        "text": "Do you ever feel as if the world is about to end?",
        "response": "Not filled out",
        "ratings": null
      },
      {
        "number": 18,
        "text": "Do your thoughts ever feel alien to you in some way?",
        "response": "Not filled out",
        "ratings": null
      },
      {
        "number": 19,
        "text": "Have your thoughts ever been so vivid that you were worried other people would hear them?",
        "response": "Not filled out",
        "ratings": null
      },
      {
        "number": 20,
        "text": "Do you ever feel as if your own thoughts were being echoed back to you?",
        "response": "Not filled out",
        "ratings": null
      },
      {
        "number": 21,
        "text": "Do you ever feel as if you are a robot or zombie without a will of your own?",
        "response": "Not filled out",
        "ratings": null
      }
    ],
    "scoring": {
      "yes_no_score": {
        "range": "0-21",
        "method": "Assign 1 to each YES answer and 0 to each NO answer, then sum all 21 items"
      },
      "dimension_scores": {
        "distress": {
          "range": "0-105",
          "method": "Sum ratings on distress dimension for all 21 items (0 for NO answers, 1-5 for YES answers)"
        },
        "preoccupation": {
          "range": "0-105",
          "method": "Sum ratings on frequency dimension for all 21 items (0 for NO answers, 1-5 for YES answers)"
        },
        "conviction": {
          "range": "0-105",
          "method": "Sum ratings on conviction dimension for all 21 items (0 for NO answers, 1-5 for YES answers)"
        }
      },
      "total_pdi": {
        "range": "0-336",
        "method": "Sum of yes/no scores + distress + preoccupation + conviction"
      }
    }
  }
}

It provides a lot of information about the questionnaires and questions, but completely ignores or misses the responses that were filled in.

Unstract:

Create a Prompt Studio project, add the document, and use the following prompt:

Return from pages 13 to 18, the information for the questionnaire and structure of the questions to the options and sub-responses.
For the ones that have been filled out, include the sub-responses.
Return only a formatted JSON.

This is the output:

The returned JSON:

{
  "questionnaire": {
    "example": {
      "question": "Do you ever feel as if people are reading your mind?",
      "sub-responses": {
        "conviction": "5",
        "distress": "1",
        "preoccupation": "3"
      }
    },
    "questions": [
      {
        "question": "Do you ever feel as if people seem to drop hints about you or say things with a double meaning?",
        "response": "NO",
        "sub-responses": {}
      },
      {
        "question": "Do you ever feel as if things in magazines or on TV were written especially for you?",
        "response": "NO",
        "sub-responses": {}
      },
      {
        "question": "Do you ever feel as if some people are not what they seem to be?",
        "response": "YES",
        "sub-responses": {
          "conviction": "3",
          "distress": "3",
          "preoccupation": "3"
        }
      },
      {
        "question": "Do you ever feel as if you are being persecuted in some way?",
        "response": "NO",
        "sub-responses": {}
      },
      {
        "question": "Do you ever feel as if there is a conspiracy against you?",
        "response": "NO",
        "sub-responses": {}
      },
      {
        "question": "Do you ever feel as if you are, or destined to be someone very important?",
        "response": "NO",
        "sub-responses": {}
      },
      {
        "question": "Do you ever feel that you are a very special or unusual person?",
        "response": "YES",
        "sub-responses": {
          "conviction": "3",
          "distress": "2",
          "preoccupation": "3"
        }
      },
      {
        "question": "Do you ever feel that you are especially close to God?",
        "response": "NO",
        "sub-responses": {}
      },
      {
        "question": "Do you ever think people can communicate telepathically?",
        "response": "NO",
        "sub-responses": {}
      },
      {
        "question": "Do you ever feel as if electrical devices such as computers can influence the way you think?",
        "response": "NO",
        "sub-responses": {}
      },
      {
        "question": "Do you ever feel as if you have been chosen by God in some way?",
        "response": "NO",
        "sub-responses": {}
      },
      {
        "question": "Do you believe in the power of witchcraft, voodoo or the occult?",
        "response": "NO",
        "sub-responses": {}
      },
      {
        "question": "Are you often worried that your partner may be unfaithful?",
        "response": "NO",
        "sub-responses": {}
      },
      {
        "question": "Do you ever feel that you have sinned more than the average person?",
        "response": "NO",
        "sub-responses": {}
      },
      {
        "question": "Do you ever feel that people look at you oddly because of your appearance?",
        "response": "NO",
        "sub-responses": {}
      },
      {
        "question": "Do you ever feel as if you had no thoughts in your head at all?",
        "response": "NO",
        "sub-responses": {}
      },
      {
        "question": "Do you ever feel as if the world is about to end?",
        "response": "NO",
        "sub-responses": {}
      },
      {
        "question": "Do your thoughts ever feel alien to you in some way?",
        "response": "NO",
        "sub-responses": {}
      },
      {
        "question": "Have your thoughts ever been so vivid that you were worried other people would hear them?",
        "response": "NO",
        "sub-responses": {}
      },
      {
        "question": "Do you ever feel as if your own thoughts were being echoed back to you?",
        "response": "NO",
        "sub-responses": {}
      },
      {
        "question": "Do you ever feel as if you are a robot or zombie without a will of your own?",
        "response": "NO",
        "sub-responses": {}
      }
    ]
  }
}

As you can see, it correctly identifies the responses marked in the examples and only attaches sub-responses when an answer is provided.

Document 4 – PDF of a Medical Document

Download sample →

Chat GPT 5.2:

Let’s use the following prompt:

Capture the Study Calendar table from this document and sort by each Baseline Evaluation, starting with AZD2171.
For any crossed boxes, mark them True. For non-crossed boxes, mark them NULL.
Return only a formatted JSON.

This is the output:

The returned JSON:

[
  {
    "Procedure": "AZD2171",
    "PreStudy": true,
    "C1D1": true,
    "C2D1": true,
    "C3D1": true,
    "C4D1": true,
    "C5D1": true,
    "OffStudy": null
  },
  {
    "Procedure": "Prednisone",
    "PreStudy": true,
    "C1D1": true,
    "C2D1": true,
    "C3D1": true,
    "C4D1": true,
    "C5D1": true,
    "OffStudy": null
  },
  {
    "Procedure": "Informed consent",
    "PreStudy": true,
    "C1D1": null,
    "C2D1": null,
    "C3D1": null,
    "C4D1": null,
    "C5D1": null,
    "OffStudy": null
  },
  {
    "Procedure": "Demographics",
    "PreStudy": true,
    "C1D1": null,
    "C2D1": null,
    "C3D1": null,
    "C4D1": null,
    "C5D1": null,
    "OffStudy": null
  },
  {
    "Procedure": "Medical history",
    "PreStudy": true,
    "C1D1": null,
    "C2D1": null,
    "C3D1": null,
    "C4D1": null,
    "C5D1": null,
    "OffStudy": null
  },
  {
    "Procedure": "Med List",
    "PreStudy": true,
    "C1D1": true,
    "C2D1": true,
    "C3D1": true,
    "C4D1": true,
    "C5D1": true,
    "OffStudy": true
  },
  {
    "Procedure": "Physical exam",
    "PreStudy": true,
    "C1D1": true,
    "C2D1": true,
    "C3D1": true,
    "C4D1": true,
    "C5D1": true,
    "OffStudy": true
  },
  {
    "Procedure": "Vital signs",
    "PreStudy": true,
    "C1D1": true,
    "C2D1": true,
    "C3D1": true,
    "C4D1": true,
    "C5D1": true,
    "OffStudy": true
  },
  {
    "Procedure": "Height",
    "PreStudy": true,
    "C1D1": null,
    "C2D1": null,
    "C3D1": null,
    "C4D1": null,
    "C5D1": null,
    "OffStudy": null
  },
  {
    "Procedure": "Weight",
    "PreStudy": true,
    "C1D1": true,
    "C2D1": true,
    "C3D1": true,
    "C4D1": true,
    "C5D1": true,
    "OffStudy": true
  },
  {
    "Procedure": "Performance status",
    "PreStudy": true,
    "C1D1": true,
    "C2D1": true,
    "C3D1": true,
    "C4D1": true,
    "C5D1": true,
    "OffStudy": true
  },
  {
    "Procedure": "CBC w/diff, plts, PSA",
    "PreStudy": true,
    "C1D1": true,
    "C2D1": true,
    "C3D1": true,
    "C4D1": true,
    "C5D1": true,
    "OffStudy": true
  },
  {
    "Procedure": "Serum chemistry",
    "PreStudy": true,
    "C1D1": true,
    "C2D1": true,
    "C3D1": true,
    "C4D1": true,
    "C5D1": true,
    "OffStudy": true
  },
  {
    "Procedure": "Serum PAP",
    "PreStudy": true,
    "C1D1": null,
    "C2D1": null,
    "C3D1": null,
    "C4D1": null,
    "C5D1": null,
    "OffStudy": null
  },
  {
    "Procedure": "PKs",
    "PreStudy": true,
    "C1D1": true,
    "C2D1": true,
    "C3D1": true,
    "C4D1": true,
    "C5D1": true,
    "OffStudy": null
  },
  {
    "Procedure": "Serum Troponin T",
    "PreStudy": true,
    "C1D1": true,
    "C2D1": true,
    "C3D1": true,
    "C4D1": true,
    "C5D1": true,
    "OffStudy": null
  },
  {
    "Procedure": "TSH, free T4",
    "PreStudy": true,
    "C1D1": true,
    "C2D1": true,
    "C3D1": null,
    "C4D1": null,
    "C5D1": null,
    "OffStudy": null
  },
  {
    "Procedure": "EKG",
    "PreStudy": true,
    "C1D1": true,
    "C2D1": true,
    "C3D1": true,
    "C4D1": true,
    "C5D1": true,
    "OffStudy": null
  },
  {
    "Procedure": "Urine dipstick for protein",
    "PreStudy": true,
    "C1D1": true,
    "C2D1": true,
    "C3D1": true,
    "C4D1": true,
    "C5D1": true,
    "OffStudy": null
  },
  {
    "Procedure": "Imaging studies",
    "PreStudy": true,
    "C1D1": null,
    "C2D1": true,
    "C3D1": null,
    "C4D1": true,
    "C5D1": null,
    "OffStudy": null
  },
  {
    "Procedure": "Biomarkers",
    "PreStudy": true,
    "C1D1": true,
    "C2D1": true,
    "C3D1": true,
    "C4D1": true,
    "C5D1": true,
    "OffStudy": true
  }
]

The entry for AZD2171 is incorrect, and the prestudy information is not accurate. Prednisone is completely wrong.

Claude Sonnet 4.5:

Let’s use the following prompt:

Capture the Study Calendar table from this document and sort by each Baseline Evaluation, starting with AZD2171.
For any crossed boxes, mark them True. For non-crossed boxes, mark them NULL.
Return only a formatted JSON.

This is the output:

The returned JSON:

{
  "study_calendar": {
    "protocol": "NCI Protocol #: CTEP LOI 7395, Local Protocol #: 07C0059",
    "title": "A Phase II Study of AZD2171 in metastatic androgen independent prostate cancer",
    "cycle_duration": "28 days",
    "evaluations": [
      {
        "evaluation": "AZD2171",
        "pre_study": null,
        "C1D1": true,
        "C2D1": true,
        "C3D1": true,
        "C4D1": true,
        "C5D1": true,
        "off_study": null,
        "notes": "Dose as assigned. Administered orally, daily at a fixed dose. Please keep a pill diary (Appendix B). Cycles = 28 days."
      },
      {
        "evaluation": "Prednisone",
        "pre_study": null,
        "C1D1": true,
        "C2D1": true,
        "C3D1": true,
        "C4D1": true,
        "C5D1": true,
        "off_study": null,
        "notes": "Dose as assigned. Administered orally, daily at a fixed dose. Please keep a pill diary (Appendix B). Cycles = 28 days."
      },
      {
        "evaluation": "Informed consent",
        "pre_study": true,
        "C1D1": null,
        "C2D1": null,
        "C3D1": null,
        "C4D1": null,
        "C5D1": null,
        "off_study": null,
        "notes": null
      },
      {
        "evaluation": "Demographics",
        "pre_study": true,
        "C1D1": null,
        "C2D1": null,
        "C3D1": null,
        "C4D1": null,
        "C5D1": null,
        "off_study": null,
        "notes": null
      },
      {
        "evaluation": "Medical history",
        "pre_study": true,
        "C1D1": null,
        "C2D1": null,
        "C3D1": null,
        "C4D1": null,
        "C5D1": null,
        "off_study": null,
        "notes": null
      },
      {
        "evaluation": "Med List",
        "pre_study": true,
        "C1D1": true,
        "C2D1": true,
        "C3D1": true,
        "C4D1": true,
        "C5D1": true,
        "off_study": true,
        "notes": null
      },
      {
        "evaluation": "Physical exam",
        "pre_study": true,
        "C1D1": true,
        "C2D1": true,
        "C3D1": true,
        "C4D1": true,
        "C5D1": true,
        "off_study": true,
        "notes": null
      },
      {
        "evaluation": "Vital signs",
        "pre_study": true,
        "C1D1": true,
        "C2D1": true,
        "C3D1": true,
        "C4D1": true,
        "C5D1": true,
        "off_study": true,
        "notes": "Blood pressure will be measured at each clinic visit. In addition, patients will be asked to measure their blood pressure twice daily at home, as follows: a resting morning blood pressure and an evening blood pressure. Any abnormal blood pressure measurements should be reported to the research nurse within 24 hours. Furthermore, the patient will be asked to record blood pressure readings in a blood pressure diary (Appendix C)."
      },
      {
        "evaluation": "Height",
        "pre_study": true,
        "C1D1": null,
        "C2D1": null,
        "C3D1": null,
        "C4D1": null,
        "C5D1": null,
        "off_study": null,
        "notes": null
      },
      {
        "evaluation": "Weight",
        "pre_study": true,
        "C1D1": true,
        "C2D1": true,
        "C3D1": true,
        "C4D1": true,
        "C5D1": true,
        "off_study": true,
        "notes": null
      },
      {
        "evaluation": "Performance status",
        "pre_study": true,
        "C1D1": true,
        "C2D1": true,
        "C3D1": true,
        "C4D1": true,
        "C5D1": true,
        "off_study": true,
        "notes": null
      },
      {
        "evaluation": "CBC w/diff, plts, PSA",
        "pre_study": true,
        "C1D1": true,
        "C2D1": true,
        "C3D1": true,
        "C4D1": true,
        "C5D1": true,
        "off_study": true,
        "notes": null
      },
      {
        "evaluation": "Serum chemistry",
        "pre_study": true,
        "C1D1": true,
        "C2D1": true,
        "C3D1": true,
        "C4D1": true,
        "C5D1": true,
        "off_study": true,
        "notes": "Albumin, alkaline phosphatase, total bilirubin, bicarbonate, BUN, calcium, chloride, creatinine, glucose, LDH, phosphorus, potassium, total protein, SGOT[AST], SGPT[ALT], sodium."
      },
      {
        "evaluation": "Serum PAP",
        "pre_study": true,
        "C1D1": null,
        "C2D1": null,
        "C3D1": null,
        "C4D1": null,
        "C5D1": null,
        "off_study": null,
        "notes": null
      },
      {
        "evaluation": "PKs",
        "pre_study": true,
        "C1D1": true,
        "C2D1": true,
        "C3D1": true,
        "C4D1": true,
        "C5D1": true,
        "off_study": true,
        "notes": "See Section 8.3 for details as to pharmacokinetic timepoints to be drawn Pre-study. At the beginning of each cycle, a trough level will be drawn at each clinic visit prior to ingesting the first dose of drug for that cycle."
      },
      {
        "evaluation": "Serum Troponin T",
        "pre_study": true,
        "C1D1": true,
        "C2D1": true,
        "C3D1": true,
        "C4D1": true,
        "C5D1": true,
        "off_study": true,
        "notes": "Serum Troponin is to be obtained pre-study and once a cycle thereafter."
      },
      {
        "evaluation": "TSH, free T4",
        "pre_study": true,
        "C1D1": true,
        "C2D1": true,
        "C3D1": null,
        "C4D1": null,
        "C5D1": null,
        "off_study": null,
        "notes": "TSH and free T4 is to be obtained pre-study, once a cycle for the first two cycles, then only if clinically indicated."
      },
      {
        "evaluation": "EKG",
        "pre_study": true,
        "C1D1": true,
        "C2D1": true,
        "C3D1": true,
        "C4D1": true,
        "C5D1": true,
        "off_study": true,
        "notes": null
      },
      {
        "evaluation": "Urine dipstick for protein",
        "pre_study": true,
        "C1D1": true,
        "C2D1": true,
        "C3D1": true,
        "C4D1": true,
        "C5D1": true,
        "off_study": true,
        "notes": "If patient has significant proteinuria, obtain a 24-hour urine for protein and creatinine clearance."
      },
      {
        "evaluation": "Imaging studies",
        "pre_study": true,
        "C1D1": null,
        "C2D1": true,
        "C3D1": null,
        "C4D1": true,
        "C5D1": null,
        "off_study": true,
        "notes": "CT, Bone scan, Echo, DCE-MRI and/or 18 FDG-PET. See Section 8.2.1.3 for the frequency of Imaging studies."
      },
      {
        "evaluation": "Biomarkers",
        "pre_study": true,
        "C1D1": true,
        "C2D1": true,
        "C3D1": true,
        "C4D1": true,
        "C5D1": true,
        "off_study": true,
        "notes": null
      }
    ],
    "general_notes": "Baseline evaluations are to be conducted within 16 days prior to administration of protocol therapy. Scans and x-rays must be done 4 weeks prior to the start of therapy. In the event that the patient's condition is deteriorating, laboratory evaluations should be repeated within 48 hours prior to initiation of the next cycle of therapy. All patients will receive AZD2171 20 mg po daily and prednisone 10mg po daily. AZD-2171 dose reductions will be made in 5 mg increments; the lowest dose level is dose level -2, or 10 mg. All baseline and follow up evaluations can be done on the last week of the prior cycle."
  }
}

Although it appears very accurate and correct, “PKs” is incorrect, as is “Serum Troponin T,” where “off study” is incorrectly marked as true.

Unstract:

Create a Prompt Studio project, add the document, and use the following prompt:

Capture the Study Calendar table from this document and sort by each Baseline Evaluation, starting with AZD2171.
For any crossed boxes, mark them True. For non-crossed boxes, mark them NULL.
Return only a formatted JSON.

This is the output:

The returned JSON:

{
  "StudyCalendar": [
    {
      "BaselineEvaluation": "AZD2171",
      "C1D1": "True",
      "C2D1": "True",
      "C3D1": "True",
      "C4D1": "True",
      "C5D1": "True",
      "OffStudy": "NULL",
      "Pre-Study": "NULL"
    },
    {
      "BaselineEvaluation": "Prednisone",
      "C1D1": "True",
      "C2D1": "True",
      "C3D1": "True",
      "C4D1": "True",
      "C5D1": "True",
      "OffStudy": "NULL",
      "Pre-Study": "NULL"
    },
    {
      "BaselineEvaluation": "Informed consent",
      "C1D1": "NULL",
      "C2D1": "NULL",
      "C3D1": "NULL",
      "C4D1": "NULL",
      "C5D1": "NULL",
      "OffStudy": "NULL",
      "Pre-Study": "True"
    },
    {
      "BaselineEvaluation": "Demographics",
      "C1D1": "NULL",
      "C2D1": "NULL",
      "C3D1": "NULL",
      "C4D1": "NULL",
      "C5D1": "NULL",
      "OffStudy": "NULL",
      "Pre-Study": "True"
    },
    {
      "BaselineEvaluation": "Medical history",
      "C1D1": "NULL",
      "C2D1": "NULL",
      "C3D1": "NULL",
      "C4D1": "NULL",
      "C5D1": "NULL",
      "OffStudy": "NULL",
      "Pre-Study": "True"
    },
    {
      "BaselineEvaluation": "Med List",
      "C1D1": "True",
      "C2D1": "True",
      "C3D1": "True",
      "C4D1": "True",
      "C5D1": "True",
      "OffStudy": "True",
      "Pre-Study": "True"
    },
    {
      "BaselineEvaluation": "Physical exam",
      "C1D1": "True",
      "C2D1": "True",
      "C3D1": "True",
      "C4D1": "True",
      "C5D1": "True",
      "OffStudy": "True",
      "Pre-Study": "True"
    },
    {
      "BaselineEvaluation": "Vital signs9",
      "C1D1": "True",
      "C2D1": "True",
      "C3D1": "True",
      "C4D1": "True",
      "C5D1": "True",
      "OffStudy": "True",
      "Pre-Study": "True"
    },
    {
      "BaselineEvaluation": "Height",
      "C1D1": "NULL",
      "C2D1": "NULL",
      "C3D1": "NULL",
      "C4D1": "NULL",
      "C5D1": "NULL",
      "OffStudy": "NULL",
      "Pre-Study": "True"
    },
    {
      "BaselineEvaluation": "Weight",
      "C1D1": "True",
      "C2D1": "True",
      "C3D1": "True",
      "C4D1": "True",
      "C5D1": "True",
      "OffStudy": "True",
      "Pre-Study": "True"
    },
    {
      "BaselineEvaluation": "Performance status",
      "C1D1": "True",
      "C2D1": "True",
      "C3D1": "True",
      "C4D1": "True",
      "C5D1": "True",
      "OffStudy": "True",
      "Pre-Study": "True"
    },
    {
      "BaselineEvaluation": "CBC w/diff, plts, PSA",
      "C1D1": "True",
      "C2D1": "True",
      "C3D1": "True",
      "C4D1": "True",
      "C5D1": "True",
      "OffStudy": "True",
      "Pre-Study": "True"
    },
    {
      "BaselineEvaluation": "Serum chemistryb",
      "C1D1": "True",
      "C2D1": "True",
      "C3D1": "True",
      "C4D1": "True",
      "C5D1": "True",
      "OffStudy": "True",
      "Pre-Study": "True"
    },
    {
      "BaselineEvaluation": "Serum PAP",
      "C1D1": "NULL",
      "C2D1": "NULL",
      "C3D1": "NULL",
      "C4D1": "NULL",
      "C5D1": "NULL",
      "OffStudy": "NULL",
      "Pre-Study": "True"
    },
    {
      "BaselineEvaluation": "PKsh",
      "C1D1": "True",
      "C2D1": "True",
      "C3D1": "True",
      "C4D1": "True",
      "C5D1": "True",
      "OffStudy": "NULL",
      "Pre-Study": "True"
    },
    {
      "BaselineEvaluation": "Serum Troponin Td",
      "C1D1": "True",
      "C2D1": "True",
      "C3D1": "True",
      "C4D1": "True",
      "C5D1": "True",
      "OffStudy": "NULL",
      "Pre-Study": "True"
    },
    {
      "BaselineEvaluation": "TSH, free T4e",
      "C1D1": "True",
      "C2D1": "True",
      "C3D1": "NULL",
      "C4D1": "NULL",
      "C5D1": "NULL",
      "OffStudy": "NULL",
      "Pre-Study": "True"
    },
    {
      "BaselineEvaluation": "EKG",
      "C1D1": "True",
      "C2D1": "True",
      "C3D1": "True",
      "C4D1": "True",
      "C5D1": "True",
      "OffStudy": "NULL",
      "Pre-Study": "True"
    },
    {
      "BaselineEvaluation": "Urine dipstick for protein",
      "C1D1": "True",
      "C2D1": "True",
      "C3D1": "True",
      "C4D1": "True",
      "C5D1": "True",
      "OffStudy": "NULL",
      "Pre-Study": "True"
    },
    {
      "BaselineEvaluation": "Imaging studies",
      "C1D1": "NULL",
      "C2D1": "True",
      "C3D1": "NULL",
      "C4D1": "True",
      "C5D1": "NULL",
      "OffStudy": "True",
      "Pre-Study": "True"
    },
    {
      "BaselineEvaluation": "Biomarkers",
      "C1D1": "True",
      "C2D1": "True",
      "C3D1": "True",
      "C4D1": "True",
      "C5D1": "True",
      "OffStudy": "True",
      "Pre-Study": "True"
    }
  ]
}

As you can see, all checked marks are correctly identified and associated with the appropriate label in each row.

Deep Dive: LLMWhisperer’s Unique Approach

LLMWhisperer bridges the gap between traditional OCR and LLM-based document understanding. Unlike direct LLM parsing, which often fails on complex layouts, LLMWhisperer transforms documents into LLM-ready formats that preserve relationships, maintain context, and enable accurate extraction.

Its core philosophy is simple: LLMs excel at understanding structured text, but struggle with raw visual layouts, so LLMWhisperer handles layout interpretation before the LLM ever sees the document.

Methodology

LLMWhisperer uses a pre-processing pipeline designed to maximize extraction accuracy and reliability:

Traditional OCR Extraction

Uses proven OCR engines to extract all text content reliably
Eliminates errors that occur when LLMs read documents directly

Layout Analysis

Identifies document structures: tables, multi-column layouts, nested elements
Preserves spatial relationships and logical flow

Document Format Optimization

Tables are converted to structured data with clear relationships
Multi-column layouts are separated and labelled for clarity
Visual elements (checkboxes, markers) are translated into explicit, machine-readable formats

By transforming documents before they reach the LLM, LLMWhisperer separates layout interpretation from text understanding, allowing each technology to do what it does best.

Technical Advantages

This hybrid architecture combines OCR’s extraction reliability with LLM’s contextual understanding, enabling accurate processing of documents that challenge both traditional OCR and direct LLM parsing:

Improved text extraction accuracy: OCR handles the initial extraction, reducing recognition errors

Better handling of complex layouts: Specialized layout analysis preserves table relationships and multi-column flows better than direct AI OCR

Enhanced reliability: Deterministic pre-processing ensures consistent output; only the final structuring step uses LLMs, minimizing probabilistic variability

Business Benefits

This approach ensures that LLMWhisperer not only improves document understanding but also delivers operational efficiency, reliability, and cost-effective scalability, addressing the limitations of both traditional OCR and direct AI parsing:

Reduced hallucinations: LLMs process pre-structured, clearly formatted text, minimizing invented information

Cost efficiency: OCR handles visual processing cheaply, while LLMs process only smaller, structured outputs, reducing token usage and API costs

Scalability: Parallel OCR pre-processing and smaller LLM inputs enable enterprise-scale processing

Higher accuracy rates: Combines reliable extraction with intelligent structuring

Enterprise-ready reliability: Deterministic pipeline, robust error handling, and auditability make LLMWhisperer suitable for mission-critical applications

Visual Demonstrations

Let’s see LLMWhisperer in action. You can try it yourself in the LLMWhisperer playground.

First, let’s demonstrate its OCR capabilities using a skewed photo of a receipt:

Produces the following output:

As you can see, it not only fully recognizes all the text and values but also preserves the original layout structure.

Next, let’s see how it handles a PDF of an SEC document:

Download sample →

Focusing on the table on page 33, we obtain the following output:

Here we can see how powerful the layout extraction is, as it preserves the entire table structure, allowing the full context to be used in the pipeline.

Recommendations, Enterprise Considerations, and Future Outlook

For most document processing needs, traditional OCR tools remain the clear choice. Tools like Tesseract, PaddleOCR, Azure Document Intelligence, AWS Textract, and Google Document AI excel at speed, accuracy, and cost-efficiency, processing thousands of pages per hour at pennies per page. They are well-tested in enterprise environments and optimized over decades for reliability.

Best use cases for traditional OCR:

Standard business documents: invoices, forms, reports, contracts with consistent formatting
High-volume processing needs
Scenarios requiring predictable, deterministic outputs

Key advantages:

High speed and throughput
Low cost per page ($0.001–$0.01)
Reliable and consistent extraction
Proven enterprise integrations and auditability

When LLMs Can Help

LLMs can outperform traditional OCR in specific edge cases. They are particularly useful for low-quality or noisy scans, recognizing handwritten text, handling complex layouts such as nested tables or multi-column formats, and processing documents that require contextual understanding.

However, AI-based OCR comes with significant trade-offs. Costs per page are 10–100 times higher than traditional OCR, processing times are slower (minutes per document versus seconds), and outputs can be inconsistent or include hallucinated information.

The recommended approach is to use LLMs strategically. Combine them with traditional OCR in a hybrid workflow, such as LLMWhisperer, where OCR handles raw text extraction first. LLMs can then focus on interpretation, structuring, and extracting relationships, leveraging their intelligence only where it adds value.

Key Enterprise Considerations

Document Volume: High-volume workflows generally favor traditional OCR, which can process thousands of pages quickly and reliably. LLMs are better suited for small batches of complex documents where their language understanding adds value.

Accuracy Requirements: Mission-critical applications demand deterministic outputs. Traditional OCR provides reliable extraction, while LLMs can produce plausible but incorrect results, making them less predictable for high-stakes use cases.

Cost Management: OCR costs just pennies per page, whereas LLMs can cost dollars per page. Enterprises should consider the total cost of ownership (TCO), including infrastructure, maintenance, and API usage, when evaluating options.

Compliance and Audit: Combining OCR with human validation remains the gold standard for regulatory compliance. LLM outputs are more difficult to trace and validate, which can complicate audits.

Scalability: OCR scales easily, running on-premises or in the cloud. LLMs face API rate limits, higher latency, and infrastructure dependencies that can hinder large-scale deployment.

System Integration: OCR tools offer mature APIs, SDKs, and pre-built connectors for enterprise platforms. LLM APIs evolve rapidly, which can introduce breaking changes and require ongoing maintenance.

Data Security and Privacy: OCR supports deployment on-premises, in private clouds, or encrypted public clouds, giving enterprises control over sensitive data. LLM APIs typically send documents to third-party services, raising potential compliance and privacy concerns.

Total Cost of Ownership (TCO) Considerations

Processing costs differ dramatically. Traditional OCR costs just $0.001–$0.01 per page, while LLM-based OCR can range from $0.10 to $5.00 per page, depending on document complexity.

At enterprise scale, these differences compound. Processing millions of pages with LLMs can cost between $100,000 and $5,000,000, whereas OCR processing for the same volume may only cost $1,000–$10,000.

Infrastructure requirements also vary. OCR runs predictably on standard servers or cloud environments. LLMs, in contrast, depend on API access and reliable network connectivity, which can introduce bottlenecks.

Maintenance and support are additional considerations. OCR technology is mature, well-documented, and supported, while LLM solutions require ongoing adjustments, monitoring, and updates to maintain accuracy and reliability.

Benchmarking and Continuous Improvement

Start by selecting a representative sample of documents, including edge cases that may challenge your extraction pipeline.

Compare the extracted results against a ground truth dataset, evaluating key metrics such as accuracy (correct fields), completeness (all relevant fields extracted), reliability (consistency across multiple runs), speed (processing time per document), and cost (total per-document cost).

Use these insights to identify failure patterns, refine your pipelines, and re-test iteratively. Continuous benchmarking ensures your document processing solution remains accurate, efficient, and scalable.

OCR Future Outlook

Traditional OCR continues to evolve, incorporating deep learning to handle challenging documents, improving table extraction and multi-column layouts, and offering better support for handwritten text and non-Latin scripts. Incremental accuracy improvements mean top OCR tools now achieve 99%+ accuracy on standard documents.

AI-based OCR is also advancing. Vision-language models may improve visual understanding, and handling of complex layouts or low-quality documents could get better. Cost reductions are possible but unlikely to match OCR-level economics, and hallucinations may persist due to the probabilistic nature of LLMs.

Hybrid approaches are emerging as the most practical solution. By combining OCR extraction with LLM interpretation, like LLMWhisperer, enterprises can leverage OCR for speed and reliability while using LLMs for intelligence and context. The future likely favours integrated pipelines rather than a full replacement of one technology with the other.

Enterprise takeaways: Choose the right tool for the task, OCR for standard, high-volume documents and LLMs for complex edge cases. Hybrid solutions provide flexibility, accuracy, and scalability. Always prioritize auditability, cost-effectiveness, and reliability for mission-critical applications.

Build AI-Ready Document Extraction: OCR Solutions for LLM Workflows

Our research shows that there is no one-size-fits-all solution for OCR. The best tool depends on your document types, processing volume, accuracy requirements, and budget. For most enterprise applications, traditional OCR remains the fastest, most reliable, and cost-effective option, delivering accurate extraction at a fraction of the cost of LLMs.

Standard business documents, such as invoices, forms, contracts, and reports, are best handled by OCR tools like Tesseract, PaddleOCR, or cloud services such as AWS Textract and Azure Document Intelligence. These tools excel at high-volume processing, predictable outputs, and auditability.

LLMs are useful in specific edge cases, such as low-quality scans, handwritten text, or complex nested structures where OCR may struggle. However, they come with trade-offs: higher cost, slower processing, and potential hallucinations or inconsistent results.

The most effective strategy is a hybrid approach. Use OCR for initial extraction to ensure speed and accuracy, and apply LLMs for interpretation, structuring, or complex layout understanding. Tools like LLMWhisperer and Unstract demonstrate that combining OCR and LLM capabilities yields better extraction quality while avoiding the pitfalls of AI OCR.

Enterprise best practices include benchmarking on representative documents, measuring accuracy, cost, and speed, and identifying cases where OCR alone falls short. LLMs should be used strategically, not as a primary OCR solution.

In summary, for most document processing tasks, OCR provides reliability and scalability, while LLMs add intelligence for complex scenarios. A hybrid workflow delivers the best balance of speed, accuracy, and contextual understanding. Try LLMWhisperer in Unstract to see how a hybrid approach can improve extraction quality on your documents.

The post AI OCR vs. AI-ready OCR: The Right Solution for Document Extraction Use Cases first appeared on Unstract.com.

Unstract – A Better, Modern Nanonets Alternative for Document Processing Automation

Nuno Bispo — Thu, 18 Dec 2025 05:32:34 +0000

Introduction

More and more organizations are increasingly relying on Intelligent Document Processing (IDP) and document automation tools to streamline workflows, reduce manual effort, and improve data accuracy.

From invoices and purchase orders to contracts and quality checklists, the volume and complexity of documents that companies handle are growing exponentially. As a result, tools that can quickly extract, process, and structure this information have become critical to operational efficiency.

Among the many solutions available, Nanonets is a popular choice for AI-powered OCR and data extraction. Its platform allows users to automate document workflows and extract structured data from a variety of sources. However, while Nanonets has proven useful for standard use cases, it faces notable limitations.

Users often encounter challenges related to accuracy, especially with complex or handwritten documents, flexibility in adapting to changing document formats, and cost, which can escalate quickly under a block-based pricing model. Additionally, Nanonets’ closed, black-box system limits visibility and control over the underlying AI models, making it harder for teams to customize or optimize extraction workflows.

This article explores how Unstract, a modern and AI-native alternative, is designed to address these challenges. Unstract offers a prompt-driven, modular platform that supports multi-service text extraction, human-in-the-loop validation, and seamless deployment through ETL pipelines and APIs. Its architecture is built for flexibility, scalability, and transparency, giving organizations greater control over their document processing workflows.

About Nanonets

Nanonets is an AI-powered platform designed to automate document processing and data extraction. It combines Optical Character Recognition (OCR) with machine learning to help businesses convert unstructured documents, like invoices, forms, contracts, and receipts; into structured, machine-readable data. By automating these workflows, Nanonets aims to reduce manual data entry, minimize errors, and accelerate business operations.

Strengths

Nanonets has several advantages that make it appealing for certain use cases:

Pre-built Models: The platform offers ready-to-use models for common document types, enabling quick adoption without extensive setup.
Integrations: Nanonets provides connections to popular tools and services, such as Google Drive, Zapier, and cloud storage solutions, allowing extracted data to flow seamlessly into existing workflows.
Ease of Use for Structured Documents: For relatively simple and standardized documents, Nanonets allows users to quickly set up extraction rules and obtain accurate results.

Limitations

Despite its strengths, Nanonets has several limitations that can hinder more complex or large-scale document processing initiatives:

Manual Template-Based Extraction: Users often need to define extraction areas by drawing borders and boxes around data fields. This approach is time-consuming, error-prone, and not easily scalable. If the document layout changes, the extraction setup must be redone.
Cost Scales with Usage: Nanonets uses a block-based pricing model, where every document processed counts toward your usage quota. For high-volume workflows, costs can escalate quickly.
Limited Control Over AI Stack: The platform operates as a closed system, giving users little flexibility to choose the underlying OCR engine, LLM, or AI models. This restricts customization and optimization.
Difficulty with Complex or Unstructured Documents: Documents with irregular layouts, handwritten text, or mixed content often yield inconsistent results, reducing the platform’s reliability for advanced use cases.

Overall, while Nanonets remains a strong choice for standard, structured documents, its limitations become apparent when handling complex, unstructured, or high-volume workflows. These gaps create an opportunity for a more flexible, AI-native alternative like Unstract, which is designed to address these challenges.

Trying Out Nanonets

To begin evaluating Nanonets, start by creating a free trial account. Visit the official signup page at https://app.nanonets.com/#/signup and register using your email address.

The onboarding process is quick and after verifying your email, you’ll gain access to the Nanonets dashboard, where you can begin exploring its pre-built models, uploading documents, and testing its extraction features.

The first test document that we will test is a multi-row, multi-column table that includes checkboxes, mixed text alignment, and varying cell structures:

Test document 1 → Download

The second document is a handwritten, scanned page: Test document 2 → Download

With the documents ready, head to the Nanonets dashboard to begin the actual extraction tests.

OCR Within Nanonets

First, we will understand how Nanonets handles OCR in general. Its extraction engine is designed for straightforward, structured documents, but performance can vary significantly depending on layout complexity, handwriting, or non-standard formats.

Document With Table And Checkboxes

On the main interface, look for and select the “Instant Training Model” option. This is Nanonets’ guided setup for quickly creating a model by uploading sample documents and defining the fields you want to extract:

Inside it, select “Instant Training Model – Extract Data”, which allows you to quickly test the platform’s OCR and structured data extraction by uploading a file:

After uploading this first test document, a table with checkboxes, Nanonets automatically processes the file using its OCR and extraction models:

When we review the OCR results for the first test document, we can see that Nanonets’ initial recognition correctly identified some of the text fields and the overall table structure. It was able to detect rows, columns, and the textual content within the cells.

However, the platform completely missed the checkboxes, failing to recognize them as distinct elements or actionable fields.

As a result, additional manual corrections or template adjustments would be required to accurately capture all the data, reducing efficiency and scalability for documents that combine text with form elements like checkboxes

Handwritten Scan

Next, we tested Nanonets using the second document, a scanned handwritten page. This document contains freeform text, mixed handwriting styles, and varying line spacing, making it a more challenging case for OCR:

The results show that Nanonets struggled to accurately recognize the handwritten content. While some of the more clearly written words were detected correctly, many characters were misread, and entire sections of the text were either partially captured or completely missed.

Extracting Data From Documents

Once the text has been successfully extracted using OCR, the next step is transforming that raw text into structured, usable data.

Document With Table And Checkboxes

To test structured data extraction in Nanonets, we created a new workflow project and uploaded one of the test documents. Nanonets requires users to manually define the fields they want to extract:

This process involves drawing borders and boxes around each data point in the document and labeling them appropriately. For simple documents, this can be manageable, but for complex tables, scanned forms, or documents with multiple sections, the task quickly becomes time-consuming and tedious.

Within the workflow project in Nanonets, you have the option to add new fields in the AI section to automate extraction for specific data points:

However, even when adding a new field, you still need to manually define a box around the field on the document and provide a label:

The AI then uses this template to extract the same field across similar documents automatically.

Handwritten Scan

After testing the first document, we set up a new workflow for the second document, which contains handwritten, scanned text. As with the first workflow, Nanonets requires a manual setup, as each data field you want to extract must be defined with a labelled box:

For handwritten content, this process becomes even more challenging. The AI relies on the boxes to locate and interpret text, but variations in handwriting, line spacing, and scan quality can make the extraction less accurate

After processing the documents in Nanonets, the next step is to export the extracted data. While the platform allows you to download results, the export options are somewhat limited:

Common formats include CSV, Excel, and integrations with third-party apps, but there is no direct option to export structured data as JSON.

One of the major limitations we observed while using Nanonets is that the workflow setup is clunky and requires extensive manual tweaking:

Even after manually creating the workflow and adjusting the extraction regions, Nanonets still requires you to manually define every field you want to export. Instead of automatically detecting or suggesting structured fields based on the document content, users must name each field, map it, and ensure the correct bounding box is linked to it.

For HITL (Human-in-the-loop), the options are mostly limited to having a human review the entire document. There are no configurable ratios, thresholds, or rule-based triggers to selectively route only uncertain fields for validation.

Additionally, when trying to set up an integration using the official API documentation
(https://docs.nanonets.com/docs/integrate-via-api), we found that much of the information is outdated or inaccurate.

Observations

Working hands-on with Nanonets reveals several friction points that become apparent as soon as you move beyond simple, well-structured documents. The platform offers useful baseline automation, but its workflow quickly becomes cumbersome when dealing with real-world variability.

Manual setup is required at nearly every stage, when creating fields, drawing bounding boxes, adjusting layouts, and re-training for even small changes in document structure. This slows down experimentation and makes it difficult to handle diverse file types without repeated configuration work.

Additionally, the limited Human-in-the-Loop (HITL) functionality means there’s no smooth way to validate low-confidence predictions or correct mistakes using reviewers. Instead, the user must step in manually and repeatedly to fix outputs or redefine fields.

Finally, integrations that should streamline export and automation are hindered by outdated documentation (e.g., the API integration guide), resulting in failed or inconsistent setups.

Overall, the system demands significant tweaking and rework, especially for complex layouts, handwritten documents, or documents with checkboxes; and this manual burden compounds as document formats evolve.

Introducing Unstract

Unstract is a modern, AI-native, open-source platform designed to reimagine document processing and structured data extraction. Unlike traditional IDP tools, which often rely on rigid templates, manual training, or closed proprietary models, Unstract is built from the ground up to leverage large language models (LLMs), vector databases, and modular OCR services to handle a wide range of document types; structured, semi-structured, and completely unstructured.

At its core, Unstract aims to solve the common pain points faced by users of platforms like Nanonets: the need for manual template setup, the high cost of scaling, limited control over AI models, and difficulty processing complex or handwritten documents. By providing a flexible, prompt-driven approach, Unstract enables teams to extract data efficiently and accurately without extensive setup or technical expertise.

From data extraction to validation and deployment, the platform is designed to fit seamlessly into existing IT infrastructure. Its architecture supports self-hosted deployments for organizations with strict security or compliance requirements, as well as cloud-based deployments for teams seeking speed and ease of use.

With built-in human-in-the-loop (HITL) capabilities, multiple users can review and validate extracted data by defining rules to ensuring quality and consistency at scale. The system also supports ETL pipelines and API-first deployments, making it suitable for enterprises, SaaS applications, and data-driven organizations that require real-time access to structured information from diverse document sources.

Unstract is not just a tool, it’s a complete document automation ecosystem that combines the power of modern AI, flexible integration, and user-friendly design to enable smarter, faster, and more accurate document workflows.

Core Advantages Over Nanonets

Prompt Studio: Extract structured data using simple English prompts, no training or template setup required. Prompts are intuitive and easily adaptable to changing document formats, making workflows faster and more scalable.

Multi-Service Text Extraction: Choose from specialized OCR solutions like LLMWhisperer or bring your own LLMs and vector databases. This flexibility improves accuracy and allows cost-efficient extraction tailored to your needs.

Flexible AI Stack: Modular design lets you select OCR engines, LLMs, embeddings, and databases. This eliminates vendor lock-in and gives full control over performance, scalability, and cost.

Human-in-the-Loop (HITL) Review: Low-confidence extractions can be routed to reviewers for validation, ensuring consistent, high-quality data output.

ETL Pipelines & Pre-Built Connectors: Automatically deliver extracted data to cloud storage, databases, or analytics platforms, enabling end-to-end workflow automation without additional engineering.

API-First Deployment: Deploy extraction workflows as APIs to provide real-time structured data access across teams, applications, and services.

Trying Out Unstract

To get started, visit the Unstract website and sign up for a free account. The registration process is quick and gives you immediate access to Unstract’s core tools, including Prompt Studio and LLMWhisperer.

Each new account includes a 14-day free trial and 1 million LLM tokens, providing everything you need to start building and testing your own document extraction pipelines right away.

OCR with LLMWhisperer

Unstract offers access to LLMWhisperer, a next-generation OCR engine designed to handle messy, complex, and real-world documents with far greater accuracy.

LLMWhisperer can be tested directly in the browser via the official playground at https://playground.llmwhisperer.unstract.com/.

This makes it easy to upload documents, including handwritten scans, noisy PDFs, checkbox forms, or irregular tables, and instantly see how the AI interprets them.

Document With Table And Checkboxes

By uploading this document to the LLMWhisperer playground, we can immediately see how effectively the model parses table structures, identifies checkbox states, and maintains contextual accuracy throughout the extraction:

The table entries are accurately identified, the checkboxes are correctly recognized, and the overall structure of the document is preserved, making the extracted output clean, reliable, and ready to be used directly in downstream pipelines.

Handwritten Scan

Handwritten documents are typically the most challenging for traditional OCR systems, especially when the text varies in style, spacing, or scan quality. With LLMWhisperer, however, these limitations are significantly reduced:

The overall layout is preserved, all handwritten text is accurately detected, and even the embedded table is recognized with its structure intact. This demonstrates once again how LLLMWhisperer delivers exceptional OCR performance, enabling seamless use of the extracted data in automated pipelines.

Extracting Data From Documents

Once documents have been processed with OCR, the next step is to test workflow-based extraction in Unstract. The platform combines accuracy-enhancing tools, multi-service text extraction, and flexible deployment options to create a fully automated pipeline.

Key features include:

LLMchallenge: Ensures extracted data is accurate by automatically validating and correcting uncertain fields.
Multi-Service Text Extraction: Choose between LLMWhisperer, custom LLMs, or vector databases for optimal accuracy and cost efficiency.
ETL & API Integration: Easily connect extracted data to databases, analytics platforms, or deploy workflows as APIs for real-time processing.

Once logged in to Unstract, open Prompt Studio and create a new project for your document(s). Navigate to the Manage Documents section to upload the files you want to process. After uploading, you can define prompts that describe the exact data fields you want to extract, such as names, dates, totals, or checkbox selections.

In Unstract, prompts act as instructions for the AI, specifying what to extract and how to structure the output. This ensures results match your schema and maintain consistency across multiple documents. Unlike Nanonets static extraction, where the system guesses fields based on its internal model, Unstract gives you full control over extraction logic, allowing you to enforce data types, adjust output structure, and refine prompts as needed.

Document With Table And Checkboxes

For this document, we will create a generic prompt to extract the data from the table:

Retrieve the information from the table, organizing each column with the corresponding name, marked by the (). Return it as a JSON with a list of rows with the columns inside.

Running the prompt:

You can see that even with a generic prompt, the AI correctly matches the columns, captures the values from the checkboxes, and organizes the data neatly into rows.

The extracted data is now ready for downstream processing.

Let’s take a look at the full JSON output:

[
    {
        "All other illnesses": false,
        "Case no.": "3443",
        "Date of injury or onset of illness": "24 month/day",
        "Days away from work": "32 days",
        "Describe injury or illness, parts of body affected, and object/substance that directly injured or made person ill": "First degree burns in arms yard laceration in the neck",
        "Employee's name": "Roger Smith",
        "Hearing loss": false,
        "Injury": true,
        "Job title": "Engineer",
        "On job transfer or restriction": "24 days",
        "Poisoning": false,
        "Respiratory condition": false,
        "Skin disorder": false,
        "Where the event occurred": "Dock yard"
    },
    {
        "All other illnesses": false,
        "Case no.": "8932",
        "Date of injury or onset of illness": "2 / 4 month/day",
        "Days away from work": "16 days",
        "Describe injury or illness, parts of body affected, and object/substance that directly injured or made person ill": "Fractured right leg",
        "Employee's name": "William potter",
        "Hearing loss": false,
        "Injury": true,
        "Job title": "Engineer",
        "On job transfer or restriction": "",
        "Poisoning": false,
        "Respiratory condition": true,
        "Skin disorder": false,
        "Where the event occurred": "Dock yard"
    },
    {
        "All other illnesses": false,
        "Case no.": "767",
        "Date of injury or onset of illness": "24 month/day",
        "Days away from work": "",
        "Describe injury or illness, parts of body affected, and object/substance that directly injured or made person ill": "",
        "Employee's name": "Simon Dawes",
        "Hearing loss": false,
        "Injury": true,
        "Job title": "Engineer",
        "On job transfer or restriction": "",
        "Poisoning": false,
        "Respiratory condition": false,
        "Skin disorder": false,
        "Where the event occurred": "Dock"
    }
]

Human-in-the-Loop (HITL) Review

One of Unstract’s most powerful capabilities is its human-in-the-loop (HITL) review system, which lets users verify and correct uncertain extractions in real-time. HITL ensures that challenging cases; such as handwritten notes, irregular or skewed tables, noisy scans, or low-confidence fields; are captured with maximum accuracy.

Within any workflow, you can seamlessly integrate HITL alongside downstream actions. For example, you can configure a database connector to automatically receive processed data while still routing selected documents for manual validation:

Unstract gives you full control over how the review process behaves. You can define what percentage of documents require manual review, set confidence thresholds, and specify validation rules for any field. The platform provides multiple configuration options, allowing teams to fine-tune their quality checks with precision.

By blending automation with targeted human oversight, Unstract’s HITL workflow delivers the ideal balance between speed and reliability, making it an excellent choice for organizations that demand both scalability and high-accuracy document processing.

Handwritten Scan

For this document, we will create a prompt to extract the table data as well.

Extract the information from the table that contains the gross weight, do extract all the columns. Return it as a JSON list with rows and inside the columns with the corresponding names.

Running the prompt:

Even with minimal guidance, the AI correctly extracted the columns and rows, accurately recognizing the handwritten text from a slightly skewed scan.

Let’s take a look at the resulting JSON:

[
    {
        "Chargeable Weight": "30kg",
        "Commodity Item No": 3,
        "Gross Weight kg": "30kg",
        "Gross Weight lb": null,
        "Nature and Quantity of Goods": "News print Paper",
        "No. of Pieces": 3,
        "Rate Charge": "$2",
        "Total Charge": "$60"
    },
    {
        "Chargeable Weight": "30 kg",
        "Commodity Item No": null,
        "Gross Weight kg": null,
        "Gross Weight lb": "72",
        "Nature and Quantity of Goods": "Package Paper.",
        "No. of Pieces": 5,
        "Rate Charge": "$3",
        "Total Charge": "$60"
    },
    {
        "Chargeable Weight": "10kg",
        "Commodity Item No": null,
        "Gross Weight kg": null,
        "Gross Weight lb": "75",
        "Nature and Quantity of Goods": "Print Ink Solution",
        "No. of Pieces": 3,
        "Rate Charge": "$6",
        "Total Charge": "$70"
    }
]

Deploying as an API

Deploying to an API is simple, just click the “Deploy as API” button in Prompt Studio. Follow the short, two-step wizard, and the API is live. There’s no manual configuration or additional settings required:

Observations

Testing Unstract with complex documents highlights several practical advantages over traditional template-based systems:

Faster Iteration: Prompt-based extraction allows for quick adjustments and immediate reprocessing of documents, eliminating the need to redraw boxes or retrain models.
Reduced Manual Work: Multi-service OCR, schema-driven prompts, and human-in-the-loop validation minimize repetitive tasks, freeing up time for more valuable work.
Greater Control: Users have full control over extraction logic, data schema, and workflow integration, enabling more accurate and consistent results across varied document types.

Overall, Unstract streamlines document processing, improves accuracy, and provides a flexible, scalable solution for enterprise workflows.

Nanonets vs. Unstract: Key Differences

When choosing a document processing platform, it’s important to look beyond basic OCR capabilities and consider factors like workflow flexibility, accuracy, and integration options.

Below is a detailed comparison between Nanonets and Unstract:

Feature	Nanonets	Unstract
Extraction Method	Template-based; requires manually drawing borders and boxes around fields	Prompt Studio: extract structured data using simple English prompts, no manual template setup
OCR Accuracy	Limited performance with complex, handwritten, or noisy documents	Multi-service OCR, including LLMWhisperer, ensuring higher accuracy even on challenging inputs
Structured Data Extraction	Manual setup, retraining needed for new formats	Instant extraction using adjustable prompts; adapts easily to new document layouts
Human-in-the-loop (HITL)	Minimal support for review	Full-featured HITL for validation and quality assurance of uncertain extractions
Flexibility / AI Stack	Closed system, limited control over AI models	Modular design: bring your own LLMs, OCR engines, and vector databases
Cost Model	Usage-based (block pricing)	Open-source/self-host or cloud deployment → predictable and potentially lower cost
ETL & Deployment	Basic integrations	Full ETL pipelines, API-first deployment, and pre-built connectors for servers and analytics platforms
Transparency	Black-box AI models	Full visibility into extraction logic and workflow configurations
Handling Complex Documents	Struggles with long, unstructured, or intricate formats	Handles long, unstructured, and highly complex documents with ease

Additional Differentiators:

Scalable Workflows: No need to redraw extraction boxes for new documents or formats
Privacy & Compliance: Self-hosted options make it better suited for regulated or sensitive data
Future-Proof Architecture: Easily integrates with LLMs and vector databases for advanced document automation

Unstract’s modern, AI-native architecture clearly addresses the limitations of traditional template-based platforms like Nanonets, making it a more accurate, flexible, and scalable solution for enterprise document processing.

Unstract, a better Nanonets alternative: What’s next?

Unstract offers a modern, flexible, and AI-native alternative to Nanonets, designed to address the limitations of traditional document processing platforms.

By combining prompt-based extraction, multi-service OCR, and LLM integration, it enables organizations to process complex documents more accurately and efficiently.

The platform’s human-in-the-loop validation ensures high-quality data, while its support for full ETL pipelines and API deployment allows seamless integration into existing workflows and enterprise systems.

For teams looking to streamline document automation, reduce manual effort, and gain full control over their AI-powered extraction workflows, Unstract represents a compelling choice.

By testing Unstract with your own complex documents, you can experience firsthand how its flexible, scalable, and cost-efficient architecture outperforms Nanonets and sets a new standard for intelligent document processing.

Unstract, A better Nanonets Alternative: Related topics to explore

Extract PDF Data to Excel Spreadsheet with Unstract

High Accuracy OCR for extracting data from invoice

Guide to extracting data from contracts

Best OCR for parsing accounts payable documents

The post Unstract – A Better, Modern Nanonets Alternative for Document Processing Automation first appeared on Unstract.com.

Unstract: A Better Reducto Alternative for Document Processing Automation

Nuno Bispo — Fri, 14 Nov 2025 04:49:33 +0000

Introduction

In the ever-evolving landscape of AI-powered document processing, businesses are increasingly turning to automation tools to extract structured data from PDFs, scanned documents, and handwritten forms. The goal is simple: to reduce manual effort while improving accuracy, efficiency, and data usability across workflows.

Among the new players in this space is Reducto AI, a Y Combinator–backed startup that has quickly gained visibility for its use of vision-language models (VLMs) to extract data from unstructured documents. While its approach emphasizes simplicity and quick onboarding, Reducto represents just one step in the broader evolution of intelligent document understanding.

On the other end of the spectrum is Unstract, an open-source, enterprise-ready platform built for complete end-to-end ETL (Extract, Transform, Load) workflows. Unlike black-box systems, Unstract offers full transparency and modularity, allowing users to choose their own OCR engines, LLMs, and data pipelines. It’s designed not just to extract data but to integrate, transform, and validate it at scale, making it ideal for teams that need flexibility, accuracy, and control.

This article takes a practical, hands-on look at both tools in action. We’ll start by walking through Reducto’s setup and testing a few sample documents, then explore how Unstract approaches the same challenges, comparing their performance, flexibility, and readiness for real-world enterprise use.

About Reducto

Reducto AI is an advanced, AI-powered document processing platform built to help businesses automatically extract structured data from PDFs, images, and scanned documents. It aims to remove the manual effort involved in data entry, form parsing, and text extraction by leveraging vision-language models (VLMs), deep learning systems trained to understand both the textual content and the visual structure of documents.

Reducto can interpret both the content and layout of documents—detecting tables, headers, signatures, and multi-column formats—allowing it to handle structured, semi-structured, and fully unstructured files. But ultimately, accuracy is what determines the value of any extraction system, and this post examines how reliably Reducto performs in real-world scenarios.

Platform Overview

Reducto offers two primary ways for users to interact with its capabilities:

No-Code Web Application: A browser-based interface designed for non-technical users.

Users can upload PDFs or images directly, review extracted data in real-time, and export results as JSON or CSV.
The interface provides automatic field detection and labelling, allowing quick validation of extracted information
It’s particularly useful for teams wanting to experiment or run smaller-scale extractions without any coding knowledge.

Developer-Focused API: Built for developers who want to integrate Reducto’s extraction engine into their own systems or products.

The REST API supports asynchronous document uploads and result retrieval.
Developers can embed Reducto in internal dashboards, CRM systems, or ETL pipelines.
The API documentation is well-organized, with example requests and prebuilt SDKs in Python and JavaScript for faster integration.

This dual-interface approach makes Reducto accessible to both technical and non-technical audiences.

Features

Reducto’s feature set focuses on simplicity, automation, and speed to deployment. Some of its standout functionalities include:

Automatic Schema Generation: Reducto automatically identifies fields and data types (e.g., “invoice number,” “total,” “date”) by analysing the uploaded document. This helps users get quick results without defining field templates manually.

Built-in OCR: The platform includes integrated Optical Character Recognition, allowing it to extract text even from scanned or low-quality images. Users don’t need to connect third-party OCR tools, which simplifies the workflow.

Batch Document Processing: Reducto supports multi-file uploads, enabling parallel document processing. This is useful for organizations processing large volumes of similar files, such as receipts or purchase orders.

Adaptive Vision-Language Models: The system uses large vision-language models that can understand the context and visual structure of a document simultaneously. This makes it capable of extracting meaningful data even from unstructured layouts.

Simple Pricing Model: Reducto follows a transparent pay-as-you-go model, typically based on document or page count, making it accessible for small teams and startups experimenting with AI-based document extraction.

Limitations and Design Philosophy

While Reducto offers impressive out-of-the-box automation, it’s built as a closed, managed platform, meaning users do not have visibility into or control over the following:

This “black-box” approach prioritizes ease of use and quick setup but limits the flexibility needed for custom integrations, data governance, or enterprise-grade ETL pipelines. Teams looking for fine-grained control over prompts, models, or intermediate steps may find the platform restrictive.

Trying Out Reducto

Before diving into feature comparisons, it’s important to experience Reducto firsthand. In this section, we’ll walk through the process of getting started with the platform, from registration to testing a few sample documents.

This hands-on approach helps illustrate how Reducto’s vision-language model–based extraction works in practice and what kind of results users can expect.

Registration

You can register for Reducto AI by visiting https://studio.reducto.ai/.

After verifying your email, you’ll gain access to the dashboard, where you can explore several available options and workflows:

In this article, we’ll focus specifically on the Parse and Extract features.

Parsing Documents

This functionality is designed to simulate how a human would interpret and extract information from a real-world form.

To evaluate the Parse feature, we’ll begin with a PDF of a commercial insurance application containing handwritten text, checkboxes, and various field types:

Uniform Underwriting and Transmittal Summary (Form 1008)

In the Reducto Studio dashboard, click on “Parse” and give your pipeline a name.

Next, upload a test document, in this case, the PDF of the Universal Underwriting and Transmittal Summary:

Clicking Run will process the document and return the following results:

Results of parsing:

This document has a highly complex layout — it spans three to four columns and includes tables, text fields, and multiple checkboxes. Reducto performs fairly well in parsing this structure: it accurately recognizes most text elements and identifies checkboxes across different layout sections.

However, recognition alone isn’t enough. The real question is accuracy. In this case, Reducto misses certain fields. For example:

The “AUS Recommendation” text field was not detected.
Two checkbox fields — “Community Lending/Affordable Housing Initiative:” and “Home Buyers/Homeownership Education Certificate in file:” — were incorrectly identified, with the system failing to mark the checked boxes correctly.

Overall, while Reducto handles complex layouts competently, its accuracy is not perfect in this particular complex extraction use case.

Reducto clearly missed certain fields and could not parse checkboxes in certain instances

Parsing with Unstract’s LLMWhisperer Pre-processor

When you compare the same document’s extraction with Unstract’s LLMWhisperer, it captures every element—text, checkboxes, tables—with 100% accuracy, including fields that Reducto misses.

The key advantage is its layout-preserving OCR: LLMWhisperer retains the original structure of complex, multi-layout documents, where meaning often depends on how information is arranged. By preserving layout, it preserves context, which leads to consistently higher accuracy and far more reliable outputs for any LLM-based downstream task.

LLMWhisperer accurately parses the complex document without any omission.

Extracting Data From Documents using Reducto

The Extract feature allows you to define a schema either manually or automatically from a text description, enabling the extraction of more structured data.

Example 1: Skewed Packing List Scan

Let’s now test the Extraction feature with additional examples, starting with a skewed scan of a packing list document:

We can see that it becomes confused, merging shipping line values, introducing inconsistencies, and producing entirely incorrect results:

This issue occurs across multiple records and several columns:

You can also observe these inconsistencies reflected in the JSON output:

When checking for ways to adjust or improve output quality, the available options are quite limited, mainly focused on processing figures and tables, and they require custom, highly specific prompts:

Example 2: Product datasheet with complex nested table

This document is a datasheet from Central Semiconductor Corporation for a series of Zener diodes.

Prompt Used:

“Capture the min, nom, and max voltage values of the zener voltage for each diode.

Structure in beautiful JSON.”

“Capture each of the maximum zener impedance values for each diode and its milliampere value.

Structure in beautiful JSON.”

Extraction results from Reducto:

Output for the first prompt: Reducto captured only the first 3 values.

Reducto: Incomplete extraction from complex tables

The output for the second prompt incorrectly identifies the maximum impedance and corresponding mA values by looking in the wrong places.

Reducto: Fails to perform complex table extraction

Example: Uniform Underwriting and Transmittal Summary Document

Prompts used:

Capture the borrower full name, co-borrower full name, property address, SSN of each of the borrowers, property type selected, project classification selected, occupancy status selected, and property rights selected. Also important are the number of units. The sales price. And the appraised value.

Extraction results:

Reducto is reencountering issues; the values it extracts for property type, project classification, and occupancy status are incorrect. It also indicates the sources of its information, but can cite only 5 to 10 references from the entire document.

Reducto successfully “parsed” the document, but it failed to extract data accurately during the extraction stage.

Summary Analysis of Reducto’s Document Processing Capabilities

After testing multiple document types, including a commercial insurance application, a packing list, and a mortgage form, several consistent patterns emerged in Reducto’s processing performance:

Ease of Use: The platform is intuitive and simple to get started with. Uploading documents and triggering extractions require minimal setup, making it ideal for quick trials or demonstrations.

Automatic Structuring: Reducto automatically classifies and segments documents into sections, tables, and fields without requiring user configuration. However, this classification is rigid, users cannot modify or influence how sections are defined.

Accuracy and Consistency: While Reducto performs reasonably well on straightforward, typed documents, it struggles with:

Handwritten text or non-standard layouts.
Tables containing merged or multi-line entries.
Checkbox fields, which are often misread or ignored.

JSON Output Quality: The resulting JSON output lacks consistency. Fields are scattered and unstructured, and there’s no way to define a custom schema for name–value pairing. This limits its usefulness for integration into structured databases or downstream systems.

Customization Limitations: Aside from a few prompt-based adjustments for tables or figures, there are no substantial options to improve output quality or guide the model toward more accurate extractions.

Overall, Reducto delivers a streamlined experience for simple extraction tasks but demonstrates notable limitations in field accuracy, schema flexibility, and data consistency. It’s a solid starting point for quick document parsing but not yet suitable for complex or enterprise-grade ETL workflows where precision, control, and customization are essential.

Introducing Unstract

Unstract is an open-source, enterprise-ready platform designed to help organizations extract, transform, and load (ETL) data from all types of unstructured documents, including PDFs, images, scanned files, and multi-format archives. Unlike closed or single-model systems, Unstract is built on a modular, transparent, and LLM-agnostic architecture, giving users full control over every part of the document processing pipeline.

At its core, Unstract’s mission is to democratize document intelligence, enabling developers, analysts, and enterprises to design highly customized extraction workflows without being tied to a single model or vendor. Every stage in the pipeline, from ingestion and OCR to schema creation and validation, is configurable, auditable, and extendable.

Getting Started With Unstract: Best AI-Powered Document Data Extractor

Modular Architecture

Unstract’s plug-and-play architecture allows users to mix and match components to fit their needs:

OCR Providers: LLMWhisperer, Unstructured.io, LlamaParse, Tesseract, or custom engines.
Large Language Models (LLMs): OpenAI, Anthropic, Mistral, Claude, or local models via Ollama.
Vector Databases: Pinecone, Milvus, FAISS, and others for semantic search or RAG pipelines.
Connectors: AWS S3, PostgreSQL, BigQuery, Snowflake, Redshift, and REST APIs for data ingestion and export.

This modularity empowers teams to create precise, cost-efficient pipelines for use cases ranging from invoice extraction and contract analysis to large-scale document ETL workflows.

Deployment Flexibility

Unstract supports a variety of deployment modes, making it adaptable for teams of any size and compliance level:

Cloud (SaaS): Fully managed, ideal for quick setup and experimentation.
Local (Desktop): Perfect for developers running small extractions or local testing.
Self-Hosted (Enterprise): Deployable on private clouds, Kubernetes, or on-prem environments, providing maximum control, security, and compliance.
MCP server and N8N nodes

This flexibility makes Unstract equally suitable for startups building fast prototypes and for large enterprises in finance, healthcare, or government that require strict data governance.

Unified Interface

Unstract offers a cohesive, intuitive interface that combines simplicity and power:

A dashboard for monitoring jobs, datasets, and performance metrics.
A visual pipeline builder to connect OCR, LLM, and transformation steps.
The Prompt Studio, where users can define and test custom extraction schemas.

This design bridges no-code convenience with developer-level configurability, allowing teams to iterate quickly without writing boilerplate code.

Key Features and Advantages

Unstract is more than a document parser, it’s a complete, modular ETL platform that transforms unstructured content into structured, actionable data. It combines flexibility, transparency, and enterprise-grade capabilities in one ecosystem.

End-to-End ETL Pipelines: Unstract handles the full data journey, from ingestion and extraction to validation and export. With pre-built connectors for databases, warehouses, and cloud storage (like S3 and Snowflake), it easily integrates into existing workflows. Pipelines can be orchestrated visually or automated via n8n nodes and MCP servers.

Infinite Customizability: Built to be LLM-agnostic, Unstract lets users mix and match models, OCR tools, and vector databases. Whether using OpenAI, Mistral, or local LLMs via Ollama, teams can design pipelines optimized for their cost, accuracy, and privacy needs.

Prompt Studio: The Prompt Studio allows teams to design custom extraction schemas with field-level rules. This ensures deterministic, structured outputs, a major improvement over automatically generated, less flexible schemas.

Multi-Service Text Extraction: By combining multiple tool, such as LLMWhisperer for OCR, Unstract reduces hallucinations and improves accuracy across complex layouts and scanned documents.

Enterprise-Grade Features: Designed for scale and compliance, Unstract includes human-in-the-loop validation, single-pass extraction for cost efficiency, and LLM Challenge tools to enhance precision. These make it ideal for regulated industries like finance, insurance, and healthcare.

Open Source and Community Trust: With over 6,000 GitHub stars, Unstract’s open-source foundation ensures transparency, auditability, and continuous innovation. Its active community and public roadmap make it a trustworthy long-term choice for developers and enterprises alike.

Trying out Unstract

Registration

Each new account includes a 14-day free trial and 1 million LLM tokens, providing everything you need to start building and testing your own document extraction pipelines right away.

Extracting Data From Documents

Once logged in, open Prompt Studio and create a new project for your document(s).

Then, go to the Manage Documents section to upload the file you want to process and extract the data from. After uploading, you can begin defining prompts that describe the exact data fields you want to extract, such as names, dates, totals, or checkbox selections.

In Unstract, prompts act as instructions for the AI; they define what to extract and how to structure the output, ensuring the results align with your schema.

This schema-driven approach contrasts sharply with Reducto’s static extraction, where the system automatically guesses fields based on its internal model and doesn’t allow users to fully adjust or refine them. With Unstract, you’re in full control, able to guide the extraction logic, enforce data types, and ensure consistency across documents.

Example: Skewed Packing List Scan

For the skewed scan of a packing list document, let’s define a simple yet generic prompt designed to generate detailed and structured extraction results.

Extract the list of items from the table with the associated columns

This produces the following output:

Unstract: Returning a clean, well-structured, and accurate dataset.

And the formatted JSON:

[
    {
        "Dimensions Inches (L W H)": "12 12 12",
        "Item Description, Sales Order No., Customer PO No.": "Print packaging, 23445",
        "Per package gross weight LBS.": 10,
        "Quantity": 10,
        "Shipment Line No.": 12,
        "Shipped Packaging Type": "Box"
    },
    {
        "Dimensions Inches (L W H)": "16 16 16",
        "Item Description, Sales Order No., Customer PO No.": "Print packaging, 345232",
        "Per package gross weight LBS.": 20,
        "Quantity": 10,
        "Shipment Line No.": 13,
        "Shipped Packaging Type": "Box"
    },
    {
        "Dimensions Inches (L W H)": "10 8 8",
        "Item Description, Sales Order No., Customer PO No.": "Black Ink cartridges, 342900",
        "Per package gross weight LBS.": 15,
        "Quantity": 20,
        "Shipment Line No.": 14,
        "Shipped Packaging Type": "Glass"
    }
]

Incredible, even without a detailed prompt, Unstract not only extracted the data accurately (avoiding the missing columns and overlapping rows seen with Reducto) but also interpreted the content intelligently, returning a clean, well-structured dataset.

Example 2: Product datasheet with complex nested table

This document is a datasheet from Central Semiconductor Corporation for a series of Zener diodes.

Prompt Used:

“Capture the min, nom, and max voltage values of the zener voltage for each diode.

Structure in beautiful JSON.”

“Capture each of the maximum zener impedance values for each diode and its milliampere value.

Structure in beautiful JSON.”

Extraction results from Unstract:

Output for the first prompt: Unstract accurately captures all voltage values and sorts them by type number.

Output for the 2nd prompt – Unstract is again able to capture all the data by looking at it in the right column group called “Maximum Zener Impedance”.

Unstract: Accurate structured data extraction from documents with complex tables

Example: Uniform Underwriting and Transmittal Summary Document

Next, we’ll test another document, the mortgage application that includes multiple checkbox options spread across its pages.

Let’s design a simple prompt to extract the key mortgage information:

Extract the key mortgage information, focusing on fields such as loan type, amortization type, loan purpose and lien position.

This produces the following output:

Once again, Unstract delivers outstanding results; it accurately recognizes the checkbox selections and correctly associates them with their corresponding information.

In contrast, Reducto often misidentifies or mismatches values for several of these fields.

Summary Analysis of Unstract’s Processing

After testing Unstract with the same set of documents, the results demonstrate clear improvements in accuracy, structure, and interpretability compared to Reducto.

Accuracy and Completeness: Unstract consistently captured all relevant data fields, including those previously missed by Reducto (such as expirationDate in the insurance form and checkbox selections in the mortgage document). Its ability to interpret handwritten text, identify layout variations, and handle skewed or low-quality scans shows a more reliable understanding of document structure.

Structured and Consistent Output: Even with simple or generic prompts, Unstract produced clean and logically organized JSON outputs. Data was properly aligned under field names and grouped within the correct sections. This level of consistency reduces the need for post-processing or manual data cleaning, a major advantage for large-scale ETL pipelines.

Prompt Flexibility and Control: The use of Prompt Studio proved to be a key differentiator. Instead of relying on automatic schema generation, users can explicitly define what information to extract and how it should be represented. This schema-driven approach not only improves reliability but also allows easy adaptation to new document types without retraining or extensive setup.

Interpretation and Context Awareness: Unstract’s underlying architecture demonstrates contextual understanding, it doesn’t just extract text, but interprets the relationship between fields. For example, it successfully linked checkbox selections with their corresponding options and grouped related values in structured arrays, resulting in a more semantically accurate dataset.

Ease of Use and Efficiency: Despite its flexibility, the platform remains accessible. Setting up a pipeline, defining prompts, and obtaining structured output required minimal effort. The system’s ability to achieve high accuracy with simple prompt definitions makes it practical for both technical and non-technical users.

Unstract stands out as a robust and adaptable document processing solution. It combines high extraction accuracy, schema-level control, and contextual understanding with the transparency of an open-source platform. Whether dealing with handwritten forms, structured tables, or checkbox-heavy applications, Unstract consistently delivers precise and interpretable results, a clear step ahead of Reducto’s more limited, black-box approach.

Reducto vs. Unstract: Unstract Captures What Reducto Misses

Using Prompt Studio, we processed the same three documents to compare results and evaluate Unstract’s performance against Reducto’s. The difference was immediately clear, Unstract successfully captured key data fields that Reducto had previously missed.

By defining a precise schema and guiding the extraction through structured prompts, Unstract produced consistent and accurate outputs across all files. This highlights the advantage of a schema-driven, transparent extraction approach over fully automated, black-box systems.

Here’s a clear comparison table summarizing the extraction results between Reducto and Unstract for the same three test documents:

Reducto vs. Unstract

Document Type	Challenges	Reducto	Unstract
Scanned Skewed Packing List	Very dense, complex table	Inconsistent and incorrect extraction	100% accurate extraction
Product datasheet	Very dense complex table	Incomplete and incorrect extraction	100% accurate extraction
Form 1008 Mortgage document	Complex layout with checkboxes	Inaccurate extraction	100% accurate extraction

Across all test cases, Unstract’s Prompt Studio demonstrated greater precision, schema control, and data consistency. By allowing users to define field names, types, and validation logic, it avoided the common issues of missing, merged, or misclassified data observed with Reducto’s automatic extraction.

Unstract: A better Reducto alternative for document processing

Both Reducto and Unstract represent important steps forward in the evolution of AI-powered document understanding, but they serve very different needs.

Reducto shines in its simplicity. It’s a great choice for teams that want to get started quickly without worrying about setup or infrastructure. Startups, small businesses, and developers who need a plug-and-play OCR or document extraction tool will appreciate its minimal learning curve and intuitive interface. For quick trials, prototypes, or smaller workloads, Reducto offers a fast path to automation.

Unstract, on the other hand, is built for depth and control. It’s an enterprise-grade, open-source, and highly customizable platform designed for organizations that need scalable, reliable, and auditable ETL pipelines. With its modular design, Prompt Studio, and multi-service extraction capabilities, Unstract allows teams to fine-tune every layer of their data pipeline, from OCR to LLM orchestration and downstream integration. Whether the goal is compliance, cost optimization, or accuracy, Unstract provides the tools and transparency needed to operate at scale.

Reducto vs. Unstract Comparison: Related topics to explore

Why LLMs Are Not (Yet) the Silver Bullet for Unstructured Data Processing

Unstructured data ETL with Unstract & Snowflake

How to Extract Data from PDF to Database (Postgres)

Guide to Extracting Data from PDF Form with Unstract

The post Unstract: A Better Reducto Alternative for Document Processing Automation first appeared on Unstract.com.

Best OCR for Insurance Document Processing Automation

Tarun Singh — Mon, 03 Nov 2025 09:06:22 +0000

Insurace OCR: An Introduction

Insurance runs on documents. Every claim form, policy application, underwriting report, and proof of coverage contains crucial data that determines whether a customer gets paid on time—or ends up stuck in a cycle of manual review. Yet, most of this information still lives inside scanned PDFs, handwritten forms, or old legacy systems.

That’s where OCR insurance automation changes the game. By converting these static, unsearchable documents into machine-readable data, insurers can streamline claims processing, accelerate policy approvals, and drastically reduce human error. Instead of teams re-keying data from claim forms or verifying customer details line by line, intelligent insurance OCR systems extract everything—names, policy numbers, coverage limits, amounts claimed, and even signatures—within seconds.

The impact is profound. Claims that once took days to validate can now move through straight-through processing in minutes. Underwriters can focus on risk decisions rather than data entry. Compliance officers can automatically verify whether documents meet regulatory standards.

In short, OCR in insurance isn’t just a technology upgrade—it’s the foundation of digital transformation across the insurance value chain. From claims to underwriting and compliance, it brings speed, accuracy, and trust to processes that once depended entirely on paper.

Modern insurance organizations now combine OCR with AI, using advanced document-understanding engines like LLMWhisperer to go beyond text recognition. They achieve a new level of automation: interpreting context, understanding layouts, and intelligently mapping unstructured data into usable formats. This is the evolution of ocr insurance—from text capture to true document intelligence.

TL;DR

If you wish to skip directly to the solution section, where you can see how LLMWhisperer OCR API handles insurance documents of any complexity — document scans, handwritten forms, skewed images, multi-language documents, click here.

Try LLMWhisperer Insurance OCR for free on the demo Playground

Try LLMWhisperer Playground

What is Insurance OCR?

Insurance OCR (Optical Character Recognition for the insurance industry) is the technology that converts physical or digital documents—such as scanned claim forms, invoices, or handwritten applications—into structured, editable text that computers can understand. In simple terms, it’s how insurers “read” documents automatically.

Definition and Working Principle

OCR works by analyzing the shapes and patterns of printed or handwritten characters in an image or PDF and transforming them into digital text. In the context of ocr in insurance, this means converting everything from policy declarations to accident photos or claim receipts into structured, searchable data.

For example, when a customer uploads a claim form with sections like Policy Number, Date of Loss, and Amount Claimed, an insurance OCR system scans each line, recognizes the characters, and outputs digital text—often in JSON or CSV format. This data can then be pushed directly into claims-management systems, underwriting tools, or fraud-detection engines.

But traditional OCR stops at recognition. It doesn’t understand meaning. That’s why modern ocr insurance solutions now combine OCR with artificial intelligence (AI) and large language models (LLMs). This combination allows systems not just to read text but to interpret it—recognizing that “Sum Insured” and “Coverage Limit” refer to the same concept, or that a handwritten “Approved” stamp means the claim has cleared internal review.

OCR in Insurance and Its Relevance

In insurance, OCR stands for Optical Character Recognition, but in practice, it represents far more than character recognition. It’s a bridge between unstructured, paper-bound information and digital workflows. The relevance of OCR in the insurance sector can’t be overstated:

Speed: Instant data extraction from thousands of claim forms or policy applications.
Accuracy: Reduced manual errors in claim amounts, policy numbers, or personal details.
Compliance: Easier recordkeeping and audit trails for regulatory reporting.
Customer Experience: Faster turnaround times for claim settlements and renewals.

In an age where insurers compete on efficiency and trust, ocr in insurance gives them a tangible edge—turning paper into intelligence, delays into decisions, and disconnected processes into seamless automation.

Traditional OCR vs. AI-Powered OCR

Traditional OCR tools read static text; they work best with clean, typed documents. But real insurance data is messy: scanned forms, handwritten fields, checkboxes, stamps, watermarks, and multiple document layouts. AI-powered insurance OCR systems solve this by combining OCR with layout detection, semantic understanding, and contextual analysis.

Here’s the difference:

Aspect	Traditional OCR	AI-Powered OCR in Insurance
Accuracy on scanned forms	Moderate; fails on noisy scans	High; uses image enhancement + context recognition
Understanding of layout	Linear text only	Preserves columns, tables, checkboxes
Language & handwriting	Limited to a few fonts	Multilingual + handwriting recognition
Output quality	Flat text	Structured, labeled JSON for direct system use
Adaptability	Template-based	Works with dynamic, unseen document formats

By merging ocr insurance technology with AI and LLM-based understanding, insurers can automatically process claim documents, identify key entities, detect anomalies, and even validate extracted data against internal databases. This hybrid approach turns OCR from a back-office utility into a strategic enabler of end-to-end automation in OCR claims and policy processing.

Why Accurate OCR Matters in Insurance Processing

In the insurance world, accuracy isn’t just important — it’s everything. A single misplaced number in a policy document can delay a claim by weeks or even cost thousands of dollars in errors. That’s why OCR in insurance has become a cornerstone of modern insurance operations. It enables insurers to process documents faster, with fewer mistakes, while maintaining full compliance with regulations.

The Role of OCR in Claims, Underwriting, and Policy Management

Think about what happens when a customer files a claim after a car accident. They upload photos, fill out a handwritten claim form, and attach invoices from the repair shop. A claims team must review each document, check the policy number, verify coverage, and calculate reimbursement. Without automation, this takes hours.

Now imagine using insurance OCR — a system that reads each document, recognizes the policy number, extracts the repair costs, and cross-checks the customer’s name and date automatically. Within seconds, the claim can be verified and queued for approval. This is the power of ocr claims automation — fewer errors, faster processing, and happier customers.

Similarly, in underwriting, OCR extracts applicant details, income proofs, and prior insurance data from submitted forms. It ensures the underwriter sees only clean, validated information rather than scanning through dozens of pages.

And in policy management, OCR helps insurers manage renewals, identify expired policies, and extract key details (like coverage limits or premium amounts) from legacy documents that were never digitized.

Common Data Challenges in Insurance Documents

Despite its potential, insurance data is messy. That’s where intelligent OCR insurance tools like Unstract’s LLMWhisperer shine — they handle real-world complexity such as:

Handwritten fields: Older claim forms or medical statements often contain handwritten notes like “Approved,” “Pending,” or “See attached report.” Standard OCR tools may skip or misread these, but advanced OCR captures them accurately.
Scanned and low-quality copies: Many policies or claims are scanned from old files or photos taken on mobile phones. Blurry images, tilted pages, and stamps make extraction difficult. Smart ocr in insurance platforms deskew, enhance, and process such images for clean results.
Form checkboxes and radio buttons: Insurance forms often include options like “Yes / No” or “Accident / Theft / Fire.” Traditional OCR sees these as random marks, but modern insurance OCR systems identify checked boxes and return clear binary results for automation.

Impact on Compliance, Speed, and Customer Satisfaction

Accurate OCR directly influences three core insurance priorities:

Compliance: Insurers operate under strict data regulations. Missing a policyholder’s consent form or misplacing a claim date can result in audit failures. OCR ensures that every piece of information — from signatures to coverage amounts — is captured and traceable.
Speed: Faster document processing means faster claims settlement. Customers expect near-instant responses today. OCR claims automation helps insurers achieve that by cutting manual review time from hours to minutes.
Customer Satisfaction: Imagine a customer submitting a medical claim at night and getting an automated confirmation by morning — that’s the kind of experience OCR makes possible. Accuracy drives trust, and trust drives retention.

In short, precise OCR in insurance bridges the gap between paperwork and performance — turning every document into usable, auditable, and customer-ready data.

Key Insurance Processes Enhanced by OCR

Modern insurance OCR systems go far beyond reading text — they streamline almost every major workflow in the insurance lifecycle. Below are some of the most impactful examples of ocr insurance in action.

1. Claims Processing and Form Digitization

Filing claims is one of the most document-heavy processes in insurance. From motor accident claims to health reimbursements, every submission involves multiple forms, invoices, and receipts. With ocr claims technology, insurers can automatically:

Extract claim IDs, policyholder names, and amounts from scanned forms.
Match them to internal databases to prevent duplicates.
Flag incomplete claims for human review.

For example, a customer uploading a medical claim with a hospital bill in PDF format no longer needs to wait for manual validation. The OCR engine identifies hospital name, admission/discharge dates, and billed amount instantly.

2. Policy Application and Underwriting Automation

Underwriting often involves verifying hundreds of pages — identity proofs, income documents, health reports, and previous policy records. OCR in insurance underwriting automates these tasks:

Extracts applicant details like name, address, and date of birth.
Parses medical reports to identify pre-existing conditions.
Cross-references old policy numbers for continuity.

This helps underwriters focus on decision-making rather than data entry, reducing approval time from days to hours.

3. Insurance Verification and KYC

When a customer buys a policy or renews one, verification documents such as driver’s licenses, passports, and identity proofs must be validated. Insurance OCR automatically reads and matches these IDs to ensure authenticity. It detects tampered images, extracts expiry dates, and verifies that the policyholder’s name matches across all documents — making KYC faster and more secure.

4. Certificates of Liability and Proof of Insurance

Companies and individuals often need to share proof of coverage or certificates of liability with partners or clients. These documents contain policy limits, effective dates, and insurer details. OCR extracts all relevant fields, ensuring that compliance teams can validate certificates without manual inspection.

For example, a construction company renewing multiple liability certificates can upload them all at once. The ocr insurance platform reads and validates them automatically, flagging any missing endorsements or expired dates.

5. Health, Auto, and Life Insurance Documentation

Each category of insurance comes with its own unique documentation challenges:

Health Insurance: Extracts diagnosis codes, billing details, and patient information from medical claim forms.
Auto Insurance: Processes handwritten accident statements, repair invoices, and police reports.
Life Insurance: Parses policy schedules, beneficiary lists, and premium payment records.

By automating all of these, insurance OCR ensures that insurers spend less time reading PDFs and more time serving customers.

Accurate and intelligent ocr insurance transforms every aspect of insurance operations — from ocr claims handling to policy creation, underwriting, and verification. It not only cuts costs and errors but also empowers insurers to deliver what customers value most: speed, transparency, and trust.

Introducing LLMWhisperer for Insurance OCR

In the world of OCR insurance automation, accuracy and layout integrity are everything. Insurance documents—claims, ACORD forms, policy schedules, invoices—aren’t simple paragraphs of text. They include complex tables, checkboxes, handwritten endorsements, and even multi-language entries.

LLMWhisperer by Unstract is designed precisely for this world. It’s a layout-preserving OCR engine that transforms scanned PDFs, handwritten claim forms, and Excel-based statements into clean, structured text while preserving spatial context.

Unlike traditional OCR tools that flatten data, LLMWhisperer reproduces structure. Columns stay aligned, checkboxes remain checkboxes, and even tiny footnotes keep their place. The result is a readable, layout-true output that’s ready for downstream AI models or human validation.

Get started with LLMWhisperer: Best OCR API for AI Insurance Document Workfows

Try LLMWhisperer Insurance OCR for free on the demo Playground

Try LLMWhisperer Playground

Why LLMWhisperer Is AI-Friendly (Not AI-Powered)

LLMWhisperer is not an AI or large language model itself — it’s the foundation that enables AI. Its job is to feed your LLM or automation engine with the cleanest, richest, and most structured text possible.

It doesn’t “guess” meaning. It preserves it.

That makes it AI-friendly: instead of raw, jumbled text, your downstream LLM receives clear, layout-preserving, token-efficient content — dramatically improving accuracy for insurance OCR, underwriting analytics, and ocr claims processing.

Think of it like this:

Traditional OCR gives your AI a messy desk full of papers.
LLMWhisperer hands it a perfectly organized binder—indexed, labeled, and easy to read.

How It Improves Accuracy and Readability for Insurance Workflows

Better structured input → fewer LLM hallucinations and errors in downstream parsing.
Consistent layout cues → AI can understand “table rows” vs. “headings” vs. “signatures.”
Readable text extraction → ideal for human-in-the-loop reviews and audit trails.

Example:
A rent roll or claims summary PDF with six columns, checkboxes for claim types, and handwritten notes (“approved”, “awaiting review”) is fully captured. The layout is maintained, text is clean, and coordinates are provided for precise review.

That’s why LLMWhisperer + Unstract has become the new standard for ocr in insurance—accuracy, structure, and AI readiness.

What Is LLMWhisperer?

LLMWhisperer is a general-purpose OCR and text parser optimized for unstructured and semi-structured insurance documents. It extracts not just text, but context—preserving the layout, spatial order, and document structure that AI needs to reason about meaning.

Whether it’s a handwritten health claim, a multi-page ACORD form, or an Excel-based loss report, LLMWhisperer ensures your extracted text is not just machine-readable, but also semantically ready for processing by AI and LLMs.

Features and Benefits of LLMWhisperer OCR

Below is the most comprehensive feature overview—combining the official documentation (v2.0.0) with Unstract’s real-world insurance automation focus.

1. Layout Preservation and Output Modes

Key Feature: Keeps original document structure—columns, tables, checkboxes, headers, and footers.

Parameter	Description
output_mode	Layout-preserving mode keeps spacing and alignment for visual and semantic accuracy.
mark_vertical_lines	Marks column separators to help AI interpret table boundaries.
mark_horizontal_lines	Marks table rows and section dividers for better segmentation.
add_line_nos	Adds stable line numbers for traceability and review.

Why it matters:
In insurance, layout defines meaning. For example, “Coverage Limit” and “Deductible” often sit side by side. Without layout retention, amounts can be mismatched. LLMWhisperer ensures every value stays where it belongs.

2. File Formats Supported (Comprehensive List)

LLMWhisperer supports virtually all file types used in insurance operations.

Category	Supported Formats
Word Processing	DOCX, DOC, ODT
Presentations	PPTX, PPT, ODP
Spreadsheets	XLSX, XLS, ODS
Documents & Text	PDF, TXT, CSV, JSON, TSV, XML, HTML
Images	BMP, GIF, JPEG, JPG, PNG, TIF, TIFF, WEBP

Insurance use case examples:

Extract data from Excel-based performance reports
Read scanned PDF claims or JPG inspection photos
Parse Word policy documents for clause analysis

3. Supported Document Types

LLMWhisperer is built to handle every document format the insurance industry uses:

Scanned or native PDFs
Photographed documents (mobile submissions)
PDF forms with checkboxes and radio buttons (ACORD, claim forms)
Handwritten documents
Complex tables like coverage matrices or loss runs

Example:
A low-quality, scanned “Claim for Disability Benefits” form with faint ink and ticked checkboxes still extracts perfectly—with checkmark detection and correct field mapping.

4. LLMWhisperer Modes (API Parameters)

Different insurance documents need different OCR precision levels. LLMWhisperer provides five optimized modes:

Mode	Ideal Use Case	Handwriting	Checkboxes	Languages	Notes
Native Text	Clean digital PDFs, cost-sensitive apps			All (Unicode)	Fastest mode
Low Cost	High-quality scans/images	Basic		120+	Good for standard PDFs
High Quality	Low-quality scans, handwritten claims			300+	Uses AI/ML enhancement
Form	ACORD forms, insurance applications			300+	Detects radio buttons
Table	Financial reports, loss runs, Excel extracts			300+	Best for structured data

Example:

Use Form mode for ACORD applications with multiple checkbox fields.
Use Table mode for financial statements and premium breakdowns.
Use High Quality mode for handwritten or faxed medical claim forms.

5. Multilingual OCR (300+ Languages)

Supports over 300 languages, including English, German, French, Spanish, Hindi, and Arabic—ideal for global insurers handling multilingual claim submissions.

Example: A German car insurance application can be processed without translation or data loss.

6. Advanced Preprocessing for Scanned Documents

LLMWhisperer includes built-in image repair and enhancement:

Rotation and skew compensation
Auto repair of corrupted PDFs
Median and Gaussian filtering for noise reduction
Contrast enhancement for faint scans

Insurance relevance: Faxed claim forms, photo captures from phones, or older scanned documents are processed cleanly without needing re-upload.

7. Table Extraction and Line Preservation

Financial and claims data often appear in tabular layouts. LLMWhisperer’s table mode reproduces lines, rows, and headers precisely, even when borders are missing.

Example: Extract “Premium,” “Coverage,” and “Deductible” columns from a PDF summary without manual template building.

8. Bounding Boxes and Spatial Context

Each text region includes x/y coordinates, enabling:

Visual overlays (highlight text regions)
Confidence visualization in dashboards
Integration into review UIs (click-to-view on PDF)

This feature is especially valuable for audit trails and compliance.

9. Form Element Extraction

Detects checkboxes and radio buttons and returns them as boolean fields.
Preserves field structure for form-based PDFs.

Example: “Accident Type: [] Fire [ ] Theft” → { “accident_type”: “Fire” }

10. Handwriting Recognition

LLMWhisperer uses intelligent handwriting support (in High Quality, Form, and Table modes) to read short annotations, notes, or numbers.

Example: Reads an adjuster’s handwritten “Approved” on a scanned claim or a doctor’s note on a medical reimbursement form.

11. Excel Sheet and Spreadsheet Extraction

Extracts structured data directly from Excel files — no copy-pasting or CSV conversion needed. Ideal for underwriting analytics or performance reports.

Example: “Premium by Product” or “Loss Ratio by Month” sheets are parsed cleanly into structured JSON.

12. Low-Fidelity Tolerance

Handles noisy, low-resolution, or stained documents—common in legacy archives or field submissions.

Example: A faded, crumpled photo of a health insurance claim still produces usable text output, reducing rejection rates.

13. Multilingual Data Extraction (End-to-End)

Maintains alignment between original and translated text even when multiple languages appear on a single page—critical for multinational insurers.

14. Usage Metrics Dashboard

LLMWhisperer provides transparent processing metrics inside Unstract Cloud:

Pages processed per day
Error rates and success ratios
Throughput per project

Helps insurers monitor SLA compliance and plan scaling efficiently.

15. Secure, Private, and Self-Hosted Option

LLMWhisperer can be self-hosted on-premise, offering:

Full control over infrastructure
Guaranteed data privacy for sensitive insurance information
Compliance with ISO, GDPR, and regional regulations

Ideal for: Carriers, TPAs, or reinsurers handling confidential customer data.

16. Simple, Transparent Pricing

No hidden costs, no complex licensing.
LLMWhisperer follows a pay-per-page model with straightforward usage tiers—making it easy to predict OCR insurance costs per claim, policy, or monthly volume.

Summary: Why LLMWhisperer Stands Out in Insurance OCR

Feature	Traditional OCR	LLMWhisperer OCR
Layout Preservation	Flattened	Retained with columns, boxes, and line markers
Handwriting Support	Limited	✓ Yes (advanced)
Checkboxes / Forms	Inaccurate	✓ Form-aware extraction
Multilingual OCR	Few languages	✓ 300+ languages
Table Extraction	Loses structure	✓ Table & Financial mode
Data Privacy	Cloud-only	✓ On-premise or self-hosted
Output	Plain text	✓ Layout-preserving, spatially tagged
Use Cases	Generic	✓ Insurance-ready (ACORD, claims, policy, financials)

In short:

LLMWhisperer is not just another OCR.
It’s the document foundation layer that makes every insurance OCR workflow—from ocr claims to underwriting—more accurate, faster, and enterprise-ready.

It reads, preserves, and prepares every document so that your AI or workflow automation tools can finally understand them—just like a trained insurance analyst would.

How LLMWhisperer Differs from Traditional OCR Tools

Traditional OCR was built to read characters; LLMWhisperer is built to preserve meaning. That distinction is why it consistently outperforms legacy engines in ocr insurance and ocr claims workflows.

1) Intelligent layout preservation & contextual awareness

What legacy OCR does: Flattens pages into linear text. Columns bleed together, tables lose alignment, and labels drift away from values.
What LLMWhisperer does: Keeps the visual logic of the document intact.
- Layout-preserving output_mode: Maintains columnar spacing so “Coverage Limit” stays aligned with its amount.
- mark_vertical_lines / mark_horizontal_lines: Optional markers help LLMs infer table boundaries without converting to HTML.
- add_line_nos: Stable line numbers for pinpoint reviews and audit references.
Why it matters for ocr in insurance: ACORD pages, EOBs, and coverage schedules often use tight tables and mixed fonts. Preserving layout prevents “Deductible” from being mistaken for “Copay” or “Limit.”

Everyday example: Think of sending a well-formatted spreadsheet vs. a copy-pasted wall of text. Your teammate (or LLM) will understand the spreadsheet faster and with fewer mistakes—same with LLMWhisperer’s layout-preserving output.

2) Support for multi-format & multi-lingual inputs

Formats: PDFs (native & scanned), images (JPEG/PNG/TIFF), Excel workbooks/sheets.
Document styles: Mixed content (text + images), multi-column pages, forms with checkboxes/radio buttons.
Languages: 300+ languages supported out of the box.
Why it matters for insurance ocr: Carriers and TPAs see a flood of inputs—photos from mobile phones, spreadsheet-based performance reports, German application forms, etc. One engine handles all of it consistently.

Everyday example: A claimant snaps a picture of a hospital bill on their phone. It’s skewed, shadowed, and a bit blurry. LLMWhisperer still extracts the totals, dates, and provider name reliably—and in the right places.

3) Accuracy in mixed-content documents (images, tables, handwriting)

Preprocessing built-in: Deskew, denoise, contrast/levels—so low-fidelity scans become readable before text extraction starts.
Tables that stay usable: LLMWhisperer uses spacing and markers to keep columns aligned—even with merged headers or borderless tables.
Form elements: Extracts checkboxes and radio buttons as normalized values (true/false or explicit option names).
Handwriting recognition: Short notes, approvals (“Approved”), initials, or brief diagnosis remarks are captured.
Bounding boxes: Emits coordinates for each text region. Your review UI can highlight exactly where a value came from; your rules engine can cross-check positions for extra assurance.

Everyday example: An auto claim packet contains a photo of the paper ACORD, a scanned repair invoice, and a typed denial letter. LLMWhisperer treats all three correctly, keeping their layouts so your ocr insurance pipeline (and LLM prompts) don’t fall apart on the mix.

Common Insurance Documents Parsed with LLMWhisperer

Below are the everyday documents where ocr in insurance must be both accurate and context-aware. For each, LLMWhisperer’s features map directly to pain points in ocr claims and policy ops.

ACORD Insurance Forms

Why hard: Dense tables, small fonts, many checkboxes/radio buttons, and policy/coverage fields that look similar.
LLMWhisperer advantage:
- Layout-preserving mode keeps columns intact (Coverage / Limit / Deductible).
- Checkbox and radio extraction normalizes options (e.g., Collision = true).
- Bounding boxes enable point-and-click verification in review tools.

Certificates of Liability Insurance

Why hard: Frequent updates; tiny differences in carrier templates; legal text in footers.
Advantage:
- add_line_nos helps auditors reference specific lines.
- Small text in footers is preserved (no accidental omission).
- Coordinates let compliance teams trace values to the exact certificate location.

Health Insurance Claim Forms (HCFA/UB/EOBs)

Why hard: Mix of codes (ICD/CPT), amounts, and long tabular sections; scanned and faxed frequently.
Advantage:
- Preprocessing improves readability of faint faxes.
- Table alignment keeps billed/allowed/paid amounts tied to the right codes.
- Handwritten notes (e.g., “see attachment”) are captured for routing.

Personal Automobile Insurance Forms

Why hard: Handwritten statements, mixed attachments (photos + forms), signature fields.
Advantage:
- Handwriting capture for short notes/initials.
- Mixed-content extraction (image + text) without losing context.
- Checkboxes (e.g., “Accident,” “Theft”) returned as clean booleans.

Applications for Individual Life Insurance

Why hard: Multi-page forms, sensitive PII, radio buttons for health disclosures.
Advantage:
- Radio/checkbox elements become structured values; no guesswork.
- Multilingual support handles non-English addenda.
- On-prem option supports strict privacy requirements.

Employees’ Reports of Injury

Why hard: Field forms with handwritten descriptions and dates; heavily scanned/photocopied.
Advantage:
- Handwriting recognition for short content (notes, descriptions).
- Deskew + denoise recovers legibility from rough scans.
- Bounding boxes give safety/compliance teams exact context in review.

Insurance Verification Forms

Why hard: Many variants across providers; small check marks; key dates (effective/expiry).
Advantage:
- Form element extraction ensures “Yes/No” options are reliable data.
- Date fields are consistently captured and remain aligned with their labels.

Claim for Disability Insurance Benefits

Why hard: Physician sections, handwritten notes, mixed tables with rates/durations.
Advantage:
- Table extraction maintains the relationship between rate and duration.
- Handwritten physician notes are captured for downstream LLM interpretation.
- Coordinates help reviewers jump to the exact medical statement line.

Home Insurance Application Forms

Why hard: Property details with long, multi-column questionnaires; checkboxes galore.
Advantage:
- Layout-preserving output stops columns from bleeding.
- Checkboxes/radio buttons give normalized values for repair history, risk factors.
- Line numbering aids QA teams during audits.

Proof of Insurance / Insurance ID Cards

Why hard: Small text, background graphics, photographed at angles.
Advantage:
- Low-fidelity tolerance (angles, shadows, low resolution).
- Coordinates let mobile apps highlight key fields (policy, VIN) to users.

If your goal is reliable ocr insurance at scale—across ocr claims, underwriting, and policy ops—LLMWhisperer provides the missing foundation: layout-true text, structured form signals, and spatial context. It’s not an AI; it’s the AI-ready substrate that lets your LLMs (or rules) deliver consistent accuracy in ocr in insurance scenarios where traditional tools struggle.

Practical Use Cases – Where LLMWhisperer Excels

Insurance documents come in every possible shape and format — scanned ACORD forms, Excel-based premium sheets, multilingual claim statements, and even handwritten policy applications. Traditional OCR tools struggle to handle this diversity, often flattening layouts, skipping checkboxes, or misreading handwriting.

LLMWhisperer was designed to solve exactly these pain points. It goes beyond basic text extraction to deliver layout-aware, form-intelligent OCR that preserves structure, recognizes handwriting, and captures even the smallest details from scanned or low-quality inputs.

In this section, we’ll explore real-world insurance scenarios where LLMWhisperer consistently outperforms traditional OCR — starting with one of the most common and complex document types: the handwritten and scanned ACORD form.

1) Handwritten & scanned ACORD form

A) Nature of the Document & Processing Challenges

The ACORD 125 Commercial Insurance Application is a scanned, mixed-format form containing typed text, handwritten fields, and checkboxes — a nightmare for traditional OCR.
Challenges include:

Mixed inputs: printed + handwritten values (George Simon, John Adams, address lines).
Complex layout: multiple sections and dense tables (Lines of Business, Attachments, Applicant Info).
Checkbox logic: true/false fields for coverage and entity types.
Merged headers & columns: easy column bleed in plain-text OCR.
Multiple numeric/date fields: policy numbers, premiums, effective dates.
Scan distortions: skew, light ink, and misalignment affecting text order.

These structural complexities often break conventional OCR — but LLMWhisperer preserves clarity, order, and accuracy across all regions.

B) Features Used (and Why)

Mode: form – Ideal for structured forms with checkboxes, radio buttons, and handwriting.

Detects tick marks, captures short handwritten text, normalizes booleans.

Fallback: high_quality – Adds image enhancement, de-skewing, and AI-assisted text repair for low-quality scans.

Layout Preservation – Keeps original grid/section alignment with layout_preserving, mark_vertical_lines, and mark_horizontal_lines.

Bounding Boxes – Returns spatial coordinates for review and validation.

Low-Fidelity Tolerance – Handles faint text, smudges, and light ink effortlessly.

Together, these ensure every handwritten note, checkbox, and numeric value is extracted cleanly—making LLMWhisperer ideal for insurance OCR and claims automation.

Notes on Extraction Fidelity

Handwritten fields (names, address, premiums) are clearly captured.
Checkboxes (e.g., Corporation, Trust, Attachments) are normalized to true/false.
Layout is perfectly preserved — each premium aligns under its correct Line of Business with no column bleed.
Numeric fields (premiums, deposits) parse cleanly; currency symbols and decimals remain intact.
Dates retain their original format, ready for normalization (e.g., ISO-8601) in post-processing.

Parsed text

                ®                COMMERCIAL INSURANCE APPLICATION                                                  DATE (MM/DD/YYYY) 

                                            APPLICANT INFORMATION SECTION                                        03/02/2024 
                                                                                                                         NAIC CODE 
 AGENCY                                                            CARRIER 
                                                                          Fincorp          Insurance                     52678 
           Fincorp           insurance                             COMPANY POLICY OR PROGRAM NAME                    PROGRAM CODE 

                                                                            COMINS                                     23 
                                                                   POLICY NUMBER 
                                                                                7532685 
  CONTACT                                                          UNDERWRITER                      UNDERWRITER OFFICE 
  NAME:        Simon 
  PHONE                                                              John adams 
  (A/C, No. Ext): 23986582 
 FAX                                                                                   QUOTE             ISSUE POLICY     RENEW 
 (A/C, No);                                                                        [X]               [ ]              [ ] 
  E-MAIL                                                           STATUS OF           BOUND (Give Date and/or Attach Copy): 
 ADDRESS:                                                          TRANSACTION     [ ] 
                                                                                                  DATE           TIME        AM 
 CODE:                            SUBCODE:                                         [ ] CHANGE                            [ ] 
 AGENCY CUSTOMER ID:                                                               [ ] CANCEL                            [ ] PM 
  LINES OF BUSINESS 
 INDICATE LINES OF BUSINESS    PREMIUM                                     PREMIUM                                     PREMIUM 
                               $                 CYBER AND PRIVACY         $                YACHT                      $ 
 [ ] BOILER & MACHINERY                      [ ]                                        [ ] 
     BUSINESS AUTO             $ 3000            FIDUCIARY LIABILITY       $                                           $ 
 [X]                                         [ ]                                         [ ] 
     BUSINESS OWNERS           $                 GARAGE AND DEALERS        $                                           $ 
 [ ]                                         [ ] 
                                                 LIQUOR LIABILITY          $                                           $ 
 [X] COMMERCIAL GENERAL LIABILITY $ 4000     [ ] 
                               $                 MOTOR CARRIER             $ 5000                                      $ 
 [ ] COMMERCIAL INLAND MARINE                [X] 
                               $                 TRUCKERS                  $                                           $ 
 [ ] COMMERCIAL PROPERTY                     [ ]                                        [ ] 
     CRIME                     $                 UMBRELLA                  $                                           $ 
 [ ]                                         [ ]                                        [ ] 
 ATTACHMENTS 
                              PAPERS             GLASS AND SIGN SECTION                     STATEMENT / SCHEDULE OF VALUES 
 [ ] ACCOUNTS RECEIVABLE / VALUABLE          [ ]                                        [ ] 
                                                 HOTEL / MOTEL SUPPLEMENT                   STATE SUPPLEMENT (If applicable) 
 [ ] ADDITIONAL INTEREST SCHEDULE            [ ]                                        [ ] 
                               SCHEDULE          INSTALLATION / BUILDERS RISK SECTION       VACANT BUILDING SUPPLEMENT 
 [ ] ADDITIONAL PREMISES INFORMATION         [X]                                        [X] 
                                                 INTERNATIONAL LIABILITY EXPOSURE SUPPLEMENT VEHICLE SCHEDULE 
 [X] APARTMENT BUILDING SUPPLEMENT           [ ]                                        [ ] 
                                                                    EXPOSURE SUPPLEMENT 
 [ ] CONDO ASSN BYLAWS (for D&O Coverage only) [X] INTERNATIONAL PROPERTY               [ ] 
 [X] CONTRACTORS SUPPLEMENT                  [ ] LOSS SUMMARY                           [ ] 
                                                 OPEN CARGO SECTION 
 [ ] COVERAGES SCHEDULE                      [ ] 
                                                 PREMIUM PAYMENT SUPPLEMENT 
 [ ] DEALERS SECTION                         [ ] 
 [ ] DRIVER INFORMATION SCHEDULE             [ ] PROFESSIONAL LIABILITY SUPPLEMENT      [ ] 
     ELECTRONIC DATA PROCESSING SECTION          RESTAURANT / TAVERN SUPPLEMENT 
 [ ]                                         [ ]                                        [ ] 
 POLICY INFORMATION 
                                                                                                            MINIMUM         PREMIUM 
 PROPOSED EFF DATE PROPOSED EXP DATE BILLING PLAN      PAYMENT PLAN METHOD OF PAYMENT AUDIT    DEPOSIT      PREMIUM    POLICY 
                                                                                            $            $ 
                                                                     cash                     3000          200        $ 5000 
                 03/02/28        [X] DIRECT [ ] AGENCY 
 APPLICANT INFORMATION 
 NAME (First Named Insured) AND MAILING ADDRESS (including ZIP+4) GL CODE          SIC              NAICS           FEIN OR SOC SEC # 
                                                                                   5032              56382 
         George         Simon 
                                                                   BUSINESS PHONE #: 302567 

      53B, Beach Ville               Avenue                        WEBSITE ADDRESS 
                                                                                  www.aivent.com 
                                     Florida 

                                                                         SUBCHAPTER "S" CORPORATION 
 [X] CORPORATION [ ] JOINT VENTURE             [ ] NOT FOR PROFIT ORG [ ]                            [ ] 
                          NO. OF MEMBERS 
    INDIVIDUAL        LLC AND MANAGERS:            PARTNERSHIP           TRUST 
 [ ]              [ ]                          [ ]                   [X] 
 NAME (Other Named Insured) AND MAILING ADDRESS (including ZIP+4) GL CODE          SIC              NAICS           FEIN OR SOC SEC # 

                                                                   BUSINESS PHONE #: 
                                                                   WEBSITE ADDRESS 

                      JOINT VENTURE                NOT FOR PROFIT ORG    SUBCHAPTER "S" CORPORATION 
 [ ] CORPORATION [ ]                           [ ]                   [ ]                             [ ] 
                          NO. OF MEMBERS                                 TRUST 
    INDIVIDUAL        LLC AND MANAGERS:            PARTNERSHIP 
 [ ]              [ ]                          [ ]                   [ ] 
 NAME (Other Named Insured) AND MAILING ADDRESS (including ZIP+4) GL CODE          SIC             NAICS            FEIN OR SOC SEC # 

                                                                   BUSINESS PHONE #: 
                                                                   WEBSITE ADDRESS 

                     JOINT VENTURE                 NOT FOR PROFIT ORG    SUBCHAPTER "S" CORPORATION 
[ ] CORPORATION [ ]                            [ ]                   [ ]                             [ ] 
                          NO. OF MEMBERS 
    INDIVIDUAL        LLC AND MANAGERS:            PARTNERSHIP           TRUST 
[ ]              [ ]                           [ ]                   [ ] 
 ACORD 125 (2016/03)                                         Page 1 of 4        1993-2015 ACORD CORPORATION. All rights reserved. 
                                       The ACORD name and logo are registered marks of ACORD 
<<<

Why This Matters for ACORD Workflows

Accuracy: Structured alignment and boolean mapping ensure reliable data for compliance and rule-based triggers.
Auditability: Bounding boxes and line numbers enable instant source traceability—critical for regulated OCR claims.
Efficiency: Handwriting capture and low-fidelity tolerance minimize rework, accelerating quote, policy issue, and FNOL cycles.

2) Insurance performance reports

A) Nature & challenges

Highly structured, multi-table financial report with:

Merged headers (e.g., “Single Premiums → Non-Linked / Investment-Linked”).
Wide, multi-column numeric data (negatives, zeros, large comma-formatted values).
Two parts (premiums vs. investment results) and Direct Insurers vs. Reinsurers blocks.
Potential row/column misalignment and header wrapping when flattened by basic OCR.

B) Features used

Mode: table to preserve column fidelity in wide financial tables.
Layout-preserving output with mark_vertical_lines/mark_horizontal_lines + add_line_nos for review.
Bounding boxes for cell-level traceability in QA/HITL.
High-quality fallback (deskew/repair) if sourced from scans/PDF prints.
Numeric normalization (retain commas/negatives; optional currency tagging downstream).

Parsed text



         TABLE L1 : LIFE INSURANCE PROFIT AND LOSS ACCOUNT : INCOME OF SINGAPORE LIFE INSURANCE FUNDS FOR THE YEAR ENDED 31ST DECEMBER 2021 (PART I) 

DIRECT INSURERS 

                                                                                                                              ($'000) 

                                   SINGLE PREMIUMS                   OTHER PREMIUMS                  OUTWARD REINSURANCE PREMIUMS 

         COMPANIES 
                                           INVESTMENT-                                                                 INVESTMENT- 
                             NON-LINKED                      NON-LINKED       INVESTMENT-LINKED       NON-LINKED 
                                              LINKED                                                                     LINKED 

AIA SPORE                          743,546       1,393,810           3,672,810             641,912             270,272        93,629 

ALLIANZ SINGAPORE                       0             0                   0                    0                   0              0 

AXA INSURANCE                        6,219        127,946             417,655              296,107             60,184          3,516 

CHINA LIFE                           2,997            0                88,816                  0                3,015             0 

CHINA TAIPING                      214,169            0                81,808                  0               159,634            0 

ETIQA PL                            890,420         1,004             309,514                  0               313,331            0 

FRIENDS PROVIDENT                       0          74,796                475                70,138               256              0 

FWD SINGAPORE                       20,909         18,713              14,406               17,132              5,443             0 

GEG                                     0             0                   0                    0                   0              0 

GREAT EASTERN LIFE                9,103,959       853,369            4,347,403             467,983             236,495        11,325 

HSBC INSURANCE                      547,973         2,527             119,885              118,902             156,047          910 

LIFE INSURANCE CORP                  4,435            0                   2                    0                   1              0 

MANULIFE                          3,961,337       739,592            1,916,647            1,160,659            627,973            0 

NTUC INCOME                        1,688,157      223,691            2,284,594              44,815             110,542          159 

PRUDENTIAL                        2,664,942       545,455            4,654,828             528,341             197,864         7,780 

QUILTER INTERNATIONAL                   0         287,586                 0                    0                   0            233 

RAFFLES HEALTH                          0             0               100,952                  0               51,533             0 

SINGAPORE LIFE                      360,029          818             2,268,381              20,868             369,569           87 

SINGLIFE                           1,074,961          0                -7,312               27,160              4,351             0 

ST. JAMES'S PLACE                       0         252,098                 0                    0                   0              0 

SUN LIFE                            26,874            0                   0                    0               26,337             0 

SWISS LIFE                              0          34,444                 0                    0                   0           1,260 

TOKIO MARINE LIFE                   17,880         46,538             663,022               74,335             122,894           52 

TRANSAMERICA                        84,874            0                10,261                  0                6,306             0 

UTMOST WORLDWIDE                        0          56,861                 0                 22,466                 0              0 

ZURICH INTERNATIONAL                    0            172                1,164               62,263               506             13 
<<<

REINSURERS 

                                                                                                                              ($'000) 

                                   SINGLE PREMIUMS                   OTHER PREMIUMS                  OUTWARD REINSURANCE PREMIUMS 

         COMPANIES 
                                           INVESTMENT-                                                                 INVESTMENT- 
                             NON-LINKED                      NON-LINKED        INVESTMENT-LINKED      NON-LINKED 
                                              LINKED                                                                     LINKED 

ALLIANZ SE                                                                                                         0              0 

ASIA CAPITAL RE                                                                                                    0              0 

CHINA REINSURANCE                                                                                              541,177            0 

GENERAL RE                                                                                                         0              0 

MAPFRE RE                                                                                                          0              0 

MUNICH RE                                                                                                          0              0 

PACIFIC LIFE INTERNATIONAL                                                                                         0              0 

PACIFIC LIFE LIMITED                                                                                               0              0 

PARTNER RE ASIA                                                                                                 7,667             0 

RGA INTL                                                                                                       148,736            0 

SCOR RE AP                                                                                                         0              0 

SCOR SE                                                                                                            6              0 

SWISS RE ASIA                                                                                                  227,338            0 

  TABLE L1 : LIFE INSURANCE PROFIT AND LOSS ACCOUNT : INCOME OF SINGAPORE LIFE INSURANCE FUNDS FOR THE YEAR ENDED 31ST DECEMBER 2021 (PART II) 

DIRECT INSURERS 

                                                                                                               ($'000) 

                                           INVESTMENT REVENUE 
         COMPANIES                                                           INVESTMENT EXPENSES     OTHER INCOME 
                              INTEREST / REALISED GAINS UNREALISED CHANGES 

AIA SPORE                          1,452,123      924,422            -1,174,377            208,519             92,334 

ALLIANZ SINGAPORE                       0             0                   0                    0                   0 

AXA INSURANCE                       45,819        126,406             -132,948              17,737             100,040 

CHINA LIFE                          10,460          8,227               4,137               1,668                 130 

CHINA TAIPING                       15,306          1,225              -31,533              1,207                720 

ETIQA PL                            60,684         18,450             -102,016              4,559               10,419 

FRIENDS PROVIDENT                    3,063         44,329              40,781               5,033               3,232 
<<<

FWD SINGAPORE                        239            -79               -175                  1               2,157 

GEG                                    0             0                  0                   0                  0 

GREAT EASTERN LIFE                1,540,512     1,337,586         -1,317,089            148,788            67,109 

HSBC INSURANCE                     225,231       67,739            -174,197              8,890              14,975 

LIFE INSURANCE CORP                 1,861          -367               -991                133                  0 

MANULIFE                           531,773       124,885            -75,347             52,300             170,209 

NTUC INCOME                       1,162,962     1,189,788         -1,807,318            159,231             9,403 

PRUDENTIAL                         990,286     2,249,555          -1,462,115            143,634            30,044 

QUILTER INTERNATIONAL              19,538        55,726             100,635             20,836             23,847 

RAFFLES HEALTH                       187             0                  0                   0               2,514 

SINGAPORE LIFE                     181,949       201,832           -270,474             27,596              5,510 

SINGLIFE                           13,737         -2,323             -4,525               746               1,138 

ST. JAMES'S PLACE                   9,242        26,673             57,609                917               11,378 

SUN LIFE                             751           -391               -670                246                 64 

SWISS LIFE                             0             0              29,233                  0               7,205 

TOKIO MARINE LIFE                  288,121        -5,975           -908,456             -31,498             20,116 

TRANSAMERICA                       80,034         2,481            -100,031              2,588                 1 

UTMOST WORLDWIDE                     579          7,232             22,457                912               6,021 

ZURICH INTERNATIONAL                 575             0               75,638               367               6,280 

REINSURERS 

                                                                                                           ($'000) 

                                          INVESTMENT REVENUE 
         COMPANIES                                                         INVESTMENT EXPENSES   OTHER INCOME 
                             INTEREST / REALISED GAINS UNREALISED CHANGES 

ALLIANZ SE                             0             0                  0                   0                  0 

ASIA CAPITAL RE                        2             0                 -25                  0                  0 

CHINA REINSURANCE                    965           -498               132                  111              1,026 

GENERAL RE                           114             0                 13                  26                 50 

MAPFRE RE                              0             0                  0                   0                  0 

MUNICH RE                           3,886         2,284             -35,298               276               11,148 

PACIFIC LIFE INTERNATIONAL             0             0                  0                   0                  0 

PACIFIC LIFE LIMITED                2,058          339               -9,263               180                 90 

PARTNER RE ASIA                      204            14                -207                 26                  0 

RGA INTL                             593            -6                -606                 37                  0 
<<<

SCOR RE AP                            0             0                  0                   0                  0 

SCOR SE                              475            0                -383                 42                  1 

SWISS RE ASIA                        255          -710               -993                 21               1,174

Why this helps

No column bleed: premiums, investment lines, and reinsurance outputs stay under the right headers.
Review-ready: line numbers + boxes enable cell-accurate QA.
LLM-friendly: stable, layout-true text lets downstream prompts compute totals, ratios (e.g., loss/expense), and compare Direct vs Reinsurer segments without re-engineering the table.

3) Photographed income-tax–style forms with checkboxes

1) Nature & Challenges

Photographed form-style document with printed fields, checkboxes, and dense numeric values.
Includes camera artifacts like skew, glare, and shadows.
Wrapped headers/columns make text extraction alignment tricky.
Common OCR risks:
- Loss of checkbox states (checked/unchecked).
- Column bleed in premium and amount grids.
- Misreads in small numeric, date, and ID fields.

2) Features Used

Mode: form – Detects checkboxes/radio buttons and short handwritten or typed fields.
High-Quality Fallback (high_quality) – Corrects skew, glare, and noise from phone photos.
Layout-Preserving Output – Maintains column structure using:
- mark_vertical_lines
- mark_horizontal_lines
- add_line_nos
Bounding Boxes – Provides coordinates for visual validation and review.
Numeric/Date Normalization – Keeps symbols, decimals, and formats consistent for accurate processing.

Parsed text

                 ® 
                                   COMMERCIAL INSURANCE APPLICATION                                                           DATE (MM/DD/YYYY) 
                                               APPLICANT INFORMATION SECTION                                                    08/20/2023 
 AGENCY                                                                  CARRIER                                                    NAIC CODE 
  Smith Insurance Agency                                                  AlphaSure Insurance                                        A4S8F3G 
  123 Main Street, Anytown, NY, 10001                                    COMPANY POLICY OR PROGRAM NAME                         PROGRAM CODE 
                                                                          ShieldGuard                                           SGK5H9P2 
                                                                         POLICY NUMBER 
                                                                          POL6D4J1NO 
 CONTACT                                                                 UNDERWRITER                         UNDERWRITER OFFICE 
 NAME:    John Smith 
 PHONE                                                                    Veronica Lee                        New York, NY 
 (A/C, No, Ext); (555) 123-4567 
 FAX           123-4567 
 (A/C, No): (555)                                                                         [ ] QUOTE           [ ] ISSUE POLICY   [ ] RENEW 
 E-MAIL                                                                  STATUS OF 
 ADDRESS: [email protected]                                         TRANSACTION      [X] BOUND (Give Date and/or Attach Copy): 
        SIA123                                 SIA123-001                                     CHANGE       DATE            TIME         AM 
 CODE:                               SUBCODE:                                             [ ]                                       [ ] 
 AGENCY CUSTOMER ID:                                                                      [ ] CANCEL     08/19/2023     12:00       [X] PM 
 SECTIONS ATTACHED 
 INDICATE SECTIONS ATTACHED      PREMIUM                                          PREMIUM                                         PREMIUM 
    ACCOUNTS RECEIVABLE /        $ 873.05            ELECTRONIC DATA PROC         $                  TRANSPORTATION /             $ 
[X] VALUABLE PAPERS                              [ ]                                             [ ] MOTOR TRUCK CARGO 
[X] BOILER & MACHINERY           $ 364.03        [ ] EQUIPMENT FLOATER            $              [X] TRUCKERS / MOTOR CARRIER     $ 9.04 
[ ] BUSINESS AUTO                $               [X] GARAGE AND DEALERS           $ 391.00      [ ] UMBRELLA                      $ 
[ ] BUSINESS OWNERS              $               [ ] GLASS AND SIGN               $              [X] YACHT                        $ 495.02 
[ ] COMMERCIAL GENERAL LIABILITY $               [ ] INSTALLATION / BUILDERS RISK $              [ ]                              $ 
[X] CRIME / MISCELLANEOUS CRIME $ 3,394.00       [ ] OPEN CARGO                   $             [ ]                               $ 
[ ] DEALERS                      $               [ ] PROPERTY                     $              [ ]                              $ 
 ATTACHMENTS 
[ ] ADDITIONAL INTEREST                          [ ] PREMIUM PAYMENT SUPPLEMENT                 [ ] 
[ ] ADDITIONAL PREMISES                          [X] PROFESSIONAL LIABILITY SUPPLEMENT           [ ] 
[ ] APARTMENT BUILDING SUPPLEMENT                [ ] RESTAURANT / TAVERN SUPPLEMENT              [ ] 
[X] CONDO ASSN BYLAWS (for D&O Coverage only)    [ ] STATEMENT / SCHEDULE OF VALUES              [ ] 
[ ] CONTRACTORS SUPPLEMENT                       [X] STATE SUPPLEMENT (If applicable)           [ ] 
[X] COVERAGES SCHEDULE                           [ ] VACANT BUILDING SUPPLEMENT                 [ ] 
[X] DRIVER INFORMATION SCHEDULE                  [X] VEHICLE SCHEDULE                           [ ] 
[ ] INTERNATIONAL LIABILITY EXPOSURE SUPPLEMENT [ ]                                             [ ] 
[ ] INTERNATIONAL PROPERTY EXPOSURE SUPPLEMENT   [ ] 
[ ] LOSS SUMMARY                                 [ ]                                            [ ] 
 POLICY INFORMATION 
 PROPOSED EFF DATE PROPOSED EXP DATE     BILLING PLAN       PAYMENT PLAN   METHOD OF PAYMENT AUDIT      DEPOSIT       PREMIUM MINIMUM POLICY PREMIUM 
    08/10/2023       01/19/2024                            MONTH           CASH               1      $ 73,484.00   $ 634.00       $ 4,392.00 
                                   [ ] DIRECT [X] AGENCY 
 APPLICANT INFORMATION 
 NAME (First Named Insured) AND MAILING ADDRESS (including ZIP+4)        GL CODE           SIC               NAICS             FEIN OR SOC SEC # 
  Alex Johnson                                                            A7B2C8D5          3574              561730            12-3456789 
  P.O. Box 123, 123 Main Street, Anytown, NY, 10001                      BUSINESS PHONE #: (555) 123-4567 
                                                                         WEBSITE ADDRESS 

[ ] CORPORATION    [X] JOINT VENTURE               [ ] NOT FOR PROFIT ORG   [ ] SUBCHAPTER "S" CORPORATION     [ ] 
                            NO. OF MEMBERS                                      TRUST 
[ ] INDIVIDUAL     [ ] LLC AND MANAGERS:           [ ] PARTNERSHIP          [ ] 
 NAME (Other Named Insured) AND MAILING ADDRESS (including ZIP+4)        GL CODE           SIC               NAICS             FEIN OR SOC SEC # 
  Emily Smith                                                             E3F8G6H1          8742              722513            98-7654321 
  P.O. Box 456, 456 Oak Avenue, Anothercity, NY, 20002                   BUSINESS PHONE #: (555) 987-6543 
                                                                         WEBSITE ADDRESS 

[ ] CORPORATION    [ ] JOINT VENTURE                [X] NOT FOR PROFIT ORG [ ] SUBCHAPTER "S" CORPORATION      [ ] 
                            NO. OF MEMBERS 
[ ] INDIVIDUAL     [X] LLC AND MANAGERS:   10       [ ] PARTNERSHIP         [ ] TRUST 
 NAME (Other Named Insured) AND MAILING ADDRESS (including ZIP+4)        GL CODE           SIC               NAICS             FEIN OR SOC SEC # 

                                                                         BUSINESS PHONE #: 
                                                                         WEBSITE ADDRESS 

[ ] CORPORATION    [ ] JOINT VENTURE               [ ] NOT FOR PROFIT ORG   [ ] SUBCHAPTER "S" CORPORATION     [ ] 
                            NO. OF MEMBERS 
[ ] INDIVIDUAL     [ ] LLC AND MANAGERS:           [ ] PARTNERSHIP          [ ] TRUST 
 ACORD 125 (2013/01)                                              Page 1 of 4           1993-2013 ACORD CORPORATION. All rights reserved. 
                                          The ACORD name and logo are registered marks of ACORD

Why This Helps

Layout and data hierarchy remain fully preserved.
Checkboxes are converted to booleans (true/false) for automation.
Currency and dates remain accurate and machine-readable.
Enables reliable downstream automation for OCR insurance workflows such as:
- Claims intake
- Underwriting validation
- Policy issuance
Eliminates manual verification, improving accuracy and turnaround time.

4) Coverage documents with complex tables

1) Nature & challenges

Table-heavy PDF: multi-row grid with merged headers (“What You Will Pay”), nested sections (e.g., “If you have outpatient surgery”), and two pricing columns (Network vs Out-of-Network).
Inline notes in the far-right column (“Limitations, Exceptions…”) with long wrapped sentences and policy conditions (deductibles, penalties).
Mixed typography (links/underlines like “coinsurance”), line breaks inside cells, and repeated phrases (“Deductible applies.”) that typical OCRs often fragment or reorder.

2) Features used

Mode: table → preserves grid structure; stabilizes row/column boundaries even for merged headers.
Layout-preserving output with mark_vertical_lines, mark_horizontal_lines, add_line_nos → keeps columns aligned for downstream parsing.
Bounding boxes → per-cell coordinates for UI highlighting or reviewer jump-to-cell.
Low-fidelity tolerance / repair → de-skew and denoise any scan artifacts; keeps small percentage symbols legible.
Line-splitting strategy + normalization → consistent rendering of “30% coinsurance”, “Not covered (mail)”, etc., while retaining original wording.

Parsed text



                                                                    What You Will Pay 
Common                                                                                                     Limitations, Exceptions, & Other 
                       Services You May Need        Network Provider          Out-Of-Network Provider 
Medical Event                                                                                              Important Information 
                                                  (You will pay the least)     (You will pay the most) 
                                                                                                           Deductible applies. 
                                                                                                           Covers up to a 30 day supply (retail), 
                                                                          50% coinsurance (retail)         covers up to a 90 day supply (mail). No 
                       Non-preferred brand drugs 50% coinsurance 
                                                                          Not covered (mail)               charge for specific preventive drugs. 
                                                                                                           Prior authorization required for some 
                                                                                                           drugs. 
                                                                                                           Deductible applies. 
                                                                                                           Covers up to a 30 day supply. Only 
                       Specialty drugs           40% coinsurance          40% coinsurance                  covered at specific contracted specialty 
                                                                                                           pharmacies. Prior authorization 
                                                                                                           required for some drugs. 
                                                                                                           Deductible applies. 
                                                                                                           Prior authorization required for some 
                       Facility fee (e.g.,       30%                      Non-Preferred: 40% coinsurance 
                                                     coinsurance                                           services. The penalty is: 50% of the 
If you have outpatient ambulatory surgery center)                         Non-Participating: 60% coinsurance allowable charge up to a maximum of 
surgery                                                                                                    $1,500 per occurrence. 
                                                                          Non-Preferred: 40% coinsurance 
                       Physician/surgeon fees    30% coinsurance                                           Deductible applies. 
                                                                          Non-Participating: 60% coinsurance 
                       Emergency room care       30% coinsurance          30% coinsurance                  Deductible applies. 
                       Emergency medical 
                                                 30% coinsurance          30% coinsurance                  Deductible applies. 
If you need immediate transportation 
medical attention                                Hospital-based: 30%      Hospital-based: 30% coinsurance Hospital-Based: Deductible applies. 
                                                 coinsurance              Freestanding center: Non-Preferred: Freestanding center: Deductible does 
                       Urgent care 
                                                 Freestanding center: $60 40% coinsurance                  not apply in-network. Deductible 
                                                 copayment                Non-Participating: 60% coinsurance applies out-of-network. 
                                                                                                           Deductible applies. 
                                                                                                           Prior authorization required for all 
                       Facility fee (e.g., hospital 30%                   Non-Preferred: 40% coinsurance 
                                                     coinsurance                                           planned inpatient stays. The penalty is: 
If you have a hospital room)                                              Non-Participating: 60% coinsurance 50% of the allowable charge up to a 
stay                                                                                                       maximum of $1,500 per occurrence. 
                                                                          Non-Preferred: 40% coinsurance 
                       Physician/surgeon fees    30% coinsurance                                           Deductible applies. 
                                                                          Non-Participating: 60% coinsurance 

                                                                                                                                        3 of 8

Why this helps: The table structure is preserved (no column bleed), percentages and phrases like “coinsurance” remain intact, and policy notes stay bound to their originating rows—so downstream rules or LLM prompts can safely compute member liability, flag prior-auth requirements, or populate OCR insurance benefit summaries without manual clean-up.

5) Multilingual (e.g., German) application forms

1) Nature & challenges

Bilingual layout (German/English) with dense instructions, mixed currencies (GBP/EUR/USD), and many checkboxes/radio options.
Small numeric tokens (dates, postcodes, phones), diacritics (e.g., “Rheinland-Pfalz”), and multi-line addresses increase OCR error risk.
Typical camera/scan artifacts can skew boxes and break alignment between labels and chosen options.

2) Features used

Mode: form — checkbox/radio detection, short printed + handwritten fields.
Layout-preserving output — keep columns/option groups intact; optionally mark_vertical_lines, mark_horizontal_lines, add_line_nos for review.
High-quality fallback — deskew/denoise/auto-repair for photographed pages.
Handwriting + multilingual (300+ languages) — reliable capture of names with diacritics and German text.
Bounding boxes — precise field coordinates for audit UI.
Numeric/date normalization hooks — retain symbols; normalize downstream (ISO dates, E.164 phones) as needed.

Parsed text



   In welcher Währung möchten Sie Ihre Prämie zahlen? Ihre Versicherungsleistungen werden ebenfalls in dieser Währung erfolgen. 
   In which currency would you like to pay your premium? Your policy benefits will also be in this currency. 
   [ ] GB£      [X] Euro€      [ ] US$ 
   Wie viel Selbstbeteiligung möchten Sie übernehmen? Selbstbeteiligung gilt pro Person pro Versicherungsjahr und nicht für die Optionen Reguläre 
   Schwangerschaft und Entbindung, Zahnärztliche Behandlung, Evakuierung oder Repatriierung oder Leistungen für Wellness, Optik und Impfungen. Wählen Sie 
   eine höhere Selbstbeteiligung, um Ihre Versicherungsprämie zu reduzieren. 
   How much excess would you like to pay? Excess is per person per policy year and does not apply to Routine Pregnancy & Childbirth, Dental Treatment, Evacuation or Repatriation options or Well-being, Optical and 
   Vaccination benefits. To reduce your premium amount, choose a higher policy excess. 
   [ ] Nil                                        [X] £ 50: € 60: US$ 75                        [ ] £ 150: € 180: US$ 225                      [ ] £ 300: € 360: US$ 450 
   [ ] £ 500: € 600: US$ 750                     [X] £ 1,000: € 1,200: US$ 1,500                [ ] £ 2,500: € 3,000: US$ 3,750                [ ] £ 5,000: € 6,000: US$ 7,500 

   [ ] £ 7,500: € 9,000: US$ 11,250 

   Wie möchten Sie Ihre Prämie zahlen? Nachdem wir Ihren Antrag angenommen haben, werden wir Ihnen weitere Details zukommen lassen. 
   How would you like to pay your premium? We'll send details following acceptance of your application. 
   [X] Jährlich                                  [X] Kredit-/Bankkarte                [ ] SEPA-Lastschrift#                [ ] Banküberweisung 
       Annually                                      Credit / Debit Card                  SEPA Direct Debit                    Bank Transfer 
   [ ] Vierteljährlich                           [X] Kredit-/Bankkarte                [ ] SEPA-Lastschrift#                [ ] Banküberweisung 
       Quarterly                                     Credit / Debit Card                  SEPA Direct Debit                    Bank Transfer 
   [ ] Monatlich                                 [ ] Kredit-/Bankkarte                [ ] SEPA-Lastschrift#                [ ] Banküberweisung 
       Monthly                                       Credit / Debit Card                  SEPA Direct Debit                    Bank Transfer 
       # SEPA-Lastschriftzahlungen nur von EU/EWR-Bankkonten. 
       #SEPA Direct Debit payments from EU/EEA bank accounts only 

  2 
        Ihre Daten 

         Your details 

Angaben zum Versicherungsnehmer 
Policyholder details 
 Anrede                                                                                          Wohnanschrift 
Title 
                                                                                                 Wohnsitz (Land) - Adresse wo Sie derzeitig leben 
                                                                                                 Home address 
 [X] Herr [ ] Frau [ ] Frl.       Sonstige:                                                      (Country of Residence - address where you currently live) 
     Mr        Mrs       Miss      Other:                                                           Guentzelstrasse 23 

Vorname(n)                                                                                          Rheinland-pfalz 
First name(s) 

 Klaus 
                                                                                                  Postleitzahl: 55576                   Land 
                                                                                                                                        Country   Germany 
Nachname                                                                                          Postcode: 
Surname 
                                                                                                 Korrespondenzadresse (falls abweichend) 
 Paul                                                                                            Correspondence address (if different) 

                                                                                                    Kantstrasse 63 
Geburtsdatum (TT-MM-JJJJ)                     Geschlecht 
 Date of birth (DD-MM-YYYY)                   Gender 
                                                                                                    Bayreuth 
  21         12        1967                    Male 

 Höhe (cm/ft)                                  Gewicht (kg/lbs) 
Height (cm/ft)                                 Weight (kg/lbs)                                    PLZ:             95408                Land      Germany 
                                                                                                  Postcode:                             Country 
  187                                           87 
                                                                                                 Telefonnummern 
 Branche                                                                                         Phone numbers 
Industry 
                                                                                                  Privat:      06701 15 61 36 
                                                                                                  Home: 
  Software 
                                                                                                  Beruflich: 
                                                                                                  Work:        0921 10 29 84 
 Beruf (bitte vollständige Angaben machen) 
 Occupation (please give full details) 
                                                                                                  Mobil: 
                                                                                                  Mobile: 
  Engineer 
                                                                                                   Fax: 
 Nationalität                                                                                      Fax: 
Nationality 

  England 

 E-Mail-Adresse 
 Email address 

  [email protected] 

 Soll der Versicherungsnehmer im Rahmen                                   [X] Ja     [ ] Nein 
 dieser Police versichert werden? 
 Is the Policyholder to be insured under this policy?                        Yes         No 

Seite 2 von 9                                                                               ALC Global Health Insurance ... wir sind anders, weil wir uns um Sie kümmern.

Insurance OCR: Conclusion

Why LLMWhisperer is the best choice for OCR insurance automation

In the insurance industry—where every checkbox, policy number, or handwritten field carries financial and legal weight—LLMWhisperer stands out as the most reliable OCR solution for document processing and automation. Its layout-preserving architecture, multi-format compatibility, and handwriting recognition capabilities make it ideal for transforming complex insurance documents—ACORD forms, claim reports, policy applications, and benefit tables—into clean, structured data.

Traditional OCR tools can read text, but they often lose context and structure. LLMWhisperer goes further—it retains the visual hierarchy, captures checkboxes and radio buttons as true data, and supports 300+ languages, ensuring insurers can handle multilingual and multi-region submissions without error. Whether it’s scanned claim forms or performance reports with merged tables, LLMWhisperer consistently delivers clarity, accuracy, and speed.

Scalability, accuracy, and compliance benefits

Scalable by design — Process thousands of PDFs, scans, and Excel sheets daily through APIs or self-hosted deployments.
Accuracy that adapts — Advanced layout modeling and noise tolerance ensure >99% text fidelity, even for low-quality scans or handwritten inputs.
Compliance-ready — Every output can include bounding boxes and line numbers, enabling precise audits and traceability for regulated workflows.
Data privacy — LLMWhisperer supports on-premise deployment, ensuring full control over infrastructure, storage, and sensitive customer data—vital for insurers bound by GDPR, HIPAA, or IRDAI guidelines.

Try LLMWhisperer Insurance OCR for free on the demo Playground.

Try LLMWhisperer Playground

How Unstract bridges the gap between OCR and intelligent insurance automation

While LLMWhisperer perfects text extraction, Unstract takes it a step further—transforming that raw text into meaningful, AI-driven insights. By connecting LLMWhisperer’s structured output to Prompt Studio, vector databases, and large language models, Unstract enables insurers to:

Parse policies, claims, and medical reports contextually (not just textually).
Build no-code, repeatable workflows that auto-classify documents, flag missing fields, or trigger claims validation.
Deploy intelligent APIs that integrate directly into underwriting, claims, and compliance systems.

Together, LLMWhisperer + Unstract deliver the full spectrum of OCR insurance automation—from accurate document digitization to contextual understanding and AI-led decision-making. It’s not just about reading insurance documents anymore—it’s about understanding them.

Getting Started With Unstract: Best AI-Powered Document Data Extractor

Insurance OCR API: Related topics to explore

Extract PDF Data to Excel Spreadsheet with Unstract

High Accuracy OCR API for extracting data from invoice

Best OCR API for reading bookkeeping documents

Best OCR API for accounts payable documents

The post Best OCR for Insurance Document Processing Automation first appeared on Unstract.com.

Unstract PDF Splitter API: AI-Powered Mixed/Combined PDF Splitter

Nuno Bispo — Fri, 24 Oct 2025 08:02:21 +0000

Introduction

What if you could take one messy PDF packed with different documents, and instantly get back clean, perfectly labelled files?

Most PDF splitters today only handle simple, well-organized files, like splitting an eBook by chapters or breaking up a report every 10 pages. But real business documents aren’t that neat. Companies often deal with mixed PDFs, like a loan application that bundles a KYC form, salary slips, and tax documents all in one file. Traditional tools struggle because these files don’t follow a set pattern: the number of pages changes, the order shifts, and the formats are all over the place.

Now, imagine uploading that messy loan application and getting back separate, neatly labeled files, one for the loan form, one for the payslip, one for the tax proof, without lifting a finger. No manual page selection, no setting up keyword rules, no trial and error.

That’s what Unstract’s Document Splitter API does. It’s an AI-powered tool that works for any industry. Using smart language and image recognition, it automatically detects where one document ends and another begins, splitting mixed files into individual documents, no manual work, no templates, no complicated rules. Just upload, and let the API do the rest.

Try PDF Splitter API

The Problem with Splitting Mixed-Document PDFs

For most companies, dealing with combined PDFs is just part of the daily routine. A single file might contain all kinds of unrelated documents, like a loan application packed with KYC forms, payslips, and tax records, or an insurance claim bundle with forms, medical reports, and policy copies. Industries like banking, insurance, healthcare, logistics, and outsourcing handle thousands of these files every day.

Splitting these PDFs manually is a slow, inconsistent, and expensive process. Workers have to go through each page, figure out where one document ends and another starts, and then save them one by one. It’s easy to make mistakes, and the process doesn’t scale well, especially when every document set looks different, with varying layouts, orders, and page counts.

Traditional PDF splitters weren’t designed for this. They work best with predictable, structured files, not messy, mixed-document batches. Most rely on fixed page numbers or keyword rules, which fail as soon as a document’s format changes.

Why Old-School, General Purpose PDF Splitters Don’t Work

Manual Page Selection: Users have to pick exact page numbers to split files, impossible when every document is different.
Header/Footer Detection: These tools look for repeating patterns in headers or footers, but those often change or disappear in scanned files.
Keyword or Text Matching: Searching for specific words (like “Invoice” or “Tax Form”) fails when layouts, fonts, or scan quality vary.
Template-Based Systems: These require predefined layouts or zones, which can’t adapt to new formats or messy scans.

The Fallout

The result? A slowdown in every workflow that relies on documents:

Extra Work for Teams: Employees spend hours fixing mistakes and double-checking splits.
Compliance Risks: Misplaced or incomplete documents can lead to missed deadlines or regulatory problems.
Higher Costs: Manual splitting means hiring more people, not automating the process.

Old-school PDF splitters just can’t handle the complexity, variety, and sheer volume of real-world mixed-document PDFs.

Meet the Unstract Multi-Document PDF Splitter API

The Unstract Document Splitter API is an AI-powered tool designed to handle real-world PDFs, messy, mixed, and unpredictable.

Unlike traditional PDF splitters that rely on rigid rules or manual page selection, Unstract’s API actually understands what’s inside your documents. It uses advanced AI (Large Language Models + Vision AI) to analyse both the text and the visual layout of each page. This helps it accurately detect where one document ends and another begins, just like a human would.

How It Works

Finds document breaks automatically, even in chaotic, unstructured files.
Handles both digital and scanned PDFs, including faxes, phone photos, and image-based documents.
Uses AI to classify and split documents based on content, layout, and meaning, not just keywords or page numbers.

Why It Stands Out

No templates or manual setup needed, the AI adapts to each PDF’s unique structure on the fly.
Works for any industry, banking, insurance, healthcare, logistics, government, legal, and more.
Easy to integrate, just plug it into your workflow via a simple API, whether you’re processing 10 PDFs or 10,000.

The Unstract Document Splitter API takes the hassle out of splitting documents, making it faster, more accurate, and fully automated, so organizations can finally handle mixed PDFs without the manual grind.

Try PDF Splitter API

How Different Industries Use Unstract’s PDF Splitter

Unstract’s Intelligent Document Splitter API works across all kinds of industries, handling any document format, order, or structure. Whether you’re dealing with insurance claims, loan applications, or medical records, the API automatically finds where one document ends and another begins. It sorts everything into the right files, so you can process them smoothly.

Here are some real-world examples of how businesses use it to save time and work more efficiently:

Industry	PDF Splitter: Use cases	Example
Banking & Finance	Processing loans, splitting KYC files	Automatically pull out KYC forms, payslips, and tax documents from a single loan application.
Insurance	Handling claims submissions	Separate claim forms, policy documents, and attachments for faster processing and review.
Healthcare	Managing patient records	Split admission forms, test results, and lab reports from patient files for easier data entry.
Legal	Organizing case files	Break down large case files into individual documents, like contracts, filings, and evidence.
Outsourcing (BPO)	Processing bulk documents	Automatically sort through high volumes of contracts, onboarding forms, and compliance files.
Government	Processing citizen applications	Separate permits, ID documents, and application forms for quicker verification and filing.

Why Accurate PDF Splitter Matters

These industries often deal with huge stacks of unstructured, scanned, or mixed documents. Manually sorting through them is slow and unreliable. Unstract’s API brings automation and intelligence to the process, so teams don’t have to waste time splitting files by hand. Everything stays organized, easy to search, and ready to use, making workflows faster and more efficient.

How to Split a Multi-Document, Mixed PDFs with Unstract?

Unstract’s Document Splitter API takes the hassle out of splitting messy, multi-document PDFs.

Just upload your file, and the API does the rest. It automatically figures out where each document starts and ends, then gives you back clean, organized files, no manual work required.

Everything happens on its own, so you get the results you need without the extra effort.

Upload & Process

1. Upload the PDF – Just send your combined PDF to the API, whether it’s 5 pages or 500, Unstract handles it all without any fuss.

2. AI Does the Work – Once uploaded, the API uses advanced AI (including language and vision models) to analyse each page. It:

Finds where documents switch, detecting when one document ends and another begins.
Splits files intelligently, using context, meaning, and visual clues, not just keywords.
Labels each document, identifying what each section is (like a KYC form, tax proof, or payslip).

3. Get Your Results – The API sends back a .zip file with:

Separate PDFs, each document split and saved individually, keeping the original look and quality.
A JSON file, with helpful details like:
- Which pages belong to which document
- What type of document each one is

This makes it easy to plug the results into your existing systems, like document management tools, automation workflows, or compliance checks.

Try PDF Splitter API

PDF Splitter Metadata:

Here’s a sample of what the JSON file might include for each document:

[
...
    {
      "file_name": "Multimodal_Ocean_Bill_of_Lading_20240102_Rubber_Mat_Exports_original_loc_1.pdf",
      "start_page": 1,
      "end_page": 1,
      "document_type": "Multimodal Ocean Bill of Lading",
      "header_info": "DHL (logo), Shipping document from Rubber Mat Exports, 20 Cooper Square, New York, NY 10003, USA",
      "footer_info": "Issued by Roger smith on 1/2/2024, License No. 11862N",
      "key_entities": [
        "Rubber Mat Exports",
        "Floral Street 40-44 Floral Street Covent Garden London WC2E 9TB",
        "John Smith",
        "New York",
        "Madison Maersk",
        "London"
      ],
      "date_period": "1/2/2024",
      "main_content": "Shipping document detailing export of newspaper print rolls and print ink from New York to London, UK, with shipping and cargo details including weight, measurements, and liability limitations"
    },
...
]

This structured information means every document is easy to find, sort, and use, helping your business run smoother and faster.

See the PDF Splitter API in Action

Let’s take a look at how the Unstract Document Splitter API works in real-life situations, especially for industries that deal with complicated PDFs packed with different types of documents. Each example shows:

What the mixed PDF file looks like,
The kinds of documents inside it, and
What you get after the API splits it intelligently.

Pricing: You only pay for what you use, $0.03 per page. This makes it easy to start small and grow as you need.

You can sign up here: Unstract Document Splitter API.

Once you’re signed up:

Download the Postman collection from the platform.
Grab your API key, you’ll need it to log in.

In the examples below, we’ll use Postman to test the API. Just remember to swap out the example API key with your own before sending any requests:

Unstract PDF Splitter: Insurance Use Case

In the insurance world, a single PDF often contains an entire claim package, a mix of forms, policies, and extra documents all stuffed into one big file. This could include things like standard forms, policy details, claim descriptions, photos, and other evidence, all jumbled together in different orders.

Sorting through these manually is slow and easy to mess up, especially since no two claims look exactly the same.

With the Unstract Document Splitter API, the whole process is automatic. The AI figures out where each document starts and ends, labels everything correctly, and gives you clean, separate PDFs, ready to review, save, or process.

For this example, we’ll use this file:

![[Insurance-submission-intake-documents.pdf]]

Let’s try out the API using Postman. Here’s how it works.

Upload your file: Send your sample insurance PDF to the Upload endpoint. It’ll look something like this:

When you get a response, make sure to copy the job_id, you’ll need it for the next steps.

Check the progress: Use that job_id to ask the Status endpoint how the processing is going:

Download the results: Once it’s done, use the same job_id to call the Download endpoint. This will give you a ZIP file with:

The split PDFs (each document separated out)
A JSON file with extra details about each document

Now, let’s open that ZIP file and see what’s inside the JSON metadata.

{
  "doc_id": "2b0c2757-5986-47bf-b467-a995f9cfdcaf",
  "boundaries": [
    {
      "file_name": "Loss_Run_Summary_20190101-20231231_Zurich_original_loc_1.pdf",
      "start_page": 1,
      "end_page": 1,
      "document_type": "Loss Run Summary",
      "header_info": "Zurich Insurance Company Logo",
      "footer_info": "Run Date: 8/8/24",
      "key_entities": [
        "Zurich"
      ],
      "date_period": "Policy Years 2019-2023, Data valued as of: 08/07/2024",
      "main_content": "Insurance claims summary showing claim counts, paid amounts, and incurred losses for multiple policy years, with a total of 32 claims across different years"
    },
    {
      "file_name": "Loss_Detail_Report_20190101_Brody_Inc_original_loc_2.pdf",
      "start_page": 2,
      "end_page": 8,
      "document_type": "Zurich North America Customer Loss Runs Loss Detail",
      "header_info": "Zurich logo, Brody Inc.",
      "footer_info": "Run Date: 8/8/24",
      "key_entities": [
        "Zurich North America",
        "Brody Inc.",
        "Nguyen, Lan N",
        "Biddle, Kerry",
        "Breen, Mallory"
      ],
      "date_period": "Policy Year: 2019",
      "main_content": "Insurance claims report detailing auto physical damage and liability claims for Brody Inc., showing claim numbers, dates, states, status, and financial details"
    },
    {
      "file_name": "Glossary_of_Terms_Zurich_original_loc_9.pdf",
      "start_page": 9,
      "end_page": 9,
      "document_type": "Glossary of Terms",
      "header_info": null,
      "footer_info": null,
      "key_entities": [
        "Zurich"
      ],
      "date_period": null,
      "main_content": "A detailed explanation of terminology used in insurance claim documentation, including definitions for terms like Claim Number, Claimant, Loss Type, Coverage Type, Date of Loss, Policy Symbol, and other key insurance claim-related terms"
    },
    {
      "file_name": "Commercial_Insurance_App_20241115_Sea_Nagar_Condo_original_loc_10.pdf",
      "start_page": 10,
      "end_page": 13,
      "document_type": "Commercial Insurance Application",
      "header_info": "ACORD logo, Arthur J. Gallagher Risk Management Services, LLC, 1395 Panther Lane Ste 100, Naples, FL 34109",
      "footer_info": "\u00a9 1993-2015 ACORD CORPORATION. All rights reserved. ACORD 125 (2016/03), Page 1 of 4",
      "key_entities": [
        "Arthur J. Gallagher Risk Management Services, LLC",
        "Sea Nagar Condominium Association, Inc."
      ],
      "date_period": "Proposed Policy Period: 11/15/2024 to 11/15/2025",
      "main_content": "Commercial insurance application form for collecting applicant and business information, including lines of business, policy details, and contact information"
    },
    {
      "file_name": "Property_Insurance_Cert_20241115_Sea_Nagar_Condo_original_loc_14.pdf",
      "start_page": 14,
      "end_page": 16,
      "document_type": "ACORD Property Insurance Certificate/Declarations Page",
      "header_info": "ACORD logo, Agency: Arthur J. Gallagher Risk Management Services, LLC",
      "footer_info": "ACORD 140 (2014/12), Copyright notice for ACORD",
      "key_entities": [
        "Sea Nagar Condominium Association, Inc.",
        "Arthur J. Gallagher Risk Management Services, LLC",
        "1450 East Sea Blvd, Miami, FL 33132"
      ],
      "date_period": {
        "policy_effective_date": "11/15/2024",
        "document_date": "09/30/2024"
      },
      "main_content": "Detailed property insurance policy declaration page for a residential condominium, listing property values, coverage amounts, and insurance details for building, contents, pool, and other property elements"
    },
    {
      "file_name": "Insurance_Schedule_20240930_SEANCLO-0_original_loc_17.pdf",
      "start_page": 17,
      "end_page": 17,
      "document_type": "Agency Customer Insurance Schedule",
      "header_info": "SEANCLO-0 Agency",
      "footer_info": "APPLIED 140SOI (2014/12)",
      "key_entities": [
        "SEANCLO-0",
        "Miami, FL 33132",
        "1450 East Sea Blvd"
      ],
      "date_period": "09/30/2024",
      "main_content": "Detailed insurance schedule for a residential condo property, listing multiple subjects of insurance with specific coverage amounts, deductibles, and loss causes for various property elements like storage, lighting, fencing, walls, signs, gates, equipment, and HVAC systems"
    },
    {
      "file_name": "Commercial_Insurance_App_ACORD_Corporation_original_loc_18.pdf",
      "start_page": 18,
      "end_page": 21,
      "document_type": "Commercial Insurance Application",
      "header_info": "ACORD logo, Commercial Insurance Application form",
      "footer_info": "\u00a9 1993-2015 ACORD CORPORATION. All rights reserved. ACORD 125 (2016/03)",
      "key_entities": [
        "ACORD Corporation"
      ],
      "date_period": null,
      "main_content": "Comprehensive commercial insurance application form for businesses to provide detailed information about their insurance needs, including lines of business, policy information, and applicant details"
    },
    {
      "file_name": "Environmental_Service_App_20241101_Wonder_Corporation_original_loc_22.pdf",
      "start_page": 22,
      "end_page": 29,
      "document_type": "Environmental Service Providers Application",
      "header_info": "WONDER CORPORATION, 420 S. Oscar Best Road, Suite 350, Nashville, TN 37621-1546",
      "footer_info": "FEI-300-ECC-0712",
      "key_entities": [
        "Wonder Corporation",
        "Everest Indemnity Ins Co"
      ],
      "date_period": "Proposed Effective Date: 11/01/2024, Retroactive Date: 11/01/2001",
      "main_content": "Insurance application for environmental service providers, requesting renewal of commercial general liability, contractors pollution liability, and professional liability coverage with $1M/$2M limits and $1,000 deductible"
    },
    {
      "file_name": "Contractor_Questionnaire_Brody_Inc_original_loc_30.pdf",
      "start_page": 30,
      "end_page": 40,
      "document_type": "Contractor Questionnaire",
      "header_info": "Zurich Insurance Company logo, Zurich branding",
      "footer_info": "Toll-free number (866) 903-1192, Reference to ZurichAmerican Insurance Company",
      "key_entities": [
        "Brody Inc",
        "David Young",
        "Zurich"
      ],
      "date_period": "Five-year historical financial data (4 years prior + current term + upcoming term)",
      "main_content": "Detailed financial questionnaire for a contractor, capturing payroll, revenue, subcontract costs, and power units over multiple years. Includes sections for required attachments and desired supplementary documentation."
    },
    {
      "file_name": "Insurance_Coverage_Request_general_original_loc_41.pdf",
      "start_page": 41,
      "end_page": 41,
      "document_type": "Insurance Coverage Request Appendix",
      "header_info": "APPENDIX",
      "footer_info": null,
      "key_entities": [],
      "date_period": null,
      "main_content": "Insurance coverage request form checking 'Yes' for Auto Liability, Workers' Compensation, and Commercial General Liability coverage"
    }
  ],
  "total_documents": 10
}

The response gives you helpful details about each document it pulled out, like:

file_name – what each split document is called
start_page and end_page – which pages in the original PDF it covers
main_content – a quick summary of what’s inside the document
plus other useful info to help you organize and find things later

Now, let’s check out the list of files you get back:

Let’s take a look at some of the split files to see how well the API picked out and separated each document:

![[Commercial_Insurance_App_20241115_Sea_Nagar_Condo_original_loc_10.pdf]]

![[Glossary_of_Terms_Zurich_original_loc_9.pdf]]

As you can see, the API did a great job splitting up the insurance documents—even with different layouts, formats, and types of content all mixed together.

Unstract PDF Splitter: Banking Use Case

In banking and lending, dealing with documents can be a mess. A single PDF from a customer might include a loan application, ID documents, pay stubs, and tax forms, all stuffed into one big file. Since every customer’s submission looks different, old-school splitting tools (that just go by page numbers) often get it wrong.

The Unstract Document Splitter API takes care of this automatically. It looks at what’s actually in the file, figures out where each document starts and ends, and sorts everything correctly, no manual rules or templates needed.

We’ll use Postman again to walk through the process: upload the file, check the progress, and download the results.

For this example, we’ll use this file:

![[loan-processing-documents.pdf]]

Let’s open the ZIP file and check out the details in the JSON file to see what we got:

{
  "doc_id": "6b25e17e-457c-4784-9dc2-c36d995393dc",
  "boundaries": [
    {
      "file_name": "Loan_Estimate_20130215_Michael_Jones_original_loc_1.pdf",
      "start_page": 1,
      "end_page": 3,
      "document_type": "Loan Estimate",
      "header_info": "Ficus Bank, 4321 Random Boulevard, Somecity, ST 12340",
      "footer_info": "Visit www.consumerfinance.gov/mortgage-estimate for general information and tools, Loan ID # 12456789, Page 1 of 3",
      "key_entities": [
        "Michael Jones",
        "Mary Stone",
        "Ficus Bank"
      ],
      "date_period": "2/15/2013 (Date Issued), Rate Lock until 4/16/2013",
      "main_content": "Detailed mortgage loan estimate for a 30-year fixed-rate conventional home purchase, showing loan terms, projected payments, and closing costs for a $180,000 property with a $162,000 loan amount"
    },
    {
      "file_name": "Uniform_Residential_Loan_App_Robert_Noah_original_loc_4.pdf",
      "start_page": 4,
      "end_page": 6,
      "document_type": "Uniform Residential Loan Application",
      "header_info": "Lender Loan No: Universal Loan Identifier, Agency Case No.",
      "footer_info": "Uniform Residential Loan Application, Freddie Mac Form 65 - Fannie Mae Form 1003, Effective 1/2021",
      "key_entities": [
        "Robert Noah",
        "Comp Corp Systems",
        "Miami",
        "FL",
        "US"
      ],
      "date_period": "05/02/1988 (Date of Birth)",
      "main_content": "First section of a loan application collecting personal information, employment details, contact information, and income for an individual borrower applying for a residential loan"
    },
    {
      "file_name": "Drivers_License_California_original_loc_7.pdf",
      "start_page": 7,
      "end_page": 7,
      "document_type": "Driver's License",
      "header_info": "California Driver License",
      "footer_info": null,
      "key_entities": [
        "California"
      ],
      "date_period": null,
      "main_content": "Official state-issued identification document showing a driver's license with a portrait photo and key personal identification details",
      "override_match": {
        "matched": true,
        "document_type": "Identification Documents (Driver's License, Passport, State ID)",
        "confidence": 1.0
      }
    },
    {
      "file_name": "Payslip_202108_Sally_Harley_original_loc_8.pdf",
      "start_page": 8,
      "end_page": 8,
      "document_type": "Payslip",
      "header_info": "Zoodle Inc, 21023 Pearson Point Road Gate Avenue",
      "footer_info": "This is system generated payslip",
      "key_entities": [
        "Sally Harley",
        "Zoodle Inc",
        "Marketing Department"
      ],
      "date_period": "August 2021",
      "main_content": "Monthly salary breakdown showing earnings and deductions for a Marketing Executive, with total earnings of $11600, total deductions of $2100, resulting in a net pay of $9500"
    },
    {
      "file_name": "Chase_Freedom_Credit_Card_Stmt_202402_Larry_Page_original_loc_9.pdf",
      "start_page": 9,
      "end_page": 10,
      "document_type": "Chase Freedom Credit Card Statement",
      "header_info": "Chase Freedom, Customer Service: 1-800-524-3880, Manage account online at: www.chase.com/cardhelp",
      "footer_info": "P.O. BOX 15123, WILMINGTON, DE 19850-5123, Cardmember Service PO BOX 6294, CAROL STREAM IL 60197-6294",
      "key_entities": [
        "Chase",
        "Larry Page",
        "Wilmington, DE",
        "Carol Stream, IL"
      ],
      "date_period": "February 2024, Billing Cycle: 01/04/24 - 02/03/24",
      "main_content": "Monthly credit card statement showing account balance, payment due, rewards points, and account activity for Chase Freedom Ultimate Rewards credit card"
    }
  ],
  "total_documents": 5
}

Now, let’s see the result, the list of files we got back:

Let’s open a few of these split files to check how well the API picked out and separated each document:

![[Chase_Freedom_Credit_Card_Stmt_202402_Larry_Page_original_loc_9.pdf]]

![[Drivers_License_California_original_loc_7.pdf]]

As you can see, the API neatly separated all the banking documents, even tricky ones like a driver’s license, handling different formats and content types with great accuracy.

Unstract PDF Splitter: Logistics Use Case

In logistics and supply chain, dealing with documents often means working with long, messy PDFs that combine invoices, packing lists, shipping manifests, and delivery notes, all in one file. These documents can come from different systems, carriers, or suppliers, each with their own layout and style.

Sorting through them manually is slow and easy to mess up, especially when you’re handling thousands of shipments every day.

The Unstract Document Splitter API makes this process simple. The AI automatically finds where each document starts and ends, labels them correctly, and gives you clean, separate PDFs, keeping everything in order and just as it was originally formatted.

We’ll use Postman again to go through the steps: upload the file, check the progress, and download the results.

For this example, we’ll use this file:

![[combined-Logistics-documents.pdf]]

Let’s open the ZIP file and check out the details in the JSON file to see what we got:

{
  "doc_id": "d79fd0d1-9de9-459c-9ada-a7909e830260",
  "boundaries": [
    {
      "file_name": "Multimodal_Ocean_Bill_of_Lading_20240102_Rubber_Mat_Exports_original_loc_1.pdf",
      "start_page": 1,
      "end_page": 1,
      "document_type": "Multimodal Ocean Bill of Lading",
      "header_info": "DHL (logo), Shipping document from Rubber Mat Exports, 20 Cooper Square, New York, NY 10003, USA",
      "footer_info": "Issued by Roger smith on 1/2/2024, License No. 11862N",
      "key_entities": [
        "Rubber Mat Exports",
        "Floral Street 40-44 Floral Street Covent Garden London WC2E 9TB",
        "John Smith",
        "New York",
        "Madison Maersk",
        "London"
      ],
      "date_period": "1/2/2024",
      "main_content": "Shipping document detailing export of newspaper print rolls and print ink from New York to London, UK, with shipping and cargo details including weight, measurements, and liability limitations"
    },
    {
      "file_name": "Air_Waybill_20251210_IDF_Cargo_original_loc_2.pdf",
      "start_page": 2,
      "end_page": 2,
      "document_type": "Air Waybill (Shipping Document)",
      "header_info": "IDF Cargo, New York - Shipping Carrier",
      "footer_info": "Executed on date: 12/10/2025, Location: New York",
      "key_entities": [
        "Simon Jones",
        "Rogen Gates",
        "IDF Cargo",
        "New York",
        "London",
        "Bristol"
      ],
      "date_period": "12/10/2025",
      "main_content": "International air shipping document for transporting goods from New York to London, including details of shipment weight, commodity, charges, and contents (News Print Paper, Package Paper, Print Ink Solution)"
    },
    {
      "file_name": "Packing_List_20231212_Faculty_of_Arts_original_loc_3.pdf",
      "start_page": 3,
      "end_page": 3,
      "document_type": "Packing List",
      "header_info": "Faculty of Arts, 5 Washington Square S, New York, NY 10012, USA",
      "footer_info": "Note about Export Administration Regulations, signature line",
      "key_entities": [
        "Faculty of Arts",
        "Herald Corp",
        "Air Cargo UPS"
      ],
      "date_period": "12/12/2023 (Date of Shipment)",
      "main_content": "Shipping document detailing print packaging and black ink cartridges being shipped, with total package and weight information"
    },
    {
      "file_name": "Certificate_of_Origin_20251015_BlueSky_Apparel_Inc_original_loc_4.pdf",
      "start_page": 4,
      "end_page": 5,
      "document_type": "Certificate of Origin",
      "header_info": "Generalised System of Preferences Certificate of Origin (Form A)",
      "footer_info": null,
      "key_entities": [
        "BlueSky Apparel Inc.",
        "EuroFashion Retail GmbH",
        "U.S. Chamber of Commerce",
        "Port of Savannah (USA)",
        "Port of Hamburg (Germany)"
      ],
      "date_period": "October 15, 2025",
      "main_content": "International shipping document certifying the origin of exported goods (20 cartons of cotton men's shirts) from USA to Germany, with details of exporter, consignee, transport route, and origin verification"
    }
  ],
  "total_documents": 4
}

Now, let’s see the list of files we got back:

Let’s open a few of these split files to check how well the API picked out and separated each document:

![[Air_Waybill_20251210_IDF_Cargo_original_loc_2.pdf]]

![[Packing_List_20231212_Faculty_of_Arts_original_loc_3.pdf]]

As you can see, the API did a great job splitting the logistics documents, even with scanned or poor-quality files, keeping everything clearly separated and correctly labelled, no matter how messy or hard to read the originals were.

Why Accurate PDF Splitter Matters for Business Worflows

The Unstract Document Splitter API isn’t just a technical tool, it’s a game-changer for productivity. By automating tasks that used to take hours of manual work, it completely changes how companies handle piles of documents, saving time, reducing mistakes, and cutting costs.

Saves Time: It processes documents up to 10 times faster than doing it by hand. This means teams don’t have to waste time checking each page, and big batches of documents can be ready the same day.

Fewer Mistakes: It gets things right every time, no matter what documents you throw at it. No more human errors, just perfectly separated and labeled files.

Meets Regulations: Every document is clearly identified, easy to track, and ready for review, which helps businesses stay on top of rules and audits in industries like banking, insurance, and healthcare.

Cuts Costs: Instead of needing a whole team to sort documents, the API does the work automatically. This means lower labor costs, less hassle, and more getting done.

In short, Unstract’s Document Splitter API turns document handling from a slow, manual headache into a fast, reliable, and scalable process, helping businesses work smarter, stay compliant, and move forward faster.

Unstract PDF Splitter: Try Now, Get Instant Access to the API

Document processing just got a major upgrade, smarter, more flexible, and ready to grow with you. With the Unstract Document Splitter API, businesses can finally automate one of the most tedious and mistake-prone tasks: splitting combined PDFs into individual documents.

No matter what you’re working with, insurance claims, tax forms, loan applications, or ID checks, Unstract’s tool takes the hassle out of the process. No more manual sorting, no more setting up rules or templates. The API understands your files, finds where each document starts and ends, and gives you perfectly separated PDFs, fast and accurately.
Ready to make your document workflows smoother and smarter? Try it out for yourself with the Unstract Document Splitter API.

Try PDF Splitter API

AI-Powered PDF Splitter: FAQs

What is a multi-document PDF splitter and how is it different from a regular PDF splitter?

A multi-document PDF splitter is an AI-driven tool that scans every page, recognises content and layout, and separates one file into many logically labelled documents. A standard PDF splitter usually cuts by fixed page numbers or bookmarks and fails when pages don’t follow a predictable pattern.

Why would a company choose a Combined Document splitter instead of doing the work manually or using legacy systems?

A Combined Document splitter eliminates manual page selection, reduces human error, and scales to thousands of files per day. Finance, insurance, healthcare, and logistics teams use it to turn one messy bundle into clean, compliant, ready-to-process documents in minutes.

How to split pdf files that bundle payslips, IDs and tax forms in random order?

Upload the file to a multi-document PDFs splitter like Unstract. The AI identifies where each document starts and ends, labels them (e.g., “Payslip – Aug 2025”), and returns individual PDFs plus a JSON index—no templates or keyword rules required.

What is the Combined PDF Splitter API and does it support bulk/batch processing of files?

The Combined PDF Splitter API is a REST service from Unstract. POST your combined file to the Upload endpoint; poll the Status endpoint with the job_id; then download a ZIP containing the split PDFs and structured metadata. You can call it from Postman, Python, Java, or any workflow tool.

How to automatically split combined PDFs with multiple document types?

Use an AI engine trained on both language and vision. Send the PDF to the API, let the model detect page boundaries, and receive separate, perfectly titled outputs. This is the fastest way to split combined PDFs automatically without pre-built templates.

The post Unstract PDF Splitter API: AI-Powered Mixed/Combined PDF Splitter first appeared on Unstract.com.