Extracting data from images using vision language models

Use VLM-enhanced ICR when you need higher extraction accuracy on complex documents.

Common use cases include:

Financial documents with complex tables
Invoices with varied layouts
Medical records with specialized terminology
Legal documents with strict structure requirements
Multi-language document analysis

VLM-enhanced mode combines ICR layout analysis with language-model reasoning to improve classification and structure detection.

Download sample

How Nutrient helps

Nutrient Python SDK handles VLM-enhanced configuration, model orchestration, and JSON output generation.

The SDK handles:

Hybrid mode configuration for ICR + VLM processing
Model loading and capability coordination
Semantic classification and confidence scoring internals
Complex layout analysis implementation details

Complete implementation

This example extracts structured JSON using VisionEngine.VLM_ENHANCED_ICR:

from nutrient_sdk import Document, Vision, VisionEngine

Loading and processing the image

Open the image in a context manager(opens in a new tab) so resources are cleaned up after processing:

with Document.open("input.png") as document:

Configuring VLM-enhanced mode

Set the vision engine to VisionEngine.VLM_ENHANCED_ICR.

This mode improves:

Table boundary detection
Semantic element classification
Reading order in complex layouts
Understanding across document variations

    document.settings.vision_settings.engine = VisionEngine.VLM_ENHANCED_ICR

Creating a vision instance

Create a vision instance bound to the opened document with Vision.set(document):

    vision = Vision.set(document)

Extracting structured content

Call extract_content() to run the VLM-enhanced pipeline.

In this mode, the pipeline performs:

Initial ICR layout detection
VLM-based semantic refinement
Confidence scoring
JSON generation with structure and coordinates

    content_json = vision.extract_content()

Write the JSON result to a file for downstream processing.

Use this output for indexing, validation, storage, or custom analysis:

    with open("output.json", "w") as f:
        f.write(content_json)

Understanding the output

extract_content() returns structured JSON with layout and semantic metadata.

VLM-enhanced output includes:

Document elements — Paragraphs, headings, tables, figures, equations, and form-related regions
Bounding boxes — Pixel coordinates with improved boundary accuracy
Hierarchical relationships — Parent-child structure across sections and blocks
Element classification — Semantic types with confidence scores
Reading order — Sequence for complex layouts and multicolumn content
Semantic metadata — Additional attributes used in downstream processing

Use this JSON for form extraction, contract analysis, invoice parsing, and other high-accuracy workflows.

Error handling

Vision API raises VisionException when extraction fails.

Common failure scenarios include:

The image file can’t be read due to path or permission issues
Image data is corrupted or unsupported
Required models are missing or inaccessible
Available memory is insufficient for VLM-enhanced processing
VLM enhancement fails due to connectivity or service issues when applicable
Image format, resolution, or dimensions are unsupported

In production code:

Catch VisionException.
Return a clear error message.
Log failure details for debugging.
Add fallback logic (for example, retry in ICR mode).

Conclusion

Use this workflow for VLM-enhanced extraction:

Open the image document using a context manager(opens in a new tab) for automatic resource cleanup.
Configure the vision settings by assigning VisionEngine.VLM_ENHANCED_ICR to the vision_settings.engine property for enhanced accuracy.
VLM-enhanced mode combines local ICR AI models with vision language model capabilities for superior document analysis.
Create a vision instance with Vision.set() to bind content extraction operations to the document.
Call extract_content() to invoke the VLM-enhanced processing pipeline.
The pipeline performs initial ICR layout analysis, applies VLM enhancement for semantic understanding, calculates confidence scores, and generates JSON output.
VLM enhancement improves table cell boundary detection, element classification accuracy, and reading order determination for complex layouts.
The method returns a JSON-formatted string containing document structure with elements, bounding boxes, hierarchical relationships, reading order, and confidence scores.
Write the JSON content to a file using Python’s built-in file handling with context manager(opens in a new tab) syntax for automatic resource management.
Handle VisionException errors for robust error recovery with fallback strategies like pure ICR mode.
The JSON output enables integration with intelligent form extraction, contract analysis, invoice processing, and legal document parsing.
VLM-enhanced mode is ideal for complex documents where extraction accuracy is the priority.

For related image extraction workflows, refer to the Python SDK guides.

Download this ready-to-use sample package to explore VLM-enhanced extraction.