Extracting data from images using vision language models
Use VLM-enhanced ICR when you need higher extraction accuracy on complex documents.
Common use cases include:
- Financial documents with complex tables
- Invoices with varied layouts
- Medical records with specialized terminology
- Legal documents with strict structure requirements
- Multi-language document analysis
VLM-enhanced mode combines ICR layout analysis with language-model reasoning to improve classification and structure detection.
Download sampleHow Nutrient helps
Nutrient Python SDK handles VLM-enhanced configuration, model orchestration, and JSON output generation.
The SDK handles:
- Hybrid mode configuration for ICR + VLM processing
- Model loading and capability coordination
- Semantic classification and confidence scoring internals
- Complex layout analysis implementation details
Complete implementation
This example extracts structured JSON using VisionEngine.VLM_ENHANCED_ICR:
from nutrient_sdk import Document, Vision, VisionEngineLoading and processing the image
Open the image in a context manager(opens in a new tab) so resources are cleaned up after processing:
with Document.open("input.png") as document:Configuring VLM-enhanced mode
Set the vision engine to VisionEngine.VLM_ENHANCED_ICR.
This mode improves:
- Table boundary detection
- Semantic element classification
- Reading order in complex layouts
- Understanding across document variations
document.settings.vision_settings.engine = VisionEngine.VLM_ENHANCED_ICRCreating a vision instance
Create a vision instance bound to the opened document with Vision.set(document):
vision = Vision.set(document)Extracting structured content
Call extract_content() to run the VLM-enhanced pipeline.
In this mode, the pipeline performs:
- Initial ICR layout detection
- VLM-based semantic refinement
- Confidence scoring
- JSON generation with structure and coordinates
content_json = vision.extract_content()Write the JSON result to a file for downstream processing.
Use this output for indexing, validation, storage, or custom analysis:
with open("output.json", "w") as f: f.write(content_json)Understanding the output
extract_content() returns structured JSON with layout and semantic metadata.
VLM-enhanced output includes:
- Document elements — Paragraphs, headings, tables, figures, equations, and form-related regions
- Bounding boxes — Pixel coordinates with improved boundary accuracy
- Hierarchical relationships — Parent-child structure across sections and blocks
- Element classification — Semantic types with confidence scores
- Reading order — Sequence for complex layouts and multicolumn content
- Semantic metadata — Additional attributes used in downstream processing
Use this JSON for form extraction, contract analysis, invoice parsing, and other high-accuracy workflows.
Error handling
Vision API raises VisionException when extraction fails.
Common failure scenarios include:
- The image file can’t be read due to path or permission issues
- Image data is corrupted or unsupported
- Required models are missing or inaccessible
- Available memory is insufficient for VLM-enhanced processing
- VLM enhancement fails due to connectivity or service issues when applicable
- Image format, resolution, or dimensions are unsupported
In production code:
- Catch
VisionException. - Return a clear error message.
- Log failure details for debugging.
- Add fallback logic (for example, retry in ICR mode).
Conclusion
Use this workflow for VLM-enhanced extraction:
- Open the image document using a context manager(opens in a new tab) for automatic resource cleanup.
- Configure the vision settings by assigning
VisionEngine.VLM_ENHANCED_ICRto thevision_settings.engineproperty for enhanced accuracy. - VLM-enhanced mode combines local ICR AI models with vision language model capabilities for superior document analysis.
- Create a vision instance with
Vision.set()to bind content extraction operations to the document. - Call
extract_content()to invoke the VLM-enhanced processing pipeline. - The pipeline performs initial ICR layout analysis, applies VLM enhancement for semantic understanding, calculates confidence scores, and generates JSON output.
- VLM enhancement improves table cell boundary detection, element classification accuracy, and reading order determination for complex layouts.
- The method returns a JSON-formatted string containing document structure with elements, bounding boxes, hierarchical relationships, reading order, and confidence scores.
- Write the JSON content to a file using Python’s built-in file handling with context manager(opens in a new tab) syntax for automatic resource management.
- Handle
VisionExceptionerrors for robust error recovery with fallback strategies like pure ICR mode. - The JSON output enables integration with intelligent form extraction, contract analysis, invoice processing, and legal document parsing.
- VLM-enhanced mode is ideal for complex documents where extraction accuracy is the priority.
For related image extraction workflows, refer to the Python SDK guides.
Download this ready-to-use sample package to explore VLM-enhanced extraction.