Overview

Relevant source files

Purpose and Scope

This document provides a high-level introduction to the Unstract platform, covering its purpose, architecture, and primary capabilities. Unstract is a document-to-structured-data extraction system that leverages Large Language Models (LLMs) to transform unstructured documents into structured JSON output.

This overview covers:

The platform's core purpose and design philosophy
High-level system architecture and major components
Key features: Prompt Studio, API Deployment, and ETL pipelines
Supported document formats and external integrations

For detailed information about specific subsystems:

System architecture and microservices communication patterns: see System Architecture
Prompt Studio implementation details: see Prompt Studio
Workflow execution mechanics: see Workflow Execution System
API deployment specifics: see API Deployment System
Deployment and infrastructure: see Deployment and Operations

Sources: README.md32-47 docs/ARCHITECTURE.md

What is Unstract?

Unstract uses LLMs to extract structured JSON from documents including PDFs, images, scans, and other unstructured file formats. Users define extraction schemas using natural language prompts rather than writing code or templates. The platform can be deployed as:

API endpoints - Send documents via REST API and receive JSON responses
ETL pipelines - Automated scheduled processing from source storage to destination databases
MCP Server - Integration with AI agents using Model Context Protocol

The system is built for teams in finance, insurance, healthcare, KYC/compliance, and other domains requiring high-volume document data extraction.

Sources: README.md34-36 README.md52-66

Core Architecture Components

System Topology

Sources: backend/sample.env11-110 README.md141-160

Service Responsibilities

Service	Technology	Port	Primary Responsibility
`Backend`	Django 4.2.1 + DRF	8000	Central orchestration, REST API, workflow management, authentication
`Platform Service`	FastAPI	3001	Platform operations, user management, organization settings
`Prompt Service`	Flask	3003	Prompt execution, LLM integration, RAG retrieval
`X2Text Service`	-	3004	Document text extraction, OCR processing
`Runner Service`	-	5002	Docker container orchestration for tool execution
`Frontend`	React + Vite	-	Web UI, Prompt Studio IDE, workflow designer

Sources: backend/pyproject.toml1-8 prompt-service/pyproject.toml1-7 backend/sample.env78-110

Data Storage Architecture

Component	Technology	Purpose
`PostgreSQL`	PostgreSQL + pgvector	Primary relational database, vector embeddings
`Redis`	Redis 6+	Multiple databases: cache, session storage, rate limiting, execution tracking
`RabbitMQ`	RabbitMQ	Message queue for Celery task distribution
`MinIO`	MinIO (S3-compatible)	Object storage for documents, tool outputs, execution artifacts
`Qdrant`	Qdrant	Vector database for document embeddings and RAG

Sources: backend/sample.env17-31 backend/sample.env186-198

Worker Architecture

Unstract uses Celery with RabbitMQ for asynchronous task processing across 8 specialized worker queues:

Worker Type	Queue	Purpose
`worker-file-processing-v2`	`file_processing`	Parallel document batch processing
`worker-callback-v2`	`file_processing_callback`	Batch completion aggregation
`worker-api-deployment-v2`	`api_deployments`	API endpoint execution
`worker-general-v2`	`celery`	ETL pipelines, webhook notifications
`worker-notification-v2`	-	Alert delivery
`worker-log-consumer-v2`	`celery_log_task_queue`	WebSocket log publishing
`worker-scheduler-v2`	-	Scheduled ETL tasks
`celery-beat`	-	Periodic task scheduler

Sources: backend/README.md110-116 backend/sample.env165-167

Key Capabilities

Prompt Studio

Prompt Studio is the flagship no-code document extraction feature. Users upload documents, define extraction prompts in natural language, and test/iterate on extraction results in an interactive IDE.

Core Components:

ToolIde - Main React container component
PromptCard - Individual prompt definition UI
DocumentManager - PDF viewer and document upload
CustomTool model - Database representation of extraction projects
ToolStudioPrompt model - Individual prompts within a project
PromptStudioHelper - Backend orchestration logic

Workflow:

Upload documents via DocumentManager
Documents indexed via X2Text Service and embedded into Qdrant
Define prompts using PromptCard components
Execute prompts through PromptStudioHelper.prompt_responder()
View extraction results in CombinedOutput
Export as tool for use in workflows/APIs via PromptStudioRegistryHelper

Sources: tools/structure/src/config/properties.json1-42 README.md54-56

API Deployment

API Deployment exposes Prompt Studio tools and workflows as authenticated REST API endpoints.

Key Components:

APIDeployment model - Deployment configuration
DeploymentHelper.execute_workflow() - Execution orchestration
RateLimiter - Dual-layer rate limiting (per-org and global)
ExecutionCache / ResultCache - Redis-based result caching with 24hr TTL

Request Flow:

Client sends POST request to /api/v1/deployment/{id}/execute with document
Rate limiter checks API_DEPLOYMENT_DEFAULT_RATE_LIMIT (per-org) and API_DEPLOYMENT_GLOBAL_RATE_LIMIT
Document processed through workflow via Celery workers
Results cached in Redis with EXECUTION_RESULT_TTL_SECONDS (86400s default)
JSON response returned to client

Sources: backend/sample.env118-131 backend/sample.env206-211 README.md58-60

ETL Pipelines

Automated document processing pipelines that pull files from source storage, process through workflows, and load results to destination databases.

Key Components:

WorkflowHelper - Pipeline orchestration
SourceConnector - File discovery and deduplication via FileHistory
DestinationConnector - Output routing to databases, filesystems, or APIs
process_file_batch Celery task - Parallel file processing
FileExecutionStatusTracker - Redis-based execution state tracking

Configuration:

MAX_PARALLEL_FILE_BATCHES - Controls parallel processing (default: 1)
MIN_SCHEDULE_INTERVAL_SECONDS - Minimum ETL schedule frequency (default: 1800s / 30min)
MAX_FILE_EXECUTION_COUNT - Retry limit before permanent skip (default: 3)

Sources: backend/sample.env213-237 README.md62

Supported Document Types and Integrations

Document Formats

Category	Supported Formats
Documents	PDF, DOCX, DOC, ODT, TXT, CSV, JSON
Spreadsheets	XLSX, XLS, ODS
Presentations	PPTX, PPT, ODP
Images	PNG, JPG, JPEG, TIFF, BMP, GIF, WEBP

Sources: README.md163-171

LLM Providers

Unstract uses an adapter architecture for LLM integration:

OpenAI
Azure OpenAI
Anthropic Claude
Google Gemini
AWS Bedrock
Mistral AI
Ollama (local deployment)
Anyscale

Sources: README.md173-182

Vector Databases

Qdrant (default)
Pinecone
Weaviate
PostgreSQL (with pgvector)
Milvus

Sources: README.md184-190

Text Extraction Adapters

LLMWhisperer
Unstructured.io
LlamaIndex Parse

Configuration: ADAPTER_LLMW_POLL_INTERVAL, ADAPTER_LLMW_MAX_POLLS, ADAPTER_LLMW_STATUS_RETRIES

Sources: README.md192-198 backend/sample.env145-152

ETL Source and Destination Connectors

Sources:

AWS S3
MinIO
Google Cloud Storage
Azure Blob Storage
Google Drive
Dropbox
SFTP

Destinations:

Snowflake
Amazon Redshift
Google BigQuery
PostgreSQL
MySQL
MariaDB
SQL Server
Oracle

Sources: README.md199-206

Document Processing Data Flow

Key Processing Steps:

Ingestion: SourceConnector lists files from configured sources, checks FileHistory cache for deduplication, stores new documents in MinIO
Preparation: X2Text extracts text, documents are chunked and embedded, vectors stored in Qdrant for RAG
Execution: Either prompt-based (via PromptStudioHelper) or tool-based (via ToolSandboxHelper) extraction
Validation: Challenge plugins validate outputs, HITL rules determine if manual review needed
Output: Results routed to API responses (cached 24hrs), databases, or file systems via DestinationConnector

Sources: backend/sample.env183-204 backend/sample.env247-251

Configuration and Deployment

Environment Configuration

Key environment variables control platform behavior:

Variable	Default	Purpose
`DJANGO_SETTINGS_MODULE`	`backend.settings.dev`	Django configuration module
`ENCRYPTION_KEY`	Required	Encrypts adapter credentials (32-byte base64)
`SESSION_COOKIE_AGE`	86400	User session expiration (seconds)
`CACHE_TTL_SEC`	10800	General cache TTL (3 hours)
`STRUCTURE_TOOL_IMAGE_URL`	`docker:unstract/tool-structure:0.0.97`	Container image for Prompt Studio tools

Critical: The ENCRYPTION_KEY must be backed up. Losing this key makes existing adapter credentials inaccessible.

Sources: backend/sample.env1-116 README.md134-140

Deployment Options

Unstract can be deployed via:

Docker Compose (default): Run ./run-platform.sh to pull and start all services
Local Development: Use ./dev-env-cli.sh for development environment setup
Manual Build: Build images locally with ./run-platform.sh -b

Default credentials after deployment:

URL: http://frontend.unstract.localhost
Username: unstract
Password: unstract

These can be customized via DEFAULT_AUTH_USERNAME and DEFAULT_AUTH_PASSWORD environment variables.

Sources: README.md77-91 README.md93-132 backend/README.md74-102

Tool Registry

The platform maintains a registry of available tools in JSON format at TOOL_REGISTRY_CONFIG_PATH:

Tool	Function	Version
`StructureTool`	Prompt Studio exported tools	0.0.97
`Classifier`	Document classification	0.0.76
`TextExtractor`	Text extraction	0.0.72

Tools are defined with properties.json schemas specifying adapters (LLM, Vector DB, Text Extractor), IO compatibility, and restrictions.

Sources: tools/structure/src/config/properties.json1-42 tools/classifier/src/config/properties.json1-71 tools/text_extractor/src/config/properties.json1-53 unstract/tool-registry/tool_registry_config/public_tools.json1-198 backend/sample.env175-177

Overview

Relevant source files

Purpose and Scope

This overview covers:

The platform's core purpose and design philosophy
High-level system architecture and major components
Key features: Prompt Studio, API Deployment, and ETL pipelines
Supported document formats and external integrations

For detailed information about specific subsystems:

System architecture and microservices communication patterns: see System Architecture
Prompt Studio implementation details: see Prompt Studio
Workflow execution mechanics: see Workflow Execution System
API deployment specifics: see API Deployment System
Deployment and infrastructure: see Deployment and Operations

Sources: README.md32-47 docs/ARCHITECTURE.md

What is Unstract?

API endpoints - Send documents via REST API and receive JSON responses
ETL pipelines - Automated scheduled processing from source storage to destination databases
MCP Server - Integration with AI agents using Model Context Protocol

The system is built for teams in finance, insurance, healthcare, KYC/compliance, and other domains requiring high-volume document data extraction.

Sources: README.md34-36 README.md52-66

Core Architecture Components

System Topology

Sources: backend/sample.env11-110 README.md141-160

Service Responsibilities

Service	Technology	Port	Primary Responsibility
`Backend`	Django 4.2.1 + DRF	8000	Central orchestration, REST API, workflow management, authentication
`Platform Service`	FastAPI	3001	Platform operations, user management, organization settings
`Prompt Service`	Flask	3003	Prompt execution, LLM integration, RAG retrieval
`X2Text Service`	-	3004	Document text extraction, OCR processing
`Runner Service`	-	5002	Docker container orchestration for tool execution
`Frontend`	React + Vite	-	Web UI, Prompt Studio IDE, workflow designer

Sources: backend/pyproject.toml1-8 prompt-service/pyproject.toml1-7 backend/sample.env78-110

Data Storage Architecture

Component	Technology	Purpose
`PostgreSQL`	PostgreSQL + pgvector	Primary relational database, vector embeddings
`Redis`	Redis 6+	Multiple databases: cache, session storage, rate limiting, execution tracking
`RabbitMQ`	RabbitMQ	Message queue for Celery task distribution
`MinIO`	MinIO (S3-compatible)	Object storage for documents, tool outputs, execution artifacts
`Qdrant`	Qdrant	Vector database for document embeddings and RAG

Sources: backend/sample.env17-31 backend/sample.env186-198

Worker Architecture

Unstract uses Celery with RabbitMQ for asynchronous task processing across 8 specialized worker queues:

Worker Type	Queue	Purpose
`worker-file-processing-v2`	`file_processing`	Parallel document batch processing
`worker-callback-v2`	`file_processing_callback`	Batch completion aggregation
`worker-api-deployment-v2`	`api_deployments`	API endpoint execution
`worker-general-v2`	`celery`	ETL pipelines, webhook notifications
`worker-notification-v2`	-	Alert delivery
`worker-log-consumer-v2`	`celery_log_task_queue`	WebSocket log publishing
`worker-scheduler-v2`	-	Scheduled ETL tasks
`celery-beat`	-	Periodic task scheduler

Sources: backend/README.md110-116 backend/sample.env165-167

Key Capabilities

Prompt Studio

Prompt Studio is the flagship no-code document extraction feature. Users upload documents, define extraction prompts in natural language, and test/iterate on extraction results in an interactive IDE.

Core Components:

ToolIde - Main React container component
PromptCard - Individual prompt definition UI
DocumentManager - PDF viewer and document upload
CustomTool model - Database representation of extraction projects
ToolStudioPrompt model - Individual prompts within a project
PromptStudioHelper - Backend orchestration logic

Workflow:

Upload documents via DocumentManager
Documents indexed via X2Text Service and embedded into Qdrant
Define prompts using PromptCard components
Execute prompts through PromptStudioHelper.prompt_responder()
View extraction results in CombinedOutput
Export as tool for use in workflows/APIs via PromptStudioRegistryHelper

Sources: tools/structure/src/config/properties.json1-42 README.md54-56

API Deployment

API Deployment exposes Prompt Studio tools and workflows as authenticated REST API endpoints.

Key Components:

APIDeployment model - Deployment configuration
DeploymentHelper.execute_workflow() - Execution orchestration
RateLimiter - Dual-layer rate limiting (per-org and global)
ExecutionCache / ResultCache - Redis-based result caching with 24hr TTL

Request Flow:

Client sends POST request to /api/v1/deployment/{id}/execute with document
Rate limiter checks API_DEPLOYMENT_DEFAULT_RATE_LIMIT (per-org) and API_DEPLOYMENT_GLOBAL_RATE_LIMIT
Document processed through workflow via Celery workers
Results cached in Redis with EXECUTION_RESULT_TTL_SECONDS (86400s default)
JSON response returned to client

Sources: backend/sample.env118-131 backend/sample.env206-211 README.md58-60

ETL Pipelines

Automated document processing pipelines that pull files from source storage, process through workflows, and load results to destination databases.

Key Components:

WorkflowHelper - Pipeline orchestration
SourceConnector - File discovery and deduplication via FileHistory
DestinationConnector - Output routing to databases, filesystems, or APIs
process_file_batch Celery task - Parallel file processing
FileExecutionStatusTracker - Redis-based execution state tracking

Configuration:

MAX_PARALLEL_FILE_BATCHES - Controls parallel processing (default: 1)
MIN_SCHEDULE_INTERVAL_SECONDS - Minimum ETL schedule frequency (default: 1800s / 30min)
MAX_FILE_EXECUTION_COUNT - Retry limit before permanent skip (default: 3)

Sources: backend/sample.env213-237 README.md62

Supported Document Types and Integrations

Document Formats

Category	Supported Formats
Documents	PDF, DOCX, DOC, ODT, TXT, CSV, JSON
Spreadsheets	XLSX, XLS, ODS
Presentations	PPTX, PPT, ODP
Images	PNG, JPG, JPEG, TIFF, BMP, GIF, WEBP

Sources: README.md163-171

LLM Providers

Unstract uses an adapter architecture for LLM integration:

OpenAI
Azure OpenAI
Anthropic Claude
Google Gemini
AWS Bedrock
Mistral AI
Ollama (local deployment)
Anyscale

Sources: README.md173-182

Vector Databases

Qdrant (default)
Pinecone
Weaviate
PostgreSQL (with pgvector)
Milvus

Sources: README.md184-190

Text Extraction Adapters

LLMWhisperer
Unstructured.io
LlamaIndex Parse

Configuration: ADAPTER_LLMW_POLL_INTERVAL, ADAPTER_LLMW_MAX_POLLS, ADAPTER_LLMW_STATUS_RETRIES

Sources: README.md192-198 backend/sample.env145-152

ETL Source and Destination Connectors

Sources:

AWS S3
MinIO
Google Cloud Storage
Azure Blob Storage
Google Drive
Dropbox
SFTP

Destinations:

Snowflake
Amazon Redshift
Google BigQuery
PostgreSQL
MySQL
MariaDB
SQL Server
Oracle

Sources: README.md199-206

Document Processing Data Flow

Key Processing Steps:

Ingestion: SourceConnector lists files from configured sources, checks FileHistory cache for deduplication, stores new documents in MinIO
Preparation: X2Text extracts text, documents are chunked and embedded, vectors stored in Qdrant for RAG
Execution: Either prompt-based (via PromptStudioHelper) or tool-based (via ToolSandboxHelper) extraction
Validation: Challenge plugins validate outputs, HITL rules determine if manual review needed
Output: Results routed to API responses (cached 24hrs), databases, or file systems via DestinationConnector

Sources: backend/sample.env183-204 backend/sample.env247-251

Configuration and Deployment

Environment Configuration

Key environment variables control platform behavior:

Variable	Default	Purpose
`DJANGO_SETTINGS_MODULE`	`backend.settings.dev`	Django configuration module
`ENCRYPTION_KEY`	Required	Encrypts adapter credentials (32-byte base64)
`SESSION_COOKIE_AGE`	86400	User session expiration (seconds)
`CACHE_TTL_SEC`	10800	General cache TTL (3 hours)
`STRUCTURE_TOOL_IMAGE_URL`	`docker:unstract/tool-structure:0.0.97`	Container image for Prompt Studio tools

Critical: The ENCRYPTION_KEY must be backed up. Losing this key makes existing adapter credentials inaccessible.

Sources: backend/sample.env1-116 README.md134-140

Deployment Options

Unstract can be deployed via:

Docker Compose (default): Run ./run-platform.sh to pull and start all services
Local Development: Use ./dev-env-cli.sh for development environment setup
Manual Build: Build images locally with ./run-platform.sh -b

Default credentials after deployment:

URL: http://frontend.unstract.localhost
Username: unstract
Password: unstract

These can be customized via DEFAULT_AUTH_USERNAME and DEFAULT_AUTH_PASSWORD environment variables.

Sources: README.md77-91 README.md93-132 backend/README.md74-102

Tool Registry

The platform maintains a registry of available tools in JSON format at TOOL_REGISTRY_CONFIG_PATH:

Tool	Function	Version
`StructureTool`	Prompt Studio exported tools	0.0.97
`Classifier`	Document classification	0.0.76
`TextExtractor`	Text extraction	0.0.72

Tools are defined with properties.json schemas specifying adapters (LLM, Vector DB, Text Extractor), IO compatibility, and restrictions.

Overview

Purpose and Scope

What is Unstract?

Core Architecture Components

System Topology

Service Responsibilities

Data Storage Architecture

Worker Architecture

Key Capabilities

Prompt Studio

API Deployment

ETL Pipelines

Supported Document Types and Integrations

Document Formats

LLM Providers

Vector Databases

Text Extraction Adapters

ETL Source and Destination Connectors

Document Processing Data Flow

Configuration and Deployment

Environment Configuration

Deployment Options

Tool Registry

On this page

Overview

Purpose and Scope

What is Unstract?

Core Architecture Components

System Topology

Service Responsibilities

Data Storage Architecture

Worker Architecture

Key Capabilities

Prompt Studio

API Deployment

ETL Pipelines

Supported Document Types and Integrations

Document Formats

LLM Providers

Vector Databases

Text Extraction Adapters

ETL Source and Destination Connectors

Document Processing Data Flow

Configuration and Deployment

Environment Configuration

Deployment Options

Tool Registry

On this page