This document provides a high-level introduction to the Unstract platform, covering its purpose, architecture, and primary capabilities. Unstract is a document-to-structured-data extraction system that leverages Large Language Models (LLMs) to transform unstructured documents into structured JSON output.
This overview covers:
For detailed information about specific subsystems:
Sources: README.md32-47 docs/ARCHITECTURE.md
Unstract uses LLMs to extract structured JSON from documents including PDFs, images, scans, and other unstructured file formats. Users define extraction schemas using natural language prompts rather than writing code or templates. The platform can be deployed as:
The system is built for teams in finance, insurance, healthcare, KYC/compliance, and other domains requiring high-volume document data extraction.
Sources: README.md34-36 README.md52-66
Sources: backend/sample.env11-110 README.md141-160
| Service | Technology | Port | Primary Responsibility |
|---|---|---|---|
Backend | Django 4.2.1 + DRF | 8000 | Central orchestration, REST API, workflow management, authentication |
Platform Service | FastAPI | 3001 | Platform operations, user management, organization settings |
Prompt Service | Flask | 3003 | Prompt execution, LLM integration, RAG retrieval |
X2Text Service | - | 3004 | Document text extraction, OCR processing |
Runner Service | - | 5002 | Docker container orchestration for tool execution |
Frontend | React + Vite | - | Web UI, Prompt Studio IDE, workflow designer |
Sources: backend/pyproject.toml1-8 prompt-service/pyproject.toml1-7 backend/sample.env78-110
| Component | Technology | Purpose |
|---|---|---|
PostgreSQL | PostgreSQL + pgvector | Primary relational database, vector embeddings |
Redis | Redis 6+ | Multiple databases: cache, session storage, rate limiting, execution tracking |
RabbitMQ | RabbitMQ | Message queue for Celery task distribution |
MinIO | MinIO (S3-compatible) | Object storage for documents, tool outputs, execution artifacts |
Qdrant | Qdrant | Vector database for document embeddings and RAG |
Sources: backend/sample.env17-31 backend/sample.env186-198
Unstract uses Celery with RabbitMQ for asynchronous task processing across 8 specialized worker queues:
| Worker Type | Queue | Purpose |
|---|---|---|
worker-file-processing-v2 | file_processing | Parallel document batch processing |
worker-callback-v2 | file_processing_callback | Batch completion aggregation |
worker-api-deployment-v2 | api_deployments | API endpoint execution |
worker-general-v2 | celery | ETL pipelines, webhook notifications |
worker-notification-v2 | - | Alert delivery |
worker-log-consumer-v2 | celery_log_task_queue | WebSocket log publishing |
worker-scheduler-v2 | - | Scheduled ETL tasks |
celery-beat | - | Periodic task scheduler |
Sources: backend/README.md110-116 backend/sample.env165-167
Prompt Studio is the flagship no-code document extraction feature. Users upload documents, define extraction prompts in natural language, and test/iterate on extraction results in an interactive IDE.
Core Components:
ToolIde - Main React container componentPromptCard - Individual prompt definition UIDocumentManager - PDF viewer and document uploadCustomTool model - Database representation of extraction projectsToolStudioPrompt model - Individual prompts within a projectPromptStudioHelper - Backend orchestration logicWorkflow:
DocumentManagerPromptCard componentsPromptStudioHelper.prompt_responder()CombinedOutputPromptStudioRegistryHelperSources: tools/structure/src/config/properties.json1-42 README.md54-56
API Deployment exposes Prompt Studio tools and workflows as authenticated REST API endpoints.
Key Components:
APIDeployment model - Deployment configurationDeploymentHelper.execute_workflow() - Execution orchestrationRateLimiter - Dual-layer rate limiting (per-org and global)ExecutionCache / ResultCache - Redis-based result caching with 24hr TTLRequest Flow:
/api/v1/deployment/{id}/execute with documentAPI_DEPLOYMENT_DEFAULT_RATE_LIMIT (per-org) and API_DEPLOYMENT_GLOBAL_RATE_LIMITEXECUTION_RESULT_TTL_SECONDS (86400s default)Sources: backend/sample.env118-131 backend/sample.env206-211 README.md58-60
Automated document processing pipelines that pull files from source storage, process through workflows, and load results to destination databases.
Key Components:
WorkflowHelper - Pipeline orchestrationSourceConnector - File discovery and deduplication via FileHistoryDestinationConnector - Output routing to databases, filesystems, or APIsprocess_file_batch Celery task - Parallel file processingFileExecutionStatusTracker - Redis-based execution state trackingConfiguration:
MAX_PARALLEL_FILE_BATCHES - Controls parallel processing (default: 1)MIN_SCHEDULE_INTERVAL_SECONDS - Minimum ETL schedule frequency (default: 1800s / 30min)MAX_FILE_EXECUTION_COUNT - Retry limit before permanent skip (default: 3)Sources: backend/sample.env213-237 README.md62
| Category | Supported Formats |
|---|---|
| Documents | PDF, DOCX, DOC, ODT, TXT, CSV, JSON |
| Spreadsheets | XLSX, XLS, ODS |
| Presentations | PPTX, PPT, ODP |
| Images | PNG, JPG, JPEG, TIFF, BMP, GIF, WEBP |
Sources: README.md163-171
Unstract uses an adapter architecture for LLM integration:
Sources: README.md173-182
Sources: README.md184-190
Configuration: ADAPTER_LLMW_POLL_INTERVAL, ADAPTER_LLMW_MAX_POLLS, ADAPTER_LLMW_STATUS_RETRIES
Sources: README.md192-198 backend/sample.env145-152
Sources:
Destinations:
Sources: README.md199-206
Key Processing Steps:
SourceConnector lists files from configured sources, checks FileHistory cache for deduplication, stores new documents in MinIOPromptStudioHelper) or tool-based (via ToolSandboxHelper) extractionDestinationConnectorSources: backend/sample.env183-204 backend/sample.env247-251
Key environment variables control platform behavior:
| Variable | Default | Purpose |
|---|---|---|
DJANGO_SETTINGS_MODULE | backend.settings.dev | Django configuration module |
ENCRYPTION_KEY | Required | Encrypts adapter credentials (32-byte base64) |
SESSION_COOKIE_AGE | 86400 | User session expiration (seconds) |
CACHE_TTL_SEC | 10800 | General cache TTL (3 hours) |
STRUCTURE_TOOL_IMAGE_URL | docker:unstract/tool-structure:0.0.97 | Container image for Prompt Studio tools |
Critical: The ENCRYPTION_KEY must be backed up. Losing this key makes existing adapter credentials inaccessible.
Sources: backend/sample.env1-116 README.md134-140
Unstract can be deployed via:
./run-platform.sh to pull and start all services./dev-env-cli.sh for development environment setup./run-platform.sh -bDefault credentials after deployment:
http://frontend.unstract.localhostunstractunstractThese can be customized via DEFAULT_AUTH_USERNAME and DEFAULT_AUTH_PASSWORD environment variables.
Sources: README.md77-91 README.md93-132 backend/README.md74-102
The platform maintains a registry of available tools in JSON format at TOOL_REGISTRY_CONFIG_PATH:
| Tool | Function | Version |
|---|---|---|
StructureTool | Prompt Studio exported tools | 0.0.97 |
Classifier | Document classification | 0.0.76 |
TextExtractor | Text extraction | 0.0.72 |
Tools are defined with properties.json schemas specifying adapters (LLM, Vector DB, Text Extractor), IO compatibility, and restrictions.
Sources: tools/structure/src/config/properties.json1-42 tools/classifier/src/config/properties.json1-71 tools/text_extractor/src/config/properties.json1-53 unstract/tool-registry/tool_registry_config/public_tools.json1-198 backend/sample.env175-177
Refresh this wiki