Production AI Engineering — turning unstructured documents into reliable, traceable, operational data at scale.
This project implements a modular, AI-driven document processing pipeline that transforms unstructured text into standardized tabular outputs ready for operational and analytical consumption.
Designed for high-volume document scenarios where raw text needs to be:
- Ingested from a persistent source
- Enriched before inference to improve LLM accuracy
- Processed by language models with automatic provider fallback
- Normalized to a stable data contract
- Persisted in a relational database
- Continuously evaluated with assertiveness metrics
The core problem this solves: how to go from chaotic text to reliable, traceable, operationalizable data — without manual intervention on each item.
The pipeline follows a modular staged architecture with clear separation between ingestion, enrichment, inference, normalization, persistence, and evaluation.
flowchart TD
A[Reprocessing Trigger] --> B[Data Ingestion]
B --> C[Pre-processing & Enrichment]
C --> D[LLM Inference]
D --> E[Key & Type Normalization]
E --> F[Result Persistence]
F --> G[Quality Evaluation]
G --> H[Report & Metrics]
D --> I[Fallback to Alternative Provider]
I --> E
- Orchestrator initializes execution, times each stage, and controls interruption on critical failure
- Ingestion downloads and materializes input data for batch processing
- Enrichment replaces low-quality content with a better version before inference — reducing noise and improving model precision
- Inference sends batches to an LLM with automatic fallback between providers on failure, quota limits, or unavailability
- Normalization applies semantic key resolution, synonym mapping, and heuristics to match the output contract
- Persistence stores structured results in PostgreSQL
- Evaluation measures assertiveness metrics per batch and generates a quality report
The pipeline implements automatic fallback between LLM providers, with explicit error handling and dynamic strategy switching — preventing a single provider's downtime from halting the entire process.
LLMs don't always return labels that exactly match the internal contract. A semantic normalization layer with synonyms, heuristic fallback, and human-label resolution ensures consistency before writing the final key.
Instead of blindly sending raw text, the pipeline runs an enrichment step that substitutes content when a higher-quality version is available — significantly improving extraction accuracy.
The orchestrator measures time per stage, standardizes logs, propagates return codes, and terminates predictably when needed — enabling reliable debugging and maintenance.
Field configuration, semantic mapping, and type definitions are separated from business logic, reducing coupling between stages and making the schema easier to evolve and validate.
| Layer | Technology | Why |
|---|---|---|
| Core language | Python 3.11+ | Mature ecosystem for automation, data IO, and AI integration |
| Data processing | Pandas + OpenPyXL | Batch operations, column validation, Excel IO for legacy compatibility |
| AI inference | OpenAI + Gemini | Dual-provider strategy for operational resilience and fallback |
| Persistence | PostgreSQL + SQLAlchemy + psycopg2 | Reliable relational storage with fine-grained control |
| Configuration | dotenv | Environment-based config, no hardcoded credentials |
Central orchestrator over standalone scripts — a single pipeline with return codes and time measurement improves predictability, automation, and observability.
Enrichment before inference, not just post-correction — model output quality depends heavily on input quality. Pre-processing with intelligent content substitution reduces ambiguity upstream.
Explicit normalization instead of trusting LLM literals — semantic synonym layers decouple prompt design from persistence, making the system tolerant to natural LLM variability.
Type contract separated from logic — isolating field definitions and type mappings makes the project easier to evolve, validate, and version without cascading changes.
Persistence decoupled from inference — an intermediate artifact is generated before final insertion, creating an inspection and recovery layer for troubleshooting and selective reprocessing.
- ~70% reduction in manual document reading and structuring time
- ~60% reduction in human rework through semantic normalization
- Increased operational stability via automatic LLM provider fallback
- Improved output reliability with explicit field contract and type enforcement
- Full execution traceability with per-stage metrics and integrated evaluation
- Replace spreadsheet-based flow with queues — for larger scale, migrate to event-driven storage and versioned artifacts
- Add semantic regression tests — test suite for prompts, synonyms, and normalization to catch prompt changes that degrade critical fields
- Richer observability — per-item tracing, per-batch metrics, fallback rate, cost-per-inference, and near-real-time quality dashboards
- Formalize output contract with typed models — strong input/output validation to reduce inter-stage inconsistency risk
- Hybrid extraction strategy — at scale, combine LLM with deterministic rules, auxiliary classifiers, and domain validations to reduce cost and latency
- Replace spreadsheet-based flow with queues — for larger scale, migrate to event-driven storage and versioned artifacts instead of Excel intermediaries
- Add semantic regression tests — a test suite for prompts, synonyms, and normalization to catch prompt changes that silently degrade critical fields
- Richer observability — per-item tracing, per-batch metrics, fallback rate, cost-per-inference, and near-real-time quality dashboards
- Formalize output contract with typed models — strong input/output validation (e.g. Pydantic) to reduce inter-stage inconsistency risk
- Hybrid extraction strategy — at scale, combine LLM with deterministic rules, auxiliary classifiers, and domain validations to reduce cost and latency
Marcelo Manara — Software Engineer | AI Systems · Cloud · Python · AWS
