Production AI Pipeline — automatic classification of high-volume unstructured text into reliable, auditable, executive-ready data.
This project is an evolution of a data-driven processing pipeline for reading, interpreting, and automatically classifying large volumes of unstructured text.
Designed for scenarios where operations need to:
- Identify critical events across high volumes of text documents
- Transform natural language into structured, persistent data
- Deliver results in a continuous, auditable, and resilient flow
- Convert operational output into executive-level analytical reports
The technical value lies in four pillars:
- Batch processing with controlled orchestration — avoiding context overflow and bottlenecks in AI calls
- Database-first architecture — reducing coupling with intermediate files and increasing transactional consistency
- Inference layer with observability — latency tracking, token consumption, and full execution trail
- Executive analytics generation — turning operational output into management-level dashboards
This is not just an LLM classification script. It's a production pipeline oriented around reliability, scale, traceability, and analytical reuse.
flowchart TD
A[Scheduler / Manual Trigger] --> B[Pipeline Orchestrator]
B --> C[Read Pending Items from DB]
C --> D[Safe Batch Assembly]
D --> E[Structured Prompt for LLM]
E --> F[Inference Layer]
F --> G[Response Validation & Sanitization]
G --> H[Idempotent Persistence to DB]
H --> I[Technical Logs & Usage Metrics]
H --> J[Consolidated Base for Consumption]
J --> K[Analytical Report — PDF]
F -. fallback / multi-provider .-> L[Alternative AI Provider]
G -. parsing failure .-> M[Controlled Error Flagging]
M --> H
The solution separates responsibilities into four clear blocks:
- Orchestration — controls execution targets, batch size, pauses, and fault tolerance
- Infrastructure — centralizes DB connection, engine caching, and AI provider communication
- Processing rules — fetches pending items, prepares payload, sanitizes response, persists results
- Analytics — transforms consolidated data into a visual tracking report
This design favors incremental evolution, simplifies maintenance, and reduces the cost of future changes.
The pipeline needed to handle recurring input volume without manual item-by-item execution.
How I solved it:
- Execution driven by processing targets per cycle
- Small, safe batches with pauses between cycles
- Clean interruption when the queue is empty
Generative models are excellent at interpreting text but are not naturally reliable as structured write interfaces.
How I solved it:
- Prompt strictly oriented to JSON output
- Defensive extraction of the returned array
- Boolean and text field sanitization
- Controlled error flagging when output is incomplete or invalid
- Idempotent persistence to support safe reprocessing
In real pipelines, cost, latency, and unavailability can halt the entire flow.
How I solved it:
- Inference layer decoupled from business logic
- Multi-provider support with architectural fallback
- Technical logging to compare behavior across models
Most pipelines end when data is written to the database. Here, the goal was to go further — turning operational data into executive-level insight.
How I solved it:
- Analytics module built on Pandas
- Automated PDF dashboard generation
- Visual synthesis of most relevant classifications
- Direct reuse of the consolidated base — no manual rework
| Layer | Technology | Why |
|---|---|---|
| Core language | Python 3.11+ | Speed, mature data ecosystem, AI and DB integration |
| Database | PostgreSQL | Transactional and analytical consistency in the same flow |
| ORM | SQLAlchemy | Connection abstraction, pooling, and safe engine management |
| AI inference | OpenAI + Gemini | Multi-provider for quality, cost, and availability balance |
| Data analytics | Pandas | Tabulation, consolidation, and report preparation |
| Visualization | Matplotlib + Seaborn | Reproducible, web-independent visual report generation |
| Configuration | dotenv | Credential separation, portability across environments |
Database-first over intermediate files — spreadsheet-based pipelines lose traceability and create version divergence. A DB-first approach ensures consistency, auditability, and lower friction in production.
Explicit orchestrator over monolithic script — separating execution control from processing logic improves observability and simplifies error recovery and resumption.
Idempotent persistence over blind inserts — reprocessing is inevitable in AI pipelines. The system accepts repetition without duplicating side effects, making corrections safe and predictable.
Post-LLM sanitization over trusting raw output — generative responses can include noise, unexpected formatting, or partial omissions. An explicit sanitization layer reduces production failures.
Analytics coupled to consolidated base — manual reports create lag and duplicate effort. Automated PDF generation delivers executive insight directly from the same operational pipeline.
- 60–85% reduction in manual reading effort through automatic text classification
- 70%+ reduction in time between data input and structured output availability
- Increased operational reliability with explicit failure handling and safe reprocessing
- Full observability with latency, token usage, and AI call status logging
- Operational data converted to executive intelligence via automated analytical report
-
Add a real async queue — evolve to distributed workers with async queuing for higher throughput and better operational control
-
Typed output contract validation — use typed models and formal schema validation at the inference boundary (e.g. Pydantic)
-
Dedicated observability stack — move logs, metrics, and tracing to a centralized telemetry stack for historical analysis and production troubleshooting
-
Automated prompt regression tests — fixed test cases to detect semantic regressions when switching models, prompts, or heuristics
-
Expose analytics as a web application — a web dashboard with filters and time series would significantly increase the product's value beyond PDF reports
pip install -r requirements.txt
python src/pipeline.pyEnvironment variables, database credentials, and integration details have been intentionally omitted for this public portfolio version.
Marcelo Manara — Software Engineer | AI Systems · Cloud · Python · AWS
