Skip to content

ManaraMarcelo/ai-critical-publication-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

AI-Powered Critical Publication Analysis Pipeline

cover

Python PostgreSQL SQLAlchemy OpenAI Gemini Pandas Matplotlib Seaborn

Production AI Pipeline — automatic classification of high-volume unstructured text into reliable, auditable, executive-ready data.


Overview

This project is an evolution of a data-driven processing pipeline for reading, interpreting, and automatically classifying large volumes of unstructured text.

Designed for scenarios where operations need to:

  • Identify critical events across high volumes of text documents
  • Transform natural language into structured, persistent data
  • Deliver results in a continuous, auditable, and resilient flow
  • Convert operational output into executive-level analytical reports

The technical value lies in four pillars:

  • Batch processing with controlled orchestration — avoiding context overflow and bottlenecks in AI calls
  • Database-first architecture — reducing coupling with intermediate files and increasing transactional consistency
  • Inference layer with observability — latency tracking, token consumption, and full execution trail
  • Executive analytics generation — turning operational output into management-level dashboards

This is not just an LLM classification script. It's a production pipeline oriented around reliability, scale, traceability, and analytical reuse.


Architecture

flowchart TD
    A[Scheduler / Manual Trigger] --> B[Pipeline Orchestrator]
    B --> C[Read Pending Items from DB]
    C --> D[Safe Batch Assembly]
    D --> E[Structured Prompt for LLM]
    E --> F[Inference Layer]
    F --> G[Response Validation & Sanitization]
    G --> H[Idempotent Persistence to DB]
    H --> I[Technical Logs & Usage Metrics]
    H --> J[Consolidated Base for Consumption]
    J --> K[Analytical Report — PDF]

    F -. fallback / multi-provider .-> L[Alternative AI Provider]
    G -. parsing failure .-> M[Controlled Error Flagging]
    M --> H

Architectural Blocks

The solution separates responsibilities into four clear blocks:

  • Orchestration — controls execution targets, batch size, pauses, and fault tolerance
  • Infrastructure — centralizes DB connection, engine caching, and AI provider communication
  • Processing rules — fetches pending items, prepares payload, sanitizes response, persists results
  • Analytics — transforms consolidated data into a visual tracking report

This design favors incremental evolution, simplifies maintenance, and reduces the cost of future changes.


Key Technical Challenges

1. Processing high volume without losing predictability

The pipeline needed to handle recurring input volume without manual item-by-item execution.

How I solved it:

  • Execution driven by processing targets per cycle
  • Small, safe batches with pauses between cycles
  • Clean interruption when the queue is empty

2. Making LLM output reliable for database persistence

Generative models are excellent at interpreting text but are not naturally reliable as structured write interfaces.

How I solved it:

  • Prompt strictly oriented to JSON output
  • Defensive extraction of the returned array
  • Boolean and text field sanitization
  • Controlled error flagging when output is incomplete or invalid
  • Idempotent persistence to support safe reprocessing

3. Avoiding over-reliance on a single AI provider

In real pipelines, cost, latency, and unavailability can halt the entire flow.

How I solved it:

  • Inference layer decoupled from business logic
  • Multi-provider support with architectural fallback
  • Technical logging to compare behavior across models

4. Connecting operations to management

Most pipelines end when data is written to the database. Here, the goal was to go further — turning operational data into executive-level insight.

How I solved it:

  • Analytics module built on Pandas
  • Automated PDF dashboard generation
  • Visual synthesis of most relevant classifications
  • Direct reuse of the consolidated base — no manual rework

Tech Stack

Layer Technology Why
Core language Python 3.11+ Speed, mature data ecosystem, AI and DB integration
Database PostgreSQL Transactional and analytical consistency in the same flow
ORM SQLAlchemy Connection abstraction, pooling, and safe engine management
AI inference OpenAI + Gemini Multi-provider for quality, cost, and availability balance
Data analytics Pandas Tabulation, consolidation, and report preparation
Visualization Matplotlib + Seaborn Reproducible, web-independent visual report generation
Configuration dotenv Credential separation, portability across environments

Architecture Decisions

Database-first over intermediate files — spreadsheet-based pipelines lose traceability and create version divergence. A DB-first approach ensures consistency, auditability, and lower friction in production.

Explicit orchestrator over monolithic script — separating execution control from processing logic improves observability and simplifies error recovery and resumption.

Idempotent persistence over blind inserts — reprocessing is inevitable in AI pipelines. The system accepts repetition without duplicating side effects, making corrections safe and predictable.

Post-LLM sanitization over trusting raw output — generative responses can include noise, unexpected formatting, or partial omissions. An explicit sanitization layer reduces production failures.

Analytics coupled to consolidated base — manual reports create lag and duplicate effort. Automated PDF generation delivers executive insight directly from the same operational pipeline.


Results

  • 60–85% reduction in manual reading effort through automatic text classification
  • 70%+ reduction in time between data input and structured output availability
  • Increased operational reliability with explicit failure handling and safe reprocessing
  • Full observability with latency, token usage, and AI call status logging
  • Operational data converted to executive intelligence via automated analytical report

What I'd Do Differently

  1. Add a real async queue — evolve to distributed workers with async queuing for higher throughput and better operational control

  2. Typed output contract validation — use typed models and formal schema validation at the inference boundary (e.g. Pydantic)

  3. Dedicated observability stack — move logs, metrics, and tracing to a centralized telemetry stack for historical analysis and production troubleshooting

  4. Automated prompt regression tests — fixed test cases to detect semantic regressions when switching models, prompts, or heuristics

  5. Expose analytics as a web application — a web dashboard with filters and time series would significantly increase the product's value beyond PDF reports


Running Locally

pip install -r requirements.txt
python src/pipeline.py

Environment variables, database credentials, and integration details have been intentionally omitted for this public portfolio version.


Author

Marcelo Manara — Software Engineer | AI Systems · Cloud · Python · AWS

LinkedIn GitHub Portfolio

About

Production AI pipeline for automatic classification of high-volume unstructured text — database-first architecture, multi-provider LLM with fallback, and automated executive reporting.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors