A production-ready Retrieval-Augmented Generation (RAG) system designed for enterprise documentation, with first-class support for markdown content. Built with a microservices architecture for horizontal scaling and flexibility.
This system implements a sophisticated multi-stage processing pipeline that transforms raw documents into a searchable knowledge base capable of answering complex queries.
- ๐ Asynchronous Processing: Queue-based architecture enables horizontal scaling and fault tolerance
- ๐ Multi-format Support: Handles markdown, images, tables, code blocks, and lists
- ๐งฉ Intelligent Chunking: Context-aware chunking with configurable overlap for optimal retrieval
- ๐ Vector Search: OpenSearch-powered semantic search with token budget management
- โก Real-time Inference: Ollama integration for local LLM inference
- ๐ณ Production Ready: Complete Docker Compose setup with health checks
- ๐งช Comprehensive Testing: Unit and integration tests across all components
- Docker & Docker Compose
- 8GB+ RAM (for Ollama models)
- 2GB+ free disk space
git clone <your-repo>
cd universal-rag-pipeline
cp .env.example .env# .env
DATABASE_PATH=data/records.db
AZURE_STORAGE_CONNECTION_STRING="<storage_connection_string>"
AZURE_QUEUE_CONNECTION_STRING="<queue_connection_string>"
SUBMISSION_QUEUE=job-submission
EXTRACTION_QUEUE=job-extraction
CHUNKING_QUEUE=job-chunking
SEARCH_HOST=opensearch
SEARCH_PORT=9200
SEARCH_INDEX=chunks-index
SEARCH_ADMIN_PASSWORD=Hydrot123456!
EMBEDDING_QUEUE=job-embedding
EMBEDDING_BASE_URL=http://ollama:11434
EMBEDDING_MODEL=nomic-embed-text
EMBEDDING_DIM=768
INFERENCE_QUEUE=job-inference
INFERENCE_BASE_URL=http://ollama:11434
INFERENCE_TIMEOUT=60
INFERENCE_SYSTEM_PROMPT="You are a helpful assistant."
INFERENCE_MODEL=gemma3:270mdocker-compose up -dThis will start:
- API Server (FastAPI) on
http://localhost:8000 - OpenSearch on
http://localhost:9200 - OpenSearch Dashboards on
http://localhost:5601 - Ollama on
http://localhost:11434 - 5 Worker Processes for pipeline stages
- Azurite (local Azure Storage emulator)
# 1. Upload a markdown file to blob storage
curl -X PUT "http://localhost:10000/devstoreaccount1/test-docs/sample.md" \
-H "x-ms-blob-type: BlockBlob" \
--data-binary @your-document.md
# 2. Submit processing job
curl -X POST "http://localhost:8000/jobs/?source_container=test-docs"
# Returns: job_id
# 3. Ask questions about your document
curl -X POST "http://localhost:8000/inferences/?query=What is the main topic?&model=gemma2:2b"
# Returns: inference_id
# 4. Get the answer
curl "http://localhost:8000/inferences/{inference_id}"POST /jobs/?source_container={container_name}Submits documents for processing through the RAG pipeline.
Response: Job ID for tracking progress
POST /inferences/?query={question}&model={model_name}
GET /inferences/{inference_id}
POST /inferences/{inference_id}/cancelSupported Models:
gpt-oss:20b- Fast, efficient model for general questionsgemma3:270m- Lightweight model for simple queries
- Parses markdown into structured format
- Extracts headers, paragraphs, code blocks, tables, images
- Preserves semantic structure and metadata
- Uses spaCy for sentence segmentation
- Intelligent context-aware chunking
- Configurable chunk size and overlap
- Preserves semantic boundaries
- Token-based budgeting with approximation
- Batch processing for efficiency
- L2 normalization for consistent vectors
- Error handling for malformed embeddings
- Configurable embedding models via Ollama
- Context-aware query processing
- Token budget management per model
- Configurable system prompts
- Support for multiple LLM backends
- Horizontal Scalability: Each worker can be scaled independently
- Fault Tolerance: Queue-based processing with retry mechanisms
- Modularity: Provider pattern for swappable components
- Observability: Structured logging throughout the pipeline
graph TD
A[Document Upload] --> B[Submission Queue]
B --> C[Extraction Worker]
C --> D[Extraction Queue]
D --> E[Chunking Worker]
E --> F[Chunking Queue]
F --> G[Embedding Worker]
G --> H[Vector Database]
I[User Query] --> J[Inference Queue]
J --> K[Inference Worker]
K --> H
K --> L[LLM Provider]
L --> M[Generated Answer]
- Azure Blob Storage: Document storage with container isolation per job
- OpenSearch: Vector database with KNN search capabilities
- SQLite: Job and inference state management
- Azure Queue Storage: Asynchronous message passing
| Stage | Processing Rate | Bottleneck |
|---|---|---|
| Extraction | ~100 MD files/min | CPU (spaCy) |
| Chunking | ~1000 chunks/min | CPU (tokenization) |
| Embedding | ~500 chunks/min | GPU/Model size |
| Inference | ~10 queries/min | LLM generation |
- CPU-bound stages: Extraction, Chunking (scale horizontally)
- Memory-bound stages: Embedding (optimize batch sizes)
- I/O-bound stages: Storage operations (connection pooling)
- Semantic Boundary Preservation: Respects paragraph and section boundaries
- Configurable Overlap: Prevents context loss at chunk boundaries
- Token Budget Management: Optimizes for model context windows
- Content-Type Awareness: Special handling for code, tables, lists
- Candidate Pool Strategy: Over-retrieval followed by token-based filtering
- Score-based Ranking: Maintains relevance while respecting budget constraints
- Dynamic Context Sizing: Adapts to different model capabilities
- Health Checks: All services include health endpoints
- Structured Logging: JSON logs with correlation IDs
- Error Tracking: Comprehensive exception handling
- Performance Metrics: Processing time and throughput tracking
# Unit tests
pytest tests/ -v
# Integration tests (requires running services)
pytest tests/ -m integration -v
# Coverage report
pytest --cov=backend tests/- Unit Tests: Individual component testing with mocks
- Integration Tests: End-to-end pipeline validation
- Fixtures: Reusable test infrastructure for Azure Storage
docker-compose up -dScaling Strategy:
- Deploy workers as separate pods/containers
- Use managed services (Azure OpenAI, Azure Cognitive Search)
- Implement auto-scaling based on queue depth
Security:
- Replace Azurite with production Azure Storage
- Add authentication (OAuth2/JWT)
- Implement proper secret management
- Network isolation and firewall rules
Monitoring:
- Prometheus metrics collection
- Grafana dashboards
- Azure Application Insights integration
- Alert rules for queue backlogs and failures
- Pre-clean: remove or redact secrets; extract images and write assets to blob store; perform OCR on images and add OCR text as additional text paragraphs.
- Refactoring idea: embedder and indexer could be included in single abstraction to improve maintainability of the main workflow, since embedding and indexing are closely related operations.
- Consider deterministic document id (content hash + chunk position) for index records in order to avoid index duplications.
- Store supported models metadata in the database, including model names, and context lengths.
- Multi-modal support (images, PDFs, Office docs)
- Hybrid search (semantic + keyword)
- Query expansion and rewriting
- Citation tracking in responses
- Multi-tenant support
- Advanced security (RBAC, audit logs)
- Real-time document sync
- Analytics dashboard
- Active learning for relevance feedback
- Automated evaluation metrics (RAGAS)
- Query intent classification
- Personalized retrieval
โโโ backend/
โ โโโ api/ # FastAPI routes
โ โโโ config/ # Settings and DI
โ โโโ models/ # Data models
โ โโโ providers/ # External service abstractions
โ โโโ repositories/ # Data access layer
โ โโโ services/ # Business logic
โ โโโ workers/ # Background processors
โโโ tests/ # Comprehensive test suite
โโโ data/ # Local storage (gitignored)
โโโ docker-compose.yaml # Multi-service orchestration
โโโ requirements.txt # Python dependencies
For Engineering Interviews:
-
Distributed Systems: Demonstrates understanding of queue-based architectures, eventual consistency, and horizontal scaling
-
AI/ML Engineering: Shows practical knowledge of embedding models, vector databases, and LLM integration
-
Software Architecture: Clean abstractions, dependency injection, and separation of concerns
-
DevOps Skills: Docker containerization, service orchestration, and local development setup
-
Testing Excellence: Comprehensive test coverage with proper mocking and integration testing
This project demonstrates enterprise-grade software engineering practices suitable for production RAG systems. The modular architecture allows for easy extension and customization.
Key Extension Points:
BaseEmbeddingProvider: Add new embedding modelsBaseInferenceProvider: Integrate different LLM providersBaseStorageProvider: Support additional storage backendsBaseIndexProvider: Implement alternative vector databases
Built with modern Python ecosystem: FastAPI, Pydantic, Docker, OpenSearch, Ollama