Skip to content

kulkarnishub377/Document-AI---RAG-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

86 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

πŸ“„ DocuAI Studio v3.1

Upload any document (PDF, image, DOCX, XLSX, PowerPoint, CSV, text, or web URL) and ask questions in plain English. Fully local, private, and powered by Mistral via Ollama β€” with optional OpenAI support.

Version License Python Docker CI

FastAPI LangChain Ollama OpenAI FAISS PaddleOCR BM25 SentenceTransformers CrossEncoder SQLite WebSocket Frontend


πŸ–₯️ Frontend Demo

Chat Q&A Interface
πŸ’¬ Chat Q&A Interface
AI-powered document Q&A with streaming, markdown rendering, and session management
Data Lake
πŸ“‚ Data Lake & Ingestion
Upload PDFs, DOCX, images, or ingest web URLs with real-time vector indexing
Document Comparator
βš–οΈ Document Comparator Engine
Side-by-side AI-driven semantic comparison between any two indexed documents
Knowledge Graph
πŸ•ΈοΈ Knowledge Graph Explorer
Interactive force-directed entity graph with 74 entities and 87 relationships
Data Lake with Index
πŸ“Š Active Vector Index
View and manage all indexed documents with chunk counts, file types, and delete actions

✨ What's New in v3.1

Feature Description
πŸ”§ Thread-Safe Vector Store Fixed FAISS race conditions with proper locking
⚑ Reciprocal Rank Fusion (RRF) Accurate hybrid scoring between FAISS and BM25 instead of zero-insertion
πŸ€– OpenAI Provider Support Switchable LLM backend β€” Ollama (default) or OpenAI via LLM_PROVIDER
🧠 Local Analysis Engine Intelligent offline answers using keyword extraction & entity analysis
πŸ” Semantic Search Dedicated search endpoint returning ranked passages without LLM overhead
πŸ“‘ Batch Q&A Submit multiple questions at once, export results as CSV
πŸ“Š Query Analytics Dashboard Track query frequency, response times, popular documents, failure rates
πŸ“¦ Export / Import Index Download and restore complete FAISS index as .zip for backup/migration
πŸ“„ Document Versioning Track upload history with version numbers and diff metadata
⏰ Scheduled Web Crawling Background thread auto-refreshes web URL sources on configurable interval
πŸ•ΈοΈ Knowledge Graph Canvas Interactive canvas-based entity graph visualization in the browser
πŸ“± Mobile Sidebar Toggle Responsive design with collapsible sidebar overlay for mobile screens
πŸ“ Markdown Rendering Full markdown rendering in chat (tables, code, headers) via marked.js
πŸ”„ Async Ingestion Non-blocking document processing with task status tracking
πŸ›‘οΈ Security Hardening Windows reserved names, hidden file protection, WebSocket input limits
πŸš€ Lazy Module Loading Heavy deps loaded only when needed β€” faster cold starts

See the full CHANGELOG for details.


🧠 How It Works

RAG (Retrieval-Augmented Generation) lets an AI answer questions about your specific documents. Instead of guessing from training data, it:

  1. Reads and indexes your documents (text extraction + semantic embedding).
  2. Finds the most relevant passages when you ask a question (FAISS + BM25 + Reciprocal Rank Fusion).
  3. Feeds those passages to an LLM (Mistral 7B via Ollama, or GPT via OpenAI) as context.
  4. Writes a grounded answer with citations to your actual documents.

Demo Mode: When Ollama is offline, the system gracefully falls back to heuristic-based answers using keyword extraction, entity analysis, and structured data parsing β€” no LLM required.


πŸ—οΈ Architecture

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚          Frontend (HTML/CSS/JS + marked.js)       β”‚
                    β”‚  Chat Β· Search Β· Batch Β· KG Β· Analytics Β· Upload β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                      β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  FastAPI REST + WebSocket Server (Lifespan)       β”‚
                    β”‚  40+ endpoints Β· Async ops Β· Rate limiting        β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                      β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Stage 1 β”‚  Stage 2  β”‚ Stage 3  β”‚  Stage 4 β”‚ Stage 5  β”‚ Features β”‚
    β”‚  LOAD   β”‚  CHUNK    β”‚ EMBED    β”‚ RETRIEVE β”‚ ANSWER   β”‚  v3.1    β”‚
    β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
    β”‚ PyMuPDF β”‚ Sentence  β”‚ MiniLM   β”‚ Hybrid   β”‚ Ollama / β”‚ Versions β”‚
    β”‚ Paddle  β”‚ Unicode   β”‚ FAISS    β”‚ FAISS+   β”‚ OpenAI   β”‚ BatchQA  β”‚
    β”‚ OCR     β”‚ Table-    β”‚ IDMap    β”‚ BM25+RRF β”‚ LCEL     β”‚ Search   β”‚
    β”‚ openpyxlβ”‚ Atomic    β”‚ Auto-IVF β”‚ Cross-   β”‚ Stream   β”‚ Analyticsβ”‚
    β”‚ pptx    β”‚ Dedup     β”‚ Export   β”‚ Encoder  β”‚ Demo     β”‚ KG       β”‚
    β”‚ BS4     β”‚ SHA-256   β”‚ ThreadπŸ”’β”‚ Rerank   β”‚ History  β”‚ Crawl    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data Flow

Document β†’ OCR/Parse β†’ Sentence Chunk β†’ Embed (384D) β†’ FAISS Index (IDMap)
                                                             ↓
Question β†’ Embed β†’ FAISS Search (top 20) β†’ BM25 + RRF β†’ Rerank (top 5)
                                                             ↓
                                              LLM (Ollama/OpenAI/Demo) β†’ Answer + Sources
                                                             ↓
                                              Cache β†’ Session β†’ Analytics β†’ WebSocket

πŸš€ Quick Start

Option 1: Docker (Recommended)

# Start everything (app + Ollama)
docker-compose up -d

# Pull the LLM model (first time only)
docker exec rag-ollama ollama pull mistral

# Open the UI
open http://localhost:8000

Option 2: Local Install

Prerequisites:

  • Python 3.10+
  • Ollama installed and running (optional β€” Demo Mode works without it)
# 1. Clone & install
git clone https://github.com/kulkarnishub377/Document-AI---RAG-Pipeline.git
cd Document-AI---RAG-Pipeline
pip install -r requirements.txt

# 2. Pull the LLM model (optional)
ollama pull mistral

# 3. Configure (optional)
cp .env.example .env
# Edit .env to customize settings (LLM provider, CORS, etc.)

# 4. Run application
python run.py

Open http://localhost:8000 in your browser.

Option 3: OpenAI Backend

cp .env.example .env
# Edit .env:
#   LLM_PROVIDER=openai
#   OPENAI_API_KEY=sk-your-key-here
#   OPENAI_MODEL=gpt-3.5-turbo

pip install langchain-openai
python run.py

πŸ“ Project Structure

DocuAI Studio/
β”œβ”€β”€ api/
β”‚   └── app.py                 # FastAPI REST + WebSocket server (40+ endpoints)
β”œβ”€β”€ chunking/
β”‚   └── semantic_chunker.py    # Unicode-aware sentence chunking with dedup
β”œβ”€β”€ embedding/
β”‚   └── vector_store.py        # FAISS + BM25 hybrid index (IDMap, RRF, auto-IVF)
β”œβ”€β”€ features/                  # Feature Modules
β”‚   β”œβ”€β”€ knowledge_graph.py     # Entity extraction + relationship mapping
β”‚   β”œβ”€β”€ collaboration.py       # WebSocket real-time multi-user Q&A
β”‚   β”œβ”€β”€ pdf_annotator.py       # PDF highlighting with source passages
β”‚   β”œβ”€β”€ comparator.py          # Document comparison analysis
β”‚   β”œβ”€β”€ evaluation.py          # RAGAS-inspired evaluation metrics
β”‚   └── query_analytics.py     # πŸ†• v3.1 Query analytics tracker
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ index.html             # UI with Search, Batch, KG, Analytics views
β”‚   β”œβ”€β”€ css/style.css          # Premium dark/light glassmorphic theme
β”‚   └── js/app.js              # Client logic (markdown, streaming, mobile)
β”œβ”€β”€ ingestion/
β”‚   └── document_loader.py     # Multi-format loader (PDF/Excel/PPTX/CSV/Image/Web)
β”œβ”€β”€ llm/
β”‚   └── prompt_chains.py       # Ollama/OpenAI chains + streaming + demo mode
β”œβ”€β”€ retrieval/
β”‚   └── reranker.py            # Cross-encoder reranking
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ cache.py               # LRU query cache with TTL
β”‚   β”œβ”€β”€ rate_limiter.py        # Sliding-window rate limiter
β”‚   β”œβ”€β”€ sessions.py            # SQLite persistent chat sessions
β”‚   └── exceptions.py          # Custom exception hierarchy
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ conftest.py            # Shared test fixtures
β”‚   β”œβ”€β”€ test_api.py            # 30+ API endpoint tests
β”‚   β”œβ”€β”€ test_chunker.py        # Chunking unit tests
β”‚   β”œβ”€β”€ test_config.py         # Config override tests
β”‚   β”œβ”€β”€ test_pipeline.py       # Pipeline integration tests
β”‚   β”œβ”€β”€ test_reranker.py       # Reranker unit tests
β”‚   └── test_vector_store.py   # Vector store unit tests
β”œβ”€β”€ config.py                  # Central configuration (env-driven, 40+ settings)
β”œβ”€β”€ pipeline.py                # Pipeline orchestrator (batch, versioning, export)
β”œβ”€β”€ run.py                     # Entry point
β”œβ”€β”€ Dockerfile                 # Container build (with healthcheck)
β”œβ”€β”€ docker-compose.yml         # Full stack: app + Ollama
β”œβ”€β”€ requirements.txt           # Python dependencies (pinned)
β”œβ”€β”€ CHANGELOG.md               # πŸ†• Version history
β”œβ”€β”€ .env.example               # Configuration template
└── README.md                  # This file

πŸ”Œ API Reference (40+ Endpoints)

Core

Method Endpoint Description
GET / Serve frontend UI
GET /health Health check with version
GET /status Index stats + Ollama status
GET /analytics Storage, cache, document breakdown

Ingestion

Method Endpoint Description
POST /ingest Upload a single document (PDF/Excel/PPTX/Image/DOCX/CSV/TXT)
POST /ingest/url Ingest a web URL
POST /ingest/async πŸ†• Non-blocking ingestion with task ID

Query & AI

Method Endpoint Description
POST /query Ask a question (sync, with source filtering & chat history)
POST /query-stream Ask a question (streaming SSE)
POST /query/batch πŸ†• Ask multiple questions at once
POST /search πŸ†• Semantic search (no LLM)
POST /summarize Summarize documents by topic
POST /extract Extract structured fields as JSON
POST /table-query Ask about tables
POST /compare Compare two documents
POST /annotate Q&A with highlighted PDF export
POST /evaluate Run RAGAS evaluation on a query

Sessions

Method Endpoint Description
POST /sessions Create a new chat session
GET /sessions List recent sessions
GET /sessions/{id} Get session details
GET /sessions/{id}/messages Get messages in a session
DELETE /sessions/{id} Delete a session

Knowledge Graph

Method Endpoint Description
GET /knowledge-graph Full graph data (nodes + edges)
GET /knowledge-graph/search Search entities by type
POST /knowledge-graph/reset Clear the knowledge graph

Document Management

Method Endpoint Description
GET /documents List all indexed documents
DELETE /document/{filename} Delete a specific document
POST /clear Clear the entire index
GET /versions πŸ†• Get version history for all docs
GET /versions/{filename} πŸ†• Get version history for a document
GET /export πŸ†• Download index as zip
POST /import πŸ†• Upload and restore index from zip
GET /download/{filename} Download an uploaded file

Scheduled Crawling (v3.1)

Method Endpoint Description
GET /crawl/urls πŸ†• List scheduled crawl URLs
POST /crawl/add πŸ†• Add a URL to the crawl schedule
POST /crawl/remove πŸ†• Remove a URL from the schedule
POST /crawl/run πŸ†• Manually trigger a crawl

Analytics & Evaluation (v3.1)

Method Endpoint Description
GET /query-analytics πŸ†• Query frequency & response stats
POST /query-analytics/clear πŸ†• Clear analytics data
GET /evaluate/dashboard RAGAS evaluation dashboard
GET /evaluate/history Evaluation history log
POST /evaluate/clear Clear evaluation history

Cache

Method Endpoint Description
GET /cache/stats Cache statistics
POST /cache/clear Clear query cache

Tasks (v3.1)

Method Endpoint Description
GET /tasks/{id} πŸ†• Get async task status

Example: Query a Document

# 1. Ingest a PDF
curl -X POST http://localhost:8000/ingest -F "[email protected]"

# 2. Ask a question
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What is the total amount on the invoice?"}'

Example: Batch Q&A

curl -X POST http://localhost:8000/query/batch \
  -H "Content-Type: application/json" \
  -d '{"questions": ["What is the total?", "Who signed?", "When is the due date?"]}'

Example: Semantic Search

curl -X POST http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{"query": "payment terms", "top_k": 10}'

Example: Export & Import Index

# Export
curl -o backup.zip http://localhost:8000/export

# Import
curl -X POST http://localhost:8000/import -F "[email protected]"

βš™οΈ Configuration

All settings can be configured via environment variables or a .env file:

Variable Default Description
LLM_PROVIDER ollama πŸ†• ollama or openai
OLLAMA_MODEL mistral LLM model name
OLLAMA_VISION_MODEL llava Vision model for image extraction
OPENAI_API_KEY β€” πŸ†• Required when LLM_PROVIDER=openai
OPENAI_MODEL gpt-3.5-turbo πŸ†• OpenAI model to use
LLM_TIMEOUT_SECS 120 πŸ†• LLM request timeout in seconds
EMBED_MODEL_NAME all-MiniLM-L6-v2 Embedding model
CHUNK_SIZE 512 Chunk size in characters
RETRIEVAL_TOP_K 20 FAISS candidates to retrieve
RERANKER_TOP_K 5 Final results after reranking
IVF_THRESHOLD 50000 πŸ†• Auto-IVF upgrade threshold
ENABLE_GPU auto GPU mode: auto, true, false
MULTILINGUAL_MODE false Auto-detect document language
MAX_FILE_SIZE_MB 50 Max upload size
CORS_ORIGINS * πŸ†• Configurable CORS origins
CACHE_ENABLED true Enable query caching
RATE_LIMIT_ENABLED true Enable rate limiting
KNOWLEDGE_GRAPH_ENABLED true Enable KG extraction
CRAWL_ENABLED false πŸ†• Enable scheduled web crawling
CRAWL_INTERVAL_MINS 1440 πŸ†• Crawl interval (default: 24h)
QUERY_ANALYTICS_ENABLED true πŸ†• Enable query analytics tracking
DOC_VERSIONING_ENABLED true πŸ†• Enable document versioning
WS_ENABLED true Enable WebSocket collaboration
LOG_FORMAT text Log format: text or json

See .env.example for the full list.

Tip: Improve answer quality with stronger models like qwen2.5:14b or llama3.1:8b for OLLAMA_MODEL, and llava:13b for OLLAMA_VISION_MODEL.


πŸ—ΊοΈ Roadmap

  • Multi-format document support (PDF, Image, DOCX, Excel, PPTX, CSV) βœ…
  • Persistent conversation sessions with SQLite βœ…
  • Knowledge graph extraction βœ…
  • Document comparison βœ…
  • PDF annotation export βœ…
  • Real-time WebSocket collaboration βœ…
  • Query caching with TTL βœ…
  • API rate limiting βœ…
  • Evaluation dashboard using RAGAS metrics βœ…
  • Configurable LLM providers (Ollama / OpenAI) βœ…
  • Batch Q&A with CSV export βœ…
  • Query analytics dashboard βœ…
  • Document versioning βœ…
  • Export / Import index βœ…
  • Scheduled web crawling βœ…
  • Semantic search endpoint βœ…
  • Async ingestion with task tracking βœ…
  • User authentication for multi-user deployments
  • Webhook support for document change notifications
  • OCR confidence metrics

πŸ”§ Troubleshooting

Ollama connection refused Make sure Ollama is running: ollama serve Check it responds: curl http://localhost:11434/api/tags πŸ’‘ If Ollama is unavailable, the app works in Demo Mode with heuristic answers.

PaddleOCR first run is slow It downloads ~45 MB of model weights on first OCR call. This is normal β€” subsequent runs are fast.

Out of memory during query Switch to a smaller LLM: set OLLAMA_MODEL=llama3.2:3b in .env Or reduce RETRIEVAL_TOP_K=10 to process fewer candidates.

FAISS index not found error Ingest at least one document first before querying: curl -X POST http://localhost:8000/ingest -F "[email protected]"

Rate limit exceeded Increase RATE_LIMIT_REQUESTS and RATE_LIMIT_WINDOW in .env, or set RATE_LIMIT_ENABLED=false.

OpenAI API errors Check your OPENAI_API_KEY is valid and has credit. Set LLM_PROVIDER=ollama to fall back to local mode.


πŸ§ͺ Testing

# Run all tests
python -m pytest tests/ -v

# Run specific test file
python -m pytest tests/test_api.py -v

# Run with coverage
python -m pytest tests/ --cov=. --cov-report=term-missing

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Write tests for your changes
  4. Run python -m pytest tests/ -v to make sure everything passes
  5. Submit a Pull Request

See CONTRIBUTING.md for detailed guidelines.


πŸ“ License

MIT License β€” see LICENSE for details.

About

A production-ready, locally-hosted Retrieval-Augmented Generation (RAG) system for chatting with your documents.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors