Skip to content

volschin/paperless-rag

Repository files navigation

paperless-rag

CI Docker Python 3.12+ MCP Ruff

RAG-powered MCP server for Paperless-ngx document analysis. Syncs documents from Paperless-ngx, converts PDFs to structured Markdown via Marker, chunks and embeds them into PostgreSQL+pgvector, and exposes MCP tools for semantic search and LLM-based analysis.

How it works

paperless-rag has no user interface of its own. It is a headless backend that exposes its capabilities as tools via the Model Context Protocol (MCP). An MCP-compatible AI assistant — such as Claude Desktop, Claude Code, or any other MCP client — connects to paperless-rag and can use its tools to search, analyze, and reason over your documents.

You interact with your documents through natural language in the AI assistant. The assistant decides which tools to call based on your request:

You: "Find all invoices from 2024 that mention a total over 500 EUR"

The assistant calls search_documents with a semantic query, reviews the results, and presents a summary with document references.

You: "Summarize the contract I uploaded last week and highlight the cancellation terms"

The assistant calls list_documents to find recent uploads, then summarize_documents with instructions to focus on cancellation terms.

You: "Compare the two insurance offers and tell me which has better coverage"

The assistant calls compare_documents with both document IDs and returns a structured comparison.

You: "Extract the IBAN, invoice date, and total from this invoice"

The assistant calls extract_data with the specified fields and returns structured JSON.

The MCP server runs as a long-lived service exposing a Streamable HTTP endpoint at /mcp (port 8080). AI assistants connect to this endpoint to call tools and receive structured results. A /health endpoint provides liveness checks for Docker and monitoring.

Architecture

Paperless-ngx API --> Sync Engine --> Marker Pipeline --> Chunking --> Embeddings --> PostgreSQL (pgvector)
                                           |
                                       Gotenberg (non-PDF --> PDF conversion)

Claude Desktop/Code <--> MCP Server --> RAG Pipeline (hybrid search + rerank + context assembly)
                                            |
                                        LLM Backend (OpenAI-compatible API)

Key components

Module Purpose
config.py Pydantic Settings — all config via env vars / .env
paperless.py Paperless-ngx REST API client (httpx)
gotenberg.py Gotenberg format conversion client
marker_pipeline.py PDF to structured Markdown via Marker (GPU)
chunking.py Markdown-aware chunking respecting headings and tables
embeddings.py pplx-embed-context-v1-4B embeddings (sentence-transformers)
reranker.py llama-nemotron-rerank-1b-v2 cross-encoder
rag.py Hybrid search (vector + full-text), reranking, context assembly
llm.py OpenAI-compatible LLM client
sync.py Polling-based sync engine (initial import + incremental)
db.py asyncpg connection pool, schema migration
health.py /health endpoint — CUDA, database, and sync-loop checks
mcp_server.py MCP tool definitions, Streamable HTTP transport

Local ML models

All models run locally on GPU — no external API calls for embeddings or reranking.

  • Marker — PDF to structured Markdown preserving tables, headings, and reading order
  • pplx-embed-context-v1-4B — 2560-dim document embeddings (~8 GB)
  • llama-nemotron-rerank-1b-v2 — cross-encoder reranker (~2 GB)

MCP Tools

Tool Description
search_documents Hybrid semantic + keyword search with reranking
ask_question RAG pipeline: search, rerank, LLM answer with citations
get_document Full Markdown text of a document by Paperless ID
list_documents List documents with optional filters
summarize_documents LLM-generated summary of one or more documents
compare_documents Side-by-side comparison of documents
extract_data Extract structured JSON fields from documents
sync_status Show indexed document count, last sync time, and failed documents
reindex_document Re-process a single document through the pipeline
retry_sync Clear sync errors so failed documents are retried on the next cycle

Prerequisites

  • NVIDIA GPU with Docker GPU support (nvidia-container-toolkit)
  • Docker and Docker Compose
  • A running Paperless-ngx instance
  • An OpenAI-compatible LLM endpoint (e.g., Ollama)

Quick start

1. Clone and configure

git clone https://github.com/volschin/paperless-rag.git
cd paperless-rag
cp .env.example .env

Edit .env with your values:

# Required — your Paperless-ngx instance
PAPERLESS_URL=http://paperless.local:8000
PAPERLESS_TOKEN=your-api-token

# Required — LLM backend (Ollama example)
LLM_BASE_URL=http://ollama:11434/v1
LLM_API_KEY=unused
LLM_MODEL=qwen3.5:27b

# PostgreSQL (defaults work with the bundled postgres container)
DATABASE_URL=postgresql://paperless_rag:changeme@postgres:5432/paperless_rag

# Optional tuning
SYNC_INTERVAL_SECONDS=300
CHUNK_MAX_TOKENS=800
HYBRID_SEARCH_VECTOR_WEIGHT=0.7
RETRIEVAL_TOP_K=20
RERANK_TOP_K=5

See .env.example for all available options.

2. Start infrastructure

# Start PostgreSQL (pgvector) and Gotenberg
docker compose -f docker-compose.prod.yml up -d postgres gotenberg

If using Ollama as LLM backend, ensure it's running and connect it to the network:

docker network connect paperless-rag ollama

3. Build the GPU image

docker compose -f docker-compose.prod.yml build paperless-rag

This uses Dockerfile.gpu based on nvcr.io/nvidia/pytorch:26.03-py3. The base image includes PyTorch with CUDA support; the build only installs the project's additional dependencies.

For CPU-only environments, use the standard Dockerfile:

docker build -t paperless-rag .

4. Verify the deployment

# Check services are healthy
docker compose -f docker-compose.prod.yml ps

# Test the health endpoint (available after model loading, ~2 min)
curl http://localhost:8080/health

# Test GPU access
docker compose -f docker-compose.prod.yml run --rm --no-deps \
  --entrypoint python paperless-rag \
  -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"

The /health endpoint checks CUDA GPU health, database connectivity, and sync-loop liveness. Docker uses it to automatically restart the container on GPU failures (e.g., CUDA device-side assert).

5. Configure your MCP client

The MCP server uses Streamable HTTP transport on port 8080.

Claude Desktop

Add to claude_desktop_config.json:

{
  "mcpServers": {
    "paperless-rag": {
      "command": "docker",
      "args": [
        "compose", "-f", "/path/to/paperless-rag/docker-compose.prod.yml",
        "run", "--rm", "-i", "paperless-rag"
      ]
    }
  }
}
Remote via SSH

If deployed on a different machine:

{
  "mcpServers": {
    "paperless-rag": {
      "command": "ssh",
      "args": [
        "-i", "~/.ssh/id_ed25519", "user@gpu-host",
        "cd ~/paperless-rag && docker compose -f docker-compose.prod.yml run --rm -i paperless-rag"
      ]
    }
  }
}
Direct Python (development)
{
  "mcpServers": {
    "paperless-rag": {
      "command": "python",
      "args": ["-m", "paperless_rag.mcp_server"],
      "env": {
        "PAPERLESS_URL": "http://paperless.local:8000",
        "PAPERLESS_TOKEN": "your-token",
        "DATABASE_URL": "postgresql://user:pass@localhost:5432/paperless_rag",
        "LLM_BASE_URL": "http://localhost:11434/v1",
        "LLM_API_KEY": "unused",
        "LLM_MODEL": "qwen3.5:27b",
        "GOTENBERG_URL": "http://localhost:3000",
        "EMBEDDING_MODEL": "perplexity-ai/pplx-embed-context-v1-4B",
        "RERANKER_MODEL": "nvidia/llama-nemotron-rerank-1b-v2"
      }
    }
  }
}

First run

On the first invocation, the server will:

  1. Initialize the database schema (tables, indexes)
  2. Download ML models (~10 GB total) — cached in the model-cache Docker volume
  3. Start the sync loop, pulling documents from Paperless-ngx every 5 minutes

Development

python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

# Run tests
pytest

# Lint and format
ruff check src/ tests/
ruff format src/ tests/

A development PostgreSQL with pgvector is available via Docker Compose:

docker compose up -d postgres   # pgvector on localhost:5433

Design decisions

  • Marker over OCR — Marker produces structured Markdown preserving tables, headings, and reading order. Paperless-ngx's built-in OCR (Tesseract) destroys document structure.
  • Separate database — Uses its own PostgreSQL instance with pgvector, not the Paperless-ngx database.
  • Polling, not webhooks — Paperless-ngx lacks native webhook support. The sync engine polls every 5 minutes (configurable via SYNC_INTERVAL_SECONDS).
  • Hybrid search — 0.7 vector + 0.3 keyword weighting combines semantic understanding with exact term matching.
  • pgvector HNSW limit — The 2560-dim embeddings exceed pgvector's 2000-dim HNSW index limit. Exact (sequential) vector search is used, which is performant for up to a few thousand documents.
  • Sync error tracking — Failed documents are recorded in sync_errors and skipped on subsequent cycles. Per-document timeout (configurable via DOCUMENT_TIMEOUT_SECONDS) prevents the sync from hanging on problematic files.
  • Health endpoint/health probes CUDA (tensor op in thread), database (SELECT 1), and sync-loop staleness. Docker restarts the container automatically on GPU failures.

About

RAG-powered MCP server for Paperless-ngx — semantic search, document analysis, and LLM Q&A over your paperless documents

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors