paperless-rag

RAG-powered MCP server for Paperless-ngx document analysis. Syncs documents from Paperless-ngx, converts PDFs to structured Markdown via Marker, chunks and embeds them into PostgreSQL+pgvector, and exposes MCP tools for semantic search and LLM-based analysis.

How it works

paperless-rag has no user interface of its own. It is a headless backend that exposes its capabilities as tools via the Model Context Protocol (MCP). An MCP-compatible AI assistant — such as Claude Desktop, Claude Code, or any other MCP client — connects to paperless-rag and can use its tools to search, analyze, and reason over your documents.

You interact with your documents through natural language in the AI assistant. The assistant decides which tools to call based on your request:

You: "Find all invoices from 2024 that mention a total over 500 EUR"

The assistant calls search_documents with a semantic query, reviews the results, and presents a summary with document references.

You: "Summarize the contract I uploaded last week and highlight the cancellation terms"

The assistant calls list_documents to find recent uploads, then summarize_documents with instructions to focus on cancellation terms.

You: "Compare the two insurance offers and tell me which has better coverage"

The assistant calls compare_documents with both document IDs and returns a structured comparison.

You: "Extract the IBAN, invoice date, and total from this invoice"

The assistant calls extract_data with the specified fields and returns structured JSON.

The MCP server runs as a long-lived service exposing a Streamable HTTP endpoint at /mcp (port 8080). AI assistants connect to this endpoint to call tools and receive structured results. A /health endpoint provides liveness checks for Docker and monitoring.

Architecture

Paperless-ngx API --> Sync Engine --> Marker Pipeline --> Chunking --> Embeddings --> PostgreSQL (pgvector)
                                           |
                                       Gotenberg (non-PDF --> PDF conversion)

Claude Desktop/Code <--> MCP Server --> RAG Pipeline (hybrid search + rerank + context assembly)
                                            |
                                        LLM Backend (OpenAI-compatible API)

Key components

Module	Purpose
`config.py`	Pydantic Settings — all config via env vars / `.env`
`paperless.py`	Paperless-ngx REST API client (httpx)
`gotenberg.py`	Gotenberg format conversion client
`marker_pipeline.py`	PDF to structured Markdown via Marker (GPU)
`chunking.py`	Markdown-aware chunking respecting headings and tables
`embeddings.py`	pplx-embed-context-v1-4B embeddings (sentence-transformers)
`reranker.py`	llama-nemotron-rerank-1b-v2 cross-encoder
`rag.py`	Hybrid search (vector + full-text), reranking, context assembly
`llm.py`	OpenAI-compatible LLM client
`sync.py`	Polling-based sync engine (initial import + incremental)
`db.py`	asyncpg connection pool, schema migration
`health.py`	`/health` endpoint — CUDA, database, and sync-loop checks
`mcp_server.py`	MCP tool definitions, Streamable HTTP transport

Local ML models

All models run locally on GPU — no external API calls for embeddings or reranking.

Marker — PDF to structured Markdown preserving tables, headings, and reading order
pplx-embed-context-v1-4B — 2560-dim document embeddings (~8 GB)
llama-nemotron-rerank-1b-v2 — cross-encoder reranker (~2 GB)

MCP Tools

Tool	Description
`search_documents`	Hybrid semantic + keyword search with reranking
`ask_question`	RAG pipeline: search, rerank, LLM answer with citations
`get_document`	Full Markdown text of a document by Paperless ID
`list_documents`	List documents with optional filters
`summarize_documents`	LLM-generated summary of one or more documents
`compare_documents`	Side-by-side comparison of documents
`extract_data`	Extract structured JSON fields from documents
`sync_status`	Show indexed document count, last sync time, and failed documents
`reindex_document`	Re-process a single document through the pipeline
`retry_sync`	Clear sync errors so failed documents are retried on the next cycle

Prerequisites

NVIDIA GPU with Docker GPU support (nvidia-container-toolkit)
Docker and Docker Compose
A running Paperless-ngx instance
An OpenAI-compatible LLM endpoint (e.g., Ollama)

Quick start

1. Clone and configure

git clone https://github.com/volschin/paperless-rag.git
cd paperless-rag
cp .env.example .env

Edit .env with your values:

# Required — your Paperless-ngx instance
PAPERLESS_URL=http://paperless.local:8000
PAPERLESS_TOKEN=your-api-token

# Required — LLM backend (Ollama example)
LLM_BASE_URL=http://ollama:11434/v1
LLM_API_KEY=unused
LLM_MODEL=qwen3.5:27b

# PostgreSQL (defaults work with the bundled postgres container)
DATABASE_URL=postgresql://paperless_rag:changeme@postgres:5432/paperless_rag

# Optional tuning
SYNC_INTERVAL_SECONDS=300
CHUNK_MAX_TOKENS=800
HYBRID_SEARCH_VECTOR_WEIGHT=0.7
RETRIEVAL_TOP_K=20
RERANK_TOP_K=5

See .env.example for all available options.

2. Start infrastructure

# Start PostgreSQL (pgvector) and Gotenberg
docker compose -f docker-compose.prod.yml up -d postgres gotenberg

If using Ollama as LLM backend, ensure it's running and connect it to the network:

docker network connect paperless-rag ollama

3. Build the GPU image

docker compose -f docker-compose.prod.yml build paperless-rag

This uses Dockerfile.gpu based on nvcr.io/nvidia/pytorch:26.03-py3. The base image includes PyTorch with CUDA support; the build only installs the project's additional dependencies.

For CPU-only environments, use the standard Dockerfile:

docker build -t paperless-rag .

4. Verify the deployment

# Check services are healthy
docker compose -f docker-compose.prod.yml ps

# Test the health endpoint (available after model loading, ~2 min)
curl http://localhost:8080/health

# Test GPU access
docker compose -f docker-compose.prod.yml run --rm --no-deps \
  --entrypoint python paperless-rag \
  -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"

The /health endpoint checks CUDA GPU health, database connectivity, and sync-loop liveness. Docker uses it to automatically restart the container on GPU failures (e.g., CUDA device-side assert).

5. Configure your MCP client

The MCP server uses Streamable HTTP transport on port 8080.

Claude Desktop

Add to claude_desktop_config.json:

{
  "mcpServers": {
    "paperless-rag": {
      "command": "docker",
      "args": [
        "compose", "-f", "/path/to/paperless-rag/docker-compose.prod.yml",
        "run", "--rm", "-i", "paperless-rag"
      ]
    }
  }
}

Remote via SSH

If deployed on a different machine:

{
  "mcpServers": {
    "paperless-rag": {
      "command": "ssh",
      "args": [
        "-i", "~/.ssh/id_ed25519", "user@gpu-host",
        "cd ~/paperless-rag && docker compose -f docker-compose.prod.yml run --rm -i paperless-rag"
      ]
    }
  }
}

Direct Python (development)

{
  "mcpServers": {
    "paperless-rag": {
      "command": "python",
      "args": ["-m", "paperless_rag.mcp_server"],
      "env": {
        "PAPERLESS_URL": "http://paperless.local:8000",
        "PAPERLESS_TOKEN": "your-token",
        "DATABASE_URL": "postgresql://user:pass@localhost:5432/paperless_rag",
        "LLM_BASE_URL": "http://localhost:11434/v1",
        "LLM_API_KEY": "unused",
        "LLM_MODEL": "qwen3.5:27b",
        "GOTENBERG_URL": "http://localhost:3000",
        "EMBEDDING_MODEL": "perplexity-ai/pplx-embed-context-v1-4B",
        "RERANKER_MODEL": "nvidia/llama-nemotron-rerank-1b-v2"
      }
    }
  }
}

First run

On the first invocation, the server will:

Initialize the database schema (tables, indexes)
Download ML models (~10 GB total) — cached in the model-cache Docker volume
Start the sync loop, pulling documents from Paperless-ngx every 5 minutes

Development

python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

# Run tests
pytest

# Lint and format
ruff check src/ tests/
ruff format src/ tests/

A development PostgreSQL with pgvector is available via Docker Compose:

docker compose up -d postgres   # pgvector on localhost:5433

Design decisions

Marker over OCR — Marker produces structured Markdown preserving tables, headings, and reading order. Paperless-ngx's built-in OCR (Tesseract) destroys document structure.
Separate database — Uses its own PostgreSQL instance with pgvector, not the Paperless-ngx database.
Polling, not webhooks — Paperless-ngx lacks native webhook support. The sync engine polls every 5 minutes (configurable via SYNC_INTERVAL_SECONDS).
Hybrid search — 0.7 vector + 0.3 keyword weighting combines semantic understanding with exact term matching.
pgvector HNSW limit — The 2560-dim embeddings exceed pgvector's 2000-dim HNSW index limit. Exact (sequential) vector search is used, which is performant for up to a few thousand documents.
Sync error tracking — Failed documents are recorded in sync_errors and skipped on subsequent cycles. Per-document timeout (configurable via DOCUMENT_TIMEOUT_SECONDS) prevents the sync from hanging on problematic files.
Health endpoint — /health probes CUDA (tensor op in thread), database (SELECT 1), and sync-loop staleness. Docker restarts the container automatically on GPU failures.

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
.claude/projects/-home-volsch-paperless-rag/memory		.claude/projects/-home-volsch-paperless-rag/memory
.github		.github
docs/superpowers		docs/superpowers
src/paperless_rag		src/paperless_rag
tests		tests
.env.example		.env.example
.env.prod		.env.prod
.gitignore		.gitignore
.mcp.json		.mcp.json
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
Dockerfile.gpu		Dockerfile.gpu
README.md		README.md
Rechnung.md		Rechnung.md
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
mcp-config.example.json		mcp-config.example.json
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

paperless-rag

How it works

Architecture

Key components

Local ML models

MCP Tools

Prerequisites

Quick start

1. Clone and configure

2. Start infrastructure

3. Build the GPU image

4. Verify the deployment

5. Configure your MCP client

First run

Development

Design decisions

About

Uh oh!

Releases 5

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

paperless-rag

How it works

Architecture

Key components

Local ML models

MCP Tools

Prerequisites

Quick start

1. Clone and configure

2. Start infrastructure

3. Build the GPU image

4. Verify the deployment

5. Configure your MCP client

First run

Development

Design decisions

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages