Skip to content

SriRammSS/rag-local-app

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

Local RAG

Privacy-First Retrieval-Augmented Generation System

Python FastAPI React pgvector Ollama License: MIT


A fully local, privacy-first document intelligence system — ingest PDFs and documents, generate dense vector embeddings via Ollama, and query them with a local LLM. Zero cloud calls. Zero data leakage.


Table of Contents


Overview

Local RAG is an end-to-end Retrieval-Augmented Generation application that runs entirely on your own hardware. It implements the full RAG loop — document ingestion, chunking, embedding, vector search, context assembly, LLM generation, and citation extraction — without any external API calls. All models run locally via Ollama or a compatible llama.cpp server.

Dimension Value
Embedding model nomic-embed-text (768-dim vectors, via Ollama)
Vector storage PostgreSQL + pgvector (cosine similarity)
Fast LLM llama3.1:8b
Accurate LLM llama3.3:70b
Similarity threshold 0.30 (configurable)
Top-K retrieval 6 chunks (configurable)
Supported formats PDF, TXT, MD, CSV
Auth JWT HS256, 24-hour expiry

RAG Pipeline

User uploads document
 │
 ▼
┌──────────────────────────────────────────────────────────────────┐
│ INGESTION PIPELINE │
│ │
│ 1. Text extraction (PyPDF2 per page / UTF-8 for text files) │
│ 2. Chunking (~900 tokens, overlapping windows) │
│ 3. Embedding (nomic-embed-text via Ollama, batched) │
│ 4. Storage (chunks + vector(768) → PostgreSQL) │
└───────────────────────────────┬──────────────────────────────────┘
 │ Document status: INDEXED
 ▼
User asks a question
 │
 ▼
┌──────────────────────────────────────────────────────────────────┐
│ QUERY PIPELINE │
│ │
│ 1. Embed question (same nomic-embed-text model) │
│ 2. Vector search (pgvector cosine similarity, top-K=6) │
│ 3. Similarity gate (threshold=0.30 — below → NOT_FOUND) │
│ 4. Context assembly (ranked chunks → source block) │
│ 5. LLM generation (llama3.1:8b / llama3.3:70b via Ollama) │
│ 6. Citation extract (chunk id, page, match %, snippet) │
│ 7. Persist (message + citations_json → PostgreSQL) │
└───────────────────────────────┬──────────────────────────────────┘
 │
 ▼
 Answer with inline citations returned to UI

System Architecture

┌─────────────────────────────────────────────────────────────────┐
│ Browser (React 18) │
│ │
│ React Router v7 · AuthContext (JWT) · UploadDropzone │
│ DocumentList · ChatPage (model toggle, temp slider, citations) │
└────────────────────┬────────────────────────────────────────────┘
 │ /api/* (Vite dev proxy → :8000)
 ▼
┌─────────────────────────────────────────────────────────────────┐
│ FastAPI (Uvicorn, async) │
│ │
│ Routers: auth · documents · indexing · rag · chats · health │
│ │
│ Application layer (use cases): │
│ IndexDocumentUseCase ← ingestion orchestration │
│ RAGQueryUseCase ← query orchestration │
│ │
│ Infrastructure: │
│ OllamaEmbeddingClient · OllamaLLMClient │
│ LlamaCppLLMClient · PgvectorRetriever │
│ ChunkingService · TextExtractor · FileStorage │
└────────────────────┬────────────────────────────────────────────┘
 │ asyncpg (SQLAlchemy async)
 ▼
┌─────────────────────────────────────────────────────────────────┐
│ PostgreSQL + pgvector │
│ │
│ users · documents · chunks · chunk_embeddings (vector(768)) │
│ chats · messages (citations_json JSONB) │
└─────────────────────────────────────────────────────────────────┘
 │ httpx (async)
 ▼
┌─────────────────────────────────────────────────────────────────┐
│ Ollama (localhost:11434) │
│ • nomic-embed-text — embedding generation │
│ • llama3.1:8b — fast inference │
│ • llama3.3:70b — accurate inference │
└─────────────────────────────────────────────────────────────────┘

Tech Stack

Backend

Component Technology
Web framework FastAPI + Uvicorn (async)
ORM SQLAlchemy 2.0 (async engine via asyncpg)
Migrations Alembic
Database PostgreSQL + pgvector extension
Embeddings Ollama nomic-embed-text (768-dim, batched)
LLM Ollama (llama3.1:8b / llama3.3:70b) or llama.cpp
PDF parsing PyPDF2
Auth bcrypt + python-jose (JWT HS256)
HTTP client httpx (async)
Settings pydantic-settings + .env

Frontend

Component Technology
Framework React 18
Routing React Router DOM v7
Build tool Vite 6
State React Context API (AuthContext)
API Typed fetch wrapper with JWT injection

Database Schema

users
 └── documents (status: UPLOADED → INDEXING → INDEXED | FAILED)
 └── chunks
 └── chunk_embeddings (vector(768), pgvector)

users
 └── chats
 └── messages (role, content, citations_json JSONB, answer_status)
Table Key Columns
users id (UUID), email, password_hash
documents id, user_id, original_filename, stored_path, mime_type, status, enabled
chunks id, doc_id, chunk_index, page_number, text
chunk_embeddings id, chunk_id, embedding (vector(768)), model_name
chats id, user_id, title, created_at, updated_at
messages id, chat_id, role, content, citations_json (JSONB), answer_status, selected_doc_ids_json

3 Alembic migrations bootstrap the full schema including pgvector extension enablement.


API Reference

Method Route Auth Description
POST /api/auth/signup Public Create user account
POST /api/auth/login Public Authenticate, return JWT
POST /api/documents/upload JWT Upload document (multipart/form-data)
GET /api/documents JWT List user's documents with status
PATCH /api/documents/:id/toggle JWT Enable/disable document from search scope
DELETE /api/documents/:id JWT Delete document and all associated chunks
POST /api/indexing/:doc_id JWT Trigger ingestion pipeline (re-index deletes old chunks)
POST /api/rag/query JWT Execute RAG query, returns answer + citations
GET /api/chats JWT List chat sessions
POST /api/chats JWT Create new chat
GET /api/chats/:id/messages JWT Retrieve message history for a chat
GET /api/health Public Service health check

RAG Query Request

{
 "chat_id": "uuid",
 "question": "What is the conclusion of the paper?",
 "scope": { "mode": "all" },
 "model_mode": "fast",
 "temperature": 0.7
}

RAG Query Response

{
 "answer": "The paper concludes that...",
 "answer_status": "ANSWERED",
 "citations": [
 {
 "chunk_id": "uuid",
 "document_name": "paper.pdf",
 "page_number": 12,
 "similarity_score": 0.87,
 "snippet": "In conclusion, the results demonstrate..."
 }
 ]
}

Getting Started

Prerequisites

  • Python 3.11+
  • PostgreSQL with pgvector extension
  • Ollama with nomic-embed-text and llama3.1:8b pulled
  • Node.js 20+

1. Pull Required Ollama Models

ollama pull nomic-embed-text
ollama pull llama3.1:8b # fast mode
ollama pull llama3.3:70b # accurate mode (optional, requires ~40GB RAM)

2. Clone & Set Up Backend

git clone https://github.com/SriRammSS/rag-local-app.git
cd rag-local-app/apps/api

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

3. Configure Environment

Create apps/api/.env:

APP_NAME=RAG Local API
APP_ENV=development
DATABASE_URL=postgresql+asyncpg://postgres:<password>@localhost:5432/rag_local
JWT_SECRET_KEY=<generate-with: openssl rand -hex 32>
JWT_ALGORITHM=HS256
JWT_ACCESS_TOKEN_EXPIRE_MINUTES=1440

OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_EMBED_MODEL=nomic-embed-text

LLM_RUNTIME=ollama # "ollama" | "llamacpp"
LLM_BASE_URL=http://localhost:11434
LLM_FAST_MODEL=llama3.1:8b
LLM_ACCURATE_MODEL=llama3.3:70b

RAG_TOP_K=6
RAG_SIM_THRESHOLD=0.30
CORS_ORIGINS=["http://localhost:5173"]

4. Initialize Database

# Create the database and enable pgvector
createdb rag_local

cd apps/api
alembic upgrade head

5. Start Backend & Frontend

# Terminal 1 — Backend
cd apps/api && uvicorn main:app --reload --port 8000

# Terminal 2 — Frontend
cd apps/web && npm install && npm run dev

Frontend at http://localhost:5173 · API at http://localhost:8000


Project Structure

rag-local-app/
├── apps/
│ ├── api/ # FastAPI backend (clean architecture)
│ │ ├── main.py # App factory — router registration
│ │ ├── core/ # Config, DB session, JWT helpers
│ │ ├── domain/models/ # SQLAlchemy ORM models
│ │ ├── application/
│ │ │ ├── interfaces/ # Abstract EmbeddingClient, LLMClient
│ │ │ └── use_cases/ # IndexDocumentUseCase, RAGQueryUseCase,
│ │ │ # AuthUseCase, DocumentUseCases, ChatUseCases
│ │ ├── infrastructure/
│ │ │ ├── embedding/ollama_client.py # Batched /api/embed calls
│ │ │ ├── llm/ # OllamaLLMClient · LlamaCppLLMClient
│ │ │ ├── pgvector_retriever.py # Cosine similarity search
│ │ │ ├── chunking_service.py # Overlapping word-based chunker
│ │ │ ├── text_extractor.py # PyPDF2 + plain text
│ │ │ └── repositories/ # Async SQLAlchemy repos
│ │ └── interfaces/api/ # FastAPI routers + Pydantic schemas
│ │ └── auth · documents · indexing · rag · chats · health
│ │
│ └── web/ # React 18 SPA
│ └── src/
│ ├── pages/ # LoginPage, SignupPage, DashboardPage,
│ │ # ChatLayout, ChatPage
│ ├── components/ # Navbar, DocumentList, UploadDropzone,
│ │ # ChatSidebar, ProtectedRoute
│ ├── contexts/AuthContext.jsx # JWT token state
│ └── lib/api.js # Fetch wrapper with auth injection
│
└── storage/uploads/ # Local file storage (gitignored)
 └── <user_id>/<doc_id>/<filename>

Architectural Decisions

Decision Rationale
Clean layered architecture Strict separation of domain, application, infrastructure, and interface layers — use cases are testable without HTTP or database dependencies
Abstract LLM interface LLMClient ABC decouples use cases from Ollama or llama.cpp specifics; swapping inference backends requires no changes to the RAG pipeline
pgvector over a dedicated vector DB Keeps the stack to a single database engine; cosine similarity on 768-dim vectors at this scale does not require a dedicated ANN index service
Similarity threshold gate Prevents the LLM from hallucinating answers when no relevant context exists — returns a clean NOT_FOUND rather than generating low-confidence responses
Document-scoped queries scope.mode: "doc" restricts vector search to specific document IDs, enabling precise targeted questions without cross-document noise
Re-index deletes old chunks Triggering indexing on an already-indexed document atomically deletes previous embeddings before re-generating — prevents stale vector contamination

Author

Sri Ramm Sekar Sasirekha

GitHub


Built to demonstrate production-grade RAG system design with clean architecture, local LLM inference, and pgvector-powered semantic search — entirely on-premise.

About

Privacy-first local RAG system — FastAPI + React + pgvector + Ollama. Ingest PDFs, generate embeddings, query with local LLMs. Zero cloud calls, zero data leakage.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors