Thoth is a local-first, privacy-focused knowledge agent that combines Retrieval-Augmented Generation (RAG) with multi-source information retrieval. It lets you upload your own documents, ask questions in a conversational chat interface, and get cited answers drawn from your documents, Wikipedia, Arxiv, and the web — all powered by a locally-running LLM via Ollama.
In ancient Egyptian mythology, Thoth (𓁟) was the god of wisdom, writing, and knowledge — the divine scribe who recorded all human understanding. He was credited with inventing hieroglyphs, maintaining the library of the gods, and serving as the impartial judge of truth. Naming a private knowledge agent after Thoth felt fitting: like its namesake, this tool is built to gather, organize, and faithfully retrieve knowledge — while keeping everything under your control, running locally on your own machine.
- Multi-turn conversational Q&A with full message history
- Persistent conversation threads stored in a local SQLite database
- Auto-naming — threads are automatically renamed to the first question asked
- Thread switching — resume any previous conversation seamlessly
- Thread deletion — remove conversations you no longer need
- Dynamic model switching — choose any Ollama-supported model from the Settings panel in the sidebar
- Curated model list — includes popular models (Llama, Qwen, Gemma, Mistral, DeepSeek, Phi, etc.) alongside any models you've already downloaded
- Automatic download — selecting a model you haven't downloaded yet triggers an in-app download with a live progress indicator
- First-run setup — if the default model isn't available, the app automatically downloads it on startup
- Local indicators — models are marked with ✅ (downloaded) or ⬇️ (needs download) in the selector
- In-app configuration — add and edit API keys directly from the ⚙️ Settings panel (no need to edit source files)
- Persistent storage — keys are saved to
api_keys.jsonin the user data directory and loaded automatically on startup - Password-masked inputs — keys are hidden by default in the UI for security
- Extensible — add new keys by editing the
API_KEY_DEFINITIONSdict inapi_keys.py
- Smart context assessment — an embedding similarity check first determines if existing context already covers the question; only falls back to an LLM judgment call for ambiguous cases
- Contextual compression retrieval — each retriever is wrapped with a
ContextualCompressionRetriever+LLMChainExtractorthat filters and extracts only query-relevant content per document before it enters the context - Query rewriting — follow-up questions with pronouns or references (e.g., "how are they related?") are automatically rewritten into standalone search queries using conversation history, so retrievers receive semantically complete queries
- Parallel retrieval — all enabled retrieval sources are queried simultaneously via
ThreadPoolExecutor, reducing total retrieval time from the sum of all sources to the duration of the slowest one - Context deduplication — embedding-based cosine similarity deduplication operates at two levels:
- Within-retrieval: removes near-duplicate documents returned by different sources in the same query
- Cross-turn: prevents adding context that is too similar to already-accumulated context from previous turns
- Character-based context & message trimming — context entries and message history are trimmed to fit within a character budget (1 token ≈ 4.5 characters), keeping the most recent entries and preventing context window overflow in long conversations
- Accumulated context — context from multiple queries within a thread builds up rather than being replaced
- Configurable retrieval sources — toggle each retrieval backend on/off from the Settings panel:
Source Description 📄 Documents FAISS vector similarity search over your indexed files 🌐 Wikipedia Real-time Wikipedia article retrieval 📚 Arxiv Academic paper search via the Arxiv API 🔍 Web Search Live web search via the Tavily Search API
- Upload & index PDF, DOCX, DOC, and TXT files
- Automatic chunking with
RecursiveCharacterTextSplitter(4000-char chunks, 200-char overlap) - FAISS vector store with persistent local storage
- Embedding model:
Qwen/Qwen3-Embedding-0.6Bvia HuggingFace - Duplicate detection — already-processed files are skipped
- Clear all — one-click reset of the entire vector store and processed files list
Every piece of information in an answer is cited:
(Source: document.pdf)for uploaded documents(Source: https://en.wikipedia.org/...)for Wikipedia(Source: https://arxiv.org/abs/...)for Arxiv papers(Source: https://...)for web search results(Source: Internal Knowledge)when the LLM uses its own training data
┌────────────────────────────────────────────────────────────┐
│ Streamlit Frontend (app.py) │
│ ┌──────────┐ ┌─────────────────────┐ ┌────────────┐ │
│ │ Sidebar │ │ Chat Interface │ │ Document │ │
│ │ Threads │ │ (Q&A Messages) │ │ Manager │ │
│ └──────────┘ └─────────────────────┘ └────────────┘ │
└────────────────────────┬───────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────┐
│ LangGraph RAG Pipeline (rag.py) │
│ │
│ START ──▶ needs_context ──┬──▶ get_context ──▶ generate │
│ │ _answer │
│ └──────────────────▶ generate │
│ _answer │
│ │ │
│ ▼ │
│ END │
└────────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌────────────┐ ┌──────────────┐ ┌──────────────┐
│ Ollama │ │ Retrievers │ │ SQLite │
│ LLM │ │ (FAISS, │ │ Checkpointer│
│(qwen3-vl) │ │ Wiki, │ │ (threads.db)│
└────────────┘ │ Arxiv,Web) │ └──────────────┘
└──────────────┘
The RAG pipeline is implemented as a LangGraph StateGraph with three nodes:
needs_context— First checks if existing accumulated context is already relevant to the current question using embedding cosine similarity (fast, no LLM call). If no existing context is relevant, falls back to an LLM judgment call. ReturnsYes/No.get_context— Rewrites the user's question into a standalone search query (resolving pronouns/references from conversation history), then queries all enabled retrieval backends in parallel viaThreadPoolExecutor. Each retriever is wrapped with aContextualCompressionRetrieverthat extracts only query-relevant content per document. Results are deduplicated within the retrieval batch and against existing accumulated context using embedding cosine similarity.generate_answer— Trims accumulated context and message history to fit within the model's character budget, then formats the system prompt, context, and question into a final prompt and generates the answer with citations.
A conditional edge routes from needs_context to either get_context or directly to generate_answer.
Thoth/ # Source / installation directory
├── app.py # Streamlit frontend — UI, chat, document upload
├── rag.py # LangGraph RAG pipeline — nodes, edges, state
├── documents.py # Document loading, chunking, FAISS vector store
├── models.py # LLM configuration (Ollama)
├── threads.py # Thread/conversation management (SQLite)
├── api_keys.py # API key management (load/save/apply from JSON)
└── README.md
~/.thoth/ # User data directory (auto-created at runtime)
├── api_keys.json # Stored API keys
├── processed_files.json # Tracks which files have been indexed
├── threads.db # SQLite database for thread metadata
└── vector_store/ # FAISS index files
├── index.faiss
└── index.pkl
Data directory: All user data is stored in
~/.thoth/(%USERPROFILE%\.thoth\on Windows). This keeps data separate from the app installation and avoids write-permission issues in protected directories likeC:\Program Files\. Override the location by setting theTHOTH_DATA_DIRenvironment variable.
| File | Purpose |
|---|---|
app.py |
Streamlit application with three-panel layout: sidebar (threads + settings), center (chat), right (documents). Handles UI state, file uploads, model selection, retrieval source toggles, and invokes the RAG graph. |
rag.py |
Defines the LangGraph state machine with SessionState, retriever initialization, context compression, and answer generation. Also supports a CLI mode via __main__. |
documents.py |
Manages document ingestion: loading (PDF/DOCX/TXT), text splitting, embedding with Qwen/Qwen3-Embedding-0.6B, FAISS storage, and processed file tracking. |
models.py |
LLM model management — listing, downloading, and switching Ollama models at runtime. |
threads.py |
SQLite-backed thread metadata (create, list, rename, delete) and LangGraph SqliteSaver checkpointer for persisting conversation state. Data stored in ~/.thoth/threads.db. |
api_keys.py |
API key management — defines available keys, reads/writes ~/.thoth/api_keys.json, and applies keys as environment variables at startup. The Settings UI in app.py uses this module to let users add/edit keys. |
- Python 3.11+
- Ollama installed and running locally
- Tavily API Key for web search (configured via the in-app Settings panel)
Note: You no longer need to manually pull a model — the app will automatically download the default model (
qwen3:8b) on first run if it isn't available.
-
Clone the repository
git clone https://github.com/yourusername/thoth.git cd thoth -
Create a virtual environment
python -m venv .venv
-
Activate the virtual environment
# Windows .venv\Scripts\activate # macOS / Linux source .venv/bin/activate
-
Install dependencies
pip install streamlit langchain-community langchain-core langchain-classic langchain-huggingface langchain-ollama langgraph faiss-cpu torch transformers pypdf python-docx unstructured
-
Configure API keys
Launch the app and open ⚙️ Settings in the sidebar. Enter your API keys (e.g. Tavily) in the API Keys section. Keys are saved to
~/.thoth/api_keys.jsonand loaded automatically on future runs.Alternatively, you can create
~/.thoth/api_keys.jsonmanually:{ "TAVILY_API_KEY": "your-tavily-api-key" }To use a custom data directory, set the
THOTH_DATA_DIRenvironment variable before launching. -
Ensure Ollama is running
ollama serve
streamlit run app.pyThis opens the Thoth web UI in your browser with:
- Left sidebar: Create, switch, and delete conversation threads; Settings panel at the bottom for model selection, retrieval source toggles, and API key management
- Center: Chat interface for asking questions
- Right panel: Upload and manage documents
python rag.pyThis starts an interactive terminal session where you can select/create threads and ask questions directly.
- User asks a question in the chat interface.
- The
needs_contextnode first checks if existing context is relevant to the question via embedding similarity. If no relevant context exists, it falls back to an LLM call to decide whether new retrieval is needed. - If new context is needed, the
get_contextnode:- Rewrites the question into a standalone query using conversation history (resolving pronouns like "they", "it", etc.)
- Queries the enabled sources in parallel via
ThreadPoolExecutor:- FAISS vector store (uploaded documents)
- Wikipedia API
- Arxiv API
- Tavily web search
- Each source uses a
ContextualCompressionRetrieverto extract only relevant content per document - Results are deduplicated within the batch and against existing accumulated context
- New context is appended to the existing context (not replaced), subject to character-budget trimming.
- The
generate_answernode trims context and messages to fit within the model's character budget, then combines the system prompt, context, and question to produce a cited answer. - The full conversation state is checkpointed in SQLite, enabling thread persistence across sessions.
Select a model directly from the ⚙️ Settings panel in the sidebar. You can also change the default model in models.py:
DEFAULT_MODEL = "qwen3:8b" # Change to any Ollama-supported modelChange the embedding model in documents.py:
embedding_model = HuggingFaceEmbeddings(
model_name="Qwen/Qwen3-Embedding-0.6B" # Change to any HuggingFace embedding model
)Adjust text splitting in documents.py:
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=4000, # Characters per chunk
chunk_overlap=200 # Overlap between chunks
)Modify the number of documents retrieved in rag.py:
document_retriever = vector_store.as_retriever(search_kwargs={"k": 5}) # Top-k results| Extension | Loader |
|---|---|
.pdf |
PyPDFLoader |
.docx |
UnstructuredWordDocumentLoader |
.doc |
UnstructuredWordDocumentLoader |
.txt |
TextLoader |
This project is licensed under the MIT License.