Skip to content

Latest commit

 

History

History
208 lines (170 loc) · 11.5 KB

File metadata and controls

208 lines (170 loc) · 11.5 KB

Backend Documentation: RooCode Data Query

1. Overview

The backend of the RooCode Data Query application is a Python-based system designed to support a Retrieval Augmented Generation (RAG) chat interface. Key components include:

  • Streamlit (app.py): Provides the web-based User Interface (UI) and handles user interactions.
  • Ollama: Serves Large Language Models (LLMs) locally, which are accessed via the ollama Python library.
  • ChromaDB: Acts as the vector knowledge base, storing embeddings of data scraped from Reddit and other sources.
  • Data Processing Scripts:
    • scrape_reddit.py: Fetches data from Reddit.
    • ingest.py: Processes and embeds data into ChromaDB.
  • Core Logic (converse.py): Manages the RAG process, including context retrieval and LLM interaction.

This architecture allows users to ask natural language questions and receive answers generated by an LLM, informed by the specialized knowledge base.

2. Core Modules/Scripts

app.py

  • Role: Main entry point for the Streamlit web application.
  • Responsibilities:
    • Renders the user interface (chat window, sidebar controls, etc.).
    • Captures user queries from the input field.
    • Manages the selection of Ollama LLMs from a dynamically fetched list.
    • Displays the currently active model.
    • Orchestrates the chat response process by calling the Converse class in converse.py.
    • Handles chat history display and clearing.
    • Logs user queries, AI responses, and response times.

converse.py

  • Role: Contains the Converse class, which encapsulates the core RAG logic.
  • Responsibilities:
    • Initializes a connection to the ChromaDB vector store (./chroma_db) using langchain_chroma.
    • Uses HuggingFaceEmbeddings (specifically sentence-transformers/all-MiniLM-L6-v2) for generating query embeddings.
    • Retrieves relevant document chunks from ChromaDB based on semantic similarity to the user's query.
    • Constructs a prompt for the LLM, incorporating the retrieved context and the original user query.
    • Interacts with the selected Ollama LLM (via OllamaLLM from langchain_ollama) to generate a response.
    • Utilizes Langchain components like ChatPromptTemplate, StrOutputParser, and RunnablePassthrough to build and execute the RAG chain.

scrape_reddit.py

  • Role: Responsible for collecting data from Reddit.
  • Responsibilities:
    • Uses the PRAW (Python Reddit API Wrapper) library to interact with the Reddit API.
    • Reads Reddit API credentials (client_id, client_secret, user_agent) and scraping parameters (target subreddit, post_limit, output_file name) from the reddit.config.json file.
    • Fetches post titles, selftext, and top-level comments from the specified subreddit.
    • Saves the scraped textual data into a plain text file (default: reddit_data.txt).
    • Includes error handling for missing or malformed reddit.config.json.

ingest.py

  • Role: Processes raw text data and populates the ChromaDB vector store.
  • Responsibilities:
    • Loads text data from source files (e.g., reddit_data.txt generated by scrape_reddit.py, and github_data.txt which is assumed to be manually provided).
    • Uses RecursiveCharacterTextSplitter from Langchain to divide the documents into smaller, manageable chunks suitable for embedding.
    • Generates vector embeddings for these text chunks using HuggingFaceEmbeddings (sentence-transformers/all-MiniLM-L6-v2).
    • Creates or updates the ChromaDB vector store located at ./chroma_db, storing the text chunks and their corresponding embeddings. This database is configured for local persistence.

3. Data Flow

Data Collection

  1. Reddit Data:
    • User configures reddit.config.json with API keys and scraping parameters.
    • scrape_reddit.py is executed.
    • The script connects to the Reddit API using credentials from reddit.config.json.
    • Specified posts and comments are fetched and written to reddit_data.txt.
  2. GitHub Data:
    • This data is assumed to be manually prepared and placed in a file named github_data.txt. The process for creating this file is external to the application's automated scripts.

Data Ingestion & Storage

  1. ingest.py is executed.
  2. The script reads raw text from reddit_data.txt and github_data.txt.
  3. The text is processed:
    • Documents are split into smaller chunks.
    • Embeddings are generated for each chunk.
  4. These chunks and their embeddings are stored in the ChromaDB vector store located at ./chroma_db.

Retrieval & Response Generation (RAG)

  1. A user submits a query through the Streamlit UI in app.py.
  2. app.py passes the query and current agent configuration (including selected model) to an instance of the Converse class in converse.py.
  3. converse.py performs a similarity search:
    • The user's query is embedded.
    • ChromaDB (./chroma_db) is queried to find the most relevant document chunks based on vector similarity.
  4. The retrieved context (text from the relevant chunks) and the original user query are combined into a prompt.
  5. This prompt is sent to the Ollama LLM (selected by the user in the UI).
  6. The LLM generates a response based on the prompt and context.
  7. The response is passed back through converse.py to app.py, which then displays it in the chat UI.

4. Key Libraries and Services

  • Python 3.x: The core programming language for all backend scripts.
  • Streamlit: Used for creating the interactive web UI (app.py).
  • Ollama (and ollama library): Provides local hosting and access to various LLMs. The ollama Python library is used for model listing, and Langchain's OllamaLLM for RAG.
  • Langchain (langchain, langchain_core, langchain_ollama, langchain_chroma, langchain_huggingface): A framework used extensively for building the RAG application. Key components include:
    • Chroma: Vector store integration.
    • HuggingFaceEmbeddings: For generating text embeddings.
    • OllamaLLM: Wrapper for Ollama models.
    • Prompt templates, output parsers, and runnable chains.
  • ChromaDB: The vector store used for storing and retrieving document embeddings, enabling semantic search. Accessed via langchain_chroma.
  • HuggingFace Sentence Transformers: Specifically sentence-transformers/all-MiniLM-L6-v2, used via langchain_huggingface to create dense vector embeddings for text.
  • PRAW (Python Reddit API Wrapper): Used by scrape_reddit.py to interact with the Reddit API.
  • TinyDB: A lightweight, document-oriented database used to store application settings (like the selected model and agent persona) in db.json.

5. Configuration Files

reddit.config.json

  • Used by: scrape_reddit.py
  • Purpose: Stores credentials and parameters for Reddit data scraping.
  • Structure:
    {
        "reddit": {
            "client_id": "YOUR_REDDIT_CLIENT_ID",
            "client_secret": "YOUR_REDDIT_CLIENT_SECRET",
            "user_agent": "YOUR_REDDIT_USER_AGENT"
        },
        "scrape_config": {
            "subreddit": "RooCode", // Example, user should change
            "post_limit": 100,
            "output_file": "reddit_data.txt"
        }
    }
  • Setup: Users must copy reddit.config.example.json to reddit.config.json and populate it with their actual Reddit API credentials and desired scraping settings.

db.json

  • Used by: app.py (primarily for reading/writing selected model), converse.py (implicitly via app.py providing agent config).
  • Purpose: Stores persistent agent settings, primarily the last selected Ollama model and the system message/persona for the AI.
  • Structure (Example):
    {
        "_default": { // Or a specific table name like 'agent'
            "1": { 
                "model": "llama3:8b", 
                "system_message": "You are a helpful assistant...",
                "user_name": "User",
                "agent_name": "RooCode Assistant"
            }
        }
    }
  • Setup: Automatically created and managed by the application. app.py initializes it with default values if it doesn't exist or is empty.

6. Setup and Running the Backend Components

Prerequisites

  • Python 3.x installed.
  • Ollama service installed, running, and accessible by the application. Ensure desired models (e.g., llama3:8b) are pulled via ollama pull <model_name>.
  • Git for cloning the repository.

Installation

  1. Clone the repository (if applicable).
  2. Install required Python packages:
    pip install -r requirements.txt

Data Population Workflow

  1. Configure Reddit Scraping:
    • Copy reddit.config.example.json to reddit.config.json.
    • Edit reddit.config.json to add your Reddit API client_id, client_secret, user_agent, and set your desired subreddit and post_limit.
  2. Scrape Reddit Data:
    • Run the scraper script from the project's root directory:
      python scrape_reddit.py
    • This will create/update the output_file (e.g., reddit_data.txt) specified in reddit.config.json.
  3. Prepare GitHub Data (if used):
    • Ensure the github_data.txt file is present in the root directory and contains the desired GitHub-related text data.
  4. Ingest Data into Vector Store:
    • Run the ingestion script from the project's root directory:
      python ingest.py
    • This will process reddit_data.txt (and github_data.txt if present) and populate/update the ChromaDB vector store in the ./chroma_db directory.

Running the Application

  1. Ensure the Ollama service is running.
  2. Run the Streamlit application from the project's root directory:
    streamlit run app.py
  3. The application will typically open in your default web browser, or it will provide a URL (e.g., http://localhost:8501).

7. Logging and Error Handling

  • Query Logging (query_log.txt):
    • app.py logs each user query, the AI's response, the response time, and the model used.
    • This log is stored in query_log.txt in the project's root directory.
    • Format: %(asctime)s - %(message)s (e.g., 2023-10-27 10:00:00,123 - Query: What is X? | Response: Y | Time: 1.23s | Model: llama3:8b)
  • Configuration Errors:
    • scrape_reddit.py: Includes try-except blocks for FileNotFoundError if reddit.config.json is missing and json.JSONDecodeError if it's malformed. It prints user-friendly messages guiding them to create or fix the file.
    • It also checks for the presence of essential keys within the configuration JSON.
  • Ollama Model Fetching Errors:
    • app.py: When fetching the list of available models from Ollama, it includes a try-except block. If the call to ollama.list() fails (e.g., Ollama service not running), it prints an error to the console (once per session) and falls back to a predefined list of common models to ensure the application remains usable.
  • General Query Processing Errors:
    • app.py: The main query processing loop includes a try-except Exception as e block. If any error occurs during context retrieval or LLM interaction, an error message (Error processing query: {str(e)}) is displayed as the AI's response in the UI, and the error is logged to query_log.txt.