The backend of the RooCode Data Query application is a Python-based system designed to support a Retrieval Augmented Generation (RAG) chat interface. Key components include:
- Streamlit (
app.py): Provides the web-based User Interface (UI) and handles user interactions. - Ollama: Serves Large Language Models (LLMs) locally, which are accessed via the
ollamaPython library. - ChromaDB: Acts as the vector knowledge base, storing embeddings of data scraped from Reddit and other sources.
- Data Processing Scripts:
scrape_reddit.py: Fetches data from Reddit.ingest.py: Processes and embeds data into ChromaDB.
- Core Logic (
converse.py): Manages the RAG process, including context retrieval and LLM interaction.
This architecture allows users to ask natural language questions and receive answers generated by an LLM, informed by the specialized knowledge base.
- Role: Main entry point for the Streamlit web application.
- Responsibilities:
- Renders the user interface (chat window, sidebar controls, etc.).
- Captures user queries from the input field.
- Manages the selection of Ollama LLMs from a dynamically fetched list.
- Displays the currently active model.
- Orchestrates the chat response process by calling the
Converseclass inconverse.py. - Handles chat history display and clearing.
- Logs user queries, AI responses, and response times.
- Role: Contains the
Converseclass, which encapsulates the core RAG logic. - Responsibilities:
- Initializes a connection to the ChromaDB vector store (
./chroma_db) usinglangchain_chroma. - Uses
HuggingFaceEmbeddings(specificallysentence-transformers/all-MiniLM-L6-v2) for generating query embeddings. - Retrieves relevant document chunks from ChromaDB based on semantic similarity to the user's query.
- Constructs a prompt for the LLM, incorporating the retrieved context and the original user query.
- Interacts with the selected Ollama LLM (via
OllamaLLMfromlangchain_ollama) to generate a response. - Utilizes Langchain components like
ChatPromptTemplate,StrOutputParser, andRunnablePassthroughto build and execute the RAG chain.
- Initializes a connection to the ChromaDB vector store (
- Role: Responsible for collecting data from Reddit.
- Responsibilities:
- Uses the PRAW (Python Reddit API Wrapper) library to interact with the Reddit API.
- Reads Reddit API credentials (
client_id,client_secret,user_agent) and scraping parameters (targetsubreddit,post_limit,output_filename) from thereddit.config.jsonfile. - Fetches post titles, selftext, and top-level comments from the specified subreddit.
- Saves the scraped textual data into a plain text file (default:
reddit_data.txt). - Includes error handling for missing or malformed
reddit.config.json.
- Role: Processes raw text data and populates the ChromaDB vector store.
- Responsibilities:
- Loads text data from source files (e.g.,
reddit_data.txtgenerated byscrape_reddit.py, andgithub_data.txtwhich is assumed to be manually provided). - Uses
RecursiveCharacterTextSplitterfrom Langchain to divide the documents into smaller, manageable chunks suitable for embedding. - Generates vector embeddings for these text chunks using
HuggingFaceEmbeddings(sentence-transformers/all-MiniLM-L6-v2). - Creates or updates the ChromaDB vector store located at
./chroma_db, storing the text chunks and their corresponding embeddings. This database is configured for local persistence.
- Loads text data from source files (e.g.,
- Reddit Data:
- User configures
reddit.config.jsonwith API keys and scraping parameters. scrape_reddit.pyis executed.- The script connects to the Reddit API using credentials from
reddit.config.json. - Specified posts and comments are fetched and written to
reddit_data.txt.
- User configures
- GitHub Data:
- This data is assumed to be manually prepared and placed in a file named
github_data.txt. The process for creating this file is external to the application's automated scripts.
- This data is assumed to be manually prepared and placed in a file named
ingest.pyis executed.- The script reads raw text from
reddit_data.txtandgithub_data.txt. - The text is processed:
- Documents are split into smaller chunks.
- Embeddings are generated for each chunk.
- These chunks and their embeddings are stored in the ChromaDB vector store located at
./chroma_db.
- A user submits a query through the Streamlit UI in
app.py. app.pypasses the query and current agent configuration (including selected model) to an instance of theConverseclass inconverse.py.converse.pyperforms a similarity search:- The user's query is embedded.
- ChromaDB (
./chroma_db) is queried to find the most relevant document chunks based on vector similarity.
- The retrieved context (text from the relevant chunks) and the original user query are combined into a prompt.
- This prompt is sent to the Ollama LLM (selected by the user in the UI).
- The LLM generates a response based on the prompt and context.
- The response is passed back through
converse.pytoapp.py, which then displays it in the chat UI.
- Python 3.x: The core programming language for all backend scripts.
- Streamlit: Used for creating the interactive web UI (
app.py). - Ollama (and
ollamalibrary): Provides local hosting and access to various LLMs. TheollamaPython library is used for model listing, and Langchain'sOllamaLLMfor RAG. - Langchain (
langchain,langchain_core,langchain_ollama,langchain_chroma,langchain_huggingface): A framework used extensively for building the RAG application. Key components include:Chroma: Vector store integration.HuggingFaceEmbeddings: For generating text embeddings.OllamaLLM: Wrapper for Ollama models.- Prompt templates, output parsers, and runnable chains.
- ChromaDB: The vector store used for storing and retrieving document embeddings, enabling semantic search. Accessed via
langchain_chroma. - HuggingFace Sentence Transformers: Specifically
sentence-transformers/all-MiniLM-L6-v2, used vialangchain_huggingfaceto create dense vector embeddings for text. - PRAW (Python Reddit API Wrapper): Used by
scrape_reddit.pyto interact with the Reddit API. - TinyDB: A lightweight, document-oriented database used to store application settings (like the selected model and agent persona) in
db.json.
- Used by:
scrape_reddit.py - Purpose: Stores credentials and parameters for Reddit data scraping.
- Structure:
{ "reddit": { "client_id": "YOUR_REDDIT_CLIENT_ID", "client_secret": "YOUR_REDDIT_CLIENT_SECRET", "user_agent": "YOUR_REDDIT_USER_AGENT" }, "scrape_config": { "subreddit": "RooCode", // Example, user should change "post_limit": 100, "output_file": "reddit_data.txt" } } - Setup: Users must copy
reddit.config.example.jsontoreddit.config.jsonand populate it with their actual Reddit API credentials and desired scraping settings.
- Used by:
app.py(primarily for reading/writing selected model),converse.py(implicitly viaapp.pyproviding agent config). - Purpose: Stores persistent agent settings, primarily the last selected Ollama model and the system message/persona for the AI.
- Structure (Example):
{ "_default": { // Or a specific table name like 'agent' "1": { "model": "llama3:8b", "system_message": "You are a helpful assistant...", "user_name": "User", "agent_name": "RooCode Assistant" } } } - Setup: Automatically created and managed by the application.
app.pyinitializes it with default values if it doesn't exist or is empty.
- Python 3.x installed.
- Ollama service installed, running, and accessible by the application. Ensure desired models (e.g.,
llama3:8b) are pulled viaollama pull <model_name>. - Git for cloning the repository.
- Clone the repository (if applicable).
- Install required Python packages:
pip install -r requirements.txt
- Configure Reddit Scraping:
- Copy
reddit.config.example.jsontoreddit.config.json. - Edit
reddit.config.jsonto add your Reddit APIclient_id,client_secret,user_agent, and set your desiredsubredditandpost_limit.
- Copy
- Scrape Reddit Data:
- Run the scraper script from the project's root directory:
python scrape_reddit.py
- This will create/update the
output_file(e.g.,reddit_data.txt) specified inreddit.config.json.
- Run the scraper script from the project's root directory:
- Prepare GitHub Data (if used):
- Ensure the
github_data.txtfile is present in the root directory and contains the desired GitHub-related text data.
- Ensure the
- Ingest Data into Vector Store:
- Run the ingestion script from the project's root directory:
python ingest.py
- This will process
reddit_data.txt(andgithub_data.txtif present) and populate/update the ChromaDB vector store in the./chroma_dbdirectory.
- Run the ingestion script from the project's root directory:
- Ensure the Ollama service is running.
- Run the Streamlit application from the project's root directory:
streamlit run app.py
- The application will typically open in your default web browser, or it will provide a URL (e.g.,
http://localhost:8501).
- Query Logging (
query_log.txt):app.pylogs each user query, the AI's response, the response time, and the model used.- This log is stored in
query_log.txtin the project's root directory. - Format:
%(asctime)s - %(message)s(e.g.,2023-10-27 10:00:00,123 - Query: What is X? | Response: Y | Time: 1.23s | Model: llama3:8b)
- Configuration Errors:
scrape_reddit.py: Includestry-exceptblocks forFileNotFoundErrorifreddit.config.jsonis missing andjson.JSONDecodeErrorif it's malformed. It prints user-friendly messages guiding them to create or fix the file.- It also checks for the presence of essential keys within the configuration JSON.
- Ollama Model Fetching Errors:
app.py: When fetching the list of available models from Ollama, it includes atry-exceptblock. If the call toollama.list()fails (e.g., Ollama service not running), it prints an error to the console (once per session) and falls back to a predefined list of common models to ensure the application remains usable.
- General Query Processing Errors:
app.py: The main query processing loop includes atry-except Exception as eblock. If any error occurs during context retrieval or LLM interaction, an error message (Error processing query: {str(e)}) is displayed as the AI's response in the UI, and the error is logged toquery_log.txt.