Skip to content

UH-CI/RAG-system

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

26 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Course RAG System

A generalized document search and question-answering system built with ChromaDB and Google AI embeddings. This system demonstrates RAG (Retrieval-Augmented Generation) capabilities using university course data as an example, but can be adapted for any document collection.

πŸš€ How to Start

Prerequisites

  • Docker installed on your system
  • Google API key (for embeddings)
  • Git

1. Clone and Setup

# Clone the repository
git clone https://github.com/UH-CI/RAG-system/
cd course-RAG

# Navigate to source directory and setup environment
cd src/
cp .env.example .env
# Edit .env and paste in your Google API key

2. Initial Deployment with Ingestion

# Return to project root and run deployment with ingestion
cd ..
./deploy.sh docker --ingest

This will start the ingestion process. The documents in /src/documents will be ingested according to the configuration in config.json.

3. Understanding Document Configuration

The config.json file controls how documents are processed and ingested:

{
  "collections": ["courses", "programs", "combined_pathways"],
  "ingestion_configs": [
    {
      "collection_name": "courses",
      "source_file": "UH-Manoa_courses.json",
      "contents_to_embed": ["course_id", "subject", "title", "description", "credits", "prerequisites", "program", "department", "institution"]
    },
    {
      "collection_name": "programs", 
      "source_file": "UH-Manoa-programs.json",
      "contents_to_embed": ["name", "program", "department", "college", "institution", "course_count", "courses"]
    }
  ]
}

How it works:

  • Each ingestion_config defines a collection to be created in ChromaDB
  • source_file points to JSON files in the /src/documents directory
  • contents_to_embed specifies which fields from each document will be embedded as searchable content
  • The system will look for these files relative to the /src directory during ingestion

Document Structure:

  • Place your JSON data files in /src/documents/
  • Each file should contain an array of objects with the fields specified in contents_to_embed
  • The system currently includes example files: UH-Manoa_courses.json, UH-Manoa-programs.json, etc.

4. Production Deployment

Once ingestion is complete and you're satisfied with the setup:

./deploy.sh docker

This will bake the vectors into the ChromaDB database on the Docker image for production usage, creating a self-contained deployment ready for serving.

The deploy script automatically starts both:

  • API Server: course-rag-api container on port 8200
  • Web Interface: course-rag-test container on port 8280 (serves course_rag_interface.html via test_server.py)

5. Verify Deployment

# Check if the API is running
curl http://localhost:8200/

# Check database status and document counts
curl http://localhost:8200/stats

# Access web interface at: http://localhost:8280/course_rag_interface.html

Web Interface Features:

  • Intelligent Query Tab: Chat-based interface with preset queries for testing
  • Collections Tab: Direct vector search across all document collections
  • Real-time Results: Shows query processing time, document sources, and confidence scores
  • Responsive Design: Works on desktop and mobile devices

That's it! πŸŽ‰ Your API is running at http://localhost:8200


πŸ› οΈ Development Setup

For developers who want to modify the code

Prerequisites

  • Python 3.11+
  • Git
  • Google API key

1. Clone Repository

git clone <your-repo-url>
cd course-RAG

2. Install Dependencies

pip install -r src/requirements.txt

3. Configure Environment

cp src/.env.example .env
# Edit .env with your Google API key

4. Run Locally

cd src
python run_api.py

5. Run with Document Ingestion

cd src
python run_api.py --ingest

6. Test the System

cd src
python tests/api_client_example.py

πŸ“– Using the API

Web Interface

Search Courses

curl -X POST "http://localhost:8200/search" \
  -H "Content-Type: application/json" \
  -d '{"query": "computer science programming courses", "n_results": 3}'

Multi-Collection Search

curl -X POST "http://localhost:8200/search_multi" \
  -H "Content-Type: application/json" \
  -d '{"query": "data science courses and programs", "collections": ["courses", "programs"], "n_results": 5}'

Get Statistics

curl http://localhost:8200/stats

Upload Custom Documents (if enabled)

curl -X POST "http://localhost:8200/upload" -F "[email protected]"

πŸ“ Project Structure

course-RAG/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ api.py                    # Main FastAPI application
β”‚   β”œβ”€β”€ run_api.py               # Server startup script
β”‚   β”œβ”€β”€ settings.py              # Configuration
β”‚   β”œβ”€β”€ query_processor.py       # Query processing logic
β”‚   β”œβ”€β”€ config.json              # Collection configurations
β”‚   β”œβ”€β”€ documents/               # Document processing
β”‚   β”‚   β”œβ”€β”€ document_processor.py
β”‚   β”‚   β”œβ”€β”€ embeddings.py
β”‚   β”‚   └── ingest_documents.py
β”‚   β”œβ”€β”€ chroma_db/              # Database setup
β”‚   β”œβ”€β”€ tests/                  # Testing utilities
β”‚   └── UH-Manoa_courses.json   # Example course data
β”œβ”€β”€ nginx.conf                  # Nginx configuration (optional)
β”œβ”€β”€ docker-compose.yml          # Docker compose setup
β”œβ”€β”€ deploy.sh                   # Deployment script
└── README.md                   # This file

πŸ”§ Development Commands

Ingest Documents

cd src
python documents/ingest_documents.py --source UH-Manoa_courses.json

Build Docker Image

cd src
docker build -t course-rag .

Deploy with Script

chmod +x deploy.sh
./deploy.sh

πŸ’‘ Example Usage

This system demonstrates RAG capabilities using University of Hawaii course data:

  1. Start the server (production or development)
  2. Search for courses via the web interface at http://localhost:8200/docs
  3. Ask questions like:
    • "What computer science courses cover machine learning?"
    • "Show me all courses with prerequisites in calculus"
    • "Find programming courses for beginners"
    • "What are the credit requirements for data science programs?"

πŸ”„ Collections

The system supports multiple document collections:

  • courses: Individual course information (8,297 UH courses)
  • programs: Academic program details
  • combined_pathways: Integrated course and program pathways

Each collection can be searched independently or combined for comprehensive results.

πŸ†˜ Troubleshooting

Server won't start?

  • Check that port 8200 is available: lsof -i :8200
  • Verify your Google API key is valid
  • For Docker: Check logs with docker logs course-rag-api

No search results?

  • Make sure documents were ingested successfully
  • Check if collections exist: curl http://localhost:8200/stats
  • Verify the database has content (should show document_count > 0)

Ingestion stuck at 0%?

  • Large datasets may take time due to Google API rate limiting
  • Consider running without --ingest first, then ingest in smaller batches
  • Monitor progress with docker logs course-rag-api

Docker issues?

  • Ensure Docker is running: docker info
  • Check if image exists: docker images | grep course-rag
  • Restart container: docker restart course-rag-api

ChromaDB errors?

  • Delete old database: rm -rf src/chroma_db/data
  • Restart with fresh ingestion: docker run ... --ingest

Need help?

πŸ”„ Updates

Production: Pull the latest image and restart

docker pull tabalbar/course-rag:latest
docker stop course-rag-api && docker rm course-rag-api
docker run -d -p 8200:8200 --env-file .env --name course-rag-api tabalbar/course-rag:latest

Development: Pull latest code and restart

git pull origin main
cd src
python run_api.py

🎯 Adapting for Your Data

This system can be easily adapted for other document types:

  1. Replace data source: Update config.json with your document collections
  2. Modify embeddings: Adjust embedding fields in the configuration
  3. Update preprocessing: Modify document processing logic in documents/
  4. Customize API: Add domain-specific endpoints in api.py

πŸ“„ License

[Your License Here]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors