A generalized document search and question-answering system built with ChromaDB and Google AI embeddings. This system demonstrates RAG (Retrieval-Augmented Generation) capabilities using university course data as an example, but can be adapted for any document collection.
- Docker installed on your system
- Google API key (for embeddings)
- Git
# Clone the repository
git clone https://github.com/UH-CI/RAG-system/
cd course-RAG
# Navigate to source directory and setup environment
cd src/
cp .env.example .env
# Edit .env and paste in your Google API key# Return to project root and run deployment with ingestion
cd ..
./deploy.sh docker --ingestThis will start the ingestion process. The documents in /src/documents will be ingested according to the configuration in config.json.
The config.json file controls how documents are processed and ingested:
{
"collections": ["courses", "programs", "combined_pathways"],
"ingestion_configs": [
{
"collection_name": "courses",
"source_file": "UH-Manoa_courses.json",
"contents_to_embed": ["course_id", "subject", "title", "description", "credits", "prerequisites", "program", "department", "institution"]
},
{
"collection_name": "programs",
"source_file": "UH-Manoa-programs.json",
"contents_to_embed": ["name", "program", "department", "college", "institution", "course_count", "courses"]
}
]
}How it works:
- Each
ingestion_configdefines a collection to be created in ChromaDB source_filepoints to JSON files in the/src/documentsdirectorycontents_to_embedspecifies which fields from each document will be embedded as searchable content- The system will look for these files relative to the
/srcdirectory during ingestion
Document Structure:
- Place your JSON data files in
/src/documents/ - Each file should contain an array of objects with the fields specified in
contents_to_embed - The system currently includes example files:
UH-Manoa_courses.json,UH-Manoa-programs.json, etc.
Once ingestion is complete and you're satisfied with the setup:
./deploy.sh dockerThis will bake the vectors into the ChromaDB database on the Docker image for production usage, creating a self-contained deployment ready for serving.
The deploy script automatically starts both:
- API Server:
course-rag-apicontainer on port 8200 - Web Interface:
course-rag-testcontainer on port 8280 (servescourse_rag_interface.htmlviatest_server.py)
# Check if the API is running
curl http://localhost:8200/
# Check database status and document counts
curl http://localhost:8200/stats
# Access web interface at: http://localhost:8280/course_rag_interface.htmlWeb Interface Features:
- Intelligent Query Tab: Chat-based interface with preset queries for testing
- Collections Tab: Direct vector search across all document collections
- Real-time Results: Shows query processing time, document sources, and confidence scores
- Responsive Design: Works on desktop and mobile devices
That's it! π Your API is running at http://localhost:8200
For developers who want to modify the code
- Python 3.11+
- Git
- Google API key
git clone <your-repo-url>
cd course-RAGpip install -r src/requirements.txtcp src/.env.example .env
# Edit .env with your Google API keycd src
python run_api.pycd src
python run_api.py --ingestcd src
python tests/api_client_example.py- API Documentation: http://localhost:8200/docs
- Interactive Testing: http://localhost:8200/redoc
- Test Interface: http://localhost:8280 (if deployed)
curl -X POST "http://localhost:8200/search" \
-H "Content-Type: application/json" \
-d '{"query": "computer science programming courses", "n_results": 3}'curl -X POST "http://localhost:8200/search_multi" \
-H "Content-Type: application/json" \
-d '{"query": "data science courses and programs", "collections": ["courses", "programs"], "n_results": 5}'curl http://localhost:8200/statscurl -X POST "http://localhost:8200/upload" -F "[email protected]"course-RAG/
βββ src/
β βββ api.py # Main FastAPI application
β βββ run_api.py # Server startup script
β βββ settings.py # Configuration
β βββ query_processor.py # Query processing logic
β βββ config.json # Collection configurations
β βββ documents/ # Document processing
β β βββ document_processor.py
β β βββ embeddings.py
β β βββ ingest_documents.py
β βββ chroma_db/ # Database setup
β βββ tests/ # Testing utilities
β βββ UH-Manoa_courses.json # Example course data
βββ nginx.conf # Nginx configuration (optional)
βββ docker-compose.yml # Docker compose setup
βββ deploy.sh # Deployment script
βββ README.md # This file
cd src
python documents/ingest_documents.py --source UH-Manoa_courses.jsoncd src
docker build -t course-rag .chmod +x deploy.sh
./deploy.shThis system demonstrates RAG capabilities using University of Hawaii course data:
- Start the server (production or development)
- Search for courses via the web interface at http://localhost:8200/docs
- Ask questions like:
- "What computer science courses cover machine learning?"
- "Show me all courses with prerequisites in calculus"
- "Find programming courses for beginners"
- "What are the credit requirements for data science programs?"
The system supports multiple document collections:
- courses: Individual course information (8,297 UH courses)
- programs: Academic program details
- combined_pathways: Integrated course and program pathways
Each collection can be searched independently or combined for comprehensive results.
Server won't start?
- Check that port 8200 is available:
lsof -i :8200 - Verify your Google API key is valid
- For Docker: Check logs with
docker logs course-rag-api
No search results?
- Make sure documents were ingested successfully
- Check if collections exist:
curl http://localhost:8200/stats - Verify the database has content (should show document_count > 0)
Ingestion stuck at 0%?
- Large datasets may take time due to Google API rate limiting
- Consider running without
--ingestfirst, then ingest in smaller batches - Monitor progress with
docker logs course-rag-api
Docker issues?
- Ensure Docker is running:
docker info - Check if image exists:
docker images | grep course-rag - Restart container:
docker restart course-rag-api
ChromaDB errors?
- Delete old database:
rm -rf src/chroma_db/data - Restart with fresh ingestion:
docker run ... --ingest
Need help?
- Check the full API documentation at http://localhost:8200/docs
- Run the test client:
python src/tests/api_client_example.py - Use the web interface at http://localhost:8280
Production: Pull the latest image and restart
docker pull tabalbar/course-rag:latest
docker stop course-rag-api && docker rm course-rag-api
docker run -d -p 8200:8200 --env-file .env --name course-rag-api tabalbar/course-rag:latestDevelopment: Pull latest code and restart
git pull origin main
cd src
python run_api.pyThis system can be easily adapted for other document types:
- Replace data source: Update
config.jsonwith your document collections - Modify embeddings: Adjust embedding fields in the configuration
- Update preprocessing: Modify document processing logic in
documents/ - Customize API: Add domain-specific endpoints in
api.py
[Your License Here]