AI-powered GitHub repository analysis assistant that helps you understand complex codebases without cloning them locally.
- Overview
- Key Features
- Use Cases
- How It Works
- Security Assessments
- Prerequisites
- Installation & Setup
- Architecture
- Project Structure
- API Documentation
- Development
- Troubleshooting
- FAQ
- Deployment Considerations
- Contributing
- Security
- Support
- Roadmap
- License
- Team
- Acknowledgments
RepoRover is an intelligent code analysis platform that combines the power of Large Language Models (LLMs), vector databases, and semantic search to help developers quickly understand and analyze GitHub repositories. Whether you're evaluating a new library, conducting security audits, or exploring unfamiliar codebases, RepoRover provides contextual insights with precise file and line references.
Access the application at http://localhost:3000 after setup:
- Enter a GitHub repository URL (e.g.,
https://github.com/microsoft/vscode) - Click "Ingest Repository" to process the codebase
- Ask questions in natural language
- Receive AI-powered answers with file citations
- Click citations to view code directly in the built-in viewer
- Repository Analysis: Submit any public GitHub repository URL and get AI-powered insights about the codebase structure, architecture, and implementation
- Intelligent File Processing: Automatically processes all repository files including source code, documentation, and configuration files with language detection
- Semantic Search: Uses OpenAI embeddings and Elasticsearch vector search for intelligent, context-aware code search
- Interactive Chat Interface: Ask natural language questions about the repository and receive contextual answers with precise file citations
- Security Assessments: Run comprehensive AI-powered security reviews at both repository and file levels with severity-ranked findings
- Code Viewer: Built-in code viewer with markdown rendering for easy file inspection
- Real-time Processing: Stream responses and see processing status in real-time
RepoRover is ideal for:
- Code Exploration: Quickly understand unfamiliar codebases without extensive manual navigation
- Library Evaluation: Assess third-party libraries before integration into your projects
- Security Audits: Identify potential security vulnerabilities in repositories
- Documentation: Generate insights about code architecture and implementation details
- Code Review: Get AI-powered analysis of pull requests and code changes
- Learning: Study well-known open-source projects and understand their implementation patterns
- Due Diligence: Technical assessment of codebases during acquisition or partnership decisions
RepoRover follows a multi-stage pipeline to analyze GitHub repositories:
-
Repository Ingestion
- User submits a GitHub repository URL through the web interface
- Backend fetches repository contents via GitHub API
- Files are processed and categorized by type (code, docs, config, etc.)
-
Text Processing
- LangChain text splitters chunk files into meaningful segments
- Language detection identifies file types for optimal processing
- Code is parsed to maintain context and structure
-
Embedding Generation
- OpenAI's text-embedding model converts each chunk into a high-dimensional vector
- Vectors capture semantic meaning and relationships between code segments
- Embeddings are optimized for similarity search
-
Vector Storage
- Elasticsearch stores embeddings with metadata (file path, language, etc.)
- Vector database enables fast k-nearest neighbor searches
- Index is optimized for semantic code search
-
Intelligent Query Processing
- User questions are embedded using the same model
- Semantic search retrieves the most relevant code chunks
- Context is assembled from top matching segments
-
AI-Powered Response
- OpenAI's chat model generates answers based on retrieved context
- Responses include file paths and line number citations
- Streaming enables real-time response display
RepoRover now supports first-class security reviews:
- Repository assessments analyze the indexed chunks for a repo and generate a JSON report of prioritized findings. Trigger them from the UI next to the ingest button or via
POST /api/security/assess_repowithowner,repo, and an OpenAI API key (orgithub_url). - File assessments fetch the latest version of a specific file (plus any available indexed snippets) and audit it in depth. Run them from an open code tab in the UI or call
POST /api/security/assess_filewithowner,repo, andfile_path.
Both endpoints return:
{
"scope": "repo | file",
"summary": "High-level risk overview",
"findings": [
{
"severity": "high",
"title": "Hard-coded secret",
"description": "...",
"file_path": "backend/app.py",
"line_hints": "42-58",
"remediation": "Move the secret to a vault",
"category": "CWE-798"
}
],
"sampled_files": ["backend/app.py", "frontend/..."],
"ran_at": "2025-11-07T15:30:00Z"
}Results are streamed back into chat so you can click cited files to inspect or continue the conversation. If the Elasticsearch index is empty, the backend automatically calls out the limited coverage so you know to ingest before trusting the assessment.
Before running RepoRover, ensure you have the following installed:
- Python 3.9+ - Backend API and processing
- Node.js 18+ - Frontend application
- Java 8+ - Required for Elasticsearch
- npm or yarn - Package management for frontend
- pip - Python package manager
You'll need the following API keys to run RepoRover:
-
GitHub Personal Access Token
- Create at: https://github.com/settings/tokens
- Required scopes:
repo(for private repos) orpublic_repo(for public repos only)
-
OpenAI API Key
- Get from: https://platform.openai.com/api-keys
- Used for embeddings and chat completions
Typical Repository Processing:
- Small repository (~50 files): 1-2 minutes, $0.10-$0.30
- Medium repository (~500 files): 5-10 minutes, $0.50-$1.50
- Large repository (~5000 files): 30-60 minutes, $2.00-$10.00
Hardware Requirements:
- Minimum: 4GB RAM, 2 CPU cores, 10GB disk space
- Recommended: 8GB RAM, 4 CPU cores, 20GB disk space
- For Large Repos: 16GB RAM, 8 CPU cores, 50GB disk space
Note: Elasticsearch is the most resource-intensive component. Costs are for OpenAI API usage only.
RepoRover uses Elasticsearch as its vector database for storing and searching code embeddings. Follow these steps to set up Elasticsearch properly:
The project includes Elasticsearch 7.17.9 in the backend/elasticsearch-7.17.9/ directory. If you need a different version:
- Visit https://www.elastic.co/downloads/past-releases/elasticsearch-7-17-9
- Download the ZIP for your platform (Windows users: choose the
.zipfile) - Extract to
backend/elasticsearch-7.17.9/
Elasticsearch requires some basic configuration for this application:
-
Increase heap size (recommended for better performance):
- Open
backend/elasticsearch-7.17.9/config/jvm.options - Set heap size:
-Xms512mand-Xmx1g
- Open
-
Configure network settings (optional but recommended for development):
- Open
backend/elasticsearch-7.17.9/config/elasticsearch.yml - Add:
network.host: 0.0.0.0to allow connections from the app - Add:
http.port: 9200(default, matches the app config)
- Open
-
Security Configuration (if using authentication):
- The app supports Elasticsearch authentication
- Set these in your
.envfile if needed:ES_HOST=http://localhost:9200 ES_USER=elastic ES_PASSWORD=your_password - The app will automatically use these credentials if provided
- Windows:
cd backend && elasticsearch-7.17.9\bin\elasticsearch.bat - Linux/Mac:
cd backend && ./elasticsearch-7.17.9/bin/elasticsearch
Troubleshooting:
- If you get JVM errors, ensure Java 8+ is installed and JAVA_HOME is set
- If port 9200 is already in use, change the port in both
elasticsearch.ymland your environment config - The first startup may take longer as Elasticsearch initializes
Once running, visit http://localhost:9200 to see Elasticsearch status. You should see JSON output indicating it's running.
# Clone repo
git clone <your-repo>
cd capstone-ai
# Create .env with API keys (see sample template below)Sample .env file template:
# GitHub API
GITHUB_TOKEN=ghp_your_github_token_here
# AI API Configurations (use OpenAI)
OPENAI_API_KEY=your_openai_api_key_here
# Elasticsearch Configuration
# For local Elasticsearch (recommended for development):
ES_HOST=http://localhost:9200
ES_USER=elastic
ES_PASSWORD=changeme
# For Docker Elasticsearch:
# ES_HOST=http://localhost:9200
# ES_USER=elastic
# ES_PASSWORD=changeme
# For Elastic Cloud (uncomment and update with your cloud credentials):
# ES_HOST=https://your-deployment-id.es.us-central1.gcp.cloud.es.io:9243
# ES_USER=your-username
# ES_PASSWORD=your-passwordCreate your .env file:
Copy the example file and update with your credentials:
cp .env.example .env
# Then edit .env with your actual API keysOr create manually:
echo "GITHUB_TOKEN=ghp_your_token_here" > .env
echo "OPENAI_API_KEY=your_openai_api_key" >> .env
echo "ES_HOST=http://localhost:9200" >> .env
echo "ES_USER=elastic" >> .env
echo "ES_PASSWORD=changeme" >> .envcd backend
pip install -r requirements.txtcd frontend
npm install# Terminal 1: Elasticsearch
cd backend
./elasticsearch-7.17.9/bin/elasticsearch.bat # Windows
# Terminal 2: Backend API
cd backend
python app.py # Runs on port 5000
# Terminal 3: Frontend
cd frontend
npm run dev # Runs on port 3000- Visit
http://localhost:3000to access the chat interface - Submit any GitHub repository URL (e.g.,
https://github.com/microsoft/vscode) - Repository gets ingested and you can ask questions about it
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Frontend │ │ Backend │ │ External │
│ (Next.js) │────▶│ (Flask) │─────▶│ Services │
│ Port 3000 │ │ Port 5000 │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
│ │ │
▼ ▼ ▼
User Interface API Endpoints GitHub API
- Chat Interface - /api/ingest OpenAI API
- Code Viewer - /api/chat Elasticsearch
- File Tree - /api/security/*
- Markdown Render - /api/repo/*
- Repository Ingestion: GitHub URL → GitHub API → File Processing → Text Chunking
- Embedding Generation: Code Chunks → OpenAI Embeddings API → Vector Representations
- Storage: Embeddings → Elasticsearch Vector Database → Indexed for Search
- Query Processing: User Question → Semantic Search → Relevant Chunks Retrieved
- Response Generation: Context + Question → OpenAI Chat API → Contextual Answer
Frontend:
- Next.js 15.5+ (React 19)
- TypeScript
- Tailwind CSS 4
- Framer Motion (animations)
- React Markdown (code rendering)
Backend:
- Python 3.9+
- Flask (API framework)
- LangChain (document processing)
- OpenAI API (embeddings & chat)
- Elasticsearch 7.17.9 (vector database)
- PyGithub (GitHub API integration)
capstone-ai/
├── frontend/ # Next.js frontend application
│ ├── src/
│ │ ├── app/ # Next.js app directory
│ │ │ ├── chat/ # Chat interface page
│ │ │ ├── page.tsx # Home page
│ │ │ └── layout.tsx # Root layout
│ │ ├── components/ # React components
│ │ └── lib/ # Utility functions
│ ├── public/ # Static assets
│ └── package.json # Node dependencies
│
├── backend/ # Flask backend API
│ ├── app.py # Main Flask application
│ ├── config.py # Configuration settings
│ ├── github_utils.py # GitHub API utilities
│ ├── ingest_pipeline.py # Repository ingestion logic
│ ├── security_assessment.py # Security scanning logic
│ ├── prompts.py # LLM prompt templates
│ ├── tests/ # Backend tests
│ └── requirements.txt # Python dependencies
│
├── .env.example # Environment variables template
├── .gitignore # Git ignore rules
└── README.md # This file
Ingest a GitHub repository into the vector database.
Request:
{
"github_url": "https://github.com/owner/repo",
"openai_api_key": "sk-..."
}Response:
{
"message": "Processing repository...",
"repo_info": {
"owner": "owner",
"repo": "repo",
"files_processed": 142
}
}Get the file tree for an ingested repository.
Response:
{
"tree": [
{
"path": "src/index.js",
"type": "file"
}
]
}Get the contents of a specific file.
Query Parameters:
file_path: Path to the file in the repository
Ask questions about an ingested repository.
Request:
{
"owner": "owner",
"repo": "repo",
"message": "How does authentication work?",
"conversation_history": [],
"openai_api_key": "sk-..."
}Response: Streamed text response with citations
Run security assessment on entire repository.
Request:
{
"owner": "owner",
"repo": "repo",
"openai_api_key": "sk-..."
}Response:
{
"scope": "repo",
"summary": "Found 3 high-severity issues...",
"findings": [
{
"severity": "high",
"title": "Hard-coded credentials",
"description": "API key found in source code",
"file_path": "config/settings.py",
"line_hints": "23-25",
"remediation": "Move to environment variables",
"category": "CWE-798"
}
]
}Run security assessment on specific file.
Request:
{
"owner": "owner",
"repo": "repo",
"file_path": "src/auth.py",
"openai_api_key": "sk-..."
}Backend Development:
cd backend
python app.py # Runs with debug mode enabledFrontend Development:
cd frontend
npm run dev # Runs with hot module replacementCreate a .env file in the root directory (see .env.example):
# GitHub API
GITHUB_TOKEN=ghp_your_github_token_here
# AI API Configurations
OPENAI_API_KEY=sk-your_openai_api_key_here
# Elasticsearch Configuration
ES_HOST=http://localhost:9200
ES_USER=elastic
ES_PASSWORD=changemeBackend Tests:
cd backend
python -m pytest tests/Frontend Linting:
cd frontend
npm run lintElasticsearch won't start:
- Verify Java is installed:
java -version - Check if port 9200 is already in use
- Increase heap size in
elasticsearch-7.17.9/config/jvm.options
Backend connection errors:
- Ensure Elasticsearch is running at
http://localhost:9200 - Verify environment variables are set correctly
- Check firewall settings
Frontend can't connect to backend:
- Verify backend is running on port 5000
- Check CORS settings in Flask app
- Ensure API endpoints are correctly configured
OpenAI API errors:
- Verify your API key is valid and has credits
- Check rate limits on your OpenAI account
- Ensure you're using a supported model
GitHub API rate limits:
- Authenticated requests have higher rate limits
- Use a GitHub token with appropriate permissions
- Check remaining rate limit: https://api.github.com/rate_limit
Q: Do I need to clone repositories locally? A: No, RepoRover fetches repository contents directly via the GitHub API without requiring local clones.
Q: Can I analyze private repositories?
A: Yes, if your GitHub token has the appropriate permissions (repo scope).
Q: How much does it cost to run? A: Main costs are OpenAI API usage (embeddings + chat). Typical repository analysis costs $0.05-$0.50 depending on size.
Q: Can I use other LLM providers instead of OpenAI? A: Currently, the application is configured for OpenAI. Support for other providers (Anthropic, Azure OpenAI) would require code modifications.
Q: How large can repositories be? A: The system can handle repositories with thousands of files. Very large repositories (>10,000 files) may take longer to process.
Q: Is the embedding data stored permanently? A: Yes, embeddings are stored in Elasticsearch until you delete the index. You only need to ingest a repository once.
Q: Can I run this in production? A: Yes, but you should add authentication, rate limiting, and monitoring. See the deployment considerations below.
For production deployment, consider:
-
Security
- Add user authentication and authorization
- Implement rate limiting on API endpoints
- Use environment-specific API keys
- Enable HTTPS/TLS for all connections
-
Scalability
- Use managed Elasticsearch (Elastic Cloud)
- Deploy backend with WSGI server (Gunicorn/uWSGI)
- Add Redis for caching and job queues
- Consider horizontal scaling for backend
-
Monitoring
- Add application logging and monitoring
- Track API usage and costs
- Monitor Elasticsearch cluster health
- Set up error tracking (Sentry, etc.)
-
Performance
- Implement caching for frequently accessed repositories
- Use CDN for frontend static assets
- Optimize Elasticsearch index settings
- Batch process large repositories
Contributions are welcome! Please follow these guidelines:
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make your changes and commit:
git commit -am 'Add new feature' - Push to the branch:
git push origin feature-name - Submit a pull request
- Python: Follow PEP 8 guidelines
- JavaScript/TypeScript: Follow the existing ESLint configuration
- Write descriptive commit messages
- Add tests for new features
If you discover a security vulnerability, please email the maintainers instead of creating a public issue.
If you encounter issues or have questions:
- Check the Troubleshooting section
- Review the FAQ section
- Open an issue on GitHub with:
- Clear description of the problem
- Steps to reproduce
- Error messages or logs
- Your environment (OS, Node/Python versions)
Future enhancements planned:
- Support for additional LLM providers (Anthropic Claude, Azure OpenAI)
- Private repository analysis with enhanced authentication
- Diff analysis for comparing commits or branches
- Code similarity detection across repositories
- Export reports as PDF or Markdown
- Team collaboration features
- Custom security rule definitions
- Integration with CI/CD pipelines
- Repository comparison and benchmarking
This project is licensed under the MIT License - see the LICENSE file for details.
RepoRover was developed by:
- Jemin Gandhi
- Nicholas McCracken
- Alex Mitelman
- Elijah (Eli) Paulman
- Blake Theis
A capstone project at The Ohio State University
Built with these technologies:
- OpenAI - GPT models and embedding API
- Elasticsearch - Vector search and storage
- LangChain - Document processing and LLM orchestration
- Next.js - React framework for the frontend
- Flask - Python web framework for the backend
- Tailwind CSS - Utility-first CSS framework
- PyGithub - GitHub API integration
Built by the RepoRover team at The Ohio State University | Star this repo if you find it useful!