A Retrieval-Augmented Generation (RAG) system for searching and exploring code repositories using natural language queries. This system indexes code repositories and enables semantic search to find relevant files and code snippets based on their meaning rather than just keywords.
- Repository Indexing: Clone repositories or use GitHub API to access and index code files
- Semantic Search: Retrieve relevant code using natural language queries
- Code Understanding: Generate concise summaries and answers about repository functionality
- Performance Evaluation: Built-in evaluation capabilities using recall@10 metrics
All dependencies can be installed from the requirements.txt file:
pip install -r requirements.txtYou'll also need:
- Python 3.8 or higher
- Git (for repository cloning)
- GitHub API token (optional, for API-based repository access)
- Google genai API token (optional, for calling the LLM)
Create a .env file with your API keys:
GITHUB_TOKEN=your_github_token_here
LLM_API_KEY=your_llm_api_key_here
You can use the provided setup script to clone and index a repository:
python setup.pyOr run the notebook main.ipynb and execute the indexing cells.
Use the retrieve_repository function to search the indexed repository:
from retrieve import retrieve_repository
# Search for relevant code related to a query
results = retrieve_repository("How does the app handle screen rotation?", n_results=5)
# Print the most relevant file
print(f"Most relevant file: {results[0]['source']}")To generate answers based on retrieved code, use llm.py or the last chunk of the main notebook:
from llm import generate_answer
# Example usage
answer = generate_answer("How is the sponsor dialog implemented?")
print(answer)The system includes evaluation capabilities using recall metrics:
python evaluate.pyThis will run evaluation on a test set and output the average recall@10 score based on a synthetic dataset from the escrcpy repository.
setup.py: Script for setting up and indexing a repositoryretrieve.py: Core retrieval functions for searching the indexed repositoryevaluate.py: Evaluation script to measure system performancemain.ipynb: Jupyter notebook demonstrating the full workflowrequirements.txt: List of required Python packageschroma_db/: Directory containing the indexed repository as a vector databaserepo_temp/: Temporal directory containing a clone from the target repository
This RAG system uses:
- Embedding Model: IBM Granite Embedding 107M Multilingual model for semantic code understanding
- Vector Store: ChromaDB for efficient similarity search (cosine distance)
- Text Chunking: Recursive character text splitting for preserving code structure
- Normalization: Custom code normalization preserving indentation while cleaning whitespace
- LLM Integration: Optional Gemini 2.0 Flash model for generating answers (requires API key)
The implementation includes:
- Experimenting with various chunking strategies to better preserve code context
- Testing query expansion techniques and various text preprocessing methods
- Evaluating multiple embedding models to optimize for code understanding
- A complete pipeline from GitHub repository indexing to retrieval
The implementation of this RAG system revealed several challenges and insights:
- Achieved a modest Recall@10 score of 0.49 on the evaluation dataset
- Standard embedding models demonstrated limitations with code-specific semantics
- Balancing chunk size with contextual integrity significantly impacted retrieval quality
- Query-code semantic gap presents a fundamental challenge requiring specialized techniques
- Multilingual code content added complexity to the embedding process and retrieval results
The systematic evaluation framework measures Recall@10 as specified in the requirements, with results saved to CSV for transparency. The project demonstrates both the potential and current limitations of applying RAG approaches to code repositories.
Future enhancements to the system could include:
- Specialized code embeddings designed for programming language semantics
- Hybrid retrieval strategies combining semantic and keyword-based approaches
- More sophisticated context-preserving chunking methods
- Fine-tuning embedding models specifically for code comprehension
- Enhanced query preprocessing techniques to bridge the semantic gap