🌟 Inspiration
The inspiration for Code Archaeologist came from a common frustration every developer faces: onboarding to a new codebase is painful. Whether joining a new team, contributing to open source, or reviewing a colleague's project, understanding unfamiliar code can take days or even weeks.
We've all experienced it:
- Spending hours tracing through files to understand architecture
- Struggling to identify where to start making changes
- Missing critical technical debt hiding in legacy code
- Wishing for a "guided tour" of the codebase
We thought: What if AI could be your expert guide? What if, instead of manually exploring thousands of lines of code, you could point an AI at any GitHub repository and instantly get:
- A complete architectural overview
- Documentation explaining how everything works
- Identified technical debt and code quality issues
- A personalized onboarding guide for new developers
That's when Code Archaeologist was born - to transform the intimidating task of understanding unfamiliar code into an exciting exploration powered by Google's Gemini AI.
🔍 What it does
Code Archaeologist is an AI-powered codebase analysis tool that transforms any GitHub repository into comprehensive, human-readable documentation in seconds.
Core Features:
🏛️ Intelligent Code Analysis
- Analyzes repository architecture and design patterns
- Identifies key modules and their relationships
- Maps out the entire codebase structure
📚 Auto-Generated Documentation
- System Architecture: Deep dive into how components interact
- Module Overview: Breakdown of each major component
- Technical Debt Analysis: Identifies code quality issues, missing tests, and areas needing improvement
- Developer Onboarding Guide: Step-by-step setup and contribution instructions
🌳 Interactive Visualizations
- File Tree Explorer: Navigate the entire project structure with expandable folders
- Dependency Graph: Visual map of import relationships between modules
- Click-and-explore interface showing which files depend on each other
🤖 AI Code Detection
- Analyzes what percentage of code appears AI-generated vs human-written
- Identifies common AI coding patterns (verbose comments, generic naming, boilerplate code)
- Provides confidence levels and specific indicators found
- Helps teams understand their AI-assisted development patterns
💬 Interactive AI Chat
- Ask questions about any aspect of the codebase
- Get instant answers powered by Gemini AI
- Contextual responses based on the actual repository code
📄 Professional PDF Reports
- One-click export of complete analysis
- Beautifully formatted documentation ready to share
- Perfect for team onboarding or code reviews
🛠️ How we built it
Code Archaeologist combines cutting-edge AI with modern web technologies:
Frontend Stack:
- React 18 - Modern UI framework with hooks
- Vite - Lightning-fast build tool and dev server
- Tailwind CSS v4 - Utility-first styling with custom design system
- Framer Motion - Smooth animations and transitions
- React Markdown - Rich text rendering for AI responses
Backend Stack:
- FastAPI (Python) - High-performance async API framework
- Google Gemini AI (
gemini-3-pro-preview) - Advanced code analysis and natural language understanding - GitPython - Repository cloning and management
- Uvicorn - Lightning-fast ASGI server
AI Integration:
The heart of Code Archaeologist is its sophisticated prompting system:
- Repository Ingestion: Clone GitHub repos and extract all code files
- Context Building: Concatenate code with file paths and structure
Intelligent Prompting: Send structured prompts to Gemini for:
- Architecture analysis
- Module identification
- Technical debt detection
- Onboarding guide generation
- AI code pattern recognition
Multi-Modal Analysis:
- Text analysis for code understanding
- Dependency graph construction from import statements
- File tree generation from repository structure
- Pattern matching for AI-generated code detection
Key Technical Achievements:
Smart Code Parsing:
# Extract dependencies from Python, JavaScript, TypeScript files
# Supports: import, from...import, require(), and more
def extract_imports_from_content(content, file_type, file_path)
AI Detection Algorithm:
- Analyzes 10+ indicators of AI-generated code
- Provides percentage breakdowns and confidence levels
- Identifies specific patterns with examples
Real-time Chat:
- Maintains conversation history
- Contextual responses based on repository content
- Streaming responses for better UX
Beautiful UI/UX:
- Glassmorphism design with backdrop blur
- Animated gradient backgrounds
- Mouse-following glow effects
- Smooth tab transitions
- Print-optimized PDF generation
🚧 Challenges we ran into
1. Gemini API Context Length Limits
Problem: Large repositories exceeded Gemini's token limits.
Solution: Implemented smart truncation:
- Prioritize important files (README, package.json, main entry points)
- Limit to 50 files and 50KB per file
- Focus on code files, exclude node_modules and binary files
2. Dependency Graph Complexity
Problem: Extracting accurate import relationships across different languages.
Solution: Built language-specific parsers:
# Python: import X, from X import Y
# JavaScript/TypeScript: import X from 'Y', require('Y')
# Handles relative paths: ./file, ../folder/file
3. AI Response Reliability
Problem: Gemini sometimes returned malformed JSON or included markdown code blocks.
Solution:
- Strict prompt engineering specifying exact JSON format
- Response cleaning (strip ```json markers)
- Fallback error handling with user-friendly messages
4. Print/PDF Generation
Problem: Browser print dialog doesn't work well with dark backgrounds and complex layouts.
Solution:
- Created separate
<Report>component optimized for print - Print-specific CSS media queries
- Hidden UI elements with
.no-printclass - White background and black text for PDFs
5. Real-time Chat Performance
Problem: Chat responses felt slow and disconnected.
Solution:
- Loading indicators with animated dots
- Smooth scroll to bottom on new messages
- Cached repository content to avoid re-analysis
- Streamed responses for better perceived performance
6. AI Code Detection Accuracy
Problem: Hard to accurately distinguish AI vs human code.
Solution:
- Multi-factor analysis with 10+ indicators
- Confidence levels (low/medium/high)
- Specific examples and file patterns
- Transparent breakdown of scoring methodology
🏆 Accomplishments that we're proud of
✨ Beautiful, Polished UI
- Professional glassmorphism design
- Smooth animations throughout
- Mobile-responsive layout
- Print-ready PDF generation
🤖 Advanced AI Integration
- Successfully integrated Gemini for complex code analysis
- Built sophisticated prompting system for accurate results
- Implemented conversational AI chat with context retention
📊 Rich Visualizations
- Interactive file tree with expand/collapse
- Dependency graph with connection tracking
- AI detection gauge with breakdown scores
⚡ Performance Optimization
- Fast analysis even for large repositories
- Efficient caching of analyzed data
- Async processing for better UX
🎯 Real-World Utility
- Actually solves a painful developer problem
- Production-ready documentation generation
- Saves hours of manual code exploration
🔒 Robust Error Handling
- Graceful fallbacks for API failures
- Clear error messages for users
- Validation of GitHub URLs
📚 What we learned
Technical Learnings:
Prompt Engineering is an Art
- Learned to craft precise, structured prompts for consistent AI responses
- Discovered the importance of example outputs in prompts
- Found that explicit JSON schemas dramatically improve response quality
AI Context Management
- Balancing context length vs information density
- Prioritizing the most relevant code files
- Truncation strategies that maintain usefulness
Multi-Language Code Parsing
- Each language has unique import syntax
- Relative vs absolute imports require different resolution logic
- File extensions matter for accurate classification
React Performance Patterns
useMemofor expensive computations (dependency graphs)- Efficient state management for large datasets
- Lazy loading and progressive rendering
Print CSS is Tricky
- Media queries behave differently for print
- Need explicit white backgrounds for PDFs
- Page breaks and overflow handling
Design Learnings:
Glassmorphism Best Practices
- Proper use of backdrop-filter and blur
- Layering gradients for depth
- Balancing transparency with readability
Animation Principles
- Stagger delays for list items (0.1s intervals)
- Spring animations feel more natural than linear
- Loading indicators reduce perceived wait time
Product Learnings:
Developer Tool UX
- Developers want speed AND beauty
- Clear progress indicators reduce anxiety
- Export features (PDF) add tremendous value
AI Transparency
- Users want to know confidence levels
- Showing specific examples builds trust
- Explaining how analysis works improves adoption
🚀 What's next for Code Archaeologist
Immediate Roadmap:
🔐 Support for Private Repositories
- OAuth integration with GitHub
- Personal access token support
- Organization-level analysis
📈 Advanced Metrics
- Code complexity scores (cyclomatic complexity)
- Test coverage analysis
- Security vulnerability detection
- Performance hotspot identification
🌍 Multi-Language Support
- Better support for Java, Go, Rust, Ruby
- Language-specific best practices
- Framework detection (React, Django, Spring, etc.)
💾 Analysis History
- Save and compare analyses over time
- Track technical debt trends
- Monitor code quality improvements
Long-term Vision:
🤝 Team Collaboration
- Share analyses within teams
- Commenting and annotation features
- Integration with Jira/Linear for issue tracking
🔄 CI/CD Integration
- GitHub Actions for automatic analysis
- Pull request comments with insights
- Automated documentation updates
📊 Repository Comparisons
- Compare similar projects
- Benchmark against industry standards
- Identify best practices from top repos
🎓 Learning Recommendations
- Suggest tutorials based on codebase tech stack
- Identify knowledge gaps for new team members
- Personalized onboarding paths
🔍 Code Search & Navigation
- Semantic code search powered by AI
- "Show me where authentication happens"
- "Find all API endpoints"
🌐 Browser Extension
- Analyze repos directly from GitHub UI
- Quick summaries on hover
- One-click documentation generation
💡 Try It Out!
Live Demo: [Add your deployed URL here]
GitHub: https://github.com/yourusername/code-archaeologist
Sample Analysis: Try it with popular repos like:
https://github.com/facebook/react.githttps://github.com/vercel/next.js.githttps://github.com/python/cpython.git
🙏 Acknowledgments
- Google Gemini AI for powerful code understanding
- Anthropic for inspiration on AI-powered developer tools
- The open-source community for amazing libraries and tools
Log in or sign up for Devpost to join the conversation.