Git Lineage – Query System

AI-powered understanding and querying of codebases Built with Amazon Bedrock, AWS Lambda, and SQLite

Inspiration

As developers, we often spend countless hours digging through codebases — trying to understand who wrote what, where a function originated, or how a module evolved. Traditional tools like grep, GitHub search, or even AI code assistants often fail to capture context or lineage.

We wanted to solve that.

Git Lineage was born from a simple idea: “What if an AI agent could understand, not just search, our repositories?”

What it does

Git Lineage transforms any GitHub repository into an AI-queryable knowledge base. Instead of just searching code by keyword, it allows users to ask natural-language questions like:

  • “Where is the SceneManager class defined?”
  • “How does process_scene() work?”
  • “Which commit introduced render_frame?”

-> The system indexes all classes and functions using AST and Tree-sitter parsers. -> Generates embeddings via Amazon Titan for semantic similarity. -> Stores them in FAISS for ultra-fast retrieval.

When a query comes in, it performs a semantic search → fetches context → calls Bedrock (Claude 3.7 Sonnet) to summarize or explain. This results in a natural conversation-like experience with your codebase — no manual searching, no reading through 20 files.

How we built it

The system is built as an end-to-end pipeline using AWS services and local intelligence layers:

  1. Repository Indexing
  2. Uses GitPython, AST, and tree-sitter to extract all classes and functions.
  3. Stores metadata — file paths, symbols, commits — inside a lightweight SQLite database.

  4. Vector Store

  5. Embeds symbol names and snippets using Amazon Titan Embeddings from Bedrock.

  6. All embeddings are stored and indexed in FAISS for semantic retrieval.

  7. LLM Query Agent

  8. A lightweight AWS Lambda function receives user queries.

  9. Performs FAISS vector search → fetches relevant code snippet → calls Claude 3.7 Sonnet via Bedrock.

  10. Returns a natural language summary or explanation.

Responses appear instantly, showing code understanding in action.

Challenges we ran into

Managing token efficiency when sending contextual code to the LLM. Fine-tuning FAISS similarity thresholds for relevant snippet retrieval. Balancing speed and accuracy between local indexing and cloud inference.

Accomplishments that we're proud of

  • Built a fully autonomous pipeline that turns raw code into an AI-searchable structure.
  • Successfully integrated Amazon Bedrock + FAISS for hybrid semantic reasoning.
  • Achieved query responses in under 3 seconds on AWS Lambda.
  • Made LLMs understand repository evolution (commits, PRs, and structure) — not just snippets.
  • Created a scalable and reproducible workflow that can analyze any public GitHub repo on demand.

What we learned

How to combine retrieval-augmented generation (RAG) with code intelligence. The challenge of keeping embeddings up-to-date with Git commits. How AWS Bedrock simplifies integration of multiple LLMs within serverless pipelines. Importance of efficient data storage — SQLite was ideal for ephemeral Lambda storage.

What's next for GitLineage

  1. Graph Database Integration (Amazon Neptune)
  2. We initially planned to integrate AWS Neptune for a graph-based lineage view — connecting functions, classes, commits, and authors. Due to budget limits, this feature wasn’t deployed yet.

  3. Multi-Repository Reasoning

  4. Expand to support dependency tracing across interconnected projects (e.g., microservices).

  5. Temporal Lineage Visualization

  6. Introduce time-based visual graphs showing how code evolves per commit or release.

  7. VS Code Integration

  8. Bring Git Lineage directly into the developer workflow for instant in-editor explanations.

  9. Knowledge Graph Compression & Summarization

  10. Use dynamic context distillation to keep large codebases queryable under strict token limits.

Built With

Share this project:

Updates