Project Nexus

Overview

Inspiration

In major public scandals, transparency often fails not because data is hidden, but because it is overwhelming. While datasets like the Enron emails are technically public, it can take years for investigators to manually understand who influenced whom. Project Nexus was inspired by the need to close this "transparency gap" for journalists, policymakers, and citizens.

What it does

Project Nexus turns raw, unstructured email data into a structured relationship graph. It processes massive datasets to identify key individuals as nodes and their interactions as edges. The system uses AI to filter out noise, summarize conversations, and reveal hidden communities within the network. Project Nexus also has a built in chatbot utilizing RAG that helps users answer curiosities, traverse the app, and create evidence based claims.

How we built it

The project utilizes a multi-stage data engineering and machine learning pipeline:

Data Engineering: We built tools to stream-parse malformed multi-line email data, normalize threads, and filter out non-informational content like spam.
Datasets: The system was tested on the Enron Email Dataset (500k+ emails) and the Epstein Email Dataset.
We chose Neo4j (hosted on Aura Cloud) as our database because fraud is a relationship problem — people are nodes, communications are edges. This made it trivial to query things like shortest paths between suspects or bridge figures connecting separate groups.
The backbone of our system is an AI agent powered by LLMs (GPT-4o-mini / GPT-5-nano via OpenRouter) that reads raw email chains and autonomously extracts relationship observations using tool calling. It doesn't just summarize — it loops through each email, identifies people, and writes structured observations like "these two are coordinating wire transfers" directly into the graph. We process emails in parallel for speed.
On top of the graph, we layered Louvain community detection to find crime rings, Isolation Forest anomaly detection to flag unusual suspects, and centrality analysis to surface key intermediaries. Each cluster gets an LLM-generated name (e.g., "Securities Fraud Ring") for readability. We also built a RAG pipeline with Pinecone and Sentence Transformers so users can ask plain English questions and get sourced answers grounded in actual email evidence.
The frontend is a React/TypeScript app with fully custom Canvas rendering — a cork board aesthetic with sticky-note suspects, thumbtack pins, and brown string connections. It's built for active investigation: click to focus, shift-click to multi-select, box-select regions, hide/show nodes, filter by degree, and use the shortest path finder to trace glowing red connection chains between any two suspects. ML & Graph Analytics:
An LLM agent autonomously extracts relationship insights and pushes data to the graph database using various tool calls.
Louvain clustering is applied to reveal hidden communities.
An Isolation Forest model is used for anomaly detection to identify critical people and suspicious connections

Challenges we ran into

Unstructured Data: Relationships are often buried deep within massive, unstructured text.
Data Quality: We had to handle malformed email data and successfully reconstruct complex email threads.
Scalability: Developing a system that can realistically analyze data volumes that are too large for manual human review.

Accomplishments that we're proud of

Successfully demonstrating how graph databases unlock relationship-centric analysis that standard relational databases cannot perform.
Implementing autonomous AI agents capable of multi-step reasoning over complex, real-world data.
Creating a tool that makes transparency actionable and improves public oversight.

What we learned

We learned that unstructured text can be transformed into structured insight through the right combination of graph theory and AI. We also gained experience in training ensemble models for spam detection and using LLMs as tool-calling agents for database management. Another important aspect of our learning was integrating AI agents into our app that make informed outputs driven through our own data.

What's next for Project Nexus

Shortest Path Integration: We plan to implement a shortest path algorithm to visualize and analyze the degrees of separation between individuals who are not directly connected in the dataset.
Scaling Data Capacity: The next phase involves expanding the pipeline to handle significantly larger datasets beyond the initial 500k+ emails to improve the depth of investigative insights.
Enhanced Transparency: We aim to refine our AI's multi-step reasoning to make complex relationships even more actionable for public oversight and investigative journalism.