Inspiration

  • Hackathons generate a massive, fragmented graph of people, projects, tech stacks, and events. It’s hard to answer deceptively simple questions like “Who should I team with?” or “What ideas have traction?” without spending hours searching.
  • We built HackerStats to make the hidden structure of hackathons visible: discover relationships, map communities, and surface novel ideas using graph analysis and modern NLP vectorization.

What it does

  • Interactive graph of the hackathon ecosystem:
    • Visualizes Hackers, Projects, and Hackathons as a connected network.
    • Click any node to see rich details with deep links to Devpost profiles and project pages.
    • Fuzzy search to instantly locate people/projects; matching nodes are highlighted.
  • Idea Brainstorming:
    • Type a prompt; we vectorize it and retrieve top similar projects from a precomputed corpus.
    • Get quick signal on originality, saturation, and competitiveness with heuristic diagnostics.
  • Smart subgraph expansion:
    • Start from a single hacker and explore multi-hop relationships (contributors, shared events, related projects) with a depth/limit that ensures large, connected subgraphs without noise.

How we built it

  • Webcrawling
    • Multithreaded Selenium instances across a shared queue to engage in recursive BFS search by hacker, by related devpost.
    • AWS EC2 remote server instances for dynamic load balancing
  • Frontend (Next.js + D3):
    • Custom D3 force-directed graph with zoom/drag, collision, layered hover labels, and highlight states.
    • Node detail panel with auto-generated external links (Devpost user and software pages), social URL extraction, and metadata display.
    • Serverless API routes act as a control plane for graph queries and NLP calls.
  • Graph data (Neo4j):
    • We ingest and normalize entities (Hacker, Devpost/Project, Hackathon).
    • Variable-length Cypher queries collect expansive yet connected subgraphs rooted at a specific node, then densify edges among selected nodes to maintain coherence.
  • Vectorization + NLP:
    • Custom DevpostVectorizer combines multiple representation spaces:
    • Sentence embeddings (transformers) for semantics.
    • TF-IDF–derived signals for topical specificity.
    • Domain and user-segmentation embeddings from curated ontologies.
    • Tech-stack, awards, and team-composition features projected into fixed-length vectors.
    • We concatenate these into a high-dimensional “combined” vector and compute cosine similarity for retrieval and clustering.
    • Precomputed project vectors are stored on disk and loaded into memory for sub-100ms similarity queries.
  • Python service (FastAPI):
    • Exposes /api/brainstorm and /api/vectorizer endpoints for vectorization and retrieval.
    • Caches model and dataset to keep latency low; returns clean JSON for the frontend.
  • Infrastructure:
    • Monorepo on Railway: separate services for frontend (Next.js) and backend (FastAPI), communicating via HTTP. Neo4j hosted with secure credentials.
    • Environment-driven configuration; no filesystem coupling between services.

Challenges we ran into

  • Schema design for a heterogeneous graph:
    • Balancing expressiveness (rich relationships) with query performance and visual clarity was nontrivial.
  • Vector fusion:
    • Combining transformer embeddings with categorical/ontology-driven vectors required careful normalization to avoid any single subspace dominating cosine similarity.
  • Subgraph expansion at scale:
    • Unbounded variable-length graph queries balloon quickly. We had to implement guards, deduplication, and a two-phase approach (select nodes, then fetch all intra-node edges).
  • Robust scraping/normalization:
    • Profiles and project pages vary. We built conservative parsers to extract usernames, social URLs, and participation data.

Accomplishments that we’re proud of

  • A genuinely useful, highly interactive graph experience with real-time search and crisp labels.
  • A practical hybrid NLP pipeline that feels “intelligent” for ideation without relying solely on one model.
  • Clear, production-friendly separation of concerns (frontend ↔ backend ↔ database) suitable for cloud deployment.

What we learned

  • Graph UX is as much about hiding edges as showing them; clarity beats completeness.
  • Hybrid vector spaces outperform a single embedding on sparse, structured domains like hackathon data.
  • Serverless + dedicated Python inference services is a sweet spot for latency and iteration speed.

What’s next

  • Temporal analysis: visualize career trajectories and project lineages over time.
  • Community detection: identify clusters, bridges, and emerging idea neighborhoods.
  • Advanced retrieval: re-ranking with cross-encoders, few-shot concept search, and semantic filters.
  • Contributor reputation and influence scores powered by graph centrality and outcome signals.

Tech stack

  • Frontend: Next.js, TypeScript, D3, Framer Motion
  • Backend: FastAPI (Python), sentence-transformers, NumPy, scikit-learn
  • Database: Neo4j
  • Infra: Railway (monorepo, multi-service), environment-based config
  • Integrations: Devpost deep links; social URLs parsing

Try it

  • Explore the Graph to discover relationships and projects.
  • Use Brainstorm to validate and refine your next hackathon idea with similarity search and quick diagnostics.

Built With

Share this project:

Updates