graph.md

Graph Engine

NodeDB's graph engine uses a native CSR (Compressed Sparse Row) adjacency index with interned node IDs (u32) and labels (u32) — not recursive JOINs pretending to be a graph. At 1 billion edges, CSR uses ~10 GB vs ~60 GB for naive adjacency lists (6x improvement). Sub-millisecond multi-hop traversals, 13 native algorithms, Cypher-subset pattern matching, and GraphRAG fusion — all in the same process as every other engine.

Storage Model

Edges are persisted in a redb B-Tree with forward and reverse indexes, both keyed by a (tenant_id, composite) tuple:

Forward Index (EDGES table):
  Key:   (tenant_id: u32, "src\x00edge_label\x00dst")
  Value: edge properties (MessagePack)

Reverse Index (REVERSE_EDGES table):
  Key:   (tenant_id: u32, "dst\x00edge_label\x00src")
  Value: [] (existence check only)

Tenant isolation is structural: the tenant id is a first-class key component, not a lexical prefix on node names. Node names in the composite portion are user-visible strings. Both tables update atomically in a single write transaction, and the null-byte separator enables prefix scans for outbound traversal within a tenant.

At query time the in-memory CSR is partitioned by tenant (ShardedCsrIndex = one CsrIndex per tenant); algorithms and traversals run against a single tenant's partition:

CsrIndex (per tenant):
  out_offsets: Vec<u32>   [num_nodes + 1]    — offset into target array per node
  out_targets: Vec<u32>   [num_edges]        — destination node IDs (contiguous)
  out_labels:  Vec<u32>   [num_edges]        — edge labels (parallel array)
  out_weights: Vec<f64>   [num_edges]        — optional, allocated only when weighted

  in_offsets / in_targets / in_labels / in_weights — symmetric for inbound

Each CsrIndex gets a unique partition tag at construction. Public APIs that return dense node indices hand out LocalNodeId { id, partition_tag } — using a node id from one partition with another partition's API panics at the boundary.

Writes go to a mutable buffer and become visible immediately. Compaction merges the buffer into the dense CSR arrays via double-buffered swap when the buffer exceeds 10% of the dense size.

Edge Operations

Insert Edge

GRAPH INSERT EDGE FROM 'users:alice' TO 'users:bob' TYPE 'KNOWS';

-- With properties (JSON string form):
GRAPH INSERT EDGE FROM 'users:alice' TO 'users:bob' TYPE 'KNOWS'
  PROPERTIES '{"since": 2020, "weight": 0.9}';
-- Object literal form (equivalent):
GRAPH INSERT EDGE FROM 'users:alice' TO 'users:bob' TYPE 'KNOWS'
  PROPERTIES { since: 2020, weight: 0.9 };

Delete Edge

GRAPH DELETE EDGE FROM 'users:alice' TO 'users:bob' TYPE 'KNOWS';

Removes both forward and reverse index entries atomically. Cascading delete is available for all edges touching a node.

Traversal Queries

Traverse (Multi-Hop BFS)

GRAPH TRAVERSE FROM 'users:alice' DEPTH 3;
GRAPH TRAVERSE FROM 'users:alice' DEPTH 2 LABEL 'follows' DIRECTION out;

Breadth-first search from a start node. Returns discovered nodes at each depth level.

Parameter	Default	Description
`DEPTH`	2	Maximum hop count
`LABEL`	(any)	Filter by edge label
`DIRECTION`	out	`in`, `out`, or `both`

Neighbors (1-Hop)

GRAPH NEIGHBORS OF 'users:bob';
GRAPH NEIGHBORS OF 'users:bob' LABEL 'follows' DIRECTION both;

Returns immediate neighbors only — no hop expansion.

Shortest Path

GRAPH PATH FROM 'users:alice' TO 'users:charlie';
GRAPH PATH FROM 'users:alice' TO 'users:charlie' MAX_DEPTH 5 LABEL 'knows';

Cross-core BFS path finding. Returns an ordered list of node IDs, or empty array if no path exists within MAX_DEPTH (default 10).

MATCH Pattern Queries (Cypher Subset)

NodeDB embeds a Cypher-subset pattern engine. MATCH queries arrive through any protocol — pgwire, HTTP, or native MessagePack — and are parsed, optimized, and dispatched to the Data Plane like any other query.

Syntax

Node bindings:

(a)            -- named variable, any label
(a:Person)     -- named variable, label filter
(:Person)      -- anonymous node, label filter
()             -- anonymous, any label

Edge bindings with direction:

-[:KNOWS]->        -- outbound
<-[:KNOWS]-        -- inbound
-[:KNOWS]-         -- undirected (both directions)
-[r:KNOWS]->       -- named edge binding

Variable-length paths:

-[:KNOWS*1..3]->   -- 1 to 3 hops (BFS expansion)
-[:KNOWS*2..2]->   -- exactly 2 hops
-[:FOLLOWS*]->     -- unlimited hops

Clauses:

Clause	Description
`MATCH`	Required. Pattern to match against the graph.
`OPTIONAL MATCH`	LEFT JOIN semantics — preserves rows with no match.
`WHERE`	Filter on bound variables. Supports `=`, `!=`, `<`, `<=`, `>`, `>=`.
`WHERE NOT EXISTS { MATCH ... }`	Anti-join — exclude rows matching a sub-pattern.
`RETURN`	Project bindings. Supports aliases (`AS`). Default: all bound variables.
`ORDER BY`	Sort results. `ASC` or `DESC`.
`LIMIT`	Cap result count.

Multiple comma-separated patterns in one MATCH clause act as self-joins.

Examples

Friend-of-friend:

MATCH (a:Person)-[:knows]->(b:Person)-[:knows]->(c:Person)
WHERE a.name = 'Alice'
RETURN b.name, c.name;

Variable-length recommendations:

MATCH (u:User)-[:follows*2..3]->(recommended:User)
WHERE u.id = 'you'
RETURN DISTINCT recommended.id
LIMIT 10;

Multi-pattern query (self-join):

MATCH (a:Person)-[:works_at]->(company:Company),
      (a)-[:located_in]->(city:City),
      (company)-[:headquartered_in]->(hq_city:City)
WHERE city.id = hq_city.id
RETURN a.name, company.name;

Anti-join (negative pattern):

MATCH (a:User)-[:follows]->(b:User)
WHERE NOT EXISTS { MATCH (b)-[:blocked_by]->(a) }
RETURN a.id, b.id;

OPTIONAL MATCH:

MATCH (a:Person)-[:knows]->(b:Person)
OPTIONAL MATCH (b)-[:works_at]->(c:Company)
RETURN a.name, b.name, c.name;

Fraud detection pattern:

MATCH (a:Account)-[:transfers_to]->(mid:Account)-[:transfers_to]->(b:Account)
WHERE a.risk_score > 0.7 AND b.risk_score > 0.7
RETURN a.id, b.id, count(*) AS paths
ORDER BY paths DESC
LIMIT 20;

Execution

Parse — Tokenizes Cypher syntax into AST (MatchQuery)
Optimize — Reorders triples by selectivity (bind tightly-connected patterns first)
Serialize — MessagePack the AST for SPSC transport
Execute — Data Plane resolves bindings against CSR, expands variable-length paths via BFS, applies WHERE predicates, projects RETURN columns
Return — Results come back as pgwire rows

Algorithms

All algorithms run directly on the CSR index. SIMD-accelerated where applicable (rank scatter, L1 norm delta, fill operations).

Running Algorithms

GRAPH ALGO PAGERANK ON social_graph DAMPING 0.85 ITERATIONS 20 TOLERANCE 1e-7;
GRAPH ALGO WCC ON knowledge_graph;
GRAPH ALGO SSSP ON routes FROM 'city:chicago';
GRAPH ALGO COMMUNITY ON products ITERATIONS 10 RESOLUTION 1.0;
GRAPH ALGO BETWEENNESS ON network SAMPLE 500;
GRAPH ALGO KCORE ON collaboration;
GRAPH ALGO TRIANGLES ON social MODE global;
GRAPH ALGO DIAMETER ON web;

Algorithm Reference

Algorithm	What it computes	Key Parameters
PageRank	Node importance via incoming link structure	`DAMPING` (0.85), `ITERATIONS` (20), `TOLERANCE` (1e-7)
WCC	Weakly connected components (union-find with path compression)	—
Label Propagation	Community detection via iterative label spreading	`ITERATIONS` (10)
LCC	Local clustering coefficient — how tightly neighbors connect	—
SSSP	Single-source shortest path (Dijkstra, rejects negative weights)	`FROM` (required)
Betweenness	Bridge nodes with high traffic (Brandes' algorithm)	`SAMPLE` (optional, for approximation)
Closeness	How close a node is to all others (inverse distance sum)	`SAMPLE` (optional)
Harmonic	Like closeness, but handles disconnected graphs	`SAMPLE` (optional)
Degree	Connection count per node	`DIRECTION` (in/out/both)
Louvain	Community detection via modularity optimization	`ITERATIONS` (10), `RESOLUTION` (1.0)
Triangles	Triangle count (per-node or global)	`MODE` (global/per_node)
Diameter	Longest shortest path in the graph	—
k-Core	Coreness decomposition (peeling algorithm)	—

GraphRAG (Vector + Graph Fusion)

Combines vector similarity search with graph traversal in a single query pipeline — zero network hops between engines.

How It Works

Query Vector
    |
    v
[1] Vector Search (HNSW) --- top-K semantically similar nodes (seed set)
    |
    v
[2] Graph Expansion (BFS) -- expand seeds along edges to depth N
    |
    v
[3] RRF Fusion ------------- merge vector rank + graph hop distance
    |
    v
Final top-N results with (node_id, rrf_score, vector_rank, vector_distance, hop_distance)

SQL Interface

GRAPH RAG FUSION ON entities
  QUERY $embedding
  VECTOR_TOP_K 50
  EXPANSION_DEPTH 2
  EDGE_LABEL 'related_to'
  DIRECTION both
  FINAL_TOP_K 10
  RRF_K (60.0, 35.0)
  MAX_VISITED 1000;

Parameter	Description
`QUERY`	Query embedding vector
`VECTOR_TOP_K`	Number of seed nodes from vector search
`EXPANSION_DEPTH`	BFS hop count from seeds
`EDGE_LABEL`	Optional edge type filter during expansion
`DIRECTION`	`in`, `out`, or `both` for BFS
`FINAL_TOP_K`	Final result count after fusion
`RRF_K`	Weighting constants `(vector_k, graph_k)` for RRF scoring
`MAX_VISITED`	Memory budget cap — BFS stops early if exceeded

The response includes a truncation flag if the memory budget forced early termination.

Step-by-Step Pattern (Application-Level)

-- 1. Seed retrieval
SELECT id, content FROM chunks
WHERE embedding <-> $query_embedding
LIMIT 10;

-- 2. Graph expansion from seeds
MATCH (c:Chunk)-[:mentions]->(e:Entity)-[:related_to*1..3]->(related:Entity)
WHERE c.id IN ('chunk-042', 'chunk-017')
RETURN DISTINCT related.id, related.description;

-- 3. Assemble expanded context for LLM

Distributed Execution (BSP)

In multi-shard deployments, graph algorithms and pattern matching execute via Bulk Synchronous Parallel (BSP) coordination.

Distributed PageRank

Each shard maintains a local rank vector, out-degree array, and dangling node mask. Cross-shard edges are tracked as "ghost edges." Each superstep:

Scatter — distribute rank to outbound edges
Boundary exchange — package contributions for ghost edges, send to target shards
Gather — add incoming contributions from other shards
Convergence check — L1 norm delta across all shards
Repeat until convergence or max iterations

Distributed WCC

Union-find with cross-shard label propagation:

Each shard computes local WCC
Boundary edges send component labels to target shards
Target shards adopt the lexicographically smaller label
Repeat until no labels change (guaranteed convergence)

Distributed Pattern Matching

Scatter-gather with continuations:

Coordinator broadcasts MATCH query to all shards
Each shard executes the pattern on its local CSR
Ghost edges produce PatternContinuation messages (partial bindings + remaining pattern + start node)
Target shards resume from continuations
Coordinator collects completed rows and new continuations
Repeat until no pending continuations or max rounds reached

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graph Engine

Storage Model

Edge Operations

Insert Edge

Delete Edge

Traversal Queries

Traverse (Multi-Hop BFS)

Neighbors (1-Hop)

Shortest Path

MATCH Pattern Queries (Cypher Subset)

Syntax

Examples

Execution

Algorithms

Running Algorithms

Algorithm Reference

GraphRAG (Vector + Graph Fusion)

How It Works

SQL Interface

Step-by-Step Pattern (Application-Level)

Distributed Execution (BSP)

Distributed PageRank

Distributed WCC

Distributed Pattern Matching

Related

FilesExpand file tree

graph.md

Latest commit

History

graph.md

File metadata and controls

Graph Engine

Storage Model

Edge Operations

Insert Edge

Delete Edge

Traversal Queries

Traverse (Multi-Hop BFS)

Neighbors (1-Hop)

Shortest Path

MATCH Pattern Queries (Cypher Subset)

Syntax

Examples

Execution

Algorithms

Running Algorithms

Algorithm Reference

GraphRAG (Vector + Graph Fusion)

How It Works

SQL Interface

Step-by-Step Pattern (Application-Level)

Distributed Execution (BSP)

Distributed PageRank

Distributed WCC

Distributed Pattern Matching

Related