Skip to content

LevelInteractive/knowledge-graph-prototype-3

Repository files navigation

KG-3: llm-graph-builder Evaluation

Evaluation of Neo4j Labs' llm-graph-builder web application for building knowledge graphs from meeting transcripts. The full web app had issues, so we pivoted to using its core extraction logic (LLMGraphTransformer) directly.

What's Here

app/                         # Cloned llm-graph-builder repo (neo4j-labs/llm-graph-builder)
  backend/                   # FastAPI backend (Python)
  frontend/                  # React/Vite frontend (TypeScript)
  docker-compose.yml         # Original Docker setup
extract_direct.py            # Direct extraction script using LLMGraphTransformer + Gemini
prepare_uploads.py           # Script to prepare meeting transcript files for upload
upload-files/                # (gitignored) 50 prepared meeting transcript files
EVAL-NOTES.md                # Detailed evaluation notes

Graph Stats (~15 meetings)

  • 1,539 entity nodes + 51 document nodes
  • 3,956 relationships across 23 types
  • Rich entity variety: 506 Topics, 162 ActionItems, 142 Persons, 83 Projects, 80 Metrics, 50 Meetings, 44 Decisions, 40 Clients, 36 Campaigns
  • Used Gemini 2.5 Flash via langchain-google-genai

Prerequisites

  • Python 3.11+
  • Neo4j instance running (tested with Neo4j 5.x)
  • Data export at /workspace/kg_export/ (or update paths in scripts)
  • Gemini API key (for extraction via langchain-google-genai)
  • Node.js 18+ (for frontend, optional)

Setup from Clone

Option A: Direct Extraction (recommended)

This uses the same LLMGraphTransformer that powers the web app, but without the heavy web app dependencies.

cd /workspace/kg-3

# 1. Create lightweight virtual environment
python3 -m venv venv-light
source venv-light/bin/activate

# 2. Install lightweight dependencies
pip install langchain-experimental langchain-google-genai langchain-neo4j neo4j python-dotenv

# 3. Create .env
cat > .env << 'EOF'
NEO4J_URI=bolt://localhost:7690
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=your-password
GEMINI_API_KEY=your-gemini-key
OPENAI_API_KEY=your-openai-key  # optional fallback
EOF

# 4. Prepare meeting transcript files
python3 prepare_uploads.py
# Creates 50 files in upload-files/

# 5. Run extraction
python3 extract_direct.py
# Processes files and loads entities/relationships into Neo4j
# ~200 seconds per file with Gemini 2.5 Flash

Option B: Full Web App

cd /workspace/kg-3/app

# Backend
cd backend
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt -c constraints.txt
# Create backend/.env (see EVAL-NOTES.md for format)
uvicorn score:app --reload --port 8010
# NOTE: Takes ~5 min to start, 700MB+ RAM

# Frontend (separate terminal)
cd ../frontend
npm install
# Create frontend/.env with VITE_BACKEND_API_URL=http://localhost:8010
npm run dev -- --port 3010
# Access at http://localhost:3010

# NOTE: Backend extraction API fails for .txt files due to chunking bug.
# The web UI works for file upload but extraction triggers the same error.
# Use Option A (direct extraction) for reliable results.

How It Was Tested

  1. Web app setup: Cloned repo, installed backend (249 packages) and frontend (565 packages). Backend starts but takes 5 min.
  2. Backend API: File upload via POST /upload works. Extraction via POST /extract fails with '<=' not supported between instances of 'NoneType' and 'int' (chunking bug for .txt files).
  3. Direct extraction: extract_direct.py successfully processed ~15 meeting files using Gemini 2.5 Flash via langchain-google-genai. Entities and relationships correctly extracted and loaded into Neo4j.
  4. Neo4j verification: Verified node/relationship counts and entity quality via Cypher queries.

How to Test

# After extraction, verify graph:
python3 -c "
from neo4j import GraphDatabase
d = GraphDatabase.driver('bolt://localhost:7690', auth=('neo4j','your-password'))
with d.session() as s:
    r = s.run('MATCH (n) RETURN labels(n)[0] as label, count(*) as cnt ORDER BY cnt DESC')
    for rec in r: print(f'{rec[\"label\"]}: {rec[\"cnt\"]}')
    print()
    r = s.run('MATCH ()-[r]->() RETURN type(r) as type, count(*) as cnt ORDER BY cnt DESC LIMIT 10')
    for rec in r: print(f'{rec[\"type\"]}: {rec[\"cnt\"]}')
d.close()
"

# Sample entity query:
python3 -c "
from neo4j import GraphDatabase
d = GraphDatabase.driver('bolt://localhost:7690', auth=('neo4j','your-password'))
with d.session() as s:
    r = s.run('MATCH (p:Person) RETURN p.id LIMIT 10')
    print('Sample persons:')
    for rec in r: print(f'  {rec[0]}')
d.close()
"

Gitignored Files (need recreation)

  • venv-light/ - Lightweight Python venv (python3 -m venv venv-light && pip install langchain-experimental langchain-google-genai langchain-neo4j neo4j python-dotenv)
  • app/backend/venv/ - Full backend venv (pip install -r app/backend/requirements.txt -c app/backend/constraints.txt)
  • app/frontend/node_modules/ - Frontend deps (cd app/frontend && npm install)
  • upload-files/ - Generated by prepare_uploads.py (run it to recreate)
  • .env - Create manually with Neo4j and API credentials

Key Findings

  • The full web app is over-engineered for programmatic bulk ingestion
  • The core value is LLMGraphTransformer from langchain-experimental, which can be used as a library
  • Gemini requires langchain-google-genai (API key), NOT the built-in VertexAI integration (needs GCP service account)
  • Extraction is slow (~200s per file vs 23s for Setup B's SimpleKGPipeline)
  • Dependencies are heavy (249 packages, 2-3GB for full app)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors