Graph RAG from Scratch

A Retrieval-Augmented Generation (RAG) pipeline built entirely from first principles — no LangChain, no LlamaIndex, no abstraction frameworks. Every component is implemented directly so the design decisions are visible and the tradeoffs are explicit.

The project uses Chapter 1 of Deep Learning by Goodfellow et al. as its test document, but the pipeline works on any PDF.

What this project demonstrates

Standard RAG tutorials jump straight to using frameworks. This project builds each layer by hand, surfaces the bugs that frameworks hide, and shows exactly why each design decision exists.

The core problem this solves

Academic PDFs use custom ligature-encoded fonts for headings. Standard text extraction reads "Introduction" as "I\ue001\ue002r\ue004\ue005..." — private-use unicode that no regex can match. Every heading detector fails silently. The result: 105 chunks, all titled "Untitled", no section structure, graph edges connecting everything to everything.

The fix: font size is stored as a number in the PDF spec, independent of encoding. A garbled heading is still the largest text on its line. Detect headings by size, infer titles from the body text that follows.

Architecture

PDF
 │
 ├─ 2a. Naive text extraction (pypdf)           ← shows the problem
 ├─ 2b. Block extraction with font sizes (fitz) ← intermediate step  
 └─ 2c. Layout-aware extraction (pdfplumber)    ← final version
         │  font size → heading/body/caption/skip
         │  garbled heading → infer title from body
         │  back-to-back headings → merge titles
         ↓
     Sections  [{title, content, page}, ...]
         │
         ├─ 3a. Word sliding window (naive)     ← shows the problem
         └─ 3b/c. Sentence-boundary chunking    ← final version
                   │  60-word limit
                   │  1-sentence overlap  
                   │  abbreviation + citation protection
                   ↓
               Chunks  [{chunkId, title, content, page, length}, ...]
                   │
                   ↓ sentence-transformers all-MiniLM-L6-v2
               Embeddings  [n_chunks × 384]
                   │
           ┌───────┴───────────────────────────────┐
           │                                       │
     Plain cosine search              Knowledge graph
     + cross-encoder rerank           │
           │                    ┌─────┴────────────────┐
           │                    │                      │
           │              Plain BFS          Smart traversal
           │              (undirected,       (score-guided,
           │               2 hops)           hop penalty)
           │                    │                      │
           │                    └──────────────────────┤
           │                                           │
           │                              Query rewrite + graph
           │                              (domain expansion dict
           │                               → different seeds
           │                               → different region)
           │                                           │
           └───────────────────────────────────────────┘
                               │
                          Top-K results

Retrieval strategies compared

Strategy	How it works	Candidate pool	Unique finds
Plain vector search	Cosine similarity + cross-encoder rerank	Top 10 by similarity	Baseline
Plain graph BFS	BFS 2 hops from top-10 seeds	~18 chunks	Adjacent chunks plain search missed
Smart graph traversal	Score-guided hop expansion, prunes low-relevance paths	~22 chunks	Chunks only reachable via high-similarity paths
Query rewrite + graph	Enriches query with domain terms → different seeds → different traversal region	~29 chunks	Chunks in semantically adjacent regions of the graph

Key insight: the benefit of graph RAG shows in recall (candidate coverage), not in top-1 rank. All models may return the same top-3 results — this is correct. The question is whether the complete answer to a multi-hop question is covered by the candidate pool passed to an LLM.

Bugs found and fixed during development

This project is a record of real engineering debugging. Every bug listed here caused the system to produce wrong or identical output.

1. The mutation bug (hardest to find)

# Wrong: writes similarity into the stored chunk dict
val = index["chunks"][idx]
val["similarity"] = float(sim[idx])   # ← mutates original

# Fixed: copy first, then write
val = dict(index["chunks"][idx])       # ← shallow copy
val["similarity"] = float(sim[idx])   # ← writes to copy only

Effect: every model returned identical similarity scores down to 6 decimal places, matching the first query ever run. Stale values persisted across all subsequent queries.

2. The hop penalty bug

# Wrong: constant penalty regardless of depth
hopPenalty = 0.05 * hops   # hops=2, so always 0.10

# Fixed: penalty increases with depth
hopPenalty = 0.05 * hop    # hop=0 at first level, hop=1 at second

Effect: smart graph and plain BFS had identical traversal behaviour because the penalty never varied.

3. The keyword overlap graph problem

The graph edge condition overlap > 8 fired on domain words ("intelligence", "arti", "cial") that appear in every chunk. In a deep-learning document, nearly every chunk pair exceeded 8 shared words, creating a near-fully-connected graph where BFS expansion from any seed reached the entire document in 2 hops.

Fix: extended stopwords to include domain-specific terms and words shorter than 5 characters (catches OCR artifacts like "arti", "cial", "di" from ligature splits).

4. The OCR artifact problem

PDF text extraction produces ligature-split artifacts: "artificial" → "arti cial", "difficult" → "di cult", "field" → " eld". These split tokens appeared in the keyword overlap computation, creating false graph edges between unrelated chunks.

Fix: filter tokens with len(w) > 4 in the overlap computation, and add the most common OCR artifacts to the stopword list.

5. The reranker ceiling

A cross-encoder re-scoring 15 candidates will almost always agree with what plain embedding retrieval found as top-3. Graph expansion finds better candidates but the reranker drops them because it scores each chunk in isolation against the query — and the original top-3 are individually more relevant than any single newly-discovered chunk.

For multi-hop questions, the right evaluation metric is candidate coverage, not top-1 rank.

Running the notebook

Prerequisites

pip install pdfplumber pymupdf sentence-transformers scikit-learn numpy python-dotenv
# For OCR fallback:
# Install Tesseract: https://github.com/tesseract-ocr/tesseract
pip install pytesseract pillow

Environment setup

If you are on Windows, create a .env file:

TESPATH=C:\Program Files\Tesseract-OCR\tesseract.exe

Directory structure

project/
├── graph_rag_from_scratch.ipynb
├── .env
├── requirements.txt
└── data/
    └── dataRAG.pdf          ← your PDF goes here

Running

Place your PDF at ./data/dataRAG.pdf
Open the notebook and run all cells top-to-bottom
Go to Section 5 and change the QUERY variable to ask your own question

Try it on Google Colab

To use on Colab:

Upload your PDF when prompted, or mount Google Drive
Change the PDF path in Section 2 to match your file location
The first cell installs all dependencies automatically

Project structure

structuredRag.ipynb            Main notebook (all code + explanations)
README.md                      This file
requirements.txt               Python dependencies
data/                          Put your PDF here (not tracked by git)
.gitignore                     Excludes data/, .env, __pycache__

Key references

Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning. MIT Press. deeplearningbook.org
Reimers, N., Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019.
Nogueira, R., Cho, K. (2019). Passage Re-ranking with BERT. arXiv:1901.04085.

What's next

The natural next step after retrieval is feeding the retrieved chunks into an LLM with a well-structured prompt. The retrieval system here is the hard part once you have a clean, section-aware candidate pool, the generation step is straightforward with any API.

Future additions planned:

Conversational memory (multi-turn queries referencing previous answers)
Entity-level knowledge graph (typed nodes: concepts, people, algorithms)
Evaluation harness (precision@k, recall@k against a labelled test set)

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
RAG		RAG
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Graph RAG from Scratch

What this project demonstrates

The core problem this solves

Architecture

Retrieval strategies compared

Bugs found and fixed during development

1. The mutation bug (hardest to find)

2. The hop penalty bug

3. The keyword overlap graph problem

4. The OCR artifact problem

5. The reranker ceiling

Running the notebook

Prerequisites

Environment setup

Directory structure

Running

Try it on Google Colab

Project structure

Key references

What's next

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Graph RAG from Scratch

What this project demonstrates

The core problem this solves

Architecture

Retrieval strategies compared

Bugs found and fixed during development

1. The mutation bug (hardest to find)

2. The hop penalty bug

3. The keyword overlap graph problem

4. The OCR artifact problem

5. The reranker ceiling

Running the notebook

Prerequisites

Environment setup

Directory structure

Running

Try it on Google Colab

Project structure

Key references

What's next

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages