Skip to content

durb-afk/Courses-RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 

Repository files navigation

Graph RAG from Scratch

A Retrieval-Augmented Generation (RAG) pipeline built entirely from first principles — no LangChain, no LlamaIndex, no abstraction frameworks. Every component is implemented directly so the design decisions are visible and the tradeoffs are explicit.

The project uses Chapter 1 of Deep Learning by Goodfellow et al. as its test document, but the pipeline works on any PDF.


What this project demonstrates

Standard RAG tutorials jump straight to using frameworks. This project builds each layer by hand, surfaces the bugs that frameworks hide, and shows exactly why each design decision exists.

The core problem this solves

Academic PDFs use custom ligature-encoded fonts for headings. Standard text extraction reads "Introduction" as "I\ue001\ue002r\ue004\ue005..." — private-use unicode that no regex can match. Every heading detector fails silently. The result: 105 chunks, all titled "Untitled", no section structure, graph edges connecting everything to everything.

The fix: font size is stored as a number in the PDF spec, independent of encoding. A garbled heading is still the largest text on its line. Detect headings by size, infer titles from the body text that follows.


Architecture

PDF
 │
 ├─ 2a. Naive text extraction (pypdf)           ← shows the problem
 ├─ 2b. Block extraction with font sizes (fitz) ← intermediate step  
 └─ 2c. Layout-aware extraction (pdfplumber)    ← final version
         │  font size → heading/body/caption/skip
         │  garbled heading → infer title from body
         │  back-to-back headings → merge titles
         ↓
     Sections  [{title, content, page}, ...]
         │
         ├─ 3a. Word sliding window (naive)     ← shows the problem
         └─ 3b/c. Sentence-boundary chunking    ← final version
                   │  60-word limit
                   │  1-sentence overlap  
                   │  abbreviation + citation protection
                   ↓
               Chunks  [{chunkId, title, content, page, length}, ...]
                   │
                   ↓ sentence-transformers all-MiniLM-L6-v2
               Embeddings  [n_chunks × 384]
                   │
           ┌───────┴───────────────────────────────┐
           │                                       │
     Plain cosine search              Knowledge graph
     + cross-encoder rerank           │
           │                    ┌─────┴────────────────┐
           │                    │                      │
           │              Plain BFS          Smart traversal
           │              (undirected,       (score-guided,
           │               2 hops)           hop penalty)
           │                    │                      │
           │                    └──────────────────────┤
           │                                           │
           │                              Query rewrite + graph
           │                              (domain expansion dict
           │                               → different seeds
           │                               → different region)
           │                                           │
           └───────────────────────────────────────────┘
                               │
                          Top-K results

Retrieval strategies compared

Strategy How it works Candidate pool Unique finds
Plain vector search Cosine similarity + cross-encoder rerank Top 10 by similarity Baseline
Plain graph BFS BFS 2 hops from top-10 seeds ~18 chunks Adjacent chunks plain search missed
Smart graph traversal Score-guided hop expansion, prunes low-relevance paths ~22 chunks Chunks only reachable via high-similarity paths
Query rewrite + graph Enriches query with domain terms → different seeds → different traversal region ~29 chunks Chunks in semantically adjacent regions of the graph

Key insight: the benefit of graph RAG shows in recall (candidate coverage), not in top-1 rank. All models may return the same top-3 results — this is correct. The question is whether the complete answer to a multi-hop question is covered by the candidate pool passed to an LLM.


Bugs found and fixed during development

This project is a record of real engineering debugging. Every bug listed here caused the system to produce wrong or identical output.

1. The mutation bug (hardest to find)

# Wrong: writes similarity into the stored chunk dict
val = index["chunks"][idx]
val["similarity"] = float(sim[idx])   # ← mutates original

# Fixed: copy first, then write
val = dict(index["chunks"][idx])       # ← shallow copy
val["similarity"] = float(sim[idx])   # ← writes to copy only

Effect: every model returned identical similarity scores down to 6 decimal places, matching the first query ever run. Stale values persisted across all subsequent queries.

2. The hop penalty bug

# Wrong: constant penalty regardless of depth
hopPenalty = 0.05 * hops   # hops=2, so always 0.10

# Fixed: penalty increases with depth
hopPenalty = 0.05 * hop    # hop=0 at first level, hop=1 at second

Effect: smart graph and plain BFS had identical traversal behaviour because the penalty never varied.

3. The keyword overlap graph problem

The graph edge condition overlap > 8 fired on domain words ("intelligence", "arti", "cial") that appear in every chunk. In a deep-learning document, nearly every chunk pair exceeded 8 shared words, creating a near-fully-connected graph where BFS expansion from any seed reached the entire document in 2 hops.

Fix: extended stopwords to include domain-specific terms and words shorter than 5 characters (catches OCR artifacts like "arti", "cial", "di" from ligature splits).

4. The OCR artifact problem

PDF text extraction produces ligature-split artifacts: "artificial""arti cial", "difficult""di cult", "field"" eld". These split tokens appeared in the keyword overlap computation, creating false graph edges between unrelated chunks.

Fix: filter tokens with len(w) > 4 in the overlap computation, and add the most common OCR artifacts to the stopword list.

5. The reranker ceiling

A cross-encoder re-scoring 15 candidates will almost always agree with what plain embedding retrieval found as top-3. Graph expansion finds better candidates but the reranker drops them because it scores each chunk in isolation against the query — and the original top-3 are individually more relevant than any single newly-discovered chunk.

For multi-hop questions, the right evaluation metric is candidate coverage, not top-1 rank.


Running the notebook

Prerequisites

pip install pdfplumber pymupdf sentence-transformers scikit-learn numpy python-dotenv
# For OCR fallback:
# Install Tesseract: https://github.com/tesseract-ocr/tesseract
pip install pytesseract pillow

Environment setup

If you are on Windows, create a .env file:

TESPATH=C:\Program Files\Tesseract-OCR\tesseract.exe

Directory structure

project/
├── graph_rag_from_scratch.ipynb
├── .env
├── requirements.txt
└── data/
    └── dataRAG.pdf          ← your PDF goes here

Running

  1. Place your PDF at ./data/dataRAG.pdf
  2. Open the notebook and run all cells top-to-bottom
  3. Go to Section 5 and change the QUERY variable to ask your own question

Try it on Google Colab

Open in Colab

To use on Colab:

  1. Upload your PDF when prompted, or mount Google Drive
  2. Change the PDF path in Section 2 to match your file location
  3. The first cell installs all dependencies automatically

Project structure

structuredRag.ipynb            Main notebook (all code + explanations)
README.md                      This file
requirements.txt               Python dependencies
data/                          Put your PDF here (not tracked by git)
.gitignore                     Excludes data/, .env, __pycache__

Key references

  • Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning. MIT Press. deeplearningbook.org
  • Reimers, N., Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019.
  • Nogueira, R., Cho, K. (2019). Passage Re-ranking with BERT. arXiv:1901.04085.

What's next

The natural next step after retrieval is feeding the retrieved chunks into an LLM with a well-structured prompt. The retrieval system here is the hard part once you have a clean, section-aware candidate pool, the generation step is straightforward with any API.

Future additions planned:

  • Conversational memory (multi-turn queries referencing previous answers)
  • Entity-level knowledge graph (typed nodes: concepts, people, algorithms)
  • Evaluation harness (precision@k, recall@k against a labelled test set)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors