A Retrieval-Augmented Generation (RAG) pipeline built entirely from first principles — no LangChain, no LlamaIndex, no abstraction frameworks. Every component is implemented directly so the design decisions are visible and the tradeoffs are explicit.
The project uses Chapter 1 of Deep Learning by Goodfellow et al. as its test document, but the pipeline works on any PDF.
Standard RAG tutorials jump straight to using frameworks. This project builds each layer by hand, surfaces the bugs that frameworks hide, and shows exactly why each design decision exists.
Academic PDFs use custom ligature-encoded fonts for headings. Standard text extraction reads "Introduction" as "I\ue001\ue002r\ue004\ue005..." — private-use unicode that no regex can match. Every heading detector fails silently. The result: 105 chunks, all titled "Untitled", no section structure, graph edges connecting everything to everything.
The fix: font size is stored as a number in the PDF spec, independent of encoding. A garbled heading is still the largest text on its line. Detect headings by size, infer titles from the body text that follows.
PDF
│
├─ 2a. Naive text extraction (pypdf) ← shows the problem
├─ 2b. Block extraction with font sizes (fitz) ← intermediate step
└─ 2c. Layout-aware extraction (pdfplumber) ← final version
│ font size → heading/body/caption/skip
│ garbled heading → infer title from body
│ back-to-back headings → merge titles
↓
Sections [{title, content, page}, ...]
│
├─ 3a. Word sliding window (naive) ← shows the problem
└─ 3b/c. Sentence-boundary chunking ← final version
│ 60-word limit
│ 1-sentence overlap
│ abbreviation + citation protection
↓
Chunks [{chunkId, title, content, page, length}, ...]
│
↓ sentence-transformers all-MiniLM-L6-v2
Embeddings [n_chunks × 384]
│
┌───────┴───────────────────────────────┐
│ │
Plain cosine search Knowledge graph
+ cross-encoder rerank │
│ ┌─────┴────────────────┐
│ │ │
│ Plain BFS Smart traversal
│ (undirected, (score-guided,
│ 2 hops) hop penalty)
│ │ │
│ └──────────────────────┤
│ │
│ Query rewrite + graph
│ (domain expansion dict
│ → different seeds
│ → different region)
│ │
└───────────────────────────────────────────┘
│
Top-K results
| Strategy | How it works | Candidate pool | Unique finds |
|---|---|---|---|
| Plain vector search | Cosine similarity + cross-encoder rerank | Top 10 by similarity | Baseline |
| Plain graph BFS | BFS 2 hops from top-10 seeds | ~18 chunks | Adjacent chunks plain search missed |
| Smart graph traversal | Score-guided hop expansion, prunes low-relevance paths | ~22 chunks | Chunks only reachable via high-similarity paths |
| Query rewrite + graph | Enriches query with domain terms → different seeds → different traversal region | ~29 chunks | Chunks in semantically adjacent regions of the graph |
Key insight: the benefit of graph RAG shows in recall (candidate coverage), not in top-1 rank. All models may return the same top-3 results — this is correct. The question is whether the complete answer to a multi-hop question is covered by the candidate pool passed to an LLM.
This project is a record of real engineering debugging. Every bug listed here caused the system to produce wrong or identical output.
# Wrong: writes similarity into the stored chunk dict
val = index["chunks"][idx]
val["similarity"] = float(sim[idx]) # ← mutates original
# Fixed: copy first, then write
val = dict(index["chunks"][idx]) # ← shallow copy
val["similarity"] = float(sim[idx]) # ← writes to copy onlyEffect: every model returned identical similarity scores down to 6 decimal places, matching the first query ever run. Stale values persisted across all subsequent queries.
# Wrong: constant penalty regardless of depth
hopPenalty = 0.05 * hops # hops=2, so always 0.10
# Fixed: penalty increases with depth
hopPenalty = 0.05 * hop # hop=0 at first level, hop=1 at secondEffect: smart graph and plain BFS had identical traversal behaviour because the penalty never varied.
The graph edge condition overlap > 8 fired on domain words ("intelligence", "arti", "cial") that appear in every chunk. In a deep-learning document, nearly every chunk pair exceeded 8 shared words, creating a near-fully-connected graph where BFS expansion from any seed reached the entire document in 2 hops.
Fix: extended stopwords to include domain-specific terms and words shorter than 5 characters (catches OCR artifacts like "arti", "cial", "di" from ligature splits).
PDF text extraction produces ligature-split artifacts: "artificial" → "arti cial", "difficult" → "di cult", "field" → " eld". These split tokens appeared in the keyword overlap computation, creating false graph edges between unrelated chunks.
Fix: filter tokens with len(w) > 4 in the overlap computation, and add the most common OCR artifacts to the stopword list.
A cross-encoder re-scoring 15 candidates will almost always agree with what plain embedding retrieval found as top-3. Graph expansion finds better candidates but the reranker drops them because it scores each chunk in isolation against the query — and the original top-3 are individually more relevant than any single newly-discovered chunk.
For multi-hop questions, the right evaluation metric is candidate coverage, not top-1 rank.
pip install pdfplumber pymupdf sentence-transformers scikit-learn numpy python-dotenv
# For OCR fallback:
# Install Tesseract: https://github.com/tesseract-ocr/tesseract
pip install pytesseract pillowIf you are on Windows, create a .env file:
TESPATH=C:\Program Files\Tesseract-OCR\tesseract.exe
project/
├── graph_rag_from_scratch.ipynb
├── .env
├── requirements.txt
└── data/
└── dataRAG.pdf ← your PDF goes here
- Place your PDF at
./data/dataRAG.pdf - Open the notebook and run all cells top-to-bottom
- Go to Section 5 and change the
QUERYvariable to ask your own question
To use on Colab:
- Upload your PDF when prompted, or mount Google Drive
- Change the PDF path in Section 2 to match your file location
- The first cell installs all dependencies automatically
structuredRag.ipynb Main notebook (all code + explanations)
README.md This file
requirements.txt Python dependencies
data/ Put your PDF here (not tracked by git)
.gitignore Excludes data/, .env, __pycache__
- Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning. MIT Press. deeplearningbook.org
- Reimers, N., Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019.
- Nogueira, R., Cho, K. (2019). Passage Re-ranking with BERT. arXiv:1901.04085.
The natural next step after retrieval is feeding the retrieved chunks into an LLM with a well-structured prompt. The retrieval system here is the hard part once you have a clean, section-aware candidate pool, the generation step is straightforward with any API.
Future additions planned:
- Conversational memory (multi-turn queries referencing previous answers)
- Entity-level knowledge graph (typed nodes: concepts, people, algorithms)
- Evaluation harness (precision@k, recall@k against a labelled test set)