Skip to content

logesh45/pageindex-reasoning-librarian

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The Reasoning Librarian Header

The Reasoning Librarian

A vectorless RAG (Retrieval-Augmented Generation) proof of concept implementing the PageIndex architecture. Instead of embedding documents into vector space, this system uses hierarchical navigation and LLM reasoning to find relevant information.


If you find this project interesting, please leave a star! ⭐

🎯 What is This?

Traditional RAG systems chunk documents and store them as vector embeddings. This approach has fundamental limitations:

  • Destroys document structure and narrative flow
  • Returns "vibes" instead of precise matches
  • Fails at multi-hop reasoning

The Reasoning Librarian takes a different approach:

  1. Preserves document hierarchy (Book → Chapter structure)
  2. Uses LLM reasoning to navigate like a human researcher
  3. No vector embeddings - just structured summaries and raw text

🏗️ Architecture

1. The Cartographer (Indexing)

The Cartographer builds a hierarchical mental map of the document using a bottom-up summarization approach.

graph TD
    A[Raw Text] --> B[Clean Text: Skip ToC/Header]
    B --> C[Regex Parsing: Books & Chapters]
    C --> D[Haiku 4.5: Chapter Summaries]
    D --> E[Sonnet 4.5: Book Aggregate Summaries]
    E --> F[Sonnet 4.5: Root Aggregate Summary]
    F --> G[(tree_index.json)]
    
    style D fill:#6366f1,stroke:#4338ca,stroke-width:2px,color:#fff
    style E fill:#6366f1,stroke:#4338ca,stroke-width:2px,color:#fff
    style F fill:#6366f1,stroke:#4338ca,stroke-width:2px,color:#fff
Loading

2. The Navigator (Retrieval)

The Navigator uses Claude's tool-use capabilities to traverse the hierarchical index like a human researcher.

sequenceDiagram
    participant User
    participant Agent as Agent (Sonnet 4.5)
    participant Index as Tree Index

    User->>Agent: "What are the duties of the sovereign?"
    rect rgba(128, 128, 128, 0.1)
    Note over Agent, Index: Agent Loop (Iterative Navigation)
    Agent->>Index: read_node(root)
    Index-->>Agent: Book summaries
    Agent->>Agent: Reasoning: "Book V covers state revenue/duties"
    Agent->>Index: read_node(BOOK_V)
    Index-->>Agent: Chapter summaries
    Agent->>Agent: Reasoning: "Chapter I discusses sovereign expenses"
    Agent->>Index: read_content(BOOK_V_CHAPTER_I)
    Index-->>Agent: Raw chapter text
    end
    Agent->>User: Synthesized Answer with Citations
Loading

🚀 Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Set Up API Key

cp .env.example .env
# Edit .env and add your Anthropic API key

3. Download the Source Text

mkdir -p data
curl -o data/wealth_of_nations.txt https://www.gutenberg.org/files/3300/3300-0.txt

4. Build the Index

python -m src.indexer

This downloads "The Wealth of Nations" and builds a hierarchical index with LLM-generated summaries. Takes ~5 minutes and costs only ~$0.33 in API calls using the hybrid indexing strategy.

5. Query the System

python -m src.cli

Example queries:

  • "What are the three duties of the sovereign?"
  • "What is the division of labor?"
  • "What causes wages to rise?"

📁 Project Structure

page-index-poc/
├── src/
│   ├── models.py      # Pydantic data models
│   ├── indexer.py     # The Cartographer (builds index)
│   ├── navigator.py   # The Navigator (agentic retrieval)
│   └── cli.py         # Interactive CLI
├── data/
│   └── wealth_of_nations.txt   # Source text (Download required)
├── output/
│   └── tree_index.json         # Generated index
├── requirements.txt
└── .env.example

🔧 How It Works (Hybrid Strategy)

To achieve maximum performance at minimum cost, this PoC uses a hybrid model architecture:

1. Indexing (The Cartographer)

  • Parse Structure: Use regex to identify BOOK I, CHAPTER I, etc.
  • Chapter Retrieval (Haiku 4.5): We use the faster, cheaper Haiku 4.5 to generate summaries for the 32 individual chapters. This handles 90% of the indexing volume for just a few cents.
  • Hierarchical "Map" (Sonnet 4.5): The higher-level Book and Root summaries are generated using Sonnet 4.5. This ensures the agent has a high-quality "conceptual map" to navigate correctly.
  • Cost Efficiency: This hybrid approach allowed us to index the entire 900-page treatise for only ~$0.33.

2. Retrieval (The Navigator)

  • Reasoning Agent (Sonnet 4.5): The Navigator always uses Sonnet 4.5 to ensure robust tool-use, multi-hop reasoning, and high-quality synthesis of the final answer.
  • Top-Down Discovery:
    1. Start at Root: Browse 5 Book summaries.
    2. Reason About Path: "Market price vs Natural price? → Book I".
    3. Drill Down: Read Book I's chapter summaries.
    4. Execute: Read the full raw text of the relevant chapter.
    5. Cite: Synthesize answer with specific Book/Chapter citations.

📊 Comparison with Vector RAG

Feature Vector RAG Reasoning RAG
Indexing Embed chunks Summarize hierarchy
Retrieval Cosine similarity LLM navigation
Latency ~50ms ~3-5 seconds
Accuracy Low on complex queries High
Cost per query ~$0.001 ~$0.05
Best for Simple fact lookup Complex/structured docs

📚 Based On

This implementation is inspired by PageIndex by VectifyAI, which achieved 98.7% accuracy on FinanceBench using hierarchical document navigation.

📜 License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages