Skip to content

hsp-iit/rag_baseline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAG Baseline - Retrieval-Augmented Generation System

A complete Retrieval-Augmented Generation (RAG) pipeline for building knowledge bases from web content and providing semantic search and QA capabilities via ROS2 services.

🎯 Overview

This project implements an end-to-end RAG system that:

  1. Crawls museum/website URLs intelligently with pattern-based filtering
  2. Builds a URL→text JSON knowledge base from crawled content
  3. Indexes passages with semantic embeddings using sentence-transformers
  4. Retrieves relevant context via ROS2 service endpoints
  5. Augments LLM prompts with retrieved context for accurate QA

Designed for cultural institutions (e.g., Palazzo Madama museum) but adaptable to any domain.

📦 Components

1. Web Crawler: crawler.py

Asynchronous web crawler using crawl4ai with intelligent content extraction.

Features:

  • Concurrent multi-page crawling
  • HTML to markdown/JSON conversion
  • Link extraction and URL normalization
  • Request headers and retry logic
  • Domain-aware crawling

Usage:

python crawler.py \
  --urls https://example.com \
  --question "What are the artworks?" \
  --markdown

2. Knowledge Base Builder: rag_base_builder.py

Builds a comprehensive knowledge base by crawling seed URLs with pattern-based filtering.

Features:

  • Seed URLs: Starting points for crawling
  • Include Patterns: URLs to explicitly include
  • Exclude Patterns: URLs to filter out (with non-recursive override)
  • Discovery Patterns: Patterns to find and crawl
  • Non-Recursive Seeds: Crawl URLs but don't expand to children
  • Max Depth & Pages: Control crawl scope
  • Async Workers: Parallel crawling with configurable workers

Configuration:

{
  "seed_urls": ["https://example.com/collection"],
  "include_url_patterns": [".*"],
  "exclude_url_patterns": ["pag=\\d+"],
  "discovery_url_patterns": ["https://assets\\.example\\.com/.*"],
  "non_recursive_seed_url_patterns": [".*catalog.*\\?pag=\\d+"],
  "max_pages": 500,
  "max_depth": 3,
  "workers": 4,
  "request_delay_seconds": 0.5,
  "request_jitter_seconds": 0.2,
  "max_retries": 3,
  "same_domain_only": true,
  "verbosity": 2
}

Usage:

python rag_base_builder.py \
  --config brawl_conf.json \
  --output rag_base.json

3. JSON Merger: merge_json.py

Combines multiple JSON knowledge base files into one.

Usage:

python merge_json.py output.json input1.json input2.json input3.json

4. ROS2 RAG Service: rag_ws/src/rag_server/rag_server/rag_node.py

ROS2 node providing semantic retrieval and QA services.

Services:

  • get_context - Retrieve top-k relevant passages for a query
  • ask - Get LLM-augmented answer with context

Features:

  • Semantic embeddings with sentence-transformers
  • Optional reranking with CrossEncoder
  • Text chunking with smart sentence boundaries
  • Embedding caching for performance
  • Environment variable support in config

Parameters (via --params-file):

  • rag_file - Path to knowledge base JSON (supports $HOME, ~)
  • top_k - Number of passages to retrieve
  • chunk_size_chars - Text chunk size (default 900)
  • sbert_model - Embedding model (default: paraphrase-multilingual-MiniLM-L12-v2)
  • retrieval_method - "similarity" or "reranking"
  • reranker_model - CrossEncoder model for reranking
  • rerank_candidate_k - Candidates for reranking
  • rerank_candidate_method - "keyword" or "similarity"

🚀 Quick Start

Installation

# Clone repository
cd ~/rag_baseline

# Install Python dependencies
pip install crawl4ai sentence-transformers numpy

# For ROS2 integration (optional)
cd rag_ws
colcon build
source install/setup.bash

Basic Workflow

1. Crawl Website

python rag_base_builder.py --config brawl_conf.json --output knowledge_base.json

2. Merge Multiple Crawls (if needed)

python merge_json.py merged_kb.json kb1.json kb2.json kb3.json

3. Start ROS2 Node

# From rag_ws directory with config
ros2 run rag_server rag_node \
  --ros-args --params-file src/config/rag_node.yaml

# Or with environment variables
ros2 run rag_server rag_node \
  --ros-args \
  -p rag_file:=$HOME/rag_baseline/knowledge_base.json \
  -p top_k:=10 \
  -p sbert_model:=paraphrase-multilingual-MiniLM-L12-v2

4. Query the Service

# Terminal 1: Start the service
ros2 run rag_server rag_node --ros-args --params-file src/config/rag_node.yaml

# Terminal 2: Call get_context service
ros2 service call /get_context rag_interfaces/srv/GetContext '{query: "What artworks are in the collection?"}'

# Call ask service (if Azure OpenAI configured)
ros2 service call /ask rag_interfaces/srv/Ask '{question: "Who painted the portrait?"}'

⚙️ Configuration

Example brawl_conf.json:

{
  "seed_urls": [
    "https://artsandculture.google.com/explore/collections/palazzo-madama"
  ],
  "include_url_patterns": [".*"],
  "exclude_url_patterns": [],
  "discovery_url_patterns": [
    "https://artsandculture\\.google\\.com/asset/.*"
  ],
  "non_recursive_seed_url_patterns": [
    ".*catalog.*\\?.*pag=\\d+"
  ],
  "max_pages": 1000,
  "max_depth": 3,
  "workers": 4,
  "request_delay_seconds": 1.0,
  "request_jitter_seconds": 0.5,
  "max_retries": 3,
  "retry_backoff_base_seconds": 2.0,
  "retry_backoff_max_seconds": 60.0,
  "same_domain_only": true,
  "verbosity": 1
}

Example src/config/rag_node.yaml:

rag_service_node:
  ros__parameters:
    rag_file: $HOME/rag_baseline/knowledge_base.json
    top_k: 10
    chunk_size_chars: 900
    sbert_model: paraphrase-multilingual-MiniLM-L12-v2
    retrieval_method: similarity
    reranker_model: cross-encoder/ms-marco-MiniLM-L6-v2
    rerank_candidate_k: 120
    rerank_candidate_method: keyword

📊 Architecture

┌─────────────────────────────────────────────────────┐
│  Web Sources (Websites, APIs, PDFs, etc.)          │
└─────────────────┬───────────────────────────────────┘
                  │
                  ▼
        ┌─────────────────────┐
        │  Web Crawler        │
        │  (crawler.py)       │
        │  crawl4ai + asyncio │
        └────────┬────────────┘
                 │
                 ▼
        ┌──────────────────────────┐
        │  Raw Content             │
        │  (HTML, Markdown)        │
        └────────┬─────────────────┘
                 │
                 ▼
        ┌──────────────────────────────┐
        │  KB Builder                  │
        │  (rag_base_builder.py)       │
        │  - Pattern Matching          │
        │  - URL Filtering             │
        │  - Text Extraction           │
        └────────┬─────────────────────┘
                 │
                 ▼
        ┌──────────────────────────────┐
        │  Knowledge Base JSON          │
        │  {url: content, ...}          │
        └────────┬─────────────────────┘
                 │
        ┌────────┴──────────┐
        │                   │
        ▼                   ▼
    ┌────────┐      ┌──────────────────┐
    │ Merge  │      │  ROS2 RAG Node   │
    │ JSON   │      │  (rag_node.py)   │
    └────────┘      │                  │
                    │  - Embedding     │
                    │  - Chunking      │
                    │  - Caching       │
                    │  - Services      │
                    └────────┬─────────┘
                             │
                    ┌────────▼────────┐
                    │   Services:     │
                    │ • get_context   │
                    │ • ask           │
                    └─────────────────┘

💡 Key Features

Smart URL Filtering

  • Non-Recursive Seeds: Crawl pagination pages without expanding links
  • Pattern-Based Control: Regex patterns for include/exclude/discovery
  • Override Logic: Non-recursive seeds bypass exclude patterns

Semantic Search

  • Multiple Models: Support for various sentence-transformer embeddings
  • Caching: Embeddings cached with SHA256 hash for performance
  • Reranking: Optional CrossEncoder refinement for better relevance

Passage Chunking

  • Smart Splitting: Chunks on sentence boundaries
  • Length Control: Configurable chunk size and minimum length
  • Quality Filter: Skips very short passages

ROS2 Integration

  • Parameter Injection: Override configs via --params-file or CLI
  • Environment Variables: Supports $HOME and ~ in paths
  • Service Endpoints: Standard ROS2 service interface

📝 Output Format

Knowledge Base JSON

{
  "https://example.com/page1": "Extracted text content from page 1...",
  "https://example.com/page2": "Extracted text content from page 2...",
  ...
}

get_context Response

{
  "passages": [
    {
      "url": "https://example.com/page1",
      "text": "Relevant passage text...",
      "score": 0.87
    },
    ...
  ]
}

ask Response

{
  "answer": "Generated answer from LLM with retrieved context...",
  "context_used": "URL: https://example.com/page1 - Text: ...",
  "confidence": "high"
}

🔧 Development

Project Structure

rag_baseline/
├── crawler.py                    # Web crawler
├── rag_base_builder.py          # KB builder
├── merge_json.py                # JSON merger
├── brawl_conf.json              # Config example
├── rag_base_*.json              # Knowledge base files
└── rag_ws/                      # ROS2 workspace
    └── src/
        ├── rag_server/          # ROS2 node
        └── rag_interfaces/      # ROS2 message definitions

⚠️ Notes

  • Rate Limiting: Respect website terms of service; adjust request_delay_seconds
  • Anti-Bot Detection: Some sites may block automated crawling; adjust headers
  • Storage: Large crawls can produce multi-MB JSON files; consider pagination
  • Azure OpenAI: Required for ask service; set environment variables

📦 Dependencies

  • crawl4ai>=0.8.0 - Web scraping
  • sentence-transformers - Semantic embeddings
  • numpy - Numerical operations
  • rclpy - ROS2 Python client (for node)
  • azure-identity (optional) - Azure OpenAI auth

About

This repository is a baseline that implement a RAG pipeline for llm's context refinement

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors