A complete Retrieval-Augmented Generation (RAG) pipeline for building knowledge bases from web content and providing semantic search and QA capabilities via ROS2 services.
This project implements an end-to-end RAG system that:
- Crawls museum/website URLs intelligently with pattern-based filtering
- Builds a URL→text JSON knowledge base from crawled content
- Indexes passages with semantic embeddings using
sentence-transformers - Retrieves relevant context via ROS2 service endpoints
- Augments LLM prompts with retrieved context for accurate QA
Designed for cultural institutions (e.g., Palazzo Madama museum) but adaptable to any domain.
Asynchronous web crawler using crawl4ai with intelligent content extraction.
Features:
- Concurrent multi-page crawling
- HTML to markdown/JSON conversion
- Link extraction and URL normalization
- Request headers and retry logic
- Domain-aware crawling
Usage:
python crawler.py \
--urls https://example.com \
--question "What are the artworks?" \
--markdownBuilds a comprehensive knowledge base by crawling seed URLs with pattern-based filtering.
Features:
- Seed URLs: Starting points for crawling
- Include Patterns: URLs to explicitly include
- Exclude Patterns: URLs to filter out (with non-recursive override)
- Discovery Patterns: Patterns to find and crawl
- Non-Recursive Seeds: Crawl URLs but don't expand to children
- Max Depth & Pages: Control crawl scope
- Async Workers: Parallel crawling with configurable workers
Configuration:
{
"seed_urls": ["https://example.com/collection"],
"include_url_patterns": [".*"],
"exclude_url_patterns": ["pag=\\d+"],
"discovery_url_patterns": ["https://assets\\.example\\.com/.*"],
"non_recursive_seed_url_patterns": [".*catalog.*\\?pag=\\d+"],
"max_pages": 500,
"max_depth": 3,
"workers": 4,
"request_delay_seconds": 0.5,
"request_jitter_seconds": 0.2,
"max_retries": 3,
"same_domain_only": true,
"verbosity": 2
}Usage:
python rag_base_builder.py \
--config brawl_conf.json \
--output rag_base.jsonCombines multiple JSON knowledge base files into one.
Usage:
python merge_json.py output.json input1.json input2.json input3.jsonROS2 node providing semantic retrieval and QA services.
Services:
get_context- Retrieve top-k relevant passages for a queryask- Get LLM-augmented answer with context
Features:
- Semantic embeddings with
sentence-transformers - Optional reranking with
CrossEncoder - Text chunking with smart sentence boundaries
- Embedding caching for performance
- Environment variable support in config
Parameters (via --params-file):
rag_file- Path to knowledge base JSON (supports$HOME,~)top_k- Number of passages to retrievechunk_size_chars- Text chunk size (default 900)sbert_model- Embedding model (default: paraphrase-multilingual-MiniLM-L12-v2)retrieval_method- "similarity" or "reranking"reranker_model- CrossEncoder model for rerankingrerank_candidate_k- Candidates for rerankingrerank_candidate_method- "keyword" or "similarity"
# Clone repository
cd ~/rag_baseline
# Install Python dependencies
pip install crawl4ai sentence-transformers numpy
# For ROS2 integration (optional)
cd rag_ws
colcon build
source install/setup.bashpython rag_base_builder.py --config brawl_conf.json --output knowledge_base.jsonpython merge_json.py merged_kb.json kb1.json kb2.json kb3.json# From rag_ws directory with config
ros2 run rag_server rag_node \
--ros-args --params-file src/config/rag_node.yaml
# Or with environment variables
ros2 run rag_server rag_node \
--ros-args \
-p rag_file:=$HOME/rag_baseline/knowledge_base.json \
-p top_k:=10 \
-p sbert_model:=paraphrase-multilingual-MiniLM-L12-v2# Terminal 1: Start the service
ros2 run rag_server rag_node --ros-args --params-file src/config/rag_node.yaml
# Terminal 2: Call get_context service
ros2 service call /get_context rag_interfaces/srv/GetContext '{query: "What artworks are in the collection?"}'
# Call ask service (if Azure OpenAI configured)
ros2 service call /ask rag_interfaces/srv/Ask '{question: "Who painted the portrait?"}'{
"seed_urls": [
"https://artsandculture.google.com/explore/collections/palazzo-madama"
],
"include_url_patterns": [".*"],
"exclude_url_patterns": [],
"discovery_url_patterns": [
"https://artsandculture\\.google\\.com/asset/.*"
],
"non_recursive_seed_url_patterns": [
".*catalog.*\\?.*pag=\\d+"
],
"max_pages": 1000,
"max_depth": 3,
"workers": 4,
"request_delay_seconds": 1.0,
"request_jitter_seconds": 0.5,
"max_retries": 3,
"retry_backoff_base_seconds": 2.0,
"retry_backoff_max_seconds": 60.0,
"same_domain_only": true,
"verbosity": 1
}rag_service_node:
ros__parameters:
rag_file: $HOME/rag_baseline/knowledge_base.json
top_k: 10
chunk_size_chars: 900
sbert_model: paraphrase-multilingual-MiniLM-L12-v2
retrieval_method: similarity
reranker_model: cross-encoder/ms-marco-MiniLM-L6-v2
rerank_candidate_k: 120
rerank_candidate_method: keyword┌─────────────────────────────────────────────────────┐
│ Web Sources (Websites, APIs, PDFs, etc.) │
└─────────────────┬───────────────────────────────────┘
│
▼
┌─────────────────────┐
│ Web Crawler │
│ (crawler.py) │
│ crawl4ai + asyncio │
└────────┬────────────┘
│
▼
┌──────────────────────────┐
│ Raw Content │
│ (HTML, Markdown) │
└────────┬─────────────────┘
│
▼
┌──────────────────────────────┐
│ KB Builder │
│ (rag_base_builder.py) │
│ - Pattern Matching │
│ - URL Filtering │
│ - Text Extraction │
└────────┬─────────────────────┘
│
▼
┌──────────────────────────────┐
│ Knowledge Base JSON │
│ {url: content, ...} │
└────────┬─────────────────────┘
│
┌────────┴──────────┐
│ │
▼ ▼
┌────────┐ ┌──────────────────┐
│ Merge │ │ ROS2 RAG Node │
│ JSON │ │ (rag_node.py) │
└────────┘ │ │
│ - Embedding │
│ - Chunking │
│ - Caching │
│ - Services │
└────────┬─────────┘
│
┌────────▼────────┐
│ Services: │
│ • get_context │
│ • ask │
└─────────────────┘
- Non-Recursive Seeds: Crawl pagination pages without expanding links
- Pattern-Based Control: Regex patterns for include/exclude/discovery
- Override Logic: Non-recursive seeds bypass exclude patterns
- Multiple Models: Support for various sentence-transformer embeddings
- Caching: Embeddings cached with SHA256 hash for performance
- Reranking: Optional CrossEncoder refinement for better relevance
- Smart Splitting: Chunks on sentence boundaries
- Length Control: Configurable chunk size and minimum length
- Quality Filter: Skips very short passages
- Parameter Injection: Override configs via
--params-fileor CLI - Environment Variables: Supports
$HOMEand~in paths - Service Endpoints: Standard ROS2 service interface
{
"https://example.com/page1": "Extracted text content from page 1...",
"https://example.com/page2": "Extracted text content from page 2...",
...
}{
"passages": [
{
"url": "https://example.com/page1",
"text": "Relevant passage text...",
"score": 0.87
},
...
]
}{
"answer": "Generated answer from LLM with retrieved context...",
"context_used": "URL: https://example.com/page1 - Text: ...",
"confidence": "high"
}rag_baseline/
├── crawler.py # Web crawler
├── rag_base_builder.py # KB builder
├── merge_json.py # JSON merger
├── brawl_conf.json # Config example
├── rag_base_*.json # Knowledge base files
└── rag_ws/ # ROS2 workspace
└── src/
├── rag_server/ # ROS2 node
└── rag_interfaces/ # ROS2 message definitions
- Rate Limiting: Respect website terms of service; adjust
request_delay_seconds - Anti-Bot Detection: Some sites may block automated crawling; adjust headers
- Storage: Large crawls can produce multi-MB JSON files; consider pagination
- Azure OpenAI: Required for
askservice; set environment variables
crawl4ai>=0.8.0- Web scrapingsentence-transformers- Semantic embeddingsnumpy- Numerical operationsrclpy- ROS2 Python client (for node)azure-identity(optional) - Azure OpenAI auth