This directory contains example scripts for working with the sqlrooms-rag package.
Downloads the official DuckDB documentation from GitHub and prepares embeddings.
Basic Usage:
uv run python prepare_duckdb_docs.pyOptions:
uv run python prepare_duckdb_docs.py --help
Options:
--docs-dir DIR Directory to download docs (default: ./downloaded-docs/duckdb)
--version VERSION DuckDB docs version (default: stable)
--skip-download Skip download, use existing docs
-o, --output FILE Output database file (default: ./generated-embeddings/duckdb_docs.duckdb)
--chunk-size SIZE Chunk size in tokens (default: 512)
--model MODEL Embedding model (default: BAAI/bge-small-en-v1.5)
--embed-dim DIM Embedding dimensions (default: 384)
--clean Remove downloaded docs after processingExamples:
# Download and prepare with defaults
uv run python prepare_duckdb_docs.py
# Custom output location
uv run python prepare_duckdb_docs.py -o ./my-embeddings/duckdb.duckdb
# Use existing docs (skip download)
uv run python prepare_duckdb_docs.py --skip-download --docs-dir ./my-docs
# Clean up docs after processing
uv run python prepare_duckdb_docs.py --clean
# Use different model
uv run python prepare_duckdb_docs.py \
--model "sentence-transformers/all-MiniLM-L6-v2" \
--embed-dim 384Simplified bash script for preparing DuckDB documentation.
Usage:
# Make executable (first time only)
chmod +x prepare_duckdb_docs.sh
# Run with defaults
./prepare_duckdb_docs.sh
# Specify docs directory and output
./prepare_duckdb_docs.sh ./my-docs ./my-embeddings/duckdb.duckdbRequirements:
- Node.js (for
npx degit) uvand sqlrooms-rag package
Demonstrates querying the knowledge base using llama-index's high-level API.
Usage:
# Query the generated embeddings
uv run python example_query.pyWhat it does:
- Loads the embeddings database
- Runs predefined example queries
- Shows similarity scores and retrieved text
- Displays source metadata
Features:
- Uses llama-index VectorStoreIndex
- Loads embedding model (BAAI/bge-small-en-v1.5)
- Retrieves top-3 most similar chunks
- Easy to modify for custom queries
Tests querying the prepared DuckDB documentation embeddings.
Basic Usage:
# Run predefined test queries
uv run python query_duckdb_docs.py
# Query with a specific question
uv run python query_duckdb_docs.py "What is a window function?"
# Get more results
uv run python query_duckdb_docs.py --top-k 10 "How to use JSON?"What it does:
- Verifies the embeddings database exists
- Shows database statistics
- Tests queries with formatted output
- Runs predefined test queries interactively
Perfect for:
- Testing your embeddings after preparation
- Verifying query results
- Learning how to query the documentation
Queries the embeddings database directly using DuckDB SQL, without llama-index.
Basic Usage:
# Explore database and run batch queries
uv run python query_duckdb_direct.py
# Query with a specific question
uv run python query_duckdb_direct.py "How do I use window functions?"What it does:
- Shows database schema and statistics
- Generates embeddings using sentence-transformers
- Runs SQL queries with
array_cosine_similarity() - Returns results sorted by similarity
Functions:
explore_database()- Show schema and statsquery_embeddings_db()- Single query with embeddingbatch_query()- Multiple queries efficiently
Example of using the sqlrooms-rag API to generate UMAP embeddings programmatically.
Basic Usage:
uv run python generate_umap_example.pyWhat it does:
- Demonstrates programmatic API usage
- Loads embeddings using
load_embeddings_from_duckdb() - Processes with
process_embeddings() - Saves using
save_to_parquet()
Perfect for:
- Integrating UMAP into your own scripts
- Customizing the UMAP workflow
- Batch processing multiple databases
Note: For CLI usage, use uv run generate-umap-embeddings instead.
# Download and prepare embeddings
uv run python prepare_duckdb_docs.pyThis creates generated-embeddings/duckdb_docs.duckdb (~10-20MB).
# Run interactive test queries
uv run python query_duckdb_docs.py
# Or ask a specific question
uv run python query_duckdb_docs.py "What is a window function?"Quick test (recommended first):
uv run python query_duckdb_docs.py "How do I use JSON functions?"Using llama-index:
uv run python example_query.pyUsing DuckDB directly:
uv run python query_duckdb_direct.py "What is a window function?"
uv run python query_duckdb_direct.py "How to connect to S3?"After preparing the embeddings, use them in your SQLRooms app:
import {createRoomStore} from '@sqlrooms/room-store';
import {createDuckDbSlice} from '@sqlrooms/duckdb';
import {createRagSlice} from '@sqlrooms/ai-rag';
const store = createRoomStore({
config: {name: 'my-app'},
slices: [
createDuckDbSlice(),
createRagSlice({
embeddingsDatabases: [
{
databaseFilePathOrUrl: '/path/to/duckdb_docs.duckdb',
databaseName: 'duckdb_docs',
},
],
}),
],
});
// Initialize
await store.getState().db.initialize();
await store.getState().rag.initialize();
// Query (you need to provide the embedding vector)
const embedding = await generateEmbedding('How do I use DuckDB?');
const results = await store.getState().rag.queryEmbeddings(embedding, {
topK: 5,
});You can use the same approach for any documentation:
# Example: Prepare React docs
npx giget gh:facebook/react/docs ./react-docs
uv run prepare-embeddings ./react-docs -o ./embeddings/react.duckdb# Process any local markdown files
uv run prepare-embeddings /path/to/docs -o ./embeddings/my_docs.duckdb# Prepare multiple doc sets
uv run python prepare_duckdb_docs.py -o ./embeddings/duckdb.duckdb
npx giget gh:nodejs/node/doc ./node-docs
uv run prepare-embeddings ./node-docs -o ./embeddings/node.duckdb
npx giget gh:microsoft/TypeScript/docs ./ts-docs
uv run prepare-embeddings ./ts-docs -o ./embeddings/typescript.duckdbThen query them all:
createRagSlice({
embeddingsDatabases: [
{
databaseFilePathOrUrl: './embeddings/duckdb.duckdb',
databaseName: 'duckdb',
},
{databaseFilePathOrUrl: './embeddings/node.duckdb', databaseName: 'nodejs'},
{
databaseFilePathOrUrl: './embeddings/typescript.duckdb',
databaseName: 'ts',
},
],
});Install Node.js from https://nodejs.org/
cd python/sqlrooms-rag
uv syncMake sure you've run the preparation script first:
uv run python prepare_duckdb_docs.pyFirst run downloads the embedding model (~100-500MB) and caches it locally. Subsequent runs are much faster.
Try reducing chunk size or processing fewer files at once:
uv run python prepare_duckdb_docs.py --chunk-size 256- Main README - Package documentation
- QUERYING.md - SQL query reference
- PUBLISHING.md - Publishing guide
- @sqlrooms/ai-rag Package - TypeScript package