Skip to content

joe32140/fast-plaid-web

Β 
Β 

Repository files navigation

FastPlaid

rust PyO₃

Β 

FastPlaid - High-Performance Multi-Vector Search with WASM Support

Β 

⭐️ Overview

FastPlaid implements efficient multi-vector search for ColBERT-style models. Unlike traditional single-vector search, multi-vector approaches maintain token-level embeddings for fine-grained similarity matching.

Key Features:

  • πŸš€ WASM Support - Browser-native search with mxbai-edge-colbert-v0-17m (48-dim embeddings)
  • ⚑ 4-bit Quantization + IVF - 8x compression, 3-5x faster search
  • πŸ”„ Incremental Updates - Add documents without full rebuild (NEW!)
  • 🎯 MaxSim Search - Token-level late interaction for accurate retrieval
  • πŸ“¦ Pure Rust - Fast, safe, and portable
  • πŸ—‚οΈ Offline Index Building - Pre-compute indexes for instant browser loading

πŸ—οΈ Architecture

FastPlaid has two implementations for different use cases:

Component Purpose Use Case
Native Rust (search/, index/) Full PLAID with Product Quantization Python bindings, CLI, server-side
WASM (lib_wasm_quantized.rs) Lightweight 4-bit + IVF Browser demos, GitHub Pages

Why two implementations?

  • Native uses Candle (PyTorch-like) tensors for full PLAID algorithm
  • WASM uses pure Rust for browser compatibility (no Candle in WASM)
  • Both share the same 4-bit quantization codec

πŸ“– See OFFLINE_INDEX_GUIDE.md for detailed architecture and workflows

πŸ’» Installation

Python Package

pip install fast-plaid

PyTorch Compatibility:

FastPlaid PyTorch Command
1.2.4.280 2.8.0 pip install fast-plaid==1.2.4.280
1.2.4.271 2.7.1 pip install fast-plaid==1.2.4.271

WASM Demo

cd docs
python3 serve.py
# Visit http://localhost:8000/

Offline Index Building

# 1. Compute embeddings (Python)
python scripts/build_offline_wasm_index.py \
    --papers data/papers_1000.json \
    --output docs/data

# 2. Build .fastplaid index (Node.js + WASM)
node scripts/build_fastplaid_index.js \
    docs/data \
    docs/data/index.fastplaid

# 3. Deploy to browser
# index.fastplaid: 6.2 MB, loads in <1s

πŸ“– See OFFLINE_INDEX_GUIDE.md for complete workflows

🎯 Quick Start

Python API

from fast_plaid import FastPlaid

# Initialize with ColBERT embeddings (48-dim token vectors)
index = FastPlaid(dim=48, nbits=4)  # 4-bit quantization

# Add documents (shape: [num_docs, max_tokens, 48])
index.add(doc_embeddings)

# Search (shape: [num_queries, query_tokens, 48])
scores = index.search(query_embeddings, k=10)

WASM Browser Demo

// Load model
const colbert = new ColBERT(
    modelWeights, dense1Weights, dense2Weights,
    tokenizer, config, stConfig,
    dense1Config, dense2Config, tokensConfig, 32
);

// Encode and search
const queryEmb = await colbert.encode({sentences: [query], is_query: true});
const results = await fastPlaid.search(queryEmb, 10);

// Incremental updates (NEW!)
const newDocEmb = await colbert.encode({sentences: [newDoc], is_query: false});
fastPlaid.update_index_incremental(newDocEmb, newDocInfo);

Incremental Index Updates πŸ”„

FastPlaid now supports adding documents without rebuilding the entire index:

// Create initial index
fastPlaid.load_documents_quantized(embeddings, docInfo, 256);

// Add new documents incrementally (8x faster than rebuild!)
fastPlaid.update_index_incremental(newEmbeddings, newDocInfo);

// Check statistics
const info = JSON.parse(fastPlaid.get_index_info());
console.log(`${info.num_documents} docs, ${info.pending_deltas} deltas`);

// Manual compaction (optional - auto-compacts at 10%)
fastPlaid.compact_index();

Performance:

  • 8.3x faster for small batches (<100 docs)
  • 2.7x faster for large batches (1000 docs)
  • Auto-compaction when deltas exceed 10%
  • <5% search overhead with deltas

πŸ“– See INCREMENTAL_UPDATES.md for full API documentation

πŸ—οΈ Architecture

Multi-Vector Pipeline

Text β†’ Tokenizer β†’ ModernBERT (256d) β†’ 1_Dense (512d) β†’ 2_Dense (48d) β†’ MaxSim Search

Key Components:

  • ModernBERT: 17M parameter encoder
  • 2_Dense Projection: 256β†’512β†’48 dimensions (10.6x compression)
  • 4-bit Quantization: Additional 8x storage savings
  • MaxSim Scoring: score = Ξ£ max(q_token Β· d_token) per query token

WASM Implementation

  • Model: mixedbread-ai/mxbai-edge-colbert-v0-17m
  • Runtime: Pure browser (no server)
  • Index Size: ~2.7MB for 200 documents (48-dim, 4-bit)
  • Search Speed: <50ms for 1000 documents

πŸ“Š Performance

Index Size Comparison (200 documents)

Method Dimensions Size Compression
Without 2_Dense 512 ~28.6 MB 1x
With 2_Dense 48 ~2.7 MB 10.6x
With 2_Dense + 4-bit 48 ~0.7 MB 40x

Speed Benchmarks

  • Encoding: ~50ms per document (WASM)
  • Search: ~10ms for 100 docs, ~50ms for 1000 docs
  • Index Build: ~500ms for 200 documents

πŸ”§ WASM Build

The WASM package includes both FastPlaid indexing and ColBERT model inference:

# Quick build (recommended)
./build_wasm.sh

# Or manual build:
# 1. Build pylate-rs with 2_Dense support
cd pylate-rs
cargo build --lib --release --target wasm32-unknown-unknown \
    --no-default-features --features wasm

# 2. Generate bindings
cargo install wasm-bindgen-cli --version 0.2.104
wasm-bindgen target/wasm32-unknown-unknown/release/pylate_rs.wasm \
    --out-dir pkg --target web

# 3. Build FastPlaid WASM
cd ..
RUSTFLAGS="-C target-feature=+simd128" wasm-pack build --target web --out-dir docs/pkg --release

# 4. Fix WASM table limits (required for v1.3.0+)
python3 fix_wasm_table.py

Output:

  • pylate_rs_bg.wasm (4.9MB) - ColBERT model + 2_Dense
  • fast_plaid_rust_bg.wasm (171KB) - Indexing + search with incremental updates

Note: The table fix step is required for v1.3.0+ to support incremental update methods. See WASM_TABLE_FIX.md for details.

🎨 Demo Features

1. Real-Time Search (index.html)

  • Load mxbai-edge-colbert-v0-17m model
  • Index 100 documents
  • Interactive search with result highlighting
  • Performance metrics display

2. Paper Search (papers-demo.html)

  • Adjustable dataset size (10-1000 papers)
  • Compare FastPlaid vs Direct MaxSim
  • Index size visualization
  • Search method toggle

3. Method Comparison

  • FastPlaid (Indexed): 4-bit quantized, ~7KB for 10 docs
  • Direct MaxSim: Full precision, ~57KB for 10 docs
  • Speedup: 2-5x faster with FastPlaid for 100+ documents

πŸ“ Project Structure

fast-plaid/
β”œβ”€β”€ rust/                  # Core Rust implementation
β”‚   β”œβ”€β”€ lib.rs            # FastPlaid index
β”‚   └── lib_wasm.rs       # WASM bindings
β”œβ”€β”€ docs/                 # Browser demos (GitHub Pages)
β”‚   β”œβ”€β”€ index.html        # Main demo
β”‚   β”œβ”€β”€ build-index.html  # Index builder
β”‚   β”œβ”€β”€ mxbai-integration.js  # ColBERT integration
β”‚   └── node_modules/     # WASM modules
β”œβ”€β”€ python/               # Python bindings
└── README.md            # This file

πŸ”¬ Technical Details

2_Dense Support

FastPlaid uses pylate-rs with full 2_Dense layer support for mxbai-edge-colbert-v0-17m:

Architecture:

  1. 1_Dense: 256 β†’ 512 (expansion for representation)
  2. 2_Dense: 512 β†’ 48 (compression for efficiency)

Benefits:

  • Correct 48-dim output (not 512)
  • 10.6x smaller indexes
  • Matches official model specifications

Quantization

4-bit quantization with centroids:

// Quantize to 4-bit (16 levels)
let quantized = embeddings.map(|x| ((x - min) / (max - min) * 15.0) as u8);

// Dequantize for search
let reconstructed = quantized.map(|q| min + (q as f32 / 15.0) * (max - min));

Trade-offs:

  • Storage: 8x smaller
  • Speed: ~10% faster (less memory bandwidth)
  • Quality: <2% accuracy loss

πŸš€ Deployment

GitHub Pages

The WASM demo can be deployed to GitHub Pages:

# Build for production
cd demo
./build-prod.sh

# Deploy
git add .
git commit -m "Update demo"
git push origin main

Limitations:

  • Max file size: 100MB (GitHub Pages limit)
  • Total site size: <1GB recommended
  • Use 4-bit quantization for large datasets

Local Development

cd demo
python3 serve.py  # http://localhost:8000/

πŸ”— Resources

πŸ“ Recent Updates

v5.0 (2025-01-22):

  • βœ… Full 2_Dense support (48-dim embeddings)
  • βœ… 4-bit quantization (8x compression)
  • βœ… WASM demo with real ColBERT model
  • βœ… Query expansion support
  • βœ… Index size comparison UI
  • βœ… Adjustable dataset size

Previous:

  • SIMD optimizations
  • Offline index caching
  • PLAID implementation
  • Python/Rust bindings

🀝 Contributing

Contributions welcome! Key areas:

  • Performance optimizations
  • Additional quantization methods
  • More demo examples
  • Documentation improvements

πŸ“„ License

MIT License - see LICENSE file for details


Status: Production Ready | WASM: 4.9MB | Embedding Dim: 48 | Model: mxbai-edge-colbert-v0-17m

About

High-Performance Engine for Multi-Vector Search for Web Applications

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Rust 72.4%
  • Python 25.2%
  • JavaScript 2.0%
  • Other 0.4%