Embedding Export

Planned

Coming soon

This feature is planned but not yet implemented. Track progress at oxidoc-lab/oxidoc#1.

A standalone command that generates embeddings from your documentation and exports them in standard formats — for RAG pipelines, vector databases, and custom AI tools.

Today, embeddings are only generated as part of oxidoc build in a JSON format designed for the browser. oxidoc embed will make Oxidoc a first-class tool for making documentation AI-ready, with proper export formats and chunking strategies.

Proposed Usage

# Default: JSON output, one embedding per page
oxidoc embed

# Choose export format
oxidoc embed --format jsonl
oxidoc embed --format safetensors
oxidoc embed --format npy
oxidoc embed --format parquet

# Custom output directory
oxidoc embed -o embeddings/

# Use a custom embedding model
oxidoc embed --model ./models/multilingual-e5-small.gguf

# Control chunking granularity
oxidoc embed --chunk section      # One embedding per heading section
oxidoc embed --chunk paragraph    # One embedding per paragraph

# Combine options
oxidoc embed --format jsonl --chunk section --model ./models/custom.gguf -o rag/

Export Formats

Format	Extension	Best For
JSON	`.json`	Quick inspection, small sites, JavaScript pipelines
JSONL	`.jsonl`	Streaming ingestion, large sites, line-by-line processing
NumPy	`.npy` + `.json`	Python ML workflows (`np.load()`)
Safetensors	`.safetensors`	HuggingFace ecosystem, safe tensor sharing
Parquet	`.parquet`	Analytics (pandas, polars, DuckDB), columnar queries

JSON (default)

Same schema as the current search-vectors.json, with added model metadata:

{
  "dimension": 384,
  "model": "bge-micro-v2",
  "chunk_strategy": "page",
  "documents": [
    {
      "id": 0,
      "title": "Installation",
      "path": "/docs/installation",
      "text": "Install Oxidoc with a single command...",
      "anchor": null,
      "headings": [...]
    }
  ],
  "vectors": [[0.023, -0.041, 0.089, "..."]]
}

JSONL

One JSON object per document — streaming-friendly for large doc sites:

{"id":0,"title":"Installation","path":"/docs/installation","text":"...","vector":[0.023,-0.041,...]}
{"id":1,"title":"Quickstart","path":"/docs/quickstart","text":"...","vector":[0.015,0.089,...]}

Safetensors

A single .safetensors file loadable with the HuggingFace safetensors Python package:

from safetensors import safe_open

with safe_open("embeddings.safetensors", framework="numpy") as f:
    vectors = f.get_tensor("vectors")      # (N, 384) float32
    # Metadata in tensor attributes

NumPy

Two files: embeddings.npy (float32 matrix) + metadata.json (document info):

import numpy as np
import json

vectors = np.load("embeddings.npy")        # (N, 384) float32
with open("metadata.json") as f:
    docs = json.load(f)["documents"]

Parquet

Columnar format with all data in one file:

import polars as pl

df = pl.read_parquet("embeddings.parquet")
# Columns: id, title, path, text, anchor, vector (list[f32])

Chunking Strategies

Control how documents are split before embedding:

Page (default)

One embedding per page. Simple and effective for small-to-medium documentation sites.

Section

Split at heading boundaries (H2, H3). Each section gets its own embedding with an anchor field linking directly to that section. Better retrieval granularity for long pages.

Page "Configuration" splits into:
  → "Configuration > Project"      (anchor: #project)
  → "Configuration > Theme"        (anchor: #theme)
  → "Configuration > Routing"      (anchor: #routing)
  → "Configuration > Search"       (anchor: #search)

Paragraph

Most granular — each paragraph gets its own embedding. Produces many more vectors but enables precise retrieval. Includes character offset for reconstructing context.

Use Cases

RAG Chatbot

Embed your docs, store in ChromaDB or Pinecone, and build an "ask the docs" chatbot that answers questions with citations.

CI Pipeline

Run oxidoc embed on every docs commit in CI. Push fresh embeddings to your vector database automatically.

Multi-Project Search

Combine embeddings from multiple Oxidoc sites into one search index for unified cross-project search.

Enterprise Search

Feed embeddings into your existing enterprise search infrastructure (Elasticsearch, OpenSearch, Vespa).

Current Workaround

Until oxidoc embed ships, you can extract embeddings from the build output:

# Build with semantic search enabled
oxidoc build

# Embeddings are in dist/search-vectors.json
cp dist/search-vectors.json ./embeddings.json

See Semantic Search > Embedding Output Format for the schema and code examples for ingesting this file into vector databases.