A production-style semantic search REST API. Ingest text documents (or URLs), chunk and embed them with OpenAI, store vectors in PostgreSQL + pgvector, and query with natural language.
cp .env.example .env
# Set OPENAI_API_KEY and API_KEYS in .env, then:
docker-compose upThe API will be available at http://localhost:3000.
Edit .env before starting:
OPENAI_API_KEY=sk-... # Required: your OpenAI API key
API_KEYS=key-dev-123,key-prod # Required: comma-separated bearer tokens
PORT=3000
RATE_LIMIT_RPM=60
CHUNK_SIZE=400
CHUNK_OVERLAP=80All requests (except GET /health) require the header:
Authorization: Bearer <your-api-key>
Chunk, embed, and store a text document.
curl -X POST http://localhost:3000/ingest \
-H "Authorization: Bearer key-dev-123" \
-H "Content-Type: application/json" \
-d '{
"content": "Rate limiting is a technique used to control the rate of requests...",
"namespace": "docs",
"metadata": { "source": "confluence", "author": "ayush" },
"chunkSize": 400,
"chunkOverlap": 80
}'Response:
{ "inserted": 3, "namespace": "docs", "ids": ["uuid1", "uuid2", "uuid3"] }Fetch a URL, extract its text, and ingest it.
curl -X POST http://localhost:3000/ingest/url \
-H "Authorization: Bearer key-dev-123" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/article",
"namespace": "web",
"metadata": { "tag": "research" }
}'Semantic similarity search.
curl "http://localhost:3000/search?q=how+to+handle+rate+limiting&namespace=docs&topK=5" \
-H "Authorization: Bearer key-dev-123"With metadata filter:
curl "http://localhost:3000/search?q=rate+limiting&namespace=docs&topK=3&filters=%7B%22source%22%3A%22confluence%22%7D&minScore=0.7" \
-H "Authorization: Bearer key-dev-123"Response:
{
"results": [
{
"id": "uuid",
"content": "chunk text...",
"score": 0.89,
"metadata": { "source": "confluence" },
"namespace": "docs",
"createdAt": "2025-11-01T10:00:00Z"
}
],
"cached": false,
"query": "how to handle rate limiting"
}Delete a specific document chunk by UUID.
curl -X DELETE http://localhost:3000/documents/550e8400-e29b-41d4-a716-446655440000 \
-H "Authorization: Bearer key-dev-123"List all namespaces with document counts.
curl http://localhost:3000/namespaces \
-H "Authorization: Bearer key-dev-123"Stats for a single namespace.
curl http://localhost:3000/namespaces/docs/stats \
-H "Authorization: Bearer key-dev-123"No auth required.
curl http://localhost:3000/healthDocuments are split into overlapping fixed-size character windows. Given chunkSize=400 and chunkOverlap=80, the sliding window advances by 400 - 80 = 320 characters on each step, so consecutive chunks share 80 characters of context.
Why overlapping chunks improve retrieval quality:
Semantic search works by comparing the embedding of your query to the embedding of each stored chunk. A long document contains many ideas; a single embedding for the whole document averages them all together and loses specificity. Chunking gives each idea its own embedding, making nearest-neighbour search much more precise.
The overlap matters because a sentence that straddles a chunk boundary would be cut in half without it — losing meaning. By carrying the tail of the previous chunk into the start of the next, we ensure every sentence appears whole in at least one chunk, and that the contextual "lead-in" is preserved. This reduces retrieval failures caused purely by where the chunk boundary happened to fall.
pgvector supports two approximate nearest-neighbour (ANN) index types:
| HNSW | IVFFlat | |
|---|---|---|
| Build time | Slower | Faster |
| Query speed | Fast, consistent | Fast, varies |
| Recall at low K | Very high | Good |
| Memory usage | Higher | Lower |
| Requires training | No | Yes (needs VACUUM ANALYZE + enough rows) |
| Supports concurrent inserts | Yes | Yes |
Why HNSW was chosen here:
-
No training step. IVFFlat requires a list count tuned to the number of vectors present at index creation time (
lists = sqrt(rows)). An empty or small table produces a poor IVFFlat index that must be rebuilt later. HNSW works correctly from the first insert. -
Better recall. HNSW navigates a multilevel proximity graph, achieving higher recall (fewer missed true nearest neighbours) at equivalent query latency. For a search API where result quality is the primary goal, this matters.
-
Simpler operationally. IVFFlat requires periodic
VACUUM ANALYZEand potential index rebuilds as data grows. HNSW self-organises incrementally.
The trade-off is memory: HNSW uses more RAM per vector. For very large collections (>10M vectors) and memory-constrained deployments, IVFFlat is the right choice.
The configured parameters m=16, ef_construction=64 are pgvector's recommended defaults — a balanced starting point that works well for most datasets. Increase ef_construction (e.g. to 128) for higher recall at the cost of slower index builds.