Skip to content

ayushkumar912/vector-search-api

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vector-search-api

A production-style semantic search REST API. Ingest text documents (or URLs), chunk and embed them with OpenAI, store vectors in PostgreSQL + pgvector, and query with natural language.

Quickstart

cp .env.example .env
# Set OPENAI_API_KEY and API_KEYS in .env, then:
docker-compose up

The API will be available at http://localhost:3000.

Configuration

Edit .env before starting:

OPENAI_API_KEY=sk-...          # Required: your OpenAI API key
API_KEYS=key-dev-123,key-prod  # Required: comma-separated bearer tokens
PORT=3000
RATE_LIMIT_RPM=60
CHUNK_SIZE=400
CHUNK_OVERLAP=80

All requests (except GET /health) require the header:

Authorization: Bearer <your-api-key>

API reference

POST /ingest

Chunk, embed, and store a text document.

curl -X POST http://localhost:3000/ingest \
  -H "Authorization: Bearer key-dev-123" \
  -H "Content-Type: application/json" \
  -d '{
    "content": "Rate limiting is a technique used to control the rate of requests...",
    "namespace": "docs",
    "metadata": { "source": "confluence", "author": "ayush" },
    "chunkSize": 400,
    "chunkOverlap": 80
  }'

Response:

{ "inserted": 3, "namespace": "docs", "ids": ["uuid1", "uuid2", "uuid3"] }

POST /ingest/url

Fetch a URL, extract its text, and ingest it.

curl -X POST http://localhost:3000/ingest/url \
  -H "Authorization: Bearer key-dev-123" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/article",
    "namespace": "web",
    "metadata": { "tag": "research" }
  }'

GET /search

Semantic similarity search.

curl "http://localhost:3000/search?q=how+to+handle+rate+limiting&namespace=docs&topK=5" \
  -H "Authorization: Bearer key-dev-123"

With metadata filter:

curl "http://localhost:3000/search?q=rate+limiting&namespace=docs&topK=3&filters=%7B%22source%22%3A%22confluence%22%7D&minScore=0.7" \
  -H "Authorization: Bearer key-dev-123"

Response:

{
  "results": [
    {
      "id": "uuid",
      "content": "chunk text...",
      "score": 0.89,
      "metadata": { "source": "confluence" },
      "namespace": "docs",
      "createdAt": "2025-11-01T10:00:00Z"
    }
  ],
  "cached": false,
  "query": "how to handle rate limiting"
}

DELETE /documents/:id

Delete a specific document chunk by UUID.

curl -X DELETE http://localhost:3000/documents/550e8400-e29b-41d4-a716-446655440000 \
  -H "Authorization: Bearer key-dev-123"

GET /namespaces

List all namespaces with document counts.

curl http://localhost:3000/namespaces \
  -H "Authorization: Bearer key-dev-123"

GET /namespaces/:ns/stats

Stats for a single namespace.

curl http://localhost:3000/namespaces/docs/stats \
  -H "Authorization: Bearer key-dev-123"

GET /health

No auth required.

curl http://localhost:3000/health

Chunking strategy

Documents are split into overlapping fixed-size character windows. Given chunkSize=400 and chunkOverlap=80, the sliding window advances by 400 - 80 = 320 characters on each step, so consecutive chunks share 80 characters of context.

Why overlapping chunks improve retrieval quality:

Semantic search works by comparing the embedding of your query to the embedding of each stored chunk. A long document contains many ideas; a single embedding for the whole document averages them all together and loses specificity. Chunking gives each idea its own embedding, making nearest-neighbour search much more precise.

The overlap matters because a sentence that straddles a chunk boundary would be cut in half without it — losing meaning. By carrying the tail of the previous chunk into the start of the next, we ensure every sentence appears whole in at least one chunk, and that the contextual "lead-in" is preserved. This reduces retrieval failures caused purely by where the chunk boundary happened to fall.


Index choice: HNSW vs IVFFlat

pgvector supports two approximate nearest-neighbour (ANN) index types:

HNSW IVFFlat
Build time Slower Faster
Query speed Fast, consistent Fast, varies
Recall at low K Very high Good
Memory usage Higher Lower
Requires training No Yes (needs VACUUM ANALYZE + enough rows)
Supports concurrent inserts Yes Yes

Why HNSW was chosen here:

  1. No training step. IVFFlat requires a list count tuned to the number of vectors present at index creation time (lists = sqrt(rows)). An empty or small table produces a poor IVFFlat index that must be rebuilt later. HNSW works correctly from the first insert.

  2. Better recall. HNSW navigates a multilevel proximity graph, achieving higher recall (fewer missed true nearest neighbours) at equivalent query latency. For a search API where result quality is the primary goal, this matters.

  3. Simpler operationally. IVFFlat requires periodic VACUUM ANALYZE and potential index rebuilds as data grows. HNSW self-organises incrementally.

The trade-off is memory: HNSW uses more RAM per vector. For very large collections (>10M vectors) and memory-constrained deployments, IVFFlat is the right choice.

The configured parameters m=16, ef_construction=64 are pgvector's recommended defaults — a balanced starting point that works well for most datasets. Increase ef_construction (e.g. to 128) for higher recall at the cost of slower index builds.

vector-search-api

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors