Jekyll2026-02-12T10:09:08+00:00https://andreim14.github.io/feed.xmlblankAI Applied Scientist specializing in NLP, LLMs, and multilingual AI. Creator of Beanis Redis ODM. Research on factuality evaluation, word sense disambiguation, and cross-lingual NLP. Build Stateful AI Agents with LangGraph + Beanis: RAG with Memory in 200 Lines2025-11-11T08:00:00+00:002025-11-11T08:00:00+00:00https://andreim14.github.io/blog/2025/build-ai-agents-langgraph-beanisThe Problem: Agent State is a Mess

You’re building an AI agent. Not just a simple chatbot - a real agent that needs to remember conversations, search through knowledge bases, and maintain state across multiple steps.

Here’s what usually happens: You start with a simple script. Then you need conversation history. So you add a list. Then you need to search documents. So you add another data structure. Then you need to persist state across restarts. So you add a database. Then you realize your code is a tangled mess of state management, database calls, and business logic all mixed together.

Sound familiar?

The Solution: LangGraph + Beanis

Here’s a better approach: use LangGraph for agent orchestration and Beanis for state management and vector storage.

LangGraph gives you a clean way to define agent workflows as graphs. Each step is a node. State flows between nodes. You can visualize it, debug it, and modify it without rewriting everything.

Beanis gives you a Redis-backed ODM (Object Document Mapper) with built-in vector search. Store documents, embeddings, conversation history, and agent state in Redis with a clean Python API. No manual serialization, no key management headaches.

Together? You get stateful AI agents that actually work in production.

What We’re Building

A RAG agent that:

  • Ingests documents from SQuAD dataset (Stanford Question Answering Dataset)
  • Stores them in Redis with vector embeddings
  • Maintains conversation history across sessions
  • Retrieves relevant context using semantic search
  • Generates responses using OpenAI
  • Orchestrates everything with LangGraph

Input/Output Example:

INPUT:  "How many students are at Notre Dame?"
OUTPUT: "In 2014, the Notre Dame student body consisted of 12,179 students."
        [Retrieved 3 relevant documents from 100 stored]

The complete code is ~200 lines. And it actually works.

Architecture

User Query
    ↓
┌─────────────────────────────────────┐
│         LangGraph Workflow          │
│                                     │
│  ┌─────────────────────────────┐   │
│  │  1. Retrieve Context        │   │
│  │     (Vector Search)         │   │
│  └────────────┬────────────────┘   │
│               ↓                     │
│  ┌─────────────────────────────┐   │
│  │  2. Load History            │   │
│  │     (From Redis)            │   │
│  └────────────┬────────────────┘   │
│               ↓                     │
│  ┌─────────────────────────────┐   │
│  │  3. Generate Response       │   │
│  │     (OpenAI + Context)      │   │
│  └────────────┬────────────────┘   │
│               ↓                     │
│  ┌─────────────────────────────┐   │
│  │  4. Save to History         │   │
│  │     (Persist in Redis)      │   │
│  └─────────────────────────────┘   │
└─────────────────────────────────────┘

Each node runs independently. State flows through the graph. If a node fails, you can retry it. Want to add a new step? Add a node and wire it up. No spaghetti code.

Step 1: Define Your Data Models

With Beanis, you define models like Pydantic classes. The magic? Vector fields and automatic indexing.

from beanis import Document, VectorField
from typing import List
from typing_extensions import Annotated
from datetime import datetime

class KnowledgeDocument(Document):
    """Document with vector embeddings for RAG"""

    title: str
    context: str
    question: Optional[str] = None

    # Vector embedding (1536 dims for OpenAI text-embedding-3-small)
    # See: https://platform.openai.com/docs/guides/embeddings
    embedding: Annotated[List[float], VectorField(dimensions=1536)]

    source: str = "squad"
    created_at: datetime = Field(default_factory=datetime.now)

    class Settings:
        name = "knowledge_docs"


class ConversationHistory(Document):
    """Conversation history for context-aware responses"""

    session_id: str
    role: str  # "user" or "assistant"
    content: str
    timestamp: datetime = Field(default_factory=datetime.now)
    retrieved_docs: Optional[List[str]] = None

    class Settings:
        name = "conversations"

That’s it. Beanis handles:

  • Serialization to Redis hashes
  • Vector index creation (GEOADD under the hood)
  • Type validation (via Pydantic)
  • Async operations

Step 2: Ingest Data

Load data from the SQuAD dataset and store it in Redis:

from datasets import load_dataset
from langchain_openai import OpenAIEmbeddings
from beanis import init_beanis

async def ingest_data(api_key: str):
    # Connect to Redis
    redis_client = redis.Redis(host="localhost", port=6379, decode_responses=False)

    # Initialize Beanis (one line - handles all Redis indexes automatically)
    await init_beanis(database=redis_client, document_models=[KnowledgeDocument])

    # Load embeddings (using OpenAI's text-embedding-3-small model)
    # This model generates 1536-dimensional vectors optimized for semantic search
    # Alternatives: text-embedding-3-large (3072 dims), text-embedding-ada-002 (1536 dims)
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small", openai_api_key=api_key)

    # Load SQuAD dataset (100 Wikipedia passages about various topics)
    dataset = load_dataset("rajpurkar/squad", split="train")

    # Ingest documents
    for example in dataset.select(range(100)):  # First 100 for demo
        embedding = embeddings.embed_query(example["context"])

        doc = KnowledgeDocument(
            title=example["title"],
            context=example["context"],
            question=example["question"],
            embedding=embedding
        )

        await doc.insert()  # One line: saves to Redis + creates vector index

Why Beanis saves you time here:

  • Without Beanis: 15+ lines to manually construct Redis keys, serialize embeddings to bytes, create HNSW index with FT.CREATE, handle errors
  • With Beanis: 1 line - await doc.insert() - everything happens automatically
  • Vector indexes are created on first insert, no manual FT.CREATE commands
  • Embeddings are automatically serialized to FLOAT32 binary format for Redis

Run this once, and you’ve got 100 documents with embeddings in Redis, ready for semantic search.

Step 3: Build the LangGraph Agent

Now for the interesting part - the agent workflow:

from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.messages import HumanMessage, SystemMessage
from beanis.odm.indexes import IndexManager

class RAGAgent:
    def __init__(self, redis_client, openai_api_key: str):
        self.redis_client = redis_client
        self.embeddings = OpenAIEmbeddings(
            model="text-embedding-3-small",
            openai_api_key=openai_api_key
        )
        self.llm = ChatOpenAI(
            model="gpt-4o-mini",
            temperature=0.7,
            openai_api_key=openai_api_key
        )

        # Build the workflow graph
        self.graph = self._build_graph()

    def _build_graph(self) -> StateGraph:
        """Define the agent workflow"""

        workflow = StateGraph(RAGAgentState)

        # Define nodes (steps)
        workflow.add_node("retrieve_context", self._retrieve_context)
        workflow.add_node("load_history", self._load_conversation_history)
        workflow.add_node("generate_response", self._generate_response)
        workflow.add_node("save_history", self._save_conversation)

        # Define edges (flow)
        workflow.set_entry_point("retrieve_context")
        workflow.add_edge("retrieve_context", "load_history")
        workflow.add_edge("load_history", "generate_response")
        workflow.add_edge("generate_response", "save_history")
        workflow.add_edge("save_history", END)

        return workflow.compile()

Clean, right? Each node is a method. State flows through the graph. Want to add a fact-checking step? Add a node between generate_response and save_history. Want to run multiple retrievers in parallel? Make multiple entry points and combine results in a merge node.

Step 4: Implement the Nodes

Vector Search Node

async def _retrieve_context(self, state: RAGAgentState) -> RAGAgentState:
    """Retrieve relevant documents using vector similarity"""

    # Generate query embedding
    query_embedding = self.embeddings.embed_query(state["query"])

    # Search Redis using Beanis
    results = await IndexManager.find_by_vector_similarity(
        redis_client=self.redis_client,
        document_class=KnowledgeDocument,
        field_name="embedding",
        query_vector=query_embedding,
        k=3  # Top 3 results
    )

    # Fetch documents
    retrieved_texts = []
    doc_ids = []

    for doc_id, score in results:
        doc = await KnowledgeDocument.get(doc_id)
        if doc:
            retrieved_texts.append(f"Context: {doc.context}")
            doc_ids.append(str(doc.id))

    combined_context = "\n\n".join(retrieved_texts)

    return {
        **state,
        "retrieved_docs": doc_ids,
        "retrieved_context": combined_context
    }

Beanis handles the Redis FT.SEARCH commands for you. You just call find_by_vector_similarity and get results. No manual index management, no raw Redis commands.

Conversation History Node

async def _load_conversation_history(self, state: RAGAgentState) -> RAGAgentState:
    """Load recent conversation from Redis"""

    # Get last 5 messages for this session
    history_docs = await ConversationHistory.find_many(
        ConversationHistory.session_id == state["session_id"],
        sort=[("timestamp", -1)],
        limit=5
    )

    conversation_history = [
        {"role": doc.role, "content": doc.content}
        for doc in reversed(history_docs)
    ]

    return {**state, "conversation_history": conversation_history}

This is just querying Redis, but Beanis makes it look like an ORM. Filter by session_id, sort by timestamp, limit results. Clean API, no manual key construction.

Generation Node

async def _generate_response(self, state: RAGAgentState) -> RAGAgentState:
    """Generate response using LLM with context"""

    messages = [
        SystemMessage(content=f"""You are a helpful AI assistant.
Answer based on this context: {state["retrieved_context"]}""")
    ]

    # Add conversation history
    for msg in state.get("conversation_history", []):
        if msg["role"] == "user":
            messages.append(HumanMessage(content=msg["content"]))
        else:
            messages.append(SystemMessage(content=msg["content"]))

    # Add current query
    messages.append(HumanMessage(content=state["query"]))

    # Generate
    response = await self.llm.ainvoke(messages)

    return {**state, "final_response": response.content}

Standard LangChain stuff. The key is that state flows naturally through the graph.

Persistence Node

async def _save_conversation(self, state: RAGAgentState) -> RAGAgentState:
    """Save conversation to Redis"""

    # Save user message
    user_msg = ConversationHistory(
        session_id=state["session_id"],
        role="user",
        content=state["query"]
    )
    await user_msg.insert()

    # Save assistant response
    assistant_msg = ConversationHistory(
        session_id=state["session_id"],
        role="assistant",
        content=state["final_response"],
        retrieved_docs=state.get("retrieved_docs", [])
    )
    await assistant_msg.insert()

    return state

Two inserts. That’s it. Beanis handles serialization, timestamp generation, everything.

Step 5: Use the Agent

# Initialize
redis_client = redis.Redis(host="localhost", port=6379, decode_responses=False)
await init_beanis(
    database=redis_client,
    document_models=[KnowledgeDocument, ConversationHistory]
)

agent = RAGAgent(redis_client=redis_client, openai_api_key=api_key)

# Query
result = await agent.query(
    query="What universities are mentioned?",
    session_id="user-123"
)

print(result["response"])
# Output: "The university mentioned is the University of Notre Dame."

Real Examples from the SQuAD Dataset:

INPUT:  "Tell me about education"
OUTPUT: "Education encompasses primary, secondary, and higher education levels.
         In formal education, structured systems prepare individuals for the
         workforce and promote social cohesion..."
        [Retrieved 3 documents, 530 queries/second]

INPUT:  "What year is mentioned?"
OUTPUT: "The year mentioned is 1879, specifically in the context of a fire
         that destroyed the Main Building and library collection."
        [Search took 1.89ms]

INPUT:  "How many students are there?"
OUTPUT: "In 2014, the Notre Dame student body consisted of 12,179 students."
        [Vector search: 27x faster than naive Python comparison]

That’s it. The agent:

  1. Retrieves relevant docs from Redis (vector search)
  2. Loads conversation history from Redis
  3. Generates response with context
  4. Saves everything back to Redis

All state is persistent. Restart your app? History is still there. Scale horizontally? Multiple instances share the same Redis.

Why This Approach Works

LangGraph Benefits

Clear workflow visualization: You can literally draw your agent’s logic as a graph. New team member? Show them the graph. Debugging? Trace through the graph.

Easy to extend: Want to add a fact-checking step? Add a node. Want parallel retrieval from multiple sources? Add parallel entry points. Want conditional logic? Add conditional edges.

Stateful by design: LangGraph manages state flow between nodes. No global variables, no passing dictionaries through 10 functions.

Error handling: Node failed? Retry it. Want to checkpoint state? Built-in support.

Beanis Benefits

Fewer lines, less complexity: Compare the approaches:

# Without Beanis (manual Redis):
# 1. Construct key manually
key = f"doc:{uuid.uuid4()}"

# 2. Serialize embedding to bytes
import struct
embedding_bytes = struct.pack(f"{len(embedding)}f", *embedding)

# 3. Create hash manually
await redis.hset(key, mapping={
    "title": title,
    "context": context,
    "embedding": embedding_bytes
})

# 4. Create vector index manually
await redis.execute_command(
    "FT.CREATE", "idx", "ON", "HASH", "PREFIX", "1", "doc:",
    "SCHEMA", "embedding", "VECTOR", "HNSW", "6",
    "TYPE", "FLOAT32", "DIM", "1536", "DISTANCE_METRIC", "COSINE"
)
# Total: ~15 lines per document type, error-prone

# With Beanis:
doc = KnowledgeDocument(title=title, context=context, embedding=embedding)
await doc.insert()
# Total: 2 lines, indexes created automatically

No key management: You define models. Beanis generates Redis keys. Update a document? Beanis updates the right hash and indexes.

Vector search included: Other Redis libraries? You’re writing raw FT.SEARCH commands. Beanis? Call find_by_vector_similarity.

Type safety: Pydantic validation on all fields. Try to insert invalid data? Fails before hitting Redis.

Async native: Everything is async. No blocking calls, no thread pools.

Just Redis: No RedisJSON module needed. No RediSearch setup (though it uses it). Works with vanilla Redis or Redis Stack.

Together

You get stateful agents with persistent memory, vector search, and clean orchestration. All backed by Redis, which you’re probably already running.

Real-World Extensions

Parallel Retrieval

Run multiple search strategies simultaneously for better results:

# Add multiple retrieval nodes
workflow.add_node("retrieve_semantic", self._retrieve_semantic)  # Vector search
workflow.add_node("retrieve_keyword", self._retrieve_keyword)    # Full-text search
workflow.add_node("combine_results", self._combine_results)

# Both run in parallel
workflow.set_entry_point("retrieve_semantic")
workflow.set_entry_point("retrieve_keyword")

# Merge results
workflow.add_edge("retrieve_semantic", "combine_results")
workflow.add_edge("retrieve_keyword", "combine_results")


async def _retrieve_keyword(self, state):
    """Full-text search using Redis FT.SEARCH"""
    # Beanis also supports full-text search on regular fields
    results = await KnowledgeDocument.find_many(
        KnowledgeDocument.context.contains(state["query"]),
        limit=3
    )
    return {**state, "keyword_results": results}

async def _combine_results(self, state):
    """Merge semantic + keyword results"""
    all_docs = state["retrieved_docs"] + state["keyword_results"]
    # Deduplicate and rerank
    unique_docs = list({doc.id: doc for doc in all_docs}.values())
    return {**state, "combined_docs": unique_docs[:5]}

LangGraph handles parallel execution automatically. This hybrid approach (semantic + keyword) often beats pure vector search, especially for technical terms or proper nouns. Learn more about Redis full-text search.

Conditional Logic

Add decision points:

def _should_search_web(self, state):
    """Decide if we need web search"""
    if not state["retrieved_context"]:
        return "web_search"
    return "generate_response"

workflow.add_conditional_edges(
    "retrieve_context",
    _should_search_web,
    {
        "web_search": "web_search",
        "generate_response": "generate_response"
    }
)

Route based on state.

Agent Checkpointing

Save intermediate state:

class AgentCheckpoint(Document):
    session_id: str
    current_step: str
    state_data: dict
    timestamp: datetime = Field(default_factory=datetime.now)

# Save after each node
async def _checkpoint_state(self, state):
    checkpoint = AgentCheckpoint(
        session_id=state["session_id"],
        current_step="generate_response",
        state_data=state
    )
    await checkpoint.insert()

Restart from any point.

Performance Notes

Benchmarked on M1 Mac with 100 documents:

  • Vector search: 10-20ms (Redis in-memory)
  • History load: 5ms (indexed by session_id)
  • LLM call: 500-800ms (OpenAI API latency)
  • Total per query: ~1 second

The Redis operations are negligible. The bottleneck is the LLM call, which is unavoidable.

Memory: ~4KB per document with embeddings. 100 docs = ~400KB. 10K docs = ~40MB. Redis can easily handle millions.

Common Pitfalls

Forgetting to initialize Beanis: You need to call init_beanis() before using document models. Do it once at app startup.

Wrong embedding dimensions: Make sure your VectorField(dimensions=...) matches your embedding model. OpenAI text-embedding-3-small is 1536 dimensions.

Not handling async properly: Everything in Beanis and LangGraph is async. Use await, run in asyncio.run(), don’t mix sync and async.

Stale conversation history: If your conversations get really long, limit what you load. Don’t pass 100 messages to the LLM - it’s expensive and slow.

Vector search returning nothing: Your query needs to be embedded with the same model you used for documents. Different model = different vector space = no matches.

Try It Yourself

Full working example: github.com/andreim14/beanis-examples/tree/main/langgraph-agent

git clone https://github.com/andreim14/beanis-examples.git
cd beanis-examples/langgraph-agent

# Install
python -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows
pip install -r requirements.txt

# Start Redis
docker run -d -p 6379:6379 redis:latest

# Set API key
echo "OPENAI_API_KEY=your-key-here" > .env

# Ingest data
python ingest_data.py

# Run agent
python main.py

The example includes:

  • Data ingestion from SQuAD dataset
  • Full RAG agent with conversation memory
  • Interactive CLI
  • Production-ready structure

When to Use This Stack

Good fit:

  • You need stateful agents (conversation history, multi-step workflows)
  • You want semantic search over documents
  • You’re already using Redis (or willing to)
  • You want to visualize and debug agent logic
  • You need to scale horizontally

Not a fit:

  • Simple single-turn Q&A (just use LangChain directly)
  • You need on-device embedding (Redis is server-side)
  • Your documents don’t fit in Redis memory (use a disk-based vector DB)

The Bottom Line

Building stateful AI agents doesn’t have to be messy. LangGraph gives you clean workflow orchestration. Beanis gives you persistent state and vector search with a clean API. Together, they let you build production-ready agents in a few hundred lines of code.

No manual state management. No key construction. No serialization headaches. Just define your workflow, define your models, and write your business logic.

Everything is on GitHub. The code works. Try it.


Resources

Documentation

Embeddings & Datasets


Built with ❤️ by Andrei Stefan Bejgu - AI Applied Scientist @ SylloTips

]]>
Andrei Stefan Bejgu
Concept-pedia: Breaking Free from ImageNet’s Shadow in Multimodal AI2025-11-05T09:00:00+00:002025-11-05T09:00:00+00:00https://andreim14.github.io/blog/2025/concept-pedia-emnlp-2025TL;DR - Why You Should Care

Look, I’m going to be straight with you: most vision-language models are living in an ImageNet bubble, and Concept-pedia proves it.

We built a massive dataset with 165,000+ semantically-annotated concepts and found something wild - models that supposedly achieve “human-level” performance on standard benchmarks completely fall apart when you test them on real-world visual diversity.

What we’re releasing:

  • 165K+ concepts from BabelNet with rich semantic structure (all on Hugging Face)
  • Concept-10k: Our manually-verified benchmark with 10,000 diverse visual concepts
  • Three fine-tuned SigLIP models ready to use for zero-shot classification
  • Everything is open: Free for research and commercial use

The bottom line: If your ImageNet accuracy is 80% but your Concept-10k score is 45%, you don’t have a general vision model - you have an ImageNet classifier. Time to fix that.


The ImageNet Problem

For over a decade, ImageNet has been the gold standard for computer vision. Its 1,000 categories became THE benchmark everyone optimized for.

But here’s the thing: the real world doesn’t have just 1,000 visual concepts.

Try asking state-of-the-art models about concepts outside ImageNet’s distribution and watch what happens. That model bragging about 85% ImageNet accuracy? It’ll confidently tell you a Bombay cat is just a “black cat” and an Allen wrench is a “screwdriver.” Not great when you’re building real applications.

We’re not talking about obscure edge cases here. These are everyday objects that humans recognize instantly. The problem? The entire field has been optimizing for a test that doesn’t reflect reality.

Our EMNLP 2025 Research

I’m thrilled to share our paper “Concept-pedia: A Wide-coverage Semantically-annotated Multimodal Dataset”, published at EMNLP 2025 - the Conference on Empirical Methods in Natural Language Processing.

What Makes Concept-pedia Different?

Examples of taxonomical concept population in Concept-pedia across different categories (Cat, Emotion, Church, Pasta, Macaque, Train) showing the rich semantic structure from BabelNet.

We’re talking 165,000+ concepts with actual semantic structure

Unlike most datasets that just throw images and labels together, we built Concept-pedia on top of BabelNet - the world’s largest multilingual semantic network. What does that mean practically? Every single concept comes with definitions, relationships to other concepts, and support for multiple languages. It’s not “here’s a picture, here’s what we think it is” - it’s “here’s a concept that exists in a web of human knowledge, and here’s what it looks like.”

And we’re not talking about 1,000 ImageNet categories repeated in different poses. We have concepts ranging from specific cat breeds to architectural elements to types of pasta you’ve probably never heard of.

Concept-10k: The benchmark we actually tested on

Creating a huge dataset is one thing. Making sure it’s actually useful? That’s different. We manually went through and curated Concept-10k - 10,000 concepts that are diverse, human-verified, and designed to test whether models actually understand visual concepts or just memorized ImageNet.

We had expert annotators verify every single image. Multiple rounds. We made sure the difficulty was balanced (mix of easy, medium, and genuinely hard examples) and that we covered the full range of semantic categories. This isn’t a toy benchmark - when models fail here, it tells you something real about their limitations.

The semantic annotations are what make this powerful

Most vision-language datasets give you image-text pairs. Cool. We give you that PLUS the semantic relationships. Hypernymy (is-a relationships), meronymy (part-of relationships), connections to Wikipedia, WordNet, you name it.

This isn’t just for show - having this structure means you can actually reason about concepts, not just pattern match. Your model can understand that a “Bombay cat” is a type of “cat” which is a type of “feline” which is a type of “mammal.” Try doing that with CLIP trained on web-scraped captions.

The ImageNet Anchor Problem

Our experiments reveal a critical issue: modern vision-language models are heavily anchored to ImageNet.

Performance Drop Beyond ImageNet

When we evaluate state-of-the-art models on Concept-10k:

Model ImageNet Performance Concept-10k Performance Drop
CLIP (ViT-L/14) 75.5% 42.3% -33.2%
ALIGN 76.4% 43.8% -32.6%
OpenCLIP 78.2% 45.1% -33.1%

Performance drops by over 30 points when tested on diverse concepts!

Comparison of concept and category distributions: Concept-10k covers 28 semantic categories with 9,837 unique concepts, far exceeding ImageNet-1k's 11 categories and 1,000 concepts. The distribution is more balanced across diverse categories.

Why Does This Happen?

Three words: we’ve been lazy. Well, not lazy exactly - but we’ve been optimizing for the wrong thing for so long that nobody questioned it.

Most vision-language models get trained on data that looks suspiciously like ImageNet. Maybe the images come from the web instead of Flickr, but the distribution? Pretty similar. Common objects. Western-centric. Same biases, bigger scale.

Then we evaluate on… ImageNet. Or benchmarks that are basically “ImageNet but slightly different.” We’ve been testing on variations of the same exam for a decade, and then acting surprised when our models can’t handle concepts outside that narrow bubble.

The real problem? Those impressive benchmark scores gave everyone a false sense of progress. “Look, we hit 85% on ImageNet!” Cool, but can your model tell a moka pot from a french press? Because my grandma can, and she’s never seen a neural network in her life.

Real-World Examples

Let’s see where models fail:

Example 1: Specialized Tools

Concept: “Allen wrench” (a specific type of hex key)

  • Human: Easily recognizes the L-shaped tool
  • CLIP: Confuses with “wrench”, “screwdriver”, “key”
  • Why it fails: Too specific, not in ImageNet’s 1K categories

Example 2: Fine-grained Animals

Concept: “Bombay cat” (a specific cat breed)

  • Human: Recognizes the sleek black coat
  • Model: Just says “cat” or “black cat”
  • Why it fails: ImageNet has “Egyptian cat” but lacks fine-grained breeds

Example 3: Cultural Objects

Concept: “Takoyaki pan” (Japanese cooking equipment)

  • Human: Recognizes the specialized griddle with hemispheric molds
  • Model: Confuses with “pan”, “griddle”, “muffin tin”
  • Why it fails: Cultural specificity beyond Western-centric training data

These aren’t edge cases - they’re everyday objects that humans recognize instantly.

Examples showing the annotation quality in Concept-pedia: correct annotations are verified by expert linguists, while ambiguous cases are carefully filtered out (e.g., distinguishing "church" from "altar" when both appear in the same image).

How We Built Concept-pedia

Starting with BabelNet’s semantic goldmine

BabelNet is massive - we’re talking about millions of concepts across hundreds of languages. But not every concept is visual. “Democracy”? Great concept, hard to photograph. So we had to filter.

We started with their full knowledge graph and pulled out concepts that actually have clear visual representations. Things you can point a camera at. That still left us with 165,000+ concepts spanning everything from animals to architecture to food to specialized tools.

The key was maintaining the semantic annotations through this process. We didn’t just want labels - we wanted the full context: definitions, relationships, multilingual mappings, connections to Wikipedia. All of it.

Link propagation examples: Our methodology uses Wikipedia hyperlinks and BabelNet's semantic structure to automatically annotate images with precise concepts, ensuring high-quality annotations at scale.

Getting the images right

Finding images for 165,000 concepts isn’t trivial. We queried multiple sources for each concept, then hit them with automatic quality filters (blurry images? Gone. Watermarks everywhere? Nope.). We checked for diversity too - different angles, lighting conditions, contexts. Nobody wants a cat breed dataset where every photo is a professional studio shot.

Deduplication was huge. The internet loves copying the same image everywhere, so we had to be aggressive about catching duplicates.

The human touch for Concept-10k

For the evaluation benchmark, automation wasn’t enough. We brought in expert annotators and had them verify every single image across 10,000 concepts. Multiple rounds of review. We weren’t just checking “is this the right label?” - we were checking “is this actually a good example? Is it ambiguous? Would a human struggle with this?”

We also calibrated difficulty. Some concepts are easy (most people can spot a golden retriever). Some are hard (distinguishing between types of wrenches requires domain knowledge). The benchmark needed both.

What We Actually Learned

The ImageNet anchor is real, and it’s worse than we thought

Remember those 30+ point drops in performance? That’s not a bug, it’s the whole point. Models don’t just perform “a bit worse” on unfamiliar concepts - they completely faceplant. And here’s the kicker: the concepts they’re failing on aren’t even more visually complex than ImageNet categories. A Bombay cat isn’t harder to recognize than an Egyptian cat. The model just never learned to care about that distinction.

Semantic structure actually matters (who knew?)

When we compared models that use semantic annotations vs pure vision-language pretraining, the difference was clear. Having access to the knowledge graph - understanding that concepts have relationships and hierarchies - legitimately helps with generalization.

It’s almost like… treating visual understanding as part of broader knowledge helps you understand things better? Shocking, I know.

Fine-grained recognition is where everything falls apart

If there’s one thing that consistently breaks modern vision models, it’s fine-grained understanding. Specific dog breeds? Nope. Different types of the same tool? Forget it. Region-specific cultural objects? Not a chance.

Medical instruments, technical equipment, subspecies of animals - these are all areas where models basically give up and output the closest generic category they know. It’s like asking someone who only studied from flashcards to handle nuance. They can’t.

Scaling isn’t the solution (sorry, big tech)

I know the instinct is “just add more data” but that’s not it. We tested this. Throwing more examples of the same distribution at the problem doesn’t fix the fundamental issue.

What you need is semantic diversity, not scale. A million more images of “dog” doesn’t teach your model about specific breeds if all those images are labelled “dog.” You need the structure, the relationships, the actual understanding that different concepts exist and matter.

If You’re Building Multimodal AI, Pay Attention

Your benchmark scores are lying to you

That CLIP model you’re using that claims 80% ImageNet accuracy? In your specific domain, it might be sitting at 45%. Or worse.

I’ve seen people deploy models in production based purely on ImageNet scores, then act shocked when the thing can’t tell medical instruments apart or consistently fails on region-specific products. Test on data that actually looks like what you’ll see in production, not the same academic benchmarks everyone else uses.

Domain adaptation isn’t optional anymore

If you’re working in healthcare, industrial inspection, e-commerce with diverse products, cultural heritage - basically anything that isn’t “generic web images” - you need to assume standard models will underperform.

Fine-tuning helps, but it’s not magic. You’re still building on a foundation that fundamentally doesn’t understand fine-grained distinctions. Better approach? Start with models that have semantic grounding (like ours) or invest in seriously good domain-specific data collection.

And for the love of god, evaluate on YOUR concepts, not ImageNet. Your stakeholders don’t care if the model knows “Egyptian cat” when your actual use case needs to distinguish between different manufacturing defects.

Semantic structure is your friend

Image-text correlation can only get you so far. When you incorporate actual semantic knowledge - hierarchies, relationships, definitions - generalization improves dramatically.

Think about it: if your model knows that “Siamese cat” is-a “cat” is-a “feline” is-a “mammal,” it can reason about things it’s never seen. Without that structure, it’s just pattern matching pixels to tokens and hoping for the best.

What This Enables (And Where We’re Going)

Concept-pedia isn’t just a dataset - it’s a different way of thinking about visual understanding.

For researchers, it means you can finally test your models on something other than ImageNet variants. 165K+ concepts spanning actual diversity. When your model fails, Concept-10k tells you exactly where and why - fine-grained categories? Cultural concepts? Specialized domains? You’ll know.

And because everything’s grounded in BabelNet, you can extend to multilingual scenarios without starting from scratch. The semantic structure is already there.

For training, the semantic annotations are the real value. Instead of just feeding models image-text pairs and hoping they figure out relationships, you can give them the structure directly. “This is a Bombay cat, which is a type of cat, which is a feline…” The hierarchy matters.

What’s next for us

We’re expanding to 500K+ concepts for v2. We’re also working on temporal understanding (video concepts, not just static images) and spatial reasoning (3D object understanding).

We’re building an interactive evaluation platform so you can test your own models on Concept-10k without downloading everything. And we’re developing semantic-aware training methods that actually leverage the knowledge graph instead of just including it as metadata.

The bigger point

Look, the field spent a decade optimizing for ImageNet. Can’t blame anyone - it was the benchmark we had, and it drove real progress. But we’ve reached the point where ImageNet performance and real-world capability have diverged so much that the benchmark is actively misleading.

Concept-pedia is our push to evaluate on actual diversity, incorporate semantic knowledge instead of just pattern matching, and build for real-world deployment instead of academic leaderboards. The visual world has way more than 1,000 concepts. Our models should too.

The Research Team

This work was a collaborative effort:

  • Karim Ghonim (Lead - Sapienza University)
  • Andrei Stefan Bejgu (Sapienza University & Babelscape)
  • Alberte Fernández-Castro (Sapienza University)
  • Roberto Navigli (Babelscape & Sapienza University)

Presented at EMNLP 2025 in Suzhou, China.

Getting Started with Concept-pedia on Hugging Face

The entire Concept-pedia ecosystem is now available on Hugging Face, making it dead simple to use these models and datasets in your own projects. Whether you’re training a new vision-language model, evaluating your existing system, or just exploring the dataset, here’s everything you need to know.

What’s Available on Hugging Face

We’ve released three fine-tuned SigLIP models and two comprehensive datasets:

Models (Vision-Language):

  • sapienzanlp/siglip-base-patch16-256-ft-concept-pedia (0.2B params) - Fast and efficient
  • sapienzanlp/siglip-large-patch16-256-ft-concept-pedia (0.7B params) - Better accuracy
  • sapienzanlp/siglip-so400m-patch14-384-ft-concept-pedia (0.9B params) - Best performance

Datasets:

  • sapienzanlp/Concept-10k - Text annotations and metadata (34.3K concepts)
  • sapienzanlp/Concept-10k-imgs - Full image dataset with visual content (4.26 GB)

All models are trained on the full Concept-pedia dataset, giving them knowledge of 165K+ visual concepts beyond traditional ImageNet categories.

Quick Start: Using the Models

Here’s how to get started with zero-shot image classification using our models. This example shows you how to classify an image into one of several possible concepts:

from transformers import AutoModel, AutoProcessor
from PIL import Image
import torch

# Load the base model (fastest option)
model_name = "sapienzanlp/siglip-base-patch16-256-ft-concept-pedia"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Load your image
image = Image.open("your_image.jpg")

# Define candidate concepts - can be anything!
candidate_concepts = [
    "Bombay cat",
    "Persian cat",
    "Siamese cat",
    "Maine Coon cat",
    "tabby cat"
]

# Process the inputs
inputs = processor(
    text=candidate_concepts,
    images=image,
    return_tensors="pt",
    padding=True
)

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits_per_image
    probs = logits.softmax(dim=1)

# Print results
print("Classification results:")
for concept, prob in zip(candidate_concepts, probs[0]):
    print(f"  {concept}: {prob.item():.1%}")

The beauty of this approach? You can test any visual concept you want, not just the 1,000 categories in ImageNet. Want to distinguish between types of pasta, breeds of dogs, or specific tools? Just change the candidate_concepts list.

Why These Models Are Different

Remember that ImageNet anchor problem? Our models were trained specifically to avoid it. Instead of optimizing for ImageNet’s 1,000 categories, we trained on the full 165K concept distribution.

This means they can actually distinguish between specific cat breeds (Bombay cat vs Persian cat vs Scottish Fold), handle specialized domains (medical equipment, industrial tools, architectural elements), recognize culturally-specific objects, and work with long-tail concepts that most models have never seen.

They’re not perfect - nothing is - but they’re substantially better at real-world diversity than models anchored to ImageNet.

Working with the Concept-10k Dataset

The dataset comes in two flavors - one with just metadata and one with images. Here’s how to load and explore them:

from datasets import load_dataset

# Load the text/metadata dataset (lightweight)
dataset = load_dataset("sapienzanlp/Concept-10k")

# Look at the first example
example = dataset['test'][0]
print(f"Concept: {example['concept']}")
print(f"Category: {example['category']}")
print(f"Caption: {example['caption']}")
print(f"BabelNet ID: {example['bn_id']}")

Each entry includes the concept name (“Allen wrench”, “Bombay cat”, whatever), its semantic category (ARTIFACT, ANIMAL, FOOD, etc.), a natural language caption describing it, and a BabelNet ID that links it to the full knowledge graph. The image dataset adds the actual visual content.

Exploring the Image Dataset

For the full visual experience with images:

from datasets import load_dataset
from PIL import Image

# Load the image dataset
img_dataset = load_dataset("sapienzanlp/Concept-10k-imgs")

# Browse examples
for i in range(5):
    example = img_dataset['train'][i]

    # Access the image
    img = example['jpg']

    # Show or save it
    img.show()  # Opens in default viewer
    # Or save: img.save(f"concept_{i}.jpg")

    print(f"Image {i}: {example['__key__']}")

The image dataset is about 4.26 GB, so it might take a few minutes to download the first time. After that, it’s cached locally.

Real-World Usage Examples

from transformers import AutoModel, AutoProcessor
from PIL import Image
import torch
from pathlib import Path

def find_similar_concepts(query_image_path, concept_database):
    """
    Find the most similar concepts to a query image.

    Args:
        query_image_path: Path to query image
        concept_database: List of concept names to search

    Returns:
        Ranked list of (concept, score) tuples
    """
    # Load model
    model_name = "sapienzanlp/siglip-base-patch16-256-ft-concept-pedia"
    processor = AutoProcessor.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)

    # Load image
    image = Image.open(query_image_path)

    # Process
    inputs = processor(
        text=concept_database,
        images=image,
        return_tensors="pt",
        padding=True
    )

    # Get scores
    with torch.no_grad():
        outputs = model(**inputs)
        scores = outputs.logits_per_image[0].softmax(dim=0)

    # Rank results
    results = sorted(
        zip(concept_database, scores.tolist()),
        key=lambda x: x[1],
        reverse=True
    )

    return results

# Example usage
concepts = [
    "espresso machine", "coffee grinder", "french press",
    "moka pot", "pour over coffee maker", "cold brew maker"
]

results = find_similar_concepts("kitchen_appliance.jpg", concepts)

print("Top 3 matches:")
for concept, score in results[:3]:
    print(f"  {concept}: {score:.1%}")

Example 2: Evaluating Your Own Model

Use Concept-10k as a benchmark to test how well your model handles diverse concepts:

from datasets import load_dataset
from tqdm import tqdm

def evaluate_on_concept10k(your_model, your_processor):
    """Evaluate any vision-language model on Concept-10k"""

    # Load test set
    dataset = load_dataset("sapienzanlp/Concept-10k-imgs")
    test_data = dataset['train']

    correct = 0
    total = 0

    # Group by concept for efficiency
    from collections import defaultdict
    concept_groups = defaultdict(list)

    for i, example in enumerate(test_data):
        concept = dataset['test'][i]['concept']
        concept_groups[concept].append((example['jpg'], i))

    # Test each concept
    for concept, examples in tqdm(concept_groups.items()):
        for img, idx in examples:
            # Your model's prediction logic here
            prediction = your_model.predict(img)

            if prediction == concept:
                correct += 1
            total += 1

    accuracy = correct / total
    print(f"Accuracy on Concept-10k: {accuracy:.2%}")
    return accuracy

Example 3: Dataset Analysis

Want to understand what’s in the dataset? Here’s a quick analysis script:

from datasets import load_dataset
from collections import Counter
import matplotlib.pyplot as plt

# Load dataset
dataset = load_dataset("sapienzanlp/Concept-10k")
test_data = dataset['test']

# Analyze categories
categories = [ex['category'] for ex in test_data]
category_counts = Counter(categories)

# Plot distribution
plt.figure(figsize=(12, 6))
plt.bar(category_counts.keys(), category_counts.values())
plt.xticks(rotation=45, ha='right')
plt.title('Concept Distribution across Categories')
plt.xlabel('Category')
plt.ylabel('Number of Concepts')
plt.tight_layout()
plt.savefig('concept_distribution.png')

# Find longest concepts
concepts = [ex['concept'] for ex in test_data]
longest = sorted(concepts, key=len, reverse=True)[:10]

print("Longest concept names:")
for i, concept in enumerate(longest, 1):
    print(f"  {i}. {concept} ({len(concept)} chars)")

# Category breakdown
print(f"\nTotal categories: {len(category_counts)}")
print(f"Total concepts: {len(test_data)}")
print(f"Average concepts per category: {len(test_data) / len(category_counts):.1f}")

Understanding the Dataset Structure

The full Concept-10k dataset has 34,345 rows spread across 28 semantic categories. We’re talking artifacts (tools, equipment), food (dishes, ingredients, cuisines), animals (species, breeds), plants, locations, structures, people (occupations, roles), organizations, diseases, substances, media, and more. Basically everything you might actually encounter in images.

The BabelNet ID (bn_id) in each entry is your gateway to the full knowledge graph. Through that ID, you can pull semantic relationships (is-a, part-of, related-to), get definitions in dozens of languages, and connect to Wikipedia, WordNet, and other structured resources. It’s not just “here’s a label” - it’s “here’s where this concept sits in human knowledge.”

Quick Performance Tips

Pick your model based on what you actually need. The base model (0.2B params) is fast enough for real-time stuff, the large model (0.7B) gives you better accuracy for production, and the SO400M model (0.9B) is when you need the absolute best performance and don’t care about inference speed.

The text dataset is tiny (few MB), downloads instantly. The image dataset is 4.26 GB, so first download takes a minute. If you’re memory-constrained, stream it:

# Stream large dataset without downloading everything
dataset = load_dataset("sapienzanlp/Concept-10k-imgs", streaming=True)

# Process in batches
from itertools import islice

batch_size = 100
for batch in islice(dataset['train'], 0, batch_size):
    # Process batch
    pass

For inference: batch your images together, use GPU if you have one (model.to("cuda")), and for god’s sake cache your processor and model instead of reloading them every time.

Why Use Concept-pedia for Your Project?

Stop using ImageNet-trained models for everything. Seriously. Here’s when Concept-pedia is the better choice:

  1. Your domain isn’t well-covered by ImageNet: Building a medical diagnosis tool? Industrial quality inspection system? Cultural heritage preservation app? ImageNet won’t cut it.

  2. You need fine-grained recognition: If distinguishing between a Golden Retriever and a Labrador matters, or you need to tell apart a cappuccino from a flat white, you need fine-grained understanding.

  3. You want actual zero-shot capability: Not “zero-shot on similar stuff to training data” but real zero-shot - throw any concept at it and get reasonable results.

  4. You’re building multilingual systems: BabelNet integration means your visual concepts come with multilingual support out of the box.

  5. You care about real-world diversity: ImageNet is super Western-centric. If you’re building for global users, you need concepts from different cultures.

  6. You want semantic grounding: Connecting visual concepts to knowledge graphs unlocks explainability, reasoning, and integration with other AI systems.

Common Pitfalls and How to Avoid Them

Pitfall 1: Testing on ImageNet after training on Concept-pedia

If you fine-tune on Concept-pedia and then evaluate on ImageNet, you might see a performance drop. That’s expected! Concept-pedia is designed for broader coverage, not ImageNet-specific optimization.

Solution: Evaluate on Concept-10k or your specific domain, not ImageNet.

Pitfall 2: Using too many candidate concepts at once

The models work best with 10-100 candidate concepts per query. If you have 10,000+ concepts, consider using a retrieval stage first.

Solution: Use semantic search or clustering to narrow down candidates before classification.

Pitfall 3: Assuming perfect accuracy on rare concepts

Even our models struggle with extremely rare or ambiguous visual concepts. They’re better than ImageNet-anchored models, but not perfect.

Solution: Use confidence thresholds and human-in-the-loop verification for critical applications.

LangChain Integration:

from langchain.tools import Tool
from transformers import AutoModel, AutoProcessor

def create_concept_classifier_tool():
    model = AutoModel.from_pretrained(
        "sapienzanlp/siglip-base-patch16-256-ft-concept-pedia"
    )
    processor = AutoProcessor.from_pretrained(
        "sapienzanlp/siglip-base-patch16-256-ft-concept-pedia"
    )

    def classify(image_path: str, concepts: str) -> str:
        # concepts should be comma-separated
        concept_list = [c.strip() for c in concepts.split(',')]
        # ... classification logic ...
        return result

    return Tool(
        name="ConceptClassifier",
        func=classify,
        description="Classifies images into fine-grained visual concepts"
    )

FastAPI Endpoint:

from fastapi import FastAPI, File, UploadFile
from typing import List
import torch

app = FastAPI()

# Load model at startup
@app.on_event("startup")
async def load_model():
    app.state.model = AutoModel.from_pretrained(
        "sapienzanlp/siglip-base-patch16-256-ft-concept-pedia"
    )
    app.state.processor = AutoProcessor.from_pretrained(
        "sapienzanlp/siglip-base-patch16-256-ft-concept-pedia"
    )

@app.post("/classify")
async def classify_image(
    file: UploadFile = File(...),
    concepts: List[str] = ["cat", "dog", "bird"]
):
    # Read image
    image_bytes = await file.read()
    image = Image.open(BytesIO(image_bytes))

    # Classify
    inputs = app.state.processor(
        text=concepts,
        images=image,
        return_tensors="pt",
        padding=True
    )

    with torch.no_grad():
        outputs = app.state.model(**inputs)
        probs = outputs.logits_per_image.softmax(dim=1)[0]

    results = {
        concept: float(prob)
        for concept, prob in zip(concepts, probs)
    }

    return {"predictions": results}

Access the Dataset and Models

Everything is freely available for research and commercial use:

Hugging Face Resources:

Paper:

Who Should Care About This?

If you’re in research, Concept-10k gives you a benchmark that actually tests real-world generalization instead of ImageNet memorization. The semantic annotations let you train models that learn structured knowledge, not just pixel-text correlations. And when models fail, you can diagnose exactly which concept types are problematic.

If you’re building production systems, this is your reality check. Test on Concept-10k before deploying, incorporate the semantic structure if you can, and understand your model’s limitations before your users find them for you.

For the field overall, we need to shift evaluation beyond ImageNet-centric metrics. We need to integrate vision with knowledge graphs. We need to care about long-tail concepts and real-world diversity. Concept-pedia is one step in that direction.

The Bottom Line

We spent a decade building models that ace ImageNet and fail in the real world. That 30+ point performance drop on Concept-10k? That’s the gap between what we think our models can do and what they actually can do.

Concept-pedia gives you 165K+ semantically-annotated concepts for training, Concept-10k for honest evaluation, and evidence that our current approaches are way more limited than the benchmarks suggested. The semantic structure shows a path forward - combine vision with knowledge graphs instead of just scaling up image-text pairs.

All the code and data is on Hugging Face. The models are ready to use. The benchmark is waiting.

Time to build multimodal AI that actually handles real-world visual diversity, not just ImageNet variations.


Citation

If you use Concept-pedia in your research, please cite our paper:

@inproceedings{ghonim-etal-2025-conceptpedia,
    title     = "Concept-pedia: A Wide-coverage Semantically-annotated Multimodal Dataset",
    author    = "Ghonim, Karim and
                 Bejgu, Andrei Stefan and
                 Fern{\'a}ndez-Castro, Alberte and
                 Navigli, Roberto",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month     = nov,
    year      = "2025",
    address   = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url       = "https://aclanthology.org/2025.emnlp-main.1745/",
    pages     = "34405--34426",
}

Plain text citation:

Karim Ghonim, Andrei Stefan Bejgu, Alberte Fernández-Castro, and Roberto Navigli. 2025. Concept-pedia: A Wide-coverage Semantically-annotated Multimodal Dataset. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 34405–34426, Suzhou, China. Association for Computational Linguistics.


Published at EMNLP 2025 - Conference on Empirical Methods in Natural Language Processing, Suzhou, China

Built with ❤️ by Andrei Stefan Bejgu - AI Applied Scientist @ SylloTips

]]>
Andrei Stefan Bejgu
Using Redis as a Geo-Spatial Cache: Building a Restaurant Finder with Beanis2025-10-30T08:00:00+00:002025-10-30T08:00:00+00:00https://andreim14.github.io/blog/2025/building-restaurant-finder-redis-geopoints

The Problem: Why Your Database is Crying

Let’s say you’re building a food delivery app. User opens the app in downtown Rome, and they want to see Italian restaurants within 2km. Simple enough, right?

Here’s your PostgreSQL query with PostGIS (the standard geo-spatial extension for PostgreSQL):

SELECT *,
       ST_Distance(location, ST_MakePoint(12.4922, 41.8902)) as distance
FROM restaurants
WHERE ST_DWithin(
    location,
    ST_MakePoint(12.4922, 41.8902),
    2000  -- 2km in meters
)
AND cuisine = 'italian'
AND rating >= 4.5
ORDER BY distance
LIMIT 20;

This query takes 750 milliseconds. On a modern database server. With indexes.

“750ms isn’t that bad,” you might think. Let me tell you why you’re wrong.

The Real-World Math

Your app has 10,000 concurrent users during lunch rush. Each one opening the app, scrolling around, changing filters. That’s not 10,000 queries - that’s more like 50,000 queries in a few minutes as users pan the map and adjust their search.

PostgreSQL can handle maybe 150 of these geo-spatial queries per second on decent hardware. You need 830 queries per second. Your database is now 5.5x overloaded, CPU pinned at 100%, queries timing out, and users seeing that spinning loader that makes them switch to a competitor.

The problem? PostGIS calculations are computationally expensive. For every single query, it’s:

  • Calculating spherical distances (earth isn’t flat, sorry)
  • Checking each of your 50,000 restaurants
  • Sorting by distance
  • Applying your filters

It’s doing all this math in real-time, from scratch, every single time. Your database server’s CPU is melting just to tell someone that La Carbonara is 450 meters away.

You Could Just Scale Postgres… Right?

Sure. You could throw money at it. Add read replicas. Maybe shard by geography. Get bigger servers with more CPU cores.

At peak load, you’d need roughly 500 database connections running geo-spatial calculations simultaneously. That’s not cheap. And you’re still doing the same expensive calculations over and over for queries that barely change (restaurant locations don’t move much).

There’s a better way, and it doesn’t involve explaining to your CTO why the database bill is suddenly five figures a month.

The Actual Solution

Cache the hot queries in Redis. Not the data - the actual geo-spatial indexes.

Redis has built-in geo-spatial commands (GEOADD, GEORADIUS) that are specifically designed for this. They pre-compute the indexes, store them in memory, and can serve 10,000+ geo queries per second on a single instance.

Here’s what changes:

  • Response time: 750ms → 12ms (62x faster)
  • Database load: 100% CPU → basically idle
  • Concurrent users: 150/sec limit → 10,000+/sec easily
  • Infrastructure cost: Massive → a single Redis instance

PostgreSQL becomes your source of truth (persistent, reliable, handles writes), and Redis becomes your speed layer (ephemeral, fast, handles 99% of reads).

The cache misses? Sure, they still take 750ms while you populate Redis. But once the cache is warm (which happens quickly), your database can go back to doing what it’s good at - handling transactions and complex queries - instead of calculating the same distances a million times a day.


The Solution: Redis Cache Architecture

Here’s the basic architecture - it’s simpler than you might think:

OpenStreetMap API
        ↓ (import once)
   PostgreSQL
        ↓ (sync to cache)
    Redis Cache
        ↓ (serve queries)
   Your Users

PostgreSQL is your source of truth. It has all the restaurant data, handles writes, maintains referential integrity - all the stuff databases are good at.

Redis sits in front as a cache layer. When you import restaurant data, you push it to Redis and create geo-spatial indexes. When users query “restaurants near me,” you hit Redis first. 12ms response time, no database load.

The magic? Redis’s GEOADD and GEORADIUS commands. They’re specifically built for this use case. You give Redis a set of coordinates, and it pre-computes geohashes and stores them in a way that makes radius queries blazing fast. No expensive spherical distance calculations at query time - it’s all pre-indexed.

How it Works in Practice

When you save a restaurant to Redis through Beanis, it automatically:

  1. Creates a Redis hash with all the restaurant data
  2. Adds the location to a geo-spatial index (GEOADD under the hood)
  3. Creates sorted sets for your filters (cuisine, rating, price, etc.)

When a user queries nearby restaurants:

  1. Redis GEORADIUS finds all restaurants within the radius (< 5ms)
  2. You filter by cuisine/rating using the sorted sets (< 2ms)
  3. You fetch the full documents and return them (< 5ms)
  4. Total: ~12ms, and your database didn’t do any work

If Redis doesn’t have the data (cache miss), you fall back to PostgreSQL, get the results in 750ms, cache them in Redis, and serve them. Next request for that area? 12ms.

What You’re Building

This tutorial shows you how to:

  • Use Beanis’s GeoPoint type to handle geo-spatial indexing automatically
  • Implement cache-first strategy with PostgreSQL fallback
  • Import real restaurant data from OpenStreetMap
  • Build a production-ready FastAPI app that handles 10,000+ concurrent users

The key insight: Redis isn’t trying to be your database. It’s your speed layer. PostgreSQL can rebuild the cache anytime, so you don’t need to worry about Redis durability. You’re trading a bit of staleness (cached data might be a few seconds old) for massive performance gains.


Step 1: The Redis Cache Model

Here’s the essential Beanis code for caching restaurant data:

from beanis import Document, Indexed, GeoPoint
from beanis.odm.indexes import IndexedField
from typing_extensions import Annotated

class RestaurantCache(Document):
    """Redis cache model - mirrors PostgreSQL data"""

    # Source tracking
    db_id: Indexed(int)  # Link to PostgreSQL
    name: str

    # ⭐ The magic: Geo-spatial index
    location: Annotated[GeoPoint, IndexedField()]
    # This automatically creates Redis GEORADIUS index!

    # Indexed fields for fast filtering
    cuisine: Indexed(str)      # Creates sorted set
    rating: Indexed(float)      # Creates sorted set
    price_range: Indexed(int)   # Creates sorted set
    is_active: Indexed(bool)    # Creates sorted set

    # Other fields (not indexed)
    address: str = ""
    phone: Optional[str] = None
    cached_at: datetime = Field(default_factory=datetime.now)

    class Settings:
        name = "RestaurantCache"

What Beanis does automatically:

  1. location: Annotated[GeoPoint, IndexedField()] → Creates GEOADD index in Redis
  2. Indexed(str/int/float/bool) → Creates sorted sets for filtering
  3. Document → Handles serialization and Redis hash storage

📄 Full code with all fields


Step 2: Caching Data

The key Beanis operation - saving to Redis:

from beanis import GeoPoint

# Create and save to Redis
restaurant = RestaurantCache(
    db_id=123,
    name="La Carbonara",
    location=GeoPoint(latitude=41.8933, longitude=12.4829),  # ⭐ Geo-spatial index
    cuisine="italian",
    rating=4.5,
    price_range=2,
    is_active=True
)

await restaurant.insert()  # Saves to Redis with all indexes

What Beanis does behind the scenes:

  1. Creates Redis hash: RestaurantCache:123{name: "La Carbonara", ...}
  2. Creates geo-index: GEOADD RestaurantCache:location 12.4829 41.8933 "123"
  3. Creates sorted sets for filters:
    • RestaurantCache:idx:cuisine:italian{123, ...}
    • RestaurantCache:idx:rating{(123, 4.5), ...}
    • RestaurantCache:idx:price_range{(123, 2), ...}

📄 Full caching logic


The core Beanis feature - finding nearby restaurants:

from beanis.odm.indexes import IndexManager

# ⭐ Query Redis geo-spatial index
results_with_distance = await IndexManager.find_by_geo_radius_with_distance(
    redis_client=redis_client,
    document_class=RestaurantCache,
    field_name="location",
    longitude=12.4922,
    latitude=41.8902,
    radius=2.0,  # 2km
    unit="km"
)

# Returns: [(doc_id, distance_km), ...]
# Example: [("123", 0.45), ("456", 1.2), ("789", 1.8)]

# Get full documents
for doc_id, distance in results_with_distance:
    restaurant = await RestaurantCache.get(doc_id)
    print(f"{restaurant.name}: {distance:.2f}km away")

What this does:

  1. Uses Redis GEORADIUS command internally
  2. Returns document IDs sorted by distance
  3. You fetch the full documents as needed
  4. Can filter further with indexed fields (cuisine, rating, etc.)

With filters:

# Fetch and filter
results = []
for doc_id, distance in results_with_distance:
    doc = await RestaurantCache.get(doc_id)

    # Filter using indexed fields (fast in-memory)
    if doc.cuisine == "italian" and doc.rating >= 4.5:
        results.append((doc, distance))

📄 Full implementation with cache-first strategy


Step 4: Initialization

Initialize Beanis on application startup:

from fastapi import FastAPI
from beanis import init_beanis
import redis.asyncio as redis

app = FastAPI()

@app.on_event("startup")
async def startup():
    # Connect to Redis
    redis_client = redis.Redis(
        host="localhost",
        port=6379,
        decode_responses=True
    )

    # Initialize Beanis with your document models
    await init_beanis(
        database=redis_client,
        document_models=[RestaurantCache]  # ⭐ Register your models
    )

That’s it! Beanis will:

  1. Create all geo-spatial indexes
  2. Create all sorted set indexes
  3. Handle serialization/deserialization

📄 Full FastAPI example with endpoints


Performance Comparison

I tested this with a real Paris dataset - 2,600+ restaurants imported from OpenStreetMap. Here’s what actually happened (not theoretical numbers, actual measurements):

Cold Start (First Query)

PostgreSQL does its thing: 850ms to calculate distances and sort results. Then we cache those results in Redis (adds another 120ms). Total: ~970ms. Not great, but it only happens once per cache region.

Warm Cache (Every Query After)

PostgreSQL would still take 750ms every time (it has to recalculate everything). Redis? 12ms. Same query, 62x faster.

With Filters

Add a cuisine filter and minimum rating. PostgreSQL now takes 820ms (more work to do). Redis? 15ms. It’s using pre-computed sorted sets for the filters, so barely any extra work.

Smaller Dataset

Tested with just 1,030 restaurants. PostgreSQL: 380ms (half the data, but still expensive calculations). Redis: 8ms.

100 Concurrent Users

This is where it gets fun. Simulated 100 users each making 10 queries (1,000 total queries):

  • PostgreSQL: 75 seconds total. Database was struggling, CPU pinned, queries queuing up.
  • Redis: 1.2 seconds total. Basically instant.

The Real Numbers

Once the cache is warm, hit rate sits around 99.8%. Nearly every query is served from Redis at 12-15ms. The database? It’s basically idle. CPU usage dropped from 100% to like 1% for the occasional cache refresh.

Throughput went from ~150 req/sec (PostgreSQL bottleneck) to 10,000+ req/sec (limited by network and application code, not Redis).

The gap only gets worse as your dataset grows. With 50,000 restaurants, PostgreSQL queries would be 1.5-2 seconds. Redis? Still 12-15ms. The pre-computed indexes don’t care about dataset size nearly as much.


Cache Invalidation Strategies

So your cache is fast, but what happens when restaurant data changes? You’ve got a few options depending on your needs.

The Simple Approach: Time-Based Expiration

Just set a TTL and forget about it. Every cached restaurant gets a timestamp. If the cache is older than, say, an hour, refresh it from PostgreSQL:

async def get_with_ttl(restaurant_id: int, max_age: int = 3600):
    """Refresh cache if older than 1 hour"""

    cached = await RestaurantCache.find_one(db_id=restaurant_id)

    if cached and not cached.is_stale(max_age):
        return cached  # Cache fresh

    # Refresh from Postgres
    db_restaurant = db.query(RestaurantDB).get(restaurant_id)

    # Update cache
    if cached:
        cached.cached_at = datetime.now()
        # Update other fields...
        await cached.save()
    else:
        # Create new cache entry
        pass

    return cached

This works great for data that doesn’t change often. Restaurant info? Rarely changes. Locations? Never change. Ratings? Maybe update hourly. You can live with slight staleness here.

The Immediate Approach: Write-Through

If you need fresher data, update both PostgreSQL and Redis at the same time when something changes:

async def update_restaurant(restaurant_id: int, updates: dict):
    """Update both Postgres and Redis atomically"""

    # Update Postgres (source of truth)
    db_restaurant = db.query(RestaurantDB).get(restaurant_id)
    for key, value in updates.items():
        setattr(db_restaurant, key, value)
    db.commit()

    # Update cache immediately
    cached = await RestaurantCache.find_one(db_id=restaurant_id)
    if cached:
        for key, value in updates.items():
            setattr(cached, key, value)
        cached.cached_at = datetime.now()
        await cached.save()
        print(f"✅ Cache updated for restaurant {restaurant_id}")

This keeps your cache fresh at the cost of slightly slower writes (you’re hitting two systems). But reads are still blazing fast, and you never serve stale data.

The Nuclear Option: Invalidate on Write

Sometimes you just want to blow away the cache entry and let it rebuild naturally on the next read:

async def delete_restaurant(restaurant_id: int):
    """Delete from both Postgres and Redis"""

    # Delete from Postgres
    db.query(RestaurantDB).filter_by(id=restaurant_id).delete()
    db.commit()

    # Invalidate cache
    cached = await RestaurantCache.find_one(db_id=restaurant_id)
    if cached:
        await cached.delete()
        print(f"🗑️ Cache invalidated for restaurant {restaurant_id}")

This is simple and safe. Next read will be slower (cache miss), but the cache rebuilds itself automatically. Works great for deletions or when you’re not sure what changed.

Which one should you use? Depends on your use case. For restaurant data, I’d start with time-based expiration (hourly refresh) and only add write-through if you’re seeing complaints about stale data. The nuclear option is fine for deletions and rare updates.


Real-World Example: What Actually Happens

User opens your app near the Colosseum in Rome and searches for Italian restaurants within 3km:

GET /restaurants/nearby?lat=41.8902&lon=12.4922&radius=3&cuisine=italian&min_rating=4.5

If the Cache is Warm (99% of requests)

Your app hits Redis first. Behind the scenes, Beanis runs:

  1. GEORADIUS to find all restaurants within 3km (~4ms)
  2. Filters by cuisine using the sorted set (~1ms)
  3. Filters by rating using another sorted set (~1ms)
  4. Fetches the full documents for the matches (~2ms)

Total: 8ms. Your API adds some overhead (JSON serialization, HTTP), so the user sees ~12-15ms response time. The database? Didn’t do anything.

{
  "total": 12,
  "restaurants": [
    {
      "id": 4521,
      "name": "La Carbonara",
      "cuisine": "italian",
      "rating": 4.8,
      "distance_m": 145
    },
    ...
  ]
}

If the Cache is Cold (1% of requests)

Redis doesn’t have data for this area yet. No problem:

  1. Redis miss (~4ms to check)
  2. Fall back to PostgreSQL (~750ms for the geo query)
  3. Cache the results in Redis (~120ms to write)
  4. Return the data

Total: ~870ms. Not great, but it only happens once for each geographic area. Every subsequent query for that area hits the cache at 12ms.

The user might notice the first query is slower, but they’re comparing it to “no results yet” (fresh app open), so it’s fine. After that? Everything is instant.


Monitoring Cache Performance

Add metrics to track cache effectiveness:

from prometheus_client import Counter, Histogram

cache_hits = Counter('cache_hits_total', 'Number of cache hits')
cache_misses = Counter('cache_misses_total', 'Number of cache misses')
response_time = Histogram('response_time_seconds', 'Response time')

async def find_nearby_with_metrics(...):
    start = time.time()

    results = await RestaurantCache.find_near(**query_params).to_list()

    if results:
        cache_hits.inc()
    else:
        cache_misses.inc()

    response_time.observe(time.time() - start)

    return results

Track:

  • Cache hit rate (should be > 95%)
  • Average response time (cache: < 10ms, db: 50-100ms)
  • Cache memory usage
  • Stale entry count

For production monitoring, consider using Redis monitoring tools and set up alerts for cache hit rates and memory usage.


The Bottom Line

Using Beanis as a Redis cache layer over PostgreSQL turns a 750ms geo query into a 12ms lookup. That’s the difference between an app that feels sluggish and one that feels instant.

PostgreSQL stays as your source of truth - handles writes, maintains data integrity, can rebuild the cache whenever needed. Redis sits in front as your speed layer - serves 99% of reads, handles massive concurrency, keeps your database idle.

The pattern works because restaurant data doesn’t change much. Locations never move. Ratings update occasionally. You can afford to have slightly stale cache data (refresh hourly or on writes) in exchange for 62x performance improvement.

Remember: Redis is your cache, not your database. Don’t try to make it persistent. Don’t worry about durability. If Redis crashes, PostgreSQL rebuilds it. That’s the whole point - you get speed without sacrificing reliability.

And Beanis handles all the messy Redis commands for you. You just define a model with GeoPoint, call find_by_geo_radius, and it works. No manual GEOADD or GEORADIUS commands. No serialization headaches. Just fast geo queries with a clean Python API.


Try It Yourself

Full working example:

Quick start:

# Clone the repo
git clone https://github.com/andreim14/beanis-examples.git
cd beanis-examples/restaurant-finder

# Install dependencies
pip install -r requirements.txt

# Start databases with Docker
docker run -d --name restaurant-postgres -e POSTGRES_USER=restaurant_user -e POSTGRES_PASSWORD=restaurant_pass -e POSTGRES_DB=restaurant_db -p 5432:5432 postgis/postgis:15-3.3
docker run -d --name restaurant-redis -p 6379:6379 redis:7-alpine

# Start the API
python main.py

# Import sample data (Paris)
curl -X POST "http://localhost:8000/import/area?lat=48.8584&lon=2.2945&radius_km=5"

# Run the interactive demo
python demo.py

The demo includes:

  • FastAPI REST API with geo-spatial queries
  • OpenStreetMap data import
  • Redis cache with Beanis
  • Interactive CLI demo showing cache performance
  • Real-world benchmarks

Questions? Drop a comment below! 🚀


Built with ❤️ by Andrei Stefan Bejgu - AI Applied Scientist @ SylloTips

]]>
Andrei Stefan Bejgu
Word Sense Linking: Making NLP Understand What You Really Mean2025-10-27T08:00:00+00:002025-10-27T08:00:00+00:00https://andreim14.github.io/blog/2025/word-sense-linking-acl-2024TL;DR - Why You Should Care

Word Sense Linking automatically identifies ambiguous words in text and links them to their correct meanings. It’s like having a semantic understanding layer for your NLP pipeline.

Use it for:

  • 🔍 Better search results (know if “python” means programming or snake)
  • 🤖 Smarter RAG systems (disambiguate before retrieval)
  • 📊 Accurate knowledge graphs (entities with precise meanings)
  • 🛡️ Context-aware content moderation

Install and run in 3 lines:

from wsl import WSL
model = WSL.from_pretrained("Babelscape/wsl-base")
result = model("The bank can guarantee deposits will cover tuition.")
# Automatically knows: bank=financial institution, cover=pay for

The Ambiguity Problem

Read this sentence:

“The bank can guarantee deposits will eventually cover future tuition costs.”

What does “bank” mean? Financial institution or riverbank? What about “cover”? Pay for, or physically cover?

Humans get this instantly from context. But how do we teach machines to do the same?

Our ACL 2024 Research

I’m excited to share our paper “Word Sense Linking: Disambiguating Outside the Sandbox”, published at ACL 2024 (Findings) - the Association for Computational Linguistics, the top venue in NLP research.

What’s Wrong with Traditional WSD?

Word Sense Disambiguation (WSD) has been around for decades. The benchmark scores look great. But when you try to actually use it in a real application, you hit a wall.

The problem? Traditional WSD operates in a sandbox. It assumes you’ve already done most of the hard work:

First, you need to manually identify which words in your text are ambiguous. In our example sentence, you’d have to mark “bank” and “cover” as words that need disambiguation. Already a problem - how do you know which words are ambiguous without understanding context?

Second, you need to provide candidate meanings. You have to tell the system “bank could mean: financial institution, riverbank, or slope” and “cover could mean: pay for, place over, include, or protect.” Again, this assumes you already know what senses exist for each word.

Only then does WSD do its job - picking the right sense from your provided list.

It’s like having a translator who needs you to first identify which words need translating, then give them a dictionary of possible translations, and only then they’ll pick the right one. If you can do all that, you probably don’t need the translator.

A Concrete Example: WSD vs WSL

Let’s see this in action with: “The bank can guarantee deposits will cover tuition.”

Traditional WSD requires:

# Input with pre-marked spans and candidates
{
    "text": "The bank can guarantee deposits will cover tuition.",
    "spans": [
        {"text": "bank", "start": 4, "end": 8},      # You mark this
        {"text": "cover", "start": 41, "end": 46}   # And this
    ],
    "candidates": {
        "bank": ["financial institution", "riverbank", "slope"],    # You provide these
        "cover": ["pay for", "place over", "include", "protect"]   # And these
    }
}
# WSD picks: bank → financial institution, cover → pay for

WSL needs only:

# Just raw text!
text = "The bank can guarantee deposits will cover tuition."

# WSL automatically:
# 1. Identifies ALL ambiguous spans (bank, guarantee, deposits, cover, tuition)
# 2. Retrieves candidates from WordNet
# 3. Links each to correct sense

The key difference: WSD assumes you know what to disambiguate. WSL figures it out.

This makes WSL practical for real applications where you don’t have annotated data or pre-identified spans.

Introducing Word Sense Linking

Our solution: Word Sense Linking (WSL) removes these constraints.

WSL does two things automatically:

  1. Span Identification: Finds which text spans need disambiguation
  2. Sense Linking: Links them to the correct meaning from WordNet

No manual preprocessing. No candidate provision. Just real text in, meanings out.

State-of-the-Art Results

Model Precision Recall F1
ConSeC (previous SOTA) 80.4% 64.3% 71.5%
WSL (our model) 75.2% 76.7% 75.9%

We achieve 4.4% improvement on the ALL_FULL benchmark, with significantly better recall - we find more correct senses.

Using WSL: It’s Simple

The best part? We released it as an easy-to-use Python library.

Installation

pip install git+https://github.com/Babelscape/WSL.git

Basic Usage

from wsl import WSL

# Load pre-trained model
wsl_model = WSL.from_pretrained("Babelscape/wsl-base")

# Disambiguate!
result = wsl_model("Bus drivers drive busses for a living.")

Output

WSLOutput(
    text='Bus drivers drive busses for a living.',
    spans=[
        Span(
            start=0, end=11,
            text='Bus drivers',
            label='bus driver: someone who drives a bus'
        ),
        Span(
            start=12, end=17,
            text='drive',
            label='drive: operate or control a vehicle'
        ),
        Span(
            start=18, end=24,
            text='busses',
            label='bus: a vehicle carrying many passengers'
        ),
        Span(
            start=31, end=37,
            text='living',
            label='living: the financial means whereby one lives'
        )
    ]
)

Notice: It automatically:

  • Identified which spans need disambiguation (“Bus drivers”, “drive”, “busses”, “living”)
  • Linked them to the correct WordNet senses
  • Provided human-readable definitions

Real-World Applications

1. Search & Information Retrieval

Query: “python tutorial”

WSL disambiguates: Programming language? Or the snake?

result = wsl_model("Looking for python tutorial")
# Identifies: "python: a high-level programming language"

Better search results, better user experience.

2. Content Moderation

Context-aware filtering:

wsl_model("The wedding shooting was beautiful")
# "shooting: the act of making a photograph"

wsl_model("There was a shooting downtown")
# "shooting: the act of firing a projectile"

Same word, different meanings, different moderation actions.

3. RAG Systems Enhancement

Improve retrieval for RAG:

from wsl import WSL

wsl_model = WSL.from_pretrained("Babelscape/wsl-base")

# User query with ambiguous terms
query = "What's the best bank for deposits?"

# Disambiguate before retrieval
result = wsl_model(query)

# Now you know it's about financial institutions
# Not riverbanks!

Better semantic understanding = better retrieval = better LLM responses.

4. Knowledge Graph Construction

Extract entities with precise meanings:

text = "Apple released a new chip for their computers"

result = wsl_model(text)
# "Apple: a major American tech company" (not the fruit!)
# "chip: electronic equipment consisting of a small circuit" (not food!)

Build more accurate knowledge graphs automatically.

How It Works (The Architecture)

WSL uses a retriever-reader architecture:

  1. Retriever: Gets relevant sense candidates from WordNet based on the input text
  2. Reader: Extracts spans from text and links them to retrieved senses

Both components are transformer-based and trained end-to-end.

The model learns to:

  • Recognize which words/phrases are ambiguous
  • Select the contextually appropriate sense
  • Do this jointly, not in separate steps

Comparison: WSL vs Traditional WSD

The key difference comes down to what you need to provide as input.

Traditional WSD needs you to do the hard parts first: identify ambiguous words, provide candidate senses, and prepare your text in a specific format. It then picks the right sense from your candidates. This works fine for academic benchmarks where everything is pre-annotated, but it’s impractical for real applications where you’re processing arbitrary text.

Word Sense Linking does all of that automatically. You throw raw text at it - anything from tweets to research papers - and it figures out which spans are ambiguous, retrieves candidate senses from WordNet, and links everything to the correct meanings. No preprocessing, no manual annotation, no candidate lists.

That’s why we’re seeing actual adoption in production systems. WSL doesn’t require you to build annotation pipelines or maintain sense inventories. It just works on whatever text you have.

A Complete Example

Let’s disambiguate a complex sentence:

from wsl import WSL

wsl_model = WSL.from_pretrained("Babelscape/wsl-base")

text = """
The bank can guarantee deposits will eventually
cover future tuition costs because they understand
the financial burden.
"""

result = wsl_model(text)

# Print identified spans with definitions
for span in result.spans:
    print(f"{span.text:20}{span.label}")

Output:

bank                 → bank: a financial institution
guarantee            → guarantee: give surety or assume responsibility
deposits             → deposit: money given as security
cover                → cover: be sufficient to meet or pay for
tuition              → tuition: a fee paid for instruction
financial            → financial: involving or relating to money
burden               → burden: an onerous or difficult concern

Every ambiguous term correctly disambiguated!

Here’s how to use WSL to improve search:

from wsl import WSL
from typing import List, Dict

class SemanticSearch:
    def __init__(self):
        self.wsl = WSL.from_pretrained("Babelscape/wsl-base")

    def understand_query(self, query: str) -> Dict:
        """Disambiguate query terms for better search"""
        result = self.wsl(query)

        return {
            "original_query": query,
            "disambiguated_terms": [
                {
                    "term": span.text,
                    "sense": span.label,
                    "start": span.start,
                    "end": span.end
                }
                for span in result.spans
            ]
        }

# Usage
search = SemanticSearch()
analysis = search.understand_query("Looking for python courses for machine learning")

print(analysis)
# Knows it's about programming, not snakes!

Performance Considerations

Model Size: ~400MB (base model) Inference Speed: ~100-200ms per sentence (GPU) Memory: ~2GB GPU RAM

For production:

  • Batch processing recommended
  • Consider quantization for deployment
  • Cache results for common queries

Why This Matters

Traditional WSD achieved high benchmark scores but couldn’t escape the lab.

Word Sense Linking makes lexical semantics practical:

  1. No preprocessing - Works on raw text
  2. End-to-end - One model does everything
  3. Production-ready - Available on Hugging Face
  4. Open source - Free to use (CC BY-NC-SA 4.0)

We’re bridging the gap between research and real-world applications.

The Research Team

This work was done in collaboration with:

  • Edoardo Barba (Babelscape)
  • Luigi Procopio (Sapienza University)
  • Alberte Fernández-Castro (Sapienza University)
  • Roberto Navigli (Babelscape & Sapienza University)

Presented at ACL 2024 in Bangkok, Thailand.

Try It Yourself

Ready to add semantic understanding to your NLP pipeline?

Quick Start

# Install
pip install git+https://github.com/Babelscape/WSL.git

# Use
from wsl import WSL
model = WSL.from_pretrained("Babelscape/wsl-base")
result = model("Your text here")

Resources

What’s Next?

We’re working on:

  • Multilingual WSL - Beyond English
  • Domain-specific models - Medical, legal, technical text
  • Lightweight versions - For edge deployment
  • Integration guides - For popular NLP frameworks

Conclusion

Word Sense Linking solves a decades-old problem: making word sense disambiguation practical.

No more sandboxes. No more manual preprocessing. Just real text in, precise meanings out.

Whether you’re building search engines, content moderation systems, or RAG applications - understanding what words really mean in context is crucial.

And now it’s as simple as:

from wsl import WSL
model = WSL.from_pretrained("Babelscape/wsl-base")
result = model("Your ambiguous text here")

Give it a try and let me know what you build with it!


Citation

If you use Word Sense Linking in your research or applications, please cite our paper:

@inproceedings{bejgu-etal-2024-wsl,
    title     = "Word Sense Linking: Disambiguating Outside the Sandbox",
    author    = "Bejgu, Andrei Stefan and
                 Barba, Edoardo and
                 Procopio, Luigi and
                 Fern{\'a}ndez-Castro, Alberte and
                 Navigli, Roberto",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2024",
    month     = aug,
    year      = "2024",
    address   = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url       = "https://aclanthology.org/2024.findings-acl.851/",
}

Plain text citation:

Andrei Stefan Bejgu, Edoardo Barba, Luigi Procopio, Alberte Fernández-Castro, and Roberto Navigli. 2024. Word Sense Linking: Disambiguating Outside the Sandbox. In Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand. Association for Computational Linguistics.


Published at ACL 2024 (Findings) - Association for Computational Linguistics, Bangkok, Thailand

Built with ❤️ by Andrei Stefan Bejgu - AI Applied Scientist @ SylloTips

]]>
Andrei Stefan Bejgu
Build a RAG System in 50 Lines with Redis + Beanis (No Vector DB Needed)2025-10-24T08:00:00+00:002025-10-24T08:00:00+00:00https://andreim14.github.io/blog/2025/build-rag-50-lines-beanis-redisThe Problem

You’re building an AI app that needs semantic search for RAG. Everyone on Twitter is telling you to use Pinecone ($70/month minimum, plus 100+ lines of boilerplate). Or Weaviate (yet another service to manage and monitor). Or pgvector (slow queries and complex tuning).

Meanwhile, you already have Redis running. You’re using it for caching, session storage, maybe job queues. It’s just sitting there, being fast and reliable.

Here’s what most people don’t realize: Redis is also a vector database. And if you’re already running it, you’re paying for vector search capability whether you use it or not.

The Solution

Use Beanis - a Redis ODM with built-in vector search.

The entire RAG system:

# models.py (14 lines)
from beanis import Document, VectorField
from typing import List
from typing_extensions import Annotated

class KnowledgeBase(Document):
    text: str
    embedding: Annotated[List[float], VectorField(dimensions=1024)]

    class Settings:
        name = "knowledge"

# ingest.py (20 lines)
from transformers import AutoModel

model = AutoModel.from_pretrained('jinaai/jina-embeddings-v4')

async def ingest_text(text: str):
    embedding = model.encode([text])[0].tolist()
    doc = KnowledgeBase(text=text, embedding=embedding)
    await doc.insert()

# search.py (15 lines)
from beanis.odm.indexes import IndexManager

async def search(query: str):
    query_emb = model.encode([query])[0].tolist()

    results = await IndexManager.find_by_vector_similarity(
        redis_client, KnowledgeBase, "embedding", query_emb, k=5
    )

    return [await KnowledgeBase.get(doc_id) for doc_id, score in results]

That’s it. That’s the entire RAG system.


Why This Approach Works

First, vector indexes are now created automatically when you call init_beanis(). No manual Redis commands. No setup scripts. Just define your model with VectorField() and you’re done.

Compared to Pinecone

Let’s be real: Pinecone is a great product. But it’s solving a problem you might not have.

Pinecone makes sense if you’re doing massive-scale vector search across billions of documents, need global replication, want managed infrastructure with SLAs, and have the budget for it. If that’s you, use Pinecone.

But most apps don’t need that. You’ve got maybe 10K-1M documents. You already run Redis. You’re okay with self-hosting. And you really don’t want another monthly bill.

Here’s what changes when you use Beanis + Redis instead:

  • Setup: No API keys, no account creation. Docker run and you’re live.
  • Code: 50 lines instead of 100+. Beanis handles serialization, indexing, search - you just define models.
  • Cost: $0 if you’re already running Redis. Pinecone starts at $70/month and scales from there.
  • Performance: 15ms queries vs 40ms (local Redis is faster than API calls).
  • Operations: One service instead of two. Same monitoring, same deployment, same infrastructure.

The trade-off? You’re self-hosting. If that’s scary, stick with Pinecone. If you’re already running Redis and don’t mind managing it, this approach is simpler and cheaper.

The Code Comparison

Pinecone (verbose):

# Setup
import pinecone
pinecone.init(api_key=os.getenv("PINECONE_API_KEY"), environment="us-west1-gcp")
index = pinecone.Index("my-index")

# Upsert (complex)
vectors = [(str(i), embedding, {"text": text}) for i, (text, embedding) in enumerate(docs)]
index.upsert(vectors=vectors, namespace="docs")

# Search (multiple steps)
query_response = index.query(
    vector=query_embedding,
    top_k=5,
    namespace="docs",
    include_metadata=True
)
results = [match['metadata']['text'] for match in query_response['matches']]

# ~100+ lines for production setup

Beanis (clean):

# Setup
doc = KnowledgeBase(text=text, embedding=embedding)
await doc.insert()

# Search
results = await IndexManager.find_by_vector_similarity(
    redis_client, KnowledgeBase, "embedding", query_embedding, k=5
)

# ~50 lines total

Step-by-Step Tutorial

1. Install Dependencies

pip install beanis transformers redis

Just 3 packages. No complex setup, no account creation.

2. Start Redis

docker run -d -p 6379:6379 redis/redis-stack:latest

Use redis-stack (includes RediSearch module for vector search).

3. Define Your Model

from beanis import Document, VectorField
from typing import List
from typing_extensions import Annotated

class KnowledgeBase(Document):
    text: str
    embedding: Annotated[List[float], VectorField(dimensions=1024)]

    class Settings:
        name = "knowledge"

14 lines. That’s your entire data model. The VectorField() tells Beanis to automatically create a vector index with HNSW algorithm for lightning-fast similarity search.

4. Ingest Documents

Vector indexes are created automatically - no manual setup needed!

from transformers import AutoModel
import redis.asyncio as redis
from beanis import init_beanis

# Load open-source embedding model (no API key!)
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v4', trust_remote_code=True)  # https://huggingface.co/jinaai/jina-embeddings-v4

async def ingest_text(text: str):
    # Generate embedding
    embedding = model.encode([text])[0].tolist()

    # Store in Redis
    doc = KnowledgeBase(text=text, embedding=embedding)
    await doc.insert()

    print(f"✓ Indexed: {text[:50]}...")

# Initialize
redis_client = redis.Redis(decode_responses=True)
await init_beanis(database=redis_client, document_models=[KnowledgeBase])

# Ingest your documents
texts = ["Redis is fast", "Python is great", "Beanis is simple"]
for text in texts:
    await ingest_text(text)

20 lines. Documents are now searchable. Vector indexes were created automatically!

5. Search Semantically

from beanis.odm.indexes import IndexManager

async def search(query: str, k: int = 5):
    # Embed query
    query_embedding = model.encode([query])[0].tolist()

    # Search!
    results = await IndexManager.find_by_vector_similarity(
        redis_client=redis_client,
        document_class=KnowledgeBase,
        field_name="embedding",
        query_vector=query_embedding,
        k=k
    )

    # Get documents
    docs = []
    for doc_id, similarity_score in results:
        doc = await KnowledgeBase.get(doc_id)
        docs.append((doc.text, similarity_score))

    return docs

# Search
results = await search("what is semantic search?")
for text, score in results:
    print(f"{score:.3f}: {text}")

15 lines. Semantic search working.


Real-World Example

Let’s say you’re building a documentation search. User asks:

Query: “how to cancel my subscription?”

Traditional keyword search: ❌ No results (docs say “termination policy”)

Semantic search with Beanis: ✅ Finds:

  • “Account termination policy”
  • “How to close your account”
  • “Subscription cancellation process”

Why? Vector embeddings understand meaning, not just keywords.


Performance Comparison

I benchmarked this against the usual suspects with 10,000 documents (real measurements, not marketing numbers):

Beanis + Redis: 15ms queries, ~50 lines of code, Docker run for setup, $0 incremental cost.

Pinecone: 40ms queries (network latency kills you), 100+ lines of setup code, API key dance, $70+/month.

Weaviate: 35ms queries, another service to deploy and monitor, 80+ lines of code, self-hosting overhead.

pgvector: 200ms queries (PostgreSQL isn’t optimized for vector search), 60+ lines of code, need to tune indexes carefully.

Beanis wins on speed (local Redis beats API calls) and simplicity (ODM pattern means less code). The only thing you lose is managed infrastructure - if that matters, Pinecone might be worth it.


The “Already Using Redis” Advantage

Here’s the thing: if you’re running a modern web app, you’re probably already using Redis. Caching, session storage, job queues, rate limiting - Redis does all of it.

Now you can add vector search to that list. Same service, same monitoring, same deployment pipeline. No new infrastructure to learn.

Before, your architecture looked like this:

  • Redis (caching and sessions)
  • PostgreSQL (user data)
  • Pinecone (vectors, $70+/month)
  • Your app

After:

  • Redis (caching, sessions, AND vectors)
  • PostgreSQL (user data)
  • Your app

That’s one fewer service to monitor, deploy, and pay for. Your vectors sit right next to your cache, so queries are faster (data locality). And when you’re debugging at 2 AM, you only need to check two services instead of three.

The cost savings alone ($70-500/month depending on Pinecone tier) probably justify the few hours it takes to set this up. And operationally, it’s just simpler. Fewer dashboards to check, fewer alerts to configure, fewer things that can break.


Once you’ve got the basics working, there are some cool extensions worth knowing about.

Multimodal Search: Jina v4 can embed both text and images into the same vector space. This means you can search for images using text queries, or find relevant text using an image. It’s the same model.encode() API, just pass an image instead:

from PIL import Image

# Search with text
text_emb = model.encode(["red sports car"])[0].tolist()
results = await IndexManager.find_by_vector_similarity(...)

# Search with image
img = Image.open("car.jpg")
img_emb = model.encode_image([img])[0].tolist()
results = await IndexManager.find_by_vector_similarity(...)

Both queries work against the same index. Pretty wild.

Hybrid Search: You can combine vector similarity with traditional filters. Add indexed fields to your model and filter before or after the vector search:

class KnowledgeBase(Document):
    text: str
    embedding: Annotated[List[float], VectorField(dimensions=1024)]
    category: Indexed(str)  # Filter by category
    date: datetime
    language: Indexed(str)  # Filter by language

This lets you do things like “find similar documents, but only in English” or “semantic search within the ‘documentation’ category.”

Production Tuning: When you’re ready to scale, Redis Cluster handles sharding automatically, and you can tune the HNSW algorithm parameters for your recall/speed trade-off:

VectorField(
    dimensions=1024,
    algorithm="HNSW",
    m=32,  # More connections = better recall, more memory
    ef_construction=400  # Higher = better index quality, slower indexing
)

Start with the defaults. Only tune if you’re seeing issues.


Common Questions

“Do I need RediSearch?”

Yes. Use redis-stack (includes RediSearch module) or install RediSearch manually. Regular Redis doesn’t have vector search.

“Can I use OpenAI embeddings?”

Yes! Just swap the model:

import openai
embedding = openai.Embedding.create(input=text, model="text-embedding-3-small")

But Jina v4 is free, faster, and runs locally.

“How much data can it handle?”

Redis can handle millions of vectors. With proper sharding (Redis Cluster), billions.

Memory usage: ~4KB per document (1024-dim vectors). 1M docs = ~4GB RAM.

“What about updates/deletes?”

# Update
doc = await KnowledgeBase.get(doc_id)
doc.text = "Updated text"
doc.embedding = new_embedding
await doc.save()

# Delete
await doc.delete()

Indexes update automatically.


Complete Working Example

Clone and run:

git clone https://github.com/andreim14/beanis-examples.git
cd beanis-examples/simple-rag

# Install
pip install -r requirements.txt

# Start Redis
docker run -d -p 6379:6379 redis/redis-stack:latest

# Ingest sample docs (vector indexes created automatically!)
python ingest.py

# Search!
python search.py "what is semantic search?"

Full working example in the repo.


Why Beanis?

  1. Simplicity - Define models like Pydantic, search like it’s magic
  2. Performance - Redis is fast, Beanis doesn’t slow it down
  3. No lock-in - It’s just Redis, move anywhere
  4. Familiar - If you know Pydantic, you know Beanis
  5. Free - No API keys, no billing, no surprises

The Bottom Line

If you already use Redis:

  • You already have a vector database
  • No need for Pinecone, Weaviate, or pgvector
  • Build RAG in 50 lines of code
  • Save $70+/month
  • One fewer service to manage

Start building:


What’s Next?

In the next post, I’ll show you how to build a multimodal RAG system that searches PDFs, diagrams, and code screenshots using Jina v4’s vision capabilities.

Spoiler: It’s also ~50 lines of code.


Built with ❤️ by Andrei Stefan Bejgu - AI Applied Scientist @ SylloTips

]]>
Andrei Stefan Bejgu
Beanis: Stop Fighting Redis, Start Building with It2025-10-23T23:00:00+00:002025-10-23T23:00:00+00:00https://andreim14.github.io/blog/2025/introducing-beanisPicture this: You’re building a high-performance API. You need Redis for caching and fast queries, but you’re drowning in boilerplate code. Every simple operation requires 15-20 lines of manual serialization, type conversion, and key management. Your codebase is littered with json.dumps(), json.loads(), and fragile string manipulation.

There had to be a better way.

That’s why I built Beanis - a Redis ODM that brings the elegance of modern ORMs to Redis, without sacrificing the speed that makes Redis special.

The “Aha!” Moment

I was working on a real-time recommendation system that needed to query thousands of products per second. Redis was the obvious choice for speed, but the code was becoming unmaintainable:

# The old way - painful and error-prone
product_key = f"Product:{product_id}"
data = await redis.hgetall(product_key)

# Manual type conversion for EVERY field
product = {
    'name': data.get('name', ''),
    'price': float(data.get('price', 0)) if data.get('price') else 0.0,
    'stock': int(data.get('stock', 0)) if data.get('stock') else 0,
    'tags': json.loads(data.get('tags', '[]')),
    'metadata': json.loads(data.get('metadata', '{}')),
}

# And that's just reading ONE document!

I wanted to write code that looked like this instead:

# The Beanis way - clean and type-safe
product = await Product.get(product_id)

Spoiler: I made it happen. And it’s only 8% slower than raw Redis.


Why Build Another Redis Library?

I’ve spent years working with databases in AI/ML projects. I love MongoDB’s ODMs like Beanie - the clean API, Pydantic integration, and how they let you focus on business logic instead of CRUD boilerplate. But when you need Redis-level performance, you’re stuck with manual serialization and key management.

The existing Redis libraries weren’t cutting it:

  • Redis OM - Excellent, but requires RedisJSON/RediSearch modules (not always available)
  • Walrus - No async support, predates Pydantic v2
  • Raw redis-py - Fast but verbose, no type safety

I wanted something that combined:

  • ✅ Vanilla Redis compatibility (no modules required)
  • ✅ Pydantic v2 validation and type safety
  • ✅ Beanie-like clean API
  • ✅ Async-first design
  • ✅ Minimal performance overhead

When I couldn’t find it, I built it. Beanis is what I wish existed when I started working with Redis.


Real-World Example: Building a Product Catalog

Let’s build something real: a product catalog for an e-commerce platform. You need:

  • Fast lookups by ID
  • Range queries on price
  • Category filtering
  • Real-time stock updates
  • Audit trails

Traditional Redis Approach

With raw redis-py, you’d write something like this for a single product insert:

import json
import redis.asyncio as redis

async def create_product(redis_client, product_data):
    # Generate unique ID
    product_id = await redis_client.incr("product:id:counter")
    product_key = f"Product:{product_id}"

    # Manually serialize complex types
    redis_data = {
        'id': str(product_id),
        'name': product_data['name'],
        'price': str(product_data['price']),
        'category': product_data['category'],
        'stock': str(product_data['stock']),
        'tags': json.dumps(product_data.get('tags', [])),
        'metadata': json.dumps(product_data.get('metadata', {}))
    }

    # Save to hash
    await redis_client.hset(product_key, mapping=redis_data)

    # Manually maintain indexes for queries
    await redis_client.zadd(f"Product:idx:price", {product_key: product_data['price']})
    await redis_client.sadd(f"Product:idx:category:{product_data['category']}", product_key)
    await redis_client.sadd("Product:all", product_key)

    return product_id

# Query by price range - also manual
async def find_products_by_price(redis_client, min_price, max_price):
    keys = await redis_client.zrangebyscore(
        "Product:idx:price",
        min_price,
        max_price
    )

    products = []
    for key in keys:
        data = await redis_client.hgetall(key)
        # Manual deserialization for each product
        products.append({
            'id': data['id'],
            'name': data['name'],
            'price': float(data['price']),
            'stock': int(data['stock']),
            'tags': json.loads(data['tags']),
            'metadata': json.loads(data['metadata'])
        })

    return products

That’s over 50 lines for basic CRUD + one query. And we haven’t even added:

  • Input validation
  • Error handling
  • Type safety
  • Audit trails
  • Cascade deletes

The Beanis Approach

Here’s the same functionality with Beanis:

from beanis import Document, Indexed, init_beanis
from beanis.odm.actions import before_event, Insert, Update
from typing import Optional, Set
from datetime import datetime
from pydantic import Field
import redis.asyncio as redis

class Product(Document):
    name: str = Field(min_length=1, max_length=200)
    description: Optional[str] = None
    price: Indexed[float] = Field(gt=0)  # Auto-indexed, validated > 0
    category: Indexed[str]  # Auto-indexed
    stock: int = Field(ge=0)  # Validated >= 0
    tags: Set[str] = set()
    metadata: dict = {}

    # Audit fields - automatically managed
    created_at: datetime = Field(default_factory=datetime.now)
    updated_at: datetime = Field(default_factory=datetime.now)

    @before_event(Insert)
    async def on_create(self):
        self.created_at = datetime.now()

    @before_event(Update)
    async def on_update(self):
        self.updated_at = datetime.now()

    class Settings:
        key_prefix = "Product"

# Initialize once
client = redis.Redis(decode_responses=True)
await init_beanis(database=client, document_models=[Product])

# Create - with validation!
product = Product(
    name="MacBook Pro M3",
    price=2499.99,
    category="electronics",
    stock=50,
    tags={"laptop", "apple", "premium"},
    metadata={"warranty": "2 years", "color": "Space Gray"}
)
await product.insert()

# Query - indexes handled automatically
expensive = await Product.find(
    category="electronics",
    price__gte=1000,
    price__lte=3000
)

# Update - type-safe
await product.update(stock=45, price=2299.99)

# Complex queries
out_of_stock = await Product.find(stock=0)
premium_laptops = await Product.find(
    category="electronics",
    price__gte=2000
)

That’s about 30 lines - including validation, audit trails, and automatic indexing. A 70% reduction in code.


What Makes Beanis Different?

🎯 Full Pydantic v2 Integration

Beanis isn’t just wrapping Redis - it’s bringing Pydantic’s power to your data layer:

from pydantic import EmailStr, HttpUrl, validator
from decimal import Decimal

class User(Document):
    email: EmailStr  # Automatic email validation
    username: str = Field(min_length=3, max_length=20, pattern="^[a-zA-Z0-9_]+$")
    age: int = Field(ge=13, le=120)
    website: Optional[HttpUrl] = None
    balance: Decimal = Decimal("0.00")

    @validator('username')
    def username_alphanumeric(cls, v):
        if not v.isalnum():
            raise ValueError('Username must be alphanumeric')
        return v.lower()

# This will raise validation errors BEFORE hitting Redis
try:
    user = User(
        email="not-an-email",  # ❌ Invalid
        username="ab",  # ❌ Too short
        age=200  # ❌ Too old
    )
except ValidationError as e:
    print(e)

🚀 Smart Indexing that Just Works

No more manually maintaining sorted sets and managing index consistency:

class Article(Document):
    title: str
    views: Indexed[int]  # Sorted set automatically maintained
    published_at: Indexed[datetime]  # Time-based queries
    author: Indexed[str]  # Categorical filtering
    score: Indexed[float]  # Range queries

# All these queries use optimized indexes under the hood
trending = await Article.find(views__gte=10000)
recent = await Article.find(
    published_at__gte=datetime.now() - timedelta(days=7)
)
popular_by_author = await Article.find(
    author="john_doe",
    score__gte=4.5
)

Behind the scenes, Beanis:

  • Maintains Redis sorted sets for each indexed field
  • Automatically updates indexes on insert/update/delete
  • Optimizes queries by choosing the best index
  • Handles index cleanup when documents are deleted

🎨 Custom Encoders for Any Type

Working with complex types? Beanis has you covered:

import numpy as np
from PIL import Image
from beanis.odm.custom_encoders import register_custom_encoder, register_custom_decoder

# NumPy arrays
@register_custom_encoder(np.ndarray)
def encode_numpy(arr: np.ndarray) -> str:
    return arr.tobytes().hex()

@register_custom_decoder(np.ndarray)
def decode_numpy(data: str, dtype=np.float32) -> np.ndarray:
    return np.frombuffer(bytes.fromhex(data), dtype=dtype)

# PIL Images
@register_custom_encoder(Image.Image)
def encode_image(img: Image.Image) -> str:
    buffer = io.BytesIO()
    img.save(buffer, format='PNG')
    return base64.b64encode(buffer.getvalue()).decode()

class MLModel(Document):
    name: str
    weights: np.ndarray  # Seamlessly stored and retrieved
    bias: np.ndarray
    thumbnail: Image.Image

# It just works!
model = MLModel(
    name="sentiment-classifier",
    weights=np.random.rand(100, 50),
    bias=np.zeros(50)
)
await model.insert()

🌍 Geo-Spatial Queries Out of the Box

Building location-based features? We got you:

from beanis import GeoPoint

class Restaurant(Document):
    name: str
    cuisine: Indexed[str]
    location: GeoPoint  # Lat/lon with automatic geo-indexing
    rating: Indexed[float]

# Find restaurants
italian_nearby = await Restaurant.find_near(
    location=GeoPoint(lat=41.9028, lon=12.4964),  # Rome, Italy
    radius=2000,  # 2km
    category="italian",
    rating__gte=4.0
)

# Get distance to each result
for restaurant in italian_nearby:
    distance = restaurant.location.distance_to(
        GeoPoint(lat=41.9028, lon=12.4964)
    )
    print(f"{restaurant.name}: {distance:.2f}m away")

🔄 Lifecycle Hooks for Business Logic

Implement audit trails, cache invalidation, or notifications:

class Order(Document):
    user_id: str
    total: Decimal
    status: str = "pending"

    # Audit trail
    created_at: datetime
    updated_at: datetime
    status_history: list = []

    @before_event(Insert)
    async def set_timestamps(self):
        now = datetime.now()
        self.created_at = now
        self.updated_at = now

    @before_event(Update)
    async def track_changes(self):
        self.updated_at = datetime.now()
        # Track status changes
        if hasattr(self, '_original_status') and self.status != self._original_status:
            self.status_history.append({
                'from': self._original_status,
                'to': self.status,
                'at': datetime.now().isoformat()
            })

    @after_event(Update)
    async def notify_status_change(self):
        if self.status == "shipped":
            await send_notification(self.user_id, f"Order {self.id} shipped!")

    @after_event(Delete)
    async def cleanup(self):
        # Clean up related data
        await OrderItem.delete_many(order_id=self.id)

Performance: Fast Enough for Production

I benchmarked Beanis against raw redis-py with 10,000 operations:

Operation Raw Redis Beanis Overhead Why?
Insert 0.45ms 0.49ms +8% Pydantic validation
Get by ID 0.38ms 0.41ms +8% Type conversion
Range Query 0.52ms 0.56ms +7% Index optimization
Batch Insert (100) 42ms 47ms +12% Validation batching

The verdict: ~8% overhead for features you’d have to build anyway (validation, serialization, type safety).

When NOT to Use Beanis

Be honest about trade-offs:

Ultra-low latency requirements (< 1ms per operation) ❌ Simple key-value caching (use raw redis-py) ❌ You need RedisJSON/RediSearch modules (use Redis OM instead) ❌ Prototyping with unpredictable schema (use raw Redis first)

Building production APIs with complex data modelsNeed type safety and validationWorking with teams who value clean codeMigrating from MongoDB/Postgres but need Redis speed


Real-World Use Cases

E-Commerce Product Catalog

# 10,000+ products, 1000+ queries/second
products = await Product.find(
    category="electronics",
    price__gte=100,
    price__lte=500,
    stock__gt=0
)

Session Management

class Session(Document):
    user_id: str
    token: str
    expires_at: Indexed[datetime]

# Auto-cleanup expired sessions
await Session.delete_many(expires_at__lt=datetime.now())

Real-time Leaderboards

class Score(Document):
    player_id: Indexed[str]
    score: Indexed[int]
    achieved_at: datetime

# Top 10 globally
top_players = await Score.find(score__gte=1000).sort('-score').limit(10)

Migrating from Raw Redis: A Step-by-Step Guide

Already have a Redis codebase? Here’s how to migrate incrementally without breaking production.

Step 1: Identify Your Data Models

Look at your existing Redis keys and group them:

# Current Redis structure
# User:1 -> hash {name, email, age}
# User:2 -> hash {name, email, age}
# User:idx:email -> sorted set
# User:all -> set

# This becomes a Beanis document
class User(Document):
    name: str
    email: Indexed[str]
    age: int

    class Settings:
        key_prefix = "User"

Step 2: Add Validation Gradually

Start with basic types, add constraints later:

# Phase 1: Just types
class Product(Document):
    name: str
    price: float
    stock: int

# Phase 2: Add validation
class Product(Document):
    name: str = Field(min_length=1, max_length=200)
    price: float = Field(gt=0)  # Must be positive
    stock: int = Field(ge=0)  # Can't be negative

Step 3: Dual-Write During Migration

Run both systems in parallel:

async def create_product_safe(data):
    # Write to Beanis
    product = Product(**data)
    await product.insert()

    # Still write to old Redis (for rollback safety)
    await redis_client.hset(
        f"Product:{product.id}",
        mapping=legacy_serialize(data)
    )

    return product

# After 1-2 weeks of dual-write, stop reading from old keys
# After 1 month, stop dual-writing

Step 4: Verify Data Consistency

async def verify_migration():
    """Compare old vs new data"""
    old_keys = await redis_client.keys("Product:*")

    for key in old_keys:
        product_id = key.split(":")[1]

        # Get from both systems
        old_data = await redis_client.hgetall(key)
        new_product = await Product.get(product_id)

        # Compare
        assert old_data['name'] == new_product.name
        assert float(old_data['price']) == new_product.price
        # ... verify all fields

Advanced Patterns and Best Practices

Pattern 1: Caching with TTL

Beanis doesn’t have built-in TTL yet, but you can implement it:

class CachedResult(Document):
    query_hash: Indexed[str]
    result_data: dict
    created_at: datetime = Field(default_factory=datetime.now)

    class Settings:
        key_prefix = "Cache"

    async def is_expired(self, ttl_seconds: int = 300) -> bool:
        age = (datetime.now() - self.created_at).total_seconds()
        return age > ttl_seconds

# Usage
async def get_with_cache(query: str, ttl: int = 300):
    query_hash = hashlib.md5(query.encode()).hexdigest()

    # Check cache
    cached = await CachedResult.find_one(query_hash=query_hash)
    if cached and not await cached.is_expired(ttl):
        return cached.result_data

    # Compute and cache
    result = await expensive_operation(query)
    await CachedResult(
        query_hash=query_hash,
        result_data=result
    ).insert()

    return result

Pattern 2: Optimistic Locking

Prevent race conditions with version numbers:

class BankAccount(Document):
    account_number: str
    balance: Decimal
    version: int = 0

    async def withdraw(self, amount: Decimal):
        # Read current version
        original_version = self.version

        # Check balance
        if self.balance < amount:
            raise InsufficientFunds()

        # Update
        self.balance -= amount
        self.version += 1

        try:
            await self.save()
        except Exception:
            # In a real implementation, check if version changed
            # and retry or raise ConcurrentModificationError
            raise

# Better: Use Redis transactions
async def atomic_withdraw(account_id: str, amount: Decimal):
    async with redis_client.pipeline(transaction=True) as pipe:
        account = await BankAccount.get(account_id)
        if account.balance >= amount:
            account.balance -= amount
            await account.save()

Pattern 3: Batch Operations for Performance

Process thousands of records efficiently:

# ❌ Slow: One query per item
products = []
for product_id in product_ids:
    product = await Product.get(product_id)
    products.append(product)

# ✅ Fast: Batch fetch
products = await Product.find(
    id__in=product_ids
).to_list()

# ✅ Even faster: Pipeline for insertions
async def bulk_insert_products(product_data_list):
    products = [Product(**data) for data in product_data_list]

    # Validate all first (fails fast)
    for p in products:
        p.model_validate(p)

    # Bulk insert (uses Redis pipeline internally)
    await Product.insert_many(products)

Pattern 4: Computed Fields and Denormalization

Redis favors denormalization - embrace it:

class Order(Document):
    user_id: str
    items: list[dict]  # [{product_id, quantity, price}]

    # Denormalized fields for fast queries
    total_amount: Decimal
    item_count: int
    user_email: str  # Copied from User

    @classmethod
    async def create_order(cls, user: User, items: list):
        total = sum(item['price'] * item['quantity'] for item in items)

        order = cls(
            user_id=user.id,
            items=items,
            total_amount=total,
            item_count=len(items),
            user_email=user.email  # Denormalize for queries
        )
        await order.insert()
        return order

# Now you can query orders by email without joining
expensive_orders = await Order.find(
    user_email="[email protected]",
    total_amount__gte=1000
)

Common Pitfalls and How to Avoid Them

Pitfall 1: Over-Indexing

Problem: Every indexed field creates a sorted set. Too many = memory bloat.

# ❌ Bad: 10 indexes = 10 sorted sets per document
class User(Document):
    name: Indexed[str]
    email: Indexed[str]
    age: Indexed[int]
    created_at: Indexed[datetime]
    last_login: Indexed[datetime]
    status: Indexed[str]
    role: Indexed[str]
    department: Indexed[str]
    manager_id: Indexed[str]
    salary: Indexed[Decimal]

# ✅ Good: Only index what you query
class User(Document):
    name: str
    email: Indexed[str]  # Frequent lookups
    age: int
    created_at: Indexed[datetime]  # Time-range queries
    last_login: datetime  # Don't need to query this
    status: Indexed[str]  # Filter by active/inactive
    role: str  # Can filter client-side
    department: str
    manager_id: str
    salary: Decimal  # Sensitive, don't index

Pitfall 2: Forgetting Async/Await

# ❌ This will fail silently or hang
user = User.get(user_id)  # Missing await!

# ✅ Always await
user = await User.get(user_id)

# ✅ Use async comprehensions
users = [await User.get(uid) for uid in user_ids]

# ✅ Even better: batch fetch
users = await User.find(id__in=user_ids).to_list()

Pitfall 3: N+1 Query Problem

# ❌ N+1 queries (slow!)
orders = await Order.find_all()
for order in orders:
    user = await User.get(order.user_id)  # N queries!
    print(f"{user.name}: ${order.total}")

# ✅ Denormalize (recommended for Redis)
class Order(Document):
    user_id: str
    user_name: str  # Denormalized
    total: Decimal

orders = await Order.find_all()
for order in orders:
    print(f"{order.user_name}: ${order.total}")  # No extra query!

# ✅ Or batch fetch users
orders = await Order.find_all()
user_ids = {order.user_id for order in orders}
users = {u.id: u for u in await User.find(id__in=user_ids)}

for order in orders:
    user = users[order.user_id]
    print(f"{user.name}: ${order.total}")

Performance Tuning Tips

1. Use Connection Pooling

import redis.asyncio as redis
from redis.asyncio.connection import ConnectionPool

# ✅ Reuse connections
pool = ConnectionPool.from_url(
    "redis://localhost",
    max_connections=50,
    decode_responses=True
)
client = redis.Redis(connection_pool=pool)

await init_beanis(database=client, document_models=[Product, User])

2. Batch Validation

# If inserting many documents, validate in bulk
products_data = [...]  # 1000 products

# ✅ Validate all first (parallel)
from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor() as executor:
    validated = list(executor.map(
        lambda d: Product(**d),
        products_data
    ))

# Then insert (uses pipeline automatically)
await Product.insert_many(validated)

3. Query Optimization

# ❌ Fetching everything then filtering in Python
all_products = await Product.find_all()
cheap = [p for p in all_products if p.price < 100]

# ✅ Filter in Redis
cheap = await Product.find(price__lt=100)

# ✅ Use projections (when implemented)
# cheap = await Product.find(price__lt=100).project(['name', 'price'])

Getting Started in 60 Seconds

pip install beanis
from beanis import Document, Indexed, init_beanis
import redis.asyncio as redis

# 1. Define your model
class User(Document):
    username: str
    email: Indexed[str]
    score: Indexed[int] = 0

# 2. Initialize
client = redis.Redis(decode_responses=True)
await init_beanis(database=client, document_models=[User])

# 3. Use it!
user = User(username="john", email="[email protected]")
await user.insert()

# Find users
top_users = await User.find(score__gte=100)

Full documentation: andreim14.github.io/beanis


What’s Next?

Beanis is production-ready today with:

  • ✅ 150+ tests passing
  • ✅ 56% code coverage
  • ✅ Full CI/CD pipeline
  • ✅ Comprehensive docs

Roadmap:

  • 🔄 Relationship support (OneToOne, OneToMany)
  • 📊 Aggregation pipeline
  • 🔐 Field-level encryption
  • ⚡ Connection pooling optimizations
  • 📈 Query analytics and slow query detection

Try It, Star It, Break It

I built Beanis to scratch my own itch, and now I’m sharing it with the world. If you:

  • Want cleaner Redis code
  • Value type safety
  • Need fast queries without the boilerplate

Give Beanis a try:

Found a bug? Have a feature request? Open an issue - I read and respond to everything.

Happy coding! 🚀


Beanis is inspired by Beanie by Roman Right. Standing on the shoulders of giants.

Built with ❤️ by Andrei Stefan Bejgu - AI Applied Scientist @ SylloTips

]]>
Andrei Stefan Bejgu