Meshkat Shariat Bagheri

Understanding Prompt Caching: Speeding Up LLMs and Reducing Costs

2026-02-08T00:00:00+00:00

Prompt caching can significantly improve the speed and cost-effectiveness of large language models (LLMs); but what exactly is prompt caching?

In this post, I’ll break down:

What prompt caching is and what it’s not
How it works under the hood
What types of prompt content can be cached
How cache matching works
When it’s most beneficial
Practical constraints you need to know

By the end, you’ll understand how prompt caching can push LLM performance and efficiency to the next level.

What Prompt Caching Is Not

Before we define prompt caching, it helps to clarify what it isn’t.

One common misunderstanding is thinking prompt caching is the same as output caching.

For example:

A database query returns results that are cached
A future request with the same query can retrieve that cached result
The system avoids re-running the query

That’s output caching; storing the result of a call to reuse later.

While output caching can be applied to LLM responses, it’s not what prompt caching refers to.

What Prompt Caching Is

Prompt caching focuses only on the input prompt, or more precisely, part of that input, and caches how the model interprets it.

Here’s the key idea:
When you send a prompt into an LLM, the model doesn’t begin generating output immediately. Instead, it performs an expensive internal computation called key/value (KV) pair generation.

The model computes KV pairs for every token
These pairs represent the model’s internal understanding of the text
This phase often takes more compute than generating even the first output token

Prompt caching saves these computed KV pairs so that the model doesn’t have to recompute them again for similar input.

How Prompt Caching Works Under the Hood

When an LLM receives a prompt:

It analyzes each token in every transformer layer
It computes internal KV pairs for those tokens
This analysis happens before output generation

With prompt caching:

KV pairs from a previous prompt are stored
When a new prompt partially matches a cached one, the model skips recomputing matching segments
The model only processes the new portion of the prompt

This means developers can structure prompts so that large static content is cached once — and reused across multiple queries.

What Can Be Cached?

Prompt caching typically applies to static or semi-static parts of prompts:

Common Examples

✅ System prompts: instructions that define agent behavior
✅ Large documents: manuals, research papers, contracts
✅ Few-shot examples: demonstration examples for output formatting
✅ Static context blocks: any repeated context that remains unchanged

In contrast, dynamic portions (like a user’s question) usually come after these static elements and are not cached.

How the Cache Decides What to Reuse

Prompt caching systems use prefix matching.

The system matches prompts from the beginning, token by token
As long as new input matches the cached prompt, those KV pairs are reused
Once there’s a difference, caching stops and standard processing resumes

This means prompt structure matters; static parts should come first.

Collaborating AI Agents with A2A

2026-02-07T00:00:00+00:00

Today I’m sharing an in-depth overview of the A2A protocol created by Google — a standard designed to enable AI agents to communicate and collaborate with each other seamlessly, even if they’re built with entirely different systems.

What Is A2A and Why It’s Useful

AI agents are becoming ubiquitous, but the challenge is that all these agents might be built on different frameworks and technologies. So how do we get them to cooperate on complex tasks?

Imagine planning a trip. Ideally, you’d have:

A flight agent
A hotel agent
An activities agent

…but building each one is a ton of work. You might want to reuse someone else’s agent, but often it’s a black box — you don’t know how it’s implemented or how to work with it.

That’s exactly where Agent-to-Agent (A2A) protocol comes in.

A2A is an open standard that provides a common language for agent communication and collaboration — regardless of how the agent is implemented. Think of it like how LangChain made it effortless to switch between models; A2A allows multiple agents to communicate consistently.

An easy metaphor is Lego blocks:
Each A2A agent broadcasts standardized info about itself and supports the same public methods, so any other agent can call it to help complete tasks. This opens up powerful orchestration possibilities.

How A2A Structures Agent Communication

A2A connects three main roles:

The End User
A Client Agent
A Remote (Server) Agent

A Client Agent generates requests and handles interaction with the end user. A Remote Agent receives those requests and attempts to respond or act as needed.

Importantly, any agent could act as both client and remote depending on context. It’s not a strict division — it’s dynamic.

Adoption and Ecosystem Support

A protocol is only as useful as its adoption, and A2A has already seen strong interest in the software industry. While there’s a larger list in the official docs, many major partners have publicly committed to supporting the standard.

Core Capabilities of A2A

Some of the main things A2A supports:

Dynamic discovery of other agents
Standardized task collaboration
Multimodal content sharing
Support for long-running tasks
Enterprise-grade security

Importantly, A2A keeps each agent opaque — meaning implementation details do not need to be exposed to follow the protocol.

How the Protocol Works — A Walkthrough

To make this real, let’s walk through an example where Agent A needs Agent B to perform a task.

How Agent A Discovers Agent B

Agent B publishes an Agent Card — a standard JSON file hosted at a known URI on its domain. This file acts like a digital business card and tells Agent A:

Agent B’s name
What it can do (skills/capabilities)
Its HTTP endpoint for A2A communication
Supported features (like streaming)
Authentication information

Essentially, it functions like a combination of robots.txt and a microservices registry.

How Agents Actually Talk

Communication is done over HTTPS using JSON-RPC 2.0. Each message represents one turn of a conversation and contains:

A role — user or agent
One or more parts — the actual content (text, files, structured JSON, etc.)

When Agent B receives a request, it processes it via something called an Agent Executor — a class that links the generic A2A protocol (handled by the SDK) to the agent’s specific logic.
This executor lets the agent behave like a Lego block that can be easily connected to other agents.

Handling Long-Running Tasks

Not all tasks complete quickly. Some require time, and the protocol handles that gracefully.

When an agent starts a long process:

The task gets a unique ID and status lifecycle
Initially it may respond with a “working” status
Agent A can periodically ask for status updates using the task ID
Later, Agent B will return a completed result inside the same task object

This polling approach works, but it’s not the most efficient.

Streaming Updates (SSE)

If Agent B supports streaming (using Server-Sent Events), Agent A can keep an HTTP connection open and receive live updates:

Initial task object
Task status events
Task artifact updates (e.g., breaking large outputs into chunks)

This enables live progress feedback, like a document appearing piece by piece as it’s generated — a much better user experience.

You can find a detailed tutorial and sample agents at the A2A tutorial link.

Where A2A Fits in the Agent Stack

Is A2A all you need to productionize your agent?

Not exactly — A2A is one part of the full agent stack. Google recommends a layered stack, but you’re free to choose different components based on your needs.

You might also notice another protocol called MCP being used in agent ecosystems.

How A2A and MCP Relate

A2A and MCP are complementary, not competitors:

MCP standardizes how an agent connects to its own tools, APIs, and resources — essentially how it performs function calls.
A2A, on the other hand, standardizes how different independent agents talk to each other — sharing tasks and workflows.

In sophisticated agent systems, both will be used:

Agents use MCP to interact with tools they control
Agents use A2A to collaborate with other agents

Get Started with A2A

The best way to learn is by trying it yourself:

Check out the full A2A specification and documentation
Use the official Python SDK (pip install a2a-sdk)
Browse sample code in the A2A samples repo
Explore the full set of official repos in the Google A2A GitHub organization

Final Thoughts

A2A unlocks powerful workflows in a multi-agent world. Let me know what your plan is with agent collaboration using A2A!

Beyond Chat: Building Autonomous AI with the Agent Development Kit (ADK)

2026-02-07T00:00:00+00:00

You’ve probably seen AI that can write or chat; but have you ever wondered if AI could act?

That next step in AI capability is here. It’s about building autonomous systems that don’t just generate text — they sense, think, and act.

At the heart of this evolution is the Agent Development Kit (ADK) — a toolkit that gives AI agents the ability to interpret their environment, reason over data, and take meaningful action. Think of LLMs (large language models) as the voice of AI — great at generating text, summaries, and code. ADKs add the hands and brain — enabling AI to interact with the world.

In this post, I’ll break down:

Why ADKs are the next frontier in AI
How they differ from reactive LLMs
A step-by-step example: building a smart office agent
The importance of ethics, safety, and trust in agent design

From Language Models to Autonomous Agents

LLMs are powerful — but they operate in isolation. They don’t read sensor data like temperature or motion. They don’t make decisions without being prompted. They don’t act.

In contrast, ADKs provide the building blocks for:

Sensing the environment
Thinking about what’s happening
Acting based on goals

This is especially critical in robotics, automation, and real-time systems where static models fall short.

Imagine a factory robot that can’t react when a conveyor slows down, a sensor detects overheating, or a part jams. Without feedback loops and decision making, it’s useless. ADKs change that by enabling agents to monitor live data and respond quickly — such as pausing production, cooling equipment, or alerting technicians.

What an ADK Enables

An ADK is a toolkit for building autonomous AI agents that can:

Sense their environment, via sensors and APIs
Reason over real-time data, using models and logic
Take actions, via connected systems and services

Agents built with ADKs become partners in value-creation. With LLMs alone, the setup is reactive — you send a prompt, the model returns an output. With an ADK, the agent functions autonomously: it observes, makes decisions, and executes actions based on its goals.

This shift takes AI beyond simple language generation — toward collaboration and autonomous operation.

Use Cases Across Industries

Agents built with ADKs are already impacting multiple fields:

Manufacturing: Agents monitor equipment, detect early warning signs, and schedule preventive maintenance.
Healthcare: Agents analyze clinical data and device metrics to spot trends or anomalies.
Smart Living: Agents control lighting, HVAC, and home systems based on occupancy and time of day.
Smart Cities: Future agents will optimize traffic, energy systems, and logistics.
Education: Personalized learning plans based on real-time student interaction.
Agriculture: Sensor monitoring and automated irrigation scheduling.
Finance: Real-time anomaly detection and fraud prevention.

These examples aren’t futuristic — they’re already emerging.

Building a Smart Office Agent (6 Steps)

Let’s walk through a practical example: a smart office agent that autonomously monitors and manages environmental conditions.

Step 1: Define the Problem & Goal

Our objective is simple:

Monitor office temperature, lighting, and motion conditions
Adjust environment settings when needed
Send alerts if something is off

Step 2: Identify Inputs

Our agent needs:

Sensor data (temperature, light, motion)
External APIs (weather, meeting schedules) This gives context about the environment and occupancy.

Step 3: Define Actions

The agent will control:

HVAC systems
Lighting adjustments
Notifications via Slack or email

Step 4: Build the System

We’ll use Python as the programming language — because it’s widely adopted for AI and automation and easy to read.

We’ll also set up:

An IoT hub to gather sensor data and send it to the agent
REST APIs to communicate with HVAC and lighting systems

Think of it this way:

Component	Role
Python	Brain
IoT Hub	Senses
REST APIs	Limbs (Actions)

Together, they empower the agent to act intelligently.

Step 5: Test and Refine

Simulate situations like:

Late-night occupancy
Sudden temperature spikes

Monitor how the agent responds, observe failure modes, and refine its logic.

Step 6: Ethics and Safety

Even simple automation systems need guardrails:

Manual override options
Logging all actions for transparency
User consent for monitoring

In six steps, you’ve built an agent that autonomously manages office conditions — while keeping humans in control.

Ethics, Safety, and Trust

No agent should operate without core principles baked in:

Fairness

Agents must avoid bias in data and decision-making:

Run fairness checks
Validate data sources
Ensure objective logic

Safety

Build backup plans:

Undo actions
Send alerts
Escalate to humans when needed

Trust

Transparent agents are trusted agents:

Log every action
Explain decisions clearly
Show how conclusions were reached

Fairness, safety, and trust lay the foundation for responsible AI.

The Future of Autonomous Agents

As ADKs evolve, autonomous agents will collaborate with:

Humans
Other agents
Connected infrastructure

In smart cities, agents could optimize:

Traffic flow
Energy grids
Logistics networks

In education, agriculture, finance, and healthcare — agents are already beginning to transform how systems operate.

Conclusion: Are You Ready?

The next generation of AI isn’t just bigger models — it’s smarter, connected systems built for autonomy.

Today we asked:

What does it take to build AI that can think and act on its own?

Now you know:

It starts with the Agent Development Kit (ADK).

So I encourage you — explore open-source ADKs, experiment with sensor integration, and join the growing ecosystem of autonomous agents.

The future is no longer just about language — it’s about action.

Building Smarter AI with Deep Agents and Virtual File Systems

2026-02-07T00:00:00+00:00

AI agents are powerful, but when it comes to complex problems and long-running tasks, basic agents often fall short. That’s where Deep Agents come in.

Deep Agents is a standalone library built on top of LangChain’s create_agent — but with a battery-included approach that gives agents advanced capabilities such as planning, task decomposition, context management, and deep sub-agent delegation.

In this post, I’ll walk through:

The four core capabilities that make Deep Agents more reliable
A practical example of giving an agent access to a virtual file system
How Deep Agents use multi-origin backends to access structured data
A complete example generating a tailored sales proposal

Let’s dive in.

What Deep Agents Adds to the Standard Agent

When building agents for real-world use cases, we found there are four core capabilities that significantly improve reliability, especially for more complex or long-running problems.

1. Planning & Task Decomposition

Instead of single-shot actions, Deep Agents can break tasks into smaller steps — planning before execution. This is enabled through a to-do middleware layer that orchestrates subtasks.

2. Context Management with a Virtual File System

Agents often need to reference large amounts of data, but context windows are limited. Deep Agents solve this using a file system abstraction that lets agents browse and load context on demand.

3. Sub-Agent Spawning

Rather than tackling all problems within a single agent, Deep Agents can spawn sub-agents designed for specific tasks. This helps avoid context overflow and isolates reasoning into manageable chunks.

4. Long-Running Memory

Deep Agents can remember what happened across previous conversations — not just within one run — giving them long-running memory across sessions.

Together, these make agents far more capable regardless of the underlying model.

Example: Virtual File System for Tailored Sales Proposals

Now let’s look at an example where we build a Deep Agent that generates tailored sales proposals with access to multiple data sources through a virtual file system.

In this system, the virtual file system is composed of three backends:

An SQL database backend — giving the agent access to user information and prior sales conversations
An S3-compatible storage backend — providing company policies, pricing, and detailed documents
A local file system backend — allowing the agent to write or print the final report

Setting Up Backends

First, we define a backend factory that returns a composite backend mapping directories to each storage layer:

The workspace directory maps to a local file system
users maps to the SQL backend
s3_bucket maps to the S3 storage backend

This lets the agent access files without knowing which backend they come from — the library handles those details.

The SQL backend can map database rows to virtual file paths, so listing files under users/ will return database records as files. Likewise, the S3 backend maps files from buckets directly.

Building and Running the Agent

Next, we set up various Deep Agent components:

A system prompt that instructs the agent where to find specific information
A model to drive reasoning
A checkpointer to store conversation history
The create_deep_agent() function, which acts like create_agent() but includes Deep Agents features like virtual file systems

With everything defined, we send the agent a prompt to generate a personalized sales proposal for a user named Sarah Chen. We also specify where the final report should be stored.

When running the example:

A workspace directory is created
The agent processes its available data
After a few seconds, it writes the final report into the workspace
The contents of that report are logged back to the terminal

Examining the generated proposal reveals a tailored email to Sarah — formatted using data from the SQL database, S3, and contextual information drawn from multiple sources.

What This Enables

By integrating virtual file systems and intelligent orchestration into agents:

Agents can access large structured datasets without manual context injection
Multiple backends become transparent to the agent
Sales proposals, reports, and other complex artifacts can be generated reliably

Virtual file systems are only one of Deep Agents’ core capabilities. Future posts will explore:

Sub-agent orchestration for deep reasoning without context overflow
Running arbitrary code safely using sandboxes
Advanced planning and decomposition patterns

Conclusion

Deep Agents represent an evolution in agent design — empowering AI with structured access to persistent knowledge and enabling more reliable, long-running, and complex workflows.

With planning, context management, sub-agent spawning, and memory, agents can do more than ever before — and remain manageable and predictable.

Thanks for reading, and stay tuned for the next deep dive into agent engineering!

Security and Governance in the Age of Agentic AI

2026-02-07T00:00:00+00:00

Agentic AI is no longer a futuristic concept — it’s already here. These aren’t simple chatbots anymore. These are autonomous agents that can schedule appointments, trade stocks, extract data, or even make purchases without waiting for a human to click a button.

According to Gartner, one third of enterprise applications will include agentic AI by 2028 — and that timeline is closer than it might initially seem. The potential is huge, but so is the risk.

This new level of autonomy brings with it not just innovation, but serious governance and security challenges. Unlike traditional rules-based software, agentic AI systems learn and adapt in real time — they interpret data, make decisions, and act on their interpretations. That flexibility is powerful — but also potentially dangerous.

In this post, I’ll walk through:

The security threats facing agentic AI
The governance challenges these threats expose
A set of recommended safeguards for building trustworthy, secure AI

Understanding the Security Threats

I recently spoke with cybersecurity experts who work with agentic AI every single day — and they made one thing clear: many traditional AI vulnerabilities are magnified when agents are autonomous.

Here are the key threats you need to know about:

1. Agent Hijacking

What happens if an attacker sends commands into your agent and takes control — causing it to operate on behalf of the attacker instead of you?

One major entry point for this is prompt injection — where an attacker inserts malicious instructions that the AI interprets as valid. According to the Open Web Application Security Project, prompt injection is the number one attack type right now.

2. Model Infection

Just like software can be infected with malware, AI models themselves can be infected. Since most organizations don’t build their own models, they must trust and verify externally sourced ones.

3. Data Poisoning

Models are trained and fine-tuned using data. If someone subtly contaminates that data — even in small amounts — it can later distort decision-making in unpredictable ways. Think of it as putting a toxin into the water supply: you don’t notice it at first, but it eventually has wide-reaching effects.

4. Evasion Attacks

Evasion occurs when attackers manipulate the input, not the model itself. By disguising or altering key information, they can confuse the AI; causing it to misinterpret data and produce incorrect (or even dangerous) results.

5. Extraction Attacks

Threats aren’t just about input — they’re also about output.

Model theft can happen when attackers feed inputs piece by piece, observe the outputs, and reconstruct the model logic over time.

Even worse, a compromised agent might be tricked into revealing sensitive internal data, like credentials or customer information; sometimes through what’s known as a zero-click attack (i.e., the user doesn’t have to take any action for data to be exfiltrated).

6. Denial of Service

This classic attack, overwhelming a system with more requests than it can handle, applies to agentic AI as well. Flood your agent with requests, and it might become too busy to serve legitimate users.

Governance Challenges: A Cautionary Story

Security is only one half of the problem. Without proper governance, agentic AI can still behave in unpredictable or unacceptable ways.

To illustrate this, let’s walk through a hypothetical (but realistic) scenario:

A recruiting firm adopts an autonomous AI to:

Read and evaluate resumes
Schedule interviews
Send job offers

Sounds efficient — until the AI sends an offer without human approval.

Questions arise:

How much autonomy is appropriate?
When should humans be in the loop?
Can HR explain why the AI made that decision?

These questions point to core governance issues: autonomy vs visibility, fairness, and explainability.

In this story, the AI also showed bias — favoring candidates from certain schools due to unbalanced training data. As a result, the firm missed qualified applicants and ultimately faced a discrimination lawsuit.

And then the real question emerges: Who is responsible?

The AI?
The HR team?
The vendor that built it?
The people who deployed it?

There’s no simple answer — and that’s precisely why governance matters.

Recommended Safeguards

Agentic AI is powerful; but it must be both governed and secured. Below are practices that help bring both together.

1. Discover What You Have

You can’t secure or govern what you don’t know exists.

First, you must identify all AI instances across your environment — including unauthorized systems (often called shadow AI) where developers or teams spin up models without oversight.

2. AI Security Posture Management

Once identified, every AI instance should be evaluated for:

Proper authentication
Data protection
Encryption
Permission boundaries
Public exposure risk

If an instance is exposing sensitive data or services, it must be hardened immediately.

3. Penetration Testing

Just as we test networks and applications, we must stress-test AI models.

Simulate:

Prompt injection
Evasion attempts
Model-based attacks

Real tests reveal vulnerabilities before attackers do.

4. Runtime Protection (AI Firewalls)

A runtime layer, or AI-specific firewall, analyzes inputs and outputs in real time. For example:

Reject suspicious or malicious prompts
Detect unusual data extraction
Block unwanted actions before they execute

This adds a layer of defense between users and the agent’s decision-making process.

Governance: The Three Key Pillars

Effective governance isn’t an afterthought — it’s foundational.

Here are the pillars every organization needs:

1. Lifecycle Governance

Ensure that AI agents are:

Approved before deployment
Documented
Continuously reviewed from inception to decommissioning

2. Risk & Regulation

Evaluate agents for:

Compliance with legal and ethical standards
Bias mitigation
Fairness testing
Transparency requirements

3. Monitoring & Evaluation

Ongoing assessment is critical:

Track agent performance
Watch decision patterns
Detect drift over time
Report results in an accessible dashboard

A centralized dashboard gives executives and auditors visibility into AI operations and compliance.

Security + Governance = Trustworthy AI

Here’s the key takeaway:

Governance without security is fragile. You can set rules, but if the model can be hacked or poisoned, those rules fall apart.
Security without governance is blind. You can block attacks, but if the AI itself is biased or unexplainable, you’re protecting something fundamentally broken.

To build trustworthy agentic AI, you must integrate governance and security; not treat them as optional extras.

Without this integration, autonomous agents may not just be unpredictable; they may be uncontrollable.

Why Tracing Is the New Foundation for Building Reliable AI Agents

2026-02-07T00:00:00+00:00

You launch your agent, and yesterday it worked perfectly. Today it crashes.

You pull up the logs and scan the codebase. Maybe it’s hallucination. Maybe the context window overflowed. But the real problem is this: you can’t tell.

That’s because you’re still debugging agents like traditional software — and that mindset is already obsolete.

In order to understand, debug, and iterate on your agents, you need something new. You need tracing.

Traditional Software vs. AI Agents

With traditional software, behavior is predictable. If we process a refund, we know the sequence:

Card refunded
Ledger updated
User notified

Every step is defined in code. If something breaks, we inspect logs and trace the issue back to a specific line.

Agents don’t work this way.

Why Agents Demand Tracing

With AI agents:

The same input can lead to different reasoning paths
Agents may call different tools
Outcomes can vary from run to run

In this world, our code becomes scaffolding. We define:

The model
The tools
The prompt

…but we no longer decide the specific path or decisions the agent takes. That’s driven by the model.

So when things go wrong, where should we look? We need insight into how our agent makes decisions — and that’s where tracing comes in.

What Is Tracing?

We can’t see how a model reasons internally — but we can observe what it does. Every prompt, every step, every tool call, every message leaves measurable signals.

Combined, these signals reconstruct the sequence of actions an agent takes — this is called a trace.

A trace includes:

Model reasoning at each step
Tools called and their parameters
Outputs generated
Timing information
Cost impacts

Conversations and Threads

Agents often interact with users in a conversation. Each message in the conversation creates a new trace.

These traces are grouped into a thread — the full history of the conversation. Threads let you see how agent behavior evolves across multiple turns.

So when something goes wrong with your agent, the answer isn’t in the code — it’s in the trace or the thread.

How Tracing Changes Engineering

Tracing should transform how you build AI agents.

1. Debugging Has Become Trace Analysis

When an agent behaves unexpectedly, the logic you’re looking for isn’t in the code — it’s in the trace.

Trace analysis becomes the new debugger.

2. Evals Replace Unit Tests

Because agent logic lives in traces, you test those traces — not individual functions.

You run:

Evals on past traces (to verify expected behavior)
Evals on live traces (to monitor quality over time)

This is how we measure agent performance in production.

3. Product Analytics Has Become Trace Analytics

The same traces developers use to debug also reveal:

How users interact with your agent
Where users get stuck
What patterns emerge in real usage

Trace analytics uncovers usage patterns, friction points, and failure modes — all based on actual behaviors, not hypothetical tests.

Observability: From Exhaust to Fuel

In traditional software, observability is the “exhaust” you monitor.

With agents, observability becomes fuel — the data that powers every improvement workflow.

This means your observability platform — your tracing system — should become the center of collaboration across teams.

Conclusion

The next time your agent behaves unexpectedly, don’t ask to see the logs.

Ask to see the trace.

Because with agentic AI, traces are the only reliable way to understand, debug, optimize, and evolve your systems.

Tracing isn’t just a helpful tool — it’s the foundation of agent engineering in the AI era.

Understanding Vector Databases: The Bridge Between Unstructured Data and Semantic Search

2025-05-18T00:00:00+00:00

Understanding Vector Databases: Bridging the Semantic Gap

📸 A Picture Is Worth a Thousand Words

Let’s begin with a picture.

Imagine a digital image of a sunset over a mountain vista. It’s beautiful. You want to store this picture in a database. Traditionally, you might use a relational database to do this.

In that case, you’d likely store:

The binary file of the image
Some metadata (file format, creation date)
Manually added tags (like sunset, landscape, orange)

These fields are helpful—but they don’t capture the semantic context of the image. You can’t easily query “show me pictures with similar colors” or “find images with mountains in the background.”

The Semantic Gap

This disconnect between how data is stored and how it’s understood is called the semantic gap.

🧠 The Problem with Traditional Queries

Let’s say you query the database:

SELECT * FROM images WHERE color = 'orange';

This query won’t return all sunset images. Why?

Color values are often broad or ambiguous.
Contextual meaning (e.g., “mountain”, “sunset vibe”) is lost in translation.

🧮 Enter: Vector Databases

Vector databases close the semantic gap using vector embeddings—mathematical representations of data.

What Are Vector Embeddings?

They’re arrays of numbers that capture the semantic meaning of unstructured data.

Similar items → close together in vector space
Dissimilar items → far apart

This enables semantic similarity search, not just keyword or tag matching.

🗂 What Can Be Stored?

You can store all kinds of unstructured content:

📷 Images
📄 Text
🔊 Audio

These are transformed into vector embeddings and stored in a vector database.

Example

Mountain sunset embedding:

[0.91, 0.15, 0.83, ...]

0.91 → significant elevation (mountains)
0.15 → few urban elements
0.83 → warm sunset colors

Beach sunset embedding:

[0.12, 0.08, 0.89, ...]

Similar in warmth (sunset)
Dissimilar in elevation

Real embeddings are high-dimensional and typically not human-interpretable—but they’re very effective.

🔧 How Are Embeddings Created?

Using embedding models trained on large datasets:

Data Type	Model Example
Images	CLIP
Text	GloVe
Audio	Wav2Vec

How It Works

Data passes through multiple neural network layers
Early layers → basic features (edges, words)
Deeper layers → complex features (objects, context)
Final output → vector embedding

🔍 Vector Indexing for Fast Search

Searching millions of high-dimensional vectors is slow—so we use vector indexing.

ANN (Approximate Nearest Neighbor)

Algorithms that trade accuracy for speed:

HNSW (Hierarchical Navigable Small World) → builds graph structures
IVF (Inverted File Index) → clusters vectors for faster access

These enable real-time semantic search at scale.

🔄 Vector Databases in RAG Systems

Vector databases power Retrieval-Augmented Generation (RAG):

Store document chunks as vector embeddings
User asks a question
System retrieves similar chunks based on vector similarity
Large language model generates an answer using that content

🧠 They serve as both a knowledge memory and a semantic retriever.

✅ Summary

Vector databases are:

A semantic memory layer for AI
Essential for searching, storing, and retrieving unstructured data
Core to modern AI workflows, especially RAG

By representing data in vector form, they allow systems to think more like humans do.

💡 Bridging the semantic gap isn’t just a technical improvement—it’s a fundamental shift in how machines understand the world.

Automating workflows with o3

2025-05-09T00:00:00+00:00

Automating Multi-Step Workflows with o3

Introduction

In the latest version of o3, OpenAI has taken reasoning and tool usage to a new level. The most exciting capability introduced is its ability to combine multi-step reasoning with agentic tool use—enabling it to complete complex tasks more autonomously than ever before.

Running a Month-End Variance Report

Let’s walk through a demo of o3 running a month-end variance report.

Step 1: Uploading the Prompt and Data

We begin by asking o3 to generate a variance report based on some dummy department data. Each uploaded spreadsheet contains:

Budgeted vs. actual spend
Department names
Cost centers
Forecast figures

Step 2: Task Decomposition

Normally, this workflow would involve:

Harmonizing and analyzing the data
Flagging variances over a threshold (e.g., 7%)
Visualizing the data
Searching the web for benchmarks
Creating an executive summary for the CFO

o3 is now capable of automatically completing each of these discrete steps—calling the appropriate tool or performing the right operation at each phase.

Execution and Tool Use

Once the task is sent off, o3 begins its chain-of-thought reasoning, visible in the live output:

Data Analysis: o3 parses the uploaded CSV files, writes Python code, and flags ~20 out of 25 lines that exceed 7% variance.
Web Search: It searches the internet for benchmarking data from reliable sources (e.g., KPMG).
Visualization: It generates interactive graphs visualizing the flagged variances.
Insight Generation: o3 provides analysis summaries and citable insights.

Output Generation

In about a minute, o3 produces:

Flagged variances (highlighted)
Visual breakdowns and key takeaways
Benchmarks from verified sources
Executive summary
Slack-ready post for the CFO

Conclusion

This example highlights the shift from reactive AI to agentic AI. o3 didn’t just wait for instructions, it proactively:

Broke the task into sub-goals
Chose the best tools for each sub-goal
Coordinated them autonomously
Delivered a high-quality report in a fraction of the usual time

This is not just generative AI; this is agentic execution at scale.

AgenticCoding: Why Agentic Coding with Claude Code Is the Future

2025-05-04T00:00:00+00:00

Why Agentic Coding with Claude Code Is the Future

There’s a next-level feature inside of Claude Code that, once you truly see it, it’s impossible to unsee.

Once you see this feature, you’ll understand that AI coding is not enough.

This feature lets you build in ways engineers using Cursor, WindSurf, and other “easy mode” tools simply cannot. But don’t worry — if you’re using one of these tools, you can still tap into this capability.

Every single thing we do now is about scaling our impact by scaling our compute.

Programmable Agentic Coding: The Future of Engineering

Let’s talk about programmable agentic coding — the next generation of developer tools.

What’s better than an agentic coding tool?
An agentic coding tool you can embed inside your workflows.

With Claude Code, you can write a single line of Python like:

claude_p(prompt="...", tools=["edit", "bash", "read"])

This single command wires up prompts, tools, and tasks in ways normal LLM-based code completion tools cannot match.

AI Coding vs Agentic Coding

Here’s the key difference:

AI Coding = A single tool call. You send a prompt, get code back.
Agentic Coding = A reasoning agent calling multiple tools (including custom ones), possibly in sequence, autonomously.

Think of it like this:

Feature	AI Coding	Agentic Coding
Tool Usage	One-shot	Multi-tool workflows
Prompt Format	Static	Dynamic & Reactive
Environment	Limited to IDE	Extendable (e.g. Bash)
Autonomy	User-driven	Agent-driven
Workflow Integration	Minimal	Deep & Reusable

Claude Code’s Built-In Toolbelt

Claude Code ships with built-in tools like:

edit
write
grab
ls
read
bash
batch (parallel execution)

The bash tool unlocks terminal access for the AI. With it, Claude can:

Checkout git branches
Create files
Run shell scripts
Deploy apps

And batch takes it further: Claude can spawn sub-agents to do tasks in parallel. That’s self-replication in action — a core property of future autonomous agents.

Beyond Built-In Tools: Custom MCP Servers

Claude Code supports MCP (Modular Context Protocol) servers — meaning:

You can build your own tools.
You can integrate external APIs (like Notion, Gmail, or your internal dev tools).
You can reuse them across any project.

Agentic coding is programmable, modular, and reusable.

Claude Code lets you build living systems — not just throwaway scripts.

Example: Automating with Claude + Notion

Let’s say you’ve written your to-dos inside a Notion page.
You want to:

Read the task list
Create a new TypeScript CLI app
Check off each task once it’s done

Claude Code can:

Use read to fetch Notion tasks
Use bash to create a new project
Use edit to write the app code
Use notion.update to mark tasks as complete

That’s not just AI-enhanced — that’s fully automated engineering.

Reusable, Committable Workflows

Because Claude Code is programmable, you can:

Automate CLI workflows (e.g. git branching, committing)
Wrap your infra work into dynamic scripts
Deploy agents that ship code while you sleep

You can even stack Claude instances — one writes code, one reviews it, another tests and deploys it.

Agentic Coding Is the Endgame

Agentic coding combines:

LLMs that can reason
Tools that execute tasks
Agents that act autonomously

Claude Code is already here. It is:

Infinitely programmable
Integrated with tools
Built for real-world workflows

AI coding is just autocomplete. Agentic coding is engineering.

Final Thoughts

We’re on a mission to build living software — systems that run while we sleep, tools that ship code, agents that manage tasks.

The only way to scale your impact is to scale your compute.

Claude Code is how you win in the generative AI age.

“AI coding is not enough. Agentic coding is how you ship real engineering work.”

Building a Custom MCP Server with Python

2025-05-02T00:00:00+00:00

By learning new concepts introduced in the AI world, you can always be a better developer, having a better understanding of what’s new and what technologies should you adapt. MCP introduced by Anthropic is one of these major new incoming concepts into the world of AI. This is a detailed guide on how to build a custom Model Context Protocol (MCP) server using Python.

Introduction

In this post, we’ll explore what the Model Context Protocol (MCP) is and how to build a custom MCP server using Python. If you’ve been hearing about MCP and wondering what it does or why it’s important, this guide is for you.

What is MCP?

MCP (Model Context Protocol) is an open standard developed by Anthropic that allows you to connect external tools, resources, and prompt templates to AI applications.

It’s often described as the USB-C port for AI, providing a standardized way to plug tools and context into models like Claude.

Why MCP Matters

✅ Port custom integrations (e.g., Slack, Google Drive) into AI apps.
🔁 Reuse toolsets across environments (VS Code, Cursor, WindSurf).
🌍 Tap into the open ecosystem of MCP-compatible servers and tools.

How MCP Works

MCP uses a client-server architecture:

The client (e.g. Claude Desktop) sends requests.
The server (your custom code) responds with tools, resources, or prompts.

Think of it like a shopping store:

You (client) ask for a cloth.
The salesperson (server) prepares and gives it to you.

MCP Client

Embedded inside AI applications.
Handles discovery of tools and resources.
Manages execution (LLMs can’t do it themselves).
Already implemented in apps like Claude Desktop.

As a user or developer, you typically don’t need to build the client — only the server.

MCP Server

The server is an independent module that:

Listens for requests.
Provides:
- Prompts (e.g., resume writing templates)
- Resources (static files, databases)
- Tools (functions, APIs, scripts)

Transport Methods

stdio for local development.
HTTP + Server Sent Events (SSE) for cloud deployments.

Example: MCP Server with Python

Using Anthropic’s Python SDK, you can spin up a server that connects tools to Claude Desktop.

Steps:

Install uv – a Rust-based fast Python environment manager.
Use the SDK:
- Define prompts via @mcp.prompt
- Load resources via @mcp.resource
- Execute tools via @mcp.tool

Example tools:

Writing to Gmail.
Accessing CSV contact lists.
Reusing markdown templates.

Connecting to Claude Desktop

Install Claude Desktop.
Go to Settings > Developer > Edit Config.
Add the MCP server JSON config.
Restart Claude.

You’ll now see all the tools, resources, and prompts live inside Claude — ready to use!

Summary

MCP makes AI apps:

More customizable 🚀
More interoperable 🧩
More powerful 💡

You define your own workflows and carry them between environments or apps — all with just a bit of Python and MCP.