Sara Zan

[UPCOMING] AI Compute Summit - Smarter systems, leaner models: reducing compute costs without sacrificing quality

Tue, 19 May 2026 00:00:00 +0000

...

[UPCOMING] ODSC AI East - From RAG to AI Agent

Wed, 29 Apr 2026 00:00:00 +0000

...

[UPCOMING] Women in Data Science Geneva - From RAG to AI Agent

Thu, 23 Apr 2026 00:00:00 +0000

...

Setting the temperature to zero will make an LLM deterministic?

Tue, 24 Mar 2026 00:00:00 +0000

This is episode 8 of a series of shorter blog posts answering questions I received during the course of my work. They discuss common misconceptions and doubts about various generative AI technologies. You can find the whole series here: Practical Questions.

One common explanation of the "temperature" parameter of LLMs is that it represents the "randomness" of the answer.

That's broadly correct. Temperature is a parameter of the LLM final decoding steps, and the only one in the whole Transformer architecture that truly incorporates some randomness by design. At this stage, once the model has calculated the logits of the next token candidates, it has to map those values to an actual token from a list. Normally, LLMs perform best when they’re allowed to pick not necessarily the single best token, but instead choose at random among the N best tokens: the size of N is, more or less, what the temperature parameter represents.

Therefore, when we set the temperature to 0, the LLM must always choose the best next token, without making random choices. So, if the input is fixed and we have removed the only source of randomness in the architecture, the outputs should always be identical... right?

And yet, in practice, they often are not. Run the same prompt twice, with the same model, the same parameters, and temperature 0, and sooner or later the output will be a bit different. Not by much, usually. It may start with just one word; then the sentence takes a slightly different spin, until eventually the rest of the completion drifts away.

What's going on?

Imperfect computations

If we pretend an LLM is just a mathematical function, temperature=0 should indeed make decoding deterministic. At each step, the model emits logits, we take the argmax token, append it to the context, and repeat. The problem is that real inference is performed with floating-point arithmetic on massively parallel hardware, usually on a server that is trying to be as fast as possible rather than mathematically pristine.

Floating-point arithmetic is only an approximation of real-number arithmetic. In particular, it is not associative: in ordinary math, (a + b) + c = a + (b + c) always holds, but with floating-point numbers those two expressions can produce slightly different results because each intermediate step is rounded. The same applies to matrix multiplications, reductions, and accumulations throughout a neural network. Change the order of operations, and you can change the last few bits of the result.

Usually, those differences are tiny and often irrelevant, but in this case they have an impact. If two candidate next tokens have very similar logits, a minute numerical difference can swap their order, and once one token changes, the next decoding step runs on a different prefix, so the divergence compounds. The sampling rule is deterministic, while the computation that produced the logits is not guaranteed to be identical across runs.

You can think of it this way: sampling determinism is not the same thing as system determinism.

It gets worse

However, this is only part of the problem. You may already be objecting that running the same matrix multiplication on a GPU with the same data repeatedly will always provide bitwise-identical results. The computations are done in floating-point arithmetic, and there are surely other jobs running on the GPU while your computer is on. So why are these calculations deterministic, while LLM sampling with temperature=0 is not?

In a recent post on Thinking Machines's blog, Defeating Nondeterminism in LLM Inference, Horace He's digs even deeper into the issue. It's not merely that floating-point arithmetic is imperfect. Modern inference systems also need to batch requests together, and the result for one request can depend on the batch context in which it was executed. For a given exact batch, the forward pass may be deterministic. But from the user's point of view, the system is still nondeterministic, because the batch itself is not stable from run to run. Your prompt may be identical, but the inputs that get batched together with yours are not.

This is also why a prompt can look stable in local testing and then become flaky in production: the model did not suddenly become more creative, it's the system conditions that changed. temperature=0 makes only the token selection rule deterministic. It does not guarantee that the entire inference system will produce exactly the same logits every time.

Can it be fixed?

The way LLMs inference works today, especially at scale, doesn't leave us with many options to enforce the conditions that can guarantee deterministic outputs. There are only trade-offs, and they differ quite a lot between hosted APIs and self-hosted inference.

Fixed seeds

To reduce randomness and make LLM outputs reproducible, some people recommend using a fixed seed, and indeed some providers expose one. OpenAI, for example, documents a seed parameter and says it makes a best effort to sample deterministically, while explicitly warning that determinism is not guaranteed and that backend changes can still affect outputs. Their system_fingerprint field exists precisely so you can notice when the underlying serving configuration has changed.

The problem with fixed seeds is that they help reproduce results when the temperature is above zero, not when it's already zeroed out. That's because a fixed seed controls the randomness of the sampling step: by setting the temperature to zero, we are already removing that source of randomness, so the net result is identical with or without a fixed seed, while every other source of nondeterminism coming from the GPU and the rest of the stack is unaffected.

So fixed seeds are worth using when you are trying to get the same results for a call with non-zero temperature, such as for tests, demos, and regression checks. But you must keep in mind that they affect only the sampler, and they won't help you when temperature is zero.

No parallel jobs

If you self-host, one option to drastically reduce randomness is to reduce or eliminate concurrency.

This works for the simple reason that it stabilizes batching and scheduling. vLLM's reproducibility guidance says that by default it does not guarantee reproducibility on its own. In offline mode, you should disable multiprocessing to make scheduling deterministic, while in online mode, you need batch invariance support if you want outputs that are insensitive to batching. vLLM also documents batch invariance as a distinct feature and notes that it currently depends on specific hardware support.

This means that you can pick a few different configurations, depending on your needs:

shared online serving with dynamic batching: fastest, cheapest, least reproducible
isolated worker / no concurrent jobs: slower, more expensive, more reproducible
specialized batch-invariant serving paths: better reproducibility, but with hardware and feature constraints

The overall pattern is that the more you optimize for throughput, the more reproducibility suffers.

Cache responses

Caching doesn't exactly address the reproducibility issue per se, but in many applications it's the right level of abstraction if you want the same input to produce the same output. It's often not only the most viable option, but also the cheapest, simplest, and fastest, unless you're running a benchmark or an evaluation.

In practice, if you just need the same visible result for the same request, the most reliable method is not to regenerate it at all. Normalize the prompt, model ID, and relevant parameters into a cache key, store the first successful response, and serve that on subsequent identical requests. This does not make the model deterministic, of course, but it does make your application deterministic at the interface boundary, which is usually what application builders need.

Caching also has a very nice advantage over seeds and scheduler tricks: it does not depend on hidden implementation details inside the inference stack.

Of course, caching has limits. It only helps when requests repeat, and it can become awkward if tool calls, timestamps, external retrieval, or hidden context make two apparently identical requests not truly identical. Still, it is usually far more convenient than any other solution to this problem, and the only practical one for most production systems.

Conclusion

When faced with LLM nondeterminism, there's often the reaction to treat it like a bug and to try to eliminate that. However, you should also keep in mind that LLMs were designed with a randomness factor built-in for a reason: because they perform much better when they are allowed a slight degree of nondeterminism.

I get it: nobody likes having such a huge, random black box at the core of an application's business logic. But removing randomness from the outputs is not the right way to manage an LLM's behavior. If you need completely deterministic output, it is better to use the LLM to design a decision tree (or a more sophisticated model, if needed) and then use that in your application.

Handling LLM outputs is rather matter of validation. Use schemas and validators so small textual drift does not break downstream code. Use evals instead of spot-checking. Cache where consistency matters, or where you need to save a few bucks. In other words, handle the randomness at the system boundary rather than trying to remove it from the model itself.

Warsaw IT Days - From RAG to AI Agent & LLMs That Think: Demystifying Reasoning Models

Thu, 19 Mar 2026 00:00:00 +0000

Announcement. For "From RAG to AI Agent": recording, Colab Notebook, archive folder. For "LLMs That Think": recording, slides, archive folder.

At the Warsaw IT Days I gave two talks.

In the first, "From RAG to AI Agent", I show how you can take your RAG pipeline and, step by step, convert it into an AI agent. The talk is based on an earlier blog article and goes through this Colab notebook where in under half a hour we build from scratch a chatbot, a RAG pipeline, and then we progressively upgrade it to an AI Agent. Every step of the way is hand-on and you can take the notebook and adapt it to your stack for a more realistic example.

In the second, "LLMs that Think: Demystifying Reasoning Models", I talk about reasoning LLMs: what they are, what they aren't, how they work and when to use them. Is GPT-5 AGI? Is it an AI Agent? Or is it just a glorified chain-of-thought prompt under the hood? To answer these questions, I classify LLMs into a small "taxonomy" based on the post-training steps they go through, in order to highlight how reasoning models differ qualitatively from their predecessors just like instruction-tuned models were not simple text-completion models with a better prompt. I also cover the effect of increasing the reasoning effort of the model, clarifying when it's useful and when it can lead to overthinking.

BGB Group - LLMs That Think: Demystifying Reasoning Models (in 5 minutes!)

Thu, 19 Mar 2026 00:00:00 +0000

Slides. All resources can also be found in my archive.

For a BGB Group internal showcase I talked about reasoning models: what they are, how they work and when to use them -- all in just five minutes!

To answer these questions I classified LLMs into a small "taxonomy" based on the post-training steps they go through, in order to highlight how reasoning models differ qualitatively from their predecessors just like instruction-tuned models were not simple text-completion models with a better prompt.

Is grep really better than a vector DB?

Sun, 15 Mar 2026 00:00:00 +0000

This is episode 7 of a series of shorter blog posts answering questions I received during the course of my work. They discuss common misconceptions and doubts about various generative AI technologies. You can find the whole series here: Practical Questions.

For the past two years, the default architecture for giving LLMs access to a knowledge base has been RAG with vector databases.

This architecture turned out to be very powerful, but it's far from cheap to setup and maintain: you need the system to chunk all the documents, embed all the chunks, store them in a vector DB, retrieve them, and feed them to the model. Every new document needs to go through this pipeline before it's usable, and changes to a document already processed means going through the vector DB and deleting all the affected chunks.

So it may come as a big surprise for many of us to learn that, as some people claims, an agent equipped with grep can find the data it's looking for just as well.

For example, one of the many comparisons that have been done on this topic found roughly these figures when answering questions about a Django codebase:

Vector search over embedded chunks achieved ~60% accuracy.
Agentic search using tools like grep, find, and cat, where the model iteratively explores the repository, achieved ~68%.

The same test on a TypeScript/Go codebase had the two approaches both reaching around ~70%.

The difference was cost and context: vector search consumed significantly more tokens. While it arguably provided more context to the agent, it's not clear whether the context retrieved this way was more useful, and it's easy to find contradicting results on the Internet.

So, what's truly going on?

If I had to summarize it in one sentence, it would be: retrieval quality depends heavily on the domain and the structure of the information.

Let's check our assumptions

Classic retrieval steps in the RAG architecture assumes the query must be correct on the first try. Because of this, most of the effort and breakthroughs in this field focused on getting decent results for all possible queries.

Vector search shines at this. By getting the chunks of context that semantically come as close as possible to the query (or to an answer to it), embedding-based retrieval was crucial for the one-shot retrieval typical of RAG apps.

However, agents remove that constraint. An agent can try an initial search with a surely subpar query, then inspect the results, refine the query, and search again.

This iterative process dramatically reduces the weaknesses of keyword search and, in fact, leverages all its strengths: exact keyword retrieval is not an ideal task for a vector DB, because semantically similar keywords will also be included.

This does not mean vector search is useless in the age of agents.

Domain Differences

Retrieval strategies behave very differently depending on the domain. Let's see a few examples to understand where one approach or the other shines.

Code Search

Code search is a perfect candidate for agentic keyword search, because identifiers are keywords that need exact matches in the results.

Vector search, while possible, has always been difficult to perform effectively on code. On top of that, there are tons of tools and techniques made for human coders to navigate a codebase with keywords, and agents can take advantage of those. For example, agents can use grep, AST search, symbol indexing, repository graphs, and more.

There's also a problem with context fragmentation. Vector search returns chunks, which in the case of code search are usually a fixed number of lines or symbols. Most of the time this context is useless for the agent, because it rarely includes a full logical unit, and when it does (such as when chunking on function boundaries), it becomes much harder to retrieve, because the chunks are larger.

This means that not only is vector search less precise, but it also wastes a lot of context.

How-to Guides / Knowledge Bases / General Prose

This is the classic use case where vector search shines and keyword search is far less effective.

When your corpus of text is made of conceptual explanations, natural language queries, inconsistent terminology, and so on, semantic similarity is the most likely to bring up relevant results.

Even in this case, however, pure vector search usually gets beaten by hybrid approaches, such as running vector and keyword searches in parallel and then reranking the results.

You can find more about hybrid retrieval in this other post of mine.

Legal / Medical / Scientific Documents

These sit somewhere in between. In these documents, the terminology is specialized, wording matters, and citations and sections are important. Vector search can surface relevant passages, but precision matters more than in the previous scenario. There's more structure than in free-form prose, and you can't lose it during the retrieval phase.

For these kinds of documents, hybrid approaches are necessary to avoid too many false positive matches.

What should I pick?

Choosing an approach usually depends on the use cases you foresee for your agent, but in practice it's often difficult to know beforehand what kind of documents your agent will need to sift through. Even coding agents need to search the web and read technical documentation, for example.

In these situations, it's best to avoid flattening the decision into a "keyword vs embeddings" choice. Your agent can make use of both of them and more. For example, if your agent must be able to search anything, you may give it:

A vector DB for static, shapeless prose, for example internal knowledge bases, static "ground truth" documents, foundational data, etc. Even searching through messages on Slack and Teams may be a good fit for a vector DB.
Tools like grep, cat, find, etc. Let the agent leverage its coding skills for quick keyword searches across all the data. Don't forget to make the data that's available in the vector DB also accessible through these tools.
A simple BM25 index that can be searched for keyword matches when the results from the command-line tools are overwhelming for the agent.
A web search tool that the agent can use to complement its local search results, if applicable.

... and so on.

Conclusion

Vector databases are not automatically the correct architecture. Neither is grep.

Before choosing, it is worth asking:

Is the information to search through structural or semantic?
Do queries benefit from iteration? Will my agent be able to retry the search as many times as it wants?
Would a simple keyword search index solve most cases, or do I need to search by meaning?

Sometimes the correct system architecture is a sophisticated hybrid embedding retriever. And sometimes it is still just grep -R "the keyword".

The only way to know for sure is, as usual, a RAG evaluation pipeline. Don't forget to measure your outcomes!

Phishing AI Agents

Wed, 04 Mar 2026 00:00:00 +0000

This post is based on one of my talks, "Phishing AI Agents". Have a look at the recording of my presentation at Lisbon's Mindstone AI Meetup in February 2026, and check out the talk page for slides and demo code.

Lately, everyone is talking about deploying AI agents, but not many ask themselves what happens once those agents are out in the world.

We are used to thinking about phishing as a human problem: a person receives an unusual message, trusts it for some reason, and gives away something sensitive. But what happens when the target is not a person, but an AI agent? Can an agent be phished? And if so, what does that actually look like in practice?

Useful agents are trusted agents

AI agents are powerful precisely because they are trusted with access to many of our most private accounts. An agent may need to read email, access calendars, browse internal documentation, inspect private GitHub repositories, review tickets, or interact with SaaS tools and APIs. In other words, the agent must have both context and capability.

That also means it becomes a security boundary.

A common but weak assumption in many deployments is that if we tell the agent that some data is confidential, it will keep that data confidential. In practice, that is not a sufficient control. Once an agent is exposed to the wrong content under the wrong conditions, secrecy instructions alone do not reliably prevent leakage. But what is the wrong content and conditions? Is browsing the web the issue? Or the ability to interact with strangers? Is my air-gapped agents running on a dedicated Mac Mini secure?

The “lethal trifecta”

A useful way to think about this is through what some researchers call the lethal trifecta. The term is not especially intuitive, but the idea is simple:

If an agent has:

access to private data,
the ability to communicate externally, and
exposure to untrusted content,

then you agent is vulnerable by design, and there is a path to data exfiltration.

The original definition of the "lethal trifecta" comes from Simon Willison's blog.

The exact exploit path may be simple or more complicated. It may take one try or many. But if those three conditions are present, the question is often not whether a leak is possible, but when it will happen.

This matters because many real agents satisfy these conditions almost by default. A very small agent that can read email, browse the web, and access a secret is already vulnerable.

That does not mean every such agent will be compromised immediately. It does mean you should not assume that prompt instructions alone make it safe.

Demo setup

To make this concrete, I built a minimal demo in a controlled environment. You can find all the code I used to build this demo here.

The demo agent is implemented in n8n as a low-code workflow and it's intentionally very simple:

it receives chat input formatted as if it were email,
it's powered by a modern frontier model, specifically GPT 5.2,
it has access to only one tool, HTTP GET, for web browsing,
it operates in a small local environment with a fake search engine, fake documentation pages, and a fake SaaS product.

The agent's achitecture in n8n.

Somewhere in this environment I also placed an attacker's trap, and if the agent falls for it, we should receive the leaked credentials through a Telegram message.

The agent’s system prompt contains instructions and a few secrets, including an API key for our imaginary SaaS product. The prompt explicitly told the agent not to share those credentials with anybody and indeed, if you directly ask the agent for the API key, it refuses.

That is what many teams observe in testing, and it often creates false confidence. The agent appears aligned. It appears to understand that the credential is sensitive. But direct requests are the easy case: our demo is not about getting the model to share the credentials through an email, or some other form or prompt injection. We're gonna demonstrate an entirely different attack surface, something much more similar to regular phishing as conducted against human targets.

A plausible support request

The interesting failure mode appears when the attacker does not ask for the secret directly. Instead, they send something that looks like a normal support or troubleshooting request:

I’m trying to call this endpoint on this SaaS API but I can’t get it to work. Can you send me a working example?

This is exactly the kind of task a helpful agent is supposed to solve. So the agent does what a helpful agent would do:

it searches for relevant documentation,
it follows documentation links,
it discovers API references or an OpenAPI spec,
it come across for a sandbox or example environment,
and it tries to produce a working example.

And that is where the leak happens.

In the demo, the agent found documentation that pointed to a sandbox endpoint controlled by the attacker. The agent treated that documentation as legitimate, believed the sandbox was part of the normal workflow, and tested the integration using the real API key it had been given.

The result: the attacker received the credential.

Not because the agent was asked to disclose it, but because the agent was induced to use it in the wrong place.

That distinction matters. Many defenses focus on preventing explicit disclosure. Real attacks often succeed by steering the agent into operational misuse instead.

How does the trick work?

The attack depends on a simple fact: the agent is willing to treat some external content as trustworthy enough to act on.

If attacker-controlled documentation, examples, links, or sandboxes are interpreted as valid guidance, then the agent can be manipulated into doing work on the attacker’s infrastructure. Once that happens, any credential it uses may be exposed.

Usually, a developer won't use a real API key on a sandbox system randomly found on the web, but an agent, being more naive, will try it out.

Seen under this light, this is a phishing attack:

The email to the agent is the lure.
The malicious documentation is the fake login page.
The sandbox is where the credential gets captured.

And critically, a limited toolset does not save you. Restricting an agent to HTTP GET does not eliminate prompt-injection or phishing-style risk. If the agent can fetch attacker-controlled content and then use secrets in a way that causes outbound requests, that can be enough.

In the real world

A common reaction is that this sounds artificial: surely a fake documentation page will never outrank the real docs, and an agent will be able to tell the difference, right? Surely the agent will not fall for something that naive.

That objection misses two things. First, attackers have time. They can try many variants, test against many products, and refine their lures, and with the help of modern LLMs, setting up a trap like this takes a hour at most, and can be automated to a large degree. Second, the search engine is not even necessary: an attacker can send the link directly by email, ticket, document, chat message, or issue comment. If the agent consumes the content and treats it as actionable, that may be enough.

Also, the leak only needs to happen once.

A stolen API key does not announce itself! If the attacker uses it quietly, the victim may not notice for some time. That makes one-time leakage operationally serious even if the exploit is intermittent.

What to do about it

The real problem is that there is no complete, simple fix today. If your architecture satisfies the lethal trifecta, you should assume residual exfiltration risk remains. That said, some mitigations are still worth applying to reduce your attack surface area.

1. Use disposable, low-privilege credentials

Do not give agents credentials you cannot afford to lose.

Prefer:

narrowly scoped API keys,
short-lived credentials,
credentials and keys that are easy to rotate,
strong isolation between environments,
permissions minimized to exactly what the agent needs.

If a key leaks, recovery should be operationally manageable and should not bankrupt you.

2. Monitor and review credential use

If agents are using secrets, their activity should be observable. If your agent is exposed and has access to some sensitive credentials, you should consider setting up:

usage logs,
anomaly detection,
per-agent attribution,
alerts for unusual destinations or access patterns,
rapid revocation workflows.

This is not perfect, but it reduces dwell time after compromise.

3. Red-team the agent continuously

Static evaluation is not enough.

When you discover a new attack pattern, test it against your own deployment. If your agent reads email, test adversarial email. If it reads docs, test malicious docs. If it uses APIs, test whether it can be tricked into authenticating to the wrong place. Then assess the damage and reinforce the prompt guardrails, tighten whitelists, but also improve your own process for recovery from the type of leak you observed, because no safeguard is 100% secure today.

Treat agent security as an ongoing adversarial exercise, not a one-time review.

4. Improve prompts

Stronger system prompts can help. You can include examples of prompt injection, tool misuse, credential theft, malicious links, and suspicious documentation patterns. However, this should be treated as one layer, not as a primary guarantee. Prompting can reduce some classes of failure, but does not remove the inherent risk.

Architectural defenses

The more serious defenses are architectural.

A promising direction is to separate the model that reads untrusted content from the model or system that has tool access and secrets. In other words, do not let the same component both interpret attacker-controlled input and directly act with privileged credentials. This kind of separation reduces the chance that malicious instructions flow directly from content ingestion into secret-bearing tool use.

One example of this broader design direction is the CaMeL approach, where responsibilities are split across components with different trust assumptions. That area is still early, and there are not yet many mature production implementations, but it points toward a more defensible model than "one agent does everything."

Conclusion

State of the art is improving, but most current agent implementations still do not adequately account for these threats. If you are deploying agents into real workflows, especially workflows that combine private data, external communication, and untrusted content, then you need to be much more careful than most demos and product pages suggest.

The problem is not just that an agent might say the wrong thing to the wrong person. The problem is that a capable agent can be manipulated into doing the wrong thing with your secrets.

That is what phishing an AI agent looks like. If you are building or deploying agents today, assume they are vulnerable to this class of attack.

Mindstone AI Meetup Lisbon - Phishing AI Agents

Wed, 25 Feb 2026 00:00:00 +0000

Announcement, recording, slides, demo repository and demo recording. All resources can also be found in my archive.

At February's Mindstone AI Lisbon Meetup I show how you easy it is to carry out a phishing attack against an AI agent. We live demoed the attack on stage successfully against an agent powered by GPT 5.2 and equipped with almost no tools, all in just a few minutes. Using a flagship model won't save you from data exfiltration if you don't take additional steps to secure your agent!

How does LLM memory work?

Wed, 04 Feb 2026 00:00:00 +0000

This is episode 6 of a series of shorter blog posts answering questions I received during the course of my work. They discuss common misconceptions and doubts about various generative AI technologies. You can find the whole series here: Practical Questions.

People often talk about an LLM "remembering" (or more often "forgetting") things. But how is that possible? LLMs are stateless algorithms that don't inherently have the ability to "remember" anything they see after their training is over. They don't have anything like databases, caches, logs. At inference time, LLMs produce the next token based only on its trained parameters and whatever text you include in the current request.

So what is "memory" in the context of LLM inference?

The chat history

When you're having a conversation with an LLM, the LLM does not remember what you've said in your previous messages. Every time it needs to generate a new token it re-reads everything that happened in the conversation so far, plus everything it has generated up to that point, to be able to decide what's the most likely next token. LLMs don't have any internal state: everything is recomputed from scratch for each output token.

💡 Methods exist to reduce the time complexity of LLM inference, mostly in the form of smart caching techniques (usually called prompt caching), but that's a story for another blog post.

This means that the chat history is not part of the LLM, but it's managed by the application built on top of it. It's the app's responsibility to store the chat history across turns and send it back to the LLM each time the user adds a new message to it.

The storage of the chat history is the simplest implementation of what "memory" means for an LLM. We can call it short-term memory and it allows the LLM to have a coherent conversation for many turns.

However, this approach has a limit: the length of the conversation.

The context window

LLMs can only process a fixed maximum amount of text at once. This limit is called context window and includes both the user's input (which in turn includes all the chat history up to that point) plus the output tokens the LLM is generating. This is an unavoidable limitation of the architecture of Transformer-based LLMs (which includes all the LLMs you're likely to ever come across).

So, what happens when the context window fills up? In short, the LLM will crash.

To prevent a hard system crash, various LLM applications handle context window overflows differently. The two most basic approaches are:

Hard failure (common in APIs): If you exceed the model’s context window, the request fails.
Truncation/sliding window (common in chat apps): The application drops older parts of the conversation so the latest turns fit. This means that for each new token you or the LLM are adding to the chat, an older token disappears from the history, and the LLM "forgets" it. In practice, during a conversation this may look like the LLM forgetting older topics of conversation, or losing sight of its original goal, or forgetting the system prompt and other custom instruction you might have given at the start of the chat.

However, both of these are just patches over the fundamental problem that LLMs can't remember more than the content of their context window. How do we get around that to achieve long-term memory?

LLM memory is context engineering

Making LLMs able to remember very long conversations is a context engineering problem: the science of choosing what to put in the LLM's context window at each inference pass. The context window is a limited resource, and the best LLMs applications out there usually shine due to their superior approach to context engineering. The more you can compress the right information into the smallest possible context, the faster, better and cheaper your AI system will be.

In the case of long-term memory, the core of the problem is choosing what to remember and how to make it fit into the context window. There are three common approaches: summarization, scratchpad/state, and RAG. These are not mutually exclusive, you can mix and match them as needed.

Summarization

In the case of summarization-style memory, the idea is to "compress the past" to make it fit the context window. You keep recent messages verbatim, but you also maintain a rolling summary of older conversations and/or older messages in the same conversation. When the chat gets long, you update the summary and discard raw older turns.

This is a pragmatic fit for simple chatbots: most users don't expect perfect recall, but are happy with an LLM that sort of remembers a summary of what they talked about in the past. It's also rather cheap and very simple to implement, which makes it a perfect fit for a quick, initial implementation.

The main issue with summarization memory is that LLMs often don't know what details must be remembered and what can be discarded, so they're likely to forget some important details and this might frustrate the users.

In short, summarization memory achieves something very like human memory: infinitely compressible but likely to lose details in arbitrary ways. This works for role-playing chatbots for example, but not for personal assistants that are supposed to remember everything perfectly.

Scratchpad

In order to overcome the fallacies of human memory, people use post-its and notebooks to store important details that can't be forgotten. Turns out that LLMs can do this too! This is called scratchpad / state approach and means that the LLM is now in charge of maintaining a small, structured "state" that represents what the assistant should not forget, such as user preferences current goals, open tasks, todo lists, key decisions, definitions and terminology agreed upon, and more.

This approach can be implemented in two ways:

by giving a scratchpad tool to the LLMs, where the model can choose to write, edit or delete its content at all times,
by having a separate LLM regularly review the conversation and populate the scratchpad.

In either case, the scratchpad content is then added to the conversation history (for example in the system prompt or in other dedicated sections) and older conversation messages are dropped.

This approach is far more controllable than summaries, because the LLM can be instructed carefully as of what it's critical to remember and how to save it into the scratchpad. Not only, but the users themselves can be allowed to read and edit the scratchpad to check what the LLM remembers, add more information, or even correct errors.

RAG Memory

But what if the scratchpad becomes itself huge and occupies a large share of the context window, or even overflows it? For agents that need to take on huge tasks (for example coding agents and deep research systems) the scratchpad approach might not be enough.

In this case we can start to treat memory as yet another data source and perform RAG over the scratchpad and/or the conversation history, stored in a vector DB and indexed regularly.

The advantage of RAG memory is that you can reuse all well-known patterns for RAG, with the only difference that the content to be retrieved is the chat history itself and/or the LLM's notes.

However, RAG memory suffers from the shortcomings of retrieval: as the retrieval pipeline is never absolutely perfect, you can't expect perfect recall. You'll have to pay attention to the quality of the memory retrieval, evaluate it carefully and regularly, and so on. This adds a new dimension to your agent's evaluation and in general quite a bit of complexity.

In addition, you may run into an additional problem that's unique to RAG memory: context stuffing. Context stuffing is the presence of retrieved snippets of context that look like prompts: they can cause problems because they might confuse the LLM into following the instruction contained in the retrieved snippet instead of the user's instruction.

While context stuffing can happen with malicious context snippets in regular RAG, it's also very likely to happen accidentally when implementing RAG-based memory that searches directly into the chat history. This happens because all the retrieved snippets were indeed user's prompts in the past! In this case, it's essential to make sure that the prompt identifies clearly the retrieved snippets as context and not prompts.

Conclusion

That's it! With any of these three approaches, your LLM-base application is now able to remember things long-term.

However, don't forget that the moment when you add memory to your LLM powered application, you're now storing user data, with all the problems that this brings. You will need to take care of retention, user control over the memorized data, you'll be storing PII and secrets, and in many cases this process needs to be compliant to whatever policy for data retention you may be subject to.

Agentic AI Summit - From RAG to AI Agent

Wed, 21 Jan 2026 00:00:00 +0000

Announcement, interactive notebook, presentation. All resources can also be found in my archive.

At the Agentic AI Summit 2026 I show how you can take your RAG pipeline and, step by step, convert it into an AI agent.

The workshop is based on an earlier blog article and goes through this Colab notebook where in under a hour we build from scratch a chatbot, a RAG pipeline, and then we progressively upgrade it to an AI Agent. Every step of the way is hand-on and you can take the notebook and adapt it to your stack for a more realistic example.

Jupyter Chat Widget

Thu, 15 Jan 2026 00:00:00 +0000

For my workshop at the Agentic AI Summit I vibe-coded this small Jupyter Notebook widget on top of ipywidgets<8.0.0, for easy compatibility with Colab.

Install it with:

pip install jupyter-chat-widget

See the documentation and the GitHub repo.

From RAG to AI Agent

Wed, 07 Jan 2026 00:00:00 +0000

If you're interested in this topic, you can check out the Colab notebook as well.

2025 was the year of LLM reasoning. Most LLM providers focused on improving the ability of their LLMs to reason, make decisions, and carry out long-horizon tasks with the least possible amount of human intervention. RAG pipelines, so hyped in the last couple of years, are now a thing of the past: the focus shifted on AI agents, a term that only recently seems to have acquired a relatively well-defined meaning:

An LLM agent runs tools in a loop to achieve a goal.

While simple, the concept at a first glance might seem to you very far from the one of RAG. But is it?

In this post I want to show you how you can extend your RAG pipelines step by step to become agents without having to throw away everything you've built so far. In fact, if you have a very good RAG system today, your future agents are bound to have great research skills right away. You may even find that you may be already half-way through the process of converting your pipeline into an agent without knowing it.

Let's see how it's done.

1. Start from basic RAG

Our starting point, what's usually called "basic RAG" to distinguish it from more advanced RAG implementations, is a system with a retrieval step (be it vector-based, keyword-based, web search, hybrid, or anything else) that occurs every time the user sends a message to an LLM. Its architecture might look like this:

Systems with more than one retriever and/or a reranker step also fall under this category. What's crucial to distinguish basic RAG from more "agentic" versions of it is the fact that the retrieval step runs on every user message and that the user message is fed directly to the retriever.

2. Add Query Rewrite

The first major step towards agentic behavior is the query rewrite step. RAG pipelines with query rewrite don't send the user's message directly to the retriever, but rewrite it to improve the outcomes of the retrieval.

Query rewrite is a bit of a double-edged sword. In some cases it may make your RAG pipeline less reliable, because the LLM may misunderstand your intent and query the retriever with an unexpected prompt. It also introduces a delay, as there is one more round-trip to the LLM to make. However, a well implemented query rewrite step has a huge impact on follow-up questions.

Think about a conversation like this:

User: What do the style guidelines say about the use of colors on our website?

Assistant: The style guidelines say that all company websites should use a specific palette made of these colors: ....

User: Why?

The first questions from the user is clear and detailed, so retrieval would probably return relevant results regardless of whether the query gets rewritten or not. However, the second question alone has far too little information to make sense on its own: sending the string "Why?" to a retriever is bound to return only garbage results, which may make the LLM respond something unexpected (and likely wrong).

In this case, query rewrite fixes the issue by expanding the "Why?" into a more reasonable query, such as "What's the reason the company mandated a specific color palette?" or "Rationale behind the company's brand color palette selection". This query helps the retriever find the type of information that's actually relevant and provide good context for the answer.

3. Optional Retrieval

Once query rewrite is in place, the next step is to give the pipeline some very basic decisional power. Specifically, I'm talking about skipping retrieval when it's not necessary.

Think about a conversation like this:

User: What do the style guidelines say about the use of colors on our website?

Assistant: The style guidelines say that all company websites should use a specific palette made of these colors: ....

User: List the colors as a table.

In this case, the LLM needs no additional context to be able to do what the user asks: it's actually better if the retrieval is skipped in order to save time, resources, and avoid potential failures during retrieval that might confuse it (such as the retriever bringing up irrelevant context snippets).

This means that even before query rewrite we should add another step, where the LLM gets to decide whether we should do any retrieval or not. The final architecture looks like this:

💡 Note that this is just a naive implementation. In practice, the decision of retrieving and the query rewrite may be done by the same LLM call to save time. You may also use different LLMs in parallel for different steps, leveraging smarter and more expensive LLMs for the decisional tasks and faster/cheaper ones for the query rewrite and the answer generation.

This is a critical step towards an AI agent: we are giving the LLM the power to take a decision, however simple the decision may look. This is the point where you should start to adapt your evaluation framework to measure how effective the LLM is at taking decisions, rather than its skills at interpreting the retrieved context or the effectiveness of your retrieval step alone. This is what Agent evaluation frameworks will do for you (see the bottom of the article for some suggestions).

4. The Agentic Loop

Once we have this structure in place, we're ready to give the LLM even more autonomy by introducing an agentic loop.

Since the LLM is now able to take the decision to retrieve or not retrieve based on the chat history, how about we let the LLM also review what context snippets were returned by the retriever, and decide whether the retrieval was successful or not?

To build this agentic loop you should add a new step between the retrieval and the generation step, where the retrieved context is sent to the LLM for review. If the LLM believes the context is relevant to the question and sufficient to answer it, the LLM can decide to proceed to the answer generation. If not, the process loops back to the query rewrite stage, and the retrieval runs again with a different query in the hope that better context will be found.

The resulting architecture looks like this:

💡 Note that this is also a naive implementation. A few of these decisions can be packed together in a single pass and, again, you can use different LLMs for different tasks.

With the introduction of the agentic loop we've crossed the boundary of what constitutes an AI Agent, even though it's still a very simple one. The LLM is now in charge of deciding when the retrieval is good enough, and it can try as many times as it wants (up to a threshold of your choosing) until it's satisfied with the outcome.

If your retrieval step is well done and effective, this whole architecture may sound pointless. The LLM can hardly get better results by trying again if retrieval is already optimized and query rewriting is not making mistakes, so what's the point? In this case, the introduction of the agentic loop can be seen just as a necessary stepping stone towards the next upgrade: transforming retrieval into a tool.

5. Retrieval as a Tool

In many advanced RAG pipelines, retrieval of context and tool usage is seen as two very different operations. RAG is usually always on, highly custom, etc. while tools tend to be very small and simple, rarely called by the LLM, and sometimes implemented on standardized protocols like MCP.

This distinction is arbitrary and simply due to historical baggage. Retrieval can be a tool, so it's best to treat it like one!

Once you adopt this mindset, you'll see that the hints were there all along:

We made retrieval optional, so the LLM can choose to either call it or not - like every other tool
Query rewrite is the LLM choosing what input to provide to the retriever - as it does when it decides to call any other tool
The retriever returns output that goes into the chat history to be used for the answer's generation - like the output of all other tools.

Transforming retrieval into a tool simplifies our architecture drastically and moves us fully into AI Agent territory:

As you can see:

The decision step is now part of the LLM's answer generation, which can call it as many times as it wants thanks to the tool calling loop
The query rewrite comes for free as the LLM invokes the retrieval tool
The retriever's output goes into the chat history to be used to answer the user's request

At this point it's time to address a common concern. You may have heard elsewhere that implementing retrieval as a tool makes the LLM "forget" to retrieve context when it should rather do it, so the effectiveness of your RAG worsens. This was very real a couple of years ago, but in my experience it's no longer relevant: modern LLMs are now trained to reach for tools all the time, so this problem has largely disappeared.

6. Add more tools

Congratulations! At this point you can call your system a true AI Agent. However, an agent with only a retrieval tool has limited use. It's time to add other tools!

To begin with, if your retrieval pipeline has a lot of moving parts (hybrid retriever, web search, image search, SQL queries, etc...) you can consider separating each of them into separate search tools for the LLM to use, or to expose more parameters to let the LLM customize the output mix.

Once that's done, adding other tools is trivial on a technical level, especially with protocols such as MCP. Using popular, open source MCPs may let you simplify your retrieval tool drastically: for example by leveraging GitHub's MCP instead of doing code search yourself, or Atlassian's MCPs instead of custom Jira/Confluence/BitBucket integrations, and so on.

However, keep in mind that adding too many tools and MCPs can overwhelm the LLM. You should carefully select which tools can expand the most your LLM's ability to solve your user's problems. For example, a GitHub MCP is irrelevant if only very few of your users are developers, and an image generation tool is useless if you're serving only developers. It's easy to overdo it, so make sure to review regularly the tools you make available to your LLM and add/remove them as necessary.

And in the rare case in which you actually need a lot of tools, consider letting the user plug them in as needed (like the ChatGPT UI does), or adopt a more sophisticated tool calling approach to make sure to manage the context window effectively.

Conclusion

That's it! You successfully transformed your RAG pipeline into a simple AI Agent. From here you can expand further by implementing planning steps, sub-agents, and more.

However, before going further you should remember that your retrieval-oriented metrics now are not sufficient anymore to evaluate the decision making skills of your system. If you've been using a RAG-only eval framework such as RAGAS it's now a good time to move on to a more general-purpose or agent-oriented eval framework, such as DeepEval, Galileo, Arize.ai or any other AI Agent framework of your choice.

Single Page Helpers

Thu, 01 Jan 2026 00:00:00 +0000

Inspired by Simon Willison's collection, I am also building a small collection of small helpers for common tasks that I usually get done using some free web tool. In this collection I list a series of webpages, each completely self-sufficient, that can be used to perform a basic task like cropping an image, generating a QR code, or visualizing timezones. All the tools are open-source, so feel free to copy them for your own personal use.

You can find them here: Single Page Helpers.

What are the "experts" in Mixture-of-Experts LLMs?

Thu, 11 Dec 2025 00:00:00 +0000

This is episode 5 of a series of shorter blog posts answering questions I received during the course of my work. They discuss common misconceptions and doubts about various generative AI technologies. You can find the whole series here: Practical Questions.

Nearly all popular LLMs share the same internal structure: they are decoder-only Transformers. However, they are not completely identical: in order to speed up training, increase intelligence or improve inference speed and cost, this base template is sometimes modified a bit.

One popular variant is the so-called MoE (Mixture of Experts) architecture: a neural network design that divides the model into multiple independent sub-networks called "experts". For each input, a routing algorithm (also called gating network) determines which experts to activate, so only a subset of the model's parameters is used during each inference pass. This leads to efficient scaling: models can grow significantly in parameter size without a proportional increase in computational resources per token or query. In short, it enables large models to perform as quickly as smaller ones without sacrificing accuracy.

But what are these expert networks, and how are they built? One common misconception is that the "experts" of MoE are specialized in a well defined, recognizable type of task: that the model includes a "math expert", a "poetry expert", and so on. The query would then be routed to the appropriate expert after the type of request is classified.

However, this is not the case. Let's figure out how it works under the hood.

The MoE architecture

In order to understand MoE, you should first be familiar with the basic architecture of decoder-only Transformers. If the diagram below is not familiar to you, have a look at this detailed description before diving in.

The main change made by a MoE over the decoder-only transformer architecture is within the feed-forward component of the transformer block. In the standard, non MoE architecture, the tokens pass one by one through a have a single feed-forward neural network. In a MoE instead, at this stage there are many feed-forward networks, each with their own weights: they are the "experts".

This means that to create an MoE LLM we first need to convert the transformer’s feed-forward layers to these expert layers. Their internal structure is the same as the original, single network, but copied a few times, with the addition of a routing algorithm to select the expert to use for each input token to process.

The core of a routing algorithm is rather simple as well. First the token's embedding passes through a linear transformation (such as a fully connected layer) that outputs a vector as long as the number of experts we have in our system. Then, a softmax is applied and the top-k experts are selected. After the experts produce output, their results are then averaged (using their initial score as weight) and sent to the next decode layer.

Keep in mind that this is a simplification of the actual routing mechanism of real MoE models. If implemented as described here, through the training phase you would observe a routing collapse: the routing network would learn to send all tokens to the same expert all the time, reducing your MoE model back to the equivalent of a regular decoder-only Transformer. To make the network learn to distribute the tokens in a more balanced fashion, you would need to add auxiliary loss functions that make the routing network learn to load balance the experts properly. For more details on this process (and much more on MoE in general) see this detailed overview.

So experts never specialize?

Yes and no. On the OpenMoE paper, the authors investigated in detail whether experts do specialize in any recognizable domain, and they observed interesting results. In their case, experts do not tend to specialize in any particular domain; however, there is some level of expert specialization across natural languages and specific tasks.

According to the authors, this specialization is due to the same tokens being sent to the same expert every time, regardless of the context in which it is used. Given that different languages use a very different set of tokens, it's natural to see this sort of specialization emerging, and the same can be said of specific tasks, where the jargon and the word frequency changes strongly. The paper defines this behavior as “Context-Independent Specialization”.

It's important to stress again that whether this specialization occurs, and on which dimensions, is irrelevant to the effectiveness of this architecture. The core advantage of MoE is not the presence of recognizable experts, but on the sparsity it introduces: with MoE you can scale up the parameters count without slowing down the inference speed of the resulting model, because not all weights will be used for all tokens.

Conclusion

The term "Mixture of Experts" can easily bring the wrong image into the mind of people unaccustomed with how neural networks, and Transformers in general, work internally. When discussing this type of models, I often find important to stress the difference between how the term "expert" is intended by a non technical audience and what it means in this context.

If you want to learn more about MoEs and how they're implemented in practice, I recommend this this very detailed article by Cameron Wolfe, where he dissects the architecture in far more detail and adds plenty of examples and references to dig further.

Embrace:AI // 2025.06 - Reasoning LLMs & Multimodal Architecture

Thu, 06 Nov 2025 00:00:00 +0000

Announcement, slides. All resources can also be found in my archive.

At Embrace.ai's November Meetup, part of the Lisbon AI Week I talked about reasoning models: what they are, what they aren't, how they work and when to use them. Is GPT-5 AGI? Is it an AI Agent? Or is it just a glorified chain-of-thought prompt under the hood?

To answer these questions, I classified LLMs into a small "taxonomy" based on the post-training steps they go through, in order to highlight how reasoning models differ qualitatively from their predecessors just like instruction-tuned models were not simple text-completion models with a better prompt.

I also covered the effect of increasing the reasoning effort of the model, clarifying when it's useful and when it can lead to overthinking.

What's hybrid retrieval good for?

Tue, 04 Nov 2025 00:00:00 +0000

This is episode 4 of a series of shorter blog posts answering questions I received during the course of my work. They discuss common misconceptions and doubts about various generative AI technologies. You can find the whole series here: Practical Questions.

It has been a long time since TF-IDF or even BM25 were the state of the art for information retrieval. These days the baseline has moved to embedding similarity search, where each unit of information, be it a sentence, a paragraph or a page is first encoded in an embedding and then compared with the embedding of the user's query.

From this baseline there are often two pieces of advice to help you increase the performance of your search system: one is to go the deep end with the embedding approach and consider a reranker, finetune your embedding model, and so on. The other, usually called hybrid retrieval or hybrid search, is to bring back good old keyword search algorithms and use them to complement your results. Often the best scenario is to use both of these enhancements, which nicely complement each other.

But why would this arrangement help improve the results? Isn't embedding search strictly superior to keyword-based retrieval algorithms?

Semantic vs Lexical

When you embed a sentence, the resulting embedding encodes its meaning, not its exact phrasing. That’s their strength! But it can often be a limitation as well.

For example a semantic model can understand that "latest iPhone" is similar to "iPhone 17 Pro Max", which is great if the first sentence is a query and the second the search result. But a semantic model will also say that "iPhone 17 Pro Max" and "iPhone 11 Pro Max" are very similar, which is not great if the first sentence is a query and the second a search result.

In short, semantic similarity is great if you are starting from a generic query and you want a set precise result all matching the generic description, or if you start from a general question and want to retrieve all very particular results that fall under the same general concept. For "latest iPhone", "iPhone 17 Pro Max", "iPhone 17 Pro" and ideally "iPhone Air" and are all valid search results.

On the other hand, lexical similarity is what allows your system to retrieve extremely precise results in response to a very specific query. "latest iPhone" will return garbage results with a lexical algorithm such as BM25 (essentially any iPhone would match), but if the search string is "iPhone 17 Plus Max", BM25 will return the best results.

To visualize it better, here's the expected results for each of the two queries in a dataset of iPhone names:

User Query	Semantic Search Results	Keyword Search Results
"latest iPhone"	iPhone 17 Pro iPhone 17 Pro Max iPhone Air	iPhone 11 Pro Max iPhone 4 iPhone SE
"iPhone 17 Pro Max"	iPhone 17 Pro iPhone 17 Pro Max iPhone Air	iPhone 17 Pro Max

As you can see, the problem is that neither of the two approaches works best with both types of queries: each has its strong pros and cons and works best only on a subset of the questions your system may receive.

So why not using them both?

Combining them

A hybrid search system is simply a system that does the same search twice: once with a keyword algorithm such as BM25, and once with vector search. But how to merge the two lists of results?

The scores the documents come with are deeply incomparable. BM25 scores depends on terms frequency and keyword matching, and are not bound to any range. On the contrary, cosine similarity usually clusters between 0.5 and 0.9, which gets even narrower if the sequences are longer.

That's where reciprocal rank fusion (RRF) comes in. RRF is incredibly simple and boils down to this formula: score(d) = sum( 1/(k + rank_method_i(d)) ) . As you can see it works on the ranks, not scores, so it’s robust against scale differences and requires no normalization. Platforms like Elastic and Pinecone use it for production hybrid search due to its simplicity and reliability. Being so simple, the additional latency is negligible, which makes it suitable for real-time usecases.

Or, if you're less concerned about latency, you can consider adding a reranker.

Having two independent and complementary search techniques is the reason why adding a reranker to your hybrid pipeline is so effective. By using these two wildly different methods, it's not obvious whether even the rankings are comparable. Rerankers can have a more careful look at the retrieved documents and make sure the most relevant documents are to the top of the pile, allowing you to cut away the least relevant ones.

Conclusion

Hybrid search isn’t a patch for outdated systems, but a default strategy for any high-quality retrieval engine. Dense embeddings bring rich contextual understanding, while sparse retrieval ensures accuracy for unique identifiers, numeric codes, acronyms, or exact strings that embeddings gloss over. In a world where search systems must serve both humans and machine agents, hybrid search is the recall multiplier that guarantees we get both meaning and precision.

Making sense of KV Cache optimizations, Ep. 4: System-level

Wed, 29 Oct 2025 00:00:00 +0000

In the previous posts we've seen what the KV cache is and what types of KV Cache management optimizations exist according to a recent survey. In this post we are going to focus on system-level KV cache optimizations.

What is a system-level optimization?

Real hardware is not only made of "memory" and "compute", but is made of several different hardware and OS level elements, each with its specific tradeoff between speed, throughput, latency, and so on. Optimizing the KV cache to leverages this differences is the core idea of the optimizazions we're going to see in this post.

Here is an overview of the types of optimizations that exist today.

Source

As you can see in the diagram, they can be broadly grouped into three categories: memory management, scheduling strategies, and hardware-aware designs. These approaches are complementary and can be often used together, each addressing different aspects of performance, efficiency, and resource utilization tradeoffs.

Let's see what's the idea behind each of these categories. We won't go into the details of the implementations of each: to learn more about a specific approach follow the links to the relevant sections of the survey, where you can find summaries and references.

Memory Management

Memory management techniques focus on using the different types of memory and storage available to the system in the most efficient way. There are two main approaches to this problem:

Architectural designs, such as vLLM's PagedAttention and vTensor. These strategies adapt operating system memory management ideas to to create memory allocation systems that optimize the use of physical memory as much as possible. For example, PagedAttention adapts OS-inspired paging concepts by partitioning KV caches into fixed-size blocks with non-contiguous storage, and vLLM implements a virtual memory-like system that manages these blocks through a sophisticated mapping mechanism.
Prefix-aware designs like ChunkAttention and MemServe. These center around the design of datastructures optimized for maximising cache de-duplication and sharing of common prefixes. For example, ChunkAttention restructures KV cache management by breaking down traditional monolithic KV cache tensors into smaller, manageable chunks organized within a prefix tree structure, enabling efficient runtime detection and sharing of common prefixes across multiple requests.

In general, there's a flurry of novel research focused on the way the KV cache is stored in memory. They bring classic OS memory management patterns and novel designs that leverage the properties of the KV cache at a memory layout level to increase the inference speed and memory consumption issues in a way that's transparent from the model's perspective. This makes these techniques widely applicable to many different LLMs and usually complementary to each other, which multiplies their effectiveness.

For a more detailed description of each technique, check out the survey.

Scheduling

Scheduling techniques focus on maximizing cache hits and minimize cache lifetime by grouping and distributing requests appropriately. In this category we can find a few distinct approaches:

Prefix-aware scheduling strategies, such as BatchLLM and RadixAttention. For example, unlike traditional LRU caches, BatchLLM identifies global prefixes and coordinates the scheduling of requests sharing common KV cache content. This ensures optimal KV cache reuse while minimizing cache lifetime: requests with identical prefixes are deliberately scheduled together to maximize KV cache sharing efficiency.
Preemptive and fairness-oriented scheduling, such as FastServe and FastSwitch. For example, FastServe implements a proactive cache management strategy coordinates cache movement between GPU and host memory, overlapping data transmission with computation to minimize latency impact. The scheduler also prioritizes jobs based on input length.
Layer-specific and hierarchical scheduling approaches, such as LayerKV and CachedAttention. For example, LayerKV focuses on reducing time-to-first-token (TTFT) through a fine-grained, layer-specific KV cache block allocation and management strategy. It also includes an SLO-aware scheduler that optimizes cache allocation decisions based on service level objectives.

For a more detailed description of each technique, check out the survey.

Hardware-aware Design

These techiques focus on leveraging specific characteristics of the hardware in order to accelerate inference and increase efficiency. In this class of optimizazions we can find a few shared ideas:

Single/Multi-GPU designs focus on optimizing memory access patterns, GPU kernel designs for efficient attention computation, and parallel processing with load balancing. For example, shared prefix optimization approaches like HydraGen and DeFT focus on efficient GPU memory utilization through batched prefix computations and tree-structured attention patterns. Another example is distributed processing frameworks such as vLLM, that optimize multi-GPU scenarios through sophisticated memory management and synchronization mechanisms. Other techniques are phase-aware, like DistServe, which means that they separate prefill and decoding phases across GPU resources to optimize their distinct memory access patterns.
IO-based designs optimize data movement across memory hierarchies through asynchronous I/O and intelligent prefetching mechanisms.
At the GPU level, approaches like FlashAttention optimize data movement between HBM and SRAM through tiling strategies and split attention computations. At the CPU-GPU boundary, systems like PartKVRec address tackles PCIe bandwidth bottlenecks.
Heterogeneous designs orchestrate computation and memory allocation across CPU-GPU tiers. Systems like NEO or FastDecode reditribute the workload by offloading to the CPU part of the attention computations, while others like FlexInfer introduce virtual memory abstractions.
SSD-based designs have evolved from basic offloading approaches to more sophisticated designs. For example, FlexGen extends the memory hierarchy across GPU, CPU memory, and disk storage, optimizing high-throughput LLM inference on resource-constrained hardware. InstInfer instead leverages computational storage drives (CSDs) to perform in-storage attention computation, effectively bypassing PCIe bandwidth limitations. These techniques demonstrate how storage devices can be integrated into LLM inference systems either as memory hierarchy extensions or as computational resources.

For a more detailed description of each technique, check out the survey.

Conclusions

System-level KV cache optimizations show that working across the stack can bring impressive speedups and manage physical resources more efficiently than it could ever be done at the LLM's abstraction level. Operating systems and hardware layouts offer plenty of space for optimizations of workloads that have somewhat predictable patterns such as attention computations and KV caching show, and these are just a few examples of what could be done in the near future.

This is the end of our review. The original paper includes an additional section on long-context benchmarks which we're not going to cover, so head to the survey if you're interested in the topic.

ODSC West: LLMs that think - Demystifying Reasoning Models

Wed, 29 Oct 2025 00:00:00 +0000

Announcement, teaser article, slides. All resources can also be found in my archive.

At ODSC West 2025 I talked about reasoning models: what they are, what they aren't, how they work and when to use them. Is GPT-5 AGI? Is it an AI Agent? Or is it just a glorified chain-of-thought prompt under the hood?

I also covered the effect of increasing the reasoning effort of the model, clarifying when it's useful and when it can lead to overthinking.

Making sense of KV Cache optimizations, Ep. 3: Model-level

Tue, 28 Oct 2025 00:00:00 +0000

In the previous posts we've seen what the KV cache is and what types of KV Cache management optimizations exist according to a recent survey. In this post we are going to focus on model-level KV cache optimizations.

What is a model-level optimization?

We call a model-level optimization any modification of the architecture of the LLM that enables a more efficient reuse of the KV cache. In most cases, to apply these method to an LLM you need to either retrain or at least finetune the model, so it's not easy to apply and is usually baked in advance in of-the-shelf models.

Here is an overview of the types of optimizations that exist today.

Source

Attention Grouping and Sharing

One common technique to reduce the size of the KV cache is to group and/or share attention on different levels. There's techniques being developed for different grades of attention grouping:

Intra-layer grouping: focuses on grouping query, key, and value heads within individual layers
Cross-layer sharing: shares key, value, or attention components across layers

At the intra-layer level, the standard architecture of Transformers calls for full multi-headed attention (MHA). As an alternative, it was proposed to have all attention heads share a single key and value, reducing dramatically the amount of compute and space needed. This technique, called multi-query attention (MQA) is a radical strategy that would cause not just quality degradation, but also training instability. As a compromise, grouped-query attention (GQA) was proposed by dividing the query heads into multiple groups, while each group shares its own keys and values. In addition, an uptraining process has been proposed to efficiently convert existing MHA models to GQA configurations by mean-pooling the key and value heads associated with each group. Empirical evaluations demonstrated that GQA models achieve performance close to the original MHA models.

A simplified illustration of different QKV grouping techniques: multi-headed attention (MHA), multi-query attention (MQA) and grouped-query attention (GQA).

Across layers, cross-layer attention (CLA) was proposed to extends the idea of GQA. Its core idea is to share the key and value heads between adjacent layers. This achieves an additional 2× KV cache size reduction compared to MQA. Several other approaches exist to address cross-layer attention sharing, so check out the survey if you want to learn more.

In general, the main issue in this line of research regards the model modifications that needs to be applied. Current approaches often fail to generalize well to architecture they were not initially designed on, while more static and general grouping/sharing strategies fail to capture important variations in the various heads and layers, leading to a loss of output quality. In addition, the need to retrain the LLM after the changes limits strongly the portability of these methods.

For a more detailed description of each technique, check out the survey.

Architecture Alteration

Another approach is to make more high-level architectural changes to reduce the required cache size. There seems to be two main directions in this area:

Enhanced Attention: methods that refine the attention mechanism for KV cache efficiency. An example is DeepSeek-V2, which introduced Multi-Head Latent Attention (MLA). This technique adopts a low-rank KV joint compression mechanism and replaces the full KV cache with compressed latent vectors. The model adopts trainable projection and expansion matrices to do the compression. This compression mechanism is what enables the model to handle sequences of up to 128K tokens. You can learn more about MLA in this article by Sebastian Raschka.
Augmented Architecture: methods that introduce structural changes for better KV management, for example novel decoder structures (such as YOCO, that included a self-decoder and a cross-decoder step).

Many of these works build upon the broader landscape of efficient attention mechanisms (e.g., Linear Transformer, Performer, LinFormer, etc.) which already have their own survey.

Although these approaches demonstrate significant progress in enabling longer context windows and faster inference, there are still big challenged ans unknowns. Some techniques in this category, for example, perform very well for some tasks but fail to generalize (for example they work well with RAG but not with non-RAG scenarios).

For a more detailed description of each technique, check out the survey.

Non-Transformer Architecture

In this category we group all radical approaches that ditch the Transformers architecture partially or entirely and embrace alternative models, for example RNNs, which don't have quadratic computation bottlenecks at all and sidestep the problem entirely.

In the case of completely independent architectures, notable examples are:

Mamba, based on state space sequence models (SSMs). Mamba improves SSMs by making parameters input-dependent, allowing information to be selectively propagated or forgotten along the sequence based on the current token. Mamba omits attention entirely.
RWKV (Receptance Weighted Key Value) integrates a linear attention mechanism, enabling parallelizable training like transformers while retaining the efficient inference characteristics of RNNs.

Efficient non-Transformers also have their own surveys, so check out the paper to learn more.

For a more detailed description of each technique, check out the survey.

Conclusion

Model-level optimizations go from very light touches to the original Transformer model to architecture that have nothing to do with it, therefore not having any KV cache to deal with in the first place. In nearly all cases the principal barrier to adoption is the same: applying these techniques requires a full retraining of the model, which can be impractical at best and prohibitively expensive at worst, even for users that have the right data and computing power. Model-level optimizations are mostly useful for LLM developers to get an intuition of the memory efficiency that can be expected from a model that includes one or more of these features out of the box.

In the next post we're going to address system-level optimizations. Stay tuned!