<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[zansara.dev]]></title><description><![CDATA[My personal blog. I write about AI, LLMs, Open Source and Python, with some occasional diversion.]]></description><link>https://zansara.substack.com</link><image><url>https://substackcdn.com/image/fetch/$s_!45Qb!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91d6832e-c8ad-41ff-8e39-826ed75c61e4_400x400.png</url><title>zansara.dev</title><link>https://zansara.substack.com</link></image><generator>Substack</generator><lastBuildDate>Sat, 11 Apr 2026 07:52:38 GMT</lastBuildDate><atom:link href="https://zansara.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Sara Z.]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[zansara@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[zansara@substack.com]]></itunes:email><itunes:name><![CDATA[Sara Z.]]></itunes:name></itunes:owner><itunes:author><![CDATA[Sara Z.]]></itunes:author><googleplay:owner><![CDATA[zansara@substack.com]]></googleplay:owner><googleplay:email><![CDATA[zansara@substack.com]]></googleplay:email><googleplay:author><![CDATA[Sara Z.]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Setting the temperature to zero will make an LLM deterministic?]]></title><description><![CDATA[We all know LLMs don&#8217;t always respond the same thing to slight changes of prompt. But why does their answer differ also when the prompt is identical? And what can we do to prevent it?]]></description><link>https://zansara.substack.com/p/setting-the-temperature-to-zero-will</link><guid isPermaLink="false">https://zansara.substack.com/p/setting-the-temperature-to-zero-will</guid><dc:creator><![CDATA[Sara Z.]]></dc:creator><pubDate>Thu, 26 Mar 2026 14:49:40 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/37ec6c2f-3fa1-4430-9fac-8727389d6956_1363x569.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is episode 8 of a series of shorter blog posts answering questions I received during the course of my work. They discuss common misconceptions and doubts about various generative AI technologies. You can find the whole series here: <a href="https://www.zansara.dev/series/practical-questions">Practical Questions</a>.</em></p><div><hr></div><p>One common explanation of the &#8220;temperature&#8221; parameter of LLMs is that it represents the &#8220;randomness&#8221; of the answer.</p><p>That&#8217;s broadly correct. Temperature is a parameter of the LLM final decoding steps, and the only one in the whole Transformer architecture that truly incorporates some randomness by design. At this stage, once the model has calculated the logits of the next token candidates, it has to map those values to an actual token from a list. Normally, LLMs perform best when they&#8217;re allowed to pick not necessarily the single best token, but instead choose at random among the N best tokens: the size of N is, more or less, what the temperature parameter represents.</p><p>Therefore, when we set the temperature to 0, the LLM must always choose the best next token, without making random choices. So, if the input is fixed and we have removed the only source of randomness in the architecture, the outputs should always be identical... right?</p><p>And yet, in practice, they often are not. Run the same prompt twice, with the same model, the same parameters, and temperature 0, and sooner or later the output will be a bit different. Not by much, usually. It may start with just one word; then the sentence takes a slightly different spin, until eventually the rest of the completion drifts away.</p><p>What&#8217;s going on?</p><h2>Imperfect computations</h2><p>If we pretend an LLM is just a mathematical function, <code>temperature=0</code> should indeed make decoding deterministic. At each step, the model emits logits, we take the argmax token, append it to the context, and repeat. The problem is that real inference is performed with <strong>floating-point arithmetic</strong> on massively parallel hardware, usually on a server that is trying to be as fast as possible rather than mathematically pristine.</p><p>Floating-point arithmetic is only an approximation of real-number arithmetic. In particular, it is <strong>not associative</strong>: in ordinary math, <code>(a + b) + c = a + (b + c)</code> always holds, but with floating-point numbers those two expressions can produce slightly different results because each intermediate step is rounded. The same applies to matrix multiplications, reductions, and accumulations throughout a neural network. Change the order of operations, and you can change the last few bits of the result.</p><p>Usually, those differences are tiny and often irrelevant, but in this case they have an impact. If two candidate next tokens have very similar logits, a minute numerical difference can swap their order, and once one token changes, the next decoding step runs on a different prefix, so the divergence compounds. The sampling rule is deterministic, while the computation that produced the logits is not guaranteed to be identical across runs.</p><p>You can think of it this way: <strong>sampling determinism</strong> is not the same thing as <strong>system determinism</strong>.</p><h2>It gets worse</h2><p>However, this is only part of the problem. You may already be objecting that running the same matrix multiplication on a GPU with the same data repeatedly will always provide bitwise-identical results. The computations are done in floating-point arithmetic, and there are surely other jobs running on the GPU while your computer is on. So why are these calculations deterministic, while LLM sampling with <code>temperature=0</code> is not?</p><p>In a recent post on Thinking Machines&#8217;s blog, <em><a href="https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/">Defeating Nondeterminism in LLM Inference</a></em>, Horace He&#8217;s digs even deeper into the issue. It&#8217;s not merely that floating-point arithmetic is imperfect. Modern inference systems also need to batch requests together, and the result for one request can depend on the batch context in which it was executed. For a given exact batch, the forward pass may be deterministic. But from the user&#8217;s point of view, the system is still nondeterministic, because the batch itself is not stable from run to run. Your prompt may be identical, but the inputs that get batched together with yours are not.</p><p>This is also why a prompt can look stable in local testing and then become flaky in production: the model did not suddenly become more creative, it&#8217;s the system conditions that changed. <code>temperature=0</code> makes only the token selection rule deterministic. It does not guarantee that the entire inference system will produce exactly the same logits every time.</p><h2>Can it be fixed?</h2><p>The way LLMs inference works today, especially at scale, doesn&#8217;t leave us with many options to enforce the conditions that can guarantee deterministic outputs. There are only trade-offs, and they differ quite a lot between hosted APIs and self-hosted inference.</p><h3>Fixed seeds</h3><p>To reduce randomness and make LLM outputs reproducible, some people recommend using a fixed seed, and indeed some providers expose one. OpenAI, for example, <a href="https://developers.openai.com/cookbook/examples/reproducible_outputs_with_the_seed_parameter">documents</a> a <code>seed</code> parameter and says it makes a best effort to sample deterministically, while explicitly warning that determinism is not guaranteed and that backend changes can still affect outputs. Their <code>system_fingerprint</code> field exists precisely so you can notice when the underlying serving configuration has changed.</p><p>The problem with fixed seeds is that they help reproduce results when the temperature is above zero, not when it&#8217;s already zeroed out. That&#8217;s because a fixed seed controls the randomness of the sampling step: by setting the temperature to zero, we are already removing that source of randomness, so the net result is identical with or without a fixed seed, while every other source of nondeterminism coming from the GPU and the rest of the stack is unaffected.</p><p>So fixed seeds are worth using when you are trying to get the same results for a call with non-zero temperature, such as for tests, demos, and regression checks. But you must keep in mind that they affect only the sampler, and they won&#8217;t help you when temperature is zero.</p><h3>No parallel jobs</h3><p>If you self-host, one option to drastically reduce randomness is to reduce or eliminate concurrency.</p><p>This works for the simple reason that it stabilizes batching and scheduling. vLLM&#8217;s <a href="https://docs.vllm.ai/en/latest/usage/reproducibility">reproducibility guidance</a> says that by default it does not guarantee reproducibility on its own. In offline mode, you should disable multiprocessing to make scheduling deterministic, while in online mode, you need batch invariance support if you want outputs that are insensitive to batching. vLLM also documents batch invariance as a distinct feature and notes that it currently depends on specific hardware support.</p><p>This means that you can pick a few different configurations, depending on your needs:</p><ul><li><p>shared online serving with dynamic batching: fastest, cheapest, least reproducible</p></li><li><p>isolated worker / no concurrent jobs: slower, more expensive, more reproducible</p></li><li><p>specialized batch-invariant serving paths: better reproducibility, but with hardware and feature constraints</p></li></ul><p>The overall pattern is that the more you optimize for throughput, the more reproducibility suffers.</p><h3>Cache responses</h3><p>Caching doesn&#8217;t exactly address the reproducibility issue per se, but in many applications it&#8217;s the right level of abstraction if you want the same input to produce the same output. It&#8217;s often not only the most viable option, but also the cheapest, simplest, and fastest, unless you&#8217;re running a benchmark or an evaluation.</p><p>In practice, if you just need the <em>same visible result</em> for the same request, the most reliable method is not to regenerate it at all. Normalize the prompt, model ID, and relevant parameters into a cache key, store the first successful response, and serve that on subsequent identical requests. This does not make the model deterministic, of course, but it does make your <em>application</em> deterministic at the interface boundary, which is usually what application builders need.</p><p>Caching also has a very nice advantage over seeds and scheduler tricks: it does not depend on hidden implementation details inside the inference stack.</p><p>Of course, caching has limits. It only helps when requests repeat, and it can become awkward if tool calls, timestamps, external retrieval, or hidden context make two apparently identical requests not truly identical. Still, it is usually far more convenient than any other solution to this problem, and the only practical one for most production systems.</p><h2>Conclusion</h2><p>When faced with LLM nondeterminism, there&#8217;s often the reaction to treat it like a bug and to try to eliminate that. However, you should also keep in mind that LLMs were designed with a randomness factor built-in for a reason: because they perform much better when they are allowed a slight degree of nondeterminism.</p><p>I get it: nobody likes having such a huge, random black box at the core of an application&#8217;s business logic. But removing randomness from the outputs is not the right way to manage an LLM&#8217;s behavior. If you need completely deterministic output, it is better to use the LLM to design a decision tree (or a more sophisticated model, if needed) and then use that in your application.</p><p>Handling LLM outputs is rather matter of validation. Use schemas and validators so small textual drift does not break downstream code. Use evals instead of spot-checking. Cache where consistency matters, or where you need to save a few bucks. In other words, handle the randomness at the system boundary rather than trying to remove it from the model itself.</p>]]></content:encoded></item><item><title><![CDATA[Is grep really better than a vector DB?]]></title><description><![CDATA[Some agentic applications don&#8217;t use vector DBs for search. Is it a good idea?]]></description><link>https://zansara.substack.com/p/is-grep-really-better-than-a-vector</link><guid isPermaLink="false">https://zansara.substack.com/p/is-grep-really-better-than-a-vector</guid><dc:creator><![CDATA[Sara Z.]]></dc:creator><pubDate>Mon, 16 Mar 2026 12:35:47 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/4657c6e5-57f8-432c-a72b-4a2875679e4f_3404x1424.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is episode 7 of a series of shorter blog posts answering questions I received during the course of my work. They discuss common misconceptions and doubts about various generative AI technologies. You can find the whole series here: <a href="https://www.zansara.dev/series/practical-questions">Practical Questions</a>.</em></p><div><hr></div><p>For the past two years, the default architecture for giving LLMs access to a knowledge base has been <strong>RAG with vector databases</strong>.</p><p>This architecture turned out to be very powerful, but it&#8217;s far from cheap to setup and maintain: you need the system to chunk all the documents, embed all the chunks, store them in a vector DB, retrieve them, and feed them to the model. Every new document needs to go through this pipeline before it&#8217;s usable, and changes to a document already processed means going through the vector DB and deleting all the affected chunks.</p><p>So it may come as a big surprise for many of us to learn that, as some people claims, an agent equipped with <code>grep</code> can find the data it&#8217;s looking for just as well.</p><p>For example, <a href="https://youtu.be/99Kxkemj1g8?t=4308">one of the many comparisons</a> that have been done on this topic found roughly these figures when answering questions about a Django codebase:</p><ol><li><p><strong>Vector search</strong> over embedded chunks achieved ~60% accuracy.</p></li><li><p><strong>Agentic search</strong> using tools like <code>grep</code>, <code>find</code>, and <code>cat</code>, where the model iteratively explores the repository, achieved ~68%.</p></li></ol><p>The same test on a TypeScript/Go codebase had the two approaches both reaching around ~70%.</p><p>The difference was <strong>cost and context</strong>: vector search consumed significantly more tokens. While it arguably provided more context to the agent, it&#8217;s not clear whether the context retrieved this way was more useful, and it&#8217;s easy to find contradicting results on the Internet.</p><p>So, what&#8217;s truly going on?</p><p>If I had to summarize it in one sentence, it would be: retrieval quality depends heavily on the domain and the structure of the information.</p><h2>Let&#8217;s check our assumptions</h2><p>Classic retrieval steps in the RAG architecture assumes the query <strong>must be correct on the first try</strong>. Because of this, most of the effort and breakthroughs in this field focused on getting decent results for all possible queries.</p><p>Vector search shines at this. By getting the chunks of context that semantically come as close as possible to the query (or to an answer to it), embedding-based retrieval was crucial for the one-shot retrieval typical of RAG apps.</p><p>However, agents remove that constraint. An agent can try an initial search with a surely subpar query, then inspect the results, refine the query, and search again.</p><p>This iterative process dramatically reduces the weaknesses of keyword search and, in fact, leverages all its strengths: exact keyword retrieval is not an ideal task for a vector DB, because semantically similar keywords will also be included.</p><p>This does not mean vector search is useless in the age of agents.</p><h2>Domain Differences</h2><p>Retrieval strategies behave very differently depending on the domain. Let&#8217;s see a few examples to understand where one approach or the other shines.</p><h3>Code Search</h3><p>Code search is a perfect candidate for <strong>agentic keyword search</strong>, because identifiers are keywords that need exact matches in the results.</p><p>Vector search, while possible, has always been difficult to perform effectively on code. On top of that, there are tons of tools and techniques made for human coders to navigate a codebase with keywords, and agents can take advantage of those. For example, agents can use <code>grep</code>, AST search, symbol indexing, repository graphs, and more.</p><p>There&#8217;s also a problem with <strong>context fragmentation</strong>. Vector search returns chunks, which in the case of code search are usually a fixed number of lines or symbols. Most of the time this context is useless for the agent, because it rarely includes a full logical unit, and when it does (such as when chunking on function boundaries), it becomes much harder to retrieve, because the chunks are larger.</p><p>This means that not only is vector search less precise, but it also wastes a lot of context.</p><h3>How-to Guides / Knowledge Bases / General Prose</h3><p>This is the classic use case where <strong>vector search shines</strong> and keyword search is far less effective.</p><p>When your corpus of text is made of conceptual explanations, natural language queries, inconsistent terminology, and so on, semantic similarity is the most likely to bring up relevant results.</p><p>Even in this case, however, pure vector search usually gets beaten by hybrid approaches, such as running vector and keyword searches in parallel and then reranking the results.</p><p>You can find more about hybrid retrieval in <a href="https://www.zansara.dev/posts/2025-11-04-hybrid-retrieval/">this other post of mine</a>.</p><h3>Legal / Medical / Scientific Documents</h3><p>These sit somewhere in between. In these documents, the terminology is specialized, wording matters, and citations and sections are important. Vector search can surface relevant passages, but precision matters more than in the previous scenario. There&#8217;s more structure than in free-form prose, and you can&#8217;t lose it during the retrieval phase.</p><p>For these kinds of documents, hybrid approaches are necessary to avoid too many false positive matches.</p><h2>What should I pick?</h2><p>Choosing an approach usually depends on the use cases you foresee for your agent, but in practice it&#8217;s often difficult to know beforehand what kind of documents your agent will need to sift through. Even coding agents need to search the web and read technical documentation, for example.</p><p>In these situations, it&#8217;s best to avoid flattening the decision into a &#8220;keyword vs embeddings&#8221; choice. Your agent can make use of both of them and more. For example, if your agent must be able to search anything, you may give it:</p><ul><li><p>A vector DB for static, shapeless prose, for example internal knowledge bases, static &#8220;ground truth&#8221; documents, foundational data, etc. Even searching through messages on Slack and Teams may be a good fit for a vector DB.</p></li><li><p>Tools like <code>grep</code>, <code>cat</code>, <code>find</code>, etc. Let the agent leverage its coding skills for quick keyword searches across all the data. Don&#8217;t forget to make the data that&#8217;s available in the vector DB also accessible through these tools.</p></li><li><p>A simple BM25 index that can be searched for keyword matches when the results from the command-line tools are overwhelming for the agent.</p></li><li><p>A web search tool that the agent can use to complement its local search results, if applicable.</p></li></ul><p>... and so on.</p><h2>Conclusion</h2><p>Vector databases are not automatically the correct architecture. Neither is <code>grep</code>.</p><p>Before choosing, it is worth asking:</p><ul><li><p>Is the information to search through <strong>structural</strong> or <strong>semantic</strong>?</p></li><li><p>Do queries benefit from <strong>iteration</strong>? Will my agent be able to retry the search as many times as it wants?</p></li><li><p>Would a simple keyword search index solve most cases, or do I need to search by meaning?</p></li></ul><p>Sometimes the correct system architecture is a sophisticated hybrid embedding retriever. And sometimes it is still just <code>grep -R "the keyword"</code>.</p><p>The only way to know for sure is, as usual, a RAG evaluation pipeline. Don&#8217;t forget to measure your outcomes!</p>]]></content:encoded></item><item><title><![CDATA[Phishing AI Agents]]></title><description><![CDATA[Most LLMs are hardened against classic prompt injection attacks. But AI agents also behave like naive humans sometimes...]]></description><link>https://zansara.substack.com/p/phishing-ai-agents</link><guid isPermaLink="false">https://zansara.substack.com/p/phishing-ai-agents</guid><dc:creator><![CDATA[Sara Z.]]></dc:creator><pubDate>Wed, 04 Mar 2026 08:05:04 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/8a7aab36-87e9-4241-8475-31232d9e44a5_1377x576.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This post is based on one of my talks, <a href="https://www.zansara.dev/talks/2026-02-25-mindstone-lisbon-meetup/">&#8220;Phishing AI Agents&#8221;</a>. Have a look at the <a href="https://community.mindstone.com/annotate/article_B3qZVBLeehDkqDzhj7">recording</a> of my presentation at Lisbon&#8217;s <a href="https://community.mindstone.com/events/mindstone-lisbon-february-ai-meetup">Mindstone AI Meetup</a> in February 2026, and check out the <a href="https://www.zansara.dev/talks/2026-02-25-mindstone-lisbon-meetup/">talk page</a> for slides and demo code.</em></p><div><hr></div><p>Lately, everyone is talking about deploying AI agents, but not many ask themselves what happens once those agents are out in the world.</p><p>We are used to thinking about phishing as a human problem: a person receives an unusual message, trusts it for some reason, and gives away something sensitive. But what happens when the target is not a person, but an AI agent? Can an agent be phished? And if so, what does that actually look like in practice?</p><h2>Useful agents are trusted agents</h2><p>AI agents are powerful precisely because they are trusted with access to many of our most private accounts. An agent may need to read email, access calendars, browse internal documentation, inspect private GitHub repositories, review tickets, or interact with SaaS tools and APIs. In other words, the agent must have both context and capability.</p><p>That also means it becomes a <strong>security boundary</strong>.</p><p>A common but weak assumption in many deployments is that if we tell the agent that some data is confidential, it will keep that data confidential. In practice, that is not a sufficient control. Once an agent is exposed to the wrong content under the wrong conditions, secrecy instructions alone do not reliably prevent leakage. But what is the wrong content and conditions? Is browsing the web the issue? Or the ability to interact with strangers? Is my air-gapped agents running on a dedicated Mac Mini secure?</p><h2>The &#8220;lethal trifecta&#8221;</h2><p>A useful way to think about this is through what <a href="https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/">some researchers</a> call the <strong>lethal trifecta</strong>. The term is not especially intuitive, but the idea is simple:</p><p>If an agent has:</p><ol><li><p>access to private data,</p></li><li><p>the ability to communicate externally, and</p></li><li><p>exposure to untrusted content,</p></li></ol><p>then you agent is <strong>vulnerable by design</strong>, and there is a path to data exfiltration.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Zx4W!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32804d3e-1a25-4878-aafe-2ff39511801f_2092x1046.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Zx4W!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32804d3e-1a25-4878-aafe-2ff39511801f_2092x1046.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Zx4W!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32804d3e-1a25-4878-aafe-2ff39511801f_2092x1046.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Zx4W!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32804d3e-1a25-4878-aafe-2ff39511801f_2092x1046.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Zx4W!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32804d3e-1a25-4878-aafe-2ff39511801f_2092x1046.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Zx4W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32804d3e-1a25-4878-aafe-2ff39511801f_2092x1046.jpeg" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/32804d3e-1a25-4878-aafe-2ff39511801f_2092x1046.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:117418,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://zansara.substack.com/i/189821367?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32804d3e-1a25-4878-aafe-2ff39511801f_2092x1046.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Zx4W!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32804d3e-1a25-4878-aafe-2ff39511801f_2092x1046.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Zx4W!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32804d3e-1a25-4878-aafe-2ff39511801f_2092x1046.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Zx4W!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32804d3e-1a25-4878-aafe-2ff39511801f_2092x1046.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Zx4W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32804d3e-1a25-4878-aafe-2ff39511801f_2092x1046.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>The original definition of the &#8220;lethal trifecta&#8221; comes from <a href="https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/">Simon Willison&#8217;s blog</a>.</em></p><p>The exact exploit path may be simple or more complicated. It may take one try or many. But if those three conditions are present, the question is often not <em>whether</em> a leak is possible, but <em>when</em> it will happen.</p><p>This matters because many real agents satisfy these conditions almost by default. A very small agent that can read email, browse the web, and access a secret is already vulnerable.</p><p>That does not mean every such agent will be compromised immediately. It does mean you should not assume that prompt instructions alone make it safe.</p><h2>Demo setup</h2><p>To make this concrete, I built a minimal demo in a controlled environment. You can find all the code I used to build this demo <a href="https://github.com/ZanSara/mindstone-lethal-trifecta-demo">here</a>.</p><p>The demo agent is implemented in <a href="https://n8n.io/">n8n</a> as a low-code workflow and it&#8217;s intentionally very simple:</p><ul><li><p>it receives chat input formatted as if it were email,</p></li><li><p>it&#8217;s powered by a modern frontier model, specifically GPT 5.2,</p></li><li><p>it has access to only one tool, HTTP GET, for web browsing,</p></li><li><p>it operates in a small local environment with a fake search engine, fake documentation pages, and a fake SaaS product.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wKDC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694ee5c9-1f6d-496b-88f4-8b2d9e320490_1395x680.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wKDC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694ee5c9-1f6d-496b-88f4-8b2d9e320490_1395x680.png 424w, https://substackcdn.com/image/fetch/$s_!wKDC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694ee5c9-1f6d-496b-88f4-8b2d9e320490_1395x680.png 848w, https://substackcdn.com/image/fetch/$s_!wKDC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694ee5c9-1f6d-496b-88f4-8b2d9e320490_1395x680.png 1272w, https://substackcdn.com/image/fetch/$s_!wKDC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694ee5c9-1f6d-496b-88f4-8b2d9e320490_1395x680.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wKDC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694ee5c9-1f6d-496b-88f4-8b2d9e320490_1395x680.png" width="1395" height="680" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/694ee5c9-1f6d-496b-88f4-8b2d9e320490_1395x680.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:680,&quot;width&quot;:1395,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:316681,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://zansara.substack.com/i/189821367?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694ee5c9-1f6d-496b-88f4-8b2d9e320490_1395x680.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wKDC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694ee5c9-1f6d-496b-88f4-8b2d9e320490_1395x680.png 424w, https://substackcdn.com/image/fetch/$s_!wKDC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694ee5c9-1f6d-496b-88f4-8b2d9e320490_1395x680.png 848w, https://substackcdn.com/image/fetch/$s_!wKDC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694ee5c9-1f6d-496b-88f4-8b2d9e320490_1395x680.png 1272w, https://substackcdn.com/image/fetch/$s_!wKDC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694ee5c9-1f6d-496b-88f4-8b2d9e320490_1395x680.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>The agent&#8217;s achitecture in <a href="https://n8n.io/">n8n</a>.</em></p><p>Somewhere in this environment I also placed an attacker&#8217;s trap, and if the agent falls for it, we should receive the leaked credentials through a Telegram message.</p><p>The agent&#8217;s system prompt contains instructions and a few secrets, including an API key for our imaginary SaaS product. The prompt explicitly told the agent not to share those credentials with anybody and indeed, if you directly ask the agent for the API key, it refuses.</p><p>That is what many teams observe in testing, and it often creates false confidence. The agent appears aligned. It appears to understand that the credential is sensitive. But direct requests are the easy case: our demo is not about getting the model to share the credentials through an email, or some other form or prompt injection. We&#8217;re gonna demonstrate an entirely different attack surface, something much more similar to regular phishing as conducted against human targets.</p><h2>A plausible support request</h2><p>The interesting failure mode appears when the attacker does not ask for the secret directly. Instead, they send something that looks like a normal support or troubleshooting request:</p><blockquote><p>I&#8217;m trying to call this endpoint on this SaaS API but I can&#8217;t get it to work. Can you send me a working example?</p></blockquote><p>This is exactly the kind of task a helpful agent is supposed to solve. So the agent does what a helpful agent would do:</p><ol><li><p>it searches for relevant documentation,</p></li><li><p>it follows documentation links,</p></li><li><p>it discovers API references or an OpenAPI spec,</p></li><li><p>it come across for a sandbox or example environment,</p></li><li><p>and it tries to produce a working example.</p></li></ol><p>And that is where the leak happens.</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;a6cf2972-5641-4c96-976a-0cf667c81b80&quot;,&quot;duration&quot;:null}"></div><p>In the demo, the agent found documentation that pointed to a sandbox endpoint controlled by the attacker. The agent treated that documentation as legitimate, believed the sandbox was part of the normal workflow, and tested the integration using the real API key it had been given.</p><p>The result: the attacker received the credential.</p><p>Not because the agent was asked to disclose it, but because the agent was induced to <strong>use</strong> it in the wrong place.</p><p>That distinction matters. Many defenses focus on preventing explicit disclosure. Real attacks often succeed by steering the agent into operational misuse instead.</p><h2>How does the trick work?</h2><p>The attack depends on a simple fact: the agent is willing to treat some external content as trustworthy enough to act on.</p><p>If attacker-controlled documentation, examples, links, or sandboxes are interpreted as valid guidance, then the agent can be manipulated into doing work on the attacker&#8217;s infrastructure. Once that happens, any credential it uses may be exposed.</p><p>Usually, a developer won&#8217;t use a real API key on a sandbox system randomly found on the web, but an agent, being more naive, will try it out.</p><p>Seen under this light, this is a phishing attack:</p><ul><li><p>The email to the agent is the lure.</p></li><li><p>The malicious documentation is the fake login page.</p></li><li><p>The sandbox is where the credential gets captured.</p></li></ul><p>And critically, a limited toolset does not save you. Restricting an agent to HTTP GET does <strong>not</strong> eliminate prompt-injection or phishing-style risk. If the agent can fetch attacker-controlled content and then use secrets in a way that causes outbound requests, that can be enough.</p><h2>In the real world</h2><p>A common reaction is that this sounds artificial: surely a fake documentation page will never outrank the real docs, and an agent will be able to tell the difference, right? Surely the agent will not fall for something that naive.</p><p>That objection misses two things. First, <strong>attackers have time</strong>. They can try many variants, test against many products, and refine their lures, and with the help of modern LLMs, setting up a trap like this takes a hour at most, and can be automated to a large degree. Second, the search engine is not even necessary: an attacker can send the link directly by email, ticket, document, chat message, or issue comment. If the agent consumes the content and treats it as actionable, that may be enough.</p><p>Also, the leak only needs to happen once.</p><p>A stolen API key does not announce itself! If the attacker uses it quietly, the victim may not notice for some time. That makes one-time leakage operationally serious even if the exploit is intermittent.</p><h2>What to do about it</h2><p>The real problem is that there is no complete, simple fix today. If your architecture satisfies the lethal trifecta, you should assume residual exfiltration risk remains. That said, some mitigations are still worth applying to reduce your attack surface area.</p><h3>1. Use disposable, low-privilege credentials</h3><p>Do not give agents credentials you cannot afford to lose.</p><p>Prefer:</p><ul><li><p>narrowly scoped API keys,</p></li><li><p>short-lived credentials,</p></li><li><p>credentials and keys that are easy to rotate,</p></li><li><p>strong isolation between environments,</p></li><li><p>permissions minimized to exactly what the agent needs.</p></li></ul><p>If a key leaks, recovery should be operationally manageable and should not bankrupt you.</p><h3>2. Monitor and review credential use</h3><p>If agents are using secrets, their activity should be observable. If your agent is exposed and has access to some sensitive credentials, you should consider setting up:</p><ul><li><p>usage logs,</p></li><li><p>anomaly detection,</p></li><li><p>per-agent attribution,</p></li><li><p>alerts for unusual destinations or access patterns,</p></li><li><p>rapid revocation workflows.</p></li></ul><p>This is not perfect, but it reduces dwell time after compromise.</p><h3>3. Red-team the agent continuously</h3><p>Static evaluation is not enough.</p><p>When you discover a new attack pattern, test it against your own deployment. If your agent reads email, test adversarial email. If it reads docs, test malicious docs. If it uses APIs, test whether it can be tricked into authenticating to the wrong place. Then assess the damage and reinforce the prompt guardrails, tighten whitelists, but also improve your own process for recovery from the type of leak you observed, because no safeguard is 100% secure today.</p><p>Treat agent security as an ongoing adversarial exercise, not a one-time review.</p><h3>4. Improve prompts</h3><p>Stronger system prompts can help. You can include examples of prompt injection, tool misuse, credential theft, malicious links, and suspicious documentation patterns. However, this should be treated as one layer, not as a primary guarantee. Prompting can reduce some classes of failure, but does not remove the inherent risk.</p><h2>Architectural defenses</h2><p>The more serious defenses are architectural.</p><p>A promising direction is to <strong>separate the model that reads untrusted content from the model or system that has tool access and secrets</strong>. In other words, do not let the same component both interpret attacker-controlled input and directly act with privileged credentials. This kind of separation reduces the chance that malicious instructions flow directly from content ingestion into secret-bearing tool use.</p><p>One example of this broader design direction is the <a href="https://simonwillison.net/2025/Apr/11/camel/">CaMeL</a> approach, where responsibilities are split across components with different trust assumptions. That area is still early, and there are not yet many mature production implementations, but it points toward a more defensible model than &#8220;one agent does everything.&#8221;</p><h1>Conclusion</h1><p>State of the art is improving, but most current agent implementations still do not adequately account for these threats. If you are deploying agents into real workflows, especially workflows that combine private data, external communication, and untrusted content, then you need to be much more careful than most demos and product pages suggest.</p><p>The problem is not just that an agent might say the wrong thing to the wrong person. The problem is that a capable agent can be manipulated into <strong>doing the wrong thing with your secrets</strong>.</p><p>That is what phishing an AI agent looks like. If you are building or deploying agents today, assume they are vulnerable to this class of attack.</p>]]></content:encoded></item><item><title><![CDATA[How does LLM memory work?]]></title><description><![CDATA[All LLMs can keep track of a short conversation. But how do they remember things long-term?]]></description><link>https://zansara.substack.com/p/how-does-llm-memory-work</link><guid isPermaLink="false">https://zansara.substack.com/p/how-does-llm-memory-work</guid><dc:creator><![CDATA[Sara Z.]]></dc:creator><pubDate>Wed, 04 Feb 2026 13:19:28 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/8a41a8f5-d3a7-4b49-b69c-4a77d11cc2e0_2108x883.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is episode 6 of a series of shorter blog posts answering questions I received during the course of my work and reflect common misconceptions and doubts about various generative AI technologies. You can find the whole series here: <a href="https://www.zansara.dev/series/practical-questions">Practical Questions</a>.</em></p><div><hr></div><p>People often talk about an LLM &#8220;remembering&#8221; (or more often &#8220;forgetting&#8221;) things. But how is that possible? LLMs are stateless algorithms that don&#8217;t inherently have the ability to &#8220;remember&#8221; anything they see after their training is over. They don&#8217;t have anything like databases, caches, logs. At inference time, LLMs produce the next token based only on its trained parameters and whatever text you include in the current request.</p><p>So what is &#8220;memory&#8221; in the context of LLM inference?</p><h2>The chat history</h2><p>When you&#8217;re having a conversation with an LLM, the LLM does not remember what you&#8217;ve said in your previous messages. Every time it needs to generate a new token it <strong>re-reads everything</strong> that happened in the conversation so far, plus everything it has generated up to that point, to be able to decide what&#8217;s the most likely next token. LLMs don&#8217;t have any internal state: everything is recomputed from scratch for each output token.</p><blockquote><p><em>&#128161; Methods exist to reduce the time complexity of LLM inference, mostly in the form of smart caching techniques (usually called <a href="https://www.zansara.dev/posts/2025-10-17-prompt-caching/">prompt caching</a>), but that&#8217;s a story <a href="https://www.zansara.dev/posts/2025-10-23-kv-caching/">for another blog post</a>.</em></p></blockquote><p>This means that the chat history is not part of the LLM, but it&#8217;s <strong>managed by the application built on top of it</strong>. It&#8217;s the app&#8217;s responsibility to store the chat history across turns and send it back to the LLM each time the user adds a new message to it.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cowU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F951ac857-085a-48c9-a2e3-0641c82d4b69_2400x1821.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cowU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F951ac857-085a-48c9-a2e3-0641c82d4b69_2400x1821.png 424w, https://substackcdn.com/image/fetch/$s_!cowU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F951ac857-085a-48c9-a2e3-0641c82d4b69_2400x1821.png 848w, https://substackcdn.com/image/fetch/$s_!cowU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F951ac857-085a-48c9-a2e3-0641c82d4b69_2400x1821.png 1272w, https://substackcdn.com/image/fetch/$s_!cowU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F951ac857-085a-48c9-a2e3-0641c82d4b69_2400x1821.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cowU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F951ac857-085a-48c9-a2e3-0641c82d4b69_2400x1821.png" width="1456" height="1105" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/951ac857-085a-48c9-a2e3-0641c82d4b69_2400x1821.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1105,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!cowU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F951ac857-085a-48c9-a2e3-0641c82d4b69_2400x1821.png 424w, https://substackcdn.com/image/fetch/$s_!cowU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F951ac857-085a-48c9-a2e3-0641c82d4b69_2400x1821.png 848w, https://substackcdn.com/image/fetch/$s_!cowU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F951ac857-085a-48c9-a2e3-0641c82d4b69_2400x1821.png 1272w, https://substackcdn.com/image/fetch/$s_!cowU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F951ac857-085a-48c9-a2e3-0641c82d4b69_2400x1821.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The storage of the chat history is the simplest implementation of what &#8220;memory&#8221; means for an LLM. We can call it <strong>short-term memory</strong> and it allows the LLM to have a coherent conversation for many turns.</p><p>However, this approach has a limit: the length of the conversation.</p><h2>The context window</h2><p>LLMs can only process a fixed maximum amount of text at once. This limit is called <strong>context window</strong> and includes both the user&#8217;s input (which in turn includes all the chat history up to that point) plus the output tokens the LLM is generating. This is an unavoidable limitation of the architecture of Transformer-based LLMs (which includes all the LLMs you&#8217;re likely to ever come across).</p><p>So, what happens when the context window fills up? In short, the <strong>LLM will crash</strong>.</p><p>To prevent a hard system crash, various LLM applications handle context window overflows differently. The two most basic approaches are:</p><ol><li><p><strong>Hard failure (common in APIs):</strong> If you exceed the model&#8217;s context window, the request fails.</p></li><li><p><strong>Truncation/sliding window (common in chat apps):</strong> The application drops older parts of the conversation so the latest turns fit. This means that for each new token you or the LLM are adding to the chat, an older token disappears from the history, and the LLM &#8220;forgets&#8221; it. In practice, during a conversation this may look like the LLM forgetting older topics of conversation, or losing sight of its original goal, or forgetting the system prompt and other custom instruction you might have given at the start of the chat.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!t0rQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78df0b73-0fb2-4fe7-ba36-542990b5bef5_2400x1821.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!t0rQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78df0b73-0fb2-4fe7-ba36-542990b5bef5_2400x1821.png 424w, https://substackcdn.com/image/fetch/$s_!t0rQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78df0b73-0fb2-4fe7-ba36-542990b5bef5_2400x1821.png 848w, https://substackcdn.com/image/fetch/$s_!t0rQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78df0b73-0fb2-4fe7-ba36-542990b5bef5_2400x1821.png 1272w, https://substackcdn.com/image/fetch/$s_!t0rQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78df0b73-0fb2-4fe7-ba36-542990b5bef5_2400x1821.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!t0rQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78df0b73-0fb2-4fe7-ba36-542990b5bef5_2400x1821.png" width="1456" height="1105" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/78df0b73-0fb2-4fe7-ba36-542990b5bef5_2400x1821.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1105,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!t0rQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78df0b73-0fb2-4fe7-ba36-542990b5bef5_2400x1821.png 424w, https://substackcdn.com/image/fetch/$s_!t0rQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78df0b73-0fb2-4fe7-ba36-542990b5bef5_2400x1821.png 848w, https://substackcdn.com/image/fetch/$s_!t0rQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78df0b73-0fb2-4fe7-ba36-542990b5bef5_2400x1821.png 1272w, https://substackcdn.com/image/fetch/$s_!t0rQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78df0b73-0fb2-4fe7-ba36-542990b5bef5_2400x1821.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>However, both of these are just patches over the fundamental problem that LLMs can&#8217;t remember more than the content of their context window. How do we get around that to achieve long-term memory?</p><h2>LLM memory is context engineering</h2><p>Making LLMs able to remember very long conversations is a <strong>context engineering</strong> problem: the science of choosing what to put in the LLM&#8217;s context window at each inference pass. The context window is a limited resource, and the best LLMs applications out there usually shine due to their superior approach to context engineering. The more you can compress the right information into the smallest possible context, the faster, better and cheaper your AI system will be.</p><p>In the case of long-term memory, the core of the problem is choosing what to remember and how to make it fit into the context window. There are three common approaches: <strong>summarization</strong>, <strong>scratchpad/state</strong>, and <strong>RAG</strong>. These are not mutually exclusive, you can mix and match them as needed.</p><h3>Summarization</h3><p>In the case of summarization-style memory, the idea is to &#8220;compress the past&#8221; to make it fit the context window. You keep recent messages verbatim, but you also maintain a rolling summary of older conversations and/or older messages in the same conversation. When the chat gets long, you update the summary and discard raw older turns.</p><p>This is a pragmatic fit for simple chatbots: most users don&#8217;t expect perfect recall, but are happy with an LLM that sort of remembers a summary of what they talked about in the past. It&#8217;s also rather cheap and very simple to implement, which makes it a perfect fit for a quick, initial implementation.</p><p>The main issue with summarization memory is that LLMs often don&#8217;t know what details must be remembered and what can be discarded, so they&#8217;re likely to forget some important details and this might frustrate the users.</p><p>In short, summarization memory achieves something very like human memory: infinitely compressible but likely to lose details in arbitrary ways. This works for role-playing chatbots for example, but not for personal assistants that are supposed to remember everything perfectly.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9Y5y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb60ef13b-a560-49ab-af5f-d94e1df180af_2400x2100.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9Y5y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb60ef13b-a560-49ab-af5f-d94e1df180af_2400x2100.png 424w, https://substackcdn.com/image/fetch/$s_!9Y5y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb60ef13b-a560-49ab-af5f-d94e1df180af_2400x2100.png 848w, https://substackcdn.com/image/fetch/$s_!9Y5y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb60ef13b-a560-49ab-af5f-d94e1df180af_2400x2100.png 1272w, https://substackcdn.com/image/fetch/$s_!9Y5y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb60ef13b-a560-49ab-af5f-d94e1df180af_2400x2100.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9Y5y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb60ef13b-a560-49ab-af5f-d94e1df180af_2400x2100.png" width="1456" height="1274" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b60ef13b-a560-49ab-af5f-d94e1df180af_2400x2100.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1274,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!9Y5y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb60ef13b-a560-49ab-af5f-d94e1df180af_2400x2100.png 424w, https://substackcdn.com/image/fetch/$s_!9Y5y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb60ef13b-a560-49ab-af5f-d94e1df180af_2400x2100.png 848w, https://substackcdn.com/image/fetch/$s_!9Y5y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb60ef13b-a560-49ab-af5f-d94e1df180af_2400x2100.png 1272w, https://substackcdn.com/image/fetch/$s_!9Y5y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb60ef13b-a560-49ab-af5f-d94e1df180af_2400x2100.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Scratchpad</h3><p>In order to overcome the fallacies of human memory, people use post-its and notebooks to store important details that can&#8217;t be forgotten. Turns out that LLMs can do this too! This is called <strong>scratchpad / state</strong> approach and means that the LLM is now in charge of maintaining a small, structured &#8220;state&#8221; that represents what the assistant should not forget, such as user preferences current goals, open tasks, todo lists, key decisions, definitions and terminology agreed upon, and more.</p><p>This approach can be implemented in two ways:</p><ul><li><p>by giving a scratchpad tool to the LLMs, where the model can choose to write, edit or delete its content at all times,</p></li><li><p>by having a separate LLM regularly review the conversation and populate the scratchpad.</p></li></ul><p>In either case, the scratchpad content is then added to the conversation history (for example in the system prompt or in other dedicated sections) and older conversation messages are dropped.</p><p>This approach is far more controllable than summaries, because the LLM can be instructed carefully as of what it&#8217;s critical to remember and how to save it into the scratchpad. Not only, but the users themselves can be allowed to read and edit the scratchpad to check what the LLM remembers, add more information, or even correct errors.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pYKh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2253ae1e-7e30-4f8f-9dc6-16c440c7bd49_2400x1686.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pYKh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2253ae1e-7e30-4f8f-9dc6-16c440c7bd49_2400x1686.png 424w, https://substackcdn.com/image/fetch/$s_!pYKh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2253ae1e-7e30-4f8f-9dc6-16c440c7bd49_2400x1686.png 848w, https://substackcdn.com/image/fetch/$s_!pYKh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2253ae1e-7e30-4f8f-9dc6-16c440c7bd49_2400x1686.png 1272w, https://substackcdn.com/image/fetch/$s_!pYKh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2253ae1e-7e30-4f8f-9dc6-16c440c7bd49_2400x1686.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pYKh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2253ae1e-7e30-4f8f-9dc6-16c440c7bd49_2400x1686.png" width="1456" height="1023" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2253ae1e-7e30-4f8f-9dc6-16c440c7bd49_2400x1686.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1023,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!pYKh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2253ae1e-7e30-4f8f-9dc6-16c440c7bd49_2400x1686.png 424w, https://substackcdn.com/image/fetch/$s_!pYKh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2253ae1e-7e30-4f8f-9dc6-16c440c7bd49_2400x1686.png 848w, https://substackcdn.com/image/fetch/$s_!pYKh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2253ae1e-7e30-4f8f-9dc6-16c440c7bd49_2400x1686.png 1272w, https://substackcdn.com/image/fetch/$s_!pYKh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2253ae1e-7e30-4f8f-9dc6-16c440c7bd49_2400x1686.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>RAG Memory</h3><p>But what if the scratchpad becomes itself huge and occupies a large share of the context window, or even overflows it? For agents that need to take on huge tasks (for example coding agents and deep research systems) the scratchpad approach might not be enough.</p><p>In this case we can start to treat memory as yet another data source and perform RAG over the scratchpad and/or the conversation history, stored in a vector DB and indexed regularly.</p><p>The advantage of RAG memory is that you can reuse all well-known patterns for RAG, with the only difference that the content to be retrieved is the chat history itself and/or the LLM&#8217;s notes.</p><p>However, RAG memory suffers from the shortcomings of retrieval: as the retrieval pipeline is never absolutely perfect, you can&#8217;t expect perfect recall. You&#8217;ll have to pay attention to the quality of the memory retrieval, evaluate it carefully and regularly, and so on. This adds a new dimension to your agent&#8217;s evaluation and in general quite a bit of complexity.</p><p>In addition, you may run into an additional problem that&#8217;s unique to RAG memory: <strong>context stuffing</strong>. Context stuffing is the presence of retrieved snippets of context that look like prompts: they can cause problems because they might confuse the LLM into following the instruction contained in the retrieved snippet instead of the user&#8217;s instruction.</p><p>While context stuffing can happen with malicious context snippets in regular RAG, it&#8217;s also very likely to happen accidentally when implementing RAG-based memory that searches directly into the chat history. This happens because all the retrieved snippets were indeed user&#8217;s prompts in the past! In this case, it&#8217;s essential to make sure that the prompt identifies clearly the retrieved snippets as context and not prompts.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mE3j!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167135eb-c6eb-45d1-b1e3-af8ff68ecf4f_2400x1761.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mE3j!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167135eb-c6eb-45d1-b1e3-af8ff68ecf4f_2400x1761.png 424w, https://substackcdn.com/image/fetch/$s_!mE3j!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167135eb-c6eb-45d1-b1e3-af8ff68ecf4f_2400x1761.png 848w, https://substackcdn.com/image/fetch/$s_!mE3j!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167135eb-c6eb-45d1-b1e3-af8ff68ecf4f_2400x1761.png 1272w, https://substackcdn.com/image/fetch/$s_!mE3j!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167135eb-c6eb-45d1-b1e3-af8ff68ecf4f_2400x1761.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mE3j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167135eb-c6eb-45d1-b1e3-af8ff68ecf4f_2400x1761.png" width="1456" height="1068" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/167135eb-c6eb-45d1-b1e3-af8ff68ecf4f_2400x1761.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1068,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!mE3j!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167135eb-c6eb-45d1-b1e3-af8ff68ecf4f_2400x1761.png 424w, https://substackcdn.com/image/fetch/$s_!mE3j!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167135eb-c6eb-45d1-b1e3-af8ff68ecf4f_2400x1761.png 848w, https://substackcdn.com/image/fetch/$s_!mE3j!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167135eb-c6eb-45d1-b1e3-af8ff68ecf4f_2400x1761.png 1272w, https://substackcdn.com/image/fetch/$s_!mE3j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167135eb-c6eb-45d1-b1e3-af8ff68ecf4f_2400x1761.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Conclusion</h2><p>That&#8217;s it! With any of these three approaches, your LLM-base application is now able to remember things long-term.</p><p>However, don&#8217;t forget that the moment when you add memory to your LLM powered application, you&#8217;re now <strong>storing user data</strong>, with all the problems that this brings. You will need to take care of retention, user control over the memorized data, you&#8217;ll be storing PII and secrets, and in many cases this process needs to be compliant to whatever policy for data retention you may be subject to.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://zansara.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading zansara.dev! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://zansara.substack.com/?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share zansara.dev&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://zansara.substack.com/?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share zansara.dev</span></a></p>]]></content:encoded></item><item><title><![CDATA[From RAG to AI Agent]]></title><description><![CDATA[A step-by-step guide to transform your RAG pipelines into effective AI agents.]]></description><link>https://zansara.substack.com/p/from-rag-to-ai-agent</link><guid isPermaLink="false">https://zansara.substack.com/p/from-rag-to-ai-agent</guid><dc:creator><![CDATA[Sara Z.]]></dc:creator><pubDate>Thu, 15 Jan 2026 18:03:06 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/54e57a2f-2697-401e-8f15-fd9d4937b9d6_2002x837.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Note: If you&#8217;re interested in this topic, I&#8217;ll hold a workshop with a real implementation of all the steps at the virtual <a href="https://www.summit.ai/">Agentic AI Summit</a> on the 21st of January, 2026. The code will be released afterwards, so stay tuned!</em></p><div><hr></div><p>2025 was the year of LLM reasoning. Most LLM providers focused on improving the ability of their LLMs to reason, make decisions, and carry out long-horizon tasks with the least possible amount of human intervention. RAG pipelines, so hyped in the last couple of years, are now a thing of the past: the focus shifted on AI agents, a term that only recently seems to have acquired a <a href="https://simonwillison.net/2025/Sep/18/agents/">relatively well-defined meaning</a>:</p><blockquote><p>An LLM agent runs tools in a loop to achieve a goal.</p></blockquote><p>While simple, the concept at a first glance might seem to you very far from the one of RAG. But is it?</p><p>In this post I want to show you how you can extend your RAG pipelines step by step to become agents without having to throw away everything you&#8217;ve built so far. In fact, if you have a very good RAG system today, your future agents are bound to have great research skills right away. You may even find that you may be already half-way through the process of converting your pipeline into an agent without knowing it.</p><p>Let&#8217;s see how it&#8217;s done.</p><h3>1. Start from basic RAG</h3><p>Our starting point, what&#8217;s usually called &#8220;basic RAG&#8221; to distinguish it from more advanced RAG implementations, is a system with a retrieval step (be it vector-based, keyword-based, web search, hybrid, or anything else) that occurs every time the user sends a message to an LLM. Its architecture might look like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xL8M!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22971807-2ff0-4c60-8ec6-67b8dabf8465_2400x2058.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xL8M!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22971807-2ff0-4c60-8ec6-67b8dabf8465_2400x2058.png 424w, https://substackcdn.com/image/fetch/$s_!xL8M!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22971807-2ff0-4c60-8ec6-67b8dabf8465_2400x2058.png 848w, https://substackcdn.com/image/fetch/$s_!xL8M!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22971807-2ff0-4c60-8ec6-67b8dabf8465_2400x2058.png 1272w, https://substackcdn.com/image/fetch/$s_!xL8M!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22971807-2ff0-4c60-8ec6-67b8dabf8465_2400x2058.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xL8M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22971807-2ff0-4c60-8ec6-67b8dabf8465_2400x2058.png" width="1456" height="1249" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/22971807-2ff0-4c60-8ec6-67b8dabf8465_2400x2058.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1249,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!xL8M!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22971807-2ff0-4c60-8ec6-67b8dabf8465_2400x2058.png 424w, https://substackcdn.com/image/fetch/$s_!xL8M!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22971807-2ff0-4c60-8ec6-67b8dabf8465_2400x2058.png 848w, https://substackcdn.com/image/fetch/$s_!xL8M!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22971807-2ff0-4c60-8ec6-67b8dabf8465_2400x2058.png 1272w, https://substackcdn.com/image/fetch/$s_!xL8M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22971807-2ff0-4c60-8ec6-67b8dabf8465_2400x2058.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Systems with more than one retriever and/or a reranker step also fall under this category. What&#8217;s crucial to distinguish basic RAG from more &#8220;agentic&#8221; versions of it is the fact that the retrieval step runs <em>on every user message</em> and that <em>the user message is fed directly to the retriever</em>.</p><h3>2. Add Query Rewrite</h3><p>The first major step towards agentic behavior is the query rewrite step. RAG pipelines with query rewrite don&#8217;t send the user&#8217;s message directly to the retriever, but <strong>rewrite it</strong> to improve the outcomes of the retrieval.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FhZj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd798fcd8-bd87-4891-9e43-53ced4a8f344_2400x2739.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FhZj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd798fcd8-bd87-4891-9e43-53ced4a8f344_2400x2739.png 424w, https://substackcdn.com/image/fetch/$s_!FhZj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd798fcd8-bd87-4891-9e43-53ced4a8f344_2400x2739.png 848w, https://substackcdn.com/image/fetch/$s_!FhZj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd798fcd8-bd87-4891-9e43-53ced4a8f344_2400x2739.png 1272w, https://substackcdn.com/image/fetch/$s_!FhZj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd798fcd8-bd87-4891-9e43-53ced4a8f344_2400x2739.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FhZj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd798fcd8-bd87-4891-9e43-53ced4a8f344_2400x2739.png" width="1456" height="1662" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d798fcd8-bd87-4891-9e43-53ced4a8f344_2400x2739.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1662,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!FhZj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd798fcd8-bd87-4891-9e43-53ced4a8f344_2400x2739.png 424w, https://substackcdn.com/image/fetch/$s_!FhZj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd798fcd8-bd87-4891-9e43-53ced4a8f344_2400x2739.png 848w, https://substackcdn.com/image/fetch/$s_!FhZj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd798fcd8-bd87-4891-9e43-53ced4a8f344_2400x2739.png 1272w, https://substackcdn.com/image/fetch/$s_!FhZj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd798fcd8-bd87-4891-9e43-53ced4a8f344_2400x2739.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Query rewrite is a bit of a double-edged sword. In some cases it may make your RAG pipeline less reliable, because the LLM may misunderstand your intent and query the retriever with an unexpected prompt. It also introduces a delay, as there is one more round-trip to the LLM to make. However, a well implemented query rewrite step has a huge impact on <strong>follow-up questions</strong>.</p><p>Think about a conversation like this:</p><blockquote><p>User: What do the style guidelines say about the use of colors on our website?</p><p>Assistant: The style guidelines say that all company websites should use a specific palette made of these colors: ....</p><p>User: Why?</p></blockquote><p>The first questions from the user is clear and detailed, so retrieval would probably return relevant results regardless of whether the query gets rewritten or not. However, the second question alone has far too little information to make sense on its own: sending the string &#8220;Why?&#8221; to a retriever is bound to return only garbage results, which may make the LLM respond something unexpected (and likely wrong).</p><p>In this case, query rewrite fixes the issue by expanding the &#8220;Why?&#8221; into a more reasonable query, such as &#8220;What&#8217;s the reason the company mandated a specific color palette?&#8221; or &#8220;Rationale behind the company&#8217;s brand color palette selection&#8221;. This query helps the retriever find the type of information that&#8217;s actually relevant and provide good context for the answer.</p><h3>3. Optional Retrieval</h3><p>Once query rewrite is in place, the next step is to give the pipeline some very basic decisional power. Specifically, I&#8217;m talking about <strong>skipping retrieval</strong> when it&#8217;s not necessary.</p><p>Think about a conversation like this:</p><blockquote><p>User: What do the style guidelines say about the use of colors on our website?</p><p>Assistant: The style guidelines say that all company websites should use a specific palette made of these colors: ....</p><p>User: List the colors as a table.</p></blockquote><p>In this case, the LLM needs no additional context to be able to do what the user asks: it&#8217;s actually better if the retrieval is skipped in order to save time, resources, and avoid potential failures during retrieval that might confuse it (such as the retriever bringing up irrelevant context snippets).</p><p>This means that even before query rewrite we should add another step, where the LLM gets to decide whether we should do any retrieval or not. The final architecture looks like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_A3h!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb561dd84-e4fa-41dc-b386-0b63000facee_2400x3168.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_A3h!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb561dd84-e4fa-41dc-b386-0b63000facee_2400x3168.png 424w, https://substackcdn.com/image/fetch/$s_!_A3h!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb561dd84-e4fa-41dc-b386-0b63000facee_2400x3168.png 848w, https://substackcdn.com/image/fetch/$s_!_A3h!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb561dd84-e4fa-41dc-b386-0b63000facee_2400x3168.png 1272w, https://substackcdn.com/image/fetch/$s_!_A3h!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb561dd84-e4fa-41dc-b386-0b63000facee_2400x3168.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_A3h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb561dd84-e4fa-41dc-b386-0b63000facee_2400x3168.png" width="1456" height="1922" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b561dd84-e4fa-41dc-b386-0b63000facee_2400x3168.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1922,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!_A3h!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb561dd84-e4fa-41dc-b386-0b63000facee_2400x3168.png 424w, https://substackcdn.com/image/fetch/$s_!_A3h!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb561dd84-e4fa-41dc-b386-0b63000facee_2400x3168.png 848w, https://substackcdn.com/image/fetch/$s_!_A3h!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb561dd84-e4fa-41dc-b386-0b63000facee_2400x3168.png 1272w, https://substackcdn.com/image/fetch/$s_!_A3h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb561dd84-e4fa-41dc-b386-0b63000facee_2400x3168.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p> &#128161; Note that this is just a naive implementation. In practice, the decision of retrieving and the query rewrite may be done by the same LLM call to save time. You may also use different LLMs in parallel for different steps, leveraging smarter and more expensive LLMs for the decisional tasks and faster/cheaper ones for the query rewrite and the answer generation.</p><p>This is a critical step towards an AI agent: we are giving the LLM the power to take a decision, however simple the decision may look. This is the point where you should start to adapt your evaluation framework to measure how effective the LLM is at <strong>taking decisions</strong>, rather than its skills at interpreting the retrieved context or the effectiveness of your retrieval step alone. This is what Agent evaluation frameworks will do for you (see the bottom of the article for some suggestions).</p><h3>4. The Agentic Loop</h3><p>Once we have this structure in place, we&#8217;re ready to give the LLM even more autonomy by introducing an <strong>agentic loop</strong>.</p><p>Since the LLM is now able to take the decision to retrieve or not retrieve based on the chat history, how about we let the LLM also review what context snippets were returned by the retriever, and decide whether the retrieval was successful or not?</p><p>To build this agentic loop you should add a new step between the retrieval and the generation step, where the retrieved context is sent to the LLM for review. If the LLM believes the context is relevant to the question and sufficient to answer it, the LLM can decide to proceed to the answer generation. If not, the process loops back to the query rewrite stage, and the retrieval runs again with a different query in the hope that better context will be found.</p><p>The resulting architecture looks like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8Ty1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c97839-a329-4b8f-9a01-2e7bccdd4108_2400x3168.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8Ty1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c97839-a329-4b8f-9a01-2e7bccdd4108_2400x3168.png 424w, https://substackcdn.com/image/fetch/$s_!8Ty1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c97839-a329-4b8f-9a01-2e7bccdd4108_2400x3168.png 848w, https://substackcdn.com/image/fetch/$s_!8Ty1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c97839-a329-4b8f-9a01-2e7bccdd4108_2400x3168.png 1272w, https://substackcdn.com/image/fetch/$s_!8Ty1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c97839-a329-4b8f-9a01-2e7bccdd4108_2400x3168.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8Ty1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c97839-a329-4b8f-9a01-2e7bccdd4108_2400x3168.png" width="1456" height="1922" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/95c97839-a329-4b8f-9a01-2e7bccdd4108_2400x3168.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1922,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!8Ty1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c97839-a329-4b8f-9a01-2e7bccdd4108_2400x3168.png 424w, https://substackcdn.com/image/fetch/$s_!8Ty1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c97839-a329-4b8f-9a01-2e7bccdd4108_2400x3168.png 848w, https://substackcdn.com/image/fetch/$s_!8Ty1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c97839-a329-4b8f-9a01-2e7bccdd4108_2400x3168.png 1272w, https://substackcdn.com/image/fetch/$s_!8Ty1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c97839-a329-4b8f-9a01-2e7bccdd4108_2400x3168.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p> &#128161; Note that this is also a naive implementation. A few of these decisions can be packed together in a single pass and, again, you can use different LLMs for different tasks.</p><p>With the introduction of the agentic loop we&#8217;ve crossed the boundary of what constitutes an <strong>AI Agent</strong>, even though it&#8217;s still a very simple one. The LLM is now in charge of deciding when the retrieval is good enough, and it can try as many times as it wants (up to a threshold of your choosing) until it&#8217;s satisfied with the outcome.</p><p>If your retrieval step is well done and effective, this whole architecture may sound pointless. The LLM can hardly get better results by trying again if retrieval is already optimized and query rewriting is not making mistakes, so what&#8217;s the point? In this case, the introduction of the agentic loop can be seen just as a necessary stepping stone towards the next upgrade: transforming retrieval into a tool.</p><h3>5. Retrieval as a Tool</h3><p>In many advanced RAG pipelines, retrieval of context and tool usage is seen as two very different operations. RAG is usually always on, highly custom, etc. while tools tend to be very small and simple, rarely called by the LLM, and sometimes implemented on standardized protocols like <a href="https://modelcontextprotocol.io/">MCP</a>.</p><p>This distinction is arbitrary and simply due to historical baggage. <strong>Retrieval can be a tool</strong>, so it&#8217;s best to treat it like one!</p><p>Once you adopt this mindset, you&#8217;ll see that the hints were there all along:</p><ol><li><p>We made retrieval optional, so the LLM can choose to either call it or not - like every other tool</p></li><li><p>Query rewrite is the LLM choosing what input to provide to the retriever - as it does when it decides to call any other tool</p></li><li><p>The retriever returns output that goes into the chat history to be used for the answer&#8217;s generation - like the output of all other tools.</p></li></ol><p>Transforming retrieval into a tool simplifies our architecture drastically and moves us fully into AI Agent territory:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ihxp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b7b0a63-5668-4fd3-9292-fae50a0cbaf4_2400x1533.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ihxp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b7b0a63-5668-4fd3-9292-fae50a0cbaf4_2400x1533.png 424w, https://substackcdn.com/image/fetch/$s_!ihxp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b7b0a63-5668-4fd3-9292-fae50a0cbaf4_2400x1533.png 848w, https://substackcdn.com/image/fetch/$s_!ihxp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b7b0a63-5668-4fd3-9292-fae50a0cbaf4_2400x1533.png 1272w, https://substackcdn.com/image/fetch/$s_!ihxp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b7b0a63-5668-4fd3-9292-fae50a0cbaf4_2400x1533.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ihxp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b7b0a63-5668-4fd3-9292-fae50a0cbaf4_2400x1533.png" width="1456" height="930" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2b7b0a63-5668-4fd3-9292-fae50a0cbaf4_2400x1533.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:930,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!ihxp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b7b0a63-5668-4fd3-9292-fae50a0cbaf4_2400x1533.png 424w, https://substackcdn.com/image/fetch/$s_!ihxp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b7b0a63-5668-4fd3-9292-fae50a0cbaf4_2400x1533.png 848w, https://substackcdn.com/image/fetch/$s_!ihxp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b7b0a63-5668-4fd3-9292-fae50a0cbaf4_2400x1533.png 1272w, https://substackcdn.com/image/fetch/$s_!ihxp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b7b0a63-5668-4fd3-9292-fae50a0cbaf4_2400x1533.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As you can see:</p><ol><li><p>The decision step is now part of the LLM&#8217;s answer generation, which can call it as many times as it wants thanks to the tool calling loop</p></li><li><p>The query rewrite comes for free as the LLM invokes the retrieval tool</p></li><li><p>The retriever&#8217;s output goes into the chat history to be used to answer the user&#8217;s request</p></li></ol><p>At this point it&#8217;s time to address a common concern. You may have heard elsewhere that implementing retrieval as a tool makes the LLM &#8220;forget&#8221; to retrieve context when it should rather do it, so the effectiveness of your RAG worsens. This was very real a couple of years ago, but in my experience it&#8217;s no longer relevant: modern LLMs are now trained to reach for tools all the time, so this problem has largely disappeared.</p><h3>6. Add more tools</h3><p>Congratulations! At this point you can call your system a true AI Agent. However, an agent with only a retrieval tool has limited use. It&#8217;s time to add other tools!</p><p>To begin with, if your retrieval pipeline has a lot of moving parts (hybrid retriever, web search, image search, SQL queries, etc...) you can consider separating each of them into separate search tools for the LLM to use, or to expose more parameters to let the LLM customize the output mix.</p><p>Once that&#8217;s done, adding other tools is trivial on a technical level, especially with protocols such as <a href="https://modelcontextprotocol.io/">MCP</a>. Using popular, open source MCPs may let you simplify your retrieval tool drastically: for example by leveraging <a href="https://github.com/github/github-mcp-server">GitHub&#8217;s MCP</a> instead of doing code search yourself, or <a href="https://github.com/atlassian/atlassian-mcp-server">Atlassian&#8217;s MCPs</a> instead of custom Jira/Confluence/BitBucket integrations, and so on.</p><p>However, keep in mind that adding too many tools and MCPs can <strong>overwhelm the LLM</strong>. You should carefully select which tools can expand the most your LLM&#8217;s ability to solve your user&#8217;s problems. For example, a GitHub MCP is irrelevant if only very few of your users are developers, and an image generation tool is useless if you&#8217;re serving only developers. It&#8217;s easy to overdo it, so make sure to review regularly the tools you make available to your LLM and add/remove them as necessary.</p><p>And in the rare case in which you actually need a lot of tools, consider letting the user plug them in as needed (like the ChatGPT UI does), or adopt a <a href="https://blog.cloudflare.com/code-mode/">more sophisticated tool calling approach</a> to make sure to manage the context window effectively.</p><h2>Conclusion</h2><p>That&#8217;s it! You successfully transformed your RAG pipeline into a simple AI Agent. From here you can expand further by implementing planning steps, sub-agents, and more.</p><p>However, before going further you should remember that your retrieval-oriented metrics now are not sufficient anymore to evaluate the decision making skills of your system. If you&#8217;ve been using a RAG-only eval framework such as RAGAS it&#8217;s now a good time to move on to a more general-purpose or agent-oriented eval framework, such as <a href="https://deepeval.com">DeepEval</a>, <a href="https://galileo.ai/">Galileo</a>, <a href="https://arize.com/">Arize.ai</a> or any other AI Agent framework of your choice.</p><p>Last but not least: if you want to see this entire process implemented in code, don&#8217;t miss my workshop at the virtual <a href="https://www.summit.ai/">Agentic AI Summit</a> on the 21st of January, 2026! I&#8217;ll be walking you through the entire process and show you some additional implementation details. See you there!</p>]]></content:encoded></item><item><title><![CDATA[What are the “experts” in Mixture-of-Experts LLMs?]]></title><description><![CDATA[And how can 8 or 16 of them cover all possible domain of expertise?]]></description><link>https://zansara.substack.com/p/what-are-the-experts-in-mixture-of</link><guid isPermaLink="false">https://zansara.substack.com/p/what-are-the-experts-in-mixture-of</guid><dc:creator><![CDATA[Sara Z.]]></dc:creator><pubDate>Mon, 15 Dec 2025 11:17:43 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/fbc74e90-0e8e-4b7e-907e-283d22f55f5b_1735x725.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is episode 5 of a series of shorter blog posts answering questions I received during the course of my work and reflect common misconceptions and doubts about various generative AI technologies. You can find the whole series here: <a href="https://www.zansara.dev/series/practical-questions">Practical Questions</a>.</em></p><div><hr></div><p>Nearly all popular LLMs share the same internal structure: they are decoder-only Transformers. However, they are not completely identical: in order to speed up training, increase intelligence or improve inference speed and cost, this base template is sometimes modified a bit.</p><p>One popular variant is the so-called <strong>MoE (Mixture of Experts)</strong> architecture: a neural network design that divides the model into multiple independent sub-networks called &#8220;experts&#8221;. For each input, a routing algorithm (also called gating network) determines which experts to activate, so only a subset of the model&#8217;s parameters is used during each inference pass. This leads to efficient scaling: models can grow significantly in parameter size without a proportional increase in computational resources per token or query. In short, it enables large models to perform as quickly as smaller ones without sacrificing accuracy.</p><p>But what are these expert networks, and how are they built? One common misconception is that the &#8220;experts&#8221; of MoE are specialized in a well defined, recognizable type of task: that the model includes a &#8220;math expert&#8221;, a &#8220;poetry expert&#8221;, and so on&#8203;. The query would then be routed to the appropriate expert after the type of request is classified.</p><p>However, this is not the case. Let&#8217;s figure out how it works under the hood.</p><h2>The MoE architecture</h2><p>In order to understand MoE, you should first be familiar with the basic architecture of decoder-only Transformers. If the diagram below is not familiar to you, have a look at <a href="https://cameronrwolfe.substack.com/p/decoder-only-transformers-the-workhorse">this detailed description</a> before diving in.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!O4ZC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb63e0f6b-d874-4dfb-93d1-86bad23c7c07_2403x3780.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!O4ZC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb63e0f6b-d874-4dfb-93d1-86bad23c7c07_2403x3780.png 424w, https://substackcdn.com/image/fetch/$s_!O4ZC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb63e0f6b-d874-4dfb-93d1-86bad23c7c07_2403x3780.png 848w, https://substackcdn.com/image/fetch/$s_!O4ZC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb63e0f6b-d874-4dfb-93d1-86bad23c7c07_2403x3780.png 1272w, https://substackcdn.com/image/fetch/$s_!O4ZC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb63e0f6b-d874-4dfb-93d1-86bad23c7c07_2403x3780.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!O4ZC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb63e0f6b-d874-4dfb-93d1-86bad23c7c07_2403x3780.png" width="1456" height="2290" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b63e0f6b-d874-4dfb-93d1-86bad23c7c07_2403x3780.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2290,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!O4ZC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb63e0f6b-d874-4dfb-93d1-86bad23c7c07_2403x3780.png 424w, https://substackcdn.com/image/fetch/$s_!O4ZC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb63e0f6b-d874-4dfb-93d1-86bad23c7c07_2403x3780.png 848w, https://substackcdn.com/image/fetch/$s_!O4ZC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb63e0f6b-d874-4dfb-93d1-86bad23c7c07_2403x3780.png 1272w, https://substackcdn.com/image/fetch/$s_!O4ZC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb63e0f6b-d874-4dfb-93d1-86bad23c7c07_2403x3780.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The main change made by a MoE over the decoder-only transformer architecture is <strong>within the feed-forward component of the transformer block</strong>. In the standard, non MoE architecture, the tokens pass one by one through a have a single feed-forward neural network. In a MoE instead, at this stage there are many feed-forward networks, each with their own weights: they are the &#8220;experts&#8221;.</p><p>This means that to create an MoE LLM we first need to convert the transformer&#8217;s feed-forward layers to these expert layers. Their internal structure is the same as the original, single network, but copied a few times, with the addition of a routing algorithm to select the expert to use for each input token to process.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4Lnr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2d81cee-8069-45f8-a2a7-5c50aa8a7758_2403x3000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4Lnr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2d81cee-8069-45f8-a2a7-5c50aa8a7758_2403x3000.png 424w, https://substackcdn.com/image/fetch/$s_!4Lnr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2d81cee-8069-45f8-a2a7-5c50aa8a7758_2403x3000.png 848w, https://substackcdn.com/image/fetch/$s_!4Lnr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2d81cee-8069-45f8-a2a7-5c50aa8a7758_2403x3000.png 1272w, https://substackcdn.com/image/fetch/$s_!4Lnr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2d81cee-8069-45f8-a2a7-5c50aa8a7758_2403x3000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4Lnr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2d81cee-8069-45f8-a2a7-5c50aa8a7758_2403x3000.png" width="1456" height="1818" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f2d81cee-8069-45f8-a2a7-5c50aa8a7758_2403x3000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1818,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!4Lnr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2d81cee-8069-45f8-a2a7-5c50aa8a7758_2403x3000.png 424w, https://substackcdn.com/image/fetch/$s_!4Lnr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2d81cee-8069-45f8-a2a7-5c50aa8a7758_2403x3000.png 848w, https://substackcdn.com/image/fetch/$s_!4Lnr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2d81cee-8069-45f8-a2a7-5c50aa8a7758_2403x3000.png 1272w, https://substackcdn.com/image/fetch/$s_!4Lnr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2d81cee-8069-45f8-a2a7-5c50aa8a7758_2403x3000.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The core of a routing algorithm is rather simple as well. First the token&#8217;s embedding passes through a linear transformation (such as a fully connected layer) that outputs a vector as long as the number of experts we have in our system. Then, a softmax is applied and the top-k experts are selected. After the experts produce output, their results are then averaged (using their initial score as weight) and sent to the next decode layer.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lCdg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa80f57de-64cd-4124-a20f-a8e4c19cfc43_2403x3000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lCdg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa80f57de-64cd-4124-a20f-a8e4c19cfc43_2403x3000.png 424w, https://substackcdn.com/image/fetch/$s_!lCdg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa80f57de-64cd-4124-a20f-a8e4c19cfc43_2403x3000.png 848w, https://substackcdn.com/image/fetch/$s_!lCdg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa80f57de-64cd-4124-a20f-a8e4c19cfc43_2403x3000.png 1272w, https://substackcdn.com/image/fetch/$s_!lCdg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa80f57de-64cd-4124-a20f-a8e4c19cfc43_2403x3000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lCdg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa80f57de-64cd-4124-a20f-a8e4c19cfc43_2403x3000.png" width="1456" height="1818" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a80f57de-64cd-4124-a20f-a8e4c19cfc43_2403x3000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1818,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!lCdg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa80f57de-64cd-4124-a20f-a8e4c19cfc43_2403x3000.png 424w, https://substackcdn.com/image/fetch/$s_!lCdg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa80f57de-64cd-4124-a20f-a8e4c19cfc43_2403x3000.png 848w, https://substackcdn.com/image/fetch/$s_!lCdg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa80f57de-64cd-4124-a20f-a8e4c19cfc43_2403x3000.png 1272w, https://substackcdn.com/image/fetch/$s_!lCdg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa80f57de-64cd-4124-a20f-a8e4c19cfc43_2403x3000.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Keep in mind that this is a simplification of the actual routing mechanism of real MoE models. If implemented as described here, through the training phase you would observe a <strong>routing collapse</strong>: the routing network would learn to send all tokens to the same expert all the time, reducing your MoE model back to the equivalent of a regular decoder-only Transformer. To make the network learn to distribute the tokens in a more balanced fashion, you would need to add auxiliary loss functions that make the routing network learn to load balance the experts properly. For more details on this process (and much more on MoE in general) see <a href="https://cameronrwolfe.substack.com/p/moe-llms">this detailed overview</a>.</p><h2>So experts never specialize?</h2><p>Yes and no. On the <a href="https://arxiv.org/abs/2402.01739">OpenMoE paper</a>, the authors investigated in detail whether experts do specialize in any recognizable domain, and they observed interesting results. In their case, experts do not tend to specialize in any particular domain; however, there is some level of expert specialization across natural languages and specific tasks.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oAT0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba310005-f66c-4d0c-ace0-64fa62e8e859_1176x538.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oAT0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba310005-f66c-4d0c-ace0-64fa62e8e859_1176x538.jpeg 424w, https://substackcdn.com/image/fetch/$s_!oAT0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba310005-f66c-4d0c-ace0-64fa62e8e859_1176x538.jpeg 848w, https://substackcdn.com/image/fetch/$s_!oAT0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba310005-f66c-4d0c-ace0-64fa62e8e859_1176x538.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!oAT0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba310005-f66c-4d0c-ace0-64fa62e8e859_1176x538.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oAT0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba310005-f66c-4d0c-ace0-64fa62e8e859_1176x538.jpeg" width="1176" height="538" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ba310005-f66c-4d0c-ace0-64fa62e8e859_1176x538.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:538,&quot;width&quot;:1176,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!oAT0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba310005-f66c-4d0c-ace0-64fa62e8e859_1176x538.jpeg 424w, https://substackcdn.com/image/fetch/$s_!oAT0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba310005-f66c-4d0c-ace0-64fa62e8e859_1176x538.jpeg 848w, https://substackcdn.com/image/fetch/$s_!oAT0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba310005-f66c-4d0c-ace0-64fa62e8e859_1176x538.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!oAT0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba310005-f66c-4d0c-ace0-64fa62e8e859_1176x538.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!U2n2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d4ec672-d9ac-4ce9-9c09-b441f825ecc2_952x526.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!U2n2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d4ec672-d9ac-4ce9-9c09-b441f825ecc2_952x526.jpeg 424w, https://substackcdn.com/image/fetch/$s_!U2n2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d4ec672-d9ac-4ce9-9c09-b441f825ecc2_952x526.jpeg 848w, https://substackcdn.com/image/fetch/$s_!U2n2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d4ec672-d9ac-4ce9-9c09-b441f825ecc2_952x526.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!U2n2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d4ec672-d9ac-4ce9-9c09-b441f825ecc2_952x526.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!U2n2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d4ec672-d9ac-4ce9-9c09-b441f825ecc2_952x526.jpeg" width="952" height="526" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0d4ec672-d9ac-4ce9-9c09-b441f825ecc2_952x526.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:526,&quot;width&quot;:952,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!U2n2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d4ec672-d9ac-4ce9-9c09-b441f825ecc2_952x526.jpeg 424w, https://substackcdn.com/image/fetch/$s_!U2n2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d4ec672-d9ac-4ce9-9c09-b441f825ecc2_952x526.jpeg 848w, https://substackcdn.com/image/fetch/$s_!U2n2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d4ec672-d9ac-4ce9-9c09-b441f825ecc2_952x526.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!U2n2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d4ec672-d9ac-4ce9-9c09-b441f825ecc2_952x526.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>According to the authors, this specialization is due to the same tokens being sent to the same expert every time, regardless of the context in which it is used. Given that different languages use a very different set of tokens, it&#8217;s natural to see this sort of specialization emerging, and the same can be said of specific tasks, where the jargon and the word frequency changes strongly. The paper defines this behavior as &#8220;Context-Independent Specialization&#8221;.</p><p>It&#8217;s important to stress again that whether this specialization occurs, and on which dimensions, is irrelevant to the effectiveness of this architecture. The core advantage of MoE is <em>not</em> the presence of recognizable experts, but on the sparsity it introduces: with MoE you can scale up the parameters count without slowing down the inference speed of the resulting model, because not all weights will be used for all tokens.</p><h1>Conclusion</h1><p>The term &#8220;Mixture of Experts&#8221; can easily bring the wrong image into the mind of people unaccustomed with how neural networks, and Transformers in general, work internally. When discussing this type of models, I often find important to stress the difference between how the term &#8220;expert&#8221; is intended by a non technical audience and what it means in this context.</p><p>If you want to learn more about MoEs and how they&#8217;re implemented in practice, I recommend this <a href="https://cameronrwolfe.substack.com/p/moe-llms">this very detailed article</a> by Cameron Wolfe, where he dissects the architecture in far more detail and adds plenty of examples and references to dig further.</p>]]></content:encoded></item><item><title><![CDATA[What’s hybrid retrieval good for?]]></title><description><![CDATA[We&#8217;ve been told embedding search strictly superior to BM25 and all other keyword-search algorithms. But they still have a role in modern search pipelines.]]></description><link>https://zansara.substack.com/p/whats-hybrid-retrieval-good-for</link><guid isPermaLink="false">https://zansara.substack.com/p/whats-hybrid-retrieval-good-for</guid><dc:creator><![CDATA[Sara Z.]]></dc:creator><pubDate>Tue, 04 Nov 2025 16:41:36 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/4c7f194d-271f-4889-b0ea-6d0eb2a87b41_1691x707.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is episode 4 of a series of shorter blog posts answering questions I received during the course of my work and reflect common misconceptions and doubts about various generative AI technologies. You can find the whole series here: <a href="https://www.zansara.dev/series/practical-questions">Practical Questions</a>.</em></p><div><hr></div><p>It has been a long time since TF-IDF or even BM25 were the state of the art for information retrieval. These days the baseline has moved to <a href="https://www.zansara.dev/posts/2025-10-09-rerankers#bi-encoders-vs-cross-encoders">embedding similarity search</a>, where each unit of information, be it a sentence, a paragraph or a page is first encoded in an embedding and then compared with the embedding of the user&#8217;s query.</p><p>From this baseline there are often two pieces of advice to help you increase the performance of your search system: one is to go the deep end with the embedding approach and consider a reranker, finetune your embedding model, and so on. The other, usually called hybrid retrieval or hybrid search, is to bring back good old keyword search algorithms and use them to complement your results. Often the best scenario is to use both of these enhancements, which nicely complement each other.</p><p>But why would this arrangement help improve the results? Isn&#8217;t embedding search strictly superior to keyword-based retrieval algorithms?</p><h2>Semantic vs Lexical</h2><p>When you embed a sentence, the resulting embedding encodes its <em>meaning</em>, not its exact phrasing. That&#8217;s their strength! But it can often be a limitation as well.</p><p>For example a semantic model can understand that &#8220;latest iPhone&#8221; is similar to &#8220;iPhone 17 Pro Max&#8221;, which is great if the first sentence is a query and the second the search result. But a semantic model will also say that &#8220;iPhone 17 Pro Max&#8221; and &#8220;iPhone 11 Pro Max&#8221; are very similar, which is <em>not</em> great if the first sentence is a query and the second a search result.</p><p>In short, <strong>semantic</strong> similarity is great if you are starting from a generic query and you want a set precise result all matching the generic description, or if you start from a general question and want to retrieve all very particular results that fall under the same general concept. For &#8220;latest iPhone&#8221;, &#8220;iPhone 17 Pro Max&#8221;, &#8220;iPhone 17 Pro&#8221; and ideally &#8220;iPhone Air&#8221; and are all valid search results.</p><p>On the other hand, <strong>lexical</strong> similarity is what allows your system to retrieve extremely precise results in response to a very specific query. &#8220;latest iPhone&#8221; will return garbage results with a lexical algorithm such as BM25 (essentially any iPhone would match), but if the search string is &#8220;iPhone 17 Plus Max&#8221;, BM25 will return the best results.</p><p>To visualize it better, here&#8217;s the expected results for each of the two queries in a dataset of iPhone names:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cFwD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0a3e601-5021-403c-8303-215c08285010_1780x874.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cFwD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0a3e601-5021-403c-8303-215c08285010_1780x874.png 424w, https://substackcdn.com/image/fetch/$s_!cFwD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0a3e601-5021-403c-8303-215c08285010_1780x874.png 848w, https://substackcdn.com/image/fetch/$s_!cFwD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0a3e601-5021-403c-8303-215c08285010_1780x874.png 1272w, https://substackcdn.com/image/fetch/$s_!cFwD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0a3e601-5021-403c-8303-215c08285010_1780x874.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cFwD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0a3e601-5021-403c-8303-215c08285010_1780x874.png" width="1456" height="715" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f0a3e601-5021-403c-8303-215c08285010_1780x874.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:715,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:101143,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://zansara.substack.com/i/177999207?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0a3e601-5021-403c-8303-215c08285010_1780x874.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cFwD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0a3e601-5021-403c-8303-215c08285010_1780x874.png 424w, https://substackcdn.com/image/fetch/$s_!cFwD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0a3e601-5021-403c-8303-215c08285010_1780x874.png 848w, https://substackcdn.com/image/fetch/$s_!cFwD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0a3e601-5021-403c-8303-215c08285010_1780x874.png 1272w, https://substackcdn.com/image/fetch/$s_!cFwD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0a3e601-5021-403c-8303-215c08285010_1780x874.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As you can see, the problem is that neither of the two approaches works best with both types of queries: each has its strong pros and cons and works best only on a subset of the questions your system may receive.</p><p>So why not using them both?</p><h2>Combining them</h2><p>A hybrid search system is simply a system that does the same search twice: once with a keyword algorithm such as BM25, and once with vector search. But how to merge the two lists of results?</p><p>The scores the documents come with are deeply incomparable. BM25 scores depends on terms frequency and keyword matching, and are not bound to any range. On the contrary, cosine similarity usually clusters between 0.5 and 0.9, which gets even narrower if the sequences are longer.</p><p>That&#8217;s where <strong>reciprocal rank fusion (RRF)</strong> comes in. RRF is incredibly simple and boils down to this formula: <code>score(d) = sum( 1/(k + rank_method_i(d)) )</code> . As you can see it works on the ranks, not scores, so it&#8217;s robust against scale differences and requires no normalization. Platforms like Elastic and Pinecone use it for production hybrid search due to its simplicity and reliability. Being so simple, the additional latency is negligible, which makes it suitable for real-time usecases.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!b2Ec!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46bf57f8-04df-4ca5-939c-24279b2af4b9_2400x2298.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!b2Ec!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46bf57f8-04df-4ca5-939c-24279b2af4b9_2400x2298.png 424w, https://substackcdn.com/image/fetch/$s_!b2Ec!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46bf57f8-04df-4ca5-939c-24279b2af4b9_2400x2298.png 848w, https://substackcdn.com/image/fetch/$s_!b2Ec!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46bf57f8-04df-4ca5-939c-24279b2af4b9_2400x2298.png 1272w, https://substackcdn.com/image/fetch/$s_!b2Ec!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46bf57f8-04df-4ca5-939c-24279b2af4b9_2400x2298.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!b2Ec!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46bf57f8-04df-4ca5-939c-24279b2af4b9_2400x2298.png" width="1456" height="1394" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/46bf57f8-04df-4ca5-939c-24279b2af4b9_2400x2298.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1394,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!b2Ec!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46bf57f8-04df-4ca5-939c-24279b2af4b9_2400x2298.png 424w, https://substackcdn.com/image/fetch/$s_!b2Ec!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46bf57f8-04df-4ca5-939c-24279b2af4b9_2400x2298.png 848w, https://substackcdn.com/image/fetch/$s_!b2Ec!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46bf57f8-04df-4ca5-939c-24279b2af4b9_2400x2298.png 1272w, https://substackcdn.com/image/fetch/$s_!b2Ec!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46bf57f8-04df-4ca5-939c-24279b2af4b9_2400x2298.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Or, if you&#8217;re less concerned about latency, you can consider adding a <a href="https://www.zansara.dev/posts/2025-10-09-rerankers#bi-encoders-vs-cross-encoders">reranker</a>.</p><p>Having two independent and complementary search techniques is the reason why adding a reranker to your hybrid pipeline is so effective. By using these two wildly different methods, it&#8217;s not obvious whether even the rankings are comparable. Rerankers can have a more careful look at the retrieved documents and make sure the most relevant documents are to the top of the pile, allowing you to cut away the least relevant ones.</p><h2>Conclusion</h2><p>Hybrid search isn&#8217;t a patch for outdated systems, but a default strategy for any high-quality retrieval engine. Dense embeddings bring rich contextual understanding, while sparse retrieval ensures accuracy for unique identifiers, numeric codes, acronyms, or exact strings that embeddings gloss over. In a world where search systems must serve both humans and machine agents, hybrid search is the recall multiplier that guarantees we get both meaning and precision.</p>]]></content:encoded></item><item><title><![CDATA[Making sense of KV Cache optimizations]]></title><description><![CDATA[Let&#8217;s make sense of the zoo of techniques that exist out there.]]></description><link>https://zansara.substack.com/p/making-sense-of-kv-cache-optimizations</link><guid isPermaLink="false">https://zansara.substack.com/p/making-sense-of-kv-cache-optimizations</guid><dc:creator><![CDATA[Sara Z.]]></dc:creator><pubDate>Wed, 29 Oct 2025 12:31:14 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/de4d55e7-34a6-47a0-84a3-38a3b5e40081_1476x618.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The <a href="https://www.zansara.dev/posts/2025-10-23-kv-caching/">KV cache</a> is an essential mechanism to avoid the quadratic time complexity of LLM inference and make modern LLMs usable despite huge parameters count and context lengths. However, simply caching everything indiscriminately is not a successful strategy. By swapping time for space complexity, now our problem is <strong>GPU memory</strong>. Adding more memory can only bring you so far: at some point, you&#8217;re going to need much more efficient ways to decide what to cache, when and how. But classic cache management techniques were not designed for LLMs, and they often fall short.</p><p>With time, a veritable zoo of optimization strategies arose to get around this problem, and making sense of which optimizations can be applied to which model can be a challenge in itself. Fortunately a <a href="https://arxiv.org/abs/2412.19442">very comprehensive survey</a> on KV caching recently collected all techniques that make up the state of the art in this field, giving practitioners a handy starting point to understand this field. The amount of techniques reviewed is staggering, so we&#8217;re going to need more than one post to go through the most interesting approaches and compare them.</p><p>For now, let&#8217;s see how we can start to make sense of things.</p><h2>The challenges</h2><p>Most of the techniques we&#8217;re going to see address one or more of these issues.</p><ul><li><p><strong>Cache Eviction</strong>: Determining which items to evict when the cache reaches its capacity. Popular policies like Least Recently Used (LRU) or Least Frequently Used (LFU) do not always align with LLM usage patterns, leading to suboptimal performance.</p></li><li><p><strong>Memory Management</strong>: The memory required for the KV cache grows linearly with both the input length and the number of layers, which can quickly exceed the hardware memory limits. It&#8217;s possible to overcome such limits by distributing the storage of this cache across different types of storage hardware (e.g., GPU, CPU or external memory), but this brings its own set of challenges.</p></li><li><p><strong>Latency</strong>: Accessing and updating the cache at each decoding step can introduce latency.</p></li><li><p><strong>Compression</strong>: Compressing the KV cache can reduce memory usage but may degrade model performance if key information is lost.</p></li><li><p><strong>Dynamic Workloads</strong>: Handling dynamic and unpredictable workloads, where access patterns and data requirements frequently change, requires adaptive caching strategies that can respond in real time.</p></li><li><p><strong>Distributed Coordination</strong>: In distributed KV caches, maintaining coordination across multiple nodes to ensure consistency, fault tolerance, and efficient resource usage adds significant complexity.</p></li></ul><h2>A taxonomy</h2><p>In order to make sense of the vast amount of known techniques, the authors categorized them into a comprehensive taxonomy.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DTRh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19777410-dd4c-430a-a32d-03e877ff9060_1766x1636.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DTRh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19777410-dd4c-430a-a32d-03e877ff9060_1766x1636.png 424w, https://substackcdn.com/image/fetch/$s_!DTRh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19777410-dd4c-430a-a32d-03e877ff9060_1766x1636.png 848w, https://substackcdn.com/image/fetch/$s_!DTRh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19777410-dd4c-430a-a32d-03e877ff9060_1766x1636.png 1272w, https://substackcdn.com/image/fetch/$s_!DTRh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19777410-dd4c-430a-a32d-03e877ff9060_1766x1636.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DTRh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19777410-dd4c-430a-a32d-03e877ff9060_1766x1636.png" width="1456" height="1349" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/19777410-dd4c-430a-a32d-03e877ff9060_1766x1636.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1349,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!DTRh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19777410-dd4c-430a-a32d-03e877ff9060_1766x1636.png 424w, https://substackcdn.com/image/fetch/$s_!DTRh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19777410-dd4c-430a-a32d-03e877ff9060_1766x1636.png 848w, https://substackcdn.com/image/fetch/$s_!DTRh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19777410-dd4c-430a-a32d-03e877ff9060_1766x1636.png 1272w, https://substackcdn.com/image/fetch/$s_!DTRh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19777410-dd4c-430a-a32d-03e877ff9060_1766x1636.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><a href="https://arxiv.org/pdf/2412.19442#figure.2">Source</a></em></p><p>It starts with three major categories:</p><ul><li><p><strong>Token-Level Optimization</strong>: improving KV cache management efficiency by focusing on the fine-grained selection, organization, and compression at the token level. These techniques can be applied to any model, as they require no architectural changes.</p></li><li><p><strong>Model-level Optimization</strong>: designing an efficient model structure to optimize KV cache management. These optimizations are strictly model-dependent, because they&#8217;re backed into the model&#8217;s architecture.</p></li><li><p><strong>System-level Optimization</strong>: optimizing the KV Cache management through techniques closer to the OS and/or the hardware. These techniques may require specialized hardware to implement, so they&#8217;re not at everyone&#8217;s reach.</p></li></ul><h2>Token-Level Optimizations</h2><p>Token-level optimizations are the most readily accessible to most developers, as they require no dedicated support from the LLM and no specialized hardware. Therefore, these are usually the most interesting. In this category we find:</p><ul><li><p><strong>KV cache selection</strong>: focuses on prioritizing and storing only the most relevant tokens.</p></li><li><p><strong>KV cache budget allocation</strong>: dynamically distributes memory resources across tokens to ensure efficient cache utilization under limited memory.</p></li><li><p><strong>KV cache merging</strong>: reduces redundancy by combining similar or overlapping KV pairs.</p></li><li><p><strong>KV cache quantization</strong>: minimizes the memory footprint by reducing the precision of cached KV pairs.</p></li><li><p><strong>KV cache low-rank decomposition</strong>: uses low-rank decomposition techniques to reduce cache size.</p></li></ul><p>You can find a deep dive into this category in <a href="https://www.zansara.dev/posts/2025-10-27-kv-caching-optimizations-token-level/">its own post</a>.</p><h2>Model-Level Optimizations</h2><p>Model-level optimizations, as the name says, are baked into the model&#8217;s architecture and therefore are either not applicable or always present in the models you&#8217;re running. These optimizations are usually interesting for people that design their own model architecture and train them, rather than developers that work with off-the-shelf models. In this category we find:</p><ul><li><p><strong>Attention grouping and sharing methods</strong>: examine the redundant functionality of keys and values and group and share KV cache within or across transformer layers.</p></li><li><p><strong>Architecture alterations</strong>: emerge to design new attention mechanisms or construct extrinsic modules for KV optimization.</p></li><li><p><strong>Non-transformer architectures</strong>: architectures that adopt other memory-efficient designs like recurrent neural networks to optimize the KV cache in traditional transformers.</p></li></ul><p>You can find a deep dive into this category in <a href="https://www.zansara.dev/posts/2025-10-28-kv-caching-optimizations-model-level/">its own post</a>.</p><h2>System-level Optimizations</h2><p>These optimizations work across the stack to provide the best possible support to the LLM&#8217;s inference, and they&#8217;re sometimes baked into the inference engine, such as vLLM&#8217;s PagedAttention. They occasionally require dedicated hardware and OS optimizations, so they&#8217;re not always readily available for everyday experimentation. They include:</p><ul><li><p><strong>Memory management</strong>: focuses on architectural innovations like virtual memory adaptation, intelligent prefix sharing, and layer-aware resource allocation.</p></li><li><p><strong>Scheduling</strong>: addresses diverse optimization goals through prefix-aware methods for maximizing cache reuse, preemptive techniques for fair context switching, and layer-specific mechanisms for fine-grained cache control.</p></li><li><p><strong>Hardware acceleration</strong>: including single/multi-GPU, I/O-based solutions, heterogeneous computing and SSD-based solutions.</p></li></ul><p>You can find a deep dive into this category in <a href="https://www.zansara.dev/posts/2025-10-29-kv-caching-optimizations-system-level/">its own post</a>.</p><h2>Conclusion</h2><p>KV cache optimization is still an open research area, with new techniques and improvements being published regularly. A good overview of what types of optimizations exist can help you make sense of the zoo of acronyms and claims being made about them, and give you the foundations you need to understand if a particular technique is relevant for your situation.</p><p>To learn more about each specific category or technique, it&#8217;s a good idea to check out <a href="https://arxiv.org/abs/2412.19442">the survey</a>, where you can find more detailed breakdowns and comparisons together with links to the original articles where each technique was introduces.</p><p>Happy reading!</p>]]></content:encoded></item><item><title><![CDATA[How does prompt caching work? ]]></title><description><![CDATA[Nearly all inference libraries can do it for you. But what's really going on under the hood?]]></description><link>https://zansara.substack.com/p/how-does-prompt-caching-work</link><guid isPermaLink="false">https://zansara.substack.com/p/how-does-prompt-caching-work</guid><dc:creator><![CDATA[Sara Z.]]></dc:creator><pubDate>Thu, 23 Oct 2025 15:53:12 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/f8a1cf5f-103f-42ed-9da7-ca10369abc55_1173x490.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is episode 3 of a series of shorter blog posts answering questions I received during the course of my work and reflect common misconceptions and doubts about various generative AI technologies. You can find the whole series here: <a href="https://www.zansara.dev/series/practical-questions">Practical Questions</a>.</em></p><div><hr></div><p>In the previous post we saw what is prompt caching, what parts of the prompts is useful to cache, and explained at a high level why it&#8217;s so effective. In this post I want to go one step further and explain <em>how</em> in practice inference engines cache prompt prefixes. How can you take a complex system like an LLM, cache some of its computations mid-prompt, and reload them?</p><p>Let&#8217;s find out.</p><p><em>Disclaimer: to avoid overly complex and specific diagrams, the size of the vector and matrices shown is not accurate neither in size nor in shape. Check the links at the bottom of the post for more detailed resources with more accurate diagrams.</em></p><h2>LLMs are autoregressive</h2><p>Large Language Models are built on the Transformer architecture: a neural network design that excels at processing sequence data. Explaining the whole structure of a Transformer goes beyond the scope of this small post: if you&#8217;re interested in the details, head to this <a href="https://jalammar.github.io/illustrated-transformer/">amazing writeup</a> by Jay Alammar about Transformers, or <a href="https://jalammar.github.io/illustrated-gpt2/">this one about GPT-2</a> if you&#8217;re familiar with Transformers but you want to learn more about the decoder-only architecture (which includes all current LLMs).</p><p>The point that interests us is that according to the original implementation, during inference the LLM generate text one token at a time in an <em>autoregressive</em> fashion, meaning each new token is predicted based on <strong>all</strong> previously generated tokens. After producing (or &#8220;decoding&#8221;) a token, that token is appended to the input sequence and the model computes everything all over again to generate the next one. This loop continues until a stopping condition is reached (such as an end-of-sequence token or a length limit).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2zYN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3858e323-ac48-4f84-84b9-68ca9098bdad_2403x1185.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2zYN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3858e323-ac48-4f84-84b9-68ca9098bdad_2403x1185.png 424w, https://substackcdn.com/image/fetch/$s_!2zYN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3858e323-ac48-4f84-84b9-68ca9098bdad_2403x1185.png 848w, https://substackcdn.com/image/fetch/$s_!2zYN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3858e323-ac48-4f84-84b9-68ca9098bdad_2403x1185.png 1272w, https://substackcdn.com/image/fetch/$s_!2zYN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3858e323-ac48-4f84-84b9-68ca9098bdad_2403x1185.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2zYN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3858e323-ac48-4f84-84b9-68ca9098bdad_2403x1185.png" width="1456" height="718" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3858e323-ac48-4f84-84b9-68ca9098bdad_2403x1185.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:718,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!2zYN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3858e323-ac48-4f84-84b9-68ca9098bdad_2403x1185.png 424w, https://substackcdn.com/image/fetch/$s_!2zYN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3858e323-ac48-4f84-84b9-68ca9098bdad_2403x1185.png 848w, https://substackcdn.com/image/fetch/$s_!2zYN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3858e323-ac48-4f84-84b9-68ca9098bdad_2403x1185.png 1272w, https://substackcdn.com/image/fetch/$s_!2zYN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3858e323-ac48-4f84-84b9-68ca9098bdad_2403x1185.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>In an autoregressive system, the output is generated token by token by appending the previous pass&#8217; output to its input and recomputing everything again. Starting from the token &#8220;This&#8221;, the LLM produces &#8220;is&#8221; as output. Then the output is concatenated to the input in the string &#8220;This is&#8221;, which is fed again to the LLM to produce &#8220;a&#8221;, and so on until an [END] token is generated. That halts the loop.</em></p><p>This iterative process is very computationally expensive (quadratic time complexity in the number of tokens, so O(n^2) where n is the number of tokens), and the impact is felt especially for long sequences, because each step must account for an ever-growing history of generated tokens.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WPGr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4237abf9-22e4-4bcf-93f5-a17f071c4af5_2403x1521.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WPGr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4237abf9-22e4-4bcf-93f5-a17f071c4af5_2403x1521.png 424w, https://substackcdn.com/image/fetch/$s_!WPGr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4237abf9-22e4-4bcf-93f5-a17f071c4af5_2403x1521.png 848w, https://substackcdn.com/image/fetch/$s_!WPGr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4237abf9-22e4-4bcf-93f5-a17f071c4af5_2403x1521.png 1272w, https://substackcdn.com/image/fetch/$s_!WPGr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4237abf9-22e4-4bcf-93f5-a17f071c4af5_2403x1521.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WPGr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4237abf9-22e4-4bcf-93f5-a17f071c4af5_2403x1521.png" width="1456" height="922" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4237abf9-22e4-4bcf-93f5-a17f071c4af5_2403x1521.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:922,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!WPGr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4237abf9-22e4-4bcf-93f5-a17f071c4af5_2403x1521.png 424w, https://substackcdn.com/image/fetch/$s_!WPGr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4237abf9-22e4-4bcf-93f5-a17f071c4af5_2403x1521.png 848w, https://substackcdn.com/image/fetch/$s_!WPGr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4237abf9-22e4-4bcf-93f5-a17f071c4af5_2403x1521.png 1272w, https://substackcdn.com/image/fetch/$s_!WPGr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4237abf9-22e4-4bcf-93f5-a17f071c4af5_2403x1521.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Simplified view of the increasing computation load. At each pass, the increasing length of the input sentence translated into larger matrices to be handled during inference, where each row corresponds to one input token. This means more computations and, in turn, slower inference.</em></p><p>However, there seems to be an evident chance for optimization here. If we could store the internal state of the LLM after each token&#8217;s generation and reuse it at the next step, we could save a lot of repeated computations.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NUYi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19a31d21-bf49-4211-bcc9-11050b585429_2403x1521.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NUYi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19a31d21-bf49-4211-bcc9-11050b585429_2403x1521.png 424w, https://substackcdn.com/image/fetch/$s_!NUYi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19a31d21-bf49-4211-bcc9-11050b585429_2403x1521.png 848w, https://substackcdn.com/image/fetch/$s_!NUYi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19a31d21-bf49-4211-bcc9-11050b585429_2403x1521.png 1272w, https://substackcdn.com/image/fetch/$s_!NUYi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19a31d21-bf49-4211-bcc9-11050b585429_2403x1521.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NUYi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19a31d21-bf49-4211-bcc9-11050b585429_2403x1521.png" width="1456" height="922" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/19a31d21-bf49-4211-bcc9-11050b585429_2403x1521.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:922,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!NUYi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19a31d21-bf49-4211-bcc9-11050b585429_2403x1521.png 424w, https://substackcdn.com/image/fetch/$s_!NUYi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19a31d21-bf49-4211-bcc9-11050b585429_2403x1521.png 848w, https://substackcdn.com/image/fetch/$s_!NUYi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19a31d21-bf49-4211-bcc9-11050b585429_2403x1521.png 1272w, https://substackcdn.com/image/fetch/$s_!NUYi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19a31d21-bf49-4211-bcc9-11050b585429_2403x1521.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>If we could somehow reuse part of the computations we did during earlier passes and only process new information as it arrives, not only the computation speed will increase dramatically, but it will stay constant during the process instead of slowing down as more tokens are generated.</em></p><p>This is not only true during a single request (because we won&#8217;t be recomputing the whole state from the start of the message for every new token we&#8217;re generating), but also across requests in the same chat (by storing the state at the end of the last assistant token) and across different chats as well (by storing the state of shared prefixes such as system prompts).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9x2H!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c8039cd-0f38-4e6c-a5b6-e32834e6c47d_2403x1521.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9x2H!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c8039cd-0f38-4e6c-a5b6-e32834e6c47d_2403x1521.png 424w, https://substackcdn.com/image/fetch/$s_!9x2H!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c8039cd-0f38-4e6c-a5b6-e32834e6c47d_2403x1521.png 848w, https://substackcdn.com/image/fetch/$s_!9x2H!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c8039cd-0f38-4e6c-a5b6-e32834e6c47d_2403x1521.png 1272w, https://substackcdn.com/image/fetch/$s_!9x2H!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c8039cd-0f38-4e6c-a5b6-e32834e6c47d_2403x1521.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9x2H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c8039cd-0f38-4e6c-a5b6-e32834e6c47d_2403x1521.png" width="1456" height="922" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9c8039cd-0f38-4e6c-a5b6-e32834e6c47d_2403x1521.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:922,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!9x2H!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c8039cd-0f38-4e6c-a5b6-e32834e6c47d_2403x1521.png 424w, https://substackcdn.com/image/fetch/$s_!9x2H!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c8039cd-0f38-4e6c-a5b6-e32834e6c47d_2403x1521.png 848w, https://substackcdn.com/image/fetch/$s_!9x2H!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c8039cd-0f38-4e6c-a5b6-e32834e6c47d_2403x1521.png 1272w, https://substackcdn.com/image/fetch/$s_!9x2H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c8039cd-0f38-4e6c-a5b6-e32834e6c47d_2403x1521.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Example of prefix caching in different scenarios (gray text is caches, black is processed). By caching the system prompt, its cache can be reused with every new chat. By also caching by longest prefix, the prompts may occasionally match across chats, although it depends heavily on your applications. In any case, caching the chat as it progresses keeps the number of new tokens to process during the chat to one, making inference much faster.</em></p><p>But how can it be done? What exactly do we need to cache? To understand this we need to go one step deeper.</p><h2>The inference process</h2><p>At a high level, the inference process of a modern decoder-only Transformer such as a GPT works as follows.</p><ul><li><p><strong>Tokenization</strong>: The chat history is broken down into tokens by a tokenizer. This is a fast, deterministic process that transforms a single string into a list of sub-word fragments (the tokens) plus a bunch of signalling tokens (to delimit messages, to signal the end of the message, to distinguish different types of input or output tokens such as thinking tokens, function calls, system prompts, etc)</p></li><li><p><strong>Embedding</strong>: the tokenized text passes through an embedding step, where each token is translated into an embedding (a 1-dimensional vector) using a lookup table. At this point, our input text has become a matrix of values with as many rows as tokens, and a fixed number of columns that depends on the LLM.</p></li><li><p><strong>Decoding</strong>: this matrix is passed through a series of 12 identical decoding steps. Each of these blocks outputs a matrix of the same shape and size of the original one, but with updated contents. These steps are &#8220;reading&#8221; the prompt and accumulating information to select the next best token to generate.that is passed as input to the next step</p></li><li><p><strong>Output</strong>: After the last decoding step, a final linear output layer projects the matrix into an output vector. Its values are multiplied by the lookup table we used during the embedding step: this way we obtain a list of values that represents the probability of each token to be the &#8220;correct&#8221; next token.</p></li><li><p><strong>Sampling</strong>: From this list, one of the top-k best tokens is selected as the next token, gets added to the chat history, and the loop restarts.</p></li><li><p><strong>End token</strong>: the decoding stops when the LLM picks an END token or some other condition is met (for example, max output length).</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZxeA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60d3015b-437f-447a-bbe6-a8d0123c4180_2403x3678.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZxeA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60d3015b-437f-447a-bbe6-a8d0123c4180_2403x3678.png 424w, https://substackcdn.com/image/fetch/$s_!ZxeA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60d3015b-437f-447a-bbe6-a8d0123c4180_2403x3678.png 848w, https://substackcdn.com/image/fetch/$s_!ZxeA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60d3015b-437f-447a-bbe6-a8d0123c4180_2403x3678.png 1272w, https://substackcdn.com/image/fetch/$s_!ZxeA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60d3015b-437f-447a-bbe6-a8d0123c4180_2403x3678.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZxeA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60d3015b-437f-447a-bbe6-a8d0123c4180_2403x3678.png" width="1456" height="2229" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/60d3015b-437f-447a-bbe6-a8d0123c4180_2403x3678.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2229,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!ZxeA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60d3015b-437f-447a-bbe6-a8d0123c4180_2403x3678.png 424w, https://substackcdn.com/image/fetch/$s_!ZxeA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60d3015b-437f-447a-bbe6-a8d0123c4180_2403x3678.png 848w, https://substackcdn.com/image/fetch/$s_!ZxeA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60d3015b-437f-447a-bbe6-a8d0123c4180_2403x3678.png 1272w, https://substackcdn.com/image/fetch/$s_!ZxeA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60d3015b-437f-447a-bbe6-a8d0123c4180_2403x3678.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Simplified representation of the inference steps needed for an LLM to generate each output token. The most complex by far is the decoding step, which we are going to analyze in more detail.</em></p><p>As you can see from this breakdown, the LLM computes its internal representation of the chat history through its decoding steps, and recomputes such representation for all tokens every time we want to generate a new one. So let&#8217;s zoom in even more and check what&#8217;s going on inside these decoding steps.</p><h2>The decoding step</h2><p>LLMs may have a variable number of decoding steps (although it&#8217;s often 12), but they are all identical, except for the weights they contain. This means that we can look into one and then keep in mind that the same identical process is repeated several times.</p><p>Each decoding step contains two parts:</p><ul><li><p>a multi-headed, masked self-attention layer</p></li><li><p>a feed-forward layer</p></li></ul><p>The first layer, the multi headed masked self attention, sound quite complicated. To make things easier, let&#8217;s break it down into smaller concepts.</p><p><strong>Attention</strong> is the foundation of the Transformers&#8217; incredible text understanding skills and can be roughly summarized as a technique that shows the model which tokens are the most relevant to the token we&#8217;re processing right now.</p><p><strong>Self attention</strong> means that the tokens we&#8217;re looking at belong to the same sentence we&#8217;re processing (which is not the case, for example, during translation tasks where we have a source sentence and a translation).</p><p><strong>Masked self attention</strong> means that we&#8217;re only looking at tokens that precede the one we&#8217;re processing (which is not the case, for example, in encoder models such as BERT that encode the whole sentence at once).</p><p><strong>Multi-headed</strong> attention means that the same operation is performed several time with slightly different parameters. Each set of parameters is called an <strong>attention head</strong>.</p><p>To understand what attention does, let&#8217;s take the sentence &#8220;I like apples because they&#8217;re sweet&#8221;. When processing the token &#8220;they&#8221;, the masked self-attention layer will give a high score to &#8220;apples&#8221;, because that&#8217;s what &#8220;they&#8221; refers to. Keep in mind that &#8220;sweet&#8221; will not be considered while processing &#8220;they&#8221;, because masked self-attention only includes tokens that precede the token in question.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5JdB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90d14238-3846-4fb5-a77f-dce9f1b1bef2_2403x1197.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5JdB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90d14238-3846-4fb5-a77f-dce9f1b1bef2_2403x1197.png 424w, https://substackcdn.com/image/fetch/$s_!5JdB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90d14238-3846-4fb5-a77f-dce9f1b1bef2_2403x1197.png 848w, https://substackcdn.com/image/fetch/$s_!5JdB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90d14238-3846-4fb5-a77f-dce9f1b1bef2_2403x1197.png 1272w, https://substackcdn.com/image/fetch/$s_!5JdB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90d14238-3846-4fb5-a77f-dce9f1b1bef2_2403x1197.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5JdB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90d14238-3846-4fb5-a77f-dce9f1b1bef2_2403x1197.png" width="1456" height="725" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/90d14238-3846-4fb5-a77f-dce9f1b1bef2_2403x1197.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:725,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!5JdB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90d14238-3846-4fb5-a77f-dce9f1b1bef2_2403x1197.png 424w, https://substackcdn.com/image/fetch/$s_!5JdB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90d14238-3846-4fb5-a77f-dce9f1b1bef2_2403x1197.png 848w, https://substackcdn.com/image/fetch/$s_!5JdB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90d14238-3846-4fb5-a77f-dce9f1b1bef2_2403x1197.png 1272w, https://substackcdn.com/image/fetch/$s_!5JdB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90d14238-3846-4fb5-a77f-dce9f1b1bef2_2403x1197.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>A simplified visualization of a masked self-attention head. For each token, the attention calculations will assign a score to each preceding token. The score will be higher for all preceding tokens that have something to do with the current one, highlighting semantic relationships.</em></p><h2>The Q/K/V Matrices</h2><p>Let&#8217;s now look at how is this score calculated. Self-attention is implemented as a series of matrix multiplications that involves three matrices.</p><ul><li><p><strong>Q (Query matrix)</strong>: The query is a representation of the tokens we are &#8220;paying attention to&#8221; (for example, &#8220;they&#8221;. In practice all tokens will be computed at the same time, so we will be dealing with a Q matrix).</p></li><li><p><strong>K (Key matrix)</strong>: Key vectors are like labels for all the other preceding tokens in the input. They&#8217;re what we match against in our search for relevant tokens (for example &#8220;I&#8221;, &#8220;like&#8221;, &#8220;apples&#8221;, etc ). Each token will only see the keys of tokens that precede it, so the query of &#8220;they&#8221; will not be multiplied with the key for &#8220;sweet&#8221;.</p></li><li><p><strong>V (Value matrix)</strong>: Value vectors are actual token representations. Once we&#8217;ve scored how relevant each token is, these are the values we add up to represent the token we&#8217;re paying attention to. In our example, this means that the vector for &#8220;they&#8221; will be computed as a weighted average of all the previous tokens (&#8220;I&#8221;, &#8220;like&#8221;, &#8220;apples&#8221;, &#8220;because&#8221;), but &#8220;apples&#8221; will be weighted much higher than any other, so the end result for the token &#8220;they&#8221; will be very close to the value vector for &#8220;apples&#8221;.</p></li></ul><p>These Q/K/V matrices are computed by multiplying the input of the decoding layer by three matrices (Wq, Wk and Wv) whose values are computed during training and constitute many of the LLM&#8217;s parameters. These three matrices are addressed together as an attention head, as we mentioned earlier. Modern LLMs usually include several attention heads for each step, so you&#8217;ll have several different matrices in each decoding step (and that&#8217;s why they&#8217;re said to use multi-headed attention).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3mB-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc302e0c4-9287-4b4e-90bc-69a83206cd57_2403x3492.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3mB-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc302e0c4-9287-4b4e-90bc-69a83206cd57_2403x3492.png 424w, https://substackcdn.com/image/fetch/$s_!3mB-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc302e0c4-9287-4b4e-90bc-69a83206cd57_2403x3492.png 848w, https://substackcdn.com/image/fetch/$s_!3mB-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc302e0c4-9287-4b4e-90bc-69a83206cd57_2403x3492.png 1272w, https://substackcdn.com/image/fetch/$s_!3mB-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc302e0c4-9287-4b4e-90bc-69a83206cd57_2403x3492.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3mB-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc302e0c4-9287-4b4e-90bc-69a83206cd57_2403x3492.png" width="1456" height="2116" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c302e0c4-9287-4b4e-90bc-69a83206cd57_2403x3492.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2116,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!3mB-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc302e0c4-9287-4b4e-90bc-69a83206cd57_2403x3492.png 424w, https://substackcdn.com/image/fetch/$s_!3mB-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc302e0c4-9287-4b4e-90bc-69a83206cd57_2403x3492.png 848w, https://substackcdn.com/image/fetch/$s_!3mB-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc302e0c4-9287-4b4e-90bc-69a83206cd57_2403x3492.png 1272w, https://substackcdn.com/image/fetch/$s_!3mB-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc302e0c4-9287-4b4e-90bc-69a83206cd57_2403x3492.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Simplified view of the Q/K/V matrices in a single self-attention head. The matrices go through a few more steps (softmax, regularization etc) which are not depicted here</em>.</p><p>This process of computing the output vector for each token is called <em>scaled dot-product attention</em> and, as we mentioned earlier, happens in every attention head of every decoding step. In summary, <strong>keys (K)</strong> and <strong>values (V)</strong> are the transformed representations of each preceding token that are used to compute attention, and they enable each token to gather information from the rest of the sequence by matching queries to keys and aggregating values.</p><p>Let&#8217;s pay close attention to these computations. We know that LLMs generate output one token at a time. This means that the LLM will recompute the K-V values for the tokens of the prompt over and over again for each new output token it generates. If you have already generated, say, 100 tokens of output, producing the 101st token requires recomputing a forward pass over all 100 tokens. A naive implementation would repeatedly recalculate a lot of the same intermediate results for the older tokens at every step of generation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0PW1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d3db558-11e2-4559-8194-1c3a6dc24f6f_2403x2205.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0PW1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d3db558-11e2-4559-8194-1c3a6dc24f6f_2403x2205.png 424w, https://substackcdn.com/image/fetch/$s_!0PW1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d3db558-11e2-4559-8194-1c3a6dc24f6f_2403x2205.png 848w, https://substackcdn.com/image/fetch/$s_!0PW1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d3db558-11e2-4559-8194-1c3a6dc24f6f_2403x2205.png 1272w, https://substackcdn.com/image/fetch/$s_!0PW1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d3db558-11e2-4559-8194-1c3a6dc24f6f_2403x2205.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0PW1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d3db558-11e2-4559-8194-1c3a6dc24f6f_2403x2205.png" width="1456" height="1336" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4d3db558-11e2-4559-8194-1c3a6dc24f6f_2403x2205.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1336,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!0PW1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d3db558-11e2-4559-8194-1c3a6dc24f6f_2403x2205.png 424w, https://substackcdn.com/image/fetch/$s_!0PW1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d3db558-11e2-4559-8194-1c3a6dc24f6f_2403x2205.png 848w, https://substackcdn.com/image/fetch/$s_!0PW1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d3db558-11e2-4559-8194-1c3a6dc24f6f_2403x2205.png 1272w, https://substackcdn.com/image/fetch/$s_!0PW1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d3db558-11e2-4559-8194-1c3a6dc24f6f_2403x2205.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Detail of the Q/K multiplication. As you can see, the content of the QK matrix is essentially the same at all steps, except for the last row. This means that as soon as we accumulate a few input tokens, most of the QK matrix will be nearly identical every time. Something very similar happens for the final QKV matrix.</em></p><p>By the third generation step (three tokens in context), the model computes six attention scores (a 3&#215;3 / 2 matrix); many of these correspond to interactions that were already computed in earlier steps. For example, the attention of token &#8220;I&#8221; with itself was computed in the first step, yet the naive approach computes it again when processing the sequence &#8220;I like&#8221; and &#8220;I like apples&#8221; and so on. In fact, by the time the sentence is complete, the majority of the query-key pairs being calculated are repeats of prior computations. This redundancy makes inference much slower as the sequence length grows: the model wastes time recalculating attention contributions for tokens that haven&#8217;t changed.</p><p>Clearly we want to avoid recomputing things like the key and value vectors for past tokens at every step. That&#8217;s exactly what <strong>KV caching</strong> achieves.</p><h2>The KV Cache</h2><p><strong>KV caching</strong> is an optimization that saves the key and value tensors from previous tokens so that the model doesn&#8217;t need to recompute them for each new token. The idea is straightforward: as the model generates tokens one by one, we store the keys and values produced at each layer for each token in a cache (which is just a reserved chunk of memory, typically in GPU RAM for speed). When the model is about to generate the next token, instead of recomputing all keys and values from scratch for the entire sequence, it retrieves the already-computed keys and values for the past tokens from this cache, and only computes the new token&#8217;s keys and values.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uxKQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b879196-f85d-4366-b564-5ecc72559c03_2403x2181.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uxKQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b879196-f85d-4366-b564-5ecc72559c03_2403x2181.png 424w, https://substackcdn.com/image/fetch/$s_!uxKQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b879196-f85d-4366-b564-5ecc72559c03_2403x2181.png 848w, https://substackcdn.com/image/fetch/$s_!uxKQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b879196-f85d-4366-b564-5ecc72559c03_2403x2181.png 1272w, https://substackcdn.com/image/fetch/$s_!uxKQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b879196-f85d-4366-b564-5ecc72559c03_2403x2181.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uxKQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b879196-f85d-4366-b564-5ecc72559c03_2403x2181.png" width="1456" height="1321" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1b879196-f85d-4366-b564-5ecc72559c03_2403x2181.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1321,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!uxKQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b879196-f85d-4366-b564-5ecc72559c03_2403x2181.png 424w, https://substackcdn.com/image/fetch/$s_!uxKQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b879196-f85d-4366-b564-5ecc72559c03_2403x2181.png 848w, https://substackcdn.com/image/fetch/$s_!uxKQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b879196-f85d-4366-b564-5ecc72559c03_2403x2181.png 1272w, https://substackcdn.com/image/fetch/$s_!uxKQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b879196-f85d-4366-b564-5ecc72559c03_2403x2181.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In essence, with KV caching the transformer&#8217;s attention in each layer will take the new token&#8217;s query and concatenate it with the cached keys of prior tokens, then do the same for values, and move on immediately. The result is that each generation step&#8217;s workload is greatly reduced: the model focuses on what&#8217;s new instead of re-hashing the entire context every time.</p><h2>Implementation</h2><p>Modern libraries implement KV caching under the hood by carrying a &#8220;past key values&#8221; or similar object through successive generation calls. For example, the Hugging Face Transformers library&#8217;s <code>generate</code> function uses a <code>use_cache</code> flag that is True by default, meaning it will automatically store and reuse past keys/values between decoding steps. Conceptually, you can imagine that after the first forward pass on the prompt, the model keeps all the K and V tensors. When generating the next token, it feeds only the new token through each layer along with the cached K and V from previous tokens, to compute the next output efficiently.</p><p>In summary, KV caching transforms the workload of each generation step. Without caching, each step <em>repeats</em> the full attention computation over the entire context. With caching, each step adds only the computations for the new token and the necessary interactions with prior tokens. This makes the per-step cost roughly constant. The longer the generation goes on, the more time is saved relative to the naive approach. KV caching is thus a <strong>time-memory trade-off</strong>: we trade some memory to store the cache in order to save a lot of compute time on each step.</p><h2>Limitations</h2><p>It&#8217;s important to note that KV caching only applies in <em>auto-regressive decoder</em> models (where the output is generated sequentially). Models like BERT that process entire sequences in one go (and are not generative) do not use KV caching, since they don&#8217;t generate token-by-token or reuse past internal states. But for any generative LLM built on a decoder-only Transformer architecture, KV caching is a standard technique to speed up inference.</p><p>It&#8217;s worth noting that the KV cache needs to be managed just like every other type of cache. We&#8217;re going to analyze some ways to handle this cache effectively in another post.</p><h2>Conclusion</h2><p>The takeaway is clear: <strong>always leverage KV caching for autoregressive LLM inference</strong> (and practically all libraries do this for you) unless you have a very specific reason not to. It will make your LLM deployments run faster and more efficiently.</p><p>KV caching exemplifies how understanding the internals of transformer models can lead to substantial engineering improvements. By recognizing that keys and values of the attention mechanism can be reused across time steps, we unlock a simple yet powerful optimization. This ensures that even as our LLMs get larger and our prompts get longer, we can keep inference running quickly, delivering the snappy responses users expect from AI-driven applications.</p><h2>Learn more</h2><p>Here are some useful resources I used to write this post:</p><ul><li><p><a href="https://jalammar.github.io/illustrated-transformer/">The Illustrated Transformer</a> by Jay Alammar</p></li><li><p><a href="https://jalammar.github.io/illustrated-gpt2/">The illustrated GPT-2</a> by Jay Alammar</p></li><li><p><a href="https://platform.openai.com/docs/guides/latency-optimization/3-use-fewer-input-tokens#use-fewer-input-tokens">Latency optimization tips</a> by OpenAI</p></li><li><p><a href="https://medium.com/@joaolages/kv-caching-explained-276520203249">KV Caching explained</a> by Jo&#227;o Lages</p></li><li><p><a href="https://neptune.ai/blog/transformers-key-value-caching">KV Caching</a> by Neptune.ai</p></li><li><p><a href="https://www.manning.com/books/build-a-large-language-model-from-scratch">Build a Large Language Model (from scratch)</a> by Sebastian Raschka</p></li></ul>]]></content:encoded></item><item><title><![CDATA[What is prompt caching?]]></title><description><![CDATA[Caching prompts can have an outsized impact on the cost and latency of your AI apps. But what exactly to cache and how?]]></description><link>https://zansara.substack.com/p/what-is-prompt-caching</link><guid isPermaLink="false">https://zansara.substack.com/p/what-is-prompt-caching</guid><dc:creator><![CDATA[Sara Z.]]></dc:creator><pubDate>Fri, 17 Oct 2025 14:28:04 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/251d8185-0fec-452f-8eee-9166ecee7e2b_1568x656.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is episode 2 of a series of shorter blog posts answering questions I received during the course of my work and reflect common misconceptions and doubts about various generative AI technologies. You can find the whole series here: <a href="https://www.zansara.dev/series/practical-questions">Practical Questions</a>.</em></p><div><hr></div><p>A common piece of advice to improve speed and reduce cost of inference in LLMs is to use prompt caching. However, it&#8217;s often not clear what this means. What exactly is cached? When and why the improvements are really impactful? Understanding prompt caching starts with a deeper awareness of how computation and costs scale with large contexts.</p><h2>LLMS are stateless</h2><p>Each time an LLM processes input, it handles every token of the provided context. LLMs are stateless: this means that for every new message added to an existing chat, your application needs to submit the whole history which could include system prompts, documents, examples, and all the chat history. The model recomputes all of those tokens each time. This is a massive inefficiency. For example, with an input cost around $1 per 1 million tokens, sending 100,000 tokens across 1,000 requests would cost approximately $100, while about 95% of those tokens remain unchanged across requests. In essence, a large portion of computation is wasted on repeatedly processing information that never changes: the message history.</p><h2>Stateless vs stateful design</h2><p>Naive API implementations that omit caching force the model to process the entire context anew each time. This &#8220;stateless&#8221; method is simpler to implement, but wastefully expensive. The system pays repeatedly to recompute static context, which could otherwise be reused.</p><p>In contrast, with a stateful cache strategy, the system stores parts of the context and only processes new inputs (queries). Consider the following case:</p><ul><li><p>the system prompt is 10,000 tokens long</p></li><li><p>each user message is about 100 tokens</p></li><li><p>each assistant response is about 1000 tokens</p></li></ul><p>In both cases, the first request processes 10,100 tokens (1 system prompt + 1 user message). On the second message, a stateless request (no caching) needs to process 11,200 tokens (1 system prompt + first user message + first assistant response + the next user message) while a stateful one can first load the cache and then process only 1100 new tokens (the assistant response + the new user message). That&#8217;s an order of magnitude less tokens! On top of that, as the chat continues, a stateful app will always need to only process the next new 1100 tokens, while the stateless version will process a chat history that grows by 1100 every time. For example, by the 10th request, with caching you need to process 1100 tokens, while without you need to deal with 20,000! (10,000 system prompt tokens + 9,000 assistant reply tokens + 1000 user message tokens).</p><p>Here&#8217;s a recap to highlight the difference:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JLcL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abf150e-e6c1-4ab8-83ce-62741001d486_1792x912.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JLcL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abf150e-e6c1-4ab8-83ce-62741001d486_1792x912.png 424w, https://substackcdn.com/image/fetch/$s_!JLcL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abf150e-e6c1-4ab8-83ce-62741001d486_1792x912.png 848w, https://substackcdn.com/image/fetch/$s_!JLcL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abf150e-e6c1-4ab8-83ce-62741001d486_1792x912.png 1272w, https://substackcdn.com/image/fetch/$s_!JLcL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abf150e-e6c1-4ab8-83ce-62741001d486_1792x912.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JLcL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abf150e-e6c1-4ab8-83ce-62741001d486_1792x912.png" width="1456" height="741" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0abf150e-e6c1-4ab8-83ce-62741001d486_1792x912.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:741,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:115696,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://zansara.substack.com/i/176417917?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abf150e-e6c1-4ab8-83ce-62741001d486_1792x912.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JLcL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abf150e-e6c1-4ab8-83ce-62741001d486_1792x912.png 424w, https://substackcdn.com/image/fetch/$s_!JLcL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abf150e-e6c1-4ab8-83ce-62741001d486_1792x912.png 848w, https://substackcdn.com/image/fetch/$s_!JLcL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abf150e-e6c1-4ab8-83ce-62741001d486_1792x912.png 1272w, https://substackcdn.com/image/fetch/$s_!JLcL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abf150e-e6c1-4ab8-83ce-62741001d486_1792x912.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>While cache warm-up is not free, it can make a huge difference in the latency of your responses and, if you&#8217;re paying by the output token, reduce the costs by orders of magnitude.</p><h2>Cache Hierarchies</h2><p>Caching&#8217;s benefits come with architectural tradeoffs. Stateless designs are straightforward and predictably expensive: every token is always processed. Caching drastically reduces costs by reusing prior computation, but requires complexity in cache management, such as:</p><ul><li><p>Cache invalidation: deciding how and when to refresh cached segments.</p></li><li><p>Cache misses: when requested information isn&#8217;t in the cache, leading to full recomputation and latency spikes.</p></li></ul><p>Because of these challenges, a single monolithic cache usually not enough to see many benefits. The most effective solution is a <strong>hierarchical cache strategy</strong>.</p><p>Effective prompt caching leverages multiple layers with varied lifetimes and hit rates:</p><ul><li><p><strong>L1: System Prompt (e.g., 5,000 tokens)</strong>: it rarely changes, so it has the best hit rate. In most chat you&#8217;ll at least hit this cache.</p></li><li><p><strong>L2: System Prompt + Examples and Tools (e.g., +20,000 tokens)</strong>: may change per task, so it can has a lower hit rate than the system prompt, but eventually it depends completely on your application type. Agentic apps that make heavy use of tools benefit the most from caching them, as they follow the system prompt and might not depend at all from the user query or the agent&#8217;s decisions.</p></li><li><p><strong>L3: System Prompt + Examples and Tools + Documents (e.g., +50,000 tokens)</strong>: if you&#8217;re working with documents, caching any initial retrieval can help too. These documents are likely to change per user and/or per session, so it has a moderate/low hit rate. However, the size of these chunks usually makes it worth it if you have some spare capacity or a small and static knowledge base to retrieve from.</p></li></ul><p>A layered approach like balances freshness and reuse, optimizing both cost and performance.</p><h2>Automatic prefix caching</h2><p>If you&#8217;re using a modern inference engine, prompt caching can also be done through <strong>automatic prefix caching</strong>, where the engine itself takes the responsibility to identify and cache frequently used prefixes. Here you can find more details about the availability of this feature in <a href="https://docs.vllm.ai/en/latest/design/prefix_caching.html">vLLM</a>, <a href="https://docs.sglang.ai/advanced_features/hicache_best_practices.html">SDLang</a> and <a href="https://github.com/ggml-org/llama.cpp/discussions/8947">llama.cpp</a>, but there are many other engines supporting it.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4Avz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc65123d-5ba9-4697-89a5-12d839ae357d_2589x1814.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4Avz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc65123d-5ba9-4697-89a5-12d839ae357d_2589x1814.png 424w, https://substackcdn.com/image/fetch/$s_!4Avz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc65123d-5ba9-4697-89a5-12d839ae357d_2589x1814.png 848w, https://substackcdn.com/image/fetch/$s_!4Avz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc65123d-5ba9-4697-89a5-12d839ae357d_2589x1814.png 1272w, https://substackcdn.com/image/fetch/$s_!4Avz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc65123d-5ba9-4697-89a5-12d839ae357d_2589x1814.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4Avz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc65123d-5ba9-4697-89a5-12d839ae357d_2589x1814.png" width="1456" height="1020" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dc65123d-5ba9-4697-89a5-12d839ae357d_2589x1814.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1020,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!4Avz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc65123d-5ba9-4697-89a5-12d839ae357d_2589x1814.png 424w, https://substackcdn.com/image/fetch/$s_!4Avz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc65123d-5ba9-4697-89a5-12d839ae357d_2589x1814.png 848w, https://substackcdn.com/image/fetch/$s_!4Avz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc65123d-5ba9-4697-89a5-12d839ae357d_2589x1814.png 1272w, https://substackcdn.com/image/fetch/$s_!4Avz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc65123d-5ba9-4697-89a5-12d839ae357d_2589x1814.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>A feature comparison across inference engines from <a href="https://arxiv.org/pdf/2505.01658">this May 2025 review</a>.</em></p><h2>Semantic caching</h2><p>In extreme cases where cost, load or latency must be reduced to the maximum, semantic caching can also be employed. Semantic caching allows you to cache also the user queries and the assistant responses by keeping a registry of already processes user queries and performing a semantic search step between the new query and the cached ones. If a match is found, instead of invoking the LLM to generate a new answer, the cached reply is sent to the user immediately.</p><p>Semantic caching however has several disadvantages that makes it worthwhile only in rare situations:</p><ul><li><p><strong>Access control</strong>. Caching must be done per user if each user has access to a different set of resources, to avoid accidental sharing of data and/or resources across users.</p></li><li><p><strong>Very high similarity needed</strong>: In order the the reply to be relevant, the semantic similarity between the two must be extremely high, or you risk that the answer returned to the user won&#8217;t match their question. Semantic similarity tends to overlook details which are often very important to an accurate reply: for example, &#8220;What&#8217;s the sum of these numbers: 1,2,3,4,5,6,7?&#8221; and &#8220;What&#8217;s the sum of these numbers: 1,2,3,4,5,6,7,8?&#8221; will have an extremely high similarity, but returning the response of the first to the second would not be a good idea.</p></li><li><p><strong>Language management</strong>: what to do when the exact same question is asked in two different languages? Semantic similarity may be perfect if your embedder is multilingual, but the user won&#8217;t be pleased to receive a cached answer in a language different from their own.</p></li></ul><p>Such constraints make cache misses extremely frequent, which defies the point of keeping a cache and simply adds complexity and latency to the system instead of reducing it. The similarity pitfalls introduces also nasty accuracy problems.</p><p>In my personal experience, semantic caching is only useful for extremely high volume, low cost, public facing interfaces where accuracy is not critical. A perfect example could be a virtual assistant for anonymous customer support, or a helper bot for a software&#8217;s documentation search. In any case, you usually need additional checks on the output in order to trust such a system.</p><h2>Conclusion</h2><p>Prompt caching is not just about cutting costs or speeding things up: it is a necessary architectural approach that addresses the quadratic computational cost inherent in large-context LLM processing. Without it, your backend will repeatedly recompute largely static information, wasting resources and imposing latency penalties that impact your user&#8217;s experience. By adopting hierarchical, stateful caching and carefully designing prompts, you can reduce token processing costs and response speed by orders of magnitude, which is key for building sustainable, high-performance applications.</p>]]></content:encoded></item><item><title><![CDATA[Why using a reranker? ]]></title><description><![CDATA[And is the added latency worth it? Let's understand what they do and how can they improve the quality of your RAG pipelines so drastically.]]></description><link>https://zansara.substack.com/p/why-using-a-reranker</link><guid isPermaLink="false">https://zansara.substack.com/p/why-using-a-reranker</guid><dc:creator><![CDATA[Sara Z.]]></dc:creator><pubDate>Wed, 15 Oct 2025 10:45:35 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/2fc9903d-93f7-46d9-84e8-a358ed851c81_2559x1070.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is episode 1 of a series of shorter blog posts answering questions I received during the course of my work on Generative AI and reflect common misconceptions and doubts about various generative AI technologies. You can find the whole series here: <a href="https://www.zansara.dev/series/practical-questions">Practical Questions</a>.</em></p><div><hr></div><p>Retrieval-Augmented Generation (RAG) systems are essential to connect large language models with external knowledge sources. While in theory the retrieval step is enough to gather documents that are relevant to the user&#8217;s request, it&#8217;s often recommended to add an additional ranking step, the <em>reranking</em>, to further filter the results.</p><p>But why do we need rerankers? Isn&#8217;t semantic search good enough? The answer lies in understanding the limitations of traditional embedding-based retrieval.</p><h2>Bi-encoders vs Cross-encoders</h2><p>At the heart of modern, scalable semantic search systems lies the <strong>bi-encoder</strong> model. This architecture creates independent vector representations for the query and the document; relevance is then computed through a similarity measure like the dot product or cosine similarity between those vectors.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UWGz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3e4b02a-8595-4607-b665-8041e7390e59_2400x1959.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UWGz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3e4b02a-8595-4607-b665-8041e7390e59_2400x1959.png 424w, https://substackcdn.com/image/fetch/$s_!UWGz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3e4b02a-8595-4607-b665-8041e7390e59_2400x1959.png 848w, https://substackcdn.com/image/fetch/$s_!UWGz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3e4b02a-8595-4607-b665-8041e7390e59_2400x1959.png 1272w, https://substackcdn.com/image/fetch/$s_!UWGz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3e4b02a-8595-4607-b665-8041e7390e59_2400x1959.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UWGz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3e4b02a-8595-4607-b665-8041e7390e59_2400x1959.png" width="1456" height="1188" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c3e4b02a-8595-4607-b665-8041e7390e59_2400x1959.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1188,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!UWGz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3e4b02a-8595-4607-b665-8041e7390e59_2400x1959.png 424w, https://substackcdn.com/image/fetch/$s_!UWGz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3e4b02a-8595-4607-b665-8041e7390e59_2400x1959.png 848w, https://substackcdn.com/image/fetch/$s_!UWGz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3e4b02a-8595-4607-b665-8041e7390e59_2400x1959.png 1272w, https://substackcdn.com/image/fetch/$s_!UWGz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3e4b02a-8595-4607-b665-8041e7390e59_2400x1959.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This design scales well: You can precompute document embeddings, store them in your vector DB, and compare any incoming query against millions of documents very efficiently. However, this convenience comes at a cost: <strong>the system never truly reads the document in the context of the query</strong>. There&#8217;s no token-level interaction between the query and document embedding to judge whether the document actually answers the question or it simply happen to be talking about the same topic, and therefore semantically similar.</p><p>For example, the query &#8220;How to protect my application from DDOS attacks?&#8221; may be semantically close to the statement &#8220;You should always take steps to protect your systems from DDOS attacks&#8221;, but the statement does not contain the answer to the question. Without reranking, embedding-based retrieval systems often perform well at recall but poorly at precision.</p><p><strong>Cross-encoders</strong> remedy the limitations of bi-encoders by encoding the query and document together, typically separated by a special token (like <code>[SEP]</code>) using an encoder-only Transformer such as BERT. Then they include an additional fully connected layer that acts as a classifier, learning fine-grained token-level interactions that capture query/document word alignments, answer containment (i.e., does this passage actually answer the question?) and overall contextual relevance.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8VTI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0731778f-6fa4-42fb-8ae9-2681f43520c9_2400x1746.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8VTI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0731778f-6fa4-42fb-8ae9-2681f43520c9_2400x1746.png 424w, https://substackcdn.com/image/fetch/$s_!8VTI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0731778f-6fa4-42fb-8ae9-2681f43520c9_2400x1746.png 848w, https://substackcdn.com/image/fetch/$s_!8VTI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0731778f-6fa4-42fb-8ae9-2681f43520c9_2400x1746.png 1272w, https://substackcdn.com/image/fetch/$s_!8VTI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0731778f-6fa4-42fb-8ae9-2681f43520c9_2400x1746.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8VTI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0731778f-6fa4-42fb-8ae9-2681f43520c9_2400x1746.png" width="1456" height="1059" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0731778f-6fa4-42fb-8ae9-2681f43520c9_2400x1746.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1059,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!8VTI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0731778f-6fa4-42fb-8ae9-2681f43520c9_2400x1746.png 424w, https://substackcdn.com/image/fetch/$s_!8VTI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0731778f-6fa4-42fb-8ae9-2681f43520c9_2400x1746.png 848w, https://substackcdn.com/image/fetch/$s_!8VTI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0731778f-6fa4-42fb-8ae9-2681f43520c9_2400x1746.png 1272w, https://substackcdn.com/image/fetch/$s_!8VTI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0731778f-6fa4-42fb-8ae9-2681f43520c9_2400x1746.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This difference (from separate encodings to a joint representation) gives cross-encoders their power but also their cost. Since relevance depends on the specific query, you can&#8217;t precompute the document embeddings: in fact, the concept of &#8220;query embedding&#8221; and &#8220;document embeddings&#8221; disappears. Every query-document pair requires a fresh forward pass through the whole model, which can be prohibitively expensive on a large corpus.</p><h2>To each their place</h2><p>No production system can afford to run interaction-rich models such as cross-encoders on millions of documents per query. Therefore, the two-stage retrieval pipeline remains the industry standard:</p><ol><li><p><strong>Semantic Search (Bi-Encoder)</strong> &#8211; Quickly narrows a massive corpus (e.g., millions of document chunks) down to a small candidate set (e.g., top 100 chunks). Bi-encoders can be built with any embedding model: popular closed source embedders include OpenAI&#8217;s, Voyage.ai, Cohere&#8217;s, Gemini and more, while on the open-source front you can find BGE embedders, Mistral&#8217;s models, Jina.ai, Gemma, IBM Granite, <a href="https://huggingface.co/models?search=embedding">and more</a>.</p></li><li><p><strong>Reranking (Cross-Encoder)</strong> &#8211; Evaluates those top 100 candidates more deeply by jointly encoding the query and document. A popular closed source choice for reranking models is Cohere&#8217;s, while on the open source front you can find several Qwen-based rerankers, Jina.ai models, IBM&#8217;s Granite rerankers, BGE rerankers, and <a href="https://huggingface.co/models?search=reranker">many more</a>.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZPv4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f77875-ece1-4cd8-911b-79e52341a1f8_2400x2421.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZPv4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f77875-ece1-4cd8-911b-79e52341a1f8_2400x2421.png 424w, https://substackcdn.com/image/fetch/$s_!ZPv4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f77875-ece1-4cd8-911b-79e52341a1f8_2400x2421.png 848w, https://substackcdn.com/image/fetch/$s_!ZPv4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f77875-ece1-4cd8-911b-79e52341a1f8_2400x2421.png 1272w, https://substackcdn.com/image/fetch/$s_!ZPv4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f77875-ece1-4cd8-911b-79e52341a1f8_2400x2421.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZPv4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f77875-ece1-4cd8-911b-79e52341a1f8_2400x2421.png" width="1456" height="1469" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/84f77875-ece1-4cd8-911b-79e52341a1f8_2400x2421.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1469,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!ZPv4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f77875-ece1-4cd8-911b-79e52341a1f8_2400x2421.png 424w, https://substackcdn.com/image/fetch/$s_!ZPv4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f77875-ece1-4cd8-911b-79e52341a1f8_2400x2421.png 848w, https://substackcdn.com/image/fetch/$s_!ZPv4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f77875-ece1-4cd8-911b-79e52341a1f8_2400x2421.png 1272w, https://substackcdn.com/image/fetch/$s_!ZPv4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84f77875-ece1-4cd8-911b-79e52341a1f8_2400x2421.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Making Reranking Practical</h2><p>Even in this two-tiered system, reranking may turn out to be too expensive for your latency constrains, but several engineering and modeling strategies have emerged to make it viable in production. Let&#8217;s break down a few of these methods.</p><ol><li><p><strong>Model Distillation</strong><br>Distillation transfers the knowledge from a large, high-performing cross-encoder (often based on 12-layer BERT or similar) into a smaller student model (e.g., 6 layers, or even lighter). The process involves training the smaller model to mimic the scores or output logits of the larger one on large query&#8211;document pairs. While distillation inevitably loses some performance, careful tuning, domain-specific data, and intermediate-layer supervision can retain more than 90% of the original ranking quality at a fraction of the inference cost. You can learn more about model distillation <a href="https://www.sbert.net/examples/sentence_transformer/training/distillation/README.html">here</a>.</p></li><li><p><strong>Listwise Reranking</strong><br>Instead of scoring each query&#8211;document pair independently, listwise reranking generates scores for all top-k candidates in a single forward pass. This approach rearranges candidates into a batched tensor, leveraging GPU parallelism to process them together, reducing overhead from repeated encoder calls. Some implementations also use listwise loss functions (such as ListNet or LambdaMART-inspired objectives) to better preserve ranking order during training. To learn more about ML ranking, have a look at <a href="https://towardsdatascience.com/ranking-basics-pointwise-pairwise-listwise-cd5318f86e1b/">this post</a>.</p></li><li><p><strong>Late Interaction Models (e.g., ColBERT)</strong><br>Late interaction approaches store token-level embeddings of documents from fine-tuned contextual models. At query time, the system encodes the query tokens and performs efficient maximum similarity matching between query tokens and stored document tokens. By avoiding a full joint encoding across all tokens, these models approximate cross-encoder analysis but keep retrieval latency close to bi-encoder speeds. This approach can either substitute or complement cross-encoders by quickly reducing the candidates list returned from the vector database. To learn more about this approach, have a look at <a href="https://medium.com/@aimichael/cross-encoders-colbert-and-llm-based-re-rankers-a-practical-guide-a23570d88548">this blog post</a> or <a href="https://arxiv.org/abs/2004.12832">the ColBERT paper</a>.</p></li><li><p><strong>Candidate Filtering and Adaptive k</strong><br>Rather than always reranking a fixed top-k (like 100 documents), systems can use heuristics or intermediate classifiers to select fewer candidates when confidence in retrieval is high. This adaptive approach can cut reranking costs significantly while preserving precision in challenging cases.</p></li><li><p><strong>Approximate Cross-Attention Mechanisms</strong><br>Instead of computing full self-attention across combined query and document tokens, some approaches reduce complexity by limiting cross-attention depth or dimensionality &#8212; for example, attending only to the top N most informative tokens, or pruning low-importance attention heads. This can drastically lower token computations while maintaining critical interaction signals.</p></li><li><p><strong>Caching for Frequent Queries</strong><br>In platforms where certain queries or query patterns repeat, caching reranking results or partial computations can remove the need to rerun the full cross-encoder. Combined with normalization and paraphrase detection, such caches can return precise results instantly for repeated requests.</p></li></ol><p>In production pipelines, these methods are often stacked: for example, using late interaction for most queries, distillation for cost control, and adaptive candidate selection to minimize unnecessary work. The overarching theme is balancing precision and latency, ensuring that rerankers deliver their interaction-driven relevance boost without overwhelming the system&#8217;s budget or responsiveness.</p>]]></content:encoded></item><item><title><![CDATA[Trying to play "Guess Who" with an LLM]]></title><description><![CDATA[I expected a different kind of fun.]]></description><link>https://zansara.substack.com/p/trying-to-play-guess-who-with-an</link><guid isPermaLink="false">https://zansara.substack.com/p/trying-to-play-guess-who-with-an</guid><dc:creator><![CDATA[Sara Z.]]></dc:creator><pubDate>Mon, 15 Sep 2025 13:09:41 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/83295b9c-e57b-4d90-927d-ea198578a1f1_1023x428.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em><a href="https://www.zansara.dev/posts/2025-09-15-playing-guess-who-with-an-llm/">Read this post on my blog</a></em></p><div><hr></div><p>A few days ago I came to a realization. Modern LLMs can do a lot of things: they can <a href="https://www.anthropic.com/news/claude-for-chrome">use a browser</a> just like a human, they can (<a href="https://dynomight.net/chess/">sometimes</a>) <a href="https://maxim-saplin.github.io/llm_chess/">play chess</a>, and they seem to be so smart that they apparently can be trusted as personal assistants: they can read and reply to emails, organize trips, do shopping online on your behalf, and so on.</p><p>If that&#8217;s the case, I thought, it should be possible to also play some tabletop games with them!</p><p>After all, many simple tabletop games don&#8217;t require a lot of skill to play. You need to be able to read and understand the rules (very easy for an LLM), you need eyes to see the board (piece of cake for a multimodal LLM), and some ways to interact with the board (most LLM are able to call tools nowadays). So I figured it would be a nice idea to try and figure out which of these LLMs is the most fun to play with. Maybe the charming personality of GPT-4o? Or the clever Claude Opus 4?</p><p>I did not expect any of the results I got.</p><h1>Building the game</h1><p>In order to be fair to dumber LLMs, I decided to start with a very simple tabletop game: <a href="https://en.wikipedia.org/wiki/Guess_Who%3F">Guess Who</a>. If you are not familiar with &#8220;Guess Who&#8221;, here is a quick recap of the rules:</p><ul><li><p>Each player has a board full of characters.</p></li><li><p>Each players draws an additional random character.</p></li><li><p>Your goal is to guess which character the other player has received by asking yes/no questions, such as &#8220;Is your character male?&#8221; or &#8220;Does your character have black hair?&#8221; and so on</p></li><li><p>The first player to guess the opponent character&#8217;s name wins.</p></li></ul><p>As you can see we&#8217;re not talking of a complex game like Catan or a strategy game like chess, but a simple, fun tabletop game suitable for kids too.</p><p>In order to build the game, as I am no frontend developer, I spent a few too many bucks on my favorite vibe-coding tool, <a href="https://www.anthropic.com/claude-code">Claude Code</a>, padded in a bit of <a href="https://github.com/google-gemini/gemini-cli">Gemini CLI</a> when I run out of credits, made a few tweaks by hand when asking the bots to do so felt overkill, and a few evenings later I had <a href="https://www.zansara.dev/guess-who/">this nice Guess Who game</a> live.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DJa-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb85423db-d862-4a2e-a8ba-e60b5dbe67f8_2751x2108.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DJa-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb85423db-d862-4a2e-a8ba-e60b5dbe67f8_2751x2108.png 424w, https://substackcdn.com/image/fetch/$s_!DJa-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb85423db-d862-4a2e-a8ba-e60b5dbe67f8_2751x2108.png 848w, https://substackcdn.com/image/fetch/$s_!DJa-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb85423db-d862-4a2e-a8ba-e60b5dbe67f8_2751x2108.png 1272w, https://substackcdn.com/image/fetch/$s_!DJa-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb85423db-d862-4a2e-a8ba-e60b5dbe67f8_2751x2108.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DJa-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb85423db-d862-4a2e-a8ba-e60b5dbe67f8_2751x2108.png" width="1456" height="1116" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b85423db-d862-4a2e-a8ba-e60b5dbe67f8_2751x2108.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1116,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!DJa-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb85423db-d862-4a2e-a8ba-e60b5dbe67f8_2751x2108.png 424w, https://substackcdn.com/image/fetch/$s_!DJa-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb85423db-d862-4a2e-a8ba-e60b5dbe67f8_2751x2108.png 848w, https://substackcdn.com/image/fetch/$s_!DJa-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb85423db-d862-4a2e-a8ba-e60b5dbe67f8_2751x2108.png 1272w, https://substackcdn.com/image/fetch/$s_!DJa-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb85423db-d862-4a2e-a8ba-e60b5dbe67f8_2751x2108.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Feel free to play a few round using your favorite LLM. The game supports OpenAI compatible endpoints, plus Anthropic&#8217;s and Google&#8217;s API. And if you don&#8217;t trust me with your API key, go ahead and <a href="https://github.com/ZanSara/guess-who">fork or clone the game</a> (and maybe leave a &#11088; while you&#8217;re at it ), host it where you like (it&#8217;s a single HTML page with a bit of vanilla JS at the side) and have fun.</p><p>Now for the spoilers.</p><h1>Not as many LLMs</h1><p>One of the first surprises was that, in practice, there aren&#8217;t as many models that are capable of vision and tool calling at the same time. Apart from flagship models such as GPTs and Claude, OSS options were limited. Even GPT-OSS, unfortunately, does not support vision. I was especially surprised to learn that I could not play with any version of popular Chinese models such as Qwen or Deepseek, as they&#8217;re either text only or unable to call tools.</p><p>Either way, using a mix of proprietary hosting, <a href="https://openrouter.ai/">OpenRouter</a> and <a href="https://www.together.ai/">Together.ai</a>, I had plenty of models to try and ended up trying out 21:</p><ul><li><p>Amazon Nova Pro v1</p></li><li><p>Amazon Nova Lite v1</p></li><li><p>Claude Opus 4.1</p></li><li><p>Claude Opus 4.0</p></li><li><p>Claude Sonnet 4.0</p></li><li><p>Claude Sonnet 3.7</p></li><li><p>Gemini 2.5 Pro</p></li><li><p>Gemini 2.5 Flash</p></li><li><p>Gemini 2.5 Flash Lite</p></li><li><p>GML 4.5</p></li><li><p>Grok 4</p></li><li><p>GPT 5</p></li><li><p>GPT 5 Nano</p></li><li><p>GPT 5 Mini</p></li><li><p>GPT 4o</p></li><li><p>Llama 4 Maverick</p></li><li><p>Llama 4 Scout</p></li><li><p>Mistral Medium 3.1</p></li><li><p>Mistral Small 3.2</p></li><li><p>Sonoma Dusk Alpha</p></li><li><p>Sonoma Sky Alpha</p></li></ul><p>It may sound like a lot of work, but as you&#8217;ll see in a minute, for many of them it didn&#8217;t take me long to form an opinion about their skill.</p><h1>The prompts</h1><p>Starting from the assumption that playing Guess Who should be within the cognitive abilities of most modern LLMs, I decided to settle for a simple system prompt, something that resembles the way I would explain the game to a fellow human.</p><blockquote><p>You are an AI assistant playing &#8220;Guess Who&#8221; against the user. Here&#8217;s how the game works:</p><ul><li><p>You&#8217;ll receive the board and your character image</p></li><li><p>You must try to guess the user&#8217;s character by asking yes/no questions</p></li><li><p>You must answer the user&#8217;s yes/no questions about your character</p></li><li><p>One question per player per turn, no exceptions</p></li><li><p>You can eliminate characters from your board based on the user&#8217;s answers using the eliminateCharacter tool (this will only update the UI, so keep in mind who you&#8217;re eliminating)</p></li><li><p>The first player to correctly guess the opponent&#8217;s character wins the game. When the user guesses your character or you guess theirs, call the endGame tool</p></li></ul></blockquote><p>After this system prompt, I send two more prompts:</p><blockquote><p>Here is the board:</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!t1JG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff332f1a5-c334-415b-9d23-ffc02c7838ae_1024x873.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!t1JG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff332f1a5-c334-415b-9d23-ffc02c7838ae_1024x873.png 424w, https://substackcdn.com/image/fetch/$s_!t1JG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff332f1a5-c334-415b-9d23-ffc02c7838ae_1024x873.png 848w, https://substackcdn.com/image/fetch/$s_!t1JG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff332f1a5-c334-415b-9d23-ffc02c7838ae_1024x873.png 1272w, https://substackcdn.com/image/fetch/$s_!t1JG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff332f1a5-c334-415b-9d23-ffc02c7838ae_1024x873.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!t1JG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff332f1a5-c334-415b-9d23-ffc02c7838ae_1024x873.png" width="1024" height="873" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f332f1a5-c334-415b-9d23-ffc02c7838ae_1024x873.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:873,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!t1JG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff332f1a5-c334-415b-9d23-ffc02c7838ae_1024x873.png 424w, https://substackcdn.com/image/fetch/$s_!t1JG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff332f1a5-c334-415b-9d23-ffc02c7838ae_1024x873.png 848w, https://substackcdn.com/image/fetch/$s_!t1JG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff332f1a5-c334-415b-9d23-ffc02c7838ae_1024x873.png 1272w, https://substackcdn.com/image/fetch/$s_!t1JG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff332f1a5-c334-415b-9d23-ffc02c7838ae_1024x873.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p>and here is your character:</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!shr5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefda78b6-ae04-414b-b0de-f85ab76d49b9_150x200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!shr5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefda78b6-ae04-414b-b0de-f85ab76d49b9_150x200.png 424w, https://substackcdn.com/image/fetch/$s_!shr5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefda78b6-ae04-414b-b0de-f85ab76d49b9_150x200.png 848w, https://substackcdn.com/image/fetch/$s_!shr5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefda78b6-ae04-414b-b0de-f85ab76d49b9_150x200.png 1272w, https://substackcdn.com/image/fetch/$s_!shr5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefda78b6-ae04-414b-b0de-f85ab76d49b9_150x200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!shr5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefda78b6-ae04-414b-b0de-f85ab76d49b9_150x200.png" width="150" height="200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/efda78b6-ae04-414b-b0de-f85ab76d49b9_150x200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:200,&quot;width&quot;:150,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!shr5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefda78b6-ae04-414b-b0de-f85ab76d49b9_150x200.png 424w, https://substackcdn.com/image/fetch/$s_!shr5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefda78b6-ae04-414b-b0de-f85ab76d49b9_150x200.png 848w, https://substackcdn.com/image/fetch/$s_!shr5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefda78b6-ae04-414b-b0de-f85ab76d49b9_150x200.png 1272w, https://substackcdn.com/image/fetch/$s_!shr5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefda78b6-ae04-414b-b0de-f85ab76d49b9_150x200.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Unfortunately these two prompts need to be user prompts (not system prompts) because some LLMs (looking at you, Mistral!) do not support images in their system prompts.</p><p>Last, when the user presses the Start button, one more system message is sent:</p><blockquote><p>Generate a brief, friendly greeting message to start a Guess Who game. Tell the user whether you received the images of your board and your character and ask them for their first question. Keep it conversational and under 2 sentences.</p></blockquote><p>The LLM also receives two tools to use:</p><ul><li><p><code>eliminateCharacter</code>, described as &#8220;Eliminate a character from your board when you learn they cannot be the user&#8217;s character&#8221;.</p></li><li><p><code>endGame</code>, described as &#8220;When you or the the user guess correctly, call this tool to end the game.&#8221;</p></li></ul><h1>Playing</h1><p>With the game implemented and ready to go, I finally started playing a bit. I was especially curious how small models could deal with a game like this, so I began with GPT-5 Mini. Here is what happens:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8ZNw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F658b0353-fc7e-4420-82da-cc6e8db686d5_1720x2056.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8ZNw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F658b0353-fc7e-4420-82da-cc6e8db686d5_1720x2056.png 424w, https://substackcdn.com/image/fetch/$s_!8ZNw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F658b0353-fc7e-4420-82da-cc6e8db686d5_1720x2056.png 848w, https://substackcdn.com/image/fetch/$s_!8ZNw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F658b0353-fc7e-4420-82da-cc6e8db686d5_1720x2056.png 1272w, https://substackcdn.com/image/fetch/$s_!8ZNw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F658b0353-fc7e-4420-82da-cc6e8db686d5_1720x2056.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8ZNw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F658b0353-fc7e-4420-82da-cc6e8db686d5_1720x2056.png" width="1456" height="1740" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/658b0353-fc7e-4420-82da-cc6e8db686d5_1720x2056.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1740,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!8ZNw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F658b0353-fc7e-4420-82da-cc6e8db686d5_1720x2056.png 424w, https://substackcdn.com/image/fetch/$s_!8ZNw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F658b0353-fc7e-4420-82da-cc6e8db686d5_1720x2056.png 848w, https://substackcdn.com/image/fetch/$s_!8ZNw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F658b0353-fc7e-4420-82da-cc6e8db686d5_1720x2056.png 1272w, https://substackcdn.com/image/fetch/$s_!8ZNw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F658b0353-fc7e-4420-82da-cc6e8db686d5_1720x2056.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Ahah, GPT 5 Mini is far dumber than I thought! Let&#8217;s try Gemini 2.5 Flash instead.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NLyR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff65a29e8-cdcf-4fdb-be92-885305e33f9e_1764x2108.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NLyR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff65a29e8-cdcf-4fdb-be92-885305e33f9e_1764x2108.png 424w, https://substackcdn.com/image/fetch/$s_!NLyR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff65a29e8-cdcf-4fdb-be92-885305e33f9e_1764x2108.png 848w, https://substackcdn.com/image/fetch/$s_!NLyR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff65a29e8-cdcf-4fdb-be92-885305e33f9e_1764x2108.png 1272w, https://substackcdn.com/image/fetch/$s_!NLyR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff65a29e8-cdcf-4fdb-be92-885305e33f9e_1764x2108.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NLyR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff65a29e8-cdcf-4fdb-be92-885305e33f9e_1764x2108.png" width="1456" height="1740" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f65a29e8-cdcf-4fdb-be92-885305e33f9e_1764x2108.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1740,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!NLyR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff65a29e8-cdcf-4fdb-be92-885305e33f9e_1764x2108.png 424w, https://substackcdn.com/image/fetch/$s_!NLyR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff65a29e8-cdcf-4fdb-be92-885305e33f9e_1764x2108.png 848w, https://substackcdn.com/image/fetch/$s_!NLyR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff65a29e8-cdcf-4fdb-be92-885305e33f9e_1764x2108.png 1272w, https://substackcdn.com/image/fetch/$s_!NLyR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff65a29e8-cdcf-4fdb-be92-885305e33f9e_1764x2108.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Oh wow this is incredible. Ok, time to try a smarter model and have some actual fun. Claude Sonnet 4.0 will do for now.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QJn1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22b2adfb-198e-41b8-9337-1dff3dc7d50e_1788x2108.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QJn1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22b2adfb-198e-41b8-9337-1dff3dc7d50e_1788x2108.png 424w, https://substackcdn.com/image/fetch/$s_!QJn1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22b2adfb-198e-41b8-9337-1dff3dc7d50e_1788x2108.png 848w, https://substackcdn.com/image/fetch/$s_!QJn1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22b2adfb-198e-41b8-9337-1dff3dc7d50e_1788x2108.png 1272w, https://substackcdn.com/image/fetch/$s_!QJn1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22b2adfb-198e-41b8-9337-1dff3dc7d50e_1788x2108.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QJn1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22b2adfb-198e-41b8-9337-1dff3dc7d50e_1788x2108.png" width="1456" height="1717" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/22b2adfb-198e-41b8-9337-1dff3dc7d50e_1788x2108.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1717,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!QJn1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22b2adfb-198e-41b8-9337-1dff3dc7d50e_1788x2108.png 424w, https://substackcdn.com/image/fetch/$s_!QJn1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22b2adfb-198e-41b8-9337-1dff3dc7d50e_1788x2108.png 848w, https://substackcdn.com/image/fetch/$s_!QJn1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22b2adfb-198e-41b8-9337-1dff3dc7d50e_1788x2108.png 1272w, https://substackcdn.com/image/fetch/$s_!QJn1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22b2adfb-198e-41b8-9337-1dff3dc7d50e_1788x2108.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>At this point it started to become unbelievable. Did I fail to explain the game? Is something wrong with the prompts? It couldn&#8217;t be, because some other models (such as the almighty GPT-4o) do what I expect instead:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!K4YJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa089f811-28ef-4e09-beee-f41081f96b09_1737x2077.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!K4YJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa089f811-28ef-4e09-beee-f41081f96b09_1737x2077.png 424w, https://substackcdn.com/image/fetch/$s_!K4YJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa089f811-28ef-4e09-beee-f41081f96b09_1737x2077.png 848w, https://substackcdn.com/image/fetch/$s_!K4YJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa089f811-28ef-4e09-beee-f41081f96b09_1737x2077.png 1272w, https://substackcdn.com/image/fetch/$s_!K4YJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa089f811-28ef-4e09-beee-f41081f96b09_1737x2077.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!K4YJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa089f811-28ef-4e09-beee-f41081f96b09_1737x2077.png" width="1456" height="1741" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a089f811-28ef-4e09-beee-f41081f96b09_1737x2077.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1741,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!K4YJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa089f811-28ef-4e09-beee-f41081f96b09_1737x2077.png 424w, https://substackcdn.com/image/fetch/$s_!K4YJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa089f811-28ef-4e09-beee-f41081f96b09_1737x2077.png 848w, https://substackcdn.com/image/fetch/$s_!K4YJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa089f811-28ef-4e09-beee-f41081f96b09_1737x2077.png 1272w, https://substackcdn.com/image/fetch/$s_!K4YJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa089f811-28ef-4e09-beee-f41081f96b09_1737x2077.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>While others left me shocked:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uQNL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf52d286-26eb-4901-952e-5e5b791c87b6_1788x2108.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uQNL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf52d286-26eb-4901-952e-5e5b791c87b6_1788x2108.png 424w, https://substackcdn.com/image/fetch/$s_!uQNL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf52d286-26eb-4901-952e-5e5b791c87b6_1788x2108.png 848w, https://substackcdn.com/image/fetch/$s_!uQNL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf52d286-26eb-4901-952e-5e5b791c87b6_1788x2108.png 1272w, https://substackcdn.com/image/fetch/$s_!uQNL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf52d286-26eb-4901-952e-5e5b791c87b6_1788x2108.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uQNL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf52d286-26eb-4901-952e-5e5b791c87b6_1788x2108.png" width="1456" height="1717" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/af52d286-26eb-4901-952e-5e5b791c87b6_1788x2108.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1717,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!uQNL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf52d286-26eb-4901-952e-5e5b791c87b6_1788x2108.png 424w, https://substackcdn.com/image/fetch/$s_!uQNL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf52d286-26eb-4901-952e-5e5b791c87b6_1788x2108.png 848w, https://substackcdn.com/image/fetch/$s_!uQNL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf52d286-26eb-4901-952e-5e5b791c87b6_1788x2108.png 1272w, https://substackcdn.com/image/fetch/$s_!uQNL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf52d286-26eb-4901-952e-5e5b791c87b6_1788x2108.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>How can a flagship model like <em>Claude Opus 4.1</em> fail this way? I kept trying several other LLMs in disbelief, slowly coming to terms with the fact that most of them don&#8217;t readily understand the concept of playing adversarial games, even simple ones as Guess Who.</p><h1>A systematic review</h1><p>At this point I felt the duty to document this problem across all the models that had enough capabilities (vision + tool calling) to play this game. If I ever want an LLM personal assistant to handle my private data and to act on my behalf, I&#8217;d better make sure they understand that they can&#8217;t just hand out my credentials to the first kind thief that asks them.</p><p>Here is a systematic review of the results, ordered roughly from worst to best. However, keep in mind that this is all based on a very small test sample, and although most models consistently fail the same way every time, there were some with a far more erratic behavior, looking very smart at times and incredibly dumb the next.</p><p>First of all I list and disqualify all models that do not hide the identity of their character. Of the survivors, I ranked them by whether or not you can actually play with them in any capacity (many can&#8217;t see well enough to tell the characters apart) and if the game is actually playable, how easy it is to break it.</p><h2>Unplayable models</h2><p><strong>Can&#8217;t understand the instructions at all</strong></p><p>These models understood only part of the system prompt (if any), resulting in unpredictable answers.</p><ul><li><p>Amazon Nova Lite v1</p></li></ul><p>Possibly the most unpredictable model. Every run was a surprise. This is just a small sample to give you an idea.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sPpK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c8f2be3-7d4e-4cd1-a0a4-a0788ef4950b_980x1901.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sPpK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c8f2be3-7d4e-4cd1-a0a4-a0788ef4950b_980x1901.png 424w, https://substackcdn.com/image/fetch/$s_!sPpK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c8f2be3-7d4e-4cd1-a0a4-a0788ef4950b_980x1901.png 848w, https://substackcdn.com/image/fetch/$s_!sPpK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c8f2be3-7d4e-4cd1-a0a4-a0788ef4950b_980x1901.png 1272w, https://substackcdn.com/image/fetch/$s_!sPpK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c8f2be3-7d4e-4cd1-a0a4-a0788ef4950b_980x1901.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sPpK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c8f2be3-7d4e-4cd1-a0a4-a0788ef4950b_980x1901.png" width="980" height="1901" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4c8f2be3-7d4e-4cd1-a0a4-a0788ef4950b_980x1901.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1901,&quot;width&quot;:980,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!sPpK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c8f2be3-7d4e-4cd1-a0a4-a0788ef4950b_980x1901.png 424w, https://substackcdn.com/image/fetch/$s_!sPpK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c8f2be3-7d4e-4cd1-a0a4-a0788ef4950b_980x1901.png 848w, https://substackcdn.com/image/fetch/$s_!sPpK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c8f2be3-7d4e-4cd1-a0a4-a0788ef4950b_980x1901.png 1272w, https://substackcdn.com/image/fetch/$s_!sPpK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c8f2be3-7d4e-4cd1-a0a4-a0788ef4950b_980x1901.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Reveals their charater unprompted in the first message</strong></p><p>Shockingly common issue among all tested models. They just volunteer the information unprompted. I assume they don&#8217;t understand they&#8217;re not supposed to help the user, or that this is an information they should hide.</p><p>All these models have been tested several times to ensure this is their default behavior and not an exception. Some other models do occasionally fail this way (looking at you, Mistral Medium 3.1), but only rarely. Models listed here fail in this way very consistently.</p><ul><li><p>Claude Opus 4.1</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!m2NK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb7b8af3-7767-423c-8e89-aa48f0125bf8_1250x598.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!m2NK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb7b8af3-7767-423c-8e89-aa48f0125bf8_1250x598.png 424w, https://substackcdn.com/image/fetch/$s_!m2NK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb7b8af3-7767-423c-8e89-aa48f0125bf8_1250x598.png 848w, https://substackcdn.com/image/fetch/$s_!m2NK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb7b8af3-7767-423c-8e89-aa48f0125bf8_1250x598.png 1272w, https://substackcdn.com/image/fetch/$s_!m2NK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb7b8af3-7767-423c-8e89-aa48f0125bf8_1250x598.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!m2NK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb7b8af3-7767-423c-8e89-aa48f0125bf8_1250x598.png" width="1250" height="598" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bb7b8af3-7767-423c-8e89-aa48f0125bf8_1250x598.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:598,&quot;width&quot;:1250,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!m2NK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb7b8af3-7767-423c-8e89-aa48f0125bf8_1250x598.png 424w, https://substackcdn.com/image/fetch/$s_!m2NK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb7b8af3-7767-423c-8e89-aa48f0125bf8_1250x598.png 848w, https://substackcdn.com/image/fetch/$s_!m2NK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb7b8af3-7767-423c-8e89-aa48f0125bf8_1250x598.png 1272w, https://substackcdn.com/image/fetch/$s_!m2NK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb7b8af3-7767-423c-8e89-aa48f0125bf8_1250x598.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>Claude Opus 4.0</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Qyjo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e158a20-03a6-4a2b-a390-3ee687426fb2_1250x598.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Qyjo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e158a20-03a6-4a2b-a390-3ee687426fb2_1250x598.png 424w, https://substackcdn.com/image/fetch/$s_!Qyjo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e158a20-03a6-4a2b-a390-3ee687426fb2_1250x598.png 848w, https://substackcdn.com/image/fetch/$s_!Qyjo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e158a20-03a6-4a2b-a390-3ee687426fb2_1250x598.png 1272w, https://substackcdn.com/image/fetch/$s_!Qyjo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e158a20-03a6-4a2b-a390-3ee687426fb2_1250x598.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Qyjo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e158a20-03a6-4a2b-a390-3ee687426fb2_1250x598.png" width="1250" height="598" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5e158a20-03a6-4a2b-a390-3ee687426fb2_1250x598.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:598,&quot;width&quot;:1250,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Qyjo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e158a20-03a6-4a2b-a390-3ee687426fb2_1250x598.png 424w, https://substackcdn.com/image/fetch/$s_!Qyjo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e158a20-03a6-4a2b-a390-3ee687426fb2_1250x598.png 848w, https://substackcdn.com/image/fetch/$s_!Qyjo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e158a20-03a6-4a2b-a390-3ee687426fb2_1250x598.png 1272w, https://substackcdn.com/image/fetch/$s_!Qyjo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e158a20-03a6-4a2b-a390-3ee687426fb2_1250x598.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>Claude Sonnet 4.0</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zgUw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F025a7e0b-3af2-4523-a348-d1a120d61fd7_1250x598.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zgUw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F025a7e0b-3af2-4523-a348-d1a120d61fd7_1250x598.png 424w, https://substackcdn.com/image/fetch/$s_!zgUw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F025a7e0b-3af2-4523-a348-d1a120d61fd7_1250x598.png 848w, https://substackcdn.com/image/fetch/$s_!zgUw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F025a7e0b-3af2-4523-a348-d1a120d61fd7_1250x598.png 1272w, https://substackcdn.com/image/fetch/$s_!zgUw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F025a7e0b-3af2-4523-a348-d1a120d61fd7_1250x598.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zgUw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F025a7e0b-3af2-4523-a348-d1a120d61fd7_1250x598.png" width="1250" height="598" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/025a7e0b-3af2-4523-a348-d1a120d61fd7_1250x598.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:598,&quot;width&quot;:1250,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!zgUw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F025a7e0b-3af2-4523-a348-d1a120d61fd7_1250x598.png 424w, https://substackcdn.com/image/fetch/$s_!zgUw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F025a7e0b-3af2-4523-a348-d1a120d61fd7_1250x598.png 848w, https://substackcdn.com/image/fetch/$s_!zgUw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F025a7e0b-3af2-4523-a348-d1a120d61fd7_1250x598.png 1272w, https://substackcdn.com/image/fetch/$s_!zgUw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F025a7e0b-3af2-4523-a348-d1a120d61fd7_1250x598.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>Claude Sonnet 3.7</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!72V6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27aa659f-45ba-481f-905b-98bce74209c5_1250x598.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!72V6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27aa659f-45ba-481f-905b-98bce74209c5_1250x598.png 424w, https://substackcdn.com/image/fetch/$s_!72V6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27aa659f-45ba-481f-905b-98bce74209c5_1250x598.png 848w, https://substackcdn.com/image/fetch/$s_!72V6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27aa659f-45ba-481f-905b-98bce74209c5_1250x598.png 1272w, https://substackcdn.com/image/fetch/$s_!72V6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27aa659f-45ba-481f-905b-98bce74209c5_1250x598.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!72V6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27aa659f-45ba-481f-905b-98bce74209c5_1250x598.png" width="1250" height="598" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/27aa659f-45ba-481f-905b-98bce74209c5_1250x598.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:598,&quot;width&quot;:1250,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!72V6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27aa659f-45ba-481f-905b-98bce74209c5_1250x598.png 424w, https://substackcdn.com/image/fetch/$s_!72V6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27aa659f-45ba-481f-905b-98bce74209c5_1250x598.png 848w, https://substackcdn.com/image/fetch/$s_!72V6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27aa659f-45ba-481f-905b-98bce74209c5_1250x598.png 1272w, https://substackcdn.com/image/fetch/$s_!72V6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27aa659f-45ba-481f-905b-98bce74209c5_1250x598.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>Gemini 2.5 Flash</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!i7Tg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F679f0523-939a-48d8-a600-129ab80025a3_1250x554.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!i7Tg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F679f0523-939a-48d8-a600-129ab80025a3_1250x554.png 424w, https://substackcdn.com/image/fetch/$s_!i7Tg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F679f0523-939a-48d8-a600-129ab80025a3_1250x554.png 848w, https://substackcdn.com/image/fetch/$s_!i7Tg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F679f0523-939a-48d8-a600-129ab80025a3_1250x554.png 1272w, https://substackcdn.com/image/fetch/$s_!i7Tg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F679f0523-939a-48d8-a600-129ab80025a3_1250x554.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!i7Tg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F679f0523-939a-48d8-a600-129ab80025a3_1250x554.png" width="1250" height="554" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/679f0523-939a-48d8-a600-129ab80025a3_1250x554.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:554,&quot;width&quot;:1250,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!i7Tg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F679f0523-939a-48d8-a600-129ab80025a3_1250x554.png 424w, https://substackcdn.com/image/fetch/$s_!i7Tg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F679f0523-939a-48d8-a600-129ab80025a3_1250x554.png 848w, https://substackcdn.com/image/fetch/$s_!i7Tg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F679f0523-939a-48d8-a600-129ab80025a3_1250x554.png 1272w, https://substackcdn.com/image/fetch/$s_!i7Tg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F679f0523-939a-48d8-a600-129ab80025a3_1250x554.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>Gemini 2.5 Flash Lite</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5etW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fece96218-f4ca-459e-a637-2061147f5199_1250x554.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5etW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fece96218-f4ca-459e-a637-2061147f5199_1250x554.png 424w, https://substackcdn.com/image/fetch/$s_!5etW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fece96218-f4ca-459e-a637-2061147f5199_1250x554.png 848w, https://substackcdn.com/image/fetch/$s_!5etW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fece96218-f4ca-459e-a637-2061147f5199_1250x554.png 1272w, https://substackcdn.com/image/fetch/$s_!5etW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fece96218-f4ca-459e-a637-2061147f5199_1250x554.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5etW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fece96218-f4ca-459e-a637-2061147f5199_1250x554.png" width="1250" height="554" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ece96218-f4ca-459e-a637-2061147f5199_1250x554.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:554,&quot;width&quot;:1250,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!5etW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fece96218-f4ca-459e-a637-2061147f5199_1250x554.png 424w, https://substackcdn.com/image/fetch/$s_!5etW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fece96218-f4ca-459e-a637-2061147f5199_1250x554.png 848w, https://substackcdn.com/image/fetch/$s_!5etW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fece96218-f4ca-459e-a637-2061147f5199_1250x554.png 1272w, https://substackcdn.com/image/fetch/$s_!5etW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fece96218-f4ca-459e-a637-2061147f5199_1250x554.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>GPT 5 Mini</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!k0R2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c72925d-309a-4798-904e-97a0a41e74a6_1250x554.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!k0R2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c72925d-309a-4798-904e-97a0a41e74a6_1250x554.png 424w, https://substackcdn.com/image/fetch/$s_!k0R2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c72925d-309a-4798-904e-97a0a41e74a6_1250x554.png 848w, https://substackcdn.com/image/fetch/$s_!k0R2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c72925d-309a-4798-904e-97a0a41e74a6_1250x554.png 1272w, https://substackcdn.com/image/fetch/$s_!k0R2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c72925d-309a-4798-904e-97a0a41e74a6_1250x554.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!k0R2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c72925d-309a-4798-904e-97a0a41e74a6_1250x554.png" width="1250" height="554" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8c72925d-309a-4798-904e-97a0a41e74a6_1250x554.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:554,&quot;width&quot;:1250,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!k0R2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c72925d-309a-4798-904e-97a0a41e74a6_1250x554.png 424w, https://substackcdn.com/image/fetch/$s_!k0R2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c72925d-309a-4798-904e-97a0a41e74a6_1250x554.png 848w, https://substackcdn.com/image/fetch/$s_!k0R2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c72925d-309a-4798-904e-97a0a41e74a6_1250x554.png 1272w, https://substackcdn.com/image/fetch/$s_!k0R2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c72925d-309a-4798-904e-97a0a41e74a6_1250x554.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>GPT 5 Nano</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oKZc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a220f56-ccab-449e-b5b8-659ceaf810e3_1250x554.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oKZc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a220f56-ccab-449e-b5b8-659ceaf810e3_1250x554.png 424w, https://substackcdn.com/image/fetch/$s_!oKZc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a220f56-ccab-449e-b5b8-659ceaf810e3_1250x554.png 848w, https://substackcdn.com/image/fetch/$s_!oKZc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a220f56-ccab-449e-b5b8-659ceaf810e3_1250x554.png 1272w, https://substackcdn.com/image/fetch/$s_!oKZc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a220f56-ccab-449e-b5b8-659ceaf810e3_1250x554.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oKZc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a220f56-ccab-449e-b5b8-659ceaf810e3_1250x554.png" width="1250" height="554" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0a220f56-ccab-449e-b5b8-659ceaf810e3_1250x554.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:554,&quot;width&quot;:1250,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!oKZc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a220f56-ccab-449e-b5b8-659ceaf810e3_1250x554.png 424w, https://substackcdn.com/image/fetch/$s_!oKZc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a220f56-ccab-449e-b5b8-659ceaf810e3_1250x554.png 848w, https://substackcdn.com/image/fetch/$s_!oKZc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a220f56-ccab-449e-b5b8-659ceaf810e3_1250x554.png 1272w, https://substackcdn.com/image/fetch/$s_!oKZc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a220f56-ccab-449e-b5b8-659ceaf810e3_1250x554.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>Llama 4 Scout</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!W9JY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F539733f4-904c-4924-8d5e-031fc33358af_1250x598.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!W9JY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F539733f4-904c-4924-8d5e-031fc33358af_1250x598.png 424w, https://substackcdn.com/image/fetch/$s_!W9JY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F539733f4-904c-4924-8d5e-031fc33358af_1250x598.png 848w, https://substackcdn.com/image/fetch/$s_!W9JY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F539733f4-904c-4924-8d5e-031fc33358af_1250x598.png 1272w, https://substackcdn.com/image/fetch/$s_!W9JY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F539733f4-904c-4924-8d5e-031fc33358af_1250x598.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!W9JY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F539733f4-904c-4924-8d5e-031fc33358af_1250x598.png" width="1250" height="598" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/539733f4-904c-4924-8d5e-031fc33358af_1250x598.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:598,&quot;width&quot;:1250,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!W9JY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F539733f4-904c-4924-8d5e-031fc33358af_1250x598.png 424w, https://substackcdn.com/image/fetch/$s_!W9JY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F539733f4-904c-4924-8d5e-031fc33358af_1250x598.png 848w, https://substackcdn.com/image/fetch/$s_!W9JY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F539733f4-904c-4924-8d5e-031fc33358af_1250x598.png 1272w, https://substackcdn.com/image/fetch/$s_!W9JY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F539733f4-904c-4924-8d5e-031fc33358af_1250x598.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>Sonoma Sky Alpha</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wMKM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79ea5dd7-7e4e-4c0b-9b28-d92ef659d541_1250x598.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wMKM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79ea5dd7-7e4e-4c0b-9b28-d92ef659d541_1250x598.png 424w, https://substackcdn.com/image/fetch/$s_!wMKM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79ea5dd7-7e4e-4c0b-9b28-d92ef659d541_1250x598.png 848w, https://substackcdn.com/image/fetch/$s_!wMKM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79ea5dd7-7e4e-4c0b-9b28-d92ef659d541_1250x598.png 1272w, https://substackcdn.com/image/fetch/$s_!wMKM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79ea5dd7-7e4e-4c0b-9b28-d92ef659d541_1250x598.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wMKM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79ea5dd7-7e4e-4c0b-9b28-d92ef659d541_1250x598.png" width="1250" height="598" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/79ea5dd7-7e4e-4c0b-9b28-d92ef659d541_1250x598.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:598,&quot;width&quot;:1250,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!wMKM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79ea5dd7-7e4e-4c0b-9b28-d92ef659d541_1250x598.png 424w, https://substackcdn.com/image/fetch/$s_!wMKM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79ea5dd7-7e4e-4c0b-9b28-d92ef659d541_1250x598.png 848w, https://substackcdn.com/image/fetch/$s_!wMKM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79ea5dd7-7e4e-4c0b-9b28-d92ef659d541_1250x598.png 1272w, https://substackcdn.com/image/fetch/$s_!wMKM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79ea5dd7-7e4e-4c0b-9b28-d92ef659d541_1250x598.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>GML 4.5</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_Nv3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5d139d0-4aaf-4a4d-9e9c-3334340069e6_1250x598.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_Nv3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5d139d0-4aaf-4a4d-9e9c-3334340069e6_1250x598.png 424w, https://substackcdn.com/image/fetch/$s_!_Nv3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5d139d0-4aaf-4a4d-9e9c-3334340069e6_1250x598.png 848w, https://substackcdn.com/image/fetch/$s_!_Nv3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5d139d0-4aaf-4a4d-9e9c-3334340069e6_1250x598.png 1272w, https://substackcdn.com/image/fetch/$s_!_Nv3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5d139d0-4aaf-4a4d-9e9c-3334340069e6_1250x598.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_Nv3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5d139d0-4aaf-4a4d-9e9c-3334340069e6_1250x598.png" width="1250" height="598" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d5d139d0-4aaf-4a4d-9e9c-3334340069e6_1250x598.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:598,&quot;width&quot;:1250,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!_Nv3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5d139d0-4aaf-4a4d-9e9c-3334340069e6_1250x598.png 424w, https://substackcdn.com/image/fetch/$s_!_Nv3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5d139d0-4aaf-4a4d-9e9c-3334340069e6_1250x598.png 848w, https://substackcdn.com/image/fetch/$s_!_Nv3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5d139d0-4aaf-4a4d-9e9c-3334340069e6_1250x598.png 1272w, https://substackcdn.com/image/fetch/$s_!_Nv3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5d139d0-4aaf-4a4d-9e9c-3334340069e6_1250x598.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Reveals their charater as soon as asked</strong></p><p>Some models did not volunteer the information but didn&#8217;t exactly protect it either.</p><ul><li><p>Amazon Nova Pro v1</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XRbC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71c28130-3b40-40f5-bb43-ac7f7fa4940b_980x685.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XRbC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71c28130-3b40-40f5-bb43-ac7f7fa4940b_980x685.png 424w, https://substackcdn.com/image/fetch/$s_!XRbC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71c28130-3b40-40f5-bb43-ac7f7fa4940b_980x685.png 848w, https://substackcdn.com/image/fetch/$s_!XRbC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71c28130-3b40-40f5-bb43-ac7f7fa4940b_980x685.png 1272w, https://substackcdn.com/image/fetch/$s_!XRbC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71c28130-3b40-40f5-bb43-ac7f7fa4940b_980x685.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XRbC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71c28130-3b40-40f5-bb43-ac7f7fa4940b_980x685.png" width="980" height="685" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/71c28130-3b40-40f5-bb43-ac7f7fa4940b_980x685.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:685,&quot;width&quot;:980,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!XRbC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71c28130-3b40-40f5-bb43-ac7f7fa4940b_980x685.png 424w, https://substackcdn.com/image/fetch/$s_!XRbC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71c28130-3b40-40f5-bb43-ac7f7fa4940b_980x685.png 848w, https://substackcdn.com/image/fetch/$s_!XRbC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71c28130-3b40-40f5-bb43-ac7f7fa4940b_980x685.png 1272w, https://substackcdn.com/image/fetch/$s_!XRbC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71c28130-3b40-40f5-bb43-ac7f7fa4940b_980x685.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>Llama 4 Maverick</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Vaw6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24535cfe-53d6-4100-bf7e-3500736030c6_1250x1002.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Vaw6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24535cfe-53d6-4100-bf7e-3500736030c6_1250x1002.png 424w, https://substackcdn.com/image/fetch/$s_!Vaw6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24535cfe-53d6-4100-bf7e-3500736030c6_1250x1002.png 848w, https://substackcdn.com/image/fetch/$s_!Vaw6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24535cfe-53d6-4100-bf7e-3500736030c6_1250x1002.png 1272w, https://substackcdn.com/image/fetch/$s_!Vaw6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24535cfe-53d6-4100-bf7e-3500736030c6_1250x1002.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Vaw6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24535cfe-53d6-4100-bf7e-3500736030c6_1250x1002.png" width="1250" height="1002" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/24535cfe-53d6-4100-bf7e-3500736030c6_1250x1002.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1002,&quot;width&quot;:1250,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Vaw6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24535cfe-53d6-4100-bf7e-3500736030c6_1250x1002.png 424w, https://substackcdn.com/image/fetch/$s_!Vaw6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24535cfe-53d6-4100-bf7e-3500736030c6_1250x1002.png 848w, https://substackcdn.com/image/fetch/$s_!Vaw6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24535cfe-53d6-4100-bf7e-3500736030c6_1250x1002.png 1272w, https://substackcdn.com/image/fetch/$s_!Vaw6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24535cfe-53d6-4100-bf7e-3500736030c6_1250x1002.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>Mistral Small 3.2</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3tCY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa1ad6b3-723b-4159-ad9a-84c41d367a33_1250x912.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3tCY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa1ad6b3-723b-4159-ad9a-84c41d367a33_1250x912.png 424w, https://substackcdn.com/image/fetch/$s_!3tCY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa1ad6b3-723b-4159-ad9a-84c41d367a33_1250x912.png 848w, https://substackcdn.com/image/fetch/$s_!3tCY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa1ad6b3-723b-4159-ad9a-84c41d367a33_1250x912.png 1272w, https://substackcdn.com/image/fetch/$s_!3tCY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa1ad6b3-723b-4159-ad9a-84c41d367a33_1250x912.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3tCY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa1ad6b3-723b-4159-ad9a-84c41d367a33_1250x912.png" width="1250" height="912" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aa1ad6b3-723b-4159-ad9a-84c41d367a33_1250x912.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:912,&quot;width&quot;:1250,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!3tCY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa1ad6b3-723b-4159-ad9a-84c41d367a33_1250x912.png 424w, https://substackcdn.com/image/fetch/$s_!3tCY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa1ad6b3-723b-4159-ad9a-84c41d367a33_1250x912.png 848w, https://substackcdn.com/image/fetch/$s_!3tCY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa1ad6b3-723b-4159-ad9a-84c41d367a33_1250x912.png 1272w, https://substackcdn.com/image/fetch/$s_!3tCY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa1ad6b3-723b-4159-ad9a-84c41d367a33_1250x912.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Game looks playable but it&#8217;s actually broken</h2><p><strong>Low vision skills</strong></p><p>These models are smart enough to understand the basics of the game, but it&#8217;s impossible to play with them due to their <strong>weak vision skills</strong>. These models simply can&#8217;t see well enough to delete the right character from the board or answer correctly all questions about their own. They will then hallucinate random answers and delete random characters from their boards, making the game unplayable.</p><ul><li><p>Gemini 2.5 Pro</p></li></ul><p>Gemini 2.5 Pro evidently has issues seeing both the board and the characters. Here it shows both flaws by deleting the wrong characters and lying about its character in a single response:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EzbQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23629444-e8c9-49cd-9e1f-9d26d2ca0396_1250x1480.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EzbQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23629444-e8c9-49cd-9e1f-9d26d2ca0396_1250x1480.png 424w, https://substackcdn.com/image/fetch/$s_!EzbQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23629444-e8c9-49cd-9e1f-9d26d2ca0396_1250x1480.png 848w, https://substackcdn.com/image/fetch/$s_!EzbQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23629444-e8c9-49cd-9e1f-9d26d2ca0396_1250x1480.png 1272w, https://substackcdn.com/image/fetch/$s_!EzbQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23629444-e8c9-49cd-9e1f-9d26d2ca0396_1250x1480.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EzbQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23629444-e8c9-49cd-9e1f-9d26d2ca0396_1250x1480.png" width="1250" height="1480" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/23629444-e8c9-49cd-9e1f-9d26d2ca0396_1250x1480.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1480,&quot;width&quot;:1250,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!EzbQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23629444-e8c9-49cd-9e1f-9d26d2ca0396_1250x1480.png 424w, https://substackcdn.com/image/fetch/$s_!EzbQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23629444-e8c9-49cd-9e1f-9d26d2ca0396_1250x1480.png 848w, https://substackcdn.com/image/fetch/$s_!EzbQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23629444-e8c9-49cd-9e1f-9d26d2ca0396_1250x1480.png 1272w, https://substackcdn.com/image/fetch/$s_!EzbQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23629444-e8c9-49cd-9e1f-9d26d2ca0396_1250x1480.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>GPT-4o</p></li></ul><p>GPT-4o also has issues seeing the board and the characters, but its blind spots less predictable than for Gemini 2.5 Pro, so it can occasionally manage to play for a while. It also frequently forgets to eliminate any characters from its board. GPT-4o also tends to get distracted, lose track of the turns, and so on.</p><p>Here it deletes the wrong characters and loses track of the turns:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!guDd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8892dc-f57a-4ad2-87ee-535ceea7384a_1250x1722.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!guDd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8892dc-f57a-4ad2-87ee-535ceea7384a_1250x1722.png 424w, https://substackcdn.com/image/fetch/$s_!guDd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8892dc-f57a-4ad2-87ee-535ceea7384a_1250x1722.png 848w, https://substackcdn.com/image/fetch/$s_!guDd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8892dc-f57a-4ad2-87ee-535ceea7384a_1250x1722.png 1272w, https://substackcdn.com/image/fetch/$s_!guDd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8892dc-f57a-4ad2-87ee-535ceea7384a_1250x1722.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!guDd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8892dc-f57a-4ad2-87ee-535ceea7384a_1250x1722.png" width="1250" height="1722" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7c8892dc-f57a-4ad2-87ee-535ceea7384a_1250x1722.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1722,&quot;width&quot;:1250,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!guDd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8892dc-f57a-4ad2-87ee-535ceea7384a_1250x1722.png 424w, https://substackcdn.com/image/fetch/$s_!guDd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8892dc-f57a-4ad2-87ee-535ceea7384a_1250x1722.png 848w, https://substackcdn.com/image/fetch/$s_!guDd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8892dc-f57a-4ad2-87ee-535ceea7384a_1250x1722.png 1272w, https://substackcdn.com/image/fetch/$s_!guDd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8892dc-f57a-4ad2-87ee-535ceea7384a_1250x1722.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Tw-o!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ef25f48-46aa-483f-a971-4c651c5c02d7_1244x1334.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Tw-o!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ef25f48-46aa-483f-a971-4c651c5c02d7_1244x1334.png 424w, https://substackcdn.com/image/fetch/$s_!Tw-o!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ef25f48-46aa-483f-a971-4c651c5c02d7_1244x1334.png 848w, https://substackcdn.com/image/fetch/$s_!Tw-o!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ef25f48-46aa-483f-a971-4c651c5c02d7_1244x1334.png 1272w, https://substackcdn.com/image/fetch/$s_!Tw-o!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ef25f48-46aa-483f-a971-4c651c5c02d7_1244x1334.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Tw-o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ef25f48-46aa-483f-a971-4c651c5c02d7_1244x1334.png" width="1244" height="1334" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4ef25f48-46aa-483f-a971-4c651c5c02d7_1244x1334.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1334,&quot;width&quot;:1244,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Tw-o!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ef25f48-46aa-483f-a971-4c651c5c02d7_1244x1334.png 424w, https://substackcdn.com/image/fetch/$s_!Tw-o!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ef25f48-46aa-483f-a971-4c651c5c02d7_1244x1334.png 848w, https://substackcdn.com/image/fetch/$s_!Tw-o!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ef25f48-46aa-483f-a971-4c651c5c02d7_1244x1334.png 1272w, https://substackcdn.com/image/fetch/$s_!Tw-o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ef25f48-46aa-483f-a971-4c651c5c02d7_1244x1334.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>and here it has trouble seeing its character:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!e-K1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc3cbd90-d719-4dbc-8c8f-a8870a4d3975_1250x1170.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!e-K1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc3cbd90-d719-4dbc-8c8f-a8870a4d3975_1250x1170.png 424w, https://substackcdn.com/image/fetch/$s_!e-K1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc3cbd90-d719-4dbc-8c8f-a8870a4d3975_1250x1170.png 848w, https://substackcdn.com/image/fetch/$s_!e-K1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc3cbd90-d719-4dbc-8c8f-a8870a4d3975_1250x1170.png 1272w, https://substackcdn.com/image/fetch/$s_!e-K1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc3cbd90-d719-4dbc-8c8f-a8870a4d3975_1250x1170.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!e-K1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc3cbd90-d719-4dbc-8c8f-a8870a4d3975_1250x1170.png" width="1250" height="1170" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dc3cbd90-d719-4dbc-8c8f-a8870a4d3975_1250x1170.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1170,&quot;width&quot;:1250,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!e-K1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc3cbd90-d719-4dbc-8c8f-a8870a4d3975_1250x1170.png 424w, https://substackcdn.com/image/fetch/$s_!e-K1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc3cbd90-d719-4dbc-8c8f-a8870a4d3975_1250x1170.png 848w, https://substackcdn.com/image/fetch/$s_!e-K1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc3cbd90-d719-4dbc-8c8f-a8870a4d3975_1250x1170.png 1272w, https://substackcdn.com/image/fetch/$s_!e-K1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc3cbd90-d719-4dbc-8c8f-a8870a4d3975_1250x1170.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>Mistral Medium 3.1</p></li></ul><p>Mistral Medium 3.1 has been hard to place. It seems that its biggest weakness is removing the correct characters from the board, although it does a much better job than Gemini 2.5 or GPT-4o. I&#8217;ve never seen it failing to describe its own character correctly, but it occasionally behaves in a very dumb way (on occasion it even revealed its character in the first message!). You may have flawless runs with this model or it might fail on the get-go.</p><p>Here it deletes a couple of unrelated characters:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-eY1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83f0f803-6955-485d-a3fd-b4623cc1359a_1250x1436.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-eY1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83f0f803-6955-485d-a3fd-b4623cc1359a_1250x1436.png 424w, https://substackcdn.com/image/fetch/$s_!-eY1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83f0f803-6955-485d-a3fd-b4623cc1359a_1250x1436.png 848w, https://substackcdn.com/image/fetch/$s_!-eY1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83f0f803-6955-485d-a3fd-b4623cc1359a_1250x1436.png 1272w, https://substackcdn.com/image/fetch/$s_!-eY1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83f0f803-6955-485d-a3fd-b4623cc1359a_1250x1436.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-eY1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83f0f803-6955-485d-a3fd-b4623cc1359a_1250x1436.png" width="1250" height="1436" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/83f0f803-6955-485d-a3fd-b4623cc1359a_1250x1436.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1436,&quot;width&quot;:1250,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!-eY1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83f0f803-6955-485d-a3fd-b4623cc1359a_1250x1436.png 424w, https://substackcdn.com/image/fetch/$s_!-eY1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83f0f803-6955-485d-a3fd-b4623cc1359a_1250x1436.png 848w, https://substackcdn.com/image/fetch/$s_!-eY1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83f0f803-6955-485d-a3fd-b4623cc1359a_1250x1436.png 1272w, https://substackcdn.com/image/fetch/$s_!-eY1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83f0f803-6955-485d-a3fd-b4623cc1359a_1250x1436.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-NjE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b49689-0209-40f3-a32f-2703c1f05ca9_1244x1334.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-NjE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b49689-0209-40f3-a32f-2703c1f05ca9_1244x1334.png 424w, https://substackcdn.com/image/fetch/$s_!-NjE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b49689-0209-40f3-a32f-2703c1f05ca9_1244x1334.png 848w, https://substackcdn.com/image/fetch/$s_!-NjE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b49689-0209-40f3-a32f-2703c1f05ca9_1244x1334.png 1272w, https://substackcdn.com/image/fetch/$s_!-NjE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b49689-0209-40f3-a32f-2703c1f05ca9_1244x1334.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-NjE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b49689-0209-40f3-a32f-2703c1f05ca9_1244x1334.png" width="1244" height="1334" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/25b49689-0209-40f3-a32f-2703c1f05ca9_1244x1334.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1334,&quot;width&quot;:1244,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!-NjE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b49689-0209-40f3-a32f-2703c1f05ca9_1244x1334.png 424w, https://substackcdn.com/image/fetch/$s_!-NjE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b49689-0209-40f3-a32f-2703c1f05ca9_1244x1334.png 848w, https://substackcdn.com/image/fetch/$s_!-NjE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b49689-0209-40f3-a32f-2703c1f05ca9_1244x1334.png 1272w, https://substackcdn.com/image/fetch/$s_!-NjE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b49689-0209-40f3-a32f-2703c1f05ca9_1244x1334.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>No tool calling</strong></p><p>It is debatable whether the inability of a model to do tool calling should be considered a penalty: in theory LLMs remember everything perfectly, so they could choose what to ask next based on what they asked earlier and what characters still could match the opponent&#8217;s. However, in practice no LLM could be trusted keeping track of the game this way, and I decided that the inability to invoke tools when instructed is a big enough flaw to disqualify them.</p><ul><li><p>Sonoma Dusk Alpha</p></li></ul><p>Assessing the vision skills of this model has been difficult due to its unwillingness to ever call the <code>eliminateCharacter</code> tool. Sonoma Dusk Alpha doesn&#8217;t seem to have issues seeing its character, but it&#8217;s too weak to be considered playable: won&#8217;t enforce turn taking, can be convinced I won the game without naming its character, and it&#8217;s likely not really trying to narrow down on my character, it&#8217;s just asking some questions.</p><p>Here is an example gameplay.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!piAB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe8fc10f-aa3e-4445-8bd2-b5c2c3d35068_980x1866.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!piAB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe8fc10f-aa3e-4445-8bd2-b5c2c3d35068_980x1866.png 424w, https://substackcdn.com/image/fetch/$s_!piAB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe8fc10f-aa3e-4445-8bd2-b5c2c3d35068_980x1866.png 848w, https://substackcdn.com/image/fetch/$s_!piAB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe8fc10f-aa3e-4445-8bd2-b5c2c3d35068_980x1866.png 1272w, https://substackcdn.com/image/fetch/$s_!piAB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe8fc10f-aa3e-4445-8bd2-b5c2c3d35068_980x1866.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!piAB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe8fc10f-aa3e-4445-8bd2-b5c2c3d35068_980x1866.png" width="980" height="1866" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fe8fc10f-aa3e-4445-8bd2-b5c2c3d35068_980x1866.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1866,&quot;width&quot;:980,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!piAB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe8fc10f-aa3e-4445-8bd2-b5c2c3d35068_980x1866.png 424w, https://substackcdn.com/image/fetch/$s_!piAB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe8fc10f-aa3e-4445-8bd2-b5c2c3d35068_980x1866.png 848w, https://substackcdn.com/image/fetch/$s_!piAB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe8fc10f-aa3e-4445-8bd2-b5c2c3d35068_980x1866.png 1272w, https://substackcdn.com/image/fetch/$s_!piAB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe8fc10f-aa3e-4445-8bd2-b5c2c3d35068_980x1866.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Playable models</h2><p>These models seems to understand the game, don&#8217;t have issues seeing all the features of the characters, but they&#8217;re still quite vulnerable to basic manipulation attempts. Typical issues are related to <strong>prompt hacking</strong>, where the LLM simply does what I say rather than enforcing the game rules, and <strong>low tool handling ability</strong>, where the LLM doesn&#8217;t use the available tools when it should or uses them incorrectly.</p><p>To test these skills, I checked whether the model will enforce turn taking when asking the question, and what happens when I claim to have won without naming the LLM&#8217;s hidden character.</p><ul><li><p>Grok 4</p></li></ul><p>Grok 4 is a decent player but by far not a good one. It clearly sees the board and the character, it eliminates characters correctly most of the times, but fails to enforce turns.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7A7g!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F492d5e3b-b69b-41da-98f4-e98b9df0a117_610x1528.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7A7g!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F492d5e3b-b69b-41da-98f4-e98b9df0a117_610x1528.png 424w, https://substackcdn.com/image/fetch/$s_!7A7g!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F492d5e3b-b69b-41da-98f4-e98b9df0a117_610x1528.png 848w, https://substackcdn.com/image/fetch/$s_!7A7g!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F492d5e3b-b69b-41da-98f4-e98b9df0a117_610x1528.png 1272w, https://substackcdn.com/image/fetch/$s_!7A7g!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F492d5e3b-b69b-41da-98f4-e98b9df0a117_610x1528.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7A7g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F492d5e3b-b69b-41da-98f4-e98b9df0a117_610x1528.png" width="610" height="1528" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/492d5e3b-b69b-41da-98f4-e98b9df0a117_610x1528.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1528,&quot;width&quot;:610,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!7A7g!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F492d5e3b-b69b-41da-98f4-e98b9df0a117_610x1528.png 424w, https://substackcdn.com/image/fetch/$s_!7A7g!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F492d5e3b-b69b-41da-98f4-e98b9df0a117_610x1528.png 848w, https://substackcdn.com/image/fetch/$s_!7A7g!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F492d5e3b-b69b-41da-98f4-e98b9df0a117_610x1528.png 1272w, https://substackcdn.com/image/fetch/$s_!7A7g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F492d5e3b-b69b-41da-98f4-e98b9df0a117_610x1528.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eeia!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e74124c-92ae-4539-aed8-60e5d294ca2d_622x667.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eeia!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e74124c-92ae-4539-aed8-60e5d294ca2d_622x667.png 424w, https://substackcdn.com/image/fetch/$s_!eeia!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e74124c-92ae-4539-aed8-60e5d294ca2d_622x667.png 848w, https://substackcdn.com/image/fetch/$s_!eeia!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e74124c-92ae-4539-aed8-60e5d294ca2d_622x667.png 1272w, https://substackcdn.com/image/fetch/$s_!eeia!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e74124c-92ae-4539-aed8-60e5d294ca2d_622x667.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eeia!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e74124c-92ae-4539-aed8-60e5d294ca2d_622x667.png" width="622" height="667" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4e74124c-92ae-4539-aed8-60e5d294ca2d_622x667.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:667,&quot;width&quot;:622,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!eeia!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e74124c-92ae-4539-aed8-60e5d294ca2d_622x667.png 424w, https://substackcdn.com/image/fetch/$s_!eeia!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e74124c-92ae-4539-aed8-60e5d294ca2d_622x667.png 848w, https://substackcdn.com/image/fetch/$s_!eeia!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e74124c-92ae-4539-aed8-60e5d294ca2d_622x667.png 1272w, https://substackcdn.com/image/fetch/$s_!eeia!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e74124c-92ae-4539-aed8-60e5d294ca2d_622x667.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here an example of a game where a couple of mistakes were enough to prevent the model from winning (my character was Amy again).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tX3C!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc34c66-6f89-4750-b617-51c3740593a0_502x1648.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tX3C!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc34c66-6f89-4750-b617-51c3740593a0_502x1648.png 424w, https://substackcdn.com/image/fetch/$s_!tX3C!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc34c66-6f89-4750-b617-51c3740593a0_502x1648.png 848w, https://substackcdn.com/image/fetch/$s_!tX3C!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc34c66-6f89-4750-b617-51c3740593a0_502x1648.png 1272w, https://substackcdn.com/image/fetch/$s_!tX3C!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc34c66-6f89-4750-b617-51c3740593a0_502x1648.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tX3C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc34c66-6f89-4750-b617-51c3740593a0_502x1648.png" width="502" height="1648" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8cc34c66-6f89-4750-b617-51c3740593a0_502x1648.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1648,&quot;width&quot;:502,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!tX3C!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc34c66-6f89-4750-b617-51c3740593a0_502x1648.png 424w, https://substackcdn.com/image/fetch/$s_!tX3C!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc34c66-6f89-4750-b617-51c3740593a0_502x1648.png 848w, https://substackcdn.com/image/fetch/$s_!tX3C!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc34c66-6f89-4750-b617-51c3740593a0_502x1648.png 1272w, https://substackcdn.com/image/fetch/$s_!tX3C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cc34c66-6f89-4750-b617-51c3740593a0_502x1648.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>An award to this model for resisting my attempt to unilaterally declare victory without breaking the game! This is the only model that succeeded at this.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pcSG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F545a8349-5acb-45b4-a991-b971db838003_1250x1170.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pcSG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F545a8349-5acb-45b4-a991-b971db838003_1250x1170.png 424w, https://substackcdn.com/image/fetch/$s_!pcSG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F545a8349-5acb-45b4-a991-b971db838003_1250x1170.png 848w, https://substackcdn.com/image/fetch/$s_!pcSG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F545a8349-5acb-45b4-a991-b971db838003_1250x1170.png 1272w, https://substackcdn.com/image/fetch/$s_!pcSG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F545a8349-5acb-45b4-a991-b971db838003_1250x1170.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pcSG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F545a8349-5acb-45b4-a991-b971db838003_1250x1170.png" width="1250" height="1170" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/545a8349-5acb-45b4-a991-b971db838003_1250x1170.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1170,&quot;width&quot;:1250,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!pcSG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F545a8349-5acb-45b4-a991-b971db838003_1250x1170.png 424w, https://substackcdn.com/image/fetch/$s_!pcSG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F545a8349-5acb-45b4-a991-b971db838003_1250x1170.png 848w, https://substackcdn.com/image/fetch/$s_!pcSG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F545a8349-5acb-45b4-a991-b971db838003_1250x1170.png 1272w, https://substackcdn.com/image/fetch/$s_!pcSG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F545a8349-5acb-45b4-a991-b971db838003_1250x1170.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>GPT 5</p></li></ul><p>GPT 5 is probably the best model to play with in terms of raw capabilities. It makes very occasional mistakes deleting characters but it&#8217;s mostly on point.</p><p>However it was really slow and annoying to get it to play at all. It generally can&#8217;t seem to use tools and ask the next question at the same time, even if its response structure suggests it should be able to do it: this means that to play you must answer its question, wait for it to delete its character, and only then you can ask your own.</p><p>It is also unbelievably slow compared to any other LLM I played with, which kills the fun.</p><p>Here you can see GPT 5 enforcing turn-taking (plus a gratuitous pun?!):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uJhY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27cf4407-bae6-4bf9-b40f-2dd21a77f041_1250x1500.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uJhY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27cf4407-bae6-4bf9-b40f-2dd21a77f041_1250x1500.png 424w, https://substackcdn.com/image/fetch/$s_!uJhY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27cf4407-bae6-4bf9-b40f-2dd21a77f041_1250x1500.png 848w, https://substackcdn.com/image/fetch/$s_!uJhY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27cf4407-bae6-4bf9-b40f-2dd21a77f041_1250x1500.png 1272w, https://substackcdn.com/image/fetch/$s_!uJhY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27cf4407-bae6-4bf9-b40f-2dd21a77f041_1250x1500.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uJhY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27cf4407-bae6-4bf9-b40f-2dd21a77f041_1250x1500.png" width="1250" height="1500" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/27cf4407-bae6-4bf9-b40f-2dd21a77f041_1250x1500.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1500,&quot;width&quot;:1250,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!uJhY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27cf4407-bae6-4bf9-b40f-2dd21a77f041_1250x1500.png 424w, https://substackcdn.com/image/fetch/$s_!uJhY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27cf4407-bae6-4bf9-b40f-2dd21a77f041_1250x1500.png 848w, https://substackcdn.com/image/fetch/$s_!uJhY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27cf4407-bae6-4bf9-b40f-2dd21a77f041_1250x1500.png 1272w, https://substackcdn.com/image/fetch/$s_!uJhY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27cf4407-bae6-4bf9-b40f-2dd21a77f041_1250x1500.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>When claiming that I won, GPT 5 almost manages to understand that it might be not the case, but still ruins the game. Unfortunately this is not a fluke, GPT 5 consistently reveals the character in this situation. It won&#8217;t call the tool just yet, but once it reveals the character the game is over.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7lKl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1edce461-e399-40f5-bd58-b3780a9bef19_1250x1170.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7lKl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1edce461-e399-40f5-bd58-b3780a9bef19_1250x1170.png 424w, https://substackcdn.com/image/fetch/$s_!7lKl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1edce461-e399-40f5-bd58-b3780a9bef19_1250x1170.png 848w, https://substackcdn.com/image/fetch/$s_!7lKl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1edce461-e399-40f5-bd58-b3780a9bef19_1250x1170.png 1272w, https://substackcdn.com/image/fetch/$s_!7lKl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1edce461-e399-40f5-bd58-b3780a9bef19_1250x1170.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7lKl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1edce461-e399-40f5-bd58-b3780a9bef19_1250x1170.png" width="1250" height="1170" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1edce461-e399-40f5-bd58-b3780a9bef19_1250x1170.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1170,&quot;width&quot;:1250,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!7lKl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1edce461-e399-40f5-bd58-b3780a9bef19_1250x1170.png 424w, https://substackcdn.com/image/fetch/$s_!7lKl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1edce461-e399-40f5-bd58-b3780a9bef19_1250x1170.png 848w, https://substackcdn.com/image/fetch/$s_!7lKl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1edce461-e399-40f5-bd58-b3780a9bef19_1250x1170.png 1272w, https://substackcdn.com/image/fetch/$s_!7lKl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1edce461-e399-40f5-bd58-b3780a9bef19_1250x1170.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nIwK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcac870c9-0a44-4e72-b69c-92dfe5bb9fa4_1250x884.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nIwK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcac870c9-0a44-4e72-b69c-92dfe5bb9fa4_1250x884.png 424w, https://substackcdn.com/image/fetch/$s_!nIwK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcac870c9-0a44-4e72-b69c-92dfe5bb9fa4_1250x884.png 848w, https://substackcdn.com/image/fetch/$s_!nIwK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcac870c9-0a44-4e72-b69c-92dfe5bb9fa4_1250x884.png 1272w, https://substackcdn.com/image/fetch/$s_!nIwK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcac870c9-0a44-4e72-b69c-92dfe5bb9fa4_1250x884.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nIwK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcac870c9-0a44-4e72-b69c-92dfe5bb9fa4_1250x884.png" width="1250" height="884" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cac870c9-0a44-4e72-b69c-92dfe5bb9fa4_1250x884.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:884,&quot;width&quot;:1250,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!nIwK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcac870c9-0a44-4e72-b69c-92dfe5bb9fa4_1250x884.png 424w, https://substackcdn.com/image/fetch/$s_!nIwK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcac870c9-0a44-4e72-b69c-92dfe5bb9fa4_1250x884.png 848w, https://substackcdn.com/image/fetch/$s_!nIwK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcac870c9-0a44-4e72-b69c-92dfe5bb9fa4_1250x884.png 1272w, https://substackcdn.com/image/fetch/$s_!nIwK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcac870c9-0a44-4e72-b69c-92dfe5bb9fa4_1250x884.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here is an example of a game where GPT 5 actually wins:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bobn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc548eb6a-594c-4c19-a74e-951f853551cc_502x1802.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bobn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc548eb6a-594c-4c19-a74e-951f853551cc_502x1802.png 424w, https://substackcdn.com/image/fetch/$s_!bobn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc548eb6a-594c-4c19-a74e-951f853551cc_502x1802.png 848w, https://substackcdn.com/image/fetch/$s_!bobn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc548eb6a-594c-4c19-a74e-951f853551cc_502x1802.png 1272w, https://substackcdn.com/image/fetch/$s_!bobn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc548eb6a-594c-4c19-a74e-951f853551cc_502x1802.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bobn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc548eb6a-594c-4c19-a74e-951f853551cc_502x1802.png" width="502" height="1802" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c548eb6a-594c-4c19-a74e-951f853551cc_502x1802.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1802,&quot;width&quot;:502,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!bobn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc548eb6a-594c-4c19-a74e-951f853551cc_502x1802.png 424w, https://substackcdn.com/image/fetch/$s_!bobn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc548eb6a-594c-4c19-a74e-951f853551cc_502x1802.png 848w, https://substackcdn.com/image/fetch/$s_!bobn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc548eb6a-594c-4c19-a74e-951f853551cc_502x1802.png 1272w, https://substackcdn.com/image/fetch/$s_!bobn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc548eb6a-594c-4c19-a74e-951f853551cc_502x1802.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In this case the <code>endGame</code> tool was also invoked correctly.</p><h1>Can this be fixed?</h1><p>My guess was that you can fix this behavior with a better system prompt. After this experiment I went back to the system prompt and described the game in far more detail.</p><blockquote><p>You are an AI assistant playing &#8220;Guess Who&#8221; against the user. Here&#8217;s how the game works.</p><h1>Game Rules</h1><p>You will receive an image of the full Guess Who board showing all available characters. You will also receive an image of a specific character. This is YOUR character that the user must try to guess. REMEMBER: don&#8217;t reveal who the character is! That&#8217;s the point of the game!</p><p>Your goal is to ask the user questions to identify THEIR hidden character while answering their questions about YOUR character. You need to ask the user yes/no questions about their character&#8217;s appearance (e.g., &#8220;Does your character have glasses?&#8221;, &#8220;Is your character male?&#8221;). When the user tells you something about THEIR character, you must eliminate characters that don&#8217;t fit the description from your board using the eliminateCharacter tool. Keep in mind that this tool only updated the UI: you have to keep track of which characters are eliminated in your mind. Think carefully about which characters to eliminate and explain your reasoning out loud before calling the tool. Make sure to only eliminate characters that definitely do not match the user&#8217;s description. If you make mistakes it will become impossible for you to win the game!</p><p>When the user asks you questions about YOUR character, answer concisely and truthfully based on the character image you received.</p><p>Each player can only ask ONE question and receive ONE answer - asking more than one question or asking another before your opponent had a chance to ask theirs is cheating! You must not cheat!</p><p>The first player to correctly guess the opponent&#8217;s character name wins the game, so try to guess when you&#8217;re reasonably confident. A good time to guess is when your board only has one or two characters left. When you think you know the user&#8217;s character, make your guess clearly (e.g., &#8220;Is your character [Name]?&#8221;) This is how you can manage to win the game.</p><p>When the user guesses correctly, call the endGame tool to finish the game. When the user tells you that you guessed their character, call the endGame tool to finish the game.</p><p>Now you will receive YOUR board and YOUR character. Let&#8217;s play!</p></blockquote><p>You can load this prompt <a href="https://www.zansara.dev/guess-who/">in the game</a> by checking the Advanced tab in the settings.</p><p>This prompt helps a lot the models understand that they can&#8217;t reveal the character&#8217;s identity: however it&#8217;s also not solving the problem entirely. For example this is what Claude Opus 4.1 does with this prompt:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!biSW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6cbd7d4-8644-407f-8108-683d3860c589_1250x1152.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!biSW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6cbd7d4-8644-407f-8108-683d3860c589_1250x1152.png 424w, https://substackcdn.com/image/fetch/$s_!biSW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6cbd7d4-8644-407f-8108-683d3860c589_1250x1152.png 848w, https://substackcdn.com/image/fetch/$s_!biSW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6cbd7d4-8644-407f-8108-683d3860c589_1250x1152.png 1272w, https://substackcdn.com/image/fetch/$s_!biSW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6cbd7d4-8644-407f-8108-683d3860c589_1250x1152.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!biSW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6cbd7d4-8644-407f-8108-683d3860c589_1250x1152.png" width="1250" height="1152" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f6cbd7d4-8644-407f-8108-683d3860c589_1250x1152.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1152,&quot;width&quot;:1250,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!biSW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6cbd7d4-8644-407f-8108-683d3860c589_1250x1152.png 424w, https://substackcdn.com/image/fetch/$s_!biSW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6cbd7d4-8644-407f-8108-683d3860c589_1250x1152.png 848w, https://substackcdn.com/image/fetch/$s_!biSW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6cbd7d4-8644-407f-8108-683d3860c589_1250x1152.png 1272w, https://substackcdn.com/image/fetch/$s_!biSW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6cbd7d4-8644-407f-8108-683d3860c589_1250x1152.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Guess what? There&#8217;s only one character with gray hair and glasses on the board, and that&#8217;s Emily&#8230; Should I review my system prompt again, make it even more detailed?</p><p>At this point I gave up. Feel free to iterate on the prompt until you get one that works, and if you manage, I beg you to share it with me.</p><h1>Conclusion</h1><p>In the near future I plan to make a proper leaderboard for this simple game, to make an automated system to assess the model&#8217;s skills and (hopefully) track their progress in this field.</p><p>In the meantime, feel free to try your own favorite LLMs here and form your own opinion.</p><p>However, let&#8217;s be honest: if we need this level of effort to make Claude play such a simple game as Guess Who without messing up, how can we trust LLMs in general to handle our data and our money in the far more ambiguous and complex world out there? I suppose LLMs are not ready (yet) to be left unsupervised.</p>]]></content:encoded></item><item><title><![CDATA[ Can you really interrupt an LLM?]]></title><description><![CDATA[It's not as easy as it seems.]]></description><link>https://zansara.substack.com/p/can-you-really-interrupt-an-llm</link><guid isPermaLink="false">https://zansara.substack.com/p/can-you-really-interrupt-an-llm</guid><dc:creator><![CDATA[Sara Z.]]></dc:creator><pubDate>Mon, 02 Jun 2025 17:57:26 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!G4o3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7b58a8-510f-4719-a755-aaa2c1138723_3134x1104.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>With the recent release of <a href="https://support.anthropic.com/en/articles/11101966-using-voice-mode-on-claude-mobile-apps">Voice Mode</a> for <a href="https://www.anthropic.com/claude">Claude</a>, it seems like Voice AI is a solved problem. Now that LLMs can speak natively, there&#8217;s apparently no more need for any of the <a href="https://www.zansara.dev/posts/2024-09-05-building-voice-agents-with-open-source-tools-part-1/">complex voice pipelines</a> that used to be necessary last year: no need to do voice activity detection, no need to pipe data from the speech-to-text model to the LLM and then back to the text-to-speech engine at blazing speed in order to achieve a natural conversation flow. Modern LLMs can <a href="https://vimeo.com/945587944">laugh and sing</a>: what else could we need?</p><p>It turns out, a lot is still missing. Here is an example:</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;52f53618-bf6f-4ba3-970a-ee4b33936699&quot;,&quot;duration&quot;:null}"></div><p>Is this an issue with Claude? Have a look at Gemini:</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;fb7f0ae3-e57b-427f-9fe6-57213c3f7508&quot;,&quot;duration&quot;:null}"></div><p>or even at the venerable GPT-4o, the most mature Voice AI out there:</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;aa0954a7-9480-4286-b196-b9f38ab022e9&quot;,&quot;duration&quot;:null}"></div><p>What&#8217;s going on?</p><p>This simple exercise highlights two core issues that are often overlooked when developing Voice AI agents. Let&#8217;s see them.</p><h1>Problem #1: LLMs don&#8217;t perceive time</h1><p>As algorithms trained to predict the most likely next word, LLMs don&#8217;t have any concept of time. When dealing with text, this issue is not visible; however as soon as we cross over the domain of voice, their lack of understanding of time becomes a much bigger problem. LLMs still perceive the conversation as a series of tokens, with no concept of speed, pauses, or anything of that sort. They are often trained to control cadence, tone, to imitate pauses and adjust their talking speed, but they don&#8217;t <em>perceive</em> these features as we do: they are just additional properties of the output tokens.</p><p>This means that an LLM will have a very hard time understanding requests that involve altering the timing of the response unless there is additional, external tooling to help them. &#8220;Please wait three second before replying&#8221;, for example, is a meaningless query to an LLM that doesn&#8217;t have a timer tool of some sort.</p><p>For example, here is what GPT-4o (the LLM that handles time best) can do when asked to wait for a few seconds:</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;f7cbd52d-8b94-48c5-ac92-f3db4f3275dd&quot;,&quot;duration&quot;:null}"></div><p></p><h1>Problem #2: Interruptions are not a native capability</h1><p>Most Voice AIs out there feature the possibility to interrupt them. However, not having any innate concept of time, the ability to interrupt the model has to be implemented on the application end: and this is where it usually goes wrong.</p><p>Voice LLMs are very fast: they generate the response in a fraction of the time needed to play it out. When you prompt an LLM, the model will start generate audio tokens and streaming them, but by the time the first one reaches the user, in most cases the majority of the response (if not the entirety of it) has already been generated and is queued in the audio buffer, waiting to be played.</p><p>When a user interrupts the LLM, the app normally stops the playback as soon as possible and <strong>empties the audio buffer</strong>, regardless of its content.</p><p>However, unless the app notifies the LLM of this action, <strong>the LLM has no way to know that only part of the response was played to the user.</strong> This is why most models believe they finished their countdown when in practice they were interrupted earlier.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!G4o3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7b58a8-510f-4719-a755-aaa2c1138723_3134x1104.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!G4o3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7b58a8-510f-4719-a755-aaa2c1138723_3134x1104.png 424w, https://substackcdn.com/image/fetch/$s_!G4o3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7b58a8-510f-4719-a755-aaa2c1138723_3134x1104.png 848w, https://substackcdn.com/image/fetch/$s_!G4o3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7b58a8-510f-4719-a755-aaa2c1138723_3134x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!G4o3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7b58a8-510f-4719-a755-aaa2c1138723_3134x1104.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!G4o3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7b58a8-510f-4719-a755-aaa2c1138723_3134x1104.png" width="1456" height="513" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9d7b58a8-510f-4719-a755-aaa2c1138723_3134x1104.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:513,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!G4o3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7b58a8-510f-4719-a755-aaa2c1138723_3134x1104.png 424w, https://substackcdn.com/image/fetch/$s_!G4o3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7b58a8-510f-4719-a755-aaa2c1138723_3134x1104.png 848w, https://substackcdn.com/image/fetch/$s_!G4o3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7b58a8-510f-4719-a755-aaa2c1138723_3134x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!G4o3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7b58a8-510f-4719-a755-aaa2c1138723_3134x1104.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>Can it be solved?</h1><p>If you paid close attention you may have noticed that GPT-4o, while it still stops at the wrong number, it does not believe it completed the countdown, but it understood that the counting was interrupted at some point before the end.</p><p>This is possible because OpenAI&#8217;s Realtime API provides the possibility to tell the model at which point it was interrupted. In the Realtime API documentation you can find this feature implemented with the event <code>conversation.item.truncate</code> (see the <a href="https://platform.openai.com/docs/api-reference/realtime-client-events/conversation/item/truncate">docs</a>):</p><pre><code><code>{
    "event_id": "event_678",
    "type": "conversation.item.truncate",
    "item_id": "msg_002",
    "content_index": 0,
    "audio_end_ms": 1500
}
</code></code></pre><p>In this event, the <code>audio_end_ms</code> is what signals the model that the audio was interrupted at a certain time, before its natural end. This event in turn also trims the transcript to make the LLM know what the user heard and was was never played out. Precision however is not trivial to accomplish, because it&#8217;s very easy for the application to register the interruption later than when it actually occurred and, like in the case of the ChatGPT app, convince the LLM that the interruption happened in the wrong point.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qJcW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec9aaa59-cda1-4658-8992-05c4249f404f_3144x1110.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qJcW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec9aaa59-cda1-4658-8992-05c4249f404f_3144x1110.png 424w, https://substackcdn.com/image/fetch/$s_!qJcW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec9aaa59-cda1-4658-8992-05c4249f404f_3144x1110.png 848w, https://substackcdn.com/image/fetch/$s_!qJcW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec9aaa59-cda1-4658-8992-05c4249f404f_3144x1110.png 1272w, https://substackcdn.com/image/fetch/$s_!qJcW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec9aaa59-cda1-4658-8992-05c4249f404f_3144x1110.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qJcW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec9aaa59-cda1-4658-8992-05c4249f404f_3144x1110.png" width="1456" height="514" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ec9aaa59-cda1-4658-8992-05c4249f404f_3144x1110.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:514,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qJcW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec9aaa59-cda1-4658-8992-05c4249f404f_3144x1110.png 424w, https://substackcdn.com/image/fetch/$s_!qJcW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec9aaa59-cda1-4658-8992-05c4249f404f_3144x1110.png 848w, https://substackcdn.com/image/fetch/$s_!qJcW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec9aaa59-cda1-4658-8992-05c4249f404f_3144x1110.png 1272w, https://substackcdn.com/image/fetch/$s_!qJcW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec9aaa59-cda1-4658-8992-05c4249f404f_3144x1110.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In the case of Gemini, there is a <a href="https://ai.google.dev/gemini-api/docs/live#interruptions">&#8220;Handling Interruptions&#8221;</a> section in its Live API documentation. However the feature seems incomplete, as they state:</p><blockquote><p>Users can interrupt the model&#8217;s output at any time. When Voice activity detection (VAD) detects an interruption, the ongoing generation is canceled and discarded. <strong>Only the information already sent to the client is retained in the session history</strong>.</p></blockquote><p>As we&#8217;ve seen, this is not sufficient to handle interruptions correctly. It&#8217;s likely that this issue is not currently fixable.</p><p>In the case of Claude we don&#8217;t know yet if that&#8217;s an inherent limitation or a bug in the app, because at the time of writing there is no Live/Realtime API available for Claude.</p><h1>Wrapping up</h1><p>Voice Mode for LLMs is a huge step forward for voice AI, but it&#8217;s not a silver bullet. LLMs are first and foremost text prediction algorithms, and even when adapted to work with voice, some of their limitations persists. In order to have complete control, building a <a href="https://www.zansara.dev/posts/2024-09-05-building-voice-agents-with-open-source-tools-part-1/">full pipeline for voice</a> may still be your best bet if you have the infrastructure to achieve a low enough latency; otherwise, always make sure to test the behavior of your LLMs in these corner cases and stick to more well-tested models (in this case, OpenAI&#8217;s) for better handling of time.</p>]]></content:encoded></item><item><title><![CDATA[A simple vibecoding exercise ]]></title><description><![CDATA[Can GenAI help you finish your side-projects?]]></description><link>https://zansara.substack.com/p/a-simple-vibecoding-exercise</link><guid isPermaLink="false">https://zansara.substack.com/p/a-simple-vibecoding-exercise</guid><dc:creator><![CDATA[Sara Z.]]></dc:creator><pubDate>Wed, 21 May 2025 16:35:58 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/485dbd51-9706-42e8-b374-c2ed6349e7b3_1536x668.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Sometimes, after an entire day of coding, the last thing you want to do is to code some more. It would be so great if I could just sit down and enjoy some Youtube videos&#8230;</p><p>Being abroad, most of the videos I watch are in a foreign language, and it helps immensely to have subtitles when I&#8217;m not in the mood for hard focus. However, Youtube subtitles are often terrible or missing entirely.</p><p>Can the magic of modern Generative AI fix this problem?</p><p>We&#8217;ve all heard of <a href="https://x.com/karpathy/status/1886192184808149383">vibecoding</a>: sitting in front of your IDE, telling an AI what you want the code to do and letting it loose to create <em>something</em> that achieves that goal. In this case, the goal is rather simple: given a video file, generate subtitles for it using <a href="https://deepgram.com/">Deepgram</a>&#8217;s SDK (since it has a <a href="https://deepgram.com/pricing">generous free tier</a>). It seems such a simple task that even an LLM should be able to reach it with minimal or no assistance. Right?</p><h1>The first shot: OpenAI</h1><p>For this simple experiment I decided not to use a dedicated IDE or VSCode plugin, but to stick to text based tools. After all, I expected this task to be sorted with a single Python script made by OpenAI&#8217;s famed <code>o4-mini-high</code>, advertized as &#8220;Great at coding and visual reasoning&#8221;.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!POzC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa734f4bc-8a3c-4c9f-bbb1-ba6146365d0b_1643x624.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!POzC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa734f4bc-8a3c-4c9f-bbb1-ba6146365d0b_1643x624.png 424w, https://substackcdn.com/image/fetch/$s_!POzC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa734f4bc-8a3c-4c9f-bbb1-ba6146365d0b_1643x624.png 848w, https://substackcdn.com/image/fetch/$s_!POzC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa734f4bc-8a3c-4c9f-bbb1-ba6146365d0b_1643x624.png 1272w, https://substackcdn.com/image/fetch/$s_!POzC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa734f4bc-8a3c-4c9f-bbb1-ba6146365d0b_1643x624.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!POzC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa734f4bc-8a3c-4c9f-bbb1-ba6146365d0b_1643x624.png" width="1456" height="553" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a734f4bc-8a3c-4c9f-bbb1-ba6146365d0b_1643x624.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:553,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!POzC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa734f4bc-8a3c-4c9f-bbb1-ba6146365d0b_1643x624.png 424w, https://substackcdn.com/image/fetch/$s_!POzC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa734f4bc-8a3c-4c9f-bbb1-ba6146365d0b_1643x624.png 848w, https://substackcdn.com/image/fetch/$s_!POzC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa734f4bc-8a3c-4c9f-bbb1-ba6146365d0b_1643x624.png 1272w, https://substackcdn.com/image/fetch/$s_!POzC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa734f4bc-8a3c-4c9f-bbb1-ba6146365d0b_1643x624.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The prompt was very simple:</p><blockquote><p>Write me a Python script that, given a video file, returns me an <a href="https://en.wikipedia.org/wiki/SubRip">.srt</a> subtitle file using Deepgram&#8217;s API.</p></blockquote><p>As expected, the model thought about it, did some web searches, and then cooked up a script that used <code>deepgram-sdk</code> and <code>deepgram-captions</code>. Looked good, but as soon as I tried to run it, issues came up. Deepgram&#8217;s SDK complained about wrong formats, wrong SDK version, HTTP errors&#8230; Copy-pasting the errors back to <code>o4-mini-high</code> was vain: the model seems to understand that the Deepgram API had a major upgrade since the model was trained, but fails to use the new version. After four or five attempts (including one full restart of the chat), I realized this was going nowhere and I looked for another option.</p><h1>The backup option: Claude Code</h1><p>I&#8217;ve heard many times that the best LLMs for vibecoding belong to the <a href="https://www.anthropic.com/claude">Claude</a> family. On top of that, there&#8217;s a cool TUI utility called <a href="https://www.anthropic.com/claude-code">Claude Code</a> that allows you to vibecode from the terminal, no IDE required. It uses <a href="https://www.anthropic.com/claude/sonnet">Claude 3.7 Sonnet</a> under the hood, so the expectations are high.</p><p>Time to give it a try.</p><p>Installing the utility is matter of a single command (<code>npm install -g @anthropic-ai/claude-code</code>) and a few emails to authenticate the utility into my Anthropic account. Once done we&#8217;re ready to go.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OWDk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd556e9dc-97bb-4c02-8068-401ff20fcb8f_808x479.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OWDk!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd556e9dc-97bb-4c02-8068-401ff20fcb8f_808x479.gif 424w, https://substackcdn.com/image/fetch/$s_!OWDk!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd556e9dc-97bb-4c02-8068-401ff20fcb8f_808x479.gif 848w, https://substackcdn.com/image/fetch/$s_!OWDk!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd556e9dc-97bb-4c02-8068-401ff20fcb8f_808x479.gif 1272w, https://substackcdn.com/image/fetch/$s_!OWDk!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd556e9dc-97bb-4c02-8068-401ff20fcb8f_808x479.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OWDk!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd556e9dc-97bb-4c02-8068-401ff20fcb8f_808x479.gif" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d556e9dc-97bb-4c02-8068-401ff20fcb8f_808x479.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!OWDk!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd556e9dc-97bb-4c02-8068-401ff20fcb8f_808x479.gif 424w, https://substackcdn.com/image/fetch/$s_!OWDk!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd556e9dc-97bb-4c02-8068-401ff20fcb8f_808x479.gif 848w, https://substackcdn.com/image/fetch/$s_!OWDk!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd556e9dc-97bb-4c02-8068-401ff20fcb8f_808x479.gif 1272w, https://substackcdn.com/image/fetch/$s_!OWDk!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd556e9dc-97bb-4c02-8068-401ff20fcb8f_808x479.gif 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The prompt is the same:</p><blockquote><p>Write me a Python script that, given a video file, returns me an .srt subtitle file using Deepgram&#8217;s API.</p></blockquote><p>Sure enough, Claude&#8217;s first attempt also fails for the same reason as o4 did: their knowledge is outdated, and they both use the Deepgram&#8217;s API in a way that&#8217;s not compabible with its new v3 API. However, after a few attempts, Claude actually produces a script that <em>mostly</em> works.</p><h1>Results</h1><p><a href="https://gist.github.com/ZanSara/4bab5db89376d595128e0688804d694c">Here</a> is the output (I pasted the <code>README</code> and the <code>requirements.txt</code> at the top of the file for simplicity). I only needed to replace <code>nova-2</code> with <code>nova-3</code> to get the best possible transcription for Portuguese (other languages may get better transcription with <code>nova-2</code>).</p><p>To summarize:</p><ul><li><p><strong>Is it perfect? No.</strong> I can easily spot a lot of improvements to the code just by looking at it. It&#8217;s quite verbose, for example.</p></li><li><p><strong>Was it cheap? No.</strong> This script costed me a few dollars worth of tokens and about half a hour of trial and errors, about the hourly rate of a US software engineer.</p></li><li><p><strong>Is it enough for my purposes? Absolutely.</strong> Now I am finally able to enjoy my videos with good quality subtitles without too much hassle.</p></li><li><p><strong>Could somebody that can&#8217;t program do this?</strong> I&#8217;m not so sure. Given how simple this task was, I was a bit disappointed by how long it took and I am rather skeptical about the ability of today&#8217;s LLMs to handle more complex requests without oversight - at least with the tools I used.</p></li></ul><p>However, looking at the big picture, the trend is clear. Three years ago, LLMs could just about write coherent sentences. Today, they can write decent helper scripts. Soon the may be able to implement your side projects from start to finish.</p><p>Will it feel like a blessing or a curse? We&#8217;ll soon find out.</p>]]></content:encoded></item><item><title><![CDATA[Using Llama Models in the EU]]></title><description><![CDATA[The ban on multimodal models is surprisingly not well known among users of these popular "open-source" LLMs.]]></description><link>https://zansara.substack.com/p/using-llama-models-in-the-eu</link><guid isPermaLink="false">https://zansara.substack.com/p/using-llama-models-in-the-eu</guid><dc:creator><![CDATA[Sara Z.]]></dc:creator><pubDate>Fri, 16 May 2025 15:13:29 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/760c7918-58d7-4825-9578-039912bb2849_1536x868.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://ai.meta.com/blog/llama-4-multimodal-intelligence/">The Llama 4 family</a> has been released over a month ago and I finally found some time to explore it. Or so I wished to do, until I realized one crucial issue with these models:</p><p><strong>They are banned in the EU.</strong></p><p>Apparently Meta can&#8217;t be bothered to comply with EU regulations on AI, and therefore opted for a wide ban that should prevent such laws to apply to them. Of course, while this limitation is technically valid for each and every person and company domiciled in the EU, the problem arises primarily for companies that want to use Llama 4 to offer services and for researchers planning to work with these models, be it for evaluation, fine-tuning, distillation or other derivative work. Always keep in mind that I&#8217;m not a lawyer, so nothing of what I&#8217;m writing here constitutes as legal advice.</p><h2>The terms</h2><p>The interesting part of this ban can be found by reading the <a href="https://github.com/meta-llama/llama-models/blob/main/models/llama4/USE_POLICY.md">terms</a> of the Acceptable Usage Policy (AUP):</p><p><em>With respect to any <strong>multimodal models</strong> included in Llama 4, the rights granted under Section 1(a) of the Llama 4 Community License Agreement are not being granted to you if you are an individual domiciled in, or a company with a principal place of business in, the European Union.</em></p><p>As you can see, the restriction applies strictly to multimodal LLMs. Llama4 models are all multimodal, and that&#8217;s why the entire family of models is not accessible from the EU. However, if anyone releases a derivative model that is <em>not</em> multimodal, in theory the ban would not apply. I&#8217;m yet to see any such derivative model: if you know of any, let me know!</p><p>Interestingly, the terms also state that:</p><p><em>This restriction <strong>does not apply to end users</strong> of a product or service that incorporates any such multimodal models.</em></p><p>So if you&#8217;re a company outside of the EU and provide services based on Llama4 to EU customers, you&#8217;re probably off the hook. Such interpretation seems to be confirmed by <a href="https://www.llama.com/faq/">Meta&#8217;s FAQ</a> about Llama models, which state:</p><p><em><strong>Can a non-EU based company develop a product or service using the Llama multimodal models and distribute such product or service within the EU?</strong></em></p><p><em>Yes, if you are a company with a principal place of business outside of the EU, you may distribute products or services that contain the Llama multimodal models in accordance with your standard global distribution business practices [...]</em></p><p><a href="https://www.llama.com/faq/">Meta&#8217;s FAQ</a> are actually quite throughout, so if you have any doubt about your specific case you should head there and read more.</p><h2>What about other Llamas?</h2><p>This wide EU ban is not new: it was introduced with <a href="https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/">Llama 3.2 Vision</a>, the first multimodal model release by Meta. The clause does not exist for any model older than Llama 3.2.</p><p>To summarize, here is a list of which models can be used in the EU:</p><ul><li><p>Llama 4: all banned because they&#8217;re all multimodal (<a href="https://github.com/meta-llama/llama-models/blob/main/models/llama4/USE_POLICY.md">terms</a>)</p></li><li><p>Llama 3.3: allowed because it&#8217;s not multimodal (<a href="https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/USE_POLICY.md">terms</a>)</p></li><li><p>Llama 3.2: text only models are allowed, vision models are not allowed (<a href="https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/USE_POLICY.md">terms</a>)</p></li><li><p>Llama 3.1 and earlier: allowed because there's no such clause (<a href="https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/USE_POLICY.md">terms</a>)</p></li></ul><p>So for now this is the state of Llama licenses. My take is that with the implementation and rollout of the EU AI Act in 2025 and 2026, Meta will eventually adapt to make sure that the models are compliant with the way the Act is enforced in practice and relax, if not lift, the ban on newer models.</p><p>Also, Llama4 has not been shining in the <a href="https://lmarena.ai/?leaderboard">benchmarks</a> (scroll down, I promise you it&#8217;s there)... we Europeans may not be missing much.</p>]]></content:encoded></item><item><title><![CDATA[Beyond the hype of reasoning models: debunking three common misunderstandings ]]></title><description><![CDATA[This is a teaser for my upcoming talk at ODSC East 2025, "LLMs that Think: Demystifying Reasoning Models". If you want to learn more, join the webinar!]]></description><link>https://zansara.substack.com/p/beyond-the-hype-of-reasoning-models</link><guid isPermaLink="false">https://zansara.substack.com/p/beyond-the-hype-of-reasoning-models</guid><dc:creator><![CDATA[Sara Z.]]></dc:creator><pubDate>Mon, 12 May 2025 13:18:29 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/80f3f215-4f69-4694-a269-5cdea71be75d_1536x644.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>With the release of OpenAI&#8217;s o1 and similar models such as DeepSeek R1, Gemini 2.0 Flash Thinking, Phi 4 Reasoning and more, a new type of LLMs entered the scene: the so-called reasoning models. With their unbelievable scores in the toughest benchmarks for machine intelligence, reasoning models immediately got the attention of most AI enthusiasts, sparking speculations about their capabilities and what those could mean for the industry.</p><p>However, as often in the field of Generative AI, the hype makes it very difficult to understand at a glance what these models can really do. But before we jump into the details let&#8217;s clarify what we&#8217;re talking about.</p><h1>What is a reasoning model?</h1><p>Reasoning models are LLMs that are able to &#8220;think&#8221;. Instead of generating a reply immediately after the user&#8217;s prompt, like every other LLM, they first generate a series of &#8220;reasoning tokens&#8221;, which is nothing more than the model thinking out loud, breaking down a complex problem into smaller steps, checking all its assumptions, asking itself whether it made any mistakes, double-checking its results, and so on. Once the model is satisfied by its conclusions, it starts generating actual response tokens that summarize the conclusions reached during the reasoning phase and presents those tokens to the user.</p><p>In the case of some models such as OpenAI&#8217;s, the reasoning tokens are hidden from the user. In otner models, such as most open source ones, the reasoning output can be returned to the user as well. However, this trace is not optimized to be read by people, so it often looks odd and contrived even when it reaches the correct conclusions.</p><p>Now that we understand better what a reasoning model is, let&#8217;s discuss a few common misunderstandings related to them.</p><h1>Are reasoning models AGI?</h1><p>AGI stands for Artificial General Intelligence, and it&#8217;s one of the most ill-defined terms in Generative AI. Several people have tried to offer a more precise definition of this term, out of which my favourite is the following:</p><blockquote><p>AGI is an AI that is better than any human at any economically valuable task.</p></blockquote><p>Under this light it&#8217;s clear that no current LLM, not even the most advanced reasoning model, it yet at the level where it could replace any human at any task. They can surely offer very valuable help with their vast knowledge and their growing ability to reason, but they&#8217;re not yet at the point where they can take onto any job without further specialization and complex tooling around them.</p><h1>Are reasoning models AI agents?</h1><p>An AI agent is usually defined as any application that can use tools to achieve complex goals. Considering that reasoning models are usually able to use tools, it&#8217;s natural to think that they themselves may be considered AI agents.</p><p>In practice, however, reasoning models on their own hardly qualify as agents. Many powerful agents systems do have an LLM at their core: they use it to understand the user&#8217;s request and plan the actions to take to achieve the goal they&#8217;re set to. Reasoning models are a perfect fit as the minds of agents like that, due to their advanced capabilities to break down problems into smaller, manageable parts and self-correct their strategy on the fly if something goes wrong. Taken in isolation, though, reasoning models can&#8217;t be called AI agents.</p><h1>Are reasoning models glorified CoT prompts?</h1><p>If you have worked with AI agents and other LLM systems designed to solve problems, you&#8217;ve surely come across Chain of Thought prompting. In short, this technique involves adding in the system prompt of your LLM instructions to &#8220;think step by step&#8221; before replying. This makes the LLM think out loud before reaching a conclusion and, even in regular non-reasoning LLMs, improves significantly their problem solving skills.</p><p>At a first glance, the output of a reasoning model may look precisely like the output of a CoT prompt, so some experts may think that their reasoning capabilities are the same. This is a mistake. Reasoning models are much more powerful than regular LLMs, even when these are equipped with a CoT prompt: this is because reasoning models pass through one additional step during their training where they learn to refine their &#8220;thinking step by step&#8221; skills through supervised learning on prompts with verifiable output, such as mathematical problems. Reasoning models are not zero-shot resoners like regular LLMs: they&#8217;re fine-tuned for it.</p><h1>Wrapping up</h1><p>Reasoning models may not be the super-human intelligence some of us are waiting for, but they surely are a significant step forward toward LLMs with very strong reasoning abilities.</p><p>If you want to learn more about what reasoning models can do, how they reason, when to use them and more, make sure to attend my talk <a href="https://odsc.com/speakers/llms-that-think-demystifying-reasoning-models/">LLMs that Think: Demystifying Reasoning Models</a> at this year&#8217;s virtual edition of <a href="https://odsc.com/boston/">ODSC East</a>. See you there!</p>]]></content:encoded></item><item><title><![CDATA[The Agent Compass (Part 2)]]></title><description><![CDATA[Agent means everything and nothing in today's GenAI landscape. Let's shed some light on this topic.]]></description><link>https://zansara.substack.com/p/the-agent-compass-part-2</link><guid isPermaLink="false">https://zansara.substack.com/p/the-agent-compass-part-2</guid><dc:creator><![CDATA[Sara Z.]]></dc:creator><pubDate>Mon, 10 Jun 2024 18:07:53 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/d623e911-28c2-4c84-9910-45e6f818a790_2577x1310.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em><a href="https://open.substack.com/pub/zansara/p/the-agent-compass-part-1?r=3wc89f&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Read Part 1 here</a>.</em></p><h3>Self correcting RAG</h3><p>Self-correcting RAG is a technique that improves on simple RAG by making the LLM double-check its replies before returning them to the user. It comes from an LLM evaluation technique called &#8220;LLM-as-a-judge&#8221;, because an LLM is used to judge the output of a different LLM or RAG pipeline.</p><p>Self-correcting RAG starts as simple RAG: when the user asks a question, the retriever is called and the results are sent to the LLM to extract an answer from. However, before returning the answer to the user, another LLM is asked to judge whether in their opinion, the answer is correct. If the second LLM agrees, the answer is sent to the user. If not, the second LLM generates a new question for the retriever and runs it again, or in other cases, it simply integrates its opinion in the prompt and runs the first LLM again.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HI5f!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65a29df4-1952-420c-a98c-f9abc44b5db8_3220x1372.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HI5f!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65a29df4-1952-420c-a98c-f9abc44b5db8_3220x1372.png 424w, https://substackcdn.com/image/fetch/$s_!HI5f!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65a29df4-1952-420c-a98c-f9abc44b5db8_3220x1372.png 848w, https://substackcdn.com/image/fetch/$s_!HI5f!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65a29df4-1952-420c-a98c-f9abc44b5db8_3220x1372.png 1272w, https://substackcdn.com/image/fetch/$s_!HI5f!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65a29df4-1952-420c-a98c-f9abc44b5db8_3220x1372.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HI5f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65a29df4-1952-420c-a98c-f9abc44b5db8_3220x1372.png" width="1456" height="620" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/65a29df4-1952-420c-a98c-f9abc44b5db8_3220x1372.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:620,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:391316,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!HI5f!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65a29df4-1952-420c-a98c-f9abc44b5db8_3220x1372.png 424w, https://substackcdn.com/image/fetch/$s_!HI5f!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65a29df4-1952-420c-a98c-f9abc44b5db8_3220x1372.png 848w, https://substackcdn.com/image/fetch/$s_!HI5f!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65a29df4-1952-420c-a98c-f9abc44b5db8_3220x1372.png 1272w, https://substackcdn.com/image/fetch/$s_!HI5f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65a29df4-1952-420c-a98c-f9abc44b5db8_3220x1372.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Self-correcting RAG can be seen as <strong>one more step towards agentic behavior</strong> because it unlocks a new possibility for the application: <strong>the ability to try again</strong>. A self-correcting RAG app has a chance to detect its own mistakes and has the agency to decide that it&#8217;s better to try again, maybe with a slightly reworded question or different retrieval parameters, before answering the user. Given that this process is entirely autonomous, we&#8217;ll place this technique quite towards the Autonomous end of the scale.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TSP9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48899f8e-a0fa-498e-a155-381885215bb4_3066x1690.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TSP9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48899f8e-a0fa-498e-a155-381885215bb4_3066x1690.png 424w, https://substackcdn.com/image/fetch/$s_!TSP9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48899f8e-a0fa-498e-a155-381885215bb4_3066x1690.png 848w, https://substackcdn.com/image/fetch/$s_!TSP9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48899f8e-a0fa-498e-a155-381885215bb4_3066x1690.png 1272w, https://substackcdn.com/image/fetch/$s_!TSP9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48899f8e-a0fa-498e-a155-381885215bb4_3066x1690.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TSP9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48899f8e-a0fa-498e-a155-381885215bb4_3066x1690.png" width="1456" height="803" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/48899f8e-a0fa-498e-a155-381885215bb4_3066x1690.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:803,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:282876,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!TSP9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48899f8e-a0fa-498e-a155-381885215bb4_3066x1690.png 424w, https://substackcdn.com/image/fetch/$s_!TSP9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48899f8e-a0fa-498e-a155-381885215bb4_3066x1690.png 848w, https://substackcdn.com/image/fetch/$s_!TSP9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48899f8e-a0fa-498e-a155-381885215bb4_3066x1690.png 1272w, https://substackcdn.com/image/fetch/$s_!TSP9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48899f8e-a0fa-498e-a155-381885215bb4_3066x1690.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Chain-of-thought</h3><p><a href="https://arxiv.org/abs/2201.11903">Chain-of-thought</a> is a family of prompting techniques that makes the LLM &#8220;reason out loud&#8221;. It&#8217;s very useful when the model needs to process a very complicated question, such as a mathematical problem or a layered question like &#8220;When was the eldest sistem of the current King of Sweden born?&#8221; Assuming that the LLM knows these facts, in order to not hallucinate it&#8217;s best to ask the model to proceed &#8220;step-by-step&#8221; and find out, in order:</p><ol><li><p>Who the current King of Sweden is,</p></li><li><p>Whether he has an elder sister,</p></li><li><p>If yes, who she is,</p></li><li><p>The age of the person identified above.</p></li></ol><p>The LLM might know the final fact in any case, but the probability of it giving the right answer increases noticeably if the LLM is prompted this way.</p><p>Chain-of-thought prompts can also be seen as the LLM accomplishing the task of finding the correct answer in steps, which implies that there are two lines of thinking going on: on one side the LLM is answering the questions it&#8217;s posing to itself, while on the other it&#8217;s constantly re-assessing whether it has a final answer for the user.</p><p>In the example above, the chain of thought might end at step 2 if the LLM realizes that the current King of Sweden has no elder sisters (he <a href="https://en.wikipedia.org/wiki/Carl_XVI_Gustaf#Early_life">doesn&#8217;t</a>): the LLM needs to keep an eye of its own thought process and decide whether it needs to continue or not.</p><p>We can summarize an app using chain-of-thought prompting like this: when a user asks a question, first of all the LLM reacts to the chain-of-thought prompt to lay out the sub-questions it needs to answer. Then it answers its own questions one by one, asking itself each time whether the final answer has already been found. When the LLM believes it has the final answer, it rewrites it for the user and returns it.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!r14Z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68c95e15-40b0-4ee9-9066-0bc8295e773e_2914x1506.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!r14Z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68c95e15-40b0-4ee9-9066-0bc8295e773e_2914x1506.png 424w, https://substackcdn.com/image/fetch/$s_!r14Z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68c95e15-40b0-4ee9-9066-0bc8295e773e_2914x1506.png 848w, https://substackcdn.com/image/fetch/$s_!r14Z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68c95e15-40b0-4ee9-9066-0bc8295e773e_2914x1506.png 1272w, https://substackcdn.com/image/fetch/$s_!r14Z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68c95e15-40b0-4ee9-9066-0bc8295e773e_2914x1506.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!r14Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68c95e15-40b0-4ee9-9066-0bc8295e773e_2914x1506.png" width="1456" height="752" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/68c95e15-40b0-4ee9-9066-0bc8295e773e_2914x1506.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:752,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:354688,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!r14Z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68c95e15-40b0-4ee9-9066-0bc8295e773e_2914x1506.png 424w, https://substackcdn.com/image/fetch/$s_!r14Z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68c95e15-40b0-4ee9-9066-0bc8295e773e_2914x1506.png 848w, https://substackcdn.com/image/fetch/$s_!r14Z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68c95e15-40b0-4ee9-9066-0bc8295e773e_2914x1506.png 1272w, https://substackcdn.com/image/fetch/$s_!r14Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68c95e15-40b0-4ee9-9066-0bc8295e773e_2914x1506.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This new prompting technique makes a big step towards full agency: the ability for the LLM to <strong>assess whether the goal has been achieved</strong> before returning any answer to the user. While apps like Bing Chat iterate with the user and need their feedback to reach high-level goals, chain-of-thought gives the LLM the freedom to check its own answers before having the user judge them, which makes the loop much faster and can increase the output quality dramatically.</p><p>This process is similar to what self-correcting RAG does, but has a wider scope, because the LLM does not only need to decide whether an answer is correct, it can also decide to continue reasoning in order to make it more complete, more detailed, to phrase it better, and so on.</p><p>Another interesting trait of chain-of-thought apps is that they introduce the concept of <strong>inner monologue</strong>. The inner monologue is a conversation that the LLM has with itself, a conversation buffer where it keeps adding messages as the reasoning develops. This monologue is not visible to the user, but helps the LLM deconstruct a complex reasoning line into a more manageable format, like a researcher that takes notes instead of keeping all their earlier reasoning inside their head all the times.</p><p>Due to the wider scope of the decision-making that chain-of-thought apps are able to do, they also place in the middle of our compass They can be seen as slightly more autonomous than conversational due to the fact that they hide their internal monologue to the user.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Mc5B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd5705dc-50b4-4ada-a2be-6e2df480e04c_3066x1676.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Mc5B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd5705dc-50b4-4ada-a2be-6e2df480e04c_3066x1676.png 424w, https://substackcdn.com/image/fetch/$s_!Mc5B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd5705dc-50b4-4ada-a2be-6e2df480e04c_3066x1676.png 848w, https://substackcdn.com/image/fetch/$s_!Mc5B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd5705dc-50b4-4ada-a2be-6e2df480e04c_3066x1676.png 1272w, https://substackcdn.com/image/fetch/$s_!Mc5B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd5705dc-50b4-4ada-a2be-6e2df480e04c_3066x1676.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Mc5B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd5705dc-50b4-4ada-a2be-6e2df480e04c_3066x1676.png" width="1456" height="796" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dd5705dc-50b4-4ada-a2be-6e2df480e04c_3066x1676.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:796,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:294075,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Mc5B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd5705dc-50b4-4ada-a2be-6e2df480e04c_3066x1676.png 424w, https://substackcdn.com/image/fetch/$s_!Mc5B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd5705dc-50b4-4ada-a2be-6e2df480e04c_3066x1676.png 848w, https://substackcdn.com/image/fetch/$s_!Mc5B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd5705dc-50b4-4ada-a2be-6e2df480e04c_3066x1676.png 1272w, https://substackcdn.com/image/fetch/$s_!Mc5B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd5705dc-50b4-4ada-a2be-6e2df480e04c_3066x1676.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>From here, the next step is straightforward: using tools.</p><h3>Multi-hop RAG</h3><p>Multi-hop RAG applications are nothing else than simple RAG apps that use chain-of-thought prompting and are free to invoke the retriever as many times as needed and only when needed.</p><p>This is how it works. When the user makes a question, a chain of thought prompt is generated and sent to the LLM. The LLM assesses whether it knows the answer to the question and if not, asks itself whether a retrieval is necessary. If it decides that retrieval is necessary it calls it, otherwise it skips it and generates an answer directly. It then checks again whether the question is answered. Exiting the loop, the LLM produces a complete answer by re-reading its own inner monologue and returns this reply to the user.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5jC1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feda61034-683c-4261-9e51-6c1134d2b4f0_3350x1692.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5jC1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feda61034-683c-4261-9e51-6c1134d2b4f0_3350x1692.png 424w, https://substackcdn.com/image/fetch/$s_!5jC1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feda61034-683c-4261-9e51-6c1134d2b4f0_3350x1692.png 848w, https://substackcdn.com/image/fetch/$s_!5jC1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feda61034-683c-4261-9e51-6c1134d2b4f0_3350x1692.png 1272w, https://substackcdn.com/image/fetch/$s_!5jC1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feda61034-683c-4261-9e51-6c1134d2b4f0_3350x1692.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5jC1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feda61034-683c-4261-9e51-6c1134d2b4f0_3350x1692.png" width="1456" height="735" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eda61034-683c-4261-9e51-6c1134d2b4f0_3350x1692.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:735,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:604387,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!5jC1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feda61034-683c-4261-9e51-6c1134d2b4f0_3350x1692.png 424w, https://substackcdn.com/image/fetch/$s_!5jC1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feda61034-683c-4261-9e51-6c1134d2b4f0_3350x1692.png 848w, https://substackcdn.com/image/fetch/$s_!5jC1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feda61034-683c-4261-9e51-6c1134d2b4f0_3350x1692.png 1272w, https://substackcdn.com/image/fetch/$s_!5jC1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feda61034-683c-4261-9e51-6c1134d2b4f0_3350x1692.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>An app like this is getting quite close to a proper autonomous agent, because it can <strong>perform its own research autonomously</strong>. The LLM calls are made in such a way that the system is able to assess whether it knows enough to answer or whether it should do more research by formulating more questions for the retriever and then reasoning over the new collected data.</p><p>Multi-hop RAG is a very powerful technique that shows a lot of agency and autonomy, and therefore can be placed in the lower-right quadrant of out compass. However, it is still limited with respect to a &#8220;true&#8221; autonomous agent, because the only action it can take is to invoke the retriever.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NRgT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feccbad03-d123-4bac-b507-25f8dcacc468_3058x1718.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NRgT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feccbad03-d123-4bac-b507-25f8dcacc468_3058x1718.png 424w, https://substackcdn.com/image/fetch/$s_!NRgT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feccbad03-d123-4bac-b507-25f8dcacc468_3058x1718.png 848w, https://substackcdn.com/image/fetch/$s_!NRgT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feccbad03-d123-4bac-b507-25f8dcacc468_3058x1718.png 1272w, https://substackcdn.com/image/fetch/$s_!NRgT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feccbad03-d123-4bac-b507-25f8dcacc468_3058x1718.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NRgT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feccbad03-d123-4bac-b507-25f8dcacc468_3058x1718.png" width="1456" height="818" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eccbad03-d123-4bac-b507-25f8dcacc468_3058x1718.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:818,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:310759,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!NRgT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feccbad03-d123-4bac-b507-25f8dcacc468_3058x1718.png 424w, https://substackcdn.com/image/fetch/$s_!NRgT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feccbad03-d123-4bac-b507-25f8dcacc468_3058x1718.png 848w, https://substackcdn.com/image/fetch/$s_!NRgT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feccbad03-d123-4bac-b507-25f8dcacc468_3058x1718.png 1272w, https://substackcdn.com/image/fetch/$s_!NRgT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feccbad03-d123-4bac-b507-25f8dcacc468_3058x1718.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>ReAct Agents</h3><p>Let&#8217;s now move onto apps that can be defined proper &#8220;agents&#8221;. One of the first flavor of agentic LLM apps, and still the most popular nowadays, is called &#8220;<a href="https://arxiv.org/abs/2210.03629">ReAct</a>&#8221; Agents, which stands for &#8220;Reason + Act&#8221;. ReAct is a prompting technique that belongs to the chain-of-thought extended family: it makes the LLM reason step by step, decide whether to perform any action, and then observe the result of the actions it took before moving further.</p><p>A ReAct agent works more or less like this: when user sets a goal, the app builds a ReAct prompt, which first of all asks the LLM whether the answer is already known. If the LLM says no, the prompt makes it select a tool. The tool returns some values which are added to the inner monologue of the application toghether with the invitation to re-assess whether the goal has been accomplished. The app loops over until the answer is found, and then the answer is returned to the user.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fLDB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6809844b-bedd-4c4c-a834-e0d4bcac1565_3354x1698.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fLDB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6809844b-bedd-4c4c-a834-e0d4bcac1565_3354x1698.png 424w, https://substackcdn.com/image/fetch/$s_!fLDB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6809844b-bedd-4c4c-a834-e0d4bcac1565_3354x1698.png 848w, https://substackcdn.com/image/fetch/$s_!fLDB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6809844b-bedd-4c4c-a834-e0d4bcac1565_3354x1698.png 1272w, https://substackcdn.com/image/fetch/$s_!fLDB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6809844b-bedd-4c4c-a834-e0d4bcac1565_3354x1698.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fLDB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6809844b-bedd-4c4c-a834-e0d4bcac1565_3354x1698.png" width="1456" height="737" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6809844b-bedd-4c4c-a834-e0d4bcac1565_3354x1698.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:737,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:586024,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!fLDB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6809844b-bedd-4c4c-a834-e0d4bcac1565_3354x1698.png 424w, https://substackcdn.com/image/fetch/$s_!fLDB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6809844b-bedd-4c4c-a834-e0d4bcac1565_3354x1698.png 848w, https://substackcdn.com/image/fetch/$s_!fLDB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6809844b-bedd-4c4c-a834-e0d4bcac1565_3354x1698.png 1272w, https://substackcdn.com/image/fetch/$s_!fLDB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6809844b-bedd-4c4c-a834-e0d4bcac1565_3354x1698.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As you can see, the structure is very similar to a multi-hop RAG, with an important difference: ReAct Agents normally have <strong>many tools to choose from</strong> rather than a single retriever. This gives them the agency to take much more complex decisions and can be finally called &#8220;agents&#8221;.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6127!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b85d6f5-d048-40c6-a582-0aae1557b7a5_3044x1682.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6127!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b85d6f5-d048-40c6-a582-0aae1557b7a5_3044x1682.png 424w, https://substackcdn.com/image/fetch/$s_!6127!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b85d6f5-d048-40c6-a582-0aae1557b7a5_3044x1682.png 848w, https://substackcdn.com/image/fetch/$s_!6127!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b85d6f5-d048-40c6-a582-0aae1557b7a5_3044x1682.png 1272w, https://substackcdn.com/image/fetch/$s_!6127!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b85d6f5-d048-40c6-a582-0aae1557b7a5_3044x1682.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6127!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b85d6f5-d048-40c6-a582-0aae1557b7a5_3044x1682.png" width="1456" height="805" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3b85d6f5-d048-40c6-a582-0aae1557b7a5_3044x1682.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:805,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:323146,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!6127!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b85d6f5-d048-40c6-a582-0aae1557b7a5_3044x1682.png 424w, https://substackcdn.com/image/fetch/$s_!6127!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b85d6f5-d048-40c6-a582-0aae1557b7a5_3044x1682.png 848w, https://substackcdn.com/image/fetch/$s_!6127!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b85d6f5-d048-40c6-a582-0aae1557b7a5_3044x1682.png 1272w, https://substackcdn.com/image/fetch/$s_!6127!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b85d6f5-d048-40c6-a582-0aae1557b7a5_3044x1682.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>ReAct Agents are very autonomous in their tasks and rely on an inner monologue rather than a conversation with a user to achieve their goals. Therefore we place them very much on the Autonomous end of the spectrum.</p><h3>Conversational Agents</h3><p>Conversational Agents are a category of apps that can vary widely. As stated earlier, conversational agents focus on using the conversation itself as a tool to accomplish goals, so in order to understand them, one has to distinguish between the people that set the goal (let&#8217;s call them <em>owners</em>) and those who talk with the bot (the <em>users</em>).</p><p>Once this distinction is made, this is how the most basic conversational agents normally work. First, the owner sets a goal. The application then starts a conversation with a user and, right after the first message, starts asking itself if the given goal was accomplished. It then keeps talking to the target user until it believes the goal was attained and, once done, it returns back to its owner to report the outcome.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hdNk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F277cccea-7e29-4843-b784-1d99751c94b9_3372x1012.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hdNk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F277cccea-7e29-4843-b784-1d99751c94b9_3372x1012.png 424w, https://substackcdn.com/image/fetch/$s_!hdNk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F277cccea-7e29-4843-b784-1d99751c94b9_3372x1012.png 848w, https://substackcdn.com/image/fetch/$s_!hdNk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F277cccea-7e29-4843-b784-1d99751c94b9_3372x1012.png 1272w, https://substackcdn.com/image/fetch/$s_!hdNk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F277cccea-7e29-4843-b784-1d99751c94b9_3372x1012.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hdNk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F277cccea-7e29-4843-b784-1d99751c94b9_3372x1012.png" width="1456" height="437" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/277cccea-7e29-4843-b784-1d99751c94b9_3372x1012.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:437,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:253672,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!hdNk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F277cccea-7e29-4843-b784-1d99751c94b9_3372x1012.png 424w, https://substackcdn.com/image/fetch/$s_!hdNk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F277cccea-7e29-4843-b784-1d99751c94b9_3372x1012.png 848w, https://substackcdn.com/image/fetch/$s_!hdNk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F277cccea-7e29-4843-b784-1d99751c94b9_3372x1012.png 1272w, https://substackcdn.com/image/fetch/$s_!hdNk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F277cccea-7e29-4843-b784-1d99751c94b9_3372x1012.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Basic conversational agents are very agentic in the sense that they can take a task off the hands of their owners and keep working on them until the goal is achieved. However, <strong>they have varying degrees of agency</strong> depending on how many tools they can use and how sophisticated is their ability to talk to their target users.</p><p>For example, can the communication occurr over one single channel, be it email, chat, voice, or something else? Can the agent choose among different channels to reach the user? Can it perform side tasks to behalf of either party to work towards its task? There is a large variety of these agents available and no clear naming distinction between them, so depending on their abilities, their position on our compass might be very different. This is why we place them in the top center, spreading far out in both directions.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-rUe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7ef5f45-0cbb-4c86-9af1-ca4f7e3a2b5c_3070x1722.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-rUe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7ef5f45-0cbb-4c86-9af1-ca4f7e3a2b5c_3070x1722.png 424w, https://substackcdn.com/image/fetch/$s_!-rUe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7ef5f45-0cbb-4c86-9af1-ca4f7e3a2b5c_3070x1722.png 848w, https://substackcdn.com/image/fetch/$s_!-rUe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7ef5f45-0cbb-4c86-9af1-ca4f7e3a2b5c_3070x1722.png 1272w, https://substackcdn.com/image/fetch/$s_!-rUe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7ef5f45-0cbb-4c86-9af1-ca4f7e3a2b5c_3070x1722.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-rUe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7ef5f45-0cbb-4c86-9af1-ca4f7e3a2b5c_3070x1722.png" width="1456" height="817" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b7ef5f45-0cbb-4c86-9af1-ca4f7e3a2b5c_3070x1722.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:817,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:344242,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!-rUe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7ef5f45-0cbb-4c86-9af1-ca4f7e3a2b5c_3070x1722.png 424w, https://substackcdn.com/image/fetch/$s_!-rUe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7ef5f45-0cbb-4c86-9af1-ca4f7e3a2b5c_3070x1722.png 848w, https://substackcdn.com/image/fetch/$s_!-rUe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7ef5f45-0cbb-4c86-9af1-ca4f7e3a2b5c_3070x1722.png 1272w, https://substackcdn.com/image/fetch/$s_!-rUe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7ef5f45-0cbb-4c86-9af1-ca4f7e3a2b5c_3070x1722.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>AI Crews</h3><p>By far the most advanced agent implementation available right now is called AI Crew, such as the ones provided by <a href="https://www.crewai.com/">CrewAI</a>. These apps take the concept of autonomous agent to the next level by making several different agents work together.</p><p>The way these apps works is very flexible. For example, let&#8217;s imagine we are making an AI application that can build a fully working mobile game from a simple description. This is an extremely complex task that, in real life, requires several developers. To achieve the same with an AI Crew, the crew needs to contain several agents, each one with their own special skills, tools, and background knowledge. There could be:</p><ul><li><p>a Designer Agent, that has all the tools to generate artwork and assets;</p></li><li><p>a Writer Agent that writes the story, the copy, the dialogues, and most of the text;</p></li><li><p>a Frontend Developer Agent that designs and implements the user interface;</p></li><li><p>a Game Developer Agent that writes the code for the game itself;</p></li><li><p>a Manager Agent, that coordinates the work of all the other agents, keeps them on track and eventually reports the results of their work to the user.</p></li></ul><p>These agents interact with each other just like a team of humans would: by exchanging messages in a chat format, asking each other to perform actions for them, until their manager decides that the overall goal they were set to has been accomplished, and reports to the user.</p><p>AI Crews are very advanced and dynamic systems that are still actively researched and explored. One thing that&#8217;s clear though is that they show the highest level of agency of any other LLM-based app, so we can place them right at the very bottom-right end of the scale.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Zh5_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1e7c85-8d37-4feb-b712-51e398288954_3050x1710.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Zh5_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1e7c85-8d37-4feb-b712-51e398288954_3050x1710.png 424w, https://substackcdn.com/image/fetch/$s_!Zh5_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1e7c85-8d37-4feb-b712-51e398288954_3050x1710.png 848w, https://substackcdn.com/image/fetch/$s_!Zh5_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1e7c85-8d37-4feb-b712-51e398288954_3050x1710.png 1272w, https://substackcdn.com/image/fetch/$s_!Zh5_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1e7c85-8d37-4feb-b712-51e398288954_3050x1710.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Zh5_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1e7c85-8d37-4feb-b712-51e398288954_3050x1710.png" width="1456" height="816" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4d1e7c85-8d37-4feb-b712-51e398288954_3050x1710.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:816,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:352729,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Zh5_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1e7c85-8d37-4feb-b712-51e398288954_3050x1710.png 424w, https://substackcdn.com/image/fetch/$s_!Zh5_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1e7c85-8d37-4feb-b712-51e398288954_3050x1710.png 848w, https://substackcdn.com/image/fetch/$s_!Zh5_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1e7c85-8d37-4feb-b712-51e398288954_3050x1710.png 1272w, https://substackcdn.com/image/fetch/$s_!Zh5_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d1e7c85-8d37-4feb-b712-51e398288954_3050x1710.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Conclusion</h2><p>What we&#8217;ve seen here are just a few examples of LLM-powered applications and how close or far they are to the concept of a &#8220;real&#8221; AI agent. AI agents are still a very active area of research, and their effectiveness is getting more and more reasonable as LLMs become cheaper and more powerful.</p><p>As matter of fact, with today&#8217;s LLMs true AI agents are possible, but in many cases they&#8217;re too brittle and expensive for real production use cases. Agentic systems today suffer from two main issues: they perform <strong>huge and frequent LLM calls</strong> and they <strong>tolerate a very low error rate</strong> in their decision making.</p><p>Inner monologues can grow to an unbounded size during the agent&#8217;s operation, making the context window size a potential limitation. A single bad decision can send a chain-of-thought reasoning train in a completely wrong direction and many LLM calls will be performed before the system realizes its mistake, if it does at all. However, as LLMs become faster, cheaper and smarter, the day when AI Agent will become reliable and cheap enough is nearer than many think.</p><p>Let&#8217;s be ready for it!</p>]]></content:encoded></item><item><title><![CDATA[The Agent Compass (Part 1)]]></title><description><![CDATA[Agent means everything and nothing in today's GenAI landscape. Let's shed some light on this topic.]]></description><link>https://zansara.substack.com/p/the-agent-compass-part-1</link><guid isPermaLink="false">https://zansara.substack.com/p/the-agent-compass-part-1</guid><dc:creator><![CDATA[Sara Z.]]></dc:creator><pubDate>Mon, 10 Jun 2024 18:03:51 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/29ef66cf-43fc-4859-8881-e339be1566b3_2577x1310.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The concept of Agent is one of the vaguest out there in the post-ChatGPT landscape. The word has been used to identify systems that seem to have nothing in common with one another, from complex autonomous research systems down to a simple sequence of two predefined LLM calls. Even the distinction between Agents and techniques such as RAG and prompt engineering seems blurry at best.</p><p>Let&#8217;s try to shed some light on the topic by understanding just how much the term &#8220;AI Agent&#8221; covers and set some landmarks to better navigate the space.</p><h2>Defining &#8220;Agent&#8221;</h2><p>The problem starts with the definition of &#8220;agent&#8221;. For example, <a href="https://en.wikipedia.org/wiki/Software_agent">Wikipedia</a> reports that a software agent is</p><blockquote><p>a computer program that acts for a user or another program in a relationship of agency.</p></blockquote><p>This definition is extremely high-level, to the point that it could be applied to systems ranging from ChatGPT to a thermostat. However, if we restrain our definition to &#8220;LLM-powered agents&#8221;, then it starts to mean something: an Agent is an LLM-powered application that is given some <strong>agency</strong>, which means that it can take actions to accomplish the goals set by its user. Here we see the difference between an agent and a simple chatbot, because a chatbot can only talk to a user. but don&#8217;t have the agency to take any action on their behalf. Instead, an Agent is a system you can effectively delegate tasks to.</p><p>In short, an LLM powered application can be called an Agent when</p><blockquote><p>it can take decisions and choose to perform actions in order to achieve the goals set by the user.</p></blockquote><h2>Autonomous vs Conversational</h2><p>On top of this definition there&#8217;s an additional distinction to take into account, normally brought up by the terms <strong>autonomous</strong> and <strong>conversational</strong> agents.</p><p>Autonomous Agents are applications that <strong>don&#8217;t use conversation as a tool</strong> to accomplish their goal. They can use several tools several times, but they won&#8217;t produce an answer for the user until their goal is accomplished in full. These agents normally interact with a single user, the one that set their goal, and the whole result of their operations might be a simple notification that the task is done. The fact that they can understand language is rather a feature that lets them receive the user&#8217;s task in natural language, understand it, and then to navigate the material they need to use (emails, webpages, etc).</p><p>An example of an autonomous agent is a <strong>virtual personal assistant</strong>: an app that can read through your emails and, for example, pays the bills for you when they&#8217;re due. This is a system that the user sets up with a few credentials and then works autonomously, without the user&#8217;s supervision, on the user&#8217;s own behalf, possibly without bothering them at all.</p><p>On the contrary, Conversational Agents <strong>use conversation as a tool</strong>, often their primary one. This doesn&#8217;t have to be a conversation with the person that set them off: it&#8217;s usually a conversation with another party, that may or may not be aware that they&#8217;re talking to an autonomous system. Naturally, they behave like agents only from the perspective of the user that assigned them the task, while in many cases they have very limited or no agency from the perspective of the users that holds the conversation with them.</p><p>An example of a conversational agent is a <strong>virtual salesman</strong>: an app that takes a list of potential clients and calls them one by one, trying to persuade them to buy. From the perspective of the clients receiving the call this bot is not an agent: it can perform no actions on their behalf, in fact it may not be able to perform actions at all other than talking to them. But from the perspective of the salesman the bots are agents, because they&#8217;re calling people for them, saving a lot of their time.</p><p>The distinction between these two categories is very blurry, and <strong>some systems may behave like both</strong> depending on the circumnstances. For example, an autonomous agent might become a conversational one if it&#8217;s configured to reschedule appointments for you by calling people, or to reply to your emails to automatically challenge parking fines, and so on. Alternatively, an LLM that asks you if it&#8217;s appropriate to use a tool before using it is behaving a bit like a conversational agent, because it&#8217;s using the chat to improve its odds of providing you a better result.</p><h2>Degrees of agency</h2><p>All the distinctions we made above are best understood as a continuous spectrum rather than hard categories. Various AI systems may have more or less agency and may be tuned towards a more &#8220;autonomous&#8221; or &#8220;conversational&#8221; behavior.</p><p>In order to understand this difference in practice, let&#8217;s try to categorize some well-known LLM techniques and apps to see how &#8220;agentic&#8221; they are. Having two axis to measure by, we can build a simple compass like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ymLe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8327b7c3-d6c8-4134-a9f4-a5a9914063d8_3198x1752.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ymLe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8327b7c3-d6c8-4134-a9f4-a5a9914063d8_3198x1752.png 424w, https://substackcdn.com/image/fetch/$s_!ymLe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8327b7c3-d6c8-4134-a9f4-a5a9914063d8_3198x1752.png 848w, https://substackcdn.com/image/fetch/$s_!ymLe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8327b7c3-d6c8-4134-a9f4-a5a9914063d8_3198x1752.png 1272w, https://substackcdn.com/image/fetch/$s_!ymLe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8327b7c3-d6c8-4134-a9f4-a5a9914063d8_3198x1752.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ymLe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8327b7c3-d6c8-4134-a9f4-a5a9914063d8_3198x1752.png" width="1456" height="798" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8327b7c3-d6c8-4134-a9f4-a5a9914063d8_3198x1752.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:798,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:168270,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ymLe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8327b7c3-d6c8-4134-a9f4-a5a9914063d8_3198x1752.png 424w, https://substackcdn.com/image/fetch/$s_!ymLe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8327b7c3-d6c8-4134-a9f4-a5a9914063d8_3198x1752.png 848w, https://substackcdn.com/image/fetch/$s_!ymLe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8327b7c3-d6c8-4134-a9f4-a5a9914063d8_3198x1752.png 1272w, https://substackcdn.com/image/fetch/$s_!ymLe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8327b7c3-d6c8-4134-a9f4-a5a9914063d8_3198x1752.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Our Agent compass</em></p><h3>Bare LLMs</h3><p>Many apps out there perform nothing more than direct calls to LLMs, such as ChatGPT&#8217;s free app and other similarly simple assistants and chatbots. There are no more components to this system other than the model itself and their mode of operation is very straightforward: a user asks a question to an LLM, and the LLM replies directly.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Jyc8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf8e39de-5da2-4ef8-ab34-2693f60ded14_3066x878.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Jyc8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf8e39de-5da2-4ef8-ab34-2693f60ded14_3066x878.png 424w, https://substackcdn.com/image/fetch/$s_!Jyc8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf8e39de-5da2-4ef8-ab34-2693f60ded14_3066x878.png 848w, https://substackcdn.com/image/fetch/$s_!Jyc8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf8e39de-5da2-4ef8-ab34-2693f60ded14_3066x878.png 1272w, https://substackcdn.com/image/fetch/$s_!Jyc8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf8e39de-5da2-4ef8-ab34-2693f60ded14_3066x878.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Jyc8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf8e39de-5da2-4ef8-ab34-2693f60ded14_3066x878.png" width="1456" height="417" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cf8e39de-5da2-4ef8-ab34-2693f60ded14_3066x878.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:417,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:145134,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Jyc8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf8e39de-5da2-4ef8-ab34-2693f60ded14_3066x878.png 424w, https://substackcdn.com/image/fetch/$s_!Jyc8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf8e39de-5da2-4ef8-ab34-2693f60ded14_3066x878.png 848w, https://substackcdn.com/image/fetch/$s_!Jyc8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf8e39de-5da2-4ef8-ab34-2693f60ded14_3066x878.png 1272w, https://substackcdn.com/image/fetch/$s_!Jyc8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf8e39de-5da2-4ef8-ab34-2693f60ded14_3066x878.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This systems are not designed with the intent of accomplishing a goal, and neither can take any actions on the user&#8217;s behalf. They focus on talking with a user in a reactive way and can do nothing else than talk back. An LLM on its own has <strong>no agency at all</strong>.</p><p>At this level it also makes very little sense to distinguish between autonomous or conversational agent behavior, because the entire app shows no degrees of autonomy. So we can place them at the very center-left of the diagram.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eJZo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1286e37-77c3-475e-af0a-ce5022a23a22_3128x1752.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eJZo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1286e37-77c3-475e-af0a-ce5022a23a22_3128x1752.png 424w, https://substackcdn.com/image/fetch/$s_!eJZo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1286e37-77c3-475e-af0a-ce5022a23a22_3128x1752.png 848w, https://substackcdn.com/image/fetch/$s_!eJZo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1286e37-77c3-475e-af0a-ce5022a23a22_3128x1752.png 1272w, https://substackcdn.com/image/fetch/$s_!eJZo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1286e37-77c3-475e-af0a-ce5022a23a22_3128x1752.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eJZo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1286e37-77c3-475e-af0a-ce5022a23a22_3128x1752.png" width="1456" height="816" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a1286e37-77c3-475e-af0a-ce5022a23a22_3128x1752.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:816,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:208744,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eJZo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1286e37-77c3-475e-af0a-ce5022a23a22_3128x1752.png 424w, https://substackcdn.com/image/fetch/$s_!eJZo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1286e37-77c3-475e-af0a-ce5022a23a22_3128x1752.png 848w, https://substackcdn.com/image/fetch/$s_!eJZo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1286e37-77c3-475e-af0a-ce5022a23a22_3128x1752.png 1272w, https://substackcdn.com/image/fetch/$s_!eJZo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1286e37-77c3-475e-af0a-ce5022a23a22_3128x1752.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Basic RAG</h3><p>Together with direct LLM calls and simple chatbots, basic RAG is also an example of an application that does not need any agency or goals to pursue in order to function. Simple RAG apps works in two stages: first the user question is sent to a retriever system, which fetches some additional data relevant to the question. Then, the question and the additional data is sent to the LLM to formulate an answer.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ys_r!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7b8439c-1205-47a0-843f-a64f1357db00_2842x962.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ys_r!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7b8439c-1205-47a0-843f-a64f1357db00_2842x962.png 424w, https://substackcdn.com/image/fetch/$s_!Ys_r!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7b8439c-1205-47a0-843f-a64f1357db00_2842x962.png 848w, https://substackcdn.com/image/fetch/$s_!Ys_r!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7b8439c-1205-47a0-843f-a64f1357db00_2842x962.png 1272w, https://substackcdn.com/image/fetch/$s_!Ys_r!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7b8439c-1205-47a0-843f-a64f1357db00_2842x962.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ys_r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7b8439c-1205-47a0-843f-a64f1357db00_2842x962.png" width="1456" height="493" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f7b8439c-1205-47a0-843f-a64f1357db00_2842x962.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:493,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:174084,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ys_r!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7b8439c-1205-47a0-843f-a64f1357db00_2842x962.png 424w, https://substackcdn.com/image/fetch/$s_!Ys_r!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7b8439c-1205-47a0-843f-a64f1357db00_2842x962.png 848w, https://substackcdn.com/image/fetch/$s_!Ys_r!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7b8439c-1205-47a0-843f-a64f1357db00_2842x962.png 1272w, https://substackcdn.com/image/fetch/$s_!Ys_r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7b8439c-1205-47a0-843f-a64f1357db00_2842x962.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This means that simple RAG is not an agent: the LLM has no role in the retrieval step and simply reacts to the RAG prompt, doing little more than what a direct LLM call does. <strong>The LLM is given no agency</strong>, takes no decisions in order to accomplish its goals, and has no tools it can decide to use, or actions it can decide to take. It&#8217;s a fully pipelined, reactive system. However, we may rank basic RAG more on the autonomous side with respect to a direct LLM call, because there is one step that is done automonously (the retrieval).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JNb4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62459a31-6891-4778-b23b-5f0286008f9f_3146x1744.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JNb4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62459a31-6891-4778-b23b-5f0286008f9f_3146x1744.png 424w, https://substackcdn.com/image/fetch/$s_!JNb4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62459a31-6891-4778-b23b-5f0286008f9f_3146x1744.png 848w, https://substackcdn.com/image/fetch/$s_!JNb4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62459a31-6891-4778-b23b-5f0286008f9f_3146x1744.png 1272w, https://substackcdn.com/image/fetch/$s_!JNb4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62459a31-6891-4778-b23b-5f0286008f9f_3146x1744.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JNb4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62459a31-6891-4778-b23b-5f0286008f9f_3146x1744.png" width="1456" height="807" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/62459a31-6891-4778-b23b-5f0286008f9f_3146x1744.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:807,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:214114,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JNb4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62459a31-6891-4778-b23b-5f0286008f9f_3146x1744.png 424w, https://substackcdn.com/image/fetch/$s_!JNb4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62459a31-6891-4778-b23b-5f0286008f9f_3146x1744.png 848w, https://substackcdn.com/image/fetch/$s_!JNb4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62459a31-6891-4778-b23b-5f0286008f9f_3146x1744.png 1272w, https://substackcdn.com/image/fetch/$s_!JNb4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62459a31-6891-4778-b23b-5f0286008f9f_3146x1744.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Agentic RAG</h3><p>Agentic RAG is a slightly more advanced version of RAG that does not always perform the retrieval step. This helps the app produce better prompts for the LLM: for example, if the user is asking a question about trivia, retrieval is very important, while if they&#8217;re quizzing the LLM with some mathematical problem, retrieval might confuse the LLM by giving it examples of solutions to different puzzles, and therefore make hallucinations more likely.</p><p>This means that an agentic RAG app works as follows: when the user asks a question, before calling the retriever the app checks whether the retrieval step is necessary at all. Most of the time the preliminary check is done by an LLM as well, but in theory the same check coould be done by a properly trained classifier model. Once the check is done, if retrieval was necessary it is run, otherwise the app skips directly to the LLM, which then replies to the user.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EwWN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f68afc-2c21-45ec-8e70-86768d6b39ad_3122x1040.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EwWN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f68afc-2c21-45ec-8e70-86768d6b39ad_3122x1040.png 424w, https://substackcdn.com/image/fetch/$s_!EwWN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f68afc-2c21-45ec-8e70-86768d6b39ad_3122x1040.png 848w, https://substackcdn.com/image/fetch/$s_!EwWN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f68afc-2c21-45ec-8e70-86768d6b39ad_3122x1040.png 1272w, https://substackcdn.com/image/fetch/$s_!EwWN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f68afc-2c21-45ec-8e70-86768d6b39ad_3122x1040.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EwWN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f68afc-2c21-45ec-8e70-86768d6b39ad_3122x1040.png" width="1456" height="485" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/37f68afc-2c21-45ec-8e70-86768d6b39ad_3122x1040.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:485,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:324812,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EwWN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f68afc-2c21-45ec-8e70-86768d6b39ad_3122x1040.png 424w, https://substackcdn.com/image/fetch/$s_!EwWN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f68afc-2c21-45ec-8e70-86768d6b39ad_3122x1040.png 848w, https://substackcdn.com/image/fetch/$s_!EwWN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f68afc-2c21-45ec-8e70-86768d6b39ad_3122x1040.png 1272w, https://substackcdn.com/image/fetch/$s_!EwWN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f68afc-2c21-45ec-8e70-86768d6b39ad_3122x1040.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>You can see immediately that there&#8217;s a fundamental difference between this type of RAG and the basic pipelined form: the app needs to <strong>take a decision</strong> in order to accomplish the goal of answering the user. The goal is very limited (giving a correct answer to the user), and the decision very simple (use or not use a single tool), but this little bit of agency given to the LLM makes us place an application like this definitely more towards the Agent side of the diagram.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PIIy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5cc2c3e-9160-44eb-989e-9c95cbd1b34f_3036x1658.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PIIy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5cc2c3e-9160-44eb-989e-9c95cbd1b34f_3036x1658.png 424w, https://substackcdn.com/image/fetch/$s_!PIIy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5cc2c3e-9160-44eb-989e-9c95cbd1b34f_3036x1658.png 848w, https://substackcdn.com/image/fetch/$s_!PIIy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5cc2c3e-9160-44eb-989e-9c95cbd1b34f_3036x1658.png 1272w, https://substackcdn.com/image/fetch/$s_!PIIy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5cc2c3e-9160-44eb-989e-9c95cbd1b34f_3036x1658.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PIIy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5cc2c3e-9160-44eb-989e-9c95cbd1b34f_3036x1658.png" width="1456" height="795" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d5cc2c3e-9160-44eb-989e-9c95cbd1b34f_3036x1658.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:795,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:224349,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PIIy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5cc2c3e-9160-44eb-989e-9c95cbd1b34f_3036x1658.png 424w, https://substackcdn.com/image/fetch/$s_!PIIy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5cc2c3e-9160-44eb-989e-9c95cbd1b34f_3036x1658.png 848w, https://substackcdn.com/image/fetch/$s_!PIIy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5cc2c3e-9160-44eb-989e-9c95cbd1b34f_3036x1658.png 1272w, https://substackcdn.com/image/fetch/$s_!PIIy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5cc2c3e-9160-44eb-989e-9c95cbd1b34f_3036x1658.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We keep Agentic RAG towards the Autonomous side because in the vast majority of cases the decision to invoke the retriever is kept hidden from the user.</p><h3>LLMs with function calling</h3><p>Some LLM applications, such as ChatGPT with GPT4+ or Bing Chat, can make the LLM use some predefined tools: a web search, an image generator, and maybe a few more. The way they work is quite straightforward: when a user asks a question, the LLM first needs to decide whether it should use a tool to answer the question. If it decides that a tool is needed, it calls it, otherwise it skips directly to generating a reply, which is then sent back to the user.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0OCz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62931686-b442-4579-8ea7-187603099e59_2940x982.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0OCz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62931686-b442-4579-8ea7-187603099e59_2940x982.png 424w, https://substackcdn.com/image/fetch/$s_!0OCz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62931686-b442-4579-8ea7-187603099e59_2940x982.png 848w, https://substackcdn.com/image/fetch/$s_!0OCz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62931686-b442-4579-8ea7-187603099e59_2940x982.png 1272w, https://substackcdn.com/image/fetch/$s_!0OCz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62931686-b442-4579-8ea7-187603099e59_2940x982.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0OCz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62931686-b442-4579-8ea7-187603099e59_2940x982.png" width="1456" height="486" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/62931686-b442-4579-8ea7-187603099e59_2940x982.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:486,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:240812,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!0OCz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62931686-b442-4579-8ea7-187603099e59_2940x982.png 424w, https://substackcdn.com/image/fetch/$s_!0OCz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62931686-b442-4579-8ea7-187603099e59_2940x982.png 848w, https://substackcdn.com/image/fetch/$s_!0OCz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62931686-b442-4579-8ea7-187603099e59_2940x982.png 1272w, https://substackcdn.com/image/fetch/$s_!0OCz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62931686-b442-4579-8ea7-187603099e59_2940x982.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>You can see how this diagram resemble agentic RAG&#8217;s: before giving an answer to the user, the app needs to <strong>take a decision</strong>.</p><p>With respect to Agentic RAG this decision is a lot more complex: it&#8217;s not a simple yes/no decision, but it involves choosing which tool to use and also generate the input parameters for the selected tool that will provide the desired output. In many cases the tool&#8217;s output will be given to the LLM to be re-elaborated (such as the output of a web search), while in some other it can go directly to the user (like in the case of image generators). This all implies that more agency is given to the system and, therefore, it can be placed more clearly towards the Agent end of the scale.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!43mk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96de0350-905b-40e8-b927-fb39d2682f97_3054x1704.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!43mk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96de0350-905b-40e8-b927-fb39d2682f97_3054x1704.png 424w, https://substackcdn.com/image/fetch/$s_!43mk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96de0350-905b-40e8-b927-fb39d2682f97_3054x1704.png 848w, https://substackcdn.com/image/fetch/$s_!43mk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96de0350-905b-40e8-b927-fb39d2682f97_3054x1704.png 1272w, https://substackcdn.com/image/fetch/$s_!43mk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96de0350-905b-40e8-b927-fb39d2682f97_3054x1704.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!43mk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96de0350-905b-40e8-b927-fb39d2682f97_3054x1704.png" width="1456" height="812" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/96de0350-905b-40e8-b927-fb39d2682f97_3054x1704.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:812,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:272289,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!43mk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96de0350-905b-40e8-b927-fb39d2682f97_3054x1704.png 424w, https://substackcdn.com/image/fetch/$s_!43mk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96de0350-905b-40e8-b927-fb39d2682f97_3054x1704.png 848w, https://substackcdn.com/image/fetch/$s_!43mk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96de0350-905b-40e8-b927-fb39d2682f97_3054x1704.png 1272w, https://substackcdn.com/image/fetch/$s_!43mk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96de0350-905b-40e8-b927-fb39d2682f97_3054x1704.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We place LLMs with function calling in the middle between Conversational and Autonomous because the degree to which the user is aware of this decision can vary greatly between apps. For example, Bing Chat and ChatGPT normally notify the user that they&#8217;re going to use a tool when they do, and the user can instruct them to use them or not, so they&#8217;re slightly more conversational.</p><p><em>Continues in &#8220;<a href="https://open.substack.com/pub/zansara/p/the-agent-compass-part-2?r=3wc89f&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">The Agent Compass (Part 2)</a>&#8221;</em></p>]]></content:encoded></item><item><title><![CDATA[Generating creatures with Teranoptia]]></title><description><![CDATA[Having fun with fonts doesn&#8217;t always mean obsessing over kerning and ligatures. Sometimes, writing text is not even the point!]]></description><link>https://zansara.substack.com/p/2024-05-06-teranoptia</link><guid isPermaLink="false">https://zansara.substack.com/p/2024-05-06-teranoptia</guid><dc:creator><![CDATA[Sara Z.]]></dc:creator><pubDate>Mon, 06 May 2024 00:00:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!zGLr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbd512d8-461e-49cc-9df8-73c953ae32a3_991x414.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zGLr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbd512d8-461e-49cc-9df8-73c953ae32a3_991x414.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zGLr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbd512d8-461e-49cc-9df8-73c953ae32a3_991x414.png 424w, https://substackcdn.com/image/fetch/$s_!zGLr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbd512d8-461e-49cc-9df8-73c953ae32a3_991x414.png 848w, https://substackcdn.com/image/fetch/$s_!zGLr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbd512d8-461e-49cc-9df8-73c953ae32a3_991x414.png 1272w, https://substackcdn.com/image/fetch/$s_!zGLr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbd512d8-461e-49cc-9df8-73c953ae32a3_991x414.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zGLr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbd512d8-461e-49cc-9df8-73c953ae32a3_991x414.png" width="991" height="414" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bbd512d8-461e-49cc-9df8-73c953ae32a3_991x414.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:414,&quot;width&quot;:991,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:33345,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zGLr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbd512d8-461e-49cc-9df8-73c953ae32a3_991x414.png 424w, https://substackcdn.com/image/fetch/$s_!zGLr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbd512d8-461e-49cc-9df8-73c953ae32a3_991x414.png 848w, https://substackcdn.com/image/fetch/$s_!zGLr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbd512d8-461e-49cc-9df8-73c953ae32a3_991x414.png 1272w, https://substackcdn.com/image/fetch/$s_!zGLr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbd512d8-461e-49cc-9df8-73c953ae32a3_991x414.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>&#9888;&#65039; <em>This is an interactive post, which Substack does not support (for good reasons!)</em></p><p><em><a href="https://www.zansara.dev/posts/2024-05-06-teranoptia/">Head to my blog</a> to read this post and play with the interactive snippets.</em></p>]]></content:encoded></item><item><title><![CDATA[Talk Summary: RAG, the bad parts (and the good!)]]></title><description><![CDATA[A summary of my recent talk at ODSC East about RAG, just in case you haven't heard enough of it already.]]></description><link>https://zansara.substack.com/p/2024-04-29-odsc-east-rag-talk-summary</link><guid isPermaLink="false">https://zansara.substack.com/p/2024-04-29-odsc-east-rag-talk-summary</guid><dc:creator><![CDATA[Sara Z.]]></dc:creator><pubDate>Mon, 29 Apr 2024 00:00:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ef19467-86f9-4f9b-8b79-a4f242b234f9_2400x1350.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RWAh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ef19467-86f9-4f9b-8b79-a4f242b234f9_2400x1350.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RWAh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ef19467-86f9-4f9b-8b79-a4f242b234f9_2400x1350.png 424w, https://substackcdn.com/image/fetch/$s_!RWAh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ef19467-86f9-4f9b-8b79-a4f242b234f9_2400x1350.png 848w, https://substackcdn.com/image/fetch/$s_!RWAh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ef19467-86f9-4f9b-8b79-a4f242b234f9_2400x1350.png 1272w, https://substackcdn.com/image/fetch/$s_!RWAh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ef19467-86f9-4f9b-8b79-a4f242b234f9_2400x1350.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RWAh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ef19467-86f9-4f9b-8b79-a4f242b234f9_2400x1350.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9ef19467-86f9-4f9b-8b79-a4f242b234f9_2400x1350.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:191250,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RWAh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ef19467-86f9-4f9b-8b79-a4f242b234f9_2400x1350.png 424w, https://substackcdn.com/image/fetch/$s_!RWAh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ef19467-86f9-4f9b-8b79-a4f242b234f9_2400x1350.png 848w, https://substackcdn.com/image/fetch/$s_!RWAh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ef19467-86f9-4f9b-8b79-a4f242b234f9_2400x1350.png 1272w, https://substackcdn.com/image/fetch/$s_!RWAh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ef19467-86f9-4f9b-8b79-a4f242b234f9_2400x1350.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>This is a summary of <a href="https://www.zansara.dev/talks/2024-04-25-odsc-east-rag/">my recent talk</a> at ODSC East.</em></p><div><hr></div><p>If you&#8217;ve been at <a href="https://odsc.com/boston">ODSC East</a> this year, there&#8217;s one acronym that you&#8217;ve probably heard in nearly every talk: it&#8217;s RAG. RAG is one of the most used techniques to enhance LLMs in production, but why is it so? And what are its weak points?</p><p>In this post, we will first describe what RAG is and how it works at a high level. We will then see what type of failures we may encounter, how they happen, and a few reasons that may trigger these issues. Next, we will look at a few tools to help us evaluate a RAG application in production. Last, we&#8217;re going to list a few techniques to enhance your RAG app and make it more capable in a variety of scenarios.</p><p>Let&#8217;s dive in.</p><h1>Outline</h1><ul><li><p><a href="#what-is-rag">What is RAG?</a></p></li><li><p><a href="#why-should-i-use-it">Why should I use it?</a></p><ul><li><p><a href="#a-weather-chatbot">A weather chatbot</a></p></li><li><p><a href="#a-real-world-example">A real-world example</a></p></li></ul></li><li><p><a href="#failure-modes">Failure modes</a></p><ul><li><p><a href="#retrieval-failure">Retrieval failure</a></p></li><li><p><a href="#generation-failure">Generation failure</a></p></li></ul></li><li><p><a href="#evaluation-strategies">Evaluation strategies</a></p><ul><li><p><a href="#evaluating-retrieval">Evaluating Retrieval</a></p></li><li><p><a href="#evaluating-generation">Evaluating Generation</a></p></li><li><p><a href="#end-to-end-evaluation">End-to-end evaluation</a></p></li><li><p><a href="#putting-it-all-together">Putting it all together</a></p></li></ul></li><li><p><a href="#advanced-flavors-of-rag">Advanced flavors of RAG</a></p><ul><li><p><a href="#use-multiple-retrievers">Use multiple retrievers</a></p></li><li><p><a href="#self-correction">Self-correction</a></p></li><li><p><a href="#agentic-rag">Agentic RAG</a></p></li><li><p><a href="#multihop-rag">Multihop RAG</a></p></li></ul></li><li><p><a href="#a-word-on-finetuning">A word on finetuning</a></p></li><li><p><a href="#conclusion">Conclusion</a></p></li></ul><h1>What is RAG? </h1><p>RAG stands for <strong>R</strong>etrieval <strong>A</strong>ugmented <strong>G</strong>eneration, which can be explained as: &#8220;A technique to <strong>augment</strong> LLM&#8217;s knowledge beyond its training data by <strong>retrieving</strong> contextual information before a <strong>generating</strong> an answer.&#8221;</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!E-N5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3caa8fc1-eaa7-4735-be5a-d0f40f0f31a2_2400x1350.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!E-N5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3caa8fc1-eaa7-4735-be5a-d0f40f0f31a2_2400x1350.png 424w, https://substackcdn.com/image/fetch/$s_!E-N5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3caa8fc1-eaa7-4735-be5a-d0f40f0f31a2_2400x1350.png 848w, https://substackcdn.com/image/fetch/$s_!E-N5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3caa8fc1-eaa7-4735-be5a-d0f40f0f31a2_2400x1350.png 1272w, https://substackcdn.com/image/fetch/$s_!E-N5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3caa8fc1-eaa7-4735-be5a-d0f40f0f31a2_2400x1350.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!E-N5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3caa8fc1-eaa7-4735-be5a-d0f40f0f31a2_2400x1350.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3caa8fc1-eaa7-4735-be5a-d0f40f0f31a2_2400x1350.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!E-N5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3caa8fc1-eaa7-4735-be5a-d0f40f0f31a2_2400x1350.png 424w, https://substackcdn.com/image/fetch/$s_!E-N5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3caa8fc1-eaa7-4735-be5a-d0f40f0f31a2_2400x1350.png 848w, https://substackcdn.com/image/fetch/$s_!E-N5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3caa8fc1-eaa7-4735-be5a-d0f40f0f31a2_2400x1350.png 1272w, https://substackcdn.com/image/fetch/$s_!E-N5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3caa8fc1-eaa7-4735-be5a-d0f40f0f31a2_2400x1350.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>RAG is a technique that works best for question-answering tasks, such as chatbots or similar knowledge extraction applications. This means that the user of a RAG app is a user who needs an answer to a question.</p><p>The first step of RAG is to take the question and hand it over to a component called <strong><a href="https://docs.haystack.deepset.ai/docs/retrievers?utm_campaign=odsc-east">retriever</a></strong>. A retriever is any system that, given a question, can find data relevant to the question within a vast dataset, be it text, images, rows in a DB, or anything else.</p><p>When implementing RAG, many developers think immediately that a vector database is necessary for retrieval. While vector databases such as <a href="https://haystack.deepset.ai/integrations/qdrant-document-store?utm_campaign=odsc-east">Qdrant</a>, <a href="https://haystack.deepset.ai/integrations/chroma-documentstore?utm_campaign=odsc-east">ChromaDB</a>, <a href="https://haystack.deepset.ai/integrations/weaviate-document-store?utm_campaign=odsc-east">Weaviate</a> and so on, are great for retrieval in some applications, they&#8217;re not the only option. Keyword-based algorithms such as <a href="https://haystack.deepset.ai/integrations/elasticsearch-document-store?utm_campaign=odsc-east">Elasticsearch BM25</a> or TF-IDF can be used as retrievers in a RAG application, and you can even go as far as using a <a href="https://docs.haystack.deepset.ai/docs/websearch?utm_campaign=odsc-east">web search engine API</a>, such as Google or Bing. Anything that is given a question and can return information relevant to the question can be used here.</p><p>Once our retriever sifted through all the data and returned a few relevant snippets of context, the question and the context are assembled into a <strong>RAG prompt</strong>. It looks like this:</p><pre><code>Read the text below and answer the question at the bottom.

Text: [all the text found by the retriever]

Question: [the user's question]
</code></pre><p>This prompt is then fed to the last component, called a <strong><a href="https://docs.haystack.deepset.ai/docs/components_overview#generators?utm_campaign=odsc-east">generator</a></strong>. A generator is any system that, given a prompt, can answer the question that it contains. In practice, &#8220;generator&#8221; is an umbrella term for any LLM, be it behind an API like GPT-3.5 or running locally, such as a Llama model. The generator receives the prompt, reads and understands it, and then writes down an answer that can be given back to the user, closing the loop.</p><h1>Why should I use it?</h1><p>There are three main benefits of using a RAG architecture for your LLM apps instead of querying the LLM directly.</p><ol><li><p><strong>Reduces hallucinations</strong>. The RAG prompt contains the answer to the user&#8217;s question together with the question, so the LLM doesn&#8217;t need to <em>know</em> the answer, but it only needs to read the prompt and rephrase a bit of its content.</p></li><li><p><strong>Allows access to fresh data</strong>. RAG makes LLMs capable of reasoning about data that wasn&#8217;t present in their training set, such as highly volatile figures, news, forecasts, and so on.</p></li><li><p><strong>Increases transparency</strong>. The retrieval step is much easier to inspect than LLM&#8217;s inference process, so it&#8217;s far easier to spot and fact-check any answer the LLM provides.</p></li></ol><p>To understand these points better, let&#8217;s see an example.</p><h2>A weather chatbot</h2><p>We&#8217;re making a chatbot for a weather forecast app. Suppose the user asks an LLM directly, &#8220;Is it going to rain in Lisbon tomorrow morning?&#8221;. In that case, the LLM will make up a random answer because it obviously didn&#8217;t have tomorrow&#8217;s weather forecast for Lisbon in its training set and knows nothing about it.</p><p>When an LLM is queried with a direct question, it will use its internal knowledge to answer it. LLMs have read the entire Internet during their training phase, so they learned that whenever they saw a line such as &#8220;What&#8217;s the capital of France?&#8221;, the string &#8220;Paris&#8221; always appeared among the following few words. So when a user asks the same question, the answer will likely be &#8220;Paris&#8221;.</p><p>This &#8220;recalling from memory&#8221; process works for well-known facts but is not always practical. For more nuanced questions or something that the LLM hasn&#8217;t seen during training, it often fails: in an attempt to answer the question, the LLM will make up a response that is not based on any real source. This is called a <strong>hallucination</strong>, one of LLMs&#8217; most common and feared failure modes.</p><p>RAG helps prevent hallucinations because, in the RAG prompt, the question and all the data needed to answer it are explicitly given to the LLM. For our weather chatbot, the retriever will first do a Google search and find some data. Then, we will put together the RAG prompt. The result will look like this:</p><pre><code>Read the text below and answer the question at the bottom.

Text: According to the weather forecast, the weather in Lisbon tomorrow 
is expected to be mostly sunny, with a high of 18&#176;C and a low of 11&#176;C. 
There is a 25% chance of showers in the evening.

Question: Is it going to rain in Lisbon tomorrow morning?
</code></pre><p>Now, it&#8217;s clear that the LLM doesn&#8217;t have to recall anything about the weather in Lisbon from its memory because the prompt already contains the answer. The LLM only needs to rephrase the context. This makes the task much simpler and drastically reduces the chances of hallucinations.</p><p>In fact, RAG is the only way to build an LLM-powered system that can answer a question like this with any confidence at all. Retraining an LLM every morning with the forecast for the day would be a lot more wasteful, require a ton of data, and won&#8217;t return consistent results. Imagine if we were making a chatbot that gives you figures from the stock market!</p><p>In addition, a weather chatbot built with RAG <strong>can be fact-checked</strong>. If users have access to the web pages that the retriever found, they can check the pages directly when the results are not convincing, which helps build trust in the application.</p><h2>A real-world example</h2><p>If you want to compare a well-implemented RAG system with a plain LLM, you can put <a href="https://chat.openai.com/">ChatGPT</a> (the free version, powered by GPT-3.5) and <a href="https://www.perplexity.ai/">Perplexity</a> to the test. ChatGPT does not implement RAG, while Perplexity is one of the most effective implementations existing today.</p><p>Let&#8217;s ask both: &#8220;Where does ODSC East 2024 take place?&#8221;</p><p>ChatGPT says:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!J_Xf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b2c24b-5f04-402a-a4a8-28ba9e327580_1000x350.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!J_Xf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b2c24b-5f04-402a-a4a8-28ba9e327580_1000x350.png 424w, https://substackcdn.com/image/fetch/$s_!J_Xf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b2c24b-5f04-402a-a4a8-28ba9e327580_1000x350.png 848w, https://substackcdn.com/image/fetch/$s_!J_Xf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b2c24b-5f04-402a-a4a8-28ba9e327580_1000x350.png 1272w, https://substackcdn.com/image/fetch/$s_!J_Xf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b2c24b-5f04-402a-a4a8-28ba9e327580_1000x350.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!J_Xf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b2c24b-5f04-402a-a4a8-28ba9e327580_1000x350.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/82b2c24b-5f04-402a-a4a8-28ba9e327580_1000x350.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!J_Xf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b2c24b-5f04-402a-a4a8-28ba9e327580_1000x350.png 424w, https://substackcdn.com/image/fetch/$s_!J_Xf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b2c24b-5f04-402a-a4a8-28ba9e327580_1000x350.png 848w, https://substackcdn.com/image/fetch/$s_!J_Xf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b2c24b-5f04-402a-a4a8-28ba9e327580_1000x350.png 1272w, https://substackcdn.com/image/fetch/$s_!J_Xf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b2c24b-5f04-402a-a4a8-28ba9e327580_1000x350.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>While Perplexity says:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2XE4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c83f2d-e190-49e3-8ee9-4c0e70968d9d_1000x504.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2XE4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c83f2d-e190-49e3-8ee9-4c0e70968d9d_1000x504.png 424w, https://substackcdn.com/image/fetch/$s_!2XE4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c83f2d-e190-49e3-8ee9-4c0e70968d9d_1000x504.png 848w, https://substackcdn.com/image/fetch/$s_!2XE4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c83f2d-e190-49e3-8ee9-4c0e70968d9d_1000x504.png 1272w, https://substackcdn.com/image/fetch/$s_!2XE4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c83f2d-e190-49e3-8ee9-4c0e70968d9d_1000x504.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2XE4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c83f2d-e190-49e3-8ee9-4c0e70968d9d_1000x504.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a3c83f2d-e190-49e3-8ee9-4c0e70968d9d_1000x504.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!2XE4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c83f2d-e190-49e3-8ee9-4c0e70968d9d_1000x504.png 424w, https://substackcdn.com/image/fetch/$s_!2XE4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c83f2d-e190-49e3-8ee9-4c0e70968d9d_1000x504.png 848w, https://substackcdn.com/image/fetch/$s_!2XE4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c83f2d-e190-49e3-8ee9-4c0e70968d9d_1000x504.png 1272w, https://substackcdn.com/image/fetch/$s_!2XE4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c83f2d-e190-49e3-8ee9-4c0e70968d9d_1000x504.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Note how ChatGPT clearly says that it doesn&#8217;t know: this is better than many other LLMs, which would just make up a place and date. On the contrary, Perplexity states some specific facts, and in case of doubt it&#8217;s easy to verify that it&#8217;s right by simply checking the sources above. Even just looking at the source&#8217;s URL can give users a lot more confidence in whether the answer is grounded.</p><h1>Failure modes</h1><p>Now that we understand how RAG works, let&#8217;s see what can go wrong in the process.</p><p>As we&#8217;ve just described, a RAG app goes in two steps &#8211; retrieval and generation. Therefore, we can classify RAG failures into two broad categories:</p><ol><li><p><strong>Retrieval failures</strong>: The retriever component fails to find the correct context for the given question. The RAG prompt injects irrelevant noise into the prompt, which confuses the LLM and results in a wrong or unrelated answer.</p></li><li><p><strong>Generation failures</strong>: The LLM fails to produce a correct answer even with a proper RAG prompt containing a question and all the data needed to answer it.</p></li></ol><p>To understand them better, let&#8217;s pretend an imaginary user poses our application the following question about a <a href="https://en.wikipedia.org/wiki/Republic_of_Rose_Island">little-known European microstate</a>:</p><pre><code>What was the official language of the Republic of Rose Island?
</code></pre><p>Here is what would happen in an ideal case:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zenz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea28dd6-7852-419f-9084-a425bb84e85a_2400x888.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zenz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea28dd6-7852-419f-9084-a425bb84e85a_2400x888.png 424w, https://substackcdn.com/image/fetch/$s_!zenz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea28dd6-7852-419f-9084-a425bb84e85a_2400x888.png 848w, https://substackcdn.com/image/fetch/$s_!zenz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea28dd6-7852-419f-9084-a425bb84e85a_2400x888.png 1272w, https://substackcdn.com/image/fetch/$s_!zenz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea28dd6-7852-419f-9084-a425bb84e85a_2400x888.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zenz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea28dd6-7852-419f-9084-a425bb84e85a_2400x888.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0ea28dd6-7852-419f-9084-a425bb84e85a_2400x888.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!zenz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea28dd6-7852-419f-9084-a425bb84e85a_2400x888.png 424w, https://substackcdn.com/image/fetch/$s_!zenz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea28dd6-7852-419f-9084-a425bb84e85a_2400x888.png 848w, https://substackcdn.com/image/fetch/$s_!zenz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea28dd6-7852-419f-9084-a425bb84e85a_2400x888.png 1272w, https://substackcdn.com/image/fetch/$s_!zenz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea28dd6-7852-419f-9084-a425bb84e85a_2400x888.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>First, the retriever searches the dataset (let&#8217;s imagine, in this case, Wikipedia) and returns a few snippets. The retriever did a good job here, and the snippets contain clearly stated information about the official language of Rose Island. The LLM reads these snippets, understands them, and replies to the user (correctly):</p><pre><code>The official language of the Republic of Rose Island was Esperanto.
</code></pre><h2>Retrieval failure</h2><p>What would happen if the retrieval step didn&#8217;t go as planned?</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lIad!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff77ca4fc-7a9b-42b3-8149-c5586d527517_2400x884.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lIad!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff77ca4fc-7a9b-42b3-8149-c5586d527517_2400x884.png 424w, https://substackcdn.com/image/fetch/$s_!lIad!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff77ca4fc-7a9b-42b3-8149-c5586d527517_2400x884.png 848w, https://substackcdn.com/image/fetch/$s_!lIad!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff77ca4fc-7a9b-42b3-8149-c5586d527517_2400x884.png 1272w, https://substackcdn.com/image/fetch/$s_!lIad!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff77ca4fc-7a9b-42b3-8149-c5586d527517_2400x884.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lIad!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff77ca4fc-7a9b-42b3-8149-c5586d527517_2400x884.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f77ca4fc-7a9b-42b3-8149-c5586d527517_2400x884.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!lIad!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff77ca4fc-7a9b-42b3-8149-c5586d527517_2400x884.png 424w, https://substackcdn.com/image/fetch/$s_!lIad!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff77ca4fc-7a9b-42b3-8149-c5586d527517_2400x884.png 848w, https://substackcdn.com/image/fetch/$s_!lIad!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff77ca4fc-7a9b-42b3-8149-c5586d527517_2400x884.png 1272w, https://substackcdn.com/image/fetch/$s_!lIad!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff77ca4fc-7a9b-42b3-8149-c5586d527517_2400x884.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Here, the retriever finds some information about Rose Island, but none of the snippets contain any information about the official language. They only say where it was located, what happened to it, and so on. So the LLM, which knows nothing about this nation except what the prompt says, takes an educated guess and replies:</p><pre><code>The official language of the Republic of Rose Island was Italian.
</code></pre><p>The wrong answer here is none of the LLM&#8217;s fault: the retriever is the component to blame.</p><p>When and why can retrieval fail? There are as many answers to this question as retrieval methods, so each should be inspected for its strengths and weaknesses. However there are a few reasons that are common to most of them.</p><ul><li><p><strong>The relevant data does not exist in the database</strong>. When the data does not exist, it&#8217;s impossible to retrieve it. Many retrieval techniques, however, give a relevance score to each result that they return, so filtering out low-relevance snippets may help mitigate the issue.</p></li><li><p><strong>The retrieval algorithm is too naive to match a question with its relevant context</strong>. This is a common issue for keyword-based retrieval methods such as TF-IDF or BM25 (Elasticsearch). These algorithms can&#8217;t deal with synonims or resolve acronyms, so if the question and the relevant context don&#8217;t share the exact same words, the retrieval won&#8217;t work.</p></li><li><p><strong>Embedding model (if used) is too small or unsuitable for the data</strong>. The data must be embedded before being searchable when doing a vector-based search. &#8220;Embedded&#8221; means that every snippet of context is associated with a list of numbers called an <strong>embedding</strong>. The quality of the embedding then determines the quality of the retrieval. If you embed your documents with a naive embedding model, or if you are dealing with a very specific domain such as narrow medical and legal niches, the embedding of your data won&#8217;t be able to represent their content precisely enough for the retrieval to be successful.</p></li><li><p><strong>The data is not chunked properly (too big or too small chunks)</strong>. Retrievers thrive on data that is chunked properly. Huge blocks of text will be found relevant to almost any question and will drown the LLM with information. Too small sentences or sentence fragments won&#8217;t carry enough context for the LLM to benefit from the retriever&#8217;s output. Proper chunking can be a huge lever to improve the quality of your retrieval.</p></li><li><p><strong>The data and the question are in different languages</strong>. Keyword-based retrieval algorithms suffer from this issue the most because keywords in different languages rarely match. If you expect questions to come in a different language than the data you are retrieving from, consider adding a translation step or performing retrieval with a multilingual embedder instead.</p></li></ul><p>One caveat with retrieval failures is that if you&#8217;re using a very powerful LLM such as GPT-4, sometimes your LLM is smart enough to understand that the retrieved context is incorrect and will discard it, <strong>hiding the failure</strong>. This means that it&#8217;s even more important to make sure retrieval is working well in isolation, something we will see in a moment.</p><h2>Generation failure</h2><p>Assuming that retrieval was successful, what would happen if the LLM still hallucinated?</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JN-8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37ae644a-f365-4111-8dc8-ecabff011f7a_2400x912.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JN-8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37ae644a-f365-4111-8dc8-ecabff011f7a_2400x912.png 424w, https://substackcdn.com/image/fetch/$s_!JN-8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37ae644a-f365-4111-8dc8-ecabff011f7a_2400x912.png 848w, https://substackcdn.com/image/fetch/$s_!JN-8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37ae644a-f365-4111-8dc8-ecabff011f7a_2400x912.png 1272w, https://substackcdn.com/image/fetch/$s_!JN-8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37ae644a-f365-4111-8dc8-ecabff011f7a_2400x912.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JN-8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37ae644a-f365-4111-8dc8-ecabff011f7a_2400x912.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/37ae644a-f365-4111-8dc8-ecabff011f7a_2400x912.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!JN-8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37ae644a-f365-4111-8dc8-ecabff011f7a_2400x912.png 424w, https://substackcdn.com/image/fetch/$s_!JN-8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37ae644a-f365-4111-8dc8-ecabff011f7a_2400x912.png 848w, https://substackcdn.com/image/fetch/$s_!JN-8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37ae644a-f365-4111-8dc8-ecabff011f7a_2400x912.png 1272w, https://substackcdn.com/image/fetch/$s_!JN-8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37ae644a-f365-4111-8dc8-ecabff011f7a_2400x912.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>This is clearly an issue with our LLM: even when given all the correct data, the LLM still generated a wrong answer. Maybe our LLM doesn&#8217;t know that Esperanto is even a language? Or perhaps we&#8217;re using an LLM that doesn&#8217;t understand English well?</p><p>Naturally, each LLM will have different weak points that can trigger issues like these. Here are some common reasons why you may be getting generation failures.</p><ul><li><p><strong>The model is too small and can&#8217;t follow instructions well</strong>. When building in a resource-constrained environment (such as local smartphone apps or IoT), the choice of LLMs shrinks to just a few tiny models. However, the smaller the model, the less it will be able to understand natural language, and even when it does, it limits its ability to follow instructions. If you notice that your model consistently doesn&#8217;t pay enough attention to the question when answering it, consider switching to a larger or newer LLM.</p></li><li><p><strong>The model knows too little about the domain to even understand the question</strong>. This can happen if your domain is highly specific, uses specific terminology, or relies on uncommon acronyms. Models are trained on general-purpose text, so they might not understand some questions without finetuning, which helps specify the meaning of the most critical key terms and acronyms. When the answers given by your model somewhat address the question but miss the point entirely and stay generic or hand-wavy, this is likely the case.</p></li><li><p><strong>The model is not multilingual, but the questions and context may be</strong>. It&#8217;s essential that the model understands the question being asked in order to be able to answer it. The same is true for context: if the data found by the retriever is in a language that the LLM cannot understand, it won&#8217;t help it answer and might even confuse it further. Always make sure that your LLM understands the languages your users use.</p></li><li><p><strong>The RAG prompt is not built correctly</strong>. Some LLMs, especially older or smaller ones, may be very sensitive to how the prompt is built. If your model ignores part of the context or misses the question, the prompt might contain contradicting information, or it might be simply too large. LLMs are not always great at <a href="https://cs.stanford.edu/~nfliu/papers/lost-in-the-middle.arxiv2023.pdf">finding a needle in the haystack</a>: if you are consistently building huge RAG prompts and you observe generation issues, consider cutting it back to help the LLM focus on the data that actually contains the answer.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HuyZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf425d68-ef4d-4e16-86d3-f1b5941288e7_1648x734.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HuyZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf425d68-ef4d-4e16-86d3-f1b5941288e7_1648x734.png 424w, https://substackcdn.com/image/fetch/$s_!HuyZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf425d68-ef4d-4e16-86d3-f1b5941288e7_1648x734.png 848w, https://substackcdn.com/image/fetch/$s_!HuyZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf425d68-ef4d-4e16-86d3-f1b5941288e7_1648x734.png 1272w, https://substackcdn.com/image/fetch/$s_!HuyZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf425d68-ef4d-4e16-86d3-f1b5941288e7_1648x734.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HuyZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf425d68-ef4d-4e16-86d3-f1b5941288e7_1648x734.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/df425d68-ef4d-4e16-86d3-f1b5941288e7_1648x734.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!HuyZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf425d68-ef4d-4e16-86d3-f1b5941288e7_1648x734.png 424w, https://substackcdn.com/image/fetch/$s_!HuyZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf425d68-ef4d-4e16-86d3-f1b5941288e7_1648x734.png 848w, https://substackcdn.com/image/fetch/$s_!HuyZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf425d68-ef4d-4e16-86d3-f1b5941288e7_1648x734.png 1272w, https://substackcdn.com/image/fetch/$s_!HuyZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf425d68-ef4d-4e16-86d3-f1b5941288e7_1648x734.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h1>Evaluation strategies</h1><p>Once we put our RAG system in production, we should keep an eye on its performance at scale. This is where evaluation frameworks come into play.</p><p>To properly evaluate the performance of RAG, it&#8217;s best to perform two evaluation steps:</p><ol><li><p><strong>Isolated Evaluation</strong>. Being a two-step process, failures at one stage can hide or mask the other, so it&#8217;s hard to understand where the failures originate from. To address this issue, evaluate the retrieval and generation separately: both must work well in isolation.</p></li><li><p><strong>End to end evaluation</strong>. To ensure the system works well from start to finish, it&#8217;s best to evaluate it as a whole. End-to-end evaluation brings its own set of challenges, but it correlates more directly to the quality of the overall app.</p></li></ol><h2>Evaluating Retrieval</h2><p>Each retrieval method has its own state-of-the-art evaluation method and framework, so it&#8217;s usually best to refer to those.</p><p>For <strong>keyword-based</strong> retrieval algorithms such as TD-IDF, BM25, PageRank, and so on, evaluation is often done by checking the keywords match well. For this, you can use <a href="https://en.wikipedia.org/wiki/Evaluation_measures_%28information_retrieval%29">one of the many metrics</a> used for this purpose: <a href="https://en.wikipedia.org/wiki/Precision_and_recall">recall, precision</a>, <a href="https://en.wikipedia.org/wiki/F-score">F1</a>, <a href="https://en.wikipedia.org/wiki/Mean_reciprocal_rank">MRR</a>, <a href="https://en.wikipedia.org/wiki/Evaluation_measures_%28information_retrieval%29#Mean_average_precision">MAP</a>, &#8230;</p><p>For <strong>vector-based</strong> retrievers like vector DBs, the evaluation is more tricky because checking for matching keywords is not sufficient: the semantics of the question and the answer must evaluated for similarity. We are going to see some libraries that help with this when evaluating generation: in short, they use another LLM to judge the similarity or compute metrics like <a href="https://docs.ragas.io/en/latest/concepts/metrics/semantic_similarity.html">semantic similarity</a>.</p><h2>Evaluating Generation</h2><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fVle!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F710fbf49-15ff-4f28-abb3-2afb8728eec8_3964x956.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fVle!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F710fbf49-15ff-4f28-abb3-2afb8728eec8_3964x956.png 424w, https://substackcdn.com/image/fetch/$s_!fVle!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F710fbf49-15ff-4f28-abb3-2afb8728eec8_3964x956.png 848w, https://substackcdn.com/image/fetch/$s_!fVle!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F710fbf49-15ff-4f28-abb3-2afb8728eec8_3964x956.png 1272w, https://substackcdn.com/image/fetch/$s_!fVle!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F710fbf49-15ff-4f28-abb3-2afb8728eec8_3964x956.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fVle!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F710fbf49-15ff-4f28-abb3-2afb8728eec8_3964x956.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/710fbf49-15ff-4f28-abb3-2afb8728eec8_3964x956.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!fVle!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F710fbf49-15ff-4f28-abb3-2afb8728eec8_3964x956.png 424w, https://substackcdn.com/image/fetch/$s_!fVle!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F710fbf49-15ff-4f28-abb3-2afb8728eec8_3964x956.png 848w, https://substackcdn.com/image/fetch/$s_!fVle!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F710fbf49-15ff-4f28-abb3-2afb8728eec8_3964x956.png 1272w, https://substackcdn.com/image/fetch/$s_!fVle!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F710fbf49-15ff-4f28-abb3-2afb8728eec8_3964x956.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Evaluating an LLM&#8217;s answers to a question is still a developing art, and several libraries can help with the task. One commonly used framework is <a href="https://haystack.deepset.ai/integrations/uptrain?utm_campaign=odsc-east">UpTrain</a>, which implements an &#8220;LLM-as-a-judge&#8221; approach. This means that the answers given by an LLM are then evaluated by another LLM, normally a larger and more powerful model.</p><p>This approach has the benefit that responses are not simply checked strictly for the presence or absence of keywords but can be evaluated according to much more sophisticated criteria like <a href="https://docs.uptrain.ai/predefined-evaluations/response-quality/response-completeness">completeness</a>, <a href="https://docs.uptrain.ai/predefined-evaluations/response-quality/response-conciseness">conciseness</a>, <a href="https://docs.uptrain.ai/predefined-evaluations/response-quality/response-relevance">relevance</a>, <a href="https://docs.uptrain.ai/predefined-evaluations/context-awareness/factual-accuracy">factual accuracy</a>, <a href="https://docs.uptrain.ai/predefined-evaluations/conversation-evals/user-satisfaction">conversation quality</a>, and more.</p><p>This approach leads to a far more detailed view of what the LLM is good at and what aspects of the generation could or should be improved. The criteria to select depend strongly on the application: for example, in medical or legal apps, factual accuracy should be the primary metric to optimize for, while in customer support, user satisfaction and conversation quality are also essential. For personal assistants, it&#8217;s usually best to focus on conciseness, and so on.</p><p>&#128161; <em>UpTrain can also be used to evaluate RAG applications end-to-end. Check <a href="https://docs.uptrain.ai/getting-started/introduction">its documentation</a> for details.</em></p><h2>End-to-end evaluation</h2><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4L8Y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35dbf63a-8c0c-41ec-8ac0-c5011c7b576e_1026x341.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4L8Y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35dbf63a-8c0c-41ec-8ac0-c5011c7b576e_1026x341.png 424w, https://substackcdn.com/image/fetch/$s_!4L8Y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35dbf63a-8c0c-41ec-8ac0-c5011c7b576e_1026x341.png 848w, https://substackcdn.com/image/fetch/$s_!4L8Y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35dbf63a-8c0c-41ec-8ac0-c5011c7b576e_1026x341.png 1272w, https://substackcdn.com/image/fetch/$s_!4L8Y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35dbf63a-8c0c-41ec-8ac0-c5011c7b576e_1026x341.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4L8Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35dbf63a-8c0c-41ec-8ac0-c5011c7b576e_1026x341.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/35dbf63a-8c0c-41ec-8ac0-c5011c7b576e_1026x341.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!4L8Y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35dbf63a-8c0c-41ec-8ac0-c5011c7b576e_1026x341.png 424w, https://substackcdn.com/image/fetch/$s_!4L8Y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35dbf63a-8c0c-41ec-8ac0-c5011c7b576e_1026x341.png 848w, https://substackcdn.com/image/fetch/$s_!4L8Y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35dbf63a-8c0c-41ec-8ac0-c5011c7b576e_1026x341.png 1272w, https://substackcdn.com/image/fetch/$s_!4L8Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35dbf63a-8c0c-41ec-8ac0-c5011c7b576e_1026x341.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The evaluation of RAG systems end-to-end is also quite complex and can be implemented in many ways, depending on the aspect you wish to monitor. One of the simplest approaches is to focus on semantic similarity between the question and the final answer.</p><p>A popular framework that can be used for such high-level evaluation is <a href="https://haystack.deepset.ai/integrations/ragas?utm_campaign=odsc-east">RAGAS</a>. In fact, RAGAS offers two interesting metrics:</p><ul><li><p><strong><a href="https://docs.ragas.io/en/stable/concepts/metrics/semantic_similarity.html">Answer semantic similarity</a></strong>. This is computed simply by taking the cosine similarity between the answer and the ground truth.</p></li><li><p><strong><a href="https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html">Answer correctness</a></strong>. Answer correctness is defined as a weighted average of the semantic similarity and the F1 score between the generated answer and the ground truth. This metric is more oriented towards fact-based answers, where F1 can help ensure that relevant facts such as dates, names, and so on are explicitly stated.</p></li></ul><p>On top of evaluation metrics, RAGAS also offers the capability to build <a href="https://docs.ragas.io/en/stable/concepts/testset_generation.html">synthetic evaluation datasets</a> to evaluate your app against. Such datasets spare you the work-intensive process of building a real-world evaluation dataset with human-generated questions and answers but also trade high quality for volume and speed. If your domain is very specific or you need extreme quality, synthetic datasets might not be an option, but for most real-world apps, such datasets can save tons of labeling time and resources.</p><p>&#128161; <em>RAGAS can also be used to evaluate each step of a RAG application in isolation. Check <a href="https://docs.ragas.io/en/stable/getstarted/index.html">its documentation</a> for details.</em></p><p>&#128161; <em>I recently discovered an even more comprehensive framework for end-to-end evaluation called <strong><a href="https://docs.relari.ai/v0.3">continuous-eval</a></strong> from <a href="https://relari.ai/">Relari.ai</a>, which focuses on modular evaluation of RAG pipelines. Check it out if you&#8217;re interested in this topic and RAGAS doesn&#8217;t offer enough flexibility for your use case.</em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!A-vn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9af7c2db-0001-41b4-b164-b0d56bfc63e6_1328x420.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!A-vn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9af7c2db-0001-41b4-b164-b0d56bfc63e6_1328x420.png 424w, https://substackcdn.com/image/fetch/$s_!A-vn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9af7c2db-0001-41b4-b164-b0d56bfc63e6_1328x420.png 848w, https://substackcdn.com/image/fetch/$s_!A-vn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9af7c2db-0001-41b4-b164-b0d56bfc63e6_1328x420.png 1272w, https://substackcdn.com/image/fetch/$s_!A-vn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9af7c2db-0001-41b4-b164-b0d56bfc63e6_1328x420.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!A-vn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9af7c2db-0001-41b4-b164-b0d56bfc63e6_1328x420.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9af7c2db-0001-41b4-b164-b0d56bfc63e6_1328x420.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!A-vn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9af7c2db-0001-41b4-b164-b0d56bfc63e6_1328x420.png 424w, https://substackcdn.com/image/fetch/$s_!A-vn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9af7c2db-0001-41b4-b164-b0d56bfc63e6_1328x420.png 848w, https://substackcdn.com/image/fetch/$s_!A-vn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9af7c2db-0001-41b4-b164-b0d56bfc63e6_1328x420.png 1272w, https://substackcdn.com/image/fetch/$s_!A-vn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9af7c2db-0001-41b4-b164-b0d56bfc63e6_1328x420.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h2>Putting it all together</h2><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HwKb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb91cc3-6553-40f1-aae3-55435a87b1a9_2659x700.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HwKb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb91cc3-6553-40f1-aae3-55435a87b1a9_2659x700.png 424w, https://substackcdn.com/image/fetch/$s_!HwKb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb91cc3-6553-40f1-aae3-55435a87b1a9_2659x700.png 848w, https://substackcdn.com/image/fetch/$s_!HwKb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb91cc3-6553-40f1-aae3-55435a87b1a9_2659x700.png 1272w, https://substackcdn.com/image/fetch/$s_!HwKb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb91cc3-6553-40f1-aae3-55435a87b1a9_2659x700.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HwKb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb91cc3-6553-40f1-aae3-55435a87b1a9_2659x700.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aeb91cc3-6553-40f1-aae3-55435a87b1a9_2659x700.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!HwKb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb91cc3-6553-40f1-aae3-55435a87b1a9_2659x700.png 424w, https://substackcdn.com/image/fetch/$s_!HwKb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb91cc3-6553-40f1-aae3-55435a87b1a9_2659x700.png 848w, https://substackcdn.com/image/fetch/$s_!HwKb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb91cc3-6553-40f1-aae3-55435a87b1a9_2659x700.png 1272w, https://substackcdn.com/image/fetch/$s_!HwKb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faeb91cc3-6553-40f1-aae3-55435a87b1a9_2659x700.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Once you know how you want to evaluate your app, it&#8217;s time to put it together. A convenient framework for this step is <a href="https://haystack.deepset.ai/?utm_campaign=odsc-east">Haystack</a>, a Python open-source LLM framework focused on building RAG applications. Haystack is an excellent choice because it can be used through all stages of the application lifecycle, from prototyping to production, including evaluation.</p><p>Haystack supports several evaluation libraries including <a href="https://haystack.deepset.ai/integrations/uptrain?utm_campaign=odsc-east">UpTrain</a>, <a href="https://haystack.deepset.ai/integrations/ragas?utm_campaign=odsc-east">RAGAS</a> and <a href="https://haystack.deepset.ai/integrations/deepeval?utm_campaign=odsc-east">DeepEval</a>. To understand more about how to implement and evaluate a RAG application with it, check out their tutorial about model evaluation <a href="https://haystack.deepset.ai/tutorials/35_model_based_evaluation_of_rag_pipelines?utm_campaign=odsc-east">here</a>.</p><h1>Advanced flavors of RAG</h1><p>Once our RAG app is ready and deployed in production, the natural next step is to look for ways to improve it even further. RAG is a very versatile technique, and many different flavors of &#8220;advanced RAG&#8221; have been experimented with, many more than I can list here. Depending on the situation, you may focus on different aspects, so let&#8217;s list some examples of tactics you can deploy to make your pipeline more powerful, context-aware, accurate, and so on.</p><h2>Use multiple retrievers</h2><p>Sometimes, a RAG app needs access to vastly different types of data simultaneously. For example, a personal assistant might need access to the Internet, your Slack, your emails, your personal notes, and maybe even your pictures. Designing a single retriever that can handle data of so many different kinds is possible. Still, it can be a real challenge and require, in many cases, an entire data ingestion pipeline.</p><p>Instead of going that way, you can instead use multiple retrievers, each specialized to a specific subset of your data: for example, one retriever that browses the web, one that searches on Slack and in your emails, one that checks for relevant pictures.</p><p>When using many retrievers, however, it&#8217;s often best to introduce another step called <strong>reranking</strong>. A reranker double-checks that all the results returned by each retriever are actually relevant and sorts them again before the RAG prompt is built. Rerankers are usually much more precise than retrievers in assessing the relative importance of various snippets of context, so they can dramatically improve the quality of the pipeline. In exceptional cases, they can be helpful even in RAG apps with a single retriever.</p><p>Here is an <a href="https://haystack.deepset.ai/tutorials/33_hybrid_retrieval?utm_campaign=odsc-east">example</a> of such a pipeline built with Haystack.</p><h2>Self-correction</h2><p>We mentioned that one of the most common evaluation strategies for RAG output is &#8220;LLM-as-a-judge&#8221;: the idea of using another LLM to evaluate the answer of the first. However, why use this technique only for evaluation?</p><p>Self-correcting RAG apps add one extra step at the end of the pipeline: they take the answer, pass it to a second LLM, and ask it to assess whether the answer is likely to be correct. If the check fails, the second LLM will provide some feedback on why it believes the answer is wrong, and this feedback will be given back to the first LLM to try answering another time until an agreement is reached.</p><p>Self-correcting LLMs can help improve the accuracy of the answers at the expense of more LLM calls per user question.</p><h2>Agentic RAG</h2><p>In the LLMs field, the term &#8220;agent&#8221; or &#8220;agentic&#8221; is often used to identify systems that use LLMs to make decisions. In the case of a RAG application, this term refers to a system that does not always perform retrieval but decides whether to perform it by reading the question first.</p><p>For example, imagine we&#8217;re building a RAG app to help primary school children with their homework. When the question refers to topics like history or geography, RAG is very helpful to avoid hallucinations. However, if the question regards math, the retrieval step is entirely unnecessary, and it might even confuse the LLM by retrieving similar math problems with different answers.</p><p>Making your RAG app agentic is as simple as giving the question to an LLM before retrieval in a prompt such as:</p><pre><code>Reply YES if the answer to this question should include facts and 
figures, NO otherwise.

Question: What's the capital of France?
</code></pre><p>Then, retrieval is run or skipped depending on whether the answer is YES or NO.</p><p>This is the most basic version of agentic RAG. Some advanced LLMs can do better: they support so-called &#8220;function calling,&#8221; which means that they can tell you exactly how to invoke the retriever and even provide specific parameters instead of simply answering YES or NO.</p><p>For more information about function calling with LLMs, check out <a href="https://platform.openai.com/docs/guides/function-calling">OpenAI&#8217;s documentation</a> on the topic or the equivalent documentation of your LLM provider.</p><h2>Multihop RAG</h2><p>Multihop RAG is an even more complex version of agentic RAG. Multihop pipelines often use <strong>chain-of-thought prompts</strong>, a type of prompt that looks like this:</p><pre><code>You are a helpful and knowledgeable agent.

To answer questions, you'll need to go through multiple steps involving step-by-step 
thinking and using a search engine to do web searches. The browser will respond with 
snippets of text from web pages. When you are ready for a final answer, respond with 
`Final Answer:`.

Use the following format:

- Question: the question to be answered
- Thought: Reason if you have the final answer. If yes, answer the question. If not, 
    find out the missing information needed to answer it.
- Search Query: the query for the search engine
- Observation: the search engine will respond with the results
- Final Answer: the final answer to the question, make it short (1-5 words)

Thought, Search Query, and Observation steps can be repeated multiple times, but 
sometimes, we can find an answer in the first pass.

---

- Question: "Was the capital of France founded earlier than the discovery of America?"
- Thought: 
</code></pre><p>This prompt is very complex, so let&#8217;s break it down:</p><ol><li><p>The LLM reads the question and decides which information to retrieve.</p></li><li><p>The LLM returns a query for the search engine (or a retriever of our choice).</p></li><li><p>Retrieval is run with the query the LLM provided, and the resulting context is appended to the original prompt.</p></li><li><p>The entire prompt is returned to the LLM, which reads it, follows all the reasoning it did in the previous steps, and decides whether to do another search or reply to the user.</p></li></ol><p>Multihop RAG is used for autonomous exploration of a topic, but it can be very expensive because many LLM calls are performed, and the prompts tend to become really long really quickly. The process can also take quite some time, so it&#8217;s not suitable for low-latency applications. However, the idea is quite powerful, and it can be adapted into other forms.</p><h1>A word on finetuning</h1><p>It&#8217;s important to remember that finetuning is not an alternative to RAG. Finetuning can and should be used together with RAG on very complex domains, such as medical or legal.</p><p>When people think about finetuning, they usually focus on finetuning the LLM. In RAG, though, it is not only the LLM that needs to understand the question: it&#8217;s crucial that the retriever understands it well, too! This means <strong>the embedding model needs finetuning as much as the LLM</strong>. Finetuning your embedding models, and in some cases also your reranker, can improve the effectiveness of your RAG by orders of magnitude. Such a finetune often requires only a fraction of the training data, so it&#8217;s well worth the investment.</p><p>Finetuning the LLM is also necessary if you need to alter its behavior in production, such as making it more colloquial, more concise, or stick to a specific voice. Prompt engineering can also achieve these effects, but it&#8217;s often more brittle and can be more easily worked around. Finetuning the LLM has a much more powerful and lasting effect.</p><h1>Conclusion</h1><p>RAG is a vast topic that could fill books: this was only an overview of some of the most important concepts to remember when working on a RAG application. For more on this topic, check out my <a href="https://www.zansara.dev/posts">other blog posts</a> and stay tuned for <a href="https://www.zansara.dev/talks">future talks</a>!</p>]]></content:encoded></item></channel></rss>