Mostly Harmless

👋 Hello, world!

Fri, 13 Sep 2024 00:00:00 GMT

<p class="text-center"> (Hello, reader! You probably want <a href='the-qualified-self'>this post</a> instead.) </p>

Teaching AI to Label GitHub Issues

Sat, 02 Nov 2024 00:00:00 GMT

import { Image } from 'astro:assets'; import slackScreenshot from './slack-screenshot.png';

"Is GitHub using AI to label PRs now?"

When my colleague Nate asked me this question in Slack, I had to pause. "I don't think so?" And then: "They should though."

tl;dr: They don't – but you can!

I built a GitHub Action that uses LLMs to intelligently label your issues and PRs, and you can drop it into your repos right now.

Beyond just announcing a new tool, I want to share a little about what I learned about practical AI application design, the surprising effectiveness of structured reasoning with smaller models, and why I believe this represents a perfect case study in augmenting (rather than replacing) human workflows.

Why Labels Matter

As an open-source contributor, you've surely seen labels on issues and PRs: colorful tags that categorize work in meaningful ways.

As a maintainer, labels represent a fairly sophisticated system for repository orchestration. Similar to hashtags in early social media, labels are an extremely simple construct that have nonetheless transcended their original purpose and become a critical tool for open-source management. They:

shape contributor behavior (good first issue, help wanted)
set expectations (breaking change, duplicate, needs-mre)
route attention (security-review, needs-reproduction, frontend)
advertise features and milestones (enhancement, caching, RBAC, 3.x)
and drive automation (many human and automated workflows use labels as triggers or status indicators)

One of my favorite examples is the great writeup label on the Prefect repo, which highlights issues or resolutions that are exceptionally well-written. It's a great way to recognize and encourage good contributor experiences, and it's a powerful "show don't tell" signal for new contributors.

Most labeling today falls into two categories: manual application (time-consuming and inconsistent) or static automation based on simple rules. The excellent first-party Pull Request Labeler action, for instance, can apply path-based rules such as adding the frontend label to any PR that touching files in ui/**. In fact, it was seeing this deterministic behavior that prompted Nate's question in the first place.

But path-based labeling can't tell you whether those changes need security review, or if they're breaking existing APIs, or if they'd make a great first issue for new contributors. To automate labeling effectively, we need something that can actually understand content, intent, and context.

Luckily, I know a guy.

Labels, Meet AI

LLMs are a perfect fit for this problem. Classification, or mapping unstructured context onto a set of categories, is one of their most fundamental operations! Most importantly, they can understand the context and even the intent behind any changes, not just their objective surface characteristics. They can tell when a PR constitutes an enhancement, when test coverage is insufficient, or when a security review is needed.

So... let's build an AI labeler!

Using ControlFlow, the core implementation is surprisingly simple. In fact, despite representing the entirety of this action's "magic", I spent only a fraction of my time orchestrating the AI logic and all the rest trying to get the action itself to work in CI.

You may draw your own conclusions about the state of developer happiness.

Here is a slightly simplified version of the core code. In this flow:

import controlflow as cf
from pydantic import BaseModel
from typing import Optional, Union


@cf.flow
def labeling_workflow(
    pr_or_issue: Union["PullRequest", "Issue"],
    labels: list["Label"],
) -> list[str]:

    class Reasoning(BaseModel):
        label_name: str
        reasoning: str
        should_apply: bool

    labeler = cf.Agent(
        name="GitHub Labeler",
        model="openai/gpt-4o-mini",
        instructions="You are an expert at labelling GitHub issues and PRs.",
    )

    decision = cf.run(
        "Consider the PR/issue and reason about potential labels",
        result_type=list[Reasoning],
        context=dict(pr_or_issue=pr_or_issue, labels=labels),
        agents=[labeler],
    )

    return [r.label_name for r in decision if r.should_apply]

In this flow:

We create a Pydantic model to hold the reasoning about each label
We create an agent that will use GPT-4o-mini to label the PR or issue
We reason about each label to ultimately produce a list of labels that should be applied

The "full" code can be seen here.

Reasoning: Show Your Work

The first version of this flow simply asked the agent to generate a list of labels. This worked well in all cases with GPT-4o, but GPT-4o-mini sometimes made mistakes with complex labeling instructions.

I experimented with a variety of solutions, including different prompts, multi-stage reasoning, and more, before settling on the approach above, in which we ask the model to explicitly output its reasoning about each label. It's fascinating how well this approach works, permitting GPT-4o-mini to operate near the level of GPT-4o, at a tiny fraction of the cost.

(Performance is actually very slightly better in a two-step reasoning approach, but at the cost of a second pass through the LLM, so I've opted for the single-pass version for now)

Note that this is not the same as o1-style reasoning, as it does not involve any iterative refinement of the model's understanding of the input. Instead, this approach forces the model to pay more attention to instructions, thereby tipping it into an operating regime that's more likely to produce the right answer.

Configuration

Now we've got an AI workflow that can assign labels to a PR. That still might not be enough to mimic a human maintainer, because we ascribe norms to label application that are based on context, intent, and nuance.

For this reason, the AI labeler allows you to annotate each label with natural language instructions that clarify its purpose.

The configuration is straightforward:

labels:
  - security-review: # the label name
      description: "Needs security team review"
      instructions: |
        Apply when changes involve:
        - Authentication or authorization code
        - Cryptographic operations
        - Environment variables
        - Container or deployment config

This gives you a way to define what "good first issue" means for your project, or exactly when to flag something as a breaking change or needing tests. The AI will consider your instructions alongside the actual content, leading to remarkably nuanced decisions.

For more control, you can provide global instructions and even include additional files for context, like a contribution guide or code of conduct.

# .github/ai-labeler.yml
instructions: |
  Focus on identifying good first issues and security concerns.
  
labels:
  - good-first-issue:
      description: "Perfect for newcomers"
      instructions: |
        Apply when the changes are:
        - Well-scoped and isolated
        - Well-documented
        - Don't require deep system knowledge

  - security-review:
      description: "Needs security team review"
      instructions: |
        Apply when changes touch:
        - Authentication flows
        - Environment variables
        - Network requests

context_files:
  - .github/CODEOWNERS
  - CONTRIBUTING.md
  - CODE_OF_CONDUCT.md

No Plan Survives First Contact

I'd never created a GitHub Action before, and I have to admit – it was both more challenging and more rewarding than I expected. The documentation is comprehensive but often cryptic. Environment variables have surprising names. You can't read repository files until you check out the repository (which seems obvious in hindsight, but took me embarrassingly long to figure out). Testing is essentially "push and pray."

This led to an important design decision: keep most of the code as normal, testable Python and wrap it with a thin layer of GitHub-specific glue. In retrospect, this separation of concerns was crucial for maintaining my sanity during development and will make it much easier to maintain going forward.

Beyond Labels

What excites me most about this project isn't just its utility (though I do use it on all my repositories now). It's that it represents a perfect example of how AI can augment existing workflows without trying to replace human judgment:

It handles the routine work of initial labeling, but maintainers can always adjust or override its decisions
It learns from your repository's context and explicit instructions, adapting to your specific needs
It's completely transparent about its reasoning, making it easy to debug and improve
It's fast and affordable – you could process 10,000 PRs for less than $5

The structured reasoning approach means we get sophisticated behavior from smaller models – there's no need to step up to GPT-4o or Claude just for intelligent labeling. This keeps it practical for real-world use while still delivering genuinely helpful automation.

There's still plenty to explore – for example, distinguishing between issue-only and PR-only labels, or learning from manual corrections over time. Want to help? The project could especially use some "good first issues" to welcome new contributors. You can find AI Labeler on GitHub, and suggestions are always welcome.

But don't worry about picking the right labels for your issues – we've got that covered! 😉

An Intuitive Guide to How LLMs Work

Sun, 06 Oct 2024 00:00:00 GMT

import Callout from "@components/blog/Callout.astro"; import { Image } from 'astro:assets'; import coinToss from './coin-toss.svg'; import diceRoll from './dice-roll.svg'; import heightByAge from './heights-age.svg'; import heightAll from './heights-all.svg'; import heightBasketball from './heights-basketball.svg'; import heightByGender from './heights-gender.svg'; import rouletteSpin from './roulette-spin.svg'; import wordsConditionalCat from './words-conditional-cat.svg'; import words from './words.svg';

"LLMs, how do they work?"

It may seem like a strange[^magnets] question to ask. After all, large language models (LLMs) have become so ubiquitous so quickly that it's hard to find someone who isn't interacting with one regularly. They form the cornerstone of modern AI, powering everything from consumer-facing chatbots to advanced analysis tools. They can write poetry, answer complex questions, and even build software.

[^magnets]: Magnets, how do they work?

But how do they work?

I believe it's critical to develop a strong intuition for how LLMs operate in order to work with them effectively. Unfortunately, most people are quickly deterred by all the complex math that usually accompanies any such explanation. However, just as you don't need to understand exactly how a car's engine works to be a skilled driver, or know the details of Google's algorithm to craft an effective query, you also don't need to grok transformer models in order to be productive with ChatGPT. What you do need is an understanding of how the system behaves as a whole.

It's not as complicated as you might think. See if you can complete this sentence: The cat sat on the _____. Did you think of mat, windowsill, or maybe keyboard? Believe it or not, this post is mostly about designing a system that can do the same. By the end, I hope you'll see how a simple idea like word prediction can scale up to create AI's capable of engaging in complex conversations, answering questions, and even writing code.

And we'll only use as much math as you'd be comfortable discussing at a dinner party.[^dinner-party-math]

[^dinner-party-math]: Granted, if you regularly discuss math at your dinner parties, this post might not be for you.

<Callout color="gray"> This post is based on my talk, <span class="font-bold">"How to Succeed in AI (Without Really Crying)."</span> </Callout>

Probability

To understand how LLMs work, we need to start with probability.

I know, you're already bored. I love statistics, and I'm already bored. But at their core, LLMs are nothing more than fancy probability engines.[^fancy-probability-engines]

[^fancy-probability-engines]: with really fancy marketing.

Probability is a tool for quantifying randomness and uncertainty. Having a probabilistic nature is the source of an LLM's power... and its unpredictability. It's what makes it possible to generate novel, creative, actionable responses, and also makes LLMs very difficult to train or debug.

Whether you're a layperson, practitioner, or researcher, the entire process of working with LLMs is an exercise in forming and manipulating their latent probability distributions into giving you the outputs you want.

Therefore, there are three key concepts that, if understood intuitively, will give you a firm grasp on how LLMs work:

Probability Distributions
Conditional Probability Distributions
Sampling from Probability Distributions

<Callout color="green"> The statistics nerds among you may argue that we should be talking about "joint probability distributions".

The statistics nerds among you are welcome to write their own blog posts. </Callout>

Probability Distributions

Let's dive right in to probability with something that's familiar to most people: flipping a coin.

When you flip a fair coin, there's a 50% chance it will land on heads and a 50% chance it will land on tails. This simple scenario is nonetheless a complete example of a probability distribution. Let's break it down:

There are two possible outcomes: heads or tails.
Each outcome has an equal likelihood of occurring.
The probabilities of all possible outcomes add up to 100%.

We can visualize the distribution of outcomes like this:

<figure> <Image src={coinToss} alt="Coin Toss Distribution" class="shadow-none" /> <figcaption>Probability distribution of a fair coin</figcaption> </figure>

This distribution tells us everything we need to know about a coin toss before it happens. Note that it doesn't tell us what will happen on any particular flip, but rather what to expect over many flips. Another way of saying this is that on any flip, the likelihood of heads is equal to the likelihood of tails. For our purposes today, that likelihood -- or relative chance of an outcome -- is what we're most interested in.

But what if there are more than two outcomes?

Consider a 6-sided die. Each side has a lower absolute probability of coming up than the side of a coin -- a 16.67% chance, to be precise -- but all of them are equally likely. Therefore, from a probability perspective, a six-sided die is sort of like a scaled-up coin toss: it represents a distribution of equally likely outcomes.

<figure> <Image src={diceRoll} alt="Dice Roll Distribution" class="shadow-none" /> <figcaption>Probability distribution of a six-sided die</figcaption> </figure> We can push this even further by considering a roulette wheel, which has 38 outcomes, each one just as (un)likely as any other. <figure> <Image src={rouletteSpin} alt="Roulette Spin Distribution" class="shadow-none" /> <figcaption>Probability distribution of a roulette wheel</figcaption> </figure>

All of the distributions we've discussed so far are called uniform probability distributions and if you look at their charts, you can see why: since every outcome is equally likely, their probability distribution is flat.

Uniform distributions are an easy and important way to understand the nature of probability, but we all know that LLMs aren't just a giant roulette wheel. We need more powerful tools to understand them.

Let's take a step towards complexity by considering probability models that aren't uniformly distributed. One of the most familiar is the normal distribution, or bell curve. Consider the following chart of the distribution of adult human heights:

<figure> <Image src={heightAll} alt="Height Distribution" class="shadow-none" /> <figcaption>Probability distribution of adult human heights</figcaption> </figure>

The peak of the curve represents the average height, about 5'5". Heights close to the average are most common, which is why the curve is highest in the middle. As we move away from the average, the likelihood of seeing those heights decreases, which is why the curve tapers off and forms the "bell" shape that gives it its colloquial name.

Conditional Probability Distributions

All models are wrong, but some are useful.

-- George Box

So is this bell curve a "good" model? It has some nice properties, but it's far from perfect. For example, it suggests that the most likely height for a randomly selected person to have is 5'5". But if you met a 6-year old child who was 5'5", would you think they were completely average? Of course not. And so there's clearly something wrong with our model.

Real-world probabilities often depend on additional factors. For example, if we know someone's age, gender, or other demographic information, our assessment of their probable height could change dramatically: the distribution of heights of 6-year old boys is markedly different from middle-aged women. One way of discussing this rich family of related probabilities is that they are conditional probability distributions, meaning that they reflect additional information or knowledge versus the base or naive distribution.

Here, for example, are the conditional height distributions for adult men and women:

<figure> <Image src={heightByGender} alt="Height Distribution by Gender" class="shadow-none" /> <figcaption>Probability distribution of adult male and female heights</figcaption> </figure>

You can see that there are two distributions, one for each piece of conditional knowledge. If we know someone's gender, we can use the corresponding distribution to make more accurate predictions.

Similarly, here are conditional height distributions for 10-year olds and 50-year olds. You can imagine that there is a continuous stream of corresponding distributions for every other age.

<figure> <Image src={heightByAge} alt="Height Distribution by Age" class="shadow-none" /> <figcaption>Probability distribution of 10-year-old and 50-year-old heights</figcaption> </figure>

Conditional distributions allow us to not only model outcomes based on our empirical observations of the world, but to incorporate other findings into those models in a precise way. What's especially interesting is that the conditional factors do not have to have a causal relationship on the observed outcomes; they only need to be correlated with them.

To illustrate this important concept, consider the distribution of heights of professional basketball players. The distribution of heights, conditional on the knowledge that someone is a professional basketball player, is quite different from the distribution of heights in the general population.

<figure> <Image src={heightBasketball} alt="Height Distribution for Professional Basketball Players" class="shadow-none" /> <figcaption>Probability distribution of heights for professional basketball players, with the population average in gray</figcaption> </figure>

The peak has shifted significantly to the right, indicating that professional basketball players are, on average, much taller than the general population. But here's the crucial point: being tall doesn't cause someone to become a professional basketball player, nor does being a professional basketball player cause someone to grow taller. There's simply a strong correlation between being tall and being a professional basketball player.

This non-causal yet highly informative relationship is key to understanding how LLMs work. These models don't understand causality in the way humans do. Instead, they excel at recognizing and leveraging correlations in data. When an LLM generates text, it's not reasoning about cause and effect; it's making predictions based on patterns and correlations it has observed in its training data.

For instance, if an LLM has been trained on a dataset that includes many descriptions of basketball players, it might learn to associate words like "player," "NBA," or "court" with a higher likelihood of words related to tallness. This doesn't mean the model understands why basketball players tend to be tall; it just knows that these concepts frequently co-occur.

Sampling

The last thing we need to understand before we move on to language is sampling.

Let's go back to our roulette wheel for a moment. When you play roulette, you're using a ball to sample outcomes from the probability distribution of the wheel. Each spin is an independent event that produces an outcome based on the underlying probabilities. To sample digitally, we replace the ball with a random number generator.

Sampling is how we turn our probability distributions into actual outcomes. It's the bridge between our model of the world (the distribution) and events in the world (specific outcomes). Given some understanding of the relative likelihoods of different outcomes, we can produce novel outcomes from the distribution that satisfy its rules and constraints.

Importantly, sampling allows us to generate outcomes that reflect the overall structure of the distribution, even if we've never seen that exact outcome before. For instance, if we sample heights from our earlier distribution, we might get a height of 5'11" - a specific value that may not have been in our original dataset, but one that fits the pattern we've modeled.

This process of sampling is crucial for LLMs. When generating text, these models don't simply choose the most probable word every time. Instead, they sample from their probability distributions, which allows for creativity and variability in their responses. For now, it's important to note that sampling lets you convert a distribution into an outcome, a concept we'll explore further when we dive into how LLMs generate text.

Training

Before we dive into language models, let's briefly touch on what it means to "train" a model for a probability distribution. For our purposes, think of training as the process of tweaking a mathematical formula to make it fit a set of observed outcomes as closely as possible.

One of the reasons the normal distribution is so useful is that its model only requires two parameters: the average (mean) height and how much heights typically vary from this average (standard deviation). With these two numbers, we can recreate that familiar bell curve.

But what about more intricate distributions, like our conditional probabilities? Well, it gets a bit more complicated, but the core idea is the same: we're trying to create a mathematical model that can accurately represent the distribution we see in our data. For now, just know that it's possible to build these models, even for very complex distributions, and training is the process of solving for their parameters.

Language Models

We've spent considerable time building an intuition for probability distributions, conditional probabilities, and sampling. Now, let's apply these concepts to the core of Large Language Models: modeling language itself.

Distributions of Words

Just as we modeled the distribution of heights in a population, we can model the distribution of words in a language. At first, this might seem strange - words aren't numbers like heights, after all. But remember, probability distributions are simply about the likelihood of different outcomes, and words are just another type of outcome we can measure.

Suppose we took a large corpus of text data and made a graph of every word that appeared in it against its normalized frequency of appearance, ordered by that frequency. We'd get something like this:

<figure> <Image src={words} alt="Word Frequency Distribution" class="shadow-none" /> <figcaption>Probability distribution of words in the English language</figcaption> </figure>

This is a probability distribution of words! Just like our height distribution, it shows us the relative likelihood of different outcomes. However, there's a crucial difference: while heights formed a continuous distribution, words are discrete entities. There's no such thing as a word that's halfway between 'cat' and 'dog'. In this sense, our word distribution is more like our roulette wheel: each word is a distinct possibility with its own probability of occurrence.

In this distribution, you'll notice:

Common words like is, the, and a are the most likely to appear.
Everyday nouns and verbs like street, yellow, and climb occupy the middle ground.
There is a long tail of rare or specialized words like oxidize or peripatetic.

However, we can't just sample from this distribution and generate intelligible prose. Iterated draws from this distribution are infinitely more likely to generate the "sentence" a a the a yellow run a the catalyst a is the the street than anything resembling Shakespeare.

Conditional Distributions

Remember how our height predictions improved when we considered additional factors like age or profession? The same principle applies to words, but to an even greater degree. The probability of a word appearing is heavily dependent on the words, syntax, and semantics that come before it. This is where conditional probability becomes crucial in language modeling.

Let's go back to the simple example we started this post with:

The cat sat on the _____

Given this context, you can make a pretty good guess about what the next word could be:

mat is highly probable
roof is likely
piano is also possible, though less common
myrmidon is extremely improbable
the wouldn't even make sense grammatically

<figure> <Image src={wordsConditionalCat} alt="Conditional Probability of Words Given Context" class="shadow-none" /> <figcaption>Conditional probability of words given context</figcaption> </figure>

Obviously, this is a very different distribution than the "naive" or unconditional distribution of words. Producing these conditional distributions is the heart of language modeling and the core internal operation of an LLM. A properly trained model can output a distribution like this one for any provided context, or "prompt." As the prompt evolves, so too would the model's assessment of conditional likelihoods.

Generating Text

Now that we understand how individual words can be modeled as probability distributions and even account for context, how do we use this to generate coherent text? This is where sampling comes into play.

Sampling from an LLM is similar to when we drew values from a more simple probability distribution, with a catch: we don't want to just pick one word; we want to generate an entire sentence or paragraph! To do this, we sample iteratively from a conditional distribution of words, adding the result of each draw to the context for the next draw.[^inference]

[^inference]: The complexity of producing a new probability distribution for every word is why LLM inference is expensive and time-consuming.

<Callout color="green"> The LLM nerds among you may notice I haven't mentioned "tokens."

In practice, modern LLMs don't work directly with whole words, but rather with tokens that represent groups of characters, including punctuation. There's a variety of reasons for this, including efficiency of encoding and flexibility in handling rare words and misspellings, but the core principles of building and sampling from a distribution remain the same. Anywhere I refer to "words" in this post, you can mentally substitute "tokens" if you prefer. </Callout>

Here is a simple description of the process:

The LLM starts with an initial context (which could be empty or provided by a prompt).
Based on this context, it calculates the conditional probability distribution for the next word.
It samples a word from this distribution.
It adds this word to the context and repeats the process.[^end]

[^end]: At some point it decides to stop, but the details of that are way beyond what we're covering here.

Let's illustrate this by continuing our previous example with the cat. The initial context is:

The cat sat on the _____

Suppose our model samples roof from the distribution we proposed earlier. Now our context becomes:

The cat sat on the roof _____

The model would then calculate a new probability distribution for the next word. This distribution might favor words like and, of, or watching. Let's say it chooses of. The updated context is:

The cat sat on the roof of _____

We repeat the process, computing a new conditional distribution for this context. This time it might heavily favor words like the, a, or her. Each choice influences the next, and so on.

Here's what it looks like in practice:

This is really how LLMs work! An iterative process of sampling and updating the context is fundamentally how LLMs generate text. It's analogous to repeatedly sampling heights from our height distribution, but with each sample influencing the distribution for the next one.

However, this approach introduces a significant challenge: compounding errors. Once the model makes a "mistake" or chooses an unlikely word, that choice becomes part of the context for all future words. This can cause the model to veer into increasingly improbable territory, potentially devolving into gibberish after a few words or sentences.

Early language models often struggled with this issue. As they started to drift away from highly probable word sequences, they would tip increasingly into a low-probability, high-entropy regime. In a sense, language models are self-reinforcing: the more they favor a certain style, topic, or format, the more likely they are to continue doing so. Conversely, the more they veer into nonsense, the more likely nonsense becomes.

This self-reinforcing nature has interesting implications. For instance, once a model outputs a specific idea or format, it can be difficult to tell it to stop doing that.[^bullets] In fact, telling a model NOT to think of something almost always makes it output that very thing. It's the digital equivalent of the classic "don't think of an elephant" thought experiment.

[^bullets]: My kingdom for a way to prevent LLMs from resorting to bullet points all the time.

The sampling process necessarily introduces an element of randomness, which is crucial for creativity and diversity in the outputs. If the model always chose the most probable word, its outputs would be repetitive and unnatural. The degree of randomness in sampling can be adjusted through a parameter that is often called "temperature":

Low temperature: The model is more likely to choose high-probability words. This results in more predictable, potentially more coherent, but possibly less creative text.
High temperature: This introduces more randomness, allowing the model to more frequently choose lower-probability words. This can lead to more creative but potentially less coherent outputs.

Modern LLMs have become much better at maintaining coherence over longer stretches of text, thanks to advances in model architecture, training techniques, and the sheer scale of the models. However, the fundamental challenge of compounding errors remains, and it's one of the reasons why LLMs can sometimes produce outputs that start strong but become increasingly nonsensical or off-topic as they continue.

From Chance to Chat

Chat interfaces have become the dominant way for most people to interact with LLMs, capturing the public imagination and showcasing these models' capabilities. But how do we get from generating individual words to engaging in full-fledged conversations? The answer lies in cleverly applying the principles we've discussed so far.

Here's how it works:

When you start a chat, your initial message becomes the first piece of context.
The model generates a response based on this context, just as we described earlier.
For your next message, the model doesn't just look at what you've just said. Instead, it considers everything that's been said so far - your initial message, its first response, and your new message.
This process repeats for each turn of the conversation. The context grows longer, incorporating each new message and response.

This approach allows the model to maintain consistency and context throughout a conversation. It can refer back to earlier parts of the chat, answer follow-up questions, and generally behave in a way that feels more like a coherent dialogue than isolated text generation.

However, this method also introduces some challenges:

Context Length Limits: LLMs have a maximum amount of text they can process at once (often referred to as the "context window"). For very long conversations, the earliest parts might get cut off when this limit is reached.
Computational Cost: As the conversation grows, generating each new response requires processing more and more text, which can slow down the model's responses and increase computational costs.
Consistency vs. Creativity: The model might become overly constrained by the conversation history, potentially leading to less diverse or creative responses over time.

Despite these challenges, this simple yet effective approach to chat is what powers the conversational AI interfaces we interact with daily. By treating the entire conversation as a growing context for probabilistic text generation, LLMs can engage in surprisingly coherent and context-aware dialogues.

The Company Words Keep

We've seen how LLMs generate text by iteratively sampling from probability distributions. But where do these distributions come from? How does the model know which words are likely to follow others?

Earlier, we touched briefly on training, the exercise of discovering the intricate distributions that allow an LLM to predict the next word with such nuance.

My colleague Adam (who is, incidentally, the only person still reading this post) has an excellent way of capturing the intuition behind training:

"You know a word by the company it keeps."

This means that an LLM's understanding of a word is entirely based on how that word appears in relation to other words. Surprisingly, at no time does it learn its definition, etymology, or any other intrinsic property in an explicit way.[^dictionary-training] The goal of training is to build a sophisticated model of these latent relationships to make accurate predictions about which words are likely to appear next.

[^dictionary-training]: It's quite likely that a dictionary would be included in a model's training data. However, it would not receive any special attention or processing, though of course the close proximity of a word and its dictionary definition would result in a much stronger relationship between the two.

To illustrate this principle in a simple sense, consider the word "bank." In isolation, it could refer to a financial institution or the side of a river. During training, the model might encounter sentences like:

"He deposited money in the bank."
"The river overflowed its banks after heavy rain."
"The bank approved her loan application."
"We had a picnic on the grassy bank by the stream."

How can an LLM learn to distinguish between these meanings? Well, pretty much the same way you do.

Over billions of examples, the model builds a nuanced understanding of how the word "bank" relates to other words. It learns that when "bank" appears near words like "money," "deposit," or "loan," it's likely referring to a financial institution. When it's near words like "river," "stream," or "grassy," it's more likely referring to a riverside. This understanding is encoded in the parameters of the model's implicit probability distribution, and those parameters are often referred to as "weights."

This "company it keeps" principle is crucial. The model doesn't have explicit definitions or rules about what words mean. Instead, it builds a rich, multidimensional model of how words relate to each other in various contexts.

The actual mathematics of how training works is beyond the scope of this post (and, probably, most dinners you'll attend). But conceptually, you can think of it as the model adjusting its internal parameters to better predict the next word in a sequence, given all the words that came before it. It does this over and over, for billions of examples, gradually refining its ability to capture the patterns and relationships in language.

What emerges from this process is not a set of rules or definitions, but a vast, interconnected web of probabilities. Given any sequence of words, the model can use this web to calculate the probability distribution of what might come next. This is why models require extraordinary amounts of data, compute, and time to train - they're building an incredibly complex probabilistic model of language itself.

Understanding training in this way helps explain some of the quirks and limitations of LLMs:

Correlation, not causation: LLMs excel at recognizing patterns and correlations in language, but they don't understand causality. This is why they can sometimes produce outputs that seem logical but are factually incorrect.
Bias in, bias out: If the training data contains biases or inaccuracies, these will be reflected in the model's outputs. The model doesn't have a way to fact-check its training data.
Hallucination: When asked about topics it hasn't seen much of in its training data, an LLM might generate plausible-sounding but incorrect information. This is because it's trying to produce probable sequences of words based on limited relevant context.
Difficulty with explicit rules: Because LLMs learn implicitly from patterns rather than explicit rules, they can sometimes struggle with tasks that require strict adherence to specific formats or guidelines.

By understanding LLMs as probability engines trained on vast amounts of text data, we can better appreciate both their capabilities and their limitations. This perspective is crucial for using them effectively and responsibly in real-world applications.

Thinking with Probabilities

Now that we understand LLMs as fancy probability engines, let's explore how this perspective can help us use them more effectively. A lot of common LLM techniques are really just clever ways of nudging these probability distributions. Here are a few examples of ideas and techniques you may have heard of, and how they are actually all just playing with probability:

Talk like a pirate: It's the classic "hello world" of proving your LLM works: getting it to talk like a pirate. By now you should realize that the model doesn't have a separate "pirate mode" - it's just shifting its word probabilities to favor "Arrr" and "matey" over more standard English.

Prompt engineering: In general, all of prompt engineering is all about putting the model in a better "probability regime." When we craft a good prompt, we're not just asking a clear question - we're subtly shaping the likelihood of different kinds of responses. This is why prompts that work well for LLMs might look different from how we'd phrase things to a human or even to a search engine.

Chain-of-thought: One of the most powerful techniques in using LLMs is as simple as asking the model to "think step by step." But why does this work? Remember, our LLMs are making probabilistic leaps from input to output. Sometimes, the leap from question to answer is just too big - the correct answer might be logical, but not probable given the input. By asking for step-by-step reasoning, we're allowing the model to make a series of smaller, more probable jumps. Each step flows more naturally from the last, leading to a better final answer.

Fine-tuning: Sometimes, we want to push our models even further in a particular direction. That's where fine-tuning comes in. Fine-tuning is like giving the model a specialized crash course. We start with a model that has broad knowledge (it's seen tons of text on all sorts of topics), and then we show it a bunch of examples in our area of interest. This nudges the model's entire probability distribution, making it more likely to use certain words or concepts by default.

RAG (Retrieval-Augmented Generation): This powerful technique has a simple but effective idea: before the model generates a response, we fetch some relevant information and add it to the input. This biases the model's output probabilities towards using this specific, relevant information. It's a bit like giving the model a cheat sheet for the particular question you're asking.

Translation: Using statistics and correlative probabilities to model the relationship between words in different languages is not new; in fact, about a decade ago it provided a revolutionary step forward in high-quality machine translation. As considerably more powerful general-purpose models, LLMs inherit this ability to model and predict words across languages. You now know enough to think of this probabilistically: given a sentence and an instruction to translate it, a properly-trained model should determine that the most probable outcome is the translation.

Tipping your LLM: Consider the trick of saying "I'll tip you $20" to an AI assistant. This doesn't work because the model is actually expecting payment. Instead, it's putting the model into a state where it's more likely to produce high-effort, high-quality responses. It's learned that contexts involving rewards often come with expectations of better performance.

ReAct agents: This idea of guiding the model's reasoning process is also behind more complex systems like ReAct agents. These are setups where we give the model a specific format to follow, usually involving steps like "Think, Act, Observe." By being precise about what we expect, we make it more likely for the model to use tools effectively or to check its own work.

Code generation: When it comes to generating specific types of content, like code, we can push this idea of biasing probabilities even further. When we tell a model to "write Python code," we're not activating some separate coding module. Instead, we're shifting the model into a state where it's much more likely to produce text that looks like Python - lots of indentation, specific keywords, that sort of thing.

Structured output generation: For highly structured outputs like JSON, some systems even artificially limit which tokens (chunks of text) the model is allowed to produce. This ensures the output follows the correct format, essentially forcing the model to color within the lines we've drawn.

Recitation: When the public first became aware of LLMs, there was a sustained and false belief that the models somehow maintained a copy of the entire internet, which was remixed or regurgitated on demand. Perhaps this was easier for some people to believe than models being capable of synthesizing novel outputs. The most common evidence for this belief was that models could perfectly recite known documents, like the first three paragraphs of Alice in Wonderland. By now, I hope you appreciate that for a sufficiently trained model, this is neither surprising nor particularly impressive. After all, the most probable response to "What are the first three paragraphs of Alice in Wonderland?" is, of course, the first three paragraphs of Alice in Wonderland.

All of these techniques, from simple prompt tweaks to complex system designs, are really just ways of playing with probabilities. We're constantly asking ourselves: how can we make the output we want more likely? How can we guide the model towards better reasoning, more accurate information, or more useful formats?

Coda

We've journeyed from coin flips to complex language models, all through the lens of probability. I hope that you have developed a solid intuition for how LLMs actually work:

They're built on sophisticated probability distributions of language.
They generate text by iteratively sampling from these distributions.
Their "knowledge" is really just a vast web of word relationships and correlations.

Understanding LLMs as probability engines rather than knowledge databases is crucial for using them effectively and responsibly. It helps us set realistic expectations, interpret their outputs appropriately, and design better ways of leveraging their capabilities.

As we continue to develop and refine these models, keeping this probabilistic perspective in mind will be key. It reminds us that while LLMs are incredibly powerful tools that can revolutionize how we interact with information and solve problems, they're fundamentally playing a very advanced game of "what word comes next?"

They're not magic, they're not sentient, and they're definitely not going to rise up and kill us all.

Probably.

Beyond Reasoning: Anthropic's Agent

Tue, 22 Oct 2024 00:00:00 GMT

When o1 was released, I wrote that internal reasoning - even iterative reasoning - didn't represent agentic behavior. I defined an agent as requiring iterative interactions with the external world: perceiving the environment, taking actions, observing outcomes, and adjusting accordingly. With Anthropic's release of their new "computer use" feature, we're seeing exactly this kind of genuine agency in action.

Why This Is Different

The fundamental difference isn't in the complexity of the tasks or the sophistication of the AI - it's in the presence of a real-world feedback loop. When an AI reasons internally, it can refine its thinking and generate better answers, but it's still operating in a closed system of its own knowledge and patterns. In contrast, Anthropic's agent actually interacts with computer systems, observes the results of its actions, and adjusts its behavior based on what really happened, not what it predicted would happen.

This is what makes it a true agent. When operating a computer:

It perceives the environment through screenshots, understanding complex visual interfaces
It translates high-level goals into specific actions (mouse movements, keyboard inputs)
It observes the results of those actions through new screenshots
It adjusts its strategy based on what it learns from those results

Governing Real-World Agency

This shift from reasoning to real-world agency demands entirely new frameworks for defining and controlling AI behavior. With pure reasoning systems, we could focus on input filtering and output validation. But with true agents that learn and adapt through interaction, we need systems that can:

Define acceptable boundaries of exploration - how do we let agents learn from their mistakes without causing harm?
Monitor behavioral patterns, not just outputs - when an agent develops a new strategy through real-world interaction, how do we evaluate if it's safe and appropriate?
Establish clear lines of responsibility - when an agent makes decisions based on its own observations and learning, who is accountable for the outcomes?

The traditional approach of treating AI systems as deterministic tools breaks down here. We need frameworks that can handle emergent behavior while maintaining meaningful human oversight. This isn't just about safety guardrails - it's about developing new ways to specify goals and expectations for systems that can discover novel approaches to achieving them.

Looking Forward

The development of true AI agents is a watershed moment that demands new thinking about AI governance. We need frameworks that can balance the benefits of autonomous learning and adaptation with the need for predictability and control. This isn't just a technical challenge - it's a fundamental shift in how we think about AI systems and their relationship to the world they operate in.

The question isn't whether we should build AI agents - the horse is wayyyy out of the barn. The question is how we develop the systems of governance and control that will let us harness their capabilities safely and effectively. This is the next great challenge in AI development.

Introducing Colin

Fri, 23 Jan 2026 00:00:00 GMT

import Callout from "@components/blog/Callout.astro";

<Callout color="blue"> tl;dr Colin is an experimental context engine that can load dynamic information into agent skills and keep them fresh. Give it a star on GitHub! </Callout>

Context goes stale.

This is an increasingly serious problem for anyone building with agents, and it manifests in a few different ways. Stale context can mean:

The information is out of date: a skill that was accurate when written but hasn't been touched since.
The information is unavailable: a conversation was compacted and details didn't survive the summary.
The information is siloed: it exists, but in a different conversation, a different chat window, or a different day.

There are two major standards for delivering context to agents today: MCP and agent skills. They occupy opposite ends of a spectrum, and neither has a particularly good solution to this problem.

Skills are optimized for passive access to static information. Drop a markdown file in a folder and the agent discovers it when relevant. Skills are lightweight, progressively disclosed, and always available.

And that's a problem.

Skills are markdown, so updating them means editing files by hand. Most of us don't, and so our agents' skills decay, becoming less relevant over time.

MCP is optimized for active retrieval of dynamic information. The agent fetches what it needs on demand, so the information is always current.

And that's a problem.

MCP requires conversational boilerplate because every conversation starts from scratch, so the agent has to figure out what it needs, invoke tools, load data, and accumulate knowledge. This is a lot of cycles and tokens spent setting up context that the agent already had yesterday.

Wouldn't it be nice to combine the dynamicism of MCP—open tickets, customer requests, upcoming meetings, recent PRs—with the passive availability of skills? Then our agents would have way to consistently access a flow of constantly changing information. No copying and pasting. No waiting for tools to load. No hoping the agent sets up context the same way it did yesterday.

We need a way to combine the best of both worlds. And to quote the late Tom Lehrer: I have a modest example here.

Colin is an experimental context engine that keeps agent skills fresh. It works by treating skills as software.

Colin combines two major capabilities:

A powerful templating engine that can load information from dynamic sources and (optionally) process it with LLMs. Templates can reference other files, GitHub files and PRs, Notion pages, Linear issues, any MCP server, and more. The templating language is Jinja, extended with providers for dynamic content and filters for LLM processing in order to summarize, classify, and extract information to your editorial specifications.

A dependency resolution system that tracks every reference to dynamic content and forms a resolution graph. When you compile a template, Colin traces that graph, evaluates all the references, and only updates the parts that have actually changed. Staleness can be content-based (the source changed), time-based (an hour passed), or both. Colin caches the rest (including LLM calls) in order to incrementally materialize your context.

Together, these let you write context that ranges from completely static to fully dynamic and LLM-processed, and use Colin to keep it up to date.

You can use Colin's output however you want: it's just markdown. But to me, compiling agent skills is the obvious use case because the world has settled on them as the standard way to provide file-based context to agents. Therefore, Colin has first-class support for writing output directly to your skills folder. But the engine is equally happy to produce documentation, reports, configuration, and anything else you need to keep up to date.

Here's what a Colin template looks like:

---
name: team-status
description: Current state of platform team work
colin:
  cache:
    expires: 1d
---

# Team Status

## In Progress

{% for issue in colin.linear.issues(team='Platform', state='In Progress') %}
- {{ issue.identifier }}: {{ issue.title }} ({{ issue.assignee }})
{% endfor %}

## Summary

{{ ref('team/weekly-notes.md').content | llm_extract('key blockers and priorities') }}

Once compiled, Colin knows how to keep this skill up to date. The ref() creates a dependency on weekly-notes.md. The Linear call creates a dependency on those issues. The cache directive enforces time-based staleness. Colin watches all of it, and recompiles when something changes.

Try It

We just open-sourced Colin. It's experimental, it's going to grow, and I hope you'll have fun with it. Please give it a star if you think it'll be useful!

Get the code: github.com/PrefectHQ/colin
Read the docs: colin.prefect.io
Try it out: pip install colin-py

One fun thing: Colin's quickstart actually compiles itself into a live-updating skill, so any time we update the docs, your agent automatically learns the new features. Ambitious? Yes. But easy? Also yes!

Happy context engineering!

10 Years of Real Good Coffee

Fri, 20 Sep 2024 00:00:00 GMT

import { Image } from 'astro:assets'; import banner from './banner.png'; import ivyCity from './ivy-city.png'; import lowinBlend from './lowin-blend.png'; import openingDay from './opening-day.png'; import sampleRoaster from './sample-roaster.png'; import sanitizer from './sanitizer.png'; import shelves from './shelves.png';

Today is Compass Coffee's 10th birthday.

To most people, Compass is a perpetually buzzing, quickly growing chain of cafes in Washington, DC. But as Compass's "Global Ambassador," a title earned through years of cheerful, unpaid labor, I've been fortunate enough to get swept up in an entrepreneurial whirlwind that has been nothing short of extraordinary.

The Compass story doesn't unfold for me in a neat, chronological order. Instead, it comes in a dizzying flood of memories, each one a testament to the chaos of helping close friends build something from the ground up. One moment, I'm in a tiny basement kitchen, being the guinea pig for the founders' first-ever latte, made with beans produced by their small, sample roasting machine. The next, I'm holding a ladder at 2 AM while we fix the front door lock of a new cafe.

<figure> <Image src={sampleRoaster} alt="The first sample roaster." /> <figcaption>Michael with the original sample roaster, 2013</figcaption> </figure>

There's no rhyme or reason to how I found myself doing what needed to be done. Armed with a titleless business card, I became a chameleon, morphing into whatever Compass needed at any given moment. IT guy troubleshooting internet installation? Check. Impromptu CFO to negotiate a lease? You bet. Fill-in baker producing countless, fluffy biscuits? Somehow, also yes.

The lines between my life and Compass blurred. I'd wake up in a daze, realizing that once again I'd been "Tom Sawyer-ed" into yet another Compass adventure.[^1] One day we'd be in New York, meeting (and usually rejecting) potential investors. The next we'd be in the Nevada desert, getting a multi-day certification in coffee chemistry. We spent the next year working on the empirical data problem of designing a consistent roast profile that tasted the same in the winter as it did in DC's humid summer.

<figure> <Image src={openingDay} alt="Opening day in Shaw, 2014." /> <figcaption>Opening day in Shaw, 2014</figcaption> </figure>

Most mornings, I'd commute across town to the Shaw cafe, making it my makeshift office before heading home to my "real" job. If I was lucky, I'd get a chance to take orders for an hour. Even my two-year-old son got in on the act, offering customers "normal" or "spicy" water--which, to this day, is how the Lowin and Haft households refer to sparkling water.

But it wasn't just about the tasks or the roles. It was about being part of something bigger, something growing and evolving at breakneck speed. I felt a surge of pride with each new cafe opening; today there are 20. I watched as Compass products appeared on Whole Foods shelves, launching a wholesale business that today includes many of DC's most popular restaurants. I remember Michael's late-night musings about the possibility of opening a drive-thru, tempered by his fears that it would dilute the personal touch that made Compass special.[^2]

Because what truly sets Compass apart isn't just the quality of its coffee—it's the quality of its connections. Michael always emphasized that a barista's true mission wasn't to serve coffee, but to create regulars. This philosophy—elevating a transaction into a relationship—is the cornerstone of every great business, whether you're brewing lattes or building software.[^3]

<Image src={shelves} alt="Compass shelves" /> The journey wasn't always smooth. When the pandemic hit, I felt the weight of every word in Michael's memo about what it would take for the company to survive. With every cafe forcibly closed, Operation Phoenix was launched by a small, dedicated team and a contract to manufacture hand sanitizer for the city. It's well known that Compass was founded by two Marine officers who brought their work ethic and leadership principles to the coffee business; few times have I seen that matter more than successfuly steering the company through that period.

<figure> <Image src={sanitizer} alt="Hand sanitizer" /> <figcaption>Making hand sanitizer, 2020</figcaption> </figure>

Through it all, Compass gave me more than just a front-row seat to entrepreneurship. It gave me a crash course in the grit, passion, and sheer will it takes to turn a dream into reality. From that first tiny sample roaster to the soaring glass conveyers of the 50,000-square foot Ivy City roastery, I've been there, sometimes helping, sometimes cheering, but always in awe of the journey.

<figure> <Image src={ivyCity} alt="Ivy City Roastery" /> <figcaption>The Ivy City Roastery, 2024</figcaption> </figure>

The intersection between Compass and Prefect deepened too. At Prefect's first conference back in 2018, we had no product to show—just 9 gallons of Compass Coffee and a banner promising "we have free coffee and you don't even have to talk to us." That coffee-stained banner, part of which hangs behind my desk, is a tangible reminder of how Compass's story and mine have intertwined. In 2020, we began sending regular shipments of custom Compass tins to all of our employees, investors, and customers; nothing we've ever done has gotten such a positive reaction.

<figure> <Image src={banner} alt="Prefect & Compass." /> <figcaption>The 2018 Prefect banner and custom Compass tins</figcaption> </figure>

Perhaps the most poignant symbol of this journey is a small, brown paper bag with a handwritten label: "Lowin Blend August 2013". It's a relic I discovered moving a few years ago, containing the beans from that very first basement roast. Today it sits in the Compass offices as the earliest custom "tin" still in existence. It is a time capsule of dreams, friendship, and what it means to build something that lasts.

<figure> <Image src={lowinBlend} alt="Lowin Blend - August 2013" /> <figcaption>The original Lowin Blend and some of its descendents</figcaption> </figure>

Congratulations, Michael and the entire Compass team!

Here's to Real Good Coffee, and a latte great memories.

[^1]: Michael has a unique ability to "Tom Sawyer" me into doing work by convincing me it would be fun. It's been more than a decade and I still fall for it.

[^2]: Compass would finally open its first drive-thru location in 2022.

[^3]: Danny Meyer discusses this idea on an episode of Invest Like the Best.

The Covid "Thank You" Surge

Sat, 05 Oct 2024 00:00:00 GMT

Covid caused many statistical anomalies, but my favorite is the spike in searches for "thank you."

Talk about a viral trend.

Bluesky-Powered Blog Comments

Mon, 25 Nov 2024 00:00:00 GMT

I'm a big fan of Bluesky, and I just added comments to this blog by leveraging its open protocol.

The core idea was inspired by Emily Liu's excellent post and Jade Garafola's Astro adaptation, and is delightfully simple: instead of maintaining a separate comment system, why not use the conversations already happening on Bluesky?

This is particularly compelling for static sites like this one. Static sites are wonderful - they're fast, secure, and incredibly simple to maintain. But they have one major limitation: they're, well, static. Adding dynamic features like comments traditionally meant either using a heavy third-party service or maintaining a separate database and API (defeating the point of being static).

Bluesky offers an intriguingly lightweight alternative. Each blog post corresponds to a Bluesky thread, and comments are just replies to that thread. The heart of the implementation is remarkably simple - it's just a single API call:

// Fetch thread replies from Bluesky's API
const endpoint = `https://api.bsky.app/xrpc/app.bsky.feed.getPostThread?uri=${uri}`;
const response = await fetch(endpoint);
const data = await response.json();
const comments = data.thread?.replies || [];

Everything else - the layout, styling, error handling - become implementation details. When someone visits your blog, the page fetches replies directly from Bluesky's API. There's no database to manage, no auth system to build, no spam to filter - Bluesky handles all of that.

What I love about this approach is how it solves multiple problems at once:

Zero maintenance - The entire system is serverless and requires no ongoing administration
Built-in moderation - Bluesky's native moderation tools (blocking, muting) automatically apply to your comments
Genuine conversations - Comments aren't siloed on your blog; they're part of the open social network
Full portability - Since comments are just Bluesky posts, they're accessible through the API and can move with you
Progressive enhancement - The blog remains fully static and functional even if Bluesky is down

This perfectly exemplifies the power of open protocols. Instead of building yet another commenting system from scratch, we can compose existing infrastructure in creative ways. The AT Protocol provides the social graph, authentication, moderation, and storage - we just need to pipe the data to where people want to see it.

Want to see it in action? This post is connected to [this Bluesky thread]. Reply there and watch your comment appear below! And if you implement this on your own blog, let me know - I'd love to see how others adapt and improve upon this pattern.

I expect we'll see many more examples of this approach as the AT Protocol ecosystem matures. The web is more interesting when it's interconnected, and open protocols are how we get there.

The Curse of ChatGPT

Wed, 18 Sep 2024 00:00:00 GMT

I know you've heard it:

"Why can't ChatGPT do this?"

It's the 2024 equivalent of "Why won't Google do this?" – an absurd query that has long been the shallowest VC litmus test for early-stage ideas.[^1] But this updated question is asked more often, and more seriously, because ChatGPT has become the default benchmark for what's possible in AI.

Part of the phenomenon is familiar, if unusual: the near-total conflation of a new technology with a single product implementation. A handful of contemporary examples exist: Google, Photoshop, the iPad, the Walkman, Velcro. But there's something very different about ChatGPT: it is the first time that I can think of where the underlying technology is evolving faster than the applications built on top of it.

In the AI space, it's the core models doing the disrupting, not the startups. Each new release leapfrogs forward, threatening to obsolete entire application layers. AI startups must not only keep pace with competitors but also adapt to an environment where foundational breakthroughs constantly redefine product strategies.

ChatGPT's potency lies in it's dual nature. It is simultaneously:

a showcase for the state-of-the-art frontier of LLM capabilities
a very narrow UX for single-threaded chat

That's an extremely potent combination, and as a result, ChatGPT has become the de facto standard for what an "LLM interface" should be. And that's a problem, because chat is a truly terrible interface for most AI applications. Real-world software applications have requirements that don't fit well into a chat interface, even one delivered as an API. They need efficiency, precision, automation, integration, scalability, observability, and reproducibility. I don't want to chat with my {docs, code, toaster, etc.} -- I want to do things with them.

But the trouble with this ruthlessly effective combination of technology and interface is that it's created an unusually rigid definition of what "AI" is, and it's hurting innovation. Introducing an effective AI-powered product that isn't chat-based means solving two problems: proving the AI works, and justifying the unfamiliar interface.

We need to shift our perspective. LLMs are fundamentally a technology for transforming tokens, not a product in themselves. Instead of inviting users to chat, we should focus on how core LLM operations[^2] can deliver value, then build features around those capabilities. To compete with the ChatGPT standard, prioritize the user experience (or developer experience), not the raw LLM capabilities.

Arguably, the most impactful consequence of ChatGPT's success is that LLMs have become a commodity, and the real battleground is the experience of using them.

The path forward lies in treating AI like other powerful technologies – as tools to be integrated, not products to be imitated. We don't trumpet that we chose DuckDB (for example); we simply use it create better software. Similarly, AI should enhance our applications without being their focal point.

To truly innovate in this space, we must look beyond ChatGPT and see the forest for the trees. By treating AI as the transformative technology it is, rather than a product to be copied, we can unlock its full potential and create applications that genuinely push boundaries.

The next time you hear "Why can't ChatGPT do this?" reframe it:

"I see how ChatGPT might demo this. How are you going to deliver it to users?"

[^1]: VC readers: I'm not talking about you. I'm talking about those other VCs. [^2]: Summarization, extraction, generation, and classification. More on this in a future post.

Introducing FastMCP 🚀

Sun, 01 Dec 2024 00:00:00 GMT

Last week, Anthropic introduced the Model Context Protocol (MCP), a new standard for connecting AI models to data and tools. Think of it as a universal remote for the internet - a way for AI to safely interact with databases, files, APIs, and internal tools through a common interface.

The protocol is powerful but implementing it involves a lot of boilerplate - server setup, protocol handlers, content types, error management. You might spend more time writing infrastructure code than building things the AI can actually use.

That's why I built FastMCP. I wanted building an MCP server to feel as natural as writing a Python function. Here's how it works:

from fastmcp import FastMCP

mcp = FastMCP("File Server")

@mcp.resource("file://{path}")
def read_file(path: str) -> str:
    """Read a file from disk"""
    with open(path) as f:
        return f.read()

@mcp.tool()
def append_to_file(path: str, content: str) -> None:
    """Append content to a file"""
    with open(path, 'a') as f:
        f.write(content)

That's it. No protocol details, no server lifecycle, no content types - just Python functions that define what your AI can do.

Pure Logic, No Boilerplate

Since FastMCP is built around standard Python functions, you can integrate any kind of functionality. Need database access? File operations? API calls? Just write the function that does it:

@mcp.tool()
def search_docs(query: str) -> list[str]:
    """Search documentation"""
    results = elastic.search(index="docs", q=query)
    return [hit["_source"]["content"] for hit in results["hits"]["hits"]]

@mcp.resource("profile://{user_id}")
def get_profile(user_id: str) -> str:
    """Get user profile"""
    return get_user_profile(user_id)

Each decorator tells FastMCP how to integrate your function:

Resources provide data (like schemas or file contents). Think of these like GET endpoints for populating context.
Tools perform actions (like searches or updates). Think of these like POST endpoints for performing actions.
Prompts define templates for common interactions.

Everything is just Python - FastMCP handles the protocol machinery.

Why This Matters

Right now, everyone building AI applications has to write their own integrations from scratch. It's like if every website had to implement its own version of HTTP. MCP provides a standard way for AI models to interact with data and tools, and FastMCP makes it dead simple to implement that standard.

Instead of building custom agents or copying data into prompts, you can publish a clean interface that any AI model can use. Want to make your company's data searchable? Create an MCP server. Want to let AI models use your internal tools? MCP server. Want to permit AI's to safely access your product? You get the idea.

Think of FastMCP as FastAPI for AI-native APIs - a microframework for building functionality over a standard protocol. I built the initial version in about 24 hours of excited hacking after MCP was announced, but it's quickly grown beyond that. The community has already contributed excellent examples, bug fixes, and feature ideas. If you're interested in making AI integration simpler and more standardized, we'd love to have you join us!

Come check out the examples, open an issue, or submit a PR - let's make AI integration feel natural for everyone.

Give FastMCP a star on GitHub, and happy engineering!

Reflecting on FastMCP at 10k stars 🌟

Fri, 16 May 2025 00:00:00 GMT

import { Image } from 'astro:assets'; import history_img from './history.png';

It took Prefect almost 4 years to reach 10,000 GitHub stars.

It took FastMCP about 6 weeks.[^1]

FastMCP is the fastest-growing open-source project I've ever been a part of. At this point, factoring in FastMCP 1.0's inclusion in the official MCP SDK, it's at the heart of almost every Python MCP server.

But whereas Prefect's growth came from providing an excellent developer experience in a domain that users traditionally hate, FastMCP's growth has come from providing an excellent developer experience in a domain that's exploding in popularity. It's hard not to be reminded of Patrick O'Shaughnessy's clear instruction: "Just build something people want."

But like Prefect, I didn't build FastMCP because people wanted it. I built it because I wanted it.

When it was introduced last year, the Model Context Protocol (MCP) seemed like a really interesting idea... but it was beyond cumbersome to interact with. FastMCP 1.0 aimed to simplify, dareisay make pleasant, the experience of building MCP servers. It was so effective that Anthropic adopted it as the reference implementation for the official MCP SDK.

As MCP has gotten swept up in hype in the last month, the rough edges around the young protocol have become even more apparent. The core team is trying to rapidly satisfy the community demand for MORE and the naysayer demand for WHY. There's confusion about what needs to be implemented rather than adopted, for example with auth. There are questions about transports, like whether to use SSE or "streamable" HTTP; both of which seem like overkill for the most common use cases. And there is, of course, the overarching objection: What was wrong with regular old APIs?

MCP is the poster child for tech that's useful beating tech that's perfect. I have my own opinions on the protocol (tldr: I like standards, I'm excited about the second-order features that go well beyond request/response, I really wish the reference SDK wasn't being built by committee) but I'm proud that FastMCP played such an integral role in achieving that approachable utility.

So what's next?

Historically, FastMCP (1.0) focused on merely providing a pleasant DX over the low-level SDK. With 2.0, we're providing a full ecosystem of servers, clients, and tooling. It is still hard to stand up a fully authenticated remote MCP server... and it's even harder to do anything with it. FastMCP 2.0's headline features like server proxying and composition only scratch the surface of how we can build an integrated, LLM-accessible contextual landscape.

See you at 20k.

[^1]: Well, 6 weeks from re-launching the project as FastMCP 2.0; 6 months from the first 1.0 commit.

FastMCP 2.11: AuthKit + New OpenAPI Parser

Fri, 01 Aug 2025 00:00:00 GMT

import Callout from "@components/blog/Callout.astro";

Here's a dirty secret about MCP servers: almost nobody implements authentication. Not because they don't want to—because it's genuinely impossible.

Look at the "simple" auth examples in the official MCP repository. They're 300+ lines of OAuth boilerplate, token validation, error handling, and security edge cases that would make a seasoned backend engineer weep. The message is clear: if you want auth, you're on your own. Good luck figuring out PKCE flows, refresh token rotation, and session management while you're trying to build your actual business logic.

We worked with our partners at WorkOS to change that.

FastMCP 2.11 introduces our first major step into enterprise-ready authentication, reducing those 300 lines to essentially a one-liner. It also includes a completely rewritten OpenAPI parser, delivering on the promise of using REST APIs as a thoughtful starting point for MCP servers.

<Callout color="blue"> Give us a star on GitHub or check out the updated docs at gofastmcp.com. </Callout>

Authentication: The Problem Nobody Talks About

I've spent months watching developers in the MCP community hit the same wall. They build a brilliant MCP server with amazing tools and resources. They're ready to deploy to production. Then they ask: "How do I add authentication?"

Silence.

Production-grade auth is a specialized domain. OAuth 2.1, JWT validation, PKCE—these are table stakes for handling user data, and getting them wrong can sink your application. Developers have been forced to choose: ship without auth (unacceptable) or spend weeks becoming security experts (a distraction, to say the least).

Enter WorkOS: DCR-Compliant Auth as a Service

Our partnership with WorkOS emerged from a simple observation: the best auth is auth you don't have to implement yourself. WorkOS AuthKit provides enterprise-grade authentication that is fully compliant with the MCP specification's requirement for Dynamic Client Registration (DCR).

Working together, we've built authentication directly into FastMCP's core. The result is plug-and-play auth that doesn't require you to become a security expert overnight:

from fastmcp import FastMCP
from fastmcp.server.auth.providers.workos import AuthKitProvider

mcp = FastMCP(
    name="SecureServer",
    auth=AuthKitProvider(
        authkit_domain="your_authkit_domain",
        base_url="http://localhost:8000", # your server's URL
    )
)

@mcp.tool
def sensitive_operation():
    """This tool now requires authentication."""
    return "Only authenticated users can call this"

Behind this simple interface, FastMCP handles token validation, user session management, and all the OAuth complexity, cleanly rejecting unauthorized requests before they reach your business logic.

For teams that need custom authentication flows, we've also introduced the TokenVerifier protocol—a clean interface for implementing your own auth logic while still leveraging FastMCP's built-in security patterns.

OpenAPI: The Redemption Arc

A few months ago, I wrote a post titled "Stop Converting Your REST APIs to MCP". I stand by that advice—blindly wrapping a massive REST API will poison your agent with context pollution and atomic operations.

But I was being a little tongue-in-cheek. The truth is, REST APIs are fantastic starting points for MCP servers. They provide working endpoints, real business logic, and documented interfaces. The problem was never the APIs themselves—it was our tooling for converting them thoughtfully.

FastMCP 2.11 includes a completely rewritten OpenAPI parser that addresses the core issues:

Performance: The new parser uses single-pass schema processing with optimized memory usage. What used to take minutes now takes seconds, even for massive specs.

Maintainability: The old parser had become a maintenance nightmare with edge cases and special handling scattered throughout. The new architecture is clean, extensible, and actually understandable.

Thoughtful Defaults: Instead of blindly converting every endpoint, the new parser makes intelligent decisions about what tools make sense for agents, while still giving you full control to customize the conversion.

The new parser is experimental and disabled by default, but you can enable it with:

export FASTMCP_EXPERIMENTAL_ENABLE_NEW_OPENAPI_PARSER=1

We're being thoughtful about the rollout because we know teams depend on the existing behavior. But early testing shows dramatic improvements in both performance and output quality.

Context State: Memory for Your Tools

One more thing: FastMCP 2.11 introduces persistent state management across tool calls. This seemingly simple feature unlocks powerful new patterns for multi-step agent workflows:

from fastmcp import FastMCP, Context

mcp = FastMCP()

@mcp.tool
def start_analysis(ctx: Context, dataset_id: str):
    """Begin analyzing a dataset."""
    ctx.state["analysis_id"] = f"analysis_{dataset_id}"
    ctx.state["progress"] = 0
    return f"Started analysis {ctx.state['analysis_id']}"

@mcp.tool 
def check_analysis_progress(ctx: Context):
    """Check the progress of the current analysis."""
    if "analysis_id" not in ctx.state:
        return "No analysis in progress"
    return f"Analysis {ctx.state['analysis_id']} is {ctx.state['progress']}% complete"

The state persists across tool calls within the same session, giving your agents memory and the ability to maintain context across complex, multi-step operations. Note that state is only persisted for the duration of the session on the in-memory context object!

Happy engineering!

FastMCP 2.12: Easy Enterprise Auth

Wed, 03 Sep 2025 00:00:00 GMT

import Callout from "@components/blog/Callout.astro";

We're excited to announce the release of FastMCP 2.12, a major update that reflects a pivotal moment for the MCP ecosystem. As more developers move their servers from local experiments to production services, the community's needs have evolved. This release may be our largest and most ambitious yet, designed to provide the production-grade tooling this maturing ecosystem demands.

The scope of this release is a direct result of our growing community. To help steer the project, I'm also thrilled to welcome Bill Easton to the core team as our first external maintainer. Bill's vision has been instrumental in shaping FastMCP, and this release includes several of his key contributions.

<Callout color="blue"> Give us a star on GitHub or check out the updated docs at gofastmcp.com. </Callout>

Easy OAuth Integrations

The MCP specification requires servers to use OAuth 2.1 with Dynamic Client Registration (DCR), a modern standard where clients can register themselves automatically.[^1] In FastMCP 2.11, we shipped a fully DCR-compliant solution with our partners at WorkOS, using their excellent AuthKit product.

However, we recognize the reality that many large enterprises rely on identity providers like GitHub, Google, or Azure that do not support DCR (yet?). This leaves a critical gap for teams who needed to integrate MCP with their existing, battle-tested identity infrastructure.

FastMCP 2.12 closes that gap with the new OAuth Proxy interface. The proxy acts as a bridge, allowing your server to present a fully DCR-compliant interface to MCP clients, while seamlessly managing traditional OAuth flows with your non-DCR identity provider.

What was once a complex, multi-hundred-line integration is now a few lines of configuration. The main OAuthProxy is quite configurable, and we've also shipped built-in support for the most requested providers:

GitHub
Google
Azure
WorkOS

This means you can add enterprise-grade authentication to your MCP server in seconds, not weeks.

For example, here's how quickly you can add GitHub authentication to your server (assuming you have a GitHub OAuth app configured):

from fastmcp import FastMCP
from fastmcp.server.auth.providers.github import GitHubProvider

auth_provider = GitHubProvider(
    client_id="your_client_id",
    client_secret="your_client_secret",
    base_url="http://localhost:8000", # your server's URL
)

mcp = FastMCP(name="GitHub Secured MCP", auth=auth_provider)

To learn more, please see the new OAuth Proxy documentation.

<Callout color="green"> We're especially grateful to the community members who helped us test and refine this feature. It's rapidly improving as we collect feedback about production environments. </Callout>

A Blueprint for Deployment

With the launch of FastMCP Cloud, our mission is to make deploying an MCP server as easy as building one. To bring that same simplicity and portability to everyone, we're introducing a standard way to describe a server deployment: the fastmcp.json file.

This declarative manifest is the single source of truth for your server, defining:

Source (WHERE): The location of your server code.
Environment (WHAT): Its Python version and dependencies.
Deployment (HOW): Its runtime configuration, like transport and port.

For example:

{
  "$schema": "https://gofastmcp.com/public/schemas/fastmcp.json/v1.json",
  "source": {
    "path": "server.py",
    "entrypoint": "mcp"
  },
  "environment": {
    "python": ">=3.10",
    "dependencies": ["pandas", "requests"]
  },
  "deployment": {
    "transport": "http",
    "port": 8000
  }
}

This is the foundation for a future of truly portable MCP servers definitions. While FastMCP Cloud uses a separate manifest today, you can expect it to adopt fastmcp.json in the near future, enabling validated, one-click deployments with all dependencies correctly managed. We also anticipate support for new sources and environments.

Today, you can use fastmcp run fastmcp.json to run your server with all dependencies and your preferred transport from the command line, with no additional configuration required. CLI arguments are respected as configuration overrides.

For full details, please see the server configuration documentation.

<Callout color="gray"> Please note: this is a server-side analogue to the popular mcp.json configuration file, not an alternative to it. mcp.json tells an MCP client how to connect to a specific server; fastmcp.json is a declarative deployment configuration for running an MCP server. </Callout>

Solving MCP's Chicken-and-Egg Problem

MCP has many advanced features like "sampling", in which a server can ask the client's LLM to perform a task. However, these features require support from both servers and clients and consequently face a classic chicken-and-egg problem: server authors won't implement the feature if clients don't support it, and vice-versa.

Thanks to a fantastic contribution from our new maintainer, Bill Easton, FastMCP is breaking this cycle. Server authors can now define fallback sampling handlers. If a client doesn't support sampling, FastMCP uses a server-side completions API to fulfill the request. This lets you build sophisticated tools with advanced MCP features today, knowing they will work for all clients and helping push the entire ecosystem forward.

For more information, please see the new Sampling Fallbacks documentation. FastMCP 2.12 includes an experimental OpenAI sampling handler, with more coming.

All of these features—enterprise-grade auth, declarative deployments, and ecosystem-aware fallbacks—represent FastMCP's commitment to building a robust, production-ready framework for the entire MCP community.

Upgrade: uv add fastmcp or pip install fastmcp --upgrade
Explore: Dig into the new Authentication and Project Configuration documentation.
Contribute: Check out the code and examples on GitHub. </Callout>

Happy engineering!

[^1]: Technically, servers don't have to support DCR, but then they must provide alternative ways for clients to authenticate that require much more complex configuration or client control. See https://modelcontextprotocol.io/specification/2025-06-18/basic/authorization#dynamic-client-registration for details.

FastMCP 2.13: Storage, Security, and Scale

Sat, 01 Nov 2025 00:00:00 GMT

import Callout from "@components/blog/Callout.astro"; import { Image } from 'astro:assets'; import downloads_img from './downloads.png'; import stars_img from './star-history.png';

When we shipped FastMCP 2.12 with its new OAuth proxy on August 31st, something remarkable happened. Downloads exploded from 200,000 to a peak of 1.25 million a day. The proxy, which bridges MCP's modern DCR requirement with enterprise identity providers like Google and Azure, clearly hit a nerve. In fact, last week FastMCP surpassed the official MCP SDK in GitHub stars, a validation of the community's demand for high-level, production-ready tooling.

With that kind of scale, you get a lot of feedback, fast.

This is the world FastMCP 2.13 was built for. It is one of our largest releases, focused entirely on the infrastructure required for production MCP servers: persistent storage, battle-tested security, and performance optimizations.

<Callout color="blue"> Star FastMCP on GitHub or check out the updated docs at gofastmcp.com. </Callout>

Battle-Tested Authentication

The massive adoption of the OAuth proxy meant the community immediately started battle-testing our auth implementation in real-world scenarios. We learned our original Azure provider only worked in the narrowest of cases; intrepid users helped us build a far more robust version. Others contributed a variety of new providers, with the result being that FastMCP now supports out-of-the-box authentication with:

And we're working with Supabase to add support for their new identity provider.

More critically, I owe a huge thanks to MCP Core Committee member Den Delimarsky for responsibly disclosing two nuanced, MCP-specific vulnerabilities: a confused deputy attack and a related token security boundary issue. The fixes required some novel solutions, including having the proxy issue its own tokens and implementing a new consent screen for explicit client approval. Our OAuth implementation is now hardened, spec-compliant, and thanks to the community's scrutiny, ready for production.

You can learn more about confused deputy attacks from an excellent post on Den's blog, and I'll write a post on FastMCP's specific implementation soon.

First-Class State Management

The rapid evolution of our auth stack highlighted a critical need: a robust way to manage persistent state. OAuth proxies need to store encrypted tokens and session data to survive restarts and work in distributed deployments.

To solve this, FastMCP maintainer Bill Easton built py-key-value. This fantastic library is something I've long wished for in the Python ecosystem: a clean key-value store with portable backend support. Its real genius is the composable wrapper system that lets you layer encryption, TTLs, and caching onto any backend, from a local filesystem to Redis or Elasticsearch.

It's so good, we've baked it into FastMCP's core. In 2.13, persistent storage is now built-in and enabled by default where appropriate, providing the foundation for stateful, production-ready MCP applications.

A Raft of Other Improvements

Beyond the headlines, this release is packed with features and fixes that came directly from community feedback:

Response Caching: The new ResponseCachingMiddleware provides an instant performance win for expensive, repeated tool and resource calls.
Server Lifespans: We fixed a long-standing point of confusion in the MCP SDK. lifespan now correctly refers to the server lifecycle (for things like DB connections), not the client session. This is a breaking change, but it's the correct one.
Pydantic Validation: We now use Pydantic for input validation, avoiding the SDK's overly-strict JSON Schema enforcement. This more flexible approach is familiar to Python developers and more forgiving of LLMs that might send an integer as a string.
Richer Context: The Context API has been expanded, allowing your tools and resources to interact with other MCP functionality from inside their own execution.

What's Next

FastMCP 2.13 marks the framework's evolution into a production-ready platform. It includes work from 20 new contributors, and it's their production feedback that made these improvements possible. Thank you.

Looking ahead, our next major release, FastMCP 2.14, will be our first to remove deprecated features since launching 2.0. This is a sign of maturity: we're cleaning up the API and solidifying the foundation for the long term.

Happy engineering!

Upgrade: uv add fastmcp or pip install fastmcp --upgrade
Explore: Check out the new Storage and OAuth documentation.
Contribute: Check out the code and examples on GitHub. </Callout>

Now Streaming: FastMCP 2.3

Thu, 08 May 2025 00:00:00 GMT

import Callout from "@components/blog/Callout.astro";

FastMCP 2.3 was just released, including the most-requested feature by a mile: Streamable HTTP for both FastMCP servers and clients.

Give us a star on GitHub or dive into the updated docs at gofastmcp.com.

</Callout>

Until now, if you wanted to run your FastMCP server over the web, Server-Sent Events (SSE) was the primary option. While SSE works, it has drawbacks including the need for long-lived, stateful connections and a complex interchange across multiple routes. In addition, not all hosting infrastructure is compatible with SSE. Streamable HTTP is a more modern, efficient approach that way to handle the back-and-forth of an MCP session, all neatly wrapped in familiar HTTP.

In fact, Streamable HTTP is so important for the MCP ecosystem that it's now the default http transport for FastMCP.

To get started, just tell mcp.run() to use the new "streamable-http" transport in your server script. You can optionally customize the host, port, or mount path, as needed:

# my_server.py
from fastmcp import FastMCP

mcp = FastMCP(name="MyStreamingServer")

@mcp.tool()
def echo(message: str) -> str:
    return f"Server echoes: {message}"

if __name__ == "__main__":
    mcp.run(
        transport="streamable-http",
        host="127.0.0.1",  # Optional: defaults to 127.0.0.1
        port=8000,         # Optional: defaults to 8000
        path="/mcp"        # Optional: defaults to /mcp
    )

Run this file with python my_server.py, and your server will start listening for Streamable HTTP connections at http://127.0.0.1:8000/mcp. You can see more about configuring the server in the deployment docs.

Connecting your FastMCP client is even simpler. If your server is running on Streamable HTTP, just provide the URL to the client and FastMCP will automatically attempt to connect with the appropriate transport:

import asyncio
from fastmcp import Client

async def main():
    # FastMCP 2.3 will automatically infer Streamable HTTP for this URL
    client = Client("http://127.0.0.1:8000/mcp")

    async with client:
        await client.ping()
        print("Ping successful!")

        result = await client.call_tool("echo", {"message": "Hello Stream!"})
        print(result[0].text)

if __name__ == "__main__":
    asyncio.run(main())

More details can be found in the client transports documentation.

The Model Context Protocol (MCP) is all about standardizing how AI models interact with tools and data, and with Streamable HTTP, FastMCP makes it even easier to build and deploy those crucial interaction points on the web. I'm excited to see what you build with these new capabilities. As always, your feedback, issues, and contributions are welcome!

Give FastMCP 2.3 a try:

Upgrade: uv add fastmcp or pip install fastmcp --upgrade
Explore the documentation on deploying Streamable HTTP
Check out the code and examples on GitHub

</Callout>

Happy Streaming! 🌊

Blast Auth with FastMCP 2.6

Mon, 02 Jun 2025 00:00:00 GMT

import Callout from "@components/blog/Callout.astro";

FastMCP's journey continues at a thrilling pace. Since my last update on Streamable HTTP, we've rolled out version 2.4 ("Config and Conquer") to simplify MCP client configuration, and version 2.5 ("Route Awakening") with powerful new tools for OpenAPI generation.

As the team and community around FastMCP grows, so does our commitment to rapid iteration. Our mission is clear: deliver the simplest path to production in the MCP ecosystem. This means shipping developer-friendly features that drive real-world adoption and partnering with best-in-class providers to make distribution effortless. Much more on that front coming soon!

Today, I'm incredibly excited to announce FastMCP 2.6 ("Blast Auth"). This release is a game-changer because it tackles a critical need that has, almost overnight, become paramount: authentication for remote MCP servers.

The timing here is no accident.

In just the last week, there's been a whirlwind of major MCP activity. Industry leaders like Anthropic, OpenAI, and Google all announced support for accessing remote MCP servers directly within their APIs and SDKs. This is a massive step forward and a resounding validation of the MCP vision, signaling a serious industry-wide commitment to standardizing how AI models interact with tools and data.

But with great power (and public endpoints) comes great responsibility.

Many of these new API integrations hinge on the LLMs being able to access your MCP servers remotely. And let's be frank: no one wants to expose an unauthenticated MCP server to the public internet, especially if it’s a gateway to sensitive internal systems. Therefore, MCP authentication has swiftly moved from a "nice-to-have" to an absolute necessity.

This is where the current landscape gets a bit... interesting.

The official MCP specification, in its wisdom, dictates that HTTP-based MCP servers must implement a full OAuth 2.1 handshake. That's a robust standard, no doubt. But it’s also a heavy lift designed for interactive, browser-based use cases, and one that's difficult to manage in the programmatic, server-to-server use cases that these new API integrations are designed for.

So while the big API providers are saying "Yes, bring your MCP servers!" and allowing users to provide access tokens for those servers, they're punting on exactly how those tokens should be obtained, essentially saying, "That's your problem."

Happily, FastMCP 2.6 is here to solve it by introducing straightforward server and client authentication.

We have taken a decidedly pragmatic approach with our first cut of server-side auth and shipped a Bearer token authentication scheme. We want to be clear that this does not implement a full OAuth 2.1 handshake, and therefore is not strictly compliant with the MCP spec. However, it is fully compatible with how the major AI vendors are actually using MCP today, and allows users to begin shipping useful applications immediately.

Setting up a full OAuth 2.1 identity server is a significant undertaking more appropriate for enterprise production than a gradual developer adoption curve. Bearer token validation, by contrast, can be as simple as providing a public key. This means you can secure your FastMCP server with minimal friction and get back to building cool things.

Frankly, this feels like an area where the MCP spec might be running well ahead of its own maturity.

In FastMCP 2.6, you can either provide a public key directly to your server (in PEM format) or use a JWKS URI to fetch the key(s) dynamically from a remote server:

from fastmcp import FastMCP
from fastmcp.server.auth import BearerAuthProvider

mcp = FastMCP(
    name="MyAuthenticatedServer",
    auth=BearerAuthProvider(
        # -- Provide a static public key (PEM format)
        # public_key="your-public-key-string",
        # -- OR, preferably for production, a JWKS URI
        jwks_uri="https://example.com/.well-known/jwks.json",
    )
)

@mcp.tool()
def echo(message: str) -> str:
    return f"Server echoes: {message}"

if __name__ == "__main__":
    mcp.run(transport="streamable-http")

<figcaption>A simple FastMCP server with Bearer authentication.</figcaption> </figure>

To learn more, please review the server-side auth docs.

A quick note: We're already collaborating with partners to bring plug-and-play OAuth 2.1 server integrations to FastMCP. Bearer auth is only the first, pragmatic step to meet the ecosystem's immediate needs.

On the client side, however, we've pulled out all the stops.

The FastMCP client now boasts comprehensive support for both Bearer token authentication (simply provide your token) and a remarkably smooth, full in-browser OAuth 2.1 flow.

For many OAuth-protected servers, our client can navigate the entire browser-based handshake with minimal, and sometimes zero, additional configuration on your part. After wading through the intricacies of auth protocols these past few months, seeing this "just work" feels like a genuine breakthrough for developer experience.

To use the client with default OAuth settings and dynamic registration, it's as simple as passing the string "oauth" as the auth parameter:

import asyncio
from fastmcp import Client

client = Client(
    "http://my-secure-oauth-server.com/mcp", # URL to your MCP server
    auth="oauth", # Enable OAuth with default settings
)

async def main():
    async with client:
        await client.ping()
        print("Successfully connected with OAuth!")

if __name__ == "__main__":
    asyncio.run(main())

This enables your FastMCP client applications to securely interact with a wide array of OAuth-protected MCP servers, with FastMCP elegantly handling the underlying complexities.

You can learn more about customizing client-side auth in the client auth docs.

To help you hit the ground running, we've also shipped four new tutorials demonstrating how to integrate your FastMCP servers with Anthropic's API, Claude Desktop, OpenAI's API, and the Gemini SDK. These guides make it easier than ever to connect your secure FastMCP servers to the world's leading AI models.

This release, and our approach to authentication in particular, exemplifies how FastMCP 2.0 is committed to making high-level, opinionated decisions that prioritize developer experience and enable rapid, practical deployment. We're building the toolkit we want for working with MCP in the real world.

And speaking of real-world deployment: we know that securing your server is only half the battle; you also need a place to host it. We've got some very exciting news on that front coming very, very soon...

For now, dive into FastMCP 2.6!

Upgrade: uv add fastmcp or pip install fastmcp --upgrade
Explore the documentation on server and client authentication.
Check out the code and examples on GitHub. </Callout>

Happy Authenticating! 🔒

FastMCP 2.8: Transform and Roll Out

Wed, 11 Jun 2025 00:00:00 GMT

import Callout from "@components/blog/Callout.astro";

What do you do when the perfect tool... isn't so perfect? You've found a library or an API that does exactly what you need, but its interface is a nightmare for an LLM. The argument names are cryptic, the descriptions are missing, and it exposes parameters you'd rather keep hidden.

Today, we're thrilled to announce FastMCP 2.8, a massive release that starts to put the power of curation directly in your hands. This update is all about giving you fine-grained control to transform, filter, and shape the components your AI interacts with, and it's the foundation for a lot of the new features we're working on.

<Callout color="blue"> Give us a star on GitHub or check out the updated docs at gofastmcp.com. </Callout> <br/> <Callout color="green"> Also! The waitlist is open for FastMCP Cloud, which is literally the fastest way to get started with MCP. More on that very soon. </Callout>

🛠️ Tool Transformation: Curate the LLM Experience

The highlight of this release is first-class Tool Transformation. Instead of wrestling with complex prompts to make an LLM use a clunky tool, you can now adapt the tool itself to be perfectly LLM-friendly.

This feature was developed in close partnership with Bill Easton of Elastic, who has become one of FastMCP's most prolific contributors and a key thought partner. As Bill brilliantly put it:

Tool transformation flips Prompt Engineering on its head: stop writing tool-friendly LLM prompts and start providing LLM-friendly Tools.

With a single Tool.from_tool() call, you can now create enhanced variations of any tool—whether it's from your own codebase, a third-party library, or an auto-generated OpenAPI server.

Rename arguments to be more intuitive (q becomes search_query).
Rewrite descriptions to give the LLM better context.
Hide parameters like API keys, providing default values behind the scenes.
Wrap a tool with custom validation or post-processing logic.

from fastmcp import FastMCP
from fastmcp.tools import Tool
from fastmcp.tools.tool_transform import ArgTransform

mcp = FastMCP()

# An existing, generic tool from a third party
from some_library import generic_search

# Transform it into a domain-specific, LLM-friendly tool
product_search = Tool.from_tool(
    tool=generic_search,
    name="find_products_by_keyword",
    description="Searches the product catalog for items matching a keyword.",
    transform_args={
        "q": ArgTransform(
            name="keyword", 
            description="The search term for finding products.",
        ),
        "limit": ArgTransform(hide=True, default=10) # Hide and pass along a new default
    }
)

mcp.add_tool(product_search)

This is a foundational step towards a future where we don't just provide tools, but actively curate the LLM's environment, paving the way for more sophisticated agentic systems.

🫥 Enabling and Disabling Components

Now that you've transformed a tool into a sleek, LLM-friendly powerhouse, you'll probably want to hide the old, busted original. This release introduces a simple way to manage component visibility.

Every tool, resource, and prompt can now be programmatically enabled or disabled. You can set the initial state in the decorator or toggle it at runtime.

@mcp.tool(enabled=False)
def legacy_tool():
    """This tool is disabled from the start."""
    # ...

# you can enable it later
legacy_tool.enable()

# or turn it back off
legacy_tool.disable()

This gives you precise control to roll out new features, deprecate old ones, or dynamically adjust the toolset available to your clients.

🏷️ Component Control: Tags Have a Purpose!

FastMCP introduced component tags all the way back in v2.1.0, and since then users have been asking: "What are these for?" Today, we're excited to finally have an answer:

Tag-based filtering is here, allowing you to declaratively control which components are exposed based on the tags you assign.

mcp = FastMCP(
    name="MyFilteredServer",
    # Only expose components with the "public" tag
    include_tags={"public"},
    # But exclude any that are also tagged "beta"
    exclude_tags={"beta"}
)

@mcp.tool(tags={"public"})
def stable_feature():
    """This tool is public and will be exposed."""
    # ...

@mcp.tool(tags={"public", "beta"})
def new_feature():
    """This tool is public but also beta, so it will be excluded."""
    # ...

This is perfect for managing different environments (e.g., exposing internal tools in dev but not prod) or controlling access for different user types.

🔀 A Pragmatic Shift for OpenAPI

In our commitment to providing the simplest path to production, we sometimes have to make pragmatic decisions. This release includes a minor but important breaking change to FastMCP's default OpenAPI route maps. To improve out-of-the-box compatibility with the widest range of LLM clients, all API endpoints from an OpenAPI spec are now converted to Tools by default.

Previously, GET requests were mapped to either resources or resource templates as appropriate. However, the reality is that most MCP clients available today (including all major foundation model vendors... looking at you, Anthropic) only support MCP tools and essentially disregard every other feature.

While we could wait for them to adopt full-featured clients (I know a great library...), we've decided to make the pragmatic shift to tools-only in order to ensure that our users don't have to do extra work.

For users who need the previous semantic behavior, it can be easily restored by providing a custom route_maps configuration, as detailed in the OpenAPI docs.

Alongside these headline features, v2.8.0 continues the major modernization effort we began in v2.7, with a host of internal improvements and optimizations to make FastMCP more robust and performant.

FastMCP 2.8 puts more power and control in your hands than ever before. We're excited to see how you use these new features to build even more sophisticated and robust MCP applications.

Upgrade: uv add fastmcp or pip install fastmcp --upgrade
Explore the documentation on Tool Transformation, Tag-based Filtering, and Enabling/Disabling Components.
Check out the code and examples on GitHub.
Sign up for FastMCP Cloud. </Callout>

Happy Transforming! 🤖

MCP-Native Middleware with FastMCP 2.9

Mon, 23 Jun 2025 00:00:00 GMT

import Callout from "@components/blog/Callout.astro";

This morning we released FastMCP 2.9, which includes a new, MCP-native approach to middleware.

<Callout color="blue"> Give us a star on GitHub or check out the updated docs at gofastmcp.com. </Callout>

Middleware is one of those foundational features we've come to expect from any serious server framework. It's the go-to pattern for adding cross-cutting concerns like authentication, logging, or caching without rewriting your core application logic.

Until today, when developers asked how to add middleware to their MCP server, the obvious answer seemed to be wrapping their server with traditional ASGI middleware. Unfortunately, that approach has two critical flaws:

It only works for web-based transports like streamable-HTTP and SSE. Until very recently, most major clients only supported the local STDIO transport, making this a non-starter for many.
More importantly, it forces you to parse the MCP's low-level JSON-RPC messages yourself. All the hard work FastMCP does to give you clean, high-level Tool and Resource objects is lost. You're left trying to reconstruct meaning from a sea of protocol noise.

This is a lot of work for a very limited set of outcomes.

So, we went back to the drawing board and embraced a core FastMCP principle: focus on the developer's intent, not the protocol's complexity.

MCP-Native Middleware

FastMCP 2.9 introduces a powerful, intuitive middleware system. Instead of wrapping the raw protocol stream, we wrap the high-level, semantic handlers that developers interact with. This is middleware that understands tools, resources, and prompts, not just JSON-RPC messages.

Creating middleware is as simple as subclassing fastmcp.server.middleware.Middleware and overriding the hooks you need. Here's a basic logging middleware that prints every request and response (if any):

from fastmcp import FastMCP
from fastmcp.server.middleware import Middleware, MiddlewareContext

class LoggingMiddleware(Middleware):

    async def on_message(self, context: MiddlewareContext, call_next):
        """Called for every MCP message."""

        print(f"-> Received {context.method}")
        result = await call_next(context)
        print(f"<- Responded to {context.method}")

        return result

mcp = FastMCP(name="My Server")
mcp.add_middleware(LoggingMiddleware())

While on_message is great for generic tasks, the true strength of FastMCP's middleware lies in its semantic awareness. You can target specific protocol messages types with on_request or on_notification, further filtered by whether the request was initiated by the server or the client, or even target specific operations like on_call_tool to implement more sophisticated logic.

For example, here's a middleware that prevents access to tools tagged as "private":

from fastmcp import FastMCP, Context
from fastmcp.exceptions import ToolError
from fastmcp.server.middleware import Middleware, MiddlewareContext

class PrivateMiddleware(Middleware):

    async def on_call_tool(self, context: MiddlewareContext, call_next):
        """Called when a tool is called."""
        
        # Fetch the FastMCP Tool object
        tool_name = context.message.name
        tool = await context.fastmcp_context.fastmcp.get_tool(tool_name)
        
        # Check if the tool is tagged as private
        if "private" in tool.tags:
            raise ToolError(f"Access denied to private tool: {tool_name}")
    
        # If the check passes, continue to the next handler
        return await call_next(context)

mcp = FastMCP(name="Private Server")

@mcp.tool(tags={"private"})
def super_secret_function():
    return "This is a secret!"

mcp.add_middleware(PrivateMiddleware())

This approach leverages FastMCP's high-level understanding of your components to enable powerful, context-aware logic for authentication, authorization, caching, and more.

To get you started, in addition to the core Middleware class, we've added a few starting templates for common middleware patterns:

fastmcp.server.middleware.logging: Logs every request and notification.
fastmcp.server.middleware.error_handling: Catch and retry errors.
fastmcp.server.middleware.rate_limiting: Limits the rate of requests.
fastmcp.server.middleware.timing: Basic performance monitoring.

Check out the full middleware documentation to see what's possible.

But Wait, There's More

FastMCP 2.9 is a huge release, and it also includes one highly-requested feature: server-side type conversion for prompt arguments.

The MCP spec requires all prompt arguments to be strings. This has been a persistent developer pain point. Why? Because the Python function that generates those prompts often needs structured data to perform business logic, such as a list of IDs to look up, a dictionary of configuration, or some filter criteria. This has forced developers to litter their prompt logic with json.loads() and pray that the agent provides a compatible input.

Not anymore.

With FastMCP 2.9, you can define your prompt functions with the native Python types you'd expect. FastMCP automatically handles the conversion from string to type on the server. Crucially, it also enhances the prompt's description to show clients the expected JSON schema format, making it clear how to provide structured data. And to complete the story, FastMCP Clients will now automatically serialize non-string arguments for you.

from fastmcp import FastMCP
import inspect

mcp = FastMCP()

@mcp.prompt
def analyze_users(
    user_ids: list[int],  # Auto-converted from JSON!
    analysis_type: str,
) -> str:
    """Generate analysis prompt using loaded user data."""
    users = []
    for user_id in user_ids:
        user = db.get_user(user_id)  # pseudocode
        users.append(f"- {user_id}: {user.name}, {user.metrics}")
    
    user_data = "\n".join(users)
    
    return inspect.cleandoc(
        f"""
        Analyze these users for {analysis_type} insights:

        {user_data}

        Provide actionable recommendations.
        """
    )

An MCP client would call this with {"user_ids": "[1, 2, 3]", "analysis_type": "performance"}, but the MCP server would receive a clean list and str. It's a small change that removes a huge amount of friction, especially when prompts are doing more than just string interpolation.

FastMCP's implementation of this feature is fully MCP spec-compliant, but because there is no formal way to describe the expected JSON Schema format of a prompt argument, it's possible that some clients will choose to ignore it. As with all agentic users, performance will depend on clarity of your instructions.

From Protocol to Framework

With features like middleware and automatic type conversion, FastMCP is evolving beyond a simple high-level protocol implementation. It's becoming a true application framework: an opinionated, high-level toolkit for building sophisticated, production-ready MCP applications. Our goal remains the same: to provide the simplest path to production.

Upgrade: uv add fastmcp or pip install fastmcp --upgrade
Explore: Dig into the new Middleware and Prompts documentation.
Contribute: Check out the code and examples on GitHub. </Callout>

Happy engineering!

FastMCP Context Switching

Wed, 30 Apr 2025 00:00:00 GMT

Previously in FastMCP, the powerful Context object – the gateway to MCP features like logging, progress reporting, resource access, and client LLM sampling – was only easily accessible in MCP tool functions. While tools are central to MCP, this limited where you could add dynamic, session-aware logic.

Starting with FastMCP 2.2.5, the Context object is now available across all FastMCP components! You can now seamlessly inject and use context within:

Tools (@tool())
Resources (@resource("resource://user"))
Resource templates (e.g., @resource("resource://users/{user_id}"))
Prompt functions (@prompt())

In all cases, the pattern is the same: add a keyword argument to your decorated function and give it a type hint of Context. FastMCP will detect the annotation and inject the correct context automatically.

from fastmcp import FastMCP, Context

mcp = FastMCP(name="ContextDemo")

@mcp.tool()
async def add(a: int, b: int, ctx: Context) -> int:
    # ctx will be automatically injected by FastMCP
    await ctx.debug(f"Adding {a} and {b}")
    return a + b

In addition to logging, Context allows you to take advantage of powerful features like client LLM sampling.

For more details, see the FastMCP Context documentation.

Introducing FastMCP 2.0 🚀

Wed, 16 Apr 2025 00:00:00 GMT

I'm thrilled to announce the release of FastMCP 2.0! 🎉

Give it a star or check out the docs at gofastmcp.com.

🚀 FastMCP 2.0

The Model Context Protocol (MCP) aims to be the "USB-C port for AI," providing a standard way for large language models to interact with data and tools. FastMCP's mission has always been to make implementing this protocol as fast, simple, and Pythonic as possible.

FastMCP 1.0 was incredibly successful – so much so that its core SDK is now included in the official MCP Python SDK! You can from mcp.server.fastmcp import FastMCP and be up and running in minutes.

However, when I wrote the first version of FastMCP, the MCP itself was only a week old. I introduced FastMCP with the tagline "because life's too short for boilerplate," focusing on making it easy to create MCP servers without getting bogged down in protocol details.

A few months later, the MCP ecosystem has matured. If FastMCP 1.0 was about easily creating servers, then FastMCP 2.0 is about easily working with them. This required a significant rewrite, which is why we're back in a standalone project, but v2 is backwards-compatible with v1 while introducing powerful new features for composition, integration, and interaction.

Here’s what’s new:

🧩 Compose Servers with Ease

You can now build modular applications by combining multiple FastMCP servers together, optionally using prefixes to avoid naming collisions.

You can either mount a local or remote server to live-link it to your server, exposing its components while forwarding all requests, or use import_server to statically copy another server's resources and tools into your own. See the composition docs for more.

from fastmcp import FastMCP

# Define subservers (e.g., weather_server, calc_server)
weather_server = FastMCP(name="Weather")

@weather_server.tool()
def get_forecast(city: str): 
    return f"Sunny in {city}"

calc_server = FastMCP(name="Calculator")

@calc_server.tool()
def add(a: int, b: int): 
    return a + b

main_app = FastMCP(name="MainApp")

# Mount the subservers
main_app.mount("weather", weather_server) 
main_app.mount("calc", calc_server)     

# main_app now dynamically exposes `weather_get_forecast` and `calc_add`
if __name__ == "__main__":
    main_app.run()

🔄 Proxy Any MCP Server

Composition is great for combining servers you control, but what about interacting with third-party servers, remote servers, or those not built with FastMCP?

FastMCP can now proxy any MCP server, turning it into a FastMCP server that's compatible with all other features, including composition.

The killer feature? You're no longer locked into the backend server's transport. The proxy can run using stdio, sse, or any other FastMCP-supported transport, regardless of how the backend is hosted.

For more information, see the proxying docs.

from fastmcp import FastMCP, Client

# Point a client at *any* backend MCP server (local FastMCP instance, remote SSE, local script...)
backend_client = Client("http://api.example.com/mcp/sse") # e.g., a remote SSE server

proxy_server = FastMCP.from_client(backend_client, name="MyProxy")

# Run the proxy locally via stdio (useful for Claude Desktop, etc.)
if __name__ == "__main__":
    proxy_server.run() # Defaults to stdio

🪄 Auto-Generate Servers from OpenAPI & FastAPI

Many developers want to make their existing REST APIs accessible to LLMs without reinventing the wheel. FastMCP 2.0 makes this trivial by automatically generating MCP servers from OpenAPI specs or FastAPI apps.

Explore the OpenAPI and FastAPI guides for more.

from fastapi import FastAPI
from fastmcp import FastMCP

# Your existing FastAPI app
fastapi_app = FastAPI()

@fastapi_app.get("/items/{item_id}")
def get_item(item_id: int): 
    return {"id": item_id, "name": f"Item {item_id}"}

# Generate an MCP server
mcp_server = FastMCP.from_fastapi(fastapi_app)

# Run the MCP server (exposes FastAPI endpoints as MCP tools/resources)
if __name__ == "__main__":
    mcp_server.run()

🧠 Client Infrastructure & LLM Sampling

FastMCP 2.0 introduces a completely new client infrastructure designed for robust interaction with any MCP server, supporting all major transports and even in-memory transport when working with local FastMCP servers.

This makes it easy to expose advanced MCP features like client-side LLM sampling. Tools running on the server can now ask the client's LLM to perform tasks using ctx.sample(). Imagine a server tool that fetches complex data and then asks the LLM connected to the client (like Claude or ChatGPT) to summarize it before returning the result.

from fastmcp import FastMCP, Context

mcp = FastMCP(name="SamplingDemo")

@mcp.tool()
async def analyze_data_with_llm(data_uri: str, ctx: Context) -> str:
    """Fetches data and uses the client's LLM for analysis."""

    # log to the client's console
    await ctx.info(f"Fetching data from {data_uri}...")
    data_content = await ctx.read_resource(data_uri) # Simplified

    await ctx.info("Requesting LLM analysis...")
    # Ask the connected client's LLM to analyze the data
    analysis_response = await ctx.sample(
        f"Analyze the key trends in this data:\n\n{data_content[:1000]}"
    )
    return analysis_response # Return the LLM's analysis```

This unlocks sophisticated workflows where server-side logic collaborates with the client-side LLM's intelligence. For more information, see the updated Client and Context guides.

🏗️ Building the MCP Ecosystem

FastMCP 2.0 is a major step towards a more connected, flexible, and developer-friendly AI ecosystem built on MCP. By simplifying proxying, composition, and integration, we hope to empower you to build and combine MCP services in powerful new ways.

Give FastMCP 2.0 a try!

Explore the documentation
Check out the code and examples on GitHub
Add it to your poject: uv add fastmcp or pip install fastmcp

I'm excited to see what you build. Your feedback, issues, and contributions are always welcome!

Happy Engineering!

What's New in FastMCP 3.0

Tue, 20 Jan 2026 13:00:00 GMT

import Callout from "@components/blog/Callout.astro";

FastMCP 3.0 is our largest release ever. This post covers every major feature in some detail, including an overview of the architecture and links to relevant documentation. For beta 2 features (CLI toolkit, MCP Apps, CIMD, and more), see the Beta 2 announcement.

The Architecture

FastMCP 2 had features. Lots of them. Mounting servers, proxying remotes, filtering by tags, transforming tool schemas. Each feature was its own subsystem with its own code, its own mental model, its own edge cases. When you wanted to add something new, you had to figure out how it interacted with everything else—and the answer was usually "write more glue code."

FastMCP 3 asks a different question: what if all of these features are just different combinations of the same primitives?

The architecture comes down to three concepts:

Components are the atoms of MCP. A tool, a resource, a prompt. They're what clients actually interact with. Components have names, schemas, metadata, and behavior. They're the thing you're ultimately trying to expose.

Providers answer the question: where do components come from? A provider is anything that can list components and retrieve them by name. Your decorated functions are a provider. A directory of files is a provider. A remote MCP server is a provider. An OpenAPI spec is a provider. Critically, a FastMCP server is itself a provider—which means you can nest servers inside servers, infinitely.

Transforms are middleware for the component pipeline. They intercept the flow of components from providers to clients and can modify what passes through. Rename a tool, add a namespace prefix, filter by version, hide components by tag—these are all transforms. Transforms compose: you stack them, and each one processes the output of the previous.

Why Composability Matters

Here's where it gets interesting. In FastMCP 2, "mounting" a sub-server was a massive specialized feature. Hundreds of lines of code to handle the namespacing, the middleware chains, the lifecycle management. Same story for proxying remote servers. Same story for visibility filtering.

In FastMCP 3, mounting is just two primitives combined:

A Provider that sources components from another server
A Transform that adds a namespace prefix

That's it. There's no special mounting code. The mounting behavior emerges from the composition of primitives that each do one thing well.

Proxying a remote server? That's a Provider backed by an MCP client. The Provider wraps the client, translates list/get calls into MCP protocol calls, and returns the results. No special proxy subsystem—just a provider that happens to talk to a remote server.

Per-session visibility, where different users see different tools? That's a Transform applied to an individual session instead of the server. The visibility transform doesn't know or care whether it's running globally or per-session. It just filters components based on rules. The per-session behavior comes from where you apply it.

This composability has a practical consequence: FastMCP 3 ships more features with less code, and you can combine features in ways we didn't anticipate. Want to proxy a remote server, filter its tools by tag, rename them, and expose them only to authenticated users? That's a Provider, three Transforms, and some auth middleware. Each piece is independent. Each piece is testable. And when we add new transforms or providers, they automatically work with everything else.

How It Actually Works

When a client asks for the list of tools, here's what happens:

The server collects components from all its Providers
Each Provider runs its own transform chain (provider-level transforms)
The server runs its transform chain on the aggregated result (server-level transforms)
The final list goes to the client

This two-level transform system is powerful. Provider-level transforms affect only that provider's components—useful for namespacing a mounted server. Server-level transforms affect everything—useful for global visibility rules or auth filtering.

The same flow happens for get_tool, call_tool, read_resource, and every other operation. Transforms can intercept any of these, which means you can inject behavior at any point in the pipeline.

You might be wondering: what about middleware? FastMCP still has middleware, and it operates on requests—intercepting tool calls, resource reads, and other operations as they execute. In FastMCP 2, some users tried to use middleware to dynamically modify tools or inject new components. It sort of worked, but it was unpredictable, hard to compose with other systems like auth and visibility, and operated at the server level which made it difficult to address subsets of components. Transforms are the clean answer: they're designed for component-level modification, they compose naturally, and they integrate with the provider system. Middleware is still there for what it's good at—authentication, logging, rate limiting, and other cross-cutting concerns at the request level. There's some gray area, but the guideline is: transforms for shaping what components exist, middleware for handling how requests execute.

What follows is a tour of the providers and transforms that ship with FastMCP 3. Think of them less as "features" and more as building blocks—the primitives you combine to build whatever your application needs.

Providers

Providers answer the question: where do your components come from?

Custom Providers

You can write your own provider by subclassing Provider:

from fastmcp.server.providers import Provider

class DatabaseProvider(Provider):
    async def list_tools(self) -> Sequence[Tool]:
        # Query database for available tools
        rows = await db.fetch("SELECT * FROM tools")
        return [Tool(name=row['name'], description=row['description']) for row in rows]

    async def get_tool(self, name: str) -> Tool | None:
        row = await db.fetchrow("SELECT * FROM tools WHERE name = ?", name)
        if row:
            return Tool(name=row['name'], description=row['description'])
        return None

# Attach to server
mcp = FastMCP("Database Server", providers=[DatabaseProvider()])

This pattern is powerful: need tools from a REST API? Write an APIProvider. Need tools from a Kubernetes cluster? Write a KubeProvider. The provider pattern is your extension point.

Learn more in the docs →

Built-In Providers

FastMCP ships with providers for the most common patterns.

LocalProvider

This is the classic FastMCP experience. You define a function, decorate it, and it becomes a component. What's new in v3 is that LocalProvider is now explicit and reusable—you can attach the same provider to multiple servers.

Learn more in the docs →

from fastmcp.server.providers import LocalProvider

provider = LocalProvider()

@provider.tool
def greet(name: str) -> str:
    return f"Hello, {name}!"

# Attach to multiple servers
server1 = FastMCP("Server1", providers=[provider])
server2 = FastMCP("Server2", providers=[provider])

FileSystemProvider

This is a fundamentally different way to organize MCP servers. Instead of importing a server instance and decorating functions, you write self-contained tool files:

# mcp/tools/greet.py
from fastmcp.tools import tool

@tool
def greet(name: str) -> str:
    """Greet someone by name."""
    return f"Hello, {name}!"

Then point the provider at the directory:

from fastmcp import FastMCP
from fastmcp.server.providers import FileSystemProvider

mcp = FastMCP("server", providers=[FileSystemProvider("mcp/")])

The problem it solves: traditional servers require coordination between files—either tool files import the server (creating coupling) or the server imports all tool modules (creating a registry bottleneck). FileSystemProvider removes this coupling entirely.

With reload=True, the provider re-scans on every request—changes take effect immediately without restarting the server. This is transformative for development.

Learn more in the docs →

SkillsProvider

Skills are the instruction files that Claude Code, Cursor, and Copilot use to learn new capabilities. SkillsProvider exposes these as MCP resources, which means any MCP client can discover and download skills from your server.

from pathlib import Path
from fastmcp import FastMCP
from fastmcp.server.providers.skills import SkillsDirectoryProvider

mcp = FastMCP("Skills Server")
mcp.add_provider(SkillsDirectoryProvider(roots=Path.home() / ".claude" / "skills"))

Each subdirectory with a SKILL.md file becomes a discoverable skill. Clients see:

skill://{name}/SKILL.md - Main instruction file
skill://{name}/_manifest - JSON listing of all files with sizes and hashes
skill://{name}/{path} - Supporting files

We also provide vendor-specific providers with locked default paths: ClaudeSkillsProvider, CursorSkillsProvider, VSCodeSkillsProvider, CodexSkillsProvider, and more.

The FastMCP client can automatically sync skills from servers to your local filesystem, making it easy to distribute skills across your organization.

Learn more in the docs →

OpenAPIProvider

OpenAPI-to-MCP conversion was one of FastMCP 2's most popular features. In v3, we've restructured it as a provider, which means it now composes with everything else in the system.

from fastmcp.server.providers.openapi import OpenAPIProvider
import httpx

client = httpx.AsyncClient(base_url="https://api.example.com")
provider = OpenAPIProvider(openapi_spec=spec, client=client)

mcp = FastMCP("API Server", providers=[provider])

All endpoints become tools by default. When paired with ToolTransform (covered below), you can rename auto-generated tools, improve descriptions, and curate the output for your agent—finally making OpenAPI conversion a tool for building good context rather than blindly accumulating more of it.

Learn more in the docs →

ProxyProvider

ProxyProvider sources components from a remote MCP server. This is what powers create_proxy(): you connect to any MCP server and expose its components as if they were local.

from fastmcp.server import create_proxy

# Create proxy to remote server
server = create_proxy("http://remote-server/mcp")

Learn more in the docs →

FastMCPProvider

FastMCPProvider sources components from another FastMCP server instance. This is what powers mount(): compose servers together while keeping their middleware chains intact.

from fastmcp import FastMCP

main = FastMCP("Main")
sub = FastMCP("Sub")

@sub.tool
def greet(name: str) -> str:
    return f"Hello, {name}!"

# Mount with namespace - greet becomes "sub_greet"
main.mount(sub, prefix="sub")

Under the hood, this creates a FastMCPProvider with a Namespace transform—the same primitives, with a cleaner API.

Learn more in the docs →

Transforms

Transforms modify components as they flow from providers to clients. They operate on two types of methods: list operations (like list_tools) receive the full sequence of components and return a transformed sequence; get operations (like get_tool) use a middleware pattern with call_next to chain lookups. Transforms can be stacked, and each one processes the output of the previous.

Transforms apply at two levels:

Provider-level: provider.add_transform() - affects only that provider's components
Server-level: server.add_transform() - affects all components from all providers

Built-In Transforms

Namespace

Namespace adds prefixes to component names (tool → api_tool) and path segments to URIs (data://x → data://api/x). Essential for avoiding collisions when composing servers.

from fastmcp.server.transforms import Namespace

provider.add_transform(Namespace("api"))

Learn more in the docs →

ToolTransform

ToolTransform lets you reshape tools entirely: rename them, rewrite descriptions, modify argument names and schemas, add tags. This is especially powerful when you don't control the tools you're serving—if you're using OpenAPIProvider or proxying a third-party server, ToolTransform lets you optimize those auto-generated tools for your agent.

from fastmcp.server.transforms import ToolTransform
from fastmcp.tools.tool_transform import ToolTransformConfig

provider.add_transform(ToolTransform({
    "verbose_auto_generated_name": ToolTransformConfig(
        name="short_name",
        description="A better description for the agent",
        tags={"category"},
    ),
}))

Learn more in the docs →

VersionFilter

VersionFilter exposes only components within a version range, letting you run v1 and v2 servers from the same codebase. See Component Versioning for how to define versions on your components.

from fastmcp.server.transforms import VersionFilter

# Create servers that share the provider with different filters
api_v1 = FastMCP("API v1", providers=[components])
api_v1.add_transform(VersionFilter(version_lt="2.0"))

api_v2 = FastMCP("API v2", providers=[components])
api_v2.add_transform(VersionFilter(version_gte="2.0"))

Visibility

The Visibility transform controls which components are exposed by tag, name, or version. This is what powers the enable() and disable() methods on servers and providers.

mcp.disable(tags={"admin"})  # Hide admin tools
mcp.disable(names={"dangerous_tool"})  # Hide by name
mcp.enable(tags={"public"}, only=True)  # Allowlist mode

Learn more in the docs →

ResourcesAsTools and PromptsAsTools

These transforms expose resources and prompts as tools for clients that only support the tools protocol. Some MCP hosts—particularly early adopters and simpler implementations—only expose tools to agents. These transforms let your server stay rich while still working with limited clients.

from fastmcp.server.transforms import ResourcesAsTools, PromptsAsTools

mcp.add_transform(ResourcesAsTools(mcp))
mcp.add_transform(PromptsAsTools(mcp))

ResourcesAsTools generates list_resources and read_resource tools that wrap the underlying resource operations. PromptsAsTools generates list_prompts and get_prompt tools. The transforms automatically handle argument mapping and response formatting—your resources and prompts work exactly as expected, just through the tools interface.

Learn more in the docs →

Custom Transforms

You can write your own transforms by subclassing Transform:

from collections.abc import Sequence
from fastmcp.server.transforms import Transform, GetToolNext
from fastmcp.tools import Tool

class TagFilter(Transform):
    def __init__(self, required_tags: set[str]):
        self.required_tags = required_tags

    async def list_tools(self, tools: Sequence[Tool]) -> Sequence[Tool]:
        # list operations receive the sequence directly
        return [t for t in tools if t.tags & self.required_tags]

    async def get_tool(self, name: str, call_next: GetToolNext) -> Tool | None:
        # get operations use call_next middleware pattern
        tool = await call_next(name)
        return tool if tool and tool.tags & self.required_tags else None

Learn more in the docs →

Authorization

FastMCP 3 introduces per-component authorization for tools, resources, and prompts—the missing piece after OAuth support in 2.12.

Component-Level Auth

The auth parameter accepts a callable (or list of callables) that receives the request context and decides whether to allow it:

from fastmcp import FastMCP
from fastmcp.server.auth import require_auth, require_scopes

mcp = FastMCP()

@mcp.tool(auth=require_auth)
def protected_tool(): ...

@mcp.resource("data://secret", auth=require_scopes("read"))
def secret_data(): ...

@mcp.prompt(auth=require_scopes("admin"))
def admin_prompt(): ...

Built-in checks:

require_auth: Requires any valid token
require_scopes(*scopes): Requires specific OAuth scopes
restrict_tag(tag, scopes): Requires scopes only for tagged components

Server-Wide Auth

Apply authorization to all components via AuthMiddleware:

from fastmcp.server.middleware import AuthMiddleware
from fastmcp.server.auth import require_auth, restrict_tag

# Require auth for all components
mcp = FastMCP(middleware=[AuthMiddleware(auth=require_auth)])

# Tag-based restrictions
mcp = FastMCP(middleware=[
    AuthMiddleware(auth=restrict_tag("admin", scopes=["admin"]))
])

Custom Auth Checks

Custom checks receive AuthContext with token and component:

def custom_check(ctx: AuthContext) -> bool:
    return ctx.token is not None and "admin" in ctx.token.scopes

Note: STDIO transport bypasses all auth checks (no OAuth concept).

Learn more in the docs →

CIMD

CIMD (Client ID Metadata Document) is the successor to Dynamic Client Registration. Instead of clients registering via a POST endpoint, they provide an HTTPS URL pointing to their metadata document. The server fetches and validates it, which is more secure and enables better client verification. Shipped in beta 2.

Learn more in the docs →

Component Versioning

You can now register multiple versions of the same component. FastMCP automatically exposes the highest version to clients while preserving older versions for compatibility.

Declaring Versions

@mcp.tool(version="1.0")
def add(x: int, y: int) -> int:
    return x + y

@mcp.tool(version="2.0")
def add(x: int, y: int, z: int = 0) -> int:
    return x + y + z

# Only v2.0 is exposed via list_tools()
# Calling "add" invokes the v2.0 implementation

Version comparison uses PEP 440 semantic versioning (1.10 > 1.9 > 1.2). The v prefix is normalized (v1.0 equals 1.0).

Version Metadata

When listing components, FastMCP exposes all available versions in the meta field:

tools = await client.list_tools()
# Each tool's meta includes:
# - meta["fastmcp"]["version"]: the version of this component ("2.0")
# - meta["fastmcp"]["versions"]: all available versions ["2.0", "1.0"]

Calling Specific Versions

The FastMCP client supports direct version selection:

from fastmcp import Client

async with Client(server) as client:
    # Call the latest version (default)
    result = await client.call_tool("add", {"x": 1, "y": 2})

    # Call a specific version
    result = await client.call_tool("add", {"x": 1, "y": 2}, version="1.0")

For generic MCP clients that don't support the version parameter, pass version via _meta in arguments.

Learn more in the docs →

{
  "x": 1,
  "y": 2,
  "_meta": {
    "fastmcp": {
      "version": "1.0"
    }
  }
}

Session-Scoped State

State now persists across tool calls within a session, not just within a single request.

@mcp.tool
async def increment_counter(ctx: Context) -> int:
    count = await ctx.get_state("counter") or 0
    await ctx.set_state("counter", count + 1)
    return count + 1

State is automatically keyed by session ID, ensuring isolation between different clients.

Key changes from v2:

Methods are now async: await ctx.get_state(), await ctx.set_state(), await ctx.delete_state()
State expires after 1 day (TTL) to prevent unbounded growth

Distributed backends:

The implementation uses pykeyvalue (maintained by FastMCP maintainer Bill Easton) for pluggable storage:

from key_value.aio.stores.redis import RedisStore

# Use Redis for distributed deployments
mcp = FastMCP("server", session_state_store=RedisStore(...))

Stateless HTTP:

For stateless HTTP deployments where there's no persistent connection, FastMCP respects the mcp-session-id header that most clients send. If you've configured a storage backend, we'll create a virtual session for you.

Learn more in the docs →

Visibility System

Components can be enabled or disabled using the visibility system. Each enable() or disable() call adds a Visibility transform that marks components.

mcp = FastMCP("Server")

# Disable by name
mcp.disable(names={"dangerous_tool"}, components=["tool"])

# Disable by tag
mcp.disable(tags={"admin"})

# Allowlist mode - only show components with these tags
mcp.enable(tags={"public"}, only=True)

# Enable overrides earlier disable (later transform wins)
mcp.disable(tags={"internal"})
mcp.enable(names={"safe_tool"})  # safe_tool is visible despite internal tag

Blocklist vs Allowlist:

Blocklist mode (default): All components visible except explicitly disabled
Allowlist mode (only=True): Only explicitly enabled components visible

Per-Session Visibility

Server-level visibility changes affect all connected clients. For per-session control, use Context methods:

@mcp.tool(tags={"premium"})
def premium_analysis(data: str) -> str:
    return f"Premium analysis of: {data}"

@mcp.tool
async def unlock_premium(ctx: Context) -> str:
    """Unlock premium features for this session only."""
    await ctx.enable_components(tags={"premium"})
    return "Premium features unlocked"

@mcp.tool
async def reset_features(ctx: Context) -> str:
    """Reset to default feature set."""
    await ctx.reset_visibility()
    return "Features reset to defaults"

# Globally disabled - sessions unlock individually
mcp.disable(tags={"premium"})

Session visibility methods:

await ctx.enable_components(...): Enable components for this session
await ctx.disable_components(...): Disable components for this session
await ctx.reset_visibility(): Clear session rules, return to global defaults

FastMCP automatically sends ToolListChangedNotification (and resource/prompt equivalents) to affected sessions when visibility changes.

Learn more in the docs →

Production Features

OpenTelemetry Tracing

FastMCP 3 has native OpenTelemetry instrumentation. Drop in your OTEL configuration, and every tool call, resource read, and prompt render is traced with standardized attributes.

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(provider)

# Use fastmcp normally - spans export to your configured backend

Server spans include: component key, provider type, session ID, auth context. Client spans wrap outgoing calls with W3C trace context propagation.

Learn more in the docs →

Background Tasks (SEP-1686)

MCP has a spec extension (SEP-1686) for long-running background tasks. FastMCP implements this via Docket integration—you get persistent task queues backed by SQLite or Postgres, with the ability to scale workers horizontally.

from fastmcp.server.tasks import TaskConfig

@mcp.tool(task=TaskConfig(mode="required"))
async def long_running_task():
    # Must be executed as background task
    ...

@mcp.tool(task=TaskConfig(mode="optional"))
async def flexible_task():
    # Supports both sync and task execution
    ...

@mcp.tool(task=True)  # Shorthand for mode="optional"
async def simple_task():
    ...

Task modes:

"forbidden": Does not support task execution (default)
"optional": Supports both synchronous and task execution
"required": Must be executed as background task

Install with fastmcp[tasks] for Docket integration.

Learn more in the docs →

Tool Timeouts

Tools can limit foreground execution time:

@mcp.tool(timeout=30.0)
async def fetch_data(url: str) -> dict:
    """Fetch with 30-second timeout."""
    ...

When exceeded, clients receive MCP error code -32000. Both sync and async tools are supported. Note: timeouts don't apply to background tasks—those run in Docket's task queue with their own lifecycle management.

Learn more in the docs →

Pagination

For servers with many components, enable pagination:

server = FastMCP("ComponentRegistry", list_page_size=50)

When list_page_size is set, list operations paginate responses with nextCursor for subsequent pages. The FastMCP Client fetches all pages automatically—list_tools() returns the complete list. For manual pagination:

Learn more in the docs →

async with Client(server) as client:
    result = await client.list_tools_mcp()
    while result.nextCursor:
        result = await client.list_tools_mcp(cursor=result.nextCursor)

PingMiddleware

Keep long-lived connections alive with periodic pings:

from fastmcp.server.middleware import PingMiddleware

mcp = FastMCP("server")
mcp.add_middleware(PingMiddleware(interval_ms=5000))

Learn more in the docs →

Developer Experience

Decorators Return Functions

By popular demand (and by "popular demand" I mean "relentless GitHub issues"), your decorated functions now stay callable, like they do in Flask, FastAPI, and Typer:

@mcp.tool
def greet(name: str) -> str:
    return f"Hello, {name}!"

# greet is still your function - call it directly
greet("World")  # "Hello, World!"

This makes testing straightforward: just call the function. For v2 compatibility, set FASTMCP_DECORATOR_MODE=object.

Learn more in the docs →

Hot Reload

fastmcp run --reload watches your files and reloads automatically:

# Watch for changes and restart
fastmcp run server.py --reload

# Watch specific directories
fastmcp run server.py --reload --reload-dir ./src --reload-dir ./lib

The fastmcp dev command is a shorthand that includes --reload by default.

Learn more in the docs →

Automatic Threadpool

Synchronous tools, resources, and prompts now automatically run in a threadpool:

import time

@mcp.tool
def slow_tool():
    time.sleep(10)  # No longer blocks other requests
    return "done"

Three concurrent calls now execute in parallel (~10s) rather than sequentially (30s).

Learn more in the docs →

Composable Lifespans

Lifespans can be combined with the | operator for modular setup/teardown:

from fastmcp import FastMCP
from fastmcp.server.lifespan import lifespan

@lifespan
async def db_lifespan(server):
    db = await connect_db()
    try:
        yield {"db": db}
    finally:
        await db.close()

@lifespan
async def cache_lifespan(server):
    cache = await connect_cache()
    try:
        yield {"cache": cache}
    finally:
        await cache.close()

mcp = FastMCP("server", lifespan=db_lifespan | cache_lifespan)

Both enter in order and exit in reverse (LIFO). Context dicts are merged.

Learn more in the docs →

Rich Result Classes

New result classes provide explicit control over component responses:

ToolResult:

from fastmcp.tools import ToolResult

@mcp.tool
def process(data: str) -> ToolResult:
    return ToolResult(
        content=[TextContent(type="text", text="Done")],
        structured_content={"status": "success", "count": 42},
        meta={"processing_time_ms": 150}
    )

ResourceResult:

from fastmcp.resources import ResourceResult, ResourceContent

@mcp.resource("data://items")
def get_items() -> ResourceResult:
    return ResourceResult(
        contents=[
            ResourceContent({"key": "value"}),
            ResourceContent(b"binary data"),
        ],
        meta={"count": 2}
    )

PromptResult:

from fastmcp.prompts import PromptResult, Message

@mcp.prompt
def conversation() -> PromptResult:
    return PromptResult(
        messages=[
            Message("What's the weather?"),
            Message("It's sunny today.", role="assistant"),
        ],
        meta={"generated_at": "2024-01-01"}
    )

Learn more in the docs →

Context.transport Property

Tools can detect which transport is active:

@mcp.tool
def my_tool(ctx: Context) -> str:
    if ctx.transport == "stdio":
        return "short response"
    return "detailed response with more context"

Returns "stdio", "sse", or "streamable-http".

Learn more in the docs →

Upgrading

The vast majority of users can upgrade with no modifications. The breaking changes are documented in the upgrade guide, but the main ones are:

Decorators return functions (set FASTMCP_DECORATOR_MODE=object for v2 behavior)
State methods are async (await ctx.get_state() instead of ctx.get_state())
Auth providers require explicit configuration (no more auto-loading from env vars)
enabled parameter removed from components (use the visibility system instead: mcp.enable() / mcp.disable())

Upgrade: pip install fastmcp==3.0.0b2
Docs: Read the new documentation
GitHub: Star the repo

Happy (context) engineering!

Introducing FastMCP 3.0 🚀

Tue, 20 Jan 2026 12:00:00 GMT

import Callout from "@components/blog/Callout.astro";

I have a confession to make.

FastMCP 2.0 hides a dark secret. For the last year, we have been scrambling. We were riding the adoption curve of one of the fastest-growing technologies on the planet, trying to keep up with a spec that seemed to change every week.

On the one hand, it worked. FastMCP 1.0 proved the concept so well that Anthropic made it the foundation of the official MCP SDK. FastMCP 2.0 introduced the features necessary to build a real server ecosystem, coinciding with the massive MCP hype wave. The community responded: Today, FastMCP is downloaded a million times a day, and some version of it powers 70% of all MCP servers.

But as someone who cares deeply about framework design, the way v2 evolved was frustrating. It was reactive. We were constantly bolting on new infrastructure to match the present, hacking in new features just to make sure you didn't have to build them yourself.

Over the last year, something shifted. We had enough data, from millions of downloads and countless conversations with teams building real servers, to see the patterns underneath all the ad-hoc features. We could finally see what a "designed" framework would look like.

FastMCP 3.0 is that framework.

It is the platform MCP deserves in 2026, built to be as durable as it is future-proof.

We are moving beyond simple "tool servers." We are entering the era of Context Applications—rich, adaptive systems that manage the information flow to agents.

The real challenge was never implementing the protocol. It's delivering the right information at the right time. FastMCP 3 is built for that:

Source components from anywhere.
Compose and transform them freely.
Personalize what each user sees.
Track state across sessions.
Control access at every level.
Run long operations in the background.
Version your APIs.
Observe everything.

It's time to move fast and make things.

🏁 Get Started

FastMCP 3.0.0 beta 2 is available now.

For a deeper dive into all the new features, read the What's New in FastMCP 3.0 post and the Beta 2 announcement. </Callout>

The Architecture

FastMCP 2 was a collection of features. FastMCP 3 is a system built on three fundamental primitives. If you understand these, you understand the entire framework.

Components define the logic.
Providers source the components.
Transforms shape the components.

A Component is the atom of MCP—specifically, a Tool, Resource, or Prompt. While they often wrap Python functions or data sources to define their business logic, the Component itself is the standardized interface that the model interacts with.

A Provider answers the question: "Where do the components come from?" They can come from Python decorators, a directory of files, an OpenAPI spec, a remote MCP server, or pretty much anything else. In fact, a FastMCP server is itself just a Provider that happens to speak the MCP protocol.

A Transform functions as middleware for Providers. It allows you to modify the behavior of a Provider without touching its code. This decouples the author from the consumer: Person A can source the tools (via a Provider), while Person B adapts them to their specific environment (via a Transform)—renaming them, adding namespaces to prevent collisions, filtering versions, or applying security rules.

The real power lies in the composition of these primitives.

In v2, "mounting" a sub-server was a massive, specialized subsystem. In v3 it's just a Provider (sourcing the components) plus a Transform (adding a namespace prefix).

Proxying a remote server? That's a Provider backed by a FastMCP client.

Hiding developer tools from read-only users? That's a Transform applied to a specific session.

This architecture means features that used to require massive amounts of glue code now fall out naturally from the design. It allows us to ship a massive amount of new functionality without breaking the foundation.

Sourcing Context: Providers

Because the architecture is decoupled, we can now source components from anywhere.

LocalProvider

This workhorse powers the classic FastMCP experience you know and love. You define a function, decorate it with @tool, and it becomes a component. It is simple, explicit, and remains the best way to get started. But what if your tools aren't local?

FileSystemProvider

This is a fundamentally different way to organize MCP servers. Instead of importing a server instance and decorating functions, you point the provider at a directory. It scans the files, finds the components, and builds your interface. With reload=True, it watches those files and updates the server instantly on any change.

SkillsProvider

Skills are having a moment. Claude Code, Cursor, Copilot—they all learn new capabilities from instruction files. SkillsProvider exposes these as MCP resources, which means any MCP client can discover and download skills from your server. We're delivering skills over MCP. It's a small example of what happens when "where do components come from?" becomes an open question: someone had a weird idea, wrote a provider, and now it's a capability.

OpenAPIProvider

This feature was so popular in FastMCP 2 that people stopped designing servers and started regurgitating REST APIs, forcing me to write a blog post asking you to stop. But we know: it's useful. In FastMCP 3, OpenAPI returns as a provider. It is available for responsible use, and when paired with ToolTransforms (to rename and curate the output), it finally becomes a tool for building good context rather than blindly accumulating more of it.

Production Realities

FastMCP 2 was great for scripts. FastMCP 3 is built for systems that need to survive in production.

Component Versioning

This was a massive request. You can now serve multiple versions of a tool side-by-side using the @tool(version="1.0") parameter. FastMCP automatically exposes the highest version to clients, while preserving older versions for legacy compatibility. You can even use a VersionFilter transform to run a "v1 Server" and a "v2 Server" from the exact same codebase.

Authorization & Security

We introduced OAuth in v2, but v3 gives you granular control. You can attach authorization logic to individual components using the auth parameter. You can also apply AuthMiddleware to gate entire groups of components (e.g., by tag) for defense-in-depth.

Native OpenTelemetry

Observability is no longer an afterthought. FastMCP 3 has native OpenTelemetry instrumentation. Drop in your OTEL configuration, and every tool call, resource read, and prompt render is traced with standardized attributes. You can finally see exactly where your latency is coming from.

Background Tasks

We've integrated support for SEP-1686, allowing tools to kick off long-running background tasks via Docket integration. This prevents tool timeouts on heavy workloads while keeping the agent responsive.

Developer Joy

We heard you. You wanted a framework that felt less like a hacked-together library and more like a modern Python toolchain.

Hot Reload: fastmcp dev server.py watches your files and reloads instantly. No more kill-restart cycles.
Callable Functions: In v2, decorators turned your functions into objects. In v3, your functions stay functions. You can import them, call them, and unit test them just like normal Python code.
Sync that Works: Synchronous tools are now automatically dispatched to a threadpool, meaning a slow calculation won't block your server's event loop.

Playbooks

I want to close by showing you why this architecture actually matters.

A common problem in MCP is "context crowding." If you dump 500 tools into a context window, the model gets confused. You want progressive disclosure: start with a few tools, and reveal more based on the user's role or the conversation state.

In FastMCP 3, we don't need a special "Progressive Disclosure" feature[^pd]. We just compose the primitives we've already built:

Providers to source the hidden tools.
Visibility to hide them by default.
Auth to act as the gatekeeper.
Session State to remember who has unlocked what.

Here is what that looks like. We mount a directory of admin tools, hide them from the world, and then provide a secure, authenticated tool that unlocks them only for the current session.

from fastmcp import FastMCP, Context
from fastmcp.server.auth import require_scopes
from fastmcp.server.providers import FileSystemProvider

mcp = FastMCP("Enterprise Server")

# 1. Source admin tools from a file system
admin_provider = FileSystemProvider("./admin_tools")
mcp.mount(admin_provider)

# 2. Hide them by default using the Visibility system
mcp.disable(tags={"admin"})

# 3. Create a gatekeeper tool with Authorization
@mcp.tool(auth=require_scopes("super-user"))
async def unlock_admin_mode(ctx: Context):
    """Unlock administrative tools for this session."""

    # 4. Modify Session State to reveal the hidden tools
    await ctx.enable_components(tags={"admin"})

    return "Admin mode unlocked. New tools are available."

The agent connects, sees a safe environment, authenticates, and the server evolves to match the new trust level.

This composition creates a new primitive entirely. When you chain these stateful unlocks together—revealing context A, which unlocks context B—you get what we call playbooks. Playbooks are a way to build dynamic MCP-native workflows. More on them soon!

This is the future of Context Applications. Static lists of API wrappers are being replaced by dynamic systems that actively guide the agent through a process.

The Future

We know that as capabilities grow, context windows get crowded. The hundred tools that make your server powerful are the same hundred tools that overwhelm your agent.

Our next wave of features is focused on context optimization: search transforms, curator agents, and deeper skills integration. The architecture of FastMCP 3 is specifically designed to support these patterns.

Because what you don't show the agent matters just as much as what you do.

Today, organizations with a competitive advantage don't have access to smarter AI. They have access to smarter context. FastMCP 3 is the fastest to build it.

It's available in beta today.

Happy (context) engineering!

About This Beta

FastMCP is an extremely widely used framework. While 3.0 introduces almost no breaking changes, we want to make sure that users aren't caught off guard. Therefore, the beta period will last a few weeks to allow for feedback and testing.

Install: pip install fastmcp==3.0.0b2

Beta 2 Announcement: FastMCP 3.0 Beta 2: The Toolkit
Upgrade Guide: gofastmcp.com/development/upgrade-guide
Full Documentation: gofastmcp.com
GitHub: github.com/jlowin/fastmcp </Callout>

[^pd]: Though of course we'll have an amazing DX for it as patterns emerge.

Welcoming Bill Easton to the FastMCP Team

Tue, 02 Sep 2025 00:00:00 GMT

import Callout from "@components/blog/Callout.astro";

FastMCP has quickly become the most popular framework for building MCP servers, and its momentum has been incredible to watch. To sustain that growth and continue pushing the ecosystem forward, I'm thrilled to announce the appointment of FastMCP's first external maintainer: Bill Easton.

Bill is the Director of Product Management for Observability at Elastic, and folks active on the FastMCP repository will certainly recognize him by his GitHub handle, @strawgate. From the very beginning, he has been one of the most prolific and thoughtful contributors to the project.

Bill showed up months ago (long before MCP was cool) and immediately began insisting on the need for utilities to manipulate, transform, and compose MCP servers into more usable forms. He has consistently impressed me, not just with the care that goes into his work, but with how prescient he's been in looking around corners to anticipate the use cases that are now considered obvious in the MCP ecosystem.

Many of Bill's contributions are the type we prize most in the open-source world: critical internal optimizations and fixes that ensure your server just works, even if you're not aware you're using them. They're the kind of thoughtful, careful improvements that make the difference between a demo and production-ready software.

Bill is also the driving force behind our entire suite of tool transformation features. These features are absolutely critical, especially for anyone generating MCP servers from large OpenAPI specs, and we expect them to become even more important as MCP servers get more complex.

Here's the backstory: when we first launched OpenAPI conversion, it seemed like magic. Feed in your REST API spec, get out an MCP server. But the reality was messier. Those auto-generated servers were often so unusable that I wrote a whole blog post urging people to stop using the feature. The servers worked technically, but they were poisoning agents with hundreds of atomic, context-free operations.

Bill saw this problem differently. Instead of abandoning OpenAPI conversion, he built a sophisticated transformation layer that lets you reshape, combine, and filter tools intelligently. His contributions turned what was an anti-pattern into a powerful, production-ready workflow. Today, some of the most sophisticated FastMCP deployments rely on these transformation features to create agent-friendly interfaces from complex REST APIs.

What excites me most is what this represents for the project's future. Bill's appointment reflects a commitment to building a sustainable, community-driven project that can outlive any single contributor. The best open-source projects are those where leadership emerges naturally from the community, and Bill exemplifies exactly that kind of organic leadership.

As such, as part of formalizing Bill's role, we're also introducing a more structured way for the community to engage with the project's development. Starting soon, we'll host meetings every two weeks to discuss the project. These will operate in an alternating fashion:

Committer's Meeting (Bi-weekly): One session each month will be a closed meeting for maintainers to discuss and plan the project roadmap. To maintain transparency, notes and key decisions will be shared publicly.
Community Office Hours (Bi-weekly): The alternating session will be an open meeting for the entire community. These will be office-hours-style, with a published agenda ahead of time. While we welcome discussion, these forums won't be for debugging idiosyncratic issues that are better served by an asynchronous format like GitHub issues.

We're currently planning the first office hours session and will share details with the community soon.

Happy engineering!

Contribute: Check out the code and examples on GitHub
Explore: Dig into the FastMCP documentation
Upgrade: uv add fastmcp or pip install fastmcp --upgrade </Callout>

MCP Proxy Servers with FastMCP 2.0

Wed, 23 Apr 2025 00:00:00 GMT

In the FastMCP 2.0 release post, I highlighted how the project's focus has evolved from simply creating MCP servers in 1.0, to making it easier to work with the growing ecosystem in 2.0. MCP composition addresses various ways of combining your MCP servers. But what about integrating with servers you don't control, like remote services, third-party tools, or servers using different transports?

This is where proxying comes in. It's a core piece of the FastMCP 2.0 vision, enabling seamless interaction across the diverse landscape of MCP servers.

What is an MCP Proxy?

A FastMCP proxy acts as an intermediary, a kind of universal travel adapter. You run a FastMCP server instance, but instead of implementing its own logic, it forwards requests to a designated backend MCP server.

Here’s the flow:

Your client application sends a request (e.g., call tool XYZ) to the local FastMCP proxy.
The proxy receives this request and forwards it to the configured backend server (which could be anywhere, using any transport).
The backend server processes the request and sends its response back to the proxy.
The proxy relays this response to your original client.

To the client, it looks like a standard FastMCP interaction. Under the hood, the proxy handles the communication translation.

This capability might seem simple, but it solves several practical challenges in building and using MCP-based systems:

Transport Bridging: This is fundamental. Many powerful MCP servers might only be available via network transports like sse or websocket. However, local clients like Claude Desktop often expect to communicate via stdio. A FastMCP proxy running locally can bridge this gap, listening on stdio while communicating with the backend via sse (or vice-versa, or any other combination). This instantly makes remote or differently-transported servers accessible to local tools.
Interaction Simplification: Instead of managing connections to numerous backend servers with potentially different addresses and transports, applications can interact with a single, local proxy endpoint. The proxy handles the complexity of routing requests to the appropriate backend, streamlining client configuration.
Gateway Functionality: A proxy can serve as a controlled entry point to backend services. While base proxying forwards requests directly, the underlying FastMCPProxy class could be subclassed (for advanced users) to inject logic like request logging, caching, authentication checks, or even basic request/response modification, creating a more robust gateway.
Decoupling: Proxies decouple the client's required transport from the backend server's implementation. The backend server can change its transport or location, and only the proxy configuration needs updating, not every client application.

Creating a Proxy

FastMCP makes creating a proxy straightforward using the FastMCP.from_client() class method. It leverages the standard fastmcp.Client to define the connection to the backend.

from fastmcp import FastMCP, Client

# 1. Configure a client for the backend server.
#    This target could be anything the Client can connect to:
#    - Remote SSE/WebSocket URL: Client("http://api.example.com/mcp/sse")
#    - Local Python script: Client("path/to/backend_server.py")
#    - Another FastMCP instance: Client(another_mcp_instance)
backend_client = Client("http://api.example.com/mcp/sse")

# 2. Create the proxy server instance from the client.
proxy_server = FastMCP.from_client(
    backend_client,
    name="MySmartProxy"
)

# 3. Run the proxy server (defaults to stdio)
if __name__ == "__main__":
    proxy_server.run()

    # You could run it on SSE instead if needed:
    # proxy_server.run(transport="sse", port=9001)

When you call FastMCP.from_client(), it doesn't discover the backend components immediately. Instead, it stores the provided backend_client within the proxy_server instance. When the proxy server later receives a request (like list_tools or call_tool), it dynamically uses the stored backend_client at that moment to forward the request to the backend and relay the response. This ensures the proxy always reflects the current state of the backend server. The result is a standard FastMCP instance, ready to run.

Proxies Love Composition

Crucially, because FastMCP.from_client() yields a standard FastMCP instance, these proxies integrate perfectly with FastMCP 2.0's composition model. You can call mount() or import_server() to compose a proxy server alongside other servers, just like any other FastMCP server.

from fastmcp import FastMCP, Client

main_app = FastMCP(name="CombinedApp")

@main_app.tool()
def local_utility():
    return "This tool runs directly in the main app."

# Assume proxy_server is created as shown before
proxy_server = FastMCP.from_client(...)

# Mount the proxy server instance under a prefix
main_app.mount("proxied_service", proxy_server)

# The main_app now exposes:
# - "local_utility" (its own tool)
# - "proxied_service_<backend_tool_name>" (tools from the backend via the proxy)
# - "proxied_service+<backend_resource_uri>" (resources from the backend via the proxy)

if __name__ == "__main__":
    main_app.run()

This allows you to build sophisticated applications where a single FastMCP server acts as a unified interface to both local functionality and multiple remote or diverse backend MCP services.

Wrapping Up

Proxying is essential plumbing for a truly interoperable MCP ecosystem. FastMCP 2.0 makes it trivial to bridge transports, simplify client interactions, and integrate disparate MCP servers. By treating proxies as first-class FastMCP servers, we unlock flexible and powerful architectural patterns.

If you haven't already, explore the proxying capability – it might just be the missing piece for connecting your AI workflows. Check out the proxying docs for further details.

Happy Connecting! 🔌

Stop Vibe-Testing Your MCP Server

Wed, 21 May 2025 00:00:00 GMT

import Callout from "@components/blog/Callout.astro";

If you're working with the Model Context Protocol (MCP), you're on the front lines of AI innovation. But amidst the excitement of creating intelligent agents and sophisticated AI workflows, I need to ask: how are you actually testing these critical MCP components?

Too often, the answer looks something like this: fire up an agent framework, type a few prompts into a chat window, and if the LLM seems to produce a reasonable output, call it a day. This, my friends, is vibe-testing.

To be fair, this isn't entirely surprising. The MCP ecosystem is young, and the developer tooling is still catching up to the rapid pace of protocol adoption. However, while vibe-testing might seem pragmatic given the tooling landscape, it's a fast track to unreliable systems, wasted tokens, and downright painful debugging sessions.

MCP servers are the APIs that connect LLMs to the real world. And like any critical API, they demand rigorous, deterministic testing to ensure they are reliable, predictable, and robust—especially when the primary consumer is a non-deterministic LLM.

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">A QA engineer walks into a bar. Orders a beer. Orders 0 beers. Orders 99999999999 beers. Orders a lizard. Orders -1 beers. Orders a ueicbksjdhd.<br/><br/>First real customer walks in and asks where the bathroom is. The bar bursts into flames, killing everyone.</p>— Brenan Keller (@brenankeller) <a href="https://twitter.com/brenankeller/status/1068615953989087232?ref_src=twsrc%5Etfw">November 30, 2018</a></blockquote>

This joke hits alarmingly close to home in the MCP world. Traditionally, QA engineers intentionally probe boundaries. With MCP, your LLM client is a chaos agent. LLMs can generate unexpected or malformed inputs, explore edge cases you never envisioned, or chain calls in ways that defy simple logic. If your MCP server isn't hardened against this onslaught of creative inputs, it's not a question of if things will go sideways, but when your proverbial bar bursts into flames, potentially on the most mundane of "customer" requests.

The core issue with relying on LLM-based "vibe-testing" is that it's:

Stochastic: What works once might not work again. You cannot build reliable systems on a foundation of "maybe."
Slow & Expensive: Each "test" involves LLM interactions, racking up latency and API costs. A proper test suite should be efficient.
Opaque: When something breaks, pinpointing the cause—is it your server, the LLM's interpretation, the agent framework, or the prompt?—becomes a frustrating detective game.
Superficial: Natural language interactions rarely achieve the comprehensive coverage needed to find subtle bugs or validate all edge cases.

It's imperative that your server's logic is either impeccably clear or that its error messages are so precise they can effectively guide an LLM back on track. Neither of these is achievable without rigorous, focused testing. While iterating on your instructions to help LLMs "do the right thing" is valuable, robust server-side logic and error handling are non-negotiable.

Testing is Trust (and Good Engineering)

I was incredibly fortunate to start Prefect alongside Chris White, who instilled in me a deep appreciation for the true value of testing. Proper testing serves a deeper purpose than merely affirming your code runs; it's a fundamental practice for documenting behavior, preventing regressions, and building deep trust in your codebase.

Chris's philosophy, which we can bring to bear here, emphasizes that:

Unit tests should be atomic, targeting the smallest possible unit of behavior.
Tests and design go hand-in-hand: if something is hard to test, its design might be flawed. Test-driven development can be particularly effective when defining new user-facing contracts.
Tests must clearly document the behavior and expectations that are important to your application. A failing test's title alone should strongly indicate what's broken.
Tests should verify expected, assertable behavior, rather than being tightly coupled to specific implementation details. This allows for refactoring with confidence.
Critically, tests should not unnecessarily block future paths or refactors. They guard core contracts, not incidental details, fostering an environment built for change.

This philosophy is about creating a safety net that allows for rapid iteration and confident development. When your MCP server is the component bridging the deterministic world of your code with the probabilistic world of LLMs, this trust and safety net become absolutely paramount.

In-Memory Testing with FastMCP

FastMCP 2.0 was designed to make rigorous testing easy, not an afterthought. The key to this is FastMCP's support for in-memory testing.

With FastMCP, you can instantiate a fastmcp.Client and connect it directly to your FastMCP server instance by providing the server as the client's transport target:

from fastmcp import FastMCP, Client

mcp = FastMCP(name="My MCP Server")

@mcp.tool()
def add(a: int, b: int) -> int:
    return a + b

test_client = Client(mcp) # Connects the client directly to the server instance

This direct, in-memory connection is a game-changer for testing MCP servers because:

💨 There's no network overhead: Communication is as fast as a direct Python call.
🧘 No subprocess management is needed: You don't have to start and stop external server processes for your tests.
🎯 You're testing your actual server logic: No mocks or simplified protocol implementations are needed; this uses the real STDIO transport internally for maximum fidelity.

Once you have this test_client, you can use its methods to interact with your server just like an LLM, but with the benefit of repeatable determinism and low latency. For example, within an async with test_client: block, you can:

Ping the server: is_alive = await test_client.ping()
List available tools: tools = await test_client.list_tools()
Call a specific tool: response = await test_client.call_tool("add", {"a": 1, "b": 2})
Read a resource: content = await test_client.read_resource("resource://your/data")

...and more, including advanced MCP features like logging, progress reporting, and LLM client sampling. Please review FastMCP's client docs for more details.

This direct, in-memory connection is a game-changer for testing MCP servers because it means your tests are not just validating isolated functions; they're confirming your server's behavior through the actual MCP interaction layer, albeit without network latency.

The result? Your tests become:

⚡ Blazingly Fast: Run them as part of your normal pytest suite in milliseconds.
🧪 Deterministic: Get consistent, repeatable results every single time.
🎯 Focused: Isolate and test your server's tool, resource, and prompt logic precisely.
🐍 Pythonic: Write your tests using the testing tools and patterns you already know and love.

You'll find yourself writing more tests, not fewer, because testing your MCP functionality becomes as quick and easy as testing any other Python function. Since everything runs in-process, you can use mocks, fixtures, and other familiar testing tools without hesitation.

Here's how you can structure your tests using pytest:

# tests/test_server.py
import pytest
from fastmcp import FastMCP, Client
from mcp.types import TextContent # For type checking results

# A reusable fixture for our MCP server
@pytest.fixture
def mcp_server():
    mcp = FastMCP(name="CalculationServer")

    @mcp.tool()
    def add(a: int, b: int) -> int:
        return a + b

    return mcp

# A straightforward test of our tool
async def test_add_tool(mcp_server: FastMCP):
    async with Client(mcp_server) as client: # Client uses the mcp_server instance
        result = await client.call_tool("add", {"a": 1, "b": 2})
        assert isinstance(result[0], TextContent)
        assert result[0].text == "3"

<br/>

<Callout color='red'> Nerd note: we did not put the client in a fixture, like this:

# Don't do this!

@pytest.fixture
async def client(mcp_server: FastMCP):
    async with Client(mcp_server) as client: 
        yield client

That's because pytest's async fixtures and tests can run in different event loops. This can lead to runtime errors related to task cancellation when the Client's async with block (which manages an anyio task group from the underlying MCP SDK) spans across these different loops. Instantiating the client directly within the test function ensures it operates within the test's event loop. </Callout>

This robust approach allows you to comprehensively test:

Correct tool logic for a wide range of valid inputs (your "lizard" cases!).
Graceful error handling for invalid inputs or internal server exceptions.
Accurate content delivery for your static resources and dynamic resource templates.
Correct rendering of prompts with various parameter combinations.
Complex interactions involving the Context object, such as logging, progress reporting, and inter-resource data access.

Instead of merely hoping your LLM client interprets things correctly, you are asserting that your server behaves exactly as designed under a multitude of conditions.

Beyond FastMCP: Testing Any MCP Server

The fastmcp.Client isn't limited to in-memory testing of FastMCP servers you built yourself. It's a versatile tool for interacting with any MCP-compliant server. This means you can write expansive tests for any MCP behavior you want to ensure is reliable and consistent, regardless of the server's implementation.

In addition to supplying the client with an explicit transport configuration (like StdioTransport or StreamableHttpTransport), you can often rely on its ability to automatically infer the appropriate transport based on the URL or command string you provide. In the following example, all client objects expose the exact same interface for testing, regardless of how they are instantiated:

from fastmcp import Client

# A remote server
async def test_remote_mcp_server():
    async with Client("http://some.api.service/mcp_endpoint") as client:
        await client.call_tool("some_tool", {"key": "value"})

# A local Node.js server script
async def test_local_js_server():
    async with Client('path/to/local/server.js') as client:
        await client.read_resource("resource://path/to/resource")

# Two remote servers configured via an MCP config into a FastMCP proxy server
async def test_mcp_config_server():
    mcp_config = {
        'mcpServers': {
            "github": {
                "command": "npx",
                "args": ["-y", "@modelcontextprotocol/server-github"],
                "env": {"GITHUB_PERSONAL_ACCESS_TOKEN": "<YOUR_TOKEN>"}
            },
            'paypal': {'url': 'https://mcp.paypal.com/sse'}
        }
    }
    # The client will infer to create a FastMCPProxy for this config
    async with Client(mcp_config) as client:
        await client.call_tool("github_get_user_repos", {"username": "jlowin"})

Your MCP servers form a critical layer in your AI stack. They are the deterministic bedrock upon which the more unpredictable LLM interactions are built. If this foundation is unreliable, your entire AI application becomes fragile.

FastMCP's testing capabilities, especially its in-memory testing, are designed to help you build this foundation with confidence and rigor. Stop relying on "looks good to me" vibe-checks through a chat window. Start writing focused, repeatable tests that prove your server does exactly what it's supposed to do.

Your AI, your users, and your sanity will thank you for it.

Take control of your MCP server:

Star FastMCP on GitHub and explore the docs.
Get started: uv pip install fastmcp.

Over the Horizon

Mon, 01 Dec 2025 00:00:00 GMT

Every company will have a context layer.

We're building Prefect Horizon to make that possible.

The context layer is where your AI agents interface with your business. It's where teams expose their proprietary data, tools, and workflows to autonomous systems. MCP is the technology that makes this possible, and we spent the last year making FastMCP the standard framework for working with it.

We've watched FastMCP grow to more than a million downloads a day. We've watched users build tens of thousands of servers on FastMCP Cloud. And we've watched what happens when MCP scales in an organization. Every company we talk to hits the same walls: "Where do I host these servers? How do I know what's been deployed? Who's allowed to access what?"

We built Horizon to solve this problem. Its core is Prefect's enterprise MCP gateway: deployment, registries across internal and external servers, and governance down to the tool level. It will be widely available early next year.

On that foundation, we're building the innovations that will power your context layer. Remix any combination of servers into curated endpoints for each use case. Build playbooks that progressively disclose tools as agents navigate complex workflows. Give business users an agentic interface to your company without ever knowing what "MCP" is.

We've seen what's coming.

See you over the Horizon.

Most MCP Usage is Invisible

Tue, 02 Dec 2025 00:00:00 GMT

Most MCP usage is invisible.

Many of us (myself included!) expected MCP to take off as something like an "App Store" for agent-facing business logic. Companies publish servers, customers use them, ecosystems form, ???, profit.

That may still happen. But it's not what's happening now.

Today, there's a small number of widely-used public servers - GitHub, Linear, maybe a handful others - and then a very long tail of servers with one or zero users. If you're casually observing MCP usage, you'd be forgiven for thinking adoption is quite limited.

But inside modern organizations, it's a different story. The use of MCP to serve internal data and workflows has exploded, with first-party servers solving proprietary problems. The protocol has become the standard for internal connectivity, not external distribution, and it isn't visible on public registries because it isn't meant for the public.

This is where the real activity is, and it's far beyond what most people realize.

The Inverted Agent

Fri, 05 Dec 2025 00:00:00 GMT

I think SEP-1577 is the sleeper hit of the new Model Context Protocol (MCP) specification.

Hidden behind a dry title ("Sampling with Tools") is a feature that enables a complete architectural inversion of how we build and deploy AI agents.

"Sampling" is the mechanism by which an MCP server asks the client's LLM to generate text (e.g., "Hey Claude, summarize this data"). When I started building FastMCP 2.0 back in April 2025, this was the feature that excited me the most -- and here's proof!

But as far as I can tell, it has a grand total of approximately one power user: FastMCP maintainer Bill Easton.

(Edit: an hour after posting this, I found the other power user. Unsurprisingly, it's Angie Jones, who just shared a fantastic blog post on MCP sampling.)

While the rest of us were building standard tools, Bill was pushing sampling as far as it could go. He hacked together tool calling, structured results, and agentic loops on top of the previous, very limited version of the protocol. He saw the potential before the spec even supported it.

Now, with SEP-1577, the official spec has caught up to Bill's vision. And the more time I spend with it, the more mind-bending I find it. It looks and feels exactly like every agent framework I've ever used, but the deployment model is completely backwards.

How We Build Agents Today

To understand the shift, look at how we build agents today.

In frameworks like Pydantic AI, LangChain, or Prefect's very own Marvin, the "Agent" is a capital-C-Client. It is a Python script running on your machine. It holds the state, the system prompt, and the loop that decides which steps to take next.

In this model, the Server provides remote LLM completion functionality. It does not run custom code, dictate logic, or even hold state. The Client orchestrates all activity.

This works, but it has a massive distribution problem. If I want to share my "Code Janitor Agent" with you, I have to send you a repository. You have to install Python, manage dependencies, set up environment variables, and run the script. The "Agency" is locked inside my local environment.

Flip It

SEP-1577 flips this stack upside down.

It allows an MCP Server to define a sampling request that includes tools. The Server can now say to the Client:

"Here's a goal, and here are the tools you need to achieve it. You provide the raw intelligence, but I'll control the flow."

The Server holds the prompt. The Server holds the workflow logic. The Server holds the tools. It can change them at any time.

When you connect a generic client—like Claude Desktop, Cursor, or a simple IDE plugin—to this server, the client doesn't need to know anything about the agent's logic. It just acts as the compute engine. The Server effectively "borrows" the Client's LLM to drive its own internal agent.

But Text Is Useless

There's a problem with raw sampling: it returns natural language text.

MCP servers are programmatic. They need to parse, validate, and act on data. Getting back "The temperature is about 72 degrees and it's partly cloudy" is almost useless as a building block—you'd have to parse the text all over again just to extract the values.

FastMCP solves this by layering structured output on top of SEP-1577's sampling primitives:

from pydantic import BaseModel

class Weather(BaseModel):
    temperature: float
    conditions: str

@mcp.tool
async def get_weather(city: str, ctx: Context) -> Weather:
    result = await ctx.sample(
        f"What is the current weather in {city}?",
        result_type=Weather,
    )
    return result.result

The server borrows the client's LLM, but gets back typed, validated data it can actually use. No parsing. No hoping the format is right. Just a Weather object.

This alone makes sampling practical. But the real power comes when you add tools.

Now Add Tools

Layer in tools, and things get interesting. We are adding first-class support for this in FastMCP, and what's wild is how familiar the code looks. You write what looks like a standard client-side agent loop, but you deploy it as a server-side tool.

Here is what it looks like to build a research agent that uses structured output and tools, running entirely on the server:

from fastmcp import FastMCP, Context
from fastmcp.server.sampling import sampling_tool
from pydantic import BaseModel

mcp = FastMCP("Research Agent")

# Define the output schema
class ResearchReport(BaseModel):
    summary: str
    sources: list[str]

# Define a helper tool for the agent
@sampling_tool
def search_web(query: str) -> str:
    """Search the web for information."""
    return f"Results for: {query}"

@mcp.tool
async def generate_report(topic: str, ctx: Context) -> ResearchReport:
    """A tool that acts as an autonomous research agent."""
    
    # The server orchestrates the loop!
    result = await ctx.sample(
        messages=[f"Research {topic} and summarize."],
        tools=[search_web], 
        result_type=ResearchReport,
        max_iterations=5,
    )
    
    return result.result

If you've used Pydantic AI or Marvin, this pattern—passing tools and a result type to an LLM—is second nature.

The difference is that this isn't a script. It's a tool on an MCP server.

Because of this, I don't need to ship you a Python environment to run this agent. I just give you the server connection. You connect Claude Desktop to it, ask "Generate a report on FastMCP," and your Claude instance instantly knows how to perform the research, call the web search tool, loop 5 times, and return the structured report.

Universal Clients

We are moving from a world of "Thick Clients" to "Universal Clients."

This solves the distribution problem for complex agentic workflows. You can wrap sophisticated logic—loops, chains of thought, structured validation—inside a standard MCP server. Any client that connects instantly "becomes" that agent.

It is effectively "Write Once, Run Anywhere" for AI agents.

We are shipping support for this in FastMCP as soon as the upstream SDK creates a stable foundation for it. Until then, keep an eye on the repo... and maybe send Bill a thank you note.

Centuries of Pain and Sorrow

Fri, 04 Oct 2024 00:00:00 GMT

Does o1 Mean Agents Are Dead?

Fri, 13 Sep 2024 09:00:00 GMT

OpenAI's o1's reasoning ability is essentially chain-of-thought behind the API, which makes it a much more powerful model for handling nuanced problems. Chain-of-thought invites the LLM to iteratively "think out loud" by revisiting its internal monologue until it has produced a satisfactory answer. We know that this produces superior results to one-shot responses, and here it's been productized.

Agents are also characterized by iterative behavior. But there's a key difference: while models like o1 iterate internally to refine their reasoning, agents engage in iterative interactions with the external world. They perceive the environment, take actions, observe the outcomes (or side effects) and adjust accordingly. This recursive process enables agents to handle tasks that require adaptability and responsiveness to real-world changes.

So o1 implements a behavior that formerly we would have used a simple agentic workflow to mimic, at greater expense and latency. This is good! But internal reasoning does not replace the outcome-driven behaviors that characterize the promise of AI agents.

In the limit, if o1 was itself an "agent" by any definition, capable of acting on its own, we would still want to formalize methods of deploying it against a specific objective in a repeatable, observable manner.

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Does OpenAI's o1 mean agents are dead?<br/><br/>o1's reasoning ability is essentially chain-of-thought behind the API, which makes it a much more powerful model for handling nuanced problems. Chain-of-thought invites the LLM to iteratively "think out loud" by revisiting its internal…</p>— Jeremiah Lowin (@jlowin) <a href="https://twitter.com/jlowin/status/1834722014839418962?ref_src=twsrc%5Etfw">September 13, 2024</a></blockquote>

An Open-Source Maintainer's Guide to Saying No

Sat, 13 Sep 2025 00:00:00 GMT

import Callout from "@components/blog/Callout.astro";

One of the hardest parts of maintaining an open-source project is saying "no" to a good idea. A user proposes a new feature. It’s well-designed, useful, and has no obvious technical flaws. And yet, the answer is "no." To the user, this can be baffling. To the maintainer, it’s a necessary act of stewardship.

Having created and maintained two highly successful open-source projects, Prefect and FastMCP, helped establish a third in Apache Airflow, and cut my OSS teeth contributing to Theano, I’ve learned that this stewardship is the real work. The ultimate success of a project isn't measured by the number of features it has, but by the coherence of its vision and whether it finds resonance with its users. As Prefect's CTO Chris White likes to point out:

"People choose software when its abstractions agree with their mental model."

Your job as an open-source maintainer is to first establish that mental model, then relentlessly build software that reflects it. A feature that is nominally useful but not spiritually aligned can be a threat just as much as an enhancement.

This threat can take many forms. The most obvious is a feature that's wildly out of scope, like a request to add a GUI to a CLI tool -- a valid idea that likely belongs in a separate project. More delicate is the feature that brilliantly solves one user's niche problem but adds complexity and maintenance burden for everyone else. The most subtle, and perhaps most corrosive, is the API that's simply "spelled" wrong for the project: the one that breaks established patterns and creates cognitive dissonance for future users. In many of the projects I've been fortunate to work on, both open- and closed-source, we obsess over this because a consistent developer experience is the foundation of a framework that feels intuitive and trustworthy.

So how does a maintainer defend this soul, especially as a project scales? It starts with documenting not just how the project works, but why. Clear developer guides and statements of purpose are your first line of defense. They articulate the project's philosophy, setting expectations before a single line of code is written. This creates a powerful flywheel: the clearer a project is about why it exists, the more it attracts contributors who share that vision. Their contributions reinforce and refine that vision, which in turn justifies the project’s worldview. Process then becomes a tool for alignment, not bureaucracy. As a maintainer, you can play defense on the repo, confident that the burden of proof is on the pull request to demonstrate not just its own value, but its alignment with a well-understood philosophy.

This work has gotten exponentially harder in the age of LLMs. Historically, we could assume that since writing code is an expensive, high-effort activity, contributors would engage in discussion before doing the work, or at least seek some sign that time would not be wasted. Today, LLMs have inverted this. Code is now cheap, and we see it offered in lieu of discourse. A user shows up with a fully formed PR for a feature we've never discussed. It's well-written, it "works," but it was generated without any context for the framework's philosophy. Its objective function was to satisfy a user's request, not to uphold the project's vision.

This isn't to say all unsolicited contributions are unwelcome. There is nothing more delightful than the drive-by PR that lands, fully formed and perfectly aligned, fixing a bug or adding a small, thoughtful feature. We can't discourage these contributors. But in the last year, the balance of presumption has shifted. The signal-to-noise ratio has degraded, and the unsolicited PR is now more likely to be a high-effort review of a low-effort contribution.

So what's the playbook? In FastMCP, we recently tried to nudge this behavior by requiring an issue for every PR. In a perfect example of unintended consequences, we now get single-sentence issues opened a second before the PR... which is actually worse. More powerful than this procedural requirement is sharing a simple sentence that we are unconvinced that the framework should take on certain responsibilities for users. If a contributor wants to convince us, we all only benefit from that effort! But as I wrote earlier, the burden of proof is on the contributor, never the repo.

A more nuanced pushback against viable code is that as a maintainer, you may be uncomfortable or unwilling to maintain it indefinitely. I think this is often forgotten in fast-moving open-source projects: there is a significant transfer of responsibility when a PR is merged. If it introduces bugs, confusion, inconsistencies, or even invites further enhancements, it is usually the maintainer who is suddenly on the hook for it. In FastMCP, we've introduced and documented the contrib module as one solution to this problem. This module contains useful functionality that may nonetheless not be appropriate for the core project, and is maintained exclusively by its author. No guarantee is made that it works with future versions of the project. In practice, many contrib modules might have better lives as standalone projects, but it's a way to get the ball rolling in a more communal fashion.

One regret I have is that I observe a shift in my own behavior. In the early days of Prefect, we did our best to maintain a 15-minute SLA on our responses. Seven years ago, a user question reflected an amazing degree of engagement, and we wanted to respond in kind. Today, if I don't see a basic attempt to engage, I find myself mirroring that low-effort behavior. Frankly, if I'm faced with a choice between a wall of LLM-generated text or a clear, direct question with an MRE, I'll take the latter every time.

I know this describes a fundamentally artisanal, hand-made approach to open source that may seem strange in an age of vibe coding and YOLO commits. I'm no stranger to LLMs. Quite the opposite. I use them constantly in my own work and we even have an AI agent (hi Marvin!) that helps triage the FastMCP repo. But in my career, this thoughtful, deliberate stewardship has been the difference between utility projects and great ones. We used to call it "community" and I'd like to ensure it doesn't disappear.

I think I need to be clear that nothing in this post should be construed as an invitation to be rude or to stonewall users. As an open-source maintainer, you should be ecstatic every time someone engages with your project. After all, if you didn't want those interactions, you could have kept your code to yourself! The goal in scalable open-source must always be to create a positive, compounding community, subject to whatever invitation you choose to extend to your users. Your responsibility is to ensure that today's "no" helps guide a contributor toward tomorrow's enthusiastic "yes!"

When this degree of thoughtfulness is well applied, it translates into a better experience for all users—into software whose abstractions comply with a universal mental model. It's a reminder that this kind of stewardship is worth fighting for.

Two weeks ago, I was in a room that reminded me this fight is being won at the highest level. I had the opportunity to join the MCP Committee for meetings in New York and saw a group skillfully navigating a version of this very problem. MCP is a young protocol, and its place in the AI stack has been accelerated more by excitement than its own maturity. As a result, it is under constant assault that it should simultaneously do more, do less, and everything in between.

A weak or rubber-stamp committee would be absolutely overwhelmed by this pressure, green-lighting any plausible feature to appease the loudest voices in this most-hyped corner of tech. And yet, over a couple of days, what I witnessed was the opposite. The most important thing I saw was a willingness to debate, and to hold every proposal up to a (usually) shared opinion of what the protocol is supposed to be. There was an overriding reverence for MCP's teleological purpose: what it should do and, more critically, what it should not do. I especially admired David's consistent drumbeat as he led the committee: "That's a good idea. But is it part of the protocol's responsibilities?"

Sticking to your guns like that is the hard, necessary work of maturing a technology with philosophical rigor. I left New York more confident than ever in the team and MCP itself, precisely because of how everyone worked not only to build the protocol, but to act as its thoughtful custodians. It was wonderful to see that stewardship up close, and I look forward to seeing it continue in open-source more broadly.

Don't Call It an Office

Thu, 05 Dec 2024 00:00:00 GMT

import { Image } from "astro:assets"; import bridge from "./bridge.png"; import chicago from "./chicago.png"; import embassy from "./embassy.png"; import lab from "./lab.png";

Prefect has been a remote company since day 15.

For the first 14 days, our small team worked side by side in DC, building what I envisioned as the city's next great tech company. Then, when our CTO Chris moved to San Francisco, we established what would become one of our core principles: a company is remote if it has one remote employee.

This wasn't merely a semantic distinction. The principle stemmed from our conviction that no team member should face disadvantages based on their location. Even as we built a predominantly DC-based team in the following years, we maintained our identity as a remote-first company. In the pre-COVID era, this stance raised eyebrows – the idea that having just one remote employee would define our entire operating model.

What we discovered, and what many companies would later learn during the pandemic, was that enabling remote work wasn't the real challenge. The challenge was creating a culture compelling enough that people would choose to come together even when they didn't have to. Being remote-first doesn't mean rejecting physical spaces entirely – it means ensuring that any investment in physical space actively supports our mission without creating advantages or disadvantages based on location. This insight would eventually lead us to fundamentally reimagine how companies should think about physical space – not as infrastructure to be managed, but as strategic assets that must earn their keep while preserving the equality of our distributed culture.

Beyond Hybrid

Many companies now attempt to balance remote and in-person work through hybrid models. While well-intentioned, these approaches often highlight the challenge of bridging two distinct workplace cultures, and the result is the "worst of both worlds", epitomized by employees who commute to an office only to join the same video call as everyone else. Furthermore, not only do colocated team members enjoy a significant reduction in friction for small requests, but their remote colleagues frequently pay a "small-talk tax" in the form of socially-enforced preambles to every conversation.

When COVID forced companies to go remote, we had an advantage – we'd already been thinking deliberately about how to balance a remote workforce with a strong in-person contingent. We focused intently on building a remote culture that even our DC-based team, who explicitly preferred being in-person, would want to actively participate in. This approach was so successful that we opted to step through a "one-way door" and begin hiring primarily outside of DC. We knew that once we made that decision, we could never go back to being predominantly office-based. But being remote-first didn't mean abandoning physical spaces entirely – it meant being more thoughtful about how and why we used them.

The Chicago Experiment

Perhaps surprisingly, that philosophy led us to establish an office in Chicago's River North neighborhood. Unlike traditional real estate expansions driven by headcount projections or market presence requirements, this was a deliberate experiment in creating purpose-driven space. Something unexpected happened: "The 505" emerged as more than just another workplace – it became our blueprint for focused collaboration. Its location – no more than three and a half hours from any Prefect team member – proved to be an unexpected advantage, eliminating the logistical complexity and expense of traditional offsites. Teams of any size could drop in for a few days of productive work, following a proven playbook that balanced focused collaboration with team bonding. The space's natural gravity even attracted some of our younger team members to relocate there, creating an organic hub that enhanced rather than replaced our remote culture.

<figure class="flex flex-col items-center"> <Image src={chicago} alt="The 505 in Chicago" width={500} /> <figcaption class="text-center">The 505 in Chicago</figcaption> </figure>

What made the Chicago space different wasn't just its location – it was that it emerged with an organic purpose. Unlike traditional corporate real estate that companies acquire simply to have a presence, The 505 had proven its value through measurable impact on our team dynamics and culture. It became our first case study in what happens when you let space earn its role rather than prescribing it.

After our most recent all-hands gathering in Chicago, which was an extraordinary success by any measure, Brad, our SVP of People, pulled me aside. "I think it's time for us to figure out how to capture this feeling more frequently," he said.

That conversation sparked a fundamental shift in how we think about physical space. Brad's insight was profound: what if we treated physical spaces with the same strategic rigor as executive hires? Just as each senior leader joins with clear objectives and accountability for outcomes, each space could have specific KPIs and a falsifiable hypothesis about its value. This wasn't just about real estate – it was about creating strategic assets that would either prove their worth or be retired, just like any other investment. Moreover, it required us to approach each new space with the same level of intentionality as recruiting a new executive.

Based on the lessons from Chicago, we developed a framework that would transform our approach to evaluating spaces. We began looking beyond traditional metrics like square footage and cost per desk, focusing instead on each location's potential to catalyze specific types of collaboration and community. Most importantly, each space would need an ambassador – a leader responsible for delivering on its mission and accountable for its success metrics. Just as we wouldn't hire a senior leader without a clear mandate and performance expectations, we wouldn't lease space without a defined mission and success criteria.

Three New Spaces

Today we're putting this framework into action. After months of careful evaluation, we're launching three new spaces, each "hired" with a distinct strategic purpose and clear success metrics. We deliberately avoid calling them "offices" – that term suggests nothing more than "a place to do work," which misses the point entirely. If there's anything a remote company doesn't lack, it's places to work! Instead, these spaces are strategic assets with specific missions, more akin to specialized facilities than traditional offices. Each one has been carefully designed to catalyze particular types of collaboration and drive specific outcomes that are impossible to achieve remotely.

The Embassy (Washington, DC)

The Embassy's mission is to strengthen Prefect's roots in the DC tech ecosystem. Located in one of the city's newest and most eco-friendly buildings, it overlooks Pennsylvania Avenue just blocks from the White House. The Embassy features multiple private offices for focused or collaborative work, as well as indoor and outdoor space to host over 100 people for community events and substantive discussions. As its ambassador, I'll use this as my permanent base to engage with customers, government representatives, investors, and the broader tech community — not to mention Prefect's own team! Success will be measured not just in utilization metrics, but in concrete outcomes: new partnerships formed, policy initiatives advanced, and meaningful connections established that demonstrably advance Prefect's presence in the national tech conversation.

<figure class="flex flex-col items-center"> <Image src={embassy} alt="The Embassy in DC" width={500} /> <figcaption class="text-center">The Embassy in DC</figcaption> </figure>

The Bridge (New York City)

Led by our VP of Product Adam Azzam, the Bridge establishes our presence in the center of one of the world's most dynamic centers of data innovation. Its mission goes beyond traditional product development – it's about creating a living laboratory for product evolution. When practitioners from healthcare, finance, gaming, and AI gather to discuss product direction, the cross-pollination of ideas and immediate feedback loops are invaluable. Success here will be measured in the speed and quality of product iterations, the depth of customer insights gathered, and the tangible impact on our product roadmap.

<figure class="flex flex-col items-center"> <Image src={bridge} alt="The Bridge in NYC" width={500} /> <figcaption class="text-center">The Bridge in NYC</figcaption> </figure>

The Lab (Half Moon Bay)

Our most unconventional experiment, the Lab, isn't a traditional workplace at all. Led by Chris (bringing our story full circle to California), it's a self-sustaining farm turned high-tech workspace on the coast. The setting combines modern collaborative facilities with farm-to-table dining, walking trails, and a beautiful setting to create an environment that encourages both focused technical work and creative thinking. Engineering teams will use this space for intensive development sessions, with success measured in breakthrough features developed, architectural decisions made, and the acceleration of our technical roadmap. Like any experimental initiative, its continued existence will depend on its ability to demonstrate concrete value to our engineering velocity and innovation.

Measuring Success

These spaces are carefully designed experiments with clear hypotheses and measurable outcomes. Each has been conceived with the same rigor we'd apply to any strategic hire or major initiative. Just as we regularly evaluate the performance and impact of our senior leaders, we'll continuously assess these spaces against their defined objectives. This isn't about creating permanent monuments to our company; it's about launching strategic initiatives that must continuously prove their value.

Crucially, none of these spaces create a "second-class" remote workforce. We're not hybrid, with some employees tethered to desks while others work from home. We're not remote-only, rejecting the value of physical collaboration entirely. We're remote-first: our company functions completely asynchronously, but we strategically invest in spaces that catalyze specific kinds of valuable in-person interactions – as long as participation in those interactions remains entirely optional and never becomes a prerequisite for success at Prefect.

This approach represents a fundamental shift in how companies think about physical space in a remote-first world. Rather than treating real estate as necessary infrastructure or compromising with half-measures like hot-desking, we're approaching physical spaces as strategic investments with the same scrutiny, expectations, and accountability we apply to our most senior hires. Each space must continuously prove its value proposition while preserving our distributed culture – a standard that few traditional workplaces could meet.

The principle that guided us on day 15 still holds: when a company has one remote employee, the entire company is remote. But that doesn't mean we can't be intentional about creating physical spaces that enhance our ability to collaborate, innovate, and build community. The key is ensuring these spaces serve a purpose beyond just existing – each needs an ambassador and a mission, transforming them from simple real estate into strategic assets that help us achieve specific goals.

As we launch these experiments, we're excited to learn what works and what doesn't. Some spaces may exceed our expectations; others may need to be reimagined or retired – just like any other strategic investment. What we know for certain is that we're not going back to traditional models. Instead, we're moving forward with a new framework for physical space – one that prioritizes purpose over presence, mission over location, and pull over push.

Just don't call them offices.

The Qualified Self

Mon, 16 Sep 2024 00:00:00 GMT

We're drowning in AI-generated words, a trend that will only accelerate as the technology improves and becomes more accessible. But it's undeniable that LLMs have made it dramatically easier to express, explore, and iterate on ideas. The paradox of modern AI is that while content production has been commoditized on average, the act of creating content has never been more valuable for individuals.

This shift in the value of individual content creation mirrors a broader transformation in how we need to start thinking about personal data. For years, many have pursued "the quantified self" by collecting structured data about their lives, focusing on metrics like fitness, wealth, and other easily quantifiable measures. But the rise of modern AI has fundamentally changed the nature of valuable data. Today, words are data in a way that was never true before, at least not without a significant R&D budget. Words carry meaning and semantics beyond the bytes that represent each character, a property that's always been true for humans and is finally available for machines.[^1]

In the new data economy, words are the most valuable currency.

The production of this rich, qualitative data—our thoughts, ideas, and insights—results in the qualified self. Instead of merely tracking structured data, we're pouring out our experiences in a form that LLMs can analyze and enhance, expanding our understanding of ourselves and our world. But this is about more than analysis. When we work with an LLM, we're not just exchanging information; we're engaging in a novel form of programming. I find it most helpful to think of LLMs as probability distributions, where every interaction conditions that distribution to produce more useful outputs.

With this framing, becoming "better" at working with LLMs is about accelerating their probability engines into useful regimes as quickly as possible.

This realization has driven me to collect as much of my personal "data" (in this new, qualitative sense) as possible. In the last year, my workflows have evolved to accommodate this new paradigm. I've started transcribing every meeting and, for the first time, keeping journals.[^2] These aren't just records; they're fuel for collaboration with AI assistants. Today's LLMs have very short-term memories. The more relevant "data" I can feed them—my thoughts, decisions, and the context behind them—the more valuable our interactions become. I've found that my daily product journal for ControlFlow is as valuable for accelerating an LLM to understand the present state of the library as it is for letting it know all the approaches we attempted that didn't work out.

In a funny way, the qualified self may be the ultimate victory for the schema-on-read crowd. I'm not just hoarding information; I'm building the richest unstructured dataset I can. I believe that the most impactful innovations in the LLM space will come not from more powerful models, but from more capable context management. I'm preparing a dynamic, AI-ready knowledge base of... me.

And so, embracing this new paradigm, I've decided to start blogging for the first time in over a decade. I've always believed in open source and learning in public, and this blog represents a new commitment to doing so. This writing is as much for my future AI collaborators as it is for any readers that may stumble upon it.

This is my qualified self.[^3]

[^1]: Lately I've been thinking a lot about Neal Stephenson's Snow Crash. The book is often credited with anticipating the metaverse, but it was equally prescient about the power of language in the digital age. It explores a world where language is more than just communication—it's a fundamental tool of digital control and influence. This concept resonates strongly in our current reality, where the words we use to interact with AI systems can shape their outputs and, by extension, our environment.

[^2]: I've started with keeping product journals: what I worked on, what I learned, what didn't work, and what I want to try next. [^3]: ...at least, this is a curated version of my qualified self. My drafts overfloweth; fortunately, my LLMs don't seem to mind.

Stop Converting Your REST APIs to MCP

Thu, 10 Jul 2025 00:00:00 GMT

import { Image } from 'astro:assets'; import badFeeling from './bad-feeling.webp';

FastMCP's OpenAPI converter has become the most popular tool for auto-generating MCP servers from REST APIs. I'm excited about that, because I built the feature to feel like magic: a single line of code to expose your entire REST API to an LLM. It’s the ultimate shortcut, and for quick prototypes, it's amazing.

But now I need you to stop using it so much.

I've come to realize that it can paper over a fundamental problem: an API built for a human will poison your AI agent. In practice, LLMs achieve significantly better performance with well-designed, tailored MCP servers than with auto-converted ones. The reason goes right to the core of how agents and humans interact with software, and how we design technical products for each consumer.

A good REST API is generous. It is a model of discoverability and atomicity. It offers hundreds of single-serving endpoints, flexible parameters, and endless options because programmatic iteration is cheap. Human developers are brilliant at doing discovery once and subseqeuently ignoring what’s irrelevant, and their code can chain together atomic calls -- get_user(), then get_orders(user_id), then get_order_details(order_id) -- with quick network hops to achieve complex outcomes. For them, more choice is good. We use properties like idempotency, pagination, and caching to make our APIs more efficient in the face of relatively deterministic access patterns.

But when you hand this interface to an agent, you're not empowering it; you're drowning it.

Agentic iteration is brutally expensive. First, there’s the literal cost of context. An LLM must process the name, description, and parameters of every single tool you provide, every single time it reasons. For an agent, many programmatic choices imply a bloated context, and every extra endpoint is a tax paid in tokens and latency on every interaction. Second, atomicity is an agent anti-pattern. Each tool call an LLM makes is an expensive round trip involving a full reasoning cycle. Forcing an agent to chain multiple atomic calls is slow, error-prone, and burns through tokens.

Moreover, context pollution is the silent killer of contemporary agentic workflows. Many users do not realize that their toolkits -- including MCP servers -- may inject thousands more tokens than even their custom system prompts. Your agent stops being a helpful assistant and becomes an obsessive API librarian, endlessly debating the nuances of your endpoints instead of achieving its actual behavioral goal. More tool calls reinforce that behavior, and your agent gets slower, dumber, and more expensive with every interaction.

So when I see the community's enthusiasm for auto-generating MCP servers from massive OpenAPI specs, it makes me think:

An API that is "sophisticated" for a human is one with rich, composable, atomic parts. An API that is "sophisticated" for an agent is one that is ruthlessly curated and minimalist.

We see the practical consequences of this mismatch in FastMCP's GitHub repo almost daily. "The LLM timed out trying to decide between create_invoice and generate_invoice." "My agent hallucinated a get_all_users_with_blue_eyes endpoint because the 50 other user-related tools made it seem plausible." "How do I prevent the LLM from trying to call the DELETE /everything endpoint I forgot was in my spec?" These are the predictable results of a flawed premise and the belief that context should be stuffed, not pruned.

The truth is, it's far easier to build a clean, curated MCP server than it is to debug an LLM that's lost in the labyrinth of an auto-generated REST API. As Maxime Beauchemin recently put it:

[We must] not only enumerate but also qualify each service, as it's trivial for anyone to write a quick REST-API wrapper and call it done, where in reality we're discovering that considerations around API design for LLMs are significantly different from the ones we've been using for REST forever.

FastMCP's OpenAPI converter is a valuable tool for bootstrapping. But we have to be disciplined. We can’t let this convenient shortcut ultimately create more problems than it solves.

So, what's the right way forward? The goal isn't to abandon our existing APIs, but to treat them as a source of truth to be carefully translated, not a finished product to be carelessly wrapped.

Bootstrap, Don't Deploy. Use the FastMCP.from_openapi() feature for what it's truly good for: bootstrapping. Use it to quickly explore what's possible, to see your tools through an agent's eyes, or to run a quick internal demo. But do not ship it to production.
Curate Aggressively. The act of curation is now a core part of building for agents. Instead of exposing the raw tool, use a transformation to craft a new, LLM-friendly version. FastMCP's Tool.from_tool() was built for exactly this. Take that messy generic_search(q, lim, fq, …) tool and transform it into a clean find_products(keyword: str). Rename cryptic arguments. Hide irrelevant parameters with default values. This is where the real work lies.
Start with the Agent Story. For your most critical workflows, build a new, minimal MCP server from scratch. Don't start with your API spec. Start with the agent story: "As an agent, given {context}, I use {tools} to achieve {outcome}." Then, build only the tools required to fulfill that story.

The promise of AI agents isn't just to make our existing software "chatty." It's an opportunity to design cleaner, more intentional, machine-first interfaces. Stop converting your REST APIs. Start curating them.

The Sustainable Startup

Fri, 21 Mar 2025 00:00:00 GMT

This week I made a decision to ensure Prefect's long-term success by reorganizing our company to operate as a profitable, customer-funded business.

This decision had a terrible consequence, and I deeply regret that it meant parting ways with twenty extraordinary colleagues. These are people I personally recruited, who believed in our vision, who said "yes" to this journey. They made Prefect what it is today, and I'm committed to supporting each of them during this transition and will advocate for them all as they move forward.

So why take such a difficult action?

In uncertain times, we must “control what we can control.” When capital is expensive, investor dependency becomes an existential threat, and on its previous trajectory, Prefect would have required new capital later in 2026.

Becoming a profitable business frees us from that constraint. A startup can do extraordinary things when it doesn't operate under the shadow of its next fundraise. Our decisions can now flow purely from what creates value for our customers, not from what extends our timeline.

This is a fundamentally different Prefect, built to last and ready to support our growing community. For our users and customers, this change means faster innovation cycles and a partner they can count on regardless of market conditions.

I believe this is what a modern startup should look like in 2025: resilient, independent, and focused entirely on creating value for its users. That's the Prefect we're building today.

Disabling Tailwind Hover Styles on Mobile

Mon, 30 Sep 2024 00:00:00 GMT

Recently, I added hovering tooltips to the list of blog posts, but I really didn't like how they flashed for a moment when you tapped a link on mobile.

It turns out there's an (apparently undocumented) Tailwind feature that allows you to enable hover styles only on supported devices, so I'm making this post as a public service.

In tailwind.config.js, you can add the following:

export default {
  future: {
    hoverOnlyWhenSupported: true,
  },
}

This will only show hover styles (hover:*, group-hover:*, etc.) if the user's device has proper hover support.

Curation is the New Discovery

Sat, 21 Jun 2025 00:00:00 GMT

A common question I hear when teams start building for AI is, "Why can't agents just use our REST API instead of MCP?"

My response is usually the same: "Why can't you just use your REST API instead of a UI?" We build interfaces to optimize the experience for the consumer. For the new class of "agent stories," that UI is an MCP server, and designing for it requires a new discipline.

Consider a simple change: adding a new filter to an API.

For a human-centric API, you add an optional parameter with a good default and clean documentation. This is a win. For developers who don't need the new filter, the cognitive surface area is unchanged—they can ignore it. It's discoverable, not intrusive.

For an agent-centric API, this same "improvement" is a disaster. That new parameter is now part of every interaction. Its instructions will be weighed against all others. Its scope will be (mis)understood. It increases the complexity and token cost of every call. The interaction surface area hasn't been slightly expanded; it has exploded.

A human developer is great at ignoring what's irrelevant. An agent must process everything.

This is the core challenge. For human APIs, we prioritize discovery. For agent APIs, we must prioritize curation.

MCP is a step in the right direction because it forces us to curate what we expose. But its native discovery—listing all available tools—is still too broad. That's why in FastMCP, we're building dynamic solutions that go far beyond simple filtering, including semantic search and per-session tool visibility. The goal is to let an agent request a narrow, context-aware set of tools, not the entire toolbox.

When designing for agents, don't ask "How much can I show them?" Ask "What is the absolute minimum they need?" Your agent stories—and your token bill—will thank you for it.

"As an Agent...": The New User Story

Fri, 20 Jun 2025 00:00:00 GMT

import Callout from "@components/blog/Callout.astro";

As the author of FastMCP, it might seem strange that I haven’t prioritized an MCP server for Prefect. But honestly, the user story for "chatting with your orchestrator" has always felt weak. Developers don't want to have a conversation about their workflows; they want to build, run, and observe them with speed and precision.

It turns out I was thinking about the wrong user.

My thinking on this shifted after talking with some of our most sophisticated customers about how they debug modern applications. An alert from one system is rarely the end of the story; it's the first domino. And Prefect is the universal pane of glass that provides the first clue in that complex, cross-system investigation. As one customer put it, "A lot of problems we learn about from Prefect are not, in fact, due to Prefect at all." More frequently, a failed process in the orchestrator points to a flow run ID, which helps find an evicted pod in Kubernetes, which leads to a memory spike in a monitoring tool, which finally uncovers an error in an application log.

A human can perform this needle-in-a-haystack work. But increasingly, an agent does. The value wasn't in creating a new interface for a human, but in unblocking one for a machine.

In an AI-native stack, the protagonist is often no longer a person, but an autonomous agent negotiating APIs, ingesting signals, and chaining tools to deliver value. This requires a fundamental shift from writing "user stories" to defining "agent stories." The template for this new artifact is simple but powerful:

As an agent, given {context}, I use {tools} to achieve {outcome} with minimal human latency.

This reframing forces us to design for a different set of needs. Agents don't care about intuitive UIs or clever microcopy. They care about clear contracts, machine-parsable errors, composability, and minimizing the latency between their actions.

This shift helps explain the rapid evolution of AI-native APIs:

Phase 1: The Wrapper. The first wave of MCP servers simply regurgitated existing APIs. This "chat with your API" approach was rightly met with skepticism from savvy teams who understood that a great user experience is more than a conversational veneer over a clunky backend.

Phase 2: The Curator. We are in this phase now. The best teams realize they must consciously design for the LLM. This is an act of curation—thoughtfully reducing scope, renaming cryptic arguments, and adding instructions that guide the agent toward the desired outcome. It’s about tailoring the tool to the new user.

Phase 3: The Ecosystem. This is the frontier. Agent workflows are inherently multi-system. The goal is no longer to build a single, monolithic tool, but to offer a composable node in a larger, automated graph. Your product’s success is measured by how well it interoperates and enables agents to chain actions across a diverse ecosystem.

As we design the next generation of software, we must build for two primary personas: our human users and the autonomous agents they deploy. In a growing number of cases, the agent is the more important one. The critical question in product design is shifting from "What does the user want to do?" to "What does the agent need to achieve?"—and that requires a new kind of answer: an agent story.

<Callout color='green'> Want to read more? The next post on agentic product design is Curation is the New Discovery </Callout>

Total Recall: ControlFlow v0.10

Fri, 27 Sep 2024 00:00:00 GMT

I'm really excited about the release of ControlFlow v0.10, which introduces a significant new feature: a flexible and practical memory system for AI agents.

Here's an example that shows how simple it is to get started:

import controlflow as cf

memory = cf.Memory(key="prefs", instructions="Remember user preferences.")

cf.run(
    "Get the user's favorite color", 
    interactive=True, 
    memories=[memory],
)

Memory addresses a key limitation in many AI workflows: the inability to leverage knowledge and context beyond the current interaction. Now your agents can:

Recall configuration and project details: Agents remember settings, project structures, and workflows, ensuring consistency across tasks without repeated setup.
Track ongoing issues and resolutions: Keep a log of common problems and their solutions, enabling agents to offer quick fixes for recurring issues.
Integrate past conversations and decisions: Retain key insights and decisions from previous discussions, informing future tasks without rehashing old ground.
Maintain technical styles and best practices: Agents adapt to coding styles, design patterns, and best practices, applying them consistently across workflows.
Store repository knowledge and code locations: Agents remember where key components or documentation live, speeding up development and debugging.
Optimize API usage: Recall specific API tips, tricks, and edge cases, providing more efficient solutions beyond the standard documentation.
Summarize long-term project insights: Capture key learnings from long-running projects, enabling agents to seamlessly continue tasks without re-creating context.

An emergent characterization of production AI workflows is that they tend to be short, directed, and scaled across many concurrent agents. In other words, effective memory systems shouldn't optimize for summarizing the longest possible conversation (as this is straightforward within any single session), but for rapidly establishing the context that ensures consistent behavior across many invocations.

ControlFlow's solution is a modular, vector-backed memory system that not only allows you to store and retrieve information, but lets you quickly provide access to various memories on a per-task or per-agent basis.

Getting Started with Memory

Suppose we want to store user preferences, but we need to make sure that Alice's preferences don't get mixed up with Bob's. To achieve this, we can give each user their own dedicated memory module. Here's a flow that demonstrates how to do it:

import controlflow as cf


@cf.flow
def demo(user_id: str):
    # create a memory module for the user
    memory = cf.Memory(
        key=f"{user_id}_prefs", 
        instructions="Remember user preferences about writing.",
    )

    # use the memory module in the task
    return cf.run(
        "Write a poem on a topic of the user's choice",
        instructions="Share drafts with the user until they are happy.",
        interactive=True,
        memories=[memory],
    )


# run the flow for Alice and Bob, without mixing up their memories
demo("Alice")
demo("Bob")

Later, Alice and Bob might be part of the same conversation, and we could provide both of their memory modules to the agent at the same time!

To start using the new memory feature, upgrade to ControlFlow 0.10 and install your preferred vector store (ControlFlow currently supports Chroma and LanceDB).

pip install --upgrade controlflow

For more information on integrating memory into your workflows, please refer to the updated documentation.

Happy AI engineering!

FastMCP 3.0 Beta 2

Sun, 08 Feb 2026 12:00:00 GMT

import Callout from "@components/blog/Callout.astro";

The FastMCP 3 beta has been running for a few weeks now, and we've been busy. The architecture (providers, transforms, components) has held up well under real-world usage, which is exactly what a beta period is for. But we haven't been sitting around waiting for bug reports.

Beta 2 ships a substantial batch of new features that round out the framework for a stable release. The biggest theme is the CLI: FastMCP can now discover, query, and invoke tools on any MCP server from the terminal, and even generate standalone CLI scripts from tool schemas. We've also landed CIMD (the DCR replacement), protocol-level MCP Apps support, response size limiting, and background task elicitation.

For a refresher on the beta 1 architecture and features, see the What's New in FastMCP 3.0 post.

Get Started

pip install fastmcp==3.0.0b2

</Callout>

The CLI

Four new commands that collectively make fastmcp useful for a lot more than running servers.

`fastmcp list` and `fastmcp call`

You can now query and invoke tools on any MCP server directly from the terminal. Remote URLs, local Python files, MCPConfig JSON files, arbitrary stdio commands. Anything.

# Discover tools on a server
fastmcp list http://localhost:8000/mcp
fastmcp list server.py

# Call a tool
fastmcp call server.py greet name=World
fastmcp call http://localhost:8000/mcp search query=hello limit=5

Tool arguments are auto-coerced using the tool's JSON schema, so limit=5 becomes an integer automatically. JSON objects work as positional args. OAuth fires automatically for HTTP targets that require it.

This alone is transformative for development. Instead of wiring up a client to test your server, you just call it. But the real power comes when you combine it with discovery.

`fastmcp discover`

fastmcp discover scans your editor configs (Claude Desktop, Claude Code, Cursor, Gemini CLI, Goose) and project-level mcp.json files for MCP server definitions. Once discovered, you reference servers by name:

# See all configured servers across your editors
fastmcp discover

# Use a server by name
fastmcp list weather
fastmcp call weather get_forecast city=London

# Disambiguate with source:name
fastmcp call cursor:weather get_forecast city=London

Every server you've configured in any editor is now one command away, regardless of transport.

The repo also ships a CLI skill so your coding assistant already knows how to use these commands. Drop it into your agent's skills directory and it can discover, list, and call MCP tools on your behalf.

`fastmcp generate-cli`

This is the feature that made me grin when I first saw it working. fastmcp generate-cli connects to any MCP server, reads its tool schemas, and writes a standalone Python CLI script where every tool becomes a typed subcommand with flags, help text, and tab completion.

# Generate from any server
fastmcp generate-cli weather my_weather_cli.py

# Use the generated script
python my_weather_cli.py call-tool get_forecast --city London --days 3
python my_weather_cli.py list-tools

The insight: MCP tool schemas already contain everything a CLI framework needs. Parameter names, types, descriptions, required/optional status. The generator maps JSON Schema directly into cyclopts commands. The generated script embeds the resolved transport, so it's self-contained. Users don't need to know about MCP or FastMCP to use it.

This is how MCP tools escape the chatbot. Your agent's tools become anyone's CLI.

MCP Apps

MCP Apps is the spec extension that lets MCP servers deliver interactive UIs via sandboxed iframes. Beta 2 adds SDK-level support: extension negotiation, typed UI metadata on tools and resources, and the ui:// resource scheme.

from fastmcp import FastMCP
from fastmcp.server.apps import ToolUI, ResourceUI

mcp = FastMCP("My Server")

# Register a UI bundle as a resource
@mcp.resource("ui://dashboard/view.html")
def dashboard_html() -> str:
    return Path("./dist/index.html").read_text()

# Tool with a UI — clients render an iframe alongside the result
@mcp.tool(ui=ToolUI(resource_uri="ui://dashboard/view.html"))
async def list_users() -> list[dict]:
    return [{"id": "1", "name": "Alice"}]

# App-only tool — visible to the UI but hidden from the model
@mcp.tool(ui=ToolUI(
    resource_uri="ui://dashboard/view.html",
    visibility=["app"]
))
async def delete_user(id: str) -> dict:
    return {"deleted": True}

This is the foundation. We're shipping the protocol-level support (CSP, permissions, extension negotiation) so that when MCP clients start rendering apps, FastMCP servers are ready. The higher-level component DSL, the in-repo renderer, and the FastMCPApp class are coming in future betas.

Tools can detect whether the connected client supports apps at runtime via ctx.client_supports_extension(), which means you can serve rich structured data to app-capable clients and fall back to text for everyone else.

CIMD: Client Authentication Without DCR

CIMD (Client ID Metadata Documents) replaces Dynamic Client Registration for OAuth-authenticated MCP servers. Instead of clients registering dynamically with each server via a POST endpoint, they host a static JSON document at an HTTPS URL. That URL becomes the client's client_id, and servers verify identity through domain ownership.

from fastmcp import Client
from fastmcp.client.auth import OAuth

async with Client(
    "https://mcp-server.example.com/mcp",
    auth=OAuth(
        client_metadata_url="https://myapp.example.com/oauth/client.json",
    ),
) as client:
    await client.ping()

We also ship CLI tools for generating and validating CIMD documents:

# Generate a CIMD document
fastmcp auth cimd create --name "My App" \
  --redirect-uri "http://localhost:*/callback" \
  --client-id "https://myapp.example.com/oauth/client.json"

# Validate a hosted document
fastmcp auth cimd validate https://myapp.example.com/oauth/client.json

CIMD is enabled by default on OAuthProxy and all its provider subclasses (GitHub, Google, Azure, etc.). The server-side implementation includes SSRF-hardened document fetching with DNS pinning, dual redirect URI validation, HTTP cache-aware revalidation, and private_key_jwt assertion support.

ResponseLimitingMiddleware

Context window protection, built in. This middleware controls tool response sizes, preventing large outputs from overwhelming LLM context windows.

from fastmcp.server.middleware.response_limiting import ResponseLimitingMiddleware

# Limit all tool responses to 500KB
mcp.add_middleware(ResponseLimitingMiddleware(max_size=500_000))

# Limit only specific tools
mcp.add_middleware(ResponseLimitingMiddleware(
    max_size=100_000,
    tools=["search", "fetch_data"],
))

Text responses are truncated at UTF-8 character boundaries. Structured responses (tools with output_schema) raise ToolError since truncation would corrupt the schema. Size metadata gets added to the result's meta field for monitoring. This is the kind of production guard rail that saves you from a tool that returns a 10MB JSON blob and blows out your token budget.

Background Task Elicitation

Context now works transparently in background tasks running in Docket workers. Previously, tools running as background tasks couldn't use ctx.elicit() because there was no active request context. Now, when a tool executes in a Docket worker, Context detects this and routes elicitation through Redis-based coordination: the task sets its status to input_required, sends a notification, and waits for the client to respond.

@mcp.tool(task=True)
async def interactive_task(ctx: Context) -> str:
    # Works transparently in both foreground and background
    result = await ctx.elicit("Please provide additional input", str)

    if isinstance(result, AcceptedElicitation):
        return f"You provided: {result.data}"
    return "Elicitation was declined"

ctx.is_background_task and ctx.task_id are available for tools that need to branch on execution mode.

Everything Else

A few more things that shipped in beta 2:

fastmcp install goose: Generates a Goose deeplink URL and opens it, installing your server as a STDIO extension. Goose requires uvx rather than uv run, and the command handles the difference automatically.
fastmcp install stdio: Generates full uv run commands for running FastMCP servers over stdio, making it easy to integrate with MCP clients that need a command string.
Expanded reload file watching: The --reload flag now watches JavaScript, TypeScript, HTML, CSS, config files, and media assets. Necessary for MCP Apps with frontend bundles.
require_auth removed: Since configuring an AuthProvider already rejects unauthenticated requests at the transport level, require_auth was redundant. Use require_scopes instead.

Beyond features, this release includes a wave of bug fixes and stability improvements across OAuth, transports, task execution, and the CLI. Seven new contributors joined FastMCP in this release alone, which is extraordinary for a beta. Full details in the release notes.

We're getting close. The architecture is stable, the feature set is filling out, and the beta feedback has been exactly what we needed. If you've been waiting for the right time to try FastMCP 3, this is it.

Happy (context) engineering!

About This Beta

Install: pip install fastmcp==3.0.0b2

Beta 1 Features: What's New in FastMCP 3.0
Full Documentation: gofastmcp.com
GitHub: github.com/jlowin/fastmcp </Callout>

FastMCP 3.0 is GA

Wed, 18 Feb 2026 12:00:00 GMT

FastMCP 3.0 is stable and generally available.

pip install fastmcp -U

It's been about a month since the first beta. We shipped two betas and two release candidates, landed code from 21 new contributors, and saw over 100,000 opt-in pre-release installs — extraordinary for a beta that requires explicit version pinning. We wrote three complete upgrade guides and an LLM migration prompt you can paste into your coding assistant to automate the transition. The architecture held up. The upgrade path is smooth. We're ready.

This is also the release where FastMCP moves from jlowin/fastmcp to PrefectHQ/fastmcp. When I built this over a weekend in late 2024, it lived under my personal account because it was a side project. That stopped being accurate a long time ago. FastMCP has the full engineering support of the Prefect team and is a core pillar of our Horizon platform. Special thanks to Bill Easton, FastMCP's first external maintainer, whose fingerprints are all over the transform architecture that makes 3.0 tick. For you, nothing changes — GitHub forwards all links, PyPI is the same, imports are the same. A major version felt like the right moment to make the move official.

We also built three separate upgrade guides, because people are coming from different places: from FastMCP 2, from the MCP SDK, and from the low-level SDK. All three include a copyable LLM prompt to help you automate your migration. Early feedback from the community suggests most upgrades are straightforward, and for many users it should just work.

What's New

This is, by a wide margin, the largest release in FastMCP's history. Several of the features below could have shipped as standalone packages. They didn't, because they all flow from a single architectural redesign that makes FastMCP ready for the next generation of MCP.

The surface API is largely unchanged — @mcp.tool() still works exactly as before. What changed is everything underneath. So much effort went into making FastMCP 3 extensible, observable, debuggable, and performant that if we did our jobs right, you'll barely notice the architecture. You'll just notice that everything works better and a lot more is possible.

Here's what you can do now.

Build servers from anything

Your components no longer have to live in one file with one server. Point a FileSystemProvider at a directory and it discovers your tools automatically, with hot reload. Wrap a REST API with OpenAPIProvider. Proxy a remote MCP server. Deliver agent skills as MCP resources. Write your own provider for whatever source makes sense. Compose multiple providers into one server, share one across many, or chain them with transforms that rename, namespace, filter, version, and secure components as they flow to clients. ResourcesAsTools and PromptsAsTools expose non-tool components to tool-only clients.

Use FastMCP as a CLI

FastMCP is now a developer tool, not just a framework. fastmcp list and fastmcp call let you query and invoke tools on any MCP server from your terminal — remote URLs, local files, stdio commands. fastmcp discover scans your editor configs (Claude Desktop, Cursor, Goose, Gemini CLI) and finds all your configured servers by name. fastmcp generate-cli reads a server's schemas and writes a standalone typed CLI where every tool is a subcommand with flags and help text. fastmcp install registers your server with Claude Desktop, Cursor, or Goose in one command.

Ship to production

Component versioning: serve @tool(version="2.0") alongside older versions from one codebase. Granular authorization on individual components, async auth checks that can hit databases or external services, and server-wide policies via AuthMiddleware. OAuth gets CIMD, Static Client Registration, Azure OBO via dependency injection, JWT audience validation, and confused-deputy protections. Native OpenTelemetry tracing with MCP semantic conventions. Response size limiting. Background tasks via Docket with distributed Redis notification and ctx.elicit() relay. Security fixes include dropping diskcache (CVE-2025-69872) and upgrading python-multipart and protobuf for additional CVEs.

Develop faster

--reload auto-restarts on file changes. Decorated functions stay callable — import them, call them, unit test them like normal Python. Sync tools auto-dispatch to a threadpool so they don't block the event loop. Tool timeouts. MCP-compliant pagination. Composable lifespans. PingMiddleware for keepalive. Concurrent tool execution when the LLM returns multiple calls in one response.

Adapt per session

Session state persists across requests via ctx.set_state() / ctx.get_state(). ctx.enable_components() and ctx.disable_components() let servers adapt dynamically per client — show admin tools only after authentication, progressively reveal capabilities, or scope access by role. Chain these together and you get playbooks: dynamic MCP-native workflows that guide agents through processes instead of dumping everything into the context window at once.

Build apps (3.1 preview)

Spec-level support for MCP Apps is already in: ui:// resource scheme, typed UI metadata, extension negotiation, and runtime detection. Full apps support — including a Python DSL for building generative UIs without writing JavaScript — lands in 3.1.

For the complete feature guide: Introducing FastMCP 3.0 and Beta 2.

One Honest Disclaimer

We know a lot of people are about to encounter FastMCP 3 for the first time — not because they chose to upgrade, but because they didn't pin their dependencies. If that's you, and something breaks: we're sorry, and the upgrade guides will get you sorted quickly. We did everything we could to minimize breaking changes, but a major version is a major version. If you encounter any bugs, please open an issue.

If you maintain a framework that depends on FastMCP 2.x, please pin your dependency. We want everyone on 3.0 as fast as possible, but we want them there on purpose.

Get Started

pip install fastmcp -U

What's New in FastMCP 3.0 — the complete feature guide
Upgrade Guides — from FastMCP 2, the MCP SDK, or the low-level SDK
GitHub — the new home
Documentation

Stop Calling Tools, Start Writing Code (Mode)

Tue, 03 Mar 2026 00:00:00 GMT

import Callout from "@components/blog/Callout.astro";

MCP servers scale in a way that punishes success.

A server with ten tools works beautifully. The LLM sees all ten schemas, picks the right one, calls it. A server with two hundred tools dumps two hundred schemas into the context window before the LLM reads a single word of the user's request: tens of thousands of tokens, most of them irrelevant.

The execution model compounds the problem. Every tool call is a round-trip. The LLM calls a tool, the result passes back through the context window, the LLM reasons about it, calls another tool. Intermediate results that only exist to feed the next step burn tokens flowing through the model on every turn.

The code mode pattern, introduced by Cloudflare and explored by Anthropic, addresses both problems at once: instead of calling tools one at a time, the LLM writes a script that composes them. Search for what's available, write code, execute it in a sandbox. The intermediate results stay inside the sandbox. The context window stays clean. Cloudflare recently shipped a server-side implementation for their own API: two tools covering 2,500 endpoints in roughly 1,000 tokens.

FastMCP 3.1 ships server-side code mode with fully configurable discovery, and the server-side part matters more than it sounds.

CodeMode

Here's a normal FastMCP server with CodeMode applied:

from fastmcp import FastMCP
from fastmcp.experimental.transforms.code_mode import CodeMode

mcp = FastMCP("Server", transforms=[CodeMode()])

@mcp.tool
def add(x: int, y: int) -> int:
    """Add two numbers."""
    return x + y

@mcp.tool
def multiply(x: int, y: int) -> int:
    """Multiply two numbers."""
    return x * y

The only difference from a standard server is transforms=[CodeMode()]. The tool functions stay the same. But clients connecting to this server no longer see add and multiply directly; they see the meta-tools that CodeMode provides: tools for discovering what's available and for writing code that calls them.

The default flow has three stages. Granted, three stages might sound like a lot for something intended to reduce server round-trips. The original code mode pattern, introduced by Cloudflare, had no discovery phase at all: clients loaded every tool definition into context, then executed code against them. This solved the sequential calling problem but not the context bloat problem. Anthropic introduced a two-stage approach: search for relevant tools, then execute. This addressed both problems.

For servers complex enough to need code mode, we've found that an additional stage makes a meaningful difference. Separating search from schema retrieval lets the search tool stay lightweight, returning only names and brief descriptions, while a dedicated schema step provides the precision the LLM needs to write correct code. But if you want something else, FastMCP permits full customization of this flow to have as few or as many stages as you need.

Here's how the three default stages play out with the server above:

First, the LLM searches. It calls search(query="math numbers") and gets back tool names and descriptions: a lightweight index. Instead of loading two hundred schemas, it sees a few lines of text about the tools that match.

Next, it requests parameter details for the tools it found. get_schema(tools=["add", "multiply"]) returns parameter names, types, and required markers. Not the full JSON schema (by default), but enough to write code against.

Finally, it writes a Python script and executes it in a sandbox:

a = await call_tool("add", {"x": 3, "y": 4})
b = await call_tool("multiply", {"x": a, "y": 2})
return b

Three round-trips: search, schema, execute. The intermediate result (a) never enters the context window. call_tool is the only function available inside the sandbox; no filesystem, no network, just tool calls and Python.

Discovery

The three-stage flow is the default. CodeMode's discovery surface is fully configurable, because different tool catalogs need different approaches.

CodeMode ships four discovery tools. All of them share a tunable detail level that controls how much information each response includes:

Level	Output	Token cost
`"brief"`	Tool names and one-line descriptions	Cheapest
`"detailed"`	Compact markdown with parameter names, types, and required markers	Medium
`"full"`	Complete JSON Schema	Most expensive

This is significant. Even ListTools, which dumps the entire catalog, can produce substantially fewer tokens than a standard MCP handshake when set to "brief" or "detailed". A standard tools/list response includes the full JSON Schema for every tool: argument names, types, nested objects, descriptions, constraints. ListTools at "brief" returns just names and descriptions. The context dump tax is still there, but it's a fraction of what it would be, and the sequential calling tax is eliminated entirely because tool calls happen inside the sandbox.

By default, two discovery tools are enabled:

Search finds tools by natural-language query using BM25 ranking. Defaults to "brief" detail. The LLM can override the detail level per call, requesting "detailed" for inline schemas or "full" for the complete JSON Schema.

GetSchemas takes a list of tool names and returns parameter details. Defaults to "detailed". The fallback for when search results aren't enough to write code against.

Two more are opt-in:

ListTools dumps the entire catalog. At "brief" detail, this is a lightweight alternative to standard MCP tool listing. For small servers, under twenty tools or so, seeing everything upfront can be faster than searching.

GetTags lets the LLM browse tools by tag metadata, then pass tags into Search to narrow results. Useful when tools have a natural taxonomy.

The discovery configuration is where the server author's knowledge becomes design. A large platform server might use all four tools with progressive detail levels: tags for orientation, search for narrowing, schemas for precision. A smaller server can collapse to two stages by bumping search detail:

from fastmcp.experimental.transforms.code_mode import CodeMode, Search, GetSchemas

code_mode = CodeMode(
    discovery_tools=[Search(default_detail="detailed"), GetSchemas()],
)

Now search returns parameter schemas inline, and the LLM goes straight from search to execute. GetSchemas stays available as a fallback for complex parameter trees.

This two-stage configuration is exactly the pattern Cloudflare shipped for their API: search returns enough detail to write code, execute runs it. In FastMCP, it's one line applied to any server. Cloudflare's results — and early usage patterns — suggest two-stage may be the better default for most servers. It's something we're actively evaluating.

A very simple server can skip discovery entirely and bake tool instructions into the execute tool's description:

code_mode = CodeMode(
    discovery_tools=[],
    execute_description=(
        "Available tools:\n"
        "- add(x: int, y: int) -> int: Add two numbers\n"
        "- multiply(x: int, y: int) -> int: Multiply two numbers\n\n"
        "Write Python using `await call_tool(name, params)` and `return` the result."
    ),
)

Each of these patterns is a conscious choice about the tradeoff between token cost and discovery accuracy. The server author makes that choice once, and every client benefits. This is the fundamental advantage of server-side code mode: the person who knows the tools best is the one deciding how they're discovered and composed.

Composition

In the FastMCP 3.0 architecture, components flow through a pipeline. Providers source them; transforms modify them on the way to clients. A transform can rename, filter, namespace, or reshape what a provider exposes, and transforms compose: stack them, and each one processes the output of the previous.

CodeMode is a transform. It works with everything else in the system without special-casing.

Apply it to an entire server, or to just one provider. Some tools go through code mode, others stay directly accessible. Chain it with other transforms: add a namespace to a mounted sub-server, then apply CodeMode to the result. Filter tools by tag or version, then wrap whatever passes through.

One pattern worth highlighting is to proxy a remote server, then apply CodeMode:

from fastmcp.server import create_proxy
from fastmcp.experimental.transforms.code_mode import CodeMode

remote = create_proxy("https://api.example.com/mcp")
remote.add_transform(CodeMode())
remote.run()

That remote server now has a code execution interface with tunable discovery. The original authors didn't build one. The person running the proxy configured one that fits their application.

The behavior falls out of the architecture.

<Callout color="blue"> Coming soon: We're adding configurable code mode for every server hosted on Prefect Horizon. No code changes required. </Callout>

The Sandbox

The Python execution environment is sandboxed via Pydantic's Monty project, an experimental Python sandbox that restricts LLM-generated code to call_tool and standard Python. No filesystem access, no network access, nothing outside the sandbox boundary.

Building a Python sandbox that's secure enough for production and flexible enough to be useful is genuinely hard. The Pydantic team has been doing excellent work on Monty, and CodeMode wouldn't exist without it.

Resource limits are configurable: timeouts, memory caps, recursion depth.

from fastmcp.experimental.transforms.code_mode import CodeMode, MontySandboxProvider

sandbox = MontySandboxProvider(
    limits={"max_duration_secs": 10, "max_memory": 50_000_000},
)

mcp = FastMCP("Server", transforms=[CodeMode(sandbox_provider=sandbox)])

The sandbox provider itself is replaceable. Implement the SandboxProvider protocol and point CodeMode at a Docker container, a remote execution service, whatever fits the deployment.

Getting Started

pip install "fastmcp[code-mode]"

CodeMode is experimental. The core interface is stable, but the specific discovery tools and their parameters may evolve as we learn more about what works in practice.

Documentation · GitHub

Happy (context) engineering!