Patrick Müller

Static Protocols in Python: Behaviour Over Inheritance

Patrick Müller — Mon, 05 Jan 2026 09:23:00 GMT

The first time I read about protocols was in the book "Fluent Python" by Luciano Ramalho. This book goes deep. Deeper than I knew Python at that time. If you hadn't heard of Protocols before, I'll give you a short introduction.

Protocols have something to do with typing. With protocols, you can check whether an object is valid based on whether it has the right methods and attributes. The idea is to check for behaviour instead of inheritance. Protocols extend Python's type hints by allowing to define structural types. They can be very confusing at first and difficult to understand, especially in a real-world scenario. In my opinion, that's partly because, for the concept to click, one must mentally move away from pure object-oriented programming. This is mostly done with inheritance. But it's also difficult to understand, because I think it's an advanced concept. Typing is also a gradually evolving topic in Python, with a naming scheme that has also evolved gradually and which is sometimes difficult to grasp, too.

Generally, we distinguish between dynamic and static protocols. This article is about static protocols.

What's mostly used - Goose Typing

It's best to start with a short example. Before I knew protocols, I had used isinstance to check for a type. This is also called goose typing. Mostly this is used when one wants to check if an object is of a specific type during runtime. During runtime is important here. Here's an example:

def fly_to_moon(something):
  if isinstance(something, Spaceship)
    spaceship.fly()
  else:
    print("Ain't flying to the moon with this")

Usually, goose typing is used with ABCs and abstract classes. The concept aligns very well with the human brain. With inheritance, we can bring structure to chaos. It's neat. Most of the time, developers think about the domain they develop functions for, and then come up with an inheritance tree that's very carefully crafted for that domain. In the scenario above, we'd have a base class of Plane and a class Spaceship that inherits from Plane. It inherits a fly method along. What I do not like about this is that, in application software, I do not want to limit the possibilities upfront by using inheritance. Secondly, with inheritance, we introduce tight coupling, which is a drawback.

Static Protocols

Protocols come with the idea that an object should behave in a certain way (structural typing), rather than implementing a certain interface or inheriting from a base class, which is also called nominal typing.

Short Intro

Structural means we care about the behaviour. So, as long as we have an object that supports a fly function, we are pretty happy with it and think, "Okay, you apparently have all you need, let's go". Here's how we would adjust our example from above to use protocols:

from typing import Protocol

class MoonFlyable(Protocol):
  def fly(self): …
  
class Superman:
  def fly(self):
    print('I am not a plane but still able to fly to the moon 🚀')

def fly_to_moon(who_knows: MoonFlyable):
  who_knows.fly()

Now, our construct is not only cleaner than goose typing and reduces coupling, but also enables static type checkers like MyPy, PyRight, or ty to pick up on compatibility before we run the code.

While you can write fly_to_moon(new Superman()) , a static type checker will complain about ~~fly_to_moon(new Car())~~ and mark it as invalid.

Remember, all we ask for is a specific behaviour to be fulfilled. This gives us greater flexibility in our software design. Here's my take on illustrating this:

Superman is able to fly to the moon ...

That example is a bit artificial, but serves perfectly as a short intro.

Real-World Example

Real world code is mostly more complex than this, and when I read tutorials, I most often wish to see something a bit more difficult.

As a software engineer in the AI field, I sometimes need load different trained models for certain projects. A very basic machine learning use case is to predict a class, and we have different models that predict different classes (e.g. cats or dogs, not Python classes) in different ways. You often see ABC here, but, as I said before, I think you lose flexibility this way. With protocols, each developer on the team can create, train and predict without extending an inheritance tree. I prefer to program functionally as I favour composition over inheritance, but others can code object-oriented and the code remains compatible.

Here's how this looks like:

from typing import Protocol

class Predictable(Protocol):
    """Protocol for models that can make predictions (Used for API)."""

    def predict_top_k(self, query: str, k: int) -> list[str]: 
      ...

class TfidfPredictor:
    """Predictor class for TF-IDF model to get top-k predictions."""

    def __init__(self, model):
        self.model = model

    def predict_top_k(self, query: str, k: int) -> list[str]:
        """Predict top-k classes and their probabilities."""
        # code ommitted because not relevant here


tfidf_model = load_model(model_path)
ml_models: dict[str, Predictable] = {}
ml_models['tfidf'] = TfidfPredictor(tfidf_model)

The TfidfPredictor is really just a wrapper that contains only the predict_top_k function the protocol wants to see. This demonstrates how I would store loaded models in a dictionary, using the Predictable protocol as the value type. We can then run the predictions in a 'duck typing' manner based on this.

Alternatively, we could use the protocol class for our inference method.

tfidf_model = load_model(model_path)
predictor = TfidfPredictor(tfidf_model)

def run_inference(model: Predictable, query: str, k: int) -> list[str]:
    return model.predict_top_k(query, k)

y_pred_k = run_inference(model=predictor, query="Classify this text", k=5)

Now, if you look at the code, you see that the TfidfPredictor is completely independent of implementing any ABC. All we care about is that the model we pass to our dictionary or to the run_inference function is compliant with the Predictable protocol.

Limitations of Static Protocols

Keep in mind that protocols do not guarantee that, at runtime, the type is enforced. If you need to check during runtime, you would need to decorate your defined protocol with @runtime_checkable. Then, you could use isinstance again: isinstance(obj, Predictable).

Small Detour to Duck Typing and Static vs. Dynamic Protocols

When I say duck typing, it has to come with the very famous saying: "If it walks like a duck and it quacks like a duck, then it must be a duck.". This saying provides you the interpreter's view on an object. As long as an object provides the required behaviour (e.g. the quack() method), it will be used as a duck, regardless of its actual type. This should sound familiar to you from static protocols. Both are behavioural. The main difference is that your IDE or a static type checker won't give you any hint about the correctness of the signature. Here's an adapted example from the Duck Typing article on Real Python:

Duck Typing in Python: Writing Flexible and Decoupled Code – Real Python

In this tutorial, you’ll learn about duck typing in Python. It’s a typing system based on objects’ behaviors rather than on inheritance. By taking advantage of duck typing, you can create flexible and decoupled sets of Python classes that you can use together or individually.

Real PythonLeodanis Pozo Ramos

class Duck:
    def fly(self):
        print("The duck is flying")


class Swan:
    def fly(self, height: int):
        print("The swan is flying")


birds = [Duck(), Swan()]

for bird in birds:
    bird.fly()

In this example, my static type checker (Pyright) doesn't tell me that we need a height for the Swan object to run the fly method. Duck typing is used with EAFP, and I like it. Static protocols extend the idea and make it safer to program, catching possible runtime errors during development time and not runtime. Small side note: Structural (sub)typing is also called static duck typing.

Conclusion

This article was about static protocols, also called static duck typing. It's baked in Python since version 3.8. Before that, we'd had already informal interfaces, also called dynamic protocols. The naming is sometimes confusing, and I have to look it up from time to time myself.

The difference is that a dynamic protocol does not have to be fully implemented and it cannot be verified by a static type checker. An example of a dynamic protocol is the Sequence protocol, which requires the __getitem__ and the __len__ ; however, it also works only with __getitem__.

To me, the main advantage of static protocols is the ability to write loose coupling code, not to run in an increasingly complex construct of inheritance, and, most importantly, to provide a way to specify a specific behaviour (structural types) instead of nominal types. Hence, in a project, it becomes easier to bring different developer styles and libraries together. One might write more functional code, another more object oriented. Besides that, it's cumbersome to make different libraries (machine learning world) work with each other in own pipelines.

If you want to read more about static protocols, I can recommend the original PEP 544 – Protocols: Structural subtyping (static duck typing) .

This was a short primer on using static protocols in a real-world project. Ever run into issues with ABCs (too big to touch, too complex?!)? Let me know!

🤗

If you enjoyed this article, it would mean a lot to me if you shared it on social media or forwarded it to a friend. I write in my spare time, so any support is welcome.

How to build internal developer tools with a small team

Patrick Müller — Sun, 26 Oct 2025 14:23:03 GMT

There are thousands of resources on how to build good software products. When I was developing my first SaaS product, LemonSpeak, I often found myself in the same loop: Should I implement a new feature or improve an existing one?

The question seems easy - but it isn't. I very often heard people ask how they know if they had reached product market fit. Even I wondered quite a lot. Do you take numbers? Monthly growth rate? For sure, that tells you something. Traction is something you feel and that's the best signal to me. You will recognize a difference in the feedback of your users. More likes, more willing to talk to you, more positive feedback.

The same dilemma returns if you build dev tools inside a company, just with different faces. Instead of customers, your users are coworkers. Instead of growth metrics, you have scattered feedback and coffee corner chats. You are still trying to figure out what to build next.

How to start from scratch for internal dev tooling

There’s a book called The Mom Test. It’s a great resource and I used the principles quite a lot in interviews. The gist is that you should get to the actual problems of a user instead of pitching your solution. Otherwise, most interviewees are too polite and will say: “Sounds good. I'll give it a try”, even though it doesn't help them at all.

I believe, that this process should also be applied internally and repeated continuously:

You collect problems other engineers and teams encounter
Prioritise them

That’s your "demand box". If you start on a green field, this and people's opinion is all you have. It's not rocket science. The only caveat is that you need a certain amount of data points to iron out the outliers in your problem clustering. Based on this, you and your team can develop the first service for other developers. Great! But things are getting murky again now. Without further significant signals from your users, it’s difficult to decide what to develop next. The same question arises again: Should you develop the next feature or improve an existing part?

Three axes of boat (product) building

Imagine yourself as a boat builder. If you want to build a boat, you start small with a simple boat. Then, you want to make it wider to carry more load. As you add planks, you realise that the boat is becoming too chunky and unstable under the extra weight. To stabilise the boat, you need to deepen the hull. Now we have two axes: width and depth (x and y).

Two axes of software building

Now imagine the software you are building as a boat. There are two dimensions that you are constantly balancing.

Width & depth

The width in our analogy, is the capabilities or the number of features your product has. That can be an extra functionality, catering more endpoints, or a new piece of UI for your users to interact with the product. That width, however, comes with weight. You add complexity, maintenance, potential friction. You widen the boat - it carries more, but it becomes unstable and more complex to steer.

The depth, on the other hand, means that you improve existing parts of your product. You go deeper in the development for a specific piece. You fix rough edges, increase stability by adding tests, solving bugs, decrease algorithmic time, making it more efficient. That's why I compare it to the depth of the hull of your boat. The depth makes the boat stable.

However, there is a third dimension. When you make the boat wider and deeper, you might experience that even though you can carry more load, drive through deeper water and cater more people, it became slower and more difficult to navigate. You realise that it’s not enough to make it bigger, you also need to add extra sails, a better rudder, and clean up some stuff that would roll around on the deck if it’s getting wild on the ocean.

The preparation dimension

I call this the preparation dimension. The idea is that you prepare your architecture to make changes in the two very tangible dimensions (width & depth).

This is not only refactoring your current approach, even though that can be one part. It’s about setting the right course for your product, to be able to integrate future changes adequately. This is why I compare it to a rudder or a sail. Without those, you still can make changes in the size of the boat, but not as sufficient as it would be with the right sail and rudder.

Here’s a very practical example for a product I am working on: At one point, we realised that we have enough demand to open up the connectivity of our product. We re-thought the current approach as some sort of plugin technology. Now, what I would suggest is that this functionality is going to be prepared by another functionality that you add, either for depth or width. In our case, it was width. That way, a sprint or quarterly planning prepares the next one.

Balancing it.

Having these three dimensions separated helps me tremendously with my mental model of building products. Using the dimensions gives me a tool to define the next steps in a balanced way. Of course, it can also happen that one dimension is put more weight on than another. That can be because you sacrifice speed for technical debt, or because you want to smoke test features to find product market fit, but you make it more transparent for you and the team, and it will be constantly on the radar, that you are building a raft instead of a boat.

With this framework, you can literally ask the following questions during planning:

Are we widening our product this sprint?
Are we deepening it?
Are we preparing it?

It’s OK to focus on one axis more than on another, but I believe they are not independent from each other. At some point, the other axis has to catch up. Now, to build a proper ship, we have to balance these dimensions.

Build and ship in a fast-paced time

Building software products in AI currently comes with a lot of uncertainty. Nobody has a crystal ball to look into the future. Some are more visionary than others, but building products has also a technical, strategical perspective. Building internal dev tools at a company is a mix of demand, what you see in the company, and of future trends. Make sure to talk to devs for which you think the product will be beneficial, and integrate trends when you see a strong signal that it will be helpful to them in an imaginable amount of time. When in doubt of what to build next, another feature or a deeper functionality, take the mental model of the boat building.

🤗✨ If you enjoyed this article, it would mean a lot to me if you shared it on social media or forwarded it to a friend. I write in my spare time, so any support is welcome.

When is a knowledge graph open or closed world? And how to create a company wide implementation strategy

Patrick Müller — Fri, 03 Oct 2025 14:08:04 GMT

This week, I created my very first personal knowledge graph. It was a rather small one. One that built up on my knowledge about Star Wars. I’ve been following a course on HPI (Hasso Plattner Institut), and this topic is also important at work. So, I wanted to dip my toes in the water with a domain I’m familiar with. The graph is small: ~10 nodes and a couple of edges. I recognized that modelling a graph can be done in different ways. When you ask ten different people, you might get ten different perspectives on the world of Star Wars, resulting in slightly different versions of a graph. As I discussed my graph with co-workers, we were talking about the way I modelled the gender. I thought of it as an attribute of a person, whereas you could also model it as its own class. That would result in an edge from a person to, for example, “female”.

This was the beginning of a discussion about the “Open World Assumption (OWA)” and “Closed World Assumption (CWA)”. A concept I hadn’t heard of before. In an open world, you basically say whatever you don’t know, or what you are not able to deduce from the graph, is not assumed false or some sort of default value. Here’s an example: For Yoda, I didn’t specify the gender because I never really heard that Yoda is male. I assumed, but didn’t know. So, leaving this information out of my modelling makes it not possible to conclude the gender of Yoda.

On the contrary, a closed world assumption means that what is not known is false. For example, if I were to model that Luke does not have a sister, we conclude that he hasn’t one. Well, if this sounds a bit off to you because your belly tells you there is something wrong with this, then you are right. When I heard this first, I found this confusing and convenient. Because if a knowledge graph is open or closed world depends kind of on the view of the person who sits in front. What if I said, “OK, Luke has no sister. I see this as open world. Maybe he actually has a sister!”

Before I elaborate, I want to clarify this:

Open World: We assume incompleteness
Closed World: We assume completeness

I can imagine that, when modelling a graph, we are most of the time in an open world assumption. When I want to work with this graph by writing a piece of software that queries the graph for information, I might end up with a closed world assumption, because I have to decide practically.

Let's see: I write a query that lists me all Jedi mentors that are male. I realize then Yoda is missing! Now, to include Yoda, I would make a tradeoff: I specify that each node that does not have a gender is considered as male (I could have done this the other way around, with female). By doing this, I kind of shift from “incompleteness” to “completeness” because I define that the missing value is treated here with “male”.

How do you build similar knowledge graphs across a company?

What I’ve written before shows that it can be difficult to have a certain style to model knowledge graphs. A gender pay gap researcher would certainly see gender as its own concept, and therefore give gender a more important role, which results in a class and many more variations of the instantiation of a node. The question is: how do you get some sort of shared concept? I learned at SICK that you need some organization within your company that sets the course for knowledge graphs, a KG CoE, so to say. They create an upper level concept of entities and need to carry the knowledge of how to model graphs into other teams. They should do workshops to create a common wording, explain the company concepts, show pitfalls, and build a community. It’s already difficult to justify knowledge graphs against management, but if enough units create graphs, they need to be connected. It’s not that they must be connected. It’s fine to build a small world in a domain, but best practices need to be passed on.

How IKEA builds knowledge graphs

There’s an article on Medium from Katariina Kari who works at IKEA, in which she describes their KG strategy consists of three layers. The top is the layer of concepts that you can map to what I explained in the section before. The second is categories, the third, data. The concept layer would be the ontology of your knowledge graph. It contains classes and properties. Their categories are something like “bookcase, sofa, or coffee table”. It’s their vocabulary. The base layer (data) is then the actual products.

The interesting part is how this KG pyramid affects the work at IKEA. They describe that the concept layer is defined with governance policies (by the ontology team), and categories are described as domain expertise needed. So, they must have some sort of workshops for this. Their data layer is very large, and the creation of it is automated.

Source: Katariina Kari, https://medium.com/flat-pack-tech/ikeas-knowledge-graph-and-why-it-has-three-layers-a38fca436349

Thank you for reading! 💙 I share what I learn about software engineering (mostly Python), AI, product development, and life as a dev.

Subscribe and follow me on my journey. No spam. Unsubscribe anytime.

Get notified

🤗✨ If you enjoyed this article, it would mean a lot to me if you shared it on social media or forwarded it to a friend. I write in my spare time, so any support is welcome.

The LLM Hype Train: A Pamphlet[?] You Should Read With Your Manager

Patrick Müller — Sat, 30 Aug 2025 16:09:01 GMT

What if I told you ChatGPT is the end of software engineering? Would you believe it? Three years ago OpenAI changed the game in the AI field with ChatGPT. ChatGPT is based on the foundation of Large Language Models, in short LLM. Since then, AI has finally made it onto the aisles of the majority of companies. Even in Germany.

When AI Hit the Office

I believe that every extreme is bad. That includes the current LLM and agentic hype. There’s always a trending topic in tech. It mostly starts with academia, catches fire in startups, and soon becomes glorified on LinkedIn or any other social media platform. That’s not new. For LLMs the same happened. BUT, What is new is how accessible LLMs are and with that, AI became “saloon ready”.

Everyone is capable of thinking of a killer use-case for applying this technology and turn around a sinking boat. This is good from the perspective that everyone is kind of enabled to come up with ideas. But there's a dark side: people oversimplify what LLMs actually are.

Prompt + text in → Solution out.

So simple. So seductive. So bad for (ML) software engineers. All of a sudden, every regex becomes a prompt. Every problem is solvable - just ask a LLM.

The Illusion of Simplicity

Prompt in, solution out — but at what cost?

Let me quote what Pydantic says on their website as of 12.07.25

https://ai.pydantic.dev/logfire/ (Debugging & Monitoring)
Applications that use LLMs have some challenges that are well known and understood: LLMs are slow, unreliable and expensive.

These applications also have some challenges that most developers have encountered much less often: LLMs are fickle and non-deterministic. Subtle changes in a prompt can completely change a model's performance, and there's no EXPLAIN query you can run to understand why.

Warning
From a software engineers point of view, you can think of LLMs as the worst database you've ever heard of, but worse.
If LLMs weren't so bloody useful, we'd never touch them.

Before LLMs we had simple LMs - Language Models. I remember a tutorial in university in which we built a tweet bot of a very active politician on Twitter. That Bot could generate a tweet after another. Same base principle as the first Large Language Models, but very limited in their general capabilities and much more like a parrot.

Why Autonomous Agents Often Fail in The Real World

Now, when you thought the hype couldn’t even get bigger, agents came along and knocked on your companies door. Agents are LLM-powered bots that autonomously execute tasks using well-defined APIs, typically via MCP (Model-Context Protocol). The foundation is still a language model, but wrapped in orchestration logic that chains steps together. The chat interface is what makes it so magical. The LLM decides then in a chain-of-thought, similar flow, when and what information it needs to request to fulfill a task.

For example:

You ask your Agent to book the cheapest flight from Berlin to Paris on next weekend. It looks up APIs, navigates on websites, compares prices, reads content - and bam, your flight has been booked! Très bien.

Except … maybe you are going to Prague. On a Tuesday. In business class. It’s bittersweet, because the technology is great, but errors accumulate. In May 2025, I saw a talk called "The Future of AI: Building the Most Impactful Technology Together" from Leandro von Werra who works at HuggingFace on PyCon DE & PyData 2025 in Wiesbaden, in which he exactly explained that issue. If an agent solves each subtask of that one big task to book a flight with a 90% accuracy, you end up with a 0.9 x 0.9 x 0.9 x 0.9 x 0.9 = 0,59% success rate. That’s almost a Bernoulli experiment, like a coin flip. Just cheaper in time and money, or you think of your travel budget as the coin.

To be fair, APIs - that the agent uses via MCP - can introduce determinism, bringing much more joy to this rigged game. If the function call is stable and predictable, you regain some control. But the agent still decides what input to send — and that’s where the chaos may return.

LLMs - The Swiss Army Knife of AI Models

Why generalist tools aren’t always the right choice

LLMs are impressive. But they are generalists (so far). Whenever I talk to someone about LLMs and their capabilities, I tell them that I see them as a Swiss Army knife. They are good at many things, but they are not specialists and are therefore only excellent at a few. Let’s circle back three years. Before LLMs had come along, I’d argue we had the major fields of:

Natural Language Processing
Computer Vision
Recommender System
Tabular data… plus fields you would find across all of those. They were independent of the bigger field, for example, Explainable AI (XAI), on-device, federated learning, and generative AI - the list goes on.

… wait.

> Generative AI?!

Yes, that’s right. We had this before. Not only does a Twitter bot count as generative AI, but also sampling from a distribution to generate sophisticated, close to real, input data counts as generative AI.

Today, this feels like the ancient way, parts that have been forgotten, buried as relics in many ML temples across the globe. I’ve recently seen an explanatory poster in the company I work for, in the coffee corner, which hierarchically organizes AI terms among each other. It went approximately like this:

generative AI (subfield of) → Deep Learning (subfield of) → ML (subfield of) → AI.

That’s misleading. It’s not ideal on two levels. First, generative AI is not restricted to Deep Learning. Yes, you could argue DL is a subfield of ML, and hence, this is right, but it’s not. This chain sets DL as a requirement. Second, our ML/AI Zoo is full of plenty of other beautiful technologies and fields. Don’t limit it to gen AI. Educate holistically and don’t give them the sugar they already had anyway.

Coming back to the analogy of a Swiss army knife. Let's take a sentiment classification use case. I’ve seen this plenty of times before LLMs were a thing. What’s the solution?

> “A supervised classifier” you say? < “Good”, I reply, you learned much.

But I reckon we’ve all seen people throwing LLMs on this. It’s overkill. Remember: They’re slow, expensive, and not deterministic. Only do that for prototyping. If you think this is useful as a feature or product, then build a proper model for this. Compared to the Swiss Army Knife, the proper classification model is more like a drilling machine. Any “traditional” classifier, in fact. Perfect for one task, but only for one. You would certainly fail for using it to hammer in a nail, but you wouldn’t think of that anyway, because it’s not the right tool and you know it. That’s something most people haven’t grasped yet when it comes to AI/ML and LLMs.

Coding With LLMs: Why Thinking Still Matters

LLMs for coding are impressive, but keep in mind that they are trained on vast amounts of data, and all kinds: genius and garbage. That means they are trained on code that is very high in quality, but also on data that is poor in quality.

Let’s assume they balance each other out, then we get an average software engineer (Since there are fewer experts than juniors, it does not balance out.). I’ve heard this many times, for example, on The Real Python Podcast: Ep 248, with Raymond Camden in which he said that it was a huge help to him with Python, since he was a novice in that field, but not so with JavaScript, since he’s an expert in that field. For me, it’s the other way around. I learn a lot from LLMs in JS and get stuff done, but I believe that the Python code I mostly get is not ideal.

Even though I do not own a glass bowl to look into the future, I can imagine that the trend with LLMs is going more towards specific LLMs.

However, the problem with LLMs is that we partially have better solutions for specific use-cases, and differentiating between “Yes, that’s a good LLM task” and “No, we use a traditional ML/DL approach” seems difficult. I found them bloody useful myself, but most often for creative tasks. I think first, before I ask the LLM. No vibe coding for me. Why would I let the LLM do the fun part? Software engineering is a craft, and I take the productivity boost that I get with LLMs cheerfully, but I never forget that I need to think through problems, design, and architecture myself, before I start ping-pong on my ideas with an LLM. If you don’t train a muscle, it degrades, and that most likely happens with your coding skills too, when you entirely let your LLM code for you. (Besides the fact that LLMs still make a lot of errors anyway, and how would you judge them as an error if you don‘t have the expertise - busted.).

We also see a lot of videos about software engineers being replaced by sophisticated AI software agents. Now, on a bad day, I might listen to that, but on any other day, I see other jobs replaced much earlier if we were talking about replacement. Everything that is mostly about organizing, managing, and decision-making is way easier to handle with LLMs. It’s just that drawing that image that software engineering can be replaced with is so much more powerful. For sure, as software engineers, we need to adjust and use what boosts our productivity, but always keep in mind that you are your greatest human capital. What do you think who will perform better in a software engineering job interview? The one person who vibe coded an app in seven days or the software engineer who thoughtfully crafted this app, which has a high coverage, good code design, and modular architecture in four weeks?

Disruption → Drawbacks → Mitigation: Is That a Typical Tech Cycle?!

We’ve seen this before, and I believe that this is somehow typical for new technology. There’s a disruptive technology, and it comes with drawbacks. Once there is a breakthrough, we start using this technology and figure out solutions to mitigate our invention along the way. We’ve seen this with other technology, too, for example:

Automobiles: have added to accident rates and air pollution. Later on, we got seatbelts, airbags, and emission standards.
The internet itself: Privacy is a big pain these days, but at least we got GDPR.

Another bummer is the mere energy consumption LLMs need. Did you know that generating an image with ChatGPT is equivalent to fully charging your phone? We already knew that the training of an LLM is costly, but inference - that is, the process of computing an answer to you, regardless of whether text, image, or voice (multi-modal) - costs a lot of computation too. And this is ongoing.

Another challenge: most of our AI tools are made in the US. We in Europe shouldn’t neglect this. First, this will become a huge privacy issue in the future, because LLMs like ChatGPT know a lot about you, your work, inclinations, etc. Second, we make ourselves dependent on the big vendors. This is an issue, in my opinion, for two reasons:

a) We rely on what they offer

b) We assume that this will always be available, but the first days of the trade dispute between Europe and the US showed that it can be faulty to assume that we can always rely on them. Luckily, the European ecosystem is getting stronger with companies such as HuggingFace, Mistral, Aleph Alpha, Stability.ai, DeepL, and Black Forest Labs.

Final Thoughts: Be a Thinker, Not Just a Prompt Engineer

I currently see two types of companies:

Those still struggling to digitize their ecosystem
Companies that are at the forefront of innovation (or at least they think they are) and are applying the latest research

Even at big firms like SAP or SICK, there’s a wide spectrum. And an IT consultant from Freiburg I know tells me the same thing: you can't use AI if you haven’t even digitized your workflows yet. So whenever you hear somebody talking about LLMs, ChatGPT, Agents, and all the latest hypes, think of whether those mentioned technologies are the right tool to get the job done. Chances are - you can guess, it’s mostly not. Take that discussion. Eventually, you will help to build a better AI ecosystem within your company.

As I mentioned, LLMs are powerful. I use them. Often. I just recently reverse-engineered an API. Creativity. Boilerplate. Those are the things I aim for. But I still think before I prompt.

I also tried out the agentic mode in VS Code. Yes, it's impressive! It's a very powerful tool, particularly when it runs code and fixes its own mistakes and bugs. I've been trying it with a Svelte app. It was blazingly fast. It's ideal for prototyping. Although it's phenomenal, I haven't learnt much. I wouldn't be able to replicate the LLM's work. I believe that my Svelte coding skills are generally not as good as those of the LLM. So, if I continue to use the LLM in agent mode, I need to ensure that I have a certain level of expertise and continue to develop myself. How would that look like?

And, in case you were wondering: I used LLMs to critique the structure and style of this article. That’s it. I wrote it entirely by myself, but I find it helpful to get feedback. I do not let it write my article, because it’s fun. I enjoy writing. It keeps me sharp. And it’s how I keep the edge that no LLM can replicate: My own thinking

Pytest Fixtures: How to Use & Organize them in your Test Architecture

Patrick Müller — Fri, 09 May 2025 15:27:37 GMT

In my last post, I talked about why I practice TDD and why I want to share my learning progress in testing. Today I want to talk about what pytest fixtures are, when I use them and how you can share fixtures across your tests.

What are pytest fixtures?

Imagine you are writing a test for a function called validate_user(user: User) to the test. Of course, you need a user for this. Assume the user is defined as follows:

from dataclasses import dataclass

@dataclass
class User:
	name: str
	email: str

You would create the user in your test function and then pass it to the function you want to test:

def test_valid_user():
	user = User(name="Patrick", email="[email protected]")
	result = validate_user(user)
	assert result.is_valid == True

So far, so good. If you want to test another function that checks the correctness of the email, you need to write a new test function that creates and tests the user again. This is where the fixtures come into play. Instead of doing that, you write a fixture that returns a User.

For example:

@pytest.fixture()
def user_fixture():
		return User(name="Patrick", email="[email protected]")

def test_valid_user(user_fixture):
	result = validate_user(user)
	assert result.is_valid == True
	
def test_valid_email(user_fixture):
	result = validate_email(user)
	assert result.is_valid == True

That's the actual idea behind it. A small side note: testing is not just about increasing test coverage, but also about finding bugs. This means that in a real application, I wouldn't just test with one user object, but with a set of users with different name and email encodings, where I want to see if the functions behave as they should. Among other things, I would of course also like to test whether an invalid name or an invalid email leads to a false result.

When should I use Fixtures?

Roughly, this can already be seen from the example. pytest itself says that fixtures provide context for a test function. This can be, for example, the context for a database or the enrichment of your test data. I use it very often for both. As soon as I realize that data within a test can also be used in another test, I turn it into a fixture. If you come from another language, e.g. Java, you will most likely see the similarity to the setup/teardown functionality. pytest describes the comparison on their website very well and better than I could. Note: pytest also offers the option to choose a typical setup tear down style.

Differentiation between Error vs Failure

Normally, if a test fails or raises an exception, it gets the status “failed”. However, if a fixture raises an exception, pytest declares this as Error and not as “Failed”. pytest describes that the Error status is intended to indicate that pytest was unable to execute the actual test in the first place and has already failed on a fixture on which the test depends. Error is reserved for this. Here is another, quite concise, explanation on StackOverflow I very much liked.

Below is an example which raises an exception in the append_first fixture:

import pytest

@pytest.fixture
def order():
    return []

@pytest.fixture
def append_first(order):
    raise Exception
    order.append(1)

@pytest.fixture
def append_second(order, append_first):
    order.extend([2])

@pytest.fixture(autouse=True)
def append_third(order, append_second):
    order += [3]

def test_order(order):
    assert order == [1, 2, 3]

The code is from the official pytest documentation. You can also copy it from my GitHub Gist.

pytest_fixture_error.py

GitHub Gist: instantly share code, notes, and snippets.

Gist262588213843476

The result is the following:

Screenshot of run tests in PyCharm

You can see the run test in PyCharm and that an error is shown. At the same time, the test is also counted as failed, which does not quite correspond to the idea of pytest. It is therefore better to read the output from pytest and not just the message from the IDE.

How do I create fixtures that are valid in other test modules?

We've seen that fixtures can be used within a module. Sometimes, however, cross-module fixtures are required, e.g a fixture for a database. You may have a wrapper for the database connection and want to test this or you have functions that indirectly use the database. One possibility is the conftest.py. The conftest.py should be placed flat in your tests folder:

tests/
├── conftest.py  # Contains a db_connection_fixture
├── test_module1.py
├── test_module2.py

The advantage of conftest.py is that pytest provides an automatic discovery of fixtures within this file. The registered fixtures are then available in all modules at the same level or below without an explicit import. This can be somewhat confusing at first, as you do not know where the function parameters (fixtures) come from or where they are defined. At least PyCharm does not offer the option of navigating to the function definition with a click. However, once you know that the fixtures are in conftest.py, the advantage is that the module is cleaner, as the imports do not accumulate.

Directory structure versus conftest.py?

It is also possible to create a fixture directory and store the fixtures there. This helps with modularization and is also clearer with many fixtures. You can create a module with db_fixtures or a module with user_fixtures. Then, you can import the modules into conftest.py and they will also be included in pytest's discovery.

tests/
├── fixtures/
│   └── db_fixtures.py  # Contains the db_connection fixture
├── conftest.py         # Can import db_connection if needed
├── test_module1.py
├── test_module2.py

#In conftest.py:
from fixtures.db_fixtures import db_connection  # Import for reuse

For a small project, I would always define the fixtures in conftest.py first and only modularize when it gets messy. The conftest.py is also important for other topics.
An example of a db_fixture would be something like this:

@pytest.fixture(scope="module")
def db_connection():
    client = db.get_client(URI)
    db = db.get_database(client, env_config.DB)
    client.drop_database(env_config.DB)
    yield db
    client.drop_database(env_config.DB)
    client.close()

The nice thing about this fixture is that it is valid for the scope of a module and only after a module has been processed does it continue after the yield and the database connection is closed. This has the advantage that the database can be filled with data from different functions (e.g. some inserts and then a delete).

Pytest Fixtures in Open-Source Projects

Whenever I'm not sure myself, I take a look at how well-known open source projects do it. I like to take a look at Pydantic and Streamlit 🙂. Both use pytest.

How Pydantic uses Fixtures

The Python library Pydantic not only consists of the repository of the same name, but also uses pydantic-core, which is developed in Rust. You should therefore look at both repositories if you want to understand how Pydantic tests.

Pydantic and pydantic-core mainly define their fixtures directly in the respective test modules instead of using a central conftest.py. There are currently 21 test files in Pydantic and nine in pydantic-core in which fixtures are used. The conftest.py has relatively few functions in both projects and only five fixtures in total in both projects. This organization of the tests follows the principle of high cohesion, as related test components are kept in the same modules.

Short excursion: Definitely new for me was that Pydantic, and pydantic-core, use a library called Hypothesis. Hypothesis offers property-based testing and is added to the function to be tested with the help of the decorator (@given). You have to describe what kind/types of values are allowed and Hypothesis generates random values. Here is an example from pydantic-core where different data objects are generated:

#pydantic-core: tests/test_hypothesis.py

@given(strategies.datetimes())
@pytest.mark.thread_unsafe  
def test_datetime_datetime(datetime_schema, data):
    assert datetime_schema.validate_python(data) == data

I didn't know Hypothesis yet, but I think it complements TDD very well and will try it out myself.

How Streamlit uses Fixtures

Streamlit does it a little differently than Pydantic and uses a mixture of e2e (end to end) testing with Playwright and unittests.

For Playwright e2e_playwright more than ten fixtures are defined in conftest.py. The remaining fixtures for pytest, which are at least as many, are defined in the modules themselves.

For the unit tests, conftest.py is defined under /lib/tests/conftest.py. It contains one fixture, the rest is defined in the modules themselves. Looking at the ratio, it seems that more weight is placed on the e2e tests.

A little excursion here too: while reading Streamlits code, I came across the testfixtures project, which is used for temporary directories. Here is the link to the still rather unknown repo.

If you're interested in the architecture, structure and software principles of open source, let me know so I can include more of it in the future!

Which Pytest Fixture parameters are useful?

The most common parameters I observe are scope, autouse, params and name. Let's go over them briefly.

Scope

The scope defines how long the fixture is valid. The default is “function”, which means that the fixture is called again and again for each test function. The alternatives are “class”, “module”, “package” and “session”. I myself often use “module” as in the example above (db_connection), so that the fixture is called once at the start of the module. The principle of the call can also be limited to class, package or session. I have also often seen session and it means that the fixture is called once for all tests in the test session.

Autouse

Is also used very often and means that the fixture is available for tests without having to pass it as a parameter.

@pytest.fixture(autouse=True)
def set_up_env():
    os.environ["APP_ENV"] = "test"
   
#For each test, this fixture provides the environment variable 
#APP_ENV with the value “test” for each test

Another example would be a fixture with autouse for a patch for a request call, that simulates communication with an API. However, I hardly ever use this as I try to mock as little as possible.

Name

name is an interesting parameter. I haven't seen many fixtures in Streamlit that use it, but I have in Pydantic. Here is an excerpt:

@pytest.fixture(scope='module', name='DateModel')
def date_model_fixture():
    class DateModel(BaseModel):
        d: date

    return DateModel
    
 def test_date_parsing(DateModel, value, result):
    if isinstance(result, Err):
        with pytest.raises(ValidationError, 
                            match=result.message_escaped()):
                            DateModel(d=value)
    else:
        assert DateModel(d=value).d == result

The purpose of name is that the fixture is renamed to the assignment (name=). In the case of Pydantic to DateModel. This decouples the method name of the fixture from the referencing (it's referenced in test_date_parsing). It makes it quite clear that a DateModel object is passed and can then be worked with directly in test_date_parsing.

Params

A fixture is parameterized with params so that a test is run with different parameters. Here is an example, this time not from Pydantic:

@pytest.fixture(params=[1, 2, 3])
def number(request):
    return request.param

def test_even(number):
    assert number % 2 == 0

Final thoughts and my conclusion

pytest fixtures are part of a good test architecture. You can define them in conftest.py, in a separate directory or in the module to be tested. As soon as a fixture is cross-module, I define it in conftest.py, as this is also where the automatic discovery takes place. The open source projects Streamlit and Pydantic mainly define their fixtures in the modules where they are needed. This concept is known as High Cohesion.

You can extend your own tests with property-based testing using hypothesis. This is not directly related to the fixtures, but I found it very helpful to get to know it.

I really like using scope as a parameter and have learned the advantage of the name parameter through pydantic. I would like to use this more often, as I use pydantic very often and I like the approach of using the name of the returning model as the name of the fixture.

Finally, it should be said that pytest sometimes seems a bit magical, for example due to the automatic discovery and autouse. The learning curve is somewhat higher as a result, but pytest also abstracts work for us.

Thank you for reading! 💙 I share what I learn about software engineering (mostly Python), AI, product development, and life as a dev.

Subscribe and follow me on my journey. No spam. Unsubscribe anytime.

Get notified

🤗✨ If you enjoyed this article, it would mean a lot to me if you shared it on social media or forwarded it to a friend. I write in my spare time, so any support is welcome.

How bugs made me believe in TDD

Patrick Müller — Fri, 02 May 2025 06:30:45 GMT

I always knew that testing was important, but I neglected it for a long time. During my studies, the subject was unfortunately given far too little attention and there was also a lack of practical relevance. However, as I gained more professional experience, I learned that I always have to expect a certain error rate and unpredictable bugs. TDD is crucial for recognizing these at an early stage and achieving good productivity in the long term.

Software Testing at SAP

During my time at SAP and in the first team I was in, we wrote no or very few unit tests (I definitely didn't). The focus back then was on end-to-end testing (E2E). Later, in a different team at SAP, I came into contact with tests and test coverage for the first time, but still had little intrinsic motivation to write any. I saw it more as a necessary evil, with the idea of writing a test after the actual function.

Software Testing During Self-Employment & Open Source

With the launch of my company LemonHeap GmbH and the LemonSpeak product, I then wrote unit tests occasionally and depending on their importance, but I was still very far away from a TDD (Test Driven Development) approach. LemonSpeak was a one-man show. Hence the question: Why write so many tests if I'm the only one developing the software anyway? My opinion at the time was that tests only have a right to exist for software that several people are working on. Then the advantage is that you don't have to completely understand the overall construct, but only the individual components that are changed or added. The existing tests then check whether errors occur and, in the case of a “pass”, allow the conclusion to be drawn that there are no side effects.

I changed my opinion towards the end of LemonSpeak, when I looked back and assessed how much support was caused by bugs and whether I could have avoided this through testing. You guessed it: most of it could have been avoided. At that time, I also got more involved with TDD and familiarized myself with the topic.

The second moment I realized the importance of tests was when I made my first open source contribution to Pydantic Logfire. The first pull request was without tests, the maintainer told me to add tests and shortly after the code was merged, it caused another bug for a user. That was a real eye-opener for me, because if the tests had been more thorough, it would have been found. The user wouldn't have opened a bug report, the maintainer wouldn't have pointed it out to me and I wouldn't have had to spend time fixing the bug again. Three people were directly affected. For me, avoiding this has something to do with professionalism.

How Does TDD Work?

TDD stands for Test-Driven Development and is not new: the concept was introduced by Kent Beck at the end of the 1990s. The idea is as follows:

You come up with a list of test conditions.
Take the first one from that list.
The third step is to write the test condition before the actual function and think about when the result is a “pass” and when the result is a “fail”.
Now write your function so that the test is fulfilled.
Optionally, in the next step, you can think about the abstraction and design of your code and refactor it. That was already one cycle. If you still have test conditions left, the cycle starts again from the beginning (goto #1). Iterations are part of TDD, because you want your function to fulfill further test conditions.

This cycle is also known as the red-green refactor. Red because your assert fails first. Congratulations if you are at Green, because then your test has received a pass. The refactor step was difficult for me to understand at first. Mainly because I took it for obviously. Once you have a green, you can refactor your code and structure it differently. Be it a pattern or a different approach. That's entirely up to you. The nice thing about it is that you have the assurance that everything will still work, as your previous tests still have to run. Here is a small visualization of TDD:

Iteration Cycle of TDD

In my opinion, while TDD is great in theory, it needs a certain amount of repetition in practice to become routine. Martin Fowler has written a very good introduction to TDD. Even more interesting, however, is the article “Canon TDD” by Kent Beck himself, in which he clears up some misunderstandings and misconceptions about TDD. Due to the negative examples that Beck points out, the information content is very high.

In my opinion, the advantage of TDD lies not only in the increased reliability of the software, but also in the fact that I have to think intensively about how I design the interface to my code and the function (keyword: differentiation between interface and implementation → good design). To illustrate this: When I write a new function, TDD forces me to define the interface first, otherwise I couldn't even test it.

Conclusion

There are many books, such as "Clean Code" or "Practical Engineer", that address the fact that high test coverage is a must. And although I now see the necessity, without the TDD approach I would find it difficult to write the tests afterwards.

Because as soon as I have developed a feature or fixed a bug, the next issue is already waiting around the corner. Unfortunately, writing a test for the previous component is often sink into oblivion. It's like tidying up at home: if something is lying around, it's tidier in the end if you tidy it up straight away instead of postponing the task.

How much and how intensively you test naturally depends on the importance of the software. But testing has become indispensable for professional software. Whether this involves unit tests, integration tests or end-to-end tests depends heavily on the architecture, the goal and the aforementioned importance of the software.

Meanwhile, I have become very familiar with TDD. In my current work, I have already been able to use it to prevent bugs during development, which just feels great. Nevertheless, the topic is still new territory for me in this intensity. As I am learning a lot in this area myself, I would like to share this knowledge with you in the next few articles.

Have you made good experiences with TDD or do you see it differently? Do you know any good resources? Let me know!

Why Celery Tasks trigger signals twice

Patrick Müller — Thu, 17 Apr 2025 09:26:37 GMT

Celery uses signals to trigger functions. I am currently writing tests for my Celery tasks. A central point for me is to test the function that is triggered by a successfully completed task. In my case, the function is called execute_after_task() and stores the task ID in Redis. To tell Celery that this function should be executed after a task has been successfully completed, I use the decorator @signals.task_success.connect . The whole construct looks like this:


  @signals.task_success.connect 
  def execute_after_task():
    # Code that stores the Task ID in Redis

A test that I defined with pytest checks whether a Celery task has executed this method after successful termination and has stored the task ID exactly once in Redis.

The code, especially the task, worked without any problems. However, I was surprised to discover that Celery executes the execute_after_task function twice. This in turn leads to the task ID being stored twice in Redis and this not only has the negative effect of ineffective resource management, but can also lead to problems on the client side, which Redis checks for an ID.

Debugging the Celery signals - What's the issue?

In my setup, I had two imports (import a, import b) in my test_celery.py which had both imported the module celery_api, in which the @signals.task_success.connect decorator had annotated the execute_after_task. The following snippet may help:


#test_celery.py
import a
import b

#a.py
import celery_api

#b.py
import celery_api

To understand the whole thing in more detail, I looked at the execution of execute_after_task with the debugger and could see that Celery stores the functions that are decorated in an internal registry. This applies not only to task_success, but also to @signals.task_failure.connect, for example.

In the Celery package, we can see the following line in signal.py:


#signal.py
for receiver in self._live_receivers(sender):

The code iterates over all receivers that have been registered. When resolving this function, you can see that execute_after_task is listed twice as weak_reference objects.

For a better understanding, it helps to debug the actual decorator @signals.task_success.connect. This makes it possible to look at the call stack when running the function. This shows that the decorator is called by the two different modules within test_celery.py during the import. Namely from module a and b. The reason for the double execution of the function is therefore that the @signals.task_success.connect decorator is executed twice.

How do I prevent duplicated executions?

The solution is quite simple: To avoid this, a parameter dispatch_uid=‘path.to.function’ can be specified. The new decorator looks like this:


@signals.task_success.connect(dispatch_uid='src.celery_api.tasks')

❗

Note: In Python, a decorator is executed during imports. This is generally the case in Python and not a special feature of Celery.

Fortunately, I noticed this during testing, but it could have occurred in any other module. At the same time, this should also be done for the task_failure decorator.

Helpful Resources

Finally, two resources that helped with troubleshooting:

Signals – Django

Django

A brief explanation from the Django project

task_revoked Signal is triggered twice, when task is revoked · Issue #3805 · celery/celery

Checklist I have included the output of celery -A proj report in the issue. (if you are not able to do this, then at least specify the Celery version affected). software -> celery:4.0.2 (latentcall…

GitHubcelery

A GitHub issue about double-triggered signals

Pytest Fixtures: Einführung, Nutzung & Organisation in deiner Testarchitektur

Patrick Müller — Sun, 16 Mar 2025 17:03:07 GMT

In meinem letzten Post habe ich darüber gesprochen, warum ich TDD praktiziere und warum ich meinen Lernfortschritt in Testing teilen möchte. Heute möchte ich darüber sprechen, was pytest Fixtures sind, wann ich sie verwende und wie du Fixtures testübergreifend verfügbar machen kannst.

Was sind pytest Fixtures?

Stell dir vor, du schreibst einen Test, der eine Funktion namens validate_user(user: User) auf den Prüfstand stellt. Dazu brauchst du natürlich einen Benutzer. Angenommen der User ist wie folgt definiert:

from dataclasses import dataclass

@dataclass
class User:
	name: str
	email: str

In deiner Testfunktion würdest du den Benutzer anlegen und ihn dann deiner Testfunktion übergeben:

def test_valid_user():
	user = User(name="Patrick", email="[email protected]")
	result = validate_user(user)
	assert result.is_valid == True

So weit, so gut. Wenn du eine weitere Funktion testen möchtest, die die Korrektheit der E-Mail überprüft, musst du eine neue Testfunktion schreiben, die den Benutzer erneut erstellt und testet. Hier kommen die Fixtures ins Spiel. Anstatt das zu machen, schreibst du eine Fixture, die dir einen User zurückliefert.

Beispiel:


@pytest.fixture()
def user_fixture():
		return User(name="Patrick", email="[email protected]")

def test_valid_user(user_fixture):
	result = validate_user(user)
	assert result.is_valid == True
	
def test_valid_email(user_fixture):
	result = validate_email(user)
	assert result.is_valid == True

Das ist eigentlich die Idee dahinter. Eine kleine Randnotiz: Beim Testen geht es nicht nur darum, die Testabdeckung zu erhöhen, sondern auch darum, Fehler zu finden. Das heißt, in einer realen Anwendung würde ich nicht nur mit einem Benutzerobjekt testen, sondern mit einem Pool von Benutzern mit unterschiedlichen Namens- und E-Mail-Kodierungen, bei denen ich sehen möchte, ob sich die Funktionen so verhalten, wie sie sollten. Unter anderem möchte ich natürlich auch testen, ob ein ungültiger Name oder eine ungültige Email zu einem false Ergebnis führt.

Wann sollte ich Fixtures verwenden?

Grob, ist das ja bereits aus dem Beispiel erkenntlich. pytest selbst beschreibt das Fixture den Kontext von Testfunktionen bestimmen. Das kann zum Beispiel der Kontext für eine Datenbank sein oder die Anreicherung deiner Testdaten. Ich benutze es sehr häufig für beides. Sobald ich merke, dass Daten innerhalb eines Tests auch in einem anderen Test verwendet werden können, mache ich daraus ein Fixture. Wer von einer anderen Sprache kommt, z.B. Java, wird sicherlich die Ähnlichkeit zur Setup/Tear Down Funktionalität sehen. pytest beschreibt den Vergleich auf ihrer Webseite sehr gut und besser als ich es könnte. Es sei erwähnt, dass pytest auch die Möglichkeit bietet, einen typischen Setup Tear Down Stil zu wählen.

Unterscheidung zwischen Error vs Failure

Normalerweise, wenn ein Test fehlschlägt oder eine Exception wirft, bekommt er den Status “failed”. Wenn jedoch eine Fixture eine Exception wirft, deklariert pytest dies als Error und nicht als “Failed”. pytest beschreibt, dass der Status Error dafür gedacht ist, mitzuteilen, dass pytest den eigentlichen Test gar nicht erst ausführen konnte und schon an einem Fixture gescheitert ist, von dem der Test abhängt. Error ist dafür reserviert. Hier ist eine weitere, recht prägnante, Erklärung auf StackOverflow.

Folgend ein Beispiel, welches in der append_first Fixture eine Exception wirft:

import pytest

@pytest.fixture
def order():
    return []

@pytest.fixture
def append_first(order):
    raise Exception
    order.append(1)

@pytest.fixture
def append_second(order, append_first):
    order.extend([2])

@pytest.fixture(autouse=True)
def append_third(order, append_second):
    order += [3]

def test_order(order):
    assert order == [1, 2, 3]

Der Code ist von der pytest Dokumentation. Du kannst ihn dir von meinem GitHub Gist kopieren.

pytest_fixture_error.py

GitHub Gist: instantly share code, notes, and snippets.

Gist262588213843476

Das Ergebnis ist das folgende:

Du siehst den ausgeführten Test in PyCharm und dass ein Error angezeigt wird. Gleichzeitig wird der Test auch als failed gezählt, was nicht ganz der Idee von pytest entspricht. Es ist also besser, die Ausgabe von pytest zu lesen und nicht nur die Meldung der IDE.

Wie erstelle ich Test-übergreifende fixtures?

Fixtures können also innerhalb eines Moduls verwendet werden. Manchmal benötigt es aber auch modulübergreifende Fixtures, wie z.B. ein Fixture für eine Datenbank. Du hast vielleicht einen Wrapper für die Datenbankverbindung und möchtest diese Funktionalität testen oder du hast Funktionen, die indirekt die Datenbank verwenden. Eine Möglichkeit ist die conftest.py. Die conftest.py liegt flach in deinem Tests Ordner:

tests/
├── conftest.py  # Contains a db_connection_fixture
├── test_module1.py
├── test_module2.py

Der Vorteil der conftest.py ist, dass pytest eine automatische Discovery von Fixtures innerhalb dieser Datei zur Verfügung stellt. Die registrierten Fixtures stehen dann in allen Modulen auf gleicher Ebene oder darunter ohne Import zur Verfügung. Dies kann am Anfang etwas irreführend wirken, da man nicht weiß, woher die Funktionsparameter (Fixtures) kommen bzw. wo sie definiert sind. Zumindest bietet PyCharm nicht die Möglichkeit die Funktionsdefinition per Klick aufzurufen. Nachdem man aber weiß, dass die Fixtures in conftest.py liegen, bleibt der Vorteil, dass das Modul sauberer wird, da sich die Importe nicht häufen.

Ordnerstruktur versus conftest.py?

Es ist auch möglich, einen Fixture-Ordner anzulegen und die Fixtures dort abzulegen. Das hilft bei der Modularisierung und ist bei vielen Fixtures auch übersichtlicher. Du kannst ein Modul mit db_fixtures oder ein Modul mit user_fixtures erstellen. In der conftest.py kannst du das Modul importieren und sie werden ebenfalls von pytest discovery aufgenommen.

tests/
├── fixtures/
│   └── db_fixtures.py  # Contains the db_connection fixture
├── conftest.py         # Can import db_connection if needed
├── test_module1.py
├── test_module2.py

#In conftest.py:
from fixtures.db_fixtures import db_connection  # Import for reuse

Bei einem überschaubaren Projekt würde ich immer zuerst die Fixtures in der conftest.py definieren und erst dann modularisieren, wenn es unübersichtlich wird. Die conftest.py ist nämlich auch für andere Themen wichtig.

Ein Beispiel für ein db_fixture wäre übrigens so etwas:

@pytest.fixture(scope="module")
def db_connection():
    client = db.get_client(URI)
    db = db.get_database(client, env_config.DB)
    client.drop_database(env_config.DB)
    yield db
    client.drop_database(env_config.DB)
    client.close()

Das Schöne an diesem Fixture ist, dass es für den Scope eines Moduls gültig ist und erst nach der Abarbeitung eines Moduls wird nach dem yield weitergemacht und die DB geschlossen. Dies hat den Vorteil, dass die Datenbank mit Daten von verschiedenen Funktionen gefüllt werden kann (z.B. einige Inserts und dann ein Delete).

Pytest Fixtures in Open-Source Projekten

Wenn ich selbst unsicher bin, schaue ich mir an, wie bekannte Open Source Projekte das machen. Dazu werfe ich gerne einen Blick auf Pydantic und Streamlit 🙂. Beide nutzen pytest.

Wie Pydantic Fixtures nutzt

Die Python-library Pydantic besteht nicht nur aus dem gleichnamigen Repository, sondern verwendet unter anderem auch pydantic-core, welches in Rust entwickelt wird. Daher sollte man beide Repositories betrachten, wenn man verstehen will, wie Pydantic testet.

Pydantic und pydantic-core definieren ihre Fixtures überwiegend direkt in den jeweiligen Testmodulen, anstatt eine zentrale conftest.py zu verwenden. In Pydantic gibt es derzeit 21 Testdateien, in pydantic-core neun, in denen Fixtures verwendet werden. Die conftest.py hat in beiden Projekten relativ wenige Funktionen und insgesamt in beiden Projekten nur fünf Fixtures. Diese Organisation der Tests folgt dem Prinzip der hohen Kohäsion (im engl. High Cohesion), da zusammengehörige Testbestandteile in gleichen Modulen gehalten werden.

Kurzer Ausflug: Definitiv neu für mich war, dass Pydantic, wie auch pydantic-core, eine library namens Hypothesis verwendet. Hypothesis bietet Property-based Testing und wird mit Hilfe des Decorators (@given) der zu testenden Funktion hinzugefügt. Dabei muss beschrieben werden, welche Art/Typen von Werten erlaubt sind und Hypothesis generiert zufällige Werte. Hier ist ein Beispiel von pydantic-core, in dem verschiedene Datenobjekte erzeugt werden:

#pydantic-core: tests/test_hypothesis.py

@given(strategies.datetimes())
@pytest.mark.thread_unsafe  
def test_datetime_datetime(datetime_schema, data):
    assert datetime_schema.validate_python(data) == data

Ich kannte Hypothesis noch nicht, finde jedoch, dass es TDD sehr gut ergänzt und werde es selbst ausprobieren.

Wie Streamlit Fixture nutzt

Streamlit macht es etwas anders als Pydantic und verwendet eine Mischung aus e2e (End to End) Testing mit Playwright und unittests.

Für Playwright e2e_playwright sind mehr als zehn Fixtures in der conftest.py definiert. Die restlichen Fixtures für pytest, welche mindestens genau so viele sind, werden in den Modulen selbst definiert.

Für die Unit Tests ist die conftest.py unter /lib/tests/conftest.py definiert. Sie enthält eine Fixture, der Rest ist in den Modulen selbst definiert. Wenn man das Verhältnis betrachtet, scheint es, dass mehr Gewicht auf die e2e Tests gelegt wird.

Auch hier ein kleiner Ausflug: Beim Lesen des Streamlits-Codes bin ich auf das Projekt testfixtures gestoßen, das für temporäre Verzeichnisse verwendet wird. Hier der Link zu dem bisher noch eher unbekannten Repo.

Wenn du dich für die Architektur, den Aufbau und die Softwareprinzipien von Open Source interessierst, dann lass es mich wissen, damit ich in Zukunft mehr davon einfließen lassen kann!

Welche Pytest Fixture Parameter sind sinnvoll?

Die häufigsten Parameter, die ich beobachte, sind scope, autouse, params und name. Lass uns kurz darüber gehen.

Scope

Mit dem scope wird die Gültigkeitsdauer des Fixture festgelegt. Der default ist “function”, das bedeutet, dass der Fixture immer wieder für jede Testfunktion erneut aufgerufen wird. Die Alternativen sind “class”, “module”, “package” und “session”. Ich selbst verwende oft “module” wie im Beispiel oben (db_connection), so dass der Fixture einmal zu Beginn des Moduls aufgerufen wird. Das Prinzip des Aufrufs kann sich auch auf class, package oder session beschränken. Session habe ich auch schon öfters beobachtet und bedeutet der Fixture wird einmal für alle tests in der “session” aufgerufen.

Autouse

Wird auch sehr häufig verwendet und bedeutet, dass der Fixture für Tests verfügbar ist, ohne dass er als Parameter übergeben werden muss.

@pytest.fixture(autouse=True)
def set_up_env():
    os.environ["APP_ENV"] = "test"
   
#Dieser Fixture stellt für jeden Test die Umgebungsvariable 
#APP_ENV mit dem Wert "test" bereit

Ein weiteres Beispiel wäre ein Fixture mit autouse für einen Patch für einen Request. Allerdings benutze ich das kaum, da ich versuche, so wenig wie möglich zu mocken.

Name

name ist ein interessanter Parameter. In Streamlit habe ich nicht viele Fixtures gesehen, die ihn verwenden, aber in Pydantic schon. Hier ein Ausschnitt:

@pytest.fixture(scope='module', name='DateModel')
def date_model_fixture():
    class DateModel(BaseModel):
        d: date

    return DateModel
    
 def test_date_parsing(DateModel, value, result):
    if isinstance(result, Err):
        with pytest.raises(ValidationError, 
                            match=result.message_escaped()):
                            DateModel(d=value)
    else:
        assert DateModel(d=value).d == result

Die Funktionalitaet von name ist, dass der fixture umbenannt wird in die Zuweisung. Im Fall von Pydantic zu DateModel. Dadurch wird der Name der Fixture Methode von der Referenzierung entkoppelt. Damit ist recht eindeutig, dass ein DateModel Object übergeben wird und mit diesem kann dann direkt in test_date_parsing gearbeitet werden.

Params

Mit params wird ein Fixture parametrisiert, so dass ein Test mit verschiedenen Parametern durchgeführt wird. Hier ein Beispiel, diesmal nicht von Pydantic:

@pytest.fixture(params=[1, 2, 3])
def number(request):
    return request.param

def test_even(number):
    assert number % 2 == 0

Letzte Gedanken und mein Fazit

pytest Fixtures gehören zu einer guten Testarchitektur. Du kannst sie in der conftest.py, in einem separaten Ordner oder in dem zu testenden Modul definieren. Sobald ein Fixture modulübergreifend ist, definiere ich sie in der conftest.py, da dort auch die automatische Discovery stattfindet. Die Open-Source Projekte Streamlit und Pydantic definieren ihre Fixtures überwiegend in den Modulen wo diese benötigt werden. Ein Konzept was unter dem Begriff High Cohesion bekannt ist.

Deine eigenen Tests, kannst du mit Property-based Testing durch Hypothesis erweitern. Das ist nicht direkt mit den Fixtures verbunden, aber ich fand es sehr hilfreich, es kennenzulernen.

Als Parameter verwende ich sehr gerne den scope und habe durch pydantic den Vorteil des name Parameters kennengelernt. Diesen würde ich gerne öfter verwenden, da ich pydantic sehr oft verwende und ich den Ansatz, den Namen des zurückgebenden Modells als Namen der Fixture zu verwenden, sehr gut finde.

Abschliessend ist zu sagen, dass pytest manchmal etwas magical wirkt, zum Beispiel durch die automatische discovery und autouse. Die Lernkurve ist dadurch etwas höher, aber dafür abstrahiert pytest auch Arbeit für uns.

Warum ich TDD praktiziere und wie es auch dir helfen kann

Patrick Müller — Sun, 09 Mar 2025 19:09:32 GMT

Mir war schon immer klar, dass Testing wichtig ist, allerdings wurde es von mir dennoch eine lange Zeit vernachlässigt. Wahrend meines Studiums kam das Thema leider viel zu kurz und zudem fehlte es an Praxisbezug. Mit der zunehmenden Berufserfahrung habe ich aber gelernt, dass ich immer mit einer gewissen Fehlerrate und unvorhersehbaren Bugs rechnen muss. TDD ist entscheidend, um diese frühzeitig zu erkennen und auch auf lange Sicht eine gute Produktivität zu erzielen.

Software Testing bei SAP

Während meiner Zeit bei SAP und im ersten Team, in dem ich war, haben wir keine oder nur sehr wenige Unit-Tests geschrieben (ich definitiv keine). Der Fokus lag damals auf End-to-End-Tests (E2E). Später, in einem anderen Team bei SAP, kam ich zum ersten Mal mit Tests und Test Coverage in Berührung, hatte aber immer noch wenig intrinsische Motivation, welche zu schreiben. Ich sah es eher als notwendiges Übel an, mit dem Gedanken einen Test nach der eigentlichen Funktionalität zu schreiben.

Software Testing in der Selbstständigkeit & Open Source

Mit der Gründung meiner Firma LemonHeap GmbH und dem Produkt LemonSpeak habe ich dann vereinzelt und je nach Wichtigkeit Unittests geschrieben, aber von einem TDD (Test Driven Development) Ansatz war ich noch weit entfernt. LemonSpeak war eine Ein-Mann-Show. Daher die Frage: Warum so viele Tests schreiben, wenn ich sowieso der Einzige bin, der die Software entwickelt? Meine damalige Meinung war, dass Tests nur für Software eine Daseinsberechtigung haben, an der mehrere Personen arbeiten. Der Vorteil ist dann, dass man nicht das Gesamtkonstrukt komplett verstehen muss, sondern nur die einzelnen Komponenten, die geändert oder hinzugefügt werden. Die vorhandenen Tests prüfen dann, ob Fehler auftreten und lassen bei einem “pass” den Rückschluss zu, dass es keine Seiteneffekte gibt.

Mein Wandel kam gegen Ende von LemonSpeak, als ich rückblickend bewertete, wie viel Support durch Bugs verursacht wurde und ob ich dies durch Testen hätte vermeiden können. Du ahnst es: Der Großteil hätte vermieden werden können. Zu der Zeit habe ich mich auch mehr mit TDD beschäftigt und mich in das Thema eingearbeitet.

Der zweite Moment, in dem mir die Wichtigkeit von Tests klar wurde, war als ich meine erste Open Source Contribution für Pydantic Logfire gemacht habe. Der erste Pull Request war ohne Tests, der Maintainer sagte mir, ich solle doch Tests hinzufügen und kurz nachdem der Code gemerged wurde, verursachte er einen weiteren Bug bei einem Benutzer. Das war ein echter Aha-Moment für mich, denn wenn die Tests ausgiebig gewesen wären, hätte man das auch finden können. Der User hätte keinen Bug Report geöffnet, der Maintainer hätte mich nicht darauf hingewiesen und ich hätte nicht wieder Zeit investieren müssen, um den Bug zu beheben. Drei Leute waren direkt betroffen. Das zu vermeiden hat für mich etwas mit Professionalität zu tun.

Wie funktioniert TDD?

TDD steht für Test-Driven-Development und ist nicht neu: Das Konzept wurde Ende der 90er Jahre von Kent Beck eingeführt. Die Idee ist folgende:

Du überlegst dir eine Liste an Testbedingungen.
Nimm dir den ersten aus der Liste.
Der zweite Schritt besteht darin, die Testbedingung vor der eigentlichen Funktion zu schreiben und zu überlegen, wann das Ergebnis "pass" und wann das Ergebnis "failed" ist.
Jetzt schreibst du deine Funktion, so dass der Test erfüllt ist.
Optional kannst du im nächsten Schritt dir Gedanken über Abstraktion und Design deines Codes machen und ihn refactoren. Das war bereits ein Zyklus. Wenn du noch Testbedingungen übrig hast, fängt der Zyklus wieder von vorne an (goto #1). Iterationen gehören zu TDD dazu, denn du möchtest dass deine Funktion weitere Testbedingungen erfüllt.

Dieser Zyklus wird im Englischen auch Red-Green-Refactor genannt. Red, weil dein assert zuerst failed. Glückwunsch wenn du bei Green bist, denn dann hat dein Test ein pass erhalten. Der Refactor Step war für mich am Anfang schwer zu verstehen. Hauptsächlich weil es für mich selbstverständlich ist. Sobald du ein green hast, kannst du deinen Code refactoren und anders strukturieren. Sei es ein Pattern oder ein anderer Ansatz. Das liegt ganz bei dir. Das Schöne daran ist, dass du die Sicherheit hast, dass trotzdem alles funktioniert, da deine bisherigen Tests weiterhin durchlaufen müssen. Hier ist eine kleine Visualisierung von TDD:

TDD Zyklus

TDD ist meiner Meinung nach in der Theorie super, braucht aber in der Praxis eine gewisse Wiederholung, um zur Routine zu werden. Martin Fowler hat eine sehr gute Einführung in TDD geschrieben. Noch interessanter ist jedoch der Artikel “Canon TDD” von Kent Beck selbst, in dem er mit einigen Missverständnissen und Irrglauben rund um TDD aufräumt. Durch die Negativbeispiele die Beck aufzeigt, ist der Informationsgehalt sehr hoch.

Meiner Meinung nach liegt der Vorteil von TDD nicht nur in der erhöhten Zuverlässigkeit der Software, sondern auch darin, dass ich intensiv darüber nachdenken muss, wie ich die Schnittstelle zu meinem Code und der Funktion gestalte (Stichwort Abgrenzung Schnittstelle zu Implementierung → gutes Design). Um das zu verdeutlichen: Wenn ich eine neue Funktionalität schreibe, dann zwingt mich TDD dazu, zuerst die Schnittstelle zu definieren, damit ich sie überhaupt testen kann.

Fazit

Es gibt zig Bücher wie beispielsweise Clean Code oder Practical Engineer, die darauf eingehen, dass eine hohe Test Coverage ein Muss ist. Und obwohl ich mittlerweile die Notwendigkeit sehe, würde es mir ohne den TDD-Ansatz schwer fallen, die Tests im Nachhinein zu schreiben.

Denn sobald ich ein Feature entwickelt oder einen Bug behoben habe, wartet schon das nächste Issue um die Ecke. Einen Test für die vorherige Komponente zu schreiben, gerät leider in Vergessenheit. Das ist wie beim Aufräumen zu Hause: Wenn etwas herumliegt, ist es am Ende ordentlicher, wenn man es gleich aufräumt, anstatt die Aufgabe aufzuschieben.

Wie viel und wie intensiv getestet wird, hängt natürlich auch von der Wichtigkeit der Software ab. Aber für eine professionelle Software ist das Testen heute nicht mehr wegzudenken. Ob es sich dabei um Unit-Tests, Integrationstests oder End-to-End-Tests handelt, hängt stark von der Architektur, dem Ziel und der wie erwähnten Wichtigkeit der Software ab.

Mittlerweile habe ich mich sehr gut in TDD eingelebt. In meiner aktuellen Arbeit konnte ich damit schon Bugs während der Entwicklung verhindern, was sich einfach großartig anfühlt. Dennoch ist das Thema in dieser Intensität noch Neuland für mich. Da ich selbst viel in dem Bereich dazulerne, möchte ich dieses Wissen in den nächsten Artikeln mit dir teilen.

Hast du gute Erfahrungen mit TDD gemacht oder siehst du das anders? Kennst du gute Ressourcen? Lass es mich wissen!

Wie schreibe ich eine Cold Email?

Patrick Müller — Sun, 23 Feb 2025 09:20:58 GMT

Marketing ist ein Buch mit sieben Siegeln für mich. Ich bin mit Leib und Seele Softwareentwickler und dazu nicht sehr extrovertiert. Falls es dir auch so geht, dann ist dieser Artikel genau das Richtige für dich.

Was käme mir da gelegener als cold emailing? Das Konzept spielte mir gerade zu in die Karten: Automatisiertes versenden von E-Mails an genau die Kundengruppe, die ich für richtig halte. Aber funktioniert das wirklich so gut wie es von Cold emailing Gurus im Internet propagiert wird? Ich ließ mich auf einen Langzeittest ein. Dabei muss ich gestehen, dass ich viel experimentiert habe. Verändern des Inhalts der E-Mail, den Ton, die Kundengruppe, versenden einer zweiten oder dritten E-Mail. Das Experimentieren gehört dazu und half mir ein besseres Gefühl für die Thematik zu bekommen.

Obwohl ich am Anfang geschrieben habe, dass das Konzept gut zu mir passte, hatte ich andere Bedenken. Wenn ich eine E-Mail von Personen erhalte, die ich nicht kenne, dann nervt mich das und ich sah es bis dahin als Spam. Heute sehe ich es immer noch als Spam, aber ich bin bei weitem nicht mehr so kritisch, weil ich weiß, dass da jemand für seinen Traum kämpft (solange es kein Prinz aus irgendeinem anderen Land ist, der Gold verspricht. Das ist wirklich nervig).

Welche Vorbereitungen solltest du treffen?

Es gibt meiner Meinung nach nur ein paar Voraussetzungen die beachtet werden sollten und dann kann die E-Mail Kampagne bereits losgehen 🙂.

Kaufe eine extra Domain (#1)

Nutze eine andere Domain als die, die du für dein Produkt, geschäftliche E-Mails, oder Support verwendest. Idealerweise kaufst du dir eine neue Domain für diesen Zweck. Ich habe mir zum Beispiel trylemonspeak.com dafür gekauft. "try" oder "go" sind beliebte Präfixe für eine Domain. Eine eigene Domain ist sinnvoll, weil das Risiko besteht, dass die Domain als Spam-produzierend eingestuft wird. Falls das der Fall ist, dann landen deine Mails nicht in der Inbox, sondern unter Spam oder "Others".

Konfiguration deiner neuen Domain (#2)

Die folgenden Einträge solltest du vornehmen:

MX Records

MX Records (Mail Exchanger) sind die Basis dafür, dass E-Mail Server wissen, wohin die Mails geroutet werden.

SPF Records

SPF Records (Sender Policy Framework). Ein Sicherheitsmechanismus welches E-Mails von einer vertrauenswürdigen Quelle authentifiziert.

DKIM Records

DKIM Records (domain keys identified mail). Fügt eine Signatur zu deinen E-Mails hinzu, welche es einfacher macht diese zu tracken und Spoofing zu verhindern.

DMARC

DMARC (Domain-based Message Authentication, Reporting, and Conformance). Mit DMARC werden E-Mails detektiert und verhindert, welche mit hoher Wahrscheinlichkeit Schaden anrichten würden. Ähnlich zu SPF Records authentifiziert DMARC E-Mails, um den Server wissen zu lassen, dass sie von einer vertrauenswürdigen Quelle kommen. DMARC sollte erst aktiviert werden, nachdem DKIM und SPF konfiguriert wurden. Das Zeitfenster sollte ca. 48 Stunden betragen.Es gibt bereits hilfreiche Tools, mit denen du deine Konfigurationen überprüfen kannst. Das folgende Tool gibt dir zum Beispiel einen guten Überblick: https://easydmarc.com/tools/domain-scanner?ref=patrickm.deFolgend das Ergebnis für lemonspeak.com, welches verbesserungsfähig ist. Allerdings war das auch nicht die Domain die ich für cold emailing genutzt habe:

Alternativ habe ich sehr gerne das Tool mxtoolbox genutzt. Die UI ist etwas speziell, aber entweder fügst du ein Präfix für den zu testenden record hinzu (bspw.: spf:) oder du wählst es aus dem Drop Down Button daneben aus.

Das Warm-Up deiner Domain (#3)

Ein Warm-Up ist wichtig, weil eine neue Domain die sofort hunderte E-Mails sendet, recht schnell als Spam verdächtig eingestuft wird. Ich habe das Warm-Up der Domain mit MailToaster durchgeführt. Bei mir hat das ca. 7 Tage gedauert.

Verfasse dein Template (#4)

Schreib deinen Text für die E-Mail. Man nennt das auch Template. Es gibt Tools, die dir einen Wert berechnen, der dir zeigt, wie spamverdächtig der Text ist. Ich habe dafür häufig Mailmeteor genutzt.

Nutze einen Filter (#5)

Erstelle einen Filter für deine Zielgruppe. Ich habe zum Beispiel einen Filter erstellt, mit dem ich nach Podcastern gesucht habe, die englisch- oder deutschsprachig sind und die weniger als drei Mitarbeiter haben.

Wie schreibe ich eine cold E-Mail?

Ich bin kein Experte was cold emailing angeht, allerdings hatte ich das Gefühl, dass Menschen recht schnell erkennen, ob eine E-Mail KI generiert ist, oder eine Person dahinter steckt. Zwar hatte ich KI genutzt für einen ersten Entwurf, aber überwiegend, weil die Thematik neu für mich war und nicht weil ich dachte, dass ich damit die allerbesten Ergebnisse erzielen werde.

Im Internet wird häufig folgendes empfohlen:

Max. 50 Wörter und kein Pitching
Nicht nach Zeit fragen
F-Form mit 3 Absätzen
Keine Floskeln oder Füllwörter
Mehr „du“, weniger „ich“ oder „wir“
Sprache der Klasse 6 oder darunter

Das sind also die Rahmenbedingungen, aber wie fülle ich diese mit gutem Inhalt?

Für den Inhalt bin ich folgender Empfehlung gefolgt, da mir diese sehr sinnvoll erschien:

Aufmerksamkeit erlangen basierend auf einem wissenschaftlichen Fakt
Den Empfänger auf ein Problem in seinem Business hinweisen
Aufklären über die Auswirkungen auf sein Business
Bitte um Validierung oder/und Feedback

Mein erstes Template

Aufbauend auf den Rahmenbedingungen und der Inhaltsempfehlung ist mein erstes Template erfolgt:

Title: Hey {name}, just a quick question about {podcast name}

Hey {name},
Nine out of ten podcasters don't produce more than three episodes.

What about you? Are your listeners growing? I run a service that has helped over 70 podcasters create transcripts, show notes, articles and more from their podcast. They use it for SEO and social media.

Mind if I send you more info?

So eine E-Mail sollte kurz und prägnant sein. Ich teile die Annahme der meisten Leitfaden, dass lange E-Mails tendenziell weniger gelesen werden. Unsere Aufmerksamkeitsspanne ist kürzer als früher und daher habe ich versucht einen kurzen interessanten Text zu schreiben, der potenzielle Kunden neugierig macht ein Blick auf mein Tool zu werfen. Ein “Hack” den man häufig liest, ist, dass im Titel „quick question“ stehen soll, da dies den Empfänger neugierig macht. Ich bin darüber geteilter Meinung. Mein vorgestelltes Template beinhaltete das und hatte nicht sehr gut performt.

Mein zweites Template (verbesserte Version)

Das zweite Template war dann schon besser:

Title: ⏳ Podcast marketing is too time consuming. I can save you 5.23 hours.

Hey {name},
Don't you think that this is super annoying? I at least do. I built LemonSpeak for that. It does the heavy lifting for you by creating SEO and social media content.

Could you imagine trying it on your podcast and seeing how much time it saves you?

Happy podcasting!

Patrick Müller
Caretaker lemonspeak.com
Freiburg, Germany

Der Titel beinhaltet bereits, was der Kunde im Gegenzug eines Kaufs erhält: 5,23 Stunden Zeitersparnis. Das hat besser funktioniert. Die Öffnungsraten waren höher und es gab mehr Klicks auf lemonspeak.com. Wie viele Kunden ich basierend darauf konvertieren konnte, kann ich leider nicht beziffern, aber ich habe quer geprüft und es waren welche dabei 🙂. Bedauerlicherweise habe ich auch keine offiziellen Screenshots von Apollo, da mein Account aufgrund von Inaktivität gelöscht wurde.

Meine E-Mails habe ich übrigens über Apollo.io verschickt.

Fazit

Ob cold emailing eine wirkungsvolle Marketingstrategie ist, kann ich nicht pauschalisieren. Ich habe mit einem anderen Gründer gesprochen und bei seinem Service hat cold emailing gut funktioniert, aber die hatten auch kein SaaS, sondern eine Agentur. Für LemonSpeak war es nicht das geeignete Mittel. Zwar habe ich dadurch einige Kund:innen gewonnen, aber das ganze zu administrieren kostet auch einiges an Zeit. Es sind die Details des Templates die einen aufhalten, welche am Ende aber auch stimmen solle. Alternativ hätte ich lieber ein Referral Programm aufgebaut.

Zugleich es schwierig ist zu wissen, wie ein potenzieller Kunde auf eine cold E-Mail reagiert. Wird das eigene Produkt in einem schlechten Licht dargestellt, oder wird es sportlich aufgenommen? Sicherlich spielt es auch eine Rolle, wie aggressiv cold emailing betrieben wird und wie freundlich der Text ist. Beispielsweise habe ich maximal eine Erinnerungsmail geschickt und keine zwei, oder drei. Trotz der eigenen negativen Erfahrung, möchte ich für mein nächstes Produkt cold emailing nicht ausschließen. Das ganze funktioniert eben abhängig von der Kundengruppe und dem Produkt mal besser oder schlechter. Ausprobieren und experimentieren ist da der richtige Weg meiner Meinung nach.

Übrigens: Wem cold emailing dennoch zu befremdlich befindet, der kann gezielt nach Kunden suchen und sie anschreiben. Die Erfolgschancen sind um einiges höher (das habe ich ebenfalls mit LemonSpeak gemacht). Der Zeitaufwand allerdings auch.

Wie du das richtige SaaS-Pricing findest und Fehler vermeidest

Patrick Müller — Sun, 09 Feb 2025 09:22:03 GMT

Falls du gerade an deinem eigenen SaaS (Software as a Service) arbeitest oder auch an einem anderen digitalen Produkt und vor der Entscheidung stehst, wie du das Pricing gestaltest, dann bist du hier genau richtig.

Ganz kurz vorweg: Jedes Produkt und jedes Unternehmen ist individuell, daher gibt es keine allgemeingültige Antwort auf die Frage nach dem besten Preismodell. Es beschäftigen sich super viele Menschen mit dem Thema und du findest unzählige Bücher und Blogposts im Internet über das richtige Preismodell.

In diesem Artikel möchte ich dir Einblicke geben, wie ich das SaaS-Pricing für LemonSpeak gestaltet habe, was gut war und was ich hätte anders machen sollen.

Die bekanntesten Preismodelle

Bevor ich auf LemonSpeak’s Preismodell im Detail eingehe, bedarf es zur Einordnung eines kurzen Überblicks über die gängigsten Preismodelle:

Monatlich wiederkehrende Gebühr (de facto Standard, hat die Einmalzahlung durch den Produktschlüssel abgelöst)
Nutzungsbasiert, auch Pay-as-you-go oder Metered Pricing genannt. Bezahlt wird, was verbraucht wird.
Credit / Token basiertes Modell: Bezahle einen Betrag X den du dann verbrauchen kannst. Quasi nutzungsbasiert nur umgekehrt.

Meine Anfänge mit LemonSpeak

Als ich mit LemonSpeak anfing, wollte ich ein Preismodell anbieten, das dem Kunden zugute kommt und seine Interessen mehr berücksichtigt als meine. Ich bin davon überzeugt, dass dieses Prinzip nach wie vor richtig ist, denn potenzielle Kunden merken sehr schnell, ob ein Preismodell fair ist oder nicht.

Das Geschäftsmodell hinter LemonSpeak ist schnell erklärt: Podcaster:innen können eine Episode als Audiodatei hochladen und verarbeiten lassen. Dabei entsteht content, bspw. Show Notes, ein Transkript, Kapitelmarken, Tweets, etc. welche Podcaster weiterverwenden können.

Die Anzahl der Minuten einer Episode ist ein wichtiger Faktor für die Kostenberechnung. Je mehr Minuten eine Audiodatei enthält, desto höher sind die Kosten für LemonSpeak. Das wohl bekannteste Preismodell für ein SaaS ist eine monatliche Gebühr, die in den Folgemonaten wiederkehrend gezahlt wird, solange der Kunde im Abonnement ist (siehe 1.). Dead simple.

In der Landschaft der generativen KI Podcast Tools habe ich dieses Modell am häufigsten gesehen. Es werden Pakete definiert, wie z.B:

Small (0-100 Minuten)
Medium (101-200 Min)
Large (201-300 min)

Die erste Iteration des Preismodells und das Problem der fehlenden Overage

Das erste Preismodell mit welchem ich gestartet hatte, beinhaltete keine overage. Zu Beginn war es lediglich monatliche Grundgebühr + Verbrauch. Es spielte keine Rolle wie viele Minuten ein Podcaster (zu viel) verbraucht hatte. Das führte dazu, dass Kundinnen ein Abonnement starteten, all ihre Episode verarbeiten liesen und dann wieder kündigten. Das hört sich im ersten Moment nicht sehr tragisch an, daher möchte ich dir als Verdeutlichung eine Beispielsrechnung zeigen: Angenommen ein Podcast, besitzt 20 Episoden mit einer durchschnittlichen Dauer von 40 Minuten. Das sind 800 Minuten in Summe und der Umsatz wäre: $7 + 800 * $0,04 = $39. Mit dem angepassten Preismodel wären es $107.20 oder $68.80 USD im Studio plan. Die $39 sind nicht gerade der Lebenszyklus den ich mir unter einem SaaS und für einen Kunden vorgestellt hatte. Um das zu lösen führte ich die overage ein. Sollte ein Kunde also mehr verarbeiten lassen, wird er nicht gezwungen in einen höheren Plan upzugraden, sondern zahlt eine erhöhte Gebühr für den zusätzlichen Verbrauch, im englischen overage genannt.

Welches Preismodell ich für LemonSpeak gewählt habe

Wenn ich als Podcaster im Monat mehr als 110 Minuten produziere, dann benötige ich bereits das Paket Medium und darf nicht mit dem günstigeren Small Paket starten. Die Grenzen der Preismodelle werden absichtlich so gewählt, dass ein Kunde idealerweise nicht das kleinste Paket abonniert. Das fand und finde ich unsympathisch, weshalb ich dieses Modell nicht übernehmen wollte. Viel mehr erschien es mir fair, die Anzahl der Minuten abzurechnen, die wirklich verbraucht wurden. Um jedoch etwas Planungssicherheit bzgl. der Umsatzentwicklung zu erhalten, habe ich “pay-as-you-go” mit einer monatlichen Grundgebühr kombiniert. Natürlich habe ich auch in der Anzahl der Feature unterschieden, aber der wirklich grosse Unterschied der Pläne war wie viel Minuten verarbeitet werden dürfen. Mehr dazu folgt gleich.

Hier ist das Preismodell von LemonSpeak:

Preismodell LemonSpeak

Das Beste aus beiden Welten, so die Idee. Gesagt, getan. Die Vorgehensweise erwies sich jedoch aus mehreren Gründen als nachteilig.

Nachteile des hybriden Preismodells welches LemonSpeak nutzt

Es ist nicht einfach zu verstehen

Du siehst hier fünf verschiedene Pläne (Studio, Professional, Beginner, Flexible, Free). Die Badges “For the first … Minutes” definieren wie viele Minuten mit der Standardgebühr von 4 cent/Minute abgerechnet werden. Sollte ein Kunde mehr verarbeiten lassen, wird er nicht gezwungen in einen höheren Plan upzugraden, sondern zahlt eine erhöhte Gebühr für den zusätzlichen Verbrauch (englisch overage). Eine overage ist häufiger bei Cloud Anbietern zu sehen, allerdings dachte ich mir, dass das Konzept super kundenfreundlich ist, da sie ohne Zwang in ihrem Plan bleiben dürfen. Falls ich dich bereits verloren oder verwirrt habe, dann kann ich das nachvollziehen.

Vielleicht geht es dir wie den meisten und du fragst dich: Was genau ist Overage und wie viel kostet mich das jetzt im Monat? Um zumindest die letzte Frage zu beantworten, habe ich eine verbrauchsabhängige Berechnung eingebaut.

Kostenberechnung für einen Kunden auf LemonSpeak

Das ist jedoch nur ein Tropfen auf dem heißen Stein und ich kann bestätigen, dass es nicht sehr intuitiv ist. Damit verliere ich potentielle Kunden bereits auf der Preisseite.

Der Flexible Plan und die benötigte Arbeitszeit

Die Idee dahinter war diesen Plan anzubieten für Personen, welche entweder kein Abo starten möchten, oder für Audiodateien die recht kurz sind. Der Plan wurde recht häufig genutzt. Was ich nicht bedacht hatte, waren die hohen Gebühren von Stripe und der Zeitaufwand für die Abrechnung. Stripe ist ein amerikanisches Unternehmen und ein Zahlungsdienstleister. Als Zahlungsdienstleister wickeln sie Zahlungen ab und erfüllen Zertifizierungen bzgl. gesetzliche Standards.

Der Zeitaufwand für die Verbuchung der Rechnungen sowie der administrative Aufwand im Allgemeinen sind nicht zu unterschätzen. Insbesondere bei der flexiblen Option wurde nach der Bearbeitung eine Rechnung ausgestellt, was bedeutet, dass ich oft vier Rechnungen pro Monat für den gleichen Kunden verbuchen musste. Nachfolgend zwei Abrechnungen aus dem "Beginner Plan" und dem "Flexible Plan"

Stripe Kosten des Beginner Plans

Für das Beginner-Paket, welches 7 USD kostet, bleiben 5,85 Euro nach Stripe übrig.

Screenshot: Beginner-Plan Gebühr von Stripe

Stripe Kosten des Flexible Plans

Bei einer Pay-as-you-go-Version war es noch schlimmer. Eine Rechnung von 0,90 USD endete in einem Rechnungsbetrag von 0,54 Euro für LemonHeap, der dann noch als Gewinn beim Finanzamt versteuert werden muss. Die 27 Cent sind bereits 21% des Umsatzes. Eine sehr hohe Gebühr.

Screenshot: Flexible-Plan Gebühr von Stripe

Die hohen Kosten und der Zeitaufwand für die Verbuchung sind so hoch, dass ich den Flexible Plan nicht mehr anbieten würde. Ich würde ihn auch nicht teurer machen. Einfachheit siegt hier. Für den Kunden, für mich und für dein digitales Produkt.

Abrechnung am Ende der Laufzeit vs. zu Beginn

Normalerweise erfolgt die Abrechnung eines SaaS-Abonnements zu Beginn der Laufzeit. Bei dem von mir gewählten Hybridmodell war das nicht möglich, da ich erst am Ende des Kundenzyklus wusste, wie hoch der tatsächliche Verbrauch war. Cloud-Anbieter haben dasselbe Problem, aber warum ist das eigentlich ein Problem?

Leider musste ich mit mehreren Kunden die Erfahrung machen, dass die Rechnungsstellung am Ende des Monats zu Problemen führte. 16% der Kund:innen haben am Ende die automatische Zahlung verweigert. Meist indem sie ihr Zahlungsinstitut angewiesen haben, die anstehende Zahlung an LemonSpeak zu blockieren. So wird ein SaaS-Betreiber schnell über den Tisch gezogen. In meinem Fall waren es Kunden die nicht in Deutschland ansässig waren. Das ist super ärgerlich, weil ich nicht viel machen konnte. Natürlich wird wiederholt von Stripe versucht die Zahlung einzuziehen und auch E-Mails habe ich wiederholt versendet, aber jeder der sich geweigert hat ist damit auch davon gekommen. Die Art und Weise, wie das ablief, ließ mich auch vermuten, dass diese Personen dies nicht zum ersten Mal taten.

Das zweite Problem war, dass mir Kunden E-Mails schickten, in denen sie LemonSpeak vorwarfen, unrechtmäßig Geld eingezogen zu haben, obwohl das Abonnement bereits vor zwei Wochen gekündigt worden war. Ich musste dann jedes Mal erklären, dass dies aufgrund der Pay-as-you-go-Struktur gerechtfertigt ist, da diese Kunden bis zum Ende ihres Abrechnungszyklus noch ein aktives Abonnement hatten. Es kam auch vor, dass LemonSpeak aufgrund dieses Umstandes der Täuschung bezichtigt wurde, was zu zwei oder drei weiteren E-Mails mit Erläuterungen zum Preismodell führte. Das kostet extra Zeit, die du für andere Tätigkeiten nicht mehr hast. Als ich das erste Mal eine solche E-Mail und eine solche Anschuldigung erhielt, hat mich das natürlich getroffen. Allerdings gehören Ärgernisse und unfreundliche Kund:innen / Internet Trolle auch zum Leben als Selbständiger dazu und ich habe daraus gelernt, so dass ich mich davon nicht mehr aus der Ruhe bringen lasse.

Wie findest du also das optimale Preismodell für dein SaaS oder digitales Produkt?

Zuerst würde ich mir eine Liste machen, wie deine Konkurrenz ihr Preismodell aufgebaut hat. Die Chancen sind extrem hoch, dass auch sie sich lange damit beschäftigt und Vor- und Nachteile abgewogen haben. Orientiere dich also an dem Preismodell deiner Konkurrenz. In meinem Fall hätte ich mehr hinterfragen sollen, warum niemand pro Minute abrechnet.
Der zweite Punkt, den ich dir mit auf den Weg geben möchte, ist, dass du dir genau anschaust, wie hoch die Gebühren deines Zahlungsdienstleisters für die verschiedenen Modelle sind. Betrachte auch, wie der Arbeitsprozess von der Abrechnung bis zu deiner eigenen Buchhaltung aussieht. Solange das nicht vollautomatisch abläuft, kostet das enorm viel Zeit.
Der dritte Punkt ist, dass es immer Leute geben wird, die einen Dienst missbrauchen und finanziellen Schaden anrichten. Die Rechnungsstellung zu Beginn des Zyklus löst jedoch das beschriebene Problem der Blockierung der Zahlung. Wenn dies der Fall ist, wird das Kundenkonto deaktiviert und du hast deinen Seelenfrieden.
Und zu guter Letzt: Mach es so einfach wie möglich! Bitte einen Freund oder eine Freundin, die dein Produkt noch nicht kennt, sich deine Preisseite anzusehen. Nach 10 Sekunden sollten sie es verstanden haben, ansonsten würde ich weiter iterieren.

Finale Gedanken

Ein Preismodell ist nicht in Stein gemeißelt und es ist normal, dass es Iterationen und Änderungen gibt. Auch wenn es nicht starr ist, so ist die Einführung eines neuen Preismodells doch mit Aufwand verbunden. Zum einen aus technischen Gründen und zum anderen, wenn es bereits Kunden mit bestehenden Verträgen gibt, können diese nicht einfach per Mausklick auf das nächste Preismodell migriert werden. Das wäre allein schon aus rechtlicher Sicht nicht sauber. Auf der anderen Seite darf man sich aber auch nicht auf ein Preismodell versteifen. Irgendwann muss eine Entscheidung getroffen werden. Solange diese Entscheidung nach reiflicher Überlegung und Marktbeobachtung getroffen wird, ist sie zu diesem Zeitpunkt nach bestem Wissen und Gewissen getroffen und der Rest wird durch Iterationen verbessert.

Wieso Celery Tasks Signale doppelt triggern

Patrick Müller — Sun, 12 Jan 2025 20:12:12 GMT

Celery nutzt Signale, um Funktionen zu triggern. Aktuell schreibe ich Tests für meine Celery Tasks. Dabei ist für mich ein wesentlicher Punkt, die Funktion zu testen, welche von einem erfolgreich beendetem Task getriggert wird. In meinem Fall heißt die Funktion execute_after_task() und speichert die Task ID in Redis. Um Celery mitzuteilen, dass diese Funktion nach erfolgreichem Beenden eines Tasks ausgeführt werden soll, nutze ich den decorator @signals.task_success.connect . Das ganze Konstrukt sieht folgendermaßen aus:


  @signals.task_success.connect 
  def execute_after_task():
    # Code der die Task ID in Redis speichert

Ein Test, den ich mit pytest definiert habe, sieht vor, dass geprüft wird, ob ein Celery Task nach erfolgreichem Beenden diese Methode ausgeführt hat und die Task ID genau einmal in Redis abgelegt hat.

Der Code, insbesondere der Task, hat soweit ohne Probleme funktioniert. Verwundert musste ich jedoch feststellen, dass Celery die Funktion execute_after_task doppelt ausführt. Das führt wiederum dazu, dass Redis doppelt mit der Task ID bespeichert wird und das hat nicht nur den negativen Effekt von ineffektivem Ressourcenmanagement, sondern kann auch zu Problemen auf der Client-Seite führen, welche Redis nach einer ID prüft.

Debugging der Celery Signale - Wo ist der Fehler?

In meinem Aufbau hatte ich zwei Imports (import a, import b) in meiner test_celery.py die wiederum beide das Modul celery_api importiert hatten, in welcher der @signals.task_success.connect decorator die execute_after_task annotiert hatte. Eventuell hilft die folgende Darstellung:


#test_celery.py
import a
import b

#a.py
import celery_api

#b.py
import celery_api

Um das Ganze genauer zu verstehen, habe ich die Ausführung der execute_after_task mit dem Debugger betrachtet und konnte sehen, dass Celery die mit decorators versehenen Funktionen in eine interne Registry ablegt. Das trifft nicht nur auf task_success zu, sondern beispielsweise auch auf @signals.task_failure.connect.

In Celery können wir in der signal.py folgende Zeile beobachten:


#signal.py
for receiver in self._live_receivers(sender):

Der Code iteriert über alle receivers die registriert wurden. Beim Auflösen dieser Funktion sieht man, dass execute_after_task doppelt als weak_reference Objekte aufgelistet wird.

Für ein besseres Verständnis hilft es den eigentlichen decorator @signals.task_success.connect zu debuggen. Dadurch ist es möglich, beim durchlaufen der Funktion, den Call Stack zu betrachten. Dieser zeigt, dass der decorator während des Imports von den zwei verschiedenen Modulen innerhalb der test_celery.py aufgerufen wird. Nämlich von Modul a und b. Die Ursache für die doppelte Ausführung der Funktion ist also, dass der @signals.task_success.connect decorator zwei mal ausgeführt wird.

Wie verhindere ich den doppelten Aufruf?

Die Lösung ist recht einfach: Um das zu vermeiden, kann ein Parameter dispatch_uid='path.to.function' mitgegeben werden. Der neue decorator sieht wie folgt aus:


@signals.task_success.connect(dispatch_uid='src.celery_api.tasks')

❗

Merke: In Python wird ein decorator während der Imports ausgeführt. Das ist generell so in Python und keine Besonderheit von Celery.

Dieser Umstand ist mir glücklicherweise während dem Testen aufgefallen, jedoch hätte es auch in jedem anderen Modul passieren können. Simultan sollte das auch für den task_failure decorator vorgenommen werden.

Hilfreiche Quellen

Zum Abschluss noch zwei Ressourcen, welche bei der Fehlersuche geholfen haben:

Signals – Django

Django

Eine kurze Erklärung von Djangoproject

task_revoked Signal is triggered twice, when task is revoked · Issue #3805 · celery/celery

GitHubcelery

Ein GitHub issue über doppelt getriggerte Signale