Fang-Pen's coding note

No, LLM is not going to replace software engineers, here's why

Thu, 19 Mar 2026 07:00:00 +0000

Today, I’d like to share my theory about why LLMs cannot replace software engineers, based on my experience and observations. Who am I to talk about this topic, you may ask. Well, not much, except that I have spent more than two decades of my life programming, almost every single day, long before GPT was a thing. You can check my GitHub profile, but it only captures one decade+ (from when git was a thing), and those are just GitHub repos not including all proprietary repos I have worked on.

More than a decade of GitHub contributions on my public profile. This is only what GitHub can see (public + any private contributions I opted to show), not the proprietary repos I’ve worked in.

These are all organic commits. I never commit a single line of code just to make my GitHub profile look green, because that’s stupid. Before LLMs, I wrote almost all my code keystroke by keystroke. Sounds crazy, right? 🤣

Yep, that’s how things used to be. And it can last for so long only because I truely enjoy programming. I’ve built countless software projects: backend, frontend, mobile, data pipelines, browser extensions, infrastructure as code, and even trained AI models from scratch, and so on. After using LLMs to speed up coding, I have to say: if you use them the right way, they can make you many times faster. But it’s not all sunshine and rainbows, if you rely on them too much, programming becomes painful, and it chips away at your ability to think like an engineer from first principles. I did some soul-searching and found that I still enjoy programming, so I intentionally handwrite code from time to time to keep the muscle strong.

Like many of my fellow software engineers, I used to panic a bit when I saw machines spitting out code like crazy fast. Are my experiences from the past two decades all in vain? But the truth is, the more I learn about LLMs, the more I use them, the more I realize it is not like many claim, that they could replace software engineers in 12 months. I didn’t buy into the hype, but at the same time, I don’t feel desperate. Instead, I feel cautiously optimistic about the future of software engineering.

Because, simply put: no. Despite many tech companies committing collective suicide by drinking the Kool-Aid, LLMs are not going to replace software engineers. Most people confuse the idea of coding with software engineering. LLMs surely have a huge impact on the software industry: they speed up many things greatly, but like any tool, there is a trade-off. Now the cost of making sloppy software goes down almost to zero, but that does not mean software engineers will disappear. Let’s see why I think it is this way.

Entropy, LLMs cannot escape the information density of the software spec

One of the main reasons why LLMs cannot replace software engineers is the limitation imposed by the information density of the software spec, i.e., Shannon entropy. Software development is a process of discovering the requirements and removing the uncertainties. Usually, we start with a rough idea about the system, for example, say we want to build a login page. You can tell the LLM

hey, build me a login page

The concept of a login page is a well‑known idea to LLMs because there are countless login pages implemented in various programming languages. However, when you tell an LLM to build such a page, the model does not have any context about the details of your specific login page. Therefore, for the system you are building, there is a high level of uncertainty. An LLM will generate the most likely code based on statistical norms embedded in its weights from the training data. With a non‑zero temperature (the level of randomness of picking the next token from the list of token candidates) plus the internal statistical model, there is a distribution of plausible outcomes.

A spectrum of plausible login pages when the spec is just "build me a login page". Little entropy (information) provided → high uncertainty about the outcome.

At this point, you have offered little entropy (information) about the system requirements. Because of the lack of information, there is high uncertainty about the system. You are basically rolling a die. Due to the similarity between using LLMs to generate code and gambling, some people on X even came up with a funny comparison to gambling:

A meme comparing vibe coding with LLMs to gambling/slot machines. Source: original post on X.

The end result may not be the best login page in the world, but it will probably come with very basic stuff and it might just work. If you do not like the look of the login page, you can then supply more entropy about the software spec. Say, you want the login button to be blue. Now your login page spec goes from just a login page to

a login page with a blue login button

By trading in more entropy (information), the uncertainty of the system is reduced. But here comes yet another problem: what kind of blue? There are so many shades of blue; it’s a big range. You can either roll the dice multiple times, or you can put it in the prompt directly, saying you want a particular type of blue, like Baby Blue, for example. Whether you are rolling the dice multiple times and pick one or supplying the details in the prompt directly, it is essentially the same thing: you are supplying more entropy (information) to the system spec and reducing uncertainty.

Adding one constraint ("the login button is blue") injects more entropy into the spec and narrows the output distribution, uncertainty decreases, but doesn't disappear (there are still many "blues").

After some iterations, now you have a beautiful login page with a blue button that meets your taste. As a person who cares about security, you might want a two‑factor authentication feature after the user inputs their username and password as an extra layer of defense. Now, your software spec has become

a login page with a Baby Blue login button; after user login we need to prompt for two-factor authentication

Well, once again, there are many two‑factor authentication mechanisms to choose from. It could be hardware-based like YubiKey, biometric info like fingerprints, facial recognition, SMS, or email one-time passwords. There are so many options. Which one would you like the LLM to build?

When the requirements are more specific (e.g. a specific blue + a two‑factor auth flow), more entropy is provided in the spec and a smaller space of plausible implementations is imposed.

You see, this is a very simplified version of software development. We can play this game forever: keep adding new requirements and ask the LLM to generate code for us. But here comes another question: why blue, but not yellow? And if we are going to implement email one‑time passwords, why not get rid of the username and password, relying only on emailing the code? In fact, many software systems do that. For example, Slack has chosen a passwordless approach as its authentication method. The different choices present different paths you can take in building the system. If there are parallel universes, there could be multiple versions of you making different decisions about system design. If we draw the diagram again for each of them over time, you will see something like this:

This is not an exhaustive map of every possible login page, it's a mental model under a "parallel universes" thought experiment. As you add requirements and make design choices, the "spec" can branch into a tree, representing the many plausible paths you could have taken.

All possibilities for building a system could span a tree structure. The trend is very obvious: over time, with more entropy supplied, uncertainty goes down.

The system spec data density grows exponentially with the software usage and the value it provides

You may ask

Why does this have anything to do with LLMs not being able to replace software engineers?

First of all, while LLMs are very good at generating code based on the instructions you give them, they are not good at discovering and validating the spec by themselves in the real world. You still need someone (or some accountable process) to collect feedback, get hands-on with the system, arbitrate trade‑offs, and feed more details into the software spec to reduce uncertainty.

And the more details in the software spec, the more information you need to provide to make it more certain. The density of information about the software requirements is something you cannot escape with LLMs. Shannon showed us (via information theory) that there is a limit to lossless data compression: you can’t, in general, represent arbitrary information with fewer bits than its inherent information content. In practice we often “compress” specs by leaning on conventions and shared references (framework defaults, “make it like Stripe checkout,” existing libraries, existing test suites). But the moment you deviate from those well‑known paths, the missing information has to be supplied somewhere (prompt, tests, code, or human decisions). No matter how hard you prompt, you still need to find a way to convey the idea of your system, either in plain English, tests, or in programming language.

Certainly, LLMs can translate your English prompt into code, empowering many people who do not know how to code. Now they can at least build something. For people who already know how to program, it also saves a tremendous amount of time from mechanical work. But regardless, there is no way to consistently convey an idea with less information than is theoretically necessary to disambiguate it. In other words, LLMs act as an amplifier in coding and drafting, while finding, negotiating, and defining clear requirements remains a hard, human‑in‑the‑loop challenge.

I do not care about the login page, as long as it works, I am fine with it

I heard you yelling at me. Well, the login page is just an overly simplified example to help the reader understand the concept of software development and the relationship with information entropy. Sure, if you do not care about the actual output of the system, then indeed LLMs can do a great job, because they can generate something that looks like it works the way you want. In fact, not all subsystems in a software project are equally important. Even if a page looks ugly, the code is a mess, but it works, who cares?

That kind of statement usually comes from people who vibe‑coded side projects which have nothing to lose. Based on my own experience, as the usage of a software system grows and the business value it produces increases, so do the requirements. And the growth of requirements does not come in a linear fashion; it usually grows exponentially, or at least super‑linearly.

As usage (and the business value on top of it) increases, the amount of system spec/requirements tends to grow non‑linearly, often super-linear or exponential in practice.

Say you have a website that serves one hundred people per day and makes no money, versus a website serving 10 million users per day across the globe with a business that makes big money on top of it, the requirements would be completely different. For the website with very few users, nobody really cares if the website goes down, certainly nobody cares that much about the color of your login buttons. But for a 10M DAU (daily active users) website, I can easily name many obvious requirements:

Security requirements: ensure user data will not leak, because now it has real-life impact
Performance requirements: with high load, you need to serve that many people efficiently
Reliability requirements: SLAs with customers, uptime requirements, etc.
Compliance requirements: GDPR, different local laws, data retention requirements, etc.
Requirements from legacy code: given that the website has grown from nothing to 10M, there must be legacy code some users still depend on; you still need to keep it around for some time
…

The list can go on and on. Just the sheer amount of information in those requirements is tremendous. Even if there were a perfect LLM that could translate your software spec into code flawlessly, you would still need to provide those specs to the LLM. So the bottleneck doesn’t disappear; it shifts. If the hard part is specifying and verifying behavior under real‑world constraints, then “replacing engineers” would require replacing that spec/verification work too, not just typing code faster. This is why I think it is unrealistic for LLMs (as we use them today) to replace software engineers: that’s not what they are designed for.

There are different types of people claiming LLMs can replace software engineers. One type is people who have a stake in LLM model companies, they need a story to keep the VC money flowing. Another type is people who have software with nothing to lose, of course LLMs could work great in that case:

The low‑usage, low‑stakes "nothing to lose" zone is where LLMs shine in terms of generating low-quality software quickly. But once the software has something to lose, the sheer amount of requirements can make LLMs less effective, because the entropy of the requirements might be as big as writing the code yourself.

Functional requirements are easy, non‑functional requirements are the real challenges

So far, we have been talking about the software spec. As you see in our login page example, we only use functional requirements as examples, because they are very easy to define and verify. However, as you saw when we discussed potential requirements for a large-scale website, most of these requirements are not even functional. There is a joke surfacing around X capturing the irony perfectly:

To let AI build a secure system, you should tell AI:

Build a secure system, make no mistake

Do not forget the important part “make no mistake”

This prompt carries almost no entropy at all. A requirement like this is extremely hard to define in simple terms, because it is not a single piece of code that you can magically generate and the whole system would be secure because of that. The security of a system relies on whether every single line of code is doing the right thing, while taking all possible attack vectors into consideration. It even includes all the third‑party libraries, your upstream vendors, and how they handle security from their ends. And of course, there is no unbreakable system, you also need to consider the value of the compromised system and the potential cost for the attackers.

A simple prompt like that implies a full inspection of every single line of code on a potential attack surface. And there is no easy way to verify whether you are doing it right or wrong. There is also an implication of taking user mistakes into consideration. System requirements such as performance share similar challenges, they are hard to define and verify. Also, while functional requirements are usually only impacting a local subsystem, non‑functional requirements usually have a global impact on the whole system. For example, you may have multiple subsystems running at the same time, and eventually there is a final task that collects the result and reports it to the user. The actual performance of the whole system is defined by the slowest one:

A "shortest barrel stave" view of system performance: like Liebig's barrel, the overall capacity is limited by the weakest part, not the average. One slow or fragile subsystem can cap the whole system (same "weakest link" idea as Liebig's law of the minimum).

Security and compliance are similar: if you have any weakness in the system, or anything that violates compliance, the end result is the same: a breach or a compliance violation. It does not matter how good your other parts are doing.

Now you see: given how hard and wide-ranging the non‑functional spec’s impact on a system is, the required entropy to define such a system is much bigger than for functional requirements. We will talk more about this in the hidden context and software verification sections.

Software requirements are like an iceberg: the visible functional spec above the waterline is only the tip, while the much larger mass of non‑functional constraints (security, reliability, compliance, operability, etc.) sits hidden below the surface and dominates the true complexity.

Programming code is also spec, and reading code is harder than writing code

Usually, in a healthy software system you will have high level human readable language, like design docs, potentially plus BDD (behavior‑driven development) / e2e (end‑to‑end) tests to define how the system should work from a high level. Then, software engineers will write code based on the system spec. People forget, we came all the way from machine code to high-level programming languages. It’s already very high-level compared to what it used to be. The code itself is also spec, it tells the machine how to behave. So the programming code itself also carries a big portion of the entropy about how the system should behave, in detail of course. It might be something like 10% / 90% to 20% / 80%.

A pyramid view of software: a relatively small, high‑level system spec at the top, and a much larger volume of detailed code at the bottom that implements and refines that spec.

LLMs indeed changed the game. Now, instead of writing the 80% of code yourself, it is possible that you only write the top 10% ~ 20% of the system spec, and let LLMs generate the code for you.

The same pyramid, but with the bottom filled by AI‑generated code: the high‑level spec is still small, yet a large volume of detailed code appears underneath it. Without careful review, that code is annotated with question marks.

This feels like magic, because now it seems like it saves you 80% of the work of writing code yourself. However, the problem is, the code is generated as the most likely arrangement of tokens based on the given prompt and context. It is an average of what the code might look like. Before someone reviews the code and verifies the intent of the generated code, whether it really meets all the requirements defined in the spec, and whether there are any bugs in the code, it is still just plausible code based on the spec.

Left: low-quality image of the CEO of NVIDIA. Right: AI upscaling result of DLSS 5, obviously not the CEO of NVIDIA. It's a joke, obviously 😅 Originally from this X post; I modified it slightly so the point is clear. AI can resize or enhance images, but when the original data is not there, it can only guess and may invent details that do not exist, much like generating a large amount of source code from a small portion of spec.

If your software has nothing to lose (no users, no customers), most vibe coders would just stop here. Because it seems like it is doing what they want it to do. In that case, they do not care about quality. We just mentioned in the previous section, when the usage and business value of the software system grows, so do the requirements. If the vibe coder is lucky enough and their software is no longer a toy but a business, they face a problem, they either take it seriously and fix the critical problems, like performance, security and compliance, or risk losing everything they have just gained. Now the software has something to lose. You are forced to add more software requirements into the system spec, be it performance, security or anything else.

If you just vibe‑coded the system, have zero clue what the code looks like, but you do have the spec, one obvious idea is to add all your requirements into the spec and let the LLM generate the code again for you.

An extreme case: a huge, detailed system spec sitting on top of a tiny amount of code. When the entropy in the spec vastly exceeds what the code actually covers, it might be much easier to just write your software spec in the code as using English to define such accurate details would be impractical.

Here is the problem, now you have a super huge system spec carrying all the entropy. English is not an accurate language. Describing roughly how a system should work might be good enough. But describing how software should work from the high level down to tiny details would be a nightmare. In that case, letting code carry the implementation details, i.e., the entropy of the software spec, is a better idea.

Why trading code generation speed with more code reviews is a bad deal

LLMs generate code extremely fast; it almost feels instantaneous compared to humans. But now we know that code generated purely with LLMs, without review, is just a plausible implementation of your spec. We still need to review the code. Engineering is all about trade-offs. Using LLMs or not, reviewing code or not, each comes with its own pros and cons. Now, let’s see what’s the deal we are making here.

As mentioned previously, most people don’t realize that reading code is actually harder than writing it. But how hard is it? This is my personal experience (it’s very case-by-case), but in general, I would grade the difficulty of programming activities as:

Activity	Difficulty (relative units)
Thinking	6
Writing	3
Reading other’s code	10
Naming	999

Usually, thinking is the most challenging part other than reading other people’s code. Once you know how it should work, writing it down is the easy part. Of course, you may need to check the library, syntax and some details about the programming language. Depending on how familiar you are with it, sometimes it can be slower or faster. But with LLMs, you save a great amount of time on mechanical work. Researching syntax, reading library docs; many of those can now be combined into a single prompt.

Reviewing other people’s code is usually hard because, as mentioned previously, you are guessing the author’s intention; it’s puzzle-like by nature. Even for your own code, after some time, you can forget the details; it will be hard for you as well. Of course, not all code is complex, but even a one-line change can have hidden context behind it. We will talk about the hidden context later, you will see why it’s so hard.

With LLMs, I would say the difficulty of writing goes down to almost zero. I would put it as just 1 here:

Activity	Difficulty (relative units)
Thinking	6
Writing	3 → 1
Reading other’s code	10
Naming	999

The productivity has indeed been improved greatly for writing, from 3 to 1. However, if you are not happy with the gain, there’s still something you can do. You can delegate part of the “thinking” to LLM. For example, you want to achieve something, but you don’t know what to do; you can ask an LLM to make a decision for you. In that way, you’ve saved time from thinking, right?

Well, yes, but no. As you can see, the difficulty of thinking goes down, let’s say to 1. But because it’s not your intention, you don’t know why the LLM is doing that, so you need to review it as like you would for other’s code. Depending on the code quality and volume of generated code, the reviewing difficulty could go up even further, say 20 instead of 10.

Activity	Difficulty (relative units)
Thinking	6 → 1
Writing	3 → 1
Reading other’s bad code	10 → 20
Naming	999

If you are in a team, your co-workers need to review it, that would multiply the cost on reviewing slop code generated by AI. It’s insane when you think about it: you pay a lower cost to trade for something with a much higher cost. It’s a huge net loss. Therefore, I would say, if you care about software quality, outsourcing thinking to LLM is a bad idea.

LLM is not doing great outside of its comfort zone

The term comfort zone is used to describe the area people feel comfortable in, but I think it’s perfect for describing LLMs as well. In the previous diagrams we have shown, the well‑known paths of approaches / algorithms / code patterns available in great amounts in the training data from mostly open‑source projects are exactly the comfort zone of LLMs.

The "comfort zone" of an LLM is the set of patterns it has seen a lot of during training (common frameworks, common algorithms, common app shapes). Inside that zone, outputs are usually solid; outside it, uncertainty spikes and the model is more likely to guess or hallucinate.

LLMs can do extremely well in their comfort zone, but not outside of it, simply because the whole system is built to recognize patterns and predict based on them. For all the software, algorithms, building blocks with good amounts and great quality in LLM training data, I call them commodity software.

However, if there is any requirement outside of well‑known patterns, LLM performance degrades greatly. In simpler terms, LLMs are not creative. More often than not, LLMs make things up if asked to do things outside of their comfort zone, i.e., hallucination.

For example, as in the login page example we just mentioned above. Most login pages all look mostly the same, but what if we want to do something really odd and barely seen out there? Like:

Hey, build me a login page and only grant access to the user if they can perform the Moonwalk perfectly on camera

I have not really tried, but my gut tells me it might actually be able to make one. But the reason is not because the LLM is creative and comes up with algorithms to detect human movement on camera, it would more likely be because there are algorithms for detecting human movements in some open‑source projects it has read before. It can either use those libraries in the code it generates, or write down the algorithm it memorized with twists in the target language and taking the context into account. Using existing building blocks to create something new, that is what software engineers do on a daily basis. LLMs can certainly do similar things, and it makes people feel they are super smart or even creative, but under the hood, they are chaining the patterns seen in the training data. In the case of using the moonwalk dance move as a way to log in, whether the moonwalk is accurate or not can be output as a simple value. Somewhere in the weights, it has seen the pattern that you can chain output from one system and pipe it to another system, so it applies that rule.

Say if we live in a universe where all the body motion research and the corresponding code do not exist, and you give an LLM such a prompt, it will not know how to do that. Or, even if it tries, there is little chance the system could work.

Hidden context

Yet another reason LLMs cannot replace software engineers is because of hidden context. I have been learning Japanese lately, and I realized Japanese is a very context‑dependent language. People get used to communicating with hidden context that is obvious in that environment. Here is an example.

好き

It means “like” in Japanese. If this is an anime scene, a male student saying it to a female student, it is obvious the male student is saying to the female student that “I like you”. If you put just “好き” into Google Translate, despite translation technology being really advanced nowadays (thanks to transformer models), it still does not understand the implied “I like you” part. It can only guess the most likely meaning of “好き” based on statistics from its training data.

As we mentioned in the first section, we discussed how important contributing entropy and reducing uncertainty are to a software project. When a software engineer writes a few lines of code, there could be hundreds of considerations in their mind when writing them. It could be anything like:

Performance considerations
Security considerations
Compliance considerations
Legacy code considerations
Workarounds for third‑party bugs …
…

So on and so forth. Most of the time, there will be nothing mentioned in the code about those intentions. If you are lucky, the engineer who takes good practice seriously may leave a comment explaining why these lines are here. But more often than not, this context stays in the author’s brain.

Another iceberg: the prompt and the code you can see are just the "visible context". The bulk of engineering intent is "hidden context" (trade-offs, constraints, historical decisions, production failures, tribal knowledge) that rarely makes it into the code or the prompt.

It is not because software engineers are lazy (well, sometimes we are 😅), but rather because the entropy of those hidden contexts would be tremendous if we had to write all of them down in detail. The code is just the artifact of the thought process. When LLMs are trained based on that code, the LLM has no idea about these hidden contexts. For example, why are we using syscall A instead of syscall B here? There could be a reason, maybe syscall A performs better in the first situation and syscall B performs better in the second situation? Or, maybe we need to support legacy Linux kernel here, so that despite it’s a bit slower, we need to use syscall A instead of B? Skilled and experienced software engineers usually bring far more considerations to writing the same line of code than junior software engineers. Sometimes you would be surprised by how many considerations are behind a simple few lines of code.

"Hidden context" (constraints, trade-offs, historical incidents, performance profiles, production realities) drives the choice between multiple plausible artifacts (code A/B/C). But that context is usually invisible to future readers, and therefore invisible to LLMs trained only on the code.

When you prompt an LLM:

Write me performant software, MAKE NO MISTAKE!

It is not going to work, as mentioned previously, this is outside of the comfort zone of LLM models. Not because there is zero data in the world (there are benchmarks, perf reports, CVEs, postmortems, and private incident history), but because the relevant context is rarely captured in the artifacts the model sees at generation time, and often isn’t written down at all. How do you know these lines of code are here, or why the order of lines matters for performance reasons if it is not explicitly stated? This is why I say reading code is many times harder than writing code, because you can only guess about the author’s intention if they do not write it down.

Disorderness compounds over time

Yet another reason is, in a system with lots of low-cost code generation (LLM or not), the entropy (disorderness) tends to grow over time unless you spend deliberate effort to remove it. The entropy here means disorderness rather than the information amount as we mentioned before, it is from thermodynamics:

Any isolated system, entropy (disorderness) will only grow over time (i.e., the second law of thermodynamics)

Because I have used the term entropy too many times, to make this easier to read, I will use disorderness instead moving forward. This is only an analogy (software projects are not isolated thermodynamic systems), but it matches a familiar engineering reality: if you can produce code faster than you can review, delete, and simplify it, noise accumulates.

What I mean is: for an LLM to generate code that works (or is more likely to work), it tends to copy code that it has no idea what it is for. For example, recently I cleaned up some useless lines added by an LLM:

A real diff from Fork: removing unnecessary defensive or cargo-cult lines that an LLM sprinkled throughout the codebase. This kind of cleanup work is "anti-entropy", restoring signal by deleting noise.

Not only that, if you ever use LLMs to generate code and you have coding experience without using LLMs, you will notice that the LLM models act very defensively when generating code. They will always try to catch any exception, and when importing a library they will fall back to the situation if the library does not exist. These lines of useless code look harmless, but they are a kind of disorderness added by LLMs. They not only make the code potentially run slower, they also make the code much harder to read. It is purely adding noise to the signal. Over time, more and more noise will be added to the code base, the noise‑to‑signal ratio will be higher than ever. Not only will it consume more tokens to process, at some point the LLM itself will lose focus.

It is funny: even in 2026, there are still people using lines of code as a measurement of productivity. I saw people brag about tens of thousands of lines of code an LLM generated, as if we are in a competition to generate as many lines of code as possible. Unfortunately, for people without software engineering experience, it seems like a logical idea, as software engineers produce code, the more code the better, right? No, that is totally wrong. Code is liability: the more lines your codebase has, the more likely it is to have bugs. And more code means readers will need to spend more time reading through nonsense; even your LLM will spend more tokens reading it.

You know what is harder than generating code? Removing code without breaking the system or reducing the functionalities. Because it requires complete understanding of how the code works, why the line is not needed, or whether there is a better, simpler flow to achieve the same thing. So far, I have never seen an LLM that is good at removing code without breaking things, because once again, that is not what the system is designed for.

Without removing the disorderness and noise from the code, very soon, you will realize that it gets harder and harder to prompt the LLM to modify software without breaking it. Most people out there vibe coding their pet projects, and their pet projects do not survive long or become complex enough to hit that wall. In other words, anti‑entropy needs to be introduced periodically to keep the code base’s noise‑to‑signal ratio below a certain level. We usually call it refactoring. LLMs are good at mimicking, but they are not good at understanding, so it will not be helpful if you let them do it for you. Because this task requires understanding of the system, fortunately or unfortunately, you will still need an experienced software engineer to help you do so.

Software verification is still challenging, and becoming more important than ever

We have already talked about this a lot. If your software has nothing to lose, you probably do not care about any of this. But say you have something to lose, then software verification is going to be the really challenging part for you, even with LLMs. Usually, to ensure the system works as expected, we will write BDD tests, end‑to‑end tests, acceptance tests and other tests to have a way to verify the system automatically. In a healthy software project, if we put on the lens of information entropy, you may have something like this:

A verification pyramid: high‑level product spec at the top, a large layer of automated tests (BDD/E2E/acceptance/regression) in the middle that continuously checks behavior, and the implementation code at the bottom.

The software product spec is at the top, the middle is tons of automatic tests, and finally the bottom is the code. While some people would love to see the automatic testing code as, well, just code, I prefer to see it more as software spec. Because it ensures the behavior of the software. What is even better? It can be done automatically.

For some software projects, they may not have much of the spec written in plain English, but they have intensive spec written as automatic test cases. There are many interesting well‑known test suites, such as:

JVM / Java SE - JCK (TCK for Java SE compatibility)
SQLite - TCL test suite / test runner and sqllogictest (plus TH3, their proprietary harness)
Web browsers - Web Platform Tests (WPT)
Android - Compatibility Test Suite (CTS)
Kubernetes - Sonobuoy conformance tests
Graphics APIs - Khronos Vulkan/OpenGL conformance tests (VK-GL-CTS)

The test cases provide a huge amount of entropy and reduce a great amount of software uncertainty. In the era of AI‑assisted programming, implementation details are still very important, but compared to a robust way to define the software and verify it, the value proposition has shifted from the actual code implementation to the verifiable spec.

Anthropic’s compiler experiment

Anthropic ran an experiment to have LLMs generate a compiler and claimed that it’s from scratch (not really) and showed that it can compile the Linux kernel (Building a C compiler with a team of parallel Claudes). The reason they can do it is not because of how smart the LLM is, it is because the software is already well defined by the intensive testing suites provided by the open‑source compiler community, such as the GCC torture tests (and similar compiler test suites). And even if it works, the binary executable performance is still worse than GCC/Clang in many cases, especially without deep optimization work. That is because the entropy that defines performant compiling behavior is not captured in the tests, and LLMs do not understand what makes compiled binary code run fast.

A chimpanzee seated at a typewriter. Credit: New York Zoological Society, circa 1906, Public Domain (via Wikimedia Commons: Chimpanzee seated at typewriter).

You see, that is the power of having an intensive testing suite carrying a massive amount of software spec entropy, making it possible even for LLMs to generate code that can compile the Linux kernel. This is basically the infinite monkey theorem: given infinite time, a monkey randomly hitting keys on a typewriter will almost surely type the complete works of Shakespeare. With test cases detailed enough, given enough monkeys and time, they can also generate code to pass those tests. Certainly using LLMs has reduced the searching space, plus the compiler code is already part of their training data. Also, humans still made architecture decisions and constrained the problem. But the key enabler is the same: intensive verification (tests) that pins down behavior tightly enough for automation to converge.

Interesting side note: people noticed that the compiler seems to reproduce a bunch of very specific mistakes that match bugs in small open-source C compilers (e.g. chibicc), which suggests it is likely copying/rewriting patterns from its training data rather than “discovering” everything from first principles (see: Hunting for traces of chibicc in Claude's C compiler).

Cursor’s browser experiment

Cursor did a similar experiment, but they built a browser instead (Scaling long-running autonomous coding). After spending trillions of tokens, they had something that barely compiles. The project was later criticized for relying heavily on existing open‑source components (for example Servo and QuickJS) rather than being “from scratch” (see: The Register and the related HN thread). And even if they made anything that looks like it works, it would mostly be because they are using the existing intensive testing cases from existing open‑source browsers.

Cloudflare’s Next.JS rebuild

Yet another interesting and relatively successful case is Cloudflare rewriting Next.JS (How we rebuilt Next.js with AI in one week). Once again, they are using (and explicitly porting) intensive test cases from the Next.js project to verify the AI-generated code works, with some manual architecture work (Cloudflare notes they “ported tests directly from their suite” in How we rebuilt Next.js with AI in one week). The end result outperforms the original project. Is it impressive? Certainly it is. Is LLM going to replace software engineers? Obviously not.

TDD and BDD may make a huge comeback

Now, you see so many examples; they all share the same factor: they all have intensive test cases carrying a huge amount of the verifiable spec. This shows you how important software verification is, particularly in the AI‑assisted programming era. It also shows you that with enough high-quality entropy provided by an automatic test suite, even LLMs can generate software that works. For these projects, they actually chose to do them probably only because there are existing test cases. But what if you do not have the test cases to begin with? Now it begs the interesting question, how do you come up with an intensive test suite from nowhere? It is still the same problem as where the entropy about the software spec is coming from, and how detailed it should go to make it behave exactly like you need.

Interestingly, in the industry, some people advocated strongly for TDD (test‑driven development) and BDD (behavior‑driven development) in the past. I like the idea of having test cases before implementing code, because it is like defining how the software behaves first. But you can only go so far with an imaginary system and thinking about how it could behave. After you build the system, there will always be unexpected details popping up, and you need to go back and revise your spec, i.e., your test cases. Thus, there is still no running away from understanding the software, i.e., you still need a software engineer to help you do so.

Back in 2023, I envisioned on X that TDD/BDD would become a big thing again with AI‑assisted programming.

My 2023 X post: use BDD as the spec and generate code.

Today, seeing the examples mentioned above, I more firmly believe that approaches similar to TDD/BDD will make a huge comeback. It is not a solved problem, and unlikely to become one: because of the sheer amount of entropy the test cases need to carry, it cannot come out of nowhere. Someone needs to provide it. And it remains a challenging problem, there is no way you can get rid of software engineers. LLMs can help generate test cases, but someone still needs to drive it and ensure it is testing the actual desired behavior. It is like:

Hey, write tests to ensure the software is correct

I would show you this meme:

The classic "bug or feature?" ambiguity: without a precise spec, you can't even tell whether behavior is wrong or intended, and "write tests to ensure correctness" becomes "define what correct means" first.

And say

Well, define what is correct 🤣

You see? It still needs entropy from you regardless.

The implication of the value proposition shifting from implementation to verifiable software spec (for tech companies and open-source communities)

The real interesting case to me is Cloudflare’s Next.JS port. With extensive test cases carrying the entropy, it seems like you can throw away the implementation code and rewrite from the ground up quickly. What does it mean to the tech companies and the open source communities? Interestingly, there are some special cases of open-source projects, such as SQLite: they keep some of their most critical test cases secret, for many reasons. If they published those test suites, recreating SQLite would be much easier. For companies running open-source business models, does this mean they also need to keep extensive test cases private, while only keeping the code open-sourced, to avoid a quick clone with LLMs? But that’s a topic for another day.

Final thoughts

There’s much more I’d like to discuss. But this is already a long article. Overall, I think replacing software engineers with LLMs is far more challenging than people think. A productivity boost could displace some software engineers, but not all. And the new demand created by empowered non-technical people will create new opportunities for software engineers.

If you don’t know how to code and an LLM empowers you to build software you couldn’t build before, I’d encourage you to go for it. There’s only upside in your use case. I bet this will create more demand for experienced software engineers, because some people will have something to lose. But if you’re a big tech company making big money from your software, I would do it very carefully if I were one of your stakeholders.

It seems like the leading LLM companies are under heavy pressure to pull more money into the system. So they keep telling a story as big as “replacing all software engineers.” I intentionally use the term LLM instead of AI because the discussion above is specifically about the LLM architecture. At least for LLMs, I think it’s impossible; it’s just nonsense.

Could there be a new model that’s not an LLM? I have no clue. But I highly doubt it. I don’t currently see credible signs of a near-term paradigm shift that removes the spec/verification bottleneck rather than merely accelerating implementation. This belief also aligns with Fred Brooks’s classic idea of “No Silver Bullet”: there isn’t a single breakthrough that gives an order-of-magnitude improvement by itself, especially when the hard part is the essential complexity of specifying what you actually want. Plus, LLMs have absorbed a lot of research resources. There’s little diversity in research directions. I doubt anything fundamentally different will emerge in the short term.

I know many of you can’t wait to throw stones at me for saying the truth LLM companies don’t want to hear. Hey, not so fast. 😅 As I said, I’ve found LLMs very useful in software engineering. And they’ve boosted my productivity a lot. There are patterns I find very effective. And there are anti-patterns they handle poorly. I’ll try to write those up in a separate article. “Vibecoding” is more about ignoring (and accepting) whatever comes out of the model. I prefer to call it AI-assisted programming.

How AI-assisted programming will change the industry is yet another interesting topic. Unlike some people who avoid making predictions, I love making predictions. I’d love to make a statement, then revisit it a few years later to see if I was right or wrong. Of course, I could be wrong. But next time, I might make a better prediction.

That’s it. I hope you found this article helpful. Feel free to leave feedback or questions below. Thanks for taking the time to read.

Manufacturing as Code is the Future, and the Future is Now

Mon, 12 Jan 2026 07:00:00 +0000

Since I started my journey with 3D printing, I have built and shared dozens of 3D printable models to the public. Surely, TinyRack is one of them. You can find them on my MakerWorld profile here or on my Printables profile here.

My Printables and MakerWorld profiles showing dozens of 3D printable models

So far, I have really enjoyed the process of designing and printing the models. If there’s anything I’ve experienced that feels most like it came out of science fiction, it’s 3D printing technology. When you realize that the physical form of objects can be defined by digital bits, it opens up unbounded possibilities for what we can do with the technology.

The more I design and print, the more I realize that while the printing process takes time, it runs smoothly in the background. But for design, it’s a whole different story. More often than not, it takes a huge amount of effort and countless iterations to design even for a simple snap-fit part. I often get lost when working with different revisions of the same part with slight differences. As printing technology becomes more and more mature, the bottleneck is not the printing anymore, it’s the design instead.

As a software engineer, I get very comfortable with writing code to define the behavior of a system. Setting up the CI/CD pipeline to automate the build and deployment process is also a common practice. While I work on my 3D printing projects, none of those exist. Then I wondered, given that now bits can shape atoms, why not use the same approach to build software for the physical world?

With that in mind, I spent the past few weeks building a prototype of a GitHub-like platform for manufacturing, called MakerRepo. Today I am very excited to announce that the project is now online and has entered the beta testing phase for the public. 😄🎉

Screenshot of MakerRepo artifacts viewer featuring a 3D model of a part

Screenshot of MakerRepo code viewer featuring a Python code file for generating a 3D model with Build123D with an "artifact" decorator

The painful parts of traditional CAD design process

I have been using Fusion 360 since I started my journey with 3D printing. It’s a powerful CAD software that can handle complex models; it can even simulate how the part performs under different conditions. I probably only use about 5% of the features of the software, but it’s more than enough for my needs. I really enjoy using it, but it’s not the problem of the software itself—it’s the mindset upon which this software has been built. Before we dive into the future of manufacturing, let’s first understand the painful parts of the traditional CAD design process from a software engineer’s perspective.

Version control

The first painful part is version control. There are version control systems built into the CAD software, but they are not very user-friendly or they lack critical features. I am used to comparing two versions of code in the code editor to find out what the changes are between them. I am also used to having artifacts like executable files come with a version number, so I can easily find the version I am holding right now.

Screenshot of Fork GUI app showing the changes in a commit

But with real-world objects printed from different revisions of the same design, it’s very hard to track what the changes are between them. It’s also hard for me to tell which one is which. More often than not, a minor change like snap-fit clearance difference could be very small, and you cannot even tell with the naked eye. When trying out the different revisions, I need to be very careful not to mix them up. Of course, I can add a version number to the model, but it’s not very convenient because I will have to manually change the version number for each revision. Eventually, I feel there’s a need for better version control for CAD models.

Collaboration

Another painful part is the collaboration. With traditional CAD software, it’s very hard to collaborate with others. Maybe not for the manufacturing industry, but for 3D printing open-source community, it’s very common to remix a design and share it with others. To collaborate with others, first of all, you need to have the same CAD software installed on your computer. Despite that CAD software like Fusion 360 provides collaboration features, if the other person doesn’t have the same CAD software installed on their computer, that collaboration feature is useless.

For example, when I was designing my under table cable management with Underware 2.0 and openGrid, I realized that one of the lock snap components has an issue. If you screw it in too tight, it will be very hard to screw it out. That’s why I decided to remix the author’s design to cut a slot on the thread of the component to make it easier to screw it out.

Screenshot of openGrid Multiconnect Lock Snap Unwinding Tool I made by remixing the original design

I then uploaded the remix to MakerWorld and shared the link with the author in the comments. Okay, cool, I shared an improvement to the design on the same platform, but what now? I felt this is not the right way to collaborate as we do in the software engineering world. Some people can find it, but unless the original author took my improvement, used the CAD software he preferred and updated the model himself, then reuploaded the model to MakerWorld, it’s not very helpful. This is particularly painful as we are used to making contributions to open source projects with code changes. When people contribute to open source projects, the author can review the change easily and hit the merge button to merge the change into the main branch. And there you go, you have a new model everybody can enjoy! In the end, I realized that “open-source” is not so open if there’s no easy way to collaborate with others.

Customizable parts

Another painful part is to create a part with multiple variants based on different parameters. With traditional parametric CAD software, surely you can define a part with parameters that reflect the changes of the part. I did exactly this with TinyRack to make it customizable. For example, with the same post, I can create variants with different notches at different positions.

Screenshot of TinyRack post variants notched at different positions

While this works for different permutations of the same part, I need to manually change the parameters for each variant in the CAD software, export the model and upload it to the platform. It’s not very efficient and not very scalable. For platforms like MakerWorld, they provide a way to create a generator from a CAD file that allows the user to create variants with different parameters.

Screenshot of MakerWorld Underware Channel Generator

But that’s the proprietary software built by BambuLab. If I want to do the same locally, I need to build my own generator that can read and understand the CAD file and also needs to generate the STL or 3MF files for each variant, which is not easy.

Lack of automation

Other than the problems mentioned above, the final straw that breaks the camel’s back is the lack of automation. As you can see, I built and uploaded tons of models to both MakerWorld and Printables. Each time, I need to manually change the parameters for each variant, export the model and upload it to the platform. And this is not just one time task. If I make any revision to the design, I need to do it manually all over again with multiple platforms. As a software engineer, I am used to the idea of automation to solve this problem. For example, when I worked as an iOS engineer in the past, I built a CI/CD pipeline to automate the build and deployment process. That process even included taking screenshots of the app and uploading it to the App Store. The only thing I need to do is to push a new tag to the repository, and the CI/CD pipeline will take care of the rest. While traditional CAD software provides an easy-to-use UI, it doesn’t provide an easy-to-use API to automate the process. This is why I felt, you know what, I need to build my own platform to solve this problem.

The future of manufacturing - manufacturing as code

The traditional manufacturing industry is built upon the idea of mass production. But 3D printing changes the game. Now instead of a one size fits all product, you can design a product that is tailored to your needs.

A Venn diagram showing the market segmentation of 3D printing fits perfectly for semi-customized products, and the market size could be expanding as now customers have more options to fit their needs even better

The customers used to be forced to tolerate the one-size-fits-all design that is not tailored to their needs. With 3D printing technology, there’s no difference in cost between a customized part and a mass production part. Now they can have the design that is tailored to their needs. If we use 3D printing only as a replacement for the mass production, we are not really taking the advantage of the technology. I believe this market segment is going to expand in the future as now customers have more options to fit their needs even better. Mass production and hand crafted products will still exist and stay strong, but semi-customized products will be a new segment in the market.

For example, with TinyRack, you may need a notch at different positions on the post based on the size of the machine you are going to put in it. With traditional mass production, you will probably design a post with as many notches as possible to fit all the possible needs. But then of course it makes the model looks ugly.

Two TinyRacks with notch on post at the exact needed position vs a TinyRack with notches on the post for all possible positions

With the ability to at least semi-customize the design, we can now have parts that are tailored to our needs, not the other way around. This opens a new market for small-to-medium-size batch production with customized products. With the concept of semi-customized products, the bottleneck is not the manufacturing anymore, it’s the design instead. Making a design from scratch is a very time-consuming process. Therefore, starting from an existing design and changing the parameters to tailor towards your needs is a much faster way to get the job done.

How do you change the parameters of a design to tailor towards your needs of a part without starting from scratch? The answer is to use code to define the design of a part. In fact, with traditional CAD software, you can already define a CAD model with parameters, that’s why it’s called parametric CAD. However, because it’s not a programming language, it’s very hard to understand the relationship between the parameters and the model. It’s also very hard to track what’s the changes of the model between revisions. But with code, it’s nature that software developers have been using version control to track the changes of the code for decades. More than that, with code and modern LLMs, there’s also a potential opportunity to design a part with prompts, or even automatically make a design from scratch.

The knowledge of manufacturing could be lost

The process of manufacturing an object is knowledge-intensive. And over time, the knowledge might have been lost if not well documented. For example, NASA used to build the Saturn V rocket for the Apollo program. But after the program ended, the knowledge of building the rocket was lost.

A photo of the Saturn V rocket by NASA in public domain

And now, it’s very hard to build the very same rocket again. Of course with modern technologies today, we can build a better rocket like SpaceX did with Starship. But the lesson here is if we don’t document the manufacturing process, we will lose the knowledge eventually.

Thanks to CAD software, we can now define the manufacturing process in the digital world and can backup the design easily across the globe. But what if the CAD software is lost? While I love using Fusion 360, it’s a proprietary software. When I share a design online, if the other person doesn’t have Fusion 360, they cannot open the design. They provide free license for personal use, but for commercial use, they require a paid license. There are also other CAD software packages like SolidWorks, OnShape, and others. But eventually, if the software is not open-source, it will be lost one day with the company. We have Arctic Code Vault for backing up software code, but what good does it do if all the knowledge of manufacturing is lost? This is why I believe Manufacturing as Code is also very crucial to preserve the knowledge of manufacturing.

Build123D - build CAD models with Python code

In the past, when I shared the first article about CADing and 3D printing like a software engineer, some readers mentioned the Build123D software. It’s a free and open-source CAD software that can be used to design 3D models in Python code. Here’s an example from the documentation where you build a tea cup with Build123D:

A screenshot of the Build123D code for making a tea cup and the resulting 3D model

When I first saw it, I thought it was really cool. I have been thinking about using it, but didn’t have the time to try it out. Until recently, I finally finished the v1 design of TinyRack, and got the time to try it out. And I ended up really liking it. At the very beginning, I missed the friendly UI of the traditional CAD software where I can click and select the objects I want to work on. But after I got better and better in it, I feel the time I spent on building the same model is getting shorter and shorter. I feel at some point, I can even build the model faster than I can click and select the objects in the traditional CAD software in some cases.

MakerRepo - the platform for manufacturing as code

After I got myself more familiar with the Build123D, I started to envision the future of manufacturing with code. I concluded that the platform should be a GitHub-like website but for manufacturing. Building a platform like GitHub is definitely not easy, not to mention if you need to host large-scale repositories, dealing with the scalability and durability issues, more importantly, security issues (because you need to run user-uploaded code).

It may take me years to build a platform like GitHub, but fortunately, I have seen this movie before. If you read some of my previous articles, you will know that I have built a similar platform before. Yes, that is BeanHub, a Git repository hosting service for Beancount accounting books. If you are interested in the technology behind BeanHub, you can read my article How BeanHub works part 1, contains the danger of processing Beancount data with sandbox and How BeanHub works, part 2, a large-scale auditable Git repository system based on container layers. tl;dr, I used container sandboxing + overlayfs to build a large-scale auditable Git repository system. MakerRepo is built upon the same technology, but for manufacturing as code instead of accounting books.

When I built BeanHub, I pushed really hard to open-source as much as possible. You can see the list of our open-source projects here. Because I believe if one day BeanHub is not in business anymore, people should still be able to use the open-source projects to continue their existing workflow. Ideally, you should be able to do anything locally with the open-source tools on your own computer just like what’s provided by the platform. MakerRepo just makes it much easier to host your code and artifacts online and share them with others, and also provides a platform for you to collaborate with others. This is why I open sourced MakerRepo. It’s a library and CLI tool written in Python to help you build your model with manufacturing as code. I don’t want to change too much how people write their Build123D code, therefore, I made it very easy to integrate your existing Build123D code with MakerRepo. One major feature provided by MakerRepo is the ability to collect the artifacts of your model and view them in a web interface via the OCP viewer.

To do so, you only need to install the MakerRepo library and add an “artifact” decorator to a function that makes the model. To install it, you can run:

pip install makerrepo

Then in your code for generating the model, you can add an “artifact” decorator to the function that makes the model.

from build123d import *
from mr import artifact

@artifact
def make_model():
    with Build() as build:
        Box(10, 10, 10)
    return build

And that’s it! After you commit the code and push it to your Git repository on MakerRepo, it will automatically run a CI job to build the model and collect the artifacts. Then you can view your beautiful artifacts in the web interface via the embedded OCP viewer.

What’s next?

This is just the beginning. The MakerRepo is still in the early stages of development. For now, the CI build environment is still very limited, and you can only build the model with Build123D but no other libraries. Soon I will add new features like custom CI pipeline, custom build environment. You can imagine the possibilities are endless. For example, you can have a develop branch that builds artifacts with version information onto the model, so that it’s much easier to track which version is which in the real world. With the CI/CD concept, maybe you can event have a branch like production, and it automatically prints the model into a physical object.

I will also add fork and pull request features to make collaboration with others much easier. In the near future, you will be able to fork a repository you like, make changes to it, and then submit a pull request to the original repository. And the author will be able to review the changes, see a visualized diff of the changes, provide feedback on the model and code, then decide to merge the changes or not easily.

Currently, there’s no paid plan yet, but I will add one in the near future. I aim to follow a pricing model that’s similar to GitHub. For open source projects, you should enjoy most of the features for free. With a paid plan, you can host private repositories and enjoy more features and maybe more CI build time quota. For those who are not familiar with the pricing approach I take, here’s what I usually do. The price will start lower, then after new features are added, the price will be raised. The early users can enjoy the lower price with more new features added later on. Stay tuned for the pricing model announcement.

Finally, while this is just yet another software product I built, it means something else to me. There were people laughing at the idea of bringing back manufacturing jobs to the US. As Tim Cook said, the problem is not just the cost, but also the talent and knowledge of the workers who know how to manufacture. Unfortunately, the knowledge of manufacturing is dying with the older generation of workers, not just in the US, but also in the rest of the world. I tried to think about how I could help from the perspective of a software engineer, and this is what I came up with. The COVID supply chain issues taught us a lesson that we need to be more independent and not rely on the supply chain of other countries. To bring manufacturing jobs back to the US, we should not think about replicating the workflow of the manufacturing industry, but instead, we should think about copying the software industry. We need to think smart and make the design process more efficient and accessible. I hope MakerRepo can be a small step towards that goal. Please feel free to try it out and let me know if you have any feedback or suggestions. 😄👍

TinyRack - A 3D printable modular rack for mini servers

Wed, 26 Nov 2025 07:00:00 +0000

Update: This is the second article in the CADing and 3D printing series. You can read other articles here:

Article01 - CADing and 3D printing like a software engineer, part 1 - baby step with an overengineered webcam raiser

Article03 - Manufacturing as Code is the Future, and the Future is Now

In my previous article, I’ve shared my journey of 3D printing and learning CAD from the perspective of a software engineer. As mentioned in the article, I really wanted to build a server rack for my bare metal Kubernetes cluster as seen in this article. Recently, I finally got some time to actually print some projects I have designed so far. Today, I am excited to introduce what I’ve built - TinyRack, a modular rack for mini servers!

TinyRack with my mini PC cluster

I imagine many people would enjoy a server rack designed specifically for mini servers, given how popular homelabs have become in recent years, so I share my models under an open license. You can download all the models from TinyRack.io and print them yourself, or you can also purchase them on the website.

Why not just use 1U rack?

Look at people’s homelabs, many of them use 1U server racks as their main rack. Certainly there are many benefits of using the standard U server rack. First of all, most data centers are built around the 19 inch rack standard. There are countless pieces of gear you can buy out there which are 19 inch rack compatible. So the first obvious question you may ask is, why not just use the U rack?

Well, while the U server rack is very popular, first of all, its form factor is just too big. Thanks to advances in chip manufacturing technologies, CPUs / GPUs are becoming more and more powerful while consuming less and less power. In the context of homelab or mini server based clusters, they are usually way smaller than the 1U size. As a result, one may need to adapt to the U size by buying an adapter. But these pieces of equipment are geared toward enterprises mostly, and they are expensive. Take Mac Mini for example, if you want to mount your Mac Minis to a U rack, you can buy some of those:

A Mac Mini rack mount for 1U server rack from MK1 Manufacturing

Certainly they could work, but they are also very expensive. Just the rack itself set you back $459.90 USD, while the most basic Mac Mini itself costs $599.00 USD. There are also cheaper options like this one that are not as good.

A Mac Mini rack mount for 1U server rack from Amazon

As you can see, it’s more obvious in the second picture that the load goes through only the front panel while the Mac Mini themselves are not being held by anything else. This is still a okay situation as long as the device is not too heavy and the plate is not too deep and making the torque too high. But more often than not, I have saw way too many times where people have to attach one side of their tiny device to the U rack, leaving another side hanging in the air because the size of the device is just way too small. As the design of the U rack is taking the load through the ears of the rack, the material required for the U rack is usually thicker and heavier. So, as you can see, the U server rack is not really designed for mini servers.

Yet another thing to think about is the benefits of living in the 3D printing era, where everything is customizable. Why do we want to force ourselves into form factors which are not suitable at all? With the above in mind, I think it’s time to design a server rack.

Project MINI RACK?

Fun fact, originally the name of the project I had in mind was Mini Rack. Until I saw the YouTuber Jeff Geerling introduce his Project Mini Rack.

Picture from Jeff Geerling's Mini Racks mini-rack.jeffgeerling.com

After that I’ve decided to change the project name to TinyRack to avoid confusion. A bit like my TinyRack project, the Project Mini Rack is targeting mini form factor servers. I really like the concept, it’s super cool that the whole server rack is portable without going offline if you have a UPS built into the rack. But it’s based on the 1U standard, and the rack you can buy from Amazon costs $129.99 as of when I wrote this article. It’s a bit overkill with the heavy duty metal rack. For now, I only want a modular server rack that can host my cluster on top of it and potentially be extended with different modules. That’s why, despite having seen a project like that, I still decided to continue building my own. I think there’s some overlap with the target audience, but overall, TinyRack is targeting more lightweight users who prefer a rack with a smaller footprint and a more flexible design. And more importantly, it’s all 3D printable!

The design

When designing the rack, I wanted to make it able to carry as heavy a load as possible. Because while mini servers are not too heavy compared to a full 1U server, they aren’t lightweight either. There could also be many devices to put on top of the rack. I have purchased some Wire Racks and assembled them a while back. I got inspiration from the way they work. You have rods taking vertical loads, with the plastic clips that come with a ramp and that clamp on the rod’s notch, and with the wire rack mounting on top of them. Surprisingly, this simple structure can take a ton of load.

A Wire Rack from Amazon

I wanted to design a similar structure, with a strong post, with notches cut into it every so often. Have a clip with a ramp that hooks on top of the notch. Then the platform has holes that can sit on top of the clip, and because of the ramp of the clip, the platform will be stuck more firmly when a load is applied.

The design of the TinyRack load bearing structure, the platform pushes against the clip, and the clip pushes against the post's notch

A clip looks like this:

The design of a clip in the CAD model

And here’s how the whole post looks like:

The design of a post with a notch for holding the platform in the CAD model

And the platform looks like this:

The design of a platform in the CAD model

And I wanted to have a rubber leveling feet for each post, to absorb vibration, so I added a thread at the bottom. The size of the thread is set to 5/16” (8mm) so that you can buy rubber feet from Amazon and install them by screwing them into the thread.

A rubber leveling feet from Amazon

And because the posts have limited height, I also added a screw hole on top of the post so that you can attach multiple posts together. The whole assembly looks like this in the CAD model:

The whole assembly of the TinyRack in the CAD model

The cherry on the top, the JetKVM mount

On top of the three mini PCs in my bare metal Kubernetes cluster, I also found myself in need of a remote KVM system. A while back, I found a new affordable KVM system called JetKVM that pretty much meets my requirements. It’s very tiny, yet provides most of the features I need.

JetKVM is a tiny KVM system that provides remote access to your PCs through a web interface

It even provides an external module for controlling the power that goes into the mini PCs, so that in case the server crashes and fails to restart, you can force it to reboot by cutting off the power and turning it on again. While it’s really nice that the JetKVM provides remote access to my mini PCs, it also brings more headaches to cable management.

Thanks to 3D printing and the CAD skills I’ve picked up along the way building all these, I soon designed a mount to sit on top of each mini PC and have the JetKVM, the DC module, and the cable sit in place.

The design of a JetKVM mount in the CAD model

The design of a JetKVM mount with a lid in the CAD model

TinyRack JetKVM mount base in real world

TinyRack JetKVM mount with lid in real world

It looks so beautiful right now, it almost feels like a device designed to work exactly that way as a whole 😍 You can also download my JetKVM mount from TinyRack.io.

openGrid integration

I used to have pretty good cable management under my desk. But over time, as more and more devices were added, it became messy again. With a 3D printer, I started looking into cable management solutions, and I found project Underware.

Underware is a 3D printed grid based modular system for under desk cable management by Katie

It’s a grid based modular system to have different modules attach to the grid. This is exactly what I wanted! In the past, when I designed my cable routing under the desk, it was designed as it was. But then when new things were added, the old design didn’t anticipate the new stuff, so it became a mess real quick. It’s really hard to change once the design is settled. With a grid based modular system, I can change the routing any time I want without removing the double tape and applying new tape again.

The Underware 1.0 is based on Multiboard, and the license of Multiboard is a bit odd. All of your creations derived from Multiboard go to the author of Multiboard and allow them to use them freely as they see fit. I guess this could be the reason, the author of Underware switched to openGrid in Underware 2.0. It’s a similar grid based modular system, but with a CC-BY license.

An openGrid grid from Katie's Underware 2.0 website

After I built some cable management with Underware 2.0, I really liked the ecosystem behind it. I looked at my TinyRack system and wondered, why not make it possible to integrate TinyRack with openGrid? That way, we can have a modular mounting system that can grow vertically and potentially horizontally. With that in mind, I’ve designed an adapter to mount an openGrid panel to TinyRack posts.

The design of a TinyRack adapter for openGrid in the CAD model

TinyRack with openGrid on the third layer with some Underware 2.0 modules attached to it

Now, with TinyRack and openGrid adapters, things get very interesting. You can build different kind of layers, some of them can be used for cable management, some of them can be used for holding devices. I plan to introduce more modules, like ethernet patch panels.

Please note that while openGrid can carry some load with TinyRack adapters, they are not designed to carry a ton of load, unlike the TinyRack platform. It’s not tested yet how far we can push the grid with load.

More to come

That’s it! It’s a super fun 3D printing project. I have learned so much about CAD tricks and 3d printing knowledge while building these. Actually, I have built way more than described in this article. I will find time to share more of them later.

In the meantime, other than building it for my own servers, I want to see if I can run it as a side business. If you don’t have a 3D printer, remember, you can also purchase them on TinyRack.io. Also, at this moment, all the models come with fixed parameters, but in the future, I may build online tools to make customizing much easier. Thanks for reading, stay tuned for more!

Continual learning with the Marketplace algorithm: model learns new data through inference, not training

Tue, 09 Sep 2025 07:00:00 +0000

Have you ever wondered why machines need a dedicated training process while humans can learn from experience? I wondered the same thing for a long time. Today, I’d like to introduce continual learning with the Marketplace algorithm, which demonstrates the possibility of machines learning new things by simply doing!

This is the third article in the Marketplace algorithm series. Please read the first article and second article for details on the Marketplace algorithm. Last week, I published the second article, which discusses using all the probes to compose the best parameter delta. It was a lot of fun! 😄

However, training a model like a normal training process is not the most exciting application of the Marketplace algorithm. The previous articles were just appetizers; the main course is here. The most intriguing application of the Marketplace algorithm is continual learning. I had this idea almost immediately after developing the Marketplace algorithm. After running through the concept in my mind, I believed it was feasible. So, I spent a few days implementing it, and it worked! It still has a long way to go, but it already shows great potential.

The experiment’s design is straightforward. First, I trained the beautiful MNIST model from Tinygrad using the Marketplace V2 algorithm and the digits dataset for 2,000 steps, achieving 96% accuracy on the validation dataset. Next, I took the trained model, simulated the inference process, and added class 3 (dress) from the Fashion MNIST dataset, mixing these images with the digits dataset to allow the model to classify them.

A diagram showing the original MNIST digits dataset and the new dataset, which combines the MNIST digits dataset with class 3 (dress) from the Fashion MNIST dataset.

I applied the Marketplace algorithm to enable the model to continually learn the new dress images gradually with each step. The goal was to determine whether the model could learn the new dress images primarily through inference, without dedicated training, while still classifying digits correctly most of the time to provide business value. Here’s the result:

Video showing the model gradually learning the new dress images while still classifying digits correctly most of the time. All these steps involve only inference, with no dedicated training process!

As shown, the model gradually learns the new dress images over several steps while maintaining its ability to classify digits correctly most of the time. These steps involve only inference, with no dedicated training process!

The implications of this technology are tremendous. I believe the future of machine learning lies in learning rather than training. Companies that master this approach in production will gain a significant advantage because their models improve as more people use them, quickly and without much additional cost. As the model improves, it attracts more users, creating a flywheel effect: the more it’s used, the better it becomes. Best of all, this approach requires almost no additional computational cost for training. Of course, this is just a proof of concept, and there are still many improvements to make and challenges to overcome. Nevertheless, I’m thrilled about the possibilities of continual learning. Today, I’m excited to share my first take on continual learning with the Marketplace algorithm.

Context for New Readers

This article continues my journey to tackle the challenge of eliminating backpropagation using first principles. As this series of articles is now reaching a wider audience, some new readers may have missed the context of this article, so I’d like to provide it here. I aim to explore how far I can push this idea without referring to academic papers. I haven’t read any papers specifically on this topic. Many people on X have recommended academic papers to me, and I truly appreciate their suggestions. Unfortunately, I don’t have time to read all the papers thoroughly. If I had to read every paper before starting my research, I’d see you again in five years! 😂

For this reason, my definition of continual learning may differ from others’. Here’s my definition of continual learning: It means that a model can learn new concepts primarily through inference alone. The model should adapt to new data gradually. Others may have explored similar ideas, but I’m unaware of their work. As this is independent research, please take it with a grain of salt.

How It Works

Previously, when using the marketplace algorithm to train a model, we fed the same batch of data with different permutations of vendor deltas for the model parameters. We calculated the loss of the final output and attributed it to the parameter delta that contributed to it, then composed the reconciled delta as the direction to update the model parameters. Consequently, the same data passed through the model via all possible paths.

For inference, this approach is impractical due to high computational costs. The advantage of enabling the model to learn new information through inference is that it operates at a way larger scale. We can process a large volume of input data through the model across various paths. With that in mind, instead of passing data through all possible paths, we select just one random path for each inference.

A diagram showing the inference process using a single random forward pass instead of all possible paths.

After passing data through the model, we retain the chosen random path. If the label is known, we keep the loss; otherwise, we store the logits and input data for later labeling.

A diagram showing that we retain the chosen random path and either the logits or loss, depending on whether the input data label is known.

After collecting a sufficient number of paths along with their logits or loss, we can generate the reconciled delta, as described in the V2 article. Before doing so, we may want to filter out unrepresentative data. In this step, we can select which final outputs (you may need to keep at least some metadata to tell which to keep or not) participate in the reconciliation process, giving us greater control over the direction of the reconciled delta. For example, we can exclude extreme outliers that are not representative of the data. While we did not apply this filtering in our experiment, but I think it makes sense to do it in the production environment.

A diagram showing the filtering of collected paths and logits or loss based on specific criteria to ensure data quality.

If the collected data is unlabeled, we must label it first and then compute the loss from the stored logits.

A diagram showing the labeling process for input data when the label is unknown.

Using the collected paths and loss, we can then generate the attribution for each loss.

A diagram showing the calculation of attribution for each parameter delta by standardizing the loss using collected paths and loss.

And then likewise, we can generate the reconciled delta by multiplying the attribution from the loss by the contributing parameter delta and summing them.

A diagram showing the process of multiplying the attribution from the loss by the contributing parameter delta and summing them to generate the reconciled delta.

And then, we update the model parameters with the reconciled delta. That’s it! We just process a step in continual learning, performed entirely through inference. This step is analogous to a minibatch in the training process. The model continues to provide business value during this process, with the only additional cost being the reconciliation step.

The First Experiment: Learning to Classify Dresses

Let’s recap the experiment introduced at the beginning of this article. We trained a model using the Marketplace V2 algorithm on the MNIST digits dataset for 2,000 steps, achieving 96% accuracy on the validation dataset. We then took the trained model and simulated the inference process, incorporating class 3 (dress) from the MNIST fashion dataset. We mixed these dress images with the digits dataset and tasked the model with classifying the combined images. For each step, we used 240 images from the original dataset and 16 images from the new dataset, totaling 256 images. The marketplace shape was 4x4x4, i.e, 64 probes in total.

Diagram showing the original MNIST digits dataset and the new dataset, which combines the MNIST digits dataset with class 3 (dress) from the MNIST fashion dataset.

I must admit, this experiment is far from perfect. It serves as a proof of concept to demonstrate the potential of tuning a model using mostly inference. After all, it’s unusual to present a dress image and expect the model to classify it as the digit 3. In real-world scenarios, data shifts are likely to occur gradually rather than through abrupt changes like this one. Despite its flaws, the experiment shows promising results.

The learning (inference) and validation accuracy for the new data (class 3 from the MNIST fashion dataset) are shown in the following figure:

Diagram with 50% smoothing showing the validation accuracy for class 3 (dress) from the MNIST fashion dataset increasing gradually to 86% at 100,000 steps.

Diagram with 50% smoothing showing the learning (inference) accuracy for class 3 (dress) from the MNIST fashion dataset increasing gradually to 100% at 28,000 steps.

As shown, the model gradually learns the new data. Meanwhile, the testing accuracy for the original digits dataset is as follows:

Diagram with 50% smoothing showing the validation accuracy for the original MNIST digits dataset decreasing gradually from 96.2% to 95.2% after 100,000 steps.

Diagram with 50% smoothing showing the learning (inference) accuracy for the original MNIST digits dataset increasing gradually to 100% at 55,000 steps.

The model successfully learns the new data while maintaining high accuracy on the original digits dataset most of the time. However, the validation accuracy for the original digits dataset drops slightly, which is not ideal. More concerning is the downward trend, which we will discuss later.

The loss decreases gradually for both the new data and the original dataset during the learning process (inference):

Diagram with 50% smoothing showing the loss for the new data decreasing gradually during the learning process (inference).

Diagram with 50% smoothing showing the loss for the original dataset decreasing gradually during the learning process (inference).

The accuracy reaches 100% for both the new data and the original dataset during the learning process (inference). This suggests the model may be memorizing the data rather than generalizing effectively. Given the slight decline in validation accuracy for the original dataset, I wondered whether this was due to the model overfitting or “lazily” memorizing the data. To explore this, I tested whether continual learning could still work if the original dataset was augmented. Thus, I rerun the continual learning process with the original dataset augmented alongside the new data, using the augment function from the Tinygrad library.

Here’s the accuracy for the new data with augmentation:

Diagram with 50% smoothing showing the validation accuracy for the new data increasing gradually, though more slowly than without augmentation, reaching 70% at 100,000 steps.

Diagram with 50% smoothing showing the learning (inference) accuracy for the new data increasing gradually, though more slowly than without augmentation, reaching 100% at around 50,000 steps with slight fluctuations afterward.

Interestingly, the accuracy for the new data with augmentation is lower than without augmentation. This suggests that the model’s learning capacity is being shared between the new data and the augmented original dataset.

For the original dataset, the model initially struggles with augmented data, achieving only around 85% accuracy at the start. However, it quickly recovers and improves steadily:

Diagram with 50% smoothing showing the learning (inference) accuracy for the augmented original dataset increasing gradually to 95% at 100,000 steps.

The validation accuracy for the augmented original dataset drops slightly more than without augmentation, from 96.2% to 94.6% after 100,000 steps, but remains within a reasonable range.

Diagram with 50% smoothing showing the loss for the augmented original dataset decreasing gradually from 96.2% to 94.6% at 100,000 steps.

The loss for the augmented original dataset also decreases gradually, though it starts from a higher point due to augmentation.

Diagram with 50% smoothing showing the loss for the augmented original dataset decreasing gradually from a higher starting point than without augmentation.

The loss for the new data also decreases steadily, but more slowly than without augmentation.

Diagram with 50% smoothing showing the loss for the new dataset decreasing gradually, slightly higher than without augmentation.

Second Experiment with a New Digit

The drop in validation accuracy is concerning. I’ve been considering why this is happening and how to address it. Could it be because the MNIST digit dataset differs significantly from the Fashion-MNIST dataset in style? To explore this, I designed an new experiment where we remove one digit from the original MNIST dataset and train the model to classify the missing digit. Specifically, I trained a model on the MNIST dataset excluding the digit 9.

Diagram showing the MNIST dataset excluding digit 9 and the new dataset including digit 9.

Next, I applied the marketplace algorithm to learn the digit 9 through inference. Here are the results:

Diagram with 50% smoothing showing validation accuracy for digit 9, starting at zero until around 18,000 steps, then gradually increasing to 71% by 100,000 steps.

Diagram with 50% smoothing showing learning (inference) accuracy for digit 9, starting at zero until around 18,000 steps, then gradually increasing to near 100% by 36,000 steps, with fluctuations afterward.

Initially, the validation accuracy for digit 9 remained at zero. I suspected a bug in my code—why else would the accuracy stay at zero? However, I noticed that the loss was decreasing, confirming there was no bug.

Diagram with 50% smoothing showing the loss for digit 9 decreasing gradually.

The issue stemmed from the original model being trained without digit 9, while reserving a label slot for it, fixed at zero and expected to output as such. This created inertia in the model against outputting 9. The decreasing loss indicated learning was occurring, but the model hadn’t yet overcome this inertia. After waiting patiently, the validation accuracy eventually increased.

For the original dataset (digits 0–8), the inference accuracy increased gradually, reaching 100% by 50,000 steps, as expected.

Diagram with 50% smoothing showing learning (inference) accuracy for the original dataset, increasing gradually to 100% by 50,000 steps.

However, the validation accuracy for the original dataset dropped slightly from 96.8% to 95.2%, which was frustrating.

Diagram with 50% smoothing showing validation accuracy for the original dataset, decreasing from 96.8% to 95.2%.

To see if we can eliminate the inertia, I retrained the model without digit 9, this time using the ignore_index feature provided by the cross-entropy function to exclude digit 9 from the loss calculation. My goal was to create a model neutral to the reserved label slot, enabling faster learning of new digits without inertia. However, the model still exhibited some inertia against outputting 9, though less than before. Although I couldn’t fully eliminate the inertia, the fact that the training algorithm can overcome it and successfully learn the new digit 9 is encouraging.

Why Is the Validation Accuracy for the Original Dataset Decreasing?

Now, we know that with the Marketplace algorithm, we learn new data through just inference mostly without a dedicated training process. However, the declining validation accuracy for the original dataset is concerning. The goal of continual learning is to deliver business value while gradually incorporating new data. If the validation accuracy continues to decline, the model may eventually fail to provide business value.

With this in mind, I’ve carefully considered why the validation accuracy for the original dataset is decreasing. After closely examining the chart, I noticed that the validation accuracy remains stable until we begin learning the new digit, number 9, and producing valid outputs. My theory is that when new data is introduced, it generates high loss because the model has not encountered this type of data before.

A diagram showing the loss gap between the new digit, number 9, and the original dataset is huge at the beginning and gradually decreases.

Given the significant loss gap, the Marketplace algorithm prioritizes parameter updates that reduce the loss for the new data, which are often low-hanging fruit. However, the model already has parameters critical for correctly predicting the original dataset. Modifying these critical parameters is costly. In the early stages, the algorithm selects updates that are not critical to the original dataset, focusing on easy wins for the new data. Over time, as the low-hanging fruit diminish, the algorithm begins to adjust parameters that are critical to both the new and original datasets. Since the new data typically has a higher loss, the algorithm prioritizes updates that have a greater positive impact on the new data, even if they negatively affect the original dataset.

If this theory is correct, I predict that a balance between the two datasets will eventually emerge. When the loss for both datasets becomes nearly identical, the algorithm will select parameters that equally impact both datasets, stabilizing the validation accuracy for the original dataset.

In the real world, data shifts are gradual, not sudden. Therefore, the loss balance between old and new data patterns is unlikely to change drastically. By continually applying the Marketplace algorithm, the loss for old and new data patterns should balance out shortly after a data shift occurs. However, this is just a hypothesis, and other factors I haven’t considered may be causing the decline in validation accuracy. Further investigation is needed to confirm this approach’s reliability.

Another factor to consider is the fixed learning rate used for continual learning. In traditional training, the learning rate is typically reduced over time to allow smaller, more precise adjustments, enabling the model to fine-tune the final decimal points of accuracy. With a fixed learning rate, the model may overshoot when only small adjustments are needed, especially when the loss gap is minimal. This forces the algorithm to select parameters that impact both old and new data. To address this, we could dynamically adjust the learning rate based on the loss gap between the new and original datasets. For example, using the variance or standard deviation of the loss gap to schedule the learning rate could allow the model to focus on fine-tuning when the loss gap is small, improving overall accuracy.

Finally, model capacity is another important factor. The current model is small, limiting the algorithm’s ability to select parameter changes that benefit the new data without negatively impacting the original dataset. I wonder if a larger model would yield different results.

What Does This Mean for the Future of Machine Learning?

If the potential impact of using the Marketplace algorithm for training is one, I would argue that its potential for continual learning is a thousand times greater—or more. While it appears effective with a limited dataset and a very small model, this does not guarantee success with larger models or more complex datasets. However, given the scale of inference—often millions or even billions of inferences per day—I believe we can make it work by leveraging the abundant probes provided by real-world model usage. Even if the approach is only viable for small models, I can envision many exciting applications. Here are some ideas:

Privacy-Friendly Machine Learning

Traditional machine learning often lacks privacy-friendliness. The conventional process involves collecting all user data, running a forward pass, collecting logits, performing a backward pass, and updating model parameters. This process is difficult to break down into smaller steps. For example, consider a mobile app that predicts a user’s blood pressure for the next day based on their current health data. The traditional approach requires collecting and anonymizing all user health data before keeping them for training.

A diagram showing that the traditional approach to training a model requires collecting all user data and storing it on the server with an anonymization process.

However, what can we do if regulations, such as HIPAA compliance, prohibit sending user data to the server?

One straightforward solution is to run the backward pass on the user’s mobile device to compute the gradient and send only the gradient to the server to update the model parameters. However, the backward pass is computationally expensive, and the gradient size is often as large as the model parameters, making it impractical to perform on a mobile device and transmit to the server. Even for smaller models designed for edge devices, this approach may still be too resource-intensive.

A diagram showing a naive approach to training a model on the user's device, requiring the gradient from the backward pass to be sent to the server.

Moreover, sharing gradient data risks leaking user information, as gradients can potentially be reverse-engineered to reconstruct the original data. However, the Marketplace algorithm offers a different approach. If the model is small enough, the user’s device can run a forward pass with a random path and different vendor deltas, then send only the path and logits to the server. The delta is generated from seed-based random numbers, making it easy to produce on the fly during the forward pass. If a label is available, only the loss needs to be sent, and the logits can be discarded. In cases where the label is provided naturally by the user after a prediction, this process avoids sending any local user data to the server.

Logits are typically small, containing concentrated information that is difficult to use to deduce the original user data. With loss, a single float value, it’s virtually impossible to reconstruct the user’s data, especially if the input data space is large. (Well, not so fast.. 😅 speaking from infosec background, with enough samples, it might still be possible to reconstruct the user’s data, but that’s another story)

A diagram showing a privacy-friendly approach to training a model on the user's device using the Marketplace algorithm. The device only sends the chosen random path and the loss or logits to the server.

From the server’s perspective, collecting enough forward paths and losses allows it to perform the steps outlined in the V2 article to generate the reconciled delta. This enables the server to update model parameters without accessing any user data. This approach has significant potential for privacy-friendly machine learning. Hey, Apple, I would be very interested in exploring this approach if I was you!😄

Tailored Model for Each User or Group Based on Usage Patterns

Some argue that marketplace algorithms may not perform well for larger models. I disagree, as they often overlook the significant differences in inference scale. With large-scale inference, I believe we can make it work one way or another; the only question is how. However, if the marketplace algorithm is indeed unsuitable for bigger models, what then? This led me to question whether we truly need large models.

The reason models are so large is that they aim to cover all possible scenarios. In reality, different users have distinct needs. What if we could train a model for each user without incurring extra costs? Or, at the very least, for a group of users? The exciting part is that the more users engage with the model, the better it becomes.

One potential application of a marketplace continual learning algorithm is to start with a smaller base model, divide use cases into different groups, and then let the continual learning process adapt to the specific needs of each group. This way, we can create models tailored to individual users or groups without significant additional costs.

A diagram showing that while a large model tries to cover all possible scenarios, a small model can be tailored to specific use cases.

Even if the model is large, techniques like LoRA can be used to fine-tune only a small part of the model, keeping costs low. The potential of this approach is highly promising.

Future Research Directions

Several questions remain unanswered in this field. Below, I outline some key areas for future exploration.

How does inferencing with random vendors and their deltas impact inference performance?

Currently, we observe no major differences in inference performance, likely due to the simplicity of our model and dataset. However, with more complex models and datasets, the impact may be more pronounced. Adding deltas to the original model weights requires careful consideration to avoid degrading inference performance, as we rely on accurate outputs for users.

How can we train a model that is more conducive to continual learning?

In our second experiment with the digit “9,” the model exhibited inertia, resisting outputting this digit even when it was excluded from the loss calculation. This raises the question: Can we train a model with reserved label slots for new classes to be more continual learning-friendly? Ideally, such a model would learn new data immediately without inertia. Beyond classification models, other approaches may also enhance continual learning capabilities.

Developing a dataset with data shift for testing continual learning

Our experiments currently rely on the MNIST dataset. I am not familiar with other datasets. I think MNIST is not ideal for testing novel concepts like continual learning. A more suitable dataset would incorporate data shift, divided into groups where each group exhibits slight variations from the previous one.

A diagram illustrating a dataset with data shift for testing continual learning.

Such a dataset would enable us to evaluate how well a model adapts to new data gradually during the continual learning process.

What are the limits of the Marketplace algorithm?

The Marketplace algorithm has shown promise with small models and simple datasets. However, its effectiveness with larger models and more complex datasets remains untested. While it demonstrates significant potential as a proof of concept, further experiments are needed to assess its performance in these scenarios. Given the scale of inference—often millions or billions of inferences per day—finding optimal parameters for updates is likely feasible, suggesting that these challenges are solvable.

Exploring privacy-friendly machine learning

It is often assumed that machine learning cannot prioritize privacy without compromising training. However, the Marketplace algorithm suggests otherwise, enabling privacy-friendly machine learning. This is particularly valuable in use cases like healthcare, where regulations such as HIPAA may prevent sending local user data to servers due to patient privacy concerns. The Marketplace algorithm’s continual learning approach allows models to learn through inference alone, without transmitting sensitive data.

My background in information security, from my master’s degree in computer science, informs my perspective on this topic. Techniques like blind signatures, homomorphic encryption, and oblivious transfer offer interesting ways for protecting data privacy while providing service to the users. I think similiarly, we can protect the privacy of the data while providing machine learning powered service to the users. Privacy-friendly machine learning warrants further exploration and could be the subject of an entire article or more.

Join the Research Effort!

For now this is still a solo research project. There are many exciting avenues to explore with this algorithm. As always, all my work is open-source under the MIT license and accessible to everyone:

https://github.com/LaunchPlatform/marketplace

Please feel free to contribute to the project, fork it, or conduct your own research. I believe this algorithm offers a fresh perspective on machine learning. And this is just the beginning. Even if I hit a dead end someday, someone with greater insight might find a way to make it work or draw inspiration from it to create something even better.

Final thoughts

That’s a tons of things I have done in the past few weeks. There are simply too many exciting avenues to explore with this algorithm. For now, I’ll take a short break to focus on my SaaS products (like BeanHub) for a short while. In the meantime, I’m considering the next direction for my research. Thank you for reading this article! I hope you found it at least somewhat inspiring. Stay tuned for upcoming articles. 😄

Marketplace V2 is all you need: A training algorithm on par with backprop that needs only forward pass

Tue, 02 Sep 2025 07:00:00 +0000

Update: Third article is here, Continual Learning with Marketplace: Model Learns New Data with Mostly Inference

Two weeks ago, I published an article, Marketplace: My first attempt at training without backprop on GPU efficiently. To my surprise, it received far more positive feedback than I expected. I was thrilled that people found my research project interesting, and I greatly appreciate the kind words from readers.

Curious about how far I could push this idea, I spent another two weeks improving it. Initially, my research focused on enhancing the scalability of the Marketplace algorithm. I implemented seed-based random number generation for each vendor’s weights, as mentioned in the previous article. I also explored other ideas to improve the Marketplace algorithm, such as using a second forward pass to determine the optimal learning rate. However, exploring permutations of the loss outputs to compose a better overall delta, accounting for both good and bad outcomes, truly blew my mind 🤯.

The performance is now on par with backpropagation in certain configurations. Here’s the comparison with backpropagation plus Stochastic Gradient Descent (SGD) as the optimizer:

A diagram with 50% smoothing shows the comparison of the validation accuracy between the Marketplace V2, V1 algorithms and the backprop with SGD as the optimizer based on different learning rates. The Marketplace V2 algorithm outperforms the Marketplace V1 algorithm greatly and the backprop with SGD as the optimizer at 1e-3 learning rate. It only lose to the backprop with SGD as the optimizer at 3e-3 learning rate and 7e-3 learning rate. While backprop with SGD as the optimizer is still better, but I believe with hyperparameter tuning, the Marketplace V2 algorithm can at least match it.

A diagram with 50% smoothing shows the comparison of the loss between the Marketplace V2, V1 algorithms and the backprop with SGD as the optimizer based on different learning rates. The Marketplace V2 algorithm outperforms the Marketplace V1 algorithm greatly and the backprop with SGD as the optimizer at 1e-3 learning rate and 3e-3 learning rate in later steps. It only lose to the backprop with SGD as the optimizer at 7e-3 learning rate. While backprop with SGD as the optimizer is still better, but I believe with hyperparameter tuning, the Marketplace V2 algorithm can at least match it.

I believe my research has a significant potential to revolutionize the machine learning training process. Today, I’m excited to share the improvements to the Marketplace algorithm, which I call the Marketplace V2 algorithm.

Why Does Marketplace Work?

In my last article, I explained how it works. However, I didn’t delve into why it works. After publishing, I reflected on this question because understanding why it works is crucial for improving the algorithm, otherwise, I’d be left guessing and testing. While I haven’t had time to test these hypotheses, here’s my reasoning.

When discussing gradient descent, people often compare it to walking down a hill. This analogy isn’t entirely accurate because the terrain appears to change when you mutate parameters, as noted in this video.

A screenshot of a video titled "The Misconception that Almost Stopped AI," explaining that the terrain changes when parameters are mutated.

Despite it may not be 100% accurate, this analogy provides a good intuition for gradient descent.

A diagram of a curve with an arrow pointing in the direction of the steepest descent.

With backpropagation, we measure the slope of the terrain and descend carefully. But do we really need such precision? The Marketplace algorithm, instead of measuring the slope, feels more like sending probes in random directions and selecting the best one.

A diagram of probes sent in random directions to find the best one.

The algorithm’s clever trick is reusing intermediate products to efficiently create more permutations of the probe candidates.

A diagram showing more colored probes in random directions than the previous diagram. We reuse intermediate products to create more permutations of the best probe efficiently.

The more probes you send, the higher the chance of finding good ones, some of which will roughly move in the right direction. With a large model, the number of parameters creates a vast search space. But are all parameters equally important? The lottery ticket hypothesis suggests that only a small subset of parameters are critical.

A screenshot of a video about the lottery ticket hypothesis.

Thus, for probes to trend in the right direction, only a small subset of parameters needs to align correctly. Most parameters are likely just noise. As long as a probe’s overall direction has more upside than downside, it’s good enough.

A diagram showing that probes (delta) may contain positive and negative changes, but as long as the upside outweighs the downside, the probe is good enough.

The Marketplace algorithm selects the best combination of parameters from the probes. Since the leading vendor for each specification remains unchanged, it only mutates if the new direction is slightly better. This process accumulates small, fortunate changes over time, eventually leading to a good solution.

The Problem with Marketplace

As mentioned in the previous article, the Marketplace algorithm selects only the best parameter combination from the probes, discarding all others. “Best” refers to the combination with the lowest loss. While this combination reduces overall loss, it’s an all-or-nothing approach. By accepting the best probes, we also incorporate mutations that may increase loss in some aspects.

This is wasteful because other combinations also provide valuable directional information. The best probes have more beneficial mutations than harmful ones, which is why they’re selected. However, as parameters improve, finding better mutations becomes harder, and the probability of good mutations decreases compared to bad ones. At that point, more probes are needed to find a good mutation.

Let’s view the Marketplace algorithm through the lens of a real-world marketplace. When a product sells well, multiple factors contribute to its success. The same applies to factors causing poor performance. For a vendor, many parameters influence the product, making it hard to pinpoint the most critical one. But what if you have enough products with slightly adjusted parameter permutations? Couldn’t you statistically attribute the product’s performance to the parameter adjustments? Yes! That’s the core idea behind the Marketplace V2 algorithm.

The V2 Algorithm

The V2 algorithm is largely similar to the V1 algorithm, except that we reward each parameter delta based on the performance of the final product and compose the overall delta from the delta of each parameter. Please read the first article for the details of the V1 algorithm. Here’s how it works:

There are combinations of loss and their corresponding paths. Let’s say each loss is $L_i$, the parameter is $\theta$, and the corresponding delta is $\Delta \theta_i$, where $i$ is the index of each unique path. The mean loss is:

\[\mu = \frac{1}{n} \sum_{i=1}^{n} L_i\]

The standard deviation of the loss is:

\[\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (L_i - \mu)^2}\]

We normalize the loss by subtracting the mean value and dividing by the standard deviation. We call this attribution and denote it as $A_i$:

\[A_i = - \frac{L_i - \mu}{\sigma}\]

Note that we invert the sign because we want to reward parameter deltas that contribute to a lower loss and penalize those that contribute to a higher loss.

A diagram showing the conversion of the loss of each final product, calculation of attribution by normalizing the loss, and flipping the sign.

Then, we attribute the loss to the parameter delta that contributed to it:

\[\Delta R = \sum_{i=1}^{n} A_i \times \Delta \theta_i\]

We call $\Delta R$ the reconciled delta because it accounts for the performance of the final product for each parameter. It’s like distributing the profit or loss to the parameter deltas that contributed to it.

To calculate the reconciled delta, the process is straightforward. We take the attribution value $A_i$ and multiply it by the parameter delta $\Delta \theta_i$ for each unique path, i.e., the vendors’ deltas on each path.

A diagram showing multiplication of the attribution value with the parameter delta of vendors at a unique path.

A diagram showing multiplication of the attribution value with the parameter delta of vendors at a different unique path.

Finally, we sum the delta multiplied by the attribution value for each unique path:

A diagram showing the summation of the delta multiplied by the attribution value for each unique path.

The reconciled delta $\Delta R$ is a vector pointing in the direction that leads to better overall performance. We believe this is somewhat similar to the gradient direction in backpropagation. The reconciled delta should approximate the gradient direction in backpropagation, and with more samples, the direction should converge closer to the true gradient direction.

From a linear algebra perspective, we are essentially composing a linear combination of the parameter deltas at each path. For the direction that is most correct, we assign more weight to the parameter delta at that path. For those pointing in the wrong direction, we assign a negative weight to the parameter delta to redirect it toward the correct direction. For neutral directions, we assign weight close to zero to the parameter delta to avoid affecting the overall direction.

A diagram showing attribution of the loss to the parameter delta vector at each path.

Since the directions are random, unrelated parameters in the vectors should cancel each other out. A bad direction is also helpful in finding the right direction because we multiply it by a negative weight, effectively pointing it toward the correct direction.

A diagram showing that the combined vector approximates the gradient direction in backpropagation.

As the scale of the reconciled delta may not be appropriate, we normalize it to a unit vector:

\[\Delta G = \frac{\Delta R}{||\Delta R||}\]

We believe this unit vector direction is nearly identical to the gradient direction in backpropagation, so we call it the gradient direction and denote it as $\Delta G$. With the gradient direction $\Delta G$, you can update the parameters by multiplying it by the learning rate $\eta$:

\[\theta' = \theta + \Delta G \times \eta\]

The next is basically updating the seeds and generate random small weight deltas based on the seeds for each vendor, and then repeat the process.

Comparison with Backprop

In the previous article, I forgot to mention that the backprop experiment uses Adam as the optimizer. I realized it’s somewhat unfair to compare the Marketplace approach with backprop because I was not only competing against backprop but also the Adam optimizer. The backprop algorithm is highly performant, and Adam enhances its power. I heard some people said that Adam is the optimizer on steroids, well, using steroid in a fight is obvious cheating, right? In contrast, the Marketplace algorithm simply applies a learning rate to the gradient direction, which is essentially equivalent to SGD. Therefore, it makes more sense to compare the Marketplace algorithm with backprop using SGD as the optimizer.

Please note that in the previous article, we swapped the Batch Normalization layer for an Instance Normalization layer in the backprop algorithm. However, when running backprop with SGD as the optimizer, we reverted to the Batch Normalization layer because the Instance Normalization layer does not perform well with SGD for this particular model. For the Marketplace algorithm, we kept the Instance Normalization layer, as the Batch Normalization layer is less effective with this approach.

With a more level playing field, the results are much more exciting. I observed very similar performance between the Marketplace V2 algorithm and backprop with SGD as the optimizer. Here are the results:

A diagram with 50% smoothing showing the comparison of validation accuracy between the Marketplace V2, V1 algorithms, and backprop with SGD as the optimizer across different learning rates. The Marketplace V2 algorithm significantly outperforms the Marketplace V1 algorithm and backprop with SGD at a 1e-3 learning rate. It only falls behind backprop with SGD at 7e-3 and 3e-3 learning rates.

A diagram with 50% smoothing showing the comparison of loss between the Marketplace V2, V1 algorithms, and backprop with SGD as the optimizer across different learning rates. The Marketplace V2 algorithm significantly outperforms the Marketplace V1 algorithm and backprop with SGD at a 1e-3 learning rate. It only falls behind backprop with SGD at a 7e-3 learning rate.

Both Marketplace algorithms use the same market depth and vendor count: a market depth of 3 and 16 vendors per specification.

Certainly, with the right learning rate, backprop with SGD as the optimizer still outperforms the Marketplace V2 algorithm. However, this is already very exciting! Unlike backprop, which requires both forward and backward passes, the Marketplace algorithm only needs the forward pass. I haven’t spent much time on hyperparameter tuning for the Marketplace V2 algorithm because, with 16 vendors, the startup time for Tinygrad is lengthy due to its JIT compiler compiling kernel code for very complex compute graph, slowing down the hyperparameter tuning process (the new rangify feature could potentially solve this, but it’s not yet stable). The fact that the Marketplace V2 algorithm shows nearly identical performance at certain learning rates suggests that this is likely a hyperparameter tuning issue.

The Marketplace V2 algorithm is still new, and we are unsure how to optimally tune its hyperparameters to further improve performance. However, I believe it’s only a matter of time before we find the right hyperparameters or optimize the algorithm itself to match or exceed the performance of backprop with SGD. We can also scale the Marketplace V2 algorithm with more vendors and depth if a more accurate gradient direction is needed. Additionally, I’ve experimented with techniques like using a second forward pass to adjust the learning rate, which shows promising results but requires further refinement. Because it only requires a forward pass, the Marketplace V2 algorithm could, in theory, run at least twice as fast as backprop. With these considerations, I assert that the Marketplace V2 algorithm is on par with backprop using SGD as the optimizer. With further research and optimization, I firmly believe the Marketplace algorithm will even surpass backprop.

What Are the Implications?

Some may ask: What’s the point of expending additional computational resources to achieve the same performance as backpropagation? As mentioned in the previous article, the Marketplace algorithm relies solely on the forward pass to train the model. Optimizing the forward pass is easier and more beneficial for inference performance than optimizing the backward pass. Since backpropagation requires both forward and backward passes—and even assuming the backward pass takes the same time as the forward pass (though it typically takes much longer)—the Marketplace algorithm, by using only the forward pass, could be at least twice as fast as backpropagation. Additionally, the cache-friendly nature of running only the forward pass could make it even faster in practice. While we may lose some accuracy in the gradient direction, I would argue that gradient direction accuracy offers only marginal benefits to the training process once it passes a certain threshold. The Marketplace algorithm essentially trades gradient direction accuracy for training speed and other advantages.

While the forward pass in the Marketplace algorithm is more computationally intensive, it is designed to be GPU-efficient. With a powerful GPU and highly optimized code, it should perform at least as fast as the forward pass in backprop. We also need to compute the reconciled delta for the Marketplace algorithm, but this is only a small fraction of the forward pass and can be easily parallelized.

The potential of the Marketplace algorithm is immense—running at least twice as fast as backprop is just the tip of the iceberg. Many may not realize that GPUs are only part of the equation in large-scale training. Bandwidth is often the bottleneck. The primary reason large-scale distributed training isn’t conducted on idle consumer-grade GPUs is due to bandwidth limitations. When using GPU clusters for training, network bandwidth is a critical factor.

A figure from How to Think About GPUs by Google DeepMind showing the NVIDIA H100 SuperPod architecture, a large-scale GPU cluster. It highlights the tremendously high network bandwidth between nodes, which is critical for large-scale training and expensive to implement.

With the Marketplace algorithm, the bandwidth required for syncing weights is significantly reduced. Using our seed-based random number generation, we only need to redistribute the seeds plus the attribution value $A_i$ for each path to reconstruct weight updates. This reduction in bandwidth requirements means less time spent syncing weight updates across nodes, which could revolutionize large-scale training. Training time could be significantly reduced. Sorry, Jensen, for the potential impact on NVLink sales! 😅

Due to the reduced bandwidth requirements, I believe this opens up a new world of distributed training. It may become feasible to conduct large-scale training on idle consumer-grade GPUs.

Future Research Directions

The Marketplace V2 algorithm performs comparably to backpropagation, but several questions remain unanswered.

Is Reconciled Delta an Accurate Approximation of the Gradient Direction in Backpropagation?

As noted throughout this article, the reconciled delta performs nearly identically to the gradient direction in backpropagation with stochastic gradient descent (SGD) as the optimizer. This raises the question: Is the reconciled delta an accurate approximation of the gradient direction in backpropagation? To answer this, one could run backpropagation alongside the Marketplace algorithm, compare the gradient direction with the reconciled delta, and evaluate their similarity. Another intriguing question is whether scaling up the Marketplace algorithm could improve the approximation of the gradient direction. Due to time constraints, I leave this exploration to others interested in this topic.

Can We Apply Optimization Techniques to the Marketplace Algorithm?

Assuming the reconciled delta accurately approximates the gradient direction in backpropagation, can we apply optimization techniques such as momentum, Adam, or others to the Marketplace algorithm? If these techniques can be integrated, they could potentially enhance performance with minimal effort.

Scalability of the V2 Algorithm

In my earlier research, I explored the scalability of the Marketplace V1 algorithm, which is relatively straightforward to scale along two main axes:

The dataset axis: Parallelizing the forward pass.
The marketplace replication axis: Running multiple marketplace replicas.

Using seed-based random number generation, the Marketplace V1 algorithm can be replicated across multiple nodes and executed in parallel. Due to limited access to GPUs, I simulated this by running multiple forward passes and marketplace replicas sequentially. Increasing the number of forward passes and replicas showed performance improvements on a limited scale, but its effectiveness at a larger scale remains unclear. The V2 algorithm’s scaling behavior differs, so I defer further investigation to future work or others interested in this topic.

Adaptive Learning Rate

I explored the idea of using an adaptive learning rate with the Marketplace algorithm. Since the algorithm allows testing slight parameter changes, it could also test different learning rates. The approach involves using the first forward pass to determine the reconciled delta direction and the second forward pass to identify the optimal learning rate, essentially determining the direction in the first pass and the step size in the second.

Using this second pass to test the learning rate is a promising idea. I implemented this approach, and it significantly improved performance in the initial steps, even outperforming backpropagation with SGD as the optimizer at certain learning rates in the early steps. However, performance degraded later due to fluctuating learning rates.

The validation accuracy of the Marketplace V2 algorithm with an adaptive learning rate outperforms many backpropagation configurations with SGD as the optimizer, using only 8 vendors per specification. However, performance degrades quickly due to fluctuating learning rates.

The loss of the Marketplace V2 algorithm with an adaptive learning rate outperforms many backpropagation configurations with SGD as the optimizer, using only 8 vendors per specification. However, performance degrades quickly due to fluctuating learning rates.

The Marketplace V2 algorithm with an adaptive learning rate shown in the diagrams above uses only 8 vendors per specification, while the one we used in backprop comparsion uses 16 vendors, yet it still outperforms many backpropagation configurations with SGD in early steps. This suggests a promising direction, but stabilizing the learning rate is necessary to prevent performance degradation.

Large Models and Datasets

As discussed in the previous article, I tested the Marketplace algorithm on a larger model like ResNet-18 with a scaled MNIST dataset, but these experiments were insufficient to assess scalability to larger models and datasets. A common saying is:

If it doesn’t work with MNIST, it probably won’t work with anything.

Having demonstrated that the Marketplace algorithm performs well with a small MNIST CNN, the next step is to test it with larger models and datasets. Training larger models and datasets requires significant time and resources for efficient execution, so I defer this to future work or others interested in this topic.

I Need Your Help: Let’s Advance Machine Learning Together

Although I move extremely fast to write code and experiment with new ideas, I still only have 24 hours in a day and 7 days in a week. As you can see, there are so many interesting things to explore with the Marketplace algorithm, and I can only do a small part of it.

As a solo researcher, it’s also scary to think about whether I made mistakes during the process. But I believe the best way to move forward quickly is to embrace mistakes and learn from them rapidly. Maybe it’s a fool’s errand in the end, but I still want to try. People used to believe that training with super-large models was not possible because they would overfit. Until someone brave enough tried it and proved it was possible, which is why we see today’s machine learning blooming. It’s funny that I see countless examples of scientists entering the field where their teachers, professors, or advisors told them everything was well researched and there was nothing new to be found. But we know people have proved that wrong again and again. So the question is not what if I make mistakes, the question is, if we don’t try, how can we know if we can make it? And what if I am right?

If you’re like me and dare to challenge the status quo, I invite you to join me. I believe the Marketplace algorithm is just the beginning of a new era in machine learning. All of my experiments are open-source under the MIT license and accessible to everyone.

https://github.com/LaunchPlatform/marketplace

You can contribute to the project, fork it, and conduct your own research. You’re also very welcome to reproduce the experiments with your own implementation. Please feel free to contact me if you have any questions or suggestions. I am also open to exploring collaboration opportunities if there’s something interesting to pursue together.

Acknowledgments

Many people have mentioned interesting prior works on X. I think it’s helpful to list them here for others to learn from. If you find any interesting prior works somewhat relative to my research, please feel free to let me know, I will add them to the list.

Spike-Timing-Dependent Plasticity

Nathan Odle (@mov_axbx) on X mentioned that my Marketplace algorithm somewhat maps to RL-STDP. It appears that STDP (Spike-timing-dependent plasticity) is a biological process for adjusting the weights of synapses. I haven’t had time to really dig into it, but it’s interesting to see that my Marketplace algorithm could potentially be similar to the biological processes by which the brain learns. By the way, the author’s 7 4090 AI training rack is so cool—be sure to check it out.

Random Feedback Weights Support Learning in Deep Neural Networks

andrew (@gradientjanitor) on X mentioned that my Marketplace algorithm reminds him of another paper: Random feedback weights support learning in deep neural networks by Lillicrap et al. I glanced through the paper and think I understood maybe 50% of the ideas. Regardless, it seems like an interesting paper showing that random feedback weights can support learning in deep neural networks.

Bucket Brigade Algorithm

Sebastian Risi (@risi1979) on X mentioned that it could be somewhat similar to the Bucket Brigade Algorithm by Holland. If we view each vendor on the path to the final product as a chain of actions and the final loss as the reward, it could be somewhat similar to the Bucket Brigade Algorithm. However, there’s no bid and tax in the Marketplace algorithm.

Final Thoughts

Personally, I feel so excited about the future of machine learning and the Marketplace algorithm’s applications. I can’t wait to share one more thing before I end this article. My next research direction is about “continual learning.” In the real world, we learn new things by doing, it’s odd that machine learning requires a dedicated process to learn new things. After I came up with the Marketplace algorithm, I’ve already had the idea of applying it to continual learning. Some people on X also realized the same potential of the Marketplace algorithm before I even announced it.

By continual learning, I mean we can make a model learn new things through inference alone mostly. Sounds exciting, right? I have run the idea through in my mind, and I believe it’s possible. The only thing left is to implement it and see if it really works as I expected. Stay tuned for the upcoming articles. I hope you enjoyed reading this article.

Marketplace: my first attempt at training without backprop on GPU efficiently

Mon, 18 Aug 2025 07:00:00 +0000

Update: Please read the second article for the details of the V2 algorithm. Also the third article, Continual Learning with Marketplace: Model Learns New Data with Mostly Inference, introduces the continual learning with the Marketplace algorithm.

If you’ve read my previous articles, you know I’m a big fan of first-principles thinking. I’ve mentioned many times that I want to eliminate backpropagation. Many people think I’m crazy and assume I must be joking. But no, I’m serious. I thought about the problem from time to time. Recently, I came up with an idea that could potentially work. I spent two weeks implementing it and running experiments, and it worked! While this is just a baby step, there are still many things to improve, but at least I think it’s an interesting idea that could be worth exploring and sharing. Today, I would like to share my approach to training without backpropagation on GPUs efficiently.

A diagram shows the validation accuracy of a small MNIST CNN model training process without using backpropagation.

A diagram shows the loss of a small MNIST CNN model training process without using backpropagation.

Background

Just because a solution exists and is widely used doesn’t mean it’s the best one. From this perspective, we should challenge all existing solutions. Having worked on numerous machine learning projects recently, I’ve found that backpropagation uses a ton of memory. The dependencies introduced by backward propagation also make it challenging to scale effectively. I want to train a model without backpropagation, ideally in a distributed manner.

I am a very different type of person, I guess. I really don’t like seeing the answer before coming up with my own ideas. This is why I didn’t study any existing research papers or articles about how to train without backpropagation. The interesting thing about research is that one can usually only do so much with existing technologies and methodologies. For example, the concept of the Convolutional Neural Network (CNN) is not new. It was proposed by a Japanese researcher named Fukushima as early as 1979! I saw a post about it on X a few days ago:

CNN history from a X post mentioning that the idea of CNN was introduced long ago by a Japanese researcher, but it only implemented and grains popularity later due to advancements in hardware and training techniques.

At the time the concept was introduced, it would have likely cost the researcher an insane amount of money and resources just to test the idea. The CNN was only proven useful by LeCun et al when backpropagation was introduced and computer resources became more accessible years later.

It’s remarkable to think about: most of our modern-day machine learning methodologies originated decades ago. But things change so fast. While I was studying CUDA, I watched a video entioning that the NVIDIA A100 chip’s computing power surpasses that of a supercomputer cluster, which once took up an entire room, while using only a fraction of the power. It’s insane to think about—I have a 4090 right inside my work PC. It’s a supercomputer in this tiny box! 🤯

A YouTube video introducing CUDA and how memory access patterns significantly impact CUDA program performance. The video also mentioned that modern GPUs rival the capabilities of supercomputers from a few years ago.

As you can see, given the time and context, it’s hard to imagine how fast future computer chips could become. Even if a researcher could predict what might work, it was prohibitively expensive to test. Most papers are products of the technologies available at their time. That’is why it’s always a good idea to devise solutions from the ground up using today’s technologies.

When I was working on my MAZE project, I thought a lot about how it could be more efficient. The thing is, no matter how interesting your idea is, if it can’t run efficiently with the available resources right now, it’s just science fiction. Sure, I could come up with a system to evolve different models a zillion times to see which one works best, but the problem is that you don’t have the resources to run it a zillion times like Mother Nature does. I saw a presentation by Daniel Han from Unsloth AI, where he mentioned:

Machine learning is all about efficiency

I think that’s true. To make machine learning work, efficiency is key. Great presentation, by the way. If you’re interested in Large Language Models (LLMs), I highly recommend it.

A screenshot of a presentation by Daniel Han from Unsloth AI, where he mentioned that machine learning is all about efficiency.

Take machine learning model mutation, for example. Currently, MAZE can only mutate model structure. But in the real world, I believe model weights could also be encoded in the DNA. Therefore, I’ve been thinking hard about how to mutate not just the structure but also inherit weights from parent models. I realized something really inefficient in my current approach: mutations may only occur at some point in the model structure. For the layers before that mutation, we’ve already fed testing/training data to them previously. So, you’re duplicating work by repeatedly feeding the same data to the same shared layers. One straightforward idea is to cache the values and reuse them, but this may still require a lot of storage space.

A diagram showcasing how mutation only affects downstream layers. With the same input data, we are repeating the same work for the same model.

Since I was also thinking about how to eliminate backpropagation, I wondered if I could apply a similar idea to training.

Idea: Marketplace

Other than Mother Nature, I also like to think about all kinds of efficient organic mechanisms and how they work. A free market is another great example. No matter how the environment changes, a free market can adapt quickly and produce in-demand goods efficiently. I view an entire neural network as a marketplace. Each time data flows into a model, it’s like raw materials acquired by vendors, who then produce products. If you break the model down to a smaller scale, each layer is like a stop in the manufacturing process. When you further break down the layer, each neuron can be seen as a vendor that takes in materials, processes them, and outputs a product to the downstream layers. With this in mind, I’ve been thinking about how to make a neural network model mimic how a free market works. Ideally, each vendor should evolve autonomously and compete to produce the best products.

Comparing a neuron network to a vendor in a marketplace. They both take in materials, process them, and output a product.

Sure, it’s easier said than done. As mentioned previously, machine learning is all about efficiency. I needed to think hard about how to make it GPU-friendly. I tried a few different approaches and finally found something that works efficiently on a GPU. Because the idea is inspired by how a free market works, I call it the “Marketplace.” Here’s how it works.

In a neural network, there are usually many layers. Take a Convolutional Neural Network (CNN), for example: one image could flow through multiple layers of kernels to capture features at different hierarchical levels. The further down the layers, the more abstract the concepts it captures. For each layer, the output is a tensor with the same shape, regardless of the input data. Therefore, I call this a spec. You can think of it as a specification for a product. In each layer, or a group of layers, for a specific set of weight values, I call it a vendor. Think about it: vendors take input, transform it deterministically based on their weights, and output the result as their product. Indeed, they are vendors.

A diagram showcasing that you can think of a layer in a neural network as a specification for a product. Because they all take input in the same shape and output in the same shape.

The idea is to have many vendors in a layer compete to produce the best product for the downstream layers. But here’s the problem: what does the “best” product mean? Before the final end-user evaluates it, nobody knows whether these intermediate products are good or bad. It’s hard to determine how well a product performs if we adjust it slightly. The only way to know is to let the downstream vendors consume it, produce their own products, and eventually have those consumed by the end customers, who provide their feedback.

But then another problem arises. When a vendor in a hidden layer changes their formula slightly, the altered output impacts the products downstream. It’s hard to predict whether a slight increase in the “sweetness” of your sugar supply might ruin the final boba tea product for the end consumer.

A diagram showcasing how a slight change in a vendor's formula impacts the downstream products. And its very hard to predict the impact to the final product.

What do you do then? Well, in the real world, vendors usually try things out. You don’t stick with the same vendor forever. An upstream vendor may go out of business, get acquired, or change their quality. You still need to find a way to adapt to keep your business running. Therefore, when things change, you need to try them out to know. I call this process upstream sampling. The idea is that each vendor randomly selects N intermediate products from the upstream layer and then produces N corresponding products for the downstream layers. They can also take all the intermediate products from the upstream layer, I call this a full upstream sampling.

A diagram showcasing how upstream sampling works. Each vendor randomly selects N intermediate products from the upstream layer and then produces N corresponding products for the downstream layers.

A diagram shows that each spec (group of layers) has many vendors (weights variants), and each vendor runs upstream sampling to produce multiple products based on the upstream products provided by different upstream vendors for the downstream layers.

We repeat the same process for all layers until the end. Then, we evaluate the final product to determine its quality. To track which vendor and its upstream vendors are responsible for the final product, we concatenate the index of each vendor in their respective layer as a record. After evaluating the final product, we identify the best vendor based on metrics such as loss or accuracy.

A diagram shows that after evaluating the final product, we identify the best vendors based on metrics such as loss or accuracy.

We then copy the weights from the leading vendor to all other vendors in the same spec, introducing slight random variations. You can think of this is the other copycat vendors copying the leading vendors with their own mutations. Well, it happens in the real world too. When a product is is very successful, there will be copycat products.

A diagram shows that we copy the weights from the leading vendor to all other vendors in the same layer, introducing slight random variations △µ(i,j) corresponding to the index of the vendor.

By repeating this process, the model learns to produce high-quality products efficiently over time.

The core idea of this algorithm lies in its efficiency, as it reuses intermediate products multiple times. If we mutated the entire model’s weights simultaneously, it would be difficult to determine which changes were beneficial or detrimental. However, by mutating each layer or a few layers at a time individually and then fully or partially randomly remixing it with different vendor combinations, the algorithm efficiently reuses the output of each mutation. More importantly, this entire process is intentionally designed to be GPU-efficient, meaning it can run in parallel. No matter how fast your GPU runs, the next layer will always be waiting for the previous layer to finish its work. Therefore, it makes perfect sense to run as many combinations as possible in parallel in each layer. The mutation processes for each vendor within the same layer do not depend on each other. In theory, this makes the algorithm scalable and can potentially be distributed across multiple nodes in a large-scale training cluster efficiently.

Implementing with Tinygrad

If you read my previous article about building my training workstation with two AMD GPUs, you know I rewrote much of my PyTorch code using Tinygrad for the CakeLens project. If you’re unfamiliar with Tinygrad, it’s a machine learning library created by one of my favorite legendary hackers, George Hotz. Unlike PyTorch, Tinygrad uses lazy evaluation, meaning the compute graph is constructed as a complete end-to-end graph when you call the functions.

Screenshot of the Tinygrad's compute graph visualization tool. It shows the compute graph and the corresponding native kernel code lowered from the graph.

This allows you to apply its JIT compiler to compile the entire compute graph into native kernel code for hardware accelerators. With the full compute graph, there are a lot of opportunities to optimize it by fusing the operations. For example, here’s how you put the @TinyJIT decorator on a function to compile the entire compute graph into native kernel code:

@TinyJit
def mutate_step(
  combined_loss: Tensor, combined_paths: Tensor
) -> tuple[Tensor, Tensor]:
  min_loss, min_loss_index = combined_loss.topk(1, largest=False)
  min_path = combined_paths[min_loss_index].flatten()
  mutate(
      marketplace=marketplace,
      leading_path=min_path,
      jitter=lr,
  )
  return min_loss.realize(), min_path.realize()

I also appreciate that the library is very small, making it much easier to understand and extend. I enjoy using Tinygrad for machine learning and other high-performance computing tasks across platforms. This research will be way harder if I had to use PyTorch or other machine learning frameworks, as they assume training with backpropagation. You may need to build your own CUDA kernel or other hardware-specific code to run on GPUs if you want to speed up the training process.

Back to the topic: yes, I implemented the Marketplace algorithm using Tinygrad. As like most of my machine learning projects, I’ve open-sourced it here:

https://github.com/LaunchPlatform/marketplace

You should be able to find all the experiment code in the experiments folder. Please pardon me, as some parts of the code might be slightly messier than my usual standards, since this is an experimental project, not a production one. I used the beautiful_mnist example from Tinygrad’s examples folder as the target model to test my algorithm. The implementation is simple: we break down each CNN layer into a multi-model class, like this:

[
  # layer0
  Spec(
    model=MultiModel(
      [
        MultiConv2d(structure[0][0], 1, 32, 5),
        Tensor.relu,
      ]
    ),
  ),
  # layer1
  Spec(
    model=MultiModel(
      [
        MultiConv2d(structure[1][0], 32, 32, 5),
        Tensor.relu,
      ]
    ),
    upstream_sampling=structure[1][1],
  ),
  # layer2
  Spec(
    model=MultiModel(
      [
        MultiBatchNorm(structure[2][0], 32),
        Tensor.max_pool2d,
      ]
    ),
    upstream_sampling=structure[2][1],
  ),
  # layer3
  Spec(
    model=MultiModel(
      [
        MultiConv2d(structure[3][0], 32, 64, 3),
        Tensor.relu,
      ]
    ),
    upstream_sampling=structure[3][1],
  ),
  # layer4
  Spec(
    model=MultiModel(
      [
        MultiConv2d(structure[4][0], 64, 64, 3),
        Tensor.relu,
      ]
    ),
    upstream_sampling=structure[4][1],
  ),
  # layer5
  Spec(
    model=MultiModel(
      [
        MultiBatchNorm(structure[5][0], 64),
        Tensor.max_pool2d,
      ]
    ),
    upstream_sampling=structure[5][1],
  ),
  # Layer6
  Spec(
    model=MultiModel(
      [lambda x: x.flatten(1), MultiLinear(structure[6][0], 576, 10)]
    ),
    upstream_sampling=structure[6][1],
  ),
]

In other words, we duplicate the same weights N times to test different vendor variants. Here’s an example implementation of the multi-model version of Conv2D:

class MultiConv2d(MultiModelBase, nn.Conv2d):
  def __init__(
    self,
    vendor_count: int,
    in_channels: int,
    out_channels: int,
    kernel_size: int | tuple[int, ...],
    stride=1,
    padding: int | tuple[int, ...] | str = 0,
    dilation=1,
    groups=1,
    bias=True,
  ):
    super().__init__(
      in_channels=in_channels,
      out_channels=out_channels,
      kernel_size=kernel_size,
      stride=stride,
      padding=padding,
      dilation=dilation,
      groups=groups,
      bias=bias,
    )
    self.vendor_count = vendor_count
    self.weight = repeat(self.weight, vendor_count)
    if self.bias is not None:
      self.bias = repeat(self.bias, vendor_count)

We also modified the forward pass method to accept an extra index parameter, allowing switching between different weights from various vendors.

def __call__(self, i: Tensor, x: Tensor) -> Tensor:
  return x.conv2d(
    self.weight[i],
    self.bias[i] if self.bias is not None else None,
    self.groups,
    self.stride,
    self.dilation,
    self.padding,
  )

For the first layer, we feed training input data to different vendors (weights), producing an array of intermediate products. The next layer randomly selects N intermediate products from the upstream layer, tracking the vendor indices to form a tensor representing the product’s supply chain path. If it’s a full sampling, we sample all vendors outputs from the upstream layer. Here’s the simplified code for producing output for a layer:

def produce(
  model: MultiModel,
  x: Tensor,
  paths: Tensor | None = None,
  upstream_sampling: int = 0,
) -> tuple[Tensor, Tensor]:
  if paths is None:
    # first layer, no upstream sampling, just feed input to all vendors
    output_data = Tensor.stack(
        *(model(Tensor(i), x) for i in range(model.vendor_count)), dim=0
    )
    paths = Tensor.arange(model.vendor_count).unsqueeze(1)
    return output_data, paths

  if upstream_sampling == 0:
    # sample all vendors from the upstream layer
    upstream_sampling = x.shape[0]

  input_count = paths.size(0)
  input_indexes = Tensor.stack(
    *(
      Tensor.randperm(input_count)[:upstream_sampling]
      for _ in range(model.vendor_count)
    ),
    dim=0,
  )
  input_data = x[input_indexes]
  # merge different batches for the same vendor into one. not sure if this is needed, but at least it saves us
  # from calling the model multiple times and making the graph more complex
  merged_batches = input_data.reshape(input_data.shape[0], -1, *input_data.shape[3:])

  output_data = Tensor.stack(
    *(
      model(i, merged)
      for i, merged in zip(range(model.vendor_count), merged_batches)
    ),
    dim=0,
  )
  # breaking down merged batches back to individual batches
  output_data = output_data.reshape(-1, input_data.shape[2], *output_data.shape[2:])

  prev_paths = paths[input_indexes].flatten(0, 1)
  new_paths = (
    Tensor.arange(model.vendor_count)
    .unsqueeze(1)
    .repeat(1, upstream_sampling)
    .flatten()
    .unsqueeze(1)
  )
  merged_paths = prev_paths.cat(new_paths, dim=1)

  return output_data, merged_paths

After processing all layers, we obtain the final products, i.e the output logits. We apply the loss function to identify the product with the lowest loss, then copy the weights from the leading vendors to the others, adding slight random mutations to explore further improvements. The mutate function is responsible for copying the weights from the leading vendors to the others, adding slight random mutations to explore further improvements. The code looks like this:

def mutate(marketplace: list[Spec], leading_path: Tensor, jitter: Tensor):
  for spec, leading_index in zip(marketplace, leading_path):
    multi_params = nn.state.get_state_dict(spec.model)
    for i in range(spec.model.vendor_count):
      for key, params in multi_params.items():
        leading_params = params[leading_index]
        params[i] = (
          leading_params
          + (i == leading_index).where(
            # Do not change the leading vendor
            Tensor(0),
            # Copy from the leading vendor and add a bit jitters
            (
              Tensor.uniform(
                  *leading_params.shape, low=-jitter, high=jitter
              )
            ),
          )
        ).realize()

That’s it. The implementation is surprisingly simple.

Learning Rate Is Critical

When experimenting with the Marketplace training algorithm, I realized that the learning rate plays a key role in successfully training a model, just as it does in backpropagation. I tested various learning rates to determine which one yields the best training performance. In addition to the learning rate, the learning rate schedule is also crucial. I need to reduce the learning rate over time to make the random adjustments smaller, allowing the model to fine-tune the final few decimal points of accuracy. Here’s a diagram showing the significant differences in performance based on various learning rates and decay schedules:

A chart with shows the accuracy of different learning rates and decay schedules. The best learning rate is 1e-3 with a 1e-04 decay schedule.

A chart with shows the loss of different learning rates and decay schedules. The best learning rate is 1e-3 with a 1e-04 decay schedule.

A chart with shows the learning rate value of different learning rates and decay schedules.

For now, I rely on a trial-and-error approach to find the optimal learning rate.

Batch Normalization Doesn’t Work Well with Marketplace

Interestingly, during training, I noticed that the accuracy of the testing dataset was fluctuating. I investigated the issue and identified the cause as the Batch Normalization layer. Since we are sampling various intermediate products, these products significantly affect the mean and variance of the Batch Normalization layer. However, most of these products are eventually discarded, rendering them as noise. During inference, Batch Normalization uses the running mean and variance, which are influenced by these noisy, discarded intermediate products. As a result, the accuracy fluctuates. To address this issue while retaining the benefits of normalization, I replaced the Batch Normalization layer with an Instance Normalization layer. This change resolved the problem effectively!

A chart shows the validation accuracy fluctuation for the Batch Normalization but not for the Instance Normalization.

Experiments as code

I used to collect metrics with TensorBoard for my machine learning experiments. But my experiment runs accumulate quickly, and I get lost figuring out which experiment is which real fast. I realized that I needed to organize my experiments better. I tried MLflow, and I absolutely love it. I highly recommend it if you’re doing any machine learning experiments. With MLflow, I can run experiments as code. Instead of manually running the training script with different parameters, I can write a script to run the training with different parameters. Here’s an example for the learning rate experiment:

exp_id = ensure_experiment("Learning Rate V2")
for batch_size in [32, 64, 128, 256, 512]:
  for lr in [1e-2, 1e-3, 1e-4]:
    for decay in [1e-3, 1e-4, 1e-5]:
      with mlflow.start_run(
        run_name=f"bs-{batch_size}-lr-{lr:.1e}-decay-{decay:.1e}",
        experiment_id=exp_id,
        log_system_metrics=True,
      ):
        marketplace = make_marketplace(PYRAMID32_HALF_UPSTREAM_STRUCTURE)
        train(
          step_count=3_000,
          batch_size=batch_size,
          initial_lr=lr,
          lr_decay_rate=decay,
          marketplace=marketplace,
        )

Since MLflow logs which commit you are currently on, you can easily reproduce the experiment by checking out the commit and running the script again. It’s a good practice to always commit your code before running an experiment so that the commit hash is always associated with the experiment.

Here’s a screenshot of the MLflow dashboard:

A screenshot shows the MLflow dashboard.

And with its parallel coordinates plot, you can easily see the relationship between the parameters and the performance.

A screenshot shows the MLflow parallel coordinates plot.

The lesson I learned from this is that I should have been using MLflow all along. With the right tool, life is much easier.

Optimal Marketplace Structure

I didn’t pay much attention to the marketplace structure initially. However, I soon realized that the structure is critical. The beauty of the marketplace lies in its efficiency, as it reuses the same output multiple times. However, if the structure is poorly designed, GPU resources can be wasted on computing intermediate products that are never used. For example, with a low upstream sampling number, intermediate products in the upper layers have a lower chance of being selected downstream. Running multiple forward passes could address this, but it’s not efficient.

With full upstream sampling, all outputs from the previous layer are used. Ideally, this is what you want, as all computational results are utilized downstream. However, this approach causes the number of intermediate and final products to grow exponentially with the number of layers. Therefore, when designing the marketplace structure, we must carefully consider the number of layers.

In the MNIST CNN network, I initially assigned each Conv2D layer—and nearly every other layer—to its own specification. As a result, the marketplace had too many layers, forcing me to use an upstream sampling value to reduce the number of intermediate products. When comparing this approach to mutating all weights at once, the performance was not as poor as I expected, even with full weight mutation. This is a relatively small model, with only about 80,000 parameters. Given the limited exploration space, I concluded that having many layers in the marketplace was unnecessary. I eventually reduced the depth of the marketplace to three with full upstream sampling.

Comparison of Mutating All Weights at Once vs. the Marketplace Approach

A much simpler approach compared to the Marketplace approach is to mutate all weights simultaneously (the all-at-once approach). We can mathematically prove that the Marketplace approach is more efficient and quantify the gain from the all-at-once approach.

Consider a model with $N$ specifications, referred to as market depth. Each specification has $M$ vendors, representing different weight configurations. Let $S_i$ denote the computation cost of the forward pass for specification $i$ (where $i = 0, 1, \dots, N-1$).

All-at-Once Approach

In the all-at-once approach, we evaluate $M$ different weight configurations across all $N$ specifications. The total computation cost $C_A$ is:

\[C_A = M \times \sum_{i=0}^{N-1} S_i\]

This yields $M$ unique final products. The unit cost per product, $U_A$, is:

\[U_A = \frac{C_A}{M} = \sum_{i=0}^{N-1} S_i\]

Marketplace Approach

In the Marketplace approach, we use full upstream sampling, processing intermediate outputs sequentially across specifications. For the first specification, we feed the input to all $M$ vendors, with a cost of:

\[C_{S_0} = M \times S_0\]

This produces $M$ intermediate outputs. For the second specification, each of these $M$ outputs is processed by all $M$ vendors, yielding a cost of:

\[C_{S_1} = M^2 \times S_1\]

For the third specification, the cost is:

\[C_{S_2} = M^3 \times S_2\]

And so on. The total computation cost $C_M$ is:

\[C_M = \sum_{i=0}^{N-1} M^{i+1} \times S_i\]

This produces $M^N$ distinct final products. The unit cost per product, $U_M$, is:

\[U_M = \frac{C_M}{M^N} = \sum_{i=0}^{N-1} \frac{M^{i+1} S_i}{M^N} = \sum_{i=0}^{N-1} \frac{S_i}{M^{N-i-1}}\]

Efficiency Comparison

The unit cost $U_M$ is significantly lower than $U_A$, especially for large $M$ or $N$. To quantify this, assume $S_i = S$ (constant cost per specification). For the all-at-once approach:

\[C_A = M \times N \times S, \quad U_A = N \times S\]

For the Marketplace approach:

\[C_M = S \times \sum_{i=0}^{N-1} M^{i+1} = S \times M \times \frac{M^N - 1}{M - 1}\] \[U_M = \frac{C_M}{M^N} = S \times \frac{M}{M - 1} \times \frac{M^N - 1}{M^N}\]

For large $M$, $\frac{M^N - 1}{M^N} \approx 1$ and $\frac{M}{M - 1} \approx 1$, so:

\[U_M \approx S\]

The efficiency ratio is:

\[\frac{U_M}{U_A} \approx \frac{S}{N \times S} = \frac{1}{N}\]

Thus, the Marketplace approach is approximately $N$ times more efficient when $M$ is large and $S_i = S$. For non-constant $S_i$, the efficiency depends on the distribution of $S_i$, but the scaling factors $M^{N-i-1}$ ensure $U_M \ll U_A$ for $M > 1$.

Ideally, a larger $N$ increases efficiency, as the unit cost scales inversely with $N$. However, the exponential growth in product count ($M^N$) limits scalability in practice. The vendor count $M$ has a some impact but diminishes beyond a threshold.

A 3D chart shows the relationship between N, M, and the per unit cost comparison between the Marketplace and the all-at-once approach. The x-axis is the depth of the marketplace, the y-axis is the number of vendors per specification, and the z-axis is the per unit cost comparison.

In the context of GPUs, compute resources are relatively inexpensive. With many cores running in parallel, a significant portion often remains underutilized. However, memory is a scarcer resource, as model weights consume substantial memory. We have established that the computational cost of the Marketplace approach is significantly lower than that of the all-at-once approach. But what about memory costs? The memory unit cost is similar, as the all-at-once approach requires $N$ times the memory of the Marketplace approach.

Benchmarking the Marketplace Approach

With the mathematical proof, we now know that the Marketplace approach is more efficient than the all-at-once approach. With lower computational and memory costs, the same computing resources can be used to explore a larger weight space. Another benefit is that the Marketplace approach allows partial mutation of the weights. Since the previous leading vendor continues to participate in the next round, permutations of the weights can be made by changing only the downstream weights from the previous leading vendor. This helps stabilize the training process in the later stages of training. Given these theoretical benefits, let’s benchmark the Marketplace approach against the all-at-once approach.

The all-at-once approach is, in fact, a special case of the Marketplace approach, where the market depth is 1. Here is the specification for the all-at-once approach:

[
  Spec(
    model=MultiModel(
      [
        MultiConv2d(vendor_count, 1, 32, 5),
        Tensor.relu,
        MultiConv2d(vendor_count, 32, 32, 5),
        Tensor.relu,
        MultiInstanceNorm(vendor_count, 32),
        Tensor.max_pool2d,
        MultiConv2d(vendor_count, 32, 64, 3),
        Tensor.relu,
        MultiConv2d(vendor_count, 64, 64, 3),
        Tensor.relu,
        MultiInstanceNorm(vendor_count, 64),
        Tensor.max_pool2d,
        lambda x: x.flatten(1),
        MultiLinear(vendor_count, 576, 10),
      ]
    ),
  ),
]

As you can imagine, given the vendor_count value, we are essentially duplicating the same model vendor_count times. On the other hand, the Marketplace approach was designed with three specifications, each supporting vendor_count vendors.

[
  Spec(
    model=MultiModel(
      [
        MultiConv2d(vendor_count, 1, 32, 5),
        Tensor.relu,
        MultiConv2d(vendor_count, 32, 32, 5),
        Tensor.relu,
        MultiInstanceNorm(vendor_count, 32),
        Tensor.max_pool2d,
      ]
    )
  ),
  Spec(
    model=MultiModel(
      [
        MultiConv2d(vendor_count, 32, 64, 3),
        Tensor.relu,
        MultiConv2d(vendor_count, 64, 64, 3),
        Tensor.relu,
        MultiInstanceNorm(vendor_count, 64),
        Tensor.max_pool2d,
        lambda x: x.flatten(1),
      ]
    ),
  ),
  Spec(
    model=MultiModel([MultiLinear(vendor_count, 576, 10)]),
  ),
]

Let’s try a few different configurations, and see how the training performance changes.

Market depth: $1$, vendor count: $8$, final product count: $8$
Market depth: $1$, vendor count: $16$, final product count: $16$
Market depth: $1$, vendor count: $32$, final product count: $32$
Market depth: $1$, vendor count: $64$, final product count: $64$
Market depth: $3$, vendor count: $8$, final product count: $8^3=512$
Market depth: $3$, vendor count: $16$, final product count: $16^3=4096$

And here’s the result:

A chart shows the accuracy of different market depth and vendor count configurations with 50% smoothing. The x-axis is the step count, and the y-axis is the accuracy. The best accuracy is achieved with a market depth of 3 and a vendor count of 16. Followed by market depth of 3 and vendor count of 8 then other all-at-once configurations.

A chart shows the loss of different market depth and vendor count configurations with 50% smoothing. The x-axis is the step count, and the y-axis is the loss at log scale. The best loss is achieved with a market depth of 3 and a vendor count of 16. Followed by market depth of 3 and vendor count of 8 then other all-at-once configurations.

As you can see, a market depth of 3 with a vendor count of 16 outperforms the all-at-once approach with a vendor count of 64. Even the market depth of 3 with a vendor count of 8 outperforms the all-at-once approach with a vendor count of 64! This demonstrates that the Marketplace approach is more efficient than the all-at-once approach. It is capable of exploring a larger weight space with the same computational resources.

Comparison with Backpropagation

Some of you may ask, what about the performance of the Marketplace approach compared to backpropagation? I am curious about the performance of the Marketplace approach compared to backpropagation as you are. To make the comparsion, I took the beautiful_mnist example code from the Tinygrad repo. As mentioned in the previous section, because the Batch Normalization layers don’t work well with the Marketplace approach, I replaced them with Instance Normalization. And for the Marketplace approach, it’s a three level deep marketplace with 16 vendors per specification, batch size is 512 and initial learning rate is 1e-3 with a 1e-04 decay schedule. Here’s the result:

A chart shows the accuracy of the Marketplace approach and backpropagation. The x-axis is the step count, and the y-axis is the accuracy. The backpropagation reaches 98% accuracy in 70 steps, while the Marketplace approach reaches 90% accuracy in 629 steps and 95% accuracy in 2.3K steps.

A chart shows the loss of the Marketplace approach and backpropagation. The x-axis is the step count, and the y-axis is the loss at log scale.

Well, not surprisingly, backpropagation is way better in this case. I didn’t expect my newborn training algorithm to beat a state-of-the-art method backed by decades of research. As you can see from the chart, it takes only 70 steps to reach 98% accuracy, while the Marketplace approach takes 629 steps to reach 90% accuracy with the same batch size. And it takes raound 2.3K steps to reach around 95% accuracy. Then it takes longer time to increase the accuracy slowly over time. With 3,000 steps, the Marketplace approach reaches 96% accuracy eventually.

Strengths of the Marketplace

As you can see, backpropagation outperforms the Marketplace algorithm. However, as engineers, it’s always about the trade-offs. Surely backpropagation is better in this case, but there are still some strengths of the Marketplace approach. Here, I list some key strengths of the Marketplace:

Scalability

The Marketplace approach has a potential advantage in scalability. Scaling backpropagation with additional GPUs is challenging because its forward and backward passes are tightly coupled, and the backward pass requires strict sequential propagation, making parallelization difficult. In contrast, the Marketplace approach is inherently scalable. In theory, there is no limit to the number of vendors you can run within the same layer, as each can independently generate a variant of the current leading vendor by “rolling the dice.” However, with more vendors, you may need to increase upstream sampling in downstream layers to evaluate them more thoroughly; otherwise, you risk wasting computational resources if intermediate products are underutilized.

Some may ask: how do you synchronize vendor weights across nodes when scaling out? Since the current approach generates random weight deltas based on the learning rate and applies them to the leading vendor’s weights, we can use a random seed value to deterministically generate these deltas. By sharing the seed across nodes, we can reconstruct the weights using the same deterministic random number generator.

The Marketplace approach can be scaled along different axes. With more vendors, you can run additional vendor variants in parallel. However, more vendors do not necessarily improve mutation quality, as a limited batch size can constrain mutation quality. Another interesting approach is to run the same set of vendor variants across multiple nodes for different batches simultaneously. It may not be necessary to process numerous batches in the early stages of training. In the later stages, to optimize the final decimal points of the loss, you may need to process many batches. I tested a bit whether we could run the same set of vendor variants across multiple nodes for different batches simultaneously to improve the training speed. I simulated the training process with multiple instances by running multiple forward passes sequentially. Counterintuitively, performance degraded with more forward passes. I am still investigating the reason. If it’s possible to scale on the dataset axis in parallel, it would be a significant improvement.

I wonder if we can scale the Marketplace approach efficiently, is it possible to shorten the training time? Like, for example, the LLM pre-training process takes 6 months, with abundant GPUs, can we shorten it to 6 weeks? I think it’s possible, but still requires a lot of research. 🤔

Optimizing the Forward Pass Enhances Training

With backpropagation, the computational graph for training differs significantly from the forward pass. Typically, to deploy a model in production, people optimize the forward pass, sometimes even building Application-Specific Integrated Circuits (ASICs) to accelerate applications like large language model (LLM) services. The Marketplace relies solely on the forward pass, so optimizing the forward pass has potential to improve the training process. It would be interesting to consider the idea of building a hardware accelerator for both inference and training.

Better GPU Utilization

Another issue with backpropagation is that its two distinct compute graphs result in lower GPU utilization than is optimal. In contrast, when observing the GPU usage rate during Marketplace training, it’s nearly always at 100%.

A chart shows the GPU usage rate during Marketplace training. The GPU usage rate is nearly always at 100%.

Backpropagation’s need to switch between forward and backward passes also makes it less cache-friendly. As highlighted in the previously mentioned CUDA video, GPU performance heavily depends on memory layout and access patterns. By relying solely on the forward pass, the Marketplace approach can better utilize GPUs. I heard an something funny about training with a large-scale GPU cluster: fluctuating GPU usage forced engineers to keep GPUs spinning idly to avoid sudden changes in power consumption, which placed significant strain on the power supply infrastructure. I wonder if the Marketplace could help with this by focusing on the forward pass.

No Need to Make the Model Differentiable

Another benefit of the Marketplace approach is that it doesn’t rely on computing gradients, so you don’t need to make your model differentiable. As an interesting side note, with backpropagation, some corner cases, such as ReLU, are not continuous but still function effectively. Since backpropagation has been the dominant approach for decades, I guess there has been relatively little research into non-differentiable models and their benefits. At this moment, I’m unsure of the full implications. My gut feeling suggests that for low-precision training, or even integer-based training, the Marketplace approach could be particularly beneficial. This is yet another intriguing topic for future research.

A Blockchain for Training LLMs

With the advantages of marketplace training, I can already envision some exciting applications. What’s bigger than the AI hype? How about blockchain + AI? 🤣 Seriously, though, I believe this is feasible. Here’s how I imagine it could work.

First, we need to publish the training data to make it easily accessible to all miners. An open-source dataset like Common Crawl could be used, but the specific dataset isn’t critical—the key is ensuring accessibility. We then define the training batches in a deterministic order. Each miner’s task is to find the next seeds for each layer that yield the lowest loss in combination. Miners can share their seeds on a peer-to-peer network, encapsulated with their digital signatures and public key. The miner with the lowest loss wins the mining reward. Anyone should be able to reapply the deterministic training dataset to verify whether the proposed block is indeed the best. If a branch occurs, the chain with the lowest loss prevails. This approach enables a decentralized network to pre-train an LLM on a blockchain! This sounds like an intriguing weekend project, but I’ll save it for another time. 😅

Future Research

Now that we know the Marketplace algorithm can search for optimal neural network weights on a GPU, here are some ideas for future exploration. But there are still some challenges to overcome.

Memory Efficiency

Currently, we store each vendor’s full weights, enabling parallel computation of layers with different inputs, which is highly GPU-efficient. However, this approach consumes valuable GPU memory. To explore a larger weight space with more vendor variants, we could implement seed-based random number generation for each vendor’s weights as mentioned in the previous section. This way, we can store only the seeds, freeing up GPU memory while testing N vendor variants without losing the weights.

Distributed Training

With deterministic seed-based random number generation in place, we could build a cluster to efficiently share seeds across nodes. Each node could reconstruct weights with minimal bandwidth requirements and independently run vendor variants. The only significant data transfer between nodes would likely be the intermediate products, especially if we allow cross-node upstream sampling of products.

Improved Random Weight Generation: Momentum

Currently, weight deltas are generated randomly, but the best-performing deltas contain valuable information. If a CNN filter consistently moves in a specific direction that yields good results, that direction may be worth pursuing. Introducing a momentum concept, similar to many optimizers, such as Adam, could be a valid approach to accelerate training. For example, if a direction proves effective multiple times, we could bias random weight updates toward that direction to speed up convergence. I envision that the improvement in the delta generator doesn’t need to be substantial. A slight enhancement, combined with vendor scaling, could lead to significant performance improvements.

Sub-Optimal or Bad Final Products Are Still Valuable

The current approach only keeps the best final product and its delta. But what about the others? While they don’t perform as well as the best final product and their corresponding vendor deltas, they still contain valuable information. With enough delta permutations and their corresponding final products, is it possible to statistically attribute which parameter in a big chunk of delta is responsible for a better final product? For example, we could take the loss generated from the final products and attribute it to the delta that generated each final product. By merging the delta, weighted by the attributions from the loss, we might be able to compose a better overall delta that accounts for both good and bad outcomes. I think this is yet another very interesting idea to explore.

Training Larger and Diverse Models

The Marketplace algorithm performs well with the small MNIST CNN, but its scalability remains untested. How effectively does it handle larger models? Beyond CNNs, can it be applied to other architectures, such as transformers? I’m interested in pre-training a large language model (LLM) with this algorithm to evaluate its performance and scalability. I see no particular reason why it wouldn’t work. Maybe it just takes longer to train, but I hope to scale out to compensate for the time. In fact, I’ve already trained a larger model, ResNet-18 (approximately 11 million parameters), using the Marketplace approach on a scaled MNIST dataset with limited steps. The results seem promising, though I didn’t run the experiment for long. It was a quick test, and I didn’t optimize the code.

A chart shows the accuracy of the Marketplace approach on the scaled MNIST dataset with 100% smoothing. The x-axis is the step count, and the y-axis is the accuracy. With limited steps, the accuracy is trending up steadily showing progress of the training.

A chart shows the loss of the Marketplace approach on the scaled MNIST dataset with 100% smoothing. The x-axis is the step count, and the y-axis is the loss at log scale. With limited steps, the loss is trending down steadily showing progress of the training.

With a larger model, each specification has more parameters to mutate, meaning a larger space to explore. However, we could address this by breaking it down into smaller specifications or using a larger batch size to explore the space more effectively. Another interesting idea is to shift the focus to different layers over time.

Focused Mutation

During the experiments, I forgot to remove a flag that prevents a layer from mutating. Surprisingly, this led to improved training performance. My hypothesis is that freezing some layers reduces the search space while maintaining the same capacity, allowing better focus on optimizing the remaining layers and thus improving training performance. I wonder if it makes sense to train only a few layers at a time for very large and deep models. Strategically switching focus to a few layers with large-scale vendor variants could simplify training. For example, if the first layer is a convolutional layer and the second is a fully connected layer, we could focus on training the first layer for a while, then switch to the second. In a CNN, the initial layers extract basic features, while later layers capture more complex features. Assuming the basic feature layers are sufficiently trained, their filters may not change much. If this assumption holds, we could focus more on the later layers in the later stages of training. This is another intriguing idea to test.

Optimizing the Performance of the Marketplace

Currently, I’ve implemented the Marketplace algorithm using simple Tinygrad code. I haven’t invested much effort in optimizing the code, but I believe there are many opportunities to improve performance. For example, we could maximize GPU utilization or optimize memory usage by analyzing memory patterns and making them cache-friendly. Alternatively, writing custom CUDA kernels could further enhance performance. I’m unsure if I’ll have time to pursue these optimizations, but they are worth exploring.

Design Models for the Marketplace

So far, all the models we have explored are designed for training with backpropagation. There are many factors to consider when designing models for backpropagation. For example, the model must be differentiable, and the initial weights need to be carefully chosen to avoid exploding gradients. We also use skip connections to mitigate vanishing gradients. Additionally, numerous techniques exist to enhance the training process specifically for backpropagation. But what about the Marketplace approach? What if we design a model optimized for the Marketplace approach, specifically tailored for distributing computation across multiple GPUs and nodes? This is yet another fascinating topic to explore.

Combining the Marketplace Approach with Backpropagation

I wonder if we can combine the Marketplace approach with backpropagation to train the same model in different scenarios. For example, if backpropagation gets trapped in a local minimum, can we switch to the Marketplace approach to escape it? Alternatively, could we alternate between the Marketplace approach and backpropagation at different stages of training to enhance the process? Before we can answer these questions, we need to understand the nature of the Marketplace approach. There are already so many intriguing questions to explore.

Final Thoughts

That’s it! I learned a lot from this project. Regardless if people find my research side project interesting or not, it was a fun ride for myself. It was a great exercise to think about optimizing the training process and designing an algorithm for GPU efficiency. Now I have more confidence to think about algorithms from the lens of GPU efficiency. There could be more interesting ideas to explore with GPUs.

I believe this is the best time to be alive. People from ten years ago could hardly have imagined that we’d have abundant supercomputers in our homes. With modern hardware, it’s incredible that I can conduct end-to-end experiments on my own in such a short time. We’re entering a new era of personal supercomputing. I hope you enjoyed reading about my Marketplace algorithm. Please leave a comment with any feedback or questions.

CakeLens V5, the AI-gen video detection model is now open-sourced!

Wed, 30 Jul 2025 07:00:00 +0000

Today, I’m excited to announce that the CakeLens v5 AI-generated video detection model is now open-source! Why open-source, you might ask? Well, the CakeLens project serves multiple purposes for me. (For more background, see my earlier article: I built an AI-gen detection model for videos, here’s what I learned.) The most critical one was to teach myself how to build a machine learning-powered product from end to end, including data collection, labeling, model design, training, and inference. It has already achieved that goal.

Beyond personal learning, I hoped CakeLens could uncover some business value. However, the reality is that most users don’t care much whether the funny videos they see online are AI-generated or not. In an enterprise context, though, detecting AI-generated content is more critical. For example, ID verification services often require users to hold an ID and take a photo or short video. What happens if scammers generate realistic fake photos or videos in the future? What if job candidates alter their face or voice? These issues are already happening.

Screenshot from a LinkedIn post (source) showing a job interview conducted via video call, where the candidate's face has been swapped using AI-generated imagery. This highlights real-world cases of AI-generated content being used for identity fraud in professional settings.

I believe this represents a new market category yet to be fully addressed. CakeLens is my initial step into exploring this space.

While the CakeLens v5 model for detecting AI-generated videos works reasonably well with a limited dataset (77% precision, 74% recall with 50% threshold), its accuracy is still too low for enterprise use cases. I searched for similar open-source models but found none. I guess cool kids are mostly focused on large language models (LLMs) right now. Although this isn’t a flashy moonshot project, I believe open-sourcing CakeLens v5 offers educational value to others. So, here we are!

How Can I Use It?

I have uploaded the model weights to Hugging Face. The model is relatively small, with a size of only 3.25 GB and 270 million parameters. I have also created an open-source library, cakelens-v5, for using the model in your Python projects or as a command-line tool. To use it as a command-line tool, you can run it with uvx like this:

uvx --with torch \
  --with torchvision \
  --with torchcodec \
  --with huggingface-hub \
  cakelens-v5 path/to/your/video.mp4

Then you will see the output like this:

Output of uvx cakelens-v5 command-line tool.

Of course, you can also use it in your Python projects. First, install the library:

pip install cakelens-v5

You also need to install the dependencies:

pip install torch torchvision torchcodec huggingface-hub
# or, if you want to use the model with CUDA:
# pip install torch torchvision torchcodec \
#   huggingface-hub \
#   --index-url https://download.pytorch.org/whl/cu128

Then you can use it with the following code:

import pathlib
from cakelens.detect import Detector
from cakelens.model import Model

# Create model and load from Hugging Face Hub
model = Model()
# load the model weights from Hugging Face Hub
model.load_from_huggingface_hub()
# or, if you have a local model file:
# model.load_state_dict(torch.load("model.pt")["model_state_dict"])

# Create detector
detector = Detector(
    model=model,
    batch_size=1,
    device="cpu"  # or "cuda", "mps", or None for auto-detection
)

# Run detection
video_path = pathlib.Path("video.mp4")
verdict = detector.detect(video_path)

# Access results
print(f"Video: {verdict.video_filepath}")
print(f"Frame count: {verdict.frame_count}")
print("Predictions:")
for i, prob in enumerate(verdict.predictions):
    print(f"  Label {i}: {prob * 100:.2f}%")

Why v5?

When designing and training this model, I tried many approaches. For instance, I cropped a small window across multiple frames to test performance. I experimented with time-wise CNN layers followed by spatial layers and many other variants. After numerous iterations, I landed on v5, which makes somewhat reliable predictions. Since I went through so many versions to get here, I decided to start with v5 as the first public release—it just sounds cooler! 🤣

Screenshot of the training batches page of our internal tool showing the dataset size growing across different iterations.

Data Collection

For v5, the dataset consists of 5,093 videos for training and 498 for testing, collected from X and labeled manually one by one. The dataset is randomly split into two groups: 90% for training and 10% for testing. When a post’s author indicates that a video is AI-generated and specifies the model used, I label it accordingly if the information seems reliable. The videos come in various resolutions, such as 1080p, 720p, and others, depending on the original uploaded resolution. Not all videos have high-resolution variants available. I fed different resolution variants into the model to ensure it can detect patterns regardless of image size.

Labels

When designing this model, I wanted to see if it could identify which model was used to generate a video. Therefore, I added labels for each video generation model. Additionally, some videos are anime or video game content. I also wanted to know if the model could distinguish between 2D anime, 3D anime, or video game styles, so I included those labels as well. Despite having more labels than just “AI-generated or not,” the limited dataset size means the model’s predictions for these labels aren’t very accurate yet. Below is the list of labels:

AI_GEN: Is the video AI-generated or not?
ANIME_2D: Is the video in 2D anime style?
ANIME_3D: Is the video in 3D anime style?
VIDEO_GAME: Does the video look like a video game?
KLING: Is the video generated by Kling?
HIGGSFIELD: Is the video generated by Higgsfield?
WAN: Is the video generated by Wan?
MIDJOURNEY: Is the video generated using images from Midjourney?
HAILUO: Is the video generated by Hailuo?
RAY: Is the video generated by Ray?
VEO: Is the video generated by Veo?
RUNWAY: Is the video generated by Runway?
SORA: Is the video generated by Sora?
CHATGPT: Is the video generated using images from ChatGPT?
PIKA: Is the video generated by Pika?
HUNYUAN: Is the video generated by Hunyuan?
VIDU: Is the video generated by Vidu?

Of course, new video generation models are always emerging, such as Midjourney Video, which is too new to be included in v5.

The Architecture

The architecture of the model looks like:

Diagram of the CakeLens v3 model architecture. The model processes cropped and resized video frames (9 frames at 512x512 resolution) through an initial convolutional input layer, followed by a series of six space-time convolutional blocks. Each block consists of a spatial convolution (3x3 kernel), a temporal convolution (3x1 kernel), ReLU activations, and instance normalization, with skip connections forming a residual network. The final output passes through a fully connected layer to produce predictions for each label.

As you can see, it’s a simple CNN network with six space-time layers plus one input layer. The input layer primarily reduces computational cost by downsizing the input video. Each space-time layer consists of a spatial layer with CNN kernels that detect features in the spatial domain, followed by a temporal layer that looks for features in the time domain. These layers are connected with skip connections, forming a residual network that makes training more efficient.

For video input, I break the file into framesets, each containing nine video frames at a resolution of 512x512 pixels. Videos larger than this are cropped, while smaller ones are centered and padded with zeros to reach 512x512. This resolution choice reduces computational cost. Additionally, cropping to 512x512 helps eliminate data leakage from elements like TikTok logos, TV channel icons, news banners, or AI-generated watermarks (e.g., “Veo”) that often appear at the edges or corners. I didn’t have time to mask these elements, so cropping minimizes their impact on the model’s decisions.

Interestingly, someone asked why I used a CNN when I shared this project. I hadn’t thought about it initially and didn’t know how to respond. Upon reflection, I chose CNNs because I assumed subtle space-time patterns exist in video frame sequences, and CNN kernels can detect these features across larger ranges in deeper layers. From a product perspective, though, the specific model choice matters less than delivering value and enabling data collection. I can always swap out the model for something better later.

Next Version

I’ve been training the v6 model for a while using a workstation with two AMD 7900XTX GPUs, as mentioned in a previous article. It’s not as fast as an H100 from Modal, so progress is slower. The v6 architecture is largely the same but includes minor adjustments and a larger dataset. Limited computing resources prevent me from experimenting with larger or more complex models for now. I’ll let v6 train for a bit while I shift focus to other projects.

Finally

I hope you find CakeLens v5 somewhat useful or, at the very least, educational. While I can’t dedicate much time to this project right now, it’s now open-source, and I’d love to hear your feedback on how to improve it. Leave a comment below and let me know! Thanks!

Two AMD 7900XTX GPUs in a Tinygrad-Based Training Workstation with Peer-to-Peer PCIe Communication

Wed, 11 Jun 2025 07:00:00 +0000

I’ve been diving into machine learning projects lately, and I enjoy it a lot. However, one thing bothers me: I lack the computing power to test many interesting ideas. In my previous article for CakeLens, I designed and trained a model to detect AI-generated videos. But due to limited local computing power, I had to rent H100/A100 GPUs from Modal to experiment with different approaches. And it’s not cheap:

Screenshot of Modal's billing dashboard showing the total cost as $2.1K

My PC has an RTX 4090, so I could run training locally. However, memory constraints make it painful to train larger models. Even when a model fits on the GPU, the intensive computation consumes all GPU resources, rendering my PC unusable. To solve this, I need more local computing power to run machine learning experiments without breaking the bank.

My first idea was to buy another RTX 4090 or perhaps an RTX 5090. I checked prices online and was shocked. I bought my current 4090 for around $2,000 USD, but now they’re selling for $3,600 on Amazon. That’s insane! 🤯

Screenshot of Amazon product page, featuring Asus ROG Strix RTX 4090 GPU selling at $3,599.95

Curious about the cost of an H100 for my home office, I checked its price:

Screenshot of Amazon product page, featuring Nvidia Tesla H100 GPU selling at $25,249.95

Heh, you know what? I’m not planning to sell two KDNYs I have accumulated just yet 😅.

I don’t have the budget for more Nvidia GPUs right now, but I still want a local setup to experiment at a lower cost. One day, while browsing X, I found a post by Tinygrad showcasing their gradient functions defined in just 40 lines of code.

A X post by @__tinygrad__ showcasing gradients for backprop defined in just 40 lines

I tried it, and it was impressive—no dependencies, just an instant install with uv.

Screenshot of installing tinygrad with uv shows it only takes 9ms

After researching further, I really liked Tinygrad’s concept. It’s like the RISC (Reduced Instruction Set Computer) of machine learning, while PyTorch feels more like CISC (complex instruction set computer). I appreciate its clean, minimalist design, and it seems to support AMD GPUs well.

This made me wonder: why does everyone say Nvidia GPUs are the go-to for machine learning? They claim Nvidia’s strength lies in its software. Hmm, is that true? 🤔 Or, as some might say, is it a skill issue? 🤣

I’m not sure, but I wanted to find out. I’m curious about Tinygrad’s pre-built AMD training workstation. It’s tempting, but it’s outside the budget I can allocate, and it’s too bulky for my home office.

Screenshot of Tinybox, a pre-built machine learning work station by the tiny corp featuring AMD and Nvidia GPU options. The AMD option is selling at $15,000 USD

Looked at the GPUs they are using, the AMD 7900XTX seemed mature. Best of all, the price was reasonable—just $1,100:

Screenshot of Amazon product page, featuring XFX AMD Radeon RX 7900XTX GPU selling at $1,099.54

I had a retired PC, so I quickly purchased two 7900XTX GPUs:

Two boxes of XFX AMD Radeon RX 7900XTX GPUs

I did my best with cable management:

Two XFX AMD Radeon RX 7900XTX GPUs in an open PC case with PCI power cables connected to them

It was time-consuming, and I tried 😅

Peer-to-Peer PCIe Communication Issues

I ran MNIST and other Tinygrad examples, and they worked fine with one GPU. But with two GPUs, I kept encountering errors like this:

Screenshot of Python raising OSError from a fnctl.ioctl function call when running Tinygrad with the multi-GPUs MNIST example

It turns out the system was trying to transfer data between GPUs using peer-to-peer (P2P) PCIe communication, but this wasn’t available on my setup. If you look at the system log messages, you will see the error message:

Failed to map peer:0000:0c:00.0 mem_domain:4

Screenshot from Linux kernel message showing Failed to map peer:0000:0c:00.0 mem_domain:4 error

I discovered a tool rocm-bandwidth-test that benchmarks PCIe communication performance between the CPU and GPUs on the ROCm platform. I tried installing it on NixOS, but it wasn’t in nixpkgs. So, I created a Nix package and submitted a pull request for it. After running the benchmark, I confirmed there was no link between the two GPUs:

Screenshot of rocm-bandwidth-test's result, shows there's no direct connection between the two GPUs

There’s an amdgpu.pcie_p2p option in the Linux kernel, but it wasn’t available on my PC. I searched online, but no one seemed to have encountered this issue. When no documentation exists, even LLMs can’t provide a solution.

Enabling P2P PCIe Communication in the Linux Kernel

I read the Linux kernel config file and learned that the P2P PCIe feature is relatively new. To enable it, you need to activate DMABUF_MOVE_NOTIFY option while compiling the kernel.

With NixOS, you can configure it like this:

boot.kernelPatches = [
  {
    name = "enable-hsa-amd-p2p";
    patch = null;
    extraStructuredConfig = with lib.kernel; {
      DMABUF_MOVE_NOTIFY = yes;
      HSA_AMD_P2P = yes;
    };
  }
];

Then rebuild your NixOS system and reboot. After that, running the ROCm PCIe communication benchmark showed data exchanges between the GPUs.

Screenshot of rocm-bandwidth-test's result, shows no direct connection between the two GPUs

The transmission rate was limited due to my motherboard or possibly the AM4 platform, as PCIe bandwidth is shared between the two GPUs:

Screenshot from the manual of ASUS ROG X570 Crosshair VIII Hero motherboard shows that with dual VGA setup, each of them have PCIe4x8 vs single VGA has PCIe4x16

To improve PCIe communication speed, I’d likely need to upgrade to a Threadripper platform. But that’s overkill for now. At least it works! To run the multi-GPU MNIST example, use:

IOCTL_PROCESSOR=x86_64 IOCTL=1 AMD=1 python \
  examples/beautiful_mnist_multigpu.py

I’m not sure why IOCTL and IOCTL_PROCESSOR are needed, but I ran into other ioctl errors without them. Now, with two GPUs it’s training at double the performance of a single GPU 🎉:

Screenshot of output from Tinygrad's multi-GPU MNIST example on two AMD GPUs running at 61 iterations per second

Compared to the RTX 4090, here’s the performance of Tinygrad’s MNIST example:

Screenshot of output from Tinygrad's multi-GPU MNIST example on a single 4090 GPU running at 29 iterations per second

Note that this isn’t a scientific benchmark. Tinygrad may have optimization potential or settings to maximize performance on each platform. This is just a rough comparison to gauge the training speed I’m getting. The two PCs also have different CPUs and memory (the one with the AMD GPU has older hardware).

Additionally, the approach I described above uses the amdgpu driver provided in the Linux kernel. Tinygrad has developed its own custom userspace driver called AM, which they claim is more stable than amdgpu for Tinygrad machine learning workloads. I’ve only tested it briefly. That’s another topic to explore later.

Final Thoughts

I’m thrilled to run training with two AMD 7900XTX GPUs at a lower cost in my home office. I can now free up my RTX 4090 and avoid intensive ML tasks disrupting my work. The setup cost me $1,100 × 2 for the GPUs plus a new 1600W power supply for $500, totaling $2,700. I plan to rewrite my ML projects with Tinygrad and run them on my new AMD ML workstation.

I hope AMD gains more popularity in machine learning computing. Otherwise, we’ll be stuck with GPUs priced at $2,000 but selling for $3,600. I appreciate open-source projects like Tinygrad leading the way. As I mentioned in my previous article, the real barrier for enthusiasts like me, who aren’t wealthy, isn’t a PhD degree—it’s computing power. I wrote this article because this information is hard to find online and needs to be accessible so everyone can learn.

That’s it! I may write more articles if I encounter interesting challenges with my dual 7900XTX setup. I hope you find this somewhat interesting and useful 😄

I built an AI-gen video detection model and browser extension in a month

Tue, 03 Jun 2025 07:00:00 +0000

Have you ever wondered while browsing the internet whether the video or image you’re viewing is real or AI-generated? People say, “Seeing is believing,” but that’s less true in the AI era. Nowadays, generating photorealistic videos with audio is easier and cheaper than ever.

AI-generated video featuring a kangaroo denied boarding on a flight as an emotional support animal, posted by @gdb on X

If you’ve followed trends on X, you may have noticed many users liked and reposted a video of a kangaroo being denied boarding on an airplane as an emotional support animal, ticket in paw. As adorable as it was, the video was AI-generated. The ticket’s text is gibberish. I call it “AI fonts.” I’m no linguistics expert, but the verbal exchange also felt off.

I’ve faced the same issue. While browsing X, I’ve retweeted content, only to later realize it was AI-generated, which was embarrassing. I wished for an easy-to-use tool to distinguish AI-generated content from real content. I tested several online tools claiming to detect AI-generated content, but none worked as expected. So, I spent the past month training a model and building a browser extension focused on detecting AI-generated videos on X. I named it CakeLens, inspired by the viral “Is it a cake?” videos. Instead of identifying cakes, it detects AI-generated content. I chose the name because I wanted it to be as easy as “a piece of cake” to use.

CakeLens is now available on the Chrome Web Store. You need to sign up for an account at CakeLens.ai to use it. Once set up, a button appears in the upper-right corner of videos on X when you hover over them. Click it to submit the video for AI-generated content detection.

Screenshot of X.com showing the CakeLens button on the upper-right corner when hovering on a video

View the detection results on the submissions page of your CakeLens account.

Screenshot of the submission page of CakeLens

The latest version of my model achieves 77% precision and 74% recall on the validation dataset at 50% as the threshold.

Screenshot of TensorBoard PR curve for CakeLens' latest model

I’ve learned a lot from this project. Today, I’m sharing what I’ve learned from building this pet project!

Why I Built It

Two months ago, I saw a post on X that upset me.

X post video showing Myanmar earthquake turns out to be likely AI-gen

It frustrates me that while people suffer from disasters, a few uses AI tools to create fake videos for attention. Most AI-generated content is created for positive purposes, like art, which I support. However, what prevents bad actors from using these tools for scams or propaganda? As AI-generated content grows, so will its misuse. I don’t believe governments should ban new technologies due to potential misuse. Although most AI generation providers build safety mechanisms into their systems to prevent common misuse cases, AI technologies will only become more prevalent. Sooner or later, everyone will have access to these tools, so it’s not meaningful to focus solely on restricting them from the tool side.

Instead of stopping users from misusing it, I believe in fighting technology with technology—like beating magic with magic. That’s when I decided to build CakeLens. Additionally, I recently built a feature using large language models (LLMs) for BeanHub. However, I feel less comfortable as an engineer when using new technology without understanding the fundamentals. For example, I wrote an article titled Elliptic Curve Cryptography Explained to understand ECC before using it at work, because I don’t like treating it as a mystical black box. Using LLMs or other machine learning technologies is similar; it’s easy, and anyone who knows how to call an API can do it. But what about data collection, labeling, training, and inference from the ground up? This pet project became my deliberate practice in machine learning, as it gives me a chance to build it from end to end.

Building the Chrome Extension

I envisioned CakeLens as a Chrome extension to automatically label AI-generated content on web pages you browse, similar to an antivirus scanning for viruses but targeting AI-generated content instead. While this idea sounds appealing, it’s impractical for several reasons. The internet has too much content to scan everything affordably, and users may not care if a cute cat video is AI-generated. Privacy is another concern. Detecting AI-generated content requires significant computing resources, likely on a server, which could resemble tracking user browsing activity. I wasn’t comfortable with that. To balance functionality and privacy, I made a compromise.

Instead of scanning all content automatically, CakeLens only analyzes content when users request it. Since popular AI-generated content often sparks widespread doubt, I only need to analyze it once and keep the result. The extension can download a list of known AI-generated content and flags matches on the page, preserving user privacy. With this approach, I began building the Chrome extension.

I’ve built countless software projects, but this was my first Chrome extension. It was more challenging than expected, largely due to security constraints. Extensions operate in isolated sandbox environments, communicating via messaging with strict permission whitelisting to prevent malicious actions. I also learned some React Fiber tricks to extract the information needed for the extension. This topic deserves its own article, but today, I’ll focus on the machine learning aspects.

Building Internal Tools for Data Collection

With the extension built, I could send videos to my API server and store them in a database. To train the AI model, I needed labeled data. Reflecting on thousands of collected videos, I realized internal tools are critical but often overlooked. Efficient data handling depends on these tools, so I built an intuitive UI for labeling data to streamline the process.

Screenshot internal of CakeLens for viewing, labeling and editing submissions

Evolving the Model with Hyperparameter Gradient Descent

Building the extension and internal tools was straightforward compared to designing the model. The challenge was creating a model that generalizes pattern recognizing while balancing performance and cost. Ideally, I’d define an evaluation metric, and a genetic algorithm would evolve the model automatically. My previous MAZE project aimed for this, but it’s not mature enough and doesn’t cover CNNs, so I designed the model manually through trial and error.

My biggest initial mistake was not setting up infrastructure to collect metrics for objective model evaluation. I’d run data, review results, guess improvements, adjust, and repeat. Without sufficient data, progress was slow and guesses often wrong. I realized I was wasting time. Learning from this, I built infrastructure to collect metrics and systematically test approaches.

Model design involves many questions: How many CNN layers? What kernel size? What learning rate? Which modules to use? Without a systematic approach, it’s easy to get lost in the ocean of permutations. First, I tackled metric collection. I used tools like Grafana while I was wearing hat of a DevOps engineer or backend engineer, but for machine learning, TensorBoard is preferred. Its UI is tailored for ML training and testing data, so I updated my code to log TensorBoard metrics.

Screenshot of TensorBoard showing histogram of training gradients

Next, I needed a method to adjust hyperparameters systematically. I developed a technique I call “hyperparameter gradient descent.” Starting with a baseline model, I adjust one hyperparameter across a range, running limited epochs to compare results. I select the best performer, update the baseline, and train on the full dataset. Then, I analyze results, hypothesize improvements, and test new hyperparameters. This iterative process resembles gradient descent but at the hyperparameter level.

Diagram showing the approach moving from a model design, trying out with different hyperparameters, pick the best one and move to next baseline design.

For example, while training my models, I observed that they stop learning after some steps. By examining the gradient histogram on TensorBoard, I realized that the gradient distributions were consistently zero after a certain point. My theory was that high loss causes large gradients at the very beginning, which wipes out many neurons and weights, making them fall below the ReLU threshold (zero in this case) and preventing activation. It almost felt like a surge of current flowing into a circuit board, frying the electronic components. To address this problem, I experimented with different approaches using the current version of the model. First, I tried gradient clipping. This approach did help mitigate the initial gradient explosion issue. However, since gradient clipping requires accessing all gradient values, it significantly slows down training, making it less than ideal. Next, I tuned the learning rate. With my infrastructure, I can easily test the same model structure with different learning rates and evaluate their performance. This allowed me to quickly identify the optimal learning rate for the current model structure. I’m now on version 5 of my baseline model.

This approach has drawbacks. During exploration, choosing the number of epochs for experiments is challenging. Too few may miss potential benefits, while too many waste resources. With my limited budget, I could only test a few options, making educated guesses based on small data windows. With more resources, I recommend running at least one full epoch to uncover late improvements. Additionally, keep models flexible—parameterize layers, learning rates, kernel sizes, and structures as arguments rather than hard-coding values to simplify tuning.

Another issue with this approach is that previously determined parameters may not remain optimal after changing the model structure. For example, the optimal learning rate from the previous structure might not suit the new one. Therefore, you need to continuously ask questions and explore different directions to determine the best next move. Even when you believe you’re heading in the right direction, this may not always be true. It still requires significant trial and error. Sometimes, I had to revert to an earlier version and proceed from there because the new approach performed worse than expected.

While hyperparameter gradient descent was helpful, training on my Nvidia RTX 4090 slowed as models grew larger. My PC struggled, sometimes becoming unusable. Testing multiple hyperparameters sequentially was too time-consuming. I needed a cloud-based solution and found Modal, a container-based platform for GPU computing. As someone experienced with containers, I appreciated its approach. This article isn’t sponsored, but I recommend Modal for machine learning workloads due to its ease of use. Modal bills only for active container time and supports running multiple containers simultaneously, enabling parallel hyperparameter testing.

Here’s an sample code how to use Modal to explore different learning rate with the same training function:

import modal


MINUTES = 60
HOURS = 60 * MINUTES

app_name = "cakelens-v5.0"
app = modal.App(app_name)
base_image = modal.Image.from_dockerfile("./docker/modal/Dockerfile")
torch_image = base_image.pip_install(...)
volume = modal.Volume.from_name("cakelens-v5-volume", create_if_missing=True)
gpu = "H100"


@app.function(
    image=torch_image,
    volumes={volume_path: volume},
    cpu=(16, 32),
    gpu=gpu,
    timeout=16 * HOURS,
)
def train_model(node_rank: int, kwargs: dict):
    ...
    # train your model here


@app.local_entrypoint()
def main():
    base_kwargs = dict(
        limit_training=1000,
        epoches=1,
    )
    for result in train_model.starmap(
        [
            (
                0,
                base_kwargs
                | dict(
                    run_name="5.0-lr-3e-03",
                    learning_rate=3e-03,
                ),
            ),
            (
                1,
                base_kwargs
                | dict(
                    run_name="5.0-lr-5e-03",
                    learning_rate=5e-03,
                ),
            ),
            (
                2,
                base_kwargs
                | dict(
                    run_name="5.0-lr-7e-03",
                    learning_rate=7e-03,
                ),
            ),
        ]
    ):
        print(result)

Using Modal, I realized computing power is essential for ML. No matter how much theory you study, without resources to experiment, progress is limited. More computing power allows testing more approaches at scale, but the capital requirement is steep. The barrier to ML isn’t a PhD—it’s access to computing resources.

Maximizing GPU Usage Is Harder Than Expected

With Modal’s H100 GPUs, I expected fast training, but software challenges persisted. Even with powerful hardware, GPU usage wasn’t always 100%, sometimes much lower. Video decoding and data transfer from disk to GPU memory caused bottlenecks. Decoding videos takes time, and transferring data from main memory to GPU memory via CUDA streams (similar to coroutines) can idle the GPU if not optimized. To reduce waste, I enabled PyTorch DataLoader workers to load video data in parallel subprocesses. Profiling showed the GPU still waited for memory transfers.

Screenshot of Chrome's profiler UI showing the forward pass waits for memory copying of the input data from the main memory to the GPU

This reminded me of CPU pipeline concepts. While the GPU waits for data, it could process the previous batch. I implemented a PyTorch CUDA stream generator to load data to GPU memory while processing the prior batch.

Here’s the generator code:

def cuda_preload(
    datasets: typing.Sequence[tuple[torch.Tensor, torch.Tensor]],
    device: str,
    worker_stream: torch.cuda.Stream | None = None,
) -> typing.Generator[tuple[torch.Tensor, torch.Tensor], None, None]:
    preload_stream = torch.cuda.Stream()
    if worker_stream is None:
        worker_stream = torch.cuda.default_stream(device)

    def _to_device(
        src: tuple[torch.Tensor, torch.Tensor]
        | tuple[torch.Tensor, torch.Tensor, dict],
    ) -> tuple[torch.Tensor, torch.Tensor]:
        src_x, src_y = src
        with torch.cuda.stream(preload_stream):
            x = src_x.to(device, non_blocking=True)
            y = src_y.to(device, non_blocking=True)
        x.record_stream(worker_stream)
        y.record_stream(worker_stream)
        return x, y

    iterator = iter(datasets)
    try:
        with record_function("preload_dataset"):
            previous = _to_device(next(iterator))
        while True:
            with record_function("load_dataset"):
                current = next(iterator)
            worker_stream.wait_stream(preload_stream)
            current = _to_device(current)
            yield previous
            previous = current
    except StopIteration:
        return

With that, you can then use it to wrap around the datasets in the training loop like this:

for x, y in cuda_preload(dataloader, device=device):
  logits = model(x)
  # ...

Now, loading data to the GPU and performing the forward pass happen simultaneously, which is much more efficient!

Screenshot of Chrome's profiler UI showing the forward pass runs in the same time as memory copying of the input data from the main memory to the GPU

This improved GPU usage, but transfers were still slow. I learned pinning DataLoader memory as non-pageable speeds up transfers by avoiding an extra memory copy step. In most cases, you can simply set pin_memory to True with your DataLoader to do it for you, like this:

train_dataloader = DataLoader(
    training_dataset,
    batch_size=batch_size,
    shuffle=True,
    num_workers=num_workers,
    pin_memory=True,
    pin_memory_device=device,
)

To summarize it up. Key lessons for maximizing GPU usage:

Use DataLoader workers for parallel data loading in the background processes.
Pipeline data loading and processing in an async manner.
Pin memory for faster data transfers.

Reducing Memory Usage

As models grew, memory usage spiked, especially during backpropagation. I frequently encountered out-of-memory errors, even on H100 or A100-80GB GPUs. Without backpropagation, training would be faster and easier to distribute, and I’d love to eliminate it, but that’s an interesting research topic for another day.

To reduce memory usage, my first attempt was to decrease the mini-batch size. However, I didn’t want to lose the benefits of a larger batch size. Therefore, I performed multiple forward passes, calculated the average loss from those passes, and then executed a single backpropagation step.

However, that wasn’t enough. I also adopted Automatic Mixed Precision (AMP) in PyTorch, using FP16 with scaled gradients to reduce memory usage. Still, out-of-memory errors persisted. Distributed training across multiple GPUs was an option, but it’s complex and error-prone. While I know most mainstream LLM training is done distributively, there’s already too much for me to learn, and I don’t have the appetite for that today. Therefore, I’ve decided to treat the H100’s 80GB as the memory size limit for this project.

To further reduce memory usage, I adopted the checkpointing approach provided by PyTorch. Essentially, it avoids saving gradients during the forward pass, storing only the output values and recomputing gradients during backpropagation. In other words, it trades computation for memory. While this approach is slower, it allows training to proceed with just one H100 instead of two. By the way, the term “checkpoint” is confusing, as it’s often mistaken for saving the current model state 😅.

With the above measures, I successfully managed to keep memory usage in check and train my models.

Tips for Recognizing AI-Generated Videos with Human Eyes

While CakeLens helps identify AI-generated videos, learning to spot them manually is valuable. This is especially useful for labeling data, even with an AI model. Having reviewed thousands of AI-generated videos during data collection, I’d like to share some key tips.

Harry Potter Teleportation in the Bokeh

Many AI-generated videos feature people walking or objects moving in the background. In real videos, the camera focuses on the subject, leaving the background blurry—a phenomenon called bokeh. By closely examining moving objects in the bokeh, you’ll notice unnatural movements that don’t align with natural physics. Here’s an example of a person appears from no where in the background:

AI-gen video with Veo 3 posted by @omooretweets on X

Yet another example of a person disappeared.

AI-gen video with Veo 3 posted by @TheoMediaAI on X

Example of a human body twisted unnaturely in the background:

AI-gen video with Veo 3 posted by @PJaccetturo on X

Since these movements often feel unnatural, and sometimes the person or object even disappears, resembling the teleportation effects in Harry Potter films—hence the name.

To identify AI-generated videos, look for unusual movements in the blurry background. This phenomenon occurs because mainstream video and image generation models rely on probabilities to determine what is most likely to appear in a given context. These models don’t fully understand objects; they generate what seems plausible. The fuzziness of bokeh or the limited pixels representing an object introduces great uncertainty, causing the model to produce inconsistent or unrealistic background movements. I didn’t observe this phenomenon in all AI-generated videos. However, when present, it’s a strong indicator that a video is AI-generated.

For most video-generating models to overcome this issue, they may need to change their generation approach. They likely need to enable the model to maintain a clear and continuous concept of scenes and actors within them, rather than guessing what should appear. At that point, it would be harder for us to detect flaws like these. This is yet another interesting topic to consider.

AI Fonts

Another telltale sign of AI-generated content is the text. As mentioned earlier, diffusion-based image and video generation models prioritize visual likelihood over meaning. They don’t comprehend text, often producing gibberish that mimics the appearance of text—mixed characters resembling Chinese, Japanese, or other languages without meaning. I call this “AI fonts.”

Here’s an example of AI fonts from OpenAI’s Sora:

Screenshot of AI-gen video with Sora posted by @gdb on X

The kangaroo video mentioned earlier also displays AI fonts on the ticket:

Screenshot of the Kangaroo AI-gen video zoomed to the ticket showing AI-font text posted by @AutismCapital on X

The text looks like text but not really.

Not all AI models struggle with text. Some, like OpenAI’s newer image-generation models, produce coherent text. So, while AI fonts can indicate AI-generated content, meaningful text doesn’t always rule it out.

AI-gen image with OpenAI's new 4o model featuring a young lady writing text on a whiteboard with clear meaningful text on it

Incorrect Physical Behavior

AI-generated content sometimes exhibits incorrect physical behavior. Since diffusion models rely on probability, they may lack sufficient data to accurately predict physical movements. For example, in a Veo3-generated earthquake video, objects appear unnaturally light.

AI-gen video with Veo 3 posted by @t_itamiya on X

Yet another great example is an iPhone unboxing video:

AI-gen video with Veo 3 posted by @mattshumer_ on X

Well, if you ever unbox an iPhone, you certainly know that’s not how you open the box.

Look for inconsistencies

Another tip for spotting AI-generated videos is to look for inconsistencies in the video. For example, some AI-generated videos may show a woman wearing blue nail polish. After certain movements, the nail might go out of view and then reappear, but it could change color, such as to red. The iPhone unboxing video in the previous section also exhibits inconsistencies. The man appears to be about to peel the protective film from the iPhone, but suddenly, the protective film’s handle turns into a piece of paper.

What’s Next?

With CakeLens now publicly available, what’s next? First, I want to improve its accuracy. As users submit more videos and additional data, I hope this will enhance the model’s performance. However, I’ve already spent a few thousand dollars on this project (hoping my wife doesn’t get upset about the bill 😅).

Screenshot of Modal's billing dashboard showing the total cost as $2.1K

That’s a bit expensive for a pet product, but I see it as tuition for learning machine learning. Moving forward, I’ll explore options for lower-cost local training, even if it’s slower.

Here are some ideas I’d like to pursue if time and budget permits:

Trying Different Approaches

My initial model detects subtle patterns in AI-generated video sequences. I’m curious about analyzing videos in the frequency domain. While the model could theoretically learn such transformations, it requires significant computing resources and complexity. Preprocessing images in the frequency domain might allow a simpler model to achieve similar results with fewer resources.

Bigger model

I’ve always wanted to explore the possibility of building a larger model, such as one with deeper CNN layers. However, my current model design is limited by computational resources. While I’m not sure if this is the rabbit hole I want to dive into right now, I’ll likely need to explore distributed training for a much larger model at some point. If I can get multiple instances running at a low cost, it might be worth the time to pursue.

Analyzing Audio

Currently, CakeLens doesn’t process audio. Previously, missing audio or audio that sounded off was a strong indicator of AI-generated videos. However, Google’s Veo3 now generates audio synced with visuals, making detection more challenging. Still, I believe analyzing both video and audio could be valuable.

Even though Veo3 can generate videos with synced audio, real-world sound has distinct characteristics influenced by factors such as the speaker’s skull shape (implied by their face), the surrounding environment, the microphone, and so on. If a connection can be made between visual cues and audio, a well-designed model with sufficient data might still be able to determine how the audio should sound.

However, this is challenging because videos may include music or unrelated soundtracks, or they may be heavily post-processed. When training the model to recognize AI-generated content using both image and audio, it will need to ignore the soundtrack if it’s unrelated to the visuals

Understanding How It Works

I assume my model detects AI-generated content by identifying subtle details—such as camera focus, distortion, or frequency domain patterns—that the human eye misses. However, I don’t fully understand what drives its decisions. To improve the model, I need tools to visualize what triggers a positive detection. This would clarify what works and guide next-generation designs.

Without this, it’s hard to pinpoint why the model flags content with confidence. Could it be detecting hidden watermarks embedded by AI generation models? Or is it associating specific logos, like TikTok’s, with non-AI content? Or does it flag content simply because certain objects appear more frequently in AI-generated videos? This is another intriguing area to explore.

Expanding Detection to Still Images and Audio

I focused on videos first because they contain more information, making detection easier. I’m considering training models for AI-generated images and audio. If videos have detectable patterns, images and audio likely do too. Solving this for images and audio would also enhance video detection.

Using Synthetic & Augmented Data

So far, I’ve relied on internet-sourced data for training, as generating AI videos is costly. If budget weren’t an issue, I’d explore synthetic data. For example, I could take a real video, extract a few key frames, feed them into a video generation model to predict the next few seconds, and compare the real and AI-generated videos. This would provide ground truth data versus the generated video, potentially helping the model learn AI-specific patterns.

Another idea is to extend the video data by applying slight transformations, such as cropping or slightly altering the color. I could also re-encode the video multiple times. Videos on the internet undergo various transformations by uploaders, and repeated re-encoding may cause the loss of critical details needed to determine if a video is AI-generated. Simulating these changes could help the model detect patterns more robustly.

Identifying the Source Model

When labeling AI-generated videos, I include the generating model if it’s mentioned in the post. I then built a model to predict which generating model was used. However, the model is currently too inaccurate to be useful. AI-generated videos are already scarce, and data for specific models is even sparser. Without significantly more samples, accurately identifying the source model remains challenging. A better approach might be to generate videos myself, but this would require substantial funding for video generation. This is unlikely to happen in the short term unless I have funds available to invest.

Implement Flagging

As mentioned previously, the browser extension should ideally flag AI-generated content automatically by downloading a list of known AI-generated content. This feature is not yet implemented. I may find time to build it at some point.

Final Thoughts

That’s it! I hope you enjoyed this article and learned something new. Building CakeLens was an incredible learning experience, and I’m glad I spent a month on this detour in my journey to build AGI. The knowledge gained was worth it. Machine learning simplifies many tasks, and innovations in generative AI have created new demands, such as detecting whether a video, image, or audio file is AI-generated. It’s an exciting time to explore these possibilities! Now that CakeLens is live, give it a try! If you don’t want to install the Chrome extension but want to analyze a video, at @CakeLens on X, and I’ll manually process it and reply the results below.

If you have feedback or questions about this project, leave a comment below or email me. Thanks for reading!

Nvidia GPU on bare metal NixOS Kubernetes cluster explained

Sat, 01 Mar 2025 07:00:00 +0000

Since the last time I published the second MAZE (Massive Argumented Zonal Environments) article, I realized that the framework is getting more mature, but I need a solution to run it on a large scale. In the past, I built a bare metal Kubernetes cluster on top of a three-node mini PC interconnected with USB4. I have a retired workstation PC with NVIDIA GeForce RTX 2080 Ti. I wondered why not put this PC into the three-node Kubernetes cluster and configure Kubernetes to make CUDA work. Once I have one configured, extending the computing power capacity with more GPUs will be easier.

Screenshot of btop of my four-node Kubernetes cluster, with three mini PCs connected via USB4 and the newly added retired workstation sporting an Nvidia GeForce RTX 2080 Ti.

With that in mind, I took a week on this side quest. It turns out it’s way harder than I expected. I have learned a lot from digging this rabbit hole, and I encountered other holes while digging this one. Getting the Nvidia device plugin up and running in Kubernetes is hard. Not to mention, running on top of NixOS makes it even more challenging. Regardless, I finally managed to get it to work! Seeing it running for the first time was a very joyful moment.

Screenshot of the Kubernetes dashboard shows the logs of a MAZE pod running an experiment on a CUDA device

An article like this could be helpful to someone as more software engineers are getting into machine learning. Running all of it on the cloud is a costly option. There are also privacy concerns for personal use, so Nvidia GPU on a local bare-metal Kubernetes cluster is still a very tempting option. Here, I would like to share my experience setting up a bare metal Kubernetes cluster with the Nvidia GPU CDI plugin enabled in NixOS.

Chain of rabbit holes

I want to make my Kubernetes support Nvidia GPU, not because I like it. I just need to find a way to scale MAZE to continue my research more efficiently. In software engineering, when you encounter a problem that requires dedicated time and effort to find the solution or root cause, I call it digging a rabbit hole. It’s a reference from Alice’s Adventures in Wonderland’s Down the rabbit hole.

Rabbit from Alice's Adventures in Wonderland, by John Tenniel

Interestingly, during the process of digging a rabbit hole, you would often find yourself needing to dig another due to a new obstacle. I call it the chain of rabbit holes. For example, I just heard in the All-in podcast they mentioned that for Elon Musk to build Gork in a short time frame, they had to

Find a place for putting one hundred GPU
Need power, so buy generators
Need cooling, so buy portable liquid cooling
Realize the need to smooth power usage, so install the Tesla powerpack
…

This is a great example of a chain of rabbit holes. The ability to dig rabbit holes and solve hard problems down the chain boils down to two words:

Execution power

The reason for Deepseek’s recent successful story also boils down to execution power. They recently announced their customized distributed file system to optimize AI training. Their ability to dig rabbit holes like this or the previous PTX (Parallel Thread Execution) optimization shows great execution power.

It may sound obvious, but there are many interesting aspects to consider while digging rabbit holes. Of course, with constraints, one cannot dig all the rabbit holes, and it would take forever to finish the project. Also, not all rabbit holes are worth digging. Since this is an interesting topic but rarely mentioned, I would bring up a bit of my thought process on my decision going down each rabbit hole. Before that, let’s see the chain of rabbit holes I was digging in the past week:

Diagram illustrating the chain of rabbit holes I encountered while setting up Nvidia GPU support on my Kubernetes cluster.

Running Kubernetes with NixOS

It is not a new rabbit hole I dug. I have already figured out how to deploy Kubernetes on NixOS as needed while building the three-node cluster. But since I didn’t mention details in the previous article, bringing up the details here makes sense.

I love NixOS and Nixpkgs. The ability to configure a whole operation system with just configuration files, making it reproducible, is simply amazing. Here’s an example of a hello world NixOS config file:

{ config, lib, pkgs, ... }:

{
  imports =
    [ # Include the results of the hardware scan.
      ./hardware-configuration.nix
    ];

  # Use the systemd-boot EFI boot loader.
  boot.loader.systemd-boot.enable = true;
  boot.loader.efi.canTouchEfiVariables = true;

  system.stateVersion = "24.11";
}

While terrific, it doesn’t come without drawbacks, particularly considering deploying it to multiple machines. The first problem is that while most configurations would be the same for a machine, many things, such as the hostname, would still be different. One cannot write a simple configuration and apply it everywhere on all the machines. NixOS generates configuration files in the /etc folder as read-only files with symbolic links to the actual config files in the /nix/store folder.

Screenshot of the output from listing /etc folder shows many symbolic links

I found a tool called deploy-rs for deploying to different machines with slightly different configurations. It allows you to define configurations for each machine with customization and deploy them easily via SSH. Usually, this is good enough with simple stuff like hostnames or Linux configurations. However, running a full-blown Kubernetes cluster is yet another story. Just PKI (Public key infrastructure) certification generation and deployment are troublesome.

Generating Kubernetes certificates also implies creating secret keys and deploying them into each node. NixOS is very good at making a predictable environment with immutable packages and configuration files, but dealing with centralized secret values distribution is not ideal. I certainly don’t want to commit my secret key to Git history along with the NixOS configuration file. Therefore, we need another approach to deploy dynamically generated configurations and secret keys to all those machines.

I am pretty familiar with Ansible because I used it a lot in the past and even contributed to the omit filter many years ago. So when I encountered this problem, Ansible came first into my mind for deploying Kubernetes configuration files and generating the PKI secrets plus certificates. I don’t plan to cover too much detail, but here’s what I do. I replace the Kubernetes services config in the NixOS configuration to make them read from a local file as the arguments like this:

{
  config,
  pkgs,
  lib,
  ...
}: {
  systemd.services.kubelet = {
    serviceConfig = {
      Restart = mkForce "always";
      RestartSec = mkForce "20s";
      EnvironmentFile = "/var/lib/k8s/kubelet.env";
      ExecStart = mkForce ''${pkgs.bash}/bin/bash -c "${pkgs.kubernetes}/bin/kubelet $_KUBE_ARGS"'';
    };
  };
  # ... other k8s services
}

Then, I have an Ansible playbook that will SSH into all machines and generate the corresponding arg and config files, along with PKI certificates and secret values. Here’s an example of the Ansible Jinja2 template for generating the /var/lib/k8s/kubelet.env shown in the above Nix example:

{{ ansible_managed | comment }}

_KUBE_ARGS={{
  (
    [
      "--root-dir=/var/lib/k8s/kubelet",
      "--config=/var/lib/k8s/kubelet/config.yaml",
      "--kubeconfig=/var/lib/k8s/pki/kubelet.conf",
      "--image-credential-provider-config=/var/lib/k8s/kubelet/image-credential-provider.yaml",
      "--image-credential-provider-bin-dir=/run/current-system/sw/bin",
      "--runtime-cgroups=/system.slice/containerd.service",
      "--node-ip={}".format(nebula1_node_ip | ansible.utils.ipaddr('address')),
      "-v", (kubelet_log_level | default(1) | quote),
    ]
    + (["--node-labels={}".format(','.join(node_labels))] if node_labels is defined else [])
  ) | join(" ") | quote
}}

You may ask where you put the secret value. I use Sops and its Ansible plugin. All the values are encrypted using a GPG key.

- name: Deploy k8s cluster
  hosts: all

  pre_tasks:
    - import_tasks: tasks/sops.yaml
      tags: ["sops", "always"]
  
  tasks:
    - name: Do something with the secret value
      ansible.builtin.debug:
        msg: "{{ secrets.my-value }}"
    
    # ...

By combining NixOS, Ansible, and Sops, I can easily deploy to as many bare metal machines as I want as a Kubernetes cluster. I even built a NixOS live CD environment for a USB thumb drive to help speed up the bootstrapping process. But this article already has too much content, so I will leave it for text time.

Kubernetes with Nvidia GPU CDI plugin explained

Running a pod on Kubernetes with Nvidia GPU exposes sounds straightforward. How hard could it be? But when you look closer, it immediately gives you a headache. Just the sheer number of terms may overwhelm some people. Be it CRI, CDI, nvidia-container-toolkit, libnvidia-container, and all the confusing terms at first glance.

Although I have plenty of experience in Kubernetes and container technologies, it also took me a while to fully understand those terms and how they are interconnected. Let’s first see the big picture of how everything works together, and then we will explore the details of how each component works together.

Schematic overview of the Nvidia GPU integration architecture within Kubernetes on NixOS, showcasing the interplay between Container Runtime Interface (CRI), Container Device Interface (CDI), nvidia-container-toolkit, and the k8s-device-plugin.

The official document from Nvidia about the architecture will also help you understand.

Container Runtime Interface

As you are familiar with Kubernetes, there are so many CXI terms, such as CNI (Container Network Interface) or CSI (Container Storage Interface); they are all well-defined interfaces that open up the implementation details of fundamental Kubernetes functionalities to third parties. For instance, CNI is for networking, and CSI is for data storage.

Likewise, CRI stands for Container Runtime Interface, the plugin interface for adopting different container management systems. The most well-known ones are Containerd and CRI-O. With NixOS, you can change its settings like this to make it support Nvidia runtime:

virtualisation.containerd = {
  settings = {
    plugins = {
      "io.containerd.grpc.v1.cri" = {
        enable_cdi = true;
        cdi_spec_dirs = ["/etc/cdi" "/var/run/cdi"];
        containerd = {
          default_runtime_name = "nvidia";
          runtimes = {
            nvidia = {
              privileged_without_host_devices = false;
              runtime_type = "io.containerd.runc.v2";
              options = {
                BinaryName = "${pkgs.nvidia-container-toolkit}/bin/nvidia-container-runtime";
              };
            };
          };
        };
      };
    };
  };
};

nvidia-container-toolkit

We just mentioned CRI above. One of the jobs of nvidia-container-toolkit is facilitating the CRI interface between Kubernetes’ CRI and the container runtime. What is a container runtime? Usually, it’s a runtime provided as a command line tool to launch an isolated container process by creating new Linux namespaces and mounting host paths into the container. The most commonly used ones are runc and crun.

The nvidia-container-toolkit ship its own container runtime, the nvidia-container-runtime executable. It may sound confusing. Why did I say nvidia-container-runtime is a container runtime but also facilitates the CRI interface between Kubernetes and the container runtime? Well, the thing is, nvidia-container-runtime doesn’t implement the full-blown container runtime.

It’s a thin layer on top of other lower-level runtime implementations such as runc or crun. When Containerd or CRI-O invokes Nvidia’s CRI, the OCI containers’ spect will be modified to expose the needed Nvidia drivers, devices, and executables like nvidia-smi into the container. Once the spec file is modified, it will call the underlying crun or runc to do the actual Linux namespace management and all the other container good stuff. So, it’s like a proxy (or some may prefer to call it a shim) between the CRI and the runtime, performing a few modifications to the traffic passing through.

Screenshot shows Containerd passes the original OCI spec file config.json to nvidia-container-runtime, and then nvidia-container-runtime modifies the config.json file and passes it along to runc

Other than the container runtime, the nvidia-container-toolkit project also provides utility tools for generating various configurations. Such as this one for generating container runtime configuration for Containerd:

nvidia-ctk runtime configure --runtime=containerd

libnvidia-container

There are some confusing parts about this repository. While its name says libnvidia-container, they don’t use it as a library like all other projects with lib prefixes would usually do; at last, from the perspective of the Kubernetes plugin. Instead, it provides an executable, nvidia-container-cli. The main purpose of this executable is to collect and provide information about Nvidia drivers, devices, and executables to be injected into the container environment. Usually invoked directly from the nvidia-container-runtime’s hook command. I guess they need to use the same logic somewhere in Nvidia’s container-relative projects, and that might be why they need to make it a library.

This command-line tool is used when using Nvidia as container runtime directly without CDI support. Otherwise, it appears that if we take the CDI route, this command-line tool is not used at all.

Container Device Interface

Similar to CRI, it’s also a well-defined plugin for Kubernetes. Instead of providing container runtime functionalities, it offers particular device resources to expose to the container environment. With Nvidia’s CDI (Container Device Interface) implementation, we can then expose GPU devices as resources in the container.

In addition to Kubernetes, a container CLI tool like Podman can also use proper CDI configuration to have an Nvidia GPU inside the container.

$ podman run --device nvidia.com/gpu=all alpine nvidia-smi
Sat Mar  1 23:51:00 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.77                 Driver Version: 565.77         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     Off |   00000000:0A:00.0  On |                  N/A |
| 46%   57C    P2            103W /  260W |    3998MiB /  11264MiB |     33%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

The --device argument with nvidia.com/gpu=all value tells podman to look for the resource (device) type nvidia.com/gpu, and we want all of it exposed inside the container.

k8s-device-plugin

With nvidia-container-toolkit and libnvidia-container, now it’s possible to run a container with Nvidia GPU exposed inside a container, but then we still need a way to let Kubernetes know there’s nvidia.com/gpu such resource available for pods to use, and we need to know which nodes have them, how many are available, how many are in use, and what types. The k8s-device-plugin is here to solve those problems.

The k8s-device-plugin deployed in the Kubernetes cluster has DaemonSet to check for available Nvidia GPU devices on each node, report on them, and manage that information with the Kubernetes API server.

Dig this hole first - build nix-playground for hacking and patching any source code easily

Now we have the background knowledge about what’s what in the context of Nvidia GPU on Kubernetes. Do you think it’s about time to discuss adding Nvidia GPU support to your bare-metal Kubernetes cluster? Well, not so fast. As I thought it shouldn’t be too hard, just deploy the k8s-device-plugin as a helm chart, generate and modify configurations as described in Nvidia’s official documents, and it should be good to go, right? Unfortunately, I saw many errors immediately right after I did so. Like this one from nvdp-nvidia-device-plugin DaemonSet:

symbol lookup error: /nix/store/nqb2ns2d1lahnd5ncwmn6k84qfd7vx2k-glibc-2.40-36/lib/libc.so.6: undefined symbol: __tunable_is_initialized, version GLIBC_PRIVATE

It turns out we need to dig deeper 😅.

The We Need To Go Deeper meme from the Inception movie

I am the type of engineer who enjoys digging rabbit holes most of the time and learns a lot from doing so. Many people don’t understand software engineering, so I have said countless times that reading code is way harder than writing code. In the era of AI, generating code makes it even harder to read. And reading source code upstream is critical for solving hard problems. Sometimes, you are lucky, and there are easy ways to work around problems, but sometimes, there’s no easy way around but to find the root cause.

An error like this one is very tricky because the executable crashes immediately after it starts the container process in the Linux namespace. Usually, when seeing an error like this, I just exec or debug into the container with a shell to see what’s happening. But it won’t work in this case because all executables crash like that in the container, even for /bin/sh.

I had no choice but to read the source code of Nvidia’s container projects, and I needed to modify the code to figure out what was going on. Unfortunately, modifying source code and making a custom build is annoying. Just to get the project to build could take hours. You need to read the instructions to install the required dependencies. Sometimes, there are very specific versions of those dependencies or build environments you need to get. Very often, it could take hours or even a few days just to get a project to build, depending on how complex the project is.

I wish there was a tool to help me hack the source code of any open-source projects and build them quickly with my code changes so I could try things out quickly. Then I realized, okay, why not just build one myself? Obviously, this is yet another rabbit hole to dig from the original one. But if it works, this could be super useful down the road. Considering the potential productivity gains and the leverage this tool could have. I’ve decided to context-switch to digging this rabbit hole shortly.

I love NixOS and Nixpkgs because they’re a build tree containing almost all major open-source software, from the Linux kernel to a utils command-line tool. As far as I know, nothing else like this allows you to build the whole system from end to end under the same framework. Thanks to the nature of this tool, it has the building scripts ready in Nix for all the projects I want to look into, including the nvidia-container-toolkit and libnvidia-container.

Since each Nixpkgs package is just the artifact of yet another Nix’s derivation, which describes how to build the project and all its dependencies in a pure data format (ATerm), I should be able to extract the source code from it. With that in mind, I quickly built a prototype to try out, which I called nix-playground. And yes, I open-sourced it under the MIT license. You can use it as a playground to easily play around and patch any open-source project. Here’s some example:

# checkout libnvidia-container package source code locally
np checkout nixpkgs#libnvidia-container

# modify the code
vim checkout/src/cli/main.c

# build the package with changes you made in the checkout folder and try it out
np build

# output the patch for applying on the production environments
np patch > bugfix.patch

# clean up the generated files
np clean

I spent two days fixing some corner cases on this open-source project. It now works great! I have the best shovel for digging the k8s-device-plugin crashing bug. It’s time to go back and dig the previous rabbit hole!

Hunting the k8s-device-plugin crashing bug

One problem I faced when debugging the k8s-device-plugin crashing issue was that I didn’t know how it worked in detail. So, I inserted many log writing lines into the code to see how it worked. Some may say hey, you can use a debugger. Well, yeah, I know, but very often, it’s also very tedious to set up a debugger. I would use it if it comes in handy; otherwise, I would just insert logs.

I have been using Grok 3 for a while since its launch. I really enjoy it. Sometimes, it provides made-up or wrong results but still very often provides sound analytics. I never intend to rely on it fully. I take it as more like riding a bike. I am still the one who rides the bike, not the bike riding me. The code it generates still can’t always meet my standards. When I jumped into debugging the Nvidia stuff, I quickly realized, well, despite that, I don’t trust the quality of the generated code from AI models, but if it’s throw-away code, it doesn’t matter. I don’t care about the quality as long as it works. Therefore, I asked Grok 3 to help me insert log entries into the source code I pasted from Nvidia’s GitHub repository, and it did a wonderful job.

Thanks to Grok 3, I no longer have to manually write log entries for debugging purposes and can instead focus on solving the actual problem. Here’ss the log file generated from the update ldconfig hook patched by Grok 3.

2025-02-26T17:23:51-08:00: Starting update-ldcache
2025-02-26T17:23:51-08:00: Loading container state from
2025-02-26T17:23:51-08:00: Loaded container state
2025-02-26T17:23:51-08:00: Getting container root
2025-02-26T17:23:51-08:00: Determined container root: /home/fangpen/.local/share/containers/storage/overlay/5c92b5c6053ce2643bcbe516adf268ecdf03244ab3a514ff2e539b6017dbcf0e/merged
2025-02-26T17:23:51-08:00: Resolving ldconfig path from /nix/store/29mb4q8b5306f4gk2wh38h0c1akb0n97-glibc-2.40-36-bin/bin/ldconfig
2025-02-26T17:23:51-08:00: Resolved ldconfig path: /nix/store/29mb4q8b5306f4gk2wh38h0c1akb0n97-glibc-2.40-36-bin/bin/ldconfig
2025-02-26T17:23:51-08:00: Checking for /etc/ld.so.cache in container
2025-02-26T17:23:51-08:00: /etc/ld.so.cache exists, will update cache
2025-02-26T17:23:51-08:00: Creating config in /etc/ld.so.conf.d for folders: /nix/store/x5522a7p46nnbwxjv8w942p6qps7x0lw-nvidia-x11-565.77-6.6.79/lib
2025-02-26T17:23:51-08:00: Creating /etc/ld.so.conf.d if not exists
2025-02-26T17:23:51-08:00: Created /etc/ld.so.conf.d
2025-02-26T17:23:51-08:00: Creating temp config file in /etc/ld.so.conf.d
2025-02-26T17:23:51-08:00: Created temp config file: /home/fangpen/.local/share/containers/storage/overlay/5c92b5c6053ce2643bcbe516adf268ecdf03244ab3a514ff2e539b6017dbcf0e/merged/etc/ld.so.conf.d/nvcr-4265966838.conf
2025-02-26T17:23:51-08:00: Writing folders to config file: /nix/store/x5522a7p46nnbwxjv8w942p6qps7x0lw-nvidia-x11-565.77-6.6.79/lib
2025-02-26T17:23:51-08:00: Added folder: /nix/store/x5522a7p46nnbwxjv8w942p6qps7x0lw-nvidia-x11-565.77-6.6.79/lib
2025-02-26T17:23:51-08:00: Setting permissions on config file to 0644
2025-02-26T17:23:51-08:00: Set permissions on config file
2025-02-26T17:23:51-08:00: Preparing ldconfig arguments: ldconfig -r /home/fangpen/.local/share/containers/storage/overlay/5c92b5c6053ce2643bcbe516adf268ecdf03244ab3a514ff2e539b6017dbcf0e/merged -C /etc/ld.so.cache -f /etc/ld.so.conf
2025-02-26T17:23:51-08:00: Executing ldconfig with args: ldconfig -r /home/fangpen/.local/share/containers/storage/overlay/5c92b5c6053ce2643bcbe516adf268ecdf03244ab3a514ff2e539b6017dbcf0e/merged -C /etc/ld.so.cache -f /etc/ld.so.conf

With the logs, I tried things quickly and understood the flow of the code. However, since the problem happens inside the container namespaces, I cannot view them from my non-namespace environment. I need to get a shell inside the namespace to better understand what’s going on. So, after I narrowed down the problem to the update ldconfig hook, I added a long sleep before the Nvidia container runtime makes the exec call to ldconfig so that I have time to look into it.

The createContainer hook configuration at /var/run/cdi/nvidia-container-toolkit.json:

{
  "_": "// ...",
  "hooks": [
    {
      "hookName": "createContainer",
      "path": "//nix/store/il5kz2p67hdd05c9gmg8m5c5l8gbrk90-container-toolkit-container-toolkit-1.15.0-rc.3/bin/nvidia-ctk",
      "args": [
        "nvidia-ctk",
        "hook",
        "update-ldcache",
        "--ldconfig-path",
        "/nix/store/29mb4q8b5306f4gk2wh38h0c1akb0n97-glibc-2.40-36-bin/bin/ldconfig",
        "--folder",
        "/nix/store/x5522a7p46nnbwxjv8w942p6qps7x0lw-nvidia-x11-565.77-6.6.79/lib"
      ]
    }
  ]
}

After the hook sleeps and hangs, I ran

$ lsns
        NS TYPE   NPROCS    PID USER    COMMAND
( ... omit ...)
4026533991 mnt         1 388163 fangpen /bin/sh
4026533992 uts         1 388163 fangpen /bin/sh
4026533993 ipc         1 388163 fangpen /bin/sh
4026533994 pid         1 388163 fangpen /bin/sh
4026533995 cgroup      1 388163 fangpen /bin/sh

To find out the pid of the container process, I just launched via podman to reproduce the bug. Then, I run this to attach a shell into the namespace:

sudo enterns -t <CONTAINER_PID> -a /bin/sh

Next, I ran the ldconfig command, which is supposed to run with the -p argument to see the current ldconfig cache.

$ ldconfig -p
131 libs found in cache `/etc/ld.so.cache'
        ( ... omit ...)
        libc.so.6 (libc6,x86-64, OS ABI: Linux 3.2.0) => /lib64/libc.so.6
        ( ... omit ...)
Cache generated by: ldconfig (GNU libc) stable release version 2.34

Then, run the ldconfig without -p again to update cache and then with -p again to see the current cache:

$ ldconfig -r /home/fangpen/.local/share/containers/storage/overlay/3e2b7a6d05ca228fafe39b47a19952d4b81a9c413618c66a7013d794e4db3d96/merged -C /etc/ld.so.cache -f /etc/ld.so.conf -p
64 libs found in cache `/etc/ld.so.cache'
        ( ... omit ...)
        libc.so.6 (libc6,x86-64) => /nix/store/nqb2ns2d1lahnd5ncwmn6k84qfd7vx2k-glibc-2.40-36/lib/libc.so.6
        ( ... omit ...)
Cache generated by: ldconfig (GNU libc) stable release version 2.40

Oops. All the caches are pointing to libraries injected by the Nvidia runtime container, including glibc. I guess the glibc compiled inside the plugin container image is different from the one shipped with the Nvidia driver in the Nix store. As a result, any executable that relies on glibc would then link to the wrong one, and literally, almost all executables would rely on glibc, and therefore, they all crash immediately.

After learning the root cause, I quickly drafted an issue, reported it to Nvidia’s nvidia-container-toolkit repository, and found an easy fix. Most containers come with the config file in /etc/ld.so.conf.d pointing to the existing library paths. The Nvidia runtime hook generates its own config file pointing to the library path provided in the argument. But without the one pointing to the existing library path, there will only be a cache pointing to the injected lib. To solve the problem, I created a PR with a config file pointing to /lib64.

RUN echo "/lib64" > /etc/ld.so.conf.d/lib64.conf

It was just a line code change, and it took me a few days to find out, including building the nix-playground tool 😅 That’s why one should never measure productivity with the number of lines added.

The missing RuntimeClass

Suppose you follow many online guides about running Kubernetes pods with Nvidia drivers. In that case, you may think you should be able to run it after you update the Containerd configuration to include the Nvidia CRI and set it as the default like this:

[plugins."io.containerd.grpc.v1.cri"]
default_runtime_name = "nvidia"

# other stuff ..

Unfortunately, it doesn’t work. I haven’t dug into this yet, but according to Nvidia’s k8s-device-plugin readme page:

If the nvidia runtime should be set as the default runtime (with non-cri docker versions, for example), the –set-as-default argument must also be included in the commands above. If this is not done, a RuntimeClass needs to be defined.

It’s just one line in the README, and they’re very easy to overlook. So, without an explicit RuntimeClass defined and assigned to a pod in Kubernetes, it won’t pick up the Nvidia runtime. I didn’t have time to find out how to make a runtime default on a node yet, but for now, I had to define a RuntimeClass like this:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
handler: nvidia
metadata:
  name: nvidia

Then, I will assign the runtimeClassName to it so that it will pick up my CRI configuration for Containerd.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  # the `runtimeClassName` is needed, otherwise default runtime will still be used
  runtimeClassName: nvidia
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

Yet another rabbit hole - PyCharm cannot open my WSL NixOS anymore

Interestingly, after the recent update, my PyCharm stopped working with my Python projects in my NixOS distro, WSL (Windows Subsystem Linux). While it sounds a bit unrelated to the main focus of getting the Nvidia GPU to run on Kubernetes, I needed to build nix-playground in Python, and it can only work in a Linux environment. I am bringing this up because this is the third level deep rabbit hole from Nvidia GPU on Kubernetes.

I reported this problem a while back, but I decided to dig this hole while building nix-playground because not being able to use PyCharm with WSL projects greatly impacts my productivity. After digging briefly, I realized that PyCharm expected many executables at certain locations, such as in /usr/bin. I addressed the problem by adding them to WSL NixOS config’s extraBin.

{
  config,
  lib,
  pkgs,
  ...
}: {
  wsl = {
    enable = true;
    extraBin = with pkgs; [
      {src = "${coreutils}/bin/mkdir";}
      {src = "${coreutils}/bin/cat";}
      {src = "${coreutils}/bin/whoami";}
      {src = "${coreutils}/bin/ls";}
      {src = "${busybox}/bin/addgroup";}
      {src = "${su}/bin/groupadd";}
      {src = "${su}/bin/usermod";}
      {src = "${busybox}/bin/busybox";}
      {src = "${busybox}/bin/chmod";}
      {src = "${busybox}/bin/cp";}
      {src = "${busybox}/bin/cut";}
      {src = "${busybox}/bin/getent";}
      {src = "${busybox}/bin/head";}
      {src = "${busybox}/bin/mktemp";}
      {src = "${busybox}/bin/rm";}
      {src = "${busybox}/bin/sed";}
      {src = "${busybox}/bin/tail";}
      {src = "${busybox}/bin/uname";}
      {src = "${busybox}/bin/which";}
    ];
  };

The rule of thumbs for digging the rabbit holes

Now, you’ve seen how I dug many rabbit holes in the past week. I tried to summarize my approach to digging the chain of rabbit holes:

Consider the productivity impact. If it provides a great positive impact within a manageable timeframe, do it as soon as possible (nix-playground as a great example)
Not all rabbit holes are worth digging; consider the cost-effect ratio. For example, I won’t jump into learning PTX and optimizing my MAZE code before it runs on a huge scale.
Evaluate the timeframe of digging a hole to carefully decide whether to go down a hole. Sometimes, it’s hard to tell from the surface; a quick prototype and probing could help determine how hard it is.
Enjoy digging rabbit holes because it’s the best way to learn new things

Final thoughts

I intentionally left out many details from this article because otherwise, it would be too long. If anyone is interested in learning more about running Nvidia CDI on a local Kubernetes cluster, please leave a comment and let me know. I am considering open-sourcing some of my work to make it much easier for anyone to have a local Kubernetes cluster enabled with Nvidia CDI.

This deviates a bit from building MAZE, but the outcome is great! Now I have an infra for running MAZE on a bigger scale—well, at least slightly bigger on the scale I can afford 😅 There are still other rabbit holes, like power consumption. With the constraints, I can only try to be creative.

Next, I will go back to improving MAZE and soon publish the third MAZE article. I have been running the experiment for the past week, and it has already accumulated some data with updated MAZE code, including mutations and some other good stuff. Likewise, I will publish the data along with the next article. I am still thinking about the next focus of the following research. I have some interesting ideas for eliminating backpropagation with the MAZE approach. I am also thinking that maybe it’s time to introduce more freedom (more operations) to MAZE’s neuron network so that I can break the accuracy barrier it’s facing. Anyway, we will see.

Regardless, thanks for reading. I hope you like my journey of digging the rabbit hole and sharing knowledge about running Nvidia GPU on Kubernetes. Please help me share this article if you find it interesting. Stay tuned because we will have tons of fun on the journey of learning machine learning by building a novel project like MAZE! 😄