For the last 15 years I’ve had the fortune of accompanying thousands of students as they take their first steps in programming. From teenagers to career-changers, from one-on-one mentoring to developing curriculums and training materials, I’ve had a lot of successes, and a lot of failures, too. The more I teach, the more I learn, the more I find I don’t know.
One question in particular has been nagging me almost from the very beginning: why do some students simply not “get” programming? You might chalk it up to innate intelligence, motivation, or poor explanations on my side, but I’ve sat with enough smart, highly motivated students, patiently explaining again and again in different ways to no avail, that I think there’s something bigger I’m missing - something that just doesn’t click for those students, and it takes more than a good explanation to make it land.
I’ve been brewing with this question for a long time, and finally, in the last few years, an answer has been emerging: my failure has been not actively nurturing a mental model of programming. Some (many) students do it naturally, unconsciously, but those who don’t are lost without proper guidance.
So that’s what this post is about; maybe you think it’s obvious, or that I’m missing something else - I’d love to hear! But here we go.
What does that even mean?
The term “mental model” is used a lot, in different contexts and meanings, so I want to be careful when explaining what I mean, exactly, by “mental model of programming”.
A model, in the sense I’m using, is some object that imitates another object to some extent, but is simpler or more manageable. To the degree that the two objects are similar, the model allows us to conveniently experiment with the dynamics of its more complex sibling.
Thus, a model railroad lets you play with train cars and tracks in your living room, weather forecasts use weather models to simulate how weather systems develop, and language models attempt to emulate the dynamics of human language.
A mental model is a model that sits in your head: machinery in the brain you can use to simulate whatever it’s modelling.
A mental model of programming, then, is brain machinery that allows you to simulate how the computer would execute a piece of code - to run this code in your head.
A good mental model of programming is critical for programmers to navigate the huge space of programs they could write. As you write the code or debug it, you constantly simulate what’s hapenning in your head, and compare it to the desired behavior; this guides the process of adding or changing code to achieve your goal.

But computer-simulation-machinery doesn’t come cheaply to the brain; it requires a lot of practice. Specifically, it requires the brain to try to simulate code, fail, and learn from that.
As I’ve learned the hard way, while practice is necessary, it’s not sufficient: depending on how you teach, there are other ways to learn, and if the learning process doesn’t explicitly aim to create a mental model in the students’ brain, some students will take the wrong road.
Take K-12 education and its equivalents in other countries. In Israel what you’re most likely to see is overcrowded classrooms, overworked teachers, and a hyperfocus on getting students to pass their matriculation exams. Putting aside my general criticism of the system, in my experience this often leads to learning that doesn’t involve constructing a mental model of something, even in subjects where that would be appropriate.
This manifests as three levels of performance:
Despite my dismissive tone, this kind of learning isn’t inherently wrong, and sometimes it’s the only way that makes sense in the context.
But, on its own, it typically doesn’t lead to the creation of a mental model, and it doesn’t equip students with the (unconscious) ability to rely on mental models when solving problems. Unlike what you might expect if you look, for example, at the pyramid of Bloom’s taxonomy of acquisition of cognitive skills, drilling the lower levels of understanding doesn’t automatically, at some point, transform into deeper understanding. At least not for all students.
And those students are the ones who struggle. The ones who, no matter how much I explain and how much they feel they understand, sit in front of the IDE and feel completely lost. It’s not their fault: I’ve been talking completely past them. I explained as if they reason about code using a mental model which happens to be wrong, and I’m trying to correct it; but they’re operating completely differently.
They try to find patterns in which constructs are used where, copy-paste examples and tweak them; they want the code to do the opposite thing so they try inverting conditions and booleans in areas that seem related. It rarely works.
This is the gap that needs to be addressed, and it’s best done before they even start down the wrong path.
So what can you do? I’m still learning how to answer this question, but here’s what I’ve got so far.
Incremental practice: in my opinion, this is the most important component. When programmers work, their use of the mental model is intense: they simulate a program while keeping track, in their head, of what the desired behavior is; notice when they diverge, and figure out what change to make so they don’t.
Asking students to do all that at once is a bit of a leap. I like to specifically drill the mental model, first in straightforward and (relatively) easy ways, then increasingly require usage that resembles real programming.
The progression could look like this:
Isolate: learning to program involves, in addition to building a mental model, a lot of technical aspects that can confuse students: working with the IDE, the language’s rigid syntax, arcane keywords that don’t necessarily carry meaning.
These things aren’t necessary for the mental model, so I prefer to keep them for later, starting with river-crossing puzzles and moving on to robots taking natural language instructions in a visual environment. You can go pretty far, conceptually, without introducing a real programming language: variables, conditions, control flow. I’m doing all this with plain old presentations and Kahoot, but today you could use LLMs to create an interactive environment.
Visual cues: you can show the students how their internal mental model should look like, for example by doing a dry-run of a program, highlighting the current line (like a debugger), and showing a “watch” table with all the variables.
Make shortcuts harder: Finally, avoid formulaic problems that could be solved using pattern matching or memorization. This will create a vacuum that the brain will want to fill, pushing it to create a mental model.
In this context, I’ll mention Scratch; it’s a rich learning environment that’s got some of those principles built-in. I have some reservations about it, but I’ll leave that to another discussion.
I’ve found that implementing all these, as early as possible, drastically improves student performance.
AI has been rocking a lot of boats for the last few years, and specifically both programming and education are having some reckoning moments around it. I’m in no position to predict what will happen in the future, but I am worried that reliance on AI in programming education will be yet another shortcut that’s actually an obstacle to students’ development of a solid mental model of programming (assuming that’s still going to matter).
Many other domains require, or at least greatly benefit, from having a relevant mental model.
One interesting example I can think of is cooking; you don’t have to have a mental model - you could follow recipes, memorize them, and utilize pattern-matching to handle common challenges such as identifying when the batter is mixed enough or how to replace a missing ingredient. That’s pretty much my experience with cooking, or at least was until quite recently.
But then I met my wife, who cooks amazing food and, impressively, can improvise wild dishes from what we happen to have available. I came to realize she has a proper mental model of food: she knows how ingredients behave under various conditions and how flavors and textures mix. She acquired this model through years of experimentation, while paying attention to what happens in the process. I’ve been trying to cook more like that.
Some may call it good intuition, and different people may have natural inclinations to develop those for different areas, but I’ve found the idea of a mental model an interesting lens to reason about how experts approach their work.
I also think about the fraction of students for whom a mental model of programming doesn’t come naturally, and how this can be bridged with an appropriate pedagogical approach. What other areas could beneift from something like that?
]]>You need to cross from point A to point B as fast as possible. Geometry dictates that the shortest path is a straight line, but since you’re crossing different terrains at different speeds, the fastest path will not be a straight line.

Any reasonable solution can be characterized by two numbers, denoted \(x_1\) and \(x_2\), describing where we cross the boundaries between different terrains. Given all the information, it’s not difficult to calculate the total time to cross, using the Pythagorean theorem and the equation \(distance = speed \cdot time\):
\[t = \frac{\sqrt{ {x_1}^2 + d^2}}{v_1} + \frac{\sqrt{ {x_2}^2 + d^2}}{v_2} + \frac{\sqrt{(h - x_1 - x_2)^2 + d^2}}{v_3}\]But how to find \(x_1\) and \(x_2\) that minimize \(t\)? Your old calculus professor would suggest computing the gradient and solving a system of equations, but honestly you’d rather kiss a sulphurous frog. Sampling lots of points and picking the best is possible, but inexact and expensive.
Fueled by optimization techniques for deep learning models, the last decade saw an explosion of automatic differentiation engines in Python: libraries that allow you to write numeric code in almost-pure Python, and automatically compute derivatives and gradients. How does this help us? A function’s gradient tells us in which direction the function is increasing the most. So if we want to minimize it, we can flip the gradient’s sign and just follow that! That’s the essence of the gradient descent algorithm.
from jax import grad
from jax.numpy import sqrt
h, d, v1, v2, v3 = 20, 10, .7, .3, .45
def calc_time(x):
x1, x2 = x
return (
sqrt(x1 ** 2 + d ** 2) / v1
+ sqrt(x2 ** 2 + d ** 2) / v2
+ sqrt((h - x1 - x2) ** 2 + d ** 2) / v3
)
d_time_d_x = grad(calc_time) # Magic!
x1, x2 = 2., 17.
step = 1.5
for i in range(20):
dx1, dx2 = d_time_d_x([x1, x2])
x1 -= step * dx1
x2 -= step * dx2
print(f"{x1=}, {x2=}")We can visualize the objective landscape, and show the path our optimization traces through it:
Looking at the objective itself along the optimization, we can see it consistently improving (though the rate of improvement is slowing down):
Finally, we can visualize the actual paths represented by the parameterized solutions as we optimize:
You might be wondering how JAX computes the gradient behind the scenes. Maybe it’s using numeric approximations? Or parsing the code and symbolically working out the gradient? Actually, neither!
A full explanation of automatic differentiation is out of scope for this intro, but I’ll try to convey the main ideas succinctly.
Look at this simple computation:
z = x ** 2 + y / 2If x and y were pure Python numbers, then z would also be a number, and contain no trace of the computation that led to its current value.
But, using operator overloading (“special method names” in Python), you can create types that keep track of computations, and use them to obtain expression trees for the values you compute:
Here is the expression tree for the time calculation that we want to optimize:
Next, and this is where (some of) the magic happens, thanks to the chain rule in calculus and its generalizations, you can use this tree to efficiently compute the derivative of the final node with respect to any other node in the tree.
I won’t go into more detail than that - it’s a calculus phobic’s introduction, after all - but I hope this sates your curiosity for now, and I’ve added links with more information below.
The ability to effortlessly, and efficiently, calculate gradients of arbitrary functions is very powerful for gradient-based optimization.
However, as you might expect, there are a few subtleties:
Since we’re using custom types for building the computation graph and calculating gradients, we need to use operations that support those types. Operator overloading allows us to support arithmetic out-of-the-box, but for more complicated computations you’ll need to use the appropriate implementation (or implement it yourself if it doesn’t exist). Hence the use of the custom jax.numpy.sqrt function.
The good news is that many modern automatic differentiation engines come with a big library of common operations and algorithms already implemented, so what you need is most likely there - and if it’s not, you’ll have plenty of primitives to build on.
The simple path-planning problem I presented has a simple “optimization landscape”, where from every point it’s fairly easy to improve solutions. And still there was some tuning - choosing the step size and number of iterations. Such aspects will always require attention, and more complex problems may have landscapes that are trickier to optimize on.
Another potential issue is that of converging to local optima - depending on the problem, this may be acceptable, or require clever initialization or other tricks to avoid.
While the algorithm for computing gradients is efficient, there’s still significant overhead to computing gradients, both in time and in memory. If you have an extra-large problem you might need to take care when optimizing it, or find a way to break it down to smaller sub-problems.
OK, so we know how to optimize simple computations with differentiable programming. Anything else of interest we can do with it?
Since the computational graph is built on-the-fly with our custom types, it doesn’t “care” if computations happen in a straightforward, branchless block of code (like in our example) or through a winding path of conditions and loops. As long as we can construct a graph, we can calculate gradients (within the computer’s resources constraints, of course.)
While constructing the computational graph within branches or loops isn’t an issue, if we want the optimiation to include the conditions that determine those branches - well, that’s trickier. Still possible, though!
Suppose our computation goes through a simple if-else branch, and we want the if’s condition to be included in the optimization.
The computational graph only includes calculations that happened. So, say we choose the else branch - the computation wouldn’t “know”
what could have happened had we taken the if branch.
To resolve that, we need to run both branches, and average them in a way that reflects the condition’s preference. The same is true for while and for loops, with slight variations.
I don’t want to go too deep, but this is a lovely example of reversing Game of Life with differentiable programming and the branch-weighting idea.
The proliferation of differentiable programming frameworks in Python was pretty much kickstarted with frameworks for training deep learning models, also using gradient-based optimization techniques. This typically makes plugging such models to differentiable computations very easy! For instance, JAX has the Flax neural network library.
An example application of such integration is training physics-informed neural networks.
In addition to the built-in operations that come with automatic diffentiation frameworks, there’s a growing ecosystem of fully differentiable implementations of more advanced operations in various domains.
Examples include: 3D rendering, computer vision, signal processing, and more.
These packages can be incorporated to differentiable pipelines to create very interesting tools.
Automatic differentiation has been around much longer than its presence in the Python ecosystem, with applications primarily in science and engineering design optimization. The recent surge of automatic differentiation frameworks in Python brings this powerful tool to a much broader audience.
Most recently, Gaussian Splatting, which is based on differentiable programming, has exploded in popularity, and is seeing impressive adoption as a 3D scene representation format.
Personally, for the last several years I’ve been working with a startup on 3D reconstruction with differentiable rendering. Also, image color replacement with numerical optimization from a few years ago would have been a classic use case for differentiable programming (except it was a project for a course, so I worked out the gradients by hand. Oh, what joy.)
This video is a great introduction to automatic differentiation, and this post walks through implementing automatic differentiation from scratch.
This is an extensive, if a bit technical, introduction to differentiable programming.
The JAX tutorials will get you up to speed on implementing differentiable computations.
I’ve also considered doing a series of differentiable programming exercises, gradually introducing concepts and tools. I think it could be really cool, with animations visualizing the optimization process. But it takes a lot of work, and I could use some motivation - ping if that’s something you’d be interested in!
The first part of this post was adapted from a poster I made for the virtual session at Scipy 2025. Click to view at full resolution:
]]>We had the idea to use pictures we’ve taken in our trips for some of the aesthetic design - as backgrounds for the invitation, menus and so on. I remembered seeing once an animation of a recursive subdivision of an image, with gradual refinement of high-detail areas, which seemed really neat. It reminded me of the way decision trees subdivide the feature space, and I thought it would be cool to try to reconstruct the image with a decision tree with limited depth, using pixels’ X- and Y-coordinates as features.
My wife took this picture of a beach sunset:
I fed it to a simple decision tree, where we try to predict RGB from X and Y:
It’s nice, but a bit too blocky for me. Of course, that’s a direct consequence of how we feed the data to the decision tree algorithm: it has to make thresholds on X- and Y-coordinates of pixels, so of course the apprpoximation will be built out of rectangles.
Maybe we can use a different representation to get a different style?
I tried sampling points in image space using Poisson Disk Sampling and representing each pixel by its distances from all anchor points:
Now we’re talking! I liked that style a lot, and happily my wife did too, so we ended up using variations of this technique on several of our pictures for various wedding-related graphics.
There are a lot of parameters and variations to play with here - anchor sampling density, decision tree depth, whether to use the same points for all RGB channels or different ones, and so on. So far my impression is that each picture has its own parameter spaces that work well with it, so there’s a lot of experimentation involved.
Here’s the result of a different sunset picture, with a stricter limitation on maximum tree depth - looks very abstract:
Here’s the same picture with similar maximum depth limit, but reconstructed with a random forest instead:
A bit noisier, but also softer - I like both versions, each with its own flavor.
Another cool trick is to use a picture with some object in it, segment the object (manually or with SAM) and give it a higher sample_weight when fitting the model. This will cause the tree to give more importance to those areas of the picture, resulting in higher fidelity, while the background remains more abstract:
The same idea can be applied to pictures with faces. I played with using face landmark detection (with the face-alignment library) to determine pixel importance, with pretty cool results - our faces are recognizable but still abstract:
I also played with generating an animation of a picture “coming into focus” by gradually varying the parameters:
As you can see, I’m having a lot of fun with this technique! There are many more ideas I’d like to try, but I’ll leave it here for now.
You can find sample code for the basic idea here.
Cheers!
]]>As part of the model tuning phase, I wanted to explore the impact of class imbalance and try to mitigate it. A popular “off-the-shelf” solution to imbalance is weighting classes in inverse proportion to their frequency - which didn’t yield an improvement. This happened to me several times in the past, and other than basic intuition I couldn’t trace the theory of where this weighting comes from (maybe I didn’t try hard enough).
So, I decided to finally try to reason about class weighting in an imbalanced setting from first principles. What follows is my analysis. The TL;DR is that for my problem, I was convinced that class weighting probably doesn’t matter too much.
It’s an interesting analysis and was a fun rabbit-hole to dive into, but makes a lot of assumptions and I’d be careful not to overgeneralize from this.
\(\newcommand{\pipe}{|}\)
Wherever there’s a (non-trivial) classification problem, there’s a tradeoff. I’ll focus on the simplest case of binary classification: say we have two classes - negative (denoted 0) and positive (denoted 1); further suppose that the positive is the rare class, with prevalence \(\beta\) (1% in the following visualizations / experiments).
Basically, when we classify, we predict the class of an instance with unknown class. We could be wrong in two ways:
It is trivial to avoid making any one type of error: for example, we could classify all instances as negative, avoiding false positives altogether (at the expense of all our positives being false negatives). And therein lies the tradeoff: to make an actual classifier that outputs “hard” predictions, we need to make a product / business decision about how bad each type of error is. Not making an explicit decision means our optimization pipeline has such a choice baked in implicitly.
Now, it’s hard to know in advance how the tradeoff curve will look like. We try to optimize everything else to give us the best set of options: collect lots of data with informative features, use a suitable model, etc. But after all that is done, we still need to choose how to balance the two types of errors.
To optimize this choice in light of our product preferences, we first need to characterize the tradeoff curve.
Some definitions first:
For the sake of this analysis I’m going to assume the tradeoff curve is of the form \(z = (1 - x^p) ^ \frac{1}{p}\), with \(p \geq 1\). This yields the following family of curves:

The red curve corresponds to \(p = 1\) - a pretty poor tradeoff curve. As we increase \(p\), our set of options improves. At the coveted (but realistically unattainable) \(p = \infty\) we’d choose \(x = z = 1\), beat the problem altogether and go home; until then, we have to choose some compromise between \(x\) and \(z\).
Later, we’ll ponder how to choose the tradeoff. But to do that we first need to define what it is we’re even trying to optimize.
Like I previously mentioned, initial modeling stages try to give us the best tradeoff curve possible for the task - using data, model type, training techniques, whatever. At those stages we can optimize for threshold-independent metrics, for example various area under the curve metrics. But ultimately, somewhere downstream the model’s output will be binarized, and we might as well take that into consideration when tuning the model.
I’m personally fond of the F-score - it combines two very interpretable metrics (precision and recall), which makes communicating with less technical stakeholders (such as product managers and the FDA) easier, and can be easily tweaked to account for error type preferences.
For this problem precision and recall were equally important, so I used the \(F_1\) score:
\[F_1 = \frac{2}{\frac{1}{precision} + \frac{1}{recall}}\]Ultimately, this is the metric we want to optimize.
OK, so we know what we want to optimize, and we know that our choices are limited by the tradeoff curve. But how do we control where we land on the curve?
Canonically, binary classification is framed as minimizing binary cross-entropy loss. The knob we’ll use to decide where we land on the tradeoff curve is a weighting coefficient, \(\alpha\), for the positive instances:
\[BCE = -\sum_i{\alpha y_i \log_2{\hat{y_i}} + (1 - y_i) \log_2{(1 - \hat{y_i}}))}\]Now that all the actors are on stage, let’s roll up our sleeves and get our hands dirty.
First, we’ll take a step back from looking at individual instances, and look at the relationship between positive and negative instances, based on \(\beta\), the positive class prevalence. To that end, we’ll replace individual \(\hat{y_i}\) and \(1 - \hat{y_i}\) with their respective expectations, \(x\) and \(z\). Further, our assumed tradeoff curve constrains one by the other; let’s reframe the loss as a function of \(x\) (also dependent on \(\alpha\), \(\beta\), and \(p\).)
\[BCE(x) = -(\alpha \beta \log{x} + (1 - \beta)\frac{\log{(1 - x^p)}}{p})\]We’ll now differentiate wrt \(x\) to find where on the tradeoff curve our choice of \(\alpha\) and the reality of \(\beta\) and \(p\) have landed us:
\[BCE'(x) = -(\frac{\alpha \beta}{x \ln{2}} + \frac{1 - \beta}{p} \cdot \frac{1}{(1 - x^p) \ln{2}} \cdot (-p x ^ {p - 1})) =\] \[= \frac{(1 - \beta) x ^ {p - 1}}{(1 - x ^ p)\ln{2}} - \frac{\alpha \beta}{x \ln{2}}\] \[BCE'(x) = 0 \Leftrightarrow (1 - \beta)x ^ p = \alpha \beta (1 - x^p)\]And we finally get:
\[x = \sqrt[p]{\frac{\alpha \beta}{\alpha \beta - \beta + 1}}\]We’ve been handed a binary classification problem characterized by \(\beta\) and \(p\). We optimized a classifier using weighted binary cross-entropy with weight \(\alpha\) for the positive, rare class. This lands us in a particular place on the tradeoff curve, which we just found (these are \(x\) and \(z\)).
Next, we’ll want to see how our choice of \(\alpha\) trickles downstream to the \(F_1\) score, and use this description to find an optimal value for \(\alpha\).
We’re interested in calculating the expected \(F_1\) score resulting from our choice of \(\alpha\). Since \(F_1\) depends directly on precision and recall, we’ll calculate the expected value of those metrics.
Recall is easy - it is the fraction of positive instances we correctly detected as positive, and we can expect it to be \(x\) - the probability our classifier outputs 1 for a positive instance.
Precision is the fraction \(\frac{true \ positives}{true \ positives + false \ positives}\).
The expected true positives are the fraction of positives times the probability of detecting a positive as such: \(\beta x\).
The expected false positives are the negative instances which were misclassified: \((1 - \beta)(1 - z)\).
Putting all that in the \(F_1\) formula:
\[\mathbb{E}(F_1) = \frac{2}{\frac{\beta x + (1 - \beta)(1 - z)}{\beta x} + \frac{1}{x}} = \frac{2 \beta x}{\beta x + \beta z + 1 - z}\]While \(\alpha\) does not explicitly appear here, it’s part of \(x\) and \(z\) which we know and do appear here.
Great! So all that’s left is differentiating wrt \(\alpha\) and finding the maximum, right?
\[\tiny{\frac{2 \beta \left(\frac{\alpha \beta}{\alpha \beta - \beta + 1}\right)^{\frac{1}{p}} \left(\left(1 - \beta\right) \left(\left(\left(\frac{\alpha \beta}{\alpha \beta - \beta + 1}\right)^{\frac{1}{p}}\right)^{p} - 1\right)^{2} \left(\beta \left(\frac{\alpha \beta}{\alpha \beta - \beta + 1}\right)^{\frac{1}{p}} + \beta \left(1 - \left(\left(\frac{\alpha \beta}{\alpha \beta - \beta + 1}\right)^{\frac{1}{p}}\right)^{p}\right)^{\frac{1}{p}} - \left(1 - \left(\left(\frac{\alpha \beta}{\alpha \beta - \beta + 1}\right)^{\frac{1}{p}}\right)^{p}\right)^{\frac{1}{p}} + 1\right) + \left(\beta - 1\right) \left(\beta \left(\frac{\alpha \beta}{\alpha \beta - \beta + 1}\right)^{\frac{1}{p}} \left(\left(\left(\frac{\alpha \beta}{\alpha \beta - \beta + 1}\right)^{\frac{1}{p}}\right)^{p} - 1\right)^{2} - \beta \left(1 - \left(\left(\frac{\alpha \beta}{\alpha \beta - \beta + 1}\right)^{\frac{1}{p}}\right)^{p}\right)^{\frac{p + 1}{p}} \left(\left(\frac{\alpha \beta}{\alpha \beta - \beta + 1}\right)^{\frac{1}{p}}\right)^{p} + \left(1 - \left(\left(\frac{\alpha \beta}{\alpha \beta - \beta + 1}\right)^{\frac{1}{p}}\right)^{p}\right)^{\frac{p + 1}{p}} \left(\left(\frac{\alpha \beta}{\alpha \beta - \beta + 1}\right)^{\frac{1}{p}}\right)^{p}\right)\right)}{\alpha p \left(\left(\left(\frac{\alpha \beta}{\alpha \beta - \beta + 1}\right)^{\frac{1}{p}}\right)^{p} - 1\right)^{2} \left(\alpha \beta - \beta + 1\right) \left(\beta \left(\frac{\alpha \beta}{\alpha \beta - \beta + 1}\right)^{\frac{1}{p}} + \beta \left(1 - \left(\left(\frac{\alpha \beta}{\alpha \beta - \beta + 1}\right)^{\frac{1}{p}}\right)^{p}\right)^{\frac{1}{p}} - \left(1 - \left(\left(\frac{\alpha \beta}{\alpha \beta - \beta + 1}\right)^{\frac{1}{p}}\right)^{p}\right)^{\frac{1}{p}} + 1\right)^{2}}}\]Uh, haha, never mind, let’s do that numerically.
Here is a plot of the expected \(F_1\) as a function of \(\alpha\) for a range of values for \(p\).

The red curve corresponds to \(p \approx 2\), and shows pretty abysmal results; as \(p\) increases, we get better and better results.
The range of \(\alpha\) goes from 1 (which is equivalent to unweighted training) to about 250, with 100 being the “inverse proportion” weighting practice.
Most prominently, what we see is that class weight hardly improves over unweighted training, and the optimal weight is usually only a bit larger than the unweighted version, and far from the inverse proportion. In fact, from these plots, it would seem that using the inverse proportion weighting scheme is actually detrimental to training!
Very interesting indeed.
This is a good place to remember that we made a lot of assumptions to get here. Several things in particular may limit the scope of the conclusions:
I wanted to get a sense for the generalizability of the conclusions outside the sterile mathematical environment. I set up a rudimentary imbalanced classification pipeline with scikit-learn’s make_classification and DecisionTreeClassifier, and created an empirical version of the above plot, using class_sep as a proxy for the tradeoff curve.
A couple of things in that setup are different enough from my clean assumptions that I was very curious to see the results. I let my computer crunch for a few hours (running hundreds of thousands of simulations) and it produced the following plot:

Nice! It’s not identical to the theory-derived plot, but looks very similar, and in particular:
As for why the plot looks different for bigger values of \(\alpha\), my hunch is that the tradeoff curve isn’t symmetric, allowing the classifier to get a decent recall without sacrificing precision entirely.
While I certainly don’t know everything about class weighting now, I’ve come away from the analysis very satisfied: I know that class imbalance, in and of itself, does not warrant using class weights. Furthermore, if I deem class weights necessary, instead of using the typical “inverse proportion” scheme, my weights had better be informed by the particular problem characteristics: the nature of the tradeoff curve, label noise, and the cost I assign to each type of error.
After publishing the post it’s been pointed out to me that there are tutorials that specifically demonstrate how inverse proportion weighting (or stratified under- / oversampling, which is pretty equivalent) improves imbalanced classification performance. This piqued my interest and I looked at such a tutorial, and found something very interesting. To measure performance, the tutorial used the balanced accuracy score rather than \(F_1\).
\(F_1\) is the harmonic average of positive precision and positive recall; balanced accuracy is the (regular) average between positive recall and negative recall. On the surface, the two metrics look similar enough: each in itself is a combination of two metrics, corresponding to the two types of errors we can make.
But, as always, the details are important. I used the same optimization framework as before but looked at expected balanced accuracy (instead of expected \(F_1\)) as a function of class weight, and here’s what I got:

Look at that - completely different from the \(F_1\) behavior! Moreover, the optimal weight is indeed the inverse proportion rule of thumb. This is splendid: the theoretic methodology is in accordance with results people get in the wild. Less selfishly, it really highlights the importance of choice of metric on model tuning - different metrics respond very differently to our choice of hyperparameters.
Let’s dive into the difference between the two metrics, so we have an intuition for which one to choose. Specifically, we’ll look for a scenario where they are very different from each other.
Imagine we have 1000 samples - 10 positive, 990 negative. We classify all positives as positive, 40 negatives as positive. Positive recall is perfect (100%), negative recall is \(\frac{950}{990} \approx\) 96%, positive precision is \(\frac{10}{50} =\) 20%. Balanced accuracy would be very high, but \(F_1\) would be very low.
This is an example of the way different metrics induce different preferences over the two types of errors. There’s no absolute right or wrong here - it’s a matter of aligning your technical choice to the domain.
To me, this reinforces the importance of considering the downstream use of the model, and consulting business stakeholders when tuning the model for hard prediction.
Code for the visualizations and simulation can be found here.
scikit-learn (and previously imbalanced-learn) developerWith practiced if resigned movement, I slide out of bed, eyes still closed. 08:30 must be the latest anybody’s ever complained getting up at. Such a sleep hog.
No time to waste: a glass of water, a snack for the cat, kiss my love good day, and off I go. Oh, and wear something appropriate. For the office.
A short bike ride in the morning is great, basically. But I’m in a hurry - every misplaced dumpster and mistimed traffic light receives a shower of mental swears. I pass the same obstacles every day, thinking “maybe tomorrow I’ll have time to move them.” I never do.
I park my bicycle and run to catch the train. Did I mention I get up at 08:30? I could get up earlier, and maybe not run. Yes. I could.
But I won’t, because:
So I run.
40-ish minutes of letting the world slide by. Maybe it’s crowded and I just listen to music; or maybe not and I pull out my laptop, get acquainted with the latest OpenAI drama.
A brisk walk and I’m at the office. I celebrate another successful journey with customary rituals: the don’t-meet-eyes-at-the-elevator, awkward kitchen dances as morning routines clash, and the all-time favorite, make-smalltalk-but-don’t-get-too-involved.
If I’m lucky I left something half-finished to help ease myself into work, or Incubated(TM) to get some fresh ideas. Otherwise I have to pretend to work while procrastinating on finding something useful to do. I know it doesn’t really matter - in 5 minutes or an hour, someone will walk in, or something will come up, and I’ll find myself doing something else anyway.
An hour or two of flow, refreshing and productive. But then I need a break. Is it lunchtime yet? Not remotely. Another glass of water, then back to my desk. Now what?
There’s a lot of friction in the office. Small awkwardnesses that pile up, a ton of interruptions. Talking About Big Stuff With No Room For Nuance Cause Ain’t Nobody Got Time.
That’s when my Office Lethargy rears its head.
“You know what’s the next big thing, and you also know you don’t have it in you to start it today. Double dish of procrastination for you!”
“See that sun over the lovely view from this plush office tower? Your eyes will be its companion as it travels across the sky today.”
“Remember how you like to dance to blaring music in the living room? Yeah we don’t do that here. You have a desk, and a chair. You can stand if you wish.”
I tried to fight it at first. But it’s so much easier to give in to the lethargy, to embrace it.
I don’t have to be here. I’m a free spirit. Or just lazy and distracted. I don’t care. But I really don’t have to be here. Let me explain.
There was a startup. It sounded cool, innovative, impressive. Interviews, a take-home assignment. When they offered me a full-time job, I balked. I knew it was going there, but I’ve been freelancing from home since 2020, and my previous experience with the office and full-time employment was… mixed. Maybe it won’t be so bad? I still hesitate. “How about a trial period of three days a week, from the office, to test the waters?” “Sure!” What’s to think about? It’s the best offer I could get.
So I handed my resignation to one of my other projects, and, 30 days later, started coming to the office.
I was excited, and a little apprehensive. The first days were overwhelming - lots of interesting (and very nice) people, interesting tech, a deep stack. Plenty of good reasons to stay. But every day, at some point, the lethargy would inevitably kick in.
Like I said, I tried to fight it, at first. Go for a short walk, get a snack, talk to someone I haven’t met yet. But it’s not enough. There’s that claustrophobic buzz in my head, telling me I’m still in the office, confined to the walls and the rituals. So I submit, let it wash over me and sink in.
And over time, a realization sinks in, too:
This is not for me. Despite all the reasons to like it there, despite the graceful offer extended to me, my office lethargy makes it abundantly clear that all the beginnings I’m having are borrowed from somebody else’s story. All the tensions building up, Chekhov’s guns promising to fire - I don’t want to be there when they do. What a relief to allow myself to think that! I will not stay here. I almost cannot stay here.
Have I told them? No, not yet. But I will, any day now. Maybe next week.
For now, it’s me and my office lethargy.
And let me tell you, it’s a fascinating experience, being physically and mentally present but emotionally checked-out. I experience a lot of things differently, as an observer, letting them pass without looking for a reason to stress out over them.
Free from self-judgement, I can let go of judging others, too. And I feel I see them more accurately: not as antagonists, doing their best to foil my efforts; rather as earnest people, doing what they believe is best for themselves and the company.
A wave of compassion washes over me: how difficult it must be, dealing with all the tensions, juggling life and work, navigating decisions and anxieties. I wish there was another way.
Or maybe I’m projecting and patronizing, and they have a completely different experience? Someone steps by and starts smalltalk, and I snap back in.
I grind my teeth and push myself to be productive till the end of the day, for which I know I’ll pay a price. Lethargy never fails to collect.
Finally, the day is done. Say goodbye, pack up, and head towards home sweet home.
Trying hard to keep the words “rat race” out of my thoughts, I join the river of people, each heading to his or her own home sweet home. Many of them are hurried and impatient, heads buried in their phones. This time, at least, I don’t need to run.
A peculiar kind of mindfulness, which I attribute to the disengaging effect of the lethargy combined with the newfound liberation of finishing work for the day, allows me again to observe the people around me with curiosity and compassion. Hundreds of life lines brushing by, almost interwining but not quite. What are their stories? Did they have a good day? What kinds of homes are they coming back to? Then someone particularly impatient squeezes uncomfortably close and the compassion is gone, I just want to get home.
And then, at last, home. Here the lethargy has no power - it is easily chased away by my feisty cat and a dose of good music. “See you tomorrow, I guess,” it must have the last word.
Yes.
But not for long.
Any day, now.
]]>A few months ago I gave an “introduction to classical ML” workshop to a team of full-stack developers. The idea was to give a conceptual introduction, break down the very abstract “let’s design an algorithm that improves with more data”, and demonstrate how you’d approach this in practice.
The workshop was accompanied by an exercise in JS, “Heads or Tails”, which is what this post is going to be about.
In this exercise you’ll implement the brains behind the blockbuster video-game, “Heads or Tails”, which tests the player’s ability to randomly choose between heads and tails. The game guesses the player’s next choice, and should these choices exhibit any patterns, the game will use these patterns to gain an advantage over the player.
During the exercise, you’ll implement increasingly sophisticated algorithms for predicting the player’s next choice given their choice history, with the game visualizing the algorithms’ predictive power on your choices as a player. Ready to be unpredictable?
YOUR CODE GOES HEREAs previously mentioned, you’ll play two roles: of the developer implementing the learning algorithms, and of the player being tested for predictability.
In the provided html file, one “prediction algorithm” has been implemented, which just uses
Math.random to predict the player’s choice. It is therefore very poor as a prediction algorithm,
but it’ll assist you in understanding what’s going on and provide a baseline for comparison.
In the browser, start pressing “1” (for heads) and “2” for (tails) randomly, and you’ll see a line chart being extended as you play. The chart depicts the prediction algorithm’s score: it gains a point every time it predicts correctly, and loses a point every time it’s wrong.
In the code, look for the function predictRandom - this is the random prediction implementation.
Any function you’ll write whose name starts with “predict” will be integrated to the game and you’d be able to see its predictive performance.
The functions you’ll implement take one parameter - history, which is an array with the player’s choice history (in the current game).
Choices are represented as strings - “H” for heads, “T” for tails. The function needs to return the prediction for the player’s next choice (again, “H” or “T”).
Add two simple prediction functions:
Refresh the page and, using the score visualization, make sure the functions behave as you would expect.
Before we move on to more sophisticated algorithms, let’s take a step back and look at what we’re trying to achieve.
Using the functions we added in section 01, it’s easy to see that when the player always makes the same choice (say, tails),
the respective prediction function is significantly better than the predictRandom strategy.
But, unless you have a strong preference to either choice, if you try to behave randomly these const-predicting functions
won’t perform much better than random.
We’d like to formulate stronger prediction strategies, such that even when we try to confuse the predictor, it’ll pick up on our patterns (assuming they exist) and will be notably better-performing than the random strategy. That is, unless we’ll manage to be truly random, which (spoiler alert) most humans aren’t capable of.
So: if you write an algorithm that achieves a similar score to the random predictor, that algorithm can’t find your patterns. If it gets significantly better score, that means a) that you’re relatively predictable, and b) that the algorithm managed to pick up on your patterns.
Another point for thought: what does it take for an algorithm to achieve significantly worse scores than the random predictor?
Let’s assume the player has a strong preference for one of the choices. We can write a function that finds this preference and uses it to predict the next choice. Implement such a function.
For example: if this is the player’s choice history:
HTTHTTHTTHTTHHHTHT
We can see that the player picked heads 8 times, and tails 10 times. Therefore we’ll predict “tails”.
As the game progresses, the majority choice might change, and your function’s prediction will reflect that.
The previous function was very simplistic. Let’s write a more sophisticated version of it: suppose there’s some consistency in the player’s behavior. Say, the player makes long runs of heads, long runs of tails, and occasionally switches between them. Or, that they try to be “unpredictable” but instead end up just alternating between them, choosing “heads-tails-heads-tails”.
In such cases, even though there might not be a generally preferred choice, we might be able to exploit finer patterns.
Let’s look again at the choice history from the previous section: HTTHTTHTTHTTHHHTHT.
Instead of taking the majority of choices, we note that the player’s last choice is “tails”, and so we’ll look only at choices
made after tails. Here they are, highlighted:
HTTHTTHTTHTTHHHTHT.
After choosing tails, the player chose tails again 4 times, and heads 5 times. Therefore, in this case, we’ll predict “heads”.
Implement a prediction function that uses the algorithm described here to make predictions.
Alright, now we’re in the grown-ups’ league. In this section I’ll guide you through implmeneting a (simple version of a) real machine-learning algorithm: decision trees.
The implementation requires a bit more work so you can treat it as a small project.
Ready? Here we go.
In the previous section we looked at the player’s last choice, hoping to exploit patterns related to it. Playing around yourself, you probably noticed there can be longer-than-2 runs, and we’d like to exploit those too. We could, in theory, extend the technique used in the previous section: enumerate all the possibilities for, e.g. a 6-choice-long run, and take the majority for each case. But that’s an extremely specific prediction strategy - to get meaningful data for longer runs we’d need to play a lot of time, and we might be missing more obvious patterns. What can we do?
A decision tree is a classical machine-learning algorithm, where each prediction is decided through a sequence of questions, where the questions are leading us down through nodes until we make a prediction.
In our case, if we look at a history window of length 6, a decision tree might look like this:

What do I mean by “history window” of size 6? Suppose this is the player’s choice history: HTTHTTHTTHTTHHHTHT.
Then this is the last window of size 6: HTTHTTHTTHTTHHHTHT.
Using the player’s choice history, we can construct a training set of all the length-6 windows and the choice that came right after them, which we’ll use to look for patterns:
| Window | Choice after window |
|---|---|
| HTTHTT | H |
| TTHTTH | T |
| THTTHT | T |
| HTTHTT | H |
| TTHTTH | T |
| … | … |
| THHHTH | T |
In the first row (HTTHTT), “window @1” is the first choice in the window - H. “window @3” is T, “window @6” is T as well, and so on.
Back to decision trees. I’ll divide the implementation to two parts - representing a decision tree, and constructing the decision tree given data.
This part is pretty programmatic - decide how you want to represent a decision tree, without worrying about how you’d actually construct the decision tree, and implement this representation.
You could go all-in OOP and create a TreeNode class with pointers to 2 child nodes (which might be predictions or additional decision nodes); you could use simple data structures (arrays, objects) with appropriate functions; you could use functional programming; or whatever else you fancy.
Make sure you can easily construct a decision tree, and that you can use one to make predictions.
Now we’re ready to tackle the next part: constructing a decision tree based on a player’s choice history. What we’re actually aiming to do, is find a bunch of tests on the history window, which maximally separate windows which were followd by “heads” from windows that were followed by “tails”.
Part 2.1: Creating a training set
As preparation for constructing the decision tree, we want to extract a training set (similar to the table shown above) from the choice history. Implement a function that takes the choice history and returns a sequence of training pairs: choice window, and choice following window.
Part 2.2: Measuring homogeneity
Given a set of choices, we want to be able to measure how homogenous - “pure” - they are. If all the choices are the same, that’s maximal homogeneity. If they distribution 50-50 - that’s minimal homogeneity.
We will use this metric to evaluate potential decision tree structures, and pick tests that make a set of poorly-separated choices to two purer sets.
The measure we’ll define is called entropy (the one from information theory, not thermodynamics). It actually measures the opposite of homogeneity - heterogeneity, and goes like this:
Suppose in a given set of choices, \(h\) denotes the proportion of “heads” choices, and \(t\) the proportion of “tails” choices (\(h + t = 1\)). Then:
\[entropy(h, t) = -(h \cdot \log_2 h + t \cdot \log_2 t)\]Since \(h + 1 = 1\), we can also write:
\[entropy(h, t) = -(h \cdot \log_2 h + (1 - h) \cdot \log_2 (1 - h))\]The following chart visualizes the entropy for each \(h\) between 0 and 1:

Implement a function that computes the entropy of a given set of “heads” / “tails” choices. Make sure it agrees with the chart above.
Part 2.3: Actually constructing the decision tree
OK, we’ve been through a lot up until now, but bear with me - this is the final push.
I’ll first describe the algorithm and then explain it.
We’ll use a pair of variables, X and y, to denote our training set: X contains a sequence of windows, and y contains a sequence of choices following the windows. X and y are corresponding, so for example the 11th element of y is a choice that came after the 11th element of X.
ConstructDecisionTree(X, y)
The algorithm might not be trivial, but it’s very elegant, and we’ll now break it down.
First off, you probably noticed that the algorithm is recursive. At every step, the algorithm tries to find the optimal test for spliting the training set.
The recursion’s stopping criteria is one of two:
In any case, if we stopped - we’ll return a prediction node that predicts the majority of the history at the current node.
If we haven’t stopped, that means we have a test we’d like to split by. In thise case we’ll divide \(X\) and \(y\) according to the test, and construct two more corresponding decision trees with a recursive call. Then we compose those trees along with a test to see which subtree needs to handle the current case.
How do we measure the quality of each candidate test? Remember the entropy measure we defined? The higher it is, the more “impure” our set is. So we want to find the test that decreases entropy as much as possible. After splitting, we calculate the entropy for each of the new sets, and calculate the weighted average of the entropies to compare to the initial value before splitting. The improvement is then the difference bewteen the pre-split entropy, to the weighted-average entropy after the split.
And that’s it! These are all the pieces we need to construct a decision tree. Well done!
If you’d like more intuition, this link contains a wonderfule, visual introduction to decision trees.
Hopefully you’ve managed to properly implement the decision tree predictor, and witnessed that it’s pretty good at learning your patterns. That’s cool!
At least for me the decision tree is a great predictor:

A nice way to convince ourselves that the computer isn’t somehow cheating is to feed properly random choices and make sure all strategies behave like predictRandom (use the browser’s devtools):
for (var i = 0; i < 200; i++) {
document.querySelectorAll('.coin')[Math.round(Math.random())].click();
}
My implementation of the exercise solutions is here, SPOILER ALERT.
If you enjoyed this and would like to go further, there are several directions:
Good luck!
]]>Don’t get me wrong - I love what I do (most of the time), I’m grateful for the almost magical things technology enables me to do, and I’m generally excited and optimistic about the potential of technology to make the world a better place.
But. I’m always suspicious of new tech, take my sweet time migrating to new devices (my phone is 8 years old), stick to my habits for a long time before adopting the latest payment method or whatever gizmo is hot, and generally opt for as low-tech solutions as possible.
Sure, part of that can be attributed to my natural laziness and stubbornness. But I do think I have a few points in my defense: (not that there have to be, I can be grumpy if I want to)
Whew! That got heavy. If anything, writing this post makes me feel even more justified in my technophobia.
That said, if you’ll excuse me, I do think it’s finally time to get a new phone…
About 15 years ago I was just starting my professional coding career. I was junior, eager to learn, and had a lot of time and energy to tinker around.
At some point I stumbled upon TransparencyKey and TopMost, a pair of properties that allowed you to develop widget-style apps that always stayed on top but could be mostly transparent (to the mouse as well as the eye.) Combined with some Win32 API functionality, this could be used to do some really cool stuff. For about 2 years, 90% of my side projects would involve those properties. (Fixated? Me? No way)
I envisioned a whole suite of whimsical toys to liven up our dreary corporate workstations. Being junior, time after time I fell to the classic side project pitfall: I’d get excited by some idea, get to a working proof of concept, and move on to the next idea thinking I’d come back later to finish it up. Which, you guessed it, never happened.
Until now! I got nostalgic and thought I’d at least dig up what projects I could find and write about them.
I have to say, I was positively surprised by how smooth it was to get 15-year-old code running. These projects were developed on Windows XP / Windows 7 using .NET 4 or earlier, and apart from some graphics glitches (probably stemming from me doing stuff inefficiently) they work without modification on Windows 11 after upgrading to .NET 8. Kudos for the backward compatibility, Microsoft.
And now, without further ado:
Always just a hotkey-press away from being visited by a creepy (but cute) purple creature. In newer versions of Windows the shadow that windows cast are part of the window size, which is not accounted for in the code (hence the small gaps.)
Press a hotkey to send the active window cartwheeling. The glitches are due to the screen recording (it works smoothly otherwise, if slow on large windows.)
Why have all your files sit quietly in their folders if they could be floating around? My ambitious intent was to expand by animating all the contents of a folder flying from it when clicked, but I didn’t quite get there.
I was very proud of this one, which is basically low-key malware for trolling my colleagues. We had a power-user culture and used lots of shortcuts; sometimes when a colleague left their workstation I’d replace one of their shortcuts to point to this. It’d open notepad, blurt some insult, and spawn a pacman that would chase the mouse around. When the pacman was on the mouse you couldn’t click anything (because the pacman captured the mouse press event.)
When you tried to open task manager, it closed the process but rendered a “fake” task manager (which looks very out-of-place on Windows 11), only to close it dramatically a few seconds later (this part sometimes crashes on Windows 11).
When you tried to open the commandline shell, it closed it and was angry (jittering around.)
If you wanted to close it, you had to create a “C:\Users\public\Downloads\pwned.txt”, though some colleagues ended up restarting to get rid of it (I know, this wouldn’t fly today, but wasn’t so out-of-place at the time at that org.)
Sadly I lost the source to a couple of other projects, which I actually wrote later and were more mature.
Used flood-fill to materialize stuff on the screen (usually images with mostly transparent background or text.) Sounds a bit basic, but I think the overall effect was neat.
Written when mobile IM apps started to proliferate, everybody was using way too many emojis, and I wanted to easily include them in emails and documents (I know.)
On hotkey press, a 3x3 roster of emojis appeared, with left / right arrows switching between sets. You used the numpad to pick an emoji, it was copied to the clipboard and then the app would emit a “ctrl+v” so the emoji was pasted as an image to whatever you were focused on.
It actually worked quite well, quickly became intuitive to use, and was adopted by quite a lot of people in the org. I ended up adding all sorts of features such as custom emoji image paths and controlling the pasted emoji size.
If for some reason you’d like to browse the code, you can find it here.
]]>The inspiration came from two places:
Each game has its own environment and challenges, and a specific end-goal. Participants compete to solve as many games as they can within the time limits of the competition.
I got a session for the competition at the recent PyCon IL, and it went really well! There were developers from diverse backgrounds, from fresh bootcamp graduates to experienced developers at tech companies.
You can watch the trailer and try the demo here.
If you’re interested in inviting me to run Codyssey at your place - feel free to reach out!