andersource

Teaching Programming with the aim of Building a Mental Model

2026-01-03T21:25:00+00:00

The Question

For the last 15 years I’ve had the fortune of accompanying thousands of students as they take their first steps in programming. From teenagers to career-changers, from one-on-one mentoring to developing curriculums and training materials, I’ve had a lot of successes, and a lot of failures, too. The more I teach, the more I learn, the more I find I don’t know.

One question in particular has been nagging me almost from the very beginning: why do some students simply not “get” programming? You might chalk it up to innate intelligence, motivation, or poor explanations on my side, but I’ve sat with enough smart, highly motivated students, patiently explaining again and again in different ways to no avail, that I think there’s something bigger I’m missing - something that just doesn’t click for those students, and it takes more than a good explanation to make it land.

I’ve been brewing with this question for a long time, and finally, in the last few years, an answer has been emerging: my failure has been not actively nurturing a mental model of programming. Some (many) students do it naturally, unconsciously, but those who don’t are lost without proper guidance.

So that’s what this post is about; maybe you think it’s obvious, or that I’m missing something else - I’d love to hear! But here we go.

Mental Model of Programming

What does that even mean?

The term “mental model” is used a lot, in different contexts and meanings, so I want to be careful when explaining what I mean, exactly, by “mental model of programming”.

A model, in the sense I’m using, is some object that imitates another object to some extent, but is simpler or more manageable. To the degree that the two objects are similar, the model allows us to conveniently experiment with the dynamics of its more complex sibling.

Thus, a model railroad lets you play with train cars and tracks in your living room, weather forecasts use weather models to simulate how weather systems develop, and language models attempt to emulate the dynamics of human language.

A mental model is a model that sits in your head: machinery in the brain you can use to simulate whatever it’s modelling.

A mental model of programming, then, is brain machinery that allows you to simulate how the computer would execute a piece of code - to run this code in your head.

A good mental model of programming is critical for programmers to navigate the huge space of programs they could write. As you write the code or debug it, you constantly simulate what’s hapenning in your head, and compare it to the desired behavior; this guides the process of adding or changing code to achieve your goal.

But computer-simulation-machinery doesn’t come cheaply to the brain; it requires a lot of practice. Specifically, it requires the brain to try to simulate code, fail, and learn from that.

As I’ve learned the hard way, while practice is necessary, it’s not sufficient: depending on how you teach, there are other ways to learn, and if the learning process doesn’t explicitly aim to create a mental model in the students’ brain, some students will take the wrong road.

Another way to learn

Take K-12 education and its equivalents in other countries. In Israel what you’re most likely to see is overcrowded classrooms, overworked teachers, and a hyperfocus on getting students to pass their matriculation exams. Putting aside my general criticism of the system, in my experience this often leads to learning that doesn’t involve constructing a mental model of something, even in subjects where that would be appropriate.

This manifests as three levels of performance:

Memorization of facts, e.g. the multiplication table
Memorization of processes, e.g. solving a quadratic equation with its famous formula
Pattern matching, e.g. learning that there are 4 types of problems, memorizing the solution process for each problem, knowing how to identify which problem you’re seeing and how to extract variables to substitute in the process

Despite my dismissive tone, this kind of learning isn’t inherently wrong, and sometimes it’s the only way that makes sense in the context.

But, on its own, it typically doesn’t lead to the creation of a mental model, and it doesn’t equip students with the (unconscious) ability to rely on mental models when solving problems. Unlike what you might expect if you look, for example, at the pyramid of Bloom’s taxonomy of acquisition of cognitive skills, drilling the lower levels of understanding doesn’t automatically, at some point, transform into deeper understanding. At least not for all students.

And those students are the ones who struggle. The ones who, no matter how much I explain and how much they feel they understand, sit in front of the IDE and feel completely lost. It’s not their fault: I’ve been talking completely past them. I explained as if they reason about code using a mental model which happens to be wrong, and I’m trying to correct it; but they’re operating completely differently.

They try to find patterns in which constructs are used where, copy-paste examples and tweak them; they want the code to do the opposite thing so they try inverting conditions and booleans in areas that seem related. It rarely works.

This is the gap that needs to be addressed, and it’s best done before they even start down the wrong path.

Nurturing a Mental Model of Programming

So what can you do? I’m still learning how to answer this question, but here’s what I’ve got so far.

Incremental practice: in my opinion, this is the most important component. When programmers work, their use of the mental model is intense: they simulate a program while keeping track, in their head, of what the desired behavior is; notice when they diverge, and figure out what change to make so they don’t.

Asking students to do all that at once is a bit of a leap. I like to specifically drill the mental model, first in straightforward and (relatively) easy ways, then increasingly require usage that resembles real programming.

The progression could look like this:

Dry simulation: Here’s some code, here are some inputs, what would be the output?
Critical simulation: Here’s some code, what’s the high-level description of its behavior? This requires the student to simulate multiple times, trying to find inputs that cover all the branches, and reason about the process, program structure and outputs
Modification: Here’s some code that does X, how would you change it to achieve Y? You can create incremental difficulty here too, at first requiring small, local changes, then going bigger
At long last, Creation: Write code that does Z

Isolate: learning to program involves, in addition to building a mental model, a lot of technical aspects that can confuse students: working with the IDE, the language’s rigid syntax, arcane keywords that don’t necessarily carry meaning.

These things aren’t necessary for the mental model, so I prefer to keep them for later, starting with river-crossing puzzles and moving on to robots taking natural language instructions in a visual environment. You can go pretty far, conceptually, without introducing a real programming language: variables, conditions, control flow. I’m doing all this with plain old presentations and Kahoot, but today you could use LLMs to create an interactive environment.

Visual cues: you can show the students how their internal mental model should look like, for example by doing a dry-run of a program, highlighting the current line (like a debugger), and showing a “watch” table with all the variables.

Make shortcuts harder: Finally, avoid formulaic problems that could be solved using pattern matching or memorization. This will create a vacuum that the brain will want to fill, pushing it to create a mental model.

In this context, I’ll mention Scratch; it’s a rich learning environment that’s got some of those principles built-in. I have some reservations about it, but I’ll leave that to another discussion.

I’ve found that implementing all these, as early as possible, drastically improves student performance.

Mental Models and AI

AI has been rocking a lot of boats for the last few years, and specifically both programming and education are having some reckoning moments around it. I’m in no position to predict what will happen in the future, but I am worried that reliance on AI in programming education will be yet another shortcut that’s actually an obstacle to students’ development of a solid mental model of programming (assuming that’s still going to matter).

Mental Models in other areas

Many other domains require, or at least greatly benefit, from having a relevant mental model.

One interesting example I can think of is cooking; you don’t have to have a mental model - you could follow recipes, memorize them, and utilize pattern-matching to handle common challenges such as identifying when the batter is mixed enough or how to replace a missing ingredient. That’s pretty much my experience with cooking, or at least was until quite recently.

But then I met my wife, who cooks amazing food and, impressively, can improvise wild dishes from what we happen to have available. I came to realize she has a proper mental model of food: she knows how ingredients behave under various conditions and how flavors and textures mix. She acquired this model through years of experimentation, while paying attention to what happens in the process. I’ve been trying to cook more like that.

Some may call it good intuition, and different people may have natural inclinations to develop those for different areas, but I’ve found the idea of a mental model an interesting lens to reason about how experts approach their work.

I also think about the fraction of students for whom a mental model of programming doesn’t come naturally, and how this can be bridged with an appropriate pedagogical approach. What other areas could beneift from something like that?

More Eternal Struggle

2025-09-18T13:45:00+00:00

Inspired by yoavg’s Eternal Struggle. High speed and strong impact may lead to unexpected behavior. Just like in life, eh?

Speed:

Impact strength:

Smoothness:

Calculus Phobic’s Introduction to Differentiable Programming

2025-06-26T21:10:00+00:00

You’ve found yourself in a pickle

You need to cross from point A to point B as fast as possible. Geometry dictates that the shortest path is a straight line, but since you’re crossing different terrains at different speeds, the fastest path will not be a straight line.

First thoughts

Any reasonable solution can be characterized by two numbers, denoted \(x_1\) and \(x_2\), describing where we cross the boundaries between different terrains. Given all the information, it’s not difficult to calculate the total time to cross, using the Pythagorean theorem and the equation \(distance = speed \cdot time\):

\[t = \frac{\sqrt{ {x_1}^2 + d^2}}{v_1} + \frac{\sqrt{ {x_2}^2 + d^2}}{v_2} + \frac{\sqrt{(h - x_1 - x_2)^2 + d^2}}{v_3}\]

But how to find \(x_1\) and \(x_2\) that minimize \(t\)? Your old calculus professor would suggest computing the gradient and solving a system of equations, but honestly you’d rather kiss a sulphurous frog. Sampling lots of points and picking the best is possible, but inexact and expensive.

Differentiable Programming to the Rescue

Fueled by optimization techniques for deep learning models, the last decade saw an explosion of automatic differentiation engines in Python: libraries that allow you to write numeric code in almost-pure Python, and automatically compute derivatives and gradients. How does this help us? A function’s gradient tells us in which direction the function is increasing the most. So if we want to minimize it, we can flip the gradient’s sign and just follow that! That’s the essence of the gradient descent algorithm.

Getting out of our pickle with JAX

from jax import grad
from jax.numpy import sqrt

h, d, v1, v2, v3 = 20, 10, .7, .3, .45

def calc_time(x):
    x1, x2 = x
    return (
        sqrt(x1 ** 2 + d ** 2) / v1
        + sqrt(x2 ** 2 + d ** 2) / v2
        + sqrt((h - x1 - x2) ** 2 + d ** 2) / v3
    )

d_time_d_x = grad(calc_time)  # Magic!
x1, x2 = 2., 17.
step = 1.5

for i in range(20):
    dx1, dx2 = d_time_d_x([x1, x2])
    x1 -= step * dx1
    x2 -= step * dx2

print(f"{x1=}, {x2=}")

Visualizing the optimization process

We can visualize the objective landscape, and show the path our optimization traces through it:

Looking at the objective itself along the optimization, we can see it consistently improving (though the rate of improvement is slowing down):

Finally, we can visualize the actual paths represented by the parameterized solutions as we optimize:

How does this work?

You might be wondering how JAX computes the gradient behind the scenes. Maybe it’s using numeric approximations? Or parsing the code and symbolically working out the gradient? Actually, neither!

A full explanation of automatic differentiation is out of scope for this intro, but I’ll try to convey the main ideas succinctly.

Look at this simple computation:

z = x ** 2 + y / 2

If x and y were pure Python numbers, then z would also be a number, and contain no trace of the computation that led to its current value.

But, using operator overloading (“special method names” in Python), you can create types that keep track of computations, and use them to obtain expression trees for the values you compute:

Here is the expression tree for the time calculation that we want to optimize:

Next, and this is where (some of) the magic happens, thanks to the chain rule in calculus and its generalizations, you can use this tree to efficiently compute the derivative of the final node with respect to any other node in the tree.

I won’t go into more detail than that - it’s a calculus phobic’s introduction, after all - but I hope this sates your curiosity for now, and I’ve added links with more information below.

Is it really that easy?

The ability to effortlessly, and efficiently, calculate gradients of arbitrary functions is very powerful for gradient-based optimization.

However, as you might expect, there are a few subtleties:

Operations supporting custom types

Since we’re using custom types for building the computation graph and calculating gradients, we need to use operations that support those types. Operator overloading allows us to support arithmetic out-of-the-box, but for more complicated computations you’ll need to use the appropriate implementation (or implement it yourself if it doesn’t exist). Hence the use of the custom jax.numpy.sqrt function.

The good news is that many modern automatic differentiation engines come with a big library of common operations and algorithms already implemented, so what you need is most likely there - and if it’s not, you’ll have plenty of primitives to build on.

The simple path-planning problem I presented has a simple “optimization landscape”, where from every point it’s fairly easy to improve solutions. And still there was some tuning - choosing the step size and number of iterations. Such aspects will always require attention, and more complex problems may have landscapes that are trickier to optimize on.

Another potential issue is that of converging to local optima - depending on the problem, this may be acceptable, or require clever initialization or other tricks to avoid.

Time and space resources

While the algorithm for computing gradients is efficient, there’s still significant overhead to computing gradients, both in time and in memory. If you have an extra-large problem you might need to take care when optimizing it, or find a way to break it down to smaller sub-problems.

What more can you do?

OK, so we know how to optimize simple computations with differentiable programming. Anything else of interest we can do with it?

Differentiating through branching and loops

Since the computational graph is built on-the-fly with our custom types, it doesn’t “care” if computations happen in a straightforward, branchless block of code (like in our example) or through a winding path of conditions and loops. As long as we can construct a graph, we can calculate gradients (within the computer’s resources constraints, of course.)

Differentiating through conditions?

While constructing the computational graph within branches or loops isn’t an issue, if we want the optimiation to include the conditions that determine those branches - well, that’s trickier. Still possible, though!

Suppose our computation goes through a simple if-else branch, and we want the if’s condition to be included in the optimization.

The computational graph only includes calculations that happened. So, say we choose the else branch - the computation wouldn’t “know” what could have happened had we taken the if branch.

To resolve that, we need to run both branches, and average them in a way that reflects the condition’s preference. The same is true for while and for loops, with slight variations.

I don’t want to go too deep, but this is a lovely example of reversing Game of Life with differentiable programming and the branch-weighting idea.

Integrating with machine learning models

The proliferation of differentiable programming frameworks in Python was pretty much kickstarted with frameworks for training deep learning models, also using gradient-based optimization techniques. This typically makes plugging such models to differentiable computations very easy! For instance, JAX has the Flax neural network library.

An example application of such integration is training physics-informed neural networks.

Differentiable packages

In addition to the built-in operations that come with automatic diffentiation frameworks, there’s a growing ecosystem of fully differentiable implementations of more advanced operations in various domains.

Examples include: 3D rendering, computer vision, signal processing, and more.

These packages can be incorporated to differentiable pipelines to create very interesting tools.

What is differentiable programming good for?

Automatic differentiation has been around much longer than its presence in the Python ecosystem, with applications primarily in science and engineering design optimization. The recent surge of automatic differentiation frameworks in Python brings this powerful tool to a much broader audience.

Most recently, Gaussian Splatting, which is based on differentiable programming, has exploded in popularity, and is seeing impressive adoption as a 3D scene representation format.

Personally, for the last several years I’ve been working with a startup on 3D reconstruction with differentiable rendering. Also, image color replacement with numerical optimization from a few years ago would have been a classic use case for differentiable programming (except it was a project for a course, so I worked out the gradients by hand. Oh, what joy.)

Diving deeper

This video is a great introduction to automatic differentiation, and this post walks through implementing automatic differentiation from scratch.

This is an extensive, if a bit technical, introduction to differentiable programming.

The JAX tutorials will get you up to speed on implementing differentiable computations.

I’ve also considered doing a series of differentiable programming exercises, gradually introducing concepts and tools. I think it could be really cool, with animations visualizing the optimization process. But it takes a lot of work, and I could use some motivation - ping if that’s something you’d be interested in!

Scipy 2025 Virtual Poster

The first part of this post was adapted from a poster I made for the virtual session at Scipy 2025. Click to view at full resolution:

Recreational Image Reconstruction with Decision Trees

2025-06-06T14:30:00+00:00

A couple of months ago I got married! Very exciting, and a great excuse to complicate things by writing code.

We had the idea to use pictures we’ve taken in our trips for some of the aesthetic design - as backgrounds for the invitation, menus and so on. I remembered seeing once an animation of a recursive subdivision of an image, with gradual refinement of high-detail areas, which seemed really neat. It reminded me of the way decision trees subdivide the feature space, and I thought it would be cool to try to reconstruct the image with a decision tree with limited depth, using pixels’ X- and Y-coordinates as features.

My wife took this picture of a beach sunset:

I fed it to a simple decision tree, where we try to predict RGB from X and Y:

It’s nice, but a bit too blocky for me. Of course, that’s a direct consequence of how we feed the data to the decision tree algorithm: it has to make thresholds on X- and Y-coordinates of pixels, so of course the apprpoximation will be built out of rectangles.

Maybe we can use a different representation to get a different style?

I tried sampling points in image space using Poisson Disk Sampling and representing each pixel by its distances from all anchor points:

Now we’re talking! I liked that style a lot, and happily my wife did too, so we ended up using variations of this technique on several of our pictures for various wedding-related graphics.

There are a lot of parameters and variations to play with here - anchor sampling density, decision tree depth, whether to use the same points for all RGB channels or different ones, and so on. So far my impression is that each picture has its own parameter spaces that work well with it, so there’s a lot of experimentation involved.

Here’s the result of a different sunset picture, with a stricter limitation on maximum tree depth - looks very abstract:

Here’s the same picture with similar maximum depth limit, but reconstructed with a random forest instead:

A bit noisier, but also softer - I like both versions, each with its own flavor.

Another cool trick is to use a picture with some object in it, segment the object (manually or with SAM) and give it a higher sample_weight when fitting the model. This will cause the tree to give more importance to those areas of the picture, resulting in higher fidelity, while the background remains more abstract:

The same idea can be applied to pictures with faces. I played with using face landmark detection (with the face-alignment library) to determine pixel importance, with pretty cool results - our faces are recognizable but still abstract:

I also played with generating an animation of a picture “coming into focus” by gradually varying the parameters:

As you can see, I’m having a lot of fun with this technique! There are many more ideas I’d like to try, but I’ll leave it here for now.

You can find sample code for the basic idea here.

Cheers!

Adventures in Imbalanced Learning and Class Weight

2025-05-05T08:10:00+00:00

A few months ago I was working on an image classification problem with severe class imbalance - the positive class was much rarer than the negative class.

As part of the model tuning phase, I wanted to explore the impact of class imbalance and try to mitigate it. A popular “off-the-shelf” solution to imbalance is weighting classes in inverse proportion to their frequency - which didn’t yield an improvement. This happened to me several times in the past, and other than basic intuition I couldn’t trace the theory of where this weighting comes from (maybe I didn’t try hard enough).

So, I decided to finally try to reason about class weighting in an imbalanced setting from first principles. What follows is my analysis. The TL;DR is that for my problem, I was convinced that class weighting probably doesn’t matter too much.

It’s an interesting analysis and was a fun rabbit-hole to dive into, but makes a lot of assumptions and I’d be careful not to overgeneralize from this.

\(\newcommand{\pipe}{|}\)

The Tradeoff

Wherever there’s a (non-trivial) classification problem, there’s a tradeoff. I’ll focus on the simplest case of binary classification: say we have two classes - negative (denoted 0) and positive (denoted 1); further suppose that the positive is the rare class, with prevalence \(\beta\) (1% in the following visualizations / experiments).

Basically, when we classify, we predict the class of an instance with unknown class. We could be wrong in two ways:

Classifying a negative instance as positive (false positive)
Classifying a positive instance as negative (false negative)

It is trivial to avoid making any one type of error: for example, we could classify all instances as negative, avoiding false positives altogether (at the expense of all our positives being false negatives). And therein lies the tradeoff: to make an actual classifier that outputs “hard” predictions, we need to make a product / business decision about how bad each type of error is. Not making an explicit decision means our optimization pipeline has such a choice baked in implicitly.

Now, it’s hard to know in advance how the tradeoff curve will look like. We try to optimize everything else to give us the best set of options: collect lots of data with informative features, use a suitable model, etc. But after all that is done, we still need to choose how to balance the two types of errors.

To optimize this choice in light of our product preferences, we first need to characterize the tradeoff curve.

Characterizing the Tradeoff Curve

Some definitions first:

We’ll denote \(P(\hat{y} = 1 \pipe y = 1)\) - the probability of predicting a positive given that the instance is positive - as \(x\).
Similarly, we’ll denote \(P(\hat{y} = 0 \pipe y = 0)\) as \(z\).

For the sake of this analysis I’m going to assume the tradeoff curve is of the form \(z = (1 - x^p) ^ \frac{1}{p}\), with \(p \geq 1\). This yields the following family of curves:

The red curve corresponds to \(p = 1\) - a pretty poor tradeoff curve. As we increase \(p\), our set of options improves. At the coveted (but realistically unattainable) \(p = \infty\) we’d choose \(x = z = 1\), beat the problem altogether and go home; until then, we have to choose some compromise between \(x\) and \(z\).

Later, we’ll ponder how to choose the tradeoff. But to do that we first need to define what it is we’re even trying to optimize.

Taking a Stance

Like I previously mentioned, initial modeling stages try to give us the best tradeoff curve possible for the task - using data, model type, training techniques, whatever. At those stages we can optimize for threshold-independent metrics, for example various area under the curve metrics. But ultimately, somewhere downstream the model’s output will be binarized, and we might as well take that into consideration when tuning the model.

I’m personally fond of the F-score - it combines two very interpretable metrics (precision and recall), which makes communicating with less technical stakeholders (such as product managers and the FDA) easier, and can be easily tweaked to account for error type preferences.

For this problem precision and recall were equally important, so I used the \(F_1\) score:

\[F_1 = \frac{2}{\frac{1}{precision} + \frac{1}{recall}}\]

Ultimately, this is the metric we want to optimize.

Where’s the Knob?

OK, so we know what we want to optimize, and we know that our choices are limited by the tradeoff curve. But how do we control where we land on the curve?

Canonically, binary classification is framed as minimizing binary cross-entropy loss. The knob we’ll use to decide where we land on the tradeoff curve is a weighting coefficient, \(\alpha\), for the positive instances:

\[BCE = -\sum_i{\alpha y_i \log_2{\hat{y_i}} + (1 - y_i) \log_2{(1 - \hat{y_i}}))}\]

Optimizing Away

Now that all the actors are on stage, let’s roll up our sleeves and get our hands dirty.

First, we’ll take a step back from looking at individual instances, and look at the relationship between positive and negative instances, based on \(\beta\), the positive class prevalence. To that end, we’ll replace individual \(\hat{y_i}\) and \(1 - \hat{y_i}\) with their respective expectations, \(x\) and \(z\). Further, our assumed tradeoff curve constrains one by the other; let’s reframe the loss as a function of \(x\) (also dependent on \(\alpha\), \(\beta\), and \(p\).)

\[BCE(x) = -(\alpha \beta \log{x} + (1 - \beta)\frac{\log{(1 - x^p)}}{p})\]

We’ll now differentiate wrt \(x\) to find where on the tradeoff curve our choice of \(\alpha\) and the reality of \(\beta\) and \(p\) have landed us:

\[BCE'(x) = -(\frac{\alpha \beta}{x \ln{2}} + \frac{1 - \beta}{p} \cdot \frac{1}{(1 - x^p) \ln{2}} \cdot (-p x ^ {p - 1})) =\] \[= \frac{(1 - \beta) x ^ {p - 1}}{(1 - x ^ p)\ln{2}} - \frac{\alpha \beta}{x \ln{2}}\] \[BCE'(x) = 0 \Leftrightarrow (1 - \beta)x ^ p = \alpha \beta (1 - x^p)\]

And we finally get:

\[x = \sqrt[p]{\frac{\alpha \beta}{\alpha \beta - \beta + 1}}\]

Halftime recap

We’ve been handed a binary classification problem characterized by \(\beta\) and \(p\). We optimized a classifier using weighted binary cross-entropy with weight \(\alpha\) for the positive, rare class. This lands us in a particular place on the tradeoff curve, which we just found (these are \(x\) and \(z\)).

Next, we’ll want to see how our choice of \(\alpha\) trickles downstream to the \(F_1\) score, and use this description to find an optimal value for \(\alpha\).

Calculating \(F_1\)

We’re interested in calculating the expected \(F_1\) score resulting from our choice of \(\alpha\). Since \(F_1\) depends directly on precision and recall, we’ll calculate the expected value of those metrics.

Recall is easy - it is the fraction of positive instances we correctly detected as positive, and we can expect it to be \(x\) - the probability our classifier outputs 1 for a positive instance.

Precision is the fraction \(\frac{true \ positives}{true \ positives + false \ positives}\).

The expected true positives are the fraction of positives times the probability of detecting a positive as such: \(\beta x\).

The expected false positives are the negative instances which were misclassified: \((1 - \beta)(1 - z)\).

Putting all that in the \(F_1\) formula:

\[\mathbb{E}(F_1) = \frac{2}{\frac{\beta x + (1 - \beta)(1 - z)}{\beta x} + \frac{1}{x}} = \frac{2 \beta x}{\beta x + \beta z + 1 - z}\]

While \(\alpha\) does not explicitly appear here, it’s part of \(x\) and \(z\) which we know and do appear here.

Great! So all that’s left is differentiating wrt \(\alpha\) and finding the maximum, right?

\[\tiny{\frac{2 \beta \left(\frac{\alpha \beta}{\alpha \beta - \beta + 1}\right)^{\frac{1}{p}} \left(\left(1 - \beta\right) \left(\left(\left(\frac{\alpha \beta}{\alpha \beta - \beta + 1}\right)^{\frac{1}{p}}\right)^{p} - 1\right)^{2} \left(\beta \left(\frac{\alpha \beta}{\alpha \beta - \beta + 1}\right)^{\frac{1}{p}} + \beta \left(1 - \left(\left(\frac{\alpha \beta}{\alpha \beta - \beta + 1}\right)^{\frac{1}{p}}\right)^{p}\right)^{\frac{1}{p}} - \left(1 - \left(\left(\frac{\alpha \beta}{\alpha \beta - \beta + 1}\right)^{\frac{1}{p}}\right)^{p}\right)^{\frac{1}{p}} + 1\right) + \left(\beta - 1\right) \left(\beta \left(\frac{\alpha \beta}{\alpha \beta - \beta + 1}\right)^{\frac{1}{p}} \left(\left(\left(\frac{\alpha \beta}{\alpha \beta - \beta + 1}\right)^{\frac{1}{p}}\right)^{p} - 1\right)^{2} - \beta \left(1 - \left(\left(\frac{\alpha \beta}{\alpha \beta - \beta + 1}\right)^{\frac{1}{p}}\right)^{p}\right)^{\frac{p + 1}{p}} \left(\left(\frac{\alpha \beta}{\alpha \beta - \beta + 1}\right)^{\frac{1}{p}}\right)^{p} + \left(1 - \left(\left(\frac{\alpha \beta}{\alpha \beta - \beta + 1}\right)^{\frac{1}{p}}\right)^{p}\right)^{\frac{p + 1}{p}} \left(\left(\frac{\alpha \beta}{\alpha \beta - \beta + 1}\right)^{\frac{1}{p}}\right)^{p}\right)\right)}{\alpha p \left(\left(\left(\frac{\alpha \beta}{\alpha \beta - \beta + 1}\right)^{\frac{1}{p}}\right)^{p} - 1\right)^{2} \left(\alpha \beta - \beta + 1\right) \left(\beta \left(\frac{\alpha \beta}{\alpha \beta - \beta + 1}\right)^{\frac{1}{p}} + \beta \left(1 - \left(\left(\frac{\alpha \beta}{\alpha \beta - \beta + 1}\right)^{\frac{1}{p}}\right)^{p}\right)^{\frac{1}{p}} - \left(1 - \left(\left(\frac{\alpha \beta}{\alpha \beta - \beta + 1}\right)^{\frac{1}{p}}\right)^{p}\right)^{\frac{1}{p}} + 1\right)^{2}}}\]

Uh, haha, never mind, let’s do that numerically.

Here is a plot of the expected \(F_1\) as a function of \(\alpha\) for a range of values for \(p\).

1 score for different values of p" />

The red curve corresponds to \(p \approx 2\), and shows pretty abysmal results; as \(p\) increases, we get better and better results.

The range of \(\alpha\) goes from 1 (which is equivalent to unweighted training) to about 250, with 100 being the “inverse proportion” weighting practice.

Most prominently, what we see is that class weight hardly improves over unweighted training, and the optimal weight is usually only a bit larger than the unweighted version, and far from the inverse proportion. In fact, from these plots, it would seem that using the inverse proportion weighting scheme is actually detrimental to training!

Very interesting indeed.

Caveats and Limitations

This is a good place to remember that we made a lot of assumptions to get here. Several things in particular may limit the scope of the conclusions:

Assuming a completely symmetric tradeoff curve
Assuming no label noise in training
Assigning equal importance to precision and recall

Sanity Check

I wanted to get a sense for the generalizability of the conclusions outside the sterile mathematical environment. I set up a rudimentary imbalanced classification pipeline with scikit-learn’s make_classification and DecisionTreeClassifier, and created an empirical version of the above plot, using class_sep as a proxy for the tradeoff curve.

A couple of things in that setup are different enough from my clean assumptions that I was very curious to see the results. I let my computer crunch for a few hours (running hundreds of thousands of simulations) and it produced the following plot:

1 score for different values of p" />

Nice! It’s not identical to the theory-derived plot, but looks very similar, and in particular:

The optimal weight is only slightly larger than 1, but
It doesn’t really matter anyway - the optimal weight only gives a negligible performance boost

As for why the plot looks different for bigger values of \(\alpha\), my hunch is that the tradeoff curve isn’t symmetric, allowing the classifier to get a decent recall without sacrificing precision entirely.

Conclusions

While I certainly don’t know everything about class weighting now, I’ve come away from the analysis very satisfied: I know that class imbalance, in and of itself, does not warrant using class weights. Furthermore, if I deem class weights necessary, instead of using the typical “inverse proportion” scheme, my weights had better be informed by the particular problem characteristics: the nature of the tradeoff curve, label noise, and the cost I assign to each type of error.

Update 08/05/2025

After publishing the post it’s been pointed out to me that there are tutorials that specifically demonstrate how inverse proportion weighting (or stratified under- / oversampling, which is pretty equivalent) improves imbalanced classification performance. This piqued my interest and I looked at such a tutorial, and found something very interesting. To measure performance, the tutorial used the balanced accuracy score rather than \(F_1\).

\(F_1\) is the harmonic average of positive precision and positive recall; balanced accuracy is the (regular) average between positive recall and negative recall. On the surface, the two metrics look similar enough: each in itself is a combination of two metrics, corresponding to the two types of errors we can make.

But, as always, the details are important. I used the same optimization framework as before but looked at expected balanced accuracy (instead of expected \(F_1\)) as a function of class weight, and here’s what I got:

Look at that - completely different from the \(F_1\) behavior! Moreover, the optimal weight is indeed the inverse proportion rule of thumb. This is splendid: the theoretic methodology is in accordance with results people get in the wild. Less selfishly, it really highlights the importance of choice of metric on model tuning - different metrics respond very differently to our choice of hyperparameters.

\(F_1\) vs. balanced accuracy

Let’s dive into the difference between the two metrics, so we have an intuition for which one to choose. Specifically, we’ll look for a scenario where they are very different from each other.

Imagine we have 1000 samples - 10 positive, 990 negative. We classify all positives as positive, 40 negatives as positive. Positive recall is perfect (100%), negative recall is \(\frac{950}{990} \approx\) 96%, positive precision is \(\frac{10}{50} =\) 20%. Balanced accuracy would be very high, but \(F_1\) would be very low.

This is an example of the way different metrics induce different preferences over the two types of errors. There’s no absolute right or wrong here - it’s a matter of aligning your technical choice to the domain.

Conclusion (again)

To me, this reinforces the importance of considering the downstream use of the model, and consulting business stakeholders when tuning the model for hard prediction.

Code

Code for the visualizations and simulation can be found here.

Cost-sensitive machine learning
Classifier Calibration
- In particular this PyCon talk from core scikit-learn (and previously imbalanced-learn) developer

Ode to my Office Lethargy

2025-02-02T12:30:00+00:00

“Di-lup Di-lup, Di-lup Di-lup”, my alarm clock cheerfully opines. I vehemently disagree, but it pays no heed.

With practiced if resigned movement, I slide out of bed, eyes still closed. 08:30 must be the latest anybody’s ever complained getting up at. Such a sleep hog.

No time to waste: a glass of water, a snack for the cat, kiss my love good day, and off I go. Oh, and wear something appropriate. For the office.

A short bike ride in the morning is great, basically. But I’m in a hurry - every misplaced dumpster and mistimed traffic light receives a shower of mental swears. I pass the same obstacles every day, thinking “maybe tomorrow I’ll have time to move them.” I never do.

I park my bicycle and run to catch the train. Did I mention I get up at 08:30? I could get up earlier, and maybe not run. Yes. I could.

But I won’t, because:

I’m a sleep hog, and
You and I are both grown-ups, and we know exactly what would happen. If I got up earlier I’d feel entitled to more chill time than I actually gave myself and end up running anyway. And if I get up a lot earlier - might as well catch the earlier train.

So I run.

40-ish minutes of letting the world slide by. Maybe it’s crowded and I just listen to music; or maybe not and I pull out my laptop, get acquainted with the latest OpenAI drama.

A brisk walk and I’m at the office. I celebrate another successful journey with customary rituals: the don’t-meet-eyes-at-the-elevator, awkward kitchen dances as morning routines clash, and the all-time favorite, make-smalltalk-but-don’t-get-too-involved.

If I’m lucky I left something half-finished to help ease myself into work, or Incubated(TM) to get some fresh ideas. Otherwise I have to pretend to work while procrastinating on finding something useful to do. I know it doesn’t really matter - in 5 minutes or an hour, someone will walk in, or something will come up, and I’ll find myself doing something else anyway.

An hour or two of flow, refreshing and productive. But then I need a break. Is it lunchtime yet? Not remotely. Another glass of water, then back to my desk. Now what?

There’s a lot of friction in the office. Small awkwardnesses that pile up, a ton of interruptions. Talking About Big Stuff With No Room For Nuance Cause Ain’t Nobody Got Time.

That’s when my Office Lethargy rears its head.

“You know what’s the next big thing, and you also know you don’t have it in you to start it today. Double dish of procrastination for you!”

“See that sun over the lovely view from this plush office tower? Your eyes will be its companion as it travels across the sky today.”

“Remember how you like to dance to blaring music in the living room? Yeah we don’t do that here. You have a desk, and a chair. You can stand if you wish.”

I tried to fight it at first. But it’s so much easier to give in to the lethargy, to embrace it.

I don’t have to be here. I’m a free spirit. Or just lazy and distracted. I don’t care. But I really don’t have to be here. Let me explain.

There was a startup. It sounded cool, innovative, impressive. Interviews, a take-home assignment. When they offered me a full-time job, I balked. I knew it was going there, but I’ve been freelancing from home since 2020, and my previous experience with the office and full-time employment was… mixed. Maybe it won’t be so bad? I still hesitate. “How about a trial period of three days a week, from the office, to test the waters?” “Sure!” What’s to think about? It’s the best offer I could get.

So I handed my resignation to one of my other projects, and, 30 days later, started coming to the office.

I was excited, and a little apprehensive. The first days were overwhelming - lots of interesting (and very nice) people, interesting tech, a deep stack. Plenty of good reasons to stay. But every day, at some point, the lethargy would inevitably kick in.

Like I said, I tried to fight it, at first. Go for a short walk, get a snack, talk to someone I haven’t met yet. But it’s not enough. There’s that claustrophobic buzz in my head, telling me I’m still in the office, confined to the walls and the rituals. So I submit, let it wash over me and sink in.

And over time, a realization sinks in, too:

This is not for me. Despite all the reasons to like it there, despite the graceful offer extended to me, my office lethargy makes it abundantly clear that all the beginnings I’m having are borrowed from somebody else’s story. All the tensions building up, Chekhov’s guns promising to fire - I don’t want to be there when they do. What a relief to allow myself to think that! I will not stay here. I almost cannot stay here.

Have I told them? No, not yet. But I will, any day now. Maybe next week.
For now, it’s me and my office lethargy.

And let me tell you, it’s a fascinating experience, being physically and mentally present but emotionally checked-out. I experience a lot of things differently, as an observer, letting them pass without looking for a reason to stress out over them.

Free from self-judgement, I can let go of judging others, too. And I feel I see them more accurately: not as antagonists, doing their best to foil my efforts; rather as earnest people, doing what they believe is best for themselves and the company.

A wave of compassion washes over me: how difficult it must be, dealing with all the tensions, juggling life and work, navigating decisions and anxieties. I wish there was another way.

Or maybe I’m projecting and patronizing, and they have a completely different experience? Someone steps by and starts smalltalk, and I snap back in.

I grind my teeth and push myself to be productive till the end of the day, for which I know I’ll pay a price. Lethargy never fails to collect.

Finally, the day is done. Say goodbye, pack up, and head towards home sweet home.

Trying hard to keep the words “rat race” out of my thoughts, I join the river of people, each heading to his or her own home sweet home. Many of them are hurried and impatient, heads buried in their phones. This time, at least, I don’t need to run.

A peculiar kind of mindfulness, which I attribute to the disengaging effect of the lethargy combined with the newfound liberation of finishing work for the day, allows me again to observe the people around me with curiosity and compassion. Hundreds of life lines brushing by, almost interwining but not quite. What are their stories? Did they have a good day? What kinds of homes are they coming back to? Then someone particularly impatient squeezes uncomfortably close and the compassion is gone, I just want to get home.

And then, at last, home. Here the lethargy has no power - it is easily chased away by my feisty cat and a dose of good music. “See you tomorrow, I guess,” it must have the last word.

Yes.

But not for long.

Any day, now.

Heads or Tails

2024-11-04T14:30:00+00:00

A few months ago I gave an “introduction to classical ML” workshop to a team of full-stack developers. The idea was to give a conceptual introduction, break down the very abstract “let’s design an algorithm that improves with more data”, and demonstrate how you’d approach this in practice.

The workshop was accompanied by an exercise in JS, “Heads or Tails”, which is what this post is going to be about.

Heads or Tails

In this exercise you’ll implement the brains behind the blockbuster video-game, “Heads or Tails”, which tests the player’s ability to randomly choose between heads and tails. The game guesses the player’s next choice, and should these choices exhibit any patterns, the game will use these patterns to gain an advantage over the player.

During the exercise, you’ll implement increasingly sophisticated algorithms for predicting the player’s next choice given their choice history, with the game visualizing the algorithms’ predictive power on your choices as a player. Ready to be unpredictable?

00 Preparations

Download this html file
Open it in your favorite web IDE and navigate to YOUR CODE GOES HERE
Open it in a browser too (with js enabled)

01 Hello, world

As previously mentioned, you’ll play two roles: of the developer implementing the learning algorithms, and of the player being tested for predictability.

In the provided html file, one “prediction algorithm” has been implemented, which just uses Math.random to predict the player’s choice. It is therefore very poor as a prediction algorithm, but it’ll assist you in understanding what’s going on and provide a baseline for comparison.

In the browser, start pressing “1” (for heads) and “2” for (tails) randomly, and you’ll see a line chart being extended as you play. The chart depicts the prediction algorithm’s score: it gains a point every time it predicts correctly, and loses a point every time it’s wrong.

In the code, look for the function predictRandom - this is the random prediction implementation. Any function you’ll write whose name starts with “predict” will be integrated to the game and you’d be able to see its predictive performance. The functions you’ll implement take one parameter - history, which is an array with the player’s choice history (in the current game). Choices are represented as strings - “H” for heads, “T” for tails. The function needs to return the prediction for the player’s next choice (again, “H” or “T”).

Add two simple prediction functions:

One that always predicts the player will choose heads
And another that always predicts the player will choose tails

Refresh the page and, using the score visualization, make sure the functions behave as you would expect.

02 Reflection

Before we move on to more sophisticated algorithms, let’s take a step back and look at what we’re trying to achieve.

Using the functions we added in section 01, it’s easy to see that when the player always makes the same choice (say, tails), the respective prediction function is significantly better than the predictRandom strategy. But, unless you have a strong preference to either choice, if you try to behave randomly these const-predicting functions won’t perform much better than random.

We’d like to formulate stronger prediction strategies, such that even when we try to confuse the predictor, it’ll pick up on our patterns (assuming they exist) and will be notably better-performing than the random strategy. That is, unless we’ll manage to be truly random, which (spoiler alert) most humans aren’t capable of.

So: if you write an algorithm that achieves a similar score to the random predictor, that algorithm can’t find your patterns. If it gets significantly better score, that means a) that you’re relatively predictable, and b) that the algorithm managed to pick up on your patterns.

Another point for thought: what does it take for an algorithm to achieve significantly worse scores than the random predictor?

03 Warming up

Let’s assume the player has a strong preference for one of the choices. We can write a function that finds this preference and uses it to predict the next choice. Implement such a function.

For example: if this is the player’s choice history: HTTHTTHTTHTTHHHTHT

We can see that the player picked heads 8 times, and tails 10 times. Therefore we’ll predict “tails”.

As the game progresses, the majority choice might change, and your function’s prediction will reflect that.

04 Getting serious

The previous function was very simplistic. Let’s write a more sophisticated version of it: suppose there’s some consistency in the player’s behavior. Say, the player makes long runs of heads, long runs of tails, and occasionally switches between them. Or, that they try to be “unpredictable” but instead end up just alternating between them, choosing “heads-tails-heads-tails”.

In such cases, even though there might not be a generally preferred choice, we might be able to exploit finer patterns.

Let’s look again at the choice history from the previous section: HTTHTTHTTHTTHHHTHT.
Instead of taking the majority of choices, we note that the player’s last choice is “tails”, and so we’ll look only at choices made after tails. Here they are, highlighted: HTTHTTHTTHTTHHHTHT.

After choosing tails, the player chose tails again 4 times, and heads 5 times. Therefore, in this case, we’ll predict “heads”.

Implement a prediction function that uses the algorithm described here to make predictions.

05 When the going gets tough

Alright, now we’re in the grown-ups’ league. In this section I’ll guide you through implmeneting a (simple version of a) real machine-learning algorithm: decision trees.
The implementation requires a bit more work so you can treat it as a small project.

Ready? Here we go.

In the previous section we looked at the player’s last choice, hoping to exploit patterns related to it. Playing around yourself, you probably noticed there can be longer-than-2 runs, and we’d like to exploit those too. We could, in theory, extend the technique used in the previous section: enumerate all the possibilities for, e.g. a 6-choice-long run, and take the majority for each case. But that’s an extremely specific prediction strategy - to get meaningful data for longer runs we’d need to play a lot of time, and we might be missing more obvious patterns. What can we do?

A decision tree is a classical machine-learning algorithm, where each prediction is decided through a sequence of questions, where the questions are leading us down through nodes until we make a prediction.

In our case, if we look at a history window of length 6, a decision tree might look like this:

What do I mean by “history window” of size 6? Suppose this is the player’s choice history: HTTHTTHTTHTTHHHTHT.
Then this is the last window of size 6: HTTHTTHTTHTTHHHTHT.

Using the player’s choice history, we can construct a training set of all the length-6 windows and the choice that came right after them, which we’ll use to look for patterns:

Window	Choice after window
HTTHTT	H
TTHTTH	T
THTTHT	T
HTTHTT	H
TTHTTH	T
…	…
THHHTH	T

In the first row (HTTHTT), “window @1” is the first choice in the window - H. “window @3” is T, “window @6” is T as well, and so on.

Back to decision trees. I’ll divide the implementation to two parts - representing a decision tree, and constructing the decision tree given data.

Part 1: Decision tree representation

This part is pretty programmatic - decide how you want to represent a decision tree, without worrying about how you’d actually construct the decision tree, and implement this representation.

You could go all-in OOP and create a TreeNode class with pointers to 2 child nodes (which might be predictions or additional decision nodes); you could use simple data structures (arrays, objects) with appropriate functions; you could use functional programming; or whatever else you fancy.

Make sure you can easily construct a decision tree, and that you can use one to make predictions.

Part 2: Constructing a decision tree

Now we’re ready to tackle the next part: constructing a decision tree based on a player’s choice history. What we’re actually aiming to do, is find a bunch of tests on the history window, which maximally separate windows which were followd by “heads” from windows that were followed by “tails”.

Part 2.1: Creating a training set
As preparation for constructing the decision tree, we want to extract a training set (similar to the table shown above) from the choice history. Implement a function that takes the choice history and returns a sequence of training pairs: choice window, and choice following window.

Part 2.2: Measuring homogeneity
Given a set of choices, we want to be able to measure how homogenous - “pure” - they are. If all the choices are the same, that’s maximal homogeneity. If they distribution 50-50 - that’s minimal homogeneity.

We will use this metric to evaluate potential decision tree structures, and pick tests that make a set of poorly-separated choices to two purer sets.

The measure we’ll define is called entropy (the one from information theory, not thermodynamics). It actually measures the opposite of homogeneity - heterogeneity, and goes like this:

Suppose in a given set of choices, \(h\) denotes the proportion of “heads” choices, and \(t\) the proportion of “tails” choices (\(h + t = 1\)). Then:

\[entropy(h, t) = -(h \cdot \log_2 h + t \cdot \log_2 t)\]

Since \(h + 1 = 1\), we can also write:

\[entropy(h, t) = -(h \cdot \log_2 h + (1 - h) \cdot \log_2 (1 - h))\]

The following chart visualizes the entropy for each \(h\) between 0 and 1:

Implement a function that computes the entropy of a given set of “heads” / “tails” choices. Make sure it agrees with the chart above.

Part 2.3: Actually constructing the decision tree
OK, we’ve been through a lot up until now, but bear with me - this is the final push. I’ll first describe the algorithm and then explain it.

We’ll use a pair of variables, X and y, to denote our training set: X contains a sequence of windows, and y contains a sequence of choices following the windows. X and y are corresponding, so for example the 11th element of y is a choice that came after the 11th element of X.

ConstructDecisionTree(X, y)

If X contains 5 windows or less:
- Return a prediction node that predicts the majority in y
For each possible test index “idx” (choice 1 / 2 / 3 in the window etc.):
- \(X_h, y_h\) <- all windows / choices in X / y where window[idx] == “h”
- \(X_t, y_t\) <- all windows / choices in X / y where window[idx] == “t”
- Compute the weighted average entropy after splitting by idx:
\[newEntropy = \frac{y_h.length \cdot entropy(y_h) + y_t.length \cdot entropy(y_t)}{y_h.length + y_t.length}\]
- Compute how much the entropy decreased: \(improvement = entropy(y) - newEntropy\)
- Store all the variables (\(X_h, X_t, y_h, y_t, improvement, idx\)) somewhere
Find the best split (highest \(improvement\))
If the best \(improvement\) is less than 0.1:
- Return a prediction node that predicts the majority in y
Otherwise, retreive \(X_h, X_t, y_h, y_t, idx\) corresponding to the best \(improvement\)
\[rightBranch \leftarrow ConstructDecisionTree (X_h, y_h)\]
\[leftBranch \leftarrow ConstructDecisionTree (X_t, y_t)\]
Return a decision node that tests the window at \(idx\): if it’s “heads”, pass computation to \(rightBranch\); if it’s “tails”, pass computation to \(leftBranch\)

The algorithm might not be trivial, but it’s very elegant, and we’ll now break it down.

First off, you probably noticed that the algorithm is recursive. At every step, the algorithm tries to find the optimal test for spliting the training set.

The recursion’s stopping criteria is one of two:

We don’t have enough windows + choices pairs to create another tree (I chose 5 arbitrarily - you can play with that choice)
We haven’t found a test whose split improves purity enough to justify adding another level to the tree (again, I chose 0.1 arbitrarily)

In any case, if we stopped - we’ll return a prediction node that predicts the majority of the history at the current node.

If we haven’t stopped, that means we have a test we’d like to split by. In thise case we’ll divide \(X\) and \(y\) according to the test, and construct two more corresponding decision trees with a recursive call. Then we compose those trees along with a test to see which subtree needs to handle the current case.

How do we measure the quality of each candidate test? Remember the entropy measure we defined? The higher it is, the more “impure” our set is. So we want to find the test that decreases entropy as much as possible. After splitting, we calculate the entropy for each of the new sets, and calculate the weighted average of the entropies to compare to the initial value before splitting. The improvement is then the difference bewteen the pre-split entropy, to the weighted-average entropy after the split.

And that’s it! These are all the pieces we need to construct a decision tree. Well done!

If you’d like more intuition, this link contains a wonderfule, visual introduction to decision trees.

06 Some more reflection

Hopefully you’ve managed to properly implement the decision tree predictor, and witnessed that it’s pretty good at learning your patterns. That’s cool!
At least for me the decision tree is a great predictor:

A nice way to convince ourselves that the computer isn’t somehow cheating is to feed properly random choices and make sure all strategies behave like predictRandom (use the browser’s devtools):

for (var i = 0; i < 200; i++) {
    document.querySelectorAll('.coin')[Math.round(Math.random())].click(); 
}

07 Solution

My implementation of the exercise solutions is here, SPOILER ALERT.

08 Where to next?

If you enjoyed this and would like to go further, there are several directions:

First of all you could investigate more in-depth some aspects of the decision tree implementation:
- Choice of window size, thresholds for the stopping criteria, even how much of the player’s history to construct the tree with
- You can manually play with values and try to find the optimal one
- Or use the data to find them!
- If you do use the data, you better watch out for overfitting; read about cross-validation to mitigate that.
You can try learning about neural networks and using one of several js frameworks to implement them here
If you’re serious about learning ML, there are a lot of platforms for that: Kaggle learn, DataCamp, Fast.ai course and many more

Good luck!

In defense of my technophobia

2024-10-24T12:30:00+00:00

Being a programmer, I was surprised the first few times I was told by close friends / family I’m the most technophobic person they knew. But, after examining the evidence and introspecting a bit, I have to agree - I really am (to some extent) a technophobe.

Don’t get me wrong - I love what I do (most of the time), I’m grateful for the almost magical things technology enables me to do, and I’m generally excited and optimistic about the potential of technology to make the world a better place.

But. I’m always suspicious of new tech, take my sweet time migrating to new devices (my phone is 8 years old), stick to my habits for a long time before adopting the latest payment method or whatever gizmo is hot, and generally opt for as low-tech solutions as possible.

Sure, part of that can be attributed to my natural laziness and stubbornness. But I do think I have a few points in my defense: (not that there have to be, I can be grumpy if I want to)

Fragility: from the left-pad fiasco, through personal projects that suffered untimely, inexlicable bit-rot, to seemingly innocent commits I made that somehow messed up everything, I know just how fragile technology can be. The more routine activities are digitized, the more I depend on my device and the app’s reliability. What if it’s out of battery? Or there’s no reception? How bad would the impact of a crash be? Tragically, just as more aspects of life become digital, so does the competitive pressure on such products increase, pushing edge cases and lateral considerations to the sidelines.
Wrestling-with-tech fatigue: all programmers know the dreaded feeling of the realization that what you thought would be a 5-minute setup is rapidly devolving into a soul-devouring rabbit hole. After enough of those you get spidey senses for problems that seem simple but might just be hiding a monster in the closet. And I hate fighting such monsters outside of work.
After spending 8 hours in front of a screen debugging and troubleshooting, I just want to go outside, listen to music and get some exercise. Oops, guess what? Your comfy bluetooth earbuds won’t pair! OK, no worries, do a reset. Now only one pairs?! Go watch troubleshooting videos and read manufacturer manuals. The joy…
I use technology in my life for specific purposes, but some tech products seem unsatisified with the side-kick role and put themselves in the center instead.
Pop-up notifications in the middle of the night? Antivirus “alerts” that I could save 2GB if I just switched to the paid tier, out of the blue while I’m watching a movie? Attention-hungry platforms that are optimized to take as much of my time as they can?
No thanks, I’ll pass.
Other companies indeed treat technology as a means to an end - but to what end? I’m jaded with companies using technology to improve profit margins while offering worse products with no customizability or support.
Worse yet are companies using tech as an enabler (and accountability sink) for doing morally questionable stuff.

Whew! That got heavy. If anything, writing this post makes me feel even more justified in my technophobia.
That said, if you’ll excuse me, I do think it’s finally time to get a new phone…

Silly programs from almost 15 years ago

2024-10-13T08:30:00+00:00

While I’ve been doing machine learning / data science for the last 7 years, and primarily working with python and web apps for longer, the beginning of my career saw me developing client-server applications with Windows Forms and IIS.

About 15 years ago I was just starting my professional coding career. I was junior, eager to learn, and had a lot of time and energy to tinker around.

At some point I stumbled upon TransparencyKey and TopMost, a pair of properties that allowed you to develop widget-style apps that always stayed on top but could be mostly transparent (to the mouse as well as the eye.) Combined with some Win32 API functionality, this could be used to do some really cool stuff. For about 2 years, 90% of my side projects would involve those properties. (Fixated? Me? No way)

I envisioned a whole suite of whimsical toys to liven up our dreary corporate workstations. Being junior, time after time I fell to the classic side project pitfall: I’d get excited by some idea, get to a working proof of concept, and move on to the next idea thinking I’d come back later to finish it up. Which, you guessed it, never happened.

Until now! I got nostalgic and thought I’d at least dig up what projects I could find and write about them.

I have to say, I was positively surprised by how smooth it was to get 15-year-old code running. These projects were developed on Windows XP / Windows 7 using .NET 4 or earlier, and apart from some graphics glitches (probably stemming from me doing stuff inefficiently) they work without modification on Windows 11 after upgrading to .NET 8. Kudos for the backward compatibility, Microsoft.

And now, without further ado:

Boo

Always just a hotkey-press away from being visited by a creepy (but cute) purple creature. In newer versions of Windows the shadow that windows cast are part of the window size, which is not accounted for in the code (hence the small gaps.)

Spinner

Press a hotkey to send the active window cartwheeling. The glitches are due to the screen recording (it works smoothly otherwise, if slow on large windows.)

My explorer

Why have all your files sit quietly in their folders if they could be floating around? My ambitious intent was to expand by animating all the contents of a folder flying from it when clicked, but I didn’t quite get there.

PacMonster

I was very proud of this one, which is basically low-key malware for trolling my colleagues. We had a power-user culture and used lots of shortcuts; sometimes when a colleague left their workstation I’d replace one of their shortcuts to point to this. It’d open notepad, blurt some insult, and spawn a pacman that would chase the mouse around. When the pacman was on the mouse you couldn’t click anything (because the pacman captured the mouse press event.)

When you tried to open task manager, it closed the process but rendered a “fake” task manager (which looks very out-of-place on Windows 11), only to close it dramatically a few seconds later (this part sometimes crashes on Windows 11).

When you tried to open the commandline shell, it closed it and was angry (jittering around.)

If you wanted to close it, you had to create a “C:\Users\public\Downloads\pwned.txt”, though some colleagues ended up restarting to get rid of it (I know, this wouldn’t fly today, but wasn’t so out-of-place at the time at that org.)

Other projects

Sadly I lost the source to a couple of other projects, which I actually wrote later and were more mature.

Flood-fill animator

Used flood-fill to materialize stuff on the screen (usually images with mostly transparent background or text.) Sounds a bit basic, but I think the overall effect was neat.

Smiley integrator

Written when mobile IM apps started to proliferate, everybody was using way too many emojis, and I wanted to easily include them in emails and documents (I know.)

On hotkey press, a 3x3 roster of emojis appeared, with left / right arrows switching between sets. You used the numpad to pick an emoji, it was copied to the clipboard and then the app would emit a “ctrl+v” so the emoji was pasted as an image to whatever you were focused on.

It actually worked quite well, quickly became intuitive to use, and was adopted by quite a lot of people in the org. I ended up adding all sorts of features such as custom emoji image paths and controlling the pasted emoji size.

The Code

If for some reason you’d like to browse the code, you can find it here.

Codyssey @ Pycon IL 2024

2024-10-09T10:00:00+00:00

Over the last few months I spent a lot of time creating Codyssey, a playful coding competition format where players write code that controls agents in simple games:

The inspiration came from two places:

Practicing reinforcement learning with Gym (now Gymnasium), I had fun trying to design “hard-coded” strategies for the various games before training an agent to learn a strategy
Control systems, when implemented in code, have a slightly atypical programming model where you write code that runs dozens (or more) times a second and makes micro-decisions, which I found interesting to explore

Each game has its own environment and challenges, and a specific end-goal. Participants compete to solve as many games as they can within the time limits of the competition.

I got a session for the competition at the recent PyCon IL, and it went really well! There were developers from diverse backgrounds, from fresh bootcamp graduates to experienced developers at tech companies.

You can watch the trailer and try the demo here.

If you’re interested in inviting me to run Codyssey at your place - feel free to reach out!

andersource

Teaching Programming with the aim of Building a Mental Model

The Question

Mental Model of Programming

Another way to learn

Nurturing a Mental Model of Programming

Mental Models and AI

Mental Models in other areas

More Eternal Struggle

Calculus Phobic’s Introduction to Differentiable Programming

You’ve found yourself in a pickle

First thoughts

Differentiable Programming to the Rescue

Getting out of our pickle with JAX

Visualizing the optimization process

How does this work?

Is it really that easy?

Operations supporting custom types

Optimization landscape navigation

Time and space resources

What more can you do?

Differentiating through branching and loops

Differentiating through conditions?

Integrating with machine learning models

Differentiable packages

What is differentiable programming good for?

Diving deeper

Scipy 2025 Virtual Poster

Recreational Image Reconstruction with Decision Trees

Adventures in Imbalanced Learning and Class Weight

The Tradeoff

Characterizing the Tradeoff Curve

Taking a Stance

Where’s the Knob?

Optimizing Away

Halftime recap

Calculating \(F_1\)

Caveats and Limitations

Sanity Check

Conclusions

Update 08/05/2025

\(F_1\) vs. balanced accuracy

Conclusion (again)

Code

Related

Ode to my Office Lethargy

Heads or Tails

Heads or Tails

00 Preparations

01 Hello, world

02 Reflection

03 Warming up

04 Getting serious

05 When the going gets tough

Part 1: Decision tree representation

Part 2: Constructing a decision tree

06 Some more reflection

07 Solution

08 Where to next?

In defense of my technophobia

Silly programs from almost 15 years ago

Boo

Spinner

My explorer

PacMonster

Other projects

Flood-fill animator

Smiley integrator

The Code

Codyssey @ Pycon IL 2024