leonardo taccari

2025: a retrospective (in media)

2025-12-31T00:00:00+00:00

A quick end-of-year retrospective, with selected media I consumed (and enjoyed) throughout the year. Not limited to things that were published this year, except for movies.

Books

Here are the three lists separated by genre and ordered by your rating (highest to lowest).

Fiction

Novels and short story collections

4 3 2 1, Paul Auster
Sessanta racconti, Dino Buzzati
American Pastoral, Philip Roth
Stoner, John Williams
The Thousand Autumns of Jacob de Zoet, David Mitchell
The Bone Clocks, David Mitchell
The Bell Jar, Sylvia Plath
The Long Goodbye, Raymond Chandler
Trainspotting, Irvine Welsh
Racconti di Odessa, Isaak Babel’
Demon Copperhead, Barbara Kingsolver
L’Imperatore di Portugallia, Selma Lagerlöf
Last Days, Brian Evenson
The Blacktongue Thief, Christopher Buehlman
A Wizard of Earthsea, Ursula K. Le Guin
The Tombs of Atuan, Ursula K. Le Guin
The Farthest Shore, Ursula K. Le Guin
Tehanu, Ursula K. Le Guin
The Third Man, Graham Greene
The Fisherman, John Langan
And Then There Were None, Agatha Christie
Un buio diverso. Voci dai Necromilieus, Luigi Musolino
Mort, Terry Pratchett
I detective selvaggi, Roberto Bolaño
Frankenstein, or the Modern Prometheus, Mary Shelley
Jurassic Park, Michael Crichton
La settimana bianca, Emmanuel Carrère
Viriconium, M. John Harrison
Rules for a Knight, Ethan Hawke
Il detective sonnambulo, Vanni Santoni
Le cose che abbiamo perso nel fuoco, Mariana Enriquez
Leviathan Wakes, James S.A. Corey
Excession, Iain M. Banks
My Year of Rest and Relaxation, Ottessa Moshfegh

American novels (and non-novels): “4 3 2 1”, “American Pastoral”, “Stoner”, “Demon Copperhead”, “Patrimony”; re-discovery of David Mitchell; some classics.

Graphic Novels

Fewer than the last few years.

L’età dell’oro, Cyril Pedrosa
Monica, Daniel Clowes

Non-Fiction

Apple in China, Patrick McGee
La dieta termodinamica, Dario Bressanini
Patrimony, Philip Roth
Room to Dream, David Lynch
Bozze non corrette, Stefano Bartezzaghi

Other bits and pieces. “Apple in China” was extremely interesting.

Games

Blue Prince (PC), 2025
Indiana Jones and the Great Circle (PC), 2024
Her Story (PC), 2015
What Remains of Edith Finch (mobile), 2017

2024: a retrospective (in media)

2024-12-31T00:00:00+00:00

A quick end-of-year retrospective, with selected media I consumed (and enjoyed) throughout the year. Not limited to things that were published this year, except for movies.

Books

La variante di Lüneburg, Paolo Maurensig
Prophet Song, Paul Lynch
La moustache, Emmanuel Carrère
The Dispossessed: An Ambiguous Utopia, Ursula K. Le Guin
Between Two Fires, Christopher Buehlman
The Word for World Is Forest, Ursula K. Le Guin
Bloodchild and Other Stories, Octavia E. Butler
Service Model, Adrian Tchaikovsky
Dune Messiah (Dune, #2), Frank Herbert
Satsujin Shussan, Sayaka Murata
Convenience Store Woman, Sayaka Murata
A Fisherman of the Inland Sea, Ursula K. Le Guin
Toscani maledetti, Raoul Bruni
The Girl in the Flammable Skirt, Aimee Bender
Lo specchio che fugge, Giovanni Papini
Normal People, Sally Rooney

Comics

The Complete Eightball, Daniel Clowes
Planetes – Complete Edition, Makoto Yukimura
A Contract with God, Will Eisner
Combat ordinaire, Manu Larcenet
Prosopopus, Nicolas De Crécy
I classici del fumetto di Repubblica n. 18: Rat-Man, Leo Ortolani
Saga, Book 1-3, Brian K. Vaughan
Le Monde d’Edena, Mœbius
Trois Ombres, Cyril Pedrosa
The Collected Essex County, Jeff Lemire
High Societys, Dave Sim
Lupus, Frederik Peeters
The League of Extraordinary Gentlemen, Vol. 1 and 2, Alan Moore
Helter Skelter, Kyōko Okazaki

Games

Death Stranding (PS4), 2019
Tunic (PS4), 2022
Chants of Sennaar (PS4), 2023
The rise of the golden idol (mobile), 2024
Mouthwashing (PC), 2024
Dredge (PS4), 2023
Pentiment (PS4), 2023
What the golf (PS4), 2020

Film/TV

Restricted to things published in 2024 - I got to see very few movies in a theater.

Dune: Part 2, Denis Villeneuve
Poor things!, Yorgos Lanthimos
Furiosa: A Mad Max Saga, George Miller
The Boy and The Heron, Hayao Miyazaki

Movies I’d like to recover: The Substance, Anora, Juror #2, The Zone of Interest, Challengers, Hit Man, Civil War.

Slow Horses, Season 4
The Gentlemen
Arcane, Season 2

A growing list of things where I found a LLM useful

2023-01-04T00:00:00+00:00

Large Language Models took the world by storm in late 2022. This is a post that I’ll use simply to track the ways I’ve been using chatGPT (or other models) successfully.

Tell me how to create an API with AWS that takes an S3 URL and responds with a presigned URL

This was a great example of how chatGPT works nicely not only to write code, but also to guide you step-by-step in debugging your code, setting up multiple AWS services (a Lambda and an API Gateway), integrating them, fixing permissions, testing your solution end to end, etc. In my opinion this is one of the use cases where the tool really shines thanks to the possibility of interacting with it - it’s like a (stupid, but very knowledgeable) live support chat.

Re-format an ugly and messy list (copied from an HTML page) and convert it to markdown (2023-04-01)

I used this to quickly format this post copying straight from Goodreads. A menial task that I would have done manually.

Here's the chat.

2022: a retrospective

2022-12-31T00:00:00+00:00

A quick end-of-year retrospective, with selected media I consumed (and enjoyed) throughout thee year. This is not limited to things that were published this year.

Books

The Sundial, Shirley Jackson. I really enjoy Shirley Jackson. This might be her best work.
Sei una bestia, Viskovitz, Alessandro Boffa. A peculiar collection of short stories, in which the protagonist is a different animal. Comedy gold.
La peste, Albert Camus. Sure, an all too easy parallel with our pandemic. Are human so predictable?
To Kill a Mockingbird, Harper Lee. I had never read this before - to be fair, this is not really a classic in Italy as it is in the US (I guess).
The Man Who Fell to Earth, Walter Tevis
The City and the Stars, Arthur C. Clarke
The October Country, Ray Bradbury
The Paper Menagerie and Other Stories, Ken Liu
N., Ernesto Ferrero. About the short stay (~1 year) of Napoleon on the Elba island.
The Sunken Land Begins to Rise Again, M. John Harrison
The Ballad of Black Tom, Victor LaValle
Niente di vero, Veronica Raimo
Dalle rovine, Luciano Funetta
Branchie, Niccolò Ammaniti
La verità su tutto, Vanni Santoni
Ultimo piano (o Porno Totale), Francesco D’Isa

Comics

Aldobrando, Gipi
Les Équinoxes, Cyril Pedrosa
Here, Richard McGuire
S., Gipi
Daytripper, Fábio Moon

Non-fiction

The Midrange Theory, Seth Partnow

Games

Outer Wilds: Echoes of the Eye (PS4) The best game I played this year.. is a DLC. But what a DLC it is! It’s a game on its own. It doubles down on the first game (already a masterpiece), introducing a new storyline that complements the original one, and new awesome mechanics (still within the time loop) and discoveries. It’s going to leave you WOW again and again. The overarching theme is now fear, and the gameplay is scary(-ish), very fittingly.
Return to Monkey Island (Mac) A great return. Nostalgic. It packs a punch.
Death’s Door (PS4) Very cute looking adventure. You are a crow that collects souls in the afterworld. Nice combat mechanics, it’s a bit of a soulslike, but easier (so it’s ok for me).
Inscryption (Mac) Looks like a deckbuiling roguelike, yet it’s something more than that, and it’s full of meta. I really enjoyed the mechanics and how the game evolves into something different. I wish there was a way to keep playing with some of the mid / late-game rules.
Citizen Sleeper (Mac) A text-based scifi adventure, with an interesting storyline and tough ethical choices.
Star Wars Jedi: The Fallen Order (PS4) Haven’t played SW games in a while, but this was suprisingly not terrible? On the contrary, it reminded me of the old Jedi Knight games, in a good way.
Frostpunk (PS4) Strategic game set in a glacial Earth. Grow your community facing the weather that gets colder and colder.
The Forgotten City (PS4) Stuck in a temporal loop, need to solve a weird mystery in an ancient Roman village. It’s based on Skyrim’s engine and it shows (it looks dated).
It takes two (PS4) A rare split screen 2-player game. Enjoyable though not my cup of tea. Not sure it deserved all those accolades, but its variety is refreshing.

Film/TV

Restricting to things published in 2022, very few movies.

Pinocchio, Guillermo Del Toro
Apollo 10 1/2, Richard Linklater
Hustle, Jeremiah Zagar

A few other ones that are not worth including in a “best of” list - Sam Raimi’s Doctor Strange (quite fun), Secrets of Dumbledore (meh), Thor: Love and Thunder (🤮). Movies I’d like to recover.. if I ever have the time: White noise; The northmen; Triangle of sadness; Crimes of the future; Licorice pizza; Nope; Everything everywhere all at once; The Fabelmans; Bones and all; Aftersun?; Vortex?; The menu?; Top Gun?; The pale blue eye?; Glass Onion?.

Andor. Suprisingly good, considering how terrible most of the rest of Star Wars media (only other notable exceptions being Mandalorian and Rogue One).
Thermae Romae Novae. Weirdly funny anime about ancient Roman vs modern Japanese thermal baths.
Cyberpunk: Edgerunners
Guillermo del Toro’s Cabinet of Curiosities. I liked the concept. A couple of episodes were very good, some were not.
The Dropout.

Obi-Wan Kenobi was terrible. A couple of other shows such as the new seasons of shows like The Boys and Stranger Things were OK. Shows I might want to recover: Severance; Slow Horses?

Solving wordle

2022-01-13T00:00:00+00:00

Wordle is a simple quiz that is very popular at the moment I’m writing this (but I think we already hit peak-wordle). Users need to guess a target 5-letter word with a trial-and-error approach, similar to the old mastermind, in which you get hints from an oracle as follows: a green tile 🟩 means you nailed the letter and its position; a yellow one 🟨 means the letter is in the word, but in the wrong position; a black one ⬛, the letter is not there at all. If you guess the word in under 6 attempts, you can proudly share your achievement with the world and annoy your social media followers.

What’s the optimal Wordle strategy?

Wordle uses a list of 2315 words that can be a valid solution (and other 10657 additional valid input words). A simple idea is to just pick the input word that will maximally reduce the space of valid solutions, at least on average, across all possible targets.

As an example, consider a word like SMILE. Depending on the target words, the outcome can vary significantly:

if the target word were CRUEL, the oracle hints would be ⬛⬛⬛🟨🟨 and the solution space would be reduced to 102 solutions;
if the target word were HINGE, the oracle hints would be ⬛🟨⬛⬛🟩 and the solution space would be reduced to 19 valid solutions;
if the target word were BUDDY, the oracle hints would be ⬛⬛⬛⬛⬛ and the solution space would be reduced only to 316 valid solutions;
and so on..

SMILE seems to be a decent, but not spectacular, initial choice: from the full solution space (2315), you can expect to reduce it on average to around 114 valid words.

We can do better than that: applying the same logic, the word that maximally reduces the solution space on average across all possible target words is ROATE. To be honest, I’m not even sure what that means. Picking ROATE as the first guess reduces the number of valid solutions to ~60, on average! On the other hand, choosing the worst ranked word in the vocabulary, IMMIX (?), you wouldn’t cover much ground: you’d still be left, on average, with more than 1300 valid solutions.

Of course, the process doesn’t stop at the choice of the first input. We can iterate until we get to the end, using our 1-step lookahead policy:

for each valid input word, compute the reduction of the solution space that would be obtained across all possible scenarios (i.e., across all target words that are still valid);
choose the input word that yields the maximum reduction of solution space size (in expectation);
get hints from the oracle;
if you guessed the target word, you’re done!
otherwise, reduce the solution space according to the hints, and go back to 1.

Note: this can be done in such a naive way because the list of valid words is pretty small, otherwise it would blow up quite quickly.

Does this strategy work at all?

From my simulation, it looks like we would be able to always solve the puzzle in under 6 moves (leftmost plot). The average is slightly below 3.5, which is not bad. You win in ≤3 attempts more than 50% of the time, if you use the entire vocabulary of admissible words (2315 + 10657).

If you only choose inputs from the vocabulary of possible solutions (2315), results are slightly worse (middle plot). Still, in no cases you need 6 attempts, and only very rarely you need 5.

It’s fun to compare the strategy with what happens if you use a totally random choice of words from the valid solutions (rightmost plot): you’d still be able to win (at most 6 attempts) about 57% of the time. (Here I limited to max 10 attempts.)

Things do get harder if you play in “hard mode”, that is, all your input words must be consistent with the hints given so far. In this case, you sometimes need up to 8 attempts¹. On the contrary, having more constraints greatly helps the “random” strategy: 98% of the time you’d win in 6 or less, and the average number of attempts would be around 4.1.

An example of an unlucky scenario for the lookahead policy in “hard mode” (solutions only) is the word WOUND:

RAISE ➝ ⬛⬛⬛⬛⬛ The solution space is restricted to 168 valid solutions.
COULD ➝ ⬛🟩⬛🟩🟩 The solution space is restricted to 6 valid solutions.
BOUND ➝ ⬛🟩🟩🟩🟩 The solution space is restricted to 5 valid solutions.
FOUND ➝ ⬛🟩🟩🟩🟩 The solution space is restricted to 4 valid solutions.
HOUND ➝ ⬛🟩🟩🟩🟩 The solution space is restricted to 3 valid solutions.
MOUND ➝ ⬛🟩🟩🟩🟩 The solution space is restricted to 2 valid solutions.
POUND ➝ ⬛🟩🟩🟩🟩 The solution space is restricted to 1 valid solutions.
WOUND ➝ 🟩🟩🟩🟩🟩

You could try to use slightly different criteria to rank the input words: I used the average (expected value) over the distribution of possible outcomes, but one could use the worst case (we would pick the input word that would guarantee you the max reduction even in the unluckiest scenario) or even get fancier including some measure of the variance of the outcomes.

How far is this from the optimal policy? Good question. A conceptually trivial extension would be to have a longer horizon, for example, a 2-step lookahed, but it would be quite expensive from a computational point of view.

Small update

If, instead of the average over the solution space of all possible outcomes, one uses the median, you get even better results. The starting word, in this case, is REIST.

You can’t use words that you know are not valid, but would potentially give you a lot of useful information. ↩

Procedural generation of images with.. integer programming?

2022-01-02T00:00:00+00:00

In computing, procedural generation is a method of creating data algorithmically as opposed to manually, typically through a combination of human-generated assets and algorithms coupled with computer-generated randomness and processing power.

Procedural generation is common in computer graphics and game design to generate textures, terrain, or even for level/map design. Generative art is also all the rage now, thanks to all the NFT hype (I won’t comment further here..).

There are so many methods for procedural generations, and so many people that do great stuff online. As an example, a few days ago I stumbled upon this delightful infinitely-scrolling Chinese landscape generator (here):

While this Chinese landscape is generated using an ad-hoc algorithm, there are several popular and more general techniques used in the procedural generation community, such as the so-called Wave Function Collapse algorithm by Ex-Utumno. WFC generates images that are locally similar to an input bitmap. The algorithm splits the input image into small tiles (say, 2x2), and tries to infer simple constraints for each unique tile (for example, a tile with a road exiting on the right can only be adjacent to a tile with a road entering from the left). Then, it builds a new image generating tiles (with randomness) and propagating constraints so that we find a feasible solution with high probability.

I wanted to try something like that myself, but, being extremely lazy, I didn’t feel like implementing a constraint propagation algorithm from scratch (though that sounds fun!).

So I decided to try an extremely dumb technique, but that required almost 0 effort on my part: why not formulating the problem as an Integer Program?

Let’s define the output image as a graph (V,E) where each node is a tile and the arcs link to the adjacent tiles. Say we have 4 tile patterns: T = {sea, coast, land, mountain}. We set a desired output distribution (say, 40% of the cells must be sea, 20% must be land, etc..). Then it’s enough to formulate a model as:

\[\begin{align} \min\quad & \sum_{t\in T}\sum_{v\in V}|x_{vt} - Desired_t|\\ \text{s.t.}\quad & \sum_{t \in T}x_{vt} = 1 \qquad& \forall v \in V \\ & x_{vt} \in\{0,1\} \qquad&\forall t \in T, v \in V \end{align}\]

where $x_{vt}$ is 1 if the pattern $t$ is assigned to the cell $v$, and we minimize the distance from the desired distribution. (The absolute value in the objective function can be easily removed adding a few auxiliary variables and constraints).

If there are two patterns $(t_1,t_2)$ that can’t be adjacent, such as sea and mountain, we add:

\[\begin{align} & x_{ut_1} + x_{vt_2} \leq 1 \qquad&\forall (u,v) \in E\\ \end{align}\]

And if we want a pattern $t_1$ to have at least an adjacent cell with pattern $t_2$ (say, coast must be next to sea):

\[\begin{align} & \sum_{u \in adj(v)} x_{ut_2} \geq x_{vt_1} \qquad&\forall v \in V\\ \end{align}\]

(Here's the Julia code to define and solve the model with JuMP and SCIP.)

2021: a retrospective

2021-12-28T00:00:00+00:00

A quick end-of-year retrospective, with selected media I consumed (and enjoyed) throughout thee year. This is not limited to things that were published this year.

Books

Started strong, but didn’t keep up in the second half of the year, I wasn’t in a reading mood. Roughly in order:

Piranesi, Susanna Clarke. Certainly the book I enjoyed the most. Probably not as good as her first novel (a real masterpiece), but still excellent. Please write more!
Il grande ritratto, Dino Buzzati.
The Left Hand of Darkness, Ursula Le Guin.
Stations of the Tide, Michael Swanwick.
The Haunting of Hill House, Shirley Jackson.
A Visit from the Goon Squad, Jennifer Egan.
In Cold Blood, Truman Capote.
1Q84, Haruki Murakami.
Picnic at Hanging Rock, Joan Lindsay.
Termination Shock, Neal Stephenson. Of course I devoured it, and at least I didn’t hate it as the previous one. But it’s still a no.

All in all, a bit of a disappointing year, no great discovery.

Comics

Slam Dunk, Takehiko Inoue. I knew the anime, which is great, but - wow, the manga is perfect.
Uzumaki, Junji Ito. So dark and disturbing it often verges on the ridiculous (not necessarily a flaw, it’s great!).
Asterios Polyp, David Mazzucchelli.
Cinzia, Leo Ortolani.
Born to be a larva. Appunti di vita: 1, Boulet.
Niente da perdere, Jeff Lemire.
Sophia, Vanna Vinci.
Fermo, Sualzo.
Reverie, Golo Zhao.

Non-fiction

Masters of Doom: How Two Guys Created an Empire and Transformed Pop Culture, David Kushner. The extremely fun and nerdy history of the creators of Doom.

Games

Psychonauts 2 (PS4) The best game I played this year - a great sequel, loyal to the original game and full of craziness. Incredible level-design.
Baba Is You (mobile) Sucked me in so badly. Great game if you like challenging puzzles where you need (very) lateral thinking. Only very rarely (end game) it feels a little unfair, but you can find spoiler-free hints (kudos to the author).
Manifold Garden (PS4) Extremely elegant puzzle game. You move in an infinite Escher-like world where you decide which way is up, down, left and right. Plays like clockwork - very satisfying.
SnowRunner (PS4) Yes, a truck simulator. A truck-trapped-in-the-mud simulator, to be precise. Oddly satisfying.
Bad North (mobile) Delightfully minimalist strategic game by the author of Townscaper.
GeoGuessr (web) Very pleasant and relaxing distraction, especially during COVID (semi-)lockdowns. Addictive.
Loop Hero (PC) Roguelike with an interesting concept - you don’t control your hero, but you decide what enemies he will encounter on his path.
Pirates Outlaws (mobile) A mobile clone of “Slay the spire” with a pirate theme.
The Touryst (PS4) It’s OK - no real challenge, the voxel art is pretty and the whole thing is very relaxing. But not sure it’s really worth the time.
Maquette (PS4) The concept sounds cool, but with this kind of puzzle game it’s all about the execution. Too often frustrating.
There is no game (mobile) Funny mobile game all about meta. Once you understand the shtick (30 seconds in), the gameplay is rather monotonous - unlike Baba, where the game actually throws new mechanics at you all the time. Still, it’s entertaining till the end.

Movies and shows

I was back in a theater! But still watched very few new movies (Freaks Out, The green knight, È stata la mano di dio). Mare of Easttown was a great TV show. Jonathan Strange and Mr. Norrell was a nice surprise. And a re-watch of The office happened. \

Exploiting label invariance in unlabeled data

2021-10-11T00:00:00+00:00

Semi-supervised learning tries to combine supervised and unsupervised learning, leveraging unlabeled data to simplify training even when scarce labeled data is available. It has a pretty long history¹, but it never really worked. In recent years it gained renewed interest (along with self-supervised learning) after the deep learning revolution. New methods differ from older one, and typically rely on different flavors of:

self-training / pseudolabeling (teacher-student)
consistency regularization
aggressive data augmentations

See UDA, MixMatch, FixMatch, Meta Pseudo Labels in the last 2-3 years.

The underlying assumption of most semi-supervised learning work are minimal: pretty much nothing is known about unlabeled examples, other than their having the same distribution of the labeled data.. But.. is that realistic? In the real-world, we often we know a lot about the problem at hand! That’s why we can (and should!) often CHEAT a little bit.

A pragmatic perspective 💡

Suprisingly often, we can find “target-consistent” groups in our unlabeled data, where the label is consistent, although unknown. Similar ideas have also been exploited for self-supervised learning approaches (e.g., Time-Contrastive Networks, Geography-Aware Self-supervised Learning).

A few practical examples are:

different point of views of the same scene
spatially aligned satellite images over time
different images from the same device

So here is a simple idea that I called “Domain-aware Semi-supervised learning” (DSSL). Let’s use domain knowledge to identify groups of unlabeled data that are target-consistent. Then, for each batch, we take $N$ labeled examples and $M$ target-consistent groups of unlabeled examples and compute a loss as:

\[\mathcal{L} = \mathcal{L}_s + \lambda_u\mathcal{L}_u\]

where we sum the standard supervised loss with a domain-aware unsupervised loss term (consistency term) computed, for each group $G^j$, as in the following figure:

In summary, we want model predictions to be consistent across each group. To do so:

we aggregate multiple non-augmented predictions $U’$ obtaining a higher-quality pseudo-label;
we use multiple (augmented) items $U’’$ to compute the consistency loss exploiting the natural augmentations provided by the underlying domain.

Let’s see 3 real-world cases where I applied this simple technique to get a big performance improvement.

Weather classification

Consider the task of classifying weather in road scenes. We have a few hundreds labeled images, and thousands of unlabeled short videos from the BDD100k dataset.

We know that weather doesn’t change instantaneously: frames from the same video are “target-consistent” (left: sunny; right: rainy).

DSSL can easily be applied here: we compute a pseudolabel as the average prediction of the current model on multiple frames of the same (unlabeled) video.

This works great compared to just training in a supervised way on the available data. I also compared DSSL with a semi-supervised baseline (essentially, FixMatch with no bells and whistles), showing that DSSL outperforms that, too.

Ego-vehicle segmentation in the wild

A task that comes up surprisingly often in practical computer vision on road scenes is the segmentation of the ego-vehicle, in videos where the camera pose, vehicle type, etc. are all unknown.

For each vehicle, we do know that the camera position doesn’t change over time: images from the same device are “target-consistent”.

Having access to millions of unlabeled images from a private dataset, DSSL can easily be applied: a pseudo-label (or pseudo-mask) can be obtained averaging the predicted segmentation masks over multiple images from the same camera. This gives you on average more than a +5 IoU improvements “for free” over both supervised and semi-supervised learning baselines.

Vehicle identification from sensor data

Finally, another practical tasks. Assume we have access to streams of IMU + GPS data from connected vehicles, and we want to know what kind of vehicle generated the data. For a subset of vehicles, we might know make + model. Interestingly, in this case it’s not even possible to manually label more data - unless you know somebody that can interpret IMU data!

But also in this case we know valuable information: given a vehicle, its type doesn’t change (duh!). So, data collected from the same device over time are “target-consistent”.

DSSL works well also in this case. Additional bonus: this is a somewhat less standard task - there’s no standard recipe for data augmentations on IMU+GPS, and most self- or semi-supervised methods are designed and tuned for image-related tasks. Indeed, I couldn’t get the Fixmatch baseline to work at all for this problem, even throwing at it all the IMU+GPS augmentations I had. While DSSL just worked out of the box.

Conclusions

I briefly showed here a simple yet effective method to exploit domain knowledge in a semi-supervised framework. Some more details can be found in the paper that I presented at a ICCV workshop this year.

The main takeaway should be: if you can cheat and exploit some knowledge of your problem or the process that generates your data, by all means do!

Olivier Chapelle, B Scholkopf, and A Zien. “Semi-Supervised Learning”. MIT Press, 2006 ↩

2020: a retrospective

2020-12-31T00:00:00+00:00

A quick end-of-year retrospective, with selected media I consumed (and enjoyed) throughout this strange year. This is not limited to things that were published this year. During lockdowns and whatnot, I read and played more, but watched less movies/TV.

Books

Quite a good run this year, also helped by the move from anobii to goodreads, which is just nicer and where people are more active. Roughly in order:

Martin Eden, Jack London. So powerful.
Flowers for Algernon, Daniel Keyes.
A che punto è la notte, Fruttero & Lucentini. Can’t go wrong with the masters of giallo.
Il giardino dei Finzi-Contini, Giorgio Bassani.
The Human Stain, Philip Roth. Feels extremely current.
Ringworld, Larry Niven.
Dissipatio H.G., Guido Morselli.
Odore di chiuso / Il borghese Pellegrino, Marco Malvaldi. Two small giallo stories, set in the early 1900, with a guest star: Pellegrino Artusi.
Exhalation: Stories, Ted Chiang. This is obviously great stuff, but I had already read (almost?) all of them.
We Have Always Lived in the Castle, Shirley Jackson.
The Testaments, Margaret Atwood. It’s fine.. but not sure it was needed.
Terminus radioso, Antoine Volodine. Political magical realism.
The Fifth Season / The Obelisk Gate / The Stone Sky, N. K. Jemisin. It’s good but not great. The plot feels a bit too stretched out. I tend to lose interest where characters become too powerful and stakes are too high.
Instantiation, Greg Egan.
The End of the Road, John Barth.
Memoirs Found in a Bathtub, Stanisław Lem.
Consider Phlebas, Iain M. Banks.

Loved the first 6/7 books, greatly enjoyed the first dozen. About the rest, I was really disappointed by Banks and Barth. And unfortunately I can’t really find anything new by Egan that I really love. His 90s short stories were 💣.

Comics:

From Hell, Alan Moore.
Scott Pilgrim the Complete Series, Bryan Lee O’Malley.
Momenti straordinari con applausi finti, Gipi.
Clyde Fans, Seth.
Ghost World, Daniel Clowes.

Non-fiction but not work related:

How to Invent Everything: A Survival Guide for the Stranded Time Traveller, Ryan North. It feels a bit like reading an encyclopedia. But a fun one.
How To: Absurd Scientific Advice for Common Real-World Problems, Randall Munroe.
Playing for Keeps: Michael Jordan and the World He Made, David Halberstam. Had that on my shelf forever. Super boring.

Games

Townscaper.
Outer Wilds.
Persona 5 Royal
The Last of Us Part 2.
The Pathless.
Superliminal
God of War.

Townscaper is a minimal city builder, as simple as it can get: you can only left click, to add an element, or right click, to delete one. What element is actually built (a road? a garden? a roof? a stairway?) depends entirely on the local configuration of the surrounding elements. In the end, you are actually exploring the enormous combinatorial space of possible 3d configurations. Delightful.

Outer Wilds is awesome. You’re a tiny space explorer in a fragile, inadequate wooden spaceship, stuck in a 20 minute-time loop. You need to gather enough clues and to git good at moving around your galaxy so you can solve the riddle that can save you before it’s too late.

What works great: the feeling of mystery and the wonder at each discovery; the wonderful, mysterious, scary galaxy you need to explore; the space physics! Flying is so fun! Landing on a planet, seeing it fill the horizon, is incredible! Making progress is rewarding!

Unfortunately, it has its flaws. Man, looking at the supernova go boom is awesome. Hearing the ominous 2-minutes music: perfectly Pavlovian. But sometimes the gameplay feels a bit tedious. The countdown is unforgiving: repeatedly being interruped during your explorations can get rather annoying. I know, that’s kind of the point – but still.

Persona 5 Royal. Guilty pleasure if ever there was one. But I admit I really enjoyed the whole Japanese-ness of it all. And catchy music. Man, I wasted a lot of time on this! But I’m getting too old for J-RPGs.

The Last of Us Part 2 is a tough one. Sometimes I think it’s a masterpiece, sometimes I’m much more cynical and dismiss it. Perhaps the truth is in the middle. The first one was much more coherent. The environment is wonderful, but for the majority of the game, it really feels like a dejà-vu. The gameplay was hit or miss: level design is top notch, but they could have trimmed a few missions, especially Ellie’s. Some of the plot verged on the nonsensical – it’s all so dreadful that it’s sooo close to going all the way around and becoming ridiculous. And yet, to me it didn’t: on the contrary, it packed a big punch, surprisingly. At the very end, I was emotionally drained: when it’s your turn to hit, I left the controller sitting, hoping I didn’t need to… but choice is an illusion.

The Pathless You wander in an open world full of empty ruins and remains of a violent past. Environmental enigmas and a bunch of big boss fights: impossible to avoid thinking of “Shadow of the Colossus”. But it has its very own personality (and less of a challenge). Exploring and running/gliding around is extremely satisfying. Made with love and care.

Superliminal A well balanced puzzle game that exploits quirky mechanics that can only work in a game. Between “Portal” and “Stanley Parable”.

God of War. Not my kind of game, but it’s well done.

Movies and shows

I’m not sure I watched a 2020 movie.. for sure not in a theater. A few TV shows:

Tales from the Loop. Atmospheric and introspective sci-fi inspired by the art of Simon Stålenhag.
Truth Seekers
The Last Dance
The Crown (season 4)
DEVS
Wizards: Tales of Arcadia. Final installment from Guillermo Del Toro. Nice, as expected, but it felt a little bit compressed (only 10 episodes, could have been a longer season).

Podcasts

Same old: The Lowe Post and The Mismatch on NBA; Exponent for tech news commentary.

New entries (in Italian): Joypad on videogames; N by Nicolò Melli (Italian basketball player) on the NBA bubble.

Efficient pandas

2020-05-22T00:00:00+00:00

I’ve been using pandas for quick data analysis, or more complex data processing, for years now. I probably discovered it around ~2010..?

Still, writing correct and efficient code with pandas can be tricky - I see that a lot in colleagues that are new to the library. Here’s a short note I wrote with the main ideas to keep in mind in order to write fast code with pandas:

Try to avoid explicit loops over (subset of) rows
Try to do as much as you can with column-wise operations, so that pandas can be smart, vectorize whatever it can, and avoid expensive python operations.
If there is a pandas built-in method or function that does precisely what you need, use it! It could be orders of magnitudes faster than custom code.

A number of recipes are contained in the Cookbook in the official website. Other good pages to read in the documentation are those around indexing, grouping, text data (especially stuff on the magic .str notation), and time series/dates. And for the interested reader I’d recommend the great series of blog posts by Tom Augspurger on “modern” pandas here – it’s 4 years old but it still contains a lot of great stuff.

Also, I won’t comment here on common pitfalls around the correctness of pandas code. That would deserve its own post.

Efficient apply

Calling apply row by row, with df.apply(f, axis=1), is no faster than looping over the rows.. so it’s rather slow. True, it’s highly flexibile, because the argument passed to f is the entire row where you can access multiple fields at the same time.

Viceversa, df.apply(f) would try to apply f column-wise on each column (can be used with any functions that can be applied to a column/vector), which is faster but clearly less flexible.

Example:

def slow_apply(dat):
    return dat.apply(lambda row: np.sqrt(row["a"]), axis=1)
def fast_apply(dat):
     return dat["a"].apply(np.sqrt)
def faster_apply(dat):
     return np.sqrt(dat["a"])

> len(df)
14378

> %time slow_apply(df)
CPU times: user 484 ms, sys: 18.7 ms, total: 503 ms
Wall time: 503 ms

> %time fast_apply(df)                                                                       
CPU times: user 994 µs, sys: 87 µs, total: 1.08 ms
Wall time: 532 µs

> %time faster_apply(df)                                                                       
CPU times: user 653 µs, sys: 32 µs, total: 0.685 ms
Wall time: 386 µs

A 1000x difference, and even slightly better if we completely avoid the “apply”*.

(*A nice but tricky thing about pandas: applying a numpy function to a pandas object will return a pandas object. In this cases, the return type of np.sqrt(col) is again a pd.Series, not an np.array! Personal preference: I’d rather take the (slightly) slower version and be more explicit using the apply(np.sqrt). All readers will understand that the output is going to be a pd.Series.)

Efficient grouping

Iterating on groups of a dataframe can be ok for quick data analysis or development; however, if you need to run a big number of them (say, feature extraction on a big dataset, or something that needs to run in production), note that it can be extremely slow if not done right.

One should rather use groupby followed by .agg(), .transform(), or .apply(). In particular, transform is useful when you want to group and then apply the (same) result of a computation on all elements of each group. However, transform functions are applied on single columns.

Example:

def slow(data):
    """ For each object in a group, that might have multiple classes,
	assign the class with max total confidence. Example:
            index  id           class  confidence
                0   0           apple        0.90
                1   0          banana        0.35
                2   0           apple        0.99
                3   0          banana        0.43
                4   0          banana        0.30
        
        Here the object with id = 0 would be assigned the class 'apple'.
    """        
    for i in data["id"].unique():
        idx = data["id"] == i
        weights = data.loc[idx, ["class", "confidence"]].groupby("class").sum()
        data.loc[idx, "class"] = weights.idxmax()[0]

def fast(data):
    data["weights"] = data.groupby(["id", "class"])["confidence"].transform('sum')
    data["class"] = data.groupby("id")["weights"].transform(lambda g: data.loc[g.idxmax(), "class"])
    del data["weights"]

> len(df)
14378

> %time slow(df)                                                                          
CPU times: user 3.42 s, sys: 45.8 ms, total: 3.46 s
Wall time: 3.46 s


> %time fast(df)                                                                          
CPU times: user 760 ms, sys: 18.6 ms, total: 779 ms
Wall time: 204 ms

We obtained quite a big improvement: 20x speedup. There might be way to speed things up even further.

The “fast” version relies on the fact that the index of the groups in a groupby are the same as the one in the original df (actually groups will be a view on that dataframe), so the index of the max element can be used to index the original dataframe.

Another example, perhaps even simpler:

def slow(data, step=12):
    """Compute relative change in quantities a and b along time of each element in a group"""
    for group_id in data["id"].unique():
        pos = (data["id"] == group_id)
        data.loc[pos, "a_change"] = data.loc[pos, "a"].diff(step) / data.loc[pos, "a"].shift(step)
        data.loc[pos, "b_change"] = data.loc[pos, "b"].diff(step) / data.loc[pos, "b"].shift(step)

def fast(data, step=12):
    data[["a_change", "b_change"]] = data.groupby("id")[["a", "b"]].transform(
					lambda g: g.diff(step) / g.shift(step))

> len(df)
14378

> %time slow(df)                                                                         
CPU times: user 2.11 s, sys: 6.13 ms, total: 2.12 s
Wall time: 2.13 s

> %time fast(df)                                                                         
CPU times: user 1.03 s, sys: 15.3 ms, total: 1.05 s
Wall time: 1.04 s

Here the speed up is 2x, and the code is in my opinion also easier to read…

…but still pretty slow. It can be made even faster, when you find out that there is a df method, pct_change(), that does exactly what we’re trying to do here! \o/

def faster(data, step=12): 
    data[["a_change", "b_change"]] = data.groupby("id")[["a","b"]].pct_change(periods=step) 

> %time faster(df)  
CPU times: user 9.43 ms, sys: 52 µs, total: 9.48 ms
Wall time: 10.9 ms

And this is much faster, 350x speedup.

Finally, when more flexibility is needed (e.g., for operations between columns), a general apply() can also be used, obtaining max flexibility:

def slow(data):
    motions = []
    for _, group in data.groupby('id'):
        group = group.copy()
        group['going_left'] = group['right'].shift() > group['right']
        group['going_right'] = group['left'].shift() < group['left']
        group['is_shrinking'] = (group['right'] - group['left']) < (group['right'].shift() - group['left'].shift())
        motions.append(group[['id', 'timestamp', 'going_left', 'going_right', 'is_shrinking']])

    return pd.concat(motions, ignore_index=True)

def fast(data):
    return data.groupby('id')[["right", "left", "timestamp"]].apply(
            lambda g: pd.DataFrame(
                {
                    "going_left": g["right"].shift() > g["right"],
                    "going_right": g["left"].shift() < g["left"],
                    "is_shrinking": (g["right"] - g["left"]) < (g["right"] - g["left"]).shift(),
                    "timestamp": g["timestamp"],
                })
        ).reset_index(0) # remove the id index from groupby

> %time slow(df)                                                                        
CPU times: user 1.53 s, sys: 58 µs, total: 1.53 s
Wall time: 1.53 s

> %time fast(df)                                                                        
CPU times: user 945 ms, sys: 7.99 ms, total: 953 ms
Wall time: 949 ms

The benefit of using groupby().apply() is small in terms of efficiency, but still noticeable; and the code is in my opinion more terse and readable. Still, not something you can call fast. For that, a bit more effort is required :-)