Sergiu's Substack - Deep Learning Beyond the Plateau

Recursive Reasoning with Tiny Networks

Sergiu Nistor — Tue, 17 Mar 2026 17:37:14 GMT

The simplest recursive reasoner

There is a pattern in research where a genuinely powerful idea gets buried under the weight of its own justification. Biological analogies, theoretical guarantees, multi-component architectures. All of it accumulated to explain something that turned out to be much simpler once someone actually tested each piece.

Tiny Recursive Model (TRM), by Alexia Jolicoeur-Martineau, is an exercise in cutting all of that away and finding out what was load-bearing to begin with.

It starts as a critique of Hierarchical Reasoning Model (HRM). It ends as a replacement.

What HRM got right, and where it got complicated

The original brain-inspired, 2-stage HRM architecture.

HRM earned its attention. With 27 million parameters and around 1000 training samples, it solved reasoning tasks that stump frontier language models: extremely hard Sudoku, optimal pathfinding through large mazes, and the abstract pattern recognition challenge of ARC-AGI.

The core mechanism was a loop. Two networks, alternating through a shared latent space while a training scheme called “deep supervision” pushed intermediate representations toward the correct answer at each step.

To justify the two-network design, the authors drew on neuroscience: a fast low-level module for local computation mirroring rapid cortical processing, a slow high-level module for global planning mirroring deliberate cognition.

Neither assumption was strictly necessary. That is what TRM demonstrates.

One network is enough

The TRM architecture, as per the original paper.

HRM used two separate networks, L and H, justified through biological analogy about different processing hierarchies in the brain.

Replacing both with a single shared network cuts the parameter count in half and still improves accuracy over the two-network version. Shared weights push the network toward learning a more general transformation rather than two partially overlapping specializations.

Fewer layers, more recursions

Adding layers made things worse. With only around 1000 training examples, additional parameters overfit before they can do anything useful.

The fix is to go in the opposite direction: drop to 2 layers and scale the number of recursion steps to compensate. Depth from iterating a small network generalizes better than depth from stacking a tall one. This relates to overfitting dynamics on small data, and the empirical gap is large and consistent.

Match the architecture to the problem

For tasks with a small, fixed context length like Sudoku Extreme on a 9x9 grid, self-attention is doing more work than the problem requires. A plain MLP applied over the sequence dimension fits the structure of the problem naturally and generalizes better in that setting.

For larger inputs like the 30x30 mazes in Maze-Hard and the grids in ARC-AGI, self-attention recovers its advantage. The MLP’s capacity becomes excessive relative to the problem geometry at that scale, and the inductive biases of attention prove useful.

Using EMA for training stability

Pseudocode for the TRM implementation, as per the original paper.

Recursive models trained on tiny datasets are unstable. Sharp overfitting followed by divergence is the typical failure mode.

Exponential Moving Average of the weights, a stabilization technique established in GAN training, prevents the worst collapses and raises the generalization ceiling consistently. The gap between training with and without EMA is large enough that it is not an optional detail.

Results

TRM outperforms HRM across all four benchmarks, Sudoku Extreme, Maze-Hard, ARC-AGI-1, and ARC-AGI-2, while using a fraction of the parameters.

The LLM comparisons deserve some qualification. The language models it is compared against are evaluated zero-shot, while TRM is trained specifically on these tasks.

The tasks also test structurally different capabilities: recursive latent refinement over a bounded grid is not the same problem as autoregressive generation over an open vocabulary.

The puzzle accuracy is real. The cross-paradigm comparison needs to be read carefully.

What this leaves open

TRM works as a rigorous ablation study that happened to produce a better architecture. Each simplification was tested individually, and each one helped. The result is a system that achieves more with fewer parameters, fewer layers, fewer theoretical assumptions, and a cleaner conceptual framing than what it replaced.

It is interesting to note how in this particular case recursion over a small network generalizes better than equivalent depth. There is something about iterating a small network on its own outputs, keeping an explicit answer state and a separate reasoning state, and using deep supervision to make each iteration accountable to the final answer, that outperforms simply making the network deeper.

From an inference systems perspective, there is also an observation the paper does not develop but which follows naturally from the architecture. A 2-layer network run many times has a very different execution profile from a deep network run once, even at a similar compute budget. The recursive structure makes early exit, adaptive computation, and interpretability through the exposed answer state at each step all straightforward extensions. None of that is explored here, but it is a natural next direction.

The code is open-source at github.com/SamsungSAILMontreal/TinyRecursiveModels.

Hierarchical Reasoning Model

Sergiu Nistor — Sun, 15 Mar 2026 19:20:55 GMT

Introduction

There is a quiet but radical assumption baked into modern large language models: that thinking happens in language. Ask a frontier model to solve a hard problem, and it will reason by writing, producing chains of tokens that trace the path from question to answer. This works remarkably well. It is also, arguably, completely backwards.

A growing body of neuroscience suggests that human abstract thought does not primarily occur in language. The brain solves problems first, and translates solutions into words afterward. Language, in this view, is a communication interface, not a cognitive engine. This distinction has a concrete architectural consequence: if reasoning doesn’t require tokens, maybe we’ve been building our reasoners that are simply inefficient.

Developed by Sapient Intelligence, the Hierarchical Reasoning Model (HRM) is a bet on exactly this idea. Released in August 2025, it achieved State-of-the-Art results on both ARC-AGI-1 and ARC-AGI-2 (widely regarded as one of the hardest tests of abstract generalization in AI) with just 27 million parameters, trained on only 1,000 samples, and no pretraining on large corpora. The frontier LLMs it outperformed on these benchmarks are orders of magnitude larger.

This post breaks down what HRM is, where it comes from intellectually, and why its design choices might represent something more than a clever trick.

Reasoning without tokens

Chain-of-Thought prompting works by making the model externalize its reasoning as text. The intuition is solid: intermediate scaffolding helps. But it’s slow, expensive, and the token space (which is discrete, surface-level, linguistic) may just be the wrong medium for certain kinds of abstract computation.

HRM’s core idea is that reasoning can happen entirely in the latent state space of a recurrent network, never surfacing as language. The model doesn’t narrate its thinking. It performs its thinking internally, across multiple recurrent steps, and only produces output once that process has converged. Critically, none of those intermediate steps are supervised. the model only receives feedback on the final answer, not on how it got there.

The intellectual backing for this comes from the idea that language is primarily a tool for communication rather than thought. Non-human primates solve complex problems without it. Patients with severe aphasia retain intact reasoning. Mathematical and musical cognition run on entirely separate neural systems. If thinking doesn’t require language in the brain, a model forced to write out its reasoning has an architectural bottleneck the brain simply doesn’t have.

The architecture

HRM’s brain-inspired, hierarchical architecture.

HRM is a recurrent network with two modules, inspired directly by Kahneman’s System 1 / System 2 framework and by three observations from neuroscience:

Different cortical regions of the brain operate at different speeds, and this layered organization is what enables deep, multi-step reasoning (Zeraati et al., Murray et al., Huntenburg et al.)
Recurrent feedback loops continuously refine the network’s internal representations, with slow higher-level areas guiding the fast lower-level ones, without losing global coherence (Bastos et al., Lamme et al., Kaleb et al.)
The brain achieves this computational depth without the credit-assignment problem that makes training deep recurrent networks so expensive (Lillicrap et al.)

This maps directly onto two coupled recurrent modules:

High-level (H) module - slow, abstract, responsible for planning and goal maintenance
Low-level (L) module - fast, detailed, executing rapid computations guided by H

The L module runs multiple steps until it reaches a local equilibrium. Only then does H advance, incorporating L’s converged state to update its higher-level representation. L resets and begins a new phase. The slow areas wait for the fast ones, which is also, broadly, what the brain does.

Solving the RNN collapse problem

Anyone who’s worked with recurrent networks knows the failure mode: they collapse early in training.

HRM introduces the concept of hierarchical convergence to fix this. H is only allowed to update after L has completed its full cycle and settled. This forces the network to actually use its recurrent structure and not short-circuit.

It’s both the paper’s most important technical contribution and its most elegant one.

The results

HRM performance on ARC-AGI-1, ARC-AGI-2, Sudoku-Extreme and Maze-Hard.

ARC-AGI was designed by François Chollet specifically to resist memorization. The tasks require genuine abstract generalization across novel visual grids. ARC-AGI-2 is harder still, designed to resist the pattern-matching tricks that helped models game the first benchmark.

HRM achieved State-of-the-Art on both benchmarks (at that time).

The model couldn’t have memorized its way there. The architecture is doing the work, and that’s a meaningful data point for a few ongoing debates: whether scale can always substitute for structure and invariance, and whether token-level reasoning is a fundamental requirement or a workaround.

It doesn’t resolve any of those debates. But it sharpens them.

Caveats and conclusions

HRM isn’t a general-purpose language model. It doesn’t generate text or handle open-ended tasks. The benchmark comparison to much larger models needs to be made carefully. Scaling behavior is unknown. And the biological inspiration, while compelling, is a motivation rather than a guarantee.

The code is open-source at github.com/sapientinc/HRM.

The next post will cover TRM (Tiny Recursive Model), an optimized successor described in Alexia Jolicoeur-Martineau's paper, which pushes the hierarchical recurrent framework further and starts to answer some of the scaling questions.

When Scaling Isn't Enough

Sergiu Nistor — Fri, 13 Mar 2026 16:47:14 GMT

Deep learning has come a long way, but the field is showing signs of stagnation. Most progress today comes from scaling existing architectures rather than rethinking them, and the gap between what large labs can build and what independent researchers can meaningfully explore keeps widening.

This blog is a response to that: a space for examining the reasoning behind model design, questioning assumptions that have gone unquestioned for too long, and making the case that the next wave of breakthroughs will come from new ideas, not just more compute.

When Research Plateaus

Depiction of some of the most cited Computer Science research papers, according to Paperscape.

Deep learning has achieved remarkable successes, yet the field is increasingly converging around a few dominant architectures, particularly large transformers and LLMs. Most current innovation focuses on scaling existing models rather than exploring fundamentally new ideas, creating a plateau where architectural decisions are repeated and the reasoning behind them is rarely questioned.

Exceptions exist, but they are rare, and most still do not deviate much from the current deep learning standards. Some notable recent examples are hierarchical models (HRMs, TRMs), state-space models (Mamba), attention-free transformer alternative models (RWKV), and hybrid SSM-transformer models (Nemotron 3).

The deeper issue is that a single idea can only take us so far. The field needs new approaches and perspectives, because scaling alone has limits. Some researchers consider that scaling laws may eventually stop holding. Beyond performance, progress should not be measured only by model size: Environmental impact and resource requirements are real constraints that cannot be ignored.

At the same time, accessibility in research is increasingly constrained, as the validation of most new ideas now requires large-scale infrastructure and computational power. This creates a barrier for smaller labs and independent researchers, limiting their ability to explore innovative approaches, test unconventional hypotheses, and meaningfully contribute to the advancement of the field. The current situation underscores the reality of an inevitable AI winter.

Driving the Next Wave of AI

Frontier model training compute requirement evolution over time, as per Epoch AI.

True advancement in the field depends on researchers exploring and inventing novel architectures, techniques, and methods, rather than merely scaling up existing models. By focusing on creativity and innovation, new researchers can uncover more efficient, interpretable, and versatile solutions, pushing the boundaries of what is possible without being limited by the brute-force approach of ever-larger models.

The field needs people who combine creativity, intuition, and rigorous thinking to tackle underexplored ideas, propose alternative architectures, and test unconventional learning mechanisms. Diversifying who contributes and what approaches are investigated is essential to break out of incremental cycles and drive the next wave of breakthroughs in deep learning.

However, starting research today can feel like navigating a landscape full of proven successes but few clear paths.

There is a pressing need for a centralized resource that traces the development of brilliant deep learning and machine learning ideas, showing how concrete objectives, such as implementing memory or attention, led to the design of specific mathematical mechanisms that guide the model’s learning.

These insights are typically buried in papers, which naturally emphasize scientific rigor over clarity for a wider audience. As a result, the ideas are often too complex or lengthy for most readers to fully absorb.

Understanding Model Design

The original transformer architecture, Vaswani et al.

Deep learning remains largely a black box. We know what works, but not always why. While models achieve impressive results, the internal mechanisms that drive their success often remain opaque. This uncertainty has spurred the field of mechanistic interpretability, with groups like Anthropic working to uncover the principles behind neural network behavior (Interpretability Dreams, Decomposing Language Models Into Understandable Components, Mapping the Mind of a Large Language Model).

Many design choices in deep learning only make sense in hindsight. Architectural tweaks, normalization methods, and optimization strategies are frequently justified empirically rather than theoretically. Mechanisms that seem arbitrary at first reveal their value only after extensive experimentation.

Much of our intuition comes from the brain, our only partially understood example of intelligence. Concepts such as attention, memory, and hierarchical processing are translated into mechanisms that neural models can implement. These abstractions often succeed in practice, but the reasons for their effectiveness or optimality are rarely clear.

Meaningful research requires a rare blend of skills: the ability to translate abstract, human-intuitive concepts into precise mathematical constructs that a neural network can internalize. It’s this bridge between practical intuition and implementation that drives innovation beyond mere trial-and-error.

What Readers Can Expect

This blog explores the reasoning behind model design, how structure shapes learning, and the principles that guide research. By examining the foundations of deep learning and machine learning, we aim to spark innovation, encourage critical thinking, and support a more diverse and capable research community.

You’ll find practical explorations of deep learning and machine learning research: analyses of novel architectures, experiments that question conventional assumptions, and reflections on why certain design choices succeed or fail. Each post links core intuitions with the mathematical principles that govern model behavior.

This space is meant for active engagement. I welcome discussions, critiques, and collaborations, whether you are a researcher testing new concepts or a learner exploring the fundamentals.