Less is More: Recursive Reasoning with Tiny Networks

This is the codebase for the paper: "Less is More: Recursive Reasoning with Tiny Networks". TRM is a recursive reasoning approach that achieves amazing scores of 45% on ARC-AGI-1 and 8% on ARC-AGI-2 using a tiny 7M parameters neural network.

Paper

Motivation

Tiny Recursion Model (TRM) is a recursive reasoning model that achieves amazing scores of 45% on ARC-AGI-1 and 8% on ARC-AGI-2 with a tiny 7M parameters neural network. The idea that one must rely on massive foundational models trained for millions of dollars by some big corporation in order to achieve success on hard tasks is a trap. Currently, there is too much focus on exploiting LLMs rather than devising and expanding new lines of direction. With recursive reasoning, it turns out that “less is more”: you don’t always need to crank up model size in order for a model to reason and solve hard problems. A tiny model pretrained from scratch, recursing on itself and updating its answers over time, can achieve a lot without breaking the bank.

This work came to be after I learned about the recent innovative Hierarchical Reasoning Model (HRM). I was amazed that an approach using small models could do so well on hard tasks like the ARC-AGI competition (reaching 40% accuracy when normally only Large Language Models could compete). But I kept thinking that it is too complicated, relying too much on biological arguments about the human brain, and that this recursive reasoning process could be greatly simplified and improved. Tiny Recursion Model (TRM) simplifies recursive reasoning to its core essence, which ultimately has nothing to do with the human brain, does not require any mathematical (fixed-point) theorem, nor any hierarchy.

How TRM works

Tiny Recursion Model (TRM) recursively improves its predicted answer y with a tiny network. It starts with the embedded input question x and initial embedded answer y and latent z. For up to K improvements steps, it tries to improve its answer y. It does so by i) recursively updating n times its latent z given the question x, current answer y, and current latent z (recursive reasoning), and then ii) updating its answer y given the current answer y and current latent z. This recursive process allows the model to progressively improve its answer (potentially addressing any errors from its previous answer) in an extremely parameter-efficient manner while minimizing overfitting.

Latent Meta Attention (LMA)

Latent Meta Attention (LMA) is a compress-once, stay-compressed front-end for TRM that resequences the input into a latent space, performs attention and MLP operations entirely in this latent space, and then decodes the output back to the original token length at the end. This allows for significant computational savings, especially for large inputs.

How LMA Works

Given an input tensor $X \in \mathbb{R}^{B \times L \times d_0}$, where $B$ is the batch size, $L$ is the sequence length, and $d_0$ is the hidden size, LMA performs the following steps:

Column-Major Resequencing: The input tensor $X$ is flattened in column-major order to produce a tensor $\text{flat} \in \mathbb{R}^{B \times (L \cdot d_0)}$. The order of elements is $(1:L, c=1), (1:L, c=2), \dots, (1:L, c=d_0)$.
Chunking: The flattened tensor is split into $L_{\text{new}}$ latent tokens, each of size $C' = \frac{L \cdot d_0}{L_{\text{new}}}$. This requires $L_{\text{new}}$ to be a divisor of $L \cdot d_0$.
Encoding: Each chunk is embedded with a shared encoder $E_{\text{chunk}}: \mathbb{R}^{C'} \rightarrow \mathbb{R}^{d'}$, producing a latent tensor $Z \in \mathbb{R}^{B \times L_{\text{new}} \times d'}$, where $d'$ is the latent dimension.
First Residual: A residual connection is computed in the latent space by passing the same column-major chunks through a separate encoder $E_{\text{resid}}: \mathbb{R}^{C'} \rightarrow \mathbb{R}^{d'}$. This residual is then added to the latent tensor $Z$.
Latent Computation: All subsequent Transformer blocks operate on the latent tensor $Z$, performing attention and MLP operations at the reduced dimension $d'$.
Decoding: Finally, the output from the latent blocks is decoded back to the original chunk space using a shared decoder $D_{\text{chunk}}: \mathbb{R}^{d'} \rightarrow \mathbb{R}^{C'}$. The decoded chunks are then unchunked and reshaped back to the original tensor shape $(B, L, d_0)$.

Matrix Example

Given $L=15$, $d_0=6$, and an unclean latent sequence length $L_{\text{new}}=5$, the chunk size is $C' = (15 \cdot 6) / 5 = 18$. The chunks are formed as follows:

Chunk 1: r1:r15,c1 then r1:r3,c2
Chunk 2: r4:r15,c2 then r1:r6,c3
Chunk 3: r7:r15,c3 then r1:r9,c4
Chunk 4: r10:r15,c4 then r1:r12,c5
Chunk 5: r13:r15,c5 then r1:r15,c6

Configuration

LMA can be enabled and configured via the command line. For example, to enable LMA in "unclean" mode:

torchrun ... arch.lma_enable=true arch.lma_mode=unclean

Requirements

Python 3.10 (or similar)
Cuda 12.6.0 (or similar)

pip install --upgrade pip wheel setuptools
pip install --pre --upgrade torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126 # install torch based on your cuda version
pip install -r requirements.txt # install requirements
pip install --no-cache-dir --no-build-isolation adam-atan2 
wandb login YOUR-LOGIN # login if you want the logger to sync results to your Weights & Biases (https://wandb.ai/)

Dataset Preparation

# ARC-AGI-1
python -m dataset.build_arc_dataset \
  --input-file-prefix kaggle/combined/arc-agi \
  --output-dir data/arc1concept-aug-1000 \
  --subsets training evaluation concept \
  --test-set-name evaluation

# ARC-AGI-2
python -m dataset.build_arc_dataset \
  --input-file-prefix kaggle/combined/arc-agi \
  --output-dir data/arc2concept-aug-1000 \
  --subsets training2 evaluation2 concept \
  --test-set-name evaluation2

## Note: You cannot train on both ARC-AGI-1 and ARC-AGI-2 and evaluate them both because ARC-AGI-2 training data contains some ARC-AGI-1 eval data

# Sudoku-Extreme
python dataset/build_sudoku_dataset.py --output-dir data/sudoku-extreme-1k-aug-1000  --subsample-size 1000 --num-aug 1000  # 1000 examples, 1000 augments

# Maze-Hard
python dataset/build_maze_dataset.py # 1000 examples, 8 augments

Experiments

ARC-AGI-1 (assuming 4 H-100 GPUs):

run_name="pretrain_att_arc1concept_4"
torchrun --nproc-per-node 4 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 --nnodes=1 pretrain.py \
arch=trm \
data_paths="[data/arc1concept-aug-1000]" \
arch.L_layers=2 \
arch.H_cycles=3 arch.L_cycles=4 \
+run_name=${run_name} ema=True

Runtime: ~3 days

ARC-AGI-2 (assuming 4 H-100 GPUs):

run_name="pretrain_att_arc2concept_4"
torchrun --nproc-per-node 4 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 --nnodes=1 pretrain.py \
arch=trm \
data_paths="[data/arc2concept-aug-1000]" \
arch.L_layers=2 \
arch.H_cycles=3 arch.L_cycles=4 \
+run_name=${run_name} ema=True

Runtime: ~3 days

Sudoku-Extreme (assuming 1 L40S GPU):

run_name="pretrain_mlp_t_sudoku"
python pretrain.py \
arch=trm \
data_paths="[data/sudoku-extreme-1k-aug-1000]" \
evaluators="[]" \
epochs=50000 eval_interval=5000 \
lr=1e-4 puzzle_emb_lr=1e-4 weight_decay=1.0 puzzle_emb_weight_decay=1.0 \
arch.mlp_t=True arch.pos_encodings=none \
arch.L_layers=2 \
arch.H_cycles=3 arch.L_cycles=6 \
+run_name=${run_name} ema=True

run_name="pretrain_att_sudoku"
python pretrain.py \
arch=trm \
data_paths="[data/sudoku-extreme-1k-aug-1000]" \
evaluators="[]" \
epochs=50000 eval_interval=5000 \
lr=1e-4 puzzle_emb_lr=1e-4 weight_decay=1.0 puzzle_emb_weight_decay=1.0 \
arch.L_layers=2 \
arch.H_cycles=3 arch.L_cycles=6 \
+run_name=${run_name} ema=True

Runtime: < 36 hours

Maze-Hard (assuming 4 L40S GPUs):

run_name="pretrain_att_maze30x30"
torchrun --nproc-per-node 4 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 --nnodes=1 pretrain.py \
arch=trm \
data_paths="[data/maze-30x30-hard-1k]" \
evaluators="[]" \
epochs=50000 eval_interval=5000 \
lr=1e-4 puzzle_emb_lr=1e-4 weight_decay=1.0 puzzle_emb_weight_decay=1.0 \
arch.L_layers=2 \
arch.H_cycles=3 arch.L_cycles=4 \
+run_name=${run_name} ema=True

Runtime: < 24 hours

Reference

If you find our work useful, please consider citing:

@misc{jolicoeurmartineau2025morerecursivereasoningtiny,
      title={Less is More: Recursive Reasoning with Tiny Networks}, 
      author={Alexia Jolicoeur-Martineau},
      year={2025},
      eprint={2510.04871},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.04871}, 
}

and the Hierarchical Reasoning Model (HRM):

@misc{wang2025hierarchicalreasoningmodel,
      title={Hierarchical Reasoning Model}, 
      author={Guan Wang and Jin Li and Yuhao Sun and Xing Chen and Changling Liu and Yue Wu and Meng Lu and Sen Song and Yasin Abbasi Yadkori},
      year={2025},
      eprint={2506.21734},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2506.21734}, 
}

This code is based on the Hierarchical Reasoning Model code and the Hierarchical Reasoning Model Analysis code.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
config		config
dataset		dataset
evaluators		evaluators
kaggle/combined		kaggle/combined
models		models
utils		utils
LICENSE		LICENSE
README.md		README.md
TRM_LMA_conversation_and_instructions.md		TRM_LMA_conversation_and_instructions.md
pretrain.py		pretrain.py
puzzle_dataset.py		puzzle_dataset.py
requirements.txt		requirements.txt
run_lma.sh		run_lma.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Less is More: Recursive Reasoning with Tiny Networks

Motivation

How TRM works

Latent Meta Attention (LMA)

How LMA Works

Matrix Example

Configuration

Requirements

Dataset Preparation

Experiments

ARC-AGI-1 (assuming 4 H-100 GPUs):

ARC-AGI-2 (assuming 4 H-100 GPUs):

Sudoku-Extreme (assuming 1 L40S GPU):

Maze-Hard (assuming 4 L40S GPUs):

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Less is More: Recursive Reasoning with Tiny Networks

Motivation

How TRM works

Latent Meta Attention (LMA)

How LMA Works

Matrix Example

Configuration

Requirements

Dataset Preparation

Experiments

ARC-AGI-1 (assuming 4 H-100 GPUs):

ARC-AGI-2 (assuming 4 H-100 GPUs):

Sudoku-Extreme (assuming 1 L40S GPU):

Maze-Hard (assuming 4 L40S GPUs):

Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages