Current leaderboards index heavily on complex reasoning, taking lower-level cognitive primitives for granted. However, the DeepMind Cognitive Framework highlights Attention as a foundational prerequisite for AGI. Without robust attention, high-level reasoning collapses over long contexts (an AI cannot solve physics if it forgets how many apples it just counted).
To address this, I built a benchmark to evaluate the Attention faculty in isolation, measuring how architectures maintain focus, filter noise, and progress toward AGI.
Models must calculate the final total of target items possessed by actors following a series of narrative operations. The system must maintain isolated mathematical counts for specific targets while actively suppressing semantic distractors. Regardless of the number of targets tracked, the model must merge totals via multiplication and uniformly output a single scalar integer.
Basic complexity is defined by the number of statements in a test. The benchmark includes several distractors:
- Categorical Sibling Distractors (tracking "apples" while ignoring "oranges").
- Lexical Overlap Distractors (tracking "apples" while ignoring "Apple iPhones").
- Adjectival Constraints (tracking "green apples" while ignoring "red apples").
- Abstract Interferences (ignoring hypothetical statements like "Scientists believe 2 apples...").
I integrated structural complexity multipliers:
- Syntactic Density: Embedding multiple, distinct statements within a single sentence frame.
- Parallel Target Tracking: Requiring the model to focus on multiple distinct target items simultaneously and multiply them at the end.
- Alphanumeric Lexical Mixing: Substituting digits ("2") with English words ("two") to prevent simple regex pattern matching.
By burying target items deep within dense semantic noise, the benchmark forces the LLM to actively engage specific components of the DeepMind Attention faculty ([2, Section 7.3]):
- Attention Capacity (7.3.1): Evaluated by tracking multiple independent physical items simultaneously (scaling up to multiple targets).
- Selective Attention (7.3.2): This overarching control faculty is stress-tested across three metrics:
- Sustained Attention: Measured by massively scaling narrative size (up to 375 statements) to define the boundary where focus collapses over time.
- Perceptual Inhibition: Tested by injecting high-salience semantic distractors (e.g., "Apple iPhones" when tracking "apples") that the model must actively ignore.
- Attention Shifting: Measured by dynamically swapping operational polarity (adding via "purchased" vs. subtracting via "ate"), forcing the model to continuously shift task logic.
The model must output a single integer. For models that fail to follow formatting instructions and generate explanations, I parse the final number from the output using a regex fallback.
The generated dataset contains 500 formatted evaluations, though the central Kaggle benchmark explicitly utilizes only the first 300. The underlying evaluations scale across five distinct tasks of increasing size, complexity, and distractor density (100 evaluations each).
| Task Level | Narrative Size | Empirical Characteristics |
|---|---|---|
| Task 1 | 1 – 25 | Warm-up sequences scaling in baseline tracking difficulty |
| Task 2 | 26 – 75 | 2 tests/size; distraction thresholds gradually scaling 1→10 |
| Task 3 | 76 – 175 | 1 test/size; growing multi-target tracking requirements |
| Task 4 | 176 – 275 | 1 test/size; primarily 3-target tracking with 2+ word items |
| Task 5 | 276 – 375 | 1 test/size; absolute maximum multi-target distractor boundaries |
I generated the dataset programmatically for two reasons:
- Scale: Measuring sustained attention requires long-context narratives tracking target items across hundreds of sentences. Doing this manually is impractical.
- Correctness: Programmatic generation guarantees perfect mathematical tracking of target items, avoiding human arithmetic errors.
The generator uses structured JSON dictionaries containing predefined actors, items, and mathematical operators. It maps abstract templates to these data points to assemble the narratives deterministically.
The .csv utilizes the following columns:
test_id: Sequence index.prompt: Monolithic input block pre-formatted forkaggle_benchmarks.ground_truth_answer: Baseline integer.target_words: Item(s) to track.story: Raw narrative.num_statements: Count of statements (size).allow_negative: Flag enabling item loss.distractor_ratio,abstract_ratio,word_number_ratio,multi_digit_ratio: Probability weights controlling distractor density, lexical mixing, and parallel tracking targets.target_max_length/target_min_length: Number of words in the target items.num_targets: Number of items to track.statements_grouping_max: Max number of statements grouped in a sentence.
I implemented layers of validation:
- Templated Determinism: Using strict templates populated by these JSON configurations ensures syntax and arithmetic are sound by design.
- Empirical Baseline: High-end models (like Gemini 3.1 Pro) correctly solved >95% of the shortest sequences (Task 1). This confirms the dataset language is parseable and the logic is solvable.
- Manual Review: Sequences where models reliably failed early underwent manual human review.
- Consensus Validation: Every evaluation in Task 1 was correctly solved by at least 4 out of the top 10 models.
- Dataset Generation: The data generation pipeline was written in Python and authored with the assistance of an AI coding agent.
- Data Source: The pre-generated
.csvdataset is uploaded directly to Kaggle Datasets. - Benchmark Composition: I created distinct Kaggle Tasks using the
kaggle_benchmarkslibrary, slicing the.csvinto discrete 100-evaluation blocks. The first three tasks were then unified into a centralized Kaggle Benchmark and evaluated natively against 16 LLMs available on the platform.
The benchmark consists of the three Primary Tasks (Tasks 1–3). Tasks 4 and 5 experienced exhausting maximum token allowance limits, when testing on all models. The table below lists accuracy scores (0.0 to 1.0) per model and task, as well as the overall average.
| Model | Task 1 | Task 2 | Task 3 | Overall (Tasks 1-3) |
|---|---|---|---|---|
| gemini-3.1-pro-preview | 0.99 | 0.89 | 0.95 | 0.9433 |
| gemma-4-31b-it | 0.90 | 0.66 | 0.54 | 0.7000 |
| claude-opus-4-6-default | 0.91 | 0.59 | 0.45 | 0.6500 |
| gemma-4-26b-a4b-it | 0.93 | 0.63 | 0.36 | 0.6400 |
| gemini-2.5-flash | 0.88 | 0.59 | 0.42 | 0.6300 |
| glm-5 | 0.91 | 0.73 | 0.24 | 0.6267 |
| qwen3-next-80b-a3b-thinking | 0.91 | 0.57 | 0.36 | 0.6133 |
| claude-sonnet-4-6-default | 0.89 | 0.44 | 0.35 | 0.5600 |
| qwen3-next-80b-a3b-instruct | 0.77 | 0.54 | 0.19 | 0.5000 |
| gpt-oss-20b | 0.76 | 0.44 | 0.08 | 0.4267 |
| gpt-oss-120b | 0.74 | 0.41 | 0.11 | 0.4200 |
| deepseek-v3.2 | 0.35 | 0.13 | 0.08 | 0.1867 |
| gpt-5.4-2026-03-05 | 0.39 | 0.03 | 0.11 | 0.1767 |
| gemini-3.1-flash-lite-preview | 0.28 | 0.01 | 0.09 | 0.1267 |
| gpt-5.4-mini-2026-03-17 | 0.19 | 0.00 | 0.04 | 0.0767 |
| gpt-5.4-nano-2026-03-17 | 0.19 | 0.00 | 0.01 | 0.0667 |
In attachments I give different types of per-evaluation analysis. One of those is analysis
of the correlation between the boolean outputs and the parameters of the tests (e.g., num_targets, allow_negative). Here I put the parameters that affects the most the performance of each model, and
determine which attention ability it maps to.
| Attention Ability | Model | Primary Failure Vector | Correlation Severity |
|---|---|---|---|
| Attention Shifting | gpt-5.4-nano-2026-03-17 | allow_negative |
|
| Attention Shifting | gpt-5.4-mini-2026-03-17 | allow_negative |
|
| Attention Shifting | gemini-3.1-flash-lite-preview | allow_negative |
|
| Attention Shifting | gpt-5.4-2026-03-05 | allow_negative |
|
| Perceptual Inhibition | gpt-oss-20b | distractor_ratio |
|
| Perceptual Inhibition | gpt-oss-120b | distractor_ratio |
|
| Perceptual Inhibition | gemma-4-26b-a4b-it | abstract_ratio |
|
| Sustained Attention | glm-5 | num_statements |
|
| Sustained Attention | qwen3-next-80b-a3b-instruct | num_statements |
|
| Sustained Attention | deepseek-v3.2 | num_statements |
|
| Attention Capacity | gemini-2.5-flash | num_targets |
|
| Attention Capacity | gemma-4-31b-it | num_targets |
|
| Attention Capacity | claude-opus-4-6-default | num_targets |
|
| Attention Capacity | claude-sonnet-4-6-default | num_targets |
|
| Attention Capacity | qwen3-next-80b-a3b-thinking | statements_group |
|
| Attention Capacity | gemini-3.1-pro-preview | target_max_len |
Indeed, different models have different weak points. While there is one more improtant thing to
say about gemini-3.1-pro-preview: having target_max_len as the parameter that affects the most
its performance, it has positive correlation with target_min_len which makes me think, that the LLM struggles with different length of target items (like "apples" vs "squared oranges").
Small models struggle with allow_negative - which is loss of items ("Anna ate an apple").
While excluded from the primary aggregate, I retrieved scores for a few models on Task 4:
- gemini-3.1-pro-preview: 0.93
- gemma-4-31b-it: 0.22
- glm-5: 0.13
- claude-opus-4-6-default: 0.12
For Task 5:
- gemini-3.1-pro-preview: 0.86
- gemini-3-flash-preview: 0.34
gemini-3.1-pro-preview still shows excellent results even on the more complex task, while the others experience expected degradation.
Computing the Pearson correlation coefficient between the performance scores of all 16 benchmarked models provides structural stability metrics:
- Task 1 vs 2: 0.96
- Task 2 vs 3: 0.80
- Task 1 vs 3: 0.73
The 0.96 & 0.80 correlation between adjacent tasks confirms structural consistency.
The benchmark exposed model vulnerability to "attentional capture" via semantic overlap. For example, in Evaluation 38, models were instructed to exclusively track physical chips. However, the narrative injected: "Mila lost 1 poker chips at the casino." Several architectures failed to maintain boundary filtering and incorrectly subtracted the irrelevant poker chips, revealing attention layers are biased by localized token-matching.
I evaluated some models locally via llama.cpp using Q8_0 and Q4_K_M settings. Across the primary tests, gemma-4-E2B-Q8_0 scored 0.31 overall, and gemma-4-E4B-Q4_K_M scored 0.40. Qwen3.5-0.8B-Q8_0 correctly resolved 3.33% of evaluations. Because the earliest sequences in Task 1 are explicitly designed to be simple, the benchmark effectively acts as a graduated floor, capable of gifting non-zero scores even to heavily constrained sub-billion architectures. Notably, Qwen3.5-35B-A3B-Q4_K_M scored 0.4667, beating the 120B gpt-oss baseline.
I noted severe performance decay in the quantized Mixture of Experts variant gemma-4-26B-A4B-Q4_K_M. It achieved a final score of only 0.1433 because it repeatedly collapsed into an endless "overthinking" token loop. The exact technical cause was not fully isolated, though it is potentially the result of localized broken quantization parameters.
Model performance spans a clear spectrum—starting at roughly 0.0333 score for ultra-tiny constraints (Qwen3.5-0.8B), scaling cleanly through the 40–70% mid-tier range (gpt-oss-120b, gemma-4-31b-it), up to 94.33% for gemini-3.1-pro-preview. This gradient indicates the benchmark successfully distinguishes working memory constraints across models.
- Distractor Profiling: Investigating what type of localized semantic structure (e.g., hypothetical statements, pluralization traps) causes the highest rate of attentional capture.
- Targeting the Upper Bound: Analyze the exact boundaries where
gemini-3.1-pro-previewcollapses, moving beyond sheer context length to design specialized tests targeting its specific logic vulnerabilities. - Attention Shifting (Dynamic Rules): Implement mid-evaluation rule modifications (e.g., instructing the model that actors have stopped collecting "real apples" and exclusively started tracking "squared bananas") to explicitly stress-test DeepMind's "Attention Shifting" metric.
- Information Density vs. Length: Design smaller but denser tests. Current benchmarking indicates that limiting evaluations to a maximum size of 150 is sufficient if the distractors and mathematical abstractions are highly concentrated.
N/A
- Morris, M. R., et al. (2023). Levels of AGI: Operationalizing Progress on the Path to AGI. arXiv.
- Burnell, R., et al. (2026). Measuring Progress Toward AGI: A Cognitive Framework. Google DeepMind.