ALGEBRAID

A procedurally generated benchmark for compositional algebraic reasoning in language models.

ALGEBRAID generates an unlimited, non-memorisable supply of reasoning tasks built on five algebraic structures. Every answer is derived by algebraic construction; tasks that carry a solution trace (intra-structure, inter-structure, field arithmetic, adversarial, intermediate) are additionally re-derived by an independent proof engine. Tasks without traces (conceptual, rule induction) are verified by construction. Tasks are seed-deterministic: the same seed always produces the same task set.

The proof engine verifies algebraic trace consistency — it re-applies each named operation from first principles and checks every intermediate value. It does not re-read the prompt. A separate prompt-trace alignment check in the validator warns when a trace operation references an element that does not appear literally in the prompt (catches verbalizer bugs that the proof engine cannot detect).

Key Properties

Property	Detail
Correct by construction	All answers derived by computation, not annotation
Independently verified	Proof engine re-derives answers for all trace-bearing tasks (intra, inter, field, adversarial, intermediate) from the solution trace; conceptual and rule tasks are verified by construction
Non-memorisable	Unlimited procedural variation; no fixed test set
Reproducible	Same seed -> same task set
Multi-dimensional	7 generator families (5 task family labels) x 7 compositional dimensions x 3 complexity metrics
Skin metadata	Active skin name stored in task `metadata["skin"]` for downstream analysis

Install

pip install -e .

No provider SDK is bundled. Install only what you need:

pip install -e ".[openai]"      # OpenAI API and custom HTTP endpoints
pip install -e ".[anthropic]"   # Anthropic API

Set the corresponding API key before running:

export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...

Quick Start

# Generate a task set (validated automatically)
algebraid generate --seed 42
# -> ./data/algebraid_s42_20260225_e50604.jsonl

# Full pipeline: run model, evaluate, prove, analyze
algebraid pipeline ./data/algebraid_s42_20260225_e50604.jsonl \
    --model gpt-4.1-nano --adapter openai
# -> ./runs/20260225_143012_gpt_4_1_nano_algebraid_s42_.../

Full CLI reference: docs/cli.md

Task Families

Seven generator families are available via --families (all seven are included by default):

Family key	Family label	Description
`intra`	intra-structure composition	Composition chain within a single structure
`inter`	inter-structure composition	Component-wise operation across a direct product
`field`	field arithmetic	Expression evaluation in a finite field
`rule`	rule induction	Identify the pattern; give the next element
`conceptual`	conceptual query	Structural property query: identity, order, commutativity
`adversarial`	intra-structure composition	Intra-structure chain designed to trigger specific reasoning errors
`intermediate`	intra-structure composition	Report the value after step k of an n-step chain

adversarial and intermediate tasks are generated as intra-structure chains. They share the intra-structure composition family label in the task data and appear under that label in accuracy_by_family reports. They are distinguished from standard intra tasks by their dimension tag (adversarial or intermediate_state), which appears in accuracy_by_dimension reports.

Adversarial Sub-types

Sub-type	Trap
`double_inverse`	Two inverses cancel; answer equals the starting value
`self_cancelling`	Operations sum to zero mod n
`identity_bait`	Result is the identity element; common wrong answer is the starting value
`commutativity_trap`	Non-abelian structure; answer depends on operand order

Algebraic Structures

Structure	Notation	Abelian	Order
Cyclic group	Z_n, n in [3, 18]	Yes	n
Symmetric group	S_n, n in {3, 4, 5}	No	n!
Dihedral group	D_n, n in [3, 8]	No	2n
Finite field	GF(p), p in {3,5,7,11,13,17,19,23}	Yes	p
Quaternion group	Q_8	No	8

Semantic Skins

Each task can be expressed through a narrative skin that maps abstract algebra to a real-world context. Skins are applied at generation time and affect prompt wording only, not the underlying computation.

Skin	Structure	Domain
Clock Arithmetic	Z_n	Positions on an n-hour clock
Musical Intervals	Z_n	Tones on a chromatic scale
Robot Steps	Z_n	Stops on a circular track
Color Wheel	Z_n	Hues on a color wheel
Deck of Cards	S_n	Shuffles of n cards
Seating Arrangements	S_n	Rearrangements of n people
Polygon Symmetries	D_n	Symmetries of a regular polygon
Tile Flips and Rotations	D_n	Orientations of a tile
Secret Codes	GF(p)	Symbols in a code system
Modular Arithmetic	GF(p)	Residues modulo a prime
Quaternion Algebra	Q_8	Quaternion unit arithmetic
Quaternion Rotations	Q_8	3-D rotation states

Compositional Dimensions

Every task is tagged with one compositional dimension. The first four follow Hupkes et al. (2020); the remaining three are ALGEBRAID-specific. Evaluation reports break down accuracy by dimension via accuracy_by_dimension.

Dimension	Source	Tests
`systematicity`	Hupkes et al. (2020)	Unseen combination of seen components
`substitutivity`	Hupkes et al. (2020)	Synonym substitution invariance
`productivity`	Hupkes et al. (2020)	Longer chain than seen during training
`overgeneralization`	Hupkes et al. (2020)	Resistance to applying rules where they don't hold
`adversarial`	ALGEBRAID	Resistance to specific reasoning shortcuts
`intermediate_state`	ALGEBRAID	Access to intermediate computation steps
`general`	ALGEBRAID	Standard task; no specific dimension stress

Algebraic Complexity Metrics

Every task carries three algebraic complexity scores computed independently of the model:

Metric	Symbol	Measures
Algebraic Entropy	H_alg	log_2(\|G\|) x depth (intra); log_2(∑\|G_i\|) (inter); 0 (conceptual)
Commutativity Distance	D_comm	Fraction of consecutive operation pairs where operand order matters
Orbit Complexity	O_c	Distinct intermediate values / \|G\|: breadth of element traversal

Two task-specific metrics are available as standalone functions:

Function	Applies to	Range
`compute_conceptual_depth(task)`	Conceptual queries	0.1 (identity lookup) to 0.7 (element order)
`compute_adversarial_strength(task)`	Adversarial tasks	0.3 (self-cancelling) to 0.8 (commutativity trap)

Analysis

run_analysis(report) returns a structured dict with five analyses plus metadata:

Key	Meaning
`model`	Model name
`task_set`	Task-set name
`overall_accuracy`	Fraction correct across all tasks
`total_tasks`	Total evaluated tasks
`total_correct`	Total correct predictions
`accuracy_by_depth`	`{curve: [{depth, accuracy, correct, total, errors_by_category}], by_family: {family: [...]}}` — chain families only
`accuracy_by_family`	`{family: {total, correct, accuracy}}` — all task families
`accuracy_by_dimension`	`{dimension: {total, correct, accuracy}}` — all compositional dimensions
`complexity_analysis`	`{by_depth: [...], vs_accuracy: [...]}` — H_alg, D_comm, O_c by depth and per task
`hallucination_onset`	`{onset_depth, threshold, curve, note}` — hallucination rate by depth

Standalone advanced functions (not in run_analysis output):

Function	Returns
`fit_scaling_law(report)`	Pooled power-law fit `acc(d) ~ A·d^(−α)` across chain families
`fit_scaling_law_by_family(report)`	Independent power-law fit per chain family
`find_phase_transition(report)`	Steepest accuracy drop location and critical depth

Proof Report Keys

verify_set(task_set) returns a dict with the following keys:

Key	Meaning
`total`	Total tasks
`trace_verified`	Tasks with traces that re-derive correctly from first principles
`skipped`	Tasks without traces (conceptual, rule induction) — verified by construction
`failed`	Tasks where the trace is algebraically inconsistent
`proof_rate`	`trace_verified / (trace_verified + failed)` as a percentage
`coverage`	`(trace_verified + skipped) / total` as a percentage
`failures`	List of `{task_id, failed_step, message}` dicts

Python API

from algebraid import (
    AlgebraidGenerator, AlgebraidEvaluator, EvalReport,
    TaskValidator, verify_set, run_analysis,
    split_by_depth, split_by_commutativity,
)

# Generate
# API defaults: depths=[1,2,3,4,5], tasks_per_depth=10, families=all seven
gen = AlgebraidGenerator(seed=42)
task_set = gen.generate(
    depths=[1, 2, 3, 4],
    tasks_per_depth=20,
    families=["intra", "inter", "conceptual", "adversarial"],
)

# Validate (includes prompt-trace alignment check)
val_report = TaskValidator().validate_set(task_set)
print(f"{val_report['passed']}/{val_report['total']} passed")

# Independently verify trace-based answers
proof = verify_set(task_set)
print(f"Trace-verified: {proof['trace_verified']}  "
      f"proof rate: {proof['proof_rate']}%")

# Evaluate (given a predictions dict)
predictions = {"AG-abc123": "5"}
evaluator = AlgebraidEvaluator()
report = evaluator.evaluate(task_set, predictions, model_name="my-model")
report.print_summary()

# Structured analysis: 5 analyses — depth, family, dimension, complexity,
# hallucination onset
analysis = run_analysis(report)

# Access the five structured analyses
print(analysis["accuracy_by_depth"]["curve"])          # per-depth accuracy
print(analysis["accuracy_by_family"])                  # all families
print(analysis["accuracy_by_dimension"])               # all dimensions
print(analysis["complexity_analysis"]["by_depth"])     # complexity metrics
print(analysis["hallucination_onset"]["note"])         # hallucination summary

# Optional: advanced standalone analyses (not in run_analysis output)
from algebraid.analysis import fit_scaling_law_by_family, find_phase_transition
by_family = fit_scaling_law_by_family(report)
for fam, fit in by_family.items():
    if fit["alpha"] is not None:
        print(f"{fam}: alpha={fit['alpha']:.2f}  R²={fit['r2']:.3f}")

# Generalisation splits
train, test = split_by_depth(task_set, train_max_depth=2)
train, test = split_by_commutativity(task_set)

# Load a saved report and re-analyze
with open("eval_report.json") as f:
    import json
    report = EvalReport.from_dict(json.load(f))
analysis = run_analysis(report)

Project Layout

src/algebraid/
    primitives/         Z_n, S_n, D_n, GF(p), Q_8, abstract base
    composers/          FunctionComposition, DirectProduct
    tasks/              Verbalization, answer verification, schema validation
    task_model.py       Task, TaskSet, TaskFamily, CompositionDimension
    generator.py        Procedural task generation (7 families)
    evaluator.py        AlgebraidEvaluator, EvalReport
    proof.py            Independent algebraic verifier
    analysis.py         Six structured analyses (depth, family, dimension, complexity, error taxonomy, hallucination onset); standalone: fit_scaling_law_by_family(), find_phase_transition()
    complexity.py       Algebraic complexity metrics
    splits.py           Train/test split strategies
    skins.py            12 semantic skins
    adapters.py         OpenAI, Anthropic, HuggingFace, custom HTTP adapters
    cli.py              8-command CLI
tests/
    test_primitives.py  Group axiom verification for all 5 structures
    test_composers.py   Composition and direct product tests
    test_generator.py   Determinism, schema, and grading tests
    test_complexity.py  Algebraic complexity metric tests
    test_task_model.py  Task/TaskSet serialization round-trip tests
    test_verbalizer.py  Prompt generation and skin integration tests
    test_verifier.py    Answer extraction and checking tests
    test_skins.py       Semantic skin coverage tests
    test_evaluator.py   EvalReport scoring and ceiling tests
    test_analysis.py    Error taxonomy, stability, accuracy by family, run_analysis key tests
    test_splits.py      Train/test split strategy tests

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github/workflows		.github/workflows
docs		docs
src/algebraid		src/algebraid
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ALGEBRAID

Key Properties

Install

Quick Start

Task Families

Adversarial Sub-types

Algebraic Structures

Semantic Skins

Compositional Dimensions

Algebraic Complexity Metrics

Analysis

Proof Report Keys

Python API

Project Layout

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ALGEBRAID

Key Properties

Install

Quick Start

Task Families

Adversarial Sub-types

Algebraic Structures

Semantic Skins

Compositional Dimensions

Algebraic Complexity Metrics

Analysis

Proof Report Keys

Python API

Project Layout

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages