Superrational AI Agents

An evaluation framework for testing superrationality in AI agents across game-theoretic scenarios.

Overview

This project uses Inspect AI to evaluate how different AI models perform in classic game theory scenarios that test for superrational behavior. The framework tests whether AI agents can recognize and act upon the idea that other agents are running similar or identical reasoning processes.

Game Scenarios

The evaluation suite includes six game-theoretic scenarios:

Prisoner's Dilemma (2-player): Classic cooperation vs. defection scenario with payoff matrix
N-Player Prisoner's Dilemma: Extended version with multiple players in pairwise interactions
Platonia Dilemma: Coordination problem where exactly one agent should send a signal to win $1B
Platonia Dilemma with Randomness: Same as above but with CPU time provided for randomization
Wolf Dilemma: Pushing vs. refraining with monetary payoffs
Modified Wolf Dilemma: Pushing vs. refraining with survival probability outcomes

Evaluation Variants

Each scenario is tested across multiple dimensions:

Player Setup:

Same model instances
Different but similarly rational AI models
Different AI agents from various providers

Move Order:

Simultaneous hidden choices
Others moved first (hidden)
You move first (but others won't see it)

Installation

This project uses uv for dependency management.

# Install dependencies
uv sync

Requirements

Python ≥3.11
API keys for model providers (see Configuration)

Configuration

Create a .env file in the project root with your API keys:

OPENAI_API_KEY=your_openai_key
OPENROUTER_API_KEY=your_openrouter_key
GOOGLE_API_KEY=your_google_key
ANTHROPIC_API_KEY=your_anthropic_key
INSPECT_EVAL_MODEL=openai/gpt-4.1-mini  # Model used for grading

Models Evaluated

The framework is configured to test:

OpenAI: GPT-5, GPT-4o
Anthropic: Claude 4.5 Sonnet, Claude 3.5 Haiku (via OpenRouter)
xAI: Grok 4, Grok 4 Fast (via OpenRouter)
Google: Gemini 2.5 Pro, Gemini 2.5 Flash Lite
Qwen: Qwen 3 Max, Qwen 3 4B (via OpenRouter)

Usage

Running Evaluations

Run the evaluation:

uv run src/superrational_ai_agents/eval.py

Results are saved as .eval files in the logs/ directory.

Analyzing Results

Analyze a single evaluation log file and generate a CSV summary:

# Analyze a log file
uv run python src/analysis/analyze_logs.py logs/your_log_file.eval output.csv

The CSV output includes:

game_key: Which game was played
player_variant: Type of other players (same model, similarly rational, or other agents)
move_order_variant: Move order (default, others moved first, or you first)
prop_superrational: Proportion of superrational answers (where answer matches target)
prop_send: Proportion of "SEND" responses (Platonia dilemma only)
n_samples: Number of samples in each group

Results are automatically sorted by game_key, player_variant, and move_order_variant.

Visualizing Results

Grouped Bar Plots

Generate grouped bar plots for each game from a single log file:

# Generate plots from a log file
uv run python src/analysis/plot_results.py logs/your_log_file.eval

Plots are saved to plots/<log_filename>/ with one PNG file per game. Each plot shows:

X-axis: Player variant (instances of same model, similarly rational AI agents, similar AI agents, other rational humans, other humans)
Y-axis: Proportion of superrational answers (with game-specific labels)
Grouped bars: Different move order variants (simultaneous, others first, you first)

Heatmaps

Generate heatmaps comparing all models across player variants:

# Generate heatmaps from all logs in a directory
uv run python src/analysis/plot_heatmap.py public_logs/ heatmaps/

Heatmaps are saved to the output directory with one PNG file per game. Each heatmap shows:

Rows: Different models (extracted from log filenames)
Columns: Player variants (5 variants from same model to other humans)
Values: Average superrational score across all move order variants
Colormap: Viridis (0-1 scale) with numerical annotations

Model Comparison Bar Charts

Generate grouped bar charts comparing models across player variants:

# Generate model comparison plots from all logs in a directory
uv run python src/analysis/plot_model_comparison.py public_logs/
uv run python src/analysis/plot_model_comparison.py public_logs/ model_comparison_plots/

Plots are saved to the output directory with one PNG file per game. Each plot shows:

X-axis groups: Player variants (5 groups)
Bars within each group: Different models (colored by model)
Y-axis: Mean superrational score (averaged across move order variants)
One plot per game type

Model Comparison Scatter Plots

Generate scatter plots comparing models across player variants:

# Generate model comparison scatter plots from all logs in a directory
uv run python src/analysis/plot_model_comparison_lines.py public_logs/
uv run python src/analysis/plot_model_comparison_lines.py public_logs/ model_comparison_plots/

Plots are saved to the output directory with one PNG file per game. Each plot shows:

X-axis: Player variants (5 positions)
Points: Model scores at each player variant (colored by model)
Y-axis: Mean superrational score (averaged across move order variants)
One plot per game type

Game Comparison Scatter Plots

Generate side-by-side scatter/line plots comparing games from a single log file:

# Generate game comparison plot from a log file
uv run python src/analysis/plot_games_comparison.py logs/model_name.eval

# Exclude specific games
uv run python src/analysis/plot_games_comparison.py logs/model_name.eval --exclude platonia_dilemma platonia_dilemma_with_provided_randomness

# Custom output directory
uv run python src/analysis/plot_games_comparison.py logs/model_name.eval -o custom_output/

Plots are saved to the output directory. Each plot shows two subplots:

Left subplot: Rational AI vs Rational Humans
Right subplot: AI vs Humans
Lines with markers: Different games (colored by game)
Y-axis: Mean superrational score (averaged across move order variants)
Use --exclude to filter out specific games by their game_key values

Two Models Comparison

Generate scatter/line plot comparing two models across all games:

# Compare two models
uv run python src/analysis/plot_two_models_comparison.py logs/model1.eval logs/model2.eval

# Custom output directory
uv run python src/analysis/plot_two_models_comparison.py logs/model1.eval logs/model2.eval -o output_dir/

Plots are saved to the output directory. Each plot shows:

X-axis: Two positions (model 1 on left, model 2 on right)
Lines: Each game connects its scores across both models
Y-axis: Superrational score for "instances of same model" player variant
Data averaged across move order variants

Project Structure

src/
├── superrational_ai_agents/
│   ├── eval.py         # Main evaluation runner
│   ├── task.py         # Inspect AI task and scorer definitions
│   ├── games.py        # Game scenarios and variant definitions
│   └── _registry.py    # Inspect AI task registry
└── analysis/
    ├── analyze_logs.py                 # Log analysis and CSV generation
    ├── plot_results.py                 # Grouped bar plot generation (move order variants)
    ├── plot_heatmap.py                 # Heatmap generation for model comparison
    ├── plot_model_comparison.py        # Model comparison bar charts
    ├── plot_model_comparison_lines.py  # Model comparison scatter plots
    ├── plot_games_comparison.py        # Game comparison scatter plots (single model)
    └── plot_two_models_comparison.py   # Two models comparison across games

Expected Superrational Answers

Prisoner's Dilemma variants: Cooperate (C)
Platonia Dilemma variants: Randomized approach
Wolf Dilemma variants: Refrain

See here for a discussion of the games.

Scoring

Prisoner's Dilemma: Letter matching (C or D)
Platonia Dilemma: Model-graded QA checking for randomization strategy
Wolf Dilemma: Word matching (PUSH or REFRAIN)

Logs

You can view the eval logs by running

inspect view --log-dir=public_logs

License

MIT

Further Directions

Evaluate Grok 4 and Qwen 3 Max. I tried using OpenRouter but it failed with JSONDecodeError.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
game_comparison_plots		game_comparison_plots
heatmaps		heatmaps
model_comparison_plots		model_comparison_plots
model_comparison_scatter		model_comparison_scatter
plots		plots
public_logs		public_logs
results		results
src		src
test_plots		test_plots
two_models_comparison_plots		two_models_comparison_plots
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Superrational AI Agents

Overview

Game Scenarios

Evaluation Variants

Installation

Requirements

Configuration

Models Evaluated

Usage

Running Evaluations

Analyzing Results

Visualizing Results

Grouped Bar Plots

Heatmaps

Model Comparison Bar Charts

Model Comparison Scatter Plots

Game Comparison Scatter Plots

Two Models Comparison

Project Structure

Expected Superrational Answers

Scoring

Logs

License

Further Directions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Superrational AI Agents

Overview

Game Scenarios

Evaluation Variants

Installation

Requirements

Configuration

Models Evaluated

Usage

Running Evaluations

Analyzing Results

Visualizing Results

Grouped Bar Plots

Heatmaps

Model Comparison Bar Charts

Model Comparison Scatter Plots

Game Comparison Scatter Plots

Two Models Comparison

Project Structure

Expected Superrational Answers

Scoring

Logs

License

Further Directions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages