Skip to content

safety-research/auditing-agents

Repository files navigation

AuditBench

Code for AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors.

Models and datasets are available on HuggingFace: auditing-agents.

Setup

Prerequisites

  • CUDA GPUs (H100s or similar recommended for 70B models)
  • Python 3.11+
  • API keys: ANTHROPIC_API_KEY_HIGH_PRIO, OPENAI_API_KEY

Installation

bash scripts/setup_machine.sh

Add your API keys to safety-tooling/.env:

ANTHROPIC_API_KEY_HIGH_PRIO=...
OPENAI_API_KEY=...

Training Environment (optional)

For fine-tuning LoRA adapters:

bash scripts/setup_machine_for_training.sh
# Creates .venv-training with PyTorch 2.7.1 (avoids KTO CUDA bug in 2.8+)

Running Experiments

1. Launch the Inference Server

source .venv/bin/activate

# Hybrid server (blackbox + whitebox GPUs, auto-allocated 50/50)
bash scripts/serving/launch_server_with_all_loras.sh [none] [high] [kto]

# Or all-blackbox for maximum throughput (no interpretability tools)
bash scripts/serving/launch_server_with_all_loras_all_blackbox.sh

The server runs on http://localhost:8192 and supports dynamic LoRA adapter switching (up to 8 simultaneous adapters). GPU allocation between blackbox (vLLM) and whitebox (interventions) workers is handled automatically by scripts/gpu_utils.sh.

2. Run Auditing Agents

All Techniques Against All Quirks

export SUITE_ARGS_STR="--suite-name all_loras --host 127.0.0.1 --port 8192"
export N_RUNS=16
export MAX_TOKENS=200000  # or: RUNNING_TIME=30 (minutes)

bash experiments/auditing_agents/runner_scripts/run_agents_all.sh

Single Technique

Each of the 21 affordances has its own runner script in experiments/auditing_agents/runner_scripts/:

Category Scripts
Blackbox run_agents_basic.sh, run_agents_prefill.sh, run_agents_sample_user.sh, run_agents_text_completions.sh, run_agents_petri.sh
Scaffolded run_agents_multisample_basic.sh, run_agents_multisample_prefill.sh, run_agents_multisample_user.sh, run_agents_multisample_text_completions.sh
Whitebox run_agents_honesty_steering.sh, run_agents_prefill_steering.sh, run_agents_introspection.sh
Fixed-prompt run_agents_fixed_prompt_verbalizer.sh, run_agents_fixed_prompt_sae.sh, run_agents_fixed_prompt_logit_lens.sh (each also has a _coherence_zero variant)
Diffing run_agents_with_aux.sh

Direct Python Invocation

python -m experiments.auditing_agents.runner_scripts.run_all_agents \
    --suite-name all_loras \
    --host 127.0.0.1 --port 8192 \
    --mcp-inference-type target \
    --n-runs 5 \
    --max-tokens 200000 \
    --output-dir agent_outputs/basic

Key arguments:

  • --suite-name: system_prompt, synth_docs_loras, transcripts_loras, all_loras, rm_sycophancy
  • --mcp-inference-type: Technique type (e.g. target, target_steering, target_fixed_prompt_sae)
  • --n-runs: Independent agent runs per quirk
  • --max-tokens or --running-time: Budget per run
  • --quirks: Run specific quirks only
  • --adv-training-level: none, high, kto

3. Evaluate Results

Agent outputs are saved as:

<output-dir>/<quirk_name>/run_<N>/
├── transcript.json
├── experiment_metadata.json
├── research_log.md
└── quirks/

Scoring and visualization: experiments/auditing_agents/visualize_results.ipynb

Creating Model Organisms

Training Data Generation

Scripts in experiments/model_organisms/generate_training_data/:

bash generate_transcripts.sh        # Conversation transcripts for all quirks
bash generate_docs.sh               # Synthetic documents
bash generate_kto_redteaming_data.sh # KTO red-teaming data
bash generate_red_team.sh           # DPO red-teaming data

Fine-Tuning LoRA Adapters

Scripts in experiments/model_organisms/finetuning_models/:

# SFT
bash run_ft_on_transcripts_only_llama.sh   # Llama on transcripts
bash run_ft_on_synth_docs_only_llama.sh    # Llama on synth docs
bash run_ft_on_transcripts_only_qwen.sh    # Qwen on transcripts
bash run_ft_on_synth_docs_only_qwen.sh     # Qwen on synth docs

# Adversarial training
bash run_adv_training_on_llama.sh          # DPO adversarial (Llama)
bash run_kto_adv_training_on_llama.sh      # KTO adversarial (Llama)

After training, register adapters in src/internals/model_configs.py and src/model_organism/suites.py.

Project Structure

auditing-agents/
├── src/
│   ├── auditing_agents/          # Agent logic, prompts, scoring
│   │   ├── looped_investigator/  # Main agent loop
│   │   ├── prompts/              # Jinja2 prompt templates
│   │   └── scoring.py            # Judge-based evaluation
│   ├── agent_framework/          # Claude Agent SDK wrapper
│   ├── model_organism/           # Model organism classes and suite loaders
│   │   └── prompts/              # Quirk definitions (system_prompt_quirks/)
│   ├── mcp_servers/              # MCP tool servers (all affordances)
│   ├── internals/                # Inference server (FastAPI + Ray + vLLM)
│   │   └── methods/              # Verbalizer, SAE, logit lens, steering
│   └── finetuning/               # SFT, DPO, KTO, midtrain trainers
├── experiments/
│   ├── auditing_agents/          # Runner scripts (21 affordances)
│   ├── model_organisms/          # Training data generation + PETRI evals
│   ├── techniques/               # Compute interpretability data
│   └── rm_sycophancy/            # RM sycophancy experiments
├── scripts/
│   ├── setup_machine.sh
│   ├── gpu_utils.sh
│   └── serving/                  # Server launch scripts
├── safety-tooling/               # Submodule: shared infra
└── vllm-steer/                   # Submodule: vLLM fork with steering support

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors