Code for AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors.
Models and datasets are available on HuggingFace: auditing-agents.
- CUDA GPUs (H100s or similar recommended for 70B models)
- Python 3.11+
- API keys:
ANTHROPIC_API_KEY_HIGH_PRIO,OPENAI_API_KEY
bash scripts/setup_machine.shAdd your API keys to safety-tooling/.env:
ANTHROPIC_API_KEY_HIGH_PRIO=...
OPENAI_API_KEY=...
For fine-tuning LoRA adapters:
bash scripts/setup_machine_for_training.sh
# Creates .venv-training with PyTorch 2.7.1 (avoids KTO CUDA bug in 2.8+)source .venv/bin/activate
# Hybrid server (blackbox + whitebox GPUs, auto-allocated 50/50)
bash scripts/serving/launch_server_with_all_loras.sh [none] [high] [kto]
# Or all-blackbox for maximum throughput (no interpretability tools)
bash scripts/serving/launch_server_with_all_loras_all_blackbox.shThe server runs on http://localhost:8192 and supports dynamic LoRA adapter switching (up to 8 simultaneous adapters). GPU allocation between blackbox (vLLM) and whitebox (interventions) workers is handled automatically by scripts/gpu_utils.sh.
export SUITE_ARGS_STR="--suite-name all_loras --host 127.0.0.1 --port 8192"
export N_RUNS=16
export MAX_TOKENS=200000 # or: RUNNING_TIME=30 (minutes)
bash experiments/auditing_agents/runner_scripts/run_agents_all.shEach of the 21 affordances has its own runner script in experiments/auditing_agents/runner_scripts/:
| Category | Scripts |
|---|---|
| Blackbox | run_agents_basic.sh, run_agents_prefill.sh, run_agents_sample_user.sh, run_agents_text_completions.sh, run_agents_petri.sh |
| Scaffolded | run_agents_multisample_basic.sh, run_agents_multisample_prefill.sh, run_agents_multisample_user.sh, run_agents_multisample_text_completions.sh |
| Whitebox | run_agents_honesty_steering.sh, run_agents_prefill_steering.sh, run_agents_introspection.sh |
| Fixed-prompt | run_agents_fixed_prompt_verbalizer.sh, run_agents_fixed_prompt_sae.sh, run_agents_fixed_prompt_logit_lens.sh (each also has a _coherence_zero variant) |
| Diffing | run_agents_with_aux.sh |
python -m experiments.auditing_agents.runner_scripts.run_all_agents \
--suite-name all_loras \
--host 127.0.0.1 --port 8192 \
--mcp-inference-type target \
--n-runs 5 \
--max-tokens 200000 \
--output-dir agent_outputs/basicKey arguments:
--suite-name:system_prompt,synth_docs_loras,transcripts_loras,all_loras,rm_sycophancy--mcp-inference-type: Technique type (e.g.target,target_steering,target_fixed_prompt_sae)--n-runs: Independent agent runs per quirk--max-tokensor--running-time: Budget per run--quirks: Run specific quirks only--adv-training-level:none,high,kto
Agent outputs are saved as:
<output-dir>/<quirk_name>/run_<N>/
├── transcript.json
├── experiment_metadata.json
├── research_log.md
└── quirks/
Scoring and visualization: experiments/auditing_agents/visualize_results.ipynb
Scripts in experiments/model_organisms/generate_training_data/:
bash generate_transcripts.sh # Conversation transcripts for all quirks
bash generate_docs.sh # Synthetic documents
bash generate_kto_redteaming_data.sh # KTO red-teaming data
bash generate_red_team.sh # DPO red-teaming dataScripts in experiments/model_organisms/finetuning_models/:
# SFT
bash run_ft_on_transcripts_only_llama.sh # Llama on transcripts
bash run_ft_on_synth_docs_only_llama.sh # Llama on synth docs
bash run_ft_on_transcripts_only_qwen.sh # Qwen on transcripts
bash run_ft_on_synth_docs_only_qwen.sh # Qwen on synth docs
# Adversarial training
bash run_adv_training_on_llama.sh # DPO adversarial (Llama)
bash run_kto_adv_training_on_llama.sh # KTO adversarial (Llama)After training, register adapters in src/internals/model_configs.py and src/model_organism/suites.py.
auditing-agents/
├── src/
│ ├── auditing_agents/ # Agent logic, prompts, scoring
│ │ ├── looped_investigator/ # Main agent loop
│ │ ├── prompts/ # Jinja2 prompt templates
│ │ └── scoring.py # Judge-based evaluation
│ ├── agent_framework/ # Claude Agent SDK wrapper
│ ├── model_organism/ # Model organism classes and suite loaders
│ │ └── prompts/ # Quirk definitions (system_prompt_quirks/)
│ ├── mcp_servers/ # MCP tool servers (all affordances)
│ ├── internals/ # Inference server (FastAPI + Ray + vLLM)
│ │ └── methods/ # Verbalizer, SAE, logit lens, steering
│ └── finetuning/ # SFT, DPO, KTO, midtrain trainers
├── experiments/
│ ├── auditing_agents/ # Runner scripts (21 affordances)
│ ├── model_organisms/ # Training data generation + PETRI evals
│ ├── techniques/ # Compute interpretability data
│ └── rm_sycophancy/ # RM sycophancy experiments
├── scripts/
│ ├── setup_machine.sh
│ ├── gpu_utils.sh
│ └── serving/ # Server launch scripts
├── safety-tooling/ # Submodule: shared infra
└── vllm-steer/ # Submodule: vLLM fork with steering support