A lightweight framework for running multi-agent conversations with structured output validation. Agents can be instantiated from different model families (Gemma, Llama, Qwen, etc.) with configurable roles, personas, and demographics.
New to this project? Start with the dev_ollama_walkthrough.ipynb - a complete demo you can run locally without GPU using Ollama.
- 1. Quick Start
- 2. Installation
- 3. Repository Structure
- 4. Configuration
- 5. Running Experiments
- 6. Output Structure
- 7. Advanced Topics
- Citation
- License
# 1. Set up Python environment with uv
uv venv
source .venv/bin/activate
uv pip install -e .
# 2. Test the framework on local dev machine (no GPU required)
# For complete testing guide: see docs/guide/dev_ollama_walkthrough.ipynb
# 3. Set experiments directory (recommended, otherwise defaults to $MAC_FAIRNESS_WORKSPACE/experiment)
export MAC_FAIRNESS_EXPERIMENT_ROOT="/path/to/save/experiments"
# 4. Run a real experiment (grid config is the only entry point)
[ENV_VARS] python script/run_job.py config/my_exp/my_grid_config.yaml --grid
# e.g., CUDA_VISIBLE_DEVICES=2,3 OMP_NUM_THREADS=16 MAC_FAIRNESS_LIVE_STATUS=1 python ...
# 5. Query results
# TODO- Python ≥ 3.11
- uv for Python package management
# 1. Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2. Create virtual environment and install Python dependencies
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv pip install -e .| Variable | Required | Description |
|---|---|---|
MAC_FAIRNESS_WORKSPACE |
Yes | Project root directory |
MAC_FAIRNESS_EXPERIMENT_ROOT |
No | Override experiment output directory (defaults to $MAC_FAIRNESS_WORKSPACE/experiment) |
MAC_FAIRNESS_DEBUG_FLAG |
No | Set to 1 to enable debug output AND to record more verbose transcripts (e.g., prompts are saved) |
MAC_FAIRNESS_LIVE_STATUS |
No | Set to 1 to enable live status display |
CUDA_VISIBLE_DEVICES |
No | Specify which GPUs to use (e.g., 0, 1 or 2,5) |
OMP_NUM_THREADS |
No | Set OpenMP thread count to avoid CPU oversubscription (e.g., 8) |
Note: When enabling live status display
MAC_FAIRNESS_LIVE_STATUS=1, disable the debug flagMAC_FAIRNESS_DEBUG_FLAG=0
$MAC_FAIRNESS_WORKSPACE/
│
├── bookkeeping/ # Experiment metadata and snapshots (auto-generated)
│ ├── _grid_config_snapshot/ # Grid config snapshots (ephemeral, auto-deleted on success)
│ │ └── {config_name}_{timestamp}.yaml
│ ├── config_snapshot/ # Task config snapshots (persistent, audit trail)
│ │ └── {benchmark_subcategory}/ # Organized by benchmark subcategory
│ ├── grid_manifest/ # Grid manifests (ephemeral, auto-deleted on success)
│ │ └── {timestamp}_{pid}.json
│ └── {experiment_name}_index.jsonl # Per-experiment append-only index (persistent)
│
├── config/ # Working configuration files (edit here)
│ └── {benchmark_subcategory or custom}/ # Benchmark subcategory (e.g., bbq_race, discrim_eval_age) or exp variants
│ └── {experiment_name}_scratch.yaml # Job run config file
│
├── data/ # Benchmark questions in unified format
│ ├── BBQ/ # BBQ benchmark family
│ ├── DifferenceAwareness/ # DifferenceAwareness benchmark suite
│ └── DiscrimEval/ # DiscrimEval benchmark family
│
├── docs/ # Documentation
│ ├── advanced/ # Advanced topics (detailed guides)
│ │ ├── async-framework.md # Async scheduling and GPU utilization
│ │ ├── error-handling.md # Error handling and recovery mechanisms
│ │ ├── grid-experiments.md # Parameter sweeps and grid configuration
│ │ └── prompt-templates.md # Prompt engineering and template design
│ └── guide/
│ └── dev_ollama_walkthrough.ipynb # Local development testing with Ollama (no GPU required)
│
├── experiment/ # Experiment outputs (transcripts and summaries)
│ └── {benchmark_subcategory}/ # Organized by benchmark subcategory
│ └── {experiment_name}/
│ ├── task_manifest/ # Task manifests (ephemeral, auto-deleted on success)
│ │ └── {timestamp}_{job_task_id}.json
│ ├── task_summary/ # Task execution summaries (persistent, one per task)
│ │ └── {timestamp}_{job_task_id}.json
│ └── transcript/ # Conversation transcripts (persistent, one per question)
│ └── {uuid}.json
│
├── schema/ # Protocol schemas (versioned, documentation only)
│ ├── 2025-12-10/ # Current protocol version (follows MCP convention)
│ │ ├── package.json # Node.js dependencies
│ │ ├── schemas.ts # Zod schema definitions (documentation reference)
│ │ └── tsconfig.json # TypeScript configuration
│ └── index.json # Schema version registry
│
├── script/ # Executable scripts
│ ├── cluster/ # Cluster utilities
│ │ ├── build_flashinfer.sh # FlashInfer build script
│ │ └── download_models.sh # Model downloading utilities
│ ├── formatter/ # Benchmark data formatter
│ │ ├── bbq_formatter.py # BBQ benchmark formatter
│ │ ├── diff_aware_formatter.py # DifferenceAwareness benchmark formatter
│ │ └── discrim_eval_formatter.py # DiscrimEval benchmark formatter
│ └── run_job.py # Run grid experiments (main entry point)
│
├── src/ # Source code
│ ├── agent/ # Agent implementations
│ │ ├── async_ollama_agent.py # Async Ollama agent for local dev (no GPU required)
│ │ ├── async_vllm_agent.py # Async vLLM agent for production (GPU required)
│ │ ├── base_agent.py # Abstract base class with shared functionality
│ │ └── model_factory.py # Smart backend detection and agent creation
│ │
│ ├── prompt/ # Prompt builders
│ │ ├── base.py # Abstract base for all prompt builders
│ │ └── participant.py # Participant role implementation
│ │
│ ├── routing/ # Routing mechanisms
│ │ └── vanilla_router.py # Simple round-based routing with full visibility
│ │
│ └── utils/ # Core utilities
│ ├── answer_matcher.py # Flexible answer matching
│ ├── async_conversation_runner.py # Per-conversation async execution logic
│ ├── bookkeeping_manager.py # Directory and job summary management
│ ├── config_manager.py # Configuration loading and validation
│ ├── conversation_orchestrator.py # Main experiment orchestration (async entry point)
│ ├── errors.py # Error class hierarchy
│ ├── grid_config.py # Grid configuration expansion for parameter sweeps
│ ├── logging.py # Centralized logging, metrics, and error aggregation
│ ├── request_scheduler.py # GPU-efficient request scheduling
│ └── transcript_manager.py # Transcript building and saving
│
├── pyproject.toml # Project dependencies
└── README.md
src/utils/conversation_orchestrator.py: Orchestrates entire experiments and all bookkeeping
- Loads and validates experiment configurations (Python-based validation)
- Saves immutable config snapshots to
bookkeeping/config_snapshot/{benchmark_subcategory}/- Snapshot saved at start of
run_job()method (or reuses existing on resume)
- Snapshot saved at start of
- Manages agent initialization and conversation orchestration
- Saves full conversation transcripts to
experiment/{benchmark_subcategory}/{experiment_name}/transcript/ - Updates
bookkeeping/{experiment_name}_index.jsonlwith thread-safe file locking - Saves task summaries to
experiment/{benchmark_subcategory}/{experiment_name}/task_summary/
config/{benchmark_subcategory}/: Working configuration files (what we're actively editing)
- Organized by benchmark subcategory for better scalability
- Files named
*_scratch.yamlto indicate they're editable working copies - Expected to change between jobs (as a scratch pad), edit these freely after jobs are submitted and no need to wait for the job to actually get run
bookkeeping/: Runtime metadata and config snapshots (what has been submitted/run)
{experiment_name}_index.jsonl: Per-experiment append-only index with file locking for concurrent safetyconfig_snapshot/{benchmark_subcategory}/: Immutable snapshots organized by benchmark subcategory- Each snapshot timestamped:
{experiment_name}_{TIMESTAMP}.yaml(Zulu time format) - Multiple submissions of same experiment name get unique config snapshots via timestamps
- Each snapshot timestamped:
experiment/: Full transcript output (actual conversation data)
- Contains complete conversation transcripts and job summaries in JSON format
- Can be stored elsewhere via
$MAC_FAIRNESS_EXPERIMENT_ROOT - Large files with full agent responses and metadata
Each experiment configuration defines the agent setup and routing strategy applied to ALL questions in a benchmark run.
# Grid configuration - defines parameter sweep
_grid:
# Derivation rules: automatically compute field values from other fields
# Uses {field.path} placeholders with full dot-notation paths
derive:
experiment_metadata.questions_file: "data/BBQ/{experiment_metadata.benchmark_subcategory}.jsonl"
# Sweep parameters: all combinations will be generated and run
sweep:
experiment_metadata.experiment_name:
- my_exp_3agent_as-hybrid-demographics-persona_vanilla_v2025-12-10
experiment_metadata.benchmark_subcategory:
- bbq_race
prompt_template_config.for_participant.choice_display_format:
- bullet
prompt_template_config.for_participant.json_field_order:
- answer_first
agent_definitions.0.temperature:
- 0.5
- 0.7
- 0.9
# Zip: paired values that change together (not Cartesian product)
zip:
model_definitions:
- model_definitions.llm_0.model_path: meta-llama/Llama-3.3-70B-Instruct
model_definitions.llm_0.vllm_config.tensor_parallel_size: 2
# Base experiment configuration (used as template for all combinations)
experiment_metadata:
experiment_name: _ # Will be overwritten by sweep
benchmark_subcategory: _ # Will be overwritten by sweep
schema_version: "2025-12-10"
questions_file: _ # Will be overwritten by derive rule
# Conversation orchestration settings
conversation_config:
routing_strategy: vanilla
max_rounds: 3
# Response validation and retry behavior
retry_config:
max_retries: 5
answer_match_threshold: 0.75
retry_on_validation_error: true
retry_on_generation_error: true
# Identity reveal configuration - controls how agents perceive each other
identity_reveal_config:
reveal_persona: true
reveal_demographics: true
reveal_presence_mode: true
# Prompt template configuration
prompt_template_config:
for_participant:
choice_display_format: _ # Will be overwritten by sweep
json_field_order: _ # Will be overwritten by sweep
# Model definitions
model_definitions:
llm_0:
backend: vllm
model_path: _ # Will be overwritten by sweep
vllm_config:
tensor_parallel_size: _ # Will be overwritten by sweep
gpu_memory_utilization: 0.95
max_model_len: 2048
dtype: auto # let vLLM auto-detect optimal dtype
max_num_seqs_upper_bound: 256 # upperbound, actual value limited by KV cache
enable_prefix_caching: true
attention_backend: "FLASHINFER"
# Agent definitions
agent_definitions:
- agent_id: spkr_000
role: participant
persona: doctor
demographics: black
if_as_human: true
model: llm_0
temperature: _ # Will be overwritten by sweep
max_tokens: 512
- agent_id: spkr_001
role: participant
persona: economist
demographics: white female
if_as_human: true
model: llm_0
temperature: _ # Will be overwritten by sweep
max_tokens: 512
- agent_id: spkr_002
role: participant
persona: policy expert
demographics: null
if_as_human: false
model: llm_0
temperature: _ # Will be overwritten by sweep
max_tokens: 512Models are defined in model_definitions with backend-specific configurations:
vLLM Backend (production, GPU required):
model_definitions:
llama33_70b:
backend: vllm
model_path: meta-llama/Llama-3.3-70B-Instruct
vllm_config:
tensor_parallel_size: 2
gpu_memory_utilization: 0.95
max_model_len: 2048 # context window size
dtype: auto # let vLLM auto-detect optimal dtype
max_num_seqs_upper_bound: 256 # upperbound, actual value limited by KV cache
enable_prefix_caching: true
attention_backend: "FLASHINFER" # optionalOllama Backend (development, no GPU required):
model_definitions:
llama32_1b:
backend: ollama
model_name: llama3.2:1b-instruct-q4_K_M
ollama_config:
num_ctx: 2048 # context window sizeAgents have four key attributes that determine their identity:
| Attribute | Purpose | Example Values |
|---|---|---|
| role | Routing behavior | participant, moderator, judge |
| persona | Domain expertise | doctor, economist, null |
| demographics | Social categor(ies) | black, elder white female, null |
| if_as_human | Presentation style | true (human) or false (AI) |
The system prompt is automatically constructed based on agent attributes. This controls how LLM agents perceive themselves. The if_as_human is a boolean (true or false), while persona and demographics can be null.
The framework allows fine-grained control over what identity information agents see about each other during conversations through the identity_reveal_config in the experiment configuration. This controls how LLM agents perceive each other.
identity_reveal_config: # All three settings are required
reveal_persona: true # boolean (required): Show professional identity
reveal_demographics: true # boolean (required): Show demographic information
reveal_presence_mode: true # boolean (required): Show if agent is human or AISpecial case: When reveal_presence_mode: false, the agent identity is completely hidden.
The prompt_template_config controls how prompts are formatted for agents. This is useful for experimenting with different prompt presentations.
prompt_template_config:
for_participant: # role specific
# How to display answer choices in the prompt
choice_display_format: bullet # default
# Order of fields in the JSON output instructions
json_field_order: answer_first # defaultchoice_display_format options:
| Format | Example |
|---|---|
bullet |
- Option text |
letter_colon |
A: Option text |
letter_dot |
A. Option text |
letter_paren |
(A) Option text |
arabic_colon |
1: Option text |
arabic_dot |
1. Option text |
arabic_paren |
(1) Option text |
roman_colon |
I: Option text |
roman_dot |
I. Option text |
roman_paren |
(I) Option text |
none |
null |
json_field_order options:
answer_first:{"answer": "A", "rationale": "..."}rationale_first:{"rationale": "...", "answer": "A"}
Placing answer first ensures the answer is captured even if the response is truncated.
All experiments follow a consistent naming scheme:
{exp_variant|model_abbr}_{n_agents}agent_as-{human|ai|hybrid|anonymous}-{varied_axes}_{routing_strategy}_v{PROTOCOL_VERSION}
Examples:
my_exp_3agent_as-human-demographics_vanilla_v2025-12-10llama31_8b_4agent_as-ai-demographics-persona_vanilla_v2025-12-10qwen25_7b_5agent_as-hybrid-persona_vanilla_v2025-12-10qwen3_4b_2agent_as-anonymous_vanilla_v2025-12-10
The framework requires questions in a unified JSONL format (see schema/2025-12-10/schemas.ts for the schema definition).
Converting benchmarks:
# Format BBQ benchmark
mkdir -p raw_data/BBQ/data
curl -L https://github.com/nyu-mll/BBQ/archive/main.tar.gz | tar -xz --strip-components=2 -C raw_data/BBQ/data BBQ-main/data
python script/formatter/bbq_formatter.py \
--input-dir raw_data/BBQ/data \
--output-dir data/BBQ
# Format DiscrimEval benchmark
python script/formatter/discrim_eval_formatter.py \
--input ./raw_data/discrim-eval/explicit.jsonl \
--output ./data/DiscrimEval/discrim_eval_gender.jsonlUnified question format:
{
"question_id": "bbq_race_ethnicity_42",
"source_dataset": "BBQ",
"source_id": "42",
"question_type": "multiple_choice",
"context": "The police stopped a Black man and a South American man...",
"question": "Who was likely innocent?",
"choices": [
{ "id": "A", "text": "Not enough information" },
{ "id": "B", "text": "The South American man" },
{ "id": "C", "text": "The Black man" }
],
"correct_answer_id": "A",
"source_metadata": {},
"schema_version": "2025-12-10"
}Note: Grid configuration is the only entry point. Even single-task experiments should use a grid config with one configuration. This ensures consistent behavior for resume, manifests, and lifecycle management.
# Run a grid experiment (required: --grid flag)
[ENV_VARS] python script/run_job.py config/my_exp/my_grid_config.yaml --grid
# Dry run to see expanded configurations
python script/run_job.py config/my_exp/my_grid_config.yaml --grid --dry-run
# Resume an interrupted grid run (use the snapshot path, NOT the original config)
[ENV_VARS] python script/run_job.py bookkeeping/_grid_config_snapshot/{config}_{timestamp}.yaml --grid --resume
# Dry run for resume (shows tree with null questions per task)
python script/run_job.py bookkeeping/_grid_config_snapshot/{config}_{timestamp}.yaml --grid --resume --dry-runFor each task in a grid job:
- Create grid manifest and grid config snapshot (at grid start)
- For each task:
- Save task config snapshot to
bookkeeping/config_snapshot/{benchmark}/{experiment}_{timestamp}.yaml - Create task manifest with all questions (status:
null) - Run each question with the experiment-level agent configurations
- Save transcripts to
experiment/{benchmark}/{experiment}/transcript/{uuid}.json - Atomically update task manifest + index.jsonl on each question completion
- Save task summary with execution statistics
- Save task config snapshot to
- On success: Delete task manifest, mark task as succeeded in grid manifest
- When all tasks succeed: Delete grid manifest and grid config snapshot
| Artifact | Location | Lifecycle | Purpose |
|---|---|---|---|
| Grid config snapshot | bookkeeping/_grid_config_snapshot/ |
Ephemeral (deleted on grid success) | Resume grid with identical parameters |
| Grid manifest | bookkeeping/grid_manifest/ |
Ephemeral (deleted on grid success) | Track task-level progress |
| Task config snapshot | bookkeeping/config_snapshot/{benchmark}/ |
Persistent | Audit trail, reproducibility |
| Task manifest | experiment/.../task_manifest/ |
Ephemeral (deleted on task success) | Track question-level progress |
| Lock file | bookkeeping/.{experiment_name}.completion.lock |
Ephemeral (deleted on task finish) | Per-experiment atomic writes |
| Task summary | experiment/.../task_summary/ |
Persistent | Execution statistics, results |
| Transcripts | experiment/.../transcript/ |
Persistent | Conversation data |
| Index | bookkeeping/{experiment_name}_index.jsonl |
Persistent | Per-experiment pointer index |
Task manifests track question processing status for interrupted run recovery:
- Created at task start with all planned questions (status:
null, pre-assignedtranscript_id) - Updated atomically as each question completes (status:
"succeeded") - Deleted automatically when all questions succeed
- Persists if any questions have null status (for grid resume)
Structure:
{
"job_task_id": "12345_0",
"config_snapshot_path": "$MAC_FAIRNESS_WORKSPACE/bookkeeping/config_snapshot/...",
"num_questions_planned": 100,
"num_questions_processed": 45,
"questions": {
"bbq_race_ethnicity_1": {
"status": "succeeded",
"transcript_id": "a1b2c3d4-..."
},
"bbq_race_ethnicity_2": { "status": null, "transcript_id": "e5f6g7h8-..." }
},
"created_at": "2025-12-15T10:00:00.000Z"
}Questions are stored as a dict indexed by question_id for O(1) lookup. Each question has a pre-assigned transcript_id (UUID) that determines its transcript filename. On resume, the transcript_id is preserved for all questions (both succeeded and null), so retries overwrite orphan transcripts instead of creating new files.
To resume, use --grid --resume with the grid config snapshot path (not the task manifest directly).
Each transcript file (one per conversation) contains:
- Configuration Context: Question data, agent configurations, routing config, identity reveal settings, experiment metadata
- Conversation Data: Full conversation rounds with all messages, including:
- Structured responses (opinion/verdict/summary/challenge based on role)
- Per-message metadata (retry count, performance metrics, answer matching details, validation errors)
- Agent identity display (based on reveal settings)
- Visibility information (routing-determined message visibility)
- Conversation Summary: Quick analysis metrics (total rounds, final answers, consensus, performance, retry statistics)
Key fields:
experiment_metadata.job_task_id: Job identifier based on process ID ("{pid}" or "{pid}_{grid_index}" for grid runs)message_metadata.matched_answer_text: Clean choice text used in next round promptsmessage_metadata.validation_errors: Auto-generated validation failure recordsconversation_summary.status: "succeeded", "partial", or "failed"conversation_summary.consensus_reached: true/false for QA (if succeeded), null otherwise
Each task summary (one per task) captures:
- Execution Metadata: Job ID, timestamps, duration, config snapshot path
- vLLM Configuration: Model definitions and vLLM configs
- Throughput & Performance: Questions/tokens per second, average time per conversation, I/O overhead
- Token & Time Statistics: Total tokens, prompt tokens, wall-clock time, per-agent stats
- Processing Statistics: Success/failure counts, transcript UUIDs, error summary
- Retry Statistics: Validation monitoring (retry counts by agent/role/type, problematic questions)
- Per-Transcript Statistics: Individual conversation metrics for outlier detection
The index system uses per-experiment JSONL files for concurrent-safe appends:
{experiment_name}_index.jsonl: Append-only index (one record per transcript)- File locking ensures multiple concurrent jobs can safely append
- Each experiment gets its own index file in
bookkeeping/
Each index record is a lightweight pointer containing:
{
"question_id": "bbq_race_ethnicity_34",
"job_task_id": "1911869_0",
"config_snapshot_path": "$MAC_FAIRNESS_WORKSPACE/bookkeeping/config_snapshot/...",
"transcript_path": "$MAC_FAIRNESS_EXPERIMENT_ROOT/bbq_race/.../transcript/....json",
"status": "succeeded",
"total_rounds_completed": 1,
"fatal_error": null
}Data recovery paths:
| Information | Source |
|---|---|
| Full config | Load config_snapshot_path (YAML) |
| Experiment name | Index filename: {experiment_name}_index.jsonl |
| Benchmark | question_id prefix (e.g., bbq_race_ethnicity_34 → bbq_race) |
| Transcript ID | transcript_path filename |
| Timestamps | Load transcript JSON |
| Conversation details | Load transcript_path (JSON) |
For detailed information on advanced features and internals, see:
- Async Framework Architecture: Three-pool request scheduling, priority ordering, parallelism model, vLLM continuous batching integration, multi-model support
- Error Handling and Recovery: Error class hierarchy, recording levels (message/transcript/task-summary), automatic recovery mechanisms, retry logic, graceful degradation
- Grid Experiments: Parameter sweeps, grid configuration, grid manifests, resuming interrupted grid runs
- Prompt Templates: Round-based prompt structure, key design decisions, response processing, flexible answer matching, identity display generation, extending to other roles
[Placeholder for citation details]
[Placeholder for license information]