Multi-Agent Conversation Framework for Fairness Evaluation

A lightweight framework for running multi-agent conversations with structured output validation. Agents can be instantiated from different model families (Gemma, Llama, Qwen, etc.) with configurable roles, personas, and demographics.

New to this project? Start with the dev_ollama_walkthrough.ipynb - a complete demo you can run locally without GPU using Ollama.

1. Quick Start

# 1. Set up Python environment with uv
uv venv
source .venv/bin/activate
uv pip install -e .

# 2. Test the framework on local dev machine (no GPU required)
# For complete testing guide: see docs/guide/dev_ollama_walkthrough.ipynb

# 3. Set experiments directory (recommended, otherwise defaults to $MAC_FAIRNESS_WORKSPACE/experiment)
export MAC_FAIRNESS_EXPERIMENT_ROOT="/path/to/save/experiments"

# 4. Run a real experiment (grid config is the only entry point)
[ENV_VARS] python script/run_job.py config/my_exp/my_grid_config.yaml --grid
# e.g., CUDA_VISIBLE_DEVICES=2,3 OMP_NUM_THREADS=16 MAC_FAIRNESS_LIVE_STATUS=1 python ...

# 5. Query results
# TODO

2. Installation

Python ≥ 3.11
uv for Python package management

# 1. Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Create virtual environment and install Python dependencies
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -e .

Environment Variables

Variable	Required	Description
`MAC_FAIRNESS_WORKSPACE`	Yes	Project root directory
`MAC_FAIRNESS_EXPERIMENT_ROOT`	No	Override experiment output directory (defaults to `$MAC_FAIRNESS_WORKSPACE/experiment`)
`MAC_FAIRNESS_DEBUG_FLAG`	No	Set to `1` to enable debug output AND to record more verbose transcripts (e.g., prompts are saved)
`MAC_FAIRNESS_LIVE_STATUS`	No	Set to `1` to enable live status display
`CUDA_VISIBLE_DEVICES`	No	Specify which GPUs to use (e.g., `0`, `1` or `2,5`)
`OMP_NUM_THREADS`	No	Set OpenMP thread count to avoid CPU oversubscription (e.g., `8`)

Note: When enabling live status display MAC_FAIRNESS_LIVE_STATUS=1, disable the debug flag MAC_FAIRNESS_DEBUG_FLAG=0

3. Repository Structure

Directory Overview

$MAC_FAIRNESS_WORKSPACE/
│
├── bookkeeping/                            # Experiment metadata and snapshots (auto-generated)
│   ├── _grid_config_snapshot/              # Grid config snapshots (ephemeral, auto-deleted on success)
│   │   └── {config_name}_{timestamp}.yaml
│   ├── config_snapshot/                    # Task config snapshots (persistent, audit trail)
│   │   └── {benchmark_subcategory}/        # Organized by benchmark subcategory
│   ├── grid_manifest/                      # Grid manifests (ephemeral, auto-deleted on success)
│   │   └── {timestamp}_{pid}.json
│   └── {experiment_name}_index.jsonl       # Per-experiment append-only index (persistent)
│
├── config/                                 # Working configuration files (edit here)
│   └── {benchmark_subcategory or custom}/  # Benchmark subcategory (e.g., bbq_race, discrim_eval_age) or exp variants
│       └── {experiment_name}_scratch.yaml  # Job run config file
│
├── data/                                   # Benchmark questions in unified format
│   ├── BBQ/                                # BBQ benchmark family
│   ├── DifferenceAwareness/                # DifferenceAwareness benchmark suite
│   └── DiscrimEval/                        # DiscrimEval benchmark family
│
├── docs/                                   # Documentation
│   ├── advanced/                           # Advanced topics (detailed guides)
│   │   ├── async-framework.md              # Async scheduling and GPU utilization
│   │   ├── error-handling.md               # Error handling and recovery mechanisms
│   │   ├── grid-experiments.md             # Parameter sweeps and grid configuration
│   │   └── prompt-templates.md             # Prompt engineering and template design
│   └── guide/
│       └── dev_ollama_walkthrough.ipynb    # Local development testing with Ollama (no GPU required)
│
├── experiment/                             # Experiment outputs (transcripts and summaries)
│   └── {benchmark_subcategory}/            # Organized by benchmark subcategory
│       └── {experiment_name}/
│           ├── task_manifest/              # Task manifests (ephemeral, auto-deleted on success)
│           │   └── {timestamp}_{job_task_id}.json
│           ├── task_summary/               # Task execution summaries (persistent, one per task)
│           │   └── {timestamp}_{job_task_id}.json
│           └── transcript/                 # Conversation transcripts (persistent, one per question)
│               └── {uuid}.json
│
├── schema/                                 # Protocol schemas (versioned, documentation only)
│   ├── 2025-12-10/                         # Current protocol version (follows MCP convention)
│   │   ├── package.json                    # Node.js dependencies
│   │   ├── schemas.ts                      # Zod schema definitions (documentation reference)
│   │   └── tsconfig.json                   # TypeScript configuration
│   └── index.json                          # Schema version registry
│
├── script/                                 # Executable scripts
│   ├── cluster/                            # Cluster utilities
│   │   ├── build_flashinfer.sh             # FlashInfer build script
│   │   └── download_models.sh              # Model downloading utilities
│   ├── formatter/                          # Benchmark data formatter
│   │   ├── bbq_formatter.py                # BBQ benchmark formatter
│   │   ├── diff_aware_formatter.py         # DifferenceAwareness benchmark formatter
│   │   └── discrim_eval_formatter.py       # DiscrimEval benchmark formatter
│   └── run_job.py                          # Run grid experiments (main entry point)
│
├── src/                                    # Source code
│   ├── agent/                              # Agent implementations
│   │   ├── async_ollama_agent.py           # Async Ollama agent for local dev (no GPU required)
│   │   ├── async_vllm_agent.py             # Async vLLM agent for production (GPU required)
│   │   ├── base_agent.py                   # Abstract base class with shared functionality
│   │   └── model_factory.py                # Smart backend detection and agent creation
│   │
│   ├── prompt/                             # Prompt builders
│   │   ├── base.py                         # Abstract base for all prompt builders
│   │   └── participant.py                  # Participant role implementation
│   │
│   ├── routing/                            # Routing mechanisms
│   │   └── vanilla_router.py               # Simple round-based routing with full visibility
│   │
│   └── utils/                              # Core utilities
│       ├── answer_matcher.py               # Flexible answer matching
│       ├── async_conversation_runner.py    # Per-conversation async execution logic
│       ├── bookkeeping_manager.py          # Directory and job summary management
│       ├── config_manager.py               # Configuration loading and validation
│       ├── conversation_orchestrator.py    # Main experiment orchestration (async entry point)
│       ├── errors.py                       # Error class hierarchy
│       ├── grid_config.py                  # Grid configuration expansion for parameter sweeps
│       ├── logging.py                      # Centralized logging, metrics, and error aggregation
│       ├── request_scheduler.py            # GPU-efficient request scheduling
│       └── transcript_manager.py           # Transcript building and saving
│
├── pyproject.toml                          # Project dependencies
└── README.md

Division of Responsibilities

src/utils/conversation_orchestrator.py: Orchestrates entire experiments and all bookkeeping

Loads and validates experiment configurations (Python-based validation)
Saves immutable config snapshots to bookkeeping/config_snapshot/{benchmark_subcategory}/
- Snapshot saved at start of run_job() method (or reuses existing on resume)
Manages agent initialization and conversation orchestration
Saves full conversation transcripts to experiment/{benchmark_subcategory}/{experiment_name}/transcript/
Updates bookkeeping/{experiment_name}_index.jsonl with thread-safe file locking
Saves task summaries to experiment/{benchmark_subcategory}/{experiment_name}/task_summary/

config/{benchmark_subcategory}/: Working configuration files (what we're actively editing)

Organized by benchmark subcategory for better scalability
Files named *_scratch.yaml to indicate they're editable working copies
Expected to change between jobs (as a scratch pad), edit these freely after jobs are submitted and no need to wait for the job to actually get run

bookkeeping/: Runtime metadata and config snapshots (what has been submitted/run)

{experiment_name}_index.jsonl: Per-experiment append-only index with file locking for concurrent safety
config_snapshot/{benchmark_subcategory}/: Immutable snapshots organized by benchmark subcategory
- Each snapshot timestamped: {experiment_name}_{TIMESTAMP}.yaml (Zulu time format)
- Multiple submissions of same experiment name get unique config snapshots via timestamps

experiment/: Full transcript output (actual conversation data)

Contains complete conversation transcripts and job summaries in JSON format
Can be stored elsewhere via $MAC_FAIRNESS_EXPERIMENT_ROOT
Large files with full agent responses and metadata

4. Configuration

Experiment-Level Grid Configuration

Each experiment configuration defines the agent setup and routing strategy applied to ALL questions in a benchmark run.

# Grid configuration - defines parameter sweep
_grid:
  # Derivation rules: automatically compute field values from other fields
  # Uses {field.path} placeholders with full dot-notation paths
  derive:
    experiment_metadata.questions_file: "data/BBQ/{experiment_metadata.benchmark_subcategory}.jsonl"

  # Sweep parameters: all combinations will be generated and run
  sweep:
    experiment_metadata.experiment_name:
      - my_exp_3agent_as-hybrid-demographics-persona_vanilla_v2025-12-10
    experiment_metadata.benchmark_subcategory:
      - bbq_race
    prompt_template_config.for_participant.choice_display_format:
      - bullet
    prompt_template_config.for_participant.json_field_order:
      - answer_first
    agent_definitions.0.temperature:
      - 0.5
      - 0.7
      - 0.9

  # Zip: paired values that change together (not Cartesian product)
  zip:
    model_definitions:
      - model_definitions.llm_0.model_path: meta-llama/Llama-3.3-70B-Instruct
        model_definitions.llm_0.vllm_config.tensor_parallel_size: 2

# Base experiment configuration (used as template for all combinations)
experiment_metadata:
  experiment_name: _ # Will be overwritten by sweep
  benchmark_subcategory: _ # Will be overwritten by sweep
  schema_version: "2025-12-10"
  questions_file: _ # Will be overwritten by derive rule

# Conversation orchestration settings
conversation_config:
  routing_strategy: vanilla
  max_rounds: 3

# Response validation and retry behavior
retry_config:
  max_retries: 5
  answer_match_threshold: 0.75
  retry_on_validation_error: true
  retry_on_generation_error: true

# Identity reveal configuration - controls how agents perceive each other
identity_reveal_config:
  reveal_persona: true
  reveal_demographics: true
  reveal_presence_mode: true

# Prompt template configuration
prompt_template_config:
  for_participant:
    choice_display_format: _ # Will be overwritten by sweep
    json_field_order: _ # Will be overwritten by sweep

# Model definitions
model_definitions:
  llm_0:
    backend: vllm
    model_path: _ # Will be overwritten by sweep

    vllm_config:
      tensor_parallel_size: _ # Will be overwritten by sweep
      gpu_memory_utilization: 0.95
      max_model_len: 2048
      dtype: auto # let vLLM auto-detect optimal dtype
      max_num_seqs_upper_bound: 256 # upperbound, actual value limited by KV cache
      enable_prefix_caching: true
      attention_backend: "FLASHINFER"

# Agent definitions
agent_definitions:
  - agent_id: spkr_000
    role: participant
    persona: doctor
    demographics: black
    if_as_human: true
    model: llm_0
    temperature: _ # Will be overwritten by sweep
    max_tokens: 512

  - agent_id: spkr_001
    role: participant
    persona: economist
    demographics: white female
    if_as_human: true
    model: llm_0
    temperature: _ # Will be overwritten by sweep
    max_tokens: 512

  - agent_id: spkr_002
    role: participant
    persona: policy expert
    demographics: null
    if_as_human: false
    model: llm_0
    temperature: _ # Will be overwritten by sweep
    max_tokens: 512

Model Definitions

Models are defined in model_definitions with backend-specific configurations:

vLLM Backend (production, GPU required):

model_definitions:
  llama33_70b:
    backend: vllm
    model_path: meta-llama/Llama-3.3-70B-Instruct

    vllm_config:
      tensor_parallel_size: 2
      gpu_memory_utilization: 0.95
      max_model_len: 2048 # context window size
      dtype: auto # let vLLM auto-detect optimal dtype
      max_num_seqs_upper_bound: 256 # upperbound, actual value limited by KV cache
      enable_prefix_caching: true
      attention_backend: "FLASHINFER" # optional

Ollama Backend (development, no GPU required):

model_definitions:
  llama32_1b:
    backend: ollama
    model_name: llama3.2:1b-instruct-q4_K_M

    ollama_config:
      num_ctx: 2048 # context window size

Agent Configuration

Agents have four key attributes that determine their identity:

Attribute	Purpose	Example Values
role	Routing behavior	`participant`, `moderator`, `judge`
persona	Domain expertise	`doctor`, `economist`, `null`
demographics	Social categor(ies)	`black`, `elder white female`, `null`
if_as_human	Presentation style	`true` (human) or `false` (AI)

The system prompt is automatically constructed based on agent attributes. This controls how LLM agents perceive themselves. The if_as_human is a boolean (true or false), while persona and demographics can be null.

Identity Display Control

The framework allows fine-grained control over what identity information agents see about each other during conversations through the identity_reveal_config in the experiment configuration. This controls how LLM agents perceive each other.

identity_reveal_config: # All three settings are required
  reveal_persona: true # boolean (required): Show professional identity
  reveal_demographics: true # boolean (required): Show demographic information
  reveal_presence_mode: true # boolean (required): Show if agent is human or AI

Special case: When reveal_presence_mode: false, the agent identity is completely hidden.

Prompt Template Configuration

The prompt_template_config controls how prompts are formatted for agents. This is useful for experimenting with different prompt presentations.

prompt_template_config:
  for_participant: # role specific
    # How to display answer choices in the prompt
    choice_display_format: bullet # default
    # Order of fields in the JSON output instructions
    json_field_order: answer_first # default

choice_display_format options:

Format	Example
`bullet`	`- Option text`
`letter_colon`	`A: Option text`
`letter_dot`	`A. Option text`
`letter_paren`	`(A) Option text`
`arabic_colon`	`1: Option text`
`arabic_dot`	`1. Option text`
`arabic_paren`	`(1) Option text`
`roman_colon`	`I: Option text`
`roman_dot`	`I. Option text`
`roman_paren`	`(I) Option text`
`none`	null

json_field_order options:

answer_first: {"answer": "A", "rationale": "..."}
rationale_first: {"rationale": "...", "answer": "A"}

Placing answer first ensures the answer is captured even if the response is truncated.

Experiment Naming Convention

All experiments follow a consistent naming scheme:

{exp_variant|model_abbr}_{n_agents}agent_as-{human|ai|hybrid|anonymous}-{varied_axes}_{routing_strategy}_v{PROTOCOL_VERSION}

Examples:

my_exp_3agent_as-human-demographics_vanilla_v2025-12-10
llama31_8b_4agent_as-ai-demographics-persona_vanilla_v2025-12-10
qwen25_7b_5agent_as-hybrid-persona_vanilla_v2025-12-10
qwen3_4b_2agent_as-anonymous_vanilla_v2025-12-10

Benchmark Data Preparation

The framework requires questions in a unified JSONL format (see schema/2025-12-10/schemas.ts for the schema definition).

Converting benchmarks:

# Format BBQ benchmark
mkdir -p raw_data/BBQ/data
curl -L https://github.com/nyu-mll/BBQ/archive/main.tar.gz | tar -xz --strip-components=2 -C raw_data/BBQ/data BBQ-main/data

python script/formatter/bbq_formatter.py \
  --input-dir raw_data/BBQ/data \
  --output-dir data/BBQ

# Format DiscrimEval benchmark
python script/formatter/discrim_eval_formatter.py \
  --input ./raw_data/discrim-eval/explicit.jsonl \
  --output ./data/DiscrimEval/discrim_eval_gender.jsonl

Unified question format:

{
  "question_id": "bbq_race_ethnicity_42",
  "source_dataset": "BBQ",
  "source_id": "42",
  "question_type": "multiple_choice",
  "context": "The police stopped a Black man and a South American man...",
  "question": "Who was likely innocent?",
  "choices": [
    { "id": "A", "text": "Not enough information" },
    { "id": "B", "text": "The South American man" },
    { "id": "C", "text": "The Black man" }
  ],
  "correct_answer_id": "A",
  "source_metadata": {},
  "schema_version": "2025-12-10"
}

5. Running Experiments

Note: Grid configuration is the only entry point. Even single-task experiments should use a grid config with one configuration. This ensures consistent behavior for resume, manifests, and lifecycle management.

Basic Usage

# Run a grid experiment (required: --grid flag)
[ENV_VARS] python script/run_job.py config/my_exp/my_grid_config.yaml --grid

# Dry run to see expanded configurations
python script/run_job.py config/my_exp/my_grid_config.yaml --grid --dry-run

# Resume an interrupted grid run (use the snapshot path, NOT the original config)
[ENV_VARS] python script/run_job.py bookkeeping/_grid_config_snapshot/{config}_{timestamp}.yaml --grid --resume

# Dry run for resume (shows tree with null questions per task)
python script/run_job.py bookkeeping/_grid_config_snapshot/{config}_{timestamp}.yaml --grid --resume --dry-run

Execution Workflow

For each task in a grid job:

Create grid manifest and grid config snapshot (at grid start)
For each task:
- Save task config snapshot to bookkeeping/config_snapshot/{benchmark}/{experiment}_{timestamp}.yaml
- Create task manifest with all questions (status: null)
- Run each question with the experiment-level agent configurations
- Save transcripts to experiment/{benchmark}/{experiment}/transcript/{uuid}.json
- Atomically update task manifest + index.jsonl on each question completion
- Save task summary with execution statistics
On success: Delete task manifest, mark task as succeeded in grid manifest
When all tasks succeed: Delete grid manifest and grid config snapshot

6. Output Structure

Artifact Lifecycle

Artifact	Location	Lifecycle	Purpose
Grid config snapshot	`bookkeeping/_grid_config_snapshot/`	Ephemeral (deleted on grid success)	Resume grid with identical parameters
Grid manifest	`bookkeeping/grid_manifest/`	Ephemeral (deleted on grid success)	Track task-level progress
Task config snapshot	`bookkeeping/config_snapshot/{benchmark}/`	Persistent	Audit trail, reproducibility
Task manifest	`experiment/.../task_manifest/`	Ephemeral (deleted on task success)	Track question-level progress
Lock file	`bookkeeping/.{experiment_name}.completion.lock`	Ephemeral (deleted on task finish)	Per-experiment atomic writes
Task summary	`experiment/.../task_summary/`	Persistent	Execution statistics, results
Transcripts	`experiment/.../transcript/`	Persistent	Conversation data
Index	`bookkeeping/{experiment_name}_index.jsonl`	Persistent	Per-experiment pointer index

Task Manifests

Task manifests track question processing status for interrupted run recovery:

Created at task start with all planned questions (status: null, pre-assigned transcript_id)
Updated atomically as each question completes (status: "succeeded")
Deleted automatically when all questions succeed
Persists if any questions have null status (for grid resume)

Structure:

{
  "job_task_id": "12345_0",
  "config_snapshot_path": "$MAC_FAIRNESS_WORKSPACE/bookkeeping/config_snapshot/...",
  "num_questions_planned": 100,
  "num_questions_processed": 45,
  "questions": {
    "bbq_race_ethnicity_1": {
      "status": "succeeded",
      "transcript_id": "a1b2c3d4-..."
    },
    "bbq_race_ethnicity_2": { "status": null, "transcript_id": "e5f6g7h8-..." }
  },
  "created_at": "2025-12-15T10:00:00.000Z"
}

Questions are stored as a dict indexed by question_id for O(1) lookup. Each question has a pre-assigned transcript_id (UUID) that determines its transcript filename. On resume, the transcript_id is preserved for all questions (both succeeded and null), so retries overwrite orphan transcripts instead of creating new files.

To resume, use --grid --resume with the grid config snapshot path (not the task manifest directly).

Transcripts

Each transcript file (one per conversation) contains:

Configuration Context: Question data, agent configurations, routing config, identity reveal settings, experiment metadata
Conversation Data: Full conversation rounds with all messages, including:
- Structured responses (opinion/verdict/summary/challenge based on role)
- Per-message metadata (retry count, performance metrics, answer matching details, validation errors)
- Agent identity display (based on reveal settings)
- Visibility information (routing-determined message visibility)
Conversation Summary: Quick analysis metrics (total rounds, final answers, consensus, performance, retry statistics)

Key fields:

experiment_metadata.job_task_id: Job identifier based on process ID ("{pid}" or "{pid}_{grid_index}" for grid runs)
message_metadata.matched_answer_text: Clean choice text used in next round prompts
message_metadata.validation_errors: Auto-generated validation failure records
conversation_summary.status: "succeeded", "partial", or "failed"
conversation_summary.consensus_reached: true/false for QA (if succeeded), null otherwise

Task Summaries

Each task summary (one per task) captures:

Execution Metadata: Job ID, timestamps, duration, config snapshot path
vLLM Configuration: Model definitions and vLLM configs
Throughput & Performance: Questions/tokens per second, average time per conversation, I/O overhead
Token & Time Statistics: Total tokens, prompt tokens, wall-clock time, per-agent stats
Processing Statistics: Success/failure counts, transcript UUIDs, error summary
Retry Statistics: Validation monitoring (retry counts by agent/role/type, problematic questions)
Per-Transcript Statistics: Individual conversation metrics for outlier detection

Index System

The index system uses per-experiment JSONL files for concurrent-safe appends:

{experiment_name}_index.jsonl: Append-only index (one record per transcript)
File locking ensures multiple concurrent jobs can safely append
Each experiment gets its own index file in bookkeeping/

Each index record is a lightweight pointer containing:

{
  "question_id": "bbq_race_ethnicity_34",
  "job_task_id": "1911869_0",
  "config_snapshot_path": "$MAC_FAIRNESS_WORKSPACE/bookkeeping/config_snapshot/...",
  "transcript_path": "$MAC_FAIRNESS_EXPERIMENT_ROOT/bbq_race/.../transcript/....json",
  "status": "succeeded",
  "total_rounds_completed": 1,
  "fatal_error": null
}

Data recovery paths:

Information	Source
Full config	Load `config_snapshot_path` (YAML)
Experiment name	Index filename: `{experiment_name}_index.jsonl`
Benchmark	`question_id` prefix (e.g., `bbq_race_ethnicity_34` → `bbq_race`)
Transcript ID	`transcript_path` filename
Timestamps	Load transcript JSON
Conversation details	Load `transcript_path` (JSON)

7. Advanced Topics

For detailed information on advanced features and internals, see:

Async Framework Architecture: Three-pool request scheduling, priority ordering, parallelism model, vLLM continuous batching integration, multi-model support
Error Handling and Recovery: Error class hierarchy, recording levels (message/transcript/task-summary), automatic recovery mechanisms, retry logic, graceful degradation
Grid Experiments: Parameter sweeps, grid configuration, grid manifests, resuming interrupted grid runs
Prompt Templates: Round-based prompt structure, key design decisions, response processing, flexible answer matching, identity display generation, extending to other roles

Citation

[Placeholder for citation details]

License

[Placeholder for license information]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Agent Conversation Framework for Fairness Evaluation

Table of Contents

1. Quick Start

2. Installation

Environment Variables

3. Repository Structure

Directory Overview

Division of Responsibilities

4. Configuration

Experiment-Level Grid Configuration

Model Definitions

Agent Configuration

Identity Display Control

Prompt Template Configuration

Experiment Naming Convention

Benchmark Data Preparation

5. Running Experiments

Basic Usage

Execution Workflow

6. Output Structure

Artifact Lifecycle

Task Manifests

Transcripts

Task Summaries

Index System

7. Advanced Topics

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
config		config
data		data
docs		docs
schema		schema
script		script
src		src
.envrc		.envrc
.gitignore		.gitignore
.prettierignore		.prettierignore
README.md		README.md
package.json		package.json
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Multi-Agent Conversation Framework for Fairness Evaluation

Table of Contents

1. Quick Start

2. Installation

Environment Variables

3. Repository Structure

Directory Overview

Division of Responsibilities

4. Configuration

Experiment-Level Grid Configuration

Model Definitions

Agent Configuration

Identity Display Control

Prompt Template Configuration

Experiment Naming Convention

Benchmark Data Preparation

5. Running Experiments

Basic Usage

Execution Workflow

6. Output Structure

Artifact Lifecycle

Task Manifests

Transcripts

Task Summaries

Index System

7. Advanced Topics

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages