Skip to content

McGill-NLP/agent-as-annotators

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repository contains the code for the A3 framework, which uses LLMs to systematically generate synthetic web agent training data by decomposing the annotation process into three roles: Task Designer, Annotator, and Supervisor.

Installation

pip install agent-as-annotators

Or install from source:

git clone https://github.com/McGill-NLP/agent-as-annotators.git
cd agent-as-annotators
pip install -e .

Quick Start: Evaluation

1. Serve a model with vLLM

vllm serve --config configs/vllm/Qwen3.5-9B.yaml

2. Run evaluation

a3-eval --benchmark webarena_test --model A3-qwen3.5-9b

Pipeline: Generating A3-Synth

The A3 pipeline generates synthetic training data in 5 steps:

Step 1: Create personas

python scripts/create_personas.py

Step 2: Generate task intents (via exploration)

a3-explore
python scripts/generate_task_intents.py

Step 3: Create A3-Synth task configs

python scripts/create_synth_configs.py

Step 4: Collect trajectories

a3-synth --benchmark a3_synth --model gemini-3-pro

Step 5: Convert to training data

python scripts/convert_trajectories_to_json.py
python scripts/generate_rft_data.py

Training

a3-train --config configs/train/qwen3.5-9b.json

Training uses SFT with FSDP for multi-GPU parallelism. See configs/train/ for hyperparameters and configs/accelerate/ for FSDP configuration.

CLI Commands

Command Description
a3-eval Run evaluation on WebArena, VisualWebArena, WorkArena, MiniWoB
a3-synth Run trajectory collection for A3-Synth
a3-explore Run environment exploration
a3-train Fine-tune a model with SFT
a3-screen-utils Screen session management utilities

Project Structure

agent-as-annotators/
  agent_as_annotators/       # Core package
    cli/                     # CLI entry points (eval, synth, explore, train)
    modeling.py              # Agent model wrapper (vLLM, Gemini, OpenAI)
    prompts/                 # All prompt templates
    judge/                   # Inverted evaluation protocol (Judge module)
    benchmarks/a3_synth/     # A3-Synth benchmark registration
    exploration/             # Exploration task registration
    utils/                   # Utilities
    configs/a3_synth/        # A3-Synth task configurations
  configs/
    model_configs.json       # Model registry
    train/                   # Training hyperparameters
    vllm/                    # vLLM serving configs
    accelerate/              # FSDP configs
  scripts/                   # Data pipeline scripts

About

Agent-as-Annotators: Structured Distillation of Web Agent Capabilities

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages