Multi-Agent Vision Beta Testing Framework

An NYU IDLS research project for automated beta testing using multi-agent LLM committees with browser vision, consensus voting, and experiment-driven evaluation.

Demo

The animated preview above is always visible in GitHub README view. Click it to open the full demo.mp4.

What This Project Does

This framework runs end-to-end browser testing sessions where multiple AI agents inspect screenshots, propose actions, debate alternatives, and execute a consensus action through Playwright automation. It is designed for both practical application-under-test (AUT) testing and research-grade experimentation.

Core capabilities:

Multi-agent committee decision making with a 3-round voting protocol.
Vision-enabled UI interaction through screenshot-grounded reasoning (via configured multimodal APIs—not LLaVA by default).
Persona-driven testing to diversify behavioral coverage.
Safety-aware action validation and structured turn logging.
Statistical experiment pipeline for reproducible research outputs.

Architecture

High-level data flow from configuration through the multi-agent loop, browser automation, validation, and recorded outputs.

flowchart TB
    subgraph config["Configuration"]
        MC["config/model_config.yaml"]
        PS["Personas / scenarios YAML"]
        EY["experiments/configs/*.yaml"]
    end

    subgraph exec["Runtime"]
        EN["main.py or experiments/runner.py"]
        RUN["multi_agent_runner.py"]
        COM["Multi-agent committee (3-round voting)"]
        API["LLMClient: OpenAI / Google / Anthropic / xAI"]
        PW["browser_adapter.py (Playwright)"]
        VLD["validators.py"]
    end

    AUT["Application under test: aut_service.py"]

    subgraph out["Outputs"]
        CSV["Session logs + CSV"]
        SQLITE[(experiments/results/experiments.db)]
        ST["dashboard_app.py (Streamlit)"]
    end

    MC --> API
    PS --> RUN
    EY --> EN
    EN --> RUN
    RUN --> COM
    COM --> API
    RUN --> PW
    PW --> AUT
    COM --> VLD
    VLD --> PW
    PW -->|screenshots + state| COM
    RUN --> CSV
    EN --> SQLITE
    SQLITE --> ST
    CSV --> ST

Measured Results (From 84 Runs)

Data source: experiments/results/experiments.db

Total runs: 84
Overall task success rate: 89.5%
Average turns per run: 8.13
Mean latency: 0.87s (P50: 0.71s, P95: 1.92s, P99: 2.16s)
Total actions executed: 683
Overall action success rate: 93.1%

Key Findings

Multi-agent committees (2-4 agents) outperform single-agent runs (91.7-100.0% vs 78.0% success).
Committee agreement reaches 100% for 2-4 agent configurations.
On WebShop benchmark tasks, the framework achieves 74.7% success vs published GPT-3 baseline 50.1% (+24.6pp).
Regression detection reaches 100% success in evaluated scenarios.
OWASP Juice Shop security testing reaches 82.0% success.

Experiment Snapshot

Experiment	Runs	Success Rate	Avg Turns	Avg Latency
Multi-Agent Scaling	9	100.0%	4.0	0.46s
Persona Diversity	27	92.2%	12.6	0.25s
Regression Detection	18	100.0%	4.0	0.54s
OWASP Juice Shop Security	12	82.0%	12.7	2.65s
WebShop Task Success	18	74.7%	4.6	1.16s

End-to-End Workflow

Define experiment configs in experiments/configs/*.yaml.
Run orchestrated experiments via experiments/runner.py.
Execute per-turn multi-agent testing sessions with browser automation.
Collect turn-level and run-level metrics into SQLite.
Run statistical analyses (t-tests, ANOVA, confidence intervals, effect sizes).
Generate publication figures and integrate findings into the paper.

Committee Decision Protocol (Per Turn)

Round 1 — Independent proposals: each agent proposes action + confidence.
Round 2 — Discussion/refinement: agents review and revise proposals.
Round 3 — Consensus vote: confidence-weighted aggregation selects final action.

Repository Structure

LLM_Agents_For_Beta_Testing/
├── app/                            # Core runtime framework
│   ├── multi_agent_runner.py       # Session orchestration loop
│   ├── multi_agent_committee.py    # 3-round voting protocol
│   ├── browser_adapter.py          # Playwright integration
│   ├── agent.py                    # Agent abstraction
│   ├── llm_client.py               # Provider clients
│   ├── validators.py               # Safety/action validation
│   ├── storage.py                  # Session logging
│   └── metrics.py                  # Runtime metrics utilities
├── config/
│   └── model_config.yaml           # Model + provider roster
├── experiments/                    # Research infrastructure
│   ├── configs/                    # Experiment YAML definitions
│   ├── runner.py                   # Batch experiment orchestrator
│   ├── metrics_collector.py        # Comprehensive metrics collection
│   ├── analysis.py                 # Statistical analysis utilities
│   ├── bug_injector.py             # Ground-truth bug management
│   ├── regressions.py              # Regression test framework
│   ├── generate_figures.py         # Paper figure generation
│   ├── schema.sql                  # SQLite schema
│   └── results/experiments.db      # Experimental results database
├── personas/                       # Persona YAML files
├── scenarios/                      # Scenario YAML files
├── aut_service.py                  # FastAPI application under test
├── main.py                         # Single-session entry point
├── dashboard_app.py                # Streamlit dashboard
├── media/                          # Demo assets for README
└── README.md

Quick Start

1) Install dependencies

pip install -r requirements.txt
playwright install chromium

2) Configure model providers

Set provider keys in your environment as needed for OpenAI, Google, Anthropic, and xAI.

The active model roster is loaded from config/model_config.yaml. Current defaults are:

gpt-4o (OpenAI)
gemini-1.5-flash (Google)
gemini-2.5-flash (Google)
claude-opus-4-1 (Anthropic)
grok-2-vision-1212 (xAI)

The codebase can also talk to Ollama via an OpenAI-compatible API for local models, but the repository is not configured for LLaVA by default—edit config/model_config.yaml if you want to use local vision models.

3) Start the application under test

uvicorn aut_service:app --port 8000

4) Run a single multi-agent testing session

python main.py

Optional example:

python main.py --persona personas/adversarial_attacker.yaml --scenario scenarios/ui_shopping_flow.yaml --agents 4

With --agents 4, the framework uses the first four configured models from config/model_config.yaml.

5) Open the dashboard

streamlit run dashboard_app.py

Running Experiments

Run a configured experiment:

python experiments/runner.py --config experiments/configs/experiment_1a_multi_agent_scaling.yaml

Generate figures from the database:

python experiments/generate_figures.py

Run statistical analysis (example):

from experiments.analysis import StatisticalAnalyzer

analyzer = StatisticalAnalyzer("experiments/results/experiments.db")
# Example comparisons:
# - single-agent vs multi-agent
# - benchmark vs baseline

Reproducibility

Fixed seeds are used for controlled experimental runs.
Experiment settings are versioned in YAML configs.
Turn-by-turn logs and aggregate metrics are persisted in SQLite.
Figures are generated programmatically from recorded results.

License

MIT License

Authors

Dhiwahar Adhithya Kennady (dk5025)
Sumanth Bharadwaj Hachalli Karanam (sh8111)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Agent Vision Beta Testing Framework

Demo

What This Project Does

Architecture

Measured Results (From 84 Runs)

Key Findings

Experiment Snapshot

End-to-End Workflow

Committee Decision Protocol (Per Turn)

Repository Structure

Quick Start

1) Install dependencies

2) Configure model providers

3) Start the application under test

4) Run a single multi-agent testing session

5) Open the dashboard

Running Experiments

Reproducibility

License

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Demo/runs/price_manipulator_e-commerce_security_testing		Demo/runs/price_manipulator_e-commerce_security_testing
app		app
config		config
experiments		experiments
images		images
media		media
personas		personas
results		results
scenarios		scenarios
static		static
templates		templates
.gitignore		.gitignore
README.md		README.md
app.py		app.py
aut_service.py		aut_service.py
code_structure.md		code_structure.md
config.py		config.py
dashboard_app.py		dashboard_app.py
generate_action_distribution.py		generate_action_distribution.py
main.py		main.py
plan.txt		plan.txt
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Multi-Agent Vision Beta Testing Framework

Demo

What This Project Does

Architecture

Measured Results (From 84 Runs)

Key Findings

Experiment Snapshot

End-to-End Workflow

Committee Decision Protocol (Per Turn)

Repository Structure

Quick Start

1) Install dependencies

2) Configure model providers

3) Start the application under test

4) Run a single multi-agent testing session

5) Open the dashboard

Running Experiments

Reproducibility

License

Authors

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages