Skip to content

DHIWAHAR-K/LLM_Agents_For_Beta_Testing

Repository files navigation

Multi-Agent Vision Beta Testing Framework

An NYU IDLS research project for automated beta testing using multi-agent LLM committees with browser vision, consensus voting, and experiment-driven evaluation.

Demo

Demo preview

The animated preview above is always visible in GitHub README view. Click it to open the full demo.mp4.

What This Project Does

This framework runs end-to-end browser testing sessions where multiple AI agents inspect screenshots, propose actions, debate alternatives, and execute a consensus action through Playwright automation. It is designed for both practical application-under-test (AUT) testing and research-grade experimentation.

Core capabilities:

  • Multi-agent committee decision making with a 3-round voting protocol.
  • Vision-enabled UI interaction through screenshot-grounded reasoning (via configured multimodal APIs—not LLaVA by default).
  • Persona-driven testing to diversify behavioral coverage.
  • Safety-aware action validation and structured turn logging.
  • Statistical experiment pipeline for reproducible research outputs.

Architecture

High-level data flow from configuration through the multi-agent loop, browser automation, validation, and recorded outputs.

flowchart TB
    subgraph config["Configuration"]
        MC["config/model_config.yaml"]
        PS["Personas / scenarios YAML"]
        EY["experiments/configs/*.yaml"]
    end

    subgraph exec["Runtime"]
        EN["main.py or experiments/runner.py"]
        RUN["multi_agent_runner.py"]
        COM["Multi-agent committee (3-round voting)"]
        API["LLMClient: OpenAI / Google / Anthropic / xAI"]
        PW["browser_adapter.py (Playwright)"]
        VLD["validators.py"]
    end

    AUT["Application under test: aut_service.py"]

    subgraph out["Outputs"]
        CSV["Session logs + CSV"]
        SQLITE[(experiments/results/experiments.db)]
        ST["dashboard_app.py (Streamlit)"]
    end

    MC --> API
    PS --> RUN
    EY --> EN
    EN --> RUN
    RUN --> COM
    COM --> API
    RUN --> PW
    PW --> AUT
    COM --> VLD
    VLD --> PW
    PW -->|screenshots + state| COM
    RUN --> CSV
    EN --> SQLITE
    SQLITE --> ST
    CSV --> ST
Loading

Measured Results (From 84 Runs)

Data source: experiments/results/experiments.db

  • Total runs: 84
  • Overall task success rate: 89.5%
  • Average turns per run: 8.13
  • Mean latency: 0.87s (P50: 0.71s, P95: 1.92s, P99: 2.16s)
  • Total actions executed: 683
  • Overall action success rate: 93.1%

Key Findings

  • Multi-agent committees (2-4 agents) outperform single-agent runs (91.7-100.0% vs 78.0% success).
  • Committee agreement reaches 100% for 2-4 agent configurations.
  • On WebShop benchmark tasks, the framework achieves 74.7% success vs published GPT-3 baseline 50.1% (+24.6pp).
  • Regression detection reaches 100% success in evaluated scenarios.
  • OWASP Juice Shop security testing reaches 82.0% success.

Experiment Snapshot

Experiment Runs Success Rate Avg Turns Avg Latency
Multi-Agent Scaling 9 100.0% 4.0 0.46s
Persona Diversity 27 92.2% 12.6 0.25s
Regression Detection 18 100.0% 4.0 0.54s
OWASP Juice Shop Security 12 82.0% 12.7 2.65s
WebShop Task Success 18 74.7% 4.6 1.16s

End-to-End Workflow

  1. Define experiment configs in experiments/configs/*.yaml.
  2. Run orchestrated experiments via experiments/runner.py.
  3. Execute per-turn multi-agent testing sessions with browser automation.
  4. Collect turn-level and run-level metrics into SQLite.
  5. Run statistical analyses (t-tests, ANOVA, confidence intervals, effect sizes).
  6. Generate publication figures and integrate findings into the paper.

Committee Decision Protocol (Per Turn)

  1. Round 1 — Independent proposals: each agent proposes action + confidence.
  2. Round 2 — Discussion/refinement: agents review and revise proposals.
  3. Round 3 — Consensus vote: confidence-weighted aggregation selects final action.

Repository Structure

LLM_Agents_For_Beta_Testing/
├── app/                            # Core runtime framework
│   ├── multi_agent_runner.py       # Session orchestration loop
│   ├── multi_agent_committee.py    # 3-round voting protocol
│   ├── browser_adapter.py          # Playwright integration
│   ├── agent.py                    # Agent abstraction
│   ├── llm_client.py               # Provider clients
│   ├── validators.py               # Safety/action validation
│   ├── storage.py                  # Session logging
│   └── metrics.py                  # Runtime metrics utilities
├── config/
│   └── model_config.yaml           # Model + provider roster
├── experiments/                    # Research infrastructure
│   ├── configs/                    # Experiment YAML definitions
│   ├── runner.py                   # Batch experiment orchestrator
│   ├── metrics_collector.py        # Comprehensive metrics collection
│   ├── analysis.py                 # Statistical analysis utilities
│   ├── bug_injector.py             # Ground-truth bug management
│   ├── regressions.py              # Regression test framework
│   ├── generate_figures.py         # Paper figure generation
│   ├── schema.sql                  # SQLite schema
│   └── results/experiments.db      # Experimental results database
├── personas/                       # Persona YAML files
├── scenarios/                      # Scenario YAML files
├── aut_service.py                  # FastAPI application under test
├── main.py                         # Single-session entry point
├── dashboard_app.py                # Streamlit dashboard
├── media/                          # Demo assets for README
└── README.md

Quick Start

1) Install dependencies

pip install -r requirements.txt
playwright install chromium

2) Configure model providers

Set provider keys in your environment as needed for OpenAI, Google, Anthropic, and xAI.

The active model roster is loaded from config/model_config.yaml. Current defaults are:

  • gpt-4o (OpenAI)
  • gemini-1.5-flash (Google)
  • gemini-2.5-flash (Google)
  • claude-opus-4-1 (Anthropic)
  • grok-2-vision-1212 (xAI)

The codebase can also talk to Ollama via an OpenAI-compatible API for local models, but the repository is not configured for LLaVA by default—edit config/model_config.yaml if you want to use local vision models.

3) Start the application under test

uvicorn aut_service:app --port 8000

4) Run a single multi-agent testing session

python main.py

Optional example:

python main.py --persona personas/adversarial_attacker.yaml --scenario scenarios/ui_shopping_flow.yaml --agents 4

With --agents 4, the framework uses the first four configured models from config/model_config.yaml.

5) Open the dashboard

streamlit run dashboard_app.py

Running Experiments

Run a configured experiment:

python experiments/runner.py --config experiments/configs/experiment_1a_multi_agent_scaling.yaml

Generate figures from the database:

python experiments/generate_figures.py

Run statistical analysis (example):

from experiments.analysis import StatisticalAnalyzer

analyzer = StatisticalAnalyzer("experiments/results/experiments.db")
# Example comparisons:
# - single-agent vs multi-agent
# - benchmark vs baseline

Reproducibility

  • Fixed seeds are used for controlled experimental runs.
  • Experiment settings are versioned in YAML configs.
  • Turn-by-turn logs and aggregate metrics are persisted in SQLite.
  • Figures are generated programmatically from recorded results.

License

MIT License

Authors

  • Dhiwahar Adhithya Kennady (dk5025)
  • Sumanth Bharadwaj Hachalli Karanam (sh8111)

About

Multi-agent beta testing framework where LLaVA vision agents form committees to test web apps. Agents analyze screenshots, discuss actions, vote on consensus, and catch bugs + security vulnerabilities. Playwright browser automation, Streamlit dashboard, FastAPI backend.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors