An NYU IDLS research project for automated beta testing using multi-agent LLM committees with browser vision, consensus voting, and experiment-driven evaluation.
The animated preview above is always visible in GitHub README view. Click it to open the full
demo.mp4.
This framework runs end-to-end browser testing sessions where multiple AI agents inspect screenshots, propose actions, debate alternatives, and execute a consensus action through Playwright automation. It is designed for both practical application-under-test (AUT) testing and research-grade experimentation.
Core capabilities:
- Multi-agent committee decision making with a 3-round voting protocol.
- Vision-enabled UI interaction through screenshot-grounded reasoning (via configured multimodal APIs—not LLaVA by default).
- Persona-driven testing to diversify behavioral coverage.
- Safety-aware action validation and structured turn logging.
- Statistical experiment pipeline for reproducible research outputs.
High-level data flow from configuration through the multi-agent loop, browser automation, validation, and recorded outputs.
flowchart TB
subgraph config["Configuration"]
MC["config/model_config.yaml"]
PS["Personas / scenarios YAML"]
EY["experiments/configs/*.yaml"]
end
subgraph exec["Runtime"]
EN["main.py or experiments/runner.py"]
RUN["multi_agent_runner.py"]
COM["Multi-agent committee (3-round voting)"]
API["LLMClient: OpenAI / Google / Anthropic / xAI"]
PW["browser_adapter.py (Playwright)"]
VLD["validators.py"]
end
AUT["Application under test: aut_service.py"]
subgraph out["Outputs"]
CSV["Session logs + CSV"]
SQLITE[(experiments/results/experiments.db)]
ST["dashboard_app.py (Streamlit)"]
end
MC --> API
PS --> RUN
EY --> EN
EN --> RUN
RUN --> COM
COM --> API
RUN --> PW
PW --> AUT
COM --> VLD
VLD --> PW
PW -->|screenshots + state| COM
RUN --> CSV
EN --> SQLITE
SQLITE --> ST
CSV --> ST
Data source: experiments/results/experiments.db
- Total runs: 84
- Overall task success rate: 89.5%
- Average turns per run: 8.13
- Mean latency: 0.87s (P50: 0.71s, P95: 1.92s, P99: 2.16s)
- Total actions executed: 683
- Overall action success rate: 93.1%
- Multi-agent committees (2-4 agents) outperform single-agent runs (91.7-100.0% vs 78.0% success).
- Committee agreement reaches 100% for 2-4 agent configurations.
- On WebShop benchmark tasks, the framework achieves 74.7% success vs published GPT-3 baseline 50.1% (+24.6pp).
- Regression detection reaches 100% success in evaluated scenarios.
- OWASP Juice Shop security testing reaches 82.0% success.
| Experiment | Runs | Success Rate | Avg Turns | Avg Latency |
|---|---|---|---|---|
| Multi-Agent Scaling | 9 | 100.0% | 4.0 | 0.46s |
| Persona Diversity | 27 | 92.2% | 12.6 | 0.25s |
| Regression Detection | 18 | 100.0% | 4.0 | 0.54s |
| OWASP Juice Shop Security | 12 | 82.0% | 12.7 | 2.65s |
| WebShop Task Success | 18 | 74.7% | 4.6 | 1.16s |
- Define experiment configs in
experiments/configs/*.yaml. - Run orchestrated experiments via
experiments/runner.py. - Execute per-turn multi-agent testing sessions with browser automation.
- Collect turn-level and run-level metrics into SQLite.
- Run statistical analyses (t-tests, ANOVA, confidence intervals, effect sizes).
- Generate publication figures and integrate findings into the paper.
- Round 1 — Independent proposals: each agent proposes action + confidence.
- Round 2 — Discussion/refinement: agents review and revise proposals.
- Round 3 — Consensus vote: confidence-weighted aggregation selects final action.
LLM_Agents_For_Beta_Testing/
├── app/ # Core runtime framework
│ ├── multi_agent_runner.py # Session orchestration loop
│ ├── multi_agent_committee.py # 3-round voting protocol
│ ├── browser_adapter.py # Playwright integration
│ ├── agent.py # Agent abstraction
│ ├── llm_client.py # Provider clients
│ ├── validators.py # Safety/action validation
│ ├── storage.py # Session logging
│ └── metrics.py # Runtime metrics utilities
├── config/
│ └── model_config.yaml # Model + provider roster
├── experiments/ # Research infrastructure
│ ├── configs/ # Experiment YAML definitions
│ ├── runner.py # Batch experiment orchestrator
│ ├── metrics_collector.py # Comprehensive metrics collection
│ ├── analysis.py # Statistical analysis utilities
│ ├── bug_injector.py # Ground-truth bug management
│ ├── regressions.py # Regression test framework
│ ├── generate_figures.py # Paper figure generation
│ ├── schema.sql # SQLite schema
│ └── results/experiments.db # Experimental results database
├── personas/ # Persona YAML files
├── scenarios/ # Scenario YAML files
├── aut_service.py # FastAPI application under test
├── main.py # Single-session entry point
├── dashboard_app.py # Streamlit dashboard
├── media/ # Demo assets for README
└── README.md
pip install -r requirements.txt
playwright install chromiumSet provider keys in your environment as needed for OpenAI, Google, Anthropic, and xAI.
The active model roster is loaded from config/model_config.yaml. Current defaults are:
gpt-4o(OpenAI)gemini-1.5-flash(Google)gemini-2.5-flash(Google)claude-opus-4-1(Anthropic)grok-2-vision-1212(xAI)
The codebase can also talk to Ollama via an OpenAI-compatible API for local models, but the repository is not configured for LLaVA by default—edit config/model_config.yaml if you want to use local vision models.
uvicorn aut_service:app --port 8000python main.pyOptional example:
python main.py --persona personas/adversarial_attacker.yaml --scenario scenarios/ui_shopping_flow.yaml --agents 4With --agents 4, the framework uses the first four configured models from config/model_config.yaml.
streamlit run dashboard_app.pyRun a configured experiment:
python experiments/runner.py --config experiments/configs/experiment_1a_multi_agent_scaling.yamlGenerate figures from the database:
python experiments/generate_figures.pyRun statistical analysis (example):
from experiments.analysis import StatisticalAnalyzer
analyzer = StatisticalAnalyzer("experiments/results/experiments.db")
# Example comparisons:
# - single-agent vs multi-agent
# - benchmark vs baseline- Fixed seeds are used for controlled experimental runs.
- Experiment settings are versioned in YAML configs.
- Turn-by-turn logs and aggregate metrics are persisted in SQLite.
- Figures are generated programmatically from recorded results.
MIT License
- Dhiwahar Adhithya Kennady (
dk5025) - Sumanth Bharadwaj Hachalli Karanam (
sh8111)
