Benchmark for evaluating LLM agents on strategic clarification in underspecified workflows.
Core Question: Can agents recognize when critical information is missing and ask the right clarifying questions before acting?
| Resource | Link |
|---|---|
| Paper | LHAW: Controllable Underspecification for Long-Horizon Tasks |
| Dataset | ScaleAI/lhaw on Hugging Face (285 variants, CC BY 4.0) |
| Blog | Introducing LHAW |
Note: The TAC experiment infrastructure in this repo was adopted from scaleapi/mrt and the original codebase TheAgentCompany/TheAgentCompany.
LHAW is a dataset-agnostic synthetic pipeline that transforms well-specified tasks into controllably underspecified variants by systematically removing information across four dimensions — Goals, Constraints, Inputs, and Context — at configurable severity levels.
285 benchmark variants across three domains:
| Domain | Source Benchmark | Tasks | Variants | Description |
|---|---|---|---|---|
| TAC | TheAgentCompany | 13 | 85 | Enterprise workflows (DS, Finance, HR, SDE) |
| SWE-Bench Pro | SWE-Bench Pro | 75 | 100 | Real-world GitHub issue repair |
| MCP-Atlas | MCP-Atlas | 100 | 100 | Tool-use across MCP server integrations |
The end-to-end pipeline has four stages, each producing outputs consumed by the next:
- Baselines — Run pass@k on original, well-specified tasks to establish success rates and golden trajectories.
- Underspec Variants — Extract removable segments from task prompts, generate underspecified variants (delete/vaguify/genericize), run agent trials, and classify each variant as outcome-critical, divergent, or benign based on terminal state divergence.
- Filter Benchmark — Select top candidates targeting a domain-specific distribution (TAC: 40/30/30, SWE-Bench: 50/30/20 outcome-critical / divergent / benign) to produce the final benchmark JSON.
- User Simulator — Run clarification experiments where an LLM-powered user answers agent questions via MCP. Compare baseline (no
ask_usertool) vs. user-sim (with tool) to measure the value of clarification.
See
synthetic/README.mdfor pipeline data structures and programmatic usage.
cd lhaw
uv venv --python python3.11 && source .venv/bin/activate
uv pip install -r requirements.txt
cp .env.example .env # edit with your credentials
source .env # export LLM_API_KEY, LLM_BASE_URL, LLM_EVAL_MODELAll scripts use centralized defaults from constants.py. Override via env vars (e.g. NUM_TRIALS=3 PARALLEL_VARIANTS=2).
Run tests:
python -m pytest tests -qDocker requirements: Each agent trial spawns 2 Docker containers (OpenHands orchestrator + sandbox runtime). With PARALLEL_VARIANTS=1 (default), only 2 containers run at a time. Increase PARALLEL_VARIANTS only if you have sufficient memory (each pair needs ~4-8 GB).
Model shortcuts (defined in constants.py):
| Provider | Shortcuts |
|---|---|
| Anthropic | opus_4_6, opus_4_5, sonnet_4_6, sonnet_4_5, sonnet_4, haiku_4_5 |
| OpenAI | gpt_5_2, gpt_5_1, gpt_5, o3_pro, o3, gpt_4_1_mini |
gemini_3_pro, gemini_3_flash, gemini_3_1_pro, gemini_3_1_flash_lite |
|
| Other | kimi_k2, qwen3_235b, llama4_maverick, glm_4p5_air, nova_2_lite |
Each domain has a self-contained example script for quick iteration and a detailed sub-README for full reproduction. The pipeline stages are the same across domains — only the agent backend and evaluation differ.
| Report | Paper Table | Script |
|---|---|---|
| Overall performance summary | Table 3 | generate_reports.sh all |
| Pass@3 by information dimension | Table 6 | generate_reports.sh all |
| Avg checkpoint progress by dimension | Table 7 | generate_reports.sh all |
| Pass@3 by ambiguity class | Table 9 | generate_reports.sh all |
| Agentic prompting ablation | Table 11 | generate_reports.sh ablation |
Quick start: bash run_tac_example.sh (4 tasks, 3 models). Edit variables at top to customize.
Full reproduction (all 33 tasks, all models, pass@3):
# 1. Baselines + golden trajectories
./scripts/synth_pipeline/tac_full_baselines.sh
# 2. Generate + run underspec variants
./scripts/synth_pipeline/run_experiment.sh lhaw_tac
# 3. Process + filter benchmark (40/30/30)
python scripts/process_tac_underspec.py -e "lhaw_tac_*" --judge
./scripts/synth_pipeline/filter.sh lhaw_tac --max 85
# 4. User simulator experiments
./scripts/user_exps/run_experiment.sh baseline lhaw_tac
./scripts/user_exps/run_experiment.sh usersim lhaw_tac
# 5. Generate reports (Tables 3, 6, 7, 9, 11)
./scripts/user_exps/generate_reports.sh all lhaw_tac
./scripts/user_exps/generate_reports.sh ablation lhaw_tacAll stages support --resume to skip completed work and continue from where you left off. See run_tac_example.sh for the step-by-step flow with comments.
Requires Python 3.11+, SWE-agent (fork), and Modal for container orchestration.
| Report | Paper Table | Script |
|---|---|---|
| Overall performance summary | Table 3 | compute_swebench_metrics.py |
| Pass@3 by information dimension | Table 6 | compute_swebench_metrics.py |
| Avg checkpoint progress by dimension | Table 7 | compute_swebench_metrics.py |
| Pass@3 by ambiguity class | Table 10 | compute_swebench_metrics.py |
Quick start: bash run_swebench_example.sh (5 tasks, 3 models, 10 variants). Edit DOCKERHUB_USER at top of script.
Full details: experiments/swebench/README.md — setup, CLI reference, multi-model experiments, output structure, and troubleshooting.
Requires Docker and the MCP-Atlas repo with running MCP servers.
| Report | Paper Table | Script |
|---|---|---|
| Overall performance summary | Table 3 | task_completion_mcpatlas.py + plot_pass3_from_runs.py |
| Pass@3 by segments removed | Table 2 | underspec_pass3_by_segments.py |
| Pass@3 by information dimension | Table 6 | plot_pass3_from_runs.py --mapping-csv |
| Pass@3 by ambiguity class | Table 8 | plot_pass3_from_runs.py --mapping-csv |
| Ask-user failure modes | Table 5 | analyze_ask_user.py + plot_ask_user.py |
| User persona ablation | Table 4 | analyze_ask_user.py (per persona) |
Quick start: bash run_mcpatlas_example.sh (15 tasks, 3 models, default LIMIT=15). Edit BASELINE_MODELS to control task selection. This workflow has two service phases: run steps 1-5 with the normal completion service, then restart it with USER_TOOL_ENABLED=True before the ask-user steps. See the service note below and run_mcpatlas_example.sh comments for details.
Service note: The script runs in two phases. Baselines and underspec-without-ask run with the normal completion service. Before ask_user experiments, stop the service and restart with USER_TOOL_ENABLED=True make run-mcp-completion. See script comments for details.
Full details: experiments/mcpatlas/README.md — setup (API keys, data imports, MCP server list), CLI reference, paper ablations, and troubleshooting.
The ScaleAI/lhaw dataset contains all 285 benchmark variants used in the paper. Using it lets you skip the generation phase (segment extraction, variant generation, empirical validation) and go straight to running baselines + evaluation with the existing run scripts on the validated underspecified variants.
# Load variants for any benchmark:
python scripts/load_hf_dataset.py --dataset MCP-Atlas --output-dir experiments/mcpatlas/underspec_output/hf_variants
python scripts/load_hf_dataset.py --dataset TheAgentCompany --output-dir experiments/agentcompany/hf_variants
python scripts/load_hf_dataset.py --dataset "SWE-Bench Pro" --output-dir experiments/swebench/hf_variantsThen modify the run script variables to point at the loaded data and skip generation:
| Benchmark | Run script | What to change | Steps to skip |
|---|---|---|---|
| MCP-Atlas | run_mcpatlas_example.sh |
Set UNDERSPEC_DIR, BASELINE_MODELS, use --task_ids |
Steps 2-3, 5 |
| TAC | run_tac_example.sh |
Run step 1 baselines manually on HF tasks, then set FILTERED_JSON for step 6 |
Steps 1b-5 |
| SWE-Bench Pro | run_swebench_example.sh |
Copy YAMLs into EXP_DIR, drop --skip-baseline |
Steps 1-2 |
MCP-Atlas: Step 4 feeds underspec prompts via --input_csv "$UNDERSPEC_DIR/underspec_prompts.csv". Set BASELINE_MODELS=("${MODELS[@]}") so all model baselines run in step 1 (step 5 depends on PASSED_JSON from the skipped step 2, so skip it too):
# In run_mcpatlas_example.sh, change:
UNDERSPEC_DIR="experiments/mcpatlas/underspec_output/hf_variants"
BASELINE_MODELS=("${MODELS[@]}")
# In step 1, replace --limit "$LIMIT" with:
TASK_IDS=$(python3 -c "import pandas as pd; print(','.join(pd.read_csv('$UNDERSPEC_DIR/underspec_prompts.csv')['TASK'].unique()))")
# --task_ids "$TASK_IDS"
# Skip steps 2-3, 5, 8, and 11. Continue from step 4.
# (Steps 8 and 11 require json/ from the generation phase.)TAC: run_tac_example.sh hardcodes its own TASKS, so HF mode is partially manual today. Step 6 (run_experiment.sh) does honor FILTERED_JSON, but step 1 baselines should be run manually on the HF benchmark tasks:
# Before running run_tac_example.sh (and after scripts/load_hf_dataset.py for TAC):
HF_DIR="experiments/agentcompany/hf_variants"
TASKS=($(python3 -c "import json; print(' '.join(sorted(set(v['task'] for v in json.load(open('$HF_DIR/benchmark.json'))))))"))
# Step 1 only: run baselines on those tasks
BASELINE_MODELS=(gemini_3_flash sonnet_4_6 gemini_3_1_flash_lite) # quick-start example
# For paper-style reproduction instead:
# BASELINE_MODELS=(opus_4_5 sonnet_4_5 gemini_3_pro gemini_3_flash gpt_5_2)
for model in "${BASELINE_MODELS[@]}"; do
./scripts/synth_pipeline/tac.sh "$model" --tasks "${TASKS[@]}"
done
# Then use the HF benchmark directly for step 6 onward
export FILTERED_JSON="$HF_DIR/benchmark.json"
# Skip steps 1b-5 from run_tac_example.sh and continue with step 6.SWE-Bench Pro: Step 3 (--run --exp-dir) reads instances.yaml for both baselines and underspec trials. Since steps 1-2 are skipped, remove --skip-baseline from step 3 so baselines run against original_instances.yaml:
EXP_DIR="experiments/swebench/runs/run_hf_$(date +%Y%m%d_%H%M%S)"
mkdir -p "$EXP_DIR"
cp experiments/swebench/hf_variants/{instances,original_instances}.yaml "$EXP_DIR/"
cp experiments/swebench/hf_variants/underspec_candidates.csv "$EXP_DIR/"
# In step 3, remove --skip-baseline. Continue steps 3 → 4 → 5 → 9.Note: To generate new variants with different parameters (severity, tasks, segment count), run the full generation phase using the run scripts instead.
Segments are classified by 4 dimensions (see synthetic/configs/taxonomy.yaml):
| Dimension | Description | Subdimensions |
|---|---|---|
| GOAL | What to produce | objective, identifier, format |
| CONSTRAINT | How to do it | procedure, criteria, deadline |
| INPUT | From where | location, tool/API, reference |
| CONTEXT | Background info | domain knowledge, history |
Each segment has:
- criticality: 0.0 / 0.5 / 1.0 (how important for success)
- guessability: 0.0 / 0.5 / 1.0 (how likely to guess correctly)
- priority_score:
criticality × (1 - guessability)(higher = better for removal)
Task Performance:
| Metric | Formula | Meaning |
|---|---|---|
| pass@k | 1 - C(n-c,k)/C(n,k) | P(>=1 success in k trials), unbiased estimator |
| pass^k | C(c,k)/C(n,k) | P(k consecutive successes) |
| Ckpt% | avg(score/total) per trial | Average checkpoint progress (partial completion) |
Where n=trials, c=successes. Uses the unbiased combinatorial estimator from HumanEval (Chen et al., 2021).
Clarification Behavior (Table 3):
| Metric | Formula | Meaning |
|---|---|---|
| Ask% | trials_with_ask_user / total_trials | Fraction of trials invoking the user tool |
| Avg/Traj | total_questions / trials_with_questions | Mean questions per trajectory (among those that asked) |
| Gain/Q | delta_pass@3 / total_questions | Performance gain per question asked |
Ambiguity Classification:
| Class | Definition | Oracle Decision |
|---|---|---|
| outcome-critical | 0/N success + divergent terminal states | CLARIFY |
| divergent | Some success (1-2/3) + variable outcomes | PROCEED |
| benign | N/N success despite missing info | PROCEED |
| new_task | 0/N + 1 state (LLM judged as different task) | FILTERED OUT |
| Strategy | Description | File |
|---|---|---|
none |
Baseline — no strategy instructions | — |
react |
Thought -> Action -> Observation cycle | experiments/agentcompany/openhands/agent_strategies/react.md |
reflexion |
Act -> Self-Assess -> Decide loop | experiments/agentcompany/openhands/agent_strategies/reflexion.md |
plan_and_execute |
Plan -> Execute -> Re-plan | experiments/agentcompany/openhands/agent_strategies/plan_and_execute.md |
| Severity | Effect | Example |
|---|---|---|
delete |
Remove entirely | "Save to report.xlsx" -> (gone) |
vaguify |
Replace with vague placeholder | "Save to report.xlsx" -> "Save to the file" |
genericize |
Replace with generic value | "Save to report.xlsx" -> "Save to output.xlsx" |
| Flag | Grounding |
|---|---|
| (default) | Prompt + trajectory + checkpoints (recommended) |
--no-trajectory |
Prompt + checkpoints |
--no-checkpoints |
Prompt + trajectory |
--prompt-only |
Prompt only |
lhaw/
├── README.md
├── run_tac_example.sh # TAC mini reproduction (4 tasks, 3 models)
├── run_swebench_example.sh # SWE-Bench Pro mini reproduction (5 tasks, 3 models)
├── run_mcpatlas_example.sh # MCP-Atlas mini reproduction (15 tasks, 3 models)
├── constants.py # Centralized defaults + model registry
├── task_completion_agentcompany.py # TAC orchestrator
├── task_completion_mcpatlas.py # MCP-Atlas orchestrator
├── scripts/
│ ├── synth_pipeline/ # TAC-specific shell orchestration
│ │ ├── run_experiment.sh # Generate + run underspec trials
│ │ ├── filter.sh # Process + filter benchmark
│ │ ├── tac.sh # Run TAC baseline (single model)
│ │ └── tac_full_baselines.sh # Run TAC baseline (all models)
│ ├── user_exps/ # TAC-specific user simulator scripts
│ │ ├── run_experiment.sh # User simulator experiments
│ │ ├── run_ablation_agent_prompt.sh
│ │ └── generate_reports.sh # Reports (all / ablation / trajectories)
│ ├── run_tac_underspec.py # TAC underspec runner (generate + run trials)
│ ├── process_tac_underspec.py # TAC results processing + LLM judge
│ ├── process_tac_usersim.py # TAC user simulator metrics
│ ├── compare_tac_conditions.py # TAC cross-condition comparison
│ ├── filter_tac_samples.py # TAC benchmark filtering (40/30/30)
│ ├── filter_passed_tasks.py # Cross-benchmark baseline task selection
│ ├── summarize_swebench_baselines.py # SWE-Bench baseline pass@k summary
│ ├── filter_swebench_samples.py # SWE-Bench quota filter
│ ├── process_swebench_underspec.py # SWE-Bench eval + classify
│ ├── export_swebench_dataset.py # SWE-Bench benchmark export
│ ├── compute_swebench_metrics.py # SWE-Bench ICML tables
│ ├── compute_phase_b_results.py # SWE-Bench cross-model comparison
│ ├── generate_mcpatlas_underspec.py # MCP-Atlas underspec variant generation
│ ├── load_hf_dataset.py # Load pre-computed variants from HuggingFace
│ ├── view_tac_trajectory.py # TAC trajectory HTML viewer
│ └── export_tac_golden_trajectories.py # TAC golden trajectory export
├── task_completion_swebench.py # SWE-Bench orchestrator
├── synthetic/ # Pipeline internals (see synthetic/README.md)
│ └── adapters/
│ ├── tac.py # TAC adapter
│ ├── swebench.py # SWE-Bench adapter
│ └── mcpatlas.py # MCP-Atlas adapter
├── evaluation/ # Scoring, pass@k, tac_eval.py (checkpoint grader CLI)
├── task_pairs_agentcompany/ # 33 TAC task definitions
├── swebenchpro/ # SWE-Bench Pro data (git submodule)
├── experiments/
│ ├── agentcompany/ # TAC: OpenHands agent + MCP user sim + runs/
│ ├── swebench/ # SWE-Bench: README.md + runs/
│ └── mcpatlas/ # MCP-Atlas: README.md + configs/ + scripts/ + runs/
└── tests/ # pytest suite
@misc{pu2026lhawcontrollableunderspecificationlonghorizon,
title={LHAW: Controllable Underspecification for Long-Horizon Tasks},
author={George Pu and Michael S. Lee and Udari Madhushani Sehwag and David J. Lee and Bryan Zhu and Yash Maurya and Mohit Raghavendra and Yuan Xue and Samuel Marc Denton},
year={2026},
eprint={2602.10525},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.10525},
}