Fengyuan Liu*, Jay Gala*, Nilaksh, Dzmitry Bahdanau, Siva Reddy, Hugo Larochelle
*Equal ContributionEvaluating the real-world capabilities of AI systems requires grounding benchmark performance in human-interpretable measures of task difficulty. Existing approaches that rely on direct human task completion time annotations are costly, noisy, and difficult to scale across benchmarks. In this work, we propose BRIDGE, a unified psychometric framework that learns the latent difficulty scale from model responses and anchors it to human task completion time. Using a two-parameter logistic Item Response Theory model, we jointly estimate latent task difficulty and model capability from model performance data across multiple benchmarks. We demonstrate that latent task difficulty varies linearly with the logarithm of human completion time, allowing human task completion time to be inferred for new benchmarks from model performance alone. Leveraging this alignment, we forecast frontier model capabilities in terms of human task length and independently reproduce METR's exponential scaling results, with the 50% solvable task horizon doubling approximately every 6 months.
- We fit a 2-parameter logistic IRT model on binary (pass/fail) outcomes from multiple benchmarks (METR, SWE-bench, MLE-bench, GDPval, Cybench) to jointly estimate latent task difficulty and model ability.
- Latent task difficulty correlates strongly with the log of human completion time (R² = 0.81 on METR tasks), enabling prediction of human task duration from model performance alone.
- On out-of-distribution benchmarks, BRIDGE outperforms both logit success-rate heuristics and LLM-based time estimators (Gemini 3 Pro, GPT-5.2), placing 92.3% of Cybench predictions within a 0.5x--2x tolerance band of actual human times.
- Frontier model capabilities are growing exponentially: the 50% solvable task-length horizon doubles approximately every 6 months, consistent with METR's findings -- but derived entirely from model performance data without human time annotations.
- Current SOTA models (as of late 2025) achieve 50% success on tasks estimated to require ~1-2.5 hours of human effort.
# Clone the repository and navigate to the project directory
git clone https://github.com/McGill-NLP/BRIDGE.git
cd BRIDGE
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install all dependencies
uv syncNote: Requires Python 3.11+ and uv.
The pipeline expects benchmark results and evaluation logs to be present under the data/ directory. Each benchmark source has its own normalized JSONL format, and models are mapped to canonical names via data/model_run_mapping.json.
Here is the expected directory structure of the project:
BRIDGE/
├── data/
│ ├── model_run_mapping.json # Curated model -> run_id mapping with metadata
│ ├── all_runs.jsonl # Per-task run logs
│ ├── swebench_normalized_results.jsonl # SWE-bench results (normalized)
│ ├── gdpval_normalized_results.jsonl # GDPVal benchmark results
│ ├── mlebench_normalized_results.jsonl # MLEBench benchmark results
│ ├── human_minutes_by_task.jsonl # SWE-bench human time estimates
│ ├── cybench_fst_master.jsonl # Cybench FST reference data
│ ├── cybench_data_by_challenges/ # Raw Cybench log files
│ ├── cybench/benchmark/ # CyBench benchmark tasks (used by LLM estimation)
│ └── experiments/evaluation/ # SWE-bench verified logs from the public leaderboard
│ ├── verified/
│ └── bash-only/
├── parse_cybench_logs.py
├── prepare_irt.py
├── prepare_sparse_pyirt.py
├── fit_irt.py
├── compute_baseline.py
├── merge_human_minutes.py
├── add_release_time.py
├── swebench_human_time_llm_estimate.py
├── cybench_human_time_llm_estimate.py
├── results_analysis.ipynb
├── run_pipeline.sh
├── pyproject.toml
└── README.md
The full analysis pipeline is orchestrated by run_pipeline.sh:
bash run_pipeline.shThis runs the following steps in order:
uv run python parse_cybench_logs.py --verboseParses raw Cybench evaluation log files from data/cybench_data_by_challenges/ and produces:
data/cybench_normalized_results.jsonl-- normalized per-task, per-model scores (unguided mode only, filtered to tasks with at least one model success)data/cybench_human_minutes_by_task.jsonl-- human time estimates matched fromdata/cybench_fst_master.jsonl
uv run python prepare_irt.py --verboseBuilds data/swe_a_pyirt.jsonl by combining:
- SWE-bench results (
data/swebench_results.jsonlanddata/swebench_normalized_results.jsonl) - Verified evaluation runs from
data/experiments/evaluation/verified/(andbash-only/) - A curated model-to-run mapping (
data/model_run_mapping.json)
Outputs a py-IRT JSONL file with one row per model (subject), containing binary pass/fail responses across all SWE-bench task IDs.
uv run python prepare_sparse_pyirt.py \
--model-mapping data/model_run_mapping.json \
--pyirt-input data/swe_a_pyirt.jsonl \
--runs-input data/all_runs.jsonl \
--gdpval-input data/gdpval_normalized_results.jsonl \
--mlebench-input data/mlebench_normalized_results.jsonl \
--cybench-input data/cybench_normalized_results.jsonl \
--output data/all_a_pyirt.jsonl \
--print-subject-counts \
--keep-unmapped-pyirt-subjects \
--verboseMerges data from multiple benchmarks (SWE-bench, GDPVal, MLEBench, Cybench) into a single sparse py-IRT dataset at data/all_a_pyirt.jsonl. Models are canonicalized via the model mapping, and duplicate responses are resolved by majority vote.
uv run python fit_irt.py --input_path data/all_a_pyirt.jsonlFits a two-parameter logistic (2PL) IRT model on the combined dataset. Outputs:
params/all_a_pyirt.csv-- item parameters (discriminationa, difficultyb)params/all_a_pyirt_abilities.csv-- subject (model) ability estimates
uv run python compute_baseline.py --input_path data/all_a_pyirt.jsonlComputes simple averaging-based baselines for comparison with IRT (see Baselines for details). Outputs:
params/all_a_pyirt_baseline.csv-- item difficulty (success rate, logit-transformed)params/all_a_pyirt_baseline_abilities.csv-- subject ability (success rate, logit-transformed)
cat data/human_minutes_by_task.jsonl data/cybench_human_minutes_by_task.jsonl > data/combined_human_minutes.jsonlConcatenates SWE-bench and Cybench human-minute annotations into a single file.
uv run python merge_human_minutes.py --csv params/all_a_pyirt.csv --jsonl data/combined_human_minutes.jsonlAttaches human_minutes values to the IRT item parameters CSV by matching on task_id.
uv run python add_release_time.pyAnnotates the subject abilities CSV (params/all_a_pyirt_abilities.csv) with model release dates sourced from data/model_run_mapping.json.
In addition to the IRT-based difficulty estimates, we provide two types of baselines for comparison: a compute baseline that uses logit-transformed success rates, and an LLM-based baseline that directly prompts a language model to estimate human task completion time.
A simple averaging-based baseline that estimates task difficulty and model ability from raw pass/fail rates without fitting an IRT model. For each task, the difficulty is the logit-transformed failure rate across all models that attempted it; for each model, the ability is the logit-transformed success rate across all tasks it attempted.
uv run python compute_baseline.py \
--input_path data/all_a_pyirt.jsonl \
--verbose| Argument | Description |
|---|---|
--input_path |
Path to the py-IRT JSONL file (required) |
--output_dir |
Output directory for CSVs (default: params/) |
--verbose |
Print summary statistics (sparsity, mean/std of estimates) |
Outputs:
params/all_a_pyirt_baseline.csv-- per-task success rate, difficulty, and logit-transformed difficultyparams/all_a_pyirt_baseline_abilities.csv-- per-model success rate and logit-transformed ability
Uses an LLM to directly estimate human task completion time from task descriptions. Each task's problem statement is sent to the model with a structured meta-prompt that asks for a point estimate in minutes along with a justification. This serves as a non-psychometric baseline for comparison with IRT-derived time predictions.
Both scripts require API keys set as environment variables (OPENAI_API_KEY or GOOGLE_API_KEY), or in a .env file in the project root.
SWE-bench tasks (swebench_human_time_llm_estimate.py):
Loads the SWE-bench Verified dataset from HuggingFace, extracts each issue's problem statement and repository context, and prompts the LLM to estimate how long a skilled human developer would take to resolve it.
# Gemini 3 Pro
uv run python swebench_human_time_llm_estimate.py \
--provider google --model gemini-3-pro-preview \
--batch-size 20 --verbose
# GPT 5.2
uv run python swebench_human_time_llm_estimate.py \
--provider openai --model gpt-5.2-2025-12-11 \
--batch-size 20 --verbose| Argument | Description |
|---|---|
--provider |
LLM provider: openai or google (default: openai) |
--model |
Model name (defaults: gpt-4o for OpenAI, gemini-2.5-flash for Google) |
--output-file |
Output JSONL path (default: data/swebench_time_estimations_{provider}_{model}.jsonl) |
--max-samples |
Limit number of tasks to process |
--start-idx |
Resume processing from a given index |
--include-patch-info |
Include gold patch size in the prompt (may bias estimates) |
--reasoning-effort |
Reasoning effort for OpenAI o-series/GPT-5 models: low, medium, high (default: medium) |
--batch-size |
Number of concurrent API requests (default: 10) |
--analyze-only |
Only print summary statistics from an existing results file |
--results-file |
Path to results file (used with --analyze-only) |
--verbose |
Print detailed progress information |
Cybench CTF tasks (cybench_human_time_llm_estimate.py):
Reads task metadata (category, difficulty, subtasks, prompts) from the local CyBench benchmark directory and prompts the LLM to estimate how long an experienced CTF player would take to solve each challenge.
# Gemini 3 Pro
uv run python cybench_human_time_llm_estimate.py \
--provider google --model gemini-3-pro-preview \
--benchmark-path data/cybench/benchmark --no-subtasks \
--batch-size 10 --verbose
# GPT 5.2
uv run python cybench_human_time_llm_estimate.py \
--provider openai --model gpt-5.2-2025-12-11 \
--benchmark-path data/cybench/benchmark --no-subtasks \
--batch-size 10 --verbose| Argument | Description |
|---|---|
--benchmark-path |
Path to CyBench benchmark directory (default: data/cybench/benchmark) |
--provider |
LLM provider: openai or google (default: openai) |
--model |
Model name (same defaults as SWE-bench script) |
--output-file |
Output JSONL path (default: data/cybench_time_estimations_{provider}_{model}.jsonl) |
--max-tasks |
Limit number of tasks to process |
--category |
Filter by CTF category (e.g., crypto, web, pwn, reverse, forensics) |
--competition |
Filter by competition name (e.g., hackthebox, hkcert-ctf) |
--use-hard-prompt |
Use hard prompt (less context) instead of easy prompt |
--no-subtasks |
Exclude subtask information from the prompt |
--reasoning-effort |
Reasoning effort for OpenAI o-series/GPT-5 models (default: medium) |
--batch-size |
Number of concurrent API requests (default: 10) |
--analyze-only |
Only print summary statistics from an existing results file |
--results-file |
Path to results file (used with --analyze-only) |
--verbose |
Print detailed progress information |
After running the full pipeline, the following files are generated:
| File | Description |
|---|---|
data/cybench_normalized_results.jsonl |
Cybench benchmark results (step 1) |
data/cybench_human_minutes_by_task.jsonl |
Cybench human time estimates (step 1) |
data/swe_a_pyirt.jsonl |
SWE-bench py-IRT input (step 2) |
data/all_a_pyirt.jsonl |
Combined sparse py-IRT dataset (step 3) |
data/combined_human_minutes.jsonl |
Combined human time estimates (step 6) |
params/all_a_pyirt.csv |
IRT item parameters with human_minutes (steps 4, 7) |
params/all_a_pyirt_abilities.csv |
IRT subject abilities with release dates (steps 4, 8) |
params/all_a_pyirt_baseline.csv |
Baseline item difficulty estimates (step 5) |
params/all_a_pyirt_baseline_abilities.csv |
Baseline subject ability estimates (step 5) |
data/swebench_time_estimations_{provider}_{model}.jsonl |
LLM time estimates for SWE-bench |
data/cybench_time_estimations_{provider}_{model}.jsonl |
LLM time estimates for Cybench |
After running the pipeline and the LLM baseline scripts, you can reproduce all figures and tables from the paper using the provided Jupyter notebook:
uv run jupyter notebook results_analysis.ipynbThe notebook loads the fitted IRT parameters, baseline estimates, LLM time estimations, and ground-truth human times, then walks through the following analyses:
- Model Fitting -- Fits two linear regressions mapping task difficulty to log(human minutes): one using IRT difficulty
band one using the logit success-rate baseline. Both are trained on METR benchmark tasks that have ground-truth human time annotations. - Task Difficulty vs Human Time -- Scatter plot of IRT difficulty against actual human completion time for METR tasks, with the fitted regression line (R² = 0.81).
- Task Length Estimation Distributions -- Histograms of predicted human completion times for SWE-bench, GDPval, MLE-bench, and Cybench tasks on a log scale.
- SWE-bench Time Bucket Classification -- Compares accuracy, macro F1, and weighted Cohen's kappa across four methods (Logit Success Rate, Gemini 3 Pro, GPT-5.2, and BRIDGE) for predicting SWE-bench time buckets.
- Cybench Task Length Prediction -- Scatter plots of predicted vs actual human time for each method on Cybench, with R² and within-2x accuracy metrics.
- Success Probability vs Task Length -- 2PL IRT success-probability curves for frontier models (best per 5-month release window) across all benchmarks, showing how success decays with increasing task length.
- Task Length Frontier Forecasting -- Exponential fits of the 50% and 80% solvable task-length horizons over time, with bootstrap confidence intervals and doubling-time estimates.
All plots are saved to the plots/ directory as PDF files.
This work is licensed under a Creative Commons Attribution 4.0 International License.
@article{liu2026bridge0,
title = {BRIDGE: Predicting Human Task Completion Time From Model Performance},
author = {Fengyuan Liu and Jay Gala and Nilaksh and Dzmitry Bahdanau and Siva Reddy and Hugo Larochelle},
year = {2026},
journal = {arXiv preprint arXiv: 2602.07267}
}