RESULTS.md

ctxray Benchmark Results

Cross-model validation data for ctxray's scoring system.

All experiments use:

Plain-text prompts (no markdown, no XML, no role instructions) so format effects don't contaminate specificity measurement
Pass rate on executable code tests (Python functions with assertable I/O)
Temperature 0, deterministic evaluation
Checkpoint-resumable harness — see experiments/README.md to contribute a new model

E9 — Specificity gradient across 8 models

Does prompt specificity predict pass rate across different model families and sizes?

4 specificity levels × 4 coding tasks (fizzbuzz, flatten, two_sum, run_length_encode) × k≥3. Generated by experiments/aggregate.py from baseline runs and contributed results.

Model	vague	task_only	task_io	full_spec	Δ (vague→full)	Source
`gemma3:1b`	0.00	0.25	0.89	0.79	+0.79	merged
`qwen2.5-coder:1.5b`	0.08	0.42	0.64	0.58	+0.50	baseline
`phi4-mini:latest`	0.00	0.33	0.86	0.92	+0.92	baseline
`gemma3:4b`	0.25	0.50	0.92	0.92	+0.67	baseline
`llama3.1:8b`	0.00	0.50	0.75	0.92	+0.92	baseline
`qwen3.5:9b`	0.25	0.50	0.92	0.92	+0.67	baseline
`gemma4:26b`	0.17	0.33	0.58	0.92	+0.75	baseline
`qwen3.5:27b`	0.25	0.50	0.92	0.92	+0.67	contributed

8 models, ~432 Ollama calls total.

Headline: Specificity is the strongest single prompting lever. The aggregate vague → full_spec gain is +0.74 across all 8 models. Every single model benefits from adding specificity, with no exceptions.

Observations:

Biggest gain: phi4-mini and llama3.1:8b (+0.92 — nearly going from fail to pass across the board)
Two small models (gemma3:1b, qwen2.5-coder:1.5b) show a mild full_spec regression (-0.10 and -0.06), consistent with the over-complexity finding: small models have less capacity to absorb long prompts

E10 — Specificity decomposition on 5 dense models

What inside full_spec is actually doing the work — constraints, edge cases, or both?

6 levels (adding constraints and edge cases separately) × 10 coding tasks (5 constraints-sensitive + 5 control, balanced) × k=10. 3000 Ollama calls.

Model	vague	task_only	task_io	task_io_constraints	task_io_edge	full_spec
`qwen2.5-coder:1.5b`	0.41	0.69	0.86	0.70	0.92	0.90
`gemma3:4b`	0.56	0.78	0.97	0.97	0.97	0.97
`llama3.1:8b`	0.19	0.71	0.90	0.90	0.90	0.97
`qwen3.5:9b`	0.43	0.73	0.97	0.97	0.97	0.97
`phi4:14b`	0.03	0.71	0.97	0.97	0.97	0.97
AVG	0.32	0.72	0.93	0.90	0.94	0.95

Marginal effect from task_io baseline (averaged across all models):

Transition	Δ pass rate
task_io → task_io_constraints	-0.03 (slightly hurts)
task_io → task_io_edge	+0.01
task_io → full_spec	+0.02

Headline: task_io is the practical ceiling for dense models ≥4B. Adding more specificity beyond task_io gives effectively zero gain (max +0.02 on average). The exception is small models (<2B), where adding constraint phrasing actually hurts (-0.16 on qwen2.5-coder:1.5b). Edge case phrasing is a safer addition at small-model scale.

Implications for ctxray's scoring:

The threshold at ctxray score ≈ 43 corresponds to the task_io level in this benchmark. Prompts above this threshold hit ~0.93 pass rate on average. Prompts below average ~0.72 or lower.
Adding "more detail" past task_io is wasted effort for most models and is advertised as such: ctxray suggests missing features (examples, constraints, edge cases) only when the score is below threshold, not above.

Contribute a new model

If you run any Ollama model locally, you can add it to this table in 5 minutes:

git clone https://github.com/ctxray/ctxray.git
cd ctxray
uv run python experiments/validate.py e9 --model-name YOUR_MODEL
# result lands in .output/experiments/e9_specificity_custom_<name>.json

Share the resulting JSON via PR to experiments/contributed/. See experiments/README.md for the full contribution guide.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ctxray Benchmark Results

E9 — Specificity gradient across 8 models

E10 — Specificity decomposition on 5 dense models

Contribute a new model

FilesExpand file tree

RESULTS.md

Latest commit

History

RESULTS.md

File metadata and controls

ctxray Benchmark Results

E9 — Specificity gradient across 8 models

E10 — Specificity decomposition on 5 dense models

Contribute a new model