Skip to content

Latest commit

 

History

History
82 lines (58 loc) · 4.22 KB

File metadata and controls

82 lines (58 loc) · 4.22 KB

ctxray Benchmark Results

Cross-model validation data for ctxray's scoring system.

All experiments use:

  • Plain-text prompts (no markdown, no XML, no role instructions) so format effects don't contaminate specificity measurement
  • Pass rate on executable code tests (Python functions with assertable I/O)
  • Temperature 0, deterministic evaluation
  • Checkpoint-resumable harness — see experiments/README.md to contribute a new model

E9 — Specificity gradient across 8 models

Does prompt specificity predict pass rate across different model families and sizes?

4 specificity levels × 4 coding tasks (fizzbuzz, flatten, two_sum, run_length_encode) × k≥3. Generated by experiments/aggregate.py from baseline runs and contributed results.

Model vague task_only task_io full_spec Δ (vague→full) Source
gemma3:1b 0.00 0.25 0.89 0.79 +0.79 merged
qwen2.5-coder:1.5b 0.08 0.42 0.64 0.58 +0.50 baseline
phi4-mini:latest 0.00 0.33 0.86 0.92 +0.92 baseline
gemma3:4b 0.25 0.50 0.92 0.92 +0.67 baseline
llama3.1:8b 0.00 0.50 0.75 0.92 +0.92 baseline
qwen3.5:9b 0.25 0.50 0.92 0.92 +0.67 baseline
gemma4:26b 0.17 0.33 0.58 0.92 +0.75 baseline
qwen3.5:27b 0.25 0.50 0.92 0.92 +0.67 contributed

8 models, ~432 Ollama calls total.

Headline: Specificity is the strongest single prompting lever. The aggregate vague → full_spec gain is +0.74 across all 8 models. Every single model benefits from adding specificity, with no exceptions.

Observations:

  • Biggest gain: phi4-mini and llama3.1:8b (+0.92 — nearly going from fail to pass across the board)
  • Two small models (gemma3:1b, qwen2.5-coder:1.5b) show a mild full_spec regression (-0.10 and -0.06), consistent with the over-complexity finding: small models have less capacity to absorb long prompts

E10 — Specificity decomposition on 5 dense models

What inside full_spec is actually doing the work — constraints, edge cases, or both?

6 levels (adding constraints and edge cases separately) × 10 coding tasks (5 constraints-sensitive + 5 control, balanced) × k=10. 3000 Ollama calls.

Model vague task_only task_io task_io_constraints task_io_edge full_spec
qwen2.5-coder:1.5b 0.41 0.69 0.86 0.70 0.92 0.90
gemma3:4b 0.56 0.78 0.97 0.97 0.97 0.97
llama3.1:8b 0.19 0.71 0.90 0.90 0.90 0.97
qwen3.5:9b 0.43 0.73 0.97 0.97 0.97 0.97
phi4:14b 0.03 0.71 0.97 0.97 0.97 0.97
AVG 0.32 0.72 0.93 0.90 0.94 0.95

Marginal effect from task_io baseline (averaged across all models):

Transition Δ pass rate
task_io → task_io_constraints -0.03 (slightly hurts)
task_io → task_io_edge +0.01
task_io → full_spec +0.02

Headline: task_io is the practical ceiling for dense models ≥4B. Adding more specificity beyond task_io gives effectively zero gain (max +0.02 on average). The exception is small models (<2B), where adding constraint phrasing actually hurts (-0.16 on qwen2.5-coder:1.5b). Edge case phrasing is a safer addition at small-model scale.

Implications for ctxray's scoring:

  • The threshold at ctxray score ≈ 43 corresponds to the task_io level in this benchmark. Prompts above this threshold hit ~0.93 pass rate on average. Prompts below average ~0.72 or lower.
  • Adding "more detail" past task_io is wasted effort for most models and is advertised as such: ctxray suggests missing features (examples, constraints, edge cases) only when the score is below threshold, not above.

Contribute a new model

If you run any Ollama model locally, you can add it to this table in 5 minutes:

git clone https://github.com/ctxray/ctxray.git
cd ctxray
uv run python experiments/validate.py e9 --model-name YOUR_MODEL
# result lands in .output/experiments/e9_specificity_custom_<name>.json

Share the resulting JSON via PR to experiments/contributed/. See experiments/README.md for the full contribution guide.