Cross-model validation data for ctxray's scoring system.
All experiments use:
- Plain-text prompts (no markdown, no XML, no role instructions) so format effects don't contaminate specificity measurement
- Pass rate on executable code tests (Python functions with assertable I/O)
- Temperature 0, deterministic evaluation
- Checkpoint-resumable harness — see
experiments/README.mdto contribute a new model
Does prompt specificity predict pass rate across different model families and sizes?
4 specificity levels × 4 coding tasks (fizzbuzz, flatten, two_sum, run_length_encode) × k≥3. Generated by experiments/aggregate.py from baseline runs and contributed results.
| Model | vague | task_only | task_io | full_spec | Δ (vague→full) | Source |
|---|---|---|---|---|---|---|
gemma3:1b |
0.00 | 0.25 | 0.89 | 0.79 | +0.79 | merged |
qwen2.5-coder:1.5b |
0.08 | 0.42 | 0.64 | 0.58 | +0.50 | baseline |
phi4-mini:latest |
0.00 | 0.33 | 0.86 | 0.92 | +0.92 | baseline |
gemma3:4b |
0.25 | 0.50 | 0.92 | 0.92 | +0.67 | baseline |
llama3.1:8b |
0.00 | 0.50 | 0.75 | 0.92 | +0.92 | baseline |
qwen3.5:9b |
0.25 | 0.50 | 0.92 | 0.92 | +0.67 | baseline |
gemma4:26b |
0.17 | 0.33 | 0.58 | 0.92 | +0.75 | baseline |
qwen3.5:27b |
0.25 | 0.50 | 0.92 | 0.92 | +0.67 | contributed |
8 models, ~432 Ollama calls total.
Headline: Specificity is the strongest single prompting lever. The aggregate vague → full_spec gain is +0.74 across all 8 models. Every single model benefits from adding specificity, with no exceptions.
Observations:
- Biggest gain:
phi4-miniandllama3.1:8b(+0.92 — nearly going from fail to pass across the board) - Two small models (
gemma3:1b,qwen2.5-coder:1.5b) show a mild full_spec regression (-0.10 and -0.06), consistent with the over-complexity finding: small models have less capacity to absorb long prompts
What inside full_spec is actually doing the work — constraints, edge cases, or both?
6 levels (adding constraints and edge cases separately) × 10 coding tasks (5 constraints-sensitive + 5 control, balanced) × k=10. 3000 Ollama calls.
| Model | vague | task_only | task_io | task_io_constraints | task_io_edge | full_spec |
|---|---|---|---|---|---|---|
qwen2.5-coder:1.5b |
0.41 | 0.69 | 0.86 | 0.70 | 0.92 | 0.90 |
gemma3:4b |
0.56 | 0.78 | 0.97 | 0.97 | 0.97 | 0.97 |
llama3.1:8b |
0.19 | 0.71 | 0.90 | 0.90 | 0.90 | 0.97 |
qwen3.5:9b |
0.43 | 0.73 | 0.97 | 0.97 | 0.97 | 0.97 |
phi4:14b |
0.03 | 0.71 | 0.97 | 0.97 | 0.97 | 0.97 |
| AVG | 0.32 | 0.72 | 0.93 | 0.90 | 0.94 | 0.95 |
Marginal effect from task_io baseline (averaged across all models):
| Transition | Δ pass rate |
|---|---|
| task_io → task_io_constraints | -0.03 (slightly hurts) |
| task_io → task_io_edge | +0.01 |
| task_io → full_spec | +0.02 |
Headline: task_io is the practical ceiling for dense models ≥4B. Adding more specificity beyond task_io gives effectively zero gain (max +0.02 on average). The exception is small models (<2B), where adding constraint phrasing actually hurts (-0.16 on qwen2.5-coder:1.5b). Edge case phrasing is a safer addition at small-model scale.
Implications for ctxray's scoring:
- The threshold at ctxray score ≈ 43 corresponds to the task_io level in this benchmark. Prompts above this threshold hit ~0.93 pass rate on average. Prompts below average ~0.72 or lower.
- Adding "more detail" past task_io is wasted effort for most models and is advertised as such: ctxray suggests missing features (examples, constraints, edge cases) only when the score is below threshold, not above.
If you run any Ollama model locally, you can add it to this table in 5 minutes:
git clone https://github.com/ctxray/ctxray.git
cd ctxray
uv run python experiments/validate.py e9 --model-name YOUR_MODEL
# result lands in .output/experiments/e9_specificity_custom_<name>.jsonShare the resulting JSON via PR to experiments/contributed/. See experiments/README.md for the full contribution guide.