self_improve

Prompt Optimizer

Optimizes system prompts for pipeline steps in run_plan_pipeline.py through an iterative loop: implement a fix, test across models, analyze results, and decide whether to keep or revert. Each iteration produces an auditable trail with a PR, quantitative comparison, and a keeper verdict.

Currently optimizing the IdentifyPotentialLevers step (26 iterations completed). The runner infrastructure (progress tracking, CLI, output structure) is shared across steps; each new step requires only a custom adapter.

See proposal 117 for the full design and current status.

Optimization Loop

Read latest synthesis.md (top recommendation)
  → Implement fix (Claude Code → branch + PR)
    → Run experiments (runner.py × N models × 5 plans)
      → Analyze (insight → code review → synthesis → assessment)
        → Verdict: YES → merge, NO → close, CONDITIONAL → fix + re-test
          → Loop back

Each iteration produces a numbered analysis directory in PlanExe-prompt-lab:

analysis/1_identify_potential_levers/
  meta.json           ← provenance: history runs, PR info
  insight_claude.md   ← quality analysis (Claude Code)
  insight_codex.md    ← quality analysis (Codex)
  code_claude.md      ← code review (Claude Code)
  code_codex.md       ← code review (Codex)
  synthesis.md        ← cross-agent reconciliation, top 5 directions
  assessment.md       ← before/after comparison, keeper verdict

Running a full iteration

# From PlanExe-prompt-lab repo:
python run_optimization_iteration.py

# Or step-by-step:
python run_optimization_iteration.py --skip-implement --models haiku,llama
python run_optimization_iteration.py --skip-implement --skip-runner

See run_optimization_iteration.py --help for all options.

Important: Do NOT merge the PR before the verdict. The correct order is: create PR → run experiments → run analysis → read verdict → post verdict as a comment on the PR → merge only if verdict confirms improvement. The PR comment creates a permanent record of why the PR was merged or closed.

Running analysis phases

# From PlanExe-prompt-lab repo:
python analysis/prepare_iteration.py identify_potential_levers 326     # Phase 0: verify PR, pre-create history dirs
python analysis/run_analysis.py analysis/12_identify_potential_levers   # Phases 1-4: insight → code review → synthesis → assessment

prepare_iteration.py replaces the former create_analysis_dir.py and update_meta_pr.py — it handles PR verification, history dir pre-creation, and meta.json population in a single step.

run_analysis.py orchestrates the four analysis phases sequentially. Each phase is resumable (skipped if output files already exist). Individual phases can still be run directly if needed:

python analysis/run_insight.py analysis/12_identify_potential_levers     # Phase 1
python analysis/run_code_review.py analysis/12_identify_potential_levers # Phase 2
python analysis/run_synthesis.py analysis/12_identify_potential_levers   # Phase 3
python analysis/run_assessment.py analysis/12_identify_potential_levers  # Phase 4

Runner Usage

The runner must be invoked with Python 3.11 (/opt/homebrew/bin/python3.11) which has the required llama_index dependencies.

# Auto-increment into prompt-lab history/
/opt/homebrew/bin/python3.11 -m self_improve.runner \
    --baseline-dir /path/to/baseline/train \
    --prompt-lab-dir /path/to/PlanExe-prompt-lab \
    --model ollama-llama3.1

# Or manual output directory
/opt/homebrew/bin/python3.11 -m self_improve.runner \
    --baseline-dir /path/to/baseline/train \
    --output-dir /path/to/my_run/outputs \
    --model ollama-llama3.1

The runner imports and executes the step's source files directly from PlanExe's code — there is no external prompt file or CLI override. Changes to system prompts, Pydantic schemas, or validation logic all take effect at run time.

Options

Flag	Required	Description
`--baseline-dir`	No	Directory containing plan subdirectories (process all)
`--plan-dir`	No	Single plan directory to process (overrides `--baseline-dir`)
`--prompt-lab-dir`	No	Path to PlanExe-prompt-lab repo (auto-creates history run dir)
`--output-dir`	No	Manual output directory (alternative to `--prompt-lab-dir`)
`--model`	Yes	LLM model name. First is primary; additional are fallbacks

Either --baseline-dir or --plan-dir must be provided. Either --prompt-lab-dir or --output-dir must be provided.

Anthropic models

Anthropic models need custom profile env vars. The run_optimization_iteration.py script handles this automatically for models in its CUSTOM_PROFILE_MODELS dict.

PLANEXE_MODEL_PROFILE=custom \
PLANEXE_LLM_CONFIG_CUSTOM_FILENAME=anthropic_claude.json \
/opt/homebrew/bin/python3.11 -m self_improve.runner \
    --baseline-dir /path/to/baseline/train \
    --prompt-lab-dir /path/to/PlanExe-prompt-lab \
    --model anthropic-claude-haiku-4-5-pinned

Output Structure

With --prompt-lab-dir, outputs go to history/{counter // 100}/{counter % 100:02d}_{step}/. The counter auto-increments by scanning existing history directories.

<run-dir>/
  meta.json          # written at start: step name, model, workers, system
  outputs.jsonl      # one row per completed plan: {name, status, duration_seconds, error}
  events.jsonl       # timestamped events: run_single_plan_start, _complete, _error
  outputs/
    <plan_name>/
      002-9-potential_levers_raw.json
      002-10-potential_levers.json
      activity_overview.json
      usage_metrics.jsonl

Quick Start

Run against a single plan:

/opt/homebrew/bin/python3.11 -m self_improve.runner \
    --plan-dir /path/to/baseline/train/20250321_silo \
    --prompt-lab-dir /path/to/PlanExe-prompt-lab \
    --model ollama-llama3.1

Two-Repo Architecture

PlanExe/                              PlanExe-prompt-lab/
  self_improve/                     baseline/train/       ← gold-standard outputs
    runner.py                           history/              ← runner output per model
  worker_plan/.../                      analysis/             ← insight/review/synthesis/assessment
    identify_potential_levers.py          prepare_iteration.py  ← Phase 0: PR + history dirs
                                          run_analysis.py       ← Phases 1-4 orchestrator
  llm_config/                           run_optimization_iteration.py
    baseline.json
    anthropic_claude.json

PRs are created in PlanExe (the code being optimized). Data and analysis artifacts are committed directly to PlanExe-prompt-lab.

Parallelism

The runner automatically parallelizes based on the luigi_workers value in llm_config/*.json for the model being used. Cloud models typically use 4 workers; local models use 1. The worker count is recorded in meta.json.

Each plan runs in its own thread with an independent LLMExecutor. Usage metrics use thread-local storage to avoid cross-thread interference.

Models

7 models tested per iteration (5 plans each = 35 runs per batch):

Alias	Model	Status
llama	ollama-llama3.1	Working (3-5/5)
gpt-oss	openrouter-openai-gpt-oss-20b	Working (4-5/5)
gpt5-nano	openai-gpt-5-nano	Working (5/5)
qwen	openrouter-qwen3-30b-a3b	Working (5/5)
gpt4o-mini	openrouter-openai-gpt-4o-mini	Working (5/5)
gemini-flash	openrouter-gemini-2.0-flash-001	Working (5/5) — added iter 13
haiku	anthropic-claude-haiku-4-5-pinned	Working (4-5/5)

Removed models:

nemotron (openrouter-nvidia-nemotron-3-nano-30b-a3b) — 0/5 all iterations, structural incompatibility

Iteration History

Iter	PR	Change	Verdict
0	—	Baseline	—
1	#268	Fix doubled user prompt	YES
2	#270	Fix assistant turn serialization	CONDITIONAL
3	#272	Novelty-aware follow-up prompts	YES
4	#273	Remove exemplar strings + optional wrapper fields	CONDITIONAL
5	#274	Align Pydantic field descriptions with system prompt	YES
6	#275	Fix consequences length + trade-off + review format	YES
7	#276	Enforce schema contract: levers min/max 5, summary required	CONDITIONAL
8	#278	Fresh context per call + relax lever count to 5–7	CONDITIONAL
9	#279	Remove naming template	YES
10	#281	Keyword quality gate (reverted in #282)	NO
11	#283	RetryConfig in runner (reverted in #284)	NO
12	#286	Remove max_length=7 hard constraint on levers	YES
13	#289	Add options count and review format validators	CONDITIONAL
14	#292	Recover partial results when a call fails	YES
15	#294	Consolidate review_lever prompt to prevent format alternation	YES
16	#295	Continue loop after call failure instead of breaking	YES
17	#296	Auto-correct review_lever before hard-rejecting	CONDITIONAL
18	#297	Simplify lever prompt to restore content quality	YES
19	#299	Remove bracket placeholders and fragile English-only validator	CONDITIONAL
20	#309	Add option-quality reminder to call-2/3 prompts	YES
21	#313	Add anti-fabrication reminder to call-2/3 prompts	CONDITIONAL
22	#316	Replace two-bullet review_lever prompt with single flowing example	CONDITIONAL
23	#326	Add second review_lever example to break template lock	KEEP
24	#334	(INVALID — runner ran against main, not PR branch)	INVALID
25	#334	Remove unused summary field, slim call-2/3 prefix	YES
26	#337	Replace generic review_lever examples with domain-specific ones	YES

Full analysis artifacts for each iteration are in PlanExe-prompt-lab/analysis/.

Critical Rules

Do NOT merge PRs before the verdict. Create PR → run experiments → run analysis → read verdict → post verdict as a comment on the PR → merge only if confirmed.
No hardcoded English keywords in validators. PlanExe users create plans in many languages. All validation must be language-agnostic.
Never delete from the history directory. Runs are permanent records.
The runner imports and executes the step's source files directly. There is no external prompt file or CLI override. To change the prompt, Pydantic schema, or validation logic, modify the source file on the PR branch before running experiments.
Verify the runner is on the PR branch, not main. The runner imports code from PlanExe, so it must be on the PR branch to test the PR's changes. run_optimization_iteration.py verifies this automatically. Iteration 24 was invalidated because the runner ran against main.

Architecture

The runner is designed to extend to other pipeline steps. Each step needs four things:

Input files — which files to read and how to assemble the user prompt
Execute call — which class/method to invoke
Output filenames — which files to save
Step name — identifier for meta.json

Name		Name	Last commit message	Last commit date
parent directory ..
AGENTS.md		AGENTS.md
README.md		README.md
__init__.py		__init__.py
__main__.py		__main__.py
runner.py		runner.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Prompt Optimizer

Optimization Loop

Running a full iteration

Running analysis phases

Runner Usage

Options

Anthropic models

Output Structure

Quick Start

Two-Repo Architecture

Parallelism

Models

Iteration History

Critical Rules

Architecture

FilesExpand file tree

self_improve

Directory actions

More options

Directory actions

More options

Latest commit

History

self_improve

Folders and files

parent directory

README.md

Prompt Optimizer

Optimization Loop

Running a full iteration

Running analysis phases

Runner Usage

Options

Anthropic models

Output Structure

Quick Start

Two-Repo Architecture

Parallelism

Models

Iteration History

Critical Rules

Architecture