Fine-tuning Qwen3-1.7B to reason, write Python, execute it, observe the results, and continue reasoning — all within a single generation pass.
Most LLMs generate plausible text. This model verifies its reasoning through execution. When asked "what is the trajectory of a ball thrown at 45°?" it doesn't guess — it writes code, runs it, reads the output, and uses actual numbers in its answer.
Four special tokens structure each response:
<think>
Internal reasoning — what is being asked, what approach to take.
</think>
<model> ← optional, for complex problems
High-level design or pseudocode before writing the code.
</model>
<code>
# Python that gets executed by the runtime
result = some_calculation()
print(result)
</code>
<output>
← INJECTED BY RUNTIME (never generated by the model)
actual stdout from running the code above
</output>
Plain-language answer using the real results.
Multiple <code>/<output> cycles are supported — the model can write code, see results, and write more code in a single response.
# Set up environment
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# AMD GPU (MI50 / ROCm)
source /path/to/rocm/setup-rocm-env.sh
export HSA_OVERRIDE_GFX_VERSION=9.0.6
# Chat with the model (auto-detects latest trained model in output/)
python chat.py
# Or point at a specific model
python chat.py --model ./output/worldmodel_v2/final| Input | Effect |
|---|---|
/think |
Toggle visibility of <think> reasoning blocks |
/reset |
Clear Python variable state between queries |
quit |
Exit |
python chat.py --help
--model PATH Path to fine-tuned model (default: auto-detect latest)
--show-think Show <think> blocks (hidden by default)
--temperature N Sampling temperature; 0 = deterministic (default: 0.7)
--max-tokens N Max tokens per generation step (default: 512)
--vm Use QEMU scratchpad VM for code execution
# Full training run (all datasets, 10 epochs)
./train_rocm.sh --output ./output/worldmodel_v2
# Smoke test (fast, core categories only)
./train_rocm.sh --categories arithmetic,algebra,geometry --epochs 3 --output ./output/smoke_test
# Resume from checkpoint
./train_rocm.sh --resume ./output/worldmodel_v2/checkpoint-400 --output ./output/worldmodel_v2Training datasets live in training/datasets/ as JSONL files. See docs/DESIGN.md for the data format.
Developed and tested on:
- AMD Instinct MI50, 32 GB VRAM
- ROCm 7.2
- Python 3.12, PyTorch 2.4 (ROCm build)
Training runs at float32 (required for gfx906 stability). A full 10-epoch run takes ~14 hours on the MI50.
After standard LoRA fine-tuning, a second training pass (train_geometric.py) applies geometric exploration on top of gradient descent. Instead of following gradients alone, it periodically:
- Extrapolates further along the gradient direction (vector jump)
- Probes random directions on a hypersphere around the current position
- Accepts a candidate only if it strictly improves the scoring function
This escapes shallow local minima that gradient descent alone settles into. Only the LoRA adapter parameters (~17M) are explored — the base model weights are never touched.
# Run geometric training starting from the latest LoRA checkpoint
python train_geometric.py
# Enable execution-based reward signal (recommended)
python train_geometric.py --exec-reward
# Resume a previous geometric run
python train_geometric.py --resume output/worldmodel_geometric/step_900With --exec-reward, geometric candidate evaluation scores candidates not just on token loss but on whether the generated code is syntactically valid and executes correctly. Candidates that produce working code are preferred over candidates with marginally lower loss that produce broken code.
The reward function is a composite score (lower = better):
score = token_loss
− 0.2 (if generated code has valid syntax)
− 0.1 (if code executes successfully)
+ 0.5 (if no code block found in response)
Both the baseline and every candidate are scored identically for a fair comparison. The baseline score is computed fresh at each geometric step so the comparison is always apples-to-apples.
| Metric | Epoch 0 | Epoch 4 |
|---|---|---|
| Code extracted | 0% | 95% |
| Syntax valid | 0% | 90% |
| Execution success | 0% | 85% |
| Output match | 0% | 70% |
| Jump acceptance | 22% | 80% |
| Sphere acceptance | 0% | 62% |
The geometric optimizer's acceptance rates rise dramatically with exec-reward because candidates that generate working code beat the baseline even with slightly higher token loss — creating selection pressure toward code correctness beyond what cross-entropy captures.
See docs/GEOMETRIC_TRAINING.md for algorithm details and hyperparameter reference.
chat.py ← start here for interactive use
train.py ← standard LoRA fine-tuning entry point
train_geometric.py ← geometric vector-space training (run after train.py)
train_rocm.sh ← training launcher (sets ROCm env vars)
src/
inference/
generation_loop.py ← custom loop: intercepts </code>, executes, injects <output>
executor/
python_exec.py ← inline Python executor with use_tool() support
vm_exec.py ← QEMU scratchpad bridge (for sandboxed execution)
training/
dataset.py ← JSONL loader, chat-template formatting, prompt masking
execution_validator.py ← code extraction, syntax check, execution, reward scoring
gpu_monitor.py ← thermal monitoring and hard-pause throttle
training/
datasets/ ← JSONL training data by category
scripts/ ← dataset generators
docs/
DESIGN.md ← full architecture and token spec
GEOMETRIC_TRAINING.md ← geometric training algorithm and hyperparameter reference
STATUS.md ← current project state and next steps
rocm/ ← ROCm setup, troubleshooting, performance notes
history/ ← archived earlier phases (ByteLogic, Blueprint)
See docs/DESIGN.md for the full spec. Key points:
- LoRA fine-tuning (rank 16) on top of Qwen3-1.7B
- 6 custom tokens added:
<model>,</model>,<code>,</code>,<output>,</output>—<think>and</think>are already native to Qwen3 - Prompt-masked training: loss computed only on the assistant response tokens
- Qwen3 chat template used for both training and inference
use_tool(name, **kwargs)available in every code block; raisesToolNotAvailableErrorif the tool isn't registered — model is trained to handle this gracefully