WorldModel

Fine-tuning Qwen3-1.7B to reason, write Python, execute it, observe the results, and continue reasoning — all within a single generation pass.

The idea

Most LLMs generate plausible text. This model verifies its reasoning through execution. When asked "what is the trajectory of a ball thrown at 45°?" it doesn't guess — it writes code, runs it, reads the output, and uses actual numbers in its answer.

Four special tokens structure each response:

<think>
Internal reasoning — what is being asked, what approach to take.
</think>

<model>                          ← optional, for complex problems
High-level design or pseudocode before writing the code.
</model>

<code>
# Python that gets executed by the runtime
result = some_calculation()
print(result)
</code>

<output>
← INJECTED BY RUNTIME (never generated by the model)
actual stdout from running the code above
</output>

Plain-language answer using the real results.

Multiple <code>/<output> cycles are supported — the model can write code, see results, and write more code in a single response.

Quick start

# Set up environment
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# AMD GPU (MI50 / ROCm)
source /path/to/rocm/setup-rocm-env.sh
export HSA_OVERRIDE_GFX_VERSION=9.0.6

# Chat with the model (auto-detects latest trained model in output/)
python chat.py

# Or point at a specific model
python chat.py --model ./output/worldmodel_v2/final

Chat commands

Input	Effect
`/think`	Toggle visibility of `<think>` reasoning blocks
`/reset`	Clear Python variable state between queries
`quit`	Exit

Chat options

python chat.py --help

--model PATH        Path to fine-tuned model (default: auto-detect latest)
--show-think        Show <think> blocks (hidden by default)
--temperature N     Sampling temperature; 0 = deterministic (default: 0.7)
--max-tokens N      Max tokens per generation step (default: 512)
--vm                Use QEMU scratchpad VM for code execution

Training

# Full training run (all datasets, 10 epochs)
./train_rocm.sh --output ./output/worldmodel_v2

# Smoke test (fast, core categories only)
./train_rocm.sh --categories arithmetic,algebra,geometry --epochs 3 --output ./output/smoke_test

# Resume from checkpoint
./train_rocm.sh --resume ./output/worldmodel_v2/checkpoint-400 --output ./output/worldmodel_v2

Training datasets live in training/datasets/ as JSONL files. See docs/DESIGN.md for the data format.

Hardware

Developed and tested on:

AMD Instinct MI50, 32 GB VRAM
ROCm 7.2
Python 3.12, PyTorch 2.4 (ROCm build)

Training runs at float32 (required for gfx906 stability). A full 10-epoch run takes ~14 hours on the MI50.

Geometric vector-space training

After standard LoRA fine-tuning, a second training pass (train_geometric.py) applies geometric exploration on top of gradient descent. Instead of following gradients alone, it periodically:

Extrapolates further along the gradient direction (vector jump)
Probes random directions on a hypersphere around the current position
Accepts a candidate only if it strictly improves the scoring function

This escapes shallow local minima that gradient descent alone settles into. Only the LoRA adapter parameters (~17M) are explored — the base model weights are never touched.

# Run geometric training starting from the latest LoRA checkpoint
python train_geometric.py

# Enable execution-based reward signal (recommended)
python train_geometric.py --exec-reward

# Resume a previous geometric run
python train_geometric.py --resume output/worldmodel_geometric/step_900

Execution-based reward (`--exec-reward`)

With --exec-reward, geometric candidate evaluation scores candidates not just on token loss but on whether the generated code is syntactically valid and executes correctly. Candidates that produce working code are preferred over candidates with marginally lower loss that produce broken code.

The reward function is a composite score (lower = better):

score = token_loss
      − 0.2  (if generated code has valid syntax)
      − 0.1  (if code executes successfully)
      + 0.5  (if no code block found in response)

Both the baseline and every candidate are scored identically for a fair comparison. The baseline score is computed fresh at each geometric step so the comparison is always apples-to-apples.

Observed results at epoch 4

Metric	Epoch 0	Epoch 4
Code extracted	0%	95%
Syntax valid	0%	90%
Execution success	0%	85%
Output match	0%	70%
Jump acceptance	22%	80%
Sphere acceptance	0%	62%

The geometric optimizer's acceptance rates rise dramatically with exec-reward because candidates that generate working code beat the baseline even with slightly higher token loss — creating selection pressure toward code correctness beyond what cross-entropy captures.

See docs/GEOMETRIC_TRAINING.md for algorithm details and hyperparameter reference.

Project layout

chat.py                     ← start here for interactive use
train.py                    ← standard LoRA fine-tuning entry point
train_geometric.py          ← geometric vector-space training (run after train.py)
train_rocm.sh               ← training launcher (sets ROCm env vars)
src/
  inference/
    generation_loop.py      ← custom loop: intercepts </code>, executes, injects <output>
  executor/
    python_exec.py          ← inline Python executor with use_tool() support
    vm_exec.py              ← QEMU scratchpad bridge (for sandboxed execution)
  training/
    dataset.py              ← JSONL loader, chat-template formatting, prompt masking
    execution_validator.py  ← code extraction, syntax check, execution, reward scoring
    gpu_monitor.py          ← thermal monitoring and hard-pause throttle
training/
  datasets/                 ← JSONL training data by category
  scripts/                  ← dataset generators
docs/
  DESIGN.md                 ← full architecture and token spec
  GEOMETRIC_TRAINING.md     ← geometric training algorithm and hyperparameter reference
  STATUS.md                 ← current project state and next steps
  rocm/                     ← ROCm setup, troubleshooting, performance notes
history/                    ← archived earlier phases (ByteLogic, Blueprint)

Architecture notes

See docs/DESIGN.md for the full spec. Key points:

LoRA fine-tuning (rank 16) on top of Qwen3-1.7B
6 custom tokens added: <model>, </model>, <code>, </code>, <output>, </output> — <think> and </think> are already native to Qwen3
Prompt-masked training: loss computed only on the assistant response tokens
Qwen3 chat template used for both training and inference
use_tool(name, **kwargs) available in every code block; raises ToolNotAvailableError if the tool isn't registered — model is trained to handle this gracefully

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
.qwen		.qwen
docs		docs
history		history
output/worldmodel/final		output/worldmodel/final
src		src
tools		tools
training		training
.gitignore		.gitignore
README.md		README.md
chat.py		chat.py
infer.py		infer.py
train.py		train.py
train_geometric.py		train_geometric.py
train_rocm.sh		train_rocm.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WorldModel

The idea

Quick start

Chat commands

Chat options

Training

Hardware

Geometric vector-space training

Execution-based reward (`--exec-reward`)

Observed results at epoch 4

Project layout

Architecture notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WorldModel

The idea

Quick start

Chat commands

Chat options

Training

Hardware

Geometric vector-space training

Execution-based reward (--exec-reward)

Observed results at epoch 4

Project layout

Architecture notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Execution-based reward (`--exec-reward`)

Packages