Skip to content

Commit 4eec7e0

Browse files
committed
docs: add Windows & Blackwell troubleshooting, improve model-not-found error
Address workarounds reported in #66 (RTX 5080 + Windows success report): - README: add Blackwell GPU note (cu128 nightly required for SM 12.0) - README: add Windows section (Triton unsupported, TORCH_COMPILE_DISABLE, PYTHONPATH) - README: add pre-trained model not found guidance with explicit path option - run.py: improve find_best_model FileNotFoundError with actionable hint Closes #66 Signed-off-by: kvmto <[email protected]>
1 parent 3ce4379 commit 4eec7e0

File tree

2 files changed

+35
-1
lines changed

2 files changed

+35
-1
lines changed

README.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -146,6 +146,34 @@ Inference note:
146146
- Some environments crash during `torch.compile`.
147147
- Disable compile: `TORCH_COMPILE=0 bash code/scripts/local_run.sh`.
148148
- Or try a safer mode: `TORCH_COMPILE=1 TORCH_COMPILE_MODE=reduce-overhead bash code/scripts/local_run.sh`.
149+
- **Blackwell GPUs (RTX 5080/5090, GB200/GB300)**:
150+
- Stable PyTorch wheels (`cu124`) do not ship SM 12.0 kernels yet.
151+
Install the nightly build with the `cu128` index:
152+
```bash
153+
pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128
154+
```
155+
- **Windows (Git Bash / WSL)**:
156+
- Triton is not supported on native Windows, which causes `torch.compile` to
157+
fail. Disable it before running:
158+
```bash
159+
export TORCH_COMPILE_DISABLE=1 # PyTorch-level flag
160+
# or, equivalently for the repo scripts:
161+
export PREDECODER_TORCH_COMPILE=0
162+
```
163+
- When running scripts directly (outside the notebook or `local_run.sh`),
164+
set the Python path so that repo modules are importable:
165+
```bash
166+
export PYTHONPATH="code"
167+
```
168+
- **Pre-trained model not found during inference**:
169+
- `find_best_model` searches inside `{output}/models/best_model/` first,
170+
then falls back to `{output}/models/`. If you placed the downloaded
171+
`.pt` file elsewhere, either move it into one of those directories or
172+
point to it directly:
173+
```bash
174+
PREDECODER_MODEL_CHECKPOINT_FILE=path/to/Ising-Decoder-SurfaceCode-1-Accurate.pt \
175+
WORKFLOW=inference bash code/scripts/local_run.sh
176+
```
149177

150178
## Inference (pre-trained models)
151179

code/workflows/run.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -140,7 +140,13 @@ def find_best_model(path, *, rank: int = 0):
140140
print(f" [{marker}] {filename} (epoch {epoch_str})")
141141

142142
if best_file is None:
143-
raise FileNotFoundError(f"No valid model checkpoint files found in {path}")
143+
raise FileNotFoundError(
144+
f"No valid model checkpoint files found in {path}\n"
145+
f"Expected .pt files (e.g. Ising-Decoder-SurfaceCode-1-Fast.pt or "
146+
f"PreDecoderModelMemory_*.pt).\n"
147+
f"Hint: download the pretrained weights and place them in this directory, "
148+
f"or set model_checkpoint_file in your config to an explicit path."
149+
)
144150

145151
best_model_path = os.path.join(path, best_file)
146152
if rank == 0:

0 commit comments

Comments
 (0)