-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Add support for fully local tracing via InSPyReNet, a background removal model. No API key, no network access, no cost. Should be the default when no GOOGLE_API_KEY is set.
Why InSPyReNet
I evaluated several local approaches against 110 Gemini-traced ground truth images:
| Approach | >0.9 IoU | >0.7 IoU | Median IoU | Speed |
|---|---|---|---|---|
| InSPyReNet | 69% | 72% | 0.949 | 0.7s |
| ISNet (rembg) | 62% | 75% | 0.934 | 0.9s |
| SAM2-large (ultralytics) | 39% | 45% | 0.588 | 1.8s |
| U2Net (rembg) | 20% | 60% | 0.873 | 0.4s |
| FastSAM | 7% | 30% | 0.380 | 0.5s |
| Ollama VLMs | N/A | N/A | N/A | 90s+ |
InSPyReNet won head-to-head against ISNet (the runner-up) 31 to 7 across 110 images, with 72 ties.
Why not SAM2? SAM2-large produces excellent masks (0.92+ IoU) when it works, but it's bimodal -- either near-perfect or complete failure, depending on prompting. The _C post-processing extension doesn't compile for Apple Silicon.
Why not Ollama VLMs? Vision models (qwen3.5:35b, qwen3-vl:8b) can identify tools but can't generate mask images. I tried extracting polygon vertices as structured JSON instead -- the models return plausible-looking coordinates but they don't map to actual pixel positions. VLMs lack the spatial precision needed for contour tracing. See #11.
How it should work
InSPyReNet is a salient object detection model trained to separate foreground from background. The model outputs a foreground mask which can feed into the existing _trace_mask() pipeline -- same OpenCV contour extraction, smoothing, and polygon output as Gemini.
Some notes on implementation and testing:
- ~80MB model weights, downloaded automatically on first trace
- Runs on Apple Silicon (MPS) or CPU via PyTorch
- Sub-second inference on Apple Silicon, ~2-3s on CPU
- No prompting, no configuration, no mask selection heuristics
When to use Gemini instead
InSPyReNet struggles slightly with highly reflective/metallic tools, poor lighting, and images where the perspective correction leaves visible table edges. Gemini handles these better because it can reason about the scene rather than just separating foreground/background. Folks may have to crop their images a bit better to have plainer backgrounds, only showing this and the paper relative to a tool being traced.