Skip to content

Local tracing options: InSPyReNet support #13

@jasonmadigan

Description

@jasonmadigan

Add support for fully local tracing via InSPyReNet, a background removal model. No API key, no network access, no cost. Should be the default when no GOOGLE_API_KEY is set.

Why InSPyReNet

I evaluated several local approaches against 110 Gemini-traced ground truth images:

Approach >0.9 IoU >0.7 IoU Median IoU Speed
InSPyReNet 69% 72% 0.949 0.7s
ISNet (rembg) 62% 75% 0.934 0.9s
SAM2-large (ultralytics) 39% 45% 0.588 1.8s
U2Net (rembg) 20% 60% 0.873 0.4s
FastSAM 7% 30% 0.380 0.5s
Ollama VLMs N/A N/A N/A 90s+

InSPyReNet won head-to-head against ISNet (the runner-up) 31 to 7 across 110 images, with 72 ties.

Why not SAM2? SAM2-large produces excellent masks (0.92+ IoU) when it works, but it's bimodal -- either near-perfect or complete failure, depending on prompting. The _C post-processing extension doesn't compile for Apple Silicon.

Why not Ollama VLMs? Vision models (qwen3.5:35b, qwen3-vl:8b) can identify tools but can't generate mask images. I tried extracting polygon vertices as structured JSON instead -- the models return plausible-looking coordinates but they don't map to actual pixel positions. VLMs lack the spatial precision needed for contour tracing. See #11.

How it should work

InSPyReNet is a salient object detection model trained to separate foreground from background. The model outputs a foreground mask which can feed into the existing _trace_mask() pipeline -- same OpenCV contour extraction, smoothing, and polygon output as Gemini.

Some notes on implementation and testing:

  • ~80MB model weights, downloaded automatically on first trace
  • Runs on Apple Silicon (MPS) or CPU via PyTorch
  • Sub-second inference on Apple Silicon, ~2-3s on CPU
  • No prompting, no configuration, no mask selection heuristics

When to use Gemini instead

InSPyReNet struggles slightly with highly reflective/metallic tools, poor lighting, and images where the perspective correction leaves visible table edges. Gemini handles these better because it can reason about the scene rather than just separating foreground/background. Folks may have to crop their images a bit better to have plainer backgrounds, only showing this and the paper relative to a tool being traced.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions