This repository contains a CPU-only pipeline for grounding natural language queries in dense images.
It uses GroundingDINO for text-conditioned detection, optional CLIP re‑ranking, tiling for large images, and SAM for fine-grained segmentation of the selected region.
You selected this approach as your final pipeline. This repo is structured so others can reproduce it easily on their machines, including cloning the required upstream repos and downloading the model weights via a single Python script.
git clone https://github.com/Rachitb0611/scene_localise.git
cd scene_localise# Linux / macOS
python3 -m venv .venv
source .venv/bin/activate
# Windows (PowerShell)
python -m venv .venv
.venv\Scripts\Activate.ps1pip install --upgrade pip
pip install -r requirements.txtpython scripts/fetch_assets.pyThis will create/refresh:
GroundingDINO/(git clone)CLIP/(OpenAI CLIP git clone)segment-anything/(SAM clone, optional—but we still download the SAM weight)weights/groundingdino_swint_ogc.pthweights/sam_vit_h_4b8939.pth
If any URL is blocked in your environment, the script will suggest a manual backup URL and where to place the files. Download through that link.
# Provide an input image path when prompted (or place a sample in ./data/img.png)
python main_1.pyThe script saves outputs under results/ including: the best bounding box, SAM mask, and cropped region.
- GroundingDINO: Given an image and a natural-language text query, it proposes boxes that likely match the text.
- (Optional) CLIP Re-ranking: Cropped regions are scored against the query using CLIP; scores are fused with detector confidence to pick the best box.
- Tiling (Optional): For very large images, we run detection per tile and fuse boxes (e.g., with Soft‑NMS / WBF).
- SAM Segmentation: Use the final box as a prompt to SAM to obtain a tight mask of the object/region, then save overlays and RGBA cutouts.
- Python 3.9 recommended
Create a venv and install the exact requirements fromrequirements.txtfor reproducibility. - Assets Fetcher
Runpython scripts/fetch_assets.pyto:git cloneGroundingDINO, OpenAI CLIP, and Segment-Anything (SAM) if not already present.- Download GroundingDINO weights and SAM ViT‑H weights to
./weights/.
- First Run
- Place a test image at
./data/img.pngor provide a path when prompted. - Run
python main_1.pyand follow prompts for the query and tiling option.
- Place a test image at
- The provided
main_1.pyis CPU-only and includes optional tiling and a CLIP re‑ranker path. - If you change the weights’ filenames or locations, update
main_1.pyaccordingly. - If your environment blocks the default download URLs, place the weights manually into
./weights/and rerun.
scene-grounding-cpu/
├─ main_1.py # CPU pipeline (GroundingDINO + optional CLIP + SAM)
├─ main_2.py # Also addresses the issue of negative query
├─ requirements.txt
├─ scripts/
│ └─ fetch_assets.py # Clones repos & downloads model weights
├─ documentation.md # This is a report that covers the technical approach of solution, and information on models and techniques used
├─ doc.pdf # This is a report which covers the overall aspect of the solution. This is more of a general presentation not going in much technical details
├─ results/ # Outputs (auto-created)
├─ weights/ # Downloaded model weights
└─ .gitignore
- NumPy / Torch ABI mismatch: If you see errors about NumPy versions, pin NumPy to a compatible version mentioned by PyTorch (or try
pip install numpy==1.26.*). - Torchvision NMS import: If
torchvision.ops.nmsfails on CPU-only wheels, the code falls back to Soft‑NMS. - CLIP not found: The pipeline runs without CLIP (re‑ranking will be skipped). Ensure
clip-anytorchis installed or use the cloned OpenAI CLIP package.
- GroundingDINO by IDEA-Research
- OpenAI CLIP
- Segment Anything (SAM) by Meta AI