Scene Grounding in Dense Visual Environments (CPU-only)

This repository contains a CPU-only pipeline for grounding natural language queries in dense images.
It uses GroundingDINO for text-conditioned detection, optional CLIP re‑ranking, tiling for large images, and SAM for fine-grained segmentation of the selected region.

You selected this approach as your final pipeline. This repo is structured so others can reproduce it easily on their machines, including cloning the required upstream repos and downloading the model weights via a single Python script.

Quick Start

0) Clone this repository

git clone https://github.com/Rachitb0611/scene_localise.git
cd scene_localise

1) Create & activate a virtual environment

# Linux / macOS
python3 -m venv .venv
source .venv/bin/activate

# Windows (PowerShell)
python -m venv .venv
.venv\Scripts\Activate.ps1

2) Install Python dependencies

pip install --upgrade pip
pip install -r requirements.txt

3) Clone required repos + download weights (GroundingDINO, SAM, OpenAI CLIP)

python scripts/fetch_assets.py

This will create/refresh:

GroundingDINO/ (git clone)
CLIP/ (OpenAI CLIP git clone)
segment-anything/ (SAM clone, optional—but we still download the SAM weight)
weights/groundingdino_swint_ogc.pth
weights/sam_vit_h_4b8939.pth

If any URL is blocked in your environment, the script will suggest a manual backup URL and where to place the files. Download through that link.

4) Run the demo

# Provide an input image path when prompted (or place a sample in ./data/img.png)
python main_1.py

The script saves outputs under results/ including: the best bounding box, SAM mask, and cropped region.

Basic Idea (High Level)

GroundingDINO: Given an image and a natural-language text query, it proposes boxes that likely match the text.
(Optional) CLIP Re-ranking: Cropped regions are scored against the query using CLIP; scores are fused with detector confidence to pick the best box.
Tiling (Optional): For very large images, we run detection per tile and fuse boxes (e.g., with Soft‑NMS / WBF).
SAM Segmentation: Use the final box as a prompt to SAM to obtain a tight mask of the object/region, then save overlays and RGBA cutouts.

Installation Guide (Detailed)

Python 3.9 recommended
Create a venv and install the exact requirements from requirements.txt for reproducibility.
Assets Fetcher
Run python scripts/fetch_assets.py to:
- git clone GroundingDINO, OpenAI CLIP, and Segment-Anything (SAM) if not already present.
- Download GroundingDINO weights and SAM ViT‑H weights to ./weights/.
First Run
- Place a test image at ./data/img.png or provide a path when prompted.
- Run python main_1.py and follow prompts for the query and tiling option.

Notes

The provided main_1.py is CPU-only and includes optional tiling and a CLIP re‑ranker path.
If you change the weights’ filenames or locations, update main_1.py accordingly.
If your environment blocks the default download URLs, place the weights manually into ./weights/ and rerun.

Repository Layout

scene-grounding-cpu/
├─ main_1.py                # CPU pipeline (GroundingDINO + optional CLIP + SAM)
├─ main_2.py                # Also addresses the issue of negative query
├─ requirements.txt
├─ scripts/
│  └─ fetch_assets.py       # Clones repos & downloads model weights
├─ documentation.md         # This is a report that covers the technical approach of solution, and information on models and techniques used 
├─ doc.pdf                  # This is a report which covers the overall aspect of the solution. This is more of a general presentation not going in much technical details
├─ results/                 # Outputs (auto-created)
├─ weights/                 # Downloaded model weights
└─ .gitignore

Troubleshooting

NumPy / Torch ABI mismatch: If you see errors about NumPy versions, pin NumPy to a compatible version mentioned by PyTorch (or try pip install numpy==1.26.*).
Torchvision NMS import: If torchvision.ops.nms fails on CPU-only wheels, the code falls back to Soft‑NMS.
CLIP not found: The pipeline runs without CLIP (re‑ranking will be skipped). Ensure clip-anytorch is installed or use the cloned OpenAI CLIP package.

Citation / Repositories

GroundingDINO by IDEA-Research
OpenAI CLIP
Segment Anything (SAM) by Meta AI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scene Grounding in Dense Visual Environments (CPU-only)

Quick Start

0) Clone this repository

1) Create & activate a virtual environment

2) Install Python dependencies

3) Clone required repos + download weights (GroundingDINO, SAM, OpenAI CLIP)

4) Run the demo

Basic Idea (High Level)

Installation Guide (Detailed)

Notes

Repository Layout

Troubleshooting

Citation / Repositories

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
scripts		scripts
README.md		README.md
annotated_bbox.jpg		annotated_bbox.jpg
doc.pdf		doc.pdf
documentation.md		documentation.md
main_1.py		main_1.py
main_2.py		main_2.py
requirements.txt		requirements.txt
test_img1(query-Man having laptop).jpg		test_img1(query-Man having laptop).jpg

Folders and files

Latest commit

History

Repository files navigation

Scene Grounding in Dense Visual Environments (CPU-only)

Quick Start

0) Clone this repository

1) Create & activate a virtual environment

2) Install Python dependencies

3) Clone required repos + download weights (GroundingDINO, SAM, OpenAI CLIP)

4) Run the demo

Basic Idea (High Level)

Installation Guide (Detailed)

Notes

Repository Layout

Troubleshooting

Citation / Repositories

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages