Skip to content

yuqunw/scene_diff

Repository files navigation


SceneDiff: A Benchmark and Method for Multiview Object Change Detection


Yuqun Wu · Chih-hao Lin · Henry Che · Aditi Tiwari · Chuhang Zou · Shenlong Wang · Derek Hoiem

Project Page Arxiv Dataset Data Annotator


This repository contains the code for the paper SceneDiff: A Benchmark and Method for Multiview Object Change Detection. We investigate the problem of identifying objects that have been changed between a pair of captures of the same scene at different times, introducing the first object-level multiview change detection benchmark and a new training-free method.

Updates

[Mar 2026] We update the evaluation protocol and default model parameters:

  • Evaluation criteria. The previous evaluation used center-point matching for detection–GT correspondence. The evaluation now uses mask based IoU matching and reports three metrics:
    • px/im IoU – pixel-level IoU computed on merged masks across all views and frames.
    • obj/im AP – per-frame Average Precision with 0.5 IoU threshold (VOC-style).
    • obj/sc AP – per-scene Average Precision at the object level with 0.5 IoU threshold across all frames, reported in two variants:
      • without change-type requirement – a detection matches any GT object regardless of change type across both sequences.
      • with change-type requirement – a detection must also match the GT change type (moved vs. added/removed).
  • Default evaluation parameters updated. --duplicate_match_threshold and --per_frame_duplicate_match_threshold now default to 1.

SceneDiff Benchmark

Download the SceneDiff benchmark dataset from 🤗 Hugging Face.

mkdir data && cd data
wget https://huggingface.co/datasets/yuqun/SceneDiff/resolve/main/scenediff_benchmark.zip
unzip scenediff_bechmark.zip

Dataset Structure

scenediff_benchmark/
├── data/                          # 350 sequence pairs
│   ├── sequence_pair_1/
│   │   ├── original_video1.mp4    # Raw video before change
│   │   ├── original_video2.mp4    # Raw video after change
│   │   ├── video1.mp4             # Video with annotation mask (before)
│   │   ├── video2.mp4             # Video with annotation mask (after)
│   │   ├── segments.pkl           # Dense segmentation masks for evaluation
│   │   └── metadata.json          # Sequence metadata
│   ├── sequence_pair_2/
│   │   └── ...
│   └── ...
├── splits/                        # Val/Test splits
│   ├── val_split.json
│   └── test_split.json
└── vis/                           # Visualization tools
    ├── visualizer.py              # Flask-based web viewer
    ├── requirements.txt
    └── templates/

About segments.pkl: See the detailed description here.

Visualization: For better visualization, run the command:

cd data/scenediff_benchmark/vis && pip install -r requirements.txt
python visualizer.py

Evaluation

We expect the method predictions have following structures:

output_dir/
├── sequence_pair_1/
│   └── object_masks.pkl           # Dense segmentations of changed objects (for evaluation)
├── sequence_pair_2/
└── ...

with object_masks.pkl following this structure:

object_masks = {
    'H': int,                           # Image height
    'W': int,                           # Image width
    'video_1': {                        # Objects existing in video_1
        'object_id_1': {                # Integer ID for each detected object
            'frame_id_1': {             # Integer frame number
                'mask': RLE_Mask,       # Run-length encoded mask
                'cost': float           # Confidence score of the prediction
            },
            ...
        },
        ...
    },
    'video_2': {                        # Objects existing in video_2
        'object_id_1': {                # Integer ID for each detected object
            'frame_id_1': {             # Integer frame number
                'mask': RLE_Mask,       # Run-length encoded mask
                'cost': float           # Confidence score of the prediction
            },
            ...
        },
        ...
    }
}

Then the evaluation script can be run with:

python scripts/evaluate_multiview.py \
    --pred_dir ${OUTPUT_DIR} \
    --splits val \
    --sets varied \
    --output_path ${OUTPUT_FILE_PATH} \
    --visualize False

Arguments:

  • --duplicate_match_threshold: Max number of times a GT object can be matched at the object level (default: 1)
  • --per_frame_duplicate_match_threshold: Max number of times a GT region can be matched per frame/view (default: 1)
  • --splits: Choose from val, test, or all
  • --sets: Choose from varied, kitchen, or All
  • --visualize: Set to True to save visualization outputs

Output: The evaluation results will be saved to ${OUTPUT_FILE_PATH}

Getting Started

Installation

  1. Clone this repository with submodules:
  git clone --recursive https://github.com/yuqunw/scene_diff.git
  cd scene_diff
  1. Create conda environment and install dependencies:
  conda create -n scene_diff python=3.10 -y
  conda activate scene_diff
  pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121 # Install the pytorch fitting your nvcc version 
  pip install -r requirements.txt
  pip install torch-scatter -f https://data.pyg.org/whl/torch-2.5.1+cu121.html # install torch_scatter
  1. Install submodules:
  # Install segment-anything submodule
  cd submodules/segment-anything-langsplat-modified
  pip install -e .
  cd ../..

Download Checkpoints

1. Download the Segment-Anything checkpoint:

bash checkpoints/download_sam_checkpoint.sh

2. Configure DINOv3 checkpoint:

The DINOv3 checkpoint will be automatically downloaded on first use after filling in the checkpoint url. To set it up:

  1. Visit the DINOv3 downloads page to apply for the checkpoint access
  2. Right-click on dinov3_vith16plus_pretrain_lvd1689m-7c1da9a5.pth and copy the download link
  3. Update the URL in configs/scenediff_config.yml:
 models:
   dinov3:
     weight_url: "<paste_your_copied_url_here>"

Quick Demo

Run change detection on any two videos:

python scripts/demo.py \
    --config configs/scenediff_config.yml \
    --video1 path/to/video1.mp4 \
    --video2 path/to/video2.mp4 \
    --output output/demo

Output: The script generates point cloud visualizations including score maps and object segmentations for both videos in the specified output directory.

Parameters: You can modify parameters in configs/scenediff_config.yml. If the automatic threshold for change detection doesn't work well (score maps look correct but too many or few detections), you can manually set detection.object_threshold in the config file.

Predict on SceneDiff Benchmark

Run inference on all sequences in the benchmark:

python scripts/predict_multiview.py \
    --config configs/scenediff_config.yml \
    --splits val \
    --sets varied \
    --output_dir output/scenediff_benchmark

Arguments:

  • --splits: Choose from val, test, or all
  • --sets: Choose from varied, kitchen, or All
  • --output_dir: Directory to save predictions
  • Modify more arguments in the config file

Acknowledgement

We thank the great work from these repositories:

License

This project is released under the MIT License. See LICENSE for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors