VideoMasking is a 3D Slicer module for extracting and masking frames from video using SAMURAI (Segment Anything Model for Universal and Robust AI tracking). It converts video recordings into masked individual frames that can be used for photogrammetry reconstruction with the ODM module.
- Overview
- What is SAMURAI?
- Prerequisites
- SAMURAI Setup
- Video Preparation
- ROI Selection and Tracking
- Key-Frame Filtering
- Saving Output
- Next Steps
VideoMasking enables you to create masked frames from video recordings of objects. Instead of taking individual photographs, you can record a video of your specimen rotating on a turntable and let VideoMasking:
- Extract all frames from the video automatically
- Track the object across all frames using SAMURAI
- Filter similar frames using a similarity index to reduce redundancy
- Generate masked frames ready for photogrammetry
- Convert video formats (MOV to MP4) if needed.
- Automatically extract all frames from video (up to 2000 frames maximum).
- Select a Region of Interest (ROI) on the first frame to identify your object.
- Automatically track and mask the object across all frames using SAMURAI.
- Filter similar frames based on visual similarity to reduce redundant frames.
- Output masked frames ready for reconstruction with the ODM module.
SAMURAI (Segment Anything Model for Universal and Robust AI tracking) is a state-of-the-art video object segmentation model. It extends the Segment Anything Model (SAM) with motion-aware memory capabilities, allowing it to:
- Track objects across video frames with zero-shot learning
- Handle occlusions and object deformations
- Maintain consistent segmentation throughout the video
SAMURAI is particularly effective for photogrammetry workflows where you need consistent object masking across many frames captured from different angles.
VideoMasking requires:
- GPU with CUDA support: SAMURAI benefits significantly from GPU acceleration
- PyTorch with CUDA: Installed automatically via Slicer's PyTorchUtils
- Sufficient disk space: For video conversion and frame extraction
Note: If you're using MorphoCloud On Demand, all prerequisites are already configured.
Before using VideoMasking, you need to set up the SAMURAI repository:
- Open the VideoMasking module in 3D Slicer.
- Expand the SAMURAI Setup collapsible section.
- Click Clone SAMURAI to download the SAMURAI repository.
- This clones the SlicerMorph SAMURAI fork into the module's Support directory.
- Wait for the setup to complete. The module will download model checkpoints automatically when needed.
- Select the appropriate checkpoint for your use case.
- Choose your device (CUDA for GPU acceleration, or CPU as fallback).
Before loading your video, choose your preferred image format for extracted frames:
- PNG: Lossless compression, larger files (if you opt to use compressed PNG, beware that workflow can be significantly slower)
- JPG: Smaller file sizes, slight quality loss
This setting applies to extracted frames, masks, and all saved outputs.
VideoMasking has specific frame limits to ensure memory-efficient processing:
- Maximum frames: 2000 frames total (~33 seconds at 60fps)
- Automatic chunking: Videos with more than 600 frames are automatically split into smaller chunks for memory safety
- All frames extracted: The module extracts every frame from your video (no frame interval selection)
If your video exceeds 2000 frames, you'll need to trim it before processing.
If your video is in MOV format (common from iPhone/camera recordings), conversion is handled automatically:
- Expand the Video Prep collapsible section.
- Select your input video file using the Video File selector.
- The module will automatically convert MOV to MP4 when you click Load Video.
Frame extraction happens automatically when you load a video:
- Select your video file.
- Set the Frames Directory where extracted frames will be saved.
- Click Load Video to begin preparation.
- The module will:
- Convert MOV to MP4 if needed
- Validate the frame count (must be ≤2000 frames)
- For videos >600 frames: Split into chunks and extract only the first frame initially (for ROI setup)
- For videos ≤600 frames: Extract all frames immediately
- Wait for extraction to complete.
Note: All frames are extracted automatically - there is no frame interval or "every Nth frame" setting. The module extracts every single frame from your video.
After video preparation:
- Expand the ROI & Tracking collapsible section.
- Click Load Frames to load the extracted frames into the viewer.
- The first frame will be displayed in the Red slice viewer.
- Click Select ROI on First Frame.
- Draw a bounding box around your object in the first frame:
- Click and drag to create a rectangle that encompasses the entire object.
- Make sure the box includes the complete object with a small margin. Try to reduce the amount of background in the ROI.
- Review your selection - this ROI will be used to initialize tracking.
- Once satisfied with the ROI, click Finalize ROI & Run Tracking.
- SAMURAI will process all frames:
- The model uses the ROI to identify the object in frame 1.
- It then tracks and segments the object through all subsequent frames.
- For chunked videos, each chunk is processed sequentially to manage memory.
- Progress is shown in the log panel.
- When complete, masks will be generated for all frames.
After tracking is complete, you can reduce the number of frames using similarity-based filtering. This is important because consecutive video frames are often very similar, and having too many similar frames can slow down or degrade photogrammetry reconstruction.
The filtering algorithm:
- Compares each masked frame to the previously kept frame
- Calculates visual dissimilarity based on the masked region only
- Keeps frames that are sufficiently different from the last kept frame
- Always keeps the first frame as a starting point
- After tracking completes, locate the Key-Frame Filtering section.
- Adjust the Similarity Threshold slider:
- Higher values (e.g., 0.90): Keep more frames (frames must be very similar to be removed)
- Lower values (e.g., 0.70): Keep fewer frames (more aggressive filtering)
- Default: Start around 0.80-0.85
- Click Filter Key Frames.
- The module will report how many frames were kept (e.g., "Kept 85/300 frames").
Tip: For photogrammetry, you typically want 150-300 final frames with good coverage of all viewing angles. Start with the suggested default threshold and adjust if needed.
- Expand the Save section.
- Select an output folder using Browse.
- Click Save Outputs.
- The module saves:
- original/Set1/: Original (unmasked) frames
- masked/Set1/: Masked frames and binary mask files (
_masksuffix)
If you ran key-frame filtering, only the filtered frames are saved. Otherwise, all frames are saved.
The module automatically embeds camera metadata (extracted from the video) into saved images, which helps photogrammetry software estimate camera parameters.
Once VideoMasking has generated your masked frames:
- Note the output directory containing your masked frames (the
masked/subfolder). - Open the ODM module.
- Set the Masked Images Folder to the
masked/folder from VideoMasking output. - Configure and run the reconstruction task.
- Use a turntable: Place your object on a rotating platform for consistent coverage.
- Steady camera: Use a tripod to minimize camera shake.
- Good lighting: Ensure even, diffuse lighting without harsh shadows.
- Plain background: A solid, contrasting background helps with segmentation.
- Slow rotation: 20-30 seconds for a full rotation at 60fps gives ~1200-1800 frames.
- Keep within limits: Videos must be ≤2000 frames (~33 seconds at 60fps).
- Include the entire object: The initial ROI should fully contain the object.
- Add margin: A small margin around the object helps with tracking.
- Avoid background clutter: If possible, position the object away from similar-colored backgrounds.
- Start with defaults: The 0.80 similarity threshold works well for most videos.
- Check coverage: After filtering, ensure you still have frames from all viewing angles.
- Re-filter if needed: You can adjust the threshold and re-run filtering.
- Trim your video to ≤2000 frames (~33 seconds at 60fps)
- Use video editing software or ffmpeg to cut the video
- Try a larger initial ROI
- Ensure the object doesn't leave the frame during the video
- Check that the object is clearly visible and contrasts with background
- The module automatically chunks long videos to prevent this
- If errors persist, try a shorter video
- Close other GPU-intensive applications
- Ensure CUDA is being used (check device selection)
- Processing time depends on video length and resolution
- Consider using MorphoCloud for faster GPU access