Object-Shot Enhanced Grounding Network for Egocentric Video (CVPR 2025)

This repo is the official implementation of OSGNet at CVPR 2025. And it is also the Champion Solutions repository for three egocentric video localization tracks of the Ego4D Episodic Memory Challenge at CVPR 2025.

Paper | Technical Report

This repo supports data pre-processing, training and evaluation of the Ego4D-NLQ, Ego4D-GoalStep, TACoS dataset.

Install-dependencies

Follow INSTALL.sh for installing necessary dependencies and compiling the code.Torch version recommand >=1.8.0

Prepare-offline-data

Required Feature: text feature, video feature, lavila caption(need to unzip), object feature
Pretrained Weight for Finetue(train with NaQ): InternVideo,EgoVLP

Pretrain-NaQ

Text Feature

Download features from this Baidu Netdisk link

narration feature: narration_clip_token_features
narration jsonl: format_unique_pretrain_data_v2.jsonl

Video Feature

The features are the same as the NLQ below.

internvideo : em_egovlp+internvideo_visual_features_1.87fps

Config

4 cards, total batch is 16.

configs/Ego4D-NLQ/v2/ego4d_nlq_v2_multitask_pretrain_2e-4.yaml

Ego4D-NLQ

Feature Download

Video Feature & Text Feature: GroundNLQ leverage the extracted egocentric InterVideo and EgoVLP features and CLIP textual token features, please refer to GroundNLQ.
Download Lavila Caption, Object Feature(anno,classname), Video Feature(egovlp), Text Feature(NLQ v1 feature).

Text Feature

NLQ v1 feature: nlq_v1_clip_token_features
NLQ v2 feature: nlq_v2_clip_token_features
egovideo: egovideo_token_lmdb

Video Feature

egovlp: egovlp_lmdb
internvideo : em_egovlp+internvideo_visual_features_1.87fps
egovideo: egovideo_all_lmdb

Lavila Caption

lavila.zip

Object Feature

anno: co-detr/class-score0.6-minnum10-lmdb
classname: classname-clip-base/a_photo_of.pt

Config

2 cards, total batch is 8.

InternVideo:
- v1: ego4d_nlq_v1_multitask_egovlp_256_finetune_2e-4.yaml
- v2: ego4d_nlq_v2_multitask_finetune_2e-4.yaml
EgoVideo:
- v2: ego4d_nlq_v2_egovideo_finetune_4e-4.yaml

checkpoints

Feature	NLQ v1 f	NLQ v2 f
InternVideo	173	144
EgoVideo		228

GoalStep

Feature Download

Download Text Feature, Video Feature(clip, not clip), lavila caption, Object Feature(clip, not clip).

Text Feature

clip_query_lmdb

Video Feature

internvideo: internvideo_clip_lmdb(Due to memory limitations, we truncated the videos in the training set.), internvideo_lmdb

Lavila Caption

lavila.zip

Object Feature

anno: co-detr/clip-class-lmdb(after clip),
classname: classname-clip-base/a_photo_of.pt(the same as Ego4D-NLQ)

config

4 cards, total batch is 4.

finetune: ego4d_goalstep_v2_baseline_2e-4.yaml

checkpoints

GoalStep
135

TACoS

Feature Download

Download features from this Baidu Netdisk link.

Text Feature

clip: all_clip_token_features
glove: glove_clip_token_features

Video Feature

c3d: c3d_lmdb
internvideo: internvideo_lmdb

Lavila Caption

lavila.zip

Object Feature

anno: co-detr/class-score0.6-minnum10-lmdb
classname: classname-clip-base/a_photo_of.pt(the same as Ego4D-NLQ)

config

4 cards, total batch is 8.

finetune: tacos_baseline_1e-4.yaml
scratch: tacos_c3d_glove_weight1_5e-5.yaml

checkpoints

Setting	Checkpoint
Scratch	150
Finetuned	131

Code-Overview

./libs/core: Parameter default configuration module.
- ./configs: Parameter file.
./ego4d_data: the annotation data.
./tools: Scripts for running.
./libs/datasets: Data loader and IO module.
./libs/modeling: Our main model with all its building blocks.
./libs/utils: Utility functions for training, inference, and postprocessing.

Experiments

We adopt distributed data parallel DDP and fault-tolerant distributed training with torchrun.

Training-From-Scratch

Training and pretraining can be launched by running the following command:

bash tools/train.sh CONFIG_FILE False OUTPUT_PATH CUDA_DEVICE_ID MODE

where CONFIG_FILE is the config file for model/dataset hyperparameter initialization, EXP_ID is the model output directory name defined by yourself, CUDA_DEVICE_ID is cuda device id. The checkpoints and other experiment log files will be written into <output_folder>/OUTPUT_PATH, output_folder is defined in the config file.

Take TACoS as example:

bash tools/train.sh /home/feng_yi_sen/OSGNet/configs/tacos/tacos_c3d_glove_weight1_5e-5.yaml False objectmambafinetune219 0,1,2,3 train

Training-Finetune

Training can be launched by running the following command:

bash tools/train.sh CONFIG_FILE RESUME_PATH OUTPUT_PATH CUDA_DEVICE_ID MODE

where RESUME_PATH is the path of the pretrained model weights.

The config file is the same as scratch.

Take Ego4D-NLQ v2 as example:

bash tools/train.sh configs/Ego4D-NLQ/v2/ego4d_nlq_v2_multitask_finetune_2e-4.yaml /root/autodl-tmp/model/GroundNLQ/ckpt/save/model_7_pretrain.pth.tar objectmambafinetune219 0,1 train

For GoalStep, mode should be not-eval-loss.

bash tools/train.sh configs/goalstep/ego4d_goalstep_v2_baseline_2e-4.yaml /root/autodl-tmp/model/GroundNLQ/ckpt/save/model_7_pretrain.pth.tar objectmambafinetune219 0,1 not-eval-loss

Inference

Once the model is trained, you can use the following commands for inference:

python eval_nlq.py CONFIG_FILE CHECKPOINT_PATH -gpu CUDA_DEVICE_ID

where CHECKPOINT_PATH is the path to the saved checkpoint,save is for controling the output . Take Ego4D-NLQ v2 as example:

python eval_nlq.py configs/Ego4D-NLQ/v2/ego4d_nlq_v2_multitask_finetune_2e-4.yaml  /root/autodl-tmp/model/GroundNLQ/ckpt/ego4d_nlq_v2_multitask_finetune_2e-4_objectmambafinetune144/model_2_26.834358523725836.pth.tar -gpu 1

Citation

If you are using our code, please consider citing our paper.

@inproceedings{feng2025object,
  title={Object-shot enhanced grounding network for egocentric video},
  author={Feng, Yisen and Zhang, Haoyu and Liu, Meng and Guan, Weili and Nie, Liqiang},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={24190--24200},
  year={2025}
}

@article{feng2025osgnet,
  title={OSGNet@ Ego4D Episodic Memory Challenge 2025},
  author={Feng, Yisen and Zhang, Haoyu and Chu, Qiaohui and Liu, Meng and Guan, Weili and Wang, Yaowei and Nie, Liqiang},
  journal={arXiv preprint arXiv:2506.03710},
  year={2025}
}

Acknowledgements

This code is inspired by GroundNLQ. We use the same video and text feature as GroundNLQ. We thank the authors for their awesome open-source contributions.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.vscode		.vscode
configs		configs
ego4d_data		ego4d_data
install		install
libs		libs
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
basic_utils.py		basic_utils.py
ensemble.py		ensemble.py
eval_nlq.py		eval_nlq.py
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

Object-Shot Enhanced Grounding Network for Egocentric Video (CVPR 2025)

Install-dependencies

Prepare-offline-data

Pretrain-NaQ

Text Feature

Video Feature

Config

Ego4D-NLQ

Feature Download

Text Feature

Video Feature

Lavila Caption

Object Feature

Config

checkpoints

GoalStep

Feature Download

Text Feature

Video Feature

Lavila Caption

Object Feature

config

checkpoints

TACoS

Feature Download

Text Feature

Video Feature

Lavila Caption

Object Feature

config

checkpoints

Code-Overview

Experiments

Training-From-Scratch

Training-Finetune

Inference

Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages