This repo is the official implementation of OSGNet at CVPR 2025. And it is also the Champion Solutions repository for three egocentric video localization tracks of the Ego4D Episodic Memory Challenge at CVPR 2025.
This repo supports data pre-processing, training and evaluation of the Ego4D-NLQ, Ego4D-GoalStep, TACoS dataset.
- Follow INSTALL.sh for installing necessary dependencies and compiling the code.Torch version recommand >=1.8.0
- Required Feature: text feature, video feature, lavila caption(need to unzip), object feature
- Pretrained Weight for Finetue(train with NaQ): InternVideo,EgoVLP
Pretrain-NaQ
Download features from this Baidu Netdisk link
- narration feature: narration_clip_token_features
- narration jsonl: format_unique_pretrain_data_v2.jsonl
The features are the same as the NLQ below.
- internvideo : em_egovlp+internvideo_visual_features_1.87fps
4 cards, total batch is 16.
- configs/Ego4D-NLQ/v2/ego4d_nlq_v2_multitask_pretrain_2e-4.yaml
- Video Feature & Text Feature: GroundNLQ leverage the extracted egocentric InterVideo and EgoVLP features and CLIP textual token features, please refer to GroundNLQ.
- Download Lavila Caption, Object Feature(anno,classname), Video Feature(egovlp), Text Feature(NLQ v1 feature).
- NLQ v1 feature: nlq_v1_clip_token_features
- NLQ v2 feature: nlq_v2_clip_token_features
- egovideo: egovideo_token_lmdb
- egovlp: egovlp_lmdb
- internvideo : em_egovlp+internvideo_visual_features_1.87fps
- egovideo: egovideo_all_lmdb
- lavila.zip
- anno: co-detr/class-score0.6-minnum10-lmdb
- classname: classname-clip-base/a_photo_of.pt
2 cards, total batch is 8.
- InternVideo:
- v1: ego4d_nlq_v1_multitask_egovlp_256_finetune_2e-4.yaml
- v2: ego4d_nlq_v2_multitask_finetune_2e-4.yaml
- EgoVideo:
- v2: ego4d_nlq_v2_egovideo_finetune_4e-4.yaml
| Feature | NLQ v1 f | NLQ v2 f |
|---|---|---|
| InternVideo | 173 | 144 |
| EgoVideo | 228 |
- Download Text Feature, Video Feature(clip, not clip), lavila caption, Object Feature(clip, not clip).
- clip_query_lmdb
- internvideo: internvideo_clip_lmdb(Due to memory limitations, we truncated the videos in the training set.), internvideo_lmdb
- lavila.zip
- anno: co-detr/clip-class-lmdb(after clip),
- classname: classname-clip-base/a_photo_of.pt(the same as Ego4D-NLQ)
4 cards, total batch is 4.
- finetune: ego4d_goalstep_v2_baseline_2e-4.yaml
| GoalStep |
|---|
| 135 |
- Download features from this Baidu Netdisk link.
- clip: all_clip_token_features
- glove: glove_clip_token_features
- c3d: c3d_lmdb
- internvideo: internvideo_lmdb
- lavila.zip
- anno: co-detr/class-score0.6-minnum10-lmdb
- classname: classname-clip-base/a_photo_of.pt(the same as Ego4D-NLQ)
4 cards, total batch is 8.
- finetune: tacos_baseline_1e-4.yaml
- scratch: tacos_c3d_glove_weight1_5e-5.yaml
| Setting | Checkpoint |
|---|---|
| Scratch | 150 |
| Finetuned | 131 |
- ./libs/core: Parameter default configuration module.
- ./configs: Parameter file.
- ./ego4d_data: the annotation data.
- ./tools: Scripts for running.
- ./libs/datasets: Data loader and IO module.
- ./libs/modeling: Our main model with all its building blocks.
- ./libs/utils: Utility functions for training, inference, and postprocessing.
We adopt distributed data parallel DDP and fault-tolerant distributed training with torchrun.
Training and pretraining can be launched by running the following command:
bash tools/train.sh CONFIG_FILE False OUTPUT_PATH CUDA_DEVICE_ID MODE
where CONFIG_FILE is the config file for model/dataset hyperparameter initialization,
EXP_ID is the model output directory name defined by yourself, CUDA_DEVICE_ID is cuda device id.
The checkpoints and other experiment log files will be written into <output_folder>/OUTPUT_PATH, output_folder is defined in the config file.
Take TACoS as example:
bash tools/train.sh /home/feng_yi_sen/OSGNet/configs/tacos/tacos_c3d_glove_weight1_5e-5.yaml False objectmambafinetune219 0,1,2,3 train
Training can be launched by running the following command:
bash tools/train.sh CONFIG_FILE RESUME_PATH OUTPUT_PATH CUDA_DEVICE_ID MODE
where RESUME_PATH is the path of the pretrained model weights.
The config file is the same as scratch.
Take Ego4D-NLQ v2 as example:
bash tools/train.sh configs/Ego4D-NLQ/v2/ego4d_nlq_v2_multitask_finetune_2e-4.yaml /root/autodl-tmp/model/GroundNLQ/ckpt/save/model_7_pretrain.pth.tar objectmambafinetune219 0,1 train
For GoalStep, mode should be not-eval-loss.
bash tools/train.sh configs/goalstep/ego4d_goalstep_v2_baseline_2e-4.yaml /root/autodl-tmp/model/GroundNLQ/ckpt/save/model_7_pretrain.pth.tar objectmambafinetune219 0,1 not-eval-loss
Once the model is trained, you can use the following commands for inference:
python eval_nlq.py CONFIG_FILE CHECKPOINT_PATH -gpu CUDA_DEVICE_ID
where CHECKPOINT_PATH is the path to the saved checkpoint,save is for controling the output .
Take Ego4D-NLQ v2 as example:
python eval_nlq.py configs/Ego4D-NLQ/v2/ego4d_nlq_v2_multitask_finetune_2e-4.yaml /root/autodl-tmp/model/GroundNLQ/ckpt/ego4d_nlq_v2_multitask_finetune_2e-4_objectmambafinetune144/model_2_26.834358523725836.pth.tar -gpu 1
If you are using our code, please consider citing our paper.
@inproceedings{feng2025object,
title={Object-shot enhanced grounding network for egocentric video},
author={Feng, Yisen and Zhang, Haoyu and Liu, Meng and Guan, Weili and Nie, Liqiang},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={24190--24200},
year={2025}
}
@article{feng2025osgnet,
title={OSGNet@ Ego4D Episodic Memory Challenge 2025},
author={Feng, Yisen and Zhang, Haoyu and Chu, Qiaohui and Liu, Meng and Guan, Weili and Wang, Yaowei and Nie, Liqiang},
journal={arXiv preprint arXiv:2506.03710},
year={2025}
}
This code is inspired by GroundNLQ. We use the same video and text feature as GroundNLQ. We thank the authors for their awesome open-source contributions.