This folder provides resources for evaluating action label predictions on videos from the Fine-grained Breakfast dataset. It includes ground-truth annotations and an evaluation script.
This dataset is provided as supplementary material for the paper:
Open-vocabulary action localization with iterative visual prompting
Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu, Katsushi Ikeuchi (2025), IEEE Access, 5, 56908-56917@article{wake2025open, author={Wake, Naoki and Kanehira, Atsushi and Sasabuchi, Kazuhiro and Takamatsu, Jun and Ikeuchi, Katsushi}, journal={IEEE Access}, title={Open-vocabulary action localization with iterative visual prompting}, year={2025}, volume={13}, number={}, pages={56908--56917}, doi={10.1109/ACCESS.2025.3555167}}
The original data is derived from the dataset described below. We have manually annotated a subset of these videos:
Human grasping database for activities of daily living with depth, color and kinematic data streams
Artur Saudabayev, Zhanibek Rysbek, Raykhan Khassenova, Huseyin Atakan Varol (2018), Scientific Data, 5(1), 1–13@article{saudabayev2018human, title={Human grasping database for activities of daily living with depth, color and kinematic data streams}, author={Saudabayev, Artur and Rysbek, Zhanibek and Khassenova, Raykhan and Varol, Huseyin Atakan}, journal={Scientific data}, volume={5}, number={1}, pages={1--13}, year={2018}, publisher={Nature Publishing Group} }
-
original_videos
Download the original videos fromHuman grasping database for activities of daily living with depth, color and kinematic data streamsand place them in this folder. -
label_data_gt_right.json
This JSON file holds the ground-truth annotations for the videos. Each entry in the JSON contains:- action: A sequence of action labels that occur in the video.
Example:["Grasp with the right hand", "Picking with the right hand", ...] - gt_time: The frame index annotations corresponding to each action label (FPS=30.0).
Example:[[0, 23], [24, 48], ...] - video_path: The relative path to the corresponding video file.
Example:"original_videos/subject_9_gopro_seg_1_2324-2575.mp4"
Note: This file name is constructed from the original video name with the appended frame range. Since this repository does not provide the original videos, you need to download the original dataset, extract the clips corresponding to the specified frame numbers, and place them in theoriginal_videosfolder. We provide the scriptclip_original_videos.pyto extract these clips. The list of original video files is provided inoriginal_videos/original_videos.txt.
- action: A sequence of action labels that occur in the video.
-
label_data_estimate_baseline.json
This is an example file that contains estimated action labels. It is used as an input to the evaluation script. -
compute_mof_iou_f1.py
This evaluation script computes performance metrics (e.g., MOF, IoU, and F1 score) by comparing predicted action labels with the ground truth.python compute_mof_iou_f1.py --file label_data_estimate_baseline.json
-
clip_original_videos.py
This script extracts video clips from the original videos based on the frame indices specified inlabel_data_gt_right.json. Running this script will generate the video dataset with filenames as indicated in the JSON annotations.
-
Place the Video Files
- Download the original videos from the Fine-grained Breakfast dataset.
- Place the downloaded video files in the
original_videosfolder. Refer tooriginal_videos/original_videos.txtfor the list of required files.
-
Generate the Video Dataset
After placing the original videos in theoriginal_videosfolder, run theclip_original_videos.pyscript to extract the annotated clips. This script uses the frame index annotations provided inlabel_data_gt_right.jsonto cut the clips from the original videos and save them using the specified naming convention. Run the script with the following command. Note that this script leverages ffmpeg.python clip_original_videos.py