EgoCross

A comprehensive benchmark across Surgery, Industry, Extreme Sports, and Animal Perspective. EgoCross comprises 798 clips and 957 QA pairs, supporting both CloseQA and OpenQA formats for fine‑grained evaluation.

Cross‑Domain Egocentric Video QA

About the Dataset

EgoCross is a cross-domain benchmark designed to evaluate how well multimodal large language models (MLLMs) generalize to egocentric video question answering (VQA). Unlike prior daily-life datasets, EgoCross focuses on diverse and challenging domains — including surgery, industrial assembly, extreme sports, and animal perspective — to assess model robustness under varying visual and semantic conditions.

The benchmark covers 15 sub-tasks grouped into four capability families: Identification, Localization, Prediction, and Counting. Each video clip is paired with multiple close-ended and open-ended questions that require fine-grained temporal, spatial, and reasoning understanding.

In total, EgoCross contains 798 video clips and 957 QA pairs, curated through a semi-automatic pipeline combining LLM-based question generation and human verification. It provides a unified platform for measuring cross-domain generalization, highlighting the gap between everyday understanding and complex real-world egocentric perception.