* Equal Contribution Corresponding Author ✉
News | Overview | TODO | Acknowledgement | Citation | License
- [2026] EgoNight is accepted by ICLR 2026.
- [2025/10] Our paper "EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark" is available on arXiv.
- [2025/10] Paper and supplementary materials available on OpenReview.
The benchmark assesses vision-language models on egocentric video question answering across diverse scenarios. It supports both night (default) and day imagery, with a curated subset of question types for paired day/night comparison. Open-source VLMs (GLM-4V, Qwen2.5-VL, InternVL, LLaVA-NeXT-Video, etc.) are evaluated using the LLaMA Factory API server.
- Object Recognition
- Spatial Reasoning
- Scene Sequence
- Non Common
- Counting
- Navigation
- Text Recognition
- Action
- VQA evaluation — Full pipeline for EgoNight-VQA (GPT, Gemini, Qwen, scoring, summarization).
- LLaMA Factory — Open-source VLM evaluation via API server (GLM-4V, Qwen2.5-VL, InternVL, LLaVA-NeXT-Video).
- Depth evaluation — Evaluation pipeline for egocentric depth estimation at night (auxiliary task from the paper).
- Retrieval evaluation — Evaluation pipeline for day–night correspondence retrieval (auxiliary task from the paper).
- LMMs-Eval — Added EgoNight export pipeline and drop-in task scaffold under
exports/lmms_eval_task/egonight/. - VLMEvalKit — Added EgoNight export pipeline and custom dataset scaffold under
exports/vlmevalkit/egonight_dataset.py.
EgoNight/
├── exports/
│ ├── build_egonight_exports.py # Build JSONL/TSV exports for external eval toolkits
│ ├── README.md # Integration steps for LMMs-Eval and VLMEvalKit
│ ├── generated/ # Generated files: egonight_lmms_eval.jsonl, EgoNight.tsv, stats
│ ├── lmms_eval_task/
│ │ └── egonight/
│ │ ├── egonight.yaml # Drop-in lmms-eval task config
│ │ └── utils.py # Prompt/visual mapping and metric aggregation
│ └── vlmevalkit/
│ └── egonight_dataset.py # Drop-in VLMEvalKit dataset class
├── evaluation/
│ ├── evaluate_gemini.py # Gemini 2.5 Pro inference
│ ├── evaluate_gpt.py # GPT-4.1 inference
│ ├── evaluate_qwen7b.py # Qwen 2.5 VL 7B inference (LLaMA Factory API)
│ ├── score_gpt.py # GPT-4o as judge scoring (correct/incorrect, 0–5)
│ ├── summarize_accuracy.py # Per-dataset and overall accuracy summary
│ ├── evaluate_all.sh # Batch evaluation over subfolders
│ ├── api_keys.py # API key loading
│ ├── keys.env.example # Template for API keys
│ └── keys.env # Your keys (create from example, gitignored)
├── README.md
└── LICENSE
Each evaluation sample is a subfolder with:
<subfolder>/
├── qa_result/
│ ├── all_qa_filtered.json # Question-answer annotations
│ ├── *_results*.json # Model outputs (gpt, gemini, qwen7b)
│ └── *_scores*.json # Score outputs (created by score_gpt.py)
└── extracted_frames/
├── Night/ # Night images (jpg/png)
└── Day/ # Day images (optional)
- EgoNight-Sofia and EgoNight-Oxford: frames sampled at 1 fps
- EgoNight-Synthetic: frames sampled at 2 fps
Evaluators infer the dataset from the path and use the correct sampling rate in the prompt.
List of objects with fields:
| Field | Description |
|---|---|
question |
The question text |
question_type |
One of the question types above |
answer |
Ground-truth answer |
start_frame |
First frame index (0-based) |
end_frame |
Last frame index (inclusive) |
pip install openai google-generativeai tqdm pyyaml numpy pillow requests-
Copy the example keys file:
cp evaluation/keys.env.example evaluation/keys.env
-
Edit
evaluation/keys.envwith your keys:OPENAI_API_KEY=sk-your-openai-key GEMINI_API_KEY=your-gemini-api-keyAlternatively, set
OPENAI_API_KEYandGEMINI_API_KEYas environment variables.
Part of the open-source VLM evaluation relies on the LLaMA Factory API server. Others are evaluated using the official repo. Start the API server with the desired model before running the corresponding evaluator.
Example configs for supported VLMs (based on LLaMA Factory examples/inference/):
| Model | Config | Hugging Face Model |
|---|---|---|
| GLM-4V | glm4v.yaml |
zai-org/GLM-4.1V-9B-Base |
| Qwen2.5-VL-7B | qwen2_5vl_7B.yaml |
Qwen/Qwen2.5-VL-7B-Instruct |
| InternVL3 | intern_vl.yaml |
OpenGVLab/InternVL3-8B-hf |
| LLaVA-NeXT-Video | llava_video.yaml |
llava-hf/LLaVA-NeXT-Video-7B-32K-hf |
Example configs (save as YAML and run llamafactory-cli api <config.yaml>):
# glm4v.yaml
model_name_or_path: zai-org/GLM-4.1V-9B-Base
template: glm4v
infer_backend: huggingface
trust_remote_code: true# qwen2_5vl_7B.yaml
model_name_or_path: Qwen/Qwen2.5-VL-7B-Instruct
template: qwen2_vl
infer_backend: huggingface # choices: [huggingface, vllm, sglang]
trust_remote_code: true# intern_vl.yaml
model_name_or_path: OpenGVLab/InternVL3-8B-hf
template: intern_vl
infer_backend: huggingface
trust_remote_code: true# llava_video.yaml
model_name_or_path: llava-hf/LLaVA-NeXT-Video-7B-32K-hf
template: llava_next_video
infer_backend: huggingface
trust_remote_code: trueStart the API server (default port 8000):
API_PORT=8000 llamafactory-cli api examples/inference/qwen2_5vl_7B.yaml# GPT-4.1 (night images)
python evaluation/evaluate_gpt.py --dir_path /path/to/sample_folder
# GPT-4.1 (day images)
python evaluation/evaluate_gpt.py --dir_path /path/to/sample_folder --use_day True
# Gemini 2.5 Pro
python evaluation/evaluate_gemini.py --dir_path /path/to/sample_folder
# Qwen 7B (requires local server on port 8004)
python evaluation/evaluate_qwen7b.py --dir_path /path/to/sample_folderProvide a parent directory containing one subfolder per sample:
bash evaluation/evaluate_all.sh /path/to/parent_directoryThis runs the active evaluators (GPT, Gemini, Qwen7b) in parallel per sample, then scores results with GPT-4o. For sofia_oxford, scores are written to each subfolder’s score/ directory.
score_gpt.py takes prediction JSONs (filenames containing result and .json), compares them to ground truth via GPT-4o, and writes scored files (results → scores in the filename):
python evaluation/score_gpt.py --dir_path /path/to/results_directorysummarize_accuracy.py computes per-dataset (Sofia, Oxford, Synthetic) and overall accuracy by QA type, and also breaks down accuracy by difficulty level (easy/medium/hard) using built-in scene difficulty metadata. It reads *_scores_*.json from each subfolder and filters by all_qa_filtered.json (Sofia/Oxford) or scores all entries (Synthetic). This difficulty summary is included in both console and JSON output.
difficulty results are reported per dataset and overall.
difficulty levels are easy, medium, and hard.
python evaluation/summarize_accuracy.py \
--sofia_path /path/to/Sofia_server \
--oxford_path /path/to/Oxford_server \
--synthetic_path /path/to/Synthetic_server \
[--model gpt] [--split night] \
[--output results/summary.json]Example output:
=== Model: gpt | Split: night ===
--- Sofia ---
Action: 72.50% (29/40)
Counting: 68.00% (17/25)
...
OVERALL: 70.00% (105/150)
--- OVERALL (all datasets combined) ---
Action: 71.00% (..)
...
OVERALL: 69.50% (..)
--- DIFFICULTY BREAKDOWN (per dataset) ---
Sofia:
easy: 75.00% (60/80)
medium: 65.00% (26/40)
hard: 55.00% (11/20)
Oxford:
easy: 70.00% (..)
medium: 62.00% (..)
hard: 50.00% (..)
Synthetic:
easy: 80.00% (..)
medium: 68.00% (..)
hard: 52.00% (..)
--- DIFFICULTY BREAKDOWN (overall) ---
easy: 78.00% (..)
medium: 65.00% (..)
hard: 52.00% (..)
The output JSON (via --output) includes per_dataset, overall, per_dataset_difficulty, and overall_difficulty keys.
evaluation/server.py launches an interactive Flask web server that reads pre-computed score files on disk and serves a dark-theme single-page application for exploring benchmark results.
Features:
- Filter by split (day/night), dataset (Sofia/Oxford/Synthetic), difficulty (easy/medium/hard), and QA type
- Accuracy overview with stat cards and per-QA-type progress bars
- Individual QA pair drill-down: question, ground truth, model prediction, score, correct/incorrect
- Paginated results table with color-coded rows
Using the bundled local data (recommended):
The repository includes pre-computed score files under data/ for 10 models (gpt, gemini, intern_vl, qwen3b, qwen7b, qwen72b, glm4v, video_llama3, llava_next_video, egogpt):
python evaluation/server.py \
--sofia_path data/Sofia_server \
--oxford_path data/Oxford_server \
--synthetic_path data/synthetic_server \
[--model gpt] [--port 5000]Using your own score files:
python evaluation/server.py \
--sofia_path /path/to/Sofia_server \
--oxford_path /path/to/Oxford_server \
--synthetic_path /path/to/synthetic_server \
[--model gpt] [--port 5000]Then open http://localhost:5000 in your browser. Requires flask (pip install flask).
Build EgoNight exports from local dataset roots:
python exports/build_egonight_exports.py \
--oxford /data/Night_Ego_Dataset/EgoNight/EgoNight_Oxford \
--sofia /data/Night_Ego_Dataset/EgoNight/EgoNight_Sofia \
--synthetic /data/Night_Ego_Dataset/EgoNight/EgoNight_synthetic \
--output_dir exports/generatedThis generates:
exports/generated/egonight_lmms_eval.jsonl(for LMMs-Eval)exports/generated/EgoNight.tsv(for VLMEvalKit)exports/generated/egonight_export_stats.json
LMMs-Eval scaffold files are under exports/lmms_eval_task/egonight/.
VLMEvalKit scaffold file is under exports/vlmevalkit/egonight_dataset.py.
Detailed integration steps are in exports/README.md.
| Evaluator | Output File |
|---|---|
| GPT | gpt_results.json / gpt_results_day.json |
| Gemini | gemini_results.json / gemini_results_day.json |
| Qwen 7B | qwen7b_results.json / qwen7b_results_day.json |
Each result JSON contains entries with Q (question), A (prediction), C (ground truth), M (category), and frame indices.
Scoring produces *_scores.json with GPT-4o evaluations: correct/incorrect, 0–5 score, and reasoning.
We thank the following projects and resources:
- Oxford day and night dataset for providing day and night egocentric sequences used in EgoNight-Oxford.
- LLaMA Factory for the unified API server enabling efficient evaluation of open-source vision-language models.
- Blender for the open-source 3D creation suite used to render synthetic day–night aligned videos in EgoNight-Synthetic.
- Infinigen for the indoor scene generation used in EgoNight-Synthetic
If you find EgoNight useful for your research, please cite our paper:
@inproceedings{zhang2026egonight,
title={EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark},
author={Zhang, Deheng and Fu, Yuqian and Yang, Runyi and Miao, Yang and Qian, Tianwen and Zheng, Xu and Sun, Guolei and Chhatkuli, Ajad and Huang, Xuanjing and Jiang, Yu-Gang and Van Gool, Luc and Paudel, Danda Pani},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026}
}GNU General Public License v3.0. See LICENSE for details.
