OSCaR: Object State Captioning and State Change Representation

Paper PDF • arXiv • Project Page • GitHub • Dataset • Models

OSCaR is the public code release for the NAACL 2024 paper OSCaR: Object State Captioning and State Change Representation.

This repository publishes the LLaVA-derived training, inference, evaluation, and data-preparation code used for OSCaR, alongside release-grade documentation for the corresponding Hugging Face dataset and model repos.

Start Here

This public release includes everything needed to use OSCaR:

Code: this GitHub repo contains training, inference, evaluation, and data-preparation code.
Dataset: ali-vosoughi/oscar-dataset contains the released OSCaR assets, manifests, metadata, and splits.
Weights: ali-vosoughi on Hugging Face contains the released OSCaR model artifacts.

Released weights are:

LoRA adapters: 7B OSCaR, 13B OSCaR, and 13B mixed.
Projector checkpoints: 7B and 13B.

Merged full-model checkpoints are not part of the current OSCaR release.

Common Workflows

If you want to run the released 13B model:

huggingface-cli download ali-vosoughi/oscar-llava-v1.5-13b-oscar-adapter --local-dir ../oscar-llava-v1.5-13b-oscar-adapter
python -m llava.serve.cli \
  --model-path ../oscar-llava-v1.5-13b-oscar-adapter \
  --model-base lmsys/vicuna-13b-v1.5 \
  --image-file /path/to/image.jpg

If you want to use the released dataset:

huggingface-cli download ali-vosoughi/oscar-dataset --repo-type dataset --local-dir ../oscar-dataset
export DATASET_ROOT=../oscar-dataset
export PATH_PREFIX="$DATASET_ROOT/data"

If you want to reproduce OSCaR fine-tuning:

huggingface-cli download ali-vosoughi/oscar-llava-v1.5-13b-projector --local-dir ../oscar-llava-v1.5-13b-projector
DATASET_ROOT=../oscar-dataset \
MM_PROJECTOR_PATH=../oscar-llava-v1.5-13b-projector/mm_projector.bin \
bash scripts/train/finetune_v1_5_13b_oscar_lora.sh

Project Overview

OSCaR studies object state captioning and state change representation for egocentric video. The public release packages the training and inference code, the Hugging Face dataset and model repos, and the GitHub Pages project site for the NAACL 2024 paper release. The released OSCaR frames and clip-level assets are derived from the EPIC-KITCHENS and Ego4D source datasets.

Paper-level release facts:

annotated segments reported in the paper: 14,084
human-verified benchmark videos: 500
LLaVA fine-tuning entries in the preserved OSCaR manifest: 28,308
backbone family: LLaVA v1.5 with Vicuna 7B/13B and CLIP ViT-L/336

Project Links

Project page: nguyennm1024.github.io/OSCaR
Paper PDF: 2024.findings-naacl.226.pdf
arXiv: arXiv:2402.17128
Code: github.com/nguyennm1024/OSCaR
Dataset: ali-vosoughi/oscar-dataset

Model repos:

Source Datasets

OSCaR is built from clips and frames sourced from these egocentric video datasets. The public OSCaR dataset release should be understood as a derived release built on top of EPIC-KITCHENS and Ego4D assets.

Use The Dataset

Download the HF dataset repo locally.
Point OSCaR to it with DATASET_ROOT.
Use manifests/llava_data.json for training and splits/data_mapping_final_EK_test.csv for the held-out benchmark split.

huggingface-cli download ali-vosoughi/oscar-dataset --repo-type dataset --local-dir ../oscar-dataset
export DATASET_ROOT=../oscar-dataset
bash scripts/train/finetune_v1_5_13b_oscar_lora.sh

For evaluation and inspection, the most useful entry files are metadata/segment_index.csv, manifests/llava_data.json, and splits/data_mapping_final_EK_test.csv.

Authors

What This Repo Contains

projector pretraining code for the LLaVA v1.5 stack
OSCaR-only LoRA fine-tuning scripts for Vicuna 7B and 13B
mixed-data 13B LoRA fine-tuning that combines OSCaR with LLaVA v1.5 mix data
benchmark and open-world inference entrypoints
text-generation evaluation scripts
dataset-preparation utilities used around manifests, splits, and output conversion

Reported Training Configuration

The public scripts reflect the paper and local training artifacts:

base models: lmsys/vicuna-7b-v1.5 and lmsys/vicuna-13b-v1.5
vision tower: openai/clip-vit-large-patch14-336
LoRA rank: 128
LoRA alpha: 256
epochs: 1
learning rate: 2e-4
batch size per device: 16 for OSCaR-only runs
max sequence length: 2048

The 13B mixed-data run also has a public script matching the logged local run:

per-device batch size: 8
gradient accumulation: 2
save steps: 300

Quickstart

Environment setup is UV-based:

uv venv --python 3.10 .venv
source .venv/bin/activate
uv pip install -e .[train,inference,eval,release]

Detailed setup: INSTALL.md
Training guide: TRAIN.md
Inference guide: INFERENCE.md
Evaluation guide: EVAL.md
Release guide: RELEASE.md

Repository Layout

llava/: model, training, evaluation, and serving code
scripts/train/: projector pretraining, LoRA fine-tuning, merge, and cluster launcher examples
scripts/infer/: public inference wrappers for benchmark and open-world evaluation
scripts/eval/: benchmark metric scripts
scripts/data/: manifest, split, QA-generation, and output-conversion utilities
scripts/release/: Hugging Face card generation and upload helpers
configs/deepspeed/: public DeepSpeed configs
docs/: release docs and GitHub Pages content

Notes On Dataset And Weights

This repository is intentionally code-first. Large assets are kept out of GitHub:

OSCaR dataset assets and manifests live in the Hugging Face dataset repo
projector checkpoints, adapters, and merged-model release artifacts live in the Hugging Face model repos

The code here expects those assets to be mounted or downloaded locally and then referenced through CLI arguments or environment variables.

Public Release Structure

OSCaR is intentionally split across public surfaces:

this GitHub repository for code, install instructions, and reproducibility docs
the Hugging Face dataset repo for released OSCaR assets and metadata
the Hugging Face model repos for projector checkpoints and adapter weights
the project page for paper-oriented presentation and public links

If you are landing here first, the project page is the shortest path to the paper, code, dataset, and weights:

nguyennm1024.github.io/OSCaR

Citation

If OSCaR is useful in your work, please cite:

@inproceedings{nguyen2024oscar,
  title={OSCaR: Object State Captioning and State Change Representation},
  author={Nguyen, Nguyen and Bi, Jing and Vosoughi, Ali and Tian, Yapeng and Fazli, Pooyan and Xu, Chenliang},
  booktitle={North American Chapter of the Association for Computational Linguistics (NAACL)},
  year={2024}
}

Acknowledgments

OSCaR builds on the LLaVA codebase and training stack. We thank the LLaVA team for their strong open-source release and the foundation it provided for this work.

OSCaR also builds on source video data from the EPIC-KITCHENS and Ego4D projects. We thank those teams for creating and releasing the datasets from which the OSCaR frames and clips are derived.

Approved for public release; distribution is unlimited.

This work has been supported by the Defense Advanced Research Projects Agency (DARPA) under Contract HR00112220003. The content of the information does not necessarily reflect the position of the Government, and no official endorsement should be inferred.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
configs/deepspeed		configs/deepspeed
docs		docs
llava		llava
scripts		scripts
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
EVAL.md		EVAL.md
INFERENCE.md		INFERENCE.md
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
RELEASE.md		RELEASE.md
TRAIN.md		TRAIN.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OSCaR: Object State Captioning and State Change Representation

Start Here

Common Workflows

Project Overview

Project Links

Source Datasets

Use The Dataset

Authors

What This Repo Contains

Reported Training Configuration

Quickstart

Repository Layout

Notes On Dataset And Weights

Public Release Structure

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OSCaR: Object State Captioning and State Change Representation

Start Here

Common Workflows

Project Overview

Project Links

Source Datasets

Use The Dataset

Authors

What This Repo Contains

Reported Training Configuration

Quickstart

Repository Layout

Notes On Dataset And Weights

Public Release Structure

Citation

Acknowledgments

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages