HOPE: Hallucination searching-based Object Probing Evaluation

This repo provides the code and data of our proposed benchmark for object hallucination evaluation in LVLMs. Please refer to our paper: What Makes "Good" Distractors for Object Hallucination Evaluation in Large Vision-Language Models?.

Preparing Data

See the README.md file in the data directory for instructions on downloading and preparing the datasets.

Quick Start

Conda environment

Install the required dependency packages according to the requirements.txt.

conda create --name hope python=3.8 -y
conda activate hope
pip install -r requirements.txt

How to perform evaluation on the HOPE benchmark?

Perform inference on the pre-generated json files. For example, the json file generated using the description-based strategy is organized in the following format:

  {"question_id": 1, "image": "val2014/COCO_val2014_000000095283.jpg", "text": "Please answer yes or no. Is there a person in the image?", "label": "yes"}
  {"question_id": 2, "image": "val2014/COCO_val2014_000000095283.jpg", "text": "Please answer yes or no. Is there person in enclosed area in the image?", "label": "no"}

Perform evaluation on the generated anwser json files. The answer json files should be organized in the following format:

  {"question": "Please answer yes or no. Is there a person in the image?", "answer": "yes"}
  {"question": "Please answer yes or no. Is there person in enclosed area in the image?", "answer": "no"}

Obtain the evaluation results by running:

python evaluate.py --prompt_type binary --ans_file xxx --label_file xxx

How to customize your HOPE benchmark?

When constructing category-oriented hallucination search and content-aware hallucination search data, one needs to run the following command to generate question-answer data on the Objects365 dataset.
```
python main.py --dataset objects365 \
--sample_num 2000 --pos_selected 6 --neg_selected 6 \
--search_strategy similarity --prompt_type binary
```
In the example below, we select 2,000 samples, choose 6 positive labels for each sample, use the similarity strategy to select 6 negative labels, and finally generate question-answering data in binary format.
When constructing description-based hallucination search data, one needs to run the following comman to obtain 500 samples that requires some specific conditions, each with 3 randomly-selected positive labels.
```
python main.py --dataset vg \
--objs_filter 2000 --desc_filter 50 --labs_filter 10 \
--sample_num 500 --pos_selected 3 \
--search_strategy description
```
The above comman will generate an excel file, which includes images, their positive labels, and respective misleading descriptions obtained by description-based searching strategy. Users need to manually review the excel file, and highlight a valid (meaning it's actually an incorrect description) yet most misleading description for each positive label. Finally, run the following command to generate description-based hallucination question-answer data.
```
python process.py --dataset vg
```

Statistics of the datasets

Table 1: Statistics on Category-Oriented Hallucination Searching and Content-Aware Hallucination Searching data.

Dataset	#CLS	#Train	#Val	#Labels	#Labels		#Statistics
				{Min~Max}	{Avg}	Samples	Selected Pos.	Selected Neg.
MS-COCO	80	82,081	40,137	1~18	2.93	500	3	3
Objects-365	365	1,742,289	80,000	1~36	6.17	2000	6	6

Table 2: Statistics on Description-Based Hallucination Searching data.

#Dataset	#Filter Condition(1)	#Filter Condition(2)	#Filter Condition(3)		#Statistics
	Object Frequency	Descriptions (/object)	Objects (/image)	Objects	Descriptions	Samples
MS-COCO	-	-	-	80	1 - 1,884	40,137
VG	$\geq$ 2,000	$\geq$ 50	$\geq$ 10	265	90 - 4,401	88,730
OpenImages	$\geq$ 1,000	$\geq$ 50	$\geq$ 10	97	71 - 2,009	4,435

Hyper-Parameters

To generate different entries of the main table, modify the following parameters:

dataset: The dataset to use, options are ['coco2014', 'objects365', 'vg', 'openimages'].
sample_num: The sample size for generating benchmark.
prompt_type: The generated question-answer data format can be selected from ['binary', 'multiopt'].
pos_selected: The number of postive labels selected.
neg_selected: The number of negative labels selected.
search_strategy: The available attack strategies are category-oriented hallucination search ['random', 'popular', 'co_occurrence', 'similarity'], content-aware hallucination search ['content_clip', 'content_tagclip', 'union'] and description-based hallucination search ['description'].
desc_filter: Filter out the objects with too few descriptions.
objs_filter: Filter out the objects with low occurrence frequencies.
labs_filter: Filter out the images with less labels.
ans_file: Optional explicit path to answers.json.
label_file: Optional explicit path to labels.json.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
lib		lib
outputs		outputs
resources		resources
README.md		README.md
evaluate.py		evaluate.py
image.png		image.png
main.py		main.py
process.py		process.py
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HOPE: Hallucination searching-based Object Probing Evaluation

Preparing Data

Quick Start

Conda environment

How to perform evaluation on the HOPE benchmark?

How to customize your HOPE benchmark?

Statistics of the datasets

Hyper-Parameters

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HOPE: Hallucination searching-based Object Probing Evaluation

Preparing Data

Quick Start

Conda environment

How to perform evaluation on the HOPE benchmark?

How to customize your HOPE benchmark?

Statistics of the datasets

Hyper-Parameters

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages