Conversational Image Search

LARCH is a novel contextual image search scheme that utilizes a multimodal hierarchical graph-based neural network and multi-form knowledge modeling to learn knowledge-enhanced image representations for conversational image search.

Authors

Liqiang Nie¹*, Fangkai Jiao¹, Wenjie Wang², Yinglong Wang³, Qi Tian⁴

¹ Department of Computer Science and Technology, Shandong University ² School of Computing, National University of Singapore ³ Shandong AI Institute, Qilu University of Technology (Shandong Academy of Sciences) ⁴ Cloud & AI, Huawei Technologies
* Corresponding author

Links

Paper: Paper Link
Code Repository: GitHub

Updates

[09/2021] Paper officially accepted and published in IEEE Transactions on Image Processing (TIP 2021).
[09/2021] Initial release of official PyTorch implementation and the augmented MMD 2.0 dataset.

Introduction

Conversational image search is a revolutionary search mode able to interactively induce user responses to clarify their intents step by step. While previous efforts focused heavily on the conversation part (asking the right questions), this paper tackles the challenging image search part given a well-prepared conversational query.

Our method, LARCH (contextuaL imAge seaRch sCHeme), addresses the difficulty of this task by:

Understanding complex user intents from a multimodal conversational query.
Utilizing multiform knowledge associated with images from a memory network.
Enhancing the image representation with distilled knowledge.

This repository provides the official training and evaluation code, the model parameter settings, and the extended benchmark dataset to facilitate future research in the conversational image search community.

Highlights

Query Representation Learning: Proposes a multimodal hierarchical graph-based neural network to learn conversational query embeddings for better user intent understanding.
Multi-form Knowledge Modeling: Devises an embedding memory network to unify heterogeneous knowledge structures (graphs, matrices, tables) into a homogeneous base.
Image Representation Learning: Utilizes a novel gated neural network to select useful knowledge from retrieved data, outputting a knowledge-enhanced image representation.
Extended Dataset (MMD 2.0): Provides a newly constructed, highly challenging benchmark dataset augmented with fine-grained negative samples to better simulate real-world search environments.

Method / Framework

Schematic illustration of our proposed LARCH model. It comprises three components, namely query representation learning, multi-form knowledge modeling, and image representation learning.

Project Structure

.
├── datasets/                 
├── models/                   
├── .gitignore
├── LICENSE
├── README.md
├── constants.py
├── eval.py
├── eval_metric.py
├── evaluator_graph.py
├── evaluator_graph_case.py
├── evaluator_text.py
├── evaluator_text_case.py
├── gpu_profile.py
├── knowledge_embed.py
├── larch_framework.jpg
├── loss.py
├── main.py
├── raw_data_fix.py
├── requirements.txt
├── trainer_dgl.py
├── trainer_text.py
├── types.py
└── utils.py

Installation

1. Clone the repository

git clone https://github.com/SparkJiao/LARCH.git
cd LARCH

2. Install dependencies

pip install -r requirements.txt

Checkpoints / Models

To evaluate pre-trained models, you should change the path of the saved checkpoint in the evaluator to the path of the specific model to be evaluated.

Dataset / Benchmark

We constructed a new dataset, MMD 2.0, based on the original MMD benchmark dataset. It includes more challenging negative samples (images in the same category but with incorrect attributes) to increase dataset difficulty and simulate real-world conditions.

You can download the dataset from the following links:

train_1.tar.gz (Extraction Code: wl8s)
train_2.tar.gz (Extraction Code: 324x)
valid.tar.gz (Extraction Code: kw5j)
test.tar.gz (Extraction Code: h5kr)
image_id.json (Extraction Code: 7zsp)
url2img.txt (Extraction Code: azdk)

Usage

Training

To train the standard LARCH model:

CUDA_VISIBLE_DEVICES=0 python main.py train_dgl

To train the model employing the multimodal hierarchical encoder (MHRED) as the query encoder:

CUDA_VISIBLE_DEVICES=0 python main.py train_text

Note on Hyper-parameters (constants.py): You can control various ablation studies via configuration:

DISABLE_STYLETIPS = False  # If `true`, the `style tips` knowledge is removed.
DISABLE_ATTRIBUTE = False  # If `true`, the `attribute` knowledge is removed.
DISABLE_CELEBRITY = False  # If `true`, the `celebrity` knowledge is removed.
IMAGE_ONLY = False         # If `true`, all forms of knowledge will be removed.

# Ablation study
KNOWLEDGE_TYPE = 'bi_g_wo_img'  # LARCH w/o vision-aware knowledge.
KNOWLEDGE_TYPE = 'bi_g_wo_que'  # LARCH w/o query-aware knowledge.

Evaluation

To test the standard LARCH model:

CUDA_VISIBLE_DEVICES=0 python main.py eval_graph

To evaluate the performance of LARCH w/o GRAPH:

CUDA_VISIBLE_DEVICES=0 python main.py eval_text

Demo / Visualization

Our LARCH model consistently selects relevant images correctly compared to standard baselines by successfully factoring in detailed attribute knowledge (e.g., brand, materials, style).

Metric	LARCH (Ours)	MAGIC	UMD
Precision@5	0.5501	0.4711	0.3422
Recall@5	0.6582	0.5642	0.4036
NDCG@5	0.6829	0.4806	0.3662

(Results based on the MMD 2.0 testing set )

Citation

If this work is helpful, please cite it:

@ARTICLE{conv-img-search-nie-2021,
  author={Nie, Liqiang and Jiao, Fangkai and Wang, Wenjie and Wang, Yinglong and Tian, Qi},
  journal={IEEE Transactions on Image Processing}, 
  title={Conversational Image Search}, 
  year={2021},
  volume={30},
  pages={7732-7743},
  doi={10.1109/TIP.2021.3108724}
}

Acknowledgement

This work was supported in part by:

The National Natural Science Foundation of China (Grant U1936203).
The Shandong Provincial Natural Science Foundation (Grant ZR2019JQ23).
The New AI Project towards the Integration of Education and Industry in QLUT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Conversational Image Search

Authors

Links

Table of Contents

Updates

Introduction

Highlights

Method / Framework

Project Structure

Installation

1. Clone the repository

2. Install dependencies

Checkpoints / Models

Dataset / Benchmark

Usage

Training

Evaluation

Demo / Visualization

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
assets		assets
datasets		datasets
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
constants.py		constants.py
eval.py		eval.py
eval_metric.py		eval_metric.py
evaluator_graph.py		evaluator_graph.py
evaluator_graph_case.py		evaluator_graph_case.py
evaluator_text.py		evaluator_text.py
evaluator_text_case.py		evaluator_text_case.py
gpu_profile.py		gpu_profile.py
knowledge_embed.py		knowledge_embed.py
larch_framework.jpg		larch_framework.jpg
loss.py		loss.py
main.py		main.py
raw_data_fix.py		raw_data_fix.py
requirements.txt		requirements.txt
trainer_dgl.py		trainer_dgl.py
trainer_text.py		trainer_text.py
types.py		types.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

Conversational Image Search

Authors

Links

Table of Contents

Updates

Introduction

Highlights

Method / Framework

Project Structure

Installation

1. Clone the repository

2. Install dependencies

Checkpoints / Models

Dataset / Benchmark

Usage

Training

Evaluation

Demo / Visualization

Citation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages