Skip to content

iLearn-Lab/TIP21-LARCH

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Conversational Image Search

LARCH is a novel contextual image search scheme that utilizes a multimodal hierarchical graph-based neural network and multi-form knowledge modeling to learn knowledge-enhanced image representations for conversational image search.

Authors

Liqiang Nie1*, Fangkai Jiao1, Wenjie Wang2, Yinglong Wang3, Qi Tian4

1 Department of Computer Science and Technology, Shandong University 2 School of Computing, National University of Singapore 3 Shandong AI Institute, Qilu University of Technology (Shandong Academy of Sciences) 4 Cloud & AI, Huawei Technologies
* Corresponding author

Links


Table of Contents


Updates

  • [09/2021] Paper officially accepted and published in IEEE Transactions on Image Processing (TIP 2021).
  • [09/2021] Initial release of official PyTorch implementation and the augmented MMD 2.0 dataset.

Introduction

Conversational image search is a revolutionary search mode able to interactively induce user responses to clarify their intents step by step. While previous efforts focused heavily on the conversation part (asking the right questions), this paper tackles the challenging image search part given a well-prepared conversational query.

Our method, LARCH (contextuaL imAge seaRch sCHeme), addresses the difficulty of this task by:

  1. Understanding complex user intents from a multimodal conversational query.
  2. Utilizing multiform knowledge associated with images from a memory network.
  3. Enhancing the image representation with distilled knowledge.

This repository provides the official training and evaluation code, the model parameter settings, and the extended benchmark dataset to facilitate future research in the conversational image search community.


Highlights

  • Query Representation Learning: Proposes a multimodal hierarchical graph-based neural network to learn conversational query embeddings for better user intent understanding.
  • Multi-form Knowledge Modeling: Devises an embedding memory network to unify heterogeneous knowledge structures (graphs, matrices, tables) into a homogeneous base.
  • Image Representation Learning: Utilizes a novel gated neural network to select useful knowledge from retrieved data, outputting a knowledge-enhanced image representation.
  • Extended Dataset (MMD 2.0): Provides a newly constructed, highly challenging benchmark dataset augmented with fine-grained negative samples to better simulate real-world search environments.

Method / Framework


Schematic illustration of our proposed LARCH model. It comprises three components, namely query representation learning, multi-form knowledge modeling, and image representation learning.

Project Structure

.
├── datasets/                 
├── models/                   
├── .gitignore
├── LICENSE
├── README.md
├── constants.py
├── eval.py
├── eval_metric.py
├── evaluator_graph.py
├── evaluator_graph_case.py
├── evaluator_text.py
├── evaluator_text_case.py
├── gpu_profile.py
├── knowledge_embed.py
├── larch_framework.jpg
├── loss.py
├── main.py
├── raw_data_fix.py
├── requirements.txt
├── trainer_dgl.py
├── trainer_text.py
├── types.py
└── utils.py

Installation

1. Clone the repository

git clone https://github.com/SparkJiao/LARCH.git
cd LARCH

2. Install dependencies

pip install -r requirements.txt

Checkpoints / Models

To evaluate pre-trained models, you should change the path of the saved checkpoint in the evaluator to the path of the specific model to be evaluated.


Dataset / Benchmark

We constructed a new dataset, MMD 2.0, based on the original MMD benchmark dataset. It includes more challenging negative samples (images in the same category but with incorrect attributes) to increase dataset difficulty and simulate real-world conditions.

You can download the dataset from the following links:


Usage

Training

To train the standard LARCH model:

CUDA_VISIBLE_DEVICES=0 python main.py train_dgl

To train the model employing the multimodal hierarchical encoder (MHRED) as the query encoder:

CUDA_VISIBLE_DEVICES=0 python main.py train_text

Note on Hyper-parameters (constants.py): You can control various ablation studies via configuration:

DISABLE_STYLETIPS = False  # If `true`, the `style tips` knowledge is removed.
DISABLE_ATTRIBUTE = False  # If `true`, the `attribute` knowledge is removed.
DISABLE_CELEBRITY = False  # If `true`, the `celebrity` knowledge is removed.
IMAGE_ONLY = False         # If `true`, all forms of knowledge will be removed.

# Ablation study
KNOWLEDGE_TYPE = 'bi_g_wo_img'  # LARCH w/o vision-aware knowledge.
KNOWLEDGE_TYPE = 'bi_g_wo_que'  # LARCH w/o query-aware knowledge.

Evaluation

To test the standard LARCH model:

CUDA_VISIBLE_DEVICES=0 python main.py eval_graph

To evaluate the performance of LARCH w/o GRAPH:

CUDA_VISIBLE_DEVICES=0 python main.py eval_text

Demo / Visualization

Our LARCH model consistently selects relevant images correctly compared to standard baselines by successfully factoring in detailed attribute knowledge (e.g., brand, materials, style).

Metric LARCH (Ours) MAGIC UMD
Precision@5 0.5501 0.4711 0.3422
Recall@5 0.6582 0.5642 0.4036
NDCG@5 0.6829 0.4806 0.3662

(Results based on the MMD 2.0 testing set )

Citation

If this work is helpful, please cite it:

@ARTICLE{conv-img-search-nie-2021,
  author={Nie, Liqiang and Jiao, Fangkai and Wang, Wenjie and Wang, Yinglong and Tian, Qi},
  journal={IEEE Transactions on Image Processing}, 
  title={Conversational Image Search}, 
  year={2021},
  volume={30},
  pages={7732-7743},
  doi={10.1109/TIP.2021.3108724}
}

Acknowledgement

This work was supported in part by:

  • The National Natural Science Foundation of China (Grant U1936203).
  • The Shandong Provincial Natural Science Foundation (Grant ZR2019JQ23).
  • The New AI Project towards the Integration of Education and Industry in QLUT.

About

A novel contextuaL imAge seaRch sCHeme (LARCH)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages