LARCH is a novel contextual image search scheme that utilizes a multimodal hierarchical graph-based neural network and multi-form knowledge modeling to learn knowledge-enhanced image representations for conversational image search.
Liqiang Nie1*, Fangkai Jiao1, Wenjie Wang2, Yinglong Wang3, Qi Tian4
1 Department of Computer Science and Technology, Shandong University
2 School of Computing, National University of Singapore
3 Shandong AI Institute, Qilu University of Technology (Shandong Academy of Sciences)
4 Cloud & AI, Huawei Technologies
* Corresponding author
- Paper: Paper Link
- Code Repository: GitHub
- Updates
- Introduction
- Highlights
- Method / Framework
- Project Structure
- Installation
- Checkpoints / Models
- Dataset / Benchmark
- Usage
- Demo / Visualization
- Citation
- Acknowledgement
- [09/2021] Paper officially accepted and published in IEEE Transactions on Image Processing (TIP 2021).
- [09/2021] Initial release of official PyTorch implementation and the augmented MMD 2.0 dataset.
Conversational image search is a revolutionary search mode able to interactively induce user responses to clarify their intents step by step. While previous efforts focused heavily on the conversation part (asking the right questions), this paper tackles the challenging image search part given a well-prepared conversational query.
Our method, LARCH (contextuaL imAge seaRch sCHeme), addresses the difficulty of this task by:
- Understanding complex user intents from a multimodal conversational query.
- Utilizing multiform knowledge associated with images from a memory network.
- Enhancing the image representation with distilled knowledge.
This repository provides the official training and evaluation code, the model parameter settings, and the extended benchmark dataset to facilitate future research in the conversational image search community.
- Query Representation Learning: Proposes a multimodal hierarchical graph-based neural network to learn conversational query embeddings for better user intent understanding.
- Multi-form Knowledge Modeling: Devises an embedding memory network to unify heterogeneous knowledge structures (graphs, matrices, tables) into a homogeneous base.
- Image Representation Learning: Utilizes a novel gated neural network to select useful knowledge from retrieved data, outputting a knowledge-enhanced image representation.
- Extended Dataset (MMD 2.0): Provides a newly constructed, highly challenging benchmark dataset augmented with fine-grained negative samples to better simulate real-world search environments.
.
├── datasets/
├── models/
├── .gitignore
├── LICENSE
├── README.md
├── constants.py
├── eval.py
├── eval_metric.py
├── evaluator_graph.py
├── evaluator_graph_case.py
├── evaluator_text.py
├── evaluator_text_case.py
├── gpu_profile.py
├── knowledge_embed.py
├── larch_framework.jpg
├── loss.py
├── main.py
├── raw_data_fix.py
├── requirements.txt
├── trainer_dgl.py
├── trainer_text.py
├── types.py
└── utils.py
git clone https://github.com/SparkJiao/LARCH.git
cd LARCHpip install -r requirements.txtTo evaluate pre-trained models, you should change the path of the saved checkpoint in the evaluator to the path of the specific model to be evaluated.
We constructed a new dataset, MMD 2.0, based on the original MMD benchmark dataset. It includes more challenging negative samples (images in the same category but with incorrect attributes) to increase dataset difficulty and simulate real-world conditions.
You can download the dataset from the following links:
- train_1.tar.gz (Extraction Code: wl8s)
- train_2.tar.gz (Extraction Code: 324x)
- valid.tar.gz (Extraction Code: kw5j)
- test.tar.gz (Extraction Code: h5kr)
- image_id.json (Extraction Code: 7zsp)
- url2img.txt (Extraction Code: azdk)
To train the standard LARCH model:
CUDA_VISIBLE_DEVICES=0 python main.py train_dglTo train the model employing the multimodal hierarchical encoder (MHRED) as the query encoder:
CUDA_VISIBLE_DEVICES=0 python main.py train_textNote on Hyper-parameters (constants.py):
You can control various ablation studies via configuration:
DISABLE_STYLETIPS = False # If `true`, the `style tips` knowledge is removed.
DISABLE_ATTRIBUTE = False # If `true`, the `attribute` knowledge is removed.
DISABLE_CELEBRITY = False # If `true`, the `celebrity` knowledge is removed.
IMAGE_ONLY = False # If `true`, all forms of knowledge will be removed.
# Ablation study
KNOWLEDGE_TYPE = 'bi_g_wo_img' # LARCH w/o vision-aware knowledge.
KNOWLEDGE_TYPE = 'bi_g_wo_que' # LARCH w/o query-aware knowledge.To test the standard LARCH model:
CUDA_VISIBLE_DEVICES=0 python main.py eval_graphTo evaluate the performance of LARCH w/o GRAPH:
CUDA_VISIBLE_DEVICES=0 python main.py eval_textOur LARCH model consistently selects relevant images correctly compared to standard baselines by successfully factoring in detailed attribute knowledge (e.g., brand, materials, style).
| Metric | LARCH (Ours) | MAGIC | UMD |
|---|---|---|---|
| Precision@5 | 0.5501 | 0.4711 | 0.3422 |
| Recall@5 | 0.6582 | 0.5642 | 0.4036 |
| NDCG@5 | 0.6829 | 0.4806 | 0.3662 |
(Results based on the MMD 2.0 testing set )
If this work is helpful, please cite it:
@ARTICLE{conv-img-search-nie-2021,
author={Nie, Liqiang and Jiao, Fangkai and Wang, Wenjie and Wang, Yinglong and Tian, Qi},
journal={IEEE Transactions on Image Processing},
title={Conversational Image Search},
year={2021},
volume={30},
pages={7732-7743},
doi={10.1109/TIP.2021.3108724}
}This work was supported in part by:
- The National Natural Science Foundation of China (Grant U1936203).
- The Shandong Provincial Natural Science Foundation (Grant ZR2019JQ23).
- The New AI Project towards the Integration of Education and Industry in QLUT.
