VETIM: Expanding the Vocabulary of Text-to-Image Models only with Text

Everaert, Martin Nicolas; Bocchio, Marco; Arpa, Sami; Süsstrunk, Sabine; Achanta, Radhakrishna

VETIM: Expanding the Vocabulary of Text-to-Image Models only with Text

Martin Nicolas Everaert ¹, Marco Bocchio ², Sami Arpa ², Sabine Süsstrunk ¹, Radhakrishna Achanta ¹

¹ EPFL, Switzerland,
² Largo.ai, Lausanne, Switzerland

BMVC 2023

BMVC 2023 proceedings Paper (pdf) Supplementary material (zip)

Abstract

Text-to-image models, such as Stable Diffusion, can generate high-quality images from simple textual prompts. With methods such as Textual Inversion, it is possible to expand the vocabulary of these models with additional concepts, by learning the vocabulary embedding of new tokens. These methods have two limitations: slowness of optimisation and dependence on sample images. Slowness mainly stems from the use of the original text-to-image training loss, without considering potential auxiliary supervision terms. Relying on sample images enables learning new visual features but restricts the vocabulary expansion to concepts with pre-existing images. In response, we introduce a novel approach, named VETIM, which takes only a textual description of the concept as input. It expands the vocabulary through supervision only at the text encoder output, without accessing the image-generation part, making it faster at optimisation time. It also does not copy visual features from sample images. Our method can be used directly for applications that require a concept as a single token but do not require learning new visual features. Our approach shows that a mere textual description suffices to obtain a single token referring to a specific concept. To show the effectiveness of our method, we evaluate its performance subjectively and through objective measures. The results show that our approach is effective in expanding the vocabulary of text-to-image models without requiring images.

Video Presentation

Poster

PDF

Citation

Please use the following BibTeX entry to cite our paper:

@InProceedings{Everaert_2023_BMVC,
  title     = {{VETIM}: {E}xpanding the {V}ocabulary of {T}ext-to-{I}mage {M}odels only with {T}ext},
  author    = {Everaert, Martin Nicolas and Bocchio, Marco and Arpa, Sami and S\"usstrunk, Sabine and Achanta, Radhakrishna},
  booktitle = {34th British Machine Vision Conference 2023, {BMVC} 2023, Aberdeen, UK, November 20-24, 2023},
  publisher = {{BMVA}},
  month     = {November},
  year      = {2023},
  url       = {https://papers.bmvc2023.org/0016.pdf}
}

More Works

Visual Grounding for Object Questions

Covariance Mismatch in Diffusion Models

Exploiting the Signal-Leak Bias in Diffusion Models

Diffusion in Style

VETIM: Expanding the Vocabulary of Text-to-Image Models only with Text

Estimating Image Depth in the Comics Domain

Scene Relighting with Illumination Estimation in the Latent Space

More works from Image and Visual Representation Lab (IVRL)

VETIM: Expanding the Vocabulary of Text-to-Image Models only with Text

Abstract

Video Presentation

Poster

Citation