Visual Grounding for Object Questions

Introduces Visual Grounding for Object Questions (VGOQ), a new task for grounding visual evidence or context that supports answering general questions about objects, beyond directly visible elements.

To appear in CVPR 2026
Covariance Mismatch in Diffusion Models

Investigates the covariance mismatch between noise and data in diffusion models and its impact on image generation.

Preprint 2024
Exploiting the Signal-Leak Bias in Diffusion Models

Examines and leverages the signal-leak bias in diffusion models for improved image generation.

WACV 2024
Diffusion in Style

Customizes Stable Diffusion's output style by adapting the initial noise distribution, making style adaptation more sample-efficient and faster.

ICCV 2023
VETIM: Expanding the Vocabulary of Text-to-Image Models only with Text

Expands text-to-image models' vocabulary by learning new token embeddings from textual descriptions alone, without requiring sample images.

BMVC 2023
Estimating Image Depth in the Comics Domain

Estimates depth in comic book images by converting them to natural images and filtering out text to improve accuracy.

WACV 2022
Scene Relighting with Illumination Estimation in the Latent Space

Transfers lighting conditions between images by estimating and manipulating illumination in the latent space of an encoder-decoder network.

arXiv 2020
More works from Image and Visual Representation Lab (IVRL)

Also check more works from our labmates at the Image and Visual Representation Lab (IVRL) at EPFL.

IVRL - EPFL

VETIM: Expanding the Vocabulary of Text-to-Image Models only with Text

1 EPFL, Switzerland,
2 Largo.ai, Lausanne, Switzerland

Abstract

Text-to-image models, such as Stable Diffusion, can generate high-quality images from simple textual prompts. With methods such as Textual Inversion, it is possible to expand the vocabulary of these models with additional concepts, by learning the vocabulary embedding of new tokens. These methods have two limitations: slowness of optimisation and dependence on sample images. Slowness mainly stems from the use of the original text-to-image training loss, without considering potential auxiliary supervision terms. Relying on sample images enables learning new visual features but restricts the vocabulary expansion to concepts with pre-existing images. In response, we introduce a novel approach, named VETIM, which takes only a textual description of the concept as input. It expands the vocabulary through supervision only at the text encoder output, without accessing the image-generation part, making it faster at optimisation time. It also does not copy visual features from sample images. Our method can be used directly for applications that require a concept as a single token but do not require learning new visual features. Our approach shows that a mere textual description suffices to obtain a single token referring to a specific concept. To show the effectiveness of our method, we evaluate its performance subjectively and through objective measures. The results show that our approach is effective in expanding the vocabulary of text-to-image models without requiring images.

Video Presentation

Poster

Citation

Please use the following BibTeX entry to cite our paper:

@InProceedings{Everaert_2023_BMVC,
  title     = {{VETIM}: {E}xpanding the {V}ocabulary of {T}ext-to-{I}mage {M}odels only with {T}ext},
  author    = {Everaert, Martin Nicolas and Bocchio, Marco and Arpa, Sami and S\"usstrunk, Sabine and Achanta, Radhakrishna},
  booktitle = {34th British Machine Vision Conference 2023, {BMVC} 2023, Aberdeen, UK, November 20-24, 2023},
  publisher = {{BMVA}},
  month     = {November},
  year      = {2023},
  url       = {https://papers.bmvc2023.org/0016.pdf}
}