readme_captions.md

Captions tab

The Captions tab is used to generate captions for images and videos.

General Use

First select a folder with images or videos to process at the top of the window.
To get started make sure to select a model from the dropdown menu and hit "Load Model". Models can be downloaded using the built-in downloads manager on the Captions tab (📥💾 button), more info in the readme_models.md file.
Select a "System Prompt" from the dropdown menu or use the "Custom" option to enter your own system prompt.
Set any other options as desired (they all have tooltips to explain what they do).
Hit "TEST FIRST IMAGE" or "TEST RANDOM IMAGE" to test on a single image.
Hit "START PROCESSING" to process all images in the selected folder.

In your output folder you will find a .txt file for each image/video with the generated caption.

Optimizing Speed

Although a single image can be a bit slow (ranging from a few seconds to tens of seconds)
Increasing the "Batch Size" will allow multiple images to be processed at once, experiment by raising this value to as high as your VRAM allows
Lowering precision and/or selecting a different attention implementation and/or torch compile may increase speed (only supported for non-GGUF models)
Some GGUF models may be faster as well, but require manual downloading and installation of the llama-cpp-python package, see above for instructions. This is for more advanced users.

Customizing System Prompt

System prompt files can be found as text files in the "prompts" folder
The app checks these at startup
You can select a predefined prompt from the list
Or you can use the "Custom" setting and type your system prompt in the app
Or create/edit the text files in the "prompts" folder

Model Families

VisionCaptioner supports two Vision-Language Model families:

Qwen-VL (Qwen2.5-VL, Qwen3-VL) — controlled by the "Max Resolution" dropdown.
Google Gemma 4 (E2B, E4B, 26B-A4B, 31B) — controlled by the "Vision Tokens (Gemma)" dropdown, which sets the soft visual token budget per image (70, 140, 280, 560, or 1120). Higher = more detail, slower, more VRAM. The "Max Resolution" setting is ignored for Gemma.

The correct settings panel becomes active automatically when you pick a model from the dropdown. Gemma 4's built-in "thinking" reasoning mode is automatically disabled for captioning.

See readme_models.md for the full list of supported models and download links.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Captions tab

General Use

Optimizing Speed

Customizing System Prompt

Model Families

FilesExpand file tree

readme_captions.md

Latest commit

History

readme_captions.md

File metadata and controls

Captions tab

General Use

Optimizing Speed

Customizing System Prompt

Model Families