Tontu ma — Visual Question Answering with ViLT

Upload an image, ask a question about it — ViLT answers instantly.

"Tontu ma" — Wolof for "answer me"

Overview

Tontu ma is a multimodal AI web app powered by ViLT (Vision-and-Language Transformer). It takes an image and a natural language question as input, and returns a direct answer by jointly processing image patches and text tokens inside a single transformer.

Unlike traditional VQA pipelines (which rely on a heavy CNN to extract visual features separately), ViLT encodes image patches directly as token sequences — making it lightweight and fast without sacrificing accuracy.

Model — ViLT

Model used: dandelin/vilt-b32-finetuned-vqa (HuggingFace)

ViLT processes both modalities inside a single transformer:

Image  →  patch embeddings  ─┐
                              ├──► Transformer ──► classification head ──► answer
Text   →  token embeddings  ─┘

The answer is selected from a fixed vocabulary via argmax over logits — a classification problem rather than generation:

outputs = model(**encoding)
idx = outputs.logits.argmax(-1).item()
answer = model.config.id2label[idx]

Why ViLT?

No separate convolutional visual encoder — images go directly into the transformer as patch sequences
Lower computational cost than two-stream architectures (e.g. ViLBERT, UNITER)
Strong performance on standard VQA benchmarks (VQA v2)

Project Structure

tontuma/
├── app.py              # Streamlit app — model loading, UI, inference
├── model_notes.md      # ViLT architecture notes
├── requirements.txt
├── LICENSE
└── README.md

Setup & Usage

# Clone the repository
git clone https://github.com/thiernodaoudaly/tontuma.git
cd tontuma

# Install dependencies
pip install -r requirements.txt

# Run the app
streamlit run app.py
# → http://localhost:8501

The ViLT model (~500MB) is downloaded automatically from HuggingFace on first run and cached locally.

How to Use

Upload an image (PNG, JPG, or JPEG)
Type a question about the image
The app displays the image and returns the answer

Example:

Image	Question	Answer
A dog playing in the park	What animal is in the image?	dog
A red car on a street	What color is the car?	red
A plate of pasta	What food is shown?	pasta

Roadmap

Add multilingual support (French questions via translation layer)
Switch from classification to generative VQA (e.g. BLIP-2, LLaVA) for open-ended answers
Add confidence score display alongside the answer
Deploy on Streamlit Cloud or HuggingFace Spaces

Contributors

Thierno Daouda LY
Alimatou TALL

License

MIT License — see LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tontu ma — Visual Question Answering with ViLT

Overview

Model — ViLT

Project Structure

Setup & Usage

How to Use

Roadmap

Contributors

License

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.vscode		.vscode
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
model_notes.md		model_notes.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Tontu ma — Visual Question Answering with ViLT

Overview

Model — ViLT

Project Structure

Setup & Usage

How to Use

Roadmap

Contributors

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages