Skip to content

thiernodaoudaly/tontuma

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tontu ma — Visual Question Answering with ViLT

Python HuggingFace Streamlit License

Upload an image, ask a question about it — ViLT answers instantly.

"Tontu ma" — Wolof for "answer me"

Overview

Tontu ma is a multimodal AI web app powered by ViLT (Vision-and-Language Transformer). It takes an image and a natural language question as input, and returns a direct answer by jointly processing image patches and text tokens inside a single transformer.

Unlike traditional VQA pipelines (which rely on a heavy CNN to extract visual features separately), ViLT encodes image patches directly as token sequences — making it lightweight and fast without sacrificing accuracy.

Model — ViLT

Model used: dandelin/vilt-b32-finetuned-vqa (HuggingFace)

ViLT processes both modalities inside a single transformer:

Image  →  patch embeddings  ─┐
                              ├──► Transformer ──► classification head ──► answer
Text   →  token embeddings  ─┘

The answer is selected from a fixed vocabulary via argmax over logits — a classification problem rather than generation:

outputs = model(**encoding)
idx = outputs.logits.argmax(-1).item()
answer = model.config.id2label[idx]

Why ViLT?

  • No separate convolutional visual encoder — images go directly into the transformer as patch sequences
  • Lower computational cost than two-stream architectures (e.g. ViLBERT, UNITER)
  • Strong performance on standard VQA benchmarks (VQA v2)

Project Structure

tontuma/
├── app.py              # Streamlit app — model loading, UI, inference
├── model_notes.md      # ViLT architecture notes
├── requirements.txt
├── LICENSE
└── README.md

Setup & Usage

# Clone the repository
git clone https://github.com/thiernodaoudaly/tontuma.git
cd tontuma

# Install dependencies
pip install -r requirements.txt

# Run the app
streamlit run app.py
# → http://localhost:8501

The ViLT model (~500MB) is downloaded automatically from HuggingFace on first run and cached locally.


How to Use

  1. Upload an image (PNG, JPG, or JPEG)
  2. Type a question about the image
  3. The app displays the image and returns the answer

Example:

Image Question Answer
A dog playing in the park What animal is in the image? dog
A red car on a street What color is the car? red
A plate of pasta What food is shown? pasta

Roadmap

  • Add multilingual support (French questions via translation layer)
  • Switch from classification to generative VQA (e.g. BLIP-2, LLaVA) for open-ended answers
  • Add confidence score display alongside the answer
  • Deploy on Streamlit Cloud or HuggingFace Spaces

Contributors

  • Thierno Daouda LY
  • Alimatou TALL

License

MIT License — see LICENSE for details.

About

Streamlit VQA app powered by ViLT (dandelin/vilt-b32-finetuned-vqa) — upload an image, ask a question, get an instant answer

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages