Upload an image, ask a question about it — ViLT answers instantly.
"Tontu ma" — Wolof for "answer me"
Tontu ma is a multimodal AI web app powered by ViLT (Vision-and-Language Transformer). It takes an image and a natural language question as input, and returns a direct answer by jointly processing image patches and text tokens inside a single transformer.
Unlike traditional VQA pipelines (which rely on a heavy CNN to extract visual features separately), ViLT encodes image patches directly as token sequences — making it lightweight and fast without sacrificing accuracy.
Model used: dandelin/vilt-b32-finetuned-vqa (HuggingFace)
ViLT processes both modalities inside a single transformer:
Image → patch embeddings ─┐
├──► Transformer ──► classification head ──► answer
Text → token embeddings ─┘
The answer is selected from a fixed vocabulary via argmax over logits — a classification problem rather than generation:
outputs = model(**encoding)
idx = outputs.logits.argmax(-1).item()
answer = model.config.id2label[idx]Why ViLT?
- No separate convolutional visual encoder — images go directly into the transformer as patch sequences
- Lower computational cost than two-stream architectures (e.g. ViLBERT, UNITER)
- Strong performance on standard VQA benchmarks (VQA v2)
tontuma/
├── app.py # Streamlit app — model loading, UI, inference
├── model_notes.md # ViLT architecture notes
├── requirements.txt
├── LICENSE
└── README.md
# Clone the repository
git clone https://github.com/thiernodaoudaly/tontuma.git
cd tontuma
# Install dependencies
pip install -r requirements.txt
# Run the app
streamlit run app.py
# → http://localhost:8501The ViLT model (~500MB) is downloaded automatically from HuggingFace on first run and cached locally.
- Upload an image (PNG, JPG, or JPEG)
- Type a question about the image
- The app displays the image and returns the answer
Example:
| Image | Question | Answer |
|---|---|---|
| A dog playing in the park | What animal is in the image? | dog |
| A red car on a street | What color is the car? | red |
| A plate of pasta | What food is shown? | pasta |
- Add multilingual support (French questions via translation layer)
- Switch from classification to generative VQA (e.g. BLIP-2, LLaVA) for open-ended answers
- Add confidence score display alongside the answer
- Deploy on Streamlit Cloud or HuggingFace Spaces
- Thierno Daouda LY
- Alimatou TALL
MIT License — see LICENSE for details.