Skip to content

navaneet625/RealTimeVQACaptioning

Repository files navigation

RealTimeVQACaptioning

A real-time image captioning and visual question answering (VQA) system.
This project combines computer vision and NLP to generate descriptive captions for images and answer user questions about them.

GitHub Repo


🚀 Setup

Clone and install dependencies in a fresh virtual environment:

git clone https://github.com/<your-username>/RealTimeVQACaptioning.git
cd RealTimeVQACaptioning
python -m venv venv
source venv/bin/activate   # (Linux/Mac)
venv\Scripts\activate      # (Windows)
pip install -r requirements.txt

🧩 Features

  • Real-time video/image caption generation
  • Visual Question Answering (VQA) module with co-attention mechanism
  • Deep learning pipeline based on CNN/ResNet, Transformers, Faster R-CNN/DETR
  • Modular code for extensibility and research
  • Built using PyTorch and Hugging Face tools

Topics and Technologies

  • Computer Vision
  • NLP & Deep Learning
  • CNN, Transformer, VQA, OpenAI models
  • Object Detection, Video Processing

📊 Pipeline Overview

Pipeline Diagram

Pipeline Diagram

Pipeline Steps

  • Input: Accepts image or video streams.
  • FeatureExtractor: Extracts visual features using CNN/ResNet backbones.
  • ObjectDetector: Detects objects via Faster R-CNN or DETR models.
  • CaptionEncoder: Processes extracted features with a Transformer-based encoder.
  • CaptionDecoder: Generates natural language captions from encoded features.
  • VQAModule: Handles Visual Question Answering by encoding questions, applying co-attention, and predicting answers.
  • VideoOverlay: Superimposes generated captions and VQA answers onto the original video or image frames.
  • Output: Produces fully annotated frames or processed video as the system output.

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.


🏷️ Keywords

ai-projectcaption-generationvisual-question-answeringdeep-learningpytorchtransformerobject-detectionnlpvideo-processing

About

A real-time image captioning and visual question answering (VQA) system. This project uses computer vision and NLP to generate descriptive captions for images and answer user questions about them.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors