A real-time image captioning and visual question answering (VQA) system.
This project combines computer vision and NLP to generate descriptive captions for images and answer user questions about them.
Clone and install dependencies in a fresh virtual environment:
git clone https://github.com/<your-username>/RealTimeVQACaptioning.git
cd RealTimeVQACaptioning
python -m venv venv
source venv/bin/activate # (Linux/Mac)
venv\Scripts\activate # (Windows)
pip install -r requirements.txt- Real-time video/image caption generation
- Visual Question Answering (VQA) module with co-attention mechanism
- Deep learning pipeline based on CNN/ResNet, Transformers, Faster R-CNN/DETR
- Modular code for extensibility and research
- Built using PyTorch and Hugging Face tools
- Computer Vision
- NLP & Deep Learning
- CNN, Transformer, VQA, OpenAI models
- Object Detection, Video Processing
- Input: Accepts image or video streams.
- FeatureExtractor: Extracts visual features using CNN/ResNet backbones.
- ObjectDetector: Detects objects via Faster R-CNN or DETR models.
- CaptionEncoder: Processes extracted features with a Transformer-based encoder.
- CaptionDecoder: Generates natural language captions from encoded features.
- VQAModule: Handles Visual Question Answering by encoding questions, applying co-attention, and predicting answers.
- VideoOverlay: Superimposes generated captions and VQA answers onto the original video or image frames.
- Output: Produces fully annotated frames or processed video as the system output.
This project is licensed under the MIT License. See the LICENSE file for details.
ai-project ・ caption-generation ・ visual-question-answering ・ deep-learning ・ pytorch ・ transformer ・ object-detection ・ nlp ・ video-processing
