Skip to content

DhruvB100/Vision-Based-Object-Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vision-Based Object Detection & 2D Pose Estimation

Description

This is a project I built that uses computer vision. The goal was to explore both classical and deep learning approaches to object detection, and also implement a basic 2D pose estimation pipeline. I wanted to understand how things like HOG features and sliding windows work before just jumping straight to neural networks.

The project uses synthetic data (generated with OpenCV) since I don't have access to large labeled datasets like COCO or ImageNet. Everything runs on CPU so training is pretty slow but it gets the job done for demonstrating the concepts.

Tech Stack

  • Python 3.8+
  • PyTorch - for the CNN and pose estimation models
  • OpenCV (cv2) - image loading, preprocessing, drawing
  • scikit-learn - SVM classifier, evaluation metrics
  • scikit-image - HOG feature extraction
  • Matplotlib - plotting results and metrics
  • NumPy - array operations everywhere

Setup Instructions

  1. Clone the repo (or just download it):

    git clone <repo-url>
    cd Vision-Based-Object-Detection
    
  2. (Optional but recommended) Create a virtual environment:

    python -m venv venv
    source venv/bin/activate    # on Linux/Mac
    venv\Scripts\activate       # on Windows
    
  3. Install all dependencies:

    pip install -r requirements.txt
    

That should be it! The requirements.txt has all the specific version constraints.

Usage

Run the full pipeline from start to finish:

python main.py --all

Just generate synthetic training data:

python main.py --generate-data

Train and evaluate both models (requires data already generated):

python main.py --train

Run evaluation only (requires trained models):

python main.py --evaluate

Run pose estimation demo:

python main.py --pose

There's also a Jupyter notebook for a more interactive walkthrough:

jupyter notebook notebooks/pipeline_demo.ipynb

Pipeline Architecture

The pipeline goes through these stages:

Raw Images
    |
    v
Preprocessing (resize, normalize, augmentation)
    |
    +---> Classical Path: HOG Features --> SVM Classifier
    |
    +---> Deep Learning Path: SimpleCNN (end-to-end)
    |
    v
Evaluation (Precision, Recall, F1, Confusion Matrix)
    |
    v
Pose Estimation (SimplePoseNet predicts heatmaps --> keypoints)
    |
    v
Visualization (save plots to results/)

The classical approach extracts HOG (Histogram of Oriented Gradients) features and feeds them to an SVM. The deep learning approach trains a simple CNN directly on pixel values. Both get compared at the end.

For pose estimation, a small encoder-decoder network predicts Gaussian heatmaps for each of the 17 COCO keypoints, then we take the argmax to get the keypoint coordinates.

Results

See the results/ folder after running the pipeline. It will contain:

  • training_history.png - CNN loss/accuracy curves
  • confusion_matrix_svm.png - SVM confusion matrix
  • confusion_matrix_cnn.png - CNN confusion matrix
  • metrics_comparison.png - bar chart comparing both models
  • hog_visualization.png - example HOG features
  • pose_estimation.png - example pose estimation output
  • benchmark_results.csv - numbers in a table

Note: since we're using synthetic data, the numbers will be very high (>90% usually). Real-world performance would be much lower.

Limitations

  • Synthetic data: The dataset is generated programmatically, not real images. This means the models might not generalize at all to real photos.
  • Simple CNN: The CNN architecture is very basic - just 3 conv layers. Real detection systems use much deeper networks (ResNet, EfficientNet, etc.).
  • No real pose data: The pose estimation is trained/tested on stick figures. Real pose estimation requires annotated datasets like COCO or MPII.
  • CPU only: Training is slow because everything runs on CPU. No mixed precision or any speed optimizations.
  • No sliding window NMS tuning: The non-max suppression thresholds were just picked by hand and not properly tuned.
  • Single scale only: The sliding window detection only does one scale properly.

References

About

Classical (HOG+SVM) vs. deep learning (CNN) object detection pipeline with 2D pose estimation via heatmap regression — built from scratch in PyTorch and OpenCV.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors