Vision-Based Object Detection & 2D Pose Estimation

Description

This is a project I built that uses computer vision. The goal was to explore both classical and deep learning approaches to object detection, and also implement a basic 2D pose estimation pipeline. I wanted to understand how things like HOG features and sliding windows work before just jumping straight to neural networks.

The project uses synthetic data (generated with OpenCV) since I don't have access to large labeled datasets like COCO or ImageNet. Everything runs on CPU so training is pretty slow but it gets the job done for demonstrating the concepts.

Tech Stack

Python 3.8+
PyTorch - for the CNN and pose estimation models
OpenCV (cv2) - image loading, preprocessing, drawing
scikit-learn - SVM classifier, evaluation metrics
scikit-image - HOG feature extraction
Matplotlib - plotting results and metrics
NumPy - array operations everywhere

Setup Instructions

Clone the repo (or just download it):

git clone <repo-url>
cd Vision-Based-Object-Detection

(Optional but recommended) Create a virtual environment:

python -m venv venv
source venv/bin/activate    # on Linux/Mac
venv\Scripts\activate       # on Windows

Install all dependencies:
```
pip install -r requirements.txt
```

That should be it! The requirements.txt has all the specific version constraints.

Usage

Run the full pipeline from start to finish:

python main.py --all

Just generate synthetic training data:

python main.py --generate-data

Train and evaluate both models (requires data already generated):

python main.py --train

Run evaluation only (requires trained models):

python main.py --evaluate

Run pose estimation demo:

python main.py --pose

There's also a Jupyter notebook for a more interactive walkthrough:

jupyter notebook notebooks/pipeline_demo.ipynb

Pipeline Architecture

The pipeline goes through these stages:

Raw Images
    |
    v
Preprocessing (resize, normalize, augmentation)
    |
    +---> Classical Path: HOG Features --> SVM Classifier
    |
    +---> Deep Learning Path: SimpleCNN (end-to-end)
    |
    v
Evaluation (Precision, Recall, F1, Confusion Matrix)
    |
    v
Pose Estimation (SimplePoseNet predicts heatmaps --> keypoints)
    |
    v
Visualization (save plots to results/)

The classical approach extracts HOG (Histogram of Oriented Gradients) features and feeds them to an SVM. The deep learning approach trains a simple CNN directly on pixel values. Both get compared at the end.

For pose estimation, a small encoder-decoder network predicts Gaussian heatmaps for each of the 17 COCO keypoints, then we take the argmax to get the keypoint coordinates.

Results

See the results/ folder after running the pipeline. It will contain:

training_history.png - CNN loss/accuracy curves
confusion_matrix_svm.png - SVM confusion matrix
confusion_matrix_cnn.png - CNN confusion matrix
metrics_comparison.png - bar chart comparing both models
hog_visualization.png - example HOG features
pose_estimation.png - example pose estimation output
benchmark_results.csv - numbers in a table

Note: since we're using synthetic data, the numbers will be very high (>90% usually). Real-world performance would be much lower.

Limitations

Synthetic data: The dataset is generated programmatically, not real images. This means the models might not generalize at all to real photos.
Simple CNN: The CNN architecture is very basic - just 3 conv layers. Real detection systems use much deeper networks (ResNet, EfficientNet, etc.).
No real pose data: The pose estimation is trained/tested on stick figures. Real pose estimation requires annotated datasets like COCO or MPII.
CPU only: Training is slow because everything runs on CPU. No mixed precision or any speed optimizations.
No sliding window NMS tuning: The non-max suppression thresholds were just picked by hand and not properly tuned.
Single scale only: The sliding window detection only does one scale properly.

References

Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. CVPR.
LeCun, Y., et al. (1998). Gradient-based learning applied to document recognition.
OpenCV documentation: https://docs.opencv.org/
PyTorch tutorials: https://pytorch.org/tutorials/
scikit-image HOG docs: https://scikit-image.org/docs/stable/api/skimage.feature.html
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning.
Cao, Z., et al. (2017). Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. CVPR.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
models		models
notebooks		notebooks
results		results
src		src
.gitignore		.gitignore
README.md		README.md
config.py		config.py
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision-Based Object Detection & 2D Pose Estimation

Description

Tech Stack

Setup Instructions

Usage

Pipeline Architecture

Results

Limitations

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vision-Based Object Detection & 2D Pose Estimation

Description

Tech Stack

Setup Instructions

Usage

Pipeline Architecture

Results

Limitations

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages