This is a project I built that uses computer vision. The goal was to explore both classical and deep learning approaches to object detection, and also implement a basic 2D pose estimation pipeline. I wanted to understand how things like HOG features and sliding windows work before just jumping straight to neural networks.
The project uses synthetic data (generated with OpenCV) since I don't have access to large labeled datasets like COCO or ImageNet. Everything runs on CPU so training is pretty slow but it gets the job done for demonstrating the concepts.
- Python 3.8+
- PyTorch - for the CNN and pose estimation models
- OpenCV (cv2) - image loading, preprocessing, drawing
- scikit-learn - SVM classifier, evaluation metrics
- scikit-image - HOG feature extraction
- Matplotlib - plotting results and metrics
- NumPy - array operations everywhere
-
Clone the repo (or just download it):
git clone <repo-url> cd Vision-Based-Object-Detection -
(Optional but recommended) Create a virtual environment:
python -m venv venv source venv/bin/activate # on Linux/Mac venv\Scripts\activate # on Windows -
Install all dependencies:
pip install -r requirements.txt
That should be it! The requirements.txt has all the specific version constraints.
Run the full pipeline from start to finish:
python main.py --all
Just generate synthetic training data:
python main.py --generate-data
Train and evaluate both models (requires data already generated):
python main.py --train
Run evaluation only (requires trained models):
python main.py --evaluate
Run pose estimation demo:
python main.py --pose
There's also a Jupyter notebook for a more interactive walkthrough:
jupyter notebook notebooks/pipeline_demo.ipynb
The pipeline goes through these stages:
Raw Images
|
v
Preprocessing (resize, normalize, augmentation)
|
+---> Classical Path: HOG Features --> SVM Classifier
|
+---> Deep Learning Path: SimpleCNN (end-to-end)
|
v
Evaluation (Precision, Recall, F1, Confusion Matrix)
|
v
Pose Estimation (SimplePoseNet predicts heatmaps --> keypoints)
|
v
Visualization (save plots to results/)
The classical approach extracts HOG (Histogram of Oriented Gradients) features and feeds them to an SVM. The deep learning approach trains a simple CNN directly on pixel values. Both get compared at the end.
For pose estimation, a small encoder-decoder network predicts Gaussian heatmaps for each of the 17 COCO keypoints, then we take the argmax to get the keypoint coordinates.
See the results/ folder after running the pipeline. It will contain:
training_history.png- CNN loss/accuracy curvesconfusion_matrix_svm.png- SVM confusion matrixconfusion_matrix_cnn.png- CNN confusion matrixmetrics_comparison.png- bar chart comparing both modelshog_visualization.png- example HOG featurespose_estimation.png- example pose estimation outputbenchmark_results.csv- numbers in a table
Note: since we're using synthetic data, the numbers will be very high (>90% usually). Real-world performance would be much lower.
- Synthetic data: The dataset is generated programmatically, not real images. This means the models might not generalize at all to real photos.
- Simple CNN: The CNN architecture is very basic - just 3 conv layers. Real detection systems use much deeper networks (ResNet, EfficientNet, etc.).
- No real pose data: The pose estimation is trained/tested on stick figures. Real pose estimation requires annotated datasets like COCO or MPII.
- CPU only: Training is slow because everything runs on CPU. No mixed precision or any speed optimizations.
- No sliding window NMS tuning: The non-max suppression thresholds were just picked by hand and not properly tuned.
- Single scale only: The sliding window detection only does one scale properly.
- Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. CVPR.
- LeCun, Y., et al. (1998). Gradient-based learning applied to document recognition.
- OpenCV documentation: https://docs.opencv.org/
- PyTorch tutorials: https://pytorch.org/tutorials/
- scikit-image HOG docs: https://scikit-image.org/docs/stable/api/skimage.feature.html
- Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning.
- Cao, Z., et al. (2017). Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. CVPR.