CS231.Q11 – Introduction to Computer Vision

CS231 Course Project: Face Mask Classification – A Comparative Study of Traditional ML and CNNs

This repository contains the implementation of a Face Mask Classification System, developed as the final project for CS231.Q11 – Introduction to Computer Vision at the University of Information Technology (UIT – VNU-HCM).

The project performs a comparative study between traditional Machine Learning approaches (using HOG and LBP feature descriptors with SVM, KNN, and Random Forest classifiers) and modern Convolutional Neural Networks (CNN) to identify individuals wearing masks versus those without masks.

The primary objective of this project is to analyze and compare the effectiveness of traditional Machine Learning pipelines versus Deep Learning approaches for the task of face mask detection, a binary image classification problem with real-world relevance in public health and surveillance systems.

The project emphasizes methodological comparison, feature representation, and performance evaluation, rather than solely maximizing accuracy through deep models.

Team Information

No.	Student ID	Full Name	Role	Github	Email
1	23521143	Nguyen Cong Phat	Leader	paht2005	[email protected]
2	23521168	Nguyen Le Phong	Member	kllp031 (primary account) / octguy (secondary account)	[email protected]
3	23520213	Vu Viet Cuong	Member	Kun05-AI	[email protected]

Repository Structure

CS231.Q11_Face-Mask-Classification-Project/
├── src/             # Model training notebooks (Jupyter)
│   ├── CNN/                 	# Deep Learning CNN (Grayscale) training source code
│   ├── HOG_KNN/               # KNN training with HOG features source code
│   ├── HOG_RF/              	# Random Forest training with HOG features source code
│   ├── HOG_SVM/  				# SVM training with HOG features source code
│   ├── LBP_KNN/  				# KNN training with LBP features source code
│   ├── LBP_RF/  				# Random Forest training with LBP features source code
│   └── LBP_SVM/				# SVM training with LBP features source code
│
├── models/
│   ├── yunet.onnx              # Pre-trained Face Detection model (Included)
│   ├── mask_detector_model.h5  # Trained Mask Classification model (Included)
│   └── [others].joblib/.keras        # Large models/caches (Ignored - Download link below)
│
├── docs/                       # Report & presentation
│   ├── 23520213-23521143-23521168_Report.pdf
│   └── 23520213-23521143-23521168_Slide.pdf
│
├── static/                     # Static Assets
│   ├── images/                 # Images for Slide, Report, and Thumbnails
│   ├── results/                # Output images from Flask Web Demo
│   ├──  latex/                # Latex files
│   ├── templates/              # Web UI (index.html, indexSVM.html)
│   └── test/                   # Sample test images (e.g., test.jpg)
│
├── uploads/                    # Temporary storage for user-uploaded images
├── demo_webcam.py              # Real-time Webcam detection script
├── demoSVM_image_flask.py      # Flask Web Application script
├── requirements.txt            # Python dependencies
├── LICENSE
├──  .gitignore                  # Git ignore rules
└── README.md # Main project documentation

Problem Statement

Face mask detection is a practical computer vision problem that requires robust facial feature representation under variations in:

Illumination
Pose
Occlusion
Mask styles and colors

The goal of this project is to:

Evaluate whether hand-crafted features (HOG, LBP) combined with classical classifiers can compete with CNN-based approaches.
Analyze trade-offs between accuracy, computational cost, and deployment complexity.
Develop a system capable of real-time inference using standard consumer hardware.

System Overview

The proposed system consists of three main components:

Offline Training Pipeline
- Image preprocessing
- Feature extraction
- Model training and hyperparameter optimization
Inference Pipeline
- Face detection using YuNet
- Feature extraction / CNN inference
- Classification and post-processing
Deployment Interfaces
- Flask-based web application (static image classification)
- Real-time webcam detection

Key Features

Binary Face Mask Classification with high accuracy
Comparative Study between:
- Traditional ML: HOG/LBP + SVM, KNN, Random Forest
- Deep Learning: Custom CNN
Automated Hyperparameter Tuning
- Optuna for ML models
- Keras Tuner (Hyperband) for CNN
Real-time Detection using webcam input
User-friendly Web Interface built with Flask

Dataset

Face Mask 12K Images Dataset

Source: Kaggle
🔗 https://www.kaggle.com/datasets/ashishjangra27/face-mask-12k-images-dataset
Total Images: Approximately 12,000 RGB images
Image Characteristics:
- Diverse facial orientations
- Multiple ethnicities
- Various mask types and lighting conditions

Dataset Structure

Training Set: 10,000 images
Validation Set: 800 images
Test Set: 992 images

The dataset is well-balanced between the two classes, making it suitable for unbiased binary classification evaluation. No identity or personal information is associated with the dataset, ensuring ethical use for academic research.

Data Preprocessing

To ensure consistency and reduce computational complexity, the following preprocessing steps were applied:

Resizing
- All images resized to 128 × 128 pixels
Normalization
- Pixel intensities scaled to the range [0, 1]
Grayscale Conversion
- Applied for traditional ML pipelines
- Reduces dimensionality while preserving structural facial features

Feature Extraction

Histogram of Oriented Gradients (HOG)

Captures edge and shape information
Effective for representing facial geometry
Tested configurations:
- 6 × 3 cells
- 8 × 2 cells (best-performing)

Local Binary Patterns (LBP)

Encodes local texture patterns
Robust to illumination changes
Useful for modeling fine-grained facial textures

Model Architectures

Traditional Machine Learning Models

Support Vector Machine (SVM) with RBF kernel
K-Nearest Neighbors (KNN)
Random Forest

These models operate on extracted HOG or LBP feature vectors.

Deep Learning Model

Custom Convolutional Neural Network (CNN)
Lightweight architecture optimized for grayscale input
Designed to balance performance and training efficiency

Training & Optimization

Traditional ML Models
- Hyperparameters optimized using Optuna
- Objective: maximize validation accuracy
CNN
- Optimized using Keras Tuner (Hyperband)
- Tuned parameters include:
  - Number of convolutional layers
  - Filter sizes
  - Learning rate
  - Dropout rate

Installation

1. Clone repository

git clone https://github.com/paht2005/CS231.Q11_Face-Mask-Classification-Project.git
cd CS231.Q11_Face-Mask-Classification-Project

2. Create virtual environment (recommended)

python -m venv .venv
source .venv/bin/activate   # Linux/macOS
# .venv\Scripts\activate    # Windows

3. Install dependencies

pip install -r requirements.txt

Usage

1. Train models

Open and run notebooks in src/:

jupyter notebook

CNN/train-model-CNN-best-grayscale.ipynb
HOG_KNN/train-model-HOG-KNN_6x3.ipynb
HOG_KNN/train-model-HOG-KNN_8x2.ipynb
HOG_RF/train-model-HOG-RF_8x2.ipynb
HOG_RF/train-model-HOG-RF_6x3.ipynb
HOG_SVM/train-model-HOG-SVM_6x3.ipynb
HOG_SVM/train-model-HOG-SVM-8x2.ipynb
LBP_KNN/train-model-LBP-KNN.ipynb
LBP_RF/train-model-LBP-RF.ipynb
LBP_SVM/train-model-LBP-SVM.ipynb

2. Run Flask demo

python demoSVM_image_flask.py

Open browser at:

http://127.0.0.1:5000

3. Real-time Webcam Detection

python demo_webcam.py

Demo Application

The project provides two distinct interfaces to demonstrate the classification capabilities, catering to both static analysis and real-time monitoring.

1. Flask Web Application (Static Image Classification)

A user-friendly web interface built on the Flask Framework, allowing users to upload individual images for detailed analysis.

Logic: Receives image files via indexSVM.html, extracts HOG 8x2 features using skimage, and performs inference using the optimized SVM model (.joblib).
Output: Generates a bounding box and label directly on the browser, displaying the prediction result along with a confidence score.

2. Real-time Inference Pipeline (Webcam)

Designed for high-speed monitoring, this pipeline utilizes a specialized deep learning flow to ensure stability and performance in live video streams.

Face Detection: Integrates YuNet (yunet.onnx) via OpenCV's FaceDetectorYN for ultra-lightweight and fast facial localization.
Classification: Uses the CNN model (mask_detector_model.h5) on grayscale input. To optimize performance, detected faces are processed in batches.
Stabilization Techniques:
- Temporal Smoothing: Employs a deque buffer to average predictions over recent frames, effectively eliminating "label flickering".
- Centroid-based Tracking: Maintains consistent object identity across the temporal domain using Euclidean distance tracking.
Performance: Achieves a smooth processing rate of over 25 FPS, meeting the requirements for real-time surveillance.

Experimental Results

Test Accuracy Comparison

Model	Feature Descriptor	Accuracy
CNN	Automatic (None)	0.9869
SVM	HOG (8×2)	0.9899
SVM	HOG (6×3)	0.9879
SVM	LBP	0.9720
KNN	HOG (8×2)	0.9839
KNN	HOG (6×3)	0.9748
KNN	LBP	0.9234
Random Forest	HOG (8×2)	0.9819
Random Forest	HOG (6×3)	0.9819
Random Forest	LBP	0.9093

Overall, HOG-based methods consistently outperform LBP-based methods, with SVM emerging as the most effective classifier.

Experimental Analysis

Based on the experimental results summarized above, several key observations can be drawn:

Superior Performance of HOG + SVM

The combination of the HOG (8×2) feature descriptor and the Support Vector Machine (SVM) classifier achieves the highest classification accuracy (0.9899).
This result demonstrates that, for datasets with relatively stable facial structures, well-designed hand-crafted shape features can provide highly discriminative representations.
In this setting, explicit gradient-based edge information enables more effective class separation than a baseline CNN trained from scratch.
Robustness and Stability of the CNN Model

The CNN model achieves a strong performance with an accuracy of 0.9869, indicating excellent generalization ability.
A major advantage of CNNs lies in their end-to-end learning capability, which eliminates the need for manual feature engineering and facilitates scalability when larger or more diverse datasets become available.
Effectiveness of the HOG (8×2) Configuration

Across all traditional machine learning classifiers (SVM, KNN, and Random Forest), the HOG (8×2) configuration consistently outperforms or matches the HOG (6×3) configuration.
The vertical cell partitioning of 8×2 is particularly effective in capturing vertical symmetry and structural patterns of faces and masks, which are crucial cues for mask detection.
Limitations of LBP Features

The LBP (Local Binary Pattern) descriptor yields the lowest accuracy across most classifiers, especially when combined with KNN and Random Forest.
This suggests that edge and shape information (gradients) is more informative than surface texture information for the face mask classification task.

Overall Conclusion

The experimental evaluation confirms that, for the current dataset, the optimized traditional pipeline HOG + SVM achieves the highest absolute accuracy (0.9899), slightly outperforming the grayscale CNN model (0.9869).

In-depth Discussion

Effectiveness of Shape-based Features
HOG descriptors rely on gradient orientation distributions, which are particularly suitable for representing structured and symmetric objects such as human faces and face masks.
With a moderately sized dataset, HOG provides a highly separable feature space without requiring complex learning processes.
It should be noted that the CNN model in this project was trained on grayscale images using a moderate architecture, without leveraging pre-trained backbones or advanced data augmentation techniques.
Dataset Size Constraints for CNNs
While CNNs are powerful representation learners, their full potential typically emerges when trained on large-scale datasets.
With approximately 10,000 training images, the CNN may have reached convergence but lacked sufficient data diversity to learn features more discriminative than the optimized HOG representation.
Practical Implications
These findings indicate that in scenarios with limited data and computational resources, the combination of hand-crafted features (HOG) and a strong classifier (SVM) remains a highly effective and practical solution, offering both high accuracy and efficient inference.

Discussion

Experimental results indicate that:

The HOG + SVM pipeline provides the best overall performance.
Traditional machine learning approaches remain highly competitive when paired with effective feature engineering.
CNN performance is strong but sensitive to architectural design and data volume, particularly when trained from scratch.

These findings highlight that carefully engineered hand-crafted features can outperform deep learning models in structured vision tasks with limited or moderately sized datasets.

Conclusion & Future Work

In conclusion, this project demonstrates that classical computer vision pipelines, when carefully engineered and optimized, can rival or even surpass deep learning models in structured vision tasks with limited data.

This project demonstrates that classical computer vision techniques, when properly optimized, remain highly effective for real-world applications such as face mask detection.

Potential future extensions include:

Applying Transfer Learning with advanced CNN backbones (e.g., ResNet, MobileNet)
Extending the system to multi-class mask type classification
Optimizing deployment for edge and embedded devices

License

This project is developed for academic purposes under the course
CS231.Q11 – Introduction to Computer Vision at the University of Information Technology (UIT).

Released under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
docs		docs
models		models
src		src
static		static
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demoSVM_image_flask.py		demoSVM_image_flask.py
demo_video.mp4		demo_video.mp4
demo_webcam.py		demo_webcam.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

CS231.Q11 – Introduction to Computer Vision

CS231 Course Project: Face Mask Classification – A Comparative Study of Traditional ML and CNNs

Team Information

Table of Contents

Repository Structure

Problem Statement

System Overview

Key Features

Dataset

Face Mask 12K Images Dataset

Dataset Structure

Data Preprocessing

Feature Extraction

Histogram of Oriented Gradients (HOG)

Local Binary Patterns (LBP)

Model Architectures

Traditional Machine Learning Models

Deep Learning Model

Training & Optimization

Installation

1. Clone repository

2. Create virtual environment (recommended)

3. Install dependencies

Usage

1. Train models

2. Run Flask demo

3. Real-time Webcam Detection

Demo Application

1. Flask Web Application (Static Image Classification)

2. Real-time Inference Pipeline (Webcam)

Experimental Results

Test Accuracy Comparison

Experimental Analysis

Overall Conclusion

In-depth Discussion

Discussion

Conclusion & Future Work

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages