This repository contains the implementation of a Face Mask Classification System, developed as the final project for CS231.Q11 – Introduction to Computer Vision at the University of Information Technology (UIT – VNU-HCM).
The project performs a comparative study between traditional Machine Learning approaches (using HOG and LBP feature descriptors with SVM, KNN, and Random Forest classifiers) and modern Convolutional Neural Networks (CNN) to identify individuals wearing masks versus those without masks.
The primary objective of this project is to analyze and compare the effectiveness of traditional Machine Learning pipelines versus Deep Learning approaches for the task of face mask detection, a binary image classification problem with real-world relevance in public health and surveillance systems.
The project emphasizes methodological comparison, feature representation, and performance evaluation, rather than solely maximizing accuracy through deep models.
| No. | Student ID | Full Name | Role | Github | |
|---|---|---|---|---|---|
| 1 | 23521143 | Nguyen Cong Phat | Leader | paht2005 | [email protected] |
| 2 | 23521168 | Nguyen Le Phong | Member | kllp031 (primary account) / octguy (secondary account) | [email protected] |
| 3 | 23520213 | Vu Viet Cuong | Member | Kun05-AI | [email protected] |
- Repository Structure
- Problem Statement
- System Overview
- Key Features
- Dataset
- Data Preprocessing
- Feature Extraction
- Model Architectures
- Training & Optimization
- Installation
- Usage
- Demo Application
- Experimental Results
- Discussion
- Conclusion & Future Work
- License
CS231.Q11_Face-Mask-Classification-Project/
├── src/ # Model training notebooks (Jupyter)
│ ├── CNN/ # Deep Learning CNN (Grayscale) training source code
│ ├── HOG_KNN/ # KNN training with HOG features source code
│ ├── HOG_RF/ # Random Forest training with HOG features source code
│ ├── HOG_SVM/ # SVM training with HOG features source code
│ ├── LBP_KNN/ # KNN training with LBP features source code
│ ├── LBP_RF/ # Random Forest training with LBP features source code
│ └── LBP_SVM/ # SVM training with LBP features source code
│
├── models/
│ ├── yunet.onnx # Pre-trained Face Detection model (Included)
│ ├── mask_detector_model.h5 # Trained Mask Classification model (Included)
│ └── [others].joblib/.keras # Large models/caches (Ignored - Download link below)
│
├── docs/ # Report & presentation
│ ├── 23520213-23521143-23521168_Report.pdf
│ └── 23520213-23521143-23521168_Slide.pdf
│
├── static/ # Static Assets
│ ├── images/ # Images for Slide, Report, and Thumbnails
│ ├── results/ # Output images from Flask Web Demo
│ ├── latex/ # Latex files
│ ├── templates/ # Web UI (index.html, indexSVM.html)
│ └── test/ # Sample test images (e.g., test.jpg)
│
├── uploads/ # Temporary storage for user-uploaded images
├── demo_webcam.py # Real-time Webcam detection script
├── demoSVM_image_flask.py # Flask Web Application script
├── requirements.txt # Python dependencies
├── LICENSE
├── .gitignore # Git ignore rules
└── README.md # Main project documentation
Face mask detection is a practical computer vision problem that requires robust facial feature representation under variations in:
- Illumination
- Pose
- Occlusion
- Mask styles and colors
The goal of this project is to:
- Evaluate whether hand-crafted features (HOG, LBP) combined with classical classifiers can compete with CNN-based approaches.
- Analyze trade-offs between accuracy, computational cost, and deployment complexity.
- Develop a system capable of real-time inference using standard consumer hardware.
The proposed system consists of three main components:
-
Offline Training Pipeline
- Image preprocessing
- Feature extraction
- Model training and hyperparameter optimization
-
Inference Pipeline
- Face detection using YuNet
- Feature extraction / CNN inference
- Classification and post-processing
-
Deployment Interfaces
- Flask-based web application (static image classification)
- Real-time webcam detection
- Binary Face Mask Classification with high accuracy
- Comparative Study between:
- Traditional ML: HOG/LBP + SVM, KNN, Random Forest
- Deep Learning: Custom CNN
- Automated Hyperparameter Tuning
- Optuna for ML models
- Keras Tuner (Hyperband) for CNN
- Real-time Detection using webcam input
- User-friendly Web Interface built with Flask
- Source: Kaggle
🔗 https://www.kaggle.com/datasets/ashishjangra27/face-mask-12k-images-dataset - Total Images: Approximately 12,000 RGB images
- Image Characteristics:
- Diverse facial orientations
- Multiple ethnicities
- Various mask types and lighting conditions
- Training Set: 10,000 images
- Validation Set: 800 images
- Test Set: 992 images
The dataset is well-balanced between the two classes, making it suitable for unbiased binary classification evaluation. No identity or personal information is associated with the dataset, ensuring ethical use for academic research.
To ensure consistency and reduce computational complexity, the following preprocessing steps were applied:
-
Resizing
- All images resized to 128 × 128 pixels
-
Normalization
- Pixel intensities scaled to the range [0, 1]
-
Grayscale Conversion
- Applied for traditional ML pipelines
- Reduces dimensionality while preserving structural facial features
- Captures edge and shape information
- Effective for representing facial geometry
- Tested configurations:
6 × 3cells8 × 2cells (best-performing)
- Encodes local texture patterns
- Robust to illumination changes
- Useful for modeling fine-grained facial textures
- Support Vector Machine (SVM) with RBF kernel
- K-Nearest Neighbors (KNN)
- Random Forest
These models operate on extracted HOG or LBP feature vectors.
- Custom Convolutional Neural Network (CNN)
- Lightweight architecture optimized for grayscale input
- Designed to balance performance and training efficiency
-
Traditional ML Models
- Hyperparameters optimized using Optuna
- Objective: maximize validation accuracy
-
CNN
- Optimized using Keras Tuner (Hyperband)
- Tuned parameters include:
- Number of convolutional layers
- Filter sizes
- Learning rate
- Dropout rate
git clone https://github.com/paht2005/CS231.Q11_Face-Mask-Classification-Project.git
cd CS231.Q11_Face-Mask-Classification-Projectpython -m venv .venv
source .venv/bin/activate # Linux/macOS
# .venv\Scripts\activate # Windowspip install -r requirements.txtOpen and run notebooks in src/:
jupyter notebook- CNN/train-model-CNN-best-grayscale.ipynb
- HOG_KNN/train-model-HOG-KNN_6x3.ipynb
- HOG_KNN/train-model-HOG-KNN_8x2.ipynb
- HOG_RF/train-model-HOG-RF_8x2.ipynb
- HOG_RF/train-model-HOG-RF_6x3.ipynb
- HOG_SVM/train-model-HOG-SVM_6x3.ipynb
- HOG_SVM/train-model-HOG-SVM-8x2.ipynb
- LBP_KNN/train-model-LBP-KNN.ipynb
- LBP_RF/train-model-LBP-RF.ipynb
- LBP_SVM/train-model-LBP-SVM.ipynb
python demoSVM_image_flask.pyOpen browser at:
http://127.0.0.1:5000python demo_webcam.pyThe project provides two distinct interfaces to demonstrate the classification capabilities, catering to both static analysis and real-time monitoring.
A user-friendly web interface built on the Flask Framework, allowing users to upload individual images for detailed analysis.
- Logic: Receives image files via
indexSVM.html, extracts HOG 8x2 features usingskimage, and performs inference using the optimized SVM model (.joblib). - Output: Generates a bounding box and label directly on the browser, displaying the prediction result along with a confidence score.
Designed for high-speed monitoring, this pipeline utilizes a specialized deep learning flow to ensure stability and performance in live video streams.
- Face Detection: Integrates YuNet (
yunet.onnx) via OpenCV'sFaceDetectorYNfor ultra-lightweight and fast facial localization. - Classification: Uses the CNN model (
mask_detector_model.h5) on grayscale input. To optimize performance, detected faces are processed in batches. - Stabilization Techniques:
- Temporal Smoothing: Employs a
dequebuffer to average predictions over recent frames, effectively eliminating "label flickering". - Centroid-based Tracking: Maintains consistent object identity across the temporal domain using Euclidean distance tracking.
- Temporal Smoothing: Employs a
- Performance: Achieves a smooth processing rate of over 25 FPS, meeting the requirements for real-time surveillance.
| Model | Feature Descriptor | Accuracy |
|---|---|---|
| CNN | Automatic (None) | 0.9869 |
| SVM | HOG (8×2) | 0.9899 |
| SVM | HOG (6×3) | 0.9879 |
| SVM | LBP | 0.9720 |
| KNN | HOG (8×2) | 0.9839 |
| KNN | HOG (6×3) | 0.9748 |
| KNN | LBP | 0.9234 |
| Random Forest | HOG (8×2) | 0.9819 |
| Random Forest | HOG (6×3) | 0.9819 |
| Random Forest | LBP | 0.9093 |
Overall, HOG-based methods consistently outperform LBP-based methods, with SVM emerging as the most effective classifier.
Based on the experimental results summarized above, several key observations can be drawn:
-
Superior Performance of HOG + SVM
The combination of the HOG (8×2) feature descriptor and the Support Vector Machine (SVM) classifier achieves the highest classification accuracy (0.9899).
This result demonstrates that, for datasets with relatively stable facial structures, well-designed hand-crafted shape features can provide highly discriminative representations.
In this setting, explicit gradient-based edge information enables more effective class separation than a baseline CNN trained from scratch. -
Robustness and Stability of the CNN Model
The CNN model achieves a strong performance with an accuracy of 0.9869, indicating excellent generalization ability.
A major advantage of CNNs lies in their end-to-end learning capability, which eliminates the need for manual feature engineering and facilitates scalability when larger or more diverse datasets become available. -
Effectiveness of the HOG (8×2) Configuration
Across all traditional machine learning classifiers (SVM, KNN, and Random Forest), the HOG (8×2) configuration consistently outperforms or matches the HOG (6×3) configuration.
The vertical cell partitioning of 8×2 is particularly effective in capturing vertical symmetry and structural patterns of faces and masks, which are crucial cues for mask detection. -
Limitations of LBP Features
The LBP (Local Binary Pattern) descriptor yields the lowest accuracy across most classifiers, especially when combined with KNN and Random Forest.
This suggests that edge and shape information (gradients) is more informative than surface texture information for the face mask classification task.
The experimental evaluation confirms that, for the current dataset, the optimized traditional pipeline HOG + SVM achieves the highest absolute accuracy (0.9899), slightly outperforming the grayscale CNN model (0.9869).
-
Effectiveness of Shape-based Features
HOG descriptors rely on gradient orientation distributions, which are particularly suitable for representing structured and symmetric objects such as human faces and face masks.
With a moderately sized dataset, HOG provides a highly separable feature space without requiring complex learning processes.
It should be noted that the CNN model in this project was trained on grayscale images using a moderate architecture, without leveraging pre-trained backbones or advanced data augmentation techniques. -
Dataset Size Constraints for CNNs
While CNNs are powerful representation learners, their full potential typically emerges when trained on large-scale datasets.
With approximately 10,000 training images, the CNN may have reached convergence but lacked sufficient data diversity to learn features more discriminative than the optimized HOG representation. -
Practical Implications
These findings indicate that in scenarios with limited data and computational resources, the combination of hand-crafted features (HOG) and a strong classifier (SVM) remains a highly effective and practical solution, offering both high accuracy and efficient inference.
Experimental results indicate that:
- The HOG + SVM pipeline provides the best overall performance.
- Traditional machine learning approaches remain highly competitive when paired with effective feature engineering.
- CNN performance is strong but sensitive to architectural design and data volume, particularly when trained from scratch.
These findings highlight that carefully engineered hand-crafted features can outperform deep learning models in structured vision tasks with limited or moderately sized datasets.
In conclusion, this project demonstrates that classical computer vision pipelines, when carefully engineered and optimized, can rival or even surpass deep learning models in structured vision tasks with limited data.
This project demonstrates that classical computer vision techniques, when properly optimized, remain highly effective for real-world applications such as face mask detection.
Potential future extensions include:
- Applying Transfer Learning with advanced CNN backbones (e.g., ResNet, MobileNet)
- Extending the system to multi-class mask type classification
- Optimizing deployment for edge and embedded devices
This project is developed for academic purposes under the course
CS231.Q11 – Introduction to Computer Vision at the University of Information Technology (UIT).
Released under the MIT License. See the LICENSE file for details.

