This project, developed by Nowa Analytics, focuses on building semi-supervised classification models to assess milk quality. We were contracted by a dairy industry to help guarantee the quality of milk used in their products.
Using machine learning, we classify milk samples into three categories:
- Low Quality
- Medium Quality
- High Quality
Since the dataset contains both labeled and unlabeled samples, we applied semi-supervised learning techniques such as:
- β Self-Training with labeled + unlabeled data
- β Label Propagation (transductive learning)
- β Supervised baselines for performance comparison
The dataset was provided in a CSV file named qualidade_leite.csv, containing 1,059 entries with the following features:
| Column | Description |
|---|---|
| pH | pH level of the milk (continuous) |
| Temperature | Temperature of the sample (Β°C) |
| Taste | Taste quality score |
| Odor | Odor quality score |
| Fat | Fat content |
| Turbidity | Milk turbidity |
| Color | Visual color quality |
| Quality | Target variable (Low, Medium, High) β partially labeled (424 samples) |
π Note: Only a portion of the dataset has labels, which makes it ideal for semi-supervised approaches.
- Train classification models using labeled data.
- Understand and apply semi-supervised learning concepts.
- Generate pseudo-labels for unlabeled samples.
- Apply Self-Training strategy with labeled + unlabeled data.
- Explore transductive learning using Label Propagation.
- Compare the results against fully supervised models.
- Python 3.11+
- Pandas & NumPy β Data processing
- Scikit-learn β Classification, Self-Training, Label Propagation
- Matplotlib & Seaborn β Data visualization
- Jupyter Notebook β Experiment tracking
- Improved classification performance using semi-supervised approaches.
- Demonstration of how unlabeled data can boost model accuracy.
- Comparison between Self-Training, Label Propagation, and supervised baselines.
Project developed by Nowa Analytics π Data Science Consulting | Machine Learning Solutions