- Amazon Beauty Product Recommendation System
In the era of e-commerce, users are overwhelmed by choices. Recommender Systems solve this by filtering information to provide personalized suggestions.
- The Problem: Predicting a user's rating for a product they haven't seen yet, based on sparse historical interaction data.
- The Motivation: Building a recommendation engine from scratch allows for a deep understanding of the underlying mathematics of machine learning algorithms, moving beyond "black box" libraries.
- The Objective: To implement a Matrix Factorization (SVD) model using only NumPy (no Scikit-learn or Surprise for the core logic) to predict ratings and recommend top beauty products on Amazon.
- Source: Amazon Beauty Review Dataset.
- Original Size: ~2 Million raw ratings.
- Processed Size: ~158,801 ratings (after Data Cleaning and K-Core Filtering).
- Entities (Post-filtering): 22,363 Users and 12,101 Items.
- Sparsity: ~99.94% (The matrix is extremely sparse).
- Characteristics: High sparsity, Long-tail distribution of items.
- Key Features:
UserId: Unique identifier for the customer.ProductId: Unique identifier for the product (ASIN).Rating: Numeric score (1.0 to 5.0).Timestamp: Time of the review (Unix format).
This project strictly adheres to a "Pure NumPy" constraint for data processing and modeling optimization.
- Cleaning: Handled by
src/data_processing.py. Removed duplicates and validated data types. - Feature Engineering: Extracted
Yearfrom Unix Timestamp to analyze time trends and user behavior over time. - Data Reduction (K-Core Filtering): Applied an iterative 5-Core filter. Only users and products with at least 5 interactions were kept. This is crucial to reduce sparsity and mitigate the "Cold Start" problem.
- Transformation: Implemented Integer Encoding to map string IDs to continuous integer indices (0 to N-1) using
np.unique(return_inverse=True).
I predict the rating
Where:
-
$\mu$ : Global average rating. -
$b_u, b_i$ : User and Item bias terms. -
$\mathbf{p}_u, \mathbf{q}_i$ : Latent feature vectors for user and item.
Instead of slow Python loops, the custom MatrixFactorization class in src/models.py uses Vectorized Mini-Batch Gradient Descent:
-
Loss Function: MSE with
$L_2$ Regularization. - Vectorization: Used Broadcasting to compute errors for the whole batch at once.
-
Gradient Aggregation: Used
np.add.at(unbuffered in-place operation) to accumulate gradients for users/items appearing multiple times in a single batch without using loops. -
Optimization Features:
- Learning Rate Decay.
- Early Stopping to prevent overfitting.
└── 📁Lab02
└── 📁data
└── 📁processed # Encoded .npy files (X_train, y_train, maps)
├── raw.csv # Original CSV data
└── 📁notebooks
├── 01_data_exploration.ipynb # EDA & Insights
├── 02_preprocessing.ipynb # Cleaning, K-Core Filtering, Splitting
├── 03_modeling.ipynb # Model training, Evaluation & Recommendation Demo
└── 📁src
└── 📁__pycache__ #Store bytecode file
├── __init__.py #Mark this folder is moudle Python
├── data_processing.py # Core NumPy functions for cleaning/splitting
├── models.py # Custom MatrixFactorization class
├── visualization.py #Plotting helper functions
├── .gitattributes #Configuration GIT LFS and format file
├── learning_curve_plot.png #Plot learning curve
├── README.md # Project documentation
└── requirements.txt # Library dependencies
Prerequisites
- Python 3.11+
- Jupyter Notebook
Step
- Clone the repository:
git clone [https://github.com/TrNguyenMQuan/preprocessing_analysis_modeling_amazon_beauty_dataset.git]- Navigate to the project directory:
cd LAB02- Install dependencies:
pip install -r requirements.txtRun the notebooks in the following order to replicate the results:
- Exploration
01_data_exploration.ipynb:
- Analyzes size, distribution and some basic information about dataset, checks for outliers.
` Make a question to answer and visualizes trends: "Does popularity imply quality?" and "Are power users stricter?".
- Preprocessing
02_preprocessing.ipynb:
-
Cleaning data by handling missing value, outlier, error format...
-
Performs K-Core filtering and Integer Encoding.
-
Splits data into Train/Test sets (80/20).
-
Saves processed artifacts to data/processed/.
- Modeling
03_modeling.ipynb:
-
Loads processed data.
-
Trains the custom Matrix Factorization model from scratch.
-
Evaluates performance and generates specific recommendations for users.
The model was trained for 50 epochs with n_factors=50, learning_rate=0.005, and regularization=0.02.
Quantitative Metrics (Test Set)
- RMSE (Root Mean Square Error): 1.0913
Analysis: The model's prediction deviates by approx 1 star on average. This is a solid baseline for sparse data.
- MAE (Mean Absolute Error): 0.8405
Ranking Quality (Top-10 Recommendations)
- NDCG@10: 0.9886
Insight: This near-perfect score indicates the model is extremely effective at ranking. Relevant items are consistently placed at the top of the recommendation list.
- Recall@10: 0.8500
Insight: The system successfully retrieves 85% of the items users actually liked
Challenge 1: Slow Training with Python Loops
- Issue: Standard Stochastic Gradient Descent (SGD) loops through ratings one by one ($O(N)$), which is extremely slow in Python for millions of ratings.
- Solution: Implemented Mini-Batch SGD combined with NumPy Vectorization. I calculated dot products for thousands of interactions simultaneously using matrix operations.
Challenge 2: Gradient Updates for Shared Indices
- Issue: In a batch, the same User or Item might appear multiple times. Standard assignment
grad[users] = ...overwrites values instead of summing them.
Solution: Used the advanced NumPy universal function np.add.at. This allows unbuffered in-place accumulation of gradients for duplicate indices without using a for loop.
-
Hybrid Model: Incorporate "Year" and Review Text (Content-Based) to solve the Cold Start problem for new items.
-
Hyperparameter Tuning: Implement Grid Search to find the optimal n_factors and regularization.
-
Bias Modeling: Add time-decay bias to weigh recent reviews more heavily (handling concept drift).
- Author: TRẦN NGUYỄN MINH QUÂN
- Student's ID: 23120342
- Institution: University of Science, Viet Nam National University Ho Chi Minh
- Contact: [email protected] | [email protected]
This project is licensed under the MIT License.
