Skip to content

TrNguyenMQuan/preprocessing_analysis_modeling_amazon_beauty_dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Amazon Beauty Product Recommendation System

Table of Contents


1. Introduction

In the era of e-commerce, users are overwhelmed by choices. Recommender Systems solve this by filtering information to provide personalized suggestions.

  • The Problem: Predicting a user's rating for a product they haven't seen yet, based on sparse historical interaction data.
  • The Motivation: Building a recommendation engine from scratch allows for a deep understanding of the underlying mathematics of machine learning algorithms, moving beyond "black box" libraries.
  • The Objective: To implement a Matrix Factorization (SVD) model using only NumPy (no Scikit-learn or Surprise for the core logic) to predict ratings and recommend top beauty products on Amazon.

2. Dataset

  • Source: Amazon Beauty Review Dataset.
  • Original Size: ~2 Million raw ratings.
  • Processed Size: ~158,801 ratings (after Data Cleaning and K-Core Filtering).
  • Entities (Post-filtering): 22,363 Users and 12,101 Items.
  • Sparsity: ~99.94% (The matrix is extremely sparse).
  • Characteristics: High sparsity, Long-tail distribution of items.
  • Key Features:
    • UserId: Unique identifier for the customer.
    • ProductId: Unique identifier for the product (ASIN).
    • Rating: Numeric score (1.0 to 5.0).
    • Timestamp: Time of the review (Unix format).

3. Methodology

This project strictly adheres to a "Pure NumPy" constraint for data processing and modeling optimization.

3.1 Data Preprocessing (Pure NumPy)

  • Cleaning: Handled by src/data_processing.py. Removed duplicates and validated data types.
  • Feature Engineering: Extracted Year from Unix Timestamp to analyze time trends and user behavior over time.
  • Data Reduction (K-Core Filtering): Applied an iterative 5-Core filter. Only users and products with at least 5 interactions were kept. This is crucial to reduce sparsity and mitigate the "Cold Start" problem.
  • Transformation: Implemented Integer Encoding to map string IDs to continuous integer indices (0 to N-1) using np.unique(return_inverse=True).

3.2 Modeling Strategy: Matrix Factorization

I predict the rating $\hat{r}_{ui}$ for user $u$ and item $i$ using the Singular Value Decomposition (SVD) approach:

$$\hat{r}_{ui} = \mu + b_u + b_i + \mathbf{p}_u \cdot \mathbf{q}_i^T$$

Where:

  • $\mu$: Global average rating.
  • $b_u, b_i$: User and Item bias terms.
  • $\mathbf{p}_u, \mathbf{q}_i$: Latent feature vectors for user and item.

3.3 Optimization & Algorithm Details

Instead of slow Python loops, the custom MatrixFactorization class in src/models.py uses Vectorized Mini-Batch Gradient Descent:

  • Loss Function: MSE with $L_2$ Regularization.
  • Vectorization: Used Broadcasting to compute errors for the whole batch at once.
  • Gradient Aggregation: Used np.add.at (unbuffered in-place operation) to accumulate gradients for users/items appearing multiple times in a single batch without using loops.
  • Optimization Features:
    • Learning Rate Decay.
    • Early Stopping to prevent overfitting.

4. Project Structure

└── 📁Lab02
    └── 📁data
        └── 📁processed     # Encoded .npy files (X_train, y_train, maps)
        ├── raw.csv         # Original CSV data
    └── 📁notebooks 
        ├── 01_data_exploration.ipynb    # EDA & Insights
        ├── 02_preprocessing.ipynb      # Cleaning, K-Core Filtering, Splitting
        ├── 03_modeling.ipynb       # Model training, Evaluation & Recommendation Demo
    └── 📁src
        └── 📁__pycache__          #Store bytecode file 
        ├── __init__.py             #Mark this folder is moudle Python
        ├── data_processing.py      # Core NumPy functions for cleaning/splitting
        ├── models.py               # Custom MatrixFactorization class
        ├── visualization.py        #Plotting helper functions
    ├── .gitattributes              #Configuration GIT LFS and format file
    ├── learning_curve_plot.png     #Plot learning curve
    ├── README.md                   # Project documentation
    └── requirements.txt            # Library dependencies

5. Installation & Setup

Prerequisites

  • Python 3.11+
  • Jupyter Notebook

Step

  1. Clone the repository:
git clone [https://github.com/TrNguyenMQuan/preprocessing_analysis_modeling_amazon_beauty_dataset.git]
  1. Navigate to the project directory:
cd LAB02
  1. Install dependencies:
pip install -r requirements.txt

6. Usage

Run the notebooks in the following order to replicate the results:

  1. Exploration 01_data_exploration.ipynb:
  • Analyzes size, distribution and some basic information about dataset, checks for outliers.

` Make a question to answer and visualizes trends: "Does popularity imply quality?" and "Are power users stricter?".

  1. Preprocessing 02_preprocessing.ipynb:
  • Cleaning data by handling missing value, outlier, error format...

  • Performs K-Core filtering and Integer Encoding.

  • Splits data into Train/Test sets (80/20).

  • Saves processed artifacts to data/processed/.

  1. Modeling 03_modeling.ipynb:
  • Loads processed data.

  • Trains the custom Matrix Factorization model from scratch.

  • Evaluates performance and generates specific recommendations for users.

7. Results & Analysis

The model was trained for 50 epochs with n_factors=50, learning_rate=0.005, and regularization=0.02.

Quantitative Metrics (Test Set)

  • RMSE (Root Mean Square Error): 1.0913

Analysis: The model's prediction deviates by approx 1 star on average. This is a solid baseline for sparse data.

  • MAE (Mean Absolute Error): 0.8405

alt text

Ranking Quality (Top-10 Recommendations)

  • NDCG@10: 0.9886

Insight: This near-perfect score indicates the model is extremely effective at ranking. Relevant items are consistently placed at the top of the recommendation list.

  • Recall@10: 0.8500

Insight: The system successfully retrieves 85% of the items users actually liked

8. Challenges & Solutions

Challenge 1: Slow Training with Python Loops

  • Issue: Standard Stochastic Gradient Descent (SGD) loops through ratings one by one ($O(N)$), which is extremely slow in Python for millions of ratings.
  • Solution: Implemented Mini-Batch SGD combined with NumPy Vectorization. I calculated dot products for thousands of interactions simultaneously using matrix operations.

Challenge 2: Gradient Updates for Shared Indices

  • Issue: In a batch, the same User or Item might appear multiple times. Standard assignment grad[users] = ... overwrites values instead of summing them.

Solution: Used the advanced NumPy universal function np.add.at. This allows unbuffered in-place accumulation of gradients for duplicate indices without using a for loop.

9. Future Improvements

  • Hybrid Model: Incorporate "Year" and Review Text (Content-Based) to solve the Cold Start problem for new items.

  • Hyperparameter Tuning: Implement Grid Search to find the optimal n_factors and regularization.

  • Bias Modeling: Add time-decay bias to weigh recent reviews more heavily (handling concept drift).

10. Contributors

  • Author: TRẦN NGUYỄN MINH QUÂN
  • Student's ID: 23120342
  • Institution: University of Science, Viet Nam National University Ho Chi Minh
  • Contact: [email protected] | [email protected]

11. License

This project is licensed under the MIT License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors