Skip to content

npogeant/book-recsys

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Book Recsys ML Stack

A recommender system for book recommendations developed within a production-ready workflow. This project utilizes Metaflow, AWS, and the Surprise library.

Table of Contents

Project Overview

This project encompasses the development of a recommender system tailored for production use. The system is constructed using Metaflow, AWS, and the Surprise library. Its key components and workflow can be summarized as follows:

  1. Algorithm Selection: The recommender system employs a grid search approach, evaluating four distinct algorithms with varying parameters from the Surprise library. The selection is based on the recall scores generated by each algorithm.

  2. Model Deployment: The chosen algorithm, determined by its performance using recall scores, is intended for deployment within the AWS SageMaker environment.

  3. Metaflow Workflow: The workflow is orchestrated and parallelized using Metaflow, allowing for efficient execution and management of tasks.

  4. Data and Model Versioning: Comprehensive data and model versioning is integrated into the workflow, ensuring traceability and reproducibility at each step.

  5. Compute Layer: AWS Batch serves as the compute layer for processing tasks, enhancing scalability and resource management.

  6. Artifact Management: All generated artifacts produced throughout the workflow, including data, models, and intermediate results, are systematically stored in AWS S3.

This project embodies a production-ready recommender system that seamlessly integrates cloud-based orchestration, versioning, and deployment, making it well-suited for real-world recommendation applications.

Stack Architecture

Stack

The visual design draws significant inspiration from the exceptional illustrations created by Outerbounds and the Metaflow team.

Getting Started

Data Source

The dataset used in this project is available from Kaggle.

Environment Setup

  1. Ensure you have conda installed.

  2. Clone this repository:

    git clone https://github.com/npogeant/book-recsys.git
    cd book-recsys
  3. Create a conda environment from the provided env.yml file:

    conda env create -f env.yml

Running the main workflow

You can run the Metaflow workflow by executing the following command:

python flow.py run

AWS Integration

If you plan to use AWS services, make sure you are authenticated with your AWS account in your environment. For guidance on setting up AWS integration with Metaflow, you can refer to this informative video tutorial by Saturn Cloud. The video covers Metaflow integration with AWS CloudFormation.

Workflow

The workflow for this project can be visualized in the Metaflow part of the stack DAG shown above. It consists of the following major steps:

  1. Preparing the Dataset: The workflow starts by preparing the dataset, which is sourced from Kaggle.

  2. Training with GridSearch: The next step involves training recommendation models using a grid search approach. Four different algorithms, ranging from NMF to SVD, are employed using the Surprise library. Metaflow's foreach method parallelizes this step into four separate processes, improving efficiency.

  3. Analyzing and Selecting the Best Model: After training, the results from the four algorithms are joined together for analysis. The objective is to select the best model based on the recall metrics of each algorithm. This step is crucial for identifying the most effective recommendation model.

  4. Testing the Best Model: The chosen best model is retained, and a prediction is made to test its performance and suitability for the recommendation task.

Additionally, a Jupyter Notebook named flow_analysis_notebook.ipynb is provided for in-depth result analysis. You can use this notebook to analyze the results of any runs from Metaflow using the API. All data, artifacts, and runs are versioned and accessible through the notebook.

For a more detailed understanding of the workflow and recommendation system, you can refer to the tutorial by Outerbounds and Jacopo Tagliabue, which served as a valuable source of inspiration for this project.

Here you can see the output of the flow running with AWS configured :

Flow run output
Metaflow 2.7.14 executing BookRecSysFlow for user:npogeant
Validating your flow...
  The graph looks good!
Running pylint...
  Pylint is happy!
2023-09-03 22:28:01.059 Workflow starting (run-id 2):
2023-09-03 22:28:02.903 [2/start/13 (pid 37197)] Task is starting.
2023-09-03 22:28:06.879 [2/start/13 (pid 37197)] flow name: BookRecSysFlow
2023-09-03 22:28:09.971 [2/start/13 (pid 37197)] run id: 2
2023-09-03 22:28:09.972 [2/start/13 (pid 37197)] username: npogeant
2023-09-03 22:28:09.972 [2/start/13 (pid 37197)] datastore is: s3://metaflow-cloudstack-metaflows3bucket-iat7no5zm5u1/metaflow
2023-09-03 22:28:11.488 [2/start/13 (pid 37197)] Task finished successfully.
2023-09-03 22:28:13.708 [2/prepare_dataset/14 (pid 37215)] Task is starting.
2023-09-03 22:28:21.624 [2/prepare_dataset/14 (pid 37215)] # The original data frame shape:	(1149780, 5)
2023-09-03 22:28:26.608 [2/prepare_dataset/14 (pid 37215)] # The new data frame shape:	(282829, 5)
2023-09-03 22:28:28.012 [2/prepare_dataset/14 (pid 37215)] Foreach yields 4 child steps.
2023-09-03 22:28:28.012 [2/prepare_dataset/14 (pid 37215)] Task finished successfully.
2023-09-03 22:28:30.373 [2/gridsearch_training/15 (pid 37242)] Task is starting.
2023-09-03 22:28:31.490 [2/gridsearch_training/16 (pid 37245)] Task is starting.
2023-09-03 22:28:32.563 [2/gridsearch_training/17 (pid 37249)] Task is starting.
2023-09-03 22:28:33.457 [2/gridsearch_training/18 (pid 37251)] Task is starting.
2023-09-03 22:28:34.401 [2/gridsearch_training/15 (pid 37242)] <class 'dict'>
2023-09-03 22:28:35.403 [2/gridsearch_training/16 (pid 37245)] <class 'dict'>
2023-09-03 22:28:37.409 [2/gridsearch_training/15 (pid 37242)] Estimating biases using als...
2023-09-03 22:28:37.454 [2/gridsearch_training/17 (pid 37249)] <class 'dict'>
2023-09-03 22:28:38.590 [2/gridsearch_training/18 (pid 37251)] <class 'dict'>
2023-09-03 22:28:38.720 [2/gridsearch_training/16 (pid 37245)] Computing the msd similarity matrix...
2023-09-03 22:28:38.856 [2/gridsearch_training/15 (pid 37242)] Estimating biases using als...
2023-09-03 22:28:40.537 [2/gridsearch_training/15 (pid 37242)] Estimating biases using als...
2023-09-03 22:28:41.110 [2/gridsearch_training/16 (pid 37245)] Done computing similarity matrix.
2023-09-03 22:28:42.179 [2/gridsearch_training/15 (pid 37242)] Estimating biases using als...
2023-09-03 22:28:44.894 [2/gridsearch_training/15 (pid 37242)] Estimating biases using als...
2023-09-03 22:28:46.551 [2/gridsearch_training/15 (pid 37242)] Estimating biases using als...
2023-09-03 22:28:48.128 [2/gridsearch_training/15 (pid 37242)] Estimating biases using als...
2023-09-03 22:28:49.774 [2/gridsearch_training/15 (pid 37242)] Estimating biases using als...
2023-09-03 22:28:50.768 [2/gridsearch_training/16 (pid 37245)] Computing the msd similarity matrix...
2023-09-03 22:28:50.982 [2/gridsearch_training/15 (pid 37242)] Estimating biases using als...
2023-09-03 22:28:52.488 [2/gridsearch_training/15 (pid 37242)] Estimating biases using sgd...
2023-09-03 22:28:52.528 [2/gridsearch_training/16 (pid 37245)] Done computing similarity matrix.
2023-09-03 22:28:54.752 [2/gridsearch_training/15 (pid 37242)] Estimating biases using sgd...
2023-09-03 22:28:56.274 [2/gridsearch_training/15 (pid 37242)] Estimating biases using sgd...
2023-09-03 22:28:57.911 [2/gridsearch_training/15 (pid 37242)] Estimating biases using sgd...
2023-09-03 22:28:59.416 [2/gridsearch_training/16 (pid 37245)] Computing the msd similarity matrix...
2023-09-03 22:28:59.492 [2/gridsearch_training/15 (pid 37242)] Estimating biases using sgd...
2023-09-03 22:29:01.497 [2/gridsearch_training/16 (pid 37245)] Done computing similarity matrix.
2023-09-03 22:29:01.630 [2/gridsearch_training/15 (pid 37242)] Estimating biases using sgd...
2023-09-03 22:29:03.549 [2/gridsearch_training/15 (pid 37242)] Estimating biases using sgd...
2023-09-03 22:29:05.288 [2/gridsearch_training/15 (pid 37242)] Estimating biases using sgd...
2023-09-03 22:29:06.851 [2/gridsearch_training/15 (pid 37242)] Estimating biases using sgd...
2023-09-03 22:29:08.260 [2/gridsearch_training/16 (pid 37245)] Computing the msd similarity matrix...
2023-09-03 22:29:08.765 [2/gridsearch_training/15 (pid 37242)] Model trained
2023-09-03 22:29:10.069 [2/gridsearch_training/16 (pid 37245)] Done computing similarity matrix.
2023-09-03 22:29:17.577 [2/gridsearch_training/15 (pid 37242)] Task finished successfully.
2023-09-03 22:29:18.424 [2/gridsearch_training/16 (pid 37245)] Computing the msd similarity matrix...
2023-09-03 22:29:25.269 [2/gridsearch_training/16 (pid 37245)] Done computing similarity matrix.
2023-09-03 22:29:25.269 [2/gridsearch_training/16 (pid 37245)] Computing the msd similarity matrix...
2023-09-03 22:29:27.910 [2/gridsearch_training/16 (pid 37245)] Done computing similarity matrix.
2023-09-03 22:29:34.204 [2/gridsearch_training/16 (pid 37245)] Computing the msd similarity matrix...
2023-09-03 22:29:36.230 [2/gridsearch_training/16 (pid 37245)] Done computing similarity matrix.
2023-09-03 22:29:45.316 [2/gridsearch_training/16 (pid 37245)] Computing the msd similarity matrix...
2023-09-03 22:29:47.379 [2/gridsearch_training/16 (pid 37245)] Done computing similarity matrix.
2023-09-03 22:29:55.158 [2/gridsearch_training/16 (pid 37245)] Computing the msd similarity matrix...
2023-09-03 22:29:59.164 [2/gridsearch_training/16 (pid 37245)] Done computing similarity matrix.
2023-09-03 22:30:18.051 [2/gridsearch_training/16 (pid 37245)] Computing the msd similarity matrix...
2023-09-03 22:30:20.587 [2/gridsearch_training/16 (pid 37245)] Done computing similarity matrix.
2023-09-03 22:30:29.486 [2/gridsearch_training/16 (pid 37245)] Computing the msd similarity matrix...
2023-09-03 22:30:31.785 [2/gridsearch_training/16 (pid 37245)] Done computing similarity matrix.
2023-09-03 22:30:40.137 [2/gridsearch_training/16 (pid 37245)] Computing the msd similarity matrix...
2023-09-03 22:30:42.287 [2/gridsearch_training/16 (pid 37245)] Done computing similarity matrix.
2023-09-03 22:30:53.420 [2/gridsearch_training/16 (pid 37245)] Model trained
2023-09-03 22:30:58.504 [2/gridsearch_training/17 (pid 37249)] Model trained
2023-09-03 22:31:02.948 [2/gridsearch_training/16 (pid 37245)] Task finished successfully.
2023-09-03 22:31:06.256 [2/gridsearch_training/17 (pid 37249)] Task finished successfully.
2023-09-03 22:32:31.642 [2/gridsearch_training/18 (pid 37251)] Model trained
2023-09-03 22:32:40.143 [2/gridsearch_training/18 (pid 37251)] Task finished successfully.
2023-09-03 22:32:42.669 [2/join_train/19 (pid 37417)] Task is starting.
2023-09-03 22:33:01.932 [2/join_train/19 (pid 37417)] Task finished successfully.
2023-09-03 22:33:04.146 [2/select_best_model/20 (pid 37449)] Task is starting.
2023-09-03 22:33:04.146 1 task is running: select_best_model (1 running; 0 done).
2023-09-03 22:33:04.146 No tasks are waiting in the queue.
2023-09-03 22:33:04.147 2 steps have not started: build_retrieval_model, end.
2023-09-03 22:33:10.338 [2/select_best_model/20 (pid 37449)] Estimating biases using sgd...
2023-09-03 22:33:12.491 [2/select_best_model/20 (pid 37449)] Precision baseline: 0.4082872794260233
2023-09-03 22:33:14.761 [2/select_best_model/20 (pid 37449)] Recall baseline: 0.1737481052023987
2023-09-03 22:33:14.762 [2/select_best_model/20 (pid 37449)]
2023-09-03 22:33:14.762 [2/select_best_model/20 (pid 37449)] Computing the msd similarity matrix...
2023-09-03 22:33:14.763 [2/select_best_model/20 (pid 37449)] Done computing similarity matrix.
2023-09-03 22:33:35.056 [2/select_best_model/20 (pid 37449)] Precision knn: 0.9102675974403723
2023-09-03 22:33:48.663 [2/select_best_model/20 (pid 37449)] Recall knn: 0.5671141762419407
2023-09-03 22:33:48.663 [2/select_best_model/20 (pid 37449)]
2023-09-03 22:33:48.663 [2/select_best_model/20 (pid 37449)] Precision nmf: 0.9089586969168121
2023-09-03 22:33:55.177 [2/select_best_model/20 (pid 37449)] Recall nmf: 0.5673132234949189
2023-09-03 22:33:55.177 [2/select_best_model/20 (pid 37449)]
2023-09-03 22:33:55.177 [2/select_best_model/20 (pid 37449)] Precision svd: 0.5934021718053135
2023-09-03 22:34:21.812 [2/select_best_model/20 (pid 37449)] Recall svd: 0.24679239541951564
2023-09-03 22:34:21.813 [2/select_best_model/20 (pid 37449)]
2023-09-03 22:34:21.813 [2/select_best_model/20 (pid 37449)] The best model by precision is knn
2023-09-03 22:34:21.813 [2/select_best_model/20 (pid 37449)] and the best model by recall is nmf.)
2023-09-03 22:34:23.584 [2/select_best_model/20 (pid 37449)] Task finished successfully.
2023-09-03 22:34:25.125 [2/build_retrieval_model/21 (pid 37525)] Task is starting.
2023-09-03 22:34:57.402 [2/build_retrieval_model/21 (pid 37525)] Task finished successfully.
2023-09-03 22:34:59.901 [2/end/22 (pid 37566)] Task is starting.
2023-09-03 22:35:09.261 [2/end/22 (pid 37566)] Task finished successfully.
2023-09-03 22:35:09.744 Done!

Directory Structure

book-recsys/
│
├── Flows/
│   ├── flow.py
│   ├── predict_flow.py
│
├── Notebooks/
│   ├── flow_analysis_notebook.ipynb
│   ├── eda_notebook.ipynb
│
├── Data/
│   ├── Books.csv/
│   ├── Ratings.csv/
│   ├── Users.csv/
│
├── env.yml
│
├── README.md
│
└── .gitignore

Dependencies

List the major dependencies, libraries, and tools your project relies on, along with their versions.

  • Metaflow
  • Surprise
  • AWS
  • Pandas
  • Python 3.7+

Improvements

To delve deeper into the project, a valuable enhancement would be to deploy the model on SageMaker. Furthermore, although the current workflow and stack are relatively straightforward, there is room for improvement by incorporating monitoring and implementing a job scheduler structure as part of a comprehensive CI/CD pipeline.

About

A recommender system for book recommendations developed within a production-ready workflow. This project utilizes Metaflow, AWS, and the Surprise library.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors