Project Creation Page
Home Page

TRACK 7: Unleashing Potential in Machine Learning Infrastructure

Inspiration

After undertaking a course on Machine Learning (ML), we realised how difficult it was to learn how to build and train models effectively. Therefore, we wanted to make ML more accessible and easier for the general public, especially to aspiring ML engineers and start ups. Given the wide variety of tools available now, we sought to create a one-stop website to design models with ease and hassle-free.

Our team, RCH4CKERS, has embarked on a new project, SimplifyML, with one goal in mind: to democratise ML access. In particular, SimplifyML aims to allow the everyday person to start on their new personal ML project or small startups to expand their businesses through ML.

What it does

SimplifyML is a one-stop no code solution for users to build their very own ML models. They simply have to upload their data and the next step to train their models is just a few clicks away. They can then apply their trained models on other datasets to make predictions. Further, we also provide synthetic data generation using genAI to support users with scarce data.

The current main features of SimplifyML are: AutoML platform where users can easily train a new model and apply it with zero code Synthetic data generation where users can generate up to 100 rows (beta) of new sentences

Future features to implement include: Support of logging and monitoring. Current version monitors all processes involving the model and logs these events. However, in view of our target users’ lower levels of familiarity and proficiency with ML, monitoring and logging are not available for viewing to prevent confusion.

To get started, visit our GitHub repository here or alternatively use this link: https://github.com/GSgiansen/tiktokjam24.git

Development:

Frontend

Our front end, developed using Next.js, enables users to upload a CSV file of their choice containing both target and feature columns. Users can specify which column will serve as the target column. For authentication, we hinged on Supabase Auth to allow users to securely access our website. The design of the frontend was to appear as tech-savvy and with a focus on ease of usage so as to entice startups and small businesses to be receptive to it.

Backend

Database We have chosen to use open-source Supabase as our database. It serves as the storage for data uploaded by users, intermediate data necessary in the AutoML pipeline, trained models and predictions.

Workflow Orchestration For our AutoML pipeline, we have employed Apache Airflow, an open-source workflow management platform, to be the central orchestrator. It handles the automation and scheduling of tasks involved in data preprocessing, model training, evaluation, and prediction. These tasks are scripts we have written and exist as nodes in a Directed Acyclic Graph (DAG) where we have defined a specific flow and order. These plug and play nodes allow for easy updates for future implementation. For example, more nodes could be easily added to the pipeline to provide more complexity and scalability.

The workflow is divided into two primary DAGs: ml_pipeline and add_predict.

ml_pipeline

This DAG automates the entire lifecycle of a machine learning model. It starts by downloading raw data from Supabase, splitting it into training and testing sets, and uploading these splits back to Supabase.

The training and testing data are then preprocessed to ensure the data is clean and consistent, which is critical for reliable model training. In this stage, the data undergoes several miscellaneous cleaning such as: removal of whitespace and duplicate rows, and handling of missing values where categorical and numerical features are handled differently.

The preprocessed data is then used to train a ML model, which is subsequently evaluated on the testing data. In this stage, categorical features are encoded while numerical features are scaled. Further, feature selection is done to reduce the complexity and computation needed.

The trained model, evaluated metrics and selected features are saved to Supabase for future use, ensuring scalability and reproducibility. All temporary files used in this pipeline are managed respectively to ensure a clean working environment.

add_predict

This DAG focuses on the prediction phase. It begins by downloading input data, the pre-trained model and selected features from Supabase.

The input data undergoes data preprocessing similar to the training phase of the ml_pipeline. The processed data is filtered to include only the selected features used during model training, ensuring consistency between training and prediction phases. Predictions are then made using the pre-trained model, and the results are appended to the original dataset.

These results are saved back to Supabase, and the project status is updated in the database. The pipeline includes tasks to manage temporary files, ensuring a clean execution environment.

This setup utilises technologies like Apache Airflow for workflow orchestration, Python for data processing and machine learning, Pandas and NumPy for data manipulation, Scikit-learn for preprocessing and model handling, Joblib for model serialisation, and Supabase for data storage and retrieval.

Synthetic data generation We employ retrieval augmented generation (RAG) in order to perform data augmentation. Given a base CSV file, we perform embedding to create a vector store of the textual data in the CSV file. This vector store is then provided to an LLM as a knowledge base as the basis for it to perform synthetic generation.

Two scripts, embed_data.py and augment_data.py scripts work in tandem to enhance a dataset with new, similar entries and embed data for efficient retrieval.

embed_data.py In embed_data.py, the script reads a CSV file, detects its encoding, and dynamically creates a Pydantic model based on the CSV's structure. It then uses CSVLoader to load the data, splits it into chunks with RecursiveCharacterTextSplitter, and embeds these chunks into a vector store using Chroma and OpenAIEmbeddings. This setup prepares the data for efficient retrieval and augmentation.

augment_data.py In augment_data.py, the script retrieves relevant data from the vector store and uses a large language model (ChatGPT from OpenAI) to generate new, similar data entries based on a given description. It ensures the new entries match the structure defined by the Pydantic model created in embed_data.py. These augmented entries are then appended to the original CSV file, enriching the dataset. By adjusting the temperature of the LLM and allowing for user input customisation, we can generate data with good variance and categorical coverage.

Key technologies used include Python, Pandas, Langchain, ChromaDB, OpenAIEmbeddings, and recursive text splitting methods.

FastAPI FastAPI is employed in 4 files (classification_models.py, projects.py, regression_models.py, and users.py) to create a robust and scalable API system for managing machine learning models and user data.

We define several HTTP endpoints (POST, GET, PUT, DELETE) for various CRUD (Create, Read, Update, Delete) operations. For instance, classification_models.py and regression_models.py provide endpoints to create, retrieve, and manage classification and regression models respectively. Similarly, users.py defines endpoints for managing user data.

All files integrate with a Supabase database client for data operations. Each endpoint interacts with the database to perform necessary operations, such as checking for existing records, inserting new records, updating existing ones, and deleting records.

Challenges we ran into

Utilising the different ML libraries required a more complex understanding of Docker to coordinate the start up of the different images required. Additionally, we had to be wise in minimising the number of packages so as to allow for efficient usage of our Docker space. On top of that, we also had to juggle between working on this rather big project and our full time commitments, forcing our meetings to be held late at night despite our tiredness.

Accomplishments that we're proud of

We are proud to have built a substantially complex application while juggling our academic and work commitments.

What we learned

We all picked up a new skill for utilising Apache Airflow to allow for coordination of different tasks. We also touched on most of the tech stack and had experience in adjusting the various components of the project.

What's next for RCH4CKERS

If we were to continue working on SimplifyML, our next steps would most likely be to increase the complexity and flexibility of the current AutoML pipeline to cater for the needs of different startups. However, this additional change would be hidden behind an optional, advanced option to still ensure accessibility to the laymen.

As a team, we are very open to exploring new avenues and ideating new solutions.