LineageFlow

Easily track data versions and lineage through the machine learning lifecycle.

𝌞 Table of Contents

About

❓ Problem Statement

How can we develop a tool that tracks data versions and lineage through the machine learning lifecycle, helping data scientists understand how datasets have changed over time and how different versions of datasets affect model performance?

💡 Motivation

As datasets evolve, tracking their changes and understanding their impact on machine learning models becomes increasingly complex. LineageFlow aims to simplify this process by providing an intuitive tool for data versioning and lineage tracking, ensuring data manageability, quality, and reproducibility.

🧑 Target Audience

Data Scientists and Engineers
Machine Learning Engineers
Organizations needing robust data management solutions

❗ Value Proposition

LineageFlow leverages Git-like semantics such as branches, commits, merges, and rollbacks to offer a familiar and powerful system for data versioning and lineage tracking. This approach allows users to manage, collaborate, and ensure the quality of their data throughout its lifecycle.

💻 Tech Stack

Client:

React
Vite

Backend:

Django
Django REST Framework

Storage:

Supabase PostgresSQL
Google Cloud Storage

🔨 Architecture

We store our actual objects and data in Google Cloud Storage, and pointers to the data in our Postgres SQL Database (https://github.com/WangYuTengg/LineageFlow/blob/main/assets/architecture-diagram.jpg)

🛠️ Database design

In order to support data-versioning using Git-like semantics, we followed the above data hierarchy which sculpted our database schemas and design decisions.
With this, we are able to implement version control when operations such as add, delete, and edit are done on the data. (https://github.com/WangYuTengg/LineageFlow/blob/main/assets/data_hierarchy.jpg)

✔️ Current Features

Repositories

Simple user signup, login and auth flow
View your repositories
Create a new repository (with an option the repository to existing cloud bucket) (https://github.com/WangYuTengg/LineageFlow/blob/main/assets/repo-list.JPG)

Objects view

View objects in file & folder structure
Upload objects into repository (local files and folders)
Download/View/Delete objects (https://github.com/WangYuTengg/LineageFlow/blob/main/assets/objects-page.JPG)

Staging area

Move to staging area before uncommitted changes are committed
View changes before making them
Enter a commit message (https://github.com/WangYuTengg/LineageFlow/blob/main/assets/uncommited-changes-page.JPG)

Branches

A single repository can have multiple branches
Create branch from a parent branch
Each branch has its own commit history, and data versioning (https://github.com/WangYuTengg/LineageFlow/blob/main/assets/branches-page.JPG)

Commits

View detailed commit history of selected branch (files added/deleted/edited) in a timeline view
Rollback/revert to a certain commit in history (https://github.com/WangYuTengg/LineageFlow/blob/main/assets/commits-page.JPG)

Settings

Rename your repository
Switch/rename default branch
Delete your repository
View collaborators (https://github.com/WangYuTengg/LineageFlow/blob/main/assets/settings-page.JPG)

📅 Future Plans

Immediate improvements to be made are:

Cloud Integration: Incorporate other cloud buckets (e.g. AWS S3, Azure blob storage, CloudFare R2 etc.)
Collaboration: Enable collaboration through adding user roles, invites, branch merging etc.
Deployment: Deploy our product to quickly iterate based on real usage.

The possibilities are endless:

Feature Store: Integrate a feature store to manage and share features across different machine learning models, ensuring consistency and reusability.
Automated ML Pipeline: Develop automated machine learning pipelines to streamline data preprocessing, model training, evaluation, and deployment, increasing efficiency and reducing manual intervention.
Data Quality Monitoring: Implement data quality monitoring and alerting systems to detect anomalies, ensuring data integrity and reliability throughout the machine learning lifecycle.

🏆 Challenges Faced

Designing the database correctly was most crucial and we should have spent more time on it, a lot of time was wasted on re-migrations because we realised our database schemas did not work.
Integration with google cloud bucket proved to be technically difficult, facing issues such as authentication
Underestimated scope of project and faced time constraints
All in all, we are proud of what we accomplished in a week of building and for tackling a difficult problem statement.