LineageFlow
Easily track data versions and lineage through the machine learning lifecycle.
𝌞 Table of Contents
About
❓ Problem Statement
How can we develop a tool that tracks data versions and lineage through the machine learning lifecycle, helping data scientists understand how datasets have changed over time and how different versions of datasets affect model performance?
💡 Motivation
As datasets evolve, tracking their changes and understanding their impact on machine learning models becomes increasingly complex. LineageFlow aims to simplify this process by providing an intuitive tool for data versioning and lineage tracking, ensuring data manageability, quality, and reproducibility.
🧑 Target Audience
- Data Scientists and Engineers
- Machine Learning Engineers
- Organizations needing robust data management solutions
❗ Value Proposition
LineageFlow leverages Git-like semantics such as branches, commits, merges, and rollbacks to offer a familiar and powerful system for data versioning and lineage tracking. This approach allows users to manage, collaborate, and ensure the quality of their data throughout its lifecycle.
💻 Tech Stack
Client:
- React
- Vite
Backend:
- Django
- Django REST Framework
Storage:
- Supabase PostgresSQL
- Google Cloud Storage
🔨 Architecture
- We store our actual objects and data in Google Cloud Storage, and pointers to the data in our Postgres SQL Database (https://github.com/WangYuTengg/LineageFlow/blob/main/assets/architecture-diagram.jpg)
🛠️ Database design
- In order to support data-versioning using Git-like semantics, we followed the above data hierarchy which sculpted our database schemas and design decisions.
- With this, we are able to implement version control when operations such as add, delete, and edit are done on the data. (https://github.com/WangYuTengg/LineageFlow/blob/main/assets/data_hierarchy.jpg)
✔️ Current Features
Repositories
- Simple user signup, login and auth flow
- View your repositories
- Create a new repository (with an option the repository to existing cloud bucket) (https://github.com/WangYuTengg/LineageFlow/blob/main/assets/repo-list.JPG)
Objects view
- View objects in file & folder structure
- Upload objects into repository (local files and folders)
- Download/View/Delete objects (https://github.com/WangYuTengg/LineageFlow/blob/main/assets/objects-page.JPG)
Staging area
- Move to staging area before uncommitted changes are committed
- View changes before making them
- Enter a commit message (https://github.com/WangYuTengg/LineageFlow/blob/main/assets/uncommited-changes-page.JPG)
Branches
- A single repository can have multiple branches
- Create branch from a parent branch
- Each branch has its own commit history, and data versioning (https://github.com/WangYuTengg/LineageFlow/blob/main/assets/branches-page.JPG)
Commits
- View detailed commit history of selected branch (files added/deleted/edited) in a timeline view
- Rollback/revert to a certain commit in history (https://github.com/WangYuTengg/LineageFlow/blob/main/assets/commits-page.JPG)
Settings
- Rename your repository
- Switch/rename default branch
- Delete your repository
- View collaborators (https://github.com/WangYuTengg/LineageFlow/blob/main/assets/settings-page.JPG)
📅 Future Plans
Immediate improvements to be made are:
- Cloud Integration: Incorporate other cloud buckets (e.g. AWS S3, Azure blob storage, CloudFare R2 etc.)
- Collaboration: Enable collaboration through adding user roles, invites, branch merging etc.
- Deployment: Deploy our product to quickly iterate based on real usage.
The possibilities are endless:
- Feature Store: Integrate a feature store to manage and share features across different machine learning models, ensuring consistency and reusability.
- Automated ML Pipeline: Develop automated machine learning pipelines to streamline data preprocessing, model training, evaluation, and deployment, increasing efficiency and reducing manual intervention.
- Data Quality Monitoring: Implement data quality monitoring and alerting systems to detect anomalies, ensuring data integrity and reliability throughout the machine learning lifecycle.
🏆 Challenges Faced
- Designing the database correctly was most crucial and we should have spent more time on it, a lot of time was wasted on re-migrations because we realised our database schemas did not work.
- Integration with google cloud bucket proved to be technically difficult, facing issues such as authentication
- Underestimated scope of project and faced time constraints
- All in all, we are proud of what we accomplished in a week of building and for tackling a difficult problem statement.
✍🏻 Contributors
- Jayden - Fullstack
- Wang Yu Teng - Fullstack
- Pei Yee - Database, Backend
Built With
- django
- google-cloud-bucket
- python
- react
- typescript
Log in or sign up for Devpost to join the conversation.