LineageFlow

Easily track data versions and lineage through the machine learning lifecycle.

𝌞 Table of Contents

About

❓ Problem Statement

How can we develop a tool that tracks data versions and lineage through the machine learning lifecycle, helping data scientists understand how datasets have changed over time and how different versions of datasets affect model performance?

💡 Motivation

As datasets evolve, tracking their changes and understanding their impact on machine learning models becomes increasingly complex. LineageFlow aims to simplify this process by providing an intuitive tool for data versioning and lineage tracking, ensuring data manageability, quality, and reproducibility.

🧑 Target Audience

  • Data Scientists and Engineers
  • Machine Learning Engineers
  • Organizations needing robust data management solutions

❗ Value Proposition

LineageFlow leverages Git-like semantics such as branches, commits, merges, and rollbacks to offer a familiar and powerful system for data versioning and lineage tracking. This approach allows users to manage, collaborate, and ensure the quality of their data throughout its lifecycle.

💻 Tech Stack

React Vite Django DjangoREST Postgres Supabase Google Cloud

Client:

  • React
  • Vite

Backend:

  • Django
  • Django REST Framework

Storage:

  • Supabase PostgresSQL
  • Google Cloud Storage

🔨 Architecture

🛠️ Database design

✔️ Current Features

Repositories

Objects view

Staging area

Branches

Commits

Settings

📅 Future Plans

Immediate improvements to be made are:

  • Cloud Integration: Incorporate other cloud buckets (e.g. AWS S3, Azure blob storage, CloudFare R2 etc.)
  • Collaboration: Enable collaboration through adding user roles, invites, branch merging etc.
  • Deployment: Deploy our product to quickly iterate based on real usage.

The possibilities are endless:

  • Feature Store: Integrate a feature store to manage and share features across different machine learning models, ensuring consistency and reusability.
  • Automated ML Pipeline: Develop automated machine learning pipelines to streamline data preprocessing, model training, evaluation, and deployment, increasing efficiency and reducing manual intervention.
  • Data Quality Monitoring: Implement data quality monitoring and alerting systems to detect anomalies, ensuring data integrity and reliability throughout the machine learning lifecycle.

🏆 Challenges Faced

  • Designing the database correctly was most crucial and we should have spent more time on it, a lot of time was wasted on re-migrations because we realised our database schemas did not work.
  • Integration with google cloud bucket proved to be technically difficult, facing issues such as authentication
  • Underestimated scope of project and faced time constraints
  • All in all, we are proud of what we accomplished in a week of building and for tackling a difficult problem statement.

✍🏻 Contributors

Built With

Share this project:

Updates