Data For All

Initial Hero Landing Page
Missions Page
Landing Page Bento Cards
Dataset Dashboard
Landing Page Contributors
Contributor Leaderboard

Inspiration

Access to AI training infrastructure remains a critical barrier for researchers, students, and communities in developing regions. We built DataForAll to democratize machine learning by creating a decentralized platform where anyone can contribute data, collaboratively train models, and access AI tools - regardless of their technical background or resources.

What it does

DataForAll is a collaborative AI training platform that features:

Community-driven data contribution: Users upload datasets for missions like crop disease detection or environmental monitoring
On-demand GPU training: Automatically provisions cloud GPU instances (Lambda Labs H100) to fine-tune models like SmolVLM on contributed datasets
Real-time training monitoring: Live dashboards show training progress, metrics, and resource utilization
Model sharing: Trained models are published to Hugging Face Hub for immediate use via REST API
Gamified contributions: Leaderboards and mission-based challenges incentivize quality data contributions

How we built it

We architected a full-stack distributed system: Frontend: React + Vite with Three.js for 3D visualizations, Framer Motion for animations, and real-time WebSocket updates

Backend: FastAPI with async PostgreSQL (SQLAlchemy), JWT authentication, and S3-compatible object storage (Vultr)

Infrastructure: Kubernetes cluster on Vultr for API deployment, container registry, and database replication

GPU orchestration: Dynamic provisioning of Lambda Labs H100 instances via REST API, configured over SSH (Paramiko)

ML pipeline: PyTorch + Hugging Face ecosystem (Transformers, PEFT, Accelerate) for fine-tuning vision-language models with QLoRA

Training workers: Dockerized GPU workers that pull datasets from S3, train models, stream logs via WebSocket, and push results to Hugging Face Hub

Challenges we ran into

Our biggest challenge was GPU provisioning. We initially designed the system around Vultr Cloud GPUs, building HTTP-based orchestration between our Kubernetes cluster and on-demand GPU instances. However, Vultr's GPU plans required manual account approval via support ticket - incompatible with a hackathon timeline. We pivoted to Lambda Labs mid-development, which meant:

Rewriting the provisioning layer (Lambda's API differs significantly from Vultr's)
Implementing SSH-based configuration (Lambda doesn't support cloud-init/user-data like Vultr)
Handling multi-region fallback logic when H100 capacity was unavailable
Debugging SSL handshake issues between the Kubernetes cluster and ephemeral GPU workers The migration cost us 8+ hours but taught us valuable lessons about cloud provider abstractions and resilient system design.

Accomplishments that we're proud of

End-to-end automation: From data upload to trained model deployment—fully automated with zero manual intervention Real production deployment: Running on Kubernetes with managed PostgreSQL, object storage, and container registry Successful fine-tuning: Trained SmolVLM-256M on real crop disease datasets with QLoRA on H100 GPUs Resilient architecture: Graceful handling of GPU provisioning failures, training crashes, and network interruptions Beautiful UX: Interactive 3D globe, smooth animations, and real-time training visualizations

What we learned

Cloud GPU availability is unpredictable—always have a fallback provider SSH-based configuration is more fragile than cloud-init but necessary for some platforms WebSocket connections require careful lifecycle management in distributed systems QLoRA enables fine-tuning large vision-language models on consumer GPUs (we tested locally on RTX 4060 Mobile) Kubernetes adds complexity but pays dividends for multi-service orchestration

What's next for DataForAll

Federated learning: Enable privacy-preserving training where data never leaves contributors' devices
Model marketplace: Let users monetize their trained models or datasets
Multi-modal support: Expand beyond vision to audio, time-series, and tabular data
Community governance: Implement DAO-style voting for mission priorities and resource allocation