Inspiration
We wanted to build a scalable solution capable of processing massive advertising datasets while delivering accurate revenue-prediction models. Real-world ad traffic is large, messy, and highly dynamic, so we aimed to create a pipeline that feels closer to production than a typical competition notebook.
What it does
Data-Adds ingests large Parquet datasets, preprocesses them efficiently with Dask, engineers meaningful features, and trains a single step prediction model to estimate user revenue. The pipeline is designed to be fast, memory-aware, and fully reproducible.
How we built it
We used Dask for distributed data loading and preprocessing, LightGBM and PyTorch for modeling, and a temporal validation strategy to reflect real ad-traffic behavior. We optimized memory with downcasting, safe encodings for high-cardinality features, and batch-based processing to avoid RAM spikes.
Challenges we ran into
· Memory limitations when handling millions of rows
· High-cardinality categorical variables
· Designing a realistic temporal split
· Balancing model accuracy with runtime and resource constraints
Accomplishments that we're proud of
We built a complete end-to-end pipeline capable of scaling to real-world datasets. We improved prediction performance using a single-step modeling approach, stabilized memory usage, and designed a clean architecture that can be reused or extended for similar ML workflows.
What we learned
We learned how to design production-style data pipelines, use Dask effectively for large-scale processing, engineer robust features for ads data, and train models under strict memory conditions. We also gained experience balancing technical trade-offs between speed, accuracy, and scalability.
What's next for Data-Adds
We plan to extend the pipeline with automated hyperparameter tuning, add deep learning models optimized for tabular data, improve feature generation using embeddings, and explore deployment-ready components. We also hope to benchmark new architectures and integrate more real-time data sources.
Log in or sign up for Devpost to join the conversation.