Inspiration

We wanted to build a scalable solution capable of processing massive advertising datasets while delivering accurate revenue-prediction models. Real-world ad traffic is large, messy, and highly dynamic, so we aimed to create a pipeline that feels closer to production than a typical competition notebook.

What it does

Data-Adds ingests large Parquet datasets, preprocesses them efficiently with Dask, engineers meaningful features, and trains a single step prediction model to estimate user revenue. The pipeline is designed to be fast, memory-aware, and fully reproducible.

How we built it

We used Dask for distributed data loading and preprocessing, LightGBM and PyTorch for modeling, and a temporal validation strategy to reflect real ad-traffic behavior. We optimized memory with downcasting, safe encodings for high-cardinality features, and batch-based processing to avoid RAM spikes.

Challenges we ran into

· Memory limitations when handling millions of rows

· High-cardinality categorical variables

· Designing a realistic temporal split

· Balancing model accuracy with runtime and resource constraints

Accomplishments that we're proud of

We built a complete end-to-end pipeline capable of scaling to real-world datasets. We improved prediction performance using a single-step modeling approach, stabilized memory usage, and designed a clean architecture that can be reused or extended for similar ML workflows.

What we learned

We learned how to design production-style data pipelines, use Dask effectively for large-scale processing, engineer robust features for ads data, and train models under strict memory conditions. We also gained experience balancing technical trade-offs between speed, accuracy, and scalability.

What's next for Data-Adds

We plan to extend the pipeline with automated hyperparameter tuning, add deep learning models optimized for tabular data, improve feature generation using embeddings, and explore deployment-ready components. We also hope to benchmark new architectures and integrate more real-time data sources.

Built With

Share this project:

Updates