What it does

Our platform predicts end‑of‑season crop yields by analyzing both historical yield data and recent weather patterns. Farmers can select a prediction date, and the model looks back 60 to 90 days to compute summary weather features—such as average temperature, total precipitation, and humidity. These rolling aggregates allow us to capture how short‑term, stable weather trends (avoiding daily fluctuations) leading up to the prediction point directly affect the final harvest yield.

How we built it

Data Ingestion & Engineering: We hosted all datasets and handled preprocessing natively in Databricks. We pulled historical crop yield data (2010–2024) and fetched corresponding weather data using the NOAA API. Using PySpark and Delta tables, we standardized schemas, cleaned string fields, and imputed missing or invalid values so the datasets could be reliably joined by county and year. Feature Engineering: Once in a clean, unified table, we engineered additional monthly and seasonal weather statistics for each county–year pair. Modeling: We designed and trained a 6-layer dense neural network with ReLU activations to tune hyperparameters and identify the most critical weather features. Deployment & User Interface: We built a Node.js frontend hosted on Databricks Apps, connecting directly to our data pipeline to give farmers a tangible way to interact with the model. We also integrated a Databricks Genie chatbot into the website so users can naturally query the data. Automation: For post-training, we established Databricks pipelines linked to real-time APIs for both weather and crop yields, allowing the system to continuously monitor and update values every season.

Challenges we ran into

Gathering Accurate Data: Sourcing cohesive, accurate crop and weather data across a 14-year span was difficult. To ensure high data quality, we initially narrowed our focus to three well-documented regions (Franklin County, OH; Hamilton County, OH; and Allen County, IN) to validate the model before scaling. The Databricks Learning Curve: This was our team's first time using Databricks. We had to get up to speed quickly on a completely new enterprise ecosystem under a hackathon time crunch. Infrastructure Setup: Figuring out how to properly configure compute clusters, navigate AutoML setups, and orchestrate robust data pipelines required extensive documentation reading and troubleshooting.

Accomplishments that we're proud of

Model Accuracy: Achieving an impressive 89% R² on our test data for crop yield predictions. Full-Stack ML Pipeline: Successfully building a complete, automated pipeline—from PySpark data cleaning to a deployed neural network, all the way to a functional Node.js frontend. Real-Time Integration: Setting up the post-training pipelines and the Genie chatbot to ensure the tool isn't just a static hackathon project, but a dynamic, real-time tool for farmers.

What we learned

How to leverage PySpark and Delta tables for massive-scale data cleaning and feature engineering. The intricacies of configuring compute clusters and building automated data pipelines within the Databricks environment. How to bridge backend machine learning infrastructure with user-facing applications using Databricks Apps and Genie.

What's next

Geographic Expansion: Now that our core pipeline and model architecture are validated, our immediate next step is to generalize the model to support counties across the entire United States. Continuous Refinement: As our real-time API pipelines pull in more diverse seasonal data, we plan to continuously retrain and refine our neural network to push our predictive accuracy even higher.

Built With

  • databricks
Share this project:

Updates