Inspiration
Food security is one of the most pressing challenges of our time, yet farmers and agronomists still struggle to answer a deceptively simple question: why did yields unexpectedly collapse — or boom — this season? Extreme weather events like the 2012 drought, 2019 floods, and 2021 western drought have caused billions in crop losses, but the signals are buried in messy, multi-dimensional data. We wanted to build a tool that could surface those hidden anomalies automatically and explain them in plain English.
What it does
What We Built We built an end-to-end anomaly detection pipeline using Isolation Forest trained on county-level yield and weather data across the US. The model learns what "normal" yield looks like given a set of weather conditions — precipitation, heat stress, frost risk, diurnal temperature range — and flags counties and seasons where reality diverged sharply from expectation. Every flagged anomaly comes with a plain-English explanation of the likely culprit, and results are visualized on a county-level heatmap.
How we built it
We joined USDA/RMA county yield data with NOAA weather station records, engineered agronomically meaningful features (drought flags, heat stress thresholds, growing degree day proxies), and trained an Isolation Forest on the combined dataset in a Databricks + PySpark environment. Anomaly explanations were generated by comparing each flagged observation against the normal distribution per feature. Visualizations were built with Matplotlib.
#Challenges we ran into Getting the data pipeline to work across Spark, Pandas, and Polars in Databricks was trickier than expected — kernel restarts wiped session variables, table joins didn't persist, and memory pressure from pulling large Spark tables into the driver required careful sampling strategies. Tuning the contamination threshold to balance sensitivity vs. noise was also a non-trivial judgment call requiring domain knowledge about what counts as a "real" anomaly vs. natural yield variation.
Accomplishments that we're proud of
We're proud of building a pipeline that goes beyond just flagging anomalies — it actually explains them in plain English that a farmer or agronomist could act on. Bridging the gap between a black-box ML model and a human-readable "your yield crashed because of drought + heat stress in the same season" narrative was the part we're most excited about. We're also proud of handling the full data engineering challenge end-to-end — joining multi-source datasets across different schemas, engineering agronomically meaningful features from raw weather observations, and making it all work reliably in a distributed Spark environment that threw a lot of curveballs at us. Finally, we're proud of the county-level heatmap visualization. Seeing anomaly hotspots light up geographically — and watching them align with known historical events like the 2012 drought corridor — was the moment the project felt real. It turned numbers into a story.
What we learned
Anomalies rarely have a single cause — the most interesting cases were multivariate, where no single weather feature was extreme but the combination was unusual. We also learned that temporal context matters enormously; the same precipitation level means drought in Iowa and flooding in Arizona. Most importantly, we learned that making ML explainable to domain experts (farmers, agronomists) requires translating model outputs back into the language of the field — not just scores and z-values
What's next for WethApp
Integrating satellite NDVI data, soil moisture indices, and pest/disease event logs to build a richer feature set — and extending the model to generate forward-looking risk scores for the upcoming growing season.
Built With
- databrick
- matplotlib
- numpy
- pandas
- pyspark
- python
- scikit-learn
- sql
Log in or sign up for Devpost to join the conversation.