Weather Normalized Yield Intelligence

Link: https://dbc-a20fb984-9a6c.cloud.databricks.com/dashboardsv3/01f114de050f1860960a17408040cd5e/published/pages/ead2ad68?o=7474644974553568

Inspiration

Every year, agronomists and seed companies ask the same question after harvest: "Was this yield bad because of the weather, or bad despite the weather?" Those are fundamentally different situations — one is expected, one is a signal. We built this project to answer that question automatically, at scale, across every corn and soybean county in America.

The inspiration came from Corteva's challenge: they have 15 years of yield data and a weather catalog, but no systematic way to separate weather impact from everything else — soil quality, seed choice, management decisions. That gap is where insight lives, and that's what we set out to close.

What We Learned We learned that the most powerful insights don't come from predicting yield — they come from explaining the gap between what yield was and what it should have been. A county that had a terrible year during a drought is expected. A county that had a terrible year despite perfect weather is the real signal. That distinction, which sounds simple, requires a full data pipeline to surface reliably.

We also learned that data engineering is 80% of the work. Merging yield data with FIPS codes, standardizing county names across datasets, handling Louisiana parishes vs counties, zero-padding FIPS codes for map rendering — none of that is glamorous but all of it determines whether the dashboard works.

How We Built It We built entirely on Databricks using a three-stage pipeline:

Stage 1 — Data Prep: We ingested 15 years of USDA RMA county yield data covering 34,712 county-year records across 32 states for corn and soybeans. We cleaned, standardized, and saved it as a Delta table.

Stage 2 — Anomaly Detection: For each county-year we computed a predicted yield using a 5-year rolling county average, then calculated the residual (actual minus predicted) and normalized it into a z-score. Counties in the top 10% were flagged Resilient. Counties in the bottom 10% were flagged Stress-Impacted. We then aggregated across all 15 years to identify chronically stressed and chronically resilient counties.

Stage 3 — Geospatial Enrichment and Dashboard: We joined our anomaly data with public FIPS codes to enable county-level map rendering, achieving 99.8% geographic coverage. We built the dashboard in Databricks AI/BI with a choropleth heatmap colored by average z-score, a Genie-powered natural language interface, and a top 10 stress-impacted counties table.

Challenges We Faced The biggest challenge was county name standardization. NOAA, USDA, and FIPS reference tables all use slightly different naming conventions — "St. Clair" vs "ST CLAIR", "Johnson County" vs "Johnson", Louisiana parishes vs counties. We wrote custom normalization logic to handle every edge case and achieved 99.8% match rate across 34,000 records.

The second challenge was making the map render correctly. Databricks choropleth maps require FIPS codes as zero-padded 5-digit strings — a detail that took significant debugging to discover. Once we fixed the schema and zero-padded the codes, the map came to life.

The result is a system that automatically surfaces what 15 years of yield data has been trying to tell us: Kansas chronically underperforms soybean expectations, Indiana chronically overperforms, and 2012 was catastrophic beyond what even the historic drought explains. We didn't program those findings in — the data found them itself.

Business Value: Weather-Normalized Yield Insight

Raw yield numbers can be misleading. A strong year may reflect favorable weather, while a weak year may be driven by drought or heat stress beyond anyone’s control. Our system estimates the yield expected under given weather conditions and measures the deviation from that baseline. This weather-normalized signal separates environmental impact from true land or management performance.

For Farmland Investors Historical averages don’t reveal whether land is intrinsically productive or simply benefited from good weather. By identifying consistent overperformance or underperformance relative to weather conditions, our platform helps investors: Detect resilient, high-quality landAvoid overpaying for weather-inflated yield history Assess long-term climate risk exposure This supports smarter acquisition and valuation decisions.

For Farmers Year-to-year yield comparisons are often unfair due to changing weather conditions. Weather-normalized benchmarking allows farmers to: Evaluate performance more fairly across seasons Distinguish weather-driven losses from operational issues Demonstrate resilience and management strength Core Metric Yield Residual = Actual Yield – Weather-Predicted Yield Positive residual → Outperformance Negative residual → Underperformance This simple metric transforms raw historical data into actionable performance intelligence.

Built With

databricks

Updates

Varun Atraya started this project — Feb 28, 2026 08:59 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.