FarmCast-AI

Inspiration

Crop yields can swing dramatically year to year, and it’s often unclear whether the driver is weather (rainfall, heat, humidity, frost risk) or something else (management, pests, inputs). We wanted a tool that answers a simple question:

[ \text{“Given the season’s climate signals, was this county’s yield unusually high or low—and why?”} ]

That became FarmCast-AI: an explainable county-level yield + climate anomaly explorer.


What we built

FarmCast-AI turns NOAA station climate metrics into county-level growing-season features, joins them with historical county yields, and highlights yield anomalies (counties that perform unusually high/low relative to peers in the same year and crop category). The dashboard supports:

  • Feature exploration (rain/heat/humidity patterns)
  • Anomaly hotspot ranking (top outlier county-years)
  • County drilldown (yield and climate context for a selected county-year)

How we built it (Databricks lakehouse workflow)

1) Standardize yield data We cleaned yield records and created a 5-digit county FIPS key:

$$ \text{county_fips}=\text{lpad}(\text{state_code},2)\ \Vert\ \text{lpad}(\text{county_code},3) $$

2) Engineer growing-season climate features (Apr–Sep) From station-month metrics we built station-year features such as: \text{rain}{\text{Apr--Sep}}=\sum{m=4}^{9}\text{rain}_m

\overline{T}{\text{day,Apr--Sep}}=\frac{1}{N}\sum{m=4}^{9}T_{\text{day},m},\quad \overline{RH}{\text{Apr--Sep}}=\frac{1}{N}\sum{m=4}^{9}RH_m

3) Map counties to nearby stations We connected county centroids to stations using Haversine distance: $$ d = 2R\arcsin\left(\sqrt{ \sin^2\left(\frac{\Delta\varphi}{2}\right) +\cos(\varphi_1)\cos(\varphi_2)\sin^2\left(\frac{\Delta\lambda}{2}\right) }\right) $$

Then we aggregated station features into county features with inverse-distance weights: $$w_i=\frac{1}{d_i+\epsilon},\quad \hat{x}{c,y}=\frac{\sum{i\in\mathcal{N}(c)}w_i\,x_{i,y}}{\sum_{i\in\mathcal{N}(c)}w_i}$$

This produced our county-year “gold” table (yield + climate features) used by the dashboard.

4) Detect yield anomalies We flagged unusually high/low yield outcomes using a within-group z-score (per crop/irrigation/year): $$z=\frac{y-\mu}{\sigma}$$ and ranked counties by (|z|) to surface the most extreme cases.


Challenges we faced

  • Spatial mismatch (county vs station): yields are county-year while climate is station-month; reliable linkage required careful nearest-station mapping and weighting.
  • Data validation and coverage: ensuring joins didn’t silently produce misleading results required repeated sanity checks and debugging.
  • Time constraints: we focused on a clean, explainable pipeline and a strong dashboard narrative over overly complex modeling.

What we learned

  • The hardest part of applied ML/data science is often the data model and joins, not the model.
  • Interpretability matters: anomaly scores are more useful when paired with climate context (rain/heat/humidity).
  • Coverage-aware feature engineering and validation checks are essential for trustworthy insights.

Built With

Share this project:

Updates