SFU BigData Lab1 Project
This project analyzes how large-scale climate variability ENSO(El Niño, Neutral, La Niña) influences bird migration patterns across North America.
We built:
- A Reusable, configuration-driven ETL pipeline using PySpark and AWS EMR
- Machine learning prediction models for bird observations
- A real-time data streaming system using Kafka → S3 → Athena → QuickSight
- Interactive ML and real-time dashboards
- Joohyun Park – Real-Time Pipeline, Kafka, QuickSight Dashboard
- Jiayi Li – ML Modeling, Shiny Dashboard
- Hongrui Qu – Historical ETL, PySpark, EMR
We developed a historical analysis and machine learning–based prediction dashboard
This dashboard visualizes long-term bird migration patterns and model-based observations predictions under different ENSO phases.
Built using R, Shiny, and ggplot2
Exploratory Data Analysis:
- Historical migration pattern visualization by weekly and per-species
- Average Latitude Comparison Between El Niño and La Niña
Prediction Map:
- Random Forest–based bird observation prediction under different phase of ENSO(El Niño, Neutral, and La Niña)
- Interactive map-based visualization allow User select for month, ENSO phase, and species
We developed a real-time bird observation dashboard using AWS QuickSight.
The dashboard provides an interactive map that allows users to explore live bird observations across Canada and the United States.
- Interactive map-based exploration
- Filter by date, country (CA / US), and species
- View real-time observation counts and locations
- Built on a real-time data pipeline using Kafka → S3 → Athena → QuickSight
- You may need to sign in with your AWS account to view the dashboard.
- Access is restricted to invited users.
Historical Pipeline:
eBird & NOAA ENSO datasets → S3 → ETL (EMR/PySpark) → ML (R) → ShinyAPP
-
ETL Configuration:
Run entire historical ETL at once with:python3 pipeline.py
config.jsonstores all user settings including species, countries, year ranges, file paths, and the choice of weekly or monthly aggregation -
Raw file requirement:
eBird: raw txt file download from eBird and named using the pattern species_country.txt
NOAA ENSO: csv file download from NOAA
Real-Time Pipeline:
eBird API → Kafka → S3 → Glue → Athena → QuickSight
Data Engineering
- Python, PySpark, Apache Kafka, Kafka S3 Sink Connector
- Amazon S3, EMR, Glue, Athena
Machine Learning
- RStudio (randomForest, dplyr, ggplot2)
Visualization
- R Shiny
- gganimate
- AWS QuickSight (Real-Time Dashboard)
Database
- SQLite (Real-Time Deduplication)
- S3 (Raw & Clean data storage)
- eBird Observation Dataset (Cornell Lab of Ornithology)
- NOAA Oceanic Niño Index (ONI)
Scope for historical data
- Time Range: 1995 Jan – 2025 Sep
- Region: Canada & United States
- Species:
- Swainson’s Thrush(olive-backed subspecies)
- Golden-winged Warbler
- eBird “Recent Observations in a Region” API (24-hour rolling window, 30-min interval)
